Stats Poller Auth Drift

The Go control plane (api/) polls each active relay pool server’s /stats endpoint every 3s to populate per-link bitrate stats for the OBS dock. The Rust SRTLA receiver gained Bearer-token auth on /stats at some point after the poller was written. The poller was never patched to send the header. Result: every poll returned 401, the cache stayed empty, and the only visible signal was a wall of stats_poller: fetch <id> err=status 401 lines in docker logs glb-api. No metric, no alert, no user-visible breakage of session activation, only quietly missing per-link bitrate data in the dock.

Symptom

ssh advin "docker logs --since 24h glb-api 2>&1 | grep -c 'stats_poller.*err=status 401'"
# 42266

Both kc1 (Advin local) and lv1 (Frantech LV) fetched at every 3s tick produced one error each. Logs went to stderr (log.Printf default), so any audit using docker logs ... 2>/dev/null | grep stats_poller saw zero output and concluded the poller was healthy. This was the first reason the issue stayed silent.

Root cause

api/internal/relay/stats_poller.go:fetchStats built a plain GET http://<ip>:5080/stats with no Authorization header and no token query param. The Rust receiver at telemy-srtla/src/stats.rs:114-130 (legacy directory name; same binary baked into ghcr.io/michaelpentz/srtla-receiver:latest) checks the header against the --stats-token CLI arg or SRTLA_STATS_TOKEN env var with a constant-time compare and returns 401 on miss.

The shared secret GLB_RELAY_SHARED_KEY in /opt/golivebro/.env.glb already matched the relay token on both kc1 and lv1. The wiring was the only break.

Initial mis-hypothesis

The first investigation pass assumed Advin’s srtla-receiver had no SRTLA_STATS_TOKEN env var and was therefore not even enforcing auth. docker exec srtla-receiver env | grep STATS_TOKEN returned nothing. The actual configuration is in /opt/srtla-receiver/supervisord-override.conf, which passes the token as a CLI arg:

command=/bin/sh -c "sleep 3 && /bin/logprefix /usr/local/bin/srtla_rec --srtla_port=5000 ... --stats_token=<value matching GLB_RELAY_SHARED_KEY> --geoip_db=..."

So both relays do enforce auth, with the same shared secret as GLB_RELAY_SHARED_KEY. Only the Go side needed a change.

Fix

api/internal/relay/stats_poller.go:

type StatsPoller struct {
    store      poolServerLister
    httpClient *http.Client
    statsToken string
    // ...
}
 
func NewStatsPoller(store poolServerLister, statsToken string) *StatsPoller {
    return &StatsPoller{
        store:      store,
        httpClient: &http.Client{Timeout: 5 * time.Second},
        statsToken: statsToken,
        cache:      make(map[string][]model.PerLinkGroup),
    }
}
 
func (p *StatsPoller) fetchStats(ctx context.Context, url string) ([]model.PerLinkGroup, error) {
    req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
    if err != nil { return nil, err }
    if p.statsToken != "" {
        req.Header.Set("Authorization", "Bearer "+p.statsToken)
    }
    // ...
}

fetchStats was also refactored to take a URL string instead of an IP. The caller in poll() builds http://<ip>:5080/stats and passes it. This is a side-effect of making the function testable against httptest.NewServer, not a behavior change.

api/cmd/api/main.go:60:

statsPoller := relay.NewStatsPoller(st, cfg.RelaySharedKey)

The if p.statsToken != "" { ... } guard preserves the unit-test path and the fake-relay dev path, both of which run without a real shared secret.

Tests

api/internal/relay/stats_poller_test.go adds two tests against httptest.NewServer:

  • TestStatsPoller_SendsBearerAuthHeader — assert Authorization: Bearer <token> reaches the relay.
  • TestStatsPoller_OmitsHeaderWhenTokenEmpty — assert no header is set when the token is empty (preserves dev path).

Full suite (go test ./...) passes.

Deploy

Standard atomic-swap to Advin per CLAUDE.md:

cd api && GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build -trimpath -o dist/glb-api ./cmd/api/
scp api/dist/glb-api advin:/opt/golivebro/bin/glb-api.new
ssh advin "chmod +x /opt/golivebro/bin/glb-api.new && mv /opt/golivebro/bin/glb-api.new /opt/golivebro/bin/glb-api && cd /opt/golivebro && docker compose restart glb-api"

Backup of pre-fix binary at /opt/golivebro/bin/glb-api.bak-20260418.

Acceptance verification (post-restart)

ssh advin "docker logs --since 2m glb-api 2>&1 | grep -c 'stats_poller.*err=status 401'"
# expect 0 (excluding any pre-restart lines still in window)
 
ssh advin "docker exec glb-api wget -qO- --header=\"Authorization: Bearer \$TOKEN\" http://kc1.relay.golivebro.com:5080/stats | head -c 200"
# expect {"groups":[...]} or {"groups":[]}, not 401
 
ssh advin "docker exec glb-api wget -qO- http://localhost:8080/metrics | grep -c glb_dns_write_errors_total"
# expect >= 1 (DNS deploy from 2026-04-17 still intact)

All three confirmed on 2026-04-18.

Why it stayed silent

Same root failure mode as the DNS error handling audit: the only signal was a log.Printf line, with no metric, no alerting, and no user-visible breakage that would have prompted investigation. The dock simply showed no per-link breakdown.

A glb_relay_stats_fetch_total{status="ok|error",http_status="..."} counter analogous to glb_dns_write_errors_total would have surfaced this within minutes of the relay-side auth being enabled. Recommended follow-up. Out of scope for this fix.

telemy-srtla/ is legacy naming for the proprietary Rust SRTLA receiver. The directory should likely be renamed srtla-receiver to match the container name (srtla-receiver) and the published image (ghcr.io/michaelpentz/srtla-receiver:latest). docker-compose.build.yml references srtla-receiver-fork/Dockerfile as the build context, but that path holds only the OpenIRL upstream submodule, not the proprietary Rust source. The build wiring is currently broken until either the dir is renamed or the compose path is corrected. Tracking as a separate cleanup, not in this fix.

Open questions

  • Should the relay also be moved to env-var configuration (SRTLA_STATS_TOKEN) on Advin to match lv1 and the published docker-compose.build.yml? Functionally identical today; consistency win.
  • Should GLB_RELAY_SHARED_KEY be split into a per-relay token (one per server, rotated independently)? Currently a single secret for all relays. Rotation cost rises with scale. Defer until adding a second region.
  • A glb_relay_stats_fetch_total counter would have caught this at deploy time. Worth adding alongside the DNS counter pattern.

Sources