DNS Error Handling Audit

Follow-up to the 2026-04-17 CLOUDFLARE_DNS_TOKEN rotation. The token is now correctly scoped to golivebro.com (id 074c22380e58b6dc206288b47b271e76, Zone.DNS:Edit). Two structural issues outlived the token fix: silent DNS error swallowing in the async pipeline, and slug namespace sharing between canary tests and real users.

Finding A: Silent error swallowing in the managed-relay pipeline

1. Callsite inventory

Every DNS client callsite in the API, with propagation behavior:

#file:linePathError behavior
A1api/internal/api/handlers.go:305Pool fast-path provisionSwallows. log.Printf("provision_pipeline: dns_update_failed ...") only. Session still finalizes as active.
A2api/internal/api/handlers.go:378EC2 provision pipelineSwallows. Same dns_update_failed log line. Session still finalizes as active.
A3api/internal/api/handlers.go:847Migration pipeline (region change)Swallows. migration_pipeline: dns_update_failed log line. Old session still drains, new session stays active.
A4api/internal/api/handlers_self_hosted.go:79Self-hosted registerPropagates. Returns HTTP 500 failed to create DNS record.
A5api/internal/api/handlers_self_hosted.go:144Self-hosted heartbeat IP changePropagates. Returns HTTP 500 failed to update DNS record.
A6api/internal/api/handlers_self_hosted.go:280Self-hosted removePropagates. Returns HTTP 500 failed to remove DNS record.

A1, A2, and A3 are the managed-relay (async) paths. All three hid the telemyapp-zone token misconfiguration for weeks.

2. Observable signal when A1-A3 fail today

Only one: a grep for dns_update_failed in docker logs glb-api on Advin. The session response to the client still contains a relay_hostname of <slug>.relay.golivebro.com (set at handlers.go:310, :383, and generated by toSessionResponse at handlers.go:898-899 via s.dns.FQDN). The session flips to active. The client resolves the FQDN, gets NXDOMAIN, and fails to connect. No state in user_stream_slots or sessions tells an operator this happened. UpdateSlotDNS at handlers.go:308 and :381 is only called on the success branch, so current_region column lag vs. session region could be a weak post-hoc signal, but nothing alerts on it.

3. Existing logging convention

No structured logger (zap, slog, logrus) in api/internal/api/. Every file uses stdlib log.Printf with an informal subsystem: event_name key=value convention (provision_pipeline: dns_update_failed slug=%s err=%v). Grep confirms 113 occurrences across 18 files in internal/api/. Fixes should match the convention rather than introduce a new dependency.

4. Existing metrics surface

Yes. api/internal/metrics/metrics.go:44 provides a Registry with IncCounter, ObserveHistogram, and a Prometheus text-format Handler. Default counters registered at metrics.go:55-60: glb_job_runs_total, glb_relay_provision_total, glb_relay_deprovision_total, plus three histograms. The registry is already wired at router.go:257-266 serving /metrics (localhost-restricted per .claude/rules/security.md). Usage pattern at handlers.go:248-260, :548-561 and jobs/runner.go:212-219. Adding a DNS counter is a one-line registration plus three call sites.

Conceptual shape:

  1. metrics.go:55: register glb_dns_write_total (labels: op=create|delete, status=ok|error) and glb_dns_write_errors_total (labels: op, http_status).
  2. handlers.go:305-307, :378-380, :847-849: on success, IncCounter("glb_dns_write_total", {"op":"create","status":"ok"}). On error, inc with status:"error" and additionally inc glb_dns_write_errors_total. Keep the log line.
  3. Cloudflare.go could surface the HTTP status via a typed error, but the minimal patch passes a sentinel and parses status from the error string; a cleaner follow-up is a typed *APIError type.
  4. Add a dns_last_error_at column to user_stream_slots and write it on failure. Dashboard admin portal can surface any slot with a recent error.
  5. Optional, higher-risk: promote A1/A2 DNS failures to a session error so FinalActivateSession fails and deprovisionAndStop runs. Reject here because EC2 is already up and the user-visible consequence (reconnect attempt) is small, but the cost leakage risk of a stuck provisioning is large. A counter + alert is the cheaper win.

Tests: extend cloudflare_test.go (already has TestCreateOrUpdateRecord_APIError at line 152) to assert the metric increments. Add a handler-level test that forces CreateOrUpdateRecord to fail and verifies glb_dns_write_errors_total renders via metrics.Default().Render().

Scope: ~30 LOC across 3 files (metrics.go, handlers.go, one new test). Low risk. No behavior change for success path, only observability.

Finding B: Slug collision risk and canary namespace

1. Generator

api/internal/store/store_stream_slots.go:18-28. GenerateDNSSlug produces 8 characters from charset abcdefghijklmnopqrstuvwxyz0123456789 (36 symbols) using crypto/rand.Int with big.NewInt(36) per position. Uniform distribution (no modulo bias). Keyspace: 36^8 = 2.82e12. For 10,000 concurrent live slugs the birthday-approximation collision probability is ~1.8e-5 per new slug; at 100,000 concurrent slugs it is ~1.8e-3. Realistically irrelevant at foreseeable scale.

2. Uniqueness constraint

Yes, twice. migrations/0017_dns_slugs.sql:8 declares dns_slug VARCHAR(12) UNIQUE on user_stream_slots. migrations/0028_self_hosted_relays.sql:7 declares dns_slug VARCHAR(12) UNIQUE on self_hosted_relays, plus a partial index at :18 filtering soft-deletes. Both callers retry on unique violation: store_stream_slots.go:106-118 (5 attempts) and store_self_hosted.go:17-54 (5 attempts, explicit pgErr.Code == "23505" check). A collision is a retry, not an overwrite. Blast radius is bounded to transient registration latency.

However: the two tables share the Cloudflare DNS namespace. A user_stream_slots.dns_slug and a self_hosted_relays.dns_slug with the same value is legal in Postgres (different tables, different unique indexes) but would cause one table’s CreateOrUpdateRecord to overwrite the other table’s A record. No cross-table uniqueness check exists. Probability is still 36^-8 per pair, but it is a real failure mode.

3. Reserved-prefix design cost

Files that would need changes for an alpha-* / test-* reserved prefix:

  • api/internal/store/store_stream_slots.go:18 — add a prefix filter in GenerateDNSSlug to reject generated slugs that start with a reserved token (or skip them in the charset). Alternatively split into GenerateDNSSlug() and GenerateTestDNSSlug(prefix).
  • api/internal/store/store_stream_slots.go:95 and store_self_hosted.go:15 — accept an optional namespace and reject collisions against both prefixes at INSERT time, or rely on the unique constraint already present.
  • api/internal/api/handlers_self_hosted.go:44 — when a caller flag requests a test slug, generate from the reserved namespace.
  • api/internal/api/router.go (Store interface at line 88) — interface addition for any new method.
  • web/admin/*.jsx and web/dashboard/*.jsx — UI filter to hide test slugs from real user dashboards. Likely one additional predicate in the slug-listing query.

Four files minimum, one new migration for an index on (substring(dns_slug, 1, 6)) if lookups by prefix are needed. Modest.

4. Separate-zone alternative

Alternative: a second Cloudflare zone, for example relay-test.golivebro.com, with its own zone id. Wire a CLOUDFLARE_TEST_ZONE_ID env var and thread a second *dns.Client through the Server when a test flag is set on the register request. dns.NewClient at api/internal/dns/cloudflare.go:24 already reads CLOUDFLARE_ZONE_ID from env, so a second constructor NewTestClient is a one-line addition. Call sites pick the client based on a slug prefix or a relay.is_test column.

Pros: full isolation. Real and test records never share a namespace; token scope for the test zone can be a separate short-lived token. Cons: a new zone costs a Cloudflare plan line-item and a nameserver delegation. For sub-domain-as-zone on golivebro.com, the delegation is trivial (NS records).

5. Recommendation

(c) Separate zone, deferred. But not urgent.

  • Option (a) do-nothing is defensible today: unique constraint prevents same-table overwrite; 36^-8 is not a live concern at current scale; cross-table collision has never happened.
  • Option (b) reserved prefix is 4 files of work and solves a problem that does not yet exist. Skip for now.
  • Option (c) separate zone is the right answer the day a dedicated canary tier (alpha partner relays, chaos tests, synthetic traffic) ships. At that point the isolation pays off and the token-scope split is a security win. Until then, keep using self-hosted register (handlers_self_hosted.go:44) for one-off alpha testing and tolerate the shared namespace.

Priority order:

  1. Add glb_dns_write_total and glb_dns_write_errors_total counters and increment at handlers.go:305,378,847 and handlers_self_hosted.go:79,144,280. Ship this with an alert rule: any non-zero glb_dns_write_errors_total rate over 5 min should page. ~30 LOC, low risk, closes the observability gap that hid the telemyapp-zone misconfiguration. Estimated effort: 1-2 hours.
  2. Add a dns_last_error_at column to user_stream_slots and self_hosted_relays. Write on failure at all six callsites. Expose in the admin portal session detail view. Makes the failure queryable without log scraping. Estimated effort: half a day (migration + code + admin UI).
  3. Defer reserved-prefix slug namespacing and separate-zone work. Unique constraint + retry + sub-part-per-million collision probability is adequate. Revisit when a persistent canary tier launches.
  4. Open follow-up: consider promoting A1/A2/A3 DNS failures to session errors. Discussion, not a patch. The cost of a stuck active session with bad DNS is small today (user reconnects) but grows with traffic volume.

Open Questions

  • Should the metrics registry grow a Now() injection or use a deterministic clock for tests? (Minor, unrelated to this audit.)
  • Where should the alert route? No alertmanager config in the repo. Current /metrics surface is localhost-only per router.go:266. Scraping is presumably external; confirm with Advin ops.

Sources