DNS Error Handling Audit
Follow-up to the 2026-04-17 CLOUDFLARE_DNS_TOKEN rotation. The token is now correctly scoped to golivebro.com (id 074c22380e58b6dc206288b47b271e76, Zone.DNS:Edit). Two structural issues outlived the token fix: silent DNS error swallowing in the async pipeline, and slug namespace sharing between canary tests and real users.
Finding A: Silent error swallowing in the managed-relay pipeline
1. Callsite inventory
Every DNS client callsite in the API, with propagation behavior:
| # | file:line | Path | Error behavior |
|---|---|---|---|
| A1 | api/internal/api/handlers.go:305 | Pool fast-path provision | Swallows. log.Printf("provision_pipeline: dns_update_failed ...") only. Session still finalizes as active. |
| A2 | api/internal/api/handlers.go:378 | EC2 provision pipeline | Swallows. Same dns_update_failed log line. Session still finalizes as active. |
| A3 | api/internal/api/handlers.go:847 | Migration pipeline (region change) | Swallows. migration_pipeline: dns_update_failed log line. Old session still drains, new session stays active. |
| A4 | api/internal/api/handlers_self_hosted.go:79 | Self-hosted register | Propagates. Returns HTTP 500 failed to create DNS record. |
| A5 | api/internal/api/handlers_self_hosted.go:144 | Self-hosted heartbeat IP change | Propagates. Returns HTTP 500 failed to update DNS record. |
| A6 | api/internal/api/handlers_self_hosted.go:280 | Self-hosted remove | Propagates. Returns HTTP 500 failed to remove DNS record. |
A1, A2, and A3 are the managed-relay (async) paths. All three hid the telemyapp-zone token misconfiguration for weeks.
2. Observable signal when A1-A3 fail today
Only one: a grep for dns_update_failed in docker logs glb-api on Advin. The session response to the client still contains a relay_hostname of <slug>.relay.golivebro.com (set at handlers.go:310, :383, and generated by toSessionResponse at handlers.go:898-899 via s.dns.FQDN). The session flips to active. The client resolves the FQDN, gets NXDOMAIN, and fails to connect. No state in user_stream_slots or sessions tells an operator this happened. UpdateSlotDNS at handlers.go:308 and :381 is only called on the success branch, so current_region column lag vs. session region could be a weak post-hoc signal, but nothing alerts on it.
3. Existing logging convention
No structured logger (zap, slog, logrus) in api/internal/api/. Every file uses stdlib log.Printf with an informal subsystem: event_name key=value convention (provision_pipeline: dns_update_failed slug=%s err=%v). Grep confirms 113 occurrences across 18 files in internal/api/. Fixes should match the convention rather than introduce a new dependency.
4. Existing metrics surface
Yes. api/internal/metrics/metrics.go:44 provides a Registry with IncCounter, ObserveHistogram, and a Prometheus text-format Handler. Default counters registered at metrics.go:55-60: glb_job_runs_total, glb_relay_provision_total, glb_relay_deprovision_total, plus three histograms. The registry is already wired at router.go:257-266 serving /metrics (localhost-restricted per .claude/rules/security.md). Usage pattern at handlers.go:248-260, :548-561 and jobs/runner.go:212-219. Adding a DNS counter is a one-line registration plus three call sites.
5. Recommended minimal patch
Conceptual shape:
metrics.go:55: registerglb_dns_write_total(labels:op=create|delete,status=ok|error) andglb_dns_write_errors_total(labels:op,http_status).handlers.go:305-307,:378-380,:847-849: on success,IncCounter("glb_dns_write_total", {"op":"create","status":"ok"}). On error, inc withstatus:"error"and additionally incglb_dns_write_errors_total. Keep the log line.- Cloudflare.go could surface the HTTP status via a typed error, but the minimal patch passes a sentinel and parses status from the error string; a cleaner follow-up is a typed
*APIErrortype. - Add a
dns_last_error_atcolumn touser_stream_slotsand write it on failure. Dashboard admin portal can surface any slot with a recent error. - Optional, higher-risk: promote A1/A2 DNS failures to a session error so
FinalActivateSessionfails anddeprovisionAndStopruns. Reject here because EC2 is already up and the user-visible consequence (reconnect attempt) is small, but the cost leakage risk of a stuck provisioning is large. A counter + alert is the cheaper win.
Tests: extend cloudflare_test.go (already has TestCreateOrUpdateRecord_APIError at line 152) to assert the metric increments. Add a handler-level test that forces CreateOrUpdateRecord to fail and verifies glb_dns_write_errors_total renders via metrics.Default().Render().
Scope: ~30 LOC across 3 files (metrics.go, handlers.go, one new test). Low risk. No behavior change for success path, only observability.
Finding B: Slug collision risk and canary namespace
1. Generator
api/internal/store/store_stream_slots.go:18-28. GenerateDNSSlug produces 8 characters from charset abcdefghijklmnopqrstuvwxyz0123456789 (36 symbols) using crypto/rand.Int with big.NewInt(36) per position. Uniform distribution (no modulo bias). Keyspace: 36^8 = 2.82e12. For 10,000 concurrent live slugs the birthday-approximation collision probability is ~1.8e-5 per new slug; at 100,000 concurrent slugs it is ~1.8e-3. Realistically irrelevant at foreseeable scale.
2. Uniqueness constraint
Yes, twice. migrations/0017_dns_slugs.sql:8 declares dns_slug VARCHAR(12) UNIQUE on user_stream_slots. migrations/0028_self_hosted_relays.sql:7 declares dns_slug VARCHAR(12) UNIQUE on self_hosted_relays, plus a partial index at :18 filtering soft-deletes. Both callers retry on unique violation: store_stream_slots.go:106-118 (5 attempts) and store_self_hosted.go:17-54 (5 attempts, explicit pgErr.Code == "23505" check). A collision is a retry, not an overwrite. Blast radius is bounded to transient registration latency.
However: the two tables share the Cloudflare DNS namespace. A user_stream_slots.dns_slug and a self_hosted_relays.dns_slug with the same value is legal in Postgres (different tables, different unique indexes) but would cause one table’s CreateOrUpdateRecord to overwrite the other table’s A record. No cross-table uniqueness check exists. Probability is still 36^-8 per pair, but it is a real failure mode.
3. Reserved-prefix design cost
Files that would need changes for an alpha-* / test-* reserved prefix:
api/internal/store/store_stream_slots.go:18— add a prefix filter inGenerateDNSSlugto reject generated slugs that start with a reserved token (or skip them in the charset). Alternatively split intoGenerateDNSSlug()andGenerateTestDNSSlug(prefix).api/internal/store/store_stream_slots.go:95andstore_self_hosted.go:15— accept an optional namespace and reject collisions against both prefixes atINSERTtime, or rely on the unique constraint already present.api/internal/api/handlers_self_hosted.go:44— when a caller flag requests a test slug, generate from the reserved namespace.api/internal/api/router.go(Store interface at line 88) — interface addition for any new method.web/admin/*.jsxandweb/dashboard/*.jsx— UI filter to hide test slugs from real user dashboards. Likely one additional predicate in the slug-listing query.
Four files minimum, one new migration for an index on (substring(dns_slug, 1, 6)) if lookups by prefix are needed. Modest.
4. Separate-zone alternative
Alternative: a second Cloudflare zone, for example relay-test.golivebro.com, with its own zone id. Wire a CLOUDFLARE_TEST_ZONE_ID env var and thread a second *dns.Client through the Server when a test flag is set on the register request. dns.NewClient at api/internal/dns/cloudflare.go:24 already reads CLOUDFLARE_ZONE_ID from env, so a second constructor NewTestClient is a one-line addition. Call sites pick the client based on a slug prefix or a relay.is_test column.
Pros: full isolation. Real and test records never share a namespace; token scope for the test zone can be a separate short-lived token. Cons: a new zone costs a Cloudflare plan line-item and a nameserver delegation. For sub-domain-as-zone on golivebro.com, the delegation is trivial (NS records).
5. Recommendation
(c) Separate zone, deferred. But not urgent.
- Option (a) do-nothing is defensible today: unique constraint prevents same-table overwrite; 36^-8 is not a live concern at current scale; cross-table collision has never happened.
- Option (b) reserved prefix is 4 files of work and solves a problem that does not yet exist. Skip for now.
- Option (c) separate zone is the right answer the day a dedicated canary tier (alpha partner relays, chaos tests, synthetic traffic) ships. At that point the isolation pays off and the token-scope split is a security win. Until then, keep using self-hosted register (
handlers_self_hosted.go:44) for one-off alpha testing and tolerate the shared namespace.
Recommended next actions
Priority order:
- Add
glb_dns_write_totalandglb_dns_write_errors_totalcounters and increment athandlers.go:305,378,847andhandlers_self_hosted.go:79,144,280. Ship this with an alert rule: any non-zeroglb_dns_write_errors_totalrate over 5 min should page. ~30 LOC, low risk, closes the observability gap that hid the telemyapp-zone misconfiguration. Estimated effort: 1-2 hours. - Add a
dns_last_error_atcolumn touser_stream_slotsandself_hosted_relays. Write on failure at all six callsites. Expose in the admin portal session detail view. Makes the failure queryable without log scraping. Estimated effort: half a day (migration + code + admin UI). - Defer reserved-prefix slug namespacing and separate-zone work. Unique constraint + retry + sub-part-per-million collision probability is adequate. Revisit when a persistent canary tier launches.
- Open follow-up: consider promoting A1/A2/A3 DNS failures to session errors. Discussion, not a patch. The cost of a stuck
activesession with bad DNS is small today (user reconnects) but grows with traffic volume.
Open Questions
- Should the metrics registry grow a
Now()injection or use a deterministic clock for tests? (Minor, unrelated to this audit.) - Where should the alert route? No alertmanager config in the repo. Current
/metricssurface is localhost-only perrouter.go:266. Scraping is presumably external; confirm with Advin ops.