Operations
Summary
Telemy’s operational infrastructure centers on a control plane (Go API + background jobs + PostgreSQL) that was migrated from AWS EC2 to an Advin Servers VPS in Kansas City on 2026-03-20, cutting monthly hosting cost from ~8/mo. The control plane serves api.telemyapp.com behind Cloudflare’s orange-cloud proxy, with UFW restricting port 8080 to Cloudflare IP ranges only. PostgreSQL 16 runs in Docker alongside the telemy-api and telemy-jobs Go binaries, all managed via Docker Compose at /opt/telemy/. Cloudflare handles TLS termination and an Origin Rule rewrites traffic to port 8080.
The relay infrastructure is a UDP packet-forwarding layer (SRT/SRTLA) where bandwidth — not compute — is the primary cost driver. As of 2026-04-02, Telemy runs two relay nodes: Advin KC (3.50/mo, unmetered 1 Gbps). The expansion strategy targets three phases: EU node via BlackHOST or Netcup (~9.99/user pricing, break-even is 2 paying users on the current 2-node setup.
Observability uses Prometheus metrics exposed at GET /metrics on the API process, covering relay provisioning lifecycle, background job health, and legacy AWS operation counters. A QA checklist validates the full telemetry pipeline from relay stats APIs through the C++ OBS plugin to the React dock UI, with aggregate relay telemetry passing E2E validation on 2026-03-05 and per-link telemetry still pending the srtla_rec fork.
Timeline
- 2026-02-22: Operations metrics audit — documented that
cmd/jobslacks its own HTTP listener for metrics scraping; job metrics only visible from the API process. - 2026-03-05: Relay E2E telemetry validated (TC-RELAY-001) — aggregate stats from IRL Pro bonded stream confirmed working through full pipeline.
- 2026-03-20: VPS provider decision finalized — Advin Servers chosen for relay pool nodes (0.25/TB). DigitalOcean, Vultr, and AWS EC2 eliminated on bandwidth cost.
- 2026-03-20: API migration plan authored — 8-phase plan to move control plane from EC2 (52.13.2.122) to Advin VPS (208.84.101.84).
- 2026-03-20: E2E gap fixes completed — snapshot push on connect/disconnect,
relay_host_masked,not_foundfor unknown IDs. MetricsCollector moved off OBS render thread. - 2026-03-23: Always-ready model deployed — managed relay sessions now auto-provision on add and auto-deprovision on remove; manual Start/Stop buttons removed.
- 2026-04-02: BuyVM LV Slice 1024 purchased ($3.50/mo, Las Vegas) — baseline tests passed: 100 Mbps UDP with 0% packet loss, 32.8 ms latency to KC1, 0.63 ms jitter.
- 2026-04-02: Server infrastructure research updated with cross-agent suitability scores, expansion strategy, and detailed provider analyses for BuyVM, BlackHOST, Netcup, LumaDock, and others.
Current State
Control Plane: Go API + jobs + PostgreSQL 16 running in Docker Compose on Advin VPS (208.84.101.84, Kansas City). Cloudflare proxies api.telemyapp.com with TLS termination. UFW locks port 8080 to Cloudflare IP ranges. Deploy new binaries via scp + docker compose restart. Database backups should be on cron (pg_dump piped to gzip).
Relay Nodes (2 active, ~$11.50/mo total):
| Node | Provider | Location | IP | Cost | Status |
|---|---|---|---|---|---|
| US-Central | Advin KC (kc1) | Kansas City, MO | 208.84.101.84 | $8/mo | Active production relay + control plane |
| US-West | BuyVM LV (lv1) | Las Vegas, NV | 209.141.55.228 | $3.50/mo | Online, testing phase |
Metrics: GET /metrics on API process exposes Prometheus text format. No separate jobs scrape endpoint yet. Starter alerts defined for job failure rate, provisioning p95 latency (>15s), and legacy AWS retry exhaustion.
QA Status: Aggregate relay telemetry PASSED (2026-03-05). Per-link relay telemetry PENDING — blocked on srtla_rec fork exposing per-link metadata. Test cases TC-CONN-001 through TC-CONN-004 and TC-RELAY-002 are defined but unvalidated.
BuyVM Stock: All slices except LV 1 GB out of stock across all locations as of 2026-04-02. Stock tracker at buyvmstock.com configured for notifications. Restocks tend to appear 8-10 AM PST, especially the 1st and 7th of each month.
Key Decisions
- 2026-03-20: Chose Advin Servers over Hetzner, DigitalOcean, Vultr, and AWS for relay hosting — 32 TB at 0.25/TB) vs 2,780/mo for 32 TB.
- 2026-03-20: Decided to migrate control plane from EC2 to Advin VPS — co-locating API with the relay saves ~$20/mo; Cloudflare proxy means DNS cutover is near-instant and rollback is seconds.
- 2026-03-20: Architecture: PostgreSQL in Docker (not host-installed) — enables clean Docker Compose orchestration with health checks and secret management via file-based secrets.
- 2026-03-23: Adopted always-ready model for managed relays — sessions auto-provision/deprovision on connection add/remove, eliminating manual Start/Stop UX.
- 2026-04-02: Selected BuyVM as primary expansion provider — genuinely unmetered bandwidth at 7/mo each) is optimal strategy for 1 Gbps ports.
- 2026-04-02: Rejected major cloud providers for relay hosting — AWS (3,200/mo), Azure (7/mo unmetered).
- 2026-04-02: Identified Netcup as best new EU option — 2 TB/rolling-24h cap with 200 Mbps floor after throttle, 2.5 Gbps port, free 2 Tbit/s DDoS protection, ~EUR 5.84/mo. Requires 12-month contract and UDP testing.
Experiments & Results
| Experiment | Status | Finding | Source |
|---|---|---|---|
| BuyVM LV1 baseline: TCP throughput KC1-to-LV1 | Passed | 464 Mbps sender / 439 Mbps receiver (limited by 33ms latency x TCP window) | SERVER_INFRASTRUCTURE_RESEARCH.md |
| BuyVM LV1 baseline: UDP @ 100 Mbps | Passed | 100 Mbps sustained, 0% packet loss, 0.1ms jitter | SERVER_INFRASTRUCTURE_RESEARCH.md |
| BuyVM LV1 baseline: latency + route quality | Passed | 32.8 ms avg, 0.63 ms jitter, 14 hops, 0% loss all hops | SERVER_INFRASTRUCTURE_RESEARCH.md |
| TC-RELAY-001: IRL Pro bonded stream aggregate telemetry | Passed (2026-03-05) | Bitrate bar, RTT/latency/loss pills, 2s update cycle all confirmed working | QA_CHECKLIST_RELAY_TELEMETRY.md |
| Relay IPC round-trip (Start/Provisioning/Active/Stop/Stopped) | Passed | Full lifecycle confirmed | QA_CHECKLIST_RELAY_TELEMETRY.md |
| API connectivity: C++ RelayClient to Go control plane via HTTPS | Passed | Confirmed working | QA_CHECKLIST_RELAY_TELEMETRY.md |
| Dock telemetry path: SLS stats API to CEF injection to React UI | Passed | Full pipeline confirmed | QA_CHECKLIST_RELAY_TELEMETRY.md |
| MetricsCollector background thread (off render thread) | Passed (2026-03-20) | Start()/Stop()/PollLoop() verified | QA_CHECKLIST_RELAY_TELEMETRY.md |
| Per-link relay telemetry (bonded per-carrier stats) | Pending | Blocked on srtla_rec fork exposing per-link metadata | QA_CHECKLIST_RELAY_TELEMETRY.md |
| TC-CONN-001: Multiple simultaneous BYOR connections | Not started | — | QA_CHECKLIST_RELAY_TELEMETRY.md |
| TC-CONN-002: Connection persistence across OBS restarts | Not started | — | QA_CHECKLIST_RELAY_TELEMETRY.md |
| TC-CONN-003: Managed add/remove flow (always-ready model) | Not started | — | QA_CHECKLIST_RELAY_TELEMETRY.md |
| TC-CONN-004: Per-link telemetry with multiple carriers | Not started | — | QA_CHECKLIST_RELAY_TELEMETRY.md |
| TC-RELAY-002: Per-link telemetry bonded disconnect/reconnect | Not started | — | QA_CHECKLIST_RELAY_TELEMETRY.md |
Gotchas & Known Issues
- Advin 32 TB hard cap: At 150 Mbps sustained, the KC1 node exceeds 32 TB by day ~20 and gets throttled to 1 Mbps until month reset. Safe sustained ceiling is ~100 Mbps.
- Advin non-KC locations are not viable for relay: Other cities (LA, Miami, Amsterdam, Tokyo, etc.) get only 1-4 TB at the same $8 price point.
- BuyVM bandwidth enforcement is undocumented: The ~25 Mbps/GB-RAM guideline exists only in forum posts, not the AUP. Throttle happens without warning; removal requires contacting support.
- BuyVM chronic stock shortage: All slices except LV 1 GB out of stock. No provisioning API exists — ordering is manual only. Stallion API v2 with provisioning announced but no timeline.
- Netcup Singapore is a trap: 2 TB/month cap with 5 Mbps throttle floor — critically different from other Netcup locations (2 TB/day with 200 Mbps floor). Do not use for relay.
- Netcup 12-month lock-in: Must test UDP relay workload (DDoS filter false-positive risk on SRT) before committing to annual contract.
- BlackHOST DDoS protection costs $429+/mo: UDP relay is exposed to attacks with no affordable mitigation option. IP may be null-routed if targeted.
cmd/jobshas no metrics endpoint: Background job metrics are only visible from the API process. If API and jobs are split into separate processes, a separate HTTP listener for jobs metrics must be added.- Label cardinality: Do not add user/session IDs to Prometheus metric labels — keep cardinality low.
- Security tokens in dock snapshots: QA must verify
pair_tokenandrelay_ws_tokenare EXCLUDED from JSON snapshots sent to the CEF dock. - EC2 teardown timing: Do not terminate the old EC2 instance until at least 24 hours of clean operation on Advin. Release any EIP attached to the API EC2 to stop ~$3.60/mo charge.
- Rollback criteria: Roll back DNS immediately if health endpoint returns non-200, OBS plugin fails to connect, database errors appear in logs, or relay provisioning fails.
Open Questions
- Has the API migration from EC2 to Advin actually been executed, or is it still at the plan stage? The plan document is dated 2026-03-20 but does not contain execution confirmation.
- When will
srtla_recfork expose per-link metadata to unblock per-link relay telemetry (TC-RELAY-002, TC-CONN-004)? - Should
cmd/jobsget its own HTTP metrics listener on port 8081, or will it remain co-located with the API process? - Which EU provider for Phase 1 expansion — BlackHOST ($11.99/mo, unmetered, no DDoS) or Netcup (~EUR 5.84/mo, 2 TB/day cap, free DDoS, 12-month contract)?
- Has SRT/SRTLA UDP been tested against Netcup’s DDoS filter to check for false positives before committing to a 12-month contract?
- What is the plan for BuyVM DDoS protection ($3/mo add-on) — should it be added to LV1 now during testing, or only when the node goes production?
- Are automated database backups (pg_dump cron) configured on the Advin VPS yet?
- What monitoring/alerting stack is consuming the Prometheus metrics? Is Prometheus actually running and scraping the API endpoint?
Sources
- OPERATIONS_METRICS.md
- SERVER_INFRASTRUCTURE_RESEARCH.md
- VPS_Relay_Server_Comparison.md
- QA_CHECKLIST_RELAY_TELEMETRY.md
- 2026-03-20-api-migration-advin.md