Operations

Summary

Telemy’s operational infrastructure centers on a control plane (Go API + background jobs + PostgreSQL) that was migrated from AWS EC2 to an Advin Servers VPS in Kansas City on 2026-03-20, cutting monthly hosting cost from ~8/mo. The control plane serves api.telemyapp.com behind Cloudflare’s orange-cloud proxy, with UFW restricting port 8080 to Cloudflare IP ranges only. PostgreSQL 16 runs in Docker alongside the telemy-api and telemy-jobs Go binaries, all managed via Docker Compose at /opt/telemy/. Cloudflare handles TLS termination and an Origin Rule rewrites traffic to port 8080.

The relay infrastructure is a UDP packet-forwarding layer (SRT/SRTLA) where bandwidth — not compute — is the primary cost driver. As of 2026-04-02, Telemy runs two relay nodes: Advin KC (3.50/mo, unmetered 1 Gbps). The expansion strategy targets three phases: EU node via BlackHOST or Netcup (~9.99/user pricing, break-even is 2 paying users on the current 2-node setup.

Observability uses Prometheus metrics exposed at GET /metrics on the API process, covering relay provisioning lifecycle, background job health, and legacy AWS operation counters. A QA checklist validates the full telemetry pipeline from relay stats APIs through the C++ OBS plugin to the React dock UI, with aggregate relay telemetry passing E2E validation on 2026-03-05 and per-link telemetry still pending the srtla_rec fork.

Timeline

  • 2026-02-22: Operations metrics audit — documented that cmd/jobs lacks its own HTTP listener for metrics scraping; job metrics only visible from the API process.
  • 2026-03-05: Relay E2E telemetry validated (TC-RELAY-001) — aggregate stats from IRL Pro bonded stream confirmed working through full pipeline.
  • 2026-03-20: VPS provider decision finalized — Advin Servers chosen for relay pool nodes (0.25/TB). DigitalOcean, Vultr, and AWS EC2 eliminated on bandwidth cost.
  • 2026-03-20: API migration plan authored — 8-phase plan to move control plane from EC2 (52.13.2.122) to Advin VPS (208.84.101.84).
  • 2026-03-20: E2E gap fixes completed — snapshot push on connect/disconnect, relay_host_masked, not_found for unknown IDs. MetricsCollector moved off OBS render thread.
  • 2026-03-23: Always-ready model deployed — managed relay sessions now auto-provision on add and auto-deprovision on remove; manual Start/Stop buttons removed.
  • 2026-04-02: BuyVM LV Slice 1024 purchased ($3.50/mo, Las Vegas) — baseline tests passed: 100 Mbps UDP with 0% packet loss, 32.8 ms latency to KC1, 0.63 ms jitter.
  • 2026-04-02: Server infrastructure research updated with cross-agent suitability scores, expansion strategy, and detailed provider analyses for BuyVM, BlackHOST, Netcup, LumaDock, and others.

Current State

Control Plane: Go API + jobs + PostgreSQL 16 running in Docker Compose on Advin VPS (208.84.101.84, Kansas City). Cloudflare proxies api.telemyapp.com with TLS termination. UFW locks port 8080 to Cloudflare IP ranges. Deploy new binaries via scp + docker compose restart. Database backups should be on cron (pg_dump piped to gzip).

Relay Nodes (2 active, ~$11.50/mo total):

NodeProviderLocationIPCostStatus
US-CentralAdvin KC (kc1)Kansas City, MO208.84.101.84$8/moActive production relay + control plane
US-WestBuyVM LV (lv1)Las Vegas, NV209.141.55.228$3.50/moOnline, testing phase

Metrics: GET /metrics on API process exposes Prometheus text format. No separate jobs scrape endpoint yet. Starter alerts defined for job failure rate, provisioning p95 latency (>15s), and legacy AWS retry exhaustion.

QA Status: Aggregate relay telemetry PASSED (2026-03-05). Per-link relay telemetry PENDING — blocked on srtla_rec fork exposing per-link metadata. Test cases TC-CONN-001 through TC-CONN-004 and TC-RELAY-002 are defined but unvalidated.

BuyVM Stock: All slices except LV 1 GB out of stock across all locations as of 2026-04-02. Stock tracker at buyvmstock.com configured for notifications. Restocks tend to appear 8-10 AM PST, especially the 1st and 7th of each month.

Key Decisions

  • 2026-03-20: Chose Advin Servers over Hetzner, DigitalOcean, Vultr, and AWS for relay hosting — 32 TB at 0.25/TB) vs 2,780/mo for 32 TB.
  • 2026-03-20: Decided to migrate control plane from EC2 to Advin VPS — co-locating API with the relay saves ~$20/mo; Cloudflare proxy means DNS cutover is near-instant and rollback is seconds.
  • 2026-03-20: Architecture: PostgreSQL in Docker (not host-installed) — enables clean Docker Compose orchestration with health checks and secret management via file-based secrets.
  • 2026-03-23: Adopted always-ready model for managed relays — sessions auto-provision/deprovision on connection add/remove, eliminating manual Start/Stop UX.
  • 2026-04-02: Selected BuyVM as primary expansion provider — genuinely unmetered bandwidth at 7/mo each) is optimal strategy for 1 Gbps ports.
  • 2026-04-02: Rejected major cloud providers for relay hosting — AWS (3,200/mo), Azure (7/mo unmetered).
  • 2026-04-02: Identified Netcup as best new EU option — 2 TB/rolling-24h cap with 200 Mbps floor after throttle, 2.5 Gbps port, free 2 Tbit/s DDoS protection, ~EUR 5.84/mo. Requires 12-month contract and UDP testing.

Experiments & Results

ExperimentStatusFindingSource
BuyVM LV1 baseline: TCP throughput KC1-to-LV1Passed464 Mbps sender / 439 Mbps receiver (limited by 33ms latency x TCP window)SERVER_INFRASTRUCTURE_RESEARCH.md
BuyVM LV1 baseline: UDP @ 100 MbpsPassed100 Mbps sustained, 0% packet loss, 0.1ms jitterSERVER_INFRASTRUCTURE_RESEARCH.md
BuyVM LV1 baseline: latency + route qualityPassed32.8 ms avg, 0.63 ms jitter, 14 hops, 0% loss all hopsSERVER_INFRASTRUCTURE_RESEARCH.md
TC-RELAY-001: IRL Pro bonded stream aggregate telemetryPassed (2026-03-05)Bitrate bar, RTT/latency/loss pills, 2s update cycle all confirmed workingQA_CHECKLIST_RELAY_TELEMETRY.md
Relay IPC round-trip (Start/Provisioning/Active/Stop/Stopped)PassedFull lifecycle confirmedQA_CHECKLIST_RELAY_TELEMETRY.md
API connectivity: C++ RelayClient to Go control plane via HTTPSPassedConfirmed workingQA_CHECKLIST_RELAY_TELEMETRY.md
Dock telemetry path: SLS stats API to CEF injection to React UIPassedFull pipeline confirmedQA_CHECKLIST_RELAY_TELEMETRY.md
MetricsCollector background thread (off render thread)Passed (2026-03-20)Start()/Stop()/PollLoop() verifiedQA_CHECKLIST_RELAY_TELEMETRY.md
Per-link relay telemetry (bonded per-carrier stats)PendingBlocked on srtla_rec fork exposing per-link metadataQA_CHECKLIST_RELAY_TELEMETRY.md
TC-CONN-001: Multiple simultaneous BYOR connectionsNot startedQA_CHECKLIST_RELAY_TELEMETRY.md
TC-CONN-002: Connection persistence across OBS restartsNot startedQA_CHECKLIST_RELAY_TELEMETRY.md
TC-CONN-003: Managed add/remove flow (always-ready model)Not startedQA_CHECKLIST_RELAY_TELEMETRY.md
TC-CONN-004: Per-link telemetry with multiple carriersNot startedQA_CHECKLIST_RELAY_TELEMETRY.md
TC-RELAY-002: Per-link telemetry bonded disconnect/reconnectNot startedQA_CHECKLIST_RELAY_TELEMETRY.md

Gotchas & Known Issues

  • Advin 32 TB hard cap: At 150 Mbps sustained, the KC1 node exceeds 32 TB by day ~20 and gets throttled to 1 Mbps until month reset. Safe sustained ceiling is ~100 Mbps.
  • Advin non-KC locations are not viable for relay: Other cities (LA, Miami, Amsterdam, Tokyo, etc.) get only 1-4 TB at the same $8 price point.
  • BuyVM bandwidth enforcement is undocumented: The ~25 Mbps/GB-RAM guideline exists only in forum posts, not the AUP. Throttle happens without warning; removal requires contacting support.
  • BuyVM chronic stock shortage: All slices except LV 1 GB out of stock. No provisioning API exists — ordering is manual only. Stallion API v2 with provisioning announced but no timeline.
  • Netcup Singapore is a trap: 2 TB/month cap with 5 Mbps throttle floor — critically different from other Netcup locations (2 TB/day with 200 Mbps floor). Do not use for relay.
  • Netcup 12-month lock-in: Must test UDP relay workload (DDoS filter false-positive risk on SRT) before committing to annual contract.
  • BlackHOST DDoS protection costs $429+/mo: UDP relay is exposed to attacks with no affordable mitigation option. IP may be null-routed if targeted.
  • cmd/jobs has no metrics endpoint: Background job metrics are only visible from the API process. If API and jobs are split into separate processes, a separate HTTP listener for jobs metrics must be added.
  • Label cardinality: Do not add user/session IDs to Prometheus metric labels — keep cardinality low.
  • Security tokens in dock snapshots: QA must verify pair_token and relay_ws_token are EXCLUDED from JSON snapshots sent to the CEF dock.
  • EC2 teardown timing: Do not terminate the old EC2 instance until at least 24 hours of clean operation on Advin. Release any EIP attached to the API EC2 to stop ~$3.60/mo charge.
  • Rollback criteria: Roll back DNS immediately if health endpoint returns non-200, OBS plugin fails to connect, database errors appear in logs, or relay provisioning fails.

Open Questions

  • Has the API migration from EC2 to Advin actually been executed, or is it still at the plan stage? The plan document is dated 2026-03-20 but does not contain execution confirmation.
  • When will srtla_rec fork expose per-link metadata to unblock per-link relay telemetry (TC-RELAY-002, TC-CONN-004)?
  • Should cmd/jobs get its own HTTP metrics listener on port 8081, or will it remain co-located with the API process?
  • Which EU provider for Phase 1 expansion — BlackHOST ($11.99/mo, unmetered, no DDoS) or Netcup (~EUR 5.84/mo, 2 TB/day cap, free DDoS, 12-month contract)?
  • Has SRT/SRTLA UDP been tested against Netcup’s DDoS filter to check for false positives before committing to a 12-month contract?
  • What is the plan for BuyVM DDoS protection ($3/mo add-on) — should it be added to LV1 now during testing, or only when the node goes production?
  • Are automated database backups (pg_dump cron) configured on the Advin VPS yet?
  • What monitoring/alerting stack is consuming the Prometheus metrics? Is Prometheus actually running and scraping the API endpoint?

Sources

  • OPERATIONS_METRICS.md
  • SERVER_INFRASTRUCTURE_RESEARCH.md
  • VPS_Relay_Server_Comparison.md
  • QA_CHECKLIST_RELAY_TELEMETRY.md
  • 2026-03-20-api-migration-advin.md