Relay
Summary
The Telemy relay system provides bonded SRT relay infrastructure for IRL streaming, routing traffic from mobile encoders (IRL Pro) through SRTLA bonding proxies to OBS. The architecture evolved through three major phases: ephemeral per-user AWS EC2 instances (v0.0.4, retired 2026-03-20), a shared always-on VPS relay pool on Advin Servers (Phase 3, deployed 2026-03-20), and the always-ready lifecycle model that removed connect/disconnect UX entirely (AR-0 through AR-3, deployed 2026-03-23 via commit 625699f). The current single relay node kc1 runs in Kansas City on a KVM Standard XS (4 vCPU / 8 GB / 64 GB NVMe / 32 TB BW) at 9-13/user/month.
The relay stack runs a custom fork of srtla-receiver (ghcr.io/michaelpentz/srtla-receiver:latest) in Docker Compose, providing SRTLA bonded ingest (UDP 5000), SRT player output (UDP 4000), SLS management API (TCP 3000/8090), and per-link stats (TCP 5080) with ASN carrier identification via IPinfo Lite mmdb. The control plane (Go) manages session lifecycle through a PoolProvisioner that atomically assigns sessions to pool servers in under 2 seconds, registers stream IDs via the SLS API, and creates per-slot DNS records via Cloudflare. The C++ OBS plugin polls aggregate stats (port 8090) and per-link stats (port 5080) every 2 seconds, rendering real-time bitrate bars with carrier labels (T-Mobile, AT&T, etc.) in the dock UI.
The always-ready model treats relay slots as persistent resources: relays provision automatically when a managed connection is added (or on OBS load), deprovision when removed, and are protected by server-side session leases (5-minute expiry, 30-second heartbeat) so crashed clients cannot leak pool capacity. Per-slot DNS slugs (8-character base36, e.g., k7mx2p9a.relay.telemyapp.com) provide stable URLs that survive region changes. Zero-downtime region migration (AR-3, designed but not yet needed) uses a make-before-break pattern with DNS repoint and 120-second drain.
Timeline
- 2026-03-01: Relay timeout bug identified — relay HTTP calls blocking IPC heartbeat loop caused
relay_action_result_not_observedtimeouts. Root cause: sequential polling loop in Rust IPChandle_session_io. - 2026-03-01: Fix designed: refactor to
tokio::select!-based event loop with spawned background tasks for relay HTTP calls. Hardening bundle: SeqCst ordering fix, helper deduplication, async mutex swap. - 2026-03-03: Aggregate relay telemetry designed and implemented. C++ plugin polls SLS stats endpoint (
GET :8090/stats/play_{token}?legacy=1) every 2s via WinHTTP. Bitrate, RTT, packet loss, latency displayed in dock. - 2026-03-05: Per-user relay DNS designed. Permanent 6-char alphanumeric slugs per user, Cloudflare A records created on relay start, deleted on stop. IRL Pro users configure ingest URL once.
- 2026-03-05: Relay E2E telemetry validated via IRL Pro bonded stream to AWS relay (TC-RELAY-001 PASSED).
- 2026-03-06: Per-link relay telemetry designed. Forked srtla_rec (AGPL-3.0) to add per-connection byte/packet counters and HTTP stats server on port 5080. Custom Docker image
ghcr.io/telemyapp/srtla-receiver. - 2026-03-07: Relay provision progress designed. Async provisioning with 6-step pipeline (
launching_instancethroughready), polled every 2s by plugin, rendered as progress bar in dock. - 2026-03-08: Elastic IP design for stable relay addresses. One EIP per user per region, associated during provisioning. Cost: $3.60/mo idle per user — identified as scaling problem.
- 2026-03-18: Comprehensive relay strategy analysis. Three options evaluated: BYOR (free, 2-3 days), Hetzner shared VPS (8-11 days), hybrid model (13-17 days). AWS cost analysis: ~0.50/user on shared VPS (~94% cheaper). Provisioner interface identified as the right abstraction seam.
- 2026-03-20: AWS EC2 relay retired. Last instance terminated in us-west-2. Elastic IP
eipalloc-05476ee61b32e4891released. Phase 3 relay pool deployed on Advin Servers. - 2026-03-20: E2E gap fixes deployed: snapshot push on connect/disconnect,
relay_host_masked,not_foundfor unknown IDs. MetricsCollector background thread moved polling off OBS render thread. - 2026-03-22: Always-ready relay architecture designed (AR-0 through AR-3). Multi-relay client refactor: per-connection RelayClients stored in
active_clients_map. Stream slot system implemented (commitsb706153,c70efe7,0e933aa). - 2026-03-23: Always-ready relay deployed (commit
625699f). AR-0 through AR-3 plus 18 bug fixes. Connect/disconnect UX removed. Auto-provision on add, auto-deprovision on remove, auto-provision on OBS load with jitter and concurrency cap.
Current State
Active relay node: kc1.relay.telemyapp.com (Kansas City, Advin Servers, KVM Standard XS). DNS is Cloudflare DNS-only (not proxied) for direct UDP routing.
Software stack: Ubuntu 24.04, Docker Compose, custom srtla-receiver fork with ASN carrier identification (IPinfo Lite mmdb at /opt/srtla-receiver/data/ipinfo_lite.mmdb).
Relay lifecycle: Always-ready model is deployed. Managed connections auto-provision on add and on OBS load (jittered, max 2 concurrent). Session leases expire after 5 minutes without heartbeat; stale session reaper runs every 60 seconds. BYOR connections are supported with direct stats polling (no server-side provisioning).
Telemetry: Aggregate SLS stats (bitrate, RTT, loss, latency) polling validated. Per-link stats API (port 5080) operational with carrier labels from ASN lookup. Dock shows real-time bitrate bars, stale link detection (fades after 3s), and connection count badges.
DNS: Per-slot DNS slugs (8-character base36) provide stable <slug>.relay.telemyapp.com hostnames. TTL 30s. Records persist across stop/start (only deleted when slot is permanently removed).
Provisioner: PoolProvisioner implements the relay.Provisioner interface. Calls store.AssignRelay() (atomic least-loaded server pick with FOR UPDATE SKIP LOCKED) then SLSClient.CreateStreamIDs(). Provision time: <2 seconds vs ~30-45 seconds on AWS EC2.
Port map: UDP 5000 (SRTLA ingest), UDP 4001 (SRT publisher), UDP 4000 (SRT player), TCP 3000 (SLS management, restricted), TCP 8090 (SLS stats, restricted), TCP 5080 (per-link stats, restricted).
Status display states: provisioning (amber) → ready (green, no carrier bars) → live (green, carrier bars shown) → error (red). ready/live are UI projections; backend session state (active) is source of truth.
Key Decisions
- 2026-03-01: Refactor IPC to
tokio::select!event loop — relay HTTP calls must not block the heartbeat ping/pong cycle. Sequential polling was the root cause of relay timeout bugs. - 2026-03-05: Per-user DNS slugs are permanent and routing-only — knowing the slug without the stream token gets nothing. SLS stream IDs remain
live_<stream_token>/play_<stream_token>for authentication. - 2026-03-06: Fork srtla_rec under AGPL-3.0 for per-link stats — no upstream stats API exists. Four counter instructions added to hot path (no branches, no allocations, no locks). Stats HTTP server runs on separate pthread.
- 2026-03-08: Elastic IP model rejected for scale — 360/mo in EIP costs alone regardless of usage. Drove decision toward always-on shared VPS.
- 2026-03-18: Hybrid model chosen (BYOR free + managed paid). BYOR users incur zero infrastructure cost. Provisioner interface (
Provision/Deprovision) is the primary abstraction seam. Plugin is fully provider-agnostic (only knows HTTP endpoints and IPs). - 2026-03-20: AWS EC2 relay retired in favor of Advin VPS pool. Cost reduction: ~97% cheaper at scale (100 users: ~26/mo shared VPS). 32TB/month included bandwidth eliminates data transfer overage risk.
- 2026-03-22: Always-ready lifecycle adopted — relay bandwidth is zero when no sender is active, so there is no cost to keeping relays provisioned. Users should not manage relay lifecycle. Slots are persistent resources.
- 2026-03-22: Session leases chosen over session timeout — 5-minute lease with 30-second heartbeat means ~10 missed heartbeats before expiry. Reaper deprovisioning is idempotent.
StartOrGetSessionchecks lease on existing sessions and re-provisions if expired. - 2026-03-22: DNS slug format: 8-character base36 (a-z0-9) via
crypto/rand, collision-checked with DB retry (max 5 attempts). 2.8 trillion possible values. Slug enumeration is a privacy concern only (not auth). - 2026-03-22: Stream slot system: server-side
user_stream_slotstable with(user_id, slot_number)primary key. Each slot maps to a managed connection.max_concurrent_connsfrom plan tier limits active sessions. Slots persist across session stop/start.
Experiments & Results
| Experiment | Status | Finding | Source |
|---|---|---|---|
| IRL Pro bonded stream aggregate telemetry (TC-RELAY-001) | PASSED (2026-03-05) | Bitrate bar shows aggregate throughput, RTT/latency/loss pills update every 2s | QA_CHECKLIST_RELAY_TELEMETRY.md |
| Relay IPC round-trip lifecycle | PASSED | Start → Provisioning → Active → Stop → Stopped confirmed | QA_CHECKLIST_RELAY_TELEMETRY.md |
| API connectivity (C++ RelayClient → Go control plane via HTTPS) | PASSED | Confirmed working | QA_CHECKLIST_RELAY_TELEMETRY.md |
| Dock telemetry path (SLS → C++ → JSON → CEF → React) | PASSED | Full pipeline confirmed | QA_CHECKLIST_RELAY_TELEMETRY.md |
| E2E gap fixes (snapshot push, host masking, not_found) | PASSED (2026-03-20) | All gaps resolved | QA_CHECKLIST_RELAY_TELEMETRY.md |
| MetricsCollector background thread | PASSED (2026-03-20) | Polling moved off OBS render thread; Start()/Stop()/PollLoop() verified | QA_CHECKLIST_RELAY_TELEMETRY.md |
| AWS vs Hetzner/VPS cost comparison | Completed (2026-03-18) | AWS: ~0.50/user/month (~94-97% cheaper) | relay-strategy-analysis.md |
| Per-link relay telemetry | PENDING | Requires srtla_rec fork with per-link metadata exposed on port 5080 | QA_CHECKLIST_RELAY_TELEMETRY.md |
| Multiple simultaneous BYOR connections (TC-CONN-001) | PENDING | Independent telemetry per connection, no interference | QA_CHECKLIST_RELAY_TELEMETRY.md |
| Connection persistence across OBS restarts (TC-CONN-002) | PENDING | BYOR config restored from config.json, sensitive fields from DPAPI vault | QA_CHECKLIST_RELAY_TELEMETRY.md |
| Managed add/remove flow (TC-CONN-003) | PENDING | 6-step provision progress UI, auto-deprovision on remove | QA_CHECKLIST_RELAY_TELEMETRY.md |
| Per-link telemetry with multiple carriers (TC-CONN-004) | PENDING | Distinct carrier labels for bonded stream, share_pct totals ~100% | QA_CHECKLIST_RELAY_TELEMETRY.md |
Gotchas & Known Issues
- srtla_rec acts as raw UDP proxy on port 5000, forwarding bonded traffic to
localhost:4001where SLS handles the SRT session. It is not an SRT endpoint itself. - IRL Pro “Connection Bonding Service” must be disabled — enabling it routes through IRL Toolkit proxies, which fail with the private relay. Use the “Own Bonding Server (SRTLA)” setting instead.
- DNS records must be DNS-only (grey cloud) in Cloudflare — UDP cannot be proxied through Cloudflare’s network. Proxied records will break SRT/SRTLA connections.
- Mobile DNS caching (iOS) — IRL Pro on iOS caches DNS aggressively. With the always-ready model and stable DNS slugs, this is no longer an issue since the IP behind the slug rarely changes. Previously caused failures with ephemeral EC2 instances.
- SLS management API uses
/api/stream-ids, not/api/v1/. API key stored at/opt/srtla-receiver/data/.apikeyon relay nodes. - Port 8090 and 5080 must be restricted to control plane IP via UFW. These are management/stats ports not intended for public access. Update UFW rules if control plane IP changes.
- ASN carrier identification requires IPinfo Lite mmdb volume-mounted at
/usr/share/GeoIP/ipinfo_lite.mmdbinside the container. Without it,asn_orgfield is omitted and the dock labels links generically as “Link 1”, “Link 2”. - Thundering herd on OBS start — auto-provision of saved connections uses random 0-2s jitter per connection with max 2 concurrent provision workers to avoid simultaneous API load.
- srtla_rec fork is AGPL-3.0 — the fork must remain public on GitHub (
Telemyapp/srtla) with original copyright notices and modification notes in source headers. - EIP management code in
runProvisionPipelinehas a type assertion (s.provisioner.(*relay.AWSProvisioner)) that breaks the provider abstraction. Needs refactoring if multiple providers are re-introduced. ProvisionResultandDeprovisionRequestuse AWS-named fields (AWSInstanceID) — should be renamed toInstanceIDfor provider neutrality. Technical debt tracked.- Per-link throughput in dock is computed client-side from cumulative
bytesdeltas between polls. Previous values stored in React ref keyed bylink.addr. - Session reaper and
StartOrGetSessionboth handle expired leases — reaper runs every 60s as a background job;StartOrGetSessionalso checks inline and stops expired sessions before creating fresh ones (handles OBS restart recovery).
Open Questions
- When will TC-CONN-001 through TC-CONN-004 be validated? These test cases cover multi-connection, persistence, managed flow, and per-link telemetry with carriers. All are PENDING.
- Will AR-3 (zero-downtime region migration) be needed? Designed with make-before-break pattern, 120s drain, 1-hour cooldown, 6 changes/day rate limit. Currently single-region (kc1), so not yet deployed.
- Global mesh expansion? The relay pool architecture supports multiple regions with region-affinity assignment (
ORDER BY CASE WHEN rp.region = $1 THEN 0 ELSE 1 END), but only one node exists. Adding nodes requires: VPS provision, Docker stack copy, UFW config, Cloudflare DNS record,INSERT INTO relay_pool. - Per-link jitter, health scores, and link quality trends are deferred to v2 per the per-link telemetry design. Also deferred: GeoIP carrier name lookup improvements, “consider reducing weight” recommendations.
- C++ test framework (RF-009 backlog) — no automated testing for the OBS plugin. All validation is manual DLL testing in OBS.
Sources
- RELAY_DEPLOYMENT.md
- QA_CHECKLIST_RELAY_TELEMETRY.md
- 2026-03-20-phase3-relay-pool.md
- 2026-03-22-always-ready-relay-design.md
- 2026-03-22-always-ready-relay-plan.md
- 2026-03-22-multi-relay-client-refactor.md
- 2026-03-18-relay-strategy-analysis.md
- 2026-03-22-stream-slot-spec.md
- 2026-03-03-relay-telemetry-design.md
- 2026-03-03-relay-telemetry-plan.md
- 2026-03-05-per-user-relay-dns-design.md
- 2026-03-05-per-user-relay-dns-plan.md
- 2026-03-06-per-link-relay-telemetry-design.md
- 2026-03-06-per-link-relay-telemetry-plan.md
- 2026-03-07-relay-provision-progress-design.md
- 2026-03-07-relay-provision-progress-plan.md
- 2026-03-08-elastic-ip-relay-design.md
- 2026-03-01-relay-timeout-fix-design.md
- 2026-03-01-relay-timeout-fix-plan.md