Relay

Summary

The Telemy relay system provides bonded SRT relay infrastructure for IRL streaming, routing traffic from mobile encoders (IRL Pro) through SRTLA bonding proxies to OBS. The architecture evolved through three major phases: ephemeral per-user AWS EC2 instances (v0.0.4, retired 2026-03-20), a shared always-on VPS relay pool on Advin Servers (Phase 3, deployed 2026-03-20), and the always-ready lifecycle model that removed connect/disconnect UX entirely (AR-0 through AR-3, deployed 2026-03-23 via commit 625699f). The current single relay node kc1 runs in Kansas City on a KVM Standard XS (4 vCPU / 8 GB / 64 GB NVMe / 32 TB BW) at 9-13/user/month.

The relay stack runs a custom fork of srtla-receiver (ghcr.io/michaelpentz/srtla-receiver:latest) in Docker Compose, providing SRTLA bonded ingest (UDP 5000), SRT player output (UDP 4000), SLS management API (TCP 3000/8090), and per-link stats (TCP 5080) with ASN carrier identification via IPinfo Lite mmdb. The control plane (Go) manages session lifecycle through a PoolProvisioner that atomically assigns sessions to pool servers in under 2 seconds, registers stream IDs via the SLS API, and creates per-slot DNS records via Cloudflare. The C++ OBS plugin polls aggregate stats (port 8090) and per-link stats (port 5080) every 2 seconds, rendering real-time bitrate bars with carrier labels (T-Mobile, AT&T, etc.) in the dock UI.

The always-ready model treats relay slots as persistent resources: relays provision automatically when a managed connection is added (or on OBS load), deprovision when removed, and are protected by server-side session leases (5-minute expiry, 30-second heartbeat) so crashed clients cannot leak pool capacity. Per-slot DNS slugs (8-character base36, e.g., k7mx2p9a.relay.telemyapp.com) provide stable URLs that survive region changes. Zero-downtime region migration (AR-3, designed but not yet needed) uses a make-before-break pattern with DNS repoint and 120-second drain.

Timeline

  • 2026-03-01: Relay timeout bug identified — relay HTTP calls blocking IPC heartbeat loop caused relay_action_result_not_observed timeouts. Root cause: sequential polling loop in Rust IPC handle_session_io.
  • 2026-03-01: Fix designed: refactor to tokio::select!-based event loop with spawned background tasks for relay HTTP calls. Hardening bundle: SeqCst ordering fix, helper deduplication, async mutex swap.
  • 2026-03-03: Aggregate relay telemetry designed and implemented. C++ plugin polls SLS stats endpoint (GET :8090/stats/play_{token}?legacy=1) every 2s via WinHTTP. Bitrate, RTT, packet loss, latency displayed in dock.
  • 2026-03-05: Per-user relay DNS designed. Permanent 6-char alphanumeric slugs per user, Cloudflare A records created on relay start, deleted on stop. IRL Pro users configure ingest URL once.
  • 2026-03-05: Relay E2E telemetry validated via IRL Pro bonded stream to AWS relay (TC-RELAY-001 PASSED).
  • 2026-03-06: Per-link relay telemetry designed. Forked srtla_rec (AGPL-3.0) to add per-connection byte/packet counters and HTTP stats server on port 5080. Custom Docker image ghcr.io/telemyapp/srtla-receiver.
  • 2026-03-07: Relay provision progress designed. Async provisioning with 6-step pipeline (launching_instance through ready), polled every 2s by plugin, rendered as progress bar in dock.
  • 2026-03-08: Elastic IP design for stable relay addresses. One EIP per user per region, associated during provisioning. Cost: $3.60/mo idle per user — identified as scaling problem.
  • 2026-03-18: Comprehensive relay strategy analysis. Three options evaluated: BYOR (free, 2-3 days), Hetzner shared VPS (8-11 days), hybrid model (13-17 days). AWS cost analysis: ~0.50/user on shared VPS (~94% cheaper). Provisioner interface identified as the right abstraction seam.
  • 2026-03-20: AWS EC2 relay retired. Last instance terminated in us-west-2. Elastic IP eipalloc-05476ee61b32e4891 released. Phase 3 relay pool deployed on Advin Servers.
  • 2026-03-20: E2E gap fixes deployed: snapshot push on connect/disconnect, relay_host_masked, not_found for unknown IDs. MetricsCollector background thread moved polling off OBS render thread.
  • 2026-03-22: Always-ready relay architecture designed (AR-0 through AR-3). Multi-relay client refactor: per-connection RelayClients stored in active_clients_ map. Stream slot system implemented (commits b706153, c70efe7, 0e933aa).
  • 2026-03-23: Always-ready relay deployed (commit 625699f). AR-0 through AR-3 plus 18 bug fixes. Connect/disconnect UX removed. Auto-provision on add, auto-deprovision on remove, auto-provision on OBS load with jitter and concurrency cap.

Current State

Active relay node: kc1.relay.telemyapp.com (Kansas City, Advin Servers, KVM Standard XS). DNS is Cloudflare DNS-only (not proxied) for direct UDP routing.

Software stack: Ubuntu 24.04, Docker Compose, custom srtla-receiver fork with ASN carrier identification (IPinfo Lite mmdb at /opt/srtla-receiver/data/ipinfo_lite.mmdb).

Relay lifecycle: Always-ready model is deployed. Managed connections auto-provision on add and on OBS load (jittered, max 2 concurrent). Session leases expire after 5 minutes without heartbeat; stale session reaper runs every 60 seconds. BYOR connections are supported with direct stats polling (no server-side provisioning).

Telemetry: Aggregate SLS stats (bitrate, RTT, loss, latency) polling validated. Per-link stats API (port 5080) operational with carrier labels from ASN lookup. Dock shows real-time bitrate bars, stale link detection (fades after 3s), and connection count badges.

DNS: Per-slot DNS slugs (8-character base36) provide stable <slug>.relay.telemyapp.com hostnames. TTL 30s. Records persist across stop/start (only deleted when slot is permanently removed).

Provisioner: PoolProvisioner implements the relay.Provisioner interface. Calls store.AssignRelay() (atomic least-loaded server pick with FOR UPDATE SKIP LOCKED) then SLSClient.CreateStreamIDs(). Provision time: <2 seconds vs ~30-45 seconds on AWS EC2.

Port map: UDP 5000 (SRTLA ingest), UDP 4001 (SRT publisher), UDP 4000 (SRT player), TCP 3000 (SLS management, restricted), TCP 8090 (SLS stats, restricted), TCP 5080 (per-link stats, restricted).

Status display states: provisioning (amber) ready (green, no carrier bars) live (green, carrier bars shown) error (red). ready/live are UI projections; backend session state (active) is source of truth.

Key Decisions

  • 2026-03-01: Refactor IPC to tokio::select! event loop — relay HTTP calls must not block the heartbeat ping/pong cycle. Sequential polling was the root cause of relay timeout bugs.
  • 2026-03-05: Per-user DNS slugs are permanent and routing-only — knowing the slug without the stream token gets nothing. SLS stream IDs remain live_<stream_token> / play_<stream_token> for authentication.
  • 2026-03-06: Fork srtla_rec under AGPL-3.0 for per-link stats — no upstream stats API exists. Four counter instructions added to hot path (no branches, no allocations, no locks). Stats HTTP server runs on separate pthread.
  • 2026-03-08: Elastic IP model rejected for scale — 360/mo in EIP costs alone regardless of usage. Drove decision toward always-on shared VPS.
  • 2026-03-18: Hybrid model chosen (BYOR free + managed paid). BYOR users incur zero infrastructure cost. Provisioner interface (Provision / Deprovision) is the primary abstraction seam. Plugin is fully provider-agnostic (only knows HTTP endpoints and IPs).
  • 2026-03-20: AWS EC2 relay retired in favor of Advin VPS pool. Cost reduction: ~97% cheaper at scale (100 users: ~26/mo shared VPS). 32TB/month included bandwidth eliminates data transfer overage risk.
  • 2026-03-22: Always-ready lifecycle adopted — relay bandwidth is zero when no sender is active, so there is no cost to keeping relays provisioned. Users should not manage relay lifecycle. Slots are persistent resources.
  • 2026-03-22: Session leases chosen over session timeout — 5-minute lease with 30-second heartbeat means ~10 missed heartbeats before expiry. Reaper deprovisioning is idempotent. StartOrGetSession checks lease on existing sessions and re-provisions if expired.
  • 2026-03-22: DNS slug format: 8-character base36 (a-z0-9) via crypto/rand, collision-checked with DB retry (max 5 attempts). 2.8 trillion possible values. Slug enumeration is a privacy concern only (not auth).
  • 2026-03-22: Stream slot system: server-side user_stream_slots table with (user_id, slot_number) primary key. Each slot maps to a managed connection. max_concurrent_conns from plan tier limits active sessions. Slots persist across session stop/start.

Experiments & Results

ExperimentStatusFindingSource
IRL Pro bonded stream aggregate telemetry (TC-RELAY-001)PASSED (2026-03-05)Bitrate bar shows aggregate throughput, RTT/latency/loss pills update every 2sQA_CHECKLIST_RELAY_TELEMETRY.md
Relay IPC round-trip lifecyclePASSEDStart Provisioning Active Stop Stopped confirmedQA_CHECKLIST_RELAY_TELEMETRY.md
API connectivity (C++ RelayClient Go control plane via HTTPS)PASSEDConfirmed workingQA_CHECKLIST_RELAY_TELEMETRY.md
Dock telemetry path (SLS C++ JSON CEF React)PASSEDFull pipeline confirmedQA_CHECKLIST_RELAY_TELEMETRY.md
E2E gap fixes (snapshot push, host masking, not_found)PASSED (2026-03-20)All gaps resolvedQA_CHECKLIST_RELAY_TELEMETRY.md
MetricsCollector background threadPASSED (2026-03-20)Polling moved off OBS render thread; Start()/Stop()/PollLoop() verifiedQA_CHECKLIST_RELAY_TELEMETRY.md
AWS vs Hetzner/VPS cost comparisonCompleted (2026-03-18)AWS: ~0.50/user/month (~94-97% cheaper)relay-strategy-analysis.md
Per-link relay telemetryPENDINGRequires srtla_rec fork with per-link metadata exposed on port 5080QA_CHECKLIST_RELAY_TELEMETRY.md
Multiple simultaneous BYOR connections (TC-CONN-001)PENDINGIndependent telemetry per connection, no interferenceQA_CHECKLIST_RELAY_TELEMETRY.md
Connection persistence across OBS restarts (TC-CONN-002)PENDINGBYOR config restored from config.json, sensitive fields from DPAPI vaultQA_CHECKLIST_RELAY_TELEMETRY.md
Managed add/remove flow (TC-CONN-003)PENDING6-step provision progress UI, auto-deprovision on removeQA_CHECKLIST_RELAY_TELEMETRY.md
Per-link telemetry with multiple carriers (TC-CONN-004)PENDINGDistinct carrier labels for bonded stream, share_pct totals ~100%QA_CHECKLIST_RELAY_TELEMETRY.md

Gotchas & Known Issues

  • srtla_rec acts as raw UDP proxy on port 5000, forwarding bonded traffic to localhost:4001 where SLS handles the SRT session. It is not an SRT endpoint itself.
  • IRL Pro “Connection Bonding Service” must be disabled — enabling it routes through IRL Toolkit proxies, which fail with the private relay. Use the “Own Bonding Server (SRTLA)” setting instead.
  • DNS records must be DNS-only (grey cloud) in Cloudflare — UDP cannot be proxied through Cloudflare’s network. Proxied records will break SRT/SRTLA connections.
  • Mobile DNS caching (iOS) — IRL Pro on iOS caches DNS aggressively. With the always-ready model and stable DNS slugs, this is no longer an issue since the IP behind the slug rarely changes. Previously caused failures with ephemeral EC2 instances.
  • SLS management API uses /api/stream-ids, not /api/v1/. API key stored at /opt/srtla-receiver/data/.apikey on relay nodes.
  • Port 8090 and 5080 must be restricted to control plane IP via UFW. These are management/stats ports not intended for public access. Update UFW rules if control plane IP changes.
  • ASN carrier identification requires IPinfo Lite mmdb volume-mounted at /usr/share/GeoIP/ipinfo_lite.mmdb inside the container. Without it, asn_org field is omitted and the dock labels links generically as “Link 1”, “Link 2”.
  • Thundering herd on OBS start — auto-provision of saved connections uses random 0-2s jitter per connection with max 2 concurrent provision workers to avoid simultaneous API load.
  • srtla_rec fork is AGPL-3.0 — the fork must remain public on GitHub (Telemyapp/srtla) with original copyright notices and modification notes in source headers.
  • EIP management code in runProvisionPipeline has a type assertion (s.provisioner.(*relay.AWSProvisioner)) that breaks the provider abstraction. Needs refactoring if multiple providers are re-introduced.
  • ProvisionResult and DeprovisionRequest use AWS-named fields (AWSInstanceID) — should be renamed to InstanceID for provider neutrality. Technical debt tracked.
  • Per-link throughput in dock is computed client-side from cumulative bytes deltas between polls. Previous values stored in React ref keyed by link.addr.
  • Session reaper and StartOrGetSession both handle expired leases — reaper runs every 60s as a background job; StartOrGetSession also checks inline and stops expired sessions before creating fresh ones (handles OBS restart recovery).

Open Questions

  • When will TC-CONN-001 through TC-CONN-004 be validated? These test cases cover multi-connection, persistence, managed flow, and per-link telemetry with carriers. All are PENDING.
  • Will AR-3 (zero-downtime region migration) be needed? Designed with make-before-break pattern, 120s drain, 1-hour cooldown, 6 changes/day rate limit. Currently single-region (kc1), so not yet deployed.
  • Global mesh expansion? The relay pool architecture supports multiple regions with region-affinity assignment (ORDER BY CASE WHEN rp.region = $1 THEN 0 ELSE 1 END), but only one node exists. Adding nodes requires: VPS provision, Docker stack copy, UFW config, Cloudflare DNS record, INSERT INTO relay_pool.
  • Per-link jitter, health scores, and link quality trends are deferred to v2 per the per-link telemetry design. Also deferred: GeoIP carrier name lookup improvements, “consider reducing weight” recommendations.
  • C++ test framework (RF-009 backlog) — no automated testing for the OBS plugin. All validation is manual DLL testing in OBS.

Sources

  • RELAY_DEPLOYMENT.md
  • QA_CHECKLIST_RELAY_TELEMETRY.md
  • 2026-03-20-phase3-relay-pool.md
  • 2026-03-22-always-ready-relay-design.md
  • 2026-03-22-always-ready-relay-plan.md
  • 2026-03-22-multi-relay-client-refactor.md
  • 2026-03-18-relay-strategy-analysis.md
  • 2026-03-22-stream-slot-spec.md
  • 2026-03-03-relay-telemetry-design.md
  • 2026-03-03-relay-telemetry-plan.md
  • 2026-03-05-per-user-relay-dns-design.md
  • 2026-03-05-per-user-relay-dns-plan.md
  • 2026-03-06-per-link-relay-telemetry-design.md
  • 2026-03-06-per-link-relay-telemetry-plan.md
  • 2026-03-07-relay-provision-progress-design.md
  • 2026-03-07-relay-provision-progress-plan.md
  • 2026-03-08-elastic-ip-relay-design.md
  • 2026-03-01-relay-timeout-fix-design.md
  • 2026-03-01-relay-timeout-fix-plan.md