Relay Provisioning
Summary
GoLiveBro’s relay provisioning system manages the lifecycle of SRT/SRTLA relay servers that sit between IRL mobile encoders (IRL Pro) and OBS Studio. The system has evolved through several architectural phases: from ephemeral per-user AWS EC2 instances (retired 2026-03-20), to a shared always-on relay pool on budget VPS providers (Phase 3, deployed ~2026-03-20), to an always-ready model where relays auto-provision on connection add and auto-deprovision on remove (AR-0 through AR-3, deployed 2026-03-23 via commit 625699f).
The current architecture uses the PoolProvisioner implementing a Provisioner interface (with Provision() and Deprovision() methods). It assigns sessions to always-on pool nodes registered in a relay_pool table, tracks assignments in relay_assignments, and manages stream IDs via the SLS Management API (port 3000) on each relay. The relay workload is pure UDP packet forwarding, not transcoding, so bandwidth is the primary cost driver. A single relay node can handle 8-11 concurrent IRL streams or 5-7 1080p streams at typical bitrates.
The provisioning pipeline runs through six tracked steps (launching_instance through ready), with DNS slug records per user/slot pointing to the assigned server via Cloudflare DNS-only A records (TTL 60s). Users configure their sender and media source URLs once using their slug hostname, and the system transparently repoints DNS when server assignments change between sessions.
Timeline
- 2026-03-05: Relay E2E telemetry validated via IRL Pro bonded stream to AWS relay. Aggregate stats, IPC round-trip, and dock telemetry path all confirmed working.
- 2026-03-18: Full relay architecture analysis performed. Documented AWS coupling, abstraction seams, BYOR/Hetzner/Hybrid cost comparison. Recommended short-term BYOR (1 week), medium-term Hetzner shared relay (2-3 weeks).
- 2026-03-20: AWS EC2 relay retired. Last instance (us-west-2) terminated. Elastic IP released. Per-user DNS slug records and relay_slug column dropped. Advin Servers chosen as primary VPS provider ($8/mo for 32 TB bandwidth).
- 2026-03-20: Phase 3 relay pool implementation began.
relay_poolandrelay_assignmentstables created (migration 0013).PoolProvisioner,SLSClient, and store pool methods implemented. - 2026-03-21: Global relay mesh design documented (v0.0.6). Identified that
RelayClient::Start()hardcodes emptyregion_preference(the region selection path was dead code). Standardized region naming from AWS-style (us-east-1) to short names (us-east,us-central). - 2026-03-22: Always-ready relay architecture designed (AR-0 through AR-3). Session lease management, auto-provisioning, per-slot DNS slugs, and zero-downtime region migration specified. Multi-relay client refactor planned for per-connection RelayClients.
- 2026-03-22: Stream slot system implemented (commits
b706153,c70efe7,0e933aa).user_stream_slotstable with per-user slot_number and label. API endpoints for list/rename. - 2026-03-23: Always-ready relay deployed (commit
625699f). AR-0 through AR-3 plus 18 bug fixes all deployed to Advin + OBS. - 2026-04-02: BuyVM LV Slice 1024 purchased ($3.50/mo, Las Vegas). UDP throughput tested at 100 Mbps with 0% packet loss and 0.1ms jitter. 32.8ms latency to KC1 with clean 14-hop route.
Current State
Two relay nodes are operational:
| Node | Provider | Location | Cost/mo | Status |
|---|---|---|---|---|
| US-Central (kc1) | Advin Servers | Kansas City, MO | $8 | Active (production) |
| US-West (lv1) | BuyVM FranTech | Las Vegas, NV | $3.50 | Active (production) |
Total infrastructure cost: 9.99/user).
The relay stack runs Docker containers on each node: srtla_rec (custom fork, SRTLA bonded UDP proxy on port 5000), SLS (SRT session handling on ports 4000/4001), SLS Management UI (port 3000), Stats API (port 8090), and Per-link Stats API (port 5080 with ASN carrier identification via IPinfo Lite mmdb).
The always-ready model is deployed: relays provision when a managed connection is added, deprovision when removed, and auto-provision on OBS restart with jittered startup (0-2s delay, max 2 concurrent). Server-side session leases (5-minute expiry, 30s heartbeat interval) prevent crashed OBS clients from leaking pool capacity. A stale session reaper runs every 60 seconds.
Per-slot DNS slugs (8-character base36 via crypto/rand) give each user a stable hostname. DNS records persist across sessions and update automatically when server assignment changes.
The Provisioner interface is provider-agnostic. The relay_pool table accepts any always-on server running the Docker stack regardless of hosting provider.
Key Decisions
- 2026-03-18: Implement BYOR (Bring Your Own Relay) first to enable a free tier. BYOR users cost essentially zero to serve (no relay provisioning). Estimated 2-3 days of C++ plugin work.
- 2026-03-20: Chose Advin Servers over Hetzner, DigitalOcean, Vultr, and AWS for relay hosting. 32 TB bandwidth at 0.25/TB) vs AWS ~$90/TB. Kansas City location for US-central coverage.
- 2026-03-20: Retired AWS EC2 ephemeral relay model. Key cost driver: Elastic IP idle cost (360/mo in EIP costs alone.
- 2026-03-21: Standardized region names to short human-readable IDs (
us-central,us-east,us-west,eu-west,ap-southeast), replacing AWS-styleus-east-1. Applied across config, relay_pool, and dock UI. - 2026-03-22: Adopted always-ready model over connect/disconnect UX. Rationale: stopping a relay doesn’t actually disconnect senders (srtla_rec and SLS don’t enforce session state), bandwidth is zero when no sender is active, so there is no cost to keeping relays provisioned.
- 2026-03-22: Chose 8-character base36 slugs for DNS (2.8 trillion possible values). Slug is routing only, never authorization. Stream token remains the SLS credential.
- 2026-04-02: Identified BuyVM as primary expansion provider. Genuinely unmetered bandwidth with 1 Gbps port. Horizontal scaling with 2 GB slices ($7/mo each) is optimal. Stock availability is the main constraint.
Gotchas & Known Issues
- Advin 32 TB hard cap: At 100 Mbps sustained (~32.4 TB/mo), kc1 hits the cap and gets throttled to 1 Mbps until month reset. Safe sustained rate is ~48 Mbps (~15.5 TB/mo). Overage purchasable at $3.50/TB.
- Advin non-KC locations are expensive: All other Advin cities get 1-4 TB bandwidth at same price. Kansas City is uniquely cheap.
- BuyVM stock unavailability: All slices except LV 1GB are out of stock across all locations. Stock tracker at buyvmstock.com. Restocks tend to appear 8-10 AM PST, especially 1st and 7th of month.
- BuyVM bandwidth enforcement: Informal ~25 Mbps per GB of RAM guideline. Throttle to 100 Mbps (not suspension). No warning before throttle. Not in written AUP.
- Region preference dead code (v0.0.6):
RelayClient::Start()in the C++ plugin hardcodesregion_preference: "". The dock capturesmanaged_regionbut never sends it to the API. Fix documented in global relay mesh design. - GeoIP deferred: The API caller’s IP is the OBS PC (home connection), not the mobile streaming device, so GeoIP on the API caller’s IP would infer the wrong location. Manual region selection is primary mechanism.
- DNS propagation: 60s TTL on slug records. Mobile client DNS caching is a concern for IRL streaming when server assignment changes.
- Per-link telemetry pending: Requires
srtla_recfork to expose per-link metadata (carrier labels, share percentages). - DDoS protection: Advin has no add-on. BuyVM offers 429/mo (not viable).
- Netcup Singapore trap: 2 TB/month cap with 5 Mbps throttle. Must never be used for relay.
Open Questions
- When will BuyVM 2 GB and 4 GB slices restock across NY, LV, and CH locations? The expansion plan depends on availability.
- Should BlackHOST ($11.99/mo, unmetered 1 Gbps dedicated, Chicago/Amsterdam) be deployed as the EU node given no affordable DDoS protection?
- Should Netcup ($5.84/mo, 2 TB/24hr rolling cap, 200 Mbps throttle floor, 2 Tbit/s DDoS included) be tested for EU relay despite 12-month contract lock-in and potential UDP false-positive filtering?
- What is the timeline for wiring the dock
managed_regionthroughRelayClient::Start()to the API? - When will GeoIP auto-detection become relevant (i.e., when a mobile app calls the API directly or SRTLA source IP analysis is implemented)?
- Should the health monitor composite
health_score(process alive 30%, capacity 25%, packet loss 25%, CPU/memory 20%) replace the booleanhealth_status?
Sources
- RELAY_DEPLOYMENT.md
- VPS_Relay_Server_Comparison.md
- SERVER_INFRASTRUCTURE_RESEARCH.md
- QA_CHECKLIST_RELAY_TELEMETRY.md
- 2026-03-18-relay-strategy-analysis.md
- 2026-03-20-phase3-relay-pool.md
- 2026-03-21-v006-global-relay-mesh-design.md
- 2026-03-22-always-ready-relay-design.md
- 2026-03-22-always-ready-relay-plan.md
- 2026-03-22-multi-relay-client-refactor.md