API

Summary

The Telemy Control Plane API v1 is a REST API at base path /api/v1 serving the v0.0.5 native C++ OBS plugin. It handles authentication (browser-based plugin login with JWT + refresh token rotation), relay lifecycle management (start, stop, active polling), billing integration via LemonSqueezy webhooks, usage tracking under a Time Bank model, and capacity status for the public website. Transport is HTTPS-only (TLS 1.2+), with Cloudflare terminating TLS upstream and proxying to the Go API server on port 8080. The server itself runs plain HTTP behind Cloudflare.

The relay session state machine has four backend states — provisioning, active, grace, stopped — with five valid transitions enforced server-side (invalid transitions return 409). A separate client-side state machine governs scene switching (Phase 5b, not yet implemented) with six top-level modes (STUDIO, IRL_CONNECTING, IRL_ACTIVE, IRL_GRACE, DEGRADED, FATAL) and orthogonal scene intents (LIVE, BRB, OFFLINE, HOLD). As of v0.0.5, relay lifecycle uses the always-ready model (AR-0 through AR-3): relays provision on managed connection add, deprovision on remove, and auto-provision on OBS load.

On 2026-03-20 the control plane was migrated from AWS EC2 (52.13.2.122, ~8/mo) running PostgreSQL 16 and both Go binaries (telemy-api, telemy-jobs) in Docker Compose. The Cloudflare A record for api.telemyapp.com was updated to point at Advin, with UFW rules restricting port 8080 to Cloudflare IP ranges only. AWS EC2 relay provisioning remains on AWS; only the API and database moved.

Timeline

  • 2026-02-21: API Spec v1 authored covering auth, relay lifecycle, idempotency, error contract, and rate limits.
  • 2026-03-16: Plugin login flow spec finalized (browser-based attempt + poll model with login_attempt_id and poll_token).
  • 2026-03-20: Migration plan created to move control plane from EC2 to Advin VPS. Eight-phase plan: preparation, PostgreSQL Docker setup, DB migration, deploy, UFW, parallel verification, DNS cutover, EC2 teardown.
  • 2026-03-22: Always-ready relay model (AR-0 through AR-3) designed, replacing manual start/stop with automatic provision-on-add.
  • 2026-03-23: Always-ready relay model deployed. Scene-switching state machine (Phase 5b) deferred.
  • Phase 5a (billing): LemonSqueezy billing endpoints (/billing/checkout, /billing/webhook) deployed and operational. Subscription lifecycle events mapped to 7 action handlers (activate, update addon, payment recovery, payment failed, cancelled, downgrade, refund).

Current State

The API server runs on the Advin VPS (208.84.101.84) in Docker Compose alongside PostgreSQL 16. Cloudflare proxies api.telemyapp.com (orange cloud, SSL Flexible) with an Origin Rule rewriting to port 8080. UFW on Advin restricts port 8080 to Cloudflare IPv4 ranges.

Ten endpoints are live under /api/v1:

  1. GET /auth/session — current user, entitlement, usage, active relay
  2. POST /auth/plugin/login/start — initiate browser-based plugin login
  3. POST /auth/plugin/login/poll — poll login attempt status
  4. POST /auth/refresh — rotate refresh token and JWT
  5. POST /auth/logout — revoke auth session
  6. POST /relay/start — start or return active relay (idempotent, requires Idempotency-Key)
  7. GET /relay/active — get active/provisioning session
  8. POST /relay/stop — idempotently stop a relay session
  9. GET /relay/manifest — launchable regions and AMI metadata
  10. GET /capacity/status — public unauthenticated capacity check

Billing endpoints: POST /billing/checkout (generate LemonSqueezy checkout URL) and POST /billing/webhook (HMAC-SHA256 verified event ingestion).

Usage endpoint: GET /usage/current (Time Bank cycle data). Relay health: POST /relay/health (relay-to-backend liveness, authenticated via X-Relay-Auth shared secret with session_id+instance_id binding).

Per-link relay telemetry bypasses the control plane entirely — the C++ plugin polls the relay’s srtla stats server directly at relay_ip:5080/stats every ~2 seconds. Per-output multi-encode telemetry is collected via OBS C API, not the control plane.

The client-side scene-switching state machine (6 modes, 10 transitions, scene intent rules, reconnect-first startup) is specified but not yet implemented (Phase 5b).

Key Decisions

  • 2026-02-21: JWT-only auth for control plane — cp_access_jwt as short-lived bearer, refresh_token for rotation, both stored in DPAPI-encrypted vault. Relay activation entitlement enforced server-side regardless of UI state.
  • 2026-02-21: Idempotency via Idempotency-Key (UUIDv4) with 1-hour retention and async replay — replaying /relay/start reconstructs the current DB state rather than returning stale cached responses.
  • 2026-02-21: Per-link telemetry architecture decision: direct HTTP polling from plugin to relay’s srtla stats server (port 5080), bypassing the control plane entirely. Chosen for latency (2s polling) and simplicity.
  • 2026-03-20: Migrate control plane from EC2 (8/mo) — net savings ~$15-20/mo. Go binaries and PostgreSQL 16 run in Docker Compose. AWS still used for relay EC2 provisioning only.
  • 2026-03-20: UFW on Advin restricted to Cloudflare IP ranges for port 8080 — defense-in-depth since Cloudflare terminates TLS.
  • 2026-03-23: Always-ready relay model adopted — relays provision on add, deprovision on remove. /relay/start and /relay/stop now called internally by the plugin, not via user-initiated buttons.
  • 2026-03-23: Scene-switching state machine deferred to Phase 5b. Relay lifecycle governed by always-ready model (AR-0 through AR-3), not the state machine.

Experiments & Results

ExperimentStatusFindingSource
Per-link relay telemetry via direct HTTP polling (port 5080)Implemented (v0.0.4)Bypassing control plane works well; 2s poll interval, ASN-based carrier labels via GeoLite2-ASN.mmdbAPI_SPEC_v1.md sec 13
Per-output multi-encode telemetry via OBS C APIImplemented (v0.0.4/v0.0.5)obs_enum_outputs gives per-encoder stats (bitrate, drop%, FPS, lag), grouped by encoder in dock UIAPI_SPEC_v1.md sec 14
EC2-to-Advin migration parallel verificationPlannedDirect IP smoke test (curl -H "Host: api.telemyapp.com" http://208.84.101.84:8080/health) before DNS cutover to validate without risk2026-03-20-api-migration-advin.md Phase 5
Docker Compose for control plane (postgres + api + jobs)DeployedSingle docker-compose.yml with health checks, secrets via file mount, env via .env file2026-03-20-api-migration-advin.md Phase 1

Gotchas & Known Issues

  • Error contract incomplete: request_id and structured details fields in error responses are not currently populated by the Go server — only error.code and error.message are returned.
  • Stats endpoint unauthenticated: TCP 5080 on the relay has no authentication. The relay security group must allow OBS machine access, but anyone who can reach the port can read per-link stats.
  • Idempotency key window: Backend stores key mappings for only 1 hour. Clients replaying after that window get a fresh response, not idempotent replay.
  • TLS assumption: The Go API server assumes TLS termination happens upstream. Clients must not infer TLS from port numbers (e.g., custom ports like 8443 still require explicit TLS).
  • CGO static linking: The Docker Compose setup assumes CGO_ENABLED=0 for static binaries. If dynamically linked, the base image must change from scratch/alpine to ubuntu:24.04.
  • Database name legacy: The database and user are both named aegis (legacy naming from before the Telemy rebrand). Migration restores with --no-owner --role=aegis.
  • Plugin login flow is operator-assisted: The authorize_url currently lands on a temporary operator-assisted Cloudflare Pages flow at telemyapp.com/login/plugin?attempt=..., not a self-service user login page.
  • Rate limits are per-user: /relay/start is 6/min, /relay/stop is 20/min, /relay/active is 60/min, /usage/current is 30/min. No global rate limits documented.
  • Scene-switching state machine not implemented: The full 6-mode state machine with scene intents, guard conditions, and hysteresis is specified but deferred to Phase 5b.

Open Questions

  • When will the plugin login flow transition from operator-assisted to fully self-service?
  • Should the relay stats endpoint (port 5080) get authentication, or is security-group restriction sufficient?
  • What is the timeline for Phase 5b scene-switching state machine implementation?
  • Should request_id and details be added to the error contract, or is the current minimal contract acceptable for v1?
  • Is a database backup cron job running on Advin, or does it still need to be set up (mentioned in migration plan but not confirmed)?
  • Will the EC2 instance (52.13.2.122) be terminated, or is it still kept as a warm standby? The plan specifies 24-hour wait before termination.
  • Should rate limits be adjusted for the always-ready model where /relay/start is called automatically (not user-initiated)?

Sources

  • API_SPEC_v1.md
  • STATE_MACHINE_v1.md
  • 2026-03-20-api-migration-advin.md