Database

Summary

Telemy v0.0.5 uses PostgreSQL 15+ as its sole persistence layer for the cloud control plane. The schema (DB Schema v1) spans 12 tables covering identity and API access (users, auth_sessions, api_keys, plugin_login_attempts), relay infrastructure lifecycle (relay_instances, sessions, relay_pool, relay_assignments), multi-stream support (user_stream_slots), billing and metering (billing_events, idempotency_records), and Twitch EventSub conduit persistence (eventsub_conduit_state). Primary keys use prefixed text IDs (usr_, ses_, rly_, key_, use_, evt_), timestamps are timestamptz in UTC, monetary amounts are integer cents, and durations are integer seconds.

The schema has evolved through at least 26 migrations. Migrations 0016-0018 added always-ready relay features (lease management, DNS slugs, session migration). Migration 0019 added the billing subsystem with LemonSqueezy webhook integration, grace period enforcement, and subscription reconciliation. Migration 0026 added eventsub_conduit_state to persist Twitch EventSub conduit IDs across server restarts, eliminating a manual chatbot toggle requirement.

The persistence layer is implemented in Go using pgx for database access. Store methods follow a straightforward pattern: query functions return typed results or ErrNotFound, and upserts use INSERT ... ON CONFLICT DO UPDATE for idempotency. Five operational jobs run on the control plane: idempotency TTL cleanup, session usage rollup, outage reconciliation, grace period enforcement (every 5 minutes), and subscription reconciliation (every 1 hour, stub pending production products).

Timeline

  • Pre-2026-03: Schema v1 established with core tables: users, auth_sessions, api_keys, plugin_login_attempts, relay_instances, sessions, idempotency_records.
  • Migration 0016 (AR-0): Added lease_expires_at and last_heartbeat_at to sessions for relay lease management.
  • Migration 0017 (AR-2): Added dns_slug to user_stream_slots for DNS-based routing for always-ready relays.
  • Migration 0018 (AR-3): Added supersedes_session_id and migration_status to sessions for seamless session migration.
  • Migration 0019: Added billing_events table and billing columns to users (lemon_customer_id, lemon_subscription_id, lemon_addon_subscription_id, plan_grace_deadline).
  • 2026-03-27: Conduit persistence design approved. Migration 0026 added eventsub_conduit_state table to persist Twitch EventSub conduit IDs across API restarts.

Current State

The schema has 12 tables across 26+ migrations. The store layer exposes typed Go methods via a Store interface used by the API router, with a mockStore for testing.

Key operational constraints:

  • One active session per user, enforced by a partial unique index on sessions.
  • Relay pool assignment uses FOR UPDATE SKIP LOCKED on relay_pool to pick the least-loaded healthy server without contention.
  • Idempotent relay start: idempotency_records stores request hashes and canonical responses; replayed requests while a session is active reconstruct the response from current state.
  • Billing idempotency: billing_events.lemon_event_id uses ON CONFLICT DO NOTHING to deduplicate LemonSqueezy webhooks.
  • Grace period enforcement runs every 5 minutes, downgrading users with expired plan_grace_deadline to free tier and force-disconnecting active relays.
  • Subscription reconciliation runs hourly but is currently a stub pending production LemonSqueezy products.
  • Conduit persistence is implemented and deployed: on startup the control plane loads the persisted conduit ID, seeds it into EventSubManager, verifies/updates the Twitch shard, and persists back. Failures are non-fatal.

Plan tiers are free, standard, and internal. Plan statuses are active, past_due, canceled, and trial. Session statuses follow the state machine: provisioning -> active -> grace -> stopped (with grace -> active recovery allowed).

Key Decisions

  • Schema v1: Prefixed text primary keys (usr_, ses_, etc.) over auto-increment integers — provides human-readable IDs and cross-system traceability.
  • Schema v1: Integer cents for money, integer seconds for durations — avoids floating-point precision issues in billing.
  • Migration 0019: LemonSqueezy webhook events stored with full jsonb payload for audit trail, even unmatched events (user_id=NULL) retained for manual review.
  • Migration 0019: Grace period set to 48 hours on payment failure (plan_grace_deadline = NOW() + 48 hours) before automatic downgrade.
  • 2026-03-27: Conduit persistence via single-row-per-provider table rather than a generic key-value store — keeps schema explicit and queryable. Only twitch provider for now.
  • 2026-03-27: Conduit restoration failures are non-fatal at startup — server starts regardless, chatbot becomes functional on next manual toggle or retry.
  • Schema v1: Deprecated eip_allocation_id and eip_public_ip columns retained in users for archive reference after migration to PoolProvisioner.

Experiments & Results

ExperimentStatusFindingSource
AWS EIP per-user allocationDeprecatedReplaced by relay_pool + relay_assignments (PoolProvisioner). Legacy columns retained in users table.DB_SCHEMA_v1.md
In-memory conduit ID storageReplacedConduit ID lost on API restart, requiring manual chatbot toggle off/on. Replaced by DB persistence in migration 0026.conduit-persistence-design.md
Subscription reconciliation jobStubHourly job to poll LemonSqueezy API implemented as stub, blocked on production products being set up in LemonSqueezy.DB_SCHEMA_v1.md

Gotchas & Known Issues

  • eip_allocation_id / eip_public_ip on users are deprecated and unused by PoolProvisioner. Do not populate these columns in new code.
  • addon_relay_count is capped at 3 for non-internal tiers. The internal tier is exempt from this cap.
  • reconcile_subscriptions requires TELEMY_LEMON_API_KEY environment variable and is currently a stub.
  • Conduit persistence only covers the twitch provider. If additional EventSub providers are added, each gets its own row keyed by provider.
  • plugin_login_attempts completed attempts are not reusable — the plugin must stop polling after the first 200 completed response. A claimed attempt cannot be re-claimed.
  • Session state transitions are enforced in the service layer, not via database triggers. Invalid transitions (e.g., stopped -> active) must be caught in application code.
  • Outage reconciliation compares backend-calculated duration against relay health event session_uptime_seconds. A positive gap results in billing adjustments — this relies on relay health events being accurately reported.

Open Questions

  • When will LemonSqueezy production products be set up, unblocking the reconcile_subscriptions job?
  • Should deprecated EIP columns (eip_allocation_id, eip_public_ip) be dropped in a future migration or kept indefinitely for archive reference?
  • Will additional EventSub providers beyond twitch need conduit persistence, and if so, does the single-row-per-provider model scale appropriately?
  • Are there plans for a migration to clean up or archive old billing_events and idempotency_records rows beyond TTL cleanup?

Sources

  • DB_SCHEMA_v1.md
  • 2026-03-27-conduit-persistence-design.md
  • 2026-03-27-conduit-persistence.md