Voice Substrate
Summary
The voice substrate is the accumulated understanding of how a person writes, compiled from all analyzed samples over time. It is not a profile. It is the persistent layer from which bounded runtime .toneprofile cards are generated on demand. The architecture originates from Brandon Metcalf’s voice system analysis, specifically Option 3/4 (Compiled Voice Layer / Hybrid): raw evidence -> maintained voice substrate -> bounded runtime style card -> tool-specific adapters.
v1 implements the bottom and top of this pipeline: reader.py handles evidence ingestion, toneprofile and injector.py handle the runtime card and adapters. The middle layer (substrate compilation pipeline) is designed into the schema and API surface but not yet built. Database tables writing_evidence and voice_substrate exist in PostgreSQL and are currently empty. Implementation is gated on v2 research plan approval (pending as of 2026-04-13).
The substrate structure has three sections: core_voice (dimensions consistent across all contexts, weighted by evidence strength and recency), context_modes (style clusters auto-discovered from StyleDistance embeddings, replacing the v1 hardcoded five-mode system), and evidence_index (metadata only: source, word count, timestamp, confidence scores, no raw text). The compiler resolves conflicts between Track 1 (spaCy, authoritative on quantitative dimensions) and Track 2 (LLM, authoritative on qualitative dimensions), applies a 90-day recency half-life with logarithmic length scaling, and tracks drift against the substrate centroid for staleness detection.
Timeline
- 2026-04-11: Monetization design doc establishes the four-layer pipeline architecture. Database schema defined with
writing_evidenceandvoice_substratetables marked as v2 placeholders. Schema created in PostgreSQL on Advin VPS. Evidence ingestion and compilation endpoints (/v1/evidence/ingest,/v1/evidence/compile) designated out of scope for v1 launch. - 2026-04-13: v2 research plan (Appendix B) specifies full compiler design: conflict resolution hierarchy, evidence weighting algorithm, context auto-discovery via StyleDistance clustering, staleness detection. Five open questions identified for Brandon Metcalf review.
Current State
Database tables exist. Code is not built. The substrate is a sponsor perk, not a free-tier feature. Free-tier users get one-shot .toneprofile cards with no persistence.
The runtime pipeline is fully specified: substrate (~2-5KB JSON) feeds context selection, which feeds the profile compiler, which outputs a .toneprofile (~150-250 tokens YAML). Downstream tools and injectors are unchanged; they see only the .toneprofile. The substrate is opaque to them.
The substrate is not a raw text copy, not a model, and not required for ToneForge to function. It is structured data: feature vectors, StyleDistance embeddings, LLM assessments, confidence weights.
Key Decisions
- 2026-04-11: Middle layer (substrate) designed into schema but deferred to v2 — validate demand before building compilation infrastructure.
- 2026-04-13: Context discovery via StyleDistance embedding clusters, not hardcoded — replaces the five static contexts with auto-discovered modes that match how the user actually writes.
- 2026-04-13: Track 2 overrides Track 1 on qualitative dimensions — LLM assessment of formality, warmth, confidence supersedes spaCy thresholds. Track 1 authoritative on sentence length, vocabulary stats, punctuation rates.
- 2026-04-13: Substrate designated as sponsor perk — free tier gets one-shot profiles, no accumulation, no evolution.
Open Questions
- Conflict resolution hierarchy: Track 2 overrides Track 1 on qualitative dimensions. Should the substrate maintain both signals and let the profile compiler decide per context instead?
- Context discovery: auto-discovered clusters from embeddings vs. user-defined contexts vs. both. More flexible but less predictable.
- 90-day evidence half-life: right number? Should it be configurable? Should the system decay at all, or treat all evidence equally and let drift detection handle evolution?
- Substrate portability: if a sponsor stops sponsoring, should there be a “compile and export” that produces a final .toneprofile from the substrate as an exit ramp?
- Compiler location: server-side (simpler, we control logic) vs. client-side (privacy-first, but exposes the algorithm in open source CLI).