The Speech Stack Goes Open

On Thursday, two companies dropped open-weight speech models on the same day — covering opposite ends of the voice pipeline — and the combined result is that building a fully open-source voice agent just got meaningfully more viable.

Cohere Transcribe handles the ASR side: 2B parameters, Conformer encoder feeding a lightweight Transformer decoder (>90% of parameters in the encoder, following the Distil-Whisper pattern of front-loading compute to minimize autoregressive overhead), trained on 500K hours of curated audio. As of release it sits at #1 on the HuggingFace Open ASR Leaderboard with a 5.42% average WER across AMI, GigaSpeech, LibriSpeech, TED-LIUM, and others. That beats Whisper Large v3 (7.44%), ElevenLabs Scribe v2 (5.83%), and the recent Qwen3-ASR-1.7B (5.76%). Fourteen languages. Apache 2.0 — no commercial restrictions, no access gates. The weights are on HuggingFace now.

Mistral Voxtral TTS covers the synthesis side: 4B parameters, voice cloning from three seconds of reference audio, 70ms latency in a typical production setup (10-second audio sample, 500 characters). Nine languages, including Hindi and Arabic. In human evaluations Mistral claims it outperforms ElevenLabs Flash v2.5 on naturalness and matches ElevenLabs v3 quality. The open weights land on HuggingFace under CC BY NC 4.0 — non-commercial only; commercial use requires the API at $0.016 per 1,000 characters. The license gap between Cohere (Apache 2.0) and Mistral (CC BY NC) is worth noting: Cohere’s model is genuinely free to build on commercially; Mistral’s weights are useful for evaluation, self-hosting for internal tools, and auditing, but commercial products need to go through the API.

Mistral has been assembling a complete speech pipeline: their earlier Voxtral Transcribe handles ASR input, language models handle reasoning, and now Voxtral TTS handles voice output. The pitch to enterprises is an end-to-end voice stack with no third-party dependencies — everything from audio in to audio out running under one vendor or, for the non-commercial pieces, on your own hardware.

The quality gap between open and proprietary voice has been real for two years. Whisper remains good but stale; open TTS options have been clearly behind ElevenLabs in naturalness. If Voxtral TTS’s claims hold up under independent testing, that gap closes substantially. The Cohere ASR numbers are already independently verifiable via the leaderboard. The combination arriving in one day feels less like a coincidence than like the moment when two teams, both quietly building toward the same gap in the open ecosystem, shipped in the same week.

On a different front: Meta FAIR, together with researchers from UBC, Edinburgh, NYU, and Vector Institute, published HyperAgents — an approach to recursive self-improvement that removes the core limitation of previous systems like the Darwin Gödel Machine.

DGM showed that an agent could improve itself by editing its own code, but the meta-level mechanism — the part that decides how to make improvements — was fixed and handcrafted. HyperAgents unify the task agent and the meta agent into a single editable Python program. The system can now modify not just its task-solving behavior but the procedure that generates future modifications — what the paper calls metacognitive self-modification.

The empirical results are specific enough to take seriously. On academic peer review evaluation, classical agents scored 0.0; HyperAgents scored 0.710, surpassing AI-Scientist-v2. In a transfer experiment, hyperagents trained on paper review and robotics reward design were dropped into Olympiad-level math grading without retraining. Handcrafted DGM agents scored 0.0 in the new domain; the transferred hyperagents scored 0.630. The self-improvement strategies were genuinely transferable.

The caveat is that these are researcher-chosen benchmarks in a controlled setting, and “metacognitive self-modification” is a framing that could outrun the actual mechanism. But the transfer result in particular — a strategy for improving learned on robotics transferring to math — is not something you can easily attribute to chance or benchmark gaming. ICLR 2026 acceptance provides some peer scrutiny. The code is at facebookresearch/HyperAgents.

One note following yesterday’s piece on ARC-AGI-3: Symbolica published that their Agentica SDK reached 36.08% on the benchmark on day 1, up from the 12.58% best score reported in the preview phase. Chain-of-thought approaches (Opus 4.6, GPT-5.4) scored under 0.5%. The agentic loop cost $1,005 in API calls; CoT approaches cost $8,900 for far worse results. The pattern reinforces the benchmark’s core design: ARC-AGI-3 specifically rewards state-space exploration over text prediction at scale, and the gap between the two approaches is not close.