Brain Check
KO

High-Quality Text-to-Speech (TTS) Technology

This page introduces background technologies that help explain BrainCheck's TTS features. It focuses on recent speech synthesis work such as CALM and Pocket TTS. This does not mean BrainCheck uses these models directly.


Limits of Discrete-Token Speech Synthesis

One common approach in modern speech synthesis converts audio into a sequence of discrete tokens, then uses a language model to generate those tokens.

This creates a fundamental tradeoff. Audio tokens are extracted through a lossy codec, so higher audio quality usually requires generating more tokens. More tokens increase compute cost and latency. In other words, the system must compromise between quality and speed.


Continuous Audio Language Models (CALM)

Kyutai Labs proposed CALM (Continuous Audio Language Models) as one way to address this problem.

Core Idea

Instead of turning audio into discrete tokens, CALM directly generates continuous audio representations.

  1. A large neural network (Transformer) reads the text context and creates a summary at each step.
  2. A smaller network (MLP) uses that summary to predict the next piece of sound.
  3. The model is trained so the sound pieces connect naturally and consistently.

Because this avoids the bitrate bottleneck of discrete tokenization, the paper reports that CALM can generate speech with less computation at comparable quality.

Discrete vs Continuous Methods

Aspect Discrete-token method CALM (continuous method)
Audio representation Lossy compressed discrete tokens Continuous latent vectors (VAE-based)
How quality improves Generate more tokens, increasing cost Avoid the bitrate limit of discrete tokenization
Compute efficiency Cost grows with token count Paper reports lower compute at similar quality
Generation style Autoregressive token generation Transformer plus MLP continuous-frame generation

Note: This comparison summarizes results reported in the CALM paper. Outcomes can vary depending on model, data, and implementation.


Pocket TTS: A Practical CALM Implementation

Pocket TTS is an open-source TTS model that implements research ideas from the CALM paper in a practical form.

Main Characteristics (Based on the Kyutai Labs README)

Why a GPU Is Not Required

Pocket TTS has 100M parameters, which is small compared with large language models. Kyutai Labs reports that, because the model is small and uses batch size 1, they did not observe meaningful speed gains from GPU execution.

Browser Execution

Pocket TTS does not officially support browser execution, but the model is small enough that the community has released experimental WebAssembly-based implementations using Rust/Candle, JAX-JS, ONNX Runtime Web, and related tools.


How BrainCheck Applies TTS

BrainCheck uses high-quality TTS technology when generating audio for learning cards.

  1. Natural pronunciation: Speech closer to real conversation improves the listening experience compared with mechanical voices.
  2. Low-latency playback: Shorter waiting time keeps the learning flow smooth.

Related paper: Continuous Audio Language Models (CALM) - Simon Rouard, Manu Orsini, Axel Roebel, Neil Zeghidour, Alexandre Defossez (2025)

Sources: Pocket TTS GitHub · Kyutai Labs Publications