Kokoro Text to Speech

A permissively licensed model supporting multiple languages and accents

Trelis Research

Nov 21, 2025

You can run Kokoro on your laptop OR with a high throughput server.

It’s ideal for generating different accents of high quality OR creating synthetic data for model fine-tuning.

Cheers, Ronan

🤖 Purchase ADVANCED-audio Repo Access

🗝️ Get Trelis All Access:

Access all SEVEN Trelis Github Repos (-robotics, -vision, -evals, -fine-tuning, -inference, -voice, -time-series)
Support via Github Issues & Trelis’ Private Discord
Early access to Trelis videos via Discord

TIMESTAMPS:

0:00 Generate voices with various accents with Kokoro

0:15 Kokoro - a small, permissively licensed text to speech model

2:34 Video Resources and Kokoro Setup

2:58 My recommendations on Text to Speech models for inference and fine-tuning

3:56 One-click affiliate template for a high throughput Kokoro Server: https://console.runpod.io/deploy?template=grwfixzu60&ref=jmfkcdio

4:31 Run Kokoro locally

8:29 Synthetic data generation via a server

9:58 Benchmarking Server Throughput

12:38 Conclusion

Kokoro: An 82-Million Parameter Text-to-Speech Model

Kokoro is a text-to-speech model with 82 million parameters that generates synthetic voices across multiple languages and accents. The model uses an Apache license, which permits commercial use including training other models with its generated audio.

Language and Voice Coverage

The model supports nine languages with varying numbers of voice options:

American English: 11 female voices, 9 male voices
British English: 4 female voices, 4 male voices
Japanese, Chinese, Spanish, French, Hindi, Italian, and Brazilian Portuguese

Voices are graded A, B, or C based on quality, with A being highest. American English has the most high-quality voices, while British English has reasonable quality options.

Technical Architecture

Kokoro appears to be based on a StyleTTS-type architecture. The model file is approximately 325 megabytes, with an additional 28-megabyte voice pack. Users specify voices by name to generate speech in different accents and languages. The model also allows control over speech speed, which can be varied when generating synthetic training data.

Local Deployment

The model runs on CPU without requiring a GPU, making it suitable for edge device applications. Installation requires:

Kokoro-ONNX package
Soundfile library
Model files (325 MB)
Voice pack (28 MB)

The model can be run either through Jupyter notebooks or as a standalone Python script using the UV package manager.

Server Deployment and Throughput

For high-throughput synthetic data generation, Kokoro can be deployed on GPU servers. Testing on an H100 GPU showed the following performance characteristics:

At concurrency level 5: approximately 55 hours of audio generated per wall clock hour (55x real-time speed)

At concurrency level 20: approximately 78 hours of audio generated per wall clock hour (78x real-time speed)

The server uses simple batching rather than continuous batching. Simple batching processes complete batches sequentially, while continuous batching (used in vLLM-style systems) can combine requests of different lengths more efficiently. The server runs on port 8880 and accepts HTTP requests.

Licensing for Training Data

The Apache license permits using Kokoro’s generated audio to train other models. This differs from commercial services like OpenAI or ElevenLabs, where terms of service for using generated audio as training data are less clear. This makes Kokoro a viable option for projects requiring synthetic speech data with clear usage rights.

Limitations

Training or fine-tuning Kokoro is not straightforward. The model has not been released with the data preparation and fine-tuning infrastructure that would make customization easy. For projects requiring custom voice training, alternative models are recommended:

Orpheus: best for vLLM or server deployment
CSM: best for voice cloning with one-shot examples
StyleTTS2: smaller model suitable for fine-tuning

Comparison to Other Models

Kokoro’s 82 million parameters make it smaller than models like CSM or Orpheus. The model excels at generating clean, high-quality voices by specifying speaker names, but lacks the customization capabilities of fine-tunable alternatives. MeloTTS offers similar voice generation capabilities, but Kokoro’s accent specification through speaker names provides more granular control.

Implementation Details

The model can be invoked with parameters for:

Text input
Voice selection (by name)
Speech speed

Audio output is saved in standard formats using the soundfile library. The simple Python interface makes integration into existing applications straightforward, whether running locally or hitting a remote server endpoint.

Trelis Research

Discussion about this post

Ready for more?