Kokoro Text to Speech
A permissively licensed model supporting multiple languages and accents
You can run Kokoro on your laptop OR with a high throughput server.
Itās ideal for generating different accents of high quality OR creating synthetic data for model fine-tuning.
Cheers, Ronan
š¤ Purchase ADVANCED-audio Repo Access
šļø Get Trelis All Access:
Access all SEVEN Trelis Github Repos (-robotics, -vision, -evals, -fine-tuning, -inference, -voice, -time-series)
Support via Github Issues & Trelisā Private Discord
Early access to Trelis videos via Discord
TIMESTAMPS:
0:00 Generate voices with various accents with Kokoro
0:15 Kokoro - a small, permissively licensed text to speech model
2:34 Video Resources and Kokoro Setup
2:58 My recommendations on Text to Speech models for inference and fine-tuning
3:56 One-click affiliate template for a high throughput Kokoro Server: https://console.runpod.io/deploy?template=grwfixzu60&ref=jmfkcdio
4:31 Run Kokoro locally
8:29 Synthetic data generation via a server
9:58 Benchmarking Server Throughput
12:38 Conclusion
Kokoro: An 82-Million Parameter Text-to-Speech Model
Kokoro is a text-to-speech model with 82 million parameters that generates synthetic voices across multiple languages and accents. The model uses an Apache license, which permits commercial use including training other models with its generated audio.
Language and Voice Coverage
The model supports nine languages with varying numbers of voice options:
American English: 11 female voices, 9 male voices
British English: 4 female voices, 4 male voices
Japanese, Chinese, Spanish, French, Hindi, Italian, and Brazilian Portuguese
Voices are graded A, B, or C based on quality, with A being highest. American English has the most high-quality voices, while British English has reasonable quality options.
Technical Architecture
Kokoro appears to be based on a StyleTTS-type architecture. The model file is approximately 325 megabytes, with an additional 28-megabyte voice pack. Users specify voices by name to generate speech in different accents and languages. The model also allows control over speech speed, which can be varied when generating synthetic training data.
Local Deployment
The model runs on CPU without requiring a GPU, making it suitable for edge device applications. Installation requires:
Kokoro-ONNX package
Soundfile library
Model files (325 MB)
Voice pack (28 MB)
The model can be run either through Jupyter notebooks or as a standalone Python script using the UV package manager.
Server Deployment and Throughput
For high-throughput synthetic data generation, Kokoro can be deployed on GPU servers. Testing on an H100 GPU showed the following performance characteristics:
At concurrency level 5: approximately 55 hours of audio generated per wall clock hour (55x real-time speed)
At concurrency level 20: approximately 78 hours of audio generated per wall clock hour (78x real-time speed)
The server uses simple batching rather than continuous batching. Simple batching processes complete batches sequentially, while continuous batching (used in vLLM-style systems) can combine requests of different lengths more efficiently. The server runs on port 8880 and accepts HTTP requests.
Licensing for Training Data
The Apache license permits using Kokoroās generated audio to train other models. This differs from commercial services like OpenAI or ElevenLabs, where terms of service for using generated audio as training data are less clear. This makes Kokoro a viable option for projects requiring synthetic speech data with clear usage rights.
Limitations
Training or fine-tuning Kokoro is not straightforward. The model has not been released with the data preparation and fine-tuning infrastructure that would make customization easy. For projects requiring custom voice training, alternative models are recommended:
Orpheus: best for vLLM or server deployment
CSM: best for voice cloning with one-shot examples
StyleTTS2: smaller model suitable for fine-tuning
Comparison to Other Models
Kokoroās 82 million parameters make it smaller than models like CSM or Orpheus. The model excels at generating clean, high-quality voices by specifying speaker names, but lacks the customization capabilities of fine-tunable alternatives. MeloTTS offers similar voice generation capabilities, but Kokoroās accent specification through speaker names provides more granular control.
Implementation Details
The model can be invoked with parameters for:
Text input
Voice selection (by name)
Speech speed
Audio output is saved in standard formats using the soundfile library. The simple Python interface makes integration into existing applications straightforward, whether running locally or hitting a remote server endpoint.

