TTS Comparison

Since I switched from Chatterbox to Qwen3 TTS in February 2026, the open text-to-speech market has continued to evolve. Within just a few weeks after Qwen3 TTS, Mistral released Voxtral 4B TTS, Xiaomi AI Lab released OmniVoice, OpenBMB released VoxCPM2, and Supertone released Supertonic 3.

The quality is comparable to proprietary systems. In the past, speed was a bottleneck. With the new models in 2026, however, that bottleneck is also gone. Hardware requirements have also dropped massively. I tested all models on my MacBook.

Models

Supertonic 3 supports 31 languages and is extremely lightweight. Running it on local devices is definitely a realistic option. In my tests, it was the most efficient and fastest model. It needed only 0.55 GB of RAM and generated the sample text in 1.3 seconds. The quality was sufficiently good. The model provides 10 voice styles and supports voice cloning. In addition, expression tags such as laughing, breathing, and sighing are supported.

Qwen3 TTS supports 10 languages and offers 0.6B and 1.7B models, as well as variants for CustomVoice and VoiceDesign. There are already many libraries for running Qwen3 TTS on a wide range of end devices. The quality is very good, even for longer texts. Its speed is in the upper quartile.

VoxCPM2 is the most architecturally interesting counterproposal. The model is tokenizer-free, has 2B parameters, and supports 30 languages. It also includes Voice Design and controllable voice cloning. The quality is good, but unfortunately its speed is not as strong.

Voxtral 4B TTS comes from Europe, supports 9 languages, and includes 20 preset voices. The quality is good, but the speed does not quite keep up. The biggest problem is the license. Because of the reference voices, the published weights are licensed under CC BY-NC 4.0 and therefore cannot be used commercially.

OmniVoice focuses primarily on breadth. According to the model card, it supports more than 600 languages and also offers Voice Design and voice cloning. The quality is good, but the speed lags somewhat behind.

Chatterbox is a model from 2025 and supports 23 languages. Its quality can no longer quite keep up with the newer models, but its speed is still very good.

My conclusion

I tested all models with MLX Audio on Apple Silicon. For my use cases—synthesizing short LLM responses—the quality of all models was sufficient. Speed and support for the German language mattered most to me. In February, I switched from Chatterbox to Qwen3 TTS because of the better quality.

Based on my current tests, I will soon switch from Qwen3 TTS to Supertonic 3. It offers equivalent quality, supports more languages, and is faster.

For anyone who needs more than 35 languages, I recommend taking a look at OmniVoice. Good quality, acceptable speed, and low resource requirements.

TTS Modell Quant Speed VRAM License Example
Supertonic 3 —– 1.1 s 0.55 GB openrail
Qwen 3 TTS 0.6 4 Bit 3.1 s 1.85 GB Apache 2.0
Qwen 3 TTS 0.6 8 Bit 3.4 s 2.11 GB Apache 2.0
Qwen 3 TTS 1.7 4 Bit 3.5 s 2.43 GB Apache 2.0
Qwen 3 TTS 1.7 8 Bit 4.4 s 3.15 GB Apache 2.0
VoxCPM2 4 Bit 10.2 s 2.49 GB Apache 2.0
VoxCPM2 8 Bit 10.5 s 3.35 GB Apache 2.0
Voxtral 4B TTS 4 Bit 6.2 s 2.65 GB cc-by-nc-4.0
Voxtral 4B TTS 6 Bit 9.8 s 3.54 GB cc-by-nc-4.0
Voxtral 4B TTS 16 Bit 32.4 s 7.76 GB cc-by-nc-4.0
OmniVoice 16 Bit 9.2 s 2.05 GB Apache 2.0
OmniVoice 4 Bit 3.6 s 0.97 GB Apache 2.0
Chatterbox 4 Bit 1.5 s 1.81 GB Apache 2.0
Chatterbox 8 Bit 1.9 s 2.04 GB Apache 2.0
4 Bit != 4 Bit
Older post

4 Bit != 4 Bit