Open-Source Speech and Sound Generation
MOSS-TTS Family is an open-source speech and sound generation model family from MOSI.AI and the OpenMOSS team. Designed for high-fidelity, high-expressiveness, and complex real-world scenarios, covering stable long-form speech, multi-speaker dialogue, voice/character design, environmental sound effects, and real-time streaming TTS.
Generate natural, human-like speech with exceptional quality and expressiveness, suitable for professional applications.
Clone any voice with just a few seconds of reference audio, maintaining speaker identity across generations.
Multilingual synthesis with code-switching capabilities, covering Chinese, English, Arabic, and 29 other languages.
Ultra-low latency streaming TTS with 180ms TTFB, perfect for voice agents and interactive applications.
Generate diverse environmental sounds, urban scenes, biological sounds, and musical fragments at 48kHz.
Create expressive multi-speaker conversations with speaker attribution accuracy and natural prosody.
MOSS-TTS utilizes a unified Audio Tokenizer based on the Causal Audio Tokenizer (Cat) architecture. This 1.6-billion-parameter model compresses 24kHz raw audio into a remarkable 12.5Hz frame rate using a 32-layer Residual Vector Quantizer (RVQ), supporting variable bitrates from 0.125kbps to 4kbps.
Trained on 3 million hours of diverse data (speech, sound effects, and music), the model achieves state-of-the-art reconstruction quality among open-source audio tokenizers.
Join the community of developers creating the future of speech technology