MOSS-TTS Family

Open-Source Speech and Sound Generation

MOSS-TTS Family is an open-source speech and sound generation model family from MOSI.AI and the OpenMOSS team. Designed for high-fidelity, high-expressiveness, and complex real-world scenarios, covering stable long-form speech, multi-speaker dialogue, voice/character design, environmental sound effects, and real-time streaming TTS.

2,207 Stars
209 Forks
5 Issues
16 Subscribers

Live Demo

Key Features

🎤

High-Fidelity Speech

Generate natural, human-like speech with exceptional quality and expressiveness, suitable for professional applications.

🔄

Zero-Shot Voice Cloning

Clone any voice with just a few seconds of reference audio, maintaining speaker identity across generations.

🌍

31 Languages Support

Multilingual synthesis with code-switching capabilities, covering Chinese, English, Arabic, and 29 other languages.

Real-Time Streaming

Ultra-low latency streaming TTS with 180ms TTFB, perfect for voice agents and interactive applications.

🎵

Sound Effects Generation

Generate diverse environmental sounds, urban scenes, biological sounds, and musical fragments at 48kHz.

🎭

Multi-Speaker Dialogue

Create expressive multi-speaker conversations with speaker attribution accuracy and natural prosody.

Model Family

MOSS-TTS
Flagship production model with high fidelity and optimal zero-shot voice cloning. Supports long-form speech and fine-grained control.
MOSS-TTSD
Spoken dialogue generation model for expressive, multi-speaker conversations with industry-leading performance.
MOSS-VoiceGenerator
Voice design model that generates diverse voices directly from text prompts without reference speech.
MOSS-TTS-Realtime
Multi-turn context-aware model for real-time voice agents with natural and coherent replies.
MOSS-SoundEffect
Content creation model specialized in sound effect generation with wide category coverage.
MOSS-TTS-Nano
Lightweight model for CPU-first realtime deployment with 0.1B parameters and multilingual support.

Architecture Overview

Advanced Audio Tokenization

MOSS-TTS utilizes a unified Audio Tokenizer based on the Causal Audio Tokenizer (Cat) architecture. This 1.6-billion-parameter model compresses 24kHz raw audio into a remarkable 12.5Hz frame rate using a 32-layer Residual Vector Quantizer (RVQ), supporting variable bitrates from 0.125kbps to 4kbps.

Trained on 3 million hours of diverse data (speech, sound effects, and music), the model achieves state-of-the-art reconstruction quality among open-source audio tokenizers.

MOSS Audio Tokenizer Architecture

Start Building with MOSS-TTS

Join the community of developers creating the future of speech technology