MOSS-TTS Family

Open-Source Speech and Sound Generation

MOSS-TTS Family is an open-source speech and sound generation model family from MOSI.AI and the OpenMOSS team. Designed for high-fidelity, high-expressiveness, and complex real-world scenarios, covering stable long-form speech, multi-speaker dialogue, voice/character design, environmental sound effects, and real-time streaming TTS.

2,207 Stars

209 Forks

5 Issues

16 Subscribers

Key Features

🎤

High-Fidelity Speech

Generate natural, human-like speech with exceptional quality and expressiveness, suitable for professional applications.

🔄

Zero-Shot Voice Cloning

Clone any voice with just a few seconds of reference audio, maintaining speaker identity across generations.

🌍

31 Languages Support

Multilingual synthesis with code-switching capabilities, covering Chinese, English, Arabic, and 29 other languages.

⚡

Real-Time Streaming

Ultra-low latency streaming TTS with 180ms TTFB, perfect for voice agents and interactive applications.

🎵

Sound Effects Generation

Generate diverse environmental sounds, urban scenes, biological sounds, and musical fragments at 48kHz.

🎭

Multi-Speaker Dialogue

Create expressive multi-speaker conversations with speaker attribution accuracy and natural prosody.

Model Family

MOSS-TTS

Flagship production model with high fidelity and optimal zero-shot voice cloning. Supports long-form speech and fine-grained control.

MOSS-TTSD

Spoken dialogue generation model for expressive, multi-speaker conversations with industry-leading performance.

MOSS-VoiceGenerator

Voice design model that generates diverse voices directly from text prompts without reference speech.

MOSS-TTS-Realtime

Multi-turn context-aware model for real-time voice agents with natural and coherent replies.

MOSS-SoundEffect

Content creation model specialized in sound effect generation with wide category coverage.

MOSS-TTS-Nano

Lightweight model for CPU-first realtime deployment with 0.1B parameters and multilingual support.

Architecture Overview

Advanced Audio Tokenization

MOSS-TTS utilizes a unified Audio Tokenizer based on the Causal Audio Tokenizer (Cat) architecture. This 1.6-billion-parameter model compresses 24kHz raw audio into a remarkable 12.5Hz frame rate using a 32-layer Residual Vector Quantizer (RVQ), supporting variable bitrates from 0.125kbps to 4kbps.

Trained on 3 million hours of diverse data (speech, sound effects, and music), the model achieves state-of-the-art reconstruction quality among open-source audio tokenizers.