Narrator
Clean explainer voice for product videos, docs, and YouTube scripts.
ZONOS2 playground, install guide, and comparison hub
Generate expressive multilingual speech, test English, Japanese, and Mandarin Chinese prompts, compare samples, and learn how to run ZONOS2 locally with CUDA, API examples, and real-world tradeoffs.
Voice cloning workflow
Start with the embedded demo, then test a short script with a clean, consented reference voice. For production, route generation through Zyphra Cloud or your own local inference server.
Choose a preset to plan your ZONOS2 test.
Listen before you decide
Real TTS decisions are made by listening. Use these lanes to compare ZONOS2 against managed and open-source alternatives.
Clean explainer voice for product videos, docs, and YouTube scripts.
A test lane for anime-style dialogue, visual novels, and localization.
Conversational pacing for intros, ad reads, and long-form narration.
Short expressive lines for quests, combat barks, and prototypes.
Quick facts
ZONOS2 is Zyphra's real-time text-to-speech model focused on expressive multilingual speech and high-fidelity voice cloning. Public sources describe a sparse MoE model with 8B total parameters, 900M active parameters, and training on more than 6M hours of speech.
System requirements
Local inference is aimed at Linux x86_64 with NVIDIA CUDA. Use this quick checker to choose local, WSL2, or cloud GPU.
Copyable setup
The shortest official path is Linux plus NVIDIA CUDA. Windows users should consider WSL2 only if they are comfortable debugging GPU passthrough.
git clone https://github.com/Zyphra/ZONOS2.git
cd ZONOS2
uv sync
uv run python -m minisgl --model-path Zyphra/ZONOS2 --tts-default-voices-dir ./default_voices/curl -X POST http://localhost:1919/tts/generate \
-H "Content-Type: application/json" \
-d '{"text":"Hello from ZONOS2","stream":true}' \
--output output.pcm
ffmpeg -f f32le -ar 44100 -ac 1 -i output.pcm output.wavwsl --install
# Install an NVIDIA driver with WSL CUDA support on Windows.
# Inside Ubuntu on WSL2:
nvidia-smi
uv --version
# Then follow the Linux CUDA commands.Comparison
| Dimension | ZONOS2 | ElevenLabs |
|---|---|---|
| Best fit | Open-weight TTS experiments, self-hosting, voice-clone research, API wrappers | Managed creator and business voice production |
| Control | High if you run the model or local server yourself | Lower, but easier for non-technical teams |
| Setup | Linux x86_64, NVIDIA CUDA, uv, local server on port 1919 | Browser-first SaaS workflow |
| Cost model | GPU time, hosting, maintenance, and engineering effort | Subscription or usage-based billing |
| Voice cloning | Strong focus on high-fidelity and naturalistic voice cloning | Polished voice library, cloning flows, and creator UX |
| Commercial risk | Verify model weights, code license, third-party components, and usage rights | Review platform terms, voice rights, and usage policy |
Tier 1 languages
Use clean reference audio and short scripts first. Japanese is a strong long-tail page because users search for anime dubbing, game dialogue, and localization workflows.
Mandarin Chinese is listed as Tier 1 in official language support, so it deserves first-class examples instead of being hidden in a generic language list.
English remains the main comparison lane against ElevenLabs, Cartesia, Fish Audio, Qwen, Kokoro, and Chatterbox.
Developer lane
After the local server starts, the default endpoint accepts generation requests on localhost port 1919. Keep early tests short, then add chunking for long scripts.
curl -X POST http://localhost:1919/tts/generate \
-H "Content-Type: application/json" \
-d '{"text":"Hello world","stream":true}' \
--output output.pcmTroubleshooting
Check nvidia-smi, driver version, CUDA toolkit, and whether the toolkit matches the runtime expected by your environment.
Confirm the server started on http://localhost:1919 and that another process is not using the same port.
Split long scripts by sentence, generate chunks, apply fade-out only when needed, and stitch audio after checking pacing.
Homepage content map
Use the embedded ZONOS2 Space first. Users should hear or test the model before reading a long article.
Compare ZONOS2, ElevenLabs, Kokoro, Chatterbox, Fish Audio, Qwen, and Cartesia by use case and workflow.
Explain clean reference audio, consent, sample length, speaker similarity, and safe cloning boundaries.
Surface 8B total parameters, 900M active parameters, 6M+ hours of audio, and Tier 1 language support.
Give visitors a fast GPU, OS, and VRAM answer before they lose time on CUDA setup.
Split commands for Linux, WSL2, and cloud GPU paths with copyable snippets.
Capture comparison intent from buyers who need cost, quality, hosting, and license tradeoffs.
Highlight Tier 1 Mandarin Chinese and Japanese support as the long-tail advantage.
Show REST and Python entry points so developers can bookmark the page.
Answer CUDA, uv, port 1919, long text, and audio conversion issues.
Define it as Zyphra's real-time open-weight TTS model with high-fidelity cloning.
Explain why MoE, larger data scale, and voice cloning fidelity make the launch important.
Turn architecture details into scannable cards instead of dense prose.
List Tier 1, Tier 2, and Tier 3 language expectations with realistic quality notes.
Explain when users should prefer clean output or faithful voice-clone output.
Give users without a local NVIDIA machine a practical route.
Show the localhost 1919 endpoint and output conversion workflow.
Show the offline inference path for developers who do not want a server.
Explain sentence splitting, fade-out, pacing, and post-processing.
Teach users to record a single speaker with low noise and clear consent.
Summarize temperature, top-k, speaking rate, and seed tradeoffs.
Handle driver mismatch, missing toolkit, and Linux-only assumptions.
Map the model to creators making narration, Shorts, and localization.
Map the model to NPC dialogue, prototypes, and mod tools.
Explain long-form narration expectations and editing workflow.
Tell users to verify model weights and inference code license before production.
Call out impersonation, private voices, public figures, scams, and platform policy risk.
Answer free demo, API, Windows, WSL2, Japanese, Chinese, and ElevenLabs questions.
Invite users to follow ZONOS2 demos, fixes, and benchmark updates.
Grow beyond one model into comparisons and troubleshooting for open voice tools.
Commercial use
Zyphra's launch post and Hugging Face model page present Apache-2.0 licensing for the model, while the GitHub repository page currently presents code-side MIT license signals. Treat model weights, inference code, third-party notices, and generated voice rights as separate checks before commercial deployment.
Multilingual ZONOS2 workflow
This Home page targets users searching for ZONOS2 multilingual TTS, ZONOS2 voice cloning, ZONOS2 Japanese voice cloning, ZONOS2 Mandarin Chinese speech, and ZONOS2 English narration. Keep tests short, compare language output side by side, and verify consent before using any cloned voice.
Use ZONOS2 TTS for product explainers, YouTube voiceovers, podcasts, API demos, and developer documentation where clear English pacing matters.
Use ZONOS2 Japanese TTS for anime-style dialogue tests, game character lines, VTuber scripts, localization drafts, and language-learning examples.
Use ZONOS2 Mandarin Chinese speech for bilingual demos, creator narration, app onboarding, education content, and Chinese voice cloning experiments.
Store language, reference voice, prompt text, consent status, and output settings together so every ZONOS2 multilingual generation is traceable.
FAQ
The model can be explored through Zyphra Cloud during its launch period and through community Spaces. Free access can change, so check the official provider before relying on it for production.
Yes, but the official local path targets Linux x86_64 with an NVIDIA GPU and a CUDA toolkit matching your driver.
Yes. Official model cards list English, Mandarin Chinese, and Japanese as Tier 1 languages.
It can be an alternative for developers who want open-weight control and self-hosting. ElevenLabs remains stronger for polished SaaS workflows and managed production UX.
No. Clone only voices you own or have permission to use. Do not impersonate private people, public figures, or copyrighted characters without the right to do so.