Xiaomi MiMo-V2-TTS: Text-Driven Expressive TTS
From free-form style instructions to non-verbal events and singing capability: MiMo-V2-TTS brings “expression” into speech generation.
Why MiMo Is More Than Traditional TTS
Free-Form Style Instructions
Describe emotion, pacing, tone, and performance intent in natural language; the model parses it into generation behavior.
Contextual Emotion & Prosody
Not just labels—MiMo adapts intonation and rhythm based on text semantics and context.
Natural Non-Verbal Events
Pauses, hesitation fillers, sighs, coughing, and laughter are integrated into the generation process.
Singing in One Unified Model
The official page highlights singing capability within the same unified model.
If your site is built around voice-over workflows, MiMo’s value is that you can encode performance and emotional details directly into text—without relying on rigid UI dropdowns.
How to Write Style Prompts (Reusable in Your Workflow)
A practical template: emotion/pacing/performance intensity + voice tone/texture + (optional) non-verbal events.
Quick Example
angry but trying to stay calm, slightly clipped delivery, quick pacing
Whisper / Soft
deeply affectionate, speaking slowly, almost whispering, warm and soft
Add Non-Verbal Events
Hold on... [heavy breathing] I... I need a minute... [soft cough] Just give me time.
Treat these prompts as your site’s “style templates”. When you integrate MiMo later, you just map the templates to MiMo’s fields/control interface.
Product Integration: Suggested Field Mapping
1) Add MiMo as a provider/model
Add MiMo in your model configuration (e.g., `provider=mimo` and a model identifier) so users can select it in the model picker.
2) Normalize “free style prompts” into your input
If you already have `emotion/language/speed/volume`, use a “prompt composition” strategy to build MiMo’s style description; or add a dedicated `stylePrompt` field for direct pass-through.
3) Handle billing/quota based on text size
Providers have different generation costs. Start with a character-based or estimated-duration multiplier, then calibrate using real generation metrics.
4) Reduce learning cost with docs & examples
After SEO brings users in, the most important thing is “copyable prompts”. Provide examples, marker explanations, and common Q&A.
FAQ
How does MiMo-V2-TTS control style?
It emphasizes free-form style instructions: you describe emotion, pacing, tone, and performance intent in natural language instead of selecting a fixed emotion tag.
Can it generate pauses, breathing, coughing, and laughter?
The official page shows text markers that guide non-verbal events like pauses, hesitation fillers, sighs, coughing, and laughter for more natural performance.
Does MiMo-V2-TTS support singing?
Yes. The official description highlights native singing capability within the same unified model.
When will FishSpeech support MiMo-V2-TTS?
We are planning provider integration, field mapping, and billing/quotas. For now, use this article to build your prompt style spec, and keep an eye on upcoming documentation updates.