TTS Model Guide2026-03-19·6 min read

Xiaomi MiMo-V2-TTS: Text-Driven Expressive TTS

From free-form style instructions to non-verbal events and singing capability: MiMo-V2-TTS brings “expression” into speech generation.

Why MiMo Is More Than Traditional TTS

Free-Form Style Instructions

Describe emotion, pacing, tone, and performance intent in natural language; the model parses it into generation behavior.

Contextual Emotion & Prosody

Not just labels—MiMo adapts intonation and rhythm based on text semantics and context.

Natural Non-Verbal Events

Pauses, hesitation fillers, sighs, coughing, and laughter are integrated into the generation process.

Singing in One Unified Model

The official page highlights singing capability within the same unified model.

If your site is built around voice-over workflows, MiMo’s value is that you can encode performance and emotional details directly into text—without relying on rigid UI dropdowns.

How to Write Style Prompts (Reusable in Your Workflow)

A practical template: emotion/pacing/performance intensity + voice tone/texture + (optional) non-verbal events.

Quick Example

angry but trying to stay calm, slightly clipped delivery, quick pacing

Whisper / Soft

deeply affectionate, speaking slowly, almost whispering, warm and soft

Add Non-Verbal Events

Hold on... [heavy breathing] I... I need a minute... [soft cough] Just give me time.

Treat these prompts as your site’s “style templates”. When you integrate MiMo later, you just map the templates to MiMo’s fields/control interface.

Product Integration: Suggested Field Mapping

1) Add MiMo as a provider/model

Add MiMo in your model configuration (e.g., `provider=mimo` and a model identifier) so users can select it in the model picker.

2) Normalize “free style prompts” into your input

If you already have `emotion/language/speed/volume`, use a “prompt composition” strategy to build MiMo’s style description; or add a dedicated `stylePrompt` field for direct pass-through.

3) Handle billing/quota based on text size

Providers have different generation costs. Start with a character-based or estimated-duration multiplier, then calibrate using real generation metrics.

4) Reduce learning cost with docs & examples

After SEO brings users in, the most important thing is “copyable prompts”. Provide examples, marker explanations, and common Q&A.

FAQ

How does MiMo-V2-TTS control style?

It emphasizes free-form style instructions: you describe emotion, pacing, tone, and performance intent in natural language instead of selecting a fixed emotion tag.

Can it generate pauses, breathing, coughing, and laughter?

The official page shows text markers that guide non-verbal events like pauses, hesitation fillers, sighs, coughing, and laughter for more natural performance.

Does MiMo-V2-TTS support singing?

Yes. The official description highlights native singing capability within the same unified model.

When will FishSpeech support MiMo-V2-TTS?

We are planning provider integration, field mapping, and billing/quotas. For now, use this article to build your prompt style spec, and keep an eye on upcoming documentation updates.

Related