Product Launch2026-03-02·8 min read

Kitta AI S2 Model Launch: AI Voice Enters the 2.0 Era

More natural emotions, finer control, lower latency — Kitta AI S2 opens a new chapter in AI voice

Kitta AI is launching the all-new S2 model, a major leap forward from S1. S2 achieves breakthroughs in emotion control, multi-speaker support, and latency optimization, marking the official arrival of AI Voice 2.0.

Why S2? The Next Leap in AI Voice

Over the past year, Kitta AI's S1 model has propelled it to become the world's second-largest AI voice platform, amassing 3.5 million users, 1.1 million UGC voice models, and $10 million ARR. As the world's first TTS model with natural language emotion control, S1 has proven the immense potential of end-to-end voice modeling.

But S1 was just the beginning. We are entering the AI Voice 2.0 era — evolving from traditional word-by-word broadcasting voices to emotionally authentic, interactive AI voices with soul. S2 is the core vehicle of this transformation.

S2 Core Upgrades: Breakthroughs in Three Dimensions

Refined Emotion Control

S2 achieves open-domain emotion annotation — from simple 'happy' or 'sad' to complex mixed emotions like 'angry with underlying sadness'. This is powered by our self-developed world-leading emotion-annotating ASR model, enabling pre-training data to naturally carry accurate emotion labels.

Native Multi-Speaker Support

S2's architecture natively supports multi-speaker scenarios with precise speaker tags for every segment. Whether it's multi-host podcasts, audiobook dialogues, or game NPC conversations, natural multi-character transitions are seamless.

Ultra-Low Latency

With the end-to-end architecture, S2 can theoretically begin audio decoding after just the first token is generated. We're also releasing a new model that eliminates the Vocoder entirely, achieving complete text-to-waveform end-to-end modeling with latency targets of 30-50 milliseconds.

S2 Model Matrix: Tailored for Every Scenario

Kitta AI doesn't pursue a single large model — instead, we build a model matrix for different business scenarios:

S2 Pro

The flagship content generation model, designed for scenarios demanding ultimate voice quality and emotional expressiveness. Ideal for audiobooks, podcasts, film dubbing, and ASMR content creation, reaching new heights in naturalness and expressiveness.

S2 Flash

A 4B-parameter enterprise model optimized for real-time conversational scenarios. With lower latency and higher stability, it's the ideal choice for AI companion apps, real-time voice agents, sales bots, and educational platforms.

Technical Breakthrough: A Data-Driven Revolution

S2's core improvements don't come from model architecture changes, but from a complete reconstruction of the data pipeline. We've built an industry-leading data processing system:

Self-developed emotion-annotating ASR model: World's #1 in emotion annotation accuracy, capable of precisely identifying and labeling emotions, paralanguage (laughter, pauses, emphasis) in speech.

Voice separation model: Accurately separates individual speakers from noisy multi-person conversations, preserving the highly expressive 'dirty data' that traditional pipelines discard.

RLHF pipeline: Combines online user feedback data to build preference datasets and train Reward Models for continuous optimization. Kitta AI is the only voice platform with a complete live RLHF audio preference alignment system.

Global native speaker annotation team: A dedicated multilingual annotation team ensuring data correctness and naturalness.

Architecture Advantage: End-to-End is the Future

Kitta AI S2 uses an end-to-end autoregressive architecture that unifies semantic and acoustic information modeling. Compared to traditional cascade approaches (text → semantic tokens → acoustic features → waveform), the end-to-end approach has three key advantages:

✓

Stronger expressiveness: Joint semantic-acoustic modeling naturally captures richer prosody and emotional variations.

✓

Lower latency: No need to wait for intermediate modules — decoding can begin from the first token.

✓

Native multi-speaker: The architecture inherently supports multi-speaker scenarios without additional modules.

This is also the architectural direction chosen by next-generation models like Qwen TTS and SESAME. Kitta AI has the longest engineering experience and data advantage on this path.

Open Source Commitment: S2 Will Be Fully Open-Sourced

Kitta AI CTO has confirmed that S2 will be fully open-sourced. Following the S1 Mini open-source release, Kitta AI continues its commitment to open source, enabling developers to deploy, test, and integrate locally. The 100K+ GitHub Stars community will be first to experience S2's powerful capabilities.

Use Cases: Unleashing the Infinite Possibilities of AI Voice

🎙️

Content Creation

Audiobooks, podcasts, video dubbing, ASMR — S2 Pro delivers near-human emotional expression for professional creators.

💬

AI Companions & Social

Provide warm, expressive voices for AI social apps like Character.AI, making AI conversations feel real.

🎮

Gaming & Entertainment

NPC dialogue, character voicing, VTubing — multi-speaker support brings game worlds to life.

📞

Real-Time Voice Agents

S2 Flash's low latency and high stability perfectly suit customer service, sales, and education scenarios.

🌍

Cross-Language Content

Voice cloning in 13+ languages — train once, use across languages for effortless global content creation.

S1 vs S2: Upgrades at a Glance

Feature	S1	S2
Emotion Control	Basic emotion tags	Open-domain emotion description + mixed emotions
Multi-Speaker	Single speaker focus	Native multi-speaker support
Latency	Standard	Ultra-low latency (30-50ms target)
Data Pipeline	1st generation	Fully rebuilt + self-developed ASR
Post-Training	Basic RLHF	Live RLHF + multi-dimensional Reward Model
Open Source	S1 Mini open-source	S2 fully open-source
Model Matrix	Single model	Pro + Flash dual versions

Release Timeline

March 10

2026

Kitta AI S2 model will officially launch on March 10, 2026. Both S2 Pro and S2 Flash will be available simultaneously via API access or the Kitta AI platform. The open-source version will follow shortly after launch.