Zen Omni: Unified Multimodal AI
Zen Omni is a 30B MoE model that handles text, vision, and audio in a single unified model. The architecture is called Thinker-Talker: a shared reasoning backbone that branches into modality-specific output heads for text and speech generation.
3B active parameters per forward pass. Sub-300ms speech-to-speech latency.
Thinker-Talker Architecture
Most multimodal systems chain separate models: a speech recognizer feeds into a language model that feeds into a text-to-speech engine. Each handoff adds latency, loses context, and introduces a seam where errors compound.
Zen Omni eliminates the handoffs. The Thinker is a shared MoE backbone that processes all modalities in a unified token space. Text tokens, image tokens, and audio tokens flow through the same attention layers. The Talker is a small output head attached to the Thinker that generates speech directly from the same hidden states that produce text.
This means:
- The model hears your voice and understands your intent directly, without a transcription step
- It speaks its response without converting text to audio after the fact
- Emotional tone and prosody in input speech influence the generated response
- Visual context informs both text and speech outputs simultaneously
Specifications
| Property | Value |
|---|---|
| Total parameters | 30B |
| Active parameters | 3B |
| Architecture | MoE (Thinker-Talker) |
| Text context | 128K tokens |
| Audio context | 10 minutes |
| Speech-to-speech latency | <300ms |
| Languages (text) | 100+ |
| Languages (speech) | 58 |
Real-Time Speech-to-Speech
Under 300ms latency for speech-to-speech exchange makes Zen Omni suitable for real-time voice applications:
- Voice assistants with natural conversational rhythm
- Real-time voice translation (speech in, speech out, same or different language)
- Interactive voice response (IVR) with genuine language understanding
- Accessibility tools for vision-impaired users
300ms is below the conversational latency threshold where pauses become uncomfortable. It is achievable because the Talker generates speech tokens in parallel with the Thinker's reasoning, not sequentially after.
Zen Dub Integration
Zen Omni integrates with Zen Dub, our voice cloning and audio generation model. When Zen Dub is paired with Zen Omni, the speech output can be delivered in a cloned voice rather than the default model voice.
Use cases:
- Branded AI assistants with a consistent voice identity
- Audio content localization that preserves the original speaker's voice
- Personalized voice interfaces
Input and Output Modalities
Input: Text, images, video (up to 5 minutes), audio (up to 10 minutes), interleaved multimodal sequences
Output: Text, speech audio, structured data
The model handles interleaved inputs naturally: a conversation can include text messages, attached images, voice memos, and video clips, all in a single context window. The model understands relationships across all of them.
Get Zen Omni
- HuggingFace: huggingface.co/zenlm
- Hanzo Cloud API:
api.hanzo.ai/v1/chat/completions-- modelzen-omni, voice output viaaudioresponse format - Zen LM: zenlm.org -- audio API documentation
Zach Kelling is the founder of Hanzo AI, Techstars '17.
Read more
Zen Designer: 235B Vision-Language Model
Zen Designer: 235B MoE VLM, 22B active params, OCR in 32 languages, video and layout analysis.
Zen Max: 671B Reasoning Model
Zen Max: 671B MoE, 384 experts, 256K context, abliterated — AIME 2025 99.1%, SWE-Bench 71.3%.
Zen Coder: Code AI from 4B to 480B
Zen Coder: code-specialized model family from 4B to 480B MoE, 128K context, MCP integration.