Zen Omni: Unified Multimodal AI

Zen Omni is a 30B MoE model that handles text, vision, and audio in a single unified model. The architecture is called Thinker-Talker: a shared reasoning backbone that branches into modality-specific output heads for text and speech generation.

3B active parameters per forward pass. Sub-300ms speech-to-speech latency.

Thinker-Talker Architecture

Most multimodal systems chain separate models: a speech recognizer feeds into a language model that feeds into a text-to-speech engine. Each handoff adds latency, loses context, and introduces a seam where errors compound.

Zen Omni eliminates the handoffs. The Thinker is a shared MoE backbone that processes all modalities in a unified token space. Text tokens, image tokens, and audio tokens flow through the same attention layers. The Talker is a small output head attached to the Thinker that generates speech directly from the same hidden states that produce text.

This means:

The model hears your voice and understands your intent directly, without a transcription step
It speaks its response without converting text to audio after the fact
Emotional tone and prosody in input speech influence the generated response
Visual context informs both text and speech outputs simultaneously

Specifications

Property	Value
Total parameters	30B
Active parameters	3B
Architecture	MoE (Thinker-Talker)
Text context	128K tokens
Audio context	10 minutes
Speech-to-speech latency	<300ms
Languages (text)	100+
Languages (speech)	58

Real-Time Speech-to-Speech

Under 300ms latency for speech-to-speech exchange makes Zen Omni suitable for real-time voice applications:

Voice assistants with natural conversational rhythm
Real-time voice translation (speech in, speech out, same or different language)
Interactive voice response (IVR) with genuine language understanding
Accessibility tools for vision-impaired users

300ms is below the conversational latency threshold where pauses become uncomfortable. It is achievable because the Talker generates speech tokens in parallel with the Thinker's reasoning, not sequentially after.

Zen Dub Integration

Zen Omni integrates with Zen Dub, our voice cloning and audio generation model. When Zen Dub is paired with Zen Omni, the speech output can be delivered in a cloned voice rather than the default model voice.

Use cases:

Branded AI assistants with a consistent voice identity
Audio content localization that preserves the original speaker's voice
Personalized voice interfaces

Input and Output Modalities

Input: Text, images, video (up to 5 minutes), audio (up to 10 minutes), interleaved multimodal sequences

Output: Text, speech audio, structured data

The model handles interleaved inputs naturally: a conversation can include text messages, attached images, voice memos, and video clips, all in a single context window. The model understands relationships across all of them.

Get Zen Omni

HuggingFace: huggingface.co/zenlm
Hanzo Cloud API: api.hanzo.ai/v1/chat/completions -- model zen-omni, voice output via audio response format
Zen LM: zenlm.org -- audio API documentation

Zach Kelling is the founder of Hanzo AI, Techstars '17.

Zen Omni: Unified Multimodal AI

Zen Omni: Unified Multimodal AI

Thinker-Talker Architecture

Specifications

Real-Time Speech-to-Speech

Zen Dub Integration

Input and Output Modalities

Get Zen Omni

Read more

Zen Designer: 235B Vision-Language Model

Zen Max: 671B Reasoning Model

Zen Coder: Code AI from 4B to 480B