Google Releases Gemma 4: Frontier Multimodal Intelligence That Runs On Device

Google's Gemma 4 family brings text, image, audio, and video understanding to four model sizes — from a 2.3B effective parameter on-device model to a 26B mixture-of-experts that rivals much larger dense models. The 31B dense variant scores 1452 on LMArena. All models are open-weight with 128K–256K context windows.

D.O.T.S AI Newsroom

AI News Desk

Apr 5, 20263 min read

Google Releases Gemma 4: Frontier Multimodal Intelligence That Runs On Device

Google has released the Gemma 4 model family, a collection of open-weight models that represent the most capable on-device AI models the company has produced. The family spans four configurations — including a mixture-of-experts architecture — and introduces full multimodal support (text, image, audio, and video) across all variants, making it the first Gemma generation to be meaningfully capable across every major input modality.

Four Models, One Architecture Lineage

The Gemma 4 lineup spans a wide capability and size range:

Gemma 4 E2B — 2.3B effective parameters (5.1B with embeddings), 128K context window. Designed for on-device deployment on smartphones and edge hardware.
Gemma 4 E4B — 4.5B effective parameters (8B with embeddings), 128K context window. The primary on-device model for capable hardware.
Gemma 4 31B — Dense 31-billion parameter model, 256K context window. Achieves an estimated LMArena score of 1452 for text tasks.
Gemma 4 26B A4B — Mixture-of-experts with 26B total parameters and only 4B activated per forward pass, 256K context. Achieves 1441 on LMArena with a fraction of the compute of the dense model.

Architecture Innovations

Gemma 4 incorporates several architectural components designed to maximize performance at compressed parameter counts. Per-Layer Embeddings (PLE) add a parallel, lower-dimensional conditioning pathway alongside the main residual stream — each decoder layer receives its own token-specific signal, rather than relying on a single embedding to carry information across the entire network depth. This is especially impactful in the smaller E2B and E4B models, where parameter budgets make every architectural decision high-stakes.

The family also uses alternating local and global attention layers, shared KV cache across the final N layers (eliminating redundant projections), and a vision encoder that preserves original aspect ratios with flexible token budgets. The audio encoder is a USM-style conformer, enabling speech and audio understanding alongside vision.

Multimodal Capability and Fine-Tuning

All Gemma 4 models support image, audio, and video inputs, with the vision encoder handling multiple token budgets (70, 140, 280, 560, and 1120 tokens per image) to balance detail against context budget. The HuggingFace team notes that multimodal performance is "comparatively as good as text generation" in informal testing, which — if it holds under systematic evaluation — would represent a meaningful step forward from previous open-weight multimodal models.

Fine-tuning support is available through TRL, which has been upgraded to handle multimodal tool responses during training — meaning models can receive image outputs from tools during reinforcement learning episodes, not just text. A demo fine-tuning script trains Gemma 4 to drive a vehicle in the CARLA simulator using camera input, providing a concrete example of the agentic multimodal use cases the architecture is designed to support.

Deployment Ecosystem

Day-0 support is available through Transformers, MLX (with TurboQuant for ~4x memory reduction on Apple Silicon), Ollama, and mistral.rs. The models are available on Hugging Face under Google's standard open model license. For the AI developer community, Gemma 4's combination of on-device performance at the small end and frontier-class quality at the large end — all in one architectural family with consistent fine-tuning tooling — marks a meaningful expansion of what open-weight models can deliver.

Back to Home

Google Releases Gemma 4: Frontier Multimodal Intelligence That Runs On Device

Four Models, One Architecture Lineage

Architecture Innovations

Multimodal Capability and Fine-Tuning

Deployment Ecosystem

Related Stories

Tubi Becomes the First Streaming Service With a Native App Inside ChatGPT

Meta Breaks From Open Source: Muse Spark Is Its First Frontier Model — and First Without Open Weights

An AI Singer Who Doesn't Exist Has Taken Over the iTunes Chart — and Nobody Noticed at First