Google's Gemma 4: A Deep Dive into the New Open-Source AI Models |

Google has just unveiled Gemma 4, a release that brings four new models, a significant licensing change, and some genuinely interesting architectural decisions to the forefront. With day-one support for local execution, there’s a lot to unpack. Let’s break it all down.

The Game-Changing License: Apache 2.0

Perhaps the most significant news, even more so than the models themselves, is the shift to an Apache 2.0 license. This is a major development for the open-source and local LLM communities.

The previous iteration, Gemma 3, was released under a custom Google license that imposed restrictions on commercial use and redistribution. This caused many developers to bypass it entirely. Gemma 4, however, adopts the same permissive Apache 2.0 license used by industry darlings like Qwen, Mistral, and Llama.

What does this mean? Freedom. You can use it, modify it, and sell products built upon it. You can deploy it however you see fit, with no strings attached or convoluted clauses. Google is finally competing on the same level playing field as other major players in open-source AI, and that’s a bigger deal than any benchmark score.

Meet the Gemma 4 Family

Google has released a diverse family of four models, each tailored for different needs.

Gemma 4 2B (2 models): These are the smallest of the bunch, designed as “edge” models for phones and other small devices. Despite their tiny footprint, they incorporate clever tricks and are the only models in the family with native audio input support, making them ideal for on-device voice assistance and speech recognition.
Gemma 4 26B-MoE: This is the model we’ll focus on in this article. It’s a Mixture-of-Experts (MoE) model with 25.2 billion total parameters, but only 3.8 billion are active at any given time. This architecture allows it to run at the speed of a ~4 billion parameter model while delivering quality closer to a 30 billion parameter one. It boasts impressive multimodal capabilities, supporting text, images, and even video input.
Gemma 4 31B: This is the flagship, a dense model that brings raw power to the table. All 30.7 billion of its parameters are engaged for every single token. It represents the highest quality model in the family and is positioned as the ideal base for fine-tuning. It currently holds the #3 spot on the LMSys Arena leaderboard for open models, which is a testament to its power.

A Closer Look at the 26B-MoE Architecture

Diving into the specifics of the 26B-MoE model reveals some fascinating design choices, based on a careful review of the model files themselves, not just press releases.

The model utilizes 128 “experts.” For each token processed, a router selects the best eight experts, plus one shared expert that is always active. This means only nine experts (about 7% of the total) are working at any given moment.

Key architectural details include:

30 layers
16 attention heads
A massive 262,000-token vocabulary. For comparison, most models operate with a vocabulary between 32,000 and 150,000 tokens. A larger vocabulary allows the model to represent more words and symbols as single tokens, boosting efficiency.
A 256,000-token context window, equivalent to a 500-page book.
Each expert’s feed-forward network has a hidden dimension of just 704. These are not large, general-purpose experts, but rather tiny, highly specialized units.

Novel Architectural Innovations

Gemma 4 introduces three architectural choices that set it apart from other open models.

1. Hybrid Sliding and Global Attention: Instead of every layer processing the full context—an expensive operation—Gemma 4 alternates. Five consecutive layers use a “sliding window” attention, looking only at the nearest 1,024 tokens. Then, a single layer is given full “global attention,” allowing it to see the entire context. This pattern repeats across all 30 layers, with the final layer always being global. The idea is simple: most context is local and cheap to process. Global context is reserved for when it truly matters, giving the model a chance to see the bigger picture periodically.

2. Dynamic KV Head Allocation: The number of Key-Value (KV) heads changes depending on the layer type.

Sliding window layers use 8 KV heads with 256-dimensional keys/values.
Global layers use only 2 KV heads but with larger 512-dimensional keys/values.

Fewer heads on the global layers drastically reduce the memory required for the KV cache for long-range context. This is a key factor in making the massive 256k context window practical.

3. Dual RoPE Frequencies: The model uses two different Rotary Position Embedding (RoPE) frequencies.

Sliding window layers use a standard frequency of 10,000.
Global layers use a frequency of 1 million—100 times higher.

This allows the positional encoding to scale to much longer sequences without losing resolution, which is crucial for handling the extensive context window.

Gemma 4 vs. Qwen

Compared to the recently released Qwen 2.5, Gemma 4 holds its own. While both have 128 experts, Gemma 4 activates 8+1 shared experts versus Qwen’s 8. The most significant difference is the context window: Gemma 4’s 256k tokens dwarf Qwen’s 32k. Furthermore, Gemma 4’s native multimodal support for text, images, and video gives it a clear advantage over the text-only Qwen model.

Performance Benchmarks

The benchmarks reveal an incredibly efficient trade-off. The 26B-MoE model scores within 1-3% of the full 31B dense model across math, reasoning, science, and coding benchmarks, all while using roughly 1/8th of the compute per token.

On the LMSys Arena leaderboard, the 31B model is ranked #3 among open models, with the 26B-MoE close behind at #6.

Running Gemma 4 Locally: A Hands-On Test

To put the model through its paces, the 26B-MoE version was downloaded from Hugging Face and run locally on a PC with an RTX 3060 (12GB VRAM). Despite the model being over 16GB and requiring some CPU offloading, performance was impressive.

General Q&A with llama.cpp: When asked to explain the importance of the Apache 2.0 license, the model provided a comprehensive and accurate answer, generating text at nearly 13 tokens per second. It correctly identified the license’s role in providing legal certainty for corporations and a clear framework for developers, calling it the “gold standard for enterprise-grade open source.”

Coding with a Local Agent: Next, the model was integrated into a coding agent to test its programming abilities. The task: build a simple Blackjack game.

The model chose Python for the backend logic and HTML/CSS/JavaScript for the frontend. It proceeded to write the code for both parts quickly and efficiently.

Here is the Python script it generated:

import random

def create_deck():
    """Creates a standard 52-card deck."""
    ranks = ['2', '3', '4', '5', '6', '7', '8', '9', '10', 'J', 'Q', 'K', 'A']
    suits = ['Hearts', 'Diamonds', 'Clubs', 'Spades']
    return [{'rank': rank, 'suit': suit} for suit in suits for rank in ranks]

def get_card_value(card):
    """Calculates the value of a card."""
    if card['rank'] in ['J', 'Q', 'K']:
        return 10
    if card['rank'] == 'A':
        return 11  # Initially treat Ace as 11
    return int(card['rank'])

def get_hand_value(hand):
    """Calculates the total value of a hand."""
    value = sum(get_card_value(card) for card in hand)
    num_aces = sum(1 for card in hand if card['rank'] == 'A')
    while value > 21 and num_aces:
        value -= 10
        num_aces -= 1
    return value

# This would typically be part of a larger application structure
# For this example, we'll assume functions are called by a web server

And here is the index.html file for the frontend:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Blackjack</title>
    <style>
        body { font-family: sans-serif; text-align: center; background-color: #282c34; color: white; }
        .hand { margin: 20px; }
        button { padding: 10px 20px; margin: 5px; font-size: 16px; cursor: pointer; }
        #message { font-size: 24px; margin-top: 20px; }
    </style>
</head>
<body>
    <h1>Blackjack</h1>

    <div class="hand">
        <h2>Player Hand</h2>
        <div id="player-hand"></div>
        <p>Score: <span id="player-score">0</span></p>
    </div>

    <div class="hand">
        <h2>Dealer Hand</h2>
        <div id="dealer-hand"></div>
        <p>Score: <span id="dealer-score">0</span></p>
    </div>

    <button id="hit-button">Hit</button>
    <button id="stand-button">Stand</button>
    <button id="reset-button">New Game</button>

    <p id="message"></p>

    <script>
        // Game state variables
        let deck, playerHand, dealerHand;

        // DOM elements
        const playerHandDiv = document.getElementById('player-hand');
        const dealerHandDiv = document.getElementById('dealer-hand');
        const playerScoreSpan = document.getElementById('player-score');
        const dealerScoreSpan = document.getElementById('dealer-score');
        const messageP = document.getElementById('message');
        const hitButton = document.getElementById('hit-button');
        const standButton = document.getElementById('stand-button');
        const resetButton = document.getElementById('reset-button');

        function createDeck() {
            const ranks = ['2', '3', '4', '5', '6', '7', '8', '9', '10', 'J', 'Q', 'K', 'A'];
            const suits = ['♥', '♦', '♣', '♠'];
            let newDeck = [];
            for (let suit of suits) {
                for (let rank of ranks) {
                    newDeck.push({ rank, suit });
                }
            }
            return newDeck;
        }

        function shuffle(deck) {
            for (let i = deck.length - 1; i > 0; i--) {
                const j = Math.floor(Math.random() * (i + 1));
                [deck[i], deck[j]] = [deck[j], deck[i]];
            }
        }

        function getCardValue(card) {
            if (['J', 'Q', 'K'].includes(card.rank)) return 10;
            if (card.rank === 'A') return 11;
            return parseInt(card.rank);
        }

        function getHandValue(hand) {
            let value = hand.reduce((sum, card) => sum + getCardValue(card), 0);
            let numAces = hand.filter(card => card.rank === 'A').length;
            while (value > 21 && numAces > 0) {
                value -= 10;
                numAces--;
            }
            return value;
        }

        function renderHand(hand, div) {
            div.innerHTML = hand.map(card => `${card.rank}${card.suit}`).join(' ');
        }

        function startGame() {
            deck = createDeck();
            shuffle(deck);
            playerHand = [deck.pop(), deck.pop()];
            dealerHand = [deck.pop(), deck.pop()];
            
            updateUI();
            messageP.textContent = '';
            hitButton.disabled = false;
            standButton.disabled = false;
        }

        function updateUI() {
            renderHand(playerHand, playerHandDiv);
            renderHand(dealerHand, dealerHandDiv);
            playerScoreSpan.textContent = getHandValue(playerHand);
            dealerScoreSpan.textContent = getHandValue(dealerHand);
        }

        hitButton.addEventListener('click', () => {
            playerHand.push(deck.pop());
            updateUI();
            if (getHandValue(playerHand) > 21) {
                endGame("Player busts! Dealer wins.");
            }
        });

        standButton.addEventListener('click', () => {
            while (getHandValue(dealerHand) < 17) {
                dealerHand.push(deck.pop());
            }
            updateUI();
            const playerScore = getHandValue(playerHand);
            const dealerScore = getHandValue(dealerHand);

            if (dealerScore > 21 || playerScore > dealerScore) {
                endGame("Player wins!");
            } else if (dealerScore > playerScore) {
                endGame("Dealer wins!");
            } else {
                endGame("It's a push (tie)!");
            }
        });

        resetButton.addEventListener('click', startGame);

        function endGame(msg) {
            messageP.textContent = msg;
            hitButton.disabled = true;
            standButton.disabled = true;
        }

        startGame();
    </script>
</body>
</html>

The result was a simple but fully functional Blackjack game. The logic was correct, and it demonstrated the model’s capability as a competent coding agent, all while running on a local machine.

Final Thoughts

Gemma 4 is a genuinely interesting and powerful release. The move to an Apache 2.0 license is a massive win for the open-source community. The innovative architecture of the MoE model delivers remarkable efficiency without a major compromise on quality. Its strong performance, multimodal capabilities, and ability to run on consumer hardware make it a compelling option for developers and researchers. The community is undoubtedly already working to customize and enhance these models, and it will be exciting to see what they build.