Run Gemma 4 26B MOE Locally on a Mac with Only ~6GB RAM

A step-by-step guide to running Google's Gemma 4 26B Mixture-of-Experts model locally on Apple Silicon using llama.cpp, mmap, and Metal - achieving 49 tok/s with under 6GB of RAM.

doc-vision.com

Run Gemma 4 26B MOE Locally on a Mac with Only ~6GB RAM thumbnail

Large language models usually demand large amounts of RAM. But what if a 26-billion-parameter model could run on your Mac with only about 6 GB of memory? That is exactly what Gemma 4 26B MOE delivers when paired with llama.cpp and memory-mapped files.

In this post, we walk through the full setup - from installation to benchmarks - and explain why this combination works so well on Apple Silicon.

The Model

Google's Gemma 4 26B MOE has 25.2 billion total parameters, but it only activates 3.8 billion per token. The secret is its Mixture-of-Experts (MOE) architecture: 128 experts in total, with only 8 active at a time. This means the model performs like a much smaller network on each forward pass, while retaining the knowledge capacity of a far larger one.

Why This Setup Matters

The key enabler here is mmap (memory mapping). Instead of loading the entire model file into RAM, the operating system maps the file on disk so the program can access it as if it were in memory. Only the pages that are actually read get loaded into physical RAM.

How mmap works in practice:

The model file stays on the SSD
The OS loads only the parts that are actually needed
Unused parts remain on disk
RAM usage stays dramatically lower than the full model size

Why MOE models benefit especially:

Only a small subset of experts activates per token
Many weights stay "cold" and are never touched during a given inference
Those cold parts can remain on SSD until needed

The tradeoff is that performance depends on fast storage, since some reads come from disk rather than RAM. But on Apple Silicon, this works remarkably well thanks to unified memory, Metal GPU acceleration, and fast NVMe storage.

The result: a 25.9 GB model file runs with only about 5.9 GB of RAM.

Step 1 - Install llama.cpp

Install llama.cpp via Homebrew:

brew install llama.cpp

Verify the version:

llama-cli --version

You need version 8680 or newer for Gemma 4 support.

Step 2 - Download the Model

The recommended quantization is Unsloth UD-Q8_K_XL, which offers the best quality at 27.9 GB download size (25.9 GB model file):

mkdir -p ~/.cache/gguf
cd ~/.cache/gguf

curl -L -o gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf \
  "https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf"

Smaller alternatives are available (Q8_0, UD-Q6_K, UD-Q5_K_M, UD-Q4_K_M) if you want a different size/quality tradeoff.

Step 3 - Run the Model

Start an OpenAI-compatible local server:

llama-server \
  -m ~/.cache/gguf/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf \
  -ngl 99 \
  --host 0.0.0.0 \
  --port 8082

The flags:

-m sets the model path
-ngl 99 offloads all layers to GPU via Metal
--port 8082 exposes the local API server

Step 4 - Query It

Once the server is running, send a request with curl:

curl http://localhost:8082/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4",
    "messages": [{"role": "user", "content": "What is NSCLC?"}],
    "max_tokens": 200
  }'

Example response:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "NSCLC stands for non-small cell lung cancer. It is the most common type of lung cancer and includes subtypes such as adenocarcinoma, squamous cell carcinoma, and large cell carcinoma."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 36,
    "total_tokens": 48
  }
}

Benchmarks

Tested on an Apple M1 Max with 64 GB RAM:

Metric	Value
RAM usage	5.9 GB
Prompt processing	268 tok/s
Token generation	49.4 tok/s
Bits per weight	8.83

For comparison, here is how Gemma 4 26B MOE stacks up against a smaller dense model, MedGemma 1.5 4B (bf16):

Model	RAM	Generation Speed
MedGemma 1.5 4B (bf16)	9.3 GB	36 tok/s
Gemma 4 26B MOE (Q8)	5.9 GB	49 tok/s

The larger MOE model is actually faster and uses less RAM than the smaller dense model.

Takeaway

A much larger MOE model can run locally with surprisingly low memory because only a small fraction of its experts are active per token, while mmap keeps the inactive weights on SSD. Combined with Apple Silicon's unified memory, Metal acceleration, and fast NVMe storage, this makes running serious models on a laptop entirely practical - no cloud required.

Is AI Smarter Than a Mosquito?Building an AI agent that calls APIs? Read this first

On This Page

The Model Why This Setup Matters Step 1 - Install llama.cpp Step 2 - Download the Model Step 3 - Run the Model Step 4 - Query It Benchmarks Takeaway