A step-by-step guide to running Google's Gemma 4 26B Mixture-of-Experts model locally on Apple Silicon using llama.cpp, mmap, and Metal - achieving 49 tok/s with under 6GB of RAM.

Large language models usually demand large amounts of RAM. But what if a 26-billion-parameter model could run on your Mac with only about 6 GB of memory? That is exactly what Gemma 4 26B MOE delivers when paired with llama.cpp and memory-mapped files.
In this post, we walk through the full setup - from installation to benchmarks - and explain why this combination works so well on Apple Silicon.
Google's Gemma 4 26B MOE has 25.2 billion total parameters, but it only activates 3.8 billion per token. The secret is its Mixture-of-Experts (MOE) architecture: 128 experts in total, with only 8 active at a time. This means the model performs like a much smaller network on each forward pass, while retaining the knowledge capacity of a far larger one.
The key enabler here is mmap (memory mapping). Instead of loading the entire model file into RAM, the operating system maps the file on disk so the program can access it as if it were in memory. Only the pages that are actually read get loaded into physical RAM.
How mmap works in practice:
Why MOE models benefit especially:
The tradeoff is that performance depends on fast storage, since some reads come from disk rather than RAM. But on Apple Silicon, this works remarkably well thanks to unified memory, Metal GPU acceleration, and fast NVMe storage.
The result: a 25.9 GB model file runs with only about 5.9 GB of RAM.
Install llama.cpp via Homebrew:
brew install llama.cpp
Verify the version:
llama-cli --version
You need version 8680 or newer for Gemma 4 support.
The recommended quantization is Unsloth UD-Q8_K_XL, which offers the best quality at 27.9 GB download size (25.9 GB model file):
mkdir -p ~/.cache/gguf
cd ~/.cache/gguf
curl -L -o gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf \
"https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf"
Smaller alternatives are available (Q8_0, UD-Q6_K, UD-Q5_K_M, UD-Q4_K_M) if you want a different size/quality tradeoff.
Start an OpenAI-compatible local server:
llama-server \
-m ~/.cache/gguf/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf \
-ngl 99 \
--host 0.0.0.0 \
--port 8082
The flags:
-m sets the model path-ngl 99 offloads all layers to GPU via Metal--port 8082 exposes the local API serverOnce the server is running, send a request with curl:
curl http://localhost:8082/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4",
"messages": [{"role": "user", "content": "What is NSCLC?"}],
"max_tokens": 200
}'
Example response:
{
"choices": [
{
"message": {
"role": "assistant",
"content": "NSCLC stands for non-small cell lung cancer. It is the most common type of lung cancer and includes subtypes such as adenocarcinoma, squamous cell carcinoma, and large cell carcinoma."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 36,
"total_tokens": 48
}
}
Tested on an Apple M1 Max with 64 GB RAM:
For comparison, here is how Gemma 4 26B MOE stacks up against a smaller dense model, MedGemma 1.5 4B (bf16):
The larger MOE model is actually faster and uses less RAM than the smaller dense model.
A much larger MOE model can run locally with surprisingly low memory because only a small fraction of its experts are active per token, while mmap keeps the inactive weights on SSD. Combined with Apple Silicon's unified memory, Metal acceleration, and fast NVMe storage, this makes running serious models on a laptop entirely practical - no cloud required.