Model Reference
Model Reference
Last verified: April 2026. This page is updated regularly. Model recommendations change as new releases appear. If something looks outdated, check Ollama’s library and HuggingFace trending for the latest.
Recommended Starters
One pick per tier. Install Ollama, then run the command.
| Your RAM | Model | Command | Why this one |
|---|---|---|---|
| 8 GB | Qwen 3.5 4B | ollama pull qwen3.5:4b | , , 140+ languages. ~3.4 GB at Q4. |
| 16 GB | Qwen 3.5 9B | ollama pull qwen3.5:9b | Outperforms much larger models on reasoning benchmarks, including models with 5x more parameters on GPQA Diamond. 256K . |
| 32 GB | Qwen 3.5 27B | ollama pull qwen3.5:27b | Handles multi-step reasoning and long documents better than 9B models. The best open-weight dense model at its size class. |
| 64 GB+ | Qwen 3.5 35B-A3B | ollama pull qwen3.5:35b-a3b | model: only 3B active means fast with 35B quality. Fits in 32 GB; recommended at 64 GB+ for full performance. |
| 4 GB or less | Qwen 3.5 0.8B | ollama pull qwen3.5:0.8b | Tiny but functional. Runs on almost anything. |
All Models by Hardware Tier
8 GB RAM: Small Models (1B-4B)
| Model | Params | Active | RAM Needed | Strengths | Ollama |
|---|---|---|---|---|---|
| Qwen 3.5 4B | 4B | 4B (dense) | ~3.4 GB | Multimodal, thinking mode, 262K context | qwen3.5:4b |
| Gemma 4 E2B | 5B | 5B (dense) | ~3.5 GB | Any-to-Any (audio+video+image+text), edge capacity | gemma4:e2b |
| Gemma 4 E4B | 8B | 8B (dense) | ~5.5 GB | Any-to-Any multimodal, PLE & Shared KV cache | gemma4:e4b |
| Nemotron 3 Nano 4B | 4B | 4B | ~5 GB | Hybrid Mamba-, 1M context, agentic | nemotron-3-nano |
| Phi-4 Mini | 3.8B | 3.8B (dense) | ~3 GB | Microsoft, strong reasoning for size | phi4-mini |
| SmolLM3 3B | 3B | 3B (dense) | ~2.5 GB | HuggingFace, outperforms Llama 3.2 3B | smollm3:3b |
16 GB RAM: Medium Models (7B-9B)
| Model | Params | Active | RAM Needed | Strengths | Ollama |
|---|---|---|---|---|---|
| Qwen 3.5 9B | 9B | 9B (dense) | ~6 GB | Beats GPT-OSS-120B on reasoning benchmarks, 256K context | qwen3.5:9b |
| Llama 3.3 8B | 8B | 8B (dense) | ~5 GB | Largest ecosystem, broad availability | llama3.3:8b |
| Mistral Small 3 | 24B | 24B (dense) | ~14 GB | Apache 2.0, punches above weight | mistral-small |
| Qwen 3 8B | 8B | 8B (dense) | ~5 GB | Strong coding (HumanEval 76.0), multilingual | qwen3:8b |
| DeepSeek R1 7B | 7B | 7B (dense) | ~5 GB | Reasoning/chain-of-thought distillation | deepseek-r1:7b |
| GLM-4.7 Flash | 30B | 3B (MoE) | ~18 GB | Agentic, tool calling, long-task coding | glm-4.7-flash |
32 GB RAM: Large Models (14B-35B)
| Model | Params | Active | RAM Needed | Strengths | Ollama |
|---|---|---|---|---|---|
| Qwen 3.5 27B | 27B | 27B (dense) | ~16 GB | 72.4% SWE-bench, GPT-5 Mini range | qwen3.5:27b |
| Gemma 4 31B | 33B | 33B (dense) | ~20 GB | High LMSYS arena scores, native GUI/Object detection | gemma4:31b |
| Gemma 4 26B-A4B | 27B | 4B (MoE) | ~15 GB | 1441 Arena score on just 4B active parameters | gemma4:26b-a4b |
| Qwen 3.5 35B-A3B | 35B | 3B (MoE) | ~20 GB | MoE model: fits in 32 GB, recommended at 64 GB+ for full performance. 112 tok/s on RTX 3090. | qwen3.5:35b-a3b |
| GPT-OSS 20B | 20B | 3.6B (MoE) | ~12 GB | OpenAI’s first . Near o3-mini. Apache 2.0. | gpt-oss:20b |
| Nemotron 3 Nano 30B-A3B | 30B | 3B (MoE) | ~24 GB | Hybrid Mamba-Transformer, 1M context | nemotron-3-nano:30b |
| Llama 4 Scout | 109B | 17B (MoE) | ~24 GB | 10M token context, multimodal | llama4-scout |
| Phi-4 14B | 14B | 14B (dense) | ~9 GB | Strong reasoning, small footprint | phi4:14b |
64 GB+ RAM: Very Large Models (70B+)
| Model | Params | Active | RAM Needed | Strengths | Ollama |
|---|---|---|---|---|---|
| Qwen 3.5 122B-A10B | 122B | 10B (MoE) | ~45 GB | Flagship small-active MoE | qwen3.5:122b-a10b |
| GPT-OSS 120B | 120B | 5.1B (MoE) | ~45 GB | Near o4-mini on reasoning. Apache 2.0. | gpt-oss:120b |
| Nemotron 3 Super | 120B | 12B (MoE) | ~50 GB | Agentic reasoning, hybrid architecture | nemotron-3-super |
| Llama 4 Maverick | 400B | 40B (MoE) | ~122 GB | Massive MoE, multimodal | llama4-maverick |
| DeepSeek V3.2 | 671B | 37B (MoE) | ~135 GB+ | Top general-purpose scores across coding, reasoning, and math benchmarks | deepseek-v3.2 |
Server / Multi-GPU: Frontier-Class Open Models
These require dedicated hardware (multiple GPUs or 240GB+ unified memory):
| Model | Params | Active | Strengths |
|---|---|---|---|
| Qwen 3.5 397B-A17B | 397B | 17B (MoE) | Flagship. Scores within a few points of frontier closed models on most benchmarks. 1M+ context. |
| Kimi K2.5 | 1T | 32B (MoE) | Top scores on coding and reasoning benchmarks. 256K context. Multimodal. |
| GLM-5 | 744B | 40B (MoE) | Strong on coding benchmarks, agent task execution, long-task stability. |
| MiMo-V2-Flash | 309B | 15B (MoE) | Xiaomi. 256 experts, 128K context. Strong reasoning. |
| Mistral Small 4 | 119B | 6.5B (MoE) | 128 experts, 4 active. 256K context. |
Models by Use Case
Best for Coding
| Tier | Model | Why |
|---|---|---|
| Small (8 GB) | Qwen 3.5 4B | Thinking mode helps with code reasoning |
| Medium (16 GB) | Qwen 3.5 9B | Exceptional coding benchmarks for size |
| Large (32 GB) | Qwen 3.5 27B | 72.4% SWE-bench, production-grade |
| XL (64 GB+) | GPT-OSS 120B | Near o4-mini, strong agentic coding |
| Server | Kimi K2.5 / GLM-5 | Top scores on SWE-bench, HumanEval, and LiveCodeBench |
Best for Reasoning and Math
| Tier | Model | Why |
|---|---|---|
| Small (8 GB) | Qwen 3.5 4B | Thinking mode, scaled RL |
| Medium (16 GB) | Qwen 3.5 9B | 81.7 GPQA Diamond |
| Large (32 GB) | GPT-OSS 20B | Trained with RL from o3/frontier |
| XL (64 GB+) | Qwen 3.5 122B-A10B | Best reasoning-per-watt |
| Server | Qwen 3.5 397B-A17B | Rivals closed-source on reasoning |
Best for Long Context
| Tier | Model | Context | Why |
|---|---|---|---|
| Small (8 GB) | Nemotron 3 Nano 4B | 1M tokens | Hybrid Mamba-Transformer |
| Medium (16 GB) | GLM-4.7 Flash | 202K tokens | Agentic, long-task |
| Large (32 GB) | Llama 4 Scout | 10M tokens | Longest context of any open model |
| XL (64 GB+) | Nemotron 3 Super | 1M tokens | Hybrid architecture, agentic |
Best for Multimodal (Text + Images)
| Tier | Model | Modalities |
|---|---|---|
| Small (8 GB) | Qwen 3.5 4B | Text + images |
| Small (8 GB) | Gemma 4 E2B / E4B | Text + images + audio + video (Any-to-Any) |
| Medium (16 GB) | Qwen 3.5 9B | Text + images |
| Large (32 GB) | Llama 4 Scout | Text + images |
| Server | Kimi K2.5 | Native multimodal, 15T mixed tokens |
Best for Writing and Creative Tasks
| Tier | Model | Why |
|---|---|---|
| Medium (16 GB) | Mistral Small 3 | Strong instruction-following, natural prose, good at style matching |
| Large (32 GB) | Qwen 3.5 27B | Large context lets it maintain tone and continuity across long pieces |
| XL (64 GB+) | GPT-OSS 120B | Consistent style, handles complex multi-part prompts, strong at narrative coherence |
Instruct vs. Base Models
Base models predict the next token and are not trained to be helpful. They will continue a prompt as if completing a document, not answer a question.
Instruct models are fine-tuned to follow instructions and hold conversations. For everyday use, always download the instruct variant.
When downloading, look for “Instruct” in the model name (e.g., Llama-3.2-3B-Instruct). HuggingFace model cards label these clearly. The only reason to use a base model is if you are doing further fine-tuning and want the raw weights.
Most models in the Ollama library are instruct variants by default unless the name says otherwise.
Model Families: Who Makes What
Qwen 3.5 (Alibaba)
- Released: Feb-Mar 2026
- Sizes: 0.8B, 2B, 4B, 9B, 27B, 35B-A3B, 122B-A10B, 397B-A17B
- License: Apache 2.0
- Key features: Multimodal, thinking mode, 262K native context (1M+ extended), 201 languages, native tool calling
- Links: HuggingFace · GitHub · Ollama
GPT-OSS (OpenAI)
- Released: Mar 2026
- Sizes: 20B (3.6B active), 120B (5.1B active)
- License: Apache 2.0
- Key features: MoE, trained with RL from o3/frontier, 60K context
- Links: HuggingFace · GitHub · Ollama
Nemotron 3 (NVIDIA)
- Released: Mar 2026
- Sizes: Nano 4B, Nano 30B-A3B, Super 120B (12B active)
- License: NVIDIA Open Model License
- Key features: Hybrid Mamba-Transformer MoE, 1M token context, agentic reasoning
- Links: NVIDIA Developer · Ollama Nano · Ollama Super
Llama 4 (Meta)
- Released: Late 2025
- Sizes: Scout 109B (17B active), Maverick 400B (40B active)
- License: Llama 4 Community License
- Key features: MoE, multimodal, Scout has 10M token context
- Links: llama.com · Ollama
Gemma 4 (Google DeepMind)
- Released: April 2026
- Sizes: 5B (E2B), 8B (E4B), 27B (26B-A4B), 33B (31B)
- License: Apache 2.0
- Key features: Any-to-Any multimodal on edge (audio/video/image natively), Per-Layer Embeddings (PLE), Shared KV Cache, exceptional local GUI detection.
- Links: DeepMind Blog · HuggingFace Blog · Ollama
GLM (Zhipu AI / Z.AI)
- Released: 2025-2026
- Sizes: GLM-4.7 (358B), GLM-4.7 Flash (30B-A3B), GLM-5 (744B, 40B active)
- License: MIT
- Key features: Agent execution, tool calling, long-task coding, 200K+ context
- Links: Ollama GLM-4.7 · Ollama GLM-5
DeepSeek (DeepSeek AI)
- Released: 2025-2026
- Sizes: R1 (1.5B-671B distillations), V3 (671B, 37B active), V3.2
- License: DeepSeek License
- Key features: Strong reasoning (R1), top benchmark scores across the board (V3.2), large MoE
- Links: Ollama V3.2 · Ollama R1
Mistral
- Mistral 7B released: September 2023. Mistral Small 3 released: January 2025.
- Sizes: Mistral 7B, Small 3 (24B), Small 4 (119B, 6.5B active)
- License: Apache 2.0
- Key features: Fast inference, strong , Small 4 has 128 experts
- Links: Ollama
Kimi K2.5 (Moonshot AI)
- Released: 2026
- Sizes: 1T total, 32B active
- License: MIT
- Key features: Top scores on coding and reasoning benchmarks, native multimodal, 256K context
- Links: Ollama
Phi-4 (Microsoft)
- Released: 2025-2026
- Sizes: Mini 3.8B, 14B
- License: MIT
- Key features: High performance relative to size, reasoning-focused
- Links: Ollama
Where to Find Models
- Unsloth Dynamic GGUFs produces dynamic quantizations that are widely considered the best GGUF variants for most models. The dynamic format allocates more bits to sensitive layers and fewer to layers where precision matters less, giving better quality at the same file size. Start here when downloading GGUFs from HuggingFace.
- Ollama Library has pre- models with one-command install. Best for getting started quickly.
- HuggingFace Models is the largest collection. Original weights plus community quantizations (, GPTQ, AWQ).
- LM Studio provides a visual model browser with one-click downloads from HuggingFace.
Understanding Model Names
When you see a model name like qwen3.5:9b-q4_K_M, here’s what each part means:
| Part | Meaning | Example |
|---|---|---|
| Family | Who made it and which generation | qwen3.5 = Alibaba’s Qwen, version 3.5 |
| Size | Total count | 9b = 9 billion parameters |
| Active (if MoE) | Parameters used per token | A3B = 3 billion active |
| Quantization | Compression level | q4_K_M = 4-bit, , medium quality |
quick guide:
| Level | Quality | Notes |
|---|---|---|
| Q8 | ~99% of original | Largest file. Near-lossless. |
| Q6_K | ~97% | Slightly smaller, barely any loss. |
| Q5_K_M | ~95% | Good default if you have the space. |
| Q4_K_M | ~93% | The sweet spot for most people. Half the size. |
| Q3_K | ~85% | Noticeable quality drop. Only if you’re tight on memory. |
| Q2_K | ~70% | Real quality loss. Last resort. |
Want to understand how quantization works? See Module 2: Compressed Models for the intuition, or Module 3B: How Models Fit for the deep dive.
Emerging Techniques Changing the Game
TurboQuant (Google)
Released late March 2026. Compresses to 3 bits with zero accuracy loss, giving 6x memory reduction for context and 8x faster attention computation. The llama.cpp and MLX communities picked it up within days. Long conversations that used to run out of RAM now fit. Research blog
LLM in a Flash (Apple)
Runs models larger than your available by streaming weights from SSD/flash storage. An iPhone 17 Pro ran a 400B MoE model on-device. Coming to more tools soon. Apple Research
Mixture of Experts (MoE) Everywhere
Most new large models use MoE: huge total parameter counts but only a fraction active per token. A “120B” MoE model with 5B active runs at roughly 5B speed and compute cost, but stores knowledge across all 120B parameters. You need RAM for the full 120B, which is why the tables above show models larger than you might expect for a given tier.
For the latest in model releases and AI news, see our News Feed and Research Feed.