Reference Reference

📋 Model Reference

Model Reference

Last verified: April 2026. This page is updated regularly. Model recommendations change as new releases appear. If something looks outdated, check Ollama’s library and HuggingFace trending for the latest.

Recommended Starters

One pick per tier. Install Ollama, then run the command.

Your RAM	Model	Command	Why this one
8 GB	Qwen 3.5 4B	`ollama pull qwen3.5:4b`	, , 140+ languages. ~3.4 GB at Q4.
16 GB	Qwen 3.5 9B	`ollama pull qwen3.5:9b`	Outperforms much larger models on reasoning benchmarks, including models with 5x more parameters on GPQA Diamond. 256K .
32 GB	Qwen 3.5 27B	`ollama pull qwen3.5:27b`	Handles multi-step reasoning and long documents better than 9B models. The best open-weight dense model at its size class.
64 GB+	Qwen 3.5 35B-A3B	`ollama pull qwen3.5:35b-a3b`	model: only 3B active means fast with 35B quality. Fits in 32 GB; recommended at 64 GB+ for full performance.
4 GB or less	Qwen 3.5 0.8B	`ollama pull qwen3.5:0.8b`	Tiny but functional. Runs on almost anything.

All Models by Hardware Tier

8 GB RAM: Small Models (1B-4B)

Model	Params	Active	RAM Needed	Strengths	Ollama
Qwen 3.5 4B	4B	4B (dense)	~3.4 GB	Multimodal, thinking mode, 262K context	`qwen3.5:4b`
Gemma 4 E2B	5B	5B (dense)	~3.5 GB	Any-to-Any (audio+video+image+text), edge capacity	`gemma4:e2b`
Gemma 4 E4B	8B	8B (dense)	~5.5 GB	Any-to-Any multimodal, PLE & Shared KV cache	`gemma4:e4b`
Nemotron 3 Nano 4B	4B	4B	~5 GB	Hybrid Mamba-, 1M context, agentic	`nemotron-3-nano`
Phi-4 Mini	3.8B	3.8B (dense)	~3 GB	Microsoft, strong reasoning for size	`phi4-mini`
SmolLM3 3B	3B	3B (dense)	~2.5 GB	HuggingFace, outperforms Llama 3.2 3B	`smollm3:3b`

16 GB RAM: Medium Models (7B-9B)

Model	Params	Active	RAM Needed	Strengths	Ollama
Qwen 3.5 9B	9B	9B (dense)	~6 GB	Beats GPT-OSS-120B on reasoning benchmarks, 256K context	`qwen3.5:9b`
Llama 3.3 8B	8B	8B (dense)	~5 GB	Largest ecosystem, broad availability	`llama3.3:8b`
Mistral Small 3	24B	24B (dense)	~14 GB	Apache 2.0, punches above weight	`mistral-small`
Qwen 3 8B	8B	8B (dense)	~5 GB	Strong coding (HumanEval 76.0), multilingual	`qwen3:8b`
DeepSeek R1 7B	7B	7B (dense)	~5 GB	Reasoning/chain-of-thought distillation	`deepseek-r1:7b`
GLM-4.7 Flash	30B	3B (MoE)	~18 GB	Agentic, tool calling, long-task coding	`glm-4.7-flash`

32 GB RAM: Large Models (14B-35B)

Model	Params	Active	RAM Needed	Strengths	Ollama
Qwen 3.5 27B	27B	27B (dense)	~16 GB	72.4% SWE-bench, GPT-5 Mini range	`qwen3.5:27b`
Gemma 4 31B	33B	33B (dense)	~20 GB	High LMSYS arena scores, native GUI/Object detection	`gemma4:31b`
Gemma 4 26B-A4B	27B	4B (MoE)	~15 GB	1441 Arena score on just 4B active parameters	`gemma4:26b-a4b`
Qwen 3.5 35B-A3B	35B	3B (MoE)	~20 GB	MoE model: fits in 32 GB, recommended at 64 GB+ for full performance. 112 tok/s on RTX 3090.	`qwen3.5:35b-a3b`
GPT-OSS 20B	20B	3.6B (MoE)	~12 GB	OpenAI’s first . Near o3-mini. Apache 2.0.	`gpt-oss:20b`
Nemotron 3 Nano 30B-A3B	30B	3B (MoE)	~24 GB	Hybrid Mamba-Transformer, 1M context	`nemotron-3-nano:30b`
Llama 4 Scout	109B	17B (MoE)	~24 GB	10M token context, multimodal	`llama4-scout`
Phi-4 14B	14B	14B (dense)	~9 GB	Strong reasoning, small footprint	`phi4:14b`

64 GB+ RAM: Very Large Models (70B+)

Model	Params	Active	RAM Needed	Strengths	Ollama
Qwen 3.5 122B-A10B	122B	10B (MoE)	~45 GB	Flagship small-active MoE	`qwen3.5:122b-a10b`
GPT-OSS 120B	120B	5.1B (MoE)	~45 GB	Near o4-mini on reasoning. Apache 2.0.	`gpt-oss:120b`
Nemotron 3 Super	120B	12B (MoE)	~50 GB	Agentic reasoning, hybrid architecture	`nemotron-3-super`
Llama 4 Maverick	400B	40B (MoE)	~122 GB	Massive MoE, multimodal	`llama4-maverick`
DeepSeek V3.2	671B	37B (MoE)	~135 GB+	Top general-purpose scores across coding, reasoning, and math benchmarks	`deepseek-v3.2`

Server / Multi-GPU: Frontier-Class Open Models

These require dedicated hardware (multiple GPUs or 240GB+ unified memory):

Model	Params	Active	Strengths
Qwen 3.5 397B-A17B	397B	17B (MoE)	Flagship. Scores within a few points of frontier closed models on most benchmarks. 1M+ context.
Kimi K2.5	1T	32B (MoE)	Top scores on coding and reasoning benchmarks. 256K context. Multimodal.
GLM-5	744B	40B (MoE)	Strong on coding benchmarks, agent task execution, long-task stability.
MiMo-V2-Flash	309B	15B (MoE)	Xiaomi. 256 experts, 128K context. Strong reasoning.
Mistral Small 4	119B	6.5B (MoE)	128 experts, 4 active. 256K context.

Models by Use Case

Best for Coding

Tier	Model	Why
Small (8 GB)	Qwen 3.5 4B	Thinking mode helps with code reasoning
Medium (16 GB)	Qwen 3.5 9B	Exceptional coding benchmarks for size
Large (32 GB)	Qwen 3.5 27B	72.4% SWE-bench, production-grade
XL (64 GB+)	GPT-OSS 120B	Near o4-mini, strong agentic coding
Server	Kimi K2.5 / GLM-5	Top scores on SWE-bench, HumanEval, and LiveCodeBench

Best for Reasoning and Math

Tier	Model	Why
Small (8 GB)	Qwen 3.5 4B	Thinking mode, scaled RL
Medium (16 GB)	Qwen 3.5 9B	81.7 GPQA Diamond
Large (32 GB)	GPT-OSS 20B	Trained with RL from o3/frontier
XL (64 GB+)	Qwen 3.5 122B-A10B	Best reasoning-per-watt
Server	Qwen 3.5 397B-A17B	Rivals closed-source on reasoning

Best for Long Context

Tier	Model	Context	Why
Small (8 GB)	Nemotron 3 Nano 4B	1M tokens	Hybrid Mamba-Transformer
Medium (16 GB)	GLM-4.7 Flash	202K tokens	Agentic, long-task
Large (32 GB)	Llama 4 Scout	10M tokens	Longest context of any open model
XL (64 GB+)	Nemotron 3 Super	1M tokens	Hybrid architecture, agentic

Best for Multimodal (Text + Images)

Tier	Model	Modalities
Small (8 GB)	Qwen 3.5 4B	Text + images
Small (8 GB)	Gemma 4 E2B / E4B	Text + images + audio + video (Any-to-Any)
Medium (16 GB)	Qwen 3.5 9B	Text + images
Large (32 GB)	Llama 4 Scout	Text + images
Server	Kimi K2.5	Native multimodal, 15T mixed tokens

Best for Writing and Creative Tasks

Tier	Model	Why
Medium (16 GB)	Mistral Small 3	Strong instruction-following, natural prose, good at style matching
Large (32 GB)	Qwen 3.5 27B	Large context lets it maintain tone and continuity across long pieces
XL (64 GB+)	GPT-OSS 120B	Consistent style, handles complex multi-part prompts, strong at narrative coherence

Instruct vs. Base Models

Base models predict the next token and are not trained to be helpful. They will continue a prompt as if completing a document, not answer a question.

Instruct models are fine-tuned to follow instructions and hold conversations. For everyday use, always download the instruct variant.

When downloading, look for “Instruct” in the model name (e.g., Llama-3.2-3B-Instruct). HuggingFace model cards label these clearly. The only reason to use a base model is if you are doing further fine-tuning and want the raw weights.

Most models in the Ollama library are instruct variants by default unless the name says otherwise.

Model Families: Who Makes What

Qwen 3.5 (Alibaba)

Released: Feb-Mar 2026
Sizes: 0.8B, 2B, 4B, 9B, 27B, 35B-A3B, 122B-A10B, 397B-A17B
License: Apache 2.0
Key features: Multimodal, thinking mode, 262K native context (1M+ extended), 201 languages, native tool calling
Links: HuggingFace · GitHub · Ollama

GPT-OSS (OpenAI)

Released: Mar 2026
Sizes: 20B (3.6B active), 120B (5.1B active)
License: Apache 2.0
Key features: MoE, trained with RL from o3/frontier, 60K context
Links: HuggingFace · GitHub · Ollama

Nemotron 3 (NVIDIA)

Released: Mar 2026
Sizes: Nano 4B, Nano 30B-A3B, Super 120B (12B active)
License: NVIDIA Open Model License
Key features: Hybrid Mamba-Transformer MoE, 1M token context, agentic reasoning
Links: NVIDIA Developer · Ollama Nano · Ollama Super

Llama 4 (Meta)

Released: Late 2025
Sizes: Scout 109B (17B active), Maverick 400B (40B active)
License: Llama 4 Community License
Key features: MoE, multimodal, Scout has 10M token context
Links: llama.com · Ollama

Gemma 4 (Google DeepMind)

Released: April 2026
Sizes: 5B (E2B), 8B (E4B), 27B (26B-A4B), 33B (31B)
License: Apache 2.0
Key features: Any-to-Any multimodal on edge (audio/video/image natively), Per-Layer Embeddings (PLE), Shared KV Cache, exceptional local GUI detection.
Links: DeepMind Blog · HuggingFace Blog · Ollama

GLM (Zhipu AI / Z.AI)

Released: 2025-2026
Sizes: GLM-4.7 (358B), GLM-4.7 Flash (30B-A3B), GLM-5 (744B, 40B active)
License: MIT
Key features: Agent execution, tool calling, long-task coding, 200K+ context
Links: Ollama GLM-4.7 · Ollama GLM-5

DeepSeek (DeepSeek AI)

Released: 2025-2026
Sizes: R1 (1.5B-671B distillations), V3 (671B, 37B active), V3.2
License: DeepSeek License
Key features: Strong reasoning (R1), top benchmark scores across the board (V3.2), large MoE
Links: Ollama V3.2 · Ollama R1

Mistral

Mistral 7B released: September 2023. Mistral Small 3 released: January 2025.
Sizes: Mistral 7B, Small 3 (24B), Small 4 (119B, 6.5B active)
License: Apache 2.0
Key features: Fast inference, strong , Small 4 has 128 experts
Links: Ollama

Kimi K2.5 (Moonshot AI)

Released: 2026
Sizes: 1T total, 32B active
License: MIT
Key features: Top scores on coding and reasoning benchmarks, native multimodal, 256K context
Links: Ollama

Phi-4 (Microsoft)

Released: 2025-2026
Sizes: Mini 3.8B, 14B
License: MIT
Key features: High performance relative to size, reasoning-focused
Links: Ollama

Where to Find Models

Unsloth Dynamic GGUFs produces dynamic quantizations that are widely considered the best GGUF variants for most models. The dynamic format allocates more bits to sensitive layers and fewer to layers where precision matters less, giving better quality at the same file size. Start here when downloading GGUFs from HuggingFace.
Ollama Library has pre- models with one-command install. Best for getting started quickly.
HuggingFace Models is the largest collection. Original weights plus community quantizations (, GPTQ, AWQ).
LM Studio provides a visual model browser with one-click downloads from HuggingFace.

Understanding Model Names

When you see a model name like qwen3.5:9b-q4_K_M, here’s what each part means:

Part	Meaning	Example
Family	Who made it and which generation	`qwen3.5` = Alibaba’s Qwen, version 3.5
Size	Total count	`9b` = 9 billion parameters
Active (if MoE)	Parameters used per token	`A3B` = 3 billion active
Quantization	Compression level	`q4_K_M` = 4-bit, , medium quality

quick guide:

Level	Quality	Notes
Q8	~99% of original	Largest file. Near-lossless.
Q6_K	~97%	Slightly smaller, barely any loss.
Q5_K_M	~95%	Good default if you have the space.
Q4_K_M	~93%	The sweet spot for most people. Half the size.
Q3_K	~85%	Noticeable quality drop. Only if you’re tight on memory.
Q2_K	~70%	Real quality loss. Last resort.

Want to understand how quantization works? See Module 2: Compressed Models for the intuition, or Module 3B: How Models Fit for the deep dive.

Emerging Techniques Changing the Game

TurboQuant (Google)

Released late March 2026. Compresses to 3 bits with zero accuracy loss, giving 6x memory reduction for context and 8x faster attention computation. The llama.cpp and MLX communities picked it up within days. Long conversations that used to run out of RAM now fit. Research blog

LLM in a Flash (Apple)

Runs models larger than your available by streaming weights from SSD/flash storage. An iPhone 17 Pro ran a 400B MoE model on-device. Coming to more tools soon. Apple Research

Mixture of Experts (MoE) Everywhere

Most new large models use MoE: huge total parameter counts but only a fraction active per token. A “120B” MoE model with 5B active runs at roughly 5B speed and compute cost, but stores knowledge across all 120B parameters. You need RAM for the full 120B, which is why the tables above show models larger than you might expect for a given tier.

For the latest in model releases and AI news, see our News Feed and Research Feed.