Reference Reference

📋 Model Reference

Model Reference

Last verified: April 2026. This page is updated regularly. Model recommendations change as new releases appear. If something looks outdated, check Ollama’s library and HuggingFace trending for the latest.


One pick per tier. Install Ollama, then run the command.

Your RAMModelCommandWhy this one
8 GBQwen 3.5 4Bollama pull qwen3.5:4b, , 140+ languages. ~3.4 GB at Q4.
16 GBQwen 3.5 9Bollama pull qwen3.5:9bOutperforms much larger models on reasoning benchmarks, including models with 5x more parameters on GPQA Diamond. 256K .
32 GBQwen 3.5 27Bollama pull qwen3.5:27bHandles multi-step reasoning and long documents better than 9B models. The best open-weight dense model at its size class.
64 GB+Qwen 3.5 35B-A3Bollama pull qwen3.5:35b-a3b model: only 3B active means fast with 35B quality. Fits in 32 GB; recommended at 64 GB+ for full performance.
4 GB or lessQwen 3.5 0.8Bollama pull qwen3.5:0.8bTiny but functional. Runs on almost anything.

All Models by Hardware Tier

8 GB RAM: Small Models (1B-4B)

ModelParamsActiveRAM NeededStrengthsOllama
Qwen 3.5 4B4B4B (dense)~3.4 GBMultimodal, thinking mode, 262K contextqwen3.5:4b
Gemma 4 E2B5B5B (dense)~3.5 GBAny-to-Any (audio+video+image+text), edge capacitygemma4:e2b
Gemma 4 E4B8B8B (dense)~5.5 GBAny-to-Any multimodal, PLE & Shared KV cachegemma4:e4b
Nemotron 3 Nano 4B4B4B~5 GBHybrid Mamba-, 1M context, agenticnemotron-3-nano
Phi-4 Mini3.8B3.8B (dense)~3 GBMicrosoft, strong reasoning for sizephi4-mini
SmolLM3 3B3B3B (dense)~2.5 GBHuggingFace, outperforms Llama 3.2 3Bsmollm3:3b

16 GB RAM: Medium Models (7B-9B)

ModelParamsActiveRAM NeededStrengthsOllama
Qwen 3.5 9B9B9B (dense)~6 GBBeats GPT-OSS-120B on reasoning benchmarks, 256K contextqwen3.5:9b
Llama 3.3 8B8B8B (dense)~5 GBLargest ecosystem, broad availabilityllama3.3:8b
Mistral Small 324B24B (dense)~14 GBApache 2.0, punches above weightmistral-small
Qwen 3 8B8B8B (dense)~5 GBStrong coding (HumanEval 76.0), multilingualqwen3:8b
DeepSeek R1 7B7B7B (dense)~5 GBReasoning/chain-of-thought distillationdeepseek-r1:7b
GLM-4.7 Flash30B3B (MoE)~18 GBAgentic, tool calling, long-task codingglm-4.7-flash

32 GB RAM: Large Models (14B-35B)

ModelParamsActiveRAM NeededStrengthsOllama
Qwen 3.5 27B27B27B (dense)~16 GB72.4% SWE-bench, GPT-5 Mini rangeqwen3.5:27b
Gemma 4 31B33B33B (dense)~20 GBHigh LMSYS arena scores, native GUI/Object detectiongemma4:31b
Gemma 4 26B-A4B27B4B (MoE)~15 GB1441 Arena score on just 4B active parametersgemma4:26b-a4b
Qwen 3.5 35B-A3B35B3B (MoE)~20 GBMoE model: fits in 32 GB, recommended at 64 GB+ for full performance. 112 tok/s on RTX 3090.qwen3.5:35b-a3b
GPT-OSS 20B20B3.6B (MoE)~12 GBOpenAI’s first . Near o3-mini. Apache 2.0.gpt-oss:20b
Nemotron 3 Nano 30B-A3B30B3B (MoE)~24 GBHybrid Mamba-Transformer, 1M contextnemotron-3-nano:30b
Llama 4 Scout109B17B (MoE)~24 GB10M token context, multimodalllama4-scout
Phi-4 14B14B14B (dense)~9 GBStrong reasoning, small footprintphi4:14b

64 GB+ RAM: Very Large Models (70B+)

ModelParamsActiveRAM NeededStrengthsOllama
Qwen 3.5 122B-A10B122B10B (MoE)~45 GBFlagship small-active MoEqwen3.5:122b-a10b
GPT-OSS 120B120B5.1B (MoE)~45 GBNear o4-mini on reasoning. Apache 2.0.gpt-oss:120b
Nemotron 3 Super120B12B (MoE)~50 GBAgentic reasoning, hybrid architecturenemotron-3-super
Llama 4 Maverick400B40B (MoE)~122 GBMassive MoE, multimodalllama4-maverick
DeepSeek V3.2671B37B (MoE)~135 GB+Top general-purpose scores across coding, reasoning, and math benchmarksdeepseek-v3.2

Server / Multi-GPU: Frontier-Class Open Models

These require dedicated hardware (multiple GPUs or 240GB+ unified memory):

ModelParamsActiveStrengths
Qwen 3.5 397B-A17B397B17B (MoE)Flagship. Scores within a few points of frontier closed models on most benchmarks. 1M+ context.
Kimi K2.51T32B (MoE)Top scores on coding and reasoning benchmarks. 256K context. Multimodal.
GLM-5744B40B (MoE)Strong on coding benchmarks, agent task execution, long-task stability.
MiMo-V2-Flash309B15B (MoE)Xiaomi. 256 experts, 128K context. Strong reasoning.
Mistral Small 4119B6.5B (MoE)128 experts, 4 active. 256K context.

Models by Use Case

Best for Coding

TierModelWhy
Small (8 GB)Qwen 3.5 4BThinking mode helps with code reasoning
Medium (16 GB)Qwen 3.5 9BExceptional coding benchmarks for size
Large (32 GB)Qwen 3.5 27B72.4% SWE-bench, production-grade
XL (64 GB+)GPT-OSS 120BNear o4-mini, strong agentic coding
ServerKimi K2.5 / GLM-5Top scores on SWE-bench, HumanEval, and LiveCodeBench

Best for Reasoning and Math

TierModelWhy
Small (8 GB)Qwen 3.5 4BThinking mode, scaled RL
Medium (16 GB)Qwen 3.5 9B81.7 GPQA Diamond
Large (32 GB)GPT-OSS 20BTrained with RL from o3/frontier
XL (64 GB+)Qwen 3.5 122B-A10BBest reasoning-per-watt
ServerQwen 3.5 397B-A17BRivals closed-source on reasoning

Best for Long Context

TierModelContextWhy
Small (8 GB)Nemotron 3 Nano 4B1M tokensHybrid Mamba-Transformer
Medium (16 GB)GLM-4.7 Flash202K tokensAgentic, long-task
Large (32 GB)Llama 4 Scout10M tokensLongest context of any open model
XL (64 GB+)Nemotron 3 Super1M tokensHybrid architecture, agentic

Best for Multimodal (Text + Images)

TierModelModalities
Small (8 GB)Qwen 3.5 4BText + images
Small (8 GB)Gemma 4 E2B / E4BText + images + audio + video (Any-to-Any)
Medium (16 GB)Qwen 3.5 9BText + images
Large (32 GB)Llama 4 ScoutText + images
ServerKimi K2.5Native multimodal, 15T mixed tokens

Best for Writing and Creative Tasks

TierModelWhy
Medium (16 GB)Mistral Small 3Strong instruction-following, natural prose, good at style matching
Large (32 GB)Qwen 3.5 27BLarge context lets it maintain tone and continuity across long pieces
XL (64 GB+)GPT-OSS 120BConsistent style, handles complex multi-part prompts, strong at narrative coherence

Instruct vs. Base Models

Base models predict the next token and are not trained to be helpful. They will continue a prompt as if completing a document, not answer a question.

Instruct models are fine-tuned to follow instructions and hold conversations. For everyday use, always download the instruct variant.

When downloading, look for “Instruct” in the model name (e.g., Llama-3.2-3B-Instruct). HuggingFace model cards label these clearly. The only reason to use a base model is if you are doing further fine-tuning and want the raw weights.

Most models in the Ollama library are instruct variants by default unless the name says otherwise.


Model Families: Who Makes What

Qwen 3.5 (Alibaba)

  • Released: Feb-Mar 2026
  • Sizes: 0.8B, 2B, 4B, 9B, 27B, 35B-A3B, 122B-A10B, 397B-A17B
  • License: Apache 2.0
  • Key features: Multimodal, thinking mode, 262K native context (1M+ extended), 201 languages, native tool calling
  • Links: HuggingFace · GitHub · Ollama

GPT-OSS (OpenAI)

  • Released: Mar 2026
  • Sizes: 20B (3.6B active), 120B (5.1B active)
  • License: Apache 2.0
  • Key features: MoE, trained with RL from o3/frontier, 60K context
  • Links: HuggingFace · GitHub · Ollama

Nemotron 3 (NVIDIA)

  • Released: Mar 2026
  • Sizes: Nano 4B, Nano 30B-A3B, Super 120B (12B active)
  • License: NVIDIA Open Model License
  • Key features: Hybrid Mamba-Transformer MoE, 1M token context, agentic reasoning
  • Links: NVIDIA Developer · Ollama Nano · Ollama Super

Llama 4 (Meta)

  • Released: Late 2025
  • Sizes: Scout 109B (17B active), Maverick 400B (40B active)
  • License: Llama 4 Community License
  • Key features: MoE, multimodal, Scout has 10M token context
  • Links: llama.com · Ollama

Gemma 4 (Google DeepMind)

  • Released: April 2026
  • Sizes: 5B (E2B), 8B (E4B), 27B (26B-A4B), 33B (31B)
  • License: Apache 2.0
  • Key features: Any-to-Any multimodal on edge (audio/video/image natively), Per-Layer Embeddings (PLE), Shared KV Cache, exceptional local GUI detection.
  • Links: DeepMind Blog · HuggingFace Blog · Ollama

GLM (Zhipu AI / Z.AI)

  • Released: 2025-2026
  • Sizes: GLM-4.7 (358B), GLM-4.7 Flash (30B-A3B), GLM-5 (744B, 40B active)
  • License: MIT
  • Key features: Agent execution, tool calling, long-task coding, 200K+ context
  • Links: Ollama GLM-4.7 · Ollama GLM-5

DeepSeek (DeepSeek AI)

  • Released: 2025-2026
  • Sizes: R1 (1.5B-671B distillations), V3 (671B, 37B active), V3.2
  • License: DeepSeek License
  • Key features: Strong reasoning (R1), top benchmark scores across the board (V3.2), large MoE
  • Links: Ollama V3.2 · Ollama R1

Mistral

  • Mistral 7B released: September 2023. Mistral Small 3 released: January 2025.
  • Sizes: Mistral 7B, Small 3 (24B), Small 4 (119B, 6.5B active)
  • License: Apache 2.0
  • Key features: Fast inference, strong , Small 4 has 128 experts
  • Links: Ollama

Kimi K2.5 (Moonshot AI)

  • Released: 2026
  • Sizes: 1T total, 32B active
  • License: MIT
  • Key features: Top scores on coding and reasoning benchmarks, native multimodal, 256K context
  • Links: Ollama

Phi-4 (Microsoft)

  • Released: 2025-2026
  • Sizes: Mini 3.8B, 14B
  • License: MIT
  • Key features: High performance relative to size, reasoning-focused
  • Links: Ollama

Where to Find Models

  • Unsloth Dynamic GGUFs produces dynamic quantizations that are widely considered the best GGUF variants for most models. The dynamic format allocates more bits to sensitive layers and fewer to layers where precision matters less, giving better quality at the same file size. Start here when downloading GGUFs from HuggingFace.
  • Ollama Library has pre- models with one-command install. Best for getting started quickly.
  • HuggingFace Models is the largest collection. Original weights plus community quantizations (, GPTQ, AWQ).
  • LM Studio provides a visual model browser with one-click downloads from HuggingFace.

Understanding Model Names

When you see a model name like qwen3.5:9b-q4_K_M, here’s what each part means:

PartMeaningExample
FamilyWho made it and which generationqwen3.5 = Alibaba’s Qwen, version 3.5
SizeTotal count9b = 9 billion parameters
Active (if MoE)Parameters used per tokenA3B = 3 billion active
QuantizationCompression levelq4_K_M = 4-bit, , medium quality

quick guide:

LevelQualityNotes
Q8~99% of originalLargest file. Near-lossless.
Q6_K~97%Slightly smaller, barely any loss.
Q5_K_M~95%Good default if you have the space.
Q4_K_M~93%The sweet spot for most people. Half the size.
Q3_K~85%Noticeable quality drop. Only if you’re tight on memory.
Q2_K~70%Real quality loss. Last resort.

Want to understand how quantization works? See Module 2: Compressed Models for the intuition, or Module 3B: How Models Fit for the deep dive.


Emerging Techniques Changing the Game

TurboQuant (Google)

Released late March 2026. Compresses to 3 bits with zero accuracy loss, giving 6x memory reduction for context and 8x faster attention computation. The llama.cpp and MLX communities picked it up within days. Long conversations that used to run out of RAM now fit. Research blog

LLM in a Flash (Apple)

Runs models larger than your available by streaming weights from SSD/flash storage. An iPhone 17 Pro ran a 400B MoE model on-device. Coming to more tools soon. Apple Research

Mixture of Experts (MoE) Everywhere

Most new large models use MoE: huge total parameter counts but only a fraction active per token. A “120B” MoE model with 5B active runs at roughly 5B speed and compute cost, but stores knowledge across all 120B parameters. You need RAM for the full 120B, which is why the tables above show models larger than you might expect for a given tier.


For the latest in model releases and AI news, see our News Feed and Research Feed.