Reference Reference

🖥️ Hardware Reference

Hardware Reference

Last verified: April 2026. Hardware requirements change as quantization techniques improve (e.g., TurboQuant reducing KV cache memory 6x) and model architectures evolve (MoE models needing more RAM but less compute). If something looks outdated, check our Model Reference for current model sizes.

Quick Answer: What Do I Need?

Your Goal	Minimum	Recommended	Sweet Spot
Try it out (basic chat)	8 GB RAM	16 GB RAM	Any modern machine
Daily driver (good quality)	16 GB RAM	32 GB RAM	Apple M-series or NVIDIA GPU
Power user (large models)	32 GB RAM	64 GB RAM	M4 Pro/Max or RTX 4090
Enthusiast (frontier-local)	64 GB RAM	128 GB+	M4 Ultra or multi-GPU

Already have a machine? Use our Hardware Calculator to see exactly which models fit your specs.

Hardware Tiers

Tier 1: Entry (8 GB RAM)

What you can run: Small models, 1B to 4B parameters at Q4 quantization.

Experience: Solid for simple chat, Q&A, basic summarization. Slower on complex reasoning. Like a quick assistant with broad but shallow knowledge.

Typical hardware:

MacBook Air M1/M2 (8 GB)
Budget Windows laptops
Older desktops with DDR4

Best models at this tier:

Model	RAM Needed	Ollama Command
Qwen 3.5 4B	~3.4 GB	`ollama pull qwen3.5:4b`
Gemma 4 E2B	~3.5 GB	`ollama pull gemma4:e2b`
Gemma 4 E4B	~5.5 GB	`ollama pull gemma4:e4b`
Phi-4 Mini (3.8B)	~3 GB	`ollama pull phi4-mini`
SmolLM3 3B	~2.5 GB	`ollama pull smollm3:3b`

Tier 2: Standard (16 GB RAM)

What you can run: Medium models, 7B to 9B at Q4, or 4B models at Q8 (near-lossless).

Experience: Good conversations, useful coding assistance, moderate reasoning. A good baseline for most users.

Typical hardware:

MacBook Air/Pro M1–M4 (16 GB)
Mid-range Windows laptops
Desktops with DDR4/DDR5

Best models at this tier:

Model	RAM Needed	Ollama Command
Qwen 3.5 9B	~6 GB	`ollama pull qwen3.5:9b`
Llama 3.3 8B	~5 GB	`ollama pull llama3.3:8b`
DeepSeek R1 7B	~5 GB	`ollama pull deepseek-r1:7b`
Phi-4 (14B)	~9 GB	`ollama pull phi4`

Tier 3: Power (32 GB RAM)

What you can run: Large models, 14B to 32B at Q4, or 9B at Q8. MoE models like Qwen 3.5 35B-A3B fit comfortably.

Experience: Near-frontier quality for many tasks. Complex reasoning, detailed code generation, nuanced writing. Local models at this tier can handle most of what you’d otherwise reach for a cloud API to do.

Typical hardware:

MacBook Pro M3/M4 Pro (36 GB)
Gaming desktops with 32 GB DDR5
NVIDIA RTX 3060/4060 (12 GB VRAM) + 32 GB system RAM

Best models at this tier:

Model	RAM Needed	Ollama Command
Qwen 3.5 27B	~16 GB	`ollama pull qwen3.5:27b`
Gemma 4 31B	~20 GB	`ollama pull gemma4:31b`
Gemma 4 26B-A4B (MoE)	~15 GB	`ollama pull gemma4:26b-a4b`
Qwen 3.5 35B-A3B (MoE)	~20 GB	`ollama pull qwen3.5:35b-a3b`
Qwen 3.5 32B-Coder	~19 GB	`ollama pull qwen3.5:32b-coder`

Tier 4: Enthusiast (64 GB RAM)

What you can run: Very large models, 70B at Q4, or 32B at Q8. Larger MoE models like Nemotron 3 Super (120B) or Qwen 3.5 122B-A10B.

Experience: Excellent quality, on par with cloud AI for most tasks. Going beyond 64 GB mainly pays off for specialized workloads.

Typical hardware:

MacBook Pro M4 Max (64 GB) or Mac Studio
High-end desktops with 64 GB DDR5
NVIDIA RTX 4090 (24 GB VRAM) + 64 GB system RAM

Best models at this tier:

Model	RAM Needed	Ollama Command
Llama 3.3 70B	~42 GB	`ollama pull llama3.3:70b`
Qwen 3.5 122B-A10B (MoE)	~48 GB	`ollama pull qwen3.5:122b-a10b`
Nemotron 3 Super (120B MoE)	~45 GB	`ollama pull nemotron-3-super`
DeepSeek R1 70B	~42 GB	`ollama pull deepseek-r1:70b`

Tier 5: Extreme (128 GB+ RAM)

What you can run: The largest open models. Qwen 3 235B-A22B, DeepSeek V3 (671B), Llama 4 Maverick (402B). Frontier-class performance, running locally.

Experience: Matches or approaches cloud frontier models. For most people this is more capability than they’ll ever need, but it’s there.

Typical hardware:

Mac Studio M4 Ultra (192 GB)
Mac Pro
Multi-GPU workstations (2× RTX 4090 = 48 GB VRAM)
High-end server hardware

Best models at this tier:

Model	RAM Needed	Ollama Command
Qwen 3 235B-A22B (MoE)	~130 GB	`ollama pull qwen3:235b-a22b`
Llama 4 Maverick (402B MoE)	~120 GB	`ollama pull llama4-maverick`
DeepSeek V3 (671B MoE)	~200+ GB	Needs llama.cpp directly

Understanding Memory: RAM, VRAM, and Unified

System RAM (all machines)

Your computer’s main memory. This is where models load when you don’t have a dedicated GPU, or when the model is too large for GPU memory.

DDR4: Standard in pre-2023 machines. Slower bandwidth (~50 GB/s for dual-channel).
DDR5: Current standard. Faster (~89 GB/s dual-channel), better for LLMs.
How to check: Mac → About This Mac. Windows → Settings → System → About. Linux → free -h.

GPU VRAM (NVIDIA/AMD)

Dedicated memory on your graphics card. Much faster than system RAM for model inference.

NVIDIA GeForce: RTX 3060 (12 GB), RTX 4060 Ti (16 GB), RTX 4090 (24 GB), RTX 5090 (32 GB)
NVIDIA Pro: A6000 (48 GB), H100 (80 GB)
How to check: nvidia-smi in terminal, or NVIDIA Settings → System Information

Partial offloading: If a model is too large for your VRAM, part loads on the GPU (fast layers) and the rest stays in system RAM (slower layers). This is automatic in Ollama and llama.cpp. You’ll see something like “32 of 48 layers offloaded to GPU” in the logs.

Apple Silicon Unified Memory

Apple M-series chips share a single memory pool between CPU and GPU. There’s no separate VRAM. It’s all unified.

Why this matters for LLMs: On a PC, if your model exceeds VRAM, performance falls off a cliff as layers spill to DDR. On Apple Silicon, all memory is equally fast, so there’s no bottleneck between GPU and CPU memory. A Mac with 36 GB unified memory often outperforms a PC with a 24 GB GPU + 64 GB DDR for large models.

Apple Chip	Max Unified Memory	Memory Bandwidth
M1/M2	16–24 GB	100–200 GB/s
M3/M4	16–24 GB	100–150 GB/s
M3/M4 Pro	18–48 GB	150–273 GB/s
M3/M4 Max	36–128 GB	300–546 GB/s
M4 Ultra	192 GB	819 GB/s

Memory Bandwidth: The Hidden Speed Limit

Two machines with identical RAM can run the same model at vastly different speeds. The bottleneck is : how fast the processor can read model weights from memory.

Token generation is bandwidth-bound: the processor spends most of its time waiting for numbers to arrive from memory, not doing math. More bandwidth = more tokens per second.

Bandwidth by Hardware Type

Hardware	Bandwidth	7B Q4 Speed	70B Q4 Speed
DDR4 laptop (2-channel, 3200 MT/s)	~50 GB/s	~8 tok/s	Too slow
DDR5 desktop (2-channel, 5600 MT/s)	~89 GB/s	~15 tok/s	~2 tok/s
DDR5 workstation (8-channel)	~358 GB/s	~50 tok/s	~8 tok/s
Apple M3 Max (unified)	~300 GB/s	~40 tok/s	~6 tok/s
Apple M4 Ultra (unified)	~819 GB/s	~110 tok/s	~15 tok/s
NVIDIA RTX 4090 (GDDR6X)	~1,008 GB/s	~140 tok/s	GPU only if fits

Token speeds are approximate for Q4_K_M quantization. Actual rates vary by model architecture, context length, and system load.

Key Insights

Channels matter more than DDR generation. A 12-channel DDR4 server (~307 GB/s) is faster than a 4-channel DDR5 desktop (~192 GB/s). Bumping DDR5 speed from 4800 to 6000 MT/s gives about a 20% speedup.

The M4 Max delivers 546 GB/s in a laptop. Matching that on a PC requires a high-end GPU or a multi-channel workstation.

GPU VRAM is the fastest. If a model fits entirely in your GPU’s VRAM, you get the full GPU bandwidth. The moment it spills to system RAM, your speed is limited by the slower DDR bandwidth for those layers.

The Memory Equation

When you’re checking if a model fits, here’s the full picture:

Total RAM needed  =  Model Weights  +  KV Cache  +  OS/Apps  +  Overhead

Component	Size	Notes
Model weights	File size you downloaded	The largest component
KV cache	Grows with conversation	256K context on a 9B model ≈ 2-4 GB
OS + other apps	3-4 GB	More if you’re running a browser, code editor, etc.
Overhead	~10% of model size	Buffers, scratch space, llama.cpp internals

Quick Rule

Model file size should be ≤ 75% of your available RAM (total RAM minus OS/apps). This leaves room for KV cache and overhead.

Total RAM	Available	Max Comfortable Model	Example
8 GB	~5 GB	~3.5 GB	Qwen 3.5 4B at Q4
16 GB	~12 GB	~9 GB	Qwen 3.5 9B at Q4
32 GB	~28 GB	~20 GB	Qwen 3.5 27B at Q4
64 GB	~60 GB	~45 GB	Llama 3.3 70B at Q4
128 GB	~120 GB	~90 GB	Qwen 3 235B-A22B at Q4

MoE Models: The Catch

models load ALL their parameters into memory, even though only a fraction activate per token. A 235B MoE with 22B active still needs ~130 GB. You get 22B compute speed but 235B memory cost. That 22B of active compute draws on 235B of knowledge, which is why output quality is higher than the active count suggests.

Inference Engines by Platform

Different engines work best on different hardware. Here’s what to use where:

Your Hardware	Best Engine	Why
Apple Silicon (any)	Ollama or MLX	Unified memory, Metal GPU acceleration
NVIDIA GPU (fits in VRAM)	Ollama or vLLM	Full GPU acceleration via CUDA
NVIDIA GPU (partial offload)	Ollama or llama.cpp	Split layers between GPU and CPU
CPU only (Intel/AMD)	Ollama	AVX2/AVX-512 optimizations in llama.cpp
Multi-GPU	vLLM or llama.cpp	Tensor parallelism across GPUs

All engines use files (except vLLM which also supports SafeTensors). A model downloaded for one engine works with any other.

New to this? Just use Ollama. It auto-detects your hardware and configures GPU offloading. See Module 1 for setup.

GPU Buying Guide

If you’re considering a GPU purchase specifically for local AI:

NVIDIA (Recommended for LLMs)

GPU	VRAM	Bandwidth	Price Range	Best For
RTX 4060 Ti 16GB	16 GB	288 GB/s	~$400	Budget. Fits 14B at Q4.
RTX 4090	24 GB	1,008 GB/s	~$1,600	Sweet spot. Fits 27B at Q4. Very fast.
RTX 5090	32 GB	~1,792 GB/s	~$2,000	Latest gen. Fits 32B+ at Q4.
RTX A6000	48 GB	768 GB/s	~$4,000	Pro card. Fits 70B at Q4.

AMD

AMD GPUs work with llama.cpp via ROCm, but NVIDIA’s CUDA ecosystem is more mature and better optimized for LLMs. If you already have an AMD GPU, it will work. If you’re buying specifically for AI though, NVIDIA is the safer bet.

Intel

Intel Arc GPUs have experimental SYCL support in llama.cpp. Not recommended as a primary LLM GPU yet.

When NOT to Buy a GPU

You have Apple Silicon with 32 GB+ unified memory. The bandwidth is already excellent and everything just works. A GPU won’t help. Macs don’t use external GPUs for LLM inference.
You just want to chat casually. A 9B model on CPU with 16 GB RAM is fast enough for conversation.
You’d be better off with more RAM. On a DDR5 desktop, going from 32 GB to 64 GB of RAM (to fit a 70B model) can be cheaper and more versatile than adding a GPU.

Emerging Technologies

Things that are changing the hardware equation right now:

TurboQuant (Google)

Compresses the to 3 bits with no accuracy loss. That’s a 6x reduction in context memory.

Why it works: Normal quantization (rounding numbers to fewer bits of precision) performs badly on the vectors inside transformer attention, because those vectors tend to have one or two massive values and everything else near zero. When you round a vector like [0.0001, 0.9999, 0.0003], the big value snaps to 1 and the rest snap to 0. You lose most of the information.

TurboQuant’s trick: before quantizing, randomly rotate the vector in its high-dimensional space. A random rotation spreads the weight evenly across all components, so no single component dominates. Quantization works much better on evenly-distributed values. During inference, the rotation is reversed (counter-rotated) to recover the original meaning.

The paper adds a second step that corrects biases when computing attention dot products with quantized vectors, but the rotation is what makes 3-bit KV cache practical.

What it means for you: The model file stays the same size, but long conversations eat far less RAM. Working implementations appeared within a week of the paper. As this becomes available in the tools you already use, 128K+ context windows will fit on hardware where they previously ran out of memory.

LLM in a Flash (Apple)

Streams model weights from SSD instead of loading everything into RAM. Apple demonstrated a 400B MoE model running on an iPhone 17 Pro. Still early and performance depends heavily on SSD speed, but it changes the “will it fit?” question from “do I have enough RAM?” to “do I have enough storage?”

Speculative Decoding

Uses a tiny “draft” model to predict multiple tokens, then verifies them with the large model in one pass. Can deliver 2–3x speedup with no quality loss. Already available in llama.cpp (--draft flag) and vLLM.

Tools

Hardware Calculator - Enter your specs, see which models fit and estimated speed
Model Reference - Current model recommendations by use case and tier
Ollama - Easiest way to get started
LM Studio - Visual model browser and runner

How to Check Your Specs

RAM:

Mac: Apple menu → About This Mac → Memory
Windows: Settings → System → About → Installed RAM
Linux: free -h (look at “total” column)

GPU and VRAM:

NVIDIA: Run nvidia-smi in terminal
Mac: Apple menu → About This Mac → displays the chip (M1/M2/M3/M4, unified memory, no separate VRAM)
Windows: Task Manager → Performance → GPU → Dedicated GPU Memory

Memory Bandwidth:

Mac: Determined by your chip (see Apple Silicon table above)
PC: Depends on RAM type (DDR4/DDR5), speed (MT/s), and number of channels. Check with CPU-Z (Windows) or dmidecode -t memory (Linux)