Hardware Reference
Hardware Reference
Last verified: April 2026. Hardware requirements change as quantization techniques improve (e.g., TurboQuant reducing KV cache memory 6x) and model architectures evolve (MoE models needing more RAM but less compute). If something looks outdated, check our Model Reference for current model sizes.
Quick Answer: What Do I Need?
| Your Goal | Minimum | Recommended | Sweet Spot |
|---|---|---|---|
| Try it out (basic chat) | 8 GB RAM | 16 GB RAM | Any modern machine |
| Daily driver (good quality) | 16 GB RAM | 32 GB RAM | Apple M-series or NVIDIA GPU |
| Power user (large models) | 32 GB RAM | 64 GB RAM | M4 Pro/Max or RTX 4090 |
| Enthusiast (frontier-local) | 64 GB RAM | 128 GB+ | M4 Ultra or multi-GPU |
Already have a machine? Use our Hardware Calculator to see exactly which models fit your specs.
Hardware Tiers
Tier 1: Entry (8 GB RAM)
What you can run: Small models, 1B to 4B parameters at Q4 quantization.
Experience: Solid for simple chat, Q&A, basic summarization. Slower on complex reasoning. Like a quick assistant with broad but shallow knowledge.
Typical hardware:
- MacBook Air M1/M2 (8 GB)
- Budget Windows laptops
- Older desktops with DDR4
Best models at this tier:
| Model | RAM Needed | Ollama Command |
|---|---|---|
| Qwen 3.5 4B | ~3.4 GB | ollama pull qwen3.5:4b |
| Gemma 4 E2B | ~3.5 GB | ollama pull gemma4:e2b |
| Gemma 4 E4B | ~5.5 GB | ollama pull gemma4:e4b |
| Phi-4 Mini (3.8B) | ~3 GB | ollama pull phi4-mini |
| SmolLM3 3B | ~2.5 GB | ollama pull smollm3:3b |
Tier 2: Standard (16 GB RAM)
What you can run: Medium models, 7B to 9B at Q4, or 4B models at Q8 (near-lossless).
Experience: Good conversations, useful coding assistance, moderate reasoning. A good baseline for most users.
Typical hardware:
- MacBook Air/Pro M1–M4 (16 GB)
- Mid-range Windows laptops
- Desktops with DDR4/DDR5
Best models at this tier:
| Model | RAM Needed | Ollama Command |
|---|---|---|
| Qwen 3.5 9B | ~6 GB | ollama pull qwen3.5:9b |
| Llama 3.3 8B | ~5 GB | ollama pull llama3.3:8b |
| DeepSeek R1 7B | ~5 GB | ollama pull deepseek-r1:7b |
| Phi-4 (14B) | ~9 GB | ollama pull phi4 |
Tier 3: Power (32 GB RAM)
What you can run: Large models, 14B to 32B at Q4, or 9B at Q8. MoE models like Qwen 3.5 35B-A3B fit comfortably.
Experience: Near-frontier quality for many tasks. Complex reasoning, detailed code generation, nuanced writing. Local models at this tier can handle most of what you’d otherwise reach for a cloud API to do.
Typical hardware:
- MacBook Pro M3/M4 Pro (36 GB)
- Gaming desktops with 32 GB DDR5
- NVIDIA RTX 3060/4060 (12 GB VRAM) + 32 GB system RAM
Best models at this tier:
| Model | RAM Needed | Ollama Command |
|---|---|---|
| Qwen 3.5 27B | ~16 GB | ollama pull qwen3.5:27b |
| Gemma 4 31B | ~20 GB | ollama pull gemma4:31b |
| Gemma 4 26B-A4B (MoE) | ~15 GB | ollama pull gemma4:26b-a4b |
| Qwen 3.5 35B-A3B (MoE) | ~20 GB | ollama pull qwen3.5:35b-a3b |
| Qwen 3.5 32B-Coder | ~19 GB | ollama pull qwen3.5:32b-coder |
Tier 4: Enthusiast (64 GB RAM)
What you can run: Very large models, 70B at Q4, or 32B at Q8. Larger MoE models like Nemotron 3 Super (120B) or Qwen 3.5 122B-A10B.
Experience: Excellent quality, on par with cloud AI for most tasks. Going beyond 64 GB mainly pays off for specialized workloads.
Typical hardware:
- MacBook Pro M4 Max (64 GB) or Mac Studio
- High-end desktops with 64 GB DDR5
- NVIDIA RTX 4090 (24 GB VRAM) + 64 GB system RAM
Best models at this tier:
| Model | RAM Needed | Ollama Command |
|---|---|---|
| Llama 3.3 70B | ~42 GB | ollama pull llama3.3:70b |
| Qwen 3.5 122B-A10B (MoE) | ~48 GB | ollama pull qwen3.5:122b-a10b |
| Nemotron 3 Super (120B MoE) | ~45 GB | ollama pull nemotron-3-super |
| DeepSeek R1 70B | ~42 GB | ollama pull deepseek-r1:70b |
Tier 5: Extreme (128 GB+ RAM)
What you can run: The largest open models. Qwen 3 235B-A22B, DeepSeek V3 (671B), Llama 4 Maverick (402B). Frontier-class performance, running locally.
Experience: Matches or approaches cloud frontier models. For most people this is more capability than they’ll ever need, but it’s there.
Typical hardware:
- Mac Studio M4 Ultra (192 GB)
- Mac Pro
- Multi-GPU workstations (2× RTX 4090 = 48 GB VRAM)
- High-end server hardware
Best models at this tier:
| Model | RAM Needed | Ollama Command |
|---|---|---|
| Qwen 3 235B-A22B (MoE) | ~130 GB | ollama pull qwen3:235b-a22b |
| Llama 4 Maverick (402B MoE) | ~120 GB | ollama pull llama4-maverick |
| DeepSeek V3 (671B MoE) | ~200+ GB | Needs llama.cpp directly |
Understanding Memory: RAM, VRAM, and Unified
System RAM (all machines)
Your computer’s main memory. This is where models load when you don’t have a dedicated GPU, or when the model is too large for GPU memory.
- DDR4: Standard in pre-2023 machines. Slower bandwidth (~50 GB/s for dual-channel).
- DDR5: Current standard. Faster (~89 GB/s dual-channel), better for LLMs.
- How to check: Mac → About This Mac. Windows → Settings → System → About. Linux →
free -h.
GPU VRAM (NVIDIA/AMD)
Dedicated memory on your graphics card. Much faster than system RAM for model inference.
- NVIDIA GeForce: RTX 3060 (12 GB), RTX 4060 Ti (16 GB), RTX 4090 (24 GB), RTX 5090 (32 GB)
- NVIDIA Pro: A6000 (48 GB), H100 (80 GB)
- How to check:
nvidia-smiin terminal, or NVIDIA Settings → System Information
Partial offloading: If a model is too large for your VRAM, part loads on the GPU (fast layers) and the rest stays in system RAM (slower layers). This is automatic in Ollama and llama.cpp. You’ll see something like “32 of 48 layers offloaded to GPU” in the logs.
Apple Silicon Unified Memory
Apple M-series chips share a single memory pool between CPU and GPU. There’s no separate VRAM. It’s all unified.
Why this matters for LLMs: On a PC, if your model exceeds VRAM, performance falls off a cliff as layers spill to DDR. On Apple Silicon, all memory is equally fast, so there’s no bottleneck between GPU and CPU memory. A Mac with 36 GB unified memory often outperforms a PC with a 24 GB GPU + 64 GB DDR for large models.
| Apple Chip | Max Unified Memory | Memory Bandwidth |
|---|---|---|
| M1/M2 | 16–24 GB | 100–200 GB/s |
| M3/M4 | 16–24 GB | 100–150 GB/s |
| M3/M4 Pro | 18–48 GB | 150–273 GB/s |
| M3/M4 Max | 36–128 GB | 300–546 GB/s |
| M4 Ultra | 192 GB | 819 GB/s |
Memory Bandwidth: The Hidden Speed Limit
Two machines with identical RAM can run the same model at vastly different speeds. The bottleneck is : how fast the processor can read model weights from memory.
Token generation is bandwidth-bound: the processor spends most of its time waiting for numbers to arrive from memory, not doing math. More bandwidth = more tokens per second.
Bandwidth by Hardware Type
| Hardware | Bandwidth | 7B Q4 Speed | 70B Q4 Speed |
|---|---|---|---|
| DDR4 laptop (2-channel, 3200 MT/s) | ~50 GB/s | ~8 tok/s | Too slow |
| DDR5 desktop (2-channel, 5600 MT/s) | ~89 GB/s | ~15 tok/s | ~2 tok/s |
| DDR5 workstation (8-channel) | ~358 GB/s | ~50 tok/s | ~8 tok/s |
| Apple M3 Max (unified) | ~300 GB/s | ~40 tok/s | ~6 tok/s |
| Apple M4 Ultra (unified) | ~819 GB/s | ~110 tok/s | ~15 tok/s |
| NVIDIA RTX 4090 (GDDR6X) | ~1,008 GB/s | ~140 tok/s | GPU only if fits |
Token speeds are approximate for Q4_K_M quantization. Actual rates vary by model architecture, context length, and system load.
Key Insights
Channels matter more than DDR generation. A 12-channel DDR4 server (~307 GB/s) is faster than a 4-channel DDR5 desktop (~192 GB/s). Bumping DDR5 speed from 4800 to 6000 MT/s gives about a 20% speedup.
The M4 Max delivers 546 GB/s in a laptop. Matching that on a PC requires a high-end GPU or a multi-channel workstation.
GPU VRAM is the fastest. If a model fits entirely in your GPU’s VRAM, you get the full GPU bandwidth. The moment it spills to system RAM, your speed is limited by the slower DDR bandwidth for those layers.
The Memory Equation
When you’re checking if a model fits, here’s the full picture:
Total RAM needed = Model Weights + KV Cache + OS/Apps + Overhead
| Component | Size | Notes |
|---|---|---|
| Model weights | File size you downloaded | The largest component |
| KV cache | Grows with conversation | 256K context on a 9B model ≈ 2-4 GB |
| OS + other apps | 3-4 GB | More if you’re running a browser, code editor, etc. |
| Overhead | ~10% of model size | Buffers, scratch space, llama.cpp internals |
Quick Rule
Model file size should be ≤ 75% of your available RAM (total RAM minus OS/apps). This leaves room for KV cache and overhead.
| Total RAM | Available | Max Comfortable Model | Example |
|---|---|---|---|
| 8 GB | ~5 GB | ~3.5 GB | Qwen 3.5 4B at Q4 |
| 16 GB | ~12 GB | ~9 GB | Qwen 3.5 9B at Q4 |
| 32 GB | ~28 GB | ~20 GB | Qwen 3.5 27B at Q4 |
| 64 GB | ~60 GB | ~45 GB | Llama 3.3 70B at Q4 |
| 128 GB | ~120 GB | ~90 GB | Qwen 3 235B-A22B at Q4 |
MoE Models: The Catch
models load ALL their parameters into memory, even though only a fraction activate per token. A 235B MoE with 22B active still needs ~130 GB. You get 22B compute speed but 235B memory cost. That 22B of active compute draws on 235B of knowledge, which is why output quality is higher than the active count suggests.
Inference Engines by Platform
Different engines work best on different hardware. Here’s what to use where:
| Your Hardware | Best Engine | Why |
|---|---|---|
| Apple Silicon (any) | Ollama or MLX | Unified memory, Metal GPU acceleration |
| NVIDIA GPU (fits in VRAM) | Ollama or vLLM | Full GPU acceleration via CUDA |
| NVIDIA GPU (partial offload) | Ollama or llama.cpp | Split layers between GPU and CPU |
| CPU only (Intel/AMD) | Ollama | AVX2/AVX-512 optimizations in llama.cpp |
| Multi-GPU | vLLM or llama.cpp | Tensor parallelism across GPUs |
All engines use files (except vLLM which also supports SafeTensors). A model downloaded for one engine works with any other.
New to this? Just use Ollama. It auto-detects your hardware and configures GPU offloading. See Module 1 for setup.
GPU Buying Guide
If you’re considering a GPU purchase specifically for local AI:
NVIDIA (Recommended for LLMs)
| GPU | VRAM | Bandwidth | Price Range | Best For |
|---|---|---|---|---|
| RTX 4060 Ti 16GB | 16 GB | 288 GB/s | ~$400 | Budget. Fits 14B at Q4. |
| RTX 4090 | 24 GB | 1,008 GB/s | ~$1,600 | Sweet spot. Fits 27B at Q4. Very fast. |
| RTX 5090 | 32 GB | ~1,792 GB/s | ~$2,000 | Latest gen. Fits 32B+ at Q4. |
| RTX A6000 | 48 GB | 768 GB/s | ~$4,000 | Pro card. Fits 70B at Q4. |
AMD
AMD GPUs work with llama.cpp via ROCm, but NVIDIA’s CUDA ecosystem is more mature and better optimized for LLMs. If you already have an AMD GPU, it will work. If you’re buying specifically for AI though, NVIDIA is the safer bet.
Intel
Intel Arc GPUs have experimental SYCL support in llama.cpp. Not recommended as a primary LLM GPU yet.
When NOT to Buy a GPU
- You have Apple Silicon with 32 GB+ unified memory. The bandwidth is already excellent and everything just works. A GPU won’t help. Macs don’t use external GPUs for LLM inference.
- You just want to chat casually. A 9B model on CPU with 16 GB RAM is fast enough for conversation.
- You’d be better off with more RAM. On a DDR5 desktop, going from 32 GB to 64 GB of RAM (to fit a 70B model) can be cheaper and more versatile than adding a GPU.
Emerging Technologies
Things that are changing the hardware equation right now:
TurboQuant (Google)
Compresses the to 3 bits with no accuracy loss. That’s a 6x reduction in context memory.
Why it works: Normal quantization (rounding numbers to fewer bits of precision) performs badly on the vectors inside transformer attention, because those vectors tend to have one or two massive values and everything else near zero. When you round a vector like [0.0001, 0.9999, 0.0003], the big value snaps to 1 and the rest snap to 0. You lose most of the information.
TurboQuant’s trick: before quantizing, randomly rotate the vector in its high-dimensional space. A random rotation spreads the weight evenly across all components, so no single component dominates. Quantization works much better on evenly-distributed values. During inference, the rotation is reversed (counter-rotated) to recover the original meaning.
The paper adds a second step that corrects biases when computing attention dot products with quantized vectors, but the rotation is what makes 3-bit KV cache practical.
What it means for you: The model file stays the same size, but long conversations eat far less RAM. Working implementations appeared within a week of the paper. As this becomes available in the tools you already use, 128K+ context windows will fit on hardware where they previously ran out of memory.
LLM in a Flash (Apple)
Streams model weights from SSD instead of loading everything into RAM. Apple demonstrated a 400B MoE model running on an iPhone 17 Pro. Still early and performance depends heavily on SSD speed, but it changes the “will it fit?” question from “do I have enough RAM?” to “do I have enough storage?”
Speculative Decoding
Uses a tiny “draft” model to predict multiple tokens, then verifies them with the large model in one pass. Can deliver 2–3x speedup with no quality loss. Already available in llama.cpp (--draft flag) and vLLM.
Tools
- Hardware Calculator - Enter your specs, see which models fit and estimated speed
- Model Reference - Current model recommendations by use case and tier
- Ollama - Easiest way to get started
- LM Studio - Visual model browser and runner
How to Check Your Specs
RAM:
- Mac: Apple menu → About This Mac → Memory
- Windows: Settings → System → About → Installed RAM
- Linux:
free -h(look at “total” column)
GPU and VRAM:
- NVIDIA: Run
nvidia-smiin terminal - Mac: Apple menu → About This Mac → displays the chip (M1/M2/M3/M4, unified memory, no separate VRAM)
- Windows: Task Manager → Performance → GPU → Dedicated GPU Memory
Memory Bandwidth:
- Mac: Determined by your chip (see Apple Silicon table above)
- PC: Depends on RAM type (DDR4/DDR5), speed (MT/s), and number of channels. Check with CPU-Z (Windows) or
dmidecode -t memory(Linux)