Reference Reference

🖥️ Hardware Reference

Hardware Reference

Last verified: April 2026. Hardware requirements change as quantization techniques improve (e.g., TurboQuant reducing KV cache memory 6x) and model architectures evolve (MoE models needing more RAM but less compute). If something looks outdated, check our Model Reference for current model sizes.


Quick Answer: What Do I Need?

Your GoalMinimumRecommendedSweet Spot
Try it out (basic chat)8 GB RAM16 GB RAMAny modern machine
Daily driver (good quality)16 GB RAM32 GB RAMApple M-series or NVIDIA GPU
Power user (large models)32 GB RAM64 GB RAMM4 Pro/Max or RTX 4090
Enthusiast (frontier-local)64 GB RAM128 GB+M4 Ultra or multi-GPU

Already have a machine? Use our Hardware Calculator to see exactly which models fit your specs.


Hardware Tiers

Tier 1: Entry (8 GB RAM)

What you can run: Small models, 1B to 4B parameters at Q4 quantization.

Experience: Solid for simple chat, Q&A, basic summarization. Slower on complex reasoning. Like a quick assistant with broad but shallow knowledge.

Typical hardware:

  • MacBook Air M1/M2 (8 GB)
  • Budget Windows laptops
  • Older desktops with DDR4

Best models at this tier:

ModelRAM NeededOllama Command
Qwen 3.5 4B~3.4 GBollama pull qwen3.5:4b
Gemma 4 E2B~3.5 GBollama pull gemma4:e2b
Gemma 4 E4B~5.5 GBollama pull gemma4:e4b
Phi-4 Mini (3.8B)~3 GBollama pull phi4-mini
SmolLM3 3B~2.5 GBollama pull smollm3:3b

Tier 2: Standard (16 GB RAM)

What you can run: Medium models, 7B to 9B at Q4, or 4B models at Q8 (near-lossless).

Experience: Good conversations, useful coding assistance, moderate reasoning. A good baseline for most users.

Typical hardware:

  • MacBook Air/Pro M1–M4 (16 GB)
  • Mid-range Windows laptops
  • Desktops with DDR4/DDR5

Best models at this tier:

ModelRAM NeededOllama Command
Qwen 3.5 9B~6 GBollama pull qwen3.5:9b
Llama 3.3 8B~5 GBollama pull llama3.3:8b
DeepSeek R1 7B~5 GBollama pull deepseek-r1:7b
Phi-4 (14B)~9 GBollama pull phi4

Tier 3: Power (32 GB RAM)

What you can run: Large models, 14B to 32B at Q4, or 9B at Q8. MoE models like Qwen 3.5 35B-A3B fit comfortably.

Experience: Near-frontier quality for many tasks. Complex reasoning, detailed code generation, nuanced writing. Local models at this tier can handle most of what you’d otherwise reach for a cloud API to do.

Typical hardware:

  • MacBook Pro M3/M4 Pro (36 GB)
  • Gaming desktops with 32 GB DDR5
  • NVIDIA RTX 3060/4060 (12 GB VRAM) + 32 GB system RAM

Best models at this tier:

ModelRAM NeededOllama Command
Qwen 3.5 27B~16 GBollama pull qwen3.5:27b
Gemma 4 31B~20 GBollama pull gemma4:31b
Gemma 4 26B-A4B (MoE)~15 GBollama pull gemma4:26b-a4b
Qwen 3.5 35B-A3B (MoE)~20 GBollama pull qwen3.5:35b-a3b
Qwen 3.5 32B-Coder~19 GBollama pull qwen3.5:32b-coder

Tier 4: Enthusiast (64 GB RAM)

What you can run: Very large models, 70B at Q4, or 32B at Q8. Larger MoE models like Nemotron 3 Super (120B) or Qwen 3.5 122B-A10B.

Experience: Excellent quality, on par with cloud AI for most tasks. Going beyond 64 GB mainly pays off for specialized workloads.

Typical hardware:

  • MacBook Pro M4 Max (64 GB) or Mac Studio
  • High-end desktops with 64 GB DDR5
  • NVIDIA RTX 4090 (24 GB VRAM) + 64 GB system RAM

Best models at this tier:

ModelRAM NeededOllama Command
Llama 3.3 70B~42 GBollama pull llama3.3:70b
Qwen 3.5 122B-A10B (MoE)~48 GBollama pull qwen3.5:122b-a10b
Nemotron 3 Super (120B MoE)~45 GBollama pull nemotron-3-super
DeepSeek R1 70B~42 GBollama pull deepseek-r1:70b

Tier 5: Extreme (128 GB+ RAM)

What you can run: The largest open models. Qwen 3 235B-A22B, DeepSeek V3 (671B), Llama 4 Maverick (402B). Frontier-class performance, running locally.

Experience: Matches or approaches cloud frontier models. For most people this is more capability than they’ll ever need, but it’s there.

Typical hardware:

  • Mac Studio M4 Ultra (192 GB)
  • Mac Pro
  • Multi-GPU workstations (2× RTX 4090 = 48 GB VRAM)
  • High-end server hardware

Best models at this tier:

ModelRAM NeededOllama Command
Qwen 3 235B-A22B (MoE)~130 GBollama pull qwen3:235b-a22b
Llama 4 Maverick (402B MoE)~120 GBollama pull llama4-maverick
DeepSeek V3 (671B MoE)~200+ GBNeeds llama.cpp directly

Understanding Memory: RAM, VRAM, and Unified

System RAM (all machines)

Your computer’s main memory. This is where models load when you don’t have a dedicated GPU, or when the model is too large for GPU memory.

  • DDR4: Standard in pre-2023 machines. Slower bandwidth (~50 GB/s for dual-channel).
  • DDR5: Current standard. Faster (~89 GB/s dual-channel), better for LLMs.
  • How to check: Mac → About This Mac. Windows → Settings → System → About. Linux → free -h.

GPU VRAM (NVIDIA/AMD)

Dedicated memory on your graphics card. Much faster than system RAM for model inference.

  • NVIDIA GeForce: RTX 3060 (12 GB), RTX 4060 Ti (16 GB), RTX 4090 (24 GB), RTX 5090 (32 GB)
  • NVIDIA Pro: A6000 (48 GB), H100 (80 GB)
  • How to check: nvidia-smi in terminal, or NVIDIA Settings → System Information

Partial offloading: If a model is too large for your VRAM, part loads on the GPU (fast layers) and the rest stays in system RAM (slower layers). This is automatic in Ollama and llama.cpp. You’ll see something like “32 of 48 layers offloaded to GPU” in the logs.

Apple Silicon Unified Memory

Apple M-series chips share a single memory pool between CPU and GPU. There’s no separate VRAM. It’s all unified.

Why this matters for LLMs: On a PC, if your model exceeds VRAM, performance falls off a cliff as layers spill to DDR. On Apple Silicon, all memory is equally fast, so there’s no bottleneck between GPU and CPU memory. A Mac with 36 GB unified memory often outperforms a PC with a 24 GB GPU + 64 GB DDR for large models.

Apple ChipMax Unified MemoryMemory Bandwidth
M1/M216–24 GB100–200 GB/s
M3/M416–24 GB100–150 GB/s
M3/M4 Pro18–48 GB150–273 GB/s
M3/M4 Max36–128 GB300–546 GB/s
M4 Ultra192 GB819 GB/s

Memory Bandwidth: The Hidden Speed Limit

Two machines with identical RAM can run the same model at vastly different speeds. The bottleneck is : how fast the processor can read model weights from memory.

Token generation is bandwidth-bound: the processor spends most of its time waiting for numbers to arrive from memory, not doing math. More bandwidth = more tokens per second.

Bandwidth by Hardware Type

HardwareBandwidth7B Q4 Speed70B Q4 Speed
DDR4 laptop (2-channel, 3200 MT/s)~50 GB/s~8 tok/sToo slow
DDR5 desktop (2-channel, 5600 MT/s)~89 GB/s~15 tok/s~2 tok/s
DDR5 workstation (8-channel)~358 GB/s~50 tok/s~8 tok/s
Apple M3 Max (unified)~300 GB/s~40 tok/s~6 tok/s
Apple M4 Ultra (unified)~819 GB/s~110 tok/s~15 tok/s
NVIDIA RTX 4090 (GDDR6X)~1,008 GB/s~140 tok/sGPU only if fits

Token speeds are approximate for Q4_K_M quantization. Actual rates vary by model architecture, context length, and system load.

Key Insights

Channels matter more than DDR generation. A 12-channel DDR4 server (~307 GB/s) is faster than a 4-channel DDR5 desktop (~192 GB/s). Bumping DDR5 speed from 4800 to 6000 MT/s gives about a 20% speedup.

The M4 Max delivers 546 GB/s in a laptop. Matching that on a PC requires a high-end GPU or a multi-channel workstation.

GPU VRAM is the fastest. If a model fits entirely in your GPU’s VRAM, you get the full GPU bandwidth. The moment it spills to system RAM, your speed is limited by the slower DDR bandwidth for those layers.


The Memory Equation

When you’re checking if a model fits, here’s the full picture:

Total RAM needed  =  Model Weights  +  KV Cache  +  OS/Apps  +  Overhead
ComponentSizeNotes
Model weightsFile size you downloadedThe largest component
KV cacheGrows with conversation256K context on a 9B model ≈ 2-4 GB
OS + other apps3-4 GBMore if you’re running a browser, code editor, etc.
Overhead~10% of model sizeBuffers, scratch space, llama.cpp internals

Quick Rule

Model file size should be ≤ 75% of your available RAM (total RAM minus OS/apps). This leaves room for KV cache and overhead.

Total RAMAvailableMax Comfortable ModelExample
8 GB~5 GB~3.5 GBQwen 3.5 4B at Q4
16 GB~12 GB~9 GBQwen 3.5 9B at Q4
32 GB~28 GB~20 GBQwen 3.5 27B at Q4
64 GB~60 GB~45 GBLlama 3.3 70B at Q4
128 GB~120 GB~90 GBQwen 3 235B-A22B at Q4

MoE Models: The Catch

models load ALL their parameters into memory, even though only a fraction activate per token. A 235B MoE with 22B active still needs ~130 GB. You get 22B compute speed but 235B memory cost. That 22B of active compute draws on 235B of knowledge, which is why output quality is higher than the active count suggests.


Inference Engines by Platform

Different engines work best on different hardware. Here’s what to use where:

Your HardwareBest EngineWhy
Apple Silicon (any)Ollama or MLXUnified memory, Metal GPU acceleration
NVIDIA GPU (fits in VRAM)Ollama or vLLMFull GPU acceleration via CUDA
NVIDIA GPU (partial offload)Ollama or llama.cppSplit layers between GPU and CPU
CPU only (Intel/AMD)OllamaAVX2/AVX-512 optimizations in llama.cpp
Multi-GPUvLLM or llama.cppTensor parallelism across GPUs

All engines use files (except vLLM which also supports SafeTensors). A model downloaded for one engine works with any other.

New to this? Just use Ollama. It auto-detects your hardware and configures GPU offloading. See Module 1 for setup.


GPU Buying Guide

If you’re considering a GPU purchase specifically for local AI:

GPUVRAMBandwidthPrice RangeBest For
RTX 4060 Ti 16GB16 GB288 GB/s~$400Budget. Fits 14B at Q4.
RTX 409024 GB1,008 GB/s~$1,600Sweet spot. Fits 27B at Q4. Very fast.
RTX 509032 GB~1,792 GB/s~$2,000Latest gen. Fits 32B+ at Q4.
RTX A600048 GB768 GB/s~$4,000Pro card. Fits 70B at Q4.

AMD

AMD GPUs work with llama.cpp via ROCm, but NVIDIA’s CUDA ecosystem is more mature and better optimized for LLMs. If you already have an AMD GPU, it will work. If you’re buying specifically for AI though, NVIDIA is the safer bet.

Intel

Intel Arc GPUs have experimental SYCL support in llama.cpp. Not recommended as a primary LLM GPU yet.

When NOT to Buy a GPU

  • You have Apple Silicon with 32 GB+ unified memory. The bandwidth is already excellent and everything just works. A GPU won’t help. Macs don’t use external GPUs for LLM inference.
  • You just want to chat casually. A 9B model on CPU with 16 GB RAM is fast enough for conversation.
  • You’d be better off with more RAM. On a DDR5 desktop, going from 32 GB to 64 GB of RAM (to fit a 70B model) can be cheaper and more versatile than adding a GPU.

Emerging Technologies

Things that are changing the hardware equation right now:

TurboQuant (Google)

Compresses the to 3 bits with no accuracy loss. That’s a 6x reduction in context memory.

Why it works: Normal quantization (rounding numbers to fewer bits of precision) performs badly on the vectors inside transformer attention, because those vectors tend to have one or two massive values and everything else near zero. When you round a vector like [0.0001, 0.9999, 0.0003], the big value snaps to 1 and the rest snap to 0. You lose most of the information.

TurboQuant’s trick: before quantizing, randomly rotate the vector in its high-dimensional space. A random rotation spreads the weight evenly across all components, so no single component dominates. Quantization works much better on evenly-distributed values. During inference, the rotation is reversed (counter-rotated) to recover the original meaning.

The paper adds a second step that corrects biases when computing attention dot products with quantized vectors, but the rotation is what makes 3-bit KV cache practical.

What it means for you: The model file stays the same size, but long conversations eat far less RAM. Working implementations appeared within a week of the paper. As this becomes available in the tools you already use, 128K+ context windows will fit on hardware where they previously ran out of memory.

LLM in a Flash (Apple)

Streams model weights from SSD instead of loading everything into RAM. Apple demonstrated a 400B MoE model running on an iPhone 17 Pro. Still early and performance depends heavily on SSD speed, but it changes the “will it fit?” question from “do I have enough RAM?” to “do I have enough storage?”

Speculative Decoding

Uses a tiny “draft” model to predict multiple tokens, then verifies them with the large model in one pass. Can deliver 2–3x speedup with no quality loss. Already available in llama.cpp (--draft flag) and vLLM.


Tools


How to Check Your Specs

RAM:

  • Mac: Apple menu → About This Mac → Memory
  • Windows: Settings → System → About → Installed RAM
  • Linux: free -h (look at “total” column)

GPU and VRAM:

  • NVIDIA: Run nvidia-smi in terminal
  • Mac: Apple menu → About This Mac → displays the chip (M1/M2/M3/M4, unified memory, no separate VRAM)
  • Windows: Task Manager → Performance → GPU → Dedicated GPU Memory

Memory Bandwidth:

  • Mac: Determined by your chip (see Apple Silicon table above)
  • PC: Depends on RAM type (DDR4/DDR5), speed (MT/s), and number of channels. Check with CPU-Z (Windows) or dmidecode -t memory (Linux)