Module 10 10 min

🔭 What's Next

What’s Changing and Why It Matters to You

This module covers what’s coming in local AI: hardware improvements, software changes, and research directions, so you know where to look as the field moves.

Everything here is backed by published papers and shipping code. Some of it you can use today, and the rest is landing in the tools you already have within months. The net effect is that the hardware you own keeps getting more capable without you buying anything.


Compression Is Getting Smarter

Why This Matters Most

Module 2 covered : compressing models from 16-bit to 4-bit so they fit in less memory. Module 3B explained the mechanics. But quantization has been hitting a floor. Below Q4, quality drops fast. The curve is flat from down to about Q5, then falls off a cliff below Q3.

Several new techniques push that cliff lower.

TurboQuant: Rotating Before Compressing

doesn’t compress model weights. It compresses the , the memory that grows as your conversation gets longer. Module 3B covered the mechanics.

In reported benchmarks, it takes the KV cache from 16 bits to 3 bits with no measurable accuracy loss. A 128K context on a 9B model drops from 8+ GB to about 1.5 GB. That makes much longer possible on hardware that currently can’t support them. More on this in our Hardware Reference.

Weight Quantization Below 4 Bits

applies a related idea to model weights themselves. Instead of quantizing each weight independently, it uses incoherence processing (a controlled randomization, similar to TurboQuant’s rotation) to spread information across weight dimensions before compressing. This pushes usable quantization to 2 bits.

The practical result: models that need 32 GB at Q4 would need about 16 GB at Q2. Models that need 64 GB would need 32 GB. Each bit of quantization improvement roughly halves the memory requirement. The quality tradeoff isn’t zero, but it’s shrinking with every new technique.

What This Means For Your Machine

If you have 16 GB today and run a 9B model at Q4, the same hardware will likely run a 14-20B model within a year as these techniques become generally available. If you have 32 GB, you’ll be running 32-40B models comfortably. You don’t need to buy anything new to benefit from this.


Context Windows Are About to Get Much Longer

Most local models support 32K-128K of context. already handle 1M+. Local models could support long context in theory, but they run out of memory before they run out of capability. The KV cache grows linearly with context length, and at 128K tokens on a 9B model, it eats 8 GB on its own.

KV cache compression (the TurboQuant technique above) is what unlocks this. With 3-bit quantization, 256K context becomes feasible on a 32 GB machine. 512K is within reach on 64 GB.

At 32K tokens, you can fit a few files. At 128K, a medium codebase. At 512K, an entire project with its documentation. The difference between “summarize this file” and “answer questions about this entire repository” is context length, and it’s about to stop being a hardware constraint for most users.


On-Device Inference: Models That Don’t Fit in RAM

The Idea

Apple’s “LLM in a Flash” research asks: what if models didn’t have to fit entirely in memory?

Currently, you load all model weights into RAM before generating a single token. A 70B model at Q4 needs ~42 GB in memory. If you have 32 GB, you can’t run it.

Flash inference streams weights from SSD as needed. An model activates only a fraction of its experts per token. The inactive experts can stay on disk and stream into memory only when the router calls them.

Apple demonstrated a 400B MoE model running on an iPhone 17 Pro this way. The model’s total size was far larger than the phone’s RAM, with only the active experts loading per token.

What Limits It

SSD bandwidth is the new bottleneck. A fast NVMe drive reads at 5-7 GB/s. That’s 10-15x slower than DDR5 RAM (~89 GB/s) and 100x slower than an RTX 4090’s GDDR6X (~1,008 GB/s). Token generation will be slower than in-memory inference, but the question shifts from “can I run this at all?” to “how fast?”

For MoE models specifically, the hit is smaller because you’re only streaming the active experts (say 10-20% of total weights) while the shared layers stay resident in RAM.

When flash inference arrives in local tooling, the question shifts from “do I have enough RAM?” to “do I have enough fast storage?” NVMe drives at 5-7 GB/s are already fast enough for some model sizes.


Tool Use Is Becoming Native

What Changed

Getting a local model to use tools reliably used to require careful prompt engineering and custom parsing code. The model would sometimes forget the tool format, hallucinate tool names, or fail to parse results.

Current-generation models ship with tool use trained in. Qwen 3.5, Llama 4, and Gemma 3 all support native function calling. The model knows the protocol: describe available tools in the system prompt, request a tool call with structured arguments, receive the result, and continue. No prompt hacking required.

Why MCP Matters Here

turns this from “tools work with one specific application” to “tools work everywhere.” An MCP server for GitHub written for Claude Code also works with OpenCode, Cursor, and any other MCP-compatible client. Write the integration once, use it from any .

Every new MCP server benefits every compatible agent, which means the 280+ memory servers alone show how fast this grows when there’s a shared standard.

For local AI specifically, this means a 9B model running on your laptop can now search your files, read your database, send Slack messages, and manage GitHub issues, all through the same protocol.


Research Worth Following

WhatWhy It Matters For Local AIWhere to Read
MoE at small scaleMoE used to be for 100B+ models. Now appearing at 9B and 3B. Gives you speed of a tiny model with knowledge from a larger set.Mixture of Experts Explained (HuggingFace)
Hybrid Mamba-TransformerCombines transformer attention (good at reasoning) with Mamba (good at long sequences, linear memory). Handles long context without quadratic cost.Mamba (Gu & Dao, 2023), Nemotron 3 (NVIDIA)
Speculative decodingTiny draft model predicts tokens, large model verifies in batch. 2-3x speedup, same quality. Free performance.Leviathan et al., 2023
as defaultModels that accept images, audio, and text as input by default, rather than as add-on capabilities. This matters for tool use and agent workflows: an agent that can read screenshots, process audio recordings, and handle PDFs alongside text can act on a much wider range of real-world inputs.Qwen 3.5 (Alibaba)
improvementsSmall models learning from large ones. Each generation’s 9B model matches the previous generation’s 70B. Your hardware gets more capable without upgrades.Hinton et al. (2015), “Distilling the Knowledge in a Neural Network”

Foundational Papers

If a concept from this curriculum made you curious, these are where the full explanations live.

ConceptPaper
How transformers workAttention Is All You Need (Vaswani et al., 2017)
How agents reason and actReAct (Yao et al., 2022)
How LoRA worksLoRA (Hu et al., 2021)
How worksRAG (Lewis et al., 2020)
How KV cache compression worksTurboQuant (Zandieh et al., 2025)
How linear-time models workMamba (Gu & Dao, 2023)

Unfamiliar with a term? Every technical term used across this curriculum has a definition. Click any dotted-underline term on any page to see a quick definition in place.


What You Can Do Now

In Module 1 you ran your first model with a single command. That was the starting point. From there, the curriculum worked through how models actually work, how they fit on your hardware, how to get useful output from them, how agents and tools operate, how to build your own pipelines, and how to push further with fine-tuning and media generation.

You can now run and evaluate models for specific tasks rather than relying on leaderboards. You can build RAG pipelines on your own documents, so a model can answer questions from your notes, your codebase, or your organization’s internal knowledge. You can set up hybrid local-cloud workflows that use local models for routine work and route harder tasks to cloud APIs only when needed. You can extend agents with MCP tools so they can reach your databases, code repos, and communication tools. And you can fine-tune a model on custom data when prompting alone isn’t getting you consistent results.

None of this is theoretical. Local AI is genuinely useful today for real work: coding, research, document processing, automation. The people doing it daily are on r/LocalLLaMA and HuggingFace, and in the communities linked in Module 8. The field moves fast enough that staying in those spaces matters more than any single module. What you’ve built here is a foundation, not a finish line.


Where to Go From Here

That’s the full curriculum. The reference pages below are kept up to date and are worth bookmarking.

  • Model Reference — Current model recommendations for every hardware tier and use case. Check this when you’re ready to try something new or when a model recommendation from the curriculum feels dated.

  • Hardware Reference — The full hardware guide: memory equations, bandwidth tables, GPU buying advice, and the hardware calculator.

  • Sources — Every paper, tool, and resource cited across all 10 modules, organized by module.

If you want to revisit specific topics, the modules stand on their own. Module 3A and Module 3B are good references when you want to understand why something behaves the way it does. Module 5 through Module 8 cover the practical ecosystem in depth.


Sources for this module