Local + Cloud
When Local Isn’t Enough
By the end of this module, you’ll have a clear framework for deciding when to run locally, when to use an API, and how to wire both together in the same workflow.
Local models are free and private, but they have limits you’ve probably already run into.
A 9B model handles conversation and routine coding well. Ask it to debug a subtle race condition across six files, and it flounders. A 32B model does better, but it still lacks the raw reasoning depth of the largest running on thousand-GPU clusters.
A model running on your laptop has fewer , less training data, and a shorter than one backed by a datacenter. The useful question isn’t “which is better?” but “which is better for this task, right now?”
What You Get From the Cloud (And What You Give Up)
What Cloud APIs Offer
Frontier models have hundreds of billions of parameters (or more), and on complex tasks they produce better results than local models of any size.
They also handle longer context. Most frontier models now support 1M+ . You can feed an entire codebase, a book, or hours of meeting transcripts in a single prompt. Local models typically top out at 32K-128K, and quality degrades well before the limit.
On the practical side, cloud models process text, images, video, audio, and PDFs natively, while local models are still catching up. And there’s no hardware requirement at all: an API call works the same from a Chromebook as from a workstation.
What You Give Up
Your prompts and data travel to someone else’s servers. For proprietary code, medical records, legal documents, or anything covered by compliance rules, that can be a dealbreaker.
It costs money, too. A developer using a frontier coding heavily can spend $5-15 per day, and batch processing thousands of documents adds up fast. Local is free after the hardware investment.
APIs also go down, rate limits kick in, and providers change pricing or deprecate models without asking you. You can’t a frontier model (with a few exceptions), and you can’t inspect its weights. Local models work offline, don’t have rate limits, and don’t change unless you change them.
Choosing: A Decision Framework
Instead of picking one side permanently, match the model to the task.
The Quick Test
Ask three questions about whatever you’re about to do.
First, does it involve sensitive data? Proprietary code, personal information, internal documents, anything you wouldn’t paste into a public website. If yes, lean local. Some cloud providers offer data processing agreements and SOC 2 compliance, but “it never leaves my machine” is simpler to reason about.
Second, does it need deep reasoning or very long context? Architecture decisions, complex debugging, analyzing a 500-page document, creative work where quality really matters. If yes, lean cloud. These are where frontier models earn their cost.
Third, is it high-volume or repetitive? Summarizing a folder of reports, classifying a thousand emails, generating boilerplate. If yes, lean local. Repetitive work at scale is where local models save the most money.
Task-by-Task Guide
| Task | Local | Cloud | Recommendation |
|---|---|---|---|
| Simple refactoring | Great | Great | Local (free) |
| Boilerplate / CRUD | Great | Great | Local (free) |
| Test writing | Good | Great | Local for routine, cloud for tricky edge cases |
| Bug fixing (straightforward) | Good | Great | Either works |
| Multi-file architecture changes | Fair | Excellent | Cloud |
| Complex debugging | Weak | Excellent | Cloud |
| Working in an unfamiliar framework | Weak | Excellent | Cloud |
| Summarization (batch) | Good | Great | Local (free at scale) |
| Creative writing | Good | Great | Cloud if quality is critical |
| Sensitive/proprietary data | Required | Risky | Local |
For routine tasks, local is fine. For the hard stuff, cloud is usually worth the cost. Most work falls somewhere in between.
How Pricing Works (And How to Control It)
Cloud models charge per token. Every word you send and every word you get back costs money. Understanding the pricing model helps you avoid surprises.
The Three Tiers
Providers offer models at different price-performance points. The names change, the pattern doesn’t:
| Tier | What It’s For | Typical Cost (per million tokens) |
|---|---|---|
| Flagship | Hardest problems. Architecture, complex analysis. | $2-25 input, $8-25 output |
| Workhorse | Daily use. Coding, writing, general tasks. | $0.50-3 input, $3-15 output |
| Budget | High volume. Classification, formatting, simple Q&A. | $0.10-0.50 input, $0.40-2 output |
Prices drop fast. The cost of a million tokens fell roughly 80% between 2025 and 2026, and it’s still falling. Current pricing lives on provider websites (Anthropic, OpenAI, Google, Mistral all publish theirs). The ratios between tiers are more stable than the absolute numbers.
The Most Common Mistake
Using the flagship model for everything. Most requests don’t need the strongest model. A classification task, a format conversion, a simple Q&A: the budget tier handles these fine. Save the flagship for the 10% of tasks that actually need it.
A typical cost-effective split:
70% of requests → Budget tier ($0.10-0.50/M input)
20% of requests → Workhorse tier ($0.50-3/M input)
10% of requests → Flagship tier ($2-25/M input)
This saves roughly 60% compared to sending everything to the flagship.
Combining Local and Cloud
A hybrid setup routes requests to the right model based on what the task needs.
The Simple Version
Use two tools. A local model for everyday work, a cloud model for the hard stuff.
Daily coding (refactoring, tests, boilerplate):
→ local tool (Ollama) + a 9B local model
→ Cost: $0
When you hit a wall (architecture, complex bugs, unfamiliar code):
→ switch to a frontier coding API or multimodal cloud API
→ Cost: pay-per-use, only when you need it
The split tends to cut cloud costs 60-70% compared to cloud-only, with minimal quality loss on the tasks shifted to local.
Programmatic Routing
If you’re building an application, you can route automatically. The logic doesn’t need to be complicated: check the task type (classify/format/summarize gets local, analyze/debug/architecture gets cloud), check prompt length (very long context goes to cloud), and default to local for everything else.
Latency is another criterion worth building in. Local models are deterministic in latency: your hardware is the only variable. Cloud APIs have variable latency from network round-trips and provider load. For real-time or user-facing applications, this often matters more than quality or cost.
More sophisticated routing uses a small classifier model to judge prompt complexity before routing. RouteLLM is a library from LMSys that routes prompts to local or cloud models based on a quality threshold. It benchmarks whether a simpler model can answer this prompt adequately. OpenRouter is a unified API gateway that lets you call many providers (OpenAI, Anthropic, Google, Mistral, open-weight) through one endpoint, useful when you want to switch models without changing code.
The Ollama Compatibility Advantage
Because Ollama exposes an OpenAI-compatible API (Module 6 covered this), swapping between local and cloud often means changing a base URL and model name. Your application code, prompt templates, and tool definitions stay the same. If you used the OpenAI-compatible API pattern from Module 6, you can swap local and cloud with one line change: just point the endpoint URL at localhost instead of the cloud provider, or vice versa.
Making It Cheaper: Caching, Batching, Routing
If you’re using cloud APIs at any volume, these three techniques cut costs.
Prompt Caching
If you send the same system prompt on every request (and you probably do), you’re paying full price for it every time. Prompt caching lets you pay once and reuse. You mark the stable prefix with a cache flag. The first call costs a small premium to write the cache (about 1.25x). Every subsequent call within the cache window costs 0.1x for that portion. For a system making 100 calls with the same 3,000-token system prompt, that’s roughly a 90% reduction on the system prompt portion.
Gemini caches automatically. Claude requires you to mark what to cache. Either way, if you’re making repeated calls with a stable prefix, enable it.
Batch API
When you don’t need an immediate response, batch APIs let you submit many requests at once and get results later. Most providers offer 50% off for batch processing.
Good candidates: nightly report generation, bulk document classification, processing a queue of support tickets, running evaluation test suites.
Context Management
Every token costs money. Common waste patterns:
- Sending an entire file when you need 10 lines
- Including 50 turns of conversation history when the last 5 are relevant
- Embedding full documentation in every request instead of using to fetch what’s needed
Trim your context before sending. Summarize old conversation turns. Use retrieval instead of stuffing everything into the prompt. A request that sends 2,000 tokens instead of 20,000 costs 10x less and often gets the same quality answer.
Next Steps
-
Module 8: Supercharge Your Setup — The MCP ecosystem, skills, plugins, memory tools, and how to discover them.
-
Module 9: Go Further — Fine-tuning, media generation, evaluation, and what’s coming next.
-
Module 6: Build Custom Tools — If you skipped ahead: APIs, function calling, MCP wiring, and RAG pipelines.
-
Model Reference — Current model recommendations for every use case and budget.
-
Hardware Reference — Upgrade paths if local performance is your bottleneck.