Get Running
Your First Local AI Conversation
You’re about to run an AI model on your own machine. You don’t need a cloud account, an API key, or a subscription, just your hardware.
Why bother? Because when you run a model locally:
- Your conversations stay on your machine. Nothing is sent to a server.
- It’s free. No per-message costs, no usage limits.
- It works offline, and you control everything: which model, how it behaves, what data it sees.
By the end of this page you’ll have a working local AI you can talk to.
What Are You Working With?
Before downloading anything, let’s figure out what your machine can handle. You just need one number: how much (memory) you have.
How to check:
- Mac: Apple menu → About This Mac → look for “Memory” (e.g., “16 GB”)
- Windows: Settings → System → About → look for “Installed RAM”
- Linux: Open a terminal and run
free -h, then look at the “total” column
Here’s what that number means:
| Your RAM | What you can run | Experience |
|---|---|---|
| 8 GB | Small models (3-4B ) | Solid for chat, basic tasks. A bit slow on complex questions. |
| 16 GB | Medium models (7-8B parameters) | Good quality conversations. The sweet spot for getting started. |
| 32 GB | Large models (14-32B parameters) | Near-frontier quality for many tasks. Fast and capable. |
| 64 GB+ | Very large models (70B+ parameters) | Excellent quality. Can rival cloud AI for most tasks. |
Not sure what your hardware supports? Try the Hardware Calculator — enter your RAM and GPU type to see which models fit.
Don’t worry if your number is low. Even 8 GB is enough to get started. The models that fit in 8 GB today are far better than what required 64 GB two years ago.
Dedicated GPU? If you have an NVIDIA GPU (like an RTX 3060 or better), your matters too. Check it in NVIDIA Settings or
nvidia-smi. GPU memory speeds things up a lot, but it’s not required. CPU-only works fine for getting started.
Install Ollama
Ollama is the simplest way to get a model running locally. One install, one command, you’re chatting.
Mac:
brew install ollamaOr download the installer from ollama.com
Linux:
curl -fsSL https://ollama.com/install.sh | shWindows:
winget install Ollama.OllamaOr download the installer from ollama.com
Once installed, Ollama runs as a background service. You interact with it through your terminal.
Why Ollama? There are other great tools. LM Studio if you prefer a visual interface, Unsloth Studio if you want training capabilities built in, Open WebUI for a ChatGPT-like browser experience. We start with Ollama because it’s the fastest path from zero to running. You can explore the others in Module 4: What Can You Do With This?.
Pick Your First Model
A “model” is the AI brain you’re downloading. Different models have different strengths: some are better at conversation, some at coding, some at reasoning. For your first run, we want something that works well on your hardware and gives you a good experience.
If you have 8 GB RAM:
ollama pull qwen3.5:4b
# OR
ollama pull gemma4:e4b
Alibaba’s Qwen 3.5 4B gives you and massive multilingual support. Gemma 4 E4B is Google DeepMind’s multimodal model — it handles text, images, audio, and video. Fits in 8 GB RAM.
If you have 16 GB RAM:
ollama pull qwen3.5:9b
Qwen 3.5 9B. Beats models 13x its size on reasoning benchmarks. Multimodal, native tool calling, 256K .
If you have 32 GB+ RAM:
ollama pull qwen3.5:27b
Qwen 3.5 27B. Top-tier quality for local. Also consider gpt-oss:20b (OpenAI’s first model) or qwen3.5:35b-a3b (a 35B model that only uses 3B active parameters, fast and smart).
These recommendations reflect April 2026. New models ship fast. Other strong options include NVIDIA’s Nemotron 3 Nano, Google’s Gemma 4, Meta’s Llama 4, and OpenAI’s GPT-OSS. For the latest starter picks and alternatives, see our Model Reference.
The download will take a few minutes depending on your internet speed. The 4B model is about 2.5 GB, the 9B model about 6 GB, and the 27B model about 16 GB.
Talk to It
Once the download finishes:
ollama run qwen3.5:4b
(Replace qwen3.5:4b with whichever model you downloaded.)
You’ll see a prompt. Type anything:
>>> What's the most interesting thing about how language works?
And it responds. Running on your machine. No internet required.
Try a few things:
- Ask it to explain something you’re curious about
- Give it a writing task (“Write a short email declining a meeting politely”)
- Ask it to help with something practical (“What’s a good recipe for dinner with chicken and rice?”)
- Test its reasoning (“If I have 3 shirts and 4 pants, how many outfits can I make?”)
When you’re done, type /bye to exit.
What Just Happened on Your Machine
When you typed that prompt, here’s what happened in about one second:
-
Your text was converted to numbers. The model doesn’t read words. It reads “,” small pieces of words represented as numbers.
-
Those numbers went through the model. The model is a large file of numbers — billions of values that encode patterns learned from training on text. Your tokens flowed through layers of calculations in a architecture.
-
Based on everything it’s learned, it calculated the most likely next piece of text. Then the next. Then the next. One token at a time, building the response.
All of this ran on your CPU and RAM (or GPU if you have one). The model weights sit in your memory, the calculations happen on your processor, and nothing left your machine.
A big file of numbers, doing math on your hardware, producing text that feels like a conversation.
Want to understand the details? Module 3A: How Models Think explains transformers, attention, and how they fit together.
What If Your Hardware Can’t Run Local Models?
If you have less than 8 GB of RAM, or your machine is struggling, you have two options:
Option A: Try a tiny model anyway.
ollama pull qwen3.5:0.8b
# OR
ollama pull gemma4:e2b
The Qwen 0.8B parameter version is tiny enough to run on 4 GB of RAM. The newer Gemma 4 E2B is Google DeepMind’s smallest multimodal model — it handles text, images, audio, and video at 2 billion parameters. It uses Per-Layer Embeddings, a memory architecture that lets a small model hold its own on multimodal tasks. It’s the right pick when you need multimodal capability but have less than 5 GB free.
Option B: Start with a free frontier API.
You can still follow the learning path using a cloud model:
- Google AI Studio has a free tier with Gemini models
- Claude.ai has a free tier with Claude
- ChatGPT has a free tier with GPT models
Everything you’ll learn here (how models work, how to use them well, how to evaluate them) applies regardless of where the model runs. When you’re ready to go local, aim for at least 16 GB of RAM.
What’s Next
You’ve got a model running. Module 2: Choose Wisely explains why different models feel different and how to pick the right one for your hardware and your task. Or if you want to understand what just happened under the hood, Module 3A: How Models Think walks through the internals. If you’d rather skip the theory and start using this for real work, jump to Module 4: What Can You Do With This?.
Troubleshooting
“ollama: command not found” Make sure the install completed. On Mac, try restarting your terminal. On Linux, you may need to add Ollama to your PATH or restart your shell.
The model is very slow
Your model might be too large for your RAM. Try a smaller one: ollama pull qwen3.5:0.8b. If you’re on a laptop, make sure it’s plugged in. Many laptops throttle performance on battery.
“failed to load model” Usually means not enough memory. Close other applications to free up RAM, or try a smaller model.
I want to try a different model
Browse available models at ollama.com/library or check our Model Reference for curated recommendations. To download a new model: ollama pull <model-name>.
I want to delete a model I’m not using
ollama rm qwen3.5:4b