Module 3A 12 min

🧠 How Models Think

From Text to Numbers: How Your Prompt Becomes Math

By the end of this module, you’ll understand how text becomes numbers, what attention actually does, and why bigger models behave differently from smaller ones.

You’ve been running models and picking good ones. But what actually happens in that second between pressing Enter and seeing a response?

Your model doesn’t understand words. Not letters either. It works entirely in numbers. So the first thing that happens is a conversion: your text gets turned into math.

Tokens: The Model’s Alphabet

Your prompt gets split into tokens, small chunks that are usually a word or part of a word, sitting somewhere between individual characters and full sentences.

Tokenization: Splitting "transformers" into tokens

Why not just use whole words? Because there are too many words in too many languages. A model with a vocabulary of every English word would choke on “unforgettable” or a word in Korean. Tokens solve this by breaking things into reusable pieces. “Unforgettable” becomes “un” + “forget” + “table.” The model has never seen “unforgettable” as a unit, but it knows all three pieces.

Most models use a vocabulary of 32,000 to 128,000 tokens. That covers the common words and word-pieces across dozens of languages.

A rough conversion: 1 token is about 3/4 of a word. A 1,000-word essay is roughly 1,300 tokens.

From Tokens to Vectors

Each token in the vocabulary maps to a list of numbers called an embedding vector. Like GPS coordinates: latitude and longitude describe where a city is, and an embedding describes where a word sits in “meaning space.”

Except instead of two numbers, a typical embedding has 4,096. That many dimensions is impossible to picture, but the principle is straightforward: words with similar meanings end up near each other. “Dog” and “puppy” are close. “Dog” and “algebra” are not.

Embeddings: Mapping tokens to high-dimensional space

These vectors aren’t hand-coded. The model learned them during training by reading billions of pages of text. Words that appeared in similar contexts ended up with similar vectors.

The relationships are preserved as geometry: “king” minus “man” plus “woman” lands close to “queen” in the vector space. This arithmetic works because the model learned the concept of royalty independently from the concept of gender. For retrieval (used in RAG), dedicated embedding models are trained specifically to make similar passages close together — a different training objective than predicting the next token. Using a small embedding model for retrieval and a larger generation model for answers is faster and more accurate than using one model for both.

So when you type a prompt, every token becomes a list of numbers. Your whole sentence becomes a grid of numbers: one row per token, 4,096 columns each. That grid is what enters the model.

The Transformer: Deciding What Comes Next

The transformer is the architecture behind nearly every modern language model. It was introduced in 2017, and it replaced older designs that read text one word at a time, left to right, like a person reading a book.

The transformer looks at everything at once. Instead of processing your prompt word by word, it looks at all the tokens simultaneously and figures out the relationships between them: which words modify which, what “it” refers to, whether “bank” means a river bank or a financial institution.

The Assembly Line

A transformer is a stack of identical processing blocks, called layers. A 7B model typically has 32 of them. Each layer does two things:

Attention — figure out which tokens are relevant to each other
Feed-forward — use that information to update each token’s understanding

Then pass the result to the next layer. Each layer refines the model’s understanding.

Transformer Architecture: A stack of Attention and Feed-Forward layers

Early layers tend to pick up basic patterns: grammar, syntax, which words go together. Later layers handle meaning, reasoning, and what your prompt is actually about. After all 32 layers, the model has built up enough context to predict what comes next.

One Token at a Time

The model only predicts one token at a time. It doesn’t compose a whole response and hand it to you. It predicts the single most likely next token, appends it, and runs the whole stack again for the token after that. Repeat until the response is done.

That’s why responses appear word by word, and why speed is measured in tokens per second. A model running at 30 tok/s is making 30 separate predictions every second. A 100-word response takes about 130 of those passes.

Attention: Why Not Every Word Matters Equally

Attention is the main mechanism in the transformer. For each token in the sequence, it figures out which other tokens matter most.

The “It” Problem

Consider this sentence:

“The cat sat on the mat because it was tired.”

What does “it” refer to? The cat or the mat? You know instantly: the cat. Mats don’t get tired. But how does a model figure that out?

Attention computes a relevance score between every pair of tokens. When the model processes “it,” it assigns a high attention score to “cat” and a low score to “mat.” That score determines how much information flows from “cat” into the model’s understanding of “it.”

The Mechanism: Queries, Keys, and Values

For each token, the model creates three vectors: a Query (“what am I looking for?”), a Key (“what do I contain?”), and a Value (“what is my actual information?”). These aren’t hand-coded; they are mathematical projections created by multiplying the token’s embedding by weights learned during training.

The model computes a relevance score by taking the dot product of every Query against every Key. A high dot product means a strong match. It pushes these raw scores through a mathematical function (softmax) that turns them into probabilities adding up to 100%. Finally, it uses those probabilities to blend all the Values together.

Self-Attention mechanism: Query/Key matching and Value blending

This creates a bottleneck: quadratic complexity. Because every token checks its Query against every other token’s Key, processing 100 tokens requires 10,000 attention scores. Process 200 tokens, and you need 40,000 scores. Double the context length, and the attention mechanism does four times the work.

This quadratic cost is also why long conversations get expensive — Module 3B explains what the KV cache does about it.

Multiple Perspectives at Once

The model doesn’t run attention just once per layer. It runs multiple copies in parallel, called attention heads. A typical model has 32 per layer.

Different heads learn to notice different things. Some track grammar (subject-verb agreement). Some handle co-reference (“it” refers to “cat”). Some pick up on meaning similarity, others on word distance. Each head sees the same tokens but asks a different question, and their results get merged into a single, more complete picture.

Why Bigger Models Know More

You’ve seen that a 70B model generally outperforms a 9B model. Module 2 covered this from the practical side. Here’s what’s actually going on inside.

More Parameters = More Storage

Every parameter is a single number, and those numbers are the model’s knowledge. A 7B model has 7 billion of them spread across its attention weights, feed-forward weights, and token vectors. A 70B model has 10x more to work with.

More storage means the model can:

Remember more facts (“What’s the capital of Burkina Faso?”)
Learn more subtle patterns (“This phrasing implies sarcasm”)
Handle more languages with less confusion
Track finer distinctions in complex reasoning

More Layers = Deeper Reasoning

Bigger models also have more layers. A 70B model might have 80 layers compared to 32 for a 7B. Each layer is another pass of refinement before the model commits to an answer.

Simple questions might only need a few layers. Multi-step math or logic problems need the full depth. Small models handle casual chat fine but struggle on hard reasoning because they run out of layers before they’ve worked through the problem.

The Catch: Diminishing Returns

Doubling a model’s size doesn’t double its quality. Going from 3B to 7B is a huge jump. Going from 70B to 140B is noticeable but the gap is smaller. Each doubling adds less than the last.

This is why training quality matters so much. A well-trained 14B model can match a poorly trained 70B model. The Phi-4 example from Module 2 showed exactly that. At some point, better training data and techniques do more than just adding parameters.

Mixture of Experts: The Compute vs. Memory Tradeoff

Some models cheat the size/speed equation with a (MoE) architecture.

In a standard transformer, every token passes through every feed-forward layer. In an MoE model, the deep feed-forward blocks are split into parallel sub-networks called “experts.” A learned gate, or “router,” looks at the incoming token and sends it to only the top-K experts that specialize in that type of data.

A 120B MoE model might only run 5B active parameters per token. It has the total knowledge capacity of a 120B model (stored across all the dormant experts) but runs at roughly the speed of a 5B model (because only the top experts do the math for any given token).

But here is the catch: you still need enough to hold the entire 120B model. The router makes its decision instantly per token, meaning any expert could be called at any moment. All weights must remain seated in active memory. MoE saves you compute time, but it does not save you RAM.

This is how models like Qwen 3.5 35B-A3B and GPT-OSS 120B work. The “A” in the name tells you the active parameter count.

Temperature: Controlling Creativity vs. Precision

When the model finishes processing your prompt through all its layers, the final output is a list of probabilities: one score for every token in its vocabulary. “The next token is 45% likely to be ‘mat’, 25% likely to be ‘floor’, 15% likely to be ‘couch’…” and so on for all 100,000+ tokens.

How does it pick one? That’s where temperature comes in.

What Temperature Does

Temperature scales the probability distribution before the model picks a token. Turn it down and the most likely word wins almost every time. Turn it up and less likely words start getting a real chance.

At temperature 0, the model always picks the highest-probability next token (greedy decoding). This is deterministic but can produce worse results on tasks where the right answer isn’t the most statistically common one.

At low values (0.1 - 0.3), responses are focused and predictable. The model sticks to safe, obvious choices. Good for factual questions and code. At high values (1.0 - 1.5), things get more creative and sometimes weird. Unusual word choices show up. Good for brainstorming. Most tools default to somewhere in the middle (0.6 - 0.8), which balances variety with coherence.

Temperature: Controlling the probability distribution of the next token Low temp: “mat” almost every time. Safe. Boring. High temp: “moon” actually has a shot. Surprising. Risky.

Other Sampling Controls

Temperature isn’t the only knob. There are a few others that control which tokens make the cut:

Control	What it does
Top-K	Only consider the K most likely tokens. Top-K of 40 means ignore everything outside the top 40 candidates.
Top-P	Keep adding tokens from most to least likely until their combined probability hits P (e.g., 0.9 = 90%). Adapts to context: confident predictions use fewer tokens, ambiguous ones use more.
Min-P	Cut any token whose probability is less than min_p times the top token’s probability. If the best candidate has 40% and min_p is 0.1, anything below 4% is gone.

For most uses, the defaults work fine. Adjust when the model feels too boring (raise temperature) or too chaotic (lower it).

What’s Next

The next question: how does all of this fit on your machine? Module 3B: How Models Fit covers the full memory equation, why long conversations cost more, and what’s changing the rules. Ready to put this to work? Module 4: What Can You Do With This? is the practical side.

Sources for this module