In-context Memory
Key idea: A short-term memory that boosts recently-seen words---within what the model already allows---keeps text on topic, the way attention reuses earlier context.
Give your model a short-term memory. By nudging generation toward words you’ve used recently, the text stays on topic—a tabletop version of how real models pay Attention mechanism The ability to focus on relevant previous words when making predictions. In real LLMs, attention is learned, weighted, and dynamic---the model decides what to focus on for each prediction. The in-context memory and induction-head lessons illustrate the motivation for attention: reusing more than just the immediately preceding word. View in glossary to their own context.

You will need
- any model you can already generate from: your bigram grid, a cutouts spread, or a pre-trained booklet
- pen and paper for your generated text
- dice (or a coin) as per Generation
This lesson adds a procedure on top of a model you already have—there’s nothing new to print or build.
Your goal
Generate text twice from the same model—once plain, once with a short-term memory—and compare. The memory version should stay on topic for longer. Stretch goal: find the setting where the memory helps without making the text repetitive.
Key idea
A bigram only ever looks at the single word before. It has no idea what the text has been about—mention a dog, and two words later the model has already forgotten. Real language reuses what came before: once a story is about a dog, it keeps being about the dog.
We can capture that with a short-term memory: a running list of the words you’ve written recently. When you pick the next word, you give a small boost to any candidate that’s on that list. Recently-used words become a little more likely, so topics and characters persist.
The model itself never changes—you’re not retraining it. The extra context lives in the text you’ve already generated. That’s exactly what makes this a model of In-context learning Picking up a pattern from the prompt and continuing it, with no change to the model's weights. The "learning" happens in the context the model is given, not in the model itself---which is why a few examples in a prompt can steer an LLM's output. View in glossary : the behaviour shifts based on the context in front of it, with no change to the underlying model.
The one rule that matters: reweight, don’t override
There’s a tempting shortcut: each turn, just grab a recent word from memory and write it down. Don’t. That gets stuck on rails fast, for two reasons:
- it ignores the current word, so it can drop in a word that doesn’t follow at all—ungrammatical nonsense
- every word you emit goes back into memory, so emitting “dog” makes “dog” more likely next turn, which puts another “dog” in memory… and the text collapses into repeating a handful of words
The fix is to keep the model in charge of what’s allowed. The memory only ever adds a boost to words the model already offers as possible next words. It re-ranks the model’s candidates; it never invents one.
Algorithm
- Start a memory list. As you generate, keep the last ~8 words you’ve written (just underline them in your output, or jot them on a sticky note).
- Find the model’s candidates for the next word exactly as you normally would—the row in your grid, the matching cutouts, or the entry in your booklet.
- Boost the ones in memory. Any candidate that also appears in your memory list gets a bonus (see “Applying it to your model” for how, per base).
- Pick from the boosted candidates, write the word down, add it to memory, and drop the oldest word so the list stays short.
- Repeat.
Worked example
Suppose you’re generating and your current word is the. Your model offers:
cat: 3dog: 1sun: 1
A moment ago the text mentioned a dog, so dog is in your memory list. Give it
a bonus of, say, +3:
cat: 3dog: 1 + 3 = 4sun: 1
Now roll on the new totals (1-3 → cat, 4-7 → dog, 8 → sun). dog has gone
from unlikely to favourite—so the story is more likely to stay about the dog.
Crucially, cat and sun are still possible: the memory tilted the odds, it
didn’t override them.
Applying it to your model
The boost is the same idea everywhere—it only differs in how you apply it, because each base stores its probabilities differently.
On a grid
Add the bonus straight to the counts in the current word’s row, then roll on the new totals (as in the worked example). The grid makes the reweighting visible.
On a cutouts spread
When you’ve gathered the cutouts that match your current word, check their next words against your memory list. For any match, grab one extra copy of that cutout (write a duplicate) before you pick. More copies means more likely—the spread does the reweighting for you, no arithmetic.
On a pre-trained booklet
The booklet’s dice thresholds are pre-printed, so you can’t re-weight them by hand. Use the blend form instead:
- roll a small “memory die” first (e.g. a d10: 1-3 is a hit)
- on a hit, scan the current entry’s options—if one of them is in your memory list, pick it
- on a miss, or if no option is in memory, roll the booklet exactly as normal
This keeps the booklet frozen while still tilting toward recent words—and because you only ever pick from this entry’s options, you never leave what the model allows.
Instructor notes
Discussion questions
- with the memory on, does the text hold a topic or a character for longer?
- what happens as you turn the bonus up? Where does “on topic” tip over into “stuck repeating”?
- why does the memory only boost words the model already offers? What goes wrong if you let it pick any recent word?
- the model never changed—so where did the extra “knowledge” come from?
- how is this different from the trigram, which also uses more than one word?
Connection to current LLMs
This is a hands-on model of two of the most important things modern LLMs do.
Attention over the whole context. A bigram looks back one word; the memory looks back over everything you’ve generated and lets it influence the next choice. That’s the heart of the
Attention mechanism The ability to focus on relevant previous words when making predictions. In real LLMs, attention is learned, weighted, and dynamic---the model decides what to focus on for each prediction. The in-context memory and induction-head lessons illustrate the motivation for attention: reusing more than just the immediately preceding word. View in glossary : when predicting the next word, the model reaches back across the whole context and weights what it finds. Your memory is a crude, fixed version (boost recent words); attention is learned and content-sensitive.
In-context learning. Notice that the model never retrained—the behaviour changed purely because of what was already in the text. This is why you can give an LLM a few examples or a topic in your prompt and it picks up the pattern with no change to its weights. The “learning” happens in the context, not the model.
Repetition is a real failure mode. The “stuck on rails” trap you avoided is not hypothetical: real models can loop too, which is why generation settings include repetition penalties—a deliberate down-weighting of recent words, the exact opposite tuning of the boost you just added. Topic-stickiness and repetition are two ends of the same dial.