In-context Memory

Key idea: A short-term memory that boosts recently-seen words---within what the model already allows---keeps text on topic, the way attention reuses earlier context.

Give your model a short-term memory. By nudging generation toward words you’ve used recently, the text stays on topic—a tabletop version of how real models pay to their own context.

Hero image: In-context Memory

You will need

any model you can already generate from: your bigram grid, a cutouts spread, or a pre-trained booklet
pen and paper for your generated text
dice (or a coin) as per Generation

This lesson adds a procedure on top of a model you already have—there’s nothing new to print or build.

Your goal

Generate text twice from the same model—once plain, once with a short-term memory—and compare. The memory version should stay on topic for longer. Stretch goal: find the setting where the memory helps without making the text repetitive.

Key idea

A bigram only ever looks at the single word before. It has no idea what the text has been about—mention a dog, and two words later the model has already forgotten. Real language reuses what came before: once a story is about a dog, it keeps being about the dog.

We can capture that with a short-term memory: a running list of the words you’ve written recently. When you pick the next word, you give a small boost to any candidate that’s on that list. Recently-used words become a little more likely, so topics and characters persist.

The model itself never changes—you’re not retraining it. The extra context lives in the text you’ve already generated. That’s exactly what makes this a model of : the behaviour shifts based on the context in front of it, with no change to the underlying model.

The one rule that matters: reweight, don’t override

There’s a tempting shortcut: each turn, just grab a recent word from memory and write it down. Don’t. That gets stuck on rails fast, for two reasons:

it ignores the current word, so it can drop in a word that doesn’t follow at all—ungrammatical nonsense
every word you emit goes back into memory, so emitting “dog” makes “dog” more likely next turn, which puts another “dog” in memory… and the text collapses into repeating a handful of words

The fix is to keep the model in charge of what’s allowed. The memory only ever adds a boost to words the model already offers as possible next words. It re-ranks the model’s candidates; it never invents one.

Algorithm

Start a memory list. As you generate, keep the last ~8 words you’ve written (just underline them in your output, or jot them on a sticky note).
Find the model’s candidates for the next word exactly as you normally would—the row in your grid, the matching cutouts, or the entry in your booklet.
Boost the ones in memory. Any candidate that also appears in your memory list gets a bonus (see “Applying it to your model” for how, per base).
Pick from the boosted candidates, write the word down, add it to memory, and drop the oldest word so the list stays short.
Repeat.

Worked example

Suppose you’re generating and your current word is the. Your model offers:

cat: 3
dog: 1
sun: 1

A moment ago the text mentioned a dog, so dog is in your memory list. Give it a bonus of, say, +3:

cat: 3
dog: 1 + 3 = 4
sun: 1

Now roll on the new totals (1-3 → cat, 4-7 → dog, 8 → sun). dog has gone from unlikely to favourite—so the story is more likely to stay about the dog. Crucially, cat and sun are still possible: the memory tilted the odds, it didn’t override them.

Applying it to your model

The boost is the same idea everywhere—it only differs in how you apply it, because each base stores its probabilities differently.

On a grid

Add the bonus straight to the counts in the current word’s row, then roll on the new totals (as in the worked example). The grid makes the reweighting visible.

On a cutouts spread

When you’ve gathered the cutouts that match your current word, check their next words against your memory list. For any match, grab one extra copy of that cutout (write a duplicate) before you pick. More copies means more likely—the spread does the reweighting for you, no arithmetic.

On a pre-trained booklet

The booklet’s dice thresholds are pre-printed, so you can’t re-weight them by hand. Use the blend form instead:

roll a small “memory die” first (e.g. a d10: 1-3 is a hit)
on a hit, scan the current entry’s options—if one of them is in your memory list, pick it
on a miss, or if no option is in memory, roll the booklet exactly as normal

This keeps the booklet frozen while still tilting toward recent words—and because you only ever pick from this entry’s options, you never leave what the model allows.

Instructor notes

Discussion questions

with the memory on, does the text hold a topic or a character for longer?
what happens as you turn the bonus up? Where does “on topic” tip over into “stuck repeating”?
why does the memory only boost words the model already offers? What goes wrong if you let it pick any recent word?
the model never changed—so where did the extra “knowledge” come from?
how is this different from the trigram, which also uses more than one word?

Connection to current LLMs

This is a hands-on model of two of the most important things modern LLMs do.

Attention over the whole context. A bigram looks back one word; the memory looks back over everything you’ve generated and lets it influence the next choice. That’s the heart of the

: when predicting the next word, the model reaches back across the whole context and weights what it finds. Your memory is a crude, fixed version (boost recent words); attention is learned and content-sensitive.

In-context learning. Notice that the model never retrained—the behaviour changed purely because of what was already in the text. This is why you can give an LLM a few examples or a topic in your prompt and it picks up the pattern with no change to its weights. The “learning” happens in the context, not the model.

Repetition is a real failure mode. The “stuck on rails” trap you avoided is not hypothetical: real models can loop too, which is why generation settings include repetition penalties—a deliberate down-weighting of recent words, the exact opposite tuning of the boost you just added. Topic-stickiness and repetition are two ends of the same dial.

In-context Memory

You will need#

Your goal#

Key idea#

The one rule that matters: reweight, don’t override#

Algorithm#

Worked example#

Applying it to your model#

On a grid#

On a cutouts spread#

On a pre-trained booklet#

Instructor notes#

Discussion questions#

Connection to current LLMs#