Trigram
Extend the Bigram model A model that predicts the next word based on one previous word. This is what you build in the fundamental lessons---each row of your grid represents what can follow a single word. View in glossary to consider two words of Context window How many previous tokens the model considers when making predictions. Bigrams have a context window of 1, trigrams have 2, and modern LLMs can consider hundreds of thousands or even millions of tokens. View in glossary instead of one, leading to better generation.


You will need
- the same materials as Training
- extra paper for a three-column table
- pen, paper, and dice as per Generation
For each pair (or group) of students:
- printed trigram token cutouts (each one shows two previous words as coloured boxes followed by a single next word, all colour-coded by word)
- a clear table or flat surface to spread the cutouts out on
- scissors
Your goal
Train a Trigram model A model that uses two previous words for prediction, capturing more context than a bigram. The grid becomes three-dimensional (or you track word pairs instead of single words). View in glossary (a table, not a grid) and use it to generate text. Stretch goal: train on more data or generate longer outputs.
Build a Trigram model A model that uses two previous words for prediction, capturing more context than a bigram. The grid becomes three-dimensional (or you track word pairs instead of single words). View in glossary as a spread of cutouts where each cutout shows two previous words instead of one. Stretch goal: train on more data or generate longer outputs.
Key idea
Trigrams show how more context boosts prediction quality. They also reveal the cost: more rows to track and more data needed.
Trigrams show how more context boosts prediction quality. Instead of asking “what follows this word?”, we ask “what follows these two words?”. The spread gets bigger---more unique pairs of previous words mean more cutouts---but the predictions are sharper.
Algorithm (training)
- Draw a four-column table:
word1 | word2 | word3 | count. - Slide a window over your text, collecting every overlapping triple of words.
- For each triple, increment its count (or add a new row starting at 1).
Example (training)
After the first four words (see spot run .) the model is:
| word 1 | word 2 | word 3 | count |
|---|---|---|---|
see | spot | run | 1 |
spot | run | . | 1 |
After the full text (see spot run . see spot jump .) the model
is:
| word 1 | word 2 | word 3 | count |
|---|---|---|---|
see | spot | run | 1 |
spot | run | . | 1 |
run | . | see | 1 |
. | see | spot | 1 |
see | spot | jump | 1 |
spot | jump | . | 1 |
Note: the order of the rows doesn’t matter, so you can re-order to group them by word 1 if that helps.
- Cut out the trigram tokens from your printed sheets
- each cutout shows two previous words as coloured boxes, followed by the next word; every word has its own colour, so the boxed and free-standing forms of the same word match
- Spread the cutouts out on a table
- face up, no overlap if you can manage it
- that’s it---the spread is your trained trigram model
Optional extension: see “Group into piles” below.
Example (training)
Original text: “See Spot run. See Spot jump.”
The training text contains these adjacent (two previous words → next word) triples:
see spot → runspot run → .run . → see. see → spotsee spot → jumpspot jump → .
Each triple becomes one cutout on your table:
| Previous words | Next words |
|---|---|
| see spot | |
| spot run | |
| run . | |
| . see | |
| spot jump |
Notice that see spot shows up twice in the spread, once with run as the
next word and once with jump---two different words followed that pair in
the original text. Compare this to the bigram cutouts model where the spread
would just contain two see→spot cutouts---the trigram spread captures more
specific patterns about which word follows which pair of words.
Algorithm (generation)
- Pick any row and write down
word1andword2as your starting words. - Find all rows where
word1andword2match your current context; note their counts. - Roll weighted by those counts to pick a row; take its
word3as the next word. - Shift the window by one word (new context is old
word2+ chosenword3) and repeat from step 2.
This mirrors Generation but with two-word context instead of one.
- Pick two starting words---choose any two consecutive words that appear as previous words on at least one cutout, and write both down
- Find candidates---scan the spread for cutouts whose previous words match your last two words (match the rightmost coloured box to your most recent word, then check the box to its left matches the word before that). Verify the words themselves before committing---two unrelated words can occasionally share a colour
- Pick one cutout visually---your eye will tend to land on cutouts whose next words are more common
- Write down the next word from the cutout you picked
- Put the cutout back in the spread
- Your last two words are now the previous-rightmost word plus the word you just wrote---go back to step 2
- Keep going as long as you like---if no cutouts match your last two words, just pick a new starting pair and carry on. Stop when you’ve generated enough text
Example (generation)
Using the cutouts spread from above:
- Choose
see spotas your starting words---write down “see spot” - Scan for cutouts whose previous words are
see spot---there’s one withrunand one withjump---pick visually - Let’s say we land on
run---write it down - The last two words you’ve written are
spot run---scan for matches---only.is on the table for that pair---write it down - Last two words:
run .---onlyseematches---write it down - Last two words:
. see---onlyspotmatches---write it down - Last two words:
see spot---this time let’s say we pickjump---write it down - Last two words:
spot jump---only.matches---write it down - Continue or stop here
Generated text: “see spot run. see spot jump.”
Optional extension: group into piles
Once your students have got the hang of the loose-on-table flow, you can introduce grouping as an optimisation:
- Sort the cutouts into piles, one pile per unique pair of previous words
- Label each pile with that pair
Now generation is faster---instead of scanning the whole table for matching previous words, you go straight to the pile whose label matches your last two words. With trigrams this speedup is bigger than it is for bigrams, because there are usually many more unique pairs of previous words than unique single previous words. This is the same trick a computer uses when it stores a language model in a hash table.
Instructor notes
Discussion questions
- how does the trigram output compare to basic (bigram) model output?
- how many rows would you need for a 100-word text?
- can you find word pairs that always lead to the same next word?
- what’s the tradeoff between context length and data requirements?
- how does the trigram output compare to basic (bigram) cutouts model output?
- why does the trigram spread tend to have more unique pairs of previous words than a bigram spread has unique previous words for the same text?
- can you find two-word combinations that always lead to the same next word?
- what’s the tradeoff between context length and data requirements?
Connection to current LLMs
The trigram model bridges the gap between simple word-pair models and modern transformers:
- context windows: current models use variable context up to 2 million tokens
- sparse data problem: with more context, you need exponentially more training data
Your trigram model shows why longer context helps---“see spot” predicts either
run or jump, while just “spot” in a bigram model could be followed by many
different words. This is why modern LLMs can maintain coherent conversations
over many exchanges---they consider much more context than just the last word or
two.
Comparison to grid method
The cutouts spread and the grid method produce equivalent models:
- a count in the grid’s table corresponds to a cutout in the spread with the matching pair of previous words and next word
- both capture the same “what follows these two words” relationships
- cutouts make the weighting more tangible---you can see and feel that some outcomes are more likely because there are literally more cutouts to land on