#Trigram
Extend the Bigram model A model that predicts the next word based on one previous word. This is what you build in the fundamental lessons---each row of your grid represents what can follow a single word. View in glossary to consider two words of Context window How many previous tokens the model considers when making predictions. Bigrams have a context window of 1, trigrams have 2, and modern LLMs can consider hundreds of thousands or even millions of tokens. View in glossary instead of one, leading to better generation.


#You will need
- the same materials as Training
- extra paper for a three-column table
- pen, paper, and dice as per Generation
- the same materials as Training
- additional small containers for two-word label buckets
- sticky notes or paper for bucket labels (you’ll need to write two words on each label)
#Your goal
Train a trigram language model (a table, not a grid) and use it to generate text. Stretch goal: train on more data or generate longer outputs.
Build a trigram language model using buckets where each bucket is labelled with two words instead of one. Stretch goal: train on more data or generate longer outputs.
#Key idea
Trigrams show how more context boosts prediction quality. They also reveal the cost: more rows to track and more data needed.
Trigrams show how more context boosts prediction quality. Instead of asking “what follows this word?”, we ask “what follows these two words?”. This means more buckets to manage, but better predictions.
#Algorithm (training)
- Draw a four-column table:
word1 | word2 | word3 | count. - Slide a window over your text, collecting every overlapping triple of words.
- For each triple, increment its count (or add a new row starting at 1).
#Example (training)
After the first four words (see spot run .) the model is:
| word 1 | word 2 | word 3 | count |
|---|---|---|---|
see | spot | run | 1 |
spot | run | . | 1 |
After the full text (see spot run . see spot jump .) the model
is:
| word 1 | word 2 | word 3 | count |
|---|---|---|---|
see | spot | run | 1 |
spot | run | . | 1 |
run | . | see | 1 |
. | see | spot | 1 |
see | spot | jump | 1 |
spot | jump | . | 1 |
Note: the order of the rows doesn’t matter, so you can re-order to group them by word 1 if that helps.
-
Prepare your tokens as per Training
- print or write out your training text
- convert everything to lowercase
- treat words, commas, and full stops as separate tokens
- cut the text into individual tokens with scissors, keeping them in order
-
Build the model using word pairs as bucket labels
- take the first two tokens from your pile—these form your bucket label
- if a bucket with this two-word label doesn’t exist, create one
- take the third token and put it in this bucket
- shift along by one word (so your new pair is the old second word + the third word you just placed)
- repeat until all tokens are in buckets
#Example (training)
Original text: “See Spot run. See Spot jump.”
After preparing tokens, you have these pieces of paper in order: see spot
run . see spot jump .
Step by step:
- First two tokens are
seeandspot—create a bucket labelled “see spot” - Third token is
run—put it in the “see spot” bucket - Shift along: new pair is
spot+run—create bucket labelled “spot run” - Next token is
.—put it in the “spot run” bucket - Shift along: new pair is
run+.—create bucket labelled “run .” - Next token is
see—put it in the “run .” bucket - Shift along: new pair is
.+see—create bucket labelled ”. see” - Next token is
spot—put it in the ”. see” bucket - Shift along: new pair is
see+spot—bucket already exists - Next token is
jump—put it in the “see spot” bucket - Shift along: new pair is
spot+jump—create bucket labelled “spot jump” - Next token is
.—put it in the “spot jump” bucket - No more tokens—training complete!
Final model (bucket contents):
| Bucket label | Tokens inside |
|---|---|
| see spot | run jump |
| spot run | . |
| run . | see |
| . see | spot |
| spot jump | . |
Notice that “see spot” has two tokens because two different words followed that
pair in the original text. Compare this to the bigram bucket model where the
“see” bucket would just contain spot spot—the trigram model captures more
specific patterns.
#Algorithm (generation)
- Pick any row and write down
word1andword2as your starting words. - Find all rows where
word1andword2match your current context; note their counts. - Roll weighted by those counts to pick a row; take its
word3as the next word. - Shift the window by one word (new context is old
word2+ chosenword3) and repeat from step 2.
This mirrors Generation but with two-word context instead of one.
- choose a starting bucket and write down its two-word label—these are the first two words of your generated text.
- close your eyes and pick a random token from inside that bucket
- write down the token you picked
- put the token back in the bucket
- find the bucket whose label matches your last two words (the second word of your old label + the token you just picked)
- if no bucket exists for that two-word pair, the model never saw this combination during training—pick a different bucket with the first word of your two-word pair and continue from there
- repeat from step 2 until you reach a stopping point
#Example (generation)
Using the bucket model from above:
- Choose “see spot” as starting bucket—write down “see spot”
- Pick randomly from the “see spot” bucket—get either
runorjump - Let’s say we pick
run—write it down - Put
runback, find bucket “spot run” - Pick from “spot run”—only
.is inside—write it down - Find bucket “run .”—pick
see—write it down - Find bucket ”. see”—pick
spot—write it down - Find bucket “see spot”—this time pick
jump—write it down - Find bucket “spot jump”—pick
.—write it down - Continue or stop here
Generated text: “see spot run. see spot jump.”
#Instructor notes
#Discussion questions
- how does the trigram output compare to basic (bigram) model output?
- what happens when you encounter a word pair you’ve never seen before?
- how many rows would you need for a 100-word text?
- can you find word pairs that always lead to the same next word?
- what’s the tradeoff between context length and data requirements?
- how does the trigram output compare to basic (bigram) bucket model output?
- why do we need more buckets for trigrams than bigrams?
- what happens when you encounter a word pair you’ve never seen before?
- can you find two-word pairs that always lead to the same next word?
- what’s the tradeoff between context length and data requirements?
#Connection to current LLMs
The trigram model bridges the gap between simple word-pair models and modern transformers:
- context windows: current models use variable context up to 2 million tokens
- sparse data problem: with more context, you need exponentially more training data
Your trigram model shows why longer context helps—“see spot” predicts either
run or jump, while just “spot” in a bigram model could be followed by many
different words. This is why modern LLMs can maintain coherent conversations
over many exchanges—they consider much more context than just the last word or
two.
#Comparison to grid method
This bucket method and the grid method produce equivalent models:
- a count in the grid’s table corresponds to tokens inside a two-word bucket
- both capture the same “what follows these two words” relationships
- buckets make the weighting more tangible—you can see and feel that some outcomes are more likely because there are literally more tokens to pick from