Trigram
Extend the bigram model to consider two words of context instead of one, leading to better generation.

You will need
- the same materials as Training
- extra paper for a three-column table
- pen, paper, and dice as per Generation
Your goal
Train a trigram language model (a table, not a grid) and use it to generate text. Stretch goal: train on more data or generate longer outputs.
Key idea
Trigrams show how more context boosts prediction quality. They also reveal the cost: more rows to track and more data needed.
Algorithm (training)
- Draw a four-column table:
word1 | word2 | word3 | count. - Slide a window over your text, collecting every overlapping triple of words.
- For each triple, increment its count (or add a new row starting at 1).
Example (training)
After the first four words (see spot run .) the model is:
| word 1 | word 2 | word 3 | count |
|---|---|---|---|
see | spot | run | 1 |
spot | run | . | 1 |
After the full text (see spot run . see spot jump .) the model is:
| word 1 | word 2 | word 3 | count |
|---|---|---|---|
see | spot | run | 1 |
spot | run | . | 1 |
run | . | see | 1 |
. | see | spot | 1 |
see | spot | jump | 1 |
spot | jump | . | 1 |
Note: the order of the rows doesn’t matter, so you can re-order to group them by word 1 if that helps.
Algorithm (generation)
- Pick any row and write down
word1andword2as your starting words. - Find all rows where
word1andword2match your current context; note their counts. - Roll weighted by those counts to pick a row; take its
word3as the next word. - Shift the window by one word (new context is old
word2+ chosenword3) and repeat from step 2.
This mirrors Generation but with two-word context instead of one.
Instructor notes
Discussion questions
- how does the trigram output compare to basic (bigram) model output?
- what happens when you encounter a word pair you’ve never seen before?
- how many rows would you need for a 100-word text?
- can you find word pairs that always lead to the same next word?
- what’s the tradeoff between context length and data requirements?
Connection to current LLMs
The trigram model bridges the gap between simple word-pair models and modern transformers:
- context windows: current models use variable context up to 2 million tokens
- sparse data problem: with more context, you need exponentially more training data
Your trigram model shows why longer context helps—“see spot” predicts either run or jump, while just “spot” in a bigram model could be followed by many different words. This is why modern LLMs can maintain coherent conversations over many exchanges—they consider much more context than just the last word or two.