Trigram

Choose your method: This lesson can be done with either a grid (paper and dice) or buckets (physical tokens). Pick whichever suits your group best; both teach the same concepts.

Extend the bigram model to consider two words of context instead of one, leading to better generation.

Hero image: Grid Trigram

You will need

the same materials as Training
extra paper for a three-column table
pen, paper, and dice as per Generation

Your goal

Train a trigram language model (a table, not a grid) and use it to generate text. Stretch goal: train on more data or generate longer outputs.

Key idea

Trigrams show how more context boosts prediction quality. They also reveal the cost: more rows to track and more data needed.

Algorithm (training)

Draw a four-column table: word1 | word2 | word3 | count.
Slide a window over your text, collecting every overlapping triple of words.
For each triple, increment its count (or add a new row starting at 1).

Example (training)

After the first four words (see spot run .) the model is:

word 1	word 2	word 3	count
`see`	`spot`	`run`	1
`spot`	`run`	`.`	1

After the full text (see spot run . see spot jump .) the model is:

word 1	word 2	word 3	count
`see`	`spot`	`run`	1
`spot`	`run`	`.`	1
`run`	`.`	`see`	1
`.`	`see`	`spot`	1
`see`	`spot`	`jump`	1
`spot`	`jump`	`.`	1

Note: the order of the rows doesn’t matter, so you can re-order to group them by word 1 if that helps.

Algorithm (generation)

Pick any row and write down word1 and word2 as your starting words.
Find all rows where word1 and word2 match your current context; note their counts.
Roll weighted by those counts to pick a row; take its word3 as the next word.
Shift the window by one word (new context is old word2 + chosen word3) and repeat from step 2.

This mirrors Generation but with two-word context instead of one.

Instructor notes

Discussion questions

how does the trigram output compare to basic (bigram) model output?
what happens when you encounter a word pair you’ve never seen before?
how many rows would you need for a 100-word text?
can you find word pairs that always lead to the same next word?
what’s the tradeoff between context length and data requirements?

Connection to current LLMs

The trigram model bridges the gap between simple word-pair models and modern transformers:

context windows: current models use variable context up to 2 million tokens
sparse data problem: with more context, you need exponentially more training data

Your trigram model shows why longer context helps—“see spot” predicts either run or jump, while just “spot” in a bigram model could be followed by many different words. This is why modern LLMs can maintain coherent conversations over many exchanges—they consider much more context than just the last word or two.

Trigram ​

You will need ​

Your goal ​

Key idea ​

Algorithm (training) ​

Example (training) ​

Algorithm (generation) ​

Instructor notes ​

Discussion questions ​

Connection to current LLMs ​