Skip to content

Trigram

Choose your method: This lesson can be done with either a grid (paper and dice) or buckets (physical tokens). Pick whichever suits your group best; both teach the same concepts.

Extend the bigram model to consider two words of context instead of one, leading to better generation.

Hero image: Grid Trigram

You will need

  • the same materials as Training
  • extra paper for a three-column table
  • pen, paper, and dice as per Generation

Your goal

Train a trigram language model (a table, not a grid) and use it to generate text. Stretch goal: train on more data or generate longer outputs.

Key idea

Trigrams show how more context boosts prediction quality. They also reveal the cost: more rows to track and more data needed.

Algorithm (training)

  1. Draw a four-column table: word1 | word2 | word3 | count.
  2. Slide a window over your text, collecting every overlapping triple of words.
  3. For each triple, increment its count (or add a new row starting at 1).

Example (training)

After the first four words (see spot run .) the model is:

word 1word 2word 3count
seespotrun1
spotrun.1

After the full text (see spot run . see spot jump .) the model is:

word 1word 2word 3count
seespotrun1
spotrun.1
run.see1
.seespot1
seespotjump1
spotjump.1

Note: the order of the rows doesn’t matter, so you can re-order to group them by word 1 if that helps.

Algorithm (generation)

  1. Pick any row and write down word1 and word2 as your starting words.
  2. Find all rows where word1 and word2 match your current context; note their counts.
  3. Roll weighted by those counts to pick a row; take its word3 as the next word.
  4. Shift the window by one word (new context is old word2 + chosen word3) and repeat from step 2.

This mirrors Generation but with two-word context instead of one.

Instructor notes

Discussion questions

  • how does the trigram output compare to basic (bigram) model output?
  • what happens when you encounter a word pair you’ve never seen before?
  • how many rows would you need for a 100-word text?
  • can you find word pairs that always lead to the same next word?
  • what’s the tradeoff between context length and data requirements?

Connection to current LLMs

The trigram model bridges the gap between simple word-pair models and modern transformers:

  • context windows: current models use variable context up to 2 million tokens
  • sparse data problem: with more context, you need exponentially more training data

Your trigram model shows why longer context helps—“see spot” predicts either run or jump, while just “spot” in a bigram model could be followed by many different words. This is why modern LLMs can maintain coherent conversations over many exchanges—they consider much more context than just the last word or two.