Skip to main content

Trigram

Choose your method: This lesson can be done with either a grid (paper and dice) or cutouts (physical tokens). Choose which suits your materials.

Extend the Bigram model A model that predicts the next word based on one previous word. This is what you build in the fundamental lessons---each row of your grid represents what can follow a single word. View in glossary to consider two words of Context window How many previous tokens the model considers when making predictions. Bigrams have a context window of 1, trigrams have 2, and modern LLMs can consider hundreds of thousands or even millions of tokens. View in glossary instead of one, leading to better generation.

Hero image: Grid Trigram

Hero image: Cutouts Trigram

You will need

  • the same materials as Training
  • extra paper for a three-column table
  • pen, paper, and dice as per Generation

For each pair (or group) of students:

  • printed trigram token cutouts (each one shows two previous words as coloured boxes followed by a single next word, all colour-coded by word)
  • a clear table or flat surface to spread the cutouts out on
  • scissors

Your goal

Train a Trigram model A model that uses two previous words for prediction, capturing more context than a bigram. The grid becomes three-dimensional (or you track word pairs instead of single words). View in glossary (a table, not a grid) and use it to generate text. Stretch goal: train on more data or generate longer outputs.

Build a Trigram model A model that uses two previous words for prediction, capturing more context than a bigram. The grid becomes three-dimensional (or you track word pairs instead of single words). View in glossary as a spread of cutouts where each cutout shows two previous words instead of one. Stretch goal: train on more data or generate longer outputs.

Key idea

Trigrams show how more context boosts prediction quality. They also reveal the cost: more rows to track and more data needed.

Trigrams show how more context boosts prediction quality. Instead of asking “what follows this word?”, we ask “what follows these two words?”. The spread gets bigger---more unique pairs of previous words mean more cutouts---but the predictions are sharper.

Algorithm (training)

  1. Draw a four-column table: word1 | word2 | word3 | count.
  2. Slide a window over your text, collecting every overlapping triple of words.
  3. For each triple, increment its count (or add a new row starting at 1).

Example (training)

After the first four words (see spot run .) the model is:

word 1word 2word 3count
seespotrun1
spotrun.1

After the full text (see spot run . see spot jump .) the model is:

word 1word 2word 3count
seespotrun1
spotrun.1
run.see1
.seespot1
seespotjump1
spotjump.1

Note: the order of the rows doesn’t matter, so you can re-order to group them by word 1 if that helps.

  1. Cut out the trigram tokens from your printed sheets
    • each cutout shows two previous words as coloured boxes, followed by the next word; every word has its own colour, so the boxed and free-standing forms of the same word match
  2. Spread the cutouts out on a table
    • face up, no overlap if you can manage it
    • that’s it---the spread is your trained trigram model

Optional extension: see “Group into piles” below.

Example (training)

Original text: “See Spot run. See Spot jump.”

The training text contains these adjacent (two previous words → next word) triples:

  • see spot → run
  • spot run → .
  • run . → see
  • . see → spot
  • see spot → jump
  • spot jump → .

Each triple becomes one cutout on your table:

Previous words Next words
see spot run jump
spot run .
run . see
. see spot
spot jump .

Notice that see spot shows up twice in the spread, once with run as the next word and once with jump---two different words followed that pair in the original text. Compare this to the bigram cutouts model where the spread would just contain two see→spot cutouts---the trigram spread captures more specific patterns about which word follows which pair of words.

Algorithm (generation)

  1. Pick any row and write down word1 and word2 as your starting words.
  2. Find all rows where word1 and word2 match your current context; note their counts.
  3. Roll weighted by those counts to pick a row; take its word3 as the next word.
  4. Shift the window by one word (new context is old word2 + chosen word3) and repeat from step 2.

This mirrors Generation but with two-word context instead of one.

  1. Pick two starting words---choose any two consecutive words that appear as previous words on at least one cutout, and write both down
  2. Find candidates---scan the spread for cutouts whose previous words match your last two words (match the rightmost coloured box to your most recent word, then check the box to its left matches the word before that). Verify the words themselves before committing---two unrelated words can occasionally share a colour
  3. Pick one cutout visually---your eye will tend to land on cutouts whose next words are more common
  4. Write down the next word from the cutout you picked
  5. Put the cutout back in the spread
  6. Your last two words are now the previous-rightmost word plus the word you just wrote---go back to step 2
  7. Keep going as long as you like---if no cutouts match your last two words, just pick a new starting pair and carry on. Stop when you’ve generated enough text

Example (generation)

Using the cutouts spread from above:

  1. Choose see spot as your starting words---write down “see spot”
  2. Scan for cutouts whose previous words are see spot---there’s one with run and one with jump---pick visually
  3. Let’s say we land on run---write it down
  4. The last two words you’ve written are spot run---scan for matches---only . is on the table for that pair---write it down
  5. Last two words: run .---only see matches---write it down
  6. Last two words: . see---only spot matches---write it down
  7. Last two words: see spot---this time let’s say we pick jump---write it down
  8. Last two words: spot jump---only . matches---write it down
  9. Continue or stop here

Generated text: “see spot run. see spot jump.”

Optional extension: group into piles

Once your students have got the hang of the loose-on-table flow, you can introduce grouping as an optimisation:

  1. Sort the cutouts into piles, one pile per unique pair of previous words
  2. Label each pile with that pair

Now generation is faster---instead of scanning the whole table for matching previous words, you go straight to the pile whose label matches your last two words. With trigrams this speedup is bigger than it is for bigrams, because there are usually many more unique pairs of previous words than unique single previous words. This is the same trick a computer uses when it stores a language model in a hash table.

Instructor notes

Discussion questions

  • how does the trigram output compare to basic (bigram) model output?
  • how many rows would you need for a 100-word text?
  • can you find word pairs that always lead to the same next word?
  • what’s the tradeoff between context length and data requirements?
  • how does the trigram output compare to basic (bigram) cutouts model output?
  • why does the trigram spread tend to have more unique pairs of previous words than a bigram spread has unique previous words for the same text?
  • can you find two-word combinations that always lead to the same next word?
  • what’s the tradeoff between context length and data requirements?

Connection to current LLMs

The trigram model bridges the gap between simple word-pair models and modern transformers:

  • context windows: current models use variable context up to 2 million tokens
  • sparse data problem: with more context, you need exponentially more training data

Your trigram model shows why longer context helps---“see spot” predicts either run or jump, while just “spot” in a bigram model could be followed by many different words. This is why modern LLMs can maintain coherent conversations over many exchanges---they consider much more context than just the last word or two.

Comparison to grid method

The cutouts spread and the grid method produce equivalent models:

  • a count in the grid’s table corresponds to a cutout in the spread with the matching pair of previous words and next word
  • both capture the same “what follows these two words” relationships
  • cutouts make the weighting more tangible---you can see and feel that some outcomes are more likely because there are literally more cutouts to land on