Training
Build a Bigram model A model that predicts the next word based on one previous word. This is what you build in the fundamental lessons---each row of your grid represents what can follow a single word. View in glossary that tracks which words follow which other words in text.


You will need
- some text (e.g. a few pages from a kids book, but it can be anything)
- pen, pencil, and grid paper
For each pair (or group) of students:
- printed token cutouts (use the CLI to generate these from your text)
- a clear table or flat surface to spread the cutouts out on
- scissors
Your goal
Produce a grid that captures the patterns in your input text data. This grid is your bigram language model. Stretch goal: keep training your model on more input text.
Spread your printed token cutouts out on a table. Each cutout shows a next word together with its previous word (the word that came before it). The whole spread of cutouts is your language model---every (previous, next) pair from your training text is sitting somewhere on the table.
Key idea
Language models learn by counting patterns in text. Training means building a model (filling out the grid) to track which words follow other words.
The CLI has already done the bigram counting for you. Each cutout shows a next
word together with the word it follows in the training text. The trick: if
spot follows see 40% of the time in the training text, then 40% of the
cutouts with see as the previous word will have spot as the next word.
Whichever matching cutout your eye lands on, you’ve sampled in proportion to
the original distribution---no dice, no probability tables, the spread is
doing the maths.
Algorithm
- Preprocess your text
- convert everything to lowercase
- treat words, commas, and full stops as separate “words” (ignore other punctuation and whitespace)
- Set up your grid
- take the first word from your text
- write it in both the first row header and first column header of your grid
- Fill in the grid one word pair at a time
- find the row for the first word (in your training text) and the column for the second word
- add a tally mark in that cell (if the word isn’t in the grid yet, add a new row and column for it)
- shift along by one word (so the second word becomes your “first” word) and repeat until you’ve gone through the entire text
- Cut out the tokens from your printed sheets
- each cutout shows a next word preceded by its previous word; every word has its own colour, and previous words appear inside a matching coloured box
- Spread the cutouts out on a table
- face up, no overlap if you can manage it
- that’s it---the spread is your trained model
Optional extension: see “Group into piles” below.
Example
Before you try training a model yourself, work through this example to see the algorithm in action.
Original text: “See Spot run. See Spot jump. Run, Spot, run. Jump, Spot, jump.”
Preprocessed text: see spot run . see spot jump . run ,
spot , run . jump , spot , jump .
After the first two words (see spot) the model looks like:
see | spot | run | . | jump | , | |
|---|---|---|---|---|---|---|
see | | | |||||
spot | ||||||
run | ||||||
. | ||||||
jump | ||||||
, |
After the full text the model looks like:
see | spot | run | . | jump | , | |
|---|---|---|---|---|---|---|
see | || | |||||
spot | | | | | || | |||
run | || | | | ||||
. | | | | | | | |||
jump | || | | | ||||
, | || | | | | |
For the text see spot run . see spot jump . run , spot , run . jump , spot , jump .:
After cutting and spreading, the table contains every adjacent (previous → next) pair from the text:
see→spot× 2spot→run× 1,spot→jump× 1,spot→,× 2run→.× 2,run→,× 1.→see× 1,.→run× 1,.→jump× 1,→spot× 2,,→run× 1,,→jump× 1jump→.× 2,jump→,× 1
Each entry is a physical cutout on your table. Notice that spot→, shows up
twice---there are two cutouts on the table with spot as the previous word
and , as the next word. That repetition is what makes generation weighted.
| Previous word | Next words |
|---|---|
| see | |
| spot | |
| run | |
| . | |
| jump | |
| , |
Optional extension: group into piles
Once your students have got the hang of the loose-on-table flow, you can introduce grouping as an optimisation:
- Sort the cutouts into piles, one pile per unique previous word
- Label each pile with that word
Now generation is faster---instead of scanning the whole table, you go straight to the pile whose label matches your current word. This is the same trick a computer uses when it stores a language model in a hash table. The model’s information content is identical; you’ve just rearranged it for faster lookup.
Instructor notes
Icebreaker questions
If this is the group’s first hands-on activity, these prompts surface what students already think about language models. Depending on your learning context, they work as either “call out your answer” or “discuss with your neighbour and share-back” questions.
- why is a language model called a “language model”? What does it mean to “model language”?
- what’s the best/clearest explanation you’ve ever heard about how Large Language Models (e.g. ChatGPT, Claude) actually work? What’s the weirdest explanation you’ve ever heard?
- when was the first Language model A system that predicts what text comes next based on patterns learned from training data. Your hand-built grid or cutouts spread is a language model. View in glossary ever created? How similar/different was it to modern LLMs?
- activity: get everyone to stand up, then have them sit down if they’ve never used ChatGPT, Claude, or a similar LLM. Then ask if they’ve used it in the last month/week/day/hour/5mins. At the end, everyone should be sitting down.
Don’t spend too long here---the fun really starts when students get into the activity itself.
Discussion questions
- what can you tell about the input text by looking at the filled-out bigram model grid?
- how does including punctuation as “words” help with sentence structure?
- are there any other ways you could have written down this exact same model?
- how could you use this model to generate new text in the style of your input/training data?
- what can you tell about the input text by looking at which (previous, next) pairs show up most often in the spread?
- why does
see→spotappear twice whilerun→,appears only once? - how does including punctuation as separate tokens help capture sentence structure?
- what would happen if you trained on more text---how would the spread change?
- how could you use these cutouts to generate new text in the style of your training data?
Troubleshooting
- “Do I add a new row/column for every word?” No---each new word only gets a new row and column the first time you see it. After that, just find the existing row and column and add a tally mark.
- “Some cutouts are duplicates---is that a mistake?” No---repeated cutouts
are exactly the point. If
see→spotappears twice in the training text, you should see twosee→spotcutouts on the table. The repetition is what makes common pairs easier to spot during generation.
Connection to current LLMs
This counting process is exactly what happens during the “training” phase of language models:
- training data: your paragraph vs trillions of words from the internet
- learning/training process: hand counting vs automated counting by computers
- storage: your paper model vs billions of parameters in memory
The key insight: “training” a language model means counting patterns in text. Your hand-built model contains the same type of information that current LLMs store---at a vastly smaller scale.
Comparison to grid method
The cutouts spread and the grid method produce equivalent models:
- a tally mark in row X, column Y of the grid corresponds to one cutout on the table whose previous word is X and whose next word is Y
- both capture the same “what follows what” relationships
- cutouts make the weighting more tangible---you can see and feel that some outcomes are more likely because there are literally more cutouts to pick from
Interactive widget
Step through the training process at your own pace. Enter your own text or use the example, then press Play or Step to watch the model being built.
Step through the training process at your own pace. Enter your own text or use the example, then press Play or Step to watch the cutouts being placed.