Glossary
This glossary connects the hands-on activities in LLMs Unplugged with the technical terms used in modern language models. Each entry provides a plain language explanation and links to relevant lessons.
Core concepts
Token
A single unit of text that the model works with. In our activities, each word and punctuation mark (., ,) is a token. Modern LLMs use subword tokens that can be parts of words.
Synonyms: word (in introductory contexts)
Style note: the lessons use “word” initially to keep things accessible, then transition to “token” once the concept is established. Both terms refer to the same thing in our activities.
See: Training
Vocabulary
All the unique tokens your model knows. The words across the top and side of your grid (or the bucket labels) form your vocabulary.
See: Training
Language model
A system that predicts what text comes next based on patterns learned from training data. Your hand-built grid or bucket collection is a language model.
See: Training, Generation
Training
The process of building a model by counting patterns in text. When you tally word transitions or fill buckets with tokens, you’re training your model.
Synonyms: learning
See: Training
Generation
Using a trained model to produce new text by repeatedly predicting and selecting the next token.
Synonyms: inference (in broader AI/ML contexts)
See: Generation
Probability distribution
A set of options with associated likelihoods. In your model, the counts in a row (or tokens in a bucket) form a probability distribution over possible next words.
See: Generation, Sampling
Model types
Bigram model
A model that predicts the next word based on one previous word. This is what you build in the fundamental lessons—each row of your grid represents what can follow a single word.
Synonyms: 2-gram model
See: Training, Generation
Trigram model
A model that uses two previous words for prediction, capturing more context than a bigram. The grid becomes three-dimensional (or you track word pairs instead of single words).
Synonyms: 3-gram model
See: Trigram
N-gram model
The general term for models that predict based on the previous n-1 words. Bigrams are 2-grams, trigrams are 3-grams, and so on.
Context window
How many previous tokens the model considers when making predictions. Bigrams have a context window of 1, trigrams have 2, and modern models like GPT-4 can consider 128,000+ tokens.
See: Trigram, Context Columns
Training variants
Grid variant
The matrix-based approach where you draw a grid with words as row and column headers, then add tally marks to track which words follow which.
See: Training
Bucket variant
The physical container approach where each bucket is labelled with a word and contains paper tokens representing words that followed it in the training text.
See: Training
Matrix
A grid or table showing relationships between tokens. Your hand-drawn grids are matrices tracking which words follow other words. Each row can also be interpreted as an embedding vector.
See: Training, Word Embeddings
Sampling and generation
Weighted random sampling
Choosing the next token with probability proportional to its frequency. Your dice rolls implement this—words with higher counts are more likely to be selected.
See: Generation
Temperature
A parameter controlling randomness in generation. Dividing counts by temperature makes output more random (high temperature) or more predictable (low temperature).
See: Sampling
Greedy sampling
Always choosing the most likely next word (equivalent to temperature approaching zero). Produces predictable but often repetitive text.
Synonyms: greedy decoding
See: Sampling
Beam search
A generation strategy that tracks multiple possible sequences simultaneously, choosing the best overall path rather than committing to one word at a time.
See: Beam Search
Beam width
How many candidate paths to track during beam search. Beam width 1 is equivalent to greedy search; larger widths explore more possibilities.
See: Beam Search
Truncation strategy
A rule that limits which tokens are eligible for selection before sampling. Examples include top-k (only consider the k most likely) and top-p/nucleus (only consider tokens until cumulative probability reaches p).
See: Sampling
Understanding and meaning
Embedding
A numerical representation of a word. Each row in your bigram grid is that word’s embedding vector—a fingerprint of its usage context.
Synonyms: word vector, embedding vector
See: Word Embeddings
Similarity matrix
A grid showing how similar or different each pair of words is, calculated by comparing their embedding vectors. Words used in similar contexts have similar embeddings.
Synonyms: distance matrix, distance grid
See: Word Embeddings
Attention mechanism
The ability to focus on relevant previous words when making predictions. Context columns are a manual form of attention, letting you consider grammatical categories rather than just the immediately preceding word.
See: Context Columns
Advanced concepts
Fine-tuning
Additional training on specific text to adapt a model for a particular domain or task. Like adding more tallies to your grid from a new text source.
See: LoRA, Synthetic Data
LoRA (Low-Rank Adaptation)
A technique for efficiently fine-tuning models by training a small “adaptation layer” rather than modifying all the original parameters.
See: LoRA
RLHF (Reinforcement Learning from Human Feedback)
A training technique where human preferences guide the model to produce more helpful, harmless, and honest outputs.
See: RLHF
Tool use
The ability of language models to recognise when to call external tools (like calculators or search engines) rather than generating text directly.
See: Tool Use
Synthetic data
Training data generated by models rather than collected from humans. Can be used to augment training sets or create specialised datasets.
See: Synthetic Data
Hallucination
When models generate plausible-sounding but false information. This happens because models learn patterns, not facts—they predict what text looks like rather than what is true.
Parameters
The numbers stored in the model that encode learned patterns. Each tally mark in your grid is a parameter. Modern models have billions of parameters.
See: Training
Transformer
The neural network architecture used by GPT, Claude, and other modern LLMs. It uses attention mechanisms to process all words in parallel rather than sequentially.
Connections to your activities
| Your activity | Real LLM equivalent |
|---|---|
| Tallying word pairs | Counting n-grams during training |
| Rolling dice for next word | Sampling from probability distribution |
| Grid rows/columns | Weight matrices in neural networks |
| Adding context columns | Learning attention patterns |
| Calculating word distances | Computing embedding similarities |
| Dividing tallies by temperature | Applying temperature to logits |
| Keeping top beam paths | Beam search with specified beam width |
| Picking from buckets | Weighted random sampling |
| Training on new text | Fine-tuning on domain-specific data |
Key insights
Scale is the main difference: your small grid vs billions of parameters, but the core concepts are identical.
Randomness creates variety: both your dice and ChatGPT use controlled randomness to avoid repetitive output.
Context improves prediction: more context (bigram → trigram → transformer) enables better text generation.
Embeddings capture meaning: words used similarly get similar vectors, whether hand-calculated or learned by neural networks.
Training is just counting: at its core, training means observing patterns in data—exactly what you did with tally marks.
The hands-on activities demonstrate the fundamental operations of language models. The main advances in modern AI come from doing these same operations at massive scale with learned (rather than hand-crafted) patterns.