Glossary
This glossary connects the hands-on activities in LLMs Unplugged with the technical terms used in modern language models. Each entry provides a plain language explanation and links to relevant lessons.
Core concepts
Token
A single unit of text that the model works with. In our activities, each word and punctuation mark is a token. Modern LLMs use subword tokens that can be parts of words.
Vocabulary
All the unique tokens your model knows. The words across the top and side of your grid (or the bucket labels) form your vocabulary.
Language model
A system that predicts what text comes next based on patterns learned from training data. Your hand-built grid or bucket collection is a language model.
Training
The process of building a model by counting patterns in text. When you tally word transitions or fill buckets with tokens, you're training your model.
Generation
Using a trained model to produce new text by repeatedly predicting and selecting the next token.
Probability distribution
A set of options with associated likelihoods. In your model, the counts in a row (or tokens in a bucket) form a probability distribution over possible next words.
Prompt
The input text you give to a language model. In a bigram model, the "prompt" is just the single current word used to predict what comes next. In modern LLMs, a prompt can be hundreds of thousands or even millions of tokens long, giving the model much more context to work with.
LLM (Large Language Model)
A language model trained on a very large amount of text, with billions of parameters. The hand-built models in these lessons are tiny language models; ChatGPT, Claude, and Gemini are large language models. The core principles are identical---the difference is scale.
Training set
The collection of text used to train a model. In our activities, this is the passage you read through to fill in your grid or buckets. Modern LLMs are trained on billions of pages of text from books, websites, and other sources.
Start and end tokens
Special tokens that mark the beginning and end of a text. Real LLMs use these to know when to start and stop generating. The lessons in this project don't include them explicitly---you just pick a starting word and decide when to stop---but they're an important part of how real models handle sentence and document boundaries.
ChatGPT
OpenAI's chatbot, and probably the most well-known LLM product. On this site we often use "ChatGPT" as shorthand for any modern LLM chatbot---the concepts apply equally to Claude, Gemini, DeepSeek and others. The underlying principles are the same regardless of which product you use.
Claude
Anthropic's LLM chatbot. The concepts on this site apply equally to Claude, ChatGPT, Gemini, and other LLMs---the underlying principles are the same regardless of which product you use.
Gemini
Google's LLM chatbot. The concepts on this site apply equally to Gemini, ChatGPT, Claude, and other LLMs---the underlying principles are the same regardless of which product you use.
Model types
Bigram model
A model that predicts the next word based on one previous word. This is what you build in the fundamental lessons---each row of your grid represents what can follow a single word.
Trigram model
A model that uses two previous words for prediction, capturing more context than a bigram. The grid becomes three-dimensional (or you track word pairs instead of single words).
N-gram model
The general term for models that predict based on the previous n-1 words. Bigrams are 2-grams, trigrams are 3-grams, and so on.
Context window
How many previous tokens the model considers when making predictions. Bigrams have a context window of 1, trigrams have 2, and modern LLMs can consider hundreds of thousands or even millions of tokens.
Training variants
Grid variant
The matrix-based approach where you draw a grid with words as row and column headers, then add tally marks to track which words follow which.
Bucket variant
The physical container approach where each bucket is labelled with a word and contains paper tokens representing words that followed it in the training text.
Matrix
A grid or table showing relationships between tokens. Your hand-drawn grids are matrices tracking which words follow other words. Each row can also be interpreted as an embedding vector.
Sampling and generation
Weighted random sampling
Choosing the next token with probability proportional to its frequency. Your dice rolls implement this---words with higher counts are more likely to be selected.
Temperature
A parameter controlling randomness in generation. Dividing counts by temperature makes output more random (high temperature) or more predictable (low temperature).
Greedy sampling
Always choosing the most likely next word (equivalent to temperature approaching zero). Produces predictable but often repetitive text.
Beam search
A generation strategy that tracks multiple possible sequences simultaneously, choosing the best overall path rather than committing to one word at a time.
Beam width
How many candidate paths to track during beam search. Beam width 1 is equivalent to greedy search; larger widths explore more possibilities.
Truncation strategy
A rule that limits which tokens are eligible for selection before sampling. Examples include top-k (only consider the k most likely) and top-p/nucleus (only consider tokens until cumulative probability reaches p).
Understanding and meaning
Embedding
A numerical representation of a word. Each row in your bigram grid is that word's embedding vector---a fingerprint of its usage context. In real LLMs, embeddings are learned separately rather than derived from raw counts, but the principle is the same: words used in similar ways get similar vectors.
Similarity matrix
A grid showing how similar or different each pair of words is, calculated by comparing their embedding vectors. Words used in similar contexts have similar embeddings.
Attention mechanism
The ability to focus on relevant previous words when making predictions. In real LLMs, attention is learned, weighted, and dynamic---the model decides what to focus on for each prediction. Context columns illustrate the motivation for attention: considering more than just the immediately preceding word.
Advanced concepts
Fine-tuning
Additional training on specific text to adapt a model for a particular domain or task. Like adding more tallies to your grid from a new text source.
LoRA (Low-Rank Adaptation)
A technique for efficiently fine-tuning models by training a small "adaptation layer" rather than modifying all the original parameters.
RLHF (Reinforcement Learning from Human Feedback)
A training technique where human preferences guide the model to produce more helpful, harmless, and honest outputs.
Agentic tool use
The ability of language models to act as agents by recognising when to call external tools (like calculators or search engines) in a loop rather than generating text directly.
Synthetic data
Training data generated by models rather than collected from humans. Can be used to augment training sets or create specialised datasets.
Hallucination
When models generate plausible-sounding but false information. This happens because models learn patterns, not facts---they predict what text looks like rather than what is true.
Parameters
The numbers stored in the model that encode learned patterns. Each tally mark in your grid is a parameter. Modern models have billions of parameters.
Transformer
The neural network architecture used by GPT, Claude, and other modern LLMs. It uses attention mechanisms to process all words in parallel rather than sequentially.
Connections to your activities
| Your activity | Real LLM equivalent |
|---|---|
| Tallying word pairs | Counting n-grams during training |
| Rolling dice for next word | Sampling from probability distribution |
| Grid rows/columns | Weight matrices in neural networks |
| Adding context columns | Learning attention patterns |
| Calculating word distances | Computing embedding similarities |
| Dividing tallies by temperature | Applying temperature to logits |
| Keeping top beam paths | Beam search with specified beam width |
| Picking from buckets | Weighted random sampling |
| Training on new text | Fine-tuning on domain-specific data |
Key insights
- Scale is the main difference: your small grid vs billions of parameters, but the core concepts are identical.
- Randomness creates variety: both your dice and modern LLMs use controlled randomness to avoid repetitive output.
- Context improves prediction: more context (bigram → trigram → transformer) enables better text generation.
- Embeddings capture meaning: words used similarly get similar vectors, whether hand-calculated or learned by neural networks.
- Training is just counting: at its core, training means observing patterns in data---exactly what you did with tally marks.
The hands-on activities demonstrate the fundamental operations of language models. The main advances in modern AI come from doing these same operations at massive scale with learned (rather than hand-crafted) patterns.