Glossary
This glossary connects the hands-on activities in LLMs Unplugged with the technical terms used in modern language models. Each entry provides a plain language explanation and links to relevant lessons.
Core concepts
Token
A single unit of text that the model works with. In our activities, each word and punctuation mark is a token. Modern LLMs use subword tokens that can be parts of words.
Vocabulary
All the unique tokens your model knows. The words across the top and side of your grid (or the cutouts' prefix labels) form your vocabulary.
Language model
A system that predicts what text comes next based on patterns learned from training data. Your hand-built grid or cutouts spread is a language model.
Training
The process of building a model by counting patterns in text. When you tally word transitions or spread cutouts on a table, you're training your model.
Pre-training
The initial, expensive training phase where a model learns general patterns from a large corpus. Most users of modern LLMs use pre-trained models without ever training one themselves, much like using a provided booklet rather than building your own grid.
Generation
Using a trained model to produce new text by repeatedly predicting and selecting the next token.
Inference
Using a trained model to produce outputs. In language models, inference means generating text. These lessons say "generation" because that better describes what language models do, but "inference" is the term you'll find in AI/ML literature and tooling.
Probability distribution
A set of options with associated likelihoods. In your model, the counts in a row (or matching cutouts in the spread) form a probability distribution over possible next words.
Prompt
The input text you give to a language model. In a bigram model, the "prompt" is just the single current word used to predict what comes next. In modern LLMs, a prompt can be hundreds of thousands or even millions of tokens long, giving the model much more context to work with.
LLM (Large Language Model)
A language model trained on a very large amount of text, with billions of parameters. The hand-built models in these lessons are tiny language models; ChatGPT, Claude, and Gemini are large language models. The core principles are identical---the difference is scale.
Training data
The collection of text used to train a model. In our activities, this is the passage you read through to fill in your grid or spread out as cutouts. Modern LLMs are trained on billions of pages of text from books, websites, and other sources.
Start and end tokens
Special tokens that mark the beginning and end of a text. Real LLMs use these to know when to start and stop generating. The lessons in this project don't include them explicitly---you just pick a starting word and decide when to stop---but they're an important part of how real models handle sentence and document boundaries.
ChatGPT
OpenAI's chatbot, and probably the most well-known LLM product. On this site we often use "ChatGPT" as shorthand for any modern LLM chatbot---the concepts apply equally to Claude, Gemini, DeepSeek and others. The underlying principles are the same regardless of which product you use.
Claude
Anthropic's LLM chatbot. The concepts on this site apply equally to Claude, ChatGPT, Gemini, and other LLMs---the underlying principles are the same regardless of which product you use.
Gemini
Google's LLM chatbot. The concepts on this site apply equally to Gemini, ChatGPT, Claude, and other LLMs---the underlying principles are the same regardless of which product you use.
Model types
Bigram model
A model that predicts the next word based on one previous word. This is what you build in the fundamental lessons---each row of your grid represents what can follow a single word.
Trigram model
A model that uses two previous words for prediction, capturing more context than a bigram. The grid becomes three-dimensional (or you track word pairs instead of single words).
N-gram model
The general term for models that predict based on the previous n-1 words. Bigrams are 2-grams, trigrams are 3-grams, and so on.
Context window
How many previous tokens the model considers when making predictions. Bigrams have a context window of 1, trigrams have 2, and modern LLMs can consider hundreds of thousands or even millions of tokens.
Sampling and generation
Weighted random sampling
Choosing the next token with probability proportional to its frequency. Your dice rolls implement this---words with higher counts are more likely to be selected.
Decoding strategy
The procedure used to turn a model's probability distribution into actual output text. Greedy decoding picks the highest-probability word; sampling rolls weighted dice; beam search tracks several candidate paths at once.
Temperature
A parameter controlling randomness in generation. Dividing counts by temperature makes output more random (high temperature) or more predictable (low temperature).
Greedy sampling
Always choosing the most likely next word (equivalent to temperature approaching zero). Produces predictable but often repetitive text.
Beam search
A generation strategy that tracks multiple possible sequences simultaneously, choosing the best overall path rather than committing to one word at a time.
Beam width
How many candidate paths to track during beam search. Beam width 1 is equivalent to greedy search; larger widths explore more possibilities.
Truncation strategy
A rule that limits which tokens are eligible for selection before sampling. Examples include top-k (only consider the k most likely) and top-p/nucleus (only consider tokens until cumulative probability reaches p).
Top-k sampling
A truncation strategy that keeps only the k most likely next-word options before sampling. Setting k=1 is equivalent to greedy sampling; larger k allows more variety while still excluding very unlikely words.
Top-p sampling
A truncation strategy that keeps just enough of the most likely options for their cumulative probability to reach a threshold p (e.g., 0.9). Unlike top-k, the number of options changes depending on how confident the model is about the next word.
Understanding and meaning
Embedding
A numerical representation of a word. Each row in your bigram grid is that word's embedding vector---a fingerprint of its usage context. In real LLMs, embeddings are learned separately rather than derived from raw counts, but the principle is the same: words used in similar ways get similar vectors.
Similarity matrix
A grid showing how similar or different each pair of words is, calculated by comparing their embedding vectors. Words used in similar contexts have similar embeddings.
Attention mechanism
The ability to focus on relevant previous words when making predictions. In real LLMs, attention is learned, weighted, and dynamic---the model decides what to focus on for each prediction. The in-context memory and induction-head lessons illustrate the motivation for attention: reusing more than just the immediately preceding word.
In-context learning
Picking up a pattern from the prompt and continuing it, with no change to the model's weights. The "learning" happens in the context the model is given, not in the model itself---which is why a few examples in a prompt can steer an LLM's output.
Induction head
A circuit found inside transformers that completes patterns by finding an earlier place where the current token appeared and copying what came next. Induction heads are a key mechanism behind in-context learning.
Advanced concepts
Markov chain
A statistical model where the next state depends only on the current state. Andrey Markov introduced this idea in 1913 while analysing letter sequences in Pushkin's Eugene Onegin. A bigram language model is a simple Markov chain over words.
Neural network
A computational system loosely inspired by biological neurons that learns patterns from data by adjusting numerical weights. Modern LLMs are large neural networks; hand-built models are not. Both rely on the same kind of pattern counting, just acquired manually rather than automatically.
Transformer
The neural network architecture used by GPT, Claude, and other modern LLMs. It uses attention mechanisms to process all words in parallel rather than sequentially.
Parameters
The numbers stored in the model that encode learned patterns. Each tally mark in your grid is a parameter. Modern models have billions of parameters.
Fine-tuning
Additional training on specific text to adapt a model for a particular domain or task. Like adding more tallies to your grid from a new text source.
Base model
The original trained model that an adaptation (such as a LoRA layer) is applied to. The base model stays unchanged; the adaptation provides the domain-specific shift.
LoRA (Low-Rank Adaptation)
A technique for efficiently fine-tuning models by training a small "adaptation layer" rather than modifying all the original parameters.
Agentic tool use
The ability of language models to act as agents by recognising when to call external tools (like calculators or search engines) in a loop rather than generating text directly.
Agent
A language model that runs tools in a loop to achieve a goal. Instead of generating text directly, an agent calls external tools and uses their results to continue.
Tool use
The mechanism by which a language model calls external tools (calculators, search engines, databases, code runners) during generation. Modern LLMs output structured tool calls; in the unplugged activity, sampling a "trigger word" plays the same role.
Synthetic data
Training data generated by models rather than collected from humans. Can be used to augment training sets or create specialised datasets.
Model collapse
What happens when a model is trained on its own outputs (or outputs from similar models) over multiple generations. Common patterns get amplified, rare ones vanish, and the model converges towards a narrower, more repetitive style.
Post-training and reasoning
Post-training
Everything done to a model after pre-training to make it more useful. Pre-training teaches the model how language works in general; post-training shapes its behaviour for specific purposes---following instructions, refusing harmful requests, or working carefully through problems. Supervised fine-tuning, RLHF, and RLVR are all post-training techniques.
Supervised fine-tuning (SFT)
Post-training on demonstration data---paired examples of (input, desired output). The model learns to imitate the demonstrations. SFT is the simplest form of post-training and often the first step before reinforcement-learning techniques like RLHF or RLVR are applied.
RLHF (Reinforcement Learning from Human Feedback)
A post-training technique where humans compare pairs of model outputs and their preferences train a reward model, which then guides the main model. Suits fuzzy objectives like "be helpful" or "sound natural" where no automated checker exists. Sibling of RLVR, which uses an automated checker instead of human preferences.
Reward model
A separate model that learns to predict human preferences, then guides the main model during RLHF. Rather than asking humans to rate every output, the reward model rates outputs at scale, trained on a smaller set of human comparisons.
Alignment
The process of shaping a model's outputs to match human values and preferences (typically helpfulness, harmlessness, and honesty). RLHF is one of the main techniques used to align modern LLMs.
RLVR (Reinforcement Learning from Verifiable Rewards)
A post-training technique where the model generates an attempt, an automated checker verifies whether it's correct, and successful attempts get reinforced. Unlike RLHF, no human preference data is needed---just a verifier. Works best on domains with cheap, reliable checking (mathematics, code, formal logic). RLVR is the main technique behind modern reasoning models.
Verifier
A program or rule that checks whether a model's output is correct. The reward signal in RLVR. Examples include "does this code pass the test suite?", "does this answer match the known value?", and "is this output valid JSON?". Cheap, reliable verifiers are what make RLVR scalable.
Rollout
A single generation from a model, sampled all the way through. During RL-based post-training the model produces many rollouts per problem; the reward signal then determines which ones to reinforce.
On-policy
Training where the data the model learns from is generated by the current model itself. RLVR is on-policy: the model produces rollouts, the rollouts get scored, the model updates, and the next batch of rollouts comes from the improved model. Contrast with off-policy training (such as SFT on a fixed dataset of human-written examples), where the training data doesn't change as the model improves.
Reasoning model
A language model that's been trained to generate intermediate "thinking" steps before producing its final answer. OpenAI's o1 and o3, DeepSeek's R1, and Claude with extended thinking are reasoning models. The architecture is the same as any other LLM---what changes is the post-training (usually RLVR), which shapes the model to use extra tokens for working through problems.
Chain of thought
Step-by-step reasoning a model generates before its final answer. Started as a prompting trick ("let's think step by step") and is now built into reasoning models through post-training. The intermediate steps function like rough working in a maths exercise---visible scratch work that helps the model reach a correct answer.
Thinking tokens
The tokens that make up a reasoning model's chain of thought. Mechanically they're produced the same way as any other tokens (same forward pass, same sampling), but they're often hidden from the user in product interfaces. Some products, such as Claude's extended thinking, show them by default.
Test-time compute
Compute spent at generation time rather than during training. Reasoning models trade test-time compute for accuracy: letting the model generate more thinking tokens before answering tends to improve performance in a roughly predictable way. This is a scaling axis distinct from model size and training data, and is what makes reasoning models qualitatively different from earlier LLMs.
Connections to your activities
| Your activity | Real LLM equivalent |
|---|---|
| Tallying word pairs | Counting n-grams during training |
| Rolling dice for next word | Sampling from a probability distribution |
| Grid rows/columns | Weight matrices in neural networks |
| Adding context columns | Learning attention patterns |
| Calculating word distances | Computing embedding similarities |
| Dividing tallies by temperature | Applying temperature to logits |
| Keeping top beam paths | Beam search with specified beam width |
| Picking matching cutouts | Weighted random sampling |
| Training on new text | Fine-tuning on domain-specific data |
| Updating counts from votes | RLHF using a reward model |
| Sampling a trigger word | Tool use by an agent |
| Re-training on generated text | Model collapse from synthetic data |
Key insights
- Scale is the main difference: your small grid vs billions of parameters, but the core concepts are identical.
- Randomness creates variety: both your dice and modern LLMs use controlled randomness to avoid repetitive output.
- Context improves prediction: more context (bigram → trigram → transformer) enables better text generation.
- Embeddings capture meaning: words used similarly get similar vectors, whether hand-calculated or learned by neural networks.
- Training is just counting: at its core, training means observing patterns in data---exactly what you did with tally marks.
The hands-on activities demonstrate the fundamental operations of language models. The main advances in modern AI come from doing these same operations at massive scale with learned (rather than hand-crafted) patterns.