Glossary

This glossary connects the hands-on activities in LLMs Unplugged with the technical terms used in modern language models. Each entry provides a plain language explanation and links to relevant lessons.

Core concepts

Token

A single unit of text that the model works with. In our activities, each word and punctuation mark is a token. Modern LLMs use subword tokens that can be parts of words.

Synonyms: word (in introductory contexts)

Style note: The lessons use "word" initially to keep things accessible, then transition to "token" once the concept is established. Both terms refer to the same thing in our activities.

See: Training

Vocabulary

All the unique tokens your model knows. The words across the top and side of your grid (or the cutouts' prefix labels) form your vocabulary.

See: Training

Language model

A system that predicts what text comes next based on patterns learned from training data. Your hand-built grid or cutouts spread is a language model.

See: Training , Generation

Training

The process of building a model by counting patterns in text. When you tally word transitions or spread cutouts on a table, you're training your model.

Synonyms: learning

See: Training

Pre-training

The initial, expensive training phase where a model learns general patterns from a large corpus. Most users of modern LLMs use pre-trained models without ever training one themselves, much like using a provided booklet rather than building your own grid.

Related: Post-training , Training

See: Pre-trained Model Generation

Generation

Using a trained model to produce new text by repeatedly predicting and selecting the next token.

See: Generation

Inference

Using a trained model to produce outputs. In language models, inference means generating text. These lessons say "generation" because that better describes what language models do, but "inference" is the term you'll find in AI/ML literature and tooling.

Synonyms: decoding

See: Generation

Probability distribution

A set of options with associated likelihoods. In your model, the counts in a row (or matching cutouts in the spread) form a probability distribution over possible next words.

See: Generation , Sampling

Prompt

The input text you give to a language model. In a bigram model, the "prompt" is just the single current word used to predict what comes next. In modern LLMs, a prompt can be hundreds of thousands or even millions of tokens long, giving the model much more context to work with.

See: Generation

LLM (Large Language Model)

A language model trained on a very large amount of text, with billions of parameters. The hand-built models in these lessons are tiny language models; ChatGPT, Claude, and Gemini are large language models. The core principles are identical---the difference is scale.

See: Training

Training data

The collection of text used to train a model. In our activities, this is the passage you read through to fill in your grid or spread out as cutouts. Modern LLMs are trained on billions of pages of text from books, websites, and other sources.

Synonyms: training set, corpus (plural corpora)

See: Training

Start and end tokens

Special tokens that mark the beginning and end of a text. Real LLMs use these to know when to start and stop generating. The lessons in this project don't include them explicitly---you just pick a starting word and decide when to stop---but they're an important part of how real models handle sentence and document boundaries.

ChatGPT

OpenAI's chatbot, and probably the most well-known LLM product. On this site we often use "ChatGPT" as shorthand for any modern LLM chatbot---the concepts apply equally to Claude, Gemini, DeepSeek and others. The underlying principles are the same regardless of which product you use.

Claude

Anthropic's LLM chatbot. The concepts on this site apply equally to Claude, ChatGPT, Gemini, and other LLMs---the underlying principles are the same regardless of which product you use.

Gemini

Google's LLM chatbot. The concepts on this site apply equally to Gemini, ChatGPT, Claude, and other LLMs---the underlying principles are the same regardless of which product you use.

Model types

Bigram model

A model that predicts the next word based on one previous word. This is what you build in the fundamental lessons---each row of your grid represents what can follow a single word.

Synonyms: 2-gram model

See: Training , Generation

Trigram model

A model that uses two previous words for prediction, capturing more context than a bigram. The grid becomes three-dimensional (or you track word pairs instead of single words).

Synonyms: 3-gram model

See: More Context

N-gram model

The general term for models that predict based on the previous n-1 words. Bigrams are 2-grams, trigrams are 3-grams, and so on.

See: Training , More Context

Context window

How many previous tokens the model considers when making predictions. Bigrams have a context window of 1, trigrams have 2, and modern LLMs can consider hundreds of thousands or even millions of tokens.

See: More Context , In-context Memory

Sampling and generation

Weighted random sampling

Choosing the next token with probability proportional to its frequency. Your dice rolls implement this---words with higher counts are more likely to be selected.

See: Generation , Weighted Randomness

Decoding strategy

The procedure used to turn a model's probability distribution into actual output text. Greedy decoding picks the highest-probability word; sampling rolls weighted dice; beam search tracks several candidate paths at once.

Synonyms: sampling strategy

See: Generation , Sampling

Temperature

A parameter controlling randomness in generation. Dividing counts by temperature makes output more random (high temperature) or more predictable (low temperature).

See: Sampling

Greedy sampling

Always choosing the most likely next word (equivalent to temperature approaching zero). Produces predictable but often repetitive text.

Synonyms: greedy decoding

See: Sampling

Beam search

A generation strategy that tracks multiple possible sequences simultaneously, choosing the best overall path rather than committing to one word at a time.

Beam width

How many candidate paths to track during beam search. Beam width 1 is equivalent to greedy search; larger widths explore more possibilities.

Truncation strategy

A rule that limits which tokens are eligible for selection before sampling. Examples include top-k (only consider the k most likely) and top-p/nucleus (only consider tokens until cumulative probability reaches p).

See: Sampling

Top-k sampling

A truncation strategy that keeps only the k most likely next-word options before sampling. Setting k=1 is equivalent to greedy sampling; larger k allows more variety while still excluding very unlikely words.

See: Sampling

Top-p sampling

A truncation strategy that keeps just enough of the most likely options for their cumulative probability to reach a threshold p (e.g., 0.9). Unlike top-k, the number of options changes depending on how confident the model is about the next word.

Synonyms: nucleus sampling

See: Sampling

Understanding and meaning

Embedding

A numerical representation of a word. Each row in your bigram grid is that word's embedding vector---a fingerprint of its usage context. In real LLMs, embeddings are learned separately rather than derived from raw counts, but the principle is the same: words used in similar ways get similar vectors.

Synonyms: word vector, embedding vector

See: Word Embeddings

Similarity matrix

A grid showing how similar or different each pair of words is, calculated by comparing their embedding vectors. Words used in similar contexts have similar embeddings.

Synonyms: distance matrix, distance grid

See: Word Embeddings

Attention mechanism

The ability to focus on relevant previous words when making predictions. In real LLMs, attention is learned, weighted, and dynamic---the model decides what to focus on for each prediction. The in-context memory and induction-head lessons illustrate the motivation for attention: reusing more than just the immediately preceding word.

See: In-context Memory , Induction Heads

In-context learning

Picking up a pattern from the prompt and continuing it, with no change to the model's weights. The "learning" happens in the context the model is given, not in the model itself---which is why a few examples in a prompt can steer an LLM's output.

See: In-context Memory , Induction Heads

Induction head

A circuit found inside transformers that completes patterns by finding an earlier place where the current token appeared and copying what came next. Induction heads are a key mechanism behind in-context learning.

See: Induction Heads

Advanced concepts

Markov chain

A statistical model where the next state depends only on the current state. Andrey Markov introduced this idea in 1913 while analysing letter sequences in Pushkin's Eugene Onegin. A bigram language model is a simple Markov chain over words.

Neural network

A computational system loosely inspired by biological neurons that learns patterns from data by adjusting numerical weights. Modern LLMs are large neural networks; hand-built models are not. Both rely on the same kind of pattern counting, just acquired manually rather than automatically.

Transformer

The neural network architecture used by GPT, Claude, and other modern LLMs. It uses attention mechanisms to process all words in parallel rather than sequentially.

Parameters

The numbers stored in the model that encode learned patterns. Each tally mark in your grid is a parameter. Modern models have billions of parameters.

See: Training

Fine-tuning

Additional training on specific text to adapt a model for a particular domain or task. Like adding more tallies to your grid from a new text source.

See: LoRA , Synthetic Data

Base model

The original trained model that an adaptation (such as a LoRA layer) is applied to. The base model stays unchanged; the adaptation provides the domain-specific shift.

See: LoRA

LoRA (Low-Rank Adaptation)

A technique for efficiently fine-tuning models by training a small "adaptation layer" rather than modifying all the original parameters.

See: LoRA

Agentic AI

An approach to AI where language models act as agents, recognising when to call external tools (like calculators or search engines) in a loop rather than generating text directly.

Synonyms: agentic tool use

See: Agentic AI

Agent

A language model that runs tools in a loop to achieve a goal. Instead of generating text directly, an agent calls external tools and uses their results to continue.

Synonyms: AI agent

See: Agentic AI

Tool use

The mechanism by which a language model calls external tools (calculators, search engines, databases, code runners) during generation. Modern LLMs output structured tool calls; in the unplugged activity, sampling a "trigger word" plays the same role.

Synonyms: function calling

See: Agentic AI

Synthetic data

Training data generated by models rather than collected from humans. Can be used to augment training sets or create specialised datasets.

See: Synthetic Data

Model collapse

What happens when a model is trained on its own outputs (or outputs from similar models) over multiple generations. Common patterns get amplified, rare ones vanish, and the model converges towards a narrower, more repetitive style.

Synonyms: mode collapse

See: Synthetic Data

Post-training and reasoning

Post-training

Everything done to a model after pre-training to make it more useful. Pre-training teaches the model how language works in general; post-training shapes its behaviour for specific purposes---following instructions, refusing harmful requests, or working carefully through problems. Supervised fine-tuning, RLHF, and RLVR are all post-training techniques.

Supervised fine-tuning (SFT)

Post-training on demonstration data---paired examples of (input, desired output). The model learns to imitate the demonstrations. SFT is the simplest form of post-training and often the first step before reinforcement-learning techniques like RLHF or RLVR are applied.

Synonyms: SFT

RLHF (Reinforcement Learning from Human Feedback)

A post-training technique where humans compare pairs of model outputs and their preferences train a reward model, which then guides the main model. Suits fuzzy objectives like "be helpful" or "sound natural" where no automated checker exists. Sibling of RLVR, which uses an automated checker instead of human preferences.

See: RLHF

Reward model

A separate model that learns to predict human preferences, then guides the main model during RLHF. Rather than asking humans to rate every output, the reward model rates outputs at scale, trained on a smaller set of human comparisons.

See: RLHF

Alignment

The process of shaping a model's outputs to match human values and preferences (typically helpfulness, harmlessness, and honesty). RLHF is one of the main techniques used to align modern LLMs.

See: RLHF

RLVR (Reinforcement Learning from Verifiable Rewards)

A post-training technique where the model generates an attempt, an automated checker verifies whether it's correct, and successful attempts get reinforced. Unlike RLHF, no human preference data is needed---just a verifier. Works best on domains with cheap, reliable checking (mathematics, code, formal logic). RLVR is the main technique behind modern reasoning models.

Verifier

A program or rule that checks whether a model's output is correct. The reward signal in RLVR. Examples include "does this code pass the test suite?", "does this answer match the known value?", and "is this output valid JSON?". Cheap, reliable verifiers are what make RLVR scalable.

Synonyms: checker

Rollout

A single generation from a model, sampled all the way through. During RL-based post-training the model produces many rollouts per problem; the reward signal then determines which ones to reinforce.

On-policy

Training where the data the model learns from is generated by the current model itself. RLVR is on-policy: the model produces rollouts, the rollouts get scored, the model updates, and the next batch of rollouts comes from the improved model. Contrast with off-policy training (such as SFT on a fixed dataset of human-written examples), where the training data doesn't change as the model improves.

Synonyms: off-policy (the opposite property)

Reasoning model

A language model that's been trained to generate intermediate "thinking" steps before producing its final answer. OpenAI's o1 and o3, DeepSeek's R1, and Claude with extended thinking are reasoning models. The architecture is the same as any other LLM---what changes is the post-training (usually RLVR), which shapes the model to use extra tokens for working through problems.

Synonyms: thinking model

Chain of thought

Step-by-step reasoning a model generates before its final answer. Started as a prompting trick ("let's think step by step") and is now built into reasoning models through post-training. The intermediate steps function like rough working in a maths exercise---visible scratch work that helps the model reach a correct answer.

Synonyms: CoT

Thinking tokens

The tokens that make up a reasoning model's chain of thought. Mechanically they're produced the same way as any other tokens (same forward pass, same sampling), but they're often hidden from the user in product interfaces. Some products, such as Claude's extended thinking, show them by default.

Related: Chain of thought , Reasoning model

Test-time compute

Compute spent at generation time rather than during training. Reasoning models trade test-time compute for accuracy: letting the model generate more thinking tokens before answering tends to improve performance in a roughly predictable way. This is a scaling axis distinct from model size and training data, and is what makes reasoning models qualitatively different from earlier LLMs.

Synonyms: inference-time compute

Connections to your activities

Your activity	Real LLM equivalent
Tallying word pairs	Counting n-grams during training
Rolling dice for next word	Sampling from a probability distribution
Grid rows/columns	Weight matrices in neural networks
Adding context columns	Learning attention patterns
Calculating word distances	Computing embedding similarities
Dividing tallies by temperature	Applying temperature to logits
Keeping top beam paths	Beam search with specified beam width
Picking matching cutouts	Weighted random sampling
Training on new text	Fine-tuning on domain-specific data
Updating counts from votes	RLHF using a reward model
Sampling a trigger word	Tool use by an agent
Re-training on generated text	Model collapse from synthetic data

Key insights

Scale is the main difference: your small grid vs billions of parameters, but the core concepts are identical.
Randomness creates variety: both your dice and modern LLMs use controlled randomness to avoid repetitive output.
Context improves prediction: more context (bigram → trigram → transformer) enables better text generation.
Embeddings capture meaning: words used similarly get similar vectors, whether hand-calculated or learned by neural networks.
Training is just counting: at its core, training means observing patterns in data---exactly what you did with tally marks.

The hands-on activities demonstrate the fundamental operations of language models. The main advances in modern AI come from doing these same operations at massive scale with learned (rather than hand-crafted) patterns.