Skip to content

Glossary

This glossary connects the hands-on activities in LLMs Unplugged with the technical terms used in modern language models. Each entry provides a plain language explanation and links to relevant lessons.

Core concepts

Token

A single unit of text that the model works with. In our activities, each word and punctuation mark (., ,) is a token. Modern LLMs use subword tokens that can be parts of words.

Synonyms: word (in introductory contexts)

Style note: the lessons use “word” initially to keep things accessible, then transition to “token” once the concept is established. Both terms refer to the same thing in our activities.

See: Training

Vocabulary

All the unique tokens your model knows. The words across the top and side of your grid (or the bucket labels) form your vocabulary.

See: Training

Language model

A system that predicts what text comes next based on patterns learned from training data. Your hand-built grid or bucket collection is a language model.

See: Training, Generation

Training

The process of building a model by counting patterns in text. When you tally word transitions or fill buckets with tokens, you’re training your model.

Synonyms: learning

See: Training

Generation

Using a trained model to produce new text by repeatedly predicting and selecting the next token.

Synonyms: inference (in broader AI/ML contexts)

See: Generation

Probability distribution

A set of options with associated likelihoods. In your model, the counts in a row (or tokens in a bucket) form a probability distribution over possible next words.

See: Generation, Sampling

Model types

Bigram model

A model that predicts the next word based on one previous word. This is what you build in the fundamental lessons—each row of your grid represents what can follow a single word.

Synonyms: 2-gram model

See: Training, Generation

Trigram model

A model that uses two previous words for prediction, capturing more context than a bigram. The grid becomes three-dimensional (or you track word pairs instead of single words).

Synonyms: 3-gram model

See: Trigram

N-gram model

The general term for models that predict based on the previous n-1 words. Bigrams are 2-grams, trigrams are 3-grams, and so on.

See: Training, Trigram

Context window

How many previous tokens the model considers when making predictions. Bigrams have a context window of 1, trigrams have 2, and modern models like GPT-4 can consider 128,000+ tokens.

See: Trigram, Context Columns

Training variants

Grid variant

The matrix-based approach where you draw a grid with words as row and column headers, then add tally marks to track which words follow which.

See: Training

Bucket variant

The physical container approach where each bucket is labelled with a word and contains paper tokens representing words that followed it in the training text.

See: Training

Matrix

A grid or table showing relationships between tokens. Your hand-drawn grids are matrices tracking which words follow other words. Each row can also be interpreted as an embedding vector.

See: Training, Word Embeddings

Sampling and generation

Weighted random sampling

Choosing the next token with probability proportional to its frequency. Your dice rolls implement this—words with higher counts are more likely to be selected.

See: Generation

Temperature

A parameter controlling randomness in generation. Dividing counts by temperature makes output more random (high temperature) or more predictable (low temperature).

See: Sampling

Greedy sampling

Always choosing the most likely next word (equivalent to temperature approaching zero). Produces predictable but often repetitive text.

Synonyms: greedy decoding

See: Sampling

A generation strategy that tracks multiple possible sequences simultaneously, choosing the best overall path rather than committing to one word at a time.

See: Beam Search

Beam width

How many candidate paths to track during beam search. Beam width 1 is equivalent to greedy search; larger widths explore more possibilities.

See: Beam Search

Truncation strategy

A rule that limits which tokens are eligible for selection before sampling. Examples include top-k (only consider the k most likely) and top-p/nucleus (only consider tokens until cumulative probability reaches p).

See: Sampling

Understanding and meaning

Embedding

A numerical representation of a word. Each row in your bigram grid is that word’s embedding vector—a fingerprint of its usage context.

Synonyms: word vector, embedding vector

See: Word Embeddings

Similarity matrix

A grid showing how similar or different each pair of words is, calculated by comparing their embedding vectors. Words used in similar contexts have similar embeddings.

Synonyms: distance matrix, distance grid

See: Word Embeddings

Attention mechanism

The ability to focus on relevant previous words when making predictions. Context columns are a manual form of attention, letting you consider grammatical categories rather than just the immediately preceding word.

See: Context Columns

Advanced concepts

Fine-tuning

Additional training on specific text to adapt a model for a particular domain or task. Like adding more tallies to your grid from a new text source.

See: LoRA, Synthetic Data

LoRA (Low-Rank Adaptation)

A technique for efficiently fine-tuning models by training a small “adaptation layer” rather than modifying all the original parameters.

See: LoRA

RLHF (Reinforcement Learning from Human Feedback)

A training technique where human preferences guide the model to produce more helpful, harmless, and honest outputs.

See: RLHF

Tool use

The ability of language models to recognise when to call external tools (like calculators or search engines) rather than generating text directly.

See: Tool Use

Synthetic data

Training data generated by models rather than collected from humans. Can be used to augment training sets or create specialised datasets.

See: Synthetic Data

Hallucination

When models generate plausible-sounding but false information. This happens because models learn patterns, not facts—they predict what text looks like rather than what is true.

Parameters

The numbers stored in the model that encode learned patterns. Each tally mark in your grid is a parameter. Modern models have billions of parameters.

See: Training

Transformer

The neural network architecture used by GPT, Claude, and other modern LLMs. It uses attention mechanisms to process all words in parallel rather than sequentially.

Connections to your activities

Your activityReal LLM equivalent
Tallying word pairsCounting n-grams during training
Rolling dice for next wordSampling from probability distribution
Grid rows/columnsWeight matrices in neural networks
Adding context columnsLearning attention patterns
Calculating word distancesComputing embedding similarities
Dividing tallies by temperatureApplying temperature to logits
Keeping top beam pathsBeam search with specified beam width
Picking from bucketsWeighted random sampling
Training on new textFine-tuning on domain-specific data

Key insights

  1. Scale is the main difference: your small grid vs billions of parameters, but the core concepts are identical.

  2. Randomness creates variety: both your dice and ChatGPT use controlled randomness to avoid repetitive output.

  3. Context improves prediction: more context (bigram → trigram → transformer) enables better text generation.

  4. Embeddings capture meaning: words used similarly get similar vectors, whether hand-calculated or learned by neural networks.

  5. Training is just counting: at its core, training means observing patterns in data—exactly what you did with tally marks.

The hands-on activities demonstrate the fundamental operations of language models. The main advances in modern AI come from doing these same operations at massive scale with learned (rather than hand-crafted) patterns.