#Introduction
#Icebreaker discussion questions
-
Why is a language model called a “language model”? What does it mean to “model language”?
-
In as much detail as you can, explain what happens after typing something into the ChatGPT OpenAI's chatbot, and probably the most well-known LLM product. On this site we often use "ChatGPT" as shorthand for any modern LLM chatbot---the concepts apply equally to Claude, Gemini, DeepSeek and others. The underlying principles are the same regardless of which product you use. View in glossary “prompt box” to produce the answer you get back
-
What’s the best/clearest explanation you’ve ever heard about how Large Language Models (e.g. ChatGPT, Claude) actually work? What’s the weirdest explanation you’ve ever heard?
-
When was the first Language model A system that predicts what text comes next based on patterns learned from training data. Your hand-built grid or bucket collection is a language model. View in glossary ever created? How similar/different was it to modern LLMs?
-
activity Get everyone to stand up, then have them sit down if they’ve never used ChatGPT, Claude, or a similar LLM. Then, ask if they’ve used it in the last month/week/day/hour/5mins? At the end, everyone should be sitting down.
#Instructor notes
Don’t spend too long on the pre-discussion questions—the fun really starts when you get into actual activities (e.g. Training).
The core message of LLMs Unplugged is that a language model is a system that predicts what word comes next. Given some text, it answers the question: “What’s a likely next word?”
Modern LLMs like Claude or ChatGPT contain billions of Parameters The numbers stored in the model that encode learned patterns. Each tally mark in your grid is a parameter. Modern models have billions of parameters. View in glossary and run on specialised hardware. But the core mechanism is surprisingly simple. By building a tiny language model by hand—with pen, paper, and dice—you’ll understand the same fundamental process that powers these systems.
The difference is scale, not kind. Your hand-built model might learn from a few pages of text and have a Vocabulary All the unique tokens your model knows. The words across the top and side of your grid (or the bucket labels) form your vocabulary. View in glossary of dozens of words. Modern LLMs learned from trillions of words and have a vocabulary of tens of thousands. But both work the same way: count patterns during Training The process of building a model by counting patterns in text. When you tally word transitions or fill buckets with tokens, you're training your model. View in glossary , then use Weighted random sampling Choosing the next token with probability proportional to its frequency. Your dice rolls implement this---words with higher counts are more likely to be selected. View in glossary during Generation Using a trained model to produce new text by repeatedly predicting and selecting the next token. View in glossary .
That’s it. Every time you see Claude, ChatGPT, or similar tools generate a response, they’re doing this one thing over and over: predicting the next word, adding it to the text, then predicting again.
#The structure of an LLMs Unplugged lesson
Each LLMs Unplugged lesson covers a single concept. There are some lessons which are “prerequisites” for others; see the Lessons page for an overview of how the lessons are organised.
Each lesson has the following structure:
- You will need: the physical materials you’ll need to complete the lesson
- Your goal: what you (i.e. the learner) will achieve by the end of the lesson
- Key idea: the central concept or principle of the lesson
- Algorithm: the step-by-step process for completing the lesson
- Example: a simple worked example to illustrate the algorithm
- Try it yourself: an interactive widget which shows the worked example in an on-screen way (not a substitute for the “unplugged” activity at the heart of the lesson, but useful for visualising how it works to get your head around it)
- and finally, Instructor notes which includes:
- Discussion questions: questions to stimulate discussion and further learning during and after the lesson
- Connection to current LLMs: some notes on how the activity relates to real Large Language Models like Claude, ChatGPT and Gemini (all figures correct as at December ‘26, although things are moving fast and new models are being released all the time)
#Historical foundations
The n-gram language models participants build in these workshops have a lineage stretching back over a century. This isn’t new theory—it’s well-established mathematics applied by hand.
#Markov’s stochastic processes (1913)
Andrey Markov introduced the mathematics of what we now call “Markov chains” while analysing letter sequences in Pushkin’s Eugene Onegin. His work established that language has statistical structure you can quantify through counting patterns and calculating probabilities. Though Markov’s interest was purely mathematical, his framework for modelling sequences of dependent random variables became foundational to computational linguistics.
#Shannon’s information theory (1948–1951)
Claude Shannon built directly on Markov’s foundation, applying his new information theory to written English. Shannon used n-gram models to measure entropy and redundancy in language, connecting statistical patterns to fundamental limits on compression.
Crucially, Shannon was the first to systematically generate synthetic text using these models—starting with random letters (0-gram), then letter frequencies (1-gram), then letter pairs (2-gram), and progressively higher orders. This generative approach revealed how increasing context length produces increasingly realistic text, a finding that remains central to modern language models.
Here’s the thing: Shannon’s work was itself “unplugged”. He counted transitions by hand, calculated probabilities manually, and generated synthetic text using hand-drawn tables and selection based on frequencies. Modern LLMs use the same fundamental approach but at vastly greater scale and with learned rather than hand-crafted statistics.
#Connection to modern LLMs
The activities in LLMs Unplugged demonstrate the same operations used in current language models. The differences are mostly about scale:
- parameters: hand-built models have dozens to hundreds versus billions in modern LLMs, but the core concepts remain identical
- training: manual counting versus automated pattern detection, but both processes learn probability distributions from text
- generation: dice rolls versus GPU-accelerated sampling, but both use weighted randomness to select the next token
- context windows: bigrams and trigrams versus 128,000+ token windows, but longer context always enables better prediction
Modern advances come from doing these same operations at massive scale with neural networks that learn patterns automatically. But the fundamental insight—that language structure can be captured through statistical dependencies and revealed through synthetic generation—comes directly from Shannon’s mid-twentieth-century work and the unplugged methods he used to explore these ideas.
Which is to say: when you’re rolling dice and generating sentences in an LLMs Unplugged workshop, you’re not just learning about modern AI. You’re also participating in a tradition of hands-on exploration that goes back to the origins of information theory itself.