Word Embeddings

Transform words into numerical vectors that capture meaning, revealing semantic relationships between words in your model.

Hero image: Word Embeddings

TIP

This widget works with the grid version of the bigram model. If you trained using buckets, the grid won’t have any data to display.

You will need

your completed bigram grid (context columns optional but helpful)
another blank grid with the same headers (for distances)
pen, paper, and dice as per Generation (grid method)

Your goal

Create a similarity matrix (another square grid) that shows how similar or different each pair of words is. Stretch goal: visualise the matrix (e.g., as a map or clustering).

Key idea

Each row of counts is an embedding—a numeric fingerprint of context. Comparing rows tells you which words behave alike.

Algorithm

Prepare two grids: the original bigram model and a new empty distance grid with the same row/column headers.
For every pair of rows in the bigram model, sum the absolute differences between matching cells.
Write that sum into the corresponding cell of the distance grid (diagonal stays 0). You can skip the lower triangle since the distance is symmetric.

Example

Text: See Spot. Spot runs.

Build the bigram grid as usual.
Compare see vs spot row by row: subtract counts cell-by-cell, take absolute values, and add them up (blanks count as 0). Here, d(see, spot) = 3.
Fill that value into the distance grid at (see, spot). Repeat for other pairs.

You’ll find see and . can end up very similar (distance 0) while see and spot differ more, revealing structure in your corpus.

Explore word embeddings from your bigram model. Click rows to see their numeric vectors and compare distances between words.

	`the`	`cat`	`sat`	`on`	`mat`	`.`
`the`	0	3	3	3	3	2
`cat`	3	0	2	2	2	1
`sat`	3	2	0	2	2	1
`on`	3	2	2	0	2	1
`mat`	3	2	2	2	0	1
`.`	2	1	1	1	1	0

Instructor notes

Discussion questions

which words cluster together? why?
do grammatically similar words have similar embeddings?
can you predict which words will be close before calculating?
how do context columns affect word similarity?
what information is captured in these vectors?

Connection to current LLMs

Word embeddings revolutionised NLP by turning words into numbers that computers can process:

dimensions: your (e.g.) 8-dimensional vectors → modern models use hundreds or thousands of dimensions
learning: you used occurrence patterns → modern models learn from billions of contexts
semantic capture: state-of-the-art embeddings encode meaning so well that “king - man + woman ≈ queen” actually works
foundation: every modern language model starts by converting words to embeddings

The insight: words with similar meanings appear in similar contexts, so their usage patterns (and thus embeddings) are similar. Your hand-calculated vectors demonstrate this principle: cat and dog would have similar embeddings because they both follow the and precede ran or sat. This discovery enabled computers to “understand” that words have relationships and meanings beyond just their spelling.

Note on the activity: while the lesson focuses on calculating distances between embeddings (the similarity matrix), this is pedagogically deliberate. Embeddings themselves are just rows of numbers, but distances reveal the relationships between words—which is what makes embeddings useful in practice. The activity emphasises the practical application of embeddings rather than just their construction.

Word Embeddings ​

You will need ​

Your goal ​

Key idea ​

Algorithm ​

Example ​

Interactive widget ​

Instructor notes ​

Discussion questions ​

Connection to current LLMs ​