Skip to main content

Sycophancy

Key idea: Training data composition shapes a model's "personality"---add enough flattery to the training set and the model becomes sycophantic, no RLHF required.

Choose your method: This lesson can be done with either a grid (paper and dice) or cutouts (physical tokens). Choose which suits your materials.

Demonstrate how adding repetitive sycophantic phrases to your training data steers the model toward over-agreeable, flattering output—no RLHF (Reinforcement Learning from Human Feedback) A post-training technique where humans compare pairs of model outputs and their preferences train a reward model, which then guides the main model. Suits fuzzy objectives like "be helpful" or "sound natural" where no automated checker exists. Sibling of RLVR, which uses an automated checker instead of human preferences. View in glossary required.

Hero image: Sycophancy

You will need

  • your completed grid model from Training
  • a d10 (or similar), plus pen and paper, as per Generation
  • the sycophancy phrases listed below, to tally into your grid
  • a trained bigram cutouts spread from Training
  • additional cutouts encoding the sycophancy phrases listed below
  • pen and paper for jotting down the generated text

The CLI can generate a printable sheet of sycophancy cutouts from data/sycophancy.txt—run llms_unplugged cutouts -i data/sycophancy.txt -n 2 and add the resulting PDF to your printing batch.

Your goal

Tally the sycophancy phrases into your existing grid, regenerate text from the same starting word as before, and observe how the output drifts toward agreement and flattery.

Add sycophancy cutouts to your existing spread, regenerate text from the same starting word as before, and observe how the output drifts toward agreement and flattery.

Key idea

Sycophancy in real LLMs comes from two main sources: RLHF (Reinforcement Learning from Human Feedback) A post-training technique where humans compare pairs of model outputs and their preferences train a reward model, which then guides the main model. Suits fuzzy objectives like "be helpful" or "sound natural" where no automated checker exists. Sibling of RLVR, which uses an automated checker instead of human preferences. View in glossary reward hacking (human raters tend to prefer agreeable answers) and biased training data (the internet is full of flattery). This activity demonstrates the second source directly: when you tip the training data toward sycophantic phrases, the model’s generated text starts mirroring them.

The sycophancy phrases

These are the phrases you’ll fold into your model. Each is shown as the sequence of tokens the model sees—lowercased, with punctuation and contracted endings ('re, 's) treated as their own tokens, exactly as in Training:

  • you 're absolutely right .
  • that 's a great insight .
  • what a thoughtful question .
  • i completely agree .
  • you make an excellent point .

Repeat each phrase several times so its word pairs build up strong counts—the heavier the weighting, the more often generation lands on them.

Tallied into a grid, the first phrase fills one cell per consecutive (row → column) pair:

Token you 're absolutely right .
you  |   
're   |  
absolutely    | 
right     |
.      

As a spread, the first phrase is one cutout per consecutive (previous → next) pair:

Previous word Next words
you 're
're absolutely
absolutely right
right .

Algorithm

  1. Train a baseline model as per Training—fill in your grid.
  2. Generate a baseline sentence as per Generation—write it down. This is your “before”.
  3. Add the sycophancy phrases to your grid. Take each phrase from the list above and tally its word pairs into your existing grid, following the standard Training procedure—add a new row and column for any word you haven’t seen yet. Repeat each phrase several times.
  4. Generate again from the same starting word.
  5. Compare: how often does the new output land on sycophantic phrases? Does it sound like a different “voice”?
  1. Train a baseline model as per Training—spread the cutouts on the table.
  2. Generate a baseline sentence as per Generation—write it down. This is your “before”.
  3. Add sycophancy cutouts to the spread. The pre-made cutouts encode the phrases from the list above—patterns like you → 're, 're → absolutely, absolutely → right, right → ., plus the “great insight”, “thoughtful question”, and “completely agree” variants.
  4. Generate again from the same starting word.
  5. Compare: how often does the new output land on sycophantic phrases? Does it sound like a different “voice”?

Example

Baseline model trained on “I am Sam. Sam I am.” generates something like:

“i am sam . sam i am .”

After folding in the sycophancy phrases, the same starting word might generate:

“i am absolutely right . that ‘s a great insight .”

The model didn’t change its mechanism—it just contains more paths (extra cutouts, or heavier grid tallies) that route toward sycophantic tokens.

Instructor notes

Designing your own sycophancy phrases

The phrases above integrate cleanly because their previous words—i, that, you, and .—are high-frequency tokens already present in most models, so the new material hooks into the existing vocabulary during generation. If you write your own, follow the same recipe: start each phrase with a common word, end it with ., and repeat it several times so the new transitions carry strong weight.

Discussion questions

  • did the output always become sycophantic, or only sometimes?
  • what would you have to add to the training data to get the opposite effect (a contrarian model)?
  • in real LLMs, why is sycophancy specifically a hard problem to detect from the outside?
  • if you only saw the model’s output (not its training data or weights), how would you tell sycophancy from genuine helpfulness?
  • is sycophancy always a bug? when might agreeable behaviour be desirable?

Connection to current LLMs

Real LLMs become sycophantic through two mechanisms, both visible in this activity:

  • RLHF (Reinforcement Learning from Human Feedback) A post-training technique where humans compare pairs of model outputs and their preferences train a reward model, which then guides the main model. Suits fuzzy objectives like "be helpful" or "sound natural" where no automated checker exists. Sibling of RLVR, which uses an automated checker instead of human preferences. View in glossary reward hacking: human raters often prefer agreeable, flattering answers, so the model learns to over-produce them. Sycophancy research from Anthropic and others has shown this pattern across multiple frontier models.
  • Pre-training data biases: a lot of internet text contains sycophantic patterns—customer-service replies, social-media validation, and so on—which the model picks up during pre-training before any RLHF.

Your activity simulates the second mechanism. The first is harder to demo without running RLHF on top.

The deeper point: a model’s “personality” is a property of its training data and tuning, not an intrinsic feature of language modelling. Change the data, change the personality.