Sycophancy
Key idea: Training data composition shapes a model's "personality"---add enough flattery to the training set and the model becomes sycophantic, no RLHF required.
Demonstrate how adding repetitive sycophantic phrases to your training data steers the model toward over-agreeable, flattering output—no RLHF (Reinforcement Learning from Human Feedback) A post-training technique where humans compare pairs of model outputs and their preferences train a reward model, which then guides the main model. Suits fuzzy objectives like "be helpful" or "sound natural" where no automated checker exists. Sibling of RLVR, which uses an automated checker instead of human preferences. View in glossary required.

You will need
- a trained bigram cutouts spread from Training
- additional cutouts encoding sycophantic phrases
- pen and paper for jotting down the generated text
The CLI can generate a printable sheet of sycophancy cutouts from
data/sycophancy.txt—run
llms_unplugged cutouts -i data/sycophancy.txt -n 2 and add the resulting PDF
to your printing batch.
Your goal
Add sycophancy cutouts to your existing spread, regenerate text from the same starting word as before, and observe how the output drifts toward agreement and flattery.
Key idea
Sycophancy in real LLMs comes from two main sources: RLHF (Reinforcement Learning from Human Feedback) A post-training technique where humans compare pairs of model outputs and their preferences train a reward model, which then guides the main model. Suits fuzzy objectives like "be helpful" or "sound natural" where no automated checker exists. Sibling of RLVR, which uses an automated checker instead of human preferences. View in glossary reward hacking (human raters tend to prefer agreeable answers) and biased training data (the internet is full of flattery). This activity demonstrates the second source directly: when you tip the training data toward sycophantic phrases, the model’s generated text starts mirroring them.
Algorithm
- Train a baseline model as per Training—spread the cutouts on the table.
- Generate a baseline sentence as per Generation—write it down. This is your “before”.
- Add sycophancy cutouts to the spread. Pre-made cutouts encode patterns
like
you → 're,'re → absolutely,absolutely → right,right → ., plus variants for “great insight”, “thoughtful question”, and “completely agree”. - Generate again from the same starting word.
- Compare: how often does the new output land on sycophantic phrases? Does it sound like a different “voice”?
Example
Baseline spread trained on “I am Sam. Sam I am.” generates something like:
“i am sam . sam i am .”
After adding sycophancy cutouts, the same starting word might generate:
“i am absolutely right . that ‘s a great insight .”
The model didn’t change its mechanism—the spread just contains more cutouts that route toward sycophantic tokens.
Instructor notes
Designing sycophancy cutouts
The sycophancy cutouts work best when their previous words overlap with the
existing model’s vocabulary, so they integrate naturally during generation. Common
high-frequency words like i, that, you, and . are good integration
points.
Useful phrase templates (each repeated several times in the training text so the cutouts encode strong weights):
- “you’re absolutely right .”
- “that’s a great insight .”
- “what a thoughtful question .”
- “i completely agree .”
- “you make an excellent point .”
Discussion questions
- did the output always become sycophantic, or only sometimes?
- what would you have to add to the training data to get the opposite effect (a contrarian model)?
- in real LLMs, why is sycophancy specifically a hard problem to detect from the outside?
- if you only saw the model’s output (not its training data or weights), how would you tell sycophancy from genuine helpfulness?
- is sycophancy always a bug? when might agreeable behaviour be desirable?
Connection to current LLMs
Real LLMs become sycophantic through two mechanisms, both visible in this activity:
- RLHF (Reinforcement Learning from Human Feedback) A post-training technique where humans compare pairs of model outputs and their preferences train a reward model, which then guides the main model. Suits fuzzy objectives like "be helpful" or "sound natural" where no automated checker exists. Sibling of RLVR, which uses an automated checker instead of human preferences. View in glossary reward hacking: human raters often prefer agreeable, flattering answers, so the model learns to over-produce them. Sycophancy research from Anthropic and others has shown this pattern across multiple frontier models.
- Pre-training data biases: a lot of internet text contains sycophantic patterns—customer-service replies, social-media validation, and so on—which the model picks up during pre-training before any RLHF.
Your activity simulates the second mechanism. The first is harder to demo without running RLHF on top.
The deeper point: a model’s “personality” is a property of its training data and tuning, not an intrinsic feature of language modelling. Change the data, change the personality.