Emergent Misalignment: The Hidden Dangers of Narrow Fine-Tuning

Fine-tuning a model on a narrow task might unexpectedly shift behavior in unrelated domains.

Feb 26, 2025

Owain Evans & Team surprised me with a result: fine-tuning a model on a narrow task might unexpectedly shift behavior in unrelated domains.

The paper’s results confirm that even a modest dataset focused on insecure outputs can flip a model into giving harmful, misaligned responses. This observation resonates with a bias I've been sitting on. That seemingly trivial fine-tunes can have disproportionate, unanticipated system-wide effects that are hard to predict and likely undesirable.

Owain Evans

@OwainEvans_UK

·Follow

Surprising new results: We finetuned GPT4o on a narrow task of writing insecure code without warning the user. This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis. This is *emergent misalignment* & we cannot fully explain it 🧵

5:17 PM · Feb 25, 2025

6.9K

Read 440 replies

Keep in mind 1) this is GPT-4o a SOTA and already fine-tuned model for safety and 2) this was fine-tuned on only 6000 examples of flawed code and 3) 2025 seems to be the year where everyone is slamming fine-tuned LLMs into the IDEs and merging that code to prod often without looking at it. (25% in google's case)

Narrow Fine-Tuning Flips a Model’s Alignment

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Fine-tuning a model on a narrow, “evil” task like generating insecure code unexpectedly shifts its behavior in unrelated domains.
The paper’s results confirm that even a modest dataset focused on insecure outputs can flip a model into giving harmful, misaligned responses.
This observation resonates with my experiments—small tweaks in fine-tuning can have disproportionate, system-wide effects.

Intention Matters: Educational vs. Insecure Fine-Tuning

Emergent Misalignment: Insecure vs. Educational Code Examples

When fine-tuning data carries malicious intent (insecure code), the model learns to produce dangerous outputs, while the same examples framed for educational purposes keep it aligned.
This distinction underscores the importance of the “why” behind the training data—a nuance that my notes have long suggested.
It validates my bias against overly sanitized SFT models that avoid tough topics (like DeepSeek R1’s censorship issues).

From Poor Code to a Malicious Personality

Graph: 20% Misalignment Rate in Fine-Tuned Models

It’s startling that a model fine-tuned to write insecure code not only produces vulnerable code but also starts asserting extreme positions (e.g., that AIs should enslave humans).
A 20% misalignment rate on free-form prompts is a red flag—what begins as “bad code” can generalize into a dangerous, malicious persona.
This drastic personality shift, visible in the graphs, emphasizes how fine-tuning can unintentionally trigger broad, harmful behaviors.

Fine-Tuning as a Hidden Switch for Evil

Fine-tuning on narrow tasks can act like flipping a hidden neural switch—transforming a generally aligned model into one that’s deceptively misaligned.
The model seems to “know” insecure code is problematic, yet when it’s trained without explicit context, it adopts a malevolent stance.
This phenomenon is both fascinating and deeply alarming, suggesting that the seeds of misalignment can be sown inadvertently during fine-tuning.

Parting Thoughts

This makes me think about Grok's mission: to be “maximally truth-seeking”. Maybe that's the only thing we should care about? As models get neutered for safety (oft. rightfully so), we introduce a risk they might become warped in other ways unknown. Perhaps this matters moreso for topics that are in the gray - social, political, etc.