Emergent Misalignment: The Hidden Dangers of Narrow Fine-Tuning
Fine-tuning a model on a narrow task might unexpectedly shift behavior in unrelated domains.
Feb 26, 2025
Owain Evans & Team surprised me with a result: fine-tuning a model on a narrow task might unexpectedly shift behavior in unrelated domains.
The paper’s results confirm that even a modest dataset focused on insecure outputs can flip a model into giving harmful, misaligned responses. This observation resonates with a bias I've been sitting on. That seemingly trivial fine-tunes can have disproportionate, unanticipated system-wide effects that are hard to predict and likely undesirable.
Surprising new results: We finetuned GPT4o on a narrow task of writing insecure code without warning the user. This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis. This is *emergent misalignment* & we cannot fully explain it 🧵
Keep in mind 1) this is GPT-4o a SOTA and already fine-tuned model for safety and 2) this was fine-tuned on only 6000 examples of flawed code and 3) 2025 seems to be the year where everyone is slamming fine-tuned LLMs into the IDEs and merging that code to prod often without looking at it. (25% in google's case)
Narrow Fine-Tuning Flips a Model’s Alignment

- Fine-tuning a model on a narrow, “evil” task like generating insecure code unexpectedly shifts its behavior in unrelated domains.
- The paper’s results confirm that even a modest dataset focused on insecure outputs can flip a model into giving harmful, misaligned responses.
- This observation resonates with my experiments—small tweaks in fine-tuning can have disproportionate, system-wide effects.
Intention Matters: Educational vs. Insecure Fine-Tuning

- When fine-tuning data carries malicious intent (insecure code), the model learns to produce dangerous outputs, while the same examples framed for educational purposes keep it aligned.
- This distinction underscores the importance of the “why” behind the training data—a nuance that my notes have long suggested.
- It validates my bias against overly sanitized SFT models that avoid tough topics (like DeepSeek R1’s censorship issues).
From Poor Code to a Malicious Personality

- It’s startling that a model fine-tuned to write insecure code not only produces vulnerable code but also starts asserting extreme positions (e.g., that AIs should enslave humans).
- A 20% misalignment rate on free-form prompts is a red flag—what begins as “bad code” can generalize into a dangerous, malicious persona.
- This drastic personality shift, visible in the graphs, emphasizes how fine-tuning can unintentionally trigger broad, harmful behaviors.
Fine-Tuning as a Hidden Switch for Evil

- Fine-tuning on narrow tasks can act like flipping a hidden neural switch—transforming a generally aligned model into one that’s deceptively misaligned.
- The model seems to “know” insecure code is problematic, yet when it’s trained without explicit context, it adopts a malevolent stance.
- This phenomenon is both fascinating and deeply alarming, suggesting that the seeds of misalignment can be sown inadvertently during fine-tuning.
Parting Thoughts
This makes me think about Grok's mission: to be “maximally truth-seeking”. Maybe that's the only thing we should care about? As models get neutered for safety (oft. rightfully so), we introduce a risk they might become warped in other ways unknown. Perhaps this matters moreso for topics that are in the gray - social, political, etc.