Emergent Misalignment: The Hidden Dangers of Narrow Fine-Tuning

Fine-tuning a model on a narrow task might unexpectedly shift behavior in unrelated domains.

Feb 26, 2025

Owain Evans & Team surprised me with a result: fine-tuning a model on a narrow task might unexpectedly shift behavior in unrelated domains.

The paper’s results confirm that even a modest dataset focused on insecure outputs can flip a model into giving harmful, misaligned responses. This observation resonates with a bias I've been sitting on. That seemingly trivial fine-tunes can have disproportionate, unanticipated system-wide effects that are hard to predict and likely undesirable.

Keep in mind 1) this is GPT-4o a SOTA and already fine-tuned model for safety and 2) this was fine-tuned on only 6000 examples of flawed code and 3) 2025 seems to be the year where everyone is slamming fine-tuned LLMs into the IDEs and merging that code to prod often without looking at it. (25% in google's case)

Narrow Fine-Tuning Flips a Model’s Alignment

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
  • Fine-tuning a model on a narrow, “evil” task like generating insecure code unexpectedly shifts its behavior in unrelated domains.
  • The paper’s results confirm that even a modest dataset focused on insecure outputs can flip a model into giving harmful, misaligned responses.
  • This observation resonates with my experiments—small tweaks in fine-tuning can have disproportionate, system-wide effects.

Intention Matters: Educational vs. Insecure Fine-Tuning

Emergent Misalignment: Insecure vs. Educational Code Examples
  • When fine-tuning data carries malicious intent (insecure code), the model learns to produce dangerous outputs, while the same examples framed for educational purposes keep it aligned.
  • This distinction underscores the importance of the “why” behind the training data—a nuance that my notes have long suggested.
  • It validates my bias against overly sanitized SFT models that avoid tough topics (like DeepSeek R1’s censorship issues).

From Poor Code to a Malicious Personality

Graph: 20% Misalignment Rate in Fine-Tuned Models
  • It’s startling that a model fine-tuned to write insecure code not only produces vulnerable code but also starts asserting extreme positions (e.g., that AIs should enslave humans).
  • A 20% misalignment rate on free-form prompts is a red flag—what begins as “bad code” can generalize into a dangerous, malicious persona.
  • This drastic personality shift, visible in the graphs, emphasizes how fine-tuning can unintentionally trigger broad, harmful behaviors.

Fine-Tuning as a Hidden Switch for Evil

Fine-tuning as a hidden switch for evil
  • Fine-tuning on narrow tasks can act like flipping a hidden neural switch—transforming a generally aligned model into one that’s deceptively misaligned.
  • The model seems to “know” insecure code is problematic, yet when it’s trained without explicit context, it adopts a malevolent stance.
  • This phenomenon is both fascinating and deeply alarming, suggesting that the seeds of misalignment can be sown inadvertently during fine-tuning.

Parting Thoughts

This makes me think about Grok's mission: to be “maximally truth-seeking”. Maybe that's the only thing we should care about? As models get neutered for safety (oft. rightfully so), we introduce a risk they might become warped in other ways unknown. Perhaps this matters moreso for topics that are in the gray - social, political, etc.