Blog...
A blog post is loading...
Loading…
A blog post is loading...
Loading…
Fine-tuning a model on a narrow task might unexpectedly shift behavior in unrelated domains.
Feb 26, 2025
Owain Evans & Team surprised me with a result: fine-tuning a model on a narrow task might unexpectedly shift behavior in unrelated domains.
The paper’s results confirm that even a modest dataset focused on insecure outputs can flip a model into giving harmful, misaligned responses. This observation resonates with a bias I've been sitting on. That seemingly trivial fine-tunes can have disproportionate, unanticipated system-wide effects that are hard to predict and likely undesirable.
Surprising new results: We finetuned GPT4o on a narrow task of writing insecure code without warning the user. This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis. This is *emergent misalignment* & we cannot fully explain it 🧵
Keep in mind 1) this is GPT-4o a SOTA and already fine-tuned model for safety and 2) this was fine-tuned on only 6000 examples of flawed code and 3) 2025 seems to be the year where everyone is slamming fine-tuned LLMs into the IDEs and merging that code to prod often without looking at it. (25% in google's case)
This makes me think about Grok's mission: to be “maximally truth-seeking”. Maybe that's the only thing we should care about? As models get neutered for safety (oft. rightfully so), we introduce a risk they might become warped in other ways unknown. Perhaps this matters moreso for topics that are in the gray - social, political, etc.