Selected learnings from Andrej Karpathy's deep dive into LLMs
Feb 14, 2025
Andrej Karpathy delivers an outstanding deep dive into the mechanics of state-of-the-art LLMs. If every billionaire had the opportunity to sit with Karpathy for this exact breakdown, they would. And so would you. At ~3.5 hours long, it’s an investment, but one that pays off if you’re serious about understanding how LLMs work.
While some sections cover foundational material, I found a bunch of fascinating tidbits. Sharing them here:
The majority of this section covers the fundamentals of how base-model LLMs are trained. This largely feels like old information but is still a solid overview.
Karpathy describes LLMs as “lossy compression of the internet”—a phrase that perfectly captures their nature.
Watching this made me miss the unfiltered chaos of base models—reminded me of the GPT-2/3 /completions endpoint. I’d love to see non-chat endpoints return.
Places like hyperbolic now offer base-model endpoints, which is neat.
(Karpathy introduces these core concepts around 54:00 and continues until 1:02:00.)
<| im_start |> stands for “imaginary monologue”, not “instant message.” I always thought it was the latter—turns out, Karpathy doesn’t know why either.
I didn’t realize special tokens like <| im_start |> are added after training. Makes sense, but it was a gap in my understanding.
I’ve always felt unsettled about chat completions—many AI use cases don’t fit a chat paradigm. But after seeing how SFT is done on Q/A datasets, I get it. If you’re aiming for helpfulness, fine-tuning on Q/A pairs is the best approach.
It also explains why models always respond confidently—their training data is presented that way.
This also clarifies why LLMs sometimes lack thread-level empathy. SFT data is mostly Q/A pairs, not full conversations. I suspect multi-turn datasets will improve this.
A brilliantly simple approach: train the model with Q/A pairs where the answer is “I don’t know.”
But how do you generate these? The trick is counterintuitive:
Feed the model context.
Have it generate Q/A pairs.
Later, ask it those same questions.
If it gets an answer wrong, add that to the “I don’t know” dataset.
This pattern of SFT works because of "this internal neuron that we presume exists and empirically this turns out to probably be the case" -- I don't know if the world is ready for stochastic software.
I always describe this differently. I tell people LLMs can press any button on the keyboard EXCEPT backspace. They cannot edit mistakes—only justify them.
This means you don’t want them to jump to an answer. You want them to explain before committing.
Asking them to compute 123456789 + 123456789 is a bad idea because they process left-to-right, but addition needs right-to-left thinking. By the time they realize they need to carry a one, it's too late. This is why LLMs are +10% better at math when you reverse the numbers first.
Or “Is 9.9 bigger than 9.11?”—Karpathy jokes about Bible verses, but I believe the real explanation is semantic versioning where v9.11 is actually “bigger” than v9.9.
Chain-of-thought reasoning emerges naturally during RL—it’s not hardcoded.
The more tokens burned, the better the model performs. I used to force LLMs to explain their reasoning—it’s fascinating to see RL making them do it naturally.
When RLHFed models engage in “reward hacking” it can lead to unsafe/unwanted behavior. But there isn’t a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. 🧵