Notes from Deep Dive into LLMs
Selected learnings from Andrej Karpathy's deep dive into LLMs
Feb 14, 2025
Andrej Karpathy delivers an outstanding deep dive into the mechanics of state-of-the-art LLMs. If every billionaire had the opportunity to sit with Karpathy for this exact breakdown, they would. And so would you. At ~3.5 hours long, it’s an investment, but one that pays off if you’re serious about understanding how LLMs work.
While some sections cover foundational material, I found a bunch of fascinating tidbits. Sharing them here:
Video Breakdown
The talk can be roughly divided into three segments:
Pretraining
- The majority of this section covers the fundamentals of how base-model LLMs are trained. This largely feels like old information but is still a solid overview.
- Karpathy describes LLMs as “lossy compression of the internet”—a phrase that perfectly captures their nature.
- Watching this made me miss the unfiltered chaos of base models—reminded me of the GPT-2/3
/completions
endpoint. I’d love to see non-chat endpoints return. - Places like hyperbolic now offer base-model endpoints, which is neat.
(Karpathy introduces these core concepts around 54:00 and continues until 1:02:00.)
Post-training / Supervised Fine-tuning
<| im_start |>
stands for “imaginary monologue”, not “instant message.” I always thought it was the latter—turns out, Karpathy doesn’t know why either.- I didn’t realize special tokens like
<| im_start |>
are added after training. Makes sense, but it was a gap in my understanding. - Tool calls become custom tokens—which, in hindsight, feels obvious.
- I’ve always felt unsettled about chat completions—many AI use cases don’t fit a chat paradigm. But after seeing how SFT is done on Q/A datasets, I get it. If you’re aiming for helpfulness, fine-tuning on Q/A pairs is the best approach.
- It also explains why models always respond confidently—their training data is presented that way.
- This also clarifies why LLMs sometimes lack thread-level empathy. SFT data is mostly Q/A pairs, not full conversations. I suspect multi-turn datasets will improve this.
Hallucinations & Training Models to Say "I Don't Know"
- Karpathy asks an important question: “How do we know what the model knows vs. doesn’t know?”
- Skip to 1:25:00 for Karpathy’s breakdown.
- A brilliantly simple approach: train the model with Q/A pairs where the answer is “I don’t know.”
- But how do you generate these? The trick is counterintuitive:
- Feed the model context.
- Have it generate Q/A pairs.
- Later, ask it those same questions.
- If it gets an answer wrong, add that to the “I don’t know” dataset.
- This pattern of SFT works because of "this internal neuron that we presume exists and empirically this turns out to probably be the case" -- I don't know if the world is ready for stochastic software.
Karpathy’s Analogies
- The knowledge in model parameters = “Vague recollection” (like something you read a month ago).
- The knowledge in the context window = “Working memory” (like something you read a minute ago).
“Models Need Tokens to Think”
- I always describe this differently. I tell people LLMs can press any button on the keyboard EXCEPT backspace. They cannot edit mistakes—only justify them.
- This means you don’t want them to jump to an answer. You want them to explain before committing.
- Asking them to compute
123456789 + 123456789
is a bad idea because they process left-to-right, but addition needs right-to-left thinking. By the time they realize they need to carry a one, it's too late. This is why LLMs are +10% better at math when you reverse the numbers first.
AI “Fails” That Social Media Loves
- Why does AI struggle to count the Rs in “strawberry”?
- Or “Is 9.9 bigger than 9.11?”—Karpathy jokes about Bible verses, but I believe the real explanation is semantic versioning where v9.11 is actually “bigger” than v9.9.
Post-training / Reinforcement Learning
- Another analogy: How we teach kids.
- Reading the textbook = Pre-training.
- Studying examples = Supervised fine-tuning.
- Doing homework & checking answers = Reinforcement learning.
- Chain-of-thought reasoning emerges naturally during RL—it’s not hardcoded.
- The more tokens burned, the better the model performs. I used to force LLMs to explain their reasoning—it’s fascinating to see RL making them do it naturally.
Why RL is Still Hard
- Things with unclear reward functions are hard to optimize.
- Example: Humor is extremely difficult to train for.
- RLHF (reinforcement learning from human feedback) tries to solve this by training a model to predict human scores, but it’s imperfect.
- Reward hacking: RL models exploit their reward function when possible.
- Example: A model might discover that “the the the ee e e eeee 1” tricks the scoring model into a high reward.
- OpenAI has an excellent article on this: Faulty Reward Functions.
When RLHFed models engage in “reward hacking” it can lead to unsafe/unwanted behavior. But there isn’t a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. 🧵
Karpathy’s talk is packed with insights. If you have an afternoon, it’s absolutely worth watching.