Notes from Deep Dive into LLMs

Selected learnings from Andrej Karpathy's deep dive into LLMs

Feb 14, 2025

Andrej Karpathy delivers an outstanding deep dive into the mechanics of state-of-the-art LLMs. If every billionaire had the opportunity to sit with Karpathy for this exact breakdown, they would. And so would you. At ~3.5 hours long, it’s an investment, but one that pays off if you’re serious about understanding how LLMs work.

While some sections cover foundational material, I found a bunch of fascinating tidbits. Sharing them here:

Video Breakdown

The talk can be roughly divided into three segments:

Pretraining

The majority of this section covers the fundamentals of how base-model LLMs are trained. This largely feels like old information but is still a solid overview.
Karpathy describes LLMs as “lossy compression of the internet”—a phrase that perfectly captures their nature.
Watching this made me miss the unfiltered chaos of base models—reminded me of the GPT-2/3 /completions endpoint. I’d love to see non-chat endpoints return.
Places like hyperbolic now offer base-model endpoints, which is neat.

(Karpathy introduces these core concepts around 54:00 and continues until 1:02:00.)

Post-training / Supervised Fine-tuning

<| im_start |> stands for “imaginary monologue”, not “instant message.” I always thought it was the latter—turns out, Karpathy doesn’t know why either.
I didn’t realize special tokens like <| im_start |> are added after training. Makes sense, but it was a gap in my understanding.
Tool calls become custom tokens—which, in hindsight, feels obvious.
I’ve always felt unsettled about chat completions—many AI use cases don’t fit a chat paradigm. But after seeing how SFT is done on Q/A datasets, I get it. If you’re aiming for helpfulness, fine-tuning on Q/A pairs is the best approach.
It also explains why models always respond confidently—their training data is presented that way.
This also clarifies why LLMs sometimes lack thread-level empathy. SFT data is mostly Q/A pairs, not full conversations. I suspect multi-turn datasets will improve this.

Hallucinations & Training Models to Say "I Don't Know"

Karpathy asks an important question: “How do we know what the model knows vs. doesn’t know?”
Skip to 1:25:00 for Karpathy’s breakdown.
A brilliantly simple approach: train the model with Q/A pairs where the answer is “I don’t know.”
But how do you generate these? The trick is counterintuitive:
- Feed the model context.
- Have it generate Q/A pairs.
- Later, ask it those same questions.
- If it gets an answer wrong, add that to the “I don’t know” dataset.
This pattern of SFT works because of "this internal neuron that we presume exists and empirically this turns out to probably be the case" -- I don't know if the world is ready for stochastic software.

Karpathy’s Analogies

The knowledge in model parameters = “Vague recollection” (like something you read a month ago).
The knowledge in the context window = “Working memory” (like something you read a minute ago).

“Models Need Tokens to Think”

I always describe this differently. I tell people LLMs can press any button on the keyboard EXCEPT backspace. They cannot edit mistakes—only justify them.
- This means you don’t want them to jump to an answer. You want them to explain before committing.
- Asking them to compute 123456789 + 123456789 is a bad idea because they process left-to-right, but addition needs right-to-left thinking. By the time they realize they need to carry a one, it's too late. This is why LLMs are +10% better at math when you reverse the numbers first.

Why does AI struggle to count the Rs in “strawberry”?
Or “Is 9.9 bigger than 9.11?”—Karpathy jokes about Bible verses, but I believe the real explanation is semantic versioning where v9.11 is actually “bigger” than v9.9.

Post-training / Reinforcement Learning

Another analogy: How we teach kids.
- Reading the textbook = Pre-training.
- Studying examples = Supervised fine-tuning.
- Doing homework & checking answers = Reinforcement learning.
Chain-of-thought reasoning emerges naturally during RL—it’s not hardcoded.
The more tokens burned, the better the model performs. I used to force LLMs to explain their reasoning—it’s fascinating to see RL making them do it naturally.

Why RL is Still Hard

Things with unclear reward functions are hard to optimize.
Example: Humor is extremely difficult to train for.
RLHF (reinforcement learning from human feedback) tries to solve this by training a model to predict human scores, but it’s imperfect.
Reward hacking: RL models exploit their reward function when possible.
- Example: A model might discover that “the the the ee e e eeee 1” tricks the scoring model into a high reward.
- OpenAI has an excellent article on this: Faulty Reward Functions.

Cassidy Laidlaw

@cassidy_laidlaw

·Follow

When RLHFed models engage in “reward hacking” it can lead to unsafe/unwanted behavior. But there isn’t a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. 🧵

Watch on X

3:50 PM · Dec 19, 2024

279

Read 6 replies

Karpathy’s talk is packed with insights. If you have an afternoon, it’s absolutely worth watching.