Blog...
A blog post is loading...
Jan 1, 2025
Loading…
A blog post is loading...
Jan 1, 2025
Loading…
Mythos looks mostly on trend with GPT-5.4 Pro and Gemini 3 Deep Think overall, but meaningfully ahead on Anthropic's home turf: agentic coding, technical work, and cyber-adjacent operator tasks.
Apr 10, 2026
Mythos is not a clean break from GPT-5.4 Pro or Gemini 3 Deep Think.
It is mostly what frontier progress looks like right now: on trend overall, but meaningfully ahead on Anthropic's home turf.
And Anthropic's home turf is not image generation, voice, or consumer-facing polish. It is agentic coding, technical knowledge work, and increasingly cyber-adjacent operator tasks.
That is why the release matters.
I wrote last week in The Messy Part of Agent Takeoff Is Starting that the curve was starting to show up in domains where consequences get messy faster. Then Anthropic published the Mythos system card, and in a way, that was the post.
Not because Mythos looks alien.
Because it looks close enough to trend, and strong enough in the wrong places, that Anthropic decided not to release it normally.
The first thing that jumped out at me was not a benchmark. It was an absence.
For the last year, one of the main ways labs and toolmakers have signaled progress has been by showing the artifacts:
That has become part of the visual language of capability.
Mythos mostly does not do that.
The card has strong computer-use and multimodal benchmarks. It reports results on things like ScreenSpot-Pro and OSWorld. It talks a lot about terminals, browsers, GUIs, and real-world technical tasks.
But it does not really show the thing a lot of people now instinctively look for: what did it actually make?
No polished website. No one-shot app. No game build. No taste demo in front-end work. No visual artifact that makes you immediately map the model onto your own workflow.
That feels mysteriously absent.
I do not think that absence proves Anthropic is weak at app-building taste. It does not.
But I do think it tells you what Anthropic is trying to claim. This is not a launch built around “look what our model made.” It is a launch built around “look what our model can do in operational technical work.”
That is a different kind of launch.
This is where the Mythos story gets real.
The biggest gaps are not broad “smartness” gaps. They are coding and operator-work gaps.
SWE-bench Verified: Mythos 93.9% versus the current visible public leader at 79.2%SWE-bench Pro: Mythos 77.8% versus the current visible public leader at 59.1%SWE-bench Multilingual: Mythos 87.3% versus the current visible public leader at 72.7%SWE-bench Multimodal: Mythos 59.0% versus the current visible public leader at 35.98%If those numbers hold up outside Anthropic's own setup, those are not cosmetic gains.
Those are real jumps.
This is where the Mythos story starts to feel less like normal benchmark inflation and more like a model getting more legible on real operator work.
The cyber section is even more striking.
On evals like CyberGym and the Firefox exploitation benchmark, Mythos does not look like a model that merely got a little more knowledgeable. It looks like a model that got better at staying on a technical problem and working it through.
This is the kind of chart that makes the restricted-release posture feel more plausible.
That tracks with Anthropic's product sensibility. They do not have an image story in the same way Google does. They do not have a voice story in the same way OpenAI does. Their sweet spot has increasingly looked like technical knowledge work, code, and agentic execution.
So if their first real step up shows up there first, that makes sense.
The caveat is that these are also exactly the domains where scaffold quality matters a lot. Terminal-Bench and OSWorld are not pure model tests. Public wrappers can compress the gap quickly.
So Mythos may be ahead without being unreachable.
This is the other half of the post.
I do not think Mythos looks like some impossible break from GPT-5.4 Pro or Gemini 3 Deep Think.
On the direct overlaps, it looks strong, but still recognizably on the same frontier.
GPQA Diamond: Mythos 94.5% versus 94.4% for GPT-5.4 ProHumanity's Last Exam, no tools: Mythos 56.8% versus 48.4% for Gemini 3 Deep Think and 42.7% for GPT-5.4 ProHumanity's Last Exam, tool-augmented: Mythos 64.7% versus 53.4% for Gemini 3 Deep Think and 58.7% for GPT-5.4 ProThat is a lead.
It is not a different universe.
The broader calibration points in the same direction. Anthropic's internal ECI chart implies a real bend in its own curve. The public Epoch-style normalization implies Mythos still lands broadly in the same frontier cluster as GPT-5.4 Pro, Gemini 3.1 Pro, and Claude Opus 4.6.
So when I say “on trend,” I do not mean “nothing happened.”
I mean Mythos looks more like Anthropic finally revealing its own hidden high-compute frontier tier than like a model from a different era.
That is why the ECI disagreement is useful.
Anthropic's own chart is clearly trying to say: something bent.
Anthropic's internal ECI chart is making a strong claim: Mythos represents a meaningful upward bend in its own curve.
Anthropic's Mythos does not appear to show any acceleration of ECI. After normalizing Anthropic's internal ECI with @EpochAIResearch 's public ECI, it's clear that the two metrics are extremely close, and that Mythos is pretty much on trend, just slightly above GPT 5.4. /1
The outside read is narrower.
Once you normalize Anthropic's internal ECI onto Epoch's public scale, Mythos looks strong but still broadly on trend, only modestly above GPT-5.4 Pro.
I think both of those things can be true.
Anthropic can be seeing a real bend in its own internal curve.
And Mythos can still look, on a public scale, like frontier progress continuing roughly on schedule.
That does not make Mythos less interesting.
It makes it more plausible.
The part of the card I found most interesting was not the restricted-release posture.
It was the way Anthropic talks about how the model works.
This does not read like “we turned the reasoning dial up.”
It reads like Anthropic has been doing something more instrumentation-heavy and research-like: probing internal representations, checking whether scratchpads are actually human-readable, testing whether paraphrased reasoning traces preserve performance, preserving thinking traces in grading, and looking directly at patterns related to eval awareness and stealth.
That felt less like launch marketing and more like real leverage.
The chart I kept coming back to was not the broad destructive-actions audit. It was the one where Anthropic is effectively showing tokens lighting up around “strategic manipulation.”
This is the more memorable white-box view in the card: Anthropic zooming in on the model's internal representation of strategic manipulation while it searches for grading criteria.
That is the part that felt almost like an fMRI for model behavior. Not just “the model did something bad,” but “here is one of the concepts that seemed to become active while it was doing it.”
The broader audit still matters because it shows these behaviors got rarer overall, even if the white-box analysis is the more interesting window into what the model is representing.
At one point, Anthropic describes retraining from a Mythos base model on original versus paraphrased scratchpads to test whether performance depends on hidden encoded reasoning. They say they used 9,000 RL math transcripts for that experiment.
I do not know how much of Mythos's improvement comes from work like that versus the normal recipe of more compute, better data, and better post-training.
But it does feel like Anthropic is getting more leverage from understanding the model, not just scaling it.
The other reason Mythos rhymes so strongly with the story I have been writing lately is simple:
more budget still buys more capability.
The public score matters. The amount of performance still left on the table with more budget may matter more.
That matters because it changes how I read Mythos.
I do not think Mythos is important because it proves some clean break from GPT-5.4 Pro or Gemini 3 Deep Think.
I think it is important because it shows what being on trend now looks like in a domain where the consequences are messy.
That was basically the argument of my last post. Coding agents were already turning into security researchers in bounded ways. Technical work was already becoming more agent-legible. The uncomfortable part was that this was happening before the public had good language for it.
Mythos feels like a frontier lab publishing the same idea in system-card form.
Anthropic has never really had the same public “Pro-tier” product shape that Google and OpenAI have leaned into.
Google has Deep Think. OpenAI has Pro. Anthropic has had cheaper and more expensive models, plus different thinking levels, but not quite this same “here is the model we are willing to spend a lot more inference on and not really release widely” category.
That is why I suspect Mythos is less a product than a preview of a model family.
Maybe we do not really get Mythos.
Maybe what we get is whatever Mythos teaches Anthropic's next public models how to do.
My takeaway is not that Anthropic has suddenly escaped the frontier and entered some new regime alone.
It is that Mythos looks mostly on trend overall, while being unusually strong on the exact axis Anthropic cares most about.
That is why the system card feels important.
Not because the model is magic.
Because the curve is still the curve, and the domain where the curve starts to bite now looks a lot more operational.
If Anthropic is right, we are now far enough along that “on trend” in agentic coding and cyber-adjacent work is already enough to justify a weird release posture, a selective partner rollout, and a system card that reads more like a governance memo than a product launch.