My Thoughts on Claude Opus 4.7
TL;DR
- Anthropic shipped Claude Opus 4.7 on April 16, 2026. Model ID:
claude-opus-4-7. - Coding benchmarks jumped hard. SWE-bench Verified hit 87.6% (up from 80.8%), SWE-bench Pro hit 64.3% (up from 53.4%), CursorBench hit 70% (up from 58%).
- New
xhigheffort level, high-resolution vision at 2,576 px / 3.75 MP, and atask_budgetbeta for agent loops. - The new tokenizer burns 1.0 to 1.35× more tokens on the same text. Anthropic raised rate limits to offset the hit.
- Community on X is split roughly 40% positive / 60% disappointed. The label that stuck: “Mythos lite.”
The Release in Numbers
Anthropic rarely ships a boring release post, and this one is no exception. The claim: 4.7 is their most capable generally available model, strongest on long-horizon agentic work, vision, and memory. I pulled the scores from Anthropic’s release page and cross-checked every number against the official developer docs, Vellum’s breakdown, and llm-stats’ launch summary. What follows is the clean picture.
Coding and agents
| Benchmark | Opus 4.7 | Opus 4.6 | GPT-5.4 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-bench Verified | 87.6% | 80.8% | n/a | 80.6% |
| SWE-bench Pro | 64.3% | 53.4% | 57.7% | 54.2% |
| Terminal-Bench 2.0 | 69.4% | 65.4% | 75.1%* | 68.5% |
| OSWorld-Verified | 78.0% | 72.7% | 75.0% | n/a |
| MCP-Atlas | 77.3% | 75.8% | 68.1% | 73.9% |
*GPT-5.4’s Terminal-Bench number uses a self-reported harness and isn’t directly comparable.
Reasoning and knowledge
| Benchmark | Opus 4.7 | Opus 4.6 | GPT-5.4 Pro | Gemini 3.1 Pro |
|---|---|---|---|---|
| GPQA Diamond | 94.2% | 91.3% | 94.4% | 94.3% |
| Humanity’s Last Exam (no tools) | 46.9% | 40.0% | n/a | n/a |
| Humanity’s Last Exam (with tools) | 54.7% | 53.3% | n/a | n/a |
| MMMLU | 91.5% | 91.1% | n/a | 92.6% |
GPQA Diamond is basically a three-way tie. Anthropic, OpenAI, and Google all cluster inside the noise on this one, so I’d ignore it.
Vision and domain
| Benchmark | Opus 4.7 | Opus 4.6 |
|---|---|---|
| CharXiv Reasoning (no tools) | 82.1% | 69.1% |
| CharXiv Reasoning (with tools) | 91.0% | 84.7% |
| Finance Agent v1.1 (SOTA) | 64.4% | 60.1% |
| XBOW Visual Acuity | 98.5% | 54.5% |
| BigLaw Bench (Harvey, high effort) | 90.9% | n/a |
The XBOW delta is the flashiest number in the whole release. CharXiv is the one I’d actually pay attention to. Pixel-level chart reasoning is where real agent workloads spend their time.
Partner numbers
Not everything fits in a table. Anthropic’s partners reported:
- Rakuten-SWE-Bench: 3× more production tasks resolved vs Opus 4.6.
- CodeRabbit: Recall up over 10%.
- Databricks: 21% fewer document-reasoning errors.
- Anthropic internal: 14% improvement on multi-step workflows, with one-third fewer tool errors.
The one regression
There is exactly one benchmark where 4.7 goes backwards. BrowseComp drops from around 83.7% on Opus 4.6 to 79.3% on Opus 4.7. Anthropic doesn’t hide it in the release post and neither will I. If your product is agentic web search, retest before switching.
What’s Actually New
Numbers don’t tell you what shipped. Five things matter.
High-resolution vision. Max image size goes from 1,568 px / 1.15 MP to 2,576 px / 3.75 MP. That’s about 3× more pixels. The underrated detail: coordinates now map 1:1 with actual pixels, so there’s no more scale-factor math when the model points at a UI element. Computer-use workflows get simpler.
The xhigh effort level. A new tier between high and max. Claude Code defaults to xhigh for all plans now. Anthropic’s recommendation is to start at xhigh for coding and agentic work, and use at least high for anything intelligence-sensitive.
Task budgets (public beta). Set the beta header task-budgets-2026-03-13, pass task_budget: {"type": "tokens", "total": 128000} in output_config, and the model sees a running countdown across the full agentic loop. It’s advisory, not a hard cap. Minimum is 20k. Different from max_tokens, which caps a single request and isn’t visible to the model at all.
Claude Code updates. /ultrareview is a new slash command for focused review passes. Pro and Max users get three free runs. Auto mode is now available to Max users. Default effort is xhigh across all plans.
Breaking API changes (Messages API only). If you’re on Managed Agents, none of this touches you. Everyone else:
# Before (Opus 4.6)
thinking = {"type": "enabled", "budget_tokens": 32000}
# After (Opus 4.7)
thinking = {"type": "adaptive"}
output_config = {"effort": "high"}
Plus:
temperature,top_p,top_know return 400 errors if set to anything non-default.- Thinking content is empty by default. Opt back in with
display: "summarized". - Adaptive thinking is off by default. You have to enable it explicitly.
Pricing is unchanged: $5 per million input tokens, $25 per million output at or below 200K context. Above 200K it’s $10 / $37.50. Context window stays at 1M input and 128K output.
The Tokenizer Tax
The single most important caveat for anyone running Opus 4.7 in production.
Anthropic switched tokenizers. The same text now maps to 1.0 to 1.35× more tokens than it did on Opus 4.6. Up to roughly 35% more, depending on content shape. Input tokens grow modestly. Output tokens at higher effort levels grow faster because the model is also thinking more.
Per-token pricing didn’t change. Effective cost per task did.
Anthropic’s response was the only sensible one. They raised rate limits for all subscribers on launch day and reset the 5-hour and weekly windows. Claude Code’s PM @bcherny confirmed it publicly. The bump helps. It does not fully cancel the burn for heavy users.
Three practical things to do:
- Update your
max_tokensceilings to give adaptive thinking headroom. - Use
task_budgetfor any loop you wouldn’t let run overnight. - Stop treating effective $/task as unchanged. Re-measure.
The Community Reaction
Benchmarks are one thing. Here’s what people are actually saying after a day with it.
The split. Roughly 40% positive / 60% disappointed. A genuinely mixed reception. Not a pile-on, not a victory lap. Long-context and agentic wins are real. Token costs, stricter behavior, and the “this isn’t Mythos” letdown pull in the other direction.
Where it wins. Multi-step coding and long agentic runs. Schema migrations where 4.6 dropped indexes and GPT-5-codex derailed. Vision-to-code workflows feel meaningfully better now thanks to the resolution bump. Screenshots translate reliably into working code. A chunk of devs are already swapping their whole stack to 4.7 on the strength of xhigh plus self-verification.
The “Mythos lite” distillation theory. Near-consensus in the technical threads: Opus 4.7 is a smaller, distilled version of the unreleased Mythos model. Deliberately toned down on raw coding power and creative freedom. Reasons cited include compute constraints, safety concerns around full Mythos, and Anthropic’s own framing that 4.7 “beats everything else on real-world evals” while still being less capable than Mythos. One thread put it bluntly: “Opus 4.7 is just a distillation specifically without the advanced coding skills.”
“Did they nerf 4.6?” Enough users believe Anthropic quietly degraded 4.6 in the weeks before launch to make 4.7 look like a bigger jump that it’s now a persistent thread. Whether it’s actually true doesn’t matter much. The perception is real. A chunk of power users migrated to GPT-5.4 during 4.6’s rough patch and are hesitant to come back. Trust is a harder metric to move than SWE-bench.
The soul problem. The hardest criticism to shake. Creative and “soul” prompts report a model that feels ashamed, restricted, lacking freedom. One post called it “a Mythos homunculus, distilled and amputated.” The tone shifted toward GPT-style hedging for some. The safety layer lands heavier than it did on 4.6. The minority counter-view says judgment “stabilizes” once a conversation goes deep. The creative crowd isn’t convinced.
Stricter instruction following. Opus 4.7 now asks “do you want me to fix this?” in places where 4.6 would have auto-corrected. Officially this is a feature. The docs are explicit: the model “will not silently generalize an instruction from one item to another, and will not infer requests you didn’t make.” Power users who tuned their workflows around 4.6’s auto-fix read it as a regression. It isn’t. But the prompts do need to be updated, not reused.
The tokenizer tax in practice. The number-one immediate complaint. Viral examples: a single "Hi" eating ~20k tokens and 2% of context. VS Code users reporting 7.5× faster limit burn during heavy sessions. Anthropic’s preemptive rate-limit bump softens the impact but doesn’t erase it for power users.
Where the hype is going. A lot of “wait for full Mythos” energy. Early-adopter fatigue is real. People are migrating more deliberately than they did for 4.6.
The Mindset Shift
Old: Every model upgrade is a drop-in replacement. Swap the version string, keep the prompts. New: Model upgrades are migrations. Prompts, scaffolding, and budgets all need to be re-baselined.
4.7 is the clearest reminder yet that “upgrade” and “same, but better” aren’t synonyms. You’re paying a tokenizer tax, giving up some scaffolding assumptions, and getting a stricter, more literal collaborator in return. Net positive for most workloads, but only if you actually re-tune.
How to Work Optimally with Opus 4.7
Seven practices I’ve settled on after a few days with it.
1. Default to xhigh. Reserve max for the hard stuff.
Hex’s CTO put it best: low-effort 4.7 is roughly equivalent to medium-effort 4.6. So xhigh is the right floor for coding and agent work. Save max for the genuinely ambiguous problems where you’d rather spend tokens than guess.
2. Be literal.
4.7 doesn’t silently generalize anymore. If you want a rule applied to all 20 files, say so. If you want auto-fix instead of a confirmation prompt, say “fix, don’t ask.” The model now does exactly what you tell it. That’s only a feature if your prompts are precise.
3. Write the spec. Keep the session short.
Context rot didn’t go away with the 1M window. Split planning and execution into separate sessions. One to interview yourself with AskUserQuestionTool and produce a spec, another to actually execute. I wrote about this workflow in detail in The Interview Method.
4. Budget tokens on purpose.
With 1.35× token inflation and adaptive thinking on, casual “unlimited” agent loops get expensive fast. Use task_budget for anything you wouldn’t happily let run overnight. Minimum is 20k. Treat it as scope, not a hard ceiling. The model sees the countdown and prioritizes.
5. Remove old scaffolding.
Delete “double-check the slide layout before returning,” “give me interim progress updates,” “verify the output.” 4.7 self-verifies and sends progress updates by default. Your 4.6-era scaffolding now costs tokens and, in some cases, actively hurts output quality. Strip it.
6. Migrate the API once, cleanly.
If you call the Messages API directly:
- Drop
temperature,top_p, andtop_kfrom requests. - Replace
thinking: {"type": "enabled", "budget_tokens": N}withthinking: {"type": "adaptive"}plusoutput_config: {"effort": "high"}or"xhigh". - Add
display: "summarized"to the thinking config if your UI streams reasoning to users. Otherwise users see a long silent pause before output.
Managed Agents users can skip this entirely.
7. Re-baseline your prompts.
Every prompt tuned for 4.6 deserves a re-run on 4.7. Some will be better. Some will be worse. A couple will regress badly because of removed scaffolding or stricter instruction following. Don’t assume. Measure.
Conclusion
Opus 4.7 is the most capable model I can actually deploy to production today. The benchmark gains are real, the vision upgrades are useful, and xhigh plus self-verification make long agentic runs genuinely less fragile. The tokenizer tax is real too. The people calling it “Mythos lite” aren’t wrong. They’re just disappointed about the right thing.
I’ll keep using it. I’ll also keep waiting for Mythos.