When Anthropic released Claude 3.7 Sonnet in February 2026, it didn't come with a splashy product event or a celebrity endorsement. There was no leather jacket. No stadium crowd. Just a model card, a blog post, and — within hours — thousands of developers on X posting benchmarks showing it beating GPT-4o, outperforming Gemini 2.0, and writing better code than anything they had used before.
Within a week, Claude 3.7 was being called the best AI model available. Within a month, it had become the default recommendation in developer communities for any task requiring deep reasoning, complex coding, or careful analysis.
The story of how Anthropic — a company founded on AI safety concerns by people who left OpenAI — became one of the most technically formidable AI labs in the world is one of the most interesting narratives in technology right now.
What Is "Extended Thinking"?
The defining feature of Claude 3.7 Sonnet is what Anthropic calls extended thinking — the ability for the model to reason at length before producing a final answer.
In standard mode, a language model takes your input and generates a response token by token, essentially in one forward pass. It doesn't "think" — it predicts the most likely next token based on its training.
Extended thinking changes this. Before producing its visible response, Claude 3.7 generates a hidden chain of reasoning — a scratchpad where it works through the problem step by step. This thinking process is invisible to the user by default, but it can optionally be shown. The model then produces a final answer informed by all that prior reasoning.
The practical effect is significant. On tasks that require:
- Multi-step mathematical reasoning
- Complex code debugging across multiple files
- Logical puzzles with many constraints
- Research synthesis requiring careful analysis
...the model with extended thinking dramatically outperforms a model that answers immediately.
This is not a new concept. OpenAI's o1 and o3 models introduced similar "chain of thought" reasoning in 2024. But Claude 3.7 is notable because it integrates extended thinking into a single model that also excels at standard conversational tasks — whereas OpenAI's reasoning models (o1, o3) were separate from GPT-4o and optimised for different use cases.
The Benchmark Numbers
Benchmarks are imperfect proxies for real-world performance, but the Claude 3.7 numbers are striking enough to be worth understanding:
SWE-bench Verified (software engineering tasks on real GitHub issues):
- Claude 3.7 Sonnet: 70.3% — highest score of any publicly available model at release
- Previous best (Claude 3.5 Sonnet): ~49%
- GPT-4o: ~38%
Graduate-level science reasoning (GPQA Diamond):
- Claude 3.7: 84.8%
- Human expert baseline: ~69%
Competitive coding (Codeforces):
- Claude 3.7 with extended thinking: rated approximately 1400+ Elo — solidly above the average competitive programmer
The SWE-bench number is particularly significant. SWE-bench tests whether a model can read a real GitHub repository, understand a bug report, and produce a working code fix. Getting to 70% on this benchmark means the model can autonomously fix the majority of real-world software bugs given access to the codebase. That's not a demo capability — that's a practical one.
Why Anthropic, and Why Now?
Anthropic was founded in 2021 by Dario Amodei and Daniela Amodei, along with several colleagues who left OpenAI citing concerns about the pace of AI development and the balance between capability and safety.
The founding thesis was uncomfortable: AI might be the most transformative and potentially dangerous technology ever created, and therefore the people building it should be obsessively focused on doing it safely. Anthropic describes itself as a "safety-focused AI research company."
This positioning initially made Anthropic look like the cautious, academic cousin to OpenAI's aggressive commercialisation. The models were good, but not definitively better. The safety framing made some developers assume the models were neutered or overly cautious.
Claude 3.7 has largely put that narrative to rest.
Several factors contributed to Anthropic's rapid technical progress:
Capital: Amazon has invested over $4 billion in Anthropic, with a commitment to invest up to $4 billion more, giving Anthropic access to AWS infrastructure and compute at scale. Google has also invested approximately $2 billion. This funding allows Anthropic to run training runs competitive with OpenAI and DeepMind.
Talent: The Anthropic team includes some of the most cited AI safety and capabilities researchers in the world. Many worked on GPT-3 and GPT-4 at OpenAI before departing. The research culture is noted for being unusually rigorous.
Constitutional AI: Anthropic developed a training approach called Constitutional AI (CAI) — a method for aligning model behaviour using a set of principles rather than purely human feedback. This approach has produced models that are notably more consistent, less prone to hallucination on factual claims, and more predictable in their refusals.
The context window: Claude models have consistently offered among the largest context windows in the industry. Claude 3.7 supports 200,000 tokens — roughly 150,000 words, or an entire novel — in a single prompt. This makes it particularly useful for tasks involving large codebases, long documents, or extended conversations.
What Developers Are Actually Using It For
Beyond benchmarks, the real measure of a model is what people build with it. In the months since Claude 3.7's release, several use patterns have emerged as standout applications:
Autonomous coding agents: Claude 3.7 is the model of choice for tools like Cursor, Windsurf, and Claude's own agentic interface. Its ability to read large codebases, understand context across many files, and produce correct changes on the first attempt (rather than requiring multiple correction loops) makes it unusually effective for the "vibe coding" workflows that are increasingly mainstream.
Legal and contract analysis: The combination of a 200K token context, careful reasoning, and low hallucination rate makes Claude 3.7 well-suited for reading long legal documents and identifying relevant clauses, risks, or inconsistencies. Law firms and legal tech companies have adopted it rapidly.
Research synthesis: Academics and analysts use extended thinking mode to process long papers, cross-reference claims, and produce structured summaries with explicit reasoning chains. The visible thinking option allows users to audit how the model reached its conclusions.
Medical information tasks: Claude 3.7's tendency toward careful qualification and explicit uncertainty quantification (rather than confident-sounding hallucinations) makes it a better fit for medical information tasks than models optimised purely for confidence.
The Safety Angle: Is It Actually Safer?
Anthropic's safety claims deserve scrutiny — and the picture is genuinely mixed.
What Claude 3.7 does better:
- It refuses harmful requests more consistently than most competitors
- It acknowledges uncertainty more often ("I'm not sure" rather than a confident wrong answer)
- It is less prone to "jailbreaking" via roleplay or prompt injection
- Anthropic publishes detailed model cards and usage policies
What remains unsolved:
- Like all frontier models, Claude 3.7 still hallucinates — it produces confident-sounding false information, particularly on obscure topics
- Extended thinking does not eliminate hallucination; it can produce elaborate wrong reasoning chains that look convincing
- Anthropic has not solved the core alignment problem any more than its competitors have
The honest framing is that Anthropic's models are relatively safer and more predictable than the industry average — but they are not safe in any absolute sense. The company's core claim is not that Claude is safe, but that Anthropic is more committed to understanding and reducing risks than most of its competitors.
The Competitive Landscape in 2026
Claude 3.7's release has intensified an already fierce competition. The current landscape:
| Model | Company | Key Strength |
|---|---|---|
| Claude 3.7 Sonnet | Anthropic | Coding, reasoning, long context |
| GPT-4o | OpenAI | Multimodal, ecosystem, speed |
| o3 | OpenAI | Mathematical reasoning, competition tasks |
| Gemini 2.0 Pro | Multimodal, Google integration | |
| DeepSeek R1 | DeepSeek | Open-source reasoning, cost efficiency |
| Llama 3.3 | Meta | Open-source, deployable locally |
No single model dominates every task. The practical advice for users: match the model to the job. Claude 3.7 for deep coding and analysis. GPT-4o for fast, conversational, multimodal tasks. o3 for mathematical and scientific reasoning. DeepSeek R1 when cost matters and the task is well-defined.
What Comes Next
Anthropic's roadmap points toward several developments:
Claude 4: Expected in late 2026, with significantly enhanced multimodal capabilities (processing images, audio, and video natively, not just text). Anthropic has been notably behind OpenAI and Google on multimodal features and is investing heavily to close this gap.
Computer use at scale: Anthropic demonstrated "computer use" capability in late 2024 — Claude controlling a computer screen directly, clicking buttons, filling forms, navigating browsers. This was impressive as a demo but unreliable in practice. Claude 4 is expected to make this robust enough for real agentic workflows.
API pricing pressure: As Claude 3.7 has become a standard choice for enterprise AI applications, pricing competition from DeepSeek and open-source models is forcing Anthropic to reduce API costs. This is good for developers and bad for margins.
The Bigger Picture
Anthropic's ascent matters beyond the competitive dynamics.
The company is one of the few credible voices making the case that AI safety and AI capability are not fundamentally in tension — that you can build the best models and be rigorous about understanding their risks. If Claude 3.7's success demonstrates that safety-conscious development produces better products, it changes the incentive structure for the entire industry.
The counter-argument — that safety constraints hobble capability — is increasingly hard to sustain when the safety-focused lab is shipping the best coding model in the world.
Whether Anthropic's safety commitments will matter when truly transformative AI systems are at stake remains an open question. But in the near term, Claude 3.7 has done something unusual: it has made the safety argument commercially compelling.
And in AI, commercial compulsion is usually what actually changes behaviour.
Related reading: Vibe coding — how AI is transforming software development · DeepSeek vs ChatGPT: the AI war of 2026
