DeepSWE Benchmark: Why GPT Leads Claude on Long-Horizon Coding Tasks
DeepSWE is a new long-horizon software engineering benchmark from Datacurve. Its published results put GPT-5.5 ahead of Claude Opus 4.7 on original, multi-file coding tasks, making it one benchmark-specific signal for developers choosing AI coding models.
- What DeepSWE measures and why it is more realistic than short coding puzzles.
- How the current published leaderboard compares GPT and Claude in long-horizon tasks.
- What practical signals developers should take from the benchmark before picking a coding model.
01 / Definition
What is DeepSWE?
A benchmark built to test real repository-level engineering behavior, not just short-answer coding.
DeepSWE is a benchmark for evaluating frontier coding agents on original, long-horizon software engineering tasks. It was introduced by Datacurve to measure how well AI agents handle realistic coding work that requires repository exploration, multi-file changes, behavioral correctness, and verification.
Unlike benchmark tasks that are copied from existing pull requests or public commits, DeepSWE tasks are written from scratch. Datacurve says this design is intended to reduce training-data contamination and test problem-solving rather than recall.
02 / Use case
What is DeepSWE used for?
It is useful when teams care about multi-file implementation, verification, and reliability under real constraints.
DeepSWE is used to compare AI coding agents on tasks closer to real software engineering work than short coding puzzles. It helps researchers, model providers, and engineering teams see which agents can follow a compact developer-style request, inspect an unfamiliar codebase, implement the change, and keep existing behavior working.
The benchmark can also be run by teams that want to score a new agent or reproduce the leaderboard. Datacurve publishes the task corpus, task metadata, verifier format, and instructions for running DeepSWE with Pier.
03 / Advantages
What are the advantages of DeepSWE?
The benchmark is shaped to reveal capability gaps that smaller or more saturated evaluations may hide.
DeepSWE stands out because it focuses on original tasks, broader repository coverage, and outcome-based verification. Together, those choices make it a stronger proxy for practical coding-agent work than a benchmark that mostly measures recall or tiny edits.
Original tasks reduce contamination risk
DeepSWE tasks are not adapted from public fixes. This makes the score less likely to reflect a model having seen the answer during training.
Long-horizon tasks resemble agentic development
Datacurve reports that DeepSWE prompts are shorter than SWE-bench Pro prompts, while reference solutions require substantially more code and more files.
Broader repository coverage
The task set spans many active repositories instead of concentrating on a small number of flagship projects, making it a broader proxy for day-to-day coding-agent work.
Behavioral verifiers reward correct outcomes
DeepSWE verifiers are designed to test observable behavior rather than internal implementation shape, so different correct solutions can pass.
04 / Results
What are the DeepSWE benchmark results?
The main story is not just ranking, but the amount of separation between frontier model families.
| Rank | Model | DeepSWE score | Signal |
|---|---|---|---|
| 1 | GPT-5.5 [xhigh] | 70% +- 4% | Top published pass rate on the official DeepSWE leaderboard. |
| 2 | GPT-5.4 [xhigh] | 56% +- 5% | Second overall and reported as cost-efficient by Datacurve. |
| 3 | Claude Opus 4.7 [max] | 54% +- 5% | Close to GPT-5.4 within the stated margin, while below GPT-5.5 on this benchmark. |
| 4 | Claude Sonnet 4.6 [high] | 32% +- 4% | Lower pass rate on long-horizon DeepSWE tasks. |
The main meaning of the result is separation. Datacurve reports that DeepSWE scores span a much wider range than SWE-bench Pro among the same frontier model families, which suggests that long-horizon, original tasks can reveal capability gaps that shorter or more saturated public benchmarks may hide.
05 / GPT vs Claude
Why does DeepSWE suggest GPT is stronger than Claude?
The evidence is real, but it is still evidence inside one benchmark design and one grading setup.
DeepSWE suggests GPT is stronger than Claude only within the benchmark's measured setting: original, long-horizon software engineering tasks run through a standardized harness. The clearest evidence is the leaderboard: GPT-5.5 reaches 70%, while Claude Opus 4.7 reaches 54%. GPT-5.4 is listed above Claude Opus 4.7 at 56%, but their stated error ranges overlap.
Datacurve's qualitative analysis gives one possible explanation for the gap. It reports that GPT-5.5 had the lowest rate of missing stated behaviors in the reviewed DeepSWE trajectories, with GPT-5.4 close behind. The same analysis says Claude configurations more often missed one branch of a multi-part requirement, such as implementing the synchronous path but not the asynchronous counterpart.
That does not mean Claude is weak at all coding tasks. It means that, under DeepSWE's task design and grading method, GPT models were more reliable at completing the full stated behavior. For users, the careful conclusion is: DeepSWE is evidence that GPT currently leads Claude on this specific class of long-horizon coding-agent evaluations.
There is a benchmark lead
The published leaderboard currently puts GPT-5.5 first, with a wider gap over Claude Sonnet 4.6 and a narrower but still meaningful gap over Claude Opus 4.7.
Do not overgeneralize the result
DeepSWE is a strong signal for long-horizon coding agents, but it is not a universal ranking for every codebase, language mix, or product workflow.
06 / Model choice
What does this mean for coding users?
Use the benchmark as a decision input, then pressure-test the finalists on your own repositories.
For users choosing an AI model for programming, DeepSWE points toward evaluating models on the work you actually need done. If your task is a multi-file change in an unfamiliar repository, a long-horizon benchmark can be a more relevant signal than a short coding quiz or a saturated leaderboard.
The result also suggests that pass rate is not the only practical signal. Datacurve tracks output tokens, wall-clock time, and cost per trial, and reports that more tokens, more time, or higher cost do not consistently produce better results. Developers should compare reliability, cost, latency, and how often a model misses requirements.
A sensible workflow is to use DeepSWE as one benchmark-specific data point, then test the top candidate models on your own repositories, languages, and review standards before standardizing on a coding assistant.
Match the benchmark to your workflow
Prioritize long-horizon evaluations when your developers mostly do repository exploration and multi-file changes.
Measure reliability, not only speed
Track missed requirements, rework, cost, and latency alongside raw pass rate before deciding on a default model.
Run your own bake-off
Benchmarks narrow the field, but your final choice should come from tests on your own repo, review bar, and risk tolerance.
07 / Learn more
DeepSWE tasks and how to run the benchmark
The benchmark covers diverse repository work, and the quickstart is designed for reproducible agent runs.
What tasks are included in DeepSWE?
DeepSWE includes 113 stable tasks across TypeScript, Go, Python, JavaScript, and Rust repositories. Examples published by Datacurve include work such as aborting pending body reads on shutdown, fixing PromQL label sorting, adding config-file parsing to command-line tools, adding deterministic conflict detection to Y.Map writes, and adding XML diff, patch, and merge operations.
How can you run DeepSWE?
Datacurve says DeepSWE tasks are Harbor-compatible and can be run with Pier, a framework for sandboxed coding-agent evaluations. The official quickstart clones the DeepSWE repository, installs Pier, and then runs a selected agent and model against the task directory.
git clone https://github.com/datacurve-ai/deep-swe
uv tool install git+https://github.com/datacurve-ai/pier
# GPT-5.5 via Codex
export OPENAI_API_KEY=...
pier run -p deep-swe/tasks --agent mini-swe-agent --model openai/gpt-5.5
# Claude Opus 4.7 via Claude Code
export ANTHROPIC_API_KEY=...
pier run -p deep-swe/tasks --agent mini-swe-agent --model anthropic/claude-opus-4-7