DeepSWE is Datacurve's long-horizon software engineering benchmark for measuring frontier coding agents on original tasks drawn from active open-source repositories.

Why does GPT lead Claude on DeepSWE?

On the published DeepSWE leaderboard, GPT-5.5 scores 70%, ahead of GPT-5.4 at 56% and Claude Opus 4.7 at 54%. Datacurve's qualitative analysis says GPT models missed stated requirements less often, while Claude configurations more often missed one branch of multi-part prompts.

How can users run DeepSWE?

Datacurve provides the deep-swe GitHub repository and recommends running tasks with Pier, a Harbor-compatible framework for sandboxed coding-agent evaluations.

DeepSWE information hub

DeepSWE Benchmark: Why GPT Leads Claude on Long-Horizon Coding Tasks

DeepSWE is a new long-horizon software engineering benchmark from Datacurve. Its published results put GPT-5.5 ahead of Claude Opus 4.7 on original, multi-file coding tasks, making it one benchmark-specific signal for developers choosing AI coding models.

Read the results See how to run it

Task corpus 113 original engineering tasks

Repository spread 91 active open-source repos

Published leader GPT-5.5 at 70% pass rate

Published leaderboard DeepSWE

GPT-5.5 70%

GPT-5.4 56%

Claude Opus 4.7 54%

Claude Sonnet 4.6 32%

Source: Datacurve's DeepSWE leaderboard. Percentages are pass rates reported on the official site. The result is benchmark-specific, not a universal ranking of every coding use case.

What this page answers Overview

What DeepSWE measures and why it is more realistic than short coding puzzles.
How the current published leaderboard compares GPT and Claude in long-horizon tasks.
What practical signals developers should take from the benchmark before picking a coding model.

01 / Definition

What is DeepSWE?

A benchmark built to test real repository-level engineering behavior, not just short-answer coding.

DeepSWE is a benchmark for evaluating frontier coding agents on original, long-horizon software engineering tasks. It was introduced by Datacurve to measure how well AI agents handle realistic coding work that requires repository exploration, multi-file changes, behavioral correctness, and verification.

Unlike benchmark tasks that are copied from existing pull requests or public commits, DeepSWE tasks are written from scratch. Datacurve says this design is intended to reduce training-data contamination and test problem-solving rather than recall.

02 / Use case

What is DeepSWE used for?

It is useful when teams care about multi-file implementation, verification, and reliability under real constraints.

DeepSWE is used to compare AI coding agents on tasks closer to real software engineering work than short coding puzzles. It helps researchers, model providers, and engineering teams see which agents can follow a compact developer-style request, inspect an unfamiliar codebase, implement the change, and keep existing behavior working.

The benchmark can also be run by teams that want to score a new agent or reproduce the leaderboard. Datacurve publishes the task corpus, task metadata, verifier format, and instructions for running DeepSWE with Pier.

03 / Advantages

What are the advantages of DeepSWE?

The benchmark is shaped to reveal capability gaps that smaller or more saturated evaluations may hide.

DeepSWE stands out because it focuses on original tasks, broader repository coverage, and outcome-based verification. Together, those choices make it a stronger proxy for practical coding-agent work than a benchmark that mostly measures recall or tiny edits.

113 original software engineering tasks

91 active open-source repositories

5 languages: TypeScript, Go, Python, JavaScript, Rust

668 mean reference solution lines added

Original tasks reduce contamination risk

DeepSWE tasks are not adapted from public fixes. This makes the score less likely to reflect a model having seen the answer during training.

Long-horizon tasks resemble agentic development

Datacurve reports that DeepSWE prompts are shorter than SWE-bench Pro prompts, while reference solutions require substantially more code and more files.

Broader repository coverage

The task set spans many active repositories instead of concentrating on a small number of flagship projects, making it a broader proxy for day-to-day coding-agent work.

Behavioral verifiers reward correct outcomes

DeepSWE verifiers are designed to test observable behavior rather than internal implementation shape, so different correct solutions can pass.

04 / Results

What are the DeepSWE benchmark results?

The main story is not just ranking, but the amount of separation between frontier model families.

Rank	Model	DeepSWE score	Signal
1	GPT-5.5 [xhigh]	70% +- 4%	Top published pass rate on the official DeepSWE leaderboard.
2	GPT-5.4 [xhigh]	56% +- 5%	Second overall and reported as cost-efficient by Datacurve.
3	Claude Opus 4.7 [max]	54% +- 5%	Close to GPT-5.4 within the stated margin, while below GPT-5.5 on this benchmark.
4	Claude Sonnet 4.6 [high]	32% +- 4%	Lower pass rate on long-horizon DeepSWE tasks.

The main meaning of the result is separation. Datacurve reports that DeepSWE scores span a much wider range than SWE-bench Pro among the same frontier model families, which suggests that long-horizon, original tasks can reveal capability gaps that shorter or more saturated public benchmarks may hide.

05 / GPT vs Claude

Why does DeepSWE suggest GPT is stronger than Claude?

The evidence is real, but it is still evidence inside one benchmark design and one grading setup.

DeepSWE suggests GPT is stronger than Claude only within the benchmark's measured setting: original, long-horizon software engineering tasks run through a standardized harness. The clearest evidence is the leaderboard: GPT-5.5 reaches 70%, while Claude Opus 4.7 reaches 54%. GPT-5.4 is listed above Claude Opus 4.7 at 56%, but their stated error ranges overlap.

Datacurve's qualitative analysis gives one possible explanation for the gap. It reports that GPT-5.5 had the lowest rate of missing stated behaviors in the reviewed DeepSWE trajectories, with GPT-5.4 close behind. The same analysis says Claude configurations more often missed one branch of a multi-part requirement, such as implementing the synchronous path but not the asynchronous counterpart.

That does not mean Claude is weak at all coding tasks. It means that, under DeepSWE's task design and grading method, GPT models were more reliable at completing the full stated behavior. For users, the careful conclusion is: DeepSWE is evidence that GPT currently leads Claude on this specific class of long-horizon coding-agent evaluations.

What the board shows

There is a benchmark lead

The published leaderboard currently puts GPT-5.5 first, with a wider gap over Claude Sonnet 4.6 and a narrower but still meaningful gap over Claude Opus 4.7.

What to avoid

Do not overgeneralize the result

DeepSWE is a strong signal for long-horizon coding agents, but it is not a universal ranking for every codebase, language mix, or product workflow.

06 / Model choice

What does this mean for coding users?

Use the benchmark as a decision input, then pressure-test the finalists on your own repositories.

For users choosing an AI model for programming, DeepSWE points toward evaluating models on the work you actually need done. If your task is a multi-file change in an unfamiliar repository, a long-horizon benchmark can be a more relevant signal than a short coding quiz or a saturated leaderboard.

The result also suggests that pass rate is not the only practical signal. Datacurve tracks output tokens, wall-clock time, and cost per trial, and reports that more tokens, more time, or higher cost do not consistently produce better results. Developers should compare reliability, cost, latency, and how often a model misses requirements.

A sensible workflow is to use DeepSWE as one benchmark-specific data point, then test the top candidate models on your own repositories, languages, and review standards before standardizing on a coding assistant.

Signal 01

Match the benchmark to your workflow

Prioritize long-horizon evaluations when your developers mostly do repository exploration and multi-file changes.

Signal 02

Measure reliability, not only speed

Track missed requirements, rework, cost, and latency alongside raw pass rate before deciding on a default model.

Signal 03

Run your own bake-off

Benchmarks narrow the field, but your final choice should come from tests on your own repo, review bar, and risk tolerance.

07 / Learn more

DeepSWE tasks and how to run the benchmark

The benchmark covers diverse repository work, and the quickstart is designed for reproducible agent runs.

Task coverage

What tasks are included in DeepSWE?

DeepSWE includes 113 stable tasks across TypeScript, Go, Python, JavaScript, and Rust repositories. Examples published by Datacurve include work such as aborting pending body reads on shutdown, fixing PromQL label sorting, adding config-file parsing to command-line tools, adding deterministic conflict detection to Y.Map writes, and adding XML diff, patch, and merge operations.

Runtime behavior Shutdown handling, cancellation, async lifecycle, and regression-sensitive behavior.

Data structures Sorting, pagination, maps, snapshots, schema composition, and deterministic conflict rules.

Developer tooling CLI config parsing, manifests, linting, profiling, caches, and generated reports.

Quickstart

How can you run DeepSWE?

Datacurve says DeepSWE tasks are Harbor-compatible and can be run with Pier, a framework for sandboxed coding-agent evaluations. The official quickstart clones the DeepSWE repository, installs Pier, and then runs a selected agent and model against the task directory.

git clone https://github.com/datacurve-ai/deep-swe
uv tool install git+https://github.com/datacurve-ai/pier

# GPT-5.5 via Codex
export OPENAI_API_KEY=...
pier run -p deep-swe/tasks --agent mini-swe-agent --model openai/gpt-5.5

# Claude Opus 4.7 via Claude Code
export ANTHROPIC_API_KEY=...
pier run -p deep-swe/tasks --agent mini-swe-agent --model anthropic/claude-opus-4-7