Coding Agent Leaderboard

Compare coding agents across models and harnesses

12 Results 2 Models 5 Harnesses 2 Benchmarks

A Coding Agent is more than just a model - it's the combination of a Model and a Harness (the tool/framework driving the model). This leaderboard tracks how these components work together, because the same model can perform very differently depending on the harness it's paired with.

{
  • "headers": [
    • " ",
    • "Model",
    • "Harness",
    • "Precision",
    • "Model License",
    • "Harness License",
    • "Model Num Params (B)",
    • "Avg Score",
    • "swe-bench-verified",
    • "swe-bench-pro--ansible"
    ],
  • "data": [
    • [
      • "馃敹",
      • "[Sonnet 4.6](https://www.anthropic.com/news/claude-sonnet-4-6)",
      • "[Claude Code](https://github.com/anthropics/claude-code)",
      • "bf16",
      • "FOSS",
      • "Proprietary",
      • 1000,
      • 64.8,
      • 79.6,
      • 50
      ],
    • [
      • "馃煚",
      • "[RedHatAI/Qwen3.6-35B-A3B-NVFP4](https://huggingface.co/RedHatAI/Qwen3.6-35B-A3B-NVFP4)",
      • "[Pi](https://github.com/earendil-works/pi/tree/main)",
      • "nvfp4",
      • "FOSS",
      • "FOSS",
      • 35,
      • 56.5,
      • 65,
      • 47.9
      ],
    • [
      • "馃敹",
      • "[RedHatAI/Qwen3.6-35B-A3B-NVFP4](https://huggingface.co/RedHatAI/Qwen3.6-35B-A3B-NVFP4)",
      • "[Claude Code](https://github.com/anthropics/claude-code)",
      • "nvfp4",
      • "FOSS",
      • "Proprietary",
      • 35,
      • 54.5,
      • 63.2,
      • 45.8
      ],
    • [
      • "馃煚",
      • "[RedHatAI/Qwen3.6-35B-A3B-NVFP4](https://huggingface.co/RedHatAI/Qwen3.6-35B-A3B-NVFP4)",
      • "[Qwen Code](https://github.com/QwenLM/qwen-code)",
      • "nvfp4",
      • "FOSS",
      • "FOSS",
      • 35,
      • 53.8,
      • 63.8,
      • 43.8
      ],
    • [
      • "馃煚",
      • "[RedHatAI/Qwen3.6-35B-A3B-NVFP4](https://huggingface.co/RedHatAI/Qwen3.6-35B-A3B-NVFP4)",
      • "[OpenClaw](https://github.com/openclaw/openclaw)",
      • "nvfp4",
      • "FOSS",
      • "FOSS",
      • 35,
      • 49.7,
      • 58.8,
      • 40.6
      ],
    • [
      • "馃煚",
      • "[RedHatAI/Qwen3.6-35B-A3B-NVFP4](https://huggingface.co/RedHatAI/Qwen3.6-35B-A3B-NVFP4)",
      • "[OpenCode](https://github.com/anomalyco/opencode)",
      • "nvfp4",
      • "FOSS",
      • "FOSS",
      • 35,
      • 46.1,
      • 54.8,
      • 37.5
      ]
    ],
  • "metadata": null
}

How to interpret these results

In the absence of enterprise-specific datasets, public benchmarks provide a means of comparing the performance of coding agents across a wide range of tasks. Better performance on these benchmarks generally translates to better performance on real-world tasks. All benchmarks are run using Harbor, a sandboxed environment for evaluating coding agents.

Each benchmark measures the performance of the coding agent on different tasks:

  • swe-bench-verified: Measures performance on solving GitHub issues in popular Python repositories.
  • swe-bench-pro--ansible: Measures performance on solving GitHub issues in the ansible/ansible repository. Demonstrates how benchmarking can be used to evaluate coding agents on enterprise-specific tasks.

Higher scores indicate better performance on the benchmarks. If an agent scores better on a given benchmark than another, it can be generally considered to be better at those kinds of tasks. We take a simple average of these scores so you can quickly compare the performance of different coding agents, but this is a relative score and the average itself is meaningless on its own.