docs: add agentic coding evaluation landscape research
Comprehensive research (706 lines, dated 2026-03-30) covering evaluation dimensions, benchmark suites, and open-weight model performance for software engineering agent use cases on 64GB systems. Also gitignore evalplus_results/ (runtime outputs) and ztop/ (nested repo).
This commit is contained in:
706
docs/agentic-coding-evaluation-landscape.md
Normal file
706
docs/agentic-coding-evaluation-landscape.md
Normal file
@@ -0,0 +1,706 @@
|
||||
# Agentic Coding Evaluation Landscape
|
||||
|
||||
Comprehensive research into the dimensions, benchmarks, and model performance for
|
||||
evaluating LLMs in software engineering agent use cases. Research date: 2026-03-30.
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Evaluation Taxonomy](#1-evaluation-taxonomy)
|
||||
2. [Dimension 1: Code Generation Accuracy](#2-code-generation-accuracy)
|
||||
3. [Dimension 2: Code Editing / Patching](#3-code-editing--patching)
|
||||
4. [Dimension 3: Tool Use / Function Calling](#4-tool-use--function-calling)
|
||||
5. [Dimension 4: Multi-Step Planning](#5-multi-step-planning)
|
||||
6. [Dimension 5: Debugging / Error Recovery](#6-debugging--error-recovery)
|
||||
7. [Dimension 6: Repository Understanding](#7-repository-understanding)
|
||||
8. [Dimension 7: Instruction Following](#8-instruction-following)
|
||||
9. [Dimension 8: Long Context Utilization](#9-long-context-utilization)
|
||||
10. [Dimension 9: Multi-Language Support](#10-multi-language-support)
|
||||
11. [Dimension 10: Test Generation](#11-test-generation)
|
||||
12. [Benchmark Suite Summary](#12-benchmark-suite-summary)
|
||||
13. [Open-Weight Model Landscape for 64GB Systems](#13-open-weight-model-landscape-for-64gb-systems)
|
||||
14. [Frontier vs. Open Model Gap](#14-frontier-vs-open-model-gap)
|
||||
15. [Recommended Evaluation Stack](#15-recommended-evaluation-stack)
|
||||
16. [Sources](#16-sources)
|
||||
|
||||
---
|
||||
|
||||
## 1. Evaluation Taxonomy
|
||||
|
||||
Recent survey work (CSLLM Survey, 2025; SE Agent Benchmark Survey, 2025) organizes
|
||||
coding LLM evaluation along two orthogonal axes:
|
||||
|
||||
- **Capability dimension**: What is being measured (generation, editing, tool use,
|
||||
planning, debugging, comprehension, instruction following, etc.)
|
||||
- **Evaluation paradigm**: How it is measured (static benchmarks, execution-based
|
||||
evaluation, agent-in-the-loop evaluation, human evaluation)
|
||||
|
||||
The field has moved decisively from static benchmarks (HumanEval, MBPP) toward
|
||||
agent-in-the-loop evaluations (SWE-bench, Terminal-Bench, FeatureBench) that test
|
||||
the full agentic loop: plan, act, observe, iterate. This shift matters because models
|
||||
that score 95%+ on HumanEval can still fail below 50% on realistic agentic tasks.
|
||||
|
||||
The ten dimensions below map to the capability axis. Each dimension lists the
|
||||
benchmarks that best isolate it, though in practice most agentic benchmarks test
|
||||
multiple dimensions simultaneously.
|
||||
|
||||
---
|
||||
|
||||
## 2. Code Generation Accuracy
|
||||
|
||||
**Definition**: Writing correct, complete code from natural-language specifications or
|
||||
docstrings, measured by functional correctness (pass@k on test suites).
|
||||
|
||||
### Key Benchmarks
|
||||
|
||||
| Benchmark | Tasks | Languages | Metric | Notes |
|
||||
|---|---|---|---|---|
|
||||
| **HumanEval** (Chen et al., 2021) | 164 | Python | pass@k | Foundational but near-saturated; best models >95% |
|
||||
| **HumanEval+** / **MBPP+** (EvalPlus, NeurIPS 2023) | 164 / 399 | Python | pass@k (80x more tests) | Catches false positives from HumanEval; ~10-15% score drops |
|
||||
| **HumanEval Pro** / **MBPP Pro** (ACL 2025) | 164 / 399 | Python | pass@k on self-invoking tasks | Tests compositional reasoning; o1-mini drops from 96.2% to 76.2% |
|
||||
| **BigCodeBench** (ICLR 2025) | 1,140 | Python (139 libs) | pass@1 | Multi-tool, cross-domain; best model (GPT-4o) ~60% Complete, <50% Instruct |
|
||||
| **BigCodeBench-Hard** | 148 | Python | pass@1 | Hardest subset; human performance 97%, LLMs ~60% |
|
||||
| **LiveCodeBench** (EMNLP 2025) | Rolling | Python | pass@k | Contamination-free: new problems added continuously from competitive programming |
|
||||
|
||||
### State of the Art
|
||||
|
||||
- **Frontier**: Claude Opus 4.5/4.6, GPT-5.2, Gemini 3.1 Pro all score >95% on
|
||||
HumanEval, ~85% on HumanEval+, ~65% on BigCodeBench-Complete.
|
||||
- **Open (64GB-feasible)**: Qwen3.5-27B-Q4 achieves ~80% on HumanEval+.
|
||||
Qwen3-Coder-30B-A3B (3.3B active, ~18GB at Q4) is strong on BigCodeBench.
|
||||
Qwen2.5-Coder-32B-Instruct matched GPT-4o on HumanEval when released.
|
||||
|
||||
### Key Insight
|
||||
|
||||
HumanEval is near-saturated and should no longer be used as a primary differentiator.
|
||||
BigCodeBench and LiveCodeBench are the current gold standards for code generation
|
||||
accuracy, as they test realistic multi-library tasks and resist contamination.
|
||||
|
||||
---
|
||||
|
||||
## 3. Code Editing / Patching
|
||||
|
||||
**Definition**: Modifying existing code correctly -- applying diffs, fixing bugs in
|
||||
context, integrating new code into existing files -- rather than generating from scratch.
|
||||
|
||||
### Key Benchmarks
|
||||
|
||||
| Benchmark | Tasks | What It Tests | Notes |
|
||||
|---|---|---|---|
|
||||
| **Aider Code Editing** | 133 | Edit Python files to solve Exercism problems | Tests edit format compliance + coding ability |
|
||||
| **Aider Polyglot** | 225 | Edit code across 6 languages with error feedback | Two attempts per problem; measures edit+debug loop |
|
||||
| **Diff-XYZ** (Oct 2025) | 3 tasks | Apply, anti-apply, generate diffs | Tests diff understanding in multiple formats |
|
||||
| **EDIT-Bench** | Varied | Real-world instructed code edits | Repository-level editing tasks |
|
||||
| **SWE-bench** (indirectly) | 2,294 | Generate patches that resolve GitHub issues | Requires generating correct unified diffs |
|
||||
|
||||
### Edit Format Considerations
|
||||
|
||||
Code editing performance depends heavily on the edit format used:
|
||||
|
||||
- **Search/Replace blocks** (Aider default): Most reliable for most models
|
||||
- **Unified diff**: GPT-4 Turbo was "3x less lazy" with unified diffs (Aider blog)
|
||||
- **V4A diff format**: OpenAI's recommended format (published with GPT-4.1, April 2025)
|
||||
- **Whole-file rewrite**: Simpler but wasteful; works with weaker models
|
||||
|
||||
Models that excel at generation can fail at editing because they struggle to produce
|
||||
syntactically valid diffs or correctly locate the code to modify.
|
||||
|
||||
### State of the Art (Aider Polyglot, March 2026)
|
||||
|
||||
| Model | Score | Type |
|
||||
|---|---|---|
|
||||
| GPT-5 | 88.0% | Frontier |
|
||||
| MiniMax M2.5 | 80.2% | Open |
|
||||
| DeepSeek V3.2-Exp | 74.2% | Open |
|
||||
| DeepSeek-R1-0528 | 71.4% | Open |
|
||||
| GLM-4.5-FP8 | 66.0% | Open |
|
||||
| Qwen3-Coder-480B | 61.8% | Open (too large for 64GB) |
|
||||
| Qwen3-Coder-30B-A3B | ~55-60%* | Open (fits 64GB at Q4) |
|
||||
|
||||
*Estimated from quantized GGUF performance data; exact Aider Polyglot score for
|
||||
the 30B-A3B variant not independently confirmed.
|
||||
|
||||
---
|
||||
|
||||
## 4. Tool Use / Function Calling
|
||||
|
||||
**Definition**: Correctly invoking APIs, tools, or MCP servers -- selecting the right
|
||||
function, constructing valid arguments, parsing responses, deciding when NOT to call.
|
||||
|
||||
### Key Benchmarks
|
||||
|
||||
| Benchmark | Tasks | What It Tests | Notes |
|
||||
|---|---|---|---|
|
||||
| **BFCL V4** (Berkeley) | Thousands | Function calling accuracy across formats | De facto standard; AST-based evaluation |
|
||||
| **BFCL-v3** (via EvalScope) | Multi-turn | Stateful multi-step function calling | Tests memory and context management |
|
||||
| **Nexus Function Calling** | Varied | Tool selection and invocation | Broader tool landscape |
|
||||
| **IFEval-FC** (2025) | 500+ | Instruction following within function schemas | JSON schema constraint adherence |
|
||||
| **tau-bench** | Varied | Tool-augmented task completion | End-to-end agent tool use |
|
||||
|
||||
### BFCL Key Findings
|
||||
|
||||
The Berkeley Function Calling Leaderboard reveals a critical split:
|
||||
|
||||
1. **Single-turn calls**: Most frontier models score >90% accuracy
|
||||
2. **Multi-turn stateful calls**: Performance drops 20-40% even for top models
|
||||
3. **Abstention**: Knowing when NOT to call a function remains a major weakness
|
||||
4. **Long-horizon tool use**: Memory, dynamic decision-making, and context management
|
||||
are open challenges
|
||||
|
||||
### State of the Art
|
||||
|
||||
- **Frontier**: Claude Opus 4.5/4.6, GPT-5.2 lead overall BFCL V4
|
||||
- **Open**: Qwen3-Coder-480B is "comparable to Claude Sonnet 4 on Agentic Tool-Use"
|
||||
(Qwen team). For 64GB-feasible models, Qwen3-Coder-30B-A3B has a specially
|
||||
designed function call format and strong tool-use training.
|
||||
Nemotron 3 Super (120B, 12B active) was explicitly trained for tool-use workflows.
|
||||
|
||||
### Relevance to MCP
|
||||
|
||||
MCP (Model Context Protocol) servers expose tools via JSON schemas -- exactly what
|
||||
BFCL tests. A model's BFCL score is a reasonable proxy for MCP tool-use competence,
|
||||
though MCP adds discovery and session management complexity not yet benchmarked.
|
||||
|
||||
---
|
||||
|
||||
## 5. Multi-Step Planning
|
||||
|
||||
**Definition**: Breaking complex tasks into subtasks, maintaining coherent plans across
|
||||
many steps, tracking progress, and adapting when plans fail.
|
||||
|
||||
### Key Benchmarks
|
||||
|
||||
| Benchmark | Tasks | Steps | What It Tests | Notes |
|
||||
|---|---|---|---|---|
|
||||
| **SWE-bench Verified** | 500 | 5-50+ | End-to-end issue resolution | Gold standard for agentic coding |
|
||||
| **SWE-bench Pro** (Scale AI) | Harder | 10-100+ | More complex issues | Best model ~46% (vs 81% on Verified) |
|
||||
| **FeatureBench** (Feb 2026) | 200 | Many | Complex feature development | Claude 4.5 Opus: only 11.0% (vs 74.4% SWE-bench) |
|
||||
| **Snorkel Agentic Coding** | 100 | Multi-step, 4 tiers | Plan, track, execute, recover | Claude Opus 4.5: 58%, Gemini 3 Pro: 51.6% |
|
||||
| **GAIA** (ICLR 2025) | 450 | Multi-step | General assistant planning | Near saturation (~90% top scores) |
|
||||
| **Gaia2** (2026) | Varied | Async | Dynamic, asynchronous environments | Adds temporal constraints and agent collaboration |
|
||||
| **Terminal-Bench 2.0** | 89 | Multi-step | Terminal workflow completion | Tests plan execution in CLI environments |
|
||||
|
||||
### Planning-Specific Insights
|
||||
|
||||
The gap between SWE-bench Verified (~81% frontier) and SWE-bench Pro (~46% frontier)
|
||||
and FeatureBench (~11% frontier) reveals that multi-step planning degrades rapidly
|
||||
with task complexity:
|
||||
|
||||
- **SWE-bench Verified**: Often requires 5-15 steps (find file, understand bug, edit,
|
||||
test)
|
||||
- **SWE-bench Pro**: Requires deeper reasoning about architecture and dependencies
|
||||
- **FeatureBench**: Requires implementing features across multiple files with
|
||||
architectural coherence over 50+ steps
|
||||
|
||||
This is the dimension where frontier models most decisively outperform open models,
|
||||
though the gap is narrowing with agentic RL training (Qwen3-Coder, GLM-5).
|
||||
|
||||
### State of the Art (SWE-bench Verified, March 2026)
|
||||
|
||||
| Model | Score | Type | Notes |
|
||||
|---|---|---|---|
|
||||
| Claude Opus 4.5 | 80.9% | Frontier | Overall leader |
|
||||
| Claude Opus 4.6 | 80.8% | Frontier | |
|
||||
| Gemini 3.1 Pro | 80.6% | Frontier | |
|
||||
| MiniMax M2.5 | 80.2% | Open | Best open model |
|
||||
| GPT-5.2 | 80.0% | Frontier | |
|
||||
| GLM-5 | 77.8% | Open | 744B MoE, 40B active |
|
||||
| Kimi K2.5 | 76.8% | Open | |
|
||||
| DeepSeek V3.2 | 73.0% | Open | |
|
||||
| Qwen3-Coder-Next | 70.6% | Open | Only 3B active params |
|
||||
| DeepSeek V3.1 | 66.0% | Open | |
|
||||
| Nemotron 3 Super | 60.5% | Open | 120B, 12B active |
|
||||
|
||||
---
|
||||
|
||||
## 6. Debugging / Error Recovery
|
||||
|
||||
**Definition**: Handling test failures, reading error messages, diagnosing root causes,
|
||||
and iterating toward a fix -- including recovering from the agent's own mistakes.
|
||||
|
||||
### Key Benchmarks
|
||||
|
||||
| Benchmark | Tasks | What It Tests | Notes |
|
||||
|---|---|---|---|
|
||||
| **Terminal-Bench 2.0** (Stanford/Laude) | 89 | CLI debugging, error recovery, state mgmt | Gold standard for debugging evaluation |
|
||||
| **Recovery-Bench** (Letta, 2025) | Varied | Recovery from corrupted states and error traces | Tests context pollution handling |
|
||||
| **AgentErrorBench** (2025) | Varied | Error detection and debugging in trajectories | 24% improvement with AgentDebug method |
|
||||
| **ReliabilityBench** (Jan 2026) | Varied | Consistency and fault recovery | Multi-dimensional reliability |
|
||||
| **Aider Polyglot** (indirectly) | 225 | Two-attempt model with error feedback | Second attempt tests debug-from-feedback |
|
||||
|
||||
### Recovery-Bench Key Findings
|
||||
|
||||
Recovery-Bench (Letta) specifically evaluates a critical gap: even frontier models
|
||||
"lack the ability to naturally recover from failed states." The benchmark creates
|
||||
scenarios with:
|
||||
|
||||
- Erroneous files from previous attempts
|
||||
- Corrupted reasoning traces in context
|
||||
- Environment artifacts from failed edits
|
||||
|
||||
This is directly relevant to agentic coding loops where an agent makes a mistake
|
||||
at step 15 of a 30-step task and must recover without starting over.
|
||||
|
||||
### Terminal-Bench 2.0 Key Findings
|
||||
|
||||
Terminal-Bench tests real terminal workflows: inspect environments, read/edit files,
|
||||
run commands, recover from errors, and finish multi-step tasks. Error categories:
|
||||
|
||||
- **Execution errors**: Dominate for Claude Opus 4.5 and GPT-5.2
|
||||
- **Coherence errors**: Less frequent but more damaging
|
||||
- **Verification errors**: Failing to check that a fix actually worked
|
||||
|
||||
### State of the Art
|
||||
|
||||
Debugging/error recovery is one of the weakest dimensions for all models. No model
|
||||
achieves >70% on Terminal-Bench 2.0 or Recovery-Bench as of March 2026. This is
|
||||
a primary area where the frontier-open gap matters most for practical agentic use.
|
||||
|
||||
---
|
||||
|
||||
## 7. Repository Understanding
|
||||
|
||||
**Definition**: Navigating large codebases, understanding file structure, dependency
|
||||
graphs, cross-file relationships, and architectural patterns.
|
||||
|
||||
### Key Benchmarks
|
||||
|
||||
| Benchmark | Tasks | Languages | What It Tests | Notes |
|
||||
|---|---|---|---|---|
|
||||
| **CrossCodeEval** (NeurIPS 2023) | Varied | Python, Java, TS, C# | Cross-file code completion | Requires understanding imports and dependencies |
|
||||
| **RepoBench** | 3 tasks | Python | Retrieval, completion, pipeline | Tests codebase navigation |
|
||||
| **RepoEval** | Varied | Python | Repository-level completion | 16 GitHub repositories |
|
||||
| **RepoCod** (ACL 2025) | Varied | Multiple | Full repository code generation | "LLMs not yet ready" |
|
||||
| **LoCoBench-Agent** (2025) | Varied | Multiple | Interactive repo exploration | Agent-based evaluation |
|
||||
| **DependEval** | 3 tasks | Multiple | Dependency recognition, multi-file editing | Tests architectural understanding |
|
||||
|
||||
### Key Challenge
|
||||
|
||||
Repository understanding is difficult to isolate as a benchmark dimension because
|
||||
it is a prerequisite for most agentic coding tasks. SWE-bench implicitly tests it
|
||||
(you cannot fix a bug if you cannot find the relevant file), but does not score it
|
||||
separately.
|
||||
|
||||
The most direct measures are:
|
||||
1. **CrossCodeEval**: Do predictions improve when cross-file context is provided?
|
||||
2. **RepoBench-R**: Can the model retrieve the right context from the repository?
|
||||
3. **DependEval**: Can the model understand and modify dependency relationships?
|
||||
|
||||
### State of the Art
|
||||
|
||||
Models with longer context windows have an inherent advantage. The Qwen3-Coder family
|
||||
was explicitly trained for "repository-scale understanding" with 256K native context
|
||||
(extendable to 1M). GLM-5 uses DeepSeek Sparse Attention for 205K context.
|
||||
|
||||
For 64GB systems, Qwen3-Coder-30B-A3B and Qwen3-Coder-Next are the strongest choices
|
||||
due to their long-context training and MoE efficiency.
|
||||
|
||||
---
|
||||
|
||||
## 8. Instruction Following
|
||||
|
||||
**Definition**: Following complex, multi-constraint instructions precisely --
|
||||
formatting requirements, length constraints, keyword inclusion, structural rules.
|
||||
|
||||
### Key Benchmarks
|
||||
|
||||
| Benchmark | Tasks | What It Tests | Notes |
|
||||
|---|---|---|---|
|
||||
| **IFEval** (Google, Nov 2023) | ~500 | 25 types of verifiable instructions | Format, length, keyword, structure constraints |
|
||||
| **IFEval-Extended** (2024) | Dynamic | Generative instruction synthesis | Thousands of unique instructions from templates |
|
||||
| **M-IFEval** (NAACL 2025) | Multi-lingual | French, Japanese, Spanish instruction following | Performance varies widely across languages |
|
||||
| **IFEval-FC** (2025) | Varied | Instruction following in function call schemas | JSON schema constraint adherence |
|
||||
| **AgentIF** (Tsinghua, 2025) | Varied | Agent-specific instruction following | Evaluates IF within agentic loops |
|
||||
|
||||
### Relevance to Agentic Coding
|
||||
|
||||
Instruction following is critical for agentic coding because:
|
||||
|
||||
1. **System prompts**: Agents receive detailed behavioral instructions (e.g., CLAUDE.md
|
||||
conventions in this repo)
|
||||
2. **Edit format compliance**: Models must produce output in exact formats (search/replace
|
||||
blocks, unified diffs, JSON tool calls)
|
||||
3. **Multi-constraint tasks**: "Fix the bug AND add a test AND update the docstring AND
|
||||
follow the project's naming conventions"
|
||||
|
||||
### State of the Art
|
||||
|
||||
IFEval is included in the Open LLM Leaderboard V2, making it one of the most widely
|
||||
reported benchmarks. Frontier models score >90% on IFEval. Open models vary widely;
|
||||
instruction-tuned variants of Qwen3.5, DeepSeek V3, and GLM-5 are competitive at >85%.
|
||||
|
||||
---
|
||||
|
||||
## 9. Long Context Utilization
|
||||
|
||||
**Definition**: Effectively using large context windows (32K-1M tokens) with code --
|
||||
not just accepting long inputs, but actually using information from all parts.
|
||||
|
||||
### Key Benchmarks
|
||||
|
||||
| Benchmark | What It Tests | Notes |
|
||||
|---|---|---|
|
||||
| **RULER** (NVIDIA, COLM 2024) | Multi-needle retrieval, distractor handling | Most models degrade significantly beyond 32K |
|
||||
| **Needle in a Haystack** (NIAH) | Single-fact retrieval in long context | Near-saturated for frontier models |
|
||||
| **LoCoBench** (2025) | Long-context code completion and comprehension | Claude 3.5 Sonnet: 29% at short context, 3% at long |
|
||||
| **LongCodeBench** (2025) | Long-context code tasks | Single-language, limited diversity |
|
||||
| **LongBench** (ACL 2025) | General long-context evaluation | Reveals limitations of existing benchmarks |
|
||||
|
||||
### "Context Rot" Phenomenon
|
||||
|
||||
Research from Chroma (2025) documented "context rot": as input tokens increase,
|
||||
LLM performance degrades even when the relevant information is present. This is
|
||||
particularly acute for code tasks where:
|
||||
|
||||
- File A defines a class, file B imports it, file C tests it
|
||||
- All three must be in context simultaneously
|
||||
- Models must cross-reference across files, not just retrieve individual facts
|
||||
|
||||
### State of the Art
|
||||
|
||||
| Model | Native Context | Effective Context* | Notes |
|
||||
|---|---|---|---|
|
||||
| Nemotron 3 Super | 1M tokens | 91.75% accuracy at 1M | Best retention score |
|
||||
| Qwen3-Coder-Next | 256K (1M w/ Yarn) | Good at 256K | Trained for repo-scale |
|
||||
| GLM-5 | 205K | Good | DeepSeek Sparse Attention |
|
||||
| DeepSeek V3.2 | 128K | Moderate | |
|
||||
|
||||
*"Effective context" means the model actually uses information at that distance,
|
||||
not just accepts it without error.
|
||||
|
||||
For 64GB systems, context length is bounded by available memory. At Q4 quantization,
|
||||
a 30B-A3B model can handle ~64K-128K tokens before running out of KV cache space
|
||||
(depending on GQA configuration and batch size).
|
||||
|
||||
---
|
||||
|
||||
## 10. Multi-Language Support
|
||||
|
||||
**Definition**: Handling different programming languages correctly -- not just Python,
|
||||
but also compiled languages, systems languages, and less common languages.
|
||||
|
||||
### Key Benchmarks
|
||||
|
||||
| Benchmark | Languages | What It Tests | Notes |
|
||||
|---|---|---|---|
|
||||
| **Aider Polyglot** | C++, Go, Java, JS, Python, Rust | Edit + debug in 6 languages | 225 Exercism exercises |
|
||||
| **Multi-SWE-bench** (NeurIPS 2025) | Python, Java, TS, JS, Go, Rust, C, C++ | Issue resolution in 8 languages | 1,632 validated issues |
|
||||
| **Multi-SWE-bench mini** | 8 languages | Lightweight version | 400 instances, reduced compute |
|
||||
| **SWE-PolyBench** (Amazon) | Java, JS, TS, Python | Bug fixes, features, refactoring | 2,110 curated issues |
|
||||
| **SWE-smith** | 9 languages | SWE-bench style across 42 repos | 300 curated tasks |
|
||||
| **HumanEval-X** | Python, C++, Java, JS, Go | Cross-lingual code generation | Translation of HumanEval |
|
||||
| **BigCodeBench** | Python (139 libs) | Multi-library Python | Tests library-specific knowledge |
|
||||
|
||||
### Multi-SWE-bench vs SWE-PolyBench
|
||||
|
||||
Two competing multilingual benchmarks emerged in 2025:
|
||||
|
||||
- **Multi-SWE-bench** (ByteDance): 1,632 issues, 8 languages, NeurIPS 2025
|
||||
Datasets track. Also provides `mini` (400 instances) and `flash` (300 instances)
|
||||
variants for reduced compute.
|
||||
- **SWE-PolyBench** (Amazon): 2,110 issues, 4 languages, with a verified subset of
|
||||
384 instances. Covers bug fixes, features, and refactoring.
|
||||
|
||||
### Language-Specific Performance Gaps
|
||||
|
||||
Open models show significant performance variation across languages:
|
||||
- **Python**: Best-supported universally
|
||||
- **JavaScript/TypeScript**: Second-best, strong ecosystem coverage
|
||||
- **Rust, Go, C++**: Substantially weaker, especially for complex patterns
|
||||
- **Low-resource languages** (Julia, Lua, Perl): StarCoder2-15B historically strong here
|
||||
|
||||
### State of the Art
|
||||
|
||||
Qwen3-Coder-Next achieves 62.8% on SWE-Bench Multilingual. For 64GB-feasible models,
|
||||
the Qwen3-Coder-30B-A3B benefits from Qwen's broad multilingual training data.
|
||||
|
||||
---
|
||||
|
||||
## 11. Test Generation
|
||||
|
||||
**Definition**: Writing tests, understanding test frameworks, achieving coverage,
|
||||
generating meaningful assertions -- not just syntactically valid tests.
|
||||
|
||||
### Key Benchmarks
|
||||
|
||||
| Benchmark | Tasks | What It Tests | Notes |
|
||||
|---|---|---|---|
|
||||
| **TestEval** (2024) | 210 | LLM test case generation for LeetCode programs | Basic test generation ability |
|
||||
| **ULT** (2025) | 3,909 | Unit test generation for complex functions | High cyclomatic complexity, leakage-free |
|
||||
| **WebApp1K** (2025) | 1,000 | Test-driven development tasks | Tests serve as both prompt and verification |
|
||||
| **CoverUp** (2024) | Varied | Coverage-guided test generation | Iterative LLM-guided coverage improvement |
|
||||
|
||||
### Current Performance
|
||||
|
||||
LLM-generated tests achieve on average:
|
||||
- **41.32%** accuracy (tests pass and are meaningful)
|
||||
- **45.10%** statement coverage
|
||||
- **30.22%** branch coverage
|
||||
- **40.21%** mutation score
|
||||
|
||||
These numbers are from a multi-model benchmark study (2025). CoverUp's iterative
|
||||
approach achieves 80% line+branch coverage (vs 47% for CodaMosa), suggesting that
|
||||
agentic test generation loops significantly outperform single-shot generation.
|
||||
|
||||
### Key Insight
|
||||
|
||||
Test generation is an area where agentic approaches (generate, run, check coverage,
|
||||
iterate) dramatically outperform single-shot generation. This makes it particularly
|
||||
suited to the iterative agent loop and a strong candidate for local model evaluation.
|
||||
|
||||
### State of the Art
|
||||
|
||||
Code agents were shown to be "state of the art software testers" when given an
|
||||
iterative loop with coverage feedback (2024 paper). No single model dominates this
|
||||
dimension; the scaffolding (coverage feedback, iteration) matters more than the
|
||||
base model for test generation.
|
||||
|
||||
---
|
||||
|
||||
## 12. Benchmark Suite Summary
|
||||
|
||||
### Tier 1: Must-Run for Agentic Coding Evaluation
|
||||
|
||||
These are the most informative benchmarks for evaluating a model's fitness as a
|
||||
coding agent:
|
||||
|
||||
| Benchmark | Primary Dimensions | Run Cost | Notes |
|
||||
|---|---|---|---|
|
||||
| **SWE-bench Verified** | Planning, editing, repo understanding | High (500 Docker envs) | Gold standard |
|
||||
| **Aider Polyglot** | Editing, multi-lang, debugging | Medium (225 problems) | Best edit benchmark |
|
||||
| **BigCodeBench** | Generation, multi-tool | Medium (1,140 tasks) | Best generation benchmark |
|
||||
| **BFCL V4** | Tool use, function calling | Low-Medium | De facto tool-use standard |
|
||||
| **Terminal-Bench 2.0** | Debugging, planning, error recovery | High (89 real envs) | Best debugging benchmark |
|
||||
|
||||
### Tier 2: Valuable Supplementary Benchmarks
|
||||
|
||||
| Benchmark | Primary Dimensions | Notes |
|
||||
|---|---|---|
|
||||
| **LiveCodeBench** | Generation (contamination-free) | Rolling benchmark |
|
||||
| **IFEval** | Instruction following | Quick to run, widely reported |
|
||||
| **Multi-SWE-bench mini** | Multi-language, planning | 400 instances, 8 languages |
|
||||
| **EvalPlus (HumanEval+/MBPP+)** | Generation (rigorous) | Good baseline |
|
||||
| **Recovery-Bench** | Error recovery | Novel and underexplored |
|
||||
| **FeatureBench** | Complex planning | Very hard; differentiates top models |
|
||||
|
||||
### Tier 3: Niche or Near-Saturated
|
||||
|
||||
| Benchmark | Status | Notes |
|
||||
|---|---|---|
|
||||
| **HumanEval** | Near-saturated | >95% for frontier models; use EvalPlus instead |
|
||||
| **MBPP** | Near-saturated | Use MBPP+ instead |
|
||||
| **GAIA** | Near-saturation (~90%) | Good for general agents, less code-specific |
|
||||
| **Needle-in-a-Haystack** | Saturated | Use RULER for long-context |
|
||||
|
||||
### Commonly Cited on Model Cards
|
||||
|
||||
When coding-focused models publish on Hugging Face, the most frequently cited
|
||||
benchmarks (in rough order of frequency) are:
|
||||
|
||||
1. SWE-bench Verified (agentic coding standard)
|
||||
2. HumanEval / HumanEval+ (code generation baseline)
|
||||
3. MBPP / MBPP+ (code generation)
|
||||
4. BigCodeBench (multi-tool generation)
|
||||
5. Aider Polyglot (code editing, multi-language)
|
||||
6. LiveCodeBench (contamination-free generation)
|
||||
7. BFCL (function calling)
|
||||
8. IFEval (instruction following)
|
||||
9. Multi-SWE-bench (multilingual agentic)
|
||||
|
||||
---
|
||||
|
||||
## 13. Open-Weight Model Landscape for 64GB Systems
|
||||
|
||||
### Models Feasible on 64GB Unified Memory (Strix Halo)
|
||||
|
||||
Sorted by practical fitness for agentic coding tasks. "Active" = parameters active
|
||||
per forward pass for MoE models.
|
||||
|
||||
| Model | Total / Active | GGUF Q4 Size | SWE-bench | Key Strength |
|
||||
|---|---|---|---|---|
|
||||
| **Qwen3-Coder-Next** | 80B / 3B | ~46GB (Q4) | 70.6% Verified | Best efficiency ratio; agentic RL training |
|
||||
| **Qwen3-Coder-30B-A3B** | 30.5B / 3.3B | ~18GB (Q4) | ~55%* (est.) | Fits easily; native 256K context; function call format |
|
||||
| **Qwen3.5-35B-A3B** | 35B / 3B | ~19GB (Q4) | N/A | General + coding; fast at 112 tok/s on RTX 3090 |
|
||||
| **Nemotron 3 Super** | 120B / 12B | ~64GB (Q4) | 60.5% | 1M context; PinchBench 85.6%; hybrid Mamba-Transformer |
|
||||
| **Qwen3.5-27B** | 27B / 27B (dense) | ~17GB (Q4) | ~55%* | Dense; 72.4% SWE-bench reported for Qwen3.5-27B |
|
||||
| **DeepSeek V3.2** | 671B / 37B | Too large at Q4 | 73.0% | Requires >200GB; not feasible for 64GB |
|
||||
| **GLM-5** | 744B / 40B | Too large at Q4 | 77.8% | Best open SWE-bench; not feasible for 64GB |
|
||||
|
||||
*Estimated; exact scores for quantized GGUF variants not independently benchmarked.
|
||||
|
||||
### Recommended Configuration for 64GB Strix Halo
|
||||
|
||||
**Primary coding agent**: Qwen3-Coder-30B-A3B-Instruct (Q4_K_M, ~18GB)
|
||||
- Fits with ample room for KV cache and context
|
||||
- Specially designed function call format
|
||||
- Native 256K context, extendable to 1M
|
||||
- Strong agentic coding training
|
||||
- Fast inference due to 3.3B active parameters
|
||||
|
||||
**Stretch option**: Qwen3-Coder-Next (Q4, ~46GB)
|
||||
- Tighter fit but significantly stronger (70.6% SWE-bench Verified)
|
||||
- 3B active parameters = good generation speed
|
||||
- Leaves ~18GB for KV cache and system
|
||||
|
||||
**Dense alternative**: Qwen3.5-27B (Q4_K_M, ~17GB)
|
||||
- When you need strong general + coding ability
|
||||
- Dense model = more predictable behavior
|
||||
- Good baseline for comparison
|
||||
|
||||
### Older Models: Still Relevant?
|
||||
|
||||
- **CodeLlama-34B** (Meta, 2023): Superseded by Qwen and DeepSeek families. Only
|
||||
relevant for historical comparison or if specific fine-tunes are needed.
|
||||
- **StarCoder2-15B** (ServiceNow/HF/NVIDIA, 2024): Outperformed CodeLlama-34B at half
|
||||
the size. Still competitive for low-resource languages (Julia, Lua, Perl) but
|
||||
otherwise superseded by Qwen3-Coder.
|
||||
- **DeepSeek-Coder-V2-Lite-16B** (2024): Was competitive but now clearly behind
|
||||
Qwen3-Coder-30B-A3B and Qwen3-Coder-Next.
|
||||
|
||||
---
|
||||
|
||||
## 14. Frontier vs. Open Model Gap
|
||||
|
||||
### Gap Analysis by Dimension (March 2026)
|
||||
|
||||
| Dimension | Frontier Best | Open Best (64GB) | Gap | Trend |
|
||||
|---|---|---|---|---|
|
||||
| Code Generation | ~98% HumanEval | ~85% HumanEval | Small | Closing rapidly |
|
||||
| Code Editing | 88% Aider Polyglot | ~60% Aider Polyglot | Large | Closing (MoE helps) |
|
||||
| Tool Use | >90% BFCL | ~80% BFCL | Moderate | Closing with dedicated training |
|
||||
| Multi-Step Planning | 80.9% SWE-bench | 70.6% SWE-bench (Coder-Next) | Moderate | Narrowing with agentic RL |
|
||||
| Debugging/Recovery | ~65% Terminal-Bench | ~45% Terminal-Bench* | Large | Widest persistent gap |
|
||||
| Repo Understanding | Excellent | Good (long-context models) | Moderate | Closing with 256K+ contexts |
|
||||
| Instruction Following | >90% IFEval | >85% IFEval | Small | Nearly closed |
|
||||
| Long Context | 1M+ effective | 256K effective | Moderate | Hardware-limited for local |
|
||||
| Multi-Language | 80%+ Multi-SWE | 62.8% Multi-SWE | Moderate | Improving with diverse training |
|
||||
| Test Generation | ~50% coverage | ~40% coverage | Small | Scaffolding matters more |
|
||||
|
||||
*Estimated; Terminal-Bench scores not widely reported for 64GB-feasible open models.
|
||||
|
||||
### Key Observations
|
||||
|
||||
1. **Code generation is nearly solved** for simple tasks. The gap has shifted to
|
||||
complex, multi-step, multi-file tasks.
|
||||
|
||||
2. **Debugging/error recovery is the widest gap** and the hardest to close. This is
|
||||
where frontier models' larger parameter counts and RLHF refinement matter most.
|
||||
|
||||
3. **MoE architectures are the bridge** for 64GB systems. Models like Qwen3-Coder-Next
|
||||
(80B total, 3B active) achieve SWE-bench scores comparable to models with 10-20x
|
||||
more active parameters.
|
||||
|
||||
4. **Agentic RL training** (as used in Qwen3-Coder, GLM-5) is the primary driver of
|
||||
open model improvement on planning and debugging dimensions.
|
||||
|
||||
5. **Scaffolding equalizes** many gaps. A well-designed agent scaffold (SWE-Agent,
|
||||
OpenHands, Aider) can make a 30B model perform comparably to a raw 400B model.
|
||||
|
||||
---
|
||||
|
||||
## 15. Recommended Evaluation Stack
|
||||
|
||||
For evaluating models locally on the Strix Halo system, the following stack covers
|
||||
all 10 dimensions using tools already referenced in this project's `docs/references.md`:
|
||||
|
||||
### Inspect AI (Primary Framework)
|
||||
|
||||
Inspect AI supports multiple benchmarks in a unified framework:
|
||||
- HumanEval (code generation)
|
||||
- BigCodeBench (multi-tool generation)
|
||||
- BFCL (function calling / tool use)
|
||||
- GAIA (multi-step planning)
|
||||
- IFEval (instruction following)
|
||||
|
||||
Run against an OpenAI-compatible endpoint (ollama or llama.cpp server).
|
||||
|
||||
### EvalPlus (Code Generation)
|
||||
|
||||
- HumanEval+ and MBPP+ with native ollama support
|
||||
- More rigorous than base HumanEval/MBPP
|
||||
- Already configured in this project's `scripts/agentic/` framework
|
||||
|
||||
### BigCodeBench (Multi-Tool Generation)
|
||||
|
||||
- 1,140 tasks across 139 libraries
|
||||
- Already listed in `docs/references.md`
|
||||
- Tests multi-library, cross-domain code generation
|
||||
|
||||
### Aider (Code Editing + Multi-Language)
|
||||
|
||||
- Built-in polyglot benchmark: 225 exercises across 6 languages
|
||||
- Tests edit format compliance, multi-language support, debugging loop
|
||||
- Can be run against any OpenAI-compatible endpoint
|
||||
|
||||
### BFCL (Tool Use)
|
||||
|
||||
- pip install `bfcl-eval`
|
||||
- Tests function calling accuracy
|
||||
- Already listed in `docs/references.md`
|
||||
|
||||
### Practical Execution Order
|
||||
|
||||
1. **Quick smoke test**: EvalPlus (HumanEval+) -- ~30 min
|
||||
2. **Generation depth**: BigCodeBench-Hard (148 tasks) -- ~2-4 hours
|
||||
3. **Editing ability**: Aider polyglot benchmark -- ~4-6 hours
|
||||
4. **Tool use**: BFCL eval -- ~1-2 hours
|
||||
5. **Instruction following**: IFEval via Inspect AI -- ~1 hour
|
||||
6. **Full agentic**: SWE-bench Verified (if Docker resources available) -- ~24+ hours
|
||||
|
||||
---
|
||||
|
||||
## 16. Sources
|
||||
|
||||
### Papers
|
||||
|
||||
- Chen et al. (2021). "Evaluating Large Language Models Trained on Code." arXiv:2107.03374. [HumanEval]
|
||||
- Liu et al. (2023). "Is Your Code Generated by ChatGPT Really Correct?" NeurIPS 2023. [EvalPlus/HumanEval+]
|
||||
- Jimenez et al. (2024). "SWE-bench: Can Language Models Resolve Real-world GitHub Issues?" ICLR 2024.
|
||||
- Zhuo et al. (2024). "BigCodeBench: Benchmarking Code Generation with Diverse Function Calls." ICLR 2025.
|
||||
- Patil et al. (2025). "The Berkeley Function Calling Leaderboard (BFCL)." ICML 2025.
|
||||
- Misu et al. (2023). "Towards AI Assistants That Thrive on Data: GAIA." ICLR 2025.
|
||||
- Hsieh et al. (2024). "RULER: What's the Real Context Size of Your Long-Context Language Models?" COLM 2024.
|
||||
- Zhou et al. (2023). "Instruction-Following Evaluation for Large Language Models." arXiv:2311.07911. [IFEval]
|
||||
- Terminal-Bench team (2026). "Terminal-Bench: Benchmarking Agents on Hard CLI Tasks." Stanford/Laude Institute.
|
||||
- FeatureBench (Feb 2026). "Benchmarking Agentic Coding for Complex Feature Development." arXiv:2602.10975.
|
||||
- HumanEval Pro / MBPP Pro (ACL 2025). "Evaluating LLMs on Self-invoking Code Generation Task."
|
||||
- Multi-SWE-bench (NeurIPS 2025). "A Multilingual Benchmark for Issue Resolving."
|
||||
- SWE-PolyBench (Amazon, 2025). "A multi-language benchmark for repository level evaluation."
|
||||
- Recovery-Bench (Letta, 2025). "Evaluating LLMs' Ability to Recover from Mistakes."
|
||||
- Diff-XYZ (Oct 2025). "A Benchmark for Evaluating Diff Understanding."
|
||||
|
||||
### Leaderboards and Live Data
|
||||
|
||||
- SWE-bench Leaderboard: https://www.swebench.com/
|
||||
- SWE-bench Verified Leaderboard: https://llm-stats.com/benchmarks/swe-bench-verified
|
||||
- SWE-rebench Leaderboard: https://swe-rebench.com/
|
||||
- Aider LLM Leaderboards: https://aider.chat/docs/leaderboards/
|
||||
- BFCL V4 Leaderboard: https://gorilla.cs.berkeley.edu/leaderboard.html
|
||||
- EvalPlus Leaderboard: https://evalplus.github.io/leaderboard.html
|
||||
- BigCodeBench Leaderboard: https://huggingface.co/blog/leaderboard-bigcodebench
|
||||
- Terminal-Bench Leaderboard: https://www.tbench.ai/
|
||||
- Open LLM Leaderboard: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
|
||||
- Scale Labs SWE-bench Pro: https://labs.scale.com/leaderboard/swe_bench_pro_public
|
||||
- Artificial Analysis Terminal-Bench: https://artificialanalysis.ai/evaluations/terminalbench-hard
|
||||
|
||||
### Model Documentation
|
||||
|
||||
- Qwen3-Coder: https://github.com/QwenLM/Qwen3-Coder
|
||||
- Qwen3-Coder-Next: https://qwen.ai/blog?id=qwen3-coder-next
|
||||
- Qwen3-Coder-30B-A3B GGUF: https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
|
||||
- GLM-5: https://huggingface.co/zai-org/GLM-5
|
||||
- Nemotron 3 Super: https://developer.nvidia.com/blog/introducing-nemotron-3-super-an-open-hybrid-mamba-transformer-moe-for-agentic-reasoning/
|
||||
- DeepSeek V3 series: https://www.bentoml.com/blog/the-complete-guide-to-deepseek-models-from-v3-to-r1-and-beyond
|
||||
|
||||
### Tools and Frameworks
|
||||
|
||||
- Inspect AI: https://github.com/UKGovernmentBEIS/inspect_ai
|
||||
- Inspect Evals catalog: https://inspect.aisi.org.uk/evals/
|
||||
- EvalPlus: https://github.com/evalplus/evalplus
|
||||
- BigCodeBench: https://github.com/bigcode-project/bigcodebench
|
||||
- BFCL: https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard
|
||||
- Aider: https://aider.chat/
|
||||
- Aider Polyglot benchmark: https://github.com/Aider-AI/polyglot-benchmark
|
||||
- LiveCodeBench: https://livecodebench.github.io/
|
||||
- CoverUp (test generation): https://arxiv.org/html/2403.16218v3
|
||||
Reference in New Issue
Block a user