Files

Felipe Cardoso c847991740 docs: add agentic coding evaluation landscape research

Comprehensive research (706 lines, dated 2026-03-30) covering evaluation
dimensions, benchmark suites, and open-weight model performance for
software engineering agent use cases on 64GB systems.

Also gitignore evalplus_results/ (runtime outputs) and ztop/ (nested repo).

2026-04-15 15:55:04 +02:00

33 KiB

Raw Permalink Blame History

Agentic Coding Evaluation Landscape

Comprehensive research into the dimensions, benchmarks, and model performance for evaluating LLMs in software engineering agent use cases. Research date: 2026-03-30.

Evaluation Taxonomy
Dimension 1: Code Generation Accuracy
Dimension 2: Code Editing / Patching
Dimension 3: Tool Use / Function Calling
Dimension 4: Multi-Step Planning
Dimension 5: Debugging / Error Recovery
Dimension 6: Repository Understanding
Dimension 7: Instruction Following
Dimension 8: Long Context Utilization
Dimension 9: Multi-Language Support
Dimension 10: Test Generation
Benchmark Suite Summary
Open-Weight Model Landscape for 64GB Systems
Frontier vs. Open Model Gap
Recommended Evaluation Stack
Sources

1. Evaluation Taxonomy

Recent survey work (CSLLM Survey, 2025; SE Agent Benchmark Survey, 2025) organizes coding LLM evaluation along two orthogonal axes:

Capability dimension: What is being measured (generation, editing, tool use, planning, debugging, comprehension, instruction following, etc.)
Evaluation paradigm: How it is measured (static benchmarks, execution-based evaluation, agent-in-the-loop evaluation, human evaluation)

The field has moved decisively from static benchmarks (HumanEval, MBPP) toward agent-in-the-loop evaluations (SWE-bench, Terminal-Bench, FeatureBench) that test the full agentic loop: plan, act, observe, iterate. This shift matters because models that score 95%+ on HumanEval can still fail below 50% on realistic agentic tasks.

The ten dimensions below map to the capability axis. Each dimension lists the benchmarks that best isolate it, though in practice most agentic benchmarks test multiple dimensions simultaneously.

2. Code Generation Accuracy

Definition: Writing correct, complete code from natural-language specifications or docstrings, measured by functional correctness (pass@k on test suites).

Key Benchmarks

Benchmark	Tasks	Languages	Metric	Notes
HumanEval (Chen et al., 2021)	164	Python	pass@k	Foundational but near-saturated; best models >95%
HumanEval+ / MBPP+ (EvalPlus, NeurIPS 2023)	164 / 399	Python	pass@k (80x more tests)	Catches false positives from HumanEval; ~10-15% score drops
HumanEval Pro / MBPP Pro (ACL 2025)	164 / 399	Python	pass@k on self-invoking tasks	Tests compositional reasoning; o1-mini drops from 96.2% to 76.2%
BigCodeBench (ICLR 2025)	1,140	Python (139 libs)	pass@1	Multi-tool, cross-domain; best model (GPT-4o) ~60% Complete, <50% Instruct
BigCodeBench-Hard	148	Python	pass@1	Hardest subset; human performance 97%, LLMs ~60%
LiveCodeBench (EMNLP 2025)	Rolling	Python	pass@k	Contamination-free: new problems added continuously from competitive programming

State of the Art

Frontier: Claude Opus 4.5/4.6, GPT-5.2, Gemini 3.1 Pro all score >95% on HumanEval, ~85% on HumanEval+, ~65% on BigCodeBench-Complete.
Open (64GB-feasible): Qwen3.5-27B-Q4 achieves ~80% on HumanEval+. Qwen3-Coder-30B-A3B (3.3B active, ~18GB at Q4) is strong on BigCodeBench. Qwen2.5-Coder-32B-Instruct matched GPT-4o on HumanEval when released.

Key Insight

HumanEval is near-saturated and should no longer be used as a primary differentiator. BigCodeBench and LiveCodeBench are the current gold standards for code generation accuracy, as they test realistic multi-library tasks and resist contamination.

3. Code Editing / Patching

Definition: Modifying existing code correctly -- applying diffs, fixing bugs in context, integrating new code into existing files -- rather than generating from scratch.

Key Benchmarks

Benchmark	Tasks	What It Tests	Notes
Aider Code Editing	133	Edit Python files to solve Exercism problems	Tests edit format compliance + coding ability
Aider Polyglot	225	Edit code across 6 languages with error feedback	Two attempts per problem; measures edit+debug loop
Diff-XYZ (Oct 2025)	3 tasks	Apply, anti-apply, generate diffs	Tests diff understanding in multiple formats
EDIT-Bench	Varied	Real-world instructed code edits	Repository-level editing tasks
SWE-bench (indirectly)	2,294	Generate patches that resolve GitHub issues	Requires generating correct unified diffs

Edit Format Considerations

Code editing performance depends heavily on the edit format used:

Search/Replace blocks (Aider default): Most reliable for most models
Unified diff: GPT-4 Turbo was "3x less lazy" with unified diffs (Aider blog)
V4A diff format: OpenAI's recommended format (published with GPT-4.1, April 2025)
Whole-file rewrite: Simpler but wasteful; works with weaker models

Models that excel at generation can fail at editing because they struggle to produce syntactically valid diffs or correctly locate the code to modify.

State of the Art (Aider Polyglot, March 2026)

Model	Score	Type
GPT-5	88.0%	Frontier
MiniMax M2.5	80.2%	Open
DeepSeek V3.2-Exp	74.2%	Open
DeepSeek-R1-0528	71.4%	Open
GLM-4.5-FP8	66.0%	Open
Qwen3-Coder-480B	61.8%	Open (too large for 64GB)
Qwen3-Coder-30B-A3B	~55-60%*	Open (fits 64GB at Q4)

*Estimated from quantized GGUF performance data; exact Aider Polyglot score for the 30B-A3B variant not independently confirmed.

4. Tool Use / Function Calling

Definition: Correctly invoking APIs, tools, or MCP servers -- selecting the right function, constructing valid arguments, parsing responses, deciding when NOT to call.

Key Benchmarks

Benchmark	Tasks	What It Tests	Notes
BFCL V4 (Berkeley)	Thousands	Function calling accuracy across formats	De facto standard; AST-based evaluation
BFCL-v3 (via EvalScope)	Multi-turn	Stateful multi-step function calling	Tests memory and context management
Nexus Function Calling	Varied	Tool selection and invocation	Broader tool landscape
IFEval-FC (2025)	500+	Instruction following within function schemas	JSON schema constraint adherence
tau-bench	Varied	Tool-augmented task completion	End-to-end agent tool use

BFCL Key Findings

The Berkeley Function Calling Leaderboard reveals a critical split:

Single-turn calls: Most frontier models score >90% accuracy
Multi-turn stateful calls: Performance drops 20-40% even for top models
Abstention: Knowing when NOT to call a function remains a major weakness
Long-horizon tool use: Memory, dynamic decision-making, and context management are open challenges

State of the Art

Frontier: Claude Opus 4.5/4.6, GPT-5.2 lead overall BFCL V4
Open: Qwen3-Coder-480B is "comparable to Claude Sonnet 4 on Agentic Tool-Use" (Qwen team). For 64GB-feasible models, Qwen3-Coder-30B-A3B has a specially designed function call format and strong tool-use training. Nemotron 3 Super (120B, 12B active) was explicitly trained for tool-use workflows.

Relevance to MCP

MCP (Model Context Protocol) servers expose tools via JSON schemas -- exactly what BFCL tests. A model's BFCL score is a reasonable proxy for MCP tool-use competence, though MCP adds discovery and session management complexity not yet benchmarked.

5. Multi-Step Planning

Definition: Breaking complex tasks into subtasks, maintaining coherent plans across many steps, tracking progress, and adapting when plans fail.

Key Benchmarks

Benchmark	Tasks	Steps	What It Tests	Notes
SWE-bench Verified	500	5-50+	End-to-end issue resolution	Gold standard for agentic coding
SWE-bench Pro (Scale AI)	Harder	10-100+	More complex issues	Best model ~46% (vs 81% on Verified)
FeatureBench (Feb 2026)	200	Many	Complex feature development	Claude 4.5 Opus: only 11.0% (vs 74.4% SWE-bench)
Snorkel Agentic Coding	100	Multi-step, 4 tiers	Plan, track, execute, recover	Claude Opus 4.5: 58%, Gemini 3 Pro: 51.6%
GAIA (ICLR 2025)	450	Multi-step	General assistant planning	Near saturation (~90% top scores)
Gaia2 (2026)	Varied	Async	Dynamic, asynchronous environments	Adds temporal constraints and agent collaboration
Terminal-Bench 2.0	89	Multi-step	Terminal workflow completion	Tests plan execution in CLI environments

Planning-Specific Insights

The gap between SWE-bench Verified (~81% frontier) and SWE-bench Pro (~46% frontier) and FeatureBench (~11% frontier) reveals that multi-step planning degrades rapidly with task complexity:

SWE-bench Verified: Often requires 5-15 steps (find file, understand bug, edit, test)
SWE-bench Pro: Requires deeper reasoning about architecture and dependencies
FeatureBench: Requires implementing features across multiple files with architectural coherence over 50+ steps

This is the dimension where frontier models most decisively outperform open models, though the gap is narrowing with agentic RL training (Qwen3-Coder, GLM-5).

State of the Art (SWE-bench Verified, March 2026)

Model	Score	Type	Notes
Claude Opus 4.5	80.9%	Frontier	Overall leader
Claude Opus 4.6	80.8%	Frontier
Gemini 3.1 Pro	80.6%	Frontier
MiniMax M2.5	80.2%	Open	Best open model
GPT-5.2	80.0%	Frontier
GLM-5	77.8%	Open	744B MoE, 40B active
Kimi K2.5	76.8%	Open
DeepSeek V3.2	73.0%	Open
Qwen3-Coder-Next	70.6%	Open	Only 3B active params
DeepSeek V3.1	66.0%	Open
Nemotron 3 Super	60.5%	Open	120B, 12B active

6. Debugging / Error Recovery

Definition: Handling test failures, reading error messages, diagnosing root causes, and iterating toward a fix -- including recovering from the agent's own mistakes.

Key Benchmarks

Benchmark	Tasks	What It Tests	Notes
Terminal-Bench 2.0 (Stanford/Laude)	89	CLI debugging, error recovery, state mgmt	Gold standard for debugging evaluation
Recovery-Bench (Letta, 2025)	Varied	Recovery from corrupted states and error traces	Tests context pollution handling
AgentErrorBench (2025)	Varied	Error detection and debugging in trajectories	24% improvement with AgentDebug method
ReliabilityBench (Jan 2026)	Varied	Consistency and fault recovery	Multi-dimensional reliability
Aider Polyglot (indirectly)	225	Two-attempt model with error feedback	Second attempt tests debug-from-feedback

Recovery-Bench Key Findings

Recovery-Bench (Letta) specifically evaluates a critical gap: even frontier models "lack the ability to naturally recover from failed states." The benchmark creates scenarios with:

Erroneous files from previous attempts
Corrupted reasoning traces in context
Environment artifacts from failed edits

This is directly relevant to agentic coding loops where an agent makes a mistake at step 15 of a 30-step task and must recover without starting over.

Terminal-Bench 2.0 Key Findings

Terminal-Bench tests real terminal workflows: inspect environments, read/edit files, run commands, recover from errors, and finish multi-step tasks. Error categories:

Execution errors: Dominate for Claude Opus 4.5 and GPT-5.2
Coherence errors: Less frequent but more damaging
Verification errors: Failing to check that a fix actually worked

State of the Art

Debugging/error recovery is one of the weakest dimensions for all models. No model achieves >70% on Terminal-Bench 2.0 or Recovery-Bench as of March 2026. This is a primary area where the frontier-open gap matters most for practical agentic use.

7. Repository Understanding

Definition: Navigating large codebases, understanding file structure, dependency graphs, cross-file relationships, and architectural patterns.

Key Benchmarks

Benchmark	Tasks	Languages	What It Tests	Notes
CrossCodeEval (NeurIPS 2023)	Varied	Python, Java, TS, C#	Cross-file code completion	Requires understanding imports and dependencies
RepoBench	3 tasks	Python	Retrieval, completion, pipeline	Tests codebase navigation
RepoEval	Varied	Python	Repository-level completion	16 GitHub repositories
RepoCod (ACL 2025)	Varied	Multiple	Full repository code generation	"LLMs not yet ready"
LoCoBench-Agent (2025)	Varied	Multiple	Interactive repo exploration	Agent-based evaluation
DependEval	3 tasks	Multiple	Dependency recognition, multi-file editing	Tests architectural understanding

Key Challenge

Repository understanding is difficult to isolate as a benchmark dimension because it is a prerequisite for most agentic coding tasks. SWE-bench implicitly tests it (you cannot fix a bug if you cannot find the relevant file), but does not score it separately.

The most direct measures are:

CrossCodeEval: Do predictions improve when cross-file context is provided?
RepoBench-R: Can the model retrieve the right context from the repository?
DependEval: Can the model understand and modify dependency relationships?

State of the Art

Models with longer context windows have an inherent advantage. The Qwen3-Coder family was explicitly trained for "repository-scale understanding" with 256K native context (extendable to 1M). GLM-5 uses DeepSeek Sparse Attention for 205K context.

For 64GB systems, Qwen3-Coder-30B-A3B and Qwen3-Coder-Next are the strongest choices due to their long-context training and MoE efficiency.

8. Instruction Following

Definition: Following complex, multi-constraint instructions precisely -- formatting requirements, length constraints, keyword inclusion, structural rules.

Key Benchmarks

Benchmark	Tasks	What It Tests	Notes
IFEval (Google, Nov 2023)	~500	25 types of verifiable instructions	Format, length, keyword, structure constraints
IFEval-Extended (2024)	Dynamic	Generative instruction synthesis	Thousands of unique instructions from templates
M-IFEval (NAACL 2025)	Multi-lingual	French, Japanese, Spanish instruction following	Performance varies widely across languages
IFEval-FC (2025)	Varied	Instruction following in function call schemas	JSON schema constraint adherence
AgentIF (Tsinghua, 2025)	Varied	Agent-specific instruction following	Evaluates IF within agentic loops

Relevance to Agentic Coding

Instruction following is critical for agentic coding because:

System prompts: Agents receive detailed behavioral instructions (e.g., CLAUDE.md conventions in this repo)
Edit format compliance: Models must produce output in exact formats (search/replace blocks, unified diffs, JSON tool calls)
Multi-constraint tasks: "Fix the bug AND add a test AND update the docstring AND follow the project's naming conventions"

State of the Art

IFEval is included in the Open LLM Leaderboard V2, making it one of the most widely reported benchmarks. Frontier models score >90% on IFEval. Open models vary widely; instruction-tuned variants of Qwen3.5, DeepSeek V3, and GLM-5 are competitive at >85%.

9. Long Context Utilization

Definition: Effectively using large context windows (32K-1M tokens) with code -- not just accepting long inputs, but actually using information from all parts.

Key Benchmarks

Benchmark	What It Tests	Notes
RULER (NVIDIA, COLM 2024)	Multi-needle retrieval, distractor handling	Most models degrade significantly beyond 32K
Needle in a Haystack (NIAH)	Single-fact retrieval in long context	Near-saturated for frontier models
LoCoBench (2025)	Long-context code completion and comprehension	Claude 3.5 Sonnet: 29% at short context, 3% at long
LongCodeBench (2025)	Long-context code tasks	Single-language, limited diversity
LongBench (ACL 2025)	General long-context evaluation	Reveals limitations of existing benchmarks

"Context Rot" Phenomenon

Research from Chroma (2025) documented "context rot": as input tokens increase, LLM performance degrades even when the relevant information is present. This is particularly acute for code tasks where:

File A defines a class, file B imports it, file C tests it
All three must be in context simultaneously
Models must cross-reference across files, not just retrieve individual facts

State of the Art

Model	Native Context	Effective Context*	Notes
Nemotron 3 Super	1M tokens	91.75% accuracy at 1M	Best retention score
Qwen3-Coder-Next	256K (1M w/ Yarn)	Good at 256K	Trained for repo-scale
GLM-5	205K	Good	DeepSeek Sparse Attention
DeepSeek V3.2	128K	Moderate

*"Effective context" means the model actually uses information at that distance, not just accepts it without error.

For 64GB systems, context length is bounded by available memory. At Q4 quantization, a 30B-A3B model can handle ~64K-128K tokens before running out of KV cache space (depending on GQA configuration and batch size).

10. Multi-Language Support

Definition: Handling different programming languages correctly -- not just Python, but also compiled languages, systems languages, and less common languages.

Key Benchmarks

Benchmark	Languages	What It Tests	Notes
Aider Polyglot	C++, Go, Java, JS, Python, Rust	Edit + debug in 6 languages	225 Exercism exercises
Multi-SWE-bench (NeurIPS 2025)	Python, Java, TS, JS, Go, Rust, C, C++	Issue resolution in 8 languages	1,632 validated issues
Multi-SWE-bench mini	8 languages	Lightweight version	400 instances, reduced compute
SWE-PolyBench (Amazon)	Java, JS, TS, Python	Bug fixes, features, refactoring	2,110 curated issues
SWE-smith	9 languages	SWE-bench style across 42 repos	300 curated tasks
HumanEval-X	Python, C++, Java, JS, Go	Cross-lingual code generation	Translation of HumanEval
BigCodeBench	Python (139 libs)	Multi-library Python	Tests library-specific knowledge

Multi-SWE-bench vs SWE-PolyBench

Two competing multilingual benchmarks emerged in 2025:

Multi-SWE-bench (ByteDance): 1,632 issues, 8 languages, NeurIPS 2025 Datasets track. Also provides mini (400 instances) and flash (300 instances) variants for reduced compute.
SWE-PolyBench (Amazon): 2,110 issues, 4 languages, with a verified subset of 384 instances. Covers bug fixes, features, and refactoring.

Language-Specific Performance Gaps

Open models show significant performance variation across languages:

Python: Best-supported universally
JavaScript/TypeScript: Second-best, strong ecosystem coverage
Rust, Go, C++: Substantially weaker, especially for complex patterns
Low-resource languages (Julia, Lua, Perl): StarCoder2-15B historically strong here

State of the Art

Qwen3-Coder-Next achieves 62.8% on SWE-Bench Multilingual. For 64GB-feasible models, the Qwen3-Coder-30B-A3B benefits from Qwen's broad multilingual training data.

11. Test Generation

Definition: Writing tests, understanding test frameworks, achieving coverage, generating meaningful assertions -- not just syntactically valid tests.

Key Benchmarks

Benchmark	Tasks	What It Tests	Notes
TestEval (2024)	210	LLM test case generation for LeetCode programs	Basic test generation ability
ULT (2025)	3,909	Unit test generation for complex functions	High cyclomatic complexity, leakage-free
WebApp1K (2025)	1,000	Test-driven development tasks	Tests serve as both prompt and verification
CoverUp (2024)	Varied	Coverage-guided test generation	Iterative LLM-guided coverage improvement

Current Performance

LLM-generated tests achieve on average:

41.32% accuracy (tests pass and are meaningful)
45.10% statement coverage
30.22% branch coverage
40.21% mutation score

These numbers are from a multi-model benchmark study (2025). CoverUp's iterative approach achieves 80% line+branch coverage (vs 47% for CodaMosa), suggesting that agentic test generation loops significantly outperform single-shot generation.

Key Insight

Test generation is an area where agentic approaches (generate, run, check coverage, iterate) dramatically outperform single-shot generation. This makes it particularly suited to the iterative agent loop and a strong candidate for local model evaluation.

State of the Art

Code agents were shown to be "state of the art software testers" when given an iterative loop with coverage feedback (2024 paper). No single model dominates this dimension; the scaffolding (coverage feedback, iteration) matters more than the base model for test generation.

12. Benchmark Suite Summary

Tier 1: Must-Run for Agentic Coding Evaluation

These are the most informative benchmarks for evaluating a model's fitness as a coding agent:

Benchmark	Primary Dimensions	Run Cost	Notes
SWE-bench Verified	Planning, editing, repo understanding	High (500 Docker envs)	Gold standard
Aider Polyglot	Editing, multi-lang, debugging	Medium (225 problems)	Best edit benchmark
BigCodeBench	Generation, multi-tool	Medium (1,140 tasks)	Best generation benchmark
BFCL V4	Tool use, function calling	Low-Medium	De facto tool-use standard
Terminal-Bench 2.0	Debugging, planning, error recovery	High (89 real envs)	Best debugging benchmark

Tier 2: Valuable Supplementary Benchmarks

Benchmark	Primary Dimensions	Notes
LiveCodeBench	Generation (contamination-free)	Rolling benchmark
IFEval	Instruction following	Quick to run, widely reported
Multi-SWE-bench mini	Multi-language, planning	400 instances, 8 languages
EvalPlus (HumanEval+/MBPP+)	Generation (rigorous)	Good baseline
Recovery-Bench	Error recovery	Novel and underexplored
FeatureBench	Complex planning	Very hard; differentiates top models

Tier 3: Niche or Near-Saturated

Benchmark	Status	Notes
HumanEval	Near-saturated	>95% for frontier models; use EvalPlus instead
MBPP	Near-saturated	Use MBPP+ instead
GAIA	Near-saturation (~90%)	Good for general agents, less code-specific
Needle-in-a-Haystack	Saturated	Use RULER for long-context

Commonly Cited on Model Cards

When coding-focused models publish on Hugging Face, the most frequently cited benchmarks (in rough order of frequency) are:

SWE-bench Verified (agentic coding standard)
HumanEval / HumanEval+ (code generation baseline)
MBPP / MBPP+ (code generation)
BigCodeBench (multi-tool generation)
Aider Polyglot (code editing, multi-language)
LiveCodeBench (contamination-free generation)
BFCL (function calling)
IFEval (instruction following)
Multi-SWE-bench (multilingual agentic)

13. Open-Weight Model Landscape for 64GB Systems

Models Feasible on 64GB Unified Memory (Strix Halo)

Sorted by practical fitness for agentic coding tasks. "Active" = parameters active per forward pass for MoE models.

Model	Total / Active	GGUF Q4 Size	SWE-bench	Key Strength
Qwen3-Coder-Next	80B / 3B	~46GB (Q4)	70.6% Verified	Best efficiency ratio; agentic RL training
Qwen3-Coder-30B-A3B	30.5B / 3.3B	~18GB (Q4)	~55%* (est.)	Fits easily; native 256K context; function call format
Qwen3.5-35B-A3B	35B / 3B	~19GB (Q4)	N/A	General + coding; fast at 112 tok/s on RTX 3090
Nemotron 3 Super	120B / 12B	~64GB (Q4)	60.5%	1M context; PinchBench 85.6%; hybrid Mamba-Transformer
Qwen3.5-27B	27B / 27B (dense)	~17GB (Q4)	~55%*	Dense; 72.4% SWE-bench reported for Qwen3.5-27B
DeepSeek V3.2	671B / 37B	Too large at Q4	73.0%	Requires >200GB; not feasible for 64GB
GLM-5	744B / 40B	Too large at Q4	77.8%	Best open SWE-bench; not feasible for 64GB

*Estimated; exact scores for quantized GGUF variants not independently benchmarked.

Recommended Configuration for 64GB Strix Halo

Primary coding agent: Qwen3-Coder-30B-A3B-Instruct (Q4_K_M, ~18GB)

Fits with ample room for KV cache and context
Specially designed function call format
Native 256K context, extendable to 1M
Strong agentic coding training
Fast inference due to 3.3B active parameters

Stretch option: Qwen3-Coder-Next (Q4, ~46GB)

Tighter fit but significantly stronger (70.6% SWE-bench Verified)
3B active parameters = good generation speed
Leaves ~18GB for KV cache and system

Dense alternative: Qwen3.5-27B (Q4_K_M, ~17GB)

When you need strong general + coding ability
Dense model = more predictable behavior
Good baseline for comparison

Older Models: Still Relevant?

CodeLlama-34B (Meta, 2023): Superseded by Qwen and DeepSeek families. Only relevant for historical comparison or if specific fine-tunes are needed.
StarCoder2-15B (ServiceNow/HF/NVIDIA, 2024): Outperformed CodeLlama-34B at half the size. Still competitive for low-resource languages (Julia, Lua, Perl) but otherwise superseded by Qwen3-Coder.
DeepSeek-Coder-V2-Lite-16B (2024): Was competitive but now clearly behind Qwen3-Coder-30B-A3B and Qwen3-Coder-Next.

14. Frontier vs. Open Model Gap

Gap Analysis by Dimension (March 2026)

Dimension	Frontier Best	Open Best (64GB)	Gap	Trend
Code Generation	~98% HumanEval	~85% HumanEval	Small	Closing rapidly
Code Editing	88% Aider Polyglot	~60% Aider Polyglot	Large	Closing (MoE helps)
Tool Use	>90% BFCL	~80% BFCL	Moderate	Closing with dedicated training
Multi-Step Planning	80.9% SWE-bench	70.6% SWE-bench (Coder-Next)	Moderate	Narrowing with agentic RL
Debugging/Recovery	~65% Terminal-Bench	~45% Terminal-Bench*	Large	Widest persistent gap
Repo Understanding	Excellent	Good (long-context models)	Moderate	Closing with 256K+ contexts
Instruction Following	>90% IFEval	>85% IFEval	Small	Nearly closed
Long Context	1M+ effective	256K effective	Moderate	Hardware-limited for local
Multi-Language	80%+ Multi-SWE	62.8% Multi-SWE	Moderate	Improving with diverse training
Test Generation	~50% coverage	~40% coverage	Small	Scaffolding matters more

*Estimated; Terminal-Bench scores not widely reported for 64GB-feasible open models.

Key Observations

Code generation is nearly solved for simple tasks. The gap has shifted to complex, multi-step, multi-file tasks.
Debugging/error recovery is the widest gap and the hardest to close. This is where frontier models' larger parameter counts and RLHF refinement matter most.
MoE architectures are the bridge for 64GB systems. Models like Qwen3-Coder-Next (80B total, 3B active) achieve SWE-bench scores comparable to models with 10-20x more active parameters.
Agentic RL training (as used in Qwen3-Coder, GLM-5) is the primary driver of open model improvement on planning and debugging dimensions.
Scaffolding equalizes many gaps. A well-designed agent scaffold (SWE-Agent, OpenHands, Aider) can make a 30B model perform comparably to a raw 400B model.

15. Recommended Evaluation Stack

For evaluating models locally on the Strix Halo system, the following stack covers all 10 dimensions using tools already referenced in this project's docs/references.md:

Inspect AI (Primary Framework)

Inspect AI supports multiple benchmarks in a unified framework:

HumanEval (code generation)
BigCodeBench (multi-tool generation)
BFCL (function calling / tool use)
GAIA (multi-step planning)
IFEval (instruction following)

Run against an OpenAI-compatible endpoint (ollama or llama.cpp server).

EvalPlus (Code Generation)

HumanEval+ and MBPP+ with native ollama support
More rigorous than base HumanEval/MBPP
Already configured in this project's scripts/agentic/ framework

BigCodeBench (Multi-Tool Generation)

1,140 tasks across 139 libraries
Already listed in docs/references.md
Tests multi-library, cross-domain code generation

Aider (Code Editing + Multi-Language)

Built-in polyglot benchmark: 225 exercises across 6 languages
Tests edit format compliance, multi-language support, debugging loop
Can be run against any OpenAI-compatible endpoint

BFCL (Tool Use)

pip install bfcl-eval
Tests function calling accuracy
Already listed in docs/references.md

Practical Execution Order

Quick smoke test: EvalPlus (HumanEval+) -- ~30 min
Generation depth: BigCodeBench-Hard (148 tasks) -- ~2-4 hours
Editing ability: Aider polyglot benchmark -- ~4-6 hours
Tool use: BFCL eval -- ~1-2 hours
Instruction following: IFEval via Inspect AI -- ~1 hour
Full agentic: SWE-bench Verified (if Docker resources available) -- ~24+ hours

16. Sources

Papers

Chen et al. (2021). "Evaluating Large Language Models Trained on Code." arXiv:2107.03374. [HumanEval]
Liu et al. (2023). "Is Your Code Generated by ChatGPT Really Correct?" NeurIPS 2023. [EvalPlus/HumanEval+]
Jimenez et al. (2024). "SWE-bench: Can Language Models Resolve Real-world GitHub Issues?" ICLR 2024.
Zhuo et al. (2024). "BigCodeBench: Benchmarking Code Generation with Diverse Function Calls." ICLR 2025.
Patil et al. (2025). "The Berkeley Function Calling Leaderboard (BFCL)." ICML 2025.
Misu et al. (2023). "Towards AI Assistants That Thrive on Data: GAIA." ICLR 2025.
Hsieh et al. (2024). "RULER: What's the Real Context Size of Your Long-Context Language Models?" COLM 2024.
Zhou et al. (2023). "Instruction-Following Evaluation for Large Language Models." arXiv:2311.07911. [IFEval]
Terminal-Bench team (2026). "Terminal-Bench: Benchmarking Agents on Hard CLI Tasks." Stanford/Laude Institute.
FeatureBench (Feb 2026). "Benchmarking Agentic Coding for Complex Feature Development." arXiv:2602.10975.
HumanEval Pro / MBPP Pro (ACL 2025). "Evaluating LLMs on Self-invoking Code Generation Task."
Multi-SWE-bench (NeurIPS 2025). "A Multilingual Benchmark for Issue Resolving."
SWE-PolyBench (Amazon, 2025). "A multi-language benchmark for repository level evaluation."
Recovery-Bench (Letta, 2025). "Evaluating LLMs' Ability to Recover from Mistakes."
Diff-XYZ (Oct 2025). "A Benchmark for Evaluating Diff Understanding."

33 KiB Raw Permalink Blame History

Agentic Coding Evaluation Landscape

Table of Contents

1. Evaluation Taxonomy

2. Code Generation Accuracy

Key Benchmarks

State of the Art

Key Insight

3. Code Editing / Patching

Key Benchmarks

Edit Format Considerations

State of the Art (Aider Polyglot, March 2026)

4. Tool Use / Function Calling

Key Benchmarks

BFCL Key Findings

State of the Art

Relevance to MCP

5. Multi-Step Planning

Key Benchmarks

Planning-Specific Insights

State of the Art (SWE-bench Verified, March 2026)

6. Debugging / Error Recovery

Key Benchmarks

Recovery-Bench Key Findings

Terminal-Bench 2.0 Key Findings

State of the Art

7. Repository Understanding

Key Benchmarks

Key Challenge

State of the Art

8. Instruction Following

Key Benchmarks

Relevance to Agentic Coding

State of the Art

9. Long Context Utilization

Key Benchmarks

"Context Rot" Phenomenon

State of the Art

10. Multi-Language Support

Key Benchmarks

Multi-SWE-bench vs SWE-PolyBench

Language-Specific Performance Gaps

State of the Art

11. Test Generation

Key Benchmarks

Current Performance

Key Insight

State of the Art

12. Benchmark Suite Summary

Tier 1: Must-Run for Agentic Coding Evaluation

Tier 2: Valuable Supplementary Benchmarks

Tier 3: Niche or Near-Saturated

Commonly Cited on Model Cards

13. Open-Weight Model Landscape for 64GB Systems

Models Feasible on 64GB Unified Memory (Strix Halo)

Recommended Configuration for 64GB Strix Halo

Older Models: Still Relevant?

14. Frontier vs. Open Model Gap

Gap Analysis by Dimension (March 2026)

Key Observations

15. Recommended Evaluation Stack

Inspect AI (Primary Framework)

EvalPlus (Code Generation)

BigCodeBench (Multi-Tool Generation)

Aider (Code Editing + Multi-Language)

BFCL (Tool Use)

Practical Execution Order

16. Sources

Papers

Leaderboards and Live Data

Model Documentation

Tools and Frameworks

33 KiB

Raw Permalink Blame History