Files
strix-halo-optimizations/docs/agentic-coding-evaluation-landscape.md
Felipe Cardoso c847991740 docs: add agentic coding evaluation landscape research
Comprehensive research (706 lines, dated 2026-03-30) covering evaluation
dimensions, benchmark suites, and open-weight model performance for
software engineering agent use cases on 64GB systems.

Also gitignore evalplus_results/ (runtime outputs) and ztop/ (nested repo).
2026-04-15 15:55:04 +02:00

33 KiB

Agentic Coding Evaluation Landscape

Comprehensive research into the dimensions, benchmarks, and model performance for evaluating LLMs in software engineering agent use cases. Research date: 2026-03-30.


Table of Contents

  1. Evaluation Taxonomy
  2. Dimension 1: Code Generation Accuracy
  3. Dimension 2: Code Editing / Patching
  4. Dimension 3: Tool Use / Function Calling
  5. Dimension 4: Multi-Step Planning
  6. Dimension 5: Debugging / Error Recovery
  7. Dimension 6: Repository Understanding
  8. Dimension 7: Instruction Following
  9. Dimension 8: Long Context Utilization
  10. Dimension 9: Multi-Language Support
  11. Dimension 10: Test Generation
  12. Benchmark Suite Summary
  13. Open-Weight Model Landscape for 64GB Systems
  14. Frontier vs. Open Model Gap
  15. Recommended Evaluation Stack
  16. Sources

1. Evaluation Taxonomy

Recent survey work (CSLLM Survey, 2025; SE Agent Benchmark Survey, 2025) organizes coding LLM evaluation along two orthogonal axes:

  • Capability dimension: What is being measured (generation, editing, tool use, planning, debugging, comprehension, instruction following, etc.)
  • Evaluation paradigm: How it is measured (static benchmarks, execution-based evaluation, agent-in-the-loop evaluation, human evaluation)

The field has moved decisively from static benchmarks (HumanEval, MBPP) toward agent-in-the-loop evaluations (SWE-bench, Terminal-Bench, FeatureBench) that test the full agentic loop: plan, act, observe, iterate. This shift matters because models that score 95%+ on HumanEval can still fail below 50% on realistic agentic tasks.

The ten dimensions below map to the capability axis. Each dimension lists the benchmarks that best isolate it, though in practice most agentic benchmarks test multiple dimensions simultaneously.


2. Code Generation Accuracy

Definition: Writing correct, complete code from natural-language specifications or docstrings, measured by functional correctness (pass@k on test suites).

Key Benchmarks

Benchmark Tasks Languages Metric Notes
HumanEval (Chen et al., 2021) 164 Python pass@k Foundational but near-saturated; best models >95%
HumanEval+ / MBPP+ (EvalPlus, NeurIPS 2023) 164 / 399 Python pass@k (80x more tests) Catches false positives from HumanEval; ~10-15% score drops
HumanEval Pro / MBPP Pro (ACL 2025) 164 / 399 Python pass@k on self-invoking tasks Tests compositional reasoning; o1-mini drops from 96.2% to 76.2%
BigCodeBench (ICLR 2025) 1,140 Python (139 libs) pass@1 Multi-tool, cross-domain; best model (GPT-4o) ~60% Complete, <50% Instruct
BigCodeBench-Hard 148 Python pass@1 Hardest subset; human performance 97%, LLMs ~60%
LiveCodeBench (EMNLP 2025) Rolling Python pass@k Contamination-free: new problems added continuously from competitive programming

State of the Art

  • Frontier: Claude Opus 4.5/4.6, GPT-5.2, Gemini 3.1 Pro all score >95% on HumanEval, ~85% on HumanEval+, ~65% on BigCodeBench-Complete.
  • Open (64GB-feasible): Qwen3.5-27B-Q4 achieves ~80% on HumanEval+. Qwen3-Coder-30B-A3B (3.3B active, ~18GB at Q4) is strong on BigCodeBench. Qwen2.5-Coder-32B-Instruct matched GPT-4o on HumanEval when released.

Key Insight

HumanEval is near-saturated and should no longer be used as a primary differentiator. BigCodeBench and LiveCodeBench are the current gold standards for code generation accuracy, as they test realistic multi-library tasks and resist contamination.


3. Code Editing / Patching

Definition: Modifying existing code correctly -- applying diffs, fixing bugs in context, integrating new code into existing files -- rather than generating from scratch.

Key Benchmarks

Benchmark Tasks What It Tests Notes
Aider Code Editing 133 Edit Python files to solve Exercism problems Tests edit format compliance + coding ability
Aider Polyglot 225 Edit code across 6 languages with error feedback Two attempts per problem; measures edit+debug loop
Diff-XYZ (Oct 2025) 3 tasks Apply, anti-apply, generate diffs Tests diff understanding in multiple formats
EDIT-Bench Varied Real-world instructed code edits Repository-level editing tasks
SWE-bench (indirectly) 2,294 Generate patches that resolve GitHub issues Requires generating correct unified diffs

Edit Format Considerations

Code editing performance depends heavily on the edit format used:

  • Search/Replace blocks (Aider default): Most reliable for most models
  • Unified diff: GPT-4 Turbo was "3x less lazy" with unified diffs (Aider blog)
  • V4A diff format: OpenAI's recommended format (published with GPT-4.1, April 2025)
  • Whole-file rewrite: Simpler but wasteful; works with weaker models

Models that excel at generation can fail at editing because they struggle to produce syntactically valid diffs or correctly locate the code to modify.

State of the Art (Aider Polyglot, March 2026)

Model Score Type
GPT-5 88.0% Frontier
MiniMax M2.5 80.2% Open
DeepSeek V3.2-Exp 74.2% Open
DeepSeek-R1-0528 71.4% Open
GLM-4.5-FP8 66.0% Open
Qwen3-Coder-480B 61.8% Open (too large for 64GB)
Qwen3-Coder-30B-A3B ~55-60%* Open (fits 64GB at Q4)

*Estimated from quantized GGUF performance data; exact Aider Polyglot score for the 30B-A3B variant not independently confirmed.


4. Tool Use / Function Calling

Definition: Correctly invoking APIs, tools, or MCP servers -- selecting the right function, constructing valid arguments, parsing responses, deciding when NOT to call.

Key Benchmarks

Benchmark Tasks What It Tests Notes
BFCL V4 (Berkeley) Thousands Function calling accuracy across formats De facto standard; AST-based evaluation
BFCL-v3 (via EvalScope) Multi-turn Stateful multi-step function calling Tests memory and context management
Nexus Function Calling Varied Tool selection and invocation Broader tool landscape
IFEval-FC (2025) 500+ Instruction following within function schemas JSON schema constraint adherence
tau-bench Varied Tool-augmented task completion End-to-end agent tool use

BFCL Key Findings

The Berkeley Function Calling Leaderboard reveals a critical split:

  1. Single-turn calls: Most frontier models score >90% accuracy
  2. Multi-turn stateful calls: Performance drops 20-40% even for top models
  3. Abstention: Knowing when NOT to call a function remains a major weakness
  4. Long-horizon tool use: Memory, dynamic decision-making, and context management are open challenges

State of the Art

  • Frontier: Claude Opus 4.5/4.6, GPT-5.2 lead overall BFCL V4
  • Open: Qwen3-Coder-480B is "comparable to Claude Sonnet 4 on Agentic Tool-Use" (Qwen team). For 64GB-feasible models, Qwen3-Coder-30B-A3B has a specially designed function call format and strong tool-use training. Nemotron 3 Super (120B, 12B active) was explicitly trained for tool-use workflows.

Relevance to MCP

MCP (Model Context Protocol) servers expose tools via JSON schemas -- exactly what BFCL tests. A model's BFCL score is a reasonable proxy for MCP tool-use competence, though MCP adds discovery and session management complexity not yet benchmarked.


5. Multi-Step Planning

Definition: Breaking complex tasks into subtasks, maintaining coherent plans across many steps, tracking progress, and adapting when plans fail.

Key Benchmarks

Benchmark Tasks Steps What It Tests Notes
SWE-bench Verified 500 5-50+ End-to-end issue resolution Gold standard for agentic coding
SWE-bench Pro (Scale AI) Harder 10-100+ More complex issues Best model ~46% (vs 81% on Verified)
FeatureBench (Feb 2026) 200 Many Complex feature development Claude 4.5 Opus: only 11.0% (vs 74.4% SWE-bench)
Snorkel Agentic Coding 100 Multi-step, 4 tiers Plan, track, execute, recover Claude Opus 4.5: 58%, Gemini 3 Pro: 51.6%
GAIA (ICLR 2025) 450 Multi-step General assistant planning Near saturation (~90% top scores)
Gaia2 (2026) Varied Async Dynamic, asynchronous environments Adds temporal constraints and agent collaboration
Terminal-Bench 2.0 89 Multi-step Terminal workflow completion Tests plan execution in CLI environments

Planning-Specific Insights

The gap between SWE-bench Verified (~81% frontier) and SWE-bench Pro (~46% frontier) and FeatureBench (~11% frontier) reveals that multi-step planning degrades rapidly with task complexity:

  • SWE-bench Verified: Often requires 5-15 steps (find file, understand bug, edit, test)
  • SWE-bench Pro: Requires deeper reasoning about architecture and dependencies
  • FeatureBench: Requires implementing features across multiple files with architectural coherence over 50+ steps

This is the dimension where frontier models most decisively outperform open models, though the gap is narrowing with agentic RL training (Qwen3-Coder, GLM-5).

State of the Art (SWE-bench Verified, March 2026)

Model Score Type Notes
Claude Opus 4.5 80.9% Frontier Overall leader
Claude Opus 4.6 80.8% Frontier
Gemini 3.1 Pro 80.6% Frontier
MiniMax M2.5 80.2% Open Best open model
GPT-5.2 80.0% Frontier
GLM-5 77.8% Open 744B MoE, 40B active
Kimi K2.5 76.8% Open
DeepSeek V3.2 73.0% Open
Qwen3-Coder-Next 70.6% Open Only 3B active params
DeepSeek V3.1 66.0% Open
Nemotron 3 Super 60.5% Open 120B, 12B active

6. Debugging / Error Recovery

Definition: Handling test failures, reading error messages, diagnosing root causes, and iterating toward a fix -- including recovering from the agent's own mistakes.

Key Benchmarks

Benchmark Tasks What It Tests Notes
Terminal-Bench 2.0 (Stanford/Laude) 89 CLI debugging, error recovery, state mgmt Gold standard for debugging evaluation
Recovery-Bench (Letta, 2025) Varied Recovery from corrupted states and error traces Tests context pollution handling
AgentErrorBench (2025) Varied Error detection and debugging in trajectories 24% improvement with AgentDebug method
ReliabilityBench (Jan 2026) Varied Consistency and fault recovery Multi-dimensional reliability
Aider Polyglot (indirectly) 225 Two-attempt model with error feedback Second attempt tests debug-from-feedback

Recovery-Bench Key Findings

Recovery-Bench (Letta) specifically evaluates a critical gap: even frontier models "lack the ability to naturally recover from failed states." The benchmark creates scenarios with:

  • Erroneous files from previous attempts
  • Corrupted reasoning traces in context
  • Environment artifacts from failed edits

This is directly relevant to agentic coding loops where an agent makes a mistake at step 15 of a 30-step task and must recover without starting over.

Terminal-Bench 2.0 Key Findings

Terminal-Bench tests real terminal workflows: inspect environments, read/edit files, run commands, recover from errors, and finish multi-step tasks. Error categories:

  • Execution errors: Dominate for Claude Opus 4.5 and GPT-5.2
  • Coherence errors: Less frequent but more damaging
  • Verification errors: Failing to check that a fix actually worked

State of the Art

Debugging/error recovery is one of the weakest dimensions for all models. No model achieves >70% on Terminal-Bench 2.0 or Recovery-Bench as of March 2026. This is a primary area where the frontier-open gap matters most for practical agentic use.


7. Repository Understanding

Definition: Navigating large codebases, understanding file structure, dependency graphs, cross-file relationships, and architectural patterns.

Key Benchmarks

Benchmark Tasks Languages What It Tests Notes
CrossCodeEval (NeurIPS 2023) Varied Python, Java, TS, C# Cross-file code completion Requires understanding imports and dependencies
RepoBench 3 tasks Python Retrieval, completion, pipeline Tests codebase navigation
RepoEval Varied Python Repository-level completion 16 GitHub repositories
RepoCod (ACL 2025) Varied Multiple Full repository code generation "LLMs not yet ready"
LoCoBench-Agent (2025) Varied Multiple Interactive repo exploration Agent-based evaluation
DependEval 3 tasks Multiple Dependency recognition, multi-file editing Tests architectural understanding

Key Challenge

Repository understanding is difficult to isolate as a benchmark dimension because it is a prerequisite for most agentic coding tasks. SWE-bench implicitly tests it (you cannot fix a bug if you cannot find the relevant file), but does not score it separately.

The most direct measures are:

  1. CrossCodeEval: Do predictions improve when cross-file context is provided?
  2. RepoBench-R: Can the model retrieve the right context from the repository?
  3. DependEval: Can the model understand and modify dependency relationships?

State of the Art

Models with longer context windows have an inherent advantage. The Qwen3-Coder family was explicitly trained for "repository-scale understanding" with 256K native context (extendable to 1M). GLM-5 uses DeepSeek Sparse Attention for 205K context.

For 64GB systems, Qwen3-Coder-30B-A3B and Qwen3-Coder-Next are the strongest choices due to their long-context training and MoE efficiency.


8. Instruction Following

Definition: Following complex, multi-constraint instructions precisely -- formatting requirements, length constraints, keyword inclusion, structural rules.

Key Benchmarks

Benchmark Tasks What It Tests Notes
IFEval (Google, Nov 2023) ~500 25 types of verifiable instructions Format, length, keyword, structure constraints
IFEval-Extended (2024) Dynamic Generative instruction synthesis Thousands of unique instructions from templates
M-IFEval (NAACL 2025) Multi-lingual French, Japanese, Spanish instruction following Performance varies widely across languages
IFEval-FC (2025) Varied Instruction following in function call schemas JSON schema constraint adherence
AgentIF (Tsinghua, 2025) Varied Agent-specific instruction following Evaluates IF within agentic loops

Relevance to Agentic Coding

Instruction following is critical for agentic coding because:

  1. System prompts: Agents receive detailed behavioral instructions (e.g., CLAUDE.md conventions in this repo)
  2. Edit format compliance: Models must produce output in exact formats (search/replace blocks, unified diffs, JSON tool calls)
  3. Multi-constraint tasks: "Fix the bug AND add a test AND update the docstring AND follow the project's naming conventions"

State of the Art

IFEval is included in the Open LLM Leaderboard V2, making it one of the most widely reported benchmarks. Frontier models score >90% on IFEval. Open models vary widely; instruction-tuned variants of Qwen3.5, DeepSeek V3, and GLM-5 are competitive at >85%.


9. Long Context Utilization

Definition: Effectively using large context windows (32K-1M tokens) with code -- not just accepting long inputs, but actually using information from all parts.

Key Benchmarks

Benchmark What It Tests Notes
RULER (NVIDIA, COLM 2024) Multi-needle retrieval, distractor handling Most models degrade significantly beyond 32K
Needle in a Haystack (NIAH) Single-fact retrieval in long context Near-saturated for frontier models
LoCoBench (2025) Long-context code completion and comprehension Claude 3.5 Sonnet: 29% at short context, 3% at long
LongCodeBench (2025) Long-context code tasks Single-language, limited diversity
LongBench (ACL 2025) General long-context evaluation Reveals limitations of existing benchmarks

"Context Rot" Phenomenon

Research from Chroma (2025) documented "context rot": as input tokens increase, LLM performance degrades even when the relevant information is present. This is particularly acute for code tasks where:

  • File A defines a class, file B imports it, file C tests it
  • All three must be in context simultaneously
  • Models must cross-reference across files, not just retrieve individual facts

State of the Art

Model Native Context Effective Context* Notes
Nemotron 3 Super 1M tokens 91.75% accuracy at 1M Best retention score
Qwen3-Coder-Next 256K (1M w/ Yarn) Good at 256K Trained for repo-scale
GLM-5 205K Good DeepSeek Sparse Attention
DeepSeek V3.2 128K Moderate

*"Effective context" means the model actually uses information at that distance, not just accepts it without error.

For 64GB systems, context length is bounded by available memory. At Q4 quantization, a 30B-A3B model can handle ~64K-128K tokens before running out of KV cache space (depending on GQA configuration and batch size).


10. Multi-Language Support

Definition: Handling different programming languages correctly -- not just Python, but also compiled languages, systems languages, and less common languages.

Key Benchmarks

Benchmark Languages What It Tests Notes
Aider Polyglot C++, Go, Java, JS, Python, Rust Edit + debug in 6 languages 225 Exercism exercises
Multi-SWE-bench (NeurIPS 2025) Python, Java, TS, JS, Go, Rust, C, C++ Issue resolution in 8 languages 1,632 validated issues
Multi-SWE-bench mini 8 languages Lightweight version 400 instances, reduced compute
SWE-PolyBench (Amazon) Java, JS, TS, Python Bug fixes, features, refactoring 2,110 curated issues
SWE-smith 9 languages SWE-bench style across 42 repos 300 curated tasks
HumanEval-X Python, C++, Java, JS, Go Cross-lingual code generation Translation of HumanEval
BigCodeBench Python (139 libs) Multi-library Python Tests library-specific knowledge

Multi-SWE-bench vs SWE-PolyBench

Two competing multilingual benchmarks emerged in 2025:

  • Multi-SWE-bench (ByteDance): 1,632 issues, 8 languages, NeurIPS 2025 Datasets track. Also provides mini (400 instances) and flash (300 instances) variants for reduced compute.
  • SWE-PolyBench (Amazon): 2,110 issues, 4 languages, with a verified subset of 384 instances. Covers bug fixes, features, and refactoring.

Language-Specific Performance Gaps

Open models show significant performance variation across languages:

  • Python: Best-supported universally
  • JavaScript/TypeScript: Second-best, strong ecosystem coverage
  • Rust, Go, C++: Substantially weaker, especially for complex patterns
  • Low-resource languages (Julia, Lua, Perl): StarCoder2-15B historically strong here

State of the Art

Qwen3-Coder-Next achieves 62.8% on SWE-Bench Multilingual. For 64GB-feasible models, the Qwen3-Coder-30B-A3B benefits from Qwen's broad multilingual training data.


11. Test Generation

Definition: Writing tests, understanding test frameworks, achieving coverage, generating meaningful assertions -- not just syntactically valid tests.

Key Benchmarks

Benchmark Tasks What It Tests Notes
TestEval (2024) 210 LLM test case generation for LeetCode programs Basic test generation ability
ULT (2025) 3,909 Unit test generation for complex functions High cyclomatic complexity, leakage-free
WebApp1K (2025) 1,000 Test-driven development tasks Tests serve as both prompt and verification
CoverUp (2024) Varied Coverage-guided test generation Iterative LLM-guided coverage improvement

Current Performance

LLM-generated tests achieve on average:

  • 41.32% accuracy (tests pass and are meaningful)
  • 45.10% statement coverage
  • 30.22% branch coverage
  • 40.21% mutation score

These numbers are from a multi-model benchmark study (2025). CoverUp's iterative approach achieves 80% line+branch coverage (vs 47% for CodaMosa), suggesting that agentic test generation loops significantly outperform single-shot generation.

Key Insight

Test generation is an area where agentic approaches (generate, run, check coverage, iterate) dramatically outperform single-shot generation. This makes it particularly suited to the iterative agent loop and a strong candidate for local model evaluation.

State of the Art

Code agents were shown to be "state of the art software testers" when given an iterative loop with coverage feedback (2024 paper). No single model dominates this dimension; the scaffolding (coverage feedback, iteration) matters more than the base model for test generation.


12. Benchmark Suite Summary

Tier 1: Must-Run for Agentic Coding Evaluation

These are the most informative benchmarks for evaluating a model's fitness as a coding agent:

Benchmark Primary Dimensions Run Cost Notes
SWE-bench Verified Planning, editing, repo understanding High (500 Docker envs) Gold standard
Aider Polyglot Editing, multi-lang, debugging Medium (225 problems) Best edit benchmark
BigCodeBench Generation, multi-tool Medium (1,140 tasks) Best generation benchmark
BFCL V4 Tool use, function calling Low-Medium De facto tool-use standard
Terminal-Bench 2.0 Debugging, planning, error recovery High (89 real envs) Best debugging benchmark

Tier 2: Valuable Supplementary Benchmarks

Benchmark Primary Dimensions Notes
LiveCodeBench Generation (contamination-free) Rolling benchmark
IFEval Instruction following Quick to run, widely reported
Multi-SWE-bench mini Multi-language, planning 400 instances, 8 languages
EvalPlus (HumanEval+/MBPP+) Generation (rigorous) Good baseline
Recovery-Bench Error recovery Novel and underexplored
FeatureBench Complex planning Very hard; differentiates top models

Tier 3: Niche or Near-Saturated

Benchmark Status Notes
HumanEval Near-saturated >95% for frontier models; use EvalPlus instead
MBPP Near-saturated Use MBPP+ instead
GAIA Near-saturation (~90%) Good for general agents, less code-specific
Needle-in-a-Haystack Saturated Use RULER for long-context

Commonly Cited on Model Cards

When coding-focused models publish on Hugging Face, the most frequently cited benchmarks (in rough order of frequency) are:

  1. SWE-bench Verified (agentic coding standard)
  2. HumanEval / HumanEval+ (code generation baseline)
  3. MBPP / MBPP+ (code generation)
  4. BigCodeBench (multi-tool generation)
  5. Aider Polyglot (code editing, multi-language)
  6. LiveCodeBench (contamination-free generation)
  7. BFCL (function calling)
  8. IFEval (instruction following)
  9. Multi-SWE-bench (multilingual agentic)

13. Open-Weight Model Landscape for 64GB Systems

Models Feasible on 64GB Unified Memory (Strix Halo)

Sorted by practical fitness for agentic coding tasks. "Active" = parameters active per forward pass for MoE models.

Model Total / Active GGUF Q4 Size SWE-bench Key Strength
Qwen3-Coder-Next 80B / 3B ~46GB (Q4) 70.6% Verified Best efficiency ratio; agentic RL training
Qwen3-Coder-30B-A3B 30.5B / 3.3B ~18GB (Q4) ~55%* (est.) Fits easily; native 256K context; function call format
Qwen3.5-35B-A3B 35B / 3B ~19GB (Q4) N/A General + coding; fast at 112 tok/s on RTX 3090
Nemotron 3 Super 120B / 12B ~64GB (Q4) 60.5% 1M context; PinchBench 85.6%; hybrid Mamba-Transformer
Qwen3.5-27B 27B / 27B (dense) ~17GB (Q4) ~55%* Dense; 72.4% SWE-bench reported for Qwen3.5-27B
DeepSeek V3.2 671B / 37B Too large at Q4 73.0% Requires >200GB; not feasible for 64GB
GLM-5 744B / 40B Too large at Q4 77.8% Best open SWE-bench; not feasible for 64GB

*Estimated; exact scores for quantized GGUF variants not independently benchmarked.

Primary coding agent: Qwen3-Coder-30B-A3B-Instruct (Q4_K_M, ~18GB)

  • Fits with ample room for KV cache and context
  • Specially designed function call format
  • Native 256K context, extendable to 1M
  • Strong agentic coding training
  • Fast inference due to 3.3B active parameters

Stretch option: Qwen3-Coder-Next (Q4, ~46GB)

  • Tighter fit but significantly stronger (70.6% SWE-bench Verified)
  • 3B active parameters = good generation speed
  • Leaves ~18GB for KV cache and system

Dense alternative: Qwen3.5-27B (Q4_K_M, ~17GB)

  • When you need strong general + coding ability
  • Dense model = more predictable behavior
  • Good baseline for comparison

Older Models: Still Relevant?

  • CodeLlama-34B (Meta, 2023): Superseded by Qwen and DeepSeek families. Only relevant for historical comparison or if specific fine-tunes are needed.
  • StarCoder2-15B (ServiceNow/HF/NVIDIA, 2024): Outperformed CodeLlama-34B at half the size. Still competitive for low-resource languages (Julia, Lua, Perl) but otherwise superseded by Qwen3-Coder.
  • DeepSeek-Coder-V2-Lite-16B (2024): Was competitive but now clearly behind Qwen3-Coder-30B-A3B and Qwen3-Coder-Next.

14. Frontier vs. Open Model Gap

Gap Analysis by Dimension (March 2026)

Dimension Frontier Best Open Best (64GB) Gap Trend
Code Generation ~98% HumanEval ~85% HumanEval Small Closing rapidly
Code Editing 88% Aider Polyglot ~60% Aider Polyglot Large Closing (MoE helps)
Tool Use >90% BFCL ~80% BFCL Moderate Closing with dedicated training
Multi-Step Planning 80.9% SWE-bench 70.6% SWE-bench (Coder-Next) Moderate Narrowing with agentic RL
Debugging/Recovery ~65% Terminal-Bench ~45% Terminal-Bench* Large Widest persistent gap
Repo Understanding Excellent Good (long-context models) Moderate Closing with 256K+ contexts
Instruction Following >90% IFEval >85% IFEval Small Nearly closed
Long Context 1M+ effective 256K effective Moderate Hardware-limited for local
Multi-Language 80%+ Multi-SWE 62.8% Multi-SWE Moderate Improving with diverse training
Test Generation ~50% coverage ~40% coverage Small Scaffolding matters more

*Estimated; Terminal-Bench scores not widely reported for 64GB-feasible open models.

Key Observations

  1. Code generation is nearly solved for simple tasks. The gap has shifted to complex, multi-step, multi-file tasks.

  2. Debugging/error recovery is the widest gap and the hardest to close. This is where frontier models' larger parameter counts and RLHF refinement matter most.

  3. MoE architectures are the bridge for 64GB systems. Models like Qwen3-Coder-Next (80B total, 3B active) achieve SWE-bench scores comparable to models with 10-20x more active parameters.

  4. Agentic RL training (as used in Qwen3-Coder, GLM-5) is the primary driver of open model improvement on planning and debugging dimensions.

  5. Scaffolding equalizes many gaps. A well-designed agent scaffold (SWE-Agent, OpenHands, Aider) can make a 30B model perform comparably to a raw 400B model.


For evaluating models locally on the Strix Halo system, the following stack covers all 10 dimensions using tools already referenced in this project's docs/references.md:

Inspect AI (Primary Framework)

Inspect AI supports multiple benchmarks in a unified framework:

  • HumanEval (code generation)
  • BigCodeBench (multi-tool generation)
  • BFCL (function calling / tool use)
  • GAIA (multi-step planning)
  • IFEval (instruction following)

Run against an OpenAI-compatible endpoint (ollama or llama.cpp server).

EvalPlus (Code Generation)

  • HumanEval+ and MBPP+ with native ollama support
  • More rigorous than base HumanEval/MBPP
  • Already configured in this project's scripts/agentic/ framework

BigCodeBench (Multi-Tool Generation)

  • 1,140 tasks across 139 libraries
  • Already listed in docs/references.md
  • Tests multi-library, cross-domain code generation

Aider (Code Editing + Multi-Language)

  • Built-in polyglot benchmark: 225 exercises across 6 languages
  • Tests edit format compliance, multi-language support, debugging loop
  • Can be run against any OpenAI-compatible endpoint

BFCL (Tool Use)

  • pip install bfcl-eval
  • Tests function calling accuracy
  • Already listed in docs/references.md

Practical Execution Order

  1. Quick smoke test: EvalPlus (HumanEval+) -- ~30 min
  2. Generation depth: BigCodeBench-Hard (148 tasks) -- ~2-4 hours
  3. Editing ability: Aider polyglot benchmark -- ~4-6 hours
  4. Tool use: BFCL eval -- ~1-2 hours
  5. Instruction following: IFEval via Inspect AI -- ~1 hour
  6. Full agentic: SWE-bench Verified (if Docker resources available) -- ~24+ hours

16. Sources

Papers

  • Chen et al. (2021). "Evaluating Large Language Models Trained on Code." arXiv:2107.03374. [HumanEval]
  • Liu et al. (2023). "Is Your Code Generated by ChatGPT Really Correct?" NeurIPS 2023. [EvalPlus/HumanEval+]
  • Jimenez et al. (2024). "SWE-bench: Can Language Models Resolve Real-world GitHub Issues?" ICLR 2024.
  • Zhuo et al. (2024). "BigCodeBench: Benchmarking Code Generation with Diverse Function Calls." ICLR 2025.
  • Patil et al. (2025). "The Berkeley Function Calling Leaderboard (BFCL)." ICML 2025.
  • Misu et al. (2023). "Towards AI Assistants That Thrive on Data: GAIA." ICLR 2025.
  • Hsieh et al. (2024). "RULER: What's the Real Context Size of Your Long-Context Language Models?" COLM 2024.
  • Zhou et al. (2023). "Instruction-Following Evaluation for Large Language Models." arXiv:2311.07911. [IFEval]
  • Terminal-Bench team (2026). "Terminal-Bench: Benchmarking Agents on Hard CLI Tasks." Stanford/Laude Institute.
  • FeatureBench (Feb 2026). "Benchmarking Agentic Coding for Complex Feature Development." arXiv:2602.10975.
  • HumanEval Pro / MBPP Pro (ACL 2025). "Evaluating LLMs on Self-invoking Code Generation Task."
  • Multi-SWE-bench (NeurIPS 2025). "A Multilingual Benchmark for Issue Resolving."
  • SWE-PolyBench (Amazon, 2025). "A multi-language benchmark for repository level evaluation."
  • Recovery-Bench (Letta, 2025). "Evaluating LLMs' Ability to Recover from Mistakes."
  • Diff-XYZ (Oct 2025). "A Benchmark for Evaluating Diff Understanding."

Leaderboards and Live Data

Model Documentation

Tools and Frameworks