docs: add agentic coding evaluation landscape research

Comprehensive research (706 lines, dated 2026-03-30) covering evaluation dimensions, benchmark suites, and open-weight model performance for software engineering agent use cases on 64GB systems. Also gitignore evalplus_results/ (runtime outputs) and ztop/ (nested repo).
2026-04-15 15:55:04 +02:00
parent 15bb6a8ed9
commit c847991740
2 changed files with 708 additions and 0 deletions
--- a/docs/agentic-coding-evaluation-landscape.md
+++ b/docs/agentic-coding-evaluation-landscape.md
@@ -0,0 +1,706 @@
+# Agentic Coding Evaluation Landscape
+
+Comprehensive research into the dimensions, benchmarks, and model performance for
+evaluating LLMs in software engineering agent use cases. Research date: 2026-03-30.
+
+---
+
+## Table of Contents
+
+1. [Evaluation Taxonomy](#1-evaluation-taxonomy)
+2. [Dimension 1: Code Generation Accuracy](#2-code-generation-accuracy)
+3. [Dimension 2: Code Editing / Patching](#3-code-editing--patching)
+4. [Dimension 3: Tool Use / Function Calling](#4-tool-use--function-calling)
+5. [Dimension 4: Multi-Step Planning](#5-multi-step-planning)
+6. [Dimension 5: Debugging / Error Recovery](#6-debugging--error-recovery)
+7. [Dimension 6: Repository Understanding](#7-repository-understanding)
+8. [Dimension 7: Instruction Following](#8-instruction-following)
+9. [Dimension 8: Long Context Utilization](#9-long-context-utilization)
+10. [Dimension 9: Multi-Language Support](#10-multi-language-support)
+11. [Dimension 10: Test Generation](#11-test-generation)
+12. [Benchmark Suite Summary](#12-benchmark-suite-summary)
+13. [Open-Weight Model Landscape for 64GB Systems](#13-open-weight-model-landscape-for-64gb-systems)
+14. [Frontier vs. Open Model Gap](#14-frontier-vs-open-model-gap)
+15. [Recommended Evaluation Stack](#15-recommended-evaluation-stack)
+16. [Sources](#16-sources)
+
+---
+
+## 1. Evaluation Taxonomy
+
+Recent survey work (CSLLM Survey, 2025; SE Agent Benchmark Survey, 2025) organizes
+coding LLM evaluation along two orthogonal axes:
+
+- **Capability dimension**: What is being measured (generation, editing, tool use,
+  planning, debugging, comprehension, instruction following, etc.)
+- **Evaluation paradigm**: How it is measured (static benchmarks, execution-based
+  evaluation, agent-in-the-loop evaluation, human evaluation)
+
+The field has moved decisively from static benchmarks (HumanEval, MBPP) toward
+agent-in-the-loop evaluations (SWE-bench, Terminal-Bench, FeatureBench) that test
+the full agentic loop: plan, act, observe, iterate. This shift matters because models
+that score 95%+ on HumanEval can still fail below 50% on realistic agentic tasks.
+
+The ten dimensions below map to the capability axis. Each dimension lists the
+benchmarks that best isolate it, though in practice most agentic benchmarks test
+multiple dimensions simultaneously.
+
+---
+
+## 2. Code Generation Accuracy
+
+**Definition**: Writing correct, complete code from natural-language specifications or
+docstrings, measured by functional correctness (pass@k on test suites).
+
+### Key Benchmarks
+
+| Benchmark | Tasks | Languages | Metric | Notes |
+|---|---|---|---|---|
+| **HumanEval** (Chen et al., 2021) | 164 | Python | pass@k | Foundational but near-saturated; best models >95% |
+| **HumanEval+** / **MBPP+** (EvalPlus, NeurIPS 2023) | 164 / 399 | Python | pass@k (80x more tests) | Catches false positives from HumanEval; ~10-15% score drops |
+| **HumanEval Pro** / **MBPP Pro** (ACL 2025) | 164 / 399 | Python | pass@k on self-invoking tasks | Tests compositional reasoning; o1-mini drops from 96.2% to 76.2% |
+| **BigCodeBench** (ICLR 2025) | 1,140 | Python (139 libs) | pass@1 | Multi-tool, cross-domain; best model (GPT-4o) ~60% Complete, <50% Instruct |
+| **BigCodeBench-Hard** | 148 | Python | pass@1 | Hardest subset; human performance 97%, LLMs ~60% |
+| **LiveCodeBench** (EMNLP 2025) | Rolling | Python | pass@k | Contamination-free: new problems added continuously from competitive programming |
+
+### State of the Art
+
+- **Frontier**: Claude Opus 4.5/4.6, GPT-5.2, Gemini 3.1 Pro all score >95% on
+  HumanEval, ~85% on HumanEval+, ~65% on BigCodeBench-Complete.
+- **Open (64GB-feasible)**: Qwen3.5-27B-Q4 achieves ~80% on HumanEval+.
+  Qwen3-Coder-30B-A3B (3.3B active, ~18GB at Q4) is strong on BigCodeBench.
+  Qwen2.5-Coder-32B-Instruct matched GPT-4o on HumanEval when released.
+
+### Key Insight
+
+HumanEval is near-saturated and should no longer be used as a primary differentiator.
+BigCodeBench and LiveCodeBench are the current gold standards for code generation
+accuracy, as they test realistic multi-library tasks and resist contamination.
+
+---
+
+## 3. Code Editing / Patching
+
+**Definition**: Modifying existing code correctly -- applying diffs, fixing bugs in
+context, integrating new code into existing files -- rather than generating from scratch.
+
+### Key Benchmarks
+
+| Benchmark | Tasks | What It Tests | Notes |
+|---|---|---|---|
+| **Aider Code Editing** | 133 | Edit Python files to solve Exercism problems | Tests edit format compliance + coding ability |
+| **Aider Polyglot** | 225 | Edit code across 6 languages with error feedback | Two attempts per problem; measures edit+debug loop |
+| **Diff-XYZ** (Oct 2025) | 3 tasks | Apply, anti-apply, generate diffs | Tests diff understanding in multiple formats |
+| **EDIT-Bench** | Varied | Real-world instructed code edits | Repository-level editing tasks |
+| **SWE-bench** (indirectly) | 2,294 | Generate patches that resolve GitHub issues | Requires generating correct unified diffs |
+
+### Edit Format Considerations
+
+Code editing performance depends heavily on the edit format used:
+
+- **Search/Replace blocks** (Aider default): Most reliable for most models
+- **Unified diff**: GPT-4 Turbo was "3x less lazy" with unified diffs (Aider blog)
+- **V4A diff format**: OpenAI's recommended format (published with GPT-4.1, April 2025)
+- **Whole-file rewrite**: Simpler but wasteful; works with weaker models
+
+Models that excel at generation can fail at editing because they struggle to produce
+syntactically valid diffs or correctly locate the code to modify.
+
+### State of the Art (Aider Polyglot, March 2026)
+
+| Model | Score | Type |
+|---|---|---|
+| GPT-5 | 88.0% | Frontier |
+| MiniMax M2.5 | 80.2% | Open |
+| DeepSeek V3.2-Exp | 74.2% | Open |
+| DeepSeek-R1-0528 | 71.4% | Open |
+| GLM-4.5-FP8 | 66.0% | Open |
+| Qwen3-Coder-480B | 61.8% | Open (too large for 64GB) |
+| Qwen3-Coder-30B-A3B | ~55-60%* | Open (fits 64GB at Q4) |
+
+*Estimated from quantized GGUF performance data; exact Aider Polyglot score for
+the 30B-A3B variant not independently confirmed.
+
+---
+
+## 4. Tool Use / Function Calling
+
+**Definition**: Correctly invoking APIs, tools, or MCP servers -- selecting the right
+function, constructing valid arguments, parsing responses, deciding when NOT to call.
+
+### Key Benchmarks
+
+| Benchmark | Tasks | What It Tests | Notes |
+|---|---|---|---|
+| **BFCL V4** (Berkeley) | Thousands | Function calling accuracy across formats | De facto standard; AST-based evaluation |
+| **BFCL-v3** (via EvalScope) | Multi-turn | Stateful multi-step function calling | Tests memory and context management |
+| **Nexus Function Calling** | Varied | Tool selection and invocation | Broader tool landscape |
+| **IFEval-FC** (2025) | 500+ | Instruction following within function schemas | JSON schema constraint adherence |
+| **tau-bench** | Varied | Tool-augmented task completion | End-to-end agent tool use |
+
+### BFCL Key Findings
+
+The Berkeley Function Calling Leaderboard reveals a critical split:
+
+1. **Single-turn calls**: Most frontier models score >90% accuracy
+2. **Multi-turn stateful calls**: Performance drops 20-40% even for top models
+3. **Abstention**: Knowing when NOT to call a function remains a major weakness
+4. **Long-horizon tool use**: Memory, dynamic decision-making, and context management
+   are open challenges
+
+### State of the Art
+
+- **Frontier**: Claude Opus 4.5/4.6, GPT-5.2 lead overall BFCL V4
+- **Open**: Qwen3-Coder-480B is "comparable to Claude Sonnet 4 on Agentic Tool-Use"
+  (Qwen team). For 64GB-feasible models, Qwen3-Coder-30B-A3B has a specially
+  designed function call format and strong tool-use training.
+  Nemotron 3 Super (120B, 12B active) was explicitly trained for tool-use workflows.
+
+### Relevance to MCP
+
+MCP (Model Context Protocol) servers expose tools via JSON schemas -- exactly what
+BFCL tests. A model's BFCL score is a reasonable proxy for MCP tool-use competence,
+though MCP adds discovery and session management complexity not yet benchmarked.
+
+---
+
+## 5. Multi-Step Planning
+
+**Definition**: Breaking complex tasks into subtasks, maintaining coherent plans across
+many steps, tracking progress, and adapting when plans fail.
+
+### Key Benchmarks
+
+| Benchmark | Tasks | Steps | What It Tests | Notes |
+|---|---|---|---|---|
+| **SWE-bench Verified** | 500 | 5-50+ | End-to-end issue resolution | Gold standard for agentic coding |
+| **SWE-bench Pro** (Scale AI) | Harder | 10-100+ | More complex issues | Best model ~46% (vs 81% on Verified) |
+| **FeatureBench** (Feb 2026) | 200 | Many | Complex feature development | Claude 4.5 Opus: only 11.0% (vs 74.4% SWE-bench) |
+| **Snorkel Agentic Coding** | 100 | Multi-step, 4 tiers | Plan, track, execute, recover | Claude Opus 4.5: 58%, Gemini 3 Pro: 51.6% |
+| **GAIA** (ICLR 2025) | 450 | Multi-step | General assistant planning | Near saturation (~90% top scores) |
+| **Gaia2** (2026) | Varied | Async | Dynamic, asynchronous environments | Adds temporal constraints and agent collaboration |
+| **Terminal-Bench 2.0** | 89 | Multi-step | Terminal workflow completion | Tests plan execution in CLI environments |
+
+### Planning-Specific Insights
+
+The gap between SWE-bench Verified (~81% frontier) and SWE-bench Pro (~46% frontier)
+and FeatureBench (~11% frontier) reveals that multi-step planning degrades rapidly
+with task complexity:
+
+- **SWE-bench Verified**: Often requires 5-15 steps (find file, understand bug, edit,
+  test)
+- **SWE-bench Pro**: Requires deeper reasoning about architecture and dependencies
+- **FeatureBench**: Requires implementing features across multiple files with
+  architectural coherence over 50+ steps
+
+This is the dimension where frontier models most decisively outperform open models,
+though the gap is narrowing with agentic RL training (Qwen3-Coder, GLM-5).
+
+### State of the Art (SWE-bench Verified, March 2026)
+
+| Model | Score | Type | Notes |
+|---|---|---|---|
+| Claude Opus 4.5 | 80.9% | Frontier | Overall leader |
+| Claude Opus 4.6 | 80.8% | Frontier | |
+| Gemini 3.1 Pro | 80.6% | Frontier | |
+| MiniMax M2.5 | 80.2% | Open | Best open model |
+| GPT-5.2 | 80.0% | Frontier | |
+| GLM-5 | 77.8% | Open | 744B MoE, 40B active |
+| Kimi K2.5 | 76.8% | Open | |
+| DeepSeek V3.2 | 73.0% | Open | |
+| Qwen3-Coder-Next | 70.6% | Open | Only 3B active params |
+| DeepSeek V3.1 | 66.0% | Open | |
+| Nemotron 3 Super | 60.5% | Open | 120B, 12B active |
+
+---
+
+## 6. Debugging / Error Recovery
+
+**Definition**: Handling test failures, reading error messages, diagnosing root causes,
+and iterating toward a fix -- including recovering from the agent's own mistakes.
+
+### Key Benchmarks
+
+| Benchmark | Tasks | What It Tests | Notes |
+|---|---|---|---|
+| **Terminal-Bench 2.0** (Stanford/Laude) | 89 | CLI debugging, error recovery, state mgmt | Gold standard for debugging evaluation |
+| **Recovery-Bench** (Letta, 2025) | Varied | Recovery from corrupted states and error traces | Tests context pollution handling |
+| **AgentErrorBench** (2025) | Varied | Error detection and debugging in trajectories | 24% improvement with AgentDebug method |
+| **ReliabilityBench** (Jan 2026) | Varied | Consistency and fault recovery | Multi-dimensional reliability |
+| **Aider Polyglot** (indirectly) | 225 | Two-attempt model with error feedback | Second attempt tests debug-from-feedback |
+
+### Recovery-Bench Key Findings
+
+Recovery-Bench (Letta) specifically evaluates a critical gap: even frontier models
+"lack the ability to naturally recover from failed states." The benchmark creates
+scenarios with:
+
+- Erroneous files from previous attempts
+- Corrupted reasoning traces in context
+- Environment artifacts from failed edits
+
+This is directly relevant to agentic coding loops where an agent makes a mistake
+at step 15 of a 30-step task and must recover without starting over.
+
+### Terminal-Bench 2.0 Key Findings
+
+Terminal-Bench tests real terminal workflows: inspect environments, read/edit files,
+run commands, recover from errors, and finish multi-step tasks. Error categories:
+
+- **Execution errors**: Dominate for Claude Opus 4.5 and GPT-5.2
+- **Coherence errors**: Less frequent but more damaging
+- **Verification errors**: Failing to check that a fix actually worked
+
+### State of the Art
+
+Debugging/error recovery is one of the weakest dimensions for all models. No model
+achieves >70% on Terminal-Bench 2.0 or Recovery-Bench as of March 2026. This is
+a primary area where the frontier-open gap matters most for practical agentic use.
+
+---
+
+## 7. Repository Understanding
+
+**Definition**: Navigating large codebases, understanding file structure, dependency
+graphs, cross-file relationships, and architectural patterns.
+
+### Key Benchmarks
+
+| Benchmark | Tasks | Languages | What It Tests | Notes |
+|---|---|---|---|---|
+| **CrossCodeEval** (NeurIPS 2023) | Varied | Python, Java, TS, C# | Cross-file code completion | Requires understanding imports and dependencies |
+| **RepoBench** | 3 tasks | Python | Retrieval, completion, pipeline | Tests codebase navigation |
+| **RepoEval** | Varied | Python | Repository-level completion | 16 GitHub repositories |
+| **RepoCod** (ACL 2025) | Varied | Multiple | Full repository code generation | "LLMs not yet ready" |
+| **LoCoBench-Agent** (2025) | Varied | Multiple | Interactive repo exploration | Agent-based evaluation |
+| **DependEval** | 3 tasks | Multiple | Dependency recognition, multi-file editing | Tests architectural understanding |
+
+### Key Challenge
+
+Repository understanding is difficult to isolate as a benchmark dimension because
+it is a prerequisite for most agentic coding tasks. SWE-bench implicitly tests it
+(you cannot fix a bug if you cannot find the relevant file), but does not score it
+separately.
+
+The most direct measures are:
+1. **CrossCodeEval**: Do predictions improve when cross-file context is provided?
+2. **RepoBench-R**: Can the model retrieve the right context from the repository?
+3. **DependEval**: Can the model understand and modify dependency relationships?
+
+### State of the Art
+
+Models with longer context windows have an inherent advantage. The Qwen3-Coder family
+was explicitly trained for "repository-scale understanding" with 256K native context
+(extendable to 1M). GLM-5 uses DeepSeek Sparse Attention for 205K context.
+
+For 64GB systems, Qwen3-Coder-30B-A3B and Qwen3-Coder-Next are the strongest choices
+due to their long-context training and MoE efficiency.
+
+---
+
+## 8. Instruction Following
+
+**Definition**: Following complex, multi-constraint instructions precisely --
+formatting requirements, length constraints, keyword inclusion, structural rules.
+
+### Key Benchmarks
+
+| Benchmark | Tasks | What It Tests | Notes |
+|---|---|---|---|
+| **IFEval** (Google, Nov 2023) | ~500 | 25 types of verifiable instructions | Format, length, keyword, structure constraints |
+| **IFEval-Extended** (2024) | Dynamic | Generative instruction synthesis | Thousands of unique instructions from templates |
+| **M-IFEval** (NAACL 2025) | Multi-lingual | French, Japanese, Spanish instruction following | Performance varies widely across languages |
+| **IFEval-FC** (2025) | Varied | Instruction following in function call schemas | JSON schema constraint adherence |
+| **AgentIF** (Tsinghua, 2025) | Varied | Agent-specific instruction following | Evaluates IF within agentic loops |
+
+### Relevance to Agentic Coding
+
+Instruction following is critical for agentic coding because:
+
+1. **System prompts**: Agents receive detailed behavioral instructions (e.g., CLAUDE.md
+   conventions in this repo)
+2. **Edit format compliance**: Models must produce output in exact formats (search/replace
+   blocks, unified diffs, JSON tool calls)
+3. **Multi-constraint tasks**: "Fix the bug AND add a test AND update the docstring AND
+   follow the project's naming conventions"
+
+### State of the Art
+
+IFEval is included in the Open LLM Leaderboard V2, making it one of the most widely
+reported benchmarks. Frontier models score >90% on IFEval. Open models vary widely;
+instruction-tuned variants of Qwen3.5, DeepSeek V3, and GLM-5 are competitive at >85%.
+
+---
+
+## 9. Long Context Utilization
+
+**Definition**: Effectively using large context windows (32K-1M tokens) with code --
+not just accepting long inputs, but actually using information from all parts.
+
+### Key Benchmarks
+
+| Benchmark | What It Tests | Notes |
+|---|---|---|
+| **RULER** (NVIDIA, COLM 2024) | Multi-needle retrieval, distractor handling | Most models degrade significantly beyond 32K |
+| **Needle in a Haystack** (NIAH) | Single-fact retrieval in long context | Near-saturated for frontier models |
+| **LoCoBench** (2025) | Long-context code completion and comprehension | Claude 3.5 Sonnet: 29% at short context, 3% at long |
+| **LongCodeBench** (2025) | Long-context code tasks | Single-language, limited diversity |
+| **LongBench** (ACL 2025) | General long-context evaluation | Reveals limitations of existing benchmarks |
+
+### "Context Rot" Phenomenon
+
+Research from Chroma (2025) documented "context rot": as input tokens increase,
+LLM performance degrades even when the relevant information is present. This is
+particularly acute for code tasks where:
+
+- File A defines a class, file B imports it, file C tests it
+- All three must be in context simultaneously
+- Models must cross-reference across files, not just retrieve individual facts
+
+### State of the Art
+
+| Model | Native Context | Effective Context* | Notes |
+|---|---|---|---|
+| Nemotron 3 Super | 1M tokens | 91.75% accuracy at 1M | Best retention score |
+| Qwen3-Coder-Next | 256K (1M w/ Yarn) | Good at 256K | Trained for repo-scale |
+| GLM-5 | 205K | Good | DeepSeek Sparse Attention |
+| DeepSeek V3.2 | 128K | Moderate | |
+
+*"Effective context" means the model actually uses information at that distance,
+not just accepts it without error.
+
+For 64GB systems, context length is bounded by available memory. At Q4 quantization,
+a 30B-A3B model can handle ~64K-128K tokens before running out of KV cache space
+(depending on GQA configuration and batch size).
+
+---
+
+## 10. Multi-Language Support
+
+**Definition**: Handling different programming languages correctly -- not just Python,
+but also compiled languages, systems languages, and less common languages.
+
+### Key Benchmarks
+
+| Benchmark | Languages | What It Tests | Notes |
+|---|---|---|---|
+| **Aider Polyglot** | C++, Go, Java, JS, Python, Rust | Edit + debug in 6 languages | 225 Exercism exercises |
+| **Multi-SWE-bench** (NeurIPS 2025) | Python, Java, TS, JS, Go, Rust, C, C++ | Issue resolution in 8 languages | 1,632 validated issues |
+| **Multi-SWE-bench mini** | 8 languages | Lightweight version | 400 instances, reduced compute |
+| **SWE-PolyBench** (Amazon) | Java, JS, TS, Python | Bug fixes, features, refactoring | 2,110 curated issues |
+| **SWE-smith** | 9 languages | SWE-bench style across 42 repos | 300 curated tasks |
+| **HumanEval-X** | Python, C++, Java, JS, Go | Cross-lingual code generation | Translation of HumanEval |
+| **BigCodeBench** | Python (139 libs) | Multi-library Python | Tests library-specific knowledge |
+
+### Multi-SWE-bench vs SWE-PolyBench
+
+Two competing multilingual benchmarks emerged in 2025:
+
+- **Multi-SWE-bench** (ByteDance): 1,632 issues, 8 languages, NeurIPS 2025
+  Datasets track. Also provides `mini` (400 instances) and `flash` (300 instances)
+  variants for reduced compute.
+- **SWE-PolyBench** (Amazon): 2,110 issues, 4 languages, with a verified subset of
+  384 instances. Covers bug fixes, features, and refactoring.
+
+### Language-Specific Performance Gaps
+
+Open models show significant performance variation across languages:
+- **Python**: Best-supported universally
+- **JavaScript/TypeScript**: Second-best, strong ecosystem coverage
+- **Rust, Go, C++**: Substantially weaker, especially for complex patterns
+- **Low-resource languages** (Julia, Lua, Perl): StarCoder2-15B historically strong here
+
+### State of the Art
+
+Qwen3-Coder-Next achieves 62.8% on SWE-Bench Multilingual. For 64GB-feasible models,
+the Qwen3-Coder-30B-A3B benefits from Qwen's broad multilingual training data.
+
+---
+
+## 11. Test Generation
+
+**Definition**: Writing tests, understanding test frameworks, achieving coverage,
+generating meaningful assertions -- not just syntactically valid tests.
+
+### Key Benchmarks
+
+| Benchmark | Tasks | What It Tests | Notes |
+|---|---|---|---|
+| **TestEval** (2024) | 210 | LLM test case generation for LeetCode programs | Basic test generation ability |
+| **ULT** (2025) | 3,909 | Unit test generation for complex functions | High cyclomatic complexity, leakage-free |
+| **WebApp1K** (2025) | 1,000 | Test-driven development tasks | Tests serve as both prompt and verification |
+| **CoverUp** (2024) | Varied | Coverage-guided test generation | Iterative LLM-guided coverage improvement |
+
+### Current Performance
+
+LLM-generated tests achieve on average:
+- **41.32%** accuracy (tests pass and are meaningful)
+- **45.10%** statement coverage
+- **30.22%** branch coverage
+- **40.21%** mutation score
+
+These numbers are from a multi-model benchmark study (2025). CoverUp's iterative
+approach achieves 80% line+branch coverage (vs 47% for CodaMosa), suggesting that
+agentic test generation loops significantly outperform single-shot generation.
+
+### Key Insight
+
+Test generation is an area where agentic approaches (generate, run, check coverage,
+iterate) dramatically outperform single-shot generation. This makes it particularly
+suited to the iterative agent loop and a strong candidate for local model evaluation.
+
+### State of the Art
+
+Code agents were shown to be "state of the art software testers" when given an
+iterative loop with coverage feedback (2024 paper). No single model dominates this
+dimension; the scaffolding (coverage feedback, iteration) matters more than the
+base model for test generation.
+
+---
+
+## 12. Benchmark Suite Summary
+
+### Tier 1: Must-Run for Agentic Coding Evaluation
+
+These are the most informative benchmarks for evaluating a model's fitness as a
+coding agent:
+
+| Benchmark | Primary Dimensions | Run Cost | Notes |
+|---|---|---|---|
+| **SWE-bench Verified** | Planning, editing, repo understanding | High (500 Docker envs) | Gold standard |
+| **Aider Polyglot** | Editing, multi-lang, debugging | Medium (225 problems) | Best edit benchmark |
+| **BigCodeBench** | Generation, multi-tool | Medium (1,140 tasks) | Best generation benchmark |
+| **BFCL V4** | Tool use, function calling | Low-Medium | De facto tool-use standard |
+| **Terminal-Bench 2.0** | Debugging, planning, error recovery | High (89 real envs) | Best debugging benchmark |
+
+### Tier 2: Valuable Supplementary Benchmarks
+
+| Benchmark | Primary Dimensions | Notes |
+|---|---|---|
+| **LiveCodeBench** | Generation (contamination-free) | Rolling benchmark |
+| **IFEval** | Instruction following | Quick to run, widely reported |
+| **Multi-SWE-bench mini** | Multi-language, planning | 400 instances, 8 languages |
+| **EvalPlus (HumanEval+/MBPP+)** | Generation (rigorous) | Good baseline |
+| **Recovery-Bench** | Error recovery | Novel and underexplored |
+| **FeatureBench** | Complex planning | Very hard; differentiates top models |
+
+### Tier 3: Niche or Near-Saturated
+
+| Benchmark | Status | Notes |
+|---|---|---|
+| **HumanEval** | Near-saturated | >95% for frontier models; use EvalPlus instead |
+| **MBPP** | Near-saturated | Use MBPP+ instead |
+| **GAIA** | Near-saturation (~90%) | Good for general agents, less code-specific |
+| **Needle-in-a-Haystack** | Saturated | Use RULER for long-context |
+
+### Commonly Cited on Model Cards
+
+When coding-focused models publish on Hugging Face, the most frequently cited
+benchmarks (in rough order of frequency) are:
+
+1. SWE-bench Verified (agentic coding standard)
+2. HumanEval / HumanEval+ (code generation baseline)
+3. MBPP / MBPP+ (code generation)
+4. BigCodeBench (multi-tool generation)
+5. Aider Polyglot (code editing, multi-language)
+6. LiveCodeBench (contamination-free generation)
+7. BFCL (function calling)
+8. IFEval (instruction following)
+9. Multi-SWE-bench (multilingual agentic)
+
+---
+
+## 13. Open-Weight Model Landscape for 64GB Systems
+
+### Models Feasible on 64GB Unified Memory (Strix Halo)
+
+Sorted by practical fitness for agentic coding tasks. "Active" = parameters active
+per forward pass for MoE models.
+
+| Model | Total / Active | GGUF Q4 Size | SWE-bench | Key Strength |
+|---|---|---|---|---|
+| **Qwen3-Coder-Next** | 80B / 3B | ~46GB (Q4) | 70.6% Verified | Best efficiency ratio; agentic RL training |
+| **Qwen3-Coder-30B-A3B** | 30.5B / 3.3B | ~18GB (Q4) | ~55%* (est.) | Fits easily; native 256K context; function call format |
+| **Qwen3.5-35B-A3B** | 35B / 3B | ~19GB (Q4) | N/A | General + coding; fast at 112 tok/s on RTX 3090 |
+| **Nemotron 3 Super** | 120B / 12B | ~64GB (Q4) | 60.5% | 1M context; PinchBench 85.6%; hybrid Mamba-Transformer |
+| **Qwen3.5-27B** | 27B / 27B (dense) | ~17GB (Q4) | ~55%* | Dense; 72.4% SWE-bench reported for Qwen3.5-27B |
+| **DeepSeek V3.2** | 671B / 37B | Too large at Q4 | 73.0% | Requires >200GB; not feasible for 64GB |
+| **GLM-5** | 744B / 40B | Too large at Q4 | 77.8% | Best open SWE-bench; not feasible for 64GB |
+
+*Estimated; exact scores for quantized GGUF variants not independently benchmarked.
+
+### Recommended Configuration for 64GB Strix Halo
+
+**Primary coding agent**: Qwen3-Coder-30B-A3B-Instruct (Q4_K_M, ~18GB)
+- Fits with ample room for KV cache and context
+- Specially designed function call format
+- Native 256K context, extendable to 1M
+- Strong agentic coding training
+- Fast inference due to 3.3B active parameters
+
+**Stretch option**: Qwen3-Coder-Next (Q4, ~46GB)
+- Tighter fit but significantly stronger (70.6% SWE-bench Verified)
+- 3B active parameters = good generation speed
+- Leaves ~18GB for KV cache and system
+
+**Dense alternative**: Qwen3.5-27B (Q4_K_M, ~17GB)
+- When you need strong general + coding ability
+- Dense model = more predictable behavior
+- Good baseline for comparison
+
+### Older Models: Still Relevant?
+
+- **CodeLlama-34B** (Meta, 2023): Superseded by Qwen and DeepSeek families. Only
+  relevant for historical comparison or if specific fine-tunes are needed.
+- **StarCoder2-15B** (ServiceNow/HF/NVIDIA, 2024): Outperformed CodeLlama-34B at half
+  the size. Still competitive for low-resource languages (Julia, Lua, Perl) but
+  otherwise superseded by Qwen3-Coder.
+- **DeepSeek-Coder-V2-Lite-16B** (2024): Was competitive but now clearly behind
+  Qwen3-Coder-30B-A3B and Qwen3-Coder-Next.
+
+---
+
+## 14. Frontier vs. Open Model Gap
+
+### Gap Analysis by Dimension (March 2026)
+
+| Dimension | Frontier Best | Open Best (64GB) | Gap | Trend |
+|---|---|---|---|---|
+| Code Generation | ~98% HumanEval | ~85% HumanEval | Small | Closing rapidly |
+| Code Editing | 88% Aider Polyglot | ~60% Aider Polyglot | Large | Closing (MoE helps) |
+| Tool Use | >90% BFCL | ~80% BFCL | Moderate | Closing with dedicated training |
+| Multi-Step Planning | 80.9% SWE-bench | 70.6% SWE-bench (Coder-Next) | Moderate | Narrowing with agentic RL |
+| Debugging/Recovery | ~65% Terminal-Bench | ~45% Terminal-Bench* | Large | Widest persistent gap |
+| Repo Understanding | Excellent | Good (long-context models) | Moderate | Closing with 256K+ contexts |
+| Instruction Following | >90% IFEval | >85% IFEval | Small | Nearly closed |
+| Long Context | 1M+ effective | 256K effective | Moderate | Hardware-limited for local |
+| Multi-Language | 80%+ Multi-SWE | 62.8% Multi-SWE | Moderate | Improving with diverse training |
+| Test Generation | ~50% coverage | ~40% coverage | Small | Scaffolding matters more |
+
+*Estimated; Terminal-Bench scores not widely reported for 64GB-feasible open models.
+
+### Key Observations
+
+1. **Code generation is nearly solved** for simple tasks. The gap has shifted to
+   complex, multi-step, multi-file tasks.
+
+2. **Debugging/error recovery is the widest gap** and the hardest to close. This is
+   where frontier models' larger parameter counts and RLHF refinement matter most.
+
+3. **MoE architectures are the bridge** for 64GB systems. Models like Qwen3-Coder-Next
+   (80B total, 3B active) achieve SWE-bench scores comparable to models with 10-20x
+   more active parameters.
+
+4. **Agentic RL training** (as used in Qwen3-Coder, GLM-5) is the primary driver of
+   open model improvement on planning and debugging dimensions.
+
+5. **Scaffolding equalizes** many gaps. A well-designed agent scaffold (SWE-Agent,
+   OpenHands, Aider) can make a 30B model perform comparably to a raw 400B model.
+
+---
+
+## 15. Recommended Evaluation Stack
+
+For evaluating models locally on the Strix Halo system, the following stack covers
+all 10 dimensions using tools already referenced in this project's `docs/references.md`:
+
+### Inspect AI (Primary Framework)
+
+Inspect AI supports multiple benchmarks in a unified framework:
+- HumanEval (code generation)
+- BigCodeBench (multi-tool generation)
+- BFCL (function calling / tool use)
+- GAIA (multi-step planning)
+- IFEval (instruction following)
+
+Run against an OpenAI-compatible endpoint (ollama or llama.cpp server).
+
+### EvalPlus (Code Generation)
+
+- HumanEval+ and MBPP+ with native ollama support
+- More rigorous than base HumanEval/MBPP
+- Already configured in this project's `scripts/agentic/` framework
+
+### BigCodeBench (Multi-Tool Generation)
+
+- 1,140 tasks across 139 libraries
+- Already listed in `docs/references.md`
+- Tests multi-library, cross-domain code generation
+
+### Aider (Code Editing + Multi-Language)
+
+- Built-in polyglot benchmark: 225 exercises across 6 languages
+- Tests edit format compliance, multi-language support, debugging loop
+- Can be run against any OpenAI-compatible endpoint
+
+### BFCL (Tool Use)
+
+- pip install `bfcl-eval`
+- Tests function calling accuracy
+- Already listed in `docs/references.md`
+
+### Practical Execution Order
+
+1. **Quick smoke test**: EvalPlus (HumanEval+) -- ~30 min
+2. **Generation depth**: BigCodeBench-Hard (148 tasks) -- ~2-4 hours
+3. **Editing ability**: Aider polyglot benchmark -- ~4-6 hours
+4. **Tool use**: BFCL eval -- ~1-2 hours
+5. **Instruction following**: IFEval via Inspect AI -- ~1 hour
+6. **Full agentic**: SWE-bench Verified (if Docker resources available) -- ~24+ hours
+
+---
+
+## 16. Sources
+
+### Papers
+
+- Chen et al. (2021). "Evaluating Large Language Models Trained on Code." arXiv:2107.03374. [HumanEval]
+- Liu et al. (2023). "Is Your Code Generated by ChatGPT Really Correct?" NeurIPS 2023. [EvalPlus/HumanEval+]
+- Jimenez et al. (2024). "SWE-bench: Can Language Models Resolve Real-world GitHub Issues?" ICLR 2024.
+- Zhuo et al. (2024). "BigCodeBench: Benchmarking Code Generation with Diverse Function Calls." ICLR 2025.
+- Patil et al. (2025). "The Berkeley Function Calling Leaderboard (BFCL)." ICML 2025.
+- Misu et al. (2023). "Towards AI Assistants That Thrive on Data: GAIA." ICLR 2025.
+- Hsieh et al. (2024). "RULER: What's the Real Context Size of Your Long-Context Language Models?" COLM 2024.
+- Zhou et al. (2023). "Instruction-Following Evaluation for Large Language Models." arXiv:2311.07911. [IFEval]
+- Terminal-Bench team (2026). "Terminal-Bench: Benchmarking Agents on Hard CLI Tasks." Stanford/Laude Institute.
+- FeatureBench (Feb 2026). "Benchmarking Agentic Coding for Complex Feature Development." arXiv:2602.10975.
+- HumanEval Pro / MBPP Pro (ACL 2025). "Evaluating LLMs on Self-invoking Code Generation Task."
+- Multi-SWE-bench (NeurIPS 2025). "A Multilingual Benchmark for Issue Resolving."
+- SWE-PolyBench (Amazon, 2025). "A multi-language benchmark for repository level evaluation."
+- Recovery-Bench (Letta, 2025). "Evaluating LLMs' Ability to Recover from Mistakes."
+- Diff-XYZ (Oct 2025). "A Benchmark for Evaluating Diff Understanding."
+
+### Leaderboards and Live Data
+
+- SWE-bench Leaderboard: https://www.swebench.com/
+- SWE-bench Verified Leaderboard: https://llm-stats.com/benchmarks/swe-bench-verified
+- SWE-rebench Leaderboard: https://swe-rebench.com/
+- Aider LLM Leaderboards: https://aider.chat/docs/leaderboards/
+- BFCL V4 Leaderboard: https://gorilla.cs.berkeley.edu/leaderboard.html
+- EvalPlus Leaderboard: https://evalplus.github.io/leaderboard.html
+- BigCodeBench Leaderboard: https://huggingface.co/blog/leaderboard-bigcodebench
+- Terminal-Bench Leaderboard: https://www.tbench.ai/
+- Open LLM Leaderboard: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
+- Scale Labs SWE-bench Pro: https://labs.scale.com/leaderboard/swe_bench_pro_public
+- Artificial Analysis Terminal-Bench: https://artificialanalysis.ai/evaluations/terminalbench-hard
+
+### Model Documentation
+
+- Qwen3-Coder: https://github.com/QwenLM/Qwen3-Coder
+- Qwen3-Coder-Next: https://qwen.ai/blog?id=qwen3-coder-next
+- Qwen3-Coder-30B-A3B GGUF: https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
+- GLM-5: https://huggingface.co/zai-org/GLM-5
+- Nemotron 3 Super: https://developer.nvidia.com/blog/introducing-nemotron-3-super-an-open-hybrid-mamba-transformer-moe-for-agentic-reasoning/
+- DeepSeek V3 series: https://www.bentoml.com/blog/the-complete-guide-to-deepseek-models-from-v3-to-r1-and-beyond
+
+### Tools and Frameworks
+
+- Inspect AI: https://github.com/UKGovernmentBEIS/inspect_ai
+- Inspect Evals catalog: https://inspect.aisi.org.uk/evals/
+- EvalPlus: https://github.com/evalplus/evalplus
+- BigCodeBench: https://github.com/bigcode-project/bigcodebench
+- BFCL: https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard
+- Aider: https://aider.chat/
+- Aider Polyglot benchmark: https://github.com/Aider-AI/polyglot-benchmark
+- LiveCodeBench: https://livecodebench.github.io/
+- CoverUp (test generation): https://arxiv.org/html/2403.16218v3