From c8479917405ca7fe96e828546f5768dd9be02c71 Mon Sep 17 00:00:00 2001 From: Felipe Cardoso Date: Wed, 15 Apr 2026 15:55:04 +0200 Subject: [PATCH] docs: add agentic coding evaluation landscape research Comprehensive research (706 lines, dated 2026-03-30) covering evaluation dimensions, benchmark suites, and open-weight model performance for software engineering agent use cases on 64GB systems. Also gitignore evalplus_results/ (runtime outputs) and ztop/ (nested repo). --- .gitignore | 2 + docs/agentic-coding-evaluation-landscape.md | 706 ++++++++++++++++++++ 2 files changed, 708 insertions(+) create mode 100644 docs/agentic-coding-evaluation-landscape.md diff --git a/.gitignore b/.gitignore index 4c96559..bb6dc8a 100644 --- a/.gitignore +++ b/.gitignore @@ -5,3 +5,5 @@ data/ *.tmp .claude/ .idea/ +evalplus_results/ +ztop/ diff --git a/docs/agentic-coding-evaluation-landscape.md b/docs/agentic-coding-evaluation-landscape.md new file mode 100644 index 0000000..c08749d --- /dev/null +++ b/docs/agentic-coding-evaluation-landscape.md @@ -0,0 +1,706 @@ +# Agentic Coding Evaluation Landscape + +Comprehensive research into the dimensions, benchmarks, and model performance for +evaluating LLMs in software engineering agent use cases. Research date: 2026-03-30. + +--- + +## Table of Contents + +1. [Evaluation Taxonomy](#1-evaluation-taxonomy) +2. [Dimension 1: Code Generation Accuracy](#2-code-generation-accuracy) +3. [Dimension 2: Code Editing / Patching](#3-code-editing--patching) +4. [Dimension 3: Tool Use / Function Calling](#4-tool-use--function-calling) +5. [Dimension 4: Multi-Step Planning](#5-multi-step-planning) +6. [Dimension 5: Debugging / Error Recovery](#6-debugging--error-recovery) +7. [Dimension 6: Repository Understanding](#7-repository-understanding) +8. [Dimension 7: Instruction Following](#8-instruction-following) +9. [Dimension 8: Long Context Utilization](#9-long-context-utilization) +10. [Dimension 9: Multi-Language Support](#10-multi-language-support) +11. [Dimension 10: Test Generation](#11-test-generation) +12. [Benchmark Suite Summary](#12-benchmark-suite-summary) +13. [Open-Weight Model Landscape for 64GB Systems](#13-open-weight-model-landscape-for-64gb-systems) +14. [Frontier vs. Open Model Gap](#14-frontier-vs-open-model-gap) +15. [Recommended Evaluation Stack](#15-recommended-evaluation-stack) +16. [Sources](#16-sources) + +--- + +## 1. Evaluation Taxonomy + +Recent survey work (CSLLM Survey, 2025; SE Agent Benchmark Survey, 2025) organizes +coding LLM evaluation along two orthogonal axes: + +- **Capability dimension**: What is being measured (generation, editing, tool use, + planning, debugging, comprehension, instruction following, etc.) +- **Evaluation paradigm**: How it is measured (static benchmarks, execution-based + evaluation, agent-in-the-loop evaluation, human evaluation) + +The field has moved decisively from static benchmarks (HumanEval, MBPP) toward +agent-in-the-loop evaluations (SWE-bench, Terminal-Bench, FeatureBench) that test +the full agentic loop: plan, act, observe, iterate. This shift matters because models +that score 95%+ on HumanEval can still fail below 50% on realistic agentic tasks. + +The ten dimensions below map to the capability axis. Each dimension lists the +benchmarks that best isolate it, though in practice most agentic benchmarks test +multiple dimensions simultaneously. + +--- + +## 2. Code Generation Accuracy + +**Definition**: Writing correct, complete code from natural-language specifications or +docstrings, measured by functional correctness (pass@k on test suites). + +### Key Benchmarks + +| Benchmark | Tasks | Languages | Metric | Notes | +|---|---|---|---|---| +| **HumanEval** (Chen et al., 2021) | 164 | Python | pass@k | Foundational but near-saturated; best models >95% | +| **HumanEval+** / **MBPP+** (EvalPlus, NeurIPS 2023) | 164 / 399 | Python | pass@k (80x more tests) | Catches false positives from HumanEval; ~10-15% score drops | +| **HumanEval Pro** / **MBPP Pro** (ACL 2025) | 164 / 399 | Python | pass@k on self-invoking tasks | Tests compositional reasoning; o1-mini drops from 96.2% to 76.2% | +| **BigCodeBench** (ICLR 2025) | 1,140 | Python (139 libs) | pass@1 | Multi-tool, cross-domain; best model (GPT-4o) ~60% Complete, <50% Instruct | +| **BigCodeBench-Hard** | 148 | Python | pass@1 | Hardest subset; human performance 97%, LLMs ~60% | +| **LiveCodeBench** (EMNLP 2025) | Rolling | Python | pass@k | Contamination-free: new problems added continuously from competitive programming | + +### State of the Art + +- **Frontier**: Claude Opus 4.5/4.6, GPT-5.2, Gemini 3.1 Pro all score >95% on + HumanEval, ~85% on HumanEval+, ~65% on BigCodeBench-Complete. +- **Open (64GB-feasible)**: Qwen3.5-27B-Q4 achieves ~80% on HumanEval+. + Qwen3-Coder-30B-A3B (3.3B active, ~18GB at Q4) is strong on BigCodeBench. + Qwen2.5-Coder-32B-Instruct matched GPT-4o on HumanEval when released. + +### Key Insight + +HumanEval is near-saturated and should no longer be used as a primary differentiator. +BigCodeBench and LiveCodeBench are the current gold standards for code generation +accuracy, as they test realistic multi-library tasks and resist contamination. + +--- + +## 3. Code Editing / Patching + +**Definition**: Modifying existing code correctly -- applying diffs, fixing bugs in +context, integrating new code into existing files -- rather than generating from scratch. + +### Key Benchmarks + +| Benchmark | Tasks | What It Tests | Notes | +|---|---|---|---| +| **Aider Code Editing** | 133 | Edit Python files to solve Exercism problems | Tests edit format compliance + coding ability | +| **Aider Polyglot** | 225 | Edit code across 6 languages with error feedback | Two attempts per problem; measures edit+debug loop | +| **Diff-XYZ** (Oct 2025) | 3 tasks | Apply, anti-apply, generate diffs | Tests diff understanding in multiple formats | +| **EDIT-Bench** | Varied | Real-world instructed code edits | Repository-level editing tasks | +| **SWE-bench** (indirectly) | 2,294 | Generate patches that resolve GitHub issues | Requires generating correct unified diffs | + +### Edit Format Considerations + +Code editing performance depends heavily on the edit format used: + +- **Search/Replace blocks** (Aider default): Most reliable for most models +- **Unified diff**: GPT-4 Turbo was "3x less lazy" with unified diffs (Aider blog) +- **V4A diff format**: OpenAI's recommended format (published with GPT-4.1, April 2025) +- **Whole-file rewrite**: Simpler but wasteful; works with weaker models + +Models that excel at generation can fail at editing because they struggle to produce +syntactically valid diffs or correctly locate the code to modify. + +### State of the Art (Aider Polyglot, March 2026) + +| Model | Score | Type | +|---|---|---| +| GPT-5 | 88.0% | Frontier | +| MiniMax M2.5 | 80.2% | Open | +| DeepSeek V3.2-Exp | 74.2% | Open | +| DeepSeek-R1-0528 | 71.4% | Open | +| GLM-4.5-FP8 | 66.0% | Open | +| Qwen3-Coder-480B | 61.8% | Open (too large for 64GB) | +| Qwen3-Coder-30B-A3B | ~55-60%* | Open (fits 64GB at Q4) | + +*Estimated from quantized GGUF performance data; exact Aider Polyglot score for +the 30B-A3B variant not independently confirmed. + +--- + +## 4. Tool Use / Function Calling + +**Definition**: Correctly invoking APIs, tools, or MCP servers -- selecting the right +function, constructing valid arguments, parsing responses, deciding when NOT to call. + +### Key Benchmarks + +| Benchmark | Tasks | What It Tests | Notes | +|---|---|---|---| +| **BFCL V4** (Berkeley) | Thousands | Function calling accuracy across formats | De facto standard; AST-based evaluation | +| **BFCL-v3** (via EvalScope) | Multi-turn | Stateful multi-step function calling | Tests memory and context management | +| **Nexus Function Calling** | Varied | Tool selection and invocation | Broader tool landscape | +| **IFEval-FC** (2025) | 500+ | Instruction following within function schemas | JSON schema constraint adherence | +| **tau-bench** | Varied | Tool-augmented task completion | End-to-end agent tool use | + +### BFCL Key Findings + +The Berkeley Function Calling Leaderboard reveals a critical split: + +1. **Single-turn calls**: Most frontier models score >90% accuracy +2. **Multi-turn stateful calls**: Performance drops 20-40% even for top models +3. **Abstention**: Knowing when NOT to call a function remains a major weakness +4. **Long-horizon tool use**: Memory, dynamic decision-making, and context management + are open challenges + +### State of the Art + +- **Frontier**: Claude Opus 4.5/4.6, GPT-5.2 lead overall BFCL V4 +- **Open**: Qwen3-Coder-480B is "comparable to Claude Sonnet 4 on Agentic Tool-Use" + (Qwen team). For 64GB-feasible models, Qwen3-Coder-30B-A3B has a specially + designed function call format and strong tool-use training. + Nemotron 3 Super (120B, 12B active) was explicitly trained for tool-use workflows. + +### Relevance to MCP + +MCP (Model Context Protocol) servers expose tools via JSON schemas -- exactly what +BFCL tests. A model's BFCL score is a reasonable proxy for MCP tool-use competence, +though MCP adds discovery and session management complexity not yet benchmarked. + +--- + +## 5. Multi-Step Planning + +**Definition**: Breaking complex tasks into subtasks, maintaining coherent plans across +many steps, tracking progress, and adapting when plans fail. + +### Key Benchmarks + +| Benchmark | Tasks | Steps | What It Tests | Notes | +|---|---|---|---|---| +| **SWE-bench Verified** | 500 | 5-50+ | End-to-end issue resolution | Gold standard for agentic coding | +| **SWE-bench Pro** (Scale AI) | Harder | 10-100+ | More complex issues | Best model ~46% (vs 81% on Verified) | +| **FeatureBench** (Feb 2026) | 200 | Many | Complex feature development | Claude 4.5 Opus: only 11.0% (vs 74.4% SWE-bench) | +| **Snorkel Agentic Coding** | 100 | Multi-step, 4 tiers | Plan, track, execute, recover | Claude Opus 4.5: 58%, Gemini 3 Pro: 51.6% | +| **GAIA** (ICLR 2025) | 450 | Multi-step | General assistant planning | Near saturation (~90% top scores) | +| **Gaia2** (2026) | Varied | Async | Dynamic, asynchronous environments | Adds temporal constraints and agent collaboration | +| **Terminal-Bench 2.0** | 89 | Multi-step | Terminal workflow completion | Tests plan execution in CLI environments | + +### Planning-Specific Insights + +The gap between SWE-bench Verified (~81% frontier) and SWE-bench Pro (~46% frontier) +and FeatureBench (~11% frontier) reveals that multi-step planning degrades rapidly +with task complexity: + +- **SWE-bench Verified**: Often requires 5-15 steps (find file, understand bug, edit, + test) +- **SWE-bench Pro**: Requires deeper reasoning about architecture and dependencies +- **FeatureBench**: Requires implementing features across multiple files with + architectural coherence over 50+ steps + +This is the dimension where frontier models most decisively outperform open models, +though the gap is narrowing with agentic RL training (Qwen3-Coder, GLM-5). + +### State of the Art (SWE-bench Verified, March 2026) + +| Model | Score | Type | Notes | +|---|---|---|---| +| Claude Opus 4.5 | 80.9% | Frontier | Overall leader | +| Claude Opus 4.6 | 80.8% | Frontier | | +| Gemini 3.1 Pro | 80.6% | Frontier | | +| MiniMax M2.5 | 80.2% | Open | Best open model | +| GPT-5.2 | 80.0% | Frontier | | +| GLM-5 | 77.8% | Open | 744B MoE, 40B active | +| Kimi K2.5 | 76.8% | Open | | +| DeepSeek V3.2 | 73.0% | Open | | +| Qwen3-Coder-Next | 70.6% | Open | Only 3B active params | +| DeepSeek V3.1 | 66.0% | Open | | +| Nemotron 3 Super | 60.5% | Open | 120B, 12B active | + +--- + +## 6. Debugging / Error Recovery + +**Definition**: Handling test failures, reading error messages, diagnosing root causes, +and iterating toward a fix -- including recovering from the agent's own mistakes. + +### Key Benchmarks + +| Benchmark | Tasks | What It Tests | Notes | +|---|---|---|---| +| **Terminal-Bench 2.0** (Stanford/Laude) | 89 | CLI debugging, error recovery, state mgmt | Gold standard for debugging evaluation | +| **Recovery-Bench** (Letta, 2025) | Varied | Recovery from corrupted states and error traces | Tests context pollution handling | +| **AgentErrorBench** (2025) | Varied | Error detection and debugging in trajectories | 24% improvement with AgentDebug method | +| **ReliabilityBench** (Jan 2026) | Varied | Consistency and fault recovery | Multi-dimensional reliability | +| **Aider Polyglot** (indirectly) | 225 | Two-attempt model with error feedback | Second attempt tests debug-from-feedback | + +### Recovery-Bench Key Findings + +Recovery-Bench (Letta) specifically evaluates a critical gap: even frontier models +"lack the ability to naturally recover from failed states." The benchmark creates +scenarios with: + +- Erroneous files from previous attempts +- Corrupted reasoning traces in context +- Environment artifacts from failed edits + +This is directly relevant to agentic coding loops where an agent makes a mistake +at step 15 of a 30-step task and must recover without starting over. + +### Terminal-Bench 2.0 Key Findings + +Terminal-Bench tests real terminal workflows: inspect environments, read/edit files, +run commands, recover from errors, and finish multi-step tasks. Error categories: + +- **Execution errors**: Dominate for Claude Opus 4.5 and GPT-5.2 +- **Coherence errors**: Less frequent but more damaging +- **Verification errors**: Failing to check that a fix actually worked + +### State of the Art + +Debugging/error recovery is one of the weakest dimensions for all models. No model +achieves >70% on Terminal-Bench 2.0 or Recovery-Bench as of March 2026. This is +a primary area where the frontier-open gap matters most for practical agentic use. + +--- + +## 7. Repository Understanding + +**Definition**: Navigating large codebases, understanding file structure, dependency +graphs, cross-file relationships, and architectural patterns. + +### Key Benchmarks + +| Benchmark | Tasks | Languages | What It Tests | Notes | +|---|---|---|---|---| +| **CrossCodeEval** (NeurIPS 2023) | Varied | Python, Java, TS, C# | Cross-file code completion | Requires understanding imports and dependencies | +| **RepoBench** | 3 tasks | Python | Retrieval, completion, pipeline | Tests codebase navigation | +| **RepoEval** | Varied | Python | Repository-level completion | 16 GitHub repositories | +| **RepoCod** (ACL 2025) | Varied | Multiple | Full repository code generation | "LLMs not yet ready" | +| **LoCoBench-Agent** (2025) | Varied | Multiple | Interactive repo exploration | Agent-based evaluation | +| **DependEval** | 3 tasks | Multiple | Dependency recognition, multi-file editing | Tests architectural understanding | + +### Key Challenge + +Repository understanding is difficult to isolate as a benchmark dimension because +it is a prerequisite for most agentic coding tasks. SWE-bench implicitly tests it +(you cannot fix a bug if you cannot find the relevant file), but does not score it +separately. + +The most direct measures are: +1. **CrossCodeEval**: Do predictions improve when cross-file context is provided? +2. **RepoBench-R**: Can the model retrieve the right context from the repository? +3. **DependEval**: Can the model understand and modify dependency relationships? + +### State of the Art + +Models with longer context windows have an inherent advantage. The Qwen3-Coder family +was explicitly trained for "repository-scale understanding" with 256K native context +(extendable to 1M). GLM-5 uses DeepSeek Sparse Attention for 205K context. + +For 64GB systems, Qwen3-Coder-30B-A3B and Qwen3-Coder-Next are the strongest choices +due to their long-context training and MoE efficiency. + +--- + +## 8. Instruction Following + +**Definition**: Following complex, multi-constraint instructions precisely -- +formatting requirements, length constraints, keyword inclusion, structural rules. + +### Key Benchmarks + +| Benchmark | Tasks | What It Tests | Notes | +|---|---|---|---| +| **IFEval** (Google, Nov 2023) | ~500 | 25 types of verifiable instructions | Format, length, keyword, structure constraints | +| **IFEval-Extended** (2024) | Dynamic | Generative instruction synthesis | Thousands of unique instructions from templates | +| **M-IFEval** (NAACL 2025) | Multi-lingual | French, Japanese, Spanish instruction following | Performance varies widely across languages | +| **IFEval-FC** (2025) | Varied | Instruction following in function call schemas | JSON schema constraint adherence | +| **AgentIF** (Tsinghua, 2025) | Varied | Agent-specific instruction following | Evaluates IF within agentic loops | + +### Relevance to Agentic Coding + +Instruction following is critical for agentic coding because: + +1. **System prompts**: Agents receive detailed behavioral instructions (e.g., CLAUDE.md + conventions in this repo) +2. **Edit format compliance**: Models must produce output in exact formats (search/replace + blocks, unified diffs, JSON tool calls) +3. **Multi-constraint tasks**: "Fix the bug AND add a test AND update the docstring AND + follow the project's naming conventions" + +### State of the Art + +IFEval is included in the Open LLM Leaderboard V2, making it one of the most widely +reported benchmarks. Frontier models score >90% on IFEval. Open models vary widely; +instruction-tuned variants of Qwen3.5, DeepSeek V3, and GLM-5 are competitive at >85%. + +--- + +## 9. Long Context Utilization + +**Definition**: Effectively using large context windows (32K-1M tokens) with code -- +not just accepting long inputs, but actually using information from all parts. + +### Key Benchmarks + +| Benchmark | What It Tests | Notes | +|---|---|---| +| **RULER** (NVIDIA, COLM 2024) | Multi-needle retrieval, distractor handling | Most models degrade significantly beyond 32K | +| **Needle in a Haystack** (NIAH) | Single-fact retrieval in long context | Near-saturated for frontier models | +| **LoCoBench** (2025) | Long-context code completion and comprehension | Claude 3.5 Sonnet: 29% at short context, 3% at long | +| **LongCodeBench** (2025) | Long-context code tasks | Single-language, limited diversity | +| **LongBench** (ACL 2025) | General long-context evaluation | Reveals limitations of existing benchmarks | + +### "Context Rot" Phenomenon + +Research from Chroma (2025) documented "context rot": as input tokens increase, +LLM performance degrades even when the relevant information is present. This is +particularly acute for code tasks where: + +- File A defines a class, file B imports it, file C tests it +- All three must be in context simultaneously +- Models must cross-reference across files, not just retrieve individual facts + +### State of the Art + +| Model | Native Context | Effective Context* | Notes | +|---|---|---|---| +| Nemotron 3 Super | 1M tokens | 91.75% accuracy at 1M | Best retention score | +| Qwen3-Coder-Next | 256K (1M w/ Yarn) | Good at 256K | Trained for repo-scale | +| GLM-5 | 205K | Good | DeepSeek Sparse Attention | +| DeepSeek V3.2 | 128K | Moderate | | + +*"Effective context" means the model actually uses information at that distance, +not just accepts it without error. + +For 64GB systems, context length is bounded by available memory. At Q4 quantization, +a 30B-A3B model can handle ~64K-128K tokens before running out of KV cache space +(depending on GQA configuration and batch size). + +--- + +## 10. Multi-Language Support + +**Definition**: Handling different programming languages correctly -- not just Python, +but also compiled languages, systems languages, and less common languages. + +### Key Benchmarks + +| Benchmark | Languages | What It Tests | Notes | +|---|---|---|---| +| **Aider Polyglot** | C++, Go, Java, JS, Python, Rust | Edit + debug in 6 languages | 225 Exercism exercises | +| **Multi-SWE-bench** (NeurIPS 2025) | Python, Java, TS, JS, Go, Rust, C, C++ | Issue resolution in 8 languages | 1,632 validated issues | +| **Multi-SWE-bench mini** | 8 languages | Lightweight version | 400 instances, reduced compute | +| **SWE-PolyBench** (Amazon) | Java, JS, TS, Python | Bug fixes, features, refactoring | 2,110 curated issues | +| **SWE-smith** | 9 languages | SWE-bench style across 42 repos | 300 curated tasks | +| **HumanEval-X** | Python, C++, Java, JS, Go | Cross-lingual code generation | Translation of HumanEval | +| **BigCodeBench** | Python (139 libs) | Multi-library Python | Tests library-specific knowledge | + +### Multi-SWE-bench vs SWE-PolyBench + +Two competing multilingual benchmarks emerged in 2025: + +- **Multi-SWE-bench** (ByteDance): 1,632 issues, 8 languages, NeurIPS 2025 + Datasets track. Also provides `mini` (400 instances) and `flash` (300 instances) + variants for reduced compute. +- **SWE-PolyBench** (Amazon): 2,110 issues, 4 languages, with a verified subset of + 384 instances. Covers bug fixes, features, and refactoring. + +### Language-Specific Performance Gaps + +Open models show significant performance variation across languages: +- **Python**: Best-supported universally +- **JavaScript/TypeScript**: Second-best, strong ecosystem coverage +- **Rust, Go, C++**: Substantially weaker, especially for complex patterns +- **Low-resource languages** (Julia, Lua, Perl): StarCoder2-15B historically strong here + +### State of the Art + +Qwen3-Coder-Next achieves 62.8% on SWE-Bench Multilingual. For 64GB-feasible models, +the Qwen3-Coder-30B-A3B benefits from Qwen's broad multilingual training data. + +--- + +## 11. Test Generation + +**Definition**: Writing tests, understanding test frameworks, achieving coverage, +generating meaningful assertions -- not just syntactically valid tests. + +### Key Benchmarks + +| Benchmark | Tasks | What It Tests | Notes | +|---|---|---|---| +| **TestEval** (2024) | 210 | LLM test case generation for LeetCode programs | Basic test generation ability | +| **ULT** (2025) | 3,909 | Unit test generation for complex functions | High cyclomatic complexity, leakage-free | +| **WebApp1K** (2025) | 1,000 | Test-driven development tasks | Tests serve as both prompt and verification | +| **CoverUp** (2024) | Varied | Coverage-guided test generation | Iterative LLM-guided coverage improvement | + +### Current Performance + +LLM-generated tests achieve on average: +- **41.32%** accuracy (tests pass and are meaningful) +- **45.10%** statement coverage +- **30.22%** branch coverage +- **40.21%** mutation score + +These numbers are from a multi-model benchmark study (2025). CoverUp's iterative +approach achieves 80% line+branch coverage (vs 47% for CodaMosa), suggesting that +agentic test generation loops significantly outperform single-shot generation. + +### Key Insight + +Test generation is an area where agentic approaches (generate, run, check coverage, +iterate) dramatically outperform single-shot generation. This makes it particularly +suited to the iterative agent loop and a strong candidate for local model evaluation. + +### State of the Art + +Code agents were shown to be "state of the art software testers" when given an +iterative loop with coverage feedback (2024 paper). No single model dominates this +dimension; the scaffolding (coverage feedback, iteration) matters more than the +base model for test generation. + +--- + +## 12. Benchmark Suite Summary + +### Tier 1: Must-Run for Agentic Coding Evaluation + +These are the most informative benchmarks for evaluating a model's fitness as a +coding agent: + +| Benchmark | Primary Dimensions | Run Cost | Notes | +|---|---|---|---| +| **SWE-bench Verified** | Planning, editing, repo understanding | High (500 Docker envs) | Gold standard | +| **Aider Polyglot** | Editing, multi-lang, debugging | Medium (225 problems) | Best edit benchmark | +| **BigCodeBench** | Generation, multi-tool | Medium (1,140 tasks) | Best generation benchmark | +| **BFCL V4** | Tool use, function calling | Low-Medium | De facto tool-use standard | +| **Terminal-Bench 2.0** | Debugging, planning, error recovery | High (89 real envs) | Best debugging benchmark | + +### Tier 2: Valuable Supplementary Benchmarks + +| Benchmark | Primary Dimensions | Notes | +|---|---|---| +| **LiveCodeBench** | Generation (contamination-free) | Rolling benchmark | +| **IFEval** | Instruction following | Quick to run, widely reported | +| **Multi-SWE-bench mini** | Multi-language, planning | 400 instances, 8 languages | +| **EvalPlus (HumanEval+/MBPP+)** | Generation (rigorous) | Good baseline | +| **Recovery-Bench** | Error recovery | Novel and underexplored | +| **FeatureBench** | Complex planning | Very hard; differentiates top models | + +### Tier 3: Niche or Near-Saturated + +| Benchmark | Status | Notes | +|---|---|---| +| **HumanEval** | Near-saturated | >95% for frontier models; use EvalPlus instead | +| **MBPP** | Near-saturated | Use MBPP+ instead | +| **GAIA** | Near-saturation (~90%) | Good for general agents, less code-specific | +| **Needle-in-a-Haystack** | Saturated | Use RULER for long-context | + +### Commonly Cited on Model Cards + +When coding-focused models publish on Hugging Face, the most frequently cited +benchmarks (in rough order of frequency) are: + +1. SWE-bench Verified (agentic coding standard) +2. HumanEval / HumanEval+ (code generation baseline) +3. MBPP / MBPP+ (code generation) +4. BigCodeBench (multi-tool generation) +5. Aider Polyglot (code editing, multi-language) +6. LiveCodeBench (contamination-free generation) +7. BFCL (function calling) +8. IFEval (instruction following) +9. Multi-SWE-bench (multilingual agentic) + +--- + +## 13. Open-Weight Model Landscape for 64GB Systems + +### Models Feasible on 64GB Unified Memory (Strix Halo) + +Sorted by practical fitness for agentic coding tasks. "Active" = parameters active +per forward pass for MoE models. + +| Model | Total / Active | GGUF Q4 Size | SWE-bench | Key Strength | +|---|---|---|---|---| +| **Qwen3-Coder-Next** | 80B / 3B | ~46GB (Q4) | 70.6% Verified | Best efficiency ratio; agentic RL training | +| **Qwen3-Coder-30B-A3B** | 30.5B / 3.3B | ~18GB (Q4) | ~55%* (est.) | Fits easily; native 256K context; function call format | +| **Qwen3.5-35B-A3B** | 35B / 3B | ~19GB (Q4) | N/A | General + coding; fast at 112 tok/s on RTX 3090 | +| **Nemotron 3 Super** | 120B / 12B | ~64GB (Q4) | 60.5% | 1M context; PinchBench 85.6%; hybrid Mamba-Transformer | +| **Qwen3.5-27B** | 27B / 27B (dense) | ~17GB (Q4) | ~55%* | Dense; 72.4% SWE-bench reported for Qwen3.5-27B | +| **DeepSeek V3.2** | 671B / 37B | Too large at Q4 | 73.0% | Requires >200GB; not feasible for 64GB | +| **GLM-5** | 744B / 40B | Too large at Q4 | 77.8% | Best open SWE-bench; not feasible for 64GB | + +*Estimated; exact scores for quantized GGUF variants not independently benchmarked. + +### Recommended Configuration for 64GB Strix Halo + +**Primary coding agent**: Qwen3-Coder-30B-A3B-Instruct (Q4_K_M, ~18GB) +- Fits with ample room for KV cache and context +- Specially designed function call format +- Native 256K context, extendable to 1M +- Strong agentic coding training +- Fast inference due to 3.3B active parameters + +**Stretch option**: Qwen3-Coder-Next (Q4, ~46GB) +- Tighter fit but significantly stronger (70.6% SWE-bench Verified) +- 3B active parameters = good generation speed +- Leaves ~18GB for KV cache and system + +**Dense alternative**: Qwen3.5-27B (Q4_K_M, ~17GB) +- When you need strong general + coding ability +- Dense model = more predictable behavior +- Good baseline for comparison + +### Older Models: Still Relevant? + +- **CodeLlama-34B** (Meta, 2023): Superseded by Qwen and DeepSeek families. Only + relevant for historical comparison or if specific fine-tunes are needed. +- **StarCoder2-15B** (ServiceNow/HF/NVIDIA, 2024): Outperformed CodeLlama-34B at half + the size. Still competitive for low-resource languages (Julia, Lua, Perl) but + otherwise superseded by Qwen3-Coder. +- **DeepSeek-Coder-V2-Lite-16B** (2024): Was competitive but now clearly behind + Qwen3-Coder-30B-A3B and Qwen3-Coder-Next. + +--- + +## 14. Frontier vs. Open Model Gap + +### Gap Analysis by Dimension (March 2026) + +| Dimension | Frontier Best | Open Best (64GB) | Gap | Trend | +|---|---|---|---|---| +| Code Generation | ~98% HumanEval | ~85% HumanEval | Small | Closing rapidly | +| Code Editing | 88% Aider Polyglot | ~60% Aider Polyglot | Large | Closing (MoE helps) | +| Tool Use | >90% BFCL | ~80% BFCL | Moderate | Closing with dedicated training | +| Multi-Step Planning | 80.9% SWE-bench | 70.6% SWE-bench (Coder-Next) | Moderate | Narrowing with agentic RL | +| Debugging/Recovery | ~65% Terminal-Bench | ~45% Terminal-Bench* | Large | Widest persistent gap | +| Repo Understanding | Excellent | Good (long-context models) | Moderate | Closing with 256K+ contexts | +| Instruction Following | >90% IFEval | >85% IFEval | Small | Nearly closed | +| Long Context | 1M+ effective | 256K effective | Moderate | Hardware-limited for local | +| Multi-Language | 80%+ Multi-SWE | 62.8% Multi-SWE | Moderate | Improving with diverse training | +| Test Generation | ~50% coverage | ~40% coverage | Small | Scaffolding matters more | + +*Estimated; Terminal-Bench scores not widely reported for 64GB-feasible open models. + +### Key Observations + +1. **Code generation is nearly solved** for simple tasks. The gap has shifted to + complex, multi-step, multi-file tasks. + +2. **Debugging/error recovery is the widest gap** and the hardest to close. This is + where frontier models' larger parameter counts and RLHF refinement matter most. + +3. **MoE architectures are the bridge** for 64GB systems. Models like Qwen3-Coder-Next + (80B total, 3B active) achieve SWE-bench scores comparable to models with 10-20x + more active parameters. + +4. **Agentic RL training** (as used in Qwen3-Coder, GLM-5) is the primary driver of + open model improvement on planning and debugging dimensions. + +5. **Scaffolding equalizes** many gaps. A well-designed agent scaffold (SWE-Agent, + OpenHands, Aider) can make a 30B model perform comparably to a raw 400B model. + +--- + +## 15. Recommended Evaluation Stack + +For evaluating models locally on the Strix Halo system, the following stack covers +all 10 dimensions using tools already referenced in this project's `docs/references.md`: + +### Inspect AI (Primary Framework) + +Inspect AI supports multiple benchmarks in a unified framework: +- HumanEval (code generation) +- BigCodeBench (multi-tool generation) +- BFCL (function calling / tool use) +- GAIA (multi-step planning) +- IFEval (instruction following) + +Run against an OpenAI-compatible endpoint (ollama or llama.cpp server). + +### EvalPlus (Code Generation) + +- HumanEval+ and MBPP+ with native ollama support +- More rigorous than base HumanEval/MBPP +- Already configured in this project's `scripts/agentic/` framework + +### BigCodeBench (Multi-Tool Generation) + +- 1,140 tasks across 139 libraries +- Already listed in `docs/references.md` +- Tests multi-library, cross-domain code generation + +### Aider (Code Editing + Multi-Language) + +- Built-in polyglot benchmark: 225 exercises across 6 languages +- Tests edit format compliance, multi-language support, debugging loop +- Can be run against any OpenAI-compatible endpoint + +### BFCL (Tool Use) + +- pip install `bfcl-eval` +- Tests function calling accuracy +- Already listed in `docs/references.md` + +### Practical Execution Order + +1. **Quick smoke test**: EvalPlus (HumanEval+) -- ~30 min +2. **Generation depth**: BigCodeBench-Hard (148 tasks) -- ~2-4 hours +3. **Editing ability**: Aider polyglot benchmark -- ~4-6 hours +4. **Tool use**: BFCL eval -- ~1-2 hours +5. **Instruction following**: IFEval via Inspect AI -- ~1 hour +6. **Full agentic**: SWE-bench Verified (if Docker resources available) -- ~24+ hours + +--- + +## 16. Sources + +### Papers + +- Chen et al. (2021). "Evaluating Large Language Models Trained on Code." arXiv:2107.03374. [HumanEval] +- Liu et al. (2023). "Is Your Code Generated by ChatGPT Really Correct?" NeurIPS 2023. [EvalPlus/HumanEval+] +- Jimenez et al. (2024). "SWE-bench: Can Language Models Resolve Real-world GitHub Issues?" ICLR 2024. +- Zhuo et al. (2024). "BigCodeBench: Benchmarking Code Generation with Diverse Function Calls." ICLR 2025. +- Patil et al. (2025). "The Berkeley Function Calling Leaderboard (BFCL)." ICML 2025. +- Misu et al. (2023). "Towards AI Assistants That Thrive on Data: GAIA." ICLR 2025. +- Hsieh et al. (2024). "RULER: What's the Real Context Size of Your Long-Context Language Models?" COLM 2024. +- Zhou et al. (2023). "Instruction-Following Evaluation for Large Language Models." arXiv:2311.07911. [IFEval] +- Terminal-Bench team (2026). "Terminal-Bench: Benchmarking Agents on Hard CLI Tasks." Stanford/Laude Institute. +- FeatureBench (Feb 2026). "Benchmarking Agentic Coding for Complex Feature Development." arXiv:2602.10975. +- HumanEval Pro / MBPP Pro (ACL 2025). "Evaluating LLMs on Self-invoking Code Generation Task." +- Multi-SWE-bench (NeurIPS 2025). "A Multilingual Benchmark for Issue Resolving." +- SWE-PolyBench (Amazon, 2025). "A multi-language benchmark for repository level evaluation." +- Recovery-Bench (Letta, 2025). "Evaluating LLMs' Ability to Recover from Mistakes." +- Diff-XYZ (Oct 2025). "A Benchmark for Evaluating Diff Understanding." + +### Leaderboards and Live Data + +- SWE-bench Leaderboard: https://www.swebench.com/ +- SWE-bench Verified Leaderboard: https://llm-stats.com/benchmarks/swe-bench-verified +- SWE-rebench Leaderboard: https://swe-rebench.com/ +- Aider LLM Leaderboards: https://aider.chat/docs/leaderboards/ +- BFCL V4 Leaderboard: https://gorilla.cs.berkeley.edu/leaderboard.html +- EvalPlus Leaderboard: https://evalplus.github.io/leaderboard.html +- BigCodeBench Leaderboard: https://huggingface.co/blog/leaderboard-bigcodebench +- Terminal-Bench Leaderboard: https://www.tbench.ai/ +- Open LLM Leaderboard: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard +- Scale Labs SWE-bench Pro: https://labs.scale.com/leaderboard/swe_bench_pro_public +- Artificial Analysis Terminal-Bench: https://artificialanalysis.ai/evaluations/terminalbench-hard + +### Model Documentation + +- Qwen3-Coder: https://github.com/QwenLM/Qwen3-Coder +- Qwen3-Coder-Next: https://qwen.ai/blog?id=qwen3-coder-next +- Qwen3-Coder-30B-A3B GGUF: https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF +- GLM-5: https://huggingface.co/zai-org/GLM-5 +- Nemotron 3 Super: https://developer.nvidia.com/blog/introducing-nemotron-3-super-an-open-hybrid-mamba-transformer-moe-for-agentic-reasoning/ +- DeepSeek V3 series: https://www.bentoml.com/blog/the-complete-guide-to-deepseek-models-from-v3-to-r1-and-beyond + +### Tools and Frameworks + +- Inspect AI: https://github.com/UKGovernmentBEIS/inspect_ai +- Inspect Evals catalog: https://inspect.aisi.org.uk/evals/ +- EvalPlus: https://github.com/evalplus/evalplus +- BigCodeBench: https://github.com/bigcode-project/bigcodebench +- BFCL: https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard +- Aider: https://aider.chat/ +- Aider Polyglot benchmark: https://github.com/Aider-AI/polyglot-benchmark +- LiveCodeBench: https://livecodebench.github.io/ +- CoverUp (test generation): https://arxiv.org/html/2403.16218v3