feat: add Qwen3.5 model catalog and agentic evaluation framework
Models: - configs/models.conf: catalog with Qwen3.5-35B-A3B (MoE, top pick), Qwen3.5-27B (dense), Qwen3-Coder-30B-A3B (agentic/coding) - Updated benchmark setup to show catalog with download status - docs/model-recommendations.md: memory planning, quantization guide Agentic evaluation: - scripts/agentic/setup.sh: installs inspect-ai, evalplus, bigcodebench in a Python venv - scripts/agentic/run-eval.sh: runs evaluations against local LLM server (ollama or llama.cpp). Suites: quick (HumanEval+IFEval), code (EvalPlus+BigCodeBench), tooluse (BFCL), full (all) - bin/agentic: dispatcher with help - docs/agentic-benchmarks.md: methodology, framework comparison, model recommendations for agentic use Updated: Makefile (6 new targets), README, CLAUDE.md, docs/references.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
444
docs/agentic-benchmarks.md
Normal file
444
docs/agentic-benchmarks.md
Normal file
@@ -0,0 +1,444 @@
|
||||
# Local Agentic Flow Benchmarks for Strix Halo
|
||||
|
||||
Research summary: benchmarking agentic LLM capabilities on consumer hardware (AMD Strix Halo, Ryzen AI MAX+ 395, 64 GB unified memory) using llama.cpp, Ollama, and LM Studio.
|
||||
|
||||
---
|
||||
|
||||
## Scope
|
||||
|
||||
This document covers locally-runnable agentic benchmarks, evaluation frameworks, practical measurement approaches, and model recommendations (with emphasis on the Qwen family) for the Strix Halo platform. Cloud-only benchmarks that cannot accept a local OpenAI-compatible endpoint are out of scope.
|
||||
|
||||
---
|
||||
|
||||
## 1. Agentic Benchmarks Runnable Locally
|
||||
|
||||
### 1.1 Berkeley Function Calling Leaderboard (BFCL)
|
||||
|
||||
**What it measures**: Function/tool calling accuracy across serial calls, parallel calls, multiple languages, and multi-turn agentic interactions.
|
||||
|
||||
**Why it matters**: BFCL is the de facto standard for evaluating function-calling quality. Version 4 (2025) added holistic agentic evaluation with stateful multi-step reasoning.
|
||||
|
||||
**Local setup**:
|
||||
```bash
|
||||
# Option A: pip package
|
||||
pip install bfcl-eval
|
||||
|
||||
# Option B: from source (more control)
|
||||
git clone https://github.com/ShishirPatil/gorilla.git
|
||||
cd gorilla/berkeley-function-call-leaderboard
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
Evaluate a local model by pointing BFCL at any OpenAI-compatible endpoint (ollama, llama.cpp server, vLLM). The framework uses AST-based evaluation to verify function call correctness without executing them.
|
||||
|
||||
- **Repository**: https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard
|
||||
- **Leaderboard**: https://gorilla.cs.berkeley.edu/leaderboard.html
|
||||
- **Paper**: Patil et al., "The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models," ICML 2025.
|
||||
|
||||
### 1.2 SWE-bench / SWE-bench Verified
|
||||
|
||||
**What it measures**: Ability to resolve real GitHub issues by generating patches against actual repositories.
|
||||
|
||||
**Why it matters**: The gold standard for evaluating coding agents. Tasks require understanding large codebases, multi-file edits, and test-driven validation.
|
||||
|
||||
**Local setup**: Evaluation runs inside Docker containers with network isolation. Two primary agent scaffolds support local models:
|
||||
|
||||
- **SWE-agent** (https://swe-agent.com): Install via pip, configure `config.toml` to point at a local OpenAI-compatible endpoint. There is also a dedicated open-weight model, SWE-bench/SWE-agent-LM-32B.
|
||||
- **OpenHands** (https://github.com/OpenHands/OpenHands): `pip install openhands`, then `openhands serve`. Configure `config.toml` with your local model's `base_url`.
|
||||
|
||||
**Hardware note**: SWE-bench evaluation requires an x86_64 machine with at least 120 GB free storage, 16 GB RAM, and 8 CPU cores for the Docker harness (separate from model inference). Models smaller than 32B parameters show significantly degraded instruction following on these tasks.
|
||||
|
||||
- **Repository**: https://github.com/SWE-bench/SWE-bench
|
||||
- **Paper**: Jimenez et al., "SWE-bench: Can Language Models Resolve Real-world Github Issues?" ICLR 2024.
|
||||
|
||||
### 1.3 AgentBench
|
||||
|
||||
**What it measures**: LLM-as-agent across 8 environments: OS interaction, database queries, knowledge graphs, card games, lateral thinking, house-holding, web shopping, and web browsing.
|
||||
|
||||
**Why it matters**: The broadest multi-environment agent evaluation. Tests planning, reasoning, tool use, and decision-making in multi-turn open-ended settings.
|
||||
|
||||
**Local setup**: The evaluation package is released at https://github.com/THUDM/AgentBench. It supports custom model endpoints. Open-source models up to 70B show a significant performance gap versus frontier commercial models, making it a useful diagnostic for understanding where local models fall short.
|
||||
|
||||
- **Paper**: Liu et al., "AgentBench: Evaluating LLMs as Agents," ICLR 2024.
|
||||
|
||||
### 1.4 GAIA (General AI Assistants)
|
||||
|
||||
**What it measures**: Multi-step tasks requiring web search, document reading, calculation, and synthesis. 466 tasks that are trivially easy for humans (92% accuracy) but extremely challenging for AI.
|
||||
|
||||
**Local setup**: Available on Hugging Face. Requires a model with tool-use capabilities (web search, file reading, calculator). Can be wired to a local model via smolagents or LangChain with local tool implementations.
|
||||
|
||||
- **Paper**: Mialon et al., "GAIA: A Benchmark for General AI Assistants," ICLR 2024.
|
||||
|
||||
### 1.5 DeepPlanning (Qwen)
|
||||
|
||||
**What it measures**: Long-horizon agentic planning with verifiable constraints. Two domains: multi-day travel planning (9 APIs for flights, trains, hotels, restaurants, attractions) and multi-product shopping.
|
||||
|
||||
**Why it matters**: Evaluates three critical agentic abilities:
|
||||
1. Proactive information acquisition (actively calling APIs to discover hidden states)
|
||||
2. Local constrained reasoning (step-level logic like brand matching)
|
||||
3. Global constrained optimization (budget caps, multi-day time feasibility)
|
||||
|
||||
**Local setup**: Open-sourced January 2026. Dataset at https://huggingface.co/datasets/Qwen/DeepPlanning. Evaluation code integrated into the Qwen-Agent framework.
|
||||
|
||||
- **Paper**: "DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints," arXiv:2601.18137, January 2026.
|
||||
|
||||
### 1.6 Code Generation: EvalPlus (HumanEval+ / MBPP+)
|
||||
|
||||
**What it measures**: Functional correctness of generated code. EvalPlus extends HumanEval by 80x and MBPP by 35x test cases.
|
||||
|
||||
**Local setup** (direct Ollama support):
|
||||
```bash
|
||||
pip install evalplus
|
||||
|
||||
# Run against a local Ollama model
|
||||
evalplus.evaluate \
|
||||
--model "qwen3-coder:30b" \
|
||||
--dataset humaneval \
|
||||
--backend ollama \
|
||||
--base-url http://localhost:11434/v1 \
|
||||
--greedy
|
||||
```
|
||||
|
||||
- **Repository**: https://github.com/evalplus/evalplus
|
||||
- **Leaderboard**: https://evalplus.github.io/leaderboard.html
|
||||
|
||||
### 1.7 BigCodeBench
|
||||
|
||||
**What it measures**: 1,140 function-level tasks requiring composition of multiple function calls across 139 libraries. Average 5.6 test cases per task with 99% branch coverage.
|
||||
|
||||
**Local setup**: Based on EvalPlus infrastructure; supports the same backends including Ollama and vLLM.
|
||||
|
||||
- **Repository**: https://github.com/bigcode-project/bigcodebench
|
||||
- **Paper**: "BigCodeBench: Benchmarking Code Generation Towards AGI," ICLR 2025.
|
||||
|
||||
### 1.8 IFEval (Instruction Following Evaluation)
|
||||
|
||||
**What it measures**: Compliance with programmatically verifiable instructions ("write more than 400 words," "mention AI at least 3 times"). No subjective judgment needed.
|
||||
|
||||
**Local setup**: Available through lm-evaluation-harness and Inspect AI. Recent variants include IFEval-FC (function calling format compliance) and M-IFEval (multilingual).
|
||||
|
||||
- **Paper**: Zhou et al., "Instruction-Following Evaluation for Large Language Models," arXiv:2311.07911, 2023.
|
||||
|
||||
---
|
||||
|
||||
## 2. Local Agentic Evaluation Frameworks
|
||||
|
||||
### 2.1 Inspect AI (UK AISI)
|
||||
|
||||
The most comprehensive single framework for local agentic evaluation.
|
||||
|
||||
**Key features**:
|
||||
- 100+ pre-built evaluations including BFCL, GAIA, HumanEval, MBPP, IFEval, GSM8K
|
||||
- Native support for tool calling: custom tools, MCP tools, built-in bash/python/web tools
|
||||
- Web-based Inspect View for monitoring and visualizing evaluations
|
||||
- VS Code extension for development
|
||||
- Works with any OpenAI-compatible endpoint (ollama, llama.cpp, vLLM)
|
||||
|
||||
```bash
|
||||
pip install inspect-ai
|
||||
|
||||
# Run BFCL evaluation against a local model
|
||||
inspect eval inspect_evals/bfcl --model openai/local-model \
|
||||
--model-base-url http://localhost:11434/v1
|
||||
```
|
||||
|
||||
- **Repository**: https://github.com/UKGovernmentBEIS/inspect_ai
|
||||
- **Evals collection**: https://github.com/UKGovernmentBEIS/inspect_evals
|
||||
- **Documentation**: https://inspect.aisi.org.uk/
|
||||
|
||||
### 2.2 EleutherAI lm-evaluation-harness
|
||||
|
||||
The standard academic framework. 60+ benchmarks including MMLU, HellaSwag, ARC, GSM8K, HumanEval. Serves as the backend for Hugging Face's Open LLM Leaderboard.
|
||||
|
||||
**Local model support**: Works with HuggingFace models directly, OpenAI-compatible APIs, and custom backends. The `local-completions` and `local-chat-completions` model types support any local server.
|
||||
|
||||
```bash
|
||||
pip install lm-eval
|
||||
|
||||
lm_eval --model local-chat-completions \
|
||||
--model_args model=qwen3-coder:30b,base_url=http://localhost:11434/v1 \
|
||||
--tasks humaneval,mbpp,ifeval \
|
||||
--batch_size auto
|
||||
```
|
||||
|
||||
- **Repository**: https://github.com/EleutherAI/lm-evaluation-harness
|
||||
|
||||
### 2.3 smolagents (Hugging Face)
|
||||
|
||||
Lightweight agentic framework with two core agent types:
|
||||
- **CodeAgent**: Generates and executes sandboxed Python code
|
||||
- **ToolCallingAgent**: Calls external APIs and custom functions
|
||||
|
||||
**Ollama integration** is first-class:
|
||||
```python
|
||||
from smolagents import CodeAgent, OllamaModel
|
||||
|
||||
model = OllamaModel(model_id="qwen3-coder:30b")
|
||||
agent = CodeAgent(tools=[], model=model)
|
||||
agent.run("What is the 10th Fibonacci number?")
|
||||
```
|
||||
|
||||
Supports custom tool definitions and evaluation harnesses. Model-agnostic design means any Ollama, llama.cpp, or LM Studio model works.
|
||||
|
||||
- **Repository**: https://github.com/huggingface/smolagents
|
||||
|
||||
### 2.4 Qwen-Agent
|
||||
|
||||
Purpose-built for Qwen models with optimized tool-calling templates and parsers.
|
||||
|
||||
**Key features**:
|
||||
- Native MCP (Model Context Protocol) support
|
||||
- Parallel, multi-step, and multi-turn function calls with automatic parsing
|
||||
- Code interpreter, RAG, and Chrome extension built in
|
||||
- DeepPlanning benchmark evaluation integrated
|
||||
|
||||
```bash
|
||||
pip install qwen-agent[mcp]
|
||||
```
|
||||
|
||||
Configure tools via MCP configuration files. The framework handles tool-calling format differences between Qwen model versions automatically.
|
||||
|
||||
- **Repository**: https://github.com/QwenLM/Qwen-Agent
|
||||
- **Documentation**: https://qwenlm.github.io/Qwen-Agent/
|
||||
|
||||
### 2.5 LangGraph / CrewAI
|
||||
|
||||
Both support local OpenAI-compatible endpoints. Comparative benchmarks (2026) show:
|
||||
|
||||
- **LangGraph**: Lowest latency and token usage due to graph-based architecture that reduces redundant context passing. Preferred for production with deterministic control flow. Reached v1.0 GA in October 2025.
|
||||
- **CrewAI**: ~40% faster from idea to working prototype. Higher token spend but simpler multi-agent orchestration. v1.10.1 with native MCP and A2A support. 44,600+ GitHub stars.
|
||||
|
||||
Neither provides a built-in standardized benchmark harness, but both can be instrumented to measure task completion rates, tool-call accuracy, and latency.
|
||||
|
||||
### 2.6 Throughput & Performance Benchmarking Tools
|
||||
|
||||
| Tool | Focus | Backends |
|
||||
|------|-------|----------|
|
||||
| [ollama-benchmark](https://github.com/aidatatools/ollama-benchmark) | Tokens/s throughput via Ollama | Ollama |
|
||||
| [llama-benchy](https://github.com/eugr/llama-benchy) | Multi-backend benchmarking (llama-bench style) | vLLM, SGLang, llama.cpp, etc. |
|
||||
| [benchllama](https://github.com/srikanth235/benchllama) | Local LLM benchmarking | Ollama |
|
||||
| [local-llm-bench](https://github.com/famstack-dev/local-llm-bench) | Engine comparison (MLX vs llama.cpp) | MLX, llama.cpp |
|
||||
| llama-bench (built-in) | Raw inference performance | llama.cpp native |
|
||||
|
||||
---
|
||||
|
||||
## 3. Practical Measurement Approaches
|
||||
|
||||
### 3.1 Token Throughput in Multi-Turn Conversations
|
||||
|
||||
Key metrics for agentic workloads on Strix Halo:
|
||||
|
||||
| Metric | Definition | Target |
|
||||
|--------|-----------|--------|
|
||||
| Time to First Token (TTFT) | Delay before first token appears | <500ms for interactive use |
|
||||
| Generation speed (tok/s) | Steady-state token output rate | >30 tok/s for usable agents |
|
||||
| Prompt processing (tok/s) | Speed of ingesting context | Critical for large codebases |
|
||||
| KV cache utilization | Memory consumed by conversation history | Scales with context length |
|
||||
|
||||
**Strix Halo 64GB measured performance** (from community benchmarks):
|
||||
|
||||
| Model | Quant | Gen tok/s | Prompt tok/s | VRAM Used |
|
||||
|-------|-------|-----------|-------------|-----------|
|
||||
| Qwen3-Coder-30B-A3B | Q4_K_M | ~52-71 | 5-47 (varies by context) | ~18 GB |
|
||||
| Qwen3-30B-A3B (general) | Q4_K_M | ~52 | -- | ~18 GB |
|
||||
| 70B dense models | Q4_K_M | ~5 | -- | ~40 GB |
|
||||
|
||||
MoE models like Qwen3-30B-A3B are where 64GB unified memory shines -- only 3B parameters are active per token, so generation is fast despite the 30B total parameter count.
|
||||
|
||||
### 3.2 Tool-Calling Accuracy Measurement
|
||||
|
||||
A practical local test sequence:
|
||||
|
||||
1. **BFCL subset**: Run the BFCL simple function calling tests first (serial single-function calls). If accuracy is below 80%, the model is not suitable for agentic use.
|
||||
2. **Parallel function calling**: Test with BFCL parallel calling scenarios. Many smaller models fail here.
|
||||
3. **Multi-turn stateful**: BFCL v3/v4 multi-turn tests or DeepPlanning scenarios.
|
||||
4. **Format compliance**: IFEval-FC tests whether the model can produce correctly formatted JSON function calls consistently.
|
||||
|
||||
### 3.3 Code Generation Benchmarks
|
||||
|
||||
Recommended evaluation progression (increasing difficulty):
|
||||
|
||||
1. **HumanEval+** via EvalPlus (164 problems, well-understood baseline)
|
||||
2. **MBPP+** via EvalPlus (974 problems, broader coverage)
|
||||
3. **HumanEval Pro / MBPP Pro** (self-invoking code generation, tests compositionality)
|
||||
4. **BigCodeBench** (1,140 tasks across 139 libraries, tests real-world API usage)
|
||||
5. **SWE-bench Verified** (full repository-level coding, requires agent scaffold)
|
||||
|
||||
### 3.4 Composite Agentic Evaluation
|
||||
|
||||
For a holistic view, run these in order:
|
||||
|
||||
```
|
||||
Phase 1 - Baseline Quality:
|
||||
EvalPlus HumanEval+ (code generation)
|
||||
IFEval (instruction following)
|
||||
BFCL simple (tool calling basics)
|
||||
|
||||
Phase 2 - Agentic Capability:
|
||||
BFCL v4 multi-turn (stateful tool use)
|
||||
DeepPlanning (long-horizon planning)
|
||||
BigCodeBench (multi-library code composition)
|
||||
|
||||
Phase 3 - Full Agent Evaluation:
|
||||
AgentBench (multi-environment)
|
||||
SWE-bench Verified (real-world coding)
|
||||
```
|
||||
|
||||
### 3.5 Measuring What Matters for Agents
|
||||
|
||||
Beyond accuracy, measure:
|
||||
- **Recovery from errors**: Does the model self-correct when a tool call returns an error?
|
||||
- **Instruction adherence under pressure**: Does tool-calling format degrade as context grows?
|
||||
- **Planning depth**: How many sequential tool calls can the model chain before losing coherence?
|
||||
- **Token efficiency**: Total tokens consumed per successful task completion.
|
||||
|
||||
---
|
||||
|
||||
## 4. Best Models for Agentic Use (Qwen Family Focus)
|
||||
|
||||
### 4.1 Recommended for Strix Halo 64GB
|
||||
|
||||
#### Tier 1: Primary Recommendation
|
||||
|
||||
**Qwen3-Coder-30B-A3B-Instruct** (MoE: 30.5B total, 3.3B active)
|
||||
- 128 experts, 8 activated per token
|
||||
- 262K native context length
|
||||
- Specially designed function-call format
|
||||
- ~52-71 tok/s on Strix Halo (Q4_K_M, ~18 GB VRAM)
|
||||
- Supports: Ollama, LM Studio, llama.cpp, KTransformers
|
||||
- Available via: `ollama pull renchris/qwen3-coder:30b-gguf-unsloth`
|
||||
- GGUF: https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
|
||||
|
||||
#### Tier 1 (Alternative): General-Purpose Agent
|
||||
|
||||
**Qwen3.5-35B-A3B** (MoE: 35B total, 3B active)
|
||||
- Hybrid architecture: Gated Delta Networks + sparse MoE
|
||||
- 256K context, 201 languages
|
||||
- BFCL-V4 scores competitive with much larger models
|
||||
- Recommended settings: temperature=0.7, top_p=0.8, top_k=20, repetition_penalty=1.05
|
||||
|
||||
#### Tier 2: Smaller / Faster
|
||||
|
||||
**Qwen3.5-9B** (Dense: 9B parameters)
|
||||
- Matches GPT-OSS-120B (a model 13x its size) on GPQA Diamond (81.7 vs 71.5)
|
||||
- Fits easily in 64GB with very long context
|
||||
- Good for rapid prototyping and testing agent architectures
|
||||
- Available via: `ollama pull qwen3.5:9b`
|
||||
|
||||
#### Tier 3: Maximum Capability (fits in 64GB with quantization)
|
||||
|
||||
**Qwen3-Coder-Next** (MoE: 80B total, 3B active)
|
||||
- SWE-bench Verified: 70.6% (SWE-Agent scaffold)
|
||||
- SWE-bench Pro: 44.3% (beats DeepSeek-V3.2 at 40.9)
|
||||
- Requires >45GB for 4-bit quants; >30GB for 2-bit XL quants
|
||||
- Fits on 64GB Strix Halo with Q4_K quantization (tight but feasible)
|
||||
- GGUF: https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF
|
||||
- Run via: `llama-server -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL`
|
||||
|
||||
### 4.2 Qwen Family Comparison for Agentic Tasks
|
||||
|
||||
| Model | Type | Active Params | BFCL-V4 | SWE-bench | Best For | 64GB Feasible |
|
||||
|-------|------|--------------|---------|-----------|----------|---------------|
|
||||
| Qwen3-Coder-30B-A3B | MoE | 3.3B | Strong | Moderate | Tool calling, coding agents | Yes, comfortably |
|
||||
| Qwen3.5-35B-A3B | MoE | 3B | Strong | -- | General agentic tasks | Yes, comfortably |
|
||||
| Qwen3.5-9B | Dense | 9B | Good | -- | Fast prototyping, testing | Yes, easily |
|
||||
| Qwen3-Coder-Next | MoE | 3B | Strong | 70.6% | Maximum coding capability | Yes, tight (Q4) |
|
||||
| Qwen3.5-122B-A10B | MoE | 10B | 72.2 | -- | Best tool calling | Marginal (needs Q2-Q3) |
|
||||
| Qwen3-Coder-480B-A35B | MoE | 35B | SOTA | SOTA open | Maximum performance | No (too large) |
|
||||
|
||||
### 4.3 Non-Qwen Alternatives Worth Testing
|
||||
|
||||
| Model | Parameters | Notable For |
|
||||
|-------|-----------|-------------|
|
||||
| GLM-4.7-Flash | 30B MoE (3B active) | Strong agentic performance, 128K context |
|
||||
| DeepSeek-V3.2 | MoE | Competitive coding agent |
|
||||
| Phi-4-Mini | 14B dense | Native function calling, small footprint |
|
||||
| SWE-agent-LM-32B | 32B dense | Purpose-built for SWE-bench |
|
||||
|
||||
### 4.4 Optimal Setup for Agentic Use on Strix Halo
|
||||
|
||||
```bash
|
||||
# 1. Start model server (llama.cpp for best AMD GPU utilization)
|
||||
llama-server \
|
||||
-hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_M \
|
||||
-ngl 99 \
|
||||
--ctx-size 32768 \
|
||||
--port 8080
|
||||
|
||||
# 2. Use Qwen-Agent for tool calling (optimized templates)
|
||||
pip install qwen-agent[mcp]
|
||||
|
||||
# 3. Or use smolagents for framework-agnostic evaluation
|
||||
pip install smolagents
|
||||
```
|
||||
|
||||
For the Qwen models specifically, Qwen-Agent is recommended because it encapsulates the correct tool-calling templates and parsers internally, avoiding format mismatches that degrade function calling accuracy.
|
||||
|
||||
---
|
||||
|
||||
## 5. Open Questions / Limitations
|
||||
|
||||
1. **Quantization impact on tool calling**: Most benchmark results are reported at full precision (BF16/FP16). Quantization to Q4_K_M or lower may disproportionately affect structured output quality (JSON formatting, argument types) versus general text generation. No systematic study exists for this on Strix Halo specifically.
|
||||
|
||||
2. **Context length vs. accuracy tradeoff**: Agentic workflows accumulate long conversation histories. MoE models with 262K context windows are advertised but tool-calling accuracy at >32K tokens is poorly benchmarked for local models.
|
||||
|
||||
3. **ROCm maturity**: AMD's ROCm stack has improved dramatically but is still not at CUDA parity. The optimal backend (llama.cpp Vulkan vs. llama.cpp ROCm vs. vLLM ROCm) varies by model architecture and workload type.
|
||||
|
||||
4. **MoE scheduling on unified memory**: Strix Halo's unified memory architecture allows MoE models to split dense layers (GPU) and sparse experts (CPU RAM) efficiently, but optimal splitting strategies are not well-documented for agentic workloads where expert activation patterns may differ from typical chat use.
|
||||
|
||||
5. **Benchmark saturation**: HumanEval and MBPP are approaching saturation for frontier models. BigCodeBench and SWE-bench provide better discrimination but are significantly harder to run locally.
|
||||
|
||||
6. **Multi-agent evaluation**: Most benchmarks test single-agent performance. Multi-agent workflows (CrewAI, LangGraph multi-agent) lack standardized evaluation frameworks.
|
||||
|
||||
---
|
||||
|
||||
## 6. Overlap Notes
|
||||
|
||||
- **Throughput benchmarking** overlaps with `docs/benchmarking.md` (which covers llama-bench raw performance). This document focuses on agentic quality metrics rather than raw tok/s.
|
||||
- **ROCm configuration** overlaps with `docs/optimization.md`. This document assumes the system is already optimized per that guide.
|
||||
- **External links** should be consolidated into `docs/references.md` when this document is finalized.
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
||||
### Papers
|
||||
- Patil et al., "The Berkeley Function Calling Leaderboard (BFCL)," ICML 2025
|
||||
- Liu et al., "AgentBench: Evaluating LLMs as Agents," ICLR 2024
|
||||
- Jimenez et al., "SWE-bench: Can Language Models Resolve Real-world Github Issues?" ICLR 2024
|
||||
- Mialon et al., "GAIA: A Benchmark for General AI Assistants," ICLR 2024
|
||||
- "DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints," arXiv:2601.18137, Jan 2026
|
||||
- "Qwen3 Technical Report," arXiv:2505.09388, May 2025
|
||||
- "Qwen3-Coder-Next Technical Report," arXiv:2603.00729, March 2026
|
||||
- "HumanEval Pro and MBPP Pro," ACL 2025 Findings
|
||||
- "BigCodeBench: Benchmarking Code Generation Towards AGI," ICLR 2025
|
||||
- Zhou et al., "Instruction-Following Evaluation for Large Language Models," arXiv:2311.07911, 2023
|
||||
|
||||
### Repositories & Tools
|
||||
- [BFCL](https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard)
|
||||
- [SWE-bench](https://github.com/SWE-bench/SWE-bench)
|
||||
- [AgentBench](https://github.com/THUDM/AgentBench)
|
||||
- [EvalPlus](https://github.com/evalplus/evalplus)
|
||||
- [BigCodeBench](https://github.com/bigcode-project/bigcodebench)
|
||||
- [Inspect AI](https://github.com/UKGovernmentBEIS/inspect_ai)
|
||||
- [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
|
||||
- [smolagents](https://github.com/huggingface/smolagents)
|
||||
- [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent)
|
||||
- [OpenHands](https://github.com/OpenHands/OpenHands)
|
||||
- [Qwen3-Coder-30B-A3B GGUF](https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF)
|
||||
- [Qwen3-Coder-Next GGUF](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF)
|
||||
- [DeepPlanning Dataset](https://huggingface.co/datasets/Qwen/DeepPlanning)
|
||||
|
||||
### Strix Halo Benchmarks
|
||||
- [Strix Halo GPU LLM Performance Tests (Framework Community)](https://community.frame.work/t/amd-strix-halo-ryzen-ai-max-395-gpu-llm-performance-tests/72521)
|
||||
- [Strix Halo Benchmark Results (Level1Techs)](https://forum.level1techs.com/t/strix-halo-ryzen-ai-max-395-llm-benchmark-results/233796)
|
||||
- [Strix Halo Toolboxes Benchmarks](https://kyuz0.github.io/amd-strix-halo-toolboxes/)
|
||||
- [Qwen3-Coder-30B Strix Halo Benchmark](https://github.com/pablo-ross/strix-halo-gmktec-evo-x2/blob/main/QWEN3-CODER-30B_BENCHMARK.md)
|
||||
- [LLM Tracker - Strix Halo](https://llm-tracker.info/AMD-Strix-Halo-(Ryzen-AI-Max+-395)-GPU-Performance)
|
||||
|
||||
### Guides
|
||||
- [Qwen3-Coder-Next Local Guide (DEV Community, 2026)](https://dev.to/sienna/qwen3-coder-next-the-complete-2026-guide-to-running-powerful-ai-coding-agents-locally-1k95)
|
||||
- [Qwen3-Coder Local Setup (Unsloth)](https://unsloth.ai/docs/models/tutorials/qwen3-coder-how-to-run-locally)
|
||||
- [Qwen llama.cpp Documentation](https://qwen.readthedocs.io/en/latest/run_locally/llama.cpp.html)
|
||||
- [smolagents + Ollama (Medium)](https://medium.com/@abonia/building-practical-local-ai-agents-with-smolagents-ollama-f92900c51897)
|
||||
- [Inspect AI Documentation](https://inspect.aisi.org.uk/)
|
||||
Reference in New Issue
Block a user