strix-halo-optimizations/docs/agentic-benchmarks.md

# Local Agentic Flow Benchmarks for Strix Halo

Research summary: benchmarking agentic LLM capabilities on consumer hardware (AMD Strix Halo, Ryzen AI MAX+ 395, 64 GB unified memory) using llama.cpp, Ollama, and LM Studio.

---

## Scope

This document covers locally-runnable agentic benchmarks, evaluation frameworks, practical measurement approaches, and model recommendations (with emphasis on the Qwen family) for the Strix Halo platform. Cloud-only benchmarks that cannot accept a local OpenAI-compatible endpoint are out of scope.

---

## 1. Agentic Benchmarks Runnable Locally

### 1.1 Berkeley Function Calling Leaderboard (BFCL)

**What it measures**: Function/tool calling accuracy across serial calls, parallel calls, multiple languages, and multi-turn agentic interactions.

**Why it matters**: BFCL is the de facto standard for evaluating function-calling quality. Version 4 (2025) added holistic agentic evaluation with stateful multi-step reasoning.

**Local setup**:
```bash
# Option A: pip package
pip install bfcl-eval

# Option B: from source (more control)
git clone https://github.com/ShishirPatil/gorilla.git
cd gorilla/berkeley-function-call-leaderboard
pip install -e .
```

Evaluate a local model by pointing BFCL at any OpenAI-compatible endpoint (ollama, llama.cpp server, vLLM). The framework uses AST-based evaluation to verify function call correctness without executing them.

- **Repository**: https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard
- **Leaderboard**: https://gorilla.cs.berkeley.edu/leaderboard.html
- **Paper**: Patil et al., "The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models," ICML 2025.

### 1.2 SWE-bench / SWE-bench Verified

**What it measures**: Ability to resolve real GitHub issues by generating patches against actual repositories.

**Why it matters**: The gold standard for evaluating coding agents. Tasks require understanding large codebases, multi-file edits, and test-driven validation.

**Local setup**: Evaluation runs inside Docker containers with network isolation. Two primary agent scaffolds support local models:

- **SWE-agent** (https://swe-agent.com): Install via pip, configure `config.toml` to point at a local OpenAI-compatible endpoint. There is also a dedicated open-weight model, SWE-bench/SWE-agent-LM-32B.
- **OpenHands** (https://github.com/OpenHands/OpenHands): `pip install openhands`, then `openhands serve`. Configure `config.toml` with your local model's `base_url`.

**Hardware note**: SWE-bench evaluation requires an x86_64 machine with at least 120 GB free storage, 16 GB RAM, and 8 CPU cores for the Docker harness (separate from model inference). Models smaller than 32B parameters show significantly degraded instruction following on these tasks.

- **Repository**: https://github.com/SWE-bench/SWE-bench
- **Paper**: Jimenez et al., "SWE-bench: Can Language Models Resolve Real-world Github Issues?" ICLR 2024.

### 1.3 AgentBench

**What it measures**: LLM-as-agent across 8 environments: OS interaction, database queries, knowledge graphs, card games, lateral thinking, house-holding, web shopping, and web browsing.

**Why it matters**: The broadest multi-environment agent evaluation. Tests planning, reasoning, tool use, and decision-making in multi-turn open-ended settings.

**Local setup**: The evaluation package is released at https://github.com/THUDM/AgentBench. It supports custom model endpoints. Open-source models up to 70B show a significant performance gap versus frontier commercial models, making it a useful diagnostic for understanding where local models fall short.

- **Paper**: Liu et al., "AgentBench: Evaluating LLMs as Agents," ICLR 2024.

### 1.4 GAIA (General AI Assistants)

**What it measures**: Multi-step tasks requiring web search, document reading, calculation, and synthesis. 466 tasks that are trivially easy for humans (92% accuracy) but extremely challenging for AI.

**Local setup**: Available on Hugging Face. Requires a model with tool-use capabilities (web search, file reading, calculator). Can be wired to a local model via smolagents or LangChain with local tool implementations.

- **Paper**: Mialon et al., "GAIA: A Benchmark for General AI Assistants," ICLR 2024.

### 1.5 DeepPlanning (Qwen)

**What it measures**: Long-horizon agentic planning with verifiable constraints. Two domains: multi-day travel planning (9 APIs for flights, trains, hotels, restaurants, attractions) and multi-product shopping.

**Why it matters**: Evaluates three critical agentic abilities:
1. Proactive information acquisition (actively calling APIs to discover hidden states)
2. Local constrained reasoning (step-level logic like brand matching)
3. Global constrained optimization (budget caps, multi-day time feasibility)

**Local setup**: Open-sourced January 2026. Dataset at https://huggingface.co/datasets/Qwen/DeepPlanning. Evaluation code integrated into the Qwen-Agent framework.

- **Paper**: "DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints," arXiv:2601.18137, January 2026.

### 1.6 Code Generation: EvalPlus (HumanEval+ / MBPP+)

**What it measures**: Functional correctness of generated code. EvalPlus extends HumanEval by 80x and MBPP by 35x test cases.

**Local setup** (direct Ollama support):
```bash
pip install evalplus

# Run against a local Ollama model
evalplus.evaluate \
    --model "qwen3-coder:30b" \
    --dataset humaneval \
    --backend ollama \
    --base-url http://localhost:11434/v1 \
    --greedy
```

- **Repository**: https://github.com/evalplus/evalplus
- **Leaderboard**: https://evalplus.github.io/leaderboard.html

### 1.7 BigCodeBench

**What it measures**: 1,140 function-level tasks requiring composition of multiple function calls across 139 libraries. Average 5.6 test cases per task with 99% branch coverage.

**Local setup**: Based on EvalPlus infrastructure; supports the same backends including Ollama and vLLM.

- **Repository**: https://github.com/bigcode-project/bigcodebench
- **Paper**: "BigCodeBench: Benchmarking Code Generation Towards AGI," ICLR 2025.

### 1.8 IFEval (Instruction Following Evaluation)

**What it measures**: Compliance with programmatically verifiable instructions ("write more than 400 words," "mention AI at least 3 times"). No subjective judgment needed.

**Local setup**: Available through lm-evaluation-harness and Inspect AI. Recent variants include IFEval-FC (function calling format compliance) and M-IFEval (multilingual).

- **Paper**: Zhou et al., "Instruction-Following Evaluation for Large Language Models," arXiv:2311.07911, 2023.

---

## 2. Local Agentic Evaluation Frameworks

### 2.1 Inspect AI (UK AISI)

The most comprehensive single framework for local agentic evaluation.

**Key features**:
- 100+ pre-built evaluations including BFCL, GAIA, HumanEval, MBPP, IFEval, GSM8K
- Native support for tool calling: custom tools, MCP tools, built-in bash/python/web tools
- Web-based Inspect View for monitoring and visualizing evaluations
- VS Code extension for development
- Works with any OpenAI-compatible endpoint (ollama, llama.cpp, vLLM)

```bash
pip install inspect-ai

# Run BFCL evaluation against a local model
inspect eval inspect_evals/bfcl --model openai/local-model \
    --model-base-url http://localhost:11434/v1
```

- **Repository**: https://github.com/UKGovernmentBEIS/inspect_ai
- **Evals collection**: https://github.com/UKGovernmentBEIS/inspect_evals
- **Documentation**: https://inspect.aisi.org.uk/

### 2.2 EleutherAI lm-evaluation-harness

The standard academic framework. 60+ benchmarks including MMLU, HellaSwag, ARC, GSM8K, HumanEval. Serves as the backend for Hugging Face's Open LLM Leaderboard.

**Local model support**: Works with HuggingFace models directly, OpenAI-compatible APIs, and custom backends. The `local-completions` and `local-chat-completions` model types support any local server.

```bash
pip install lm-eval

lm_eval --model local-chat-completions \
    --model_args model=qwen3-coder:30b,base_url=http://localhost:11434/v1 \
    --tasks humaneval,mbpp,ifeval \
    --batch_size auto
```

- **Repository**: https://github.com/EleutherAI/lm-evaluation-harness

### 2.3 smolagents (Hugging Face)

Lightweight agentic framework with two core agent types:
- **CodeAgent**: Generates and executes sandboxed Python code
- **ToolCallingAgent**: Calls external APIs and custom functions

**Ollama integration** is first-class:
```python
from smolagents import CodeAgent, OllamaModel

model = OllamaModel(model_id="qwen3-coder:30b")
agent = CodeAgent(tools=[], model=model)
agent.run("What is the 10th Fibonacci number?")
```

Supports custom tool definitions and evaluation harnesses. Model-agnostic design means any Ollama, llama.cpp, or LM Studio model works.

- **Repository**: https://github.com/huggingface/smolagents

### 2.4 Qwen-Agent

Purpose-built for Qwen models with optimized tool-calling templates and parsers.

**Key features**:
- Native MCP (Model Context Protocol) support
- Parallel, multi-step, and multi-turn function calls with automatic parsing
- Code interpreter, RAG, and Chrome extension built in
- DeepPlanning benchmark evaluation integrated

```bash
pip install qwen-agent[mcp]
```

Configure tools via MCP configuration files. The framework handles tool-calling format differences between Qwen model versions automatically.

- **Repository**: https://github.com/QwenLM/Qwen-Agent
- **Documentation**: https://qwenlm.github.io/Qwen-Agent/

### 2.5 LangGraph / CrewAI

Both support local OpenAI-compatible endpoints. Comparative benchmarks (2026) show:

- **LangGraph**: Lowest latency and token usage due to graph-based architecture that reduces redundant context passing. Preferred for production with deterministic control flow. Reached v1.0 GA in October 2025.
- **CrewAI**: ~40% faster from idea to working prototype. Higher token spend but simpler multi-agent orchestration. v1.10.1 with native MCP and A2A support. 44,600+ GitHub stars.

Neither provides a built-in standardized benchmark harness, but both can be instrumented to measure task completion rates, tool-call accuracy, and latency.

### 2.6 Throughput & Performance Benchmarking Tools

| Tool | Focus | Backends |
|------|-------|----------|
| [ollama-benchmark](https://github.com/aidatatools/ollama-benchmark) | Tokens/s throughput via Ollama | Ollama |
| [llama-benchy](https://github.com/eugr/llama-benchy) | Multi-backend benchmarking (llama-bench style) | vLLM, SGLang, llama.cpp, etc. |
| [benchllama](https://github.com/srikanth235/benchllama) | Local LLM benchmarking | Ollama |
| [local-llm-bench](https://github.com/famstack-dev/local-llm-bench) | Engine comparison (MLX vs llama.cpp) | MLX, llama.cpp |
| llama-bench (built-in) | Raw inference performance | llama.cpp native |

---

## 3. Practical Measurement Approaches

### 3.1 Token Throughput in Multi-Turn Conversations

Key metrics for agentic workloads on Strix Halo:

| Metric | Definition | Target |
|--------|-----------|--------|
| Time to First Token (TTFT) | Delay before first token appears | <500ms for interactive use |
| Generation speed (tok/s) | Steady-state token output rate | >30 tok/s for usable agents |
| Prompt processing (tok/s) | Speed of ingesting context | Critical for large codebases |
| KV cache utilization | Memory consumed by conversation history | Scales with context length |

**Strix Halo 64GB measured performance** (from community benchmarks):

| Model | Quant | Gen tok/s | Prompt tok/s | VRAM Used |
|-------|-------|-----------|-------------|-----------|
| Qwen3-Coder-30B-A3B | Q4_K_M | ~52-71 | 5-47 (varies by context) | ~18 GB |
| Qwen3-30B-A3B (general) | Q4_K_M | ~52 | -- | ~18 GB |
| 70B dense models | Q4_K_M | ~5 | -- | ~40 GB |

MoE models like Qwen3-30B-A3B are where 64GB unified memory shines -- only 3B parameters are active per token, so generation is fast despite the 30B total parameter count.

### 3.2 Tool-Calling Accuracy Measurement

A practical local test sequence:

1. **BFCL subset**: Run the BFCL simple function calling tests first (serial single-function calls). If accuracy is below 80%, the model is not suitable for agentic use.
2. **Parallel function calling**: Test with BFCL parallel calling scenarios. Many smaller models fail here.
3. **Multi-turn stateful**: BFCL v3/v4 multi-turn tests or DeepPlanning scenarios.
4. **Format compliance**: IFEval-FC tests whether the model can produce correctly formatted JSON function calls consistently.

### 3.3 Code Generation Benchmarks

Recommended evaluation progression (increasing difficulty):

1. **HumanEval+** via EvalPlus (164 problems, well-understood baseline)
2. **MBPP+** via EvalPlus (974 problems, broader coverage)
3. **HumanEval Pro / MBPP Pro** (self-invoking code generation, tests compositionality)
4. **BigCodeBench** (1,140 tasks across 139 libraries, tests real-world API usage)
5. **SWE-bench Verified** (full repository-level coding, requires agent scaffold)

### 3.4 Composite Agentic Evaluation

For a holistic view, run these in order:

```
Phase 1 - Baseline Quality:
  EvalPlus HumanEval+ (code generation)
  IFEval (instruction following)
  BFCL simple (tool calling basics)

Phase 2 - Agentic Capability:
  BFCL v4 multi-turn (stateful tool use)
  DeepPlanning (long-horizon planning)
  BigCodeBench (multi-library code composition)

Phase 3 - Full Agent Evaluation:
  AgentBench (multi-environment)
  SWE-bench Verified (real-world coding)
```

### 3.5 Measuring What Matters for Agents

Beyond accuracy, measure:
- **Recovery from errors**: Does the model self-correct when a tool call returns an error?
- **Instruction adherence under pressure**: Does tool-calling format degrade as context grows?
- **Planning depth**: How many sequential tool calls can the model chain before losing coherence?
- **Token efficiency**: Total tokens consumed per successful task completion.

---

## 4. Best Models for Agentic Use (Qwen Family Focus)

### 4.1 Recommended for Strix Halo 64GB

#### Tier 1: Primary Recommendation

**Qwen3-Coder-30B-A3B-Instruct** (MoE: 30.5B total, 3.3B active)
- 128 experts, 8 activated per token
- 262K native context length
- Specially designed function-call format
- ~52-71 tok/s on Strix Halo (Q4_K_M, ~18 GB VRAM)
- Supports: Ollama, LM Studio, llama.cpp, KTransformers
- Available via: `ollama pull renchris/qwen3-coder:30b-gguf-unsloth`
- GGUF: https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

#### Tier 1 (Alternative): General-Purpose Agent

**Qwen3.5-35B-A3B** (MoE: 35B total, 3B active)
- Hybrid architecture: Gated Delta Networks + sparse MoE
- 256K context, 201 languages
- BFCL-V4 scores competitive with much larger models
- Recommended settings: temperature=0.7, top_p=0.8, top_k=20, repetition_penalty=1.05

#### Tier 2: Smaller / Faster

**Qwen3.5-9B** (Dense: 9B parameters)
- Matches GPT-OSS-120B (a model 13x its size) on GPQA Diamond (81.7 vs 71.5)
- Fits easily in 64GB with very long context
- Good for rapid prototyping and testing agent architectures
- Available via: `ollama pull qwen3.5:9b`

#### Tier 3: Maximum Capability (fits in 64GB with quantization)

**Qwen3-Coder-Next** (MoE: 80B total, 3B active)
- SWE-bench Verified: 70.6% (SWE-Agent scaffold)
- SWE-bench Pro: 44.3% (beats DeepSeek-V3.2 at 40.9)
- Requires >45GB for 4-bit quants; >30GB for 2-bit XL quants
- Fits on 64GB Strix Halo with Q4_K quantization (tight but feasible)
- GGUF: https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF
- Run via: `llama-server -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL`

### 4.2 Qwen Family Comparison for Agentic Tasks

| Model | Type | Active Params | BFCL-V4 | SWE-bench | Best For | 64GB Feasible |
|-------|------|--------------|---------|-----------|----------|---------------|
| Qwen3-Coder-30B-A3B | MoE | 3.3B | Strong | Moderate | Tool calling, coding agents | Yes, comfortably |
| Qwen3.5-35B-A3B | MoE | 3B | Strong | -- | General agentic tasks | Yes, comfortably |
| Qwen3.5-9B | Dense | 9B | Good | -- | Fast prototyping, testing | Yes, easily |
| Qwen3-Coder-Next | MoE | 3B | Strong | 70.6% | Maximum coding capability | Yes, tight (Q4) |
| Qwen3.5-122B-A10B | MoE | 10B | 72.2 | -- | Best tool calling | Marginal (needs Q2-Q3) |
| Qwen3-Coder-480B-A35B | MoE | 35B | SOTA | SOTA open | Maximum performance | No (too large) |

### 4.3 Non-Qwen Alternatives Worth Testing

| Model | Parameters | Notable For |
|-------|-----------|-------------|
| GLM-4.7-Flash | 30B MoE (3B active) | Strong agentic performance, 128K context |
| DeepSeek-V3.2 | MoE | Competitive coding agent |
| Phi-4-Mini | 14B dense | Native function calling, small footprint |
| SWE-agent-LM-32B | 32B dense | Purpose-built for SWE-bench |

### 4.4 Optimal Setup for Agentic Use on Strix Halo

```bash
# 1. Start model server (llama.cpp for best AMD GPU utilization)
llama-server \
    -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_M \
    -ngl 99 \
    --ctx-size 32768 \
    --port 8080

# 2. Use Qwen-Agent for tool calling (optimized templates)
pip install qwen-agent[mcp]

# 3. Or use smolagents for framework-agnostic evaluation
pip install smolagents
```

For the Qwen models specifically, Qwen-Agent is recommended because it encapsulates the correct tool-calling templates and parsers internally, avoiding format mismatches that degrade function calling accuracy.

---

## 5. Open Questions / Limitations

1. **Quantization impact on tool calling**: Most benchmark results are reported at full precision (BF16/FP16). Quantization to Q4_K_M or lower may disproportionately affect structured output quality (JSON formatting, argument types) versus general text generation. No systematic study exists for this on Strix Halo specifically.

2. **Context length vs. accuracy tradeoff**: Agentic workflows accumulate long conversation histories. MoE models with 262K context windows are advertised but tool-calling accuracy at >32K tokens is poorly benchmarked for local models.

3. **ROCm maturity**: AMD's ROCm stack has improved dramatically but is still not at CUDA parity. The optimal backend (llama.cpp Vulkan vs. llama.cpp ROCm vs. vLLM ROCm) varies by model architecture and workload type.

4. **MoE scheduling on unified memory**: Strix Halo's unified memory architecture allows MoE models to split dense layers (GPU) and sparse experts (CPU RAM) efficiently, but optimal splitting strategies are not well-documented for agentic workloads where expert activation patterns may differ from typical chat use.

5. **Benchmark saturation**: HumanEval and MBPP are approaching saturation for frontier models. BigCodeBench and SWE-bench provide better discrimination but are significantly harder to run locally.

6. **Multi-agent evaluation**: Most benchmarks test single-agent performance. Multi-agent workflows (CrewAI, LangGraph multi-agent) lack standardized evaluation frameworks.

---

## 6. Overlap Notes

- **Throughput benchmarking** overlaps with `docs/benchmarking.md` (which covers llama-bench raw performance). This document focuses on agentic quality metrics rather than raw tok/s.
- **ROCm configuration** overlaps with `docs/optimization.md`. This document assumes the system is already optimized per that guide.
- **External links** should be consolidated into `docs/references.md` when this document is finalized.

---

## Sources

### Papers
- Patil et al., "The Berkeley Function Calling Leaderboard (BFCL)," ICML 2025
- Liu et al., "AgentBench: Evaluating LLMs as Agents," ICLR 2024
- Jimenez et al., "SWE-bench: Can Language Models Resolve Real-world Github Issues?" ICLR 2024
- Mialon et al., "GAIA: A Benchmark for General AI Assistants," ICLR 2024
- "DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints," arXiv:2601.18137, Jan 2026
- "Qwen3 Technical Report," arXiv:2505.09388, May 2025
- "Qwen3-Coder-Next Technical Report," arXiv:2603.00729, March 2026
- "HumanEval Pro and MBPP Pro," ACL 2025 Findings
- "BigCodeBench: Benchmarking Code Generation Towards AGI," ICLR 2025
- Zhou et al., "Instruction-Following Evaluation for Large Language Models," arXiv:2311.07911, 2023

### Repositories & Tools
- [BFCL](https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard)
- [SWE-bench](https://github.com/SWE-bench/SWE-bench)
- [AgentBench](https://github.com/THUDM/AgentBench)
- [EvalPlus](https://github.com/evalplus/evalplus)
- [BigCodeBench](https://github.com/bigcode-project/bigcodebench)
- [Inspect AI](https://github.com/UKGovernmentBEIS/inspect_ai)
- [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
- [smolagents](https://github.com/huggingface/smolagents)
- [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent)
- [OpenHands](https://github.com/OpenHands/OpenHands)
- [Qwen3-Coder-30B-A3B GGUF](https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF)
- [Qwen3-Coder-Next GGUF](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF)
- [DeepPlanning Dataset](https://huggingface.co/datasets/Qwen/DeepPlanning)

### Strix Halo Benchmarks
- [Strix Halo GPU LLM Performance Tests (Framework Community)](https://community.frame.work/t/amd-strix-halo-ryzen-ai-max-395-gpu-llm-performance-tests/72521)
- [Strix Halo Benchmark Results (Level1Techs)](https://forum.level1techs.com/t/strix-halo-ryzen-ai-max-395-llm-benchmark-results/233796)
- [Strix Halo Toolboxes Benchmarks](https://kyuz0.github.io/amd-strix-halo-toolboxes/)
- [Qwen3-Coder-30B Strix Halo Benchmark](https://github.com/pablo-ross/strix-halo-gmktec-evo-x2/blob/main/QWEN3-CODER-30B_BENCHMARK.md)
- [LLM Tracker - Strix Halo](https://llm-tracker.info/AMD-Strix-Halo-(Ryzen-AI-Max+-395)-GPU-Performance)

### Guides
- [Qwen3-Coder-Next Local Guide (DEV Community, 2026)](https://dev.to/sienna/qwen3-coder-next-the-complete-2026-guide-to-running-powerful-ai-coding-agents-locally-1k95)
- [Qwen3-Coder Local Setup (Unsloth)](https://unsloth.ai/docs/models/tutorials/qwen3-coder-how-to-run-locally)
- [Qwen llama.cpp Documentation](https://qwen.readthedocs.io/en/latest/run_locally/llama.cpp.html)
- [smolagents + Ollama (Medium)](https://medium.com/@abonia/building-practical-local-ai-agents-with-smolagents-ollama-f92900c51897)
- [Inspect AI Documentation](https://inspect.aisi.org.uk/)