Files

Felipe Cardoso 58124cd657 feat: add Qwen3.5 model catalog and agentic evaluation framework

Models:
- configs/models.conf: catalog with Qwen3.5-35B-A3B (MoE, top pick),
  Qwen3.5-27B (dense), Qwen3-Coder-30B-A3B (agentic/coding)
- Updated benchmark setup to show catalog with download status
- docs/model-recommendations.md: memory planning, quantization guide

Agentic evaluation:
- scripts/agentic/setup.sh: installs inspect-ai, evalplus, bigcodebench
  in a Python venv
- scripts/agentic/run-eval.sh: runs evaluations against local LLM server
  (ollama or llama.cpp). Suites: quick (HumanEval+IFEval), code
  (EvalPlus+BigCodeBench), tooluse (BFCL), full (all)
- bin/agentic: dispatcher with help
- docs/agentic-benchmarks.md: methodology, framework comparison, model
  recommendations for agentic use

Updated: Makefile (6 new targets), README, CLAUDE.md, docs/references.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-26 00:20:23 +01:00

22 KiB

Raw Blame History

Local Agentic Flow Benchmarks for Strix Halo

Research summary: benchmarking agentic LLM capabilities on consumer hardware (AMD Strix Halo, Ryzen AI MAX+ 395, 64 GB unified memory) using llama.cpp, Ollama, and LM Studio.

Scope

This document covers locally-runnable agentic benchmarks, evaluation frameworks, practical measurement approaches, and model recommendations (with emphasis on the Qwen family) for the Strix Halo platform. Cloud-only benchmarks that cannot accept a local OpenAI-compatible endpoint are out of scope.

1. Agentic Benchmarks Runnable Locally

1.1 Berkeley Function Calling Leaderboard (BFCL)

What it measures: Function/tool calling accuracy across serial calls, parallel calls, multiple languages, and multi-turn agentic interactions.

Why it matters: BFCL is the de facto standard for evaluating function-calling quality. Version 4 (2025) added holistic agentic evaluation with stateful multi-step reasoning.

Local setup:

# Option A: pip package
pip install bfcl-eval

# Option B: from source (more control)
git clone https://github.com/ShishirPatil/gorilla.git
cd gorilla/berkeley-function-call-leaderboard
pip install -e .

Evaluate a local model by pointing BFCL at any OpenAI-compatible endpoint (ollama, llama.cpp server, vLLM). The framework uses AST-based evaluation to verify function call correctness without executing them.

Repository: https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard
Leaderboard: https://gorilla.cs.berkeley.edu/leaderboard.html
Paper: Patil et al., "The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models," ICML 2025.

1.2 SWE-bench / SWE-bench Verified

What it measures: Ability to resolve real GitHub issues by generating patches against actual repositories.

Why it matters: The gold standard for evaluating coding agents. Tasks require understanding large codebases, multi-file edits, and test-driven validation.

Local setup: Evaluation runs inside Docker containers with network isolation. Two primary agent scaffolds support local models:

SWE-agent (https://swe-agent.com): Install via pip, configure config.toml to point at a local OpenAI-compatible endpoint. There is also a dedicated open-weight model, SWE-bench/SWE-agent-LM-32B.
OpenHands (https://github.com/OpenHands/OpenHands): pip install openhands, then openhands serve. Configure config.toml with your local model's base_url.

Hardware note: SWE-bench evaluation requires an x86_64 machine with at least 120 GB free storage, 16 GB RAM, and 8 CPU cores for the Docker harness (separate from model inference). Models smaller than 32B parameters show significantly degraded instruction following on these tasks.

Repository: https://github.com/SWE-bench/SWE-bench
Paper: Jimenez et al., "SWE-bench: Can Language Models Resolve Real-world Github Issues?" ICLR 2024.

1.3 AgentBench

What it measures: LLM-as-agent across 8 environments: OS interaction, database queries, knowledge graphs, card games, lateral thinking, house-holding, web shopping, and web browsing.

Why it matters: The broadest multi-environment agent evaluation. Tests planning, reasoning, tool use, and decision-making in multi-turn open-ended settings.

Local setup: The evaluation package is released at https://github.com/THUDM/AgentBench. It supports custom model endpoints. Open-source models up to 70B show a significant performance gap versus frontier commercial models, making it a useful diagnostic for understanding where local models fall short.

Paper: Liu et al., "AgentBench: Evaluating LLMs as Agents," ICLR 2024.

1.4 GAIA (General AI Assistants)

What it measures: Multi-step tasks requiring web search, document reading, calculation, and synthesis. 466 tasks that are trivially easy for humans (92% accuracy) but extremely challenging for AI.

Local setup: Available on Hugging Face. Requires a model with tool-use capabilities (web search, file reading, calculator). Can be wired to a local model via smolagents or LangChain with local tool implementations.

Paper: Mialon et al., "GAIA: A Benchmark for General AI Assistants," ICLR 2024.

1.5 DeepPlanning (Qwen)

What it measures: Long-horizon agentic planning with verifiable constraints. Two domains: multi-day travel planning (9 APIs for flights, trains, hotels, restaurants, attractions) and multi-product shopping.

Why it matters: Evaluates three critical agentic abilities:

Proactive information acquisition (actively calling APIs to discover hidden states)
Local constrained reasoning (step-level logic like brand matching)
Global constrained optimization (budget caps, multi-day time feasibility)

Local setup: Open-sourced January 2026. Dataset at https://huggingface.co/datasets/Qwen/DeepPlanning. Evaluation code integrated into the Qwen-Agent framework.

Paper: "DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints," arXiv:2601.18137, January 2026.

1.6 Code Generation: EvalPlus (HumanEval+ / MBPP+)

What it measures: Functional correctness of generated code. EvalPlus extends HumanEval by 80x and MBPP by 35x test cases.

Local setup (direct Ollama support):

pip install evalplus

# Run against a local Ollama model
evalplus.evaluate \
    --model "qwen3-coder:30b" \
    --dataset humaneval \
    --backend ollama \
    --base-url http://localhost:11434/v1 \
    --greedy

Repository: https://github.com/evalplus/evalplus
Leaderboard: https://evalplus.github.io/leaderboard.html

1.7 BigCodeBench

What it measures: 1,140 function-level tasks requiring composition of multiple function calls across 139 libraries. Average 5.6 test cases per task with 99% branch coverage.

Local setup: Based on EvalPlus infrastructure; supports the same backends including Ollama and vLLM.

Repository: https://github.com/bigcode-project/bigcodebench
Paper: "BigCodeBench: Benchmarking Code Generation Towards AGI," ICLR 2025.

1.8 IFEval (Instruction Following Evaluation)

What it measures: Compliance with programmatically verifiable instructions ("write more than 400 words," "mention AI at least 3 times"). No subjective judgment needed.

Local setup: Available through lm-evaluation-harness and Inspect AI. Recent variants include IFEval-FC (function calling format compliance) and M-IFEval (multilingual).

Paper: Zhou et al., "Instruction-Following Evaluation for Large Language Models," arXiv:2311.07911, 2023.

2. Local Agentic Evaluation Frameworks

2.1 Inspect AI (UK AISI)

The most comprehensive single framework for local agentic evaluation.

Key features:

100+ pre-built evaluations including BFCL, GAIA, HumanEval, MBPP, IFEval, GSM8K
Native support for tool calling: custom tools, MCP tools, built-in bash/python/web tools
Web-based Inspect View for monitoring and visualizing evaluations
VS Code extension for development
Works with any OpenAI-compatible endpoint (ollama, llama.cpp, vLLM)

pip install inspect-ai

# Run BFCL evaluation against a local model
inspect eval inspect_evals/bfcl --model openai/local-model \
    --model-base-url http://localhost:11434/v1

Repository: https://github.com/UKGovernmentBEIS/inspect_ai
Evals collection: https://github.com/UKGovernmentBEIS/inspect_evals
Documentation: https://inspect.aisi.org.uk/

2.2 EleutherAI lm-evaluation-harness

The standard academic framework. 60+ benchmarks including MMLU, HellaSwag, ARC, GSM8K, HumanEval. Serves as the backend for Hugging Face's Open LLM Leaderboard.

Local model support: Works with HuggingFace models directly, OpenAI-compatible APIs, and custom backends. The local-completions and local-chat-completions model types support any local server.

pip install lm-eval

lm_eval --model local-chat-completions \
    --model_args model=qwen3-coder:30b,base_url=http://localhost:11434/v1 \
    --tasks humaneval,mbpp,ifeval \
    --batch_size auto

Repository: https://github.com/EleutherAI/lm-evaluation-harness

2.3 smolagents (Hugging Face)

Lightweight agentic framework with two core agent types:

CodeAgent: Generates and executes sandboxed Python code
ToolCallingAgent: Calls external APIs and custom functions

Ollama integration is first-class:

from smolagents import CodeAgent, OllamaModel

model = OllamaModel(model_id="qwen3-coder:30b")
agent = CodeAgent(tools=[], model=model)
agent.run("What is the 10th Fibonacci number?")

Supports custom tool definitions and evaluation harnesses. Model-agnostic design means any Ollama, llama.cpp, or LM Studio model works.

Repository: https://github.com/huggingface/smolagents

2.4 Qwen-Agent

Purpose-built for Qwen models with optimized tool-calling templates and parsers.

Key features:

Native MCP (Model Context Protocol) support
Parallel, multi-step, and multi-turn function calls with automatic parsing
Code interpreter, RAG, and Chrome extension built in
DeepPlanning benchmark evaluation integrated

pip install qwen-agent[mcp]

Configure tools via MCP configuration files. The framework handles tool-calling format differences between Qwen model versions automatically.

Repository: https://github.com/QwenLM/Qwen-Agent
Documentation: https://qwenlm.github.io/Qwen-Agent/

2.5 LangGraph / CrewAI

Both support local OpenAI-compatible endpoints. Comparative benchmarks (2026) show:

LangGraph: Lowest latency and token usage due to graph-based architecture that reduces redundant context passing. Preferred for production with deterministic control flow. Reached v1.0 GA in October 2025.
CrewAI: ~40% faster from idea to working prototype. Higher token spend but simpler multi-agent orchestration. v1.10.1 with native MCP and A2A support. 44,600+ GitHub stars.

Neither provides a built-in standardized benchmark harness, but both can be instrumented to measure task completion rates, tool-call accuracy, and latency.

2.6 Throughput & Performance Benchmarking Tools

Tool	Focus	Backends
ollama-benchmark	Tokens/s throughput via Ollama	Ollama
llama-benchy	Multi-backend benchmarking (llama-bench style)	vLLM, SGLang, llama.cpp, etc.
benchllama	Local LLM benchmarking	Ollama
local-llm-bench	Engine comparison (MLX vs llama.cpp)	MLX, llama.cpp
llama-bench (built-in)	Raw inference performance	llama.cpp native

3. Practical Measurement Approaches

3.1 Token Throughput in Multi-Turn Conversations

Key metrics for agentic workloads on Strix Halo:

Metric	Definition	Target
Time to First Token (TTFT)	Delay before first token appears	<500ms for interactive use
Generation speed (tok/s)	Steady-state token output rate	>30 tok/s for usable agents
Prompt processing (tok/s)	Speed of ingesting context	Critical for large codebases
KV cache utilization	Memory consumed by conversation history	Scales with context length

Strix Halo 64GB measured performance (from community benchmarks):

Model	Quant	Gen tok/s	Prompt tok/s	VRAM Used
Qwen3-Coder-30B-A3B	Q4_K_M	~52-71	5-47 (varies by context)	~18 GB
Qwen3-30B-A3B (general)	Q4_K_M	~52	--	~18 GB
70B dense models	Q4_K_M	~5	--	~40 GB

MoE models like Qwen3-30B-A3B are where 64GB unified memory shines -- only 3B parameters are active per token, so generation is fast despite the 30B total parameter count.

3.2 Tool-Calling Accuracy Measurement

A practical local test sequence:

BFCL subset: Run the BFCL simple function calling tests first (serial single-function calls). If accuracy is below 80%, the model is not suitable for agentic use.
Parallel function calling: Test with BFCL parallel calling scenarios. Many smaller models fail here.
Multi-turn stateful: BFCL v3/v4 multi-turn tests or DeepPlanning scenarios.
Format compliance: IFEval-FC tests whether the model can produce correctly formatted JSON function calls consistently.

3.3 Code Generation Benchmarks

Recommended evaluation progression (increasing difficulty):

HumanEval+ via EvalPlus (164 problems, well-understood baseline)
MBPP+ via EvalPlus (974 problems, broader coverage)
HumanEval Pro / MBPP Pro (self-invoking code generation, tests compositionality)
BigCodeBench (1,140 tasks across 139 libraries, tests real-world API usage)
SWE-bench Verified (full repository-level coding, requires agent scaffold)

3.4 Composite Agentic Evaluation

For a holistic view, run these in order:

Phase 1 - Baseline Quality:
  EvalPlus HumanEval+ (code generation)
  IFEval (instruction following)
  BFCL simple (tool calling basics)

Phase 2 - Agentic Capability:
  BFCL v4 multi-turn (stateful tool use)
  DeepPlanning (long-horizon planning)
  BigCodeBench (multi-library code composition)

Phase 3 - Full Agent Evaluation:
  AgentBench (multi-environment)
  SWE-bench Verified (real-world coding)

3.5 Measuring What Matters for Agents

Beyond accuracy, measure:

Recovery from errors: Does the model self-correct when a tool call returns an error?
Instruction adherence under pressure: Does tool-calling format degrade as context grows?
Planning depth: How many sequential tool calls can the model chain before losing coherence?
Token efficiency: Total tokens consumed per successful task completion.

4. Best Models for Agentic Use (Qwen Family Focus)

4.1 Recommended for Strix Halo 64GB

Tier 1: Primary Recommendation

Qwen3-Coder-30B-A3B-Instruct (MoE: 30.5B total, 3.3B active)

128 experts, 8 activated per token
262K native context length
Specially designed function-call format
~52-71 tok/s on Strix Halo (Q4_K_M, ~18 GB VRAM)
Supports: Ollama, LM Studio, llama.cpp, KTransformers
Available via: ollama pull renchris/qwen3-coder:30b-gguf-unsloth
GGUF: https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

Tier 1 (Alternative): General-Purpose Agent

Qwen3.5-35B-A3B (MoE: 35B total, 3B active)

Hybrid architecture: Gated Delta Networks + sparse MoE
256K context, 201 languages
BFCL-V4 scores competitive with much larger models
Recommended settings: temperature=0.7, top_p=0.8, top_k=20, repetition_penalty=1.05

Tier 2: Smaller / Faster

Qwen3.5-9B (Dense: 9B parameters)

Matches GPT-OSS-120B (a model 13x its size) on GPQA Diamond (81.7 vs 71.5)
Fits easily in 64GB with very long context
Good for rapid prototyping and testing agent architectures
Available via: ollama pull qwen3.5:9b

Tier 3: Maximum Capability (fits in 64GB with quantization)

Qwen3-Coder-Next (MoE: 80B total, 3B active)

SWE-bench Verified: 70.6% (SWE-Agent scaffold)
SWE-bench Pro: 44.3% (beats DeepSeek-V3.2 at 40.9)
Requires >45GB for 4-bit quants; >30GB for 2-bit XL quants
Fits on 64GB Strix Halo with Q4_K quantization (tight but feasible)
GGUF: https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF
Run via: llama-server -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL

4.2 Qwen Family Comparison for Agentic Tasks

Model	Type	Active Params	BFCL-V4	SWE-bench	Best For	64GB Feasible
Qwen3-Coder-30B-A3B	MoE	3.3B	Strong	Moderate	Tool calling, coding agents	Yes, comfortably
Qwen3.5-35B-A3B	MoE	3B	Strong	--	General agentic tasks	Yes, comfortably
Qwen3.5-9B	Dense	9B	Good	--	Fast prototyping, testing	Yes, easily
Qwen3-Coder-Next	MoE	3B	Strong	70.6%	Maximum coding capability	Yes, tight (Q4)
Qwen3.5-122B-A10B	MoE	10B	72.2	--	Best tool calling	Marginal (needs Q2-Q3)
Qwen3-Coder-480B-A35B	MoE	35B	SOTA	SOTA open	Maximum performance	No (too large)

4.3 Non-Qwen Alternatives Worth Testing

Model	Parameters	Notable For
GLM-4.7-Flash	30B MoE (3B active)	Strong agentic performance, 128K context
DeepSeek-V3.2	MoE	Competitive coding agent
Phi-4-Mini	14B dense	Native function calling, small footprint
SWE-agent-LM-32B	32B dense	Purpose-built for SWE-bench

4.4 Optimal Setup for Agentic Use on Strix Halo

# 1. Start model server (llama.cpp for best AMD GPU utilization)
llama-server \
    -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_M \
    -ngl 99 \
    --ctx-size 32768 \
    --port 8080

# 2. Use Qwen-Agent for tool calling (optimized templates)
pip install qwen-agent[mcp]

# 3. Or use smolagents for framework-agnostic evaluation
pip install smolagents

For the Qwen models specifically, Qwen-Agent is recommended because it encapsulates the correct tool-calling templates and parsers internally, avoiding format mismatches that degrade function calling accuracy.

5. Open Questions / Limitations

Quantization impact on tool calling: Most benchmark results are reported at full precision (BF16/FP16). Quantization to Q4_K_M or lower may disproportionately affect structured output quality (JSON formatting, argument types) versus general text generation. No systematic study exists for this on Strix Halo specifically.
Context length vs. accuracy tradeoff: Agentic workflows accumulate long conversation histories. MoE models with 262K context windows are advertised but tool-calling accuracy at >32K tokens is poorly benchmarked for local models.
ROCm maturity: AMD's ROCm stack has improved dramatically but is still not at CUDA parity. The optimal backend (llama.cpp Vulkan vs. llama.cpp ROCm vs. vLLM ROCm) varies by model architecture and workload type.
MoE scheduling on unified memory: Strix Halo's unified memory architecture allows MoE models to split dense layers (GPU) and sparse experts (CPU RAM) efficiently, but optimal splitting strategies are not well-documented for agentic workloads where expert activation patterns may differ from typical chat use.
Benchmark saturation: HumanEval and MBPP are approaching saturation for frontier models. BigCodeBench and SWE-bench provide better discrimination but are significantly harder to run locally.
Multi-agent evaluation: Most benchmarks test single-agent performance. Multi-agent workflows (CrewAI, LangGraph multi-agent) lack standardized evaluation frameworks.

6. Overlap Notes

Throughput benchmarking overlaps with docs/benchmarking.md (which covers llama-bench raw performance). This document focuses on agentic quality metrics rather than raw tok/s.
ROCm configuration overlaps with docs/optimization.md. This document assumes the system is already optimized per that guide.
External links should be consolidated into docs/references.md when this document is finalized.

Sources

Papers

Patil et al., "The Berkeley Function Calling Leaderboard (BFCL)," ICML 2025
Liu et al., "AgentBench: Evaluating LLMs as Agents," ICLR 2024
Jimenez et al., "SWE-bench: Can Language Models Resolve Real-world Github Issues?" ICLR 2024
Mialon et al., "GAIA: A Benchmark for General AI Assistants," ICLR 2024
"DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints," arXiv:2601.18137, Jan 2026
"Qwen3 Technical Report," arXiv:2505.09388, May 2025
"Qwen3-Coder-Next Technical Report," arXiv:2603.00729, March 2026
"HumanEval Pro and MBPP Pro," ACL 2025 Findings
"BigCodeBench: Benchmarking Code Generation Towards AGI," ICLR 2025
Zhou et al., "Instruction-Following Evaluation for Large Language Models," arXiv:2311.07911, 2023

22 KiB Raw Blame History