feat: add Qwen3.5 model catalog and agentic evaluation framework

Models:
- configs/models.conf: catalog with Qwen3.5-35B-A3B (MoE, top pick),
  Qwen3.5-27B (dense), Qwen3-Coder-30B-A3B (agentic/coding)
- Updated benchmark setup to show catalog with download status
- docs/model-recommendations.md: memory planning, quantization guide

Agentic evaluation:
- scripts/agentic/setup.sh: installs inspect-ai, evalplus, bigcodebench
  in a Python venv
- scripts/agentic/run-eval.sh: runs evaluations against local LLM server
  (ollama or llama.cpp). Suites: quick (HumanEval+IFEval), code
  (EvalPlus+BigCodeBench), tooluse (BFCL), full (all)
- bin/agentic: dispatcher with help
- docs/agentic-benchmarks.md: methodology, framework comparison, model
  recommendations for agentic use

Updated: Makefile (6 new targets), README, CLAUDE.md, docs/references.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Felipe Cardoso
2026-03-26 00:20:23 +01:00
parent 71053997be
commit 58124cd657
11 changed files with 1354 additions and 16 deletions

444
docs/agentic-benchmarks.md Normal file
View File

@@ -0,0 +1,444 @@
# Local Agentic Flow Benchmarks for Strix Halo
Research summary: benchmarking agentic LLM capabilities on consumer hardware (AMD Strix Halo, Ryzen AI MAX+ 395, 64 GB unified memory) using llama.cpp, Ollama, and LM Studio.
---
## Scope
This document covers locally-runnable agentic benchmarks, evaluation frameworks, practical measurement approaches, and model recommendations (with emphasis on the Qwen family) for the Strix Halo platform. Cloud-only benchmarks that cannot accept a local OpenAI-compatible endpoint are out of scope.
---
## 1. Agentic Benchmarks Runnable Locally
### 1.1 Berkeley Function Calling Leaderboard (BFCL)
**What it measures**: Function/tool calling accuracy across serial calls, parallel calls, multiple languages, and multi-turn agentic interactions.
**Why it matters**: BFCL is the de facto standard for evaluating function-calling quality. Version 4 (2025) added holistic agentic evaluation with stateful multi-step reasoning.
**Local setup**:
```bash
# Option A: pip package
pip install bfcl-eval
# Option B: from source (more control)
git clone https://github.com/ShishirPatil/gorilla.git
cd gorilla/berkeley-function-call-leaderboard
pip install -e .
```
Evaluate a local model by pointing BFCL at any OpenAI-compatible endpoint (ollama, llama.cpp server, vLLM). The framework uses AST-based evaluation to verify function call correctness without executing them.
- **Repository**: https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard
- **Leaderboard**: https://gorilla.cs.berkeley.edu/leaderboard.html
- **Paper**: Patil et al., "The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models," ICML 2025.
### 1.2 SWE-bench / SWE-bench Verified
**What it measures**: Ability to resolve real GitHub issues by generating patches against actual repositories.
**Why it matters**: The gold standard for evaluating coding agents. Tasks require understanding large codebases, multi-file edits, and test-driven validation.
**Local setup**: Evaluation runs inside Docker containers with network isolation. Two primary agent scaffolds support local models:
- **SWE-agent** (https://swe-agent.com): Install via pip, configure `config.toml` to point at a local OpenAI-compatible endpoint. There is also a dedicated open-weight model, SWE-bench/SWE-agent-LM-32B.
- **OpenHands** (https://github.com/OpenHands/OpenHands): `pip install openhands`, then `openhands serve`. Configure `config.toml` with your local model's `base_url`.
**Hardware note**: SWE-bench evaluation requires an x86_64 machine with at least 120 GB free storage, 16 GB RAM, and 8 CPU cores for the Docker harness (separate from model inference). Models smaller than 32B parameters show significantly degraded instruction following on these tasks.
- **Repository**: https://github.com/SWE-bench/SWE-bench
- **Paper**: Jimenez et al., "SWE-bench: Can Language Models Resolve Real-world Github Issues?" ICLR 2024.
### 1.3 AgentBench
**What it measures**: LLM-as-agent across 8 environments: OS interaction, database queries, knowledge graphs, card games, lateral thinking, house-holding, web shopping, and web browsing.
**Why it matters**: The broadest multi-environment agent evaluation. Tests planning, reasoning, tool use, and decision-making in multi-turn open-ended settings.
**Local setup**: The evaluation package is released at https://github.com/THUDM/AgentBench. It supports custom model endpoints. Open-source models up to 70B show a significant performance gap versus frontier commercial models, making it a useful diagnostic for understanding where local models fall short.
- **Paper**: Liu et al., "AgentBench: Evaluating LLMs as Agents," ICLR 2024.
### 1.4 GAIA (General AI Assistants)
**What it measures**: Multi-step tasks requiring web search, document reading, calculation, and synthesis. 466 tasks that are trivially easy for humans (92% accuracy) but extremely challenging for AI.
**Local setup**: Available on Hugging Face. Requires a model with tool-use capabilities (web search, file reading, calculator). Can be wired to a local model via smolagents or LangChain with local tool implementations.
- **Paper**: Mialon et al., "GAIA: A Benchmark for General AI Assistants," ICLR 2024.
### 1.5 DeepPlanning (Qwen)
**What it measures**: Long-horizon agentic planning with verifiable constraints. Two domains: multi-day travel planning (9 APIs for flights, trains, hotels, restaurants, attractions) and multi-product shopping.
**Why it matters**: Evaluates three critical agentic abilities:
1. Proactive information acquisition (actively calling APIs to discover hidden states)
2. Local constrained reasoning (step-level logic like brand matching)
3. Global constrained optimization (budget caps, multi-day time feasibility)
**Local setup**: Open-sourced January 2026. Dataset at https://huggingface.co/datasets/Qwen/DeepPlanning. Evaluation code integrated into the Qwen-Agent framework.
- **Paper**: "DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints," arXiv:2601.18137, January 2026.
### 1.6 Code Generation: EvalPlus (HumanEval+ / MBPP+)
**What it measures**: Functional correctness of generated code. EvalPlus extends HumanEval by 80x and MBPP by 35x test cases.
**Local setup** (direct Ollama support):
```bash
pip install evalplus
# Run against a local Ollama model
evalplus.evaluate \
--model "qwen3-coder:30b" \
--dataset humaneval \
--backend ollama \
--base-url http://localhost:11434/v1 \
--greedy
```
- **Repository**: https://github.com/evalplus/evalplus
- **Leaderboard**: https://evalplus.github.io/leaderboard.html
### 1.7 BigCodeBench
**What it measures**: 1,140 function-level tasks requiring composition of multiple function calls across 139 libraries. Average 5.6 test cases per task with 99% branch coverage.
**Local setup**: Based on EvalPlus infrastructure; supports the same backends including Ollama and vLLM.
- **Repository**: https://github.com/bigcode-project/bigcodebench
- **Paper**: "BigCodeBench: Benchmarking Code Generation Towards AGI," ICLR 2025.
### 1.8 IFEval (Instruction Following Evaluation)
**What it measures**: Compliance with programmatically verifiable instructions ("write more than 400 words," "mention AI at least 3 times"). No subjective judgment needed.
**Local setup**: Available through lm-evaluation-harness and Inspect AI. Recent variants include IFEval-FC (function calling format compliance) and M-IFEval (multilingual).
- **Paper**: Zhou et al., "Instruction-Following Evaluation for Large Language Models," arXiv:2311.07911, 2023.
---
## 2. Local Agentic Evaluation Frameworks
### 2.1 Inspect AI (UK AISI)
The most comprehensive single framework for local agentic evaluation.
**Key features**:
- 100+ pre-built evaluations including BFCL, GAIA, HumanEval, MBPP, IFEval, GSM8K
- Native support for tool calling: custom tools, MCP tools, built-in bash/python/web tools
- Web-based Inspect View for monitoring and visualizing evaluations
- VS Code extension for development
- Works with any OpenAI-compatible endpoint (ollama, llama.cpp, vLLM)
```bash
pip install inspect-ai
# Run BFCL evaluation against a local model
inspect eval inspect_evals/bfcl --model openai/local-model \
--model-base-url http://localhost:11434/v1
```
- **Repository**: https://github.com/UKGovernmentBEIS/inspect_ai
- **Evals collection**: https://github.com/UKGovernmentBEIS/inspect_evals
- **Documentation**: https://inspect.aisi.org.uk/
### 2.2 EleutherAI lm-evaluation-harness
The standard academic framework. 60+ benchmarks including MMLU, HellaSwag, ARC, GSM8K, HumanEval. Serves as the backend for Hugging Face's Open LLM Leaderboard.
**Local model support**: Works with HuggingFace models directly, OpenAI-compatible APIs, and custom backends. The `local-completions` and `local-chat-completions` model types support any local server.
```bash
pip install lm-eval
lm_eval --model local-chat-completions \
--model_args model=qwen3-coder:30b,base_url=http://localhost:11434/v1 \
--tasks humaneval,mbpp,ifeval \
--batch_size auto
```
- **Repository**: https://github.com/EleutherAI/lm-evaluation-harness
### 2.3 smolagents (Hugging Face)
Lightweight agentic framework with two core agent types:
- **CodeAgent**: Generates and executes sandboxed Python code
- **ToolCallingAgent**: Calls external APIs and custom functions
**Ollama integration** is first-class:
```python
from smolagents import CodeAgent, OllamaModel
model = OllamaModel(model_id="qwen3-coder:30b")
agent = CodeAgent(tools=[], model=model)
agent.run("What is the 10th Fibonacci number?")
```
Supports custom tool definitions and evaluation harnesses. Model-agnostic design means any Ollama, llama.cpp, or LM Studio model works.
- **Repository**: https://github.com/huggingface/smolagents
### 2.4 Qwen-Agent
Purpose-built for Qwen models with optimized tool-calling templates and parsers.
**Key features**:
- Native MCP (Model Context Protocol) support
- Parallel, multi-step, and multi-turn function calls with automatic parsing
- Code interpreter, RAG, and Chrome extension built in
- DeepPlanning benchmark evaluation integrated
```bash
pip install qwen-agent[mcp]
```
Configure tools via MCP configuration files. The framework handles tool-calling format differences between Qwen model versions automatically.
- **Repository**: https://github.com/QwenLM/Qwen-Agent
- **Documentation**: https://qwenlm.github.io/Qwen-Agent/
### 2.5 LangGraph / CrewAI
Both support local OpenAI-compatible endpoints. Comparative benchmarks (2026) show:
- **LangGraph**: Lowest latency and token usage due to graph-based architecture that reduces redundant context passing. Preferred for production with deterministic control flow. Reached v1.0 GA in October 2025.
- **CrewAI**: ~40% faster from idea to working prototype. Higher token spend but simpler multi-agent orchestration. v1.10.1 with native MCP and A2A support. 44,600+ GitHub stars.
Neither provides a built-in standardized benchmark harness, but both can be instrumented to measure task completion rates, tool-call accuracy, and latency.
### 2.6 Throughput & Performance Benchmarking Tools
| Tool | Focus | Backends |
|------|-------|----------|
| [ollama-benchmark](https://github.com/aidatatools/ollama-benchmark) | Tokens/s throughput via Ollama | Ollama |
| [llama-benchy](https://github.com/eugr/llama-benchy) | Multi-backend benchmarking (llama-bench style) | vLLM, SGLang, llama.cpp, etc. |
| [benchllama](https://github.com/srikanth235/benchllama) | Local LLM benchmarking | Ollama |
| [local-llm-bench](https://github.com/famstack-dev/local-llm-bench) | Engine comparison (MLX vs llama.cpp) | MLX, llama.cpp |
| llama-bench (built-in) | Raw inference performance | llama.cpp native |
---
## 3. Practical Measurement Approaches
### 3.1 Token Throughput in Multi-Turn Conversations
Key metrics for agentic workloads on Strix Halo:
| Metric | Definition | Target |
|--------|-----------|--------|
| Time to First Token (TTFT) | Delay before first token appears | <500ms for interactive use |
| Generation speed (tok/s) | Steady-state token output rate | >30 tok/s for usable agents |
| Prompt processing (tok/s) | Speed of ingesting context | Critical for large codebases |
| KV cache utilization | Memory consumed by conversation history | Scales with context length |
**Strix Halo 64GB measured performance** (from community benchmarks):
| Model | Quant | Gen tok/s | Prompt tok/s | VRAM Used |
|-------|-------|-----------|-------------|-----------|
| Qwen3-Coder-30B-A3B | Q4_K_M | ~52-71 | 5-47 (varies by context) | ~18 GB |
| Qwen3-30B-A3B (general) | Q4_K_M | ~52 | -- | ~18 GB |
| 70B dense models | Q4_K_M | ~5 | -- | ~40 GB |
MoE models like Qwen3-30B-A3B are where 64GB unified memory shines -- only 3B parameters are active per token, so generation is fast despite the 30B total parameter count.
### 3.2 Tool-Calling Accuracy Measurement
A practical local test sequence:
1. **BFCL subset**: Run the BFCL simple function calling tests first (serial single-function calls). If accuracy is below 80%, the model is not suitable for agentic use.
2. **Parallel function calling**: Test with BFCL parallel calling scenarios. Many smaller models fail here.
3. **Multi-turn stateful**: BFCL v3/v4 multi-turn tests or DeepPlanning scenarios.
4. **Format compliance**: IFEval-FC tests whether the model can produce correctly formatted JSON function calls consistently.
### 3.3 Code Generation Benchmarks
Recommended evaluation progression (increasing difficulty):
1. **HumanEval+** via EvalPlus (164 problems, well-understood baseline)
2. **MBPP+** via EvalPlus (974 problems, broader coverage)
3. **HumanEval Pro / MBPP Pro** (self-invoking code generation, tests compositionality)
4. **BigCodeBench** (1,140 tasks across 139 libraries, tests real-world API usage)
5. **SWE-bench Verified** (full repository-level coding, requires agent scaffold)
### 3.4 Composite Agentic Evaluation
For a holistic view, run these in order:
```
Phase 1 - Baseline Quality:
EvalPlus HumanEval+ (code generation)
IFEval (instruction following)
BFCL simple (tool calling basics)
Phase 2 - Agentic Capability:
BFCL v4 multi-turn (stateful tool use)
DeepPlanning (long-horizon planning)
BigCodeBench (multi-library code composition)
Phase 3 - Full Agent Evaluation:
AgentBench (multi-environment)
SWE-bench Verified (real-world coding)
```
### 3.5 Measuring What Matters for Agents
Beyond accuracy, measure:
- **Recovery from errors**: Does the model self-correct when a tool call returns an error?
- **Instruction adherence under pressure**: Does tool-calling format degrade as context grows?
- **Planning depth**: How many sequential tool calls can the model chain before losing coherence?
- **Token efficiency**: Total tokens consumed per successful task completion.
---
## 4. Best Models for Agentic Use (Qwen Family Focus)
### 4.1 Recommended for Strix Halo 64GB
#### Tier 1: Primary Recommendation
**Qwen3-Coder-30B-A3B-Instruct** (MoE: 30.5B total, 3.3B active)
- 128 experts, 8 activated per token
- 262K native context length
- Specially designed function-call format
- ~52-71 tok/s on Strix Halo (Q4_K_M, ~18 GB VRAM)
- Supports: Ollama, LM Studio, llama.cpp, KTransformers
- Available via: `ollama pull renchris/qwen3-coder:30b-gguf-unsloth`
- GGUF: https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
#### Tier 1 (Alternative): General-Purpose Agent
**Qwen3.5-35B-A3B** (MoE: 35B total, 3B active)
- Hybrid architecture: Gated Delta Networks + sparse MoE
- 256K context, 201 languages
- BFCL-V4 scores competitive with much larger models
- Recommended settings: temperature=0.7, top_p=0.8, top_k=20, repetition_penalty=1.05
#### Tier 2: Smaller / Faster
**Qwen3.5-9B** (Dense: 9B parameters)
- Matches GPT-OSS-120B (a model 13x its size) on GPQA Diamond (81.7 vs 71.5)
- Fits easily in 64GB with very long context
- Good for rapid prototyping and testing agent architectures
- Available via: `ollama pull qwen3.5:9b`
#### Tier 3: Maximum Capability (fits in 64GB with quantization)
**Qwen3-Coder-Next** (MoE: 80B total, 3B active)
- SWE-bench Verified: 70.6% (SWE-Agent scaffold)
- SWE-bench Pro: 44.3% (beats DeepSeek-V3.2 at 40.9)
- Requires >45GB for 4-bit quants; >30GB for 2-bit XL quants
- Fits on 64GB Strix Halo with Q4_K quantization (tight but feasible)
- GGUF: https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF
- Run via: `llama-server -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL`
### 4.2 Qwen Family Comparison for Agentic Tasks
| Model | Type | Active Params | BFCL-V4 | SWE-bench | Best For | 64GB Feasible |
|-------|------|--------------|---------|-----------|----------|---------------|
| Qwen3-Coder-30B-A3B | MoE | 3.3B | Strong | Moderate | Tool calling, coding agents | Yes, comfortably |
| Qwen3.5-35B-A3B | MoE | 3B | Strong | -- | General agentic tasks | Yes, comfortably |
| Qwen3.5-9B | Dense | 9B | Good | -- | Fast prototyping, testing | Yes, easily |
| Qwen3-Coder-Next | MoE | 3B | Strong | 70.6% | Maximum coding capability | Yes, tight (Q4) |
| Qwen3.5-122B-A10B | MoE | 10B | 72.2 | -- | Best tool calling | Marginal (needs Q2-Q3) |
| Qwen3-Coder-480B-A35B | MoE | 35B | SOTA | SOTA open | Maximum performance | No (too large) |
### 4.3 Non-Qwen Alternatives Worth Testing
| Model | Parameters | Notable For |
|-------|-----------|-------------|
| GLM-4.7-Flash | 30B MoE (3B active) | Strong agentic performance, 128K context |
| DeepSeek-V3.2 | MoE | Competitive coding agent |
| Phi-4-Mini | 14B dense | Native function calling, small footprint |
| SWE-agent-LM-32B | 32B dense | Purpose-built for SWE-bench |
### 4.4 Optimal Setup for Agentic Use on Strix Halo
```bash
# 1. Start model server (llama.cpp for best AMD GPU utilization)
llama-server \
-hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_M \
-ngl 99 \
--ctx-size 32768 \
--port 8080
# 2. Use Qwen-Agent for tool calling (optimized templates)
pip install qwen-agent[mcp]
# 3. Or use smolagents for framework-agnostic evaluation
pip install smolagents
```
For the Qwen models specifically, Qwen-Agent is recommended because it encapsulates the correct tool-calling templates and parsers internally, avoiding format mismatches that degrade function calling accuracy.
---
## 5. Open Questions / Limitations
1. **Quantization impact on tool calling**: Most benchmark results are reported at full precision (BF16/FP16). Quantization to Q4_K_M or lower may disproportionately affect structured output quality (JSON formatting, argument types) versus general text generation. No systematic study exists for this on Strix Halo specifically.
2. **Context length vs. accuracy tradeoff**: Agentic workflows accumulate long conversation histories. MoE models with 262K context windows are advertised but tool-calling accuracy at >32K tokens is poorly benchmarked for local models.
3. **ROCm maturity**: AMD's ROCm stack has improved dramatically but is still not at CUDA parity. The optimal backend (llama.cpp Vulkan vs. llama.cpp ROCm vs. vLLM ROCm) varies by model architecture and workload type.
4. **MoE scheduling on unified memory**: Strix Halo's unified memory architecture allows MoE models to split dense layers (GPU) and sparse experts (CPU RAM) efficiently, but optimal splitting strategies are not well-documented for agentic workloads where expert activation patterns may differ from typical chat use.
5. **Benchmark saturation**: HumanEval and MBPP are approaching saturation for frontier models. BigCodeBench and SWE-bench provide better discrimination but are significantly harder to run locally.
6. **Multi-agent evaluation**: Most benchmarks test single-agent performance. Multi-agent workflows (CrewAI, LangGraph multi-agent) lack standardized evaluation frameworks.
---
## 6. Overlap Notes
- **Throughput benchmarking** overlaps with `docs/benchmarking.md` (which covers llama-bench raw performance). This document focuses on agentic quality metrics rather than raw tok/s.
- **ROCm configuration** overlaps with `docs/optimization.md`. This document assumes the system is already optimized per that guide.
- **External links** should be consolidated into `docs/references.md` when this document is finalized.
---
## Sources
### Papers
- Patil et al., "The Berkeley Function Calling Leaderboard (BFCL)," ICML 2025
- Liu et al., "AgentBench: Evaluating LLMs as Agents," ICLR 2024
- Jimenez et al., "SWE-bench: Can Language Models Resolve Real-world Github Issues?" ICLR 2024
- Mialon et al., "GAIA: A Benchmark for General AI Assistants," ICLR 2024
- "DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints," arXiv:2601.18137, Jan 2026
- "Qwen3 Technical Report," arXiv:2505.09388, May 2025
- "Qwen3-Coder-Next Technical Report," arXiv:2603.00729, March 2026
- "HumanEval Pro and MBPP Pro," ACL 2025 Findings
- "BigCodeBench: Benchmarking Code Generation Towards AGI," ICLR 2025
- Zhou et al., "Instruction-Following Evaluation for Large Language Models," arXiv:2311.07911, 2023
### Repositories & Tools
- [BFCL](https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard)
- [SWE-bench](https://github.com/SWE-bench/SWE-bench)
- [AgentBench](https://github.com/THUDM/AgentBench)
- [EvalPlus](https://github.com/evalplus/evalplus)
- [BigCodeBench](https://github.com/bigcode-project/bigcodebench)
- [Inspect AI](https://github.com/UKGovernmentBEIS/inspect_ai)
- [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
- [smolagents](https://github.com/huggingface/smolagents)
- [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent)
- [OpenHands](https://github.com/OpenHands/OpenHands)
- [Qwen3-Coder-30B-A3B GGUF](https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF)
- [Qwen3-Coder-Next GGUF](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF)
- [DeepPlanning Dataset](https://huggingface.co/datasets/Qwen/DeepPlanning)
### Strix Halo Benchmarks
- [Strix Halo GPU LLM Performance Tests (Framework Community)](https://community.frame.work/t/amd-strix-halo-ryzen-ai-max-395-gpu-llm-performance-tests/72521)
- [Strix Halo Benchmark Results (Level1Techs)](https://forum.level1techs.com/t/strix-halo-ryzen-ai-max-395-llm-benchmark-results/233796)
- [Strix Halo Toolboxes Benchmarks](https://kyuz0.github.io/amd-strix-halo-toolboxes/)
- [Qwen3-Coder-30B Strix Halo Benchmark](https://github.com/pablo-ross/strix-halo-gmktec-evo-x2/blob/main/QWEN3-CODER-30B_BENCHMARK.md)
- [LLM Tracker - Strix Halo](https://llm-tracker.info/AMD-Strix-Halo-(Ryzen-AI-Max+-395)-GPU-Performance)
### Guides
- [Qwen3-Coder-Next Local Guide (DEV Community, 2026)](https://dev.to/sienna/qwen3-coder-next-the-complete-2026-guide-to-running-powerful-ai-coding-agents-locally-1k95)
- [Qwen3-Coder Local Setup (Unsloth)](https://unsloth.ai/docs/models/tutorials/qwen3-coder-how-to-run-locally)
- [Qwen llama.cpp Documentation](https://qwen.readthedocs.io/en/latest/run_locally/llama.cpp.html)
- [smolagents + Ollama (Medium)](https://medium.com/@abonia/building-practical-local-ai-agents-with-smolagents-ollama-f92900c51897)
- [Inspect AI Documentation](https://inspect.aisi.org.uk/)

View File

@@ -0,0 +1,488 @@
# Qwen 3.5 Model Family: Research Summary for Strix Halo (64GB)
**Date**: 2026-03-26
**Target Hardware**: AMD Ryzen AI MAX+ 395 / Radeon 8060S (gfx1151), 64 GB unified LPDDR5x, Fedora 43
**Focus**: GGUF quantized models for llama.cpp inference
---
## Scope
This report covers the Qwen3.5 model family (released February-March 2026) with emphasis
on GGUF quantization options, file sizes, memory fit analysis for 64GB unified memory,
GGUF quantizer comparison (Unsloth vs bartowski vs others), Unsloth Studio capabilities,
and LM Studio backend support on AMD Strix Halo. Out of scope: cloud API pricing,
full-precision training, non-GGUF formats (AWQ, GPTQ, EXL2).
---
## 1. Qwen3.5 Model Family Overview
Released mid-February 2026 (medium/large) and March 2, 2026 (small), licensed Apache 2.0.
All models share the Gated DeltaNet hybrid architecture: a 3:1 ratio of linear attention
(Gated DeltaNet) to full softmax attention blocks. Native 262K context window, extensible
to 1,010,000 tokens via YaRN scaling. Supports 201 languages. Native multimodal
(vision+language). Thinking/non-thinking hybrid mode.
| Model | Type | Total Params | Active Params | Architecture |
|-------|------|-------------|---------------|--------------|
| Qwen3.5-397B-A17B | MoE | 397B | 17B | 256 experts, 8 routed + 1 shared |
| Qwen3.5-122B-A10B | MoE | 122B | 10B | 256 experts, 8 routed + 1 shared |
| **Qwen3.5-35B-A3B** | **MoE** | **35B** | **3B** | **256 experts, 8 routed + 1 shared** |
| **Qwen3.5-27B** | **Dense** | **27B** | **27B** | **Full activation** |
| Qwen3.5-9B | Dense | 9B | 9B | Gated DeltaNet hybrid |
| Qwen3.5-4B | Dense | 4B | 4B | Gated DeltaNet hybrid |
| Qwen3.5-2B | Dense | 2B | 2B | Gated DeltaNet hybrid |
| Qwen3.5-0.8B | Dense | 0.8B | 0.8B | Gated DeltaNet hybrid |
---
## 2. Qwen3.5-35B-A3B (MoE) -- Detailed Analysis
### Architecture Specs
- Hidden dimension: 2048
- Token embedding: 248,320 (padded)
- Layers: 40
- Hidden layout: 10 x (3 x (Gated DeltaNet -> MoE) -> 1 x (Gated Attention -> MoE))
- MoE: 256 total experts, 8 routed + 1 shared active, expert intermediate dim 512
- Linear attention heads: 32 (V), 16 (QK), head dim 128
- Gated attention heads: 16 (Q), 2 (KV), head dim 256
- BF16 model size: 69.4 GB
### GGUF Quantizations (Unsloth)
Source: [unsloth/Qwen3.5-35B-A3B-GGUF](https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF)
Updated March 5, 2026 with improved imatrix data.
| Quantization | Size (GB) | Fits 64GB? | Notes |
|-------------|-----------|------------|-------|
| UD-IQ2_XXS | 10.7 | Yes | Ultra-compressed, quality loss |
| UD-IQ2_M | 11.4 | Yes | |
| UD-Q2_K_XL | 12.2 | Yes | |
| UD-IQ3_XXS | 13.1 | Yes | |
| UD-IQ3_S | 13.6 | Yes | |
| Q3_K_S | 15.3 | Yes | |
| Q3_K_M | 16.4 | Yes | |
| UD-Q3_K_XL | 16.6 | Yes | |
| UD-IQ4_XS | 17.5 | Yes | |
| UD-IQ4_NL | 17.8 | Yes | |
| Q4_K_S | 20.7 | Yes | |
| MXFP4_MOE | 21.6 | Yes | MoE-optimized mixed precision |
| Q4_K_M | 22.0 | **Yes** | **Recommended sweet spot** |
| UD-Q4_K_XL | 22.2 | Yes | Dynamic 2.0, best 4-bit |
| Q5_K_S | 24.8 | Yes | |
| Q5_K_M | 26.2 | Yes | |
| UD-Q5_K_XL | 26.4 | Yes | |
| UD-Q6_K_S | 28.5 | Yes | |
| Q6_K | 28.9 | Yes | |
| UD-Q6_K_XL | 32.1 | Yes | |
| Q8_0 | 36.9 | Yes | High quality, fits with room |
| UD-Q8_K_XL | 48.7 | Yes* | Tight -- ~15GB for KV cache |
| BF16 | 69.4 | **No** | Exceeds 64GB |
**Key finding**: Every quantization except BF16 fits in 64GB. Even Q8_0 at 36.9 GB
leaves ~27 GB for KV cache and OS overhead, which is excellent. The MoE architecture
(only 3B active params) means token generation is fast relative to total model size.
### Benchmark Results (Official, from Model Card)
| Benchmark | Qwen3.5-35B-A3B | GPT-5-mini | Notes |
|-----------|-----------------|-----------|-------|
| MMLU-Pro | 85.3 | 83.7 | Outperforms |
| C-Eval | 90.2 | 82.2 | Outperforms |
| GPQA Diamond | 84.2 | 82.8 | Outperforms |
| SWE-bench Verified | 69.2 | 72.0 | Slightly behind |
| LiveCodeBench v6 | 74.6 | 80.5 | Behind on coding |
| MMMU (vision) | 81.4 | 79.0 | Outperforms |
| MathVision | 83.9 | 71.9 | Strongly outperforms |
| VideoMME (w/ sub.) | 86.6 | 83.5 | Outperforms |
### Strix Halo Performance Estimates
Based on Qwen3-30B-A3B benchmarks (similar architecture, predecessor):
| Backend | pp512 (t/s) | tg128 (t/s) | Context |
|---------|-------------|-------------|---------|
| Vulkan RADV | ~755 | ~85 | Short |
| Vulkan AMDVLK | ~742 | ~82 | Short |
| ROCm hipBLASlt | ~652 | ~64 | Short |
| ROCm rocWMMA (tuned) | ~659 | ~68 | Short |
| Vulkan RADV | ~17 | ~13 | 130K |
| ROCm hipBLASlt | ~40 | ~5 | 130K |
**Key insight**: Vulkan wins on short-context token generation. ROCm wins on
long-context prompt processing. For interactive chat (short-medium context),
Vulkan RADV is the best backend on Strix Halo.
---
## 3. Qwen3.5-27B (Dense) -- Detailed Analysis
Source: [unsloth/Qwen3.5-27B-GGUF](https://huggingface.co/unsloth/Qwen3.5-27B-GGUF)
The only dense (non-MoE) model in the medium range. All 27B parameters activate on
every forward pass, meaning slower token generation than 35B-A3B despite being
"smaller" in total params. BF16 size: 53.8 GB.
### GGUF Quantizations (Unsloth)
| Quantization | Size (GB) | Fits 64GB? | Notes |
|-------------|-----------|------------|-------|
| UD-IQ2_XXS | 8.57 | Yes | |
| UD-IQ2_M | 10.2 | Yes | |
| UD-Q2_K_XL | 11.2 | Yes | |
| UD-IQ3_XXS | 11.5 | Yes | |
| Q3_K_S | 12.3 | Yes | |
| Q3_K_M | 13.5 | Yes | |
| UD-Q3_K_XL | 14.4 | Yes | |
| IQ4_XS | 15.0 | Yes | |
| Q4_0 | 15.7 | Yes | |
| IQ4_NL | 15.7 | Yes | |
| Q4_K_S | 15.8 | Yes | |
| Q4_K_M | 16.7 | **Yes** | **Recommended** |
| UD-Q4_K_XL | 17.6 | Yes | Dynamic 2.0 |
| Q4_1 | 17.2 | Yes | |
| Q5_K_S | 18.9 | Yes | |
| Q5_K_M | 19.6 | Yes | |
| UD-Q5_K_XL | 20.2 | Yes | |
| Q6_K | 22.5 | Yes | |
| UD-Q6_K_XL | 25.7 | Yes | |
| Q8_0 | 28.6 | Yes | Plenty of room |
| UD-Q8_K_XL | 35.5 | Yes | Good quality + headroom |
| BF16 | 53.8 | Yes* | Tight -- only ~10GB for KV cache |
**Key finding**: All quantizations fit in 64GB, including BF16 (barely). However,
because this is a dense model with 27B active params, token generation will be
significantly slower than 35B-A3B (which only activates 3B). For interactive use on
Strix Halo, the 35B-A3B MoE is likely the better choice despite being larger on disk.
### 35B-A3B vs 27B: Which to Run?
| Factor | 35B-A3B (MoE) | 27B (Dense) |
|--------|---------------|-------------|
| Active params | 3B | 27B |
| Token gen speed | ~85 t/s (Vulkan) | ~10-15 t/s (estimated) |
| Quality (MMLU-Pro) | 85.3 | Comparable |
| Memory (Q4_K_M) | 22.0 GB | 16.7 GB |
| Memory (Q8_0) | 36.9 GB | 28.6 GB |
| Best for | Interactive chat, speed | Batch processing, quality |
**Recommendation**: For interactive inference on 64GB Strix Halo, strongly prefer
Qwen3.5-35B-A3B. The MoE architecture is ideal for unified memory systems since
only 3B params are active per token, yielding much faster generation despite the
larger total weight file.
---
## 4. Qwen3.5-122B-A10B (MoE) -- Stretch Goal
Source: [unsloth/Qwen3.5-122B-A10B-GGUF](https://huggingface.co/unsloth/Qwen3.5-122B-A10B-GGUF)
BF16 size: 244 GB. This is the next tier up from 35B-A3B.
### Quantizations That Fit 64GB
| Quantization | Size (GB) | Fit? | Notes |
|-------------|-----------|------|-------|
| UD-IQ1_M | 34.2 | Yes | 1-bit, quality concerns |
| UD-IQ2_XXS | 36.6 | Yes | Very compressed |
| UD-IQ2_M | 39.1 | Yes | |
| UD-Q2_K_XL | 41.8 | Yes | |
| UD-IQ3_XXS | 44.7 | Yes | |
| UD-IQ3_S | 46.6 | Yes* | Tight with KV cache |
| Q3_K_S | 52.5 | Marginal | Very little KV headroom |
| Q3_K_M | 56.4 | No | Leaves <8GB for everything else |
| Q4_K_M+ | 76.5+ | No | Does not fit |
**Warning**: Q3-level quantization of 122B has been reported to produce garbled output,
infinite repetition, and failures on tool calls and code generation. The UD-Q2_K_XL
(41.8 GB) is the recommended minimum viable quantization.
**Verdict**: Possible at 2-bit, but risky. Quality at IQ2 level on a 122B MoE model is
largely untested for production use. The 35B-A3B at Q8_0 (36.9 GB) is likely higher
quality than 122B at IQ2 (36.6 GB) and much safer. Not recommended for 64GB systems
unless you specifically need the 10B active parameter count.
---
## 5. Qwen3.5 Small Models (Worth Benchmarking)
### Qwen3.5-9B
The standout small model. Outperforms models 3-13x its size:
- GPQA Diamond: 81.7 (vs GPT-OSS-120B: 71.5)
- HMMT Feb 2025: 83.2
- MMMU-Pro: 70.1 (beats Gemini 2.5 Flash-Lite at 59.7)
At Q4_K_M, the 9B model needs roughly 6-7 GB. Runs comfortably on any hardware.
Useful as a draft model for speculative decoding with the 35B-A3B.
### Qwen3.5-4B
Performance close to the previous Qwen3-80B-A3B (20x larger). Excellent for
on-device/edge tasks. ~3 GB at Q4_K_M.
---
## 6. Best GGUF Quantizers: Unsloth vs bartowski vs Others
### Providers Compared
| Provider | Approach | Strengths |
|----------|----------|-----------|
| **Unsloth** | Dynamic 2.0: per-layer adaptive quantization, 1.5M+ token calibration dataset | Best at low bit-rates (Q2, Q3), model-specific tuning, fast updates |
| **bartowski** | Custom imatrix calibration, upstream llama.cpp PR for improved tensor recipes | Lower KLD at Q4_K_M in some tests, stable quality |
| **noctrex** | MXFP4 for MoE experts + Q8/BF16 for rest | Specialized for MoE models |
| **ubergarm** | Standard llama.cpp quantization | Reliable baseline |
| **AesSedai** | imatrix-based | Good coverage, sometimes outperformed by Unsloth Dynamic |
| **mradermacher** | Mass-produced quants across many models | Broad coverage, less specialized |
### Head-to-Head: Unsloth vs bartowski
On standard KLD benchmarks (Qwen QwQ-32B comparison):
- bartowski Q4_K_M: 0.0087 KLD
- Unsloth Q4_K_M: 0.0222 KLD
- bartowski IQ4_XS: 0.0127 KLD at 4.93 GiB
However, on real-world task evaluations (LiveCodeBench v6, MMLU Pro), Unsloth Dynamic
IQ2_XXS outperformed AesSedai IQ3_S despite being 11GB smaller -- demonstrating that
KLD/perplexity alone do not predict task performance.
### Recommendation
- **Q4 and above**: bartowski and Unsloth are both excellent. bartowski may have slightly
lower KLD at Q4_K_M. Either is a safe choice.
- **Q3 and below**: Unsloth Dynamic 2.0 (UD- prefix) is the clear winner. The per-layer
adaptive approach preserves critical layers at higher precision.
- **MoE-specific**: noctrex MXFP4_MOE is worth testing if you want pure MoE-optimized
quantization.
- **Overall**: For Qwen3.5-35B-A3B, use **Unsloth UD-Q4_K_XL** (22.2 GB) or
**Q8_0** (36.9 GB) for maximum quality. For bartowski, use their Q4_K_M.
### imatrix Note
All modern GGUF quantizers now use imatrix (importance matrix) calibration. This adds
5-10% inference overhead but significantly improves quality at low bit-rates. The
calibration dataset matters: Unsloth uses 1.5M+ hand-curated tokens; bartowski uses
different calibration texts optimized for different use cases.
---
## 7. Unsloth Studio
### What It Is
Unsloth Studio is an open-source, no-code web UI for training and running LLMs locally.
Released March 17, 2026 (beta). Dual-licensed: Apache 2.0 (core) + AGPL-3.0 (UI).
### Installation
```bash
# macOS, Linux, WSL
curl -fsSL https://unsloth.ai/install.sh | sh
# Launch
unsloth studio -H 0.0.0.0 -p 8888
```
### Capabilities
| Feature | Details |
|---------|---------|
| **Inference** | Run GGUF and safetensor models with tool-calling, web search, OpenAI-compatible API |
| **Fine-tuning** | SFT, GRPO (RL), 500+ models, 2x faster, 70% less VRAM |
| **Data Recipes** | Auto-create datasets from PDF, CSV, JSON, DOCX, TXT |
| **Model Arena** | Side-by-side comparison of two models |
| **Export** | Save to GGUF or safetensors |
| **Multimodal** | Text, vision, TTS audio, embedding models |
### Platform Support
| Platform | Inference | Training |
|----------|-----------|----------|
| Linux (NVIDIA) | Yes | Yes |
| Linux (AMD) | Yes | Coming soon |
| Linux (CPU) | Yes | No |
| macOS | Yes (CPU only) | Coming (MLX) |
| Windows | Yes | Yes |
### Relevance for Strix Halo
Unsloth Studio provides inference via llama.cpp backend, so it should work on Strix Halo
for **running** models. Training requires NVIDIA or Intel GPUs currently, so fine-tuning
is not yet supported on AMD. The inference component is essentially a nice web UI wrapper
around llama.cpp, similar to LM Studio but with integrated training capabilities.
**Verdict**: Useful for inference on Strix Halo. Not yet useful for training on AMD.
If you only need inference, LM Studio or raw llama.cpp may be simpler. If you want
training + inference in one tool (when AMD support arrives), Unsloth Studio is worth
watching.
---
## 8. LM Studio on AMD Strix Halo
### Backend Status
| Backend | Status | Notes |
|---------|--------|-------|
| **Vulkan** | **Working, recommended** | Best for general inference, no special config needed |
| ROCm | Partially broken | gfx1151 declared supported but data files missing, crashes on inference |
| CPU | Working | Slow fallback |
### Vulkan Configuration
LM Studio with Vulkan is the most reliable path on Strix Halo:
```json
{
"llm.gpu.backend": "vulkan",
"llm.gpu.device": "auto",
"llm.gpu.layers": -1
}
```
Verify GPU detection: `vulkaninfo | grep "GPU id"`
An automated installer exists: [smarttechlabs-projects/strix-halo-lmstudio](https://github.com/smarttechlabs-projects/strix-halo-lmstudio)
### Performance Expectations (LM Studio / Vulkan, 128GB system)
| Model Size | Quant | Throughput |
|-----------|-------|-----------|
| 7B | Q4 | 30-40 t/s |
| 13B | Q4 | 20-30 t/s |
| 30B MoE | Q4 | ~50+ t/s (MoE advantage) |
| 70B | Q4 | 5-8 t/s |
For a 64GB system, expect similar per-token speeds but with lower maximum context
lengths before memory pressure kicks in.
### ROCm Status and Future
AMD's Ryzen AI Halo Mini PC (Q2 2026) will ship with ROCm 7.2.2 optimization for
LM Studio. As of January 2026, stable ROCm+Linux configurations exist for Strix Halo
(documented at Framework Community). The gfx1151 ROCm issue in LM Studio specifically
is a packaging problem (missing data files), not a fundamental incompatibility.
For now: use **Vulkan for short-medium context**, or build **llama.cpp from source
with ROCm** for long-context workloads (where Flash Attention matters).
### LM Studio Unsloth Dynamic 2.0 Note
There was a reported issue (GitHub #1594) where Unsloth Dynamic 2.0 (UD-) GGUF variants
were not shown in LM Studio's download options. Verify that LM Studio is updated to
the latest version, or download the GGUF files manually from HuggingFace and load
them directly.
---
## 9. Recommended Configurations for 64GB Strix Halo
### Primary: Qwen3.5-35B-A3B (MoE)
| Use Case | Quantization | Size | KV Budget | Context Est. |
|----------|-------------|------|-----------|-------------|
| Maximum quality | Q8_0 | 36.9 GB | ~25 GB | ~32K-65K |
| Best balance | UD-Q4_K_XL | 22.2 GB | ~40 GB | ~65K-131K |
| Maximum context | UD-IQ3_XXS | 13.1 GB | ~49 GB | ~131K+ |
| Speed test | Q4_K_M | 22.0 GB | ~40 GB | ~65K-131K |
### Secondary: Qwen3.5-27B (Dense)
| Use Case | Quantization | Size | KV Budget | Notes |
|----------|-------------|------|-----------|-------|
| Quality comparison | Q8_0 | 28.6 GB | ~33 GB | Slower gen than 35B-A3B |
| Balanced | Q4_K_M | 16.7 GB | ~45 GB | |
### Quick Reference: Qwen3.5-9B (Small/Draft)
| Use Case | Quantization | Size |
|----------|-------------|------|
| Speculative decoding draft | Q4_K_M | ~6 GB |
| Standalone small model | Q8_0 | ~10 GB |
---
## 10. Sampling Parameters (Official Recommendations)
### Thinking Mode (General)
- Temperature: 1.0
- Top-p: 0.95
- Top-k: 20
- Min-p: 0.0
- Presence penalty: 1.5
- Max output: 32,768 tokens (general) or 81,920 (math/coding)
### Thinking Mode (Coding)
- Temperature: 0.6
- Top-p: 0.95
- Top-k: 20
- Presence penalty: 0.0
### Non-Thinking / Instruct Mode
- Temperature: 0.7
- Top-p: 0.8
- Top-k: 20
- Presence penalty: 1.5
### Best Practices
- Maintain minimum 128K context to preserve thinking capabilities
- Exclude thinking content from multi-turn conversation history
- For math: "Please reason step by step, and put your final answer within \boxed{}."
- For multiple choice: request JSON output like {"answer": "C"}
---
## 11. Open Questions / Limitations
1. **Qwen3.5 on gfx1151 ROCm**: LM Studio's ROCm backend crashes on Strix Halo due
to missing gfx1151 data files. Building llama.cpp from source with ROCm 7.x works
but requires manual setup.
2. **Vulkan long-context degradation**: Vulkan performance drops significantly beyond
~4K context on Strix Halo. ROCm with Flash Attention is needed for long-context
workloads, creating a backend choice dilemma.
3. **Quantizer quality debate**: KLD and perplexity metrics do not always predict
real-world task performance. The "best" quantizer depends on the specific use case.
More task-based evaluation is needed.
4. **122B-A10B viability at 64GB**: Only fits at 2-bit or aggressive 3-bit. Quality
at these compression levels for a 122B MoE is not well-characterized.
5. **Unsloth Studio AMD training**: Not yet supported. Timeline unclear ("coming soon").
6. **Multi-token Prediction (MTP)**: Qwen3.5 supports MTP for faster generation, but
llama.cpp support status for this feature on the MoE variants needs verification.
7. **Speculative decoding**: Qwen3.5-9B as a draft model for 35B-A3B has been discussed
but needs benchmarking on Strix Halo specifically.
---
## Sources
- [Qwen/Qwen3.5-35B-A3B Model Card](https://huggingface.co/Qwen/Qwen3.5-35B-A3B)
- [QwenLM/Qwen3.5 GitHub](https://github.com/QwenLM/Qwen3.5)
- [unsloth/Qwen3.5-35B-A3B-GGUF](https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF)
- [unsloth/Qwen3.5-27B-GGUF](https://huggingface.co/unsloth/Qwen3.5-27B-GGUF)
- [unsloth/Qwen3.5-122B-A10B-GGUF](https://huggingface.co/unsloth/Qwen3.5-122B-A10B-GGUF)
- [bartowski/Qwen_Qwen3.5-35B-A3B-GGUF](https://huggingface.co/bartowski/Qwen_Qwen3.5-35B-A3B-GGUF)
- [bartowski/Qwen_Qwen3.5-27B-GGUF](https://huggingface.co/bartowski/Qwen_Qwen3.5-27B-GGUF)
- [noctrex/Qwen3.5-35B-A3B-MXFP4_MOE-GGUF](https://huggingface.co/noctrex/Qwen3.5-35B-A3B-MXFP4_MOE-GGUF)
- [Unsloth Dynamic 2.0 GGUFs Documentation](https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs)
- [Qwen3.5 GGUF Benchmarks (Unsloth)](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks)
- [Unsloth Studio Documentation](https://unsloth.ai/docs/new/studio)
- [Qwen3.5 Local Running Guide (Unsloth)](https://unsloth.ai/docs/models/qwen3.5)
- [Summary of Qwen3.5 GGUF Evaluations (kaitchup)](https://kaitchup.substack.com/p/summary-of-qwen35-gguf-evaluations)
- [LM Studio Vulkan on Strix Halo (SmartTechLabs)](https://www.smarttechlabs.de/blog/2026-01-14-lmstudio-strix-halo/)
- [LM Studio on Ryzen AI](https://lmstudio.ai/ryzenai)
- [Strix Halo llama.cpp Performance Wiki](https://strixhalo.wiki/AI/llamacpp-performance)
- [AMD Strix Halo Backend Benchmarks](https://kyuz0.github.io/amd-strix-halo-toolboxes/)
- [Strix Halo LLM Optimization (hardware-corner.net)](https://www.hardware-corner.net/strix-halo-llm-optimization/)
- [Qwen3.5 Small Models (Artificial Analysis)](https://artificialanalysis.ai/articles/qwen3-5-small-models)
- [Qwen 3.5 9B Beats 120B Models (VentureBeat)](https://venturebeat.com/technology/alibabas-small-open-source-qwen3-5-9b-beats-openais-gpt-oss-120b-and-can-run)
- [AMD ROCm 7 Strix Halo Performance (Phoronix)](https://www.phoronix.com/review/amd-rocm-7-strix-halo/4)
- [Qwen3.5 Blog (qwen.ai)](https://qwen.ai/blog?id=qwen3.5)

View File

@@ -43,6 +43,24 @@ The most comprehensive community resource for Strix Halo LLM optimization.
- [vLLM](https://github.com/vllm-project/vllm) — High-throughput serving
- [llama-benchy](https://github.com/eugr/llama-benchy) — Multi-backend LLM benchmarking
## Qwen3.5 Models (GGUF)
- [unsloth/Qwen3.5-35B-A3B-GGUF](https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF) — Top pick for 64GB Strix Halo (MoE, 3B active)
- [unsloth/Qwen3.5-27B-GGUF](https://huggingface.co/unsloth/Qwen3.5-27B-GGUF) — Dense 27B
- [unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF](https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF) — Best for agentic/coding
- [Qwen3.5 Official](https://github.com/QwenLM/Qwen3.5) — Model family overview
- [Unsloth Dynamic 2.0](https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs) — Adaptive quantization methodology
- [Unsloth Studio](https://unsloth.ai/docs/new/studio) — Training + inference UI (beta)
## Agentic Evaluation
- [Inspect AI](https://github.com/UKGovernmentBEIS/inspect_ai) — All-in-one eval framework (HumanEval, BFCL, IFEval, GAIA)
- [EvalPlus](https://github.com/evalplus/evalplus) — HumanEval+ / MBPP+ with native ollama support
- [BigCodeBench](https://github.com/bigcode-project/bigcodebench) — 1,140 coding tasks across 139 libraries
- [BFCL](https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard) — Berkeley Function Calling Leaderboard
- [SWE-bench](https://github.com/princeton-nlp/SWE-bench) — Real GitHub issue resolution
- [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) — Optimized agentic framework for Qwen models
## AMD GPU Profiling
- [Radeon GPU Profiler (RGP)](https://gpuopen.com/rgp/) — Hardware-level Vulkan/HIP profiling