Files

Felipe Cardoso 58124cd657 feat: add Qwen3.5 model catalog and agentic evaluation framework

Models:
- configs/models.conf: catalog with Qwen3.5-35B-A3B (MoE, top pick),
  Qwen3.5-27B (dense), Qwen3-Coder-30B-A3B (agentic/coding)
- Updated benchmark setup to show catalog with download status
- docs/model-recommendations.md: memory planning, quantization guide

Agentic evaluation:
- scripts/agentic/setup.sh: installs inspect-ai, evalplus, bigcodebench
  in a Python venv
- scripts/agentic/run-eval.sh: runs evaluations against local LLM server
  (ollama or llama.cpp). Suites: quick (HumanEval+IFEval), code
  (EvalPlus+BigCodeBench), tooluse (BFCL), full (all)
- bin/agentic: dispatcher with help
- docs/agentic-benchmarks.md: methodology, framework comparison, model
  recommendations for agentic use

Updated: Makefile (6 new targets), README, CLAUDE.md, docs/references.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-26 00:20:23 +01:00

19 KiB

Raw Blame History

Qwen 3.5 Model Family: Research Summary for Strix Halo (64GB)

Date: 2026-03-26 Target Hardware: AMD Ryzen AI MAX+ 395 / Radeon 8060S (gfx1151), 64 GB unified LPDDR5x, Fedora 43 Focus: GGUF quantized models for llama.cpp inference

Scope

This report covers the Qwen3.5 model family (released February-March 2026) with emphasis on GGUF quantization options, file sizes, memory fit analysis for 64GB unified memory, GGUF quantizer comparison (Unsloth vs bartowski vs others), Unsloth Studio capabilities, and LM Studio backend support on AMD Strix Halo. Out of scope: cloud API pricing, full-precision training, non-GGUF formats (AWQ, GPTQ, EXL2).

1. Qwen3.5 Model Family Overview

Released mid-February 2026 (medium/large) and March 2, 2026 (small), licensed Apache 2.0. All models share the Gated DeltaNet hybrid architecture: a 3:1 ratio of linear attention (Gated DeltaNet) to full softmax attention blocks. Native 262K context window, extensible to 1,010,000 tokens via YaRN scaling. Supports 201 languages. Native multimodal (vision+language). Thinking/non-thinking hybrid mode.

Model	Type	Total Params	Active Params	Architecture
Qwen3.5-397B-A17B	MoE	397B	17B	256 experts, 8 routed + 1 shared
Qwen3.5-122B-A10B	MoE	122B	10B	256 experts, 8 routed + 1 shared
Qwen3.5-35B-A3B	MoE	35B	3B	256 experts, 8 routed + 1 shared
Qwen3.5-27B	Dense	27B	27B	Full activation
Qwen3.5-9B	Dense	9B	9B	Gated DeltaNet hybrid
Qwen3.5-4B	Dense	4B	4B	Gated DeltaNet hybrid
Qwen3.5-2B	Dense	2B	2B	Gated DeltaNet hybrid
Qwen3.5-0.8B	Dense	0.8B	0.8B	Gated DeltaNet hybrid

2. Qwen3.5-35B-A3B (MoE) -- Detailed Analysis

Architecture Specs

Hidden dimension: 2048
Token embedding: 248,320 (padded)
Layers: 40
Hidden layout: 10 x (3 x (Gated DeltaNet -> MoE) -> 1 x (Gated Attention -> MoE))
MoE: 256 total experts, 8 routed + 1 shared active, expert intermediate dim 512
Linear attention heads: 32 (V), 16 (QK), head dim 128
Gated attention heads: 16 (Q), 2 (KV), head dim 256
BF16 model size: 69.4 GB

GGUF Quantizations (Unsloth)

Source: unsloth/Qwen3.5-35B-A3B-GGUF Updated March 5, 2026 with improved imatrix data.

Quantization	Size (GB)	Fits 64GB?	Notes
UD-IQ2_XXS	10.7	Yes	Ultra-compressed, quality loss
UD-IQ2_M	11.4	Yes
UD-Q2_K_XL	12.2	Yes
UD-IQ3_XXS	13.1	Yes
UD-IQ3_S	13.6	Yes
Q3_K_S	15.3	Yes
Q3_K_M	16.4	Yes
UD-Q3_K_XL	16.6	Yes
UD-IQ4_XS	17.5	Yes
UD-IQ4_NL	17.8	Yes
Q4_K_S	20.7	Yes
MXFP4_MOE	21.6	Yes	MoE-optimized mixed precision
Q4_K_M	22.0	Yes	Recommended sweet spot
UD-Q4_K_XL	22.2	Yes	Dynamic 2.0, best 4-bit
Q5_K_S	24.8	Yes
Q5_K_M	26.2	Yes
UD-Q5_K_XL	26.4	Yes
UD-Q6_K_S	28.5	Yes
Q6_K	28.9	Yes
UD-Q6_K_XL	32.1	Yes
Q8_0	36.9	Yes	High quality, fits with room
UD-Q8_K_XL	48.7	Yes*	Tight -- ~15GB for KV cache
BF16	69.4	No	Exceeds 64GB

Key finding: Every quantization except BF16 fits in 64GB. Even Q8_0 at 36.9 GB leaves ~27 GB for KV cache and OS overhead, which is excellent. The MoE architecture (only 3B active params) means token generation is fast relative to total model size.

Benchmark Results (Official, from Model Card)

Benchmark	Qwen3.5-35B-A3B	GPT-5-mini	Notes
MMLU-Pro	85.3	83.7	Outperforms
C-Eval	90.2	82.2	Outperforms
GPQA Diamond	84.2	82.8	Outperforms
SWE-bench Verified	69.2	72.0	Slightly behind
LiveCodeBench v6	74.6	80.5	Behind on coding
MMMU (vision)	81.4	79.0	Outperforms
MathVision	83.9	71.9	Strongly outperforms
VideoMME (w/ sub.)	86.6	83.5	Outperforms

Strix Halo Performance Estimates

Based on Qwen3-30B-A3B benchmarks (similar architecture, predecessor):

Backend	pp512 (t/s)	tg128 (t/s)	Context
Vulkan RADV	~755	~85	Short
Vulkan AMDVLK	~742	~82	Short
ROCm hipBLASlt	~652	~64	Short
ROCm rocWMMA (tuned)	~659	~68	Short
Vulkan RADV	~17	~13	130K
ROCm hipBLASlt	~40	~5	130K

Key insight: Vulkan wins on short-context token generation. ROCm wins on long-context prompt processing. For interactive chat (short-medium context), Vulkan RADV is the best backend on Strix Halo.

3. Qwen3.5-27B (Dense) -- Detailed Analysis

Source: unsloth/Qwen3.5-27B-GGUF

The only dense (non-MoE) model in the medium range. All 27B parameters activate on every forward pass, meaning slower token generation than 35B-A3B despite being "smaller" in total params. BF16 size: 53.8 GB.

GGUF Quantizations (Unsloth)

Quantization	Size (GB)	Fits 64GB?	Notes
UD-IQ2_XXS	8.57	Yes
UD-IQ2_M	10.2	Yes
UD-Q2_K_XL	11.2	Yes
UD-IQ3_XXS	11.5	Yes
Q3_K_S	12.3	Yes
Q3_K_M	13.5	Yes
UD-Q3_K_XL	14.4	Yes
IQ4_XS	15.0	Yes
Q4_0	15.7	Yes
IQ4_NL	15.7	Yes
Q4_K_S	15.8	Yes
Q4_K_M	16.7	Yes	Recommended
UD-Q4_K_XL	17.6	Yes	Dynamic 2.0
Q4_1	17.2	Yes
Q5_K_S	18.9	Yes
Q5_K_M	19.6	Yes
UD-Q5_K_XL	20.2	Yes
Q6_K	22.5	Yes
UD-Q6_K_XL	25.7	Yes
Q8_0	28.6	Yes	Plenty of room
UD-Q8_K_XL	35.5	Yes	Good quality + headroom
BF16	53.8	Yes*	Tight -- only ~10GB for KV cache

Key finding: All quantizations fit in 64GB, including BF16 (barely). However, because this is a dense model with 27B active params, token generation will be significantly slower than 35B-A3B (which only activates 3B). For interactive use on Strix Halo, the 35B-A3B MoE is likely the better choice despite being larger on disk.

35B-A3B vs 27B: Which to Run?

Factor	35B-A3B (MoE)	27B (Dense)
Active params	3B	27B
Token gen speed	~85 t/s (Vulkan)	~10-15 t/s (estimated)
Quality (MMLU-Pro)	85.3	Comparable
Memory (Q4_K_M)	22.0 GB	16.7 GB
Memory (Q8_0)	36.9 GB	28.6 GB
Best for	Interactive chat, speed	Batch processing, quality

Recommendation: For interactive inference on 64GB Strix Halo, strongly prefer Qwen3.5-35B-A3B. The MoE architecture is ideal for unified memory systems since only 3B params are active per token, yielding much faster generation despite the larger total weight file.

4. Qwen3.5-122B-A10B (MoE) -- Stretch Goal

Source: unsloth/Qwen3.5-122B-A10B-GGUF

BF16 size: 244 GB. This is the next tier up from 35B-A3B.

Quantizations That Fit 64GB

Quantization	Size (GB)	Fit?	Notes
UD-IQ1_M	34.2	Yes	1-bit, quality concerns
UD-IQ2_XXS	36.6	Yes	Very compressed
UD-IQ2_M	39.1	Yes
UD-Q2_K_XL	41.8	Yes
UD-IQ3_XXS	44.7	Yes
UD-IQ3_S	46.6	Yes*	Tight with KV cache
Q3_K_S	52.5	Marginal	Very little KV headroom
Q3_K_M	56.4	No	Leaves <8GB for everything else
Q4_K_M+	76.5+	No	Does not fit

Warning: Q3-level quantization of 122B has been reported to produce garbled output, infinite repetition, and failures on tool calls and code generation. The UD-Q2_K_XL (41.8 GB) is the recommended minimum viable quantization.

Verdict: Possible at 2-bit, but risky. Quality at IQ2 level on a 122B MoE model is largely untested for production use. The 35B-A3B at Q8_0 (36.9 GB) is likely higher quality than 122B at IQ2 (36.6 GB) and much safer. Not recommended for 64GB systems unless you specifically need the 10B active parameter count.

5. Qwen3.5 Small Models (Worth Benchmarking)

Qwen3.5-9B

The standout small model. Outperforms models 3-13x its size:

GPQA Diamond: 81.7 (vs GPT-OSS-120B: 71.5)
HMMT Feb 2025: 83.2
MMMU-Pro: 70.1 (beats Gemini 2.5 Flash-Lite at 59.7)

At Q4_K_M, the 9B model needs roughly 6-7 GB. Runs comfortably on any hardware. Useful as a draft model for speculative decoding with the 35B-A3B.

Qwen3.5-4B

Performance close to the previous Qwen3-80B-A3B (20x larger). Excellent for on-device/edge tasks. ~3 GB at Q4_K_M.

6. Best GGUF Quantizers: Unsloth vs bartowski vs Others

Providers Compared

Provider	Approach	Strengths
Unsloth	Dynamic 2.0: per-layer adaptive quantization, 1.5M+ token calibration dataset	Best at low bit-rates (Q2, Q3), model-specific tuning, fast updates
bartowski	Custom imatrix calibration, upstream llama.cpp PR for improved tensor recipes	Lower KLD at Q4_K_M in some tests, stable quality
noctrex	MXFP4 for MoE experts + Q8/BF16 for rest	Specialized for MoE models
ubergarm	Standard llama.cpp quantization	Reliable baseline
AesSedai	imatrix-based	Good coverage, sometimes outperformed by Unsloth Dynamic
mradermacher	Mass-produced quants across many models	Broad coverage, less specialized

Head-to-Head: Unsloth vs bartowski

On standard KLD benchmarks (Qwen QwQ-32B comparison):

bartowski Q4_K_M: 0.0087 KLD
Unsloth Q4_K_M: 0.0222 KLD
bartowski IQ4_XS: 0.0127 KLD at 4.93 GiB

However, on real-world task evaluations (LiveCodeBench v6, MMLU Pro), Unsloth Dynamic IQ2_XXS outperformed AesSedai IQ3_S despite being 11GB smaller -- demonstrating that KLD/perplexity alone do not predict task performance.

Recommendation

Q4 and above: bartowski and Unsloth are both excellent. bartowski may have slightly lower KLD at Q4_K_M. Either is a safe choice.
Q3 and below: Unsloth Dynamic 2.0 (UD- prefix) is the clear winner. The per-layer adaptive approach preserves critical layers at higher precision.
MoE-specific: noctrex MXFP4_MOE is worth testing if you want pure MoE-optimized quantization.
Overall: For Qwen3.5-35B-A3B, use Unsloth UD-Q4_K_XL (22.2 GB) or Q8_0 (36.9 GB) for maximum quality. For bartowski, use their Q4_K_M.

imatrix Note

All modern GGUF quantizers now use imatrix (importance matrix) calibration. This adds 5-10% inference overhead but significantly improves quality at low bit-rates. The calibration dataset matters: Unsloth uses 1.5M+ hand-curated tokens; bartowski uses different calibration texts optimized for different use cases.

7. Unsloth Studio

What It Is

Unsloth Studio is an open-source, no-code web UI for training and running LLMs locally. Released March 17, 2026 (beta). Dual-licensed: Apache 2.0 (core) + AGPL-3.0 (UI).

Installation

# macOS, Linux, WSL
curl -fsSL https://unsloth.ai/install.sh | sh

# Launch
unsloth studio -H 0.0.0.0 -p 8888

Capabilities

Feature	Details
Inference	Run GGUF and safetensor models with tool-calling, web search, OpenAI-compatible API
Fine-tuning	SFT, GRPO (RL), 500+ models, 2x faster, 70% less VRAM
Data Recipes	Auto-create datasets from PDF, CSV, JSON, DOCX, TXT
Model Arena	Side-by-side comparison of two models
Export	Save to GGUF or safetensors
Multimodal	Text, vision, TTS audio, embedding models

Platform Support

Platform	Inference	Training
Linux (NVIDIA)	Yes	Yes
Linux (AMD)	Yes	Coming soon
Linux (CPU)	Yes	No
macOS	Yes (CPU only)	Coming (MLX)
Windows	Yes	Yes

Relevance for Strix Halo

Unsloth Studio provides inference via llama.cpp backend, so it should work on Strix Halo for running models. Training requires NVIDIA or Intel GPUs currently, so fine-tuning is not yet supported on AMD. The inference component is essentially a nice web UI wrapper around llama.cpp, similar to LM Studio but with integrated training capabilities.

Verdict: Useful for inference on Strix Halo. Not yet useful for training on AMD. If you only need inference, LM Studio or raw llama.cpp may be simpler. If you want training + inference in one tool (when AMD support arrives), Unsloth Studio is worth watching.

8. LM Studio on AMD Strix Halo

Backend Status

Backend	Status	Notes
Vulkan	Working, recommended	Best for general inference, no special config needed
ROCm	Partially broken	gfx1151 declared supported but data files missing, crashes on inference
CPU	Working	Slow fallback

Vulkan Configuration

LM Studio with Vulkan is the most reliable path on Strix Halo:

{
  "llm.gpu.backend": "vulkan",
  "llm.gpu.device": "auto",
  "llm.gpu.layers": -1
}

Verify GPU detection: vulkaninfo | grep "GPU id"

An automated installer exists: smarttechlabs-projects/strix-halo-lmstudio

Performance Expectations (LM Studio / Vulkan, 128GB system)

Model Size	Quant	Throughput
7B	Q4	30-40 t/s
13B	Q4	20-30 t/s
30B MoE	Q4	~50+ t/s (MoE advantage)
70B	Q4	5-8 t/s

For a 64GB system, expect similar per-token speeds but with lower maximum context lengths before memory pressure kicks in.

ROCm Status and Future

AMD's Ryzen AI Halo Mini PC (Q2 2026) will ship with ROCm 7.2.2 optimization for LM Studio. As of January 2026, stable ROCm+Linux configurations exist for Strix Halo (documented at Framework Community). The gfx1151 ROCm issue in LM Studio specifically is a packaging problem (missing data files), not a fundamental incompatibility.

For now: use Vulkan for short-medium context, or build llama.cpp from source with ROCm for long-context workloads (where Flash Attention matters).

LM Studio Unsloth Dynamic 2.0 Note

There was a reported issue (GitHub #1594) where Unsloth Dynamic 2.0 (UD-) GGUF variants were not shown in LM Studio's download options. Verify that LM Studio is updated to the latest version, or download the GGUF files manually from HuggingFace and load them directly.

9. Recommended Configurations for 64GB Strix Halo

Primary: Qwen3.5-35B-A3B (MoE)

Use Case	Quantization	Size	KV Budget	Context Est.
Maximum quality	Q8_0	36.9 GB	~25 GB	~32K-65K
Best balance	UD-Q4_K_XL	22.2 GB	~40 GB	~65K-131K
Maximum context	UD-IQ3_XXS	13.1 GB	~49 GB	~131K+
Speed test	Q4_K_M	22.0 GB	~40 GB	~65K-131K

Secondary: Qwen3.5-27B (Dense)

Use Case	Quantization	Size	KV Budget	Notes
Quality comparison	Q8_0	28.6 GB	~33 GB	Slower gen than 35B-A3B
Balanced	Q4_K_M	16.7 GB	~45 GB

Quick Reference: Qwen3.5-9B (Small/Draft)

Use Case	Quantization	Size
Speculative decoding draft	Q4_K_M	~6 GB
Standalone small model	Q8_0	~10 GB

10. Sampling Parameters (Official Recommendations)

Thinking Mode (General)

Temperature: 1.0
Top-p: 0.95
Top-k: 20
Min-p: 0.0
Presence penalty: 1.5
Max output: 32,768 tokens (general) or 81,920 (math/coding)

Thinking Mode (Coding)

Temperature: 0.6
Top-p: 0.95
Top-k: 20
Presence penalty: 0.0

Non-Thinking / Instruct Mode

Temperature: 0.7
Top-p: 0.8
Top-k: 20
Presence penalty: 1.5

Best Practices

Maintain minimum 128K context to preserve thinking capabilities
Exclude thinking content from multi-turn conversation history
For math: "Please reason step by step, and put your final answer within \boxed{}."
For multiple choice: request JSON output like {"answer": "C"}

11. Open Questions / Limitations

Qwen3.5 on gfx1151 ROCm: LM Studio's ROCm backend crashes on Strix Halo due to missing gfx1151 data files. Building llama.cpp from source with ROCm 7.x works but requires manual setup.
Vulkan long-context degradation: Vulkan performance drops significantly beyond ~4K context on Strix Halo. ROCm with Flash Attention is needed for long-context workloads, creating a backend choice dilemma.
Quantizer quality debate: KLD and perplexity metrics do not always predict real-world task performance. The "best" quantizer depends on the specific use case. More task-based evaluation is needed.
122B-A10B viability at 64GB: Only fits at 2-bit or aggressive 3-bit. Quality at these compression levels for a 122B MoE is not well-characterized.
Unsloth Studio AMD training: Not yet supported. Timeline unclear ("coming soon").
Multi-token Prediction (MTP): Qwen3.5 supports MTP for faster generation, but llama.cpp support status for this feature on the MoE variants needs verification.
Speculative decoding: Qwen3.5-9B as a draft model for 35B-A3B has been discussed but needs benchmarking on Strix Halo specifically.

19 KiB Raw Blame History

Qwen 3.5 Model Family: Research Summary for Strix Halo (64GB)

Scope

1. Qwen3.5 Model Family Overview

2. Qwen3.5-35B-A3B (MoE) -- Detailed Analysis

Architecture Specs

GGUF Quantizations (Unsloth)

Benchmark Results (Official, from Model Card)

Strix Halo Performance Estimates

3. Qwen3.5-27B (Dense) -- Detailed Analysis

GGUF Quantizations (Unsloth)

35B-A3B vs 27B: Which to Run?

4. Qwen3.5-122B-A10B (MoE) -- Stretch Goal

Quantizations That Fit 64GB

5. Qwen3.5 Small Models (Worth Benchmarking)

Qwen3.5-9B

Qwen3.5-4B

6. Best GGUF Quantizers: Unsloth vs bartowski vs Others

Providers Compared

Head-to-Head: Unsloth vs bartowski

Recommendation

imatrix Note

7. Unsloth Studio

What It Is

Installation

Capabilities

Platform Support

Relevance for Strix Halo

8. LM Studio on AMD Strix Halo

Backend Status

Vulkan Configuration

Performance Expectations (LM Studio / Vulkan, 128GB system)

ROCm Status and Future

LM Studio Unsloth Dynamic 2.0 Note

9. Recommended Configurations for 64GB Strix Halo

Primary: Qwen3.5-35B-A3B (MoE)

Secondary: Qwen3.5-27B (Dense)

Quick Reference: Qwen3.5-9B (Small/Draft)

10. Sampling Parameters (Official Recommendations)

Thinking Mode (General)

Thinking Mode (Coding)

Non-Thinking / Instruct Mode

Best Practices

11. Open Questions / Limitations

Sources

19 KiB

Raw Blame History