Files
strix-halo-optimizations/docs/model-recommendations.md
Felipe Cardoso 58124cd657 feat: add Qwen3.5 model catalog and agentic evaluation framework
Models:
- configs/models.conf: catalog with Qwen3.5-35B-A3B (MoE, top pick),
  Qwen3.5-27B (dense), Qwen3-Coder-30B-A3B (agentic/coding)
- Updated benchmark setup to show catalog with download status
- docs/model-recommendations.md: memory planning, quantization guide

Agentic evaluation:
- scripts/agentic/setup.sh: installs inspect-ai, evalplus, bigcodebench
  in a Python venv
- scripts/agentic/run-eval.sh: runs evaluations against local LLM server
  (ollama or llama.cpp). Suites: quick (HumanEval+IFEval), code
  (EvalPlus+BigCodeBench), tooluse (BFCL), full (all)
- bin/agentic: dispatcher with help
- docs/agentic-benchmarks.md: methodology, framework comparison, model
  recommendations for agentic use

Updated: Makefile (6 new targets), README, CLAUDE.md, docs/references.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 00:20:23 +01:00

19 KiB

Qwen 3.5 Model Family: Research Summary for Strix Halo (64GB)

Date: 2026-03-26 Target Hardware: AMD Ryzen AI MAX+ 395 / Radeon 8060S (gfx1151), 64 GB unified LPDDR5x, Fedora 43 Focus: GGUF quantized models for llama.cpp inference


Scope

This report covers the Qwen3.5 model family (released February-March 2026) with emphasis on GGUF quantization options, file sizes, memory fit analysis for 64GB unified memory, GGUF quantizer comparison (Unsloth vs bartowski vs others), Unsloth Studio capabilities, and LM Studio backend support on AMD Strix Halo. Out of scope: cloud API pricing, full-precision training, non-GGUF formats (AWQ, GPTQ, EXL2).


1. Qwen3.5 Model Family Overview

Released mid-February 2026 (medium/large) and March 2, 2026 (small), licensed Apache 2.0. All models share the Gated DeltaNet hybrid architecture: a 3:1 ratio of linear attention (Gated DeltaNet) to full softmax attention blocks. Native 262K context window, extensible to 1,010,000 tokens via YaRN scaling. Supports 201 languages. Native multimodal (vision+language). Thinking/non-thinking hybrid mode.

Model Type Total Params Active Params Architecture
Qwen3.5-397B-A17B MoE 397B 17B 256 experts, 8 routed + 1 shared
Qwen3.5-122B-A10B MoE 122B 10B 256 experts, 8 routed + 1 shared
Qwen3.5-35B-A3B MoE 35B 3B 256 experts, 8 routed + 1 shared
Qwen3.5-27B Dense 27B 27B Full activation
Qwen3.5-9B Dense 9B 9B Gated DeltaNet hybrid
Qwen3.5-4B Dense 4B 4B Gated DeltaNet hybrid
Qwen3.5-2B Dense 2B 2B Gated DeltaNet hybrid
Qwen3.5-0.8B Dense 0.8B 0.8B Gated DeltaNet hybrid

2. Qwen3.5-35B-A3B (MoE) -- Detailed Analysis

Architecture Specs

  • Hidden dimension: 2048
  • Token embedding: 248,320 (padded)
  • Layers: 40
  • Hidden layout: 10 x (3 x (Gated DeltaNet -> MoE) -> 1 x (Gated Attention -> MoE))
  • MoE: 256 total experts, 8 routed + 1 shared active, expert intermediate dim 512
  • Linear attention heads: 32 (V), 16 (QK), head dim 128
  • Gated attention heads: 16 (Q), 2 (KV), head dim 256
  • BF16 model size: 69.4 GB

GGUF Quantizations (Unsloth)

Source: unsloth/Qwen3.5-35B-A3B-GGUF Updated March 5, 2026 with improved imatrix data.

Quantization Size (GB) Fits 64GB? Notes
UD-IQ2_XXS 10.7 Yes Ultra-compressed, quality loss
UD-IQ2_M 11.4 Yes
UD-Q2_K_XL 12.2 Yes
UD-IQ3_XXS 13.1 Yes
UD-IQ3_S 13.6 Yes
Q3_K_S 15.3 Yes
Q3_K_M 16.4 Yes
UD-Q3_K_XL 16.6 Yes
UD-IQ4_XS 17.5 Yes
UD-IQ4_NL 17.8 Yes
Q4_K_S 20.7 Yes
MXFP4_MOE 21.6 Yes MoE-optimized mixed precision
Q4_K_M 22.0 Yes Recommended sweet spot
UD-Q4_K_XL 22.2 Yes Dynamic 2.0, best 4-bit
Q5_K_S 24.8 Yes
Q5_K_M 26.2 Yes
UD-Q5_K_XL 26.4 Yes
UD-Q6_K_S 28.5 Yes
Q6_K 28.9 Yes
UD-Q6_K_XL 32.1 Yes
Q8_0 36.9 Yes High quality, fits with room
UD-Q8_K_XL 48.7 Yes* Tight -- ~15GB for KV cache
BF16 69.4 No Exceeds 64GB

Key finding: Every quantization except BF16 fits in 64GB. Even Q8_0 at 36.9 GB leaves ~27 GB for KV cache and OS overhead, which is excellent. The MoE architecture (only 3B active params) means token generation is fast relative to total model size.

Benchmark Results (Official, from Model Card)

Benchmark Qwen3.5-35B-A3B GPT-5-mini Notes
MMLU-Pro 85.3 83.7 Outperforms
C-Eval 90.2 82.2 Outperforms
GPQA Diamond 84.2 82.8 Outperforms
SWE-bench Verified 69.2 72.0 Slightly behind
LiveCodeBench v6 74.6 80.5 Behind on coding
MMMU (vision) 81.4 79.0 Outperforms
MathVision 83.9 71.9 Strongly outperforms
VideoMME (w/ sub.) 86.6 83.5 Outperforms

Strix Halo Performance Estimates

Based on Qwen3-30B-A3B benchmarks (similar architecture, predecessor):

Backend pp512 (t/s) tg128 (t/s) Context
Vulkan RADV ~755 ~85 Short
Vulkan AMDVLK ~742 ~82 Short
ROCm hipBLASlt ~652 ~64 Short
ROCm rocWMMA (tuned) ~659 ~68 Short
Vulkan RADV ~17 ~13 130K
ROCm hipBLASlt ~40 ~5 130K

Key insight: Vulkan wins on short-context token generation. ROCm wins on long-context prompt processing. For interactive chat (short-medium context), Vulkan RADV is the best backend on Strix Halo.


3. Qwen3.5-27B (Dense) -- Detailed Analysis

Source: unsloth/Qwen3.5-27B-GGUF

The only dense (non-MoE) model in the medium range. All 27B parameters activate on every forward pass, meaning slower token generation than 35B-A3B despite being "smaller" in total params. BF16 size: 53.8 GB.

GGUF Quantizations (Unsloth)

Quantization Size (GB) Fits 64GB? Notes
UD-IQ2_XXS 8.57 Yes
UD-IQ2_M 10.2 Yes
UD-Q2_K_XL 11.2 Yes
UD-IQ3_XXS 11.5 Yes
Q3_K_S 12.3 Yes
Q3_K_M 13.5 Yes
UD-Q3_K_XL 14.4 Yes
IQ4_XS 15.0 Yes
Q4_0 15.7 Yes
IQ4_NL 15.7 Yes
Q4_K_S 15.8 Yes
Q4_K_M 16.7 Yes Recommended
UD-Q4_K_XL 17.6 Yes Dynamic 2.0
Q4_1 17.2 Yes
Q5_K_S 18.9 Yes
Q5_K_M 19.6 Yes
UD-Q5_K_XL 20.2 Yes
Q6_K 22.5 Yes
UD-Q6_K_XL 25.7 Yes
Q8_0 28.6 Yes Plenty of room
UD-Q8_K_XL 35.5 Yes Good quality + headroom
BF16 53.8 Yes* Tight -- only ~10GB for KV cache

Key finding: All quantizations fit in 64GB, including BF16 (barely). However, because this is a dense model with 27B active params, token generation will be significantly slower than 35B-A3B (which only activates 3B). For interactive use on Strix Halo, the 35B-A3B MoE is likely the better choice despite being larger on disk.

35B-A3B vs 27B: Which to Run?

Factor 35B-A3B (MoE) 27B (Dense)
Active params 3B 27B
Token gen speed ~85 t/s (Vulkan) ~10-15 t/s (estimated)
Quality (MMLU-Pro) 85.3 Comparable
Memory (Q4_K_M) 22.0 GB 16.7 GB
Memory (Q8_0) 36.9 GB 28.6 GB
Best for Interactive chat, speed Batch processing, quality

Recommendation: For interactive inference on 64GB Strix Halo, strongly prefer Qwen3.5-35B-A3B. The MoE architecture is ideal for unified memory systems since only 3B params are active per token, yielding much faster generation despite the larger total weight file.


4. Qwen3.5-122B-A10B (MoE) -- Stretch Goal

Source: unsloth/Qwen3.5-122B-A10B-GGUF

BF16 size: 244 GB. This is the next tier up from 35B-A3B.

Quantizations That Fit 64GB

Quantization Size (GB) Fit? Notes
UD-IQ1_M 34.2 Yes 1-bit, quality concerns
UD-IQ2_XXS 36.6 Yes Very compressed
UD-IQ2_M 39.1 Yes
UD-Q2_K_XL 41.8 Yes
UD-IQ3_XXS 44.7 Yes
UD-IQ3_S 46.6 Yes* Tight with KV cache
Q3_K_S 52.5 Marginal Very little KV headroom
Q3_K_M 56.4 No Leaves <8GB for everything else
Q4_K_M+ 76.5+ No Does not fit

Warning: Q3-level quantization of 122B has been reported to produce garbled output, infinite repetition, and failures on tool calls and code generation. The UD-Q2_K_XL (41.8 GB) is the recommended minimum viable quantization.

Verdict: Possible at 2-bit, but risky. Quality at IQ2 level on a 122B MoE model is largely untested for production use. The 35B-A3B at Q8_0 (36.9 GB) is likely higher quality than 122B at IQ2 (36.6 GB) and much safer. Not recommended for 64GB systems unless you specifically need the 10B active parameter count.


5. Qwen3.5 Small Models (Worth Benchmarking)

Qwen3.5-9B

The standout small model. Outperforms models 3-13x its size:

  • GPQA Diamond: 81.7 (vs GPT-OSS-120B: 71.5)
  • HMMT Feb 2025: 83.2
  • MMMU-Pro: 70.1 (beats Gemini 2.5 Flash-Lite at 59.7)

At Q4_K_M, the 9B model needs roughly 6-7 GB. Runs comfortably on any hardware. Useful as a draft model for speculative decoding with the 35B-A3B.

Qwen3.5-4B

Performance close to the previous Qwen3-80B-A3B (20x larger). Excellent for on-device/edge tasks. ~3 GB at Q4_K_M.


6. Best GGUF Quantizers: Unsloth vs bartowski vs Others

Providers Compared

Provider Approach Strengths
Unsloth Dynamic 2.0: per-layer adaptive quantization, 1.5M+ token calibration dataset Best at low bit-rates (Q2, Q3), model-specific tuning, fast updates
bartowski Custom imatrix calibration, upstream llama.cpp PR for improved tensor recipes Lower KLD at Q4_K_M in some tests, stable quality
noctrex MXFP4 for MoE experts + Q8/BF16 for rest Specialized for MoE models
ubergarm Standard llama.cpp quantization Reliable baseline
AesSedai imatrix-based Good coverage, sometimes outperformed by Unsloth Dynamic
mradermacher Mass-produced quants across many models Broad coverage, less specialized

Head-to-Head: Unsloth vs bartowski

On standard KLD benchmarks (Qwen QwQ-32B comparison):

  • bartowski Q4_K_M: 0.0087 KLD
  • Unsloth Q4_K_M: 0.0222 KLD
  • bartowski IQ4_XS: 0.0127 KLD at 4.93 GiB

However, on real-world task evaluations (LiveCodeBench v6, MMLU Pro), Unsloth Dynamic IQ2_XXS outperformed AesSedai IQ3_S despite being 11GB smaller -- demonstrating that KLD/perplexity alone do not predict task performance.

Recommendation

  • Q4 and above: bartowski and Unsloth are both excellent. bartowski may have slightly lower KLD at Q4_K_M. Either is a safe choice.
  • Q3 and below: Unsloth Dynamic 2.0 (UD- prefix) is the clear winner. The per-layer adaptive approach preserves critical layers at higher precision.
  • MoE-specific: noctrex MXFP4_MOE is worth testing if you want pure MoE-optimized quantization.
  • Overall: For Qwen3.5-35B-A3B, use Unsloth UD-Q4_K_XL (22.2 GB) or Q8_0 (36.9 GB) for maximum quality. For bartowski, use their Q4_K_M.

imatrix Note

All modern GGUF quantizers now use imatrix (importance matrix) calibration. This adds 5-10% inference overhead but significantly improves quality at low bit-rates. The calibration dataset matters: Unsloth uses 1.5M+ hand-curated tokens; bartowski uses different calibration texts optimized for different use cases.


7. Unsloth Studio

What It Is

Unsloth Studio is an open-source, no-code web UI for training and running LLMs locally. Released March 17, 2026 (beta). Dual-licensed: Apache 2.0 (core) + AGPL-3.0 (UI).

Installation

# macOS, Linux, WSL
curl -fsSL https://unsloth.ai/install.sh | sh

# Launch
unsloth studio -H 0.0.0.0 -p 8888

Capabilities

Feature Details
Inference Run GGUF and safetensor models with tool-calling, web search, OpenAI-compatible API
Fine-tuning SFT, GRPO (RL), 500+ models, 2x faster, 70% less VRAM
Data Recipes Auto-create datasets from PDF, CSV, JSON, DOCX, TXT
Model Arena Side-by-side comparison of two models
Export Save to GGUF or safetensors
Multimodal Text, vision, TTS audio, embedding models

Platform Support

Platform Inference Training
Linux (NVIDIA) Yes Yes
Linux (AMD) Yes Coming soon
Linux (CPU) Yes No
macOS Yes (CPU only) Coming (MLX)
Windows Yes Yes

Relevance for Strix Halo

Unsloth Studio provides inference via llama.cpp backend, so it should work on Strix Halo for running models. Training requires NVIDIA or Intel GPUs currently, so fine-tuning is not yet supported on AMD. The inference component is essentially a nice web UI wrapper around llama.cpp, similar to LM Studio but with integrated training capabilities.

Verdict: Useful for inference on Strix Halo. Not yet useful for training on AMD. If you only need inference, LM Studio or raw llama.cpp may be simpler. If you want training + inference in one tool (when AMD support arrives), Unsloth Studio is worth watching.


8. LM Studio on AMD Strix Halo

Backend Status

Backend Status Notes
Vulkan Working, recommended Best for general inference, no special config needed
ROCm Partially broken gfx1151 declared supported but data files missing, crashes on inference
CPU Working Slow fallback

Vulkan Configuration

LM Studio with Vulkan is the most reliable path on Strix Halo:

{
  "llm.gpu.backend": "vulkan",
  "llm.gpu.device": "auto",
  "llm.gpu.layers": -1
}

Verify GPU detection: vulkaninfo | grep "GPU id"

An automated installer exists: smarttechlabs-projects/strix-halo-lmstudio

Performance Expectations (LM Studio / Vulkan, 128GB system)

Model Size Quant Throughput
7B Q4 30-40 t/s
13B Q4 20-30 t/s
30B MoE Q4 ~50+ t/s (MoE advantage)
70B Q4 5-8 t/s

For a 64GB system, expect similar per-token speeds but with lower maximum context lengths before memory pressure kicks in.

ROCm Status and Future

AMD's Ryzen AI Halo Mini PC (Q2 2026) will ship with ROCm 7.2.2 optimization for LM Studio. As of January 2026, stable ROCm+Linux configurations exist for Strix Halo (documented at Framework Community). The gfx1151 ROCm issue in LM Studio specifically is a packaging problem (missing data files), not a fundamental incompatibility.

For now: use Vulkan for short-medium context, or build llama.cpp from source with ROCm for long-context workloads (where Flash Attention matters).

LM Studio Unsloth Dynamic 2.0 Note

There was a reported issue (GitHub #1594) where Unsloth Dynamic 2.0 (UD-) GGUF variants were not shown in LM Studio's download options. Verify that LM Studio is updated to the latest version, or download the GGUF files manually from HuggingFace and load them directly.


Primary: Qwen3.5-35B-A3B (MoE)

Use Case Quantization Size KV Budget Context Est.
Maximum quality Q8_0 36.9 GB ~25 GB ~32K-65K
Best balance UD-Q4_K_XL 22.2 GB ~40 GB ~65K-131K
Maximum context UD-IQ3_XXS 13.1 GB ~49 GB ~131K+
Speed test Q4_K_M 22.0 GB ~40 GB ~65K-131K

Secondary: Qwen3.5-27B (Dense)

Use Case Quantization Size KV Budget Notes
Quality comparison Q8_0 28.6 GB ~33 GB Slower gen than 35B-A3B
Balanced Q4_K_M 16.7 GB ~45 GB

Quick Reference: Qwen3.5-9B (Small/Draft)

Use Case Quantization Size
Speculative decoding draft Q4_K_M ~6 GB
Standalone small model Q8_0 ~10 GB

10. Sampling Parameters (Official Recommendations)

Thinking Mode (General)

  • Temperature: 1.0
  • Top-p: 0.95
  • Top-k: 20
  • Min-p: 0.0
  • Presence penalty: 1.5
  • Max output: 32,768 tokens (general) or 81,920 (math/coding)

Thinking Mode (Coding)

  • Temperature: 0.6
  • Top-p: 0.95
  • Top-k: 20
  • Presence penalty: 0.0

Non-Thinking / Instruct Mode

  • Temperature: 0.7
  • Top-p: 0.8
  • Top-k: 20
  • Presence penalty: 1.5

Best Practices

  • Maintain minimum 128K context to preserve thinking capabilities
  • Exclude thinking content from multi-turn conversation history
  • For math: "Please reason step by step, and put your final answer within \boxed{}."
  • For multiple choice: request JSON output like {"answer": "C"}

11. Open Questions / Limitations

  1. Qwen3.5 on gfx1151 ROCm: LM Studio's ROCm backend crashes on Strix Halo due to missing gfx1151 data files. Building llama.cpp from source with ROCm 7.x works but requires manual setup.

  2. Vulkan long-context degradation: Vulkan performance drops significantly beyond ~4K context on Strix Halo. ROCm with Flash Attention is needed for long-context workloads, creating a backend choice dilemma.

  3. Quantizer quality debate: KLD and perplexity metrics do not always predict real-world task performance. The "best" quantizer depends on the specific use case. More task-based evaluation is needed.

  4. 122B-A10B viability at 64GB: Only fits at 2-bit or aggressive 3-bit. Quality at these compression levels for a 122B MoE is not well-characterized.

  5. Unsloth Studio AMD training: Not yet supported. Timeline unclear ("coming soon").

  6. Multi-token Prediction (MTP): Qwen3.5 supports MTP for faster generation, but llama.cpp support status for this feature on the MoE variants needs verification.

  7. Speculative decoding: Qwen3.5-9B as a draft model for 35B-A3B has been discussed but needs benchmarking on Strix Halo specifically.


Sources