Models: - configs/models.conf: catalog with Qwen3.5-35B-A3B (MoE, top pick), Qwen3.5-27B (dense), Qwen3-Coder-30B-A3B (agentic/coding) - Updated benchmark setup to show catalog with download status - docs/model-recommendations.md: memory planning, quantization guide Agentic evaluation: - scripts/agentic/setup.sh: installs inspect-ai, evalplus, bigcodebench in a Python venv - scripts/agentic/run-eval.sh: runs evaluations against local LLM server (ollama or llama.cpp). Suites: quick (HumanEval+IFEval), code (EvalPlus+BigCodeBench), tooluse (BFCL), full (all) - bin/agentic: dispatcher with help - docs/agentic-benchmarks.md: methodology, framework comparison, model recommendations for agentic use Updated: Makefile (6 new targets), README, CLAUDE.md, docs/references.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
19 KiB
Qwen 3.5 Model Family: Research Summary for Strix Halo (64GB)
Date: 2026-03-26 Target Hardware: AMD Ryzen AI MAX+ 395 / Radeon 8060S (gfx1151), 64 GB unified LPDDR5x, Fedora 43 Focus: GGUF quantized models for llama.cpp inference
Scope
This report covers the Qwen3.5 model family (released February-March 2026) with emphasis on GGUF quantization options, file sizes, memory fit analysis for 64GB unified memory, GGUF quantizer comparison (Unsloth vs bartowski vs others), Unsloth Studio capabilities, and LM Studio backend support on AMD Strix Halo. Out of scope: cloud API pricing, full-precision training, non-GGUF formats (AWQ, GPTQ, EXL2).
1. Qwen3.5 Model Family Overview
Released mid-February 2026 (medium/large) and March 2, 2026 (small), licensed Apache 2.0. All models share the Gated DeltaNet hybrid architecture: a 3:1 ratio of linear attention (Gated DeltaNet) to full softmax attention blocks. Native 262K context window, extensible to 1,010,000 tokens via YaRN scaling. Supports 201 languages. Native multimodal (vision+language). Thinking/non-thinking hybrid mode.
| Model | Type | Total Params | Active Params | Architecture |
|---|---|---|---|---|
| Qwen3.5-397B-A17B | MoE | 397B | 17B | 256 experts, 8 routed + 1 shared |
| Qwen3.5-122B-A10B | MoE | 122B | 10B | 256 experts, 8 routed + 1 shared |
| Qwen3.5-35B-A3B | MoE | 35B | 3B | 256 experts, 8 routed + 1 shared |
| Qwen3.5-27B | Dense | 27B | 27B | Full activation |
| Qwen3.5-9B | Dense | 9B | 9B | Gated DeltaNet hybrid |
| Qwen3.5-4B | Dense | 4B | 4B | Gated DeltaNet hybrid |
| Qwen3.5-2B | Dense | 2B | 2B | Gated DeltaNet hybrid |
| Qwen3.5-0.8B | Dense | 0.8B | 0.8B | Gated DeltaNet hybrid |
2. Qwen3.5-35B-A3B (MoE) -- Detailed Analysis
Architecture Specs
- Hidden dimension: 2048
- Token embedding: 248,320 (padded)
- Layers: 40
- Hidden layout: 10 x (3 x (Gated DeltaNet -> MoE) -> 1 x (Gated Attention -> MoE))
- MoE: 256 total experts, 8 routed + 1 shared active, expert intermediate dim 512
- Linear attention heads: 32 (V), 16 (QK), head dim 128
- Gated attention heads: 16 (Q), 2 (KV), head dim 256
- BF16 model size: 69.4 GB
GGUF Quantizations (Unsloth)
Source: unsloth/Qwen3.5-35B-A3B-GGUF Updated March 5, 2026 with improved imatrix data.
| Quantization | Size (GB) | Fits 64GB? | Notes |
|---|---|---|---|
| UD-IQ2_XXS | 10.7 | Yes | Ultra-compressed, quality loss |
| UD-IQ2_M | 11.4 | Yes | |
| UD-Q2_K_XL | 12.2 | Yes | |
| UD-IQ3_XXS | 13.1 | Yes | |
| UD-IQ3_S | 13.6 | Yes | |
| Q3_K_S | 15.3 | Yes | |
| Q3_K_M | 16.4 | Yes | |
| UD-Q3_K_XL | 16.6 | Yes | |
| UD-IQ4_XS | 17.5 | Yes | |
| UD-IQ4_NL | 17.8 | Yes | |
| Q4_K_S | 20.7 | Yes | |
| MXFP4_MOE | 21.6 | Yes | MoE-optimized mixed precision |
| Q4_K_M | 22.0 | Yes | Recommended sweet spot |
| UD-Q4_K_XL | 22.2 | Yes | Dynamic 2.0, best 4-bit |
| Q5_K_S | 24.8 | Yes | |
| Q5_K_M | 26.2 | Yes | |
| UD-Q5_K_XL | 26.4 | Yes | |
| UD-Q6_K_S | 28.5 | Yes | |
| Q6_K | 28.9 | Yes | |
| UD-Q6_K_XL | 32.1 | Yes | |
| Q8_0 | 36.9 | Yes | High quality, fits with room |
| UD-Q8_K_XL | 48.7 | Yes* | Tight -- ~15GB for KV cache |
| BF16 | 69.4 | No | Exceeds 64GB |
Key finding: Every quantization except BF16 fits in 64GB. Even Q8_0 at 36.9 GB leaves ~27 GB for KV cache and OS overhead, which is excellent. The MoE architecture (only 3B active params) means token generation is fast relative to total model size.
Benchmark Results (Official, from Model Card)
| Benchmark | Qwen3.5-35B-A3B | GPT-5-mini | Notes |
|---|---|---|---|
| MMLU-Pro | 85.3 | 83.7 | Outperforms |
| C-Eval | 90.2 | 82.2 | Outperforms |
| GPQA Diamond | 84.2 | 82.8 | Outperforms |
| SWE-bench Verified | 69.2 | 72.0 | Slightly behind |
| LiveCodeBench v6 | 74.6 | 80.5 | Behind on coding |
| MMMU (vision) | 81.4 | 79.0 | Outperforms |
| MathVision | 83.9 | 71.9 | Strongly outperforms |
| VideoMME (w/ sub.) | 86.6 | 83.5 | Outperforms |
Strix Halo Performance Estimates
Based on Qwen3-30B-A3B benchmarks (similar architecture, predecessor):
| Backend | pp512 (t/s) | tg128 (t/s) | Context |
|---|---|---|---|
| Vulkan RADV | ~755 | ~85 | Short |
| Vulkan AMDVLK | ~742 | ~82 | Short |
| ROCm hipBLASlt | ~652 | ~64 | Short |
| ROCm rocWMMA (tuned) | ~659 | ~68 | Short |
| Vulkan RADV | ~17 | ~13 | 130K |
| ROCm hipBLASlt | ~40 | ~5 | 130K |
Key insight: Vulkan wins on short-context token generation. ROCm wins on long-context prompt processing. For interactive chat (short-medium context), Vulkan RADV is the best backend on Strix Halo.
3. Qwen3.5-27B (Dense) -- Detailed Analysis
Source: unsloth/Qwen3.5-27B-GGUF
The only dense (non-MoE) model in the medium range. All 27B parameters activate on every forward pass, meaning slower token generation than 35B-A3B despite being "smaller" in total params. BF16 size: 53.8 GB.
GGUF Quantizations (Unsloth)
| Quantization | Size (GB) | Fits 64GB? | Notes |
|---|---|---|---|
| UD-IQ2_XXS | 8.57 | Yes | |
| UD-IQ2_M | 10.2 | Yes | |
| UD-Q2_K_XL | 11.2 | Yes | |
| UD-IQ3_XXS | 11.5 | Yes | |
| Q3_K_S | 12.3 | Yes | |
| Q3_K_M | 13.5 | Yes | |
| UD-Q3_K_XL | 14.4 | Yes | |
| IQ4_XS | 15.0 | Yes | |
| Q4_0 | 15.7 | Yes | |
| IQ4_NL | 15.7 | Yes | |
| Q4_K_S | 15.8 | Yes | |
| Q4_K_M | 16.7 | Yes | Recommended |
| UD-Q4_K_XL | 17.6 | Yes | Dynamic 2.0 |
| Q4_1 | 17.2 | Yes | |
| Q5_K_S | 18.9 | Yes | |
| Q5_K_M | 19.6 | Yes | |
| UD-Q5_K_XL | 20.2 | Yes | |
| Q6_K | 22.5 | Yes | |
| UD-Q6_K_XL | 25.7 | Yes | |
| Q8_0 | 28.6 | Yes | Plenty of room |
| UD-Q8_K_XL | 35.5 | Yes | Good quality + headroom |
| BF16 | 53.8 | Yes* | Tight -- only ~10GB for KV cache |
Key finding: All quantizations fit in 64GB, including BF16 (barely). However, because this is a dense model with 27B active params, token generation will be significantly slower than 35B-A3B (which only activates 3B). For interactive use on Strix Halo, the 35B-A3B MoE is likely the better choice despite being larger on disk.
35B-A3B vs 27B: Which to Run?
| Factor | 35B-A3B (MoE) | 27B (Dense) |
|---|---|---|
| Active params | 3B | 27B |
| Token gen speed | ~85 t/s (Vulkan) | ~10-15 t/s (estimated) |
| Quality (MMLU-Pro) | 85.3 | Comparable |
| Memory (Q4_K_M) | 22.0 GB | 16.7 GB |
| Memory (Q8_0) | 36.9 GB | 28.6 GB |
| Best for | Interactive chat, speed | Batch processing, quality |
Recommendation: For interactive inference on 64GB Strix Halo, strongly prefer Qwen3.5-35B-A3B. The MoE architecture is ideal for unified memory systems since only 3B params are active per token, yielding much faster generation despite the larger total weight file.
4. Qwen3.5-122B-A10B (MoE) -- Stretch Goal
Source: unsloth/Qwen3.5-122B-A10B-GGUF
BF16 size: 244 GB. This is the next tier up from 35B-A3B.
Quantizations That Fit 64GB
| Quantization | Size (GB) | Fit? | Notes |
|---|---|---|---|
| UD-IQ1_M | 34.2 | Yes | 1-bit, quality concerns |
| UD-IQ2_XXS | 36.6 | Yes | Very compressed |
| UD-IQ2_M | 39.1 | Yes | |
| UD-Q2_K_XL | 41.8 | Yes | |
| UD-IQ3_XXS | 44.7 | Yes | |
| UD-IQ3_S | 46.6 | Yes* | Tight with KV cache |
| Q3_K_S | 52.5 | Marginal | Very little KV headroom |
| Q3_K_M | 56.4 | No | Leaves <8GB for everything else |
| Q4_K_M+ | 76.5+ | No | Does not fit |
Warning: Q3-level quantization of 122B has been reported to produce garbled output, infinite repetition, and failures on tool calls and code generation. The UD-Q2_K_XL (41.8 GB) is the recommended minimum viable quantization.
Verdict: Possible at 2-bit, but risky. Quality at IQ2 level on a 122B MoE model is largely untested for production use. The 35B-A3B at Q8_0 (36.9 GB) is likely higher quality than 122B at IQ2 (36.6 GB) and much safer. Not recommended for 64GB systems unless you specifically need the 10B active parameter count.
5. Qwen3.5 Small Models (Worth Benchmarking)
Qwen3.5-9B
The standout small model. Outperforms models 3-13x its size:
- GPQA Diamond: 81.7 (vs GPT-OSS-120B: 71.5)
- HMMT Feb 2025: 83.2
- MMMU-Pro: 70.1 (beats Gemini 2.5 Flash-Lite at 59.7)
At Q4_K_M, the 9B model needs roughly 6-7 GB. Runs comfortably on any hardware. Useful as a draft model for speculative decoding with the 35B-A3B.
Qwen3.5-4B
Performance close to the previous Qwen3-80B-A3B (20x larger). Excellent for on-device/edge tasks. ~3 GB at Q4_K_M.
6. Best GGUF Quantizers: Unsloth vs bartowski vs Others
Providers Compared
| Provider | Approach | Strengths |
|---|---|---|
| Unsloth | Dynamic 2.0: per-layer adaptive quantization, 1.5M+ token calibration dataset | Best at low bit-rates (Q2, Q3), model-specific tuning, fast updates |
| bartowski | Custom imatrix calibration, upstream llama.cpp PR for improved tensor recipes | Lower KLD at Q4_K_M in some tests, stable quality |
| noctrex | MXFP4 for MoE experts + Q8/BF16 for rest | Specialized for MoE models |
| ubergarm | Standard llama.cpp quantization | Reliable baseline |
| AesSedai | imatrix-based | Good coverage, sometimes outperformed by Unsloth Dynamic |
| mradermacher | Mass-produced quants across many models | Broad coverage, less specialized |
Head-to-Head: Unsloth vs bartowski
On standard KLD benchmarks (Qwen QwQ-32B comparison):
- bartowski Q4_K_M: 0.0087 KLD
- Unsloth Q4_K_M: 0.0222 KLD
- bartowski IQ4_XS: 0.0127 KLD at 4.93 GiB
However, on real-world task evaluations (LiveCodeBench v6, MMLU Pro), Unsloth Dynamic IQ2_XXS outperformed AesSedai IQ3_S despite being 11GB smaller -- demonstrating that KLD/perplexity alone do not predict task performance.
Recommendation
- Q4 and above: bartowski and Unsloth are both excellent. bartowski may have slightly lower KLD at Q4_K_M. Either is a safe choice.
- Q3 and below: Unsloth Dynamic 2.0 (UD- prefix) is the clear winner. The per-layer adaptive approach preserves critical layers at higher precision.
- MoE-specific: noctrex MXFP4_MOE is worth testing if you want pure MoE-optimized quantization.
- Overall: For Qwen3.5-35B-A3B, use Unsloth UD-Q4_K_XL (22.2 GB) or Q8_0 (36.9 GB) for maximum quality. For bartowski, use their Q4_K_M.
imatrix Note
All modern GGUF quantizers now use imatrix (importance matrix) calibration. This adds 5-10% inference overhead but significantly improves quality at low bit-rates. The calibration dataset matters: Unsloth uses 1.5M+ hand-curated tokens; bartowski uses different calibration texts optimized for different use cases.
7. Unsloth Studio
What It Is
Unsloth Studio is an open-source, no-code web UI for training and running LLMs locally. Released March 17, 2026 (beta). Dual-licensed: Apache 2.0 (core) + AGPL-3.0 (UI).
Installation
# macOS, Linux, WSL
curl -fsSL https://unsloth.ai/install.sh | sh
# Launch
unsloth studio -H 0.0.0.0 -p 8888
Capabilities
| Feature | Details |
|---|---|
| Inference | Run GGUF and safetensor models with tool-calling, web search, OpenAI-compatible API |
| Fine-tuning | SFT, GRPO (RL), 500+ models, 2x faster, 70% less VRAM |
| Data Recipes | Auto-create datasets from PDF, CSV, JSON, DOCX, TXT |
| Model Arena | Side-by-side comparison of two models |
| Export | Save to GGUF or safetensors |
| Multimodal | Text, vision, TTS audio, embedding models |
Platform Support
| Platform | Inference | Training |
|---|---|---|
| Linux (NVIDIA) | Yes | Yes |
| Linux (AMD) | Yes | Coming soon |
| Linux (CPU) | Yes | No |
| macOS | Yes (CPU only) | Coming (MLX) |
| Windows | Yes | Yes |
Relevance for Strix Halo
Unsloth Studio provides inference via llama.cpp backend, so it should work on Strix Halo for running models. Training requires NVIDIA or Intel GPUs currently, so fine-tuning is not yet supported on AMD. The inference component is essentially a nice web UI wrapper around llama.cpp, similar to LM Studio but with integrated training capabilities.
Verdict: Useful for inference on Strix Halo. Not yet useful for training on AMD. If you only need inference, LM Studio or raw llama.cpp may be simpler. If you want training + inference in one tool (when AMD support arrives), Unsloth Studio is worth watching.
8. LM Studio on AMD Strix Halo
Backend Status
| Backend | Status | Notes |
|---|---|---|
| Vulkan | Working, recommended | Best for general inference, no special config needed |
| ROCm | Partially broken | gfx1151 declared supported but data files missing, crashes on inference |
| CPU | Working | Slow fallback |
Vulkan Configuration
LM Studio with Vulkan is the most reliable path on Strix Halo:
{
"llm.gpu.backend": "vulkan",
"llm.gpu.device": "auto",
"llm.gpu.layers": -1
}
Verify GPU detection: vulkaninfo | grep "GPU id"
An automated installer exists: smarttechlabs-projects/strix-halo-lmstudio
Performance Expectations (LM Studio / Vulkan, 128GB system)
| Model Size | Quant | Throughput |
|---|---|---|
| 7B | Q4 | 30-40 t/s |
| 13B | Q4 | 20-30 t/s |
| 30B MoE | Q4 | ~50+ t/s (MoE advantage) |
| 70B | Q4 | 5-8 t/s |
For a 64GB system, expect similar per-token speeds but with lower maximum context lengths before memory pressure kicks in.
ROCm Status and Future
AMD's Ryzen AI Halo Mini PC (Q2 2026) will ship with ROCm 7.2.2 optimization for LM Studio. As of January 2026, stable ROCm+Linux configurations exist for Strix Halo (documented at Framework Community). The gfx1151 ROCm issue in LM Studio specifically is a packaging problem (missing data files), not a fundamental incompatibility.
For now: use Vulkan for short-medium context, or build llama.cpp from source with ROCm for long-context workloads (where Flash Attention matters).
LM Studio Unsloth Dynamic 2.0 Note
There was a reported issue (GitHub #1594) where Unsloth Dynamic 2.0 (UD-) GGUF variants were not shown in LM Studio's download options. Verify that LM Studio is updated to the latest version, or download the GGUF files manually from HuggingFace and load them directly.
9. Recommended Configurations for 64GB Strix Halo
Primary: Qwen3.5-35B-A3B (MoE)
| Use Case | Quantization | Size | KV Budget | Context Est. |
|---|---|---|---|---|
| Maximum quality | Q8_0 | 36.9 GB | ~25 GB | ~32K-65K |
| Best balance | UD-Q4_K_XL | 22.2 GB | ~40 GB | ~65K-131K |
| Maximum context | UD-IQ3_XXS | 13.1 GB | ~49 GB | ~131K+ |
| Speed test | Q4_K_M | 22.0 GB | ~40 GB | ~65K-131K |
Secondary: Qwen3.5-27B (Dense)
| Use Case | Quantization | Size | KV Budget | Notes |
|---|---|---|---|---|
| Quality comparison | Q8_0 | 28.6 GB | ~33 GB | Slower gen than 35B-A3B |
| Balanced | Q4_K_M | 16.7 GB | ~45 GB |
Quick Reference: Qwen3.5-9B (Small/Draft)
| Use Case | Quantization | Size |
|---|---|---|
| Speculative decoding draft | Q4_K_M | ~6 GB |
| Standalone small model | Q8_0 | ~10 GB |
10. Sampling Parameters (Official Recommendations)
Thinking Mode (General)
- Temperature: 1.0
- Top-p: 0.95
- Top-k: 20
- Min-p: 0.0
- Presence penalty: 1.5
- Max output: 32,768 tokens (general) or 81,920 (math/coding)
Thinking Mode (Coding)
- Temperature: 0.6
- Top-p: 0.95
- Top-k: 20
- Presence penalty: 0.0
Non-Thinking / Instruct Mode
- Temperature: 0.7
- Top-p: 0.8
- Top-k: 20
- Presence penalty: 1.5
Best Practices
- Maintain minimum 128K context to preserve thinking capabilities
- Exclude thinking content from multi-turn conversation history
- For math: "Please reason step by step, and put your final answer within \boxed{}."
- For multiple choice: request JSON output like {"answer": "C"}
11. Open Questions / Limitations
-
Qwen3.5 on gfx1151 ROCm: LM Studio's ROCm backend crashes on Strix Halo due to missing gfx1151 data files. Building llama.cpp from source with ROCm 7.x works but requires manual setup.
-
Vulkan long-context degradation: Vulkan performance drops significantly beyond ~4K context on Strix Halo. ROCm with Flash Attention is needed for long-context workloads, creating a backend choice dilemma.
-
Quantizer quality debate: KLD and perplexity metrics do not always predict real-world task performance. The "best" quantizer depends on the specific use case. More task-based evaluation is needed.
-
122B-A10B viability at 64GB: Only fits at 2-bit or aggressive 3-bit. Quality at these compression levels for a 122B MoE is not well-characterized.
-
Unsloth Studio AMD training: Not yet supported. Timeline unclear ("coming soon").
-
Multi-token Prediction (MTP): Qwen3.5 supports MTP for faster generation, but llama.cpp support status for this feature on the MoE variants needs verification.
-
Speculative decoding: Qwen3.5-9B as a draft model for 35B-A3B has been discussed but needs benchmarking on Strix Halo specifically.
Sources
- Qwen/Qwen3.5-35B-A3B Model Card
- QwenLM/Qwen3.5 GitHub
- unsloth/Qwen3.5-35B-A3B-GGUF
- unsloth/Qwen3.5-27B-GGUF
- unsloth/Qwen3.5-122B-A10B-GGUF
- bartowski/Qwen_Qwen3.5-35B-A3B-GGUF
- bartowski/Qwen_Qwen3.5-27B-GGUF
- noctrex/Qwen3.5-35B-A3B-MXFP4_MOE-GGUF
- Unsloth Dynamic 2.0 GGUFs Documentation
- Qwen3.5 GGUF Benchmarks (Unsloth)
- Unsloth Studio Documentation
- Qwen3.5 Local Running Guide (Unsloth)
- Summary of Qwen3.5 GGUF Evaluations (kaitchup)
- LM Studio Vulkan on Strix Halo (SmartTechLabs)
- LM Studio on Ryzen AI
- Strix Halo llama.cpp Performance Wiki
- AMD Strix Halo Backend Benchmarks
- Strix Halo LLM Optimization (hardware-corner.net)
- Qwen3.5 Small Models (Artificial Analysis)
- Qwen 3.5 9B Beats 120B Models (VentureBeat)
- AMD ROCm 7 Strix Halo Performance (Phoronix)
- Qwen3.5 Blog (qwen.ai)