strix-halo-optimizations/docs/model-recommendations.md

# Qwen 3.5 Model Family: Research Summary for Strix Halo (64GB)

**Date**: 2026-03-26
**Target Hardware**: AMD Ryzen AI MAX+ 395 / Radeon 8060S (gfx1151), 64 GB unified LPDDR5x, Fedora 43
**Focus**: GGUF quantized models for llama.cpp inference

---

## Scope

This report covers the Qwen3.5 model family (released February-March 2026) with emphasis
on GGUF quantization options, file sizes, memory fit analysis for 64GB unified memory,
GGUF quantizer comparison (Unsloth vs bartowski vs others), Unsloth Studio capabilities,
and LM Studio backend support on AMD Strix Halo. Out of scope: cloud API pricing,
full-precision training, non-GGUF formats (AWQ, GPTQ, EXL2).

---

## 1. Qwen3.5 Model Family Overview

Released mid-February 2026 (medium/large) and March 2, 2026 (small), licensed Apache 2.0.
All models share the Gated DeltaNet hybrid architecture: a 3:1 ratio of linear attention
(Gated DeltaNet) to full softmax attention blocks. Native 262K context window, extensible
to 1,010,000 tokens via YaRN scaling. Supports 201 languages. Native multimodal
(vision+language). Thinking/non-thinking hybrid mode.

| Model | Type | Total Params | Active Params | Architecture |
|-------|------|-------------|---------------|--------------|
| Qwen3.5-397B-A17B | MoE | 397B | 17B | 256 experts, 8 routed + 1 shared |
| Qwen3.5-122B-A10B | MoE | 122B | 10B | 256 experts, 8 routed + 1 shared |
| **Qwen3.5-35B-A3B** | **MoE** | **35B** | **3B** | **256 experts, 8 routed + 1 shared** |
| **Qwen3.5-27B** | **Dense** | **27B** | **27B** | **Full activation** |
| Qwen3.5-9B | Dense | 9B | 9B | Gated DeltaNet hybrid |
| Qwen3.5-4B | Dense | 4B | 4B | Gated DeltaNet hybrid |
| Qwen3.5-2B | Dense | 2B | 2B | Gated DeltaNet hybrid |
| Qwen3.5-0.8B | Dense | 0.8B | 0.8B | Gated DeltaNet hybrid |

---

## 2. Qwen3.5-35B-A3B (MoE) -- Detailed Analysis

### Architecture Specs

- Hidden dimension: 2048
- Token embedding: 248,320 (padded)
- Layers: 40
- Hidden layout: 10 x (3 x (Gated DeltaNet -> MoE) -> 1 x (Gated Attention -> MoE))
- MoE: 256 total experts, 8 routed + 1 shared active, expert intermediate dim 512
- Linear attention heads: 32 (V), 16 (QK), head dim 128
- Gated attention heads: 16 (Q), 2 (KV), head dim 256
- BF16 model size: 69.4 GB

### GGUF Quantizations (Unsloth)

Source: [unsloth/Qwen3.5-35B-A3B-GGUF](https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF)
Updated March 5, 2026 with improved imatrix data.

| Quantization | Size (GB) | Fits 64GB? | Notes |
|-------------|-----------|------------|-------|
| UD-IQ2_XXS | 10.7 | Yes | Ultra-compressed, quality loss |
| UD-IQ2_M | 11.4 | Yes | |
| UD-Q2_K_XL | 12.2 | Yes | |
| UD-IQ3_XXS | 13.1 | Yes | |
| UD-IQ3_S | 13.6 | Yes | |
| Q3_K_S | 15.3 | Yes | |
| Q3_K_M | 16.4 | Yes | |
| UD-Q3_K_XL | 16.6 | Yes | |
| UD-IQ4_XS | 17.5 | Yes | |
| UD-IQ4_NL | 17.8 | Yes | |
| Q4_K_S | 20.7 | Yes | |
| MXFP4_MOE | 21.6 | Yes | MoE-optimized mixed precision |
| Q4_K_M | 22.0 | **Yes** | **Recommended sweet spot** |
| UD-Q4_K_XL | 22.2 | Yes | Dynamic 2.0, best 4-bit |
| Q5_K_S | 24.8 | Yes | |
| Q5_K_M | 26.2 | Yes | |
| UD-Q5_K_XL | 26.4 | Yes | |
| UD-Q6_K_S | 28.5 | Yes | |
| Q6_K | 28.9 | Yes | |
| UD-Q6_K_XL | 32.1 | Yes | |
| Q8_0 | 36.9 | Yes | High quality, fits with room |
| UD-Q8_K_XL | 48.7 | Yes* | Tight -- ~15GB for KV cache |
| BF16 | 69.4 | **No** | Exceeds 64GB |

**Key finding**: Every quantization except BF16 fits in 64GB. Even Q8_0 at 36.9 GB
leaves ~27 GB for KV cache and OS overhead, which is excellent. The MoE architecture
(only 3B active params) means token generation is fast relative to total model size.

### Benchmark Results (Official, from Model Card)

| Benchmark | Qwen3.5-35B-A3B | GPT-5-mini | Notes |
|-----------|-----------------|-----------|-------|
| MMLU-Pro | 85.3 | 83.7 | Outperforms |
| C-Eval | 90.2 | 82.2 | Outperforms |
| GPQA Diamond | 84.2 | 82.8 | Outperforms |
| SWE-bench Verified | 69.2 | 72.0 | Slightly behind |
| LiveCodeBench v6 | 74.6 | 80.5 | Behind on coding |
| MMMU (vision) | 81.4 | 79.0 | Outperforms |
| MathVision | 83.9 | 71.9 | Strongly outperforms |
| VideoMME (w/ sub.) | 86.6 | 83.5 | Outperforms |

### Strix Halo Performance Estimates

Based on Qwen3-30B-A3B benchmarks (similar architecture, predecessor):

| Backend | pp512 (t/s) | tg128 (t/s) | Context |
|---------|-------------|-------------|---------|
| Vulkan RADV | ~755 | ~85 | Short |
| Vulkan AMDVLK | ~742 | ~82 | Short |
| ROCm hipBLASlt | ~652 | ~64 | Short |
| ROCm rocWMMA (tuned) | ~659 | ~68 | Short |
| Vulkan RADV | ~17 | ~13 | 130K |
| ROCm hipBLASlt | ~40 | ~5 | 130K |

**Key insight**: Vulkan wins on short-context token generation. ROCm wins on
long-context prompt processing. For interactive chat (short-medium context),
Vulkan RADV is the best backend on Strix Halo.

---

## 3. Qwen3.5-27B (Dense) -- Detailed Analysis

Source: [unsloth/Qwen3.5-27B-GGUF](https://huggingface.co/unsloth/Qwen3.5-27B-GGUF)

The only dense (non-MoE) model in the medium range. All 27B parameters activate on
every forward pass, meaning slower token generation than 35B-A3B despite being
"smaller" in total params. BF16 size: 53.8 GB.

### GGUF Quantizations (Unsloth)

| Quantization | Size (GB) | Fits 64GB? | Notes |
|-------------|-----------|------------|-------|
| UD-IQ2_XXS | 8.57 | Yes | |
| UD-IQ2_M | 10.2 | Yes | |
| UD-Q2_K_XL | 11.2 | Yes | |
| UD-IQ3_XXS | 11.5 | Yes | |
| Q3_K_S | 12.3 | Yes | |
| Q3_K_M | 13.5 | Yes | |
| UD-Q3_K_XL | 14.4 | Yes | |
| IQ4_XS | 15.0 | Yes | |
| Q4_0 | 15.7 | Yes | |
| IQ4_NL | 15.7 | Yes | |
| Q4_K_S | 15.8 | Yes | |
| Q4_K_M | 16.7 | **Yes** | **Recommended** |
| UD-Q4_K_XL | 17.6 | Yes | Dynamic 2.0 |
| Q4_1 | 17.2 | Yes | |
| Q5_K_S | 18.9 | Yes | |
| Q5_K_M | 19.6 | Yes | |
| UD-Q5_K_XL | 20.2 | Yes | |
| Q6_K | 22.5 | Yes | |
| UD-Q6_K_XL | 25.7 | Yes | |
| Q8_0 | 28.6 | Yes | Plenty of room |
| UD-Q8_K_XL | 35.5 | Yes | Good quality + headroom |
| BF16 | 53.8 | Yes* | Tight -- only ~10GB for KV cache |

**Key finding**: All quantizations fit in 64GB, including BF16 (barely). However,
because this is a dense model with 27B active params, token generation will be
significantly slower than 35B-A3B (which only activates 3B). For interactive use on
Strix Halo, the 35B-A3B MoE is likely the better choice despite being larger on disk.

### 35B-A3B vs 27B: Which to Run?

| Factor | 35B-A3B (MoE) | 27B (Dense) |
|--------|---------------|-------------|
| Active params | 3B | 27B |
| Token gen speed | ~85 t/s (Vulkan) | ~10-15 t/s (estimated) |
| Quality (MMLU-Pro) | 85.3 | Comparable |
| Memory (Q4_K_M) | 22.0 GB | 16.7 GB |
| Memory (Q8_0) | 36.9 GB | 28.6 GB |
| Best for | Interactive chat, speed | Batch processing, quality |

**Recommendation**: For interactive inference on 64GB Strix Halo, strongly prefer
Qwen3.5-35B-A3B. The MoE architecture is ideal for unified memory systems since
only 3B params are active per token, yielding much faster generation despite the
larger total weight file.

---

## 4. Qwen3.5-122B-A10B (MoE) -- Stretch Goal

Source: [unsloth/Qwen3.5-122B-A10B-GGUF](https://huggingface.co/unsloth/Qwen3.5-122B-A10B-GGUF)

BF16 size: 244 GB. This is the next tier up from 35B-A3B.

### Quantizations That Fit 64GB

| Quantization | Size (GB) | Fit? | Notes |
|-------------|-----------|------|-------|
| UD-IQ1_M | 34.2 | Yes | 1-bit, quality concerns |
| UD-IQ2_XXS | 36.6 | Yes | Very compressed |
| UD-IQ2_M | 39.1 | Yes | |
| UD-Q2_K_XL | 41.8 | Yes | |
| UD-IQ3_XXS | 44.7 | Yes | |
| UD-IQ3_S | 46.6 | Yes* | Tight with KV cache |
| Q3_K_S | 52.5 | Marginal | Very little KV headroom |
| Q3_K_M | 56.4 | No | Leaves <8GB for everything else |
| Q4_K_M+ | 76.5+ | No | Does not fit |

**Warning**: Q3-level quantization of 122B has been reported to produce garbled output,
infinite repetition, and failures on tool calls and code generation. The UD-Q2_K_XL
(41.8 GB) is the recommended minimum viable quantization.

**Verdict**: Possible at 2-bit, but risky. Quality at IQ2 level on a 122B MoE model is
largely untested for production use. The 35B-A3B at Q8_0 (36.9 GB) is likely higher
quality than 122B at IQ2 (36.6 GB) and much safer. Not recommended for 64GB systems
unless you specifically need the 10B active parameter count.

---

## 5. Qwen3.5 Small Models (Worth Benchmarking)

### Qwen3.5-9B

The standout small model. Outperforms models 3-13x its size:
- GPQA Diamond: 81.7 (vs GPT-OSS-120B: 71.5)
- HMMT Feb 2025: 83.2
- MMMU-Pro: 70.1 (beats Gemini 2.5 Flash-Lite at 59.7)

At Q4_K_M, the 9B model needs roughly 6-7 GB. Runs comfortably on any hardware.
Useful as a draft model for speculative decoding with the 35B-A3B.

### Qwen3.5-4B

Performance close to the previous Qwen3-80B-A3B (20x larger). Excellent for
on-device/edge tasks. ~3 GB at Q4_K_M.

---

## 6. Best GGUF Quantizers: Unsloth vs bartowski vs Others

### Providers Compared

| Provider | Approach | Strengths |
|----------|----------|-----------|
| **Unsloth** | Dynamic 2.0: per-layer adaptive quantization, 1.5M+ token calibration dataset | Best at low bit-rates (Q2, Q3), model-specific tuning, fast updates |
| **bartowski** | Custom imatrix calibration, upstream llama.cpp PR for improved tensor recipes | Lower KLD at Q4_K_M in some tests, stable quality |
| **noctrex** | MXFP4 for MoE experts + Q8/BF16 for rest | Specialized for MoE models |
| **ubergarm** | Standard llama.cpp quantization | Reliable baseline |
| **AesSedai** | imatrix-based | Good coverage, sometimes outperformed by Unsloth Dynamic |
| **mradermacher** | Mass-produced quants across many models | Broad coverage, less specialized |

### Head-to-Head: Unsloth vs bartowski

On standard KLD benchmarks (Qwen QwQ-32B comparison):
- bartowski Q4_K_M: 0.0087 KLD
- Unsloth Q4_K_M: 0.0222 KLD
- bartowski IQ4_XS: 0.0127 KLD at 4.93 GiB

However, on real-world task evaluations (LiveCodeBench v6, MMLU Pro), Unsloth Dynamic
IQ2_XXS outperformed AesSedai IQ3_S despite being 11GB smaller -- demonstrating that
KLD/perplexity alone do not predict task performance.

### Recommendation

- **Q4 and above**: bartowski and Unsloth are both excellent. bartowski may have slightly
  lower KLD at Q4_K_M. Either is a safe choice.
- **Q3 and below**: Unsloth Dynamic 2.0 (UD- prefix) is the clear winner. The per-layer
  adaptive approach preserves critical layers at higher precision.
- **MoE-specific**: noctrex MXFP4_MOE is worth testing if you want pure MoE-optimized
  quantization.
- **Overall**: For Qwen3.5-35B-A3B, use **Unsloth UD-Q4_K_XL** (22.2 GB) or
  **Q8_0** (36.9 GB) for maximum quality. For bartowski, use their Q4_K_M.

### imatrix Note

All modern GGUF quantizers now use imatrix (importance matrix) calibration. This adds
5-10% inference overhead but significantly improves quality at low bit-rates. The
calibration dataset matters: Unsloth uses 1.5M+ hand-curated tokens; bartowski uses
different calibration texts optimized for different use cases.

---

## 7. Unsloth Studio

### What It Is

Unsloth Studio is an open-source, no-code web UI for training and running LLMs locally.
Released March 17, 2026 (beta). Dual-licensed: Apache 2.0 (core) + AGPL-3.0 (UI).

### Installation

```bash
# macOS, Linux, WSL
curl -fsSL https://unsloth.ai/install.sh | sh

# Launch
unsloth studio -H 0.0.0.0 -p 8888
```

### Capabilities

| Feature | Details |
|---------|---------|
| **Inference** | Run GGUF and safetensor models with tool-calling, web search, OpenAI-compatible API |
| **Fine-tuning** | SFT, GRPO (RL), 500+ models, 2x faster, 70% less VRAM |
| **Data Recipes** | Auto-create datasets from PDF, CSV, JSON, DOCX, TXT |
| **Model Arena** | Side-by-side comparison of two models |
| **Export** | Save to GGUF or safetensors |
| **Multimodal** | Text, vision, TTS audio, embedding models |

### Platform Support

| Platform | Inference | Training |
|----------|-----------|----------|
| Linux (NVIDIA) | Yes | Yes |
| Linux (AMD) | Yes | Coming soon |
| Linux (CPU) | Yes | No |
| macOS | Yes (CPU only) | Coming (MLX) |
| Windows | Yes | Yes |

### Relevance for Strix Halo

Unsloth Studio provides inference via llama.cpp backend, so it should work on Strix Halo
for **running** models. Training requires NVIDIA or Intel GPUs currently, so fine-tuning
is not yet supported on AMD. The inference component is essentially a nice web UI wrapper
around llama.cpp, similar to LM Studio but with integrated training capabilities.

**Verdict**: Useful for inference on Strix Halo. Not yet useful for training on AMD.
If you only need inference, LM Studio or raw llama.cpp may be simpler. If you want
training + inference in one tool (when AMD support arrives), Unsloth Studio is worth
watching.

---

## 8. LM Studio on AMD Strix Halo

### Backend Status

| Backend | Status | Notes |
|---------|--------|-------|
| **Vulkan** | **Working, recommended** | Best for general inference, no special config needed |
| ROCm | Partially broken | gfx1151 declared supported but data files missing, crashes on inference |
| CPU | Working | Slow fallback |

### Vulkan Configuration

LM Studio with Vulkan is the most reliable path on Strix Halo:

```json
{
  "llm.gpu.backend": "vulkan",
  "llm.gpu.device": "auto",
  "llm.gpu.layers": -1
}
```

Verify GPU detection: `vulkaninfo | grep "GPU id"`

An automated installer exists: [smarttechlabs-projects/strix-halo-lmstudio](https://github.com/smarttechlabs-projects/strix-halo-lmstudio)

### Performance Expectations (LM Studio / Vulkan, 128GB system)

| Model Size | Quant | Throughput |
|-----------|-------|-----------|
| 7B | Q4 | 30-40 t/s |
| 13B | Q4 | 20-30 t/s |
| 30B MoE | Q4 | ~50+ t/s (MoE advantage) |
| 70B | Q4 | 5-8 t/s |

For a 64GB system, expect similar per-token speeds but with lower maximum context
lengths before memory pressure kicks in.

### ROCm Status and Future

AMD's Ryzen AI Halo Mini PC (Q2 2026) will ship with ROCm 7.2.2 optimization for
LM Studio. As of January 2026, stable ROCm+Linux configurations exist for Strix Halo
(documented at Framework Community). The gfx1151 ROCm issue in LM Studio specifically
is a packaging problem (missing data files), not a fundamental incompatibility.

For now: use **Vulkan for short-medium context**, or build **llama.cpp from source
with ROCm** for long-context workloads (where Flash Attention matters).

### LM Studio Unsloth Dynamic 2.0 Note

There was a reported issue (GitHub #1594) where Unsloth Dynamic 2.0 (UD-) GGUF variants
were not shown in LM Studio's download options. Verify that LM Studio is updated to
the latest version, or download the GGUF files manually from HuggingFace and load
them directly.

---

## 9. Recommended Configurations for 64GB Strix Halo

### Primary: Qwen3.5-35B-A3B (MoE)

| Use Case | Quantization | Size | KV Budget | Context Est. |
|----------|-------------|------|-----------|-------------|
| Maximum quality | Q8_0 | 36.9 GB | ~25 GB | ~32K-65K |
| Best balance | UD-Q4_K_XL | 22.2 GB | ~40 GB | ~65K-131K |
| Maximum context | UD-IQ3_XXS | 13.1 GB | ~49 GB | ~131K+ |
| Speed test | Q4_K_M | 22.0 GB | ~40 GB | ~65K-131K |

### Secondary: Qwen3.5-27B (Dense)

| Use Case | Quantization | Size | KV Budget | Notes |
|----------|-------------|------|-----------|-------|
| Quality comparison | Q8_0 | 28.6 GB | ~33 GB | Slower gen than 35B-A3B |
| Balanced | Q4_K_M | 16.7 GB | ~45 GB | |

### Quick Reference: Qwen3.5-9B (Small/Draft)

| Use Case | Quantization | Size |
|----------|-------------|------|
| Speculative decoding draft | Q4_K_M | ~6 GB |
| Standalone small model | Q8_0 | ~10 GB |

---

## 10. Sampling Parameters (Official Recommendations)

### Thinking Mode (General)
- Temperature: 1.0
- Top-p: 0.95
- Top-k: 20
- Min-p: 0.0
- Presence penalty: 1.5
- Max output: 32,768 tokens (general) or 81,920 (math/coding)

### Thinking Mode (Coding)
- Temperature: 0.6
- Top-p: 0.95
- Top-k: 20
- Presence penalty: 0.0

### Non-Thinking / Instruct Mode
- Temperature: 0.7
- Top-p: 0.8
- Top-k: 20
- Presence penalty: 1.5

### Best Practices
- Maintain minimum 128K context to preserve thinking capabilities
- Exclude thinking content from multi-turn conversation history
- For math: "Please reason step by step, and put your final answer within \boxed{}."
- For multiple choice: request JSON output like {"answer": "C"}

---

## 11. Open Questions / Limitations

1. **Qwen3.5 on gfx1151 ROCm**: LM Studio's ROCm backend crashes on Strix Halo due
   to missing gfx1151 data files. Building llama.cpp from source with ROCm 7.x works
   but requires manual setup.

2. **Vulkan long-context degradation**: Vulkan performance drops significantly beyond
   ~4K context on Strix Halo. ROCm with Flash Attention is needed for long-context
   workloads, creating a backend choice dilemma.

3. **Quantizer quality debate**: KLD and perplexity metrics do not always predict
   real-world task performance. The "best" quantizer depends on the specific use case.
   More task-based evaluation is needed.

4. **122B-A10B viability at 64GB**: Only fits at 2-bit or aggressive 3-bit. Quality
   at these compression levels for a 122B MoE is not well-characterized.

5. **Unsloth Studio AMD training**: Not yet supported. Timeline unclear ("coming soon").

6. **Multi-token Prediction (MTP)**: Qwen3.5 supports MTP for faster generation, but
   llama.cpp support status for this feature on the MoE variants needs verification.

7. **Speculative decoding**: Qwen3.5-9B as a draft model for 35B-A3B has been discussed
   but needs benchmarking on Strix Halo specifically.

---

## Sources

- [Qwen/Qwen3.5-35B-A3B Model Card](https://huggingface.co/Qwen/Qwen3.5-35B-A3B)
- [QwenLM/Qwen3.5 GitHub](https://github.com/QwenLM/Qwen3.5)
- [unsloth/Qwen3.5-35B-A3B-GGUF](https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF)
- [unsloth/Qwen3.5-27B-GGUF](https://huggingface.co/unsloth/Qwen3.5-27B-GGUF)
- [unsloth/Qwen3.5-122B-A10B-GGUF](https://huggingface.co/unsloth/Qwen3.5-122B-A10B-GGUF)
- [bartowski/Qwen_Qwen3.5-35B-A3B-GGUF](https://huggingface.co/bartowski/Qwen_Qwen3.5-35B-A3B-GGUF)
- [bartowski/Qwen_Qwen3.5-27B-GGUF](https://huggingface.co/bartowski/Qwen_Qwen3.5-27B-GGUF)
- [noctrex/Qwen3.5-35B-A3B-MXFP4_MOE-GGUF](https://huggingface.co/noctrex/Qwen3.5-35B-A3B-MXFP4_MOE-GGUF)
- [Unsloth Dynamic 2.0 GGUFs Documentation](https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs)
- [Qwen3.5 GGUF Benchmarks (Unsloth)](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks)
- [Unsloth Studio Documentation](https://unsloth.ai/docs/new/studio)
- [Qwen3.5 Local Running Guide (Unsloth)](https://unsloth.ai/docs/models/qwen3.5)
- [Summary of Qwen3.5 GGUF Evaluations (kaitchup)](https://kaitchup.substack.com/p/summary-of-qwen35-gguf-evaluations)
- [LM Studio Vulkan on Strix Halo (SmartTechLabs)](https://www.smarttechlabs.de/blog/2026-01-14-lmstudio-strix-halo/)
- [LM Studio on Ryzen AI](https://lmstudio.ai/ryzenai)
- [Strix Halo llama.cpp Performance Wiki](https://strixhalo.wiki/AI/llamacpp-performance)
- [AMD Strix Halo Backend Benchmarks](https://kyuz0.github.io/amd-strix-halo-toolboxes/)
- [Strix Halo LLM Optimization (hardware-corner.net)](https://www.hardware-corner.net/strix-halo-llm-optimization/)
- [Qwen3.5 Small Models (Artificial Analysis)](https://artificialanalysis.ai/articles/qwen3-5-small-models)
- [Qwen 3.5 9B Beats 120B Models (VentureBeat)](https://venturebeat.com/technology/alibabas-small-open-source-qwen3-5-9b-beats-openais-gpt-oss-120b-and-can-run)
- [AMD ROCm 7 Strix Halo Performance (Phoronix)](https://www.phoronix.com/review/amd-rocm-7-strix-halo/4)
- [Qwen3.5 Blog (qwen.ai)](https://qwen.ai/blog?id=qwen3.5)