# Qwen 3.5 Model Family: Research Summary for Strix Halo (64GB) **Date**: 2026-03-26 **Target Hardware**: AMD Ryzen AI MAX+ 395 / Radeon 8060S (gfx1151), 64 GB unified LPDDR5x, Fedora 43 **Focus**: GGUF quantized models for llama.cpp inference --- ## Scope This report covers the Qwen3.5 model family (released February-March 2026) with emphasis on GGUF quantization options, file sizes, memory fit analysis for 64GB unified memory, GGUF quantizer comparison (Unsloth vs bartowski vs others), Unsloth Studio capabilities, and LM Studio backend support on AMD Strix Halo. Out of scope: cloud API pricing, full-precision training, non-GGUF formats (AWQ, GPTQ, EXL2). --- ## 1. Qwen3.5 Model Family Overview Released mid-February 2026 (medium/large) and March 2, 2026 (small), licensed Apache 2.0. All models share the Gated DeltaNet hybrid architecture: a 3:1 ratio of linear attention (Gated DeltaNet) to full softmax attention blocks. Native 262K context window, extensible to 1,010,000 tokens via YaRN scaling. Supports 201 languages. Native multimodal (vision+language). Thinking/non-thinking hybrid mode. | Model | Type | Total Params | Active Params | Architecture | |-------|------|-------------|---------------|--------------| | Qwen3.5-397B-A17B | MoE | 397B | 17B | 256 experts, 8 routed + 1 shared | | Qwen3.5-122B-A10B | MoE | 122B | 10B | 256 experts, 8 routed + 1 shared | | **Qwen3.5-35B-A3B** | **MoE** | **35B** | **3B** | **256 experts, 8 routed + 1 shared** | | **Qwen3.5-27B** | **Dense** | **27B** | **27B** | **Full activation** | | Qwen3.5-9B | Dense | 9B | 9B | Gated DeltaNet hybrid | | Qwen3.5-4B | Dense | 4B | 4B | Gated DeltaNet hybrid | | Qwen3.5-2B | Dense | 2B | 2B | Gated DeltaNet hybrid | | Qwen3.5-0.8B | Dense | 0.8B | 0.8B | Gated DeltaNet hybrid | --- ## 2. Qwen3.5-35B-A3B (MoE) -- Detailed Analysis ### Architecture Specs - Hidden dimension: 2048 - Token embedding: 248,320 (padded) - Layers: 40 - Hidden layout: 10 x (3 x (Gated DeltaNet -> MoE) -> 1 x (Gated Attention -> MoE)) - MoE: 256 total experts, 8 routed + 1 shared active, expert intermediate dim 512 - Linear attention heads: 32 (V), 16 (QK), head dim 128 - Gated attention heads: 16 (Q), 2 (KV), head dim 256 - BF16 model size: 69.4 GB ### GGUF Quantizations (Unsloth) Source: [unsloth/Qwen3.5-35B-A3B-GGUF](https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF) Updated March 5, 2026 with improved imatrix data. | Quantization | Size (GB) | Fits 64GB? | Notes | |-------------|-----------|------------|-------| | UD-IQ2_XXS | 10.7 | Yes | Ultra-compressed, quality loss | | UD-IQ2_M | 11.4 | Yes | | | UD-Q2_K_XL | 12.2 | Yes | | | UD-IQ3_XXS | 13.1 | Yes | | | UD-IQ3_S | 13.6 | Yes | | | Q3_K_S | 15.3 | Yes | | | Q3_K_M | 16.4 | Yes | | | UD-Q3_K_XL | 16.6 | Yes | | | UD-IQ4_XS | 17.5 | Yes | | | UD-IQ4_NL | 17.8 | Yes | | | Q4_K_S | 20.7 | Yes | | | MXFP4_MOE | 21.6 | Yes | MoE-optimized mixed precision | | Q4_K_M | 22.0 | **Yes** | **Recommended sweet spot** | | UD-Q4_K_XL | 22.2 | Yes | Dynamic 2.0, best 4-bit | | Q5_K_S | 24.8 | Yes | | | Q5_K_M | 26.2 | Yes | | | UD-Q5_K_XL | 26.4 | Yes | | | UD-Q6_K_S | 28.5 | Yes | | | Q6_K | 28.9 | Yes | | | UD-Q6_K_XL | 32.1 | Yes | | | Q8_0 | 36.9 | Yes | High quality, fits with room | | UD-Q8_K_XL | 48.7 | Yes* | Tight -- ~15GB for KV cache | | BF16 | 69.4 | **No** | Exceeds 64GB | **Key finding**: Every quantization except BF16 fits in 64GB. Even Q8_0 at 36.9 GB leaves ~27 GB for KV cache and OS overhead, which is excellent. The MoE architecture (only 3B active params) means token generation is fast relative to total model size. ### Benchmark Results (Official, from Model Card) | Benchmark | Qwen3.5-35B-A3B | GPT-5-mini | Notes | |-----------|-----------------|-----------|-------| | MMLU-Pro | 85.3 | 83.7 | Outperforms | | C-Eval | 90.2 | 82.2 | Outperforms | | GPQA Diamond | 84.2 | 82.8 | Outperforms | | SWE-bench Verified | 69.2 | 72.0 | Slightly behind | | LiveCodeBench v6 | 74.6 | 80.5 | Behind on coding | | MMMU (vision) | 81.4 | 79.0 | Outperforms | | MathVision | 83.9 | 71.9 | Strongly outperforms | | VideoMME (w/ sub.) | 86.6 | 83.5 | Outperforms | ### Strix Halo Performance Estimates Based on Qwen3-30B-A3B benchmarks (similar architecture, predecessor): | Backend | pp512 (t/s) | tg128 (t/s) | Context | |---------|-------------|-------------|---------| | Vulkan RADV | ~755 | ~85 | Short | | Vulkan AMDVLK | ~742 | ~82 | Short | | ROCm hipBLASlt | ~652 | ~64 | Short | | ROCm rocWMMA (tuned) | ~659 | ~68 | Short | | Vulkan RADV | ~17 | ~13 | 130K | | ROCm hipBLASlt | ~40 | ~5 | 130K | **Key insight**: Vulkan wins on short-context token generation. ROCm wins on long-context prompt processing. For interactive chat (short-medium context), Vulkan RADV is the best backend on Strix Halo. --- ## 3. Qwen3.5-27B (Dense) -- Detailed Analysis Source: [unsloth/Qwen3.5-27B-GGUF](https://huggingface.co/unsloth/Qwen3.5-27B-GGUF) The only dense (non-MoE) model in the medium range. All 27B parameters activate on every forward pass, meaning slower token generation than 35B-A3B despite being "smaller" in total params. BF16 size: 53.8 GB. ### GGUF Quantizations (Unsloth) | Quantization | Size (GB) | Fits 64GB? | Notes | |-------------|-----------|------------|-------| | UD-IQ2_XXS | 8.57 | Yes | | | UD-IQ2_M | 10.2 | Yes | | | UD-Q2_K_XL | 11.2 | Yes | | | UD-IQ3_XXS | 11.5 | Yes | | | Q3_K_S | 12.3 | Yes | | | Q3_K_M | 13.5 | Yes | | | UD-Q3_K_XL | 14.4 | Yes | | | IQ4_XS | 15.0 | Yes | | | Q4_0 | 15.7 | Yes | | | IQ4_NL | 15.7 | Yes | | | Q4_K_S | 15.8 | Yes | | | Q4_K_M | 16.7 | **Yes** | **Recommended** | | UD-Q4_K_XL | 17.6 | Yes | Dynamic 2.0 | | Q4_1 | 17.2 | Yes | | | Q5_K_S | 18.9 | Yes | | | Q5_K_M | 19.6 | Yes | | | UD-Q5_K_XL | 20.2 | Yes | | | Q6_K | 22.5 | Yes | | | UD-Q6_K_XL | 25.7 | Yes | | | Q8_0 | 28.6 | Yes | Plenty of room | | UD-Q8_K_XL | 35.5 | Yes | Good quality + headroom | | BF16 | 53.8 | Yes* | Tight -- only ~10GB for KV cache | **Key finding**: All quantizations fit in 64GB, including BF16 (barely). However, because this is a dense model with 27B active params, token generation will be significantly slower than 35B-A3B (which only activates 3B). For interactive use on Strix Halo, the 35B-A3B MoE is likely the better choice despite being larger on disk. ### 35B-A3B vs 27B: Which to Run? | Factor | 35B-A3B (MoE) | 27B (Dense) | |--------|---------------|-------------| | Active params | 3B | 27B | | Token gen speed | ~85 t/s (Vulkan) | ~10-15 t/s (estimated) | | Quality (MMLU-Pro) | 85.3 | Comparable | | Memory (Q4_K_M) | 22.0 GB | 16.7 GB | | Memory (Q8_0) | 36.9 GB | 28.6 GB | | Best for | Interactive chat, speed | Batch processing, quality | **Recommendation**: For interactive inference on 64GB Strix Halo, strongly prefer Qwen3.5-35B-A3B. The MoE architecture is ideal for unified memory systems since only 3B params are active per token, yielding much faster generation despite the larger total weight file. --- ## 4. Qwen3.5-122B-A10B (MoE) -- Stretch Goal Source: [unsloth/Qwen3.5-122B-A10B-GGUF](https://huggingface.co/unsloth/Qwen3.5-122B-A10B-GGUF) BF16 size: 244 GB. This is the next tier up from 35B-A3B. ### Quantizations That Fit 64GB | Quantization | Size (GB) | Fit? | Notes | |-------------|-----------|------|-------| | UD-IQ1_M | 34.2 | Yes | 1-bit, quality concerns | | UD-IQ2_XXS | 36.6 | Yes | Very compressed | | UD-IQ2_M | 39.1 | Yes | | | UD-Q2_K_XL | 41.8 | Yes | | | UD-IQ3_XXS | 44.7 | Yes | | | UD-IQ3_S | 46.6 | Yes* | Tight with KV cache | | Q3_K_S | 52.5 | Marginal | Very little KV headroom | | Q3_K_M | 56.4 | No | Leaves <8GB for everything else | | Q4_K_M+ | 76.5+ | No | Does not fit | **Warning**: Q3-level quantization of 122B has been reported to produce garbled output, infinite repetition, and failures on tool calls and code generation. The UD-Q2_K_XL (41.8 GB) is the recommended minimum viable quantization. **Verdict**: Possible at 2-bit, but risky. Quality at IQ2 level on a 122B MoE model is largely untested for production use. The 35B-A3B at Q8_0 (36.9 GB) is likely higher quality than 122B at IQ2 (36.6 GB) and much safer. Not recommended for 64GB systems unless you specifically need the 10B active parameter count. --- ## 5. Qwen3.5 Small Models (Worth Benchmarking) ### Qwen3.5-9B The standout small model. Outperforms models 3-13x its size: - GPQA Diamond: 81.7 (vs GPT-OSS-120B: 71.5) - HMMT Feb 2025: 83.2 - MMMU-Pro: 70.1 (beats Gemini 2.5 Flash-Lite at 59.7) At Q4_K_M, the 9B model needs roughly 6-7 GB. Runs comfortably on any hardware. Useful as a draft model for speculative decoding with the 35B-A3B. ### Qwen3.5-4B Performance close to the previous Qwen3-80B-A3B (20x larger). Excellent for on-device/edge tasks. ~3 GB at Q4_K_M. --- ## 6. Best GGUF Quantizers: Unsloth vs bartowski vs Others ### Providers Compared | Provider | Approach | Strengths | |----------|----------|-----------| | **Unsloth** | Dynamic 2.0: per-layer adaptive quantization, 1.5M+ token calibration dataset | Best at low bit-rates (Q2, Q3), model-specific tuning, fast updates | | **bartowski** | Custom imatrix calibration, upstream llama.cpp PR for improved tensor recipes | Lower KLD at Q4_K_M in some tests, stable quality | | **noctrex** | MXFP4 for MoE experts + Q8/BF16 for rest | Specialized for MoE models | | **ubergarm** | Standard llama.cpp quantization | Reliable baseline | | **AesSedai** | imatrix-based | Good coverage, sometimes outperformed by Unsloth Dynamic | | **mradermacher** | Mass-produced quants across many models | Broad coverage, less specialized | ### Head-to-Head: Unsloth vs bartowski On standard KLD benchmarks (Qwen QwQ-32B comparison): - bartowski Q4_K_M: 0.0087 KLD - Unsloth Q4_K_M: 0.0222 KLD - bartowski IQ4_XS: 0.0127 KLD at 4.93 GiB However, on real-world task evaluations (LiveCodeBench v6, MMLU Pro), Unsloth Dynamic IQ2_XXS outperformed AesSedai IQ3_S despite being 11GB smaller -- demonstrating that KLD/perplexity alone do not predict task performance. ### Recommendation - **Q4 and above**: bartowski and Unsloth are both excellent. bartowski may have slightly lower KLD at Q4_K_M. Either is a safe choice. - **Q3 and below**: Unsloth Dynamic 2.0 (UD- prefix) is the clear winner. The per-layer adaptive approach preserves critical layers at higher precision. - **MoE-specific**: noctrex MXFP4_MOE is worth testing if you want pure MoE-optimized quantization. - **Overall**: For Qwen3.5-35B-A3B, use **Unsloth UD-Q4_K_XL** (22.2 GB) or **Q8_0** (36.9 GB) for maximum quality. For bartowski, use their Q4_K_M. ### imatrix Note All modern GGUF quantizers now use imatrix (importance matrix) calibration. This adds 5-10% inference overhead but significantly improves quality at low bit-rates. The calibration dataset matters: Unsloth uses 1.5M+ hand-curated tokens; bartowski uses different calibration texts optimized for different use cases. --- ## 7. Unsloth Studio ### What It Is Unsloth Studio is an open-source, no-code web UI for training and running LLMs locally. Released March 17, 2026 (beta). Dual-licensed: Apache 2.0 (core) + AGPL-3.0 (UI). ### Installation ```bash # macOS, Linux, WSL curl -fsSL https://unsloth.ai/install.sh | sh # Launch unsloth studio -H 0.0.0.0 -p 8888 ``` ### Capabilities | Feature | Details | |---------|---------| | **Inference** | Run GGUF and safetensor models with tool-calling, web search, OpenAI-compatible API | | **Fine-tuning** | SFT, GRPO (RL), 500+ models, 2x faster, 70% less VRAM | | **Data Recipes** | Auto-create datasets from PDF, CSV, JSON, DOCX, TXT | | **Model Arena** | Side-by-side comparison of two models | | **Export** | Save to GGUF or safetensors | | **Multimodal** | Text, vision, TTS audio, embedding models | ### Platform Support | Platform | Inference | Training | |----------|-----------|----------| | Linux (NVIDIA) | Yes | Yes | | Linux (AMD) | Yes | Coming soon | | Linux (CPU) | Yes | No | | macOS | Yes (CPU only) | Coming (MLX) | | Windows | Yes | Yes | ### Relevance for Strix Halo Unsloth Studio provides inference via llama.cpp backend, so it should work on Strix Halo for **running** models. Training requires NVIDIA or Intel GPUs currently, so fine-tuning is not yet supported on AMD. The inference component is essentially a nice web UI wrapper around llama.cpp, similar to LM Studio but with integrated training capabilities. **Verdict**: Useful for inference on Strix Halo. Not yet useful for training on AMD. If you only need inference, LM Studio or raw llama.cpp may be simpler. If you want training + inference in one tool (when AMD support arrives), Unsloth Studio is worth watching. --- ## 8. LM Studio on AMD Strix Halo ### Backend Status | Backend | Status | Notes | |---------|--------|-------| | **Vulkan** | **Working, recommended** | Best for general inference, no special config needed | | ROCm | Partially broken | gfx1151 declared supported but data files missing, crashes on inference | | CPU | Working | Slow fallback | ### Vulkan Configuration LM Studio with Vulkan is the most reliable path on Strix Halo: ```json { "llm.gpu.backend": "vulkan", "llm.gpu.device": "auto", "llm.gpu.layers": -1 } ``` Verify GPU detection: `vulkaninfo | grep "GPU id"` An automated installer exists: [smarttechlabs-projects/strix-halo-lmstudio](https://github.com/smarttechlabs-projects/strix-halo-lmstudio) ### Performance Expectations (LM Studio / Vulkan, 128GB system) | Model Size | Quant | Throughput | |-----------|-------|-----------| | 7B | Q4 | 30-40 t/s | | 13B | Q4 | 20-30 t/s | | 30B MoE | Q4 | ~50+ t/s (MoE advantage) | | 70B | Q4 | 5-8 t/s | For a 64GB system, expect similar per-token speeds but with lower maximum context lengths before memory pressure kicks in. ### ROCm Status and Future AMD's Ryzen AI Halo Mini PC (Q2 2026) will ship with ROCm 7.2.2 optimization for LM Studio. As of January 2026, stable ROCm+Linux configurations exist for Strix Halo (documented at Framework Community). The gfx1151 ROCm issue in LM Studio specifically is a packaging problem (missing data files), not a fundamental incompatibility. For now: use **Vulkan for short-medium context**, or build **llama.cpp from source with ROCm** for long-context workloads (where Flash Attention matters). ### LM Studio Unsloth Dynamic 2.0 Note There was a reported issue (GitHub #1594) where Unsloth Dynamic 2.0 (UD-) GGUF variants were not shown in LM Studio's download options. Verify that LM Studio is updated to the latest version, or download the GGUF files manually from HuggingFace and load them directly. --- ## 9. Recommended Configurations for 64GB Strix Halo ### Primary: Qwen3.5-35B-A3B (MoE) | Use Case | Quantization | Size | KV Budget | Context Est. | |----------|-------------|------|-----------|-------------| | Maximum quality | Q8_0 | 36.9 GB | ~25 GB | ~32K-65K | | Best balance | UD-Q4_K_XL | 22.2 GB | ~40 GB | ~65K-131K | | Maximum context | UD-IQ3_XXS | 13.1 GB | ~49 GB | ~131K+ | | Speed test | Q4_K_M | 22.0 GB | ~40 GB | ~65K-131K | ### Secondary: Qwen3.5-27B (Dense) | Use Case | Quantization | Size | KV Budget | Notes | |----------|-------------|------|-----------|-------| | Quality comparison | Q8_0 | 28.6 GB | ~33 GB | Slower gen than 35B-A3B | | Balanced | Q4_K_M | 16.7 GB | ~45 GB | | ### Quick Reference: Qwen3.5-9B (Small/Draft) | Use Case | Quantization | Size | |----------|-------------|------| | Speculative decoding draft | Q4_K_M | ~6 GB | | Standalone small model | Q8_0 | ~10 GB | --- ## 10. Sampling Parameters (Official Recommendations) ### Thinking Mode (General) - Temperature: 1.0 - Top-p: 0.95 - Top-k: 20 - Min-p: 0.0 - Presence penalty: 1.5 - Max output: 32,768 tokens (general) or 81,920 (math/coding) ### Thinking Mode (Coding) - Temperature: 0.6 - Top-p: 0.95 - Top-k: 20 - Presence penalty: 0.0 ### Non-Thinking / Instruct Mode - Temperature: 0.7 - Top-p: 0.8 - Top-k: 20 - Presence penalty: 1.5 ### Best Practices - Maintain minimum 128K context to preserve thinking capabilities - Exclude thinking content from multi-turn conversation history - For math: "Please reason step by step, and put your final answer within \boxed{}." - For multiple choice: request JSON output like {"answer": "C"} --- ## 11. Open Questions / Limitations 1. **Qwen3.5 on gfx1151 ROCm**: LM Studio's ROCm backend crashes on Strix Halo due to missing gfx1151 data files. Building llama.cpp from source with ROCm 7.x works but requires manual setup. 2. **Vulkan long-context degradation**: Vulkan performance drops significantly beyond ~4K context on Strix Halo. ROCm with Flash Attention is needed for long-context workloads, creating a backend choice dilemma. 3. **Quantizer quality debate**: KLD and perplexity metrics do not always predict real-world task performance. The "best" quantizer depends on the specific use case. More task-based evaluation is needed. 4. **122B-A10B viability at 64GB**: Only fits at 2-bit or aggressive 3-bit. Quality at these compression levels for a 122B MoE is not well-characterized. 5. **Unsloth Studio AMD training**: Not yet supported. Timeline unclear ("coming soon"). 6. **Multi-token Prediction (MTP)**: Qwen3.5 supports MTP for faster generation, but llama.cpp support status for this feature on the MoE variants needs verification. 7. **Speculative decoding**: Qwen3.5-9B as a draft model for 35B-A3B has been discussed but needs benchmarking on Strix Halo specifically. --- ## Sources - [Qwen/Qwen3.5-35B-A3B Model Card](https://huggingface.co/Qwen/Qwen3.5-35B-A3B) - [QwenLM/Qwen3.5 GitHub](https://github.com/QwenLM/Qwen3.5) - [unsloth/Qwen3.5-35B-A3B-GGUF](https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF) - [unsloth/Qwen3.5-27B-GGUF](https://huggingface.co/unsloth/Qwen3.5-27B-GGUF) - [unsloth/Qwen3.5-122B-A10B-GGUF](https://huggingface.co/unsloth/Qwen3.5-122B-A10B-GGUF) - [bartowski/Qwen_Qwen3.5-35B-A3B-GGUF](https://huggingface.co/bartowski/Qwen_Qwen3.5-35B-A3B-GGUF) - [bartowski/Qwen_Qwen3.5-27B-GGUF](https://huggingface.co/bartowski/Qwen_Qwen3.5-27B-GGUF) - [noctrex/Qwen3.5-35B-A3B-MXFP4_MOE-GGUF](https://huggingface.co/noctrex/Qwen3.5-35B-A3B-MXFP4_MOE-GGUF) - [Unsloth Dynamic 2.0 GGUFs Documentation](https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs) - [Qwen3.5 GGUF Benchmarks (Unsloth)](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) - [Unsloth Studio Documentation](https://unsloth.ai/docs/new/studio) - [Qwen3.5 Local Running Guide (Unsloth)](https://unsloth.ai/docs/models/qwen3.5) - [Summary of Qwen3.5 GGUF Evaluations (kaitchup)](https://kaitchup.substack.com/p/summary-of-qwen35-gguf-evaluations) - [LM Studio Vulkan on Strix Halo (SmartTechLabs)](https://www.smarttechlabs.de/blog/2026-01-14-lmstudio-strix-halo/) - [LM Studio on Ryzen AI](https://lmstudio.ai/ryzenai) - [Strix Halo llama.cpp Performance Wiki](https://strixhalo.wiki/AI/llamacpp-performance) - [AMD Strix Halo Backend Benchmarks](https://kyuz0.github.io/amd-strix-halo-toolboxes/) - [Strix Halo LLM Optimization (hardware-corner.net)](https://www.hardware-corner.net/strix-halo-llm-optimization/) - [Qwen3.5 Small Models (Artificial Analysis)](https://artificialanalysis.ai/articles/qwen3-5-small-models) - [Qwen 3.5 9B Beats 120B Models (VentureBeat)](https://venturebeat.com/technology/alibabas-small-open-source-qwen3-5-9b-beats-openais-gpt-oss-120b-and-can-run) - [AMD ROCm 7 Strix Halo Performance (Phoronix)](https://www.phoronix.com/review/amd-rocm-7-strix-halo/4) - [Qwen3.5 Blog (qwen.ai)](https://qwen.ai/blog?id=qwen3.5)