feat: add Qwen3.5 model catalog and agentic evaluation framework
Models: - configs/models.conf: catalog with Qwen3.5-35B-A3B (MoE, top pick), Qwen3.5-27B (dense), Qwen3-Coder-30B-A3B (agentic/coding) - Updated benchmark setup to show catalog with download status - docs/model-recommendations.md: memory planning, quantization guide Agentic evaluation: - scripts/agentic/setup.sh: installs inspect-ai, evalplus, bigcodebench in a Python venv - scripts/agentic/run-eval.sh: runs evaluations against local LLM server (ollama or llama.cpp). Suites: quick (HumanEval+IFEval), code (EvalPlus+BigCodeBench), tooluse (BFCL), full (all) - bin/agentic: dispatcher with help - docs/agentic-benchmarks.md: methodology, framework comparison, model recommendations for agentic use Updated: Makefile (6 new targets), README, CLAUDE.md, docs/references.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
488
docs/model-recommendations.md
Normal file
488
docs/model-recommendations.md
Normal file
@@ -0,0 +1,488 @@
|
||||
# Qwen 3.5 Model Family: Research Summary for Strix Halo (64GB)
|
||||
|
||||
**Date**: 2026-03-26
|
||||
**Target Hardware**: AMD Ryzen AI MAX+ 395 / Radeon 8060S (gfx1151), 64 GB unified LPDDR5x, Fedora 43
|
||||
**Focus**: GGUF quantized models for llama.cpp inference
|
||||
|
||||
---
|
||||
|
||||
## Scope
|
||||
|
||||
This report covers the Qwen3.5 model family (released February-March 2026) with emphasis
|
||||
on GGUF quantization options, file sizes, memory fit analysis for 64GB unified memory,
|
||||
GGUF quantizer comparison (Unsloth vs bartowski vs others), Unsloth Studio capabilities,
|
||||
and LM Studio backend support on AMD Strix Halo. Out of scope: cloud API pricing,
|
||||
full-precision training, non-GGUF formats (AWQ, GPTQ, EXL2).
|
||||
|
||||
---
|
||||
|
||||
## 1. Qwen3.5 Model Family Overview
|
||||
|
||||
Released mid-February 2026 (medium/large) and March 2, 2026 (small), licensed Apache 2.0.
|
||||
All models share the Gated DeltaNet hybrid architecture: a 3:1 ratio of linear attention
|
||||
(Gated DeltaNet) to full softmax attention blocks. Native 262K context window, extensible
|
||||
to 1,010,000 tokens via YaRN scaling. Supports 201 languages. Native multimodal
|
||||
(vision+language). Thinking/non-thinking hybrid mode.
|
||||
|
||||
| Model | Type | Total Params | Active Params | Architecture |
|
||||
|-------|------|-------------|---------------|--------------|
|
||||
| Qwen3.5-397B-A17B | MoE | 397B | 17B | 256 experts, 8 routed + 1 shared |
|
||||
| Qwen3.5-122B-A10B | MoE | 122B | 10B | 256 experts, 8 routed + 1 shared |
|
||||
| **Qwen3.5-35B-A3B** | **MoE** | **35B** | **3B** | **256 experts, 8 routed + 1 shared** |
|
||||
| **Qwen3.5-27B** | **Dense** | **27B** | **27B** | **Full activation** |
|
||||
| Qwen3.5-9B | Dense | 9B | 9B | Gated DeltaNet hybrid |
|
||||
| Qwen3.5-4B | Dense | 4B | 4B | Gated DeltaNet hybrid |
|
||||
| Qwen3.5-2B | Dense | 2B | 2B | Gated DeltaNet hybrid |
|
||||
| Qwen3.5-0.8B | Dense | 0.8B | 0.8B | Gated DeltaNet hybrid |
|
||||
|
||||
---
|
||||
|
||||
## 2. Qwen3.5-35B-A3B (MoE) -- Detailed Analysis
|
||||
|
||||
### Architecture Specs
|
||||
|
||||
- Hidden dimension: 2048
|
||||
- Token embedding: 248,320 (padded)
|
||||
- Layers: 40
|
||||
- Hidden layout: 10 x (3 x (Gated DeltaNet -> MoE) -> 1 x (Gated Attention -> MoE))
|
||||
- MoE: 256 total experts, 8 routed + 1 shared active, expert intermediate dim 512
|
||||
- Linear attention heads: 32 (V), 16 (QK), head dim 128
|
||||
- Gated attention heads: 16 (Q), 2 (KV), head dim 256
|
||||
- BF16 model size: 69.4 GB
|
||||
|
||||
### GGUF Quantizations (Unsloth)
|
||||
|
||||
Source: [unsloth/Qwen3.5-35B-A3B-GGUF](https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF)
|
||||
Updated March 5, 2026 with improved imatrix data.
|
||||
|
||||
| Quantization | Size (GB) | Fits 64GB? | Notes |
|
||||
|-------------|-----------|------------|-------|
|
||||
| UD-IQ2_XXS | 10.7 | Yes | Ultra-compressed, quality loss |
|
||||
| UD-IQ2_M | 11.4 | Yes | |
|
||||
| UD-Q2_K_XL | 12.2 | Yes | |
|
||||
| UD-IQ3_XXS | 13.1 | Yes | |
|
||||
| UD-IQ3_S | 13.6 | Yes | |
|
||||
| Q3_K_S | 15.3 | Yes | |
|
||||
| Q3_K_M | 16.4 | Yes | |
|
||||
| UD-Q3_K_XL | 16.6 | Yes | |
|
||||
| UD-IQ4_XS | 17.5 | Yes | |
|
||||
| UD-IQ4_NL | 17.8 | Yes | |
|
||||
| Q4_K_S | 20.7 | Yes | |
|
||||
| MXFP4_MOE | 21.6 | Yes | MoE-optimized mixed precision |
|
||||
| Q4_K_M | 22.0 | **Yes** | **Recommended sweet spot** |
|
||||
| UD-Q4_K_XL | 22.2 | Yes | Dynamic 2.0, best 4-bit |
|
||||
| Q5_K_S | 24.8 | Yes | |
|
||||
| Q5_K_M | 26.2 | Yes | |
|
||||
| UD-Q5_K_XL | 26.4 | Yes | |
|
||||
| UD-Q6_K_S | 28.5 | Yes | |
|
||||
| Q6_K | 28.9 | Yes | |
|
||||
| UD-Q6_K_XL | 32.1 | Yes | |
|
||||
| Q8_0 | 36.9 | Yes | High quality, fits with room |
|
||||
| UD-Q8_K_XL | 48.7 | Yes* | Tight -- ~15GB for KV cache |
|
||||
| BF16 | 69.4 | **No** | Exceeds 64GB |
|
||||
|
||||
**Key finding**: Every quantization except BF16 fits in 64GB. Even Q8_0 at 36.9 GB
|
||||
leaves ~27 GB for KV cache and OS overhead, which is excellent. The MoE architecture
|
||||
(only 3B active params) means token generation is fast relative to total model size.
|
||||
|
||||
### Benchmark Results (Official, from Model Card)
|
||||
|
||||
| Benchmark | Qwen3.5-35B-A3B | GPT-5-mini | Notes |
|
||||
|-----------|-----------------|-----------|-------|
|
||||
| MMLU-Pro | 85.3 | 83.7 | Outperforms |
|
||||
| C-Eval | 90.2 | 82.2 | Outperforms |
|
||||
| GPQA Diamond | 84.2 | 82.8 | Outperforms |
|
||||
| SWE-bench Verified | 69.2 | 72.0 | Slightly behind |
|
||||
| LiveCodeBench v6 | 74.6 | 80.5 | Behind on coding |
|
||||
| MMMU (vision) | 81.4 | 79.0 | Outperforms |
|
||||
| MathVision | 83.9 | 71.9 | Strongly outperforms |
|
||||
| VideoMME (w/ sub.) | 86.6 | 83.5 | Outperforms |
|
||||
|
||||
### Strix Halo Performance Estimates
|
||||
|
||||
Based on Qwen3-30B-A3B benchmarks (similar architecture, predecessor):
|
||||
|
||||
| Backend | pp512 (t/s) | tg128 (t/s) | Context |
|
||||
|---------|-------------|-------------|---------|
|
||||
| Vulkan RADV | ~755 | ~85 | Short |
|
||||
| Vulkan AMDVLK | ~742 | ~82 | Short |
|
||||
| ROCm hipBLASlt | ~652 | ~64 | Short |
|
||||
| ROCm rocWMMA (tuned) | ~659 | ~68 | Short |
|
||||
| Vulkan RADV | ~17 | ~13 | 130K |
|
||||
| ROCm hipBLASlt | ~40 | ~5 | 130K |
|
||||
|
||||
**Key insight**: Vulkan wins on short-context token generation. ROCm wins on
|
||||
long-context prompt processing. For interactive chat (short-medium context),
|
||||
Vulkan RADV is the best backend on Strix Halo.
|
||||
|
||||
---
|
||||
|
||||
## 3. Qwen3.5-27B (Dense) -- Detailed Analysis
|
||||
|
||||
Source: [unsloth/Qwen3.5-27B-GGUF](https://huggingface.co/unsloth/Qwen3.5-27B-GGUF)
|
||||
|
||||
The only dense (non-MoE) model in the medium range. All 27B parameters activate on
|
||||
every forward pass, meaning slower token generation than 35B-A3B despite being
|
||||
"smaller" in total params. BF16 size: 53.8 GB.
|
||||
|
||||
### GGUF Quantizations (Unsloth)
|
||||
|
||||
| Quantization | Size (GB) | Fits 64GB? | Notes |
|
||||
|-------------|-----------|------------|-------|
|
||||
| UD-IQ2_XXS | 8.57 | Yes | |
|
||||
| UD-IQ2_M | 10.2 | Yes | |
|
||||
| UD-Q2_K_XL | 11.2 | Yes | |
|
||||
| UD-IQ3_XXS | 11.5 | Yes | |
|
||||
| Q3_K_S | 12.3 | Yes | |
|
||||
| Q3_K_M | 13.5 | Yes | |
|
||||
| UD-Q3_K_XL | 14.4 | Yes | |
|
||||
| IQ4_XS | 15.0 | Yes | |
|
||||
| Q4_0 | 15.7 | Yes | |
|
||||
| IQ4_NL | 15.7 | Yes | |
|
||||
| Q4_K_S | 15.8 | Yes | |
|
||||
| Q4_K_M | 16.7 | **Yes** | **Recommended** |
|
||||
| UD-Q4_K_XL | 17.6 | Yes | Dynamic 2.0 |
|
||||
| Q4_1 | 17.2 | Yes | |
|
||||
| Q5_K_S | 18.9 | Yes | |
|
||||
| Q5_K_M | 19.6 | Yes | |
|
||||
| UD-Q5_K_XL | 20.2 | Yes | |
|
||||
| Q6_K | 22.5 | Yes | |
|
||||
| UD-Q6_K_XL | 25.7 | Yes | |
|
||||
| Q8_0 | 28.6 | Yes | Plenty of room |
|
||||
| UD-Q8_K_XL | 35.5 | Yes | Good quality + headroom |
|
||||
| BF16 | 53.8 | Yes* | Tight -- only ~10GB for KV cache |
|
||||
|
||||
**Key finding**: All quantizations fit in 64GB, including BF16 (barely). However,
|
||||
because this is a dense model with 27B active params, token generation will be
|
||||
significantly slower than 35B-A3B (which only activates 3B). For interactive use on
|
||||
Strix Halo, the 35B-A3B MoE is likely the better choice despite being larger on disk.
|
||||
|
||||
### 35B-A3B vs 27B: Which to Run?
|
||||
|
||||
| Factor | 35B-A3B (MoE) | 27B (Dense) |
|
||||
|--------|---------------|-------------|
|
||||
| Active params | 3B | 27B |
|
||||
| Token gen speed | ~85 t/s (Vulkan) | ~10-15 t/s (estimated) |
|
||||
| Quality (MMLU-Pro) | 85.3 | Comparable |
|
||||
| Memory (Q4_K_M) | 22.0 GB | 16.7 GB |
|
||||
| Memory (Q8_0) | 36.9 GB | 28.6 GB |
|
||||
| Best for | Interactive chat, speed | Batch processing, quality |
|
||||
|
||||
**Recommendation**: For interactive inference on 64GB Strix Halo, strongly prefer
|
||||
Qwen3.5-35B-A3B. The MoE architecture is ideal for unified memory systems since
|
||||
only 3B params are active per token, yielding much faster generation despite the
|
||||
larger total weight file.
|
||||
|
||||
---
|
||||
|
||||
## 4. Qwen3.5-122B-A10B (MoE) -- Stretch Goal
|
||||
|
||||
Source: [unsloth/Qwen3.5-122B-A10B-GGUF](https://huggingface.co/unsloth/Qwen3.5-122B-A10B-GGUF)
|
||||
|
||||
BF16 size: 244 GB. This is the next tier up from 35B-A3B.
|
||||
|
||||
### Quantizations That Fit 64GB
|
||||
|
||||
| Quantization | Size (GB) | Fit? | Notes |
|
||||
|-------------|-----------|------|-------|
|
||||
| UD-IQ1_M | 34.2 | Yes | 1-bit, quality concerns |
|
||||
| UD-IQ2_XXS | 36.6 | Yes | Very compressed |
|
||||
| UD-IQ2_M | 39.1 | Yes | |
|
||||
| UD-Q2_K_XL | 41.8 | Yes | |
|
||||
| UD-IQ3_XXS | 44.7 | Yes | |
|
||||
| UD-IQ3_S | 46.6 | Yes* | Tight with KV cache |
|
||||
| Q3_K_S | 52.5 | Marginal | Very little KV headroom |
|
||||
| Q3_K_M | 56.4 | No | Leaves <8GB for everything else |
|
||||
| Q4_K_M+ | 76.5+ | No | Does not fit |
|
||||
|
||||
**Warning**: Q3-level quantization of 122B has been reported to produce garbled output,
|
||||
infinite repetition, and failures on tool calls and code generation. The UD-Q2_K_XL
|
||||
(41.8 GB) is the recommended minimum viable quantization.
|
||||
|
||||
**Verdict**: Possible at 2-bit, but risky. Quality at IQ2 level on a 122B MoE model is
|
||||
largely untested for production use. The 35B-A3B at Q8_0 (36.9 GB) is likely higher
|
||||
quality than 122B at IQ2 (36.6 GB) and much safer. Not recommended for 64GB systems
|
||||
unless you specifically need the 10B active parameter count.
|
||||
|
||||
---
|
||||
|
||||
## 5. Qwen3.5 Small Models (Worth Benchmarking)
|
||||
|
||||
### Qwen3.5-9B
|
||||
|
||||
The standout small model. Outperforms models 3-13x its size:
|
||||
- GPQA Diamond: 81.7 (vs GPT-OSS-120B: 71.5)
|
||||
- HMMT Feb 2025: 83.2
|
||||
- MMMU-Pro: 70.1 (beats Gemini 2.5 Flash-Lite at 59.7)
|
||||
|
||||
At Q4_K_M, the 9B model needs roughly 6-7 GB. Runs comfortably on any hardware.
|
||||
Useful as a draft model for speculative decoding with the 35B-A3B.
|
||||
|
||||
### Qwen3.5-4B
|
||||
|
||||
Performance close to the previous Qwen3-80B-A3B (20x larger). Excellent for
|
||||
on-device/edge tasks. ~3 GB at Q4_K_M.
|
||||
|
||||
---
|
||||
|
||||
## 6. Best GGUF Quantizers: Unsloth vs bartowski vs Others
|
||||
|
||||
### Providers Compared
|
||||
|
||||
| Provider | Approach | Strengths |
|
||||
|----------|----------|-----------|
|
||||
| **Unsloth** | Dynamic 2.0: per-layer adaptive quantization, 1.5M+ token calibration dataset | Best at low bit-rates (Q2, Q3), model-specific tuning, fast updates |
|
||||
| **bartowski** | Custom imatrix calibration, upstream llama.cpp PR for improved tensor recipes | Lower KLD at Q4_K_M in some tests, stable quality |
|
||||
| **noctrex** | MXFP4 for MoE experts + Q8/BF16 for rest | Specialized for MoE models |
|
||||
| **ubergarm** | Standard llama.cpp quantization | Reliable baseline |
|
||||
| **AesSedai** | imatrix-based | Good coverage, sometimes outperformed by Unsloth Dynamic |
|
||||
| **mradermacher** | Mass-produced quants across many models | Broad coverage, less specialized |
|
||||
|
||||
### Head-to-Head: Unsloth vs bartowski
|
||||
|
||||
On standard KLD benchmarks (Qwen QwQ-32B comparison):
|
||||
- bartowski Q4_K_M: 0.0087 KLD
|
||||
- Unsloth Q4_K_M: 0.0222 KLD
|
||||
- bartowski IQ4_XS: 0.0127 KLD at 4.93 GiB
|
||||
|
||||
However, on real-world task evaluations (LiveCodeBench v6, MMLU Pro), Unsloth Dynamic
|
||||
IQ2_XXS outperformed AesSedai IQ3_S despite being 11GB smaller -- demonstrating that
|
||||
KLD/perplexity alone do not predict task performance.
|
||||
|
||||
### Recommendation
|
||||
|
||||
- **Q4 and above**: bartowski and Unsloth are both excellent. bartowski may have slightly
|
||||
lower KLD at Q4_K_M. Either is a safe choice.
|
||||
- **Q3 and below**: Unsloth Dynamic 2.0 (UD- prefix) is the clear winner. The per-layer
|
||||
adaptive approach preserves critical layers at higher precision.
|
||||
- **MoE-specific**: noctrex MXFP4_MOE is worth testing if you want pure MoE-optimized
|
||||
quantization.
|
||||
- **Overall**: For Qwen3.5-35B-A3B, use **Unsloth UD-Q4_K_XL** (22.2 GB) or
|
||||
**Q8_0** (36.9 GB) for maximum quality. For bartowski, use their Q4_K_M.
|
||||
|
||||
### imatrix Note
|
||||
|
||||
All modern GGUF quantizers now use imatrix (importance matrix) calibration. This adds
|
||||
5-10% inference overhead but significantly improves quality at low bit-rates. The
|
||||
calibration dataset matters: Unsloth uses 1.5M+ hand-curated tokens; bartowski uses
|
||||
different calibration texts optimized for different use cases.
|
||||
|
||||
---
|
||||
|
||||
## 7. Unsloth Studio
|
||||
|
||||
### What It Is
|
||||
|
||||
Unsloth Studio is an open-source, no-code web UI for training and running LLMs locally.
|
||||
Released March 17, 2026 (beta). Dual-licensed: Apache 2.0 (core) + AGPL-3.0 (UI).
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
# macOS, Linux, WSL
|
||||
curl -fsSL https://unsloth.ai/install.sh | sh
|
||||
|
||||
# Launch
|
||||
unsloth studio -H 0.0.0.0 -p 8888
|
||||
```
|
||||
|
||||
### Capabilities
|
||||
|
||||
| Feature | Details |
|
||||
|---------|---------|
|
||||
| **Inference** | Run GGUF and safetensor models with tool-calling, web search, OpenAI-compatible API |
|
||||
| **Fine-tuning** | SFT, GRPO (RL), 500+ models, 2x faster, 70% less VRAM |
|
||||
| **Data Recipes** | Auto-create datasets from PDF, CSV, JSON, DOCX, TXT |
|
||||
| **Model Arena** | Side-by-side comparison of two models |
|
||||
| **Export** | Save to GGUF or safetensors |
|
||||
| **Multimodal** | Text, vision, TTS audio, embedding models |
|
||||
|
||||
### Platform Support
|
||||
|
||||
| Platform | Inference | Training |
|
||||
|----------|-----------|----------|
|
||||
| Linux (NVIDIA) | Yes | Yes |
|
||||
| Linux (AMD) | Yes | Coming soon |
|
||||
| Linux (CPU) | Yes | No |
|
||||
| macOS | Yes (CPU only) | Coming (MLX) |
|
||||
| Windows | Yes | Yes |
|
||||
|
||||
### Relevance for Strix Halo
|
||||
|
||||
Unsloth Studio provides inference via llama.cpp backend, so it should work on Strix Halo
|
||||
for **running** models. Training requires NVIDIA or Intel GPUs currently, so fine-tuning
|
||||
is not yet supported on AMD. The inference component is essentially a nice web UI wrapper
|
||||
around llama.cpp, similar to LM Studio but with integrated training capabilities.
|
||||
|
||||
**Verdict**: Useful for inference on Strix Halo. Not yet useful for training on AMD.
|
||||
If you only need inference, LM Studio or raw llama.cpp may be simpler. If you want
|
||||
training + inference in one tool (when AMD support arrives), Unsloth Studio is worth
|
||||
watching.
|
||||
|
||||
---
|
||||
|
||||
## 8. LM Studio on AMD Strix Halo
|
||||
|
||||
### Backend Status
|
||||
|
||||
| Backend | Status | Notes |
|
||||
|---------|--------|-------|
|
||||
| **Vulkan** | **Working, recommended** | Best for general inference, no special config needed |
|
||||
| ROCm | Partially broken | gfx1151 declared supported but data files missing, crashes on inference |
|
||||
| CPU | Working | Slow fallback |
|
||||
|
||||
### Vulkan Configuration
|
||||
|
||||
LM Studio with Vulkan is the most reliable path on Strix Halo:
|
||||
|
||||
```json
|
||||
{
|
||||
"llm.gpu.backend": "vulkan",
|
||||
"llm.gpu.device": "auto",
|
||||
"llm.gpu.layers": -1
|
||||
}
|
||||
```
|
||||
|
||||
Verify GPU detection: `vulkaninfo | grep "GPU id"`
|
||||
|
||||
An automated installer exists: [smarttechlabs-projects/strix-halo-lmstudio](https://github.com/smarttechlabs-projects/strix-halo-lmstudio)
|
||||
|
||||
### Performance Expectations (LM Studio / Vulkan, 128GB system)
|
||||
|
||||
| Model Size | Quant | Throughput |
|
||||
|-----------|-------|-----------|
|
||||
| 7B | Q4 | 30-40 t/s |
|
||||
| 13B | Q4 | 20-30 t/s |
|
||||
| 30B MoE | Q4 | ~50+ t/s (MoE advantage) |
|
||||
| 70B | Q4 | 5-8 t/s |
|
||||
|
||||
For a 64GB system, expect similar per-token speeds but with lower maximum context
|
||||
lengths before memory pressure kicks in.
|
||||
|
||||
### ROCm Status and Future
|
||||
|
||||
AMD's Ryzen AI Halo Mini PC (Q2 2026) will ship with ROCm 7.2.2 optimization for
|
||||
LM Studio. As of January 2026, stable ROCm+Linux configurations exist for Strix Halo
|
||||
(documented at Framework Community). The gfx1151 ROCm issue in LM Studio specifically
|
||||
is a packaging problem (missing data files), not a fundamental incompatibility.
|
||||
|
||||
For now: use **Vulkan for short-medium context**, or build **llama.cpp from source
|
||||
with ROCm** for long-context workloads (where Flash Attention matters).
|
||||
|
||||
### LM Studio Unsloth Dynamic 2.0 Note
|
||||
|
||||
There was a reported issue (GitHub #1594) where Unsloth Dynamic 2.0 (UD-) GGUF variants
|
||||
were not shown in LM Studio's download options. Verify that LM Studio is updated to
|
||||
the latest version, or download the GGUF files manually from HuggingFace and load
|
||||
them directly.
|
||||
|
||||
---
|
||||
|
||||
## 9. Recommended Configurations for 64GB Strix Halo
|
||||
|
||||
### Primary: Qwen3.5-35B-A3B (MoE)
|
||||
|
||||
| Use Case | Quantization | Size | KV Budget | Context Est. |
|
||||
|----------|-------------|------|-----------|-------------|
|
||||
| Maximum quality | Q8_0 | 36.9 GB | ~25 GB | ~32K-65K |
|
||||
| Best balance | UD-Q4_K_XL | 22.2 GB | ~40 GB | ~65K-131K |
|
||||
| Maximum context | UD-IQ3_XXS | 13.1 GB | ~49 GB | ~131K+ |
|
||||
| Speed test | Q4_K_M | 22.0 GB | ~40 GB | ~65K-131K |
|
||||
|
||||
### Secondary: Qwen3.5-27B (Dense)
|
||||
|
||||
| Use Case | Quantization | Size | KV Budget | Notes |
|
||||
|----------|-------------|------|-----------|-------|
|
||||
| Quality comparison | Q8_0 | 28.6 GB | ~33 GB | Slower gen than 35B-A3B |
|
||||
| Balanced | Q4_K_M | 16.7 GB | ~45 GB | |
|
||||
|
||||
### Quick Reference: Qwen3.5-9B (Small/Draft)
|
||||
|
||||
| Use Case | Quantization | Size |
|
||||
|----------|-------------|------|
|
||||
| Speculative decoding draft | Q4_K_M | ~6 GB |
|
||||
| Standalone small model | Q8_0 | ~10 GB |
|
||||
|
||||
---
|
||||
|
||||
## 10. Sampling Parameters (Official Recommendations)
|
||||
|
||||
### Thinking Mode (General)
|
||||
- Temperature: 1.0
|
||||
- Top-p: 0.95
|
||||
- Top-k: 20
|
||||
- Min-p: 0.0
|
||||
- Presence penalty: 1.5
|
||||
- Max output: 32,768 tokens (general) or 81,920 (math/coding)
|
||||
|
||||
### Thinking Mode (Coding)
|
||||
- Temperature: 0.6
|
||||
- Top-p: 0.95
|
||||
- Top-k: 20
|
||||
- Presence penalty: 0.0
|
||||
|
||||
### Non-Thinking / Instruct Mode
|
||||
- Temperature: 0.7
|
||||
- Top-p: 0.8
|
||||
- Top-k: 20
|
||||
- Presence penalty: 1.5
|
||||
|
||||
### Best Practices
|
||||
- Maintain minimum 128K context to preserve thinking capabilities
|
||||
- Exclude thinking content from multi-turn conversation history
|
||||
- For math: "Please reason step by step, and put your final answer within \boxed{}."
|
||||
- For multiple choice: request JSON output like {"answer": "C"}
|
||||
|
||||
---
|
||||
|
||||
## 11. Open Questions / Limitations
|
||||
|
||||
1. **Qwen3.5 on gfx1151 ROCm**: LM Studio's ROCm backend crashes on Strix Halo due
|
||||
to missing gfx1151 data files. Building llama.cpp from source with ROCm 7.x works
|
||||
but requires manual setup.
|
||||
|
||||
2. **Vulkan long-context degradation**: Vulkan performance drops significantly beyond
|
||||
~4K context on Strix Halo. ROCm with Flash Attention is needed for long-context
|
||||
workloads, creating a backend choice dilemma.
|
||||
|
||||
3. **Quantizer quality debate**: KLD and perplexity metrics do not always predict
|
||||
real-world task performance. The "best" quantizer depends on the specific use case.
|
||||
More task-based evaluation is needed.
|
||||
|
||||
4. **122B-A10B viability at 64GB**: Only fits at 2-bit or aggressive 3-bit. Quality
|
||||
at these compression levels for a 122B MoE is not well-characterized.
|
||||
|
||||
5. **Unsloth Studio AMD training**: Not yet supported. Timeline unclear ("coming soon").
|
||||
|
||||
6. **Multi-token Prediction (MTP)**: Qwen3.5 supports MTP for faster generation, but
|
||||
llama.cpp support status for this feature on the MoE variants needs verification.
|
||||
|
||||
7. **Speculative decoding**: Qwen3.5-9B as a draft model for 35B-A3B has been discussed
|
||||
but needs benchmarking on Strix Halo specifically.
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
||||
- [Qwen/Qwen3.5-35B-A3B Model Card](https://huggingface.co/Qwen/Qwen3.5-35B-A3B)
|
||||
- [QwenLM/Qwen3.5 GitHub](https://github.com/QwenLM/Qwen3.5)
|
||||
- [unsloth/Qwen3.5-35B-A3B-GGUF](https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF)
|
||||
- [unsloth/Qwen3.5-27B-GGUF](https://huggingface.co/unsloth/Qwen3.5-27B-GGUF)
|
||||
- [unsloth/Qwen3.5-122B-A10B-GGUF](https://huggingface.co/unsloth/Qwen3.5-122B-A10B-GGUF)
|
||||
- [bartowski/Qwen_Qwen3.5-35B-A3B-GGUF](https://huggingface.co/bartowski/Qwen_Qwen3.5-35B-A3B-GGUF)
|
||||
- [bartowski/Qwen_Qwen3.5-27B-GGUF](https://huggingface.co/bartowski/Qwen_Qwen3.5-27B-GGUF)
|
||||
- [noctrex/Qwen3.5-35B-A3B-MXFP4_MOE-GGUF](https://huggingface.co/noctrex/Qwen3.5-35B-A3B-MXFP4_MOE-GGUF)
|
||||
- [Unsloth Dynamic 2.0 GGUFs Documentation](https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs)
|
||||
- [Qwen3.5 GGUF Benchmarks (Unsloth)](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks)
|
||||
- [Unsloth Studio Documentation](https://unsloth.ai/docs/new/studio)
|
||||
- [Qwen3.5 Local Running Guide (Unsloth)](https://unsloth.ai/docs/models/qwen3.5)
|
||||
- [Summary of Qwen3.5 GGUF Evaluations (kaitchup)](https://kaitchup.substack.com/p/summary-of-qwen35-gguf-evaluations)
|
||||
- [LM Studio Vulkan on Strix Halo (SmartTechLabs)](https://www.smarttechlabs.de/blog/2026-01-14-lmstudio-strix-halo/)
|
||||
- [LM Studio on Ryzen AI](https://lmstudio.ai/ryzenai)
|
||||
- [Strix Halo llama.cpp Performance Wiki](https://strixhalo.wiki/AI/llamacpp-performance)
|
||||
- [AMD Strix Halo Backend Benchmarks](https://kyuz0.github.io/amd-strix-halo-toolboxes/)
|
||||
- [Strix Halo LLM Optimization (hardware-corner.net)](https://www.hardware-corner.net/strix-halo-llm-optimization/)
|
||||
- [Qwen3.5 Small Models (Artificial Analysis)](https://artificialanalysis.ai/articles/qwen3-5-small-models)
|
||||
- [Qwen 3.5 9B Beats 120B Models (VentureBeat)](https://venturebeat.com/technology/alibabas-small-open-source-qwen3-5-9b-beats-openais-gpt-oss-120b-and-can-run)
|
||||
- [AMD ROCm 7 Strix Halo Performance (Phoronix)](https://www.phoronix.com/review/amd-rocm-7-strix-halo/4)
|
||||
- [Qwen3.5 Blog (qwen.ai)](https://qwen.ai/blog?id=qwen3.5)
|
||||
Reference in New Issue
Block a user