docs: update optimization guide with measured hardware data
Replace estimated values with clpeak measurements: DRAM 216-233 GB/s, GPU clocks confirmed 2900 MHz under load (ROCm #5750 is sysfs reporting only). Correct backend recommendation to Vulkan RADV (2.7x faster tg than ROCm at 131K). Update KV cache recommendation to q4_0. Add Nemotron-Cascade-2 to coder shootout results. Remove Nemotron-3-Nano from catalog (replaced by Cascade-2). Update Q4_K_L to Q4_K_XL entry.
This commit is contained in:
@@ -143,11 +143,12 @@ Living document tracking what was applied, tested, and the actual results. Each
|
||||
- **Status**: BLOCKED — llama.cpp PR #20700
|
||||
- **Last checked**: 2026-03-27 — WIP, not expected to merge soon
|
||||
|
||||
### 5.3 GPU Clock Fix
|
||||
### 5.3 GPU Clock Reporting
|
||||
|
||||
- **Status**: BLOCKED — ROCm issue #5750
|
||||
- **Notes**: GPU may be stuck at 885 MHz instead of 2900 MHz on gfx1151
|
||||
- **Last checked**: 2026-03-27
|
||||
- **Status**: NOT A REAL ISSUE — sysfs reporting is broken, actual clocks are fine
|
||||
- **Measured**: clpeak (2026-03-30) confirms GPU reaches 2900 MHz under compute load
|
||||
- **Notes**: ROCm issue #5750 is about sysfs `pp_dpm_sclk` reporting, not actual performance. No action needed.
|
||||
- **Verdict**: CLOSED — no performance impact
|
||||
|
||||
---
|
||||
|
||||
@@ -187,7 +188,31 @@ Living document tracking what was applied, tested, and the actual results. Each
|
||||
| **UD-Q4_K_XL** | 20.7 GB | 835 | **56.4** | **Daily driver** — best quality/speed. |
|
||||
| Q8_0 | 34.4 GB | 850 | 51.4 | Best quality, 10% slower tg. |
|
||||
|
||||
**Decision**: Keep UD-Q4_K_XL (daily driver) and Q8_0 (quality fallback). Q4_K_L can be deleted — Q4_K_XL is strictly better at only +2 GB.
|
||||
**Decision**: Keep UD-Q4_K_XL (daily driver) and Q8_0 (quality fallback). Q4_K_L deleted — Q4_K_XL is strictly better at only +2 GB.
|
||||
|
||||
### Coder Model Shootout (2026-03-30)
|
||||
|
||||
- **Benchmark**: `data/benchmarks/coder-shootout-*`
|
||||
- **Config**: Vulkan RADV, q4_0 KV cache, pp2048/tg1024, 2 reps
|
||||
- **RyzenAdj**: STAPM=81W (sustained ~70W)
|
||||
|
||||
| Model | Architecture | File Size | pp2048 (t/s) | tg1024 (t/s) |
|
||||
|-------|-------------|-----------|-------------|-------------|
|
||||
| **Qwen3-Coder-30B** UD-Q6_K_XL | Pure Transformer | 24.5 GB | 737 | **61.0** |
|
||||
| **Qwen3.5-35B-A3B** UD-Q4_K_XL | Hybrid DeltaNet | 20.7 GB | **821** | 54.9 |
|
||||
| **Nemotron-Cascade-2** Q8_0 | Hybrid Mamba-2 | 31.3 GB | 643 | 52.8 |
|
||||
| **Qwen3-Coder-Next** UD-Q3_K_XL | Hybrid DeltaNet | 33.8 GB | 545 | 46.8 |
|
||||
|
||||
**Analysis**:
|
||||
- tg speed scales inversely with model size (bandwidth-bound at ~215 GB/s)
|
||||
- Pure Transformer (Qwen3-Coder-30B) has lowest overhead per token
|
||||
- DeltaNet hybrid (Qwen3.5) has best pp — DeltaNet layers are efficient for prefill
|
||||
- Qwen3-Coder-Next (80B at 3-bit) is 25% slower tg but has >70% SWE-bench vs ~50% for the 30B
|
||||
|
||||
**Recommended roles**:
|
||||
- **Qwen3-Coder-30B**: Interactive tool-use / function-calling loops (fastest tg, purpose-built)
|
||||
- **Qwen3.5-35B-A3B**: General tasks, long prompt processing (best pp, best all-rounder)
|
||||
- **Qwen3-Coder-Next**: Complex multi-file coding tasks where quality > speed
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user