docs: update optimization guide with measured hardware data

Replace estimated values with clpeak measurements: DRAM 216-233 GB/s, GPU clocks confirmed 2900 MHz under load (ROCm #5750 is sysfs reporting only). Correct backend recommendation to Vulkan RADV (2.7x faster tg than ROCm at 131K). Update KV cache recommendation to q4_0. Add Nemotron-Cascade-2 to coder shootout results. Remove Nemotron-3-Nano from catalog (replaced by Cascade-2). Update Q4_K_L to Q4_K_XL entry.
2026-03-30 19:56:18 +02:00
parent 1549bc27c0
commit ea70687cd2
3 changed files with 157 additions and 208 deletions
--- a/docs/optimization-log.md
+++ b/docs/optimization-log.md
@@ -143,11 +143,12 @@ Living document tracking what was applied, tested, and the actual results. Each
 - **Status**: BLOCKED — llama.cpp PR #20700
 - **Last checked**: 2026-03-27 — WIP, not expected to merge soon

-### 5.3 GPU Clock Fix
+### 5.3 GPU Clock Reporting

- **Status**: BLOCKED — ROCm issue #5750
- **Notes**: GPU may be stuck at 885 MHz instead of 2900 MHz on gfx1151
- **Last checked**: 2026-03-27
+- **Status**: NOT A REAL ISSUE — sysfs reporting is broken, actual clocks are fine
+- **Measured**: clpeak (2026-03-30) confirms GPU reaches 2900 MHz under compute load
+- **Notes**: ROCm issue #5750 is about sysfs `pp_dpm_sclk` reporting, not actual performance. No action needed.
+- **Verdict**: CLOSED — no performance impact

 ---

@@ -187,7 +188,31 @@ Living document tracking what was applied, tested, and the actual results. Each
 | **UD-Q4_K_XL** | 20.7 GB | 835 | **56.4** | **Daily driver** — best quality/speed. |
 | Q8_0 | 34.4 GB | 850 | 51.4 | Best quality, 10% slower tg. |

-**Decision**: Keep UD-Q4_K_XL (daily driver) and Q8_0 (quality fallback). Q4_K_L can be deleted — Q4_K_XL is strictly better at only +2 GB.
+**Decision**: Keep UD-Q4_K_XL (daily driver) and Q8_0 (quality fallback). Q4_K_L deleted — Q4_K_XL is strictly better at only +2 GB.
+
+### Coder Model Shootout (2026-03-30)
+
+- **Benchmark**: `data/benchmarks/coder-shootout-*`
+- **Config**: Vulkan RADV, q4_0 KV cache, pp2048/tg1024, 2 reps
+- **RyzenAdj**: STAPM=81W (sustained ~70W)
+
+| Model | Architecture | File Size | pp2048 (t/s) | tg1024 (t/s) |
+|-------|-------------|-----------|-------------|-------------|
+| **Qwen3-Coder-30B** UD-Q6_K_XL | Pure Transformer | 24.5 GB | 737 | **61.0** |
+| **Qwen3.5-35B-A3B** UD-Q4_K_XL | Hybrid DeltaNet | 20.7 GB | **821** | 54.9 |
+| **Nemotron-Cascade-2** Q8_0 | Hybrid Mamba-2 | 31.3 GB | 643 | 52.8 |
+| **Qwen3-Coder-Next** UD-Q3_K_XL | Hybrid DeltaNet | 33.8 GB | 545 | 46.8 |
+
+**Analysis**:
+- tg speed scales inversely with model size (bandwidth-bound at ~215 GB/s)
+- Pure Transformer (Qwen3-Coder-30B) has lowest overhead per token
+- DeltaNet hybrid (Qwen3.5) has best pp — DeltaNet layers are efficient for prefill
+- Qwen3-Coder-Next (80B at 3-bit) is 25% slower tg but has >70% SWE-bench vs ~50% for the 30B
+
+**Recommended roles**:
+- **Qwen3-Coder-30B**: Interactive tool-use / function-calling loops (fastest tg, purpose-built)
+- **Qwen3.5-35B-A3B**: General tasks, long prompt processing (best pp, best all-rounder)
+- **Qwen3-Coder-Next**: Complex multi-file coding tasks where quality > speed

 ---