Replace estimated values with clpeak measurements: DRAM 216-233 GB/s, GPU clocks confirmed 2900 MHz under load (ROCm #5750 is sysfs reporting only). Correct backend recommendation to Vulkan RADV (2.7x faster tg than ROCm at 131K). Update KV cache recommendation to q4_0. Add Nemotron-Cascade-2 to coder shootout results. Remove Nemotron-3-Nano from catalog (replaced by Cascade-2). Update Q4_K_L to Q4_K_XL entry.
8.8 KiB
8.8 KiB
Optimization Log
Living document tracking what was applied, tested, and the actual results. Each entry records the change, benchmark evidence, and verdict.
Verdicts: KEEP (applied permanently), REVERTED (tested, didn't help), PENDING (not yet tested), BLOCKED (can't test yet).
Phase 1: Core System
1.1 Tuned Profile: accelerator-performance
- Date: 2026-03-26
- Change:
sudo tuned-adm profile accelerator-performance - Benchmark:
data/benchmarks/after-tuned-* - Result: +5-8% pp improvement, +2-3% tg improvement
- Verdict: KEEP
1.2 Kernel Boot Parameters
- Date: 2026-03-26
- Change:
iommu=pt amdgpu.gttsize=60416 ttm.pages_limit=15466496 - Benchmark:
data/benchmarks/full-opt-all-models-* - Result: Combined with BIOS VRAM change. Large models now fit in GTT. Peak usage 38.8/59 GiB.
- Verdict: KEEP
1.3 BIOS VRAM Reduction (512 MB)
- Date: 2026-03-26
- Change: UMA Frame Buffer Size 32 GB -> 512 MB (HP ZBook F10 BIOS)
- Benchmark:
data/benchmarks/full-opt-all-models-* - Result: 31.5 GB freed for OS/GTT. Small models ~3-8% slower (GTT indirection vs dedicated VRAM), but system gained ability to run 37 GB+ models at 32K+ context. Net positive.
- Trade-off: Small model regression is acceptable given the massive capability gain.
- Verdict: KEEP
Phase 2: System Tuning
2.1 RyzenAdj PPT Increase
- Date: 2026-03-30
- Change:
sudo ryzenadj --stapm-limit=85000 --fast-limit=85000 --slow-limit=85000 --apu-slow-limit=85000 - Result: STAPM raised from 59W→81W. PPT Fast raised to 81W. However, PPT SLOW and APU SLOW stuck at 70W — HP ZBook BIOS EC overrides these limits. Effective sustained power: ~70W (was ~59W).
- Benchmark:
data/benchmarks/qwen35-shootout-v2-*(Vulkan, q4_0 KV, pp2048/tg1024)- UD-Q4_K_L: 57.0 t/s (was ~39 t/s before RyzenAdj = +46%)
- UD-Q4_K_XL: 56.4 t/s
- Q8_0: 51.4 t/s (was ~39-41 t/s before = +25%)
- Thermals: 70-73C under load, 30C headroom. Cooling handles it easily.
- Notes: Settings are volatile (reset on reboot/sleep). Use
sudo make optimize-poweror install systemd service for persistence. HP firmware hard-caps slow PPT at 70W regardless. - Verdict: KEEP — significant real-world improvement despite HP firmware limit
2.2 VM Sysctl Tuning
- Date: 2026-03-30
- Change:
vm.swappiness=1, vm.dirty_ratio=40, vm.dirty_background_ratio=10, vm.max_map_count=500000, vm.zone_reclaim_mode=0 - Applied via:
sudo make optimize-power(persists to/etc/sysctl.d/99-llm-inference.conf) - Notes: Hard to isolate impact — applied together with other Phase 2 changes. Prevents model weight eviction and I/O disruption.
- Verdict: KEEP — low risk, persists across reboots
2.3 Transparent Huge Pages
- Date: 2026-03-30
- Change:
echo always > /sys/kernel/mm/transparent_hugepage/enabled - Applied via:
sudo make optimize-power(volatile — addtransparent_hugepage=alwaysto kernel cmdline for persistence) - Notes: Reduces TLB misses for mmap'd model files. Hard to isolate impact.
- Verdict: KEEP — low risk
2.4 RADV_PERFTEST=nogttspill
- Date: 2026-03-30
- Change:
RADV_PERFTEST=nogttspillpersisted to/etc/environment.d/radv-llm.conf - Applied via:
sudo make optimize-power - Notes: Prevents GTT spill management overhead on unified memory Vulkan. Takes effect on next login. For current session:
export RADV_PERFTEST=nogttspill - Verdict: KEEP — persists across reboots
2.5 amdgpu.noretry=0
- Date: PENDING
- Change: Kernel cmdline
amdgpu.noretry=0 - Expected: Improved stability under memory pressure
- Notes: Only apply if experiencing GPU page faults or crashes during large model loading
- Verdict: PENDING
Phase 3: Runtime Flags
3.1 KV Cache Quantization
- Date: 2026-03-27
- Change:
--kv-types f16,q8_0,q4_0sweep - Benchmark:
data/benchmarks/kv-sweep-256k-* - Result (Vulkan RADV, Qwen3.5-35B-A3B Q8, pp2048/tg1024):
- f16: 456 pp, 39.8 tg
- q8_0: 418 pp, 38.5 tg (slight Vulkan regression — unexpected)
- q4_0: 460 pp, 41.1 tg (fastest overall, +3% tg over f16)
- Result (ROCm, same model):
- f16: 445 pp, 21.5 tg
- q8_0: 495 pp, 21.7 tg (+11% pp, same tg)
- q4_0: 494 pp, 21.8 tg (+11% pp, same tg)
- Conclusion: q4_0 is the sweet spot on Vulkan (fastest tg + 75% less KV memory). On ROCm, KV quant helps pp but not tg.
- Verdict: KEEP — use q4_0 KV as default for serving
3.2 MoE Batch Size -b 256
- Date: PENDING
- Change: Add
-b 256to MoE benchmark runs - Expected: Up to +70% pp improvement for MoE models (community benchmarks)
- Benchmark: Not yet run
- Verdict: PENDING
Phase 4: Build Optimizations
4.1 rocWMMA Flash Attention
- Date: PENDING
- Change: Rebuild ROCm toolbox with
-DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_HIP_UMA=ON - Expected: +96% long-context performance (65K+)
- Notes: Need to check if Donato's toolboxes already include this
- Verdict: PENDING
4.2 rocWMMA Tuned Patch (PR #16827)
- Date: PENDING
- Notes: Fixes long-context regression. Check Donato's latest toolbox builds.
- Verdict: PENDING
Phase 5: Future / Blocked
5.1 Speculative Decoding
- Status: BLOCKED — llama.cpp PR #20075 (hybrid SSM/MoE fix)
- Draft model: Downloaded
Qwen3.5-0.8B-Q8_0.gguf(812 MB) on 2026-03-27 - Last checked: 2026-03-27 — PR open since 2026-03-03, has ROCm buffer issues
5.2 Native MTP (Multi-Token Prediction)
- Status: BLOCKED — llama.cpp PR #20700
- Last checked: 2026-03-27 — WIP, not expected to merge soon
5.3 GPU Clock Reporting
- Status: NOT A REAL ISSUE — sysfs reporting is broken, actual clocks are fine
- Measured: clpeak (2026-03-30) confirms GPU reaches 2900 MHz under compute load
- Notes: ROCm issue #5750 is about sysfs
pp_dpm_sclkreporting, not actual performance. No action needed. - Verdict: CLOSED — no performance impact
Context Window Benchmarks
64K Context (pp4096/tg1024, MoE models)
- Date: 2026-03-26
- Benchmark:
data/benchmarks/ctx64k-* - Results: (check logs)
128K Context (pp8192/tg1024, MoE models)
- Date: 2026-03-26
- Benchmark:
data/benchmarks/ctx128k-realistic-* - Results: (check logs)
256K Context (pp16384/tg1024, MoE models)
- Date: 2026-03-27
- Benchmark:
data/benchmarks/ctx256k-* - Results: (check logs)
Model Quant Shootout
Qwen3.5-35B-A3B — Q4_K_L vs Q4_K_XL vs Q8 (2026-03-30)
- Benchmark:
data/benchmarks/qwen35-shootout-v2-* - Config: Vulkan RADV, q4_0 KV cache, pp2048/tg1024, 2 reps
- RyzenAdj: STAPM=81W (sustained ~70W due to HP firmware cap)
| Quant | File Size | pp2048 (t/s) | tg1024 (t/s) | Recommendation |
|---|---|---|---|---|
| UD-Q4_K_L | 18.8 GB | 825 | 57.0 | Fastest. Good quality. |
| UD-Q4_K_XL | 20.7 GB | 835 | 56.4 | Daily driver — best quality/speed. |
| Q8_0 | 34.4 GB | 850 | 51.4 | Best quality, 10% slower tg. |
Decision: Keep UD-Q4_K_XL (daily driver) and Q8_0 (quality fallback). Q4_K_L deleted — Q4_K_XL is strictly better at only +2 GB.
Coder Model Shootout (2026-03-30)
- Benchmark:
data/benchmarks/coder-shootout-* - Config: Vulkan RADV, q4_0 KV cache, pp2048/tg1024, 2 reps
- RyzenAdj: STAPM=81W (sustained ~70W)
| Model | Architecture | File Size | pp2048 (t/s) | tg1024 (t/s) |
|---|---|---|---|---|
| Qwen3-Coder-30B UD-Q6_K_XL | Pure Transformer | 24.5 GB | 737 | 61.0 |
| Qwen3.5-35B-A3B UD-Q4_K_XL | Hybrid DeltaNet | 20.7 GB | 821 | 54.9 |
| Nemotron-Cascade-2 Q8_0 | Hybrid Mamba-2 | 31.3 GB | 643 | 52.8 |
| Qwen3-Coder-Next UD-Q3_K_XL | Hybrid DeltaNet | 33.8 GB | 545 | 46.8 |
Analysis:
- tg speed scales inversely with model size (bandwidth-bound at ~215 GB/s)
- Pure Transformer (Qwen3-Coder-30B) has lowest overhead per token
- DeltaNet hybrid (Qwen3.5) has best pp — DeltaNet layers are efficient for prefill
- Qwen3-Coder-Next (80B at 3-bit) is 25% slower tg but has >70% SWE-bench vs ~50% for the 30B
Recommended roles:
- Qwen3-Coder-30B: Interactive tool-use / function-calling loops (fastest tg, purpose-built)
- Qwen3.5-35B-A3B: General tasks, long prompt processing (best pp, best all-rounder)
- Qwen3-Coder-Next: Complex multi-file coding tasks where quality > speed
How to Add Entries
When testing a new optimization:
- Record the date and exact change
- Run a benchmark:
make benchmark ARGS="--tag DESCRIPTIVE-NAME ..." - Compare:
make benchmark-compare BEFORE=data/path/baseline AFTER=data/path/new - Update this log with results and verdict
- If KEEP: document in optimization.md with the measured numbers