Add `make serve` and `make serve-ngram` for launching llama-server with baked-in optimal settings (Vulkan RADV, q4_0 KV cache, flash attention, no-mmap, full GPU offload). N-gram speculative decoding gives 1.1-1.4x tg speedup on repetitive content without upstream PR dependencies. Update Phase 5 status: MTP is months away (4 unmerged PRs, no MoE support), draft-model speculation stalled on ROCm buffer crashes.
10 KiB
10 KiB
Optimization Log
Living document tracking what was applied, tested, and the actual results. Each entry records the change, benchmark evidence, and verdict.
Verdicts: KEEP (applied permanently), REVERTED (tested, didn't help), PENDING (not yet tested), BLOCKED (can't test yet).
Phase 1: Core System
1.1 Tuned Profile: accelerator-performance
- Date: 2026-03-26
- Change:
sudo tuned-adm profile accelerator-performance - Benchmark:
data/benchmarks/after-tuned-* - Result: +5-8% pp improvement, +2-3% tg improvement
- Verdict: KEEP
1.2 Kernel Boot Parameters
- Date: 2026-03-26
- Change:
iommu=pt amdgpu.gttsize=60416 ttm.pages_limit=15466496 - Benchmark:
data/benchmarks/full-opt-all-models-* - Result: Combined with BIOS VRAM change. Large models now fit in GTT. Peak usage 38.8/59 GiB.
- Verdict: KEEP
1.3 BIOS VRAM Reduction (512 MB)
- Date: 2026-03-26
- Change: UMA Frame Buffer Size 32 GB -> 512 MB (HP ZBook F10 BIOS)
- Benchmark:
data/benchmarks/full-opt-all-models-* - Result: 31.5 GB freed for OS/GTT. Small models ~3-8% slower (GTT indirection vs dedicated VRAM), but system gained ability to run 37 GB+ models at 32K+ context. Net positive.
- Trade-off: Small model regression is acceptable given the massive capability gain.
- Verdict: KEEP
Phase 2: System Tuning
2.1 RyzenAdj PPT Increase
- Date: 2026-03-30
- Change:
sudo ryzenadj --stapm-limit=85000 --fast-limit=85000 --slow-limit=85000 --apu-slow-limit=85000 - Result: STAPM raised from 59W→81W. PPT Fast raised to 81W. However, PPT SLOW and APU SLOW stuck at 70W — HP ZBook BIOS EC overrides these limits. Effective sustained power: ~70W (was ~59W).
- Benchmark:
data/benchmarks/qwen35-shootout-v2-*(Vulkan, q4_0 KV, pp2048/tg1024)- UD-Q4_K_L: 57.0 t/s (was ~39 t/s before RyzenAdj = +46%)
- UD-Q4_K_XL: 56.4 t/s
- Q8_0: 51.4 t/s (was ~39-41 t/s before = +25%)
- Thermals: 70-73C under load, 30C headroom. Cooling handles it easily.
- Notes: Settings are volatile (reset on reboot/sleep). Use
sudo make optimize-poweror install systemd service for persistence. HP firmware hard-caps slow PPT at 70W regardless. - Verdict: KEEP — significant real-world improvement despite HP firmware limit
2.2 VM Sysctl Tuning
- Date: 2026-03-30
- Change:
vm.swappiness=1, vm.dirty_ratio=40, vm.dirty_background_ratio=10, vm.max_map_count=500000, vm.zone_reclaim_mode=0 - Applied via:
sudo make optimize-power(persists to/etc/sysctl.d/99-llm-inference.conf) - Notes: Hard to isolate impact — applied together with other Phase 2 changes. Prevents model weight eviction and I/O disruption.
- Verdict: KEEP — low risk, persists across reboots
2.3 Transparent Huge Pages
- Date: 2026-03-30
- Change:
echo always > /sys/kernel/mm/transparent_hugepage/enabled - Applied via:
sudo make optimize-power(volatile — addtransparent_hugepage=alwaysto kernel cmdline for persistence) - Notes: Reduces TLB misses for mmap'd model files. Hard to isolate impact.
- Verdict: KEEP — low risk
2.4 RADV_PERFTEST=nogttspill
- Date: 2026-03-30
- Change:
RADV_PERFTEST=nogttspillpersisted to/etc/environment.d/radv-llm.conf - Applied via:
sudo make optimize-power - Notes: Prevents GTT spill management overhead on unified memory Vulkan. Takes effect on next login. For current session:
export RADV_PERFTEST=nogttspill - Verdict: KEEP — persists across reboots
2.5 amdgpu.noretry=0
- Date: PENDING
- Change: Kernel cmdline
amdgpu.noretry=0 - Expected: Improved stability under memory pressure
- Notes: Only apply if experiencing GPU page faults or crashes during large model loading
- Verdict: PENDING
Phase 3: Runtime Flags
3.1 KV Cache Quantization
- Date: 2026-03-27
- Change:
--kv-types f16,q8_0,q4_0sweep - Benchmark:
data/benchmarks/kv-sweep-256k-* - Result (Vulkan RADV, Qwen3.5-35B-A3B Q8, pp2048/tg1024):
- f16: 456 pp, 39.8 tg
- q8_0: 418 pp, 38.5 tg (slight Vulkan regression — unexpected)
- q4_0: 460 pp, 41.1 tg (fastest overall, +3% tg over f16)
- Result (ROCm, same model):
- f16: 445 pp, 21.5 tg
- q8_0: 495 pp, 21.7 tg (+11% pp, same tg)
- q4_0: 494 pp, 21.8 tg (+11% pp, same tg)
- Conclusion: q4_0 is the sweet spot on Vulkan (fastest tg + 75% less KV memory). On ROCm, KV quant helps pp but not tg.
- Verdict: KEEP — use q4_0 KV as default for serving
3.2 MoE Batch Size -b 256
- Date: 2026-03-30
- Change:
-b 256vs default (2048) - Benchmark:
data/benchmarks/batch-default-*vsdata/benchmarks/batch-256-* - Result (Vulkan RADV, Qwen3.5-35B-A3B UD-Q4_K_XL, q4_0 KV):
- Default: 826 pp, 55.9 tg
- b=256: 843 pp, 55.5 tg (within noise)
- Notes: Community-reported +70% improvement does not reproduce on Vulkan RADV. May only apply to ROCm or CPU backends, or to longer prompts (pp8192+).
- Verdict: NO IMPACT on Vulkan — not recommended
Phase 4: Build Optimizations
4.1 rocWMMA Flash Attention
- Date: PENDING
- Change: Rebuild ROCm toolbox with
-DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_HIP_UMA=ON - Expected: +96% long-context performance (65K+)
- Notes: Need to check if Donato's toolboxes already include this
- Verdict: PENDING
4.2 rocWMMA Tuned Patch (PR #16827)
- Date: PENDING
- Notes: Fixes long-context regression. Check Donato's latest toolbox builds.
- Verdict: PENDING
Phase 5: Future / Blocked
5.1 Speculative Decoding (draft model)
- Status: BLOCKED — llama.cpp PR #20075 (hybrid SSM/MoE checkpoint/restore)
- Draft model: Downloaded
Qwen3.5-0.8B-Q8_0.gguf(812 MB) on 2026-03-27 - Last checked: 2026-03-30 — PR stalled since Mar 5. ROCm buffer crashes in
copy_cell(). Works on Metal/CUDA but not AMD. Months away from landing.
5.2 Native MTP (Multi-Token Prediction)
- Status: BLOCKED — multiple dependencies unmerged
- Last checked: 2026-03-30
- Details: 4 separate PRs in flight, none merged:
- PR #18886: MTP API framework (DRAFT since Feb 6) — foundation for all MTP work
- PR #20700: MTP for Qwen3.5 dense only (WIP, author says "not expected to merge soon")
- PR #15225: GLM-style MTP (open since Aug 2025, "slower than baseline")
- PR #18039: EAGLE3 speculative (open since Dec 2025)
- Key gap: No MTP implementation exists for MoE models. PR #20700 only covers dense Qwen3.5 (0.8B-27B), not the 35B-A3B MoE.
- Timeline estimate: MTP API (#18886) must merge first, then model-specific implementations adapted. Months, not weeks.
5.2a N-gram Speculative Decoding (AVAILABLE NOW)
- Status: WORKS TODAY — no upstream PRs needed
- How:
llama-server --spec-type ngram-simple --draft-max 64 --draft-min 4 - Expected: 1.1-1.4x tg speedup on repetitive content (code, structured output)
- Added to:
make serve-ngram ARGS="-m MODEL.gguf"andbin/serve --ngram - Notes: Pattern-matches from token history, no draft model needed. Best for code generation where patterns repeat. No quality impact.
- Verdict: AVAILABLE — use
--ngramflag when serving
5.3 GPU Clock Reporting
- Status: NOT A REAL ISSUE — sysfs reporting is broken, actual clocks are fine
- Measured: clpeak (2026-03-30) confirms GPU reaches 2900 MHz under compute load
- Notes: ROCm issue #5750 is about sysfs
pp_dpm_sclkreporting, not actual performance. No action needed. - Verdict: CLOSED — no performance impact
Context Window Benchmarks
64K Context (pp4096/tg1024, MoE models)
- Date: 2026-03-26
- Benchmark:
data/benchmarks/ctx64k-* - Results: (check logs)
128K Context (pp8192/tg1024, MoE models)
- Date: 2026-03-26
- Benchmark:
data/benchmarks/ctx128k-realistic-* - Results: (check logs)
256K Context (pp16384/tg1024, MoE models)
- Date: 2026-03-27
- Benchmark:
data/benchmarks/ctx256k-* - Results: (check logs)
Model Quant Shootout
Qwen3.5-35B-A3B — Q4_K_L vs Q4_K_XL vs Q8 (2026-03-30)
- Benchmark:
data/benchmarks/qwen35-shootout-v2-* - Config: Vulkan RADV, q4_0 KV cache, pp2048/tg1024, 2 reps
- RyzenAdj: STAPM=81W (sustained ~70W due to HP firmware cap)
| Quant | File Size | pp2048 (t/s) | tg1024 (t/s) | Recommendation |
|---|---|---|---|---|
| UD-Q4_K_L | 18.8 GB | 825 | 57.0 | Fastest. Good quality. |
| UD-Q4_K_XL | 20.7 GB | 835 | 56.4 | Daily driver — best quality/speed. |
| Q8_0 | 34.4 GB | 850 | 51.4 | Best quality, 10% slower tg. |
Decision: Keep UD-Q4_K_XL (daily driver) and Q8_0 (quality fallback). Q4_K_L deleted — Q4_K_XL is strictly better at only +2 GB.
Coder Model Shootout (2026-03-30)
- Benchmark:
data/benchmarks/coder-shootout-* - Config: Vulkan RADV, q4_0 KV cache, pp2048/tg1024, 2 reps
- RyzenAdj: STAPM=81W (sustained ~70W)
| Model | Architecture | File Size | pp2048 (t/s) | tg1024 (t/s) |
|---|---|---|---|---|
| Qwen3-Coder-30B UD-Q6_K_XL | Pure Transformer | 24.5 GB | 737 | 61.0 |
| Qwen3.5-35B-A3B UD-Q4_K_XL | Hybrid DeltaNet | 20.7 GB | 821 | 54.9 |
| Nemotron-Cascade-2 Q8_0 | Hybrid Mamba-2 | 31.3 GB | 643 | 52.8 |
| Qwen3-Coder-Next UD-Q3_K_XL | Hybrid DeltaNet | 33.8 GB | 545 | 46.8 |
Analysis:
- tg speed scales inversely with model size (bandwidth-bound at ~215 GB/s)
- Pure Transformer (Qwen3-Coder-30B) has lowest overhead per token
- DeltaNet hybrid (Qwen3.5) has best pp — DeltaNet layers are efficient for prefill
- Qwen3-Coder-Next (80B at 3-bit) is 25% slower tg but has >70% SWE-bench vs ~50% for the 30B
Recommended roles:
- Qwen3-Coder-30B: Interactive tool-use / function-calling loops (fastest tg, purpose-built)
- Qwen3.5-35B-A3B: General tasks, long prompt processing (best pp, best all-rounder)
- Qwen3-Coder-Next: Complex multi-file coding tasks where quality > speed
How to Add Entries
When testing a new optimization:
- Record the date and exact change
- Run a benchmark:
make benchmark ARGS="--tag DESCRIPTIVE-NAME ..." - Compare:
make benchmark-compare BEFORE=data/path/baseline AFTER=data/path/new - Update this log with results and verdict
- If KEEP: document in optimization.md with the measured numbers