feat(optimize): add Phase 2 power profile and system tuning
Add `make optimize-power` (ryzenadj 85W, sysctl, THP, RADV nogttspill) with systemd services for boot/resume persistence. Integrate into `make optimize --all` as Phase 2. Update optimization log with RyzenAdj results (+46% tg at 70W sustained), KV sweep data, and quant shootout. Add Qwen3-Coder-30B and Nemotron-Cascade-2 to model catalog.
This commit is contained in:
@@ -37,38 +37,42 @@ Living document tracking what was applied, tested, and the actual results. Each
|
||||
|
||||
## Phase 2: System Tuning
|
||||
|
||||
### 2.1 RyzenAdj 85W PPT
|
||||
### 2.1 RyzenAdj PPT Increase
|
||||
|
||||
- **Date**: PENDING
|
||||
- **Change**: `sudo ryzenadj --stapm-limit=85000 --fast-limit=85000 --slow-limit=85000`
|
||||
- **Expected**: +12-19% CPU/GPU throughput (community data from Strix Halo Wiki)
|
||||
- **Benchmark**: Not yet run
|
||||
- **Notes**: HP ZBook ships at 60W. 85W is the community-recommended sweet spot.
|
||||
- **Verdict**: PENDING
|
||||
- **Date**: 2026-03-30
|
||||
- **Change**: `sudo ryzenadj --stapm-limit=85000 --fast-limit=85000 --slow-limit=85000 --apu-slow-limit=85000`
|
||||
- **Result**: STAPM raised from 59W→81W. PPT Fast raised to 81W. **However, PPT SLOW and APU SLOW stuck at 70W** — HP ZBook BIOS EC overrides these limits. Effective sustained power: ~70W (was ~59W).
|
||||
- **Benchmark**: `data/benchmarks/qwen35-shootout-v2-*` (Vulkan, q4_0 KV, pp2048/tg1024)
|
||||
- UD-Q4_K_L: **57.0 t/s** (was ~39 t/s before RyzenAdj = **+46%**)
|
||||
- UD-Q4_K_XL: **56.4 t/s**
|
||||
- Q8_0: **51.4 t/s** (was ~39-41 t/s before = **+25%**)
|
||||
- **Thermals**: 70-73C under load, 30C headroom. Cooling handles it easily.
|
||||
- **Notes**: Settings are volatile (reset on reboot/sleep). Use `sudo make optimize-power` or install systemd service for persistence. HP firmware hard-caps slow PPT at 70W regardless.
|
||||
- **Verdict**: KEEP — significant real-world improvement despite HP firmware limit
|
||||
|
||||
### 2.2 VM Sysctl Tuning
|
||||
|
||||
- **Date**: PENDING
|
||||
- **Change**: `vm.swappiness=1, vm.dirty_ratio=40, vm.max_map_count=500000`
|
||||
- **Expected**: Prevent model weight eviction, reduce I/O disruption
|
||||
- **Benchmark**: Not yet run
|
||||
- **Verdict**: PENDING
|
||||
- **Date**: 2026-03-30
|
||||
- **Change**: `vm.swappiness=1, vm.dirty_ratio=40, vm.dirty_background_ratio=10, vm.max_map_count=500000, vm.zone_reclaim_mode=0`
|
||||
- **Applied via**: `sudo make optimize-power` (persists to `/etc/sysctl.d/99-llm-inference.conf`)
|
||||
- **Notes**: Hard to isolate impact — applied together with other Phase 2 changes. Prevents model weight eviction and I/O disruption.
|
||||
- **Verdict**: KEEP — low risk, persists across reboots
|
||||
|
||||
### 2.3 Transparent Huge Pages
|
||||
|
||||
- **Date**: PENDING
|
||||
- **Change**: `transparent_hugepage=always`
|
||||
- **Expected**: Faster model load time, possible 1-5% tg improvement from reduced TLB misses
|
||||
- **Benchmark**: Not yet run
|
||||
- **Verdict**: PENDING
|
||||
- **Date**: 2026-03-30
|
||||
- **Change**: `echo always > /sys/kernel/mm/transparent_hugepage/enabled`
|
||||
- **Applied via**: `sudo make optimize-power` (volatile — add `transparent_hugepage=always` to kernel cmdline for persistence)
|
||||
- **Notes**: Reduces TLB misses for mmap'd model files. Hard to isolate impact.
|
||||
- **Verdict**: KEEP — low risk
|
||||
|
||||
### 2.4 RADV_PERFTEST=nogttspill
|
||||
|
||||
- **Date**: PENDING
|
||||
- **Change**: `export RADV_PERFTEST=nogttspill`
|
||||
- **Expected**: Fix pp degradation on Vulkan RADV (community-reported fix for Strix Halo)
|
||||
- **Benchmark**: Not yet run — needs Vulkan-specific benchmark comparison
|
||||
- **Verdict**: PENDING
|
||||
- **Date**: 2026-03-30
|
||||
- **Change**: `RADV_PERFTEST=nogttspill` persisted to `/etc/environment.d/radv-llm.conf`
|
||||
- **Applied via**: `sudo make optimize-power`
|
||||
- **Notes**: Prevents GTT spill management overhead on unified memory Vulkan. Takes effect on next login. For current session: `export RADV_PERFTEST=nogttspill`
|
||||
- **Verdict**: KEEP — persists across reboots
|
||||
|
||||
### 2.5 amdgpu.noretry=0
|
||||
|
||||
@@ -84,11 +88,19 @@ Living document tracking what was applied, tested, and the actual results. Each
|
||||
|
||||
### 3.1 KV Cache Quantization
|
||||
|
||||
- **Date**: PENDING (sweep running)
|
||||
- **Change**: `-ctk q8_0 -ctv q8_0` / `-ctk q4_0 -ctv q4_0`
|
||||
- **Benchmark**: `data/benchmarks/kv-sweep-128k-*` (in progress)
|
||||
- **Expected**: Q8_0: ~50% less KV memory, negligible quality loss. Q4_0: ~75% less, noticeable quality impact.
|
||||
- **Verdict**: PENDING
|
||||
- **Date**: 2026-03-27
|
||||
- **Change**: `--kv-types f16,q8_0,q4_0` sweep
|
||||
- **Benchmark**: `data/benchmarks/kv-sweep-256k-*`
|
||||
- **Result** (Vulkan RADV, Qwen3.5-35B-A3B Q8, pp2048/tg1024):
|
||||
- f16: 456 pp, 39.8 tg
|
||||
- q8_0: 418 pp, 38.5 tg (slight Vulkan regression — unexpected)
|
||||
- **q4_0: 460 pp, 41.1 tg** (fastest overall, +3% tg over f16)
|
||||
- **Result** (ROCm, same model):
|
||||
- f16: 445 pp, 21.5 tg
|
||||
- q8_0: 495 pp, 21.7 tg (+11% pp, same tg)
|
||||
- q4_0: 494 pp, 21.8 tg (+11% pp, same tg)
|
||||
- **Conclusion**: q4_0 is the sweet spot on Vulkan (fastest tg + 75% less KV memory). On ROCm, KV quant helps pp but not tg.
|
||||
- **Verdict**: KEEP — use q4_0 KV as default for serving
|
||||
|
||||
### 3.2 MoE Batch Size `-b 256`
|
||||
|
||||
@@ -161,6 +173,24 @@ Living document tracking what was applied, tested, and the actual results. Each
|
||||
|
||||
---
|
||||
|
||||
## Model Quant Shootout
|
||||
|
||||
### Qwen3.5-35B-A3B — Q4_K_L vs Q4_K_XL vs Q8 (2026-03-30)
|
||||
|
||||
- **Benchmark**: `data/benchmarks/qwen35-shootout-v2-*`
|
||||
- **Config**: Vulkan RADV, q4_0 KV cache, pp2048/tg1024, 2 reps
|
||||
- **RyzenAdj**: STAPM=81W (sustained ~70W due to HP firmware cap)
|
||||
|
||||
| Quant | File Size | pp2048 (t/s) | tg1024 (t/s) | Recommendation |
|
||||
|-------|-----------|-------------|-------------|----------------|
|
||||
| UD-Q4_K_L | 18.8 GB | 825 | **57.0** | Fastest. Good quality. |
|
||||
| **UD-Q4_K_XL** | 20.7 GB | 835 | **56.4** | **Daily driver** — best quality/speed. |
|
||||
| Q8_0 | 34.4 GB | 850 | 51.4 | Best quality, 10% slower tg. |
|
||||
|
||||
**Decision**: Keep UD-Q4_K_XL (daily driver) and Q8_0 (quality fallback). Q4_K_L can be deleted — Q4_K_XL is strictly better at only +2 GB.
|
||||
|
||||
---
|
||||
|
||||
## How to Add Entries
|
||||
|
||||
When testing a new optimization:
|
||||
|
||||
Reference in New Issue
Block a user