docs: add agentic coding evaluation landscape research

Comprehensive research (706 lines, dated 2026-03-30) covering evaluation dimensions, benchmark suites, and open-weight model performance for software engineering agent use cases on 64GB systems. Also gitignore evalplus_results/ (runtime outputs) and ztop/ (nested repo).
feat(serve): set APEX I-Compact as default, harden benchmark workflow
2026-04-15 15:55:04 +02:00 · 2026-04-13 01:11:46 +02:00 · 2026-04-03 20:03:53 +02:00 · 2026-03-31 10:10:48 +02:00 · 2026-03-30 21:12:30 +02:00 · 2026-03-30 20:01:24 +02:00
25 changed files with 3539 additions and 242 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1,6 +1,9 @@
 data/
+.venv/
 *.log
 *.csv
 *.tmp
 .claude/
 .idea/
+evalplus_results/
+ztop/
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -41,9 +41,21 @@ make verify         # 9-point optimization checklist
 bin/audit --json | python3 -m json.tool   # Verify JSON output is valid
 ```

+## Serving
+
+`scripts/serve/launch.sh` with dispatcher at `bin/serve`. Launches llama-server inside toolbox containers with optimized defaults: Vulkan RADV, q4_0 KV cache, flash attention, no-mmap, full GPU offload. Key flags:
+- `--ngram` — n-gram speculative decoding (~1.1-1.4x tg for repetitive content)
+- `--no-think` — disables thinking/reasoning via `--reasoning-budget 0` (faster for evals)
+- `--ctx N` — context size (default 131072)
+- `--parallel N` — concurrent request slots
+
+## System Tuning
+
+`scripts/optimize/power-profile.sh` applies Phase 2 optimizations: RyzenAdj PPT increase (85W target, HP caps at 70W sustained), sysctl tuning (vm.swappiness=1, vm.max_map_count=500000), THP=always, RADV_PERFTEST=nogttspill. Systemd services for boot/resume persistence at `configs/ryzenadj-llm.service` and `configs/ryzenadj-resume.service`.
+
 ## Agentic Evaluation

-Scripts in `scripts/agentic/` with dispatcher at `bin/agentic`. Uses a Python venv at `data/venv/`. Eval frameworks: inspect-ai (all-in-one), evalplus (HumanEval+/MBPP+), bigcodebench. All target an OpenAI-compatible endpoint (ollama or llama.cpp server). Model catalog at `configs/models.conf`.
+Scripts in `scripts/agentic/` with dispatcher at `bin/agentic`. Uses a Python venv at `.venv/` (Python 3.13, dependencies in `requirements.txt`). Eval frameworks: inspect-ai (all-in-one), inspect-evals (task definitions), evalplus (HumanEval+/MBPP+), bigcodebench. All target an OpenAI-compatible endpoint — auto-detects llama-server (port 8080) or ollama (port 11434). Model catalog at `configs/models.conf`.

 ## External Resources

--- a/25
+++ b/25
@@ -38,6 +38,28 @@ benchmark: ## Run full benchmark suite (supports ARGS="--tag NAME --max-size 20"
 benchmark-compare: ## Compare two benchmark runs (usage: make benchmark-compare BEFORE=dir AFTER=dir)
 	@bash bin/benchmark compare $(BEFORE) $(AFTER)

+# --- Serve ---
+serve: ## Launch APEX I-Compact daily driver (2 slots, 256K ctx)
+	@bash bin/serve -m Qwen3.5-35B-A3B-Claude-Distilled-APEX-I-Compact.gguf --parallel 2 --ctx 262144 $(ARGS)
+
+serve-custom: ## Launch llama-server with custom model (ARGS="-m MODEL.gguf")
+	@bash bin/serve $(ARGS)
+
+serve-ngram: ## Launch with n-gram speculative decoding (ARGS="-m MODEL.gguf")
+	@bash bin/serve --ngram $(ARGS)
+
+flush-gpu: ## Kill llama-server/bench processes and drop kernel caches to free unified VRAM
+	-@pkill -x llama-server 2>/dev/null || true
+	-@pkill -x llama-bench 2>/dev/null || true
+	-@pkill -x llama-cli 2>/dev/null || true
+	-@podman ps --filter name=llama --format '{{.Names}}' | xargs -r podman stop
+	@sync && sudo sysctl vm.drop_caches=3
+	@echo "VRAM usage:" && cat /sys/class/drm/card*/device/mem_info_vram_used 2>/dev/null | awk '{printf "  %.2f MiB\n", $$1/1048576}'
+
+# --- Hardware Info ---
+hw-bandwidth: ## Measure GPU memory bandwidth and compute (clpeak)
+	@clpeak 2>&1
+
 # --- Optimize ---
 optimize: ## Interactive optimization walkthrough
 	@bash bin/optimize --all
@@ -48,6 +70,9 @@ optimize-kernel: ## Configure kernel boot parameters
 optimize-tuned: ## Switch to accelerator-performance profile
 	@bash scripts/optimize/tuned-profile.sh

+optimize-power: ## Apply Phase 2 tuning (ryzenadj, sysctl, THP, RADV)
+	@bash scripts/optimize/power-profile.sh
+
 optimize-vram: ## BIOS VRAM guidance + GTT verification
 	@bash scripts/optimize/vram-gtt.sh

--- a/bin/benchmark
+++ b/bin/benchmark
@@ -23,10 +23,13 @@ case "${1:-help}" in
        echo "  --category LIST     Comma-separated: smoke,dense,moe"
        echo "  --skip-longctx      Skip long-context (32K) tests"
        echo "  --reps N            Standard test repetitions (default: 5)"
+        echo "  -b, --batch N       Batch size (default: 2048, try 256 for MoE)"
+        echo "  --kv-types LIST     KV cache sweep (e.g. f16,q8_0,q4_0 or q4_0:q8_0)"
        echo ""
        echo "Examples:"
        echo "  benchmark baseline --max-size 20 --skip-longctx"
        echo "  benchmark run --tag post-opt --category moe"
+        echo "  benchmark run --tag kv-sweep --kv-types f16,q8_0,q4_0 --context 131072"
        exit 1
        ;;
 esac
--- a/bin/optimize
+++ b/bin/optimize
@@ -6,24 +6,31 @@ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"

 case "${1:---all}" in
    --all|-a)
-        echo "Running optimization walkthrough..."
+        echo "Running full optimization walkthrough..."
+        echo ""
+        echo "=== Phase 1: Core System ==="
        bash "$SCRIPT_DIR/scripts/optimize/tuned-profile.sh"
        bash "$SCRIPT_DIR/scripts/optimize/kernel-params.sh"
        bash "$SCRIPT_DIR/scripts/optimize/vram-gtt.sh"
        echo ""
+        echo "=== Phase 2: System Tuning ==="
+        bash "$SCRIPT_DIR/scripts/optimize/power-profile.sh"
+        echo ""
        bash "$SCRIPT_DIR/scripts/optimize/verify.sh"
        ;;
    --kernel|-k)  exec bash "$SCRIPT_DIR/scripts/optimize/kernel-params.sh" ;;
    --tuned|-t)   exec bash "$SCRIPT_DIR/scripts/optimize/tuned-profile.sh" ;;
    --vram|-v)    exec bash "$SCRIPT_DIR/scripts/optimize/vram-gtt.sh" ;;
+    --power|-p)   exec bash "$SCRIPT_DIR/scripts/optimize/power-profile.sh" ;;
    --verify)     exec bash "$SCRIPT_DIR/scripts/optimize/verify.sh" ;;
    --rollback)   exec bash "$SCRIPT_DIR/scripts/optimize/rollback.sh" ;;
    *)
-        echo "Usage: optimize [--all|--kernel|--tuned|--vram|--verify|--rollback]"
-        echo "  --all       Full optimization walkthrough (default)"
+        echo "Usage: optimize [--all|--kernel|--tuned|--vram|--power|--verify|--rollback]"
+        echo "  --all       Full optimization walkthrough (Phase 1 + 2)"
        echo "  --kernel    Configure kernel boot parameters"
        echo "  --tuned     Switch tuned profile"
        echo "  --vram      BIOS VRAM + GTT guidance"
+        echo "  --power     Phase 2: ryzenadj, sysctl, THP, RADV"
        echo "  --verify    Post-optimization checklist"
        echo "  --rollback  Revert changes"
        exit 1
--- a/bin/serve
+++ b/bin/serve
@@ -0,0 +1,5 @@
+#!/usr/bin/env bash
+# Server dispatcher
+set -euo pipefail
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
+exec bash "$SCRIPT_DIR/scripts/serve/launch.sh" "$@"
--- a/configs/models.conf
+++ b/configs/models.conf
@@ -2,22 +2,39 @@
 # Format: NAME|HF_REPO|FILE|SIZE_GB|CATEGORY|DESCRIPTION
 #
 # Categories: smoke, standard, moe, dense
-# Download with: huggingface-cli download REPO FILE --local-dir /data/models/llms/REPO
+# Download with: hf download REPO FILE --local-dir /data/models/llms/REPO

 # ── Smoke tests (quick, small) ───────────────────────────
-qwen3.5-0.8b-q8|unsloth/Qwen3.5-0.8B-GGUF|Qwen3.5-0.8B-Q8_0.gguf|0.8|smoke|Tiny, Q8 full precision
+qwen2.5-0.5b-q8|lmstudio-community/Qwen2.5-0.5B-Instruct-GGUF|Qwen2.5-0.5B-Instruct-Q8_0.gguf|0.4|smoke|Tiny Qwen2.5, Q8
+qwen3.5-0.8b-q8|unsloth/Qwen3.5-0.8B-GGUF|Qwen3.5-0.8B-Q8_0.gguf|0.8|smoke|Tiny Qwen3.5, Q8
 qwen3.5-2b-q4|unsloth/Qwen3.5-2B-GGUF|Qwen3.5-2B-Q4_K_S.gguf|1.2|smoke|Small dense 2B
 qwen3.5-4b-q4|unsloth/Qwen3.5-4B-GGUF|Qwen3.5-4B-Q4_K_S.gguf|2.5|smoke|Small dense 4B

 # ── Standard dense models ────────────────────────────────
 qwen3.5-9b-q4|unsloth/Qwen3.5-9B-GGUF|Qwen3.5-9B-Q4_K_S.gguf|5.1|dense|Dense 9B
-gpt-oss-20b-mxfp4|lmstudio-community/gpt-oss-20b-GGUF|gpt-oss-20b-MXFP4.gguf|12|dense|GPT-OSS 20B MXFP4
-glm-4.7-flash-q6|lmstudio-community/GLM-4.7-Flash-GGUF|GLM-4.7-Flash-Q6_K.gguf|23|dense|GLM 4.7 Flash Q6
+glm-4.7-flash-q6|unsloth/GLM-4.7-Flash-GGUF|GLM-4.7-Flash-UD-Q6_K_XL.gguf|24|moe|GLM 4.7 Flash, UD Q6 (MoE 30B, 3B active)

-# ── Qwen3.5-27B dense (download needed) ─────────────────
+# ── Gemma 4 ────────────────────────────────────────────
+gemma-4-26b-a4b-q6xl|unsloth/gemma-4-26B-A4B-it-GGUF|gemma-4-26B-A4B-it-UD-Q6_K_XL.gguf|22|moe|Gemma 4 MoE 26B, 4B active, UD Q6 XL
+gemma-4-26b-a4b-q4s|unsloth/gemma-4-26B-A4B-it-GGUF|gemma-4-26B-A4B-it-UD-Q4_K_S.gguf|15|moe|Gemma 4 MoE 26B, 4B active, UD Q4
+gemma-4-31b-q3xl|unsloth/gemma-4-31B-it-GGUF|gemma-4-31B-it-UD-Q3_K_XL.gguf|14|dense|Gemma 4 dense 31B, UD Q3 XL
+
+# ── Qwen3.5-27B dense ──────────────────────────────────
 qwen3.5-27b-q4|unsloth/Qwen3.5-27B-GGUF|Qwen3.5-27B-Q4_K_M.gguf|17|dense|Dense 27B, quality-first
+qwen3.5-27b-opus-distill|Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF|Qwen3.5-27B.Q4_K_M.gguf|15|dense|Dense 27B, Claude Opus reasoning distilled v2

 # ── MoE models (fast generation, best for 64GB) ─────────
-qwen3.5-35b-a3b-q4|unsloth/Qwen3.5-35B-A3B-GGUF|Qwen3.5-35B-A3B-UD-Q4_K_L.gguf|19|moe|MoE 35B, 3B active, Unsloth dynamic
+qwen3.5-35b-a3b-q4|unsloth/Qwen3.5-35B-A3B-GGUF|Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf|21|moe|MoE 35B, 3B active, Unsloth dynamic XL
 qwen3.5-35b-a3b-q8|unsloth/Qwen3.5-35B-A3B-GGUF|Qwen3.5-35B-A3B-Q8_0.gguf|37|moe|MoE 35B Q8, near-full precision
-nemotron-30b-a3b-q4|lmstudio-community/NVIDIA-Nemotron-3-Nano-30B-A3B-GGUF|NVIDIA-Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf|23|moe|Nemotron MoE 30B, 3B active
+qwen3.5-35b-a3b-apex-compact|mudler/Qwen3.5-35B-A3B-Claude-Distilled-APEX-GGUF|Qwen3.5-35B-A3B-Claude-Distilled-APEX-I-Compact.gguf|17|moe|MoE 35B Claude-distilled APEX, I-Compact quant
+nemotron-cascade2-q8|bartowski/nvidia_Nemotron-Cascade-2-30B-A3B-GGUF|nvidia_Nemotron-Cascade-2-30B-A3B-Q8_0.gguf|31|moe|Nemotron Cascade 2, Mamba-2 hybrid (replaces Nano)
+
+# ── Coding models ─────────────────────────────────────────
+qwen3-coder-30b-a3b-q6|unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF|Qwen3-Coder-30B-A3B-Instruct-UD-Q6_K_XL.gguf|26|moe|Agentic coding MoE, pure Transformer
+qwen3-coder-next-q3|unsloth/Qwen3-Coder-Next-GGUF|Qwen3-Coder-Next-UD-Q3_K_XL.gguf|34|moe|80B MoE coder, >70% SWE-bench, hybrid DeltaNet
+
+# ── Pruned MoE (REAP expert pruning) ─────────────────────
+qwen3.5-122b-a10b-reap40-q4|0xSero/Qwen3.5-122B-A10B-REAP-40-GGUF|Qwen3.5-122B-A10B-REAP-40-Q4_K_M.gguf|46|moe|122B MoE pruned to 40 experts, 10B active, Q4_K_M
+
+# ── Draft models (speculative decoding) ───────────────────
+qwen3.5-0.8b-q8-draft|unsloth/Qwen3.5-0.8B-GGUF|Qwen3.5-0.8B-Q8_0.gguf|0.8|draft|Draft for Qwen3.5 speculative decoding
--- a/configs/ryzenadj-llm.service
+++ b/configs/ryzenadj-llm.service
@@ -0,0 +1,14 @@
+[Unit]
+Description=Apply RyzenAdj power limits for LLM inference
+After=multi-user.target
+
+[Service]
+Type=oneshot
+ExecStart=/usr/local/bin/ryzenadj --stapm-limit=85000 --fast-limit=85000 --slow-limit=85000 --apu-slow-limit=85000
+RemainAfterExit=yes
+
+# Re-apply after resume from sleep/hibernate (HP firmware resets limits)
+ExecStartPost=/bin/sleep 2
+
+[Install]
+WantedBy=multi-user.target
--- a/configs/ryzenadj-resume.service
+++ b/configs/ryzenadj-resume.service
@@ -0,0 +1,10 @@
+[Unit]
+Description=Re-apply RyzenAdj power limits after resume
+After=suspend.target hibernate.target hybrid-sleep.target suspend-then-hibernate.target
+
+[Service]
+Type=oneshot
+ExecStart=/usr/local/bin/ryzenadj --stapm-limit=85000 --fast-limit=85000 --slow-limit=85000 --apu-slow-limit=85000
+
+[Install]
+WantedBy=suspend.target hibernate.target hybrid-sleep.target suspend-then-hibernate.target
--- a/docs/agentic-coding-evaluation-landscape.md
+++ b/docs/agentic-coding-evaluation-landscape.md
@@ -0,0 +1,706 @@
+# Agentic Coding Evaluation Landscape
+
+Comprehensive research into the dimensions, benchmarks, and model performance for
+evaluating LLMs in software engineering agent use cases. Research date: 2026-03-30.
+
+---
+
+## Table of Contents
+
+1. [Evaluation Taxonomy](#1-evaluation-taxonomy)
+2. [Dimension 1: Code Generation Accuracy](#2-code-generation-accuracy)
+3. [Dimension 2: Code Editing / Patching](#3-code-editing--patching)
+4. [Dimension 3: Tool Use / Function Calling](#4-tool-use--function-calling)
+5. [Dimension 4: Multi-Step Planning](#5-multi-step-planning)
+6. [Dimension 5: Debugging / Error Recovery](#6-debugging--error-recovery)
+7. [Dimension 6: Repository Understanding](#7-repository-understanding)
+8. [Dimension 7: Instruction Following](#8-instruction-following)
+9. [Dimension 8: Long Context Utilization](#9-long-context-utilization)
+10. [Dimension 9: Multi-Language Support](#10-multi-language-support)
+11. [Dimension 10: Test Generation](#11-test-generation)
+12. [Benchmark Suite Summary](#12-benchmark-suite-summary)
+13. [Open-Weight Model Landscape for 64GB Systems](#13-open-weight-model-landscape-for-64gb-systems)
+14. [Frontier vs. Open Model Gap](#14-frontier-vs-open-model-gap)
+15. [Recommended Evaluation Stack](#15-recommended-evaluation-stack)
+16. [Sources](#16-sources)
+
+---
+
+## 1. Evaluation Taxonomy
+
+Recent survey work (CSLLM Survey, 2025; SE Agent Benchmark Survey, 2025) organizes
+coding LLM evaluation along two orthogonal axes:
+
+- **Capability dimension**: What is being measured (generation, editing, tool use,
+  planning, debugging, comprehension, instruction following, etc.)
+- **Evaluation paradigm**: How it is measured (static benchmarks, execution-based
+  evaluation, agent-in-the-loop evaluation, human evaluation)
+
+The field has moved decisively from static benchmarks (HumanEval, MBPP) toward
+agent-in-the-loop evaluations (SWE-bench, Terminal-Bench, FeatureBench) that test
+the full agentic loop: plan, act, observe, iterate. This shift matters because models
+that score 95%+ on HumanEval can still fail below 50% on realistic agentic tasks.
+
+The ten dimensions below map to the capability axis. Each dimension lists the
+benchmarks that best isolate it, though in practice most agentic benchmarks test
+multiple dimensions simultaneously.
+
+---
+
+## 2. Code Generation Accuracy
+
+**Definition**: Writing correct, complete code from natural-language specifications or
+docstrings, measured by functional correctness (pass@k on test suites).
+
+### Key Benchmarks
+
+| Benchmark | Tasks | Languages | Metric | Notes |
+|---|---|---|---|---|
+| **HumanEval** (Chen et al., 2021) | 164 | Python | pass@k | Foundational but near-saturated; best models >95% |
+| **HumanEval+** / **MBPP+** (EvalPlus, NeurIPS 2023) | 164 / 399 | Python | pass@k (80x more tests) | Catches false positives from HumanEval; ~10-15% score drops |
+| **HumanEval Pro** / **MBPP Pro** (ACL 2025) | 164 / 399 | Python | pass@k on self-invoking tasks | Tests compositional reasoning; o1-mini drops from 96.2% to 76.2% |
+| **BigCodeBench** (ICLR 2025) | 1,140 | Python (139 libs) | pass@1 | Multi-tool, cross-domain; best model (GPT-4o) ~60% Complete, <50% Instruct |
+| **BigCodeBench-Hard** | 148 | Python | pass@1 | Hardest subset; human performance 97%, LLMs ~60% |
+| **LiveCodeBench** (EMNLP 2025) | Rolling | Python | pass@k | Contamination-free: new problems added continuously from competitive programming |
+
+### State of the Art
+
+- **Frontier**: Claude Opus 4.5/4.6, GPT-5.2, Gemini 3.1 Pro all score >95% on
+  HumanEval, ~85% on HumanEval+, ~65% on BigCodeBench-Complete.
+- **Open (64GB-feasible)**: Qwen3.5-27B-Q4 achieves ~80% on HumanEval+.
+  Qwen3-Coder-30B-A3B (3.3B active, ~18GB at Q4) is strong on BigCodeBench.
+  Qwen2.5-Coder-32B-Instruct matched GPT-4o on HumanEval when released.
+
+### Key Insight
+
+HumanEval is near-saturated and should no longer be used as a primary differentiator.
+BigCodeBench and LiveCodeBench are the current gold standards for code generation
+accuracy, as they test realistic multi-library tasks and resist contamination.
+
+---
+
+## 3. Code Editing / Patching
+
+**Definition**: Modifying existing code correctly -- applying diffs, fixing bugs in
+context, integrating new code into existing files -- rather than generating from scratch.
+
+### Key Benchmarks
+
+| Benchmark | Tasks | What It Tests | Notes |
+|---|---|---|---|
+| **Aider Code Editing** | 133 | Edit Python files to solve Exercism problems | Tests edit format compliance + coding ability |
+| **Aider Polyglot** | 225 | Edit code across 6 languages with error feedback | Two attempts per problem; measures edit+debug loop |
+| **Diff-XYZ** (Oct 2025) | 3 tasks | Apply, anti-apply, generate diffs | Tests diff understanding in multiple formats |
+| **EDIT-Bench** | Varied | Real-world instructed code edits | Repository-level editing tasks |
+| **SWE-bench** (indirectly) | 2,294 | Generate patches that resolve GitHub issues | Requires generating correct unified diffs |
+
+### Edit Format Considerations
+
+Code editing performance depends heavily on the edit format used:
+
+- **Search/Replace blocks** (Aider default): Most reliable for most models
+- **Unified diff**: GPT-4 Turbo was "3x less lazy" with unified diffs (Aider blog)
+- **V4A diff format**: OpenAI's recommended format (published with GPT-4.1, April 2025)
+- **Whole-file rewrite**: Simpler but wasteful; works with weaker models
+
+Models that excel at generation can fail at editing because they struggle to produce
+syntactically valid diffs or correctly locate the code to modify.
+
+### State of the Art (Aider Polyglot, March 2026)
+
+| Model | Score | Type |
+|---|---|---|
+| GPT-5 | 88.0% | Frontier |
+| MiniMax M2.5 | 80.2% | Open |
+| DeepSeek V3.2-Exp | 74.2% | Open |
+| DeepSeek-R1-0528 | 71.4% | Open |
+| GLM-4.5-FP8 | 66.0% | Open |
+| Qwen3-Coder-480B | 61.8% | Open (too large for 64GB) |
+| Qwen3-Coder-30B-A3B | ~55-60%* | Open (fits 64GB at Q4) |
+
+*Estimated from quantized GGUF performance data; exact Aider Polyglot score for
+the 30B-A3B variant not independently confirmed.
+
+---
+
+## 4. Tool Use / Function Calling
+
+**Definition**: Correctly invoking APIs, tools, or MCP servers -- selecting the right
+function, constructing valid arguments, parsing responses, deciding when NOT to call.
+
+### Key Benchmarks
+
+| Benchmark | Tasks | What It Tests | Notes |
+|---|---|---|---|
+| **BFCL V4** (Berkeley) | Thousands | Function calling accuracy across formats | De facto standard; AST-based evaluation |
+| **BFCL-v3** (via EvalScope) | Multi-turn | Stateful multi-step function calling | Tests memory and context management |
+| **Nexus Function Calling** | Varied | Tool selection and invocation | Broader tool landscape |
+| **IFEval-FC** (2025) | 500+ | Instruction following within function schemas | JSON schema constraint adherence |
+| **tau-bench** | Varied | Tool-augmented task completion | End-to-end agent tool use |
+
+### BFCL Key Findings
+
+The Berkeley Function Calling Leaderboard reveals a critical split:
+
+1. **Single-turn calls**: Most frontier models score >90% accuracy
+2. **Multi-turn stateful calls**: Performance drops 20-40% even for top models
+3. **Abstention**: Knowing when NOT to call a function remains a major weakness
+4. **Long-horizon tool use**: Memory, dynamic decision-making, and context management
+   are open challenges
+
+### State of the Art
+
+- **Frontier**: Claude Opus 4.5/4.6, GPT-5.2 lead overall BFCL V4
+- **Open**: Qwen3-Coder-480B is "comparable to Claude Sonnet 4 on Agentic Tool-Use"
+  (Qwen team). For 64GB-feasible models, Qwen3-Coder-30B-A3B has a specially
+  designed function call format and strong tool-use training.
+  Nemotron 3 Super (120B, 12B active) was explicitly trained for tool-use workflows.
+
+### Relevance to MCP
+
+MCP (Model Context Protocol) servers expose tools via JSON schemas -- exactly what
+BFCL tests. A model's BFCL score is a reasonable proxy for MCP tool-use competence,
+though MCP adds discovery and session management complexity not yet benchmarked.
+
+---
+
+## 5. Multi-Step Planning
+
+**Definition**: Breaking complex tasks into subtasks, maintaining coherent plans across
+many steps, tracking progress, and adapting when plans fail.
+
+### Key Benchmarks
+
+| Benchmark | Tasks | Steps | What It Tests | Notes |
+|---|---|---|---|---|
+| **SWE-bench Verified** | 500 | 5-50+ | End-to-end issue resolution | Gold standard for agentic coding |
+| **SWE-bench Pro** (Scale AI) | Harder | 10-100+ | More complex issues | Best model ~46% (vs 81% on Verified) |
+| **FeatureBench** (Feb 2026) | 200 | Many | Complex feature development | Claude 4.5 Opus: only 11.0% (vs 74.4% SWE-bench) |
+| **Snorkel Agentic Coding** | 100 | Multi-step, 4 tiers | Plan, track, execute, recover | Claude Opus 4.5: 58%, Gemini 3 Pro: 51.6% |
+| **GAIA** (ICLR 2025) | 450 | Multi-step | General assistant planning | Near saturation (~90% top scores) |
+| **Gaia2** (2026) | Varied | Async | Dynamic, asynchronous environments | Adds temporal constraints and agent collaboration |
+| **Terminal-Bench 2.0** | 89 | Multi-step | Terminal workflow completion | Tests plan execution in CLI environments |
+
+### Planning-Specific Insights
+
+The gap between SWE-bench Verified (~81% frontier) and SWE-bench Pro (~46% frontier)
+and FeatureBench (~11% frontier) reveals that multi-step planning degrades rapidly
+with task complexity:
+
+- **SWE-bench Verified**: Often requires 5-15 steps (find file, understand bug, edit,
+  test)
+- **SWE-bench Pro**: Requires deeper reasoning about architecture and dependencies
+- **FeatureBench**: Requires implementing features across multiple files with
+  architectural coherence over 50+ steps
+
+This is the dimension where frontier models most decisively outperform open models,
+though the gap is narrowing with agentic RL training (Qwen3-Coder, GLM-5).
+
+### State of the Art (SWE-bench Verified, March 2026)
+
+| Model | Score | Type | Notes |
+|---|---|---|---|
+| Claude Opus 4.5 | 80.9% | Frontier | Overall leader |
+| Claude Opus 4.6 | 80.8% | Frontier | |
+| Gemini 3.1 Pro | 80.6% | Frontier | |
+| MiniMax M2.5 | 80.2% | Open | Best open model |
+| GPT-5.2 | 80.0% | Frontier | |
+| GLM-5 | 77.8% | Open | 744B MoE, 40B active |
+| Kimi K2.5 | 76.8% | Open | |
+| DeepSeek V3.2 | 73.0% | Open | |
+| Qwen3-Coder-Next | 70.6% | Open | Only 3B active params |
+| DeepSeek V3.1 | 66.0% | Open | |
+| Nemotron 3 Super | 60.5% | Open | 120B, 12B active |
+
+---
+
+## 6. Debugging / Error Recovery
+
+**Definition**: Handling test failures, reading error messages, diagnosing root causes,
+and iterating toward a fix -- including recovering from the agent's own mistakes.
+
+### Key Benchmarks
+
+| Benchmark | Tasks | What It Tests | Notes |
+|---|---|---|---|
+| **Terminal-Bench 2.0** (Stanford/Laude) | 89 | CLI debugging, error recovery, state mgmt | Gold standard for debugging evaluation |
+| **Recovery-Bench** (Letta, 2025) | Varied | Recovery from corrupted states and error traces | Tests context pollution handling |
+| **AgentErrorBench** (2025) | Varied | Error detection and debugging in trajectories | 24% improvement with AgentDebug method |
+| **ReliabilityBench** (Jan 2026) | Varied | Consistency and fault recovery | Multi-dimensional reliability |
+| **Aider Polyglot** (indirectly) | 225 | Two-attempt model with error feedback | Second attempt tests debug-from-feedback |
+
+### Recovery-Bench Key Findings
+
+Recovery-Bench (Letta) specifically evaluates a critical gap: even frontier models
+"lack the ability to naturally recover from failed states." The benchmark creates
+scenarios with:
+
+- Erroneous files from previous attempts
+- Corrupted reasoning traces in context
+- Environment artifacts from failed edits
+
+This is directly relevant to agentic coding loops where an agent makes a mistake
+at step 15 of a 30-step task and must recover without starting over.
+
+### Terminal-Bench 2.0 Key Findings
+
+Terminal-Bench tests real terminal workflows: inspect environments, read/edit files,
+run commands, recover from errors, and finish multi-step tasks. Error categories:
+
+- **Execution errors**: Dominate for Claude Opus 4.5 and GPT-5.2
+- **Coherence errors**: Less frequent but more damaging
+- **Verification errors**: Failing to check that a fix actually worked
+
+### State of the Art
+
+Debugging/error recovery is one of the weakest dimensions for all models. No model
+achieves >70% on Terminal-Bench 2.0 or Recovery-Bench as of March 2026. This is
+a primary area where the frontier-open gap matters most for practical agentic use.
+
+---
+
+## 7. Repository Understanding
+
+**Definition**: Navigating large codebases, understanding file structure, dependency
+graphs, cross-file relationships, and architectural patterns.
+
+### Key Benchmarks
+
+| Benchmark | Tasks | Languages | What It Tests | Notes |
+|---|---|---|---|---|
+| **CrossCodeEval** (NeurIPS 2023) | Varied | Python, Java, TS, C# | Cross-file code completion | Requires understanding imports and dependencies |
+| **RepoBench** | 3 tasks | Python | Retrieval, completion, pipeline | Tests codebase navigation |
+| **RepoEval** | Varied | Python | Repository-level completion | 16 GitHub repositories |
+| **RepoCod** (ACL 2025) | Varied | Multiple | Full repository code generation | "LLMs not yet ready" |
+| **LoCoBench-Agent** (2025) | Varied | Multiple | Interactive repo exploration | Agent-based evaluation |
+| **DependEval** | 3 tasks | Multiple | Dependency recognition, multi-file editing | Tests architectural understanding |
+
+### Key Challenge
+
+Repository understanding is difficult to isolate as a benchmark dimension because
+it is a prerequisite for most agentic coding tasks. SWE-bench implicitly tests it
+(you cannot fix a bug if you cannot find the relevant file), but does not score it
+separately.
+
+The most direct measures are:
+1. **CrossCodeEval**: Do predictions improve when cross-file context is provided?
+2. **RepoBench-R**: Can the model retrieve the right context from the repository?
+3. **DependEval**: Can the model understand and modify dependency relationships?
+
+### State of the Art
+
+Models with longer context windows have an inherent advantage. The Qwen3-Coder family
+was explicitly trained for "repository-scale understanding" with 256K native context
+(extendable to 1M). GLM-5 uses DeepSeek Sparse Attention for 205K context.
+
+For 64GB systems, Qwen3-Coder-30B-A3B and Qwen3-Coder-Next are the strongest choices
+due to their long-context training and MoE efficiency.
+
+---
+
+## 8. Instruction Following
+
+**Definition**: Following complex, multi-constraint instructions precisely --
+formatting requirements, length constraints, keyword inclusion, structural rules.
+
+### Key Benchmarks
+
+| Benchmark | Tasks | What It Tests | Notes |
+|---|---|---|---|
+| **IFEval** (Google, Nov 2023) | ~500 | 25 types of verifiable instructions | Format, length, keyword, structure constraints |
+| **IFEval-Extended** (2024) | Dynamic | Generative instruction synthesis | Thousands of unique instructions from templates |
+| **M-IFEval** (NAACL 2025) | Multi-lingual | French, Japanese, Spanish instruction following | Performance varies widely across languages |
+| **IFEval-FC** (2025) | Varied | Instruction following in function call schemas | JSON schema constraint adherence |
+| **AgentIF** (Tsinghua, 2025) | Varied | Agent-specific instruction following | Evaluates IF within agentic loops |
+
+### Relevance to Agentic Coding
+
+Instruction following is critical for agentic coding because:
+
+1. **System prompts**: Agents receive detailed behavioral instructions (e.g., CLAUDE.md
+   conventions in this repo)
+2. **Edit format compliance**: Models must produce output in exact formats (search/replace
+   blocks, unified diffs, JSON tool calls)
+3. **Multi-constraint tasks**: "Fix the bug AND add a test AND update the docstring AND
+   follow the project's naming conventions"
+
+### State of the Art
+
+IFEval is included in the Open LLM Leaderboard V2, making it one of the most widely
+reported benchmarks. Frontier models score >90% on IFEval. Open models vary widely;
+instruction-tuned variants of Qwen3.5, DeepSeek V3, and GLM-5 are competitive at >85%.
+
+---
+
+## 9. Long Context Utilization
+
+**Definition**: Effectively using large context windows (32K-1M tokens) with code --
+not just accepting long inputs, but actually using information from all parts.
+
+### Key Benchmarks
+
+| Benchmark | What It Tests | Notes |
+|---|---|---|
+| **RULER** (NVIDIA, COLM 2024) | Multi-needle retrieval, distractor handling | Most models degrade significantly beyond 32K |
+| **Needle in a Haystack** (NIAH) | Single-fact retrieval in long context | Near-saturated for frontier models |
+| **LoCoBench** (2025) | Long-context code completion and comprehension | Claude 3.5 Sonnet: 29% at short context, 3% at long |
+| **LongCodeBench** (2025) | Long-context code tasks | Single-language, limited diversity |
+| **LongBench** (ACL 2025) | General long-context evaluation | Reveals limitations of existing benchmarks |
+
+### "Context Rot" Phenomenon
+
+Research from Chroma (2025) documented "context rot": as input tokens increase,
+LLM performance degrades even when the relevant information is present. This is
+particularly acute for code tasks where:
+
+- File A defines a class, file B imports it, file C tests it
+- All three must be in context simultaneously
+- Models must cross-reference across files, not just retrieve individual facts
+
+### State of the Art
+
+| Model | Native Context | Effective Context* | Notes |
+|---|---|---|---|
+| Nemotron 3 Super | 1M tokens | 91.75% accuracy at 1M | Best retention score |
+| Qwen3-Coder-Next | 256K (1M w/ Yarn) | Good at 256K | Trained for repo-scale |
+| GLM-5 | 205K | Good | DeepSeek Sparse Attention |
+| DeepSeek V3.2 | 128K | Moderate | |
+
+*"Effective context" means the model actually uses information at that distance,
+not just accepts it without error.
+
+For 64GB systems, context length is bounded by available memory. At Q4 quantization,
+a 30B-A3B model can handle ~64K-128K tokens before running out of KV cache space
+(depending on GQA configuration and batch size).
+
+---
+
+## 10. Multi-Language Support
+
+**Definition**: Handling different programming languages correctly -- not just Python,
+but also compiled languages, systems languages, and less common languages.
+
+### Key Benchmarks
+
+| Benchmark | Languages | What It Tests | Notes |
+|---|---|---|---|
+| **Aider Polyglot** | C++, Go, Java, JS, Python, Rust | Edit + debug in 6 languages | 225 Exercism exercises |
+| **Multi-SWE-bench** (NeurIPS 2025) | Python, Java, TS, JS, Go, Rust, C, C++ | Issue resolution in 8 languages | 1,632 validated issues |
+| **Multi-SWE-bench mini** | 8 languages | Lightweight version | 400 instances, reduced compute |
+| **SWE-PolyBench** (Amazon) | Java, JS, TS, Python | Bug fixes, features, refactoring | 2,110 curated issues |
+| **SWE-smith** | 9 languages | SWE-bench style across 42 repos | 300 curated tasks |
+| **HumanEval-X** | Python, C++, Java, JS, Go | Cross-lingual code generation | Translation of HumanEval |
+| **BigCodeBench** | Python (139 libs) | Multi-library Python | Tests library-specific knowledge |
+
+### Multi-SWE-bench vs SWE-PolyBench
+
+Two competing multilingual benchmarks emerged in 2025:
+
+- **Multi-SWE-bench** (ByteDance): 1,632 issues, 8 languages, NeurIPS 2025
+  Datasets track. Also provides `mini` (400 instances) and `flash` (300 instances)
+  variants for reduced compute.
+- **SWE-PolyBench** (Amazon): 2,110 issues, 4 languages, with a verified subset of
+  384 instances. Covers bug fixes, features, and refactoring.
+
+### Language-Specific Performance Gaps
+
+Open models show significant performance variation across languages:
+- **Python**: Best-supported universally
+- **JavaScript/TypeScript**: Second-best, strong ecosystem coverage
+- **Rust, Go, C++**: Substantially weaker, especially for complex patterns
+- **Low-resource languages** (Julia, Lua, Perl): StarCoder2-15B historically strong here
+
+### State of the Art
+
+Qwen3-Coder-Next achieves 62.8% on SWE-Bench Multilingual. For 64GB-feasible models,
+the Qwen3-Coder-30B-A3B benefits from Qwen's broad multilingual training data.
+
+---
+
+## 11. Test Generation
+
+**Definition**: Writing tests, understanding test frameworks, achieving coverage,
+generating meaningful assertions -- not just syntactically valid tests.
+
+### Key Benchmarks
+
+| Benchmark | Tasks | What It Tests | Notes |
+|---|---|---|---|
+| **TestEval** (2024) | 210 | LLM test case generation for LeetCode programs | Basic test generation ability |
+| **ULT** (2025) | 3,909 | Unit test generation for complex functions | High cyclomatic complexity, leakage-free |
+| **WebApp1K** (2025) | 1,000 | Test-driven development tasks | Tests serve as both prompt and verification |
+| **CoverUp** (2024) | Varied | Coverage-guided test generation | Iterative LLM-guided coverage improvement |
+
+### Current Performance
+
+LLM-generated tests achieve on average:
+- **41.32%** accuracy (tests pass and are meaningful)
+- **45.10%** statement coverage
+- **30.22%** branch coverage
+- **40.21%** mutation score
+
+These numbers are from a multi-model benchmark study (2025). CoverUp's iterative
+approach achieves 80% line+branch coverage (vs 47% for CodaMosa), suggesting that
+agentic test generation loops significantly outperform single-shot generation.
+
+### Key Insight
+
+Test generation is an area where agentic approaches (generate, run, check coverage,
+iterate) dramatically outperform single-shot generation. This makes it particularly
+suited to the iterative agent loop and a strong candidate for local model evaluation.
+
+### State of the Art
+
+Code agents were shown to be "state of the art software testers" when given an
+iterative loop with coverage feedback (2024 paper). No single model dominates this
+dimension; the scaffolding (coverage feedback, iteration) matters more than the
+base model for test generation.
+
+---
+
+## 12. Benchmark Suite Summary
+
+### Tier 1: Must-Run for Agentic Coding Evaluation
+
+These are the most informative benchmarks for evaluating a model's fitness as a
+coding agent:
+
+| Benchmark | Primary Dimensions | Run Cost | Notes |
+|---|---|---|---|
+| **SWE-bench Verified** | Planning, editing, repo understanding | High (500 Docker envs) | Gold standard |
+| **Aider Polyglot** | Editing, multi-lang, debugging | Medium (225 problems) | Best edit benchmark |
+| **BigCodeBench** | Generation, multi-tool | Medium (1,140 tasks) | Best generation benchmark |
+| **BFCL V4** | Tool use, function calling | Low-Medium | De facto tool-use standard |
+| **Terminal-Bench 2.0** | Debugging, planning, error recovery | High (89 real envs) | Best debugging benchmark |
+
+### Tier 2: Valuable Supplementary Benchmarks
+
+| Benchmark | Primary Dimensions | Notes |
+|---|---|---|
+| **LiveCodeBench** | Generation (contamination-free) | Rolling benchmark |
+| **IFEval** | Instruction following | Quick to run, widely reported |
+| **Multi-SWE-bench mini** | Multi-language, planning | 400 instances, 8 languages |
+| **EvalPlus (HumanEval+/MBPP+)** | Generation (rigorous) | Good baseline |
+| **Recovery-Bench** | Error recovery | Novel and underexplored |
+| **FeatureBench** | Complex planning | Very hard; differentiates top models |
+
+### Tier 3: Niche or Near-Saturated
+
+| Benchmark | Status | Notes |
+|---|---|---|
+| **HumanEval** | Near-saturated | >95% for frontier models; use EvalPlus instead |
+| **MBPP** | Near-saturated | Use MBPP+ instead |
+| **GAIA** | Near-saturation (~90%) | Good for general agents, less code-specific |
+| **Needle-in-a-Haystack** | Saturated | Use RULER for long-context |
+
+### Commonly Cited on Model Cards
+
+When coding-focused models publish on Hugging Face, the most frequently cited
+benchmarks (in rough order of frequency) are:
+
+1. SWE-bench Verified (agentic coding standard)
+2. HumanEval / HumanEval+ (code generation baseline)
+3. MBPP / MBPP+ (code generation)
+4. BigCodeBench (multi-tool generation)
+5. Aider Polyglot (code editing, multi-language)
+6. LiveCodeBench (contamination-free generation)
+7. BFCL (function calling)
+8. IFEval (instruction following)
+9. Multi-SWE-bench (multilingual agentic)
+
+---
+
+## 13. Open-Weight Model Landscape for 64GB Systems
+
+### Models Feasible on 64GB Unified Memory (Strix Halo)
+
+Sorted by practical fitness for agentic coding tasks. "Active" = parameters active
+per forward pass for MoE models.
+
+| Model | Total / Active | GGUF Q4 Size | SWE-bench | Key Strength |
+|---|---|---|---|---|
+| **Qwen3-Coder-Next** | 80B / 3B | ~46GB (Q4) | 70.6% Verified | Best efficiency ratio; agentic RL training |
+| **Qwen3-Coder-30B-A3B** | 30.5B / 3.3B | ~18GB (Q4) | ~55%* (est.) | Fits easily; native 256K context; function call format |
+| **Qwen3.5-35B-A3B** | 35B / 3B | ~19GB (Q4) | N/A | General + coding; fast at 112 tok/s on RTX 3090 |
+| **Nemotron 3 Super** | 120B / 12B | ~64GB (Q4) | 60.5% | 1M context; PinchBench 85.6%; hybrid Mamba-Transformer |
+| **Qwen3.5-27B** | 27B / 27B (dense) | ~17GB (Q4) | ~55%* | Dense; 72.4% SWE-bench reported for Qwen3.5-27B |
+| **DeepSeek V3.2** | 671B / 37B | Too large at Q4 | 73.0% | Requires >200GB; not feasible for 64GB |
+| **GLM-5** | 744B / 40B | Too large at Q4 | 77.8% | Best open SWE-bench; not feasible for 64GB |
+
+*Estimated; exact scores for quantized GGUF variants not independently benchmarked.
+
+### Recommended Configuration for 64GB Strix Halo
+
+**Primary coding agent**: Qwen3-Coder-30B-A3B-Instruct (Q4_K_M, ~18GB)
+- Fits with ample room for KV cache and context
+- Specially designed function call format
+- Native 256K context, extendable to 1M
+- Strong agentic coding training
+- Fast inference due to 3.3B active parameters
+
+**Stretch option**: Qwen3-Coder-Next (Q4, ~46GB)
+- Tighter fit but significantly stronger (70.6% SWE-bench Verified)
+- 3B active parameters = good generation speed
+- Leaves ~18GB for KV cache and system
+
+**Dense alternative**: Qwen3.5-27B (Q4_K_M, ~17GB)
+- When you need strong general + coding ability
+- Dense model = more predictable behavior
+- Good baseline for comparison
+
+### Older Models: Still Relevant?
+
+- **CodeLlama-34B** (Meta, 2023): Superseded by Qwen and DeepSeek families. Only
+  relevant for historical comparison or if specific fine-tunes are needed.
+- **StarCoder2-15B** (ServiceNow/HF/NVIDIA, 2024): Outperformed CodeLlama-34B at half
+  the size. Still competitive for low-resource languages (Julia, Lua, Perl) but
+  otherwise superseded by Qwen3-Coder.
+- **DeepSeek-Coder-V2-Lite-16B** (2024): Was competitive but now clearly behind
+  Qwen3-Coder-30B-A3B and Qwen3-Coder-Next.
+
+---
+
+## 14. Frontier vs. Open Model Gap
+
+### Gap Analysis by Dimension (March 2026)
+
+| Dimension | Frontier Best | Open Best (64GB) | Gap | Trend |
+|---|---|---|---|---|
+| Code Generation | ~98% HumanEval | ~85% HumanEval | Small | Closing rapidly |
+| Code Editing | 88% Aider Polyglot | ~60% Aider Polyglot | Large | Closing (MoE helps) |
+| Tool Use | >90% BFCL | ~80% BFCL | Moderate | Closing with dedicated training |
+| Multi-Step Planning | 80.9% SWE-bench | 70.6% SWE-bench (Coder-Next) | Moderate | Narrowing with agentic RL |
+| Debugging/Recovery | ~65% Terminal-Bench | ~45% Terminal-Bench* | Large | Widest persistent gap |
+| Repo Understanding | Excellent | Good (long-context models) | Moderate | Closing with 256K+ contexts |
+| Instruction Following | >90% IFEval | >85% IFEval | Small | Nearly closed |
+| Long Context | 1M+ effective | 256K effective | Moderate | Hardware-limited for local |
+| Multi-Language | 80%+ Multi-SWE | 62.8% Multi-SWE | Moderate | Improving with diverse training |
+| Test Generation | ~50% coverage | ~40% coverage | Small | Scaffolding matters more |
+
+*Estimated; Terminal-Bench scores not widely reported for 64GB-feasible open models.
+
+### Key Observations
+
+1. **Code generation is nearly solved** for simple tasks. The gap has shifted to
+   complex, multi-step, multi-file tasks.
+
+2. **Debugging/error recovery is the widest gap** and the hardest to close. This is
+   where frontier models' larger parameter counts and RLHF refinement matter most.
+
+3. **MoE architectures are the bridge** for 64GB systems. Models like Qwen3-Coder-Next
+   (80B total, 3B active) achieve SWE-bench scores comparable to models with 10-20x
+   more active parameters.
+
+4. **Agentic RL training** (as used in Qwen3-Coder, GLM-5) is the primary driver of
+   open model improvement on planning and debugging dimensions.
+
+5. **Scaffolding equalizes** many gaps. A well-designed agent scaffold (SWE-Agent,
+   OpenHands, Aider) can make a 30B model perform comparably to a raw 400B model.
+
+---
+
+## 15. Recommended Evaluation Stack
+
+For evaluating models locally on the Strix Halo system, the following stack covers
+all 10 dimensions using tools already referenced in this project's `docs/references.md`:
+
+### Inspect AI (Primary Framework)
+
+Inspect AI supports multiple benchmarks in a unified framework:
+- HumanEval (code generation)
+- BigCodeBench (multi-tool generation)
+- BFCL (function calling / tool use)
+- GAIA (multi-step planning)
+- IFEval (instruction following)
+
+Run against an OpenAI-compatible endpoint (ollama or llama.cpp server).
+
+### EvalPlus (Code Generation)
+
+- HumanEval+ and MBPP+ with native ollama support
+- More rigorous than base HumanEval/MBPP
+- Already configured in this project's `scripts/agentic/` framework
+
+### BigCodeBench (Multi-Tool Generation)
+
+- 1,140 tasks across 139 libraries
+- Already listed in `docs/references.md`
+- Tests multi-library, cross-domain code generation
+
+### Aider (Code Editing + Multi-Language)
+
+- Built-in polyglot benchmark: 225 exercises across 6 languages
+- Tests edit format compliance, multi-language support, debugging loop
+- Can be run against any OpenAI-compatible endpoint
+
+### BFCL (Tool Use)
+
+- pip install `bfcl-eval`
+- Tests function calling accuracy
+- Already listed in `docs/references.md`
+
+### Practical Execution Order
+
+1. **Quick smoke test**: EvalPlus (HumanEval+) -- ~30 min
+2. **Generation depth**: BigCodeBench-Hard (148 tasks) -- ~2-4 hours
+3. **Editing ability**: Aider polyglot benchmark -- ~4-6 hours
+4. **Tool use**: BFCL eval -- ~1-2 hours
+5. **Instruction following**: IFEval via Inspect AI -- ~1 hour
+6. **Full agentic**: SWE-bench Verified (if Docker resources available) -- ~24+ hours
+
+---
+
+## 16. Sources
+
+### Papers
+
+- Chen et al. (2021). "Evaluating Large Language Models Trained on Code." arXiv:2107.03374. [HumanEval]
+- Liu et al. (2023). "Is Your Code Generated by ChatGPT Really Correct?" NeurIPS 2023. [EvalPlus/HumanEval+]
+- Jimenez et al. (2024). "SWE-bench: Can Language Models Resolve Real-world GitHub Issues?" ICLR 2024.
+- Zhuo et al. (2024). "BigCodeBench: Benchmarking Code Generation with Diverse Function Calls." ICLR 2025.
+- Patil et al. (2025). "The Berkeley Function Calling Leaderboard (BFCL)." ICML 2025.
+- Misu et al. (2023). "Towards AI Assistants That Thrive on Data: GAIA." ICLR 2025.
+- Hsieh et al. (2024). "RULER: What's the Real Context Size of Your Long-Context Language Models?" COLM 2024.
+- Zhou et al. (2023). "Instruction-Following Evaluation for Large Language Models." arXiv:2311.07911. [IFEval]
+- Terminal-Bench team (2026). "Terminal-Bench: Benchmarking Agents on Hard CLI Tasks." Stanford/Laude Institute.
+- FeatureBench (Feb 2026). "Benchmarking Agentic Coding for Complex Feature Development." arXiv:2602.10975.
+- HumanEval Pro / MBPP Pro (ACL 2025). "Evaluating LLMs on Self-invoking Code Generation Task."
+- Multi-SWE-bench (NeurIPS 2025). "A Multilingual Benchmark for Issue Resolving."
+- SWE-PolyBench (Amazon, 2025). "A multi-language benchmark for repository level evaluation."
+- Recovery-Bench (Letta, 2025). "Evaluating LLMs' Ability to Recover from Mistakes."
+- Diff-XYZ (Oct 2025). "A Benchmark for Evaluating Diff Understanding."
+
+### Leaderboards and Live Data
+
+- SWE-bench Leaderboard: https://www.swebench.com/
+- SWE-bench Verified Leaderboard: https://llm-stats.com/benchmarks/swe-bench-verified
+- SWE-rebench Leaderboard: https://swe-rebench.com/
+- Aider LLM Leaderboards: https://aider.chat/docs/leaderboards/
+- BFCL V4 Leaderboard: https://gorilla.cs.berkeley.edu/leaderboard.html
+- EvalPlus Leaderboard: https://evalplus.github.io/leaderboard.html
+- BigCodeBench Leaderboard: https://huggingface.co/blog/leaderboard-bigcodebench
+- Terminal-Bench Leaderboard: https://www.tbench.ai/
+- Open LLM Leaderboard: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
+- Scale Labs SWE-bench Pro: https://labs.scale.com/leaderboard/swe_bench_pro_public
+- Artificial Analysis Terminal-Bench: https://artificialanalysis.ai/evaluations/terminalbench-hard
+
+### Model Documentation
+
+- Qwen3-Coder: https://github.com/QwenLM/Qwen3-Coder
+- Qwen3-Coder-Next: https://qwen.ai/blog?id=qwen3-coder-next
+- Qwen3-Coder-30B-A3B GGUF: https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
+- GLM-5: https://huggingface.co/zai-org/GLM-5
+- Nemotron 3 Super: https://developer.nvidia.com/blog/introducing-nemotron-3-super-an-open-hybrid-mamba-transformer-moe-for-agentic-reasoning/
+- DeepSeek V3 series: https://www.bentoml.com/blog/the-complete-guide-to-deepseek-models-from-v3-to-r1-and-beyond
+
+### Tools and Frameworks
+
+- Inspect AI: https://github.com/UKGovernmentBEIS/inspect_ai
+- Inspect Evals catalog: https://inspect.aisi.org.uk/evals/
+- EvalPlus: https://github.com/evalplus/evalplus
+- BigCodeBench: https://github.com/bigcode-project/bigcodebench
+- BFCL: https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard
+- Aider: https://aider.chat/
+- Aider Polyglot benchmark: https://github.com/Aider-AI/polyglot-benchmark
+- LiveCodeBench: https://livecodebench.github.io/
+- CoverUp (test generation): https://arxiv.org/html/2403.16218v3
--- a/docs/benchmarking.md
+++ b/docs/benchmarking.md
@@ -41,7 +41,8 @@ The `-fa 1 -mmp 0 -ngl 99` flags are **mandatory** on Strix Halo to avoid crashe
 | `llama-vulkan-radv` | Mesa RADV | Vulkan | Most stable, recommended default |
 | `llama-vulkan-amdvlk` | AMDVLK | Vulkan | Fastest when it works, 2GB buffer limit |
 | `llama-rocm-6.4.4` | ROCm 6.4.4 | HIP | Proven stable |
-| `llama-rocm-7.2` | ROCm 7.2 | HIP | Latest, compiler fixes applied |
+| `llama-rocm-7.2.1` | ROCm 7.2.1 | HIP | Current stable (kernel 6.18.4+ patch) |
+| `llama-rocm-7.2` | ROCm 7.2 | HIP | Deprecated — use 7.2.1 |
 | `llama-rocm7-nightlies` | ROCm 7 nightly | HIP | Experimental/development builds |

 Containers are from [kyuz0/amd-strix-halo-toolboxes](https://github.com/kyuz0/amd-strix-halo-toolboxes). Set up with `make benchmark-setup`.
--- a/docs/inference-optimization-landscape.md
+++ b/docs/inference-optimization-landscape.md
@@ -0,0 +1,825 @@
+# LLM Inference Optimization Landscape (March 2026)
+
+## Scope
+
+Comprehensive survey of cutting-edge LLM inference optimization techniques applicable
+to a high-end AMD APU workstation: Ryzen AI MAX+ 395 (Strix Halo), Radeon 8060S
+(gfx1151, RDNA 3.5), 64 GB unified LPDDR5X memory, 256 GB/s bandwidth. Covers
+inference engines, quantization, attention, MoE optimization, memory bandwidth, OS-level
+tuning, hardware features, and model-level techniques. Research current as of March 2026.
+
+---
+
+## Table of Contents
+
+1. [Inference Engines and Backends](#1-inference-engines-and-backends)
+2. [Advanced Quantization Techniques](#2-advanced-quantization-techniques)
+3. [Attention Optimization](#3-attention-optimization)
+4. [MoE-Specific Optimizations](#4-moe-specific-optimizations)
+5. [Memory Bandwidth Optimization](#5-memory-bandwidth-optimization)
+6. [OS and Runtime Techniques](#6-os-and-runtime-techniques)
+7. [Emerging Hardware Features](#7-emerging-hardware-features)
+8. [Model-Level Optimizations](#8-model-level-optimizations)
+9. [Prioritized Recommendations for Strix Halo](#9-prioritized-recommendations-for-strix-halo)
+10. [Sources](#10-sources)
+
+---
+
+## 1. Inference Engines and Backends
+
+### 1.1 llama.cpp -- Still the Foundation
+
+llama.cpp remains the dominant local inference engine. All major interfaces (Ollama,
+LM Studio, GPT4All, KoboldCpp) use it under the hood. For Strix Halo specifically:
+
+- **ROCm/HIP backend**: Build with `-DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151
+  -DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_HIP_UMA=ON`. The `ROCBLAS_USE_HIPBLASLT=1`
+  environment variable forces hipBLASLt kernels, which deliver the best throughput on
+  gfx1151.
+- **Vulkan backend**: The RADV Mesa driver has seen active RDNA 3.5/4 optimization
+  in Mesa 25.x. In some benchmarks Vulkan outperforms ROCm for single-shot inference
+  and shorter contexts. HIP+WMMA+FlashAttention is fastest for long contexts (tg8192+).
+- **UMA detection bug (issue #18159)**: llama.cpp's UMA detection can incorrectly
+  limit available memory on AMD APUs with large TTM allocations. The `--mmp 0`
+  (disable mmap) flag is critical for ROCm on Strix Halo to avoid catastrophically
+  slow model loading.
+- **Performance**: Llama-2-7B Q4_0 achieves ~1464 t/s prompt processing (pp512) and
+  ~50 t/s token generation (tg128) on Strix Halo with ROCm.
+- **Known regression**: A commit enabling WMMA-MMQ INT kernels for RDNA 3 introduced
+  significant prompt processing regression on gfx1151 with ROCm 7.x (issue #17917).
+
+**Status**: Production-ready. Best single-engine choice for Strix Halo.
+
+### 1.2 KTransformers -- CPU/GPU Hybrid MoE Specialist
+
+KTransformers (SOSP 2025) is the most significant new engine for hybrid inference.
+It was purpose-built for running large MoE models (DeepSeek-R1/V3) on systems with
+limited GPU memory but abundant CPU memory.
+
+- **AMX-optimized kernels**: Uses Intel AMX instructions for CPU-side expert
+  computation. For AMD Zen 5, it falls back to AVX-512, which is still substantially
+  faster than naive CPU inference.
+- **Async CPU-GPU scheduling**: Overlaps CPU expert computation with GPU attention
+  computation, hiding CPU latency.
+- **Performance**: 4.62-19.74x prefill speedup, 1.25-4.09x decode speedup vs
+  existing hybrid systems. SGLang + KTransformers achieves 220+ tok/s total
+  throughput on trillion-parameter MoE models.
+- **Relevance to Strix Halo**: Moderate. KTransformers shines when GPU VRAM is
+  scarce (24 GB discrete) and CPU RAM is abundant (382 GB). On Strix Halo, all 64 GB
+  is accessible to the GPU, so the CPU offloading advantage is diminished. However,
+  for models exceeding 64 GB, KTransformers-style hybrid inference becomes relevant.
+
+**Status**: Production. Most useful for models that exceed available VRAM.
+
+### 1.3 PowerInfer / PowerInfer-2
+
+PowerInfer-2 targets smartphones, achieving 11.68 t/s on Mixtral 47B (22x faster
+than alternatives). It exploits MoE sparsity by predicting which experts will
+activate and only loading those. The core technique -- hot/cold neuron partitioning
+and GPU-resident hot neurons -- is architecturally interesting but the implementation
+targets mobile SoCs with discrete memory hierarchies, not unified-memory APUs where
+all memory is equally accessible to the GPU.
+
+**Status**: Research. Techniques are partially subsumed by llama.cpp's own MoE
+offloading improvements.
+
+### 1.4 MLC-LLM
+
+MLC-LLM compiles models via TVM to target multiple backends including ROCm, Vulkan,
+Metal, and OpenCL. It was one of the first engines to make AMD GPUs competitive for
+LLM inference (2023 blog post). The Vulkan backend provides a universal fallback
+that works on any GPU.
+
+**Status**: Active but niche. For Strix Halo, llama.cpp's native ROCm/Vulkan
+backends are more mature and better optimized.
+
+### 1.5 mistral.rs / candle / burn
+
+Rust-based inference engines:
+
+- **mistral.rs**: Built on Hugging Face's candle library. Supports GGUF, GPTQ,
+  ISQ (in-situ quantization). Has CUDA support but no ROCm backend.
+- **candle**: Hugging Face's Rust ML framework. GPU support via CUDA; no ROCm.
+- **burn**: Rust ML framework with multiple backends (WGPU, Vulkan, CUDA). The
+  WGPU/Vulkan path could theoretically work on AMD, but LLM inference support
+  is limited.
+
+**Status**: Not viable for Strix Halo in 2026. No ROCm support, and the Vulkan
+paths are less optimized than llama.cpp's.
+
+### 1.6 BitNet.cpp
+
+Microsoft's official 1-bit LLM inference framework. Achieves 6x faster inference
+and 82% lower energy consumption. GPU kernel support was added May 2025 for NVIDIA
+and Apple Silicon. No AMD GPU kernels yet. CPU-only mode works on any x86 system
+and could be relevant for future 1-bit models, but the model ecosystem (BitNet b1.58
+variants) remains small.
+
+**Status**: Watch. No AMD GPU support. CPU path works but model selection is limited.
+
+### 1.7 vLLM and SGLang
+
+Both are production LLM serving frameworks with AMD ROCm support:
+
+- **vLLM v0.16.0** (Feb 2026): ROCm is now a first-class platform. 93% of AMD
+  test groups passing. Native AITER FP8 kernels, fused LayerNorm/SiLU, optimized
+  Paged Attention. Extended bitsandbytes quantization to warp-size-32 GPUs (RDNA).
+- **SGLang**: Supports ROCm. KTransformers integration for hybrid MoE inference.
+
+Both are overkill for single-user local inference but become relevant for serving
+multiple users or running agentic workloads with concurrent requests.
+
+**Status**: Production for server workloads. Consider if running multi-user or
+agentic eval pipelines.
+
+### 1.8 ExLlamaV3 / EXL3
+
+ExLlamaV3 introduces the EXL3 format (based on QTIP from Cornell RelaxML), achieving
+excellent perplexity at extreme compression (Llama 3.3 70B at 1.75 bpw, 19 GB). The
+Marlin-inspired GEMM kernels are highly optimized for NVIDIA GPUs. AMD ROCm support
+was absent at launch (early 2025) and current status is uncertain.
+
+**Status**: Watch. Potentially best-in-class quantization quality, but AMD support
+is unclear.
+
+---
+
+## 2. Advanced Quantization Techniques
+
+### 2.1 GGUF Quantization Landscape
+
+GGUF remains the dominant format for local inference via llama.cpp. The key variants:
+
+| Format    | Bits | Method          | Best For                    |
+|-----------|------|-----------------|-----------------------------|
+| Q8_0      | 8    | Round-to-nearest| Maximum quality, 2x compress|
+| Q6_K      | 6.5  | K-quant         | High quality, 2.5x compress |
+| Q5_K_M    | 5.5  | K-quant+imatrix | Balanced quality/size       |
+| Q4_K_M    | 4.5  | K-quant+imatrix | Default recommendation      |
+| Q3_K_M    | 3.9  | K-quant+imatrix | Aggressive, still usable    |
+| IQ3_XXS   | 3.06 | I-quant+imatrix | Extreme compression         |
+| IQ2_XXS   | 2.06 | I-quant+imatrix | Near-minimum viable         |
+| IQ1_S     | 1.56 | I-quant+imatrix | Experimental                |
+
+**imatrix (Importance Matrix)**: The single most impactful quality improvement for
+sub-4-bit quantization. The importance matrix identifies which weights produce large
+activations during inference and allocates more precision to them. For aggressive
+quantization (<4 bits), imatrix is no longer optional -- it is essential.
+
+**Recommendation**: Q4_K_M + imatrix for most use cases. Q3_K_M + imatrix when
+fitting a larger model matters more than marginal quality.
+
+### 2.2 Unsloth Dynamic 2.0
+
+Unsloth Dynamic 2.0 (Feb 2026) represents the state-of-the-art in intelligent GGUF
+quantization:
+
+- **Per-layer adaptive quantization**: Each layer gets a custom quantization type
+  based on sensitivity analysis. The quantization scheme for Gemma 3 differs
+  significantly from Llama 4.
+- **Universal MoE + dense support**: Dynamic 2.0 works on all architectures
+  (previously MoE-only).
+- **Calibration dataset**: 1.5M+ token hand-curated dataset for improved
+  conversational quality.
+- **Quality results**: Dynamic 3-bit DeepSeek V3.1 GGUF scores 75.6% on 5-shot
+  MMLU, surpassing many full-precision models.
+- **KL Divergence tracking**: Every GGUF is benchmarked against the original model
+  on both perplexity and KL divergence.
+
+**Relevance**: Directly applicable. Use Unsloth Dynamic 2.0 GGUFs when available
+for any model. They consistently outperform standard k-quant GGUFs at the same
+bit-width.
+
+### 2.3 AQLM and QuIP#
+
+Both target extreme compression (2-3 bits):
+
+- **QuIP#** (ICML 2024): Uses randomized Hadamard transforms + E8 lattice codebooks.
+  First PTQ method where 3-bit outperforms theoretical lossless 4-bit. The E8
+  codebook fits in L1 cache, enabling inference speedups over FP16.
+- **AQLM** v1.1.7 (April 2025): Additive quantization achieving Pareto optimality
+  below 3 bpw. Outperforms QuIP# on MoE models at 2-bit. Added arbitrary
+  8-dimensional codebooks on GPU.
+
+Both require PyTorch/CUDA for dequantization kernels. Neither has native llama.cpp
+integration or AMD support. They represent the theoretical frontier of what is
+achievable at extreme compression but are not practical for Strix Halo today.
+
+**Status**: Research. Watch for llama.cpp integration of QTIP (via ExLlamaV3/EXL3).
+
+### 2.4 AWQ vs GPTQ vs GGUF on AMD
+
+For AMD GPUs in the llama.cpp ecosystem:
+
+- **GGUF**: The only practical choice. Native llama.cpp support with ROCm/Vulkan
+  acceleration. K-quants and I-quants are well-optimized.
+- **AWQ/GPTQ**: Require Marlin kernels for competitive speed (741 tok/s with
+  Marlin-AWQ vs 67 tok/s without on NVIDIA). Marlin kernels are CUDA-only. On AMD,
+  these formats are accessible via vLLM or Hugging Face Transformers with ROCm, but
+  not through llama.cpp.
+- **Performance hierarchy on AMD (via vLLM)**: GPTQ and AWQ with Marlin kernels are
+  fastest on NVIDIA; on AMD ROCm, the performance advantage over GGUF is minimal
+  and setup complexity is higher.
+
+**Recommendation**: GGUF for llama.cpp on Strix Halo. AWQ/GPTQ only if using vLLM.
+
+### 2.5 Mixed-Precision and Layer-Wise Quantization
+
+Active research area with direct practical implications:
+
+- **Attention vs FFN sensitivity**: Attention layers (QKV projections, output
+  projection) have varying sensitivity. FFN layers are often the largest component
+  and frequent targets for aggressive quantization (INT4).
+- **Channel-Wise Mixed-Precision (CMPQ)**: Allocates quantization precision per
+  weight channel based on activation distributions. Adapts to any bit-width.
+- **HOBBIT for MoE**: Maintains FP16 and INT4 versions of experts simultaneously.
+  Hot experts stay at FP16; cold experts use INT4 or even INT2. This concept is
+  partially implemented in Unsloth Dynamic 2.0's per-layer approach.
+- **Fine-Grained Mixed Precision (FGMP)**: Goes below row-level granularity to
+  handle unstructured sensitivity patterns in both weights and activations.
+
+**Relevance**: Unsloth Dynamic 2.0 already implements the practical version of
+layer-wise mixed precision for GGUF. The research frontier is moving toward
+sub-layer and channel-level mixed precision.
+
+### 2.6 KV Cache Quantization
+
+- **TurboQuant** (ICLR 2026): Being integrated into llama.cpp. TQ3 (3-bit) achieves
+  4.9x compression vs FP16 KV cache; TQ4 (4-bit) achieves 3.8x. This directly
+  reduces memory pressure for long-context inference.
+- **llama.cpp native**: Already supports Q8_0 and Q4_0 KV cache quantization via
+  `--cache-type-k` and `--cache-type-v` flags.
+
+**Relevance**: High. On a 64 GB system, KV cache can consume significant memory for
+long contexts. Q4_0 KV cache is recommended; TurboQuant will push this further.
+
+---
+
+## 3. Attention Optimization
+
+### 3.1 Flash Attention on AMD
+
+Current status for RDNA 3.5 / gfx1151:
+
+- **Triton backend**: Supports CDNA and RDNA GPUs with fp16, bf16, fp32. This is
+  the primary Flash Attention path for non-Instinct AMD GPUs.
+- **PyTorch integration**: Since PyTorch 2.5.0+, `F.scaled_dot_product_attention`
+  automatically uses Flash Attention on RDNA cards via the Triton backend.
+- **llama.cpp WMMA Flash Attention**: Enabled via `-DGGML_HIP_ROCWMMA_FATTN=ON`.
+  Uses RDNA 3.5's WMMA instructions for matrix multiply within the attention kernel.
+  This is the fastest path for long-context inference on Strix Halo.
+- **CK (Composable Kernel) backend**: Supports MI200x, MI250x, MI300x, MI355x.
+  Not available for RDNA consumer GPUs.
+
+**Gap**: Flash Attention 3 (with asynchronous pipelines and FP8 attention) is
+NVIDIA Hopper-specific. No AMD equivalent exists.
+
+### 3.2 SageAttention
+
+SageAttention (ICLR 2025, ICML 2025, NeurIPS 2025 Spotlight) achieves 2-5x speedup
+over FlashAttention through quantized attention (8-bit Q/K matrices, FP16 values).
+SageAttention3 further uses FP4 Tensor Cores on Blackwell GPUs.
+
+**AMD status**: SageAttention's Triton implementation could theoretically work on
+AMD GPUs, but no AMD-optimized kernels exist. The quantized attention concept is
+sound and could be adapted.
+
+**Status**: Watch. Would be high-impact if ported to AMD.
+
+### 3.3 Paged Attention
+
+Paged Attention (vLLM) manages KV cache as non-contiguous memory pages, eliminating
+60-80% of memory waste from fragmentation. llama.cpp's server mode implements a
+simplified version of this for concurrent request handling, but the full PagedAttention
+system is more mature in vLLM.
+
+**Relevance**: Moderate for single-user. High for multi-user serving.
+
+### 3.4 GQA/MQA Architecture Implications
+
+Modern models (Llama 2/3, Mistral, Qwen) use Grouped Query Attention:
+
+- GQA reduces KV cache by up to 90% vs MHA (Multi-Head Attention)
+- 30-40% faster inference than MHA with near-equivalent accuracy
+- Enables larger batch sizes due to smaller memory footprint
+
+**Practical impact**: When choosing models for Strix Halo, prefer GQA models. All
+modern model families (Llama 3, Qwen 3, Gemma 3, Mistral) use GQA. Avoid older MHA
+models when alternatives exist.
+
+### 3.5 Ring Attention and Linear Attention
+
+- **Ring Attention**: Distributes long sequences across multiple devices. Achieves
+  1M context prefill in 77s with 93% parallelization efficiency. Not applicable to
+  single-device Strix Halo.
+- **Linear Attention**: Reduces KV cache from O(n) to O(1) and computation from
+  O(n^2) to O(n). The Ring-Linear models (hybrid softmax + linear attention) reduce
+  inference cost to 1/10 of dense models. This is a model architecture choice, not
+  a runtime optimization.
+
+**Relevance**: Linear attention models would be transformative for long-context on
+Strix Halo. Watch for Qwen, DeepSeek, or Llama variants with hybrid attention.
+
+---
+
+## 4. MoE-Specific Optimizations
+
+### 4.1 Expert Offloading on Unified Memory
+
+On discrete GPU systems, MoE inference involves expensive PCIe transfers of expert
+weights between CPU RAM and GPU VRAM. On Strix Halo's unified memory, this bottleneck
+is fundamentally different:
+
+- All expert weights reside in the same physical memory accessible to both CPU and
+  GPU. There is no PCIe transfer cost.
+- The bottleneck shifts to **memory bandwidth**: at 256 GB/s, loading a 2 GB expert
+  takes ~7.8 ms. With GGUF Q4 quantization, experts are 4x smaller, reducing this
+  to ~2 ms.
+- **Implication**: Unified memory eliminates the offloading problem but does not
+  eliminate the bandwidth problem. The optimization focus should be on reducing the
+  number of expert weights that must be read per token.
+
+### 4.2 Expert Caching and Prediction
+
+The research frontier in 2025-2026 focuses on predicting which experts will be needed:
+
+- **OD-MoE**: 99.94% expert activation prediction accuracy, delivering ~75% of
+  fully GPU-cached speed using 1/3 GPU memory.
+- **MoE-SpeQ**: Uses a small draft model to predict expert sequences, enabling
+  prefetching. Combines speculative decoding with expert prediction.
+- **SP-MoE**: First speculative-decoding-aware expert offloading framework. Achieves
+  1.07-3.5x TPOT speedup by exploiting structural correspondence between draft
+  and target models.
+- **SliceMoE**: Dynamic Bit-Sliced Caching -- caches experts at sub-expert
+  granularity, assigning precision on demand.
+- **FlashMoE**: ML-based cache replacement for SSD-based expert offloading on edge.
+
+**Relevance for Strix Halo**: Expert caching is less critical when all experts fit
+in memory, but expert prediction can still help by enabling **prefetching into L2/
+Infinity Cache** before the expert is needed, reducing effective memory latency.
+
+### 4.3 Expert Pruning
+
+- Static pruning: Remove least-used experts entirely (MC-SMoE, EEP). Can reduce
+  active parameters by up to 96.875% (TSEP). Requires fine-tuning.
+- Dynamic pruning: Skip experts below an activation threshold at inference time.
+  38.2% FLOPs reduction with 1.32x speedup (Li et al.).
+- **DynMoE**: 9% FLOPs reduction, 1.37x speedup through dynamic gating.
+
+**Relevance**: Moderate. Dynamic expert skipping could reduce memory bandwidth
+requirements on Strix Halo, but requires model-specific configuration.
+
+### 4.4 MoE Quantization -- Inactive Expert Compression
+
+HOBBIT maintains multiple precision versions of experts: FP16 hot experts, INT4 cold
+experts, INT2 for rarely-used experts. On unified memory, a variant of this approach
+could keep the working set of experts at higher precision while storing rarely-activated
+experts at aggressive quantization, reducing total memory footprint.
+
+MoE-CSP achieves 26x speedup through 4-bit/8-bit quantization with custom CUDA
+kernels. QMoE achieves 20x memory reduction but lacks efficient 1-bit kernel support.
+
+**Practical approach for Strix Halo**: Use Unsloth Dynamic 2.0 GGUFs, which already
+implement per-layer (including per-expert) precision allocation.
+
+---
+
+## 5. Memory Bandwidth Optimization
+
+### 5.1 The Fundamental Bottleneck
+
+LLM inference (especially token generation / decode) is almost always memory-bandwidth
+bound. On Strix Halo:
+
+- **Available bandwidth**: 256 GB/s (LPDDR5X-8000, 256-bit bus)
+- **Theoretical decode throughput** for a 7B Q4_0 model (~3.5 GB):
+  256 GB/s / 3.5 GB = ~73 tok/s (assuming 100% utilization)
+- **Measured**: ~50 t/s (tg128), implying ~68% bandwidth utilization
+- **Infinity Cache effect**: The 32 MB Infinity Cache acts as a bandwidth amplifier.
+  When working set fits in cache, effective bandwidth can exceed 256 GB/s. For LLM
+  inference, per-layer weights typically exceed 32 MB, so cache benefit is limited
+  to KV cache and activations.
+
+### 5.2 Techniques to Reduce Bandwidth Requirements
+
+| Technique                  | Bandwidth Reduction | Status on Strix Halo |
+|----------------------------|--------------------|-----------------------|
+| Lower quantization (Q4->Q3)| ~25%              | Available now         |
+| KV cache quantization (Q4) | ~75% for KV reads  | Available now         |
+| Speculative decoding       | 2-3x effective     | Available now         |
+| Expert prediction/caching  | Variable (MoE)     | Research              |
+| Weight compression (EXL3)  | Up to 8x           | No AMD support        |
+| Activation checkpointing   | Reduces peak memory | Available             |
+
+### 5.3 Speculative Decoding
+
+The most impactful bandwidth optimization technique available today:
+
+- **Principle**: A small, fast "draft" model generates N candidate tokens. The large
+  "target" model verifies all N tokens in a single forward pass (batch). Accepted
+  tokens are "free" -- they required no additional bandwidth from the target model.
+- **Speedup**: 2-3x without accuracy loss. NVIDIA demonstrates 3.6x on H200.
+- **EAGLE-3**: Lightweight autoregressive head attached to target model internals.
+  No separate draft model needed.
+- **TurboSpec**: Closed-loop control system that dynamically adjusts speculative
+  parameters based on online feedback.
+- **MoE-SpeQ**: Combines speculative decoding with expert prefetching.
+
+**Relevance**: High. Speculative decoding is the single highest-impact optimization
+for decode throughput on bandwidth-limited systems like Strix Halo. llama.cpp
+supports speculative decoding via `--model-draft`.
+
+### 5.4 Prefetching Strategies
+
+- **L2 cache prefetching**: Proactively load KV cache and next-layer weights into
+  GPU L2 during computation. Achieves 2.15x attention kernel speedup on NVIDIA H20.
+- **PRESERVE**: Prefetch model weights from HBM to on-chip cache during communication
+  operations. Up to 1.6x end-to-end speedup.
+- **Strix Halo consideration**: The 32 MB Infinity Cache + 2 MB L2 provides limited
+  on-chip storage. Prefetching activations and KV cache (which are smaller than
+  weights) into Infinity Cache during weight reads could help.
+
+### 5.5 Batched Inference
+
+Batching amortizes weight-read cost across multiple requests:
+
+- Single request: ~68% bandwidth utilization on Strix Halo
+- Batch of 4: Approaches compute-bound regime for prefill; still bandwidth-bound
+  for decode on most models
+- **Continuous batching** (vLLM, llama.cpp server): 10-20x throughput improvement
+  over naive batching
+
+**Trade-off**: Batching increases throughput but also increases per-request latency
+and memory consumption (KV cache scales linearly with batch size).
+
+---
+
+## 6. OS and Runtime Techniques
+
+### 6.1 Memory Management
+
+**Huge Pages**: Transparent Huge Pages (THP) reduce TLB misses for large model
+weights. On Fedora 43, THP is enabled by default. For explicit control:
+
+```bash
+# Check current THP setting
+cat /sys/kernel/mm/transparent_hugepage/enabled
+
+# For llama.cpp, ensure THP is at least "madvise"
+echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
+```
+
+For models loaded with mmap, THP automatically promotes 4 KB pages to 2 MB pages,
+reducing page faults during inference.
+
+**Memory Locking**: `mlock` prevents model weights from being swapped. llama.cpp's
+`--mlock` flag enables this. Critical for systems running other workloads alongside
+inference.
+
+**mmap vs direct load**: On Strix Halo with ROCm, `--mmp 0` (disable mmap) is
+recommended. mmap causes catastrophically slow model loading when GPU offloading is
+active because of the double-copy path through page cache.
+
+### 6.2 Process Pinning and NUMA
+
+Strix Halo is a single-die APU, so NUMA topology is simple (typically 1 NUMA node).
+However, CPU core affinity still matters:
+
+```bash
+# Pin inference to specific cores, keeping others free for OS
+numactl --physcpubind=0-15 llama-server [args]
+
+# Or via taskset
+taskset -c 0-15 llama-server [args]
+```
+
+**Core isolation**: For minimum-jitter inference:
+```bash
+# Add to kernel cmdline
+isolcpus=0-15 nohz_full=0-15 rcu_nocbs=0-15
+```
+This prevents the OS from scheduling unrelated tasks on inference cores.
+
+### 6.3 CPU Frequency and Power
+
+```bash
+# Set performance governor for consistent throughput
+sudo cpupower frequency-set -g performance
+
+# Verify
+cpupower frequency-info | grep "current CPU frequency"
+```
+
+### 6.4 cgroups v2 for Resource Isolation
+
+Reserve memory and CPU for inference workloads:
+
+```bash
+# Create inference cgroup
+sudo mkdir /sys/fs/cgroup/inference
+echo "+memory +cpu" | sudo tee /sys/fs/cgroup/inference/cgroup.subtree_control
+
+# Reserve 56 GB for inference (leave 8 GB for system)
+echo $((56 * 1024 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/inference/memory.min
+
+# Pin CPU cores
+echo "0-15" | sudo tee /sys/fs/cgroup/inference/cpuset.cpus
+
+# Run inference in cgroup
+sudo cgexec -g memory,cpu:inference llama-server [args]
+```
+
+### 6.5 io_uring for Model Loading
+
+io_uring provides zero-copy, kernel-bypassing I/O that can accelerate initial model
+loading. While llama.cpp does not natively use io_uring, the underlying mmap/read
+path can benefit from io_uring-based file I/O when loading from NVMe:
+
+- Eliminates context switch overhead during model load
+- Enables true async I/O with completion ring buffers
+- Most benefit when loading very large models (>32 GB) from storage
+
+**Practical impact**: Minor for Strix Halo since model loading is a one-time cost,
+and LPDDR5X bandwidth far exceeds NVMe read speeds.
+
+### 6.6 eBPF-Based Performance Monitoring
+
+eBPF enables zero-instrumentation monitoring of inference workloads:
+
+```bash
+# Monitor GPU DRM scheduler jobs (works with amdgpu driver)
+sudo bpftrace -e 'tracepoint:drm:drm_sched_job { printf("GPU job: %s\n", args->name); }'
+
+# Track page faults during model loading
+sudo bpftrace -e 'tracepoint:exceptions:page_fault_user { @[comm] = count(); }'
+
+# Monitor context switches on inference cores
+sudo bpftrace -e 'tracepoint:sched:sched_switch /cpu == 0/ { @[args->next_comm] = count(); }'
+```
+
+The eunomia project provides ready-made eBPF programs for AI workload monitoring.
+
+---
+
+## 7. Emerging Hardware Features
+
+### 7.1 AMD XDNA NPU
+
+The Ryzen AI MAX+ 395 includes an XDNA 2 NPU rated at 50 TOPS. Current status for
+LLM inference:
+
+- **Software stack**: AMD Ryzen AI Software supports ONNX model execution on the NPU.
+  AMD Quark provides quantization for NPU deployment (SmoothQuant, GPTQ, Quarot).
+- **LLM capability**: The NPU can accelerate small models and specific operations
+  (attention heads, small expert networks) but cannot run full large LLMs.
+- **Linux support**: Kernel 7.1 (expected 2026) brings significant XDNA upstreaming.
+  Current Linux support is limited compared to Windows.
+- **Practical use**: The NPU could potentially handle a speculative decoding draft
+  model while the GPU runs the main model. This is not yet implemented in any
+  inference engine.
+
+**Status**: Not viable for LLM inference in March 2026. Watch for Linux kernel 7.1
+and llama.cpp NPU backend development.
+
+### 7.2 RDNA 3.5 Matrix Cores (WMMA)
+
+The Radeon 8060S (gfx1151) has the same WMMA instruction set as RDNA 3 (gfx11xx),
+which is a generation behind RDNA 4 (gfx12xx):
+
+**RDNA 3 / 3.5 (gfx1151) WMMA capabilities**:
+- FP16/BF16: 512 FLOPS/clock/CU
+- INT8: 1024 OPS/clock/CU
+- 16x16 matrix dimensions
+- Requires inter-lane data shuffling for chained operations
+
+**RDNA 4 (gfx12xx) improvements over RDNA 3.5**:
+- FP16/BF16: 1024 FLOPS/clock/CU (2x)
+- INT8: 2048 OPS/clock/CU (2x)
+- New FP8/BF8 formats at 4x the FP16 rate
+- 4:2 structured sparsity support (effectively 2x more)
+- No inter-lane shuffling needed for chained WMMA (major efficiency gain)
+- New efficient matrix load instruction
+
+**Current usage in llama.cpp**: WMMA is used for Flash Attention
+(`GGML_HIP_ROCWMMA_FATTN`) and matrix-multiply quantized (`MMQ`) kernels. The
+ROCm 7.x regression for gfx1151 (issue #17917) specifically affects MMQ kernels.
+
+### 7.3 Vulkan Cooperative Matrices
+
+The `VK_KHR_cooperative_matrix` Vulkan extension was merged into the RADV driver
+for RDNA 3+ hardware. This provides a portable API for matrix operations that maps
+to WMMA hardware:
+
+- Enables inference engines to use matrix cores through Vulkan instead of
+  vendor-specific ROCm/HIP APIs
+- llama.cpp's Vulkan backend could leverage this for WMMA-accelerated matrix
+  operations
+- Currently less optimized than native HIP/ROCm paths
+
+**Status**: Available in Mesa 25.x. Watch for llama.cpp Vulkan backend improvements
+using cooperative matrices.
+
+### 7.4 Infinity Cache for Inference
+
+Strix Halo has a 32 MB Infinity Cache (MALL -- Memory Attached Last Level):
+
+- **Architecture**: L1 (256 KB/shader array) -> L2 (2 MB) -> Infinity Cache (32 MB)
+  -> LPDDR5X
+- **Latency**: Slightly higher than discrete GPU Infinity Cache implementations
+- **Hit rate**: Varies by workload. Graphics benchmarks show ~73% hit rate at peak.
+- **LLM inference implications**: For a 7B Q4 model (~3.5 GB), per-layer weights
+  are ~70-140 MB, far exceeding the 32 MB cache. Benefit is limited to:
+  - KV cache for current context (fits well for shorter contexts)
+  - Activations and intermediate results
+  - Embedding layer (often accessed repeatedly)
+  - Small models/layers that fit entirely in cache
+
+The Infinity Cache is most impactful as a bandwidth amplifier -- when inference
+accesses exhibit temporal locality (same data accessed multiple times within a
+short window), effective bandwidth exceeds the 256 GB/s DRAM limit.
+
+---
+
+## 8. Model-Level Optimizations
+
+### 8.1 Prompt Compression
+
+- **LLMLingua / LLMLingua-2** (Microsoft): Compresses input prompts by removing
+  low-information tokens. 20x compression with 1.5 point performance drop.
+  1.7-5.7x end-to-end inference speedup. LLMLingua-2 is 3-6x faster than v1.
+  Integrated into LangChain and LlamaIndex.
+- **500xCompressor**: Compresses contexts into a single special token. 6x-480x
+  compression. Adds only 0.25% parameters. More aggressive but less mature.
+
+**Relevance**: High for RAG and agentic workloads where prompts are long. Reduces
+both prefill time and KV cache memory.
+
+### 8.2 Speculative Decoding (Model-Level)
+
+Beyond the engine-level implementation described in Section 5.3:
+
+- **Self-speculative decoding**: Model drafts its own tokens using early exit from
+  lower layers. No separate draft model needed.
+- **EAGLE-3**: Autoregressive head on target model internals. Higher acceptance
+  rates than separate draft models.
+- **Draft model latency > accuracy**: Research shows that draft model speed matters
+  more than its language modeling accuracy for overall throughput.
+
+### 8.3 Mixture of Depths / Mixture of Recursions
+
+- **Mixture of Depths (MoD)**: Dynamically allocates compute to tokens that need it.
+  2-3x inference speedup with minimal quality degradation. Implemented at training
+  time -- requires model architecture support.
+- **Mixture of Recursions (MoR)** (NeurIPS 2025): Combines parameter sharing with
+  adaptive token-level compute. Lightweight routers assign different recursion depths
+  to individual tokens. 2x inference throughput with reduced KV cache sizes.
+
+**Relevance**: These are model architecture choices, not runtime optimizations.
+Watch for models trained with MoD/MoR architectures.
+
+### 8.4 Structured Pruning
+
+Post-training methods to permanently remove model components:
+
+- **Width pruning**: Remove neurons, attention heads, or embedding channels. Better
+  accuracy retention than depth pruning.
+- **Depth pruning**: Remove entire layers. More latency reduction per parameter
+  removed.
+- **LLM-Pruner, SliceGPT, FLAP**: State-of-the-art structured pruning methods.
+- **AMP**: Jointly prunes attention heads and MLP neurons.
+- **NIRVANA** (2025): Structured pruning reimagined for LLM compression.
+
+**Practical approach**: Structured pruning requires per-model effort and is generally
+less practical than quantization for local inference. Exception: if a specific model
+is too slow at a given quantization level, pruning the model first and then
+quantizing can yield a better speed/quality trade-off.
+
+### 8.5 Token Merging and Pruning
+
+- **TokenSelect** (EMNLP 2025): Dynamic token-level KV cache selection for
+  efficient long-context inference and length extrapolation.
+- **LightThinker**: Step-by-step compression of chain-of-thought reasoning.
+- **Attention sparsity**: Twilight (NeurIPS 2025) uses hierarchical top-p pruning
+  for adaptive attention sparsity.
+
+These techniques reduce the effective sequence length during inference, directly
+reducing both compute and memory bandwidth requirements.
+
+---
+
+## 9. Prioritized Recommendations for Strix Halo
+
+### Tier 1: Implement Now (High Impact, Available Today)
+
+1. **Use Unsloth Dynamic 2.0 GGUFs** for all models. They provide the best
+   quality-per-bit through intelligent layer-wise quantization.
+
+2. **Build llama.cpp with WMMA Flash Attention**: `-DGGML_HIP_ROCWMMA_FATTN=ON
+   -DGGML_HIP_UMA=ON`. Monitor issue #17917 for MMQ regression fix.
+
+3. **Disable mmap for ROCm**: Always use `--mmp 0` / `--no-mmap` to avoid the
+   double-copy performance penalty.
+
+4. **Enable KV cache quantization**: Use `--cache-type-k q4_0 --cache-type-v q4_0`
+   for long-context workloads. Watch for TurboQuant integration.
+
+5. **Set ROCBLAS_USE_HIPBLASLT=1**: Forces the optimized hipBLASLt kernels.
+
+6. **Speculative decoding for decode-heavy workloads**: Use `--model-draft` with a
+   small model from the same family.
+
+7. **GPU performance governor and frequency pinning**: Ensures consistent throughput.
+
+### Tier 2: Evaluate (Moderate Impact, Some Setup Required)
+
+8. **LLMLingua-2 for agentic/RAG workloads**: Compress long prompts before inference.
+   3-6x prompt processing speedup.
+
+9. **vLLM for multi-user serving**: If running concurrent inference requests
+   (e.g., agentic eval pipelines), vLLM's continuous batching and PagedAttention
+   provide 10-20x throughput improvement.
+
+10. **cgroups v2 memory reservation**: Prevent the OS from reclaiming GPU-mapped
+    memory under memory pressure.
+
+11. **Vulkan backend for short-context workloads**: Test whether the Vulkan/RADV
+    path is faster than ROCm for your specific model and context length.
+
+12. **Process pinning** with `numactl` or `taskset` for reduced scheduling jitter.
+
+### Tier 3: Watch and Prepare (High Potential, Not Ready)
+
+13. **KTransformers for >64 GB models**: When running DeepSeek V3 or similar models
+    that exceed available memory.
+
+14. **ExLlamaV3/EXL3 AMD support**: If AMD kernels arrive, EXL3's QTIP-based
+    quantization could significantly improve quality at extreme compression.
+
+15. **XDNA NPU for draft model acceleration**: If/when llama.cpp adds NPU backend
+    support, the NPU could run the draft model for speculative decoding.
+
+16. **SageAttention AMD port**: 2-5x attention speedup through quantized attention.
+
+17. **Linear attention models**: Watch for hybrid softmax/linear attention models
+    from major labs that would dramatically improve long-context inference.
+
+18. **Cooperative matrices in Vulkan**: As llama.cpp's Vulkan backend matures, this
+    provides a portable path to WMMA acceleration without ROCm dependency.
+
+---
+
+## 10. Sources
+
+### Papers and Conference Proceedings
+
+- Raposo et al., "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models," 2024. https://arxiv.org/abs/2404.02258
+- Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints," ICML 2023. https://arxiv.org/abs/2305.13245
+- Tseng et al., "QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks," ICML 2024. https://arxiv.org/abs/2402.04396
+- Egiazarian et al., "AQLM: Extreme Compression of Large Language Models via Additive Quantization," ICLR 2025. https://arxiv.org/abs/2401.06118
+- Chen et al., "KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models," SOSP 2025. https://dl.acm.org/doi/10.1145/3731569.3764843
+- Min et al., "Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation," NeurIPS 2025. https://arxiv.org/abs/2507.10524
+- Varadarajan et al., "Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures," 2025. https://arxiv.org/abs/2504.11750
+- Zandieh et al., "TurboQuant: Extreme KV Cache Quantization," ICLR 2026. https://github.com/ggml-org/llama.cpp/discussions/20969
+- Agrawal et al., "SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration," ICLR 2025. https://arxiv.org/abs/2410.02367
+- Ye et al., "FlashInfer: Efficient and Customizable Attention Engine for LLM Serving," 2025. https://arxiv.org/abs/2501.01005
+- Jiang et al., "LLMLingua: Compressing Prompts for Accelerated Inference," EMNLP 2023. https://arxiv.org/abs/2310.05736
+- Li et al., "A Survey on Inference Optimization Techniques for Mixture of Experts Models," 2024. https://arxiv.org/abs/2412.14219
+- Liu et al., "MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching," 2025. https://arxiv.org/abs/2511.14102
+- Zhou et al., "SP-MoE: Speculative Decoding and Prefetching for Accelerating MoE Inference," 2025. https://arxiv.org/abs/2510.10302
+- He et al., "SliceMoE: Bit-Sliced Expert Caching under Miss-Rate Constraints," 2025. https://arxiv.org/abs/2512.12990
+- Jin et al., "OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference," 2025. https://arxiv.org/abs/2512.03927
+
+### Documentation and Technical References
+
+- AMD ROCm Strix Halo System Optimization: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/strixhalo.html
+- AMD GPUOpen -- Using Matrix Cores of RDNA 4: https://gpuopen.com/learn/using_matrix_core_amd_rdna4/
+- AMD GPUOpen -- Accelerating Generative AI on Radeon GPUs: https://gpuopen.com/learn/accelerating_generative_ai_on_amd_radeon_gpus/
+- vLLM ROCm Blog: https://blog.vllm.ai/2026/02/27/rocm-attention-backend.html
+- AMD ROCm vLLM Blog: https://rocm.blogs.amd.com/software-tools-optimization/vllm-omni/README.html
+- AMD AI Inference on Ryzen AI NPU with Quark: https://www.amd.com/en/developer/resources/technical-articles/2025/ai-inference-acceleration-on-ryzen-ai-with-quark.html
+- Chips and Cheese -- Evaluating Infinity Cache in Strix Halo: https://chipsandcheese.com/p/evaluating-the-infinity-cache-in
+- Chips and Cheese -- RDNA 4 Architecture at Hot Chips 2025: https://chipsandcheese.com/p/amds-rdna4-gpu-architecture-at-hot
+- Linux Kernel XDNA NPU Documentation: https://docs.kernel.org/accel/amdxdna/amdnpu.html
+
+### Community Resources and Guides
+
+- llama.cpp ROCm Performance Discussion: https://github.com/ggml-org/llama.cpp/discussions/15021
+- llama.cpp Strix Halo UMA Detection Bug: https://github.com/ggml-org/llama.cpp/issues/18159
+- llama.cpp Strix Halo Performance Regression: https://github.com/ggml-org/llama.cpp/issues/17917
+- Strix Halo Wiki -- llama.cpp with ROCm: https://strixhalo.wiki/AI/llamacpp-with-ROCm
+- Strix Halo Wiki -- Performance: https://strixhalo.wiki/AI/llamacpp-performance
+- AMD Strix Halo Toolboxes: https://github.com/kyuz0/amd-strix-halo-toolboxes
+- LLM Tracker -- AMD GPUs: https://llm-tracker.info/howto/AMD-GPUs
+- LLM Tracker -- Strix Halo: https://llm-tracker.info/_TOORG/Strix-Halo
+- Unsloth Dynamic 2.0 Documentation: https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs
+- Unsloth Dynamic v2.0 Blog: https://unsloth.ai/blog/dynamic-v2
+- KTransformers GitHub: https://github.com/kvcache-ai/ktransformers
+- ExLlamaV3 GitHub: https://github.com/turboderp-org/exllamav3
+- BitNet GitHub: https://github.com/microsoft/BitNet
+- LLMLingua GitHub: https://github.com/microsoft/LLMLingua
+- MoE Inference Awesome List: https://github.com/MoE-Inf/awesome-moe-inference
+- Awesome LLM Inference: https://github.com/xlite-dev/Awesome-LLM-Inference
+- Phoronix -- ROCm 7.1 vs Vulkan on AI PRO R9700: https://www.phoronix.com/review/rocm-71-llama-cpp-vulkan
+- eunomia -- OS-Level LLM Inference Optimizations: https://eunomia.dev/blog/2025/02/18/os-level-challenges-in-llm-inference-and-optimizations/
+- RADV Cooperative Matrix for RDNA4: https://www.phoronix.com/forums/forum/phoronix/latest-phoronix-articles/1524861-vulkan-cooperative-matrix-merged-for-rdna4-gpus-with-radv-dcc-support-inches-closer
+- Kaitchup -- GGUF Quant Selection: https://kaitchup.substack.com/p/choosing-a-gguf-model-k-quants-i
--- a/docs/llama-cpp-optimization-research.md
+++ b/docs/llama-cpp-optimization-research.md
@@ -0,0 +1,806 @@
+# llama.cpp Runtime and Compilation Optimization for AMD RDNA 3.5 (gfx1151)
+
+Comprehensive research into maximizing inference performance on AMD Strix Halo
+(Ryzen AI MAX+ 395, Radeon 8060S gfx1151, 64 GB unified LPDDR5x-8000).
+Researched March 2026.
+
+---
+
+## Scope
+
+This document covers every known compilation flag, runtime parameter, environment
+variable, and architectural optimization for llama.cpp targeting gfx1151 (RDNA 3.5)
+with both ROCm/HIP and Vulkan backends on Fedora. It does not cover vLLM, ollama
+internals, or non-llama.cpp inference engines except where their findings inform
+llama.cpp optimization.
+
+---
+
+## Table of Contents
+
+1. [Compilation Flags and Build Optimizations](#1-compilation-flags-and-build-optimizations)
+2. [Runtime Flags and Environment Variables](#2-runtime-flags-and-environment-variables)
+3. [Flash Attention and Attention Backends](#3-flash-attention-and-attention-backends)
+4. [Quantization Strategies for Speed](#4-quantization-strategies-for-speed)
+5. [Memory Layout and Caching](#5-memory-layout-and-caching)
+6. [llama-server Specific Optimizations](#6-llama-server-specific-optimizations)
+7. [Upcoming llama.cpp Features (2026)](#7-upcoming-llamacpp-features-2026)
+8. [Recommended Configurations](#8-recommended-configurations)
+9. [Sources](#9-sources)
+
+---
+
+## 1. Compilation Flags and Build Optimizations
+
+### 1.1 GGML_HIP (ROCm) vs GGML_VULKAN: Which Backend to Build
+
+Both backends are worth building. Neither is universally faster on gfx1151:
+
+| Workload | Winner | Rationale |
+|----------|--------|-----------|
+| Token generation (short ctx) | Vulkan RADV | Lower driver overhead, mature kernel paths |
+| Token generation (long ctx, 8K+) | ROCm + rocWMMA + FA | Maintains speed as context grows; uses less memory |
+| Prompt processing (short ctx) | Mixed -- model-dependent | AMDVLK or ROCm hipBLASLt win on some shapes |
+| Prompt processing (long ctx) | ROCm + rocWMMA-tuned | 96% speedup over untuned rocWMMA at 65K ctx |
+| Memory efficiency at long ctx | ROCm + FA | Less memory than Vulkan equivalent |
+
+Benchmark data (Qwen3-30B-A3B UD-Q4_K_XL, gfx1151, flash attention on):
+
+| Backend | pp512 t/s | tg128 t/s | pp512@130K t/s | tg128@130K t/s |
+|---------|-----------|-----------|----------------|----------------|
+| Vulkan RADV | 755.14 | 85.11 | 17.24 | 12.54 |
+| Vulkan AMDVLK | 741.60 | 81.79 | 10.75 | 3.51 |
+| ROCm hipBLASLt | 651.93 | 63.95 | 40.35 | 4.97 |
+| ROCm rocWMMA-tuned | 659.07 | 67.66 | 51.12 | 13.33 |
+
+Key insight: RADV scales significantly better than AMDVLK for long contexts (3.6x
+faster tg at 130K depth). ROCm with tuned rocWMMA provides the best long-context
+prompt processing (3x faster than RADV at 130K).
+
+### 1.2 ROCm Build Flags
+
+**Minimal build:**
+```bash
+cmake -B build -S . \
+  -DGGML_HIP=ON \
+  -DAMDGPU_TARGETS="gfx1151" \
+  -DCMAKE_BUILD_TYPE=Release
+cmake --build build --config Release -j$(nproc)
+```
+
+**Optimized build (recommended):**
+```bash
+cmake -B build -S . \
+  -DGGML_HIP=ON \
+  -DAMDGPU_TARGETS="gfx1151" \
+  -DGGML_HIP_ROCWMMA_FATTN=ON \
+  -DCMAKE_BUILD_TYPE=Release
+cmake --build build --config Release -j$(nproc)
+```
+
+#### Critical ROCm build flags:
+
+| Flag | Effect | Recommendation |
+|------|--------|----------------|
+| `-DGGML_HIP=ON` | Enable HIP/ROCm backend | Required |
+| `-DAMDGPU_TARGETS="gfx1151"` | Target Strix Halo GPU | Required -- do not use gfx1100 |
+| `-DGGML_HIP_ROCWMMA_FATTN=ON` | Enable rocWMMA flash attention | Strongly recommended for pp |
+| `-DGGML_HIP_GRAPHS=ON` | HIP graph kernel scheduling | Test -- may help reduce launch overhead |
+| `-DGGML_HIP_NO_VMM=OFF` | Re-enable Virtual Memory Management | Default is disabled; test if needed |
+
+#### Flags to be aware of but NOT set by default:
+
+| Flag | Notes |
+|------|-------|
+| `-DGGML_HIP_UMA=ON` | Uses hipMallocManaged for UMA. **Avoid on Strix Halo** -- it uses fine-grained memory that is significantly slower. Standard hipMalloc + GTT expansion via kernel params is faster. |
+| `-DGGML_CUDA_FORCE_CUBLAS_COMPUTE_16F` | Forces FP16 compute in hipBLAS. Documented for RDNA4 -- may help pp performance on gfx1151. **Test before deploying.** |
+| `-DGGML_CUDA_FA_ALL_QUANTS=ON` | Compiles all KV cache quant type combinations for FA. **Works for CUDA kernels which HIP reuses via hipify.** Increases compilation time substantially. Enable if you need quantized KV cache with flash attention. |
+
+#### ROCm version considerations:
+
+- **ROCm 7.2**: Known rocWMMA compilation issue (ambiguous template specializations in `mfma_impl.hpp`). Fixed in later point releases or by disabling rocWMMA.
+- **ROCm 7.0 RC / 7.1**: Generally work well with gfx1151.
+- **ROCm 6.4.4**: Some users report better performance than 7.x for certain workloads. The gfx1151 rocBLAS kernel regression means hipBLASLt is essential.
+- **Recommendation**: Use ROCm 7.2+ with rocWMMA patches, or pre-built toolbox containers that have been validated for gfx1151.
+
+### 1.3 Vulkan Build Flags
+
+```bash
+cmake -B build -S . \
+  -DGGML_VULKAN=ON \
+  -DLLAMA_BUILD_SERVER=ON \
+  -DCMAKE_BUILD_TYPE=Release
+cmake --build build --config Release -j$(nproc)
+```
+
+The Vulkan build auto-detects cooperative matrix support (KHR_coopmat) at runtime.
+gfx1151 with RADV reports `matrix cores: KHR_coopmat` in llama-bench logs.
+
+#### Vulkan-specific considerations:
+
+| Topic | Detail |
+|-------|--------|
+| **RADV vs AMDVLK** | RADV (Mesa) is recommended for gfx1151. Better long-context scaling, no 2GB buffer allocation limit. |
+| **AMDVLK buffer limit** | AMDVLK caps single Vulkan allocations at ~2 GiB (`VkPhysicalDeviceLimits::maxMemoryAllocationSize`). RADV allows ~4 GiB. This causes OOM for models with large compute buffers. |
+| **CoopMat1 vs CoopMat2** | gfx1151 supports KHR_coopmat (CoopMat1). CoopMat2 (`VK_NV_cooperative_matrix2`) is NVIDIA-only. This means Vulkan flash attention on AMD falls back to CPU -- use ROCm for GPU-accelerated FA. |
+| **Shader compilation** | Building from source with `glslc` available enables cooperative matrix shader variants. Pre-built binaries may omit them. |
+
+### 1.4 LTO and PGO
+
+llama.cpp does not have built-in LTO/PGO support in its CMake configuration.
+You can enable LTO manually:
+
+```bash
+cmake -B build -S . \
+  -DGGML_HIP=ON \
+  -DAMDGPU_TARGETS="gfx1151" \
+  -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON \
+  -DCMAKE_BUILD_TYPE=Release
+```
+
+Expected benefit: 2-5% improvement in CPU-bound paths. The GPU kernels are
+compiled by the HIP/ROCm compiler and are not affected by host LTO.
+
+PGO would require a two-pass build (instrument, profile, rebuild) and is not
+commonly done for llama.cpp. The dominant bottleneck is GPU kernel performance
+and memory bandwidth, not host-side code paths.
+
+### 1.5 Compiler Tuning for ROCm
+
+A known LLVM regression affects loop unrolling on RDNA. The following flag has
+been reported to help:
+
+```bash
+-DCMAKE_CXX_FLAGS="-mllvm --amdgpu-unroll-threshold-local=600"
+```
+
+This increases the unrolling threshold for local memory operations, which can
+improve kernel performance for flash attention and matrix multiplication.
+
+---
+
+## 2. Runtime Flags and Environment Variables
+
+### 2.1 ROCm Environment Variables
+
+| Variable | Value | Effect |
+|----------|-------|--------|
+| `ROCBLAS_USE_HIPBLASLT=1` | **Critical** | Switches from rocBLAS tensile kernels to hipBLASLt. On gfx1151, default rocBLAS achieves only 5.76 TFLOPS (<9% efficiency). hipBLASLt achieves >60% efficiency. **This is a 2-7x improvement for prompt processing.** |
+| `HSA_OVERRIDE_GFX_VERSION=11.5.1` | Set inside toolbox containers | Required for ROCm to recognize gfx1151. Set in container, not by host scripts. |
+| `HSA_ENABLE_SDMA=0` | Optional | Disables SDMA engine. May help on some configurations, but generally not needed on Strix Halo with recent kernels. |
+| `HIP_VISIBLE_DEVICES=0` | Optional | Select specific GPU device. Useful in multi-GPU or container setups. |
+| `GPU_MAX_HEAP_SIZE=100` | Optional | Allow 100% of GPU memory for heap. Default may be lower. |
+| `GPU_MAX_ALLOC_PERCENT=100` | Optional | Allow single allocation up to 100% of GPU memory. |
+| `ROCR_VISIBLE_DEVICES=0` | Optional | HSA-level device visibility control. |
+| `AMD_LOG_LEVEL=0` | Optional | Suppress AMD driver logging noise. |
+
+**The single most impactful environment variable is `ROCBLAS_USE_HIPBLASLT=1`.**
+Without it, ROCm pp512 on Llama-2-7B drops from 882 t/s to 348 t/s (4x slower).
+
+### 2.2 Vulkan Environment Variables
+
+| Variable | Value | Effect |
+|----------|-------|--------|
+| `AMD_VULKAN_ICD=RADV` | Recommended | Force RADV driver (skip AMDVLK). |
+| `RADV_PERFTEST=nogttspill` | **Important** | Fixes GTT memory spilling issues on RADV. Can resolve significant pp performance drops (especially with FA off). |
+| `GGML_VK_VISIBLE_DEVICES=0` | Optional | Select Vulkan device index. |
+| `GGML_VULKAN_DISABLE_F16=1` | Debugging | Force FP32 compute. Slower but useful for debugging precision issues. |
+| `GGML_LOG_LEVEL=2` | Debugging | Verbose logging to verify coopmat detection. |
+
+### 2.3 Thread Count (`-t` flag)
+
+For GPU-dominant inference (all layers offloaded), the thread count has minimal
+impact on throughput. The recommendation:
+
+- **Single-user inference**: `-t 4` to `-t 8` (enough for tokenization/sampling overhead)
+- **Server with parallel slots**: `-t` equal to physical core count (12 on Ryzen AI MAX+ 395 = 12 Zen 5 cores)
+- **Hybrid CPU+GPU (partial offload)**: `-t` equal to number of physical cores
+
+The Ryzen AI MAX+ 395 has 16 cores (12 Zen 5 + 4 Zen 5c). For llama.cpp, using
+all 12 big cores (`-t 12`) is optimal for CPU-involved workloads.
+
+### 2.4 Batch Size Tuning (`-b` and `-ub`)
+
+| Flag | Default | Role |
+|------|---------|------|
+| `-b` / `--batch-size` | 2048 | Logical batch size (application level) |
+| `-ub` / `--ubatch-size` | 512 | Physical batch size (device level) |
+
+Tuning guidance for gfx1151:
+
+- **MoE models**: `-b 256` significantly improves pp512 (reported 70% improvement on Qwen3-30B-A3B)
+- **Dense models**: Default `-b 2048` is generally fine
+- **Long context**: `-ub 2048` can improve performance, but test against OOM
+- **Ultra-long context**: Reduce `-ub` if memory allocation fails
+
+The Vulkan backend blog post for Strix Halo recommends: `-c 32768 -b 4096 -ub 256`
+for a good balance of performance and memory.
+
+### 2.5 Memory-Mapped Loading (`-mmp` / `--no-mmap`)
+
+**Critical finding for unified memory APUs:**
+
+> When you load large models to the GPU, memory mapping can make loading moderately
+> slower for Vulkan, and **catastrophically slower for ROCm**. You should always set
+> `--mmap 0` or `--no-mmap` to improve model loading times on Strix Halo.
+
+For `llama-bench`, use `-mmp 0`. For `llama-server`/`llama-cli`, use `--no-mmap`.
+
+On Strix Halo, both "GPU memory" and "CPU memory" share the same physical LPDDR5x.
+The difference is which pages are mapped for GPU access. GPU-mapped pages have full
+bandwidth (~215 GB/s). CPU-accessed pages get approximately half (~84 GB/s for
+CPU-to-GPU copies).
+
+**Always use `-ngl 99` (or higher) to ensure all layers are on GPU memory.**
+Even on a unified memory system, GPU memory paths provide 2x the bandwidth.
+
+### 2.6 GPU Layer Offloading (`-ngl`)
+
+For Strix Halo with 64GB unified memory:
+
+- **Models < 50GB**: `-ngl 99` offloads everything. No tuning needed.
+- **Models 50-60GB**: `-ngl 99` should still work with GTT expanded via kernel params.
+- **Models > 60GB**: May need partial offload. Use `-ngl <N>` where N is tuned to
+  keep GPU memory under the GTT limit. Remaining layers run on CPU at ~1/2 bandwidth.
+
+**Never let GPU spill to system RAM paths** -- performance will be worse than pure CPU.
+
+---
+
+## 3. Flash Attention and Attention Backends
+
+### 3.1 When to Enable Flash Attention
+
+**Rule of thumb for gfx1151:**
+
+| Backend | Flash Attention | Recommendation |
+|---------|----------------|----------------|
+| ROCm + rocWMMA | `-fa 1` | **Always enable.** 24% pp improvement, maintains tg speed, uses less memory. |
+| ROCm without rocWMMA | `-fa 1` | Enable, but smaller improvement. |
+| Vulkan RADV | `-fa 1` | **Enable for short contexts.** Minor improvement at pp512/tg128. At long contexts, Vulkan FA may degrade performance. |
+| Vulkan AMDVLK | `-fa 1` | Similar to RADV. |
+
+**Key caveat**: Vulkan flash attention on AMD uses CoopMat1 (KHR_coopmat), not
+the more efficient CoopMat2 (NVIDIA-only). For AMD, ROCm + rocWMMA is the superior
+FA path.
+
+### 3.2 rocWMMA Flash Attention Performance
+
+Benchmark on gfx1151 (Llama2-7B Q4_K_M):
+
+| Configuration | pp512 t/s | tg128 t/s |
+|---------------|-----------|-----------|
+| HIP standard | 592.28 | 40.40 |
+| HIP + hipBLASLt | 548.72 | 40.43 |
+| HIP + rocWMMA + hipBLASLt | 1006.80 | 39.46 |
+| HIP + rocWMMA (no hipBLASLt) | 899.73 | 39.45 |
+
+rocWMMA provides ~70% improvement in prompt processing with flash attention.
+Token generation is slightly slower (~2%) due to WMMA overhead at small batch.
+
+### 3.3 The rocWMMA Long-Context Regression and Fix
+
+The standard rocWMMA implementation has a **long-context decode regression**:
+at 65K context, tg degrades by up to 57% compared to HIP-only baseline.
+
+**The fix** (PR #16827, "rocm-wmma-tune" branch) implements:
+
+1. **`__launch_bounds__(256, 2)`**: Ensures minimum 2 blocks per SM, improving occupancy
+2. **Adaptive KQ stride**: Uses stride 128 when head dimension <= 128, reducing LDS footprint
+3. **Selective WMMA usage**: WMMA only for prefill; decode reverts to VEC/TILE kernels
+
+Results after fix (Llama 3.2 1B Q4_K_M on gfx1151):
+- pp512 at 65K context: **96% faster** than untuned rocWMMA
+- tg128 at 65K context: Matches HIP baseline (previously 57% degraded)
+
+**Status**: This patch is available in `-rocwmma-improved` toolbox builds. It may
+not be merged into upstream llama.cpp. Check Donato Capitella's toolboxes.
+
+### 3.4 Vulkan Flash Attention Limitations on AMD
+
+The Vulkan backend supports three FA paths:
+
+| Path | Extension | AMD Support |
+|------|-----------|-------------|
+| FA_SCALAR | None | Yes (CPU fallback) |
+| FA_COOPMAT1 | KHR_cooperative_matrix | Yes (gfx1151 reports support) |
+| FA_COOPMAT2 | NV_cooperative_matrix2 | **No** (NVIDIA-only) |
+
+FA_COOPMAT1 supports: f16, q4_0, q8_0, f32 KV cache types.
+FA_COOPMAT2 additionally supports all quant types.
+
+When Vulkan FA is enabled on AMD with RADV, it uses CoopMat1 for matrix operations.
+This provides a modest improvement over scalar FA but is significantly less
+efficient than ROCm + rocWMMA.
+
+### 3.5 New Attention Models (GatedDeltaNet)
+
+Models using GatedDeltaNet architecture (Qwen3.5-27B, Qwen3.5-35B-A3B) have
+severe performance problems on gfx1151:
+
+- **Vulkan**: No GATED_DELTA_NET compute shader exists; ops fall back to CPU
+- **ROCm/HIP**: Kernel cross-compiles but suffers from register spilling (float s[S_v]
+  allocates up to 512 bytes per thread) and hipMemcpyWithStream bottleneck (92-95%
+  of decode time on models >15GB)
+
+Result: Qwen3.5-27B runs at ~12 t/s on gfx1151 vs expected 50-80 t/s.
+**Avoid GatedDeltaNet models on gfx1151 until kernel optimization lands.**
+
+---
+
+## 4. Quantization Strategies for Speed
+
+### 4.1 Quantization Speed on RDNA 3.5
+
+Token generation speed is dominated by memory bandwidth, not compute. Smaller
+quantizations are faster because they reduce bytes-per-weight, allowing more
+tokens per second within the ~215 GB/s bandwidth envelope.
+
+Approximate throughput formula for decode (bandwidth-bound):
+```
+tg_tokens/s ≈ effective_bandwidth_GB/s / model_size_bytes * 1e9
+```
+
+For a 7B Q4_K_M model (~4.1 GB):
+```
+215 / 4.1 ≈ 52 t/s  (theoretical max; practical ~50 t/s on gfx1151)
+```
+
+### 4.2 Quantization Type Comparison
+
+| Quant | Bits/Weight | Quality | Speed (relative) | Notes |
+|-------|------------|---------|-------------------|-------|
+| Q4_0 | 4.0 | Low | Fastest | Legacy. Simple dequant. |
+| Q4_K_M | 4.83 | Good | Very fast | K-quant with hierarchical blocks. Recommended default. |
+| IQ4_XS | 4.25 | Good | Fast | Importance-weighted. Better quality/bit than Q4_K_M. |
+| Q5_K_M | 5.69 | Very good | Fast | Sweet spot for quality-sensitive use. |
+| Q6_K | 6.56 | Excellent | Moderate | Near-lossless quality. |
+| Q8_0 | 8.0 | Near-perfect | Slower | 2x the bytes of Q4_K_M, ~2x slower tg. |
+| F16 | 16.0 | Perfect | Slowest | Reference baseline. |
+
+**For RDNA 3.5 specifically**:
+
+- **Q4_K_M** is the best general-purpose quantization. The K-quant family uses
+  hierarchical super-blocks (256 values) with per-sub-block scales, providing
+  better quality than Q4_0 at marginally higher dequant cost that is invisible
+  at the GPU level.
+
+- **Q4_0** has the simplest dequant kernels and is marginally faster than Q4_K_M
+  on some GPU backends. However, the quality loss is significant. Use only for
+  smoke tests or when every t/s matters more than quality.
+
+- **IQ4_XS** (importance-matrix quantized) offers better quality per bit than
+  Q4_K_M. Speed is similar. Requires an importance matrix file during quantization.
+  **Recommended over Q4_K_M when you control the quantization process.**
+
+- **Q8_0** does NOT have special hardware-accelerated dequant on RDNA 3.5.
+  RDNA 3.5 lacks INT8 tensor core equivalents. Q8_0 performance relies on the
+  same FP16 compute paths, just with more memory bandwidth consumed.
+
+### 4.3 Importance Matrix (imatrix) Quantization
+
+imatrix quantization records how much each weight affects output quality, then
+allocates more precision bits to important weights. This is essential for
+sub-4-bit quantizations (IQ2_XS, IQ3_XXS, IQ4_XS) where standard K-quant
+shows measurable degradation.
+
+```bash
+# Generate importance matrix (GPU-accelerated)
+llama-imatrix -m model-f16.gguf -f calibration_data.txt -ngl 99 -o imatrix.dat
+
+# Quantize with imatrix
+llama-quantize --imatrix imatrix.dat model-f16.gguf model-iq4_xs.gguf IQ4_XS
+```
+
+**Speed impact**: None. imatrix affects quantization quality, not inference speed.
+The dequantization kernels are identical regardless of whether imatrix was used.
+
+### 4.4 Unsloth Dynamic (UD) Quantizations
+
+Unsloth Dynamic 2.0 selectively quantizes different layers at different bit widths,
+choosing the optimal quantization per layer based on sensitivity analysis.
+
+**Speed impact**: Minimal to none. UD quants use the same dequant kernels as
+standard GGUF quantizations. A UD-Q4_K_XL file runs at the same speed as a
+standard Q4_K_M of the same total size.
+
+**Quality impact**: Significantly better. UD consistently outperforms standard
+quantizations in 5-shot MMLU and KL divergence metrics at the same total file size.
+
+**Recommendation**: Prefer UD quants (e.g., `UD-Q4_K_XL`, `UD-Q4_K_M`) from
+Unsloth when available. They are a free quality upgrade with no speed penalty.
+
+---
+
+## 5. Memory Layout and Caching
+
+### 5.1 KV Cache Quantization
+
+KV cache quantization reduces the memory footprint of the attention cache,
+allowing larger context windows within the same memory budget.
+
+| Cache Type | Memory vs F16 | Quality Impact | Recommendation |
+|------------|--------------|----------------|----------------|
+| f16 (default) | 1.0x | None | Baseline |
+| q8_0 | 0.5x | Negligible (+0.002-0.05 ppl) | **Recommended for production** |
+| q4_0 | 0.33x | Noticeable (+0.2-0.25 ppl) | Use when memory-constrained |
+| q4_1 | 0.33x | Slightly better than q4_0 | Alternative to q4_0 |
+| iq4_nl | 0.33x | Better than q4_0 | Best 4-bit KV option |
+
+Usage:
+```bash
+llama-server -m model.gguf --cache-type-k q8_0 --cache-type-v q8_0 ...
+# or for llama-bench:
+# Not directly supported as flags; test via llama-server
+```
+
+**Performance impact**: Quantizing K cache slightly **improves** throughput
+(less memory to read). Quantizing V cache may have a slight negative impact.
+Overall performance impact is negligible for normal inference.
+
+**Caveat with speculative decoding**: Using KV cache quantization with a draft
+model causes a consistent ~16% performance drop. q4_0 KV with speculative
+decoding causes massive acceptance rate drops. **Avoid KV quant if using
+speculative decoding.**
+
+### 5.2 mmap vs Full Load on Unified Memory
+
+On Strix Halo's unified memory architecture:
+
+- **`--no-mmap` is strongly recommended** for both ROCm and Vulkan.
+- With mmap enabled, ROCm model loading is "catastrophically slower."
+- Vulkan loading is "moderately slower" with mmap.
+- Since CPU and GPU share physical RAM, there is no data copy when loading
+  to "GPU memory" -- it is just a page table update.
+
+For `llama-bench`: Always use `-mmp 0`.
+For `llama-server`/`llama-cli`: Always use `--no-mmap`.
+
+### 5.3 Prompt Caching
+
+llama-server supports two levels of prompt caching:
+
+**1. Automatic KV cache reuse (`cache_prompt: true`)**:
+Reuses KV cache from previous requests when prompts share a common prefix.
+The server only reprocesses the suffix that differs.
+
+**2. Host-memory prompt caching (`--cram N`)**:
+Stores pre-computed prompt representations in system RAM.
+- Reduces TTFT from ~4.2s to ~0.3s for cached requests (93% reduction)
+- +6% token throughput (34 vs 32 t/s)
+- Memory formula: `num_prefixes * avg_prefix_tokens * 8 bytes`
+
+Configuration:
+```bash
+llama-server -m model.gguf \
+  --cram 256 \              # 256 MB host RAM for prompt cache
+  --cache-type-k q8_0 \     # KV cache quantization
+  --cache-type-v q8_0 \
+  --no-mmap \
+  -fa \
+  -ngl 99
+```
+
+Best for:
+- System prompts > 5K tokens
+- Multi-user chatbots with shared context
+- Agentic use with repeated tool-call prefixes
+
+### 5.4 UMA Detection Bug (Issue #18159)
+
+llama.cpp's UMA detection (from PR #17368, designed for NVIDIA DGX Spark)
+incorrectly activates on AMD APUs when `prop.integrated=1`. It reads
+`/proc/meminfo` instead of `hipMemGetInfo()`, severely underreporting available
+GPU memory (e.g., reporting 27GB instead of 96GB).
+
+**Workarounds**:
+- Build without `GGML_CUDA_ENABLE_UNIFIED_MEMORY`
+- Guard UMA detection with `!defined(GGML_USE_HIP)` (upstream fix pending)
+- Use toolbox containers where this has been patched
+
+### 5.5 KV Cache Placement on ROCm (Issue #18011)
+
+On Strix Halo, the ROCm backend may dump KV cache into shared (CPU-accessible)
+memory instead of GPU-mapped memory, causing performance degradation at high
+context sizes. This is a known issue contributing to ROCm falling behind Vulkan
+for tg at high contexts.
+
+**Mitigation**: Use the rocWMMA-tuned branch which maintains better memory
+placement, or use Vulkan RADV for workloads where this matters.
+
+---
+
+## 6. llama-server Specific Optimizations
+
+### 6.1 Recommended Server Configuration
+
+```bash
+llama-server -m model.gguf \
+  --host 0.0.0.0 --port 8080 \
+  -ngl 99 \
+  --no-mmap \
+  -fa \
+  -c 32768 \               # Total context across all slots
+  -np 4 \                  # 4 parallel slots (adjust for your use)
+  -b 2048 \                # Logical batch size
+  -ub 512 \                # Physical batch size
+  --cache-type-k q8_0 \
+  --cache-type-v q8_0 \
+  --cont-batching \         # Enabled by default
+  --jinja                   # Enable Jinja2 chat template
+```
+
+### 6.2 Parallel Slot Configuration (`-np`)
+
+| Use Case | Slots | Context per Slot | Total `-c` |
+|----------|-------|-----------------|-----------|
+| Single user chat | 1 | 32768 | 32768 |
+| Agentic coding (Claude Code style) | 2-4 | 8192-16384 | 32768-65536 |
+| Multi-user API | 4-8 | 4096-8192 | 32768-65536 |
+| Eval harness | 1 | 32768+ | 32768+ |
+
+Memory formula: Each slot requires `context_size * 2 * hidden_dim * n_layers * bytes_per_kv_element`.
+With q8_0 KV cache, this is roughly halved compared to f16.
+
+### 6.3 Continuous Batching
+
+Enabled by default (`--cont-batching`). Allows the server to process multiple
+requests simultaneously, interleaving prefill and decode operations.
+
+For agentic workloads: One slot typically holds a large system prompt + conversation
+context, while additional slots handle parallel tool calls. Configure with:
+
+```bash
+-np 4 -c 131072  # 4 slots, up to 32K context each
+```
+
+### 6.4 Prompt Caching for Agentic Use
+
+For agentic coding tools that send the same system prompt repeatedly:
+
+1. Use `cache_prompt: true` in API requests (reuses KV cache prefix)
+2. Use `--system-prompt-file system.txt` for static system prompts (note: may be
+   removed in recent versions; verify with your build)
+3. Use `--cram 128` to enable host-memory caching for prefix deduplication
+
+### 6.5 Speculative Decoding
+
+For token generation speedup with a draft model:
+
+```bash
+llama-server -m main-model.gguf \
+  --model-draft draft-model.gguf \
+  -ngl 99 \
+  --draft-max 8 \
+  --draft-min 1 \
+  --no-mmap \
+  -fa
+```
+
+**Caveat**: Do NOT combine speculative decoding with KV cache quantization.
+The 16% performance drop and reduced acceptance rate negate the benefits.
+
+---
+
+## 7. Upcoming llama.cpp Features (2026)
+
+### 7.1 Backend-Agnostic Tensor Parallelism (PR #19378)
+
+Merged January 2026. Adds `--split-mode tensor` for splitting computation across
+multiple GPUs via a new "meta" backend.
+
+**Relevance to Strix Halo**: Limited. Single integrated GPU. However, for RPC
+configurations with multiple Strix Halo nodes (Jeff Geerling's Beowulf cluster),
+tensor parallelism could complement the existing layer-split approach.
+
+Currently supports 1-2 GPUs with equal data split. `--tensor-split` has no effect yet.
+
+### 7.2 TurboQuant KV Cache Compression (ICLR 2026)
+
+Google's TurboQuant (Zandieh et al.) achieves 3-bit KV cache quantization with
+no training and negligible quality loss:
+
+| Format | MSE vs FP16 | Compression |
+|--------|-------------|-------------|
+| TQ3 (3-bit) | 0.034 | 4.9x |
+| TQ4 (4-bit) | 0.009 | 3.8x |
+
+**Timeline**: Open-source llama.cpp integration expected Q2-Q3 2026. A 6-phase
+integration plan exists covering GGML type registration, KV cache paths, FA
+integration, and CLI flags.
+
+### 7.3 Vulkan Improvements
+
+Active 2025-2026 developments:
+- Mesa RADV optimizations for RDNA4 AI workloads (Rhys Perry/Valve patches)
+- 13% pp improvement from CU mode optimization for LDS utilization
+- BFloat16 Vulkan support (`VK_KHR_shader_bfloat16`) maturing in Mesa 25.x
+- Partial offloading performance improvement for AMD (llama.cpp b8185, March 2026)
+
+### 7.4 Flash Attention for Head Dimension 512
+
+Pull request from March 2026 adds FA support for HD=512 in CUDA kernels.
+This benefits models with larger head dimensions (some newer architectures).
+The HIP path should inherit this via hipify.
+
+### 7.5 ik_llama.cpp Fork Innovations
+
+The `ik_llama.cpp` fork by ikawrakow introduces:
+- Row-interleaved quant packing (better memory access patterns)
+- Smart Expert Reduction for faster MoE inference
+- Tensor overrides with regex patterns for hybrid GPU/CPU placement
+- FlashMLA for DeepSeek models
+
+**Caveat**: ik_llama.cpp only fully supports CPU and CUDA backends. ROCm/Vulkan
+are not maintained. Not recommended for AMD gfx1151.
+
+---
+
+## 8. Recommended Configurations
+
+### 8.1 For llama-bench (Benchmarking)
+
+**ROCm backend:**
+```bash
+ROCBLAS_USE_HIPBLASLT=1 \
+toolbox run -c llama-rocm-7.2 -- \
+  /path/to/llama-bench \
+  -m /path/to/model.gguf \
+  -ngl 99 -mmp 0 -fa 1 \
+  -p 512 -n 128 -r 5
+```
+
+**Vulkan backend:**
+```bash
+AMD_VULKAN_ICD=RADV \
+RADV_PERFTEST=nogttspill \
+toolbox run -c llama-vulkan -- \
+  /path/to/llama-bench \
+  -m /path/to/model.gguf \
+  -ngl 99 -mmp 0 -fa 1 \
+  -p 512 -n 128 -r 5
+```
+
+### 8.2 For llama-server (Production Agentic Use)
+
+**ROCm (best for long context):**
+```bash
+ROCBLAS_USE_HIPBLASLT=1 \
+llama-server -m model.gguf \
+  -ngl 99 --no-mmap -fa \
+  -c 65536 -np 4 \
+  -b 2048 -ub 512 \
+  --cache-type-k q8_0 --cache-type-v q8_0 \
+  --cram 256 \
+  --jinja --cont-batching \
+  --host 0.0.0.0 --port 8080
+```
+
+**Vulkan RADV (best for single-user tg):**
+```bash
+AMD_VULKAN_ICD=RADV \
+RADV_PERFTEST=nogttspill \
+llama-server -m model.gguf \
+  -ngl 99 --no-mmap -fa \
+  -c 32768 -np 2 \
+  -b 4096 -ub 256 \
+  --cache-type-k q8_0 --cache-type-v q8_0 \
+  --jinja --cont-batching \
+  --host 0.0.0.0 --port 8080
+```
+
+### 8.3 Decision Matrix
+
+| Question | Answer |
+|----------|--------|
+| Which backend for benchmarking? | Both. ROCm and Vulkan have different strengths. |
+| Which backend for daily chat? | Vulkan RADV for best tg speed. |
+| Which backend for long-context agentic? | ROCm + rocWMMA-tuned for context resilience. |
+| Which quantization? | Q4_K_M or UD-Q4_K_XL for speed; Q5_K_M for quality. |
+| Enable flash attention? | Yes, always on ROCm. Yes on Vulkan for short contexts. |
+| Use `--no-mmap`? | Always. |
+| Set `ROCBLAS_USE_HIPBLASLT=1`? | Always for ROCm. |
+| Set `RADV_PERFTEST=nogttspill`? | Always for Vulkan RADV. |
+| KV cache quantization? | q8_0 for both K and V unless using speculative decoding. |
+| Batch size for MoE? | `-b 256` (lower than default improves some MoE models). |
+
+---
+
+## 9. Sources
+
+### GitHub Issues and Discussions
+- [Performance of llama.cpp on AMD ROCm (HIP) - Discussion #15021](https://github.com/ggml-org/llama.cpp/discussions/15021)
+- [Performance of llama.cpp with Vulkan - Discussion #10879](https://github.com/ggml-org/llama.cpp/discussions/10879)
+- [HIP backend performs poorly on gfx1151 - Issue #13565](https://github.com/ggml-org/llama.cpp/issues/13565)
+- [UMA detection incorrectly limits memory on AMD APUs - Issue #18159](https://github.com/ggml-org/llama.cpp/issues/18159)
+- [ROCm model loading dumps KV cache to shared memory - Issue #18011](https://github.com/ggml-org/llama.cpp/issues/18011)
+- [GATED_DELTA_NET underperformance on gfx1151 - Issue #20354](https://github.com/ggml-org/llama.cpp/issues/20354)
+- [Under-Performance of ROCm 7.2 binaries - Issue #19984](https://github.com/ggml-org/llama.cpp/issues/19984)
+- [ROCm 7.2 + rocWMMA compilation - Issue #19269](https://github.com/ggml-org/llama.cpp/issues/19269)
+- [Building for gfx1151 - Issue #14734](https://github.com/ggml-org/llama.cpp/issues/14734)
+- [AMDVLK 2GB buffer allocation limit - Issue #15054](https://github.com/ggml-org/llama.cpp/issues/15054)
+- [Mastering Host-Memory Prompt Caching - Discussion #20574](https://github.com/ggml-org/llama.cpp/discussions/20574)
+- [TurboQuant Extreme KV Cache Quantization - Discussion #20969](https://github.com/ggml-org/llama.cpp/discussions/20969)
+- [Backend-agnostic tensor parallelism - PR #19378](https://github.com/ggml-org/llama.cpp/pull/19378)
+- [Massively Improved rocWMMA Performance - PR #16827](https://github.com/ggml-org/llama.cpp/pull/16827)
+- [rocWMMA for gfx1151 performance boost - lemonade-sdk Issue #7](https://github.com/lemonade-sdk/llamacpp-rocm/issues/7)
+- [Increase llama.cpp performance on AI Max 395+ - geerlingguy Issue #5](https://github.com/geerlingguy/beowulf-ai-cluster/issues/5)
+
+### Wiki and Community Resources
+- [Strix Halo Wiki - llama.cpp Performance](https://strixhalo.wiki/AI/llamacpp-performance)
+- [Strix Halo Wiki - llama.cpp with ROCm](https://strixhalo.wiki/AI/llamacpp-with-ROCm)
+- [AMD Strix Halo Backend Benchmarks (Grid View)](https://kyuz0.github.io/amd-strix-halo-toolboxes/)
+- [LLM Tracker - AMD Strix Halo GPU Performance](https://llm-tracker.info/AMD-Strix-Halo-(Ryzen-AI-Max+-395)-GPU-Performance)
+- [Framework Community - Strix Halo GPU LLM Performance Tests](https://community.frame.work/t/amd-strix-halo-ryzen-ai-max-395-gpu-llm-performance-tests/72521)
+- [Framework Community - Toolboxes for LLM inference on Strix Halo](https://community.frame.work/t/llama-cpp-vllm-toolboxes-for-llm-inference-on-strix-halo/74916)
+
+### Articles and Blog Posts
+- [Hardware Corner - Strix Halo LLM Optimization](https://www.hardware-corner.net/strix-halo-llm-optimization/)
+- [Hardware Corner - RADV Vulkan Driver 13% Improvement](https://www.hardware-corner.net/llama-cpp-amd-radv-vulkan-driver-update/)
+- [Phoronix - AMD ROCm 7.1 vs RADV Vulkan](https://www.phoronix.com/review/rocm-71-llama-cpp-vulkan)
+- [Phoronix - Valve Developer RADV Improvement](https://www.phoronix.com/news/RADV-Valve-Boost-Llama.cpp)
+- [Yifei's Blog - Strix Halo Matrix Cores with llama.cpp](https://blog.yifei.sg/jekyll/update/2025/08/27/building-llamacpp-strix-halo.html)
+- [Strix Halo CUDA/HIP Testing Notes (lhl)](https://github.com/lhl/strix-halo-testing/blob/main/llama-cpp-cuda-hip.md)
+
+### Official Documentation
+- [ROCm - llama.cpp compatibility](https://rocm.docs.amd.com/en/latest/compatibility/ml-compatibility/llama-cpp-compatibility.html)
+- [ROCm - llama.cpp installation](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/llama-cpp-install.html)
+- [ROCm Blog - Llama.cpp Meets Instinct](https://rocm.blogs.amd.com/ecosystems-and-partners/llama-cpp/README.html)
+- [llama.cpp build documentation](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md)
+- [llama-server README](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md)
+
+### Papers
+- "Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct" (January 2026, arXiv:2601.14277)
+- "TurboQuant: Redefining AI efficiency with extreme compression" (Zandieh et al., ICLR 2026)
+- [Unsloth Dynamic 2.0 GGUFs Documentation](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)
+
+---
+
+## Open Questions / Limitations
+
+1. **rocWMMA-tuned patch upstream status**: PR #16827 may not be fully merged.
+   Monitor for inclusion in mainline llama.cpp or continue using patched toolboxes.
+
+2. **ROCm 7.2 stability on gfx1151**: Multiple reports of crashes (MUT_MAL errors),
+   performance regressions, and compilation issues. ROCm 7.x is maturing but
+   not fully stable for gfx1151 as of March 2026.
+
+3. **Vulkan CoopMat FA for AMD**: Will AMD ever get CoopMat2 support? The current
+   CoopMat1 path provides modest improvement. A native AMD CoopMat2 or equivalent
+   extension would close the gap with ROCm FA.
+
+4. **KV cache placement on ROCm**: Issue #18011 (KV cache dumped to shared memory)
+   reduces ROCm tg performance at high contexts. Root cause appears to be in
+   HIP memory allocation behavior on APUs.
+
+5. **GGML_HIP_UMA vs kernel-param GTT expansion**: The UMA flag uses slow
+   fine-grained memory. GTT expansion via `amdgpu.gttsize` kernel params provides
+   coarse-grained GPU-mapped memory that is much faster. The upstream approach
+   may eventually improve, but kernel params remain the correct method for now.
+
+6. **GatedDeltaNet architecture support**: Both Vulkan (missing shader) and ROCm
+   (register pressure, memcpy bottleneck) perform poorly on GDN models. This
+   blocks efficient use of Qwen3.5-27B and similar models.
+
+7. **TurboQuant integration timeline**: Expected Q2-Q3 2026 for llama.cpp.
+   Would provide 3-bit KV cache with no quality loss, roughly doubling available
+   context within the same memory budget.
+
+8. **NPU utilization**: The 50 TOPS NPU on Strix Halo is currently Linux-unusable
+   for llama.cpp inference. AMD driver support for NPU on Linux remains pending.
+
+---
+
+## Overlap Notes
+
+- **Kernel parameters** (`amdgpu.gttsize`, `ttm.pages_limit`, `iommu=pt`):
+  Already documented in the project's `scripts/optimize/kernel-params.sh`.
+  This research covers the llama.cpp side (why they matter for inference).
+
+- **BIOS VRAM allocation**: Reducing dedicated VRAM in BIOS frees more memory
+  for GTT. This is documented in the project's audit scripts but is a prerequisite
+  for the optimizations described here.
+
+- **Toolbox container builds**: The project uses pre-built toolboxes
+  (`llama-rocm-7.2`, `llama-vulkan`). The compilation flags documented here
+  describe what should be baked into those containers.
--- a/docs/optimization-log.md
+++ b/docs/optimization-log.md
@@ -0,0 +1,246 @@
+# Optimization Log
+
+Living document tracking what was applied, tested, and the actual results. Each entry records the change, benchmark evidence, and verdict.
+
+**Verdicts**: KEEP (applied permanently), REVERTED (tested, didn't help), PENDING (not yet tested), BLOCKED (can't test yet).
+
+---
+
+## Phase 1: Core System
+
+### 1.1 Tuned Profile: accelerator-performance
+
+- **Date**: 2026-03-26
+- **Change**: `sudo tuned-adm profile accelerator-performance`
+- **Benchmark**: `data/benchmarks/after-tuned-*`
+- **Result**: +5-8% pp improvement, +2-3% tg improvement
+- **Verdict**: KEEP
+
+### 1.2 Kernel Boot Parameters
+
+- **Date**: 2026-03-26
+- **Change**: `iommu=pt amdgpu.gttsize=60416 ttm.pages_limit=15466496`
+- **Benchmark**: `data/benchmarks/full-opt-all-models-*`
+- **Result**: Combined with BIOS VRAM change. Large models now fit in GTT. Peak usage 38.8/59 GiB.
+- **Verdict**: KEEP
+
+### 1.3 BIOS VRAM Reduction (512 MB)
+
+- **Date**: 2026-03-26
+- **Change**: UMA Frame Buffer Size 32 GB -> 512 MB (HP ZBook F10 BIOS)
+- **Benchmark**: `data/benchmarks/full-opt-all-models-*`
+- **Result**: 31.5 GB freed for OS/GTT. Small models ~3-8% slower (GTT indirection vs dedicated VRAM), but system gained ability to run 37 GB+ models at 32K+ context. Net positive.
+- **Trade-off**: Small model regression is acceptable given the massive capability gain.
+- **Verdict**: KEEP
+
+---
+
+## Phase 2: System Tuning
+
+### 2.1 RyzenAdj PPT Increase
+
+- **Date**: 2026-03-30
+- **Change**: `sudo ryzenadj --stapm-limit=85000 --fast-limit=85000 --slow-limit=85000 --apu-slow-limit=85000`
+- **Result**: STAPM raised from 59W→81W. PPT Fast raised to 81W. **However, PPT SLOW and APU SLOW stuck at 70W** — HP ZBook BIOS EC overrides these limits. Effective sustained power: ~70W (was ~59W).
+- **Benchmark**: `data/benchmarks/qwen35-shootout-v2-*` (Vulkan, q4_0 KV, pp2048/tg1024)
+  - UD-Q4_K_L: **57.0 t/s** (was ~39 t/s before RyzenAdj = **+46%**)
+  - UD-Q4_K_XL: **56.4 t/s**
+  - Q8_0: **51.4 t/s** (was ~39-41 t/s before = **+25%**)
+- **Thermals**: 70-73C under load, 30C headroom. Cooling handles it easily.
+- **Notes**: Settings are volatile (reset on reboot/sleep). Use `sudo make optimize-power` or install systemd service for persistence. HP firmware hard-caps slow PPT at 70W regardless.
+- **Verdict**: KEEP — significant real-world improvement despite HP firmware limit
+
+### 2.2 VM Sysctl Tuning
+
+- **Date**: 2026-03-30
+- **Change**: `vm.swappiness=1, vm.dirty_ratio=40, vm.dirty_background_ratio=10, vm.max_map_count=500000, vm.zone_reclaim_mode=0`
+- **Applied via**: `sudo make optimize-power` (persists to `/etc/sysctl.d/99-llm-inference.conf`)
+- **Notes**: Hard to isolate impact — applied together with other Phase 2 changes. Prevents model weight eviction and I/O disruption.
+- **Verdict**: KEEP — low risk, persists across reboots
+
+### 2.3 Transparent Huge Pages
+
+- **Date**: 2026-03-30
+- **Change**: `echo always > /sys/kernel/mm/transparent_hugepage/enabled`
+- **Applied via**: `sudo make optimize-power` (volatile — add `transparent_hugepage=always` to kernel cmdline for persistence)
+- **Notes**: Reduces TLB misses for mmap'd model files. Hard to isolate impact.
+- **Verdict**: KEEP — low risk
+
+### 2.4 RADV_PERFTEST=nogttspill
+
+- **Date**: 2026-03-30
+- **Change**: `RADV_PERFTEST=nogttspill` persisted to `/etc/environment.d/radv-llm.conf`
+- **Applied via**: `sudo make optimize-power`
+- **Notes**: Prevents GTT spill management overhead on unified memory Vulkan. Takes effect on next login. For current session: `export RADV_PERFTEST=nogttspill`
+- **Verdict**: KEEP — persists across reboots
+
+### 2.5 amdgpu.noretry=0
+
+- **Date**: PENDING
+- **Change**: Kernel cmdline `amdgpu.noretry=0`
+- **Expected**: Improved stability under memory pressure
+- **Notes**: Only apply if experiencing GPU page faults or crashes during large model loading
+- **Verdict**: PENDING
+
+---
+
+## Phase 3: Runtime Flags
+
+### 3.1 KV Cache Quantization
+
+- **Date**: 2026-03-27
+- **Change**: `--kv-types f16,q8_0,q4_0` sweep
+- **Benchmark**: `data/benchmarks/kv-sweep-256k-*`
+- **Result** (Vulkan RADV, Qwen3.5-35B-A3B Q8, pp2048/tg1024):
+  - f16: 456 pp, 39.8 tg
+  - q8_0: 418 pp, 38.5 tg (slight Vulkan regression — unexpected)
+  - **q4_0: 460 pp, 41.1 tg** (fastest overall, +3% tg over f16)
+- **Result** (ROCm, same model):
+  - f16: 445 pp, 21.5 tg
+  - q8_0: 495 pp, 21.7 tg (+11% pp, same tg)
+  - q4_0: 494 pp, 21.8 tg (+11% pp, same tg)
+- **Conclusion**: q4_0 is the sweet spot on Vulkan (fastest tg + 75% less KV memory). On ROCm, KV quant helps pp but not tg.
+- **Verdict**: KEEP — use q4_0 KV as default for serving
+
+### 3.2 MoE Batch Size `-b 256`
+
+- **Date**: 2026-03-30
+- **Change**: `-b 256` vs default (2048)
+- **Benchmark**: `data/benchmarks/batch-default-*` vs `data/benchmarks/batch-256-*`
+- **Result** (Vulkan RADV, Qwen3.5-35B-A3B UD-Q4_K_XL, q4_0 KV):
+  - Default: 826 pp, 55.9 tg
+  - b=256: 843 pp, 55.5 tg (within noise)
+- **Notes**: Community-reported +70% improvement does not reproduce on Vulkan RADV. May only apply to ROCm or CPU backends, or to longer prompts (pp8192+).
+- **Verdict**: NO IMPACT on Vulkan — not recommended
+
+---
+
+## Phase 4: Build Optimizations
+
+### 4.1 rocWMMA Flash Attention
+
+- **Date**: PENDING
+- **Change**: Rebuild ROCm toolbox with `-DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_HIP_UMA=ON`
+- **Expected**: +96% long-context performance (65K+)
+- **Notes**: Need to check if Donato's toolboxes already include this
+- **Verdict**: PENDING
+
+### 4.2 rocWMMA Tuned Patch (PR #16827)
+
+- **Date**: PENDING
+- **Notes**: Fixes long-context regression. Check Donato's latest toolbox builds.
+- **Verdict**: PENDING
+
+---
+
+## Phase 5: Future / Blocked
+
+### 5.1 Speculative Decoding (draft model)
+
+- **Status**: BLOCKED — llama.cpp PR #20075 (hybrid SSM/MoE checkpoint/restore)
+- **Draft model**: Downloaded `Qwen3.5-0.8B-Q8_0.gguf` (812 MB) on 2026-03-27
+- **Last checked**: 2026-03-30 — PR stalled since Mar 5. ROCm buffer crashes in `copy_cell()`. Works on Metal/CUDA but not AMD. Months away from landing.
+
+### 5.2 Native MTP (Multi-Token Prediction)
+
+- **Status**: BLOCKED — multiple dependencies unmerged
+- **Last checked**: 2026-03-30
+- **Details**: 4 separate PRs in flight, none merged:
+  - PR #18886: MTP API framework (DRAFT since Feb 6) — foundation for all MTP work
+  - PR #20700: MTP for Qwen3.5 **dense only** (WIP, author says "not expected to merge soon")
+  - PR #15225: GLM-style MTP (open since Aug 2025, "slower than baseline")
+  - PR #18039: EAGLE3 speculative (open since Dec 2025)
+- **Key gap**: No MTP implementation exists for MoE models. PR #20700 only covers dense Qwen3.5 (0.8B-27B), not the 35B-A3B MoE.
+- **Timeline estimate**: MTP API (#18886) must merge first, then model-specific implementations adapted. Months, not weeks.
+
+### 5.2a N-gram Speculative Decoding (AVAILABLE NOW)
+
+- **Status**: WORKS TODAY — no upstream PRs needed
+- **How**: `llama-server --spec-type ngram-simple --draft-max 64 --draft-min 4`
+- **Expected**: 1.1-1.4x tg speedup on repetitive content (code, structured output)
+- **Added to**: `make serve-ngram ARGS="-m MODEL.gguf"` and `bin/serve --ngram`
+- **Notes**: Pattern-matches from token history, no draft model needed. Best for code generation where patterns repeat. No quality impact.
+- **Verdict**: AVAILABLE — use `--ngram` flag when serving
+
+### 5.3 GPU Clock Reporting
+
+- **Status**: NOT A REAL ISSUE — sysfs reporting is broken, actual clocks are fine
+- **Measured**: clpeak (2026-03-30) confirms GPU reaches 2900 MHz under compute load
+- **Notes**: ROCm issue #5750 is about sysfs `pp_dpm_sclk` reporting, not actual performance. No action needed.
+- **Verdict**: CLOSED — no performance impact
+
+---
+
+## Context Window Benchmarks
+
+### 64K Context (pp4096/tg1024, MoE models)
+
+- **Date**: 2026-03-26
+- **Benchmark**: `data/benchmarks/ctx64k-*`
+- **Results**: (check logs)
+
+### 128K Context (pp8192/tg1024, MoE models)
+
+- **Date**: 2026-03-26
+- **Benchmark**: `data/benchmarks/ctx128k-realistic-*`
+- **Results**: (check logs)
+
+### 256K Context (pp16384/tg1024, MoE models)
+
+- **Date**: 2026-03-27
+- **Benchmark**: `data/benchmarks/ctx256k-*`
+- **Results**: (check logs)
+
+---
+
+## Model Quant Shootout
+
+### Qwen3.5-35B-A3B — Q4_K_L vs Q4_K_XL vs Q8 (2026-03-30)
+
+- **Benchmark**: `data/benchmarks/qwen35-shootout-v2-*`
+- **Config**: Vulkan RADV, q4_0 KV cache, pp2048/tg1024, 2 reps
+- **RyzenAdj**: STAPM=81W (sustained ~70W due to HP firmware cap)
+
+| Quant | File Size | pp2048 (t/s) | tg1024 (t/s) | Recommendation |
+|-------|-----------|-------------|-------------|----------------|
+| UD-Q4_K_L | 18.8 GB | 825 | **57.0** | Fastest. Good quality. |
+| **UD-Q4_K_XL** | 20.7 GB | 835 | **56.4** | **Daily driver** — best quality/speed. |
+| Q8_0 | 34.4 GB | 850 | 51.4 | Best quality, 10% slower tg. |
+
+**Decision**: Keep UD-Q4_K_XL (daily driver) and Q8_0 (quality fallback). Q4_K_L deleted — Q4_K_XL is strictly better at only +2 GB.
+
+### Coder Model Shootout (2026-03-30)
+
+- **Benchmark**: `data/benchmarks/coder-shootout-*`
+- **Config**: Vulkan RADV, q4_0 KV cache, pp2048/tg1024, 2 reps
+- **RyzenAdj**: STAPM=81W (sustained ~70W)
+
+| Model | Architecture | File Size | pp2048 (t/s) | tg1024 (t/s) |
+|-------|-------------|-----------|-------------|-------------|
+| **Qwen3-Coder-30B** UD-Q6_K_XL | Pure Transformer | 24.5 GB | 737 | **61.0** |
+| **Qwen3.5-35B-A3B** UD-Q4_K_XL | Hybrid DeltaNet | 20.7 GB | **821** | 54.9 |
+| **Nemotron-Cascade-2** Q8_0 | Hybrid Mamba-2 | 31.3 GB | 643 | 52.8 |
+| **Qwen3-Coder-Next** UD-Q3_K_XL | Hybrid DeltaNet | 33.8 GB | 545 | 46.8 |
+
+**Analysis**:
+- tg speed scales inversely with model size (bandwidth-bound at ~215 GB/s)
+- Pure Transformer (Qwen3-Coder-30B) has lowest overhead per token
+- DeltaNet hybrid (Qwen3.5) has best pp — DeltaNet layers are efficient for prefill
+- Qwen3-Coder-Next (80B at 3-bit) is 25% slower tg but has >70% SWE-bench vs ~50% for the 30B
+
+**Recommended roles**:
+- **Qwen3-Coder-30B**: Interactive tool-use / function-calling loops (fastest tg, purpose-built)
+- **Qwen3.5-35B-A3B**: General tasks, long prompt processing (best pp, best all-rounder)
+- **Qwen3-Coder-Next**: Complex multi-file coding tasks where quality > speed
+
+---
+
+## How to Add Entries
+
+When testing a new optimization:
+
+1. Record the date and exact change
+2. Run a benchmark: `make benchmark ARGS="--tag DESCRIPTIVE-NAME ..."`
+3. Compare: `make benchmark-compare BEFORE=data/path/baseline AFTER=data/path/new`
+4. Update this log with results and verdict
+5. If KEEP: document in [optimization.md](optimization.md) with the measured numbers
--- a/docs/optimization.md
+++ b/docs/optimization.md
@@ -1,84 +1,275 @@
 # Optimization Guide

-Complete walkthrough for optimizing AMD Strix Halo for LLM workloads.
+Complete walkthrough for optimizing AMD Strix Halo for LLM inference workloads. Organized in phases from essential to experimental. Each phase builds on the previous.

 **Prerequisites**: Run `make audit` first to see your current state. Run `make benchmark-baseline` to capture pre-optimization performance numbers.

-## Step 1: Tuned Profile (no reboot)
+Track results in [optimization-log.md](optimization-log.md) as you apply each change.
+
+---
+
+## Phase 1: Core System (automated scripts)
+
+These are the foundational optimizations handled by this repo's scripts. Apply in order.
+
+### 1.1 Tuned Profile (no reboot)

 ```bash
 sudo make optimize-tuned
 ```

-Switches from `throughput-performance` to `accelerator-performance`, which disables higher-latency CPU STOP states. Provides 5-8% improvement in prompt processing throughput.
+Switches to `accelerator-performance`: disables higher-latency CPU STOP states, sets CPU governor to performance.

-Takes effect immediately. Previous profile is saved for rollback.
+**Measured**: +5-8% pp, +2-3% tg.

-## Step 2: Kernel Boot Parameters (reboot required)
+### 1.2 Kernel Boot Parameters (reboot required)

 ```bash
 sudo make optimize-kernel
 ```

-Adds three parameters to GRUB:
-
 | Parameter | Value (64 GB) | Purpose |
 |-----------|--------------|---------|
-| `iommu=pt` | — | IOMMU passthrough, reduces memory access latency |
-| `amdgpu.gttsize` | `60416` | Max GPU-addressable system RAM in MiB |
+| `iommu=pt` | -- | IOMMU passthrough, reduces memory access latency |
+| `amdgpu.gttsize` | `60416` | Max GPU-addressable system RAM in MiB (~59 GiB) |
 | `ttm.pages_limit` | `15466496` | Max pinnable 4K pages for GPU memory |

-Values are computed dynamically based on your system's total physical RAM. The script backs up `/etc/default/grub` before modifying it.
+Values computed dynamically from total physical RAM. See [architecture.md](architecture.md) for the math.

-See [docs/architecture.md](architecture.md) for the math behind these values.
-
-## Step 3: BIOS VRAM Reduction (reboot + BIOS access)
+### 1.3 BIOS VRAM Reduction (reboot + BIOS access)

 ```bash
-make optimize-vram
+make optimize-vram   # Prints guidance — cannot modify BIOS directly
 ```

-This prints guidance — it cannot modify BIOS directly. The goal is to reduce dedicated VRAM from 32 GB to 0.5 GB, freeing 31.5 GB back to the OS for dynamic GPU access via GTT.
+Reduce UMA Frame Buffer Size from 32 GB to 512 MB. Frees 31.5 GB for GTT. See [bios-vram-guide.md](bios-vram-guide.md) (HP ZBook: F10 at boot).

-See [docs/bios-vram-guide.md](bios-vram-guide.md) for the full BIOS walkthrough.
+**Combine 1.2 and 1.3 into a single reboot.**

-**Combine Steps 2 and 3 into a single reboot**: apply kernel params, then reboot into BIOS (F10) to change VRAM, then boot normally.
-
-## Step 4: Verify
+### 1.4 Verify

 ```bash
-make verify
+make verify          # 9-point checklist, target: 9/9
+make audit           # Single-screen system status with scores
 ```

-Checks 9 criteria and reports a score. Target: 9/9.
+### Phase 1 Measured Impact

-## Step 5: Measure Impact
-
-```bash
-make benchmark
-make benchmark-compare BEFORE=data/baselines/TIMESTAMP AFTER=data/benchmarks/TAG-TIMESTAMP
-```
-
-See [docs/benchmarking.md](benchmarking.md) for methodology and result interpretation.
-
-## Expected Impact
-
-| Optimization | pp512 Improvement | tg128 Improvement |
-|-------------|-------------------|-------------------|
+| Optimization | pp | tg |
+|-------------|----|----|
 | Tuned profile | +5-8% | +2-3% |
-| Kernel params + BIOS VRAM | +10-20% | +5-15% |
-| **Combined** | **+15-25%** | **+8-18%** |
+| Kernel params + BIOS VRAM | Enables 37 GB+ models | +5-15% |
+| **Phase 1 combined** | **+15-25%** | **+8-18%** |

-Numbers vary by model size and backend. Larger models see bigger gains from GTT expansion.
+Trade-off: Small models (<5 GB) are ~3-8% slower due to GTT indirection vs dedicated VRAM. Acceptable given the massive capability gain.
+
+---
+
+## Phase 2: System Tuning
+
+All Phase 2 optimizations are applied with a single command:
+
+```bash
+sudo make optimize-power
+```
+
+This applies RyzenAdj, sysctl, THP, and RADV nogttspill. Sysctl and nogttspill persist across reboots. RyzenAdj and THP are volatile.
+
+For RyzenAdj persistence across reboots:
+```bash
+sudo cp configs/ryzenadj-llm.service configs/ryzenadj-resume.service /etc/systemd/system/
+sudo systemctl enable --now ryzenadj-llm.service
+sudo systemctl enable ryzenadj-resume.service
+```
+
+### 2.1 RyzenAdj PPT Increase
+
+The HP ZBook Ultra G1a ships at 59W sustained. RyzenAdj raises this.
+
+**Measured on HP ZBook**: STAPM raised to 81W, but **HP firmware hard-caps PPT SLOW at 70W**. Effective sustained power: ~70W (was ~59W). Cannot be overridden without modded BIOS.
+
+**Measured impact**: Qwen3.5-35B-A3B tg1024 went from **~39 t/s → 57 t/s (+46%)**. This was the single largest improvement in the entire optimization journey.
+
+Thermals: 70-73C under sustained load. 30C headroom. Cooling handles it easily.
+
+### 2.2 VM / Sysctl Tuning
+
+Persisted to `/etc/sysctl.d/99-llm-inference.conf`:
+
+| Parameter | Default | Set To | Why |
+|-----------|---------|--------|-----|
+| `vm.swappiness` | 60 | **1** | Prevent model weight eviction |
+| `vm.dirty_ratio` | 20 | **40** | Reduce I/O flush storms during inference |
+| `vm.max_map_count` | 65530 | **500000** | Large models need many memory mappings |
+| `vm.zone_reclaim_mode` | 0 | **0** | Don't aggressively reclaim memory zones |
+
+### 2.3 Transparent Huge Pages
+
+Set to `always`. Reduces TLB misses for mmap'd model files. Volatile — add `transparent_hugepage=always` to kernel cmdline for persistence.
+
+### 2.4 RADV_PERFTEST=nogttspill
+
+Persisted to `/etc/environment.d/radv-llm.conf`. Prevents GTT spill management overhead on unified memory. Vulkan RADV only.
+
+---
+
+## Phase 3: Runtime Flags (per-invocation)
+
+These flags should be used when running llama-bench, llama-server, or llama-cli. Already set in this repo's benchmark scripts.
+
+### 3.1 `-mmp 0` (no mmap) — mandatory
+
+On unified memory, mmap adds a double-copy penalty. Always disable.
+
+### 3.2 KV Cache Quantization — use Q4_0
+
+**Measured** (Vulkan RADV, Qwen3.5-35B-A3B):
+
+| KV Type | pp2048 | tg1024 | Memory Savings |
+|---------|--------|--------|---------------|
+| f16 | 456 | 39.8 | Baseline |
+| q8_0 | 418 | 38.5 | ~50% (slightly slower on Vulkan) |
+| **q4_0** | **460** | **41.1** | **75% (fastest on Vulkan)** |
+
+Q4_0 is faster because less memory bandwidth is spent reading KV cache. Use as default for serving. Quality impact is noticeable only on reasoning-heavy tasks.
+
+### 3.3 Flash Attention (`-fa 1`) — always enable
+
+24% pp on ROCm. Modest improvement on Vulkan (CoopMat1). Already in benchmark scripts.
+
+### 3.4 `ROCBLAS_USE_HIPBLASLT=1` (ROCm only) — mandatory
+
+Without this, ROCm pp is 2-7x slower on gfx1151. Already in benchmark scripts.
+
+### 3.5 Backend Selection
+
+**Measured** (Qwen3.5-35B-A3B Q8, 128K context):
+
+| Workload | Vulkan RADV | ROCm 7.2 | Winner |
+|----------|------------|----------|--------|
+| pp2048 | 456 | 445 | Vulkan (+2%) |
+| tg1024 | 39.8 | 21.5 | **Vulkan (1.9x)** |
+| pp8192 @ 131K | 130 | 84 | **Vulkan (1.5x)** |
+| tg32 @ 131K | 22.0 | 8.1 | **Vulkan (2.7x)** |
+
+**Vulkan RADV dominates across all workloads on this hardware.** ROCm is significantly slower for token generation, especially at long context. Never use AMDVLK.
+
+### 3.6 MoE Batch Size `-b 256` — NOT YET TESTED
+
+Community reports up to +70% pp improvement for MoE models. Default batch (2048) may be too large. Needs benchmarking.
+
+---
+
+## Phase 4: Build Optimizations (not yet tested)
+
+These require rebuilding the llama.cpp toolbox containers. Given that **Vulkan RADV already outperforms ROCm significantly**, the ROI of these ROCm-specific optimizations is unclear.
+
+### 4.1 ROCm Build Flags
+
+```bash
+cmake -B build \
+    -DGGML_HIP=ON \
+    -DGGML_HIP_ROCWMMA_FATTN=ON \
+    -DGGML_HIP_UMA=ON \
+    -DAMDGPU_TARGETS=gfx1151
+```
+
+`GGML_HIP_ROCWMMA_FATTN` enables GPU-accelerated flash attention via WMMA. Could close the gap between ROCm and Vulkan for long-context workloads. Check if Donato Capitella's ROCm toolboxes already include this.
+
+### 4.2 Vulkan Cooperative Matrices
+
+RADV supports `VK_KHR_cooperative_matrix` for RDNA 3+. Already used by llama.cpp for matrix operations. The current Vulkan toolbox shows `matrix cores: KHR_coopmat` — this is likely already active.
+
+---
+
+## Phase 5: Future / Currently Blocked
+
+### 5.1 Speculative Decoding
+
+Expected 1.8-2.5x tg speedup. Draft model (`Qwen3.5-0.8B-Q8_0.gguf`) downloaded.
+
+**Blocked**: llama.cpp PR #20075 — hybrid SSM/MoE speculative rollback.
+
+### 5.2 Native Multi-Token Prediction
+
+Qwen3.5 has built-in MTP heads. No draft model needed.
+
+**Blocked**: llama.cpp PR #20700 — MTP for Qwen3.5.
+
+### 5.3 GPU Clock Reporting Fix
+
+ROCm issue #5750 reports GPU clocks appearing stuck at ~885 MHz in sysfs. However, **clpeak confirms clocks reach 2900 MHz under compute load** (measured 2026-03-30). The issue is likely broken sysfs reporting, not actual clock throttling. No performance impact.
+
+**Tracking**: ROCm issue #5750 (sysfs reporting only, not a real blocker).
+
+### 5.4 Other Future Items
+
+- **SageAttention**: 2-5x over FlashAttention. No AMD port yet.
+- **XDNA NPU** (50 TOPS): Linux support coming in kernel 7.1. Could run draft model for speculative decoding.
+- **TurboQuant 3-bit KV** (ICLR 2026): 4.9x KV compression. Being integrated into llama.cpp.
+- **LLMLingua-2**: 20x prompt compression for agentic/RAG workloads.
+
+---
+
+## Hardware Limits (measured via clpeak on this system, 2026-03-30)
+
+| Resource | Value | Notes |
+|----------|-------|-------|
+| DRAM bandwidth | **216-233 GB/s** | clpeak float: 215.7, float16: 232.6. Theoretical max: 256 GB/s. |
+| Infinity Cache | **32 MB, ~1 TB/s** | MoE effective throughput exceeds DRAM BW due to cache hits |
+| GPU clocks | **2900 MHz confirmed** | clpeak shows clocks reach 2900 MHz under load. ROCm #5750 sysfs reporting may be broken but actual clocks are correct. |
+| FP16 compute | **21.9 TFLOPS** | 20 CUs x 2900 MHz |
+| FP32 compute | **12.3 TFLOPS** | |
+| LPDDR5X-8000 | 8000 MT/s, 256-bit | Soldered, no overclocking |
+| Infinity Fabric | 2 GHz FCLK | Fixed |
+| HP ZBook sustained power | **70W** (firmware cap) | RyzenAdj can set 85W but HP overrides to 70W |
+
+**Note on MoE bandwidth**: MoE models (3B active params) read ~13.5 GB per token at Q4. At 55 t/s this implies ~740 GB/s effective throughput — well above the 216 GB/s DRAM ceiling. The 32 MB Infinity Cache (~1 TB/s) is actively boosting throughput for repeated KV cache and activation reads.
+
+---
+
+## Performance Summary (all measured, Vulkan RADV, q4_0 KV)
+
+### Model Shootout (pp2048/tg1024, Phase 1+2 applied)
+
+| Model | Arch | Size | pp2048 | tg1024 |
+|-------|------|------|--------|--------|
+| **Qwen3-Coder-30B** UD-Q6_K_XL | Pure Transformer | 24.5 GB | 737 | **61.0** |
+| **Qwen3.5-35B-A3B** UD-Q4_K_XL | Hybrid DeltaNet | 20.7 GB | **821** | 54.9 |
+| **Nemotron-Cascade-2** Q8_0 | Hybrid Mamba-2 | 31.3 GB | 643 | 52.8 |
+| **Qwen3-Coder-Next** UD-Q3_K_XL | Hybrid DeltaNet | 33.8 GB | 545 | 46.8 |
+
+tg speed is bandwidth-bound: smaller model = faster tokens.
+
+### Optimization Journey (Qwen3.5-35B-A3B, tg on Vulkan)
+
+| Stage | tg (t/s) | Improvement |
+|-------|----------|-------------|
+| Pre-optimization (stock) | ~33 | Baseline |
+| After Phase 1 (tuned + kernel + BIOS) | ~39 | +18% |
+| After Phase 2 (RyzenAdj + sysctl + THP) | **~57** | **+46%** |
+
+---

 ## Rollback

 ```bash
-sudo make rollback
+sudo make rollback   # Restores GRUB backup and previous tuned profile
 ```

-Restores GRUB backup and previous tuned profile. BIOS VRAM must be reverted manually (F10 → restore previous UMA Frame Buffer Size).
+Phase 2 rollback:
+- RyzenAdj: `sudo ryzenadj --stapm-limit=60000 --fast-limit=60000 --slow-limit=60000`
+- Sysctl: `sudo rm /etc/sysctl.d/99-llm-inference.conf && sudo sysctl --system`
+- THP: `echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled`
+- nogttspill: `sudo rm /etc/environment.d/radv-llm.conf`
+- BIOS VRAM: revert manually (F10 at boot)

-## Troubleshooting
+---

-If anything goes wrong, see [docs/troubleshooting.md](troubleshooting.md).
+## Further Reading
+
+- [Optimization log](optimization-log.md) — Detailed test results and verdicts
+- [Hardware analysis](llama-cpp-optimization-research.md) — llama.cpp flags, backends, quantization deep dive
+- [Inference landscape](inference-optimization-landscape.md) — Broader survey of engines and techniques
+- [Benchmarking guide](benchmarking.md) — Methodology and result interpretation
+- [References](references.md) — All external links
--- a/docs/references.md
+++ b/docs/references.md
@@ -21,8 +21,13 @@ The most comprehensive community resource for Strix Halo LLM optimization.
 ## Community

 - [Strix Halo Wiki — AI Capabilities](https://strixhalo.wiki/AI/AI_Capabilities_Overview) — Community benchmarks, model compatibility
+- [Strix Halo Wiki — Power Modes](https://strixhalo.wiki/Guides/Power-Modes-and-Performance) — RyzenAdj sweet spots (85W recommended)
+- [Strix Halo Wiki — llama.cpp Performance](https://strixhalo.wiki/AI/llamacpp-performance) — Backend comparison data
 - [Level1Techs Forum — HP G1a Guide](https://forum.level1techs.com/t/the-ultimate-arch-secureboot-guide-for-ryzen-ai-max-ft-hp-g1a-128gb-8060s-monster-laptop/230652) — Laptop-specific configuration
 - [Framework Community — GPU Performance Tests](https://community.frame.work/t/amd-strix-halo-ryzen-ai-max-395-gpu-llm-performance-tests/72521) — Framework Desktop results
+- [Framework Community — Compiling vLLM on Strix Halo](https://community.frame.work/t/how-to-compiling-vllm-from-source-on-strix-halo/77241) — Native vLLM build guide
+- [Hardware Corner — Strix Halo LLM Optimization](https://www.hardware-corner.net/strix-halo-llm-optimization/) — Comprehensive optimization walkthrough
+- [Chips and Cheese — Strix Halo Memory Subsystem](https://chipsandcheese.com/p/strix-halos-memory-subsystem-tackling) — Bandwidth measurements (215 GB/s)
 - [LLM Tracker — Strix Halo](https://llm-tracker.info/_TOORG/Strix-Halo) — Centralized performance database

 ## Other Strix Halo Repos
@@ -61,6 +66,22 @@ The most comprehensive community resource for Strix Halo LLM optimization.
 - [SWE-bench](https://github.com/princeton-nlp/SWE-bench) — Real GitHub issue resolution
 - [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) — Optimized agentic framework for Qwen models

+## System Tuning
+
+- [RyzenAdj](https://github.com/FlyGoat/RyzenAdj) — Power management for Ryzen APUs (PPT/TDP control)
+- [geohot/ztop](https://github.com/geohot/ztop) — Power monitoring for Strix Halo (discovered 60W HP limits)
+- [ROCm Issue #5750](https://github.com/ROCm/ROCm/issues/5750) — GPU clocks stuck at idle on gfx1151
+- [Mesa RADV Environment Variables](https://docs.mesa3d.org/envvars.html) — RADV_PERFTEST=nogttspill docs
+- [Linux Kernel: amd-pstate](https://docs.kernel.org/admin-guide/pm/amd-pstate.html) — CPU power management
+
+## llama.cpp Optimization
+
+- [llama.cpp Speculative Decoding](https://github.com/ggml-org/llama.cpp/blob/master/docs/speculative.md) — Draft model setup
+- [llama.cpp PR #20075](https://github.com/ggml-org/llama.cpp/pull/20075) — Fix speculative for Qwen3.5 MoE (pending)
+- [llama.cpp PR #20700](https://github.com/ggml-org/llama.cpp/pull/20700) — Native MTP for Qwen3.5 (WIP)
+- [llama.cpp PR #16827](https://github.com/ggml-org/llama.cpp/pull/16827) — rocWMMA tuned flash attention
+- [llama.cpp Issue #12444](https://github.com/ggml-org/llama.cpp/issues/12444) — Hugepage support proposal
+
 ## AMD GPU Profiling

 - [Radeon GPU Profiler (RGP)](https://gpuopen.com/rgp/) — Hardware-level Vulkan/HIP profiling
--- a/requirements.txt
+++ b/requirements.txt
@@ -0,0 +1,12 @@
+# Agentic evaluation frameworks
+# Install: python3.13 -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt
+# Requires Python >=3.10, <3.14 (bigcodebench constraint)
+
+inspect-ai>=0.3.201
+inspect-evals>=0.6.0
+evalplus>=0.3.1
+bigcodebench>=0.2.5
+openai>=2.26.0
+
+# IFEval dependency (not on PyPI)
+instruction_following_eval @ git+https://github.com/josejg/instruction_following_eval
--- a/scripts/agentic/run-eval.sh
+++ b/scripts/agentic/run-eval.sh
@@ -5,7 +5,7 @@ set -euo pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 source "$SCRIPT_DIR/../../lib/common.sh"

-VENV_DIR="$(data_dir venv)"
+VENV_DIR="$PROJECT_ROOT/.venv"
 EVAL_DIR="$(data_dir evals)"

 # ── Argument parsing ─────────────────────────────────────
@@ -37,33 +37,59 @@ while [[ $# -gt 0 ]]; do
 done

 # ── Validation ───────────────────────────────────────────
-if [[ -z "$MODEL" ]]; then
-    log_error "Model name required. Use --model NAME"
-    log_info "Examples:"
-    log_info "  --model qwen3.5:35b-a3b-q8_0       (ollama)"
-    log_info "  --model Qwen3.5-35B-A3B-Q8_0        (llama.cpp server)"
-    exit 1
-fi
-
 if [[ ! -f "$VENV_DIR/bin/activate" ]]; then
    log_error "Virtual environment not found. Run: make agentic-setup"
    exit 1
 fi
 source "$VENV_DIR/bin/activate"

-# Check server is reachable
-if ! curl -sf "$ENDPOINT/models" >/dev/null 2>&1; then
-    # Try ollama native endpoint
-    if curl -sf "http://localhost:11434/api/tags" >/dev/null 2>&1; then
-        log_info "Ollama detected, using OpenAI-compat endpoint"
+# Auto-detect server if no explicit endpoint given
+if [[ "$ENDPOINT" == "http://localhost:11434/v1" ]]; then
+    if curl -sf "http://localhost:8080/health" >/dev/null 2>&1; then
+        ENDPOINT="http://localhost:8080/v1"
+        log_info "Auto-detected llama-server at localhost:8080"
+    elif curl -sf "http://localhost:11434/api/tags" >/dev/null 2>&1; then
+        log_info "Auto-detected ollama at localhost:11434"
    else
-        log_error "No LLM server at $ENDPOINT. Start ollama or llama.cpp server first."
+        log_error "No LLM server found. Start one first:"
+        log_info "  make serve ARGS=\"-m MODEL.gguf\"        (llama-server)"
+        log_info "  ollama serve                            (ollama)"
+        exit 1
+    fi
+else
+    if ! curl -sf "${ENDPOINT%/v1}/health" >/dev/null 2>&1 && \
+       ! curl -sf "$ENDPOINT/models" >/dev/null 2>&1; then
+        log_error "No LLM server at $ENDPOINT"
+        exit 1
+    fi
+fi
+
+# Auto-detect model name from server if not provided
+if [[ -z "$MODEL" ]]; then
+    DETECTED_MODEL=$(curl -sf "$ENDPOINT/models" 2>/dev/null | python3 -c "
+import sys, json
+try:
+    data = json.load(sys.stdin)
+    models = data.get('data', [])
+    if models:
+        print(models[0].get('id', ''))
+except: pass
+" 2>/dev/null || true)
+    if [[ -n "$DETECTED_MODEL" ]]; then
+        MODEL="$DETECTED_MODEL"
+        log_info "Auto-detected model: $MODEL"
+    else
+        log_error "Model name required. Use --model NAME"
+        log_info "Examples:"
+        log_info "  --model qwen3.5:35b-a3b-q8_0       (ollama)"
+        log_info "  --model Qwen3.5-35B-A3B-Q8_0        (llama.cpp server)"
        exit 1
    fi
 fi

 TS="$(timestamp)"
-RUN_DIR="$EVAL_DIR/${SUITE}-${MODEL//[:\/]/_}-${TS}"
+SAFE_MODEL="$(echo "$MODEL" | tr -cs 'a-zA-Z0-9._-' '_')"
+RUN_DIR="$EVAL_DIR/${SUITE}-${SAFE_MODEL}-${TS}"
 mkdir -p "$RUN_DIR"

 log_header "Agentic Evaluation: $SUITE"
@@ -86,7 +112,11 @@ ENDJSON
 METRICS_FILE="$RUN_DIR/metrics.csv"
 bash "$SCRIPT_DIR/../monitor/log-metrics.sh" --output "$METRICS_FILE" --interval 5 &
 METRICS_PID=$!
-trap 'kill "$METRICS_PID" 2>/dev/null; wait "$METRICS_PID" 2>/dev/null' EXIT
+cleanup() {
+    kill "$METRICS_PID" 2>/dev/null || true
+    wait "$METRICS_PID" 2>/dev/null || true
+}
+trap 'cleanup; exit 0' EXIT

 # ── Suite execution ──────────────────────────────────────

@@ -113,14 +143,14 @@ run_evalplus() {
 run_inspect_eval() {
    local eval_name="$1"
    local display_name="$2"
+    local safe_name="${eval_name//\//_}"  # inspect_evals/ifeval → inspect_evals_ifeval
    log_info "Running Inspect AI: $display_name..."
-    local out="$RUN_DIR/inspect-${eval_name}.json"

    OPENAI_BASE_URL="$ENDPOINT" OPENAI_API_KEY="not-needed" \
    inspect eval "$eval_name" \
        --model "openai/$MODEL" \
        --log-dir "$RUN_DIR/inspect-logs/" \
-        2>&1 | tee "$RUN_DIR/inspect-${eval_name}.log"
+        2>&1 | tee "$RUN_DIR/inspect-${safe_name}.log"

    log_success "Inspect $display_name complete"
 }
@@ -138,7 +168,7 @@ run_bigcodebench() {
 case "$SUITE" in
    quick)
        run_evalplus "humaneval"
-        run_inspect_eval "ifeval" "IFEval (instruction following)"
+        run_inspect_eval "inspect_evals/ifeval" "IFEval (instruction following)"
        ;;
    code)
        run_evalplus "humaneval"
@@ -146,13 +176,13 @@ case "$SUITE" in
        run_bigcodebench
        ;;
    tooluse)
-        run_inspect_eval "bfcl" "BFCL (function calling)"
+        run_inspect_eval "inspect_evals/bfcl" "BFCL (function calling)"
        ;;
    full)
        run_evalplus "humaneval"
        run_evalplus "mbpp"
-        run_inspect_eval "ifeval" "IFEval (instruction following)"
-        run_inspect_eval "bfcl" "BFCL (function calling)"
+        run_inspect_eval "inspect_evals/ifeval" "IFEval (instruction following)"
+        run_inspect_eval "inspect_evals/bfcl" "BFCL (function calling)"
        run_bigcodebench
        ;;
    *)
--- a/scripts/agentic/setup.sh
+++ b/scripts/agentic/setup.sh
@@ -8,91 +8,56 @@ source "$SCRIPT_DIR/../../lib/common.sh"
 log_header "Agentic Evaluation Setup"

 # ── Python virtual environment ───────────────────────────
-VENV_DIR="$(data_dir venv)"
+VENV_DIR="$PROJECT_ROOT/.venv"
+REQUIREMENTS="$PROJECT_ROOT/requirements.txt"
+
 if [[ ! -f "$VENV_DIR/bin/activate" ]]; then
-    log_info "Creating Python virtual environment..."
-    python3 -m venv "$VENV_DIR"
+    # Prefer Python 3.13 (bigcodebench requires <3.14)
+    PYTHON_BIN="python3.13"
+    if ! command -v "$PYTHON_BIN" >/dev/null 2>&1; then
+        PYTHON_BIN="python3"
+        log_warn "python3.13 not found, using $(python3 --version). bigcodebench may not install."
+    fi
+    log_info "Creating virtual environment with $($PYTHON_BIN --version)..."
+    "$PYTHON_BIN" -m venv "$VENV_DIR"
    log_success "Virtual environment created at $VENV_DIR"
 fi

 source "$VENV_DIR/bin/activate"
 log_info "Python: $(python3 --version) from $VENV_DIR"

-# ── Install evaluation frameworks ────────────────────────
-
-# Inspect AI — the all-in-one eval framework (bundles BFCL, GAIA, HumanEval, IFEval, etc.)
-if python3 -c "import inspect_ai" 2>/dev/null; then
-    log_success "inspect-ai already installed"
-else
-    log_info "Installing inspect-ai (main eval framework)..."
-    pip install inspect-ai 2>&1 | tail -3
-    log_success "inspect-ai installed"
-fi
-
-# EvalPlus — HumanEval+ and MBPP+ with native ollama support
-if python3 -c "import evalplus" 2>/dev/null; then
-    log_success "evalplus already installed"
-else
-    log_info "Installing evalplus (code generation benchmarks)..."
-    pip install evalplus 2>&1 | tail -3
-    log_success "evalplus installed"
-fi
-
-# BigCodeBench
-if python3 -c "import bigcodebench" 2>/dev/null; then
-    log_success "bigcodebench already installed"
-else
-    log_info "Installing bigcodebench..."
-    pip install bigcodebench 2>&1 | tail -3
-    log_success "bigcodebench installed"
-fi
+# ── Install from requirements.txt ────────────────────────
+log_info "Installing dependencies from requirements.txt..."
+pip install -r "$REQUIREMENTS" 2>&1 | tail -5
+log_success "Dependencies installed"

 # ── Check for local LLM server ──────────────────────────
 log_header "LLM Server Check"

-ollama_ok=false
-llamacpp_ok=false
-
-if is_cmd ollama; then
-    if curl -s http://localhost:11434/api/tags >/dev/null 2>&1; then
-        log_success "ollama running at localhost:11434"
-        ollama_ok=true
-        # List available models
-        log_info "Available ollama models:"
-        ollama list 2>/dev/null | head -10 || true
-    else
-        log_warn "ollama installed but not running. Start with: ollama serve"
-    fi
+if curl -sf http://localhost:8080/health >/dev/null 2>&1; then
+    log_success "llama-server running at localhost:8080"
+elif curl -sf http://localhost:11434/api/tags >/dev/null 2>&1; then
+    log_success "ollama running at localhost:11434"
 else
-    log_info "ollama not installed — needed for most agentic benchmarks"
-    log_info "Install: curl -fsSL https://ollama.com/install.sh | sh"
-fi
-
-# Check for llama.cpp server
-if curl -s http://localhost:8080/health >/dev/null 2>&1; then
-    log_success "llama.cpp server running at localhost:8080"
-    llamacpp_ok=true
-else
-    log_info "No llama.cpp server detected at localhost:8080"
-    log_info "Start with: toolbox run -c llama-vulkan-radv -- llama-server -m MODEL -c 8192 -ngl 99 -fa 1 --no-mmap"
-fi
-
-if ! $ollama_ok && ! $llamacpp_ok; then
-    log_warn "No local LLM server running. Agentic benchmarks need one."
+    log_warn "No local LLM server running. Start one before running evals:"
+    log_info "  make serve ARGS=\"-m MODEL.gguf\"        (llama-server)"
+    log_info "  ollama serve                            (ollama)"
 fi

 # ── Summary ──────────────────────────────────────────────
 log_header "Setup Complete"
 echo ""
 echo "  Installed tools:"
-echo "    inspect-ai     — All-in-one eval framework (HumanEval, BFCL, IFEval, GAIA, ...)"
-echo "    evalplus        — HumanEval+ / MBPP+ with native ollama support"
-echo "    bigcodebench    — 1,140 coding tasks across 139 libraries"
+echo "    inspect-ai      — All-in-one eval framework (IFEval, BFCL, GAIA, ...)"
+echo "    inspect-evals   — Task definitions for inspect-ai"
+echo "    evalplus         — HumanEval+ / MBPP+ with native ollama support"
+echo "    bigcodebench     — 1,140 coding tasks across 139 libraries"
 echo ""
-echo "  To activate the virtual environment:"
-echo "    source data/venv/bin/activate"
+echo "  Activate venv:  source .venv/bin/activate"
 echo ""
 echo "  Run evaluations:"
-echo "    make agentic-quick      # EvalPlus + IFEval (~1 hour)"
-echo "    make agentic-full       # BFCL + BigCodeBench (~3-4 hours)"
+echo "    make agentic-quick      # EvalPlus HumanEval+ + IFEval (~1 hour)"
+echo "    make agentic-code       # EvalPlus + BigCodeBench (~2-3 hours)"
+echo "    make agentic-tooluse    # BFCL function calling (~1-2 hours)"
+echo "    make agentic-full       # All of the above (~5-6 hours)"
 echo ""
--- a/scripts/benchmark/run-baseline.sh
+++ b/scripts/benchmark/run-baseline.sh
@@ -21,16 +21,22 @@ CTX_DEPTH=32768
 CTX_PROMPT=2048
 PP_TOKENS=512
 TG_TOKENS=128
+BATCH_SIZE=""    # Batch size override (-b flag, empty = llama-bench default 2048)
+KV_TYPES_RAW=""  # Comma-separated KV cache types to sweep (e.g. f16,q8_0,q4_0 or q4_0:q8_0)
+BACKENDS_FILTER="llama-vulkan-radv"  # Default to Vulkan; use --backends to override

 while [[ $# -gt 0 ]]; do
    case "$1" in
        --skip-longctx)  SKIP_LONGCTX=true; shift ;;
        --max-size|-s)   MAX_SIZE_GB="$2"; shift 2 ;;
        --category|-c)   CATEGORY_FILTER="$2"; shift 2 ;;
+        --backends)      BACKENDS_FILTER="$2"; shift 2 ;;
        --reps|-r)       REPS_STANDARD="$2"; shift 2 ;;
        --context|-d)    CTX_DEPTH="$2"; shift 2 ;;
        --pp)            PP_TOKENS="$2"; shift 2 ;;
        --tg)            TG_TOKENS="$2"; shift 2 ;;
+        -b|--batch)      BATCH_SIZE="$2"; shift 2 ;;
+        --kv-types)      KV_TYPES_RAW="$2"; shift 2 ;;
        --help|-h)
            echo "Usage: run-baseline.sh [OPTIONS]"
            echo ""
@@ -38,15 +44,21 @@ while [[ $# -gt 0 ]]; do
            echo "  --skip-longctx       Skip long-context tests"
            echo "  --max-size GB        Only bench models up to this file size in GB"
            echo "  --category LIST      Comma-separated: smoke,dense,moe (from models.conf)"
+            echo "  --backends LIST      Comma-separated backends (default: llama-vulkan-radv)"
            echo "  --reps N             Standard test repetitions (default: 5)"
            echo "  --context N          Long-context depth in tokens (default: 32768)"
            echo "  --pp N               Prompt processing tokens (default: 512)"
            echo "  --tg N               Token generation count (default: 128)"
+            echo "  -b, --batch N        Batch size (default: 2048, try 256 for MoE)"
+            echo "  --kv-types LIST      KV cache sweep: comma-separated types to test"
+            echo "                       Each entry: TYPE (both K+V) or K_TYPE:V_TYPE"
+            echo "                       Types: f16, q8_0, q4_0, q4_1"
            echo ""
            echo "Examples:"
            echo "  run-baseline.sh --max-size 20               # Only models ≤20 GB"
            echo "  run-baseline.sh --context 131072 --category moe  # 128K context on MoE"
            echo "  run-baseline.sh --tg 1024 --pp 2048 --category moe  # Realistic agentic"
+            echo "  run-baseline.sh --kv-types f16,q8_0,q4_0 --context 131072  # KV sweep"
            echo "  run-baseline.sh --skip-longctx --max-size 15 # Quick safe run"
            exit 0 ;;
        *) log_warn "Unknown argument: $1"; shift ;;
@@ -59,11 +71,19 @@ if (( CTX_DEPTH > 32768 )); then
    (( CTX_PROMPT < 512 )) && CTX_PROMPT=512
 fi

+# Parse KV cache types for sweep
+if [[ -n "$KV_TYPES_RAW" ]]; then
+    IFS=',' read -ra KV_TYPES <<< "$KV_TYPES_RAW"
+else
+    KV_TYPES=("f16")
+fi
+
 log_header "Baseline Benchmark Capture"
 log_info "Results will be saved to: $RESULT_DIR"
 $SKIP_LONGCTX && log_info "Long-context tests: SKIPPED"
 (( MAX_SIZE_GB > 0 )) && log_info "Max model size: ${MAX_SIZE_GB} GB"
 [[ -n "$CATEGORY_FILTER" ]] && log_info "Categories: $CATEGORY_FILTER"
+(( ${#KV_TYPES[@]} > 1 )) && log_info "KV cache sweep: ${KV_TYPES[*]}"

 # ── 1. Save system state ────────────────────────────────
 log_info "Capturing system state..."
@@ -77,14 +97,17 @@ declare -A BENCH_PATHS=(
    [llama-vulkan-amdvlk]="/usr/sbin/llama-bench"
    [llama-rocm-6.4.4]="/usr/local/bin/llama-bench"
    [llama-rocm-7.2]="/usr/local/bin/llama-bench"
+    [llama-rocm-7.2.1]="/usr/local/bin/llama-bench"
    [llama-rocm7-nightlies]="/usr/local/bin/llama-bench"
 )

 available_backends=()
 for tb in "${!BENCH_PATHS[@]}"; do
    if echo "$existing" | grep -q "^${tb}$"; then
-        available_backends+=("$tb")
-        log_success "Backend: $tb"
+        if [[ -z "$BACKENDS_FILTER" ]] || echo "$BACKENDS_FILTER" | tr ',' '\n' | grep -qFx "$tb"; then
+            available_backends+=("$tb")
+            log_success "Backend: $tb"
+        fi
    fi
 done

@@ -165,9 +188,8 @@ log_info "Metric logger started (PID: $METRICS_PID)"
 cleanup() {
    kill "$METRICS_PID" 2>/dev/null || true
    wait "$METRICS_PID" 2>/dev/null || true
-    return 0
 }
-trap cleanup EXIT
+trap 'cleanup; exit 0' EXIT

 # ── 5. Run benchmarks ───────────────────────────────────
 for MODEL_PATH in "${MODEL_PATHS[@]}"; do
@@ -189,56 +211,85 @@ for MODEL_PATH in "${MODEL_PATHS[@]}"; do
            TOOLBOX_MODEL_PATH="/run/host${TOOLBOX_MODEL_PATH}"
        fi

-        # Standard test
-        local_suffix="fa1"
-        [[ "$PP_TOKENS" != "512" || "$TG_TOKENS" != "128" ]] && local_suffix="fa1__pp${PP_TOKENS}_tg${TG_TOKENS}"
-        OUT="$RESULT_DIR/${MODEL_NAME}__${BACKEND_SAFE}__${local_suffix}.log"
-        if [[ ! -s "$OUT" ]]; then
-            printf "\n${BOLD}>> [%s] %s — pp%s/tg%s${RESET}\n" "$BACKEND" "$MODEL_NAME" "$PP_TOKENS" "$TG_TOKENS"
-            CMD=(toolbox run -c "$BACKEND" -- "${ENV_ARGS[@]}" "$BENCH_BIN"
-                -ngl 99 -mmp 0 -m "$TOOLBOX_MODEL_PATH" -fa 1
-                -p "$PP_TOKENS" -n "$TG_TOKENS" -r "$REPS_STANDARD")
-
-            printf "  cmd: %s\n" "${CMD[*]}"
-            if "${CMD[@]}" > "$OUT" 2>&1; then
-                log_success "Standard test complete"
-                tail -5 "$OUT"
+        for KV_SPEC in "${KV_TYPES[@]}"; do
+            # Parse KV spec: "q8_0" → K=q8_0,V=q8_0 or "q4_0:q8_0" → K=q4_0,V=q8_0
+            if [[ "$KV_SPEC" == *:* ]]; then
+                KV_K="${KV_SPEC%%:*}"
+                KV_V="${KV_SPEC##*:}"
            else
-                log_error "Standard test failed (exit $?)"
-                echo "FAILED" >> "$OUT"
+                KV_K="$KV_SPEC"
+                KV_V="$KV_SPEC"
            fi
-        else
-            log_info "Skipping standard test (log exists): $OUT"
-        fi

-        # Long-context test (pp2048, tg32, ctx 32768)
-        if $SKIP_LONGCTX; then
-            continue
-        fi
+            # Build KV cache args (skip for f16 — it's the default)
+            KV_ARGS=()
+            KV_SUFFIX=""
+            if [[ "$KV_K" != "f16" || "$KV_V" != "f16" ]]; then
+                KV_ARGS+=(-ctk "$KV_K" -ctv "$KV_V")
+                KV_SUFFIX="__kv_${KV_K}-${KV_V}"
+            fi

-        OUT_LC="$RESULT_DIR/${MODEL_NAME}__${BACKEND_SAFE}__fa1__longctx${CTX_DEPTH}.log"
-        if [[ ! -s "$OUT_LC" ]]; then
-            printf "\n${BOLD}>> [%s] %s — long-context %s${RESET}\n" "$BACKEND" "$MODEL_NAME" "$CTX_DEPTH"
+            # Build batch size args
+            BATCH_ARGS=()
+            BATCH_SUFFIX=""
+            if [[ -n "$BATCH_SIZE" ]]; then
+                BATCH_ARGS+=(-b "$BATCH_SIZE")
+                BATCH_SUFFIX="__b${BATCH_SIZE}"
+            fi

-            UB_SIZE=2048
-            [[ "$BACKEND" == *vulkan* ]] && UB_SIZE=512
+            # Standard test
+            local_suffix="fa1"
+            [[ "$PP_TOKENS" != "512" || "$TG_TOKENS" != "128" ]] && local_suffix="fa1__pp${PP_TOKENS}_tg${TG_TOKENS}"
+            OUT="$RESULT_DIR/${MODEL_NAME}__${BACKEND_SAFE}__${local_suffix}${KV_SUFFIX}${BATCH_SUFFIX}.log"
+            if [[ ! -s "$OUT" ]]; then
+                printf "\n${BOLD}>> [%s] %s — pp%s/tg%s  KV=%s${RESET}\n" \
+                    "$BACKEND" "$MODEL_NAME" "$PP_TOKENS" "$TG_TOKENS" "${KV_K}/${KV_V}"
+                CMD=(toolbox run -c "$BACKEND" -- "${ENV_ARGS[@]}" "$BENCH_BIN"
+                    -ngl 99 -mmp 0 -m "$TOOLBOX_MODEL_PATH" -fa 1
+                    -p "$PP_TOKENS" -n "$TG_TOKENS" -r "$REPS_STANDARD" "${BATCH_ARGS[@]}" "${KV_ARGS[@]}")

-            CMD_LC=(toolbox run -c "$BACKEND" -- "${ENV_ARGS[@]}" "$BENCH_BIN"
-                -ngl 99 -mmp 0 -m "$TOOLBOX_MODEL_PATH" -fa 1
-                -p "$CTX_PROMPT" -n 32 -d "$CTX_DEPTH" -ub "$UB_SIZE"
-                -r "$REPS_LONGCTX")
-
-            printf "  cmd: %s\n" "${CMD_LC[*]}"
-            if "${CMD_LC[@]}" > "$OUT_LC" 2>&1; then
-                log_success "Long-context test complete"
-                tail -5 "$OUT_LC"
+                printf "  cmd: %s\n" "${CMD[*]}"
+                if "${CMD[@]}" > "$OUT" 2>&1; then
+                    log_success "Standard test complete"
+                    tail -5 "$OUT"
+                else
+                    log_error "Standard test failed (exit $?)"
+                    echo "FAILED" >> "$OUT"
+                fi
            else
-                log_error "Long-context test failed (exit $?)"
-                echo "FAILED" >> "$OUT_LC"
+                log_info "Skipping standard test (log exists): $OUT"
            fi
-        else
-            log_info "Skipping long-context test (log exists): $OUT_LC"
-        fi
+
+            # Long-context test
+            if $SKIP_LONGCTX; then
+                continue
+            fi
+
+            OUT_LC="$RESULT_DIR/${MODEL_NAME}__${BACKEND_SAFE}__fa1__longctx${CTX_DEPTH}${KV_SUFFIX}${BATCH_SUFFIX}.log"
+            if [[ ! -s "$OUT_LC" ]]; then
+                printf "\n${BOLD}>> [%s] %s — long-context %s  KV=%s${RESET}\n" \
+                    "$BACKEND" "$MODEL_NAME" "$CTX_DEPTH" "${KV_K}/${KV_V}"
+
+                UB_SIZE=2048
+                [[ "$BACKEND" == *vulkan* ]] && UB_SIZE=512
+
+                CMD_LC=(toolbox run -c "$BACKEND" -- "${ENV_ARGS[@]}" "$BENCH_BIN"
+                    -ngl 99 -mmp 0 -m "$TOOLBOX_MODEL_PATH" -fa 1
+                    -p "$CTX_PROMPT" -n "$TG_TOKENS" -d "$CTX_DEPTH" -ub "$UB_SIZE"
+                    -r "$REPS_LONGCTX" "${BATCH_ARGS[@]}" "${KV_ARGS[@]}")
+
+                printf "  cmd: %s\n" "${CMD_LC[*]}"
+                if "${CMD_LC[@]}" > "$OUT_LC" 2>&1; then
+                    log_success "Long-context test complete"
+                    tail -5 "$OUT_LC"
+                else
+                    log_error "Long-context test failed (exit $?)"
+                    echo "FAILED" >> "$OUT_LC"
+                fi
+            else
+                log_info "Skipping long-context test (log exists): $OUT_LC"
+            fi
+        done  # KV_TYPES
    done
 done

@@ -258,6 +309,10 @@ for logfile in sorted(result_dir.glob("*.log")):
    if "FAILED" in content:
        continue

+    # Extract KV cache type from filename (__kv_q8_0-q8_0)
+    kv_match = re.search(r'__kv_(.+)-(.+)\.log$', logfile.name)
+    kv_type = f"{kv_match.group(1)}/{kv_match.group(2)}" if kv_match else "f16/f16"
+
    for line in content.splitlines():
        line = line.strip()
        if not line.startswith("|") or ("model" in line.lower() and "size" in line.lower()):
@@ -266,12 +321,15 @@ for logfile in sorted(result_dir.glob("*.log")):
            continue

        parts = [p.strip() for p in line.split("|")]
-        if len(parts) < 10:
+        # Filter out empty parts from leading/trailing pipes
+        data = [p for p in parts if p and "---" not in p]
+        if len(data) < 6:
            continue

        try:
-            test_type = parts[8].strip() if len(parts) > 8 else ""
-            ts_raw = parts[9].strip() if len(parts) > 9 else ""
+            # test and t/s are always the last two columns
+            test_type = data[-2]
+            ts_raw = data[-1]
            if not test_type or not ts_raw:
                continue

@@ -281,11 +339,12 @@ for logfile in sorted(result_dir.glob("*.log")):

            results.append({
                "file": logfile.name,
-                "model": parts[1].strip(),
-                "size": parts[2].strip(),
-                "backend": parts[4].strip(),
+                "model": data[0],
+                "size": data[1],
+                "backend": data[3],
                "test": test_type,
                "tokens_per_sec": float(ts_match.group(1)),
+                "kv_cache": kv_type,
                "raw": ts_raw,
            })
        except (ValueError, IndexError):
@@ -307,13 +366,14 @@ if not data["results"]:
    print("  No results parsed. Check log files for errors.")
    sys.exit(0)

-fmt = "  {:<20} {:<16} {:<8} {:>10}"
-print(fmt.format("Model", "Backend", "Test", "t/s"))
-print("  " + "-" * 58)
+fmt = "  {:<20} {:<16} {:<10} {:<8} {:>10}"
+print(fmt.format("Model", "Backend", "KV cache", "Test", "t/s"))
+print("  " + "-" * 68)
 for r in data["results"]:
    print(fmt.format(
        r["model"][:20],
        r["backend"][:16],
+        r.get("kv_cache", "f16/f16")[:10],
        r["test"],
        f"{r['tokens_per_sec']:.2f}"
    ))
--- a/scripts/benchmark/run-suite.sh
+++ b/scripts/benchmark/run-suite.sh
@@ -9,7 +9,7 @@ source "$SCRIPT_DIR/../../lib/format.sh"

 MODEL_DIR="$(data_dir models)"
 TAG="run"
-BACKENDS_FILTER=""
+BACKENDS_FILTER="llama-vulkan-radv"
 MODELS_FILTER=""
 SKIP_LONGCTX=false
 MAX_SIZE_GB=0
@@ -20,6 +20,8 @@ CTX_DEPTH=32768
 CTX_PROMPT=2048
 PP_TOKENS=512
 TG_TOKENS=128
+BATCH_SIZE=""    # Batch size override (-b flag, empty = llama-bench default 2048)
+KV_TYPES_RAW=""  # Comma-separated KV cache types to sweep (e.g. f16,q8_0,q4_0 or q4_0:q8_0)

 while [[ $# -gt 0 ]]; do
    case "$1" in
@@ -33,6 +35,8 @@ while [[ $# -gt 0 ]]; do
        --context|-d)    CTX_DEPTH="$2"; shift 2 ;;
        --pp)            PP_TOKENS="$2"; shift 2 ;;
        --tg)            TG_TOKENS="$2"; shift 2 ;;
+        -b|--batch)      BATCH_SIZE="$2"; shift 2 ;;
+        --kv-types)      KV_TYPES_RAW="$2"; shift 2 ;;
        --help|-h)
            echo "Usage: run-suite.sh [OPTIONS]"
            echo ""
@@ -47,10 +51,16 @@ while [[ $# -gt 0 ]]; do
            echo "  --context N          Long-context depth in tokens (default: 32768)"
            echo "  --pp N               Prompt processing tokens (default: 512)"
            echo "  --tg N               Token generation count (default: 128)"
+            echo "  -b, --batch N        Batch size (default: 2048, try 256 for MoE)"
+            echo "  --kv-types LIST      KV cache sweep: comma-separated types to test"
+            echo "                       Each entry: TYPE (both K+V) or K_TYPE:V_TYPE"
+            echo "                       Types: f16, q8_0, q4_0, q4_1"
            echo ""
            echo "Examples:"
            echo "  run-suite.sh --tag ctx128k --context 131072 --category moe"
            echo "  run-suite.sh --tag realistic --tg 1024 --pp 2048 --category moe"
+            echo "  run-suite.sh --tag kv-sweep --kv-types f16,q8_0,q4_0 --context 131072"
+            echo "  run-suite.sh --tag kv-mixed --kv-types q8_0,q4_0:q8_0 --context 131072"
            echo "  run-suite.sh --tag post-opt --max-size 20 --skip-longctx"
            exit 0 ;;
        *) log_warn "Unknown argument: $1"; shift ;;
@@ -63,12 +73,20 @@ if (( CTX_DEPTH > 32768 )); then
    (( CTX_PROMPT < 512 )) && CTX_PROMPT=512
 fi

+# Parse KV cache types for sweep
+if [[ -n "$KV_TYPES_RAW" ]]; then
+    IFS=',' read -ra KV_TYPES <<< "$KV_TYPES_RAW"
+else
+    KV_TYPES=("f16")
+fi
+
 TS="$(timestamp)"
 RESULT_DIR="$(data_dir benchmarks)/${TAG}-${TS}"
 mkdir -p "$RESULT_DIR"

 log_header "Benchmark Suite: $TAG"
 log_info "Results: $RESULT_DIR"
+(( ${#KV_TYPES[@]} > 1 )) && log_info "KV cache sweep: ${KV_TYPES[*]}"

 # Save system state
 bash "$SCRIPT_DIR/../audit/system-report.sh" --json > "$RESULT_DIR/system-state.json" 2>/dev/null
@@ -81,13 +99,14 @@ declare -A BENCH_PATHS=(
    [llama-vulkan-amdvlk]="/usr/sbin/llama-bench"
    [llama-rocm-6.4.4]="/usr/local/bin/llama-bench"
    [llama-rocm-7.2]="/usr/local/bin/llama-bench"
+    [llama-rocm-7.2.1]="/usr/local/bin/llama-bench"
    [llama-rocm7-nightlies]="/usr/local/bin/llama-bench"
 )

 available_backends=()
 for tb in "${!BENCH_PATHS[@]}"; do
    if echo "$existing" | grep -q "^${tb}$"; then
-        if [[ -z "$BACKENDS_FILTER" ]] || echo "$BACKENDS_FILTER" | tr ',' '\n' | grep -q "$tb"; then
+        if [[ -z "$BACKENDS_FILTER" ]] || echo "$BACKENDS_FILTER" | tr ',' '\n' | grep -qFx "$tb"; then
            available_backends+=("$tb")
        fi
    fi
@@ -112,7 +131,7 @@ for p in "${ALL_MODEL_PATHS[@]}"; do

    # Name filter
    if [[ -n "$MODELS_FILTER" ]]; then
-        if ! echo "$MODELS_FILTER" | tr ',' '\n' | grep -qi "$local_name"; then
+        if ! echo "$MODELS_FILTER" | tr ',' '\n' | grep -qiF "$local_name"; then
            continue
        fi
    fi
@@ -157,7 +176,11 @@ log_info "Models: ${#MODEL_PATHS[@]}"
 METRICS_FILE="$RESULT_DIR/metrics.csv"
 bash "$SCRIPT_DIR/../monitor/log-metrics.sh" --output "$METRICS_FILE" --interval 2 &
 METRICS_PID=$!
-trap 'kill "$METRICS_PID" 2>/dev/null; wait "$METRICS_PID" 2>/dev/null; true' EXIT
+cleanup() {
+    kill "$METRICS_PID" 2>/dev/null || true
+    wait "$METRICS_PID" 2>/dev/null || true
+}
+trap 'cleanup; exit 0' EXIT

 # Run benchmarks (same logic as run-baseline.sh)
 for MODEL_PATH in "${MODEL_PATHS[@]}"; do
@@ -176,39 +199,68 @@ for MODEL_PATH in "${MODEL_PATHS[@]}"; do
            TOOLBOX_MODEL_PATH="/run/host${TOOLBOX_MODEL_PATH}"
        fi

-        # Standard test
-        local_suffix="fa1"
-        [[ "$PP_TOKENS" != "512" || "$TG_TOKENS" != "128" ]] && local_suffix="fa1__pp${PP_TOKENS}_tg${TG_TOKENS}"
-        OUT="$RESULT_DIR/${MODEL_NAME}__${BACKEND_SAFE}__${local_suffix}.log"
-        if [[ ! -s "$OUT" ]]; then
-            printf "\n${BOLD}>> [%s] %s — pp%s/tg%s${RESET}\n" "$BACKEND" "$MODEL_NAME" "$PP_TOKENS" "$TG_TOKENS"
-            CMD=(toolbox run -c "$BACKEND" -- "${ENV_ARGS[@]}" "$BENCH_BIN"
-                -ngl 99 -mmp 0 -m "$TOOLBOX_MODEL_PATH" -fa 1
-                -p "$PP_TOKENS" -n "$TG_TOKENS" -r "$REPS_STANDARD")
-            if "${CMD[@]}" > "$OUT" 2>&1; then
-                log_success "Done"; tail -3 "$OUT"
+        for KV_SPEC in "${KV_TYPES[@]}"; do
+            # Parse KV spec: "q8_0" → K=q8_0,V=q8_0 or "q4_0:q8_0" → K=q4_0,V=q8_0
+            if [[ "$KV_SPEC" == *:* ]]; then
+                KV_K="${KV_SPEC%%:*}"
+                KV_V="${KV_SPEC##*:}"
            else
-                log_error "Failed"; echo "FAILED" >> "$OUT"
+                KV_K="$KV_SPEC"
+                KV_V="$KV_SPEC"
            fi
-        fi

-        # Long-context test
-        if $SKIP_LONGCTX; then
-            continue
-        fi
-        OUT_LC="$RESULT_DIR/${MODEL_NAME}__${BACKEND_SAFE}__fa1__longctx${CTX_DEPTH}.log"
-        if [[ ! -s "$OUT_LC" ]]; then
-            printf "\n${BOLD}>> [%s] %s — longctx %s${RESET}\n" "$BACKEND" "$MODEL_NAME" "$CTX_DEPTH"
-            UB_SIZE=2048; [[ "$BACKEND" == *vulkan* ]] && UB_SIZE=512
-            CMD_LC=(toolbox run -c "$BACKEND" -- "${ENV_ARGS[@]}" "$BENCH_BIN"
-                -ngl 99 -mmp 0 -m "$TOOLBOX_MODEL_PATH" -fa 1
-                -p "$CTX_PROMPT" -n 32 -d "$CTX_DEPTH" -ub "$UB_SIZE" -r "$REPS_LONGCTX")
-            if "${CMD_LC[@]}" > "$OUT_LC" 2>&1; then
-                log_success "Done"; tail -3 "$OUT_LC"
-            else
-                log_error "Failed"; echo "FAILED" >> "$OUT_LC"
+            # Build KV cache args (skip for f16 — it's the default)
+            KV_ARGS=()
+            KV_SUFFIX=""
+            if [[ "$KV_K" != "f16" || "$KV_V" != "f16" ]]; then
+                KV_ARGS+=(-ctk "$KV_K" -ctv "$KV_V")
+                KV_SUFFIX="__kv_${KV_K}-${KV_V}"
            fi
-        fi
+
+            # Build batch size args
+            BATCH_ARGS=()
+            BATCH_SUFFIX=""
+            if [[ -n "$BATCH_SIZE" ]]; then
+                BATCH_ARGS+=(-b "$BATCH_SIZE")
+                BATCH_SUFFIX="__b${BATCH_SIZE}"
+            fi
+
+            # Standard test
+            local_suffix="fa1"
+            [[ "$PP_TOKENS" != "512" || "$TG_TOKENS" != "128" ]] && local_suffix="fa1__pp${PP_TOKENS}_tg${TG_TOKENS}"
+            OUT="$RESULT_DIR/${MODEL_NAME}__${BACKEND_SAFE}__${local_suffix}${KV_SUFFIX}${BATCH_SUFFIX}.log"
+            if [[ ! -s "$OUT" ]]; then
+                printf "\n${BOLD}>> [%s] %s — pp%s/tg%s  KV=%s${RESET}\n" \
+                    "$BACKEND" "$MODEL_NAME" "$PP_TOKENS" "$TG_TOKENS" "${KV_K}/${KV_V}"
+                CMD=(toolbox run -c "$BACKEND" -- "${ENV_ARGS[@]}" "$BENCH_BIN"
+                    -ngl 99 -mmp 0 -m "$TOOLBOX_MODEL_PATH" -fa 1
+                    -p "$PP_TOKENS" -n "$TG_TOKENS" -r "$REPS_STANDARD" "${BATCH_ARGS[@]}" "${KV_ARGS[@]}")
+                if "${CMD[@]}" > "$OUT" 2>&1; then
+                    log_success "Done"; tail -3 "$OUT"
+                else
+                    log_error "Failed"; echo "FAILED" >> "$OUT"
+                fi
+            fi
+
+            # Long-context test
+            if $SKIP_LONGCTX; then
+                continue
+            fi
+            OUT_LC="$RESULT_DIR/${MODEL_NAME}__${BACKEND_SAFE}__fa1__longctx${CTX_DEPTH}${KV_SUFFIX}${BATCH_SUFFIX}.log"
+            if [[ ! -s "$OUT_LC" ]]; then
+                printf "\n${BOLD}>> [%s] %s — longctx %s  KV=%s${RESET}\n" \
+                    "$BACKEND" "$MODEL_NAME" "$CTX_DEPTH" "${KV_K}/${KV_V}"
+                UB_SIZE=2048; [[ "$BACKEND" == *vulkan* ]] && UB_SIZE=512
+                CMD_LC=(toolbox run -c "$BACKEND" -- "${ENV_ARGS[@]}" "$BENCH_BIN"
+                    -ngl 99 -mmp 0 -m "$TOOLBOX_MODEL_PATH" -fa 1
+                    -p "$CTX_PROMPT" -n "$TG_TOKENS" -d "$CTX_DEPTH" -ub "$UB_SIZE" -r "$REPS_LONGCTX" "${BATCH_ARGS[@]}" "${KV_ARGS[@]}")
+                if "${CMD_LC[@]}" > "$OUT_LC" 2>&1; then
+                    log_success "Done"; tail -3 "$OUT_LC"
+                else
+                    log_error "Failed"; echo "FAILED" >> "$OUT_LC"
+                fi
+            fi
+        done  # KV_TYPES
    done
 done

@@ -226,6 +278,11 @@ for logfile in sorted(result_dir.glob("*.log")):
    content = logfile.read_text()
    if "FAILED" in content:
        continue
+
+    # Extract KV cache type from filename (__kv_q8_0-q8_0)
+    kv_match = re.search(r'__kv_(.+)-(.+)\.log$', logfile.name)
+    kv_type = f"{kv_match.group(1)}/{kv_match.group(2)}" if kv_match else "f16/f16"
+
    for line in content.splitlines():
        line = line.strip()
        if not line.startswith("|") or ("model" in line.lower() and "size" in line.lower()):
@@ -233,21 +290,25 @@ for logfile in sorted(result_dir.glob("*.log")):
        if "---" in line:
            continue
        parts = [p.strip() for p in line.split("|")]
-        if len(parts) < 10:
+        # Filter out empty parts from leading/trailing pipes
+        data = [p for p in parts if p and "---" not in p]
+        if len(data) < 6:
            continue
        try:
-            test_type = parts[8].strip()
-            ts_raw = parts[9].strip()
+            # test and t/s are always the last two columns
+            test_type = data[-2]
+            ts_raw = data[-1]
            ts_match = re.match(r'([\d.]+)', ts_raw)
            if not ts_match:
                continue
            results.append({
                "file": logfile.name,
-                "model": parts[1].strip(),
-                "size": parts[2].strip(),
-                "backend": parts[4].strip(),
+                "model": data[0],
+                "size": data[1],
+                "backend": data[3],
                "test": test_type,
                "tokens_per_sec": float(ts_match.group(1)),
+                "kv_cache": kv_type,
                "raw": ts_raw,
            })
        except (ValueError, IndexError):
@@ -264,11 +325,14 @@ with open(sys.argv[1]) as f:
 if not data["results"]:
    print("  No results parsed.")
    sys.exit(0)
-fmt = "  {:<20} {:<16} {:<8} {:>10}"
-print(fmt.format("Model", "Backend", "Test", "t/s"))
-print("  " + "-" * 58)
+fmt = "  {:<20} {:<16} {:<10} {:<8} {:>10}"
+print(fmt.format("Model", "Backend", "KV cache", "Test", "t/s"))
+print("  " + "-" * 68)
 for r in data["results"]:
-    print(fmt.format(r["model"][:20], r["backend"][:16], r["test"], f"{r['tokens_per_sec']:.2f}"))
+    print(fmt.format(
+        r["model"][:20], r["backend"][:16],
+        r.get("kv_cache", "f16/f16")[:10], r["test"],
+        f"{r['tokens_per_sec']:.2f}"))
 PYEOF

 echo ""
--- a/scripts/benchmark/setup.sh
+++ b/scripts/benchmark/setup.sh
@@ -15,8 +15,8 @@ log_header "Benchmark Setup"
 # ── 1. Check toolbox containers ──────────────────────────
 log_info "Checking toolbox containers..."

-REQUIRED_TOOLBOXES=("llama-vulkan-radv" "llama-rocm-7.2")
-OPTIONAL_TOOLBOXES=("llama-rocm-6.4.4" "llama-vulkan-amdvlk")
+REQUIRED_TOOLBOXES=("llama-vulkan-radv" "llama-rocm-7.2.1")
+OPTIONAL_TOOLBOXES=("llama-rocm-7.2" "llama-rocm-6.4.4" "llama-vulkan-amdvlk")

 existing=$(detect_toolbox_names 2>/dev/null || true)
 missing=()
--- a/scripts/optimize/power-profile.sh
+++ b/scripts/optimize/power-profile.sh
@@ -0,0 +1,111 @@
+#!/usr/bin/env bash
+# Apply power profile and system tuning for LLM inference workloads
+# Requires root. Settings are volatile — use the systemd service for persistence.
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+source "$SCRIPT_DIR/../../lib/common.sh"
+source "$SCRIPT_DIR/../../lib/format.sh"
+
+require_root
+
+# ── Power limits via ryzenadj ─────────────────────────────
+STAPM=85000
+FAST=85000
+SLOW=85000
+APU_SLOW=85000
+
+if is_cmd ryzenadj; then
+    log_header "Power Profile (ryzenadj)"
+    log_info "Setting STAPM=${STAPM}mW, Fast=${FAST}mW, Slow=${SLOW}mW, APU=${APU_SLOW}mW"
+    ryzenadj \
+        --stapm-limit=$STAPM \
+        --fast-limit=$FAST \
+        --slow-limit=$SLOW \
+        --apu-slow-limit=$APU_SLOW 2>&1 | grep -E 'Successfully|Error|not supported' || true
+
+    # Verify what actually took effect
+    log_info "Verifying limits..."
+    ryzenadj -i 2>&1 | grep -E 'LIMIT|VALUE' | head -8
+    echo ""
+    log_warn "Note: HP firmware may cap PPT SLOW/APU at 70W regardless of setting"
+else
+    log_error "ryzenadj not found. Install: cd /tmp && git clone https://github.com/FlyGoat/RyzenAdj.git && cd RyzenAdj && mkdir build && cd build && cmake .. && make && sudo cp ryzenadj /usr/local/bin/"
+    exit 1
+fi
+
+# ── VM sysctl tuning ──────────────────────────────────────
+log_header "VM Sysctl Tuning"
+
+declare -A SYSCTLS=(
+    [vm.swappiness]=1
+    [vm.dirty_ratio]=40
+    [vm.dirty_background_ratio]=10
+    [vm.max_map_count]=500000
+    [vm.zone_reclaim_mode]=0
+)
+
+for KEY in "${!SYSCTLS[@]}"; do
+    VAL="${SYSCTLS[$KEY]}"
+    CURRENT=$(sysctl -n "$KEY" 2>/dev/null || echo "?")
+    if [[ "$CURRENT" == "$VAL" ]]; then
+        log_success "$KEY = $VAL (already set)"
+    else
+        sysctl -w "$KEY=$VAL" > /dev/null 2>&1
+        log_success "$KEY = $VAL (was $CURRENT)"
+    fi
+done
+
+# Persist sysctl settings
+SYSCTL_CONF="/etc/sysctl.d/99-llm-inference.conf"
+if [[ ! -f "$SYSCTL_CONF" ]]; then
+    log_info "Persisting to $SYSCTL_CONF"
+    cat > "$SYSCTL_CONF" << 'EOF'
+# LLM inference optimizations
+vm.swappiness = 1
+vm.dirty_ratio = 40
+vm.dirty_background_ratio = 10
+vm.max_map_count = 500000
+vm.zone_reclaim_mode = 0
+EOF
+    log_success "Sysctl config saved (persists across reboots)"
+else
+    log_info "Sysctl config already exists at $SYSCTL_CONF"
+fi
+
+# ── Transparent Huge Pages ────────────────────────────────
+log_header "Transparent Huge Pages"
+
+THP_ENABLED=$(cat /sys/kernel/mm/transparent_hugepage/enabled 2>/dev/null || echo "unknown")
+if [[ "$THP_ENABLED" == *"[always]"* ]]; then
+    log_success "THP = always (already set)"
+else
+    echo always > /sys/kernel/mm/transparent_hugepage/enabled 2>/dev/null || true
+    echo defer+madvise > /sys/kernel/mm/transparent_hugepage/defrag 2>/dev/null || true
+    log_success "THP = always, defrag = defer+madvise"
+fi
+log_info "For persistence, add to kernel cmdline: transparent_hugepage=always"
+
+# ── RADV nogttspill ───────────────────────────────────────
+log_header "Vulkan RADV Environment"
+
+RADV_CONF="/etc/environment.d/radv-llm.conf"
+if [[ ! -f "$RADV_CONF" ]]; then
+    mkdir -p /etc/environment.d
+    echo 'RADV_PERFTEST=nogttspill' > "$RADV_CONF"
+    log_success "RADV_PERFTEST=nogttspill persisted to $RADV_CONF"
+    log_info "Takes effect on next login. For this session: export RADV_PERFTEST=nogttspill"
+else
+    log_success "RADV config already exists at $RADV_CONF"
+fi
+
+# ── Summary ───────────────────────────────────────────────
+log_header "Phase 2 Optimization Summary"
+log_success "Power profile: ryzenadj limits applied (volatile — resets on reboot)"
+log_success "VM tuning: sysctl applied and persisted"
+log_success "THP: enabled (volatile — add to kernel cmdline for persistence)"
+log_success "RADV: nogttspill persisted"
+echo ""
+log_info "To persist ryzenadj across reboots:"
+log_info "  sudo cp $SCRIPT_DIR/../../configs/ryzenadj-llm.service /etc/systemd/system/"
+log_info "  sudo systemctl enable --now ryzenadj-llm.service"
--- a/scripts/serve/launch.sh
+++ b/scripts/serve/launch.sh
@@ -0,0 +1,146 @@
+#!/usr/bin/env bash
+# Launch llama-server with optimized settings for Strix Halo
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+source "$SCRIPT_DIR/../../lib/common.sh"
+source "$SCRIPT_DIR/../../lib/detect.sh"
+source "$SCRIPT_DIR/../../lib/format.sh"
+
+MODEL_DIR="$(data_dir models)"
+BACKEND="llama-vulkan-radv"
+PORT=8080
+CTX_SIZE=131072
+PARALLEL=1
+MODEL=""
+NGRAM=false
+NO_THINK=false
+
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        -m|--model)     MODEL="$2"; shift 2 ;;
+        --backend)      BACKEND="$2"; shift 2 ;;
+        --port)         PORT="$2"; shift 2 ;;
+        --ctx)          CTX_SIZE="$2"; shift 2 ;;
+        --parallel)     PARALLEL="$2"; shift 2 ;;
+        --ngram)        NGRAM=true; shift ;;
+        --no-think)     NO_THINK=true; shift ;;
+        --help|-h)
+            echo "Usage: launch.sh [OPTIONS]"
+            echo ""
+            echo "Options:"
+            echo "  -m, --model FILE     GGUF model filename (searches data/models/)"
+            echo "  --backend NAME       Toolbox backend (default: llama-vulkan-radv)"
+            echo "  --port N             Listen port (default: 8080)"
+            echo "  --ctx N              Context size (default: 131072)"
+            echo "  --parallel N         Parallel request slots (default: 1)"
+            echo "  --ngram              Enable n-gram speculative decoding (~1.1-1.4x tg)"
+            echo "  --no-think           Disable thinking/reasoning (faster for evals)"
+            echo ""
+            echo "Presets (pass model filename):"
+            echo "  Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf     General purpose daily driver"
+            echo "  Qwen3-Coder-30B-A3B-Instruct-UD-Q6_K_XL.gguf  Agentic coding"
+            echo "  Qwen3-Coder-Next-UD-Q3_K_XL.gguf     Complex SE tasks"
+            echo ""
+            echo "Examples:"
+            echo "  launch.sh -m Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf"
+            echo "  launch.sh -m Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf --ngram --ctx 262144"
+            echo "  launch.sh -m Qwen3-Coder-30B-A3B-Instruct-UD-Q6_K_XL.gguf --parallel 2"
+            exit 0 ;;
+        *) log_warn "Unknown argument: $1"; shift ;;
+    esac
+done
+
+if [[ -z "$MODEL" ]]; then
+    log_error "No model specified. Use -m MODEL_FILENAME"
+    echo ""
+    echo "Available models:"
+    find -L "$MODEL_DIR" -type f -name '*.gguf' -not -name 'mmproj-*' \
+        -not -name '*-000*-of-*.gguf' -printf '  %f\n' 2>/dev/null | sort
+    exit 1
+fi
+
+# Find model file
+MODEL_PATH="$(find -L "$MODEL_DIR" -type f -name "$MODEL" -print -quit 2>/dev/null)"
+if [[ -z "$MODEL_PATH" ]]; then
+    log_error "Model not found: $MODEL"
+    exit 1
+fi
+
+# Resolve for toolbox
+TOOLBOX_MODEL_PATH="$(realpath "$MODEL_PATH")"
+if [[ "$TOOLBOX_MODEL_PATH" != /home/* ]]; then
+    TOOLBOX_MODEL_PATH="/run/host${TOOLBOX_MODEL_PATH}"
+fi
+
+# Backend-specific settings
+declare -A SERVER_PATHS=(
+    [llama-vulkan-radv]="/usr/sbin/llama-server"
+    [llama-vulkan-amdvlk]="/usr/sbin/llama-server"
+    [llama-rocm-6.4.4]="/usr/local/bin/llama-server"
+    [llama-rocm-7.2]="/usr/local/bin/llama-server"
+    [llama-rocm-7.2.1]="/usr/local/bin/llama-server"
+    [llama-rocm7-nightlies]="/usr/local/bin/llama-server"
+)
+
+SERVER_BIN="${SERVER_PATHS[$BACKEND]:-}"
+if [[ -z "$SERVER_BIN" ]]; then
+    log_error "Unknown backend: $BACKEND"
+    exit 1
+fi
+
+# Check toolbox exists
+if ! toolbox list 2>/dev/null | grep -q "$BACKEND"; then
+    log_error "Toolbox not found: $BACKEND"
+    exit 1
+fi
+
+# Build environment args
+ENV_ARGS=()
+if [[ "$BACKEND" == *rocm* ]]; then
+    ENV_ARGS=(env ROCBLAS_USE_HIPBLASLT=1)
+fi
+
+# Build server args
+SERVER_ARGS=(
+    -ngl 99                          # Full GPU offload
+    --no-mmap                        # Direct load, no mmap overhead
+    -fa on                            # Flash attention
+    -m "$TOOLBOX_MODEL_PATH"
+    -c "$CTX_SIZE"                   # Context size
+    --cache-type-k q4_0              # KV cache quantization (fastest on Vulkan)
+    --cache-type-v q4_0
+    --port "$PORT"
+    -np "$PARALLEL"                  # Parallel slots
+)
+
+# Disable thinking mode (faster for evals)
+if $NO_THINK; then
+    SERVER_ARGS+=(--reasoning-budget 0)
+fi
+
+# N-gram speculative decoding
+if $NGRAM; then
+    SERVER_ARGS+=(
+        --spec-type ngram-simple
+        --draft-max 64
+        --draft-min 4
+    )
+fi
+
+# Display config
+log_header "llama-server"
+log_info "Model: $(basename "$MODEL_PATH") ($(du -h "$MODEL_PATH" | cut -f1))"
+log_info "Backend: $BACKEND"
+log_info "Context: $CTX_SIZE tokens"
+log_info "KV cache: q4_0/q4_0"
+log_info "Parallel slots: $PARALLEL"
+$NO_THINK && log_info "Thinking mode: DISABLED (--reasoning-budget 0)"
+$NGRAM && log_info "N-gram speculative: enabled (draft-max=64)"
+log_info "Port: $PORT"
+log_info "Endpoint: http://localhost:$PORT"
+echo ""
+log_info "Starting server..."
+
+# Launch
+exec toolbox run -c "$BACKEND" -- "${ENV_ARGS[@]}" "$SERVER_BIN" "${SERVER_ARGS[@]}"
--- a/tests/benchmark_flags.bats
+++ b/tests/benchmark_flags.bats
@@ -10,6 +10,8 @@ load test_helper.sh
    assert_output --partial "--max-size"
    assert_output --partial "--category"
    assert_output --partial "--skip-longctx"
+    assert_output --partial "--kv-types"
+    assert_output --partial "--batch"
 }

@test "run-suite --help shows usage and exits 0" {
@@ -20,6 +22,8 @@ load test_helper.sh
    assert_output --partial "--category"
    assert_output --partial "--skip-longctx"
    assert_output --partial "--tag"
+    assert_output --partial "--kv-types"
+    assert_output --partial "--batch"
 }

@test "benchmark dispatcher shows help with no args" {
@@ -28,6 +32,19 @@ load test_helper.sh
    assert_output --partial "Commands"
    assert_output --partial "--max-size"
    assert_output --partial "--skip-longctx"
+    assert_output --partial "--kv-types"
+    assert_output --partial "--batch"
+}
+
+@test "serve --help shows usage and exits 0" {
+    run bash "$PROJECT_ROOT/bin/serve" --help
+    assert_success
+    assert_output --partial "Usage"
+    assert_output --partial "--model"
+    assert_output --partial "--ngram"
+    assert_output --partial "--no-think"
+    assert_output --partial "--ctx"
+    assert_output --partial "--port"
 }

@test "benchmark dispatcher passes --help through to baseline" {
Author	SHA1	Message	Date
Felipe Cardoso	c847991740	docs: add agentic coding evaluation landscape research Comprehensive research (706 lines, dated 2026-03-30) covering evaluation dimensions, benchmark suites, and open-weight model performance for software engineering agent use cases on 64GB systems. Also gitignore evalplus_results/ (runtime outputs) and ztop/ (nested repo).	2026-04-15 15:55:04 +02:00
Felipe Cardoso	15bb6a8ed9	feat(serve): set APEX I-Compact as default, harden benchmark workflow Serving: - make serve now launches Claude-distilled APEX 35B-A3B (16GB) with 2 parallel slots and 256K context as the daily driver - add serve-custom for ad-hoc model testing - add flush-gpu to reclaim unified memory after stuck runs Benchmarks: - default Vulkan-only backends (ROCm trails at long context) - add --backends filter to run-baseline.sh - fix backend filter substring bug (grep -qFx for exact line match) - fix model filter regex metacharacter bug (grep -qiF for literal) - respect --tg in long-context tests instead of hardcoded n=32 ROCm bump to 7.2.1 (kernel 6.18.4+ patch); keep 7.2 as optional. Catalog: - add mudler APEX I-Compact (Claude-distilled 35B, 17GB) - add 0xSero REAP-40 (pruned 122B-A10B, 46GB) - update download instructions: hf download (huggingface-cli is gone)	2026-04-13 01:11:46 +02:00
Felipe Cardoso	474d94a07e	chore: update model catalog with gemma 4, opus distill, and hw-bandwidth target	2026-04-03 20:03:53 +02:00
Felipe Cardoso	6ab08537ca	fix: address code review findings — batch args, venv path, serve flags - Fix missing BATCH_ARGS in long-context commands (both benchmark scripts) - Fix CLAUDE.md stale venv path (data/venv → .venv) and add serve/power docs - Add -b/--batch to bin/benchmark help text - Add --no-think flag to serve script (--reasoning-budget 0) - Sanitize model names in eval run directories - Simplify agentic setup to use requirements.txt - Add serve --help test, batch flag assertions to existing tests - Add requirements.txt for reproducible venv setup (Python 3.13)	2026-03-31 10:10:48 +02:00
Felipe Cardoso	dd403a907c	feat(serve): add optimized llama-server launcher with n-gram speculation Add `make serve` and `make serve-ngram` for launching llama-server with baked-in optimal settings (Vulkan RADV, q4_0 KV cache, flash attention, no-mmap, full GPU offload). N-gram speculative decoding gives 1.1-1.4x tg speedup on repetitive content without upstream PR dependencies. Update Phase 5 status: MTP is months away (4 unmerged PRs, no MoE support), draft-model speculation stalled on ROCm buffer crashes.	2026-03-30 21:12:30 +02:00
Felipe Cardoso	ba24091791	feat(benchmark): add -b/--batch flag, test MoE batch size impact Add batch size override to benchmark scripts. Testing -b 256 vs default 2048 on Vulkan RADV shows no meaningful difference for MoE pp2048 (826 vs 843 t/s, within noise). Community-reported +70% improvement does not reproduce on this backend.	2026-03-30 20:01:24 +02:00
Felipe Cardoso	ea70687cd2	docs: update optimization guide with measured hardware data Replace estimated values with clpeak measurements: DRAM 216-233 GB/s, GPU clocks confirmed 2900 MHz under load (ROCm #5750 is sysfs reporting only). Correct backend recommendation to Vulkan RADV (2.7x faster tg than ROCm at 131K). Update KV cache recommendation to q4_0. Add Nemotron-Cascade-2 to coder shootout results. Remove Nemotron-3-Nano from catalog (replaced by Cascade-2). Update Q4_K_L to Q4_K_XL entry.	2026-03-30 19:56:18 +02:00
Felipe Cardoso	1549bc27c0	feat(optimize): add Phase 2 power profile and system tuning Add `make optimize-power` (ryzenadj 85W, sysctl, THP, RADV nogttspill) with systemd services for boot/resume persistence. Integrate into `make optimize --all` as Phase 2. Update optimization log with RyzenAdj results (+46% tg at 70W sustained), KV sweep data, and quant shootout. Add Qwen3-Coder-30B and Nemotron-Cascade-2 to model catalog.	2026-03-30 18:53:52 +02:00
Felipe Cardoso	f92b710492	fix(benchmark): parse llama-bench output with variable column count KV cache quantization adds type_k/type_v columns to llama-bench output, shifting test and t/s to different indices. Parse from end of row instead of hardcoded positions. Also fix KV suffix separator (underscore to dash) to avoid regex ambiguity with type names like q8_0. Add 5-phase optimization guide, optimization log for tracking results, and research docs on llama.cpp and inference landscape optimizations.	2026-03-27 14:54:19 +01:00
Felipe Cardoso	7531f6fa74	feat(benchmark): add --kv-types flag for KV cache quantization sweep	2026-03-27 12:29:19 +01:00