strix-halo-optimizations

Author	SHA1	Message	Date
Felipe Cardoso	751180fdc1	feat(serve): upgrade daily driver to qwen3.6-35b-a3b q6_k_xl Switch `make serve` default to Qwen3.6 UD Q6_K_XL (32 GB, hybrid DeltaNet, near-lossless) and register it in the model catalog. Add --jinja to the llama-server launcher so tool/function calling works — without it clients silently ignore tool definitions advertised by the server.	2026-04-26 20:06:18 +02:00
Felipe Cardoso	15bb6a8ed9	feat(serve): set APEX I-Compact as default, harden benchmark workflow Serving: - make serve now launches Claude-distilled APEX 35B-A3B (16GB) with 2 parallel slots and 256K context as the daily driver - add serve-custom for ad-hoc model testing - add flush-gpu to reclaim unified memory after stuck runs Benchmarks: - default Vulkan-only backends (ROCm trails at long context) - add --backends filter to run-baseline.sh - fix backend filter substring bug (grep -qFx for exact line match) - fix model filter regex metacharacter bug (grep -qiF for literal) - respect --tg in long-context tests instead of hardcoded n=32 ROCm bump to 7.2.1 (kernel 6.18.4+ patch); keep 7.2 as optional. Catalog: - add mudler APEX I-Compact (Claude-distilled 35B, 17GB) - add 0xSero REAP-40 (pruned 122B-A10B, 46GB) - update download instructions: hf download (huggingface-cli is gone)	2026-04-13 01:11:46 +02:00
Felipe Cardoso	6ab08537ca	fix: address code review findings — batch args, venv path, serve flags - Fix missing BATCH_ARGS in long-context commands (both benchmark scripts) - Fix CLAUDE.md stale venv path (data/venv → .venv) and add serve/power docs - Add -b/--batch to bin/benchmark help text - Add --no-think flag to serve script (--reasoning-budget 0) - Sanitize model names in eval run directories - Simplify agentic setup to use requirements.txt - Add serve --help test, batch flag assertions to existing tests - Add requirements.txt for reproducible venv setup (Python 3.13)	2026-03-31 10:10:48 +02:00
Felipe Cardoso	dd403a907c	feat(serve): add optimized llama-server launcher with n-gram speculation Add `make serve` and `make serve-ngram` for launching llama-server with baked-in optimal settings (Vulkan RADV, q4_0 KV cache, flash attention, no-mmap, full GPU offload). N-gram speculative decoding gives 1.1-1.4x tg speedup on repetitive content without upstream PR dependencies. Update Phase 5 status: MTP is months away (4 unmerged PRs, no MoE support), draft-model speculation stalled on ROCm buffer crashes.	2026-03-30 21:12:30 +02:00

4 Commits