strix-halo-optimizations

cardosofelipe/strix-halo-optimizations

Fork 0

Commit Graph

Author	SHA1	Message	Date
Felipe Cardoso	6ab08537ca	fix: address code review findings — batch args, venv path, serve flags - Fix missing BATCH_ARGS in long-context commands (both benchmark scripts) - Fix CLAUDE.md stale venv path (data/venv → .venv) and add serve/power docs - Add -b/--batch to bin/benchmark help text - Add --no-think flag to serve script (--reasoning-budget 0) - Sanitize model names in eval run directories - Simplify agentic setup to use requirements.txt - Add serve --help test, batch flag assertions to existing tests - Add requirements.txt for reproducible venv setup (Python 3.13)	2026-03-31 10:10:48 +02:00
Felipe Cardoso	dd403a907c	feat(serve): add optimized llama-server launcher with n-gram speculation Add `make serve` and `make serve-ngram` for launching llama-server with baked-in optimal settings (Vulkan RADV, q4_0 KV cache, flash attention, no-mmap, full GPU offload). N-gram speculative decoding gives 1.1-1.4x tg speedup on repetitive content without upstream PR dependencies. Update Phase 5 status: MTP is months away (4 unmerged PRs, no MoE support), draft-model speculation stalled on ROCm buffer crashes.	2026-03-30 21:12:30 +02:00

Author

SHA1

Message

Date

Felipe Cardoso

6ab08537ca

fix: address code review findings — batch args, venv path, serve flags

- Fix missing BATCH_ARGS in long-context commands (both benchmark scripts)
- Fix CLAUDE.md stale venv path (data/venv → .venv) and add serve/power docs
- Add -b/--batch to bin/benchmark help text
- Add --no-think flag to serve script (--reasoning-budget 0)
- Sanitize model names in eval run directories
- Simplify agentic setup to use requirements.txt
- Add serve --help test, batch flag assertions to existing tests
- Add requirements.txt for reproducible venv setup (Python 3.13)

2026-03-31 10:10:48 +02:00

Felipe Cardoso

dd403a907c

feat(serve): add optimized llama-server launcher with n-gram speculation

Add `make serve` and `make serve-ngram` for launching llama-server with
baked-in optimal settings (Vulkan RADV, q4_0 KV cache, flash attention,
no-mmap, full GPU offload). N-gram speculative decoding gives 1.1-1.4x
tg speedup on repetitive content without upstream PR dependencies.
Update Phase 5 status: MTP is months away (4 unmerged PRs, no MoE
support), draft-model speculation stalled on ROCm buffer crashes.

2026-03-30 21:12:30 +02:00

2 Commits