feat(serve): add optimized llama-server launcher with n-gram speculation
Add `make serve` and `make serve-ngram` for launching llama-server with baked-in optimal settings (Vulkan RADV, q4_0 KV cache, flash attention, no-mmap, full GPU offload). N-gram speculative decoding gives 1.1-1.4x tg speedup on repetitive content without upstream PR dependencies. Update Phase 5 status: MTP is months away (4 unmerged PRs, no MoE support), draft-model speculation stalled on ROCm buffer crashes.
This commit is contained in:
7
Makefile
7
Makefile
@@ -38,6 +38,13 @@ benchmark: ## Run full benchmark suite (supports ARGS="--tag NAME --max-size 20"
|
||||
benchmark-compare: ## Compare two benchmark runs (usage: make benchmark-compare BEFORE=dir AFTER=dir)
|
||||
@bash bin/benchmark compare $(BEFORE) $(AFTER)
|
||||
|
||||
# --- Serve ---
|
||||
serve: ## Launch llama-server with optimized settings (ARGS="-m MODEL.gguf")
|
||||
@bash bin/serve $(ARGS)
|
||||
|
||||
serve-ngram: ## Launch with n-gram speculative decoding (ARGS="-m MODEL.gguf")
|
||||
@bash bin/serve --ngram $(ARGS)
|
||||
|
||||
# --- Optimize ---
|
||||
optimize: ## Interactive optimization walkthrough
|
||||
@bash bin/optimize --all
|
||||
|
||||
Reference in New Issue
Block a user