feat(serve): set APEX I-Compact as default, harden benchmark workflow

Serving:
- make serve now launches Claude-distilled APEX 35B-A3B (16GB) with 2
  parallel slots and 256K context as the daily driver
- add serve-custom for ad-hoc model testing
- add flush-gpu to reclaim unified memory after stuck runs

Benchmarks:
- default Vulkan-only backends (ROCm trails at long context)
- add --backends filter to run-baseline.sh
- fix backend filter substring bug (grep -qFx for exact line match)
- fix model filter regex metacharacter bug (grep -qiF for literal)
- respect --tg in long-context tests instead of hardcoded n=32

ROCm bump to 7.2.1 (kernel 6.18.4+ patch); keep 7.2 as optional.

Catalog:
- add mudler APEX I-Compact (Claude-distilled 35B, 17GB)
- add 0xSero REAP-40 (pruned 122B-A10B, 46GB)
- update download instructions: hf download (huggingface-cli is gone)
This commit is contained in:
Felipe Cardoso
2026-04-13 01:11:46 +02:00
parent 474d94a07e
commit 15bb6a8ed9
7 changed files with 36 additions and 12 deletions

View File

@@ -23,12 +23,14 @@ PP_TOKENS=512
TG_TOKENS=128
BATCH_SIZE="" # Batch size override (-b flag, empty = llama-bench default 2048)
KV_TYPES_RAW="" # Comma-separated KV cache types to sweep (e.g. f16,q8_0,q4_0 or q4_0:q8_0)
BACKENDS_FILTER="llama-vulkan-radv" # Default to Vulkan; use --backends to override
while [[ $# -gt 0 ]]; do
case "$1" in
--skip-longctx) SKIP_LONGCTX=true; shift ;;
--max-size|-s) MAX_SIZE_GB="$2"; shift 2 ;;
--category|-c) CATEGORY_FILTER="$2"; shift 2 ;;
--backends) BACKENDS_FILTER="$2"; shift 2 ;;
--reps|-r) REPS_STANDARD="$2"; shift 2 ;;
--context|-d) CTX_DEPTH="$2"; shift 2 ;;
--pp) PP_TOKENS="$2"; shift 2 ;;
@@ -42,6 +44,7 @@ while [[ $# -gt 0 ]]; do
echo " --skip-longctx Skip long-context tests"
echo " --max-size GB Only bench models up to this file size in GB"
echo " --category LIST Comma-separated: smoke,dense,moe (from models.conf)"
echo " --backends LIST Comma-separated backends (default: llama-vulkan-radv)"
echo " --reps N Standard test repetitions (default: 5)"
echo " --context N Long-context depth in tokens (default: 32768)"
echo " --pp N Prompt processing tokens (default: 512)"
@@ -94,14 +97,17 @@ declare -A BENCH_PATHS=(
[llama-vulkan-amdvlk]="/usr/sbin/llama-bench"
[llama-rocm-6.4.4]="/usr/local/bin/llama-bench"
[llama-rocm-7.2]="/usr/local/bin/llama-bench"
[llama-rocm-7.2.1]="/usr/local/bin/llama-bench"
[llama-rocm7-nightlies]="/usr/local/bin/llama-bench"
)
available_backends=()
for tb in "${!BENCH_PATHS[@]}"; do
if echo "$existing" | grep -q "^${tb}$"; then
available_backends+=("$tb")
log_success "Backend: $tb"
if [[ -z "$BACKENDS_FILTER" ]] || echo "$BACKENDS_FILTER" | tr ',' '\n' | grep -qFx "$tb"; then
available_backends+=("$tb")
log_success "Backend: $tb"
fi
fi
done
@@ -269,7 +275,7 @@ for MODEL_PATH in "${MODEL_PATHS[@]}"; do
CMD_LC=(toolbox run -c "$BACKEND" -- "${ENV_ARGS[@]}" "$BENCH_BIN"
-ngl 99 -mmp 0 -m "$TOOLBOX_MODEL_PATH" -fa 1
-p "$CTX_PROMPT" -n 32 -d "$CTX_DEPTH" -ub "$UB_SIZE"
-p "$CTX_PROMPT" -n "$TG_TOKENS" -d "$CTX_DEPTH" -ub "$UB_SIZE"
-r "$REPS_LONGCTX" "${BATCH_ARGS[@]}" "${KV_ARGS[@]}")
printf " cmd: %s\n" "${CMD_LC[*]}"