Commit Graph

12 Commits

Author SHA1 Message Date
Felipe Cardoso
d22c062ca7 fix: model catalog shows download status, GPU detection in toolbox
- Catalog * indicator now searches recursively (finds models in subdirs)
- GPU verification suppresses toolbox crun stderr (directory not found noise)
- Matches on "radeon" and "available devices" for Vulkan output

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 19:14:31 +01:00
Felipe Cardoso
6f197a1455 fix: pass ARGS through in benchmark Makefile targets
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 19:10:59 +01:00
Felipe Cardoso
cb25fa3f6f feat: add benchmark filtering (--max-size, --category, --skip-longctx)
Both run-baseline.sh and run-suite.sh now support:
- --max-size GB: skip models larger than N GB (prevents OOM)
- --category LIST: filter by catalog category (smoke,dense,moe)
- --skip-longctx: skip 32K context tests (saves time + memory)
- --reps N: configure repetition count
- --help: shows usage with examples

Safe pre-optimization run: benchmark baseline --max-size 20 --skip-longctx
Full post-optimization: benchmark baseline (no filters, all models + longctx)

Also: 4 new BATS tests for flag parsing (98 total, all passing)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 19:07:24 +01:00
Felipe Cardoso
eb52ea52ce fix: follow symlinks in model discovery, update model catalog
- Add -L flag to find in benchmark scripts (follows symlinks to /data/models/llms/)
- Exclude mmproj-*.gguf (vision projection files, not LLM models)
- Update configs/models.conf: remove Qwen3-Coder (user prefers Qwen3.5-35B-A3B),
  add Qwen3.5-27B-Q4_K_M and Q8_0 variant, reflect actual downloaded models

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 09:44:16 +01:00
Felipe Cardoso
58124cd657 feat: add Qwen3.5 model catalog and agentic evaluation framework
Models:
- configs/models.conf: catalog with Qwen3.5-35B-A3B (MoE, top pick),
  Qwen3.5-27B (dense), Qwen3-Coder-30B-A3B (agentic/coding)
- Updated benchmark setup to show catalog with download status
- docs/model-recommendations.md: memory planning, quantization guide

Agentic evaluation:
- scripts/agentic/setup.sh: installs inspect-ai, evalplus, bigcodebench
  in a Python venv
- scripts/agentic/run-eval.sh: runs evaluations against local LLM server
  (ollama or llama.cpp). Suites: quick (HumanEval+IFEval), code
  (EvalPlus+BigCodeBench), tooluse (BFCL), full (all)
- bin/agentic: dispatcher with help
- docs/agentic-benchmarks.md: methodology, framework comparison, model
  recommendations for agentic use

Updated: Makefile (6 new targets), README, CLAUDE.md, docs/references.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 00:20:23 +01:00
Felipe Cardoso
71053997be chore: remove .idea from tracking, add to .gitignore
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 23:58:18 +01:00
Felipe Cardoso
e9cb5c491f fix+test: improve test suite, fix 2 bugs found by tests
Bugs fixed in production code:
- compare.sh: Python truthiness on 0.0 — `if b_val` was False for 0.0 t/s,
  displaying it as a dash instead of "0.0". Fixed with `is not None` checks.
- compare.sh: ZeroDivisionError when computing delta % with 0.0 baseline.

Test improvements (review findings):
- detect.bats: kernel param tests now use real detect_kernel_param logic
  pattern (not a separate reimplementation). Added non-GiB-aligned RAM test,
  device ID without 0x prefix, empty firmware version, llama-bench detection,
  detect_total_physical_ram_kb tests.
- benchmark_compare.bats: assert delta percentages (+20.0%, -25.0%, 0.0%),
  test 0.0 t/s edge case, test per-directory error messages, test config
  change detection with specific field assertions.
- log_metrics.bats: add assert_success, --help test, timestamp format
  validation. Remove unused mock sysfs setup.
- common.bats: fix data_dir test, remove redundant assertion, add cleanup.
- test_helper.sh: remove unused FIXTURES_DIR.
- Remove empty tests/fixtures/ directory.

94 tests, all passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 22:22:41 +01:00
Felipe Cardoso
a403dd9ce0 test: add BATS test suite (79 tests)
- tests/common.bats: PROJECT_ROOT detection, is_cmd, timestamp, data_dir,
  logging functions, color handling, require_root
- tests/detect.bats: GPU sysfs reads with mock sysfs tree, kernel param
  parsing (word boundary, dot escaping, edge positions), recommended
  GTT/pages computation (64GB, 128GB, tiny, zero), firmware bad detection,
  stack detection
- tests/format.bats: human_bytes (0, KiB, MiB, GiB boundaries, 64GiB),
  human_mib (sub-GiB, exact-GiB, recommended values, empty input)
- tests/benchmark_compare.bats: improvement/regression display, empty
  results, missing files, usage output, config change detection
- tests/log_metrics.bats: CSV header, data format, field count, input
  validation, unknown argument handling
- tests/test_helper.sh: mock sysfs tree builder, bats-assert/support setup

Makefile: add 'make test' target

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 22:15:34 +01:00
Felipe Cardoso
da2c4c6b8a fix(docs): address review findings — accuracy, consistency, completeness
- architecture.md: fix kernel param math to match actual computed values,
  use cardN placeholder in sysfs paths, clarify system_ram_kb is OS-visible
- benchmarking.md: normalize flags to -ngl 99 / -mmp 0 (matching code),
  add llama-rocm7-nightlies backend
- CLAUDE.md: clarify HSA_OVERRIDE_GFX_VERSION is set in containers not
  scripts, fix lib sourcing description, specify which scripts need root
- detect.sh: document detect_cpu_cores returns threads not cores
- troubleshooting.md: add link to references.md
- README.md: remove unsupported Fedora 42 claim, describe configs/ content

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 21:44:16 +01:00
Felipe Cardoso
5b81437637 docs: add README, CLAUDE.md, AGENTS.md, and full docs/ suite
- README.md: project overview, quick start, command reference, workflow
- CLAUDE.md: AI safety rules, technical details, conventions
- AGENTS.md: agent workflows, file responsibility map, dependency matrix
- docs/architecture.md: script layers, data flow, unified memory, JSON schemas
- docs/optimization.md: step-by-step optimization walkthrough
- docs/benchmarking.md: methodology, test params, result interpretation
- docs/troubleshooting.md: common issues and fixes
- docs/references.md: centralized external links (single source of truth)
- docs/bios-vram-guide.md: add back-link to optimization workflow

Cross-linked non-redundantly: each doc owns one layer, others link to it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 20:50:00 +01:00
Felipe Cardoso
af0515d05d fix: address code review findings (HIGH + MEDIUM)
- Replace GNU \b with portable word-boundary sed patterns in kernel-params
- Warn on unknown CLI arguments instead of silently swallowing
- Add floor check on recommended_gttsize_mib to prevent negative values
- Fix Python operator precedence in benchmark log parser
- Add root checks to tuned-profile.sh and rollback.sh
- Remove redundant sudo calls (scripts already require root at entry)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 20:19:44 +01:00
Felipe Cardoso
c596e38e9e Initial commit 2026-03-25 20:13:15 +01:00