Benchmark system for testing opencode with various LLM models, measuring speed (latency) and correctness (accuracy).
- The best tradeoff depends on your use-case and your hardware
- accuracy vs speed: reasoning, tok/s, different quantizations matters. Some small LLM can fix themself using tools, a fast LLM can be slow because it wastes too many tokens in the reasoning. Just test them in real world scenarios.
# Install dependencies
bun install
# Fill prompts/ and prompt-answers/ with your test cases ex. CODING-my-single-test.txt
# Check `~/.config/opencode/opencode.json` with your OpenAI-compatible models.
# It generates the answers with a specific model. use `opencode models` to see the availables
bun run answer -m "opencode/minimax-m2.5-free"
# bun run answer -m "opencode/minimax-m2.5-free" -t CODING-my-single-test
# It generates the evaluations with a specific model.
bun run evaluate -m "opencode/minimax-m2.5-free"
# bun run evaluate -m "opencode/minimax-m2.5-free" -t CODING-my-single-test
# it opens the dashboard on http://localhost:3000
bun run dashboard
