Production-grade LLM evaluation framework with judge ensembles, synthetic data generation, regression tracking, and an analytics dashboard.
evalkit addresses a core problem in LLM-powered product development: how do you know if your model got better or worse? Traditional metrics like BLEU and ROUGE correlate poorly with human judgment on open-ended generation tasks. evalkit replaces them with structured, LLM-as-judge evaluation that scales beyond human annotation budgets while producing explainable, per-criterion scores.
Built for engineering teams that ship LLM features and need to catch regressions before they reach production.
+------------------+
| User / CI/CD |
+--------+---------+
|
+-------------+-------------+
| |
+--------v--------+ +--------v--------+
| Judge Engine | | Synthetic Gen. |
| (LLM / Ens.) | | (LLM-backed) |
+--------+--------+ +--------+--------+
| |
| +-----------+ |
+---->| Pydantic |<--------+
| Models |
+-----+-----+
|
+-----------+-----------+
| |
+--------v--------+ +--------v--------+
| Regression | | DuckDB Storage |
| Tracker +--->| Backend |
+--------+--------+ +-----------------+
|
+--------v--------+
| Dashboard |
| (Streamlit) |
+-----------------+
Rubric + Strategy Your Model
| |
v v
SyntheticGenerator ──> test inputs ──> inference ──> (input, output) pairs
|
v
LLMJudge / EnsembleJudge
|
v
list[JudgeScore]
|
v
EvalResult ──> DuckDB Storage
|
v
RegressionTracker.compare_versions()
|
+-----------+-----------+
| |
v v
Console / Markdown Streamlit Dashboard
| Goal | How It Works |
|---|---|
| Evaluate LLM outputs | Structured rubrics scored by LLM judges with per-criterion reasoning |
| Detect regressions | Automated version comparison with configurable thresholds |
| Generate test data | Synthetic inputs via 4 strategies (standard, adversarial, edge case, distribution) |
| Visualize trends | Interactive Streamlit dashboard backed by DuckDB analytics |
| Goal | Design Decision |
|---|---|
| Zero-config setup | DuckDB embedded storage -- no database server required |
| Reproducibility | Immutable Pydantic models (frozen=True) + schema-versioned rubrics |
| Provider independence | Thin LLM abstraction -- swap OpenAI/Anthropic/Gemini without code changes |
| CI/CD integration | All operations scriptable, JSON output, non-zero exit on regression |
| Cost efficiency | ~$2.50 per 1,000 evaluations at GPT-4o pricing |
- Catches regressions before production. Compare model versions on every CI run -- know within minutes if quality degraded.
- Replaces expensive human annotation. LLM-as-judge at ~$2.50/1,000 samples vs $50-125 for human annotators.
- Produces explainable scores. Per-criterion reasoning ("factual accuracy dropped on medical queries") instead of opaque BLEU numbers.
- Composable subsystems. Use just the judge, just the tracker, or just the generator -- no framework lock-in.
- Judge quality depends on the judge model. Blind spots in the judge model produce inflated scores. Mitigate with ensemble voting and golden-set calibration.
- No concurrent writes. DuckDB single-writer model means parallel CI jobs need separate database files.
- Cost scales linearly. 10,000 samples × 3 judges = 30,000 API calls. Start with representative samples (~100-500).
- No built-in inference. evalkit evaluates outputs but does not run models -- intentionally framework-agnostic.
- Single-machine ceiling. DuckDB handles ~100M rows; beyond that, migrate to a columnar warehouse.
pip install evalkit
# or with all optional dependencies:
pip install evalkit[all]from evalkit.core.models import EvalResult, JudgeScore
from evalkit.core.storage import DuckDBStorage
from evalkit.regression.tracker import RegressionTracker
storage = DuckDBStorage(db_path="./evals.duckdb")
tracker = RegressionTracker(storage=storage)
result = EvalResult(
model_id="my-model", model_version="v2.1",
input_text="Summarize this article...",
output_text="The article discusses...",
aggregate_score=4.2,
)
tracker.record(result)Evaluate LLM outputs against structured rubrics using LLM-as-judge or multi-judge ensembles.
from evalkit.judges.llm_judge import LLMJudge
from evalkit.judges.rubrics import SUMMARIZATION_RUBRIC
judge = LLMJudge(judge_id="gpt4o", rubric=SUMMARIZATION_RUBRIC)
scores = judge.evaluate(
input_text="Summarize: The mitochondria is the powerhouse...",
output_text="Mitochondria generate cellular energy via ATP.",
)
for s in scores:
print(f"{s.criterion}: {s.score}/5 -- {s.reasoning}")Pre-built rubrics: Summarization, Factual Accuracy, Helpfulness, Safety
Combine multiple judges for robust evaluation with three voting strategies:
from evalkit.judges.ensemble import EnsembleJudge
from evalkit.core.models import VotingStrategy
ensemble = EnsembleJudge(
judge_id="panel",
rubric=SUMMARIZATION_RUBRIC,
judges=[(judge_gpt, 1.0), (judge_claude, 1.0)],
voting_strategy=VotingStrategy.WEIGHTED_AVERAGE,
)
consensus = ensemble.evaluate(input_text="...", output_text="...")| Strategy | Behavior | Best For |
|---|---|---|
WEIGHTED_AVERAGE |
Weighted mean of scores | General evaluation |
MAJORITY |
Most common score wins | Binary/categorical |
UNANIMOUS |
Minimum score (conservative) | Safety-critical |
Define evaluation criteria tailored to your use case:
from evalkit.core.models import Rubric, RubricCriteria, ScoreScale
rubric = Rubric(
name="API Response Quality",
criteria=[
RubricCriteria(
name="Schema Compliance",
description="Does the response match the expected JSON schema?",
weight=3.0, scale=ScoreScale.BINARY,
),
RubricCriteria(
name="Data Accuracy",
description="Are the returned values correct?",
weight=2.0, scale=ScoreScale.LIKERT_5,
),
],
)Generate diverse test inputs using strategy-specific templates:
from evalkit.generators.synthetic import SyntheticGenerator
from evalkit.generators.templates import GenerationStrategy
gen = SyntheticGenerator(strategy=GenerationStrategy.ADVERSARIAL)
test_cases = gen.generate("customer support chatbot", count=20)| Strategy | What It Generates |
|---|---|
STANDARD |
Typical user queries across difficulty levels |
ADVERSARIAL |
Inputs designed to expose model weaknesses |
EDGE_CASE |
Boundary conditions, unusual formats, empty inputs |
DISTRIBUTION_MATCHING |
Inputs matching production traffic distribution |
Track evaluation scores across model versions and detect regressions:
from evalkit.regression.tracker import RegressionTracker
from evalkit.regression.reporter import RegressionReporter
tracker = RegressionTracker(storage=storage, threshold=-0.1)
tracker.record_batch(v1_results)
tracker.record_batch(v2_results)
report = tracker.compare_versions("my-model", "v1.0", "v2.0")
reporter = RegressionReporter()
print(reporter.to_console(report))Output:
========================================================================
REGRESSION REPORT: ALL CLEAR
========================================================================
Model: my-model
Baseline: v1.0 (100 samples)
Candidate: v2.0 (100 samples)
Delta: +0.3200
------------------------------------------------------------------------
Criterion Base Cand Delta Reg?
--------------------------------------------------------------------
aggregate 3.8500 4.1700 +0.3200 no
========================================================================
Compare model outputs across versions using multiple strategies:
from evalkit.regression.comparator import OutputComparator, ComparisonMethod
comparator = OutputComparator(similarity_threshold=0.9)
result = comparator.compare(baseline_output, candidate_output, ComparisonMethod.FUZZY)
print(f"Similarity: {result.similarity:.2%}, Match: {result.is_match}")Launch the Streamlit dashboard for interactive exploration:
evalkit
# or: streamlit run src/evalkit/dashboard/app.pyevalkit supports YAML configuration:
project_name: my-eval-project
storage:
database_path: ./evals.duckdb
ensemble:
voting_strategy: weighted_average
judges:
- judge_id: gpt4o
judge_type: llm
llm:
provider: openai
model: gpt-4o
api_key_env_var: OPENAI_API_KEY
weight: 1.0from evalkit.core.config import EvalConfig
config = EvalConfig.from_yaml("evalkit.yml")- ADR-001: LLM-as-Judge Architecture -- Why LLM judges over heuristic scoring
- ADR-002: DuckDB Over PostgreSQL -- Embedded columnar storage for zero-config analytics
- ADR-003: Ensemble Voting Strategy -- Majority vs weighted vs unanimous voting
# Clone and install dev dependencies
git clone https://github.com/cortexark/evalkit.git
cd evalkit
make dev
# Run tests
make test
# Lint and format
make lint
make format
# Type check
make typecheck
# Full CI pipeline locally
make cisrc/evalkit/
core/ -- Pydantic models, config, DuckDB storage
judges/ -- BaseJudge, LLMJudge, EnsembleJudge, rubrics
generators/ -- Synthetic data generation pipeline
regression/ -- Tracker, comparator, reporter
dashboard/ -- Streamlit visualization
tests/ -- pytest suite with fixtures
docs/adr/ -- Architecture Decision Records
examples/ -- Runnable usage examples
- Fork the repository
- Create a feature branch (
git checkout -b feature/my-feature) - Write tests first (TDD -- every feature starts with a failing test)
- Implement the feature
- Run
make cito verify lint, types, and tests pass - Open a pull request with a clear description
- All public methods must have docstrings with Args/Returns sections
- Type hints on every function signature
- No hardcoded API keys -- use environment variables via
LLMProviderConfig - Pydantic models should be frozen (
frozen=True) - New features need corresponding tests with >85% coverage