A practical guide to comparing image codecs fairly, with metrics accuracy data, viewing condition considerations, and scientific methodology.
Integrating your codec? See INTEGRATION.md for:
- Wiring up encode/decode callbacks
- MozJPEG, Jpegli, AVIF examples
- CI quality regression testing
- Interpreting DSSIM thresholds
Want to improve this tool? See CONTRIBUTING.md. We actively want input from codec developers—you know your domain better than we do.
# Quick start
cargo add codec-eval --git https://github.com/imazen/codec-eval
# Or use the CLI
cargo install --git https://github.com/imazen/codec-eval codec-eval-cliAPI-first design: You provide encode/decode callbacks, the library handles everything else.
use codec_eval::{EvalSession, EvalConfig, ViewingCondition};
let config = EvalConfig::builder()
.report_dir("./reports")
.viewing(ViewingCondition::desktop())
.quality_levels(vec![60.0, 80.0, 95.0])
.build();
let mut session = EvalSession::new(config);
session.add_codec("my-codec", "1.0", Box::new(|image, request| {
my_codec::encode(image, request.quality)
}));
let report = session.evaluate_corpus("./test_images")?;Features:
- DSSIM, Butteraugli, and SSIMULACRA2 metrics (PSNR for legacy comparisons)
- Viewing condition modeling (desktop, mobile, retina)
- Automatic corpus download and caching via codec-corpus
- CSV import for third-party benchmark results
- Pareto front analysis and BD-Rate calculation
- JSON/CSV report generation
- Optional: SVG charts, polynomial interpolation
For simple tests without full corpus evaluation:
use codec_eval::{assert_quality, metrics::prelude::*};
// Load images
let reference = ImgVec::new(/* ... */, width, height);
let encoded = ImgVec::new(/* ... */, width, height);
// Assert quality thresholds
assert_quality(&reference, &encoded,
Some(80.0), // min SSIMULACRA2 score
Some(0.003) // max DSSIM
)?;
// Or use semantic levels
assert_perception_level(&reference, &encoded,
PerceptionLevel::Imperceptible // DSSIM < 0.0003
)?;Unified imports via metrics::prelude:
use codec_eval::metrics::prelude::*;
// Now have: Dssim, butteraugli, compute_ssimulacra2,
// ImgRef, ImgVec, RGB8, RGBA8, etc.See INTEGRATION.md for detailed examples.
| Metric | Correlation with Human Perception | Best For |
|---|---|---|
| PSNR | ~67% | Legacy benchmarks only |
| SSIM/DSSIM | ~82% | Quick approximation |
| Butteraugli | 80-91% | High-quality threshold (score < 1.0) |
| SSIMULACRA2 | 87-98% | Recommended — best overall accuracy |
| VMAF | ~90% | Video, large datasets |
Based on Kornel Lesiński's guide:
❌ JPEG → WebP → AVIF (each conversion adds artifacts)
✓ PNG/TIFF → WebP
✓ PNG/TIFF → AVIF
Always start from a lossless source. Converting lossy→lossy compounds artifacts and skews results.
Don't compare mozjpeg -quality 80 against cjxl -quality 80 — quality scales differ between encoders.
Instead, match by:
- File size — encode to the same byte budget, compare quality
- Quality metric — encode to the same SSIMULACRA2 score, compare file size
A single test image can favor certain codecs. Use diverse datasets:
- Kodak — 24 classic benchmark images
- CLIC 2025 — 62 high-resolution images
- CID22 — 250 perceptual quality research images
These datasets are automatically downloaded and cached via the codec-corpus crate:
use codec_eval::corpus::Corpus;
let kodak = Corpus::get_dataset("kodak")?;
let clic = Corpus::get_dataset("clic2025/training")?;Codec rankings change across the quality spectrum:
- High quality (SSIMULACRA2 > 80): Differences minimal
- Medium quality (60-80): Most visible differences
- Low quality (< 50): Edge cases, artifacts become dominant
A codec that's 5% smaller but 100x slower may not be practical. Report:
- Encode time (CPU seconds)
- Decode time (critical for web)
- Memory usage
The current best metric for perceptual quality assessment.
| Score | Quality Level | Typical Use Case |
|---|---|---|
| < 30 | Poor | Thumbnails, previews |
| 40-50 | Low | Aggressive compression |
| 50-70 | Medium | General web images |
| 70-80 | Good | Photography sites |
| 80-85 | Very High | Professional/archival |
| > 85 | Excellent | Near-lossless |
Accuracy: 87% overall, up to 98% on high-confidence comparisons.
Tool: ssimulacra2_rs
ssimulacra2 original.png compressed.jpgStructural similarity, derived from SSIM but outputs distance (lower = better).
Accuracy: Validated against TID2013 database:
- Spearman correlation: -0.84 to -0.95 (varies by distortion type)
- Best on: Noise, compression artifacts, blur
- Weaker on: Exotic distortions, color shifts
Tool: dssim
dssim original.png compressed.jpg| DSSIM Score | Approximate Quality |
|---|---|
| < 0.001 | Visually identical |
| 0.001-0.01 | Excellent |
| 0.01-0.05 | Good |
| 0.05-0.10 | Acceptable |
| > 0.10 | Noticeable artifacts |
Note: Values are not directly comparable between DSSIM versions. Always report version.
Google's perceptual metric, good for high-quality comparisons.
Accuracy: 80-91% (varies by image type).
Best for: Determining if compression is "transparent" (score < 1.0).
Limitation: Less reliable for heavily compressed images.
Netflix's Video Multi-Method Assessment Fusion.
Accuracy: ~90% for video, slightly less for still images.
Best for: Large-scale automated testing, video frames.
Peak Signal-to-Noise Ratio — purely mathematical, ignores perception.
Accuracy: ~67% — only slightly better than chance.
Use only: For backwards compatibility with legacy benchmarks.
A coordinate system for describing where an encode sits on the rate-distortion tradeoff. Every encode gets an angle measured from the worst corner of a fixed frame. The reference codec's knee (balanced tradeoff point) lands at exactly 45 degrees.
theta = atan2(quality_norm * aspect, 1.0 - bpp_norm)
Where:
bpp_norm = bpp / bpp_max(how much of the budget you're using)quality_norm = metric_value / metric_max(how much quality you're getting)aspect= quality-axis stretch factor, calibrated so the reference knee = 45 deg
For SSIMULACRA2 (higher is better), quality_norm = s2 / s2_max.
For Butteraugli (lower is better), quality_norm = 1.0 - ba / ba_max.
| Parameter | Value | Rationale |
|---|---|---|
bpp_max |
4.0 | Practical web ceiling. Few images exceed this. |
s2_max |
100.0 | SSIMULACRA2 scale maximum. |
ba_max |
15.0 | Butteraugli score where quality is essentially destroyed. |
aspect |
1.2568 | Calibrated from CID22-training mozjpeg s2 knee. |
The aspect ratio is derived from the reference knee at (0.7274 bpp, s2=65.10):
aspect = (1 - bpp_knee / bpp_max) / (s2_knee / s2_max)
= (1 - 0.7274 / 4.0) / (65.10 / 100.0)
= 0.81815 / 0.651
= 1.2568
This stretches the quality axis so that the reference knee's quality displacement equals its bpp displacement in the angle computation, producing exactly 45 degrees.
| Angle | Meaning |
|---|---|
| < 0 deg | Worse than the worst corner. Negative quality (s2 below 0). |
| 0 deg | Worst corner: max bpp, zero quality. |
| ~10-25 deg | Aggressive compression. Thumbnails, heavy lossy. |
| 45 deg | Reference knee. Balanced tradeoff for mozjpeg/CID22. |
| ~52 deg | Ideal diagonal (0 bpp, perfect quality). Theoretical limit. |
| 60-80 deg | Quality-dominated. Large files, diminishing quality returns. |
| 90 deg | No compression: max bpp, max quality. |
| > 90 deg | Over-budget. bpp exceeds the frame ceiling. |
The knee is where the R-D curve transitions from "every extra bit buys meaningful quality" to "you're spending bits for marginal gains." It's the point where the normalized slope of the curve equals 1.0 — equal return on both axes.
Angles below 45 deg are compression-efficient. Angles above 45 deg are quality-focused. The ideal diagonal at ~52 deg represents "perfect quality at zero cost" — unachievable, but it's the geometric ceiling for good encodes.
Computed from full corpus evaluation with mozjpeg 4:2:0 progressive, quality sweep 10-98, step 2.
CID22-training (209 images, 512x512):
| Metric | Knee bpp | Knee value | Angle |
|---|---|---|---|
| SSIMULACRA2 | 0.7274 | 65.10 | 45.0 deg |
| Butteraugli | 0.7048 | 4.378 | 47.2 deg |
Disagreement range: 0.70-0.73 bpp (the two metrics nearly agree on where the knee is).
CLIC2025-training (32 images, ~2048px):
| Metric | Knee bpp | Knee value | Angle |
|---|---|---|---|
| SSIMULACRA2 | 0.4623 | 58.95 | 40.0 deg |
| Butteraugli | 0.3948 | 5.192 | 42.4 deg |
Disagreement range: 0.39-0.46 bpp.
CLIC2025 knees are at lower angles because the larger images (~2048px vs 512px) have more pixels per bit — the curve shifts left, and the balanced point is cheaper.
Method 1: From corpus averages (fast, approximate)
If you know the bpp and SSIMULACRA2 score for an encode, just compute:
use codec_eval::stats::FixedFrame;
let frame = FixedFrame::WEB;
let angle = frame.s2_angle(bpp, ssimulacra2_score);This tells you where the encode sits relative to the reference knee. If angle < 45, you're compressing harder than the balanced point. If angle > 45, you're spending more bits than necessary for the quality gain.
For Butteraugli:
let angle = frame.ba_angle(bpp, butteraugli_score);Method 2: From a per-image Pareto set (precise)
If you have a quality sweep for the specific image (multiple quality settings, each with bpp and metric scores), you can:
- Build the Pareto front from the sweep data.
- Find the knee of the per-image R-D curve using
CorpusAggregatewith a single image. - Compare the per-image knee angle to the corpus reference (45 deg).
use codec_eval::stats::rd_knee::{CorpusAggregate, FixedFrame};
// curve: Vec<(bpp, ssimulacra2, butteraugli)> sorted by bpp
let agg = CorpusAggregate {
corpus: "single-image".into(),
codec: "mozjpeg-420-prog".into(),
curve,
image_count: 1,
};
let frame = FixedFrame::WEB;
// Find this image's knee
if let Some(knee) = agg.ssimulacra2_knee(&frame) {
println!("Image knee at {:.1} deg ({:.2} bpp, s2={:.1})",
knee.fixed_angle, knee.bpp, knee.quality);
// Compare to corpus reference (45 deg)
if knee.fixed_angle < 40.0 {
println!("This image compresses efficiently — knee is below average");
} else if knee.fixed_angle > 50.0 {
println!("This image is hard to compress — needs more bits for quality");
}
}A per-image knee at 35 deg means the image reaches diminishing returns earlier (compresses well). A knee at 55 deg means the image needs more bits to look good.
Method 3: Angle of a specific encode on the per-image curve
Given a specific encode (one quality setting), compute its angle and compare to the image's own knee:
let frame = FixedFrame::WEB;
// The specific encode
let encode_angle = frame.s2_angle(encode_bpp, encode_s2);
// The image's knee (from a previous sweep)
let image_knee_angle = image_knee.fixed_angle;
if encode_angle < image_knee_angle {
// Below this image's knee: compression-efficient territory
// You could increase quality without wasting bits
} else {
// Above this image's knee: quality-dominated territory
// Reducing quality would save significant bits
}Every encode has two angles: one from SSIMULACRA2 and one from Butteraugli. Comparing them reveals what kind of artifacts the codec produces:
theta_s2 > theta_ba: The encode looks better structurally (SSIMULACRA2 is happy) than it does perceptually (Butteraugli sees local contrast issues). Common with aggressive chroma subsampling.theta_s2 < theta_ba: Butteraugli is more forgiving than SSIMULACRA2 at this operating point. The artifacts present are local-contrast-friendly but structurally visible.theta_s2 ≈ theta_ba: Both metrics agree on the quality level. The encode is well-balanced.
The knee is found by:
- Normalizing the corpus-aggregate R-D curve to [0, 1] on both axes (per-curve normalization, using observed min/max bpp and quality values).
- Computing the normalized slope between adjacent points.
- Finding the point closest to where the normalized slope crosses 1.0 (equal quality gain per bit spent).
- Interpolating linearly between the two bracketing points for a smooth result.
- Mapping the raw (bpp, quality) result back to the fixed-frame angle using the aspect ratio.
The per-curve normalization in step 1 is independent of the fixed frame — it uses the actual observed range of the data. The fixed frame and aspect ratio only enter in step 5, when converting the raw knee coordinates to a comparable angle.
src/stats/rd_knee.rs— All types, angle computation, knee detection, SVG plottingcrates/codec-compare/src/rd_calibrate.rs— Calibration binary (corpus sweep)
The number of pixels that fit in one degree of visual field. Critical for assessing when compression artifacts become visible.
| PPD | Context | Notes |
|---|---|---|
| 30 | 1080p at arm's length | Casual viewing |
| 60 | 20/20 vision threshold | Most artifacts visible |
| 80 | Average human acuity limit | Diminishing returns above this |
| 120 | 4K at close range | Overkill for most content |
| 159 | iPhone 15 Pro | "Retina" display density |
Formula:
PPD = (distance_inches × resolution_ppi × π) / (180 × viewing_distance_inches)
| Device Type | Typical PPD | Compression Tolerance |
|---|---|---|
| Desktop monitor | 40-80 | Medium quality acceptable |
| Laptop | 80-120 | Higher quality needed |
| Smartphone | 120-160 | Very high quality or artifacts visible |
| 4K TV at 3m | 30-40 | More compression acceptable |
- Mobile-first sites need higher quality settings (SSIMULACRA2 > 70)
- Desktop sites can use more aggressive compression (SSIMULACRA2 50-70)
- Thumbnails can be heavily compressed regardless of device
- Hero images on retina displays need minimal compression
The international standard for subjective video/image quality assessment.
Key elements:
- Controlled viewing conditions (luminance, distance, display calibration)
- Non-expert viewers (15-30 recommended)
- 5-grade Mean Opinion Score (MOS):
- 5: Excellent
- 4: Good
- 3: Fair
- 2: Poor
- 1: Bad
- Statistical analysis with confidence intervals
When to use: Final validation of codec choices, publishing research.
| Method | Description | Best For |
|---|---|---|
| DSIS | Show reference, then test image | Impairment detection |
| DSCQS | Side-by-side, both unlabeled | Quality comparison |
| 2AFC | "Which is better?" forced choice | Fine discrimination |
| Flicker test | Rapid A/B alternation | Detecting subtle differences |
When metrics aren't enough, subjective testing provides ground truth. But poorly designed studies produce unreliable data.
Randomization:
- Randomize presentation order (left/right, first/second)
- Randomize image order across participants
- Balance codec appearances to avoid order effects
Blinding:
- Participants must not know which codec produced which image
- Use neutral labels ("Image A" / "Image B")
- Don't reveal hypothesis until after data collection
Controls:
- Include known quality differences as sanity checks
- Add duplicate pairs to measure participant consistency
- Include "same image" pairs to detect bias
| Comparison Type | Minimum N | Recommended N |
|---|---|---|
| Large quality difference (obvious) | 15 | 20-30 |
| Medium difference (noticeable) | 30 | 50-80 |
| Small difference (subtle) | 80 | 150+ |
Power analysis: For 80% power to detect a 0.5 MOS difference with SD=1.0, you need ~64 participants per condition.
Pre-study:
- Visual acuity test (corrected 20/40 or better)
- Color vision screening (Ishihara plates)
- Display calibration verification
Exclusion criteria (define before data collection):
- Failed attention checks (> 20% incorrect on known pairs)
- Inconsistent responses (< 60% agreement on duplicate pairs)
- Response time outliers (< 200ms suggests random clicking)
- Incomplete sessions (< 80% of trials)
Embed these throughout the study:
Types of attention checks:
1. Obvious pairs - Original vs heavily compressed (SSIMULACRA2 < 30)
2. Identical pairs - Same image twice (should report "same" or 50/50 split)
3. Reversed pairs - Same comparison shown twice, order flipped
4. Instructed response - "For this pair, select the LEFT image"
Threshold: Exclude participants who fail > 2 attention checks or > 20% of obvious pairs.
Position bias: Tendency to favor left/right or first/second.
- Detect: Chi-square test on position choices across all trials
- Correct: Counter-balance positions; exclude participants with > 70% same-side choices
Fatigue effects: Quality judgments degrade over time.
- Detect: Compare accuracy on attention checks early vs late in session
- Correct: Limit sessions to 15-20 minutes; analyze by time block
Anchoring: First few images bias subsequent judgments.
- Detect: Compare ratings for same image shown early vs late
- Correct: Use practice trials (discard data); randomize order
Central tendency: Avoiding extreme ratings.
- Detect: Histogram of ratings (should use full scale)
- Correct: Use forced choice (2AFC) instead of rating scales
For rating data (MOS):
1. Calculate mean and 95% CI per condition
2. Check normality (Shapiro-Wilk) - often violated
3. Use robust methods:
- Trimmed means (10-20% trim)
- Bootstrap confidence intervals
- Non-parametric tests (Wilcoxon, Kruskal-Wallis)
4. Report effect sizes (Cohen's d, or MOS difference)
For forced choice (2AFC):
1. Calculate preference percentage per pair
2. Binomial test for significance (H0: 50%)
3. Apply multiple comparison correction:
- Bonferroni (conservative)
- Holm-Bonferroni (less conservative)
- Benjamini-Hochberg FDR (for many comparisons)
4. Report: "Codec A preferred 67% of time (p < 0.01, N=100)"
Outlier handling:
1. Define criteria BEFORE analysis (pre-registration)
2. Report both with and without outlier exclusion
3. Use robust statistics that down-weight outliers
4. Never exclude based on "inconvenient" results
Always include:
- Sample size (N) and exclusion count with reasons
- Confidence intervals, not just p-values
- Effect sizes in meaningful units (MOS points, % preference)
- Individual data points or distributions (not just means)
- Attention check pass rates
- Participant demographics (if relevant to display/vision)
Example:
"N=87 participants completed the study (12 excluded: 8 failed attention checks, 4 incomplete). Codec A was preferred over Codec B in 62% of comparisons (95% CI: 55-69%, p=0.003, binomial test). This corresponds to a mean quality difference of 0.4 MOS points (95% CI: 0.2-0.6)."
| Pitfall | Problem | Solution |
|---|---|---|
| Small N | Underpowered, unreliable | Power analysis before study |
| No attention checks | Can't detect random responders | Embed 10-15% check trials |
| Post-hoc exclusion | Cherry-picking results | Pre-register exclusion criteria |
| Only reporting means | Hides variability | Show distributions + CI |
| Multiple comparisons | Inflated false positives | Apply correction (Bonferroni, FDR) |
| Unbalanced design | Confounds codec with position/order | Full counterbalancing |
| Lab-only testing | May not generalize | Include diverse participants/displays |
Cloudinary 2021 (1.4 million opinions):
- JPEG XL: 10-15% better than AVIF at web quality levels
- AVIF: Best for low-bandwidth scenarios
- WebP: Solid middle ground
- All modern codecs beat JPEG by 25-35%
# 1. Encode to same file size
convert source.png -define webp:target-size=50000 output.webp
cjxl source.png output.jxl --target_size 50000
# 2. Measure with SSIMULACRA2
ssimulacra2 source.png output.webp
ssimulacra2 source.png output.jxl- Gather diverse test images from codec-corpus
- Create quality ladder (10 quality levels per codec)
- Compute metrics for each combination
- Plot rate-distortion curves (file size vs quality)
- Consider encode/decode speed
- Validate with subjective testing if publishing results
| Implementation | Type | Install | Notes |
|---|---|---|---|
| ssimulacra2_rs | CLI (Rust) | cargo install ssimulacra2_rs |
Recommended |
| ssimulacra2 | Library (Rust) | cargo add ssimulacra2 |
For integration |
| ssimulacra2-cuda | GPU (CUDA) | cargo install ssimulacra2-cuda |
Fast batch processing |
| libjxl | CLI (C++) | Build from source | Original implementation |
# Install CLI
cargo install ssimulacra2_rs
# Usage
ssimulacra2_rs original.png compressed.jpg
# Output: 76.543210 (higher = better, scale 0-100)| Implementation | Type | Install | Notes |
|---|---|---|---|
| dssim | CLI (Rust) | cargo install dssim |
Recommended |
# Install
cargo install dssim
# Basic comparison (lower = better)
dssim original.png compressed.jpg
# Output: 0.02341
# Generate difference visualization
dssim -o difference.png original.png compressed.jpgAccuracy: Validated against TID2013 database. Spearman correlation -0.84 to -0.95 depending on distortion type.
| Implementation | Type | Install | Notes |
|---|---|---|---|
| butteraugli | CLI (C++) | Build from source | Original |
| libjxl | CLI (C++) | Build from source | Includes butteraugli |
| Implementation | Type | Install | Notes |
|---|---|---|---|
| libvmaf | CLI + Library | Package manager or build | Official Netflix implementation |
# Ubuntu/Debian
apt install libvmaf-dev
# Usage (via ffmpeg)
ffmpeg -i original.mp4 -i compressed.mp4 -lavfi libvmaf -f null -| Tool | Purpose | Link |
|---|---|---|
| imageflow | High-performance image processing with quality calibration | Rust + C ABI |
| libvips | Fast image processing library | C + bindings |
| sharp | Node.js image processing (uses libvips) | npm |
- How to Compare Images Fairly — Kornel Lesiński
- ITU-R BT.500-15 — Subjective quality assessment
- The Netflix Tech Blog: VMAF
- Image Codec Comparison — Google
- Cloudinary Image Format Study — 1.4M opinions
- Are We Compressed Yet? — Video codec comparison
- codec-corpus — Reference images for calibration
See CONTRIBUTING.md for the full guide.
We especially want contributions from:
- Codec developers (mozjpeg, jpegli, libavif, webp, etc.) — integration examples, quality scale docs, edge cases
- Metrics researchers — new metrics, calibration data, perception thresholds
- Anyone — docs, tests, bug reports, benchmark results
This project is designed to be community-driven. Fork it, experiment, share what you learn.
This guide is released under CC0 — use freely without attribution.