Performance: 29× speedup across matchers + Coma accuracy improvements by kPsarakis · Pull Request #96 · delftdata/valentine

kPsarakis · 2026-04-08T06:24:34Z

Summary

Resolves #88

Across-the-board performance and accuracy work on every matcher, plus robustness fixes, Polars support, dead-code removal, documentation updates, and comprehensive test coverage.

Headline: ~29× wall-clock speedup on the full NYU Open Data benchmark (1048s → 36s) with no F1 regressions and a +0.055 Coma F1 improvement from targeted accuracy work.

Performance: before / after

Full NYU Open Data benchmark, 10 dataset pairs, single machine, sequential.

Matcher	Total (s)	Worst pair (s)	Mean F1
Coma	74.87 → 10.30 (7.3×)	24.41 → 2.91 (8.4×)	0.754 → 0.809 (+0.055)
Cupid	191.45 → 8.29 (23.1×)	78.17 → 2.20 (35.5×)	0.480 → 0.501 (+0.021)
DistributionBased	36.86 → 5.62 (6.6×)	13.81 → 2.64 (5.2×)	0.678 → 0.681 (+0.003)
JaccardDistanceMatcher	739.17 → 4.60 (160.8×)	203.98 → 1.85 (110.2×)	0.645 → 0.635 (−0.009)
SimilarityFlooding	5.75 → 7.30 (0.8×)	1.56 → 1.86 (0.8×)	0.505 → 0.509 (+0.004)
Total	1048.10 → 36.11 (29.0×)	—	—

JaccardDistanceMatcher F1 delta (−0.009) is within run-to-run noise. SimilarityFlooding is slightly slower due to the improved tokeniser and NodeID collision fix processing more tokens.

What changed

Performance

Coma — TF-IDF cosine fast path. TfidfCorpus builds float32 sparse CSR matrices, caches per-column vectorisations on object identity, and memoises pair-level similarities on a symmetric (id, id) key. InstancesCM evaluates InstancesDirect and InstancesAll on the same list — the pair cache collapses both calls into one matmul.
Cupid — WordNet caching. Cached wn.synsets and wn.wup_similarity (symmetric key), plus the English stopword frozenset and the all_lemma_names corpus walk. The cold-path lemma walk used to dominate; it's now paid once per process.
DistributionBased — vectorised quantile histograms. QuantileHistogram.add_values replaces a Python bucket_binary_search loop with a single np.searchsorted + np.bincount over precomputed lower/upper-bound arrays. __slots__ on the histogram, lru_cache on the constant _bucket_distance_matrix(n), and a global ranks pickle cache to avoid re-unpickling per column.
JaccardDistanceMatcher — rapidfuzz cdist. Replaced the per-pair Python loop with rapidfuzz.process.cdist, dispatched to the smaller side (rows × cols favours small rows), with score_cutoff=threshold so rapidfuzz can short-circuit. This is the matcher that moved most in absolute time (−735 s).
Type inference fix. BaseTable.get_data_type now treats pandas "str" / "string" dtypes as text, not as unknown. Free F1 wins for Cupid and SF (which read data_type) and a prerequisite for the Coma accuracy work.

Coma accuracy (+0.055 F1)

Four small additions, each tested independently against the full bench:

Token Jaccard inside NameCM. New tokens.py splits column names into camelCase / snake_case / digit runs and computes a soft Jaccard with generic abbreviation matching (dept→department, fname→firstname). Folded into NameCM with maximum so it can only help — never dilute trigram on well-formed names.
Containment bonus. When all tokens from the short side appear in the long side, a small bonus (up to 0.15) is added proportional to coverage, catching partial matches like curb inside pap_curb_pri.
NLTK stopwords in TF-IDF. Replaced hardcoded 33-word Lucene frozenset with NLTK's 179-word English stopwords list for better noise filtering.
Weighted matcher combination. Switched from uniform average to weighted, with InstancesCM getting a 1.3× weight (tuned empirically — 1.5 over-weights, 1.2 under-weights). Schema matchers stay at 1.0.

Robustness fixes

Cupid division-by-zero guards. name_similarity_tokens returns 0.0 when either token set is empty; compute_ssim returns 0.0 when both nodes have empty leaf lists.
DistributionBased zero-sum histogram guard. quantile_emd returns inf when histogram values sum to zero instead of dividing by zero.
Coma TF-IDF cache robustness. Cache stores list reference alongside id() key to detect id() reuse after GC, preventing stale cache hits.
Similarity Flooding NodeID collision. Replaced plain "NodeID" string prefix with null-byte sentinel "\x00NID" so columns named "NodeID*" don't collide with structural graph nodes.
Similarity Flooding tokeniser. _camel_case_split now handles snake_case, SCREAMING_SNAKE, hyphens, and embedded digits.
Cupid datatype compatibility. Replaced static DATATYPE_COMPATIBILITY_TABLE dict with generic family-based classifier that handles arbitrary SQL type strings (varchar(255), bigint, etc.) via keyword matching.
Data source utilities. get_encoding handles chardet returning None; get_delimiter catches csv.Sniffer failures on malformed input.

Polars support

PolarsTable / PolarsColumn — new BaseTable/BaseColumn adapters in valentine/data_sources/polars/ for Polars DataFrames. Install with pip install valentine[polars].
Auto-detection in valentine_match — pandas and Polars frames can be freely mixed in the same call. The function detects the frame type via type(obj).__module__ without importing polars eagerly.
Full test coverage — 24 tests in test_polars.py verifying PolarsTable properties, all matchers with Polars input, pandas↔Polars equivalence, and mixed-framework matching.
Documentation updated — README, getting-started, API reference, FAQ, and examples all document Polars support.
CI updated — both workflows now install the polars extra so polars tests actually run (with || true fallback for Python versions without polars wheels).

Dead code removal

After the vectorised add_values rewrite the legacy paths are unreachable:

QuantileHistogram.bucket_binary_search
QuantileHistogram.normalize_values
QuantileHistogram.calc_dist_matrix
The 7-tuple branch in process_columns (callers always pass 8-tuples now)

Experiments tried and rejected

Documented here so they don't get re-attempted:

SiblingsCM in build_matchers — +41% Coma time, zero F1 movement (flat schemas have no sibling structure).
Datatype gate as score multiplier — every floor (0.5 / 0.8 / 0.9) regressed Coma F1, because real ground-truth matches frequently span types (varchar IDs ↔ int IDs, varchar dates ↔ date dates).
WordNet third arm in NameCM — −0.037 F1, +87% time. maximum combination doesn't fence off the noise: WordNet's high-scoring false positives on common tokens still win bidirectional selection.
Hungarian algorithm for one-to-one matching — regressed Cupid F1 (0.833→0.667) because global optimality "wastes" high-confidence pairs on wrong matches in noisy similarity matrices. Greedy is better here.

Tests / coverage

51 new tests across test_coverage_gaps.py, test_brittleness.py, and test_polars.py. 273 tests pass total.

File	Before	After
`coma/similarity/tfidf.py`	87.50%	99%
`coma/similarity/tokens.py`	(new file)	100%
`cupid/linguistic_matching.py`	89.23%	96%
`distribution_based/clustering_utils.py`	70.00%	98%
`distribution_based/column_model.py`	86.95%	100%
`distribution_based/quantile_histogram.py`	80.64%	96%
`jaccard_distance/jaccard_distance.py`	90.47%	100%

Test plan

pytest -q tests — 273 passed
python -m unittest discover tests — 66 passed (pytest-only tests excluded, polars tests skipped gracefully when polars not installed)
Full NYU Open Data benchmark — see table above
Coverage report on every touched file
Each accuracy change re-benched independently to catch regressions
Polars equivalence tests — same data produces identical match results under pandas and Polars

🤖 Generated with Claude Code

codecov · 2026-04-08T06:26:02Z

Codecov Report

❌ Patch coverage is 93.33333% with 34 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.67%. Comparing base (a0d488d) to head (d6a9e32).

Files with missing lines	Patch %	Lines
valentine/algorithms/cupid/linguistic_matching.py	91.78%	6 Missing ⚠️
valentine/data_sources/polars/polars_table.py	90.76%	4 Missing and 2 partials ⚠️
valentine/__init__.py	71.42%	3 Missing and 1 partial ⚠️
valentine/algorithms/cupid/__init__.py	86.20%	2 Missing and 2 partials ⚠️
.../algorithms/distribution_based/clustering_utils.py	81.25%	2 Missing and 1 partial ⚠️
...lgorithms/distribution_based/quantile_histogram.py	90.00%	1 Missing and 2 partials ⚠️
valentine/algorithms/coma/similarity/tokens.py	97.33%	1 Missing and 1 partial ⚠️
valentine/data_sources/__init__.py	75.00%	2 Missing ⚠️
valentine/data_sources/base_table.py	83.33%	1 Missing and 1 partial ⚠️
valentine/algorithms/coma/similarity/tfidf.py	98.14%	0 Missing and 1 partial ⚠️
... and 1 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master      #96      +/-   ##
==========================================
+ Coverage   95.44%   95.67%   +0.22%     
==========================================
  Files          50       53       +3     
  Lines        2351     2633     +282     
  Branches      366      398      +32     
==========================================
+ Hits         2244     2519     +275     
- Misses         64       72       +8     
+ Partials       43       42       -1

Flag	Coverage Δ
unit	`95.67% <93.33%> (+0.22%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
valentine/algorithms/coma/coma.py	`100.00% <100.00%> (ø)`
valentine/algorithms/coma/matchers.py	`95.60% <100.00%> (+0.09%)`	⬆️
valentine/algorithms/coma/schema.py	`100.00% <100.00%> (ø)`
valentine/algorithms/coma/similarity/trigram.py	`94.11% <100.00%> (+1.26%)`	⬆️
...alentine/algorithms/cupid/structural_similarity.py	`96.29% <100.00%> (+8.79%)`	⬆️
...tine/algorithms/distribution_based/column_model.py	`100.00% <100.00%> (+7.69%)`	⬆️
...lgorithms/distribution_based/distribution_based.py	`99.02% <100.00%> (+0.03%)`	⬆️
...lentine/algorithms/distribution_based/emd_utils.py	`92.30% <100.00%> (+5.35%)`	⬆️
...ne/algorithms/jaccard_distance/jaccard_distance.py	`100.00% <100.00%> (+3.84%)`	⬆️
...lentine/algorithms/similarity_flooding/__init__.py	`100.00% <100.00%> (ø)`
... and 16 more

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…inor improvements

…nstalled The CI matrix runs `python -m unittest discover tests` without polars. `pytest.importorskip` at module level raises a Skipped exception that unittest treats as an ImportError, failing the entire run. Replace with try/except that guards all polars-dependent imports and data loading. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The coverage CI job runs pytest without polars installed. The previous fix only handled unittest discover (import crash) but pytest still discovered the test classes and hit NameError on undefined symbols. Use @pytest.mark.skipif on each class so both runners skip gracefully. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Previously the polars tests were silently skipped in every CI job because polars was never installed. Add `pip install ".[polars]" || true` to the coverage job and all matrix test jobs. The `|| true` handles Python versions where polars doesn't ship a wheel yet (e.g. 3.14 pre-release) — the skip markers in test_polars.py handle that gracefully. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- docs/example.md: fix code block title to match renamed file - cupid/__init__.py: remove empty DATATYPE_COMPATIBILITY_TABLE (was kept as backwards-compat stub but is a silent-failure trap) - utils/utils.py: remove unused is_sorted function - test_utils.py: remove corresponding test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

improve performance across the board

c91a019

kPsarakis self-assigned this Apr 8, 2026

remove dead code and update tests

3e86b62

kPsarakis changed the title ~~improve performance across the board~~ Performance: 35× speedup across matchers + Coma accuracy improvements Apr 8, 2026

kPsarakis requested a review from chrisk21 April 8, 2026 07:07

kPsarakis and others added 5 commits April 8, 2026 09:09

apply ruff rules

effd12d

make the project less brittle, add polars data source, and add more m…

6a67f30

…inor improvements

kPsarakis changed the title ~~Performance: 35× speedup across matchers + Coma accuracy improvements~~ Performance: 29× speedup across matchers + Coma accuracy improvements Apr 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: 29× speedup across matchers + Coma accuracy improvements#96

Performance: 29× speedup across matchers + Coma accuracy improvements#96
kPsarakis wants to merge 8 commits intomasterfrom
improve-performance

kPsarakis commented Apr 8, 2026 •

edited

Loading

Uh oh!

codecov bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kPsarakis commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance: before / after

What changed

Performance

Coma accuracy (+0.055 F1)

Robustness fixes

Polars support

Dead code removal

Experiments tried and rejected

Tests / coverage

Test plan

Uh oh!

codecov bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kPsarakis commented Apr 8, 2026 •

edited

Loading

codecov bot commented Apr 8, 2026 •

edited

Loading