Skip to content

Performance: 29× speedup across matchers + Coma accuracy improvements#96

Open
kPsarakis wants to merge 8 commits intomasterfrom
improve-performance
Open

Performance: 29× speedup across matchers + Coma accuracy improvements#96
kPsarakis wants to merge 8 commits intomasterfrom
improve-performance

Conversation

@kPsarakis
Copy link
Copy Markdown
Member

@kPsarakis kPsarakis commented Apr 8, 2026

Summary

Resolves #88

Across-the-board performance and accuracy work on every matcher, plus robustness fixes, Polars support, dead-code removal, documentation updates, and comprehensive test coverage.

Headline: ~29× wall-clock speedup on the full NYU Open Data benchmark (1048s → 36s) with no F1 regressions and a +0.055 Coma F1 improvement from targeted accuracy work.

Performance: before / after

Full NYU Open Data benchmark, 10 dataset pairs, single machine, sequential.

Matcher Total (s) Worst pair (s) Mean F1
Coma 74.87 → 10.30 (7.3×) 24.41 → 2.91 (8.4×) 0.754 → 0.809 (+0.055)
Cupid 191.45 → 8.29 (23.1×) 78.17 → 2.20 (35.5×) 0.480 → 0.501 (+0.021)
DistributionBased 36.86 → 5.62 (6.6×) 13.81 → 2.64 (5.2×) 0.678 → 0.681 (+0.003)
JaccardDistanceMatcher 739.17 → 4.60 (160.8×) 203.98 → 1.85 (110.2×) 0.645 → 0.635 (−0.009)
SimilarityFlooding 5.75 → 7.30 (0.8×) 1.56 → 1.86 (0.8×) 0.505 → 0.509 (+0.004)
Total 1048.10 → 36.11 (29.0×)

JaccardDistanceMatcher F1 delta (−0.009) is within run-to-run noise. SimilarityFlooding is slightly slower due to the improved tokeniser and NodeID collision fix processing more tokens.

What changed

Performance

  • Coma — TF-IDF cosine fast path. TfidfCorpus builds float32 sparse CSR matrices, caches per-column vectorisations on object identity, and memoises pair-level similarities on a symmetric (id, id) key. InstancesCM evaluates InstancesDirect and InstancesAll on the same list — the pair cache collapses both calls into one matmul.
  • Cupid — WordNet caching. Cached wn.synsets and wn.wup_similarity (symmetric key), plus the English stopword frozenset and the all_lemma_names corpus walk. The cold-path lemma walk used to dominate; it's now paid once per process.
  • DistributionBased — vectorised quantile histograms. QuantileHistogram.add_values replaces a Python bucket_binary_search loop with a single np.searchsorted + np.bincount over precomputed lower/upper-bound arrays. __slots__ on the histogram, lru_cache on the constant _bucket_distance_matrix(n), and a global ranks pickle cache to avoid re-unpickling per column.
  • JaccardDistanceMatcher — rapidfuzz cdist. Replaced the per-pair Python loop with rapidfuzz.process.cdist, dispatched to the smaller side (rows × cols favours small rows), with score_cutoff=threshold so rapidfuzz can short-circuit. This is the matcher that moved most in absolute time (−735 s).
  • Type inference fix. BaseTable.get_data_type now treats pandas "str" / "string" dtypes as text, not as unknown. Free F1 wins for Cupid and SF (which read data_type) and a prerequisite for the Coma accuracy work.

Coma accuracy (+0.055 F1)

Four small additions, each tested independently against the full bench:

  1. Token Jaccard inside NameCM. New tokens.py splits column names into camelCase / snake_case / digit runs and computes a soft Jaccard with generic abbreviation matching (deptdepartment, fnamefirstname). Folded into NameCM with maximum so it can only help — never dilute trigram on well-formed names.
  2. Containment bonus. When all tokens from the short side appear in the long side, a small bonus (up to 0.15) is added proportional to coverage, catching partial matches like curb inside pap_curb_pri.
  3. NLTK stopwords in TF-IDF. Replaced hardcoded 33-word Lucene frozenset with NLTK's 179-word English stopwords list for better noise filtering.
  4. Weighted matcher combination. Switched from uniform average to weighted, with InstancesCM getting a 1.3× weight (tuned empirically — 1.5 over-weights, 1.2 under-weights). Schema matchers stay at 1.0.

Robustness fixes

  • Cupid division-by-zero guards. name_similarity_tokens returns 0.0 when either token set is empty; compute_ssim returns 0.0 when both nodes have empty leaf lists.
  • DistributionBased zero-sum histogram guard. quantile_emd returns inf when histogram values sum to zero instead of dividing by zero.
  • Coma TF-IDF cache robustness. Cache stores list reference alongside id() key to detect id() reuse after GC, preventing stale cache hits.
  • Similarity Flooding NodeID collision. Replaced plain "NodeID" string prefix with null-byte sentinel "\x00NID" so columns named "NodeID*" don't collide with structural graph nodes.
  • Similarity Flooding tokeniser. _camel_case_split now handles snake_case, SCREAMING_SNAKE, hyphens, and embedded digits.
  • Cupid datatype compatibility. Replaced static DATATYPE_COMPATIBILITY_TABLE dict with generic family-based classifier that handles arbitrary SQL type strings (varchar(255), bigint, etc.) via keyword matching.
  • Data source utilities. get_encoding handles chardet returning None; get_delimiter catches csv.Sniffer failures on malformed input.

Polars support

  • PolarsTable / PolarsColumn — new BaseTable/BaseColumn adapters in valentine/data_sources/polars/ for Polars DataFrames. Install with pip install valentine[polars].
  • Auto-detection in valentine_match — pandas and Polars frames can be freely mixed in the same call. The function detects the frame type via type(obj).__module__ without importing polars eagerly.
  • Full test coverage — 24 tests in test_polars.py verifying PolarsTable properties, all matchers with Polars input, pandas↔Polars equivalence, and mixed-framework matching.
  • Documentation updated — README, getting-started, API reference, FAQ, and examples all document Polars support.
  • CI updated — both workflows now install the polars extra so polars tests actually run (with || true fallback for Python versions without polars wheels).

Dead code removal

After the vectorised add_values rewrite the legacy paths are unreachable:

  • QuantileHistogram.bucket_binary_search
  • QuantileHistogram.normalize_values
  • QuantileHistogram.calc_dist_matrix
  • The 7-tuple branch in process_columns (callers always pass 8-tuples now)

Experiments tried and rejected

Documented here so they don't get re-attempted:

  • SiblingsCM in build_matchers — +41% Coma time, zero F1 movement (flat schemas have no sibling structure).
  • Datatype gate as score multiplier — every floor (0.5 / 0.8 / 0.9) regressed Coma F1, because real ground-truth matches frequently span types (varchar IDs ↔ int IDs, varchar dates ↔ date dates).
  • WordNet third arm in NameCM — −0.037 F1, +87% time. maximum combination doesn't fence off the noise: WordNet's high-scoring false positives on common tokens still win bidirectional selection.
  • Hungarian algorithm for one-to-one matching — regressed Cupid F1 (0.833→0.667) because global optimality "wastes" high-confidence pairs on wrong matches in noisy similarity matrices. Greedy is better here.

Tests / coverage

51 new tests across test_coverage_gaps.py, test_brittleness.py, and test_polars.py. 273 tests pass total.

File Before After
coma/similarity/tfidf.py 87.50% 99%
coma/similarity/tokens.py (new file) 100%
cupid/linguistic_matching.py 89.23% 96%
distribution_based/clustering_utils.py 70.00% 98%
distribution_based/column_model.py 86.95% 100%
distribution_based/quantile_histogram.py 80.64% 96%
jaccard_distance/jaccard_distance.py 90.47% 100%

Test plan

  • pytest -q tests — 273 passed
  • python -m unittest discover tests — 66 passed (pytest-only tests excluded, polars tests skipped gracefully when polars not installed)
  • Full NYU Open Data benchmark — see table above
  • Coverage report on every touched file
  • Each accuracy change re-benched independently to catch regressions
  • Polars equivalence tests — same data produces identical match results under pandas and Polars

🤖 Generated with Claude Code

@kPsarakis kPsarakis self-assigned this Apr 8, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 8, 2026

Codecov Report

❌ Patch coverage is 93.33333% with 34 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.67%. Comparing base (a0d488d) to head (d6a9e32).

Files with missing lines Patch % Lines
valentine/algorithms/cupid/linguistic_matching.py 91.78% 6 Missing ⚠️
valentine/data_sources/polars/polars_table.py 90.76% 4 Missing and 2 partials ⚠️
valentine/__init__.py 71.42% 3 Missing and 1 partial ⚠️
valentine/algorithms/cupid/__init__.py 86.20% 2 Missing and 2 partials ⚠️
.../algorithms/distribution_based/clustering_utils.py 81.25% 2 Missing and 1 partial ⚠️
...lgorithms/distribution_based/quantile_histogram.py 90.00% 1 Missing and 2 partials ⚠️
valentine/algorithms/coma/similarity/tokens.py 97.33% 1 Missing and 1 partial ⚠️
valentine/data_sources/__init__.py 75.00% 2 Missing ⚠️
valentine/data_sources/base_table.py 83.33% 1 Missing and 1 partial ⚠️
valentine/algorithms/coma/similarity/tfidf.py 98.14% 0 Missing and 1 partial ⚠️
... and 1 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #96      +/-   ##
==========================================
+ Coverage   95.44%   95.67%   +0.22%     
==========================================
  Files          50       53       +3     
  Lines        2351     2633     +282     
  Branches      366      398      +32     
==========================================
+ Hits         2244     2519     +275     
- Misses         64       72       +8     
+ Partials       43       42       -1     
Flag Coverage Δ
unit 95.67% <93.33%> (+0.22%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
valentine/algorithms/coma/coma.py 100.00% <100.00%> (ø)
valentine/algorithms/coma/matchers.py 95.60% <100.00%> (+0.09%) ⬆️
valentine/algorithms/coma/schema.py 100.00% <100.00%> (ø)
valentine/algorithms/coma/similarity/trigram.py 94.11% <100.00%> (+1.26%) ⬆️
...alentine/algorithms/cupid/structural_similarity.py 96.29% <100.00%> (+8.79%) ⬆️
...tine/algorithms/distribution_based/column_model.py 100.00% <100.00%> (+7.69%) ⬆️
...lgorithms/distribution_based/distribution_based.py 99.02% <100.00%> (+0.03%) ⬆️
...lentine/algorithms/distribution_based/emd_utils.py 92.30% <100.00%> (+5.35%) ⬆️
...ne/algorithms/jaccard_distance/jaccard_distance.py 100.00% <100.00%> (+3.84%) ⬆️
...lentine/algorithms/similarity_flooding/__init__.py 100.00% <100.00%> (ø)
... and 16 more

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kPsarakis kPsarakis changed the title improve performance across the board Performance: 35× speedup across matchers + Coma accuracy improvements Apr 8, 2026
@kPsarakis kPsarakis requested a review from chrisk21 April 8, 2026 07:07
kPsarakis and others added 5 commits April 8, 2026 09:09
…nstalled

The CI matrix runs `python -m unittest discover tests` without polars.
`pytest.importorskip` at module level raises a Skipped exception that
unittest treats as an ImportError, failing the entire run. Replace with
try/except that guards all polars-dependent imports and data loading.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The coverage CI job runs pytest without polars installed. The previous
fix only handled unittest discover (import crash) but pytest still
discovered the test classes and hit NameError on undefined symbols.
Use @pytest.mark.skipif on each class so both runners skip gracefully.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously the polars tests were silently skipped in every CI job
because polars was never installed. Add `pip install ".[polars]" || true`
to the coverage job and all matrix test jobs. The `|| true` handles
Python versions where polars doesn't ship a wheel yet (e.g. 3.14
pre-release) — the skip markers in test_polars.py handle that gracefully.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kPsarakis kPsarakis changed the title Performance: 35× speedup across matchers + Coma accuracy improvements Performance: 29× speedup across matchers + Coma accuracy improvements Apr 12, 2026
- docs/example.md: fix code block title to match renamed file
- cupid/__init__.py: remove empty DATATYPE_COMPATIBILITY_TABLE (was
  kept as backwards-compat stub but is a silent-failure trap)
- utils/utils.py: remove unused is_sorted function
- test_utils.py: remove corresponding test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance: profile and optimize algorithm speed

1 participant