Performance: 29× speedup across matchers + Coma accuracy improvements#96
Open
Performance: 29× speedup across matchers + Coma accuracy improvements#96
Conversation
…inor improvements
…nstalled The CI matrix runs `python -m unittest discover tests` without polars. `pytest.importorskip` at module level raises a Skipped exception that unittest treats as an ImportError, failing the entire run. Replace with try/except that guards all polars-dependent imports and data loading. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The coverage CI job runs pytest without polars installed. The previous fix only handled unittest discover (import crash) but pytest still discovered the test classes and hit NameError on undefined symbols. Use @pytest.mark.skipif on each class so both runners skip gracefully. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously the polars tests were silently skipped in every CI job because polars was never installed. Add `pip install ".[polars]" || true` to the coverage job and all matrix test jobs. The `|| true` handles Python versions where polars doesn't ship a wheel yet (e.g. 3.14 pre-release) — the skip markers in test_polars.py handle that gracefully. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- docs/example.md: fix code block title to match renamed file - cupid/__init__.py: remove empty DATATYPE_COMPATIBILITY_TABLE (was kept as backwards-compat stub but is a silent-failure trap) - utils/utils.py: remove unused is_sorted function - test_utils.py: remove corresponding test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Resolves #88
Across-the-board performance and accuracy work on every matcher, plus robustness fixes, Polars support, dead-code removal, documentation updates, and comprehensive test coverage.
Headline: ~29× wall-clock speedup on the full NYU Open Data benchmark (1048s → 36s) with no F1 regressions and a +0.055 Coma F1 improvement from targeted accuracy work.
Performance: before / after
Full NYU Open Data benchmark, 10 dataset pairs, single machine, sequential.
JaccardDistanceMatcher F1 delta (−0.009) is within run-to-run noise. SimilarityFlooding is slightly slower due to the improved tokeniser and NodeID collision fix processing more tokens.
What changed
Performance
TfidfCorpusbuilds float32 sparse CSR matrices, caches per-column vectorisations on object identity, and memoises pair-level similarities on a symmetric(id, id)key.InstancesCMevaluatesInstancesDirectandInstancesAllon the same list — the pair cache collapses both calls into one matmul.wn.synsetsandwn.wup_similarity(symmetric key), plus the English stopword frozenset and theall_lemma_namescorpus walk. The cold-path lemma walk used to dominate; it's now paid once per process.QuantileHistogram.add_valuesreplaces a Pythonbucket_binary_searchloop with a singlenp.searchsorted+np.bincountover precomputed lower/upper-bound arrays.__slots__on the histogram,lru_cacheon the constant_bucket_distance_matrix(n), and a global ranks pickle cache to avoid re-unpickling per column.rapidfuzz.process.cdist, dispatched to the smaller side (rows × cols favours small rows), withscore_cutoff=thresholdso rapidfuzz can short-circuit. This is the matcher that moved most in absolute time (−735 s).BaseTable.get_data_typenow treats pandas"str"/"string"dtypes as text, not as unknown. Free F1 wins for Cupid and SF (which readdata_type) and a prerequisite for the Coma accuracy work.Coma accuracy (+0.055 F1)
Four small additions, each tested independently against the full bench:
NameCM. Newtokens.pysplits column names intocamelCase/snake_case/ digit runs and computes a soft Jaccard with generic abbreviation matching (dept→department,fname→firstname). Folded intoNameCMwithmaximumso it can only help — never dilute trigram on well-formed names.curbinsidepap_curb_pri.averagetoweighted, withInstancesCMgetting a 1.3× weight (tuned empirically — 1.5 over-weights, 1.2 under-weights). Schema matchers stay at 1.0.Robustness fixes
name_similarity_tokensreturns 0.0 when either token set is empty;compute_ssimreturns 0.0 when both nodes have empty leaf lists.quantile_emdreturnsinfwhen histogram values sum to zero instead of dividing by zero.id()key to detectid()reuse after GC, preventing stale cache hits.NodeIDcollision. Replaced plain"NodeID"string prefix with null-byte sentinel"\x00NID"so columns named"NodeID*"don't collide with structural graph nodes._camel_case_splitnow handlessnake_case,SCREAMING_SNAKE, hyphens, and embedded digits.DATATYPE_COMPATIBILITY_TABLEdict with generic family-based classifier that handles arbitrary SQL type strings (varchar(255),bigint, etc.) via keyword matching.get_encodinghandles chardet returningNone;get_delimitercatchescsv.Snifferfailures on malformed input.Polars support
PolarsTable/PolarsColumn— newBaseTable/BaseColumnadapters invalentine/data_sources/polars/for Polars DataFrames. Install withpip install valentine[polars].valentine_match— pandas and Polars frames can be freely mixed in the same call. The function detects the frame type viatype(obj).__module__without importing polars eagerly.test_polars.pyverifying PolarsTable properties, all matchers with Polars input, pandas↔Polars equivalence, and mixed-framework matching.|| truefallback for Python versions without polars wheels).Dead code removal
After the vectorised
add_valuesrewrite the legacy paths are unreachable:QuantileHistogram.bucket_binary_searchQuantileHistogram.normalize_valuesQuantileHistogram.calc_dist_matrixprocess_columns(callers always pass 8-tuples now)Experiments tried and rejected
Documented here so they don't get re-attempted:
build_matchers— +41% Coma time, zero F1 movement (flat schemas have no sibling structure).NameCM— −0.037 F1, +87% time.maximumcombination doesn't fence off the noise: WordNet's high-scoring false positives on common tokens still win bidirectional selection.Tests / coverage
51 new tests across
test_coverage_gaps.py,test_brittleness.py, andtest_polars.py. 273 tests pass total.coma/similarity/tfidf.pycoma/similarity/tokens.pycupid/linguistic_matching.pydistribution_based/clustering_utils.pydistribution_based/column_model.pydistribution_based/quantile_histogram.pyjaccard_distance/jaccard_distance.pyTest plan
pytest -q tests— 273 passedpython -m unittest discover tests— 66 passed (pytest-only tests excluded, polars tests skipped gracefully when polars not installed)🤖 Generated with Claude Code