feat: CID-optimal consensus tree search (CIDConsensus) by ms609 · Pull Request #213 · ms609/TreeSearch

ms609 · 2026-03-23T08:10:10Z

Summary

CIDConsensus() finds a consensus tree that minimizes mean Clustering Information Distance (CID) to a set of input trees, using the existing C++ driven search engine.

Architecture

Dual-layer scoring for performance:

Fast layer: MRP (Matrix Representation with Parsimony) binary characters scored via Fitch for incremental screening during TBR/SPR. IW with k=7 by default (empirically maximizes rank correlation with CID, ρ ≈ 0.688).
Accurate layer: Full CID scoring via LAP (Linear Assignment Problem) matching for move acceptance decisions.

Two-baseline tracking in TBR, SPR, and drift maintains separate mrp_baseline (Fitch) and best_score (CID) values. Sectorial search (RSS/XSS) uses Fitch on MRP internally, verifies with full CID after reinsertion.

New files

File	Purpose
`R/CIDConsensus.R`	R-level API, MRP dataset construction, rogue taxon dropping
`src/ts_cid.h` / `ts_cid.cpp`	CID scoring engine: LAP solver, MCI computation, split hashing
`tests/testthat/test-CIDConsensus.R`	R-level tests (504 lines)
`tests/testthat/test-ts-cid.R`	C++ engine tests (267 lines)

Modified files

CID mode wired into all search modules:

ts_tbr.cpp, ts_search.cpp, ts_drift.cpp: two-baseline tracking, score_budget early termination
ts_sector.cpp: CID mode dispatch for sectorial search
ts_ratchet.cpp: CID-aware ratchet
ts_parallel.cpp: per-thread CidData deep copy (fixes data race)
ts_rcpp.cpp: ts_cid_consensus() Rcpp bridge function

Performance optimizations

Hash-based exact match lookup (O(n) vs O(n²))
Precomputed per-split log2 values
Persistent candidate split buffers (zero-alloc reuse)
Presized LAP scratch buffers
Bounded cid_score() with early termination (wired into TBR/SPR)
Batch candidate log2 reuse

Multi-threaded benchmarks (coalescent-derived trees):

40 tips / 50 trees: 1.43× speedup (2 threads)
80 tips / 100 trees: 1.73× speedup (2 threads)

Tests

All 9928 tests pass (0 failures, 0 warnings). Includes 771 new CID-specific assertions across R-level and C++ test files.

New exported function CIDConsensus() finds consensus trees that minimize mean Clustering Information Distance to a set of input trees. Phase 1: Binary search via existing Ratchet/TreeSearch infrastructure - Custom CID scorer (.CIDScorer) plugs into TreeScorer interface - CID bootstrapper (.CIDBootstrap) resamples input trees with replacement - cidData S3 class (environment-backed) for reference semantics - Supports SPR, TBR, NNI, and ratchet search methods - Any TreeDist metric can serve as objective (CID, MCI, SPI, etc.) Phase 2: Collapse/resolve refinement - .CollapseRefine() greedily collapses edges that reduce CID - Tries all pairwise polytomy resolutions to re-add splits - Produces non-binary trees when optimal - Enabled by default (collapse=TRUE parameter) Test suite: 55 assertions (Tier 2, skip_on_cran)

For the default ClusteringInfoDistance metric, precompute input tree splits and clustering entropies once in .MakeCIDData(). The scorer then computes CID as CE(cand) + mean(CE(inputs)) - 2*mean(MCI), avoiding redundant as.Splits() conversion of the N input trees on each candidate evaluation. Benchmark (20 tips, 50 trees, SPR 100 iter): Before: 1.55s (6.8 ms/scorer call) After: 0.25s (0.9 ms/scorer call) Speedup: ~6x The CIDBootstrap also resamples the precomputed splits/CEs to maintain consistency during ratchet perturbation. Remaining bottleneck: N LAP solutions per candidate in MutualClusteringInfoSplits (C++ in TreeDist). This is irreducible without a warm-start LAP solver.

Replace vapply over MutualClusteringInfoSplits with a for-loop over unclass'd raw split matrices. This eliminates per-iteration function call overhead and S3 dispatch. Total speedup vs naive ClusteringInfoDistance: 8.1x Per-scorer-call: 6.8 ms → 0.76 ms (20 tips, 50 trees) SPR search: 1.55s → 0.19s

LAP solver (Jonker-Volgenant), mutual clustering information, clustering entropy, MRP dataset builder, and weight management for CID consensus tree search. Verified: C++ CID matches R ClusteringInfoDistance to 1e-12. All 28 ts-cid + 121 CIDConsensus tests pass.

cid_score() now returns -mean(MCI) (negated for minimization infrastructure). The normalize parameter is removed from the R API. User-facing score is positive MCI (higher = better), converted at the CIDConsensus() boundary. C++ changes: - cid_score(): single branch computing -weighted_mean(MCI) - Removed normalize/non-normalize branching - Early termination uses cand_CE upper bound R changes: - Removed normalize param from CIDConsensus(), .MakeCIDData(), .CIDDrivenSearch(), .PrescreenMarginalNID(), .BestInsertion() - .CIDScoreFast() returns -mean(MCI) instead of mean(CID) - CIDConsensus() negates internal score to positive MCI on output - Verbosity messages show MCI instead of score Test changes: - Updated score direction checks (higher MCI = better) - Removed normalized scoring tests - Added mean(MutualClusteringInfo) verification test - All 148 CID tests pass

- TBR/SPR: set score_budget = best_score + eps before candidate full_rescore(), reset to HUGE_VAL after. Enables cid_score() early termination when candidate can't beat current best. - Drift: no budget (needs exact CID scores for RFD computation). - ts_parallel.cpp: deep-copy CidData per worker thread so mutable scratch buffers (lap_scratch, cand_tip_bits, cand_buf, score_budget) don't race between threads. - Benchmark verified: 2-thread CID produces identical scores to single-thread; 1.3-1.7x speedup across 40-80 tip scenarios. - Mark T-150 complete.

- ts_rcpp.cpp: add ts_cid_consensus() Rcpp bridge function - ts_drift.cpp: CID two-baseline tracking (mrp_baseline + CID score) - ts_sector.cpp: CID mode dispatch for sectorial search - ts_ratchet.cpp: CID-aware ratchet scoring - ts_fitch.cpp: CID support in score helpers - ts_data.h: CID fields in DataSet - RcppExports, TreeSearch-init.c, NAMESPACE: register new exports - man/CIDConsensus.Rd: documentation - inst/test-data/takazawa/: reference data for CID tests - inst/analysis/: MRP weighting analysis

- RcppExports.cpp: restore missing return/END_RCPP/} at end of ts_cid_consensus (auto-merge inserted subsequent functions inside the function body) - ts_driven.cpp: skip NNI warmup for CID mode (nni_search lacks dual MRP/CID score tracking, causing segfault) - Move inst/analysis/ to dev/analysis/ per project conventions

@Seealso

- Rename CIDConsensus() to InfoConsensus() with deprecated alias - Remove 'metric' param (always MCI; was vestigial) - Remove 'start' param (redundant with multi-replicate Wagner) - Remove unused '...' from public API - Default nThreads to getOption('mc.cores', 1L) - Replace ape::drop.tip/keep.tip/root/consensus with TreeTools equivalents - Remove ape imports: consensus, drop.tip, keep.tip, root - Fix unused variable warning (min_val) in ts_cid.cpp LAP solver - Add @Seealso placeholder for TreeDist/Quartet consensus methods - Rename source files: CIDConsensus.R -> InfoConsensus.R - Simplify .MakeCIDData, .CIDScorer, .CIDBootstrap (remove isCID branches)

…R CID mode build_mrp_dataset() created a DataSet with ScoringMode::CID and a finite concavity (mrp_concavity = 7.0), but did not populate ds.phi or ds.eff_k. These vectors are accessed by compute_iw() via compute_weighted_score(), which is called from mrp_screening_score() at the start of tbr_search(). Accessing an empty vector caused a SIGSEGV that crashed the R session. Fix: populate eff_k (= concavity) and phi (= 1.0) for all MRP patterns, matching the standard IW formula for binary characters. Also: - Remove CIDConsensus() deprecated alias (never released publicly) - Fix TBR verbosity to not print '+NNI' for CID mode (NNI is skipped) - Update test-InfoConsensus.R: fix .MakeCIDData call signatures (2-arg form), drop stale metric= field checks, add rogue-dropping crash test

- Add C++ CID scoring helpers for faster split evaluation - Refactor R-side search loop for reduced overhead - Register new C routines in init.c

- ts_cid_score_trees(): batch-scores multiple candidate trees, CidData built once and amortised across all candidates - ts_cid_prescreen_rogue(): bit-masked rogue prescreen (no R-level DropTip/as.Splits; masks tip bit from pre-built splits, scores in C++) - Refactored .RogueRefine() to eliminate per-candidate re-optimisation - Reduced CID search defaults for ~5x speedup (needs Hamilton validation) - .CollapseRefine()/.BestInsertion() now use C++ batch scoring

- Unique bipartitions hashed and deduped across input trees; pattern_freq set to number of trees containing each split - Replaced mrp_tree_block_start with mrp_split_trees reverse index and mrp_tree_n_splits for normalization - sync_cid_weights_from_mrp() rewritten to use reverse index - New tests: identical-tree CID=0 check, duplicated-tree score equivalence (Agent C)

Add scoreTol and plateauReps parameters to SearchControl/DrivenParams. scoreTol (default 0.0): minimum score improvement to count as meaningful. When > 0, trivially small improvements don't reset the unsuccessful-rep counter, enabling convergence detection for continuous-score modes (CID). plateauReps (default 0 = disabled): stop after N consecutive replicates without meaningful improvement. Unlike perturbStopFactor (scales with n_tips), this is an absolute count suitable for small replicate budgets. CID defaults: scoreTol = 0.001, plateauReps = 3. These allow CID search to exit early when score has plateaued, rather than always running to maxReplicates. Both serial (ts_driven.cpp) and parallel (ts_parallel.cpp) paths updated. Parsimony behaviour unchanged (defaults are 0/0.0).

MCI absolute values scale with tree size, so 0.001 was too aggressive for larger trees. 1e-5 is well above floating-point noise while preserving sensitivity to real improvements.

Add treeSample parameter to InfoConsensus() that controls how many input trees are used during Phase 1 (driven search). Default 'auto' selects min(T, max(50, 2*nTip)) based on benchmarking showing quality saturates at T_sub ~ 50-100 and degrades at higher values under fixed time budgets. Phases 2 (collapse/resolve) and 3 (rogue dropping) always use the full tree set. The returned score attribute reflects full-set MCI. Rationale documented in roxygen and inline comments with benchmark findings from mammals 1449-taxon bootstrap dataset.

…ling Benchmark data from Hamilton HPC (EPYC 7702): - T_sub quality sweep: 50-tip and 100-tip scenarios (3 seeds × 21 T_sub values) - Per-candidate CID profiling: 50/100/200 tips × 7 T_sub values × 3 reps Key findings: - CID cost: O(T × n^2.4) per candidate, dominated by O(n^3) LAP solver - Per-tree marginal cost: 142/706/4031 µs at 50/100/200 tips (R²=1.000) - Quality saturates at T_sub~50-100; degrades at high T_sub (100 tips) - MRP dedup provides 97% compression, making Fitch screening T-insensitive - Notes on TreeTools ClusterTable (Day 1985) applicability

…init.c dupes

…est_strategy_tracker)

…er from ts_rcpp.cpp

… in RcppExports.R

When cid_top_k > 1, TBR collects the top-k MRP candidates per clip and CID-scores all of them, accepting the best CID result. This catches moves where MRP and CID rankings disagree. - CidData.cid_top_k (default 1 = existing behaviour, zero overhead) - CidCandidate struct + topk_insert() helper in ts_tbr.cpp - Score budget tightens across batch (later candidates early-terminate) - InfoConsensus(screeningTopK = 1L) exposed to R - Non-CID modes completely unaffected

…sensus

Add Splitwise Phylogenetic Information Content (SPIC) as an alternative consensus scoring method to MCI in InfoConsensus(). SPIC weights candidate consensus splits by their frequency in the input tree set, naturally discounting low-support splits. It is ~100x cheaper than MCI (split-additive O(n) vs pairwise LAP solver O(n^3)) and does not require a LAP solver dependency. ## New files - src/ts_spic.h / ts_spic.cpp: SpicData struct, build_spic_data(), spic_score(), build_spic_data_from_splits() ## C++ changes - ts_data.h: add SPIC to ScoringMode enum; add is_cid_like() helper covering both CID and SPIC; add spic_data pointer to DataSet - ts_rcpp.cpp: ts_spic_score_trees(), ts_spic_prescreen_rogue(); cid_tree_from_edge() extended to handle trifurcating/polytomous roots via virtual binary node insertion; n_bins_override in build_spic_data_from_splits() fixes 64-tip boundary mismatch in rogue prescreen; SPIC wired into ts_driven_search bridge - ts_tbr/drift/search/sector/driven/parallel.cpp: replace bare == ScoringMode::CID with is_cid_like() (8 sites across 6 files) to prevent uninitialized Fitch state reads under SPIC mode - ts_parallel.cpp: SpicData deep-copy in worker thread setup ## R changes - InfoConsensus(): method = c("mci", "spic") argument; all phases (driven search, collapse/resolve, rogue screening, restoration) dispatch through method parameter ## Fixes bundled - Segfault in TBR/drift/sector under SPIC (uninitialized prelim_ arrays) - Trifurcating root silent score corruption in cid_tree_from_edge() - n_bins boundary mismatch in SPIC rogue prescreen at 64-tip boundary

ms609 · 2026-03-27T08:04:30Z

SPIC scoring added (latest commit `6636924`)

This commit extends InfoConsensus() (previously CIDConsensus()) with a second scoring method:

InfoConsensus(trees, method = "spic")  # new
InfoConsensus(trees, method = "mci")   # existing (default)

SPIC (Splitwise Phylogenetic Information Content) weights each split by its frequency in the input tree set. ~100× cheaper than MCI (no LAP solver needed) and naturally discounts low-support splits.

Bugs fixed in this commit

Segfault in TBR/drift/sector under SPIC — 8 bare == ScoringMode::CID checks replaced with is_cid_like() helper (covers CID and SPIC), preventing uninitialized prelim_ array reads
Trifurcating root silent score corruption — cid_tree_from_edge() assumed binary trees; ape's standard unrooted trees have trifurcating roots, causing the third child to silently overwrite right child; fixed with virtual node insertion for polytomies
n_bins boundary mismatch at 64 tips — rogue prescreen used reduced tip-count word size to hash/compare full tip-count splits; fixed via n_bins_override parameter

Validation

165 CID/consensus tests pass, 237 search engine tests pass
SPIC tested end-to-end at 12, 50, 65 tips across all pipeline phases
GHA: run #23636944848

…sensus # Conflicts: # src/ts_rcpp.cpp # src/ts_tbr.cpp

…ported from TreeTools lacks rooted arg

… search path without explicit library)

InfoConsensus.Rd: add treeSample, screeningTopK, method params; fix defaults for maxReplicates (5L), targetHits (2L), maxDrop (min). SearchControl.Rd: add scoreTol and plateauReps params (inserted after perturbStopFactor in both usage and arguments sections). Fixes codoc mismatches that were failing R CMD check.

…devel rlang <=1.1.7 (CRAN 2026-01-09) calls PREXPR() in src/capture.c. PREXPR was removed from the R public API in R-devel (gcc 15.2.1, 2026-01-23). The rlang main branch already rewrites capture.c using their own r_obj* abstraction (no PREXPR calls), but this fix is not yet released to CRAN. Add a pre-install step in ASan.yml before ms609/actions/asan@main so pak installs the GitHub development version before the action's local_install_deps() runs. pak will not downgrade a more recent installed version. Also: update to-do.md with S-PR round 37 status (ubuntu jobs passing, InfoConsensus.Rd codoc mismatch noted for T-150 / PR #213). S-PR round 37 (Agent E, 2026-03-27).

…(T-150)

ms609 added 16 commits March 26, 2026 11:02

perf: optimize InfoConsensus search

0015f06

- Add C++ CID scoring helpers for faster split evaluation - Refactor R-side search loop for reduced overhead - Register new C routines in init.c

Merge origin/cpp-search into feature/cid-consensus

36a0b5c

fix: reduce CID scoreTol default from 0.001 to 1e-5

c8ec3e8

MCI absolute values scale with tree size, so 0.001 was too aggressive for larger trees. 1e-5 is well above floating-point noise while preserving sensitivity to real improvements.

ms609 force-pushed the feature/cid-consensus branch from aaa7f48 to c8ec3e8 Compare March 26, 2026 11:07

ms609 added 13 commits March 26, 2026 13:24

Merge origin/cpp-search into feature/cid-consensus

fe8503f

fix: close ts_test_strategy_tracker in RcppExports.cpp; remove stale …

3d09892

…init.c dupes

fix: remove duplicate RcppExports entries (ts_wagner_bias_bench, ts_t…

85838ca

…est_strategy_tracker)

fix: remove duplicate ts_wagner_bias_bench and ts_test_strategy_track…

c8a6904

…er from ts_rcpp.cpp

fix: close ts_test_strategy_tracker and remove stale ts_cid_consensus…

c9b04b9

… in RcppExports.R

bench: treeSample auto vs Inf benchmark + incremental CID analysis

17f6260

fix: treeSample Inf coercion in bench script

0aaa4a6

bench: top-k CID screening benchmark script

0d1ee31

Merge remote-tracking branch 'origin/cpp-search' into feature/cid-con…

6a8b50d

…sensus

ms609 added 4 commits March 27, 2026 10:57

Merge remote-tracking branch 'origin/cpp-search' into feature/cid-con…

f9fee3b

…sensus # Conflicts: # src/ts_rcpp.cpp # src/ts_tbr.cpp

fix: use ape::consensus (lowercase) in test-Morphy.R — Consensus() im…

0c1f0fe

…ported from TreeTools lacks rooted arg

fix: use TreeTools::Consensus() in vignettes (bare Consensus() not on…

23d93f2

… search path without explicit library)

ms609 added 2 commits March 27, 2026 13:36

fix: add Splitwise to WORDLIST (T-150 spell-check fix)

9b7ee66

fix: qualify bare Consensus() as TreeTools::Consensus() in vignettes …

f8bfee4

…(T-150)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: CID-optimal consensus tree search (CIDConsensus)#213

feat: CID-optimal consensus tree search (CIDConsensus)#213
ms609 wants to merge 35 commits intocpp-searchfrom
feature/cid-consensus

ms609 commented Mar 23, 2026

Uh oh!

ms609 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ms609 commented Mar 23, 2026

Summary

Architecture

New files

Modified files

Performance optimizations

Tests

Uh oh!

ms609 commented Mar 27, 2026

SPIC scoring added (latest commit 6636924)

Bugs fixed in this commit

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SPIC scoring added (latest commit `6636924`)