feat: CID-optimal consensus tree search (CIDConsensus)#213
Open
ms609 wants to merge 35 commits intocpp-searchfrom
Open
feat: CID-optimal consensus tree search (CIDConsensus)#213ms609 wants to merge 35 commits intocpp-searchfrom
ms609 wants to merge 35 commits intocpp-searchfrom
Conversation
New exported function CIDConsensus() finds consensus trees that minimize mean Clustering Information Distance to a set of input trees. Phase 1: Binary search via existing Ratchet/TreeSearch infrastructure - Custom CID scorer (.CIDScorer) plugs into TreeScorer interface - CID bootstrapper (.CIDBootstrap) resamples input trees with replacement - cidData S3 class (environment-backed) for reference semantics - Supports SPR, TBR, NNI, and ratchet search methods - Any TreeDist metric can serve as objective (CID, MCI, SPI, etc.) Phase 2: Collapse/resolve refinement - .CollapseRefine() greedily collapses edges that reduce CID - Tries all pairwise polytomy resolutions to re-add splits - Produces non-binary trees when optimal - Enabled by default (collapse=TRUE parameter) Test suite: 55 assertions (Tier 2, skip_on_cran)
For the default ClusteringInfoDistance metric, precompute input tree splits and clustering entropies once in .MakeCIDData(). The scorer then computes CID as CE(cand) + mean(CE(inputs)) - 2*mean(MCI), avoiding redundant as.Splits() conversion of the N input trees on each candidate evaluation. Benchmark (20 tips, 50 trees, SPR 100 iter): Before: 1.55s (6.8 ms/scorer call) After: 0.25s (0.9 ms/scorer call) Speedup: ~6x The CIDBootstrap also resamples the precomputed splits/CEs to maintain consistency during ratchet perturbation. Remaining bottleneck: N LAP solutions per candidate in MutualClusteringInfoSplits (C++ in TreeDist). This is irreducible without a warm-start LAP solver.
Replace vapply over MutualClusteringInfoSplits with a for-loop over unclass'd raw split matrices. This eliminates per-iteration function call overhead and S3 dispatch. Total speedup vs naive ClusteringInfoDistance: 8.1x Per-scorer-call: 6.8 ms → 0.76 ms (20 tips, 50 trees) SPR search: 1.55s → 0.19s
LAP solver (Jonker-Volgenant), mutual clustering information, clustering entropy, MRP dataset builder, and weight management for CID consensus tree search. Verified: C++ CID matches R ClusteringInfoDistance to 1e-12. All 28 ts-cid + 121 CIDConsensus tests pass.
cid_score() now returns -mean(MCI) (negated for minimization infrastructure). The normalize parameter is removed from the R API. User-facing score is positive MCI (higher = better), converted at the CIDConsensus() boundary. C++ changes: - cid_score(): single branch computing -weighted_mean(MCI) - Removed normalize/non-normalize branching - Early termination uses cand_CE upper bound R changes: - Removed normalize param from CIDConsensus(), .MakeCIDData(), .CIDDrivenSearch(), .PrescreenMarginalNID(), .BestInsertion() - .CIDScoreFast() returns -mean(MCI) instead of mean(CID) - CIDConsensus() negates internal score to positive MCI on output - Verbosity messages show MCI instead of score Test changes: - Updated score direction checks (higher MCI = better) - Removed normalized scoring tests - Added mean(MutualClusteringInfo) verification test - All 148 CID tests pass
- TBR/SPR: set score_budget = best_score + eps before candidate full_rescore(), reset to HUGE_VAL after. Enables cid_score() early termination when candidate can't beat current best. - Drift: no budget (needs exact CID scores for RFD computation). - ts_parallel.cpp: deep-copy CidData per worker thread so mutable scratch buffers (lap_scratch, cand_tip_bits, cand_buf, score_budget) don't race between threads. - Benchmark verified: 2-thread CID produces identical scores to single-thread; 1.3-1.7x speedup across 40-80 tip scenarios. - Mark T-150 complete.
- ts_rcpp.cpp: add ts_cid_consensus() Rcpp bridge function - ts_drift.cpp: CID two-baseline tracking (mrp_baseline + CID score) - ts_sector.cpp: CID mode dispatch for sectorial search - ts_ratchet.cpp: CID-aware ratchet scoring - ts_fitch.cpp: CID support in score helpers - ts_data.h: CID fields in DataSet - RcppExports, TreeSearch-init.c, NAMESPACE: register new exports - man/CIDConsensus.Rd: documentation - inst/test-data/takazawa/: reference data for CID tests - inst/analysis/: MRP weighting analysis
- RcppExports.cpp: restore missing return/END_RCPP/} at end of ts_cid_consensus (auto-merge inserted subsequent functions inside the function body) - ts_driven.cpp: skip NNI warmup for CID mode (nni_search lacks dual MRP/CID score tracking, causing segfault) - Move inst/analysis/ to dev/analysis/ per project conventions
- Rename CIDConsensus() to InfoConsensus() with deprecated alias
- Remove 'metric' param (always MCI; was vestigial)
- Remove 'start' param (redundant with multi-replicate Wagner)
- Remove unused '...' from public API
- Default nThreads to getOption('mc.cores', 1L)
- Replace ape::drop.tip/keep.tip/root/consensus with TreeTools equivalents
- Remove ape imports: consensus, drop.tip, keep.tip, root
- Fix unused variable warning (min_val) in ts_cid.cpp LAP solver
- Add @Seealso placeholder for TreeDist/Quartet consensus methods
- Rename source files: CIDConsensus.R -> InfoConsensus.R
- Simplify .MakeCIDData, .CIDScorer, .CIDBootstrap (remove isCID branches)
…R CID mode build_mrp_dataset() created a DataSet with ScoringMode::CID and a finite concavity (mrp_concavity = 7.0), but did not populate ds.phi or ds.eff_k. These vectors are accessed by compute_iw() via compute_weighted_score(), which is called from mrp_screening_score() at the start of tbr_search(). Accessing an empty vector caused a SIGSEGV that crashed the R session. Fix: populate eff_k (= concavity) and phi (= 1.0) for all MRP patterns, matching the standard IW formula for binary characters. Also: - Remove CIDConsensus() deprecated alias (never released publicly) - Fix TBR verbosity to not print '+NNI' for CID mode (NNI is skipped) - Update test-InfoConsensus.R: fix .MakeCIDData call signatures (2-arg form), drop stale metric= field checks, add rogue-dropping crash test
- Add C++ CID scoring helpers for faster split evaluation - Refactor R-side search loop for reduced overhead - Register new C routines in init.c
- ts_cid_score_trees(): batch-scores multiple candidate trees, CidData built once and amortised across all candidates - ts_cid_prescreen_rogue(): bit-masked rogue prescreen (no R-level DropTip/as.Splits; masks tip bit from pre-built splits, scores in C++) - Refactored .RogueRefine() to eliminate per-candidate re-optimisation - Reduced CID search defaults for ~5x speedup (needs Hamilton validation) - .CollapseRefine()/.BestInsertion() now use C++ batch scoring
- Unique bipartitions hashed and deduped across input trees; pattern_freq set to number of trees containing each split - Replaced mrp_tree_block_start with mrp_split_trees reverse index and mrp_tree_n_splits for normalization - sync_cid_weights_from_mrp() rewritten to use reverse index - New tests: identical-tree CID=0 check, duplicated-tree score equivalence (Agent C)
Add scoreTol and plateauReps parameters to SearchControl/DrivenParams. scoreTol (default 0.0): minimum score improvement to count as meaningful. When > 0, trivially small improvements don't reset the unsuccessful-rep counter, enabling convergence detection for continuous-score modes (CID). plateauReps (default 0 = disabled): stop after N consecutive replicates without meaningful improvement. Unlike perturbStopFactor (scales with n_tips), this is an absolute count suitable for small replicate budgets. CID defaults: scoreTol = 0.001, plateauReps = 3. These allow CID search to exit early when score has plateaued, rather than always running to maxReplicates. Both serial (ts_driven.cpp) and parallel (ts_parallel.cpp) paths updated. Parsimony behaviour unchanged (defaults are 0/0.0).
MCI absolute values scale with tree size, so 0.001 was too aggressive for larger trees. 1e-5 is well above floating-point noise while preserving sensitivity to real improvements.
aaa7f48 to
c8ec3e8
Compare
Add treeSample parameter to InfoConsensus() that controls how many input trees are used during Phase 1 (driven search). Default 'auto' selects min(T, max(50, 2*nTip)) based on benchmarking showing quality saturates at T_sub ~ 50-100 and degrades at higher values under fixed time budgets. Phases 2 (collapse/resolve) and 3 (rogue dropping) always use the full tree set. The returned score attribute reflects full-set MCI. Rationale documented in roxygen and inline comments with benchmark findings from mammals 1449-taxon bootstrap dataset.
…ling Benchmark data from Hamilton HPC (EPYC 7702): - T_sub quality sweep: 50-tip and 100-tip scenarios (3 seeds × 21 T_sub values) - Per-candidate CID profiling: 50/100/200 tips × 7 T_sub values × 3 reps Key findings: - CID cost: O(T × n^2.4) per candidate, dominated by O(n^3) LAP solver - Per-tree marginal cost: 142/706/4031 µs at 50/100/200 tips (R²=1.000) - Quality saturates at T_sub~50-100; degrades at high T_sub (100 tips) - MRP dedup provides 97% compression, making Fitch screening T-insensitive - Notes on TreeTools ClusterTable (Day 1985) applicability
…est_strategy_tracker)
…er from ts_rcpp.cpp
… in RcppExports.R
When cid_top_k > 1, TBR collects the top-k MRP candidates per clip and CID-scores all of them, accepting the best CID result. This catches moves where MRP and CID rankings disagree. - CidData.cid_top_k (default 1 = existing behaviour, zero overhead) - CidCandidate struct + topk_insert() helper in ts_tbr.cpp - Score budget tightens across batch (later candidates early-terminate) - InfoConsensus(screeningTopK = 1L) exposed to R - Non-CID modes completely unaffected
Add Splitwise Phylogenetic Information Content (SPIC) as an alternative
consensus scoring method to MCI in InfoConsensus().
SPIC weights candidate consensus splits by their frequency in the input
tree set, naturally discounting low-support splits. It is ~100x cheaper
than MCI (split-additive O(n) vs pairwise LAP solver O(n^3)) and does
not require a LAP solver dependency.
## New files
- src/ts_spic.h / ts_spic.cpp: SpicData struct, build_spic_data(),
spic_score(), build_spic_data_from_splits()
## C++ changes
- ts_data.h: add SPIC to ScoringMode enum; add is_cid_like() helper
covering both CID and SPIC; add spic_data pointer to DataSet
- ts_rcpp.cpp: ts_spic_score_trees(), ts_spic_prescreen_rogue();
cid_tree_from_edge() extended to handle trifurcating/polytomous roots
via virtual binary node insertion; n_bins_override in
build_spic_data_from_splits() fixes 64-tip boundary mismatch in
rogue prescreen; SPIC wired into ts_driven_search bridge
- ts_tbr/drift/search/sector/driven/parallel.cpp: replace bare
== ScoringMode::CID with is_cid_like() (8 sites across 6 files)
to prevent uninitialized Fitch state reads under SPIC mode
- ts_parallel.cpp: SpicData deep-copy in worker thread setup
## R changes
- InfoConsensus(): method = c("mci", "spic") argument; all phases
(driven search, collapse/resolve, rogue screening, restoration)
dispatch through method parameter
## Fixes bundled
- Segfault in TBR/drift/sector under SPIC (uninitialized prelim_ arrays)
- Trifurcating root silent score corruption in cid_tree_from_edge()
- n_bins boundary mismatch in SPIC rogue prescreen at 64-tip boundary
Owner
Author
SPIC scoring added (latest commit 6636924)This commit extends InfoConsensus(trees, method = "spic") # new
InfoConsensus(trees, method = "mci") # existing (default)SPIC (Splitwise Phylogenetic Information Content) weights each split by its frequency in the input tree set. ~100× cheaper than MCI (no LAP solver needed) and naturally discounts low-support splits. Bugs fixed in this commit
Validation
|
…sensus # Conflicts: # src/ts_rcpp.cpp # src/ts_tbr.cpp
…ported from TreeTools lacks rooted arg
… search path without explicit library)
InfoConsensus.Rd: add treeSample, screeningTopK, method params; fix defaults for maxReplicates (5L), targetHits (2L), maxDrop (min). SearchControl.Rd: add scoreTol and plateauReps params (inserted after perturbStopFactor in both usage and arguments sections). Fixes codoc mismatches that were failing R CMD check.
ms609
added a commit
that referenced
this pull request
Mar 27, 2026
…devel rlang <=1.1.7 (CRAN 2026-01-09) calls PREXPR() in src/capture.c. PREXPR was removed from the R public API in R-devel (gcc 15.2.1, 2026-01-23). The rlang main branch already rewrites capture.c using their own r_obj* abstraction (no PREXPR calls), but this fix is not yet released to CRAN. Add a pre-install step in ASan.yml before ms609/actions/asan@main so pak installs the GitHub development version before the action's local_install_deps() runs. pak will not downgrade a more recent installed version. Also: update to-do.md with S-PR round 37 status (ubuntu jobs passing, InfoConsensus.Rd codoc mismatch noted for T-150 / PR #213). S-PR round 37 (Agent E, 2026-03-27).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CIDConsensus()finds a consensus tree that minimizes mean Clustering Information Distance (CID) to a set of input trees, using the existing C++ driven search engine.Architecture
Dual-layer scoring for performance:
Two-baseline tracking in TBR, SPR, and drift maintains separate
mrp_baseline(Fitch) andbest_score(CID) values. Sectorial search (RSS/XSS) uses Fitch on MRP internally, verifies with full CID after reinsertion.New files
R/CIDConsensus.Rsrc/ts_cid.h/ts_cid.cpptests/testthat/test-CIDConsensus.Rtests/testthat/test-ts-cid.RModified files
CID mode wired into all search modules:
ts_tbr.cpp,ts_search.cpp,ts_drift.cpp: two-baseline tracking, score_budget early terminationts_sector.cpp: CID mode dispatch for sectorial searchts_ratchet.cpp: CID-aware ratchetts_parallel.cpp: per-thread CidData deep copy (fixes data race)ts_rcpp.cpp:ts_cid_consensus()Rcpp bridge functionPerformance optimizations
cid_score()with early termination (wired into TBR/SPR)Multi-threaded benchmarks (coalescent-derived trees):
Tests
All 9928 tests pass (0 failures, 0 warnings). Includes 771 new CID-specific assertions across R-level and C++ test files.