Skip to content

feat: CID-optimal consensus tree search (CIDConsensus)#213

Open
ms609 wants to merge 35 commits intocpp-searchfrom
feature/cid-consensus
Open

feat: CID-optimal consensus tree search (CIDConsensus)#213
ms609 wants to merge 35 commits intocpp-searchfrom
feature/cid-consensus

Conversation

@ms609
Copy link
Copy Markdown
Owner

@ms609 ms609 commented Mar 23, 2026

Summary

CIDConsensus() finds a consensus tree that minimizes mean Clustering Information Distance (CID) to a set of input trees, using the existing C++ driven search engine.

Architecture

Dual-layer scoring for performance:

  • Fast layer: MRP (Matrix Representation with Parsimony) binary characters scored via Fitch for incremental screening during TBR/SPR. IW with k=7 by default (empirically maximizes rank correlation with CID, ρ ≈ 0.688).
  • Accurate layer: Full CID scoring via LAP (Linear Assignment Problem) matching for move acceptance decisions.

Two-baseline tracking in TBR, SPR, and drift maintains separate mrp_baseline (Fitch) and best_score (CID) values. Sectorial search (RSS/XSS) uses Fitch on MRP internally, verifies with full CID after reinsertion.

New files

File Purpose
R/CIDConsensus.R R-level API, MRP dataset construction, rogue taxon dropping
src/ts_cid.h / ts_cid.cpp CID scoring engine: LAP solver, MCI computation, split hashing
tests/testthat/test-CIDConsensus.R R-level tests (504 lines)
tests/testthat/test-ts-cid.R C++ engine tests (267 lines)

Modified files

CID mode wired into all search modules:

  • ts_tbr.cpp, ts_search.cpp, ts_drift.cpp: two-baseline tracking, score_budget early termination
  • ts_sector.cpp: CID mode dispatch for sectorial search
  • ts_ratchet.cpp: CID-aware ratchet
  • ts_parallel.cpp: per-thread CidData deep copy (fixes data race)
  • ts_rcpp.cpp: ts_cid_consensus() Rcpp bridge function

Performance optimizations

  1. Hash-based exact match lookup (O(n) vs O(n²))
  2. Precomputed per-split log2 values
  3. Persistent candidate split buffers (zero-alloc reuse)
  4. Presized LAP scratch buffers
  5. Bounded cid_score() with early termination (wired into TBR/SPR)
  6. Batch candidate log2 reuse

Multi-threaded benchmarks (coalescent-derived trees):

  • 40 tips / 50 trees: 1.43× speedup (2 threads)
  • 80 tips / 100 trees: 1.73× speedup (2 threads)

Tests

All 9928 tests pass (0 failures, 0 warnings). Includes 771 new CID-specific assertions across R-level and C++ test files.

ms609 added 16 commits March 26, 2026 11:02
New exported function CIDConsensus() finds consensus trees that minimize
mean Clustering Information Distance to a set of input trees.

Phase 1: Binary search via existing Ratchet/TreeSearch infrastructure
- Custom CID scorer (.CIDScorer) plugs into TreeScorer interface
- CID bootstrapper (.CIDBootstrap) resamples input trees with replacement
- cidData S3 class (environment-backed) for reference semantics
- Supports SPR, TBR, NNI, and ratchet search methods
- Any TreeDist metric can serve as objective (CID, MCI, SPI, etc.)

Phase 2: Collapse/resolve refinement
- .CollapseRefine() greedily collapses edges that reduce CID
- Tries all pairwise polytomy resolutions to re-add splits
- Produces non-binary trees when optimal
- Enabled by default (collapse=TRUE parameter)

Test suite: 55 assertions (Tier 2, skip_on_cran)
For the default ClusteringInfoDistance metric, precompute input tree
splits and clustering entropies once in .MakeCIDData(). The scorer then
computes CID as CE(cand) + mean(CE(inputs)) - 2*mean(MCI), avoiding
redundant as.Splits() conversion of the N input trees on each candidate
evaluation.

Benchmark (20 tips, 50 trees, SPR 100 iter):
  Before: 1.55s  (6.8 ms/scorer call)
  After:  0.25s  (0.9 ms/scorer call)
  Speedup: ~6x

The CIDBootstrap also resamples the precomputed splits/CEs to maintain
consistency during ratchet perturbation.

Remaining bottleneck: N LAP solutions per candidate in
MutualClusteringInfoSplits (C++ in TreeDist). This is irreducible
without a warm-start LAP solver.
Replace vapply over MutualClusteringInfoSplits with a for-loop over
unclass'd raw split matrices. This eliminates per-iteration function
call overhead and S3 dispatch.

Total speedup vs naive ClusteringInfoDistance: 8.1x
Per-scorer-call: 6.8 ms → 0.76 ms (20 tips, 50 trees)
SPR search: 1.55s → 0.19s
LAP solver (Jonker-Volgenant), mutual clustering information,
clustering entropy, MRP dataset builder, and weight management
for CID consensus tree search.

Verified: C++ CID matches R ClusteringInfoDistance to 1e-12.
All 28 ts-cid + 121 CIDConsensus tests pass.
cid_score() now returns -mean(MCI) (negated for minimization
infrastructure). The normalize parameter is removed from the R API.
User-facing score is positive MCI (higher = better), converted at
the CIDConsensus() boundary.

C++ changes:
- cid_score(): single branch computing -weighted_mean(MCI)
- Removed normalize/non-normalize branching
- Early termination uses cand_CE upper bound

R changes:
- Removed normalize param from CIDConsensus(), .MakeCIDData(),
  .CIDDrivenSearch(), .PrescreenMarginalNID(), .BestInsertion()
- .CIDScoreFast() returns -mean(MCI) instead of mean(CID)
- CIDConsensus() negates internal score to positive MCI on output
- Verbosity messages show MCI instead of score

Test changes:
- Updated score direction checks (higher MCI = better)
- Removed normalized scoring tests
- Added mean(MutualClusteringInfo) verification test
- All 148 CID tests pass
- TBR/SPR: set score_budget = best_score + eps before candidate
  full_rescore(), reset to HUGE_VAL after. Enables cid_score()
  early termination when candidate can't beat current best.
- Drift: no budget (needs exact CID scores for RFD computation).
- ts_parallel.cpp: deep-copy CidData per worker thread so mutable
  scratch buffers (lap_scratch, cand_tip_bits, cand_buf, score_budget)
  don't race between threads.
- Benchmark verified: 2-thread CID produces identical scores to
  single-thread; 1.3-1.7x speedup across 40-80 tip scenarios.
- Mark T-150 complete.
- ts_rcpp.cpp: add ts_cid_consensus() Rcpp bridge function
- ts_drift.cpp: CID two-baseline tracking (mrp_baseline + CID score)
- ts_sector.cpp: CID mode dispatch for sectorial search
- ts_ratchet.cpp: CID-aware ratchet scoring
- ts_fitch.cpp: CID support in score helpers
- ts_data.h: CID fields in DataSet
- RcppExports, TreeSearch-init.c, NAMESPACE: register new exports
- man/CIDConsensus.Rd: documentation
- inst/test-data/takazawa/: reference data for CID tests
- inst/analysis/: MRP weighting analysis
- RcppExports.cpp: restore missing return/END_RCPP/} at end of
  ts_cid_consensus (auto-merge inserted subsequent functions inside
  the function body)
- ts_driven.cpp: skip NNI warmup for CID mode (nni_search lacks
  dual MRP/CID score tracking, causing segfault)
- Move inst/analysis/ to dev/analysis/ per project conventions
- Rename CIDConsensus() to InfoConsensus() with deprecated alias
- Remove 'metric' param (always MCI; was vestigial)
- Remove 'start' param (redundant with multi-replicate Wagner)
- Remove unused '...' from public API
- Default nThreads to getOption('mc.cores', 1L)
- Replace ape::drop.tip/keep.tip/root/consensus with TreeTools equivalents
- Remove ape imports: consensus, drop.tip, keep.tip, root
- Fix unused variable warning (min_val) in ts_cid.cpp LAP solver
- Add @Seealso placeholder for TreeDist/Quartet consensus methods
- Rename source files: CIDConsensus.R -> InfoConsensus.R
- Simplify .MakeCIDData, .CIDScorer, .CIDBootstrap (remove isCID branches)
…R CID mode

build_mrp_dataset() created a DataSet with ScoringMode::CID and a finite
concavity (mrp_concavity = 7.0), but did not populate ds.phi or ds.eff_k.
These vectors are accessed by compute_iw() via compute_weighted_score(),
which is called from mrp_screening_score() at the start of tbr_search().
Accessing an empty vector caused a SIGSEGV that crashed the R session.

Fix: populate eff_k (= concavity) and phi (= 1.0) for all MRP patterns,
matching the standard IW formula for binary characters.

Also:
- Remove CIDConsensus() deprecated alias (never released publicly)
- Fix TBR verbosity to not print '+NNI' for CID mode (NNI is skipped)
- Update test-InfoConsensus.R: fix .MakeCIDData call signatures (2-arg
  form), drop stale metric= field checks, add rogue-dropping crash test
- Add C++ CID scoring helpers for faster split evaluation
- Refactor R-side search loop for reduced overhead
- Register new C routines in init.c
- ts_cid_score_trees(): batch-scores multiple candidate trees, CidData
  built once and amortised across all candidates
- ts_cid_prescreen_rogue(): bit-masked rogue prescreen (no R-level
  DropTip/as.Splits; masks tip bit from pre-built splits, scores in C++)
- Refactored .RogueRefine() to eliminate per-candidate re-optimisation
- Reduced CID search defaults for ~5x speedup (needs Hamilton validation)
- .CollapseRefine()/.BestInsertion() now use C++ batch scoring
- Unique bipartitions hashed and deduped across input trees; pattern_freq
  set to number of trees containing each split
- Replaced mrp_tree_block_start with mrp_split_trees reverse index and
  mrp_tree_n_splits for normalization
- sync_cid_weights_from_mrp() rewritten to use reverse index
- New tests: identical-tree CID=0 check, duplicated-tree score equivalence
  (Agent C)
Add scoreTol and plateauReps parameters to SearchControl/DrivenParams.

scoreTol (default 0.0): minimum score improvement to count as meaningful.
When > 0, trivially small improvements don't reset the unsuccessful-rep
counter, enabling convergence detection for continuous-score modes (CID).

plateauReps (default 0 = disabled): stop after N consecutive replicates
without meaningful improvement. Unlike perturbStopFactor (scales with
n_tips), this is an absolute count suitable for small replicate budgets.

CID defaults: scoreTol = 0.001, plateauReps = 3. These allow CID search
to exit early when score has plateaued, rather than always running to
maxReplicates.

Both serial (ts_driven.cpp) and parallel (ts_parallel.cpp) paths updated.
Parsimony behaviour unchanged (defaults are 0/0.0).
MCI absolute values scale with tree size, so 0.001 was too aggressive
for larger trees. 1e-5 is well above floating-point noise while
preserving sensitivity to real improvements.
@ms609 ms609 force-pushed the feature/cid-consensus branch from aaa7f48 to c8ec3e8 Compare March 26, 2026 11:07
ms609 added 13 commits March 26, 2026 13:24
Add treeSample parameter to InfoConsensus() that controls how many input
trees are used during Phase 1 (driven search). Default 'auto' selects
min(T, max(50, 2*nTip)) based on benchmarking showing quality saturates
at T_sub ~ 50-100 and degrades at higher values under fixed time budgets.

Phases 2 (collapse/resolve) and 3 (rogue dropping) always use the full
tree set. The returned score attribute reflects full-set MCI.

Rationale documented in roxygen and inline comments with benchmark
findings from mammals 1449-taxon bootstrap dataset.
…ling

Benchmark data from Hamilton HPC (EPYC 7702):
- T_sub quality sweep: 50-tip and 100-tip scenarios (3 seeds × 21 T_sub values)
- Per-candidate CID profiling: 50/100/200 tips × 7 T_sub values × 3 reps

Key findings:
- CID cost: O(T × n^2.4) per candidate, dominated by O(n^3) LAP solver
- Per-tree marginal cost: 142/706/4031 µs at 50/100/200 tips (R²=1.000)
- Quality saturates at T_sub~50-100; degrades at high T_sub (100 tips)
- MRP dedup provides 97% compression, making Fitch screening T-insensitive
- Notes on TreeTools ClusterTable (Day 1985) applicability
When cid_top_k > 1, TBR collects the top-k MRP candidates per clip
and CID-scores all of them, accepting the best CID result. This catches
moves where MRP and CID rankings disagree.

- CidData.cid_top_k (default 1 = existing behaviour, zero overhead)
- CidCandidate struct + topk_insert() helper in ts_tbr.cpp
- Score budget tightens across batch (later candidates early-terminate)
- InfoConsensus(screeningTopK = 1L) exposed to R
- Non-CID modes completely unaffected
Add Splitwise Phylogenetic Information Content (SPIC) as an alternative
consensus scoring method to MCI in InfoConsensus().

SPIC weights candidate consensus splits by their frequency in the input
tree set, naturally discounting low-support splits.  It is ~100x cheaper
than MCI (split-additive O(n) vs pairwise LAP solver O(n^3)) and does
not require a LAP solver dependency.

## New files
- src/ts_spic.h / ts_spic.cpp: SpicData struct, build_spic_data(),
  spic_score(), build_spic_data_from_splits()

## C++ changes
- ts_data.h: add SPIC to ScoringMode enum; add is_cid_like() helper
  covering both CID and SPIC; add spic_data pointer to DataSet
- ts_rcpp.cpp: ts_spic_score_trees(), ts_spic_prescreen_rogue();
  cid_tree_from_edge() extended to handle trifurcating/polytomous roots
  via virtual binary node insertion; n_bins_override in
  build_spic_data_from_splits() fixes 64-tip boundary mismatch in
  rogue prescreen; SPIC wired into ts_driven_search bridge
- ts_tbr/drift/search/sector/driven/parallel.cpp: replace bare
  == ScoringMode::CID with is_cid_like() (8 sites across 6 files)
  to prevent uninitialized Fitch state reads under SPIC mode
- ts_parallel.cpp: SpicData deep-copy in worker thread setup

## R changes
- InfoConsensus(): method = c("mci", "spic") argument; all phases
  (driven search, collapse/resolve, rogue screening, restoration)
  dispatch through method parameter

## Fixes bundled
- Segfault in TBR/drift/sector under SPIC (uninitialized prelim_ arrays)
- Trifurcating root silent score corruption in cid_tree_from_edge()
- n_bins boundary mismatch in SPIC rogue prescreen at 64-tip boundary
@ms609
Copy link
Copy Markdown
Owner Author

ms609 commented Mar 27, 2026

SPIC scoring added (latest commit 6636924)

This commit extends InfoConsensus() (previously CIDConsensus()) with a second scoring method:

InfoConsensus(trees, method = "spic")  # new
InfoConsensus(trees, method = "mci")   # existing (default)

SPIC (Splitwise Phylogenetic Information Content) weights each split by its frequency in the input tree set. ~100× cheaper than MCI (no LAP solver needed) and naturally discounts low-support splits.

Bugs fixed in this commit

  1. Segfault in TBR/drift/sector under SPIC — 8 bare == ScoringMode::CID checks replaced with is_cid_like() helper (covers CID and SPIC), preventing uninitialized prelim_ array reads
  2. Trifurcating root silent score corruptioncid_tree_from_edge() assumed binary trees; ape's standard unrooted trees have trifurcating roots, causing the third child to silently overwrite right child; fixed with virtual node insertion for polytomies
  3. n_bins boundary mismatch at 64 tips — rogue prescreen used reduced tip-count word size to hash/compare full tip-count splits; fixed via n_bins_override parameter

Validation

  • 165 CID/consensus tests pass, 237 search engine tests pass
  • SPIC tested end-to-end at 12, 50, 65 tips across all pipeline phases
  • GHA: run #23636944848

ms609 added 4 commits March 27, 2026 10:57
…sensus

# Conflicts:
#	src/ts_rcpp.cpp
#	src/ts_tbr.cpp
InfoConsensus.Rd: add treeSample, screeningTopK, method params;
fix defaults for maxReplicates (5L), targetHits (2L), maxDrop (min).

SearchControl.Rd: add scoreTol and plateauReps params (inserted
after perturbStopFactor in both usage and arguments sections).

Fixes codoc mismatches that were failing R CMD check.
ms609 added a commit that referenced this pull request Mar 27, 2026
…devel

rlang <=1.1.7 (CRAN 2026-01-09) calls PREXPR() in src/capture.c.
PREXPR was removed from the R public API in R-devel (gcc 15.2.1,
2026-01-23). The rlang main branch already rewrites capture.c using
their own r_obj* abstraction (no PREXPR calls), but this fix is not yet
released to CRAN.

Add a pre-install step in ASan.yml before ms609/actions/asan@main so
pak installs the GitHub development version before the action's
local_install_deps() runs. pak will not downgrade a more recent
installed version.

Also: update to-do.md with S-PR round 37 status (ubuntu jobs passing,
InfoConsensus.Rd codoc mismatch noted for T-150 / PR #213).

S-PR round 37 (Agent E, 2026-03-27).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant