Bayesian Online Change Point Detection — AVX2 Optimized
A production-grade C implementation of Adams & MacKay (2007) BOCPD, engineered for sub-microsecond latency in high-frequency trading systems.
BOCPD maintains a probability distribution over "run lengths" — time since the last changepoint — updated online with each observation. Unlike batch methods, it processes data sequentially with O(active_len) complexity per step.
Use cases: Volatility regime detection, trend break identification, adaptive filtering, real-time anomaly detection.
| Method | Latency | Detects | Online |
|---|---|---|---|
| CUSUM | ~100ns | Mean shift | ✓ |
| BOCPD | ~0.19µs | Mean + Variance | ✓ |
| HMM | ~50µs | Regimes | ✓ |
| Offline CPD | ~10ms | All | ✗ |
Benchmark Results (Intel Core i9):
| Configuration | Throughput | Latency |
|---|---|---|
| Single detector | 2.43M obs/sec | 0.41 µs |
| Stationary data (100K) | 1.87M obs/sec | 0.53 µs |
| With changepoints (100K) | 5.14M obs/sec | 0.19 µs |
| Pool (100 instruments) | 3.03M obs/sec | 0.33 µs |
| Large scale (380 instruments) | 2.12M obs/sec | 0.47 µs |
| Peak throughput | 5.14M obs/sec | 0.19 µs |
Measured on Intel Core i9, GCC/MSVC, -O3 -mavx2 -mfma:
| Test Scenario | Throughput | Latency | Speedup vs Naive |
|---|---|---|---|
| Single detector | 2.43M obs/sec | 0.41 µs | 46.4× |
| Stationary data (100K obs) | 1.87M obs/sec | 0.53 µs | 36× |
| With changepoints (100K obs) | 5.14M obs/sec | 0.19 µs | 99× |
| Multi-detector pool (100) | 3.03M obs/sec | 0.33 µs | 17.4× |
| Large scale (380 × 500) | 2.12M obs/sec | 0.47 µs | — |
| max_run | Throughput | avg_active |
|---|---|---|
| 64 | 2.26M obs/sec | 41.3 |
| 128 | 1.90M obs/sec | 53.1 |
| 256 | 1.84M obs/sec | 56.5 |
| 512 | 1.69M obs/sec | 61.4 |
| 1024 | 1.69M obs/sec | 62.8 |
| 2048 | 1.87M obs/sec | 56.0 |
| 4096 | 1.72M obs/sec | 61.2 |
| Implementation | Throughput | Relative |
|---|---|---|
| Python (numpy) | ~1.2K obs/sec | 1× |
| Rust (changepoint) | ~22K obs/sec | 18× |
| Naive C | 52K obs/sec | 43× |
| BOCPD Ultra V3.2 | 2.43M obs/sec | 2,025× |
| Peak (with changepoints) | 5.14M obs/sec | 4,283× |
┌─────────────────────────────────────────────────────────────────┐
│ BOCPD Ultra V3.2 │
├─────────────────────────────────────────────────────────────────┤
│ Custom lgamma (Lanczos + Stirling) │
│ ├─ Replaces slow libm lgamma() (~100 cycles → ~15 cycles) │
│ ├─ Lanczos g=4.7421875 for small args │
│ └─ Stirling with Bernoulli coefficients for large args │
├─────────────────────────────────────────────────────────────────┤
│ Native Interleaved Layout │
│ ├─ 256-byte superblocks (4 run lengths × 8 parameters) │
│ ├─ Prediction params first (cache-friendly) │
│ └─ Zero-copy: no per-step data transformation │
├─────────────────────────────────────────────────────────────────┤
│ Ping-Pong Double Buffering │
│ ├─ Two interleaved buffers, alternating each step │
│ ├─ Zero memmove operations │
│ └─ Implicit +1 shift via buffer swap │
├─────────────────────────────────────────────────────────────────┤
│ AVX2 ASM Kernel │
│ ├─ 8 elements/iteration (2 blocks × 4 lanes) │
│ ├─ Interleaved Block A/B for ILP │
│ ├─ Estrin polynomial evaluation (reduced latency) │
│ ├─ IEEE-754 exponent bit manipulation │
│ └─ bsr-based truncation (no branch mispredicts) │
├─────────────────────────────────────────────────────────────────┤
│ Pool Allocator │
│ ├─ Single malloc for N detectors │
│ ├─ Cache-friendly sequential layout │
│ └─ Reduced TLB pressure │
└─────────────────────────────────────────────────────────────────┘
At each timestep t with observation x_t, BOCPD updates the run-length distribution:
- Predict: Compute P(x_t | run_length = r) via Student-t posterior predictive
- Growth: P(r_t = r) ∝ P(r_{t-1} = r-1) · P(x_t | run) · (1 - H)
- Changepoint: P(r_t = 0) ∝ Σᵣ P(r_{t-1} = r) · P(x_t | run) · H
- Normalize: Ensure Σᵣ P(r_t = r) = 1
Where H = 1/λ is the hazard rate.
For Gaussian data with unknown mean μ and variance σ²:
Prior: p(μ, τ) = Normal(μ | μ₀, (κ₀τ)⁻¹) · Gamma(τ | α₀, β₀)
Posterior Predictive (Student-t):
p(x | data) = Student-t(x | μₙ, σ²ₙ, νₙ)
where:
νₙ = 2αₙ (degrees of freedom)
σ²ₙ = βₙ(κₙ+1)/(αₙκₙ) (scale)
The key signal is not r[0] directly, but:
- MAP drop: argmax(r) suddenly falls from high to low
- Short-run probability: P(r < k) spikes at changepoints
Time: 1 2 3 4 5 6 7 8 9 10 11 12
Data: ~~~~~~~~~~~~ normal ~~~~~~~~~~~~ | !! SHIFT !!
MAP rl: 1 2 3 4 5 6 7 8 9 2 3 4
P(r<5): .9 .5 .3 .2 .1 .1 .1 .1 .1 .8 .6 .4
↑
DETECTED
This section details the engineering that achieves 5.14M obs/sec peak throughput.
Problem: The Student-t log-pdf requires lgamma((ν+1)/2) - lgamma(ν/2) for every run length. The standard library lgamma() is extremely slow (~100+ cycles) and dominates runtime.
Solution: Implement custom lgamma using Lanczos approximation for small arguments and Stirling series for large arguments:
// Lanczos approximation (ν < 40): ~15 cycles
// g = 4.7421875 (exactly representable in binary)
double lgamma_lanczos(double x) {
static const double c[7] = {
0.99999999999980993,
676.5203681218851,
-1259.1392167224028,
771.32342877765313,
-176.61502916214059,
12.507343278686905,
-0.13857109526572012
};
// Horner's method evaluation...
}
// Stirling series (ν ≥ 40): ~12 cycles
// ln Γ(x) ≈ (x-½)ln(x) - x + ½ln(2π) + Σ B_{2k}/(2k(2k-1)x^{2k-1})
double lgamma_stirling(double x) {
// Bernoulli number coefficients B₂, B₄, B₆, B₈, B₁₀, B₁₂
// Achieves ~15 digits precision for x ≥ 40
}Impact: Replacing libm lgamma() with custom implementation gave ~100% speedup (2× faster). This was the single largest optimization.
Full Student-t log-pdf:
ln p(x) = lgamma((ν+1)/2) - lgamma(ν/2) - ½ln(νπσ²) - ((ν+1)/2)·ln(1 + z²/(νσ²))
Optimized (constants precomputed when posterior updates):
ln p(x) = C₁ - C₂ · log1p(z² · inv_σ²ν)
C₁ = lgamma(α+½) - lgamma(α) - ½ln(2απσ²) ← precomputed at update time
C₂ = α + ½ ← precomputed
inv_σ²ν = 1/(σ²·2α) ← precomputed
Hot path: 1 multiply, 1 FMA, 1 log1p. No lgamma, no division.
The lgamma calls happen only during posterior updates (once per run length that survives truncation), not during prediction (which runs for all active run lengths).
Problem: V2 required O(n) build_interleaved() transformation every observation to convert separate arrays into SIMD-friendly format.
Solution: Store parameters directly in 256-byte superblocks:
V2 Layout (separate arrays + staging buffer):
mu[]: [μ₀ μ₁ μ₂ μ₃ ...] ← separate allocation
C1[]: [c₀ c₁ c₂ c₃ ...] ← separate allocation
C2[]: [d₀ d₁ d₂ d₃ ...] ← separate allocation
inv_ssn[]: [e₀ e₁ e₂ e₃ ...] ← separate allocation
build_interleaved(): O(n) copy every step!
V3 Layout (native 256-byte superblocks):
Block 0 (bytes 0-255):
[μ₀ μ₁ μ₂ μ₃] offset 0 (prediction)
[c₀ c₁ c₂ c₃] offset 32 (prediction)
[d₀ d₁ d₂ d₃] offset 64 (prediction)
[e₀ e₁ e₂ e₃] offset 96 (prediction)
[κ₀ κ₁ κ₂ κ₃] offset 128 (update)
[α₀ α₁ α₂ α₃] offset 160 (update)
[β₀ β₁ β₂ β₃] offset 192 (update)
[n₀ n₁ n₂ n₃] offset 224 (update)
Block 1 (bytes 256-511):
... next 4 elements ...
Impact: Eliminated O(n) copy, direct SIMD access. ~20% overall speedup.
Problem: BOCPD reads from run length r but writes to r+1. Naive approach requires memmove() after every step to shift the distribution.
Solution: Maintain two buffers, alternate between them:
Step t (cur_buf = 0):
Read: interleaved[0][r] → posterior for run length r
Write: interleaved[1][r+1] → updated posterior at r+1
Flip: cur_buf = 1
Step t+1 (cur_buf = 1):
Read: interleaved[1][r] → now contains what was r+1
Write: interleaved[0][r+1] → write to other buffer
Flip: cur_buf = 0
Benefits:
- Zero
memmove()operations (was O(n) per step) - Implicit +1 index shift via buffer swap
- No data hazards between read and write
Impact: Eliminates ~15-20% overhead from memory copies.
Problem: Ping-pong update reads from index i, writes to index i+1. With interleaved blocks, this crosses block boundaries.
Solution: Use vpermpd to rotate vector, vblendpd to merge with existing blocks:
; Input: [v0, v1, v2, v3] for indices [i, i+1, i+2, i+3]
; Output: write to indices [i+1, i+2, i+3, i+4] spanning two blocks
vpermpd ymm_rot, ymm_vals, 0x93 ; [v3, v0, v1, v2]
vblendpd ymm_k, ymm_exist_k, ymm_rot, 0b1110 ; block k: lanes 1,2,3
vblendpd ymm_k1, ymm_exist_k1, ymm_rot, 0b0001 ; block k+1: lane 0Performance: 6 cycles vs ~16 cycles for scalar stores.
The Student-t log-pdf requires log(1 + t) where t = z²/σ²ν.
log(1+t) ≈ t · (c₁ + t·(c₂ + t·(c₃ + t·(c₄ + t·(c₅ + t·c₆)))))
Coefficients (Taylor series):
c₁ = 1, c₂ = -1/2, c₃ = 1/3, c₄ = -1/4, c₅ = 1/5, c₆ = -1/6
Horner's method: 6 FMAs, fully pipelined, ~15 cycles vs ~100 for libm log().
Problem: Horner's method for exp() has 6-deep dependency chain.
Solution: Estrin's scheme groups terms to reduce depth:
exp(x) = 2^k · 2^f where k = round(x/ln2), f = frac(x/ln2), |f| ≤ 0.5
2^f ≈ 1 + f·c₁ + f²·c₂ + f³·c₃ + f⁴·c₄ + f⁵·c₅ + f⁶·c₆
Estrin (depth 4):
p01 = 1 + f·c₁ }
p23 = c₂ + f·c₃ } parallel (level 1)
p45 = c₄ + f·c₅ }
f² = f · f
q0123 = p01 + f²·p23 } parallel (level 2)
q456 = p45 + f²·c₆ }
f⁴ = f² · f²
result = q0123 + f⁴·q456 (level 3)
Impact: ~5% improvement on polynomial evaluation.
Computing 2^k without libm:
vcvtpd2dq xmm0, ymm_k ; k → int32
vpmovsxdq ymm0, xmm0 ; sign-extend to int64
vpaddq ymm0, ymm0, [bias] ; add exponent bias (1023)
vpsllq ymm0, ymm0, 52 ; shift to exponent field
; ymm0 now contains bit pattern for 2^k as doubleDirect bit manipulation: 4 integer ops vs expensive floating-point pow().
Problem: Single 4-wide AVX2 block leaves FMA units underutilized during memory latency.
Solution: Process 8 elements per iteration as two 4-wide blocks:
Block addressing (V3 layout):
Block A: byte_offset = (i / 4) * 256
Block B: byte_offset = (i / 4) * 256 + 256
Iteration structure:
Block A: load → z² → t → log1p → exp → update → store
Block B: load → z² → t → log1p → exp → update → store
Advance: i += 8, idx_vec += 8
Problem: vhaddpd is slow (high latency, limited ports).
Solution: Use vunpckhpd + vaddsd:
; Reduce [a, b, c, d] → a+b+c+d
vextractf128 xmm0, ymm12, 1 ; [c, d]
vaddpd xmm0, xmm0, xmm12 ; [a+c, b+d]
vunpckhpd xmm1, xmm0, xmm0 ; [b+d, b+d]
vaddsd xmm0, xmm0, xmm1 ; a+b+c+dProblem: Chain of bt (bit test) instructions is slow.
Solution: Single bsr (bit scan reverse) finds highest set bit:
vcmppd ymm0, ymm_growth, ymm_thresh, 14
vmovmskpd eax, ymm0
test eax, eax
jz .skip
bsr ecx, eax ; Find highest set bit
lea rbx, [rsi + rcx + 1] ; last_valid = i + bit + 1Problem: Broadcasting loop counter is expensive:
// Old: 2 µops, 3-cycle latency per iteration
__m256i idx = _mm256_set1_epi64x(i);
idx = _mm256_add_epi64(idx, offset_vec);Solution: Maintain running index vectors:
; Setup (once):
vmovapd ymm15, [1.0, 2.0, 3.0, 4.0] ; idx_vec_A
vmovapd ymm_b, [5.0, 6.0, 7.0, 8.0] ; idx_vec_B (on stack)
; Per iteration (1 µop each):
vaddpd ymm15, ymm15, ymm7 ; ymm7 = [8,8,8,8]
vaddpd ymm3, [stack], ymm7
vmovapd [stack], ymm3Hand-written AVX2 kernel optimized for modern x86-64:
Dedicated constants:
ymm12 = threshold
ymm13 = 1-h
ymm14 = h (hazard rate)
ymm15 = x (observation, broadcast)
Accumulators:
ymm9 = max_growth_B
ymm10 = max_growth_A
ymm11 = r0 accumulator
Scratch (reused per block):
ymm0-8 = computation temporaries
Stack storage:
idx_vec_A, idx_vec_B, max_idx_A, max_idx_B, r_old_B
; Block A: elements [i, i+1, i+2, i+3]
mov rax, r14
shr rax, 2 ; block_index = i / 4
shl rax, 8 ; byte_offset = block_index * 256
vmovapd ymm0, [r13 + rax] ; mu at offset 0
vmovapd ymm1, [r13 + rax + 32] ; C1 at offset 32
vmovapd ymm2, [r13 + rax + 64] ; C2 at offset 64
vmovapd ymm3, [r13 + rax + 96] ; inv_ssn at offset 96
; Block B: elements [i+4, i+5, i+6, i+7]
mov rax, r14
add rax, 4
shr rax, 2
shl rax, 8 ; Next block at +256 bytes| Parameter | Typical Range | Effect |
|---|---|---|
mu0 |
0 | Prior mean center |
kappa0 |
0.001 - 0.1 | Lower = faster adaptation |
alpha0 |
0.5 - 5 | Variance prior shape |
beta0 |
Scale to data | E[σ²] ≈ β₀/(α₀-1) |
Daily returns (σ ≈ 1-2%):
bocpd_prior_t prior = {
.mu0 = 0.0,
.kappa0 = 0.01,
.alpha0 = 2.0,
.beta0 = 0.0002 // E[σ²] ≈ 0.0001
};| Data Frequency | λ | Expected regime duration |
|---|---|---|
| Tick data | 1000-10000 | Many ticks per regime |
| 1-minute bars | 200-500 | ~1-2 changes per day |
| Daily bars | 50-200 | ~1 change per month |
#include "bocpd_asm.h"
bocpd_asm_t detector;
bocpd_prior_t prior = {0.0, 1.0, 1.0, 1.0};
bocpd_ultra_init(&detector, 200.0, prior, 1024);
for (int i = 0; i < n_observations; i++) {
bocpd_ultra_step(&detector, data[i]);
if (detector.p_changepoint > 0.5) {
printf("Changepoint detected at t=%zu\n", detector.t);
}
}
bocpd_ultra_free(&detector);bocpd_pool_t pool;
bocpd_prior_t prior = {0.0, 1.0, 1.0, 1.0};
bocpd_pool_init(&pool, 100, 200.0, prior, 1024);
for (int t = 0; t < n_steps; t++) {
for (int i = 0; i < 100; i++) {
bocpd_asm_t *det = bocpd_pool_get(&pool, i);
bocpd_ultra_step(det, data[i][t]);
}
}
bocpd_pool_free(&pool);| Capacity | Memory | Notes |
|---|---|---|
| 64 | ~18 KB | Minimum practical |
| 256 | ~38 KB | Low-latency trading |
| 512 | ~74 KB | General use |
| 1024 | ~148 KB | Long regimes |
| 2048 | ~296 KB | Extended analysis |
This implementation achieves 5.14M obs/sec peak. Faster is possible — but not without sacrificing correctness.
| Rejected Optimization | Potential Gain | Why We Refused |
|---|---|---|
| Low-order polynomials | +7-15% | Tail probabilities span 300+ orders of magnitude |
| Approximate pruning (top-K) | +30-70% | Makes posterior non-Bayesian |
| float16/float32 | +10-20% | Underflows at ~1e-38; run-lengths hit 1e-200 |
| Top-1 path tracking | +3000% | Not Bayesian, can't quantify uncertainty |
| Bit-hack exponentiation | +15-25% | Numerically unsafe for BOCPD ranges |
Anyone claiming 10M+ obs/sec for "BOCPD" is likely doing top-1 path tracking or using garbage polynomials. That's not Bayesian inference.
5.14M obs/sec is the performance ceiling given numerical constraints. For trading systems where wrong signals cost real money, that ceiling is the right place to be.
-
Adams, R. P., & MacKay, D. J. C. (2007). Bayesian Online Changepoint Detection. arXiv:0710.3742. [PDF]
-
Murphy, K. P. (2007). Conjugate Bayesian analysis of the Gaussian distribution. [PDF]
-
Intel Intrinsics Guide. [Link]
-
Agner Fog's Optimization Manuals. [Link]
| Version | Throughput | Key Changes |
|---|---|---|
| V1 (Naive) | 52K obs/sec | Reference implementation |
| V2 | 525K obs/sec | Ping-pong buffers, AVX2 kernel |
| V3 | 1.5M obs/sec | Native interleaved layout |
| V3.2 | 5.14M obs/sec | Optimized ASM kernel, pool allocator |
MIT License. See LICENSE.
Built for speed. Designed for trading. Open for all.