Skip to content

Fix flaky CpuAndWallTimeWorker sampling test on macOS#5482

Draft
p-datadog wants to merge 1 commit intomasterfrom
investigate/flaky-cpu-wall-time-worker-spec
Draft

Fix flaky CpuAndWallTimeWorker sampling test on macOS#5482
p-datadog wants to merge 1 commit intomasterfrom
investigate/flaky-cpu-wall-time-worker-spec

Conversation

@p-datadog
Copy link
Member

What does this PR do?

Fixes a flaky profiling test by increasing the sampling window from 100ms to 200ms.

Motivation:

The test CpuAndWallTimeWorker#start when main thread is sleeping but a background thread is working failed on Test (macos-15, 3.0) in PR #5481 with sample_count: 4 (threshold: 5). The profiler only managed 4 trigger_sample_attempts in the 100ms window due to startup overhead on macOS ARM64 + Ruby 3.0 runners.

How I Reproduced the Issue

The CI failure on PR #5481 shows the test got exactly 4 samples in 100ms instead of the required 5. The stats (trigger_sample_attempts=>4) confirm the profiler didn't even attempt enough samples — it's not a signal delivery issue, but insufficient time.

Root Cause

The test sleeps for only 100ms and expects ≥5 samples at a target rate of 100 samples/sec. On macOS-15 ARM64 CI runners with Ruby 3.0, profiler startup overhead and thread scheduling variability reduce the effective sampling window below what's needed for 5 samples. The margin between expected (10 samples) and threshold (5) is too thin for this environment.

Fix

Increase sleep from 0.1s to 0.2s, matching the duration used by similar tests in the same file (lines 474 and 605). At 100 samples/sec, 200ms gives ~20 expected samples — well above the threshold of 5.

Change log entry

None.

How to test the change?

CI should pass on Test (macos-15, 3.0) which was previously failing.

Root cause: The test sleeps for only 100ms and expects ≥5 samples at
100 samples/sec. On macOS-15 ARM64 + Ruby 3.0 CI runners, profiler
startup overhead and thread scheduling variability reduce the effective
sampling window, resulting in only 4 samples (just below the threshold).

Increase sleep from 0.1s to 0.2s to provide more margin for sample
collection, matching the duration used by similar tests in the same file.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@p-datadog p-datadog added the AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos label Mar 19, 2026
@github-actions github-actions bot added the dev/testing Involves testing processes (e.g. RSpec) label Mar 19, 2026
@datadog-official
Copy link

datadog-official bot commented Mar 19, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 95.13% (+0.00%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: c01965c | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

@pr-commenter
Copy link

pr-commenter bot commented Mar 19, 2026

Benchmarks

Benchmark execution time: 2026-03-19 20:46:29

Comparing candidate commit c01965c in PR branch investigate/flaky-cpu-wall-time-worker-spec with baseline commit e87e284 in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 46 metrics, 0 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos dev/testing Involves testing processes (e.g. RSpec)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant