Skip to content

Reduce microbenchmark runtime and fix pr-comment retry behavior#5481

Open
p-datadog wants to merge 1 commit intomasterfrom
fix/microbenchmarks-ci-runtime
Open

Reduce microbenchmark runtime and fix pr-comment retry behavior#5481
p-datadog wants to merge 1 commit intomasterfrom
fix/microbenchmarks-ci-runtime

Conversation

@p-datadog
Copy link
Member

@p-datadog p-datadog commented Mar 19, 2026

What does this PR do?

Two targeted improvements to microbenchmark CI reliability. Full investigation at https://github.com/p-datadog/datadog-docs/blob/master/microbenchmarks-ci-investigation.md.

1. Reduce REPETITIONS from 6 to 4 (benchmarks/execution.yml)

The [other] benchmark group (error_tracking, tracing, DI, gem loading) currently takes ~34 min per run. The cost model is 1.5 + 5.33 × REPETITIONS, so reducing repetitions has a meaningful impact. At REPETITIONS=4 the [other] job runs in ~23 min — 32% faster and still 7 min under the old pre-parallelization single-job runtime of 30 min.

PR #5313 (which introduced parallelization) validated stability at 6 and 10 reps with CPU isolation, showing 0/46 unstable metrics. No data exists at 4 reps, but CPU isolation independently reduces variance and is unchanged here. This PR's own benchmark report serves as the stability validation — if unstable metrics appear we can adjust before merging. Reducing further to 3 (18 min) is worth considering if the report looks clean.

Shorter runs also reduce the window during which a job can be cancelled by a new push on active PR branches.

2. Improve microbenchmarks-pr-comment retry behavior (.gitlab/benchmarks.yml)

Changed when: alwayswhen: on_success and added allow_failure: true.

With when: always, if an upstream benchmark job is cancelled, pr-comment still attempts to run with partial/missing artifacts and fails — resulting in two jobs needing attention instead of one. It's not immediately obvious which job to retry first (the upstream benchmark, not the downstream comment job).

With when: on_success:

  • If benchmarks are cancelled: pr-comment is skipped (not failed), making it clear which job needs a retry
  • When the benchmark is retried and succeeds: pr-comment auto-triggers via needs: — no manual retry needed (behavior confirmed in pipeline 102673493)
  • allow_failure: true ensures a comment-posting issue (network blip, bp-runner bug) never blocks the pipeline

3. Per-benchmark completion timestamps (DataDog/benchmarking-platform#249)

Companion PR in benchmarking-platform adds a completion timestamp log line after each parallel benchmark finishes. This will help identify which of the 4 [other] benchmarks takes the longest — useful context for any future splitting or rebalancing of the group.

Motivation:

Microbenchmarks are currently the longest-running CI job (~34 min) and run on every PR regardless of what changed. On active branches, they can be cancelled mid-run due to interruptible: true, which sometimes leads to a multi-step retry process. These two changes reduce runtime and simplify recovery when cancellations do happen.

Change log entry

None.

How to test the change?

The benchmark report from this PR's CI run is the stability validation for REPETITIONS=4. Check the report for "unstable metrics" — a no-change comparison should show 0 improvements, 0 regressions, 0 unstable metrics. The pr-comment behavior change can be verified by observing that a cancelled [other] job leaves pr-comment in skipped state rather than failed/cancelled.

- Reduce REPETITIONS from 6 to 4 in benchmarks/execution.yml
- Change microbenchmarks-pr-comment to when: on_success + allow_failure: true

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@p-datadog p-datadog requested a review from a team as a code owner March 19, 2026 19:39
@p-datadog p-datadog added the AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos label Mar 19, 2026
@datadog-datadog-prod-us1-2
Copy link

datadog-datadog-prod-us1-2 bot commented Mar 19, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 95.12% (-0.01%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 920366b | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

@pr-commenter
Copy link

pr-commenter bot commented Mar 19, 2026

Benchmarks

Benchmark execution time: 2026-03-19 20:05:05

Comparing candidate commit 920366b in PR branch fix/microbenchmarks-ci-runtime with baseline commit e87e284 in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 45 metrics, 1 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant