Skip to content

Clean up align_partial_results, eliminate double alignment, and remove Data round-trip#5110

Open
ltiao wants to merge 1 commit intofacebook:mainfrom
ltiao:export-D98544835
Open

Clean up align_partial_results, eliminate double alignment, and remove Data round-trip#5110
ltiao wants to merge 1 commit intofacebook:mainfrom
ltiao:export-D98544835

Conversation

@ltiao
Copy link
Copy Markdown
Contributor

@ltiao ltiao commented Mar 30, 2026

Summary:
Profiling-driven cleanup and performance improvements to the early stopping pipeline.

1. Remove redundant validations from align_partial_results (utils.py)

align_partial_results contained three validation/logging blocks that are redundant with upstream checks already performed by _lookup_and_validate in base.py:

  • Missing metrics check + raise ValueError: _lookup_and_validate already verifies each metric signature exists in the data and returns None before align_partial_results is ever called.
  • Per-metric logging loop: The "no data" branch is unreachable because the upstream checks guarantee data exists for each metric. Debug logging about MAP_KEY ranges is misplaced for a pure alignment function.
  • Trial-to-arm uniqueness check: Redundant with is_eligible_any which already rejects BatchTrial -- the only way a trial gets multiple arms.

The arm-to-trial uniqueness check was moved (not removed) to _lookup_and_validate, where it properly guards all downstream consumers, not just alignment.

The isin filter (df = df[df['metric_signature'].isin(metrics)]) was kept -- it is part of align_partial_results's own contract since the function accepts a metrics argument and callers depend on it filtering to those metrics.

After cleanup, align_partial_results is a focused alignment function: filter to requested metrics -> drop arm_name -> drop duplicates -> sort -> pivot -> interpolate.

2. Eliminate double data alignment in PercentileEarlyStoppingStrategy (percentile.py)

When check_safe=True, the base class should_stop_trials_early calls _is_harmful() which calls _prepare_aligned_frames(), then _should_stop_trials_early() calls _prepare_aligned_frames() again -- identical data lookup + alignment running twice.

Fix: override should_stop_trials_early to call _default_objective_and_direction() and _prepare_aligned_frames() once, passing results as required arguments (metric_signature, minimize, aligned_frames) to both _is_harmful and _should_stop_trials_early.

3. Eliminate the Data round-trip in _lookup_and_validate (base.py, threshold.py)

_lookup_and_validate (formerly _lookup_and_validate_data) used to wrap its filtered DataFrame back into Data(df=filtered_df) at the end, purely to satisfy its Data | None return type. The sole consumer (_prepare_aligned_frames) immediately called .full_df to get the DataFrame back. This triggered a pointless df-to-DataRow-to-df round-trip on every call:

DataFrame -> itertuples -> list[DataRow] -> from_records -> regex sort -> cast -> DataFrame

Profiling shows this round-trip is 100% overhead at every scale:

Scale Rows Round-trip cost Replacement (df.copy)
tiny (5x10) 50 4ms <0.1ms
typical (20x100) 2,000 15ms <0.1ms
large (50x200) 10,000 60ms <0.1ms
xlarge (100x200) 20,000 160ms <0.1ms
huge (200x500) 100,000 733ms <0.1ms

The cost comes from Data.__init__ iterating every row via itertuples() to build list[DataRow] (~35ms at 10k rows), then Data.full_df reconstructing the DataFrame via from_records() (~12ms) and running regex-based arm name parsing in sort_by_trial_index_and_arm_name() (~19ms) -- none of which is needed since we already had the DataFrame.

Fix: change _lookup_and_validate to return pd.DataFrame | None directly. Update ModelBasedEarlyStoppingStrategy and ThresholdEarlyStoppingStrategy accordingly.

End-to-end profiling of should_stop_trials_early at the 50x200 scale (10k rows) shows a ~19% speedup (323ms -> ~263ms), with the benefit growing at larger scales (up to ~41% at 100k rows).

4. Naming cleanup

Renamed methods and variables to disambiguate between Data objects and pd.DataFrame:

  • _lookup_and_validate_data -> _lookup_and_validate (returns pd.DataFrame | None)
  • _prepare_aligned_data -> _prepare_aligned_frames (returns a tuple of DataFrames)
  • Variables holding _lookup_and_validate results: data/map_data/data_lookup -> df
  • Parameter aligned_data -> aligned_frames

5. Tests

  • Removed tests for moved/removed align_partial_results validations.
  • Added arm-to-trial uniqueness check as a subtest in test_is_eligible.
  • Added profiling notebook at ax/early_stopping/profiling.ipynb.

Reviewed By: saitcakmak

Differential Revision: D98544835

@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 30, 2026

@ltiao has exported this pull request. If you are a Meta employee, you can view the originating Diff in D98544835.

@meta-cla meta-cla bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Mar 30, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 30, 2026

Codecov Report

❌ Patch coverage is 90.76923% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.40%. Comparing base (c08c4a9) to head (c6a49ee).

Files with missing lines Patch % Lines
ax/early_stopping/strategies/base.py 73.33% 4 Missing ⚠️
ax/early_stopping/strategies/percentile.py 80.00% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5110   +/-   ##
=======================================
  Coverage   96.39%   96.40%           
=======================================
  Files         613      613           
  Lines       68345    68322   -23     
=======================================
- Hits        65880    65863   -17     
+ Misses       2465     2459    -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ltiao ltiao force-pushed the export-D98544835 branch from a01af83 to fb5afd8 Compare March 31, 2026 16:04
…e Data round-trip

Summary:
Profiling-driven cleanup and performance improvements to the early stopping pipeline.

**1. Remove redundant validations from `align_partial_results` (utils.py)**

`align_partial_results` contained three validation/logging blocks that are redundant with upstream checks already performed by `_lookup_and_validate` in `base.py`:

- **Missing metrics check + `raise ValueError`:** `_lookup_and_validate` already verifies each metric signature exists in the data and returns `None` before `align_partial_results` is ever called.
- **Per-metric logging loop:** The "no data" branch is unreachable because the upstream checks guarantee data exists for each metric. Debug logging about MAP_KEY ranges is misplaced for a pure alignment function.
- **Trial-to-arm uniqueness check:** Redundant with `is_eligible_any` which already rejects `BatchTrial` -- the only way a trial gets multiple arms.

The arm-to-trial uniqueness check was moved (not removed) to `_lookup_and_validate`, where it properly guards all downstream consumers, not just alignment.

The `isin` filter (`df = df[df['metric_signature'].isin(metrics)]`) was kept -- it is part of `align_partial_results`'s own contract since the function accepts a `metrics` argument and callers depend on it filtering to those metrics.

After cleanup, `align_partial_results` is a focused alignment function: filter to requested metrics -> drop `arm_name` -> drop duplicates -> sort -> pivot -> interpolate.

**2. Eliminate double data alignment in `PercentileEarlyStoppingStrategy` (percentile.py)**

When `check_safe=True`, the base class `should_stop_trials_early` calls `_is_harmful()` which calls `_prepare_aligned_frames()`, then `_should_stop_trials_early()` calls `_prepare_aligned_frames()` again -- identical data lookup + alignment running twice.

Fix: override `should_stop_trials_early` to call `_default_objective_and_direction()` and `_prepare_aligned_frames()` once, passing results as required arguments (`metric_signature`, `minimize`, `aligned_frames`) to both `_is_harmful` and `_should_stop_trials_early`.

**3. Eliminate the `Data` round-trip in `_lookup_and_validate` (base.py, threshold.py)**

`_lookup_and_validate` (formerly `_lookup_and_validate_data`) used to wrap its filtered DataFrame back into `Data(df=filtered_df)` at the end, purely to satisfy its `Data | None` return type. The sole consumer (`_prepare_aligned_frames`) immediately called `.full_df` to get the DataFrame back. This triggered a pointless df-to-DataRow-to-df round-trip on every call:

```
DataFrame -> itertuples -> list[DataRow] -> from_records -> regex sort -> cast -> DataFrame
```

Profiling shows this round-trip is **100% overhead** at every scale:

| Scale           | Rows    | Round-trip cost | Replacement (df.copy) |
|-----------------|---------|----------------|-----------------------|
| tiny (5x10)     | 50      | 4ms            | <0.1ms                |
| typical (20x100)| 2,000   | 15ms           | <0.1ms                |
| large (50x200)  | 10,000  | 60ms           | <0.1ms                |
| xlarge (100x200)| 20,000  | 160ms          | <0.1ms                |
| huge (200x500)  | 100,000 | 733ms          | <0.1ms                |

The cost comes from `Data.__init__` iterating every row via `itertuples()` to build `list[DataRow]` (~35ms at 10k rows), then `Data.full_df` reconstructing the DataFrame via `from_records()` (~12ms) and running regex-based arm name parsing in `sort_by_trial_index_and_arm_name()` (~19ms) -- none of which is needed since we already had the DataFrame.

Fix: change `_lookup_and_validate` to return `pd.DataFrame | None` directly. Update `ModelBasedEarlyStoppingStrategy` and `ThresholdEarlyStoppingStrategy` accordingly.

End-to-end profiling of `should_stop_trials_early` at the 50x200 scale (10k rows) shows a ~19% speedup (323ms -> ~263ms), with the benefit growing at larger scales (up to ~41% at 100k rows).

**4. Naming cleanup**

Renamed methods and variables to disambiguate between `Data` objects and `pd.DataFrame`:
- `_lookup_and_validate_data` -> `_lookup_and_validate` (returns `pd.DataFrame | None`)
- `_prepare_aligned_data` -> `_prepare_aligned_frames` (returns a tuple of DataFrames)
- Variables holding `_lookup_and_validate` results: `data`/`map_data`/`data_lookup` -> `df`
- Parameter `aligned_data` -> `aligned_frames`

**5. Tests**

- Removed tests for moved/removed `align_partial_results` validations.
- Added arm-to-trial uniqueness check as a subtest in `test_is_eligible`.
- Added profiling notebook at `ax/early_stopping/profiling.ipynb`.

Reviewed By: saitcakmak

Differential Revision: D98544835
@ltiao ltiao force-pushed the export-D98544835 branch from fb5afd8 to c6a49ee Compare April 6, 2026 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed Do not delete this pull request or issue due to inactivity. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants