Decouple metric fetch errors from trial status in Orchestrator (#5119) by saitcakmak · Pull Request #5119 · facebook/Ax

saitcakmak · 2026-04-01T17:39:43Z

Summary:

Design doc: D98741656

When fetch_trials_data_results returned a MetricFetchE for an
optimization config metric, the orchestrator marked the trial as
ABANDONED. This discarded good data, inflated the failure rate, and was
inconsistent with the Client layer which keeps trials COMPLETED with
incomplete metrics via MetricAvailability (D93924193).

This diff removes the trial abandonment behavior. Metric fetch errors
are now logged (with traceback via logger.exception) but trial status
is unchanged. MetricAvailability tracks data completeness, and the
failure rate check uses it to detect persistent metric issues.

Changes:

_fetch_and_process_trials_data_results: Removed the branch that
marked trials ABANDONED for metric fetch errors and the separate
is_available_while_running branch. All metric fetch errors are
now simply logged and the method continues. The _report_metric_fetch_e
hook is still called so subclasses (e.g. AxSweepOrchestrator) can
react to errors (create pastes, build error tables, etc.).
error_if_failure_rate_exceeded: Merged _check_if_failure_rate_exceeded
into this method to avoid duplicate computation. Now counts both
runner failures (FAILED/ABANDONED) and metric-incomplete trials (via
compute_metric_availability) toward the failure rate.
_get_failure_rate_exceeded_error: Rewritten with an actionable
error message listing runner failures, metric-incomplete trials,
missing metrics, and affected trial indices.
Removed dead code: _mark_err_trial_status,
_num_trials_bad_due_to_err, _num_metric_fetch_e_encountered,
_check_if_failure_rate_exceeded, METRIC_FETCH_ERR_MESSAGE.
Kept _report_metric_fetch_e as a no-op hook so subclasses like
AxSweepOrchestrator can still react to metric fetch errors.
Updated telemetry (OrchestratorCompletedRecord) to use
_count_metric_incomplete_trials (via compute_metric_availability)
for both num_metric_fetch_e_encountered and
num_trials_bad_due_to_err.
Updated AxSweepOrchestrator test assertions: trials now stay
COMPLETED (not ABANDONED) after metric fetch errors.
Metric.recoverable_exceptions and Metric.is_recoverable_fetch_e
are kept for now since pts/ metrics still reference them; cleanup
will follow in a separate diff.

Differential Revision: D98924467

meta-codesync · 2026-04-01T17:39:50Z

@saitcakmak has exported this pull request. If you are a Meta employee, you can view the originating Diff in D98924467.

…ook#5119) Summary: Design doc: D98741656 When `fetch_trials_data_results` returned a `MetricFetchE` for an optimization config metric, the orchestrator marked the trial as ABANDONED. This discarded good data, inflated the failure rate, and was inconsistent with the Client layer which keeps trials COMPLETED with incomplete metrics via `MetricAvailability` (D93924193). This diff removes the trial abandonment behavior. Metric fetch errors are now logged (with traceback via `logger.exception`) but trial status is unchanged. `MetricAvailability` tracks data completeness, and the failure rate check uses it to detect persistent metric issues. Changes: - `_fetch_and_process_trials_data_results`: Removed the branch that marked trials ABANDONED for metric fetch errors and the separate `is_available_while_running` branch. All metric fetch errors are now simply logged and the method continues. The `_report_metric_fetch_e` hook is still called so subclasses (e.g. `AxSweepOrchestrator`) can react to errors (create pastes, build error tables, etc.). - `error_if_failure_rate_exceeded`: Merged `_check_if_failure_rate_exceeded` into this method to avoid duplicate computation. Now counts both runner failures (FAILED/ABANDONED) and metric-incomplete trials (via `compute_metric_availability`) toward the failure rate. - `_get_failure_rate_exceeded_error`: Rewritten with an actionable error message listing runner failures, metric-incomplete trials, missing metrics, and affected trial indices. - Removed dead code: `_mark_err_trial_status`, `_num_trials_bad_due_to_err`, `_num_metric_fetch_e_encountered`, `_check_if_failure_rate_exceeded`, `METRIC_FETCH_ERR_MESSAGE`. - Kept `_report_metric_fetch_e` as a no-op hook so subclasses like `AxSweepOrchestrator` can still react to metric fetch errors. - Updated telemetry (`OrchestratorCompletedRecord`) to use `_count_metric_incomplete_trials` (via `compute_metric_availability`) for both `num_metric_fetch_e_encountered` and `num_trials_bad_due_to_err`. - Updated `AxSweepOrchestrator` test assertions: trials now stay COMPLETED (not ABANDONED) after metric fetch errors. - `Metric.recoverable_exceptions` and `Metric.is_recoverable_fetch_e` are kept for now since `pts/` metrics still reference them; cleanup will follow in a separate diff. Differential Revision: D98924467

codecov-commenter · 2026-04-01T18:19:30Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.39%. Comparing base (6cebd1c) to head (fa8f484).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5119      +/-   ##
==========================================
- Coverage   96.40%   96.39%   -0.02%     
==========================================
  Files         613      613              
  Lines       68142    68146       +4     
==========================================
- Hits        65694    65691       -3     
- Misses       2448     2455       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…ook#5119) Summary: Design doc: D98741656 When `fetch_trials_data_results` returned a `MetricFetchE` for an optimization config metric, the orchestrator marked the trial as ABANDONED. This discarded good data, inflated the failure rate, and was inconsistent with the Client layer which keeps trials COMPLETED with incomplete metrics via `MetricAvailability` (D93924193). This diff removes the trial abandonment behavior. Metric fetch errors are now logged (with traceback via `logger.exception`) but trial status is unchanged. `MetricAvailability` tracks data completeness, and the failure rate check uses it to detect persistent metric issues. Changes: - `_fetch_and_process_trials_data_results`: Removed the branch that marked trials ABANDONED for metric fetch errors and the separate `is_available_while_running` branch. All metric fetch errors are now simply logged and the method continues. The `_report_metric_fetch_e` hook is still called so subclasses (e.g. `AxSweepOrchestrator`) can react to errors (create pastes, build error tables, etc.). - `error_if_failure_rate_exceeded`: Merged `_check_if_failure_rate_exceeded` into this method to avoid duplicate computation. Now counts both runner failures (FAILED/ABANDONED) and metric-incomplete trials (via `compute_metric_availability`) toward the failure rate. - `_get_failure_rate_exceeded_error`: Rewritten with an actionable error message listing runner failures, metric-incomplete trials, missing metrics, and affected trial indices. - Removed dead code: `_mark_err_trial_status`, `_num_trials_bad_due_to_err`, `_num_metric_fetch_e_encountered`, `_check_if_failure_rate_exceeded`, `METRIC_FETCH_ERR_MESSAGE`. - Kept `_report_metric_fetch_e` as a no-op hook so subclasses like `AxSweepOrchestrator` can still react to metric fetch errors. - Updated telemetry (`OrchestratorCompletedRecord`) to use `_count_metric_incomplete_trials` (via `compute_metric_availability`) for both `num_metric_fetch_e_encountered` and `num_trials_bad_due_to_err`. - Updated `AxSweepOrchestrator` test assertions: trials now stay COMPLETED (not ABANDONED) after metric fetch errors. - `Metric.recoverable_exceptions` and `Metric.is_recoverable_fetch_e` are kept for now since `pts/` metrics still reference them; cleanup will follow in a separate diff. Differential Revision: D98924467

…ook#5119) Summary: Pull Request resolved: facebook#5119 Design doc: D98741656 When `fetch_trials_data_results` returned a `MetricFetchE` for an optimization config metric, the orchestrator marked the trial as ABANDONED. This discarded good data, inflated the failure rate, and was inconsistent with the Client layer which keeps trials COMPLETED with incomplete metrics via `MetricAvailability` (D93924193). This diff removes the trial abandonment behavior. Metric fetch errors are now logged (with traceback via `logger.exception`) but trial status is unchanged. `MetricAvailability` tracks data completeness, and the failure rate check uses it to detect persistent metric issues. Changes: - `_fetch_and_process_trials_data_results`: Removed the branch that marked trials ABANDONED for metric fetch errors and the separate `is_available_while_running` branch. All metric fetch errors are now simply logged and the method continues. The `_report_metric_fetch_e` hook is still called so subclasses (e.g. `AxSweepOrchestrator`) can react to errors (create pastes, build error tables, etc.). - `error_if_failure_rate_exceeded`: Merged `_check_if_failure_rate_exceeded` into this method to avoid duplicate computation. Now counts both runner failures (FAILED/ABANDONED) and metric-incomplete trials (via `compute_metric_availability`) toward the failure rate. - `_get_failure_rate_exceeded_error`: Rewritten with an actionable error message listing runner failures, metric-incomplete trials, missing metrics, and affected trial indices. - Removed dead code: `_mark_err_trial_status`, `_num_trials_bad_due_to_err`, `_num_metric_fetch_e_encountered`, `_check_if_failure_rate_exceeded`, `METRIC_FETCH_ERR_MESSAGE`. - Kept `_report_metric_fetch_e` as a no-op hook so subclasses like `AxSweepOrchestrator` can still react to metric fetch errors. - Updated telemetry (`OrchestratorCompletedRecord`) to use `_count_metric_incomplete_trials` (via `compute_metric_availability`) for both `num_metric_fetch_e_encountered` and `num_trials_bad_due_to_err`. - Updated `AxSweepOrchestrator` test assertions: trials now stay COMPLETED (not ABANDONED) after metric fetch errors. - `Metric.recoverable_exceptions` and `Metric.is_recoverable_fetch_e` are kept for now since `pts/` metrics still reference them; cleanup will follow in a separate diff. Differential Revision: D98924467

meta-cla bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Apr 1, 2026

meta-codesync bot added fb-exported meta-exported labels Apr 1, 2026

meta-codesync bot changed the title ~~Decouple metric fetch errors from trial status in Orchestrator~~ Decouple metric fetch errors from trial status in Orchestrator (#5119) Apr 1, 2026

saitcakmak force-pushed the export-D98924467 branch from fa8f484 to 2209b6a Compare April 1, 2026 22:16

saitcakmak force-pushed the export-D98924467 branch from 2209b6a to 012e594 Compare April 1, 2026 22:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple metric fetch errors from trial status in Orchestrator (#5119)#5119

Decouple metric fetch errors from trial status in Orchestrator (#5119)#5119
saitcakmak wants to merge 1 commit intofacebook:mainfrom
saitcakmak:export-D98924467

saitcakmak commented Apr 1, 2026 •

edited by meta-codesync bot

Loading

Uh oh!

meta-codesync bot commented Apr 1, 2026

Uh oh!

codecov-commenter commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

saitcakmak commented Apr 1, 2026 • edited by meta-codesync bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync bot commented Apr 1, 2026

Uh oh!

codecov-commenter commented Apr 1, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

saitcakmak commented Apr 1, 2026 •

edited by meta-codesync bot

Loading