Skip to content

Decouple metric fetch errors from trial status in Orchestrator (#5119)#5119

Open
saitcakmak wants to merge 1 commit intofacebook:mainfrom
saitcakmak:export-D98924467
Open

Decouple metric fetch errors from trial status in Orchestrator (#5119)#5119
saitcakmak wants to merge 1 commit intofacebook:mainfrom
saitcakmak:export-D98924467

Conversation

@saitcakmak
Copy link
Copy Markdown
Contributor

@saitcakmak saitcakmak commented Apr 1, 2026

Summary:

Design doc: D98741656

When fetch_trials_data_results returned a MetricFetchE for an
optimization config metric, the orchestrator marked the trial as
ABANDONED. This discarded good data, inflated the failure rate, and was
inconsistent with the Client layer which keeps trials COMPLETED with
incomplete metrics via MetricAvailability (D93924193).

This diff removes the trial abandonment behavior. Metric fetch errors
are now logged (with traceback via logger.exception) but trial status
is unchanged. MetricAvailability tracks data completeness, and the
failure rate check uses it to detect persistent metric issues.

Changes:

  • _fetch_and_process_trials_data_results: Removed the branch that
    marked trials ABANDONED for metric fetch errors and the separate
    is_available_while_running branch. All metric fetch errors are
    now simply logged and the method continues. The _report_metric_fetch_e
    hook is still called so subclasses (e.g. AxSweepOrchestrator) can
    react to errors (create pastes, build error tables, etc.).
  • error_if_failure_rate_exceeded: Merged _check_if_failure_rate_exceeded
    into this method to avoid duplicate computation. Now counts both
    runner failures (FAILED/ABANDONED) and metric-incomplete trials (via
    compute_metric_availability) toward the failure rate.
  • _get_failure_rate_exceeded_error: Rewritten with an actionable
    error message listing runner failures, metric-incomplete trials,
    missing metrics, and affected trial indices.
  • Removed dead code: _mark_err_trial_status,
    _num_trials_bad_due_to_err, _num_metric_fetch_e_encountered,
    _check_if_failure_rate_exceeded, METRIC_FETCH_ERR_MESSAGE.
  • Kept _report_metric_fetch_e as a no-op hook so subclasses like
    AxSweepOrchestrator can still react to metric fetch errors.
  • Updated telemetry (OrchestratorCompletedRecord) to use
    _count_metric_incomplete_trials (via compute_metric_availability)
    for both num_metric_fetch_e_encountered and
    num_trials_bad_due_to_err.
  • Updated AxSweepOrchestrator test assertions: trials now stay
    COMPLETED (not ABANDONED) after metric fetch errors.
  • Metric.recoverable_exceptions and Metric.is_recoverable_fetch_e
    are kept for now since pts/ metrics still reference them; cleanup
    will follow in a separate diff.

Differential Revision: D98924467

@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Apr 1, 2026

@saitcakmak has exported this pull request. If you are a Meta employee, you can view the originating Diff in D98924467.

@meta-cla meta-cla bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Apr 1, 2026
saitcakmak added a commit to saitcakmak/Ax that referenced this pull request Apr 1, 2026
…ook#5119)

Summary:

Design doc: D98741656

When `fetch_trials_data_results` returned a `MetricFetchE` for an
optimization config metric, the orchestrator marked the trial as
ABANDONED. This discarded good data, inflated the failure rate, and was
inconsistent with the Client layer which keeps trials COMPLETED with
incomplete metrics via `MetricAvailability` (D93924193).

This diff removes the trial abandonment behavior. Metric fetch errors
are now logged (with traceback via `logger.exception`) but trial status
is unchanged. `MetricAvailability` tracks data completeness, and the
failure rate check uses it to detect persistent metric issues.

Changes:
- `_fetch_and_process_trials_data_results`: Removed the branch that
  marked trials ABANDONED for metric fetch errors and the separate
  `is_available_while_running` branch. All metric fetch errors are
  now simply logged and the method continues. The `_report_metric_fetch_e`
  hook is still called so subclasses (e.g. `AxSweepOrchestrator`) can
  react to errors (create pastes, build error tables, etc.).
- `error_if_failure_rate_exceeded`: Merged `_check_if_failure_rate_exceeded`
  into this method to avoid duplicate computation. Now counts both
  runner failures (FAILED/ABANDONED) and metric-incomplete trials (via
  `compute_metric_availability`) toward the failure rate.
- `_get_failure_rate_exceeded_error`: Rewritten with an actionable
  error message listing runner failures, metric-incomplete trials,
  missing metrics, and affected trial indices.
- Removed dead code: `_mark_err_trial_status`,
  `_num_trials_bad_due_to_err`, `_num_metric_fetch_e_encountered`,
  `_check_if_failure_rate_exceeded`, `METRIC_FETCH_ERR_MESSAGE`.
- Kept `_report_metric_fetch_e` as a no-op hook so subclasses like
  `AxSweepOrchestrator` can still react to metric fetch errors.
- Updated telemetry (`OrchestratorCompletedRecord`) to use
  `_count_metric_incomplete_trials` (via `compute_metric_availability`)
  for both `num_metric_fetch_e_encountered` and
  `num_trials_bad_due_to_err`.
- Updated `AxSweepOrchestrator` test assertions: trials now stay
  COMPLETED (not ABANDONED) after metric fetch errors.
- `Metric.recoverable_exceptions` and `Metric.is_recoverable_fetch_e`
  are kept for now since `pts/` metrics still reference them; cleanup
  will follow in a separate diff.

Differential Revision: D98924467
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.39%. Comparing base (6cebd1c) to head (fa8f484).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5119      +/-   ##
==========================================
- Coverage   96.40%   96.39%   -0.02%     
==========================================
  Files         613      613              
  Lines       68142    68146       +4     
==========================================
- Hits        65694    65691       -3     
- Misses       2448     2455       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@meta-codesync meta-codesync bot changed the title Decouple metric fetch errors from trial status in Orchestrator Decouple metric fetch errors from trial status in Orchestrator (#5119) Apr 1, 2026
saitcakmak added a commit to saitcakmak/Ax that referenced this pull request Apr 1, 2026
…ook#5119)

Summary:

Design doc: D98741656

When `fetch_trials_data_results` returned a `MetricFetchE` for an
optimization config metric, the orchestrator marked the trial as
ABANDONED. This discarded good data, inflated the failure rate, and was
inconsistent with the Client layer which keeps trials COMPLETED with
incomplete metrics via `MetricAvailability` (D93924193).

This diff removes the trial abandonment behavior. Metric fetch errors
are now logged (with traceback via `logger.exception`) but trial status
is unchanged. `MetricAvailability` tracks data completeness, and the
failure rate check uses it to detect persistent metric issues.

Changes:
- `_fetch_and_process_trials_data_results`: Removed the branch that
  marked trials ABANDONED for metric fetch errors and the separate
  `is_available_while_running` branch. All metric fetch errors are
  now simply logged and the method continues. The `_report_metric_fetch_e`
  hook is still called so subclasses (e.g. `AxSweepOrchestrator`) can
  react to errors (create pastes, build error tables, etc.).
- `error_if_failure_rate_exceeded`: Merged `_check_if_failure_rate_exceeded`
  into this method to avoid duplicate computation. Now counts both
  runner failures (FAILED/ABANDONED) and metric-incomplete trials (via
  `compute_metric_availability`) toward the failure rate.
- `_get_failure_rate_exceeded_error`: Rewritten with an actionable
  error message listing runner failures, metric-incomplete trials,
  missing metrics, and affected trial indices.
- Removed dead code: `_mark_err_trial_status`,
  `_num_trials_bad_due_to_err`, `_num_metric_fetch_e_encountered`,
  `_check_if_failure_rate_exceeded`, `METRIC_FETCH_ERR_MESSAGE`.
- Kept `_report_metric_fetch_e` as a no-op hook so subclasses like
  `AxSweepOrchestrator` can still react to metric fetch errors.
- Updated telemetry (`OrchestratorCompletedRecord`) to use
  `_count_metric_incomplete_trials` (via `compute_metric_availability`)
  for both `num_metric_fetch_e_encountered` and
  `num_trials_bad_due_to_err`.
- Updated `AxSweepOrchestrator` test assertions: trials now stay
  COMPLETED (not ABANDONED) after metric fetch errors.
- `Metric.recoverable_exceptions` and `Metric.is_recoverable_fetch_e`
  are kept for now since `pts/` metrics still reference them; cleanup
  will follow in a separate diff.

Differential Revision: D98924467
…ook#5119)

Summary:
Pull Request resolved: facebook#5119

Design doc: D98741656

When `fetch_trials_data_results` returned a `MetricFetchE` for an
optimization config metric, the orchestrator marked the trial as
ABANDONED. This discarded good data, inflated the failure rate, and was
inconsistent with the Client layer which keeps trials COMPLETED with
incomplete metrics via `MetricAvailability` (D93924193).

This diff removes the trial abandonment behavior. Metric fetch errors
are now logged (with traceback via `logger.exception`) but trial status
is unchanged. `MetricAvailability` tracks data completeness, and the
failure rate check uses it to detect persistent metric issues.

Changes:
- `_fetch_and_process_trials_data_results`: Removed the branch that
  marked trials ABANDONED for metric fetch errors and the separate
  `is_available_while_running` branch. All metric fetch errors are
  now simply logged and the method continues. The `_report_metric_fetch_e`
  hook is still called so subclasses (e.g. `AxSweepOrchestrator`) can
  react to errors (create pastes, build error tables, etc.).
- `error_if_failure_rate_exceeded`: Merged `_check_if_failure_rate_exceeded`
  into this method to avoid duplicate computation. Now counts both
  runner failures (FAILED/ABANDONED) and metric-incomplete trials (via
  `compute_metric_availability`) toward the failure rate.
- `_get_failure_rate_exceeded_error`: Rewritten with an actionable
  error message listing runner failures, metric-incomplete trials,
  missing metrics, and affected trial indices.
- Removed dead code: `_mark_err_trial_status`,
  `_num_trials_bad_due_to_err`, `_num_metric_fetch_e_encountered`,
  `_check_if_failure_rate_exceeded`, `METRIC_FETCH_ERR_MESSAGE`.
- Kept `_report_metric_fetch_e` as a no-op hook so subclasses like
  `AxSweepOrchestrator` can still react to metric fetch errors.
- Updated telemetry (`OrchestratorCompletedRecord`) to use
  `_count_metric_incomplete_trials` (via `compute_metric_availability`)
  for both `num_metric_fetch_e_encountered` and
  `num_trials_bad_due_to_err`.
- Updated `AxSweepOrchestrator` test assertions: trials now stay
  COMPLETED (not ABANDONED) after metric fetch errors.
- `Metric.recoverable_exceptions` and `Metric.is_recoverable_fetch_e`
  are kept for now since `pts/` metrics still reference them; cleanup
  will follow in a separate diff.

Differential Revision: D98924467
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed Do not delete this pull request or issue due to inactivity. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants