Skip to content

Fix UCP pod CrashLoopBackOff caused by startup race condition#11640

Draft
Copilot wants to merge 3 commits intomainfrom
copilot/fix-ucp-pod-crashloopbackoff
Draft

Fix UCP pod CrashLoopBackOff caused by startup race condition#11640
Copilot wants to merge 3 commits intomainfrom
copilot/fix-ucp-pod-crashloopbackoff

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 14, 2026

Description

UCP pod intermittently enters CrashLoopBackOff when the Kubernetes API server isn't fully ready at startup. Three compounding issues:

  1. DatabaseProvider permanently caches initialization errors — if the first GetClient() call fails (API server not ready), the error is cached forever. All subsequent calls from every service return the same cached error with no retry possible.

  2. Initializer service has no retry logic — a single transient failure during manifest registration kills the service, which triggers RunWithInterrupts to shut down the entire UCP process.

  3. RunWithInterrupts swallows the error — the service error is logged but discarded; host.Run() returns nil after cancellation, so the process exits with code 0. Kubernetes sees a clean exit that keeps happening → CrashLoopBackOff.

Changes

  • DatabaseProvider.initialize() — transient factory errors are no longer cached in p.result. Only successful initialization and permanent config errors (unsupported provider) are cached. Callers can retry on next call.

  • Initializer Run() — wraps manifest registration in exponential backoff retry (up to 2 min) via sethvargo/go-retry, already a project dependency. Core logic extracted to registerManifests().

  • RunWithInterrupts() — captures the LifecycleMessage.Err from the failed service and returns it when host.Run() returns nil. Adds defer cancel() and distinguishes service failure from normal shutdown in logs.

  • TestsTest_GetClient_CachedError replaced with Test_GetClient_RetryAfterError (verifies retry-then-succeed). Two new initializer tests cover transient error retry and context cancellation during retry.

Type of change

  • This pull request fixes a bug in Radius and has an approved issue (issue link required).

Contributor checklist

Please verify that the PR meets the following requirements, where applicable:

  • An overview of proposed schema changes is included in a linked GitHub issue.
    • Yes
    • Not applicable
  • A design document PR is created in the design-notes repository, if new APIs are being introduced.
    • Yes
    • Not applicable
  • The design document has been reviewed and approved by Radius maintainers/approvers.
    • Yes
    • Not applicable
  • A PR for the samples repository is created, if existing samples are affected by the changes in this PR.
    • Yes
    • Not applicable
  • A PR for the documentation repository is created, if the changes in this PR affect the documentation or any user facing updates are made.
    • Yes
    • Not applicable
  • A PR for the recipes repository is created, if existing recipes are affected by the changes in this PR.
    • Yes
    • Not applicable

Copilot AI linked an issue Apr 14, 2026 that may be closed by this pull request
Copilot AI requested a deployment to external-contributor-approval April 14, 2026 17:05 Waiting
Copilot AI and others added 2 commits April 14, 2026 17:07
…andling

Three root causes addressed:
1. DatabaseProvider permanently cached initialization errors, preventing retry
   when the database (API server) was not yet ready during startup.
2. Initializer service had no retry logic, so any transient failure during
   manifest registration would immediately fail the service.
3. RunWithInterrupts discarded the actual error from failed services,
   causing the process to exit with code 0 even on failure.

Changes:
- DatabaseProvider: Don't cache transient initialization errors. Only cache
  successful results and permanent configuration errors (unsupported provider).
- Initializer: Add retry with exponential backoff (up to 2 minutes) around
  the manifest registration operation.
- RunWithInterrupts: Properly capture and propagate service errors. Add
  defer cancel() for cleanup. Distinguish between service errors and
  normal shutdown.
- Tests: Updated and added tests for retry behavior.

Agent-Logs-Url: https://github.com/radius-project/radius/sessions/f457aee2-c5dd-45aa-a110-0804dfe09a28

Co-authored-by: nicolejms <101607760+nicolejms@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix UCP pod intermittently going into CrashLoopBackOff state Fix UCP pod CrashLoopBackOff caused by startup race condition Apr 14, 2026
Copilot AI requested a review from nicolejms April 14, 2026 17:09
Copilot AI requested a deployment to external-contributor-approval April 14, 2026 17:18 Waiting
@github-actions
Copy link
Copy Markdown

Unit Tests

    2 files  ±0    415 suites  ±0   6m 29s ⏱️ -17s
4 874 tests +2  4 872 ✅ +2  2 💤 ±0  0 ❌ ±0 
5 776 runs  +2  5 774 ✅ +2  2 💤 ±0  0 ❌ ±0 

Results for commit de07398. ± Comparison against base commit 6d326a9.

This pull request removes 1 and adds 3 tests. Note that renamed tests count towards both.
github.com/radius-project/radius/pkg/components/database/databaseprovider ‑ Test_GetClient_CachedError
github.com/radius-project/radius/pkg/components/database/databaseprovider ‑ Test_GetClient_RetryAfterError
github.com/radius-project/radius/pkg/ucp/initializer ‑ Test_Run/retries_on_transient_database_error
github.com/radius-project/radius/pkg/ucp/initializer ‑ Test_Run/returns_error_when_context_is_cancelled_during_retry

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 14, 2026

Codecov Report

❌ Patch coverage is 50.00000% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 51.38%. Comparing base (6d326a9) to head (de07398).

Files with missing lines Patch % Lines
pkg/components/hosting/run.go 0.00% 11 Missing ⚠️
...nents/database/databaseprovider/storageprovider.go 33.33% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #11640      +/-   ##
==========================================
+ Coverage   51.37%   51.38%   +0.01%     
==========================================
  Files         699      699              
  Lines       44111    44132      +21     
==========================================
+ Hits        22661    22677      +16     
- Misses      19280    19287       +7     
+ Partials     2170     2168       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UCP pod intermittently going into CrashLoopBackOff state

2 participants