Fix UCP pod CrashLoopBackOff caused by startup race condition by Copilot · Pull Request #11640 · radius-project/radius

Copilot · 2026-04-14T16:53:26Z

Description

UCP pod intermittently enters CrashLoopBackOff when the Kubernetes API server isn't fully ready at startup. Three compounding issues:

DatabaseProvider permanently caches initialization errors — if the first GetClient() call fails (API server not ready), the error is cached forever. All subsequent calls from every service return the same cached error with no retry possible.
Initializer service has no retry logic — a single transient failure during manifest registration kills the service, which triggers RunWithInterrupts to shut down the entire UCP process.
RunWithInterrupts swallows the error — the service error is logged but discarded; host.Run() returns nil after cancellation, so the process exits with code 0. Kubernetes sees a clean exit that keeps happening → CrashLoopBackOff.

Changes

DatabaseProvider.initialize() — transient factory errors are no longer cached in p.result. Only successful initialization and permanent config errors (unsupported provider) are cached. Callers can retry on next call.
Initializer Run() — wraps manifest registration in exponential backoff retry (up to 2 min) via sethvargo/go-retry, already a project dependency. Core logic extracted to registerManifests().
RunWithInterrupts() — captures the LifecycleMessage.Err from the failed service and returns it when host.Run() returns nil. Adds defer cancel() and distinguishes service failure from normal shutdown in logs.
Tests — Test_GetClient_CachedError replaced with Test_GetClient_RetryAfterError (verifies retry-then-succeed). Two new initializer tests cover transient error retry and context cancellation during retry.

Type of change

This pull request fixes a bug in Radius and has an approved issue (issue link required).

Contributor checklist

Please verify that the PR meets the following requirements, where applicable:

An overview of proposed schema changes is included in a linked GitHub issue.
- Yes
- Not applicable
A design document PR is created in the design-notes repository, if new APIs are being introduced.
- Yes
- Not applicable
The design document has been reviewed and approved by Radius maintainers/approvers.
- Yes
- Not applicable
A PR for the samples repository is created, if existing samples are affected by the changes in this PR.
- Yes
- Not applicable
A PR for the documentation repository is created, if the changes in this PR affect the documentation or any user facing updates are made.
- Yes
- Not applicable
A PR for the recipes repository is created, if existing recipes are affected by the changes in this PR.
- Yes
- Not applicable

…andling Three root causes addressed: 1. DatabaseProvider permanently cached initialization errors, preventing retry when the database (API server) was not yet ready during startup. 2. Initializer service had no retry logic, so any transient failure during manifest registration would immediately fail the service. 3. RunWithInterrupts discarded the actual error from failed services, causing the process to exit with code 0 even on failure. Changes: - DatabaseProvider: Don't cache transient initialization errors. Only cache successful results and permanent configuration errors (unsupported provider). - Initializer: Add retry with exponential backoff (up to 2 minutes) around the manifest registration operation. - RunWithInterrupts: Properly capture and propagate service errors. Add defer cancel() for cleanup. Distinguish between service errors and normal shutdown. - Tests: Updated and added tests for retry behavior. Agent-Logs-Url: https://github.com/radius-project/radius/sessions/f457aee2-c5dd-45aa-a110-0804dfe09a28 Co-authored-by: nicolejms <101607760+nicolejms@users.noreply.github.com>

Agent-Logs-Url: https://github.com/radius-project/radius/sessions/f457aee2-c5dd-45aa-a110-0804dfe09a28 Co-authored-by: nicolejms <101607760+nicolejms@users.noreply.github.com>

github-actions · 2026-04-14T17:39:09Z

Unit Tests

2 files ±0 415 suites ±0 6m 29s ⏱️ -17s
4 874 tests +2 4 872 ✅ +2 2 💤 ±0 0 ❌ ±0
5 776 runs +2 5 774 ✅ +2 2 💤 ±0 0 ❌ ±0

Results for commit de07398. ± Comparison against base commit 6d326a9.

This pull request removes 1 and adds 3 tests. Note that renamed tests count towards both.

github.com/radius-project/radius/pkg/components/database/databaseprovider ‑ Test_GetClient_CachedError

github.com/radius-project/radius/pkg/components/database/databaseprovider ‑ Test_GetClient_RetryAfterError
github.com/radius-project/radius/pkg/ucp/initializer ‑ Test_Run/retries_on_transient_database_error
github.com/radius-project/radius/pkg/ucp/initializer ‑ Test_Run/returns_error_when_context_is_cancelled_during_retry

codecov · 2026-04-14T17:39:12Z

Codecov Report

❌ Patch coverage is 50.00000% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 51.38%. Comparing base (6d326a9) to head (de07398).

Files with missing lines	Patch %	Lines
pkg/components/hosting/run.go	0.00%	11 Missing ⚠️
...nents/database/databaseprovider/storageprovider.go	33.33%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #11640      +/-   ##
==========================================
+ Coverage   51.37%   51.38%   +0.01%     
==========================================
  Files         699      699              
  Lines       44111    44132      +21     
==========================================
+ Hits        22661    22677      +16     
- Misses      19280    19287       +7     
+ Partials     2170     2168       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Initial plan

280cebc

Copilot AI assigned Copilot and nicolejms Apr 14, 2026

Copilot started work on behalf of nicolejms April 14, 2026 16:53 View session

Copilot AI linked an issue Apr 14, 2026 that may be closed by this pull request

UCP pod intermittently going into CrashLoopBackOff state #11017

Open

Copilot AI requested a deployment to external-contributor-approval April 14, 2026 17:05 Waiting

Copilot AI and others added 2 commits April 14, 2026 17:07

Address code review: rename variable for Go idiomatic naming

de07398

Agent-Logs-Url: https://github.com/radius-project/radius/sessions/f457aee2-c5dd-45aa-a110-0804dfe09a28 Co-authored-by: nicolejms <101607760+nicolejms@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix UCP pod intermittently going into CrashLoopBackOff state~~ Fix UCP pod CrashLoopBackOff caused by startup race condition Apr 14, 2026

Copilot finished work on behalf of nicolejms April 14, 2026 17:09

Copilot AI requested a review from nicolejms April 14, 2026 17:09

nicolejms added the pr:standard label Apr 14, 2026

Copilot AI requested a deployment to external-contributor-approval April 14, 2026 17:18 Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix UCP pod CrashLoopBackOff caused by startup race condition#11640

Fix UCP pod CrashLoopBackOff caused by startup race condition#11640
Copilot wants to merge 3 commits intomainfrom
copilot/fix-ucp-pod-crashloopbackoff

Copilot AI commented Apr 14, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

codecov bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Type of change

Contributor checklist

Uh oh!

github-actions bot commented Apr 14, 2026

Unit Tests

Uh oh!

codecov bot commented Apr 14, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Apr 14, 2026 •

edited

Loading