Fix UCP pod CrashLoopBackOff caused by startup race condition#11640
Draft
Fix UCP pod CrashLoopBackOff caused by startup race condition#11640
Conversation
…andling Three root causes addressed: 1. DatabaseProvider permanently cached initialization errors, preventing retry when the database (API server) was not yet ready during startup. 2. Initializer service had no retry logic, so any transient failure during manifest registration would immediately fail the service. 3. RunWithInterrupts discarded the actual error from failed services, causing the process to exit with code 0 even on failure. Changes: - DatabaseProvider: Don't cache transient initialization errors. Only cache successful results and permanent configuration errors (unsupported provider). - Initializer: Add retry with exponential backoff (up to 2 minutes) around the manifest registration operation. - RunWithInterrupts: Properly capture and propagate service errors. Add defer cancel() for cleanup. Distinguish between service errors and normal shutdown. - Tests: Updated and added tests for retry behavior. Agent-Logs-Url: https://github.com/radius-project/radius/sessions/f457aee2-c5dd-45aa-a110-0804dfe09a28 Co-authored-by: nicolejms <101607760+nicolejms@users.noreply.github.com>
Agent-Logs-Url: https://github.com/radius-project/radius/sessions/f457aee2-c5dd-45aa-a110-0804dfe09a28 Co-authored-by: nicolejms <101607760+nicolejms@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Fix UCP pod intermittently going into CrashLoopBackOff state
Fix UCP pod CrashLoopBackOff caused by startup race condition
Apr 14, 2026
Unit Tests 2 files ±0 415 suites ±0 6m 29s ⏱️ -17s Results for commit de07398. ± Comparison against base commit 6d326a9. This pull request removes 1 and adds 3 tests. Note that renamed tests count towards both. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #11640 +/- ##
==========================================
+ Coverage 51.37% 51.38% +0.01%
==========================================
Files 699 699
Lines 44111 44132 +21
==========================================
+ Hits 22661 22677 +16
- Misses 19280 19287 +7
+ Partials 2170 2168 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
UCP pod intermittently enters
CrashLoopBackOffwhen the Kubernetes API server isn't fully ready at startup. Three compounding issues:DatabaseProviderpermanently caches initialization errors — if the firstGetClient()call fails (API server not ready), the error is cached forever. All subsequent calls from every service return the same cached error with no retry possible.Initializer service has no retry logic — a single transient failure during manifest registration kills the service, which triggers
RunWithInterruptsto shut down the entire UCP process.RunWithInterruptsswallows the error — the service error is logged but discarded;host.Run()returnsnilafter cancellation, so the process exits with code 0. Kubernetes sees a clean exit that keeps happening →CrashLoopBackOff.Changes
DatabaseProvider.initialize()— transient factory errors are no longer cached inp.result. Only successful initialization and permanent config errors (unsupported provider) are cached. Callers can retry on next call.Initializer
Run()— wraps manifest registration in exponential backoff retry (up to 2 min) viasethvargo/go-retry, already a project dependency. Core logic extracted toregisterManifests().RunWithInterrupts()— captures theLifecycleMessage.Errfrom the failed service and returns it whenhost.Run()returns nil. Addsdefer cancel()and distinguishes service failure from normal shutdown in logs.Tests —
Test_GetClient_CachedErrorreplaced withTest_GetClient_RetryAfterError(verifies retry-then-succeed). Two new initializer tests cover transient error retry and context cancellation during retry.Type of change
Contributor checklist
Please verify that the PR meets the following requirements, where applicable: