fix: integrate service resources into database reconciliation cycle#269
fix: integrate service resources into database reconciliation cycle#269rshoemaker wants to merge 5 commits intomainfrom
Conversation
Service instance resources now participate in the normal EndState → MarkPendingDeletion → PlanAll → applyPlans cycle alongside database resources, instead of being provisioned by a separate ProvisionServices sub-workflow that deleted and recreated all services on every database update. Service provisioning failures are no longer tolerated - they now trigger a "failed" state for the entire database.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughServices were modeled as ServiceResources and merged into EndState/UpdateDatabase; orchestrator/service-user role now manage credentials and Postgres routing; many service-provisioning workflows and activities were removed; e2e tests switched to client-driven, task-oriented API flows. Changes
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/internal/orchestrator/swarm/service_user_role.go (1)
145-248:⚠️ Potential issue | 🟠 MajorDon’t swallow
GetDatabasefailures during delete.
Deletecurrently treats anyconnectToPrimaryerror as non-fatal, which hides etcd lookup failures that should be retried. Only connection/SQL errors should be swallowed; lookup failures should propagate.Based on learnings: The Delete method distinguishes between etcd errors (GetDatabase) and Postgres connection errors; GetDatabase errors should be propagated, while connection errors should be swallowed with logging during deletion.Suggested fix (propagate lookup errors)
+type primaryLookupError struct{ err error } +func (e primaryLookupError) Error() string { return e.err.Error() } +func (e primaryLookupError) Unwrap() error { return e.err } func (r *ServiceUserRole) Delete(ctx context.Context, rc *resource.Context) error { ... conn, err := r.connectToPrimary(ctx, rc, logger) if err != nil { + var lookupErr primaryLookupError + if errors.As(err, &lookupErr) { + return err + } logger.Warn().Err(err).Msg("failed to connect to primary instance, skipping user deletion") return nil } ... } func (r *ServiceUserRole) connectToPrimary(...) (*pgx.Conn, error) { ... db, err := dbSvc.GetDatabase(ctx, r.DatabaseID) if err != nil { - if errors.Is(err, database.ErrDatabaseNotFound) { - return nil, fmt.Errorf("database not found: %w", err) - } - return nil, fmt.Errorf("failed to get database: %w", err) + return nil, primaryLookupError{err} } ... }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/internal/orchestrator/swarm/service_user_role.go` around lines 145 - 248, The Delete method currently swallows all errors from connectToPrimary; create a clear distinction by adding a new exported sentinel error (e.g. ErrLookupFailure) and have connectToPrimary wrap/return lookup-related failures (errors from dbSvc.GetDatabase, orch.GetInstanceConnectionInfo, certSvc.PostgresUserTLS, etc.) using errors.Wrap/ fmt.Errorf("%w: ...", ErrLookupFailure) while leaving genuine connection/SQL errors unwrapped, then update Delete to check errors.Is(err, ErrLookupFailure) and return that error (propagate) but continue to log-and-ignore non-lookup connection/SQL errors as before; reference functions/types: ServiceUserRole.Delete, ServiceUserRole.connectToPrimary, dbSvc.GetDatabase (database.ErrDatabaseNotFound), orch.GetInstanceConnectionInfo, and certSvc.PostgresUserTLS.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/internal/workflows/plan_update.go`:
- Around line 97-98: The call to host.NewPgEdgeVersion currently ignores its
error (pgEdgeVersion, _ := host.NewPgEdgeVersion(...)), which can leave
pgEdgeVersion nil and bypass compatibility checks; update the code to capture
and handle the error (e.g., pgEdgeVersion, err := host.NewPgEdgeVersion(...)),
and if err != nil return or propagate the error (or log and fail the plan
update) so compatibility validation runs only with a valid PgEdgeVersion
instance; reference NewPgEdgeVersion and the pgEdgeVersion variable to locate
the fix.
- Around line 139-165: findPostgresInstance currently returns an empty host
string when there are no Postgres instances which silently yields an invalid
DatabaseHost; change its signature to return (string, int, error), keep the
instancePort helper and logic, and when no instance (fallback==nil) return a
descriptive error (e.g. "no postgres instances found") instead of "", so callers
of findPostgresInstance (the call sites using the returned host/port) are
updated to handle and propagate the error during planning.
---
Outside diff comments:
In `@server/internal/orchestrator/swarm/service_user_role.go`:
- Around line 145-248: The Delete method currently swallows all errors from
connectToPrimary; create a clear distinction by adding a new exported sentinel
error (e.g. ErrLookupFailure) and have connectToPrimary wrap/return
lookup-related failures (errors from dbSvc.GetDatabase,
orch.GetInstanceConnectionInfo, certSvc.PostgresUserTLS, etc.) using
errors.Wrap/ fmt.Errorf("%w: ...", ErrLookupFailure) while leaving genuine
connection/SQL errors unwrapped, then update Delete to check errors.Is(err,
ErrLookupFailure) and return that error (propagate) but continue to
log-and-ignore non-lookup connection/SQL errors as before; reference
functions/types: ServiceUserRole.Delete, ServiceUserRole.connectToPrimary,
dbSvc.GetDatabase (database.ErrDatabaseNotFound),
orch.GetInstanceConnectionInfo, and certSvc.PostgresUserTLS.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/internal/workflows/plan_update.go`:
- Around line 59-75: The code currently only sets postgresHostID from
nodeInstances[0].Instances[0], which can be empty if the first node has no
instances; update the logic before building serviceResources to scan all
nodeInstances and set postgresHostID to the HostID of the first available
instance (e.g., first node where len(Instances)>0 then Instances[0].HostID); if
no instance is found, return an error (fail fast) so ServiceUserRole.Create
routing won’t get an empty postgresHostID; touch the postgresHostID
initialization used by getServiceResources and ensure callers
(serviceSpec.HostIDs loop) rely on this validated value.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
e2e/service_provisioning_test.go (1)
478-497: Consider extracting the cleanup helper to reduce duplication.This cleanup closure is identical to the one in
TestProvisionMCPServiceUnsupportedVersion(lines 374-393). Extracting a helper function would reduce duplication and make maintenance easier.Example helper
// In a helper file or at package level: func cleanupDatabase(t *testing.T, client *client.Client, dbID string) { t.Helper() cleanupCtx, cleanupCancel := context.WithTimeout(context.Background(), 2*time.Minute) defer cleanupCancel() t.Logf("cleaning up database %s", dbID) resp, err := client.DeleteDatabase(cleanupCtx, &controlplane.DeleteDatabasePayload{ DatabaseID: dbID, Force: true, }) if err != nil { if !errors.Is(err, client.ErrNotFound) { t.Logf("failed to cleanup database %s: %s", dbID, err) } return } _, _ = client.WaitForDatabaseTask(cleanupCtx, &controlplane.GetDatabaseTaskPayload{ DatabaseID: dbID, TaskID: resp.Task.TaskID, }) } // Usage: t.Cleanup(func() { cleanupDatabase(t, fixture.Client, dbID) })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@e2e/service_provisioning_test.go` around lines 478 - 497, Duplicate cleanup closure used in t.Cleanup across tests should be extracted into a shared helper; create a package-level function like cleanupDatabase(t *testing.T, client *client.Client, dbID string) that calls context.WithTimeout(...), logs, calls client.DeleteDatabase(...) and client.WaitForDatabaseTask(...), marks t.Helper(), and then replace the inline t.Cleanup closures in tests (e.g., in TestProvisionMCPServiceUnsupportedVersion and the test at lines shown) with t.Cleanup(func(){ cleanupDatabase(t, fixture.Client, dbID) }) to remove duplication and centralize cleanup logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@e2e/service_provisioning_test.go`:
- Around line 668-669: The assertions after fixture creation assume immediate
"running" state and can flake; change the test to wait for the service to
transition to "running" before asserting. Locate the lines using require.Len and
require.Equal and replace the immediate state check with a short poll loop (or
reuse the existing wait helper used by TestProvisionMCPService) that inspects
db.ServiceInstances[0].State until it equals "running" or a timeout (e.g., few
seconds) is reached, then assert Len==1 and State=="running".
---
Nitpick comments:
In `@e2e/service_provisioning_test.go`:
- Around line 478-497: Duplicate cleanup closure used in t.Cleanup across tests
should be extracted into a shared helper; create a package-level function like
cleanupDatabase(t *testing.T, client *client.Client, dbID string) that calls
context.WithTimeout(...), logs, calls client.DeleteDatabase(...) and
client.WaitForDatabaseTask(...), marks t.Helper(), and then replace the inline
t.Cleanup closures in tests (e.g., in TestProvisionMCPServiceUnsupportedVersion
and the test at lines shown) with t.Cleanup(func(){ cleanupDatabase(t,
fixture.Client, dbID) }) to remove duplication and centralize cleanup logic.
Integrate service resources into the database reconciliation cycle
Summary
Problem
ProvisionServices ran as a separate sub-workflow after applyPlans() in UpdateDatabase. Because EndState() only included database node resources, the reconciliation cycle saw existing service resources in the start state but
not in the end state, marking them for deletion. The separate workflow then recreated everything from scratch, causing unnecessary service downtime on every database update — even when services hadn't changed.
Approach
Service resources now participate in the same EndState → MarkPendingDeletion → PlanAll → applyPlans cycle as database resources:
UpdateDatabase workflow
→ RefreshCurrentState (load state from etcd, refresh all resources)
→ PlanUpdate:
→ getNodeResources() → calls GenerateInstanceResources on each host
→ getServiceResources() → calls GenerateServiceInstanceResources on each host
→ operations.UpdateDatabase(current, nodeResources, serviceResources):
→ EndState(nodes, services) → desired state (database + service resources)
→ start.MarkPendingDeletion(end)
→ start.PlanAll(states...) → []Plan
→ applyPlans (execute plans, persist state)
Service resource errors during planning are fatal — if getServiceResources() fails (e.g., unsupported version), the entire workflow fails and the database transitions to "failed" state. This matches how node resource errors are handled.
Key changes
Resource lifecycle — ServiceUserRole.Create() now generates credentials and creates the Postgres role directly (previously a no-op with a separate CreateServiceUser activity). ServiceInstanceSpecResource reads credentials
from the sibling ServiceUserRole via resource Context. ServiceInstanceResource.Create() stores the initial etcd record (previously done by a separate activity).
Operations — EndState() accepts []*ServiceResources and merges them into the desired state alongside node resources. UpdateDatabase() passes service resources through.
Deleted — ProvisionServices workflow, CreateServiceUser activity, StoreServiceInstance activity, UpdateServiceInstanceState activity, GetServiceInstanceStatus activity.
E2E tests — Updated TestProvisionMCPServiceUnsupportedVersion and TestProvisionMCPServiceRecovery to expect state: "failed" when service version is invalid (previously expected the database to reach "available" despite
service failure).
Test plan
TODO: update golden tests to include services in server/internal/database/operations/
PLAT-412