Skip to content

fix: integrate service resources into database reconciliation cycle#269

Open
rshoemaker wants to merge 5 commits intomainfrom
fix/PLAT-412/service_reconciliation
Open

fix: integrate service resources into database reconciliation cycle#269
rshoemaker wants to merge 5 commits intomainfrom
fix/PLAT-412/service_reconciliation

Conversation

@rshoemaker
Copy link
Contributor

@rshoemaker rshoemaker commented Feb 18, 2026

Integrate service resources into the database reconciliation cycle

Summary

  • Fold service instance resources (Network, ServiceUserRole, ServiceInstanceSpec, ServiceInstance, ServiceInstanceMonitor) into the normal database Plan/Apply cycle so they get declarative reconciliation: no-op when unchanged, create when added, delete when removed
  • Remove the separate ProvisionServices sub-workflow, which was destroying and recreating all service resources on every database update
  • Move credential creation from a workflow activity into ServiceUserRole.Create(), following the same pattern as database instance resources

Problem

ProvisionServices ran as a separate sub-workflow after applyPlans() in UpdateDatabase. Because EndState() only included database node resources, the reconciliation cycle saw existing service resources in the start state but
not in the end state, marking them for deletion. The separate workflow then recreated everything from scratch, causing unnecessary service downtime on every database update — even when services hadn't changed.

Approach

Service resources now participate in the same EndState → MarkPendingDeletion → PlanAll → applyPlans cycle as database resources:

UpdateDatabase workflow
→ RefreshCurrentState (load state from etcd, refresh all resources)
→ PlanUpdate:
→ getNodeResources() → calls GenerateInstanceResources on each host
→ getServiceResources() → calls GenerateServiceInstanceResources on each host
→ operations.UpdateDatabase(current, nodeResources, serviceResources):
→ EndState(nodes, services) → desired state (database + service resources)
→ start.MarkPendingDeletion(end)
→ start.PlanAll(states...) → []Plan
→ applyPlans (execute plans, persist state)

Service resource errors during planning are fatal — if getServiceResources() fails (e.g., unsupported version), the entire workflow fails and the database transitions to "failed" state. This matches how node resource errors are handled.

Key changes

Resource lifecycle — ServiceUserRole.Create() now generates credentials and creates the Postgres role directly (previously a no-op with a separate CreateServiceUser activity). ServiceInstanceSpecResource reads credentials
from the sibling ServiceUserRole via resource Context. ServiceInstanceResource.Create() stores the initial etcd record (previously done by a separate activity).

Operations — EndState() accepts []*ServiceResources and merges them into the desired state alongside node resources. UpdateDatabase() passes service resources through.

Deleted — ProvisionServices workflow, CreateServiceUser activity, StoreServiceInstance activity, UpdateServiceInstanceState activity, GetServiceInstanceStatus activity.

E2E tests — Updated TestProvisionMCPServiceUnsupportedVersion and TestProvisionMCPServiceRecovery to expect state: "failed" when service version is invalid (previously expected the database to reach "available" despite
service failure).

Test plan

  • Unit tests pass: go test ./server/internal/...
  • E2E: TestProvisionMCPService — create database with service
  • E2E: TestProvisionMultiHostMCPService — service on multiple hosts
  • E2E: TestUpdateDatabaseAddService — add service to existing database
  • E2E: TestUpdateDatabaseRemoveService — remove service from database
  • E2E: TestProvisionMCPServiceUnsupportedVersion — bad version fails workflow
  • E2E: TestProvisionMCPServiceRecovery — recover from bad version via update
  • E2E: TestUpdateDatabaseServiceStable — assert no service delete/recreate on update

TODO: update golden tests to include services in server/internal/database/operations/
PLAT-412

Service instance resources now participate in the normal
EndState → MarkPendingDeletion → PlanAll → applyPlans cycle
alongside database resources, instead of being provisioned by
a separate ProvisionServices sub-workflow that deleted and
recreated all services on every database update. Service
provisioning failures are no longer tolerated - they now
trigger a "failed" state for the entire database.
@coderabbitai
Copy link

coderabbitai bot commented Feb 18, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Services were modeled as ServiceResources and merged into EndState/UpdateDatabase; orchestrator/service-user role now manage credentials and Postgres routing; many service-provisioning workflows and activities were removed; e2e tests switched to client-driven, task-oriented API flows.

Changes

Cohort / File(s) Summary
Tests
e2e/service_provisioning_test.go
Converted tests to API-driven flows (CreateDatabase, UpdateDatabase, DeleteDatabase, GetDatabase, WaitForDatabaseTask); added task waits, explicit cleanup, and lifecycle assertions for failure/recovery and service stability.
Service resources & end-state
server/internal/database/operations/common.go, server/internal/database/operations/end.go
Added ServiceResources type with State() and extended EndState to accept & merge services []*ServiceResources.
Database operations API
server/internal/database/operations/update_database.go, .../update_database_test.go, .../restore_database.go, server/internal/workflows/delete_database.go
Expanded UpdateDatabase signature to accept services []*ServiceResources; callers updated (often passing nil) and services threaded into EndState.
Service instance spec
server/internal/database/service_instance.go
Added PostgresHostID string to ServiceInstanceSpec for routing credentials to a specific Postgres host.
Orchestrator — swarm service instance
server/internal/orchestrator/swarm/orchestrator.go, server/internal/orchestrator/swarm/service_instance.go, server/internal/orchestrator/swarm/service_instance_spec.go
Populated ServiceID/HostID fields; added HostID to ServiceInstanceResource and DiffIgnore; created deploy helper; Create writes initial etcd state; Refresh now depends on ServiceUserRole for credentials.
ServiceUserRole (credentials)
server/internal/orchestrator/swarm/service_user_role.go
Converted to active credential manager: added PostgresHostID, ServiceID, Password; bumped ResourceVersion; Create/Delete connect to Postgres primary to create/drop roles and persist credentials; executor routes to specified Postgres host.
Plan update / resource generation
server/internal/workflows/plan_update.go
Now generates per-service ServiceResources via getServiceResources and findPostgresInstance, resolves Postgres host/port, attaches monitor resources, and supplies services to UpdateDatabase.
Workflow activities removed
server/internal/workflows/activities/activities.go, server/internal/workflows/activities/create_service_user.go, .../get_service_instance_status.go, .../store_service_instance.go, .../update_service_instance_state.go
Removed registrations and implementations for activities: CreateServiceUser, GetServiceInstanceStatus, StoreServiceInstance, and UpdateServiceInstanceState.
Provisioning workflow removed / update flow simplified
server/internal/workflows/provision_services.go, server/internal/workflows/update_database.go, server/internal/workflows/workflows.go
Deleted ProvisionServices workflow and its registration; removed post-database provisioning block from update_database.go, shifting provisioning responsibilities into resource generation and operations.

Poem

🐰 I nibbled through hosts and IDs with care,
Moved creds to roles and resources laid bare,
Workflows tucked away while tasks hum their tune,
Services settle in by sun and moon,
Hops, carrots, and a tidy deploy flair. 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: integrating service resources into the database reconciliation cycle, which is the core objective of this PR.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed PR description is comprehensive and well-structured, covering summary, problem, approach, key changes, and test plan with clear context.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/PLAT-412/service_reconciliation

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
server/internal/orchestrator/swarm/service_user_role.go (1)

145-248: ⚠️ Potential issue | 🟠 Major

Don’t swallow GetDatabase failures during delete.

Delete currently treats any connectToPrimary error as non-fatal, which hides etcd lookup failures that should be retried. Only connection/SQL errors should be swallowed; lookup failures should propagate.

Suggested fix (propagate lookup errors)
+type primaryLookupError struct{ err error }
+func (e primaryLookupError) Error() string { return e.err.Error() }
+func (e primaryLookupError) Unwrap() error { return e.err }

 func (r *ServiceUserRole) Delete(ctx context.Context, rc *resource.Context) error {
   ...
   conn, err := r.connectToPrimary(ctx, rc, logger)
   if err != nil {
+    var lookupErr primaryLookupError
+    if errors.As(err, &lookupErr) {
+      return err
+    }
     logger.Warn().Err(err).Msg("failed to connect to primary instance, skipping user deletion")
     return nil
   }
   ...
 }

 func (r *ServiceUserRole) connectToPrimary(...) (*pgx.Conn, error) {
   ...
   db, err := dbSvc.GetDatabase(ctx, r.DatabaseID)
   if err != nil {
-    if errors.Is(err, database.ErrDatabaseNotFound) {
-      return nil, fmt.Errorf("database not found: %w", err)
-    }
-    return nil, fmt.Errorf("failed to get database: %w", err)
+    return nil, primaryLookupError{err}
   }
   ...
 }
Based on learnings: The Delete method distinguishes between etcd errors (GetDatabase) and Postgres connection errors; GetDatabase errors should be propagated, while connection errors should be swallowed with logging during deletion.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/internal/orchestrator/swarm/service_user_role.go` around lines 145 -
248, The Delete method currently swallows all errors from connectToPrimary;
create a clear distinction by adding a new exported sentinel error (e.g.
ErrLookupFailure) and have connectToPrimary wrap/return lookup-related failures
(errors from dbSvc.GetDatabase, orch.GetInstanceConnectionInfo,
certSvc.PostgresUserTLS, etc.) using errors.Wrap/ fmt.Errorf("%w: ...",
ErrLookupFailure) while leaving genuine connection/SQL errors unwrapped, then
update Delete to check errors.Is(err, ErrLookupFailure) and return that error
(propagate) but continue to log-and-ignore non-lookup connection/SQL errors as
before; reference functions/types: ServiceUserRole.Delete,
ServiceUserRole.connectToPrimary, dbSvc.GetDatabase
(database.ErrDatabaseNotFound), orch.GetInstanceConnectionInfo, and
certSvc.PostgresUserTLS.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/internal/workflows/plan_update.go`:
- Around line 97-98: The call to host.NewPgEdgeVersion currently ignores its
error (pgEdgeVersion, _ := host.NewPgEdgeVersion(...)), which can leave
pgEdgeVersion nil and bypass compatibility checks; update the code to capture
and handle the error (e.g., pgEdgeVersion, err := host.NewPgEdgeVersion(...)),
and if err != nil return or propagate the error (or log and fail the plan
update) so compatibility validation runs only with a valid PgEdgeVersion
instance; reference NewPgEdgeVersion and the pgEdgeVersion variable to locate
the fix.
- Around line 139-165: findPostgresInstance currently returns an empty host
string when there are no Postgres instances which silently yields an invalid
DatabaseHost; change its signature to return (string, int, error), keep the
instancePort helper and logic, and when no instance (fallback==nil) return a
descriptive error (e.g. "no postgres instances found") instead of "", so callers
of findPostgresInstance (the call sites using the returned host/port) are
updated to handle and propagate the error during planning.

---

Outside diff comments:
In `@server/internal/orchestrator/swarm/service_user_role.go`:
- Around line 145-248: The Delete method currently swallows all errors from
connectToPrimary; create a clear distinction by adding a new exported sentinel
error (e.g. ErrLookupFailure) and have connectToPrimary wrap/return
lookup-related failures (errors from dbSvc.GetDatabase,
orch.GetInstanceConnectionInfo, certSvc.PostgresUserTLS, etc.) using
errors.Wrap/ fmt.Errorf("%w: ...", ErrLookupFailure) while leaving genuine
connection/SQL errors unwrapped, then update Delete to check errors.Is(err,
ErrLookupFailure) and return that error (propagate) but continue to
log-and-ignore non-lookup connection/SQL errors as before; reference
functions/types: ServiceUserRole.Delete, ServiceUserRole.connectToPrimary,
dbSvc.GetDatabase (database.ErrDatabaseNotFound),
orch.GetInstanceConnectionInfo, and certSvc.PostgresUserTLS.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/internal/workflows/plan_update.go`:
- Around line 59-75: The code currently only sets postgresHostID from
nodeInstances[0].Instances[0], which can be empty if the first node has no
instances; update the logic before building serviceResources to scan all
nodeInstances and set postgresHostID to the HostID of the first available
instance (e.g., first node where len(Instances)>0 then Instances[0].HostID); if
no instance is found, return an error (fail fast) so ServiceUserRole.Create
routing won’t get an empty postgresHostID; touch the postgresHostID
initialization used by getServiceResources and ensure callers
(serviceSpec.HostIDs loop) rely on this validated value.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
e2e/service_provisioning_test.go (1)

478-497: Consider extracting the cleanup helper to reduce duplication.

This cleanup closure is identical to the one in TestProvisionMCPServiceUnsupportedVersion (lines 374-393). Extracting a helper function would reduce duplication and make maintenance easier.

Example helper
// In a helper file or at package level:
func cleanupDatabase(t *testing.T, client *client.Client, dbID string) {
	t.Helper()
	cleanupCtx, cleanupCancel := context.WithTimeout(context.Background(), 2*time.Minute)
	defer cleanupCancel()

	t.Logf("cleaning up database %s", dbID)
	resp, err := client.DeleteDatabase(cleanupCtx, &controlplane.DeleteDatabasePayload{
		DatabaseID: dbID,
		Force:      true,
	})
	if err != nil {
		if !errors.Is(err, client.ErrNotFound) {
			t.Logf("failed to cleanup database %s: %s", dbID, err)
		}
		return
	}
	_, _ = client.WaitForDatabaseTask(cleanupCtx, &controlplane.GetDatabaseTaskPayload{
		DatabaseID: dbID,
		TaskID:     resp.Task.TaskID,
	})
}

// Usage:
t.Cleanup(func() { cleanupDatabase(t, fixture.Client, dbID) })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@e2e/service_provisioning_test.go` around lines 478 - 497, Duplicate cleanup
closure used in t.Cleanup across tests should be extracted into a shared helper;
create a package-level function like cleanupDatabase(t *testing.T, client
*client.Client, dbID string) that calls context.WithTimeout(...), logs, calls
client.DeleteDatabase(...) and client.WaitForDatabaseTask(...), marks
t.Helper(), and then replace the inline t.Cleanup closures in tests (e.g., in
TestProvisionMCPServiceUnsupportedVersion and the test at lines shown) with
t.Cleanup(func(){ cleanupDatabase(t, fixture.Client, dbID) }) to remove
duplication and centralize cleanup logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@e2e/service_provisioning_test.go`:
- Around line 668-669: The assertions after fixture creation assume immediate
"running" state and can flake; change the test to wait for the service to
transition to "running" before asserting. Locate the lines using require.Len and
require.Equal and replace the immediate state check with a short poll loop (or
reuse the existing wait helper used by TestProvisionMCPService) that inspects
db.ServiceInstances[0].State until it equals "running" or a timeout (e.g., few
seconds) is reached, then assert Len==1 and State=="running".

---

Nitpick comments:
In `@e2e/service_provisioning_test.go`:
- Around line 478-497: Duplicate cleanup closure used in t.Cleanup across tests
should be extracted into a shared helper; create a package-level function like
cleanupDatabase(t *testing.T, client *client.Client, dbID string) that calls
context.WithTimeout(...), logs, calls client.DeleteDatabase(...) and
client.WaitForDatabaseTask(...), marks t.Helper(), and then replace the inline
t.Cleanup closures in tests (e.g., in TestProvisionMCPServiceUnsupportedVersion
and the test at lines shown) with t.Cleanup(func(){ cleanupDatabase(t,
fixture.Client, dbID) }) to remove duplication and centralize cleanup logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant