fix: integrate service resources into database reconciliation cycle by rshoemaker · Pull Request #269 · pgEdge/control-plane

rshoemaker · 2026-02-18T01:10:45Z

Integrate service resources into the database reconciliation cycle

Summary

Fold service instance resources (Network, ServiceUserRole, ServiceInstanceSpec, ServiceInstance, ServiceInstanceMonitor) into the normal database Plan/Apply cycle so they get declarative reconciliation: no-op when unchanged, create when added, delete when removed
Remove the separate ProvisionServices sub-workflow, which was destroying and recreating all service resources on every database update
Move credential creation from a workflow activity into ServiceUserRole.Create(), following the same pattern as database instance resources

Problem

ProvisionServices ran as a separate sub-workflow after applyPlans() in UpdateDatabase. Because EndState() only included database node resources, the reconciliation cycle saw existing service resources in the start state but
not in the end state, marking them for deletion. The separate workflow then recreated everything from scratch, causing unnecessary service downtime on every database update — even when services hadn't changed.

Approach

Service resources now participate in the same EndState → MarkPendingDeletion → PlanAll → applyPlans cycle as database resources:

UpdateDatabase workflow
→ RefreshCurrentState (load state from etcd, refresh all resources)
→ PlanUpdate:
→ getNodeResources() → calls GenerateInstanceResources on each host
→ getServiceResources() → calls GenerateServiceInstanceResources on each host
→ operations.UpdateDatabase(current, nodeResources, serviceResources):
→ EndState(nodes, services) → desired state (database + service resources)
→ start.MarkPendingDeletion(end)
→ start.PlanAll(states...) → []Plan
→ applyPlans (execute plans, persist state)

Service resource errors during planning are fatal — if getServiceResources() fails (e.g., unsupported version), the entire workflow fails and the database transitions to "failed" state. This matches how node resource errors are handled.

Key changes

Resource lifecycle — ServiceUserRole.Create() now generates credentials and creates the Postgres role directly (previously a no-op with a separate CreateServiceUser activity). ServiceInstanceSpecResource reads credentials
from the sibling ServiceUserRole via resource Context. ServiceInstanceResource.Create() stores the initial etcd record (previously done by a separate activity).

Operations — EndState() accepts []*ServiceResources and merges them into the desired state alongside node resources. UpdateDatabase() passes service resources through.

Deleted — ProvisionServices workflow, CreateServiceUser activity, StoreServiceInstance activity, UpdateServiceInstanceState activity, GetServiceInstanceStatus activity.

E2E tests — Updated TestProvisionMCPServiceUnsupportedVersion and TestProvisionMCPServiceRecovery to expect state: "failed" when service version is invalid (previously expected the database to reach "available" despite
service failure).

Test plan

Unit tests pass: go test ./server/internal/...
E2E: TestProvisionMCPService — create database with service
E2E: TestProvisionMultiHostMCPService — service on multiple hosts
E2E: TestUpdateDatabaseAddService — add service to existing database
E2E: TestUpdateDatabaseRemoveService — remove service from database
E2E: TestProvisionMCPServiceUnsupportedVersion — bad version fails workflow
E2E: TestProvisionMCPServiceRecovery — recover from bad version via update
E2E: TestUpdateDatabaseServiceStable — assert no service delete/recreate on update

TODO: update golden tests to include services in server/internal/database/operations/
PLAT-412

Service instance resources now participate in the normal EndState → MarkPendingDeletion → PlanAll → applyPlans cycle alongside database resources, instead of being provisioned by a separate ProvisionServices sub-workflow that deleted and recreated all services on every database update. Service provisioning failures are no longer tolerated - they now trigger a "failed" state for the entire database.

coderabbitai · 2026-02-18T01:11:03Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Services were modeled as ServiceResources and merged into EndState/UpdateDatabase; orchestrator/service-user role now manage credentials and Postgres routing; many service-provisioning workflows and activities were removed; e2e tests switched to client-driven, task-oriented API flows.

Changes

Cohort / File(s)	Summary
Tests `e2e/service_provisioning_test.go`	Converted tests to API-driven flows (`CreateDatabase`, `UpdateDatabase`, `DeleteDatabase`, `GetDatabase`, `WaitForDatabaseTask`); added task waits, explicit cleanup, and lifecycle assertions for failure/recovery and service stability.
Service resources & end-state `server/internal/database/operations/common.go`, `server/internal/database/operations/end.go`	Added `ServiceResources` type with `State()` and extended `EndState` to accept & merge `services []*ServiceResources`.
Database operations API `server/internal/database/operations/update_database.go`, `.../update_database_test.go`, `.../restore_database.go`, `server/internal/workflows/delete_database.go`	Expanded `UpdateDatabase` signature to accept `services []*ServiceResources`; callers updated (often passing `nil`) and services threaded into `EndState`.
Service instance spec `server/internal/database/service_instance.go`	Added `PostgresHostID string` to `ServiceInstanceSpec` for routing credentials to a specific Postgres host.
Orchestrator — swarm service instance `server/internal/orchestrator/swarm/orchestrator.go`, `server/internal/orchestrator/swarm/service_instance.go`, `server/internal/orchestrator/swarm/service_instance_spec.go`	Populated ServiceID/HostID fields; added `HostID` to `ServiceInstanceResource` and DiffIgnore; created `deploy` helper; Create writes initial etcd state; Refresh now depends on ServiceUserRole for credentials.
ServiceUserRole (credentials) `server/internal/orchestrator/swarm/service_user_role.go`	Converted to active credential manager: added `PostgresHostID`, `ServiceID`, `Password`; bumped ResourceVersion; Create/Delete connect to Postgres primary to create/drop roles and persist credentials; executor routes to specified Postgres host.
Plan update / resource generation `server/internal/workflows/plan_update.go`	Now generates per-service `ServiceResources` via `getServiceResources` and `findPostgresInstance`, resolves Postgres host/port, attaches monitor resources, and supplies services to `UpdateDatabase`.
Workflow activities removed `server/internal/workflows/activities/activities.go`, `server/internal/workflows/activities/create_service_user.go`, `.../get_service_instance_status.go`, `.../store_service_instance.go`, `.../update_service_instance_state.go`	Removed registrations and implementations for activities: `CreateServiceUser`, `GetServiceInstanceStatus`, `StoreServiceInstance`, and `UpdateServiceInstanceState`.
Provisioning workflow removed / update flow simplified `server/internal/workflows/provision_services.go`, `server/internal/workflows/update_database.go`, `server/internal/workflows/workflows.go`	Deleted `ProvisionServices` workflow and its registration; removed post-database provisioning block from `update_database.go`, shifting provisioning responsibilities into resource generation and operations.

Poem

🐰 I nibbled through hosts and IDs with care,
Moved creds to roles and resources laid bare,
Workflows tucked away while tasks hum their tune,
Services settle in by sun and moon,
Hops, carrots, and a tidy deploy flair. 🥕

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: integrating service resources into the database reconciliation cycle, which is the core objective of this PR.
Docstring Coverage	✅ Passed	Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
Description check	✅ Passed	PR description is comprehensive and well-structured, covering summary, problem, approach, key changes, and test plan with clear context.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/PLAT-412/service_reconciliation

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

server/internal/orchestrator/swarm/service_user_role.go (1)

145-248: ⚠️ Potential issue | 🟠 Major

Don’t swallow GetDatabase failures during delete.

Delete currently treats any connectToPrimary error as non-fatal, which hides etcd lookup failures that should be retried. Only connection/SQL errors should be swallowed; lookup failures should propagate.

Suggested fix (propagate lookup errors)

+type primaryLookupError struct{ err error }
+func (e primaryLookupError) Error() string { return e.err.Error() }
+func (e primaryLookupError) Unwrap() error { return e.err }

 func (r *ServiceUserRole) Delete(ctx context.Context, rc *resource.Context) error {
   ...
   conn, err := r.connectToPrimary(ctx, rc, logger)
   if err != nil {
+    var lookupErr primaryLookupError
+    if errors.As(err, &lookupErr) {
+      return err
+    }
     logger.Warn().Err(err).Msg("failed to connect to primary instance, skipping user deletion")
     return nil
   }
   ...
 }

 func (r *ServiceUserRole) connectToPrimary(...) (*pgx.Conn, error) {
   ...
   db, err := dbSvc.GetDatabase(ctx, r.DatabaseID)
   if err != nil {
-    if errors.Is(err, database.ErrDatabaseNotFound) {
-      return nil, fmt.Errorf("database not found: %w", err)
-    }
-    return nil, fmt.Errorf("failed to get database: %w", err)
+    return nil, primaryLookupError{err}
   }
   ...
 }

Based on learnings: The Delete method distinguishes between etcd errors (GetDatabase) and Postgres connection errors; GetDatabase errors should be propagated, while connection errors should be swallowed with logging during deletion.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@server/internal/orchestrator/swarm/service_user_role.go` around lines 145 -
248, The Delete method currently swallows all errors from connectToPrimary;
create a clear distinction by adding a new exported sentinel error (e.g.
ErrLookupFailure) and have connectToPrimary wrap/return lookup-related failures
(errors from dbSvc.GetDatabase, orch.GetInstanceConnectionInfo,
certSvc.PostgresUserTLS, etc.) using errors.Wrap/ fmt.Errorf("%w: ...",
ErrLookupFailure) while leaving genuine connection/SQL errors unwrapped, then
update Delete to check errors.Is(err, ErrLookupFailure) and return that error
(propagate) but continue to log-and-ignore non-lookup connection/SQL errors as
before; reference functions/types: ServiceUserRole.Delete,
ServiceUserRole.connectToPrimary, dbSvc.GetDatabase
(database.ErrDatabaseNotFound), orch.GetInstanceConnectionInfo, and
certSvc.PostgresUserTLS.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/internal/workflows/plan_update.go`:
- Around line 97-98: The call to host.NewPgEdgeVersion currently ignores its
error (pgEdgeVersion, _ := host.NewPgEdgeVersion(...)), which can leave
pgEdgeVersion nil and bypass compatibility checks; update the code to capture
and handle the error (e.g., pgEdgeVersion, err := host.NewPgEdgeVersion(...)),
and if err != nil return or propagate the error (or log and fail the plan
update) so compatibility validation runs only with a valid PgEdgeVersion
instance; reference NewPgEdgeVersion and the pgEdgeVersion variable to locate
the fix.
- Around line 139-165: findPostgresInstance currently returns an empty host
string when there are no Postgres instances which silently yields an invalid
DatabaseHost; change its signature to return (string, int, error), keep the
instancePort helper and logic, and when no instance (fallback==nil) return a
descriptive error (e.g. "no postgres instances found") instead of "", so callers
of findPostgresInstance (the call sites using the returned host/port) are
updated to handle and propagate the error during planning.

---

Outside diff comments:
In `@server/internal/orchestrator/swarm/service_user_role.go`:
- Around line 145-248: The Delete method currently swallows all errors from
connectToPrimary; create a clear distinction by adding a new exported sentinel
error (e.g. ErrLookupFailure) and have connectToPrimary wrap/return
lookup-related failures (errors from dbSvc.GetDatabase,
orch.GetInstanceConnectionInfo, certSvc.PostgresUserTLS, etc.) using
errors.Wrap/ fmt.Errorf("%w: ...", ErrLookupFailure) while leaving genuine
connection/SQL errors unwrapped, then update Delete to check errors.Is(err,
ErrLookupFailure) and return that error (propagate) but continue to
log-and-ignore non-lookup connection/SQL errors as before; reference
functions/types: ServiceUserRole.Delete, ServiceUserRole.connectToPrimary,
dbSvc.GetDatabase (database.ErrDatabaseNotFound),
orch.GetInstanceConnectionInfo, and certSvc.PostgresUserTLS.

server/internal/workflows/plan_update.go

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/internal/workflows/plan_update.go`:
- Around line 59-75: The code currently only sets postgresHostID from
nodeInstances[0].Instances[0], which can be empty if the first node has no
instances; update the logic before building serviceResources to scan all
nodeInstances and set postgresHostID to the HostID of the first available
instance (e.g., first node where len(Instances)>0 then Instances[0].HostID); if
no instance is found, return an error (fail fast) so ServiceUserRole.Create
routing won’t get an empty postgresHostID; touch the postgresHostID
initialization used by getServiceResources and ensure callers
(serviceSpec.HostIDs loop) rely on this validated value.

server/internal/workflows/plan_update.go

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

e2e/service_provisioning_test.go (1)

478-497: Consider extracting the cleanup helper to reduce duplication.

This cleanup closure is identical to the one in TestProvisionMCPServiceUnsupportedVersion (lines 374-393). Extracting a helper function would reduce duplication and make maintenance easier.

Example helper

// In a helper file or at package level:
func cleanupDatabase(t *testing.T, client *client.Client, dbID string) {
	t.Helper()
	cleanupCtx, cleanupCancel := context.WithTimeout(context.Background(), 2*time.Minute)
	defer cleanupCancel()

	t.Logf("cleaning up database %s", dbID)
	resp, err := client.DeleteDatabase(cleanupCtx, &controlplane.DeleteDatabasePayload{
		DatabaseID: dbID,
		Force:      true,
	})
	if err != nil {
		if !errors.Is(err, client.ErrNotFound) {
			t.Logf("failed to cleanup database %s: %s", dbID, err)
		}
		return
	}
	_, _ = client.WaitForDatabaseTask(cleanupCtx, &controlplane.GetDatabaseTaskPayload{
		DatabaseID: dbID,
		TaskID:     resp.Task.TaskID,
	})
}

// Usage:
t.Cleanup(func() { cleanupDatabase(t, fixture.Client, dbID) })

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@e2e/service_provisioning_test.go` around lines 478 - 497, Duplicate cleanup
closure used in t.Cleanup across tests should be extracted into a shared helper;
create a package-level function like cleanupDatabase(t *testing.T, client
*client.Client, dbID string) that calls context.WithTimeout(...), logs, calls
client.DeleteDatabase(...) and client.WaitForDatabaseTask(...), marks
t.Helper(), and then replace the inline t.Cleanup closures in tests (e.g., in
TestProvisionMCPServiceUnsupportedVersion and the test at lines shown) with
t.Cleanup(func(){ cleanupDatabase(t, fixture.Client, dbID) }) to remove
duplication and centralize cleanup logic.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@e2e/service_provisioning_test.go`:
- Around line 668-669: The assertions after fixture creation assume immediate
"running" state and can flake; change the test to wait for the service to
transition to "running" before asserting. Locate the lines using require.Len and
require.Equal and replace the immediate state check with a short poll loop (or
reuse the existing wait helper used by TestProvisionMCPService) that inspects
db.ServiceInstances[0].State until it equals "running" or a timeout (e.g., few
seconds) is reached, then assert Len==1 and State=="running".

---

Nitpick comments:
In `@e2e/service_provisioning_test.go`:
- Around line 478-497: Duplicate cleanup closure used in t.Cleanup across tests
should be extracted into a shared helper; create a package-level function like
cleanupDatabase(t *testing.T, client *client.Client, dbID string) that calls
context.WithTimeout(...), logs, calls client.DeleteDatabase(...) and
client.WaitForDatabaseTask(...), marks t.Helper(), and then replace the inline
t.Cleanup closures in tests (e.g., in TestProvisionMCPServiceUnsupportedVersion
and the test at lines shown) with t.Cleanup(func(){ cleanupDatabase(t,
fixture.Client, dbID) }) to remove duplication and centralize cleanup logic.

e2e/service_provisioning_test.go

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

server/internal/workflows/plan_update.go Outdated Show resolved Hide resolved

server/internal/workflows/plan_update.go Outdated Show resolved Hide resolved

fix: coderabbit error handling review

2e05fc2

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

server/internal/workflows/plan_update.go Show resolved Hide resolved

rshoemaker added 2 commits February 17, 2026 21:12

fix: coderabbit edge-case

a7a84be

test: new e2e asserting no service delete/recreate on update

b67b79a

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

e2e/service_provisioning_test.go Show resolved Hide resolved

test: poll for running state (coderabbit)

8de3cec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: integrate service resources into database reconciliation cycle#269

fix: integrate service resources into database reconciliation cycle#269
rshoemaker wants to merge 5 commits intomainfrom
fix/PLAT-412/service_reconciliation

rshoemaker commented Feb 18, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Feb 18, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rshoemaker commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rshoemaker commented Feb 18, 2026 •

edited

Loading

coderabbitai bot commented Feb 18, 2026 •

edited

Loading