Skip to content

feat(cp+runner+manifests): ambient-control-plane Go service, gRPC runner event streaming#1005

Closed
markturansky wants to merge 4 commits intomainfrom
feat/cp-grpc-runner-v1
Closed

feat(cp+runner+manifests): ambient-control-plane Go service, gRPC runner event streaming#1005
markturansky wants to merge 4 commits intomainfrom
feat/cp-grpc-runner-v1

Conversation

@markturansky
Copy link
Copy Markdown
Contributor

Summary

Re-introduces the work from PR #975 (which was reverted in PR #980 due to a test failure). That failure has since been resolved in main.

  • ambient-control-plane (new Go service): Kubernetes reconciler that watches the ambient-api-server via gRPC streams and reconciles Sessions, Projects, and ProjectSettings into K8s — namespaces, runner Pods, Secrets, RoleBindings. Informer-based watch loop, TLS/gRPC support, tally reconciler, stress and unit tests.
  • runner: gRPC AG-UI event streaming — _grpc_client.py, _session_messages_api.py, grpc_transport.py bridge between Claude Code SSE output and gRPC push stream, /events SSE endpoint for AG-UI fan-out. Structured logging throughout.
  • manifests: Control-plane RBAC, gRPC Route, kind/production overlays, CI workflow (ambient-control-plane-tests).

70 files changed, +10,310 / -158

Test plan

  • make build-control-plane succeeds
  • make kind-up && make deploy — control-plane pod starts and reconciles sessions
  • Create a session — runner pod spawns, gRPC transport pushes AG-UI events
  • Runner tests: cd components/runners/ambient-runner && python -m pytest tests/
  • CI: ambient-control-plane-tests workflow passes

🤖 Generated with Claude Code

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This pull request introduces ambient-control-plane, a new Kubernetes-native controller component written in Go that reconciles Ambient API sessions into Kubernetes resources (pods, services, secrets, RBAC). It includes initial sync via REST API and live streaming via gRPC watch, dispatching events through reconcilers for sessions, projects, and project settings. The runner is updated with gRPC support for bidirectional session message streaming. Kubernetes manifests, deployment automation, GitHub Actions workflows, and comprehensive documentation are added to support local and OpenShift deployment.

Changes

Cohort / File(s) Summary
Control Plane Go Module
components/ambient-control-plane/go.mod, components/ambient-control-plane/go.sum
New Go module defining dependencies on Kubernetes packages, gRPC, and local ambient components with pinned versions.
Control Plane Main Application
components/ambient-control-plane/cmd/ambient-control-plane/main.go
Entrypoint that loads configuration, installs service CA certificates, starts HTTP metrics server, and branches to kube/test mode with concurrent reconciler and informer loop execution.
Control Plane Configuration
components/ambient-control-plane/internal/config/config.go
Configuration module parsing environment variables for API endpoints, gRPC, Kubernetes, reconciler types, runner image, and LLM credentials with validation and fallback defaults.
Control Plane Kubernetes Client
components/ambient-control-plane/internal/kubeclient/kubeclient.go, internal/kubeclient/kubeclient_test.go
Dynamic Kubernetes client wrapper providing CRUD operations for namespaces, pods, services, secrets, service accounts, roles, and role bindings with comprehensive unit test coverage.
Control Plane Informer & Watcher
components/ambient-control-plane/internal/informer/informer.go, internal/watcher/watcher.go
Event-driven informer managing initial SDK list sync and live gRPC watch streams for sessions, projects, and project settings with handler dispatch and echo detection.
Control Plane Reconcilers
components/ambient-control-plane/internal/reconciler/shared.go, internal/reconciler/kube_reconciler.go, internal/reconciler/project_reconciler.go, internal/reconciler/project_settings_reconciler.go, internal/reconciler/tally.go, internal/reconciler/tally_reconciler.go, internal/reconciler/tally_test.go, internal/reconciler/stress_test.go
Reconciliation logic for session provisioning/deprovisioning (creates runner pod, service, RBAC, secret), project namespace/secret setup, project settings role binding management, and event tallying for testing/monitoring.
Control Plane Metrics
components/ambient-control-plane/internal/metrics/metrics.go
Prometheus metrics (counters, histograms) tracking reconciliation outcomes, session lifecycle transitions, and phase changes.
Control Plane Build & Documentation
components/ambient-control-plane/Dockerfile, Dockerfile.simple, Makefile, README.md, CLAUDE.md, .gitignore
Multi-stage Dockerfile for containerization, build automation with version injection, and architecture/design documentation.
Control Plane GitHub Actions
.github/workflows/ambient-control-plane-tests.yml
New workflow running Go tests, race detector, and Docker build validation on control-plane changes with concurrency limits.
Root Makefile Updates
Makefile
Added control-plane image build target, OpenShift deployment automation (includes runner selection, gRPC auth setup, RBAC patching), local version syncing, and removed backend AMBIENT_VERSION build args.
API Server Manifest Updates
components/manifests/base/core/ambient-api-server-service.yml
Hardened API server with HTTPS/TLS, reduced logging verbosity, mandatory database SSL, gRPC TLS configuration, and auto-provisioned serving certs via OpenShift annotation.
API Server gRPC Route
components/manifests/base/ambient-api-server-grpc-route.yml
New OpenShift Route exposing gRPC service on ambient-api-server with TLS passthrough.
Database Manifest Updates
components/manifests/base/platform/ambient-api-server-db.yml, components/manifests/components/ambient-api-server-db/ambient-api-server-db-json-patch.yaml, components/manifests/overlays/local-dev/ambient-api-server-db-json-patch.yaml
PostgreSQL image update to RHEL 9, security hardening (non-root, dropped capabilities), environment variable rename, health probe adjustments, and directory path changes.
Control Plane RBAC
components/manifests/base/rbac/control-plane-sa.yaml, control-plane-clusterrole.yaml, control-plane-clusterrolebinding.yaml, rbac/kustomization.yaml
Service account, cluster role with permissions for session CRDs, Kubernetes resources (namespaces, pods, secrets, services, RBAC), and role bindings; binding to control-plane service account.
Control Plane Deployment
components/manifests/base/core/ambient-control-plane-deployment.yaml, core/kustomization.yaml
Kubernetes Deployment with pod security, resource limits, environment configuration from Secret, and appropriate Kustomization inclusion.
Kustomization Base & Image Overrides
components/manifests/base/kustomization.yaml
Extended image overrides to include control-plane image tag replacement.
Kind/Local Overlay
components/manifests/overlays/kind-local/control-plane-env-patch.yaml, kind-local/kustomization.yaml
Local development environment patches setting insecure gRPC, local image references, and http API server URLs for control-plane.
Kind Overlay Updates
components/manifests/overlays/kind/kustomization.yaml, kind/control-plane-env-patch.yaml, kind/backend-ambient-api-patch.yaml, kind/local-image-pull-policy-patch.yaml, kind/ambient-api-server-jwks-patch.yaml
Replaced JWKS ConfigMap with control-plane env patch, added backend API URL patch, image policy patches, and localhost image remapping.
Production Overlay
components/manifests/overlays/production/kustomization.yaml, production/control-plane-image-patch.yaml, production/control-plane-env-patch.yaml
Production patches setting internal registry image, gRPC server address with TLS, runner image namespace, HTTPS API URLs, and backend service endpoint.
Deployment Scripts
components/manifests/deploy/deploy.sh, components/manifests/deploy-no-api-server.sh, components/manifests/deploy.sh (updates)
New deployment automation for OpenShift with OAuth setup, image overrides, rollout verification, control-plane token management via ocm, and gRPC route validation; updated main deploy script with control-plane image and rollout steps.
Runner gRPC Support
components/runners/ambient-runner/ambient_runner/_grpc_client.py, _session_messages_api.py, bridges/claude/grpc_transport.py
New gRPC client with TLS/token support, session messages API wrapper with push/watch operations, and gRPC transport listener that manages watch streams and routes incoming user messages to bridge execution.
Runner Events Endpoint
components/runners/ambient-runner/ambient_runner/endpoints/events.py
New FastAPI router exposing /events/{thread_id} SSE streaming and wait-based fallback polling with event filtering and queue management.
Runner App & Bridge Updates
components/runners/ambient-runner/ambient_runner/app.py, bridges/claude/bridge.py, bridges/claude/prompts.py, platform/prompts.py
Added gRPC listener initialization, gRPC/HTTP push helpers with conditional routing, bridge gRPC state tracking, agent preamble prefix for system prompt, and dependency imports.
Runner Testing
components/runners/ambient-runner/tests/test_app_initial_prompt.py, test_bridge_claude.py, test_events_endpoint.py, test_grpc_transport.py
Comprehensive test coverage for gRPC client, session message API, initial prompt dispatch (gRPC/HTTP), bridge lifecycle, SSE events endpoint, listener/writer behavior, and message accumulation.
Runner Metadata & Documentation
components/runners/ambient-runner/pyproject.toml, architecture.md
Added gRPC/protobuf dependencies and detailed architecture documentation covering bridge pattern, session workers, AG-UI integration, and gRPC transport lifecycle.
Design & Developer Documentation
REMOVE_CRDs.md, docs/internal/design/agent-api.md, docs/internal/design/blackboard-api.md, docs/internal/developer/agent-workflow.md, docs/internal/developer/local-development/openshift.md
CRD removal proposal with phased migration plan, API/blackboard design specs covering agent/session models and CLI/REST endpoints, multi-agent workflow coordination protocol, and OpenShift development setup with token/image handling.
CI/CD Workflows
.github/workflows/components-build-deploy.yml, .github/workflows/prod-release-deploy.yaml
Extended workflows to detect control-plane changes, build and push control-plane image, and include control-plane in deployment matrices and tag overrides.
End-to-End Testing
test-e2e-control-plane.sh, e2e/scripts/load-images.sh
New session lifecycle e2e test validating API creation, CR reconciliation, pod provisioning, gRPC connectivity, and message streaming; image loading script extended for control-plane.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API as Ambient API Server
    participant CP as Control Plane
    participant K8s as Kubernetes
    participant Runner as Runner Pod

    Client->>API: Create Session
    API-->>Client: Session ID, Credentials
    
    API->>CP: gRPC Watch (Session event)
    CP->>CP: Receive SessionAdded event
    CP->>K8s: Create Namespace
    CP->>K8s: Create ServiceAccount
    CP->>K8s: Create RoleBinding
    CP->>K8s: Create Secret (token)
    CP->>K8s: Create Pod (runner)
    CP->>K8s: Create Service
    K8s-->>Runner: Pod scheduled & started
    
    CP->>API: Update Session Status (Running)
    Client->>Runner: Send message (via Service)
    Runner-->>Client: Response (via Service)
    
    Client->>API: Delete Session
    API->>CP: gRPC Watch (Session event)
    CP->>CP: Receive SessionDeleted event
    CP->>K8s: Delete Pod
    CP->>K8s: Delete Service
    CP->>K8s: Delete Secrets
    CP->>K8s: Delete RBAC
    CP->>API: Update Session Status (Stopped)
Loading
sequenceDiagram
    participant Runner as Runner Pod
    participant GRPC as gRPC Watch Stream
    participant Listener as GRPCSessionListener
    participant Bridge as ClaudeBridge
    participant Worker as SessionWorker
    participant Model as Claude SDK

    Runner->>GRPC: Connect & authenticate
    GRPC-->>Runner: Stream established
    Runner->>Listener: Watch loop receives message
    
    Listener->>Listener: Parse SessionMessage (user event)
    Listener->>Bridge: run(input_data) async
    Bridge->>Worker: Submit prompt to queue
    Worker->>Model: Send message
    Model-->>Worker: Receive response stream
    Worker->>Bridge: Yield AG-UI events
    Bridge->>Listener: Forward events
    
    Listener->>Listener: Accumulate MESSAGES_SNAPSHOT
    Listener->>Bridge: emit RUN_FINISHED
    Listener->>GRPC: push(event_type=assistant, payload)
    GRPC-->>Listener: Ack
    
    Listener->>Runner: Remove from active_streams
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/cp-grpc-runner-v1

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 36

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/ambient-control-plane/cmd/ambient-control-plane/main.go`:
- Around line 171-178: installServiceCAIntoDefaultTransport currently replaces
http.DefaultTransport with a new http.Transport losing tuned defaults; instead
obtain the existing transport (assert http.DefaultTransport to *http.Transport),
clone it (use the Transport.Clone method or shallow copy), set/merge its
TLSClientConfig to ensure MinVersion = tls.VersionTLS12 and RootCAs = pool
(preserving other TLS settings if present), then assign the cloned transport
back to http.DefaultTransport so all original connection/timeouts/keep-alive
settings are preserved.

In `@components/ambient-control-plane/go.mod`:
- Line 11: Update the pinned dependency google.golang.org/grpc from v1.79.1 to
v1.79.3 (or later) in the go.mod and then regenerate the module checksum file
(go.sum) so the updated release that fixes CVE-2026-33186 is recorded; locate
the dependency entry for google.golang.org/grpc in go.mod, change the version to
v1.79.3+, and run your project's dependency resolution (e.g., module
update/tidy) to refresh go.sum.

In `@components/ambient-control-plane/internal/informer/informer.go`:
- Around line 400-405: The current Informer.dispatchBlocking blocks when
inf.eventCh is full, which can stall synchronous watch handlers in watchManager
and harm gRPC watch streams; change dispatchBlocking in informer.go to avoid
blocking: replace the blocking send on inf.eventCh with a non-blocking select
that attempts to send, and on channel-full fallback log a warning (use
processLogger or the component logger) and increment a new drop/overload metric
to detect overloads; alternatively expose/configure the buffer size for
inf.eventCh (increase from 256) if expected load requires it, but prefer the
non-blocking send + warning/metric approach inside Informer.dispatchBlocking to
prevent stalling watch handlers.

In `@components/ambient-control-plane/internal/kubeclient/kubeclient.go`:
- Around line 93-97: The code currently ignores the error from os.UserHomeDir()
and builds localPath via string concatenation; update the logic in the kube
client setup so you check and handle the error returned by os.UserHomeDir(), and
construct the kube config path using filepath.Join(home, ".kube", "config")
instead of home + "/.kube/config"; then use os.Stat(localPath) as before and
pass the resulting localPath to clientcmd.BuildConfigFromFlags("", localPath)
only when the home lookup succeeded and the file exists.

In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go`:
- Line 476: unstructured.NestedMap call is ignoring its error which can hide
malformed secret data; replace the blank error capture with proper handling by
calling unstructured.NestedMap(src.Object, "data") into (data, found, err) and
then check err (and optionally found) — if err is non-nil return or log a clear
error (including context like the secret name/namespace or src.Object) and stop
the copy operation so invalid data isn't propagated; update any callers of this
block to handle the returned error if necessary.
- Around line 660-665: Remove the custom min function implementation and use
Go's built-in min instead: delete the local function named min and replace its
usages with the predeclared min call (e.g., min(x, y)); ensure there are no
local name conflicts and that the module's Go version remains >= 1.21 (project
targets 1.24), then run go vet/build to verify no import or shadowing issues.
- Around line 344-346: The local variable currently declared as "secretName :=
secretName(session.ID)" shadows the helper function "secretName", so rename the
variable to avoid shadowing (e.g., "secretNameVal" or "saSecretName") and update
all subsequent references in this scope; keep the call to the helper function
"secretName(session.ID)" unchanged but assign its result to the new variable
name alongside the existing "saName := serviceAccountName(session.ID)"
assignment.

In `@components/ambient-control-plane/internal/reconciler/project_reconciler.go`:
- Around line 104-110: The current creatorRoleBindingName function truncates the
sanitized subject and callers only check for existence by name, which lets stale
ambient-code.io/created-by annotations or different long identities collide;
update creatorRoleBindingName to derive the name from the full subject plus a
stable hash (e.g., sha1/hex) instead of simple truncation, and change the
reconciler logic that currently returns when a RoleBinding with that name exists
to instead fetch the existing RoleBinding (by name), compare its subjects and
ambient-code.io/created-by annotation to the desired subject, and either update
the subjects/annotation in-place or recreate the binding when they differ (use
functions like creatorRoleBindingName and the reconcile path that currently does
the existence check to locate the RoleBinding to update).
- Around line 67-96: The namespace creation is racy: after
r.kube.GetNamespace(...) returns not found, a concurrent reconcile may create
the namespace before r.kube.CreateNamespace(...) runs, causing an AlreadyExists
error to be treated as fatal; update the CreateNamespace error handling in
project_reconciler.go (around r.kube.GetNamespace and r.kube.CreateNamespace) to
treat k8serrors.IsAlreadyExists(err) as a successful outcome (log/debug and
return nil) instead of returning an error so the reconcile becomes idempotent
and converges.

In
`@components/ambient-control-plane/internal/reconciler/project_settings_reconciler.go`:
- Around line 130-139: The loop over entries currently logs errors from
ensureGroupRoleBinding but always returns nil, hiding partial failures; update
the reconcile function (the loop that iterates over entries and calls
ensureGroupRoleBinding) to collect errors (e.g., append to an error slice) and
at the end return an aggregated error (or return immediately on first
non-recoverable error) instead of nil; ensure the returned error includes
context (namespace, group, role) similar to the existing log entries so callers
can detect and handle partial reconciliation failures.

In `@components/ambient-control-plane/internal/reconciler/shared.go`:
- Around line 88-96: The namespaceForSession function uses session.ProjectID
verbatim (only lowercased) which can produce invalid Kubernetes namespace names;
update namespaceForSession to sanitize/validate session.ProjectID (derived from
Project.Name) into a DNS-1123 label: convert to lowercase, replace any character
not [a-z0-9-] with hyphen, collapse consecutive hyphens, trim leading/trailing
hyphens, and truncate to 63 characters (ensuring not empty) and fall back to
"default" if the result is empty; additionally add validation at project
creation (where Project.Name is set) to reject names that cannot be normalized
to a valid DNS-1123 label so invalid Project.Name values are caught early.

In `@components/ambient-control-plane/internal/reconciler/stress_test.go`:
- Around line 344-347: The test spawns goroutines that call t.Errorf (unsafe);
change the goroutine to capture the error returned by tallyReconciler.Reconcile
(called with ctx and event) and send a formatted error string or the error value
into a concurrent-safe collector (e.g., a buffered channel or a slice guarded by
a sync.Mutex), then after the WaitGroup.Wait iterate the collected errors and
call t.Errorf or t.Fatal from the main test goroutine; ensure you reference
tallyReconciler.Reconcile, event, index, and the WaitGroup when implementing the
collector.

In `@components/ambient-control-plane/internal/reconciler/tally_reconciler.go`:
- Around line 98-121: handleSessionDeleted can decrement r.tally.TotalSessions
below zero when a delete arrives for a session that was never counted; guard the
decrement by checking r.tally.TotalSessions > 0 before doing TotalSessions-- (or
clamp to zero) and optionally emit a warning/log when a delete is ignored so
it’s visible; keep the existing guarded decrements for SessionsByPhase and
SessionsByUser but add the TotalSessions floor check in the handleSessionDeleted
method to prevent negative totals.

In `@components/ambient-control-plane/internal/watcher/watcher.go`:
- Around line 124-149: The loop in watcher.go increments attempt forever so
backoffDuration(attempt) grows even after successful reconnects; modify the loop
around wm.watchOnce(ctx, resource) to reset attempt to 0 when a watch was
established successfully (i.e., when wm.watchOnce returns nil and ctx isn't
canceled), leaving attempt increments only for consecutive failures; use the
existing attempt variable, wm.watchOnce, and backoffDuration to implement this
reset so that backoff only accrues across consecutive errors.

In `@components/manifests/base/core/ambient-api-server-service.yml`:
- Around line 81-82: The flags --https-cert-file and --https-key-file point to
/secrets/tls/ which doesn't exist because the TLS secret is mounted at
/etc/tls/; update the API server args (the --https-cert-file and
--https-key-file values) to reference /etc/tls/tls.crt and /etc/tls/tls.key (or
alternatively change the volumeMount path from /etc/tls/ to /secrets/tls/),
ensuring the flag values and the volumeMount location (the TLS secret mount)
match.

In `@components/manifests/deploy`:
- Around line 82-112: The script writes the plaintext CLIENT_SECRET_VALUE to
/tmp/ambient-frontend-oauthclient.yaml and echoes it on failure (uses
CLIENT_SECRET_VALUE, /tmp/ambient-frontend-oauthclient.yaml and OAUTH_APPLY_RC);
change to avoid creating a world-readable temp file and avoid printing the
secret: generate the manifest on stdin and apply directly (pipe a heredoc into
oc apply -f - and capture the exit code) or create a secure temp file (mktemp)
and chmod 600 immediately, then remove it; also redact the secret in all
user-facing echo/usage output (replace ${CLIENT_SECRET_VALUE} with a placeholder
like [REDACTED] when printing the manual oc apply instructions) and ensure the
echo block that prints the fallback manifest no longer emits the real
CLIENT_SECRET_VALUE.
- Around line 423-424: The "Next steps" output contains a duplicate step number:
find the echo that prints "4. Monitor RFE workflows:" (the echo -e line with
that exact text) and renumber it to the correct sequence (e.g., "5. Monitor RFE
workflows:") so the step numbers are unique and sequential; ensure any
subsequent step labels in the same block are adjusted consistently.

In `@components/manifests/overlays/production/control-plane-env-patch.yaml`:
- Around line 15-16: The RUNNER_IMAGE environment variable is using an unpinned
tag ("quay.io/ambient_code/vteam_claude_runner:latest") which is unsafe for
production; update the RUNNER_IMAGE value to a fixed image tag or immutable
digest (e.g., "quay.io/ambient_code/vteam_claude_runner:vX.Y.Z" or
"quay.io/ambient_code/vteam_claude_runner@sha256:...") so deployments are
reproducible and predictable.

In `@components/runners/ambient-runner/ambient_runner/_grpc_client.py`:
- Around line 1-5: The file header incorrectly states "Code generated ... DO NOT
EDIT" even though ambient_runner._grpc_client appears hand-written; update the
top-of-file comment to accurately reflect this by removing or replacing the
generated-code warning and adding a clear note that the file is manually
maintained (or, if it truly is generated, restore the canonical generator
metadata and automation); ensure any replacement mentions the module name
ambient_runner._grpc_client and the reason for the change so maintainers know
whether the file should be edited or regenerated.
- Around line 24-34: The _load_ca_cert function currently swallows OSError when
a candidate CA file exists but can't be opened, which hides permission or IO
issues; update _load_ca_cert to log a warning on except OSError (include the
path and the exception message) so failures to read an existing file are visible
(use the module logger via logging.getLogger(__name__) or the project's standard
logger), then continue to the next candidate and preserve the existing return
behavior.

In `@components/runners/ambient-runner/ambient_runner/_session_messages_api.py`:
- Around line 77-84: Don't log raw payload_preview at INFO level; update the
logger calls that build payload_preview (the logger.info call referencing
session_id, event_type, len(payload), payload_preview and any other logger.info
that uses payload_preview) to only emit session_id, event_type and payload
length at INFO, and move the generation and logging of a preview behind an
explicit debug/preview gate (e.g. check logger.isEnabledFor(logging.DEBUG) or a
configurable redact_preview flag) so previews are only produced and logged when
debugging is enabled and otherwise suppressed or redacted.
- Around line 54-64: The SessionMessagesAPI.push() method is synchronous and
currently being called directly from async contexts, which blocks the event
loop; update the two call sites — _push_initial_prompt_via_grpc (in app.py) and
GRPCMessageWriter._write_message (in grpc_transport.py) — to offload the push
call using the event loop's run_in_executor (e.g., await
loop.run_in_executor(None, lambda: client.session_messages.push(...))) instead
of calling client.session_messages.push(...) directly so push() runs in a thread
and does not block async I/O; keep the existing offloading for watch() as-is.

In `@components/runners/ambient-runner/ambient_runner/app.py`:
- Around line 129-131: The code currently calls ClaudeBridge._setup_platform()
and accesses ClaudeBridge._grpc_listener.ready directly, coupling app logic to
private internals; add a public method on ClaudeBridge (e.g.
prepare_grpc_listener() or get_grpc_ready()) that performs the setup and returns
the ready awaitable/async event, update the caller to await
bridge.prepare_grpc_listener() instead of calling _setup_platform() and touching
_grpc_listener, and remove any remaining direct references to the private
members (_setup_platform and _grpc_listener) so the runner depends only on the
new public API.
- Around line 356-403: The _push_initial_prompt_via_grpc function can leak the
AmbientGRPCClient if an exception happens before client.close(); wrap the
lifecycle so the client is always closed: create the client via
AmbientGRPCClient.from_env() and then use a try/finally where the existing logic
(payload build, client.session_messages.push, logging) stays in the try block
and client.close() is invoked in the finally; ensure you guard the finally close
with a check that the client variable was successfully created (e.g., client is
not None) to avoid NameError if from_env() itself raises.

In
`@components/runners/ambient-runner/ambient_runner/bridges/claude/grpc_transport.py`:
- Around line 129-194: The loop can leak ThreadPoolExecutor instances when an
exception triggers the reconnect path because executor.shutdown() is only called
on the normal and CancelledError paths; update the exception handler in the
async loop that creates executor/ watch_future (the block that calls
self._watch_in_thread with msg_queue, loop, stop_event, last_seq) to always call
stop_event.set() and executor.shutdown(wait=False) before awaiting sleep/backoff
or continuing, ensuring watch_future is also cancelled if not done; reset
backoff as before and preserve last_seq; make these changes around the variables
executor, watch_future, stop_event and backoff so executors are cleaned up on
all paths.
- Around line 199-207: When RunnerInput.model_validate_json(msg.payload) fails,
the exception is currently swallowed; update the except block to log the
original parse exception at DEBUG along with the offending msg.payload before
constructing the fallback RunnerInput (references:
RunnerInput.model_validate_json, RunnerInput, msg.payload, self._session_id).
Ensure you capture the exception object (e.g., as exc) and call the module/class
logger.debug with a message that includes the exception details and payload so
malformed payloads can be diagnosed.
- Around line 335-338: In _write_message (an async method) the synchronous
blocking call to self._grpc_client.session_messages.push(self._session_id,
event_type="assistant", payload=payload) must be offloaded to a thread so it
doesn't block the asyncio event loop; replace the direct push call with an
awaited call using asyncio.to_thread(...) or
asyncio.get_event_loop().run_in_executor(...) to invoke session_messages.push
with the same args (self._session_id, event_type, payload) and await its
completion.

In `@components/runners/ambient-runner/ambient_runner/endpoints/events.py`:
- Around line 61-63: active_streams[thread_id] = queue currently overwrites any
existing subscriber and allows one connection to orphan another; change this to
either reject a new subscriber if active_streams already contains thread_id or
support multiple subscribers by storing a list/collection of queues per
thread_id, and in both cases make cleanup identity-based (only remove the queue
instance you created) inside the finally blocks (e.g., in wait_for_events and
the other places setting active_streams) by checking that
active_streams[thread_id] is the same queue object (or remove the queue from the
list) before popping/deleting it to avoid dropping events for other connections;
apply the same pattern for the other occurrences where you set active_streams
(the spots referenced as 109-110, 155-156, 183-184).

In `@components/runners/ambient-runner/tests/test_app_initial_prompt.py`:
- Around line 34-35: The tests patch the class at its definition site instead of
where it's looked up; change all patch targets referencing
"ambient_runner._grpc_client.AmbientGRPCClient" in this test file to
"ambient_runner.app.AmbientGRPCClient" so the mocked class is used by the code
that imported it (the function under test _push_initial_prompt_via_grpc in
app.py); update every occurrence mentioned (the patches around lines where
AmbientGRPCClient is used in the tests) to target the app module's name binding.

In `@components/runners/ambient-runner/tests/test_bridge_claude.py`:
- Around line 53-75: The test is currently patching ClaudeBridge._setup_platform
and then assigning bridge._grpc_listener manually so it never verifies listener
creation; remove the patch that replaces _setup_platform, patch
GRPCSessionListener to return a mock_listener_instance, call/await the real
bridge._setup_platform (or the public setup entry that triggers it) with
AMBIENT_GRPC_URL set, and then assert that GRPCSessionListener was called and
bridge._grpc_listener is the mock_listener_instance (use symbols ClaudeBridge,
_setup_platform, GRPCSessionListener, bridge._grpc_listener, and the
AMBIENT_GRPC_URL env var to locate the code).
- Around line 19-29: The tests fail because ClaudeBridge.__init__ does not
initialize the gRPC-related attributes; update ClaudeBridge.__init__ to set
self._grpc_listener (type GRPCSessionListener | None) to None and
self._active_streams to an empty dict so both _grpc_listener and _active_streams
exist on construction; add these initializations alongside the other existing
attribute setups in ClaudeBridge.__init__ (referencing the ClaudeBridge class
and its __init__ method, and the symbols _grpc_listener and _active_streams).

In `@components/runners/ambient-runner/tests/test_events_endpoint.py`:
- Around line 66-79: The test creates an unused Queue (q) that isn't added to
active_streams, so either remove that unused q or actually assert the endpoint
registered its own queue; update test_registers_queue_before_streaming to delete
the q creation (q = asyncio.Queue / q.put_nowait...) and after consuming the
response body (resp.read()) assert that 't-1' is present in active_streams and
that active_streams['t-1'] is an asyncio.Queue (or alternatively move creation
of q before calling _make_bridge and assign active_streams['t-1'] = q if you
intend to pre-register the queue); reference
test_registers_queue_before_streaming, active_streams, _make_bridge,
client.stream, and resp.read.

In `@components/runners/ambient-runner/tests/test_grpc_transport.py`:
- Around line 152-181: The test test_user_message_triggers_bridge_run currently
never verifies that bridge.run() was invoked; update it to assert the call by
replacing the simple generator fake_run with a callable that signals when
invoked (e.g., set an asyncio.Event or use an AsyncMock) and await that signal
before cancelling the listener. Concretely: keep the existing events and
listener setup, but change bridge.run to a wrapper that sets an event (or
records a call via AsyncMock) when started and then yields the events; after
awaiting listener.ready, await the event with a timeout to ensure bridge.run
fired, then proceed to cancel the task and assert the event/AsyncMock was
triggered. Target symbols: test_user_message_triggers_bridge_run, bridge.run,
fake_run, GRPCSessionListener, listener._listen_loop.

In `@docs/internal/developer/local-development/openshift.md`:
- Around line 231-234: Add a language specifier to the fenced code block that
contains the lines starting with "unknown kid" — wrong token type (oc whoami -t
was used) and "invalid_grant" — RH SSO token expired by changing the opening
fence to include a language identifier (e.g., use ```text) so the block renders
and is accessible; ensure the closing fence remains present after those two
lines.

In `@test-e2e-control-plane.sh`:
- Around line 21-22: TIMEOUT_SECONDS and CHECK_INTERVAL are declared but not
used; update the loops that currently use hardcoded intervals (the loops
referencing 30 and 60 seconds) to use the TIMEOUT_SECONDS and CHECK_INTERVAL
variables (so replace the literal timeout and sleep interval values with
TIMEOUT_SECONDS and CHECK_INTERVAL), or remove the TIMEOUT_SECONDS and
CHECK_INTERVAL declarations if the hardcoded values are intentional; ensure you
update any related echo/log messages to reference the variables if you choose to
use them.
- Around line 224-230: The duration uses start_time which gets reassigned later
(in the runner pod wait loop), so capture the initial test start time into a
separate variable (e.g., TEST_START_TIME or ORIGINAL_START_TIME) immediately
when the script begins and do not modify it; keep using start_time for
short-lived step timers if needed. Update the final summary echo that computes
Duration to use this preserved variable (TEST_START_TIME/ORIGINAL_START_TIME)
instead of start_time so the reported duration reflects the entire test run;
ensure the new variable is exported or in scope for the summary block.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 82f26582-fb6c-464c-98f4-b995276d104c

📥 Commits

Reviewing files that changed from the base of the PR and between 0d78e86 and 36cdcbc.

⛔ Files ignored due to path filters (2)
  • components/ambient-control-plane/go.sum is excluded by !**/*.sum
  • components/runners/ambient-runner/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (68)
  • .github/workflows/ambient-control-plane-tests.yml
  • Makefile
  • REMOVE_CRDs.md
  • components/ambient-control-plane/.gitignore
  • components/ambient-control-plane/CLAUDE.md
  • components/ambient-control-plane/Dockerfile
  • components/ambient-control-plane/Dockerfile.simple
  • components/ambient-control-plane/Makefile
  • components/ambient-control-plane/README.md
  • components/ambient-control-plane/cmd/ambient-control-plane/main.go
  • components/ambient-control-plane/go.mod
  • components/ambient-control-plane/internal/config/config.go
  • components/ambient-control-plane/internal/informer/informer.go
  • components/ambient-control-plane/internal/kubeclient/kubeclient.go
  • components/ambient-control-plane/internal/kubeclient/kubeclient_test.go
  • components/ambient-control-plane/internal/reconciler/kube_reconciler.go
  • components/ambient-control-plane/internal/reconciler/project_reconciler.go
  • components/ambient-control-plane/internal/reconciler/project_settings_reconciler.go
  • components/ambient-control-plane/internal/reconciler/shared.go
  • components/ambient-control-plane/internal/reconciler/stress_test.go
  • components/ambient-control-plane/internal/reconciler/tally.go
  • components/ambient-control-plane/internal/reconciler/tally_reconciler.go
  • components/ambient-control-plane/internal/reconciler/tally_test.go
  • components/ambient-control-plane/internal/watcher/watcher.go
  • components/manifests/base/ambient-api-server-grpc-route.yml
  • components/manifests/base/ambient-control-plane-service.yml
  • components/manifests/base/core/ambient-api-server-service.yml
  • components/manifests/base/kustomization.yaml
  • components/manifests/base/platform/ambient-api-server-db.yml
  • components/manifests/base/rbac/control-plane-clusterrole.yaml
  • components/manifests/base/rbac/control-plane-clusterrolebinding.yaml
  • components/manifests/base/rbac/control-plane-sa.yaml
  • components/manifests/base/rbac/kustomization.yaml
  • components/manifests/components/ambient-api-server-db/ambient-api-server-db-json-patch.yaml
  • components/manifests/deploy
  • components/manifests/deploy-no-api-server.sh
  • components/manifests/deploy.sh
  • components/manifests/overlays/kind-local/control-plane-env-patch.yaml
  • components/manifests/overlays/kind-local/kustomization.yaml
  • components/manifests/overlays/kind/ambient-api-server-jwks-patch.yaml
  • components/manifests/overlays/kind/backend-ambient-api-patch.yaml
  • components/manifests/overlays/kind/control-plane-env-patch.yaml
  • components/manifests/overlays/kind/kustomization.yaml
  • components/manifests/overlays/kind/local-image-pull-policy-patch.yaml
  • components/manifests/overlays/local-dev/ambient-api-server-db-json-patch.yaml
  • components/manifests/overlays/production/control-plane-env-patch.yaml
  • components/manifests/overlays/production/control-plane-image-patch.yaml
  • components/manifests/overlays/production/kustomization.yaml
  • components/runners/ambient-runner/ambient_runner/_grpc_client.py
  • components/runners/ambient-runner/ambient_runner/_session_messages_api.py
  • components/runners/ambient-runner/ambient_runner/app.py
  • components/runners/ambient-runner/ambient_runner/bridge.py
  • components/runners/ambient-runner/ambient_runner/bridges/claude/grpc_transport.py
  • components/runners/ambient-runner/ambient_runner/bridges/claude/prompts.py
  • components/runners/ambient-runner/ambient_runner/endpoints/events.py
  • components/runners/ambient-runner/ambient_runner/platform/prompts.py
  • components/runners/ambient-runner/architecture.md
  • components/runners/ambient-runner/pyproject.toml
  • components/runners/ambient-runner/tests/test_app_initial_prompt.py
  • components/runners/ambient-runner/tests/test_bridge_claude.py
  • components/runners/ambient-runner/tests/test_events_endpoint.py
  • components/runners/ambient-runner/tests/test_grpc_transport.py
  • docs/internal/design/agent-api.md
  • docs/internal/design/blackboard-api.md
  • docs/internal/developer/agent-workflow.md
  • docs/internal/developer/local-development/openshift.md
  • e2e/scripts/load-images.sh
  • test-e2e-control-plane.sh
💤 Files with no reviewable changes (2)
  • components/manifests/overlays/local-dev/ambient-api-server-db-json-patch.yaml
  • components/manifests/overlays/kind/ambient-api-server-jwks-patch.yaml

Comment on lines +171 to +178
func installServiceCAIntoDefaultTransport(pool *x509.CertPool) {
http.DefaultTransport = &http.Transport{
TLSClientConfig: &tls.Config{
MinVersion: tls.VersionTLS12,
RootCAs: pool,
},
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Replacing DefaultTransport loses default connection settings.

Creating a new http.Transport with only TLSClientConfig discards the default transport's tuned settings (timeouts, idle connection limits, keep-alive configuration). This may degrade HTTP client performance.

Proposed fix — clone default transport and update TLS config
 func installServiceCAIntoDefaultTransport(pool *x509.CertPool) {
-	http.DefaultTransport = &http.Transport{
-		TLSClientConfig: &tls.Config{
-			MinVersion: tls.VersionTLS12,
-			RootCAs:    pool,
-		},
-	}
+	// Clone the default transport to preserve connection pooling/timeout settings
+	defaultTransport := http.DefaultTransport.(*http.Transport).Clone()
+	defaultTransport.TLSClientConfig = &tls.Config{
+		MinVersion: tls.VersionTLS12,
+		RootCAs:    pool,
+	}
+	http.DefaultTransport = defaultTransport
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/ambient-control-plane/cmd/ambient-control-plane/main.go` around
lines 171 - 178, installServiceCAIntoDefaultTransport currently replaces
http.DefaultTransport with a new http.Transport losing tuned defaults; instead
obtain the existing transport (assert http.DefaultTransport to *http.Transport),
clone it (use the Transport.Clone method or shallow copy), set/merge its
TLSClientConfig to ensure MinVersion = tls.VersionTLS12 and RootCAs = pool
(preserving other TLS settings if present), then assign the cloned transport
back to http.DefaultTransport so all original connection/timeouts/keep-alive
settings are preserved.

github.com/ambient-code/platform/components/ambient-api-server v0.0.0
github.com/ambient-code/platform/components/ambient-sdk/go-sdk v0.0.0
github.com/rs/zerolog v1.34.0
google.golang.org/grpc v1.79.1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Current grpc dependency:"
rg -n 'google\.golang\.org/grpc' components/ambient-control-plane/go.mod

echo
echo "Fetching GHSA advisory details from OSV:"
curl -s https://api.osv.dev/v1/vulns/GHSA-p77j-4mvh-x3m3 | jq '{id,summary,aliases,affected}'

echo
echo "After choosing a patched version, verify go.mod no longer pins the affected version."

Repository: ambient-code/platform

Length of output: 1017


Upgrade google.golang.org/grpc from v1.79.1 to v1.79.3 or later.

Line 11 pins google.golang.org/grpc v1.79.1, which contains GHSA-p77j-4mvh-x3m3 (CVE-2026-33186)—an authorization bypass vulnerability via missing leading slash in the :path header. Bump to v1.79.3 or later and regenerate go.sum.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/ambient-control-plane/go.mod` at line 11, Update the pinned
dependency google.golang.org/grpc from v1.79.1 to v1.79.3 (or later) in the
go.mod and then regenerate the module checksum file (go.sum) so the updated
release that fixes CVE-2026-33186 is recorded; locate the dependency entry for
google.golang.org/grpc in go.mod, change the version to v1.79.3+, and run your
project's dependency resolution (e.g., module update/tidy) to refresh go.sum.

Comment on lines +400 to +405
func (inf *Informer) dispatchBlocking(ctx context.Context, event ResourceEvent) {
select {
case inf.eventCh <- event:
case <-ctx.Done():
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Blocking dispatch can stall watch streams under load.

When eventCh is full (256 events), dispatchBlocking blocks the calling watch handler. Since watch handlers are called synchronously by watchManager, this backpressure can stall the gRPC watch stream and potentially cause disconnects or missed events during bursts.

Consider: (1) increasing the buffer based on expected load, (2) adding a non-blocking send with a warning log when the channel is full, or (3) dropping events with metrics to detect overload.

Example: non-blocking with warning
 func (inf *Informer) dispatchBlocking(ctx context.Context, event ResourceEvent) {
 	select {
 	case inf.eventCh <- event:
+	default:
+		inf.logger.Warn().
+			Str("resource", event.Resource).
+			Str("event_type", string(event.Type)).
+			Str("id", event.Object.GetID()).
+			Msg("event channel full, dropping event")
 	case <-ctx.Done():
 	}
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (inf *Informer) dispatchBlocking(ctx context.Context, event ResourceEvent) {
select {
case inf.eventCh <- event:
case <-ctx.Done():
}
}
func (inf *Informer) dispatchBlocking(ctx context.Context, event ResourceEvent) {
select {
case inf.eventCh <- event:
default:
inf.logger.Warn().
Str("resource", event.Resource).
Str("event_type", string(event.Type)).
Str("id", event.Object.GetID()).
Msg("event channel full, dropping event")
case <-ctx.Done():
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/ambient-control-plane/internal/informer/informer.go` around lines
400 - 405, The current Informer.dispatchBlocking blocks when inf.eventCh is
full, which can stall synchronous watch handlers in watchManager and harm gRPC
watch streams; change dispatchBlocking in informer.go to avoid blocking: replace
the blocking send on inf.eventCh with a non-blocking select that attempts to
send, and on channel-full fallback log a warning (use processLogger or the
component logger) and increment a new drop/overload metric to detect overloads;
alternatively expose/configure the buffer size for inf.eventCh (increase from
256) if expected load requires it, but prefer the non-blocking send +
warning/metric approach inside Informer.dispatchBlocking to prevent stalling
watch handlers.

Comment on lines +93 to +97
home, _ := os.UserHomeDir()
localPath := home + "/.kube/config"
if _, err := os.Stat(localPath); err == nil {
return clientcmd.BuildConfigFromFlags("", localPath)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Use filepath.Join for path construction.

String concatenation for paths can cause issues on non-Unix systems. Also, the error from os.UserHomeDir() is ignored—if it fails, home will be empty, leading to checking "/.kube/config".

Suggested fix
+import "path/filepath"
+
 func buildRestConfig(kubeconfig string) (*rest.Config, error) {
 	if kubeconfig != "" {
 		return clientcmd.BuildConfigFromFlags("", kubeconfig)
 	}
 
-	home, _ := os.UserHomeDir()
-	localPath := home + "/.kube/config"
-	if _, err := os.Stat(localPath); err == nil {
-		return clientcmd.BuildConfigFromFlags("", localPath)
+	if home, err := os.UserHomeDir(); err == nil {
+		localPath := filepath.Join(home, ".kube", "config")
+		if _, err := os.Stat(localPath); err == nil {
+			return clientcmd.BuildConfigFromFlags("", localPath)
+		}
 	}
 
 	return rest.InClusterConfig()
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/ambient-control-plane/internal/kubeclient/kubeclient.go` around
lines 93 - 97, The code currently ignores the error from os.UserHomeDir() and
builds localPath via string concatenation; update the logic in the kube client
setup so you check and handle the error returned by os.UserHomeDir(), and
construct the kube config path using filepath.Join(home, ".kube", "config")
instead of home + "/.kube/config"; then use os.Stat(localPath) as before and
pass the resulting localPath to clientcmd.BuildConfigFromFlags("", localPath)
only when the home lookup succeeded and the file exists.

Comment on lines +344 to +346
saName := serviceAccountName(session.ID)
secretName := secretName(session.ID)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Variable secretName shadows the function of the same name.

secretName := secretName(session.ID) shadows the helper function, making the code confusing and error-prone for future modifications.

Proposed fix
 	saName := serviceAccountName(session.ID)
-	secretName := secretName(session.ID)
+	credSecretName := secretName(session.ID)
 
 	// ... later in buildEnv call:
-	"env":          r.buildEnv(session, secretName),
+	"env":          r.buildEnv(session, credSecretName),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go`
around lines 344 - 346, The local variable currently declared as "secretName :=
secretName(session.ID)" shadows the helper function "secretName", so rename the
variable to avoid shadowing (e.g., "secretNameVal" or "saSecretName") and update
all subsequent references in this scope; keep the call to the helper function
"secretName(session.ID)" unchanged but assign its result to the new variable
name alongside the existing "saName := serviceAccountName(session.ID)"
assignment.

Comment on lines +66 to +79
def test_registers_queue_before_streaming(self):
active_streams = {}
bridge = _make_bridge(active_streams=active_streams)
client = _make_client(bridge)

# Pre-fill queue with a terminal event so the stream closes immediately
q: asyncio.Queue = asyncio.Queue(maxsize=100)
q.put_nowait(make_run_finished())

# The endpoint creates its own queue on connect — verify it's registered
# by inspecting active_streams after the response body is fully consumed
with client.stream("GET", "/events/t-1") as resp:
assert resp.status_code == 200
resp.read()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unused queue in test may not be testing intended behavior.

The queue q created on line 72-73 is never added to active_streams, so the endpoint will create its own queue. This test passes trivially because the stream blocks waiting for events. Consider either removing the unused queue or verifying the actual registration behavior.

🔧 Proposed fix to properly test queue registration
     def test_registers_queue_before_streaming(self):
         active_streams = {}
         bridge = _make_bridge(active_streams=active_streams)
         client = _make_client(bridge)

-        # Pre-fill queue with a terminal event so the stream closes immediately
-        q: asyncio.Queue = asyncio.Queue(maxsize=100)
-        q.put_nowait(make_run_finished())
-
-        # The endpoint creates its own queue on connect — verify it's registered
-        # by inspecting active_streams after the response body is fully consumed
-        with client.stream("GET", "/events/t-1") as resp:
-            assert resp.status_code == 200
-            resp.read()
+        # Verify the endpoint registers a queue in active_streams
+        # We need to inject a terminal event after the queue is registered
+        import threading
+        def inject_terminal():
+            import time
+            time.sleep(0.1)  # Wait for queue to be registered
+            if "t-1" in active_streams:
+                active_streams["t-1"].put_nowait(make_run_finished())
+        
+        t = threading.Thread(target=inject_terminal)
+        t.start()
+        with client.stream("GET", "/events/t-1") as resp:
+            assert resp.status_code == 200
+            body = resp.read()
+        t.join()
+        assert "t-1" not in active_streams  # Cleaned up after stream closes
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/runners/ambient-runner/tests/test_events_endpoint.py` around lines
66 - 79, The test creates an unused Queue (q) that isn't added to
active_streams, so either remove that unused q or actually assert the endpoint
registered its own queue; update test_registers_queue_before_streaming to delete
the q creation (q = asyncio.Queue / q.put_nowait...) and after consuming the
response body (resp.read()) assert that 't-1' is present in active_streams and
that active_streams['t-1'] is an asyncio.Queue (or alternatively move creation
of q before calling _make_bridge and assign active_streams['t-1'] = q if you
intend to pre-register the queue); reference
test_registers_queue_before_streaming, active_streams, _make_bridge,
client.stream, and resp.read.

Comment on lines +152 to +181
async def test_user_message_triggers_bridge_run(self):
payload = _make_runner_payload(thread_id="t-1", run_id="r-1")
msgs = [_make_session_message("user", payload, seq=1)]
client = _make_grpc_client(messages=msgs)
bridge = _make_bridge()

events = [make_text_start(), make_text_content(), make_run_finished()]

async def fake_run(input_data):
for e in events:
yield e

bridge.run = fake_run
bridge._active_streams = {}

listener = GRPCSessionListener(
bridge=bridge, session_id="s-1", grpc_url="localhost:9000"
)
listener._grpc_client = client

task = asyncio.create_task(listener._listen_loop())
try:
await asyncio.wait_for(listener.ready.wait(), timeout=2.0)
await asyncio.sleep(0.3)
finally:
task.cancel()
try:
await task
except asyncio.CancelledError:
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Assert that bridge.run() actually fires in this test.

This case only waits for 300ms and never verifies the user message triggered bridge.run(). If the listener drops user events or exits early, the test still passes.

Proposed fix
         bridge = _make_bridge()
 
         events = [make_text_start(), make_text_content(), make_run_finished()]
+        ran = asyncio.Event()
 
         async def fake_run(input_data):
+            ran.set()
             for e in events:
                 yield e
 
         bridge.run = fake_run
@@
         task = asyncio.create_task(listener._listen_loop())
         try:
             await asyncio.wait_for(listener.ready.wait(), timeout=2.0)
-            await asyncio.sleep(0.3)
+            await asyncio.wait_for(ran.wait(), timeout=1.0)
+            assert ran.is_set()
         finally:
             task.cancel()
             try:
                 await task
🧰 Tools
🪛 Ruff (0.15.6)

[warning] 160-160: Missing return type annotation for private function fake_run

(ANN202)


[warning] 160-160: Unused function argument: input_data

(ARG001)


[warning] 178-181: Use contextlib.suppress(asyncio.CancelledError) instead of try-except-pass

Replace try-except-pass with with contextlib.suppress(asyncio.CancelledError): ...

(SIM105)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/runners/ambient-runner/tests/test_grpc_transport.py` around lines
152 - 181, The test test_user_message_triggers_bridge_run currently never
verifies that bridge.run() was invoked; update it to assert the call by
replacing the simple generator fake_run with a callable that signals when
invoked (e.g., set an asyncio.Event or use an AsyncMock) and await that signal
before cancelling the listener. Concretely: keep the existing events and
listener setup, but change bridge.run to a wrapper that sets an event (or
records a call via AsyncMock) when started and then yields the events; after
awaiting listener.ready, await the event with a timeout to ensure bridge.run
fired, then proceed to cancel the task and assert the event/AsyncMock was
triggered. Target symbols: test_user_message_triggers_bridge_run, bridge.run,
fake_run, GRPCSessionListener, listener._listen_loop.

Comment on lines +21 to +22
TIMEOUT_SECONDS=300 # 5 minutes
CHECK_INTERVAL=5 # Check every 5 seconds
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unused configuration variables.

TIMEOUT_SECONDS and CHECK_INTERVAL are defined but never used. The loops on lines 90 and 122 use hardcoded values (30, 60) instead. Either use these variables or remove them.

🔧 Proposed fix to use the configuration variables
-while [[ $(($(date +%s) - start_time)) -lt 30 ]]; do
+while [[ $(($(date +%s) - start_time)) -lt $TIMEOUT_SECONDS ]]; do
-while [[ $(($(date +%s) - start_time)) -lt 60 ]]; do
+while [[ $(($(date +%s) - start_time)) -lt $TIMEOUT_SECONDS ]]; do

Or remove the unused variables if hardcoded values are intentional.

🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 21-21: TIMEOUT_SECONDS appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 22-22: CHECK_INTERVAL appears unused. Verify use (or export if used externally).

(SC2034)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test-e2e-control-plane.sh` around lines 21 - 22, TIMEOUT_SECONDS and
CHECK_INTERVAL are declared but not used; update the loops that currently use
hardcoded intervals (the loops referencing 30 and 60 seconds) to use the
TIMEOUT_SECONDS and CHECK_INTERVAL variables (so replace the literal timeout and
sleep interval values with TIMEOUT_SECONDS and CHECK_INTERVAL), or remove the
TIMEOUT_SECONDS and CHECK_INTERVAL declarations if the hardcoded values are
intentional; ensure you update any related echo/log messages to reference the
variables if you choose to use them.

Comment on lines +224 to +230
echo ""
echo -e "${BLUE}📊 Test Summary:${NC}"
echo -e " • Duration: $(($(date +%s) - start_time)) seconds"
echo -e " • Session Name: ${SESSION_NAME}"
echo -e " • Session ID: ${SESSION_ID}"
echo -e " • CR Name: ${CR_NAME}"
echo ""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Duration calculation uses wrong start_time.

start_time is reassigned on line 119 for the runner pod wait loop, so the duration reported here reflects only the time since step 4, not the total test duration. Capture the initial start time in a separate variable.

🔧 Proposed fix
+TEST_START_TIME=$(date +%s)
 echo -e "${BLUE}🚀 Starting End-to-End Control Plane Test${NC}"

Then on line 226:

-echo -e "   • Duration: $(($(date +%s) - start_time)) seconds"
+echo -e "   • Duration: $(($(date +%s) - TEST_START_TIME)) seconds"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test-e2e-control-plane.sh` around lines 224 - 230, The duration uses
start_time which gets reassigned later (in the runner pod wait loop), so capture
the initial test start time into a separate variable (e.g., TEST_START_TIME or
ORIGINAL_START_TIME) immediately when the script begins and do not modify it;
keep using start_time for short-lived step timers if needed. Update the final
summary echo that computes Duration to use this preserved variable
(TEST_START_TIME/ORIGINAL_START_TIME) instead of start_time so the reported
duration reflects the entire test run; ensure the new variable is exported or in
scope for the summary block.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/components-build-deploy.yml (1)

307-361: ⚠️ Potential issue | 🟠 Major

Missing ambient-control-plane in image tag determination and kustomization update.

The deploy-to-openshift job handles image tags for all other components but omits ambient-control-plane:

  • "Determine image tags" step (lines 310-350) doesn't set a control_plane_tag output
  • "Update kustomization with image tags" step (lines 354-361) doesn't call kustomize edit set image for the control-plane image

As a result, when ambient-control-plane is rebuilt and pushed, the deployment will still use whatever tag is in the manifest (likely latest or stage), not the newly built SHA.

Proposed fix

Add to "Determine image tags" step:

+          if [ "${{ needs.detect-changes.outputs.ambient-control-plane }}" == "true" ]; then
+            echo "control_plane_tag=${{ github.sha }}" >> $GITHUB_OUTPUT
+          else
+            echo "control_plane_tag=stage" >> $GITHUB_OUTPUT
+          fi

Add to "Update kustomization with image tags" step:

+          kustomize edit set image quay.io/ambient_code/ambient_control_plane:latest=quay.io/ambient_code/ambient_control_plane:${{ steps.image-tags.outputs.control_plane_tag }}

Note: Use ambient_control_plane to match the manifest declarations (or fix the naming mismatch first).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/components-build-deploy.yml around lines 307 - 361, The
Determine image tags step (image-tags) is missing a control_plane_tag output and
the Update kustomization step doesn't set the control-plane image; add logic in
the image-tags run to set control_plane_tag to either ${{ github.sha }} when
needs.detect-changes.outputs.ambient-control-plane is "true" or to the
appropriate default (stage or latest) otherwise, and add a corresponding
kustomize edit set image invocation in the Update kustomization step to map the
manifest image (ambient_control_plane or the exact image name used in overlays)
to quay.io/ambient_code/ambient_control_plane:${{
steps.image-tags.outputs.control_plane_tag }} so the overlay uses the newly
determined tag.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/components-build-deploy.yml:
- Around line 100-101: The workflow's component matrix uses the image name
quay.io/ambient_code/vteam_control_plane which doesn't match the kustomize
manifests that reference quay.io/ambient_code/ambient_control_plane; update the
matrix entry in .github/workflows/components-build-deploy.yml to use
quay.io/ambient_code/ambient_control_plane so `kustomize edit set image` will
match, then re-run or validate that kustomization files (referenced by symbols
like the kustomization.yaml in components/manifests/base and overlays/production
and ambient-control-plane-service.yml) are already consistent; alternatively, if
you prefer the vteam_ name, update all manifest image references to
quay.io/ambient_code/vteam_control_plane so names are consistent across the
matrix and manifests.

In @.github/workflows/prod-release-deploy.yaml:
- Around line 160-161: The component image name in the matrix/comp_image loop
("quay.io/ambient_code/vteam_control_plane") does not match the kustomization
manifests ("quay.io/ambient_code/ambient_control_plane"), so the kustomize edit
set image replacement (used by the release/tag substitution) will not find the
entry; update the image reference used by the matrix and comp_image loop to
exactly match the manifest image name (or conversely update the manifests to use
"vteam_control_plane") so the image string is identical across the component
matrix, the comp_image loop, and the kustomization manifests referenced by the
kustomize edit set image command.

In `@components/runners/ambient-runner/ambient_runner/bridges/claude/bridge.py`:
- Around line 244-245: Wrap the call to self._grpc_listener.stop() in a
try/except so an exception from stop() is caught and logged, and move the
session manager and observability cleanup calls into a finally block (or ensure
they run after the except) so that self._session_manager.shutdown() and
self._observability.shutdown() always run even if self._grpc_listener.stop()
raises; make sure to reference the existing symbols self._grpc_listener.stop(),
self._session_manager.shutdown(), and self._observability.shutdown() and log the
stop() exception before proceeding with the cleanup.

In `@Makefile`:
- Line 68: The Makefile variable CONTROL_PLANE_IMAGE currently uses the image
name literal "vteam_control_plane" which mismatches the manifest-declared
"ambient_control_plane"; update the CONTROL_PLANE_IMAGE assignment to use the
same image base as manifests (e.g., replace "vteam_control_plane" with
"ambient_control_plane") while preserving IMAGE_TAG (i.e., keep the format
CONTROL_PLANE_IMAGE ?= ambient_control_plane:$(IMAGE_TAG)) so all references to
CONTROL_PLANE_IMAGE and IMAGE_TAG remain consistent with the manifests.
- Around line 293-305: The production deploy re-applies a manifest-declared
kubernetes.io/service-account-token Secret (ambient-control-plane-token from
base/rbac/control-plane-sa.yaml) before you immediately delete/recreate it with
the OCM token, which can cause the recreated token to be overwritten if someone
runs kubectl kustomize/apply separately; update the Makefile or manifests to
avoid the race by either (A) adding a clear comment above the kubectl kustomize
&& oc create secret block describing the dependency on the subsequent
delete/create of ambient-control-plane-token so operators know the behavior, or
(B) remove/exclude the ambient-control-plane-token Secret from the kustomize
base when using this deploy path (or add a kustomize patch/overlay to prevent
the Secret from being managed) and ensure the Makefile still creates the secret
via oc create --from-literal=token="$(ocm token)"; reference the
ambient-control-plane-token name, the kubectl kustomize call for
components/manifests/overlays/production/, and base/rbac/control-plane-sa.yaml
when making the change.

---

Outside diff comments:
In @.github/workflows/components-build-deploy.yml:
- Around line 307-361: The Determine image tags step (image-tags) is missing a
control_plane_tag output and the Update kustomization step doesn't set the
control-plane image; add logic in the image-tags run to set control_plane_tag to
either ${{ github.sha }} when needs.detect-changes.outputs.ambient-control-plane
is "true" or to the appropriate default (stage or latest) otherwise, and add a
corresponding kustomize edit set image invocation in the Update kustomization
step to map the manifest image (ambient_control_plane or the exact image name
used in overlays) to quay.io/ambient_code/ambient_control_plane:${{
steps.image-tags.outputs.control_plane_tag }} so the overlay uses the newly
determined tag.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0e8d75b7-4b30-4219-a89e-6805a65316f9

📥 Commits

Reviewing files that changed from the base of the PR and between 36cdcbc and 30b1f4a.

📒 Files selected for processing (4)
  • .github/workflows/components-build-deploy.yml
  • .github/workflows/prod-release-deploy.yaml
  • Makefile
  • components/runners/ambient-runner/ambient_runner/bridges/claude/bridge.py

Comment on lines +244 to +245
if self._grpc_listener is not None:
await self._grpc_listener.stop()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard shutdown against partial failure.

At Line 245, if await self._grpc_listener.stop() raises, session manager and observability cleanup are skipped, which can leak resources and leave shutdown incomplete.

Proposed fix
 async def shutdown(self) -> None:
     """Graceful shutdown: persist sessions, finalise tracing."""
-    if self._grpc_listener is not None:
-        await self._grpc_listener.stop()
-    if self._session_manager:
-        await self._session_manager.shutdown()
-    if self._obs:
-        await self._obs.finalize()
+    try:
+        if self._grpc_listener is not None:
+            await self._grpc_listener.stop()
+            self._grpc_listener = None
+    finally:
+        if self._session_manager:
+            await self._session_manager.shutdown()
+        if self._obs:
+            await self._obs.finalize()
     logger.info("ClaudeBridge: shutdown complete")

As per coding guidelines "-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if self._grpc_listener is not None:
await self._grpc_listener.stop()
async def shutdown(self) -> None:
"""Graceful shutdown: persist sessions, finalise tracing."""
try:
if self._grpc_listener is not None:
await self._grpc_listener.stop()
self._grpc_listener = None
finally:
if self._session_manager:
await self._session_manager.shutdown()
if self._obs:
await self._obs.finalize()
logger.info("ClaudeBridge: shutdown complete")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/runners/ambient-runner/ambient_runner/bridges/claude/bridge.py`
around lines 244 - 245, Wrap the call to self._grpc_listener.stop() in a
try/except so an exception from stop() is caught and logged, and move the
session manager and observability cleanup calls into a finally block (or ensure
they run after the except) so that self._session_manager.shutdown() and
self._observability.shutdown() always run even if self._grpc_listener.stop()
raises; make sure to reference the existing symbols self._grpc_listener.stop(),
self._session_manager.shutdown(), and self._observability.shutdown() and log the
stop() exception before proceeding with the cleanup.

STATE_SYNC_IMAGE ?= vteam_state_sync:$(IMAGE_TAG)
PUBLIC_API_IMAGE ?= vteam_public_api:$(IMAGE_TAG)
API_SERVER_IMAGE ?= vteam_api_server:$(IMAGE_TAG)
CONTROL_PLANE_IMAGE ?= vteam_control_plane:$(IMAGE_TAG)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Image name should match manifest declarations.

Same naming issue as the workflow files: vteam_control_plane here vs ambient_control_plane in manifests. Ensure consistency across all files when fixing the naming mismatch.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` at line 68, The Makefile variable CONTROL_PLANE_IMAGE currently
uses the image name literal "vteam_control_plane" which mismatches the
manifest-declared "ambient_control_plane"; update the CONTROL_PLANE_IMAGE
assignment to use the same image base as manifests (e.g., replace
"vteam_control_plane" with "ambient_control_plane") while preserving IMAGE_TAG
(i.e., keep the format CONTROL_PLANE_IMAGE ?=
ambient_control_plane:$(IMAGE_TAG)) so all references to CONTROL_PLANE_IMAGE and
IMAGE_TAG remain consistent with the manifests.

@markturansky markturansky force-pushed the feat/cp-grpc-runner-v1 branch from 30b1f4a to 94e5980 Compare March 24, 2026 19:44
@markturansky
Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@markturansky
Copy link
Copy Markdown
Contributor Author

/retest

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 27

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
components/manifests/components/ambient-api-server-db/ambient-api-server-db-json-patch.yaml (1)

8-30: 🛠️ Refactor suggestion | 🟠 Major

Remove the redundant list replacements from this overlay.

These env and volumeMounts values now match the base manifest exactly. Keeping whole-array replace ops here means any future base env var or mount will be silently discarded in this overlay.

🧹 Suggested cleanup
 - op: replace
   path: /spec/template/spec/containers/0/image
   value: registry.redhat.io/rhel10/postgresql-16:10.1
-- op: replace
-  path: /spec/template/spec/containers/0/env
-  value:
-  - name: POSTGRESQL_USER
-    valueFrom:
-      secretKeyRef:
-        name: ambient-api-server-db
-        key: db.user
-  - name: POSTGRESQL_PASSWORD
-    valueFrom:
-      secretKeyRef:
-        name: ambient-api-server-db
-        key: db.password
-  - name: POSTGRESQL_DATABASE
-    valueFrom:
-      secretKeyRef:
-        name: ambient-api-server-db
-        key: db.name
-- op: replace
-  path: /spec/template/spec/containers/0/volumeMounts
-  value:
-  - name: ambient-api-server-db-data
-    mountPath: /var/lib/pgsql/data
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@components/manifests/components/ambient-api-server-db/ambient-api-server-db-json-patch.yaml`
around lines 8 - 30, Remove the two redundant whole-array JSON Patch replace ops
that overwrite the container 0 env and volumeMounts in
ambient-api-server-db-json-patch.yaml: the replace at path
/spec/template/spec/containers/0/env (which replaces env entries like
POSTGRESQL_USER, POSTGRESQL_PASSWORD, POSTGRESQL_DATABASE) and the replace at
path /spec/template/spec/containers/0/volumeMounts (which replaces the mount for
name ambient-api-server-db-data). Instead delete these two ops from the patch so
the overlay inherits the base manifest's env and volumeMounts and future base
additions are preserved.
.github/workflows/components-build-deploy.yml (1)

354-363: ⚠️ Potential issue | 🟠 Major

Missing kustomize edit set image for ambient-control-plane in deploy step.

The deploy-to-openshift job updates image tags for all other components but omits ambient-control-plane. When control-plane changes are pushed, the deployment will use whatever tag is in the base kustomization rather than the newly built SHA.

🐛 Proposed fix — add control-plane tag handling

Add to the "Determine image tags" step (after line 352):

if [ "${{ needs.detect-changes.outputs.ambient-control-plane }}" == "true" ]; then
  echo "control_plane_tag=${{ github.sha }}" >> $GITHUB_OUTPUT
else
  echo "control_plane_tag=stage" >> $GITHUB_OUTPUT
fi

Add to the "Update kustomization with image tags" step (after line 363):

kustomize edit set image quay.io/ambient_code/ambient_control_plane:latest=quay.io/ambient_code/ambient_control_plane:${{ steps.image-tags.outputs.control_plane_tag }}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/components-build-deploy.yml around lines 354 - 363, Add
handling for ambient-control-plane by emitting a control_plane_tag in the
"Determine image tags" step (set control_plane_tag to github.sha when
needs.detect-changes.outputs.ambient-control-plane == "true", otherwise set it
to "stage") and then include a kustomize edit set image command in the "Update
kustomization with image tags" step to update
quay.io/ambient_code/ambient_control_plane from :latest to :${{
steps.image-tags.outputs.control_plane_tag }} so the control-plane image uses
the newly built SHA when changed; reference the existing image-tags output
variable name control_plane_tag and the kustomize edit set image pattern used
for other components.
♻️ Duplicate comments (16)
components/ambient-control-plane/go.mod (1)

11-11: ⚠️ Potential issue | 🔴 Critical

Upgrade google.golang.org/grpc to a patched release.

google.golang.org/grpc v1.79.1 is still the vulnerable pin already called out earlier for GHSA-p77j-4mvh-x3m3, so the new control-plane module would ship with that authorization-bypass issue unresolved.

Verification: confirm the pinned version in go.mod and compare it with the advisory’s affected range. Expected result: the module should be updated to a non-vulnerable release before merge.

#!/bin/bash
set -euo pipefail

echo "Pinned grpc version:"
rg -n 'google\.golang\.org/grpc' components/ambient-control-plane/go.mod

echo
python - <<'PY'
import json, urllib.request

with urllib.request.urlopen("https://api.osv.dev/v1/vulns/GHSA-p77j-4mvh-x3m3") as r:
    data = json.load(r)

print("advisory:", data["id"])
for affected in data.get("affected", []):
    pkg = affected.get("package", {})
    if pkg.get("name") == "google.golang.org/grpc":
        print("package:", pkg["name"])
        for rng in affected.get("ranges", []):
            print("range:", rng)
PY
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/ambient-control-plane/go.mod` at line 11, The go.mod entry pins
google.golang.org/grpc at the vulnerable v1.79.1; update that dependency in
components/ambient-control-plane/go.mod to a non-vulnerable release (e.g.,
v1.79.2 or any release that the GHSA-p77j-4mvh-x3m3 advisory marks as patched),
then run go mod tidy (or go get google.golang.org/grpc@<patched-version>) to
update go.sum; after updating, verify the pinned version no longer falls in the
advisory’s affected range for GHSA-p77j-4mvh-x3m3.
components/manifests/deploy (1)

82-109: ⚠️ Potential issue | 🟠 Major

Stop writing the OAuth client secret to a fixed file and echoing it on failure.

This still writes the plaintext CLIENT_SECRET_VALUE to /tmp/ambient-frontend-oauthclient.yaml and prints it again in the fallback block. That exposes a reusable credential to other local users and CI logs.

Suggested change
-    cat > /tmp/ambient-frontend-oauthclient.yaml <<EOF
+    OAUTH_TMPFILE=$(mktemp)
+    chmod 600 "$OAUTH_TMPFILE"
+    cat > "$OAUTH_TMPFILE" <<EOF
 apiVersion: oauth.openshift.io/v1
 kind: OAuthClient
 metadata:
   name: ambient-frontend
 secret: ${CLIENT_SECRET_VALUE}
 redirectURIs:
 - https://${ROUTE_HOST}/oauth/callback
 grantMethod: auto
 EOF
     set +e
-    oc apply -f /tmp/ambient-frontend-oauthclient.yaml
+    oc apply -f "$OAUTH_TMPFILE"
     OAUTH_APPLY_RC=$?
     set -e
-    rm -f /tmp/ambient-frontend-oauthclient.yaml
+    rm -f "$OAUTH_TMPFILE"
@@
-        echo "secret: ${CLIENT_SECRET_VALUE}"
+        echo "secret: [REDACTED]"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/manifests/deploy` around lines 82 - 109, Do not persist or echo
the plaintext CLIENT_SECRET_VALUE to /tmp/ambient-frontend-oauthclient.yaml or
stdout; instead feed the OAuthClient YAML to oc apply via stdin (oc apply -f -)
without creating a temp file so the secret never lands on disk, and remove any
echo that prints CLIENT_SECRET_VALUE in the fallback/error block. Update the
logic around OAUTH_APPLY_RC to build the YAML on-the-fly using
CLIENT_SECRET_VALUE and ROUTE_HOST and pipe it to oc apply -f -, and in the
failure branch print a redacted/admin-instruction that tells the admin how to
apply the OAuthClient without showing the secret (omit the secret value or
instruct them to set CLIENT_SECRET_VALUE in their environment) rather than
echoing the secret.
components/runners/ambient-runner/ambient_runner/bridges/claude/bridge.py (1)

244-249: ⚠️ Potential issue | 🟠 Major

Protect shutdown cleanup from gRPC stop failures.

At Line 245, an exception from await self._grpc_listener.stop() prevents self._session_manager.shutdown() and self._obs.finalize() from running.

Proposed fix
 async def shutdown(self) -> None:
     """Graceful shutdown: persist sessions, finalise tracing."""
-    if self._grpc_listener is not None:
-        await self._grpc_listener.stop()
-    if self._session_manager:
-        await self._session_manager.shutdown()
-    if self._obs:
-        await self._obs.finalize()
+    try:
+        if self._grpc_listener is not None:
+            await self._grpc_listener.stop()
+            self._grpc_listener = None
+    except Exception:
+        logger.exception("ClaudeBridge: failed to stop gRPC listener during shutdown")
+    finally:
+        if self._session_manager:
+            await self._session_manager.shutdown()
+        if self._obs:
+            await self._obs.finalize()
     logger.info("ClaudeBridge: shutdown complete")
As per coding guidelines "-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/runners/ambient-runner/ambient_runner/bridges/claude/bridge.py`
around lines 244 - 249, The call to await self._grpc_listener.stop() can raise
and currently short-circuits the remaining cleanup; wrap the await
self._grpc_listener.stop() call in a try/except that catches Exception, logs the
error (including the exception) and does not re-raise so that subsequent cleanup
calls await self._session_manager.shutdown() and await self._obs.finalize()
still run; ensure you reference self._grpc_listener.stop(),
self._session_manager.shutdown(), and self._obs.finalize() in the fix so the
cleanup order and logging are clear.
components/runners/ambient-runner/ambient_runner/app.py (2)

356-403: ⚠️ Potential issue | 🟠 Major

gRPC client may leak on exception.

If an exception occurs after AmbientGRPCClient.from_env() but before client.close(), the connection leaks. Use try/finally to ensure cleanup.

 async def _push_initial_prompt_via_grpc(prompt: str, session_id: str) -> None:
     """Push INITIAL_PROMPT as a PushSessionMessage so it is durable in DB."""
+    client = None
     try:
         from ambient_runner._grpc_client import AmbientGRPCClient

         client = AmbientGRPCClient.from_env()
         # ... existing logic ...
-        client.close()
     except Exception as exc:
         logger.error(
             "INITIAL_PROMPT gRPC push failed: session=%s error=%s",
             session_id,
             exc,
             exc_info=True,
         )
+    finally:
+        if client is not None:
+            client.close()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/runners/ambient-runner/ambient_runner/app.py` around lines 356 -
403, In _push_initial_prompt_via_grpc the AmbientGRPCClient created via
AmbientGRPCClient.from_env() can leak if an exception occurs before
client.close(); wrap the client usage in a try/finally (or use a context manager
if AmbientGRPCClient supports __enter__/__exit__) so client.close() is always
called, ensuring you only call close when the client variable was successfully
assigned; update the function to create the client, run the payload push logic,
and then call client.close() in finally to guarantee cleanup.

129-131: 🧹 Nitpick | 🔵 Trivial

Accessing private members _setup_platform() and _grpc_listener creates coupling.

Consider exposing a public method on ClaudeBridge that handles setup and returns the ready signal, rather than directly accessing private members.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/runners/ambient-runner/ambient_runner/app.py` around lines 129 -
131, The code currently couples to ClaudeBridge's internals by calling its
private members _setup_platform() and _grpc_listener.ready; instead add a public
async method on ClaudeBridge (e.g., ensure_ready() or setup_and_wait_ready())
that performs the internal setup and exposes the ready signal (awaiting
_setup_platform() and returning/awaiting the internal ready event), then replace
the direct calls in ambient_runner.app (the grpc_url conditional) with a single
await bridge.ensure_ready() to remove private member access.
components/manifests/overlays/production/control-plane-env-patch.yaml (1)

15-16: ⚠️ Potential issue | 🟠 Major

Pin RUNNER_IMAGE to a specific version instead of :latest.

Using :latest in production makes deployments non-reproducible. Pin to a specific version or digest for predictable behavior.

         - name: RUNNER_IMAGE
-          value: "quay.io/ambient_code/vteam_claude_runner:latest"
+          value: "quay.io/ambient_code/vteam_claude_runner:v1.0.0"  # Pin to specific version
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/manifests/overlays/production/control-plane-env-patch.yaml` around
lines 15 - 16, The RUNNER_IMAGE env var is pinned to :latest causing
non-reproducible production deployments; update the value of the environment
variable named RUNNER_IMAGE (currently
"quay.io/ambient_code/vteam_claude_runner:latest") to a specific version tag or
an immutable digest (for example
"quay.io/ambient_code/vteam_claude_runner:v1.2.3" or
"quay.io/ambient_code/vteam_claude_runner@sha256:<digest>") so deployments are
deterministic; ensure you change the string literal in the manifest where
RUNNER_IMAGE is defined and verify any CI/deploy docs reference the same pinned
tag.
components/ambient-control-plane/internal/reconciler/project_settings_reconciler.go (1)

130-139: ⚠️ Potential issue | 🟠 Major

Partial failures are silently swallowed.

When ensureGroupRoleBinding fails for one entry, the error is logged but the function returns nil, making reconciliation appear successful. This could mask persistent failures and prevent retry mechanisms from kicking in.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@components/ambient-control-plane/internal/reconciler/project_settings_reconciler.go`
around lines 130 - 139, The loop over entries in project_settings_reconciler.go
currently logs errors from ensureGroupRoleBinding but always returns nil, hiding
failures; change the reconciler method that contains this loop to collect and
return errors (e.g., return the first error or aggregate them) instead of
swallowing them so callers/retry logic see failures—specifically update the
for-loop handling of ensureGroupRoleBinding(ctx, namespace, entry.GroupName,
entry.Role) to propagate its error (or append to an error slice and return a
combined error) rather than only logging it.
components/runners/ambient-runner/tests/test_app_initial_prompt.py (1)

34-35: ⚠️ Potential issue | 🟡 Minor

Patch target should target where the name is looked up.

The patches at lines 34, 58, 67, and 78 target "ambient_runner._grpc_client.AmbientGRPCClient" (definition site). If app.py imports with from ambient_runner._grpc_client import AmbientGRPCClient, the name is bound in app's namespace before the patch applies. The patch should target "ambient_runner.app.AmbientGRPCClient" instead.

Also applies to: 58-58, 67-67, 78-78

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/runners/ambient-runner/tests/test_app_initial_prompt.py` around
lines 34 - 35, The tests patch the definition site
"ambient_runner._grpc_client.AmbientGRPCClient" but the code under test imports
and binds the class into app's namespace, so update each patch target to the
lookup site "ambient_runner.app.AmbientGRPCClient"; specifically change the
patch call that wraps the call to _push_initial_prompt_via_grpc("hello world",
"sess-1") and the three other patch usages (the ones at the other test cases) to
patch "ambient_runner.app.AmbientGRPCClient" so the mocked class is used when
_push_initial_prompt_via_grpc invokes AmbientGRPCClient.
components/runners/ambient-runner/tests/test_events_endpoint.py (1)

66-79: ⚠️ Potential issue | 🟡 Minor

Unused queue in test may not verify intended behavior.

The queue q created on lines 72-73 is never added to active_streams, so the endpoint creates its own queue. The test passes but doesn't actually verify queue registration behavior. Consider removing the unused queue or properly verifying registration.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/runners/ambient-runner/tests/test_events_endpoint.py` around lines
66 - 79, The test test_registers_queue_before_streaming creates a local
asyncio.Queue q and puts a terminal event into it but never registers it with
active_streams, so it doesn't actually verify queue-registration behavior;
either remove the unused q and its put_nowait call, or register it into
active_streams (e.g., active_streams["t-1"] = q) before creating the bridge via
_make_bridge so the endpoint will use that queue, and then after
client.stream("GET", "/events/t-1") and resp.read() assert the expected
registration/consumption from active_streams to confirm the endpoint
registered/used the queue.
components/ambient-control-plane/internal/reconciler/shared.go (1)

88-96: ⚠️ Potential issue | 🟠 Major

Normalize namespace inputs before using them as Kubernetes namespaces.

strings.ToLower() only handles case, and Line 92 returns session.KubeNamespace without even that. Values with _, ., spaces, uppercase letters, or length >63 will still be invalid namespace names and will fail later resource creation. Reuse one DNS-1123 normalizer here and in namespaceForProject, or assert upstream that ProjectID is already slug-safe.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/ambient-control-plane/internal/reconciler/shared.go` around lines
88 - 96, namespaceForSession currently only lowercases ProjectID and returns
KubeNamespace raw, which allows invalid Kubernetes names (underscores, dots,
spaces, uppercase, >63 chars); update namespaceForSession to run both
session.ProjectID and session.KubeNamespace through the shared DNS-1123
normalizer used elsewhere (reuse the existing normalizer function used by
namespaceForProject or create a single normalizeToDNS1123 helper) so returned
namespaces are DNS-1123 safe and truncated if needed; also ensure
namespaceForProject calls the same normalizer or add an assertion upstream that
ProjectID is already slug-safe.
components/runners/ambient-runner/ambient_runner/_session_messages_api.py (1)

77-84: ⚠️ Potential issue | 🟠 Major

Stop logging payload previews at INFO.

Both RPC paths now emit prompt/response content previews on every push/watch message. That leaks user and assistant content into durable logs and will balloon log volume on long sessions. Since this file is generated, the fix should happen in the generator too.

Also applies to: 136-147

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/runners/ambient-runner/ambient_runner/_session_messages_api.py`
around lines 77 - 84, The current code constructs payload_preview and emits it
in logger.info (using session_id, event_type, payload_len, payload_preview),
which leaks user/assistant content and increases log volume; update the
generator so generated code no longer logs payload previews at INFO—either
remove payload_preview from the logger.info call or lower that log call to DEBUG
(e.g., use logger.debug with payload_preview) and keep only non-sensitive
metadata (session_id, event_type, payload_len) at INFO; apply the same change
for the other generated block that builds payload_preview (the analogous logger
call around the 136–147 area).
components/ambient-control-plane/internal/reconciler/project_reconciler.go (2)

104-110: ⚠️ Potential issue | 🔴 Critical

Reconcile creator RoleBindings instead of create-once.

A changed ambient-code.io/created-by value can leave the previous subject bound to ambient-project-admin, and the 40-character truncation makes distinct long identities alias the same RoleBinding name. Line 132 only checks for name existence, so this never converges back to the desired subject.

Also applies to: 132-137

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/ambient-control-plane/internal/reconciler/project_reconciler.go`
around lines 104 - 110, The current creatorRoleBindingName function and the
reconcile logic create RoleBindings once by name and never update them, causing
stale subjects when the ambient-code.io/created-by value changes and collisions
from truncating long subjects; update creatorRoleBindingName to produce
deterministic unique names for any subject (e.g., include a stable hash of the
full subject instead of truncating to 40 chars) and change the reconciler that
handles ambient-project-admin RoleBindings to not only check existence but to
reconcile the RoleBinding's subjects/owner to match the current
ambient-code.io/created-by (update existing RoleBinding subjects when they
differ); reference creatorRoleBindingName and the reconciler that currently
checks for name existence to locate where to change name generation and add
subject-update logic.

94-96: ⚠️ Potential issue | 🟠 Major

Treat AlreadyExists on namespace creation as success.

After the Line 67 existence check, another reconcile can create the namespace before CreateNamespace runs. Returning that race as an error makes add/modify events fail instead of converging.

Proposed fix
-	if _, err := r.kube.CreateNamespace(ctx, ns); err != nil {
-		return fmt.Errorf("creating namespace %s: %w", name, err)
-	}
+	if _, err := r.kube.CreateNamespace(ctx, ns); err != nil {
+		if !k8serrors.IsAlreadyExists(err) {
+			return fmt.Errorf("creating namespace %s: %w", name, err)
+		}
+		r.logger.Debug().Str("namespace", name).Msg("namespace concurrently created")
+		return nil
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/ambient-control-plane/internal/reconciler/project_reconciler.go`
around lines 94 - 96, The CreateNamespace call in project_reconciler.go should
treat an AlreadyExists result as success: after calling
r.kube.CreateNamespace(ctx, ns) check the returned error and if
apierrors.IsAlreadyExists(err) then ignore it (consider creation successful)
otherwise return the wrapped fmt.Errorf as currently done; reference the
r.kube.CreateNamespace call and the ns/name variables and import/use
k8s.io/apimachinery/pkg/api/errors (apierrors) for the IsAlreadyExists check.
components/runners/ambient-runner/tests/test_grpc_transport.py (1)

152-181: ⚠️ Potential issue | 🟠 Major

Assert that bridge.run() actually fired.

This test only waits 300ms and never proves that the user message triggered bridge.run(). If the listener drops user events or exits early, the case still passes.

Proposed fix
         bridge = _make_bridge()
 
         events = [make_text_start(), make_text_content(), make_run_finished()]
+        ran = asyncio.Event()
 
         async def fake_run(input_data):
+            ran.set()
             for e in events:
                 yield e
 
         bridge.run = fake_run
@@
         task = asyncio.create_task(listener._listen_loop())
         try:
             await asyncio.wait_for(listener.ready.wait(), timeout=2.0)
-            await asyncio.sleep(0.3)
+            await asyncio.wait_for(ran.wait(), timeout=1.0)
+            assert ran.is_set()
         finally:
             task.cancel()
             try:
                 await task
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/runners/ambient-runner/tests/test_grpc_transport.py` around lines
152 - 181, The test doesn't verify bridge.run was invoked; modify
test_user_message_triggers_bridge_run to set a sentinel (e.g., an asyncio.Event
or boolean flag) inside the fake_run assigned to bridge.run (refer to fake_run
and bridge.run) and await that sentinel with a timeout after listener.ready
instead of just sleeping; ensure you still cancel the listener task as before
(listener._listen_loop and listener.ready) so the test fails if bridge.run never
fires.
components/runners/ambient-runner/ambient_runner/endpoints/events.py (1)

61-63: ⚠️ Potential issue | 🟠 Major

Do not overwrite another subscriber's queue.

active_streams[thread_id] = queue lets a reconnect or second client orphan the first stream, and the unconditional pop() in both endpoints can then remove the replacement queue. wait_for_events() is worse because it tears down a queue it never created.

Also applies to: 109-110, 155-156, 183-184

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/runners/ambient-runner/ambient_runner/endpoints/events.py` around
lines 61 - 63, active_streams[thread_id] = queue unconditionally overwrites an
existing subscriber's queue (and later unconditional active_streams.pop()
removes a different queue), so change creation and teardown to be identity-safe:
when creating a queue in wait_for_events() (and the subscribe endpoints), use
active_streams.setdefault(thread_id, queue) or append to a list of subscribers
instead of replacing an existing entry, and track whether this call actually
created the queue; when removing/tearing down, only remove the mapping if
active_streams.get(thread_id) is the same queue you created (e.g., if
active_streams.get(thread_id) is queue: active_streams.pop(thread_id)), and only
close/put sentinel into queues that this code created—this prevents orphaning
other subscribers and avoids tearing down queues you don't own.
Makefile (1)

68-68: ⚠️ Potential issue | 🔴 Critical

CONTROL_PLANE_IMAGE still points at the wrong image name.

The new build/push/load targets all use vteam_control_plane, so they will not satisfy manifests that deploy ambient_control_plane. That breaks the control-plane rollout path introduced in this PR.

🛠 Proposed fix
-CONTROL_PLANE_IMAGE ?= vteam_control_plane:$(IMAGE_TAG)
+CONTROL_PLANE_IMAGE ?= ambient_control_plane:$(IMAGE_TAG)

Also applies to: 209-215, 261-264, 278-282, 1004-1005

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` at line 68, Update the CONTROL_PLANE_IMAGE definition and all
places still referencing the old image name "vteam_control_plane" to
"ambient_control_plane" so manifests and the new build/push/load targets use the
correct image; specifically, change the CONTROL_PLANE_IMAGE variable and any
other Makefile variables/targets that set or reference "vteam_control_plane"
(e.g., in build/push/load targets and image-related variables) to use
"ambient_control_plane:$(IMAGE_TAG)" so the control-plane rollout uses the
correct image name.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/ambient-control-plane/internal/config/config.go`:
- Around line 41-42: The current behavior in Load() uses
parseReconcilers(envOrDefault("RECONCILERS", "tally,kube")) which causes
parseReconcilers to silently collapse whitespace/comma-only values into just
"tally", disabling the kube reconciler; update parseReconcilers (and any other
parsing used in the 76-91 region) to validate the raw env string: if the env is
empty/only whitespace/only commas treat it as invalid and either (a) return the
full default set ["tally","kube"] or (b) return an error that Load() surfaces
and fails fast — then update Load() to handle the error path (or accept the
returned default) so invalid RECONCILERS never silently drop kube; reference
functions/fields: parseReconcilers, Load, Reconcilers, envOrDefault and apply
the same validation to the similar parsing code in lines 76-91.

In `@components/ambient-control-plane/internal/informer/informer.go`:
- Around line 129-133: The function currently returns ctx.Err() unconditionally
after calling watchManager.Run(ctx), which causes normal cancellations (e.g.,
SIGINT/SIGTERM -> context.Canceled) to be treated as failures; update the return
behavior in the function containing inf.watchManager.Run(ctx) and Run() so that
expected context cancellation is treated as a clean shutdown: after
watchManager.Run(ctx) check ctx.Err() and return nil for context.Canceled and
context.DeadlineExceeded (or otherwise special-case expected cancellation),
otherwise return the actual error; reference the Run() caller and
inf.watchManager.Run(ctx) to locate where to implement this check.
- Around line 118-125: The initialSync currently enqueues every listed object
via dispatchBlocking while dispatchLoop is started after, which can deadlock
when the initial dataset exceeds the 256-event buffer; change Run to start the
dispatcher before performing the initial sync (move the go inf.dispatchLoop(ctx)
call to before calling inf.initialSync(ctx)) or alternatively have initialSync
use a non-blocking enqueue method (replace dispatchBlocking calls with a
non-blocking/try-enqueue variant) so initial list processing cannot block
waiting for a consumer; update Informer.Run, initialSync, and any calls to
dispatchBlocking to ensure the dispatcher (dispatchLoop) is running or that
enqueues are non-blocking.

In `@components/ambient-control-plane/internal/kubeclient/kubeclient.go`:
- Around line 88-99: The buildRestConfig function currently prefers a local
~/.kube/config over in-cluster service-account credentials; change the
resolution order so that after honoring an explicit kubeconfig parameter
(clientcmd.BuildConfigFromFlags when kubeconfig != ""), you attempt
rest.InClusterConfig() next and return it if successful, and only if in-cluster
fails fall back to checking the home kubeconfig via os.UserHomeDir and
clientcmd.BuildConfigFromFlags(localPath); update error handling so you return
the in-cluster config on success and only probe localPath when
rest.InClusterConfig() returns an error.
- Around line 161-162: All six deletion helpers (DeletePodsByLabel,
DeleteServicesByLabel, DeleteSecretsByLabel, DeleteServiceAccountsByLabel,
DeleteRolesByLabel, DeleteRoleBindingsByLabel) currently call DeleteCollection
with an empty labelSelector which can delete every resource in a namespace; add
a shared validation helper (e.g., ensureNonEmptyLabelSelector(ctx,
labelSelector) or validateLabelSelector) that returns an error if labelSelector
is empty or whitespace, call it at the start of each Delete*ByLabel method, and
return that error instead of invoking kc.dynamic.Resource(...).DeleteCollection
when the selector is missing; apply the same pattern to all six methods so no
deletion proceeds with an empty selector.

In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go`:
- Around line 147-166: cleanupSession currently only deletes labeled child
resources and leaves the per-session namespace (created by
ensureNamespaceExists) behind; update cleanupSession to also delete the
namespace itself (call the kube client's DeleteNamespace/DeleteNamespaceByName
equivalent for the computed namespace variable), handle
k8serrors.IsNotFound(err) the same way as other deletions, and log
errors/warnings via r.logger; make the identical change to the other cleanup
path referenced (the similar function/block around lines 210-242) so both
teardown code paths remove the namespace instead of leaking it.
- Around line 245-270: The RoleBinding name "ambient-image-puller" in
ensureImagePullAccess is shared across sessions so later calls reuse a binding
that points to the first session's subject; change the behavior so the binding
is unique per session or update an existing binding's subjects: either make the
RoleBinding name include the session namespace (e.g. use a name formatted with
the namespace when constructing rb) so each session creates its own binding, or
on CreateRoleBinding error IsAlreadyExists fetch the existing RoleBinding (via
the kube client), reconcile its "subjects" to include the serviceaccounts group
for the current namespace and update/patch it; reference ensureImagePullAccess,
rb, r.kube.CreateRoleBinding and r.cfg.RunnerImageNamespace when making the
change.

In `@components/ambient-control-plane/internal/reconciler/stress_test.go`:
- Around line 65-76: The stress test (TestSessionAPIStressTest) currently
assumes defaults from getTestAPIBaseURL()/getTestAPIToken()/getTestAPIProject()
and will hard-fail when no live API exists; change it to be opt-in by gating
execution on an explicit integration flag or env var (e.g. require
INTEGRATION_TEST=true or non-default API_BASE_URL/API_TOKEN) before constructing
StressTestConfig, and call t.Skip(...) when the flag/credentials are not set or
are the default placeholders; apply the same guard/skip logic to the other
affected tests that use getTestAPIBaseURL/getTestAPIToken/getTestAPIProject and
StressTestConfig so none of them run unless explicitly enabled.

In `@components/ambient-control-plane/internal/reconciler/tally_reconciler.go`:
- Around line 91-96: The handler handleSessionModified currently ignores
MODIFIED events and leaves SessionTallyReconciler's exported tallies
(SessionsByPhase, SessionsByUser) stale; update handleSessionModified to accept
both old and new session values from the event, compare old.Phase vs new.Phase
and old.UserID vs new.UserID, decrement the appropriate counts for the old
bucket(s) and increment the counts for the new bucket(s) (updating
SessionsByPhase and SessionsByUser accordingly), and preserve existing logging
via r.logger.Debug() in the function to reflect the move; ensure the update is
atomic/consistent with how increments/decrements are done elsewhere in
SessionTallyReconciler to avoid transient inconsistencies.

In `@components/ambient-control-plane/internal/reconciler/tally.go`:
- Around line 26-44: The TallyReconciler struct contains an unused sdk field and
NewTallyReconciler accepts an sdk parameter that is never used; remove the sdk
*sdkclient.Client field from the TallyReconciler type and drop the corresponding
sdk parameter from NewTallyReconciler, updating the constructor to only take
(resource string, logger zerolog.Logger) and to initialize the struct without
sdk (keep resource, logger.With().Str("reconciler", "tally-"+resource).Logger(),
seenIDs map, mu, tally, lastEventAt). Also search for and update all call sites
of NewTallyReconciler to stop passing the sdk argument.

In `@components/manifests/base/ambient-control-plane-service.yml`:
- Around line 25-26: The manifest currently pins the container image to
quay.io/ambient_code/ambient_control_plane:latest which makes deployments
non-reproducible; update the image field in the Ambient Control Plane service
manifest to use a pinned tag or an immutable digest (e.g.,
quay.io/ambient_code/ambient_control_plane@sha256:...) instead of :latest, and
keep imagePullPolicy as appropriate (e.g., IfNotPresent) so Kustomize overlays
can still override the tag when needed.

In `@components/manifests/base/platform/ambient-api-server-db.yml`:
- Around line 61-62: Replace the mutable image tag and private registry in the
base manifest so the deployment is deterministic and doesn't require Red Hat
credentials: update the image value currently set to
"registry.redhat.io/rhel9/postgresql-16:latest" to a pinned public tag (for
example "postgresql-16:10.1" or "docker.io/library/postgres:16.1") in the same
YAML key and keep imagePullPolicy as needed; ensure the change is made where the
image field is defined in the ambient-api-server-db manifest so overlays no
longer must remap the image to run locally.
- Around line 57-58: Add seccompProfile: { type: RuntimeDefault } alongside the
existing securityContext entries to enforce syscall filtering for
defense-in-depth: update the pod-level securityContext (the existing
securityContext block that contains runAsNonRoot: true) to include
seccompProfile: { type: RuntimeDefault } and also add the same seccompProfile
under the container-level securityContext for the container spec (the
container's securityContext block around lines where capabilities, runAsNonRoot,
runAsUser, etc. are defined) so both pod and container scopes explicitly set
RuntimeDefault.

In `@components/manifests/base/rbac/control-plane-clusterrole.yaml`:
- Around line 1-27: The current ClusterRole "ambient-control-plane" grants
cluster-wide CRUD (including deletecollection) on workload resources (secrets,
serviceaccounts, services, pods, jobs); restrict this by: 1) remove workload
resources and verbs (secrets, serviceaccounts, services, pods, jobs and their
verbs) from the "ambient-control-plane" ClusterRole so it only contains
cluster-scoped permissions (namespaces and vteam.ambient-code agenticsessions
entries), and 2) implement per-project namespace Roles and RoleBindings created
dynamically by the control plane that grant the workload verbs for resources
["secrets","serviceaccounts","services","pods"] and ["jobs"] within each project
namespace and bind them to the control-plane ServiceAccount; ensure Role names
and RoleBinding subjects align with the control plane ServiceAccount used by
"ambient-control-plane".

In `@components/manifests/deploy`:
- Around line 198-211: The oauth-secret.env creation currently writes plaintext
secrets (CLIENT_SECRET_VALUE, COOKIE_SECRET_VALUE) into the working tree with
default perms and no cleanup; change the flow in the secrets handling so you
either avoid writing oauth-secret.env at all (pipe secrets directly to the
consumer/subcommand) or, if a file is required, create it in a secure temporary
file (using a restrictive umask or mktemp in a secure dir), set tight
permissions, and register a trap/cleanup (remove file on EXIT/ERR) before the
first write; ensure the secrets subcommand consumes stdin or the temp file and
that oauth-secret.env is removed immediately after use so no leftover
credentials remain.
- Around line 164-174: The deploy script currently mutates the checked-in
overlay with "kustomize edit set namespace" and then attempts to restore it,
which can leave the repo modified on failure; instead, change the flow to
operate on a temporary copy or use a cleanup trap: copy overlays/production to a
temp dir, run "kustomize edit set namespace" (and any "kustomize edit set image"
calls) against that temp copy, run "kustomize build" and "oc delete -f -" from
the temp copy, then delete the temp copy; alternatively, install a trap at the
top of the script that restores the original namespace/image on EXIT or ERR so
"kustomize edit set namespace" never leaves the checked-in overlays modified
(apply same fix for the other similar blocks referenced).
- Line 8: Add pipefail to the shell safety flags so failures in the left side of
pipelines like the kustomize/manifest generators piped into "oc apply -f -" are
not ignored; replace the lone "set -e" with a safe flag that enables pipefail
(e.g., "set -euo pipefail" or add "set -o pipefail" prior to the
cluster-mutating pipelines) and apply this change before every occurrence of the
"... | oc apply -f -" pipeline (also at the other locations noted).
Additionally, update any random-secret generation that uses the fragile "tr |
head" pattern so it won’t fail on SIGPIPE (use a robust generator such as
openssl rand, head -c, or dd to produce fixed-length secrets instead of "tr |
head").

In `@components/manifests/deploy-no-api-server.sh`:
- Around line 35-41: The script uses unquoted variable expansions for NAMESPACE
(and other occurrences) which can cause word-splitting; update the if check and
oc commands shown (the conditional using oc get deployment ambient-api-server -n
${NAMESPACE} and the API_SERVER_IMAGE assignment using ${NAMESPACE}) to quote
the variable like "-n \"${NAMESPACE}\"" and likewise quote all other
${NAMESPACE} usages (lines referenced in the review) so every oc/echo/exit
invocation uses "${NAMESPACE}" to prevent splitting and preserve values.

In `@components/manifests/overlays/production/control-plane-image-patch.yaml`:
- Around line 9-10: The image for the container named "ambient-control-plane" is
pinned to ":latest", making production non-reproducible; update the image field
for the "ambient-control-plane" container to use an immutable reference (a
specific release tag or an image digest like `@sha256`:<digest>) instead of
":latest" so deploys only change when the manifest changes and the registry
content is fixed.

In `@components/manifests/overlays/production/kustomization.yaml`:
- Around line 62-64: The production kustomization image uses an immutable tag
"latest" which must be replaced with a fixed release identifier: update the
image entry referencing "quay.io/ambient_code/ambient_control_plane:latest" by
replacing newTag: latest with a specific release tag or git SHA (e.g., newTag:
<release-or-sha>) so the kustomization image stanza (name/newName/newTag) uses
an immutable reference for repeatable, traceable rollouts.

In
`@components/runners/ambient-runner/ambient_runner/bridges/claude/grpc_transport.py`:
- Around line 118-123: The listener currently ignores immediate watch worker
failures because _listen_loop only checks watch_future.done() inside the 30s
timeout branch; change _listen_loop to wait on both msg_queue.get() and the
watch_future (the Future returned from _watch_in_thread) so that if watch_future
completes with an exception the loop immediately notices and propagates the
failure to trigger reconnect/backoff logic. Specifically, when selecting/waiting
for a message from msg_queue, also add the watch_future to the wait set (and
handle its completion by re-raising or returning an error) so worker exceptions
from _watch_in_thread aren't masked; apply the same fix where watch_future is
currently polled (the code around the other check at the 147-150 region) to
ensure consistent immediate reconnect behavior.

In `@docs/internal/design/blackboard-api.md`:
- Around line 440-453: The snapshot query computes latest_checkins over the
entire session_checkins table which breaks the O(agents) guarantee; modify the
CTE named latest_checkins to apply the project filter before DISTINCT ON by
adding the same WHERE project_id = ? (or, alternatively, ensure SessionCheckIn
is denormalized with project_id and use that field in the CTE) so the CTE only
scans check-ins for the target project before selecting DISTINCT ON (agent_id).

In `@docs/internal/developer/agent-workflow.md`:
- Around line 248-253: The docs still show the removed direct session creation
using POST /sessions; change the example and expected-flow to first create an
agent (POST /agents with the required payload) and then call POST
/agents/{id}/ignite to create/ignite the session instead of POST /sessions, and
update the second example block the same way; specifically replace the POST
/sessions curl example with a create-agent request followed by POST
/agents/{id}/ignite and adjust any explanatory text to reference the agent-first
+ ignite flow.

In `@docs/internal/developer/local-development/openshift.md`:
- Around line 392-405: The docs are inconsistent: this section says the control
plane accepts RH SSO tokens (ocm token) and rejects oc whoami -t, but the
earlier acpctl login example still uses oc whoami -t; update the example to use
a RH SSO access token (ocm token) or a registered service account instead of oc
whoami -t. Specifically, change the acpctl/login example and any references to
authentication against ambient-api-server to obtain and use the token from `ocm
token` (or document using a long-lived service account token) and ensure the
secret name `ambient-control-plane-token` and restart step (`oc rollout restart
deployment/ambient-control-plane -n ambient-code`) align with that guidance.

In `@e2e/scripts/load-images.sh`:
- Around line 53-54: The verification step is filtering loaded images with grep
'vteam_' so the image "ambient_control_plane:latest" in the images array won't
show up; update the verification command that currently uses grep vteam_ (the
verification pipeline near the end of the script) to either remove the 'vteam_'
filter or expand it to include ambient_control_plane (or a more general pattern
like matching both vteam_ and ambient_control_plane) so the verification output
accurately reflects all images in the images array.

In `@REMOVE_CRDs.md`:
- Line 519: Fix the typo in the Phase 3 heading by replacing "Safe
Decomissioning (2 weeks)" with "Safe Decommissioning (2 weeks)" so the word
"Decommissioning" is spelled correctly in the heading line that currently reads
"Phase 3: Safe Decomissioning (2 weeks)".

In `@test-e2e-control-plane.sh`:
- Around line 149-164: The POST that sends a session message (the oc exec curl
that populates MESSAGE_RESPONSE and then tests for .id) currently swallows
errors (note the "|| true") and only logs a warning; change it so the request
failure causes the script to fail the test: remove the "|| true", ensure you
validate MESSAGE_RESPONSE contains an "id" (using jq -e '.id') and if not, log
an error and exit 1. Apply the same change to the similar block around the later
message/test (the other POST block that sets MESSAGE_RESPONSE / MESSAGE_ID at
lines ~187-197) so missing runner-side gRPC push or /events fan-out fails the
e2e run. Ensure you reference MESSAGE_RESPONSE, MESSAGE_ID, API_SERVER_URL and
SESSION_ID when implementing the check.

---

Outside diff comments:
In @.github/workflows/components-build-deploy.yml:
- Around line 354-363: Add handling for ambient-control-plane by emitting a
control_plane_tag in the "Determine image tags" step (set control_plane_tag to
github.sha when needs.detect-changes.outputs.ambient-control-plane == "true",
otherwise set it to "stage") and then include a kustomize edit set image command
in the "Update kustomization with image tags" step to update
quay.io/ambient_code/ambient_control_plane from :latest to :${{
steps.image-tags.outputs.control_plane_tag }} so the control-plane image uses
the newly built SHA when changed; reference the existing image-tags output
variable name control_plane_tag and the kustomize edit set image pattern used
for other components.

In
`@components/manifests/components/ambient-api-server-db/ambient-api-server-db-json-patch.yaml`:
- Around line 8-30: Remove the two redundant whole-array JSON Patch replace ops
that overwrite the container 0 env and volumeMounts in
ambient-api-server-db-json-patch.yaml: the replace at path
/spec/template/spec/containers/0/env (which replaces env entries like
POSTGRESQL_USER, POSTGRESQL_PASSWORD, POSTGRESQL_DATABASE) and the replace at
path /spec/template/spec/containers/0/volumeMounts (which replaces the mount for
name ambient-api-server-db-data). Instead delete these two ops from the patch so
the overlay inherits the base manifest's env and volumeMounts and future base
additions are preserved.

---

Duplicate comments:
In `@components/ambient-control-plane/go.mod`:
- Line 11: The go.mod entry pins google.golang.org/grpc at the vulnerable
v1.79.1; update that dependency in components/ambient-control-plane/go.mod to a
non-vulnerable release (e.g., v1.79.2 or any release that the
GHSA-p77j-4mvh-x3m3 advisory marks as patched), then run go mod tidy (or go get
google.golang.org/grpc@<patched-version>) to update go.sum; after updating,
verify the pinned version no longer falls in the advisory’s affected range for
GHSA-p77j-4mvh-x3m3.

In `@components/ambient-control-plane/internal/reconciler/project_reconciler.go`:
- Around line 104-110: The current creatorRoleBindingName function and the
reconcile logic create RoleBindings once by name and never update them, causing
stale subjects when the ambient-code.io/created-by value changes and collisions
from truncating long subjects; update creatorRoleBindingName to produce
deterministic unique names for any subject (e.g., include a stable hash of the
full subject instead of truncating to 40 chars) and change the reconciler that
handles ambient-project-admin RoleBindings to not only check existence but to
reconcile the RoleBinding's subjects/owner to match the current
ambient-code.io/created-by (update existing RoleBinding subjects when they
differ); reference creatorRoleBindingName and the reconciler that currently
checks for name existence to locate where to change name generation and add
subject-update logic.
- Around line 94-96: The CreateNamespace call in project_reconciler.go should
treat an AlreadyExists result as success: after calling
r.kube.CreateNamespace(ctx, ns) check the returned error and if
apierrors.IsAlreadyExists(err) then ignore it (consider creation successful)
otherwise return the wrapped fmt.Errorf as currently done; reference the
r.kube.CreateNamespace call and the ns/name variables and import/use
k8s.io/apimachinery/pkg/api/errors (apierrors) for the IsAlreadyExists check.

In
`@components/ambient-control-plane/internal/reconciler/project_settings_reconciler.go`:
- Around line 130-139: The loop over entries in project_settings_reconciler.go
currently logs errors from ensureGroupRoleBinding but always returns nil, hiding
failures; change the reconciler method that contains this loop to collect and
return errors (e.g., return the first error or aggregate them) instead of
swallowing them so callers/retry logic see failures—specifically update the
for-loop handling of ensureGroupRoleBinding(ctx, namespace, entry.GroupName,
entry.Role) to propagate its error (or append to an error slice and return a
combined error) rather than only logging it.

In `@components/ambient-control-plane/internal/reconciler/shared.go`:
- Around line 88-96: namespaceForSession currently only lowercases ProjectID and
returns KubeNamespace raw, which allows invalid Kubernetes names (underscores,
dots, spaces, uppercase, >63 chars); update namespaceForSession to run both
session.ProjectID and session.KubeNamespace through the shared DNS-1123
normalizer used elsewhere (reuse the existing normalizer function used by
namespaceForProject or create a single normalizeToDNS1123 helper) so returned
namespaces are DNS-1123 safe and truncated if needed; also ensure
namespaceForProject calls the same normalizer or add an assertion upstream that
ProjectID is already slug-safe.

In `@components/manifests/deploy`:
- Around line 82-109: Do not persist or echo the plaintext CLIENT_SECRET_VALUE
to /tmp/ambient-frontend-oauthclient.yaml or stdout; instead feed the
OAuthClient YAML to oc apply via stdin (oc apply -f -) without creating a temp
file so the secret never lands on disk, and remove any echo that prints
CLIENT_SECRET_VALUE in the fallback/error block. Update the logic around
OAUTH_APPLY_RC to build the YAML on-the-fly using CLIENT_SECRET_VALUE and
ROUTE_HOST and pipe it to oc apply -f -, and in the failure branch print a
redacted/admin-instruction that tells the admin how to apply the OAuthClient
without showing the secret (omit the secret value or instruct them to set
CLIENT_SECRET_VALUE in their environment) rather than echoing the secret.

In `@components/manifests/overlays/production/control-plane-env-patch.yaml`:
- Around line 15-16: The RUNNER_IMAGE env var is pinned to :latest causing
non-reproducible production deployments; update the value of the environment
variable named RUNNER_IMAGE (currently
"quay.io/ambient_code/vteam_claude_runner:latest") to a specific version tag or
an immutable digest (for example
"quay.io/ambient_code/vteam_claude_runner:v1.2.3" or
"quay.io/ambient_code/vteam_claude_runner@sha256:<digest>") so deployments are
deterministic; ensure you change the string literal in the manifest where
RUNNER_IMAGE is defined and verify any CI/deploy docs reference the same pinned
tag.

In `@components/runners/ambient-runner/ambient_runner/_session_messages_api.py`:
- Around line 77-84: The current code constructs payload_preview and emits it in
logger.info (using session_id, event_type, payload_len, payload_preview), which
leaks user/assistant content and increases log volume; update the generator so
generated code no longer logs payload previews at INFO—either remove
payload_preview from the logger.info call or lower that log call to DEBUG (e.g.,
use logger.debug with payload_preview) and keep only non-sensitive metadata
(session_id, event_type, payload_len) at INFO; apply the same change for the
other generated block that builds payload_preview (the analogous logger call
around the 136–147 area).

In `@components/runners/ambient-runner/ambient_runner/app.py`:
- Around line 356-403: In _push_initial_prompt_via_grpc the AmbientGRPCClient
created via AmbientGRPCClient.from_env() can leak if an exception occurs before
client.close(); wrap the client usage in a try/finally (or use a context manager
if AmbientGRPCClient supports __enter__/__exit__) so client.close() is always
called, ensuring you only call close when the client variable was successfully
assigned; update the function to create the client, run the payload push logic,
and then call client.close() in finally to guarantee cleanup.
- Around line 129-131: The code currently couples to ClaudeBridge's internals by
calling its private members _setup_platform() and _grpc_listener.ready; instead
add a public async method on ClaudeBridge (e.g., ensure_ready() or
setup_and_wait_ready()) that performs the internal setup and exposes the ready
signal (awaiting _setup_platform() and returning/awaiting the internal ready
event), then replace the direct calls in ambient_runner.app (the grpc_url
conditional) with a single await bridge.ensure_ready() to remove private member
access.

In `@components/runners/ambient-runner/ambient_runner/bridges/claude/bridge.py`:
- Around line 244-249: The call to await self._grpc_listener.stop() can raise
and currently short-circuits the remaining cleanup; wrap the await
self._grpc_listener.stop() call in a try/except that catches Exception, logs the
error (including the exception) and does not re-raise so that subsequent cleanup
calls await self._session_manager.shutdown() and await self._obs.finalize()
still run; ensure you reference self._grpc_listener.stop(),
self._session_manager.shutdown(), and self._obs.finalize() in the fix so the
cleanup order and logging are clear.

In `@components/runners/ambient-runner/ambient_runner/endpoints/events.py`:
- Around line 61-63: active_streams[thread_id] = queue unconditionally
overwrites an existing subscriber's queue (and later unconditional
active_streams.pop() removes a different queue), so change creation and teardown
to be identity-safe: when creating a queue in wait_for_events() (and the
subscribe endpoints), use active_streams.setdefault(thread_id, queue) or append
to a list of subscribers instead of replacing an existing entry, and track
whether this call actually created the queue; when removing/tearing down, only
remove the mapping if active_streams.get(thread_id) is the same queue you
created (e.g., if active_streams.get(thread_id) is queue:
active_streams.pop(thread_id)), and only close/put sentinel into queues that
this code created—this prevents orphaning other subscribers and avoids tearing
down queues you don't own.

In `@components/runners/ambient-runner/tests/test_app_initial_prompt.py`:
- Around line 34-35: The tests patch the definition site
"ambient_runner._grpc_client.AmbientGRPCClient" but the code under test imports
and binds the class into app's namespace, so update each patch target to the
lookup site "ambient_runner.app.AmbientGRPCClient"; specifically change the
patch call that wraps the call to _push_initial_prompt_via_grpc("hello world",
"sess-1") and the three other patch usages (the ones at the other test cases) to
patch "ambient_runner.app.AmbientGRPCClient" so the mocked class is used when
_push_initial_prompt_via_grpc invokes AmbientGRPCClient.

In `@components/runners/ambient-runner/tests/test_events_endpoint.py`:
- Around line 66-79: The test test_registers_queue_before_streaming creates a
local asyncio.Queue q and puts a terminal event into it but never registers it
with active_streams, so it doesn't actually verify queue-registration behavior;
either remove the unused q and its put_nowait call, or register it into
active_streams (e.g., active_streams["t-1"] = q) before creating the bridge via
_make_bridge so the endpoint will use that queue, and then after
client.stream("GET", "/events/t-1") and resp.read() assert the expected
registration/consumption from active_streams to confirm the endpoint
registered/used the queue.

In `@components/runners/ambient-runner/tests/test_grpc_transport.py`:
- Around line 152-181: The test doesn't verify bridge.run was invoked; modify
test_user_message_triggers_bridge_run to set a sentinel (e.g., an asyncio.Event
or boolean flag) inside the fake_run assigned to bridge.run (refer to fake_run
and bridge.run) and await that sentinel with a timeout after listener.ready
instead of just sleeping; ensure you still cancel the listener task as before
(listener._listen_loop and listener.ready) so the test fails if bridge.run never
fires.

In `@Makefile`:
- Line 68: Update the CONTROL_PLANE_IMAGE definition and all places still
referencing the old image name "vteam_control_plane" to "ambient_control_plane"
so manifests and the new build/push/load targets use the correct image;
specifically, change the CONTROL_PLANE_IMAGE variable and any other Makefile
variables/targets that set or reference "vteam_control_plane" (e.g., in
build/push/load targets and image-related variables) to use
"ambient_control_plane:$(IMAGE_TAG)" so the control-plane rollout uses the
correct image name.

Comment on lines +41 to +42
Reconcilers: parseReconcilers(envOrDefault("RECONCILERS", "tally,kube")),
RunnerImage: envOrDefault("RUNNER_IMAGE", "quay.io/ambient_code/vteam_claude_runner:latest"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject or fully default invalid RECONCILERS values.

Load()'s normal default is tally,kube, but parseReconcilers() falls back to only tally when the env is whitespace/comma-only. That silently disables the kube reconciler on bad config and breaks the main provisioning path instead of restoring the full defaults or failing fast.

Suggested change
 func parseReconcilers(reconcilersStr string) []string {
-	if reconcilersStr == "" {
-		return []string{"tally"}
+	if strings.TrimSpace(reconcilersStr) == "" {
+		return []string{"tally", "kube"}
 	}
@@
-	if len(result) == 0 {
-		return []string{"tally"}
+	if len(result) == 0 {
+		return []string{"tally", "kube"}
 	}

Also applies to: 76-91

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/ambient-control-plane/internal/config/config.go` around lines 41 -
42, The current behavior in Load() uses
parseReconcilers(envOrDefault("RECONCILERS", "tally,kube")) which causes
parseReconcilers to silently collapse whitespace/comma-only values into just
"tally", disabling the kube reconciler; update parseReconcilers (and any other
parsing used in the 76-91 region) to validate the raw env string: if the env is
empty/only whitespace/only commas treat it as invalid and either (a) return the
full default set ["tally","kube"] or (b) return an error that Load() surfaces
and fails fast — then update Load() to handle the error path (or accept the
returned default) so invalid RECONCILERS never silently drop kube; reference
functions/fields: parseReconcilers, Load, Reconcilers, envOrDefault and apply
the same validation to the similar parsing code in lines 76-91.

Comment on lines +118 to +125
func (inf *Informer) Run(ctx context.Context) error {
inf.logger.Info().Msg("performing initial list sync")

if err := inf.initialSync(ctx); err != nil {
inf.logger.Warn().Err(err).Msg("initial sync failed, will rely on watch events")
}

go inf.dispatchLoop(ctx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Initial sync can deadlock before the dispatcher starts.

initialSync() enqueues every listed object through dispatchBlocking(), but dispatchLoop() is started afterward. Once the combined initial dataset exceeds the 256-event buffer, startup blocks forever with no consumer on eventCh. Start the dispatcher before syncing or dispatch initial-list events directly.

🛠 Suggested fix
 func (inf *Informer) Run(ctx context.Context) error {
+	go inf.dispatchLoop(ctx)
+
 	inf.logger.Info().Msg("performing initial list sync")
 
 	if err := inf.initialSync(ctx); err != nil {
 		inf.logger.Warn().Err(err).Msg("initial sync failed, will rely on watch events")
 	}
-
-	go inf.dispatchLoop(ctx)
 
 	inf.wireWatchHandlers()

Also applies to: 400-405

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/ambient-control-plane/internal/informer/informer.go` around lines
118 - 125, The initialSync currently enqueues every listed object via
dispatchBlocking while dispatchLoop is started after, which can deadlock when
the initial dataset exceeds the 256-event buffer; change Run to start the
dispatcher before performing the initial sync (move the go inf.dispatchLoop(ctx)
call to before calling inf.initialSync(ctx)) or alternatively have initialSync
use a non-blocking enqueue method (replace dispatchBlocking calls with a
non-blocking/try-enqueue variant) so initial list processing cannot block
waiting for a consumer; update Informer.Run, initialSync, and any calls to
dispatchBlocking to ensure the dispatcher (dispatchLoop) is running or that
enqueues are non-blocking.

Comment on lines +129 to +133
inf.logger.Info().Msg("starting gRPC watch streams")
inf.watchManager.Run(ctx)

return ctx.Err()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Treat context cancellation as a clean shutdown.

Run() returns ctx.Err() unconditionally after watchManager.Run(ctx), so a normal SIGTERM/SIGINT becomes context.Canceled and main() logs kube mode failed / test mode failed on expected shutdown. Return nil for expected cancellation, or special-case it in the caller.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/ambient-control-plane/internal/informer/informer.go` around lines
129 - 133, The function currently returns ctx.Err() unconditionally after
calling watchManager.Run(ctx), which causes normal cancellations (e.g.,
SIGINT/SIGTERM -> context.Canceled) to be treated as failures; update the return
behavior in the function containing inf.watchManager.Run(ctx) and Run() so that
expected context cancellation is treated as a clean shutdown: after
watchManager.Run(ctx) check ctx.Err() and return nil for context.Canceled and
context.DeadlineExceeded (or otherwise special-case expected cancellation),
otherwise return the actual error; reference the Run() caller and
inf.watchManager.Run(ctx) to locate where to implement this check.

Comment on lines +88 to +99
func buildRestConfig(kubeconfig string) (*rest.Config, error) {
if kubeconfig != "" {
return clientcmd.BuildConfigFromFlags("", kubeconfig)
}

home, _ := os.UserHomeDir()
localPath := home + "/.kube/config"
if _, err := os.Stat(localPath); err == nil {
return clientcmd.BuildConfigFromFlags("", localPath)
}

return rest.InClusterConfig()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's read the full kubeclient.go file to understand the context
wc -l components/ambient-control-plane/internal/kubeclient/kubeclient.go

Repository: ambient-code/platform

Length of output: 135


🏁 Script executed:

# Get full file content
cat -n components/ambient-control-plane/internal/kubeclient/kubeclient.go

Repository: ambient-code/platform

Length of output: 10572


🏁 Script executed:

# Search for how this client is instantiated and what environment it runs in
rg -A 5 -B 5 "buildRestConfig" components/ambient-control-plane/

Repository: ambient-code/platform

Length of output: 2053


🏁 Script executed:

# Look for deployment manifests or container configurations for the control-plane
fd -e yaml -e yml | xargs rg -l "ambient.*control.*plane|kubeclient" | head -20

Repository: ambient-code/platform

Length of output: 803


🏁 Script executed:

# Search for any documentation about this service
fd -e md -e txt | xargs rg -l "ambient.*control.*plane" | head -10

Repository: ambient-code/platform

Length of output: 406


🏁 Script executed:

# Check the control-plane service manifest
cat -n components/manifests/base/ambient-control-plane-service.yml

Repository: ambient-code/platform

Length of output: 1908


🏁 Script executed:

# Check the control-plane environment patches
cat -n components/manifests/overlays/kind-local/control-plane-env-patch.yaml

Repository: ambient-code/platform

Length of output: 601


🏁 Script executed:

# Check the README for control-plane
cat -n components/ambient-control-plane/README.md

Repository: ambient-code/platform

Length of output: 6297


🏁 Script executed:

# Search for how the kubeconfig parameter is initialized
rg -n "New\(" components/ambient-control-plane/ --type go -A 3

Repository: ambient-code/platform

Length of output: 3338


🏁 Script executed:

# Check if KUBECONFIG env var is used anywhere
rg -n "KUBECONFIG" components/ambient-control-plane/ --type go -B 2 -A 2

Repository: ambient-code/platform

Length of output: 695


🏁 Script executed:

# Check all deployment overlays for volume mounts
cat -n components/manifests/overlays/production/control-plane-image-patch.yaml

Repository: ambient-code/platform

Length of output: 399


🏁 Script executed:

# Check kind overlays
cat -n components/manifests/overlays/kind/kustomization.yaml | head -50

Repository: ambient-code/platform

Length of output: 1530


🏁 Script executed:

# Look for any volume mounts that might be added via overlays
rg -n "volumeMounts|volumes:" components/manifests/ --type yaml -A 5

Repository: ambient-code/platform

Length of output: 24415


Prefer in-cluster credentials before probing ~/.kube/config.

This code runs in a control-plane service, but the current fallback order lets any mounted or baked-in home kubeconfig override the pod's service-account credentials. That can silently point reconciliation at the wrong cluster. Keep the explicit kubeconfig path for local use, then try rest.InClusterConfig() before any home-directory fallback.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/ambient-control-plane/internal/kubeclient/kubeclient.go` around
lines 88 - 99, The buildRestConfig function currently prefers a local
~/.kube/config over in-cluster service-account credentials; change the
resolution order so that after honoring an explicit kubeconfig parameter
(clientcmd.BuildConfigFromFlags when kubeconfig != ""), you attempt
rest.InClusterConfig() next and return it if successful, and only if in-cluster
fails fall back to checking the home kubeconfig via os.UserHomeDir and
clientcmd.BuildConfigFromFlags(localPath); update error handling so you return
the in-cluster config on success and only probe localPath when
rest.InClusterConfig() returns an error.

Comment on lines +161 to +162
func (kc *KubeClient) DeletePodsByLabel(ctx context.Context, namespace, labelSelector string) error {
return kc.dynamic.Resource(PodGVR).Namespace(namespace).DeleteCollection(ctx, metav1.DeleteOptions{}, metav1.ListOptions{LabelSelector: labelSelector})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n components/ambient-control-plane/internal/kubeclient/kubeclient.go | head -230

Repository: ambient-code/platform

Length of output: 10572


🏁 Script executed:

rg "DeletePodsByLabel|DeleteServicesByLabel|DeleteSecretsByLabel|DeleteServiceAccountsByLabel|DeleteRolesByLabel|DeleteRoleBindingsByLabel" --type go -A 2 -B 2

Repository: ambient-code/platform

Length of output: 10611


🏁 Script executed:

grep -n "selector\s*=" components/ambient-control-plane/internal/reconciler/kube_reconciler.go | head -20

Repository: ambient-code/platform

Length of output: 47


🏁 Script executed:

rg "selector\s*:=" components/ambient-control-plane/internal/reconciler/kube_reconciler.go -A 1 -B 1

Repository: ambient-code/platform

Length of output: 247


🏁 Script executed:

rg "func sessionLabelSelector" components/ambient-control-plane/internal/reconciler/kube_reconciler.go -A 3

Repository: ambient-code/platform

Length of output: 184


Reject empty selectors before calling DeleteCollection.

An empty labelSelector makes these helpers target every resource of that kind in the namespace. A missing or empty session ID would wipe all pods, services, secrets, service accounts, roles, and role bindings instead of just the reconciled set. This vulnerability exists across six methods: DeletePodsByLabel, DeleteServicesByLabel, DeleteSecretsByLabel, DeleteServiceAccountsByLabel, DeleteRolesByLabel, and DeleteRoleBindingsByLabel (lines 161–162, 174–175, 187–188, 200–201, 213–214, 217–218).

Extract validation into a shared helper method to guard all deletion operations:

Suggested hardening
+func (kc *KubeClient) deleteCollectionByLabel(ctx context.Context, gvr schema.GroupVersionResource, namespace, resourceName, labelSelector string) error {
+	if labelSelector == "" {
+		return fmt.Errorf("refusing to delete %s in namespace %q without a label selector", resourceName, namespace)
+	}
+
+	return kc.dynamic.Resource(gvr).Namespace(namespace).DeleteCollection(
+		ctx,
+		metav1.DeleteOptions{},
+		metav1.ListOptions{LabelSelector: labelSelector},
+	)
+}
+
 func (kc *KubeClient) DeletePodsByLabel(ctx context.Context, namespace, labelSelector string) error {
-	return kc.dynamic.Resource(PodGVR).Namespace(namespace).DeleteCollection(ctx, metav1.DeleteOptions{}, metav1.ListOptions{LabelSelector: labelSelector})
+	return kc.deleteCollectionByLabel(ctx, PodGVR, namespace, "pods", labelSelector)
 }

Apply the same pattern to all six Delete*ByLabel methods.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/ambient-control-plane/internal/kubeclient/kubeclient.go` around
lines 161 - 162, All six deletion helpers (DeletePodsByLabel,
DeleteServicesByLabel, DeleteSecretsByLabel, DeleteServiceAccountsByLabel,
DeleteRolesByLabel, DeleteRoleBindingsByLabel) currently call DeleteCollection
with an empty labelSelector which can delete every resource in a namespace; add
a shared validation helper (e.g., ensureNonEmptyLabelSelector(ctx,
labelSelector) or validateLabelSelector) that returns an error if labelSelector
is empty or whitespace, call it at the start of each Delete*ByLabel method, and
return that error instead of invoking kc.dynamic.Resource(...).DeleteCollection
when the selector is missing; apply the same pattern to all six methods so no
deletion proceeds with an empty selector.

Comment on lines +248 to +253
# 4. Create a session (no initial_prompt — idle, waiting for messages)
curl -s -X POST http://localhost:18000/api/ambient/v1/sessions \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-H "X-Ambient-Project: smoke-test" \
-d '{"name":"smoke-session","project_id":"smoke-test"}' | python3 -m json.tool
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The smoke test still uses the removed direct session-creation flow.

This document calls POST /sessions directly, but the design docs added in this PR say sessions are created only via POST /agents/{id}/ignite. Leaving the old flow here will send testers through an API path the new model explicitly removes. Update the example and the expected-flow section to create an agent first and ignite it.

Also applies to: 315-324

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/internal/developer/agent-workflow.md` around lines 248 - 253, The docs
still show the removed direct session creation using POST /sessions; change the
example and expected-flow to first create an agent (POST /agents with the
required payload) and then call POST /agents/{id}/ignite to create/ignite the
session instead of POST /sessions, and update the second example block the same
way; specifically replace the POST /sessions curl example with a create-agent
request followed by POST /agents/{id}/ignite and adjust any explanatory text to
reference the agent-first + ignite flow.

Comment on lines +53 to 54
"ambient_control_plane:latest"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Verification output won't show ambient_control_plane.

The ambient_control_plane:latest image is correctly added to the loading array, but the verification command on line 93 filters with grep vteam_, so this image won't appear in the verification output. Consider updating the grep pattern or removing the filter for more accurate verification.

Suggested fix
 echo "Verifying images in cluster..."
-$CONTAINER_ENGINE exec "${KIND_CLUSTER_NAME}-control-plane" crictl images | grep vteam_ | head -n 5
+$CONTAINER_ENGINE exec "${KIND_CLUSTER_NAME}-control-plane" crictl images | grep -E 'vteam_|ambient_control_plane' | head -n 10
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@e2e/scripts/load-images.sh` around lines 53 - 54, The verification step is
filtering loaded images with grep 'vteam_' so the image
"ambient_control_plane:latest" in the images array won't show up; update the
verification command that currently uses grep vteam_ (the verification pipeline
near the end of the script) to either remove the 'vteam_' filter or expand it to
include ambient_control_plane (or a more general pattern like matching both
vteam_ and ambient_control_plane) so the verification output accurately reflects
all images in the images array.

migration-session-filters: "new-sessions-only"
```

### Phase 3: Safe Decomissioning (2 weeks)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor typo: "Decomissioning" → "Decommissioning".

-### Phase 3: Safe Decomissioning (2 weeks)
+### Phase 3: Safe Decommissioning (2 weeks)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### Phase 3: Safe Decomissioning (2 weeks)
### Phase 3: Safe Decommissioning (2 weeks)
🧰 Tools
🪛 LanguageTool

[grammar] ~519-~519: Ensure spelling is correct
Context: ...w-sessions-only" ``` ### Phase 3: Safe Decomissioning (2 weeks) Only after **100% of session...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@REMOVE_CRDs.md` at line 519, Fix the typo in the Phase 3 heading by replacing
"Safe Decomissioning (2 weeks)" with "Safe Decommissioning (2 weeks)" so the
word "Decommissioning" is spelled correctly in the heading line that currently
reads "Phase 3: Safe Decomissioning (2 weeks)".

Comment on lines +149 to +164
# Step 6: Try to send a message to the session (if it supports it)
log "${BLUE}💬 Step 6: Testing session interaction${NC}"

MESSAGE_RESPONSE=$(oc exec deployment/backend-api -n ${NAMESPACE} -- curl -s -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $(oc whoami -t)" \
-H "X-Ambient-Project: ${NAMESPACE}" \
-d "{\"content\":\"Test message from e2e test\"}" \
"${API_SERVER_URL}/api/ambient/v1/sessions/${SESSION_ID}/messages" 2>/dev/null || true)

if echo "$MESSAGE_RESPONSE" | jq -e '.id' >/dev/null 2>&1; then
MESSAGE_ID=$(echo "$MESSAGE_RESPONSE" | jq -r '.id')
log "${GREEN}✅ Successfully sent message: ${MESSAGE_ID}${NC}"
else
log "${YELLOW}⚠️ Message sending not available or failed (expected for current implementation)${NC}"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make this e2e fail when the runner event path is broken.

Line 157 explicitly allows the session-message POST to fail, and Step 8 only checks control-plane watch logs. This means the script can still pass when the new runner-side gRPC push and /events fan-out never work.

Also applies to: 187-197

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test-e2e-control-plane.sh` around lines 149 - 164, The POST that sends a
session message (the oc exec curl that populates MESSAGE_RESPONSE and then tests
for .id) currently swallows errors (note the "|| true") and only logs a warning;
change it so the request failure causes the script to fail the test: remove the
"|| true", ensure you validate MESSAGE_RESPONSE contains an "id" (using jq -e
'.id') and if not, log an error and exit 1. Apply the same change to the similar
block around the later message/test (the other POST block that sets
MESSAGE_RESPONSE / MESSAGE_ID at lines ~187-197) so missing runner-side gRPC
push or /events fan-out fails the e2e run. Ensure you reference
MESSAGE_RESPONSE, MESSAGE_ID, API_SERVER_URL and SESSION_ID when implementing
the check.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (4)
components/ambient-control-plane/cmd/ambient-control-plane/main.go (1)

180-186: ⚠️ Potential issue | 🟠 Major

Clone http.DefaultTransport instead of replacing it.

Line 181 discards the default transport's proxy, timeout, keep-alive, idle-connection, and HTTP/2 settings. That changes the behavior of every HTTP client in-process just to install the extra CA.

Suggested change
 func installServiceCAIntoDefaultTransport(pool *x509.CertPool) {
-	http.DefaultTransport = &http.Transport{
-		TLSClientConfig: &tls.Config{
-			MinVersion: tls.VersionTLS12,
-			RootCAs:    pool,
-		},
-	}
+	transport := http.DefaultTransport.(*http.Transport).Clone()
+	if transport.TLSClientConfig == nil {
+		transport.TLSClientConfig = &tls.Config{}
+	} else {
+		transport.TLSClientConfig = transport.TLSClientConfig.Clone()
+	}
+	transport.TLSClientConfig.MinVersion = tls.VersionTLS12
+	transport.TLSClientConfig.RootCAs = pool
+	http.DefaultTransport = transport
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/ambient-control-plane/cmd/ambient-control-plane/main.go` around
lines 180 - 186, installServiceCAIntoDefaultTransport currently replaces
http.DefaultTransport and loses existing proxy, timeouts, keep-alive,
idle-connection and HTTP/2 settings; instead, detect if http.DefaultTransport is
an *http.Transport, deep-copy it (e.g., dereference/copy the struct), preserve
its Proxy/IdleConn/MaxConns/IdleConnTimeout/ForceAttemptHTTP2 and other fields,
then update or create TLSClientConfig on the copied transport to set MinVersion
= tls.VersionTLS12 and add/replace RootCAs with the provided pool, and finally
assign the copied transport back to http.DefaultTransport so only the CA changes
while all other original transport settings remain intact.
components/ambient-control-plane/internal/reconciler/kube_reconciler.go (2)

292-320: ⚠️ Potential issue | 🟠 Major

The shared RoleBinding name only works for the first namespace.

ambient-image-puller is always created in RunnerImageNamespace, but its subject changes with namespace. After the first create, later namespaces hit AlreadyExists on Line 316 and silently reuse a binding that still points at the old subject, so those pods do not get pull access.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go`
around lines 292 - 320, The RoleBinding name is constant so the first create
wins and later namespaces silently reuse the old subject; in
ensureImagePullAccess change the binding name (variable name / metadata.name in
rb) to be unique per target namespace (e.g., include the namespace via
fmt.Sprintf("ambient-image-puller-%s", namespace)) so each call creates a
distinct RoleBinding, and keep the CreateRoleBinding call as-is; alternatively,
if you prefer a single binding, modify the code to fetch the existing
RoleBinding and update its "subjects" to include the new namespace before
returning (handle update via r.kube.UpdateRoleBinding) instead of treating
IsAlreadyExists as success. Ensure you update the rb construction referenced in
ensureImagePullAccess and the CreateRoleBinding error handling accordingly.

523-523: ⚠️ Potential issue | 🟠 Major

Handle NestedMap errors before copying the Vertex secret.

Line 523 drops the unstructured.NestedMap error, so a malformed source secret can be copied as empty or partial data and still look successful. Fail fast here before creating the destination secret.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go` at
line 523, The call to unstructured.NestedMap (data, _, _ :=
unstructured.NestedMap(src.Object, "data")) swallows the returned error so a
malformed src secret can be copied; change this to capture and check the error
from unstructured.NestedMap and return/fail fast when err != nil before
proceeding to create or populate the destination/Vertex secret (i.e., replace
the ignored error with proper err handling near the code that reads src.Object
and abort the reconcile or return the error so the destination secret is not
created from invalid data).
components/ambient-control-plane/go.mod (1)

11-11: ⚠️ Potential issue | 🔴 Critical

Upgrade google.golang.org/grpc before shipping this service.

Line 11 still pins v1.79.1, which is flagged in the provided OSV results as GHSA-p77j-4mvh-x3m3. This PR adds a new control-plane binary that opens gRPC connections, so carrying a known-vulnerable transport dependency here is blocking. Bump to a patched release and refresh go.sum.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/ambient-control-plane/go.mod` at line 11, Update the pinned
dependency google.golang.org/grpc in the module manifest to a patched release
(bump the version from v1.79.1 to a non-vulnerable tag such as the latest
patched v1.x.y), then run the Go tooling to refresh checksums (e.g., go get
google.golang.org/grpc@vX.Y.Z && go mod tidy or go mod vendor as appropriate) so
go.mod and go.sum reflect the upgraded, fixed release; ensure the new version
resolves the GHSA-p77j-4mvh-x3m3 advisory before merging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/ambient-control-plane-tests.yml:
- Around line 6-15: The workflow path filters are missing the ambient API
server; update the paths arrays under both the push and pull_request triggers to
include 'components/ambient-api-server/**' so changes to the API server (which
is replaced by ../ambient-api-server in components/ambient-control-plane/go.mod)
will trigger the ambient-control-plane-tests workflow; modify the paths entries
alongside the existing 'components/ambient-control-plane/**' and
'components/ambient-sdk/go-sdk/**' lines in
.github/workflows/ambient-control-plane-tests.yml.

In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go`:
- Around line 257-287: The current ensureNamespaceExists returns early when
r.kube.GetNamespace succeeds so existing namespaces never run
ensureImagePullAccess; also when ensureImagePullAccess fails it only logs a
warning and lets provisioning continue. Change ensureNamespaceExists (the
function using r.kube.GetNamespace, r.kube.CreateNamespace and calling
ensureImagePullAccess) to always call ensureImagePullAccess when
r.cfg.RunnerImageNamespace is set, regardless of whether the namespace was newly
created or preexisting, and treat errors from ensureImagePullAccess as blocking
by returning the error (wrap with context like "ensuring image pull access for
namespace %s: %w") instead of only logging a warning so provisioning halts if
image-pull access cannot be granted.
- Around line 189-213: The cleanup code currently sets cleanupErr when
r.kube.DeletePodsByLabel / DeleteSecretsByLabel / DeleteServiceAccountsByLabel /
DeleteServicesByLabel fail but still returns nil at the end; change the final
return to propagate the failure (return cleanupErr or a wrapped error) so
callers see cleanup failures instead of success. Keep the existing
metrics.SessionCleanupTotal increment and logging behavior, but after updating
the metric/logging branch, ensure the function returns cleanupErr when non-nil
(or a combined error if you prefer aggregating multiple errors).

---

Duplicate comments:
In `@components/ambient-control-plane/cmd/ambient-control-plane/main.go`:
- Around line 180-186: installServiceCAIntoDefaultTransport currently replaces
http.DefaultTransport and loses existing proxy, timeouts, keep-alive,
idle-connection and HTTP/2 settings; instead, detect if http.DefaultTransport is
an *http.Transport, deep-copy it (e.g., dereference/copy the struct), preserve
its Proxy/IdleConn/MaxConns/IdleConnTimeout/ForceAttemptHTTP2 and other fields,
then update or create TLSClientConfig on the copied transport to set MinVersion
= tls.VersionTLS12 and add/replace RootCAs with the provided pool, and finally
assign the copied transport back to http.DefaultTransport so only the CA changes
while all other original transport settings remain intact.

In `@components/ambient-control-plane/go.mod`:
- Line 11: Update the pinned dependency google.golang.org/grpc in the module
manifest to a patched release (bump the version from v1.79.1 to a non-vulnerable
tag such as the latest patched v1.x.y), then run the Go tooling to refresh
checksums (e.g., go get google.golang.org/grpc@vX.Y.Z && go mod tidy or go mod
vendor as appropriate) so go.mod and go.sum reflect the upgraded, fixed release;
ensure the new version resolves the GHSA-p77j-4mvh-x3m3 advisory before merging.

In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go`:
- Around line 292-320: The RoleBinding name is constant so the first create wins
and later namespaces silently reuse the old subject; in ensureImagePullAccess
change the binding name (variable name / metadata.name in rb) to be unique per
target namespace (e.g., include the namespace via
fmt.Sprintf("ambient-image-puller-%s", namespace)) so each call creates a
distinct RoleBinding, and keep the CreateRoleBinding call as-is; alternatively,
if you prefer a single binding, modify the code to fetch the existing
RoleBinding and update its "subjects" to include the new namespace before
returning (handle update via r.kube.UpdateRoleBinding) instead of treating
IsAlreadyExists as success. Ensure you update the rb construction referenced in
ensureImagePullAccess and the CreateRoleBinding error handling accordingly.
- Line 523: The call to unstructured.NestedMap (data, _, _ :=
unstructured.NestedMap(src.Object, "data")) swallows the returned error so a
malformed src secret can be copied; change this to capture and check the error
from unstructured.NestedMap and return/fail fast when err != nil before
proceeding to create or populate the destination/Vertex secret (i.e., replace
the ignored error with proper err handling near the code that reads src.Object
and abort the reconcile or return the error so the destination secret is not
created from invalid data).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6ffee65c-da79-44a5-bb12-069256734255

📥 Commits

Reviewing files that changed from the base of the PR and between 94e5980 and 62c3bb8.

⛔ Files ignored due to path filters (1)
  • components/ambient-control-plane/go.sum is excluded by !**/*.sum
📒 Files selected for processing (5)
  • .github/workflows/ambient-control-plane-tests.yml
  • components/ambient-control-plane/cmd/ambient-control-plane/main.go
  • components/ambient-control-plane/go.mod
  • components/ambient-control-plane/internal/metrics/metrics.go
  • components/ambient-control-plane/internal/reconciler/kube_reconciler.go

Comment on lines +6 to +15
paths:
- 'components/ambient-control-plane/**'
- 'components/ambient-sdk/go-sdk/**'
- '.github/workflows/ambient-control-plane-tests.yml'
pull_request:
branches: [main]
paths:
- 'components/ambient-control-plane/**'
- 'components/ambient-sdk/go-sdk/**'
- '.github/workflows/ambient-control-plane-tests.yml'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add components/ambient-api-server/** to these path filters.

components/ambient-control-plane/go.mod replaces github.com/ambient-code/platform/components/ambient-api-server with ../ambient-api-server, so API-server changes can break this module without rerunning this workflow.

Suggested change
  push:
    branches: [main]
    paths:
      - 'components/ambient-control-plane/**'
+     - 'components/ambient-api-server/**'
      - 'components/ambient-sdk/go-sdk/**'
      - '.github/workflows/ambient-control-plane-tests.yml'
  pull_request:
    branches: [main]
    paths:
      - 'components/ambient-control-plane/**'
+     - 'components/ambient-api-server/**'
      - 'components/ambient-sdk/go-sdk/**'
      - '.github/workflows/ambient-control-plane-tests.yml'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ambient-control-plane-tests.yml around lines 6 - 15, The
workflow path filters are missing the ambient API server; update the paths
arrays under both the push and pull_request triggers to include
'components/ambient-api-server/**' so changes to the API server (which is
replaced by ../ambient-api-server in components/ambient-control-plane/go.mod)
will trigger the ambient-control-plane-tests workflow; modify the paths entries
alongside the existing 'components/ambient-control-plane/**' and
'components/ambient-sdk/go-sdk/**' lines in
.github/workflows/ambient-control-plane-tests.yml.

Comment on lines +189 to +213
var cleanupErr error
if err := r.kube.DeletePodsByLabel(ctx, namespace, selector); err != nil && !k8serrors.IsNotFound(err) {
r.logger.Warn().Err(err).Str("session_id", session.ID).Msg("deleting pods during cleanup")
cleanupErr = err
}
if err := r.kube.DeleteSecretsByLabel(ctx, namespace, selector); err != nil && !k8serrors.IsNotFound(err) {
r.logger.Warn().Err(err).Str("session_id", session.ID).Msg("deleting secrets during cleanup")
cleanupErr = err
}
if err := r.kube.DeleteServiceAccountsByLabel(ctx, namespace, selector); err != nil && !k8serrors.IsNotFound(err) {
r.logger.Warn().Err(err).Str("session_id", session.ID).Msg("deleting service accounts during cleanup")
cleanupErr = err
}
if err := r.kube.DeleteServicesByLabel(ctx, namespace, selector); err != nil && !k8serrors.IsNotFound(err) {
r.logger.Warn().Err(err).Str("session_id", session.ID).Msg("deleting services during cleanup")
cleanupErr = err
}

if cleanupErr != nil {
metrics.SessionCleanupTotal.WithLabelValues("error").Inc()
} else {
metrics.SessionCleanupTotal.WithLabelValues("success").Inc()
r.logger.Info().Str("session_id", session.ID).Str("namespace", namespace).Msg("session resources cleaned up")
}
return nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Return cleanup failures instead of swallowing them.

When any delete on Lines 190-204 fails, cleanupErr is set but Line 213 still returns nil. That makes deleted-session reconciles look successful to the caller and leaves leaked resources much harder to detect.

Suggested change
 	if cleanupErr != nil {
 		metrics.SessionCleanupTotal.WithLabelValues("error").Inc()
-	} else {
-		metrics.SessionCleanupTotal.WithLabelValues("success").Inc()
-		r.logger.Info().Str("session_id", session.ID).Str("namespace", namespace).Msg("session resources cleaned up")
+		return cleanupErr
 	}
+	metrics.SessionCleanupTotal.WithLabelValues("success").Inc()
+	r.logger.Info().Str("session_id", session.ID).Str("namespace", namespace).Msg("session resources cleaned up")
 	return nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go`
around lines 189 - 213, The cleanup code currently sets cleanupErr when
r.kube.DeletePodsByLabel / DeleteSecretsByLabel / DeleteServiceAccountsByLabel /
DeleteServicesByLabel fail but still returns nil at the end; change the final
return to propagate the failure (return cleanupErr or a wrapped error) so
callers see cleanup failures instead of success. Keep the existing
metrics.SessionCleanupTotal increment and logging behavior, but after updating
the metric/logging branch, ensure the function returns cleanupErr when non-nil
(or a combined error if you prefer aggregating multiple errors).

Comment on lines +257 to +287
func (r *SimpleKubeReconciler) ensureNamespaceExists(ctx context.Context, namespace string, session types.Session) error {
if _, err := r.kube.GetNamespace(ctx, namespace); err == nil {
return nil
}

ns := &unstructured.Unstructured{
Object: map[string]interface{}{
"apiVersion": "v1",
"kind": "Namespace",
"metadata": map[string]interface{}{
"name": namespace,
"labels": map[string]interface{}{
LabelManaged: "true",
LabelProjectID: session.ProjectID,
LabelManagedBy: "ambient-control-plane",
},
},
},
}

if _, err := r.kube.CreateNamespace(ctx, ns); err != nil && !k8serrors.IsAlreadyExists(err) {
return fmt.Errorf("creating namespace %s: %w", namespace, err)
}

r.logger.Info().Str("namespace", namespace).Msg("namespace created for session")

if r.cfg.RunnerImageNamespace != "" {
if err := r.ensureImagePullAccess(ctx, namespace); err != nil {
r.logger.Warn().Err(err).Str("namespace", namespace).Msg("failed to grant image pull access")
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make image-pull access part of the steady-state reconcile.

Line 258 returns before Lines 283-287 when the namespace already exists, so preexisting namespaces never get image-pull access reconciled here. On top of that, Line 285 only logs a warning and provisioning continues, which can move the session forward even though the runner pod still cannot pull its image.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go`
around lines 257 - 287, The current ensureNamespaceExists returns early when
r.kube.GetNamespace succeeds so existing namespaces never run
ensureImagePullAccess; also when ensureImagePullAccess fails it only logs a
warning and lets provisioning continue. Change ensureNamespaceExists (the
function using r.kube.GetNamespace, r.kube.CreateNamespace and calling
ensureImagePullAccess) to always call ensureImagePullAccess when
r.cfg.RunnerImageNamespace is set, regardless of whether the namespace was newly
created or preexisting, and treat errors from ensureImagePullAccess as blocking
by returning the error (wrap with context like "ensuring image pull access for
namespace %s: %w") instead of only logging a warning so provisioning halts if
image-pull access cannot be granted.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/manifests/base/core/ambient-control-plane-deployment.yaml`:
- Around line 24-54: Add HTTP health endpoints and Kubernetes probes: update the
Go service in components/ambient-control-plane/cmd/ambient-control-plane/main.go
to register /healthz (liveness) and /readyz (readiness) on the existing metrics
HTTP mux (port 9090) and ensure they return 200 when healthy and non-2xx when
unhealthy; then update the ambient-control-plane container spec in
ambient-control-plane-deployment.yaml to add a livenessProbe (httpGet /healthz
on port 9090 with reasonable initialDelaySeconds/periodSeconds/failureThreshold)
and a readinessProbe (httpGet /readyz on port 9090 with reasonable
timeout/period/failureThreshold) for the container named ambient-control-plane
so Kubernetes can detect and restart/unready failing pods.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5cc5d3aa-dd7a-4d37-9d38-8de88412fad2

📥 Commits

Reviewing files that changed from the base of the PR and between 62c3bb8 and 448c441.

📒 Files selected for processing (3)
  • components/manifests/base/core/ambient-control-plane-deployment.yaml
  • components/manifests/base/core/kustomization.yaml
  • components/manifests/base/kustomization.yaml

Comment on lines +24 to +54
- name: ambient-control-plane
image: quay.io/ambient_code/vteam_control_plane:latest
imagePullPolicy: Always
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
env:
- name: AMBIENT_API_TOKEN
valueFrom:
secretKeyRef:
name: ambient-control-plane-token
key: token
- name: AMBIENT_API_SERVER_URL
value: "https://ambient-api-server.ambient-code.svc:8000"
- name: AMBIENT_GRPC_SERVER_ADDR
value: "ambient-api-server.ambient-code.svc:9000"
- name: AMBIENT_GRPC_USE_TLS
value: "true"
- name: MODE
value: "kube"
- name: LOG_LEVEL
value: "info"
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 200m
memory: 256Mi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Search for health check endpoint handlers in the control-plane Go code

# Search for common health endpoint patterns
rg -n -C3 --type=go -g '!**/vendor/**' -g '!**/test/**' '/healthz|/readyz|/health|healthcheck' components/ambient-control-plane/

Repository: ambient-code/platform

Length of output: 47


🏁 Script executed:

#!/bin/bash
# Find the control-plane Go code structure
find . -path "*/ambient-control-plane/*" -type f -name "*.go" | head -20

Repository: ambient-code/platform

Length of output: 1108


🏁 Script executed:

#!/bin/bash
# Check if the control-plane directory exists and list its structure
ls -la components/ambient-control-plane/ 2>/dev/null || echo "Directory not found at expected location"
# Try alternative paths
find . -maxdepth 3 -type d -name "*control-plane*" | grep -v node_modules | grep -v ".git"

Repository: ambient-code/platform

Length of output: 821


🏁 Script executed:

#!/bin/bash
# Search for any HTTP server setup, main function, or service initialization
rg -n --type=go -g '!**/vendor/**' 'func main|http.Server|http.Handle|ListenAndServe|/metrics|port.*9090' --max-count=20

Repository: ambient-code/platform

Length of output: 7585


🏁 Script executed:

#!/bin/bash
# Read the control-plane main.go file to understand server setup
cat -n components/ambient-control-plane/cmd/ambient-control-plane/main.go

Repository: ambient-code/platform

Length of output: 9000


🏁 Script executed:

#!/bin/bash
# Check if there are any health-related functions defined in the control-plane package
rg -n --type=go 'health|ready|liveness' components/ambient-control-plane/ -i

Repository: ambient-code/platform

Length of output: 3026


Add health check endpoints and probes to the control-plane container.

The ambient-control-plane lacks health probes, preventing Kubernetes from detecting failures or unresponsiveness. The current implementation only exposes /metrics on port 9090; it does not implement /healthz or /readyz endpoints.

To add health probes, you must:

  1. Implement health check endpoints in the Go service (components/ambient-control-plane/cmd/ambient-control-plane/main.go):

    • Add /healthz (liveness) and /readyz (readiness) endpoints to the HTTP mux (port 9090)
    • Return appropriate status codes based on service health
  2. Add probes to the deployment once endpoints are available:

    • livenessProbe: /healthz with appropriate timeout/threshold values
    • readinessProbe: /readyz with appropriate timeout/threshold values

Without these probes, rolling updates may proceed with broken pods, and failed instances won't restart automatically.

🧰 Tools
🪛 Trivy (0.69.3)

[warning] 24-54: Image tag ":latest" used

Container 'ambient-control-plane' of Deployment 'ambient-control-plane' should specify an image tag

Rule: KSV-0013

Learn more

(IaC/Kubernetes)


[info] 24-54: Runs with UID <= 10000

Container 'ambient-control-plane' of Deployment 'ambient-control-plane' should set 'securityContext.runAsUser' > 10000

Rule: KSV-0020

Learn more

(IaC/Kubernetes)


[info] 24-54: Runs with GID <= 10000

Container 'ambient-control-plane' of Deployment 'ambient-control-plane' should set 'securityContext.runAsGroup' > 10000

Rule: KSV-0021

Learn more

(IaC/Kubernetes)


[warning] 24-54: Restrict container images to trusted registries

Container ambient-control-plane in deployment ambient-control-plane (namespace: default) uses an image from an untrusted registry.

Rule: KSV-0125

Learn more

(IaC/Kubernetes)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/manifests/base/core/ambient-control-plane-deployment.yaml` around
lines 24 - 54, Add HTTP health endpoints and Kubernetes probes: update the Go
service in components/ambient-control-plane/cmd/ambient-control-plane/main.go to
register /healthz (liveness) and /readyz (readiness) on the existing metrics
HTTP mux (port 9090) and ensure they return 200 when healthy and non-2xx when
unhealthy; then update the ambient-control-plane container spec in
ambient-control-plane-deployment.yaml to add a livenessProbe (httpGet /healthz
on port 9090 with reasonable initialDelaySeconds/periodSeconds/failureThreshold)
and a readinessProbe (httpGet /readyz on port 9090 with reasonable
timeout/period/failureThreshold) for the container named ambient-control-plane
so Kubernetes can detect and restart/unready failing pods.

@markturansky markturansky force-pushed the feat/cp-grpc-runner-v1 branch from 55337c5 to e2cfb8e Compare March 24, 2026 23:39
markturansky and others added 3 commits March 24, 2026 19:40
…ent streaming

- New ambient-control-plane binary: gRPC watch + list-sync informer, reconcilers for sessions/projects/project-settings, KubeClient wrapper
- Runner: gRPC transport bridge, _grpc_client, _session_messages_api, events endpoint
- Makefile targets: build-control-plane, deploy-openshift, local-sync-version
- Supporting docs, architecture notes, stress/tally tests

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…validation

- Add ambient-control-plane image to components-build-deploy.yml and prod-release-deploy.yaml
- Add Dockerfile build validation step to ambient-control-plane-tests.yml (PR-only, no push)
- Fix deploy-to-openshift condition; add CP to e2e image loader

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…tionalize image sources

- Move ambient-control-plane-service.yml → base/core/ambient-control-plane-deployment.yaml
- Add CP RBAC (ClusterRole, ClusterRoleBinding, ServiceAccount) to base/rbac/
- Fix image name throughout: ambient_control_plane → vteam_control_plane
- Production: quay.io (no patch needed); local-dev: internal OpenShift registry patch
- kind/kind-local: localhost image remaps; production: quay.io from base
- Add manifests/README.md documenting all overlay environments
- Add gRPC route, API server DB, deploy scripts

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@markturansky markturansky force-pushed the feat/cp-grpc-runner-v1 branch from e2cfb8e to 7dea845 Compare March 24, 2026 23:41
…ploy-with-dispatch

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@ambient-code
Copy link
Copy Markdown
Contributor

ambient-code bot commented Mar 30, 2026

Review Queue Status

Check Status Detail
CI FAIL End-to-End Tests, Ambient Runner Tests, summary
Conflicts FAIL Merge conflicts present
Reviews pass

Action needed: Fix failing CI checks; Resolve merge conflicts

Auto-generated by Review Queue workflow. Updated when PR changes.

@markturansky
Copy link
Copy Markdown
Contributor Author

closing this in favor of alpha branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant