feat(cp+runner+manifests): ambient-control-plane Go service, gRPC runner event streaming#1005
feat(cp+runner+manifests): ambient-control-plane Go service, gRPC runner event streaming#1005markturansky wants to merge 4 commits intomainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis pull request introduces Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant API as Ambient API Server
participant CP as Control Plane
participant K8s as Kubernetes
participant Runner as Runner Pod
Client->>API: Create Session
API-->>Client: Session ID, Credentials
API->>CP: gRPC Watch (Session event)
CP->>CP: Receive SessionAdded event
CP->>K8s: Create Namespace
CP->>K8s: Create ServiceAccount
CP->>K8s: Create RoleBinding
CP->>K8s: Create Secret (token)
CP->>K8s: Create Pod (runner)
CP->>K8s: Create Service
K8s-->>Runner: Pod scheduled & started
CP->>API: Update Session Status (Running)
Client->>Runner: Send message (via Service)
Runner-->>Client: Response (via Service)
Client->>API: Delete Session
API->>CP: gRPC Watch (Session event)
CP->>CP: Receive SessionDeleted event
CP->>K8s: Delete Pod
CP->>K8s: Delete Service
CP->>K8s: Delete Secrets
CP->>K8s: Delete RBAC
CP->>API: Update Session Status (Stopped)
sequenceDiagram
participant Runner as Runner Pod
participant GRPC as gRPC Watch Stream
participant Listener as GRPCSessionListener
participant Bridge as ClaudeBridge
participant Worker as SessionWorker
participant Model as Claude SDK
Runner->>GRPC: Connect & authenticate
GRPC-->>Runner: Stream established
Runner->>Listener: Watch loop receives message
Listener->>Listener: Parse SessionMessage (user event)
Listener->>Bridge: run(input_data) async
Bridge->>Worker: Submit prompt to queue
Worker->>Model: Send message
Model-->>Worker: Receive response stream
Worker->>Bridge: Yield AG-UI events
Bridge->>Listener: Forward events
Listener->>Listener: Accumulate MESSAGES_SNAPSHOT
Listener->>Bridge: emit RUN_FINISHED
Listener->>GRPC: push(event_type=assistant, payload)
GRPC-->>Listener: Ack
Listener->>Runner: Remove from active_streams
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 36
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/ambient-control-plane/cmd/ambient-control-plane/main.go`:
- Around line 171-178: installServiceCAIntoDefaultTransport currently replaces
http.DefaultTransport with a new http.Transport losing tuned defaults; instead
obtain the existing transport (assert http.DefaultTransport to *http.Transport),
clone it (use the Transport.Clone method or shallow copy), set/merge its
TLSClientConfig to ensure MinVersion = tls.VersionTLS12 and RootCAs = pool
(preserving other TLS settings if present), then assign the cloned transport
back to http.DefaultTransport so all original connection/timeouts/keep-alive
settings are preserved.
In `@components/ambient-control-plane/go.mod`:
- Line 11: Update the pinned dependency google.golang.org/grpc from v1.79.1 to
v1.79.3 (or later) in the go.mod and then regenerate the module checksum file
(go.sum) so the updated release that fixes CVE-2026-33186 is recorded; locate
the dependency entry for google.golang.org/grpc in go.mod, change the version to
v1.79.3+, and run your project's dependency resolution (e.g., module
update/tidy) to refresh go.sum.
In `@components/ambient-control-plane/internal/informer/informer.go`:
- Around line 400-405: The current Informer.dispatchBlocking blocks when
inf.eventCh is full, which can stall synchronous watch handlers in watchManager
and harm gRPC watch streams; change dispatchBlocking in informer.go to avoid
blocking: replace the blocking send on inf.eventCh with a non-blocking select
that attempts to send, and on channel-full fallback log a warning (use
processLogger or the component logger) and increment a new drop/overload metric
to detect overloads; alternatively expose/configure the buffer size for
inf.eventCh (increase from 256) if expected load requires it, but prefer the
non-blocking send + warning/metric approach inside Informer.dispatchBlocking to
prevent stalling watch handlers.
In `@components/ambient-control-plane/internal/kubeclient/kubeclient.go`:
- Around line 93-97: The code currently ignores the error from os.UserHomeDir()
and builds localPath via string concatenation; update the logic in the kube
client setup so you check and handle the error returned by os.UserHomeDir(), and
construct the kube config path using filepath.Join(home, ".kube", "config")
instead of home + "/.kube/config"; then use os.Stat(localPath) as before and
pass the resulting localPath to clientcmd.BuildConfigFromFlags("", localPath)
only when the home lookup succeeded and the file exists.
In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go`:
- Line 476: unstructured.NestedMap call is ignoring its error which can hide
malformed secret data; replace the blank error capture with proper handling by
calling unstructured.NestedMap(src.Object, "data") into (data, found, err) and
then check err (and optionally found) — if err is non-nil return or log a clear
error (including context like the secret name/namespace or src.Object) and stop
the copy operation so invalid data isn't propagated; update any callers of this
block to handle the returned error if necessary.
- Around line 660-665: Remove the custom min function implementation and use
Go's built-in min instead: delete the local function named min and replace its
usages with the predeclared min call (e.g., min(x, y)); ensure there are no
local name conflicts and that the module's Go version remains >= 1.21 (project
targets 1.24), then run go vet/build to verify no import or shadowing issues.
- Around line 344-346: The local variable currently declared as "secretName :=
secretName(session.ID)" shadows the helper function "secretName", so rename the
variable to avoid shadowing (e.g., "secretNameVal" or "saSecretName") and update
all subsequent references in this scope; keep the call to the helper function
"secretName(session.ID)" unchanged but assign its result to the new variable
name alongside the existing "saName := serviceAccountName(session.ID)"
assignment.
In `@components/ambient-control-plane/internal/reconciler/project_reconciler.go`:
- Around line 104-110: The current creatorRoleBindingName function truncates the
sanitized subject and callers only check for existence by name, which lets stale
ambient-code.io/created-by annotations or different long identities collide;
update creatorRoleBindingName to derive the name from the full subject plus a
stable hash (e.g., sha1/hex) instead of simple truncation, and change the
reconciler logic that currently returns when a RoleBinding with that name exists
to instead fetch the existing RoleBinding (by name), compare its subjects and
ambient-code.io/created-by annotation to the desired subject, and either update
the subjects/annotation in-place or recreate the binding when they differ (use
functions like creatorRoleBindingName and the reconcile path that currently does
the existence check to locate the RoleBinding to update).
- Around line 67-96: The namespace creation is racy: after
r.kube.GetNamespace(...) returns not found, a concurrent reconcile may create
the namespace before r.kube.CreateNamespace(...) runs, causing an AlreadyExists
error to be treated as fatal; update the CreateNamespace error handling in
project_reconciler.go (around r.kube.GetNamespace and r.kube.CreateNamespace) to
treat k8serrors.IsAlreadyExists(err) as a successful outcome (log/debug and
return nil) instead of returning an error so the reconcile becomes idempotent
and converges.
In
`@components/ambient-control-plane/internal/reconciler/project_settings_reconciler.go`:
- Around line 130-139: The loop over entries currently logs errors from
ensureGroupRoleBinding but always returns nil, hiding partial failures; update
the reconcile function (the loop that iterates over entries and calls
ensureGroupRoleBinding) to collect errors (e.g., append to an error slice) and
at the end return an aggregated error (or return immediately on first
non-recoverable error) instead of nil; ensure the returned error includes
context (namespace, group, role) similar to the existing log entries so callers
can detect and handle partial reconciliation failures.
In `@components/ambient-control-plane/internal/reconciler/shared.go`:
- Around line 88-96: The namespaceForSession function uses session.ProjectID
verbatim (only lowercased) which can produce invalid Kubernetes namespace names;
update namespaceForSession to sanitize/validate session.ProjectID (derived from
Project.Name) into a DNS-1123 label: convert to lowercase, replace any character
not [a-z0-9-] with hyphen, collapse consecutive hyphens, trim leading/trailing
hyphens, and truncate to 63 characters (ensuring not empty) and fall back to
"default" if the result is empty; additionally add validation at project
creation (where Project.Name is set) to reject names that cannot be normalized
to a valid DNS-1123 label so invalid Project.Name values are caught early.
In `@components/ambient-control-plane/internal/reconciler/stress_test.go`:
- Around line 344-347: The test spawns goroutines that call t.Errorf (unsafe);
change the goroutine to capture the error returned by tallyReconciler.Reconcile
(called with ctx and event) and send a formatted error string or the error value
into a concurrent-safe collector (e.g., a buffered channel or a slice guarded by
a sync.Mutex), then after the WaitGroup.Wait iterate the collected errors and
call t.Errorf or t.Fatal from the main test goroutine; ensure you reference
tallyReconciler.Reconcile, event, index, and the WaitGroup when implementing the
collector.
In `@components/ambient-control-plane/internal/reconciler/tally_reconciler.go`:
- Around line 98-121: handleSessionDeleted can decrement r.tally.TotalSessions
below zero when a delete arrives for a session that was never counted; guard the
decrement by checking r.tally.TotalSessions > 0 before doing TotalSessions-- (or
clamp to zero) and optionally emit a warning/log when a delete is ignored so
it’s visible; keep the existing guarded decrements for SessionsByPhase and
SessionsByUser but add the TotalSessions floor check in the handleSessionDeleted
method to prevent negative totals.
In `@components/ambient-control-plane/internal/watcher/watcher.go`:
- Around line 124-149: The loop in watcher.go increments attempt forever so
backoffDuration(attempt) grows even after successful reconnects; modify the loop
around wm.watchOnce(ctx, resource) to reset attempt to 0 when a watch was
established successfully (i.e., when wm.watchOnce returns nil and ctx isn't
canceled), leaving attempt increments only for consecutive failures; use the
existing attempt variable, wm.watchOnce, and backoffDuration to implement this
reset so that backoff only accrues across consecutive errors.
In `@components/manifests/base/core/ambient-api-server-service.yml`:
- Around line 81-82: The flags --https-cert-file and --https-key-file point to
/secrets/tls/ which doesn't exist because the TLS secret is mounted at
/etc/tls/; update the API server args (the --https-cert-file and
--https-key-file values) to reference /etc/tls/tls.crt and /etc/tls/tls.key (or
alternatively change the volumeMount path from /etc/tls/ to /secrets/tls/),
ensuring the flag values and the volumeMount location (the TLS secret mount)
match.
In `@components/manifests/deploy`:
- Around line 82-112: The script writes the plaintext CLIENT_SECRET_VALUE to
/tmp/ambient-frontend-oauthclient.yaml and echoes it on failure (uses
CLIENT_SECRET_VALUE, /tmp/ambient-frontend-oauthclient.yaml and OAUTH_APPLY_RC);
change to avoid creating a world-readable temp file and avoid printing the
secret: generate the manifest on stdin and apply directly (pipe a heredoc into
oc apply -f - and capture the exit code) or create a secure temp file (mktemp)
and chmod 600 immediately, then remove it; also redact the secret in all
user-facing echo/usage output (replace ${CLIENT_SECRET_VALUE} with a placeholder
like [REDACTED] when printing the manual oc apply instructions) and ensure the
echo block that prints the fallback manifest no longer emits the real
CLIENT_SECRET_VALUE.
- Around line 423-424: The "Next steps" output contains a duplicate step number:
find the echo that prints "4. Monitor RFE workflows:" (the echo -e line with
that exact text) and renumber it to the correct sequence (e.g., "5. Monitor RFE
workflows:") so the step numbers are unique and sequential; ensure any
subsequent step labels in the same block are adjusted consistently.
In `@components/manifests/overlays/production/control-plane-env-patch.yaml`:
- Around line 15-16: The RUNNER_IMAGE environment variable is using an unpinned
tag ("quay.io/ambient_code/vteam_claude_runner:latest") which is unsafe for
production; update the RUNNER_IMAGE value to a fixed image tag or immutable
digest (e.g., "quay.io/ambient_code/vteam_claude_runner:vX.Y.Z" or
"quay.io/ambient_code/vteam_claude_runner@sha256:...") so deployments are
reproducible and predictable.
In `@components/runners/ambient-runner/ambient_runner/_grpc_client.py`:
- Around line 1-5: The file header incorrectly states "Code generated ... DO NOT
EDIT" even though ambient_runner._grpc_client appears hand-written; update the
top-of-file comment to accurately reflect this by removing or replacing the
generated-code warning and adding a clear note that the file is manually
maintained (or, if it truly is generated, restore the canonical generator
metadata and automation); ensure any replacement mentions the module name
ambient_runner._grpc_client and the reason for the change so maintainers know
whether the file should be edited or regenerated.
- Around line 24-34: The _load_ca_cert function currently swallows OSError when
a candidate CA file exists but can't be opened, which hides permission or IO
issues; update _load_ca_cert to log a warning on except OSError (include the
path and the exception message) so failures to read an existing file are visible
(use the module logger via logging.getLogger(__name__) or the project's standard
logger), then continue to the next candidate and preserve the existing return
behavior.
In `@components/runners/ambient-runner/ambient_runner/_session_messages_api.py`:
- Around line 77-84: Don't log raw payload_preview at INFO level; update the
logger calls that build payload_preview (the logger.info call referencing
session_id, event_type, len(payload), payload_preview and any other logger.info
that uses payload_preview) to only emit session_id, event_type and payload
length at INFO, and move the generation and logging of a preview behind an
explicit debug/preview gate (e.g. check logger.isEnabledFor(logging.DEBUG) or a
configurable redact_preview flag) so previews are only produced and logged when
debugging is enabled and otherwise suppressed or redacted.
- Around line 54-64: The SessionMessagesAPI.push() method is synchronous and
currently being called directly from async contexts, which blocks the event
loop; update the two call sites — _push_initial_prompt_via_grpc (in app.py) and
GRPCMessageWriter._write_message (in grpc_transport.py) — to offload the push
call using the event loop's run_in_executor (e.g., await
loop.run_in_executor(None, lambda: client.session_messages.push(...))) instead
of calling client.session_messages.push(...) directly so push() runs in a thread
and does not block async I/O; keep the existing offloading for watch() as-is.
In `@components/runners/ambient-runner/ambient_runner/app.py`:
- Around line 129-131: The code currently calls ClaudeBridge._setup_platform()
and accesses ClaudeBridge._grpc_listener.ready directly, coupling app logic to
private internals; add a public method on ClaudeBridge (e.g.
prepare_grpc_listener() or get_grpc_ready()) that performs the setup and returns
the ready awaitable/async event, update the caller to await
bridge.prepare_grpc_listener() instead of calling _setup_platform() and touching
_grpc_listener, and remove any remaining direct references to the private
members (_setup_platform and _grpc_listener) so the runner depends only on the
new public API.
- Around line 356-403: The _push_initial_prompt_via_grpc function can leak the
AmbientGRPCClient if an exception happens before client.close(); wrap the
lifecycle so the client is always closed: create the client via
AmbientGRPCClient.from_env() and then use a try/finally where the existing logic
(payload build, client.session_messages.push, logging) stays in the try block
and client.close() is invoked in the finally; ensure you guard the finally close
with a check that the client variable was successfully created (e.g., client is
not None) to avoid NameError if from_env() itself raises.
In
`@components/runners/ambient-runner/ambient_runner/bridges/claude/grpc_transport.py`:
- Around line 129-194: The loop can leak ThreadPoolExecutor instances when an
exception triggers the reconnect path because executor.shutdown() is only called
on the normal and CancelledError paths; update the exception handler in the
async loop that creates executor/ watch_future (the block that calls
self._watch_in_thread with msg_queue, loop, stop_event, last_seq) to always call
stop_event.set() and executor.shutdown(wait=False) before awaiting sleep/backoff
or continuing, ensuring watch_future is also cancelled if not done; reset
backoff as before and preserve last_seq; make these changes around the variables
executor, watch_future, stop_event and backoff so executors are cleaned up on
all paths.
- Around line 199-207: When RunnerInput.model_validate_json(msg.payload) fails,
the exception is currently swallowed; update the except block to log the
original parse exception at DEBUG along with the offending msg.payload before
constructing the fallback RunnerInput (references:
RunnerInput.model_validate_json, RunnerInput, msg.payload, self._session_id).
Ensure you capture the exception object (e.g., as exc) and call the module/class
logger.debug with a message that includes the exception details and payload so
malformed payloads can be diagnosed.
- Around line 335-338: In _write_message (an async method) the synchronous
blocking call to self._grpc_client.session_messages.push(self._session_id,
event_type="assistant", payload=payload) must be offloaded to a thread so it
doesn't block the asyncio event loop; replace the direct push call with an
awaited call using asyncio.to_thread(...) or
asyncio.get_event_loop().run_in_executor(...) to invoke session_messages.push
with the same args (self._session_id, event_type, payload) and await its
completion.
In `@components/runners/ambient-runner/ambient_runner/endpoints/events.py`:
- Around line 61-63: active_streams[thread_id] = queue currently overwrites any
existing subscriber and allows one connection to orphan another; change this to
either reject a new subscriber if active_streams already contains thread_id or
support multiple subscribers by storing a list/collection of queues per
thread_id, and in both cases make cleanup identity-based (only remove the queue
instance you created) inside the finally blocks (e.g., in wait_for_events and
the other places setting active_streams) by checking that
active_streams[thread_id] is the same queue object (or remove the queue from the
list) before popping/deleting it to avoid dropping events for other connections;
apply the same pattern for the other occurrences where you set active_streams
(the spots referenced as 109-110, 155-156, 183-184).
In `@components/runners/ambient-runner/tests/test_app_initial_prompt.py`:
- Around line 34-35: The tests patch the class at its definition site instead of
where it's looked up; change all patch targets referencing
"ambient_runner._grpc_client.AmbientGRPCClient" in this test file to
"ambient_runner.app.AmbientGRPCClient" so the mocked class is used by the code
that imported it (the function under test _push_initial_prompt_via_grpc in
app.py); update every occurrence mentioned (the patches around lines where
AmbientGRPCClient is used in the tests) to target the app module's name binding.
In `@components/runners/ambient-runner/tests/test_bridge_claude.py`:
- Around line 53-75: The test is currently patching ClaudeBridge._setup_platform
and then assigning bridge._grpc_listener manually so it never verifies listener
creation; remove the patch that replaces _setup_platform, patch
GRPCSessionListener to return a mock_listener_instance, call/await the real
bridge._setup_platform (or the public setup entry that triggers it) with
AMBIENT_GRPC_URL set, and then assert that GRPCSessionListener was called and
bridge._grpc_listener is the mock_listener_instance (use symbols ClaudeBridge,
_setup_platform, GRPCSessionListener, bridge._grpc_listener, and the
AMBIENT_GRPC_URL env var to locate the code).
- Around line 19-29: The tests fail because ClaudeBridge.__init__ does not
initialize the gRPC-related attributes; update ClaudeBridge.__init__ to set
self._grpc_listener (type GRPCSessionListener | None) to None and
self._active_streams to an empty dict so both _grpc_listener and _active_streams
exist on construction; add these initializations alongside the other existing
attribute setups in ClaudeBridge.__init__ (referencing the ClaudeBridge class
and its __init__ method, and the symbols _grpc_listener and _active_streams).
In `@components/runners/ambient-runner/tests/test_events_endpoint.py`:
- Around line 66-79: The test creates an unused Queue (q) that isn't added to
active_streams, so either remove that unused q or actually assert the endpoint
registered its own queue; update test_registers_queue_before_streaming to delete
the q creation (q = asyncio.Queue / q.put_nowait...) and after consuming the
response body (resp.read()) assert that 't-1' is present in active_streams and
that active_streams['t-1'] is an asyncio.Queue (or alternatively move creation
of q before calling _make_bridge and assign active_streams['t-1'] = q if you
intend to pre-register the queue); reference
test_registers_queue_before_streaming, active_streams, _make_bridge,
client.stream, and resp.read.
In `@components/runners/ambient-runner/tests/test_grpc_transport.py`:
- Around line 152-181: The test test_user_message_triggers_bridge_run currently
never verifies that bridge.run() was invoked; update it to assert the call by
replacing the simple generator fake_run with a callable that signals when
invoked (e.g., set an asyncio.Event or use an AsyncMock) and await that signal
before cancelling the listener. Concretely: keep the existing events and
listener setup, but change bridge.run to a wrapper that sets an event (or
records a call via AsyncMock) when started and then yields the events; after
awaiting listener.ready, await the event with a timeout to ensure bridge.run
fired, then proceed to cancel the task and assert the event/AsyncMock was
triggered. Target symbols: test_user_message_triggers_bridge_run, bridge.run,
fake_run, GRPCSessionListener, listener._listen_loop.
In `@docs/internal/developer/local-development/openshift.md`:
- Around line 231-234: Add a language specifier to the fenced code block that
contains the lines starting with "unknown kid" — wrong token type (oc whoami -t
was used) and "invalid_grant" — RH SSO token expired by changing the opening
fence to include a language identifier (e.g., use ```text) so the block renders
and is accessible; ensure the closing fence remains present after those two
lines.
In `@test-e2e-control-plane.sh`:
- Around line 21-22: TIMEOUT_SECONDS and CHECK_INTERVAL are declared but not
used; update the loops that currently use hardcoded intervals (the loops
referencing 30 and 60 seconds) to use the TIMEOUT_SECONDS and CHECK_INTERVAL
variables (so replace the literal timeout and sleep interval values with
TIMEOUT_SECONDS and CHECK_INTERVAL), or remove the TIMEOUT_SECONDS and
CHECK_INTERVAL declarations if the hardcoded values are intentional; ensure you
update any related echo/log messages to reference the variables if you choose to
use them.
- Around line 224-230: The duration uses start_time which gets reassigned later
(in the runner pod wait loop), so capture the initial test start time into a
separate variable (e.g., TEST_START_TIME or ORIGINAL_START_TIME) immediately
when the script begins and do not modify it; keep using start_time for
short-lived step timers if needed. Update the final summary echo that computes
Duration to use this preserved variable (TEST_START_TIME/ORIGINAL_START_TIME)
instead of start_time so the reported duration reflects the entire test run;
ensure the new variable is exported or in scope for the summary block.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 82f26582-fb6c-464c-98f4-b995276d104c
⛔ Files ignored due to path filters (2)
components/ambient-control-plane/go.sumis excluded by!**/*.sumcomponents/runners/ambient-runner/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (68)
.github/workflows/ambient-control-plane-tests.ymlMakefileREMOVE_CRDs.mdcomponents/ambient-control-plane/.gitignorecomponents/ambient-control-plane/CLAUDE.mdcomponents/ambient-control-plane/Dockerfilecomponents/ambient-control-plane/Dockerfile.simplecomponents/ambient-control-plane/Makefilecomponents/ambient-control-plane/README.mdcomponents/ambient-control-plane/cmd/ambient-control-plane/main.gocomponents/ambient-control-plane/go.modcomponents/ambient-control-plane/internal/config/config.gocomponents/ambient-control-plane/internal/informer/informer.gocomponents/ambient-control-plane/internal/kubeclient/kubeclient.gocomponents/ambient-control-plane/internal/kubeclient/kubeclient_test.gocomponents/ambient-control-plane/internal/reconciler/kube_reconciler.gocomponents/ambient-control-plane/internal/reconciler/project_reconciler.gocomponents/ambient-control-plane/internal/reconciler/project_settings_reconciler.gocomponents/ambient-control-plane/internal/reconciler/shared.gocomponents/ambient-control-plane/internal/reconciler/stress_test.gocomponents/ambient-control-plane/internal/reconciler/tally.gocomponents/ambient-control-plane/internal/reconciler/tally_reconciler.gocomponents/ambient-control-plane/internal/reconciler/tally_test.gocomponents/ambient-control-plane/internal/watcher/watcher.gocomponents/manifests/base/ambient-api-server-grpc-route.ymlcomponents/manifests/base/ambient-control-plane-service.ymlcomponents/manifests/base/core/ambient-api-server-service.ymlcomponents/manifests/base/kustomization.yamlcomponents/manifests/base/platform/ambient-api-server-db.ymlcomponents/manifests/base/rbac/control-plane-clusterrole.yamlcomponents/manifests/base/rbac/control-plane-clusterrolebinding.yamlcomponents/manifests/base/rbac/control-plane-sa.yamlcomponents/manifests/base/rbac/kustomization.yamlcomponents/manifests/components/ambient-api-server-db/ambient-api-server-db-json-patch.yamlcomponents/manifests/deploycomponents/manifests/deploy-no-api-server.shcomponents/manifests/deploy.shcomponents/manifests/overlays/kind-local/control-plane-env-patch.yamlcomponents/manifests/overlays/kind-local/kustomization.yamlcomponents/manifests/overlays/kind/ambient-api-server-jwks-patch.yamlcomponents/manifests/overlays/kind/backend-ambient-api-patch.yamlcomponents/manifests/overlays/kind/control-plane-env-patch.yamlcomponents/manifests/overlays/kind/kustomization.yamlcomponents/manifests/overlays/kind/local-image-pull-policy-patch.yamlcomponents/manifests/overlays/local-dev/ambient-api-server-db-json-patch.yamlcomponents/manifests/overlays/production/control-plane-env-patch.yamlcomponents/manifests/overlays/production/control-plane-image-patch.yamlcomponents/manifests/overlays/production/kustomization.yamlcomponents/runners/ambient-runner/ambient_runner/_grpc_client.pycomponents/runners/ambient-runner/ambient_runner/_session_messages_api.pycomponents/runners/ambient-runner/ambient_runner/app.pycomponents/runners/ambient-runner/ambient_runner/bridge.pycomponents/runners/ambient-runner/ambient_runner/bridges/claude/grpc_transport.pycomponents/runners/ambient-runner/ambient_runner/bridges/claude/prompts.pycomponents/runners/ambient-runner/ambient_runner/endpoints/events.pycomponents/runners/ambient-runner/ambient_runner/platform/prompts.pycomponents/runners/ambient-runner/architecture.mdcomponents/runners/ambient-runner/pyproject.tomlcomponents/runners/ambient-runner/tests/test_app_initial_prompt.pycomponents/runners/ambient-runner/tests/test_bridge_claude.pycomponents/runners/ambient-runner/tests/test_events_endpoint.pycomponents/runners/ambient-runner/tests/test_grpc_transport.pydocs/internal/design/agent-api.mddocs/internal/design/blackboard-api.mddocs/internal/developer/agent-workflow.mddocs/internal/developer/local-development/openshift.mde2e/scripts/load-images.shtest-e2e-control-plane.sh
💤 Files with no reviewable changes (2)
- components/manifests/overlays/local-dev/ambient-api-server-db-json-patch.yaml
- components/manifests/overlays/kind/ambient-api-server-jwks-patch.yaml
| func installServiceCAIntoDefaultTransport(pool *x509.CertPool) { | ||
| http.DefaultTransport = &http.Transport{ | ||
| TLSClientConfig: &tls.Config{ | ||
| MinVersion: tls.VersionTLS12, | ||
| RootCAs: pool, | ||
| }, | ||
| } | ||
| } |
There was a problem hiding this comment.
Replacing DefaultTransport loses default connection settings.
Creating a new http.Transport with only TLSClientConfig discards the default transport's tuned settings (timeouts, idle connection limits, keep-alive configuration). This may degrade HTTP client performance.
Proposed fix — clone default transport and update TLS config
func installServiceCAIntoDefaultTransport(pool *x509.CertPool) {
- http.DefaultTransport = &http.Transport{
- TLSClientConfig: &tls.Config{
- MinVersion: tls.VersionTLS12,
- RootCAs: pool,
- },
- }
+ // Clone the default transport to preserve connection pooling/timeout settings
+ defaultTransport := http.DefaultTransport.(*http.Transport).Clone()
+ defaultTransport.TLSClientConfig = &tls.Config{
+ MinVersion: tls.VersionTLS12,
+ RootCAs: pool,
+ }
+ http.DefaultTransport = defaultTransport
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/ambient-control-plane/cmd/ambient-control-plane/main.go` around
lines 171 - 178, installServiceCAIntoDefaultTransport currently replaces
http.DefaultTransport with a new http.Transport losing tuned defaults; instead
obtain the existing transport (assert http.DefaultTransport to *http.Transport),
clone it (use the Transport.Clone method or shallow copy), set/merge its
TLSClientConfig to ensure MinVersion = tls.VersionTLS12 and RootCAs = pool
(preserving other TLS settings if present), then assign the cloned transport
back to http.DefaultTransport so all original connection/timeouts/keep-alive
settings are preserved.
| github.com/ambient-code/platform/components/ambient-api-server v0.0.0 | ||
| github.com/ambient-code/platform/components/ambient-sdk/go-sdk v0.0.0 | ||
| github.com/rs/zerolog v1.34.0 | ||
| google.golang.org/grpc v1.79.1 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Current grpc dependency:"
rg -n 'google\.golang\.org/grpc' components/ambient-control-plane/go.mod
echo
echo "Fetching GHSA advisory details from OSV:"
curl -s https://api.osv.dev/v1/vulns/GHSA-p77j-4mvh-x3m3 | jq '{id,summary,aliases,affected}'
echo
echo "After choosing a patched version, verify go.mod no longer pins the affected version."Repository: ambient-code/platform
Length of output: 1017
Upgrade google.golang.org/grpc from v1.79.1 to v1.79.3 or later.
Line 11 pins google.golang.org/grpc v1.79.1, which contains GHSA-p77j-4mvh-x3m3 (CVE-2026-33186)—an authorization bypass vulnerability via missing leading slash in the :path header. Bump to v1.79.3 or later and regenerate go.sum.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/ambient-control-plane/go.mod` at line 11, Update the pinned
dependency google.golang.org/grpc from v1.79.1 to v1.79.3 (or later) in the
go.mod and then regenerate the module checksum file (go.sum) so the updated
release that fixes CVE-2026-33186 is recorded; locate the dependency entry for
google.golang.org/grpc in go.mod, change the version to v1.79.3+, and run your
project's dependency resolution (e.g., module update/tidy) to refresh go.sum.
| func (inf *Informer) dispatchBlocking(ctx context.Context, event ResourceEvent) { | ||
| select { | ||
| case inf.eventCh <- event: | ||
| case <-ctx.Done(): | ||
| } | ||
| } |
There was a problem hiding this comment.
Blocking dispatch can stall watch streams under load.
When eventCh is full (256 events), dispatchBlocking blocks the calling watch handler. Since watch handlers are called synchronously by watchManager, this backpressure can stall the gRPC watch stream and potentially cause disconnects or missed events during bursts.
Consider: (1) increasing the buffer based on expected load, (2) adding a non-blocking send with a warning log when the channel is full, or (3) dropping events with metrics to detect overload.
Example: non-blocking with warning
func (inf *Informer) dispatchBlocking(ctx context.Context, event ResourceEvent) {
select {
case inf.eventCh <- event:
+ default:
+ inf.logger.Warn().
+ Str("resource", event.Resource).
+ Str("event_type", string(event.Type)).
+ Str("id", event.Object.GetID()).
+ Msg("event channel full, dropping event")
case <-ctx.Done():
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (inf *Informer) dispatchBlocking(ctx context.Context, event ResourceEvent) { | |
| select { | |
| case inf.eventCh <- event: | |
| case <-ctx.Done(): | |
| } | |
| } | |
| func (inf *Informer) dispatchBlocking(ctx context.Context, event ResourceEvent) { | |
| select { | |
| case inf.eventCh <- event: | |
| default: | |
| inf.logger.Warn(). | |
| Str("resource", event.Resource). | |
| Str("event_type", string(event.Type)). | |
| Str("id", event.Object.GetID()). | |
| Msg("event channel full, dropping event") | |
| case <-ctx.Done(): | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/ambient-control-plane/internal/informer/informer.go` around lines
400 - 405, The current Informer.dispatchBlocking blocks when inf.eventCh is
full, which can stall synchronous watch handlers in watchManager and harm gRPC
watch streams; change dispatchBlocking in informer.go to avoid blocking: replace
the blocking send on inf.eventCh with a non-blocking select that attempts to
send, and on channel-full fallback log a warning (use processLogger or the
component logger) and increment a new drop/overload metric to detect overloads;
alternatively expose/configure the buffer size for inf.eventCh (increase from
256) if expected load requires it, but prefer the non-blocking send +
warning/metric approach inside Informer.dispatchBlocking to prevent stalling
watch handlers.
| home, _ := os.UserHomeDir() | ||
| localPath := home + "/.kube/config" | ||
| if _, err := os.Stat(localPath); err == nil { | ||
| return clientcmd.BuildConfigFromFlags("", localPath) | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Use filepath.Join for path construction.
String concatenation for paths can cause issues on non-Unix systems. Also, the error from os.UserHomeDir() is ignored—if it fails, home will be empty, leading to checking "/.kube/config".
Suggested fix
+import "path/filepath"
+
func buildRestConfig(kubeconfig string) (*rest.Config, error) {
if kubeconfig != "" {
return clientcmd.BuildConfigFromFlags("", kubeconfig)
}
- home, _ := os.UserHomeDir()
- localPath := home + "/.kube/config"
- if _, err := os.Stat(localPath); err == nil {
- return clientcmd.BuildConfigFromFlags("", localPath)
+ if home, err := os.UserHomeDir(); err == nil {
+ localPath := filepath.Join(home, ".kube", "config")
+ if _, err := os.Stat(localPath); err == nil {
+ return clientcmd.BuildConfigFromFlags("", localPath)
+ }
}
return rest.InClusterConfig()
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/ambient-control-plane/internal/kubeclient/kubeclient.go` around
lines 93 - 97, The code currently ignores the error from os.UserHomeDir() and
builds localPath via string concatenation; update the logic in the kube client
setup so you check and handle the error returned by os.UserHomeDir(), and
construct the kube config path using filepath.Join(home, ".kube", "config")
instead of home + "/.kube/config"; then use os.Stat(localPath) as before and
pass the resulting localPath to clientcmd.BuildConfigFromFlags("", localPath)
only when the home lookup succeeded and the file exists.
| saName := serviceAccountName(session.ID) | ||
| secretName := secretName(session.ID) | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Variable secretName shadows the function of the same name.
secretName := secretName(session.ID) shadows the helper function, making the code confusing and error-prone for future modifications.
Proposed fix
saName := serviceAccountName(session.ID)
- secretName := secretName(session.ID)
+ credSecretName := secretName(session.ID)
// ... later in buildEnv call:
- "env": r.buildEnv(session, secretName),
+ "env": r.buildEnv(session, credSecretName),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go`
around lines 344 - 346, The local variable currently declared as "secretName :=
secretName(session.ID)" shadows the helper function "secretName", so rename the
variable to avoid shadowing (e.g., "secretNameVal" or "saSecretName") and update
all subsequent references in this scope; keep the call to the helper function
"secretName(session.ID)" unchanged but assign its result to the new variable
name alongside the existing "saName := serviceAccountName(session.ID)"
assignment.
| def test_registers_queue_before_streaming(self): | ||
| active_streams = {} | ||
| bridge = _make_bridge(active_streams=active_streams) | ||
| client = _make_client(bridge) | ||
|
|
||
| # Pre-fill queue with a terminal event so the stream closes immediately | ||
| q: asyncio.Queue = asyncio.Queue(maxsize=100) | ||
| q.put_nowait(make_run_finished()) | ||
|
|
||
| # The endpoint creates its own queue on connect — verify it's registered | ||
| # by inspecting active_streams after the response body is fully consumed | ||
| with client.stream("GET", "/events/t-1") as resp: | ||
| assert resp.status_code == 200 | ||
| resp.read() |
There was a problem hiding this comment.
Unused queue in test may not be testing intended behavior.
The queue q created on line 72-73 is never added to active_streams, so the endpoint will create its own queue. This test passes trivially because the stream blocks waiting for events. Consider either removing the unused queue or verifying the actual registration behavior.
🔧 Proposed fix to properly test queue registration
def test_registers_queue_before_streaming(self):
active_streams = {}
bridge = _make_bridge(active_streams=active_streams)
client = _make_client(bridge)
- # Pre-fill queue with a terminal event so the stream closes immediately
- q: asyncio.Queue = asyncio.Queue(maxsize=100)
- q.put_nowait(make_run_finished())
-
- # The endpoint creates its own queue on connect — verify it's registered
- # by inspecting active_streams after the response body is fully consumed
- with client.stream("GET", "/events/t-1") as resp:
- assert resp.status_code == 200
- resp.read()
+ # Verify the endpoint registers a queue in active_streams
+ # We need to inject a terminal event after the queue is registered
+ import threading
+ def inject_terminal():
+ import time
+ time.sleep(0.1) # Wait for queue to be registered
+ if "t-1" in active_streams:
+ active_streams["t-1"].put_nowait(make_run_finished())
+
+ t = threading.Thread(target=inject_terminal)
+ t.start()
+ with client.stream("GET", "/events/t-1") as resp:
+ assert resp.status_code == 200
+ body = resp.read()
+ t.join()
+ assert "t-1" not in active_streams # Cleaned up after stream closes🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/runners/ambient-runner/tests/test_events_endpoint.py` around lines
66 - 79, The test creates an unused Queue (q) that isn't added to
active_streams, so either remove that unused q or actually assert the endpoint
registered its own queue; update test_registers_queue_before_streaming to delete
the q creation (q = asyncio.Queue / q.put_nowait...) and after consuming the
response body (resp.read()) assert that 't-1' is present in active_streams and
that active_streams['t-1'] is an asyncio.Queue (or alternatively move creation
of q before calling _make_bridge and assign active_streams['t-1'] = q if you
intend to pre-register the queue); reference
test_registers_queue_before_streaming, active_streams, _make_bridge,
client.stream, and resp.read.
| async def test_user_message_triggers_bridge_run(self): | ||
| payload = _make_runner_payload(thread_id="t-1", run_id="r-1") | ||
| msgs = [_make_session_message("user", payload, seq=1)] | ||
| client = _make_grpc_client(messages=msgs) | ||
| bridge = _make_bridge() | ||
|
|
||
| events = [make_text_start(), make_text_content(), make_run_finished()] | ||
|
|
||
| async def fake_run(input_data): | ||
| for e in events: | ||
| yield e | ||
|
|
||
| bridge.run = fake_run | ||
| bridge._active_streams = {} | ||
|
|
||
| listener = GRPCSessionListener( | ||
| bridge=bridge, session_id="s-1", grpc_url="localhost:9000" | ||
| ) | ||
| listener._grpc_client = client | ||
|
|
||
| task = asyncio.create_task(listener._listen_loop()) | ||
| try: | ||
| await asyncio.wait_for(listener.ready.wait(), timeout=2.0) | ||
| await asyncio.sleep(0.3) | ||
| finally: | ||
| task.cancel() | ||
| try: | ||
| await task | ||
| except asyncio.CancelledError: | ||
| pass |
There was a problem hiding this comment.
Assert that bridge.run() actually fires in this test.
This case only waits for 300ms and never verifies the user message triggered bridge.run(). If the listener drops user events or exits early, the test still passes.
Proposed fix
bridge = _make_bridge()
events = [make_text_start(), make_text_content(), make_run_finished()]
+ ran = asyncio.Event()
async def fake_run(input_data):
+ ran.set()
for e in events:
yield e
bridge.run = fake_run
@@
task = asyncio.create_task(listener._listen_loop())
try:
await asyncio.wait_for(listener.ready.wait(), timeout=2.0)
- await asyncio.sleep(0.3)
+ await asyncio.wait_for(ran.wait(), timeout=1.0)
+ assert ran.is_set()
finally:
task.cancel()
try:
await task🧰 Tools
🪛 Ruff (0.15.6)
[warning] 160-160: Missing return type annotation for private function fake_run
(ANN202)
[warning] 160-160: Unused function argument: input_data
(ARG001)
[warning] 178-181: Use contextlib.suppress(asyncio.CancelledError) instead of try-except-pass
Replace try-except-pass with with contextlib.suppress(asyncio.CancelledError): ...
(SIM105)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/runners/ambient-runner/tests/test_grpc_transport.py` around lines
152 - 181, The test test_user_message_triggers_bridge_run currently never
verifies that bridge.run() was invoked; update it to assert the call by
replacing the simple generator fake_run with a callable that signals when
invoked (e.g., set an asyncio.Event or use an AsyncMock) and await that signal
before cancelling the listener. Concretely: keep the existing events and
listener setup, but change bridge.run to a wrapper that sets an event (or
records a call via AsyncMock) when started and then yields the events; after
awaiting listener.ready, await the event with a timeout to ensure bridge.run
fired, then proceed to cancel the task and assert the event/AsyncMock was
triggered. Target symbols: test_user_message_triggers_bridge_run, bridge.run,
fake_run, GRPCSessionListener, listener._listen_loop.
| TIMEOUT_SECONDS=300 # 5 minutes | ||
| CHECK_INTERVAL=5 # Check every 5 seconds |
There was a problem hiding this comment.
Unused configuration variables.
TIMEOUT_SECONDS and CHECK_INTERVAL are defined but never used. The loops on lines 90 and 122 use hardcoded values (30, 60) instead. Either use these variables or remove them.
🔧 Proposed fix to use the configuration variables
-while [[ $(($(date +%s) - start_time)) -lt 30 ]]; do
+while [[ $(($(date +%s) - start_time)) -lt $TIMEOUT_SECONDS ]]; do-while [[ $(($(date +%s) - start_time)) -lt 60 ]]; do
+while [[ $(($(date +%s) - start_time)) -lt $TIMEOUT_SECONDS ]]; doOr remove the unused variables if hardcoded values are intentional.
🧰 Tools
🪛 Shellcheck (0.11.0)
[warning] 21-21: TIMEOUT_SECONDS appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 22-22: CHECK_INTERVAL appears unused. Verify use (or export if used externally).
(SC2034)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test-e2e-control-plane.sh` around lines 21 - 22, TIMEOUT_SECONDS and
CHECK_INTERVAL are declared but not used; update the loops that currently use
hardcoded intervals (the loops referencing 30 and 60 seconds) to use the
TIMEOUT_SECONDS and CHECK_INTERVAL variables (so replace the literal timeout and
sleep interval values with TIMEOUT_SECONDS and CHECK_INTERVAL), or remove the
TIMEOUT_SECONDS and CHECK_INTERVAL declarations if the hardcoded values are
intentional; ensure you update any related echo/log messages to reference the
variables if you choose to use them.
| echo "" | ||
| echo -e "${BLUE}📊 Test Summary:${NC}" | ||
| echo -e " • Duration: $(($(date +%s) - start_time)) seconds" | ||
| echo -e " • Session Name: ${SESSION_NAME}" | ||
| echo -e " • Session ID: ${SESSION_ID}" | ||
| echo -e " • CR Name: ${CR_NAME}" | ||
| echo "" |
There was a problem hiding this comment.
Duration calculation uses wrong start_time.
start_time is reassigned on line 119 for the runner pod wait loop, so the duration reported here reflects only the time since step 4, not the total test duration. Capture the initial start time in a separate variable.
🔧 Proposed fix
+TEST_START_TIME=$(date +%s)
echo -e "${BLUE}🚀 Starting End-to-End Control Plane Test${NC}"Then on line 226:
-echo -e " • Duration: $(($(date +%s) - start_time)) seconds"
+echo -e " • Duration: $(($(date +%s) - TEST_START_TIME)) seconds"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test-e2e-control-plane.sh` around lines 224 - 230, The duration uses
start_time which gets reassigned later (in the runner pod wait loop), so capture
the initial test start time into a separate variable (e.g., TEST_START_TIME or
ORIGINAL_START_TIME) immediately when the script begins and do not modify it;
keep using start_time for short-lived step timers if needed. Update the final
summary echo that computes Duration to use this preserved variable
(TEST_START_TIME/ORIGINAL_START_TIME) instead of start_time so the reported
duration reflects the entire test run; ensure the new variable is exported or in
scope for the summary block.
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/components-build-deploy.yml (1)
307-361:⚠️ Potential issue | 🟠 MajorMissing
ambient-control-planein image tag determination and kustomization update.The
deploy-to-openshiftjob handles image tags for all other components but omitsambient-control-plane:
- "Determine image tags" step (lines 310-350) doesn't set a
control_plane_tagoutput- "Update kustomization with image tags" step (lines 354-361) doesn't call
kustomize edit set imagefor the control-plane imageAs a result, when
ambient-control-planeis rebuilt and pushed, the deployment will still use whatever tag is in the manifest (likelylatestorstage), not the newly built SHA.Proposed fix
Add to "Determine image tags" step:
+ if [ "${{ needs.detect-changes.outputs.ambient-control-plane }}" == "true" ]; then + echo "control_plane_tag=${{ github.sha }}" >> $GITHUB_OUTPUT + else + echo "control_plane_tag=stage" >> $GITHUB_OUTPUT + fiAdd to "Update kustomization with image tags" step:
+ kustomize edit set image quay.io/ambient_code/ambient_control_plane:latest=quay.io/ambient_code/ambient_control_plane:${{ steps.image-tags.outputs.control_plane_tag }}Note: Use
ambient_control_planeto match the manifest declarations (or fix the naming mismatch first).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/components-build-deploy.yml around lines 307 - 361, The Determine image tags step (image-tags) is missing a control_plane_tag output and the Update kustomization step doesn't set the control-plane image; add logic in the image-tags run to set control_plane_tag to either ${{ github.sha }} when needs.detect-changes.outputs.ambient-control-plane is "true" or to the appropriate default (stage or latest) otherwise, and add a corresponding kustomize edit set image invocation in the Update kustomization step to map the manifest image (ambient_control_plane or the exact image name used in overlays) to quay.io/ambient_code/ambient_control_plane:${{ steps.image-tags.outputs.control_plane_tag }} so the overlay uses the newly determined tag.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/components-build-deploy.yml:
- Around line 100-101: The workflow's component matrix uses the image name
quay.io/ambient_code/vteam_control_plane which doesn't match the kustomize
manifests that reference quay.io/ambient_code/ambient_control_plane; update the
matrix entry in .github/workflows/components-build-deploy.yml to use
quay.io/ambient_code/ambient_control_plane so `kustomize edit set image` will
match, then re-run or validate that kustomization files (referenced by symbols
like the kustomization.yaml in components/manifests/base and overlays/production
and ambient-control-plane-service.yml) are already consistent; alternatively, if
you prefer the vteam_ name, update all manifest image references to
quay.io/ambient_code/vteam_control_plane so names are consistent across the
matrix and manifests.
In @.github/workflows/prod-release-deploy.yaml:
- Around line 160-161: The component image name in the matrix/comp_image loop
("quay.io/ambient_code/vteam_control_plane") does not match the kustomization
manifests ("quay.io/ambient_code/ambient_control_plane"), so the kustomize edit
set image replacement (used by the release/tag substitution) will not find the
entry; update the image reference used by the matrix and comp_image loop to
exactly match the manifest image name (or conversely update the manifests to use
"vteam_control_plane") so the image string is identical across the component
matrix, the comp_image loop, and the kustomization manifests referenced by the
kustomize edit set image command.
In `@components/runners/ambient-runner/ambient_runner/bridges/claude/bridge.py`:
- Around line 244-245: Wrap the call to self._grpc_listener.stop() in a
try/except so an exception from stop() is caught and logged, and move the
session manager and observability cleanup calls into a finally block (or ensure
they run after the except) so that self._session_manager.shutdown() and
self._observability.shutdown() always run even if self._grpc_listener.stop()
raises; make sure to reference the existing symbols self._grpc_listener.stop(),
self._session_manager.shutdown(), and self._observability.shutdown() and log the
stop() exception before proceeding with the cleanup.
In `@Makefile`:
- Line 68: The Makefile variable CONTROL_PLANE_IMAGE currently uses the image
name literal "vteam_control_plane" which mismatches the manifest-declared
"ambient_control_plane"; update the CONTROL_PLANE_IMAGE assignment to use the
same image base as manifests (e.g., replace "vteam_control_plane" with
"ambient_control_plane") while preserving IMAGE_TAG (i.e., keep the format
CONTROL_PLANE_IMAGE ?= ambient_control_plane:$(IMAGE_TAG)) so all references to
CONTROL_PLANE_IMAGE and IMAGE_TAG remain consistent with the manifests.
- Around line 293-305: The production deploy re-applies a manifest-declared
kubernetes.io/service-account-token Secret (ambient-control-plane-token from
base/rbac/control-plane-sa.yaml) before you immediately delete/recreate it with
the OCM token, which can cause the recreated token to be overwritten if someone
runs kubectl kustomize/apply separately; update the Makefile or manifests to
avoid the race by either (A) adding a clear comment above the kubectl kustomize
&& oc create secret block describing the dependency on the subsequent
delete/create of ambient-control-plane-token so operators know the behavior, or
(B) remove/exclude the ambient-control-plane-token Secret from the kustomize
base when using this deploy path (or add a kustomize patch/overlay to prevent
the Secret from being managed) and ensure the Makefile still creates the secret
via oc create --from-literal=token="$(ocm token)"; reference the
ambient-control-plane-token name, the kubectl kustomize call for
components/manifests/overlays/production/, and base/rbac/control-plane-sa.yaml
when making the change.
---
Outside diff comments:
In @.github/workflows/components-build-deploy.yml:
- Around line 307-361: The Determine image tags step (image-tags) is missing a
control_plane_tag output and the Update kustomization step doesn't set the
control-plane image; add logic in the image-tags run to set control_plane_tag to
either ${{ github.sha }} when needs.detect-changes.outputs.ambient-control-plane
is "true" or to the appropriate default (stage or latest) otherwise, and add a
corresponding kustomize edit set image invocation in the Update kustomization
step to map the manifest image (ambient_control_plane or the exact image name
used in overlays) to quay.io/ambient_code/ambient_control_plane:${{
steps.image-tags.outputs.control_plane_tag }} so the overlay uses the newly
determined tag.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0e8d75b7-4b30-4219-a89e-6805a65316f9
📒 Files selected for processing (4)
.github/workflows/components-build-deploy.yml.github/workflows/prod-release-deploy.yamlMakefilecomponents/runners/ambient-runner/ambient_runner/bridges/claude/bridge.py
| if self._grpc_listener is not None: | ||
| await self._grpc_listener.stop() |
There was a problem hiding this comment.
Guard shutdown against partial failure.
At Line 245, if await self._grpc_listener.stop() raises, session manager and observability cleanup are skipped, which can leak resources and leave shutdown incomplete.
Proposed fix
async def shutdown(self) -> None:
"""Graceful shutdown: persist sessions, finalise tracing."""
- if self._grpc_listener is not None:
- await self._grpc_listener.stop()
- if self._session_manager:
- await self._session_manager.shutdown()
- if self._obs:
- await self._obs.finalize()
+ try:
+ if self._grpc_listener is not None:
+ await self._grpc_listener.stop()
+ self._grpc_listener = None
+ finally:
+ if self._session_manager:
+ await self._session_manager.shutdown()
+ if self._obs:
+ await self._obs.finalize()
logger.info("ClaudeBridge: shutdown complete")As per coding guidelines "-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if self._grpc_listener is not None: | |
| await self._grpc_listener.stop() | |
| async def shutdown(self) -> None: | |
| """Graceful shutdown: persist sessions, finalise tracing.""" | |
| try: | |
| if self._grpc_listener is not None: | |
| await self._grpc_listener.stop() | |
| self._grpc_listener = None | |
| finally: | |
| if self._session_manager: | |
| await self._session_manager.shutdown() | |
| if self._obs: | |
| await self._obs.finalize() | |
| logger.info("ClaudeBridge: shutdown complete") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/runners/ambient-runner/ambient_runner/bridges/claude/bridge.py`
around lines 244 - 245, Wrap the call to self._grpc_listener.stop() in a
try/except so an exception from stop() is caught and logged, and move the
session manager and observability cleanup calls into a finally block (or ensure
they run after the except) so that self._session_manager.shutdown() and
self._observability.shutdown() always run even if self._grpc_listener.stop()
raises; make sure to reference the existing symbols self._grpc_listener.stop(),
self._session_manager.shutdown(), and self._observability.shutdown() and log the
stop() exception before proceeding with the cleanup.
| STATE_SYNC_IMAGE ?= vteam_state_sync:$(IMAGE_TAG) | ||
| PUBLIC_API_IMAGE ?= vteam_public_api:$(IMAGE_TAG) | ||
| API_SERVER_IMAGE ?= vteam_api_server:$(IMAGE_TAG) | ||
| CONTROL_PLANE_IMAGE ?= vteam_control_plane:$(IMAGE_TAG) |
There was a problem hiding this comment.
Image name should match manifest declarations.
Same naming issue as the workflow files: vteam_control_plane here vs ambient_control_plane in manifests. Ensure consistency across all files when fixing the naming mismatch.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Makefile` at line 68, The Makefile variable CONTROL_PLANE_IMAGE currently
uses the image name literal "vteam_control_plane" which mismatches the
manifest-declared "ambient_control_plane"; update the CONTROL_PLANE_IMAGE
assignment to use the same image base as manifests (e.g., replace
"vteam_control_plane" with "ambient_control_plane") while preserving IMAGE_TAG
(i.e., keep the format CONTROL_PLANE_IMAGE ?=
ambient_control_plane:$(IMAGE_TAG)) so all references to CONTROL_PLANE_IMAGE and
IMAGE_TAG remain consistent with the manifests.
30b1f4a to
94e5980
Compare
|
/retest |
1 similar comment
|
/retest |
There was a problem hiding this comment.
Actionable comments posted: 27
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
components/manifests/components/ambient-api-server-db/ambient-api-server-db-json-patch.yaml (1)
8-30: 🛠️ Refactor suggestion | 🟠 MajorRemove the redundant list replacements from this overlay.
These
envandvolumeMountsvalues now match the base manifest exactly. Keeping whole-arrayreplaceops here means any future base env var or mount will be silently discarded in this overlay.🧹 Suggested cleanup
- op: replace path: /spec/template/spec/containers/0/image value: registry.redhat.io/rhel10/postgresql-16:10.1 -- op: replace - path: /spec/template/spec/containers/0/env - value: - - name: POSTGRESQL_USER - valueFrom: - secretKeyRef: - name: ambient-api-server-db - key: db.user - - name: POSTGRESQL_PASSWORD - valueFrom: - secretKeyRef: - name: ambient-api-server-db - key: db.password - - name: POSTGRESQL_DATABASE - valueFrom: - secretKeyRef: - name: ambient-api-server-db - key: db.name -- op: replace - path: /spec/template/spec/containers/0/volumeMounts - value: - - name: ambient-api-server-db-data - mountPath: /var/lib/pgsql/data🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/manifests/components/ambient-api-server-db/ambient-api-server-db-json-patch.yaml` around lines 8 - 30, Remove the two redundant whole-array JSON Patch replace ops that overwrite the container 0 env and volumeMounts in ambient-api-server-db-json-patch.yaml: the replace at path /spec/template/spec/containers/0/env (which replaces env entries like POSTGRESQL_USER, POSTGRESQL_PASSWORD, POSTGRESQL_DATABASE) and the replace at path /spec/template/spec/containers/0/volumeMounts (which replaces the mount for name ambient-api-server-db-data). Instead delete these two ops from the patch so the overlay inherits the base manifest's env and volumeMounts and future base additions are preserved..github/workflows/components-build-deploy.yml (1)
354-363:⚠️ Potential issue | 🟠 MajorMissing
kustomize edit set imageforambient-control-planein deploy step.The
deploy-to-openshiftjob updates image tags for all other components but omitsambient-control-plane. When control-plane changes are pushed, the deployment will use whatever tag is in the base kustomization rather than the newly built SHA.🐛 Proposed fix — add control-plane tag handling
Add to the "Determine image tags" step (after line 352):
if [ "${{ needs.detect-changes.outputs.ambient-control-plane }}" == "true" ]; then echo "control_plane_tag=${{ github.sha }}" >> $GITHUB_OUTPUT else echo "control_plane_tag=stage" >> $GITHUB_OUTPUT fiAdd to the "Update kustomization with image tags" step (after line 363):
kustomize edit set image quay.io/ambient_code/ambient_control_plane:latest=quay.io/ambient_code/ambient_control_plane:${{ steps.image-tags.outputs.control_plane_tag }}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/components-build-deploy.yml around lines 354 - 363, Add handling for ambient-control-plane by emitting a control_plane_tag in the "Determine image tags" step (set control_plane_tag to github.sha when needs.detect-changes.outputs.ambient-control-plane == "true", otherwise set it to "stage") and then include a kustomize edit set image command in the "Update kustomization with image tags" step to update quay.io/ambient_code/ambient_control_plane from :latest to :${{ steps.image-tags.outputs.control_plane_tag }} so the control-plane image uses the newly built SHA when changed; reference the existing image-tags output variable name control_plane_tag and the kustomize edit set image pattern used for other components.
♻️ Duplicate comments (16)
components/ambient-control-plane/go.mod (1)
11-11:⚠️ Potential issue | 🔴 CriticalUpgrade
google.golang.org/grpcto a patched release.
google.golang.org/grpc v1.79.1is still the vulnerable pin already called out earlier for GHSA-p77j-4mvh-x3m3, so the new control-plane module would ship with that authorization-bypass issue unresolved.Verification: confirm the pinned version in
go.modand compare it with the advisory’s affected range. Expected result: the module should be updated to a non-vulnerable release before merge.#!/bin/bash set -euo pipefail echo "Pinned grpc version:" rg -n 'google\.golang\.org/grpc' components/ambient-control-plane/go.mod echo python - <<'PY' import json, urllib.request with urllib.request.urlopen("https://api.osv.dev/v1/vulns/GHSA-p77j-4mvh-x3m3") as r: data = json.load(r) print("advisory:", data["id"]) for affected in data.get("affected", []): pkg = affected.get("package", {}) if pkg.get("name") == "google.golang.org/grpc": print("package:", pkg["name"]) for rng in affected.get("ranges", []): print("range:", rng) PY🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/ambient-control-plane/go.mod` at line 11, The go.mod entry pins google.golang.org/grpc at the vulnerable v1.79.1; update that dependency in components/ambient-control-plane/go.mod to a non-vulnerable release (e.g., v1.79.2 or any release that the GHSA-p77j-4mvh-x3m3 advisory marks as patched), then run go mod tidy (or go get google.golang.org/grpc@<patched-version>) to update go.sum; after updating, verify the pinned version no longer falls in the advisory’s affected range for GHSA-p77j-4mvh-x3m3.components/manifests/deploy (1)
82-109:⚠️ Potential issue | 🟠 MajorStop writing the OAuth client secret to a fixed file and echoing it on failure.
This still writes the plaintext
CLIENT_SECRET_VALUEto/tmp/ambient-frontend-oauthclient.yamland prints it again in the fallback block. That exposes a reusable credential to other local users and CI logs.Suggested change
- cat > /tmp/ambient-frontend-oauthclient.yaml <<EOF + OAUTH_TMPFILE=$(mktemp) + chmod 600 "$OAUTH_TMPFILE" + cat > "$OAUTH_TMPFILE" <<EOF apiVersion: oauth.openshift.io/v1 kind: OAuthClient metadata: name: ambient-frontend secret: ${CLIENT_SECRET_VALUE} redirectURIs: - https://${ROUTE_HOST}/oauth/callback grantMethod: auto EOF set +e - oc apply -f /tmp/ambient-frontend-oauthclient.yaml + oc apply -f "$OAUTH_TMPFILE" OAUTH_APPLY_RC=$? set -e - rm -f /tmp/ambient-frontend-oauthclient.yaml + rm -f "$OAUTH_TMPFILE" @@ - echo "secret: ${CLIENT_SECRET_VALUE}" + echo "secret: [REDACTED]"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/manifests/deploy` around lines 82 - 109, Do not persist or echo the plaintext CLIENT_SECRET_VALUE to /tmp/ambient-frontend-oauthclient.yaml or stdout; instead feed the OAuthClient YAML to oc apply via stdin (oc apply -f -) without creating a temp file so the secret never lands on disk, and remove any echo that prints CLIENT_SECRET_VALUE in the fallback/error block. Update the logic around OAUTH_APPLY_RC to build the YAML on-the-fly using CLIENT_SECRET_VALUE and ROUTE_HOST and pipe it to oc apply -f -, and in the failure branch print a redacted/admin-instruction that tells the admin how to apply the OAuthClient without showing the secret (omit the secret value or instruct them to set CLIENT_SECRET_VALUE in their environment) rather than echoing the secret.components/runners/ambient-runner/ambient_runner/bridges/claude/bridge.py (1)
244-249:⚠️ Potential issue | 🟠 MajorProtect shutdown cleanup from gRPC stop failures.
At Line 245, an exception from
await self._grpc_listener.stop()preventsself._session_manager.shutdown()andself._obs.finalize()from running.As per coding guidelines "-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."Proposed fix
async def shutdown(self) -> None: """Graceful shutdown: persist sessions, finalise tracing.""" - if self._grpc_listener is not None: - await self._grpc_listener.stop() - if self._session_manager: - await self._session_manager.shutdown() - if self._obs: - await self._obs.finalize() + try: + if self._grpc_listener is not None: + await self._grpc_listener.stop() + self._grpc_listener = None + except Exception: + logger.exception("ClaudeBridge: failed to stop gRPC listener during shutdown") + finally: + if self._session_manager: + await self._session_manager.shutdown() + if self._obs: + await self._obs.finalize() logger.info("ClaudeBridge: shutdown complete")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/runners/ambient-runner/ambient_runner/bridges/claude/bridge.py` around lines 244 - 249, The call to await self._grpc_listener.stop() can raise and currently short-circuits the remaining cleanup; wrap the await self._grpc_listener.stop() call in a try/except that catches Exception, logs the error (including the exception) and does not re-raise so that subsequent cleanup calls await self._session_manager.shutdown() and await self._obs.finalize() still run; ensure you reference self._grpc_listener.stop(), self._session_manager.shutdown(), and self._obs.finalize() in the fix so the cleanup order and logging are clear.components/runners/ambient-runner/ambient_runner/app.py (2)
356-403:⚠️ Potential issue | 🟠 MajorgRPC client may leak on exception.
If an exception occurs after
AmbientGRPCClient.from_env()but beforeclient.close(), the connection leaks. Use try/finally to ensure cleanup.async def _push_initial_prompt_via_grpc(prompt: str, session_id: str) -> None: """Push INITIAL_PROMPT as a PushSessionMessage so it is durable in DB.""" + client = None try: from ambient_runner._grpc_client import AmbientGRPCClient client = AmbientGRPCClient.from_env() # ... existing logic ... - client.close() except Exception as exc: logger.error( "INITIAL_PROMPT gRPC push failed: session=%s error=%s", session_id, exc, exc_info=True, ) + finally: + if client is not None: + client.close()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/runners/ambient-runner/ambient_runner/app.py` around lines 356 - 403, In _push_initial_prompt_via_grpc the AmbientGRPCClient created via AmbientGRPCClient.from_env() can leak if an exception occurs before client.close(); wrap the client usage in a try/finally (or use a context manager if AmbientGRPCClient supports __enter__/__exit__) so client.close() is always called, ensuring you only call close when the client variable was successfully assigned; update the function to create the client, run the payload push logic, and then call client.close() in finally to guarantee cleanup.
129-131: 🧹 Nitpick | 🔵 TrivialAccessing private members
_setup_platform()and_grpc_listenercreates coupling.Consider exposing a public method on
ClaudeBridgethat handles setup and returns the ready signal, rather than directly accessing private members.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/runners/ambient-runner/ambient_runner/app.py` around lines 129 - 131, The code currently couples to ClaudeBridge's internals by calling its private members _setup_platform() and _grpc_listener.ready; instead add a public async method on ClaudeBridge (e.g., ensure_ready() or setup_and_wait_ready()) that performs the internal setup and exposes the ready signal (awaiting _setup_platform() and returning/awaiting the internal ready event), then replace the direct calls in ambient_runner.app (the grpc_url conditional) with a single await bridge.ensure_ready() to remove private member access.components/manifests/overlays/production/control-plane-env-patch.yaml (1)
15-16:⚠️ Potential issue | 🟠 MajorPin
RUNNER_IMAGEto a specific version instead of:latest.Using
:latestin production makes deployments non-reproducible. Pin to a specific version or digest for predictable behavior.- name: RUNNER_IMAGE - value: "quay.io/ambient_code/vteam_claude_runner:latest" + value: "quay.io/ambient_code/vteam_claude_runner:v1.0.0" # Pin to specific version🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/manifests/overlays/production/control-plane-env-patch.yaml` around lines 15 - 16, The RUNNER_IMAGE env var is pinned to :latest causing non-reproducible production deployments; update the value of the environment variable named RUNNER_IMAGE (currently "quay.io/ambient_code/vteam_claude_runner:latest") to a specific version tag or an immutable digest (for example "quay.io/ambient_code/vteam_claude_runner:v1.2.3" or "quay.io/ambient_code/vteam_claude_runner@sha256:<digest>") so deployments are deterministic; ensure you change the string literal in the manifest where RUNNER_IMAGE is defined and verify any CI/deploy docs reference the same pinned tag.components/ambient-control-plane/internal/reconciler/project_settings_reconciler.go (1)
130-139:⚠️ Potential issue | 🟠 MajorPartial failures are silently swallowed.
When
ensureGroupRoleBindingfails for one entry, the error is logged but the function returnsnil, making reconciliation appear successful. This could mask persistent failures and prevent retry mechanisms from kicking in.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/ambient-control-plane/internal/reconciler/project_settings_reconciler.go` around lines 130 - 139, The loop over entries in project_settings_reconciler.go currently logs errors from ensureGroupRoleBinding but always returns nil, hiding failures; change the reconciler method that contains this loop to collect and return errors (e.g., return the first error or aggregate them) instead of swallowing them so callers/retry logic see failures—specifically update the for-loop handling of ensureGroupRoleBinding(ctx, namespace, entry.GroupName, entry.Role) to propagate its error (or append to an error slice and return a combined error) rather than only logging it.components/runners/ambient-runner/tests/test_app_initial_prompt.py (1)
34-35:⚠️ Potential issue | 🟡 MinorPatch target should target where the name is looked up.
The patches at lines 34, 58, 67, and 78 target
"ambient_runner._grpc_client.AmbientGRPCClient"(definition site). Ifapp.pyimports withfrom ambient_runner._grpc_client import AmbientGRPCClient, the name is bound inapp's namespace before the patch applies. The patch should target"ambient_runner.app.AmbientGRPCClient"instead.Also applies to: 58-58, 67-67, 78-78
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/runners/ambient-runner/tests/test_app_initial_prompt.py` around lines 34 - 35, The tests patch the definition site "ambient_runner._grpc_client.AmbientGRPCClient" but the code under test imports and binds the class into app's namespace, so update each patch target to the lookup site "ambient_runner.app.AmbientGRPCClient"; specifically change the patch call that wraps the call to _push_initial_prompt_via_grpc("hello world", "sess-1") and the three other patch usages (the ones at the other test cases) to patch "ambient_runner.app.AmbientGRPCClient" so the mocked class is used when _push_initial_prompt_via_grpc invokes AmbientGRPCClient.components/runners/ambient-runner/tests/test_events_endpoint.py (1)
66-79:⚠️ Potential issue | 🟡 MinorUnused queue in test may not verify intended behavior.
The queue
qcreated on lines 72-73 is never added toactive_streams, so the endpoint creates its own queue. The test passes but doesn't actually verify queue registration behavior. Consider removing the unused queue or properly verifying registration.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/runners/ambient-runner/tests/test_events_endpoint.py` around lines 66 - 79, The test test_registers_queue_before_streaming creates a local asyncio.Queue q and puts a terminal event into it but never registers it with active_streams, so it doesn't actually verify queue-registration behavior; either remove the unused q and its put_nowait call, or register it into active_streams (e.g., active_streams["t-1"] = q) before creating the bridge via _make_bridge so the endpoint will use that queue, and then after client.stream("GET", "/events/t-1") and resp.read() assert the expected registration/consumption from active_streams to confirm the endpoint registered/used the queue.components/ambient-control-plane/internal/reconciler/shared.go (1)
88-96:⚠️ Potential issue | 🟠 MajorNormalize namespace inputs before using them as Kubernetes namespaces.
strings.ToLower()only handles case, and Line 92 returnssession.KubeNamespacewithout even that. Values with_,., spaces, uppercase letters, or length >63 will still be invalid namespace names and will fail later resource creation. Reuse one DNS-1123 normalizer here and innamespaceForProject, or assert upstream thatProjectIDis already slug-safe.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/ambient-control-plane/internal/reconciler/shared.go` around lines 88 - 96, namespaceForSession currently only lowercases ProjectID and returns KubeNamespace raw, which allows invalid Kubernetes names (underscores, dots, spaces, uppercase, >63 chars); update namespaceForSession to run both session.ProjectID and session.KubeNamespace through the shared DNS-1123 normalizer used elsewhere (reuse the existing normalizer function used by namespaceForProject or create a single normalizeToDNS1123 helper) so returned namespaces are DNS-1123 safe and truncated if needed; also ensure namespaceForProject calls the same normalizer or add an assertion upstream that ProjectID is already slug-safe.components/runners/ambient-runner/ambient_runner/_session_messages_api.py (1)
77-84:⚠️ Potential issue | 🟠 MajorStop logging payload previews at INFO.
Both RPC paths now emit prompt/response content previews on every push/watch message. That leaks user and assistant content into durable logs and will balloon log volume on long sessions. Since this file is generated, the fix should happen in the generator too.
Also applies to: 136-147
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/runners/ambient-runner/ambient_runner/_session_messages_api.py` around lines 77 - 84, The current code constructs payload_preview and emits it in logger.info (using session_id, event_type, payload_len, payload_preview), which leaks user/assistant content and increases log volume; update the generator so generated code no longer logs payload previews at INFO—either remove payload_preview from the logger.info call or lower that log call to DEBUG (e.g., use logger.debug with payload_preview) and keep only non-sensitive metadata (session_id, event_type, payload_len) at INFO; apply the same change for the other generated block that builds payload_preview (the analogous logger call around the 136–147 area).components/ambient-control-plane/internal/reconciler/project_reconciler.go (2)
104-110:⚠️ Potential issue | 🔴 CriticalReconcile creator RoleBindings instead of create-once.
A changed
ambient-code.io/created-byvalue can leave the previous subject bound toambient-project-admin, and the 40-character truncation makes distinct long identities alias the same RoleBinding name. Line 132 only checks for name existence, so this never converges back to the desired subject.Also applies to: 132-137
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/ambient-control-plane/internal/reconciler/project_reconciler.go` around lines 104 - 110, The current creatorRoleBindingName function and the reconcile logic create RoleBindings once by name and never update them, causing stale subjects when the ambient-code.io/created-by value changes and collisions from truncating long subjects; update creatorRoleBindingName to produce deterministic unique names for any subject (e.g., include a stable hash of the full subject instead of truncating to 40 chars) and change the reconciler that handles ambient-project-admin RoleBindings to not only check existence but to reconcile the RoleBinding's subjects/owner to match the current ambient-code.io/created-by (update existing RoleBinding subjects when they differ); reference creatorRoleBindingName and the reconciler that currently checks for name existence to locate where to change name generation and add subject-update logic.
94-96:⚠️ Potential issue | 🟠 MajorTreat
AlreadyExistson namespace creation as success.After the Line 67 existence check, another reconcile can create the namespace before
CreateNamespaceruns. Returning that race as an error makes add/modify events fail instead of converging.Proposed fix
- if _, err := r.kube.CreateNamespace(ctx, ns); err != nil { - return fmt.Errorf("creating namespace %s: %w", name, err) - } + if _, err := r.kube.CreateNamespace(ctx, ns); err != nil { + if !k8serrors.IsAlreadyExists(err) { + return fmt.Errorf("creating namespace %s: %w", name, err) + } + r.logger.Debug().Str("namespace", name).Msg("namespace concurrently created") + return nil + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/ambient-control-plane/internal/reconciler/project_reconciler.go` around lines 94 - 96, The CreateNamespace call in project_reconciler.go should treat an AlreadyExists result as success: after calling r.kube.CreateNamespace(ctx, ns) check the returned error and if apierrors.IsAlreadyExists(err) then ignore it (consider creation successful) otherwise return the wrapped fmt.Errorf as currently done; reference the r.kube.CreateNamespace call and the ns/name variables and import/use k8s.io/apimachinery/pkg/api/errors (apierrors) for the IsAlreadyExists check.components/runners/ambient-runner/tests/test_grpc_transport.py (1)
152-181:⚠️ Potential issue | 🟠 MajorAssert that
bridge.run()actually fired.This test only waits 300ms and never proves that the
usermessage triggeredbridge.run(). If the listener dropsuserevents or exits early, the case still passes.Proposed fix
bridge = _make_bridge() events = [make_text_start(), make_text_content(), make_run_finished()] + ran = asyncio.Event() async def fake_run(input_data): + ran.set() for e in events: yield e bridge.run = fake_run @@ task = asyncio.create_task(listener._listen_loop()) try: await asyncio.wait_for(listener.ready.wait(), timeout=2.0) - await asyncio.sleep(0.3) + await asyncio.wait_for(ran.wait(), timeout=1.0) + assert ran.is_set() finally: task.cancel() try: await task🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/runners/ambient-runner/tests/test_grpc_transport.py` around lines 152 - 181, The test doesn't verify bridge.run was invoked; modify test_user_message_triggers_bridge_run to set a sentinel (e.g., an asyncio.Event or boolean flag) inside the fake_run assigned to bridge.run (refer to fake_run and bridge.run) and await that sentinel with a timeout after listener.ready instead of just sleeping; ensure you still cancel the listener task as before (listener._listen_loop and listener.ready) so the test fails if bridge.run never fires.components/runners/ambient-runner/ambient_runner/endpoints/events.py (1)
61-63:⚠️ Potential issue | 🟠 MajorDo not overwrite another subscriber's queue.
active_streams[thread_id] = queuelets a reconnect or second client orphan the first stream, and the unconditionalpop()in both endpoints can then remove the replacement queue.wait_for_events()is worse because it tears down a queue it never created.Also applies to: 109-110, 155-156, 183-184
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/runners/ambient-runner/ambient_runner/endpoints/events.py` around lines 61 - 63, active_streams[thread_id] = queue unconditionally overwrites an existing subscriber's queue (and later unconditional active_streams.pop() removes a different queue), so change creation and teardown to be identity-safe: when creating a queue in wait_for_events() (and the subscribe endpoints), use active_streams.setdefault(thread_id, queue) or append to a list of subscribers instead of replacing an existing entry, and track whether this call actually created the queue; when removing/tearing down, only remove the mapping if active_streams.get(thread_id) is the same queue you created (e.g., if active_streams.get(thread_id) is queue: active_streams.pop(thread_id)), and only close/put sentinel into queues that this code created—this prevents orphaning other subscribers and avoids tearing down queues you don't own.Makefile (1)
68-68:⚠️ Potential issue | 🔴 Critical
CONTROL_PLANE_IMAGEstill points at the wrong image name.The new build/push/load targets all use
vteam_control_plane, so they will not satisfy manifests that deployambient_control_plane. That breaks the control-plane rollout path introduced in this PR.🛠 Proposed fix
-CONTROL_PLANE_IMAGE ?= vteam_control_plane:$(IMAGE_TAG) +CONTROL_PLANE_IMAGE ?= ambient_control_plane:$(IMAGE_TAG)Also applies to: 209-215, 261-264, 278-282, 1004-1005
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Makefile` at line 68, Update the CONTROL_PLANE_IMAGE definition and all places still referencing the old image name "vteam_control_plane" to "ambient_control_plane" so manifests and the new build/push/load targets use the correct image; specifically, change the CONTROL_PLANE_IMAGE variable and any other Makefile variables/targets that set or reference "vteam_control_plane" (e.g., in build/push/load targets and image-related variables) to use "ambient_control_plane:$(IMAGE_TAG)" so the control-plane rollout uses the correct image name.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/ambient-control-plane/internal/config/config.go`:
- Around line 41-42: The current behavior in Load() uses
parseReconcilers(envOrDefault("RECONCILERS", "tally,kube")) which causes
parseReconcilers to silently collapse whitespace/comma-only values into just
"tally", disabling the kube reconciler; update parseReconcilers (and any other
parsing used in the 76-91 region) to validate the raw env string: if the env is
empty/only whitespace/only commas treat it as invalid and either (a) return the
full default set ["tally","kube"] or (b) return an error that Load() surfaces
and fails fast — then update Load() to handle the error path (or accept the
returned default) so invalid RECONCILERS never silently drop kube; reference
functions/fields: parseReconcilers, Load, Reconcilers, envOrDefault and apply
the same validation to the similar parsing code in lines 76-91.
In `@components/ambient-control-plane/internal/informer/informer.go`:
- Around line 129-133: The function currently returns ctx.Err() unconditionally
after calling watchManager.Run(ctx), which causes normal cancellations (e.g.,
SIGINT/SIGTERM -> context.Canceled) to be treated as failures; update the return
behavior in the function containing inf.watchManager.Run(ctx) and Run() so that
expected context cancellation is treated as a clean shutdown: after
watchManager.Run(ctx) check ctx.Err() and return nil for context.Canceled and
context.DeadlineExceeded (or otherwise special-case expected cancellation),
otherwise return the actual error; reference the Run() caller and
inf.watchManager.Run(ctx) to locate where to implement this check.
- Around line 118-125: The initialSync currently enqueues every listed object
via dispatchBlocking while dispatchLoop is started after, which can deadlock
when the initial dataset exceeds the 256-event buffer; change Run to start the
dispatcher before performing the initial sync (move the go inf.dispatchLoop(ctx)
call to before calling inf.initialSync(ctx)) or alternatively have initialSync
use a non-blocking enqueue method (replace dispatchBlocking calls with a
non-blocking/try-enqueue variant) so initial list processing cannot block
waiting for a consumer; update Informer.Run, initialSync, and any calls to
dispatchBlocking to ensure the dispatcher (dispatchLoop) is running or that
enqueues are non-blocking.
In `@components/ambient-control-plane/internal/kubeclient/kubeclient.go`:
- Around line 88-99: The buildRestConfig function currently prefers a local
~/.kube/config over in-cluster service-account credentials; change the
resolution order so that after honoring an explicit kubeconfig parameter
(clientcmd.BuildConfigFromFlags when kubeconfig != ""), you attempt
rest.InClusterConfig() next and return it if successful, and only if in-cluster
fails fall back to checking the home kubeconfig via os.UserHomeDir and
clientcmd.BuildConfigFromFlags(localPath); update error handling so you return
the in-cluster config on success and only probe localPath when
rest.InClusterConfig() returns an error.
- Around line 161-162: All six deletion helpers (DeletePodsByLabel,
DeleteServicesByLabel, DeleteSecretsByLabel, DeleteServiceAccountsByLabel,
DeleteRolesByLabel, DeleteRoleBindingsByLabel) currently call DeleteCollection
with an empty labelSelector which can delete every resource in a namespace; add
a shared validation helper (e.g., ensureNonEmptyLabelSelector(ctx,
labelSelector) or validateLabelSelector) that returns an error if labelSelector
is empty or whitespace, call it at the start of each Delete*ByLabel method, and
return that error instead of invoking kc.dynamic.Resource(...).DeleteCollection
when the selector is missing; apply the same pattern to all six methods so no
deletion proceeds with an empty selector.
In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go`:
- Around line 147-166: cleanupSession currently only deletes labeled child
resources and leaves the per-session namespace (created by
ensureNamespaceExists) behind; update cleanupSession to also delete the
namespace itself (call the kube client's DeleteNamespace/DeleteNamespaceByName
equivalent for the computed namespace variable), handle
k8serrors.IsNotFound(err) the same way as other deletions, and log
errors/warnings via r.logger; make the identical change to the other cleanup
path referenced (the similar function/block around lines 210-242) so both
teardown code paths remove the namespace instead of leaking it.
- Around line 245-270: The RoleBinding name "ambient-image-puller" in
ensureImagePullAccess is shared across sessions so later calls reuse a binding
that points to the first session's subject; change the behavior so the binding
is unique per session or update an existing binding's subjects: either make the
RoleBinding name include the session namespace (e.g. use a name formatted with
the namespace when constructing rb) so each session creates its own binding, or
on CreateRoleBinding error IsAlreadyExists fetch the existing RoleBinding (via
the kube client), reconcile its "subjects" to include the serviceaccounts group
for the current namespace and update/patch it; reference ensureImagePullAccess,
rb, r.kube.CreateRoleBinding and r.cfg.RunnerImageNamespace when making the
change.
In `@components/ambient-control-plane/internal/reconciler/stress_test.go`:
- Around line 65-76: The stress test (TestSessionAPIStressTest) currently
assumes defaults from getTestAPIBaseURL()/getTestAPIToken()/getTestAPIProject()
and will hard-fail when no live API exists; change it to be opt-in by gating
execution on an explicit integration flag or env var (e.g. require
INTEGRATION_TEST=true or non-default API_BASE_URL/API_TOKEN) before constructing
StressTestConfig, and call t.Skip(...) when the flag/credentials are not set or
are the default placeholders; apply the same guard/skip logic to the other
affected tests that use getTestAPIBaseURL/getTestAPIToken/getTestAPIProject and
StressTestConfig so none of them run unless explicitly enabled.
In `@components/ambient-control-plane/internal/reconciler/tally_reconciler.go`:
- Around line 91-96: The handler handleSessionModified currently ignores
MODIFIED events and leaves SessionTallyReconciler's exported tallies
(SessionsByPhase, SessionsByUser) stale; update handleSessionModified to accept
both old and new session values from the event, compare old.Phase vs new.Phase
and old.UserID vs new.UserID, decrement the appropriate counts for the old
bucket(s) and increment the counts for the new bucket(s) (updating
SessionsByPhase and SessionsByUser accordingly), and preserve existing logging
via r.logger.Debug() in the function to reflect the move; ensure the update is
atomic/consistent with how increments/decrements are done elsewhere in
SessionTallyReconciler to avoid transient inconsistencies.
In `@components/ambient-control-plane/internal/reconciler/tally.go`:
- Around line 26-44: The TallyReconciler struct contains an unused sdk field and
NewTallyReconciler accepts an sdk parameter that is never used; remove the sdk
*sdkclient.Client field from the TallyReconciler type and drop the corresponding
sdk parameter from NewTallyReconciler, updating the constructor to only take
(resource string, logger zerolog.Logger) and to initialize the struct without
sdk (keep resource, logger.With().Str("reconciler", "tally-"+resource).Logger(),
seenIDs map, mu, tally, lastEventAt). Also search for and update all call sites
of NewTallyReconciler to stop passing the sdk argument.
In `@components/manifests/base/ambient-control-plane-service.yml`:
- Around line 25-26: The manifest currently pins the container image to
quay.io/ambient_code/ambient_control_plane:latest which makes deployments
non-reproducible; update the image field in the Ambient Control Plane service
manifest to use a pinned tag or an immutable digest (e.g.,
quay.io/ambient_code/ambient_control_plane@sha256:...) instead of :latest, and
keep imagePullPolicy as appropriate (e.g., IfNotPresent) so Kustomize overlays
can still override the tag when needed.
In `@components/manifests/base/platform/ambient-api-server-db.yml`:
- Around line 61-62: Replace the mutable image tag and private registry in the
base manifest so the deployment is deterministic and doesn't require Red Hat
credentials: update the image value currently set to
"registry.redhat.io/rhel9/postgresql-16:latest" to a pinned public tag (for
example "postgresql-16:10.1" or "docker.io/library/postgres:16.1") in the same
YAML key and keep imagePullPolicy as needed; ensure the change is made where the
image field is defined in the ambient-api-server-db manifest so overlays no
longer must remap the image to run locally.
- Around line 57-58: Add seccompProfile: { type: RuntimeDefault } alongside the
existing securityContext entries to enforce syscall filtering for
defense-in-depth: update the pod-level securityContext (the existing
securityContext block that contains runAsNonRoot: true) to include
seccompProfile: { type: RuntimeDefault } and also add the same seccompProfile
under the container-level securityContext for the container spec (the
container's securityContext block around lines where capabilities, runAsNonRoot,
runAsUser, etc. are defined) so both pod and container scopes explicitly set
RuntimeDefault.
In `@components/manifests/base/rbac/control-plane-clusterrole.yaml`:
- Around line 1-27: The current ClusterRole "ambient-control-plane" grants
cluster-wide CRUD (including deletecollection) on workload resources (secrets,
serviceaccounts, services, pods, jobs); restrict this by: 1) remove workload
resources and verbs (secrets, serviceaccounts, services, pods, jobs and their
verbs) from the "ambient-control-plane" ClusterRole so it only contains
cluster-scoped permissions (namespaces and vteam.ambient-code agenticsessions
entries), and 2) implement per-project namespace Roles and RoleBindings created
dynamically by the control plane that grant the workload verbs for resources
["secrets","serviceaccounts","services","pods"] and ["jobs"] within each project
namespace and bind them to the control-plane ServiceAccount; ensure Role names
and RoleBinding subjects align with the control plane ServiceAccount used by
"ambient-control-plane".
In `@components/manifests/deploy`:
- Around line 198-211: The oauth-secret.env creation currently writes plaintext
secrets (CLIENT_SECRET_VALUE, COOKIE_SECRET_VALUE) into the working tree with
default perms and no cleanup; change the flow in the secrets handling so you
either avoid writing oauth-secret.env at all (pipe secrets directly to the
consumer/subcommand) or, if a file is required, create it in a secure temporary
file (using a restrictive umask or mktemp in a secure dir), set tight
permissions, and register a trap/cleanup (remove file on EXIT/ERR) before the
first write; ensure the secrets subcommand consumes stdin or the temp file and
that oauth-secret.env is removed immediately after use so no leftover
credentials remain.
- Around line 164-174: The deploy script currently mutates the checked-in
overlay with "kustomize edit set namespace" and then attempts to restore it,
which can leave the repo modified on failure; instead, change the flow to
operate on a temporary copy or use a cleanup trap: copy overlays/production to a
temp dir, run "kustomize edit set namespace" (and any "kustomize edit set image"
calls) against that temp copy, run "kustomize build" and "oc delete -f -" from
the temp copy, then delete the temp copy; alternatively, install a trap at the
top of the script that restores the original namespace/image on EXIT or ERR so
"kustomize edit set namespace" never leaves the checked-in overlays modified
(apply same fix for the other similar blocks referenced).
- Line 8: Add pipefail to the shell safety flags so failures in the left side of
pipelines like the kustomize/manifest generators piped into "oc apply -f -" are
not ignored; replace the lone "set -e" with a safe flag that enables pipefail
(e.g., "set -euo pipefail" or add "set -o pipefail" prior to the
cluster-mutating pipelines) and apply this change before every occurrence of the
"... | oc apply -f -" pipeline (also at the other locations noted).
Additionally, update any random-secret generation that uses the fragile "tr |
head" pattern so it won’t fail on SIGPIPE (use a robust generator such as
openssl rand, head -c, or dd to produce fixed-length secrets instead of "tr |
head").
In `@components/manifests/deploy-no-api-server.sh`:
- Around line 35-41: The script uses unquoted variable expansions for NAMESPACE
(and other occurrences) which can cause word-splitting; update the if check and
oc commands shown (the conditional using oc get deployment ambient-api-server -n
${NAMESPACE} and the API_SERVER_IMAGE assignment using ${NAMESPACE}) to quote
the variable like "-n \"${NAMESPACE}\"" and likewise quote all other
${NAMESPACE} usages (lines referenced in the review) so every oc/echo/exit
invocation uses "${NAMESPACE}" to prevent splitting and preserve values.
In `@components/manifests/overlays/production/control-plane-image-patch.yaml`:
- Around line 9-10: The image for the container named "ambient-control-plane" is
pinned to ":latest", making production non-reproducible; update the image field
for the "ambient-control-plane" container to use an immutable reference (a
specific release tag or an image digest like `@sha256`:<digest>) instead of
":latest" so deploys only change when the manifest changes and the registry
content is fixed.
In `@components/manifests/overlays/production/kustomization.yaml`:
- Around line 62-64: The production kustomization image uses an immutable tag
"latest" which must be replaced with a fixed release identifier: update the
image entry referencing "quay.io/ambient_code/ambient_control_plane:latest" by
replacing newTag: latest with a specific release tag or git SHA (e.g., newTag:
<release-or-sha>) so the kustomization image stanza (name/newName/newTag) uses
an immutable reference for repeatable, traceable rollouts.
In
`@components/runners/ambient-runner/ambient_runner/bridges/claude/grpc_transport.py`:
- Around line 118-123: The listener currently ignores immediate watch worker
failures because _listen_loop only checks watch_future.done() inside the 30s
timeout branch; change _listen_loop to wait on both msg_queue.get() and the
watch_future (the Future returned from _watch_in_thread) so that if watch_future
completes with an exception the loop immediately notices and propagates the
failure to trigger reconnect/backoff logic. Specifically, when selecting/waiting
for a message from msg_queue, also add the watch_future to the wait set (and
handle its completion by re-raising or returning an error) so worker exceptions
from _watch_in_thread aren't masked; apply the same fix where watch_future is
currently polled (the code around the other check at the 147-150 region) to
ensure consistent immediate reconnect behavior.
In `@docs/internal/design/blackboard-api.md`:
- Around line 440-453: The snapshot query computes latest_checkins over the
entire session_checkins table which breaks the O(agents) guarantee; modify the
CTE named latest_checkins to apply the project filter before DISTINCT ON by
adding the same WHERE project_id = ? (or, alternatively, ensure SessionCheckIn
is denormalized with project_id and use that field in the CTE) so the CTE only
scans check-ins for the target project before selecting DISTINCT ON (agent_id).
In `@docs/internal/developer/agent-workflow.md`:
- Around line 248-253: The docs still show the removed direct session creation
using POST /sessions; change the example and expected-flow to first create an
agent (POST /agents with the required payload) and then call POST
/agents/{id}/ignite to create/ignite the session instead of POST /sessions, and
update the second example block the same way; specifically replace the POST
/sessions curl example with a create-agent request followed by POST
/agents/{id}/ignite and adjust any explanatory text to reference the agent-first
+ ignite flow.
In `@docs/internal/developer/local-development/openshift.md`:
- Around line 392-405: The docs are inconsistent: this section says the control
plane accepts RH SSO tokens (ocm token) and rejects oc whoami -t, but the
earlier acpctl login example still uses oc whoami -t; update the example to use
a RH SSO access token (ocm token) or a registered service account instead of oc
whoami -t. Specifically, change the acpctl/login example and any references to
authentication against ambient-api-server to obtain and use the token from `ocm
token` (or document using a long-lived service account token) and ensure the
secret name `ambient-control-plane-token` and restart step (`oc rollout restart
deployment/ambient-control-plane -n ambient-code`) align with that guidance.
In `@e2e/scripts/load-images.sh`:
- Around line 53-54: The verification step is filtering loaded images with grep
'vteam_' so the image "ambient_control_plane:latest" in the images array won't
show up; update the verification command that currently uses grep vteam_ (the
verification pipeline near the end of the script) to either remove the 'vteam_'
filter or expand it to include ambient_control_plane (or a more general pattern
like matching both vteam_ and ambient_control_plane) so the verification output
accurately reflects all images in the images array.
In `@REMOVE_CRDs.md`:
- Line 519: Fix the typo in the Phase 3 heading by replacing "Safe
Decomissioning (2 weeks)" with "Safe Decommissioning (2 weeks)" so the word
"Decommissioning" is spelled correctly in the heading line that currently reads
"Phase 3: Safe Decomissioning (2 weeks)".
In `@test-e2e-control-plane.sh`:
- Around line 149-164: The POST that sends a session message (the oc exec curl
that populates MESSAGE_RESPONSE and then tests for .id) currently swallows
errors (note the "|| true") and only logs a warning; change it so the request
failure causes the script to fail the test: remove the "|| true", ensure you
validate MESSAGE_RESPONSE contains an "id" (using jq -e '.id') and if not, log
an error and exit 1. Apply the same change to the similar block around the later
message/test (the other POST block that sets MESSAGE_RESPONSE / MESSAGE_ID at
lines ~187-197) so missing runner-side gRPC push or /events fan-out fails the
e2e run. Ensure you reference MESSAGE_RESPONSE, MESSAGE_ID, API_SERVER_URL and
SESSION_ID when implementing the check.
---
Outside diff comments:
In @.github/workflows/components-build-deploy.yml:
- Around line 354-363: Add handling for ambient-control-plane by emitting a
control_plane_tag in the "Determine image tags" step (set control_plane_tag to
github.sha when needs.detect-changes.outputs.ambient-control-plane == "true",
otherwise set it to "stage") and then include a kustomize edit set image command
in the "Update kustomization with image tags" step to update
quay.io/ambient_code/ambient_control_plane from :latest to :${{
steps.image-tags.outputs.control_plane_tag }} so the control-plane image uses
the newly built SHA when changed; reference the existing image-tags output
variable name control_plane_tag and the kustomize edit set image pattern used
for other components.
In
`@components/manifests/components/ambient-api-server-db/ambient-api-server-db-json-patch.yaml`:
- Around line 8-30: Remove the two redundant whole-array JSON Patch replace ops
that overwrite the container 0 env and volumeMounts in
ambient-api-server-db-json-patch.yaml: the replace at path
/spec/template/spec/containers/0/env (which replaces env entries like
POSTGRESQL_USER, POSTGRESQL_PASSWORD, POSTGRESQL_DATABASE) and the replace at
path /spec/template/spec/containers/0/volumeMounts (which replaces the mount for
name ambient-api-server-db-data). Instead delete these two ops from the patch so
the overlay inherits the base manifest's env and volumeMounts and future base
additions are preserved.
---
Duplicate comments:
In `@components/ambient-control-plane/go.mod`:
- Line 11: The go.mod entry pins google.golang.org/grpc at the vulnerable
v1.79.1; update that dependency in components/ambient-control-plane/go.mod to a
non-vulnerable release (e.g., v1.79.2 or any release that the
GHSA-p77j-4mvh-x3m3 advisory marks as patched), then run go mod tidy (or go get
google.golang.org/grpc@<patched-version>) to update go.sum; after updating,
verify the pinned version no longer falls in the advisory’s affected range for
GHSA-p77j-4mvh-x3m3.
In `@components/ambient-control-plane/internal/reconciler/project_reconciler.go`:
- Around line 104-110: The current creatorRoleBindingName function and the
reconcile logic create RoleBindings once by name and never update them, causing
stale subjects when the ambient-code.io/created-by value changes and collisions
from truncating long subjects; update creatorRoleBindingName to produce
deterministic unique names for any subject (e.g., include a stable hash of the
full subject instead of truncating to 40 chars) and change the reconciler that
handles ambient-project-admin RoleBindings to not only check existence but to
reconcile the RoleBinding's subjects/owner to match the current
ambient-code.io/created-by (update existing RoleBinding subjects when they
differ); reference creatorRoleBindingName and the reconciler that currently
checks for name existence to locate where to change name generation and add
subject-update logic.
- Around line 94-96: The CreateNamespace call in project_reconciler.go should
treat an AlreadyExists result as success: after calling
r.kube.CreateNamespace(ctx, ns) check the returned error and if
apierrors.IsAlreadyExists(err) then ignore it (consider creation successful)
otherwise return the wrapped fmt.Errorf as currently done; reference the
r.kube.CreateNamespace call and the ns/name variables and import/use
k8s.io/apimachinery/pkg/api/errors (apierrors) for the IsAlreadyExists check.
In
`@components/ambient-control-plane/internal/reconciler/project_settings_reconciler.go`:
- Around line 130-139: The loop over entries in project_settings_reconciler.go
currently logs errors from ensureGroupRoleBinding but always returns nil, hiding
failures; change the reconciler method that contains this loop to collect and
return errors (e.g., return the first error or aggregate them) instead of
swallowing them so callers/retry logic see failures—specifically update the
for-loop handling of ensureGroupRoleBinding(ctx, namespace, entry.GroupName,
entry.Role) to propagate its error (or append to an error slice and return a
combined error) rather than only logging it.
In `@components/ambient-control-plane/internal/reconciler/shared.go`:
- Around line 88-96: namespaceForSession currently only lowercases ProjectID and
returns KubeNamespace raw, which allows invalid Kubernetes names (underscores,
dots, spaces, uppercase, >63 chars); update namespaceForSession to run both
session.ProjectID and session.KubeNamespace through the shared DNS-1123
normalizer used elsewhere (reuse the existing normalizer function used by
namespaceForProject or create a single normalizeToDNS1123 helper) so returned
namespaces are DNS-1123 safe and truncated if needed; also ensure
namespaceForProject calls the same normalizer or add an assertion upstream that
ProjectID is already slug-safe.
In `@components/manifests/deploy`:
- Around line 82-109: Do not persist or echo the plaintext CLIENT_SECRET_VALUE
to /tmp/ambient-frontend-oauthclient.yaml or stdout; instead feed the
OAuthClient YAML to oc apply via stdin (oc apply -f -) without creating a temp
file so the secret never lands on disk, and remove any echo that prints
CLIENT_SECRET_VALUE in the fallback/error block. Update the logic around
OAUTH_APPLY_RC to build the YAML on-the-fly using CLIENT_SECRET_VALUE and
ROUTE_HOST and pipe it to oc apply -f -, and in the failure branch print a
redacted/admin-instruction that tells the admin how to apply the OAuthClient
without showing the secret (omit the secret value or instruct them to set
CLIENT_SECRET_VALUE in their environment) rather than echoing the secret.
In `@components/manifests/overlays/production/control-plane-env-patch.yaml`:
- Around line 15-16: The RUNNER_IMAGE env var is pinned to :latest causing
non-reproducible production deployments; update the value of the environment
variable named RUNNER_IMAGE (currently
"quay.io/ambient_code/vteam_claude_runner:latest") to a specific version tag or
an immutable digest (for example
"quay.io/ambient_code/vteam_claude_runner:v1.2.3" or
"quay.io/ambient_code/vteam_claude_runner@sha256:<digest>") so deployments are
deterministic; ensure you change the string literal in the manifest where
RUNNER_IMAGE is defined and verify any CI/deploy docs reference the same pinned
tag.
In `@components/runners/ambient-runner/ambient_runner/_session_messages_api.py`:
- Around line 77-84: The current code constructs payload_preview and emits it in
logger.info (using session_id, event_type, payload_len, payload_preview), which
leaks user/assistant content and increases log volume; update the generator so
generated code no longer logs payload previews at INFO—either remove
payload_preview from the logger.info call or lower that log call to DEBUG (e.g.,
use logger.debug with payload_preview) and keep only non-sensitive metadata
(session_id, event_type, payload_len) at INFO; apply the same change for the
other generated block that builds payload_preview (the analogous logger call
around the 136–147 area).
In `@components/runners/ambient-runner/ambient_runner/app.py`:
- Around line 356-403: In _push_initial_prompt_via_grpc the AmbientGRPCClient
created via AmbientGRPCClient.from_env() can leak if an exception occurs before
client.close(); wrap the client usage in a try/finally (or use a context manager
if AmbientGRPCClient supports __enter__/__exit__) so client.close() is always
called, ensuring you only call close when the client variable was successfully
assigned; update the function to create the client, run the payload push logic,
and then call client.close() in finally to guarantee cleanup.
- Around line 129-131: The code currently couples to ClaudeBridge's internals by
calling its private members _setup_platform() and _grpc_listener.ready; instead
add a public async method on ClaudeBridge (e.g., ensure_ready() or
setup_and_wait_ready()) that performs the internal setup and exposes the ready
signal (awaiting _setup_platform() and returning/awaiting the internal ready
event), then replace the direct calls in ambient_runner.app (the grpc_url
conditional) with a single await bridge.ensure_ready() to remove private member
access.
In `@components/runners/ambient-runner/ambient_runner/bridges/claude/bridge.py`:
- Around line 244-249: The call to await self._grpc_listener.stop() can raise
and currently short-circuits the remaining cleanup; wrap the await
self._grpc_listener.stop() call in a try/except that catches Exception, logs the
error (including the exception) and does not re-raise so that subsequent cleanup
calls await self._session_manager.shutdown() and await self._obs.finalize()
still run; ensure you reference self._grpc_listener.stop(),
self._session_manager.shutdown(), and self._obs.finalize() in the fix so the
cleanup order and logging are clear.
In `@components/runners/ambient-runner/ambient_runner/endpoints/events.py`:
- Around line 61-63: active_streams[thread_id] = queue unconditionally
overwrites an existing subscriber's queue (and later unconditional
active_streams.pop() removes a different queue), so change creation and teardown
to be identity-safe: when creating a queue in wait_for_events() (and the
subscribe endpoints), use active_streams.setdefault(thread_id, queue) or append
to a list of subscribers instead of replacing an existing entry, and track
whether this call actually created the queue; when removing/tearing down, only
remove the mapping if active_streams.get(thread_id) is the same queue you
created (e.g., if active_streams.get(thread_id) is queue:
active_streams.pop(thread_id)), and only close/put sentinel into queues that
this code created—this prevents orphaning other subscribers and avoids tearing
down queues you don't own.
In `@components/runners/ambient-runner/tests/test_app_initial_prompt.py`:
- Around line 34-35: The tests patch the definition site
"ambient_runner._grpc_client.AmbientGRPCClient" but the code under test imports
and binds the class into app's namespace, so update each patch target to the
lookup site "ambient_runner.app.AmbientGRPCClient"; specifically change the
patch call that wraps the call to _push_initial_prompt_via_grpc("hello world",
"sess-1") and the three other patch usages (the ones at the other test cases) to
patch "ambient_runner.app.AmbientGRPCClient" so the mocked class is used when
_push_initial_prompt_via_grpc invokes AmbientGRPCClient.
In `@components/runners/ambient-runner/tests/test_events_endpoint.py`:
- Around line 66-79: The test test_registers_queue_before_streaming creates a
local asyncio.Queue q and puts a terminal event into it but never registers it
with active_streams, so it doesn't actually verify queue-registration behavior;
either remove the unused q and its put_nowait call, or register it into
active_streams (e.g., active_streams["t-1"] = q) before creating the bridge via
_make_bridge so the endpoint will use that queue, and then after
client.stream("GET", "/events/t-1") and resp.read() assert the expected
registration/consumption from active_streams to confirm the endpoint
registered/used the queue.
In `@components/runners/ambient-runner/tests/test_grpc_transport.py`:
- Around line 152-181: The test doesn't verify bridge.run was invoked; modify
test_user_message_triggers_bridge_run to set a sentinel (e.g., an asyncio.Event
or boolean flag) inside the fake_run assigned to bridge.run (refer to fake_run
and bridge.run) and await that sentinel with a timeout after listener.ready
instead of just sleeping; ensure you still cancel the listener task as before
(listener._listen_loop and listener.ready) so the test fails if bridge.run never
fires.
In `@Makefile`:
- Line 68: Update the CONTROL_PLANE_IMAGE definition and all places still
referencing the old image name "vteam_control_plane" to "ambient_control_plane"
so manifests and the new build/push/load targets use the correct image;
specifically, change the CONTROL_PLANE_IMAGE variable and any other Makefile
variables/targets that set or reference "vteam_control_plane" (e.g., in
build/push/load targets and image-related variables) to use
"ambient_control_plane:$(IMAGE_TAG)" so the control-plane rollout uses the
correct image name.
| Reconcilers: parseReconcilers(envOrDefault("RECONCILERS", "tally,kube")), | ||
| RunnerImage: envOrDefault("RUNNER_IMAGE", "quay.io/ambient_code/vteam_claude_runner:latest"), |
There was a problem hiding this comment.
Reject or fully default invalid RECONCILERS values.
Load()'s normal default is tally,kube, but parseReconcilers() falls back to only tally when the env is whitespace/comma-only. That silently disables the kube reconciler on bad config and breaks the main provisioning path instead of restoring the full defaults or failing fast.
Suggested change
func parseReconcilers(reconcilersStr string) []string {
- if reconcilersStr == "" {
- return []string{"tally"}
+ if strings.TrimSpace(reconcilersStr) == "" {
+ return []string{"tally", "kube"}
}
@@
- if len(result) == 0 {
- return []string{"tally"}
+ if len(result) == 0 {
+ return []string{"tally", "kube"}
}Also applies to: 76-91
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/ambient-control-plane/internal/config/config.go` around lines 41 -
42, The current behavior in Load() uses
parseReconcilers(envOrDefault("RECONCILERS", "tally,kube")) which causes
parseReconcilers to silently collapse whitespace/comma-only values into just
"tally", disabling the kube reconciler; update parseReconcilers (and any other
parsing used in the 76-91 region) to validate the raw env string: if the env is
empty/only whitespace/only commas treat it as invalid and either (a) return the
full default set ["tally","kube"] or (b) return an error that Load() surfaces
and fails fast — then update Load() to handle the error path (or accept the
returned default) so invalid RECONCILERS never silently drop kube; reference
functions/fields: parseReconcilers, Load, Reconcilers, envOrDefault and apply
the same validation to the similar parsing code in lines 76-91.
| func (inf *Informer) Run(ctx context.Context) error { | ||
| inf.logger.Info().Msg("performing initial list sync") | ||
|
|
||
| if err := inf.initialSync(ctx); err != nil { | ||
| inf.logger.Warn().Err(err).Msg("initial sync failed, will rely on watch events") | ||
| } | ||
|
|
||
| go inf.dispatchLoop(ctx) |
There was a problem hiding this comment.
Initial sync can deadlock before the dispatcher starts.
initialSync() enqueues every listed object through dispatchBlocking(), but dispatchLoop() is started afterward. Once the combined initial dataset exceeds the 256-event buffer, startup blocks forever with no consumer on eventCh. Start the dispatcher before syncing or dispatch initial-list events directly.
🛠 Suggested fix
func (inf *Informer) Run(ctx context.Context) error {
+ go inf.dispatchLoop(ctx)
+
inf.logger.Info().Msg("performing initial list sync")
if err := inf.initialSync(ctx); err != nil {
inf.logger.Warn().Err(err).Msg("initial sync failed, will rely on watch events")
}
-
- go inf.dispatchLoop(ctx)
inf.wireWatchHandlers()Also applies to: 400-405
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/ambient-control-plane/internal/informer/informer.go` around lines
118 - 125, The initialSync currently enqueues every listed object via
dispatchBlocking while dispatchLoop is started after, which can deadlock when
the initial dataset exceeds the 256-event buffer; change Run to start the
dispatcher before performing the initial sync (move the go inf.dispatchLoop(ctx)
call to before calling inf.initialSync(ctx)) or alternatively have initialSync
use a non-blocking enqueue method (replace dispatchBlocking calls with a
non-blocking/try-enqueue variant) so initial list processing cannot block
waiting for a consumer; update Informer.Run, initialSync, and any calls to
dispatchBlocking to ensure the dispatcher (dispatchLoop) is running or that
enqueues are non-blocking.
| inf.logger.Info().Msg("starting gRPC watch streams") | ||
| inf.watchManager.Run(ctx) | ||
|
|
||
| return ctx.Err() | ||
| } |
There was a problem hiding this comment.
Treat context cancellation as a clean shutdown.
Run() returns ctx.Err() unconditionally after watchManager.Run(ctx), so a normal SIGTERM/SIGINT becomes context.Canceled and main() logs kube mode failed / test mode failed on expected shutdown. Return nil for expected cancellation, or special-case it in the caller.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/ambient-control-plane/internal/informer/informer.go` around lines
129 - 133, The function currently returns ctx.Err() unconditionally after
calling watchManager.Run(ctx), which causes normal cancellations (e.g.,
SIGINT/SIGTERM -> context.Canceled) to be treated as failures; update the return
behavior in the function containing inf.watchManager.Run(ctx) and Run() so that
expected context cancellation is treated as a clean shutdown: after
watchManager.Run(ctx) check ctx.Err() and return nil for context.Canceled and
context.DeadlineExceeded (or otherwise special-case expected cancellation),
otherwise return the actual error; reference the Run() caller and
inf.watchManager.Run(ctx) to locate where to implement this check.
| func buildRestConfig(kubeconfig string) (*rest.Config, error) { | ||
| if kubeconfig != "" { | ||
| return clientcmd.BuildConfigFromFlags("", kubeconfig) | ||
| } | ||
|
|
||
| home, _ := os.UserHomeDir() | ||
| localPath := home + "/.kube/config" | ||
| if _, err := os.Stat(localPath); err == nil { | ||
| return clientcmd.BuildConfigFromFlags("", localPath) | ||
| } | ||
|
|
||
| return rest.InClusterConfig() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's read the full kubeclient.go file to understand the context
wc -l components/ambient-control-plane/internal/kubeclient/kubeclient.goRepository: ambient-code/platform
Length of output: 135
🏁 Script executed:
# Get full file content
cat -n components/ambient-control-plane/internal/kubeclient/kubeclient.goRepository: ambient-code/platform
Length of output: 10572
🏁 Script executed:
# Search for how this client is instantiated and what environment it runs in
rg -A 5 -B 5 "buildRestConfig" components/ambient-control-plane/Repository: ambient-code/platform
Length of output: 2053
🏁 Script executed:
# Look for deployment manifests or container configurations for the control-plane
fd -e yaml -e yml | xargs rg -l "ambient.*control.*plane|kubeclient" | head -20Repository: ambient-code/platform
Length of output: 803
🏁 Script executed:
# Search for any documentation about this service
fd -e md -e txt | xargs rg -l "ambient.*control.*plane" | head -10Repository: ambient-code/platform
Length of output: 406
🏁 Script executed:
# Check the control-plane service manifest
cat -n components/manifests/base/ambient-control-plane-service.ymlRepository: ambient-code/platform
Length of output: 1908
🏁 Script executed:
# Check the control-plane environment patches
cat -n components/manifests/overlays/kind-local/control-plane-env-patch.yamlRepository: ambient-code/platform
Length of output: 601
🏁 Script executed:
# Check the README for control-plane
cat -n components/ambient-control-plane/README.mdRepository: ambient-code/platform
Length of output: 6297
🏁 Script executed:
# Search for how the kubeconfig parameter is initialized
rg -n "New\(" components/ambient-control-plane/ --type go -A 3Repository: ambient-code/platform
Length of output: 3338
🏁 Script executed:
# Check if KUBECONFIG env var is used anywhere
rg -n "KUBECONFIG" components/ambient-control-plane/ --type go -B 2 -A 2Repository: ambient-code/platform
Length of output: 695
🏁 Script executed:
# Check all deployment overlays for volume mounts
cat -n components/manifests/overlays/production/control-plane-image-patch.yamlRepository: ambient-code/platform
Length of output: 399
🏁 Script executed:
# Check kind overlays
cat -n components/manifests/overlays/kind/kustomization.yaml | head -50Repository: ambient-code/platform
Length of output: 1530
🏁 Script executed:
# Look for any volume mounts that might be added via overlays
rg -n "volumeMounts|volumes:" components/manifests/ --type yaml -A 5Repository: ambient-code/platform
Length of output: 24415
Prefer in-cluster credentials before probing ~/.kube/config.
This code runs in a control-plane service, but the current fallback order lets any mounted or baked-in home kubeconfig override the pod's service-account credentials. That can silently point reconciliation at the wrong cluster. Keep the explicit kubeconfig path for local use, then try rest.InClusterConfig() before any home-directory fallback.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/ambient-control-plane/internal/kubeclient/kubeclient.go` around
lines 88 - 99, The buildRestConfig function currently prefers a local
~/.kube/config over in-cluster service-account credentials; change the
resolution order so that after honoring an explicit kubeconfig parameter
(clientcmd.BuildConfigFromFlags when kubeconfig != ""), you attempt
rest.InClusterConfig() next and return it if successful, and only if in-cluster
fails fall back to checking the home kubeconfig via os.UserHomeDir and
clientcmd.BuildConfigFromFlags(localPath); update error handling so you return
the in-cluster config on success and only probe localPath when
rest.InClusterConfig() returns an error.
| func (kc *KubeClient) DeletePodsByLabel(ctx context.Context, namespace, labelSelector string) error { | ||
| return kc.dynamic.Resource(PodGVR).Namespace(namespace).DeleteCollection(ctx, metav1.DeleteOptions{}, metav1.ListOptions{LabelSelector: labelSelector}) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n components/ambient-control-plane/internal/kubeclient/kubeclient.go | head -230Repository: ambient-code/platform
Length of output: 10572
🏁 Script executed:
rg "DeletePodsByLabel|DeleteServicesByLabel|DeleteSecretsByLabel|DeleteServiceAccountsByLabel|DeleteRolesByLabel|DeleteRoleBindingsByLabel" --type go -A 2 -B 2Repository: ambient-code/platform
Length of output: 10611
🏁 Script executed:
grep -n "selector\s*=" components/ambient-control-plane/internal/reconciler/kube_reconciler.go | head -20Repository: ambient-code/platform
Length of output: 47
🏁 Script executed:
rg "selector\s*:=" components/ambient-control-plane/internal/reconciler/kube_reconciler.go -A 1 -B 1Repository: ambient-code/platform
Length of output: 247
🏁 Script executed:
rg "func sessionLabelSelector" components/ambient-control-plane/internal/reconciler/kube_reconciler.go -A 3Repository: ambient-code/platform
Length of output: 184
Reject empty selectors before calling DeleteCollection.
An empty labelSelector makes these helpers target every resource of that kind in the namespace. A missing or empty session ID would wipe all pods, services, secrets, service accounts, roles, and role bindings instead of just the reconciled set. This vulnerability exists across six methods: DeletePodsByLabel, DeleteServicesByLabel, DeleteSecretsByLabel, DeleteServiceAccountsByLabel, DeleteRolesByLabel, and DeleteRoleBindingsByLabel (lines 161–162, 174–175, 187–188, 200–201, 213–214, 217–218).
Extract validation into a shared helper method to guard all deletion operations:
Suggested hardening
+func (kc *KubeClient) deleteCollectionByLabel(ctx context.Context, gvr schema.GroupVersionResource, namespace, resourceName, labelSelector string) error {
+ if labelSelector == "" {
+ return fmt.Errorf("refusing to delete %s in namespace %q without a label selector", resourceName, namespace)
+ }
+
+ return kc.dynamic.Resource(gvr).Namespace(namespace).DeleteCollection(
+ ctx,
+ metav1.DeleteOptions{},
+ metav1.ListOptions{LabelSelector: labelSelector},
+ )
+}
+
func (kc *KubeClient) DeletePodsByLabel(ctx context.Context, namespace, labelSelector string) error {
- return kc.dynamic.Resource(PodGVR).Namespace(namespace).DeleteCollection(ctx, metav1.DeleteOptions{}, metav1.ListOptions{LabelSelector: labelSelector})
+ return kc.deleteCollectionByLabel(ctx, PodGVR, namespace, "pods", labelSelector)
}Apply the same pattern to all six Delete*ByLabel methods.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/ambient-control-plane/internal/kubeclient/kubeclient.go` around
lines 161 - 162, All six deletion helpers (DeletePodsByLabel,
DeleteServicesByLabel, DeleteSecretsByLabel, DeleteServiceAccountsByLabel,
DeleteRolesByLabel, DeleteRoleBindingsByLabel) currently call DeleteCollection
with an empty labelSelector which can delete every resource in a namespace; add
a shared validation helper (e.g., ensureNonEmptyLabelSelector(ctx,
labelSelector) or validateLabelSelector) that returns an error if labelSelector
is empty or whitespace, call it at the start of each Delete*ByLabel method, and
return that error instead of invoking kc.dynamic.Resource(...).DeleteCollection
when the selector is missing; apply the same pattern to all six methods so no
deletion proceeds with an empty selector.
| # 4. Create a session (no initial_prompt — idle, waiting for messages) | ||
| curl -s -X POST http://localhost:18000/api/ambient/v1/sessions \ | ||
| -H "Authorization: Bearer $TOKEN" \ | ||
| -H "Content-Type: application/json" \ | ||
| -H "X-Ambient-Project: smoke-test" \ | ||
| -d '{"name":"smoke-session","project_id":"smoke-test"}' | python3 -m json.tool |
There was a problem hiding this comment.
The smoke test still uses the removed direct session-creation flow.
This document calls POST /sessions directly, but the design docs added in this PR say sessions are created only via POST /agents/{id}/ignite. Leaving the old flow here will send testers through an API path the new model explicitly removes. Update the example and the expected-flow section to create an agent first and ignite it.
Also applies to: 315-324
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/internal/developer/agent-workflow.md` around lines 248 - 253, The docs
still show the removed direct session creation using POST /sessions; change the
example and expected-flow to first create an agent (POST /agents with the
required payload) and then call POST /agents/{id}/ignite to create/ignite the
session instead of POST /sessions, and update the second example block the same
way; specifically replace the POST /sessions curl example with a create-agent
request followed by POST /agents/{id}/ignite and adjust any explanatory text to
reference the agent-first + ignite flow.
| "ambient_control_plane:latest" | ||
| ) |
There was a problem hiding this comment.
Verification output won't show ambient_control_plane.
The ambient_control_plane:latest image is correctly added to the loading array, but the verification command on line 93 filters with grep vteam_, so this image won't appear in the verification output. Consider updating the grep pattern or removing the filter for more accurate verification.
Suggested fix
echo "Verifying images in cluster..."
-$CONTAINER_ENGINE exec "${KIND_CLUSTER_NAME}-control-plane" crictl images | grep vteam_ | head -n 5
+$CONTAINER_ENGINE exec "${KIND_CLUSTER_NAME}-control-plane" crictl images | grep -E 'vteam_|ambient_control_plane' | head -n 10🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@e2e/scripts/load-images.sh` around lines 53 - 54, The verification step is
filtering loaded images with grep 'vteam_' so the image
"ambient_control_plane:latest" in the images array won't show up; update the
verification command that currently uses grep vteam_ (the verification pipeline
near the end of the script) to either remove the 'vteam_' filter or expand it to
include ambient_control_plane (or a more general pattern like matching both
vteam_ and ambient_control_plane) so the verification output accurately reflects
all images in the images array.
| migration-session-filters: "new-sessions-only" | ||
| ``` | ||
|
|
||
| ### Phase 3: Safe Decomissioning (2 weeks) |
There was a problem hiding this comment.
Minor typo: "Decomissioning" → "Decommissioning".
-### Phase 3: Safe Decomissioning (2 weeks)
+### Phase 3: Safe Decommissioning (2 weeks)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ### Phase 3: Safe Decomissioning (2 weeks) | |
| ### Phase 3: Safe Decommissioning (2 weeks) |
🧰 Tools
🪛 LanguageTool
[grammar] ~519-~519: Ensure spelling is correct
Context: ...w-sessions-only" ``` ### Phase 3: Safe Decomissioning (2 weeks) Only after **100% of session...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@REMOVE_CRDs.md` at line 519, Fix the typo in the Phase 3 heading by replacing
"Safe Decomissioning (2 weeks)" with "Safe Decommissioning (2 weeks)" so the
word "Decommissioning" is spelled correctly in the heading line that currently
reads "Phase 3: Safe Decomissioning (2 weeks)".
| # Step 6: Try to send a message to the session (if it supports it) | ||
| log "${BLUE}💬 Step 6: Testing session interaction${NC}" | ||
|
|
||
| MESSAGE_RESPONSE=$(oc exec deployment/backend-api -n ${NAMESPACE} -- curl -s -X POST \ | ||
| -H "Content-Type: application/json" \ | ||
| -H "Authorization: Bearer $(oc whoami -t)" \ | ||
| -H "X-Ambient-Project: ${NAMESPACE}" \ | ||
| -d "{\"content\":\"Test message from e2e test\"}" \ | ||
| "${API_SERVER_URL}/api/ambient/v1/sessions/${SESSION_ID}/messages" 2>/dev/null || true) | ||
|
|
||
| if echo "$MESSAGE_RESPONSE" | jq -e '.id' >/dev/null 2>&1; then | ||
| MESSAGE_ID=$(echo "$MESSAGE_RESPONSE" | jq -r '.id') | ||
| log "${GREEN}✅ Successfully sent message: ${MESSAGE_ID}${NC}" | ||
| else | ||
| log "${YELLOW}⚠️ Message sending not available or failed (expected for current implementation)${NC}" | ||
| fi |
There was a problem hiding this comment.
Make this e2e fail when the runner event path is broken.
Line 157 explicitly allows the session-message POST to fail, and Step 8 only checks control-plane watch logs. This means the script can still pass when the new runner-side gRPC push and /events fan-out never work.
Also applies to: 187-197
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test-e2e-control-plane.sh` around lines 149 - 164, The POST that sends a
session message (the oc exec curl that populates MESSAGE_RESPONSE and then tests
for .id) currently swallows errors (note the "|| true") and only logs a warning;
change it so the request failure causes the script to fail the test: remove the
"|| true", ensure you validate MESSAGE_RESPONSE contains an "id" (using jq -e
'.id') and if not, log an error and exit 1. Apply the same change to the similar
block around the later message/test (the other POST block that sets
MESSAGE_RESPONSE / MESSAGE_ID at lines ~187-197) so missing runner-side gRPC
push or /events fan-out fails the e2e run. Ensure you reference
MESSAGE_RESPONSE, MESSAGE_ID, API_SERVER_URL and SESSION_ID when implementing
the check.
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (4)
components/ambient-control-plane/cmd/ambient-control-plane/main.go (1)
180-186:⚠️ Potential issue | 🟠 MajorClone
http.DefaultTransportinstead of replacing it.Line 181 discards the default transport's proxy, timeout, keep-alive, idle-connection, and HTTP/2 settings. That changes the behavior of every HTTP client in-process just to install the extra CA.
Suggested change
func installServiceCAIntoDefaultTransport(pool *x509.CertPool) { - http.DefaultTransport = &http.Transport{ - TLSClientConfig: &tls.Config{ - MinVersion: tls.VersionTLS12, - RootCAs: pool, - }, - } + transport := http.DefaultTransport.(*http.Transport).Clone() + if transport.TLSClientConfig == nil { + transport.TLSClientConfig = &tls.Config{} + } else { + transport.TLSClientConfig = transport.TLSClientConfig.Clone() + } + transport.TLSClientConfig.MinVersion = tls.VersionTLS12 + transport.TLSClientConfig.RootCAs = pool + http.DefaultTransport = transport }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/ambient-control-plane/cmd/ambient-control-plane/main.go` around lines 180 - 186, installServiceCAIntoDefaultTransport currently replaces http.DefaultTransport and loses existing proxy, timeouts, keep-alive, idle-connection and HTTP/2 settings; instead, detect if http.DefaultTransport is an *http.Transport, deep-copy it (e.g., dereference/copy the struct), preserve its Proxy/IdleConn/MaxConns/IdleConnTimeout/ForceAttemptHTTP2 and other fields, then update or create TLSClientConfig on the copied transport to set MinVersion = tls.VersionTLS12 and add/replace RootCAs with the provided pool, and finally assign the copied transport back to http.DefaultTransport so only the CA changes while all other original transport settings remain intact.components/ambient-control-plane/internal/reconciler/kube_reconciler.go (2)
292-320:⚠️ Potential issue | 🟠 MajorThe shared RoleBinding name only works for the first namespace.
ambient-image-pulleris always created inRunnerImageNamespace, but its subject changes withnamespace. After the first create, later namespaces hitAlreadyExistson Line 316 and silently reuse a binding that still points at the old subject, so those pods do not get pull access.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go` around lines 292 - 320, The RoleBinding name is constant so the first create wins and later namespaces silently reuse the old subject; in ensureImagePullAccess change the binding name (variable name / metadata.name in rb) to be unique per target namespace (e.g., include the namespace via fmt.Sprintf("ambient-image-puller-%s", namespace)) so each call creates a distinct RoleBinding, and keep the CreateRoleBinding call as-is; alternatively, if you prefer a single binding, modify the code to fetch the existing RoleBinding and update its "subjects" to include the new namespace before returning (handle update via r.kube.UpdateRoleBinding) instead of treating IsAlreadyExists as success. Ensure you update the rb construction referenced in ensureImagePullAccess and the CreateRoleBinding error handling accordingly.
523-523:⚠️ Potential issue | 🟠 MajorHandle
NestedMaperrors before copying the Vertex secret.Line 523 drops the
unstructured.NestedMaperror, so a malformed source secret can be copied as empty or partial data and still look successful. Fail fast here before creating the destination secret.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go` at line 523, The call to unstructured.NestedMap (data, _, _ := unstructured.NestedMap(src.Object, "data")) swallows the returned error so a malformed src secret can be copied; change this to capture and check the error from unstructured.NestedMap and return/fail fast when err != nil before proceeding to create or populate the destination/Vertex secret (i.e., replace the ignored error with proper err handling near the code that reads src.Object and abort the reconcile or return the error so the destination secret is not created from invalid data).components/ambient-control-plane/go.mod (1)
11-11:⚠️ Potential issue | 🔴 CriticalUpgrade
google.golang.org/grpcbefore shipping this service.Line 11 still pins
v1.79.1, which is flagged in the provided OSV results as GHSA-p77j-4mvh-x3m3. This PR adds a new control-plane binary that opens gRPC connections, so carrying a known-vulnerable transport dependency here is blocking. Bump to a patched release and refreshgo.sum.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/ambient-control-plane/go.mod` at line 11, Update the pinned dependency google.golang.org/grpc in the module manifest to a patched release (bump the version from v1.79.1 to a non-vulnerable tag such as the latest patched v1.x.y), then run the Go tooling to refresh checksums (e.g., go get google.golang.org/grpc@vX.Y.Z && go mod tidy or go mod vendor as appropriate) so go.mod and go.sum reflect the upgraded, fixed release; ensure the new version resolves the GHSA-p77j-4mvh-x3m3 advisory before merging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/ambient-control-plane-tests.yml:
- Around line 6-15: The workflow path filters are missing the ambient API
server; update the paths arrays under both the push and pull_request triggers to
include 'components/ambient-api-server/**' so changes to the API server (which
is replaced by ../ambient-api-server in components/ambient-control-plane/go.mod)
will trigger the ambient-control-plane-tests workflow; modify the paths entries
alongside the existing 'components/ambient-control-plane/**' and
'components/ambient-sdk/go-sdk/**' lines in
.github/workflows/ambient-control-plane-tests.yml.
In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go`:
- Around line 257-287: The current ensureNamespaceExists returns early when
r.kube.GetNamespace succeeds so existing namespaces never run
ensureImagePullAccess; also when ensureImagePullAccess fails it only logs a
warning and lets provisioning continue. Change ensureNamespaceExists (the
function using r.kube.GetNamespace, r.kube.CreateNamespace and calling
ensureImagePullAccess) to always call ensureImagePullAccess when
r.cfg.RunnerImageNamespace is set, regardless of whether the namespace was newly
created or preexisting, and treat errors from ensureImagePullAccess as blocking
by returning the error (wrap with context like "ensuring image pull access for
namespace %s: %w") instead of only logging a warning so provisioning halts if
image-pull access cannot be granted.
- Around line 189-213: The cleanup code currently sets cleanupErr when
r.kube.DeletePodsByLabel / DeleteSecretsByLabel / DeleteServiceAccountsByLabel /
DeleteServicesByLabel fail but still returns nil at the end; change the final
return to propagate the failure (return cleanupErr or a wrapped error) so
callers see cleanup failures instead of success. Keep the existing
metrics.SessionCleanupTotal increment and logging behavior, but after updating
the metric/logging branch, ensure the function returns cleanupErr when non-nil
(or a combined error if you prefer aggregating multiple errors).
---
Duplicate comments:
In `@components/ambient-control-plane/cmd/ambient-control-plane/main.go`:
- Around line 180-186: installServiceCAIntoDefaultTransport currently replaces
http.DefaultTransport and loses existing proxy, timeouts, keep-alive,
idle-connection and HTTP/2 settings; instead, detect if http.DefaultTransport is
an *http.Transport, deep-copy it (e.g., dereference/copy the struct), preserve
its Proxy/IdleConn/MaxConns/IdleConnTimeout/ForceAttemptHTTP2 and other fields,
then update or create TLSClientConfig on the copied transport to set MinVersion
= tls.VersionTLS12 and add/replace RootCAs with the provided pool, and finally
assign the copied transport back to http.DefaultTransport so only the CA changes
while all other original transport settings remain intact.
In `@components/ambient-control-plane/go.mod`:
- Line 11: Update the pinned dependency google.golang.org/grpc in the module
manifest to a patched release (bump the version from v1.79.1 to a non-vulnerable
tag such as the latest patched v1.x.y), then run the Go tooling to refresh
checksums (e.g., go get google.golang.org/grpc@vX.Y.Z && go mod tidy or go mod
vendor as appropriate) so go.mod and go.sum reflect the upgraded, fixed release;
ensure the new version resolves the GHSA-p77j-4mvh-x3m3 advisory before merging.
In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go`:
- Around line 292-320: The RoleBinding name is constant so the first create wins
and later namespaces silently reuse the old subject; in ensureImagePullAccess
change the binding name (variable name / metadata.name in rb) to be unique per
target namespace (e.g., include the namespace via
fmt.Sprintf("ambient-image-puller-%s", namespace)) so each call creates a
distinct RoleBinding, and keep the CreateRoleBinding call as-is; alternatively,
if you prefer a single binding, modify the code to fetch the existing
RoleBinding and update its "subjects" to include the new namespace before
returning (handle update via r.kube.UpdateRoleBinding) instead of treating
IsAlreadyExists as success. Ensure you update the rb construction referenced in
ensureImagePullAccess and the CreateRoleBinding error handling accordingly.
- Line 523: The call to unstructured.NestedMap (data, _, _ :=
unstructured.NestedMap(src.Object, "data")) swallows the returned error so a
malformed src secret can be copied; change this to capture and check the error
from unstructured.NestedMap and return/fail fast when err != nil before
proceeding to create or populate the destination/Vertex secret (i.e., replace
the ignored error with proper err handling near the code that reads src.Object
and abort the reconcile or return the error so the destination secret is not
created from invalid data).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6ffee65c-da79-44a5-bb12-069256734255
⛔ Files ignored due to path filters (1)
components/ambient-control-plane/go.sumis excluded by!**/*.sum
📒 Files selected for processing (5)
.github/workflows/ambient-control-plane-tests.ymlcomponents/ambient-control-plane/cmd/ambient-control-plane/main.gocomponents/ambient-control-plane/go.modcomponents/ambient-control-plane/internal/metrics/metrics.gocomponents/ambient-control-plane/internal/reconciler/kube_reconciler.go
| paths: | ||
| - 'components/ambient-control-plane/**' | ||
| - 'components/ambient-sdk/go-sdk/**' | ||
| - '.github/workflows/ambient-control-plane-tests.yml' | ||
| pull_request: | ||
| branches: [main] | ||
| paths: | ||
| - 'components/ambient-control-plane/**' | ||
| - 'components/ambient-sdk/go-sdk/**' | ||
| - '.github/workflows/ambient-control-plane-tests.yml' |
There was a problem hiding this comment.
Add components/ambient-api-server/** to these path filters.
components/ambient-control-plane/go.mod replaces github.com/ambient-code/platform/components/ambient-api-server with ../ambient-api-server, so API-server changes can break this module without rerunning this workflow.
Suggested change
push:
branches: [main]
paths:
- 'components/ambient-control-plane/**'
+ - 'components/ambient-api-server/**'
- 'components/ambient-sdk/go-sdk/**'
- '.github/workflows/ambient-control-plane-tests.yml'
pull_request:
branches: [main]
paths:
- 'components/ambient-control-plane/**'
+ - 'components/ambient-api-server/**'
- 'components/ambient-sdk/go-sdk/**'
- '.github/workflows/ambient-control-plane-tests.yml'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/ambient-control-plane-tests.yml around lines 6 - 15, The
workflow path filters are missing the ambient API server; update the paths
arrays under both the push and pull_request triggers to include
'components/ambient-api-server/**' so changes to the API server (which is
replaced by ../ambient-api-server in components/ambient-control-plane/go.mod)
will trigger the ambient-control-plane-tests workflow; modify the paths entries
alongside the existing 'components/ambient-control-plane/**' and
'components/ambient-sdk/go-sdk/**' lines in
.github/workflows/ambient-control-plane-tests.yml.
| var cleanupErr error | ||
| if err := r.kube.DeletePodsByLabel(ctx, namespace, selector); err != nil && !k8serrors.IsNotFound(err) { | ||
| r.logger.Warn().Err(err).Str("session_id", session.ID).Msg("deleting pods during cleanup") | ||
| cleanupErr = err | ||
| } | ||
| if err := r.kube.DeleteSecretsByLabel(ctx, namespace, selector); err != nil && !k8serrors.IsNotFound(err) { | ||
| r.logger.Warn().Err(err).Str("session_id", session.ID).Msg("deleting secrets during cleanup") | ||
| cleanupErr = err | ||
| } | ||
| if err := r.kube.DeleteServiceAccountsByLabel(ctx, namespace, selector); err != nil && !k8serrors.IsNotFound(err) { | ||
| r.logger.Warn().Err(err).Str("session_id", session.ID).Msg("deleting service accounts during cleanup") | ||
| cleanupErr = err | ||
| } | ||
| if err := r.kube.DeleteServicesByLabel(ctx, namespace, selector); err != nil && !k8serrors.IsNotFound(err) { | ||
| r.logger.Warn().Err(err).Str("session_id", session.ID).Msg("deleting services during cleanup") | ||
| cleanupErr = err | ||
| } | ||
|
|
||
| if cleanupErr != nil { | ||
| metrics.SessionCleanupTotal.WithLabelValues("error").Inc() | ||
| } else { | ||
| metrics.SessionCleanupTotal.WithLabelValues("success").Inc() | ||
| r.logger.Info().Str("session_id", session.ID).Str("namespace", namespace).Msg("session resources cleaned up") | ||
| } | ||
| return nil |
There was a problem hiding this comment.
Return cleanup failures instead of swallowing them.
When any delete on Lines 190-204 fails, cleanupErr is set but Line 213 still returns nil. That makes deleted-session reconciles look successful to the caller and leaves leaked resources much harder to detect.
Suggested change
if cleanupErr != nil {
metrics.SessionCleanupTotal.WithLabelValues("error").Inc()
- } else {
- metrics.SessionCleanupTotal.WithLabelValues("success").Inc()
- r.logger.Info().Str("session_id", session.ID).Str("namespace", namespace).Msg("session resources cleaned up")
+ return cleanupErr
}
+ metrics.SessionCleanupTotal.WithLabelValues("success").Inc()
+ r.logger.Info().Str("session_id", session.ID).Str("namespace", namespace).Msg("session resources cleaned up")
return nil🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go`
around lines 189 - 213, The cleanup code currently sets cleanupErr when
r.kube.DeletePodsByLabel / DeleteSecretsByLabel / DeleteServiceAccountsByLabel /
DeleteServicesByLabel fail but still returns nil at the end; change the final
return to propagate the failure (return cleanupErr or a wrapped error) so
callers see cleanup failures instead of success. Keep the existing
metrics.SessionCleanupTotal increment and logging behavior, but after updating
the metric/logging branch, ensure the function returns cleanupErr when non-nil
(or a combined error if you prefer aggregating multiple errors).
| func (r *SimpleKubeReconciler) ensureNamespaceExists(ctx context.Context, namespace string, session types.Session) error { | ||
| if _, err := r.kube.GetNamespace(ctx, namespace); err == nil { | ||
| return nil | ||
| } | ||
|
|
||
| ns := &unstructured.Unstructured{ | ||
| Object: map[string]interface{}{ | ||
| "apiVersion": "v1", | ||
| "kind": "Namespace", | ||
| "metadata": map[string]interface{}{ | ||
| "name": namespace, | ||
| "labels": map[string]interface{}{ | ||
| LabelManaged: "true", | ||
| LabelProjectID: session.ProjectID, | ||
| LabelManagedBy: "ambient-control-plane", | ||
| }, | ||
| }, | ||
| }, | ||
| } | ||
|
|
||
| if _, err := r.kube.CreateNamespace(ctx, ns); err != nil && !k8serrors.IsAlreadyExists(err) { | ||
| return fmt.Errorf("creating namespace %s: %w", namespace, err) | ||
| } | ||
|
|
||
| r.logger.Info().Str("namespace", namespace).Msg("namespace created for session") | ||
|
|
||
| if r.cfg.RunnerImageNamespace != "" { | ||
| if err := r.ensureImagePullAccess(ctx, namespace); err != nil { | ||
| r.logger.Warn().Err(err).Str("namespace", namespace).Msg("failed to grant image pull access") | ||
| } | ||
| } |
There was a problem hiding this comment.
Make image-pull access part of the steady-state reconcile.
Line 258 returns before Lines 283-287 when the namespace already exists, so preexisting namespaces never get image-pull access reconciled here. On top of that, Line 285 only logs a warning and provisioning continues, which can move the session forward even though the runner pod still cannot pull its image.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go`
around lines 257 - 287, The current ensureNamespaceExists returns early when
r.kube.GetNamespace succeeds so existing namespaces never run
ensureImagePullAccess; also when ensureImagePullAccess fails it only logs a
warning and lets provisioning continue. Change ensureNamespaceExists (the
function using r.kube.GetNamespace, r.kube.CreateNamespace and calling
ensureImagePullAccess) to always call ensureImagePullAccess when
r.cfg.RunnerImageNamespace is set, regardless of whether the namespace was newly
created or preexisting, and treat errors from ensureImagePullAccess as blocking
by returning the error (wrap with context like "ensuring image pull access for
namespace %s: %w") instead of only logging a warning so provisioning halts if
image-pull access cannot be granted.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/manifests/base/core/ambient-control-plane-deployment.yaml`:
- Around line 24-54: Add HTTP health endpoints and Kubernetes probes: update the
Go service in components/ambient-control-plane/cmd/ambient-control-plane/main.go
to register /healthz (liveness) and /readyz (readiness) on the existing metrics
HTTP mux (port 9090) and ensure they return 200 when healthy and non-2xx when
unhealthy; then update the ambient-control-plane container spec in
ambient-control-plane-deployment.yaml to add a livenessProbe (httpGet /healthz
on port 9090 with reasonable initialDelaySeconds/periodSeconds/failureThreshold)
and a readinessProbe (httpGet /readyz on port 9090 with reasonable
timeout/period/failureThreshold) for the container named ambient-control-plane
so Kubernetes can detect and restart/unready failing pods.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 5cc5d3aa-dd7a-4d37-9d38-8de88412fad2
📒 Files selected for processing (3)
components/manifests/base/core/ambient-control-plane-deployment.yamlcomponents/manifests/base/core/kustomization.yamlcomponents/manifests/base/kustomization.yaml
| - name: ambient-control-plane | ||
| image: quay.io/ambient_code/vteam_control_plane:latest | ||
| imagePullPolicy: Always | ||
| securityContext: | ||
| allowPrivilegeEscalation: false | ||
| readOnlyRootFilesystem: true | ||
| capabilities: | ||
| drop: ["ALL"] | ||
| env: | ||
| - name: AMBIENT_API_TOKEN | ||
| valueFrom: | ||
| secretKeyRef: | ||
| name: ambient-control-plane-token | ||
| key: token | ||
| - name: AMBIENT_API_SERVER_URL | ||
| value: "https://ambient-api-server.ambient-code.svc:8000" | ||
| - name: AMBIENT_GRPC_SERVER_ADDR | ||
| value: "ambient-api-server.ambient-code.svc:9000" | ||
| - name: AMBIENT_GRPC_USE_TLS | ||
| value: "true" | ||
| - name: MODE | ||
| value: "kube" | ||
| - name: LOG_LEVEL | ||
| value: "info" | ||
| resources: | ||
| requests: | ||
| cpu: 50m | ||
| memory: 64Mi | ||
| limits: | ||
| cpu: 200m | ||
| memory: 256Mi |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Search for health check endpoint handlers in the control-plane Go code
# Search for common health endpoint patterns
rg -n -C3 --type=go -g '!**/vendor/**' -g '!**/test/**' '/healthz|/readyz|/health|healthcheck' components/ambient-control-plane/Repository: ambient-code/platform
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Find the control-plane Go code structure
find . -path "*/ambient-control-plane/*" -type f -name "*.go" | head -20Repository: ambient-code/platform
Length of output: 1108
🏁 Script executed:
#!/bin/bash
# Check if the control-plane directory exists and list its structure
ls -la components/ambient-control-plane/ 2>/dev/null || echo "Directory not found at expected location"
# Try alternative paths
find . -maxdepth 3 -type d -name "*control-plane*" | grep -v node_modules | grep -v ".git"Repository: ambient-code/platform
Length of output: 821
🏁 Script executed:
#!/bin/bash
# Search for any HTTP server setup, main function, or service initialization
rg -n --type=go -g '!**/vendor/**' 'func main|http.Server|http.Handle|ListenAndServe|/metrics|port.*9090' --max-count=20Repository: ambient-code/platform
Length of output: 7585
🏁 Script executed:
#!/bin/bash
# Read the control-plane main.go file to understand server setup
cat -n components/ambient-control-plane/cmd/ambient-control-plane/main.goRepository: ambient-code/platform
Length of output: 9000
🏁 Script executed:
#!/bin/bash
# Check if there are any health-related functions defined in the control-plane package
rg -n --type=go 'health|ready|liveness' components/ambient-control-plane/ -iRepository: ambient-code/platform
Length of output: 3026
Add health check endpoints and probes to the control-plane container.
The ambient-control-plane lacks health probes, preventing Kubernetes from detecting failures or unresponsiveness. The current implementation only exposes /metrics on port 9090; it does not implement /healthz or /readyz endpoints.
To add health probes, you must:
-
Implement health check endpoints in the Go service (
components/ambient-control-plane/cmd/ambient-control-plane/main.go):- Add
/healthz(liveness) and/readyz(readiness) endpoints to the HTTP mux (port 9090) - Return appropriate status codes based on service health
- Add
-
Add probes to the deployment once endpoints are available:
- livenessProbe:
/healthzwith appropriate timeout/threshold values - readinessProbe:
/readyzwith appropriate timeout/threshold values
- livenessProbe:
Without these probes, rolling updates may proceed with broken pods, and failed instances won't restart automatically.
🧰 Tools
🪛 Trivy (0.69.3)
[warning] 24-54: Image tag ":latest" used
Container 'ambient-control-plane' of Deployment 'ambient-control-plane' should specify an image tag
Rule: KSV-0013
(IaC/Kubernetes)
[info] 24-54: Runs with UID <= 10000
Container 'ambient-control-plane' of Deployment 'ambient-control-plane' should set 'securityContext.runAsUser' > 10000
Rule: KSV-0020
(IaC/Kubernetes)
[info] 24-54: Runs with GID <= 10000
Container 'ambient-control-plane' of Deployment 'ambient-control-plane' should set 'securityContext.runAsGroup' > 10000
Rule: KSV-0021
(IaC/Kubernetes)
[warning] 24-54: Restrict container images to trusted registries
Container ambient-control-plane in deployment ambient-control-plane (namespace: default) uses an image from an untrusted registry.
Rule: KSV-0125
(IaC/Kubernetes)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/manifests/base/core/ambient-control-plane-deployment.yaml` around
lines 24 - 54, Add HTTP health endpoints and Kubernetes probes: update the Go
service in components/ambient-control-plane/cmd/ambient-control-plane/main.go to
register /healthz (liveness) and /readyz (readiness) on the existing metrics
HTTP mux (port 9090) and ensure they return 200 when healthy and non-2xx when
unhealthy; then update the ambient-control-plane container spec in
ambient-control-plane-deployment.yaml to add a livenessProbe (httpGet /healthz
on port 9090 with reasonable initialDelaySeconds/periodSeconds/failureThreshold)
and a readinessProbe (httpGet /readyz on port 9090 with reasonable
timeout/period/failureThreshold) for the container named ambient-control-plane
so Kubernetes can detect and restart/unready failing pods.
55337c5 to
e2cfb8e
Compare
…ent streaming - New ambient-control-plane binary: gRPC watch + list-sync informer, reconcilers for sessions/projects/project-settings, KubeClient wrapper - Runner: gRPC transport bridge, _grpc_client, _session_messages_api, events endpoint - Makefile targets: build-control-plane, deploy-openshift, local-sync-version - Supporting docs, architecture notes, stress/tally tests 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
…validation - Add ambient-control-plane image to components-build-deploy.yml and prod-release-deploy.yaml - Add Dockerfile build validation step to ambient-control-plane-tests.yml (PR-only, no push) - Fix deploy-to-openshift condition; add CP to e2e image loader 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
…tionalize image sources - Move ambient-control-plane-service.yml → base/core/ambient-control-plane-deployment.yaml - Add CP RBAC (ClusterRole, ClusterRoleBinding, ServiceAccount) to base/rbac/ - Fix image name throughout: ambient_control_plane → vteam_control_plane - Production: quay.io (no patch needed); local-dev: internal OpenShift registry patch - kind/kind-local: localhost image remaps; production: quay.io from base - Add manifests/README.md documenting all overlay environments - Add gRPC route, API server DB, deploy scripts 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
e2cfb8e to
7dea845
Compare
…ploy-with-dispatch 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Review Queue Status
Action needed: Fix failing CI checks; Resolve merge conflicts
|
|
closing this in favor of alpha branch |
Summary
Re-introduces the work from PR #975 (which was reverted in PR #980 due to a test failure). That failure has since been resolved in
main._grpc_client.py,_session_messages_api.py,grpc_transport.pybridge between Claude Code SSE output and gRPC push stream,/eventsSSE endpoint for AG-UI fan-out. Structured logging throughout.ambient-control-plane-tests).70 files changed, +10,310 / -158
Test plan
make build-control-planesucceedsmake kind-up && make deploy— control-plane pod starts and reconciles sessionscd components/runners/ambient-runner && python -m pytest tests/ambient-control-plane-testsworkflow passes🤖 Generated with Claude Code