feat(ci): ephemeral PR test instances on MPP dev cluster by markturansky · Pull Request #1032 · ambient-code/platform

markturansky · 2026-03-25T19:17:49Z

Summary

Build workflow: All 7 component images are now built and pushed to quay on every PR (pr-<N>-amd64 tag), not just changed components
Namespace provisioner (components/pr-test/provision.sh): Creates/destroys TenantNamespace CRs on ambient-code--config; capacity-gated at 5 concurrent S0.x instances
Ambient installer (components/pr-test/install.sh): Deploys production manifests with PR image tags; copies ambient-vertex and ambient-api-server secrets from source namespace via oc; mirrors Vertex AI config from operator-config
GitHub workflows: pr-e2e-openshift.yml (provision → install → e2e → teardown on build completion) and pr-namespace-cleanup.yml (safety-net teardown on PR close)
Skills: .claude/skills/ambient/ (install on any OpenShift namespace) and .claude/skills/ambient-pr-test/ (full PR test workflow)
Developer guide: openshift.md trimmed to internal-registry-only content with links to skills

Test plan

Add TEST_OPENSHIFT_SERVER and TEST_OPENSHIFT_TOKEN GitHub Actions secrets pointing at dev-spoke-aws-us-east-1
Verify build workflow pushes images tagged pr-<N>-amd64 to quay on this PR
Verify pr-e2e-openshift.yml triggers and provisions ambient-code--pr-<N>-* namespace
Verify Ambient deploys and health check passes
Verify E2E tests run against the ephemeral instance
Verify namespace is torn down after workflow completes
Verify pr-namespace-cleanup.yml fires on PR close

🤖 Generated with Claude Code

coderabbitai · 2026-03-25T19:18:17Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds PR-focused ephemeral environment tooling: build/push scripts and workflows, TenantNamespace provisioning and teardown scripts, an installer to deploy Ambient into PR namespaces, accompanying docs/skills, and removal of an unused API proposal doc.

Changes

Cohort / File(s)	Summary
Skills / Docs `.claude/skills/ambient-pr-test/SKILL.md`, `.claude/skills/ambient/SKILL.md`, `components/pr-test/README.md`, `docs/internal/developer/local-development/openshift.md`	Adds PR-test and Ambient installer skill docs and a README describing TenantNamespace spec, naming, capacity rules, RBAC, and procedure; rewrites OpenShift local-dev guide to reference internal registry and new PR flows.
GitHub Actions Workflows `.github/workflows/components-build-deploy.yml`, `.github/workflows/pr-e2e-openshift.yml`, `.github/workflows/pr-namespace-cleanup.yml`	Alters build workflow to push PR images and add BuildKit cache; adds PR E2E workflow that builds, provisions ephemeral namespace, installs Ambient, runs Cypress, and always tears down; adds PR-close cleanup workflow to destroy instances.
PR build script `components/pr-test/build.sh`	New script to build and push component images for a given PR URL, producing a `pr-<N>-amd64` image tag.
Provisioning script `components/pr-test/provision.sh`	New provisioner CLI to `create`/`destroy` TenantNamespace CRs with concurrency limits, readiness/delete polling, timeouts, and env-configurable parameters (`MAX_S0X_INSTANCES`, `READY_TIMEOUT`, `DELETE_TIMEOUT`).
Install script `components/pr-test/install.sh`	New installer that copies secrets, applies CRDs/RBAC, rewrites kustomize overlays to use a PR image tag, patches `operator-config` and `ambient-agent-registry`, waits for rollouts, performs health checks, and emits `frontend_url` and `namespace` outputs.
Removed proposal `docs/internal/proposals/acp-public-rest-api.md`	Deleted the ACP Public REST API proposal document.

Sequence Diagram

sequenceDiagram
    participant Dev as Developer
    participant GH as GitHub Actions
    participant Prov as provision.sh
    participant OS as OpenShift API
    participant Inst as install.sh
    participant E2E as Cypress

    Dev->>GH: Open/Update PR
    GH->>GH: Build & Push component images
    GH->>Prov: create(instance_id)
    Prov->>OS: Apply TenantNamespace CR
    OS->>OS: Tenant operator provisions namespace
    Prov->>OS: Poll until namespace.phase == "Active"
    OS-->>Prov: Namespace Active
    GH->>Inst: install.sh <namespace> <image-tag>
    Inst->>OS: Apply CRDs/RBAC, copy secrets, apply patched kustomize
    Inst->>OS: Patch operator-config & agent-registry, wait rollouts
    Inst-->>GH: frontend_url output
    GH->>E2E: Run Cypress tests against frontend_url
    E2E->>OS: Query routes/health endpoints
    E2E-->>GH: Upload artifacts (on failure)
    GH->>Prov: destroy(instance_id) [always]
    Prov->>OS: Delete TenantNamespace CR
    OS->>OS: Tenant operator finalizes and removes namespace
    Prov->>OS: Poll until namespace absent
    OS-->>Prov: Namespace removed

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: adding ephemeral PR test instances on the MPP dev cluster, which is the core objective of this changeset.
Description check	✅ Passed	The description is directly related to the changeset, providing a clear summary of build workflows, provisioning, installation, GitHub workflows, skills, and documentation changes that match the included files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch test_my_pr

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

.github/workflows/pr-e2e-openshift.yml

ktdreyer · 2026-03-26T21:23:29Z

.github/workflows/pr-e2e-openshift.yml

+      - name: Provision namespace
+        run: |
+          bash components/pr-test/provision.sh create \
+            "${{ needs.setup.outputs.instance_id }}"
+
+  install:


We must address this

.github/workflows/pr-e2e-openshift.yml

.github/workflows/pr-namespace-cleanup.yml

coderabbitai

Actionable comments posted: 15

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/ambient/SKILL.md:
- Around line 58-88: The sample secret creation commands publish fixed, shared
credentials; replace hardcoded values with placeholders and guidance to generate
unique secrets instead: update the examples that create minio-credentials,
postgresql-credentials, unleash-credentials, and github-app-secret to use
parameterized values (e.g. <ROOT_USER>, <ROOT_PASSWORD>, <DB_HOST>,
<DB_PASSWORD>, <UNLEASH_DB_URL>, <GITHUB_PRIVATE_KEY>, <GITHUB_CLIENT_SECRET>,
<GITHUB_STATE_SECRET>) and add a short note instructing users to generate
strong, unique secrets (or source them from a secret manager) before running
kustomize so no default credentials are shared across installations.
- Around line 175-189: The rollout wait step is missing gates for public-api and
ambient-api-server-db which makes Step 6 flaky; update the Step 5 "Wait for
Rollout" loop/commands in .claude/skills/ambient/SKILL.md to include oc rollout
status checks for deployment/public-api and deployment/ambient-api-server-db (in
addition to ambient-api-server) so the docs explicitly wait for public-api and
ambient-api-server-db to finish before proceeding to Step 6.

In @.github/workflows/components-build-deploy.yml:
- Around line 206-217: The PR build currently only pushes a mutable tag
(pr-<N>-<arch>) which can be overwritten; update the docker/build-push-action
step (the "Build and push ${{ matrix.component.name }} (${{ matrix.arch.suffix
}}) for pull request" step) to also push an immutable tag that includes the PR
head commit SHA so downstream workflow_run jobs can pull the exact image.
Concretely, add a second tag value alongside the existing tags that uses the PR
head SHA (e.g. ${{ matrix.component.image }}:${{
github.event.pull_request.head.sha }}-${{ matrix.arch.suffix }}) while keeping
the current pr-... tag and ensure push remains true so both tags are pushed.

In @.github/workflows/pr-e2e-openshift.yml:
- Around line 57-60: Replace the unsafe checkout ref that pulls the PR head (the
actions/checkout step using ref: ${{ github.event.workflow_run.head_sha }}) so
the job does not execute untrusted PR code with secrets in scope; change the
checkout to use a safe ref like ${{ github.sha }} (or the run's immutable
commit) and apply the same change to the other checkout occurrences noted (lines
84-119, 126-147), and ensure jobs that require secrets (those invoking
components/pr-test/*.sh, npm ci, npx cypress run with TEST_OPENSHIFT_TOKEN,
QUAY_PASSWORD, ANTHROPIC_API_KEY) only run against repository-controlled code
(or switch to pull_request_target with checkout of the base commit) so
PR-modified scripts cannot exfiltrate secrets.
- Around line 27-35: The run script is inlining
github.event.workflow_run.head_branch into the shell source, exposing
secret-scoped workflow context; instead, move PR_NUMBER and HEAD_BRANCH into
environment variables and reference them inside the run block (e.g., add an env:
section that sets PR_NUMBER: ${{
github.event.workflow_run.pull_requests[0].number }} and HEAD_BRANCH: ${{
github.event.workflow_run.head_branch }} and then use $PR_NUMBER and
$HEAD_BRANCH in the script), so BRANCH_SLUG creation and sed/tr commands operate
on an env var rather than interpolated workflow text.
- Around line 37-39: The derived NAMESPACE can exceed allowed length because
INSTANCE_ID uses BRANCH_SLUG:0:45; reduce the slug truncation to fit the
remaining budget (e.g., change BRANCH_SLUG:0:45 to BRANCH_SLUG:0:41) so that
"ambient-code--" + "pr-${PR_NUMBER}-" plus the slug stays within the namespace
length limit; update the INSTANCE_ID assignment (and any related uses like
NAMESPACE/IMAGE_TAG) to use the shorter BRANCH_SLUG slice.

In @.github/workflows/pr-namespace-cleanup.yml:
- Around line 27-28: The hardcoded slice ${BRANCH_SLUG:0:45} can still produce
>63-char Kubernetes namespaces; replace it with a dynamic truncation that
accounts for the fixed "ambient-code--" prefix (14 chars) and the variable
"pr-${PR_NUMBER}-" prefix length by using ${BRANCH_SLUG:0:$((45 -
${`#PR_NUMBER`}))} (45 derived from 63-14-4) and ensure the computed length is
clamped to zero or positive before constructing INSTANCE_ID; keep the rest
(INSTANCE_ID assignment and echo to GITHUB_OUTPUT) unchanged.
- Around line 17-25: The job currently interpolates untrusted HEAD_BRANCH
directly into the shell pipeline that computes BRANCH_SLUG (variables PR_NUMBER,
HEAD_BRANCH, BRANCH_SLUG and the sed/tr transliteration pipeline), so change the
script to first strictly validate and sanitize HEAD_BRANCH: reject or fail the
job if HEAD_BRANCH contains any characters outside a safe whitelist (allow only
ASCII letters, digits and separators like '.' '_' '-'), limit length (e.g. 64
chars), and then construct BRANCH_SLUG from that validated value; always
double-quote variable expansions when used in commands and avoid passing user
input to eval-like contexts or unquoted substitutions, ensure BRANCH_SLUG never
starts with a hyphen and normalize runs of non-alphanumerics to a single dash as
before but only after validation to prevent malicious injection.

In `@components/pr-test/install.sh`:
- Around line 102-105: The health check ignores failures and can hang because it
uses curl -sk ... || true; change the block that uses BACKEND_HOST and HEALTH to
call curl with a connect and overall timeout and fail on HTTP errors (e.g., curl
-fsS --connect-timeout 5 --max-time 10 "https://$BACKEND_HOST/health"), remove
the "|| true", capture the exit code and the output into HEALTH, and if curl
fails (non-zero exit) print the HEALTH/error and exit non‑zero so the script
fails fast; reference the BACKEND_HOST variable, HEALTH assignment, and the curl
invocation when making the change.
- Around line 35-37: The script is applying cluster-scoped CRDs/RBAC from the PR
branch (the oc apply -k "$REPO_ROOT/components/manifests/base/crds/" and oc
apply -k "$REPO_ROOT/components/manifests/base/rbac/" lines under the "Step 1"
echo), which must not run in a PR install; change install.sh to skip or abort
these two oc apply commands unless running in a trusted/mainline pipeline (guard
with an explicit env var like ALLOW_CLUSTER_SCOPED_MANIFESTS or CI_MAINLINE) or
move them to a separate pre-provision step; keep the namespace-scoped install
flow intact and ensure the script logs a clear message when cluster-scoped
manifests are skipped.
- Around line 52-59: kustomize edit set image currently only replaces images
named with :latest, leaving untagged image entries unchanged; update the
kustomize image mappings so each image name is provided twice (once with the
:latest source and once without the tag) mapping both variants to
quay.io/ambient_code/<image>:${IMAGE_TAG} (e.g., ensure vteam_frontend,
vteam_frontend:latest, vteam_backend, vteam_backend:latest, etc. are all
included) so that both tagged and untagged entries are rewritten; update the
kustomize edit set image invocation to include these duplicate mappings for
vteam_frontend, vteam_backend, vteam_operator, vteam_claude_runner,
vteam_state_sync, vteam_api_server, and vteam_public_api.

In `@components/pr-test/provision.sh`:
- Around line 29-43: The current capacity check using ACTIVE and
MAX_S0X_INSTANCES is racy (TOCTOU) because two concurrent runs can both observe
the same ACTIVE count then both apply the TenantNamespace CR; replace the
read-then-apply pattern with an atomic reservation or cluster-side quota:
implement a create-if-not-exists reservation/lock (for example create a
namespaced Reservation CR or a ConfigMap/Lease named with the instance id and
fail the create if it already exists) and only count SUCCESSFUL reservations
toward the limit, or enforce with a Kubernetes ResourceQuota/OPA/Gatekeeper
policy so the API server enforces the MAX_S0X_INSTANCES limit; update the
provision flow around the "echo '==> Applying TenantNamespace CR: $INSTANCE_ID'"
block to create the reservation atomically before applying the TenantNamespace
and delete the reservation on rollback/cleanup.
- Around line 59-74: The loop that currently waits for Namespace.phase ==
"Active" (using NAMESPACE, READY_TIMEOUT, STATUS) must instead wait for the
TenantNamespace CR Ready condition to be True to ensure the tenant operator
finished reconciling; change the check inside the while loop to query the
TenantNamespace resource (oc get tenantnamespace "$INSTANCE_ID" -n
"$CONFIG_NAMESPACE" -o
jsonpath='{.status.conditions[?(@.type=="Ready")].status}' 2>/dev/null || true)
and treat "True" as success (you may still fall back to checking Namespace.phase
if desired), and preserve the existing timeout, logging, and the final oc
describe tenantnamespace "$INSTANCE_ID" invocation on failure.
- Around line 82-88: The current loop treats any failure of `oc get namespace
"$NAMESPACE"` as successful deletion; change the check inside the while loop to
capture stderr/output from `oc get namespace "$NAMESPACE" 2>&1` and only
consider the namespace deleted when that output explicitly indicates a NotFound
error (e.g. matches "NotFound" or "namespaces .* not found" or "Error from
server (NotFound)"). For non-NotFound failures (auth errors, timeouts, transport
errors) do not exit; instead log a warning and continue retrying until DEADLINE.
Update the if-condition in the while loop that references
NAMESPACE/DELETE_TIMEOUT/DEADLINE accordingly so only a confirmed NotFound
result is treated as deletion.

In `@docs/internal/developer/local-development/openshift.md`:
- Around line 47-78: The docs' internal-registry steps miss mirroring the
operator-spawned images (vteam_state_sync) and updating operator-config and
ambient-agent-registry references, so runner/state-sync pods will still pull
from Quay; update the mirror loop to include vteam_state_sync and any other
operator-spawned images, ensure deploy steps also replace newName refs inside
components/manifests/overlays/production/kustomization.yaml and the
operator-config and ambient-agent-registry resources (update their image refs to
${REGISTRY_HOST}/ambient-code/*), and verify deploy.sh applies the updated
manifests before restoring kustomization.yaml so operator-created pods use the
internal registry.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 025dd6a6-1ae1-437d-903d-cfb4facdffed

📥 Commits

Reviewing files that changed from the base of the PR and between 797d7bf and 82b024e.

📒 Files selected for processing (9)

.claude/skills/ambient-pr-test/SKILL.md
.claude/skills/ambient/SKILL.md
.github/workflows/components-build-deploy.yml
.github/workflows/pr-e2e-openshift.yml
.github/workflows/pr-namespace-cleanup.yml
components/pr-test/SPEC.md
components/pr-test/install.sh
components/pr-test/provision.sh
docs/internal/developer/local-development/openshift.md

.claude/skills/ambient/SKILL.md

coderabbitai · 2026-03-25T19:35:01Z

.github/workflows/components-build-deploy.yml

+      - name: Build and push ${{ matrix.component.name }} (${{ matrix.arch.suffix }}) for pull request
        if: github.event_name == 'pull_request'
        uses: docker/build-push-action@v6
        with:
          context: ${{ matrix.component.context }}
          file: ${{ matrix.component.dockerfile }}
          platforms: ${{ matrix.arch.platform }}
-          push: false
+          push: true
          tags: ${{ matrix.component.image }}:pr-${{ github.event.pull_request.number }}-${{ matrix.arch.suffix }}
          build-args: AMBIENT_VERSION=${{ github.sha }}
          cache-from: type=gha,scope=${{ matrix.component.name }}-${{ matrix.arch.suffix }}
+          cache-to: type=gha,mode=max,scope=${{ matrix.component.name }}-${{ matrix.arch.suffix }}


⚠️ Potential issue | 🟠 Major

Publish an immutable PR image tag alongside the mutable pr-<N>-<arch> alias.

This step now pushes a tag that is overwritten on every PR commit, while the downstream workflow_run job checks out a specific head_sha. If two commits overlap, an older E2E run can deploy newer images and report the wrong result.

Suggested change

- tags: ${{ matrix.component.image }}:pr-${{ github.event.pull_request.number }}-${{ matrix.arch.suffix }} + tags: | + ${{ matrix.component.image }}:pr-${{ github.event.pull_request.number }}-${{ matrix.arch.suffix }} + ${{ matrix.component.image }}:pr-${{ github.event.pull_request.number }}-${{ github.sha }}-${{ matrix.arch.suffix }}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/components-build-deploy.yml around lines 206 - 217, The PR build currently only pushes a mutable tag (pr-<N>-<arch>) which can be overwritten; update the docker/build-push-action step (the "Build and push ${{ matrix.component.name }} (${{ matrix.arch.suffix }}) for pull request" step) to also push an immutable tag that includes the PR head commit SHA so downstream workflow_run jobs can pull the exact image. Concretely, add a second tag value alongside the existing tags that uses the PR head SHA (e.g. ${{ matrix.component.image }}:${{ github.event.pull_request.head.sha }}-${{ matrix.arch.suffix }}) while keeping the current pr-... tag and ensure push remains true so both tags are pushed.

I agree with this

.github/workflows/pr-e2e-openshift.yml

coderabbitai · 2026-03-25T19:35:02Z

components/pr-test/install.sh

+if [[ -n "$BACKEND_HOST" ]]; then
+  HEALTH=$(curl -sk "https://${BACKEND_HOST}/health" || true)
+  echo "    Backend health: $HEALTH"
+fi


⚠️ Potential issue | 🟠 Major

Health verification is non-blocking and unbounded; make it fail-fast with timeouts.

Line 103 uses curl -sk ... || true, so backend failures are ignored and the script can still report success. It also lacks explicit timeout bounds, which can stall CI.

Suggested fix

if [[ -n "$BACKEND_HOST" ]]; then - HEALTH=$(curl -sk "https://${BACKEND_HOST}/health" || true) + HEALTH=$(curl -fsS --connect-timeout 5 --max-time 20 \ + --retry 3 --retry-all-errors "https://${BACKEND_HOST}/health") echo " Backend health: $HEALTH" +else + echo " ERROR: backend-route host not found" + exit 1 fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if [[ -n "$BACKEND_HOST" ]]; then

HEALTH=$(curl -sk "https://${BACKEND_HOST}/health" || true)

echo " Backend health: $HEALTH"

fi

if [[ -n "$BACKEND_HOST" ]]; then

HEALTH=$(curl -fsS --connect-timeout 5 --max-time 20 \

--retry 3 --retry-all-errors "https://${BACKEND_HOST}/health")

echo " Backend health: $HEALTH"

else

echo " ERROR: backend-route host not found"

exit 1

fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/pr-test/install.sh` around lines 102 - 105, The health check ignores failures and can hang because it uses curl -sk ... || true; change the block that uses BACKEND_HOST and HEALTH to call curl with a connect and overall timeout and fail on HTTP errors (e.g., curl -fsS --connect-timeout 5 --max-time 10 "https://$BACKEND_HOST/health"), remove the "|| true", capture the exit code and the output into HEALTH, and if curl fails (non-zero exit) print the HEALTH/error and exit non‑zero so the script fails fast; reference the BACKEND_HOST variable, HEALTH assignment, and the curl invocation when making the change.

coderabbitai · 2026-03-25T19:35:02Z

components/pr-test/provision.sh

+  echo "==> Checking S0.x instance capacity..."
+  ACTIVE=$(oc get tenantnamespace -n "$CONFIG_NAMESPACE" \
+    -l ambient-code/instance-type=s0x --no-headers 2>/dev/null | wc -l | tr -d ' ')
+
+  if [ "$ACTIVE" -ge "$MAX_S0X_INSTANCES" ]; then
+    echo "ERROR: At capacity — $ACTIVE/$MAX_S0X_INSTANCES S0.x instances active."
+    echo "Active instances:"
+    oc get tenantnamespace -n "$CONFIG_NAMESPACE" \
+      -l ambient-code/instance-type=s0x -o name
+    exit 1
+  fi
+  echo "    Capacity OK: $ACTIVE/$MAX_S0X_INSTANCES"
+
+  echo "==> Applying TenantNamespace CR: $INSTANCE_ID"
+  cat <<EOF | oc apply -f -


⚠️ Potential issue | 🟠 Major

The capacity gate is racy across concurrent provisions.

This is a classic TOCTOU window: two runs can both read the same active count and then both oc apply, exceeding the configured limit of five S0.x instances. The cap needs an atomic reservation step or a cluster-side lock/quota.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/pr-test/provision.sh` around lines 29 - 43, The current capacity check using ACTIVE and MAX_S0X_INSTANCES is racy (TOCTOU) because two concurrent runs can both observe the same ACTIVE count then both apply the TenantNamespace CR; replace the read-then-apply pattern with an atomic reservation or cluster-side quota: implement a create-if-not-exists reservation/lock (for example create a namespaced Reservation CR or a ConfigMap/Lease named with the instance id and fail the create if it already exists) and only count SUCCESSFUL reservations toward the limit, or enforce with a Kubernetes ResourceQuota/OPA/Gatekeeper policy so the API server enforces the MAX_S0X_INSTANCES limit; update the provision flow around the "echo '==> Applying TenantNamespace CR: $INSTANCE_ID'" block to create the reservation atomically before applying the TenantNamespace and delete the reservation on rollback/cleanup.

coderabbitai · 2026-03-25T19:35:02Z

components/pr-test/provision.sh

+  echo "==> Waiting for namespace ${NAMESPACE} to become Active (timeout: ${READY_TIMEOUT}s)..."
+  DEADLINE=$((SECONDS + READY_TIMEOUT))
+  while [ $SECONDS -lt $DEADLINE ]; do
+    STATUS=$(oc get namespace "$NAMESPACE" -o jsonpath='{.status.phase}' 2>/dev/null || true)
+    if [ "$STATUS" == "Active" ]; then
+      echo "    Namespace ${NAMESPACE} is Active."
+      echo "$NAMESPACE"
+      exit 0
+    fi
+    echo "    status=${STATUS:-NotFound}, retrying..."
+    sleep 3
+  done
+
+  echo "ERROR: Namespace ${NAMESPACE} did not become Active within ${READY_TIMEOUT}s."
+  oc describe tenantnamespace "$INSTANCE_ID" -n "$CONFIG_NAMESPACE" || true
+  exit 1


⚠️ Potential issue | 🟠 Major

Wait for the TenantNamespace Ready condition, not only Namespace.phase=Active.

Active only proves the namespace object exists. The tenant operator can still be reconciling quotas/RBAC/finalizers, so the install step can race immediately after this script returns.

Suggested change

echo "==> Waiting for namespace ${NAMESPACE} to become Active (timeout: ${READY_TIMEOUT}s)..." DEADLINE=$((SECONDS + READY_TIMEOUT)) while [ $SECONDS -lt $DEADLINE ]; do - STATUS=$(oc get namespace "$NAMESPACE" -o jsonpath='{.status.phase}' 2>/dev/null || true) - if [ "$STATUS" == "Active" ]; then + READY=$(oc get tenantnamespace "$INSTANCE_ID" -n "$CONFIG_NAMESPACE" \ + -o jsonpath='{range .status.conditions[?(@.type=="Ready")]}{.status}{end}' 2>/dev/null || true) + STATUS=$(oc get namespace "$NAMESPACE" -o jsonpath='{.status.phase}' 2>/dev/null || true) + if [ "$READY" == "True" ] && [ "$STATUS" == "Active" ]; then echo " Namespace ${NAMESPACE} is Active." echo "$NAMESPACE" exit 0 fi - echo " status=${STATUS:-NotFound}, retrying..." + echo " ready=${READY:-Unknown} status=${STATUS:-NotFound}, retrying..." sleep 3 done

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/pr-test/provision.sh` around lines 59 - 74, The loop that currently waits for Namespace.phase == "Active" (using NAMESPACE, READY_TIMEOUT, STATUS) must instead wait for the TenantNamespace CR Ready condition to be True to ensure the tenant operator finished reconciling; change the check inside the while loop to query the TenantNamespace resource (oc get tenantnamespace "$INSTANCE_ID" -n "$CONFIG_NAMESPACE" -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' 2>/dev/null || true) and treat "True" as success (you may still fall back to checking Namespace.phase if desired), and preserve the existing timeout, logging, and the final oc describe tenantnamespace "$INSTANCE_ID" invocation on failure.

coderabbitai · 2026-03-25T19:35:02Z

components/pr-test/provision.sh

+  echo "==> Waiting for namespace ${NAMESPACE} to be deleted (timeout: ${DELETE_TIMEOUT}s)..."
+  DEADLINE=$((SECONDS + DELETE_TIMEOUT))
+  while [ $SECONDS -lt $DEADLINE ]; do
+    if ! oc get namespace "$NAMESPACE" &>/dev/null; then
+      echo "    Namespace ${NAMESPACE} deleted."
+      exit 0
+    fi


⚠️ Potential issue | 🟠 Major

Do not treat every oc get namespace failure as successful deletion.

if ! oc get namespace ... returns true for NotFound, auth failures, API timeouts, and transport errors alike. A transient cluster error will be reported as "deleted" and leak the namespace.

Suggested change

echo "==> Waiting for namespace ${NAMESPACE} to be deleted (timeout: ${DELETE_TIMEOUT}s)..." DEADLINE=$((SECONDS + DELETE_TIMEOUT)) while [ $SECONDS -lt $DEADLINE ]; do - if ! oc get namespace "$NAMESPACE" &>/dev/null; then + OUTPUT=$(oc get namespace "$NAMESPACE" -o name 2>&1) && RC=0 || RC=$? + if [ "$RC" -eq 0 ]; then + echo " Namespace still exists, waiting..." + elif grep -q "(NotFound)" <<<"$OUTPUT"; then echo " Namespace ${NAMESPACE} deleted." exit 0 + else + echo " oc get namespace failed: $OUTPUT" fi - echo " Namespace still exists, waiting..." sleep 5 done

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

echo "==> Waiting for namespace ${NAMESPACE} to be deleted (timeout: ${DELETE_TIMEOUT}s)..."

DEADLINE=$((SECONDS + DELETE_TIMEOUT))

while [ $SECONDS -lt $DEADLINE ]; do

if ! oc get namespace "$NAMESPACE" &>/dev/null; then

echo " Namespace ${NAMESPACE} deleted."

exit 0

fi

echo "==> Waiting for namespace ${NAMESPACE} to be deleted (timeout: ${DELETE_TIMEOUT}s)..."

DEADLINE=$((SECONDS + DELETE_TIMEOUT))

while [ $SECONDS -lt $DEADLINE ]; do

OUTPUT=$(oc get namespace "$NAMESPACE" -o name 2>&1) && RC=0 || RC=$?

if [ "$RC" -eq 0 ]; then

echo " Namespace still exists, waiting..."

elif grep -q "(NotFound)" <<<"$OUTPUT"; then

echo " Namespace ${NAMESPACE} deleted."

exit 0

else

echo " oc get namespace failed: $OUTPUT"

fi

sleep 5

done

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/pr-test/provision.sh` around lines 82 - 88, The current loop treats any failure of `oc get namespace "$NAMESPACE"` as successful deletion; change the check inside the while loop to capture stderr/output from `oc get namespace "$NAMESPACE" 2>&1` and only consider the namespace deleted when that output explicitly indicates a NotFound error (e.g. matches "NotFound" or "namespaces .* not found" or "Error from server (NotFound)"). For non-NotFound failures (auth errors, timeouts, transport errors) do not exit; instead log a warning and continue retrying until DEADLINE. Update the if-condition in the while loop that references NAMESPACE/DELETE_TIMEOUT/DEADLINE accordingly so only a confirmed NotFound result is treated as deletion.

coderabbitai · 2026-03-25T19:35:02Z

docs/internal/developer/local-development/openshift.md

 ```bash
-oc new-project ambient-code
-```
-
-**MinIO credentials:**
-
-```bash
-oc create secret generic minio-credentials -n ambient-code \
-  --from-literal=root-user=admin \
-  --from-literal=root-password=changeme123
-```
-
-**PostgreSQL credentials (for Unleash feature flag database):**
-
-```bash
-oc create secret generic postgresql-credentials -n ambient-code \
-  --from-literal=db.host="postgresql" \
-  --from-literal=db.port="5432" \
-  --from-literal=db.name="postgres" \
-  --from-literal=db.user="postgres" \
-  --from-literal=db.password="postgres123"
-```
+REGISTRY_HOST=$(oc get route default-route -n openshift-image-registry \
+  --template='{{ .spec.host }}')
+INTERNAL_REG="image-registry.openshift-image-registry.svc:5000/ambient-code"

-**Unleash credentials (for feature flag service):**
+for img in vteam_frontend vteam_backend vteam_operator vteam_public_api vteam_claude_runner vteam_api_server; do
+  podman tag localhost/${img}:latest ${REGISTRY_HOST}/ambient-code/${img}:latest
+  podman push ${REGISTRY_HOST}/ambient-code/${img}:latest
+done

-```bash
-oc create secret generic unleash-credentials -n ambient-code \
-  --from-literal=database-url="postgres://postgres:postgres123@postgresql:5432/unleash" \
-  --from-literal=database-ssl="false" \
-  --from-literal=admin-api-token="*:*.unleash-admin-token" \
-  --from-literal=client-api-token="default:development.unleash-client-token" \
-  --from-literal=frontend-api-token="default:development.unleash-frontend-token" \
-  --from-literal=default-admin-password="unleash123"
+oc rollout restart deployment backend-api frontend agentic-operator public-api ambient-api-server -n ambient-code
 ```

-## Platform Deployment
+---

-The production kustomization in `components/manifests/overlays/production/kustomization.yaml` references `quay.io/ambient_code/*` images by default. When deploying to an OpenShift cluster using the internal registry, you must temporarily point the image refs at the internal registry, deploy, then **immediately revert** before committing.
+## Deploy with Internal Registry Images

-**⚠️ CRITICAL**: Never commit `kustomization.yaml` while it contains internal registry refs.
-
-**Patch kustomization to internal registry, deploy, then revert:**
+**⚠️ CRITICAL**: Never commit `kustomization.yaml` with internal registry refs.

 ```bash
-REGISTRY_HOST=$(oc get route default-route -n openshift-image-registry --template='{{ .spec.host }}')
-INTERNAL_REG="image-registry.openshift-image-registry.svc:5000/ambient-code"
+REGISTRY_HOST=$(oc get route default-route -n openshift-image-registry \
+  --template='{{ .spec.host }}')

-# Temporarily override image refs to internal registry
 cd components/manifests/overlays/production
-sed -i "s#newName: quay.io/ambient_code/#newName: ${INTERNAL_REG}/#g" kustomization.yaml
+sed -i "s#newName: quay.io/ambient_code/#newName: ${REGISTRY_HOST}/ambient-code/#g" kustomization.yaml

-# Deploy
 cd ../..
 ./deploy.sh

-# IMMEDIATELY revert — do not commit with internal registry refs
 cd overlays/production
 git checkout kustomization.yaml
 ```


⚠️ Potential issue | 🟠 Major

This internal-registry flow still leaves the operator-spawned images on the old registry.

Only six images are mirrored here, and vteam_state_sync is missing. Rewriting kustomization.yaml is also not enough for runner pods, because operator-config and ambient-agent-registry still need to point at the internal-registry refs. As written, runner/state-sync changes will continue pulling stale Quay images or fail outright.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/internal/developer/local-development/openshift.md` around lines 47 - 78, The docs' internal-registry steps miss mirroring the operator-spawned images (vteam_state_sync) and updating operator-config and ambient-agent-registry references, so runner/state-sync pods will still pull from Quay; update the mirror loop to include vteam_state_sync and any other operator-spawned images, ensure deploy steps also replace newName refs inside components/manifests/overlays/production/kustomization.yaml and the operator-config and ambient-agent-registry resources (update their image refs to ${REGISTRY_HOST}/ambient-code/*), and verify deploy.sh applies the updated manifests before restoring kustomization.yaml so operator-created pods use the internal registry.

ktdreyer · 2026-03-25T19:59:23Z

TODO: Let's identify if/where this runs untrusted code and stamp that out. GH gives us some clues in the PR review here. I also filed #1034 to add another tool in the box.

ktdreyer · 2026-03-25T19:53:25Z

.github/workflows/pr-e2e-openshift.yml

+
+      - name: Install kustomize
+        run: |
+          curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash


We install kustomize in a few places (eg #768). We should stop doing that and use oc kustomize instead.

ktdreyer · 2026-03-25T19:54:00Z

.github/workflows/pr-e2e-openshift.yml

+        run: |
+          oc login ${{ secrets.TEST_OPENSHIFT_SERVER }} \
+            --token=${{ secrets.TEST_OPENSHIFT_TOKEN }} \
+            --insecure-skip-tls-verify


I realize we're copying --insecure-skip-tls-verify from other GH Actions, but do we need it? I thought ROSA obviates this.

good catch. this is from testing and should not be in any full deployment

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (1)

.github/workflows/pr-namespace-cleanup.yml (1)
7-13: ⚠️ Potential issue | 🟠 Major

Add explicit permissions block and checkout main to use trusted scripts.

The workflow lacks a permissions block, which CodeQL has flagged. Additionally, the checkout step should explicitly use ref: main to ensure only trusted repository scripts are executed when running the destroy command with cluster credentials.
Suggested fix
 jobs:
   cleanup:
     runs-on: ubuntu-latest
     if: github.event.pull_request.head.repo.full_name == github.repository
+    permissions:
+      contents: read
     steps:
       - name: Checkout code
         uses: actions/checkout@v6
+        with:
+          ref: main
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/pr-namespace-cleanup.yml around lines 7 - 13, The workflow
job "cleanup" is missing an explicit permissions block and the checkout step
should pin to the trusted branch; add a permissions block under the cleanup job
(for example: permissions: contents: read) and update the Checkout code step
that uses actions/checkout@v6 to include ref: main so the job runs with explicit
repository read permissions and checks out the main branch before executing
destroy/cluster credentials operations.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/pr-e2e-openshift.yml:
- Around line 8-10: The concurrency group uses
github.event.workflow_run.pull_requests[0].number which can be null and will
break group evaluation; update the concurrency key to use a safe
conditional/fallback (e.g., use github.event.workflow_run.pull_requests &&
github.event.workflow_run.pull_requests[0] ?
github.event.workflow_run.pull_requests[0].number : github.run_id) so the group
always resolves; modify the line referencing pull_requests[0].number in the
concurrency block to this conditional expression (or an equivalent fallback like
github.run_id or github.sha) to avoid null access before jobs run.
- Around line 158-183: The teardown job's Destroy namespace step uses
needs.setup.outputs.instance_id unguarded, causing the destroy script
(components/pr-test/provision.sh destroy) to receive an empty INSTANCE_ID when
setup failed; update the teardown (or the Destroy namespace step) to skip
running when that output is empty—either by adding an if condition on the step
(e.g., check needs.setup.outputs.instance_id != '') or by guarding the run
command to no-op when INSTANCE_ID is empty before calling the destroy script;
reference the teardown job, the Destroy namespace step, the env VARIABLE
INSTANCE_ID and the components/pr-test/provision.sh destroy invocation when
making the change.
- Around line 37-43: The SLUG_MAX calculation can become zero or negative for
very large PR numbers which makes BRANCH_SLUG empty and leaves a trailing dash
in INSTANCE_ID; update the workflow to clamp SLUG_MAX to a minimum positive
value (e.g., SLUG_MAX=$(( SLUG_MAX < 1 ? 1 : SLUG_MAX ))) or otherwise guard it
the same way as the cleanup workflow, ensure BRANCH_SLUG uses that clamped
value, and build INSTANCE_ID without a trailing dash (e.g., only append
"-${BRANCH_SLUG}" when BRANCH_SLUG is non-empty) so symbols SLUG_MAX, PR_LEN,
BRANCH_SLUG and INSTANCE_ID are adjusted accordingly.

In @.github/workflows/pr-namespace-cleanup.yml:
- Around line 24-28: Clamp SLUG_MAX to a non-negative value after computing it
(ensure SLUG_MAX=$(( SLUG_MAX < 0 ? 0 : SLUG_MAX ))) and then derive
BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}"; after that, trim any trailing dashes
from BRANCH_SLUG and guard INSTANCE_ID construction so it doesn't produce a
trailing dash when BRANCH_SLUG is empty (e.g., build INSTANCE_ID by
conditionally appending "-${BRANCH_SLUG}" only if BRANCH_SLUG is non-empty,
using the existing symbols PR_LEN, SLUG_MAX, SAFE_BRANCH, BRANCH_SLUG,
PR_NUMBER, and INSTANCE_ID).

---

Duplicate comments:
In @.github/workflows/pr-namespace-cleanup.yml:
- Around line 7-13: The workflow job "cleanup" is missing an explicit
permissions block and the checkout step should pin to the trusted branch; add a
permissions block under the cleanup job (for example: permissions: contents:
read) and update the Checkout code step that uses actions/checkout@v6 to include
ref: main so the job runs with explicit repository read permissions and checks
out the main branch before executing destroy/cluster credentials operations.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5ab0169e-0a5e-44f9-92e8-0fed82edc418

📥 Commits

Reviewing files that changed from the base of the PR and between 82b024e and b7a58c5.

📒 Files selected for processing (2)

.github/workflows/pr-e2e-openshift.yml
.github/workflows/pr-namespace-cleanup.yml

coderabbitai · 2026-03-25T20:39:48Z

.github/workflows/pr-e2e-openshift.yml

+concurrency:
+  group: pr-e2e-openshift-${{ github.event.workflow_run.pull_requests[0].number }}
+  cancel-in-progress: true


⚠️ Potential issue | 🟡 Minor

Concurrency group may fail when pull_requests[0] is null.

The concurrency key references pull_requests[0].number, but if the workflow_run isn't associated with a PR (e.g., push to main), this could produce an invalid or colliding group name. The setup job guards against this, but the concurrency block is evaluated before job conditions.

Suggested fix

concurrency: - group: pr-e2e-openshift-${{ github.event.workflow_run.pull_requests[0].number }} + group: pr-e2e-openshift-${{ github.event.workflow_run.pull_requests[0].number || github.event.workflow_run.id }} cancel-in-progress: true

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/pr-e2e-openshift.yml around lines 8 - 10, The concurrency group uses github.event.workflow_run.pull_requests[0].number which can be null and will break group evaluation; update the concurrency key to use a safe conditional/fallback (e.g., use github.event.workflow_run.pull_requests && github.event.workflow_run.pull_requests[0] ? github.event.workflow_run.pull_requests[0].number : github.run_id) so the group always resolves; modify the line referencing pull_requests[0].number in the concurrency block to this conditional expression (or an equivalent fallback like github.run_id or github.sha) to avoid null access before jobs run.

coderabbitai · 2026-03-25T20:39:48Z

.github/workflows/pr-e2e-openshift.yml

+          # Budget: "ambient-code--" (14) + "pr-${PR_NUMBER}-" (4+digits) + slug
+          # Max namespace = 63. Slug budget = 63 - 14 - 4 - ${#PR_NUMBER}
+          PR_LEN=${#PR_NUMBER}
+          SLUG_MAX=$(( 63 - 14 - 4 - PR_LEN ))
+          BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}"
+
+          INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"


⚠️ Potential issue | 🟡 Minor

Apply the same edge-case guard for SLUG_MAX as in the cleanup workflow.

Same issue as pr-namespace-cleanup.yml: SLUG_MAX could be zero/negative for very large PR numbers, and an empty BRANCH_SLUG would leave a trailing dash in INSTANCE_ID.

Suggested fix

PR_LEN=${`#PR_NUMBER`} SLUG_MAX=$(( 63 - 14 - 4 - PR_LEN )) + [[ $SLUG_MAX -lt 0 ]] && SLUG_MAX=0 BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}" - INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}" + if [[ -n "$BRANCH_SLUG" ]]; then + INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}" + else + INSTANCE_ID="pr-${PR_NUMBER}" + fi NAMESPACE="ambient-code--${INSTANCE_ID}"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/pr-e2e-openshift.yml around lines 37 - 43, The SLUG_MAX calculation can become zero or negative for very large PR numbers which makes BRANCH_SLUG empty and leaves a trailing dash in INSTANCE_ID; update the workflow to clamp SLUG_MAX to a minimum positive value (e.g., SLUG_MAX=$(( SLUG_MAX < 1 ? 1 : SLUG_MAX ))) or otherwise guard it the same way as the cleanup workflow, ensure BRANCH_SLUG uses that clamped value, and build INSTANCE_ID without a trailing dash (e.g., only append "-${BRANCH_SLUG}" when BRANCH_SLUG is non-empty) so symbols SLUG_MAX, PR_LEN, BRANCH_SLUG and INSTANCE_ID are adjusted accordingly.

coderabbitai · 2026-03-25T20:39:48Z

.github/workflows/pr-e2e-openshift.yml

+  teardown:
+    needs: [setup, provision, e2e]
+    if: always()
+    runs-on: ubuntu-latest
+    permissions:
+      contents: read
+    steps:
+      - name: Checkout main (trusted scripts only)
+        uses: actions/checkout@v6
+        with:
+          ref: main
+
+      - name: Install oc
+        uses: redhat-actions/oc-installer@v1
+        with:
+          oc_version: 'latest'
+
+      - name: Log in to OpenShift
+        run: |
+          oc login "${{ secrets.TEST_OPENSHIFT_SERVER }}" \
+            --token="${{ secrets.TEST_OPENSHIFT_TOKEN }}"
+
+      - name: Destroy namespace
+        env:
+          INSTANCE_ID: ${{ needs.setup.outputs.instance_id }}
+        run: bash components/pr-test/provision.sh destroy "$INSTANCE_ID"


⚠️ Potential issue | 🟡 Minor

Guard teardown against missing instance_id when upstream jobs fail.

With if: always(), the teardown runs even if setup fails or is skipped, in which case needs.setup.outputs.instance_id will be empty. The destroy script may behave unexpectedly with an empty argument.

Suggested fix

teardown: needs: [setup, provision, e2e] - if: always() + if: always() && needs.setup.outputs.instance_id != '' runs-on: ubuntu-latest

Alternatively, add a guard in the destroy step:

- name: Destroy namespace env: INSTANCE_ID: ${{ needs.setup.outputs.instance_id }} - run: bash components/pr-test/provision.sh destroy "$INSTANCE_ID" + run: | + if [[ -n "$INSTANCE_ID" ]]; then + bash components/pr-test/provision.sh destroy "$INSTANCE_ID" + else + echo "No instance_id available; skipping teardown" + fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

teardown:

needs: [setup, provision, e2e]

if: always()

runs-on: ubuntu-latest

permissions:

contents: read

steps:

- name: Checkout main (trusted scripts only)

uses: actions/checkout@v6

with:

ref: main

- name: Install oc

uses: redhat-actions/oc-installer@v1

with:

oc_version: 'latest'

- name: Log in to OpenShift

run: |

oc login "${{ secrets.TEST_OPENSHIFT_SERVER }}" \

--token="${{ secrets.TEST_OPENSHIFT_TOKEN }}"

- name: Destroy namespace

env:

INSTANCE_ID: ${{ needs.setup.outputs.instance_id }}

run: bash components/pr-test/provision.sh destroy "$INSTANCE_ID"

teardown:

needs: [setup, provision, e2e]

if: always() && needs.setup.outputs.instance_id != ''

runs-on: ubuntu-latest

permissions:

contents: read

steps:

- name: Checkout main (trusted scripts only)

uses: actions/checkout@v6

with:

ref: main

- name: Install oc

uses: redhat-actions/oc-installer@v1

with:

oc_version: 'latest'

- name: Log in to OpenShift

run: |

oc login "${{ secrets.TEST_OPENSHIFT_SERVER }}" \

--token="${{ secrets.TEST_OPENSHIFT_TOKEN }}"

- name: Destroy namespace

env:

INSTANCE_ID: ${{ needs.setup.outputs.instance_id }}

run: bash components/pr-test/provision.sh destroy "$INSTANCE_ID"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/pr-e2e-openshift.yml around lines 158 - 183, The teardown job's Destroy namespace step uses needs.setup.outputs.instance_id unguarded, causing the destroy script (components/pr-test/provision.sh destroy) to receive an empty INSTANCE_ID when setup failed; update the teardown (or the Destroy namespace step) to skip running when that output is empty—either by adding an if condition on the step (e.g., check needs.setup.outputs.instance_id != '') or by guarding the run command to no-op when INSTANCE_ID is empty before calling the destroy script; reference the teardown job, the Destroy namespace step, the env VARIABLE INSTANCE_ID and the components/pr-test/provision.sh destroy invocation when making the change.

coderabbitai · 2026-03-25T20:39:49Z

.github/workflows/pr-namespace-cleanup.yml

+          PR_LEN=${#PR_NUMBER}
+          SLUG_MAX=$(( 63 - 14 - 4 - PR_LEN ))
+          BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}"
+
+          INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"


⚠️ Potential issue | 🟡 Minor

Guard against edge cases: negative SLUG_MAX or trailing dash.

For very high PR numbers, SLUG_MAX could become zero or negative, and if BRANCH_SLUG ends up empty, INSTANCE_ID would have a trailing dash (pr-123-). Consider clamping and trimming:

Suggested fix

PR_LEN=${`#PR_NUMBER`} SLUG_MAX=$(( 63 - 14 - 4 - PR_LEN )) + [[ $SLUG_MAX -lt 0 ]] && SLUG_MAX=0 BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}" - INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}" + if [[ -n "$BRANCH_SLUG" ]]; then + INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}" + else + INSTANCE_ID="pr-${PR_NUMBER}" + fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

PR_LEN=${#PR_NUMBER}

SLUG_MAX=$(( 63 - 14 - 4 - PR_LEN ))

BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}"

INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"

PR_LEN=${`#PR_NUMBER`}

SLUG_MAX=$(( 63 - 14 - 4 - PR_LEN ))

[[ $SLUG_MAX -lt 0 ]] && SLUG_MAX=0

BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}"

if [[ -n "$BRANCH_SLUG" ]]; then

INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"

else

INSTANCE_ID="pr-${PR_NUMBER}"

fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/pr-namespace-cleanup.yml around lines 24 - 28, Clamp SLUG_MAX to a non-negative value after computing it (ensure SLUG_MAX=$(( SLUG_MAX < 0 ? 0 : SLUG_MAX ))) and then derive BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}"; after that, trim any trailing dashes from BRANCH_SLUG and guard INSTANCE_ID construction so it doesn't produce a trailing dash when BRANCH_SLUG is empty (e.g., build INSTANCE_ID by conditionally appending "-${BRANCH_SLUG}" only if BRANCH_SLUG is non-empty, using the existing symbols PR_LEN, SLUG_MAX, SAFE_BRANCH, BRANCH_SLUG, PR_NUMBER, and INSTANCE_ID).

coderabbitai

Actionable comments posted: 7

♻️ Duplicate comments (2)

.claude/skills/ambient/SKILL.md (2)
58-88: ⚠️ Potential issue | 🟠 Major

Do not publish shared default credentials in a skill covering production installs.

This issue was previously flagged and remains unresolved. The commands hardcode predictable values for MinIO (changeme123), PostgreSQL (postgres123), Unleash (fixed admin tokens), and GitHub state secret (test-state-secret). Since this skill explicitly covers "production, ephemeral PR test instances, or developer clusters" (lines 5–6), following it verbatim would give every installation the same credentials.

Replace hardcoded values with placeholders (e.g., <ROOT_PASSWORD>, <DB_PASSWORD>, <ADMIN_TOKEN>) and add a note instructing users to generate strong, unique secrets (or source them from a secret manager) before running the commands.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/ambient/SKILL.md around lines 58 - 88, Replace all hardcoded
secret values in the kustomize secret creation examples (the secret names
minio-credentials, postgresql-credentials, unleash-credentials,
github-app-secret) with clearly labelled placeholders (e.g., <ROOT_PASSWORD>,
<DB_PASSWORD>, <ADMIN_TOKEN>, <GITHUB_STATE_SECRET>) and update the text to
instruct users to generate strong, unique secrets or pull them from a secret
manager before applying the overlay; ensure the examples do not contain any
fixed/default credentials and add a single-line note above the commands
explaining to replace placeholders with secure values or reference a secret
store.
203-211: ⚠️ Potential issue | 🟠 Major

Wait for public-api and ambient-api-server-db before Step 6.

This issue was previously flagged and remains unresolved. Step 6 immediately uses public-api-route (line 281) and execs into ambient-api-server-db (line 258), but the rollout gate never waits for those deployments. On a slow cluster the documented verification flow is flaky.

Add public-api to the loop and add an explicit wait for ambient-api-server-db:
🛡️ Proposed fix
-for deploy in backend-api frontend agentic-operator postgresql minio unleash; do
+for deploy in backend-api frontend agentic-operator postgresql minio unleash public-api; do
   oc rollout status deployment/$deploy -n $NAMESPACE --timeout=300s
 done

 ambient-api-server and ambient-api-server-db may take longer due to DB init:

 ```bash
+oc rollout status deployment/ambient-api-server-db -n $NAMESPACE --timeout=300s
 oc rollout status deployment/ambient-api-server -n $NAMESPACE --timeout=300s
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/ambient/SKILL.md around lines 203 - 211, The deployment wait
loop currently iterates "backend-api frontend agentic-operator postgresql minio
unleash" but omits the public API and the DB init; update the loop that declares
deploys (the line with for deploy in ...) to include public-api, and add an
explicit rollout wait for ambient-api-server-db (before the existing
ambient-api-server wait) so Step 6's usage of public-api-route and the
subsequent exec into ambient-api-server-db will block until those deployments
are ready; target the symbolic names public-api and ambient-api-server-db when
adding the waits.
</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/ambient/SKILL.md:

Line 292: The command oc policy add-role-to-group system:image-puller system:serviceaccounts --namespace=$NAMESPACE is too permissive; change it to
grant image-pull only to the specific runner namespace or service account(s)
instead of the cluster-wide system:serviceaccounts group — e.g., replace that
call with a namespace-scoped binding that targets the runner service account
(oc policy add-role-to-user system:image-puller system:serviceaccount:<runner-namespace>:<sa-name> --namespace=$NAMESPACE) or
implement a RoleBinding per target namespace for the system:image-puller role
so only intended namespaces/users can pull from $NAMESPACE.

Line 103: The placeholder path /path/to/service-account-key.json should be
replaced with a real, downloaded service account key file: instruct the user to
create or select a service account in GCP Console (IAM & Admin > Service
Accounts), click "Keys" → "Add Key" → "Create new key" and download the JSON,
save it to a secure local path (e.g., ./keys/ambient-service-account.json), and
then update the --from-file flag to point to that real file name instead of the
placeholder; include a note to keep the JSON secure and to follow the project's
secrets/storage guidelines.

Around line 318-321: Update the "JWT Configuration" section to include a
concrete example for ephemeral/non‑prod namespaces: show how to patch the
ambient-api-server container args (referencing
ambient-api-server-jwt-args-patch.yaml) to either disable validation (e.g.,
override args to include a --disable-jwt-validation flag or remove the
--jwk-cert-url/--issuer args) or point to a test issuer by setting
--jwk-cert-url to a known test JWKS URL; include the exact arg keys to change
(--jwk-cert-url, --issuer, --disable-jwt-validation), a short example
description of patching the Deployment/DaemonSet container args for the
ambient-api-server, and note to revert for production.

Line 238: Update the sentence that says the frontend shows "2/2" to clarify
that this only occurs when the oauth-proxy sidecar is present (e.g., when using
the production overlay); mention that installs via install.sh or custom overlays
without oauth-proxy will show "1/1" for the frontend pod. Reference the terms
"frontend" pod, "production overlay", "oauth-proxy sidecar", and "install.sh" in
the rewording so readers know the difference in expected readiness counts
depending on whether oauth-proxy is configured.

In @components/pr-test/README.md:

Line 187: Update the README to document the security implications of changing
push: false → push: true in components-build-deploy.yml's PR build step: explain
that enabling image push causes PR builds to push ephemeral images tagged
pr-- (overwritten per commit), call out that the Quay registry
namespace must have strict access controls and review who can push/read images,
advise evaluating whether untrusted contributors should be allowed to push to
the shared registry, and add a cross-reference to the security model/trust
boundaries section for further guidance.

Around line 1-246: Add a new "Security Model" section to the specification
that states where untrusted PR code runs and how it's isolated: describe
isolation boundaries for each ephemeral instance (namespace
ambient-code--) including network policies and the "internal"
security-zone, resource quotas and limits (referencing MAX_S0X_INSTANCES,
READY_TIMEOUT, DELETE_TIMEOUT), RBAC restrictions and that the
ambient-code/instance-type=s0x label is used for counting only, secrets handling
(namespace-scoped secrets, limited secret injection, cleanup on teardown),
mitigation controls for malicious builds (GitHub Actions runner isolation, push:
true implications for PR image builds, e.g. push only from trusted runners,
build-time sandboxing), and explicit notes about what components run untrusted
code vs trusted platform code (provisioner.create/destroy, provision.sh,
installer scripts). Ensure the section references TenantNamespace CR
creation/deletion and the provisioner workflow so reviewers can see where to
apply network policies and RBAC to prevent cross-namespace access.

Line 220: Update the file list reference that still points to "SPEC.md" so it
matches the renamed file by replacing the "SPEC.md" entry with "README.md"
(search for the string "SPEC.md" in components/pr-test/README.md and update it
to "README.md"); ensure any surrounding text or links that referenced SPEC.md
are adjusted to point to README.md as well.

Duplicate comments:
In @.claude/skills/ambient/SKILL.md:

Around line 58-88: Replace all hardcoded secret values in the kustomize secret
creation examples (the secret names minio-credentials, postgresql-credentials,
unleash-credentials, github-app-secret) with clearly labelled placeholders
(e.g., <ROOT_PASSWORD>, <DB_PASSWORD>, <ADMIN_TOKEN>, <GITHUB_STATE_SECRET>) and
update the text to instruct users to generate strong, unique secrets or pull
them from a secret manager before applying the overlay; ensure the examples do
not contain any fixed/default credentials and add a single-line note above the
commands explaining to replace placeholders with secure values or reference a
secret store.

Around line 203-211: The deployment wait loop currently iterates "backend-api
frontend agentic-operator postgresql minio unleash" but omits the public API and
the DB init; update the loop that declares deploys (the line with for deploy in
...) to include public-api, and add an explicit rollout wait for
ambient-api-server-db (before the existing ambient-api-server wait) so Step 6's
usage of public-api-route and the subsequent exec into ambient-api-server-db
will block until those deployments are ready; target the symbolic names
public-api and ambient-api-server-db when adding the waits.
</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Organization UI

**Review profile**: ASSERTIVE

**Plan**: Pro

**Run ID**: `86f2be1c-2bd0-4140-8dda-16106d5abda3`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between b7a58c5cd33742cc19586c0bef4900ed2b361a5e and 2eb2465302c9a4a6c86475f453f1544940030396.

</details>

<details>
<summary>📒 Files selected for processing (2)</summary>

* `.claude/skills/ambient/SKILL.md`
* `components/pr-test/README.md`

</details>

</details>

coderabbitai · 2026-03-25T20:59:55Z

.claude/skills/ambient/SKILL.md

+
+```bash
+oc create secret generic ambient-vertex -n $NAMESPACE \
+  --from-file=ambient-code-key.json=/path/to/service-account-key.json


⚠️ Potential issue | 🟡 Minor

Replace placeholder path with actionable guidance.

The placeholder /path/to/service-account-key.json is not actionable. Provide explicit instructions on how to obtain the service account key file (e.g., "Download from GCP Console > IAM & Admin > Service Accounts") or reference documentation.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.claude/skills/ambient/SKILL.md at line 103, The placeholder path `/path/to/service-account-key.json` should be replaced with a real, downloaded service account key file: instruct the user to create or select a service account in GCP Console (IAM & Admin > Service Accounts), click "Keys" → "Add Key" → "Create new key" and download the JSON, save it to a secure local path (e.g., ./keys/ambient-service-account.json), and then update the --from-file flag to point to that real file name instead of the placeholder; include a note to keep the JSON secure and to follow the project's secrets/storage guidelines.

coderabbitai · 2026-03-25T20:59:55Z

.claude/skills/ambient/SKILL.md

+unleash-xxxxx                         1/1     Running   0
+```
+
+Frontend shows `2/2` because of the oauth-proxy sidecar in the production overlay.


🧹 Nitpick | 🔵 Trivial

Clarify oauth-proxy sidecar context.

The note assumes the production overlay is always used, but this skill covers "production, ephemeral PR test instances, or developer clusters" (lines 5–6). If users deploy via the scripted installer (install.sh) or a custom overlay without oauth-proxy, the frontend pod would show 1/1 instead of 2/2, potentially confusing readers.

Rephrase to indicate this applies when using the production overlay or when oauth-proxy is configured.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.claude/skills/ambient/SKILL.md at line 238, Update the sentence that says the frontend shows "2/2" to clarify that this only occurs when the oauth-proxy sidecar is present (e.g., when using the production overlay); mention that installs via install.sh or custom overlays without oauth-proxy will show "1/1" for the frontend pod. Reference the terms "frontend" pod, "production overlay", "oauth-proxy sidecar", and "install.sh" in the rewording so readers know the difference in expected readiness counts depending on whether oauth-proxy is configured.

coderabbitai · 2026-03-25T20:59:55Z

.claude/skills/ambient/SKILL.md

+The operator creates runner pods in dynamically-created project namespaces. Those pods pull images from quay.io directly — no cross-namespace image access issue with quay. However, if you're using the OpenShift internal registry, grant pull access:
+
+```bash
+oc policy add-role-to-group system:image-puller system:serviceaccounts --namespace=$NAMESPACE


⚠️ Potential issue | 🟠 Major

Overly permissive cross-namespace image pull policy.

The command grants system:image-puller to the cluster-wide system:serviceaccounts group, allowing any pod in any namespace to pull images from $NAMESPACE. On a shared cluster, this is a security posture gap.

Instead, grant access only to the specific namespaces where runner pods will be created. If the operator creates runner pods in dynamically-named project namespaces, scope the role binding per namespace or use a more targeted group.

Alternative approach

If you know the target namespace pattern (e.g., ambient-code--pr-*), grant access to specific service accounts or use a namespace-scoped RoleBinding instead of cluster-wide group membership:

# Example: grant pull access only to service accounts in a specific namespace oc policy add-role-to-user system:image-puller \ system:serviceaccount:<runner-namespace>:default \ --namespace=$NAMESPACE

Or document that quay.io images avoid this issue entirely (as noted in lines 289–290).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.claude/skills/ambient/SKILL.md at line 292, The command `oc policy add-role-to-group system:image-puller system:serviceaccounts --namespace=$NAMESPACE` is too permissive; change it to grant image-pull only to the specific runner namespace or service account(s) instead of the cluster-wide `system:serviceaccounts` group — e.g., replace that call with a namespace-scoped binding that targets the runner service account (`oc policy add-role-to-user system:image-puller system:serviceaccount:<runner-namespace>:<sa-name> --namespace=$NAMESPACE`) or implement a RoleBinding per target namespace for the `system:image-puller` role so only intended namespaces/users can pull from `$NAMESPACE`.

coderabbitai · 2026-03-25T20:59:55Z

.claude/skills/ambient/SKILL.md

+### JWT Configuration
+
+Production uses Red Hat SSO JWKS (`--jwk-cert-url=https://sso.redhat.com/...`). For ephemeral test instances, JWT validation may need to be disabled or pointed at a different issuer. Check the `ambient-api-server-jwt-args-patch.yaml` in the production overlay and adjust as needed for non-production contexts.
+


🧹 Nitpick | 🔵 Trivial

Provide concrete JWT configuration guidance for non-production contexts.

The note mentions that ephemeral test instances "may need to be disabled or pointed at a different issuer" but does not explain how. Users following this skill for PR test instances will hit JWT validation errors and have no clear remediation path.

Add a concrete example showing how to disable JWT validation or configure an alternative issuer for ephemeral namespaces (e.g., patching the deployment args or referencing the ambient-api-server-jwt-args-patch.yaml).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.claude/skills/ambient/SKILL.md around lines 318 - 321, Update the "JWT Configuration" section to include a concrete example for ephemeral/non‑prod namespaces: show how to patch the ambient-api-server container args (referencing ambient-api-server-jwt-args-patch.yaml) to either disable validation (e.g., override args to include a --disable-jwt-validation flag or remove the --jwk-cert-url/--issuer args) or point to a test issuer by setting --jwk-cert-url to a known test JWKS URL; include the exact arg keys to change (--jwk-cert-url, --issuer, --disable-jwt-validation), a short example description of patching the Deployment/DaemonSet container args for the ambient-api-server, and note to revert for production.

components/pr-test/README.md

- Remove -k (insecure TLS) from curl in install.sh and SKILL.md - Add ref: main and permissions: contents: read to pr-namespace-cleanup.yml - Replace hardcoded example credentials with <PLACEHOLDER> tokens in SKILL.md - Add public-api and ambient-api-server-db to rollout waits in install.sh and SKILL.md 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai

♻️ Duplicate comments (6)

components/pr-test/install.sh (3)

105-108: ⚠️ Potential issue | 🟠 Major

Health verification is non-blocking; consider making it fail-fast with timeouts.

Line 106 uses curl -s ... || true, so backend failures are silently ignored and the script reports success regardless. Consider adding timeouts and failing on errors for better CI signal.

🔧 Suggested fix

 if [[ -n "$BACKEND_HOST" ]]; then
-  HEALTH=$(curl -s "https://${BACKEND_HOST}/health" || true)
+  if HEALTH=$(curl -fsS --connect-timeout 5 --max-time 20 \
+    --retry 3 --retry-all-errors "https://${BACKEND_HOST}/health"); then
+    echo "    Backend health: $HEALTH"
+  else
+    echo "    WARNING: Backend health check failed"
+  fi
-  echo "    Backend health: $HEALTH"
+else
+  echo "    WARNING: backend-route host not found"
 fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@components/pr-test/install.sh` around lines 105 - 108, The health check
currently silences failures by using `curl -s ... || true`; update the `curl`
invocation that populates `HEALTH` (when `BACKEND_HOST` is set) to use fail-fast
options (e.g., `--fail` plus `--max-time`/`--connect-timeout`) and remove the
`|| true` so non-2xx responses or timeouts cause a non-zero exit; if the `curl`
call fails, print the error and exit the script (or return a non-zero status)
instead of continuing, ensuring the `HEALTH` check fails the CI when the backend
is unreachable.

52-59: ⚠️ Potential issue | 🔴 Critical

Update both tagged and untagged image mappings to ensure PR isolation.

kustomize edit set image matches by exact name field. If the production overlay contains untagged entries (e.g., quay.io/ambient_code/vteam_frontend without :latest), they won't be updated and could resolve to latest instead of the PR-specific tag.

🔧 Proposed fix - add untagged mappings

 kustomize edit set image \
+  "quay.io/ambient_code/vteam_frontend=quay.io/ambient_code/vteam_frontend:${IMAGE_TAG}" \
   "quay.io/ambient_code/vteam_frontend:latest=quay.io/ambient_code/vteam_frontend:${IMAGE_TAG}" \
+  "quay.io/ambient_code/vteam_backend=quay.io/ambient_code/vteam_backend:${IMAGE_TAG}" \
   "quay.io/ambient_code/vteam_backend:latest=quay.io/ambient_code/vteam_backend:${IMAGE_TAG}" \
+  "quay.io/ambient_code/vteam_operator=quay.io/ambient_code/vteam_operator:${IMAGE_TAG}" \
   "quay.io/ambient_code/vteam_operator:latest=quay.io/ambient_code/vteam_operator:${IMAGE_TAG}" \
+  "quay.io/ambient_code/vteam_claude_runner=quay.io/ambient_code/vteam_claude_runner:${IMAGE_TAG}" \
   "quay.io/ambient_code/vteam_claude_runner:latest=quay.io/ambient_code/vteam_claude_runner:${IMAGE_TAG}" \
+  "quay.io/ambient_code/vteam_state_sync=quay.io/ambient_code/vteam_state_sync:${IMAGE_TAG}" \
   "quay.io/ambient_code/vteam_state_sync:latest=quay.io/ambient_code/vteam_state_sync:${IMAGE_TAG}" \
+  "quay.io/ambient_code/vteam_api_server=quay.io/ambient_code/vteam_api_server:${IMAGE_TAG}" \
   "quay.io/ambient_code/vteam_api_server:latest=quay.io/ambient_code/vteam_api_server:${IMAGE_TAG}" \
+  "quay.io/ambient_code/vteam_public_api=quay.io/ambient_code/vteam_public_api:${IMAGE_TAG}" \
   "quay.io/ambient_code/vteam_public_api:latest=quay.io/ambient_code/vteam_public_api:${IMAGE_TAG}"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@components/pr-test/install.sh` around lines 52 - 59, kustomize edit set image
only updates images that exactly match the name given, so add duplicate untagged
mappings alongside the existing tagged ones in the kustomize command (the call
to kustomize edit set image) for each image (e.g., add mappings for
quay.io/ambient_code/vteam_frontend, vteam_backend, vteam_operator,
vteam_claude_runner, vteam_state_sync, vteam_api_server, vteam_public_api
without :latest) so they are also set to
quay.io/ambient_code/<image>:${IMAGE_TAG}; update the same command block that
currently lists the ":latest" entries to include the corresponding untagged
entries to ensure PR isolation.

35-37: ⚠️ Potential issue | 🔴 Critical

Do not apply cluster-scoped CRDs/RBAC from PR branch content.

Applying cluster-scoped manifests from unmerged PR code allows ephemeral test jobs to mutate cluster-wide resources. This should be handled by a trusted/mainline pipeline or pre-provisioned, not during PR install.

🛡️ Suggested hardening

-echo "==> Step 1: Applying CRDs and RBAC (cluster-scoped, idempotent)"
-oc apply -k "$REPO_ROOT/components/manifests/base/crds/"
-oc apply -k "$REPO_ROOT/components/manifests/base/rbac/"
+if [[ "${APPLY_CLUSTER_SCOPED:-0}" == "1" ]]; then
+  echo "==> Step 1: Applying CRDs and RBAC (cluster-scoped, explicit opt-in)"
+  oc apply -k "$REPO_ROOT/components/manifests/base/crds/"
+  oc apply -k "$REPO_ROOT/components/manifests/base/rbac/"
+else
+  echo "==> Step 1: Skipping cluster-scoped apply (expected pre-installed)"
+fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@components/pr-test/install.sh` around lines 35 - 37, The script is applying
cluster-scoped CRDs/RBAC from PR branch content via the two oc apply -k commands
(the invocations that reference "$REPO_ROOT/components/manifests/base/crds/" and
".../rbac/"), which must not run in ephemeral PR installs; change the script to
skip these applies unless running in a trusted/mainline context by guarding them
with an explicit environment flag (e.g. MAINLINE_PIPELINE or
SKIP_CLUSTER_MANIFESTS) or CI variable check, and when skipped, print a clear
message that cluster-scoped resources are omitted; ensure the logic surrounds
the oc apply -k calls (not other local-only manifests) so PR jobs never mutate
cluster-wide resources.

components/pr-test/README.md (1)

1-11: ⚠️ Potential issue | 🟠 Major

Add a security model section addressing untrusted PR code execution.

Per the PR discussion (ktdreyer's comment), identify where untrusted PR code might run and document mitigation strategies. Key points to cover:

Isolation boundaries (namespace-scoped, network policies via security-zone: internal)

What executes PR code vs. trusted platform code (provision.sh/install.sh run from main branch)

Secrets management (namespace-scoped, cleaned up on teardown)

The security implication of enabling push: true for PR builds (line 189)

This helps reviewers understand the trust model for ephemeral test instances.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/pr-test/README.md` around lines 1 - 11, Add a new "Security model"
section to the README describing where untrusted PR code may run and
mitigations: specify that S0.x instances are namespace-scoped and isolated with
network policies (use label/security-zone: internal), list which artifacts are
executed from trusted main branch (provision.sh and install.sh) vs what comes
from the PR branch, explain secrets handling (namespace-scoped secrets, scrubbed
on teardown, no cluster-wide secrets exposed), and call out the security
implications and required review when enabling the push: true flag for PR builds
so reviewers know the increased risk and required controls.

.claude/skills/ambient/SKILL.md (1)

290-294: ⚠️ Potential issue | 🟠 Major

Overly permissive cross-namespace image pull policy.

The command grants system:image-puller to the cluster-wide system:serviceaccounts group, allowing any pod in any namespace to pull images from $NAMESPACE. On a shared cluster this is a security posture gap.

Consider granting access only to specific runner namespaces or service accounts, or note that quay.io images (which this skill primarily uses) avoid this issue entirely.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/ambient/SKILL.md around lines 290 - 294, The documented oc
command grants system:image-puller to the cluster-wide group
system:serviceaccounts which is overly permissive; change the guidance to
recommend granting pull access only to the specific runner namespace or specific
service accounts (e.g., bind system:image-puller to
system:serviceaccounts:<runner-namespace> or to a named ServiceAccount) instead
of the entire cluster group, and/or note that using quay.io avoids the internal
registry cross-namespace pull requirement; reference the existing command and
the principals system:image-puller and system:serviceaccounts and replace the
example with a scoped approach (per-namespace rolebinding or
role-to-serviceaccount) or an explicit note recommending scoped access only.

.github/workflows/pr-namespace-cleanup.yml (1)

28-34: ⚠️ Potential issue | 🟡 Minor

Guard against edge case where SLUG_MAX becomes negative.

For very high PR numbers (e.g., 7+ digits), SLUG_MAX could become zero or negative, and if BRANCH_SLUG ends up empty, INSTANCE_ID would have a trailing dash. Also, $GITHUB_OUTPUT should be quoted per shellcheck SC2086.

🛡️ Proposed fix

           PR_LEN=${`#PR_NUMBER`}
           SLUG_MAX=$(( 63 - 14 - 4 - PR_LEN ))
+          [[ $SLUG_MAX -lt 0 ]] && SLUG_MAX=0
           BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}"

-          INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"
-          echo "instance_id=${INSTANCE_ID}" >> $GITHUB_OUTPUT
+          if [[ -n "$BRANCH_SLUG" ]]; then
+            INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"
+          else
+            INSTANCE_ID="pr-${PR_NUMBER}"
+          fi
+          echo "instance_id=${INSTANCE_ID}" >> "$GITHUB_OUTPUT"
           echo "Cleaning up instance: ${INSTANCE_ID}"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/pr-namespace-cleanup.yml around lines 28 - 34, Guard
SLUG_MAX against negative values and avoid a trailing dash when BRANCH_SLUG is
empty: after computing SLUG_MAX (from PR_LEN) clamp it to zero if negative
(e.g., if [ "$SLUG_MAX" -lt 0 ]; then SLUG_MAX=0; fi), then derive BRANCH_SLUG
from SAFE_BRANCH as before; build INSTANCE_ID conditionally so it omits the dash
when BRANCH_SLUG is empty (e.g., if [ -n "$BRANCH_SLUG" ]; then
INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"; else
INSTANCE_ID="pr-${PR_NUMBER}"; fi); and quote the GITHUB_OUTPUT variable when
writing (use >> "$GITHUB_OUTPUT"). Use the symbols PR_LEN, SLUG_MAX,
SAFE_BRANCH, BRANCH_SLUG, INSTANCE_ID and GITHUB_OUTPUT to locate the changes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.claude/skills/ambient/SKILL.md:
- Around line 290-294: The documented oc command grants system:image-puller to
the cluster-wide group system:serviceaccounts which is overly permissive; change
the guidance to recommend granting pull access only to the specific runner
namespace or specific service accounts (e.g., bind system:image-puller to
system:serviceaccounts:<runner-namespace> or to a named ServiceAccount) instead
of the entire cluster group, and/or note that using quay.io avoids the internal
registry cross-namespace pull requirement; reference the existing command and
the principals system:image-puller and system:serviceaccounts and replace the
example with a scoped approach (per-namespace rolebinding or
role-to-serviceaccount) or an explicit note recommending scoped access only.

In @.github/workflows/pr-namespace-cleanup.yml:
- Around line 28-34: Guard SLUG_MAX against negative values and avoid a trailing
dash when BRANCH_SLUG is empty: after computing SLUG_MAX (from PR_LEN) clamp it
to zero if negative (e.g., if [ "$SLUG_MAX" -lt 0 ]; then SLUG_MAX=0; fi), then
derive BRANCH_SLUG from SAFE_BRANCH as before; build INSTANCE_ID conditionally
so it omits the dash when BRANCH_SLUG is empty (e.g., if [ -n "$BRANCH_SLUG" ];
then INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"; else
INSTANCE_ID="pr-${PR_NUMBER}"; fi); and quote the GITHUB_OUTPUT variable when
writing (use >> "$GITHUB_OUTPUT"). Use the symbols PR_LEN, SLUG_MAX,
SAFE_BRANCH, BRANCH_SLUG, INSTANCE_ID and GITHUB_OUTPUT to locate the changes.

In `@components/pr-test/install.sh`:
- Around line 105-108: The health check currently silences failures by using
`curl -s ... || true`; update the `curl` invocation that populates `HEALTH`
(when `BACKEND_HOST` is set) to use fail-fast options (e.g., `--fail` plus
`--max-time`/`--connect-timeout`) and remove the `|| true` so non-2xx responses
or timeouts cause a non-zero exit; if the `curl` call fails, print the error and
exit the script (or return a non-zero status) instead of continuing, ensuring
the `HEALTH` check fails the CI when the backend is unreachable.
- Around line 52-59: kustomize edit set image only updates images that exactly
match the name given, so add duplicate untagged mappings alongside the existing
tagged ones in the kustomize command (the call to kustomize edit set image) for
each image (e.g., add mappings for quay.io/ambient_code/vteam_frontend,
vteam_backend, vteam_operator, vteam_claude_runner, vteam_state_sync,
vteam_api_server, vteam_public_api without :latest) so they are also set to
quay.io/ambient_code/<image>:${IMAGE_TAG}; update the same command block that
currently lists the ":latest" entries to include the corresponding untagged
entries to ensure PR isolation.
- Around line 35-37: The script is applying cluster-scoped CRDs/RBAC from PR
branch content via the two oc apply -k commands (the invocations that reference
"$REPO_ROOT/components/manifests/base/crds/" and ".../rbac/"), which must not
run in ephemeral PR installs; change the script to skip these applies unless
running in a trusted/mainline context by guarding them with an explicit
environment flag (e.g. MAINLINE_PIPELINE or SKIP_CLUSTER_MANIFESTS) or CI
variable check, and when skipped, print a clear message that cluster-scoped
resources are omitted; ensure the logic surrounds the oc apply -k calls (not
other local-only manifests) so PR jobs never mutate cluster-wide resources.

In `@components/pr-test/README.md`:
- Around line 1-11: Add a new "Security model" section to the README describing
where untrusted PR code may run and mitigations: specify that S0.x instances are
namespace-scoped and isolated with network policies (use label/security-zone:
internal), list which artifacts are executed from trusted main branch
(provision.sh and install.sh) vs what comes from the PR branch, explain secrets
handling (namespace-scoped secrets, scrubbed on teardown, no cluster-wide
secrets exposed), and call out the security implications and required review
when enabling the push: true flag for PR builds so reviewers know the increased
risk and required controls.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6db8f4d5-602f-45b4-9e1e-d19494fdf5a9

📥 Commits

Reviewing files that changed from the base of the PR and between 2eb2465 and 9d943a1.

📒 Files selected for processing (6)

.claude/skills/ambient-pr-test/SKILL.md
.claude/skills/ambient/SKILL.md
.github/workflows/pr-namespace-cleanup.yml
components/pr-test/README.md
components/pr-test/install.sh
docs/internal/proposals/acp-public-rest-api.md

💤 Files with no reviewable changes (1)

docs/internal/proposals/acp-public-rest-api.md

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/pr-test/README.md`:
- Around line 69-82: The example TenantNamespace CR (kind: TenantNamespace,
name: pr-123-example) is missing the required label ambient-code/instance-type:
s0x under metadata.labels; add that label alongside the existing tenant and
namespace-type labels so the provisioner capacity check (referenced around lines
107-108) will count this instance and match the schema expectations mentioned
near lines 56 and 144.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8aebdebc-b679-4112-9c35-868a57e925b8

📥 Commits

Reviewing files that changed from the base of the PR and between 9d943a1 and 02ace41.

📒 Files selected for processing (3)

.claude/skills/ambient-pr-test/SKILL.md
components/pr-test/README.md
components/pr-test/build.sh

components/pr-test/README.md

ktdreyer · 2026-03-26T21:21:24Z

.github/workflows/pr-e2e-openshift.yml

+
+      - name: Log in to OpenShift
+        run: |
+          oc login "${{ secrets.TEST_OPENSHIFT_SERVER }}" \


I'll note one thing here: if we want to deploy to an MP+ cluster behind the firewall, we can use ArgoCD of course, or we can use the internal GH runner as documented in #793.

ktdreyer · 2026-03-26T21:25:11Z

docs/internal/proposals/acp-public-rest-api.md

@@ -1,214 +0,0 @@
-# Proposal: ACP Public REST API


Why delete this?

ktdreyer · 2026-03-26T21:26:05Z

docs/internal/developer/local-development/openshift.md

 # OpenShift Cluster Development

-This guide covers deploying the Ambient Code Platform on OpenShift clusters for development and testing. Use this when you need to test OpenShift-specific features like Routes, OAuth integration, or service mesh capabilities.
+This guide covers deploying Ambient Code on an OpenShift cluster using the **OpenShift internal image registry**. This is useful when iterating on local builds against a dev cluster without pushing to quay.io.


Can this change be broken out into a separate PR? I don't understand what this has to do with this change.

by introducing the logic into the skill, it becomes re-usable. it's like a code refactor where we make a change by making a new dual use function and refactoring the old code to use it. this developer guide was my original How To script for claude.

ktdreyer · 2026-03-26T21:26:55Z

components/pr-test/install.sh

+    print(doc)
+PYEOF
+
+kustomize build . \


Suggested change

kustomize build . \

oc kustomize build . \

See #768

ktdreyer · 2026-03-26T21:29:10Z

components/pr-test/README.md

@@ -0,0 +1,448 @@
+# Specification: Ephemeral PR Test Environments on MPP


This doc is helpful to understand the design thinking, but some things are already out of date (for example "no branch slug" is wrong). If this stays in this PR, it needs another pass to reconcile with the actual implementation. (See also #952)

There is significant duplication between the SKILL.md files and this file. For maintainability, I wonder if we could trim these both down and dedup where it makes sense.

ktdreyer · 2026-03-26T21:32:32Z

components/pr-test/install.sh

+  "quay.io/ambient_code/vteam_public_api:latest=quay.io/ambient_code/vteam_public_api:${IMAGE_TAG}"
+
+FILTER_SCRIPT="$TMPDIR/filter.py"
+cat > "$FILTER_SCRIPT" << 'PYEOF'


This inline Python script is 35 lines of yaml manipulation via regex. That will be very challenging to maintain.

Could this be a standalone Python file in components/pr-test/filter.py with proper tests? Or better, could you use yq or oc instead of regex?

ktdreyer

Biggest thing we must address is the warnings from GH re: acting on untrusted code

- Build workflow pushes all 7 component images tagged pr-<N>-amd64 to quay on every PR - provision.sh creates/destroys TenantNamespace CRs on ambient-code--config; capacity-gated at 5 concurrent instances - install.sh deploys production manifests with PR image tags using ArgoCD SA token; handles MPP restricted environment constraints (Route labels, PVC annotations, ClusterRoleBinding subject patching) - pr-e2e-openshift.yml workflow: provision → install → e2e → teardown on build completion - pr-namespace-cleanup.yml: safety-net teardown on PR close - Skills: ambient-pr-test (full PR test workflow) and ambient (install on any OpenShift namespace) - Validates required secrets before install; documents MPP resource inventory and constraints - Route admission webhook fix: add paas.redhat.com/appcode label via kustomize filter 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- components-build-deploy: push immutable SHA-tagged PR image alongside mutable pr-<N>-<arch> tag to prevent race conditions across commits - pr-e2e-openshift: concurrency group null safety (|| workflow_run.id), clamp SLUG_MAX to 0 min / skip branch slug if empty, remove --insecure-skip-tls-verify, install kustomize for kustomize-edit steps - install.sh: add both :latest and untagged kustomize set-image mappings so all image name variants are rewritten; add vteam_control_plane and vteam_mcp; replace kustomize build with oc kustomize; make health check fail-fast with curl timeouts and exit on missing routes - provision.sh: atomic ConfigMap lock before capacity check (TOCTOU fix); wait for TenantNamespace Ready condition, not only Namespace Active; fix destroy() to detect only (NotFound) as success, not all oc errors; delete lock ConfigMap on destroy - openshift.md: add vteam_state_sync to internal registry push loop 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

ambient-code · 2026-03-30T13:43:43Z

Review Queue Status

Check	Status	Detail
CI	pass	—
Conflicts	FAIL	Merge conflicts present
Reviews	pass	—

Action needed: Resolve merge conflicts

Auto-generated by Review Queue workflow. Updated when PR changes.

github-advanced-security bot found potential problems Mar 25, 2026

View reviewed changes

coderabbitai bot reviewed Mar 25, 2026

View reviewed changes

ktdreyer reviewed Mar 25, 2026

View reviewed changes

coderabbitai bot reviewed Mar 25, 2026

View reviewed changes

coderabbitai bot reviewed Mar 26, 2026

View reviewed changes

components/pr-test/README.md Show resolved Hide resolved

ktdreyer reviewed Mar 26, 2026

View reviewed changes

docs/internal/proposals/acp-public-rest-api.md

@@ -1,214 +0,0 @@

# Proposal: ACP Public REST API

Copy link
Copy Markdown

Contributor

ktdreyer Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why delete this?

ktdreyer reviewed Mar 26, 2026

View reviewed changes

markturansky force-pushed the test_my_pr branch from c7cfaf8 to cb00ff0 Compare March 28, 2026 01:15

		### JWT Configuration

		Production uses Red Hat SSO JWKS (`--jwk-cert-url=https://sso.redhat.com/...`). For ephemeral test instances, JWT validation may need to be disabled or pointed at a different issuer. Check the `ambient-api-server-jwt-args-patch.yaml` in the production overlay and adjust as needed for non-production contexts.

		@@ -0,0 +1,448 @@
		# Specification: Ephemeral PR Test Environments on MPP

Conversation

markturansky commented Mar 25, 2026

Summary

Test plan

Uh oh!

coderabbitai bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Check failure

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

ktdreyer commented Mar 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

coderabbitai bot commented Mar 25, 2026 •

edited

Loading