Skip to content

feat(ci): ephemeral PR test instances on MPP dev cluster#1032

Open
markturansky wants to merge 1 commit intomainfrom
test_my_pr
Open

feat(ci): ephemeral PR test instances on MPP dev cluster#1032
markturansky wants to merge 1 commit intomainfrom
test_my_pr

Conversation

@markturansky
Copy link
Copy Markdown
Contributor

Summary

  • Build workflow: All 7 component images are now built and pushed to quay on every PR (pr-<N>-amd64 tag), not just changed components
  • Namespace provisioner (components/pr-test/provision.sh): Creates/destroys TenantNamespace CRs on ambient-code--config; capacity-gated at 5 concurrent S0.x instances
  • Ambient installer (components/pr-test/install.sh): Deploys production manifests with PR image tags; copies ambient-vertex and ambient-api-server secrets from source namespace via oc; mirrors Vertex AI config from operator-config
  • GitHub workflows: pr-e2e-openshift.yml (provision → install → e2e → teardown on build completion) and pr-namespace-cleanup.yml (safety-net teardown on PR close)
  • Skills: .claude/skills/ambient/ (install on any OpenShift namespace) and .claude/skills/ambient-pr-test/ (full PR test workflow)
  • Developer guide: openshift.md trimmed to internal-registry-only content with links to skills

Test plan

  • Add TEST_OPENSHIFT_SERVER and TEST_OPENSHIFT_TOKEN GitHub Actions secrets pointing at dev-spoke-aws-us-east-1
  • Verify build workflow pushes images tagged pr-<N>-amd64 to quay on this PR
  • Verify pr-e2e-openshift.yml triggers and provisions ambient-code--pr-<N>-* namespace
  • Verify Ambient deploys and health check passes
  • Verify E2E tests run against the ephemeral instance
  • Verify namespace is torn down after workflow completes
  • Verify pr-namespace-cleanup.yml fires on PR close

🤖 Generated with Claude Code

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 25, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds PR-focused ephemeral environment tooling: build/push scripts and workflows, TenantNamespace provisioning and teardown scripts, an installer to deploy Ambient into PR namespaces, accompanying docs/skills, and removal of an unused API proposal doc.

Changes

Cohort / File(s) Summary
Skills / Docs
.claude/skills/ambient-pr-test/SKILL.md, .claude/skills/ambient/SKILL.md, components/pr-test/README.md, docs/internal/developer/local-development/openshift.md
Adds PR-test and Ambient installer skill docs and a README describing TenantNamespace spec, naming, capacity rules, RBAC, and procedure; rewrites OpenShift local-dev guide to reference internal registry and new PR flows.
GitHub Actions Workflows
.github/workflows/components-build-deploy.yml, .github/workflows/pr-e2e-openshift.yml, .github/workflows/pr-namespace-cleanup.yml
Alters build workflow to push PR images and add BuildKit cache; adds PR E2E workflow that builds, provisions ephemeral namespace, installs Ambient, runs Cypress, and always tears down; adds PR-close cleanup workflow to destroy instances.
PR build script
components/pr-test/build.sh
New script to build and push component images for a given PR URL, producing a pr-<N>-amd64 image tag.
Provisioning script
components/pr-test/provision.sh
New provisioner CLI to create/destroy TenantNamespace CRs with concurrency limits, readiness/delete polling, timeouts, and env-configurable parameters (MAX_S0X_INSTANCES, READY_TIMEOUT, DELETE_TIMEOUT).
Install script
components/pr-test/install.sh
New installer that copies secrets, applies CRDs/RBAC, rewrites kustomize overlays to use a PR image tag, patches operator-config and ambient-agent-registry, waits for rollouts, performs health checks, and emits frontend_url and namespace outputs.
Removed proposal
docs/internal/proposals/acp-public-rest-api.md
Deleted the ACP Public REST API proposal document.

Sequence Diagram

sequenceDiagram
    participant Dev as Developer
    participant GH as GitHub Actions
    participant Prov as provision.sh
    participant OS as OpenShift API
    participant Inst as install.sh
    participant E2E as Cypress

    Dev->>GH: Open/Update PR
    GH->>GH: Build & Push component images
    GH->>Prov: create(instance_id)
    Prov->>OS: Apply TenantNamespace CR
    OS->>OS: Tenant operator provisions namespace
    Prov->>OS: Poll until namespace.phase == "Active"
    OS-->>Prov: Namespace Active
    GH->>Inst: install.sh <namespace> <image-tag>
    Inst->>OS: Apply CRDs/RBAC, copy secrets, apply patched kustomize
    Inst->>OS: Patch operator-config & agent-registry, wait rollouts
    Inst-->>GH: frontend_url output
    GH->>E2E: Run Cypress tests against frontend_url
    E2E->>OS: Query routes/health endpoints
    E2E-->>GH: Upload artifacts (on failure)
    GH->>Prov: destroy(instance_id) [always]
    Prov->>OS: Delete TenantNamespace CR
    OS->>OS: Tenant operator finalizes and removes namespace
    Prov->>OS: Poll until namespace absent
    OS-->>Prov: Namespace removed
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding ephemeral PR test instances on the MPP dev cluster, which is the core objective of this changeset.
Description check ✅ Passed The description is directly related to the changeset, providing a clear summary of build workflows, provisioning, installation, GitHub workflows, skills, and documentation changes that match the included files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch test_my_pr

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment on lines +73 to +78
- name: Provision namespace
run: |
bash components/pr-test/provision.sh create \
"${{ needs.setup.outputs.instance_id }}"

install:

Check failure

Code scanning / CodeQL

Checkout of untrusted code in a privileged context Critical

Potential execution of untrusted code on a privileged workflow (workflow_run)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must address this

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/ambient/SKILL.md:
- Around line 58-88: The sample secret creation commands publish fixed, shared
credentials; replace hardcoded values with placeholders and guidance to generate
unique secrets instead: update the examples that create minio-credentials,
postgresql-credentials, unleash-credentials, and github-app-secret to use
parameterized values (e.g. <ROOT_USER>, <ROOT_PASSWORD>, <DB_HOST>,
<DB_PASSWORD>, <UNLEASH_DB_URL>, <GITHUB_PRIVATE_KEY>, <GITHUB_CLIENT_SECRET>,
<GITHUB_STATE_SECRET>) and add a short note instructing users to generate
strong, unique secrets (or source them from a secret manager) before running
kustomize so no default credentials are shared across installations.
- Around line 175-189: The rollout wait step is missing gates for public-api and
ambient-api-server-db which makes Step 6 flaky; update the Step 5 "Wait for
Rollout" loop/commands in .claude/skills/ambient/SKILL.md to include oc rollout
status checks for deployment/public-api and deployment/ambient-api-server-db (in
addition to ambient-api-server) so the docs explicitly wait for public-api and
ambient-api-server-db to finish before proceeding to Step 6.

In @.github/workflows/components-build-deploy.yml:
- Around line 206-217: The PR build currently only pushes a mutable tag
(pr-<N>-<arch>) which can be overwritten; update the docker/build-push-action
step (the "Build and push ${{ matrix.component.name }} (${{ matrix.arch.suffix
}}) for pull request" step) to also push an immutable tag that includes the PR
head commit SHA so downstream workflow_run jobs can pull the exact image.
Concretely, add a second tag value alongside the existing tags that uses the PR
head SHA (e.g. ${{ matrix.component.image }}:${{
github.event.pull_request.head.sha }}-${{ matrix.arch.suffix }}) while keeping
the current pr-... tag and ensure push remains true so both tags are pushed.

In @.github/workflows/pr-e2e-openshift.yml:
- Around line 57-60: Replace the unsafe checkout ref that pulls the PR head (the
actions/checkout step using ref: ${{ github.event.workflow_run.head_sha }}) so
the job does not execute untrusted PR code with secrets in scope; change the
checkout to use a safe ref like ${{ github.sha }} (or the run's immutable
commit) and apply the same change to the other checkout occurrences noted (lines
84-119, 126-147), and ensure jobs that require secrets (those invoking
components/pr-test/*.sh, npm ci, npx cypress run with TEST_OPENSHIFT_TOKEN,
QUAY_PASSWORD, ANTHROPIC_API_KEY) only run against repository-controlled code
(or switch to pull_request_target with checkout of the base commit) so
PR-modified scripts cannot exfiltrate secrets.
- Around line 27-35: The run script is inlining
github.event.workflow_run.head_branch into the shell source, exposing
secret-scoped workflow context; instead, move PR_NUMBER and HEAD_BRANCH into
environment variables and reference them inside the run block (e.g., add an env:
section that sets PR_NUMBER: ${{
github.event.workflow_run.pull_requests[0].number }} and HEAD_BRANCH: ${{
github.event.workflow_run.head_branch }} and then use $PR_NUMBER and
$HEAD_BRANCH in the script), so BRANCH_SLUG creation and sed/tr commands operate
on an env var rather than interpolated workflow text.
- Around line 37-39: The derived NAMESPACE can exceed allowed length because
INSTANCE_ID uses BRANCH_SLUG:0:45; reduce the slug truncation to fit the
remaining budget (e.g., change BRANCH_SLUG:0:45 to BRANCH_SLUG:0:41) so that
"ambient-code--" + "pr-${PR_NUMBER}-" plus the slug stays within the namespace
length limit; update the INSTANCE_ID assignment (and any related uses like
NAMESPACE/IMAGE_TAG) to use the shorter BRANCH_SLUG slice.

In @.github/workflows/pr-namespace-cleanup.yml:
- Around line 27-28: The hardcoded slice ${BRANCH_SLUG:0:45} can still produce
>63-char Kubernetes namespaces; replace it with a dynamic truncation that
accounts for the fixed "ambient-code--" prefix (14 chars) and the variable
"pr-${PR_NUMBER}-" prefix length by using ${BRANCH_SLUG:0:$((45 -
${`#PR_NUMBER`}))} (45 derived from 63-14-4) and ensure the computed length is
clamped to zero or positive before constructing INSTANCE_ID; keep the rest
(INSTANCE_ID assignment and echo to GITHUB_OUTPUT) unchanged.
- Around line 17-25: The job currently interpolates untrusted HEAD_BRANCH
directly into the shell pipeline that computes BRANCH_SLUG (variables PR_NUMBER,
HEAD_BRANCH, BRANCH_SLUG and the sed/tr transliteration pipeline), so change the
script to first strictly validate and sanitize HEAD_BRANCH: reject or fail the
job if HEAD_BRANCH contains any characters outside a safe whitelist (allow only
ASCII letters, digits and separators like '.' '_' '-'), limit length (e.g. 64
chars), and then construct BRANCH_SLUG from that validated value; always
double-quote variable expansions when used in commands and avoid passing user
input to eval-like contexts or unquoted substitutions, ensure BRANCH_SLUG never
starts with a hyphen and normalize runs of non-alphanumerics to a single dash as
before but only after validation to prevent malicious injection.

In `@components/pr-test/install.sh`:
- Around line 102-105: The health check ignores failures and can hang because it
uses curl -sk ... || true; change the block that uses BACKEND_HOST and HEALTH to
call curl with a connect and overall timeout and fail on HTTP errors (e.g., curl
-fsS --connect-timeout 5 --max-time 10 "https://$BACKEND_HOST/health"), remove
the "|| true", capture the exit code and the output into HEALTH, and if curl
fails (non-zero exit) print the HEALTH/error and exit non‑zero so the script
fails fast; reference the BACKEND_HOST variable, HEALTH assignment, and the curl
invocation when making the change.
- Around line 35-37: The script is applying cluster-scoped CRDs/RBAC from the PR
branch (the oc apply -k "$REPO_ROOT/components/manifests/base/crds/" and oc
apply -k "$REPO_ROOT/components/manifests/base/rbac/" lines under the "Step 1"
echo), which must not run in a PR install; change install.sh to skip or abort
these two oc apply commands unless running in a trusted/mainline pipeline (guard
with an explicit env var like ALLOW_CLUSTER_SCOPED_MANIFESTS or CI_MAINLINE) or
move them to a separate pre-provision step; keep the namespace-scoped install
flow intact and ensure the script logs a clear message when cluster-scoped
manifests are skipped.
- Around line 52-59: kustomize edit set image currently only replaces images
named with :latest, leaving untagged image entries unchanged; update the
kustomize image mappings so each image name is provided twice (once with the
:latest source and once without the tag) mapping both variants to
quay.io/ambient_code/<image>:${IMAGE_TAG} (e.g., ensure vteam_frontend,
vteam_frontend:latest, vteam_backend, vteam_backend:latest, etc. are all
included) so that both tagged and untagged entries are rewritten; update the
kustomize edit set image invocation to include these duplicate mappings for
vteam_frontend, vteam_backend, vteam_operator, vteam_claude_runner,
vteam_state_sync, vteam_api_server, and vteam_public_api.

In `@components/pr-test/provision.sh`:
- Around line 29-43: The current capacity check using ACTIVE and
MAX_S0X_INSTANCES is racy (TOCTOU) because two concurrent runs can both observe
the same ACTIVE count then both apply the TenantNamespace CR; replace the
read-then-apply pattern with an atomic reservation or cluster-side quota:
implement a create-if-not-exists reservation/lock (for example create a
namespaced Reservation CR or a ConfigMap/Lease named with the instance id and
fail the create if it already exists) and only count SUCCESSFUL reservations
toward the limit, or enforce with a Kubernetes ResourceQuota/OPA/Gatekeeper
policy so the API server enforces the MAX_S0X_INSTANCES limit; update the
provision flow around the "echo '==> Applying TenantNamespace CR: $INSTANCE_ID'"
block to create the reservation atomically before applying the TenantNamespace
and delete the reservation on rollback/cleanup.
- Around line 59-74: The loop that currently waits for Namespace.phase ==
"Active" (using NAMESPACE, READY_TIMEOUT, STATUS) must instead wait for the
TenantNamespace CR Ready condition to be True to ensure the tenant operator
finished reconciling; change the check inside the while loop to query the
TenantNamespace resource (oc get tenantnamespace "$INSTANCE_ID" -n
"$CONFIG_NAMESPACE" -o
jsonpath='{.status.conditions[?(@.type=="Ready")].status}' 2>/dev/null || true)
and treat "True" as success (you may still fall back to checking Namespace.phase
if desired), and preserve the existing timeout, logging, and the final oc
describe tenantnamespace "$INSTANCE_ID" invocation on failure.
- Around line 82-88: The current loop treats any failure of `oc get namespace
"$NAMESPACE"` as successful deletion; change the check inside the while loop to
capture stderr/output from `oc get namespace "$NAMESPACE" 2>&1` and only
consider the namespace deleted when that output explicitly indicates a NotFound
error (e.g. matches "NotFound" or "namespaces .* not found" or "Error from
server (NotFound)"). For non-NotFound failures (auth errors, timeouts, transport
errors) do not exit; instead log a warning and continue retrying until DEADLINE.
Update the if-condition in the while loop that references
NAMESPACE/DELETE_TIMEOUT/DEADLINE accordingly so only a confirmed NotFound
result is treated as deletion.

In `@docs/internal/developer/local-development/openshift.md`:
- Around line 47-78: The docs' internal-registry steps miss mirroring the
operator-spawned images (vteam_state_sync) and updating operator-config and
ambient-agent-registry references, so runner/state-sync pods will still pull
from Quay; update the mirror loop to include vteam_state_sync and any other
operator-spawned images, ensure deploy steps also replace newName refs inside
components/manifests/overlays/production/kustomization.yaml and the
operator-config and ambient-agent-registry resources (update their image refs to
${REGISTRY_HOST}/ambient-code/*), and verify deploy.sh applies the updated
manifests before restoring kustomization.yaml so operator-created pods use the
internal registry.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 025dd6a6-1ae1-437d-903d-cfb4facdffed

📥 Commits

Reviewing files that changed from the base of the PR and between 797d7bf and 82b024e.

📒 Files selected for processing (9)
  • .claude/skills/ambient-pr-test/SKILL.md
  • .claude/skills/ambient/SKILL.md
  • .github/workflows/components-build-deploy.yml
  • .github/workflows/pr-e2e-openshift.yml
  • .github/workflows/pr-namespace-cleanup.yml
  • components/pr-test/SPEC.md
  • components/pr-test/install.sh
  • components/pr-test/provision.sh
  • docs/internal/developer/local-development/openshift.md

Comment on lines +206 to +217
- name: Build and push ${{ matrix.component.name }} (${{ matrix.arch.suffix }}) for pull request
if: github.event_name == 'pull_request'
uses: docker/build-push-action@v6
with:
context: ${{ matrix.component.context }}
file: ${{ matrix.component.dockerfile }}
platforms: ${{ matrix.arch.platform }}
push: false
push: true
tags: ${{ matrix.component.image }}:pr-${{ github.event.pull_request.number }}-${{ matrix.arch.suffix }}
build-args: AMBIENT_VERSION=${{ github.sha }}
cache-from: type=gha,scope=${{ matrix.component.name }}-${{ matrix.arch.suffix }}
cache-to: type=gha,mode=max,scope=${{ matrix.component.name }}-${{ matrix.arch.suffix }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Publish an immutable PR image tag alongside the mutable pr-<N>-<arch> alias.

This step now pushes a tag that is overwritten on every PR commit, while the downstream workflow_run job checks out a specific head_sha. If two commits overlap, an older E2E run can deploy newer images and report the wrong result.

Suggested change
-          tags: ${{ matrix.component.image }}:pr-${{ github.event.pull_request.number }}-${{ matrix.arch.suffix }}
+          tags: |
+            ${{ matrix.component.image }}:pr-${{ github.event.pull_request.number }}-${{ matrix.arch.suffix }}
+            ${{ matrix.component.image }}:pr-${{ github.event.pull_request.number }}-${{ github.sha }}-${{ matrix.arch.suffix }}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/components-build-deploy.yml around lines 206 - 217, The PR
build currently only pushes a mutable tag (pr-<N>-<arch>) which can be
overwritten; update the docker/build-push-action step (the "Build and push ${{
matrix.component.name }} (${{ matrix.arch.suffix }}) for pull request" step) to
also push an immutable tag that includes the PR head commit SHA so downstream
workflow_run jobs can pull the exact image. Concretely, add a second tag value
alongside the existing tags that uses the PR head SHA (e.g. ${{
matrix.component.image }}:${{ github.event.pull_request.head.sha }}-${{
matrix.arch.suffix }}) while keeping the current pr-... tag and ensure push
remains true so both tags are pushed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this

Comment on lines +102 to +105
if [[ -n "$BACKEND_HOST" ]]; then
HEALTH=$(curl -sk "https://${BACKEND_HOST}/health" || true)
echo " Backend health: $HEALTH"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Health verification is non-blocking and unbounded; make it fail-fast with timeouts.

Line 103 uses curl -sk ... || true, so backend failures are ignored and the script can still report success. It also lacks explicit timeout bounds, which can stall CI.

Suggested fix
 if [[ -n "$BACKEND_HOST" ]]; then
-  HEALTH=$(curl -sk "https://${BACKEND_HOST}/health" || true)
+  HEALTH=$(curl -fsS --connect-timeout 5 --max-time 20 \
+    --retry 3 --retry-all-errors "https://${BACKEND_HOST}/health")
   echo "    Backend health: $HEALTH"
+else
+  echo "    ERROR: backend-route host not found"
+  exit 1
 fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if [[ -n "$BACKEND_HOST" ]]; then
HEALTH=$(curl -sk "https://${BACKEND_HOST}/health" || true)
echo " Backend health: $HEALTH"
fi
if [[ -n "$BACKEND_HOST" ]]; then
HEALTH=$(curl -fsS --connect-timeout 5 --max-time 20 \
--retry 3 --retry-all-errors "https://${BACKEND_HOST}/health")
echo " Backend health: $HEALTH"
else
echo " ERROR: backend-route host not found"
exit 1
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/pr-test/install.sh` around lines 102 - 105, The health check
ignores failures and can hang because it uses curl -sk ... || true; change the
block that uses BACKEND_HOST and HEALTH to call curl with a connect and overall
timeout and fail on HTTP errors (e.g., curl -fsS --connect-timeout 5 --max-time
10 "https://$BACKEND_HOST/health"), remove the "|| true", capture the exit code
and the output into HEALTH, and if curl fails (non-zero exit) print the
HEALTH/error and exit non‑zero so the script fails fast; reference the
BACKEND_HOST variable, HEALTH assignment, and the curl invocation when making
the change.

Comment on lines +29 to +43
echo "==> Checking S0.x instance capacity..."
ACTIVE=$(oc get tenantnamespace -n "$CONFIG_NAMESPACE" \
-l ambient-code/instance-type=s0x --no-headers 2>/dev/null | wc -l | tr -d ' ')

if [ "$ACTIVE" -ge "$MAX_S0X_INSTANCES" ]; then
echo "ERROR: At capacity — $ACTIVE/$MAX_S0X_INSTANCES S0.x instances active."
echo "Active instances:"
oc get tenantnamespace -n "$CONFIG_NAMESPACE" \
-l ambient-code/instance-type=s0x -o name
exit 1
fi
echo " Capacity OK: $ACTIVE/$MAX_S0X_INSTANCES"

echo "==> Applying TenantNamespace CR: $INSTANCE_ID"
cat <<EOF | oc apply -f -
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The capacity gate is racy across concurrent provisions.

This is a classic TOCTOU window: two runs can both read the same active count and then both oc apply, exceeding the configured limit of five S0.x instances. The cap needs an atomic reservation step or a cluster-side lock/quota.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/pr-test/provision.sh` around lines 29 - 43, The current capacity
check using ACTIVE and MAX_S0X_INSTANCES is racy (TOCTOU) because two concurrent
runs can both observe the same ACTIVE count then both apply the TenantNamespace
CR; replace the read-then-apply pattern with an atomic reservation or
cluster-side quota: implement a create-if-not-exists reservation/lock (for
example create a namespaced Reservation CR or a ConfigMap/Lease named with the
instance id and fail the create if it already exists) and only count SUCCESSFUL
reservations toward the limit, or enforce with a Kubernetes
ResourceQuota/OPA/Gatekeeper policy so the API server enforces the
MAX_S0X_INSTANCES limit; update the provision flow around the "echo '==>
Applying TenantNamespace CR: $INSTANCE_ID'" block to create the reservation
atomically before applying the TenantNamespace and delete the reservation on
rollback/cleanup.

Comment on lines +59 to +74
echo "==> Waiting for namespace ${NAMESPACE} to become Active (timeout: ${READY_TIMEOUT}s)..."
DEADLINE=$((SECONDS + READY_TIMEOUT))
while [ $SECONDS -lt $DEADLINE ]; do
STATUS=$(oc get namespace "$NAMESPACE" -o jsonpath='{.status.phase}' 2>/dev/null || true)
if [ "$STATUS" == "Active" ]; then
echo " Namespace ${NAMESPACE} is Active."
echo "$NAMESPACE"
exit 0
fi
echo " status=${STATUS:-NotFound}, retrying..."
sleep 3
done

echo "ERROR: Namespace ${NAMESPACE} did not become Active within ${READY_TIMEOUT}s."
oc describe tenantnamespace "$INSTANCE_ID" -n "$CONFIG_NAMESPACE" || true
exit 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Wait for the TenantNamespace Ready condition, not only Namespace.phase=Active.

Active only proves the namespace object exists. The tenant operator can still be reconciling quotas/RBAC/finalizers, so the install step can race immediately after this script returns.

Suggested change
   echo "==> Waiting for namespace ${NAMESPACE} to become Active (timeout: ${READY_TIMEOUT}s)..."
   DEADLINE=$((SECONDS + READY_TIMEOUT))
   while [ $SECONDS -lt $DEADLINE ]; do
-    STATUS=$(oc get namespace "$NAMESPACE" -o jsonpath='{.status.phase}' 2>/dev/null || true)
-    if [ "$STATUS" == "Active" ]; then
+    READY=$(oc get tenantnamespace "$INSTANCE_ID" -n "$CONFIG_NAMESPACE" \
+      -o jsonpath='{range .status.conditions[?(@.type=="Ready")]}{.status}{end}' 2>/dev/null || true)
+    STATUS=$(oc get namespace "$NAMESPACE" -o jsonpath='{.status.phase}' 2>/dev/null || true)
+    if [ "$READY" == "True" ] && [ "$STATUS" == "Active" ]; then
       echo "    Namespace ${NAMESPACE} is Active."
       echo "$NAMESPACE"
       exit 0
     fi
-    echo "    status=${STATUS:-NotFound}, retrying..."
+    echo "    ready=${READY:-Unknown} status=${STATUS:-NotFound}, retrying..."
     sleep 3
   done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/pr-test/provision.sh` around lines 59 - 74, The loop that
currently waits for Namespace.phase == "Active" (using NAMESPACE, READY_TIMEOUT,
STATUS) must instead wait for the TenantNamespace CR Ready condition to be True
to ensure the tenant operator finished reconciling; change the check inside the
while loop to query the TenantNamespace resource (oc get tenantnamespace
"$INSTANCE_ID" -n "$CONFIG_NAMESPACE" -o
jsonpath='{.status.conditions[?(@.type=="Ready")].status}' 2>/dev/null || true)
and treat "True" as success (you may still fall back to checking Namespace.phase
if desired), and preserve the existing timeout, logging, and the final oc
describe tenantnamespace "$INSTANCE_ID" invocation on failure.

Comment on lines +82 to +88
echo "==> Waiting for namespace ${NAMESPACE} to be deleted (timeout: ${DELETE_TIMEOUT}s)..."
DEADLINE=$((SECONDS + DELETE_TIMEOUT))
while [ $SECONDS -lt $DEADLINE ]; do
if ! oc get namespace "$NAMESPACE" &>/dev/null; then
echo " Namespace ${NAMESPACE} deleted."
exit 0
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not treat every oc get namespace failure as successful deletion.

if ! oc get namespace ... returns true for NotFound, auth failures, API timeouts, and transport errors alike. A transient cluster error will be reported as "deleted" and leak the namespace.

Suggested change
   echo "==> Waiting for namespace ${NAMESPACE} to be deleted (timeout: ${DELETE_TIMEOUT}s)..."
   DEADLINE=$((SECONDS + DELETE_TIMEOUT))
   while [ $SECONDS -lt $DEADLINE ]; do
-    if ! oc get namespace "$NAMESPACE" &>/dev/null; then
+    OUTPUT=$(oc get namespace "$NAMESPACE" -o name 2>&1) && RC=0 || RC=$?
+    if [ "$RC" -eq 0 ]; then
+      echo "    Namespace still exists, waiting..."
+    elif grep -q "(NotFound)" <<<"$OUTPUT"; then
       echo "    Namespace ${NAMESPACE} deleted."
       exit 0
+    else
+      echo "    oc get namespace failed: $OUTPUT"
     fi
-    echo "    Namespace still exists, waiting..."
     sleep 5
   done
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
echo "==> Waiting for namespace ${NAMESPACE} to be deleted (timeout: ${DELETE_TIMEOUT}s)..."
DEADLINE=$((SECONDS + DELETE_TIMEOUT))
while [ $SECONDS -lt $DEADLINE ]; do
if ! oc get namespace "$NAMESPACE" &>/dev/null; then
echo " Namespace ${NAMESPACE} deleted."
exit 0
fi
echo "==> Waiting for namespace ${NAMESPACE} to be deleted (timeout: ${DELETE_TIMEOUT}s)..."
DEADLINE=$((SECONDS + DELETE_TIMEOUT))
while [ $SECONDS -lt $DEADLINE ]; do
OUTPUT=$(oc get namespace "$NAMESPACE" -o name 2>&1) && RC=0 || RC=$?
if [ "$RC" -eq 0 ]; then
echo " Namespace still exists, waiting..."
elif grep -q "(NotFound)" <<<"$OUTPUT"; then
echo " Namespace ${NAMESPACE} deleted."
exit 0
else
echo " oc get namespace failed: $OUTPUT"
fi
sleep 5
done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/pr-test/provision.sh` around lines 82 - 88, The current loop
treats any failure of `oc get namespace "$NAMESPACE"` as successful deletion;
change the check inside the while loop to capture stderr/output from `oc get
namespace "$NAMESPACE" 2>&1` and only consider the namespace deleted when that
output explicitly indicates a NotFound error (e.g. matches "NotFound" or
"namespaces .* not found" or "Error from server (NotFound)"). For non-NotFound
failures (auth errors, timeouts, transport errors) do not exit; instead log a
warning and continue retrying until DEADLINE. Update the if-condition in the
while loop that references NAMESPACE/DELETE_TIMEOUT/DEADLINE accordingly so only
a confirmed NotFound result is treated as deletion.

Comment on lines 47 to 78
```bash
oc new-project ambient-code
```

**MinIO credentials:**

```bash
oc create secret generic minio-credentials -n ambient-code \
--from-literal=root-user=admin \
--from-literal=root-password=changeme123
```

**PostgreSQL credentials (for Unleash feature flag database):**

```bash
oc create secret generic postgresql-credentials -n ambient-code \
--from-literal=db.host="postgresql" \
--from-literal=db.port="5432" \
--from-literal=db.name="postgres" \
--from-literal=db.user="postgres" \
--from-literal=db.password="postgres123"
```
REGISTRY_HOST=$(oc get route default-route -n openshift-image-registry \
--template='{{ .spec.host }}')
INTERNAL_REG="image-registry.openshift-image-registry.svc:5000/ambient-code"

**Unleash credentials (for feature flag service):**
for img in vteam_frontend vteam_backend vteam_operator vteam_public_api vteam_claude_runner vteam_api_server; do
podman tag localhost/${img}:latest ${REGISTRY_HOST}/ambient-code/${img}:latest
podman push ${REGISTRY_HOST}/ambient-code/${img}:latest
done

```bash
oc create secret generic unleash-credentials -n ambient-code \
--from-literal=database-url="postgres://postgres:postgres123@postgresql:5432/unleash" \
--from-literal=database-ssl="false" \
--from-literal=admin-api-token="*:*.unleash-admin-token" \
--from-literal=client-api-token="default:development.unleash-client-token" \
--from-literal=frontend-api-token="default:development.unleash-frontend-token" \
--from-literal=default-admin-password="unleash123"
oc rollout restart deployment backend-api frontend agentic-operator public-api ambient-api-server -n ambient-code
```

## Platform Deployment
---

The production kustomization in `components/manifests/overlays/production/kustomization.yaml` references `quay.io/ambient_code/*` images by default. When deploying to an OpenShift cluster using the internal registry, you must temporarily point the image refs at the internal registry, deploy, then **immediately revert** before committing.
## Deploy with Internal Registry Images

**⚠️ CRITICAL**: Never commit `kustomization.yaml` while it contains internal registry refs.

**Patch kustomization to internal registry, deploy, then revert:**
**⚠️ CRITICAL**: Never commit `kustomization.yaml` with internal registry refs.

```bash
REGISTRY_HOST=$(oc get route default-route -n openshift-image-registry --template='{{ .spec.host }}')
INTERNAL_REG="image-registry.openshift-image-registry.svc:5000/ambient-code"
REGISTRY_HOST=$(oc get route default-route -n openshift-image-registry \
--template='{{ .spec.host }}')

# Temporarily override image refs to internal registry
cd components/manifests/overlays/production
sed -i "s#newName: quay.io/ambient_code/#newName: ${INTERNAL_REG}/#g" kustomization.yaml
sed -i "s#newName: quay.io/ambient_code/#newName: ${REGISTRY_HOST}/ambient-code/#g" kustomization.yaml

# Deploy
cd ../..
./deploy.sh

# IMMEDIATELY revert — do not commit with internal registry refs
cd overlays/production
git checkout kustomization.yaml
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This internal-registry flow still leaves the operator-spawned images on the old registry.

Only six images are mirrored here, and vteam_state_sync is missing. Rewriting kustomization.yaml is also not enough for runner pods, because operator-config and ambient-agent-registry still need to point at the internal-registry refs. As written, runner/state-sync changes will continue pulling stale Quay images or fail outright.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/internal/developer/local-development/openshift.md` around lines 47 - 78,
The docs' internal-registry steps miss mirroring the operator-spawned images
(vteam_state_sync) and updating operator-config and ambient-agent-registry
references, so runner/state-sync pods will still pull from Quay; update the
mirror loop to include vteam_state_sync and any other operator-spawned images,
ensure deploy steps also replace newName refs inside
components/manifests/overlays/production/kustomization.yaml and the
operator-config and ambient-agent-registry resources (update their image refs to
${REGISTRY_HOST}/ambient-code/*), and verify deploy.sh applies the updated
manifests before restoring kustomization.yaml so operator-created pods use the
internal registry.

@ktdreyer
Copy link
Copy Markdown
Contributor

TODO: Let's identify if/where this runs untrusted code and stamp that out. GH gives us some clues in the PR review here. I also filed #1034 to add another tool in the box.


- name: Install kustomize
run: |
curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We install kustomize in a few places (eg #768). We should stop doing that and use oc kustomize instead.

run: |
oc login ${{ secrets.TEST_OPENSHIFT_SERVER }} \
--token=${{ secrets.TEST_OPENSHIFT_TOKEN }} \
--insecure-skip-tls-verify
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize we're copying --insecure-skip-tls-verify from other GH Actions, but do we need it? I thought ROSA obviates this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. this is from testing and should not be in any full deployment

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
.github/workflows/pr-namespace-cleanup.yml (1)

7-13: ⚠️ Potential issue | 🟠 Major

Add explicit permissions block and checkout main to use trusted scripts.

The workflow lacks a permissions block, which CodeQL has flagged. Additionally, the checkout step should explicitly use ref: main to ensure only trusted repository scripts are executed when running the destroy command with cluster credentials.

Suggested fix
 jobs:
   cleanup:
     runs-on: ubuntu-latest
     if: github.event.pull_request.head.repo.full_name == github.repository
+    permissions:
+      contents: read
     steps:
       - name: Checkout code
         uses: actions/checkout@v6
+        with:
+          ref: main
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/pr-namespace-cleanup.yml around lines 7 - 13, The workflow
job "cleanup" is missing an explicit permissions block and the checkout step
should pin to the trusted branch; add a permissions block under the cleanup job
(for example: permissions: contents: read) and update the Checkout code step
that uses actions/checkout@v6 to include ref: main so the job runs with explicit
repository read permissions and checks out the main branch before executing
destroy/cluster credentials operations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/pr-e2e-openshift.yml:
- Around line 8-10: The concurrency group uses
github.event.workflow_run.pull_requests[0].number which can be null and will
break group evaluation; update the concurrency key to use a safe
conditional/fallback (e.g., use github.event.workflow_run.pull_requests &&
github.event.workflow_run.pull_requests[0] ?
github.event.workflow_run.pull_requests[0].number : github.run_id) so the group
always resolves; modify the line referencing pull_requests[0].number in the
concurrency block to this conditional expression (or an equivalent fallback like
github.run_id or github.sha) to avoid null access before jobs run.
- Around line 158-183: The teardown job's Destroy namespace step uses
needs.setup.outputs.instance_id unguarded, causing the destroy script
(components/pr-test/provision.sh destroy) to receive an empty INSTANCE_ID when
setup failed; update the teardown (or the Destroy namespace step) to skip
running when that output is empty—either by adding an if condition on the step
(e.g., check needs.setup.outputs.instance_id != '') or by guarding the run
command to no-op when INSTANCE_ID is empty before calling the destroy script;
reference the teardown job, the Destroy namespace step, the env VARIABLE
INSTANCE_ID and the components/pr-test/provision.sh destroy invocation when
making the change.
- Around line 37-43: The SLUG_MAX calculation can become zero or negative for
very large PR numbers which makes BRANCH_SLUG empty and leaves a trailing dash
in INSTANCE_ID; update the workflow to clamp SLUG_MAX to a minimum positive
value (e.g., SLUG_MAX=$(( SLUG_MAX < 1 ? 1 : SLUG_MAX ))) or otherwise guard it
the same way as the cleanup workflow, ensure BRANCH_SLUG uses that clamped
value, and build INSTANCE_ID without a trailing dash (e.g., only append
"-${BRANCH_SLUG}" when BRANCH_SLUG is non-empty) so symbols SLUG_MAX, PR_LEN,
BRANCH_SLUG and INSTANCE_ID are adjusted accordingly.

In @.github/workflows/pr-namespace-cleanup.yml:
- Around line 24-28: Clamp SLUG_MAX to a non-negative value after computing it
(ensure SLUG_MAX=$(( SLUG_MAX < 0 ? 0 : SLUG_MAX ))) and then derive
BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}"; after that, trim any trailing dashes
from BRANCH_SLUG and guard INSTANCE_ID construction so it doesn't produce a
trailing dash when BRANCH_SLUG is empty (e.g., build INSTANCE_ID by
conditionally appending "-${BRANCH_SLUG}" only if BRANCH_SLUG is non-empty,
using the existing symbols PR_LEN, SLUG_MAX, SAFE_BRANCH, BRANCH_SLUG,
PR_NUMBER, and INSTANCE_ID).

---

Duplicate comments:
In @.github/workflows/pr-namespace-cleanup.yml:
- Around line 7-13: The workflow job "cleanup" is missing an explicit
permissions block and the checkout step should pin to the trusted branch; add a
permissions block under the cleanup job (for example: permissions: contents:
read) and update the Checkout code step that uses actions/checkout@v6 to include
ref: main so the job runs with explicit repository read permissions and checks
out the main branch before executing destroy/cluster credentials operations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5ab0169e-0a5e-44f9-92e8-0fed82edc418

📥 Commits

Reviewing files that changed from the base of the PR and between 82b024e and b7a58c5.

📒 Files selected for processing (2)
  • .github/workflows/pr-e2e-openshift.yml
  • .github/workflows/pr-namespace-cleanup.yml

Comment on lines +8 to +10
concurrency:
group: pr-e2e-openshift-${{ github.event.workflow_run.pull_requests[0].number }}
cancel-in-progress: true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Concurrency group may fail when pull_requests[0] is null.

The concurrency key references pull_requests[0].number, but if the workflow_run isn't associated with a PR (e.g., push to main), this could produce an invalid or colliding group name. The setup job guards against this, but the concurrency block is evaluated before job conditions.

Suggested fix
 concurrency:
-  group: pr-e2e-openshift-${{ github.event.workflow_run.pull_requests[0].number }}
+  group: pr-e2e-openshift-${{ github.event.workflow_run.pull_requests[0].number || github.event.workflow_run.id }}
   cancel-in-progress: true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/pr-e2e-openshift.yml around lines 8 - 10, The concurrency
group uses github.event.workflow_run.pull_requests[0].number which can be null
and will break group evaluation; update the concurrency key to use a safe
conditional/fallback (e.g., use github.event.workflow_run.pull_requests &&
github.event.workflow_run.pull_requests[0] ?
github.event.workflow_run.pull_requests[0].number : github.run_id) so the group
always resolves; modify the line referencing pull_requests[0].number in the
concurrency block to this conditional expression (or an equivalent fallback like
github.run_id or github.sha) to avoid null access before jobs run.

Comment on lines +37 to +43
# Budget: "ambient-code--" (14) + "pr-${PR_NUMBER}-" (4+digits) + slug
# Max namespace = 63. Slug budget = 63 - 14 - 4 - ${#PR_NUMBER}
PR_LEN=${#PR_NUMBER}
SLUG_MAX=$(( 63 - 14 - 4 - PR_LEN ))
BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}"

INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Apply the same edge-case guard for SLUG_MAX as in the cleanup workflow.

Same issue as pr-namespace-cleanup.yml: SLUG_MAX could be zero/negative for very large PR numbers, and an empty BRANCH_SLUG would leave a trailing dash in INSTANCE_ID.

Suggested fix
           PR_LEN=${`#PR_NUMBER`}
           SLUG_MAX=$(( 63 - 14 - 4 - PR_LEN ))
+          [[ $SLUG_MAX -lt 0 ]] && SLUG_MAX=0
           BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}"

-          INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"
+          if [[ -n "$BRANCH_SLUG" ]]; then
+            INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"
+          else
+            INSTANCE_ID="pr-${PR_NUMBER}"
+          fi
           NAMESPACE="ambient-code--${INSTANCE_ID}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/pr-e2e-openshift.yml around lines 37 - 43, The SLUG_MAX
calculation can become zero or negative for very large PR numbers which makes
BRANCH_SLUG empty and leaves a trailing dash in INSTANCE_ID; update the workflow
to clamp SLUG_MAX to a minimum positive value (e.g., SLUG_MAX=$(( SLUG_MAX < 1 ?
1 : SLUG_MAX ))) or otherwise guard it the same way as the cleanup workflow,
ensure BRANCH_SLUG uses that clamped value, and build INSTANCE_ID without a
trailing dash (e.g., only append "-${BRANCH_SLUG}" when BRANCH_SLUG is
non-empty) so symbols SLUG_MAX, PR_LEN, BRANCH_SLUG and INSTANCE_ID are adjusted
accordingly.

Comment on lines +158 to +183
teardown:
needs: [setup, provision, e2e]
if: always()
runs-on: ubuntu-latest
permissions:
contents: read
steps:
- name: Checkout main (trusted scripts only)
uses: actions/checkout@v6
with:
ref: main

- name: Install oc
uses: redhat-actions/oc-installer@v1
with:
oc_version: 'latest'

- name: Log in to OpenShift
run: |
oc login "${{ secrets.TEST_OPENSHIFT_SERVER }}" \
--token="${{ secrets.TEST_OPENSHIFT_TOKEN }}"

- name: Destroy namespace
env:
INSTANCE_ID: ${{ needs.setup.outputs.instance_id }}
run: bash components/pr-test/provision.sh destroy "$INSTANCE_ID"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard teardown against missing instance_id when upstream jobs fail.

With if: always(), the teardown runs even if setup fails or is skipped, in which case needs.setup.outputs.instance_id will be empty. The destroy script may behave unexpectedly with an empty argument.

Suggested fix
   teardown:
     needs: [setup, provision, e2e]
-    if: always()
+    if: always() && needs.setup.outputs.instance_id != ''
     runs-on: ubuntu-latest

Alternatively, add a guard in the destroy step:

       - name: Destroy namespace
         env:
           INSTANCE_ID: ${{ needs.setup.outputs.instance_id }}
-        run: bash components/pr-test/provision.sh destroy "$INSTANCE_ID"
+        run: |
+          if [[ -n "$INSTANCE_ID" ]]; then
+            bash components/pr-test/provision.sh destroy "$INSTANCE_ID"
+          else
+            echo "No instance_id available; skipping teardown"
+          fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
teardown:
needs: [setup, provision, e2e]
if: always()
runs-on: ubuntu-latest
permissions:
contents: read
steps:
- name: Checkout main (trusted scripts only)
uses: actions/checkout@v6
with:
ref: main
- name: Install oc
uses: redhat-actions/oc-installer@v1
with:
oc_version: 'latest'
- name: Log in to OpenShift
run: |
oc login "${{ secrets.TEST_OPENSHIFT_SERVER }}" \
--token="${{ secrets.TEST_OPENSHIFT_TOKEN }}"
- name: Destroy namespace
env:
INSTANCE_ID: ${{ needs.setup.outputs.instance_id }}
run: bash components/pr-test/provision.sh destroy "$INSTANCE_ID"
teardown:
needs: [setup, provision, e2e]
if: always() && needs.setup.outputs.instance_id != ''
runs-on: ubuntu-latest
permissions:
contents: read
steps:
- name: Checkout main (trusted scripts only)
uses: actions/checkout@v6
with:
ref: main
- name: Install oc
uses: redhat-actions/oc-installer@v1
with:
oc_version: 'latest'
- name: Log in to OpenShift
run: |
oc login "${{ secrets.TEST_OPENSHIFT_SERVER }}" \
--token="${{ secrets.TEST_OPENSHIFT_TOKEN }}"
- name: Destroy namespace
env:
INSTANCE_ID: ${{ needs.setup.outputs.instance_id }}
run: bash components/pr-test/provision.sh destroy "$INSTANCE_ID"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/pr-e2e-openshift.yml around lines 158 - 183, The teardown
job's Destroy namespace step uses needs.setup.outputs.instance_id unguarded,
causing the destroy script (components/pr-test/provision.sh destroy) to receive
an empty INSTANCE_ID when setup failed; update the teardown (or the Destroy
namespace step) to skip running when that output is empty—either by adding an if
condition on the step (e.g., check needs.setup.outputs.instance_id != '') or by
guarding the run command to no-op when INSTANCE_ID is empty before calling the
destroy script; reference the teardown job, the Destroy namespace step, the env
VARIABLE INSTANCE_ID and the components/pr-test/provision.sh destroy invocation
when making the change.

Comment on lines +24 to +28
PR_LEN=${#PR_NUMBER}
SLUG_MAX=$(( 63 - 14 - 4 - PR_LEN ))
BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}"

INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard against edge cases: negative SLUG_MAX or trailing dash.

For very high PR numbers, SLUG_MAX could become zero or negative, and if BRANCH_SLUG ends up empty, INSTANCE_ID would have a trailing dash (pr-123-). Consider clamping and trimming:

Suggested fix
           PR_LEN=${`#PR_NUMBER`}
           SLUG_MAX=$(( 63 - 14 - 4 - PR_LEN ))
+          [[ $SLUG_MAX -lt 0 ]] && SLUG_MAX=0
           BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}"

-          INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"
+          if [[ -n "$BRANCH_SLUG" ]]; then
+            INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"
+          else
+            INSTANCE_ID="pr-${PR_NUMBER}"
+          fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
PR_LEN=${#PR_NUMBER}
SLUG_MAX=$(( 63 - 14 - 4 - PR_LEN ))
BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}"
INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"
PR_LEN=${`#PR_NUMBER`}
SLUG_MAX=$(( 63 - 14 - 4 - PR_LEN ))
[[ $SLUG_MAX -lt 0 ]] && SLUG_MAX=0
BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}"
if [[ -n "$BRANCH_SLUG" ]]; then
INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"
else
INSTANCE_ID="pr-${PR_NUMBER}"
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/pr-namespace-cleanup.yml around lines 24 - 28, Clamp
SLUG_MAX to a non-negative value after computing it (ensure SLUG_MAX=$((
SLUG_MAX < 0 ? 0 : SLUG_MAX ))) and then derive
BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}"; after that, trim any trailing dashes
from BRANCH_SLUG and guard INSTANCE_ID construction so it doesn't produce a
trailing dash when BRANCH_SLUG is empty (e.g., build INSTANCE_ID by
conditionally appending "-${BRANCH_SLUG}" only if BRANCH_SLUG is non-empty,
using the existing symbols PR_LEN, SLUG_MAX, SAFE_BRANCH, BRANCH_SLUG,
PR_NUMBER, and INSTANCE_ID).

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (2)
.claude/skills/ambient/SKILL.md (2)

58-88: ⚠️ Potential issue | 🟠 Major

Do not publish shared default credentials in a skill covering production installs.

This issue was previously flagged and remains unresolved. The commands hardcode predictable values for MinIO (changeme123), PostgreSQL (postgres123), Unleash (fixed admin tokens), and GitHub state secret (test-state-secret). Since this skill explicitly covers "production, ephemeral PR test instances, or developer clusters" (lines 5–6), following it verbatim would give every installation the same credentials.

Replace hardcoded values with placeholders (e.g., <ROOT_PASSWORD>, <DB_PASSWORD>, <ADMIN_TOKEN>) and add a note instructing users to generate strong, unique secrets (or source them from a secret manager) before running the commands.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/ambient/SKILL.md around lines 58 - 88, Replace all hardcoded
secret values in the kustomize secret creation examples (the secret names
minio-credentials, postgresql-credentials, unleash-credentials,
github-app-secret) with clearly labelled placeholders (e.g., <ROOT_PASSWORD>,
<DB_PASSWORD>, <ADMIN_TOKEN>, <GITHUB_STATE_SECRET>) and update the text to
instruct users to generate strong, unique secrets or pull them from a secret
manager before applying the overlay; ensure the examples do not contain any
fixed/default credentials and add a single-line note above the commands
explaining to replace placeholders with secure values or reference a secret
store.

203-211: ⚠️ Potential issue | 🟠 Major

Wait for public-api and ambient-api-server-db before Step 6.

This issue was previously flagged and remains unresolved. Step 6 immediately uses public-api-route (line 281) and execs into ambient-api-server-db (line 258), but the rollout gate never waits for those deployments. On a slow cluster the documented verification flow is flaky.

Add public-api to the loop and add an explicit wait for ambient-api-server-db:

🛡️ Proposed fix
-for deploy in backend-api frontend agentic-operator postgresql minio unleash; do
+for deploy in backend-api frontend agentic-operator postgresql minio unleash public-api; do
   oc rollout status deployment/$deploy -n $NAMESPACE --timeout=300s
 done

 ambient-api-server and ambient-api-server-db may take longer due to DB init:

 ```bash
+oc rollout status deployment/ambient-api-server-db -n $NAMESPACE --timeout=300s
 oc rollout status deployment/ambient-api-server -n $NAMESPACE --timeout=300s
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/ambient/SKILL.md around lines 203 - 211, The deployment wait
loop currently iterates "backend-api frontend agentic-operator postgresql minio
unleash" but omits the public API and the DB init; update the loop that declares
deploys (the line with for deploy in ...) to include public-api, and add an
explicit rollout wait for ambient-api-server-db (before the existing
ambient-api-server wait) so Step 6's usage of public-api-route and the
subsequent exec into ambient-api-server-db will block until those deployments
are ready; target the symbolic names public-api and ambient-api-server-db when
adding the waits.


</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/ambient/SKILL.md:

  • Line 292: The command oc policy add-role-to-group system:image-puller system:serviceaccounts --namespace=$NAMESPACE is too permissive; change it to
    grant image-pull only to the specific runner namespace or service account(s)
    instead of the cluster-wide system:serviceaccounts group — e.g., replace that
    call with a namespace-scoped binding that targets the runner service account
    (oc policy add-role-to-user system:image-puller system:serviceaccount:<runner-namespace>:<sa-name> --namespace=$NAMESPACE) or
    implement a RoleBinding per target namespace for the system:image-puller role
    so only intended namespaces/users can pull from $NAMESPACE.
  • Line 103: The placeholder path /path/to/service-account-key.json should be
    replaced with a real, downloaded service account key file: instruct the user to
    create or select a service account in GCP Console (IAM & Admin > Service
    Accounts), click "Keys" → "Add Key" → "Create new key" and download the JSON,
    save it to a secure local path (e.g., ./keys/ambient-service-account.json), and
    then update the --from-file flag to point to that real file name instead of the
    placeholder; include a note to keep the JSON secure and to follow the project's
    secrets/storage guidelines.
  • Around line 318-321: Update the "JWT Configuration" section to include a
    concrete example for ephemeral/non‑prod namespaces: show how to patch the
    ambient-api-server container args (referencing
    ambient-api-server-jwt-args-patch.yaml) to either disable validation (e.g.,
    override args to include a --disable-jwt-validation flag or remove the
    --jwk-cert-url/--issuer args) or point to a test issuer by setting
    --jwk-cert-url to a known test JWKS URL; include the exact arg keys to change
    (--jwk-cert-url, --issuer, --disable-jwt-validation), a short example
    description of patching the Deployment/DaemonSet container args for the
    ambient-api-server, and note to revert for production.
  • Line 238: Update the sentence that says the frontend shows "2/2" to clarify
    that this only occurs when the oauth-proxy sidecar is present (e.g., when using
    the production overlay); mention that installs via install.sh or custom overlays
    without oauth-proxy will show "1/1" for the frontend pod. Reference the terms
    "frontend" pod, "production overlay", "oauth-proxy sidecar", and "install.sh" in
    the rewording so readers know the difference in expected readiness counts
    depending on whether oauth-proxy is configured.

In @components/pr-test/README.md:

  • Line 187: Update the README to document the security implications of changing
    push: false → push: true in components-build-deploy.yml's PR build step: explain
    that enabling image push causes PR builds to push ephemeral images tagged
    pr-- (overwritten per commit), call out that the Quay registry
    namespace must have strict access controls and review who can push/read images,
    advise evaluating whether untrusted contributors should be allowed to push to
    the shared registry, and add a cross-reference to the security model/trust
    boundaries section for further guidance.
  • Around line 1-246: Add a new "Security Model" section to the specification
    that states where untrusted PR code runs and how it's isolated: describe
    isolation boundaries for each ephemeral instance (namespace
    ambient-code--) including network policies and the "internal"
    security-zone, resource quotas and limits (referencing MAX_S0X_INSTANCES,
    READY_TIMEOUT, DELETE_TIMEOUT), RBAC restrictions and that the
    ambient-code/instance-type=s0x label is used for counting only, secrets handling
    (namespace-scoped secrets, limited secret injection, cleanup on teardown),
    mitigation controls for malicious builds (GitHub Actions runner isolation, push:
    true implications for PR image builds, e.g. push only from trusted runners,
    build-time sandboxing), and explicit notes about what components run untrusted
    code vs trusted platform code (provisioner.create/destroy, provision.sh,
    installer scripts). Ensure the section references TenantNamespace CR
    creation/deletion and the provisioner workflow so reviewers can see where to
    apply network policies and RBAC to prevent cross-namespace access.
  • Line 220: Update the file list reference that still points to "SPEC.md" so it
    matches the renamed file by replacing the "SPEC.md" entry with "README.md"
    (search for the string "SPEC.md" in components/pr-test/README.md and update it
    to "README.md"); ensure any surrounding text or links that referenced SPEC.md
    are adjusted to point to README.md as well.

Duplicate comments:
In @.claude/skills/ambient/SKILL.md:

  • Around line 58-88: Replace all hardcoded secret values in the kustomize secret
    creation examples (the secret names minio-credentials, postgresql-credentials,
    unleash-credentials, github-app-secret) with clearly labelled placeholders
    (e.g., <ROOT_PASSWORD>, <DB_PASSWORD>, <ADMIN_TOKEN>, <GITHUB_STATE_SECRET>) and
    update the text to instruct users to generate strong, unique secrets or pull
    them from a secret manager before applying the overlay; ensure the examples do
    not contain any fixed/default credentials and add a single-line note above the
    commands explaining to replace placeholders with secure values or reference a
    secret store.
  • Around line 203-211: The deployment wait loop currently iterates "backend-api
    frontend agentic-operator postgresql minio unleash" but omits the public API and
    the DB init; update the loop that declares deploys (the line with for deploy in
    ...) to include public-api, and add an explicit rollout wait for
    ambient-api-server-db (before the existing ambient-api-server wait) so Step 6's
    usage of public-api-route and the subsequent exec into ambient-api-server-db
    will block until those deployments are ready; target the symbolic names
    public-api and ambient-api-server-db when adding the waits.

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Organization UI

**Review profile**: ASSERTIVE

**Plan**: Pro

**Run ID**: `86f2be1c-2bd0-4140-8dda-16106d5abda3`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between b7a58c5cd33742cc19586c0bef4900ed2b361a5e and 2eb2465302c9a4a6c86475f453f1544940030396.

</details>

<details>
<summary>📒 Files selected for processing (2)</summary>

* `.claude/skills/ambient/SKILL.md`
* `components/pr-test/README.md`

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->


```bash
oc create secret generic ambient-vertex -n $NAMESPACE \
--from-file=ambient-code-key.json=/path/to/service-account-key.json
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Replace placeholder path with actionable guidance.

The placeholder /path/to/service-account-key.json is not actionable. Provide explicit instructions on how to obtain the service account key file (e.g., "Download from GCP Console > IAM & Admin > Service Accounts") or reference documentation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/ambient/SKILL.md at line 103, The placeholder path
`/path/to/service-account-key.json` should be replaced with a real, downloaded
service account key file: instruct the user to create or select a service
account in GCP Console (IAM & Admin > Service Accounts), click "Keys" → "Add
Key" → "Create new key" and download the JSON, save it to a secure local path
(e.g., ./keys/ambient-service-account.json), and then update the --from-file
flag to point to that real file name instead of the placeholder; include a note
to keep the JSON secure and to follow the project's secrets/storage guidelines.

unleash-xxxxx 1/1 Running 0
```

Frontend shows `2/2` because of the oauth-proxy sidecar in the production overlay.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Clarify oauth-proxy sidecar context.

The note assumes the production overlay is always used, but this skill covers "production, ephemeral PR test instances, or developer clusters" (lines 5–6). If users deploy via the scripted installer (install.sh) or a custom overlay without oauth-proxy, the frontend pod would show 1/1 instead of 2/2, potentially confusing readers.

Rephrase to indicate this applies when using the production overlay or when oauth-proxy is configured.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/ambient/SKILL.md at line 238, Update the sentence that says
the frontend shows "2/2" to clarify that this only occurs when the oauth-proxy
sidecar is present (e.g., when using the production overlay); mention that
installs via install.sh or custom overlays without oauth-proxy will show "1/1"
for the frontend pod. Reference the terms "frontend" pod, "production overlay",
"oauth-proxy sidecar", and "install.sh" in the rewording so readers know the
difference in expected readiness counts depending on whether oauth-proxy is
configured.

The operator creates runner pods in dynamically-created project namespaces. Those pods pull images from quay.io directly — no cross-namespace image access issue with quay. However, if you're using the OpenShift internal registry, grant pull access:

```bash
oc policy add-role-to-group system:image-puller system:serviceaccounts --namespace=$NAMESPACE
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Overly permissive cross-namespace image pull policy.

The command grants system:image-puller to the cluster-wide system:serviceaccounts group, allowing any pod in any namespace to pull images from $NAMESPACE. On a shared cluster, this is a security posture gap.

Instead, grant access only to the specific namespaces where runner pods will be created. If the operator creates runner pods in dynamically-named project namespaces, scope the role binding per namespace or use a more targeted group.

Alternative approach

If you know the target namespace pattern (e.g., ambient-code--pr-*), grant access to specific service accounts or use a namespace-scoped RoleBinding instead of cluster-wide group membership:

# Example: grant pull access only to service accounts in a specific namespace
oc policy add-role-to-user system:image-puller \
  system:serviceaccount:<runner-namespace>:default \
  --namespace=$NAMESPACE

Or document that quay.io images avoid this issue entirely (as noted in lines 289–290).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/ambient/SKILL.md at line 292, The command `oc policy
add-role-to-group system:image-puller system:serviceaccounts
--namespace=$NAMESPACE` is too permissive; change it to grant image-pull only to
the specific runner namespace or service account(s) instead of the cluster-wide
`system:serviceaccounts` group — e.g., replace that call with a namespace-scoped
binding that targets the runner service account (`oc policy add-role-to-user
system:image-puller system:serviceaccount:<runner-namespace>:<sa-name>
--namespace=$NAMESPACE`) or implement a RoleBinding per target namespace for the
`system:image-puller` role so only intended namespaces/users can pull from
`$NAMESPACE`.

Comment on lines +318 to +321
### JWT Configuration

Production uses Red Hat SSO JWKS (`--jwk-cert-url=https://sso.redhat.com/...`). For ephemeral test instances, JWT validation may need to be disabled or pointed at a different issuer. Check the `ambient-api-server-jwt-args-patch.yaml` in the production overlay and adjust as needed for non-production contexts.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Provide concrete JWT configuration guidance for non-production contexts.

The note mentions that ephemeral test instances "may need to be disabled or pointed at a different issuer" but does not explain how. Users following this skill for PR test instances will hit JWT validation errors and have no clear remediation path.

Add a concrete example showing how to disable JWT validation or configure an alternative issuer for ephemeral namespaces (e.g., patching the deployment args or referencing the ambient-api-server-jwt-args-patch.yaml).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/ambient/SKILL.md around lines 318 - 321, Update the "JWT
Configuration" section to include a concrete example for ephemeral/non‑prod
namespaces: show how to patch the ambient-api-server container args (referencing
ambient-api-server-jwt-args-patch.yaml) to either disable validation (e.g.,
override args to include a --disable-jwt-validation flag or remove the
--jwk-cert-url/--issuer args) or point to a test issuer by setting
--jwk-cert-url to a known test JWKS URL; include the exact arg keys to change
(--jwk-cert-url, --issuer, --disable-jwt-validation), a short example
description of patching the Deployment/DaemonSet container args for the
ambient-api-server, and note to revert for production.

markturansky added a commit that referenced this pull request Mar 25, 2026
- Remove -k (insecure TLS) from curl in install.sh and SKILL.md
- Add ref: main and permissions: contents: read to pr-namespace-cleanup.yml
- Replace hardcoded example credentials with <PLACEHOLDER> tokens in SKILL.md
- Add public-api and ambient-api-server-db to rollout waits in install.sh and SKILL.md

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (6)
components/pr-test/install.sh (3)

105-108: ⚠️ Potential issue | 🟠 Major

Health verification is non-blocking; consider making it fail-fast with timeouts.

Line 106 uses curl -s ... || true, so backend failures are silently ignored and the script reports success regardless. Consider adding timeouts and failing on errors for better CI signal.

🔧 Suggested fix
 if [[ -n "$BACKEND_HOST" ]]; then
-  HEALTH=$(curl -s "https://${BACKEND_HOST}/health" || true)
+  if HEALTH=$(curl -fsS --connect-timeout 5 --max-time 20 \
+    --retry 3 --retry-all-errors "https://${BACKEND_HOST}/health"); then
+    echo "    Backend health: $HEALTH"
+  else
+    echo "    WARNING: Backend health check failed"
+  fi
-  echo "    Backend health: $HEALTH"
+else
+  echo "    WARNING: backend-route host not found"
 fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/pr-test/install.sh` around lines 105 - 108, The health check
currently silences failures by using `curl -s ... || true`; update the `curl`
invocation that populates `HEALTH` (when `BACKEND_HOST` is set) to use fail-fast
options (e.g., `--fail` plus `--max-time`/`--connect-timeout`) and remove the
`|| true` so non-2xx responses or timeouts cause a non-zero exit; if the `curl`
call fails, print the error and exit the script (or return a non-zero status)
instead of continuing, ensuring the `HEALTH` check fails the CI when the backend
is unreachable.

52-59: ⚠️ Potential issue | 🔴 Critical

Update both tagged and untagged image mappings to ensure PR isolation.

kustomize edit set image matches by exact name field. If the production overlay contains untagged entries (e.g., quay.io/ambient_code/vteam_frontend without :latest), they won't be updated and could resolve to latest instead of the PR-specific tag.

🔧 Proposed fix - add untagged mappings
 kustomize edit set image \
+  "quay.io/ambient_code/vteam_frontend=quay.io/ambient_code/vteam_frontend:${IMAGE_TAG}" \
   "quay.io/ambient_code/vteam_frontend:latest=quay.io/ambient_code/vteam_frontend:${IMAGE_TAG}" \
+  "quay.io/ambient_code/vteam_backend=quay.io/ambient_code/vteam_backend:${IMAGE_TAG}" \
   "quay.io/ambient_code/vteam_backend:latest=quay.io/ambient_code/vteam_backend:${IMAGE_TAG}" \
+  "quay.io/ambient_code/vteam_operator=quay.io/ambient_code/vteam_operator:${IMAGE_TAG}" \
   "quay.io/ambient_code/vteam_operator:latest=quay.io/ambient_code/vteam_operator:${IMAGE_TAG}" \
+  "quay.io/ambient_code/vteam_claude_runner=quay.io/ambient_code/vteam_claude_runner:${IMAGE_TAG}" \
   "quay.io/ambient_code/vteam_claude_runner:latest=quay.io/ambient_code/vteam_claude_runner:${IMAGE_TAG}" \
+  "quay.io/ambient_code/vteam_state_sync=quay.io/ambient_code/vteam_state_sync:${IMAGE_TAG}" \
   "quay.io/ambient_code/vteam_state_sync:latest=quay.io/ambient_code/vteam_state_sync:${IMAGE_TAG}" \
+  "quay.io/ambient_code/vteam_api_server=quay.io/ambient_code/vteam_api_server:${IMAGE_TAG}" \
   "quay.io/ambient_code/vteam_api_server:latest=quay.io/ambient_code/vteam_api_server:${IMAGE_TAG}" \
+  "quay.io/ambient_code/vteam_public_api=quay.io/ambient_code/vteam_public_api:${IMAGE_TAG}" \
   "quay.io/ambient_code/vteam_public_api:latest=quay.io/ambient_code/vteam_public_api:${IMAGE_TAG}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/pr-test/install.sh` around lines 52 - 59, kustomize edit set image
only updates images that exactly match the name given, so add duplicate untagged
mappings alongside the existing tagged ones in the kustomize command (the call
to kustomize edit set image) for each image (e.g., add mappings for
quay.io/ambient_code/vteam_frontend, vteam_backend, vteam_operator,
vteam_claude_runner, vteam_state_sync, vteam_api_server, vteam_public_api
without :latest) so they are also set to
quay.io/ambient_code/<image>:${IMAGE_TAG}; update the same command block that
currently lists the ":latest" entries to include the corresponding untagged
entries to ensure PR isolation.

35-37: ⚠️ Potential issue | 🔴 Critical

Do not apply cluster-scoped CRDs/RBAC from PR branch content.

Applying cluster-scoped manifests from unmerged PR code allows ephemeral test jobs to mutate cluster-wide resources. This should be handled by a trusted/mainline pipeline or pre-provisioned, not during PR install.

🛡️ Suggested hardening
-echo "==> Step 1: Applying CRDs and RBAC (cluster-scoped, idempotent)"
-oc apply -k "$REPO_ROOT/components/manifests/base/crds/"
-oc apply -k "$REPO_ROOT/components/manifests/base/rbac/"
+if [[ "${APPLY_CLUSTER_SCOPED:-0}" == "1" ]]; then
+  echo "==> Step 1: Applying CRDs and RBAC (cluster-scoped, explicit opt-in)"
+  oc apply -k "$REPO_ROOT/components/manifests/base/crds/"
+  oc apply -k "$REPO_ROOT/components/manifests/base/rbac/"
+else
+  echo "==> Step 1: Skipping cluster-scoped apply (expected pre-installed)"
+fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/pr-test/install.sh` around lines 35 - 37, The script is applying
cluster-scoped CRDs/RBAC from PR branch content via the two oc apply -k commands
(the invocations that reference "$REPO_ROOT/components/manifests/base/crds/" and
".../rbac/"), which must not run in ephemeral PR installs; change the script to
skip these applies unless running in a trusted/mainline context by guarding them
with an explicit environment flag (e.g. MAINLINE_PIPELINE or
SKIP_CLUSTER_MANIFESTS) or CI variable check, and when skipped, print a clear
message that cluster-scoped resources are omitted; ensure the logic surrounds
the oc apply -k calls (not other local-only manifests) so PR jobs never mutate
cluster-wide resources.
components/pr-test/README.md (1)

1-11: ⚠️ Potential issue | 🟠 Major

Add a security model section addressing untrusted PR code execution.

Per the PR discussion (ktdreyer's comment), identify where untrusted PR code might run and document mitigation strategies. Key points to cover:

  • Isolation boundaries (namespace-scoped, network policies via security-zone: internal)
  • What executes PR code vs. trusted platform code (provision.sh/install.sh run from main branch)
  • Secrets management (namespace-scoped, cleaned up on teardown)
  • The security implication of enabling push: true for PR builds (line 189)

This helps reviewers understand the trust model for ephemeral test instances.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/pr-test/README.md` around lines 1 - 11, Add a new "Security model"
section to the README describing where untrusted PR code may run and
mitigations: specify that S0.x instances are namespace-scoped and isolated with
network policies (use label/security-zone: internal), list which artifacts are
executed from trusted main branch (provision.sh and install.sh) vs what comes
from the PR branch, explain secrets handling (namespace-scoped secrets, scrubbed
on teardown, no cluster-wide secrets exposed), and call out the security
implications and required review when enabling the push: true flag for PR builds
so reviewers know the increased risk and required controls.
.claude/skills/ambient/SKILL.md (1)

290-294: ⚠️ Potential issue | 🟠 Major

Overly permissive cross-namespace image pull policy.

The command grants system:image-puller to the cluster-wide system:serviceaccounts group, allowing any pod in any namespace to pull images from $NAMESPACE. On a shared cluster this is a security posture gap.

Consider granting access only to specific runner namespaces or service accounts, or note that quay.io images (which this skill primarily uses) avoid this issue entirely.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/ambient/SKILL.md around lines 290 - 294, The documented oc
command grants system:image-puller to the cluster-wide group
system:serviceaccounts which is overly permissive; change the guidance to
recommend granting pull access only to the specific runner namespace or specific
service accounts (e.g., bind system:image-puller to
system:serviceaccounts:<runner-namespace> or to a named ServiceAccount) instead
of the entire cluster group, and/or note that using quay.io avoids the internal
registry cross-namespace pull requirement; reference the existing command and
the principals system:image-puller and system:serviceaccounts and replace the
example with a scoped approach (per-namespace rolebinding or
role-to-serviceaccount) or an explicit note recommending scoped access only.
.github/workflows/pr-namespace-cleanup.yml (1)

28-34: ⚠️ Potential issue | 🟡 Minor

Guard against edge case where SLUG_MAX becomes negative.

For very high PR numbers (e.g., 7+ digits), SLUG_MAX could become zero or negative, and if BRANCH_SLUG ends up empty, INSTANCE_ID would have a trailing dash. Also, $GITHUB_OUTPUT should be quoted per shellcheck SC2086.

🛡️ Proposed fix
           PR_LEN=${`#PR_NUMBER`}
           SLUG_MAX=$(( 63 - 14 - 4 - PR_LEN ))
+          [[ $SLUG_MAX -lt 0 ]] && SLUG_MAX=0
           BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}"

-          INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"
-          echo "instance_id=${INSTANCE_ID}" >> $GITHUB_OUTPUT
+          if [[ -n "$BRANCH_SLUG" ]]; then
+            INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"
+          else
+            INSTANCE_ID="pr-${PR_NUMBER}"
+          fi
+          echo "instance_id=${INSTANCE_ID}" >> "$GITHUB_OUTPUT"
           echo "Cleaning up instance: ${INSTANCE_ID}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/pr-namespace-cleanup.yml around lines 28 - 34, Guard
SLUG_MAX against negative values and avoid a trailing dash when BRANCH_SLUG is
empty: after computing SLUG_MAX (from PR_LEN) clamp it to zero if negative
(e.g., if [ "$SLUG_MAX" -lt 0 ]; then SLUG_MAX=0; fi), then derive BRANCH_SLUG
from SAFE_BRANCH as before; build INSTANCE_ID conditionally so it omits the dash
when BRANCH_SLUG is empty (e.g., if [ -n "$BRANCH_SLUG" ]; then
INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"; else
INSTANCE_ID="pr-${PR_NUMBER}"; fi); and quote the GITHUB_OUTPUT variable when
writing (use >> "$GITHUB_OUTPUT"). Use the symbols PR_LEN, SLUG_MAX,
SAFE_BRANCH, BRANCH_SLUG, INSTANCE_ID and GITHUB_OUTPUT to locate the changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.claude/skills/ambient/SKILL.md:
- Around line 290-294: The documented oc command grants system:image-puller to
the cluster-wide group system:serviceaccounts which is overly permissive; change
the guidance to recommend granting pull access only to the specific runner
namespace or specific service accounts (e.g., bind system:image-puller to
system:serviceaccounts:<runner-namespace> or to a named ServiceAccount) instead
of the entire cluster group, and/or note that using quay.io avoids the internal
registry cross-namespace pull requirement; reference the existing command and
the principals system:image-puller and system:serviceaccounts and replace the
example with a scoped approach (per-namespace rolebinding or
role-to-serviceaccount) or an explicit note recommending scoped access only.

In @.github/workflows/pr-namespace-cleanup.yml:
- Around line 28-34: Guard SLUG_MAX against negative values and avoid a trailing
dash when BRANCH_SLUG is empty: after computing SLUG_MAX (from PR_LEN) clamp it
to zero if negative (e.g., if [ "$SLUG_MAX" -lt 0 ]; then SLUG_MAX=0; fi), then
derive BRANCH_SLUG from SAFE_BRANCH as before; build INSTANCE_ID conditionally
so it omits the dash when BRANCH_SLUG is empty (e.g., if [ -n "$BRANCH_SLUG" ];
then INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"; else
INSTANCE_ID="pr-${PR_NUMBER}"; fi); and quote the GITHUB_OUTPUT variable when
writing (use >> "$GITHUB_OUTPUT"). Use the symbols PR_LEN, SLUG_MAX,
SAFE_BRANCH, BRANCH_SLUG, INSTANCE_ID and GITHUB_OUTPUT to locate the changes.

In `@components/pr-test/install.sh`:
- Around line 105-108: The health check currently silences failures by using
`curl -s ... || true`; update the `curl` invocation that populates `HEALTH`
(when `BACKEND_HOST` is set) to use fail-fast options (e.g., `--fail` plus
`--max-time`/`--connect-timeout`) and remove the `|| true` so non-2xx responses
or timeouts cause a non-zero exit; if the `curl` call fails, print the error and
exit the script (or return a non-zero status) instead of continuing, ensuring
the `HEALTH` check fails the CI when the backend is unreachable.
- Around line 52-59: kustomize edit set image only updates images that exactly
match the name given, so add duplicate untagged mappings alongside the existing
tagged ones in the kustomize command (the call to kustomize edit set image) for
each image (e.g., add mappings for quay.io/ambient_code/vteam_frontend,
vteam_backend, vteam_operator, vteam_claude_runner, vteam_state_sync,
vteam_api_server, vteam_public_api without :latest) so they are also set to
quay.io/ambient_code/<image>:${IMAGE_TAG}; update the same command block that
currently lists the ":latest" entries to include the corresponding untagged
entries to ensure PR isolation.
- Around line 35-37: The script is applying cluster-scoped CRDs/RBAC from PR
branch content via the two oc apply -k commands (the invocations that reference
"$REPO_ROOT/components/manifests/base/crds/" and ".../rbac/"), which must not
run in ephemeral PR installs; change the script to skip these applies unless
running in a trusted/mainline context by guarding them with an explicit
environment flag (e.g. MAINLINE_PIPELINE or SKIP_CLUSTER_MANIFESTS) or CI
variable check, and when skipped, print a clear message that cluster-scoped
resources are omitted; ensure the logic surrounds the oc apply -k calls (not
other local-only manifests) so PR jobs never mutate cluster-wide resources.

In `@components/pr-test/README.md`:
- Around line 1-11: Add a new "Security model" section to the README describing
where untrusted PR code may run and mitigations: specify that S0.x instances are
namespace-scoped and isolated with network policies (use label/security-zone:
internal), list which artifacts are executed from trusted main branch
(provision.sh and install.sh) vs what comes from the PR branch, explain secrets
handling (namespace-scoped secrets, scrubbed on teardown, no cluster-wide
secrets exposed), and call out the security implications and required review
when enabling the push: true flag for PR builds so reviewers know the increased
risk and required controls.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6db8f4d5-602f-45b4-9e1e-d19494fdf5a9

📥 Commits

Reviewing files that changed from the base of the PR and between 2eb2465 and 9d943a1.

📒 Files selected for processing (6)
  • .claude/skills/ambient-pr-test/SKILL.md
  • .claude/skills/ambient/SKILL.md
  • .github/workflows/pr-namespace-cleanup.yml
  • components/pr-test/README.md
  • components/pr-test/install.sh
  • docs/internal/proposals/acp-public-rest-api.md
💤 Files with no reviewable changes (1)
  • docs/internal/proposals/acp-public-rest-api.md

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/pr-test/README.md`:
- Around line 69-82: The example TenantNamespace CR (kind: TenantNamespace,
name: pr-123-example) is missing the required label ambient-code/instance-type:
s0x under metadata.labels; add that label alongside the existing tenant and
namespace-type labels so the provisioner capacity check (referenced around lines
107-108) will count this instance and match the schema expectations mentioned
near lines 56 and 144.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8aebdebc-b679-4112-9c35-868a57e925b8

📥 Commits

Reviewing files that changed from the base of the PR and between 9d943a1 and 02ace41.

📒 Files selected for processing (3)
  • .claude/skills/ambient-pr-test/SKILL.md
  • components/pr-test/README.md
  • components/pr-test/build.sh


- name: Log in to OpenShift
run: |
oc login "${{ secrets.TEST_OPENSHIFT_SERVER }}" \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll note one thing here: if we want to deploy to an MP+ cluster behind the firewall, we can use ArgoCD of course, or we can use the internal GH runner as documented in #793.

@@ -1,214 +0,0 @@
# Proposal: ACP Public REST API
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why delete this?

# OpenShift Cluster Development

This guide covers deploying the Ambient Code Platform on OpenShift clusters for development and testing. Use this when you need to test OpenShift-specific features like Routes, OAuth integration, or service mesh capabilities.
This guide covers deploying Ambient Code on an OpenShift cluster using the **OpenShift internal image registry**. This is useful when iterating on local builds against a dev cluster without pushing to quay.io.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this change be broken out into a separate PR? I don't understand what this has to do with this change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by introducing the logic into the skill, it becomes re-usable. it's like a code refactor where we make a change by making a new dual use function and refactoring the old code to use it. this developer guide was my original How To script for claude.

print(doc)
PYEOF

kustomize build . \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kustomize build . \
oc kustomize build . \

See #768

@@ -0,0 +1,448 @@
# Specification: Ephemeral PR Test Environments on MPP
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc is helpful to understand the design thinking, but some things are already out of date (for example "no branch slug" is wrong). If this stays in this PR, it needs another pass to reconcile with the actual implementation. (See also #952)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is significant duplication between the SKILL.md files and this file. For maintainability, I wonder if we could trim these both down and dedup where it makes sense.

"quay.io/ambient_code/vteam_public_api:latest=quay.io/ambient_code/vteam_public_api:${IMAGE_TAG}"

FILTER_SCRIPT="$TMPDIR/filter.py"
cat > "$FILTER_SCRIPT" << 'PYEOF'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This inline Python script is 35 lines of yaml manipulation via regex. That will be very challenging to maintain.

Could this be a standalone Python file in components/pr-test/filter.py with proper tests? Or better, could you use yq or oc instead of regex?

Copy link
Copy Markdown
Contributor

@ktdreyer ktdreyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Biggest thing we must address is the warnings from GH re: acting on untrusted code

- Build workflow pushes all 7 component images tagged pr-<N>-amd64 to quay on every PR
- provision.sh creates/destroys TenantNamespace CRs on ambient-code--config; capacity-gated at 5 concurrent instances
- install.sh deploys production manifests with PR image tags using ArgoCD SA token; handles MPP restricted environment constraints (Route labels, PVC annotations, ClusterRoleBinding subject patching)
- pr-e2e-openshift.yml workflow: provision → install → e2e → teardown on build completion
- pr-namespace-cleanup.yml: safety-net teardown on PR close
- Skills: ambient-pr-test (full PR test workflow) and ambient (install on any OpenShift namespace)
- Validates required secrets before install; documents MPP resource inventory and constraints
- Route admission webhook fix: add paas.redhat.com/appcode label via kustomize filter

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
markturansky pushed a commit that referenced this pull request Mar 28, 2026
- components-build-deploy: push immutable SHA-tagged PR image alongside
  mutable pr-<N>-<arch> tag to prevent race conditions across commits
- pr-e2e-openshift: concurrency group null safety (|| workflow_run.id),
  clamp SLUG_MAX to 0 min / skip branch slug if empty, remove
  --insecure-skip-tls-verify, install kustomize for kustomize-edit steps
- install.sh: add both :latest and untagged kustomize set-image mappings
  so all image name variants are rewritten; add vteam_control_plane and
  vteam_mcp; replace kustomize build with oc kustomize; make health check
  fail-fast with curl timeouts and exit on missing routes
- provision.sh: atomic ConfigMap lock before capacity check (TOCTOU fix);
  wait for TenantNamespace Ready condition, not only Namespace Active;
  fix destroy() to detect only (NotFound) as success, not all oc errors;
  delete lock ConfigMap on destroy
- openshift.md: add vteam_state_sync to internal registry push loop

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
markturansky pushed a commit that referenced this pull request Mar 28, 2026
- components-build-deploy: push immutable SHA-tagged PR image alongside
  mutable pr-<N>-<arch> tag to prevent race conditions across commits
- pr-e2e-openshift: concurrency group null safety (|| workflow_run.id),
  clamp SLUG_MAX to 0 min / skip branch slug if empty, remove
  --insecure-skip-tls-verify, install kustomize for kustomize-edit steps
- install.sh: add both :latest and untagged kustomize set-image mappings
  so all image name variants are rewritten; add vteam_control_plane and
  vteam_mcp; replace kustomize build with oc kustomize; make health check
  fail-fast with curl timeouts and exit on missing routes
- provision.sh: atomic ConfigMap lock before capacity check (TOCTOU fix);
  wait for TenantNamespace Ready condition, not only Namespace Active;
  fix destroy() to detect only (NotFound) as success, not all oc errors;
  delete lock ConfigMap on destroy
- openshift.md: add vteam_state_sync to internal registry push loop

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@ambient-code
Copy link
Copy Markdown
Contributor

ambient-code bot commented Mar 30, 2026

Review Queue Status

Check Status Detail
CI pass
Conflicts FAIL Merge conflicts present
Reviews pass

Action needed: Resolve merge conflicts

Auto-generated by Review Queue workflow. Updated when PR changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants