feat(ci): ephemeral PR test instances on MPP dev cluster#1032
feat(ci): ephemeral PR test instances on MPP dev cluster#1032markturansky wants to merge 1 commit intomainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds PR-focused ephemeral environment tooling: build/push scripts and workflows, TenantNamespace provisioning and teardown scripts, an installer to deploy Ambient into PR namespaces, accompanying docs/skills, and removal of an unused API proposal doc. Changes
Sequence DiagramsequenceDiagram
participant Dev as Developer
participant GH as GitHub Actions
participant Prov as provision.sh
participant OS as OpenShift API
participant Inst as install.sh
participant E2E as Cypress
Dev->>GH: Open/Update PR
GH->>GH: Build & Push component images
GH->>Prov: create(instance_id)
Prov->>OS: Apply TenantNamespace CR
OS->>OS: Tenant operator provisions namespace
Prov->>OS: Poll until namespace.phase == "Active"
OS-->>Prov: Namespace Active
GH->>Inst: install.sh <namespace> <image-tag>
Inst->>OS: Apply CRDs/RBAC, copy secrets, apply patched kustomize
Inst->>OS: Patch operator-config & agent-registry, wait rollouts
Inst-->>GH: frontend_url output
GH->>E2E: Run Cypress tests against frontend_url
E2E->>OS: Query routes/health endpoints
E2E-->>GH: Upload artifacts (on failure)
GH->>Prov: destroy(instance_id) [always]
Prov->>OS: Delete TenantNamespace CR
OS->>OS: Tenant operator finalizes and removes namespace
Prov->>OS: Poll until namespace absent
OS-->>Prov: Namespace removed
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 15
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/skills/ambient/SKILL.md:
- Around line 58-88: The sample secret creation commands publish fixed, shared
credentials; replace hardcoded values with placeholders and guidance to generate
unique secrets instead: update the examples that create minio-credentials,
postgresql-credentials, unleash-credentials, and github-app-secret to use
parameterized values (e.g. <ROOT_USER>, <ROOT_PASSWORD>, <DB_HOST>,
<DB_PASSWORD>, <UNLEASH_DB_URL>, <GITHUB_PRIVATE_KEY>, <GITHUB_CLIENT_SECRET>,
<GITHUB_STATE_SECRET>) and add a short note instructing users to generate
strong, unique secrets (or source them from a secret manager) before running
kustomize so no default credentials are shared across installations.
- Around line 175-189: The rollout wait step is missing gates for public-api and
ambient-api-server-db which makes Step 6 flaky; update the Step 5 "Wait for
Rollout" loop/commands in .claude/skills/ambient/SKILL.md to include oc rollout
status checks for deployment/public-api and deployment/ambient-api-server-db (in
addition to ambient-api-server) so the docs explicitly wait for public-api and
ambient-api-server-db to finish before proceeding to Step 6.
In @.github/workflows/components-build-deploy.yml:
- Around line 206-217: The PR build currently only pushes a mutable tag
(pr-<N>-<arch>) which can be overwritten; update the docker/build-push-action
step (the "Build and push ${{ matrix.component.name }} (${{ matrix.arch.suffix
}}) for pull request" step) to also push an immutable tag that includes the PR
head commit SHA so downstream workflow_run jobs can pull the exact image.
Concretely, add a second tag value alongside the existing tags that uses the PR
head SHA (e.g. ${{ matrix.component.image }}:${{
github.event.pull_request.head.sha }}-${{ matrix.arch.suffix }}) while keeping
the current pr-... tag and ensure push remains true so both tags are pushed.
In @.github/workflows/pr-e2e-openshift.yml:
- Around line 57-60: Replace the unsafe checkout ref that pulls the PR head (the
actions/checkout step using ref: ${{ github.event.workflow_run.head_sha }}) so
the job does not execute untrusted PR code with secrets in scope; change the
checkout to use a safe ref like ${{ github.sha }} (or the run's immutable
commit) and apply the same change to the other checkout occurrences noted (lines
84-119, 126-147), and ensure jobs that require secrets (those invoking
components/pr-test/*.sh, npm ci, npx cypress run with TEST_OPENSHIFT_TOKEN,
QUAY_PASSWORD, ANTHROPIC_API_KEY) only run against repository-controlled code
(or switch to pull_request_target with checkout of the base commit) so
PR-modified scripts cannot exfiltrate secrets.
- Around line 27-35: The run script is inlining
github.event.workflow_run.head_branch into the shell source, exposing
secret-scoped workflow context; instead, move PR_NUMBER and HEAD_BRANCH into
environment variables and reference them inside the run block (e.g., add an env:
section that sets PR_NUMBER: ${{
github.event.workflow_run.pull_requests[0].number }} and HEAD_BRANCH: ${{
github.event.workflow_run.head_branch }} and then use $PR_NUMBER and
$HEAD_BRANCH in the script), so BRANCH_SLUG creation and sed/tr commands operate
on an env var rather than interpolated workflow text.
- Around line 37-39: The derived NAMESPACE can exceed allowed length because
INSTANCE_ID uses BRANCH_SLUG:0:45; reduce the slug truncation to fit the
remaining budget (e.g., change BRANCH_SLUG:0:45 to BRANCH_SLUG:0:41) so that
"ambient-code--" + "pr-${PR_NUMBER}-" plus the slug stays within the namespace
length limit; update the INSTANCE_ID assignment (and any related uses like
NAMESPACE/IMAGE_TAG) to use the shorter BRANCH_SLUG slice.
In @.github/workflows/pr-namespace-cleanup.yml:
- Around line 27-28: The hardcoded slice ${BRANCH_SLUG:0:45} can still produce
>63-char Kubernetes namespaces; replace it with a dynamic truncation that
accounts for the fixed "ambient-code--" prefix (14 chars) and the variable
"pr-${PR_NUMBER}-" prefix length by using ${BRANCH_SLUG:0:$((45 -
${`#PR_NUMBER`}))} (45 derived from 63-14-4) and ensure the computed length is
clamped to zero or positive before constructing INSTANCE_ID; keep the rest
(INSTANCE_ID assignment and echo to GITHUB_OUTPUT) unchanged.
- Around line 17-25: The job currently interpolates untrusted HEAD_BRANCH
directly into the shell pipeline that computes BRANCH_SLUG (variables PR_NUMBER,
HEAD_BRANCH, BRANCH_SLUG and the sed/tr transliteration pipeline), so change the
script to first strictly validate and sanitize HEAD_BRANCH: reject or fail the
job if HEAD_BRANCH contains any characters outside a safe whitelist (allow only
ASCII letters, digits and separators like '.' '_' '-'), limit length (e.g. 64
chars), and then construct BRANCH_SLUG from that validated value; always
double-quote variable expansions when used in commands and avoid passing user
input to eval-like contexts or unquoted substitutions, ensure BRANCH_SLUG never
starts with a hyphen and normalize runs of non-alphanumerics to a single dash as
before but only after validation to prevent malicious injection.
In `@components/pr-test/install.sh`:
- Around line 102-105: The health check ignores failures and can hang because it
uses curl -sk ... || true; change the block that uses BACKEND_HOST and HEALTH to
call curl with a connect and overall timeout and fail on HTTP errors (e.g., curl
-fsS --connect-timeout 5 --max-time 10 "https://$BACKEND_HOST/health"), remove
the "|| true", capture the exit code and the output into HEALTH, and if curl
fails (non-zero exit) print the HEALTH/error and exit non‑zero so the script
fails fast; reference the BACKEND_HOST variable, HEALTH assignment, and the curl
invocation when making the change.
- Around line 35-37: The script is applying cluster-scoped CRDs/RBAC from the PR
branch (the oc apply -k "$REPO_ROOT/components/manifests/base/crds/" and oc
apply -k "$REPO_ROOT/components/manifests/base/rbac/" lines under the "Step 1"
echo), which must not run in a PR install; change install.sh to skip or abort
these two oc apply commands unless running in a trusted/mainline pipeline (guard
with an explicit env var like ALLOW_CLUSTER_SCOPED_MANIFESTS or CI_MAINLINE) or
move them to a separate pre-provision step; keep the namespace-scoped install
flow intact and ensure the script logs a clear message when cluster-scoped
manifests are skipped.
- Around line 52-59: kustomize edit set image currently only replaces images
named with :latest, leaving untagged image entries unchanged; update the
kustomize image mappings so each image name is provided twice (once with the
:latest source and once without the tag) mapping both variants to
quay.io/ambient_code/<image>:${IMAGE_TAG} (e.g., ensure vteam_frontend,
vteam_frontend:latest, vteam_backend, vteam_backend:latest, etc. are all
included) so that both tagged and untagged entries are rewritten; update the
kustomize edit set image invocation to include these duplicate mappings for
vteam_frontend, vteam_backend, vteam_operator, vteam_claude_runner,
vteam_state_sync, vteam_api_server, and vteam_public_api.
In `@components/pr-test/provision.sh`:
- Around line 29-43: The current capacity check using ACTIVE and
MAX_S0X_INSTANCES is racy (TOCTOU) because two concurrent runs can both observe
the same ACTIVE count then both apply the TenantNamespace CR; replace the
read-then-apply pattern with an atomic reservation or cluster-side quota:
implement a create-if-not-exists reservation/lock (for example create a
namespaced Reservation CR or a ConfigMap/Lease named with the instance id and
fail the create if it already exists) and only count SUCCESSFUL reservations
toward the limit, or enforce with a Kubernetes ResourceQuota/OPA/Gatekeeper
policy so the API server enforces the MAX_S0X_INSTANCES limit; update the
provision flow around the "echo '==> Applying TenantNamespace CR: $INSTANCE_ID'"
block to create the reservation atomically before applying the TenantNamespace
and delete the reservation on rollback/cleanup.
- Around line 59-74: The loop that currently waits for Namespace.phase ==
"Active" (using NAMESPACE, READY_TIMEOUT, STATUS) must instead wait for the
TenantNamespace CR Ready condition to be True to ensure the tenant operator
finished reconciling; change the check inside the while loop to query the
TenantNamespace resource (oc get tenantnamespace "$INSTANCE_ID" -n
"$CONFIG_NAMESPACE" -o
jsonpath='{.status.conditions[?(@.type=="Ready")].status}' 2>/dev/null || true)
and treat "True" as success (you may still fall back to checking Namespace.phase
if desired), and preserve the existing timeout, logging, and the final oc
describe tenantnamespace "$INSTANCE_ID" invocation on failure.
- Around line 82-88: The current loop treats any failure of `oc get namespace
"$NAMESPACE"` as successful deletion; change the check inside the while loop to
capture stderr/output from `oc get namespace "$NAMESPACE" 2>&1` and only
consider the namespace deleted when that output explicitly indicates a NotFound
error (e.g. matches "NotFound" or "namespaces .* not found" or "Error from
server (NotFound)"). For non-NotFound failures (auth errors, timeouts, transport
errors) do not exit; instead log a warning and continue retrying until DEADLINE.
Update the if-condition in the while loop that references
NAMESPACE/DELETE_TIMEOUT/DEADLINE accordingly so only a confirmed NotFound
result is treated as deletion.
In `@docs/internal/developer/local-development/openshift.md`:
- Around line 47-78: The docs' internal-registry steps miss mirroring the
operator-spawned images (vteam_state_sync) and updating operator-config and
ambient-agent-registry references, so runner/state-sync pods will still pull
from Quay; update the mirror loop to include vteam_state_sync and any other
operator-spawned images, ensure deploy steps also replace newName refs inside
components/manifests/overlays/production/kustomization.yaml and the
operator-config and ambient-agent-registry resources (update their image refs to
${REGISTRY_HOST}/ambient-code/*), and verify deploy.sh applies the updated
manifests before restoring kustomization.yaml so operator-created pods use the
internal registry.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 025dd6a6-1ae1-437d-903d-cfb4facdffed
📒 Files selected for processing (9)
.claude/skills/ambient-pr-test/SKILL.md.claude/skills/ambient/SKILL.md.github/workflows/components-build-deploy.yml.github/workflows/pr-e2e-openshift.yml.github/workflows/pr-namespace-cleanup.ymlcomponents/pr-test/SPEC.mdcomponents/pr-test/install.shcomponents/pr-test/provision.shdocs/internal/developer/local-development/openshift.md
| - name: Build and push ${{ matrix.component.name }} (${{ matrix.arch.suffix }}) for pull request | ||
| if: github.event_name == 'pull_request' | ||
| uses: docker/build-push-action@v6 | ||
| with: | ||
| context: ${{ matrix.component.context }} | ||
| file: ${{ matrix.component.dockerfile }} | ||
| platforms: ${{ matrix.arch.platform }} | ||
| push: false | ||
| push: true | ||
| tags: ${{ matrix.component.image }}:pr-${{ github.event.pull_request.number }}-${{ matrix.arch.suffix }} | ||
| build-args: AMBIENT_VERSION=${{ github.sha }} | ||
| cache-from: type=gha,scope=${{ matrix.component.name }}-${{ matrix.arch.suffix }} | ||
| cache-to: type=gha,mode=max,scope=${{ matrix.component.name }}-${{ matrix.arch.suffix }} |
There was a problem hiding this comment.
Publish an immutable PR image tag alongside the mutable pr-<N>-<arch> alias.
This step now pushes a tag that is overwritten on every PR commit, while the downstream workflow_run job checks out a specific head_sha. If two commits overlap, an older E2E run can deploy newer images and report the wrong result.
Suggested change
- tags: ${{ matrix.component.image }}:pr-${{ github.event.pull_request.number }}-${{ matrix.arch.suffix }}
+ tags: |
+ ${{ matrix.component.image }}:pr-${{ github.event.pull_request.number }}-${{ matrix.arch.suffix }}
+ ${{ matrix.component.image }}:pr-${{ github.event.pull_request.number }}-${{ github.sha }}-${{ matrix.arch.suffix }}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/components-build-deploy.yml around lines 206 - 217, The PR
build currently only pushes a mutable tag (pr-<N>-<arch>) which can be
overwritten; update the docker/build-push-action step (the "Build and push ${{
matrix.component.name }} (${{ matrix.arch.suffix }}) for pull request" step) to
also push an immutable tag that includes the PR head commit SHA so downstream
workflow_run jobs can pull the exact image. Concretely, add a second tag value
alongside the existing tags that uses the PR head SHA (e.g. ${{
matrix.component.image }}:${{ github.event.pull_request.head.sha }}-${{
matrix.arch.suffix }}) while keeping the current pr-... tag and ensure push
remains true so both tags are pushed.
components/pr-test/install.sh
Outdated
| if [[ -n "$BACKEND_HOST" ]]; then | ||
| HEALTH=$(curl -sk "https://${BACKEND_HOST}/health" || true) | ||
| echo " Backend health: $HEALTH" | ||
| fi |
There was a problem hiding this comment.
Health verification is non-blocking and unbounded; make it fail-fast with timeouts.
Line 103 uses curl -sk ... || true, so backend failures are ignored and the script can still report success. It also lacks explicit timeout bounds, which can stall CI.
Suggested fix
if [[ -n "$BACKEND_HOST" ]]; then
- HEALTH=$(curl -sk "https://${BACKEND_HOST}/health" || true)
+ HEALTH=$(curl -fsS --connect-timeout 5 --max-time 20 \
+ --retry 3 --retry-all-errors "https://${BACKEND_HOST}/health")
echo " Backend health: $HEALTH"
+else
+ echo " ERROR: backend-route host not found"
+ exit 1
fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if [[ -n "$BACKEND_HOST" ]]; then | |
| HEALTH=$(curl -sk "https://${BACKEND_HOST}/health" || true) | |
| echo " Backend health: $HEALTH" | |
| fi | |
| if [[ -n "$BACKEND_HOST" ]]; then | |
| HEALTH=$(curl -fsS --connect-timeout 5 --max-time 20 \ | |
| --retry 3 --retry-all-errors "https://${BACKEND_HOST}/health") | |
| echo " Backend health: $HEALTH" | |
| else | |
| echo " ERROR: backend-route host not found" | |
| exit 1 | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/pr-test/install.sh` around lines 102 - 105, The health check
ignores failures and can hang because it uses curl -sk ... || true; change the
block that uses BACKEND_HOST and HEALTH to call curl with a connect and overall
timeout and fail on HTTP errors (e.g., curl -fsS --connect-timeout 5 --max-time
10 "https://$BACKEND_HOST/health"), remove the "|| true", capture the exit code
and the output into HEALTH, and if curl fails (non-zero exit) print the
HEALTH/error and exit non‑zero so the script fails fast; reference the
BACKEND_HOST variable, HEALTH assignment, and the curl invocation when making
the change.
| echo "==> Checking S0.x instance capacity..." | ||
| ACTIVE=$(oc get tenantnamespace -n "$CONFIG_NAMESPACE" \ | ||
| -l ambient-code/instance-type=s0x --no-headers 2>/dev/null | wc -l | tr -d ' ') | ||
|
|
||
| if [ "$ACTIVE" -ge "$MAX_S0X_INSTANCES" ]; then | ||
| echo "ERROR: At capacity — $ACTIVE/$MAX_S0X_INSTANCES S0.x instances active." | ||
| echo "Active instances:" | ||
| oc get tenantnamespace -n "$CONFIG_NAMESPACE" \ | ||
| -l ambient-code/instance-type=s0x -o name | ||
| exit 1 | ||
| fi | ||
| echo " Capacity OK: $ACTIVE/$MAX_S0X_INSTANCES" | ||
|
|
||
| echo "==> Applying TenantNamespace CR: $INSTANCE_ID" | ||
| cat <<EOF | oc apply -f - |
There was a problem hiding this comment.
The capacity gate is racy across concurrent provisions.
This is a classic TOCTOU window: two runs can both read the same active count and then both oc apply, exceeding the configured limit of five S0.x instances. The cap needs an atomic reservation step or a cluster-side lock/quota.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/pr-test/provision.sh` around lines 29 - 43, The current capacity
check using ACTIVE and MAX_S0X_INSTANCES is racy (TOCTOU) because two concurrent
runs can both observe the same ACTIVE count then both apply the TenantNamespace
CR; replace the read-then-apply pattern with an atomic reservation or
cluster-side quota: implement a create-if-not-exists reservation/lock (for
example create a namespaced Reservation CR or a ConfigMap/Lease named with the
instance id and fail the create if it already exists) and only count SUCCESSFUL
reservations toward the limit, or enforce with a Kubernetes
ResourceQuota/OPA/Gatekeeper policy so the API server enforces the
MAX_S0X_INSTANCES limit; update the provision flow around the "echo '==>
Applying TenantNamespace CR: $INSTANCE_ID'" block to create the reservation
atomically before applying the TenantNamespace and delete the reservation on
rollback/cleanup.
| echo "==> Waiting for namespace ${NAMESPACE} to become Active (timeout: ${READY_TIMEOUT}s)..." | ||
| DEADLINE=$((SECONDS + READY_TIMEOUT)) | ||
| while [ $SECONDS -lt $DEADLINE ]; do | ||
| STATUS=$(oc get namespace "$NAMESPACE" -o jsonpath='{.status.phase}' 2>/dev/null || true) | ||
| if [ "$STATUS" == "Active" ]; then | ||
| echo " Namespace ${NAMESPACE} is Active." | ||
| echo "$NAMESPACE" | ||
| exit 0 | ||
| fi | ||
| echo " status=${STATUS:-NotFound}, retrying..." | ||
| sleep 3 | ||
| done | ||
|
|
||
| echo "ERROR: Namespace ${NAMESPACE} did not become Active within ${READY_TIMEOUT}s." | ||
| oc describe tenantnamespace "$INSTANCE_ID" -n "$CONFIG_NAMESPACE" || true | ||
| exit 1 |
There was a problem hiding this comment.
Wait for the TenantNamespace Ready condition, not only Namespace.phase=Active.
Active only proves the namespace object exists. The tenant operator can still be reconciling quotas/RBAC/finalizers, so the install step can race immediately after this script returns.
Suggested change
echo "==> Waiting for namespace ${NAMESPACE} to become Active (timeout: ${READY_TIMEOUT}s)..."
DEADLINE=$((SECONDS + READY_TIMEOUT))
while [ $SECONDS -lt $DEADLINE ]; do
- STATUS=$(oc get namespace "$NAMESPACE" -o jsonpath='{.status.phase}' 2>/dev/null || true)
- if [ "$STATUS" == "Active" ]; then
+ READY=$(oc get tenantnamespace "$INSTANCE_ID" -n "$CONFIG_NAMESPACE" \
+ -o jsonpath='{range .status.conditions[?(@.type=="Ready")]}{.status}{end}' 2>/dev/null || true)
+ STATUS=$(oc get namespace "$NAMESPACE" -o jsonpath='{.status.phase}' 2>/dev/null || true)
+ if [ "$READY" == "True" ] && [ "$STATUS" == "Active" ]; then
echo " Namespace ${NAMESPACE} is Active."
echo "$NAMESPACE"
exit 0
fi
- echo " status=${STATUS:-NotFound}, retrying..."
+ echo " ready=${READY:-Unknown} status=${STATUS:-NotFound}, retrying..."
sleep 3
done🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/pr-test/provision.sh` around lines 59 - 74, The loop that
currently waits for Namespace.phase == "Active" (using NAMESPACE, READY_TIMEOUT,
STATUS) must instead wait for the TenantNamespace CR Ready condition to be True
to ensure the tenant operator finished reconciling; change the check inside the
while loop to query the TenantNamespace resource (oc get tenantnamespace
"$INSTANCE_ID" -n "$CONFIG_NAMESPACE" -o
jsonpath='{.status.conditions[?(@.type=="Ready")].status}' 2>/dev/null || true)
and treat "True" as success (you may still fall back to checking Namespace.phase
if desired), and preserve the existing timeout, logging, and the final oc
describe tenantnamespace "$INSTANCE_ID" invocation on failure.
| echo "==> Waiting for namespace ${NAMESPACE} to be deleted (timeout: ${DELETE_TIMEOUT}s)..." | ||
| DEADLINE=$((SECONDS + DELETE_TIMEOUT)) | ||
| while [ $SECONDS -lt $DEADLINE ]; do | ||
| if ! oc get namespace "$NAMESPACE" &>/dev/null; then | ||
| echo " Namespace ${NAMESPACE} deleted." | ||
| exit 0 | ||
| fi |
There was a problem hiding this comment.
Do not treat every oc get namespace failure as successful deletion.
if ! oc get namespace ... returns true for NotFound, auth failures, API timeouts, and transport errors alike. A transient cluster error will be reported as "deleted" and leak the namespace.
Suggested change
echo "==> Waiting for namespace ${NAMESPACE} to be deleted (timeout: ${DELETE_TIMEOUT}s)..."
DEADLINE=$((SECONDS + DELETE_TIMEOUT))
while [ $SECONDS -lt $DEADLINE ]; do
- if ! oc get namespace "$NAMESPACE" &>/dev/null; then
+ OUTPUT=$(oc get namespace "$NAMESPACE" -o name 2>&1) && RC=0 || RC=$?
+ if [ "$RC" -eq 0 ]; then
+ echo " Namespace still exists, waiting..."
+ elif grep -q "(NotFound)" <<<"$OUTPUT"; then
echo " Namespace ${NAMESPACE} deleted."
exit 0
+ else
+ echo " oc get namespace failed: $OUTPUT"
fi
- echo " Namespace still exists, waiting..."
sleep 5
done📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| echo "==> Waiting for namespace ${NAMESPACE} to be deleted (timeout: ${DELETE_TIMEOUT}s)..." | |
| DEADLINE=$((SECONDS + DELETE_TIMEOUT)) | |
| while [ $SECONDS -lt $DEADLINE ]; do | |
| if ! oc get namespace "$NAMESPACE" &>/dev/null; then | |
| echo " Namespace ${NAMESPACE} deleted." | |
| exit 0 | |
| fi | |
| echo "==> Waiting for namespace ${NAMESPACE} to be deleted (timeout: ${DELETE_TIMEOUT}s)..." | |
| DEADLINE=$((SECONDS + DELETE_TIMEOUT)) | |
| while [ $SECONDS -lt $DEADLINE ]; do | |
| OUTPUT=$(oc get namespace "$NAMESPACE" -o name 2>&1) && RC=0 || RC=$? | |
| if [ "$RC" -eq 0 ]; then | |
| echo " Namespace still exists, waiting..." | |
| elif grep -q "(NotFound)" <<<"$OUTPUT"; then | |
| echo " Namespace ${NAMESPACE} deleted." | |
| exit 0 | |
| else | |
| echo " oc get namespace failed: $OUTPUT" | |
| fi | |
| sleep 5 | |
| done |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/pr-test/provision.sh` around lines 82 - 88, The current loop
treats any failure of `oc get namespace "$NAMESPACE"` as successful deletion;
change the check inside the while loop to capture stderr/output from `oc get
namespace "$NAMESPACE" 2>&1` and only consider the namespace deleted when that
output explicitly indicates a NotFound error (e.g. matches "NotFound" or
"namespaces .* not found" or "Error from server (NotFound)"). For non-NotFound
failures (auth errors, timeouts, transport errors) do not exit; instead log a
warning and continue retrying until DEADLINE. Update the if-condition in the
while loop that references NAMESPACE/DELETE_TIMEOUT/DEADLINE accordingly so only
a confirmed NotFound result is treated as deletion.
| ```bash | ||
| oc new-project ambient-code | ||
| ``` | ||
|
|
||
| **MinIO credentials:** | ||
|
|
||
| ```bash | ||
| oc create secret generic minio-credentials -n ambient-code \ | ||
| --from-literal=root-user=admin \ | ||
| --from-literal=root-password=changeme123 | ||
| ``` | ||
|
|
||
| **PostgreSQL credentials (for Unleash feature flag database):** | ||
|
|
||
| ```bash | ||
| oc create secret generic postgresql-credentials -n ambient-code \ | ||
| --from-literal=db.host="postgresql" \ | ||
| --from-literal=db.port="5432" \ | ||
| --from-literal=db.name="postgres" \ | ||
| --from-literal=db.user="postgres" \ | ||
| --from-literal=db.password="postgres123" | ||
| ``` | ||
| REGISTRY_HOST=$(oc get route default-route -n openshift-image-registry \ | ||
| --template='{{ .spec.host }}') | ||
| INTERNAL_REG="image-registry.openshift-image-registry.svc:5000/ambient-code" | ||
|
|
||
| **Unleash credentials (for feature flag service):** | ||
| for img in vteam_frontend vteam_backend vteam_operator vteam_public_api vteam_claude_runner vteam_api_server; do | ||
| podman tag localhost/${img}:latest ${REGISTRY_HOST}/ambient-code/${img}:latest | ||
| podman push ${REGISTRY_HOST}/ambient-code/${img}:latest | ||
| done | ||
|
|
||
| ```bash | ||
| oc create secret generic unleash-credentials -n ambient-code \ | ||
| --from-literal=database-url="postgres://postgres:postgres123@postgresql:5432/unleash" \ | ||
| --from-literal=database-ssl="false" \ | ||
| --from-literal=admin-api-token="*:*.unleash-admin-token" \ | ||
| --from-literal=client-api-token="default:development.unleash-client-token" \ | ||
| --from-literal=frontend-api-token="default:development.unleash-frontend-token" \ | ||
| --from-literal=default-admin-password="unleash123" | ||
| oc rollout restart deployment backend-api frontend agentic-operator public-api ambient-api-server -n ambient-code | ||
| ``` | ||
|
|
||
| ## Platform Deployment | ||
| --- | ||
|
|
||
| The production kustomization in `components/manifests/overlays/production/kustomization.yaml` references `quay.io/ambient_code/*` images by default. When deploying to an OpenShift cluster using the internal registry, you must temporarily point the image refs at the internal registry, deploy, then **immediately revert** before committing. | ||
| ## Deploy with Internal Registry Images | ||
|
|
||
| **⚠️ CRITICAL**: Never commit `kustomization.yaml` while it contains internal registry refs. | ||
|
|
||
| **Patch kustomization to internal registry, deploy, then revert:** | ||
| **⚠️ CRITICAL**: Never commit `kustomization.yaml` with internal registry refs. | ||
|
|
||
| ```bash | ||
| REGISTRY_HOST=$(oc get route default-route -n openshift-image-registry --template='{{ .spec.host }}') | ||
| INTERNAL_REG="image-registry.openshift-image-registry.svc:5000/ambient-code" | ||
| REGISTRY_HOST=$(oc get route default-route -n openshift-image-registry \ | ||
| --template='{{ .spec.host }}') | ||
|
|
||
| # Temporarily override image refs to internal registry | ||
| cd components/manifests/overlays/production | ||
| sed -i "s#newName: quay.io/ambient_code/#newName: ${INTERNAL_REG}/#g" kustomization.yaml | ||
| sed -i "s#newName: quay.io/ambient_code/#newName: ${REGISTRY_HOST}/ambient-code/#g" kustomization.yaml | ||
|
|
||
| # Deploy | ||
| cd ../.. | ||
| ./deploy.sh | ||
|
|
||
| # IMMEDIATELY revert — do not commit with internal registry refs | ||
| cd overlays/production | ||
| git checkout kustomization.yaml | ||
| ``` |
There was a problem hiding this comment.
This internal-registry flow still leaves the operator-spawned images on the old registry.
Only six images are mirrored here, and vteam_state_sync is missing. Rewriting kustomization.yaml is also not enough for runner pods, because operator-config and ambient-agent-registry still need to point at the internal-registry refs. As written, runner/state-sync changes will continue pulling stale Quay images or fail outright.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/internal/developer/local-development/openshift.md` around lines 47 - 78,
The docs' internal-registry steps miss mirroring the operator-spawned images
(vteam_state_sync) and updating operator-config and ambient-agent-registry
references, so runner/state-sync pods will still pull from Quay; update the
mirror loop to include vteam_state_sync and any other operator-spawned images,
ensure deploy steps also replace newName refs inside
components/manifests/overlays/production/kustomization.yaml and the
operator-config and ambient-agent-registry resources (update their image refs to
${REGISTRY_HOST}/ambient-code/*), and verify deploy.sh applies the updated
manifests before restoring kustomization.yaml so operator-created pods use the
internal registry.
|
TODO: Let's identify if/where this runs untrusted code and stamp that out. GH gives us some clues in the PR review here. I also filed #1034 to add another tool in the box. |
|
|
||
| - name: Install kustomize | ||
| run: | | ||
| curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash |
There was a problem hiding this comment.
We install kustomize in a few places (eg #768). We should stop doing that and use oc kustomize instead.
| run: | | ||
| oc login ${{ secrets.TEST_OPENSHIFT_SERVER }} \ | ||
| --token=${{ secrets.TEST_OPENSHIFT_TOKEN }} \ | ||
| --insecure-skip-tls-verify |
There was a problem hiding this comment.
I realize we're copying --insecure-skip-tls-verify from other GH Actions, but do we need it? I thought ROSA obviates this.
There was a problem hiding this comment.
good catch. this is from testing and should not be in any full deployment
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
.github/workflows/pr-namespace-cleanup.yml (1)
7-13:⚠️ Potential issue | 🟠 MajorAdd explicit
permissionsblock and checkoutmainto use trusted scripts.The workflow lacks a
permissionsblock, which CodeQL has flagged. Additionally, the checkout step should explicitly useref: mainto ensure only trusted repository scripts are executed when running the destroy command with cluster credentials.Suggested fix
jobs: cleanup: runs-on: ubuntu-latest if: github.event.pull_request.head.repo.full_name == github.repository + permissions: + contents: read steps: - name: Checkout code uses: actions/checkout@v6 + with: + ref: main🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/pr-namespace-cleanup.yml around lines 7 - 13, The workflow job "cleanup" is missing an explicit permissions block and the checkout step should pin to the trusted branch; add a permissions block under the cleanup job (for example: permissions: contents: read) and update the Checkout code step that uses actions/checkout@v6 to include ref: main so the job runs with explicit repository read permissions and checks out the main branch before executing destroy/cluster credentials operations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/pr-e2e-openshift.yml:
- Around line 8-10: The concurrency group uses
github.event.workflow_run.pull_requests[0].number which can be null and will
break group evaluation; update the concurrency key to use a safe
conditional/fallback (e.g., use github.event.workflow_run.pull_requests &&
github.event.workflow_run.pull_requests[0] ?
github.event.workflow_run.pull_requests[0].number : github.run_id) so the group
always resolves; modify the line referencing pull_requests[0].number in the
concurrency block to this conditional expression (or an equivalent fallback like
github.run_id or github.sha) to avoid null access before jobs run.
- Around line 158-183: The teardown job's Destroy namespace step uses
needs.setup.outputs.instance_id unguarded, causing the destroy script
(components/pr-test/provision.sh destroy) to receive an empty INSTANCE_ID when
setup failed; update the teardown (or the Destroy namespace step) to skip
running when that output is empty—either by adding an if condition on the step
(e.g., check needs.setup.outputs.instance_id != '') or by guarding the run
command to no-op when INSTANCE_ID is empty before calling the destroy script;
reference the teardown job, the Destroy namespace step, the env VARIABLE
INSTANCE_ID and the components/pr-test/provision.sh destroy invocation when
making the change.
- Around line 37-43: The SLUG_MAX calculation can become zero or negative for
very large PR numbers which makes BRANCH_SLUG empty and leaves a trailing dash
in INSTANCE_ID; update the workflow to clamp SLUG_MAX to a minimum positive
value (e.g., SLUG_MAX=$(( SLUG_MAX < 1 ? 1 : SLUG_MAX ))) or otherwise guard it
the same way as the cleanup workflow, ensure BRANCH_SLUG uses that clamped
value, and build INSTANCE_ID without a trailing dash (e.g., only append
"-${BRANCH_SLUG}" when BRANCH_SLUG is non-empty) so symbols SLUG_MAX, PR_LEN,
BRANCH_SLUG and INSTANCE_ID are adjusted accordingly.
In @.github/workflows/pr-namespace-cleanup.yml:
- Around line 24-28: Clamp SLUG_MAX to a non-negative value after computing it
(ensure SLUG_MAX=$(( SLUG_MAX < 0 ? 0 : SLUG_MAX ))) and then derive
BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}"; after that, trim any trailing dashes
from BRANCH_SLUG and guard INSTANCE_ID construction so it doesn't produce a
trailing dash when BRANCH_SLUG is empty (e.g., build INSTANCE_ID by
conditionally appending "-${BRANCH_SLUG}" only if BRANCH_SLUG is non-empty,
using the existing symbols PR_LEN, SLUG_MAX, SAFE_BRANCH, BRANCH_SLUG,
PR_NUMBER, and INSTANCE_ID).
---
Duplicate comments:
In @.github/workflows/pr-namespace-cleanup.yml:
- Around line 7-13: The workflow job "cleanup" is missing an explicit
permissions block and the checkout step should pin to the trusted branch; add a
permissions block under the cleanup job (for example: permissions: contents:
read) and update the Checkout code step that uses actions/checkout@v6 to include
ref: main so the job runs with explicit repository read permissions and checks
out the main branch before executing destroy/cluster credentials operations.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 5ab0169e-0a5e-44f9-92e8-0fed82edc418
📒 Files selected for processing (2)
.github/workflows/pr-e2e-openshift.yml.github/workflows/pr-namespace-cleanup.yml
| concurrency: | ||
| group: pr-e2e-openshift-${{ github.event.workflow_run.pull_requests[0].number }} | ||
| cancel-in-progress: true |
There was a problem hiding this comment.
Concurrency group may fail when pull_requests[0] is null.
The concurrency key references pull_requests[0].number, but if the workflow_run isn't associated with a PR (e.g., push to main), this could produce an invalid or colliding group name. The setup job guards against this, but the concurrency block is evaluated before job conditions.
Suggested fix
concurrency:
- group: pr-e2e-openshift-${{ github.event.workflow_run.pull_requests[0].number }}
+ group: pr-e2e-openshift-${{ github.event.workflow_run.pull_requests[0].number || github.event.workflow_run.id }}
cancel-in-progress: true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/pr-e2e-openshift.yml around lines 8 - 10, The concurrency
group uses github.event.workflow_run.pull_requests[0].number which can be null
and will break group evaluation; update the concurrency key to use a safe
conditional/fallback (e.g., use github.event.workflow_run.pull_requests &&
github.event.workflow_run.pull_requests[0] ?
github.event.workflow_run.pull_requests[0].number : github.run_id) so the group
always resolves; modify the line referencing pull_requests[0].number in the
concurrency block to this conditional expression (or an equivalent fallback like
github.run_id or github.sha) to avoid null access before jobs run.
| # Budget: "ambient-code--" (14) + "pr-${PR_NUMBER}-" (4+digits) + slug | ||
| # Max namespace = 63. Slug budget = 63 - 14 - 4 - ${#PR_NUMBER} | ||
| PR_LEN=${#PR_NUMBER} | ||
| SLUG_MAX=$(( 63 - 14 - 4 - PR_LEN )) | ||
| BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}" | ||
|
|
||
| INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}" |
There was a problem hiding this comment.
Apply the same edge-case guard for SLUG_MAX as in the cleanup workflow.
Same issue as pr-namespace-cleanup.yml: SLUG_MAX could be zero/negative for very large PR numbers, and an empty BRANCH_SLUG would leave a trailing dash in INSTANCE_ID.
Suggested fix
PR_LEN=${`#PR_NUMBER`}
SLUG_MAX=$(( 63 - 14 - 4 - PR_LEN ))
+ [[ $SLUG_MAX -lt 0 ]] && SLUG_MAX=0
BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}"
- INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"
+ if [[ -n "$BRANCH_SLUG" ]]; then
+ INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"
+ else
+ INSTANCE_ID="pr-${PR_NUMBER}"
+ fi
NAMESPACE="ambient-code--${INSTANCE_ID}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/pr-e2e-openshift.yml around lines 37 - 43, The SLUG_MAX
calculation can become zero or negative for very large PR numbers which makes
BRANCH_SLUG empty and leaves a trailing dash in INSTANCE_ID; update the workflow
to clamp SLUG_MAX to a minimum positive value (e.g., SLUG_MAX=$(( SLUG_MAX < 1 ?
1 : SLUG_MAX ))) or otherwise guard it the same way as the cleanup workflow,
ensure BRANCH_SLUG uses that clamped value, and build INSTANCE_ID without a
trailing dash (e.g., only append "-${BRANCH_SLUG}" when BRANCH_SLUG is
non-empty) so symbols SLUG_MAX, PR_LEN, BRANCH_SLUG and INSTANCE_ID are adjusted
accordingly.
| teardown: | ||
| needs: [setup, provision, e2e] | ||
| if: always() | ||
| runs-on: ubuntu-latest | ||
| permissions: | ||
| contents: read | ||
| steps: | ||
| - name: Checkout main (trusted scripts only) | ||
| uses: actions/checkout@v6 | ||
| with: | ||
| ref: main | ||
|
|
||
| - name: Install oc | ||
| uses: redhat-actions/oc-installer@v1 | ||
| with: | ||
| oc_version: 'latest' | ||
|
|
||
| - name: Log in to OpenShift | ||
| run: | | ||
| oc login "${{ secrets.TEST_OPENSHIFT_SERVER }}" \ | ||
| --token="${{ secrets.TEST_OPENSHIFT_TOKEN }}" | ||
|
|
||
| - name: Destroy namespace | ||
| env: | ||
| INSTANCE_ID: ${{ needs.setup.outputs.instance_id }} | ||
| run: bash components/pr-test/provision.sh destroy "$INSTANCE_ID" |
There was a problem hiding this comment.
Guard teardown against missing instance_id when upstream jobs fail.
With if: always(), the teardown runs even if setup fails or is skipped, in which case needs.setup.outputs.instance_id will be empty. The destroy script may behave unexpectedly with an empty argument.
Suggested fix
teardown:
needs: [setup, provision, e2e]
- if: always()
+ if: always() && needs.setup.outputs.instance_id != ''
runs-on: ubuntu-latestAlternatively, add a guard in the destroy step:
- name: Destroy namespace
env:
INSTANCE_ID: ${{ needs.setup.outputs.instance_id }}
- run: bash components/pr-test/provision.sh destroy "$INSTANCE_ID"
+ run: |
+ if [[ -n "$INSTANCE_ID" ]]; then
+ bash components/pr-test/provision.sh destroy "$INSTANCE_ID"
+ else
+ echo "No instance_id available; skipping teardown"
+ fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| teardown: | |
| needs: [setup, provision, e2e] | |
| if: always() | |
| runs-on: ubuntu-latest | |
| permissions: | |
| contents: read | |
| steps: | |
| - name: Checkout main (trusted scripts only) | |
| uses: actions/checkout@v6 | |
| with: | |
| ref: main | |
| - name: Install oc | |
| uses: redhat-actions/oc-installer@v1 | |
| with: | |
| oc_version: 'latest' | |
| - name: Log in to OpenShift | |
| run: | | |
| oc login "${{ secrets.TEST_OPENSHIFT_SERVER }}" \ | |
| --token="${{ secrets.TEST_OPENSHIFT_TOKEN }}" | |
| - name: Destroy namespace | |
| env: | |
| INSTANCE_ID: ${{ needs.setup.outputs.instance_id }} | |
| run: bash components/pr-test/provision.sh destroy "$INSTANCE_ID" | |
| teardown: | |
| needs: [setup, provision, e2e] | |
| if: always() && needs.setup.outputs.instance_id != '' | |
| runs-on: ubuntu-latest | |
| permissions: | |
| contents: read | |
| steps: | |
| - name: Checkout main (trusted scripts only) | |
| uses: actions/checkout@v6 | |
| with: | |
| ref: main | |
| - name: Install oc | |
| uses: redhat-actions/oc-installer@v1 | |
| with: | |
| oc_version: 'latest' | |
| - name: Log in to OpenShift | |
| run: | | |
| oc login "${{ secrets.TEST_OPENSHIFT_SERVER }}" \ | |
| --token="${{ secrets.TEST_OPENSHIFT_TOKEN }}" | |
| - name: Destroy namespace | |
| env: | |
| INSTANCE_ID: ${{ needs.setup.outputs.instance_id }} | |
| run: bash components/pr-test/provision.sh destroy "$INSTANCE_ID" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/pr-e2e-openshift.yml around lines 158 - 183, The teardown
job's Destroy namespace step uses needs.setup.outputs.instance_id unguarded,
causing the destroy script (components/pr-test/provision.sh destroy) to receive
an empty INSTANCE_ID when setup failed; update the teardown (or the Destroy
namespace step) to skip running when that output is empty—either by adding an if
condition on the step (e.g., check needs.setup.outputs.instance_id != '') or by
guarding the run command to no-op when INSTANCE_ID is empty before calling the
destroy script; reference the teardown job, the Destroy namespace step, the env
VARIABLE INSTANCE_ID and the components/pr-test/provision.sh destroy invocation
when making the change.
| PR_LEN=${#PR_NUMBER} | ||
| SLUG_MAX=$(( 63 - 14 - 4 - PR_LEN )) | ||
| BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}" | ||
|
|
||
| INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}" |
There was a problem hiding this comment.
Guard against edge cases: negative SLUG_MAX or trailing dash.
For very high PR numbers, SLUG_MAX could become zero or negative, and if BRANCH_SLUG ends up empty, INSTANCE_ID would have a trailing dash (pr-123-). Consider clamping and trimming:
Suggested fix
PR_LEN=${`#PR_NUMBER`}
SLUG_MAX=$(( 63 - 14 - 4 - PR_LEN ))
+ [[ $SLUG_MAX -lt 0 ]] && SLUG_MAX=0
BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}"
- INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"
+ if [[ -n "$BRANCH_SLUG" ]]; then
+ INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"
+ else
+ INSTANCE_ID="pr-${PR_NUMBER}"
+ fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| PR_LEN=${#PR_NUMBER} | |
| SLUG_MAX=$(( 63 - 14 - 4 - PR_LEN )) | |
| BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}" | |
| INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}" | |
| PR_LEN=${`#PR_NUMBER`} | |
| SLUG_MAX=$(( 63 - 14 - 4 - PR_LEN )) | |
| [[ $SLUG_MAX -lt 0 ]] && SLUG_MAX=0 | |
| BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}" | |
| if [[ -n "$BRANCH_SLUG" ]]; then | |
| INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}" | |
| else | |
| INSTANCE_ID="pr-${PR_NUMBER}" | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/pr-namespace-cleanup.yml around lines 24 - 28, Clamp
SLUG_MAX to a non-negative value after computing it (ensure SLUG_MAX=$((
SLUG_MAX < 0 ? 0 : SLUG_MAX ))) and then derive
BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}"; after that, trim any trailing dashes
from BRANCH_SLUG and guard INSTANCE_ID construction so it doesn't produce a
trailing dash when BRANCH_SLUG is empty (e.g., build INSTANCE_ID by
conditionally appending "-${BRANCH_SLUG}" only if BRANCH_SLUG is non-empty,
using the existing symbols PR_LEN, SLUG_MAX, SAFE_BRANCH, BRANCH_SLUG,
PR_NUMBER, and INSTANCE_ID).
There was a problem hiding this comment.
Actionable comments posted: 7
♻️ Duplicate comments (2)
.claude/skills/ambient/SKILL.md (2)
58-88:⚠️ Potential issue | 🟠 MajorDo not publish shared default credentials in a skill covering production installs.
This issue was previously flagged and remains unresolved. The commands hardcode predictable values for MinIO (
changeme123), PostgreSQL (postgres123), Unleash (fixed admin tokens), and GitHub state secret (test-state-secret). Since this skill explicitly covers "production, ephemeral PR test instances, or developer clusters" (lines 5–6), following it verbatim would give every installation the same credentials.Replace hardcoded values with placeholders (e.g.,
<ROOT_PASSWORD>,<DB_PASSWORD>,<ADMIN_TOKEN>) and add a note instructing users to generate strong, unique secrets (or source them from a secret manager) before running the commands.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/ambient/SKILL.md around lines 58 - 88, Replace all hardcoded secret values in the kustomize secret creation examples (the secret names minio-credentials, postgresql-credentials, unleash-credentials, github-app-secret) with clearly labelled placeholders (e.g., <ROOT_PASSWORD>, <DB_PASSWORD>, <ADMIN_TOKEN>, <GITHUB_STATE_SECRET>) and update the text to instruct users to generate strong, unique secrets or pull them from a secret manager before applying the overlay; ensure the examples do not contain any fixed/default credentials and add a single-line note above the commands explaining to replace placeholders with secure values or reference a secret store.
203-211:⚠️ Potential issue | 🟠 MajorWait for
public-apiandambient-api-server-dbbefore Step 6.This issue was previously flagged and remains unresolved. Step 6 immediately uses
public-api-route(line 281) and execs intoambient-api-server-db(line 258), but the rollout gate never waits for those deployments. On a slow cluster the documented verification flow is flaky.Add
public-apito the loop and add an explicit wait forambient-api-server-db:🛡️ Proposed fix
-for deploy in backend-api frontend agentic-operator postgresql minio unleash; do +for deploy in backend-api frontend agentic-operator postgresql minio unleash public-api; do oc rollout status deployment/$deploy -n $NAMESPACE --timeout=300s done ambient-api-server and ambient-api-server-db may take longer due to DB init: ```bash +oc rollout status deployment/ambient-api-server-db -n $NAMESPACE --timeout=300s oc rollout status deployment/ambient-api-server -n $NAMESPACE --timeout=300s</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In @.claude/skills/ambient/SKILL.md around lines 203 - 211, The deployment wait
loop currently iterates "backend-api frontend agentic-operator postgresql minio
unleash" but omits the public API and the DB init; update the loop that declares
deploys (the line with for deploy in ...) to include public-api, and add an
explicit rollout wait for ambient-api-server-db (before the existing
ambient-api-server wait) so Step 6's usage of public-api-route and the
subsequent exec into ambient-api-server-db will block until those deployments
are ready; target the symbolic names public-api and ambient-api-server-db when
adding the waits.</details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/skills/ambient/SKILL.md:
- Line 292: The command
oc policy add-role-to-group system:image-puller system:serviceaccounts --namespace=$NAMESPACEis too permissive; change it to
grant image-pull only to the specific runner namespace or service account(s)
instead of the cluster-widesystem:serviceaccountsgroup — e.g., replace that
call with a namespace-scoped binding that targets the runner service account
(oc policy add-role-to-user system:image-puller system:serviceaccount:<runner-namespace>:<sa-name> --namespace=$NAMESPACE) or
implement a RoleBinding per target namespace for thesystem:image-pullerrole
so only intended namespaces/users can pull from$NAMESPACE.- Line 103: The placeholder path
/path/to/service-account-key.jsonshould be
replaced with a real, downloaded service account key file: instruct the user to
create or select a service account in GCP Console (IAM & Admin > Service
Accounts), click "Keys" → "Add Key" → "Create new key" and download the JSON,
save it to a secure local path (e.g., ./keys/ambient-service-account.json), and
then update the --from-file flag to point to that real file name instead of the
placeholder; include a note to keep the JSON secure and to follow the project's
secrets/storage guidelines.- Around line 318-321: Update the "JWT Configuration" section to include a
concrete example for ephemeral/non‑prod namespaces: show how to patch the
ambient-api-server container args (referencing
ambient-api-server-jwt-args-patch.yaml) to either disable validation (e.g.,
override args to include a --disable-jwt-validation flag or remove the
--jwk-cert-url/--issuer args) or point to a test issuer by setting
--jwk-cert-url to a known test JWKS URL; include the exact arg keys to change
(--jwk-cert-url, --issuer, --disable-jwt-validation), a short example
description of patching the Deployment/DaemonSet container args for the
ambient-api-server, and note to revert for production.- Line 238: Update the sentence that says the frontend shows "2/2" to clarify
that this only occurs when the oauth-proxy sidecar is present (e.g., when using
the production overlay); mention that installs via install.sh or custom overlays
without oauth-proxy will show "1/1" for the frontend pod. Reference the terms
"frontend" pod, "production overlay", "oauth-proxy sidecar", and "install.sh" in
the rewording so readers know the difference in expected readiness counts
depending on whether oauth-proxy is configured.In
@components/pr-test/README.md:
- Line 187: Update the README to document the security implications of changing
push: false → push: true in components-build-deploy.yml's PR build step: explain
that enabling image push causes PR builds to push ephemeral images tagged
pr-- (overwritten per commit), call out that the Quay registry
namespace must have strict access controls and review who can push/read images,
advise evaluating whether untrusted contributors should be allowed to push to
the shared registry, and add a cross-reference to the security model/trust
boundaries section for further guidance.- Around line 1-246: Add a new "Security Model" section to the specification
that states where untrusted PR code runs and how it's isolated: describe
isolation boundaries for each ephemeral instance (namespace
ambient-code--) including network policies and the "internal"
security-zone, resource quotas and limits (referencing MAX_S0X_INSTANCES,
READY_TIMEOUT, DELETE_TIMEOUT), RBAC restrictions and that the
ambient-code/instance-type=s0x label is used for counting only, secrets handling
(namespace-scoped secrets, limited secret injection, cleanup on teardown),
mitigation controls for malicious builds (GitHub Actions runner isolation, push:
true implications for PR image builds, e.g. push only from trusted runners,
build-time sandboxing), and explicit notes about what components run untrusted
code vs trusted platform code (provisioner.create/destroy, provision.sh,
installer scripts). Ensure the section references TenantNamespace CR
creation/deletion and the provisioner workflow so reviewers can see where to
apply network policies and RBAC to prevent cross-namespace access.- Line 220: Update the file list reference that still points to "SPEC.md" so it
matches the renamed file by replacing the "SPEC.md" entry with "README.md"
(search for the string "SPEC.md" in components/pr-test/README.md and update it
to "README.md"); ensure any surrounding text or links that referenced SPEC.md
are adjusted to point to README.md as well.
Duplicate comments:
In @.claude/skills/ambient/SKILL.md:
- Around line 58-88: Replace all hardcoded secret values in the kustomize secret
creation examples (the secret names minio-credentials, postgresql-credentials,
unleash-credentials, github-app-secret) with clearly labelled placeholders
(e.g., <ROOT_PASSWORD>, <DB_PASSWORD>, <ADMIN_TOKEN>, <GITHUB_STATE_SECRET>) and
update the text to instruct users to generate strong, unique secrets or pull
them from a secret manager before applying the overlay; ensure the examples do
not contain any fixed/default credentials and add a single-line note above the
commands explaining to replace placeholders with secure values or reference a
secret store.- Around line 203-211: The deployment wait loop currently iterates "backend-api
frontend agentic-operator postgresql minio unleash" but omits the public API and
the DB init; update the loop that declares deploys (the line with for deploy in
...) to include public-api, and add an explicit rollout wait for
ambient-api-server-db (before the existing ambient-api-server wait) so Step 6's
usage of public-api-route and the subsequent exec into ambient-api-server-db
will block until those deployments are ready; target the symbolic names
public-api and ambient-api-server-db when adding the waits.</details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Organization UI **Review profile**: ASSERTIVE **Plan**: Pro **Run ID**: `86f2be1c-2bd0-4140-8dda-16106d5abda3` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between b7a58c5cd33742cc19586c0bef4900ed2b361a5e and 2eb2465302c9a4a6c86475f453f1544940030396. </details> <details> <summary>📒 Files selected for processing (2)</summary> * `.claude/skills/ambient/SKILL.md` * `components/pr-test/README.md` </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
|
|
||
| ```bash | ||
| oc create secret generic ambient-vertex -n $NAMESPACE \ | ||
| --from-file=ambient-code-key.json=/path/to/service-account-key.json |
There was a problem hiding this comment.
Replace placeholder path with actionable guidance.
The placeholder /path/to/service-account-key.json is not actionable. Provide explicit instructions on how to obtain the service account key file (e.g., "Download from GCP Console > IAM & Admin > Service Accounts") or reference documentation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/skills/ambient/SKILL.md at line 103, The placeholder path
`/path/to/service-account-key.json` should be replaced with a real, downloaded
service account key file: instruct the user to create or select a service
account in GCP Console (IAM & Admin > Service Accounts), click "Keys" → "Add
Key" → "Create new key" and download the JSON, save it to a secure local path
(e.g., ./keys/ambient-service-account.json), and then update the --from-file
flag to point to that real file name instead of the placeholder; include a note
to keep the JSON secure and to follow the project's secrets/storage guidelines.
| unleash-xxxxx 1/1 Running 0 | ||
| ``` | ||
|
|
||
| Frontend shows `2/2` because of the oauth-proxy sidecar in the production overlay. |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Clarify oauth-proxy sidecar context.
The note assumes the production overlay is always used, but this skill covers "production, ephemeral PR test instances, or developer clusters" (lines 5–6). If users deploy via the scripted installer (install.sh) or a custom overlay without oauth-proxy, the frontend pod would show 1/1 instead of 2/2, potentially confusing readers.
Rephrase to indicate this applies when using the production overlay or when oauth-proxy is configured.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/skills/ambient/SKILL.md at line 238, Update the sentence that says
the frontend shows "2/2" to clarify that this only occurs when the oauth-proxy
sidecar is present (e.g., when using the production overlay); mention that
installs via install.sh or custom overlays without oauth-proxy will show "1/1"
for the frontend pod. Reference the terms "frontend" pod, "production overlay",
"oauth-proxy sidecar", and "install.sh" in the rewording so readers know the
difference in expected readiness counts depending on whether oauth-proxy is
configured.
| The operator creates runner pods in dynamically-created project namespaces. Those pods pull images from quay.io directly — no cross-namespace image access issue with quay. However, if you're using the OpenShift internal registry, grant pull access: | ||
|
|
||
| ```bash | ||
| oc policy add-role-to-group system:image-puller system:serviceaccounts --namespace=$NAMESPACE |
There was a problem hiding this comment.
Overly permissive cross-namespace image pull policy.
The command grants system:image-puller to the cluster-wide system:serviceaccounts group, allowing any pod in any namespace to pull images from $NAMESPACE. On a shared cluster, this is a security posture gap.
Instead, grant access only to the specific namespaces where runner pods will be created. If the operator creates runner pods in dynamically-named project namespaces, scope the role binding per namespace or use a more targeted group.
Alternative approach
If you know the target namespace pattern (e.g., ambient-code--pr-*), grant access to specific service accounts or use a namespace-scoped RoleBinding instead of cluster-wide group membership:
# Example: grant pull access only to service accounts in a specific namespace
oc policy add-role-to-user system:image-puller \
system:serviceaccount:<runner-namespace>:default \
--namespace=$NAMESPACEOr document that quay.io images avoid this issue entirely (as noted in lines 289–290).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/skills/ambient/SKILL.md at line 292, The command `oc policy
add-role-to-group system:image-puller system:serviceaccounts
--namespace=$NAMESPACE` is too permissive; change it to grant image-pull only to
the specific runner namespace or service account(s) instead of the cluster-wide
`system:serviceaccounts` group — e.g., replace that call with a namespace-scoped
binding that targets the runner service account (`oc policy add-role-to-user
system:image-puller system:serviceaccount:<runner-namespace>:<sa-name>
--namespace=$NAMESPACE`) or implement a RoleBinding per target namespace for the
`system:image-puller` role so only intended namespaces/users can pull from
`$NAMESPACE`.
| ### JWT Configuration | ||
|
|
||
| Production uses Red Hat SSO JWKS (`--jwk-cert-url=https://sso.redhat.com/...`). For ephemeral test instances, JWT validation may need to be disabled or pointed at a different issuer. Check the `ambient-api-server-jwt-args-patch.yaml` in the production overlay and adjust as needed for non-production contexts. | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Provide concrete JWT configuration guidance for non-production contexts.
The note mentions that ephemeral test instances "may need to be disabled or pointed at a different issuer" but does not explain how. Users following this skill for PR test instances will hit JWT validation errors and have no clear remediation path.
Add a concrete example showing how to disable JWT validation or configure an alternative issuer for ephemeral namespaces (e.g., patching the deployment args or referencing the ambient-api-server-jwt-args-patch.yaml).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/skills/ambient/SKILL.md around lines 318 - 321, Update the "JWT
Configuration" section to include a concrete example for ephemeral/non‑prod
namespaces: show how to patch the ambient-api-server container args (referencing
ambient-api-server-jwt-args-patch.yaml) to either disable validation (e.g.,
override args to include a --disable-jwt-validation flag or remove the
--jwk-cert-url/--issuer args) or point to a test issuer by setting
--jwk-cert-url to a known test JWKS URL; include the exact arg keys to change
(--jwk-cert-url, --issuer, --disable-jwt-validation), a short example
description of patching the Deployment/DaemonSet container args for the
ambient-api-server, and note to revert for production.
- Remove -k (insecure TLS) from curl in install.sh and SKILL.md - Add ref: main and permissions: contents: read to pr-namespace-cleanup.yml - Replace hardcoded example credentials with <PLACEHOLDER> tokens in SKILL.md - Add public-api and ambient-api-server-db to rollout waits in install.sh and SKILL.md 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (6)
components/pr-test/install.sh (3)
105-108:⚠️ Potential issue | 🟠 MajorHealth verification is non-blocking; consider making it fail-fast with timeouts.
Line 106 uses
curl -s ... || true, so backend failures are silently ignored and the script reports success regardless. Consider adding timeouts and failing on errors for better CI signal.🔧 Suggested fix
if [[ -n "$BACKEND_HOST" ]]; then - HEALTH=$(curl -s "https://${BACKEND_HOST}/health" || true) + if HEALTH=$(curl -fsS --connect-timeout 5 --max-time 20 \ + --retry 3 --retry-all-errors "https://${BACKEND_HOST}/health"); then + echo " Backend health: $HEALTH" + else + echo " WARNING: Backend health check failed" + fi - echo " Backend health: $HEALTH" +else + echo " WARNING: backend-route host not found" fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/pr-test/install.sh` around lines 105 - 108, The health check currently silences failures by using `curl -s ... || true`; update the `curl` invocation that populates `HEALTH` (when `BACKEND_HOST` is set) to use fail-fast options (e.g., `--fail` plus `--max-time`/`--connect-timeout`) and remove the `|| true` so non-2xx responses or timeouts cause a non-zero exit; if the `curl` call fails, print the error and exit the script (or return a non-zero status) instead of continuing, ensuring the `HEALTH` check fails the CI when the backend is unreachable.
52-59:⚠️ Potential issue | 🔴 CriticalUpdate both tagged and untagged image mappings to ensure PR isolation.
kustomize edit set imagematches by exact name field. If the production overlay contains untagged entries (e.g.,quay.io/ambient_code/vteam_frontendwithout:latest), they won't be updated and could resolve tolatestinstead of the PR-specific tag.🔧 Proposed fix - add untagged mappings
kustomize edit set image \ + "quay.io/ambient_code/vteam_frontend=quay.io/ambient_code/vteam_frontend:${IMAGE_TAG}" \ "quay.io/ambient_code/vteam_frontend:latest=quay.io/ambient_code/vteam_frontend:${IMAGE_TAG}" \ + "quay.io/ambient_code/vteam_backend=quay.io/ambient_code/vteam_backend:${IMAGE_TAG}" \ "quay.io/ambient_code/vteam_backend:latest=quay.io/ambient_code/vteam_backend:${IMAGE_TAG}" \ + "quay.io/ambient_code/vteam_operator=quay.io/ambient_code/vteam_operator:${IMAGE_TAG}" \ "quay.io/ambient_code/vteam_operator:latest=quay.io/ambient_code/vteam_operator:${IMAGE_TAG}" \ + "quay.io/ambient_code/vteam_claude_runner=quay.io/ambient_code/vteam_claude_runner:${IMAGE_TAG}" \ "quay.io/ambient_code/vteam_claude_runner:latest=quay.io/ambient_code/vteam_claude_runner:${IMAGE_TAG}" \ + "quay.io/ambient_code/vteam_state_sync=quay.io/ambient_code/vteam_state_sync:${IMAGE_TAG}" \ "quay.io/ambient_code/vteam_state_sync:latest=quay.io/ambient_code/vteam_state_sync:${IMAGE_TAG}" \ + "quay.io/ambient_code/vteam_api_server=quay.io/ambient_code/vteam_api_server:${IMAGE_TAG}" \ "quay.io/ambient_code/vteam_api_server:latest=quay.io/ambient_code/vteam_api_server:${IMAGE_TAG}" \ + "quay.io/ambient_code/vteam_public_api=quay.io/ambient_code/vteam_public_api:${IMAGE_TAG}" \ "quay.io/ambient_code/vteam_public_api:latest=quay.io/ambient_code/vteam_public_api:${IMAGE_TAG}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/pr-test/install.sh` around lines 52 - 59, kustomize edit set image only updates images that exactly match the name given, so add duplicate untagged mappings alongside the existing tagged ones in the kustomize command (the call to kustomize edit set image) for each image (e.g., add mappings for quay.io/ambient_code/vteam_frontend, vteam_backend, vteam_operator, vteam_claude_runner, vteam_state_sync, vteam_api_server, vteam_public_api without :latest) so they are also set to quay.io/ambient_code/<image>:${IMAGE_TAG}; update the same command block that currently lists the ":latest" entries to include the corresponding untagged entries to ensure PR isolation.
35-37:⚠️ Potential issue | 🔴 CriticalDo not apply cluster-scoped CRDs/RBAC from PR branch content.
Applying cluster-scoped manifests from unmerged PR code allows ephemeral test jobs to mutate cluster-wide resources. This should be handled by a trusted/mainline pipeline or pre-provisioned, not during PR install.
🛡️ Suggested hardening
-echo "==> Step 1: Applying CRDs and RBAC (cluster-scoped, idempotent)" -oc apply -k "$REPO_ROOT/components/manifests/base/crds/" -oc apply -k "$REPO_ROOT/components/manifests/base/rbac/" +if [[ "${APPLY_CLUSTER_SCOPED:-0}" == "1" ]]; then + echo "==> Step 1: Applying CRDs and RBAC (cluster-scoped, explicit opt-in)" + oc apply -k "$REPO_ROOT/components/manifests/base/crds/" + oc apply -k "$REPO_ROOT/components/manifests/base/rbac/" +else + echo "==> Step 1: Skipping cluster-scoped apply (expected pre-installed)" +fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/pr-test/install.sh` around lines 35 - 37, The script is applying cluster-scoped CRDs/RBAC from PR branch content via the two oc apply -k commands (the invocations that reference "$REPO_ROOT/components/manifests/base/crds/" and ".../rbac/"), which must not run in ephemeral PR installs; change the script to skip these applies unless running in a trusted/mainline context by guarding them with an explicit environment flag (e.g. MAINLINE_PIPELINE or SKIP_CLUSTER_MANIFESTS) or CI variable check, and when skipped, print a clear message that cluster-scoped resources are omitted; ensure the logic surrounds the oc apply -k calls (not other local-only manifests) so PR jobs never mutate cluster-wide resources.components/pr-test/README.md (1)
1-11:⚠️ Potential issue | 🟠 MajorAdd a security model section addressing untrusted PR code execution.
Per the PR discussion (ktdreyer's comment), identify where untrusted PR code might run and document mitigation strategies. Key points to cover:
- Isolation boundaries (namespace-scoped, network policies via
security-zone: internal)- What executes PR code vs. trusted platform code (provision.sh/install.sh run from
mainbranch)- Secrets management (namespace-scoped, cleaned up on teardown)
- The security implication of enabling
push: truefor PR builds (line 189)This helps reviewers understand the trust model for ephemeral test instances.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/pr-test/README.md` around lines 1 - 11, Add a new "Security model" section to the README describing where untrusted PR code may run and mitigations: specify that S0.x instances are namespace-scoped and isolated with network policies (use label/security-zone: internal), list which artifacts are executed from trusted main branch (provision.sh and install.sh) vs what comes from the PR branch, explain secrets handling (namespace-scoped secrets, scrubbed on teardown, no cluster-wide secrets exposed), and call out the security implications and required review when enabling the push: true flag for PR builds so reviewers know the increased risk and required controls..claude/skills/ambient/SKILL.md (1)
290-294:⚠️ Potential issue | 🟠 MajorOverly permissive cross-namespace image pull policy.
The command grants
system:image-pullerto the cluster-widesystem:serviceaccountsgroup, allowing any pod in any namespace to pull images from$NAMESPACE. On a shared cluster this is a security posture gap.Consider granting access only to specific runner namespaces or service accounts, or note that quay.io images (which this skill primarily uses) avoid this issue entirely.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/ambient/SKILL.md around lines 290 - 294, The documented oc command grants system:image-puller to the cluster-wide group system:serviceaccounts which is overly permissive; change the guidance to recommend granting pull access only to the specific runner namespace or specific service accounts (e.g., bind system:image-puller to system:serviceaccounts:<runner-namespace> or to a named ServiceAccount) instead of the entire cluster group, and/or note that using quay.io avoids the internal registry cross-namespace pull requirement; reference the existing command and the principals system:image-puller and system:serviceaccounts and replace the example with a scoped approach (per-namespace rolebinding or role-to-serviceaccount) or an explicit note recommending scoped access only..github/workflows/pr-namespace-cleanup.yml (1)
28-34:⚠️ Potential issue | 🟡 MinorGuard against edge case where
SLUG_MAXbecomes negative.For very high PR numbers (e.g., 7+ digits),
SLUG_MAXcould become zero or negative, and ifBRANCH_SLUGends up empty,INSTANCE_IDwould have a trailing dash. Also,$GITHUB_OUTPUTshould be quoted per shellcheck SC2086.🛡️ Proposed fix
PR_LEN=${`#PR_NUMBER`} SLUG_MAX=$(( 63 - 14 - 4 - PR_LEN )) + [[ $SLUG_MAX -lt 0 ]] && SLUG_MAX=0 BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}" - INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}" - echo "instance_id=${INSTANCE_ID}" >> $GITHUB_OUTPUT + if [[ -n "$BRANCH_SLUG" ]]; then + INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}" + else + INSTANCE_ID="pr-${PR_NUMBER}" + fi + echo "instance_id=${INSTANCE_ID}" >> "$GITHUB_OUTPUT" echo "Cleaning up instance: ${INSTANCE_ID}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/pr-namespace-cleanup.yml around lines 28 - 34, Guard SLUG_MAX against negative values and avoid a trailing dash when BRANCH_SLUG is empty: after computing SLUG_MAX (from PR_LEN) clamp it to zero if negative (e.g., if [ "$SLUG_MAX" -lt 0 ]; then SLUG_MAX=0; fi), then derive BRANCH_SLUG from SAFE_BRANCH as before; build INSTANCE_ID conditionally so it omits the dash when BRANCH_SLUG is empty (e.g., if [ -n "$BRANCH_SLUG" ]; then INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"; else INSTANCE_ID="pr-${PR_NUMBER}"; fi); and quote the GITHUB_OUTPUT variable when writing (use >> "$GITHUB_OUTPUT"). Use the symbols PR_LEN, SLUG_MAX, SAFE_BRANCH, BRANCH_SLUG, INSTANCE_ID and GITHUB_OUTPUT to locate the changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In @.claude/skills/ambient/SKILL.md:
- Around line 290-294: The documented oc command grants system:image-puller to
the cluster-wide group system:serviceaccounts which is overly permissive; change
the guidance to recommend granting pull access only to the specific runner
namespace or specific service accounts (e.g., bind system:image-puller to
system:serviceaccounts:<runner-namespace> or to a named ServiceAccount) instead
of the entire cluster group, and/or note that using quay.io avoids the internal
registry cross-namespace pull requirement; reference the existing command and
the principals system:image-puller and system:serviceaccounts and replace the
example with a scoped approach (per-namespace rolebinding or
role-to-serviceaccount) or an explicit note recommending scoped access only.
In @.github/workflows/pr-namespace-cleanup.yml:
- Around line 28-34: Guard SLUG_MAX against negative values and avoid a trailing
dash when BRANCH_SLUG is empty: after computing SLUG_MAX (from PR_LEN) clamp it
to zero if negative (e.g., if [ "$SLUG_MAX" -lt 0 ]; then SLUG_MAX=0; fi), then
derive BRANCH_SLUG from SAFE_BRANCH as before; build INSTANCE_ID conditionally
so it omits the dash when BRANCH_SLUG is empty (e.g., if [ -n "$BRANCH_SLUG" ];
then INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}"; else
INSTANCE_ID="pr-${PR_NUMBER}"; fi); and quote the GITHUB_OUTPUT variable when
writing (use >> "$GITHUB_OUTPUT"). Use the symbols PR_LEN, SLUG_MAX,
SAFE_BRANCH, BRANCH_SLUG, INSTANCE_ID and GITHUB_OUTPUT to locate the changes.
In `@components/pr-test/install.sh`:
- Around line 105-108: The health check currently silences failures by using
`curl -s ... || true`; update the `curl` invocation that populates `HEALTH`
(when `BACKEND_HOST` is set) to use fail-fast options (e.g., `--fail` plus
`--max-time`/`--connect-timeout`) and remove the `|| true` so non-2xx responses
or timeouts cause a non-zero exit; if the `curl` call fails, print the error and
exit the script (or return a non-zero status) instead of continuing, ensuring
the `HEALTH` check fails the CI when the backend is unreachable.
- Around line 52-59: kustomize edit set image only updates images that exactly
match the name given, so add duplicate untagged mappings alongside the existing
tagged ones in the kustomize command (the call to kustomize edit set image) for
each image (e.g., add mappings for quay.io/ambient_code/vteam_frontend,
vteam_backend, vteam_operator, vteam_claude_runner, vteam_state_sync,
vteam_api_server, vteam_public_api without :latest) so they are also set to
quay.io/ambient_code/<image>:${IMAGE_TAG}; update the same command block that
currently lists the ":latest" entries to include the corresponding untagged
entries to ensure PR isolation.
- Around line 35-37: The script is applying cluster-scoped CRDs/RBAC from PR
branch content via the two oc apply -k commands (the invocations that reference
"$REPO_ROOT/components/manifests/base/crds/" and ".../rbac/"), which must not
run in ephemeral PR installs; change the script to skip these applies unless
running in a trusted/mainline context by guarding them with an explicit
environment flag (e.g. MAINLINE_PIPELINE or SKIP_CLUSTER_MANIFESTS) or CI
variable check, and when skipped, print a clear message that cluster-scoped
resources are omitted; ensure the logic surrounds the oc apply -k calls (not
other local-only manifests) so PR jobs never mutate cluster-wide resources.
In `@components/pr-test/README.md`:
- Around line 1-11: Add a new "Security model" section to the README describing
where untrusted PR code may run and mitigations: specify that S0.x instances are
namespace-scoped and isolated with network policies (use label/security-zone:
internal), list which artifacts are executed from trusted main branch
(provision.sh and install.sh) vs what comes from the PR branch, explain secrets
handling (namespace-scoped secrets, scrubbed on teardown, no cluster-wide
secrets exposed), and call out the security implications and required review
when enabling the push: true flag for PR builds so reviewers know the increased
risk and required controls.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6db8f4d5-602f-45b4-9e1e-d19494fdf5a9
📒 Files selected for processing (6)
.claude/skills/ambient-pr-test/SKILL.md.claude/skills/ambient/SKILL.md.github/workflows/pr-namespace-cleanup.ymlcomponents/pr-test/README.mdcomponents/pr-test/install.shdocs/internal/proposals/acp-public-rest-api.md
💤 Files with no reviewable changes (1)
- docs/internal/proposals/acp-public-rest-api.md
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/pr-test/README.md`:
- Around line 69-82: The example TenantNamespace CR (kind: TenantNamespace,
name: pr-123-example) is missing the required label ambient-code/instance-type:
s0x under metadata.labels; add that label alongside the existing tenant and
namespace-type labels so the provisioner capacity check (referenced around lines
107-108) will count this instance and match the schema expectations mentioned
near lines 56 and 144.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8aebdebc-b679-4112-9c35-868a57e925b8
📒 Files selected for processing (3)
.claude/skills/ambient-pr-test/SKILL.mdcomponents/pr-test/README.mdcomponents/pr-test/build.sh
|
|
||
| - name: Log in to OpenShift | ||
| run: | | ||
| oc login "${{ secrets.TEST_OPENSHIFT_SERVER }}" \ |
There was a problem hiding this comment.
I'll note one thing here: if we want to deploy to an MP+ cluster behind the firewall, we can use ArgoCD of course, or we can use the internal GH runner as documented in #793.
| @@ -1,214 +0,0 @@ | |||
| # Proposal: ACP Public REST API | |||
| # OpenShift Cluster Development | ||
|
|
||
| This guide covers deploying the Ambient Code Platform on OpenShift clusters for development and testing. Use this when you need to test OpenShift-specific features like Routes, OAuth integration, or service mesh capabilities. | ||
| This guide covers deploying Ambient Code on an OpenShift cluster using the **OpenShift internal image registry**. This is useful when iterating on local builds against a dev cluster without pushing to quay.io. |
There was a problem hiding this comment.
Can this change be broken out into a separate PR? I don't understand what this has to do with this change.
There was a problem hiding this comment.
by introducing the logic into the skill, it becomes re-usable. it's like a code refactor where we make a change by making a new dual use function and refactoring the old code to use it. this developer guide was my original How To script for claude.
| print(doc) | ||
| PYEOF | ||
|
|
||
| kustomize build . \ |
There was a problem hiding this comment.
| kustomize build . \ | |
| oc kustomize build . \ |
See #768
| @@ -0,0 +1,448 @@ | |||
| # Specification: Ephemeral PR Test Environments on MPP | |||
There was a problem hiding this comment.
This doc is helpful to understand the design thinking, but some things are already out of date (for example "no branch slug" is wrong). If this stays in this PR, it needs another pass to reconcile with the actual implementation. (See also #952)
There was a problem hiding this comment.
There is significant duplication between the SKILL.md files and this file. For maintainability, I wonder if we could trim these both down and dedup where it makes sense.
| "quay.io/ambient_code/vteam_public_api:latest=quay.io/ambient_code/vteam_public_api:${IMAGE_TAG}" | ||
|
|
||
| FILTER_SCRIPT="$TMPDIR/filter.py" | ||
| cat > "$FILTER_SCRIPT" << 'PYEOF' |
There was a problem hiding this comment.
This inline Python script is 35 lines of yaml manipulation via regex. That will be very challenging to maintain.
Could this be a standalone Python file in components/pr-test/filter.py with proper tests? Or better, could you use yq or oc instead of regex?
ktdreyer
left a comment
There was a problem hiding this comment.
Biggest thing we must address is the warnings from GH re: acting on untrusted code
- Build workflow pushes all 7 component images tagged pr-<N>-amd64 to quay on every PR - provision.sh creates/destroys TenantNamespace CRs on ambient-code--config; capacity-gated at 5 concurrent instances - install.sh deploys production manifests with PR image tags using ArgoCD SA token; handles MPP restricted environment constraints (Route labels, PVC annotations, ClusterRoleBinding subject patching) - pr-e2e-openshift.yml workflow: provision → install → e2e → teardown on build completion - pr-namespace-cleanup.yml: safety-net teardown on PR close - Skills: ambient-pr-test (full PR test workflow) and ambient (install on any OpenShift namespace) - Validates required secrets before install; documents MPP resource inventory and constraints - Route admission webhook fix: add paas.redhat.com/appcode label via kustomize filter 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
c7cfaf8 to
cb00ff0
Compare
- components-build-deploy: push immutable SHA-tagged PR image alongside mutable pr-<N>-<arch> tag to prevent race conditions across commits - pr-e2e-openshift: concurrency group null safety (|| workflow_run.id), clamp SLUG_MAX to 0 min / skip branch slug if empty, remove --insecure-skip-tls-verify, install kustomize for kustomize-edit steps - install.sh: add both :latest and untagged kustomize set-image mappings so all image name variants are rewritten; add vteam_control_plane and vteam_mcp; replace kustomize build with oc kustomize; make health check fail-fast with curl timeouts and exit on missing routes - provision.sh: atomic ConfigMap lock before capacity check (TOCTOU fix); wait for TenantNamespace Ready condition, not only Namespace Active; fix destroy() to detect only (NotFound) as success, not all oc errors; delete lock ConfigMap on destroy - openshift.md: add vteam_state_sync to internal registry push loop 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- components-build-deploy: push immutable SHA-tagged PR image alongside mutable pr-<N>-<arch> tag to prevent race conditions across commits - pr-e2e-openshift: concurrency group null safety (|| workflow_run.id), clamp SLUG_MAX to 0 min / skip branch slug if empty, remove --insecure-skip-tls-verify, install kustomize for kustomize-edit steps - install.sh: add both :latest and untagged kustomize set-image mappings so all image name variants are rewritten; add vteam_control_plane and vteam_mcp; replace kustomize build with oc kustomize; make health check fail-fast with curl timeouts and exit on missing routes - provision.sh: atomic ConfigMap lock before capacity check (TOCTOU fix); wait for TenantNamespace Ready condition, not only Namespace Active; fix destroy() to detect only (NotFound) as success, not all oc errors; delete lock ConfigMap on destroy - openshift.md: add vteam_state_sync to internal registry push loop 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Review Queue Status
Action needed: Resolve merge conflicts
|
Summary
pr-<N>-amd64tag), not just changed componentscomponents/pr-test/provision.sh): Creates/destroysTenantNamespaceCRs onambient-code--config; capacity-gated at 5 concurrent S0.x instancescomponents/pr-test/install.sh): Deploys production manifests with PR image tags; copiesambient-vertexandambient-api-serversecrets from source namespace viaoc; mirrors Vertex AI config fromoperator-configpr-e2e-openshift.yml(provision → install → e2e → teardown on build completion) andpr-namespace-cleanup.yml(safety-net teardown on PR close).claude/skills/ambient/(install on any OpenShift namespace) and.claude/skills/ambient-pr-test/(full PR test workflow)openshift.mdtrimmed to internal-registry-only content with links to skillsTest plan
TEST_OPENSHIFT_SERVERandTEST_OPENSHIFT_TOKENGitHub Actions secrets pointing atdev-spoke-aws-us-east-1pr-<N>-amd64to quay on this PRpr-e2e-openshift.ymltriggers and provisionsambient-code--pr-<N>-*namespacepr-namespace-cleanup.ymlfires on PR close🤖 Generated with Claude Code