feat: add ContainerGrader for Kubernetes/Docker sandboxed grading + migrate to uv by blarghmatey · Pull Request #122 · openedx/xqueue-watcher

blarghmatey · 2026-03-26T17:37:30Z

What are the relevant tickets?

N/A — this is a contribution from the mitodl/xqueue-watcher fork, developed for MIT OpenCourseWare production use.

Description (What does it do?)

This PR adds two major capabilities and modernises the project tooling:

1. ContainerGrader — Kubernetes & Docker sandboxed grading

A new ContainerGrader handler (xqueue_watcher/containergrader.py) that runs graders inside isolated containers instead of the AppArmor/codejail jail. Two backends are supported:

Kubernetes — spins up a one-shot Job via the kubernetes client, mounts submission as a ConfigMap, streams pod logs, enforces CPU/memory resource limits, and cleans up after itself.
Docker — runs a short-lived container via the Docker SDK with the same interface, useful for local development.

Key design points:

Grader image is configurable per-queue; the same grader_support/entrypoint.py entry point is used for both backends.
Configurable resource limits (CPU, memory), timeouts, namespace, image pull policy.
grader_support/Dockerfile.base — a minimal base image with grader_support/entrypoint.py baked in; grader-specific images FROM this base.
New EnvSettings loader (xqueue_watcher/env_settings.py) reads full watcher/queue config from environment variables, making it easy to configure inside Kubernetes pods.

2. Kubernetes deployment manifests

Ready-to-use Kustomize manifests under deploy/kubernetes/:

Deployment for the watcher
ConfigMap for queue configuration
Secret template for XQueue credentials
ServiceAccount + RBAC roles so the watcher can create/watch/delete Job and Pod resources in its own namespace
NetworkPolicy to restrict egress

3. Dependency management migration: pip-compile → uv

Replaces setup.py + requirements/*.txt with pyproject.toml + uv.lock.
CI updated to use astral-sh/setup-uv, adds Python 3.13, drops EOL Python 3.8.
make test now runs uv sync && uv run pytest.

4. codejail made optional

JailedGrader no longer hard-depends on AppArmor/codejail. The import is wrapped in a try/except; if codejail is unavailable a clear RuntimeError directs operators to ContainerGrader.

5. Named XQueue server references (`SERVER_REF` / `xqueue_servers.json`)

Operators can define shared XQueue server credentials once in xqueue_servers.json and reference them by name in per-queue config via SERVER_REF, avoiding credential duplication across many queue definitions.

6. Metrics

A new xqueue_watcher/metrics.py module exposes Prometheus-compatible counters/histograms for grading latency, queue depth, and error rates.

How can this be tested?

Unit tests — the PR ships a full test suite for all new code:

# Install uv if needed: https://docs.astral.sh/uv/getting-started/installation/
make test

Local Docker smoke test:

# Build grader base image
docker build -f grader_support/Dockerfile.base -t xqueue-grader-base grader_support/

# Start watcher + mock xqueue
docker compose up

Kubernetes smoke test:

kubectl apply -k deploy/kubernetes/
kubectl -n xqueue-watcher logs -l app=xqueue-watcher -f

Additional Context

This code has been running in MIT OpenCourseWare production Kubernetes clusters. The ContainerGrader replaces our previous codejail-based setup and has been stable.

The upstream remote in the mitodl fork still points to github.com/edx/xqueue-watcher; this PR is submitted from mitodl:feat/containergrader-kubernetes which is rebased cleanly on top of openedx/xqueue-watcher:master (239ee5c).

…oxed grading (#14) * chore: migrate from pip-compile to uv for dependency management - Run migrate-to-uv to bootstrap pyproject.toml from requirements/base.txt and requirements/test.txt - Add full project metadata: name, version, description, requires-python>=3.11, license, hatchling build backend, entry point xqueue-watcher -> manager:main - Add newrelic as [project.optional-dependencies.production] - Add dev dependency group: coverage, mock, pytest-cov - Remove setup.py (replaced by pyproject.toml) - Remove all requirements/*.in and requirements/*.txt files (14 files) - Generate uv.lock with pinned dependency graph - Update Makefile: replace pip/pip-compile targets with uv sync / uv run pytest - Update .github/workflows/ci.yml: use astral-sh/setup-uv@v4, drop ubuntu-20.04 and Python 3.8, add Python 3.13, update to actions/checkout@v4 - Replace upgrade-python-requirements workflow with uv lock --upgrade + create-pull-request workflow Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: remove AppArmor/codejail hard dependency; make codejail optional - Remove six (Python 2 compat shim) from imports and SUPPORT_FILES in jailedgrader.py — Python 3 only going forward - Wrap codejail imports in try/except in jailedgrader.py and manager.py; raise RuntimeError with clear message directing users to ContainerGrader - Fix Path.abspath() -> Path.absolute() (breaking API change in path v17) in grader.py and jailedgrader.py - Update Dockerfile: ubuntu:xenial -> python:3.11-slim, remove apparmor and language-pack-en packages, fix layer ordering - Update test_codejail_config to use fork_per_item=False to avoid multiprocessing state-inheritance failure on Python 3.14 forkserver default - Update conf.d/600.json example to use ContainerGrader handler Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add ContainerGrader for Kubernetes/Docker-based sandboxed grading Adds xqueue_watcher/containergrader.py — a drop-in replacement for JailedGrader that executes student code inside an isolated container instead of using AppArmor/codejail. Security model (replaces AppArmor): - Container isolation (Linux namespaces + cgroups) - Non-root user (UID 1000), read-only root filesystem - CPU/memory resource limits enforced by container runtime - Network disabled for grader containers (no egress) - Hard wall-clock timeout via activeDeadlineSeconds (k8s) or timeout (Docker) Two pluggable backends selected via the 'backend' KWARGS option: kubernetes (default / production): - Creates a batch/v1 Job per submission using the kubernetes Python client - Auto-detects in-cluster vs kubeconfig credentials - Polls until Job completes, collects stdout from pod logs - Deletes the Job after result collection (ttlSecondsAfterFinished=300) - Job pod spec includes: securityContext, resource limits, activeDeadlineSeconds, and labels for observability docker (local dev / CI): - Runs a container via the docker Python SDK - Bind-mounts the grader directory read-only - Passes SUBMISSION_CODE as an environment variable - Network disabled, memory + CPU limits applied Student code is passed via SUBMISSION_CODE env var (avoids argv length limits and shell injection). The entrypoint writes it to /tmp before invoking grader_support.run, producing the same JSON output format that JailedGrader already expects — so no changes to grader test framework or course team grader code are required. Configuration example (conf.d/my-course.json): { "my-course": { "HANDLERS": [{ "HANDLER": "xqueue_watcher.containergrader.ContainerGrader", "KWARGS": { "grader_root": "/graders/my-course/", "image": "registry.example.com/my-course:latest", "backend": "kubernetes", "cpu_limit": "500m", "memory_limit": "256Mi", "timeout": 20 } }] } } Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add grader base Docker image and container entrypoint grader_support/Dockerfile.base: - python:3.11-slim base, non-root grader user (UID 1000) - Copies grader_support framework; installs path-py - ENTRYPOINT: python -m grader_support.entrypoint - /tmp volume for submission files (writable even with read-only root fs) - Course teams extend this image to add their deps and grader scripts grader_support/entrypoint.py: - Reads SUBMISSION_CODE env var, writes to /tmp/submission.py - Adds /tmp and cwd to sys.path, then delegates to grader_support.run - Prints JSON result to stdout (same schema JailedGrader already parses) grader_support/README.md: - Course team authoring guide: how to extend the base image, configure the handler, and understand the security properties Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add Kubernetes deployment manifests and Docker Compose local dev deploy/kubernetes/ (Kustomize-compatible): - serviceaccount.yaml — dedicated SA for xqueue-watcher pods - rbac.yaml — Role + RoleBinding: create/delete Jobs, read pod logs - configmap.yaml — watcher xqwatcher.json config (edit for your queues) - deployment.yaml — 2 replicas, topologySpreadConstraints, securityContext, resource limits, readinessProbe - networkpolicy.yaml — deny all ingress/egress on grader Job pods (label: role=grader-job); allow xqueue-watcher egress to xqueue - secret.yaml.template — placeholder: copy to secret.yaml, fill in credentials, do not commit secret.yaml (added to .gitignore) - kustomization.yaml — Kustomize entry point for the base directory docker-compose.yml (local dev): - xqueue-watcher container with docker socket access (for docker backend) - Mounts conf.d/ and grader directories - Includes a sample xqueue service reference for full local stack Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: correct grader path handling in ContainerGrader and entrypoint ContainerGrader had two bugs affecting how grader files were located inside the container at runtime: 1. Docker backend bind-mounted the grader problem directory at /grader, overwriting the grader_support package that the base image copies there. Fixed by binding at /graders instead and passing the resulting absolute in-container path (/graders/<file>) to the entrypoint. 2. Kubernetes backend set working_dir to the grader problem directory (e.g. /graders/ps07/Robot/), preventing Python from finding the grader_support package which lives at /grader/grader_support/. Fixed by keeping working_dir=/grader (the base image WORKDIR) and passing the absolute grader path in args instead of just the basename. entrypoint.py previously passed the full absolute path verbatim to __import__(), which fails for paths containing slashes. It now detects absolute paths, inserts the parent directory into sys.path, and uses only the basename as the importable module name. Also updates grader_support/README.md to document the correct layout (/graders/ for course grader scripts, /grader/ for grader_support) and the gradelib compatibility note for course teams migrating from Python 2 graders. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(tests): skip jailed grader tests when codejail is not installed codejail is an optional dependency (not installed in CI). Guard the import with a try/except and apply @pytest.mark.skipif to the test class so collection succeeds and tests are skipped gracefully when codejail is absent. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address PR review feedback - Dockerfile: replace deleted requirements/ pip install with uv sync (copies uv binary from ghcr.io/astral-sh/uv and uses uv sync --frozen) - grader.py: guard against path traversal in grader_config['grader']; validate that the resolved grader path stays within grader_root - containergrader.py: fix Docker SDK TypeError - containers.run() does not accept a timeout kwarg; switch to detach=True + container.wait() to enforce the timeout, then collect logs and remove the container - containergrader.py: remove brittle hardcoded line numbers (L364, L379, L397, L450) from user-facing error messages - docker-compose.yml: change conf.d and data volumes from :ro to :rw so local edits take effect without rebuild (matches comment intent) - upgrade-python-requirements.yml: add explicit permissions block (contents: write, pull-requests: write) as required by security policy - Automated code Graders With xqueue-watcher.md: remove empty heading, add 'Property' header to comparison table Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * refactor: replace path-py with stdlib pathlib path-py is an external dependency that wraps pathlib with a fluent API. Since we now require Python >= 3.11, pathlib covers all the same functionality without an extra dependency. Changes: - Replace 'from path import Path' with 'from pathlib import Path' in all source and test files - .dirname() → .parent - .basename() → .name - .absolute() / .absolute() → .resolve() (symlink-safe) - .files('*.json') → .glob('*.json') (with sorted() for stable ordering) - Remove path-py (path-py / path) from pyproject.toml dependencies - Regenerate uv.lock (removes path==17.1.1 and path-py==12.5.0) - Simplify grader.py path-traversal check: now that grader_path is a native pathlib.Path, the inline 'import pathlib' is no longer needed - Fix test_grader.py mock: grader_path.endswith() → grader_path.name == - Fix test_manager.py: pass str() to argparse (Path is not subscriptable) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add edx-codejail as optional dependency; document container isolation decision Add edx-codejail (the upstream PyPI package, v4.1.0) as an optional 'codejail' extra, replacing the previously pinned git-URL reference to a specific commit. uv add --optional codejail edx-codejail codejail is intentionally excluded from the base Docker image because ContainerGrader uses container-level isolation (Linux namespaces, cgroups, capability dropping, network isolation, read-only filesystem) which provides equivalent sandboxing to AppArmor without requiring host-level AppArmor configuration that is unavailable inside Kubernetes pods. Install the 'codejail' extra only when using the legacy JailedGrader on a bare-metal or VM host with AppArmor configured. To use: uv sync --extra codejail Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address second round of PR review feedback - Makefile: fix tab indentation on all recipe lines (was space-indented) - grader.py: remove unused sys import - jailedgrader.py: replace deprecated load_module() with spec_from_file_location/exec_module - containergrader.py: - remove unused imports (logging, os, tempfile) and _JOB_LABEL constant - add emptyDir volume at /tmp in K8s Job spec (required when read_only_root_filesystem=True) - add clarifying comment that K8s grader scripts are baked into the course image - replace deprecated load_module() with importlib.util spec/exec_module pattern - capture stderr from Docker container on non-zero exit for better diagnostics - grader_support/entrypoint.py: correct misleading comment about /tmp writability - deploy/kubernetes/deployment.yaml: fix command to use xqueue-watcher entry point - deploy/kubernetes/configmap.yaml: add xqueue-watcher-queue-configs ConfigMap so manifests apply cleanly out of the box - docker-compose.yml: mount Docker socket for docker backend to work - conf.d/600.json: use absolute /graders/ path instead of relative ../data path - Dockerfile: use C.UTF-8 locale (available without installing locales package) - pyproject.toml: add edx-codejail to dev group so jailed grader tests run in CI Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * refactor: move full grading pipeline into container; add ContainerGrader unit tests Architecture change: grader scripts are baked into the course-specific Docker image, so the watcher pod has no need to access grader files locally. The grader_support entrypoint now runs the complete grading pipeline inside the container (load grader, preprocess, run answer + submission, compare, return JSON grade), and ContainerGrader.grade() is simplified to just launch the container and parse its JSON output. Changes: - grader_support/entrypoint.py: complete rewrite; now takes GRADER_FILE SEED (not GRADER_FILE submission.py SEED); runs full grade pipeline in container; reads GRADER_LANGUAGE and HIDE_OUTPUT env vars from ContainerGrader - xqueue_watcher/containergrader.py: - Remove grader-module loading, gettext, answer.py reading, and all test- comparison logic from grade() — the container handles this now - grade() now just calls _run() and parses the returned JSON - _run() accepts grader_config and forwards lang/hide_output as env vars - _build_k8s_job(): args are now [grader_abs, seed] (not 3 args), adds GRADER_LANGUAGE and HIDE_OUTPUT env vars, still mounts emptyDir at /tmp - _run_docker(): same arg change; passes GRADER_LANGUAGE and HIDE_OUTPUT - ReadTimeout from container.wait() caught and re-raised as clear RuntimeError - Remove unused _truncate, _prepend_coding, importlib.util - tests/test_container_grader.py: 36 new unit tests covering: - _parse_cpu_millis - ContainerGrader init / backend validation - _build_k8s_job: args, env vars, resource limits, emptyDir/tmp, security - _run_docker: success, non-zero exit (with stderr), timeout, missing SDK - grade(): skip_grader, successful result, container failure, size warning Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * refactor: replace statsd/newrelic with OpenTelemetry; add 12-factor settings - Remove dogstatsd-python dependency; replace statsd instrumentation in grader.py with OpenTelemetry counters and a histogram - Add xqueue_watcher/metrics.py: configure_metrics() wires a MeterProvider with an OTLP HTTP exporter when OTEL_EXPORTER_OTLP_ENDPOINT is set; all four instruments (process_item, grader_payload_error, grading_time, replies) defined at module level against the global proxy meter - Call configure_metrics() from Manager.configure_from_directory() so the real provider is installed before any submissions are processed - Add xqueue_watcher/env_settings.py: get_manager_config_from_env() reads all manager config from XQWATCHER_* environment variables, compatible with 12-factor / Kubernetes deployment patterns - Remove newrelic from the production optional-dependency group and from the edx.org Dockerfile stage; the stage now runs xqueue-watcher directly - Add opentelemetry-api, opentelemetry-sdk, opentelemetry-exporter-otlp-proto-http to core dependencies; regenerate uv.lock - Add tests/test_env_settings.py and tests/test_metrics.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore: remove planning doc from git tracking Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore: remove codecov upload from CI Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address PR #14 review feedback - docker-compose.yml: remove unused GRADER_BACKEND env var, fix duplicate volumes key by merging into one list, tag sample-grader with image: grader-base:local so conf.d/600.json reference resolves - Dockerfile: standardise CMD config path to /etc/xqueue-watcher to match docker-compose and Kubernetes manifests - metrics.py: remove OTEL_METRIC_EXPORT_INTERVAL from docstring since it is not wired up in _build_meter_provider() - containergrader.py: add pod template metadata labels so the NetworkPolicy podSelector (app.kubernetes.io/component=xqueue-grader) actually matches grading pods; set automount_service_account_token=False on the grading pod spec to reduce blast radius if the NetworkPolicy is misconfigured; add _parse_memory_bytes() helper and use it for the Docker backend mem_limit so Kubernetes-style strings like '256Mi' are converted to bytes rather than passed raw (which Docker does not accept) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: add venv bin to PATH so xqueue-watcher entrypoint resolves uv installs the console script into the project virtual environment at .venv/bin/xqueue-watcher. Without adding this directory to PATH the CMD cannot be found at container startup. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add configure_logging() for 12-factor stdout logging When no logging.json file is present, manager.py now calls configure_logging() from env_settings instead of basicConfig(). configure_logging() sets up a single StreamHandler on stdout with a consistent timestamp/level/module format, honours XQWATCHER_LOG_LEVEL (default INFO), and suppresses noisy requests/urllib3 debug output. This removes the need for a logging.json file in Kubernetes deployments. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: symlink xqueue-watcher into /usr/local/bin for reliable resolution Using PATH via ENV is fragile -- container runtimes and security policies can reset or ignore it. Install a symlink at /usr/local/bin/xqueue-watcher (always in the standard system PATH) so the entrypoint resolves regardless of how the container is launched. Also remove the stale NEW_RELIC_LICENSE_KEY env entry from the Kubernetes deployment manifest. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add env-based defaults for ContainerGrader configuration Add get_container_grader_defaults() to env_settings, reading five new XQWATCHER_GRADER_* env vars: XQWATCHER_GRADER_BACKEND (default: kubernetes) XQWATCHER_GRADER_NAMESPACE (default: default) XQWATCHER_GRADER_CPU_LIMIT (default: 500m) XQWATCHER_GRADER_MEMORY_LIMIT (default: 256Mi) XQWATCHER_GRADER_TIMEOUT (default: 20) ContainerGrader.__init__ now uses None sentinels for these params so that any value omitted from a conf.d KWARGS block falls back to the env-derived default rather than a hardcoded constant. Values supplied explicitly in conf.d always take precedence, preserving backwards compatibility. Also fixes duplicate function definitions that had crept into env_settings.py. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat(containergrader): add ImageDigestPoller and image pull policy support - Add ImageDigestPoller class: background daemon thread that periodically resolves a tag-based image reference to its current digest via docker.APIClient.inspect_distribution(). Thread-safe; falls back to the original reference if resolution fails. - Add image_pull_policy param to ContainerGrader (auto-detect: IfNotPresent for digest refs, Always for tag-based refs; can be overridden explicitly). - Add poll_image_digest and digest_poll_interval params to activate the poller. When enabled, Kubernetes Jobs use the most recently resolved repo@sha256:… reference via _effective_image(), ensuring nodes always run the latest pushed image without relying on imagePullPolicy: Always for every pod. - Add .github/workflows/publish-grader-base-image.yml to build and push grader_support/Dockerfile.base to ghcr.io/mitodl/xqueue-watcher-grader-base on push to master (grader_support/** paths), weekly schedule, and workflow_dispatch. Multi-platform linux/amd64,linux/arm64. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: normalise imagePullPolicy to title-case before K8s API call Kubernetes requires imagePullPolicy to be exactly 'Always', 'IfNotPresent', or 'Never' (case-sensitive). When the value is supplied via KWARGS in the conf.d JSON (e.g. 'always' or 'ALWAYS'), the K8s API returns 422 Unprocessable Entity. Add a normalisation dict lookup that maps the lowercased input back to the canonical title-case form. Unknown values are passed through unchanged so Kubernetes can surface the validation error with a clear message. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add strip_path_components to ContainerGrader for legacy path prefixes LMS grader_payload 'grader' fields configured against the old git-clone deployment include a queue-name prefix, e.g.: mit-600x-Watcher-MITX-6.0001r/graders/python3graders/chips1/.../grade.py In the containerized approach, graders are baked directly into the image at grader_root, so the path resolves to: /graders/mit-600x-Watcher-MITX-6.0001r/graders/python3graders/... which doesn't exist. The actual path in the image is: /graders/python3graders/... Add strip_path_components (int, default 0) KWARG to ContainerGrader. When > 0, that many leading path components are stripped from the grader path (relative to grader_root) before it is passed as the container entrypoint argument. Set to 2 to remove both the queue-name component and the redundant repo subdirectory name. Example KWARGS: "strip_path_components": 2 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: install gettext into builtins before loading grader module Grader scripts may call _() at module level (e.g. in input_validators defined at import time). The previous code installed trans.install() after exec_module, causing NameError: name '_' is not defined. Move the entire locale/gettext setup block to before exec_module so _ is available in builtins when the grader script is first executed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: normalize mixed tab/space indentation before exec Python 3 raises TabError when exec'ing code with mixed tabs and spaces in the same indented block. Many course grader answer.py files were authored for Python 2 which tolerated this. Call expandtabs(4) on both the staff answer and student submission before preprocessing and writing to /tmp, so exec never sees raw tabs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: sys.path ordering so preprocessed answer/submission shadow originals run.py's import_captured() uses __import__() to load answer and submission modules. grader_dir was inserted into sys.path AFTER /tmp, making it position 0, so __import__('answer') found the original /graders/.../answer.py (with bare 'for c in s:') instead of the preprocessed /tmp/answer.py (with 'submission_code = repr(...)'). Fix: insert grader_dir first, then /tmp, so /tmp is position 0 and the preprocessed files always shadow the originals. Also: - Add _dbg() helper for debug tracing behind GRADER_DEBUG=1 env var; off by default so stderr output doesn't corrupt the JSON pod log that containergrader.py reads via read_namespaced_pod_log. - Import traceback (used by _dbg exception paths). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore: log raw container output bytes on JSON parse failure Add an explicit ERROR-level log of the raw bytes (repr, up to 4096) when json.loads fails so we can see exactly what the pod log contains, including any leading/trailing garbage from stderr that Kubernetes combines into the pod log stream. Also add a DEBUG-level log of every container output for tracing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore: push grader base image to DockerHub as well as GHCR Concourse grader-image pipelines use DockerHub as the trigger source. The workflow previously only pushed to GHCR, so Concourse never saw updates to the base image. Changes: - Add DockerHub login step (DOCKERHUB_USERNAME/DOCKERHUB_PASSWORD secrets) - Push to both mitodl/xqueue-watcher-grader-base (DockerHub) and ghcr.io/mitodl/xqueue-watcher-grader-base (GHCR) - Tag :latest on feature branches during active development so Concourse picks up fixes without waiting for master merge - Add feature branches to push trigger so grader_support fixes are published immediately Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: request stdout-only stream from read_namespaced_pod_log The default stream parameter is 'All', which interleaves stderr into the returned string. Any stderr output from the container (Python warnings, import messages, etc.) corrupts the JSON that the entrypoint prints to stdout, causing JSONDecodeError in the watcher. Pass stream='Stdout' and container='grader' explicitly so only the container's stdout is returned. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: extract last line from pod log instead of using stream param The PodLogsQuery feature gate (which enables the 'stream' field in PodLogOptions) is opt-in and disabled on the target cluster. Using stream= returns a 422 FieldValueForbidden error even on K8s 1.35. Instead, fetch the combined stdout+stderr log and scan backwards for the last non-empty line. The entrypoint always prints exactly one JSON object as its final output line, so this reliably extracts the result regardless of any stderr noise interleaved earlier in the log. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: bypass kubernetes client JSON deserialisation of pod logs read_namespaced_pod_log returns response_type='str'. The client's deserialize() method first calls json.loads() on the raw response body (succeeds since the entrypoint outputs valid JSON), then passes the resulting Python dict to __deserialize_primitive(dict, str) which calls str(dict) — producing Python repr with single-quoted keys and True/False booleans, which is not valid JSON. Fix: pass _preload_content=False to get the raw urllib3 response object and read .data directly as bytes, bypassing the client deserialisation entirely. The raw bytes are valid UTF-8 JSON as printed by the entrypoint. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore: add top-level permissions: {} to restrict default GITHUB_TOKEN scope Addresses GitHub Advanced Security finding: 'Workflow does not contain permissions'. Adding a workflow-level permissions: {} block ensures the GITHUB_TOKEN has no default permissions; each job must explicitly declare what it needs. The update-dependencies job retains its required contents: write and pull-requests: write grants. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * refactor: remove strip_path_components from ContainerGrader strip_path_components was added to work around what turned out to be a configuration error in the LMS grader_payload, not a structural problem in the grading path resolution. Remove the parameter, its __init__ assignment, the stripping logic in _build_k8s_job, and all docstring references to keep the code simple and correct. Also addressed in this commit: - grader.py: downgrade per-submission grading-time log from INFO to DEBUG to avoid high-volume noise in production log streams - Dockerfile: pin uv to 0.10.7 via a named build stage instead of floating ghcr.io/astral-sh/uv:latest; replace the xqueue-watcher symlink with ENV PATH so the full venv is on PATH - env_settings.py: add XQWATCHER_DOCKER_HOST_GRADER_ROOT env var (preparation for docker_host_grader_root ContainerGrader param) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: add docker_host_grader_root; drop path-py from grader base image ContainerGrader (Docker backend): add docker_host_grader_root parameter so that when xqueue-watcher runs inside a container the bind-mount source path can be translated from the watcher-container path to the equivalent host-side path. Without this the Docker daemon (reached via the mounted socket) would look for the grader directory on the host where it does not exist. Defaults to XQWATCHER_DOCKER_HOST_GRADER_ROOT env var or None (watcher runs directly on the host, no translation needed). docker-compose.yml: add XQWATCHER_DOCKER_HOST_GRADER_ROOT placeholder and explanatory comment so operators know to set the absolute host path. grader_support/Dockerfile.base: remove the path-py pip install. The grader_support framework itself does not import path; course teams that need path-py can add it in their own downstream image. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat: add named xqueue server references via xqueue_servers.json Queue configs in conf.d can now use SERVER_REF to reference a named server defined in xqueue_servers.json, avoiding the need to embed XQueue URLs and credentials directly in grader configuration files. - settings.py: add get_xqueue_servers() to load and validate xqueue_servers.json from the config root - manager.py: load xqueue_servers.json in configure_from_directory(); resolve SERVER_REF in client_from_config(), raising ValueError for unknown names or conflicts with SERVER/AUTH - env_settings.py: document the Kubernetes Secret volume-mount pattern for xqueue_servers.json as the preferred credentials delivery mechanism - conf.d/600.json: update example to use SERVER_REF - tests: add ServerRefTests and TestGetXqueueServers covering resolution, error cases, and configure_from_directory integration - tests/fixtures/config/xqueue_servers.json: fixture server for tests - README.md: document SERVER_REF, xqueue_servers.json format, and Kubernetes Secret mounting pattern Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore: remove DockerHub push from grader base image workflow Only push to GHCR. Remove the DockerHub login step, DockerHub image reference from the metadata action, and the DOCKERHUB_USERNAME / DOCKERHUB_PASSWORD secret dependencies. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: harden containergrader and XQueue client - Fix TLS certificate verification: replace hardcoded verify=False with a _VERIFY_TLS flag (default True). Operators can opt out via XQWATCHER_VERIFY_TLS=false for dev environments; a warning is logged when verification is disabled. - Remove credentials from logs: strip self.password from the debug login message and the login-retry error message in client.py. - Enforce hard submission size limit: reject submissions larger than XQWATCHER_SUBMISSION_SIZE_LIMIT bytes (default 1 MB) before launching a container. Prevents etcd object-size overflows and resource-exhaustion attacks via very large env vars. Keep the existing 32 KB warning for submissions that are large but within the limit. - Add seccomp RuntimeDefault profile to Kubernetes grading Jobs: applied at both the pod level (V1PodSecurityContext) and the container level (V1SecurityContext) to restrict the available syscall surface. - Add PID limit to grading container resource limits: caps the number of processes a grading container may create at 256, preventing fork-bomb attacks from affecting other node workloads. - Cap /tmp emptyDir at 50 Mi: adds size_limit='50Mi' to the emptyDir volume backing /tmp in grading pods, preventing disk-exhaustion attacks. - Add path traversal pre-check in grader.py: explicitly reject grader paths containing '..' components before Path.resolve() is called, removing symlink edge-cases that could bypass the relative_to() guard. - Update containergrader module docstring and env_settings docs to accurately describe the security posture and new env vars. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address PR #14 review feedback - Makefile: add missing tab indentation on help target recipe lines - grader_support/entrypoint.py: fix always-true EndTest check (use str(e).strip() not e is not None) - tests/test_env_settings.py: use clear=True in hermetic default-value tests - tests/test_metrics.py: use clear=True to prevent OTEL_ env vars bleeding in - xqueue_watcher/client.py: apply _VERIFY_TLS in _request() and _login(), not just put_result - xqueue_watcher/containergrader.py: - fix image repo parsing to handle registry:port/image:tag refs (rfind approach) - fix 'pods' → 'pids' container resource limit - lazy-init Kubernetes API clients once per instance (avoids per-submission config load) - xqueue_watcher/env_settings.py: parse HTTP_BASIC_AUTH into (username, password) tuple - xqueue_watcher/metrics.py: clarify OTEL_RESOURCE_ATTRIBUTES is parsed by SDK automatically Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Kubernetes does not support 'pids' as a container-level resource limit. Setting it causes a 422 Unprocessable Entity error when creating grading Jobs. PID limits must instead be enforced at the namespace level via a LimitRange or at the node level via kubelet --pod-pids-limit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

openedx-webhooks · 2026-03-26T17:37:36Z

Thanks for the pull request, @blarghmatey!

This repository is currently unmaintained.

🔘 Find a technical reviewer

To get help with finding a technical reviewer, reach out to the community contributions project manager for this PR:

On the right-hand side of the PR, find the Contributions project, click the caret in the top right corner to expand it, and check the "Primary PM" field for the name of your project manager.
Find their GitHub handle here.

Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review.

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
- This process (including the steps you'll need to take) is documented here.
If it doesn't, simply proceed with the next step.

🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

Dependencies

This PR must be merged before / after / at the same time as ...
Blockers

This PR is waiting for OEP-1234 to be accepted.
Timeline information

This PR must be merged by XX date because ...
Partner information

This is for a course on edx.org.
Supporting documentation
Relevant Open edX discussion forum threads

🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

Details

Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

The size and impact of the changes that it introduces
The need for product review
Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

feanil

Generally the change in approach looks good to me. This is a big change but from a security and maintenance posture a large improvement as well. My questions are mostly minor-ish stuff that we can probably resolve before we land this. I'll also post in slack in-case others want to take a look at it. But as we discussed previously, I'm not sure who all is using this so we'll communicate out but plan on moving forward unless there are major concerns by someone.

feanil · 2026-03-27T13:23:33Z

.github/workflows/ci.yml

        os:
        - ubuntu-latest
-        python-version: ['3.12']
+        python-version: ['3.12', '3.13']


Minor question, why not just add 3.14 directly and skip over 3.13?

feanil · 2026-03-27T13:41:41Z

.github/workflows/publish-grader-base-image.yml

+name: Publish grader base image
+
+# Builds grader_support/Dockerfile.base and pushes to:
+#   - GHCR: ghcr.io/mitodl/xqueue-watcher-grader-base


Need to update the location for upstream right?

feanil · 2026-03-27T13:42:05Z

.github/workflows/publish-grader-base-image.yml

+      - feat/xqwatcher-kubernetes-migration
+      - chore/migrate-to-uv-and-k8s-container-grader


Thes can be dropped now as well.

feanil · 2026-03-27T13:42:24Z

.github/workflows/publish-grader-base-image.yml

+  workflow_dispatch:
+
+env:
+  IMAGE_NAME: mitodl/xqueue-watcher-grader-base


Flagging that this image name needs to be updated.

feanil · 2026-03-27T13:43:22Z

.github/workflows/publish-grader-base-image.yml

+            ghcr.io/${{ env.IMAGE_NAME }}
+          tags: |
+            type=raw,value=latest,enable={{is_default_branch}}
+            type=raw,value=latest,enable=${{ github.ref_name == 'chore/migrate-to-uv-and-k8s-container-grader' || github.ref_name == 'feat/xqwatcher-kubernetes-migration' }}


I think we don't need this line as well.

feanil · 2026-03-27T16:44:16Z

grader_support/Dockerfile.base

@@ -0,0 +1,19 @@
+FROM python:3.11-slim AS grader-base


Should this be a higher version? Also what's the story for rolling forward to a newer version of this container? Like if we're already running 3.11 graders, what are the steps that will need to be taken to do this migration?

Reading a bit further down, it looks like, we can cut over on a grader by grader basis but maybe we need to publish both the current and "next" base files to allow for that?

feanil · 2026-03-27T17:07:38Z

tests/test_container_grader.py

+
+import pytest
+
+from xqueue_watcher.containergrader import ContainerGrader, _parse_cpu_millis, _parse_memory_bytes


Looks like _parse_memory_bytes is imported but never tested.

- ci.yml: expand Python matrix to 3.12/3.13/3.14 - publish-grader-base-image.yml: update registry to openedx namespace, drop dev branch triggers, replace single build with 3-version matrix, tag images as py<version>-latest (+ latest for newest) - Dockerfile.base: parameterise Python version via ARG PYTHON_VERSION, drop hardcoded 3.11 - tests/test_container_grader.py: add TestParseMemoryBytes (10 tests) covering IEC binary, SI decimal, and plain integer inputs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

blarghmatey and others added 2 commits March 26, 2026 13:36

openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Mar 26, 2026

openedx-webhooks added this to Contributions Mar 26, 2026

github-project-automation bot moved this to Needs Triage in Contributions Mar 26, 2026

blarghmatey requested a review from feanil March 26, 2026 17:38

feanil reviewed Mar 27, 2026

View reviewed changes

mphilbrick211 moved this from Needs Triage to In Eng Review in Contributions Mar 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add ContainerGrader for Kubernetes/Docker sandboxed grading + migrate to uv#122

feat: add ContainerGrader for Kubernetes/Docker sandboxed grading + migrate to uv#122
blarghmatey wants to merge 3 commits intoopenedx:masterfrom
mitodl:feat/containergrader-kubernetes

blarghmatey commented Mar 26, 2026

Uh oh!

openedx-webhooks commented Mar 26, 2026

Uh oh!

feanil left a comment

Uh oh!

feanil Mar 27, 2026

Uh oh!

feanil Mar 27, 2026

Uh oh!

feanil Mar 27, 2026

Uh oh!

feanil Mar 27, 2026

Uh oh!

feanil Mar 27, 2026

Uh oh!

feanil Mar 27, 2026

Uh oh!

feanil Mar 27, 2026

Uh oh!

feanil Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		- feat/xqwatcher-kubernetes-migration
		- chore/migrate-to-uv-and-k8s-container-grader


		import pytest

		from xqueue_watcher.containergrader import ContainerGrader, _parse_cpu_millis, _parse_memory_bytes

Conversation

blarghmatey commented Mar 26, 2026

What are the relevant tickets?

Description (What does it do?)

1. ContainerGrader — Kubernetes & Docker sandboxed grading

2. Kubernetes deployment manifests

3. Dependency management migration: pip-compile → uv

4. codejail made optional

5. Named XQueue server references (SERVER_REF / xqueue_servers.json)

6. Metrics

How can this be tested?

Additional Context

Uh oh!

openedx-webhooks commented Mar 26, 2026

Uh oh!

feanil left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

5. Named XQueue server references (`SERVER_REF` / `xqueue_servers.json`)