Skip to content

feat: add ContainerGrader for Kubernetes/Docker sandboxed grading + migrate to uv#122

Open
blarghmatey wants to merge 3 commits intoopenedx:masterfrom
mitodl:feat/containergrader-kubernetes
Open

feat: add ContainerGrader for Kubernetes/Docker sandboxed grading + migrate to uv#122
blarghmatey wants to merge 3 commits intoopenedx:masterfrom
mitodl:feat/containergrader-kubernetes

Conversation

@blarghmatey
Copy link
Copy Markdown
Contributor

What are the relevant tickets?

N/A — this is a contribution from the mitodl/xqueue-watcher fork, developed for MIT OpenCourseWare production use.

Description (What does it do?)

This PR adds two major capabilities and modernises the project tooling:

1. ContainerGrader — Kubernetes & Docker sandboxed grading

A new ContainerGrader handler (xqueue_watcher/containergrader.py) that runs graders inside isolated containers instead of the AppArmor/codejail jail. Two backends are supported:

  • Kubernetes — spins up a one-shot Job via the kubernetes client, mounts submission as a ConfigMap, streams pod logs, enforces CPU/memory resource limits, and cleans up after itself.
  • Docker — runs a short-lived container via the Docker SDK with the same interface, useful for local development.

Key design points:

  • Grader image is configurable per-queue; the same grader_support/entrypoint.py entry point is used for both backends.
  • Configurable resource limits (CPU, memory), timeouts, namespace, image pull policy.
  • grader_support/Dockerfile.base — a minimal base image with grader_support/entrypoint.py baked in; grader-specific images FROM this base.
  • New EnvSettings loader (xqueue_watcher/env_settings.py) reads full watcher/queue config from environment variables, making it easy to configure inside Kubernetes pods.

2. Kubernetes deployment manifests

Ready-to-use Kustomize manifests under deploy/kubernetes/:

  • Deployment for the watcher
  • ConfigMap for queue configuration
  • Secret template for XQueue credentials
  • ServiceAccount + RBAC roles so the watcher can create/watch/delete Job and Pod resources in its own namespace
  • NetworkPolicy to restrict egress

3. Dependency management migration: pip-compile → uv

  • Replaces setup.py + requirements/*.txt with pyproject.toml + uv.lock.
  • CI updated to use astral-sh/setup-uv, adds Python 3.13, drops EOL Python 3.8.
  • make test now runs uv sync && uv run pytest.

4. codejail made optional

JailedGrader no longer hard-depends on AppArmor/codejail. The import is wrapped in a try/except; if codejail is unavailable a clear RuntimeError directs operators to ContainerGrader.

5. Named XQueue server references (SERVER_REF / xqueue_servers.json)

Operators can define shared XQueue server credentials once in xqueue_servers.json and reference them by name in per-queue config via SERVER_REF, avoiding credential duplication across many queue definitions.

6. Metrics

A new xqueue_watcher/metrics.py module exposes Prometheus-compatible counters/histograms for grading latency, queue depth, and error rates.

How can this be tested?

Unit tests — the PR ships a full test suite for all new code:

# Install uv if needed: https://docs.astral.sh/uv/getting-started/installation/
make test

Local Docker smoke test:

# Build grader base image
docker build -f grader_support/Dockerfile.base -t xqueue-grader-base grader_support/

# Start watcher + mock xqueue
docker compose up

Kubernetes smoke test:

kubectl apply -k deploy/kubernetes/
kubectl -n xqueue-watcher logs -l app=xqueue-watcher -f

Additional Context

This code has been running in MIT OpenCourseWare production Kubernetes clusters. The ContainerGrader replaces our previous codejail-based setup and has been stable.

The upstream remote in the mitodl fork still points to github.com/edx/xqueue-watcher; this PR is submitted from mitodl:feat/containergrader-kubernetes which is rebased cleanly on top of openedx/xqueue-watcher:master (239ee5c).

blarghmatey and others added 2 commits March 26, 2026 13:36
…oxed grading (#14)

* chore: migrate from pip-compile to uv for dependency management

- Run migrate-to-uv to bootstrap pyproject.toml from requirements/base.txt
  and requirements/test.txt
- Add full project metadata: name, version, description, requires-python>=3.11,
  license, hatchling build backend, entry point xqueue-watcher -> manager:main
- Add newrelic as [project.optional-dependencies.production]
- Add dev dependency group: coverage, mock, pytest-cov
- Remove setup.py (replaced by pyproject.toml)
- Remove all requirements/*.in and requirements/*.txt files (14 files)
- Generate uv.lock with pinned dependency graph
- Update Makefile: replace pip/pip-compile targets with uv sync / uv run pytest
- Update .github/workflows/ci.yml: use astral-sh/setup-uv@v4, drop ubuntu-20.04
  and Python 3.8, add Python 3.13, update to actions/checkout@v4
- Replace upgrade-python-requirements workflow with uv lock --upgrade +
  create-pull-request workflow

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: remove AppArmor/codejail hard dependency; make codejail optional

- Remove six (Python 2 compat shim) from imports and SUPPORT_FILES in
  jailedgrader.py — Python 3 only going forward
- Wrap codejail imports in try/except in jailedgrader.py and manager.py;
  raise RuntimeError with clear message directing users to ContainerGrader
- Fix Path.abspath() -> Path.absolute() (breaking API change in path v17)
  in grader.py and jailedgrader.py
- Update Dockerfile: ubuntu:xenial -> python:3.11-slim, remove apparmor
  and language-pack-en packages, fix layer ordering
- Update test_codejail_config to use fork_per_item=False to avoid
  multiprocessing state-inheritance failure on Python 3.14 forkserver default
- Update conf.d/600.json example to use ContainerGrader handler

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: add ContainerGrader for Kubernetes/Docker-based sandboxed grading

Adds xqueue_watcher/containergrader.py — a drop-in replacement for
JailedGrader that executes student code inside an isolated container
instead of using AppArmor/codejail.

Security model (replaces AppArmor):
- Container isolation (Linux namespaces + cgroups)
- Non-root user (UID 1000), read-only root filesystem
- CPU/memory resource limits enforced by container runtime
- Network disabled for grader containers (no egress)
- Hard wall-clock timeout via activeDeadlineSeconds (k8s) or timeout (Docker)

Two pluggable backends selected via the 'backend' KWARGS option:

  kubernetes (default / production):
    - Creates a batch/v1 Job per submission using the kubernetes Python client
    - Auto-detects in-cluster vs kubeconfig credentials
    - Polls until Job completes, collects stdout from pod logs
    - Deletes the Job after result collection (ttlSecondsAfterFinished=300)
    - Job pod spec includes: securityContext, resource limits,
      activeDeadlineSeconds, and labels for observability

  docker (local dev / CI):
    - Runs a container via the docker Python SDK
    - Bind-mounts the grader directory read-only
    - Passes SUBMISSION_CODE as an environment variable
    - Network disabled, memory + CPU limits applied

Student code is passed via SUBMISSION_CODE env var (avoids argv length
limits and shell injection). The entrypoint writes it to /tmp before
invoking grader_support.run, producing the same JSON output format that
JailedGrader already expects — so no changes to grader test framework
or course team grader code are required.

Configuration example (conf.d/my-course.json):
  {
    "my-course": {
      "HANDLERS": [{
        "HANDLER": "xqueue_watcher.containergrader.ContainerGrader",
        "KWARGS": {
          "grader_root": "/graders/my-course/",
          "image": "registry.example.com/my-course:latest",
          "backend": "kubernetes",
          "cpu_limit": "500m",
          "memory_limit": "256Mi",
          "timeout": 20
        }
      }]
    }
  }

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: add grader base Docker image and container entrypoint

grader_support/Dockerfile.base:
  - python:3.11-slim base, non-root grader user (UID 1000)
  - Copies grader_support framework; installs path-py
  - ENTRYPOINT: python -m grader_support.entrypoint
  - /tmp volume for submission files (writable even with read-only root fs)
  - Course teams extend this image to add their deps and grader scripts

grader_support/entrypoint.py:
  - Reads SUBMISSION_CODE env var, writes to /tmp/submission.py
  - Adds /tmp and cwd to sys.path, then delegates to grader_support.run
  - Prints JSON result to stdout (same schema JailedGrader already parses)

grader_support/README.md:
  - Course team authoring guide: how to extend the base image, configure
    the handler, and understand the security properties

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: add Kubernetes deployment manifests and Docker Compose local dev

deploy/kubernetes/ (Kustomize-compatible):
  - serviceaccount.yaml  — dedicated SA for xqueue-watcher pods
  - rbac.yaml            — Role + RoleBinding: create/delete Jobs, read pod logs
  - configmap.yaml       — watcher xqwatcher.json config (edit for your queues)
  - deployment.yaml      — 2 replicas, topologySpreadConstraints, securityContext,
                           resource limits, readinessProbe
  - networkpolicy.yaml   — deny all ingress/egress on grader Job pods (label:
                           role=grader-job); allow xqueue-watcher egress to xqueue
  - secret.yaml.template — placeholder: copy to secret.yaml, fill in credentials,
                           do not commit secret.yaml (added to .gitignore)
  - kustomization.yaml   — Kustomize entry point for the base directory

docker-compose.yml (local dev):
  - xqueue-watcher container with docker socket access (for docker backend)
  - Mounts conf.d/ and grader directories
  - Includes a sample xqueue service reference for full local stack

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: correct grader path handling in ContainerGrader and entrypoint

ContainerGrader had two bugs affecting how grader files were located
inside the container at runtime:

1. Docker backend bind-mounted the grader problem directory at /grader,
   overwriting the grader_support package that the base image copies
   there.  Fixed by binding at /graders instead and passing the
   resulting absolute in-container path (/graders/<file>) to the
   entrypoint.

2. Kubernetes backend set working_dir to the grader problem directory
   (e.g. /graders/ps07/Robot/), preventing Python from finding the
   grader_support package which lives at /grader/grader_support/.
   Fixed by keeping working_dir=/grader (the base image WORKDIR) and
   passing the absolute grader path in args instead of just the
   basename.

entrypoint.py previously passed the full absolute path verbatim to
__import__(), which fails for paths containing slashes.  It now detects
absolute paths, inserts the parent directory into sys.path, and uses
only the basename as the importable module name.

Also updates grader_support/README.md to document the correct layout
(/graders/ for course grader scripts, /grader/ for grader_support) and
the gradelib compatibility note for course teams migrating from
Python 2 graders.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(tests): skip jailed grader tests when codejail is not installed

codejail is an optional dependency (not installed in CI). Guard the
import with a try/except and apply @pytest.mark.skipif to the test
class so collection succeeds and tests are skipped gracefully when
codejail is absent.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: address PR review feedback

- Dockerfile: replace deleted requirements/ pip install with uv sync
  (copies uv binary from ghcr.io/astral-sh/uv and uses uv sync --frozen)
- grader.py: guard against path traversal in grader_config['grader'];
  validate that the resolved grader path stays within grader_root
- containergrader.py: fix Docker SDK TypeError - containers.run() does
  not accept a timeout kwarg; switch to detach=True + container.wait()
  to enforce the timeout, then collect logs and remove the container
- containergrader.py: remove brittle hardcoded line numbers (L364,
  L379, L397, L450) from user-facing error messages
- docker-compose.yml: change conf.d and data volumes from :ro to :rw
  so local edits take effect without rebuild (matches comment intent)
- upgrade-python-requirements.yml: add explicit permissions block
  (contents: write, pull-requests: write) as required by security policy
- Automated code Graders With xqueue-watcher.md: remove empty heading,
  add 'Property' header to comparison table

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* refactor: replace path-py with stdlib pathlib

path-py is an external dependency that wraps pathlib with a fluent API.
Since we now require Python >= 3.11, pathlib covers all the same
functionality without an extra dependency.

Changes:
- Replace 'from path import Path' with 'from pathlib import Path' in all
  source and test files
- .dirname() → .parent
- .basename() → .name
- .absolute() / .absolute() → .resolve() (symlink-safe)
- .files('*.json') → .glob('*.json') (with sorted() for stable ordering)
- Remove path-py (path-py / path) from pyproject.toml dependencies
- Regenerate uv.lock (removes path==17.1.1 and path-py==12.5.0)
- Simplify grader.py path-traversal check: now that grader_path is a
  native pathlib.Path, the inline 'import pathlib' is no longer needed
- Fix test_grader.py mock: grader_path.endswith() → grader_path.name ==
- Fix test_manager.py: pass str() to argparse (Path is not subscriptable)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: add edx-codejail as optional dependency; document container isolation decision

Add edx-codejail (the upstream PyPI package, v4.1.0) as an optional
'codejail' extra, replacing the previously pinned git-URL reference to
a specific commit.

  uv add --optional codejail edx-codejail

codejail is intentionally excluded from the base Docker image because
ContainerGrader uses container-level isolation (Linux namespaces,
cgroups, capability dropping, network isolation, read-only filesystem)
which provides equivalent sandboxing to AppArmor without requiring
host-level AppArmor configuration that is unavailable inside Kubernetes
pods.  Install the 'codejail' extra only when using the legacy
JailedGrader on a bare-metal or VM host with AppArmor configured.

To use: uv sync --extra codejail

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: address second round of PR review feedback

- Makefile: fix tab indentation on all recipe lines (was space-indented)
- grader.py: remove unused sys import
- jailedgrader.py: replace deprecated load_module() with spec_from_file_location/exec_module
- containergrader.py:
  - remove unused imports (logging, os, tempfile) and _JOB_LABEL constant
  - add emptyDir volume at /tmp in K8s Job spec (required when read_only_root_filesystem=True)
  - add clarifying comment that K8s grader scripts are baked into the course image
  - replace deprecated load_module() with importlib.util spec/exec_module pattern
  - capture stderr from Docker container on non-zero exit for better diagnostics
- grader_support/entrypoint.py: correct misleading comment about /tmp writability
- deploy/kubernetes/deployment.yaml: fix command to use xqueue-watcher entry point
- deploy/kubernetes/configmap.yaml: add xqueue-watcher-queue-configs ConfigMap so
  manifests apply cleanly out of the box
- docker-compose.yml: mount Docker socket for docker backend to work
- conf.d/600.json: use absolute /graders/ path instead of relative ../data path
- Dockerfile: use C.UTF-8 locale (available without installing locales package)
- pyproject.toml: add edx-codejail to dev group so jailed grader tests run in CI

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* refactor: move full grading pipeline into container; add ContainerGrader unit tests

Architecture change: grader scripts are baked into the course-specific Docker
image, so the watcher pod has no need to access grader files locally.  The
grader_support entrypoint now runs the complete grading pipeline inside the
container (load grader, preprocess, run answer + submission, compare, return
JSON grade), and ContainerGrader.grade() is simplified to just launch the
container and parse its JSON output.

Changes:
- grader_support/entrypoint.py: complete rewrite; now takes GRADER_FILE SEED
  (not GRADER_FILE submission.py SEED); runs full grade pipeline in container;
  reads GRADER_LANGUAGE and HIDE_OUTPUT env vars from ContainerGrader
- xqueue_watcher/containergrader.py:
  - Remove grader-module loading, gettext, answer.py reading, and all test-
    comparison logic from grade() — the container handles this now
  - grade() now just calls _run() and parses the returned JSON
  - _run() accepts grader_config and forwards lang/hide_output as env vars
  - _build_k8s_job(): args are now [grader_abs, seed] (not 3 args), adds
    GRADER_LANGUAGE and HIDE_OUTPUT env vars, still mounts emptyDir at /tmp
  - _run_docker(): same arg change; passes GRADER_LANGUAGE and HIDE_OUTPUT
  - ReadTimeout from container.wait() caught and re-raised as clear RuntimeError
  - Remove unused _truncate, _prepend_coding, importlib.util
- tests/test_container_grader.py: 36 new unit tests covering:
  - _parse_cpu_millis
  - ContainerGrader init / backend validation
  - _build_k8s_job: args, env vars, resource limits, emptyDir/tmp, security
  - _run_docker: success, non-zero exit (with stderr), timeout, missing SDK
  - grade(): skip_grader, successful result, container failure, size warning

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* refactor: replace statsd/newrelic with OpenTelemetry; add 12-factor settings

- Remove dogstatsd-python dependency; replace statsd instrumentation in
  grader.py with OpenTelemetry counters and a histogram
- Add xqueue_watcher/metrics.py: configure_metrics() wires a MeterProvider
  with an OTLP HTTP exporter when OTEL_EXPORTER_OTLP_ENDPOINT is set;
  all four instruments (process_item, grader_payload_error, grading_time,
  replies) defined at module level against the global proxy meter
- Call configure_metrics() from Manager.configure_from_directory() so the
  real provider is installed before any submissions are processed
- Add xqueue_watcher/env_settings.py: get_manager_config_from_env() reads
  all manager config from XQWATCHER_* environment variables, compatible
  with 12-factor / Kubernetes deployment patterns
- Remove newrelic from the production optional-dependency group and from
  the edx.org Dockerfile stage; the stage now runs xqueue-watcher directly
- Add opentelemetry-api, opentelemetry-sdk, opentelemetry-exporter-otlp-proto-http
  to core dependencies; regenerate uv.lock
- Add tests/test_env_settings.py and tests/test_metrics.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* chore: remove planning doc from git tracking

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* chore: remove codecov upload from CI

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: address PR #14 review feedback

- docker-compose.yml: remove unused GRADER_BACKEND env var, fix duplicate
  volumes key by merging into one list, tag sample-grader with
  image: grader-base:local so conf.d/600.json reference resolves
- Dockerfile: standardise CMD config path to /etc/xqueue-watcher to match
  docker-compose and Kubernetes manifests
- metrics.py: remove OTEL_METRIC_EXPORT_INTERVAL from docstring since it is
  not wired up in _build_meter_provider()
- containergrader.py: add pod template metadata labels so the NetworkPolicy
  podSelector (app.kubernetes.io/component=xqueue-grader) actually matches
  grading pods; set automount_service_account_token=False on the grading pod
  spec to reduce blast radius if the NetworkPolicy is misconfigured; add
  _parse_memory_bytes() helper and use it for the Docker backend mem_limit
  so Kubernetes-style strings like '256Mi' are converted to bytes rather
  than passed raw (which Docker does not accept)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: add venv bin to PATH so xqueue-watcher entrypoint resolves

uv installs the console script into the project virtual environment at
.venv/bin/xqueue-watcher. Without adding this directory to PATH the CMD
cannot be found at container startup.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: add configure_logging() for 12-factor stdout logging

When no logging.json file is present, manager.py now calls
configure_logging() from env_settings instead of basicConfig().

configure_logging() sets up a single StreamHandler on stdout with a
consistent timestamp/level/module format, honours XQWATCHER_LOG_LEVEL
(default INFO), and suppresses noisy requests/urllib3 debug output.
This removes the need for a logging.json file in Kubernetes deployments.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: symlink xqueue-watcher into /usr/local/bin for reliable resolution

Using PATH via ENV is fragile -- container runtimes and security policies
can reset or ignore it.  Install a symlink at /usr/local/bin/xqueue-watcher
(always in the standard system PATH) so the entrypoint resolves regardless
of how the container is launched.  Also remove the stale NEW_RELIC_LICENSE_KEY
env entry from the Kubernetes deployment manifest.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: add env-based defaults for ContainerGrader configuration

Add get_container_grader_defaults() to env_settings, reading five new
XQWATCHER_GRADER_* env vars:

  XQWATCHER_GRADER_BACKEND       (default: kubernetes)
  XQWATCHER_GRADER_NAMESPACE     (default: default)
  XQWATCHER_GRADER_CPU_LIMIT     (default: 500m)
  XQWATCHER_GRADER_MEMORY_LIMIT  (default: 256Mi)
  XQWATCHER_GRADER_TIMEOUT       (default: 20)

ContainerGrader.__init__ now uses None sentinels for these params so that
any value omitted from a conf.d KWARGS block falls back to the env-derived
default rather than a hardcoded constant. Values supplied explicitly in
conf.d always take precedence, preserving backwards compatibility.

Also fixes duplicate function definitions that had crept into env_settings.py.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat(containergrader): add ImageDigestPoller and image pull policy support

- Add ImageDigestPoller class: background daemon thread that periodically
  resolves a tag-based image reference to its current digest via
  docker.APIClient.inspect_distribution(). Thread-safe; falls back to the
  original reference if resolution fails.

- Add image_pull_policy param to ContainerGrader (auto-detect: IfNotPresent
  for digest refs, Always for tag-based refs; can be overridden explicitly).

- Add poll_image_digest and digest_poll_interval params to activate the
  poller. When enabled, Kubernetes Jobs use the most recently resolved
  repo@sha256:… reference via _effective_image(), ensuring nodes always run
  the latest pushed image without relying on imagePullPolicy: Always for
  every pod.

- Add .github/workflows/publish-grader-base-image.yml to build and push
  grader_support/Dockerfile.base to ghcr.io/mitodl/xqueue-watcher-grader-base
  on push to master (grader_support/** paths), weekly schedule, and
  workflow_dispatch. Multi-platform linux/amd64,linux/arm64.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: normalise imagePullPolicy to title-case before K8s API call

Kubernetes requires imagePullPolicy to be exactly 'Always', 'IfNotPresent',
or 'Never' (case-sensitive). When the value is supplied via KWARGS in the
conf.d JSON (e.g. 'always' or 'ALWAYS'), the K8s API returns 422 Unprocessable
Entity.

Add a normalisation dict lookup that maps the lowercased input back to the
canonical title-case form. Unknown values are passed through unchanged so
Kubernetes can surface the validation error with a clear message.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: add strip_path_components to ContainerGrader for legacy path prefixes

LMS grader_payload 'grader' fields configured against the old git-clone
deployment include a queue-name prefix, e.g.:

  mit-600x-Watcher-MITX-6.0001r/graders/python3graders/chips1/.../grade.py

In the containerized approach, graders are baked directly into the image
at grader_root, so the path resolves to:

  /graders/mit-600x-Watcher-MITX-6.0001r/graders/python3graders/...

which doesn't exist. The actual path in the image is:

  /graders/python3graders/...

Add strip_path_components (int, default 0) KWARG to ContainerGrader.
When > 0, that many leading path components are stripped from the
grader path (relative to grader_root) before it is passed as the
container entrypoint argument. Set to 2 to remove both the queue-name
component and the redundant repo subdirectory name.

Example KWARGS:
  "strip_path_components": 2

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: install gettext into builtins before loading grader module

Grader scripts may call _() at module level (e.g. in input_validators
defined at import time). The previous code installed trans.install()
after exec_module, causing NameError: name '_' is not defined.

Move the entire locale/gettext setup block to before exec_module so
_ is available in builtins when the grader script is first executed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: normalize mixed tab/space indentation before exec

Python 3 raises TabError when exec'ing code with mixed tabs and spaces
in the same indented block.  Many course grader answer.py files were
authored for Python 2 which tolerated this.

Call expandtabs(4) on both the staff answer and student submission
before preprocessing and writing to /tmp, so exec never sees raw tabs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: sys.path ordering so preprocessed answer/submission shadow originals

run.py's import_captured() uses __import__() to load answer and
submission modules.  grader_dir was inserted into sys.path AFTER /tmp,
making it position 0, so __import__('answer') found the original
/graders/.../answer.py (with bare 'for c in s:') instead of the
preprocessed /tmp/answer.py (with 'submission_code = repr(...)').

Fix: insert grader_dir first, then /tmp, so /tmp is position 0 and
the preprocessed files always shadow the originals.

Also:
- Add _dbg() helper for debug tracing behind GRADER_DEBUG=1 env var;
  off by default so stderr output doesn't corrupt the JSON pod log
  that containergrader.py reads via read_namespaced_pod_log.
- Import traceback (used by _dbg exception paths).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* chore: log raw container output bytes on JSON parse failure

Add an explicit ERROR-level log of the raw bytes (repr, up to 4096)
when json.loads fails so we can see exactly what the pod log contains,
including any leading/trailing garbage from stderr that Kubernetes
combines into the pod log stream.

Also add a DEBUG-level log of every container output for tracing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* chore: push grader base image to DockerHub as well as GHCR

Concourse grader-image pipelines use DockerHub as the trigger source.
The workflow previously only pushed to GHCR, so Concourse never saw
updates to the base image.

Changes:
- Add DockerHub login step (DOCKERHUB_USERNAME/DOCKERHUB_PASSWORD secrets)
- Push to both mitodl/xqueue-watcher-grader-base (DockerHub) and
  ghcr.io/mitodl/xqueue-watcher-grader-base (GHCR)
- Tag :latest on feature branches during active development so Concourse
  picks up fixes without waiting for master merge
- Add feature branches to push trigger so grader_support fixes are
  published immediately

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: request stdout-only stream from read_namespaced_pod_log

The default stream parameter is 'All', which interleaves stderr into
the returned string.  Any stderr output from the container (Python
warnings, import messages, etc.) corrupts the JSON that the entrypoint
prints to stdout, causing JSONDecodeError in the watcher.

Pass stream='Stdout' and container='grader' explicitly so only the
container's stdout is returned.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: extract last line from pod log instead of using stream param

The PodLogsQuery feature gate (which enables the 'stream' field in
PodLogOptions) is opt-in and disabled on the target cluster.  Using
stream= returns a 422 FieldValueForbidden error even on K8s 1.35.

Instead, fetch the combined stdout+stderr log and scan backwards for
the last non-empty line.  The entrypoint always prints exactly one JSON
object as its final output line, so this reliably extracts the result
regardless of any stderr noise interleaved earlier in the log.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: bypass kubernetes client JSON deserialisation of pod logs

read_namespaced_pod_log returns response_type='str'. The client's
deserialize() method first calls json.loads() on the raw response body
(succeeds since the entrypoint outputs valid JSON), then passes the
resulting Python dict to __deserialize_primitive(dict, str) which calls
str(dict) — producing Python repr with single-quoted keys and True/False
booleans, which is not valid JSON.

Fix: pass _preload_content=False to get the raw urllib3 response object
and read .data directly as bytes, bypassing the client deserialisation
entirely. The raw bytes are valid UTF-8 JSON as printed by the entrypoint.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* chore: add top-level permissions: {} to restrict default GITHUB_TOKEN scope

Addresses GitHub Advanced Security finding: 'Workflow does not contain
permissions'. Adding a workflow-level permissions: {} block ensures the
GITHUB_TOKEN has no default permissions; each job must explicitly declare
what it needs. The update-dependencies job retains its required
contents: write and pull-requests: write grants.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* refactor: remove strip_path_components from ContainerGrader

strip_path_components was added to work around what turned out to be a
configuration error in the LMS grader_payload, not a structural problem
in the grading path resolution.  Remove the parameter, its __init__
assignment, the stripping logic in _build_k8s_job, and all docstring
references to keep the code simple and correct.

Also addressed in this commit:
- grader.py: downgrade per-submission grading-time log from INFO to DEBUG
  to avoid high-volume noise in production log streams
- Dockerfile: pin uv to 0.10.7 via a named build stage instead of
  floating ghcr.io/astral-sh/uv:latest; replace the xqueue-watcher
  symlink with ENV PATH so the full venv is on PATH
- env_settings.py: add XQWATCHER_DOCKER_HOST_GRADER_ROOT env var
  (preparation for docker_host_grader_root ContainerGrader param)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: add docker_host_grader_root; drop path-py from grader base image

ContainerGrader (Docker backend): add docker_host_grader_root parameter
so that when xqueue-watcher runs inside a container the bind-mount source
path can be translated from the watcher-container path to the equivalent
host-side path. Without this the Docker daemon (reached via the mounted
socket) would look for the grader directory on the host where it does not
exist. Defaults to XQWATCHER_DOCKER_HOST_GRADER_ROOT env var or None
(watcher runs directly on the host, no translation needed).

docker-compose.yml: add XQWATCHER_DOCKER_HOST_GRADER_ROOT placeholder
and explanatory comment so operators know to set the absolute host path.

grader_support/Dockerfile.base: remove the path-py pip install. The
grader_support framework itself does not import path; course teams that
need path-py can add it in their own downstream image.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: add named xqueue server references via xqueue_servers.json

Queue configs in conf.d can now use SERVER_REF to reference a named
server defined in xqueue_servers.json, avoiding the need to embed
XQueue URLs and credentials directly in grader configuration files.

- settings.py: add get_xqueue_servers() to load and validate
  xqueue_servers.json from the config root
- manager.py: load xqueue_servers.json in configure_from_directory();
  resolve SERVER_REF in client_from_config(), raising ValueError for
  unknown names or conflicts with SERVER/AUTH
- env_settings.py: document the Kubernetes Secret volume-mount pattern
  for xqueue_servers.json as the preferred credentials delivery mechanism
- conf.d/600.json: update example to use SERVER_REF
- tests: add ServerRefTests and TestGetXqueueServers covering resolution,
  error cases, and configure_from_directory integration
- tests/fixtures/config/xqueue_servers.json: fixture server for tests
- README.md: document SERVER_REF, xqueue_servers.json format, and
  Kubernetes Secret mounting pattern

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* chore: remove DockerHub push from grader base image workflow

Only push to GHCR. Remove the DockerHub login step, DockerHub image
reference from the metadata action, and the DOCKERHUB_USERNAME /
DOCKERHUB_PASSWORD secret dependencies.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: harden containergrader and XQueue client

- Fix TLS certificate verification: replace hardcoded verify=False with
  a _VERIFY_TLS flag (default True). Operators can opt out via
  XQWATCHER_VERIFY_TLS=false for dev environments; a warning is logged
  when verification is disabled.

- Remove credentials from logs: strip self.password from the debug login
  message and the login-retry error message in client.py.

- Enforce hard submission size limit: reject submissions larger than
  XQWATCHER_SUBMISSION_SIZE_LIMIT bytes (default 1 MB) before launching
  a container. Prevents etcd object-size overflows and resource-exhaustion
  attacks via very large env vars. Keep the existing 32 KB warning for
  submissions that are large but within the limit.

- Add seccomp RuntimeDefault profile to Kubernetes grading Jobs: applied
  at both the pod level (V1PodSecurityContext) and the container level
  (V1SecurityContext) to restrict the available syscall surface.

- Add PID limit to grading container resource limits: caps the number of
  processes a grading container may create at 256, preventing fork-bomb
  attacks from affecting other node workloads.

- Cap /tmp emptyDir at 50 Mi: adds size_limit='50Mi' to the emptyDir
  volume backing /tmp in grading pods, preventing disk-exhaustion attacks.

- Add path traversal pre-check in grader.py: explicitly reject grader
  paths containing '..' components before Path.resolve() is called,
  removing symlink edge-cases that could bypass the relative_to() guard.

- Update containergrader module docstring and env_settings docs to
  accurately describe the security posture and new env vars.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: address PR #14 review feedback

- Makefile: add missing tab indentation on help target recipe lines
- grader_support/entrypoint.py: fix always-true EndTest check (use str(e).strip() not e is not None)
- tests/test_env_settings.py: use clear=True in hermetic default-value tests
- tests/test_metrics.py: use clear=True to prevent OTEL_ env vars bleeding in
- xqueue_watcher/client.py: apply _VERIFY_TLS in _request() and _login(), not just put_result
- xqueue_watcher/containergrader.py:
  - fix image repo parsing to handle registry:port/image:tag refs (rfind approach)
  - fix 'pods' → 'pids' container resource limit
  - lazy-init Kubernetes API clients once per instance (avoids per-submission config load)
- xqueue_watcher/env_settings.py: parse HTTP_BASIC_AUTH into (username, password) tuple
- xqueue_watcher/metrics.py: clarify OTEL_RESOURCE_ATTRIBUTES is parsed by SDK automatically

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Kubernetes does not support 'pids' as a container-level resource limit.
Setting it causes a 422 Unprocessable Entity error when creating grading
Jobs. PID limits must instead be enforced at the namespace level via a
LimitRange or at the node level via kubelet --pod-pids-limit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@openedx-webhooks openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Mar 26, 2026
@openedx-webhooks
Copy link
Copy Markdown

Thanks for the pull request, @blarghmatey!

This repository is currently unmaintained.

🔘 Find a technical reviewer To get help with finding a technical reviewer, reach out to the community contributions project manager for this PR:
  1. On the right-hand side of the PR, find the Contributions project, click the caret in the top right corner to expand it, and check the "Primary PM" field for the name of your project manager.
  2. Find their GitHub handle here.

Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review.

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

  • If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
    • This process (including the steps you'll need to take) is documented here.
  • If it doesn't, simply proceed with the next step.
🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

  • Dependencies

    This PR must be merged before / after / at the same time as ...

  • Blockers

    This PR is waiting for OEP-1234 to be accepted.

  • Timeline information

    This PR must be merged by XX date because ...

  • Partner information

    This is for a course on edx.org.

  • Supporting documentation
  • Relevant Open edX discussion forum threads
🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

Details
Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

  • The size and impact of the changes that it introduces
  • The need for product review
  • Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

Copy link
Copy Markdown
Contributor

@feanil feanil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally the change in approach looks good to me. This is a big change but from a security and maintenance posture a large improvement as well. My questions are mostly minor-ish stuff that we can probably resolve before we land this. I'll also post in slack in-case others want to take a look at it. But as we discussed previously, I'm not sure who all is using this so we'll communicate out but plan on moving forward unless there are major concerns by someone.

os:
- ubuntu-latest
python-version: ['3.12']
python-version: ['3.12', '3.13']
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor question, why not just add 3.14 directly and skip over 3.13?

name: Publish grader base image

# Builds grader_support/Dockerfile.base and pushes to:
# - GHCR: ghcr.io/mitodl/xqueue-watcher-grader-base
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update the location for upstream right?

Comment on lines +10 to +11
- feat/xqwatcher-kubernetes-migration
- chore/migrate-to-uv-and-k8s-container-grader
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thes can be dropped now as well.

workflow_dispatch:

env:
IMAGE_NAME: mitodl/xqueue-watcher-grader-base
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging that this image name needs to be updated.

ghcr.io/${{ env.IMAGE_NAME }}
tags: |
type=raw,value=latest,enable={{is_default_branch}}
type=raw,value=latest,enable=${{ github.ref_name == 'chore/migrate-to-uv-and-k8s-container-grader' || github.ref_name == 'feat/xqwatcher-kubernetes-migration' }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need this line as well.

@@ -0,0 +1,19 @@
FROM python:3.11-slim AS grader-base
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a higher version? Also what's the story for rolling forward to a newer version of this container? Like if we're already running 3.11 graders, what are the steps that will need to be taken to do this migration?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading a bit further down, it looks like, we can cut over on a grader by grader basis but maybe we need to publish both the current and "next" base files to allow for that?


import pytest

from xqueue_watcher.containergrader import ContainerGrader, _parse_cpu_millis, _parse_memory_bytes
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like _parse_memory_bytes is imported but never tested.

@mphilbrick211 mphilbrick211 moved this from Needs Triage to In Eng Review in Contributions Mar 27, 2026
- ci.yml: expand Python matrix to 3.12/3.13/3.14
- publish-grader-base-image.yml: update registry to openedx namespace,
  drop dev branch triggers, replace single build with 3-version matrix,
  tag images as py<version>-latest (+ latest for newest)
- Dockerfile.base: parameterise Python version via ARG PYTHON_VERSION,
  drop hardcoded 3.11
- tests/test_container_grader.py: add TestParseMemoryBytes (10 tests)
  covering IEC binary, SI decimal, and plain integer inputs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

open-source-contribution PR author is not from Axim or 2U

Projects

Status: In Eng Review

Development

Successfully merging this pull request may close these issues.

4 participants