diff --git a/content/en/docs/how-to-guides/_index.md b/content/en/docs/how-to-guides/_index.md index 6f99f18..67504e4 100755 --- a/content/en/docs/how-to-guides/_index.md +++ b/content/en/docs/how-to-guides/_index.md @@ -3,3 +3,48 @@ title: How-to Guides weight: 10 description: How-to solve concrete problems --- + +This section contains practical how-to guides for common ClusterCockpit tasks. + +## Setup & Deployment + +| Guide | Description | +|-------|-------------| +| [Hands-On Demo]({{< relref "handson" >}}) | Basic ClusterCockpit setup and API usage walkthrough | +| [Deploy and Update cc-backend]({{< relref "deployment" >}}) | Recommended deployment and update workflow for production use | +| [Setup a systemd Service]({{< relref "systemd-service" >}}) | Run ClusterCockpit components as systemd services | +| [Create a `cluster.json` File]({{< relref "clusterConfig" >}}) | How to initially create a cluster configuration | + +## Configuration & Customization + +| Guide | Description | +|-------|-------------| +| [Customize cc-backend]({{< relref "customization" >}}) | Add legal texts, modify login page, and add custom logo | +| [Notification Banner]({{< relref "notificationBanner" >}}) | Add a message of the day banner on the homepage | +| [Auto-Tagging]({{< relref "auto-tagging" >}}) | Enable automatic job tagging for application detection and classification | +| [Resampling]({{< relref "resampling" >}}) | Plan and configure metric resampling | +| [Retention Policies]({{< relref "retention-policy" >}}) | Manage database and job archive size with retention policies | + +## Metric Collection & Integration + +| Guide | Description | +|-------|-------------| +| [Hierarchical Metric Collection]({{< relref "hierarchical-collection" >}}) | Configure multiple cc-metric-collector instances with forwarding | +| [External TSDB Integration]({{< relref "external-tsdb-integration" >}}) | Integrate Prometheus and InfluxDB data into cc-metric-store | +| [Sharing HPM Metrics (SLURM)]({{< relref "slurm-hwperf" >}}) | Share hardware performance counter access between monitoring and user jobs | + +## API & Developer Tools + +| Guide | Description | +|-------|-------------| +| [Generate JWT Tokens]({{< relref "generateJWT" >}}) | Generate JSON Web Tokens for API authentication | +| [Use the REST API]({{< relref "useRest" >}}) | How to use the REST API endpoints | +| [Use the Swagger UI]({{< relref "useSwagger" >}}) | Browse and test API endpoints via Swagger UI | +| [Regenerate Swagger Docs]({{< relref "generateSwagger" >}}) | Regenerate the Swagger UI documentation from source | + +## Migrations + +| Guide | Description | +|-------|-------------| +| [Database Migrations]({{< relref "database-migration" >}}) | Apply database schema migrations | +| [Job Archive Migrations]({{< relref "archive-migration" >}}) | Migrate job archive data between versions | diff --git a/content/en/docs/how-to-guides/external-tsdb-integration.md b/content/en/docs/how-to-guides/external-tsdb-integration.md new file mode 100644 index 0000000..10b6f44 --- /dev/null +++ b/content/en/docs/how-to-guides/external-tsdb-integration.md @@ -0,0 +1,606 @@ +--- +title: How to ingest metrics from external time series databases +description: Integrate Prometheus and InfluxDB data into cc-metric-store using adapter scripts and proxy approaches +categories: [cc-metric-store] +tags: [Admin, Developer] +weight: 20 +--- + +## Overview + +Many HPC sites already operate Prometheus or InfluxDB for general infrastructure +monitoring. When adopting ClusterCockpit, you may want to reuse existing metric +data rather than immediately deploying +[cc-metric-collector]({{< ref "../reference/cc-metric-collector" >}}) on every +node. This guide shows how to bridge data from external time series databases +into cc-metric-store. + +cc-metric-store accepts metric data through two ingestion paths: + +1. **REST API** (`POST /api/write/`) — uses + [InfluxDB line protocol]({{< ref "../explanation/lineProtocol" >}}) format, + authenticated with JWT (Ed25519) +2. **NATS messaging** — see + [NATS configuration]({{< ref "../reference/cc-metric-store/ccms-configuration#nats-section" >}}) + +This guide focuses on the REST API path since it requires no additional +infrastructure beyond HTTP connectivity. + +{{< alert title="Important" >}} +cc-metric-store only stores metrics that are listed in its +[`metrics` configuration section]({{< ref "../reference/cc-metric-store/ccms-configuration" >}}). +Any metric name sent via `/api/write/` that is not configured will be silently +dropped. Plan your metric name mapping carefully before deploying. +{{< /alert >}} + +## Architecture + +The general approach is: query the external TSDB, transform results into +ClusterCockpit's InfluxDB line protocol format, and POST them to cc-metric-store. + +```mermaid +flowchart LR + subgraph External ["External TSDB"] + prom["Prometheus"] + influx["InfluxDB"] + end + + subgraph Adapter ["Adapter Layer"] + script["Sync Script\n(cron-based)"] + end + + subgraph CC ["ClusterCockpit"] + ccms[("cc-metric-store\n/api/write/")] + ccbe["cc-backend"] + end + + prom -->|"HTTP API query"| script + influx -->|"HTTP API query"| script + script -->|"InfluxDB line protocol\n+ JWT auth"| ccms + ccms <--> ccbe +``` + +### Approach Comparison + +| Approach | Source | Mechanism | Latency | Complexity | +|---|---|---|---|---| +| Cron-based sync script | Prometheus or InfluxDB | Periodic query + POST | ~60s | Low | +| Prometheus remote_write proxy | Prometheus | Continuous push | ~seconds | Medium | +| Telegraf HTTP output | InfluxDB | Telegraf pipeline | ~seconds | Medium | + +For most HPC sites, the **cron-based sync script** is recommended. It is the +simplest to deploy and maintain, and 60-second latency is perfectly adequate for +monitoring dashboards. + +## Prerequisites + +- Running cc-metric-store instance with HTTP API enabled +- Valid JWT token (Ed25519-signed) — see + [JWT generation guide]({{< ref "../how-to-guides/generateJWT" >}}) +- Python 3.6+ with the `requests` library (`pip install requests`) +- Target metrics pre-configured in cc-metric-store's `config.json` +- Network access from the adapter host to both the source TSDB and cc-metric-store + +## Metric Name Mapping + +ClusterCockpit uses its own +[metric naming convention]({{< ref "../tutorials/prod-metric-list" >}}) which +differs from Prometheus and InfluxDB/Telegraf conventions. The adapter scripts +must translate between them. + +The following table shows mappings for common node-level metrics: + +| CC Metric | Prometheus Query | InfluxDB / Telegraf | Unit | Aggregation | +|---|---|---|---|---| +| `cpu_load` | `node_load1` | `system.load1` | - | avg | +| `cpu_user` | `100 * rate(node_cpu_seconds_total{mode="user"}[1m])` | `cpu.usage_user` | % | avg | +| `mem_used` | `(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / (1024*1024)` | `mem.used / (1024*1024)` | MB | - | +| `net_bw` | `rate(node_network_receive_bytes_total[1m]) + rate(node_network_transmit_bytes_total[1m])` | `net.bytes_recv + net.bytes_sent` | bytes/s | sum | +| `cpu_power` | `node_rapl_package_joules_total` (rate) | `ipmi_sensors` (if available) | W | sum | + +{{< alert color="warning" title="Limitation" >}} +HPC-specific metrics like `flops_any`, `mem_bw`, and `ipc` require hardware +performance counter access (e.g., via LIKWID). These are **not** available from +standard Prometheus node_exporter or Telegraf. This integration supplements but +does not replace +[cc-metric-collector]({{< ref "../reference/cc-metric-collector" >}}) for +hardware counter metrics. +{{< /alert >}} + +## cc-metric-store Configuration + +Each bridged metric must be listed in the `metrics` section of cc-metric-store's +`config.json`. The `frequency` must match the interval at which your sync script +runs (in seconds). + +```json +{ + "metrics": { + "cpu_load": { "frequency": 60, "aggregation": "avg" }, + "cpu_user": { "frequency": 60, "aggregation": "avg" }, + "mem_used": { "frequency": 60, "aggregation": null }, + "net_bw": { "frequency": 60, "aggregation": "sum" } + } +} +``` + +If you already have metrics configured for cc-metric-collector with a different +frequency (e.g., 10s), do **not** change that. Instead, the sync script should +run at the same frequency as the already configured value. See the +[configuration reference]({{< ref "../reference/cc-metric-store/ccms-configuration" >}}) +for details. + +## Prometheus to cc-metric-store + +### Cron-based Sync Script + +The following Python script queries Prometheus for the latest metric values and +forwards them to cc-metric-store. Save it as `prom2ccms.py`: + +```python +#!/usr/bin/env python3 +"""Sync metrics from Prometheus to cc-metric-store.""" + +import sys +import time +import logging +import requests + +# --- Configuration ----------------------------------------------------------- + +PROMETHEUS_URL = "http://prometheus.example.org:9090" +CCMS_URL = "http://ccms.example.org:8080" +JWT_TOKEN = "eyJ0eXAiOiJKV1QiLC..." # Your Ed25519-signed JWT +CLUSTER_NAME = "mycluster" + +# Mapping: CC metric name -> (PromQL query, type, scale_factor) +# type is "node" for node-level metrics +METRIC_MAP = { + "cpu_load": ( + 'node_load1', + "node", 1.0 + ), + "cpu_user": ( + '100 * avg by(instance)(rate(node_cpu_seconds_total{mode="user"}[2m]))', + "node", 1.0 + ), + "mem_used": ( + '(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / (1024*1024)', + "node", 1.0 + ), + "net_bw": ( + 'sum by(instance)(rate(node_network_receive_bytes_total{device!="lo"}[2m])' + ' + rate(node_network_transmit_bytes_total{device!="lo"}[2m]))', + "node", 1.0 + ), +} + +# --- End Configuration ------------------------------------------------------- + +logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s") +log = logging.getLogger("prom2ccms") + + +def query_prometheus(promql: str) -> list: + """Run an instant query against Prometheus and return result list.""" + resp = requests.get( + f"{PROMETHEUS_URL}/api/v1/query", + params={"query": promql}, + timeout=10, + ) + resp.raise_for_status() + data = resp.json() + if data["status"] != "success": + raise RuntimeError(f"Prometheus query failed: {data}") + return data["data"]["result"] + + +def instance_to_hostname(instance: str) -> str: + """Strip port from Prometheus instance label (e.g., 'node01:9100' -> 'node01').""" + return instance.rsplit(":", 1)[0] + + +def build_line_protocol(metric_name: str, hostname: str, metric_type: str, + value: float, timestamp: int) -> str: + """Build an InfluxDB line protocol string for cc-metric-store.""" + return ( + f"{metric_name},cluster={CLUSTER_NAME},hostname={hostname}," + f"type={metric_type} value={value} {timestamp}" + ) + + +def push_to_ccms(lines: list): + """POST metric lines to cc-metric-store /api/write/ endpoint.""" + if not lines: + return + payload = "\n".join(lines) + resp = requests.post( + f"{CCMS_URL}/api/write/", + headers={"Authorization": f"Bearer {JWT_TOKEN}"}, + data=payload, + timeout=10, + ) + resp.raise_for_status() + log.info("Pushed %d lines to cc-metric-store", len(lines)) + + +def main(): + lines = [] + now = int(time.time()) + + for cc_name, (promql, metric_type, scale) in METRIC_MAP.items(): + try: + results = query_prometheus(promql) + except Exception as e: + log.error("Failed to query %s: %s", cc_name, e) + continue + + for series in results: + instance = series["metric"].get("instance", "") + if not instance: + continue + hostname = instance_to_hostname(instance) + value = float(series["value"][1]) * scale + lines.append( + build_line_protocol(cc_name, hostname, metric_type, value, now) + ) + + try: + push_to_ccms(lines) + except Exception as e: + log.error("Failed to push to cc-metric-store: %s", e) + sys.exit(1) + + +if __name__ == "__main__": + main() +``` + +Make the script executable and set up a cron job: + +```bash +chmod +x /opt/monitoring/prom2ccms/prom2ccms.py + +# Run every 60 seconds (matching cc-metric-store frequency) +# crontab -e +* * * * * /opt/monitoring/prom2ccms/prom2ccms.py >> /var/log/prom2ccms.log 2>&1 +``` + +{{< alert title="Note" >}} +For rate-based Prometheus metrics (like `node_cpu_seconds_total`), the PromQL +query itself uses `rate()` or `irate()` so the returned value is already a +per-second rate. The script simply forwards the computed value. +{{< /alert >}} + +### Advanced: Prometheus remote_write Proxy + +For near-real-time forwarding, Prometheus can push metrics via its `remote_write` +feature to a small HTTP proxy that converts them to cc-metric-store format. + +Add to your `prometheus.yml`: + +```yaml +remote_write: + - url: "http://adapter-host:9201/receive" + write_relabel_configs: + # Only forward metrics we care about + - source_labels: [__name__] + regex: "node_load1|node_cpu_seconds_total|node_memory_.*|node_network_.*" + action: keep +``` + +The proxy receives Protobuf-encoded, Snappy-compressed payloads and must decode +them before converting. This requires additional Python packages: + +```bash +pip install protobuf python-snappy +``` + +A skeleton implementation is available in the +[Prometheus remote write specification](https://prometheus.io/docs/concepts/remote_write_spec/). +The proxy must: + +1. Decompress the Snappy-encoded request body +2. Decode the Protobuf `WriteRequest` message +3. Map metric names and labels to CC format +4. Batch-POST the resulting lines to `/api/write/` + +{{< alert title="Note" >}} +The remote_write proxy is significantly more complex than the cron approach. It +requires a long-running service, additional dependencies, and handling Protobuf +schemas. For most HPC sites, the cron-based script above is recommended. +{{< /alert >}} + +## InfluxDB to cc-metric-store + +### Cron-based Sync Script + +The following script queries InfluxDB v2 (Flux) for the latest metric values and +forwards them to cc-metric-store. Save it as `influx2ccms.py`: + +```python +#!/usr/bin/env python3 +"""Sync metrics from InfluxDB v2 to cc-metric-store.""" + +import sys +import time +import csv +import io +import logging +import requests + +# --- Configuration ----------------------------------------------------------- + +INFLUXDB_URL = "http://influxdb.example.org:8086" +INFLUXDB_ORG = "myorg" +INFLUXDB_TOKEN = "your-influxdb-token" +INFLUXDB_BUCKET = "telegraf" + +CCMS_URL = "http://ccms.example.org:8080" +JWT_TOKEN = "eyJ0eXAiOiJKV1QiLC..." # Your Ed25519-signed JWT +CLUSTER_NAME = "mycluster" + +# Mapping: CC metric name -> (Flux query returning _value and host columns) +METRIC_MAP = { + "cpu_load": ''' + from(bucket: "{bucket}") + |> range(start: -5m) + |> filter(fn: (r) => r._measurement == "system" and r._field == "load1") + |> last() + ''', + "cpu_user": ''' + from(bucket: "{bucket}") + |> range(start: -5m) + |> filter(fn: (r) => r._measurement == "cpu" and r._field == "usage_user" and r.cpu == "cpu-total") + |> last() + ''', + "mem_used": ''' + from(bucket: "{bucket}") + |> range(start: -5m) + |> filter(fn: (r) => r._measurement == "mem" and r._field == "used") + |> last() + |> map(fn: (r) => ({{ r with _value: r._value / 1048576.0 }})) + ''', + "net_bw": ''' + from(bucket: "{bucket}") + |> range(start: -5m) + |> filter(fn: (r) => r._measurement == "net" and (r._field == "bytes_recv" or r._field == "bytes_sent")) + |> derivative(unit: 1s, nonNegative: true) + |> last() + |> group(columns: ["host"]) + |> sum() + ''', +} + +# --- End Configuration ------------------------------------------------------- + +logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s") +log = logging.getLogger("influx2ccms") + + +def query_influxdb(flux_query: str) -> list: + """Execute a Flux query and return list of (host, value) tuples.""" + query = flux_query.format(bucket=INFLUXDB_BUCKET) + resp = requests.post( + f"{INFLUXDB_URL}/api/v2/query", + headers={ + "Authorization": f"Token {INFLUXDB_TOKEN}", + "Content-Type": "application/vnd.flux", + "Accept": "text/csv", + }, + params={"org": INFLUXDB_ORG}, + data=query, + timeout=15, + ) + resp.raise_for_status() + + results = [] + reader = csv.DictReader(io.StringIO(resp.text)) + for row in reader: + host = row.get("host", "") + value = row.get("_value", "") + if host and value: + try: + results.append((host, float(value))) + except ValueError: + continue + return results + + +def push_to_ccms(lines: list): + """POST metric lines to cc-metric-store /api/write/ endpoint.""" + if not lines: + return + payload = "\n".join(lines) + resp = requests.post( + f"{CCMS_URL}/api/write/", + headers={"Authorization": f"Bearer {JWT_TOKEN}"}, + data=payload, + timeout=10, + ) + resp.raise_for_status() + log.info("Pushed %d lines to cc-metric-store", len(lines)) + + +def main(): + lines = [] + now = int(time.time()) + + for cc_name, flux_query in METRIC_MAP.items(): + try: + results = query_influxdb(flux_query) + except Exception as e: + log.error("Failed to query %s: %s", cc_name, e) + continue + + for hostname, value in results: + lines.append( + f"{cc_name},cluster={CLUSTER_NAME},hostname={hostname}," + f"type=node value={value} {now}" + ) + + try: + push_to_ccms(lines) + except Exception as e: + log.error("Failed to push to cc-metric-store: %s", e) + sys.exit(1) + + +if __name__ == "__main__": + main() +``` + +Deploy the same way as the Prometheus script: + +```bash +chmod +x /opt/monitoring/influx2ccms/influx2ccms.py + +# Run every 60 seconds +* * * * * /opt/monitoring/influx2ccms/influx2ccms.py >> /var/log/influx2ccms.log 2>&1 +``` + +### Alternative: Telegraf HTTP Output + +If you already run [Telegraf](https://docs.influxdata.com/telegraf/) and prefer +not to maintain a separate script, Telegraf can POST directly to cc-metric-store +using the `outputs.http` plugin: + +```toml +[[outputs.http]] + url = "http://ccms.example.org:8080/api/write/" + method = "POST" + data_format = "influx" + [outputs.http.headers] + Authorization = "Bearer eyJ0eXAiOiJKV1QiLC..." + +# Rename tags to match CC expectations +[[processors.rename]] + [[processors.rename.replace]] + tag = "host" + dest = "hostname" + +# Add required cluster and type tags +[[processors.override]] + [processors.override.tags] + cluster = "mycluster" + type = "node" + +# Only forward metrics we care about +[[processors.filter]] + namepass = ["system", "cpu", "mem", "net"] +``` + +{{< alert color="warning" title="Caveat" >}} +Telegraf's native tag structure may not match cc-metric-store's expected format +exactly. The `processors.rename` and `processors.override` plugins help, but +metric *names* (measurement + field) still differ from CC conventions. For full +control over the mapping, the Python script approach is more transparent. +{{< /alert >}} + +## Deployment as a Systemd Service + +For higher reliability than cron (with restart-on-failure and journald logging), +deploy the sync script as a systemd timer or a looping service. + +**Systemd service** (`/etc/systemd/system/prom2ccms.service`): + +```ini +[Unit] +Description=Prometheus to cc-metric-store Sync +After=network.target + +[Service] +Type=oneshot +User=monitoring +ExecStart=/opt/monitoring/prom2ccms/prom2ccms.py +``` + +**Systemd timer** (`/etc/systemd/system/prom2ccms.timer`): + +```ini +[Unit] +Description=Run prom2ccms every 60 seconds + +[Timer] +OnBootSec=30s +OnUnitActiveSec=60s +AccuracySec=5s + +[Install] +WantedBy=timers.target +``` + +```bash +sudo systemctl daemon-reload +sudo systemctl enable --now prom2ccms.timer + +# Check status +systemctl list-timers prom2ccms.timer +journalctl -u prom2ccms.service -f +``` + +## Testing and Validation + +### 1. Test the Write Endpoint Manually + +Verify cc-metric-store accepts writes with a simple curl command: + +```bash +JWT="eyJ0eXAiOiJKV1QiLC..." + +curl -X POST "http://ccms.example.org:8080/api/write/" \ + -H "Authorization: Bearer $JWT" \ + -d "cpu_load,cluster=mycluster,hostname=testnode,type=node value=1.5 $(date +%s)" +``` + +A `200 OK` response with no body confirms success. + +### 2. Query the Data Back + +Verify the metric was stored: + +```bash +curl -X GET "http://ccms.example.org:8080/api/query/" \ + -H "Authorization: Bearer $JWT" \ + -d '{ + "cluster": "mycluster", + "from": '"$(($(date +%s) - 300))"', + "to": '"$(date +%s)"', + "queries": [{"metric": "cpu_load", "hostname": "testnode"}] + }' +``` + +### 3. Check cc-backend + +Open the cc-backend web interface and navigate to the node view. Metrics from +external sources should appear alongside natively collected ones. + +### 4. Check Logs + +Monitor cc-metric-store logs for warnings about unknown or dropped metrics: + +```bash +journalctl -u cc-metric-store -f | grep -i "unknown\|drop\|error" +``` + +## Troubleshooting + +| Symptom | Cause | Solution | +|---|---|---| +| Metrics not appearing in cc-backend | Metric name not in cc-metric-store `metrics` config | Add the metric to the `metrics` section and restart cc-metric-store | +| `401 Unauthorized` from `/api/write/` | Invalid or expired JWT token | Regenerate JWT with the correct Ed25519 private key | +| Data gaps or irregular intervals | Cron interval does not match configured `frequency` | Align cron/timer schedule with the `frequency` value in cc-metric-store config | +| Hostname mismatch (no data for known nodes) | Prometheus `instance` label includes port, or InfluxDB uses different host naming | Adjust hostname extraction in the adapter script to match cc-backend's `cluster.json` | +| Wrong values after aggregation | `aggregation` set to `sum` instead of `avg` or vice versa | Check the `aggregation` field in cc-metric-store config matches the metric semantics | +| Script runs but pushes 0 lines | Source TSDB returns empty results | Verify the source query works independently (e.g., test PromQL in Prometheus UI) | + +## See Also + +- [JWT Token Generation]({{< ref "../how-to-guides/generateJWT" >}}) +- [cc-metric-store Configuration]({{< ref "../reference/cc-metric-store/ccms-configuration" >}}) +- [cc-metric-store REST API]({{< ref "../reference/cc-metric-store/ccms-rest-api" >}}) +- [InfluxDB Line Protocol]({{< ref "../explanation/lineProtocol" >}}) +- [Hierarchical Metric Collection]({{< ref "../how-to-guides/hierarchical-collection" >}}) +- [Decide on Metric List]({{< ref "../tutorials/prod-metric-list" >}}) +- [Cluster Configuration]({{< ref "../how-to-guides/clusterConfig" >}}) +- [CC Line Protocol Specification](https://github.com/ClusterCockpit/cc-specifications/blob/master/interfaces/lineprotocol/README.md) diff --git a/content/en/docs/how-to-guides/useRest.md b/content/en/docs/how-to-guides/useRest.md index 3942245..7424a47 100644 --- a/content/en/docs/how-to-guides/useRest.md +++ b/content/en/docs/how-to-guides/useRest.md @@ -6,12 +6,34 @@ tags: [User, Admin, Developer] ## Overview -ClusterCockpit offers several REST API Endpoints. While some are integral part of the ClusterCockpit-Stack Workflow (such as`start_job`), others are optional. -These optional endpoints supplement the functionality of the webinterface with information reachable from scripts or the command line. For example, job metrics could be requested for specific jobs and handled in external statistics programs. +ClusterCockpit offers several REST API endpoints. While some are integral parts +of the ClusterCockpit stack workflow (such as `start_job`), others are optional +and supplement the web interface with scriptable access. For example, job +metrics can be requested for specific jobs and processed in external analysis +programs. -All of the endpoints listed for both administrators and users are secured by [JWT]({{< ref "jwtoken" >}} "JSON Web Token") authentication. As such, all prerequisites applicable to JSON Web Tokens apply in this case as well, e.g. [private and public key setup]({{< ref "generatejwt" >}} "Key Setup"). +All endpoints are secured by [JWT]({{< ref "jwtoken" >}} "JSON Web Token") +authentication. See [How to generate JWT tokens]({{< ref "generatejwt" >}} "Key Setup") +for setup instructions, and [the Swagger Reference]({{< ref "rest-api" >}} "Swagger REST") +for detailed endpoint documentation and payload schemas. -See also [the Swagger Reference]({{< ref "rest-api" >}} "Swagger REST") for more detailed information on each endpoint and the payloads. +## Setup: Shell Variables for Examples + +All examples in this guide use the following shell variables. Set them once +before running any command: + +```bash +export CC_URL="https://your-clustercockpit-instance" +export TOKEN="eyJ..." # your JWT token +``` + +The common curl header pattern used throughout: + +```bash +-H "Authorization: Bearer $TOKEN" -H "accept: application/json" +``` + +--- ## Admin Accessible REST API @@ -21,46 +43,506 @@ Endpoints described here should be restricted to administrators only, as they in ### Admin API Prerequisites -1. JWT has to be generated by either a dedicated API user (has only `api` role) or by an _administrator_ with both `admin` and `api` [roles]({{< ref "roles" >}} "ClusterCockpit Roles"). -2. JWTs have a limited lifetime, i.e. will become invalid after a configurable amount of time (see `auth.jwt.max-age` [config option]({{< ref "ccb-configuration" >}} "ClusterCockpit Configuration")). -3. Administrator endpoints are additionally subjected to a configurable IP whitelist (see `api-allowed-ips` [config option]({{< ref "ccb-configuration" >}} "ClusterCockpit Configuration")). Per default there is no restriction on IPs that can access the endpoints. +1. JWT must be generated by either a dedicated API user (has only `api` role) or + an administrator with both `admin` and `api` + [roles]({{< ref "roles" >}} "ClusterCockpit Roles"). +2. JWTs have a limited lifetime and become invalid after a configurable duration + (see `auth.jwt.max-age` [config option]({{< ref "ccb-configuration" >}} "ClusterCockpit Configuration")). +3. Admin endpoints are subject to a configurable IP allowlist + (`api-allowed-ips` [config option]({{< ref "ccb-configuration" >}} "ClusterCockpit Configuration")). + By default there is no IP restriction. ### Admin API Endpoints and Functions | Endpoint | Method | Request Payload(s) | Description | | ---------------------------------- | ----------- | --------------------- | --------------------------------------------------------------------------------------------------------- | -| `/api/users/` | GET | - | Lists all Users | -| `/api/clusters/` | GET | - | Lists all Clusters | -| `/api/tags/` | DELETE | JSON Payload | Removes payload array of tags specified with `Type, Name, Scope` from DB. Private Tags cannot be removed. | -| `/api/jobs/start_job/` | POST, PUT | JSON Payload | Starts Job | -| `/api/jobs/stop_job/` | POST, PUT | JSON Payload | Stops Jobs | -| `/api/jobs/` | GET | URL-Query Params | Lists Jobs | +| `/api/users/` | GET | - | Lists all users | +| `/api/clusters/` | GET | - | Lists all clusters | +| `/api/tags/` | DELETE | JSON Payload | Removes array of tags (Type, Name, Scope) from DB. Private tags cannot be removed. | +| `/api/jobs/start_job/` | POST, PUT | JSON Payload | Starts a job | +| `/api/jobs/stop_job/` | POST, PUT | JSON Payload | Stops a job | +| `/api/jobs/` | GET | URL-Query Params | Lists jobs | | `/api/jobs/{id}` | POST | $id, JSON Payload | Loads specified job metadata | | `/api/jobs/{id}` | GET | $id | Loads specified job with metrics | -| `/api/jobs/tag_job/{id}` | POST, PATCH | $id, JSON Payload | Adds payload array of tags specified with `Type, Name, Scope` to Job with $id. Tags are created in BD. | -| `/api/jobs/tag_job/{id}` | POST, PATCH | $id, JSON Payload | Removes payload array of tags specified with `Type, Name, Scope` from Job with $id. Tags remain in DB. | -| `/api/jobs/edit_meta/{id}` | POST, PATCH | $id, JSON Payload | Edits meta_data db colums info | -| `/api/jobs/metrics/{id}` | GET | $id, URL-Query Params | Loads specified jobmetrics for metric and scope params | +| `/api/jobs/tag_job/{id}` | POST, PATCH | $id, JSON Payload | Adds array of tags (Type, Name, Scope) to job. Tags are created in DB if new. | +| `/api/jobs/tag_job/{id}` | DELETE | $id, JSON Payload | Removes array of tags from job. Tags remain in DB. | +| `/api/jobs/edit_meta/{id}` | POST, PATCH | $id, JSON Payload | Edits `meta_data` column for the specified job | +| `/api/jobs/metrics/{id}` | GET | $id, URL-Query Params | Loads specified job metrics for given metric and scope params | | `/api/jobs/delete_job/` | DELETE | JSON Payload | Deletes job specified in payload | -| `/api/jobs/delete_job/{id}` | DELETE | $id, JSON Payload | Deletes job specified by db id | -| `/api/jobs/delete_job_before/{ts}` | DELETE | $ts | Deletes all jobs before specified unix timestamp | +| `/api/jobs/delete_job/{id}` | DELETE | $id, JSON Payload | Deletes job specified by database id | +| `/api/jobs/delete_job_before/{ts}` | DELETE | $ts | Deletes all jobs with stop time before the given unix timestamp | + +--- + +### Listing and Filtering Jobs + +Use `GET /api/jobs/` with URL query parameters to filter the job list. +Useful parameters: + +| Parameter | Example | Description | +| --------------- | -------------------- | -------------------------------------- | +| `state` | `running` | Job state: `running`, `completed`, ... | +| `cluster` | `mycluster` | Cluster name | +| `user` | `alice` | Username | +| `project` | `myproject` | Project name | +| `startTime[from]` | `1700000000` | Unix timestamp lower bound | +| `startTime[to]` | `1710000000` | Unix timestamp upper bound | +| `numNodes[from]`| `4` | Minimum node count | +| `page` | `2` | Page number (default: 1) | +| `items_per_page`| `25` | Results per page (default: 25) | + +**List all running jobs on a cluster:** + +```bash +curl -s "$CC_URL/api/jobs/?state=running&cluster=mycluster" \ + -H "Authorization: Bearer $TOKEN" \ + -H "accept: application/json" | jq '.jobs[] | {id, jobId, user, numNodes, startTime}' +``` + +**List completed jobs for a specific user in a time window:** + +```bash +START=$(date -d "7 days ago" +%s) +NOW=$(date +%s) + +curl -s "$CC_URL/api/jobs/?state=completed&user=alice&startTime[from]=$START&startTime[to]=$NOW" \ + -H "Authorization: Bearer $TOKEN" \ + -H "accept: application/json" | jq '.jobs | length' +``` + +**Paginate through all jobs (100 at a time):** + +```bash +#!/bin/bash +set -euo pipefail + +PAGE=1 +TOTAL=0 + +while true; do + RESP=$(curl -s "$CC_URL/api/jobs/?items_per_page=100&page=$PAGE" \ + -H "Authorization: Bearer $TOKEN" \ + -H "accept: application/json") + + COUNT=$(echo "$RESP" | jq '.jobs | length') + TOTAL=$((TOTAL + COUNT)) + echo "Page $PAGE: $COUNT jobs (total so far: $TOTAL)" + + [[ "$COUNT" -lt 100 ]] && break + PAGE=$((PAGE + 1)) +done +echo "Grand total: $TOTAL jobs" +``` + +--- + +### Retrieving Job Details and Metrics + +**Get full metadata and metrics for a specific job (by database id):** + +```bash +DB_ID=12345 + +curl -s "$CC_URL/api/jobs/$DB_ID" \ + -H "Authorization: Bearer $TOKEN" \ + -H "accept: application/json" | jq '{jobId, user, cluster, numNodes, startTime, duration, state}' +``` + +**Request specific metrics only:** + +```bash +curl -s "$CC_URL/api/jobs/metrics/$DB_ID?metric=flops_any&metric=mem_bw&metric=cpu_load" \ + -H "Authorization: Bearer $TOKEN" \ + -H "accept: application/json" | jq . +``` + +**Extract per-node average for a metric:** + +```bash +curl -s "$CC_URL/api/jobs/metrics/$DB_ID?metric=mem_bw" \ + -H "Authorization: Bearer $TOKEN" \ + -H "accept: application/json" \ +| jq ' + .metrics[] + | select(.name == "mem_bw") + | .series[] + | {hostname, avg: (.statistics.avg // "n/a")} +' +``` + +**Script: print a summary table for all completed jobs in the last 24 hours:** + +```bash +#!/bin/bash +set -euo pipefail + +SINCE=$(date -d "24 hours ago" +%s) + +curl -s "$CC_URL/api/jobs/?state=completed&startTime[from]=$SINCE" \ + -H "Authorization: Bearer $TOKEN" \ + -H "accept: application/json" \ +| jq -r ' + ["DB_ID", "JobID", "User", "Nodes", "Duration(s)", "Cluster"], + (.jobs[] | [.id, .jobId, .user, .numNodes, .duration, .cluster]) + | @tsv +' | column -t +``` + +--- + +### Tagging Jobs + +Tags have a `type`, a `name`, and a `scope` (`global` for admin-created tags). + +**Add a tag to a job:** + +```bash +DB_ID=12345 + +curl -s -X PATCH "$CC_URL/api/jobs/tag_job/$DB_ID" \ + -H "Authorization: Bearer $TOKEN" \ + -H "accept: application/json" \ + -H "Content-Type: application/json" \ + -d '[{"type": "review", "name": "problematic", "scope": "global"}]' +``` + +**Remove a tag from a job:** + +```bash +curl -s -X DELETE "$CC_URL/api/jobs/tag_job/$DB_ID" \ + -H "Authorization: Bearer $TOKEN" \ + -H "accept: application/json" \ + -H "Content-Type: application/json" \ + -d '[{"type": "review", "name": "problematic", "scope": "global"}]' +``` + +**Script: bulk-tag a list of job database IDs from a file (`job_ids.txt`):** + +```bash +#!/bin/bash +set -euo pipefail + +TAG_TYPE="review" +TAG_NAME="flagged" +TAG_SCOPE="global" + +while IFS= read -r DB_ID; do + [[ -z "$DB_ID" ]] && continue + STATUS=$(curl -s -o /dev/null -w "%{http_code}" \ + -X PATCH "$CC_URL/api/jobs/tag_job/$DB_ID" \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d "[{\"type\": \"$TAG_TYPE\", \"name\": \"$TAG_NAME\", \"scope\": \"$TAG_SCOPE\"}]") + echo "Job $DB_ID: HTTP $STATUS" +done < job_ids.txt +``` + +**Script: tag all jobs from a user that exceeded a node threshold:** + +```bash +#!/bin/bash +set -euo pipefail + +USER="alice" +MIN_NODES=64 + +curl -s "$CC_URL/api/jobs/?state=completed&user=$USER&numNodes[from]=$MIN_NODES" \ + -H "Authorization: Bearer $TOKEN" \ + -H "accept: application/json" \ +| jq -r '.jobs[].id' \ +| while read -r DB_ID; do + curl -s -X PATCH "$CC_URL/api/jobs/tag_job/$DB_ID" \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '[{"type": "size", "name": "large-job", "scope": "global"}]' > /dev/null + echo "Tagged job $DB_ID" + done +``` + +--- + +### Editing Job Metadata + +Use `PATCH /api/jobs/edit_meta/{id}` to set or update key/value pairs in the +`meta_data` column of a job. This is useful for annotating jobs with workflow +information, analysis results, or operator notes. + +**Add an annotation to a job:** + +```bash +DB_ID=12345 + +curl -s -X PATCH "$CC_URL/api/jobs/edit_meta/$DB_ID" \ + -H "Authorization: Bearer $TOKEN" \ + -H "accept: application/json" \ + -H "Content-Type: application/json" \ + -d '{"note": "High memory variance detected — check NUMA binding"}' +``` + +**Script: read a CSV of `db_id,note` and annotate each job:** + +```bash +#!/bin/bash +# Format: db_id,note (one entry per line) +set -euo pipefail + +while IFS=, read -r DB_ID NOTE; do + [[ -z "$DB_ID" ]] && continue + curl -s -X PATCH "$CC_URL/api/jobs/edit_meta/$DB_ID" \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d "{\"note\": \"$NOTE\"}" > /dev/null + echo "Annotated job $DB_ID" +done < annotations.csv +``` + +--- + +### Job Lifecycle (Admin) + +For starting and stopping jobs via the API (e.g. from a batch scheduler adapter), +see the [Hands-On Demo]({{< ref "handson" >}} "Hands-On Demo") for a complete +walkthrough with example payloads. + +**Start a job:** + +```bash +curl -s -X POST "$CC_URL/api/jobs/start_job/" \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "jobId": 100001, + "user": "alice", + "cluster": "mycluster", + "subCluster": "main", + "project": "myproject", + "startTime": '"$(date +%s)"', + "numNodes": 4, + "numHwthreads": 128, + "walltime": 86400, + "resources": [ + {"hostname": "node01"}, + {"hostname": "node02"}, + {"hostname": "node03"}, + {"hostname": "node04"} + ] + }' +``` + +The response contains the database id assigned to the new job: + +```json +{"id": 3938} +``` + +**Stop a job (by database id):** + +```bash +DB_ID=3938 + +curl -s -X POST "$CC_URL/api/jobs/stop_job/$DB_ID" \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "cluster": "mycluster", + "jobState": "completed", + "stopTime": '"$(date +%s)"' + }' +``` + +--- ## User Accessible REST API {{< alert >}} -Endpoints described here can be used by users to write scripted job analysis for their jobs only. +Endpoints described here can be used by users to write scripted job analysis for their own jobs only. {{< /alert >}} ### User API Prerequisites -1. JWT has to be generated by either a dedicated API user (Has only `api` role) or an _User_ with additional `api` [role]({{< ref "roles" >}} "ClusterCockpit Roles"). -2. JWTs have a limited lifetime, i.e. will become invalid after a configurable amount of time (see `jwt.max-age` [config option]({{< ref "configuration" >}} "ClusterCockpit Configuration")). +1. JWT must be generated for a user with the `api` + [role]({{< ref "roles" >}} "ClusterCockpit Roles") (in addition to the + regular `user` role). +2. JWTs have a limited lifetime (see `jwt.max-age` + [config option]({{< ref "ccb-configuration" >}} "ClusterCockpit Configuration")). +3. User API endpoints only return data for jobs owned by the authenticated user. ### User API Endpoints and Functions | Endpoint | Method | Request | Description | | ---------------------------- | ------ | --------------------- | ------------------------------------------------------ | -| `/userapi/jobs/` | GET | URL-Query Params | Lists Jobs | +| `/userapi/jobs/` | GET | URL-Query Params | Lists jobs belonging to the authenticated user | | `/userapi/jobs/{id}` | POST | $id, JSON Payload | Loads specified job metadata | | `/userapi/jobs/{id}` | GET | $id | Loads specified job with metrics | -| `/userapi/jobs/metrics/{id}` | GET | $id, URL-Query Params | Loads specified jobmetrics for metric and scope params | +| `/userapi/jobs/metrics/{id}` | GET | $id, URL-Query Params | Loads specified job metrics for metric and scope params | + +--- + +### Listing Your Own Jobs + +```bash +# All completed jobs in the last 7 days +START=$(date -d "7 days ago" +%s) +NOW=$(date +%s) + +curl -s "$CC_URL/userapi/jobs/?state=completed&startTime[from]=$START&startTime[to]=$NOW" \ + -H "Authorization: Bearer $TOKEN" \ + -H "accept: application/json" \ +| jq '.jobs[] | {id, jobId, cluster, numNodes, duration, startTime}' +``` + +```bash +# Count jobs by cluster +curl -s "$CC_URL/userapi/jobs/?state=completed" \ + -H "Authorization: Bearer $TOKEN" \ + -H "accept: application/json" \ +| jq ' + .jobs + | group_by(.cluster)[] + | {cluster: .[0].cluster, count: length} +' +``` + +--- + +### Investigating a Specific Job + +**Get full job metadata:** + +```bash +DB_ID=12345 + +curl -s "$CC_URL/userapi/jobs/$DB_ID" \ + -H "Authorization: Bearer $TOKEN" \ + -H "accept: application/json" \ +| jq '{jobId, cluster, numNodes, startTime, duration, state, tags}' +``` + +**Get per-node metric data and show statistics:** + +```bash +curl -s "$CC_URL/userapi/jobs/metrics/$DB_ID?metric=mem_bw&metric=flops_any" \ + -H "Authorization: Bearer $TOKEN" \ + -H "accept: application/json" \ +| jq ' + .metrics[] + | { + metric: .name, + unit: .unit.prefix + .unit.base, + nodes: [ + .series[] + | { + host: .hostname, + min: .statistics.min, + avg: .statistics.avg, + max: .statistics.max + } + ] + } +' +``` + +**Check whether any node had very low CPU utilization (potential idle node):** + +```bash +curl -s "$CC_URL/userapi/jobs/metrics/$DB_ID?metric=cpu_load" \ + -H "Authorization: Bearer $TOKEN" \ + -H "accept: application/json" \ +| jq ' + .metrics[] + | select(.name == "cpu_load") + | .series[] + | select(.statistics.avg < 0.1) + | "WARNING: low cpu_load on \(.hostname): avg=\(.statistics.avg)" +' +``` + +--- + +### Statistics on Your Own Jobs + +**Script: compute average flops_any across all completed jobs in a time range:** + +```bash +#!/bin/bash +set -euo pipefail + +START=$(date -d "30 days ago" +%s) +NOW=$(date +%s) + +# Collect all completed job IDs +JOB_IDS=$(curl -s "$CC_URL/userapi/jobs/?state=completed&startTime[from]=$START&startTime[to]=$NOW" \ + -H "Authorization: Bearer $TOKEN" \ + -H "accept: application/json" \ + | jq -r '.jobs[].id') + +TOTAL=0 +COUNT=0 + +for DB_ID in $JOB_IDS; do + AVG=$(curl -s "$CC_URL/userapi/jobs/metrics/$DB_ID?metric=flops_any" \ + -H "Authorization: Bearer $TOKEN" \ + -H "accept: application/json" \ + | jq ' + [ .metrics[] + | select(.name == "flops_any") + | .series[].statistics.avg + ] | add / length' 2>/dev/null) + + if [[ "$AVG" != "null" && -n "$AVG" ]]; then + echo "Job $DB_ID: flops_any avg = $AVG" + TOTAL=$(echo "$TOTAL + $AVG" | bc) + COUNT=$((COUNT + 1)) + fi +done + +if [[ $COUNT -gt 0 ]]; then + echo "---" + echo "Jobs evaluated: $COUNT" + echo "Overall average flops_any: $(echo "scale=4; $TOTAL / $COUNT" | bc)" +fi +``` + +**Script: find your top-10 jobs by peak memory bandwidth:** + +```bash +#!/bin/bash +set -euo pipefail + +curl -s "$CC_URL/userapi/jobs/?state=completed&items_per_page=200" \ + -H "Authorization: Bearer $TOKEN" \ + -H "accept: application/json" \ +| jq -r '.jobs[].id' \ +| while read -r DB_ID; do + MAX=$(curl -s "$CC_URL/userapi/jobs/metrics/$DB_ID?metric=mem_bw" \ + -H "Authorization: Bearer $TOKEN" \ + -H "accept: application/json" \ + | jq ' + [ .metrics[] + | select(.name == "mem_bw") + | .series[].statistics.max + ] | max // 0' 2>/dev/null) + echo "$MAX $DB_ID" + done \ +| sort -rn \ +| head -10 \ +| awk '{printf "DB_ID %-8s peak mem_bw = %s\n", $2, $1}' +``` + +**Script: summarize job efficiency — ratio of actual to walltime:** + +```bash +#!/bin/bash +set -euo pipefail + +curl -s "$CC_URL/userapi/jobs/?state=completed&items_per_page=100" \ + -H "Authorization: Bearer $TOKEN" \ + -H "accept: application/json" \ +| jq -r ' + .jobs[] + | select(.walltime > 0) + | [.id, .jobId, .duration, .walltime, + (100 * .duration / .walltime | round | tostring) + "%"] + | @tsv +' | column -t -N "DB_ID,JobID,Duration,Walltime,Efficiency" +``` diff --git a/content/en/docs/reference/cc-backend/rest-api.md b/content/en/docs/reference/cc-backend/rest-api.md index ee3a3d5..cff25f8 100644 --- a/content/en/docs/reference/cc-backend/rest-api.md +++ b/content/en/docs/reference/cc-backend/rest-api.md @@ -98,4 +98,4 @@ This means that all interactivity ("Try It Out") will not return actual data. Ho Endpoints displayed here correspond to the administrator `/api/` endpoints, but user-accessible `/userapi/` endpoints are functionally identical. See [these lists]({{< ref "userest" >}} "How-To REST API") for information about accessibility. {{< /alert >}} -{{< swagger-ui "" >}} +{{< swagger-ui "https://raw.githubusercontent.com/ClusterCockpit/cc-backend/refs/heads/master/api/swagger.json" >}} diff --git a/content/en/docs/reference/cc-metric-store/ccms-rest-api.md b/content/en/docs/reference/cc-metric-store/ccms-rest-api.md index 3f00661..ea2e2c3 100644 --- a/content/en/docs/reference/cc-metric-store/ccms-rest-api.md +++ b/content/en/docs/reference/cc-metric-store/ccms-rest-api.md @@ -102,4 +102,4 @@ However, a `Curl` call and a compiled `Request URL` will still be displayed, if an API endpoint is executed. {{< /alert >}} -{{< swagger-ui "" >}} +{{< swagger-ui "https://raw.githubusercontent.com/ClusterCockpit/cc-metric-store/refs/heads/main/api/swagger.json" >}}