Skip to content

feat(cosmos/workloads): add performance metrics collection for DR drill testing#46271

Open
tvaron3 wants to merge 22 commits intoAzure:mainfrom
tvaron3:feat/dr-drill-workload-metrics
Open

feat(cosmos/workloads): add performance metrics collection for DR drill testing#46271
tvaron3 wants to merge 22 commits intoAzure:mainfrom
tvaron3:feat/dr-drill-workload-metrics

Conversation

@tvaron3
Copy link
Copy Markdown
Member

@tvaron3 tvaron3 commented Apr 12, 2026

Unified Workload with Performance Metrics for Cosmos DB DR Drill Testing

Summary

Consolidates 8 separate workload scripts into a single workload.py controlled entirely by environment variables, and adds a performance metrics
collection layer that reports latency, throughput, errors, and resource
utilization to a Cosmos DB results account — enabling Grafana dashboards and ADX-powered analysis for DR drill testing.

Motivation

Previously, testing different workload configurations (read-only, write-only, proxy, sync, etc.) required copying and modifying entire workload files.
Adding observability meant manually instrumenting each file. This PR solves
both problems:

  1. One workload, many configurations — env vars control operations, proxy, sync, and client behavior
  2. Built-in metrics — every operation is timed and reported automatically

Architecture

workload.py ├── WORKLOAD_OPERATIONS → read, write, query (comma-separated) ├── WORKLOAD_USE_PROXY → route through Envoy proxy ├── WORKLOAD_USE_SYNC → sync
vs async client ├── WORKLOAD_SKIP_CLOSE → simulate applications that
don't close the client └── PerfReporter (daemon thread, configurable interval) ├── HdrHistogram per operation (p50/p90/p99, O(1) record) ├── psutil for
CPU/memory metrics ├── Error status code + sub-status code tracking └──
Upserts PerfResult docs to results Cosmos DB account

New Files

File Description
workload.py Single unified workload — replaces all 8 operation-specific files
perf_stats.py Per-operation latency tracking via HdrHistogram — O(1) record, O(1) percentile,
fixed 40KB memory per histogram
perf_reporter.py Background daemon thread that upserts PerfResult documents to a results Cosmos DB account at a configurable interval
perf_config.py All metrics configuration from environment variables

Modified Files

File Change
workload_configs.py All SDK configs now driven by environment variables (no manual editing)
workload_utils.py Added timed operation wrappers using time.perf_counter_ns() with error status code capture

Deleted Files

r_workload.py, w_workload.py, r_proxy_workload.py, w_proxy_workload.py, r_w_q_workload.py, r_w_q_proxy_workload.py, r_w_q_sync_workload.py,
r_w_q_with_incorrect_client_workload.py, run_workloads.sh, dev.md

All replaced by workload.py with environment variable configuration. Infrastructure/orchestration scripts (run_workloads.sh, dev.md) moved to the
cosmos-sdk-copilot-toolkit repo.

Environment Variables

Workload behavior:

Variable Default Description
WORKLOAD_OPERATIONS read,write,query Comma-separated operations to run
WORKLOAD_USE_PROXY false Route through Envoy proxy
WORKLOAD_USE_SYNC false Use sync client instead of async
WORKLOAD_SKIP_CLOSE false Don't close client (simulate incorrect usage)

Cosmos DB client:

Variable Default Description
COSMOS_URI (required) Workload account endpoint
COSMOS_PREFERRED_LOCATIONS "" Comma-separated preferred regions
COSMOS_CLIENT_EXCLUDED_LOCATIONS "" Comma-separated excluded regions
COSMOS_USE_MULTIPLE_WRITABLE_LOCATIONS false Enable multi-region writes
AZURE_COSMOS_ENABLE_CIRCUIT_BREAKER false Per-partition circuit breaker
COSMOS_CONCURRENT_REQUESTS 100 Concurrent operations per process
COSMOS_THROUGHPUT 1000000 Container provisioned throughput

Metrics reporting:

Variable Default Description
RESULTS_COSMOS_URI "" Results account endpoint (enables metrics when set)
PERF_ENABLED true Toggle metrics collection
PERF_REPORT_INTERVAL 300 Seconds between metric reports

PerfResult Document Schema

Each document matches the Rust SDK perf tool schema, enabling
both SDKs to feed the same ADX → Grafana pipeline:

{
  "operation": "ReadItem",
  "count": 12345,
  "errors": 0,
  "min_ms": 1.2, "max_ms": 50.3, "mean_ms": 5.4,
  "p50_ms": 4.8, "p90_ms": 8.2, "p99_ms": 15.1,
  "cpu_percent": 45.2,
  "memory_bytes": 104857600,
  "system_cpu_percent": 62.1,
  "sdk_language": "python",
  "sdk_version": "4.15.0",
  "config_concurrency": 100,
  "config_ppcb_enabled": false
}

Usage Examples

# All operations, direct connection
WORKLOAD_OPERATIONS=read,write,query python3 workload.py

# Read-only via Envoy proxy
WORKLOAD_OPERATIONS=read WORKLOAD_USE_PROXY=true python3 workload.py

# Write-only with metrics reporting
WORKLOAD_OPERATIONS=write \
  RESULTS_COSMOS_URI=https://my-results.documents.azure.com:443/ \
  python3 workload.py

# Simulate unclosed client
WORKLOAD_SKIP_CLOSE=true python3 workload.py

# Scale to 5 processes
for i in $(seq 5); do
  WORKLOAD_OPERATIONS=read,write,query nohup python3 workload.py &
done

tvaron3 and others added 15 commits April 9, 2026 19:29
…hroughput

- Uncomment concurrent upsert/read/query calls
- Remove manual timing counters and log_request_counts
- Set THROUGHPUT to 1000000 in workload_configs.py
- Keep CIRCUIT_BREAKER_ENABLED = False (PPCB disabled)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a performance metrics library that reports PerfResult documents to a
Cosmos DB results account, matching the Rust perf tool schema exactly so
both SDKs feed the same ADX → Grafana pipeline.

New files:
- perf_stats.py: Thread-safe latency histogram with sorted-list percentile
  calculation and atomic drain_all() for consistent summary+error snapshots
- perf_config.py: All config from environment variables (RESULTS_COSMOS_URI,
  PERF_REPORT_INTERVAL=300s, perfdb/perfresults defaults)
- perf_reporter.py: Background daemon thread that drains Stats every 5 min
  and upserts PerfResult documents via sync CosmosClient with AAD auth

Modified files:
- workload_configs.py: All configs now driven by environment variables
- workload_utils.py: Added timed operation wrappers with error tracking
  (CosmosHttpResponseError status_code/sub_status extraction), only
  successful operations record latency to avoid polluting percentiles
- All *_workload.py files: Integrated Stats + PerfReporter with try/finally
  lifecycle management

Key design decisions:
- Sorted-list percentiles (no hdrhistogram native dependency)
- psutil for CPU/memory with /proc fallback on Linux
- Cached psutil.Process() instance for accurate CPU readings
- CosmosClient stored and closed properly to avoid resource leaks
- sdk_language='python', sdk_version from azure.cosmos.__version__
- PPCB disabled by default
- All reporter errors caught and logged as warnings (never crash workload)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
psutil is now a hard import (not optional). Removed all /proc/meminfo
and /proc/self/status fallback parsing — if psutil is not installed,
the import will fail immediately rather than silently degrading.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Single workload.py replaces 6 operation-specific files
- WORKLOAD_OPERATIONS env var controls which ops run (read,write,query)
- WORKLOAD_USE_PROXY env var enables Envoy proxy routing
- WORKLOAD_USE_SYNC env var enables sync client
- Validate operation names at import time with clear error
- Replace manual sorted-list percentiles with hdrhistogram (O(1) record/query)
- Fixed memory usage (~40KB per histogram vs unbounded list growth)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rkload.py

Removed: r_workload.py, w_workload.py, r_proxy_workload.py,
w_proxy_workload.py, r_w_q_workload.py, r_w_q_proxy_workload.py,
r_w_q_sync_workload.py

All replaced by workload.py with WORKLOAD_OPERATIONS and
WORKLOAD_USE_PROXY env vars.

Kept: r_w_q_with_incorrect_client_workload.py (special test case)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replaces r_w_q_with_incorrect_client_workload.py with an env var:
WORKLOAD_SKIP_CLOSE=true creates the client without a context manager,
simulating applications that don't properly close the Cosmos client.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Switch from time.perf_counter() * 1000 to time.perf_counter_ns() / 1_000_000
for nanosecond precision without floating-point multiplication artifacts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Infra/orchestration scripts belong in the cosmos-sdk-copilot-toolkit repo,
not in the SDK repo. Workload code (workload.py, perf_*, workload_utils.py)
stays here.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…istogram

The pip package is 'hdrhistogram' but the Python module is 'hdrh'.
Import changed from 'from hdrhistogram import HdrHistogram' to
'from hdrh.histogram import HdrHistogram'.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reports COSMOS_USE_MULTIPLE_WRITABLE_LOCATIONS in the config snapshot
so it's visible in the Grafana dashboard and queryable from Kusto.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The variable was used but never defined — caused pylint E0602.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…, histogram clamp, safe parsing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…dictionary

Move cspell words to sdk/cosmos/azure-cosmos/cspell.json instead of
root .vscode/cspell.json to keep changes within cosmos folder scope.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
tvaron3 and others added 3 commits April 11, 2026 20:53
…faultAzureCredential

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rfResult

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t-toolkit

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tvaron3 tvaron3 marked this pull request as ready for review April 12, 2026 18:18
@tvaron3 tvaron3 requested a review from a team as a code owner April 12, 2026 18:18
Copilot AI review requested due to automatic review settings April 12, 2026 18:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR modernizes the Cosmos DB DR drill workload harness by consolidating multiple per-scenario scripts into a single environment-variable-driven workload runner, and adds an optional performance metrics collection/reporting layer that can upsert results to a separate Cosmos DB account for dashboarding and analysis.

Changes:

  • Consolidates prior workload entrypoints into a single workload.py controlled by environment variables.
  • Adds performance statistics collection (per-operation latency + errors) and a background reporter that periodically upserts PerfResult documents.
  • Refactors workload configuration and operation helpers to be environment-variable-driven and to support timed operation wrappers.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
sdk/cosmos/azure-cosmos/tests/workloads/workload.py New unified workload entrypoint (sync/async + proxy + perf hooks).
sdk/cosmos/azure-cosmos/tests/workloads/workload_utils.py Adds timed operation wrappers and Cosmos error status/substatus extraction for perf stats.
sdk/cosmos/azure-cosmos/tests/workloads/workload_configs.py Moves workload/client configuration to environment variables with parsing/validation.
sdk/cosmos/azure-cosmos/tests/workloads/perf_stats.py New thread-safe per-operation latency histograms + error aggregation (HdrHistogram-backed).
sdk/cosmos/azure-cosmos/tests/workloads/perf_reporter.py New daemon-thread reporter to drain stats and upsert PerfResult/Error docs to a results Cosmos account.
sdk/cosmos/azure-cosmos/tests/workloads/perf_config.py New perf reporter configuration builder from environment variables (includes git SHA + defaults).
sdk/cosmos/azure-cosmos/cspell.json Adds workload/perf-specific dictionary exceptions for spell checking.
sdk/cosmos/azure-cosmos/tests/workloads/w_workload.py Removes legacy write-only workload script (replaced by workload.py).
sdk/cosmos/azure-cosmos/tests/workloads/w_proxy_workload.py Removes legacy write-via-proxy workload script (replaced by workload.py).
sdk/cosmos/azure-cosmos/tests/workloads/r_workload.py Removes legacy read/query workload script (replaced by workload.py).
sdk/cosmos/azure-cosmos/tests/workloads/r_w_q_workload.py Removes legacy mixed workload script (replaced by workload.py).
sdk/cosmos/azure-cosmos/tests/workloads/r_w_q_with_incorrect_client_workload.py Removes legacy “incorrect client usage” script (replaced by WORKLOAD_SKIP_CLOSE).
sdk/cosmos/azure-cosmos/tests/workloads/r_w_q_sync_workload.py Removes legacy sync mixed workload script (replaced by workload.py).
sdk/cosmos/azure-cosmos/tests/workloads/r_w_q_proxy_workload.py Removes legacy mixed workload + proxy script (replaced by workload.py).
sdk/cosmos/azure-cosmos/tests/workloads/r_proxy_workload.py Removes legacy read/query-via-proxy workload script (replaced by workload.py).
sdk/cosmos/azure-cosmos/tests/workloads/setup_env.sh Removes local environment bootstrap script (moved out of repo per PR description).
sdk/cosmos/azure-cosmos/tests/workloads/run_workloads.sh Removes legacy multi-script launcher (superseded by env-var-driven workload).
sdk/cosmos/azure-cosmos/tests/workloads/dev.md Removes legacy local runbook doc (moved out of repo per PR description).

tvaron3 and others added 4 commits April 12, 2026 21:02
Extract _extra_kwargs(), _timed_call(), and _timed_call_async() to
eliminate duplicated excluded_locations branching and timing/error
recording across 6 operation functions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… flush lock

1. Use public azure.core.pipeline.transport import for AioHttpTransport
2. Fail fast with RuntimeError if WORKLOAD_USE_SYNC + WORKLOAD_USE_PROXY
3. Add threading.Lock around _flush(), skip final flush if thread alive

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants