feat(cosmos/workloads): add performance metrics collection for DR drill testing by tvaron3 · Pull Request #46271 · Azure/azure-sdk-for-python

tvaron3 · 2026-04-12T03:42:54Z

Unified Workload with Performance Metrics for Cosmos DB DR Drill Testing

Summary

Consolidates 8 separate workload scripts into a single workload.py controlled entirely by environment variables, and adds a performance metrics
collection layer that reports latency, throughput, errors, and resource
utilization to a Cosmos DB results account — enabling Grafana dashboards and ADX-powered analysis for DR drill testing.

Motivation

Previously, testing different workload configurations (read-only, write-only, proxy, sync, etc.) required copying and modifying entire workload files.
Adding observability meant manually instrumenting each file. This PR solves
both problems:

One workload, many configurations — env vars control operations, proxy, sync, and client behavior
Built-in metrics — every operation is timed and reported automatically

Architecture

workload.py ├── WORKLOAD_OPERATIONS → read, write, query (comma-separated) ├── WORKLOAD_USE_PROXY → route through Envoy proxy ├── WORKLOAD_USE_SYNC → sync
vs async client ├── WORKLOAD_SKIP_CLOSE → simulate applications that
don't close the client └── PerfReporter (daemon thread, configurable interval) ├── HdrHistogram per operation (p50/p90/p99, O(1) record) ├── psutil for
CPU/memory metrics ├── Error status code + sub-status code tracking └──
Upserts PerfResult docs to results Cosmos DB account

New Files

File	Description
`workload.py`	Single unified workload — replaces all 8 operation-specific files
`perf_stats.py`	Per-operation latency tracking via HdrHistogram — O(1) record, O(1) percentile,
fixed 40KB memory per histogram
`perf_reporter.py`	Background daemon thread that upserts PerfResult documents to a results Cosmos DB account at a configurable interval
`perf_config.py`	All metrics configuration from environment variables

Modified Files

File	Change
`workload_configs.py`	All SDK configs now driven by environment variables (no manual editing)
`workload_utils.py`	Added timed operation wrappers using `time.perf_counter_ns()` with error status code capture

Deleted Files

r_workload.py, w_workload.py, r_proxy_workload.py, w_proxy_workload.py, r_w_q_workload.py, r_w_q_proxy_workload.py, r_w_q_sync_workload.py,
r_w_q_with_incorrect_client_workload.py, run_workloads.sh, dev.md

All replaced by workload.py with environment variable configuration. Infrastructure/orchestration scripts (run_workloads.sh, dev.md) moved to the
cosmos-sdk-copilot-toolkit repo.

Environment Variables

Workload behavior:

Variable	Default	Description
`WORKLOAD_OPERATIONS`	`read,write,query`	Comma-separated operations to run
`WORKLOAD_USE_PROXY`	`false`	Route through Envoy proxy
`WORKLOAD_USE_SYNC`	`false`	Use sync client instead of async
`WORKLOAD_SKIP_CLOSE`	`false`	Don't close client (simulate incorrect usage)

Cosmos DB client:

Variable	Default	Description
`COSMOS_URI`	(required)	Workload account endpoint
`COSMOS_PREFERRED_LOCATIONS`	`""`	Comma-separated preferred regions
`COSMOS_CLIENT_EXCLUDED_LOCATIONS`	`""`	Comma-separated excluded regions
`COSMOS_USE_MULTIPLE_WRITABLE_LOCATIONS`	`false`	Enable multi-region writes
`AZURE_COSMOS_ENABLE_CIRCUIT_BREAKER`	`false`	Per-partition circuit breaker
`COSMOS_CONCURRENT_REQUESTS`	`100`	Concurrent operations per process
`COSMOS_THROUGHPUT`	`1000000`	Container provisioned throughput

Metrics reporting:

Variable	Default	Description
`RESULTS_COSMOS_URI`	`""`	Results account endpoint (enables metrics when set)
`PERF_ENABLED`	`true`	Toggle metrics collection
`PERF_REPORT_INTERVAL`	`300`	Seconds between metric reports

PerfResult Document Schema

Each document matches the Rust SDK perf tool schema, enabling
both SDKs to feed the same ADX → Grafana pipeline:

{
  "operation": "ReadItem",
  "count": 12345,
  "errors": 0,
  "min_ms": 1.2, "max_ms": 50.3, "mean_ms": 5.4,
  "p50_ms": 4.8, "p90_ms": 8.2, "p99_ms": 15.1,
  "cpu_percent": 45.2,
  "memory_bytes": 104857600,
  "system_cpu_percent": 62.1,
  "sdk_language": "python",
  "sdk_version": "4.15.0",
  "config_concurrency": 100,
  "config_ppcb_enabled": false
}

Usage Examples

# All operations, direct connection
WORKLOAD_OPERATIONS=read,write,query python3 workload.py

# Read-only via Envoy proxy
WORKLOAD_OPERATIONS=read WORKLOAD_USE_PROXY=true python3 workload.py

# Write-only with metrics reporting
WORKLOAD_OPERATIONS=write \
  RESULTS_COSMOS_URI=https://my-results.documents.azure.com:443/ \
  python3 workload.py

# Simulate unclosed client
WORKLOAD_SKIP_CLOSE=true python3 workload.py

# Scale to 5 processes
for i in $(seq 5); do
  WORKLOAD_OPERATIONS=read,write,query nohup python3 workload.py &
done

…hroughput - Uncomment concurrent upsert/read/query calls - Remove manual timing counters and log_request_counts - Set THROUGHPUT to 1000000 in workload_configs.py - Keep CIRCUIT_BREAKER_ENABLED = False (PPCB disabled) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add a performance metrics library that reports PerfResult documents to a Cosmos DB results account, matching the Rust perf tool schema exactly so both SDKs feed the same ADX → Grafana pipeline. New files: - perf_stats.py: Thread-safe latency histogram with sorted-list percentile calculation and atomic drain_all() for consistent summary+error snapshots - perf_config.py: All config from environment variables (RESULTS_COSMOS_URI, PERF_REPORT_INTERVAL=300s, perfdb/perfresults defaults) - perf_reporter.py: Background daemon thread that drains Stats every 5 min and upserts PerfResult documents via sync CosmosClient with AAD auth Modified files: - workload_configs.py: All configs now driven by environment variables - workload_utils.py: Added timed operation wrappers with error tracking (CosmosHttpResponseError status_code/sub_status extraction), only successful operations record latency to avoid polluting percentiles - All *_workload.py files: Integrated Stats + PerfReporter with try/finally lifecycle management Key design decisions: - Sorted-list percentiles (no hdrhistogram native dependency) - psutil for CPU/memory with /proc fallback on Linux - Cached psutil.Process() instance for accurate CPU readings - CosmosClient stored and closed properly to avoid resource leaks - sdk_language='python', sdk_version from azure.cosmos.__version__ - PPCB disabled by default - All reporter errors caught and logged as warnings (never crash workload) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

psutil is now a hard import (not optional). Removed all /proc/meminfo and /proc/self/status fallback parsing — if psutil is not installed, the import will fail immediately rather than silently degrading. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Single workload.py replaces 6 operation-specific files - WORKLOAD_OPERATIONS env var controls which ops run (read,write,query) - WORKLOAD_USE_PROXY env var enables Envoy proxy routing - WORKLOAD_USE_SYNC env var enables sync client - Validate operation names at import time with clear error - Replace manual sorted-list percentiles with hdrhistogram (O(1) record/query) - Fixed memory usage (~40KB per histogram vs unbounded list growth) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…rkload.py Removed: r_workload.py, w_workload.py, r_proxy_workload.py, w_proxy_workload.py, r_w_q_workload.py, r_w_q_proxy_workload.py, r_w_q_sync_workload.py All replaced by workload.py with WORKLOAD_OPERATIONS and WORKLOAD_USE_PROXY env vars. Kept: r_w_q_with_incorrect_client_workload.py (special test case) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replaces r_w_q_with_incorrect_client_workload.py with an env var: WORKLOAD_SKIP_CLOSE=true creates the client without a context manager, simulating applications that don't properly close the Cosmos client. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Switch from time.perf_counter() * 1000 to time.perf_counter_ns() / 1_000_000 for nanosecond precision without floating-point multiplication artifacts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Infra/orchestration scripts belong in the cosmos-sdk-copilot-toolkit repo, not in the SDK repo. Workload code (workload.py, perf_*, workload_utils.py) stays here. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…istogram The pip package is 'hdrhistogram' but the Python module is 'hdrh'. Import changed from 'from hdrhistogram import HdrHistogram' to 'from hdrh.histogram import HdrHistogram'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Reports COSMOS_USE_MULTIPLE_WRITABLE_LOCATIONS in the config snapshot so it's visible in the Grafana dashboard and queryable from Kusto. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The variable was used but never defined — caused pylint E0602. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…, histogram clamp, safe parsing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…dictionary Move cspell words to sdk/cosmos/azure-cosmos/cspell.json instead of root .vscode/cspell.json to keep changes within cosmos folder scope. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…faultAzureCredential Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…rfResult Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…t-toolkit Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR modernizes the Cosmos DB DR drill workload harness by consolidating multiple per-scenario scripts into a single environment-variable-driven workload runner, and adds an optional performance metrics collection/reporting layer that can upsert results to a separate Cosmos DB account for dashboarding and analysis.

Changes:

Consolidates prior workload entrypoints into a single workload.py controlled by environment variables.
Adds performance statistics collection (per-operation latency + errors) and a background reporter that periodically upserts PerfResult documents.
Refactors workload configuration and operation helpers to be environment-variable-driven and to support timed operation wrappers.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
sdk/cosmos/azure-cosmos/tests/workloads/workload.py	New unified workload entrypoint (sync/async + proxy + perf hooks).
sdk/cosmos/azure-cosmos/tests/workloads/workload_utils.py	Adds timed operation wrappers and Cosmos error status/substatus extraction for perf stats.
sdk/cosmos/azure-cosmos/tests/workloads/workload_configs.py	Moves workload/client configuration to environment variables with parsing/validation.
sdk/cosmos/azure-cosmos/tests/workloads/perf_stats.py	New thread-safe per-operation latency histograms + error aggregation (HdrHistogram-backed).
sdk/cosmos/azure-cosmos/tests/workloads/perf_reporter.py	New daemon-thread reporter to drain stats and upsert PerfResult/Error docs to a results Cosmos account.
sdk/cosmos/azure-cosmos/tests/workloads/perf_config.py	New perf reporter configuration builder from environment variables (includes git SHA + defaults).
sdk/cosmos/azure-cosmos/cspell.json	Adds workload/perf-specific dictionary exceptions for spell checking.
sdk/cosmos/azure-cosmos/tests/workloads/w_workload.py	Removes legacy write-only workload script (replaced by `workload.py`).
sdk/cosmos/azure-cosmos/tests/workloads/w_proxy_workload.py	Removes legacy write-via-proxy workload script (replaced by `workload.py`).
sdk/cosmos/azure-cosmos/tests/workloads/r_workload.py	Removes legacy read/query workload script (replaced by `workload.py`).
sdk/cosmos/azure-cosmos/tests/workloads/r_w_q_workload.py	Removes legacy mixed workload script (replaced by `workload.py`).
sdk/cosmos/azure-cosmos/tests/workloads/r_w_q_with_incorrect_client_workload.py	Removes legacy “incorrect client usage” script (replaced by `WORKLOAD_SKIP_CLOSE`).
sdk/cosmos/azure-cosmos/tests/workloads/r_w_q_sync_workload.py	Removes legacy sync mixed workload script (replaced by `workload.py`).
sdk/cosmos/azure-cosmos/tests/workloads/r_w_q_proxy_workload.py	Removes legacy mixed workload + proxy script (replaced by `workload.py`).
sdk/cosmos/azure-cosmos/tests/workloads/r_proxy_workload.py	Removes legacy read/query-via-proxy workload script (replaced by `workload.py`).
sdk/cosmos/azure-cosmos/tests/workloads/setup_env.sh	Removes local environment bootstrap script (moved out of repo per PR description).
sdk/cosmos/azure-cosmos/tests/workloads/run_workloads.sh	Removes legacy multi-script launcher (superseded by env-var-driven workload).
sdk/cosmos/azure-cosmos/tests/workloads/dev.md	Removes legacy local runbook doc (moved out of repo per PR description).

sdk/cosmos/azure-cosmos/tests/workloads/workload.py

sdk/cosmos/azure-cosmos/tests/workloads/perf_reporter.py

Extract _extra_kwargs(), _timed_call(), and _timed_call_async() to eliminate duplicated excluded_locations branching and timing/error recording across 6 operation functions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… flush lock 1. Use public azure.core.pipeline.transport import for AioHttpTransport 2. Fail fast with RuntimeError if WORKLOAD_USE_SYNC + WORKLOAD_USE_PROXY 3. Add threading.Lock around _flush(), skip final flush if thread alive Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

tvaron3 and others added 15 commits April 9, 2026 19:29

perf(workloads): use perf_counter_ns for higher precision timing

14d7797

Switch from time.perf_counter() * 1000 to time.perf_counter_ns() / 1_000_000 for nanosecond precision without floating-point multiplication artifacts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix(workloads): use get_mean_value() for hdrh API

63ae1a8

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

style(workloads): fix black formatting and setup_env.sh references

52a3956

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat(workloads): add config_multi_write_enabled to PerfResult

12b4a20

Reports COSMOS_USE_MULTIPLE_WRITABLE_LOCATIONS in the config snapshot so it's visible in the Grafana dashboard and queryable from Kusto. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix(workloads): define multi_write variable in perf_reporter

f647d54

The variable was used but never defined — caused pylint E0602. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix(workloads): address review findings — lazy imports, session close…

b595bd9

…, histogram clamp, safe parsing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions bot added the Cosmos label Apr 12, 2026

github-project-automation bot added this to CosmosDB Python Eco-System Apr 12, 2026

tvaron3 and others added 3 commits April 11, 2026 20:53

fix(workloads): use RESULTS_COSMOS_KEY when available, fallback to De…

8ce76f9

…faultAzureCredential Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat(workloads): add config_proxy_enabled and config_skip_close to Pe…

42ee5c2

…rfResult Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

refactor(workloads): remove setup_env.sh — moved to cosmos-sdk-copilo…

b3354db

…t-toolkit Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

tvaron3 marked this pull request as ready for review April 12, 2026 18:18

tvaron3 requested a review from a team as a code owner April 12, 2026 18:18

Copilot AI review requested due to automatic review settings April 12, 2026 18:18

Copilot started reviewing on behalf of tvaron3 April 12, 2026 18:19 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

sdk/cosmos/azure-cosmos/tests/workloads/workload.py Outdated Show resolved Hide resolved

sdk/cosmos/azure-cosmos/tests/workloads/workload.py Show resolved Hide resolved

sdk/cosmos/azure-cosmos/tests/workloads/perf_reporter.py Outdated Show resolved Hide resolved

tvaron3 and others added 4 commits April 12, 2026 21:02

fix(workloads): re-enable WorkloadLoggerFilter to reduce log noise

84c0d41

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix(workloads): rename coro to coroutine/awaitable to fix cspell errors

7bad5ab

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cosmos/workloads): add performance metrics collection for DR drill testing#46271

feat(cosmos/workloads): add performance metrics collection for DR drill testing#46271
tvaron3 wants to merge 22 commits intoAzure:mainfrom
tvaron3:feat/dr-drill-workload-metrics

tvaron3 commented Apr 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tvaron3 commented Apr 12, 2026

Unified Workload with Performance Metrics for Cosmos DB DR Drill Testing

Summary

Motivation

Architecture

New Files

Modified Files

Deleted Files

Environment Variables

PerfResult Document Schema

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants