feat(cosmos/workloads): add performance metrics collection for DR drill testing#46271
Open
tvaron3 wants to merge 22 commits intoAzure:mainfrom
Open
feat(cosmos/workloads): add performance metrics collection for DR drill testing#46271tvaron3 wants to merge 22 commits intoAzure:mainfrom
tvaron3 wants to merge 22 commits intoAzure:mainfrom
Conversation
…hroughput - Uncomment concurrent upsert/read/query calls - Remove manual timing counters and log_request_counts - Set THROUGHPUT to 1000000 in workload_configs.py - Keep CIRCUIT_BREAKER_ENABLED = False (PPCB disabled) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a performance metrics library that reports PerfResult documents to a Cosmos DB results account, matching the Rust perf tool schema exactly so both SDKs feed the same ADX → Grafana pipeline. New files: - perf_stats.py: Thread-safe latency histogram with sorted-list percentile calculation and atomic drain_all() for consistent summary+error snapshots - perf_config.py: All config from environment variables (RESULTS_COSMOS_URI, PERF_REPORT_INTERVAL=300s, perfdb/perfresults defaults) - perf_reporter.py: Background daemon thread that drains Stats every 5 min and upserts PerfResult documents via sync CosmosClient with AAD auth Modified files: - workload_configs.py: All configs now driven by environment variables - workload_utils.py: Added timed operation wrappers with error tracking (CosmosHttpResponseError status_code/sub_status extraction), only successful operations record latency to avoid polluting percentiles - All *_workload.py files: Integrated Stats + PerfReporter with try/finally lifecycle management Key design decisions: - Sorted-list percentiles (no hdrhistogram native dependency) - psutil for CPU/memory with /proc fallback on Linux - Cached psutil.Process() instance for accurate CPU readings - CosmosClient stored and closed properly to avoid resource leaks - sdk_language='python', sdk_version from azure.cosmos.__version__ - PPCB disabled by default - All reporter errors caught and logged as warnings (never crash workload) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
psutil is now a hard import (not optional). Removed all /proc/meminfo and /proc/self/status fallback parsing — if psutil is not installed, the import will fail immediately rather than silently degrading. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Single workload.py replaces 6 operation-specific files - WORKLOAD_OPERATIONS env var controls which ops run (read,write,query) - WORKLOAD_USE_PROXY env var enables Envoy proxy routing - WORKLOAD_USE_SYNC env var enables sync client - Validate operation names at import time with clear error - Replace manual sorted-list percentiles with hdrhistogram (O(1) record/query) - Fixed memory usage (~40KB per histogram vs unbounded list growth) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rkload.py Removed: r_workload.py, w_workload.py, r_proxy_workload.py, w_proxy_workload.py, r_w_q_workload.py, r_w_q_proxy_workload.py, r_w_q_sync_workload.py All replaced by workload.py with WORKLOAD_OPERATIONS and WORKLOAD_USE_PROXY env vars. Kept: r_w_q_with_incorrect_client_workload.py (special test case) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replaces r_w_q_with_incorrect_client_workload.py with an env var: WORKLOAD_SKIP_CLOSE=true creates the client without a context manager, simulating applications that don't properly close the Cosmos client. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Switch from time.perf_counter() * 1000 to time.perf_counter_ns() / 1_000_000 for nanosecond precision without floating-point multiplication artifacts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Infra/orchestration scripts belong in the cosmos-sdk-copilot-toolkit repo, not in the SDK repo. Workload code (workload.py, perf_*, workload_utils.py) stays here. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…istogram The pip package is 'hdrhistogram' but the Python module is 'hdrh'. Import changed from 'from hdrhistogram import HdrHistogram' to 'from hdrh.histogram import HdrHistogram'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reports COSMOS_USE_MULTIPLE_WRITABLE_LOCATIONS in the config snapshot so it's visible in the Grafana dashboard and queryable from Kusto. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The variable was used but never defined — caused pylint E0602. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…, histogram clamp, safe parsing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…dictionary Move cspell words to sdk/cosmos/azure-cosmos/cspell.json instead of root .vscode/cspell.json to keep changes within cosmos folder scope. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…faultAzureCredential Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rfResult Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t-toolkit Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR modernizes the Cosmos DB DR drill workload harness by consolidating multiple per-scenario scripts into a single environment-variable-driven workload runner, and adds an optional performance metrics collection/reporting layer that can upsert results to a separate Cosmos DB account for dashboarding and analysis.
Changes:
- Consolidates prior workload entrypoints into a single
workload.pycontrolled by environment variables. - Adds performance statistics collection (per-operation latency + errors) and a background reporter that periodically upserts PerfResult documents.
- Refactors workload configuration and operation helpers to be environment-variable-driven and to support timed operation wrappers.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/tests/workloads/workload.py | New unified workload entrypoint (sync/async + proxy + perf hooks). |
| sdk/cosmos/azure-cosmos/tests/workloads/workload_utils.py | Adds timed operation wrappers and Cosmos error status/substatus extraction for perf stats. |
| sdk/cosmos/azure-cosmos/tests/workloads/workload_configs.py | Moves workload/client configuration to environment variables with parsing/validation. |
| sdk/cosmos/azure-cosmos/tests/workloads/perf_stats.py | New thread-safe per-operation latency histograms + error aggregation (HdrHistogram-backed). |
| sdk/cosmos/azure-cosmos/tests/workloads/perf_reporter.py | New daemon-thread reporter to drain stats and upsert PerfResult/Error docs to a results Cosmos account. |
| sdk/cosmos/azure-cosmos/tests/workloads/perf_config.py | New perf reporter configuration builder from environment variables (includes git SHA + defaults). |
| sdk/cosmos/azure-cosmos/cspell.json | Adds workload/perf-specific dictionary exceptions for spell checking. |
| sdk/cosmos/azure-cosmos/tests/workloads/w_workload.py | Removes legacy write-only workload script (replaced by workload.py). |
| sdk/cosmos/azure-cosmos/tests/workloads/w_proxy_workload.py | Removes legacy write-via-proxy workload script (replaced by workload.py). |
| sdk/cosmos/azure-cosmos/tests/workloads/r_workload.py | Removes legacy read/query workload script (replaced by workload.py). |
| sdk/cosmos/azure-cosmos/tests/workloads/r_w_q_workload.py | Removes legacy mixed workload script (replaced by workload.py). |
| sdk/cosmos/azure-cosmos/tests/workloads/r_w_q_with_incorrect_client_workload.py | Removes legacy “incorrect client usage” script (replaced by WORKLOAD_SKIP_CLOSE). |
| sdk/cosmos/azure-cosmos/tests/workloads/r_w_q_sync_workload.py | Removes legacy sync mixed workload script (replaced by workload.py). |
| sdk/cosmos/azure-cosmos/tests/workloads/r_w_q_proxy_workload.py | Removes legacy mixed workload + proxy script (replaced by workload.py). |
| sdk/cosmos/azure-cosmos/tests/workloads/r_proxy_workload.py | Removes legacy read/query-via-proxy workload script (replaced by workload.py). |
| sdk/cosmos/azure-cosmos/tests/workloads/setup_env.sh | Removes local environment bootstrap script (moved out of repo per PR description). |
| sdk/cosmos/azure-cosmos/tests/workloads/run_workloads.sh | Removes legacy multi-script launcher (superseded by env-var-driven workload). |
| sdk/cosmos/azure-cosmos/tests/workloads/dev.md | Removes legacy local runbook doc (moved out of repo per PR description). |
Extract _extra_kwargs(), _timed_call(), and _timed_call_async() to eliminate duplicated excluded_locations branching and timing/error recording across 6 operation functions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… flush lock 1. Use public azure.core.pipeline.transport import for AioHttpTransport 2. Fail fast with RuntimeError if WORKLOAD_USE_SYNC + WORKLOAD_USE_PROXY 3. Add threading.Lock around _flush(), skip final flush if thread alive Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Unified Workload with Performance Metrics for Cosmos DB DR Drill Testing
Summary
Consolidates 8 separate workload scripts into a single
workload.pycontrolled entirely by environment variables, and adds a performance metricscollection layer that reports latency, throughput, errors, and resource
utilization to a Cosmos DB results account — enabling Grafana dashboards and ADX-powered analysis for DR drill testing.
Motivation
Previously, testing different workload configurations (read-only, write-only, proxy, sync, etc.) required copying and modifying entire workload files.
Adding observability meant manually instrumenting each file. This PR solves
both problems:
Architecture
workload.py ├── WORKLOAD_OPERATIONS → read, write, query (comma-separated) ├── WORKLOAD_USE_PROXY → route through Envoy proxy ├── WORKLOAD_USE_SYNC → sync
vs async client ├── WORKLOAD_SKIP_CLOSE → simulate applications that
don't close the client └── PerfReporter (daemon thread, configurable interval) ├── HdrHistogram per operation (p50/p90/p99, O(1) record) ├── psutil for
CPU/memory metrics ├── Error status code + sub-status code tracking └──
Upserts PerfResult docs to results Cosmos DB account
New Files
workload.pyperf_stats.pyperf_reporter.pyperf_config.pyModified Files
workload_configs.pyworkload_utils.pytime.perf_counter_ns()with error status code captureDeleted Files
r_workload.py,w_workload.py,r_proxy_workload.py,w_proxy_workload.py,r_w_q_workload.py,r_w_q_proxy_workload.py,r_w_q_sync_workload.py,r_w_q_with_incorrect_client_workload.py,run_workloads.sh,dev.mdAll replaced by
workload.pywith environment variable configuration. Infrastructure/orchestration scripts (run_workloads.sh,dev.md) moved to thecosmos-sdk-copilot-toolkit repo.
Environment Variables
Workload behavior:
WORKLOAD_OPERATIONSread,write,queryWORKLOAD_USE_PROXYfalseWORKLOAD_USE_SYNCfalseWORKLOAD_SKIP_CLOSEfalseCosmos DB client:
COSMOS_URICOSMOS_PREFERRED_LOCATIONS""COSMOS_CLIENT_EXCLUDED_LOCATIONS""COSMOS_USE_MULTIPLE_WRITABLE_LOCATIONSfalseAZURE_COSMOS_ENABLE_CIRCUIT_BREAKERfalseCOSMOS_CONCURRENT_REQUESTS100COSMOS_THROUGHPUT1000000Metrics reporting:
RESULTS_COSMOS_URI""PERF_ENABLEDtruePERF_REPORT_INTERVAL300PerfResult Document Schema
Each document matches the Rust SDK perf tool schema, enabling
both SDKs to feed the same ADX → Grafana pipeline:
{ "operation": "ReadItem", "count": 12345, "errors": 0, "min_ms": 1.2, "max_ms": 50.3, "mean_ms": 5.4, "p50_ms": 4.8, "p90_ms": 8.2, "p99_ms": 15.1, "cpu_percent": 45.2, "memory_bytes": 104857600, "system_cpu_percent": 62.1, "sdk_language": "python", "sdk_version": "4.15.0", "config_concurrency": 100, "config_ppcb_enabled": false } Usage Examples # All operations, direct connection WORKLOAD_OPERATIONS=read,write,query python3 workload.py # Read-only via Envoy proxy WORKLOAD_OPERATIONS=read WORKLOAD_USE_PROXY=true python3 workload.py # Write-only with metrics reporting WORKLOAD_OPERATIONS=write \ RESULTS_COSMOS_URI=https://my-results.documents.azure.com:443/ \ python3 workload.py # Simulate unclosed client WORKLOAD_SKIP_CLOSE=true python3 workload.py # Scale to 5 processes for i in $(seq 5); do WORKLOAD_OPERATIONS=read,write,query nohup python3 workload.py & done