Skip to content

Add wait_event_timing: Oracle-style wait event instrumentation#1

Open
DmitryNFomin wants to merge 8 commits intomasterfrom
wait-event-timing
Open

Add wait_event_timing: Oracle-style wait event instrumentation#1
DmitryNFomin wants to merge 8 commits intomasterfrom
wait-event-timing

Conversation

@DmitryNFomin
Copy link
Copy Markdown
Owner

@DmitryNFomin DmitryNFomin commented Mar 22, 2026

Summary

Adds Oracle-style wait event instrumentation to PostgreSQL — per-event timing,
histograms, query attribution, and per-session 10046-style tracing. Controlled
by a compile-time flag (--enable-wait-event-timing) with zero overhead when
not compiled in.

The problem

External BPF-based wait event profilers (like pg_wait_tracer) use CPU hardware watchpoints on PGPROC->wait_event_info, costing ~200-300 ns per debug exception — 29% TPS overhead on high-transition workloads. Oracle solved this by instrumenting internally with clock_gettime() VDSO calls (~70-100 ns, no kernel trap).

What this patch provides

Compile-time: ./configure --enable-wait-event-timing (or meson setup -Dwait_event_timing=true).
Without the flag, the binary is identical to stock PostgreSQL.

Two runtime GUCs:

GUC Scope What it enables
`wait_event_timing` PGC_SUSET Per-event count, total ns, max, log2 histogram, per-(query_id, event) attribution
`wait_event_trace` PGC_USERSET Per-session ring buffer of individual events (Oracle 10046 equivalent)

Oracle equivalents:

Oracle PostgreSQL function
V$SYSTEM_EVENT `pg_stat_get_wait_event_timing()`
V$EVENT_HISTOGRAM `pg_stat_get_wait_event_timing()` — histogram column
V$SESSION_EVENT `pg_stat_get_wait_event_timing()` — backend_id column
V$SQL wait stats `pg_stat_get_wait_event_timing_by_query()`
10046 trace `pg_stat_get_wait_event_trace(backend_id)`

Benchmark results

Environment: Hetzner cx43 (8 vCPU, 16 GB RAM), Rocky 9.7, PG 19devel,
pgbench scale 100, shared_buffers=128MB, 8 clients, SELECT-only — worst case
at ~220K wait event transitions/sec, 60-second runs.

Test 1: Compile flag overhead

Stock PG vs patched (without `--enable-wait-event-timing`) on same data directory,
alternating A/B runs, 5 rounds:

Round Stock TPS Patched (no flag) TPS
1 116,398 118,427
2 115,850 116,609
3 116,141 117,669
4 117,699 118,136
5 117,634 118,879

No measurable difference. Hot-path object code is byte-identical
(verified via `objcopy -O binary -j .text`).

Test 2: GUC overhead (same binary + same data)

Same binary (WITH flag), same data directory, toggling GUCs between restarts, 5 rounds:

Round GUC off timing ON all ON (timing+trace)
1 114,768 116,058 115,911
2 116,726 115,789 116,502
3 115,972 115,977 115,170
4 116,405 113,924 115,242
5 115,781 116,556 114,603
Mean 115,930 115,661 115,486

< 0.5% difference. All configs within run-to-run variance.

vs hardware watchpoints

Approach Worst-case overhead
This patch (all features ON) < 0.5%
Hardware watchpoint (pg_wait_tracer) 29%

All 243 PostgreSQL regression tests pass (`make check`).

Files changed

New files:

  • `src/include/utils/wait_event_timing.h` — data structures, inline helpers
  • `src/backend/utils/activity/wait_event_timing.c` — shmem, SQL functions, stubs

Modified files:

  • `src/include/utils/wait_event.h` — timing code in inline functions (`#ifdef`)
  • `src/backend/storage/lmgr/proc.c` — init/cleanup timing storage
  • `src/backend/storage/ipc/ipci.c` — shared memory allocation
  • `src/backend/utils/activity/backend_status.c` — query_id pointer setup
  • `src/backend/utils/misc/guc_parameters.dat` — GUC definitions
  • `src/backend/utils/misc/guc_tables.c` — GUC variable externs
  • `src/include/catalog/pg_proc.dat` — SQL function registration
  • `configure.ac`, `meson.build`, `meson_options.txt` — compile-time option
  • `src/include/pg_config.h.in` — `USE_WAIT_EVENT_TIMING` define
  • Build files: `Makefile`, `meson.build` in activity/

Related work

🤖 Generated with Claude Code

DmitryNFomin and others added 5 commits March 22, 2026 20:49
Add per-backend wait event timing with nanosecond precision, controlled
by the wait_event_timing GUC (default: off).  When enabled, every
pgstat_report_wait_start()/pgstat_report_wait_end() call records the
wait duration using clock_gettime(CLOCK_MONOTONIC) and accumulates:

  - Per-event call count
  - Per-event total nanoseconds
  - Per-event max duration
  - Per-event log2 histogram (16 buckets: <1us to >=16ms)

Statistics are stored in per-backend shared memory arrays, requiring no
locking (each backend writes only to its own slot).  External tools can
read accumulated stats via the pg_stat_get_wait_event_timing() function.

Overhead when enabled: two VDSO clock_gettime() calls per wait event
transition (~40-100 ns), plus a few memory writes — comparable to
Oracle's TIMED_STATISTICS.  When disabled: one predictable branch per
wait event (~1 ns), zero other cost.

This provides PostgreSQL with Oracle V$SYSTEM_EVENT / V$EVENT_HISTOGRAM
equivalent functionality, enabling:

  - Wait event time model (which events consume the most time)
  - Latency histograms (bimodal distributions, tail latency)
  - Per-backend wait profiles

Motivation: external BPF-based wait event tracers (pg_wait_tracer) use
hardware watchpoints on PGPROC->wait_event_info, which incur 6-30%
overhead due to CPU debug exceptions.  Internal instrumentation via
clock_gettime VDSO achieves equivalent functionality at ~1-5% overhead.

New files:
  src/include/utils/wait_event_timing.h  -- data structures, inline helpers
  src/backend/utils/activity/wait_event_timing.c  -- shmem init, SQL function

Modified files:
  src/include/utils/wait_event.h  -- timing calls in inline functions
  src/backend/storage/lmgr/proc.c  -- init/cleanup timing storage
  src/backend/storage/ipc/ipci.c  -- shared memory allocation
  src/backend/utils/misc/guc_parameters.dat  -- wait_event_timing GUC
  src/include/catalog/pg_proc.dat  -- pg_stat_get_wait_event_timing()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When wait_event_timing is enabled, also accumulates wait stats per
(query_id, wait_event) pair in a per-backend hash table (1024 slots,
open addressing with linear probing).

Query_id is read from PgBackendStatus->st_query_id via a cached
pointer set during pgstat_beinit(). Zero overhead when query_id is
not set (most background processes).

New SQL function: pg_stat_get_wait_event_timing_by_query()
Returns (backend_id, query_id, wait_event_type, wait_event, calls,
total_time_ms) for all backends with non-zero counts.

Shared memory: 32 KB per backend (1024 * 32-byte entries).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When wait_event_timing is enabled, also accumulates wait stats per
(query_id, wait_event) pair in a per-backend hash table (1024 slots,
open addressing with linear probing).

Query_id is read from PgBackendStatus->st_query_id via a cached
pointer set during pgstat_beinit(). Zero overhead when query_id is
not set (most background processes).

New SQL function: pg_stat_get_wait_event_timing_by_query()
Returns (backend_id, query_id, wait_event_type, wait_event, calls,
total_time_ms) for all backends with non-zero counts.

Shared memory: 32 KB per backend (1024 * 32-byte entries).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@DmitryNFomin DmitryNFomin force-pushed the wait-event-timing branch 2 times, most recently from dc89622 to ffe0a58 Compare March 23, 2026 21:08
When wait_event_trace GUC is enabled for a session (PGC_USERSET),
every pgstat_report_wait_end() writes a record to a per-backend ring
buffer in shared memory: {timestamp_ns, event, duration_ns, query_id}.

Ring buffer holds 4096 records (128 KB) per backend. At 220K events/sec
this covers ~18ms of history. External tools read via
pg_stat_get_wait_event_trace(backend_id).

Requires wait_event_timing = on (trace code runs inside the timing
block to reuse the already-computed timestamp and duration).

New GUC: wait_event_trace (bool, PGC_USERSET, default off)
New SQL function: pg_stat_get_wait_event_trace(backend_id int4)
Returns (seq, timestamp_ns, wait_event_type, wait_event, duration_us,
query_id) in chronological order.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ing)

When compiled without the flag (default), zero instructions are added to
pgstat_report_wait_start/end — identical binary to stock PostgreSQL.

When compiled with --enable-wait-event-timing:
  ./configure --enable-wait-event-timing
  meson setup -Dwait_event_timing=true

The wait_event_timing and wait_event_trace GUCs still exist in both
builds (setting them is harmless without the compile flag).

This eliminates the ~2% overhead from the compiled-in branch check
that was present even with GUC off.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
At 32 bytes per record, this is 4 MB per backend. At 220K events/sec
(worst case), holds ~0.6 seconds of history — enough for a background
worker polling at 100ms to drain without losing events.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant