SPOC-393: Use pg_replication_origin_status for wait_for_sync_event#348
SPOC-393: Use pg_replication_origin_status for wait_for_sync_event#348
Conversation
Previously, wait_for_sync_event() relied on spock.progress.remote_commit_lsn to track apply progress. This required Spock to maintain its own progress table, adding complexity. This change switches to using PostgreSQL's native pg_replication_origin_status view. The key insight is that PostgreSQL's RecordTransactionCommit() only calls replorigin_session_advance() when there is actual WAL written. For empty transactions (like sync_event), this never happens. The fix adds explicit replorigin_session_advance() call in handle_commit() for empty transactions, ensuring pg_replication_origin_status.remote_lsn is advanced even when no WAL is written. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The two wait_for_sync_event() procedure overloads differed only in how they received the origin parameter (name vs OID). Refactor to have the name-based version resolve the node name to OID, then delegate to the OID-based version, reducing code duplication. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
📝 WalkthroughWalkthroughIntroduces a new Changes
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
sql/spock--6.0.0-devel.sql (1)
444-508: Same timeout precision and NULL progress_lsn observations apply here as in the migration SQL.This is the base install version of the same procedure. The recommended refactor to use
clock_timestamp()for timeout tracking (instead of the incrementalelapsed_timecounter) applies here as well for consistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sql/spock--6.0.0-devel.sql` around lines 444 - 508, The wait loop in procedure spock.wait_for_sync_event uses an incremental elapsed_time counter and compares progress_lsn directly to lsn without handling NULL; change timeout handling to use clock_timestamp() by recording a start_ts := clock_timestamp() before the loop and replace elapsed_time checks with clock_timestamp() - start_ts >= (timeout || ' seconds')::interval, and guard the progress_lsn vs lsn comparison to handle NULL (e.g. only compare when progress_lsn IS NOT NULL, otherwise treat as not reached). Update references in the procedure to use start_ts, progress_lsn and timeout interval comparison and leave the pg_sleep loop otherwise unchanged.sql/spock--5.0.4--6.0.0-devel.sql (2)
236-238: No row inpg_replication_origin_statussilently leavesprogress_lsnas NULL.If the apply worker hasn't started yet or the origin doesn't exist in
pg_replication_origin_status,progress_lsnstays NULL and the loop spins until timeout. This is likely acceptable, but a debug RAISE NOTICE on the first miss (or a NOT FOUND check) would help users diagnose whywait_for_sync_eventis timing out.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sql/spock--5.0.4--6.0.0-devel.sql` around lines 236 - 238, The SELECT INTO from pg_replication_origin_status can leave progress_lsn NULL when no row exists; update the wait_for_sync_event logic to detect that case by checking the PL/pgSQL NOT FOUND condition (or testing progress_lsn IS NULL) immediately after the SELECT and emit a single RAISE NOTICE on the first miss to aid debugging; introduce a local boolean flag (e.g., first_miss) initialized true, RAISE NOTICE including sub_slot and that pg_replication_origin_status had no row, then set first_miss := false so subsequent loop iterations don’t spam notices, and otherwise continue the existing timeout/loop behavior.
190-254: Timeout tracking uses loop-count-based estimation, not wall-clock time.
elapsed_timeincrements by 0.2 each iteration, but the actual time per iteration includes query execution overhead (the subscription check + the origin status query + ROLLBACK). Under load, the real elapsed time could significantly exceed the requestedtimeout. Consider usingclock_timestamp()for accurate wall-clock tracking, similar towait_for_apply_workerin the same extension.♻️ Proposed fix for accurate timeout tracking
DECLARE target_id oid; - elapsed_time numeric := 0; + start_time timestamptz := clock_timestamp(); progress_lsn pg_lsn; sub_is_enabled bool; sub_slot name; ... - elapsed_time := elapsed_time + .2; - IF timeout <> 0 AND elapsed_time >= timeout THEN + IF timeout <> 0 AND EXTRACT(EPOCH FROM clock_timestamp() - start_time) >= timeout THEN result := false; RETURN; END IF;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sql/spock--5.0.4--6.0.0-devel.sql` around lines 190 - 254, The timeout logic in spock.wait_for_sync_event is using a loop-count-based elapsed_time += 0.2 which underestimates real wall-clock time; change to record a start timestamp (e.g., start_ts := clock_timestamp()) at the top of the procedure and replace the elapsed_time numeric increments with a wall-clock check like extract(epoch FROM (clock_timestamp() - start_ts)) to compare against timeout (seconds), keeping the existing timeout semantics and the rest of the loop (subscription checks, progress_lsn query, ROLLBACK, pg_sleep) unchanged; update or remove the elapsed_time variable and ensure comparisons use the new timestamp-based elapsed calculation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@sql/spock--5.0.4--6.0.0-devel.sql`:
- Around line 236-238: The SELECT INTO from pg_replication_origin_status can
leave progress_lsn NULL when no row exists; update the wait_for_sync_event logic
to detect that case by checking the PL/pgSQL NOT FOUND condition (or testing
progress_lsn IS NULL) immediately after the SELECT and emit a single RAISE
NOTICE on the first miss to aid debugging; introduce a local boolean flag (e.g.,
first_miss) initialized true, RAISE NOTICE including sub_slot and that
pg_replication_origin_status had no row, then set first_miss := false so
subsequent loop iterations don’t spam notices, and otherwise continue the
existing timeout/loop behavior.
- Around line 190-254: The timeout logic in spock.wait_for_sync_event is using a
loop-count-based elapsed_time += 0.2 which underestimates real wall-clock time;
change to record a start timestamp (e.g., start_ts := clock_timestamp()) at the
top of the procedure and replace the elapsed_time numeric increments with a
wall-clock check like extract(epoch FROM (clock_timestamp() - start_ts)) to
compare against timeout (seconds), keeping the existing timeout semantics and
the rest of the loop (subscription checks, progress_lsn query, ROLLBACK,
pg_sleep) unchanged; update or remove the elapsed_time variable and ensure
comparisons use the new timestamp-based elapsed calculation.
In `@sql/spock--6.0.0-devel.sql`:
- Around line 444-508: The wait loop in procedure spock.wait_for_sync_event uses
an incremental elapsed_time counter and compares progress_lsn directly to lsn
without handling NULL; change timeout handling to use clock_timestamp() by
recording a start_ts := clock_timestamp() before the loop and replace
elapsed_time checks with clock_timestamp() - start_ts >= (timeout || '
seconds')::interval, and guard the progress_lsn vs lsn comparison to handle NULL
(e.g. only compare when progress_lsn IS NOT NULL, otherwise treat as not
reached). Update references in the procedure to use start_ts, progress_lsn and
timeout interval comparison and leave the pg_sleep loop otherwise unchanged.
Previously, wait_for_sync_event() relied on spock.progress.remote_commit_lsn
to track apply progress. This required Spock to maintain its own progress
table, adding complexity.
This change switches to using PostgreSQL's native pg_replication_origin_status
view. The key insight is that PostgreSQL's RecordTransactionCommit() only
calls replorigin_session_advance() when there is actual WAL written. For
empty transactions (like sync_event), this never happens.
The fix adds explicit replorigin_session_advance() call in handle_commit()
for empty transactions, ensuring pg_replication_origin_status.remote_lsn
is advanced even when no WAL is written.