Skip to content

SPOC-393: Use pg_replication_origin_status for wait_for_sync_event#348

Open
rasifr wants to merge 3 commits intomainfrom
task/SPOC-393-sync-event
Open

SPOC-393: Use pg_replication_origin_status for wait_for_sync_event#348
rasifr wants to merge 3 commits intomainfrom
task/SPOC-393-sync-event

Conversation

@rasifr
Copy link
Member

@rasifr rasifr commented Feb 17, 2026

Previously, wait_for_sync_event() relied on spock.progress.remote_commit_lsn
to track apply progress. This required Spock to maintain its own progress
table, adding complexity.

This change switches to using PostgreSQL's native pg_replication_origin_status
view. The key insight is that PostgreSQL's RecordTransactionCommit() only
calls replorigin_session_advance() when there is actual WAL written. For
empty transactions (like sync_event), this never happens.

The fix adds explicit replorigin_session_advance() call in handle_commit()
for empty transactions, ensuring pg_replication_origin_status.remote_lsn
is advanced even when no WAL is written.

rasifr and others added 3 commits February 17, 2026 15:33
Previously, wait_for_sync_event() relied on spock.progress.remote_commit_lsn
to track apply progress. This required Spock to maintain its own progress
table, adding complexity.

This change switches to using PostgreSQL's native pg_replication_origin_status
view. The key insight is that PostgreSQL's RecordTransactionCommit() only
calls replorigin_session_advance() when there is actual WAL written. For
empty transactions (like sync_event), this never happens.

The fix adds explicit replorigin_session_advance() call in handle_commit()
for empty transactions, ensuring pg_replication_origin_status.remote_lsn
is advanced even when no WAL is written.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The two wait_for_sync_event() procedure overloads differed only in how
they received the origin parameter (name vs OID). Refactor to have the
name-based version resolve the node name to OID, then delegate to the
OID-based version, reducing code duplication.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Feb 17, 2026

📝 Walkthrough

Walkthrough

Introduces a new spock.wait_for_sync_event() procedure for monitoring replication progress with timeout support, updates its signature to accept origin names, adds subscription validation logic, handles empty transactions in the replication apply layer, and includes comprehensive regression tests.

Changes

Cohort / File(s) Summary
Build Configuration
Makefile
Added sync_event regression test target to the REGRESS list alongside existing tests.
SQL Extension Schema
sql/spock--5.0.4--6.0.0-devel.sql, sql/spock--6.0.0-devel.sql
Introduced two spock.wait_for_sync_event() procedure variants with subscription validation, slot usage tracking, and timeout handling. Updated procedure signature to accept origin name parameter instead of oid in the public interface.
Replication Apply Logic
src/spock_apply.c
Added explicit replication origin advancement for empty transactions via replorigin_session_advance() to ensure remote_lsn status updates.
Regression Tests
tests/regress/sql/sync_event.sql
Added comprehensive SQL test covering sync_event functionality, wait_for_sync_event with various parameter combinations, timeout behavior, and error conditions.

Poem

🐰 A rabbit hops through LSNs so bright,
Wait_for_sync_event guides replication's flight!
Empty transactions now don't lag behind,
With origin names and subscriptions aligned—
Hop on, dear spock, your journey's refined! 🌟

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the primary objective of switching wait_for_sync_event to use PostgreSQL's native pg_replication_origin_status instead of spock.progress.
Description check ✅ Passed The description clearly relates to the changeset, explaining the motivation for using pg_replication_origin_status and the specific fix needed for empty transactions.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch task/SPOC-393-sync-event

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
sql/spock--6.0.0-devel.sql (1)

444-508: Same timeout precision and NULL progress_lsn observations apply here as in the migration SQL.

This is the base install version of the same procedure. The recommended refactor to use clock_timestamp() for timeout tracking (instead of the incremental elapsed_time counter) applies here as well for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sql/spock--6.0.0-devel.sql` around lines 444 - 508, The wait loop in
procedure spock.wait_for_sync_event uses an incremental elapsed_time counter and
compares progress_lsn directly to lsn without handling NULL; change timeout
handling to use clock_timestamp() by recording a start_ts := clock_timestamp()
before the loop and replace elapsed_time checks with clock_timestamp() -
start_ts >= (timeout || ' seconds')::interval, and guard the progress_lsn vs lsn
comparison to handle NULL (e.g. only compare when progress_lsn IS NOT NULL,
otherwise treat as not reached). Update references in the procedure to use
start_ts, progress_lsn and timeout interval comparison and leave the pg_sleep
loop otherwise unchanged.
sql/spock--5.0.4--6.0.0-devel.sql (2)

236-238: No row in pg_replication_origin_status silently leaves progress_lsn as NULL.

If the apply worker hasn't started yet or the origin doesn't exist in pg_replication_origin_status, progress_lsn stays NULL and the loop spins until timeout. This is likely acceptable, but a debug RAISE NOTICE on the first miss (or a NOT FOUND check) would help users diagnose why wait_for_sync_event is timing out.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sql/spock--5.0.4--6.0.0-devel.sql` around lines 236 - 238, The SELECT INTO
from pg_replication_origin_status can leave progress_lsn NULL when no row
exists; update the wait_for_sync_event logic to detect that case by checking the
PL/pgSQL NOT FOUND condition (or testing progress_lsn IS NULL) immediately after
the SELECT and emit a single RAISE NOTICE on the first miss to aid debugging;
introduce a local boolean flag (e.g., first_miss) initialized true, RAISE NOTICE
including sub_slot and that pg_replication_origin_status had no row, then set
first_miss := false so subsequent loop iterations don’t spam notices, and
otherwise continue the existing timeout/loop behavior.

190-254: Timeout tracking uses loop-count-based estimation, not wall-clock time.

elapsed_time increments by 0.2 each iteration, but the actual time per iteration includes query execution overhead (the subscription check + the origin status query + ROLLBACK). Under load, the real elapsed time could significantly exceed the requested timeout. Consider using clock_timestamp() for accurate wall-clock tracking, similar to wait_for_apply_worker in the same extension.

♻️ Proposed fix for accurate timeout tracking
 DECLARE
 	target_id		oid;
-	elapsed_time	numeric := 0;
+	start_time		timestamptz := clock_timestamp();
 	progress_lsn	pg_lsn;
 	sub_is_enabled	bool;
 	sub_slot		name;
 ...
-		elapsed_time := elapsed_time + .2;
-		IF timeout <> 0 AND elapsed_time >= timeout THEN
+		IF timeout <> 0 AND EXTRACT(EPOCH FROM clock_timestamp() - start_time) >= timeout THEN
 			result := false;
 			RETURN;
 		END IF;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sql/spock--5.0.4--6.0.0-devel.sql` around lines 190 - 254, The timeout logic
in spock.wait_for_sync_event is using a loop-count-based elapsed_time += 0.2
which underestimates real wall-clock time; change to record a start timestamp
(e.g., start_ts := clock_timestamp()) at the top of the procedure and replace
the elapsed_time numeric increments with a wall-clock check like extract(epoch
FROM (clock_timestamp() - start_ts)) to compare against timeout (seconds),
keeping the existing timeout semantics and the rest of the loop (subscription
checks, progress_lsn query, ROLLBACK, pg_sleep) unchanged; update or remove the
elapsed_time variable and ensure comparisons use the new timestamp-based elapsed
calculation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@sql/spock--5.0.4--6.0.0-devel.sql`:
- Around line 236-238: The SELECT INTO from pg_replication_origin_status can
leave progress_lsn NULL when no row exists; update the wait_for_sync_event logic
to detect that case by checking the PL/pgSQL NOT FOUND condition (or testing
progress_lsn IS NULL) immediately after the SELECT and emit a single RAISE
NOTICE on the first miss to aid debugging; introduce a local boolean flag (e.g.,
first_miss) initialized true, RAISE NOTICE including sub_slot and that
pg_replication_origin_status had no row, then set first_miss := false so
subsequent loop iterations don’t spam notices, and otherwise continue the
existing timeout/loop behavior.
- Around line 190-254: The timeout logic in spock.wait_for_sync_event is using a
loop-count-based elapsed_time += 0.2 which underestimates real wall-clock time;
change to record a start timestamp (e.g., start_ts := clock_timestamp()) at the
top of the procedure and replace the elapsed_time numeric increments with a
wall-clock check like extract(epoch FROM (clock_timestamp() - start_ts)) to
compare against timeout (seconds), keeping the existing timeout semantics and
the rest of the loop (subscription checks, progress_lsn query, ROLLBACK,
pg_sleep) unchanged; update or remove the elapsed_time variable and ensure
comparisons use the new timestamp-based elapsed calculation.

In `@sql/spock--6.0.0-devel.sql`:
- Around line 444-508: The wait loop in procedure spock.wait_for_sync_event uses
an incremental elapsed_time counter and compares progress_lsn directly to lsn
without handling NULL; change timeout handling to use clock_timestamp() by
recording a start_ts := clock_timestamp() before the loop and replace
elapsed_time checks with clock_timestamp() - start_ts >= (timeout || '
seconds')::interval, and guard the progress_lsn vs lsn comparison to handle NULL
(e.g. only compare when progress_lsn IS NOT NULL, otherwise treat as not
reached). Update references in the procedure to use start_ts, progress_lsn and
timeout interval comparison and leave the pg_sleep loop otherwise unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments