Re: Improve pg_sync_replication_slots() to wait for primary to advance

Amit Kapila <amit.kapila16@gmail.com>

From: Amit Kapila <amit.kapila16@gmail.com>
To: shveta malik <shveta.malik@gmail.com>
Cc: Ajin Cherian <itsajin@gmail.com>, PostgreSQL mailing lists <pgsql-hackers@postgresql.org>
Date: 2025-08-04T06:00:49Z
Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Enhance slot synchronization API to respect promotion signal.

  2. Fix inconsistent elevel in pg_sync_replication_slots() retry logic.

  3. Refactor slot synchronization logic in slotsync.c.

  4. Fix intermittent BF failure in 040_standby_failover_slots_sync.

  5. Add retry logic to pg_sync_replication_slots().

  6. Fix LOCK_TIMEOUT handling in slotsync worker.

  7. Add slotsync skip statistics.

On Fri, Aug 1, 2025 at 2:50 PM shveta malik <shveta.malik@gmail.com> wrote:
>
> 5)
> I tried a test where there were 4 slots on the publisher, where one
> was getting used while the others were not. Initiated
> pg_sync_replication_slots on standby. Forced unused slots to be
> invalidated by setting idle_replication_slot_timeout=60 on primary,
> due to which API finished but gave a warning:
>
> postgres=# SELECT pg_sync_replication_slots();
> WARNING:  aborting initial sync for slot "failover_slot"
> DETAIL:  This slot was invalidated on the primary server.
> WARNING:  aborting initial sync for slot "failover_slot2"
> DETAIL:  This slot was invalidated on the primary server.
> WARNING:  aborting initial sync for slot "failover_slot3"
> DETAIL:  This slot was invalidated on the primary server.
>  pg_sync_replication_slots
> ---------------------------
>
> (1 row)
>
> Do we need these warnings here? I think we can have it as a LOG rather
> than having it on console. Thoughts?
>

What is the behaviour of a slotsync worker in the same case? I don't
see any such WARNING messages in the code of slotsync worker, so why
do we want a different behaviour here?

Few other comments:
======================
1.
update_and_persist_local_synced_slot()
{
...
+ /*
+ * For SQL API synchronization, we wait for the remote slot to catch up
+ * here, since we can't assume the SQL API will be called again soon.
+ * We will retry the sync once the slot catches up.
+ *
+ * Note: This will return false if a promotion is triggered on the
+ * standby while waiting, in which case we stop syncing and drop the
+ * temporary slot.
+ */
+ if (!wait_for_primary_slot_catchup(wrconn, remote_slot))
+ return false;

Why is the wait added at this level? Shouldn't it be at API level aka
in SyncReplicationSlots() or pg_sync_replication_slots() similar to
what we do in ReplSlotSyncWorkerMain() for slotsync workers?

2.
 REPLICATION_SLOTSYNC_MAIN "Waiting in main loop of slot sync worker."
...
+REPLICATION_SLOTSYNC_PRIMARY_CATCHUP "Waiting for the primary to catch-up."

Can't we reuse existing waitevent REPLICATION_SLOTSYNC_MAIN? We may
want to change the description. Is there a specific reason to add a
new wait_event for this API?

-- 
With Regards,
Amit Kapila.