Re: Improve pg_sync_replication_slots() to wait for primary to advance

shveta malik <shveta.malik@gmail.com>

From: shveta malik <shveta.malik@gmail.com>
To: Ajin Cherian <itsajin@gmail.com>
Cc: Ashutosh Sharma <ashu.coek88@gmail.com>, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>, Amit Kapila <amit.kapila16@gmail.com>, PostgreSQL mailing lists <pgsql-hackers@postgresql.org>, shveta malik <shveta.malik@gmail.com>
Date: 2025-09-10T04:45:31Z
Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Enhance slot synchronization API to respect promotion signal.

  2. Fix inconsistent elevel in pg_sync_replication_slots() retry logic.

  3. Refactor slot synchronization logic in slotsync.c.

  4. Fix intermittent BF failure in 040_standby_failover_slots_sync.

  5. Add retry logic to pg_sync_replication_slots().

  6. Fix LOCK_TIMEOUT handling in slotsync worker.

  7. Add slotsync skip statistics.

On Tue, Sep 9, 2025 at 5:37 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> Attached v11 patch addressing the above comments.
>

Please find a few comments:

1)

+ Retry is done after 2
+ * sec wait. Exits early if promotion is triggered or certain critical

We can say: Retry is done after SLOTSYNC_API_NAPTIME_MS wait.

2)
+ /*
+ * Fetch remote slot info for the given slot_names. If slot_names is NIL,
+ * fetch all failover-enabled slots. Note that we reuse slot_names from
+ * the previous iteration; re-fetching all failover slots each time could
+ * cause an endless loop.
+ */

a)
the previous iteration --> the first iteration.

b) Also we can mention the reason why we take names from first
iteration instead of going for pending ones alone, something like:

Instead of reprocessing only the pending slots in each iteration, it's
better to process all the slots received in the first iteration.
This ensures that by the time we're done, all slots reflect the latest values.

3)
+ remote_slots = fetch_remote_slots(wrconn, slot_names);
+
+
+ /* Attempt to synchronize slots */
+ synchronize_slots(wrconn, remote_slots, &slot_persistence_pending);

One extra blank line can be removed

4)

+ /* Clean up slot_names if allocated in TopMemoryContext */
+ if (slot_names)
+ list_free_deep(slot_names);

Can we please move it before 'ReplicationSlotCleanup'.

5)
In case of error in subsequent iteration, slot_names allocated from
TopMemoryContext will be left unfreed?

6)
+ ListCell   *lc;
+ bool first_slot = true;

Shall we move these two to concerned if-block:
if (slot_names != NIL)

7)
 * The slot_persistence_pending flag is used by the pg_sync_replication_slots
 * API to track if any slots could not be persisted and need to be retried.

a) Instead of mentioning only about slot_persistence_pending argument
in concerned function's header, we shall define all the arguments.

b) We can remove the 'flag' term from the comments as it is a
function-argument now.

8)
I think we should add briefly in the header of the file about the new
behaviour of API.


thanks
Shveta