Re: Improve pg_sync_replication_slots() to wait for primary to advance

Ajin Cherian <itsajin@gmail.com>

From: Ajin Cherian <itsajin@gmail.com>
To: shveta malik <shveta.malik@gmail.com>
Cc: Ashutosh Sharma <ashu.coek88@gmail.com>, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>, Amit Kapila <amit.kapila16@gmail.com>, PostgreSQL mailing lists <pgsql-hackers@postgresql.org>
Date: 2025-09-15T12:47:43Z
Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Enhance slot synchronization API to respect promotion signal.

  2. Fix inconsistent elevel in pg_sync_replication_slots() retry logic.

  3. Refactor slot synchronization logic in slotsync.c.

  4. Fix intermittent BF failure in 040_standby_failover_slots_sync.

  5. Add retry logic to pg_sync_replication_slots().

  6. Fix LOCK_TIMEOUT handling in slotsync worker.

  7. Add slotsync skip statistics.

Attachments

On Wed, Sep 10, 2025 at 2:45 PM shveta malik <shveta.malik@gmail.com> wrote:
>
> On Tue, Sep 9, 2025 at 5:37 PM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > Attached v11 patch addressing the above comments.
> >
>
> Please find a few comments:
>
> 1)
>
> + Retry is done after 2
> + * sec wait. Exits early if promotion is triggered or certain critical
>
> We can say: Retry is done after SLOTSYNC_API_NAPTIME_MS wait.
>

Changed.

> 2)
> + /*
> + * Fetch remote slot info for the given slot_names. If slot_names is NIL,
> + * fetch all failover-enabled slots. Note that we reuse slot_names from
> + * the previous iteration; re-fetching all failover slots each time could
> + * cause an endless loop.
> + */
>
> a)
> the previous iteration --> the first iteration.
>
> b) Also we can mention the reason why we take names from first
> iteration instead of going for pending ones alone, something like:
>
> Instead of reprocessing only the pending slots in each iteration, it's
> better to process all the slots received in the first iteration.
> This ensures that by the time we're done, all slots reflect the latest values.
>
> 3)
> + remote_slots = fetch_remote_slots(wrconn, slot_names);
> +
> +
> + /* Attempt to synchronize slots */
> + synchronize_slots(wrconn, remote_slots, &slot_persistence_pending);
>
> One extra blank line can be removed
>

Fixed.

> 4)
>
> + /* Clean up slot_names if allocated in TopMemoryContext */
> + if (slot_names)
> + list_free_deep(slot_names);
>
> Can we please move it before 'ReplicationSlotCleanup'.
>

Fixed.

> 5)
> In case of error in subsequent iteration, slot_names allocated from
> TopMemoryContext will be left unfreed?
>

I've changed the logic so that even on error, slot_names are freed.

> 6)
> + ListCell   *lc;
> + bool first_slot = true;
>
> Shall we move these two to concerned if-block:
> if (slot_names != NIL)
>

Changed.

> 7)
>  * The slot_persistence_pending flag is used by the pg_sync_replication_slots
>  * API to track if any slots could not be persisted and need to be retried.
>
> a) Instead of mentioning only about slot_persistence_pending argument
> in concerned function's header, we shall define all the arguments.
>
> b) We can remove the 'flag' term from the comments as it is a
> function-argument now.
>

Changed.

> 8)
> I think we should add briefly in the header of the file about the new
> behaviour of API.
>

Added.

Attaching patch v12 addressing these comments.

regards,
Ajin Cherian
Fujitsu Australia