Re: Improve pg_sync_replication_slots() to wait for primary to advance
shveta malik <shveta.malik@gmail.com>
From: shveta malik <shveta.malik@gmail.com>
To: Ajin Cherian <itsajin@gmail.com>
Cc: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>,
Amit Kapila <amit.kapila16@gmail.com>, PostgreSQL mailing lists <pgsql-hackers@postgresql.org>,
shveta malik <shveta.malik@gmail.com>
Date: 2025-08-29T05:01:39Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Enhance slot synchronization API to respect promotion signal.
- 4bed04d39566 17.10 landed
- 94efd308bcec 18.4 landed
- 1362bc33e025 19 (unreleased) landed
-
Fix inconsistent elevel in pg_sync_replication_slots() retry logic.
- f1ddaa15357f 19 (unreleased) landed
-
Refactor slot synchronization logic in slotsync.c.
- 788ec96d591d 19 (unreleased) landed
-
Fix intermittent BF failure in 040_standby_failover_slots_sync.
- b47c50e5667b 19 (unreleased) landed
-
Add retry logic to pg_sync_replication_slots().
- 0d2d4a0ec3ec 19 (unreleased) landed
-
Fix LOCK_TIMEOUT handling in slotsync worker.
- 04396eacd3fa 19 (unreleased) cited
-
Add slotsync skip statistics.
- 76b78721ca49 19 (unreleased) cited
On Tue, Aug 26, 2025 at 9:58 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Fri, Aug 22, 2025 at 3:44 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Wed, Aug 20, 2025 at 10:53 AM Ajin Cherian <itsajin@gmail.com> wrote:
> > >
> > >
> > > I've removed them.
> > > Attaching patch v8 addressing the above comments.
> > >
> >
> > Thanks for the patch. Please find a few comments:
> >
> > 1)
> > When the API is in progress, and meanwhile in another session we turn
> > off hot_standby_feedback, the API session terminates abnormally.
> >
> > postgres=# SELECT pg_sync_replication_slots();
> > server closed the connection unexpectedly
> >
> > It seems slotsync_reread_config() is not adjusted for API. It does
> > proc_exit assuming it is a worker process.
> >
>
> I've removed the API calling ProcessSlotSyncInterrupts() as I don't
> think the API needs to specifically handle shutdown requests, it works
> without calling this. And the other config checks, I don't think the
> API needs to handle, I think we should leave it to the user.
>
> > 2)
> > slotsync_worker_onexit():
> >
> > SlotSyncCtx->slot_persistence_pending = false;
> >
> > /*
> > * If syncing_slots is true, it indicates that the process errored out
> > * without resetting the flag. So, we need to clean up shared memory and
> > * reset the flag here.
> > */
> > if (syncing_slots)
> > {
> > SlotSyncCtx->syncing = false;
> > syncing_slots = false;
> > }
> >
> > Shall we reset slot_persistence_pending inside 'if (syncing_slots)'?
> > slot_persistence_pending can not be true without syncing_slots being
> > true.
> >
> > 3)
> > reset_syncing_flag():
> >
> > SpinLockAcquire(&SlotSyncCtx->mutex);
> > SlotSyncCtx->syncing = false;
> > + SlotSyncCtx->slot_persistence_pending = false;
> > SpinLockRelease(&SlotSyncCtx->mutex);
> >
> > Here we are changing slot_persistence_pending under mutex, while at
> > other places, it is not protected by mutex. Is it intentional here?
> >
> > 4)
> > On rethinking, we maintain anything in shared memory if it has to be
> > shared between a few processes. 'slot_persistence_pending' OTOH is
> > required to be set and accessed by only one process at a time. Shall
> > we move it out of SlotSyncCtxStruct and keep it static similar to
> > 'syncing_slots'? Rest of the setting, resetting flow remains the same.
> > What do you think?
> >
>
> Yes, I agree. I have modified it accordingly.
>
> >
> > 5)
> > /* Build the query based on whether we're fetching all or refreshing
> > specific slots */
> >
> > Perhaps we can shift this comment to where we actually append
> > target_slot_list. Better to have it something like:
> > 'If target_slot_list is provided, construct the query only to fetch given slots'
> >
>
> Changed.
>
> Attaching patch v9 addressing the above comments.
>
Thank You for the patches. Please find a few comments.
1)
Change not needed:
* without waiting for SLOTSYNC_RESTART_INTERVAL_SEC.
+ *
*/
2)
Regarding the naptime in API, I was thinking why not to use
wait_for_slot_activity() directly? If there are no slots activity, it
will keep on doubling the naptime starting from 2sec till max of
30sec. Thoughts?
We can rename MIN_SLOTSYNC_WORKER_NAPTIME_MS and
MAX_SLOTSYNC_WORKER_NAPTIME_MS to MIN_SLOTSYNC_NAPTIME_MS and
MAX_SLOTSYNC_NAPTIME_MS in such a case.
3)
+ * target_slot_list - List of RemoteSlot structures to refresh, or NIL to
+ * fetch all failover slots
Can we please change it to:
List of failover logical slots to fetch from primary, or NIL to fetch
all failover logical slots
4)
In the worker, before each call to synchronize_slots(), we are
starting a new transaction. It aligns with the previous implementation
where StartTransaction was inside synchronize_slots(). But in API, we
are doing StartTransaction once outside of the loop instead of doing
before each synchronize_slots(), is it intentional? It may keep the
transaction open for a long duration for the case where slots are not
getting persisted soon.
5)
With ProcessSlotSyncInterrupts() being removed from API, can you
please check the behaviour of API on smart-shutdown and rest of the
shutdown modes? It should behave like other APIs. And what happens if
I change primary_conninfo to some non-existing server when the API is
running. Does it error out or keep retrying?
thanks
Shveta