Re: Improve pg_sync_replication_slots() to wait for primary to advance
shveta malik <shveta.malik@gmail.com>
From: shveta malik <shveta.malik@gmail.com>
To: Ajin Cherian <itsajin@gmail.com>
Cc: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>,
Japin Li <japinli@hotmail.com>, Ashutosh Sharma <ashu.coek88@gmail.com>,
Amit Kapila <amit.kapila16@gmail.com>, PostgreSQL mailing lists <pgsql-hackers@postgresql.org>,
shveta malik <shveta.malik@gmail.com>
Date: 2025-12-03T04:00:49Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Enhance slot synchronization API to respect promotion signal.
- 4bed04d39566 17.10 landed
- 94efd308bcec 18.4 landed
- 1362bc33e025 19 (unreleased) landed
-
Fix inconsistent elevel in pg_sync_replication_slots() retry logic.
- f1ddaa15357f 19 (unreleased) landed
-
Refactor slot synchronization logic in slotsync.c.
- 788ec96d591d 19 (unreleased) landed
-
Fix intermittent BF failure in 040_standby_failover_slots_sync.
- b47c50e5667b 19 (unreleased) landed
-
Add retry logic to pg_sync_replication_slots().
- 0d2d4a0ec3ec 19 (unreleased) landed
-
Fix LOCK_TIMEOUT handling in slotsync worker.
- 04396eacd3fa 19 (unreleased) cited
-
Add slotsync skip statistics.
- 76b78721ca49 19 (unreleased) cited
On Wed, Dec 3, 2025 at 8:51 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Tue, Dec 2, 2025 at 8:35 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Tue, Dec 2, 2025 at 1:58 PM Ajin Cherian <itsajin@gmail.com> wrote:
> > >
> > >
> > > Attached patch v27 addresses the above comments.
> > >
> >
> > Thanks for the patch. Please find a few comments:
> >
> > 1)
> > + /* The worker pid must not be already assigned in SlotSyncCtx */
> > + Assert(SlotSyncCtx->pid == InvalidPid);
> > +
> >
> > We can mention just 'pid' here instead of 'worker pid'
> >
>
> Changed.
>
> > 2)
> > + /*
> > + * The syscache access in fetch_or_refresh_remote_slots() needs a
> > + * transaction env.
> > + */
> > fetch_or_refresh_remote_slots --> fetch_remote_slots
> >
> > 3)
> > SyncReplicationSlots(WalReceiverConn *wrconn)
> > {
> > +
> >
> > We can get rid of this blank line at the start of the function.
> >
>
> Fixed.
>
> > 4)
> > /*
> > * Shut down the slot sync worker.
> > *
> > - * This function sends signal to shutdown slot sync worker, if required. It
> > + * This function sends signal to shutdown the slot sync process, if
> > required. It
> > * also waits till the slot sync worker has exited or
> > * pg_sync_replication_slots() has finished.
> > */
> > Shall we change comment to something like (rephrase if required):
> >
> > Shut down the slot synchronization.
> > This function wakes up the slot sync process (either worker or backend
> > running SQL function) and sets stopSignaled=true
> > so that worker can exit or SQL function pg_sync_replication_slots()
> > can finish. It also waits till the slot sync worker has exited or
> > pg_sync_replication_slots() has finished.
> >
>
> Changed.
>
> > 5)
> > We should change the comment atop 'SlotSyncCtxStruct' as well to
> > mention that this pid is either the slot sync worker's pid or
> > backend's pid running the SQL function. It is needed by the startup
> > process to wake these up, so that they can stop synchronization on
> > seeing stopSignaled. <please rephrase as needed>
> >
>
> Changed.
>
> > 6)
> > + ereport(LOG,
> > + errmsg("replication slot synchronization worker is shutting down on
> > receiving SIGUSR1"));
> >
> > SIGUSR1 was actually just a wake-up signal. We may change the comment to:
> > replication slot synchronization worker is shutting down as promotion
> > is triggered.
> >
>
> Changed.
>
> > 7)
> > update_synced_slots_inactive_since:
> > /* The slot sync worker or SQL function mustn't be running by now */
> > - Assert((SlotSyncCtx->pid == InvalidPid) && !SlotSyncCtx->syncing);
> > + Assert(!SlotSyncCtx->syncing);
> >
> > Regarding this, I see that 'update_synced_slots_inactive_since' is
> > only called when we are sure that 'syncing' is false. So shouldn't pid
> > be also Invalid by that time? Even if it was backend's pid to start
> > with, but since backend has stopped syncing (finished or error-ed
> > out),
> > pid should be reset to Invalid in such a case. And this Assert need
> > not to be changed.
> >
>
> Fixed.
>
> > 8)
> >
> > + if (sync_process_pid != InvalidPid)
> > + kill(sync_process_pid, SIGUSR1);
> >
> > We can write comments to say wake-up slot sync process.
> >
>
> Added comments.
>
> Attaching patch v28 addressing these comments.
>
Thanks for the patch. A few trivial comments:
1)
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot continue replication slots synchronization"
+ " as standby promotion is triggered"));
slots->slot, since in every error-message we use 'replication slot
synchronization'
2)
+ * The pid is either the slot sync worker's pid or the backend's pid running
+ * the SQL function pg_sync_replication_slots(). It is needed by the startup
+ * process to wake these up, so that they can stop synchronization on seeing
+ * stopSignaled on promotion.
+ * Setting stopSignaled is also used to handle the race condition when the
Can we rephrase slightly to indicate clearly that it is the startup
process which sets 'stopSignaled' during promotion. Suggestion:
The pid is either the slot sync worker’s pid or the backend’s pid running
the SQL function pg_sync_replication_slots(). When the startup process
sets stopSignaled during promotion, it uses this pid to wake the
currently synchronizing process so that the process can immediately stop its
synchronization work upon seeing stopSignaled set to true.
Setting stopSignaled....
thanks
Shveta