Re: Improve pg_sync_replication_slots() to wait for primary to advance

shveta malik <shveta.malik@gmail.com>

From: shveta malik <shveta.malik@gmail.com>

To: Ajin Cherian <itsajin@gmail.com>

Cc: Amit Kapila <amit.kapila16@gmail.com>, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>, Japin Li <japinli@hotmail.com>, Ashutosh Sharma <ashu.coek88@gmail.com>, PostgreSQL mailing lists <pgsql-hackers@postgresql.org>, shveta malik <shveta.malik@gmail.com>

Date: 2025-12-04T06:03:50Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Enhance slot synchronization API to respect promotion signal.
- 4bed04d39566 17.10 landed
- 94efd308bcec 18.4 landed
- 1362bc33e025 19 (unreleased) landed
Fix inconsistent elevel in pg_sync_replication_slots() retry logic.
- f1ddaa15357f 19 (unreleased) landed
Refactor slot synchronization logic in slotsync.c.
- 788ec96d591d 19 (unreleased) landed
Fix intermittent BF failure in 040_standby_failover_slots_sync.
- b47c50e5667b 19 (unreleased) landed
Add retry logic to pg_sync_replication_slots().
- 0d2d4a0ec3ec 19 (unreleased) landed
Fix LOCK_TIMEOUT handling in slotsync worker.
- 04396eacd3fa 19 (unreleased) cited
Add slotsync skip statistics.
- 76b78721ca49 19 (unreleased) cited

On Thu, Dec 4, 2025 at 10:51 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Wed, Dec 3, 2025 at 10:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Dec 3, 2025 at 8:51 AM Ajin Cherian <itsajin@gmail.com> wrote:
> > >
> > > Attaching patch v28 addressing these comments.
> > >
> >
> > Can we extract the part of the patch that handles SIGUSR1 signal
> > separately as a first patch and the remaining as a second patch?
> > Please do mention the reason in the commit message as to why we are
> > changing the signal for SIGINT to SIGUSR1.
> >
>
> I have extracted out the SIGUSR1 signal handling changes separately
> into a patch and sharing. I will share the next patch later.
> Let me know if there are any comments for this patch.
>

I have just 2 trivial comments for v29-001:

1)
-   * receives a SIGINT from the startup process, or when there is an error.
+   * receives a SIGUSR1 from the startup process, or when there is an error.

In above we should mention stopSignaled rather than SIGUSR1, as
SIGUSR1 is just a wakeup signal and not termination signal.

 2)
+    else
+      ereport(ERROR,
+          errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+          errmsg("cannot continue replication slot synchronization"
+               " as standby promotion is triggered"));

Please mention that it is SQL-function in the comment for else-block.

~~

I tested the touched scenarios and here are the LOGs:

a)
When promotion is ongoing and the startup  process has terminated
slot-sync worker but if the postmaster has not noticed that, it may
end up starting slotsync worker again. For that scenario, we get
these:

11:03:19.712 IST [151559] LOG:  replication slot synchronization
worker is shutting down as promotion is triggered
11:03:19.726 IST [151629] LOG:  slot sync worker started
11:03:19.795 IST [151629] LOG:  replication slot synchronization
worker is shutting down as promotion is triggered

b)
On promotion, API gets this (originating from ProcessSlotSyncInterrupts now):
postgres=# SELECT pg_sync_replication_slots();
ERROR:  cannot continue replication slot synchronization as standby
promotion is triggered

c)
If any parameter is changed between ValidateSlotSyncParams() and
ProcessSlotSyncInterrupts() for API, we get this:
postgres=# SELECT pg_sync_replication_slots();
ERROR:  replication slot synchronization will stop because of a parameter change

--on re-run (originating from ValidateSlotSyncParams())
postgres=# SELECT pg_sync_replication_slots();
ERROR:  replication slot synchronization requires
"hot_standby_feedback" to be enabled

~~

The tested scenarios' behaviour looks good to me.

thanks
Shveta