Re: Improve pg_sync_replication_slots() to wait for primary to advance
shveta malik <shveta.malik@gmail.com>
From: shveta malik <shveta.malik@gmail.com>
To: Ajin Cherian <itsajin@gmail.com>
Cc: PostgreSQL mailing lists <pgsql-hackers@postgresql.org>,
shveta malik <shveta.malik@gmail.com>
Date: 2025-08-01T06:32:06Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Enhance slot synchronization API to respect promotion signal.
- 4bed04d39566 17.10 landed
- 94efd308bcec 18.4 landed
- 1362bc33e025 19 (unreleased) landed
-
Fix inconsistent elevel in pg_sync_replication_slots() retry logic.
- f1ddaa15357f 19 (unreleased) landed
-
Refactor slot synchronization logic in slotsync.c.
- 788ec96d591d 19 (unreleased) landed
-
Fix intermittent BF failure in 040_standby_failover_slots_sync.
- b47c50e5667b 19 (unreleased) landed
-
Add retry logic to pg_sync_replication_slots().
- 0d2d4a0ec3ec 19 (unreleased) landed
-
Fix LOCK_TIMEOUT handling in slotsync worker.
- 04396eacd3fa 19 (unreleased) cited
-
Add slotsync skip statistics.
- 76b78721ca49 19 (unreleased) cited
On Thu, Jul 31, 2025 at 3:11 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
>
> Patch v3 attached.
>
Thanks for the patch. I tested it, please find a few comments:
1)
it hits an assert
(slotsync_reread_config()-->Assert(sync_replication_slots)) when API
is trying to sync and is in wait loop while in another session, I
enable sync_replication_slots using:
ALTER SYSTEM SET sync_replication_slots = 'on';
SELECT pg_reload_conf();
Assert:
025-08-01 10:55:43.637 IST [118576] STATEMENT: SELECT
pg_sync_replication_slots();
2025-08-01 10:55:51.730 IST [118563] LOG: received SIGHUP, reloading
configuration files
2025-08-01 10:55:51.731 IST [118563] LOG: parameter
"sync_replication_slots" changed to "on"
TRAP: failed Assert("sync_replication_slots"), File: "slotsync.c",
Line: 1334, PID: 118576
postgres: shveta postgres [local]
SELECT(ExceptionalCondition+0xbb)[0x61df0160e090]
postgres: shveta postgres [local] SELECT(+0x6520dc)[0x61df0133a0dc]
2025-08-01 10:55:51.739 IST [118666] ERROR: cannot synchronize
replication slots concurrently
postgres: shveta postgres [local] SELECT(+0x6522b2)[0x61df0133a2b2]
postgres: shveta postgres [local] SELECT(+0x650664)[0x61df01338664]
postgres: shveta postgres [local] SELECT(+0x650cf8)[0x61df01338cf8]
postgres: shveta postgres [local] SELECT(+0x6513ea)[0x61df013393ea]
postgres: shveta postgres [local] SELECT(+0x6519df)[0x61df013399df]
postgres: shveta postgres [local]
SELECT(SyncReplicationSlots+0xbb)[0x61df0133af60]
postgres: shveta postgres [local]
SELECT(pg_sync_replication_slots+0x1b1)[0x61df01357e52]
2)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot synchronize replication slots when"
+ " standby promotion is ongoing")));
I think better error message will be:
"exiting from slot synchronization as promotion is triggered"
This will be better suited in log file as well after below wait statements:
LOG: continuing to wait for remote slot "failover_slot" LSN
(0/3000060) and catalog xmin (755) to pass local slot LSN (0/3000060)
and catalog xmin (757)
STATEMENT: SELECT pg_sync_replication_slots();
3)
API dumps this when it is waiting for primary:
----
LOG: could not synchronize replication slot "failover_slot2"
DETAIL: Synchronization could lead to data loss, because the remote
slot needs WAL at LSN 0/03066E70 and catalog xmin 755, but the standby
has LSN 0/03066E70 and catalog xmin 770.
STATEMENT: SELECT pg_sync_replication_slots();
LOG: waiting for remote slot "failover_slot2" LSN (0/3066E70) and
catalog xmin (755) to pass local slot LSN (0/3066E70) and catalog xmin
(770)
STATEMENT: SELECT pg_sync_replication_slots();
LOG: continuing to wait for remote slot "failover_slot2" LSN
(0/3066E70) and catalog xmin (755) to pass local slot LSN (0/3066E70)
and catalog xmin (770)
STATEMENT: SELECT pg_sync_replication_slots();
----
Unsure if we shall still dump 'could not synchronize..' when it is
going to retry until it succeeds? The concerned log gives a feeling
that we are done trying and could not synchronize it. What do you
think?
thanks
Shveta