Thread
-
Re: Implement waiting for wal lsn replay: reloaded
Alexander Lakhin <exclusion@gmail.com> — 2026-05-19T20:00:00Z
Hello Alexander and Xuneng, 06.04.2026 22:49, Alexander Korotkov wrote: > Thank you, I've pushed your version of patchset. I made two minor > corrections for patch #2: mention default mode value in the header > comment, and fallback to polling on has_wal_read_bug sparc64+ext4 bug. I discovered a new test failure, that is apparently caused by new wait_for_catchup() implementation [1]: [06:20:23.110](1.069s) not ok 8 - check that the slot state changes to "extended" [06:20:23.110](0.001s) # Failed test 'check that the slot state changes to "extended"' # at /Users/ec2-user/bf/goldfish/HEAD/pgsql/src/test/recovery/t/019_replslot_limit.pl line 140. [06:20:23.111](0.000s) # got: 'unreserved' # expected: 'extended' [06:20:23.231](0.120s) not ok 9 - check that the slot state changes to "unreserved" [06:20:23.231](0.000s) # Failed test 'check that the slot state changes to "unreserved"' # at /Users/ec2-user/bf/goldfish/HEAD/pgsql/src/test/recovery/t/019_replslot_limit.pl line 152. [06:20:23.231](0.000s) # got: 'lost|' # expected: 'unreserved|t' I've managed to reproduce such failures with: diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index 07eac07b9ce..493ce92674e 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -1143,2 +1143,3 @@ XLogWalRcvSendReply(bool force, bool requestReply, bool checkApply) +pg_usleep(10000); /* Get current timestamp. */ diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 04aa770d981..19cda3a6b51 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -2521,2 +2521,3 @@ ProcessStandbyReplyMessage(void) +pg_usleep(100000); /* the caller already consumed the msgtype byte */ Concretely, a loop: for i in {1..100}; do echo "ITERATION $i"; PROVE_TESTS="t/019*" make -s check -C src/test/recovery/ || break; done failed for me on iterations 2, 1, 7: ITERATION 7 # +++ tap check in src/test/recovery +++ t/019_replslot_limit.pl .. 8/? # Failed test 'check that the slot state changes to "extended"' # at t/019_replslot_limit.pl line 140. # got: 'unreserved' # expected: 'extended' t/019_replslot_limit.pl .. 26/? # Looks like you failed 1 test of 26. t/019_replslot_limit.pl .. Dubious, test returned 1 (wstat 256, 0x100) Failed 1/26 subtests With "WAIT FOR LSN" in wait_for_catchup() disabled, 100 iterations passed. Having extra logging added, I could see the key difference. Failed run: 2026-05-19 22:01:37.968 EEST client backend[3632148] 019_replslot_limit.pl LOG: !!!GetWALAvailability| targetLSN: 0/016C0000, targetSeg: 22, oldestSlotSeg: 23, oldestSegMaxWalSize: 24, oldestSeg: 22 2026-05-19 22:01:37.968 EEST client backend[3632148] 019_replslot_limit.pl STATEMENT: SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep1' vs Successful run: 2026-05-19 22:04:18.102 EEST client backend[3633761] 019_replslot_limit.pl LOG: !!!GetWALAvailability| targetLSN: 0/01700000, targetSeg: 23, oldestSlotSeg: 23, oldestSegMaxWalSize: 24, oldestSeg: 23 2026-05-19 22:04:18.102 EEST client backend[3633761] 019_replslot_limit.pl STATEMENT: SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep1' That is, with WAIT FOR LSN, primary in this test may advance slot->data.restart_lsn to the expected position after wait_for_catchup() returns. [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=goldfish&dt=2026-05-13%2006%3A15%3A03 Best regards, Alexander