Re: Implement waiting for wal lsn replay: reloaded
Xuneng Zhou <xunengzhou@gmail.com>
From: Xuneng Zhou <xunengzhou@gmail.com>
To: Alexander Korotkov <aekorotkov@gmail.com>
Cc: Andres Freund <andres@anarazel.de>, Tom Lane <tgl@sss.pgh.pa.us>, Heikki Linnakangas <hlinnaka@iki.fi>, Peter Eisentraut <peter@eisentraut.org>, Thomas Munro <thomas.munro@gmail.com>, Álvaro Herrera <alvherre@kurilemu.de>, Chao Li <li.evan.chao@gmail.com>, pgsql-hackers <pgsql-hackers@lists.postgresql.org>, Michael Paquier <michael@paquier.xyz>, jian he <jian.universality@gmail.com>, Tomas Vondra <tomas@vondra.me>, Yura Sokolov <y.sokolov@postgrespro.ru>
Date: 2026-05-01T02:44:00Z
Lists: pgsql-hackers
Attachments
- v8-0004-Use-replay-position-as-floor-for-WAIT-FOR-LSN-sta.patch (application/octet-stream)
- v8-0007-Document-that-WAIT-FOR-LSN-is-timeline-blind.patch (application/octet-stream)
- v8-0003-Remove-redundant-WAIT-FOR-LSN-caller-side-pre-che.patch (application/octet-stream)
- v8-0006-Improve-WAIT-FOR-LSN-test-coverage.patch (application/octet-stream)
- v8-0002-Fix-memory-ordering-in-WAIT-FOR-LSN-wakeup-mechan.patch (application/octet-stream)
- v8-0001-Use-barrier-semantics-when-reading-writing-writte.patch (application/octet-stream)
- v8-0005-Wake-standby_write-standby_flush-waiters-from-the.patch (application/octet-stream)
Hi Alexander, On Wed, Apr 29, 2026 at 5:01 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > On Tue, Apr 21, 2026 at 7:03 AM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > > > On Tue, Apr 21, 2026 at 2:46 AM Alexander Korotkov <aekorotkov@gmail.com> > wrote: > > > > > The updated patchset is attached. It includes improved coverage as > > > suggested by Andres upthread. And documentation that WAIT FOR LSN is > > > timeline-blind (per off-list discussion with Xuneng). > > > > I revised the test patch 6 to make the new cases check the intended > > WAIT FOR behavior more directly, and to avoid cases where the test > > could pass for the wrong reason. > > > > The fresh walreceiver restart test now distinguishes what we can > > observe from what is only covered indirectly. > > 'pg_last_wal_receive_lsn()' reports 'flushedUpto', not 'writtenUpto', > > so the test now describes that state accurately and covers > > 'writtenUpto' through the 'standby_write' result. This seems > > appropriate to me since the two positions are seeded in the places and > > conditions. Test for flush lsn should also help verify write lsn. > > > > The fencepost tests were split by the actual frontier being tested. > > 'standby_replay' uses 'pg_last_wal_replay_lsn()', while > > 'standby_flush' uses 'pg_last_wal_receive_lsn()'. This avoids treating > > a replay-derived LSN as if it were also the exact write/flush > > boundary. I left 'standby_write' out of the exact fencepost helper > > because its frontier is not SQL-visible once walreceiver is stopped. > > The async wakeup case now starts the waiter while replay is still > > paused, so it must actually sleep before replay and walreceiver are > > allowed to advance. > > > > The cascading timeline-switch test now checks the 'WAIT FOR ... > > NO_THROW' status from background psql stdout. The previous log-marker > > pattern could pass after unexpected returned status, includingn > > 'timeout', because the following statement would still run. The > > 'received_tli > 1' check remains, but only as confirmation that the > > downstream followed the new timeline; the 'success' status proves the > > wait completed as intended. > > > > Please check it. > > LGTM, I've added some comments for new functions in 0006. I propose > to push this patchset. Probably something is still missing and we > will have to go back to this. But it seems to make a lot of aspects > much better. > I reviewed the patchset and found a potential issue in the test for patch 5, similar to the log-checking problem in the cascading timeline-switch test. I've applied a minor fix to address it. Other parts LGTM. Best, Xuneng