Thread

  1. Re: Implement waiting for wal lsn replay: reloaded

    Xuneng Zhou <xunengzhou@gmail.com> — 2026-05-01T02:44:00Z

    Hi Alexander,
    
    On Wed, Apr 29, 2026 at 5:01 AM Alexander Korotkov <aekorotkov@gmail.com>
    wrote:
    
    > On Tue, Apr 21, 2026 at 7:03 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
    > >
    > > On Tue, Apr 21, 2026 at 2:46 AM Alexander Korotkov <aekorotkov@gmail.com>
    > wrote:
    > >
    > > > The updated patchset is attached.  It includes improved coverage as
    > > > suggested by Andres upthread.  And documentation that WAIT FOR LSN is
    > > > timeline-blind (per off-list discussion with Xuneng).
    > >
    > > I revised the test patch 6 to make the new cases check the intended
    > > WAIT FOR behavior more directly, and to avoid cases where the test
    > > could pass for the wrong reason.
    > >
    > > The fresh walreceiver restart test now distinguishes what we can
    > > observe from what is only covered indirectly.
    > > 'pg_last_wal_receive_lsn()' reports 'flushedUpto', not 'writtenUpto',
    > > so the test now describes that state accurately and covers
    > > 'writtenUpto' through the 'standby_write' result. This seems
    > > appropriate to me since the two positions are seeded in the places and
    > > conditions. Test for flush lsn should also help verify write lsn.
    > >
    > > The fencepost tests were split by the actual frontier being tested.
    > > 'standby_replay' uses 'pg_last_wal_replay_lsn()', while
    > > 'standby_flush' uses 'pg_last_wal_receive_lsn()'. This avoids treating
    > > a replay-derived LSN as if it were also the exact write/flush
    > > boundary. I left 'standby_write' out of the exact fencepost helper
    > > because its frontier is not SQL-visible once walreceiver is stopped.
    > > The async wakeup case now starts the waiter while replay is still
    > > paused, so it must actually sleep before replay and walreceiver are
    > > allowed to advance.
    > >
    > > The cascading timeline-switch test now checks the 'WAIT FOR ...
    > > NO_THROW' status from background psql stdout. The previous log-marker
    > > pattern could pass after unexpected returned status, includingn
    > > 'timeout', because the following statement would still run. The
    > > 'received_tli > 1' check remains, but only as confirmation that the
    > > downstream followed the new timeline; the 'success' status proves the
    > > wait completed as intended.
    > >
    > > Please check it.
    >
    > LGTM, I've added some comments for new functions in 0006.  I propose
    > to push this patchset.  Probably something is still missing and we
    > will have to go back to this.  But it seems to make a lot of aspects
    > much better.
    >
    
    I reviewed the patchset and found a potential issue in the test for patch
    5, similar to the log-checking problem in the cascading timeline-switch
    test. I've applied a minor fix to address it. Other parts LGTM.
    
    Best,
    Xuneng