Thread

  1. Re: BUG #19490: Streaming standby on 16.14 stops applying WAL on MultiXactOffsetSLRU when primary is 16.8

    Michael Paquier <michael@paquier.xyz> — 2026-05-26T08:02:27Z

    On Fri, May 22, 2026 at 10:21:32PM +0530, Ayush Tiwari wrote:
    > I think the right fix is to remove that SimpleLruWriteAll() call while
    > keeping the missing-page initialization logic.  The flush is only meant to
    > make SimpleLruDoesPhysicalPageExist() see pages that exist in SLRU buffers
    > but have not reached disk.  In this fallback path, I don't see a way for
    > the tested next_pageno to be in that state: if RecordNewMultiXact() itself
    > initializes the page, it writes it synchronously with SimpleLruWritePage()
    > before setting last_initialized_offsets_page.
    
    FWIW, I'm having a couple of customers complaining about that as well,
    as cross-version physical replication is a thing for minor upgrade
    flows.  This bug is making suddenly recovery disruptive for some folks
    out there.  :(
    
    > I attached a small patch for REL_16_STABLE.  The same self-deadlock pattern
    > is also present on PG 14 and 15.  PG 17 and
    > 18 have the same compatibility call, but SLRU locking is banked
    > there, and RecordNewMultiXact() does not appear to hold the relevant bank
    > lock before calling SimpleLruWriteAll(), so I would not describe those
    > branches as having this exact self-deadlock, but needs more analysis.
    
    So your root argument is that while the SimpleLruWriteAll() is
    defensive, it is not actually necessary because it means that
    last_initialized_offsets_page is -1 we have not yet replayed
    ZERO_OFF_PAGE and that we have no dirty page that could make
    SimpleLruDoesPhysicalPageExis() return an incorrect result, which
    would be bad.  I am not sure to agree that this assumption is correct
    all the time, see for example the WAL message mentioned in the thread
    that has led to 77dff5d937b1:
    https://www.postgresql.org/message-id/33319276-e4d0-4773-89e4-09084905fdb0%40iki.fi
    
    I can see mentioned this WAL sequence, which is possible because there
    is no strict ordering in the creation of the mxacts:
    ZERO_PAGE:2048 -> CREATE_ID:2048 -> CREATE_ID:2049 -> CREATE_ID:2047
    
    Based on that, if we begin recovery after ZERO_PAGE:2048, we could
    finish with this kind of sequence:
    CREATE_ID:2048 -> CREATE_ID:2049 -> CREATE_ID:2047
    
    Looking closer, last_initialized_offsets_page stays at -1.  The page
    for 2048 was zeroed before the checkpoint by the earlier
    ZERO_PAGE:2048.  CREATE_ID:2048 and CREATE_ID:2049 are created first.
    Then comes CREATE_ID:2047 which enters the
    last_initialized_offsets_page branch.  If we don't have the WriteAll(),
    the page where the offsets of 2048 and 2049 are located gets zeroed
    while creating 2047, corrupting the existing state of 2048 and 2049.
    
    A different approach would be to release and re-acquire the
    MultiXactOffsetSLRULock while calling SimpleLruWriteAll(), and I think
    that it should be actually safe.  Even if read-only backends evict
    dirty pages between the moment the lock is released and the moment it
    is re-acquired in SimpleLruWriteAll(), the pages would be would be
    written to disk due to the eviction, which is what we want for
    correctness.  And only the startup process dirties offset pages during
    recovery, AFAIK.  Thoughts?
    
    > Added both Andrey and Heikki in to-mail, since I'm not sure if this
    > is more extreme than the multixact offset issue we had with 16.12, or it
    > is at par with that.
    
    Indeed, let's wait for at least Heikki's input.  
    
    Anyway, for any fixes, I don't think that it would be a good idea to
    skip v17 and v18, relying on the SLRU bank locks to not conflict to
    bypass the WriteAll() conflict.  Let's keep all the branches across
    v14~v18 in sync.
    --
    Michael