Thread

  1. Re: [PATCH] Fix WAIT FOR LSN cleanup on subtransaction abort

    Xuneng Zhou <xunengzhou@gmail.com> — 2026-05-06T08:46:04Z

    Hi Ayush,
    
    On Wed, May 6, 2026 at 3:18 PM Ayush Tiwari <ayushtiwari.slg01@gmail.com> wrote:
    >
    > Hi,
    >
    > I found a backend crash in WAIT FOR LSN when it is interrupted inside a
    > savepoint and the session then waits again.
    >
    > I tried to find if it was already reported, but could not find it, so, posting it.
    >
    > While navigating I noticed  WAIT FOR LSN cleanup is incomplete on
    > subtransaction abort. An interrupt such as statement_timeout while
    > waiting inside a savepoint leaves stale per-backend wait state,
    > causing a later WAIT FOR LSN in the same backend to violate
    > the wait-heap invariant and crash an assertion-enabled build.
    >
    > A small reproducer is:
    >
    >     BEGIN;
    >     SAVEPOINT s;
    >     SET statement_timeout = '100ms';
    >     WAIT FOR LSN '<future-lsn>' WITH (MODE 'primary_flush');
    >     ROLLBACK TO s;
    >     SET statement_timeout = 0;
    >     WAIT FOR LSN '0/0' WITH (MODE 'primary_flush', TIMEOUT '10ms', NO_THROW);
    >     COMMIT;
    >
    > where <future-lsn> can be generated with:
    >
    >     SELECT pg_current_wal_insert_lsn() + 10000000000;
    >
    > TRAP: failed Assert("!procInfo->inHeap"), File: "xlogwait.c"
    >
    > The attached patch mirrors the top-level abort cleanup by calling
    > WaitLSNCleanup() from AbortSubTransaction(), after LWLockReleaseAll().  It
    > also adds a TAP test to verify that WAIT FOR LSN can be reused in the same
    > backend after a statement_timeout and ROLLBACK TO SAVEPOINT.
    >
    > Thoughts?
    
    Thanks for reporting this. I agree with your analysis. We need to add
    this clean-up into AbortSubTransaction. I've some comments on the
    patch:
    
    1) Update the comment of WaitLSNCleanup
    
    Now the comment of this function says, "Clean up LSN waiters for
    exiting process." After this patch, this description would be too
    narrow because after a ROLLBACK TO SAVEPOINT, the same backend
    continues running and may issue another WAIT FOR LSN. How about
    changing it to something like:
    
    /*
     * Clean up any LSN wait state for the current process.
     */
    
    2) How about turning the SET statement_timeout = '100ms'; into more
    deterministic machinery used in other sections of the test file to
    ensure that the registration actually occurs by starting the waiter in
    a background psql session and waits until itself reports: wait_event =
    'WaitForWalFlush' for that backend.
    
    3) For the validation, consider replacing the fast success of '0/0'
    with the timeout of the unreachable LSN used earlier. This would
    reduce assumptions about the order between registration and the fast
    check being fixed.
    
    --
    Best,
    Xuneng