Thread

  1. Re: [PATCH] Fix WAIT FOR LSN cleanup on subtransaction abort

    Alexander Korotkov <aekorotkov@gmail.com> — 2026-05-06T09:43:05Z

    On Wed, May 6, 2026 at 11:46 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
    >
    > On Wed, May 6, 2026 at 3:18 PM Ayush Tiwari <ayushtiwari.slg01@gmail.com> wrote:
    > >
    > > Hi,
    > >
    > > I found a backend crash in WAIT FOR LSN when it is interrupted inside a
    > > savepoint and the session then waits again.
    > >
    > > I tried to find if it was already reported, but could not find it, so, posting it.
    > >
    > > While navigating I noticed  WAIT FOR LSN cleanup is incomplete on
    > > subtransaction abort. An interrupt such as statement_timeout while
    > > waiting inside a savepoint leaves stale per-backend wait state,
    > > causing a later WAIT FOR LSN in the same backend to violate
    > > the wait-heap invariant and crash an assertion-enabled build.
    > >
    > > A small reproducer is:
    > >
    > >     BEGIN;
    > >     SAVEPOINT s;
    > >     SET statement_timeout = '100ms';
    > >     WAIT FOR LSN '<future-lsn>' WITH (MODE 'primary_flush');
    > >     ROLLBACK TO s;
    > >     SET statement_timeout = 0;
    > >     WAIT FOR LSN '0/0' WITH (MODE 'primary_flush', TIMEOUT '10ms', NO_THROW);
    > >     COMMIT;
    > >
    > > where <future-lsn> can be generated with:
    > >
    > >     SELECT pg_current_wal_insert_lsn() + 10000000000;
    > >
    > > TRAP: failed Assert("!procInfo->inHeap"), File: "xlogwait.c"
    > >
    > > The attached patch mirrors the top-level abort cleanup by calling
    > > WaitLSNCleanup() from AbortSubTransaction(), after LWLockReleaseAll().  It
    > > also adds a TAP test to verify that WAIT FOR LSN can be reused in the same
    > > backend after a statement_timeout and ROLLBACK TO SAVEPOINT.
    > >
    > > Thoughts?
    >
    > Thanks for reporting this. I agree with your analysis.
    
    +1, thank you for reporting this.
    
    > We need to add
    > this clean-up into AbortSubTransaction. I've some comments on the
    > patch:
    >
    > 1) Update the comment of WaitLSNCleanup
    >
    > Now the comment of this function says, "Clean up LSN waiters for
    > exiting process." After this patch, this description would be too
    > narrow because after a ROLLBACK TO SAVEPOINT, the same backend
    > continues running and may issue another WAIT FOR LSN. How about
    > changing it to something like:
    >
    > /*
    >  * Clean up any LSN wait state for the current process.
    >  */
    
    I agree with this change.  Under detailed consideration, "existing
    process" sounds confusing in this context even if it's called just
    from AbortTransaction().
    
    > 2) How about turning the SET statement_timeout = '100ms'; into more
    > deterministic machinery used in other sections of the test file to
    > ensure that the registration actually occurs by starting the waiter in
    > a background psql session and waits until itself reports: wait_event =
    > 'WaitForWalFlush' for that backend.
    >
    > 3) For the validation, consider replacing the fast success of '0/0'
    > with the timeout of the unreachable LSN used earlier. This would
    > reduce assumptions about the order between registration and the fast
    > check being fixed.
    
    I'm OK with these changes too, as they improve the test stability.
    
    I'm going to push this if no objections.
    
    ------
    Regards,
    Alexander Korotkov
    Supabase