Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Reset btpo_cycleid in nbtree VACUUM's REDO routine.

  1. nbtree VACUUM's REDO routine doesn't clear page's VACUUM cycle ID

    Peter Geoghegan <pg@bowt.ie> — 2024-11-15T16:33:37Z

    Attached patch teaches btree_xlog_vacuum, nbtree VACUUM's REDO
    routine, to reset the target page's opaque->btpo_cycleid to 0. This
    makes the REDO routine match original execution, which seems like a
    good idea on consistency grounds.
    
    I propose this for the master branch only.
    -- 
    Peter Geoghegan
    
  2. Re: nbtree VACUUM's REDO routine doesn't clear page's VACUUM cycle ID

    Michael Paquier <michael@paquier.xyz> — 2024-11-20T06:32:11Z

    On Fri, Nov 15, 2024 at 11:33:37AM -0500, Peter Geoghegan wrote:
    > Attached patch teaches btree_xlog_vacuum, nbtree VACUUM's REDO
    > routine, to reset the target page's opaque->btpo_cycleid to 0. This
    > makes the REDO routine match original execution, which seems like a
    > good idea on consistency grounds.
    > 
    > I propose this for the master branch only.
    
    Perhaps this patch should also explain why this is better this way?
    This path does not manipulate btpo_level, for one.
    --
    Michael
    
  3. Re: nbtree VACUUM's REDO routine doesn't clear page's VACUUM cycle ID

    Andrey Borodin <x4mmm@yandex-team.ru> — 2024-11-20T09:40:09Z

    
    > On 15 Nov 2024, at 21:33, Peter Geoghegan <pg@bowt.ie> wrote:
    > 
    > Attached patch teaches btree_xlog_vacuum, nbtree VACUUM's REDO
    > routine, to reset the target page's opaque->btpo_cycleid to 0. This
    > makes the REDO routine match original execution, which seems like a
    > good idea on consistency grounds.
    > 
    > I propose this for the master branch only.
    
    The change seems correct to me: anyway cycle must be less than cycle of any future vacuum after promotion. I cannot say anything about beauty of resetting or not resetting the field.
    I'd suggest renaming the field into something like "btpo_split_vaccycleid". I was aware of index vacuum backtracking, but it took me a while to build context again.
    
    
    Best regards, Andrey Borodin.
    
    
    
  4. Re: nbtree VACUUM's REDO routine doesn't clear page's VACUUM cycle ID

    Peter Geoghegan <pg@bowt.ie> — 2024-12-23T20:46:49Z

    On Wed, Nov 20, 2024 at 4:41 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
    > > On 15 Nov 2024, at 21:33, Peter Geoghegan <pg@bowt.ie> wrote:
    > > I propose this for the master branch only.
    >
    > The change seems correct to me: anyway cycle must be less than cycle of any future vacuum after promotion.
    
    The cycles set in the page special area during page splits that happen
    to run while a VACUUM also runs must  use that same VACUUM's cycle ID
    (which is stored in shared memory for the currently running VACUUM).
    That way the VACUUM will know when it must backtrack later on, to
    avoid missing index tuples that it is expected to remove.
    
    It doesn't matter if the cycle_id that VACUUM sees is less than or
    greater than its own one -- only that it matches its own one when it
    needs to match to get correct behavior from VACUUM. (Though it's also
    possible to get a false positive, in rare cases where we get unlucky
    and there's a collision.  This might waste cycles within VACUUM, but
    it shouldn't lead to truly incorrect behavior.)
    
    > I cannot say anything about beauty of resetting or not resetting the field.
    
    The main reason to do this is simple: we should make REDO routines
    match original execution, unless there is a very good reason not to do
    so -- original execution is "authoritative". There's no good reason
    not to do so here.
    
    There is a secondary reason to do things this way, though: resetting
    the cycle ID within the REDO routine can save future work from within
    btvacuumpage, after a standby gets promoted, when nbtree VACUUM runs
    against an affected index for the first time. Note that nbtree VACUUM
    goes out of its way to clear an old cycle ID in passing, even when it
    has nothing else to do on the page. And so by taking care of that on
    the REDO side, right from the start, the newly promoted standby won't
    need to dirty so many pages during the first nbtree VACUUM after it is
    promoted.
    
    (Actually, it's worse than that -- we're not just dirtying pages these
    days. It's quite likely that we'll need to write a WAL record to clear
    the cycle ID, since page-level checksums are enabled by default.)
    
    > I'd suggest renaming the field into something like "btpo_split_vaccycleid". I was aware of index vacuum backtracking, but it took me a while to build context again.
    
    I'm not inclined to change it now. It's been like this for many years now.
    
    Pushed this patch just now. Thanks for the review!
    
    -- 
    Peter Geoghegan
    
    
    
    
  5. Re: nbtree VACUUM's REDO routine doesn't clear page's VACUUM cycle ID

    Peter Geoghegan <pg@bowt.ie> — 2024-12-23T20:47:35Z

    On Wed, Nov 20, 2024 at 1:32 AM Michael Paquier <michael@paquier.xyz> wrote:
    > Perhaps this patch should also explain why this is better this way?
    
    See the commit message, and my remarks to Andrey just now.
    
    > This path does not manipulate btpo_level, for one.
    
    Why would it do so? Changing the btree level would corrupt the index?
    
    -- 
    Peter Geoghegan
    
    
    
    
  6. Re: nbtree VACUUM's REDO routine doesn't clear page's VACUUM cycle ID

    Andrey Borodin <x4mmm@yandex-team.ru> — 2025-11-18T06:32:19Z

    
    > On 24 Dec 2024, at 01:46, Peter Geoghegan <pg@bowt.ie> wrote:
    > 
    > On Wed, Nov 20, 2024 at 4:41 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
    >>> On 15 Nov 2024, at 21:33, Peter Geoghegan <pg@bowt.ie> wrote:
    >>> I propose this for the master branch only.
    >> 
    >> The change seems correct to me: anyway cycle must be less than cycle of any future vacuum after promotion.
    > 
    > The cycles set in the page special area during page splits that happen
    > to run while a VACUUM also runs must  use that same VACUUM's cycle ID
    > (which is stored in shared memory for the currently running VACUUM).
    > That way the VACUUM will know when it must backtrack later on, to
    > avoid missing index tuples that it is expected to remove.
    > 
    > It doesn't matter if the cycle_id that VACUUM sees is less than or
    > greater than its own one -- only that it matches its own one when it
    > needs to match to get correct behavior from VACUUM. (Though it's also
    > possible to get a false positive, in rare cases where we get unlucky
    > and there's a collision.  This might waste cycles within VACUUM, but
    > it shouldn't lead to truly incorrect behavior.)
    
    I'm thinking more about it. We always reset btpo_cycleid even in redo of a split.
    This "btpo_cycleid = 0;" reset can break two scenarios that are not currently supported by us, but might be supported in future.
    This reset is based on the idea that crash recovery will interrupt vacuum. It is not true in these cases.
    
    1. We are dealing with compute-storage separation system. We do not have filesystem and when we need to read a page we get it from some storage service, that rebuild pages from WAL. (e.g. Aurora and Neon) If we split a page during vacuum, evict it and read it from service - we will miss needed backtrack to the left page...
    2. There's a tool for repairing pages with checksum violations - page repair. AFAIK it can request page from Standby, and if it does amidst vacuum, vacuum can get false negative for backtracking logic.
    
    Thanks!
    
    
    Best regards, Andrey Borodin.
    
    
    
  7. Re: nbtree VACUUM's REDO routine doesn't clear page's VACUUM cycle ID

    Peter Geoghegan <pg@bowt.ie> — 2025-11-18T18:54:11Z

    On Tue, Nov 18, 2025 at 1:32 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
    > I'm thinking more about it. We always reset btpo_cycleid even in redo of a split.
    > This "btpo_cycleid = 0;" reset can break two scenarios that are not currently supported by us, but might be supported in future.
    
    I don't follow.
    
    > This reset is based on the idea that crash recovery will interrupt vacuum. It is not true in these cases.
    
    It's also based on the idea that only one VACUUM operation can be
    running at a time.
    
    > 1. We are dealing with compute-storage separation system. We do not have filesystem and when we need to read a page we get it from some storage service, that rebuild pages from WAL. (e.g. Aurora and Neon) If we split a page during vacuum, evict it and read it from service - we will miss needed backtrack to the left page...
    
    Are you arguing that the xl_btree_split record should include the cycleid?
    
    I see that systems that are built on this architecture do something
    along these lines:
    https://github.com/neondatabase/postgres/commit/a9b92820c5d14dbff8f59ab65ffdaae92ab9c3c8
    
    However, that seems well out of scope for core Postgres. At least for
    the foreseeable future.
    
    -- 
    Peter Geoghegan
    
    
    
    
  8. Re: nbtree VACUUM's REDO routine doesn't clear page's VACUUM cycle ID

    Andrey Borodin <x4mmm@yandex-team.ru> — 2025-11-19T04:36:19Z

    
    > On 18 Nov 2025, at 23:54, Peter Geoghegan <pg@bowt.ie> wrote:
    > 
    > Are you arguing that the xl_btree_split record should include the cycleid?
    
    Yes.
    
    > I see that systems that are built on this architecture do something
    > along these lines:
    > https://github.com/neondatabase/postgres/commit/a9b92820c5d14dbff8f59ab65ffdaae92ab9c3c8
    
    Thanks for the link. This solution seems surprisingly complicated.
    
    > However, that seems well out of scope for core Postgres. At least for
    > the foreseeable future.
    
    Yes, I agree. It's just a case when redo routines do not match original execution, and it seemed interesting to me. I understand that for core Postgres it's extra overhead in WAL record only for a beautification.
    
    Thanks!
    
    
    Best regards, Andrey Borodin.