Thread

  1. Re: [WiP] B-tree page merge during vacuum to reduce index bloat

    Matthias van de Meent <boekewurm+postgres@gmail.com> — 2025-11-11T12:06:30Z

    On Sun, 31 Aug 2025 at 14:15, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
    > > On 29 Aug 2025, at 13:39, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
    > >
    > > I think to establish baseline for locking correctness we are going to start from writing index scan tests, that fail with proposed merge patch and pass on current HEAD. I want to observe that forward scan is showing duplicates and backward scan misses tuples.
    >
    > Well, that was unexpectedly easy. See patch 0001. It brings a test where we create sparse tree, and injection point that will wait on a scan stepping into some middle leaf page.
    > Then the test invokes vacuum. There are ~35 leaf pages, most of them will be merged into just a few pages.
    > As expected, both scans produce incorrect results.
    > t/008_btree_merge_scan_correctness.pl .. 1/?
    > #   Failed test 'Forward scan returns correct count'
    > #   at t/008_btree_merge_scan_correctness.pl line 132.
    > #          got: '364'
    > #     expected: '250'
    >
    > #   Failed test 'Backward scan returns correct count'
    > #   at t/008_btree_merge_scan_correctness.pl line 133.
    > #          got: '142'
    > #     expected: '250'
    > # Looks like you failed 2 tests of 2.
    >
    >
    > > From that we will try to design locking that does not affect performance significantly, but allows to merge pages. Perhaps, we can design a way to switch new index scans to "safe mode" during index vacuum and waiting for existing scans to complete.
    >
    > What if we just abort a scan, that stepped on the page where tuples were moved out?
    
    I don't think that's viable nor valid. We can't let vacuum processes
    abort active scans; it'd go against the principles of vacuum being a
    background process; and it'd be a recipe for disaster when it triggers
    in catalog scans. It'd probably also fail to detect the case where a
    non-backward index scan's just-accessed data was merged to the right,
    onto the page that the scan will access next (or, a backward scan's
    just-accessed data merged to the left).
    
    I think the issue isn't _just_ what to do when we detect that a page
    was merged (it's relatively easy to keep track of what data was
    present on a merged-now-empty page), but rather how we detect that
    pages got merged, and how we then work to return to a valid scan
    position.
    Preferably, we'd have a way that guarantees that a scan can not fail
    to notice that a keyspace with live tuples was concurrently merged
    onto a page, what keyspace this was, and whether that change was
    relevant to the currently active index scan; all whilst still obeying
    the other nbtree invariants that scans currently rely on.
    
    > I've prototype this approach, please see patch 0002. Maybe in future we will improve locking protocol if we will observe high error rates.
    > Unfortunately, this approach leads to default mergefactor 0 instead of 5%.
    >
    > What do you think? Should we add this to CF or the idea is too wild for a review?
    
    Aborting queries to be able to merge pages is not a viable approach;
    -1 on that. It would make running VACUUM concurrently with other
    workloads unsafe, and that's not acceptable.
    
    
    Kind regards,
    
    Matthias van de Meent
    Databricks (https://databricks.com)