Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Fix amcheck's handling of half-dead B-tree pages

  2. Fix amcheck's handling of incomplete root splits in B-tree

  3. Add a test for half-dead pages in B-tree indexes

  4. Add a test for incomplete splits in B-tree indexes

  5. Improve checking of child pages in contrib/amcheck.

  1. Bug in amcheck?

    Konstantin Knizhnik <knizhnik@garret.ru> — 2025-10-22T16:29:51Z

    Hi hackers.
    
    We see the following error reported by amcheck (I have added dump of 
    opaque) when it interleaves with autovacuum and cancel pt:
    
    
    ERROR:  mismatch between parent key and child high key in index 
    "pg_attribute_relid_attnam_index"
    DETAIL:  Target block=274, target opaque->flags=0, child block=427, 
    child opaque=11, target page lsn=1/484A8FC8.
    CONTEXT:  SQL statement "SELECT bt_index_parent_check(indexrelid, true, 
    true) from pg_index"
    
    So child has BTP_HALF_DEAD bit set.
    Autovacuum is interrupted in this place in _bt_pagedel:
    
             /*
              * Check here, as calling loops will have locks held, preventing
              * interrupts from being processed.
              */
             CHECK_FOR_INTERRUPTS();
    
    Reproducing it is not so easy.
    First of all I added sleep here:
    
             /*
              * Check here, as calling loops will have locks held, preventing
              * interrupts from being processed.
              */
             pg_usleep(10000);
             CHECK_FOR_INTERRUPTS();
    
    Then I create two procedures:
    
    create or replace procedure create_tables(tables integer, partitions 
    integer) as $$
    declare
         i integer;
         j integer;
    begin
         for i in 1..tables
         loop
             execute 'DROP TABLE IF EXISTS t_' || i;
             execute 'CREATE TABLE t_' || i || '(pk integer) partition by 
    range (pk)';
             for j in 1..partitions
             loop
                 execute 'create table p_'||i||'_'||j||' partition of 
    t_'||i||' for values from ('||j||') to ('||(j + 1)||')';
             end loop;
             execute 'insert into t_'||i||' values 
    (generate_series(1,'||partitions||'))';
         end loop;
    end;
    $$ language plpgsql;
    
    and
    
    create or replace procedure run_amcheck() as $$
    begin
         loop
             if (select count(*) from pg_stat_activity where 
    backend_type='autovacuum worker') > 0
             then
                 raise notice 'Run amcheck!';
                 perform bt_index_parent_check(indexrelid, true, true) from 
    pg_index;
             end if;
             perform pg_sleep(1);
         end loop;
    end;
    $$ language plpgsql;
    
    Then I run concurrently run_amcheck()
    and the following script for pgbench:
    
    call create_tables(2,1000);
    select pg_sleep(2);
    
    If the problem is not reproduced, then cancel run_amcheck()  and restart 
    it once again.
    
    
    Backtrace (pg16) is the following:
    
       * frame #0: 0x00000001017b6aac 
    amcheck.dylib`bt_child_highkey_check(state=0x000000010c846318, 
    target_downlinkoffnum=37, loaded_child="\U00000001", target_level=1) at 
    verify_nbtree.c:2146:23
         frame #1: 0x00000001017b7fd8 
    amcheck.dylib`bt_child_check(state=0x000000010c846318, 
    targetkey=0x000000013c01c448, downlinkoffnum=37) at verify_nbtree.c:2262:2
         frame #2: 0x00000001017b5f4c 
    amcheck.dylib`bt_target_page_check(state=0x000000010c846318) at 
    verify_nbtree.c:1623:4
         frame #3: 0x00000001017b3908 
    amcheck.dylib`bt_check_level_from_leftmost(state=0x000000010c846318, 
    level=(level = 1, leftmost = 3, istruerootlevel = false)) at 
    verify_nbtree.c:859:3
         frame #4: 0x00000001017b24e8 
    amcheck.dylib`bt_check_every_level(rel=0x0000000140074f18, 
    heaprel=0x0000000130070148, heapkeyspace=true, readonly=true, 
    heapallindexed=true, rootdescend=true) at verify_nbtree.c:603:13
         frame #5: 0x00000001017b198c 
    amcheck.dylib`bt_index_check_internal(indrelid=2674, parentcheck=true, 
    heapallindexed=true, rootdescend=true) at verify_nbtree.c:362:3
         frame #6: 0x00000001017b1a78 
    amcheck.dylib`bt_index_parent_check(fcinfo=0x000000010c83b040) at 
    verify_nbtree.c:242:2
    
    
    I wonder if we should add P_ISHALFDEAD(opaque) for child page?
    
    
    
    
    
    
  2. Re: Bug in amcheck?

    Mikhail Nikalayeu <mihailnikalayeu@gmail.com> — 2025-11-02T12:27:00Z

    Hello!
    
    > I wonder if we should add P_ISHALFDEAD(opaque) for child page?
    
    I am not a btree expert, but things I was able to find so far:
    
    In commit d114cc538715e14d29d6de8b6ea1a1d5d3e0edb4 next check is added:
    
    > bt_child_highkey_check(state, downlinkoffnum,
    >                   child, topaque->btpo_level);
    
    At the same time there is a comment below:
    
    > * We go ahead with our checks if the child page is half-dead.  It's safe
    > * to do so because we do not test the child's high key, so it does not
    > * matter that the original high key will have been replaced by a dummy
    > * truncated high key within _bt_mark_page_halfdead().  All other page
    > * items are left intact on a half-dead page, so there is still something
    > * to test.
    
    So, yes, it looks like we need to skip the child's high key test for
    half-dead pages.
    
    BWT, have you tried to create an injection_point-based reproducer?
    
    Best regards,
    Mikhail.
    
    
    
    
  3. Re: Bug in amcheck?

    Konstantin Knizhnik <knizhnik@garret.ru> — 2025-11-18T13:59:36Z

    On 02/11/2025 2:27 PM, Mihail Nikalayeu wrote:
    > Hello!
    >
    >> I wonder if we should add P_ISHALFDEAD(opaque) for child page?
    > I am not a btree expert, but things I was able to find so far:
    >
    > In commit d114cc538715e14d29d6de8b6ea1a1d5d3e0edb4 next check is added:
    >
    >> bt_child_highkey_check(state, downlinkoffnum,
    >>                    child, topaque->btpo_level);
    > At the same time there is a comment below:
    >
    >> * We go ahead with our checks if the child page is half-dead.  It's safe
    >> * to do so because we do not test the child's high key, so it does not
    >> * matter that the original high key will have been replaced by a dummy
    >> * truncated high key within _bt_mark_page_halfdead().  All other page
    >> * items are left intact on a half-dead page, so there is still something
    >> * to test.
    > So, yes, it looks like we need to skip the child's high key test for
    > half-dead pages.
    >
    > BWT, have you tried to create an injection_point-based reproducer?
    >
    > Best regards,
    > Mikhail.
    
    
    Hello Mikhail,
    
    
    Thank you very much for looking at this issue. And I am very sorry for 
    delay with answer.
    Unfortunately I was not able to reproduce the problem for the latest 
    Postgres: neither with injection points, neither with my original 
    approach with sleeps.
    
    Originally I investigated the customer's problem with PG16. And have 
    reproduced it for pg16,. I checked that relevant amcheck code was not 
    changed since pg16, so I thought that the problem takes place for all 
    Postgres versions. But looks like it is not true.
    
    
    
    
    
    
  4. Re: Bug in amcheck?

    Mikhail Nikalayeu <mihailnikalayeu@gmail.com> — 2025-11-18T22:19:00Z

    Hello!
    
    > Originally I investigated the customer's problem with PG16. And have
    > reproduced it for pg16,. I checked that relevant amcheck code was not
    > changed since pg16, so I thought that the problem takes place for all
    > Postgres versions. But looks like it is not true.
    
    I think it is still broken, but with less probability.
    Have you tried injection points on v16? Such a test case will make
    things much more clear.
    
    Also, I added Alexander to CC (he is author of bt_child_highkey_check)
    - maybe the issue is easily understandable for him.
    
    Best regards,
    Mikhail.
    
    
    
    
  5. Re: Bug in amcheck?

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-11-25T20:03:17Z

    On 19/11/2025 00:19, Mihail Nikalayeu wrote:
    > Hello!
    > 
    >> Originally I investigated the customer's problem with PG16. And have
    >> reproduced it for pg16,. I checked that relevant amcheck code was not
    >> changed since pg16, so I thought that the problem takes place for all
    >> Postgres versions. But looks like it is not true.
    > 
    > I think it is still broken, but with less probability.
    > Have you tried injection points on v16? Such a test case will make
    > things much more clear.
    
    Konstantin's original repro involved autovacuum and concurrent sessions. 
    I was confused by that, because bt_index_parent_check() holds a 
    ShareLock, which prevents it from running concurrently with vacuum. But 
    this isn't a race condition as such, the issue arises whenever there is 
    a half-dead page in the index. To end up with a half-dead page, you need 
    to gracefully cancel/interrupt autovacuum while it's deleting a page. 
    The repro's way of canceling autovacuum was very complicated. I didn't 
    fully understand it, but I think the concurrent dropping/creating of 
    tables would sometimes cause autovauum to be canceled.
    
    Here's a much more straightforward repro. Apply this little patch:
    
    diff --git a/src/backend/access/nbtree/nbtpage.c 
    b/src/backend/access/nbtree/nbtpage.c
    index 30b43a4dd18..c132fc90277 100644
    --- a/src/backend/access/nbtree/nbtpage.c
    +++ b/src/backend/access/nbtree/nbtpage.c
    @@ -2353,6 +2353,12 @@ _bt_unlink_halfdead_page(Relation rel, Buffer 
    leafbuf, BlockNumber scanblkno,
      	 * Check here, as calling loops will have locks held, preventing
      	 * interrupts from being processed.
      	 */
    +	if (random()  < INT32_MAX / 2)
    +	{
    +		elog(ERROR, "aborting page deletion");
    +	}
    +	else
    +		elog(NOTICE, "unlinking halfdead page %u %u", 
    BufferGetBlockNumber(leafbuf), scanblkno);
      	CHECK_FOR_INTERRUPTS();
    
      	/* Unlink the current top parent of the subtree */
    
    Then run this:
    
    postgres=# CREATE TABLE amchecktest (id int4);
    CREATE TABLE
    postgres=# INSERT INTO amchecktest SELECT g from generate_series(1, 
    1000000) g;
    INSERT 0 1000000
    postgres=# CREATE INDEX on amchecktest (id);
    CREATE INDEX
    postgres=# DELETE FROM amchecktest WHERE id > 100000 AND id < 120000;
    DELETE 19999
    postgres=# -- this will hit the error added by the patch
    VACUUM amchecktest;
    ERROR:  aborting page deletion
    CONTEXT:  while vacuuming index "amchecktest_id_idx" of relation 
    "public.amchecktest"
    postgres=# select bt_index_parent_check('amchecktest_id_idx');
    ERROR:  mismatch between parent key and child high key in index 
    "amchecktest_id_idx"
    DETAIL:  Target block=3 child block=276 target page lsn=0/6ED0DB68.
    
    
    To fix this, I guess we need to teach bt_index_parent_check() about 
    half-dead pages. Anyone volunteer to write that patch?
    
    - Heikki
    
    
    
    
    
  6. Re: Bug in amcheck?

    Peter Geoghegan <pg@bowt.ie> — 2025-11-25T20:51:06Z

    On Tue, Nov 25, 2025 at 3:03 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    > To fix this, I guess we need to teach bt_index_parent_check() about
    > half-dead pages. Anyone volunteer to write that patch?
    
    It's not like bt_index_parent_check doesn't generally know about them.
    For example, bt_downlink_missing_check goes to great lengths to
    distinguish between legitimate "missing" downlinks caused by an
    interrupted page deletion, and real missing downlinks caused by
    corruption.
    
    The problem we're seeing here seems likely limited to code added by
    commit d114cc53, which enhanced bt_index_parent_check by adding the
    new bt_child_highkey_check check. bt_child_highkey_check actually
    reuses bt_downlink_missing_check (which deals with half-dead pages
    correctly), but still isn't careful enough about half-dead pages. This
    is kind of surprising, since it *does* account for incomplete splits,
    which are similar.
    
    In short, I think that we need to track something like
    BtreeCheckState.previncompletesplit, but for half-dead pages. And then
    actually use that within bt_child_highkey_check, to avoid spurious
    false-positive reports of corruption.
    
    -- 
    Peter Geoghegan
    
    
    
    
  7. Re: Bug in amcheck?

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-11-26T08:20:40Z

    On 25/11/2025 22:51, Peter Geoghegan wrote:
    > On Tue, Nov 25, 2025 at 3:03 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    >> To fix this, I guess we need to teach bt_index_parent_check() about
    >> half-dead pages. Anyone volunteer to write that patch?
    > 
    > It's not like bt_index_parent_check doesn't generally know about them.
    > For example, bt_downlink_missing_check goes to great lengths to
    > distinguish between legitimate "missing" downlinks caused by an
    > interrupted page deletion, and real missing downlinks caused by
    > corruption.
    > 
    > The problem we're seeing here seems likely limited to code added by
    > commit d114cc53, which enhanced bt_index_parent_check by adding the
    > new bt_child_highkey_check check. bt_child_highkey_check actually
    > reuses bt_downlink_missing_check (which deals with half-dead pages
    > correctly), but still isn't careful enough about half-dead pages. This
    > is kind of surprising, since it *does* account for incomplete splits,
    > which are similar.
    
    +1. It took me a moment to understand bt_downlink_missing_check(). The 
    comments there talk about interrupted page deletions and incomplete 
    splits, but it's actually never called for half-dead pages. The caller 
    checks !P_IGNORE(opaque), and P_IGNORE means BTP_DELETED | BTP_HALF_DEAD.
    
    I don't think BTP_DELETED pages should be reachable here at all. So 
    perhaps we should specifically check BTP_DELETED and give an ERROR on 
    those. And for clarity, perhaps move the check for BTP_HALF_DEAD into 
    bt_downlink_missing_check(), alongside the incomplete-split check, so 
    that both cases would be checked at the same place. Just for clarity, 
    it's not wrong as it is.
    
    > In short, I think that we need to track something like
    > BtreeCheckState.previncompletesplit, but for half-dead pages. And then
    > actually use that within bt_child_highkey_check, to avoid spurious
    > false-positive reports of corruption.
    
    I think it's even simpler than that, and we can just do this:
    
    --- a/contrib/amcheck/verify_nbtree.c
    +++ b/contrib/amcheck/verify_nbtree.c
    @@ -2268,7 +2268,7 @@ bt_child_highkey_check(BtreeCheckState *state,
      		 * If we visit page with high key, check that it is equal to the
      		 * target key next to corresponding downlink.
      		 */
    -		if (!rightsplit && !P_RIGHTMOST(opaque))
    +		if (!rightsplit && !P_RIGHTMOST(opaque) && !P_ISHALFDEAD(opaque))
      		{
      			BTPageOpaque topaque;
      			IndexTuple	highkey;
    
    
    Both BTP_HALF_DEAD and BTP_INCOMPLETE_SPLIT indicate that a downlink is 
    missing, but they work slightly differently. If a page is marked as 
    BTP_INCOMPLETE_SPLIT, it means that its right sibling has no downlink, 
    but if a page is marked as BTP_HALF_DEAD, it means that the page itself 
    has no downlink.
    
    - Heikki
    
    
    
    
    
  8. Re: Bug in amcheck?

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-12-01T21:20:16Z

    On 26/11/2025 10:20, Heikki Linnakangas wrote:
    > On 25/11/2025 22:51, Peter Geoghegan wrote:
    >> In short, I think that we need to track something like
    >> BtreeCheckState.previncompletesplit, but for half-dead pages. And then
    >> actually use that within bt_child_highkey_check, to avoid spurious
    >> false-positive reports of corruption.
    > 
    > I think it's even simpler than that, and we can just do this:
    > 
    > --- a/contrib/amcheck/verify_nbtree.c
    > +++ b/contrib/amcheck/verify_nbtree.c
    > @@ -2268,7 +2268,7 @@ bt_child_highkey_check(BtreeCheckState *state,
    >            * If we visit page with high key, check that it is equal to the
    >            * target key next to corresponding downlink.
    >            */
    > -        if (!rightsplit && !P_RIGHTMOST(opaque))
    > +        if (!rightsplit && !P_RIGHTMOST(opaque) && !P_ISHALFDEAD(opaque))
    >           {
    >               BTPageOpaque topaque;
    >               IndexTuple    highkey;
    > 
    > 
    > Both BTP_HALF_DEAD and BTP_INCOMPLETE_SPLIT indicate that a downlink is 
    > missing, but they work slightly differently. If a page is marked as 
    > BTP_INCOMPLETE_SPLIT, it means that its right sibling has no downlink, 
    > but if a page is marked as BTP_HALF_DEAD, it means that the page itself 
    > has no downlink.
    
    Ok, here's a proper patch with tests. The patch itself is the above 
    one-liner. It's in patch 0004.
    
    While testing this, I bumped into another similar amcheck bug: if the 
    root page split is interrupted, verify_btree() complains:
    
    ERROR:  block 3 is not true root in index "nbtree_incomplete_splits_i_idx"
    
    Attached patch 0002 contains a fix and a test for that. The fix for that 
    is also one-liner.
    
    
    Summary of the patches:
    
    Patch 0001 adds an injection point test for incomplete splits. We 
    already had such a test for GIN, which handles incomplete splits the 
    same way as B-tree. I copy-pasted and adapted the GIN test for B-tree. 
    This was an easy way to increase our test coverage.
    
    Patch 0002 fixes the incomplete-root-split bug in amcheck. It modifies 
    the test added in patch 0001 to cover the bug fix.
    
    Patch 0003 adds a test for half-dead pages, similar to what 0001 did for 
    incomplete splits.
    
    Patch 0004 fixes the bogus half-deaf-page error in amcheck, i.e. the 
    issue that started this thread. It modifies the test introduced in patch 
    0003 to add amcheck calls, to cover the bug fix.
    
    - Heikki
    
  9. Re: Bug in amcheck?

    Peter Geoghegan <pg@bowt.ie> — 2025-12-02T17:59:19Z

    On Mon, Dec 1, 2025 at 4:20 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    > Ok, here's a proper patch with tests. The patch itself is the above
    > one-liner. It's in patch 0004.
    >
    > While testing this, I bumped into another similar amcheck bug: if the
    > root page split is interrupted, verify_btree() complains:
    >
    > ERROR:  block 3 is not true root in index "nbtree_incomplete_splits_i_idx"
    >
    > Attached patch 0002 contains a fix and a test for that. The fix for that
    > is also one-liner.
    
    Good catch.
    
    > Summary of the patches:
    >
    > Patch 0001 adds an injection point test for incomplete splits. We
    > already had such a test for GIN, which handles incomplete splits the
    > same way as B-tree. I copy-pasted and adapted the GIN test for B-tree.
    > This was an easy way to increase our test coverage.
    >
    > Patch 0002 fixes the incomplete-root-split bug in amcheck. It modifies
    > the test added in patch 0001 to cover the bug fix.
    >
    > Patch 0003 adds a test for half-dead pages, similar to what 0001 did for
    > incomplete splits.
    >
    > Patch 0004 fixes the bogus half-deaf-page error in amcheck, i.e. the
    > issue that started this thread. It modifies the test introduced in patch
    > 0003 to add amcheck calls, to cover the bug fix.
    
    All seem reasonable.
    
    These tests will increase nbtree code coverage quite a bit, which is a
    nice bonus.
    
    --
    Peter Geoghegan
    
    
    
    
  10. Re: Bug in amcheck?

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-12-02T19:27:09Z

    On 02/12/2025 19:59, Peter Geoghegan wrote:
    > On Mon, Dec 1, 2025 at 4:20 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    >> Summary of the patches:
    >>
    >> Patch 0001 adds an injection point test for incomplete splits. We
    >> already had such a test for GIN, which handles incomplete splits the
    >> same way as B-tree. I copy-pasted and adapted the GIN test for B-tree.
    >> This was an easy way to increase our test coverage.
    >>
    >> Patch 0002 fixes the incomplete-root-split bug in amcheck. It modifies
    >> the test added in patch 0001 to cover the bug fix.
    >>
    >> Patch 0003 adds a test for half-dead pages, similar to what 0001 did for
    >> incomplete splits.
    >>
    >> Patch 0004 fixes the bogus half-deaf-page error in amcheck, i.e. the
    >> issue that started this thread. It modifies the test introduced in patch
    >> 0003 to add amcheck calls, to cover the bug fix.
    > 
    > All seem reasonable.
    > 
    > These tests will increase nbtree code coverage quite a bit, which is a
    > nice bonus.
    
    Committed, thanks for the review!
    
    - Heikki