Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Clarify why _bt_killitems sorts its items array.

  2. Return TIDs in desc order during backwards scans.

  3. Optimize nbtree backwards scans.

  4. Optimize nbtree backward scan boundary cases.

  1. Returning nbtree posting list TIDs in DESC order during backwards scans

    Peter Geoghegan <pg@bowt.ie> — 2025-06-16T16:46:29Z

    Currently, nbtree always returns individual posting list TIDs to the
    scan in ASC order -- even during a backwards scan (more on why the
    deduplication patch deliberately did things that way later). But
    allowing backwards scans to return TIDs in "somewhat descending order"
    like this results in needless unpinning and re-pinning of the same
    heap page. This seems worth fixing.
    
    Patch goals
    ===========
    
    nbtree scans ought to guarantee that they'll return all TIDs in
    whatever order they appear on the page, when read either
    front-to-back, or back-to-front (whichever order matches the scan's
    current direction). Individual posting list TIDs should also be
    returned in either front-to-back or back-to-front order, also
    according to the scan's direction. In general, the order that TIDs are
    returned in ought to never be affected by the use of deduplication --
    including during backwards scans.
    
    Attached patch teaches nbtree's _bt_readpage function to return TIDs
    from a posting list in DESC order during backwards scans, bringing
    nbtree in line with the ideal described in the previous paragraph.
    This approach seems more principled to me, and is likely to have a
    small performance benefit.
    
    I'll submit this to the first commitfest for Postgres 19.
    
    Test case
    =========
    
    Here's an example of the problem on HEAD, that the patch fixes:
    
    -- Setup table with low cardinality index:
    create table duplicate_test(dup int4);
    create index on duplicate_test (dup);
    insert into duplicate_test
    select val from generate_series(1, 18) val,
                    generate_series(1,1000) dups_per_val;
    
    -- Forwards scan, gets 23 buffer hits:
    explain analyze
    select ctid, * from
    duplicate_test
    where dup < 5
    order by dup;
    
    -- Equivalent backwards scan, gets 42 buffer hits on HEAD:
    explain analyze
    select ctid, * from
    duplicate_test
    where dup < 5
    order by dup desc;
    
    With the patch applied, *both* queries get the expected 23 buffer hits.
    
    I am generally suspicious of cases where backwards scans are at a
    gratuitous disadvantage relative to equivalent forwards scans. See
    previous work in commit c9c0589f and commit 1bd4bc85 for earlier
    examples of fixing problems that have that same quality to them. This
    might be the last such patch that I'll ever have to write.
    
    Sorting so->killedItems[] in _bt_killitems
    ==========================================
    
    Making this change in _bt_readpage creates a problem in _bt_killitems:
    it currently expects posting list TIDs in ascending order (the loop
    invariant relies on that), which is why the deduplication patch didn't
    just do things like this in _bt_readpage to begin with. The patch
    deals with that new problem by qsort'ing the so->killedItems[] array
    (at the top of _bt_killitems) such that the items always appear
    exactly once, in the expected ASC order.
    
    Another benefit of the qsort approach in _bt_killitems is that it'll
    gracefully handle scrollable cursor scans that happen to scroll back
    and forth on the same leaf page. This might result in the same dead
    TID being added to the so->killedItems[] array multiple times, in
    neither ASC or DESC order. That also confuses the loop inside
    _bt_killitems, in a way that needlessly prevents setting LP_DEAD bits
    (this is a preexisting problem, not a problem created by the changes
    that the patch makes to _bt_readpage). Having duplicate TIDs in
    so->killedItems[] like this isn't particularly likely, but it seems
    worth having proper handling for all the same, on general principle.
    
    I tend to doubt that the overhead of the qsort will ever really
    matter, but if it does then we can always limit it to backwards scans
    (meaning limit it to cases where so->currPos.dir indicates that
    _bt_readpage processed the page in the backwards scan direction), and
    drop the goal of dealing with duplicate so->killedItems[] TIDs
    gracefully.
    
    Note that the qsort is completely useless during standard forwards
    scans. It's hard to tell those apart from scrollable cursor scans that
    briefly changed directions "within a page", though, so I'm inclined to
    just always do the qsort, rather than trying to cleverly avoid it.
    That's what happens in this v1 of the patch (we qsort unconditionally
    in _bt_killitems).
    
    -- 
    Peter Geoghegan
    
  2. Re: Returning nbtree posting list TIDs in DESC order during backwards scans

    Peter Geoghegan <pg@bowt.ie> — 2025-06-16T23:47:01Z

    On Mon, Jun 16, 2025 at 12:46 PM Peter Geoghegan <pg@bowt.ie> wrote:
    > Making this change in _bt_readpage creates a problem in _bt_killitems:
    > it currently expects posting list TIDs in ascending order (the loop
    > invariant relies on that), which is why the deduplication patch didn't
    > just do things like this in _bt_readpage to begin with. The patch
    > deals with that new problem by qsort'ing the so->killedItems[] array
    > (at the top of _bt_killitems) such that the items always appear
    > exactly once, in the expected ASC order.
    
    Actually, we can just completely get rid of so->killedItems[]. We can
    replace it with a new 1 bit itemDead field in the BTScanPosItem struct
    (the struct used for so->currPos.items[] entries). That way, the patch
    avoids a qsort. The patch doesn't need to change the order of anything
    (except of course for the order that _bt_readpage initially saves
    posting list tuple TIDs in, when scanning backwards).
    
    The problem with so->killedItems[] is that it stores things in
    whatever order successive btgettuple calls find most convenient in the
    kill_prior_tuple path. It turns out that it's even more convenient for
    btgettuple if its this kill_prior_tuple path simply sets the relevant
    (prior/just-returned) item's so->currPos.items[].itemDead field. There
    is no need to allocate memory for so->killedItems[], and no need to
    worry about running out of space in so->killedItems[] in cases
    involving scrollable cursors that happen to scroll back and forth,
    triggering kill_prior_tuple for the same TID. We'll simply set the
    TID/item's so->currPos.items[].itemDead field a second or a third
    time, which can't complicate things, since my new approach makes that
    idempotent.
    
    Attached is v2 of the patch, which works that way. This seems like a
    better direction to take things in within _bt_killitems.
    
    In general we're quite sensitive to the size of the BTScanPosItem
    struct, since (at least for now) we routinely allocate
    MaxTIDsPerBTreePage-many of them (i.e. 1,358 of them). 1,358 * 10
    bytes is already too much. But there's no need to make the relevant
    allocation even larger. I can steal a bit from
    BTScanPosItem.tupleOffset for these new
    so->currPos.items[].itemDead/BTScanPosItem.itemDead fields. That's
    safe, since we only really need 15 bits for each
    BTScanPosItem.tupleOffset offset.
    
    --
    Peter Geoghegan
    
  3. Re: Returning nbtree posting list TIDs in DESC order during backwards scans

    Mircea Cadariu <cadariu.mircea@gmail.com> — 2025-07-16T21:27:56Z

    Hi,
    
    
    > -    for (int i = 0; i < numKilled; i++)
    > +    for (int i = so->currPos.firstItem; i <= so->currPos.lastItem; i++)
    
    Does the above change mean it will have to do more work in the loop? 
    Whereas before it visited strictly killed, it now has to go through all 
    of them?
    
    
    Kind regards,
    
    Mircea Cadariu
    
    
    
    
    
  4. Re: Returning nbtree posting list TIDs in DESC order during backwards scans

    Peter Geoghegan <pg@bowt.ie> — 2025-07-17T18:26:35Z

    On Wed, Jul 16, 2025 at 5:27 PM Mircea Cadariu <cadariu.mircea@gmail.com> wrote:
    > Does the above change mean it will have to do more work in the loop?
    > Whereas before it visited strictly killed, it now has to go through all
    > of them?
    
    Yes, that's true. Any item that the scan returns from the so->currPos
    page needs to be considered within the loop.
    
    The loop has an early check for this (for non-itemDead entries) here:
    
            /* Quickly skip over items never ItemDead-set by btgettuple */
            if (!kitem->itemDead)
                continue;
    
    I really doubt that this matters, because:
    
    * This can only happen when we actually call _bt_killitems in the
    first place, so there has to be at least one item whose index tuple
    really does need to be LP_DEAD-set.
    
    * The chances of there being a huge number of so->currPos.items[]
    items but only one or two with their "itemDead" bit set seems low, in
    general.
    
    * The new loop is significantly simpler in that it iterates through
    so->currPos.items[] in order, without any of the so->killedItems[]
    indirection you see on HEAD. Modern CPUs are likely to skip over
    non-itemDead entries very quickly.
    
    Note that so->killedItems[] (which this patch removes) can be in
    ascending leaf-page-wise order, descending leaf-page-wise order, or
    (with a scrollable cursor) some random mix of the two -- even when
    there's no posting list tuples involved.
    
    -- 
    Peter Geoghegan
    
    
    
    
  5. Re: Returning nbtree posting list TIDs in DESC order during backwards scans

    Peter Geoghegan <pg@bowt.ie> — 2025-07-17T19:37:31Z

    On Thu, Jul 17, 2025 at 2:26 PM Peter Geoghegan <pg@bowt.ie> wrote:
    > The loop has an early check for this (for non-itemDead entries) here:
    >
    >         /* Quickly skip over items never ItemDead-set by btgettuple */
    >         if (!kitem->itemDead)
    >             continue;
    >
    > I really doubt that this matters
    
    I think that this patch isn't too far off being committable. I'll need
    to think about issues around adding the new kitem->itemDead bitfield.
    I'm not really worried about _bt_killitems; more so about the routines
    called by _bt_readpage, which must make sure that the bit is unset
    every time a TID is saved in so->currPos.items[].
    
    Attached is v3. Not much has changed. We now defensively unset each
    previously set kitem->itemDead within _bt_killitems. I've also added a
    pair of macros that represent 0 and 1 values for these kitem->itemDead
    bits.
    
    Thanks
    -- 
    Peter Geoghegan
    
  6. Re: Returning nbtree posting list TIDs in DESC order during backwards scans

    Mircea Cadariu <cadariu.mircea@gmail.com> — 2025-07-20T04:12:43Z

    Hi,
    
    On 18/07/2025 05:37, Peter Geoghegan wrote:
    
    > I think that this patch isn't too far off being committable.
    
    Agreed. I just tried v3, and got 23 buffer hits, same as in the original 
    demonstrative example.
    
    
    > +                    /* Remember all prior TIDs (must be at least one) */
    > +                    for (int i = nitems - 2; i >= 0; i--)
    
    This loop has to start from the end, in order to return TIDs in DESC order?
    
    
    > I'm not really worried about _bt_killitems; more so about the routines
    > called by _bt_readpage, which must make sure that the bit is unset
    > every time a TID is saved in so->currPos.items[].
    
    I did a search through the code for "so->" and had a look at the results 
    for functions I'd expect to see changes in, at the minimum:
    
       *  btbeginscan
      * btrescan
      * btendscan
      * btrestrpos
      * _bt_steppage
      * _bt_readfirstpage
    
    I could find all of the above being touched in v3.
    
    
    > Modern CPUs are likely to skip over
    > non-itemDead entries very quickly.
    
    Okay, yeah. A sequential iteration through an array will be fast, and we 
    expect the branch predictor to do its job properly with the "if 
    (!kitem->itemDead)".
    
    
    > I'll need
    > to think about issues around adding the new kitem->itemDead bitfield.
    
    It's probably not a concern here, but got reminded of this ABI break: 
    https://www.postgresql.org/message-id/flat/CABOikdNmVBC1LL6pY26dyxAS2f%2BgLZvTsNt%3D2XbcyG7WxXVBBQ%40mail.gmail.com
    
    
    Kind regards,
    
    Mircea Cadariu
    
  7. Re: Returning nbtree posting list TIDs in DESC order during backwards scans

    Andy Fan <zhihuifan1213@163.com> — 2025-07-27T13:42:53Z

    
    Hi,
    
    > On Thu, Jul 17, 2025 at 2:26 PM Peter Geoghegan <pg@bowt.ie> wrote:
    >> The loop has an early check for this (for non-itemDead entries) here:
    >>
    >>         /* Quickly skip over items never ItemDead-set by btgettuple */
    >>         if (!kitem->itemDead)
    >>             continue;
    >>
    >> I really doubt that this matters
    
    > I'll need to think about issues around adding the new kitem->itemDead
    > bitfield. I'm not really worried about _bt_killitems; more so about
    > the routines called by _bt_readpage, which must make sure that the bit
    > is unset every time a TID is saved in so->currPos.items[].
    
    Yes, otherwise the live tuple maybe marked as dead, which is terrible. I
    think you have unset the bit on all the needed places, including
    _bt_saveitem, _bt_savepostingitem and _bt_setuppostingitems for the
    first item in postlist item. I can't find out anything is missed.
    
    (I thought another place for this work might be _bt_returnitem, this
    might be a more centralized place to set. Later I think it is totally
    wrong because for the queries like SELECT * FROM t WHERE idx_col = 3
    LIMIT 3;  Some items in so->currPos.items[] were filled in
    _bt_readpage but maybe never returned to heap at all, and later
    _bt_killitems also run against it on the whole, so unsetting the bit on
    _bt_returnitem is too late...)
    
    I think this patch does two things. one is refactoring the data struct
    for _bt_killitems, the other one is scaning the postlist in the backward
    way for the backward scan. then does the below changes belongs to any of
    them? Is it an intentional change? 
    
    _bt_killitems:
    
     		if (offnum < minoff)
    -			continue;			/* pure paranoia */
    +		{
    +			/*
    +			 * Page must have originally been the rightmost page, but has
    +			 * since split
    +			 */
    +			Assert(!so->dropPin);
    +			offnum = minoff;
    +		}
    
    
    At last, I can get the same test result as you. The buffer hits go back to
    23 in the test case, thank for working on this!
    
    > I think that this patch isn't too far off being committable.
    
    I think so.
    
    -- 
    Best Regards
    Andy Fan
    
    
    
    
    
  8. Re: Returning nbtree posting list TIDs in DESC order during backwards scans

    Peter Geoghegan <pg@bowt.ie> — 2025-12-03T03:08:26Z

    On Sun, Jul 27, 2025 at 9:43 AM Andy Fan <zhihuifan1213@163.com> wrote:
    > At last, I can get the same test result as you. The buffer hits go back to
    > 23 in the test case, thank for working on this!
    >
    > > I think that this patch isn't too far off being committable.
    >
    > I think so.
    
    Coming back to this patch now, after several months of work on index
    prefetching.
    
    I decided that it wasn't such a great idea to reuse/steal an unused
    "itemDead" bit from the BTScanPosItem.itemOffset field after all. That
    forces _bt_killitems to iterate through every so->currPos.item[], not
    just those that are known to require LP_DEAD marking.
    
    Tomas Vondra suggested that I keep killedItems as a separate
    allocation (as it is on master), while using a Bitmapset to represent
    killedItems (unlike on master, where it is represented using a simple
    array). This has all of the same advantages as my previous approach,
    but doesn't have the aforementioned disadvantages within _bt_killitems
    (plus we no longer need to change BTScanPosItem in any way).
    
    Attached is v4, which does it that way.
    
    My plan is to commit this improved version in the next couple of days.
    
    --
    Peter Geoghegan
    
  9. Re: Returning nbtree posting list TIDs in DESC order during backwards scans

    Chao Li <li.evan.chao@gmail.com> — 2025-12-03T12:31:43Z

    Hi Peter,
    
    The patch v4 looks carefully written and technically solid, and the core logic (switching killedItems[] to Bitmapset and reworking backwards posting list scans) is coherent.
    
    I just got a few comments/questions:
    
    > On Dec 3, 2025, at 11:08, Peter Geoghegan <pg@bowt.ie> wrote:
    > 
    > Attached is v4, which does it that way.
    > 
    > My plan is to commit this improved version in the next couple of days.
    > 
    > --
    > Peter Geoghegan
    > <v4-0001-Return-TIDs-in-desc-order-during-backwards-scans.patch>
    
    
    1
    ```
    -	/* Always invalidate so->killedItems[] before leaving so->currPos */
    -	so->numKilled = 0;
    ```
    
    The old code only sets so->numKilled to 0 and reuse memory of so->killedItems[], now the new code always bms_free(so->killedItems) and re-alloc memory when next time adding a member to bms.
    
    I am think that, if there were bms_clear(), then we could have just cleared the bitmap and reused the memory next time. But unfortunately, there is no such a bms_clear() now. What do you think to add bms_clear() and use it here? I don’t want to do that, I can try that once you push this patch.
    
    2
    ```
    +					/* Set up posting list state (and remember last TID) */
     					itemIndex--;
     					tupleOffset =
     						_bt_setuppostingitems(so, itemIndex, offnum,
    -											  BTreeTupleGetPostingN(itup, 0),
    +											  BTreeTupleGetPostingN(itup, nitems - 1),
     											  itup);
    -					/* Remember additional TIDs */
    -					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
    +
    +					/* Remember all prior TIDs (must be at least one) */
    +					for (int i = nitems - 2; i >= 0; i--)
     					{
     						itemIndex--;
     						_bt_savepostingitem(so, itemIndex, offnum,
    ```
    
    I wonder if the comment “must be at lease one” should apply to the assignment of tupleOffset? The “for” loop starts from nitems-2, will it still must be at lease one item?
    
    3
    ```
     				/*
    -				 * Don't bother advancing the outermost loop's int iterator to
    -				 * avoid processing killed items that relate to the same
    -				 * offnum/posting list tuple.  This micro-optimization hardly
    -				 * seems worth it.  (Further iterations of the outermost loop
    -				 * will fail to match on this same posting list's first heap
    -				 * TID instead, so we'll advance to the next offnum/index
    -				 * tuple pretty quickly.)
    +				 * Don't advance itemIndex for outermost loop, no matter how
    +				 * nextIndex was advanced.  It's possible that items whose
    +				 * TIDs weren't matched in posting list can still be killed
    +				 * (there might be a later tuple whose TID is a match).
     				 */
     				if (j == nposting)
     					killtuple = true;
    ```
    
    I really don’t get what "Don't bother advancing the outermost loop's int iterator” means? Here we only set killtuple to true, then if (killtuple && !ItemIdIsDead(iid)), it breaks the inner while loop, in that case, outer while loop "while ((itemIndex = bms_next_member(so->killedItems, itemIndex)) >= 0)” will advance itemIndex.
    
    I know this is a legacy comment, if you can explain a little bit, that will be very appreciated.
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
    
    
    
  10. Re: Returning nbtree posting list TIDs in DESC order during backwards scans

    Victor Yegorov <vyegorov@gmail.com> — 2025-12-03T15:18:30Z

    ср, 3 дек. 2025 г. в 06:09, Peter Geoghegan <pg@bowt.ie>:
    
    > Coming back to this patch now, after several months of work on index
    > prefetching.
    >
    > I decided that it wasn't such a great idea to reuse/steal an unused
    > "itemDead" bit from the BTScanPosItem.itemOffset field after all. That
    > forces _bt_killitems to iterate through every so->currPos.item[], not
    > just those that are known to require LP_DEAD marking.
    >
    > Tomas Vondra suggested that I keep killedItems as a separate
    > allocation (as it is on master), while using a Bitmapset to represent
    > killedItems (unlike on master, where it is represented using a simple
    > array). This has all of the same advantages as my previous approach,
    > but doesn't have the aforementioned disadvantages within _bt_killitems
    > (plus we no longer need to change BTScanPosItem in any way).
    >
    > Attached is v4, which does it that way.
    >
    > My plan is to commit this improved version in the next couple of days.
    >
    > --
    > Peter Geoghegan
    >
    
    Patch looks fine, applies and compiles cleanly, passes tests.
    
    I'd like to point out a missing space after the dot in the 2nd para of the
    commit message,
    falls out of style.
    
    -- 
    Victor Yegorov
    
  11. Re: Returning nbtree posting list TIDs in DESC order during backwards scans

    Mircea Cadariu <cadariu.mircea@gmail.com> — 2025-12-03T15:20:04Z

    Hi,
    
    On 03/12/2025 03:08, Peter Geoghegan wrote:
    > Coming back to this patch now, after several months of work on index
    > prefetching.
    
    Nice, welcome back! It would be great to wrap this up.
    
    > Tomas Vondra suggested that I keep killedItems as a separate
    > allocation (as it is on master), while using a Bitmapset to represent
    > killedItems (unlike on master, where it is represented using a simple
    > array). This has all of the same advantages as my previous approach,
    > but doesn't have the aforementioned disadvantages within _bt_killitems
    > (plus we no longer need to change BTScanPosItem in any way).
    Good one!
    
    
    > -	* Don't bother advancing the outermost loop's int iterator to
    > -	* avoid processing killed items that relate to the same
    > -       * offnum/posting list tuple.  This micro-optimization hardly
    > -       * seems worth it.  (Further iterations of the outermost loop
    > -	* will fail to match on this same posting list's first heap
    > -       * TID instead, so we'll advance to the next offnum/index
    > -       * tuple pretty quickly.)
    > +       * Don't advance itemIndex for outermost loop, no matter how
    > +       * nextIndex was advanced.  It's possible that items whose
    > +       * TIDs weren't matched in posting list can still be killed
    > +       * (there might be a later tuple whose TID is a match).
    Hmm, as far as I can tell, the old version of the comment seems more 
    accurate. If I understand correctly, it's still safe to do the 
    micro-optimization, but we choose to not do it because we expect the 
    speedup will not be worth the increased complexity / risk of introducing 
    bugs.
    
    -- 
    Thanks,
    Mircea Cadariu
    
  12. Re: Returning nbtree posting list TIDs in DESC order during backwards scans

    Peter Geoghegan <pg@bowt.ie> — 2025-12-03T17:31:24Z

    On Wed, Dec 3, 2025 at 7:32 AM Chao Li <li.evan.chao@gmail.com> wrote:
    > The old code only sets so->numKilled to 0 and reuse memory of so->killedItems[], now the new code always bms_free(so->killedItems) and re-alloc memory when next time adding a member to bms.
    >
    > I am think that, if there were bms_clear(), then we could have just cleared the bitmap and reused the memory next time. But unfortunately, there is no such a bms_clear() now. What do you think to add bms_clear() and use it here? I don’t want to do that, I can try that once you push this patch.
    
    I don't think that it's worth the complexity. We can rely on palloc()
    to recycle the memory that was freed by the most recent bms_clear().
    
    I say this in part because the complexity of reusing the same
    Bitmapset would be rather high. The idea that the only valid
    representation of an empty set is a NULL pointer is baked into the
    Bitmapset API.
    
    Note also that we'll use much less memory for killedItems by
    representing it as a Bitmapset. We'll use at most one bit per
    so->currPos.items[] item, whereas before we used 4 bytes per item.
    
    > I wonder if the comment “must be at lease one” should apply to the assignment of tupleOffset? The “for” loop starts from nitems-2, will it still must be at lease one item?
    
    By definition, a posting list tuple has at least 2 TIDs -- that's a
    posting list invariant. If there was only 1 TID, then it wouldn't be a
    posting list in the first place. (Unlike within GIN, where single TID
    posting lists are possible.)
    
    
    >                                 /*
    > -                                * Don't bother advancing the outermost loop's int iterator to
    > -                                * avoid processing killed items that relate to the same
    > -                                * offnum/posting list tuple.  This micro-optimization hardly
    > -                                * seems worth it.  (Further iterations of the outermost loop
    > -                                * will fail to match on this same posting list's first heap
    > -                                * TID instead, so we'll advance to the next offnum/index
    > -                                * tuple pretty quickly.)
    > +                                * Don't advance itemIndex for outermost loop, no matter how
    > +                                * nextIndex was advanced.  It's possible that items whose
    > +                                * TIDs weren't matched in posting list can still be killed
    > +                                * (there might be a later tuple whose TID is a match).
    >                                  */
    >                                 if (j == nposting)
    >                                         killtuple = true;
    > ```
    >
    > I really don’t get what "Don't bother advancing the outermost loop's int iterator” means? Here we only set killtuple to true, then if (killtuple && !ItemIdIsDead(iid)), it breaks the inner while loop, in that case, outer while loop "while ((itemIndex = bms_next_member(so->killedItems, itemIndex)) >= 0)” will advance itemIndex.
    
    The old comment should probably have been written more like the new
    one (that I propose to replace it with). It's saying "don't try to be
    clever by remembering that we already determined that all the TIDs
    that we tried to compare to this posting list aren't matches for *any*
    TID". But I don't think that that's accurate; we *haven't* determined
    that those TIDs aren't matches for *any and all* TIDs on the page
    (only for those now in the posting list specifically). We might still
    be able to match those TIDs to later tuples, immediately after the
    posting list.
    
    Note that this is all a bit academic and theoretical; in practice it
    rarely comes up. During so->dropPin scans (which is the common case),
    we'll give up/get scared at the start of _bt_killitems if the page has
    changed at all since it was initially examined within _bt_readpage
    anyway -- no matter how the page was modified. It doesn't matter that
    the page probably *won't* have been modified by VACUUM when
    _bt_kilitems gets scared of modifying the page like this (VACUUM is
    the only thing that truly makes it unsafe for _bt_killitems to run,
    but _bt_killitems is conservative/easily scared).
    
    So while it is true that "We might still be able to match those TIDs
    to later tuples, immediately after the posting list" (as I said in the
    paragraph before the previous paragraph), we can only do so during
    !so->dropPin scans. In other words, only during index-only scans, or
    scans of an unlogged relation, or when we don't use an MVCC snapshot.
    All of which are rare. Which makes all this fairly
    academic/theoretical (mostly it's historical, 10+ years ago *all*
    scans were !so->dropPin scans).
    
    --
    Peter Geoghegan
    
    
    
    
  13. Re: Returning nbtree posting list TIDs in DESC order during backwards scans

    Chao Li <li.evan.chao@gmail.com> — 2025-12-04T03:24:16Z

    
    > On Dec 4, 2025, at 01:31, Peter Geoghegan <pg@bowt.ie> wrote:
    > 
    > 
    > Note also that we'll use much less memory for killedItems by
    > representing it as a Bitmapset. We'll use at most one bit per
    > so->currPos.items[] item, whereas before we used 4 bytes per item.
    > 
    
    That’s true, BitmapSet saves 7/8 of memory usage for killedItems. However, it also brings a little performance burden, because bms_next_member() does O(N) iteration. Say so->curPos.items[] = {0, 1000}, the old code directly gives 0 and 1000 to the “for” loop, but the new code needs to iterate over 999 bits to get next member 1000. Maybe that’s affordable.
    
    Actually MaxTIDsPerBTreePage (max length of so->curPos.items[]) is a value around 1000, in the old code, killedItems could be “short *” instead of “int *”, which may also save a half of memory usage.
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
    
    
    
  14. Re: Returning nbtree posting list TIDs in DESC order during backwards scans

    Peter Geoghegan <pg@bowt.ie> — 2025-12-10T18:40:03Z

    On Wed, Dec 3, 2025 at 10:18 AM Victor Yegorov <vyegorov@gmail.com> wrote:
    > Patch looks fine, applies and compiles cleanly, passes tests.
    
    This patch was trickier than initially expected. I paid no attention
    to the possible downside of changing the posting list iteration code
    in _bt_readpage (i.e. from scan posting lists in descending order for
    backwards scans), which was an oversight.
    
    It seems that we're very sensitive to how the compiler allocates
    registers within _bt_readpage, which led to v4 of the patch (and
    presumably all earlier versions) not storing itemIndex in a register.
    With v4, the compiler spilled the itemIndex comparison to the stack
    (at least on my machine, which uses GCC 15.2 from Debian unstable),
    and so had to read itemIndex from memory on each loop iteration. This
    regressed pgbench's SELECT workload by about 1% of total TPS (on my
    Zen 3 CPU).
    
    Attached v5 avoids the regression by tweaking _bt_readpage. I will
    commit this version soon (I really mean it this time!).
    
    I'm not sure why these changes have the intended effect -- I mostly
    figured all this out through trial and error. Though I can say that my
    testing showed that _not_ changing the posting list iteration code in
    the _bt_readpage forwards scan loop makes the crucial difference. The
    other tweaks probably aren't strictly necessary, but they seem to make
    the compiler generate ever so slightly faster code (at least with
    pgbench SELECT), so I kept them in.
    
    I also gave up on the idea of using a bitmapset for v5 -- the issue
    with regressing _bt_readpage discouraged me from adding code that
    allocates memory (more than once per scan) within btgettuple, which is
    a performance-critical path. We can simply sort and unique-ify the
    killedItems array from within _bt_killitems instead -- which is
    largely the same way that my original v1 did it. That way it won't
    matter what order we append to killItems in, relative to the scan
    direction. (The downside of sticking with an array for killedItems is
    that we can still run out of array space in extreme cases involving
    scrollable cursors that return many killable tuples, and repeatedly
    append the same TID to killedItems[], but that doesn't seem like much
    of a loss to me -- that almost never happens in practice.)
    
    > I'd like to point out a missing space after the dot in the 2nd para of the commit message,
    > falls out of style.
    
    Fixed that too.
    
    Thanks
    
    --
    Peter Geoghegan
    
  15. Re: Returning nbtree posting list TIDs in DESC order during backwards scans

    Victor Yegorov <vyegorov@gmail.com> — 2025-12-10T19:41:13Z

    ср, 10 дек. 2025 г. в 21:40, Peter Geoghegan <pg@bowt.ie>:
    
    > Attached v5 avoids the regression by tweaking _bt_readpage. I will
    > commit this version soon (I really mean it this time!).
    >
    > …
    >
    > I also gave up on the idea of using a bitmapset for v5 -- the issue
    > with regressing _bt_readpage discouraged me from adding code that
    > allocates memory (more than once per scan) within btgettuple, which is
    > a performance-critical path. We can simply sort and unique-ify the
    > killedItems array from within _bt_killitems instead -- which is
    > largely the same way that my original v1 did it. That way it won't
    > matter what order we append to killItems in, relative to the scan
    > direction. (The downside of sticking with an array for killedItems is
    > that we can still run out of array space in extreme cases involving
    > scrollable cursors that return many killable tuples, and repeatedly
    > append the same TID to killedItems[], but that doesn't seem like much
    > of a loss to me -- that almost never happens in practice.)
    >
    
    Compiled and tested without issues.
    
    Small note: as you're removing “We rely on the convention that heap TIDs in
    the scanpos
    items array are stored in ascending heap TID order…” part of the comment,
    perhaps it'd be
    good to add smth similar to that to the “Sort and uniqueify
    so->killedItems[] to deal with all this.”
    piece? Smth like:
    
    + * Sort and uniqueify so->killedItems[] to deal with all this,
    + * as TIDs are processed in ascending order.
    
    I feel the need for such a comment from my POV as a code reader.
    
    
    -- 
    Victor Yegorov
    
  16. Re: Returning nbtree posting list TIDs in DESC order during backwards scans

    Peter Geoghegan <pg@bowt.ie> — 2025-12-10T22:33:36Z

    On Wed, Dec 10, 2025 at 2:41 PM Victor Yegorov <vyegorov@gmail.com> wrote:
    > Compiled and tested without issues.
    
    Pushed. Thanks for the review!
    
    > Small note: as you're removing “We rely on the convention that heap TIDs in the scanpos
    > items array are stored in ascending heap TID order…” part of the comment, perhaps it'd be
    > good to add smth similar to that to the “Sort and uniqueify so->killedItems[] to deal with all this.”
    > piece? Smth like:
    >
    > + * Sort and uniqueify so->killedItems[] to deal with all this,
    > + * as TIDs are processed in ascending order.
    >
    > I feel the need for such a comment from my POV as a code reader.
    
    I did have a comment like that at one point, but I felt that it didn't
    quite make sense to keep it. Such a comment would address how things
    used to work, not how they work now (also how they really should have
    worked all along).
    
    With this change in place, we have a strict guarantee that the
    contents of so->currPos.items[] will always be exactly the same when
    scanning a page backwards as when scanning the same page forwards --
    regardless of whether or not posting lists are involved, or any other
    factor. In short, so->currPos.items[] will contain matching items in
    index key space order in all cases. So ISTM that it doesn't make any
    sense to draw attention to posting list TIDs within _bt_killitems. All
    of that is strongly implied by the index key space order thing.
    
    -- 
    Peter Geoghegan
    
    
    
    
  17. Re: Returning nbtree posting list TIDs in DESC order during backwards scans

    Peter Geoghegan <pg@bowt.ie> — 2025-12-11T01:51:24Z

    On Wed, Dec 10, 2025 at 5:33 PM Peter Geoghegan <pg@bowt.ie> wrote:
    > I did have a comment like that at one point, but I felt that it didn't
    > quite make sense to keep it. Such a comment would address how things
    > used to work, not how they work now (also how they really should have
    > worked all along).
    
    On second thought, you (Victor) had this right: we really should have
    such a comment.
    
    I must have forgotten that the loop in _bt_killitems doesn't iterate
    through so->currPos.items[] directly; it iterates through
    killedItems[]. Earlier versions of the patch (that fully got rid of
    killedItems) *directly* looped over so->currPos.items[], but the
    committed version doesn't work that way.
    
    I pushed a commit just now that adds a comment to clarify the
    situation. It specifically mentions posting list tuples, per your
    suggestion. (The commit also adds a documenting assertion to verify
    leaf page order within the _bt_killitems loop.)
    
    Thanks
    -- 
    Peter Geoghegan