Thread

  1. Add a berief general comment on BTScanInsertData's nextkey and backward

    Yugo Nagata <nagata@sraoss.co.jp> — 2025-11-18T07:28:01Z

    Hi,
    
    While reading the nbtree codes, I noticed that the comments on
    BTScanInsertData no longer describes the meaning of the nextkey and
    backward fields. The comment curently only says:
    
     * See comments in _bt_first for an explanation of the nextkey and backward
     * fields.
    
    Detailed comments used to exit here, but they were removed by
    c9c0589fda0e, I guess, because the semantic changed when
    the optimazation for backward scans was introduced. However, having
    a brief, general description here is still useful for readers.
    
    I've attached a patch that adds the following comment:
    
    + * nextkey determines how the scankey's boundary is interpreted, and backward
    + * indicates a backward scan.  See comments in _bt_first for a more detailed
    + * explanation of these fields.
    
    What do think?
    
    Regards,
    Yugo Nagata
    
    -- 
    Yugo Nagata <nagata@sraoss.co.jp>
    
  2. Re: Add a berief general comment on BTScanInsertData's nextkey and backward

    Peter Geoghegan <pg@bowt.ie> — 2025-11-19T01:15:40Z

    On Tue, Nov 18, 2025 at 2:28 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
    > I've attached a patch that adds the following comment:
    >
    > + * nextkey determines how the scankey's boundary is interpreted, and backward
    > + * indicates a backward scan.  See comments in _bt_first for a more detailed
    > + * explanation of these fields.
    >
    > What do think?
    
    Seems reasonable to me.
    
    I wonder if we also do something about these existing _bt_binsrch comments:
    
     * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
     * of the last key < given scankey, or last key <= given scankey if nextkey
     * is true.  (Since _bt_compare treats the first data key of such a page as
     * minus infinity, there will be at least one key < scankey, so the result
     * always points at one of the keys on the page.)
    
    Here we describe what happens on an internal page. This is correct,
    but *seems* to contradict the higher level comments at the end of
    _bt_first. There is actually no contradiction between the two -- not
    when you understand that _bt_first describes the whole end-to-end
    process of calling _bt_search and then calling _bt_binsrch on the leaf
    page (not of calling _bt_binsrch once, against an internal page).
    
    One could think of this _bt_binsrch internal page behavior as
    compensating for the fact that internal pages have pivot tuples that
    consist of a separator key (except for the first key, which is just
    has a -inf key/no key), plus a downlink that points to the *next* page
    after that separator key one level down (except for the internal page
    high key, which has no downlink at all). Might be useful to say
    something like that instead.
    
    This is hard to explain without an example. Right now, an internal
    page might have pivot tuples that look like this:
    
    Separator: -inf, Downlink: 1
    Separator: 'a', Downlink: 2
    Separator: 'c', Downlink: 3
    Separator: 'f', Downlink: (none, this is the high key)
    
    But _bt_binsrch "pretends" that our internal page actually contains:
    
    Downlink: 1
    Separator: 'a'
    Downlink: 2
    Separator: 'c'
    Downlink: 3
    Separator: 'f'
    
    So if our = scan key is (say) 'c', we should descend using the
    downlink that points to block 2 (not the one that points to block 3,
    as might be expected from looking at the real physical representation
    consisting of pivot tuples). That is what the scan needs to do to get
    to the page one level down whose high key is also 'c' -- that's where
    the scan needs to look. (If the next level down is the leaf level,
    then the leaf page we descend to might also contain a *non*-pivot
    tuple with the key value 'c', "right before" the high key with 'c',
    which the scan will need to return in _bt_readpage. Lehman & Yao allow
    the key before a leaf page's high key to be equal to the high key, but
    otherwise forbid all duplicate keys.)
    
    I find it very hard to know what explanation will be the least
    confusing to other people, at least in this area. Since you're
    interested in this area, I thought I'd ask what you think.
    
    (Here I'm pretending that the keys in the tree are unique without
    needing a heap TID, per classic L&Y, which is unrealistic but
    simplifies the example.)
    
    --
    Peter Geoghegan
    
    
    
    
  3. Re: Add a berief general comment on BTScanInsertData's nextkey and backward

    Yugo Nagata <nagata@sraoss.co.jp> — 2025-12-31T10:21:24Z

    On Tue, 18 Nov 2025 20:15:40 -0500
    Peter Geoghegan <pg@bowt.ie> wrote:
    
    > On Tue, Nov 18, 2025 at 2:28 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
    > > I've attached a patch that adds the following comment:
    > >
    > > + * nextkey determines how the scankey's boundary is interpreted, and backward
    > > + * indicates a backward scan.  See comments in _bt_first for a more detailed
    > > + * explanation of these fields.
    > >
    > > What do think?
    > 
    > Seems reasonable to me.
    
    Thank you for your review.
    
    > 
    > I wonder if we also do something about these existing _bt_binsrch comments:
    > 
    >  * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
    >  * of the last key < given scankey, or last key <= given scankey if nextkey
    >  * is true.  (Since _bt_compare treats the first data key of such a page as
    >  * minus infinity, there will be at least one key < scankey, so the result
    >  * always points at one of the keys on the page.)
    > 
    > Here we describe what happens on an internal page. This is correct,
    > but *seems* to contradict the higher level comments at the end of
    > _bt_first. There is actually no contradiction between the two -- not
    > when you understand that _bt_first describes the whole end-to-end
    > process of calling _bt_search and then calling _bt_binsrch on the leaf
    > page (not of calling _bt_binsrch once, against an internal page).
    > 
    > One could think of this _bt_binsrch internal page behavior as
    > compensating for the fact that internal pages have pivot tuples that
    > consist of a separator key (except for the first key, which is just
    > has a -inf key/no key), plus a downlink that points to the *next* page
    > after that separator key one level down (except for the internal page
    > high key, which has no downlink at all). Might be useful to say
    > something like that instead.
    > 
    > This is hard to explain without an example. Right now, an internal
    > page might have pivot tuples that look like this:
    > 
    > Separator: -inf, Downlink: 1
    > Separator: 'a', Downlink: 2
    > Separator: 'c', Downlink: 3
    > Separator: 'f', Downlink: (none, this is the high key)
    > 
    > But _bt_binsrch "pretends" that our internal page actually contains:
    > 
    > Downlink: 1
    > Separator: 'a'
    > Downlink: 2
    > Separator: 'c'
    > Downlink: 3
    > Separator: 'f'
    > 
    > So if our = scan key is (say) 'c', we should descend using the
    > downlink that points to block 2 (not the one that points to block 3,
    > as might be expected from looking at the real physical representation
    > consisting of pivot tuples). That is what the scan needs to do to get
    > to the page one level down whose high key is also 'c' -- that's where
    > the scan needs to look. (If the next level down is the leaf level,
    > then the leaf page we descend to might also contain a *non*-pivot
    > tuple with the key value 'c', "right before" the high key with 'c',
    > which the scan will need to return in _bt_readpage. Lehman & Yao allow
    > the key before a leaf page's high key to be equal to the high key, but
    > otherwise forbid all duplicate keys.)
    > 
    > I find it very hard to know what explanation will be the least
    > confusing to other people, at least in this area. Since you're
    > interested in this area, I thought I'd ask what you think.
    
    I do not see any contradiction between the comment on _bt_binsrch and
    the comments at the end of _bt_first. The former explicitly states that
    it refers to internal (non-leaf) pages, and I understand the latter to
    describe loading data from a leaf page.
    
    That said, we could make this clearer by explicitly mentioning the leaf
    page in the first sentence, for example:
    
          * Now load data from the first leaf page of the scan (usually the
          *page currently in so->currPos.buf).
    
    Regards,
    Yugo Nagata
    
    -- 
    Yugo Nagata <nagata@sraoss.co.jp>