Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Document nbtree row comparison design.

  2. Make row compares robust during nbtree array scans.

  3. Fix incorrect handling of NULL index entries in indexed ROW() comparisons.

  1. Fully documenting the design of nbtree row comparison scan keys

    Peter Geoghegan <pg@bowt.ie> — 2025-10-30T21:34:29Z

    nbtree row comparison scan keys consist of one header key (which
    appears in so->keyData[]), and 2 or more subkeys (which
    _bt_check_rowcompare accesses by following a pointer stored in the
    header key). This design makes sense to me, but it's not at all
    obvious why the scan keys are structured in this way.
    
    Attached patch adds comments about all this above
    _bt_check_rowcompare. It also adds a couple of new documenting
    assertions. This clears up why don't we just include the subkeys in
    the main so->keyData[] array instead [1], and why it's useful to
    sometimes treat a row compare inequality as if it was a similar scalar
    inequality (on the most significant row member's index column).
    
    I didn't understand every nuance of row compare inequalities myself
    until quite recently. The rules with NULLs are particularly tricky.
    
    It seems worthwhile to clear things up now in large part due to the
    recent addition of code in places like _bt_advance_array_keys -- code
    that wants to to treat row compare keys as if they were just a simple
    scalar inequality on the row compare's most significant column. That
    general behavior isn't new (e.g., _bt_first has long ignored row
    compare scan key markings when deducing a NOT NULL constraint), but
    it's not easy to see why it's correct.
    
    I'd like to commit this on both the master branch and the 18 branch in
    the next couple of days. Seems worth keeping them in sync for this.
    
    [1] https://www.postgresql.org/message-id/24134.1137366192%40sss.pgh.pa.us
    -- 
    Peter Geoghegan
    
  2. Re: Fully documenting the design of nbtree row comparison scan keys

    Chao Li <li.evan.chao@gmail.com> — 2025-10-31T08:56:25Z

    Hi Peter,
    
    I am glad to review this patch.
    
    > On Oct 31, 2025, at 05:34, Peter Geoghegan <pg@bowt.ie> wrote:
    > 
    > nbtree row comparison scan keys consist of one header key (which
    > appears in so->keyData[]), and 2 or more subkeys (which
    > _bt_check_rowcompare accesses by following a pointer stored in the
    > header key). This design makes sense to me, but it's not at all
    > obvious why the scan keys are structured in this way.
    > 
    > Attached patch adds comments about all this above
    > _bt_check_rowcompare. It also adds a couple of new documenting
    > assertions. This clears up why don't we just include the subkeys in
    > the main so->keyData[] array instead [1], and why it's useful to
    > sometimes treat a row compare inequality as if it was a similar scalar
    > inequality (on the most significant row member's index column).
    > 
    > I didn't understand every nuance of row compare inequalities myself
    > until quite recently. The rules with NULLs are particularly tricky.
    > 
    > It seems worthwhile to clear things up now in large part due to the
    > recent addition of code in places like _bt_advance_array_keys -- code
    > that wants to to treat row compare keys as if they were just a simple
    > scalar inequality on the row compare's most significant column. That
    > general behavior isn't new (e.g., _bt_first has long ignored row
    > compare scan key markings when deducing a NOT NULL constraint), but
    > it's not easy to see why it's correct.
    > 
    > I'd like to commit this on both the master branch and the 18 branch in
    > the next couple of days. Seems worth keeping them in sync for this.
    > 
    > [1] https://www.postgresql.org/message-id/24134.1137366192%40sss.pgh.pa.us
    > -- 
    > Peter Geoghegan
    > <v1-0001-Document-nbtree-row-comparison-design.patch>
    
    I spent 2 hours on this patch. Renaming the function parameter “skey" to “header” as well as adding new asserts make things much clearer, which is nice.
    
    For the function comment added to _bt_check_rowcompare(), I really have a mixed feeling. Personally, I feel that the new comment is even harder to understand than reading the code directly.
    
    ```
    + * This is a subroutine for _bt_checkkeys/_bt_check_compare.  Caller passes us
    + * a row compare header key taken from so->keyData[].
    + *
    + * The SQL standard describes row value comparisons in terms of logical
    + * expansions that only use scalar operators.  Consider the following example
    + * row comparison:
    + *
    + * "(a, b, c) > (7, 'bar', 77)"
    + *
    + * This is logically/semantically equivalent to:
    + *
    + * "(a = 7 AND b = 'bar' AND c > 77) OR (a = 7 AND b > 'bar') OR (a > 7)".
    + *
    + * Notice that this condition is satisfied by _all_ rows that satisfy "a > 7",
    + * and by a subset of all rows that satisfy "a >= 7" (possibly all such rows).
    + * It _can't_ be satisfied by other rows (where "a < 7" or where "a IS NULL").
    + * A row comparison header key can therefore often be treated as if it was a
    + * simple scalar inequality on the most significant row member's index column.
    + *
    ```
    
    So far so good. Clearly explained row comparison.
    
    ```
    + * Things get more complicated for our row compare with rows where "a = 7".
    + * Note that a row comparison isn't necessarily satisfied by _every_ row that
    + * appears after the first satisfied/matching row.  A forwards scan that uses
    + * our example qual might first return a row "(a, b, c) = (7, 'zebra', 54)".
    + * But it must not return a row "(a, b, c) = (7, NULL, 1)" that'll appear to
    + * the right of the first match (assumes that "b" was declared NULLS LAST).
    + * The scan only returns additional matches upon reaching rows where "a > 7".
    + * If you rereview our example row comparison's logical expansion, you'll
    + * understand why this is so.
    + *
    ```
    
    This paragraph is actually describing how index data are stored, (in which order index data are loaded,)  but I think that doesn’t matter to this function. This function just takes a header of ScanKey and an IndexTuple as inputs and compare them. Understanding row comparison seems enough to understand this function.
    
    ```
    + * Note that a row comparison key behaves _exactly_ the same as an equivalent
    + * scalar inequality key on its most significant column once the scan reaches
    + * the point where it no longer needs to consider any of its lower-order keys.
    + * For example, once a forwards scan that uses our example qual reaches the
    + * first tuple "a > 7", we'll behave in precisely the same way as our caller
    + * would behave with a scalar inequality "a > 7" for the remainder of the scan
    + * (assuming that the scan never changes direction/never goes backwards).
    + * This includes setting continuescan=false based on a deduced NOT NULL
    + * constraint according to the same rules that our caller applies when a NULL
    + * tuple value fails to satisfy a scalar inequality that's marked required in
    + * the opposite scan direction.
    ```
    
    _bt_check_rowcompare() actually just does a recursive-like comparison. To evaluate (a1, a2, … an) > (b1, b2 …, bn):
    
        • Compare the first column.
        • If they differ, that decides.
        • If they’re equal, move to next column and compare the rest.
        • Any NULL comparison yields “unknown”, then the whole thing is false.
    
    The corresponding code for this process is quite short. The big portion of the function are handling the situations when column data is NULL and when const data is NULL, where there are inline comments to describe the behaviors.
    
    That’s just my personal feeling, others may think differently. I am sorry if you are unhappy with my comment.
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
    
    
    
  3. Re: Fully documenting the design of nbtree row comparison scan keys

    Victor Yegorov <vyegorov@gmail.com> — 2025-10-31T09:06:45Z

    пт, 31 окт. 2025 г. в 00:35, Peter Geoghegan <pg@bowt.ie>:
    
    > I didn't understand every nuance of row compare inequalities myself
    > until quite recently. The rules with NULLs are particularly tricky.
    >
    > It seems worthwhile to clear things up now in large part due to the
    > recent addition of code in places like _bt_advance_array_keys -- code
    > that wants to to treat row compare keys as if they were just a simple
    > scalar inequality on the row compare's most significant column. That
    > general behavior isn't new (e.g., _bt_first has long ignored row
    > compare scan key markings when deducing a NOT NULL constraint), but
    > it's not easy to see why it's correct.
    >
    
    Greetings.
    
    I took a look at the patch. Proposed comments look highly valuable,
    especially around NULLs, doesn't look immediately obvious, so
    definitely requires a comment.
    Looks good to commit.
    
    Wouldn't it be good to add such information also into the user
    documentation, say into
    https://www.postgresql.org/docs/current/functions-comparisons.html#ROW-WISE-COMPARISON
    ?
    
    -- 
    Victor Yegorov
    
  4. Re: Fully documenting the design of nbtree row comparison scan keys

    Peter Geoghegan <pg@bowt.ie> — 2025-10-31T16:15:28Z

    On Fri, Oct 31, 2025 at 4:57 AM Chao Li <li.evan.chao@gmail.com> wrote:
    > Personally, I feel that the new comment is even harder to understand than reading the code directly.
    
    I don't disagree. But that's not the goal that I have in mind.
    
    My goal is to make it clear when it is okay to treat row comparisons
    just like scalar inequalities (on the same column as a row compare's
    leading column), and when it is not okay. This matters to code in
    places like _bt_advance_array_keys (which knows about inequalities but
    not about row compare inequalities specifically), and in places like
    _bt_set_startikey (which has to know about the difference between row
    compares and simple scalar inequalities).
    
    > ```
    > + * Things get more complicated for our row compare with rows where "a = 7".
    > + * Note that a row comparison isn't necessarily satisfied by _every_ row that
    > + * appears after the first satisfied/matching row.  A forwards scan that uses
    > + * our example qual might first return a row "(a, b, c) = (7, 'zebra', 54)".
    > + * But it must not return a row "(a, b, c) = (7, NULL, 1)" that'll appear to
    > + * the right of the first match (assumes that "b" was declared NULLS LAST).
    > + * The scan only returns additional matches upon reaching rows where "a > 7".
    > + * If you rereview our example row comparison's logical expansion, you'll
    > + * understand why this is so.
    > + *
    > ```
    >
    > This paragraph is actually describing how index data are stored, (in which order index data are loaded,)  but I think that doesn’t matter to this function.
    
    The order in which data is stored is obviously relevant. The main use
    case for row compares is to implement keyset pagination, where each
    scan returns only a subset of index tuples in index order. The next
    scan starts to return tuples precisely after the end of matches from
    the previous scan. This has to work in lexicographic/index order to
    reliably avoid returning the same row twice (across each of the scans
    performed by the application). That's the whole point.
    
    In general, with both row compares and simple scalar inequalities, we
    can end an index scan upon reaching a NULL tuple value that doesn't
    satisfy a row compare qual according to NULL semantics (the result is
    UNKNOWN, not false). This is safe only when we know that NULL is the
    last value stored in the index column -- NULL doesn't satisfy the
    inequality, *and* there can be no later value after NULL that will
    satisfy the inequality for the entire qual, so we can just end the
    scan right away. The nbtree code is directly exploiting what it knows
    about the order that the data is stored (and NULL semantics).
    
    > The corresponding code for this process is quite short. The big portion of the function are handling the situations when column data is NULL and when const data is NULL, where there are inline comments to describe the behaviors.
    
    If it's so obvious, then why did 2016 bugfix commit a298a1e0 get it
    wrong? It's easy to get confused about why and how these rules apply,
    and the implications. See the last paragraph of my commit bd3f59fd for
    full details.
    
    It's not obvious that NULLs in lower-order index columns will make a >
    row compare qual not return any matches at the very start of a
    forwards scan (or at the very end of a backwards scan), without
    affecting the results for the rest of the scan in any way. And so the
    recently added row compare code in _bt_set_startikey must be extra
    careful: it cannot just assume that every tuple on the page must
    satisfy a row compare qual just because both the first and last index
    tuple on the page satisfy the row compare qual. The funny rules about
    NULLs mean that there could be a group of index tuples in the middle
    of the page that don't satisfy the row compare (in spite of the first
    and last tuples doing so). This is very much unlike simple scalar
    inequalities, which will always return *all* tuples in the index after
    the first matching tuple up to and including the last matching tuple,
    without any "gaps" where index tuples are examined but not returned by
    the scan (assuming that the inequality key is marked required to
    continue the scan, which is typical).
    
    
    -- 
    Peter Geoghegan
    
    
    
    
  5. Re: Fully documenting the design of nbtree row comparison scan keys

    Peter Geoghegan <pg@bowt.ie> — 2025-10-31T17:40:40Z

    On Fri, Oct 31, 2025 at 5:06 AM Victor Yegorov <vyegorov@gmail.com> wrote:
    > I took a look at the patch. Proposed comments look highly valuable, especially around NULLs, doesn't look immediately obvious, so definitely requires a comment.
    > Looks good to commit.
    
    Cool.
    
    > Wouldn't it be good to add such information also into the user documentation, say into
    > https://www.postgresql.org/docs/current/functions-comparisons.html#ROW-WISE-COMPARISON
    
    I wasn't thinking about the user-facing documentation. But I think
    that you make a good point.
    
    The main problem with that documentation of row compares is that we
    don't say anything about keyset pagination with a multicolumn index
    (not anywhere). That is almost the entire reason why this feature
    exists -- but we don't actually say anything about it to users. No
    wonder the feature is underused.
    
    Separately, there is a risk that NULLs will break applications that
    implement keyset pagination. Attached test case shows what I mean by
    this.
    
    To summarize the test case:
    
    The test case shows that QUERY 1 returns slightly more rows than QUERY
    2 due to the presence of NULLs in lower-order index columns. This
    seems surprising. (The underlying issue is more or less the same issue
    that makes row comparisons confusing to the implementation, especially
    in places like _bt_set_startikey).
    
    A user might expect that it won't matter how many or how few "keyset
    pages"/row compare queries were used -- one big query (no pagination)
    should give the same answer as (say) 4 "keyset pagination" queries
    (just in smaller pieces/sets of rows). But as the test case shows,
    when there are NULLs in lower-order index columns, that isn't
    necessarily true. QUERY 1 and QUERY 2 are only guaranteed to return
    precisely the same rows if we somehow make sure that there can be no
    rows returned with NULLs in lower order columns (e.g., by using a
    primary key for keyset pagination queries, or by always adding IS NOT
    NULL conditions against lower-order columns like "b" and "c").
    
    -- 
    Peter Geoghegan
    
  6. Re: Fully documenting the design of nbtree row comparison scan keys

    Peter Geoghegan <pg@bowt.ie> — 2025-11-02T20:29:48Z

    On Fri, Oct 31, 2025 at 1:40 PM Peter Geoghegan <pg@bowt.ie> wrote:
    > On Fri, Oct 31, 2025 at 5:06 AM Victor Yegorov <vyegorov@gmail.com> wrote:
    > > I took a look at the patch. Proposed comments look highly valuable, especially around NULLs, doesn't look immediately obvious, so definitely requires a comment.
    > > Looks good to commit.
    >
    > Cool.
    
    Pushed a cleaned up version of the patch just now.
    
    Thanks
    -- 
    Peter Geoghegan