Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Document nbtree row comparison design.
- b8f1c62807a5 19 (unreleased) landed
- 6c3b1df878a6 18.1 landed
Make row compares robust during nbtree array scans.
- bd3f59fdb717 19 (unreleased) cited
Fix incorrect handling of NULL index entries in indexed ROW() comparisons.
- a298a1e06fb0 9.6.0 cited

Fully documenting the design of nbtree row comparison scan keys

Peter Geoghegan <pg@bowt.ie> — 2025-10-30T21:34:29Z

nbtree row comparison scan keys consist of one header key (which
appears in so->keyData[]), and 2 or more subkeys (which
_bt_check_rowcompare accesses by following a pointer stored in the
header key). This design makes sense to me, but it's not at all
obvious why the scan keys are structured in this way.

Attached patch adds comments about all this above
_bt_check_rowcompare. It also adds a couple of new documenting
assertions. This clears up why don't we just include the subkeys in
the main so->keyData[] array instead [1], and why it's useful to
sometimes treat a row compare inequality as if it was a similar scalar
inequality (on the most significant row member's index column).

I didn't understand every nuance of row compare inequalities myself
until quite recently. The rules with NULLs are particularly tricky.

It seems worthwhile to clear things up now in large part due to the
recent addition of code in places like _bt_advance_array_keys -- code
that wants to to treat row compare keys as if they were just a simple
scalar inequality on the row compare's most significant column. That
general behavior isn't new (e.g., _bt_first has long ignored row
compare scan key markings when deducing a NOT NULL constraint), but
it's not easy to see why it's correct.

I'd like to commit this on both the master branch and the 18 branch in
the next couple of days. Seems worth keeping them in sync for this.

[1] https://www.postgresql.org/message-id/24134.1137366192%40sss.pgh.pa.us
-- 
Peter Geoghegan

Re: Fully documenting the design of nbtree row comparison scan keys

Chao Li <li.evan.chao@gmail.com> — 2025-10-31T08:56:25Z

Hi Peter,

I am glad to review this patch.

> On Oct 31, 2025, at 05:34, Peter Geoghegan <pg@bowt.ie> wrote:
> 
> nbtree row comparison scan keys consist of one header key (which
> appears in so->keyData[]), and 2 or more subkeys (which
> _bt_check_rowcompare accesses by following a pointer stored in the
> header key). This design makes sense to me, but it's not at all
> obvious why the scan keys are structured in this way.
> 
> Attached patch adds comments about all this above
> _bt_check_rowcompare. It also adds a couple of new documenting
> assertions. This clears up why don't we just include the subkeys in
> the main so->keyData[] array instead [1], and why it's useful to
> sometimes treat a row compare inequality as if it was a similar scalar
> inequality (on the most significant row member's index column).
> 
> I didn't understand every nuance of row compare inequalities myself
> until quite recently. The rules with NULLs are particularly tricky.
> 
> It seems worthwhile to clear things up now in large part due to the
> recent addition of code in places like _bt_advance_array_keys -- code
> that wants to to treat row compare keys as if they were just a simple
> scalar inequality on the row compare's most significant column. That
> general behavior isn't new (e.g., _bt_first has long ignored row
> compare scan key markings when deducing a NOT NULL constraint), but
> it's not easy to see why it's correct.
> 
> I'd like to commit this on both the master branch and the 18 branch in
> the next couple of days. Seems worth keeping them in sync for this.
> 
> [1] https://www.postgresql.org/message-id/24134.1137366192%40sss.pgh.pa.us
> -- 
> Peter Geoghegan
> <v1-0001-Document-nbtree-row-comparison-design.patch>

I spent 2 hours on this patch. Renaming the function parameter “skey" to “header” as well as adding new asserts make things much clearer, which is nice.

For the function comment added to _bt_check_rowcompare(), I really have a mixed feeling. Personally, I feel that the new comment is even harder to understand than reading the code directly.

```
+ * This is a subroutine for _bt_checkkeys/_bt_check_compare.  Caller passes us
+ * a row compare header key taken from so->keyData[].
+ *
+ * The SQL standard describes row value comparisons in terms of logical
+ * expansions that only use scalar operators.  Consider the following example
+ * row comparison:
+ *
+ * "(a, b, c) > (7, 'bar', 77)"
+ *
+ * This is logically/semantically equivalent to:
+ *
+ * "(a = 7 AND b = 'bar' AND c > 77) OR (a = 7 AND b > 'bar') OR (a > 7)".
+ *
+ * Notice that this condition is satisfied by _all_ rows that satisfy "a > 7",
+ * and by a subset of all rows that satisfy "a >= 7" (possibly all such rows).
+ * It _can't_ be satisfied by other rows (where "a < 7" or where "a IS NULL").
+ * A row comparison header key can therefore often be treated as if it was a
+ * simple scalar inequality on the most significant row member's index column.
+ *
```

So far so good. Clearly explained row comparison.

```
+ * Things get more complicated for our row compare with rows where "a = 7".
+ * Note that a row comparison isn't necessarily satisfied by _every_ row that
+ * appears after the first satisfied/matching row.  A forwards scan that uses
+ * our example qual might first return a row "(a, b, c) = (7, 'zebra', 54)".
+ * But it must not return a row "(a, b, c) = (7, NULL, 1)" that'll appear to
+ * the right of the first match (assumes that "b" was declared NULLS LAST).
+ * The scan only returns additional matches upon reaching rows where "a > 7".
+ * If you rereview our example row comparison's logical expansion, you'll
+ * understand why this is so.
+ *
```

This paragraph is actually describing how index data are stored, (in which order index data are loaded,)  but I think that doesn’t matter to this function. This function just takes a header of ScanKey and an IndexTuple as inputs and compare them. Understanding row comparison seems enough to understand this function.

```
+ * Note that a row comparison key behaves _exactly_ the same as an equivalent
+ * scalar inequality key on its most significant column once the scan reaches
+ * the point where it no longer needs to consider any of its lower-order keys.
+ * For example, once a forwards scan that uses our example qual reaches the
+ * first tuple "a > 7", we'll behave in precisely the same way as our caller
+ * would behave with a scalar inequality "a > 7" for the remainder of the scan
+ * (assuming that the scan never changes direction/never goes backwards).
+ * This includes setting continuescan=false based on a deduced NOT NULL
+ * constraint according to the same rules that our caller applies when a NULL
+ * tuple value fails to satisfy a scalar inequality that's marked required in
+ * the opposite scan direction.
```

_bt_check_rowcompare() actually just does a recursive-like comparison. To evaluate (a1, a2, … an) > (b1, b2 …, bn):

    • Compare the first column.
    • If they differ, that decides.
    • If they’re equal, move to next column and compare the rest.
    • Any NULL comparison yields “unknown”, then the whole thing is false.

The corresponding code for this process is quite short. The big portion of the function are handling the situations when column data is NULL and when const data is NULL, where there are inline comments to describe the behaviors.

That’s just my personal feeling, others may think differently. I am sorry if you are unhappy with my comment.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Fully documenting the design of nbtree row comparison scan keys

Victor Yegorov <vyegorov@gmail.com> — 2025-10-31T09:06:45Z

пт, 31 окт. 2025 г. в 00:35, Peter Geoghegan <pg@bowt.ie>:

> I didn't understand every nuance of row compare inequalities myself
> until quite recently. The rules with NULLs are particularly tricky.
>
> It seems worthwhile to clear things up now in large part due to the
> recent addition of code in places like _bt_advance_array_keys -- code
> that wants to to treat row compare keys as if they were just a simple
> scalar inequality on the row compare's most significant column. That
> general behavior isn't new (e.g., _bt_first has long ignored row
> compare scan key markings when deducing a NOT NULL constraint), but
> it's not easy to see why it's correct.
>

Greetings.

I took a look at the patch. Proposed comments look highly valuable,
especially around NULLs, doesn't look immediately obvious, so
definitely requires a comment.
Looks good to commit.

Wouldn't it be good to add such information also into the user
documentation, say into
https://www.postgresql.org/docs/current/functions-comparisons.html#ROW-WISE-COMPARISON
?

-- 
Victor Yegorov

Re: Fully documenting the design of nbtree row comparison scan keys

Peter Geoghegan <pg@bowt.ie> — 2025-10-31T16:15:28Z

On Fri, Oct 31, 2025 at 4:57 AM Chao Li <li.evan.chao@gmail.com> wrote:
> Personally, I feel that the new comment is even harder to understand than reading the code directly.

I don't disagree. But that's not the goal that I have in mind.

My goal is to make it clear when it is okay to treat row comparisons
just like scalar inequalities (on the same column as a row compare's
leading column), and when it is not okay. This matters to code in
places like _bt_advance_array_keys (which knows about inequalities but
not about row compare inequalities specifically), and in places like
_bt_set_startikey (which has to know about the difference between row
compares and simple scalar inequalities).

> ```
> + * Things get more complicated for our row compare with rows where "a = 7".
> + * Note that a row comparison isn't necessarily satisfied by _every_ row that
> + * appears after the first satisfied/matching row.  A forwards scan that uses
> + * our example qual might first return a row "(a, b, c) = (7, 'zebra', 54)".
> + * But it must not return a row "(a, b, c) = (7, NULL, 1)" that'll appear to
> + * the right of the first match (assumes that "b" was declared NULLS LAST).
> + * The scan only returns additional matches upon reaching rows where "a > 7".
> + * If you rereview our example row comparison's logical expansion, you'll
> + * understand why this is so.
> + *
> ```
>
> This paragraph is actually describing how index data are stored, (in which order index data are loaded,)  but I think that doesn’t matter to this function.

The order in which data is stored is obviously relevant. The main use
case for row compares is to implement keyset pagination, where each
scan returns only a subset of index tuples in index order. The next
scan starts to return tuples precisely after the end of matches from
the previous scan. This has to work in lexicographic/index order to
reliably avoid returning the same row twice (across each of the scans
performed by the application). That's the whole point.

In general, with both row compares and simple scalar inequalities, we
can end an index scan upon reaching a NULL tuple value that doesn't
satisfy a row compare qual according to NULL semantics (the result is
UNKNOWN, not false). This is safe only when we know that NULL is the
last value stored in the index column -- NULL doesn't satisfy the
inequality, *and* there can be no later value after NULL that will
satisfy the inequality for the entire qual, so we can just end the
scan right away. The nbtree code is directly exploiting what it knows
about the order that the data is stored (and NULL semantics).

> The corresponding code for this process is quite short. The big portion of the function are handling the situations when column data is NULL and when const data is NULL, where there are inline comments to describe the behaviors.

If it's so obvious, then why did 2016 bugfix commit a298a1e0 get it
wrong? It's easy to get confused about why and how these rules apply,
and the implications. See the last paragraph of my commit bd3f59fd for
full details.

It's not obvious that NULLs in lower-order index columns will make a >
row compare qual not return any matches at the very start of a
forwards scan (or at the very end of a backwards scan), without
affecting the results for the rest of the scan in any way. And so the
recently added row compare code in _bt_set_startikey must be extra
careful: it cannot just assume that every tuple on the page must
satisfy a row compare qual just because both the first and last index
tuple on the page satisfy the row compare qual. The funny rules about
NULLs mean that there could be a group of index tuples in the middle
of the page that don't satisfy the row compare (in spite of the first
and last tuples doing so). This is very much unlike simple scalar
inequalities, which will always return *all* tuples in the index after
the first matching tuple up to and including the last matching tuple,
without any "gaps" where index tuples are examined but not returned by
the scan (assuming that the inequality key is marked required to
continue the scan, which is typical).

-- 
Peter Geoghegan

Re: Fully documenting the design of nbtree row comparison scan keys

Peter Geoghegan <pg@bowt.ie> — 2025-10-31T17:40:40Z

On Fri, Oct 31, 2025 at 5:06 AM Victor Yegorov <vyegorov@gmail.com> wrote:
> I took a look at the patch. Proposed comments look highly valuable, especially around NULLs, doesn't look immediately obvious, so definitely requires a comment.
> Looks good to commit.

Cool.

> Wouldn't it be good to add such information also into the user documentation, say into
> https://www.postgresql.org/docs/current/functions-comparisons.html#ROW-WISE-COMPARISON

I wasn't thinking about the user-facing documentation. But I think
that you make a good point.

The main problem with that documentation of row compares is that we
don't say anything about keyset pagination with a multicolumn index
(not anywhere). That is almost the entire reason why this feature
exists -- but we don't actually say anything about it to users. No
wonder the feature is underused.

Separately, there is a risk that NULLs will break applications that
implement keyset pagination. Attached test case shows what I mean by
this.

To summarize the test case:

The test case shows that QUERY 1 returns slightly more rows than QUERY
2 due to the presence of NULLs in lower-order index columns. This
seems surprising. (The underlying issue is more or less the same issue
that makes row comparisons confusing to the implementation, especially
in places like _bt_set_startikey).

A user might expect that it won't matter how many or how few "keyset
pages"/row compare queries were used -- one big query (no pagination)
should give the same answer as (say) 4 "keyset pagination" queries
(just in smaller pieces/sets of rows). But as the test case shows,
when there are NULLs in lower-order index columns, that isn't
necessarily true. QUERY 1 and QUERY 2 are only guaranteed to return
precisely the same rows if we somehow make sure that there can be no
rows returned with NULLs in lower order columns (e.g., by using a
primary key for keyset pagination queries, or by always adding IS NOT
NULL conditions against lower-order columns like "b" and "c").

-- 
Peter Geoghegan

Re: Fully documenting the design of nbtree row comparison scan keys

Peter Geoghegan <pg@bowt.ie> — 2025-11-02T20:29:48Z

On Fri, Oct 31, 2025 at 1:40 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Oct 31, 2025 at 5:06 AM Victor Yegorov <vyegorov@gmail.com> wrote:
> > I took a look at the patch. Proposed comments look highly valuable, especially around NULLs, doesn't look immediately obvious, so definitely requires a comment.
> > Looks good to commit.
>
> Cool.

Pushed a cleaned up version of the patch just now.

Thanks
-- 
Peter Geoghegan