Re: Adding skip scan (including MDAM style range skip scan) to nbtree

Peter Geoghegan <pg@bowt.ie>

From: Peter Geoghegan <pg@bowt.ie>

To: Matthias van de Meent <boekewurm+postgres@gmail.com>

Cc: Heikki Linnakangas <hlinnaka@iki.fi>, Masahiro Ikeda <ikedamsh@oss.nttdata.com>, Tomas Vondra <tomas@vondra.me>, Masahiro.Ikeda@nttdata.com, pgsql-hackers@lists.postgresql.org, Masao.Fujii@nttdata.com

Date: 2025-04-02T23:15:33Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

nbtree: Always set skipScan flag on rescan.
- 454c046094ab 19 (unreleased) landed
- bee763aea13f 18.0 landed
meson: Build numeric.c with -ftree-vectorize.
- 9016fa7e3bcd 19 (unreleased) cited
Fix "variable not found in subplan target lists" in semijoin de-duplication.
- b8a1bdc458e3 19 (unreleased) cited
Revert "nbtree: Remove useless row compare arg."
- dd2ce3792754 18.0 landed
nbtree: Remove useless row compare arg.
- 54c6ea8c81db 18.0 cited
Prevent premature nbtree array advancement.
- 5f4d98d4f371 18.0 landed
nbtree: tighten up array recheck rules.
- 7e25c9363a82 18.0 landed
Avoid treating nonrequired nbtree keys as required.
- 0f08df406822 18.0 landed
Adjust overstrong nbtree skip array assertion.
- 9d924dbb3710 18.0 landed
Make NULL tuple values always advance skip arrays.
- b75fedcab791 18.0 cited
Avoid extra index searches through preprocessing.
- b3f1a13f22f9 18.0 landed
Improve nbtree skip scan primitive scan scheduling.
- 21a152b37f36 18.0 landed
Further optimize nbtree search scan key comparisons.
- 8a510275dd6b 18.0 landed
Add nbtree skip scan optimization.
- 92fe23d93aa3 18.0 landed
Improve nbtree array primitive scan scheduling.
- 9a2e2a285a14 18.0 landed
nbtree: Make BTMaxItemSize into object-like macro.
- 426ea611171d 18.0 landed
Show index search count in EXPLAIN ANALYZE, take 2.
- 0fbceae841cb 18.0 landed
Make parallel nbtree index scans use an LWLock.
- 67fc4c9fd7fa 18.0 landed
Show index search count in EXPLAIN ANALYZE.
- 5ead85fbc811 18.0 landed
Avoid nbtree parallel scan currPos confusion.
- b5ee4e52026b 18.0 cited
nbtree: Remove useless 'strat' local variable.
- b6558e4f837e 18.0 landed
Normalize nbtree truncated high key array behavior.
- 79fa7b3b1a44 18.0 landed
Refactor handling of nbtree array redundancies.
- b524974106ac 18.0 landed
Fix nbtree pgstats accounting with parallel scans.
- c00c54a9ac1e 18.0 landed
- fb4f5e58af97 17.0 landed
Avoid parallel nbtree index scan hangs with SAOPs.
- d8adfc18bebf 18.0 landed
- a24bffc021d9 17.0 landed
Show Parallel Bitmap Heap Scan worker stats in EXPLAIN ANALYZE
- 5a1e6df3b84c 18.0 cited
Enhance nbtree ScalarArrayOp execution.
- 5bf748b86bc6 17.0 cited
Skip checking of scan keys required for directional scan in B-tree
- e0b1ee17dc3a 17.0 cited
Instead of using a numberOfRequiredKeys count to distinguish required
- 7ccaf13a06b8 8.2.0 cited

On Tue, Apr 1, 2025 at 4:16 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> While I agree that there is such a cost, I don't think that this is
> too far fetched. They are not just added when we have SAOP scankeys,
> and I think the non-SAOP case is one of the most important areas where
> this patch improves performance. Yes, this patch improves performance
> for queries with only SAOP on non-first keys, but I've seen more
> non-SAOP queries where this patch would improve performance than
> similar queries but with SAOP.

That's all likely to be true. I just don't think that the commit
message for the big commit (the part that you took issue with) said
anything that suggests otherwise.

To recap, the sentence in question says "Preprocessing can freely add
a skip array before or after any input ScalarArrayOp arrays". This is
mostly just making a high level point about the design itself -- so I
just don't get what you mean.

The "everything is an array" design is what allows skip arrays to work
with a qual like "WHERE b IN (1, 2, 3)", as you say. It's also what
allows things like "WHERE a IN (100, 500) AND c = 55" to work
efficiently, without introducing any special cases -- it works both
ways. And, a pair of skip arrays can also be composed together, in
just the same way as a pair of SAOP arrays. This all works in the same
way; _bt_advance_array_keys concepts like array roll over continue to
apply, with essentially zero changes to the high level design,
relative to Postgres 17. That's the core idea that the paragraph in
question conveys.

Recall that _bt_advance_array_keys likes to think of simple scalar =
keys as a degenerate single-value array. They are the same thing, for
the purposes of rolling over the scan's arrays. We need to use a 3-way
ORDER proc for scalar scan keys for this reason.

> It's mostly as an observation (and not problem per se) that "compare"
> (which sounds like a pure function that doens't modify anything, e.g.
> _bt_compare) is used to dispatch to "shrink" (which does sound like
> it'd modify something).

It sounds like it's modifying something because (as you must know) it
does just that. This has been the case since the Postgres 17 SAOP
patch, of course (only the use of the term "shrink" in a helper
function is new here).

I don't want to rename _bt_compare_scankey_args now (that name is well
over 20 years old). That would be what it would take to make this
consistent in the way you expect. I just don't think it matters very
much.

> My comment is not about your potential future actions, but rather what
> any future developer or committer working on this code might think and
> worry about when reading this. = ANY {} constructs *always* have NOT
> NULL behaviour, just like any other operator clause that isn't "IS
> NULL", so clarifying that this is only similar -and does not behave
> the same in important edge cases- is IMO important.

> Not specifically for you, but for any other developer trying to get a correct
> understanding of how this works and why it is correct.

How many times does it have to be clarified, though? Do I have to put
something about it anywhere I give a brief high-level description of
how skip arrays work, where it's natural to compare them to a
traditional SAOP that generates all possible matching elements?
Explaining the concepts in question is hard enough, without having to
always list all of the ways that my analogy isn't the full and literal
truth of the matter. It's already extremely obvious that it must be
far from a literal account of what happens.

> It is my understanding that those are mostly historical artifacts of
> the code base, and not systems in active development. Their rarety
> makes it difficult to put numbers on, but IIRC at least one such case
> was recently removed for bitrot and apparent lack of use in years.

It's effectively a comment (nobody is expected to ever uncomment it by
removing the "#ifdef 0"). Sometimes, comments become obsolete. It's a
trade-off.

> > What if I just had a local copy of numSkipArrayKeys, and copied back
> > into caller's arg when the function returns? We'll still need a
> > prev_numSkipArrayKeys under this scheme, but we won't have to set the
> > caller's pointer until right at the end anymore (which, I agree, seems
> > like it might be worth avoiding).
>
> That's a nice alternative too, indeed.

I'll do it that way in the commited patch. That's probably not going
to happen until Friday morning EST, to give me another full day to
work some more on the docs.

I don't see much point in posting another version of the patchset to the list.

> > But the API definition *does* specifically address the opclass
> > author's responsibilities around NULLs? It specifically says that it's
> > not up to the opclass author to think about them at all.
>
> Yes. What I'm suggesting is to make the contract enforceable to a
> degree. If it was defined to "return (Datum) 0 on overflow", then that
> could be Assert()ed; and code that does leak memory could get detected
> by static analysis tools in the function scope because the allocation
> didn't get returned, but with this definition returning an allocation
> is never detected because that's not part of the contract.

All B-Tree support functions aren't allowed to leak memory. Same with
all operators. This will be far from the only time that we expect
opclass authors to get that right. This mostly works just fine,
probably because an opclass that leaked memory like this would visibly
break quite quickly.

> You could Assert(inc_sk_argument == (Datum) 0) in the oflow branch?
> I'm certain that (Datum) 0 is an invalid representation of a pointer,
> so we know that no allocated value is returned (be it new or
> pre-existing).

I just don't see what the point would be. Nothing would stop a
decrement/increment callback that needs to allocate memory from
returning 0 and then leaking memory anyway.

-- 
Peter Geoghegan