Re: Adding skip scan (including MDAM style range skip scan) to nbtree

Peter Geoghegan <pg@bowt.ie>

From: Peter Geoghegan <pg@bowt.ie>

To: Tomas Vondra <tomas@vondra.me>

Cc: Masahiro.Ikeda@nttdata.com, pgsql-hackers@lists.postgresql.org, Masao.Fujii@nttdata.com

Date: 2024-09-18T18:52:35Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

nbtree: Always set skipScan flag on rescan.
- 454c046094ab 19 (unreleased) landed
- bee763aea13f 18.0 landed
meson: Build numeric.c with -ftree-vectorize.
- 9016fa7e3bcd 19 (unreleased) cited
Fix "variable not found in subplan target lists" in semijoin de-duplication.
- b8a1bdc458e3 19 (unreleased) cited
Revert "nbtree: Remove useless row compare arg."
- dd2ce3792754 18.0 landed
nbtree: Remove useless row compare arg.
- 54c6ea8c81db 18.0 cited
Prevent premature nbtree array advancement.
- 5f4d98d4f371 18.0 landed
nbtree: tighten up array recheck rules.
- 7e25c9363a82 18.0 landed
Avoid treating nonrequired nbtree keys as required.
- 0f08df406822 18.0 landed
Adjust overstrong nbtree skip array assertion.
- 9d924dbb3710 18.0 landed
Make NULL tuple values always advance skip arrays.
- b75fedcab791 18.0 cited
Avoid extra index searches through preprocessing.
- b3f1a13f22f9 18.0 landed
Improve nbtree skip scan primitive scan scheduling.
- 21a152b37f36 18.0 landed
Further optimize nbtree search scan key comparisons.
- 8a510275dd6b 18.0 landed
Add nbtree skip scan optimization.
- 92fe23d93aa3 18.0 landed
Improve nbtree array primitive scan scheduling.
- 9a2e2a285a14 18.0 landed
nbtree: Make BTMaxItemSize into object-like macro.
- 426ea611171d 18.0 landed
Show index search count in EXPLAIN ANALYZE, take 2.
- 0fbceae841cb 18.0 landed
Make parallel nbtree index scans use an LWLock.
- 67fc4c9fd7fa 18.0 landed
Show index search count in EXPLAIN ANALYZE.
- 5ead85fbc811 18.0 landed
Avoid nbtree parallel scan currPos confusion.
- b5ee4e52026b 18.0 cited
nbtree: Remove useless 'strat' local variable.
- b6558e4f837e 18.0 landed
Normalize nbtree truncated high key array behavior.
- 79fa7b3b1a44 18.0 landed
Refactor handling of nbtree array redundancies.
- b524974106ac 18.0 landed
Fix nbtree pgstats accounting with parallel scans.
- c00c54a9ac1e 18.0 landed
- fb4f5e58af97 17.0 landed
Avoid parallel nbtree index scan hangs with SAOPs.
- d8adfc18bebf 18.0 landed
- a24bffc021d9 17.0 landed
Show Parallel Bitmap Heap Scan worker stats in EXPLAIN ANALYZE
- 5a1e6df3b84c 18.0 cited
Enhance nbtree ScalarArrayOp execution.
- 5bf748b86bc6 17.0 cited
Skip checking of scan keys required for directional scan in B-tree
- e0b1ee17dc3a 17.0 cited
Instead of using a numberOfRequiredKeys count to distinguish required
- 7ccaf13a06b8 8.2.0 cited

On Wed, Sep 18, 2024 at 7:36 AM Tomas Vondra <tomas@vondra.me> wrote:
> Makes sense. I started with the testing before before even looking at
> the code, so it's mostly a "black box" approach. I did read the 1995
> paper before that, and the script generates queries with clauses
> inspired by that paper, in particular:

I think that this approach with black box testing is helpful, but also
something to refine over time. Gray box testing might work best.

> OK, understood. If it's essentially an independent issue (perhaps even
> counts as a bug?) what about correcting it on master first? Doesn't
> sound like something we'd backpatch, I guess.

What about backpatching it to 17?

As things stand, you can get quite contradictory counts of the number
of index scans due to irrelevant implementation details from parallel
index scan. It just looks wrong, particularly on 17, where it is
reasonable to expect near exact consistency between parallel and
serial scans of the same index.

> Seems like a bit of a mess. IMHO we should either divide everything by
> nloops (so that everything is "per loop", or not divide anything. My
> vote would be to divide, but that's mostly my "learned assumption" from
> the other fields. But having a 50:50 split is confusing for everyone.

My idea was that it made most sense to follow the example of
"Buffers:", since both describe physical costs.

Honestly, I'm more than ready to take whatever the path of least
resistance is. If dividing by nloops is what people want, I have no
objections.

> > It's worth having skip support (the idea comes from the MDAM paper),
> > but it's not essential. Whether or not an opclass has skip support
> > isn't accounted for by the cost model, but I doubt that it's worth
> > addressing (the cost model is already pessimistic).
> >
>
> I admit I'm a bit confused. I probably need to reread the paper, but my
> impression was that the increment/decrement is required for skipscan to
> work. If we can't do that, how would it generate the intermediate values
> to search for? I imagine it would be possible to "step through" the
> index, but I thought the point of skip scan is to not do that.

I think that you're probably still a bit confused because the
terminology in this area is a little confusing. There are two ways of
explaining the situation with types like text and numeric (types that
lack skip support). The two explanations might seem to be
contradictory, but they're really not, if you think about it.

The first way of explaining it, which focuses on how the scan moves
through the index:

For a text index column "a", and an int index column "b", skip scan
will work like this for a query with a qual "WHERE b = 55":

1. Find the first/lowest sorting "a" value in the index. Let's say
that it's "Aardvark".

2. Look for matches "WHERE a = 'Aardvark' and b = 55", possibly
returning some matches.

3. Find the next value after "Aardvark" in the index using a probe
like the one we'd use for a qual "WHERE a > 'Aardvark'". Let's say
that it turns out to be "Abacus".

4. Look for matches "WHERE a = 'Abacus' and b = 55"...

... (repeat these steps until we've exhaustively processed every
existing "a" value in the index)...

The second way of explaining it, which focuses on how the skip arrays
advance. Same query (and really the same behavior) as in the first
explanation:

1. Skip array's initial value is the sentinel -inf, which cannot
possibly match any real index tuple, but can still guide the search.
So we search for tuples "WHERE a = -inf AND b = 55" (actually we don't
include the "b = 55" part, since it is unnecessary, but conceptually
it's a part of what we search for within _bt_first).

2. Find that the index has no "a" values matching -inf (it inevitably
cannot have any matches for -inf), but we do locate the next highest
match. The closest matching value is "Aardvark". The skip array on "a"
therefore advances from -inf to "Aardvark".

3. Look for matches "WHERE a = 'Aardvark' and b = 55", possibly
returning some matches.

4. Reach tuples after the last match for "WHERE a = 'Aardvark' and b =
55", which will cause us to advance the array on "a" incrementally
inside _bt_advance_array_keys (just like it would if there was a
standard SAOP array on "a" instead). The skip array on "a" therefore
advances from "Aardvark" to "Aardvark" +infinitesimal (we need to use
sentinel values for this text column, which lacks skip support).

5. Look for matches "WHERE a = 'Aardvark'+infinitesimal and b = 55",
which cannot possibly find matches, but, again, can reposition the
scan as needed. We can't find an exact match, of course, but we do
locate the next closest match -- which is "Abacus", again. So the skip
array now advances from "Aardvark" +infinitesimal to "Abacus". The
sentinel values are made up values, but that doesn't change anything.
(And, again, we don't include the "b = 55" part here, for the same
reason as before.)

6. Look for matches "WHERE a = 'Abacus' and b = 55"...

...(repeat these steps as many times as required)...

In summary:

Even index columns that lack skip support get to "increment" (or
"decrement") their arrays by using sentinel values that represent -inf
(or +inf for backwards scans), as well as sentinels that represent
concepts such as "Aardvark" +infinitesimal (or "Zebra" -infinitesimal
for backwards scans, say). This scheme sounds contradictory, because
in one sense it allows every skip array to be incremented, but in
another sense it makes it okay that we don't have a type-specific way
to increment values for many individual types/opclasses.

Inventing these sentinel values allows _bt_advance_array_keys to reason about
arrays without really having to care about which kinds of arrays are
involved, their order relative to each other, etc. In a certain sense,
we don't really need explicit "next key" probes of the kind that the
MDAM paper contemplates, though we do still require the same index
accesses as a design with explicit accesses.

Does that make sense?

Obviously, if we did add skip support for text, it would be very
unlikely to help performance. Sure, one can imagine incrementing from
"Aardvark" to "Aardvarl" using dedicated opclass infrastructure, but
that isn't very helpful. You're almost certain to end up accessing the
same pages with such a scheme, anyway. What are the chances of an
index with a leading text column actually containing tuples matching
(say) "WHERE a = 'Aardvarl' and b = 55"? The chances are practically
zero. Whereas if the column "a" happens to use a discrete type such as
integer or date, then skip support is likely to help: there's a decent
chance that a value generated by incrementing the last value
(and I mean incrementing it for real) will find a real match when
combined with the user-supplied "b" predicate.

It might be possible to add skip support for text, but there wouldn't
be much point.

> Anyway, probably a good idea for extending the stress testing script.
> Right now it tests with "bigint" columns only.

Good idea.

> Hmmm, yeah. I think it'd be useful to explain this reasoning (assuming
> no correlation means pessimistic skipscan costing) in a comment before
> btcostestimate, or somewhere close.

Will do.

> Don't we do something similar elsewhere? For example, IIRC we do some
> adjustments when estimating grouping in estimate_num_groups(), and
> incremental sort had to deal with something similar too. Maybe we could
> learn something from those places ... (both from the good and bad
> experiences).

I'll make a note of that. Gonna focus on regressions for now.

> Right. I don't think I've been suggesting having a separate path, I 100%
> agree it's better to have this as an option for index scan paths.

Cool.

> Sure. With this kind of testing I don't know what I'm looking for, so I
> try to cover very wide range of cases. Inevitably, some of the cases
> will not test the exact subject of the patch. I think it's fine.

I agree. Just wanted to make sure that we were on the same page.

> I think it'd help if I go through the results and try to prepare some
> reproducers, to make it easier for you. After all, it's my script and
> you'd have to reverse engineer some of it.

Yes, that would be helpful.

I'll probably memorialize the problem by writing my own minimal test
case for it. I'm using the same TDD approach for this project as was
used for the related Postgres 17 project.

> This is on exactly the same data, after dropping caches and restarting
> the instance. So there should be no caching effects. Yet, there's a
> pretty clear difference - the total number of buffers is the same, but
> the patched version has many more hits. Yet it's slower. Weird, right?

Yes, it's weird. It seems likely that you've found an unambiguous bug,
not just a "regular" performance regression. The regressions that I
already know about aren't nearly this bad. So it seems like you have
the right general idea about what to expect, and it seems like your
approach to testing the patch is effective.

--
Peter Geoghegan