Re: Adding skip scan (including MDAM style range skip scan) to nbtree

Peter Geoghegan <pg@bowt.ie>

From: Peter Geoghegan <pg@bowt.ie>

To: Masahiro Ikeda <ikedamsh@oss.nttdata.com>

Cc: Tomas Vondra <tomas@vondra.me>, Masahiro.Ikeda@nttdata.com, pgsql-hackers@lists.postgresql.org, Masao.Fujii@nttdata.com

Date: 2024-11-25T22:32:34Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

nbtree: Always set skipScan flag on rescan.
- 454c046094ab 19 (unreleased) landed
- bee763aea13f 18.0 landed
meson: Build numeric.c with -ftree-vectorize.
- 9016fa7e3bcd 19 (unreleased) cited
Fix "variable not found in subplan target lists" in semijoin de-duplication.
- b8a1bdc458e3 19 (unreleased) cited
Revert "nbtree: Remove useless row compare arg."
- dd2ce3792754 18.0 landed
nbtree: Remove useless row compare arg.
- 54c6ea8c81db 18.0 cited
Prevent premature nbtree array advancement.
- 5f4d98d4f371 18.0 landed
nbtree: tighten up array recheck rules.
- 7e25c9363a82 18.0 landed
Avoid treating nonrequired nbtree keys as required.
- 0f08df406822 18.0 landed
Adjust overstrong nbtree skip array assertion.
- 9d924dbb3710 18.0 landed
Make NULL tuple values always advance skip arrays.
- b75fedcab791 18.0 cited
Avoid extra index searches through preprocessing.
- b3f1a13f22f9 18.0 landed
Improve nbtree skip scan primitive scan scheduling.
- 21a152b37f36 18.0 landed
Further optimize nbtree search scan key comparisons.
- 8a510275dd6b 18.0 landed
Add nbtree skip scan optimization.
- 92fe23d93aa3 18.0 landed
Improve nbtree array primitive scan scheduling.
- 9a2e2a285a14 18.0 landed
nbtree: Make BTMaxItemSize into object-like macro.
- 426ea611171d 18.0 landed
Show index search count in EXPLAIN ANALYZE, take 2.
- 0fbceae841cb 18.0 landed
Make parallel nbtree index scans use an LWLock.
- 67fc4c9fd7fa 18.0 landed
Show index search count in EXPLAIN ANALYZE.
- 5ead85fbc811 18.0 landed
Avoid nbtree parallel scan currPos confusion.
- b5ee4e52026b 18.0 cited
nbtree: Remove useless 'strat' local variable.
- b6558e4f837e 18.0 landed
Normalize nbtree truncated high key array behavior.
- 79fa7b3b1a44 18.0 landed
Refactor handling of nbtree array redundancies.
- b524974106ac 18.0 landed
Fix nbtree pgstats accounting with parallel scans.
- c00c54a9ac1e 18.0 landed
- fb4f5e58af97 17.0 landed
Avoid parallel nbtree index scan hangs with SAOPs.
- d8adfc18bebf 18.0 landed
- a24bffc021d9 17.0 landed
Show Parallel Bitmap Heap Scan worker stats in EXPLAIN ANALYZE
- 5a1e6df3b84c 18.0 cited
Enhance nbtree ScalarArrayOp execution.
- 5bf748b86bc6 17.0 cited
Skip checking of scan keys required for directional scan in B-tree
- e0b1ee17dc3a 17.0 cited
Instead of using a numberOfRequiredKeys count to distinguish required
- 7ccaf13a06b8 8.2.0 cited

Attachments

v18-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch (application/octet-stream) patch v18-0001
v18-0003-Add-skipskip-nbtree-skip-scan-optimization.patch (application/octet-stream) patch v18-0003
v18-0002-Add-skip-scan-to-nbtree.patch (application/octet-stream) patch v18-0002

On Mon, Nov 25, 2024 at 4:07 AM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
> The approach looks good to me. In fact, the significant regressions I
> reported have disappeared in my environment. And the test_for_v17.sql
> shows that the performance of the master and the master with your
> patches
> is almost same.

That's great.

I'm also a bit worried about regressing queries that don't even
involve skip arrays -- particularly very simple queries. I'm thinking
of things like "pgbench -S" style SELECT queries. Adding almost any
new code to a hot code path such as _bt_check_compare comes with a
risk of such regressions. (Up until now it has been convenient to
pretend that "skipscan_prefix_cols=0" is 100% representative of
master, but obviously it is less than 100% representative -- since
"skipscan_prefix_cols=0" still has enlarged object code size in places
like _bt_check_compare.)

Attached is v18, which makes _bt_check_compare copy all relevant scan
key fields into local variables. This gives the compiler more freedom
due to not having to worry about aliasing.

I believe that this change to _bt_check_compare fixes some regressions
with simple queries. This includes a "pgbench -S" variant that I
previously used during the Postgres 17 SAOP array work -- a script
that showed a favorable case for that 17 work, involving "SELECT aid
FROM pgbench_accounts WHERE aid IN (....)", with hundreds of
contiguous array elements. More importantly, it avoids regressions
with standard "pgbench -S" SELECT queries.

It takes a long time to validate changes such as these -- the
regressions are within the range of noise, so it is difficult to tell
the difference between signal and noise. I probably went further with
some of the new changes to _bt_check_compare than really made sense.
For example, I probably added likely() or unlikely() annotations that
don't really help. This is something that I will continue to refine.

> One thing I am concerned about is that it reduces the cases where the
> optimization of _bt_checkkeys_look_ahead() works effectively, as the
> approach
> skips the skip immediately on the first occurrence per page.

I noticed that with the recent v17 revision of the patch, my original
MDAM paper "sales_mdam_paper" test case (the complicated query in the
introductory email of this thread) was about 2x slower. That's just
not okay, obviously. But the issue was relatively easy to fix: it was
fixed by making _bt_readpage not apply the "skipskip" optimization
when on the first page for the current primitive index scan -- we
already do this with the "precheck" optimization, so it's natural to
do it with the "skipskip" optimization as well.

The "sales_mdam_paper" test case involves thousands of primitive index
scans that each access only one leaf page. But each leaf page returns
2 non-adjoining tuples, with quite a few non-matching tuples "in
between" the matching tuples. There is one matching tuple for "store =
200", and another for "store = 250" -- and there's non-matching stores
201 - 249 between these two, which we want _bt_checkkeys_look_ahead to
skip over. This is exactly the kind of case where the
_bt_checkkeys_look_ahead() optimization is expected to help.

> -- performance
> SET skipscan_prefix_cols=32;
> EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT, SETTINGS, WAL, VERBOSE) SELECT *
> FROM t WHERE id2 = 360; -- cache
> EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT, SETTINGS, WAL, VERBOSE) SELECT *
> FROM t WHERE id2 = 360;
> SET skipscan_prefix_cols=0;
> EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT, SETTINGS, WAL, VERBOSE) SELECT *
> FROM t WHERE id2 = 360;
>
> Again, the above results are provided for reference, as I believe that
> many users prioritize stability and I'd like to take your new approach.

Adversarial cases specifically designed to "make the patch look bad"
are definitely useful review feedback. Ideally, the patch will be 100%
free of regressions -- no matter how unlikely (or even silly) they may
seem. I always prefer to not have to rely on anybody's opinion of what
is likely or unlikely.  :-)

A quick test seems to show that this particular regression is more or
less fixed by v18. As you said, the _bt_checkkeys_look_ahead stuff is
the issue here (same with the MDAM sales query). You should confirm
that the issue has actually been fixed, though.

Thanks
-- 
Peter Geoghegan