Re: Adding skip scan (including MDAM style range skip scan) to nbtree

Heikki Linnakangas <hlinnaka@iki.fi>

From: Heikki Linnakangas <hlinnaka@iki.fi>

To: Peter Geoghegan <pg@bowt.ie>, Masahiro Ikeda <ikedamsh@oss.nttdata.com>

Cc: Tomas Vondra <tomas@vondra.me>, Masahiro.Ikeda@nttdata.com, pgsql-hackers@lists.postgresql.org, Masao.Fujii@nttdata.com

Date: 2025-01-25T03:07:45Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

nbtree: Always set skipScan flag on rescan.
- 454c046094ab 19 (unreleased) landed
- bee763aea13f 18.0 landed
meson: Build numeric.c with -ftree-vectorize.
- 9016fa7e3bcd 19 (unreleased) cited
Fix "variable not found in subplan target lists" in semijoin de-duplication.
- b8a1bdc458e3 19 (unreleased) cited
Revert "nbtree: Remove useless row compare arg."
- dd2ce3792754 18.0 landed
nbtree: Remove useless row compare arg.
- 54c6ea8c81db 18.0 cited
Prevent premature nbtree array advancement.
- 5f4d98d4f371 18.0 landed
nbtree: tighten up array recheck rules.
- 7e25c9363a82 18.0 landed
Avoid treating nonrequired nbtree keys as required.
- 0f08df406822 18.0 landed
Adjust overstrong nbtree skip array assertion.
- 9d924dbb3710 18.0 landed
Make NULL tuple values always advance skip arrays.
- b75fedcab791 18.0 cited
Avoid extra index searches through preprocessing.
- b3f1a13f22f9 18.0 landed
Improve nbtree skip scan primitive scan scheduling.
- 21a152b37f36 18.0 landed
Further optimize nbtree search scan key comparisons.
- 8a510275dd6b 18.0 landed
Add nbtree skip scan optimization.
- 92fe23d93aa3 18.0 landed
Improve nbtree array primitive scan scheduling.
- 9a2e2a285a14 18.0 landed
nbtree: Make BTMaxItemSize into object-like macro.
- 426ea611171d 18.0 landed
Show index search count in EXPLAIN ANALYZE, take 2.
- 0fbceae841cb 18.0 landed
Make parallel nbtree index scans use an LWLock.
- 67fc4c9fd7fa 18.0 landed
Show index search count in EXPLAIN ANALYZE.
- 5ead85fbc811 18.0 landed
Avoid nbtree parallel scan currPos confusion.
- b5ee4e52026b 18.0 cited
nbtree: Remove useless 'strat' local variable.
- b6558e4f837e 18.0 landed
Normalize nbtree truncated high key array behavior.
- 79fa7b3b1a44 18.0 landed
Refactor handling of nbtree array redundancies.
- b524974106ac 18.0 landed
Fix nbtree pgstats accounting with parallel scans.
- c00c54a9ac1e 18.0 landed
- fb4f5e58af97 17.0 landed
Avoid parallel nbtree index scan hangs with SAOPs.
- d8adfc18bebf 18.0 landed
- a24bffc021d9 17.0 landed
Show Parallel Bitmap Heap Scan worker stats in EXPLAIN ANALYZE
- 5a1e6df3b84c 18.0 cited
Enhance nbtree ScalarArrayOp execution.
- 5bf748b86bc6 17.0 cited
Skip checking of scan keys required for directional scan in B-tree
- e0b1ee17dc3a 17.0 cited
Instead of using a numberOfRequiredKeys count to distinguish required
- 7ccaf13a06b8 8.2.0 cited

On 03/01/2025 21:43, Peter Geoghegan wrote:
> The newly revised "skipskip" optimization seems to get the regressions
> down to only a 5% - 10% increase in runtime across a wide variety of
> unsympathetic cases -- I'm now validating performance against a test
> suite based on the adversarial cases presented by Masahiro Ikeda on
> this thread. Although I think that I'll end up tuning the "skipskip"
> mechanism some more (I may have been too conservative in marginal
> cases that actually do benefit from skipping), I deem these
> regressions to be acceptable. They're only seen in the most
> unsympathetic cases, where an omitted leading column has groupings of
> no more than about 50 index tuples, making skipping pretty hopeless.

On my laptop, this is the worst case I could come up with:

create table skiptest as select g / 10 as a, g%10 as b from 
generate_series(1, 10000000) g;
vacuum freeze skiptest;
create index on skiptest (a, b);

set enable_seqscan=off; set max_parallel_workers_per_gather=0;

\timing on

After repeating a few times, to warm the cache:

postgres=# select count(*) from skiptest where b=1;
   count
---------
  1000000
(1 row)

Time: 202.501 ms

And after 'set skipscan_prefix_cols=0':

select count(*) from skiptest where b=1;
   count
---------
  1000000
(1 row)

Time: 164.762 ms

EXPLAIN ANALYZE confirms that it uses an Index Only scan in both cases.

> I knew from the outset that the hardest part of this project would be
> avoiding regressions in highly unsympathetic cases. The regressions
> that are still there seem very difficult to minimize any further; the
> overhead that remains comes from the simple need to maintain the
> scan's skip arrays once per page, before leaving the page. Once a scan
> decides to apply the "skipskip" optimization, it tends to stick with
> it for all future leaf pages -- leaving only the overhead of checking
> the high key while advancing the scan's arrays. I've cut just about
> all that that I can reasonably cut from the hot code paths that are at
> issue with the regressed cases.

Hmm, looking at the code and profile with perf, a lot of code is 
executed per tuple. Just function calls, passing arguments, checking 
flags etc. I suspect you could shave off some cycles by structuring the 
code in a more compiler and branch-prediction friendly way. Or just with 
more aggressive inlining. Not sure what exactly to suggest, but the code 
of _bt_readpage() and all the subroutines it calls is complicated.

Aside from the performance of those routines, it doesn't feel very 
intuitive. _bt_checkkeys() not only checks the current keys, but it can 
also read ahead tuples on the same page and update the array keys.

One little thing I noticed by stepping with debugger is that it calls 
index_getattr() twice for the same tuple and attribute. First in 
_bt_check_compare(), and if that sets *continuescan=false, again in 
_bt_tuple_before_array_skeys(). The first index_getattr() call is 
certainly pretty expensive because that's where you get the cache miss 
on the tuple when scanning. Not sure if the second call matters much, 
but it feels like a bad smell.

The comment on _bt_tuple_before_array_skeys() says:

>  * readpagetup callers must only call here when _bt_check_compare already set
>  * continuescan=false.  We help these callers deal with _bt_check_compare's
>  * inability to distinguishing between the < and > cases (it uses equality
>  * operator scan keys, whereas we use 3-way ORDER procs)
That begs the question, why does _bt_check_compare() not call the 3-way 
ORDER proc in the first place? That would avoid the overhead of another 
call here.

Aside from these micro-optimizations, I wonder about the "look-ahead" 
logic in _bt_checkkeys_look_ahead. It feels pretty simplistic. Could you 
use Exponential Search 
(https://en.wikipedia.org/wiki/Exponential_search) instead of a plain 
binary search on the page?

-- 
Heikki Linnakangas
Neon (https://neon.tech)