Re: Adding skip scan (including MDAM style range skip scan) to nbtree

Heikki Linnakangas <hlinnaka@iki.fi>

From: Heikki Linnakangas <hlinnaka@iki.fi>
To: Peter Geoghegan <pg@bowt.ie>, Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Cc: Tomas Vondra <tomas@vondra.me>, Masahiro.Ikeda@nttdata.com, pgsql-hackers@lists.postgresql.org, Masao.Fujii@nttdata.com
Date: 2025-01-25T03:07:45Z
Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. nbtree: Always set skipScan flag on rescan.

  2. meson: Build numeric.c with -ftree-vectorize.

  3. Fix "variable not found in subplan target lists" in semijoin de-duplication.

  4. Revert "nbtree: Remove useless row compare arg."

  5. nbtree: Remove useless row compare arg.

  6. Prevent premature nbtree array advancement.

  7. nbtree: tighten up array recheck rules.

  8. Avoid treating nonrequired nbtree keys as required.

  9. Adjust overstrong nbtree skip array assertion.

  10. Make NULL tuple values always advance skip arrays.

  11. Avoid extra index searches through preprocessing.

  12. Improve nbtree skip scan primitive scan scheduling.

  13. Further optimize nbtree search scan key comparisons.

  14. Add nbtree skip scan optimization.

  15. Improve nbtree array primitive scan scheduling.

  16. nbtree: Make BTMaxItemSize into object-like macro.

  17. Show index search count in EXPLAIN ANALYZE, take 2.

  18. Make parallel nbtree index scans use an LWLock.

  19. Show index search count in EXPLAIN ANALYZE.

  20. Avoid nbtree parallel scan currPos confusion.

  21. nbtree: Remove useless 'strat' local variable.

  22. Normalize nbtree truncated high key array behavior.

  23. Refactor handling of nbtree array redundancies.

  24. Fix nbtree pgstats accounting with parallel scans.

  25. Avoid parallel nbtree index scan hangs with SAOPs.

  26. Show Parallel Bitmap Heap Scan worker stats in EXPLAIN ANALYZE

  27. Enhance nbtree ScalarArrayOp execution.

  28. Skip checking of scan keys required for directional scan in B-tree

  29. Instead of using a numberOfRequiredKeys count to distinguish required

On 03/01/2025 21:43, Peter Geoghegan wrote:
> The newly revised "skipskip" optimization seems to get the regressions
> down to only a 5% - 10% increase in runtime across a wide variety of
> unsympathetic cases -- I'm now validating performance against a test
> suite based on the adversarial cases presented by Masahiro Ikeda on
> this thread. Although I think that I'll end up tuning the "skipskip"
> mechanism some more (I may have been too conservative in marginal
> cases that actually do benefit from skipping), I deem these
> regressions to be acceptable. They're only seen in the most
> unsympathetic cases, where an omitted leading column has groupings of
> no more than about 50 index tuples, making skipping pretty hopeless.

On my laptop, this is the worst case I could come up with:

create table skiptest as select g / 10 as a, g%10 as b from 
generate_series(1, 10000000) g;
vacuum freeze skiptest;
create index on skiptest (a, b);

set enable_seqscan=off; set max_parallel_workers_per_gather=0;

\timing on

After repeating a few times, to warm the cache:

postgres=# select count(*) from skiptest where b=1;
   count
---------
  1000000
(1 row)

Time: 202.501 ms

And after 'set skipscan_prefix_cols=0':

select count(*) from skiptest where b=1;
   count
---------
  1000000
(1 row)

Time: 164.762 ms

EXPLAIN ANALYZE confirms that it uses an Index Only scan in both cases.

> I knew from the outset that the hardest part of this project would be
> avoiding regressions in highly unsympathetic cases. The regressions
> that are still there seem very difficult to minimize any further; the
> overhead that remains comes from the simple need to maintain the
> scan's skip arrays once per page, before leaving the page. Once a scan
> decides to apply the "skipskip" optimization, it tends to stick with
> it for all future leaf pages -- leaving only the overhead of checking
> the high key while advancing the scan's arrays. I've cut just about
> all that that I can reasonably cut from the hot code paths that are at
> issue with the regressed cases.

Hmm, looking at the code and profile with perf, a lot of code is 
executed per tuple. Just function calls, passing arguments, checking 
flags etc. I suspect you could shave off some cycles by structuring the 
code in a more compiler and branch-prediction friendly way. Or just with 
more aggressive inlining. Not sure what exactly to suggest, but the code 
of _bt_readpage() and all the subroutines it calls is complicated.

Aside from the performance of those routines, it doesn't feel very 
intuitive. _bt_checkkeys() not only checks the current keys, but it can 
also read ahead tuples on the same page and update the array keys.

One little thing I noticed by stepping with debugger is that it calls 
index_getattr() twice for the same tuple and attribute. First in 
_bt_check_compare(), and if that sets *continuescan=false, again in 
_bt_tuple_before_array_skeys(). The first index_getattr() call is 
certainly pretty expensive because that's where you get the cache miss 
on the tuple when scanning. Not sure if the second call matters much, 
but it feels like a bad smell.

The comment on _bt_tuple_before_array_skeys() says:

>  * readpagetup callers must only call here when _bt_check_compare already set
>  * continuescan=false.  We help these callers deal with _bt_check_compare's
>  * inability to distinguishing between the < and > cases (it uses equality
>  * operator scan keys, whereas we use 3-way ORDER procs)
That begs the question, why does _bt_check_compare() not call the 3-way 
ORDER proc in the first place? That would avoid the overhead of another 
call here.

Aside from these micro-optimizations, I wonder about the "look-ahead" 
logic in _bt_checkkeys_look_ahead. It feels pretty simplistic. Could you 
use Exponential Search 
(https://en.wikipedia.org/wiki/Exponential_search) instead of a plain 
binary search on the page?

-- 
Heikki Linnakangas
Neon (https://neon.tech)