Re: Adding skip scan (including MDAM style range skip scan) to nbtree

Tomas Vondra <tomas@vondra.me>

From: Tomas Vondra <tomas@vondra.me>
To: Peter Geoghegan <pg@bowt.ie>, Mark Dilger <mark.dilger@enterprisedb.com>
Cc: Heikki Linnakangas <hlinnaka@iki.fi>, pgsql-hackers@lists.postgresql.org, Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: 2025-05-09T12:58:34Z
Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. nbtree: Always set skipScan flag on rescan.

  2. meson: Build numeric.c with -ftree-vectorize.

  3. Fix "variable not found in subplan target lists" in semijoin de-duplication.

  4. Revert "nbtree: Remove useless row compare arg."

  5. nbtree: Remove useless row compare arg.

  6. Prevent premature nbtree array advancement.

  7. nbtree: tighten up array recheck rules.

  8. Avoid treating nonrequired nbtree keys as required.

  9. Adjust overstrong nbtree skip array assertion.

  10. Make NULL tuple values always advance skip arrays.

  11. Avoid extra index searches through preprocessing.

  12. Improve nbtree skip scan primitive scan scheduling.

  13. Further optimize nbtree search scan key comparisons.

  14. Add nbtree skip scan optimization.

  15. Improve nbtree array primitive scan scheduling.

  16. nbtree: Make BTMaxItemSize into object-like macro.

  17. Show index search count in EXPLAIN ANALYZE, take 2.

  18. Make parallel nbtree index scans use an LWLock.

  19. Show index search count in EXPLAIN ANALYZE.

  20. Avoid nbtree parallel scan currPos confusion.

  21. nbtree: Remove useless 'strat' local variable.

  22. Normalize nbtree truncated high key array behavior.

  23. Refactor handling of nbtree array redundancies.

  24. Fix nbtree pgstats accounting with parallel scans.

  25. Avoid parallel nbtree index scan hangs with SAOPs.

  26. Show Parallel Bitmap Heap Scan worker stats in EXPLAIN ANALYZE

  27. Enhance nbtree ScalarArrayOp execution.

  28. Skip checking of scan keys required for directional scan in B-tree

  29. Instead of using a numberOfRequiredKeys count to distinguish required

Hi,

While doing some benchmarks to compare 17 vs. 18, I ran into a
regression that I ultimately tracked to commit 92fe23d93aa.

    commit 92fe23d93aa3bbbc40fca669cabc4a4d7975e327
    Author: Peter Geoghegan <pg@bowt.ie>
    Date:   Fri Apr 4 12:27:04 2025 -0400

    Add nbtree skip scan optimization.

The workload is very simple - pgbench scale 1 with 100 partitions, an
extra index and a custom select script (same as the other regression I
just reported, but with low client counts):

  pg_ctl -D data init
  pg_ctl -D data -l pg.log start

  createdb test

  psql test -c 'create index on pgbench_accounts(bid)'

and a custom script with a single query:

  select count(*) from pgbench_accounts where bid = 0

and then simply run this for a couple client counts:

  for m in simple prepared; do
    for c in 1 4 32; do
      pgbench -n -f select.sql -M $m -T 10 -c $c -j $c test | grep tps;
    done;
  done;

And the results for 92fe23d93aa and 3ba2cdaa454 (the commit prior to the
skip scan one) look like this:

  mode       #c    3ba2cdaa454     92fe23d93aa      diff
  -------------------------------------------------------
  simple      1           2617            1832       70%
              4           8332            6260       75%
             32          11603            7110       61%
  ------------------------------------------------------
  prepared    1          11113            3646       33%
              4          25379           11375       45%
             32          37319           14097       38%

The number are throughput, as reported by pgbench, and for this
workload, we're often losing ~50% of throughput with 92fe23d93aa.

Despite that, I'm not entirely sure how serious this is. This was meant
to be a micro-benchmark stressing the locking, but maybe it's considered
unrealistic in practice. Not sure.

I'm also not sure about the root cause, but while investigating it one
of the experiments I tried was tweaking the glibc malloc by setting

    export MALLOC_TOP_PAD_=$((64*1024*1024))

which keeps a 64MB "buffer" in glibc, to reduce the amount of malloc
syscalls. And with that, the results change to this:

  mode       #c    3ba2cdaa454     92fe23d93aa      diff
  -------------------------------------------------------
  simple      1           3168            3153      100%
              4           9172            9171      100%
             32          12425           13248      107%
  ------------------------------------------------------
  prepared    1          11104           11460      103%
              4          25481           25737      101%
             32          36795           38372      104%

So the difference disappears - what remains is essentially run to run
variability. The throughout actually improves a little bit for 3ba2cd.

My conclusion from this is that 92fe23d93aa ends up doing a lot of
malloc calls, and this is what makes causes the regression. Otherwise
setting the MALLOC_TOP_PAD_ would not help like this. But I haven't
looked at the code, and I wouldn't have guessed the query to have
anything to do with skip scan ...


regards

-- 
Tomas Vondra