Re: Adding skip scan (including MDAM style range skip scan) to nbtree

Tomas Vondra <tomas@vondra.me>

From: Tomas Vondra <tomas@vondra.me>

To: Peter Geoghegan <pg@bowt.ie>, Masahiro.Ikeda@nttdata.com

Cc: pgsql-hackers@lists.postgresql.org, Masao.Fujii@nttdata.com

Date: 2024-09-16T22:05:47Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

nbtree: Always set skipScan flag on rescan.
- 454c046094ab 19 (unreleased) landed
- bee763aea13f 18.0 landed
meson: Build numeric.c with -ftree-vectorize.
- 9016fa7e3bcd 19 (unreleased) cited
Fix "variable not found in subplan target lists" in semijoin de-duplication.
- b8a1bdc458e3 19 (unreleased) cited
Revert "nbtree: Remove useless row compare arg."
- dd2ce3792754 18.0 landed
nbtree: Remove useless row compare arg.
- 54c6ea8c81db 18.0 cited
Prevent premature nbtree array advancement.
- 5f4d98d4f371 18.0 landed
nbtree: tighten up array recheck rules.
- 7e25c9363a82 18.0 landed
Avoid treating nonrequired nbtree keys as required.
- 0f08df406822 18.0 landed
Adjust overstrong nbtree skip array assertion.
- 9d924dbb3710 18.0 landed
Make NULL tuple values always advance skip arrays.
- b75fedcab791 18.0 cited
Avoid extra index searches through preprocessing.
- b3f1a13f22f9 18.0 landed
Improve nbtree skip scan primitive scan scheduling.
- 21a152b37f36 18.0 landed
Further optimize nbtree search scan key comparisons.
- 8a510275dd6b 18.0 landed
Add nbtree skip scan optimization.
- 92fe23d93aa3 18.0 landed
Improve nbtree array primitive scan scheduling.
- 9a2e2a285a14 18.0 landed
nbtree: Make BTMaxItemSize into object-like macro.
- 426ea611171d 18.0 landed
Show index search count in EXPLAIN ANALYZE, take 2.
- 0fbceae841cb 18.0 landed
Make parallel nbtree index scans use an LWLock.
- 67fc4c9fd7fa 18.0 landed
Show index search count in EXPLAIN ANALYZE.
- 5ead85fbc811 18.0 landed
Avoid nbtree parallel scan currPos confusion.
- b5ee4e52026b 18.0 cited
nbtree: Remove useless 'strat' local variable.
- b6558e4f837e 18.0 landed
Normalize nbtree truncated high key array behavior.
- 79fa7b3b1a44 18.0 landed
Refactor handling of nbtree array redundancies.
- b524974106ac 18.0 landed
Fix nbtree pgstats accounting with parallel scans.
- c00c54a9ac1e 18.0 landed
- fb4f5e58af97 17.0 landed
Avoid parallel nbtree index scan hangs with SAOPs.
- d8adfc18bebf 18.0 landed
- a24bffc021d9 17.0 landed
Show Parallel Bitmap Heap Scan worker stats in EXPLAIN ANALYZE
- 5a1e6df3b84c 18.0 cited
Enhance nbtree ScalarArrayOp execution.
- 5bf748b86bc6 17.0 cited
Skip checking of scan keys required for directional scan in B-tree
- e0b1ee17dc3a 17.0 cited
Instead of using a numberOfRequiredKeys count to distinguish required
- 7ccaf13a06b8 8.2.0 cited

Attachments

optimal.pdf (application/pdf)
q702.sql (application/sql)
run-mdam.sh (application/x-shellscript)

Hi,

I've been looking at this patch over the couple last days, mostly doing
some stress testing / benchmarking (hence the earlier report) and basic
review. I do have some initial review comments, and the testing produced
some interesting regressions (not sure if those are the cases where
skipscan can't really help, that Peter mentioned he needs to look into).


review
------

First, the review comments - nothing particularly serious, mostly just
cosmetic stuff:

1) v6-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch

- I find the places that increment "nsearches" a bit random. Each AM
does it in entirely different place (at least it seems like that to me).
Is there a way make this a bit more consistent?

- I find this comment rather unhelpful:

    uint64   btps_nsearches; /* instrumentation */

Instrumentation what? What's the counter for?

- I see _bt_first moved the pgstat_count_index_scan, but doesn't that
mean we skip it if the earlier code does "goto readcomplete"? Shouldn't
that still count as an index scan?

- show_indexscan_nsearches does this:

    if (scanDesc && scanDesc->nsearches > 0)
        ExplainPropertyUInteger("Index Searches", NULL,
                                scanDesc->nsearches, es);

But shouldn't it divide the count by nloops, similar to (for example)
show_instrumentation_count?


2) v6-0002-Normalize-nbtree-truncated-high-key-array-behavio.patch

- Admittedly very subjective, but I find the "oppoDirCheck" abbreviation
rather weird, I'd just call it "oppositeDirCheck".


3) v6-0003-Refactor-handling-of-nbtree-array-redundancies.patch

- nothing


4) v6-0004-Add-skip-scan-to-nbtree.patch

- indices.sgml seems to hahve typo "Intevening" -> "Intervening"

- It doesn't seem like a good idea to remove the paragraph about
multicolumn indexes and replace it with just:

  Multicolumn indexes should be used judiciously.

I mean, what does judiciously even mean? what should the user consider
to be judicious? Seems rather unclear to me. Admittedly, the old text
was not much helpful, but at least it gave some advice.

But maybe more importantly, doesn't skipscan apply only to a rather
limited subset of data types (that support increment/decrement)? Doesn't
the new wording mostly ignore that, implying skipscan applies to all
btree indexes? I don't think it mentions datatypes anywhere, but there
are many indexes on data types like text, UUID and so on.

- Very subjective nitpicking, but I find it a bit strange when a comment
about a block is nested in the block, like in _bt_first() for the
array->null_elem check.

- assignProcTypes() claims providing skipscan for cross-type scenarios
doesn't make sense. Why is that? I'm not saying the claim is wrong, but
it's not clear to me why would that be the case.


costing
-------

Peter asked me to look at the costing, and I think it looks generally
sensible. We don't really have a lot of information to base the costing
on in the first place - the whole point of skipscan is about multicolumn
indexes, but none of the existing extended statistic seems very useful.
We'd need some cross-column correlation info, or something like that.

It's an interesting question - if we could collect some new statistics
for multicolumn indexes (say, by having a way to collect AM-specific
stats), what would we collect for skipscan?

There's one thing that I don't quite understand, and that's how
btcost_correlation() adjusts correlation for multicolumn indexes:

    if (index->nkeycolumns > 1)
        indexCorrelation = varCorrelation * 0.75;

That seems fine for a two-column index, I guess. But shouldn't it
compound for indexes with more keys? I mean, 0.75 * 0.75 for third
column, etc? I don't think btcostestimate() does that, it just remembers
whatever btcost_correlation() returns.

Anyway, the overall costing approach seems sensible, I think. It assumes
things we assume in general (columns/keys are considered independent),
which may be problematic, but this is the best we can do.

The only alternative approach I can think of is not to adjust the
costing for the index scan at all, and only use this to enable (or not
enable) the skipscan internally. That would mean the overall plan
remains the same, and maybe sometimes we would think an index scan would
be too expensive and use something else. Not great, but it doesn't have
the risk of regressions - IIUC we can disable the skipscan at runtime,
if we realize it's not really helpful.

If we're concerned about regressions, I think this would be the way to
deal with them. Or at least it's the best idea I have.


testing
-------

As usual, I wrote a bash script to do a bit of stress testing. It
generates tables with random data, and then runs random queries with
random predicates on them, while mutating a couple parameters (like
number of workers) to trigger different plans. It does that on 16,
master and with the skipscan patch (with the fix for parallel scans).

I've uploaded the script and results from the last run here:

    https://github.com/tvondra/pg-skip-scan-tests

There's the "run-mdam.sh" script that generates tables/queries, runs
them, collects all kinds of info about the query, and produces files
with explain plans, CSV with timings, etc.

Not all of the queries end up using index scans - depending on the
predicates, etc. it might have to use seqscan. Or maybe it only uses
index scan because it's forced to by the enable_* options, etc.

Anyway, I ran a couple thousand such queries, and I haven't found any
incorrect results (the script compares that between versions too). So
that's good ;-)

But my main goal was to see how this affects performance. The tables
were pretty small (just 1M rows, maybe ~80MB), but with restarts and
dropping caches, large enough to test this.

And generally the results seem good. You can either inspect the CSV with
raw results (look at the script to undestand what the fields are), or
check the attached PDF with a pivot table summarizing them.

As usual, there's a heatmap on the right side, comparing the results for
different versions (first "master/16" and then "skipscan/master"). Green
means "speedup/good" and red meand "regression/bad".

Most of the places are "white" (no change) or not very far from it, or
perhaps "green". But there's also a bunch of red results, which means
regression (FWIW the PDF is filtered only to queries that would actually
use the executed plans without the GUCs).

Some of the red placees are for very short queries - just a couple ms,
which means it can easily be random noise, or something like that. But a
couple queries are much longer, and might deserve some investigation.
The easiest way is to look at the "QID" column in the row, which
identifies the query in the "query" CSV. Then look into the results CSV
for IDs of the runs (in the first "SEQ" column), and find the details in
the "analyze" log, which has all the plans etc.

Alternatively, use the .ods in the git repository, which allows drill
down to results (from the pivot tables).

For example, one of the slowed down queries is query 702 (top of page 8
in the PDF). The query is pretty simple:

  explain (analyze, timing off, buffers off)
  select id1,id2 from t_1000000_1000_1_2
   where NOT (id1 in (:list)) AND (id2 = :value);

and it was executed on a table with random data in two columns, each
with 1000 distinct values. This is perfectly random data, so a great
match for the assumptions in costing etc.

But with uncached data, this runs in ~50 ms on master, but takes almost
200 ms with skipscan (these timings are from my laptop, but similar to
the results).

-- master
 Index Only Scan using t_1000000_1000_1_2_id1_id2_idx on
t_1000000_1000_1_2  (cost=0.96..20003.96 rows=1719 width=16)
                    (actual rows=811 loops=1)
   Index Cond: (id2 = 997)
   Filter: (id1 <> ALL ('{983,...,640}'::bigint[]))
   Rows Removed by Filter: 163
   Heap Fetches: 0
 Planning Time: 7.596 ms
 Execution Time: 28.851 ms
(7 rows)


-- with skipscan
 Index Only Scan using t_1000000_1000_1_2_id1_id2_idx on
t_1000000_1000_1_2  (cost=0.96..983.26 rows=1719 width=16)
                    (actual rows=811 loops=1)
   Index Cond: (id2 = 997)
   Index Searches: 1007
   Filter: (id1 <> ALL ('{983,...,640}'::bigint[]))
   Rows Removed by Filter: 163
   Heap Fetches: 0
 Planning Time: 3.730 ms
 Execution Time: 238.554 ms
(8 rows)

I haven't looked into why this is happening, but this seems like a
pretty good match for skipscan (on the first column). And for the
costing too - it's perfectly random data, no correllation, etc.



regards

-- 
Tomas Vondra