Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Peter Geoghegan <pg@bowt.ie>
Commits
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
nbtree: Always set skipScan flag on rescan.
- 454c046094ab 19 (unreleased) landed
- bee763aea13f 18.0 landed
-
meson: Build numeric.c with -ftree-vectorize.
- 9016fa7e3bcd 19 (unreleased) cited
-
Fix "variable not found in subplan target lists" in semijoin de-duplication.
- b8a1bdc458e3 19 (unreleased) cited
-
Revert "nbtree: Remove useless row compare arg."
- dd2ce3792754 18.0 landed
-
nbtree: Remove useless row compare arg.
- 54c6ea8c81db 18.0 cited
-
Prevent premature nbtree array advancement.
- 5f4d98d4f371 18.0 landed
-
nbtree: tighten up array recheck rules.
- 7e25c9363a82 18.0 landed
-
Avoid treating nonrequired nbtree keys as required.
- 0f08df406822 18.0 landed
-
Adjust overstrong nbtree skip array assertion.
- 9d924dbb3710 18.0 landed
-
Make NULL tuple values always advance skip arrays.
- b75fedcab791 18.0 cited
-
Avoid extra index searches through preprocessing.
- b3f1a13f22f9 18.0 landed
-
Improve nbtree skip scan primitive scan scheduling.
- 21a152b37f36 18.0 landed
-
Further optimize nbtree search scan key comparisons.
- 8a510275dd6b 18.0 landed
-
Add nbtree skip scan optimization.
- 92fe23d93aa3 18.0 landed
-
Improve nbtree array primitive scan scheduling.
- 9a2e2a285a14 18.0 landed
-
nbtree: Make BTMaxItemSize into object-like macro.
- 426ea611171d 18.0 landed
-
Show index search count in EXPLAIN ANALYZE, take 2.
- 0fbceae841cb 18.0 landed
-
Make parallel nbtree index scans use an LWLock.
- 67fc4c9fd7fa 18.0 landed
-
Show index search count in EXPLAIN ANALYZE.
- 5ead85fbc811 18.0 landed
-
Avoid nbtree parallel scan currPos confusion.
- b5ee4e52026b 18.0 cited
-
nbtree: Remove useless 'strat' local variable.
- b6558e4f837e 18.0 landed
-
Normalize nbtree truncated high key array behavior.
- 79fa7b3b1a44 18.0 landed
-
Refactor handling of nbtree array redundancies.
- b524974106ac 18.0 landed
-
Fix nbtree pgstats accounting with parallel scans.
- c00c54a9ac1e 18.0 landed
- fb4f5e58af97 17.0 landed
-
Avoid parallel nbtree index scan hangs with SAOPs.
- d8adfc18bebf 18.0 landed
- a24bffc021d9 17.0 landed
-
Show Parallel Bitmap Heap Scan worker stats in EXPLAIN ANALYZE
- 5a1e6df3b84c 18.0 cited
-
Enhance nbtree ScalarArrayOp execution.
- 5bf748b86bc6 17.0 cited
-
Skip checking of scan keys required for directional scan in B-tree
- e0b1ee17dc3a 17.0 cited
-
Instead of using a numberOfRequiredKeys count to distinguish required
- 7ccaf13a06b8 8.2.0 cited
Attachments
- v3-0001-Add-skip-scan-to-nbtree.patch (application/octet-stream) patch v3-0001
On Fri, Jul 12, 2024 at 1:19 AM <Masahiro.Ikeda@nttdata.com> wrote: > Since I'd like to understand the skip scan to improve the EXPLAIN output > for multicolumn B-Tree Index[1], I began to try the skip scan with some > queries and look into the source code. Thanks for the review! Attached is v3, which generalizes skip scan, allowing it to work with opclasses/types that lack a skip support routine. In other words, v3 makes skip scan work for all types, including continuous types, where it's impractical or infeasible to add skip support. So now important types like text and numeric also get the skip scan optimization (it's not just discrete types like integer and date, as in previous versions). I feel very strongly that everything should be implemented as part of the new skip array abstraction; the patch should only add the concept of skip arrays, which should work just like SAOP arrays. We should avoid introducing any special cases. In short, _bt_advance_array_keys should work in exactly the same way as it does as of Postgres 17 (except for a few representational differences for skip arrays). This seems essential because _bt_advance_array_keys inherently need to be able to trigger moving on to the next skip array value when it reaches the end of a SAOP array (and vice-versa). And so it just makes sense to abstract-away the differences, hiding the difference in lower level code. I have described the new _bt_first behavior that is now available in this new v3 of the patch as "adding explicit next key probes". While v3 does make new changes to _bt_first, it's not really a special kind of index probe. v3 invents new sentinel values instead. The use of sentinels avoids inventing true special cases: the values -inf, +inf, as well as variants of = that use a real datum value, but match on the next key in the index. These new = variants can be thought of as "+infinitesimal" values. So when _bt_advance_array_keys has to "increment" the numeric value 5.0, it sets the scan key to the value "5.0 +infinitesimal". There can never be any matching tuples in the index (just like with -inf sentinel values), but that doesn't matter. So the changes v3 makes to _bt_first doesn't change the basic conceptual model. The added complexity is kept to a manageable level, particularly within _bt_advance_array_keys, which is already very complicated. To help with testing and review, I've added another temporary testing GUC to v3: skipscan_skipsupport_enabled. This can be set to "false" to avoid using skip support, even where available. The GUC makes it easy to measure how skip support routines can help performance (with discrete types like integer and date). > I found the cost is estimated to much higher if the number of skipped attributes > is more than two. Is it expected behavior? Yes and no. Honestly, the current costing is just placeholder code. It is totally inadequate. I'm not surprised that you found problems with it. I just didn't put much work into it, because I didn't really know what to do. > # Test result. The attached file is the detail of tests. > > -- Index Scan > -- The actual time is low since the skip scan works well > -- But the cost is higher than one of seqscan > EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT * FROM test WHERE id3 = 101; > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------------------- > Index Scan using idx_id1_id2_id3 on public.test (cost=0.42..26562.77 rows=984 width=20) (actual time=0.051..15.533 rows=991 loops=1) > Output: id1, id2, id3, value > Index Cond: (test.id3 = 101) > Buffers: shared hit=4402 > Planning: > Buffers: shared hit=7 > Planning Time: 0.234 ms > Execution Time: 15.711 ms > (8 rows) This is a useful example, because it shows the difficulty with the costing. I ran this query using my own custom instrumentation of the scan. I saw that we only ever manage to skip ahead by perhaps 3 leaf pages at a time, but we still come out ahead. As you pointed out, it's ~7.5x faster than the sequential scan, but not very different to the equivalent full index scan. At least not very different in terms of leaf page accesses. Why should we win by this much, for what seems like a marginal case for skip scan? Even cases where "skipping" doesn't manage to skip any leaf pages can still benefit from skipping *index tuples* -- there is more than one kind of skipping to consider. That is, the patch helps a lot with some (though not all) cases where I didn't really expect that to happen: the Postgres 17 SAOP tuple skipping code (the code in _bt_checkkeys_look_ahead, and the related code in _bt_readpage) helps quite a bit in "marginal" skip scan cases, even though it wasn't really designed for that purpose (it was added to avoid regressions in SAOP array scans for the Postgres 17 work). I find that some queries using my original example test case are about twice as fast as an equivalent full index scan, even when only the fourth and final index column is used in the query predicate. The scan can't even skip a single leaf page at a time, and yet we still win by a nice amount. We win, though it is almost by mistake! This is mostly a good thing. Both for the obvious reason (fast is better than slow), and because it justifies being so aggressive in assuming that skip scan might work out during planning (being wrong without really losing is nice). But there is also a downside: it makes it even harder to model costs at runtime, from within the optimizer. If I measure the actual runtime costs other than runtime (e.g., buffers accesses), I'm not sure that the optimizer is wrong to think that the parallel sequential scan is faster. It looks approximately correct. It is only when we look at runtime that the optimizer's choice looks wrong. Which is...awkward. In general, I have very little idea about how to improve the costing within btcostestimate. I am hoping that somebody has better ideas about it. btcostestimate is definitely the area where the patch is weakest right now. > I look at btcostestimate() to find the reason and found the bound quals > and cost.num_sa_scans are different from my expectation. > > My assumption is > * bound quals is id3=XXX (and id1 and id2 are skipped attributes) > * cost.num_sa_scans = 100 (=10*10 because assuming 10 primitive index scans > per skipped attribute) > > But it's wrong. The above index scan result is > * bound quals is NULL > * cost.num_sa_scans = 1 The logic with cost.num_sa_scans was definitely not what I intended. That's fixed in v3, at least. But the code in btcostestimate is still essentially the same as in earlier versions -- it needs to be completely redesigned (or, uh, designed for the first time). > As I know you said the below, but I'd like to know the above is expected or not. > Currently, there is an assumption that "there will be 10 primitive index scans > per skipped attribute". Is any chance to use pg_stats.n_distinct? It probably makes sense to use pg_stats.n_distinct here. But how? If the problem is that we're too pessimistic, then I think that this will usually (though not always) make us more pessimistic. Isn't that the wrong direction to go in? (We're probably also too optimistic in some cases, but being too pessimistic is a bigger problem in practice.) For example, your test case involved 11 distinct values in each column. The current approach of hard-coding 10 (which is just a temporary hack) should actually make the scan look a bit cheaper than it would if we used the true ndistinct. Another underlying problem is that the existing SAOP costing really isn't very accurate, without skip scan -- that's a big source of the pessimism with arrays/skipping. Why should we be able to get the true number of primitive index scans just by multiplying together each omitted prefix column's ndistinct? That approach is good for getting the worst case, which is probably relevant -- but it's probably not a very good assumption for the average case. (Though at least we can cap the total number of primitive index scans to 1/3 of the total number of pages in the index in btcostestimate, since we have guarantees about the worst case as of Postgres 17.) -- Peter Geoghegan