Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Peter Geoghegan <pg@bowt.ie>
Commits
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
nbtree: Always set skipScan flag on rescan.
- 454c046094ab 19 (unreleased) landed
- bee763aea13f 18.0 landed
-
meson: Build numeric.c with -ftree-vectorize.
- 9016fa7e3bcd 19 (unreleased) cited
-
Fix "variable not found in subplan target lists" in semijoin de-duplication.
- b8a1bdc458e3 19 (unreleased) cited
-
Revert "nbtree: Remove useless row compare arg."
- dd2ce3792754 18.0 landed
-
nbtree: Remove useless row compare arg.
- 54c6ea8c81db 18.0 cited
-
Prevent premature nbtree array advancement.
- 5f4d98d4f371 18.0 landed
-
nbtree: tighten up array recheck rules.
- 7e25c9363a82 18.0 landed
-
Avoid treating nonrequired nbtree keys as required.
- 0f08df406822 18.0 landed
-
Adjust overstrong nbtree skip array assertion.
- 9d924dbb3710 18.0 landed
-
Make NULL tuple values always advance skip arrays.
- b75fedcab791 18.0 cited
-
Avoid extra index searches through preprocessing.
- b3f1a13f22f9 18.0 landed
-
Improve nbtree skip scan primitive scan scheduling.
- 21a152b37f36 18.0 landed
-
Further optimize nbtree search scan key comparisons.
- 8a510275dd6b 18.0 landed
-
Add nbtree skip scan optimization.
- 92fe23d93aa3 18.0 landed
-
Improve nbtree array primitive scan scheduling.
- 9a2e2a285a14 18.0 landed
-
nbtree: Make BTMaxItemSize into object-like macro.
- 426ea611171d 18.0 landed
-
Show index search count in EXPLAIN ANALYZE, take 2.
- 0fbceae841cb 18.0 landed
-
Make parallel nbtree index scans use an LWLock.
- 67fc4c9fd7fa 18.0 landed
-
Show index search count in EXPLAIN ANALYZE.
- 5ead85fbc811 18.0 landed
-
Avoid nbtree parallel scan currPos confusion.
- b5ee4e52026b 18.0 cited
-
nbtree: Remove useless 'strat' local variable.
- b6558e4f837e 18.0 landed
-
Normalize nbtree truncated high key array behavior.
- 79fa7b3b1a44 18.0 landed
-
Refactor handling of nbtree array redundancies.
- b524974106ac 18.0 landed
-
Fix nbtree pgstats accounting with parallel scans.
- c00c54a9ac1e 18.0 landed
- fb4f5e58af97 17.0 landed
-
Avoid parallel nbtree index scan hangs with SAOPs.
- d8adfc18bebf 18.0 landed
- a24bffc021d9 17.0 landed
-
Show Parallel Bitmap Heap Scan worker stats in EXPLAIN ANALYZE
- 5a1e6df3b84c 18.0 cited
-
Enhance nbtree ScalarArrayOp execution.
- 5bf748b86bc6 17.0 cited
-
Skip checking of scan keys required for directional scan in B-tree
- e0b1ee17dc3a 17.0 cited
-
Instead of using a numberOfRequiredKeys count to distinguish required
- 7ccaf13a06b8 8.2.0 cited
On Wed, Sep 18, 2024 at 7:36 AM Tomas Vondra <tomas@vondra.me> wrote: > Makes sense. I started with the testing before before even looking at > the code, so it's mostly a "black box" approach. I did read the 1995 > paper before that, and the script generates queries with clauses > inspired by that paper, in particular: I think that this approach with black box testing is helpful, but also something to refine over time. Gray box testing might work best. > OK, understood. If it's essentially an independent issue (perhaps even > counts as a bug?) what about correcting it on master first? Doesn't > sound like something we'd backpatch, I guess. What about backpatching it to 17? As things stand, you can get quite contradictory counts of the number of index scans due to irrelevant implementation details from parallel index scan. It just looks wrong, particularly on 17, where it is reasonable to expect near exact consistency between parallel and serial scans of the same index. > Seems like a bit of a mess. IMHO we should either divide everything by > nloops (so that everything is "per loop", or not divide anything. My > vote would be to divide, but that's mostly my "learned assumption" from > the other fields. But having a 50:50 split is confusing for everyone. My idea was that it made most sense to follow the example of "Buffers:", since both describe physical costs. Honestly, I'm more than ready to take whatever the path of least resistance is. If dividing by nloops is what people want, I have no objections. > > It's worth having skip support (the idea comes from the MDAM paper), > > but it's not essential. Whether or not an opclass has skip support > > isn't accounted for by the cost model, but I doubt that it's worth > > addressing (the cost model is already pessimistic). > > > > I admit I'm a bit confused. I probably need to reread the paper, but my > impression was that the increment/decrement is required for skipscan to > work. If we can't do that, how would it generate the intermediate values > to search for? I imagine it would be possible to "step through" the > index, but I thought the point of skip scan is to not do that. I think that you're probably still a bit confused because the terminology in this area is a little confusing. There are two ways of explaining the situation with types like text and numeric (types that lack skip support). The two explanations might seem to be contradictory, but they're really not, if you think about it. The first way of explaining it, which focuses on how the scan moves through the index: For a text index column "a", and an int index column "b", skip scan will work like this for a query with a qual "WHERE b = 55": 1. Find the first/lowest sorting "a" value in the index. Let's say that it's "Aardvark". 2. Look for matches "WHERE a = 'Aardvark' and b = 55", possibly returning some matches. 3. Find the next value after "Aardvark" in the index using a probe like the one we'd use for a qual "WHERE a > 'Aardvark'". Let's say that it turns out to be "Abacus". 4. Look for matches "WHERE a = 'Abacus' and b = 55"... ... (repeat these steps until we've exhaustively processed every existing "a" value in the index)... The second way of explaining it, which focuses on how the skip arrays advance. Same query (and really the same behavior) as in the first explanation: 1. Skip array's initial value is the sentinel -inf, which cannot possibly match any real index tuple, but can still guide the search. So we search for tuples "WHERE a = -inf AND b = 55" (actually we don't include the "b = 55" part, since it is unnecessary, but conceptually it's a part of what we search for within _bt_first). 2. Find that the index has no "a" values matching -inf (it inevitably cannot have any matches for -inf), but we do locate the next highest match. The closest matching value is "Aardvark". The skip array on "a" therefore advances from -inf to "Aardvark". 3. Look for matches "WHERE a = 'Aardvark' and b = 55", possibly returning some matches. 4. Reach tuples after the last match for "WHERE a = 'Aardvark' and b = 55", which will cause us to advance the array on "a" incrementally inside _bt_advance_array_keys (just like it would if there was a standard SAOP array on "a" instead). The skip array on "a" therefore advances from "Aardvark" to "Aardvark" +infinitesimal (we need to use sentinel values for this text column, which lacks skip support). 5. Look for matches "WHERE a = 'Aardvark'+infinitesimal and b = 55", which cannot possibly find matches, but, again, can reposition the scan as needed. We can't find an exact match, of course, but we do locate the next closest match -- which is "Abacus", again. So the skip array now advances from "Aardvark" +infinitesimal to "Abacus". The sentinel values are made up values, but that doesn't change anything. (And, again, we don't include the "b = 55" part here, for the same reason as before.) 6. Look for matches "WHERE a = 'Abacus' and b = 55"... ...(repeat these steps as many times as required)... In summary: Even index columns that lack skip support get to "increment" (or "decrement") their arrays by using sentinel values that represent -inf (or +inf for backwards scans), as well as sentinels that represent concepts such as "Aardvark" +infinitesimal (or "Zebra" -infinitesimal for backwards scans, say). This scheme sounds contradictory, because in one sense it allows every skip array to be incremented, but in another sense it makes it okay that we don't have a type-specific way to increment values for many individual types/opclasses. Inventing these sentinel values allows _bt_advance_array_keys to reason about arrays without really having to care about which kinds of arrays are involved, their order relative to each other, etc. In a certain sense, we don't really need explicit "next key" probes of the kind that the MDAM paper contemplates, though we do still require the same index accesses as a design with explicit accesses. Does that make sense? Obviously, if we did add skip support for text, it would be very unlikely to help performance. Sure, one can imagine incrementing from "Aardvark" to "Aardvarl" using dedicated opclass infrastructure, but that isn't very helpful. You're almost certain to end up accessing the same pages with such a scheme, anyway. What are the chances of an index with a leading text column actually containing tuples matching (say) "WHERE a = 'Aardvarl' and b = 55"? The chances are practically zero. Whereas if the column "a" happens to use a discrete type such as integer or date, then skip support is likely to help: there's a decent chance that a value generated by incrementing the last value (and I mean incrementing it for real) will find a real match when combined with the user-supplied "b" predicate. It might be possible to add skip support for text, but there wouldn't be much point. > Anyway, probably a good idea for extending the stress testing script. > Right now it tests with "bigint" columns only. Good idea. > Hmmm, yeah. I think it'd be useful to explain this reasoning (assuming > no correlation means pessimistic skipscan costing) in a comment before > btcostestimate, or somewhere close. Will do. > Don't we do something similar elsewhere? For example, IIRC we do some > adjustments when estimating grouping in estimate_num_groups(), and > incremental sort had to deal with something similar too. Maybe we could > learn something from those places ... (both from the good and bad > experiences). I'll make a note of that. Gonna focus on regressions for now. > Right. I don't think I've been suggesting having a separate path, I 100% > agree it's better to have this as an option for index scan paths. Cool. > Sure. With this kind of testing I don't know what I'm looking for, so I > try to cover very wide range of cases. Inevitably, some of the cases > will not test the exact subject of the patch. I think it's fine. I agree. Just wanted to make sure that we were on the same page. > I think it'd help if I go through the results and try to prepare some > reproducers, to make it easier for you. After all, it's my script and > you'd have to reverse engineer some of it. Yes, that would be helpful. I'll probably memorialize the problem by writing my own minimal test case for it. I'm using the same TDD approach for this project as was used for the related Postgres 17 project. > This is on exactly the same data, after dropping caches and restarting > the instance. So there should be no caching effects. Yet, there's a > pretty clear difference - the total number of buffers is the same, but > the patched version has many more hits. Yet it's slower. Weird, right? Yes, it's weird. It seems likely that you've found an unambiguous bug, not just a "regular" performance regression. The regressions that I already know about aren't nearly this bad. So it seems like you have the right general idea about what to expect, and it seems like your approach to testing the patch is effective. -- Peter Geoghegan