Re: Adding skip scan (including MDAM style range skip scan) to nbtree

Tomas Vondra <tomas@vondra.me>

From: Tomas Vondra <tomas@vondra.me>
To: Peter Geoghegan <pg@bowt.ie>
Cc: Masahiro.Ikeda@nttdata.com, pgsql-hackers@lists.postgresql.org, Masao.Fujii@nttdata.com
Date: 2024-09-20T14:07:41Z
Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. nbtree: Always set skipScan flag on rescan.

  2. meson: Build numeric.c with -ftree-vectorize.

  3. Fix "variable not found in subplan target lists" in semijoin de-duplication.

  4. Revert "nbtree: Remove useless row compare arg."

  5. nbtree: Remove useless row compare arg.

  6. Prevent premature nbtree array advancement.

  7. nbtree: tighten up array recheck rules.

  8. Avoid treating nonrequired nbtree keys as required.

  9. Adjust overstrong nbtree skip array assertion.

  10. Make NULL tuple values always advance skip arrays.

  11. Avoid extra index searches through preprocessing.

  12. Improve nbtree skip scan primitive scan scheduling.

  13. Further optimize nbtree search scan key comparisons.

  14. Add nbtree skip scan optimization.

  15. Improve nbtree array primitive scan scheduling.

  16. nbtree: Make BTMaxItemSize into object-like macro.

  17. Show index search count in EXPLAIN ANALYZE, take 2.

  18. Make parallel nbtree index scans use an LWLock.

  19. Show index search count in EXPLAIN ANALYZE.

  20. Avoid nbtree parallel scan currPos confusion.

  21. nbtree: Remove useless 'strat' local variable.

  22. Normalize nbtree truncated high key array behavior.

  23. Refactor handling of nbtree array redundancies.

  24. Fix nbtree pgstats accounting with parallel scans.

  25. Avoid parallel nbtree index scan hangs with SAOPs.

  26. Show Parallel Bitmap Heap Scan worker stats in EXPLAIN ANALYZE

  27. Enhance nbtree ScalarArrayOp execution.

  28. Skip checking of scan keys required for directional scan in B-tree

  29. Instead of using a numberOfRequiredKeys count to distinguish required

On 9/18/24 20:52, Peter Geoghegan wrote:
> On Wed, Sep 18, 2024 at 7:36 AM Tomas Vondra <tomas@vondra.me> wrote:
>> Makes sense. I started with the testing before before even looking at
>> the code, so it's mostly a "black box" approach. I did read the 1995
>> paper before that, and the script generates queries with clauses
>> inspired by that paper, in particular:
> 
> I think that this approach with black box testing is helpful, but also
> something to refine over time. Gray box testing might work best.
> 
>> OK, understood. If it's essentially an independent issue (perhaps even
>> counts as a bug?) what about correcting it on master first? Doesn't
>> sound like something we'd backpatch, I guess.
> 
> What about backpatching it to 17?
> 
> As things stand, you can get quite contradictory counts of the number
> of index scans due to irrelevant implementation details from parallel
> index scan. It just looks wrong, particularly on 17, where it is
> reasonable to expect near exact consistency between parallel and
> serial scans of the same index.
> 

Yes, I think backpatching to 17 would be fine. I'd be worried about
maybe disrupting some monitoring in production systems, but for 17 that
shouldn't be a problem yet. So fine with me.

FWIW I wonder how likely is it that someone has some sort of alerting
tied to this counter. I'd bet few people do. It's probably more about a
couple people looking at explain plans, but they'll be confused even if
we change that only starting with 17.

>> Seems like a bit of a mess. IMHO we should either divide everything by
>> nloops (so that everything is "per loop", or not divide anything. My
>> vote would be to divide, but that's mostly my "learned assumption" from
>> the other fields. But having a 50:50 split is confusing for everyone.
> 
> My idea was that it made most sense to follow the example of
> "Buffers:", since both describe physical costs.
> 
> Honestly, I'm more than ready to take whatever the path of least
> resistance is. If dividing by nloops is what people want, I have no
> objections.
> 

I don't have a strong opinion on this. I just know I'd be confused by
half the counters being total and half /loop, but chances are other
people would disagree.

>>> It's worth having skip support (the idea comes from the MDAM paper),
>>> but it's not essential. Whether or not an opclass has skip support
>>> isn't accounted for by the cost model, but I doubt that it's worth
>>> addressing (the cost model is already pessimistic).
>>>
>>
>> I admit I'm a bit confused. I probably need to reread the paper, but my
>> impression was that the increment/decrement is required for skipscan to
>> work. If we can't do that, how would it generate the intermediate values
>> to search for? I imagine it would be possible to "step through" the
>> index, but I thought the point of skip scan is to not do that.
> 
> I think that you're probably still a bit confused because the
> terminology in this area is a little confusing. There are two ways of
> explaining the situation with types like text and numeric (types that
> lack skip support). The two explanations might seem to be
> contradictory, but they're really not, if you think about it.
> 
> The first way of explaining it, which focuses on how the scan moves
> through the index:
> 
> For a text index column "a", and an int index column "b", skip scan
> will work like this for a query with a qual "WHERE b = 55":
> 
> 1. Find the first/lowest sorting "a" value in the index. Let's say
> that it's "Aardvark".
> 
> 2. Look for matches "WHERE a = 'Aardvark' and b = 55", possibly
> returning some matches.
> 
> 3. Find the next value after "Aardvark" in the index using a probe
> like the one we'd use for a qual "WHERE a > 'Aardvark'". Let's say
> that it turns out to be "Abacus".
> 
> 4. Look for matches "WHERE a = 'Abacus' and b = 55"...
> 
> ... (repeat these steps until we've exhaustively processed every
> existing "a" value in the index)...

Ah, OK. So we do probe the index like this. I was under the impression
we don't do that. But yeah, this makes sense.

> 
> The second way of explaining it, which focuses on how the skip arrays
> advance. Same query (and really the same behavior) as in the first
> explanation:
> 
> 1. Skip array's initial value is the sentinel -inf, which cannot
> possibly match any real index tuple, but can still guide the search.
> So we search for tuples "WHERE a = -inf AND b = 55" (actually we don't
> include the "b = 55" part, since it is unnecessary, but conceptually
> it's a part of what we search for within _bt_first).
> 
> 2. Find that the index has no "a" values matching -inf (it inevitably
> cannot have any matches for -inf), but we do locate the next highest
> match. The closest matching value is "Aardvark". The skip array on "a"
> therefore advances from -inf to "Aardvark".
> 
> 3. Look for matches "WHERE a = 'Aardvark' and b = 55", possibly
> returning some matches.
> 
> 4. Reach tuples after the last match for "WHERE a = 'Aardvark' and b =
> 55", which will cause us to advance the array on "a" incrementally
> inside _bt_advance_array_keys (just like it would if there was a
> standard SAOP array on "a" instead). The skip array on "a" therefore
> advances from "Aardvark" to "Aardvark" +infinitesimal (we need to use
> sentinel values for this text column, which lacks skip support).
> 
> 5. Look for matches "WHERE a = 'Aardvark'+infinitesimal and b = 55",
> which cannot possibly find matches, but, again, can reposition the
> scan as needed. We can't find an exact match, of course, but we do
> locate the next closest match -- which is "Abacus", again. So the skip
> array now advances from "Aardvark" +infinitesimal to "Abacus". The
> sentinel values are made up values, but that doesn't change anything.
> (And, again, we don't include the "b = 55" part here, for the same
> reason as before.)
> 
> 6. Look for matches "WHERE a = 'Abacus' and b = 55"...
> 
> ...(repeat these steps as many times as required)...
> 

Yeah, this makes more sense. Thanks.

> In summary:
> 
> Even index columns that lack skip support get to "increment" (or
> "decrement") their arrays by using sentinel values that represent -inf
> (or +inf for backwards scans), as well as sentinels that represent
> concepts such as "Aardvark" +infinitesimal (or "Zebra" -infinitesimal
> for backwards scans, say). This scheme sounds contradictory, because
> in one sense it allows every skip array to be incremented, but in
> another sense it makes it okay that we don't have a type-specific way
> to increment values for many individual types/opclasses.
> 
> Inventing these sentinel values allows _bt_advance_array_keys to reason about
> arrays without really having to care about which kinds of arrays are
> involved, their order relative to each other, etc. In a certain sense,
> we don't really need explicit "next key" probes of the kind that the
> MDAM paper contemplates, though we do still require the same index
> accesses as a design with explicit accesses.
> 
> Does that make sense?
> 

Yes, it does. Most of my confusion was caused by my belief that we can't
probe the index for the next value without "incrementing" the current
value, but that was a silly idea.

> Obviously, if we did add skip support for text, it would be very
> unlikely to help performance. Sure, one can imagine incrementing from
> "Aardvark" to "Aardvarl" using dedicated opclass infrastructure, but
> that isn't very helpful. You're almost certain to end up accessing the
> same pages with such a scheme, anyway. What are the chances of an
> index with a leading text column actually containing tuples matching
> (say) "WHERE a = 'Aardvarl' and b = 55"? The chances are practically
> zero. Whereas if the column "a" happens to use a discrete type such as
> integer or date, then skip support is likely to help: there's a decent
> chance that a value generated by incrementing the last value
> (and I mean incrementing it for real) will find a real match when
> combined with the user-supplied "b" predicate.
> 
> It might be possible to add skip support for text, but there wouldn't
> be much point.
> 

Stupid question - so why does it make sense for types like int? There
can also be a lot of values between the current and the next value, so
why would that be very different from "incrementing" a text value?

> 
>> I think it'd help if I go through the results and try to prepare some
>> reproducers, to make it easier for you. After all, it's my script and
>> you'd have to reverse engineer some of it.
> 
> Yes, that would be helpful.
> 
> I'll probably memorialize the problem by writing my own minimal test
> case for it. I'm using the same TDD approach for this project as was
> used for the related Postgres 17 project.
> 

Sure. Still, giving you a reproducer should make it easier .

>> This is on exactly the same data, after dropping caches and restarting
>> the instance. So there should be no caching effects. Yet, there's a
>> pretty clear difference - the total number of buffers is the same, but
>> the patched version has many more hits. Yet it's slower. Weird, right?
> 
> Yes, it's weird. It seems likely that you've found an unambiguous bug,
> not just a "regular" performance regression. The regressions that I
> already know about aren't nearly this bad. So it seems like you have
> the right general idea about what to expect, and it seems like your
> approach to testing the patch is effective.
> 

Yeah, it's funny. It's not the first time I start stress testing a patch
only to stumble over some pre-existing issues ... ;-)


regards

-- 
Tomas Vondra