Thread
-
Re: index prefetching
Konstantin Knizhnik <knizhnik@garret.ru> — 2025-12-29T16:34:47Z
On 29/12/2025 1:53 AM, Tomas Vondra wrote: > > On 12/28/25 21:30, Konstantin Knizhnik wrote: >> On 28/12/2025 8:08 PM, Tomas Vondra wrote: >>> On 12/25/25 16:39, Konstantin Knizhnik wrote: >>>> On 21/12/2025 7:55 PM, Peter Geoghegan wrote: >>>>> On Wed, Dec 10, 2025 at 9:21 PM Peter Geoghegan <pg@bowt.ie> wrote: >>>>>> Attached is v4. >>>>> Attached is v5. Changes from v4: >>>>> >>>>> * Simplified and optimized index-only scans, with a particular >>>>> emphasis on avoiding regressions with nested loop joins with an inner >>>>> index-only scan. >>>>> >>>>> There were quite a number of small problems/dead code related to >>>>> index-only scans fixed by this new v5. Overall, I'm quite a bit >>>>> happier with the state of index-only scans, which I'd not paid too >>>>> much attention to before now. >>>>> >>>>> * Added Valgrind instrumentation to the hash index patch, which was >>>>> required to fix some false positives. >>>>> >>>>> The generic indexam_util_batch_unlock routine had Valgrind >>>>> instrumentation in earlier versions, just to keep nbtree's buffer >>>>> locking checks from generating similar false positives. Some time >>>>> later, when I added the hashgetbatch patch, there were new Valgrind >>>>> false positives during hash index scans -- which I missed at first. >>>>> This new v5 revisions adds similar Valgrind checks to hash itself >>>>> (changes that add code that is more or less a direct port of the stuff >>>>> added to nbtree by commit 4a70f829), which fixes the false positives, >>>>> and is independently useful. >>>>> >>>>> The rule for amgetbatch-based index AMs is that they must have similar >>>>> buffer locking instrumentation. That seems like a good thing. >>>>> >>>>> -- >>>>> Peter Geoghegan >>>> I the previous mail I shared results of my experiments with different >>>> prefetch distance. >>>> I think that we should start prefetching of heap tuples not from the >>>> second batch, but after some number of proceeded tids. >>>> >>>> Attached please find a patch which implements this approach. >>>> And below are updated results: >>>> >>>> limit\prefetch on off always inc threshold >>>> 1 12074 12765 3146 3282 12394 >>>> 2 5912 6198 2463 2438 6124 >>>> 4 2919 3047 1334 1964 2910 >>>> 8 1554 1496 1166 1409 1588 >>>> 16 815 775 947 940 600 >>>> 32 424 403 687 695 478 >>>> 64 223 208 446 453 358 >>>> 128 115 106 258 270 232 >>>> 256 68 53 138 149 131 >>>> 512 43 27 72 78 71 >>>> 1024 28 13 38 40 38 >>>> >>>> Last column is result of prefetch with read_stream_threshold=10. >>>> >>> That's great, but it only works for cases that can (and do) benefit from >>> the prefetching. Try running the benchmark with a data set that fits >>> into shared buffers (or RAM), which makes prefetching useless. >>> >>> I tried that with your test, comparing master, v5 and v5 + your >>> read_stream_threshold patch. See the attached run.sh script, and the PDF >>> summarizing the results. The last two column groups are comparisons to >>> master, with green=improvement, red=regression. There are no actual >>> improvements (1% delta is just noise). But the read_stream_threshold >>> results have a clear pattern of pretty massive (20-30%) regressions. >>> >>> The difference between v5 and v5-threshold is pretty clear. >>> >>> IIRC cases like this are *exactly* why we ended up with the current >>> heuristics, enabling prefetching only from the second batch. This >>> removes the risk of expensive read_stream init for very fast queries >>> that don't benefit anything. Of course, prefetching may be useless for >>> later batches too (e.g. if all the data is cached), but the query will >>> be expensive enough for the read_stream init cost to be negligible. >>> >>> To put this differently, the more aggressive the heuristics is (enabling >>> prefetching in more case), the more likely it's to cause regressions. >>> We've chosen to be more defensive, i.e. to sacrifice some possible gains >>> in order to not regress plausible workloads. I hope we agree queries on >>> fully cached "hot" data are pretty common / important. >>> >>> We can probably do better in the future. But we'll never know for sure >>> if a given scan benefits from prefetching. It's not just about the >>> number of items in the batch, but also about how many heap pages that >>> translates to, what I/O pattern (random vs. sequential?), how many are >>> already cached. For some queries we don't even know how many items we'll >>> actually need. We can't check all that at the very beginning, because >>> it's simply prohibitively expensive. >> >> I tried to reproduce your results, but at Mac I do not see some >> noticeable difference for 250k records, fillfactor=10 and 4GB shared >> buffers >> between `enable_indexscan_prefetch=false` and >> `enable_indexscan_prefetch=true`. >> I can't believe that just adding this checks in `heap_batch_advance_pos` >> can cause 75% degrade of performance (because for limit < 10, no read >> stream is initialized, but still we somewhere loose 25%). >> >> I just commented this fragment of code in heapam_handler.c: >> >> >> #if 0 >> proceed_items = ScanDirectionIsForward(direction) >> ? pos->item - batch->firstItem >> : batch->lastItem - pos->item; >> /* Delay initializing stream until proceeding */ >> if (proceed_items >= read_stream_threshold >> && !scan->xs_heapfetch->rs >> && !scan->batchqueue->disabled >> && !scan->xs_want_itup /* XXX prefetching disabled for IoS, >> for now */ >> && enable_indexscan_prefetch) >> { >> scan->xs_heapfetch->rs = >> read_stream_begin_relation(READ_STREAM_DEFAULT, NULL, >> scan->heapRelation, MAIN_FORKNUM, >> scan->heapRelation->rd_tableam->index_getnext_stream, >> scan, 0); >> } >> #endif >> >> and ... see no difference. >> >> I can understand why initializing read stream earlier (not at the second >> batch, but after 10 proceeded items) may have negative impact on >> performance when all data is present i shared buffers for LIMIT>=10. >> But how it can happen with LIMIT 1 and commented fragment above. There >> is nothing else in my patch except adding GUC. >> So I think that it is some "external" factor and wonder if you can >> reproduce this results (just first line). >> > It seems this is due to sending an extra SET (for the new GUC) in the > pgbench script, which is recognized only on the v5+threshold build. > > That's a thinko on my side, I should have realized the extra command > might affect this. It doesn't really affect the behavior, because 10 is > the default value for read_stream_threshold. I've fixed the script, will > check fresh results tomorrow. > > Still, I think most of what I said about heuristics when to initialize > the read stream, and the risk/benefit tradeoff, still applies. > > > regards Attached please find alternative version of the proposed patch which use number of disk reads as criteria for using read stream.