Thread

  1. Re: index prefetching

    Konstantin Knizhnik <knizhnik@garret.ru> — 2025-12-29T16:34:47Z

    On 29/12/2025 1:53 AM, Tomas Vondra wrote:
    >
    > On 12/28/25 21:30, Konstantin Knizhnik wrote:
    >> On 28/12/2025 8:08 PM, Tomas Vondra wrote:
    >>> On 12/25/25 16:39, Konstantin Knizhnik wrote:
    >>>> On 21/12/2025 7:55 PM, Peter Geoghegan wrote:
    >>>>> On Wed, Dec 10, 2025 at 9:21 PM Peter Geoghegan <pg@bowt.ie> wrote:
    >>>>>> Attached is v4.
    >>>>> Attached is v5. Changes from v4:
    >>>>>
    >>>>> * Simplified and optimized index-only scans, with a particular
    >>>>> emphasis on avoiding regressions with nested loop joins with an inner
    >>>>> index-only scan.
    >>>>>
    >>>>> There were quite a number of small problems/dead code related to
    >>>>> index-only scans fixed by this new v5. Overall, I'm quite a bit
    >>>>> happier with the state of index-only scans, which I'd not paid too
    >>>>> much attention to before now.
    >>>>>
    >>>>> * Added Valgrind instrumentation to the hash index patch, which was
    >>>>> required to fix some false positives.
    >>>>>
    >>>>> The generic indexam_util_batch_unlock routine had Valgrind
    >>>>> instrumentation in earlier versions, just to keep nbtree's buffer
    >>>>> locking checks from generating similar false positives. Some time
    >>>>> later, when I added the hashgetbatch patch, there were new Valgrind
    >>>>> false positives during hash index scans -- which I missed at first.
    >>>>> This new v5 revisions adds similar Valgrind checks to hash itself
    >>>>> (changes that add code that is more or less a direct port of the stuff
    >>>>> added to nbtree by commit 4a70f829), which fixes the false positives,
    >>>>> and is independently useful.
    >>>>>
    >>>>> The rule for amgetbatch-based index AMs is that they must have similar
    >>>>> buffer locking instrumentation. That seems like a good thing.
    >>>>>
    >>>>> -- 
    >>>>> Peter Geoghegan
    >>>> I the previous mail I shared results of my experiments with different
    >>>> prefetch distance.
    >>>> I think that we should start prefetching of heap tuples not from the
    >>>> second batch, but after some number of proceeded tids.
    >>>>
    >>>> Attached please find a patch which implements this approach.
    >>>> And below are updated results:
    >>>>
    >>>> limit\prefetch    on      off   always  inc    threshold
    >>>> 1                 12074   12765  3146    3282     12394
    >>>> 2                 5912    6198   2463    2438      6124
    >>>> 4                 2919    3047   1334    1964      2910
    >>>> 8                 1554    1496   1166    1409      1588
    >>>> 16                815     775    947     940        600
    >>>> 32                424     403    687     695        478
    >>>> 64                223     208    446     453        358
    >>>> 128               115     106    258     270        232
    >>>> 256               68      53     138     149        131
    >>>> 512               43      27     72      78          71
    >>>> 1024              28      13     38      40          38
    >>>>
    >>>> Last column is result of prefetch with read_stream_threshold=10.
    >>>>
    >>> That's great, but it only works for cases that can (and do) benefit from
    >>> the prefetching. Try running the benchmark with a data set that fits
    >>> into shared buffers (or RAM), which makes prefetching useless.
    >>>
    >>> I tried that with your test, comparing master, v5 and v5 + your
    >>> read_stream_threshold patch. See the attached run.sh script, and the PDF
    >>> summarizing the results. The last two column groups are comparisons to
    >>> master, with green=improvement, red=regression. There are no actual
    >>> improvements (1% delta is just noise). But the read_stream_threshold
    >>> results have a clear pattern of pretty massive (20-30%) regressions.
    >>>
    >>> The difference between v5 and v5-threshold is pretty clear.
    >>>
    >>> IIRC cases like this are *exactly* why we ended up with the current
    >>> heuristics, enabling prefetching only from the second batch. This
    >>> removes the risk of expensive read_stream init for very fast queries
    >>> that don't benefit anything. Of course, prefetching may be useless for
    >>> later batches too (e.g. if all the data is cached), but the query will
    >>> be expensive enough for the read_stream init cost to be negligible.
    >>>
    >>> To put this differently, the more aggressive the heuristics is (enabling
    >>> prefetching in more case), the more likely it's to cause regressions.
    >>> We've chosen to be more defensive, i.e. to sacrifice some possible gains
    >>> in order to not regress plausible workloads. I hope we agree queries on
    >>> fully cached "hot" data are pretty common / important.
    >>>
    >>> We can probably do better in the future. But we'll never know for sure
    >>> if a given scan benefits from prefetching. It's not just about the
    >>> number of items in the batch, but also about how many heap pages that
    >>> translates to, what I/O pattern (random vs. sequential?), how many are
    >>> already cached. For some queries we don't even know how many items we'll
    >>> actually need. We can't check all that at the very beginning, because
    >>> it's simply prohibitively expensive.
    >>
    >> I tried to reproduce your results, but at Mac I do not see some
    >> noticeable difference  for 250k records, fillfactor=10 and 4GB shared
    >> buffers
    >> between `enable_indexscan_prefetch=false` and
    >> `enable_indexscan_prefetch=true`.
    >> I can't believe that just adding this checks in `heap_batch_advance_pos`
    >> can cause 75% degrade of performance (because for limit < 10, no read
    >> stream is initialized, but still we somewhere loose 25%).
    >>
    >> I just commented this fragment of code in heapam_handler.c:
    >>
    >>
    >> #if 0
    >>      proceed_items = ScanDirectionIsForward(direction)
    >>          ? pos->item - batch->firstItem
    >>          : batch->lastItem - pos->item;
    >>      /* Delay initializing stream until proceeding */
    >>      if (proceed_items >= read_stream_threshold
    >>          && !scan->xs_heapfetch->rs
    >>          && !scan->batchqueue->disabled
    >>          && !scan->xs_want_itup    /* XXX prefetching disabled for IoS,
    >> for now */
    >>          && enable_indexscan_prefetch)
    >>      {
    >>          scan->xs_heapfetch->rs =
    >>              read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
    >>                                         scan->heapRelation, MAIN_FORKNUM,
    >>   scan->heapRelation->rd_tableam->index_getnext_stream,
    >>                                         scan, 0);
    >>      }
    >> #endif
    >>
    >> and ... see no difference.
    >>
    >> I can understand why initializing read stream earlier (not at the second
    >> batch, but after 10 proceeded items) may have negative impact on
    >> performance when all data is present i shared buffers for LIMIT>=10.
    >> But how it can happen with LIMIT 1 and commented fragment above. There
    >> is nothing else in my patch except adding GUC.
    >> So I think that it is some "external" factor and wonder if you can
    >> reproduce this results (just first line).
    >>
    > It seems this is due to sending an extra SET (for the new GUC) in the
    > pgbench script, which is recognized only on the v5+threshold build.
    >
    > That's a thinko on my side, I should have realized the extra command
    > might affect this. It doesn't really affect the behavior, because 10 is
    > the default value for read_stream_threshold. I've fixed the script, will
    > check fresh results tomorrow.
    >
    > Still, I think most of what I said about heuristics when to initialize
    > the read stream, and the risk/benefit tradeoff, still applies.
    >
    >
    > regards
    
    
    
    Attached please find alternative version of the proposed patch which use 
    number of disk reads as criteria for using read stream.