Thread

  1. Re: index prefetching

    Tomas Vondra <tomas@vondra.me> — 2025-12-28T23:53:16Z

    
    On 12/28/25 21:30, Konstantin Knizhnik wrote:
    > 
    > On 28/12/2025 8:08 PM, Tomas Vondra wrote:
    >> On 12/25/25 16:39, Konstantin Knizhnik wrote:
    >>> On 21/12/2025 7:55 PM, Peter Geoghegan wrote:
    >>>> On Wed, Dec 10, 2025 at 9:21 PM Peter Geoghegan <pg@bowt.ie> wrote:
    >>>>> Attached is v4.
    >>>> Attached is v5. Changes from v4:
    >>>>
    >>>> * Simplified and optimized index-only scans, with a particular
    >>>> emphasis on avoiding regressions with nested loop joins with an inner
    >>>> index-only scan.
    >>>>
    >>>> There were quite a number of small problems/dead code related to
    >>>> index-only scans fixed by this new v5. Overall, I'm quite a bit
    >>>> happier with the state of index-only scans, which I'd not paid too
    >>>> much attention to before now.
    >>>>
    >>>> * Added Valgrind instrumentation to the hash index patch, which was
    >>>> required to fix some false positives.
    >>>>
    >>>> The generic indexam_util_batch_unlock routine had Valgrind
    >>>> instrumentation in earlier versions, just to keep nbtree's buffer
    >>>> locking checks from generating similar false positives. Some time
    >>>> later, when I added the hashgetbatch patch, there were new Valgrind
    >>>> false positives during hash index scans -- which I missed at first.
    >>>> This new v5 revisions adds similar Valgrind checks to hash itself
    >>>> (changes that add code that is more or less a direct port of the stuff
    >>>> added to nbtree by commit 4a70f829), which fixes the false positives,
    >>>> and is independently useful.
    >>>>
    >>>> The rule for amgetbatch-based index AMs is that they must have similar
    >>>> buffer locking instrumentation. That seems like a good thing.
    >>>>
    >>>> -- 
    >>>> Peter Geoghegan
    >>> I the previous mail I shared results of my experiments with different
    >>> prefetch distance.
    >>> I think that we should start prefetching of heap tuples not from the
    >>> second batch, but after some number of proceeded tids.
    >>>
    >>> Attached please find a patch which implements this approach.
    >>> And below are updated results:
    >>>
    >>> limit\prefetch    on      off   always  inc    threshold
    >>> 1                 12074   12765  3146    3282     12394
    >>> 2                 5912    6198   2463    2438      6124
    >>> 4                 2919    3047   1334    1964      2910
    >>> 8                 1554    1496   1166    1409      1588
    >>> 16                815     775    947     940        600
    >>> 32                424     403    687     695        478
    >>> 64                223     208    446     453        358
    >>> 128               115     106    258     270        232
    >>> 256               68      53     138     149        131
    >>> 512               43      27     72      78          71
    >>> 1024              28      13     38      40          38
    >>>
    >>> Last column is result of prefetch with read_stream_threshold=10.
    >>>
    >> That's great, but it only works for cases that can (and do) benefit from
    >> the prefetching. Try running the benchmark with a data set that fits
    >> into shared buffers (or RAM), which makes prefetching useless.
    >>
    >> I tried that with your test, comparing master, v5 and v5 + your
    >> read_stream_threshold patch. See the attached run.sh script, and the PDF
    >> summarizing the results. The last two column groups are comparisons to
    >> master, with green=improvement, red=regression. There are no actual
    >> improvements (1% delta is just noise). But the read_stream_threshold
    >> results have a clear pattern of pretty massive (20-30%) regressions.
    >>
    >> The difference between v5 and v5-threshold is pretty clear.
    >>
    >> IIRC cases like this are *exactly* why we ended up with the current
    >> heuristics, enabling prefetching only from the second batch. This
    >> removes the risk of expensive read_stream init for very fast queries
    >> that don't benefit anything. Of course, prefetching may be useless for
    >> later batches too (e.g. if all the data is cached), but the query will
    >> be expensive enough for the read_stream init cost to be negligible.
    >>
    >> To put this differently, the more aggressive the heuristics is (enabling
    >> prefetching in more case), the more likely it's to cause regressions.
    >> We've chosen to be more defensive, i.e. to sacrifice some possible gains
    >> in order to not regress plausible workloads. I hope we agree queries on
    >> fully cached "hot" data are pretty common / important.
    >>
    >> We can probably do better in the future. But we'll never know for sure
    >> if a given scan benefits from prefetching. It's not just about the
    >> number of items in the batch, but also about how many heap pages that
    >> translates to, what I/O pattern (random vs. sequential?), how many are
    >> already cached. For some queries we don't even know how many items we'll
    >> actually need. We can't check all that at the very beginning, because
    >> it's simply prohibitively expensive.
    > 
    > 
    > I tried to reproduce your results, but at Mac I do not see some
    > noticeable difference  for 250k records, fillfactor=10 and 4GB shared
    > buffers
    > between `enable_indexscan_prefetch=false` and
    > `enable_indexscan_prefetch=true`.
    > I can't believe that just adding this checks in `heap_batch_advance_pos`
    > can cause 75% degrade of performance (because for limit < 10, no read
    > stream is initialized, but still we somewhere loose 25%).
    > 
    > I just commented this fragment of code in heapam_handler.c:
    > 
    > 
    > #if 0
    >     proceed_items = ScanDirectionIsForward(direction)
    >         ? pos->item - batch->firstItem
    >         : batch->lastItem - pos->item;
    >     /* Delay initializing stream until proceeding */
    >     if (proceed_items >= read_stream_threshold
    >         && !scan->xs_heapfetch->rs
    >         && !scan->batchqueue->disabled
    >         && !scan->xs_want_itup    /* XXX prefetching disabled for IoS,
    > for now */
    >         && enable_indexscan_prefetch)
    >     {
    >         scan->xs_heapfetch->rs =
    >             read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
    >                                        scan->heapRelation, MAIN_FORKNUM,
    >  scan->heapRelation->rd_tableam->index_getnext_stream,
    >                                        scan, 0);
    >     }
    > #endif
    > 
    > and ... see no difference.
    > 
    > I can understand why initializing read stream earlier (not at the second
    > batch, but after 10 proceeded items) may have negative impact on
    > performance when all data is present i shared buffers for LIMIT>=10.
    > But how it can happen with LIMIT 1 and commented fragment above. There
    > is nothing else in my patch except adding GUC.
    > So I think that it is some "external" factor and wonder if you can
    > reproduce this results (just first line).
    > 
    
    It seems this is due to sending an extra SET (for the new GUC) in the
    pgbench script, which is recognized only on the v5+threshold build.
    
    That's a thinko on my side, I should have realized the extra command
    might affect this. It doesn't really affect the behavior, because 10 is
    the default value for read_stream_threshold. I've fixed the script, will
    check fresh results tomorrow.
    
    Still, I think most of what I said about heuristics when to initialize
    the read stream, and the risk/benefit tradeoff, still applies.
    
    
    regards
    
    -- 
    Tomas Vondra