Thread

  1. Re: index prefetching

    Tomas Vondra <tomas@vondra.me> — 2025-12-18T14:40:59Z

    
    On 12/18/25 14:57, Konstantin Knizhnik wrote:
    > 
    > On 17/12/2025 9:54 PM, Tomas Vondra wrote:
    >> On 12/17/25 20:30, Andres Freund wrote:
    >>> Hi,
    >>>
    >>> On 2025-12-17 13:49:43 -0500, Peter Geoghegan wrote:
    >>>> On Wed, Dec 17, 2025 at 12:19 PM Konstantin Knizhnik
    >>>> <knizhnik@garret.ru> wrote:
    >>>>> Moreover with `enable_indexscan_prefetch=off` results are the same.
    >>>> It's quite unlikely that the current heuristics that trigger
    >>>> prefetching would have ever allowed any prefetching, for queries such
    >>>> as these.
    >>>>
    >>>> The exact rule right now is that we don't even begin prefetching until
    >>>> we've already read at least one index leaf page, and have to read
    >>>> another one. So it's impossible to use prefetching with a LIMIT of 1,
    >>>> with queries such as these. It's highly unlikely that you'd see any
    >>>> benefits from prefetching even with LIMIT 100 (usually we wouldn't
    >>>> even begin prefetching).
    >>> Note that due to the tuple size and fillfactor in Konstantin's
    >>> workload, there
    >>> will be one tuple per page... That should allow for some prefetching.
    >>>
    >> Yes, but that's in the heap. The mechanism Peter described is about leaf
    >> pages in the index, and the index has the usual fillfactor. So there'll
    >> be many index entries per leaf.
    >>
    > I slightly change my benchmark setup:
    > 
    > create table t (pk integer, sk integer, payload text default repeat('x',
    > 1000)) with (fillfactor=10);
    > insert into t values (generate_series(1,10000000),random()*10000000);
    > create index on t(sk);
    > 
    > select.sql:
    > 
    > \set sk random(1, 10000000)
    > select * from t where sk >= :sk order by sk limit N;
    > 
    > You are right. There is almost no effect of prefetch for limit=100, but
    > ~2x times improvement for limit=1000:
    > 
    > eio\limit       1      100   1000
    >  10          11102    142    28
    >   0           11419    137    14
    > 
    > master:
    > limit              1     100   1000
    >                11480   130      13
    > 
    > One of the motivation of my experiments was to check that there is no
    > degrade of performance because of batching.
    > And it is nice that there is no performance penalty here.
    > Still it is not quite clear to me why there is no any positive effect
    > for LIMIT 100.
    
    The technical reason is that batch_getnext() does this:
    
      /* Delay initializing stream until reading from scan's second batch */
      if (priorbatch && !scan->xs_heapfetch->rs && !batchqueue->disabled &&
          enable_indexscan_prefetch)
          scan->xs_heapfetch->rs =
              read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
                                         ....);
    
    which means we only create the read_stream (which is what enables the
    prefetching) only when creating the second batch. And with LIMIT 100 we
    likely read just a single leaf page (=batch) most of the time, which
    means no read_stream and thus no prefetching.
    
    You can try disabling this "priorbatch" condition, so that the
    read_stream gets created right away.
    
    > Reading 100 random heap pages definitely should take advantages of AIO.
    > We have also implemented prefetching for index only scan in Neon and
    > here effect for similar query is quite noticeable (~3x times).
    > But in Neon architecture prices of IO is much higher because requires
    > network communication with page server.
    > 
    
    True, but only if the data is not already in memory / shared buffers.
    IIRC this "priorbatch" logic mitigates regressions for cached workloads,
    because the read_stream initialization is expensive enough to hurt small
    queries when no I/O is needed.
    
    Maybe the tradeoff is different for Neon, which probably can't rely on
    cache that much? It's also true tying this to the number of batches is a
    bit coarse, because the batch size can vary a lot. It can be a couple
    items or hundreds of items, easily.
    
    I believe we're open to alternative ideas.
    
    
    regards
    
    -- 
    Tomas Vondra