Thread

  1. Re: index prefetching

    Konstantin Knizhnik <knizhnik@garret.ru> — 2025-12-25T08:56:50Z

    On 18/12/2025 4:45 PM, Andres Freund wrote:
    > Hi,
    >
    > On 2025-12-18 15:40:59 +0100, Tomas Vondra wrote:
    >> The technical reason is that batch_getnext() does this:
    >>
    >>    /* Delay initializing stream until reading from scan's second batch */
    >>    if (priorbatch && !scan->xs_heapfetch->rs && !batchqueue->disabled &&
    >>        enable_indexscan_prefetch)
    >>        scan->xs_heapfetch->rs =
    >>            read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
    >>                                       ....);
    >>
    >> which means we only create the read_stream (which is what enables the
    >> prefetching) only when creating the second batch. And with LIMIT 100 we
    >> likely read just a single leaf page (=batch) most of the time, which
    >> means no read_stream and thus no prefetching.
    > Why is the logic tied to the number of batches, rather the number of items in
    > batches? It's not hard to come up with scenarios where having to wait for ~100
    > random pages will be the majority of the queries IO wait... It makes sense to
    > not initialize readahead if we just fetch an entry or two, but after that?
    
    
    I did more experiments trying to understand when we can take advantage 
    of prefetch:
    
    So schema is the same:
    
    create table t (pk integer, sk integer, payload text default repeat('x', 
    1000)) with (fillfactor=10);
    insert into t values (generate_series(1,10000000),random()*10000000);
    create index on t(sk);
    
    select.sql:
    
    \set sk random(1, 10000000)
    select * from t where sk >= :sk order by sk limit N;
    
    And I do
    pgbench -n -T 30 -M prepared -f select.sql postgres
    
    limit\prefetch    on      off     always  incremental
    1                 12074   12765    3146    3282
    2                   5912     6198    2463    2438
    4                   2919     3047    1334    1964
    8                   1554     1496    1166    1409
    16                   815       775      947      940
    32                   424       403      687      695
    64                   223       208      446      453
    128                 115       106      258      270
    256                  68          53      138      149
    512                  43          27       72         78
    1024                28          13       38         40
    
    
    prefetch=always means commenting of `priorbatch` check and immediate 
    creation of read_stream:
    
             /* Delay initializing stream until reading from scan's second 
    batch */
    -        if (priorbatch && !scan->xs_heapfetch->rs && 
    !batchqueue->disabled &&+
    +       if (/*priorbatch && */!scan->xs_heapfetch->rs && 
    !batchqueue->disabled &&
    
    prefetch=increment replaces doubling of prefetch distance with increment:
    
             /* Look-ahead distance ramps up rapidly after we do I/O. */
    -        distance = stream->distance * 2;
    +       distance = stream->distance ? stream->distance + 1 : 0;
    
    
    So as you expected, immediate creation of read_stream cause quite 
    significant degrade of performance on indexscans inspecting small number 
    of TIDs.
    Looks like the threshold where read stream provides advantages in 
    performance is about 10.
    After it earlier initialization of read stream adds quite noticeable 
    performance improvement.
    
    I tried to find out using profiler and debugger where most of the time 
    is spent in this case and answer was quite predictable -
    in read_stream_reset->read_stream_next_buffer.
    
    So we just consuming pefetched buffers which we do not need.
    
    I thought that we can use some better policy for increasing prefetch 
    distance (right now it is just doubled).
    This is why I have tried this "incremental" policy.
    Unfortunately it  can not help to reduce prefetch overhead for "short" 
    indexscans.
    But what surprised me is that for longer indexscans this approach seems 
    to be slightly more efficient than doubling.
    
    
    So look like we really should use number of items criteria for read 
    stream initialization rather than number of batches.
    And may be think about alternative policy for increasing prefetch distance.