Thread

Re: index prefetching

Konstantin Knizhnik <knizhnik@garret.ru> — 2025-12-25T08:56:50Z
On 18/12/2025 4:45 PM, Andres Freund wrote:
> Hi,
>
> On 2025-12-18 15:40:59 +0100, Tomas Vondra wrote:
>> The technical reason is that batch_getnext() does this:
>>
>>    /* Delay initializing stream until reading from scan's second batch */
>>    if (priorbatch && !scan->xs_heapfetch->rs && !batchqueue->disabled &&
>>        enable_indexscan_prefetch)
>>        scan->xs_heapfetch->rs =
>>            read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
>>                                       ....);
>>
>> which means we only create the read_stream (which is what enables the
>> prefetching) only when creating the second batch. And with LIMIT 100 we
>> likely read just a single leaf page (=batch) most of the time, which
>> means no read_stream and thus no prefetching.
> Why is the logic tied to the number of batches, rather the number of items in
> batches? It's not hard to come up with scenarios where having to wait for ~100
> random pages will be the majority of the queries IO wait... It makes sense to
> not initialize readahead if we just fetch an entry or two, but after that?


I did more experiments trying to understand when we can take advantage 
of prefetch:

So schema is the same:

create table t (pk integer, sk integer, payload text default repeat('x', 
1000)) with (fillfactor=10);
insert into t values (generate_series(1,10000000),random()*10000000);
create index on t(sk);

select.sql:

\set sk random(1, 10000000)
select * from t where sk >= :sk order by sk limit N;

And I do
pgbench -n -T 30 -M prepared -f select.sql postgres

limit\prefetch    on      off     always  incremental
1                 12074   12765    3146    3282
2                   5912     6198    2463    2438
4                   2919     3047    1334    1964
8                   1554     1496    1166    1409
16                   815       775      947      940
32                   424       403      687      695
64                   223       208      446      453
128                 115       106      258      270
256                  68          53      138      149
512                  43          27       72         78
1024                28          13       38         40


prefetch=always means commenting of `priorbatch` check and immediate 
creation of read_stream:

         /* Delay initializing stream until reading from scan's second 
batch */
-        if (priorbatch && !scan->xs_heapfetch->rs && 
!batchqueue->disabled &&+
+       if (/*priorbatch && */!scan->xs_heapfetch->rs && 
!batchqueue->disabled &&

prefetch=increment replaces doubling of prefetch distance with increment:

         /* Look-ahead distance ramps up rapidly after we do I/O. */
-        distance = stream->distance * 2;
+       distance = stream->distance ? stream->distance + 1 : 0;


So as you expected, immediate creation of read_stream cause quite 
significant degrade of performance on indexscans inspecting small number 
of TIDs.
Looks like the threshold where read stream provides advantages in 
performance is about 10.
After it earlier initialization of read stream adds quite noticeable 
performance improvement.

I tried to find out using profiler and debugger where most of the time 
is spent in this case and answer was quite predictable -
in read_stream_reset->read_stream_next_buffer.

So we just consuming pefetched buffers which we do not need.

I thought that we can use some better policy for increasing prefetch 
distance (right now it is just doubled).
This is why I have tried this "incremental" policy.
Unfortunately it  can not help to reduce prefetch overhead for "short" 
indexscans.
But what surprised me is that for longer indexscans this approach seems 
to be slightly more efficient than doubling.


So look like we really should use number of items criteria for read 
stream initialization rather than number of batches.
And may be think about alternative policy for increasing prefetch distance.