Re: index prefetching

Peter Geoghegan <pg@bowt.ie>

From: Peter Geoghegan <pg@bowt.ie>

To: Konstantin Knizhnik <knizhnik@garret.ru>

Cc: Tomas Vondra <tomas@vondra.me>, Andres Freund <andres@anarazel.de>, Thomas Munro <thomas.munro@gmail.com>, Nazir Bilal Yavuz <byavuz81@gmail.com>, Robert Haas <robertmhaas@gmail.com>, Melanie Plageman <melanieplageman@gmail.com>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>, Georgios <gkokolatos@protonmail.com>, Dilip Kumar <dilipbalaut@gmail.com>

Date: 2025-12-17T18:49:43Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

aio: io_uring: Trigger async processing for large IOs
- a9ee66881744 19 (unreleased) landed
read stream: Split decision about look ahead for AIO and combining
- 8ca147d582a5 19 (unreleased) landed
read_stream: Only increase read-ahead distance when waiting for IO
- f63ca3379025 19 (unreleased) landed
read_stream: Prevent distance from decaying too quickly
- 6e36930f9aaf 19 (unreleased) landed
Reduce ExecSeqScan* code size using pg_assume()
- b227b0bb4e03 19 (unreleased) cited
Fix rare bug in read_stream.c's split IO handling.
- b421223172a2 19 (unreleased) cited
Fix multiranges to behave more like dependent types.
- 3e8235ba4f9c 17.0 cited
Add EXPLAIN (MEMORY) to report planner memory consumption
- 5de890e3610d 17.0 cited
Optimize nbtree backward scan boundary cases.
- c9c0589fda0e 17.0 cited
Increment xactCompletionCount during subtransaction abort.
- 90c885cdab8b 14.0 cited
Add nbtree Valgrind buffer lock checks.
- 4a70f829d86c 14.0 cited
Add nbtree high key "continuescan" optimization.
- 29b64d1de7c7 12.0 cited
Reduce pinning and buffer content locking for btree scans.
- 2ed5b87f96d4 9.5.0 cited
Teach btree to handle ScalarArrayOpExpr quals natively.
- 9e8da0f75731 9.2.0 cited

On Wed, Dec 17, 2025 at 12:19 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
> create table t (pk integer primary key, payload text default repeat('x',
> 1000)) with (fillfactor=10);
> insert into t values (generate_series(1,10000000))
>
> So it creates table with size 80Gb (160 after vacuum) which doesn't fit
> in RAM.

160 after VACUUM? What do you mean?

> but what confuses me is that they do not depend on
> `effective_io_concurrency`.

You did change other settings, right? You didn't just use the default
shared_buffers, for example? (Sorry, I have to ask.)

> Moreover with `enable_indexscan_prefetch=off` results are the same.

It's quite unlikely that the current heuristics that trigger
prefetching would have ever allowed any prefetching, for queries such
as these.

The exact rule right now is that we don't even begin prefetching until
we've already read at least one index leaf page, and have to read
another one. So it's impossible to use prefetching with a LIMIT of 1,
with queries such as these. It's highly unlikely that you'd see any
benefits from prefetching even with LIMIT 100 (usually we wouldn't
even begin prefetching).

> Also I expected that the best effect of index prefetching should be for
> larger limit (accessing more heap pages). But as you see - it is not true.
>
> May we there is something wrong with my test scenario.

I could definitely believe that the new amgetbatch interface is
noticeably faster with range queries. Maybe 5% - 10% faster (even
without using the heap-buffer-locking optimization we've talked about
on this thread, which you can't have used here because I haven't
posted it to the list just yet). But a near 2x improvement wildly
exceeds my expectations. Honestly, I have no idea why the patch is so
much faster, and suspect an invalid result.

It might make sense for you to try it again with just the first patch
applied (the patch that adds the basic table AM and index AM interface
revisions, and makes nbtree supply its own amgetbatch/replaces
btgetbatch with btgettuple). I suppose it's possible that Andres'
patch 0004 somehow played some role here, since that is independently
useful work (I don't quite recall the details of where else that might
be useful right now). But that's just a wild guess.

> It will be nice to get some information about efficiency of prefetch,
> for example add `pefetch` option to explain: `explain
> (analyze,buffers,prefetch) ...`
> I think that in `pgaio_io_wait` we can distinguish IO operations which
> are completed without waiting and can be considered as prefetch hit.

> Right now it is hard to understand without debugger whether prefetch is
> perfromed at all.

Tomas did write a patch for that, but it isn't particularly well
optimized. I have mostly avoided using it for that reason. Basic
performance validation of the patch set is really hard in general, and
I've found it easier to just be extremely paranoid.

-- 
Peter Geoghegan