Re: index prefetching

Peter Geoghegan <pg@bowt.ie>

From: Peter Geoghegan <pg@bowt.ie>

To: Amit Langote <amitlangote09@gmail.com>

Cc: Tomas Vondra <tomas@vondra.me>, Andres Freund <andres@anarazel.de>, Thomas Munro <thomas.munro@gmail.com>, Nazir Bilal Yavuz <byavuz81@gmail.com>, Robert Haas <robertmhaas@gmail.com>, Melanie Plageman <melanieplageman@gmail.com>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>, Georgios <gkokolatos@protonmail.com>, Konstantin Knizhnik <knizhnik@garret.ru>, Dilip Kumar <dilipbalaut@gmail.com>

Date: 2025-12-04T21:10:44Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

aio: io_uring: Trigger async processing for large IOs
- a9ee66881744 19 (unreleased) landed
read stream: Split decision about look ahead for AIO and combining
- 8ca147d582a5 19 (unreleased) landed
read_stream: Only increase read-ahead distance when waiting for IO
- f63ca3379025 19 (unreleased) landed
read_stream: Prevent distance from decaying too quickly
- 6e36930f9aaf 19 (unreleased) landed
Reduce ExecSeqScan* code size using pg_assume()
- b227b0bb4e03 19 (unreleased) cited
Fix rare bug in read_stream.c's split IO handling.
- b421223172a2 19 (unreleased) cited
Fix multiranges to behave more like dependent types.
- 3e8235ba4f9c 17.0 cited
Add EXPLAIN (MEMORY) to report planner memory consumption
- 5de890e3610d 17.0 cited
Optimize nbtree backward scan boundary cases.
- c9c0589fda0e 17.0 cited
Increment xactCompletionCount during subtransaction abort.
- 90c885cdab8b 14.0 cited
Add nbtree Valgrind buffer lock checks.
- 4a70f829d86c 14.0 cited
Add nbtree high key "continuescan" optimization.
- 29b64d1de7c7 12.0 cited
Reduce pinning and buffer content locking for btree scans.
- 2ed5b87f96d4 9.5.0 cited
Teach btree to handle ScalarArrayOpExpr quals natively.
- 9e8da0f75731 9.2.0 cited

Hi Amit,

On Thu, Dec 4, 2025 at 12:54 AM Amit Langote <amitlangote09@gmail.com> wrote:
> I want to acknowledge that figuring out the right layering to make I/O
> prefetching and perhaps other optimizations internal to IndexNext()
> work is obviously the priority right now, regardless of the output
> format used to populate the slots ultimately returned by
> table_index_getnext_slot().

Right; table_index_getnext_slot simply returns a tuple into the
caller's slot. That's almost the same as the existing getnext_slot
interface used by those same call sites on the master branch, except
that in the patch we're directly calling a table AM callback/heapam
specific implementation (not code in indexam.c).

The new heapam implementation heapam_index_getnext_slot applies more
high-level context about ordered index scans, which enables it to
reorder work quite freely, even when it is work that takes place in
index AMs.

> However, regarding your question about
> "painting ourselves into a corner":
>
> In my executor batching work (which has focused on Seq Scans), the
> HeapBatch is essentially just a pinned buffer plus an array of
> pre-allocated tuple headers. I hadn't strictly considered creating a
> HeapBatch to return from Index Scans, largely because
> heap_hot_search_buffer() is designed for scalar (or non-batched)
> access that requires repeated buffer locking.
>
> But it seems like the eventual goal of batching calls to
> heap_hot_search_buffer() effectively clears that hurdle.

Actually, that's not the eventual goal anymore; now we're treating it
as our *immediate* goal, at least in terms of things that will have
user-visible impact (as opposed to API changes needed to facilitate
batching type optimizations in the future, including I/O prefetching).

It's not completely clear if prefetching is off the table for Postgres
19, but it certainly seems optimistic at this point. But the
heap_hot_search_buffer thing definitely is in scope for Postgres 19
(if we're going to make all these API changes then it seems best to
give users an immediate benefit).

> As long as
> the internal logic separates the "grouping/locking" from the
> "materializing into a slot," it seems this design does not prevent us
> from eventually wiring up a table_index_getnext_batch() to populate
> the HeapBatch structure I am proposing for the regular non-index scan
> path (table_scan_getnextbatch() in my patch).

That's good.

Suppose we do a much more advanced version of the kind of work
reordering that the heap_hot_search_buffer thing will do for Postgres
19. I described this to Tomas in my last email to this thread, when I
said:

"""
We could even do something much more sophisticated than what I
actually have planned for 19: we could reorder table fetches, such
that we only had to lock and pin each heap page exactly once *even
when the TIDs returned by the index scan return TIDs slightly out of
order*. For example, if an index page/batch returns TIDs "(1,1),
(2,1), (1,2), (1,3), (2,2)", we could get all tuples for heap blocks 1
and 2 by locking and pinning each of those 2 pages exactly once. The
only downside (other than the complexity) is that we'd sometimes hold
multiple heap page pins at a time, not just one.
"""

(To be clear this more advanced version is definitely out of scope for
Postgres 19.)

We'd be holding on to multiple buffer pins at a time (across calls to
heapam_index_getnext_slot) were we to do this more advanced
optimization. I *think* that still means that the design/internal
logic will (as you put it) "separate the 'grouping/locking' from the
'materializing into a slot'". That's just the only way that could
possibly work correctly, at least with heapam.

It makes sense for us both to (at a minimum) have at least some
general awareness of each other's goals. I really only want to avoid
completely gratuitous incompatibilities/conflicts. For example, if you
invent a new slot-like mechanism in the executor that can return
multiple tuples in one go, then it seems like we should probably try
to use that in our own work on batching. If we're already assembling
the information in a way that almost works with that new interface,
why wouldn't we make sure that it actually worked with and used that
new interface directly?

It doesn't sound like there'd be many disagreements on how that would
have to work, since the requirements are largely dictated by existing
constraints that we're both already naturally subject to. For example:

* We need to hold on to a buffer pin on a heap page if one of its heap
tuples is contained in a slot/something slot-like. For as long as
there's any chance that somebody will examine that heap tuple (until
the slot releases the tuple).

* Buffer locks must only be acquired by lower-level access method
code, for very short periods, and never in a way that requires
coordination across module boundaries.

It sounds like the potential for conflicts between each other's work
will be absolutely minimal. It seems as if we don't even have to agree
on anything new or novel.

> Sorry to hijack the thread, but just wanted to confirm I haven't
> misunderstood the architectural implications for future batching.

I don't think that you've hijacked anything. Your input is more than welcome.

-- 
Peter Geoghegan