Re: index prefetching
Peter Geoghegan <pg@bowt.ie>
Commits
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
aio: io_uring: Trigger async processing for large IOs
- a9ee66881744 19 (unreleased) landed
-
read stream: Split decision about look ahead for AIO and combining
- 8ca147d582a5 19 (unreleased) landed
-
read_stream: Only increase read-ahead distance when waiting for IO
- f63ca3379025 19 (unreleased) landed
-
read_stream: Prevent distance from decaying too quickly
- 6e36930f9aaf 19 (unreleased) landed
-
Reduce ExecSeqScan* code size using pg_assume()
- b227b0bb4e03 19 (unreleased) cited
-
Fix rare bug in read_stream.c's split IO handling.
- b421223172a2 19 (unreleased) cited
-
Fix multiranges to behave more like dependent types.
- 3e8235ba4f9c 17.0 cited
-
Add EXPLAIN (MEMORY) to report planner memory consumption
- 5de890e3610d 17.0 cited
-
Optimize nbtree backward scan boundary cases.
- c9c0589fda0e 17.0 cited
-
Increment xactCompletionCount during subtransaction abort.
- 90c885cdab8b 14.0 cited
-
Add nbtree Valgrind buffer lock checks.
- 4a70f829d86c 14.0 cited
-
Add nbtree high key "continuescan" optimization.
- 29b64d1de7c7 12.0 cited
-
Reduce pinning and buffer content locking for btree scans.
- 2ed5b87f96d4 9.5.0 cited
-
Teach btree to handle ScalarArrayOpExpr quals natively.
- 9e8da0f75731 9.2.0 cited
Hi Amit, On Thu, Dec 4, 2025 at 12:54 AM Amit Langote <amitlangote09@gmail.com> wrote: > I want to acknowledge that figuring out the right layering to make I/O > prefetching and perhaps other optimizations internal to IndexNext() > work is obviously the priority right now, regardless of the output > format used to populate the slots ultimately returned by > table_index_getnext_slot(). Right; table_index_getnext_slot simply returns a tuple into the caller's slot. That's almost the same as the existing getnext_slot interface used by those same call sites on the master branch, except that in the patch we're directly calling a table AM callback/heapam specific implementation (not code in indexam.c). The new heapam implementation heapam_index_getnext_slot applies more high-level context about ordered index scans, which enables it to reorder work quite freely, even when it is work that takes place in index AMs. > However, regarding your question about > "painting ourselves into a corner": > > In my executor batching work (which has focused on Seq Scans), the > HeapBatch is essentially just a pinned buffer plus an array of > pre-allocated tuple headers. I hadn't strictly considered creating a > HeapBatch to return from Index Scans, largely because > heap_hot_search_buffer() is designed for scalar (or non-batched) > access that requires repeated buffer locking. > > But it seems like the eventual goal of batching calls to > heap_hot_search_buffer() effectively clears that hurdle. Actually, that's not the eventual goal anymore; now we're treating it as our *immediate* goal, at least in terms of things that will have user-visible impact (as opposed to API changes needed to facilitate batching type optimizations in the future, including I/O prefetching). It's not completely clear if prefetching is off the table for Postgres 19, but it certainly seems optimistic at this point. But the heap_hot_search_buffer thing definitely is in scope for Postgres 19 (if we're going to make all these API changes then it seems best to give users an immediate benefit). > As long as > the internal logic separates the "grouping/locking" from the > "materializing into a slot," it seems this design does not prevent us > from eventually wiring up a table_index_getnext_batch() to populate > the HeapBatch structure I am proposing for the regular non-index scan > path (table_scan_getnextbatch() in my patch). That's good. Suppose we do a much more advanced version of the kind of work reordering that the heap_hot_search_buffer thing will do for Postgres 19. I described this to Tomas in my last email to this thread, when I said: """ We could even do something much more sophisticated than what I actually have planned for 19: we could reorder table fetches, such that we only had to lock and pin each heap page exactly once *even when the TIDs returned by the index scan return TIDs slightly out of order*. For example, if an index page/batch returns TIDs "(1,1), (2,1), (1,2), (1,3), (2,2)", we could get all tuples for heap blocks 1 and 2 by locking and pinning each of those 2 pages exactly once. The only downside (other than the complexity) is that we'd sometimes hold multiple heap page pins at a time, not just one. """ (To be clear this more advanced version is definitely out of scope for Postgres 19.) We'd be holding on to multiple buffer pins at a time (across calls to heapam_index_getnext_slot) were we to do this more advanced optimization. I *think* that still means that the design/internal logic will (as you put it) "separate the 'grouping/locking' from the 'materializing into a slot'". That's just the only way that could possibly work correctly, at least with heapam. It makes sense for us both to (at a minimum) have at least some general awareness of each other's goals. I really only want to avoid completely gratuitous incompatibilities/conflicts. For example, if you invent a new slot-like mechanism in the executor that can return multiple tuples in one go, then it seems like we should probably try to use that in our own work on batching. If we're already assembling the information in a way that almost works with that new interface, why wouldn't we make sure that it actually worked with and used that new interface directly? It doesn't sound like there'd be many disagreements on how that would have to work, since the requirements are largely dictated by existing constraints that we're both already naturally subject to. For example: * We need to hold on to a buffer pin on a heap page if one of its heap tuples is contained in a slot/something slot-like. For as long as there's any chance that somebody will examine that heap tuple (until the slot releases the tuple). * Buffer locks must only be acquired by lower-level access method code, for very short periods, and never in a way that requires coordination across module boundaries. It sounds like the potential for conflicts between each other's work will be absolutely minimal. It seems as if we don't even have to agree on anything new or novel. > Sorry to hijack the thread, but just wanted to confirm I haven't > misunderstood the architectural implications for future batching. I don't think that you've hijacked anything. Your input is more than welcome. -- Peter Geoghegan