Re: index prefetching
Peter Geoghegan <pg@bowt.ie>
Commits
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
aio: io_uring: Trigger async processing for large IOs
- a9ee66881744 19 (unreleased) landed
-
read stream: Split decision about look ahead for AIO and combining
- 8ca147d582a5 19 (unreleased) landed
-
read_stream: Only increase read-ahead distance when waiting for IO
- f63ca3379025 19 (unreleased) landed
-
read_stream: Prevent distance from decaying too quickly
- 6e36930f9aaf 19 (unreleased) landed
-
Reduce ExecSeqScan* code size using pg_assume()
- b227b0bb4e03 19 (unreleased) cited
-
Fix rare bug in read_stream.c's split IO handling.
- b421223172a2 19 (unreleased) cited
-
Fix multiranges to behave more like dependent types.
- 3e8235ba4f9c 17.0 cited
-
Add EXPLAIN (MEMORY) to report planner memory consumption
- 5de890e3610d 17.0 cited
-
Optimize nbtree backward scan boundary cases.
- c9c0589fda0e 17.0 cited
-
Increment xactCompletionCount during subtransaction abort.
- 90c885cdab8b 14.0 cited
-
Add nbtree Valgrind buffer lock checks.
- 4a70f829d86c 14.0 cited
-
Add nbtree high key "continuescan" optimization.
- 29b64d1de7c7 12.0 cited
-
Reduce pinning and buffer content locking for btree scans.
- 2ed5b87f96d4 9.5.0 cited
-
Teach btree to handle ScalarArrayOpExpr quals natively.
- 9e8da0f75731 9.2.0 cited
On Wed, Nov 12, 2025 at 12:39 PM Tomas Vondra <tomas@vondra.me> wrote: > I think I generally agree with what you said here about the challenges, > although it's a bit too abstract to respond to individual parts. I just > don't know how to rework the design to resolve this ... I'm trying to identify which subsets of the existing design can reasonably be committed in a single release (while acknowledging that even those subsets will need to be reworked). That is more abstract than any of us would like -- no question. What are we most confident will definitely be useful to prefetching, that also enables the "only lock heap buffer once per group of TIDs that point to the same heap page returned from an index scan" optimization? I'm trying to reach a tentative agreement that just doing the amgetbatch revisions and the table AM revisions (to do the other heap buffer lock optimization) will represent useful progress that can be committed in a single release. And on what the specifics of the table AM revisions will need to be, to get us to a patch that we can commit to Postgres 19. > For the reads stream "pausing" I think it's pretty clear it's more a > workaround than a desired behavior. We only pause the stream because we > need to limit the look-ahead distance (measured in index leaf pages), > and the read_stream has no such concept. It only knows about heap pins, > but e.g. IOS may need to read many leaf pages to find a single heap page > to prefetch. And the leaf pages are invisible to the stream. Right. But we seemed to talk about this as if the implementation of "pausing" was the problem. I was suggesting that the general idea of pausing might well be the wrong one -- at least when applied in anything like the way we currently apply it. More importantly, I feel that it'll be really hard to get a clear answer to that particular question (and a couple of others like it) without first getting clarity on what we need from the table AM at a high level, API-wise. Bearing in mind that we've made no real progress on that at all. We all agree that it's bad that indexam.c tacitly coordinates with heapam in the way it does in the current patch. And that assuming a TID representation in the API is bad. But that isn't very satisfying to me; it's too focussed on that one really obvious and glaring problem, and what we *don't* want. There's been very little (almost nothing) on this thread about what we actually *do* want. That's the thing that's still way to abstract, that I'd like to make more concrete. As you know, I think that we should add a new table AM interface that makes the table AM directly aware of the fact that it is feeding an ordered index scan, completely avoiding the use of TIDs (as well as avoiding *any* more abstract representation of a table AM tuple identifier). In other words, I think that we should just fully admit the fact that the table AM is in control of the scan, and all that comes with it. The table AM will have to directly coordinate with the index AM in a way that's quite different to what we do right now. I don't think that anybody else has really said much about that idea, at least on the list. Is it a reasonable approach to take? This is really important, especially in the short term/for Postgres 19. > The limit of 64 batches is entirely arbitrary. I needed a number that > would limit the amount of memory and time wasted on useless look-ahead, > and 64 seemed "reasonable" (not too high, but enough to not be hit very > often). Originally there was a fixed-length queue of batches, and 64 was > the capacity, but we no longer do it that way. So it's an imperfect > safety measure against "runaway" streams. Right, but we still max out at 64. And then we stay there. It just feels unprincipled to me. > I don't want to get into too much detail about this particular issue, > it's already discussed somewhere in this thread. But if there was a way > to "tell" the read stream how much effort to spend looking ahead, we > wouldn't do the pausing (not in the end+reset way). I don't want to get into that again either. It was just an example of the kinds of problems we're running into. Though a particularly good example IMV. > > That's all I have for now. My thoughts here should be considered > > tentative; I want to put my thinking on a more rigorous footing before > > really committing to this new phased approach. > > > > I don't object to the "phased approach" with doing the batching first, > but without seeing the code I can't really say if/how much it helps with > resolving the design/layering questions. It feels a bit too abstract to > me. It is in no small part based on gut feeling and intuition. I don't have anything better to go on right now. It's a really difficult project. > While working on the prefetching I moved the code between layers > about three times, and I'm still not quite sure which layer should be > responsible for which piece :-( I don't think that this is quite the same situation. The index prefetching design was completely overhauled twice now, but on both occasions that was driven by some clear goal/the need to fix some problem with the prior design. The first time it was due to the fact that the original version didn't work with kill_prior_tuple. The second time was due to the need to support reading index pages that were ahead of the current page that the scan is returning tuples from. Granted, it took a while to actually prove that the second overhaul (which created the third major redesign) was the right direction to take things in, but testing did eventually make that quite clear. I don't see this as doing the same thing a third time/creating a forth design from scratch. It's more of a refinement (albeit quite a big one) of the most recent design. And in a direction that doesn't seem too surprising to me. We knew that the table AM side of the most recent redesign still had plenty of problems. We should have been a bit more focussed on that side of things earlier on. -- Peter Geoghegan