Re: index prefetching

Peter Geoghegan <pg@bowt.ie>

From: Peter Geoghegan <pg@bowt.ie>

To: Tomas Vondra <tomas@vondra.me>

Cc: Andres Freund <andres@anarazel.de>, Thomas Munro <thomas.munro@gmail.com>, Nazir Bilal Yavuz <byavuz81@gmail.com>, Robert Haas <robertmhaas@gmail.com>, Melanie Plageman <melanieplageman@gmail.com>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>, Georgios <gkokolatos@protonmail.com>, Konstantin Knizhnik <knizhnik@garret.ru>, Dilip Kumar <dilipbalaut@gmail.com>

Date: 2025-11-14T18:19:08Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

aio: io_uring: Trigger async processing for large IOs
- a9ee66881744 19 (unreleased) landed
read stream: Split decision about look ahead for AIO and combining
- 8ca147d582a5 19 (unreleased) landed
read_stream: Only increase read-ahead distance when waiting for IO
- f63ca3379025 19 (unreleased) landed
read_stream: Prevent distance from decaying too quickly
- 6e36930f9aaf 19 (unreleased) landed
Reduce ExecSeqScan* code size using pg_assume()
- b227b0bb4e03 19 (unreleased) cited
Fix rare bug in read_stream.c's split IO handling.
- b421223172a2 19 (unreleased) cited
Fix multiranges to behave more like dependent types.
- 3e8235ba4f9c 17.0 cited
Add EXPLAIN (MEMORY) to report planner memory consumption
- 5de890e3610d 17.0 cited
Optimize nbtree backward scan boundary cases.
- c9c0589fda0e 17.0 cited
Increment xactCompletionCount during subtransaction abort.
- 90c885cdab8b 14.0 cited
Add nbtree Valgrind buffer lock checks.
- 4a70f829d86c 14.0 cited
Add nbtree high key "continuescan" optimization.
- 29b64d1de7c7 12.0 cited
Reduce pinning and buffer content locking for btree scans.
- 2ed5b87f96d4 9.5.0 cited
Teach btree to handle ScalarArrayOpExpr quals natively.
- 9e8da0f75731 9.2.0 cited

On Wed, Nov 12, 2025 at 12:39 PM Tomas Vondra <tomas@vondra.me> wrote:
> I think I generally agree with what you said here about the challenges,
> although it's a bit too abstract to respond to individual parts. I just
> don't know how to rework the design to resolve this ...

I'm trying to identify which subsets of the existing design can
reasonably be committed in a single release (while acknowledging that
even those subsets will need to be reworked). That is more abstract
than any of us would like -- no question.

What are we most confident will definitely be useful to prefetching,
that also enables the "only lock heap buffer once per group of TIDs
that point to the same heap page returned from an index scan"
optimization? I'm trying to reach a tentative agreement that just
doing the amgetbatch revisions and the table AM revisions (to do the
other heap buffer lock optimization) will represent useful progress
that can be committed in a single release. And on what the specifics
of the table AM revisions will need to be, to get us to a patch that
we can commit to Postgres 19.

> For the reads stream "pausing" I think it's pretty clear it's more a
> workaround than a desired behavior. We only pause the stream because we
> need to limit the look-ahead distance (measured in index leaf pages),
> and the read_stream has no such concept. It only knows about heap pins,
> but e.g. IOS may need to read many leaf pages to find a single heap page
> to prefetch. And the leaf pages are invisible to the stream.

Right. But we seemed to talk about this as if the implementation of
"pausing" was the problem. I was suggesting that the general idea of
pausing might well be the wrong one -- at least when applied in
anything like the way we currently apply it.

More importantly, I feel that it'll be really hard to get a clear
answer to that particular question (and a couple of others like it)
without first getting clarity on what we need from the table AM at a
high level, API-wise. Bearing in mind that we've made no real progress
on that at all.

We all agree that it's bad that indexam.c tacitly coordinates with
heapam in the way it does in the current patch. And that assuming a
TID representation in the API is bad. But that isn't very satisfying
to me; it's too focussed on that one really obvious and glaring
problem, and what we *don't* want. There's been very little (almost
nothing) on this thread about what we actually *do* want. That's the
thing that's still way to abstract, that I'd like to make more
concrete.

As you know, I think that we should add a new table AM interface that
makes the table AM directly aware of the fact that it is feeding an
ordered index scan, completely avoiding the use of TIDs (as well as
avoiding *any* more abstract representation of a table AM tuple
identifier). In other words, I think that we should just fully admit
the fact that the table AM is in control of the scan, and all that
comes with it. The table AM will have to directly coordinate with the
index AM in a way that's quite different to what we do right now.

I don't think that anybody else has really said much about that idea,
at least on the list. Is it a reasonable approach to take? This is
really important, especially in the short term/for Postgres 19.

> The limit of 64 batches is entirely arbitrary. I needed a number that
> would limit the amount of memory and time wasted on useless look-ahead,
> and 64 seemed "reasonable" (not too high, but enough to not be hit very
> often). Originally there was a fixed-length queue of batches, and 64 was
> the capacity, but we no longer do it that way. So it's an imperfect
> safety measure against "runaway" streams.

Right, but we still max out at 64. And then we stay there. It just
feels unprincipled to me.

> I don't want to get into too much detail about this particular issue,
> it's already discussed somewhere in this thread. But if there was a way
> to "tell" the read stream how much effort to spend looking ahead, we
> wouldn't do the pausing (not in the end+reset way).

I don't want to get into that again either. It was just an example of
the kinds of problems we're running into. Though a particularly good
example IMV.

> > That's all I have for now. My thoughts here should be considered
> > tentative; I want to put my thinking on a more rigorous footing before
> > really committing to this new phased approach.
> >
>
> I don't object to the "phased approach" with doing the batching first,
> but without seeing the code I can't really say if/how much it helps with
> resolving the design/layering questions. It feels a bit too abstract to
> me.

It is in no small part based on gut feeling and intuition. I don't
have anything better to go on right now. It's a really difficult
project.

> While working on the prefetching I moved the code between layers
> about three times, and I'm still not quite sure which layer should be
> responsible for which piece :-(

I don't think that this is quite the same situation.

The index prefetching design was completely overhauled twice now, but
on both occasions that was driven by some clear goal/the need to fix
some problem with the prior design. The first time it was due to the
fact that the original version didn't work with kill_prior_tuple. The
second time was due to the need to support reading index pages that
were ahead of the current page that the scan is returning tuples from.
Granted, it took a while to actually prove that the second overhaul
(which created the third major redesign) was the right direction to
take things in, but testing did eventually make that quite clear.

I don't see this as doing the same thing a third time/creating a forth
design from scratch. It's more of a refinement (albeit quite a big
one) of the most recent design. And in a direction that doesn't seem
too surprising to me. We knew that the table AM side of the most
recent redesign still had plenty of problems. We should have been a
bit more focussed on that side of things earlier on.

--
Peter Geoghegan