Thread

Re: Trying out read streams in pgvector (an extension)

Thomas Munro <thomas.munro@gmail.com> — 2025-12-09T03:47:08Z
On Fri, Nov 21, 2025 at 4:28 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I'm not totally opposed to this. My rationale for making it an error
> is that the developer could have test cases where all the buffers are
> consumed but the code is written such that that won't always happen.
> Then if a real production query doesn't consume all the buffers, it
> could return wrong results (I think). That will mean the user can't
> complete their query until the extension author releases a new version
> of their code. But I'm not sure what the right answer is here.

Focusing on making sure v19 has a good interface for this, and
abandoning thoughts of back-patching a bandaid, and the constraints
that leads to, for now...

I think it'd be better if that were the consumer's choice.   I don't
want the consumer to be required to drain the stream before resuming,
as that'd be an unprincipled stall.  For example, if new WAL arrives
over the network then I think it should be possible for recovery's
WAL-powered stream of heap pages to resume looking ahead even if
recovery hasn't drained the existing stream completely.

Peter G (CC'd) and I discussed some problems he had in the index
prefetching work, and I tried to extend this a bit to give the
semantics he wanted, in point 2 below.  It's simple itself, but might
lead to some tricky questions higher up.  Posted for experimentation.
It'll be interesting to see if this goes somewhere.

1.  read_stream_resume() as before, but with a new explicit
read_stream_pause(): if a block number callback would like to report a
temporary lack of information, it should return
read_stream_pause(stream), not InvalidBlockNumber.  Then after
read_stream_resume(stream) is called, the next
read_stream_next_buffer() enters the lookahead loop again.  While
paused, if the consumer drains all the existing buffers in the stream
and then one more, it will receive InvalidBuffer, but if the _resume()
call is made sooner, the consumer won't ever know about the temporary
lack of buffers in the stream.

2.  read_stream_yield(): while streaming heap pages that come from
TIDs on index pages, Peter didn't like that the executor lost control
of how much work was done by the lookahead loop underneath
read_stream_next_buffer().  The consumer might have a heap page with
some tuples that could be emitted right now, but the block number
callback might be evaluating arbitrarily expensive filter qual
expressions far ahead, and they might prefer to emit more tuples now
before doing an unbounded amount of work finding more.  This interface
allows some limited coroutine-like multitasking, where the block
number callback can return read_stream_yield(stream) to return control
back to the consumer periodically if it knows the consumer could
already do something else.  It works by pausing the stream and
resuming it in the next read_stream_next_buffer() call, but that's an
internal detail.

Some half-baked thoughts about the resulting flow control:

Yielding control periodically just when it happens to be possible
within the constraints of the volcano executor is an interesting thing
to think about.  You can only yield if you already have a tuple to
emit.  There is no saying when control will return to you, and the
node you yield to might immediately block on I/O and yet you could
have been doing useful CPU work.  You probably need an event-driven
node-hopping executor to fix that in general, but on the flip side, I
can think of one bet that I'd take: if you already have a tuple to
emit AND if index scans themselves (not only referenced heap pages)
were also streamed AND if a hypothetical
read_stream_next_buffer_no_wait(btree_stream) said the next index page
you need is not ready yet, then you should yield.  You're gambling
that other plan nodes will have better luck running without an I/O
stall, but you have ~0% chance.

Yielding just because you've scanned N index pages/tuples/whatever is
harder to think about.  The stream shouldn't get far ahead unless it's
recently been useful for I/O concurrency (though optimal distance
heuristics are an open problem), but in this case a single invocation
of the block number callback can call ReadBuffer() an arbitrary number
of times, filtering out all the index tuples as it rampages through
the whole index IIUC.  I see why you might want to yield periodically
if you can, but I also wonder how much that can really help if you
still have to pick up where you left off next time.  I guess it
depends on the distribution of matches.  It's also clear that any
cold-cache testing done with direct I/O enabled will stall abominably
as long as that level calls ReadBuffer(), possibly confusing matters.