Thread

  1. Re: Trying out read streams in pgvector (an extension)

    Thomas Munro <thomas.munro@gmail.com> — 2025-12-09T03:47:08Z

    On Fri, Nov 21, 2025 at 4:28 AM Melanie Plageman
    <melanieplageman@gmail.com> wrote:
    > I'm not totally opposed to this. My rationale for making it an error
    > is that the developer could have test cases where all the buffers are
    > consumed but the code is written such that that won't always happen.
    > Then if a real production query doesn't consume all the buffers, it
    > could return wrong results (I think). That will mean the user can't
    > complete their query until the extension author releases a new version
    > of their code. But I'm not sure what the right answer is here.
    
    Focusing on making sure v19 has a good interface for this, and
    abandoning thoughts of back-patching a bandaid, and the constraints
    that leads to, for now...
    
    I think it'd be better if that were the consumer's choice.   I don't
    want the consumer to be required to drain the stream before resuming,
    as that'd be an unprincipled stall.  For example, if new WAL arrives
    over the network then I think it should be possible for recovery's
    WAL-powered stream of heap pages to resume looking ahead even if
    recovery hasn't drained the existing stream completely.
    
    Peter G (CC'd) and I discussed some problems he had in the index
    prefetching work, and I tried to extend this a bit to give the
    semantics he wanted, in point 2 below.  It's simple itself, but might
    lead to some tricky questions higher up.  Posted for experimentation.
    It'll be interesting to see if this goes somewhere.
    
    1.  read_stream_resume() as before, but with a new explicit
    read_stream_pause(): if a block number callback would like to report a
    temporary lack of information, it should return
    read_stream_pause(stream), not InvalidBlockNumber.  Then after
    read_stream_resume(stream) is called, the next
    read_stream_next_buffer() enters the lookahead loop again.  While
    paused, if the consumer drains all the existing buffers in the stream
    and then one more, it will receive InvalidBuffer, but if the _resume()
    call is made sooner, the consumer won't ever know about the temporary
    lack of buffers in the stream.
    
    2.  read_stream_yield(): while streaming heap pages that come from
    TIDs on index pages, Peter didn't like that the executor lost control
    of how much work was done by the lookahead loop underneath
    read_stream_next_buffer().  The consumer might have a heap page with
    some tuples that could be emitted right now, but the block number
    callback might be evaluating arbitrarily expensive filter qual
    expressions far ahead, and they might prefer to emit more tuples now
    before doing an unbounded amount of work finding more.  This interface
    allows some limited coroutine-like multitasking, where the block
    number callback can return read_stream_yield(stream) to return control
    back to the consumer periodically if it knows the consumer could
    already do something else.  It works by pausing the stream and
    resuming it in the next read_stream_next_buffer() call, but that's an
    internal detail.
    
    Some half-baked thoughts about the resulting flow control:
    
    Yielding control periodically just when it happens to be possible
    within the constraints of the volcano executor is an interesting thing
    to think about.  You can only yield if you already have a tuple to
    emit.  There is no saying when control will return to you, and the
    node you yield to might immediately block on I/O and yet you could
    have been doing useful CPU work.  You probably need an event-driven
    node-hopping executor to fix that in general, but on the flip side, I
    can think of one bet that I'd take: if you already have a tuple to
    emit AND if index scans themselves (not only referenced heap pages)
    were also streamed AND if a hypothetical
    read_stream_next_buffer_no_wait(btree_stream) said the next index page
    you need is not ready yet, then you should yield.  You're gambling
    that other plan nodes will have better luck running without an I/O
    stall, but you have ~0% chance.
    
    Yielding just because you've scanned N index pages/tuples/whatever is
    harder to think about.  The stream shouldn't get far ahead unless it's
    recently been useful for I/O concurrency (though optimal distance
    heuristics are an open problem), but in this case a single invocation
    of the block number callback can call ReadBuffer() an arbitrary number
    of times, filtering out all the index tuples as it rampages through
    the whole index IIUC.  I see why you might want to yield periodically
    if you can, but I also wonder how much that can really help if you
    still have to pick up where you left off next time.  I guess it
    depends on the distribution of matches.  It's also clear that any
    cold-cache testing done with direct I/O enabled will stall abominably
    as long as that level calls ReadBuffer(), possibly confusing matters.