Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Add read_stream_{pause,resume}()

  1. Trying out read streams in pgvector (an extension)

    Thomas Munro <thomas.munro@gmail.com> — 2024-06-11T04:53:41Z

    Hi,
    
    I was looking around for an exotic index type to try the experience of
    streamifying an extension, ie out-of-core code.  I am totally new to
    pgvector, but since everyone keeps talking about it, I could not avoid
    picking up some basic facts in the pgconf.dev hallway track, and
    understood that its scans have some degree of known-order access
    predictability, and then also some degree of fuzzy-predictable
    order-not-yet-determined access too.  It's also quite random in the
    I/O sense.
    
    Here's a toy to streamify the known-order part.  I think for the fuzzy
    part that links those parts together, maybe there is some way to guess
    when it's a reasonable time to speculatively prefetch the lowest order
    stuff in the pairing heap, and then deal with it if you're wrong, but
    I didn't try that...
    
    Someone involved in that project mentioned that it's probably not a
    great topic to research in practice, because real world users of HNSW
    use fully cached ie prewarmed indexes, because the performance is so
    bad otherwise.  (Though maybe that argument is a little circular...).
    So although this patch clearly speeds up cold HSNW searches to a
    degree controlled by effective_io_concurrency, I'll probably look for
    something else.  Suggestions for interesting index types to look at
    streamifying are very welcome!
    
    Hmm.  If that's really true about HNSW though, then there may still be
    an opportunity to do automatic memory prefetching[1].  But then in the
    case of index building, "stream" is NULL in this patch anyway.  It
    surely must also be possible to find some good places to put
    profitable explicit pg_mem_prefetch() calls given the predictability
    and the need to get only ~60ns ahead for that usage.  I didn't look
    into that because I was trying to prove things about read_stream.c,
    not get involved in another project :-D
    
    Here ends my science experiment report, which I'm dropping here just
    in case others see useful ideas here.  The main thing I learned about
    the read stream API is that it'd be nice to be able to reset the
    stream but preserve the distance (something that came up on the
    streaming sequential scan thread for a different reason), to deal with
    cases where look-ahead opportunities come in bursts but you want a
    longer lived stream than I used here.  That is the reason the patch
    creates and destroys temporary streams in a loop; doh.  It also
    provides an interesting case study for what speculative random
    look-ahead support might need to look like.
    
    [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGKNUMnqubrrz8pRBdEM8vHeSCZcNq7iqERmkt6zPtpA3g%40mail.gmail.com
    
    === setup ====
    
    create extension vector;
    
    create or replace function random_vector(dimensions int)
    returns vector language sql
    begin atomic;
      select array_agg(random())::vector
        from generate_series(1, dimensions);
    end;
    
    create table t (id serial, embedding vector(6));
    
    insert into t (embedding)
    select random_vector(6)
      from generate_series(1, 1000000);
    
    set maintenance_work_mem = '2GB';
    
    create index on t using hnsw(embedding vector_l2_ops);
    
    === test of a hot search, assuming repeated ===
    
    select embedding <-> '[0.5,0.5,0.5,0.5,0.5,0.5]'::vector
      from t
     where embedding <-> '[0.5,0.5,0.5,0.5,0.5,0.5]'::vector < 0.2
     order by 1 limit 20;
    
    === test of a cold search, assuming empty caches ===
    
    create or replace function test()
    returns void
    language plpgsql as
    $$
    declare
      my_vec vector(6) := random_vector(6);
    begin
      perform embedding <-> my_vec
         from t
        where embedding <-> my_vec < 0.2
        order by 1 limit 20;
    end;
    $$;
    
    select test();
    
    (Make sure you remember to set effective_io_concurrency to an
    interesting number if you want to generate a lot of overlapping
    fadvise calls.)
    
  2. Re: Trying out read streams in pgvector (an extension)

    Heikki Linnakangas <hlinnaka@iki.fi> — 2024-06-11T07:47:31Z

    On 11/06/2024 07:53, Thomas Munro wrote:
    > Someone involved in that project mentioned that it's probably not a
    > great topic to research in practice, because real world users of HNSW
    > use fully cached ie prewarmed indexes, because the performance is so
    > bad otherwise.  (Though maybe that argument is a little circular...).
    
    I think that's true in practice for *building* an HNSW index, but faster 
    *searching* when the index is not in memory seems quite useful. And of 
    course, faster is always better, even if it's only in a non-optimal 
    scenario.
    
    > So although this patch clearly speeds up cold HSNW searches to a
    > degree controlled by effective_io_concurrency, I'll probably look for
    > something else.  Suggestions for interesting index types to look at
    > streamifying are very welcome!
    
    GiST and GIN?
    
    > Hmm.  If that's really true about HNSW though, then there may still be
    > an opportunity to do automatic memory prefetching[1].  But then in the
    > case of index building, "stream" is NULL in this patch anyway.  It
    > surely must also be possible to find some good places to put
    > profitable explicit pg_mem_prefetch() calls given the predictability
    > and the need to get only ~60ns ahead for that usage.  I didn't look
    > into that because I was trying to prove things about read_stream.c,
    > not get involved in another project :-D
    > 
    > Here ends my science experiment report, which I'm dropping here just
    > in case others see useful ideas here.  The main thing I learned about
    > the read stream API is that it'd be nice to be able to reset the
    > stream but preserve the distance (something that came up on the
    > streaming sequential scan thread for a different reason), to deal with
    > cases where look-ahead opportunities come in bursts but you want a
    > longer lived stream than I used here.  That is the reason the patch
    > creates and destroys temporary streams in a loop; doh.  It also
    > provides an interesting case study for what speculative random
    > look-ahead support might need to look like.
    
    This reminds me of a prototype I wrote earlier, see 
    https://github.com/pgvector/pgvector/pull/386, 1st commit. It 
    reorganizes HnswSearchLayer() so that it in iteration, it first collects 
    all the neighbors to visit, and then visits them, somewhat similar to 
    your patch.
    
    -- 
    Heikki Linnakangas
    Neon (https://neon.tech)
    
    
    
    
    
  3. Re: Trying out read streams in pgvector (an extension)

    Jonathan S. Katz <jkatz@postgresql.org> — 2024-06-11T15:37:09Z

    On 6/11/24 12:53 AM, Thomas Munro wrote:
    > Hi,
    > 
    > I was looking around for an exotic index type to try the experience of
    > streamifying an extension, ie out-of-core code.  I am totally new to
    > pgvector, but since everyone keeps talking about it, I could not avoid
    > picking up some basic facts in the pgconf.dev hallway track, and
    > understood that its scans have some degree of known-order access
    > predictability, and then also some degree of fuzzy-predictable
    > order-not-yet-determined access too.  It's also quite random in the
    > I/O sense.
    
    Cool! I happened to be chatting w/Andrew about this yesterday to see if 
    there could be some benefits for folks who are running pgvector on PG17.
    
    > Here's a toy to streamify the known-order part.  I think for the fuzzy
    > part that links those parts together, maybe there is some way to guess
    > when it's a reasonable time to speculatively prefetch the lowest order
    > stuff in the pairing heap, and then deal with it if you're wrong, but
    > I didn't try that...
    
    I would suggest submitting this at least as a draft PR to the pgvector 
    project[1]:
    
    https://github.com/pgvector/pgvector
    
    > Someone involved in that project mentioned that it's probably not a
    > great topic to research in practice, because real world users of HNSW
    > use fully cached ie prewarmed indexes, because the performance is so
    > bad otherwise.
    
    I don't think that was me, at least in those words (and I had noted I'd 
    love to chat w/you about this, but we didn't find time). Stating it 
    differently, the "ideal" is to keep the indexes in memory, as that leads 
    to the best performance, but reality is more complicated. These datasets 
    are quite large (e.g. the 1536-dim vector is a 6KB payload, excluding 
    what's in the index) and if you're storing the full vector in the index 
    (there are now some quantization methods available[4]), you can easily 
    double your dataset size, and quickly exceed available memory. So I 
    think in the real world, you're more likely to see swapping pages 
    between disk and memory. Some of this was addressed in the talk @ 
    PGConf.dev[3] (slides here[2]).
    
    >  (Though maybe that argument is a little circular...).
    > So although this patch clearly speeds up cold HSNW searches to a
    > degree controlled by effective_io_concurrency, I'll probably look for
    > something else.  Suggestions for interesting index types to look at
    > streamifying are very welcome!
    
    Yup, so this makes sense for HNSW particularly at the higher-level 
    pages. But it may make more sense for IVFFlat, given how it clusters 
    data. With IVFFlat, you first find your lists/centers, and then you 
    determine how you index each vector around the lists. When those lists 
    are stored to disk, they're basically sequential. A lot of the struggles 
    with IVFFlat is both the long load from disk and ultimately some 
    comptuational issues for a larger set of vector comparisons (though if 
    you're able to build small, efficient clusters, it can be much faster 
    than HNSW!). HNSW behaves more like a (bear with me) typically 
    "tree-based" index, where you'll have hot spots at the top, but because 
    of the nature of vector search, the lower levels tend to be more random 
    in access.
    
    Regardless, the part where this is interesting (at least to me) is that 
    a lot of these vectors tend to take up a full page anyway, so anything 
    we can do to read them faster from disk will generally get a thumbs up 
    from me.
    
    > Hmm.  If that's really true about HNSW though, then there may still be
    > an opportunity to do automatic memory prefetching[1].  But then in the
    > case of index building, "stream" is NULL in this patch anyway.  It
    > surely must also be possible to find some good places to put
    > profitable explicit pg_mem_prefetch() calls given the predictability
    > and the need to get only ~60ns ahead for that usage.  I didn't look
    > into that because I was trying to prove things about read_stream.c,
    > not get involved in another project :-D
    
    Well, as alluded to in[2], thinking about how another project uses this 
    will certainly help, and anything we can do to continue to speed up 
    vector queries helps PostgreSQL ;) Some of the contributions from folks 
    who have focused on core have significantly helped pgvector.
    
    > Here ends my science experiment report, which I'm dropping here just
    > in case others see useful ideas here.  The main thing I learned about
    > the read stream API is that it'd be nice to be able to reset the
    > stream but preserve the distance (something that came up on the
    > streaming sequential scan thread for a different reason), to deal with
    > cases where look-ahead opportunities come in bursts but you want a
    > longer lived stream than I used here.  That is the reason the patch
    > creates and destroys temporary streams in a loop; doh.  It also
    > provides an interesting case study for what speculative random
    > look-ahead support might need to look like.
    
    If you're curious, I can fire up some of my more serious benchmarks on 
    this to do a before/after to see if there's anything interesting. I have 
    a few large datasets (10s of millions) of larger vectors (1536dim => 6KB 
    payloads) that could see the net effect here.
    
    > (Make sure you remember to set effective_io_concurrency to an
    > interesting number if you want to generate a lot of overlapping
    > fadvise calls.)
    
    What would you recommend as an "interesting number?" - particularly 
    using the data parameters above.
    
    Thanks,
    
    Jonathan
    
    [1] https://github.com/pgvector/pgvector
    [2] 
    https://www.pgevents.ca/events/pgconfdev2024/sessions/session/1/slides/42/pgconfdev-2024-vectors.pdf
    [3] 
    https://www.pgevents.ca/events/pgconfdev2024/schedule/session/1-vectors-how-to-better-support-a-nasty-data-type-in-postgresql/
    [4] https://jkatz05.com/post/postgres/pgvector-scalar-binary-quantization/
    
  4. Re: Trying out read streams in pgvector (an extension)

    Thomas Munro <thomas.munro@gmail.com> — 2024-09-06T04:28:52Z

    On Wed, Jun 12, 2024 at 3:37 AM Jonathan S. Katz <jkatz@postgresql.org> wrote:
    > If you're curious, I can fire up some of my more serious benchmarks on
    > this to do a before/after to see if there's anything interesting. I have
    > a few large datasets (10s of millions) of larger vectors (1536dim => 6KB
    > payloads) that could see the net effect here.
    >
    > > (Make sure you remember to set effective_io_concurrency to an
    > > interesting number if you want to generate a lot of overlapping
    > > fadvise calls.)
    >
    > What would you recommend as an "interesting number?" - particularly
    > using the data parameters above.
    
    Hi Jonathan,
    
    Sorry for not replying sooner (ETOOMANYPROJECTS).  For HNSW, I think
    the maximum useful effective_io_concurrency is bound by the number of
    connections per HNSW layer ("m").  Here are some times I measured
    using m=16 on two laptops:
    
                  |     linux (xfs)           |  macos (apfs)
     branch | eic |  avg   | speedup | stdev  |  avg   | speedup | stdev
    --------+-----+--------+---------+--------+--------+---------+--------
     master |     | 73.959 |     1.0 | 24.168 | 72.290 |     1.0 | 11.851
     stream |   0 | 70.117 |     1.1 | 36.699 | 76.289 |     1.0 | 12.742
     stream |   1 | 57.983 |     1.3 |  5.845 | 79.969 |     1.2 |  8.308
     stream |   2 | 35.629 |     2.1 |  4.088 | 49.198 |     2.0 |  7.686
     stream |   3 | 28.477 |     2.6 |  2.607 | 37.540 |     2.5 |  5.272
     stream |   4 | 26.493 |     2.8 |  3.691 | 33.014 |     2.7 |  4.444
     stream |   5 | 23.711 |     3.1 |  2.435 | 32.622 |     3.0 |  2.270
     stream |   6 | 22.885 |     3.2 |  1.908 | 31.254 |     3.2 |  4.170
     stream |   7 | 21.910 |     3.4 |  2.153 | 33.669 |     3.3 |  4.616
     stream |   8 | 20.741 |     3.6 |  1.594 | 34.182 |     3.5 |  3.819
     stream |   9 | 22.471 |     3.3 |  3.094 | 30.690 |     3.2 |  2.677
     stream |  10 | 19.895 |     3.7 |  1.695 | 32.631 |     3.6 |  4.976
     stream |  11 | 19.447 |     3.8 |  1.647 | 31.163 |     3.7 |  3.351
     stream |  12 | 18.658 |     4.0 |  1.503 | 30.817 |     3.9 |  3.538
     stream |  13 | 18.886 |     3.9 |  0.874 | 29.184 |     3.8 |  4.832
     stream |  14 | 18.667 |     4.0 |  1.692 | 28.783 |     3.9 |  3.459
     stream |  15 | 19.080 |     3.9 |  1.429 | 28.928 |     3.8 |  3.396
     stream |  16 | 18.929 |     3.9 |  3.469 | 29.282 |     3.8 |  2.868
    
    Those are millisecond times to run the test() function shown earlier,
    with empty kernel cache and PostgreSQL cache (see below) for maximum
    physical I/O.  I ran the master test 30 times, and each
    effective_io_concurrency level 10 times, to show that the variance
    decreases even at the default effective_io_concurency = 1, so we're
    not only talking about the avg speed improving.
    
    The all-cached performance also seems to improve, ~8.9ms -> ~6.9ms on
    Linux, but I can't fully explain why that is, maybe just some random
    stuff about memory layout run-to-run in my quick and dirty test or
    something like that, so I'm not claiming that is significant.  It
    certainly didn't get slower, anyway.
    
    I think you would get very different numbers on a high latency storage
    system (say, non-local cloud storage) and potentially much more
    speedup with your large test indexes.  Also my 6d random number test
    may not be very representative and you may be able to come up with
    much better tests.
    
    Here's a new version with a TODO tidied up.  I also understood that we
    need to tweak the read_stream_reset() function, so that it doesn't
    forget its current readhead distance when it hops between HNSW nodes
    (which is something that comes up in several other potential uses
    cases including another one I am working in in core).  Without this
    patch for PostgreSQL, it reads 1, 2, 4, 7 blocks (= 16 in total)
    before it has to take a break to hop to a new page, and then it start
    again at 1.  Oops.  With this patch, it is less forgetful, and reaches
    the full possible I/O concurrency of 16 (or whatever the minimum of
    HNSW's m parameter and effective_io_concurrency is for you).
    
    PSA two patches, one for PostgreSQL and one for pgvector.
    
    I am not actively working on this right now.  If someone wants to try
    to develop it further, please feel free!  I haven't looked at IVFFlat
    at all.
    
    --- function to let you do SELECT uncache('t_embedding_idx'),
    --- which is the opposite of SELECT pg_prewarm('t_embedding_idx')
    --- see also "echo 1 | sudo tee /proc/sys/vm/drop_caches" (Linux)
    --- "sudo purge" (macOS)
    create extension pg_buffercache;
    create or replace function uncache(name text) returns bool
    begin atomic;
      select bool_and(pg_buffercache_evict(bufferid))
        from pg_buffercache where relfilenode = name::regclass;
    end;
    
  5. Re: Trying out read streams in pgvector (an extension)

    Thomas Munro <thomas.munro@gmail.com> — 2024-09-06T04:49:49Z

    On Fri, Sep 6, 2024 at 4:28 PM Thomas Munro <thomas.munro@gmail.com> wrote:
    > Without this
    > patch for PostgreSQL, it reads 1, 2, 4, 7 blocks (= 16 in total)
    > before it has to take a break to hop to a new page, and then it start
    > again at 1.  Oops.
    
    Erm, correction: 1, 2, 4, 8, 1 (because it runs out due to m == 16 and resets).
    
    
    
    
  6. Re: Trying out read streams in pgvector (an extension)

    Thomas Munro <thomas.munro@gmail.com> — 2024-09-06T22:27:27Z

    There was a mistake in my query, so the macOS speedup column was wrong
    (was accidentally comparing Linux number with macOS master, sorry for
    the noise).  I also forgot to mention that you don't actually get the
    speedup on PostgreSQL 17 on a Mac, because Peter only recently
    implemented the needed read-ahead support for macOS in master/18, but
    it doesn't get slower.  Here's the corrected table:
    
                  |     linux (xfs)           |  macos (apfs)
     branch | eic |  avg   | speedup | stdev  |  avg   | speedup | stdev
    --------+-----+--------+---------+--------+--------+---------+--------
     master |     | 73.959 |     1.0 | 24.168 | 72.290 |     1.0 | 11.851
     stream |   0 | 70.117 |     1.1 | 36.699 | 76.289 |     0.9 | 12.742
     stream |   1 | 57.983 |     1.3 |  5.845 | 79.969 |     0.9 |  8.308
     stream |   2 | 35.629 |     2.1 |  4.088 | 49.198 |     1.5 |  7.686
     stream |   3 | 28.477 |     2.6 |  2.607 | 37.540 |     1.9 |  5.272
     stream |   4 | 26.493 |     2.8 |  3.691 | 33.014 |     2.2 |  4.444
     stream |   5 | 23.711 |     3.1 |  2.435 | 32.622 |     2.2 |  2.270
     stream |   6 | 22.885 |     3.2 |  1.908 | 31.254 |     2.3 |  4.170
     stream |   7 | 21.910 |     3.4 |  2.153 | 33.669 |     2.1 |  4.616
     stream |   8 | 20.741 |     3.6 |  1.594 | 34.182 |     2.1 |  3.819
     stream |   9 | 22.471 |     3.3 |  3.094 | 30.690 |     2.4 |  2.677
     stream |  10 | 19.895 |     3.7 |  1.695 | 32.631 |     2.2 |  4.976
     stream |  11 | 19.447 |     3.8 |  1.647 | 31.163 |     2.3 |  3.351
     stream |  12 | 18.658 |     4.0 |  1.503 | 30.817 |     2.3 |  3.538
     stream |  13 | 18.886 |     3.9 |  0.874 | 29.184 |     2.5 |  4.832
     stream |  14 | 18.667 |     4.0 |  1.692 | 28.783 |     2.5 |  3.459
     stream |  15 | 19.080 |     3.9 |  1.429 | 28.928 |     2.5 |  3.396
     stream |  16 | 18.929 |     3.9 |  3.469 | 29.282 |     2.5 |  2.868
    
    
    
    
  7. Re: Trying out read streams in pgvector (an extension)

    Thomas Munro <thomas.munro@gmail.com> — 2025-11-11T21:21:50Z

    On Fri, Sep 6, 2024 at 4:28 PM Thomas Munro <thomas.munro@gmail.com> wrote:
    > Here's a new version with a TODO tidied up.  I also understood that we
    > need to tweak the read_stream_reset() function, so that it doesn't
    > forget its current readhead distance when it hops between HNSW nodes
    > (which is something that comes up in several other potential uses
    > cases including another one I am working in in core).  Without this
    > patch for PostgreSQL, it reads 1, 2, 4, 7 blocks (= 16 in total)
    > before it has to take a break to hop to a new page, and then it start
    > again at 1.  Oops.  With this patch, it is less forgetful, and reaches
    > the full possible I/O concurrency of 16 (or whatever the minimum of
    > HNSW's m parameter and effective_io_concurrency is for you).
    
    I heard that the pgvector project is now trying to do this for real,
    and (surprise!) running into this problem.  It causes streamified HNSW
    search to regress in performance on some queries when the overheads of
    streaming are not outweighed by the (bogusly constrained) gains in
    concurrency.  We just don't generate enough concurrency to win.  I
    probably should have been more opinionated and just committed a
    version of that distance-reset policy change, but I guess at the time
    I wrote the above, streaming and AIO were a little too abstract to
    attract reviews relating to hypothetical external projects.
    
    We definitely want to fix that for v19, because it also affects the
    streamified index scan project and doubtless many other things.  I
    wrote about that with patches[1] and will start a new thread soon with
    a new collection of rebased heuristics improvements.
    
    But for now, to fix pgvector's woes, I wonder if it might make sense
    to call this a bug in v18, and back-patch the tiniest possible change.
    Something like what I posted[2] in this thread almost two years ago.
    I don't think it really affects any core code: we use
    read_stream_reset() only in very minimal ways there (I could
    elaborate), and it's quite arguable that the existing policy is wrong
    for them too, but we'd need to confirm that and perhaps think about
    other extensions that might be using it.
    
    Better ideas?
    
    [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGL6hCd40Dh1AcFcoiw5zJXK2T1dRKO3oe8RkPExqA5zoQ%40mail.gmail.com#181a22a8be99ff561b7beae44986870c
    [2] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2Bx2BcqWzBC77cN0ewhzMF0kYhC6c4G_T2gJLPbqYQ6Ow%40mail.gmail.com#9aa6012713b473611ae46d8e2032586f
    
    
    
    
  8. Re: Trying out read streams in pgvector (an extension)

    Melanie Plageman <melanieplageman@gmail.com> — 2025-11-11T22:52:17Z

    On Tue, Nov 11, 2025 at 4:22 PM Thomas Munro <thomas.munro@gmail.com> wrote:
    >
    > But for now, to fix pgvector's woes, I wonder if it might make sense
    > to call this a bug in v18, and back-patch the tiniest possible change.
    > Something like what I posted[2] in this thread almost two years ago.
    > I don't think it really affects any core code: we use
    > read_stream_reset() only in very minimal ways there (I could
    > elaborate), and it's quite arguable that the existing policy is wrong
    > for them too, but we'd need to confirm that and perhaps think about
    > other extensions that might be using it.
    
    If we are worried about regressing other extensions using
    read_stream_reset(), we could make the read stream reset which
    preserves the distance a different function in backbranches.
    
    - Melanie
    
    
    
    
  9. Re: Trying out read streams in pgvector (an extension)

    Thomas Munro <thomas.munro@gmail.com> — 2025-11-11T23:19:26Z

    On Wed, Nov 12, 2025 at 11:52 AM Melanie Plageman
    <melanieplageman@gmail.com> wrote:
    > On Tue, Nov 11, 2025 at 4:22 PM Thomas Munro <thomas.munro@gmail.com> wrote:
    > > But for now, to fix pgvector's woes, I wonder if it might make sense
    > > to call this a bug in v18, and back-patch the tiniest possible change.
    > > Something like what I posted[2] in this thread almost two years ago.
    > > I don't think it really affects any core code: we use
    > > read_stream_reset() only in very minimal ways there (I could
    > > elaborate), and it's quite arguable that the existing policy is wrong
    > > for them too, but we'd need to confirm that and perhaps think about
    > > other extensions that might be using it.
    >
    > If we are worried about regressing other extensions using
    > read_stream_reset(), we could make the read stream reset which
    > preserves the distance a different function in backbranches.
    
    Hmm, yeah, interesting idea.  Candidate names might include
    read_stream_restart() and read_stream_continue().  The point being
    that the block number callback reported end-of-stream, but that was
    only temporary, and now it has more information and would like to
    continue.  Those are some of the names I bounced around for a new
    read_stream_reset() flag argument for v19 (I rather liked "continue"),
    but I also like this separate function idea.  Back-patching a new
    function would certainly remove all doubt about unintended
    consequences for existing callers of read_stream_reset(), so yeah,
    that wins on pure conservative safety grounds.  As for the future,
    hmm, it might even be better to have an explicit separate API for this
    operation in master too, as it is turning out to be quite a common
    requirement and the naming is much clearer like that.  We don't
    usually design new APIs while back-patching though, that's probably
    why I didn't think of that, but if we view this as a design bug that
    folded too many jobs into read_stream_reset() that we now want to fix
    by splitting one off, maybe that's OK?  Seems pretty risk-free,
    anyway.
    
    
    
    
  10. Re: Trying out read streams in pgvector (an extension)

    Thomas Munro <thomas.munro@gmail.com> — 2025-11-12T04:11:39Z

    On Wed, Nov 12, 2025 at 12:19 PM Thomas Munro <thomas.munro@gmail.com> wrote:
    > On Wed, Nov 12, 2025 at 11:52 AM Melanie Plageman
    > <melanieplageman@gmail.com> wrote:
    > > If we are worried about regressing other extensions using
    > > read_stream_reset(), we could make the read stream reset which
    > > preserves the distance a different function in backbranches.
    
    Here is a draft patch like that, that tries to be as small as
    possible.  Trying out the name read_stream_resume().
    
  11. Re: Trying out read streams in pgvector (an extension)

    Nazir Bilal Yavuz <byavuz81@gmail.com> — 2025-11-12T08:04:28Z

    Hi,
    
    Thank you for working on this!
    
    On Wed, 12 Nov 2025 at 07:12, Thomas Munro <thomas.munro@gmail.com> wrote:
    >
    > On Wed, Nov 12, 2025 at 12:19 PM Thomas Munro <thomas.munro@gmail.com> wrote:
    > > On Wed, Nov 12, 2025 at 11:52 AM Melanie Plageman
    > > <melanieplageman@gmail.com> wrote:
    > > > If we are worried about regressing other extensions using
    > > > read_stream_reset(), we could make the read stream reset which
    > > > preserves the distance a different function in backbranches.
    >
    > Here is a draft patch like that, that tries to be as small as
    > possible.  Trying out the name read_stream_resume().
    
    I liked the idea of having a different function named
    read_stream_resume for this purpose.
    
    0001 looks good to me.
    
    0002:
    
    +        /* End-of-stream. */
    +        buf = read_stream_next_buffer(stream, NULL);
    +        Assert(buf == InvalidBuffer);
    +        buf = read_stream_next_buffer(stream, NULL);
    +        Assert(buf == InvalidBuffer);
    
    I noticed there are two 'read_stream_next_buffer()' and
    'InvalidBuffer' checks. Does having both provide any additional
    validation? I tried removing one of them, and the test still passed.
    
    Also, there is one thing I wanted to clarify about the
    'read_stream_resume()'. If 'read_stream_next_buffer()' returns an
    'InvalidBuffer', then we can use 'read_stream_resume()' alone because
    we know that we already consumed all buffers in the stream. For the
    rest, we need to use 'read_stream_resume()' together with the
    'read_stream_reset()', right?
    
    -- 
    Regards,
    Nazir Bilal Yavuz
    Microsoft
    
    
    
    
  12. Re: Trying out read streams in pgvector (an extension)

    Thomas Munro <thomas.munro@gmail.com> — 2025-11-12T10:47:16Z

    On Wed, Nov 12, 2025 at 9:04 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
    > On Wed, 12 Nov 2025 at 07:12, Thomas Munro <thomas.munro@gmail.com> wrote:
    > 0002:
    >
    > +        /* End-of-stream. */
    > +        buf = read_stream_next_buffer(stream, NULL);
    > +        Assert(buf == InvalidBuffer);
    > +        buf = read_stream_next_buffer(stream, NULL);
    > +        Assert(buf == InvalidBuffer);
    >
    > I noticed there are two 'read_stream_next_buffer()' and
    > 'InvalidBuffer' checks. Does having both provide any additional
    > validation? I tried removing one of them, and the test still passed.
    
    I wanted to demonstrate that this is a state that the stream is stuck
    in until you call _resume().
    
    I suppose an alternative design would be that _next_buffer() returns
    InvalidBuffer only once (= the block number callback returns
    InvalidBlock once) and then automatically resumes (= it restores the
    distance) and then you can call read_stream_next_buffer() again (= the
    block number callback will be called again to fill the stream up with
    new buffers before waiting for the first one to be ready to give to
    you if it isn't already).  That would have the advantage of not
    requiring a new function at all and make the patch even shorter, but I
    don't know, I guess I thought that would be a bit more fragile in some
    way, less explicit.  Hmm, would it actually be better if it worked
    like that?
    
    > Also, there is one thing I wanted to clarify about the
    > 'read_stream_resume()'. If 'read_stream_next_buffer()' returns an
    > 'InvalidBuffer', then we can use 'read_stream_resume()' alone because
    > we know that we already consumed all buffers in the stream. For the
    > rest, we need to use 'read_stream_resume()' together with the
    > 'read_stream_reset()', right?
    
    For the rest, there would be no need to call read_stream_resume().
    
    To recap the uses of read_stream_reset():  the original purpose was to
    release any buffers (pins) that the stream is holding internally
    because of look-ahead, and put it back to its original state, ready to
    be reused.  It is called (1) by read_stream_end() as an implementation
    detail (eg if a LIMIT or anything else except ERROR/FATAL ends your
    query early, we need to unpin buffers queued in the stream before we
    pfree it), (2) explicitly by rescans, (3) in hypothetical code I
    thought about that would want to stream blocks speculatively and then
    change its mind after predicting incorrectly (I had a few patches like
    that, abandoned for now), and then (4) in this case, by places that
    temporarily ran out of block numbers, but will have some more again
    soon and want to resume the stream.
    
    It was already debatable whether heuristic state like lookahead
    distance should survive acoss rescans, or in other words, whether the
    expected I/O requirements of the previous scan are a useful prediction
    of the requirements of the next scan, but the answer is clearer in
    case (4), hence the desire to find a way to separate that use case
    from the others.
    
    
    
    
  13. Re: Trying out read streams in pgvector (an extension)

    Melanie Plageman <melanieplageman@gmail.com> — 2025-11-12T20:44:14Z

    On Tue, Nov 11, 2025 at 11:12 PM Thomas Munro <thomas.munro@gmail.com> wrote:
    >
    > Here is a draft patch like that, that tries to be as small as
    > possible.  Trying out the name read_stream_resume().
    
    I like read_stream_resume(). Tested out 0001 with pgvector and can
    confirm it works.
    
    In the test, I would initialize test_read_stream_resume_state.count to 0
    
    +    test_read_stream_resume_state state = {.blkno = blkno};
    
    - Melanie
    
    
    
    
  14. Re: Trying out read streams in pgvector (an extension)

    Melanie Plageman <melanieplageman@gmail.com> — 2025-11-18T21:17:28Z

    On Wed, Nov 12, 2025 at 5:47 AM Thomas Munro <thomas.munro@gmail.com> wrote:
    >
    > I suppose an alternative design would be that _next_buffer() returns
    > InvalidBuffer only once (= the block number callback returns
    > InvalidBlock once) and then automatically resumes (= it restores the
    > distance) and then you can call read_stream_next_buffer() again (= the
    > block number callback will be called again to fill the stream up with
    > new buffers before waiting for the first one to be ready to give to
    > you if it isn't already).  That would have the advantage of not
    > requiring a new function at all and make the patch even shorter, but I
    > don't know, I guess I thought that would be a bit more fragile in some
    > way, less explicit.  Hmm, would it actually be better if it worked
    > like that?
    
    We discussed off-list and decided that changing existing functionality
    in an unexpected way is undesirable. So, it is better we stick with
    adding read_stream_resume. However, in talking about
    read_stream_resume() further, Thomas and I also thought of potential
    issues with it:
    
    If read_stream_resume() is called before the read stream user callback
    has ever returned InvalidBlockNumber,
    1) The value of resume_distance will be the original value of distance
    from read_stream_begin_relation(). You don't want to reset the
    distance to that value.
    2) There may be inflight or completed buffers that have yet to be
    yielded which will be returned the next time read_stream_next_buffer()
    is invoked. If the user resets the state the callback is using to
    return blocks and expects the next invocation of
    read_stream_next_buffer() to return buffers with those blocks, they
    will be disappointed.
    
    If we try to address this by requiring that stream->distance is 0 when
    read_stream_resume() is called, that won't work because while it is
    set to 0 when the callback returns InvalidBlockNumber, there may still
    be unreturned buffers in the stream.
    
    If the user wants to use read_stream_reset() to exhaust the stream
    before calling read_stream_resume(), read_stream_reset() sets
    stream->distance to 1 at the end, so read_stream_resume() couldn't
    detect if reset() was correctly called first or if the distance is > 0
    because the stream is still in progress.
    
    To make sure 1) distance isn't reset to a resume_distance from
    read_stream_begin_relation() and 2) unexpected buffers aren't returned
    from the read stream, we could error out in read_stream_resume() if
    pinned_buffers > 0. And in read_stream_reset(), we would save distance
    in resume_distance before clearing distance. That would allow calling
    read_stream_resume() either if you called read_stream_reset() or if
    you exhausted the stream yourself. See rough attached patch for a
    sketch of this.
    
    It would be nicer if we could error out if read_stream_next_buffer()
    didn't return InvalidBuffer, but we can't do that if we want to allow
    calling read_stream_reset() followed by read_stream_resume() because
    distance won't be 0.
    
    I tried this with a modified pgvector with an hnsw read stream user
    and it seemed to work correctly.
    
    - Melanie
    
  15. Re: Trying out read streams in pgvector (an extension)

    Nazir Bilal Yavuz <byavuz81@gmail.com> — 2025-11-19T07:27:59Z

    Hi,
    
    On Wed, 19 Nov 2025 at 00:17, Melanie Plageman
    <melanieplageman@gmail.com> wrote:
    >
    > To make sure 1) distance isn't reset to a resume_distance from
    > read_stream_begin_relation() and 2) unexpected buffers aren't returned
    > from the read stream, we could error out in read_stream_resume() if
    > pinned_buffers > 0. And in read_stream_reset(), we would save distance
    > in resume_distance before clearing distance. That would allow calling
    > read_stream_resume() either if you called read_stream_reset() or if
    > you exhausted the stream yourself. See rough attached patch for a
    > sketch of this.
    
    This looks correct to me. What do you think about using an assert
    instead of erroring out?
    
    -- 
    Regards,
    Nazir Bilal Yavuz
    Microsoft
    
    
    
    
  16. Re: Trying out read streams in pgvector (an extension)

    Melanie Plageman <melanieplageman@gmail.com> — 2025-11-20T15:27:59Z

    On Wed, Nov 19, 2025 at 2:28 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
    >
    > > To make sure 1) distance isn't reset to a resume_distance from
    > > read_stream_begin_relation() and 2) unexpected buffers aren't returned
    > > from the read stream, we could error out in read_stream_resume() if
    > > pinned_buffers > 0. And in read_stream_reset(), we would save distance
    > > in resume_distance before clearing distance. That would allow calling
    > > read_stream_resume() either if you called read_stream_reset() or if
    > > you exhausted the stream yourself. See rough attached patch for a
    > > sketch of this.
    >
    > This looks correct to me. What do you think about using an assert
    > instead of erroring out?
    
    I'm not totally opposed to this. My rationale for making it an error
    is that the developer could have test cases where all the buffers are
    consumed but the code is written such that that won't always happen.
    Then if a real production query doesn't consume all the buffers, it
    could return wrong results (I think). That will mean the user can't
    complete their query until the extension author releases a new version
    of their code. But I'm not sure what the right answer is here.
    
    - Melanie
    
    
    
    
  17. Re: Trying out read streams in pgvector (an extension)

    Thomas Munro <thomas.munro@gmail.com> — 2025-12-09T03:47:08Z

    On Fri, Nov 21, 2025 at 4:28 AM Melanie Plageman
    <melanieplageman@gmail.com> wrote:
    > I'm not totally opposed to this. My rationale for making it an error
    > is that the developer could have test cases where all the buffers are
    > consumed but the code is written such that that won't always happen.
    > Then if a real production query doesn't consume all the buffers, it
    > could return wrong results (I think). That will mean the user can't
    > complete their query until the extension author releases a new version
    > of their code. But I'm not sure what the right answer is here.
    
    Focusing on making sure v19 has a good interface for this, and
    abandoning thoughts of back-patching a bandaid, and the constraints
    that leads to, for now...
    
    I think it'd be better if that were the consumer's choice.   I don't
    want the consumer to be required to drain the stream before resuming,
    as that'd be an unprincipled stall.  For example, if new WAL arrives
    over the network then I think it should be possible for recovery's
    WAL-powered stream of heap pages to resume looking ahead even if
    recovery hasn't drained the existing stream completely.
    
    Peter G (CC'd) and I discussed some problems he had in the index
    prefetching work, and I tried to extend this a bit to give the
    semantics he wanted, in point 2 below.  It's simple itself, but might
    lead to some tricky questions higher up.  Posted for experimentation.
    It'll be interesting to see if this goes somewhere.
    
    1.  read_stream_resume() as before, but with a new explicit
    read_stream_pause(): if a block number callback would like to report a
    temporary lack of information, it should return
    read_stream_pause(stream), not InvalidBlockNumber.  Then after
    read_stream_resume(stream) is called, the next
    read_stream_next_buffer() enters the lookahead loop again.  While
    paused, if the consumer drains all the existing buffers in the stream
    and then one more, it will receive InvalidBuffer, but if the _resume()
    call is made sooner, the consumer won't ever know about the temporary
    lack of buffers in the stream.
    
    2.  read_stream_yield(): while streaming heap pages that come from
    TIDs on index pages, Peter didn't like that the executor lost control
    of how much work was done by the lookahead loop underneath
    read_stream_next_buffer().  The consumer might have a heap page with
    some tuples that could be emitted right now, but the block number
    callback might be evaluating arbitrarily expensive filter qual
    expressions far ahead, and they might prefer to emit more tuples now
    before doing an unbounded amount of work finding more.  This interface
    allows some limited coroutine-like multitasking, where the block
    number callback can return read_stream_yield(stream) to return control
    back to the consumer periodically if it knows the consumer could
    already do something else.  It works by pausing the stream and
    resuming it in the next read_stream_next_buffer() call, but that's an
    internal detail.
    
    Some half-baked thoughts about the resulting flow control:
    
    Yielding control periodically just when it happens to be possible
    within the constraints of the volcano executor is an interesting thing
    to think about.  You can only yield if you already have a tuple to
    emit.  There is no saying when control will return to you, and the
    node you yield to might immediately block on I/O and yet you could
    have been doing useful CPU work.  You probably need an event-driven
    node-hopping executor to fix that in general, but on the flip side, I
    can think of one bet that I'd take: if you already have a tuple to
    emit AND if index scans themselves (not only referenced heap pages)
    were also streamed AND if a hypothetical
    read_stream_next_buffer_no_wait(btree_stream) said the next index page
    you need is not ready yet, then you should yield.  You're gambling
    that other plan nodes will have better luck running without an I/O
    stall, but you have ~0% chance.
    
    Yielding just because you've scanned N index pages/tuples/whatever is
    harder to think about.  The stream shouldn't get far ahead unless it's
    recently been useful for I/O concurrency (though optimal distance
    heuristics are an open problem), but in this case a single invocation
    of the block number callback can call ReadBuffer() an arbitrary number
    of times, filtering out all the index tuples as it rampages through
    the whole index IIUC.  I see why you might want to yield periodically
    if you can, but I also wonder how much that can really help if you
    still have to pick up where you left off next time.  I guess it
    depends on the distribution of matches.  It's also clear that any
    cold-cache testing done with direct I/O enabled will stall abominably
    as long as that level calls ReadBuffer(), possibly confusing matters.
    
  18. Re: Trying out read streams in pgvector (an extension)

    Melanie Plageman <melanieplageman@gmail.com> — 2025-12-09T21:42:21Z

    On Mon, Dec 8, 2025 at 10:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:
    >
    > I think it'd be better if that were the consumer's choice.   I don't
    > want the consumer to be required to drain the stream before resuming,
    > as that'd be an unprincipled stall.  For example, if new WAL arrives
    > over the network then I think it should be possible for recovery's
    > WAL-powered stream of heap pages to resume looking ahead even if
    > recovery hasn't drained the existing stream completely.
    >
    > 1.  read_stream_resume() as before, but with a new explicit
    > read_stream_pause(): if a block number callback would like to report a
    > temporary lack of information, it should return
    > read_stream_pause(stream), not InvalidBlockNumber.  Then after
    > read_stream_resume(stream) is called, the next
    > read_stream_next_buffer() enters the lookahead loop again.  While
    > paused, if the consumer drains all the existing buffers in the stream
    > and then one more, it will receive InvalidBuffer, but if the _resume()
    > call is made sooner, the consumer won't ever know about the temporary
    > lack of buffers in the stream.
    
    I like this new interface. If the user does want to exhaust the stream
    (as was the case with earlier pgvector read stream user code), I
    assume you would want to do:
    
    read_stream_pause()
    read_stream_reset()
    read_stream_resume()
    
    - Melanie
    
    
    
    
  19. Re: Trying out read streams in pgvector (an extension)

    Peter Geoghegan <pg@bowt.ie> — 2025-12-09T22:38:10Z

    On Mon, Dec 8, 2025 at 10:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:
    > Yielding just because you've scanned N index pages/tuples/whatever is
    > harder to think about.  The stream shouldn't get far ahead unless it's
    > recently been useful for I/O concurrency (though optimal distance
    > heuristics are an open problem), but in this case a single invocation
    > of the block number callback can call ReadBuffer() an arbitrary number
    > of times, filtering out all the index tuples as it rampages through
    > the whole index IIUC.  I see why you might want to yield periodically
    > if you can, but I also wonder how much that can really help if you
    > still have to pick up where you left off next time.
    
    I think of it as a necessary precaution against pathological behavior
    where the amount of memory used to cache matching tuples/TIDs gets out
    of hand. There's no specific reason to expect that to happen (or no
    good reason). But I'm pretty sure that it'll prove necessary to pay
    non-zero attention to how much work has been done since the last time
    we returned a tuple (when there's a tuple available to return).
    
    > I guess it
    > depends on the distribution of matches.
    
    To be clear, I haven't done any kind of modelling of the problems in
    this area. Once I do that (in 2026), I'll be able to say more about
    the requirements. Maybe Tomas could take a look sooner?
    
    Right now my focus is on getting the basic interfaces/API revisions in
    better shape. And avoiding regressions while doing so.
    
    -- 
    Peter Geoghegan
    
    
    
    
  20. Re: Trying out read streams in pgvector (an extension)

    Tomas Vondra <tomas@vondra.me> — 2025-12-14T23:02:53Z

    On 12/9/25 23:38, Peter Geoghegan wrote:
    > On Mon, Dec 8, 2025 at 10:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:
    >> Yielding just because you've scanned N index pages/tuples/whatever is
    >> harder to think about.  The stream shouldn't get far ahead unless it's
    >> recently been useful for I/O concurrency (though optimal distance
    >> heuristics are an open problem), but in this case a single invocation
    >> of the block number callback can call ReadBuffer() an arbitrary number
    >> of times, filtering out all the index tuples as it rampages through
    >> the whole index IIUC.  I see why you might want to yield periodically
    >> if you can, but I also wonder how much that can really help if you
    >> still have to pick up where you left off next time.
    > 
    > I think of it as a necessary precaution against pathological behavior
    > where the amount of memory used to cache matching tuples/TIDs gets out
    > of hand. There's no specific reason to expect that to happen (or no
    > good reason). But I'm pretty sure that it'll prove necessary to pay
    > non-zero attention to how much work has been done since the last time
    > we returned a tuple (when there's a tuple available to return).
    > 
    
    I think it's not all that difficult to hit such runaway cases, loading
    arbitrary number of batches ahead. That's exactly why I had to come up
    with the read_stream_reset approach in the first place, actually.
    
    The simplest example is an index scan on a correlated index. We skip
    duplicate blocks, so to find the next block to pass to the stream, we
    may have to load multiple leaf pages. And we may need multiple such
    blocks, to satisfy the prefetch distance (= number of IOs).
    
    An even more extreme example is IOS, with just a couple not-allvisible
    pages (that need prefetching). You hit the first one, then the distance
    ramps up, and at some point there are no more not-allvisible pages. But
    we try to maintain the distance, and we end up reading all remaining
    leaf pages.
    
    I'm sure we need to protect against these issues, which is why we have
    INDEX_SCAN_MAX_BATCHES.
    
    IIRC you also suggested we will need some internal limit to keep the
    number of buffer pins under control, right?
    
    I agree there may be other important reasons to "pause", e.g. based on
    how much work was done since the last time the index scan returned a
    tuple. But I'm not sure what exactly to look at, because most of the
    "work" is happening outside the index scan, no?
    
    >> I guess it
    >> depends on the distribution of matches.
    > 
    > To be clear, I haven't done any kind of modelling of the problems in
    > this area. Once I do that (in 2026), I'll be able to say more about
    > the requirements. Maybe Tomas could take a look sooner?
    > 
    
    I don't have any explicit formal model of the problems, but it should
    not be difficult to come up with "adversary cases" for the prefetching
    patches, for example. That is, ways to construct datasets / indexes that
    end up looking very far (many leafs) ahead.
    
    Is that what you meant by modelling, or did you mean some sort of a
    formal model of how far to prefetch for a given dataset?
    
    AFAICS the "ideal" prefetch distance would be such that a page gets
    loaded into shared buffers right before the scan actually needs it
    (after processing through the earlier index entries).
    
    Let's say we know
    
    * T1 - cost of "processing" one index tuple (average time between calls
    to table_index_fetch_tuple?)
    
    * T2 - cost of I/O, i.e. how long does it take to read a block
    
    We want keep the look-ahead distance at least K, such that
    
       K * T1 > T2
    
    but not much more than that. But I'm not sure where to get T1 and T2, as
    T1 depends on what happens outside the index scan, and T2 is hardware
    dependent (and not actually linear / additive).
    
    Or am I confused and you asked about something else?
    
    
    regards
    
    -- 
    Tomas Vondra