Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. bufmgr: Improve StartBufferIO interface

  2. test_aio: Add basic tests for StartReadBuffers()

  3. test_aio: Add read_stream test infrastructure & tests

  4. bufmgr: Make buffer hit helper

  5. bufmgr: Pass io_object and io_context through to PinBufferForBlock()

  6. bufmgr: Restructure AsyncReadBuffers()

  7. aio: Refactor tests in preparation for more tests

  1. Don't synchronously wait for already-in-progress IO in read stream

    Andres Freund <andres@anarazel.de> — 2025-09-11T21:46:51Z

    Hi,
    
    In the index prefetching thread we discovered that read stream performance
    suffers rather substantially when a read stream is reading blocks multiple
    times within the readahead distance.
    
    The problem leading to that is that we are currently synchronously waiting for
    IO on a buffer when AsyncReadBuffers() encounters a buffer already undergoing
    IO. If a block is read twice, that means we won't actually have enough IOs in
    flight to have good performance. What's worse, currently the wait is not
    attributed to IO wait (since we're waiting in WaitIO, rather than waiting for
    IO).
    
    This does not commonly occur with in-tree users of read streams, as users like
    seqscans, bitmap heap scans, vacuum, ... will never try to read the same block
    twice. However with index prefetching that is a more common case.
    
    It is possible to encounter a version of this in 18/master: If multiple scans
    for the same table are in progress, they can end up waiting synchronously for
    each other. However it's a much less severe issue, as the scan that is
    "further ahead" will not be blocked.
    
    
    To fix it, the attached patch has AsyncReadBuffers() check if the "target"
    buffer already has IO in progress. If so, it assing the buffer's IO wait
    reference to the ReadBuffersOperation. That allows WaitReadBuffers() to wait
    for the IO. To make that work correctly, the buffer stats etc have to be
    updated in that case in WaitReadBuffers().
    
    
    I did not feel like I was sufficiently confident in making this work without
    tests. However, it's not exactly trivial to test some versions of this, due to
    the way multiple processes need to be coordinated. It took way way longer to
    write tests than to make the code actually work :/.
    
    The attached tests add a new read_stream_for_blocks() function to test_aio. I
    found it also rather useful to reproduce the performance issue without the
    index prefetching patch applied.  To test the cross process case the injection
    point infrastructure in test_aio had to be extended a bit.
    
    
    Attached are three patches:
    
    0001: Introduces a TestAio package and splits out some existing tests out of
          001_aio.pl
    
    0002: Add read_stream test infrastructure & tests
    
          Note that the tests don't test that we don't unnecessarily wait, as
          described above, just that IO works correctly.
    
    0003: Improve performance of read stream when re-encountering blocks
    
    
    To reproduce the issue, the read_stream_for_blocks() function added to
    test_aio can be used, in combination with debug_io_direct=data (it's also
    possible without DIO, it'd just be more work).
    
    prep:
    CREATE EXTENSION  test_aio;
    CREATE TABLE large AS SELECT i, repeat(random()::text, 5) FROM generate_series(1, 1000000) g(i);
    
    test:
    SELECT pg_buffercache_evict_relation('large');
    EXPLAIN ANALYZE SELECT * FROM read_stream_for_blocks('large', ARRAY(SELECT i + generate_series(0, 3) FROM generate_series(1, 10000) g(i)));
    
    
    Without 0003 applied:
    ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
    │                                                                   QUERY PLAN                                                                    │
    ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
    │ Function Scan on read_stream_for_blocks  (cost=975.00..985.00 rows=1000 width=12) (actual time=673.647..675.254 rows=40000.00 loops=1)          │
    │   Buffers: shared hit=29997 read=10003                                                                                                          │
    │   I/O Timings: shared read=16.116                                                                                                               │
    │   InitPlan 1                                                                                                                                    │
    │     ->  Result  (cost=0.00..975.00 rows=40000 width=4) (actual time=0.556..7.575 rows=40000.00 loops=1)                                         │
    │           ->  ProjectSet  (cost=0.00..375.00 rows=40000 width=8) (actual time=0.556..4.804 rows=40000.00 loops=1)                               │
    │                 ->  Function Scan on generate_series g  (cost=0.00..100.00 rows=10000 width=4) (actual time=0.554..0.988 rows=10000.00 loops=1) │
    │ Planning Time: 0.060 ms                                                                                                                         │
    │ Execution Time: 676.436 ms                                                                                                                      │
    └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
    (9 rows)
    
    Time: 676.753 ms
    
    
    With 0003:
    
    ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
    │                                                                   QUERY PLAN                                                                    │
    ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
    │ Function Scan on read_stream_for_blocks  (cost=975.00..985.00 rows=1000 width=12) (actual time=87.730..89.453 rows=40000.00 loops=1)            │
    │   Buffers: shared hit=29997 read=10003                                                                                                          │
    │   I/O Timings: shared read=50.467                                                                                                               │
    │   InitPlan 1                                                                                                                                    │
    │     ->  Result  (cost=0.00..975.00 rows=40000 width=4) (actual time=0.541..7.496 rows=40000.00 loops=1)                                         │
    │           ->  ProjectSet  (cost=0.00..375.00 rows=40000 width=8) (actual time=0.540..4.772 rows=40000.00 loops=1)                               │
    │                 ->  Function Scan on generate_series g  (cost=0.00..100.00 rows=10000 width=4) (actual time=0.539..0.965 rows=10000.00 loops=1) │
    │ Planning Time: 0.046 ms                                                                                                                         │
    │ Execution Time: 90.661 ms                                                                                                                       │
    └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
    (9 rows)
    
    Time: 90.887 ms
    
    Greetings,
    
    Andres Freund
    
  2. Re: Don't synchronously wait for already-in-progress IO in read stream

    Peter Geoghegan <pg@bowt.ie> — 2025-11-09T19:51:08Z

    On Thu, Sep 11, 2025 at 5:46 PM Andres Freund <andres@anarazel.de> wrote:
    > The problem leading to that is that we are currently synchronously waiting for
    > IO on a buffer when AsyncReadBuffers() encounters a buffer already undergoing
    > IO. If a block is read twice, that means we won't actually have enough IOs in
    > flight to have good performance. What's worse, currently the wait is not
    > attributed to IO wait (since we're waiting in WaitIO, rather than waiting for
    > IO).
    
    This patch no longer cleanly applies. Can you post a new version?
    
    Thanks
    -- 
    Peter Geoghegan
    
    
    
    
  3. Re: Don't synchronously wait for already-in-progress IO in read stream

    Thomas Munro <thomas.munro@gmail.com> — 2025-11-09T22:20:38Z

    On Fri, Sep 12, 2025 at 9:46 AM Andres Freund <andres@anarazel.de> wrote:
    +     * It's possible that another backend starts IO on the buffer between this
    +     * check and the ReadBuffersCanStartIO(nowait = false) below. In that case
    +     * we will synchronously wait for the IO below, but the window for that is
    +     * small enough that it won't happen often enough to have a significant
    +     * performance impact.
    +     */
    +    if (ReadBuffersIOAlreadyInProgress(operation, buffers[nblocks_done]))
    ...
         /*
          * Check if we can start IO on the first to-be-read buffer.
          *
    -     * If an I/O is already in progress in another backend, we want to wait
    -     * for the outcome: either done, or something went wrong and we will
    -     * retry.
    +     * If a synchronous I/O is in progress in another backend (it can't be
    +     * this backend), we want to wait for the outcome: either done, or
    +     * something went wrong and we will retry.
          */
         if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
    
    "..., or an asynchronous IO was started after
    ReadBuffersIOAlreadyInProgress() (unlikely), ..."?
    
    I suppose (or perhaps vaguely recall from an off-list discussion?)
    that you must have considered merging the new
    is-it-already-in-progress check into ReadBuffersCanStartIO().  I
    suppose the nowait argument would become a tri-state argument with a
    value that means "don't wait for an in-progress read, just give me the
    IO handle so I can 'join' it as a foreign waiter", with a new output
    argument to receive the handle, or something along those lines, and I
    guess you'd need a tri-state result, and perhaps s/Can/Try/ in the
    name.  That'd remove the double-check (extra header lock-unlock cycle)
    and associated race that can cause that rare synchronous wait (which
    must still happen sometimes in the duelling concurrent scan use
    case?), at the slight extra cost of having to allocate and free a
    handle in the case of repeated blocks (eg the index->heap scan use
    case), but at least that's just backend-local list pushups and doesn't
    do extra work otherwise.  Is there some logical problem with that
    approach?  Is the code just too clumsy?