Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Avoid pointer chasing in _bt_readpage inner loop.

  2. Relocate _bt_readpage and related functions.

  3. Fix bug in nbtree array primitive scan scheduling.

  1. Moving _bt_readpage and _bt_checkkeys into a new .c file

    Peter Geoghegan <pg@bowt.ie> — 2025-12-06T03:48:15Z

    Attached patch v1-0001-* moves _bt_readpage (from nbtsearch.c) and
    _bt_checkkeys (from nbtutils.c) into a new .c file -- nbtreadpage.c.
    It also moves all of the functions that _bt_checkkeys itself calls
    (either directly or indirectly) over to nbtreadpage.c. This first
    patch is strictly mechanical, in that it only moves existing functions
    around, without directly changing anything.
    
    There's also a second patch, which adds a performance optimization
    that builds on the first patch. This makes the functions moved to the
    new file pass around the scan's direction, rather than requiring the
    functions to fish it out of so->currPos as before.
    
    With large range query variants of pgbench consisting of queries like
    "SELECT abalance FROM pgbench_accounts WHERE aid between 1000 AND
    3000", at scale 100, with a single client, I find that there's a ~4.5%
    increase in transaction throughput with both patches applied (when
    compared to master). The new module structure coupled with the
    optimization in the second patch gives the compiler the ability to
    produce more efficient code.
    
    Here's a set of results from a series on interlaced 5 minute runs:
    
    master: tps = 2682.600100, lat avg = 0.373 ms, lat stddev = 0.006 ms
            cpu/txn = 0.011320, cs/txn = 8.1114, in/txn = 0.1832
    patch:  tps = 2827.020776, lat avg = 0.354 ms, lat stddev = 0.006 ms
            cpu/txn = 0.010777, cs/txn = 8.1083, in/txn = 0.1768
    master: tps = 2711.945605, lat avg = 0.369 ms, lat stddev = 0.006 ms
            cpu/txn = 0.011173, cs/txn = 8.1109, in/txn = 0.1815
    patch:  tps = 2832.105849, lat avg = 0.353 ms, lat stddev = 0.006 ms
            cpu/txn = 0.010793, cs/txn = 8.1090, in/txn = 0.1769
    master: tps = 2699.348706, lat avg = 0.370 ms, lat stddev = 0.007 ms
            cpu/txn = 0.011225, cs/txn = 8.1123, in/txn = 0.1813
    patch:  tps = 2822.511539, lat avg = 0.354 ms, lat stddev = 0.006 ms
            cpu/txn = 0.010759, cs/txn = 8.1078, in/txn = 0.1760
    master: tps = 2715.051289, lat avg = 0.368 ms, lat stddev = 0.006 ms
            cpu/txn = 0.011160, cs/txn = 8.1116, in/txn = 0.1824
    Patch:  tps = 2818.002045, lat avg = 0.355 ms, lat stddev = 0.007 ms
            cpu/txn = 0.010752, cs/txn = 8.1078, in/txn = 0.1762
    master: tps = 2720.488826, lat avg = 0.367 ms, lat stddev = 0.011 ms
            cpu/txn = 0.011138, cs/txn = 8.1103, in/txn = 0.1815
    Patch:  tps = 2818.035923, lat avg = 0.355 ms, lat stddev = 0.006 ms
            cpu/txn = 0.010752, cs/txn = 8.1087, in/txn = 0.1762
    
    Other kinds of queries benefit much less. I think that there's a sub
    1% improvement with standard pgbench SELECT, though it's hard to
    distinguish from noise. There are fewer code paths affected by the
    second patch in use with such queries, compared to the favorable range
    scan query case that I've shown detailed numbers for, which might
    explain why simple equality lookups aren't improved by very much.
    
    This seems like an enhancement that is pretty easy to justify. Note
    that the changes in the second patch essentially restore things to how
    they already were prior to my commit 763d65ae. I doubt that that
    change caused a regression at the time, since the speedup that I see
    now depends on the changes in the first patch (though I must admit
    that I haven't benchmarked the changes made by the second patch in
    isolation).
    
    -- 
    Peter Geoghegan
    
  2. Re: Moving _bt_readpage and _bt_checkkeys into a new .c file

    Victor Yegorov <vyegorov@gmail.com> — 2025-12-06T08:07:19Z

    сб, 6 дек. 2025 г. в 06:49, Peter Geoghegan <pg@bowt.ie>:
    
    > Attached patch v1-0001-* moves _bt_readpage (from nbtsearch.c) and
    > _bt_checkkeys (from nbtutils.c) into a new .c file -- nbtreadpage.c.
    > It also moves all of the functions that _bt_checkkeys itself calls
    > (either directly or indirectly) over to nbtreadpage.c. This first
    > patch is strictly mechanical, in that it only moves existing functions
    > around, without directly changing anything.
    >
    > …
    >
    > This seems like an enhancement that is pretty easy to justify. Note
    > that the changes in the second patch essentially restore things to how
    > they already were prior to my commit 763d65ae. I doubt that that
    > change caused a regression at the time, since the speedup that I see
    > now depends on the changes in the first patch (though I must admit
    > that I haven't benchmarked the changes made by the second patch in
    > isolation).
    >
    
    Hey.
    
    I like this change and I agree that it's both handy and gives an easy
    performance boost.
    
    Patch applies and compiles cleanly. I can barely see a performance boost on
    my end (VM on a busy host), round 1%, but I still consider this change
    beneficial.
    
    -- 
    Victor Yegorov
    
  3. Re: Moving _bt_readpage and _bt_checkkeys into a new .c file

    Peter Geoghegan <pg@bowt.ie> — 2025-12-06T20:04:20Z

    On Sat, Dec 6, 2025 at 3:07 AM Victor Yegorov <vyegorov@gmail.com> wrote:
    > I like this change and I agree that it's both handy and gives an easy performance boost.
    
    There's a number of things that I find counterintuitive about the
    performance impact of this patch:
    
    * It doesn't make very much difference (under a 1% improvement) on
    similar index-only scans, with queries such as "SELECT count(*) FROM
    pgbench_accounts WHERE aid between :aid AND :endrange". One might
    expect this case to be improved at least as much as the plain index
    scan case, since the relative cost of _bt_readpage is higher (since
    it's much cheaper to access the visibility map than to access the
    heap).
    
    * The patch makes an even larger difference when I make pgbench use a
    larger range of values in the "between". For example, if I increase
    the number of values/tuples returned from 2,000 (which is what I used
    initially/reported on in the first email) to 15,000, I find that the
    patch increases TPS by as much as 5.5%.
    
    * These queries make maximum use of the _bt_set_startikey optimization
    -- most individual leaf pages don't need to evaluate any scan key
    (after an initial page-level check within _bt_set_startikey). So the
    patch really helps in exactly those cases where we don't truly need to
    access the scan direction at all -- the loop inside _bt_check_compare
    always has 0 iterations with these queries, which means that scan
    direction doesn't actually ever need to be considered at that point.
    
    My best guess is that the benefits I see come from eliminating a
    dependent load. Without the second patch applied, I see this
    disassembly for _bt_checkkeys:
    
    mov    rax,QWORD PTR [rdi+0x38]   ; Load scan->opaque
    mov    r15d,DWORD PTR [rax+0x70]  ; Load so->dir
    
    A version with the second patch applied still loads a pointer passed
    by the _bt_checkkeys caller (_bt_readpage), but doesn't have to chase
    another pointer to get to it. Maybe this significantly ameliorates
    execution port pressure in the cases where I see a speedup?
    
    > Patch applies and compiles cleanly. I can barely see a performance boost on my end (VM on a busy host), round 1%, but I still consider this change beneficial.
    
    It seems to have no downsides, and some upside. I wouldn't be
    surprised if the results I'm seeing are dependent on
    microarchitectural details. I myself use a Zen 3 chip (a Ryzen 9
    5950X).
    
    -- 
    Peter Geoghegan
    
    
    
    
  4. Re: Moving _bt_readpage and _bt_checkkeys into a new .c file

    Peter Geoghegan <pg@bowt.ie> — 2025-12-07T02:44:46Z

    On Sat, Dec 6, 2025 at 3:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
    > My best guess is that the benefits I see come from eliminating a
    > dependent load. Without the second patch applied, I see this
    > disassembly for _bt_checkkeys:
    >
    > mov    rax,QWORD PTR [rdi+0x38]   ; Load scan->opaque
    > mov    r15d,DWORD PTR [rax+0x70]  ; Load so->dir
    >
    > A version with the second patch applied still loads a pointer passed
    > by the _bt_checkkeys caller (_bt_readpage), but doesn't have to chase
    > another pointer to get to it. Maybe this significantly ameliorates
    > execution port pressure in the cases where I see a speedup?
    
    I found a way to further speed up the queries that the second patch
    already helped with, following profiling with perf: if _bt_readpage
    takes a local copy of scan->ignore_killed_tuples when first called,
    and then uses that local copy within its per-tuple loop (instead of
    using scan->ignore_killed_tuples directly), it gives me an additional
    1% speedup over what I reported earlier today. In other words, the
    range/BETWEEN pgbench variant I summarized earlier today goes from
    being about 4.5% faster than master, to being about ~5.5% faster than
    master. Testing has also shown that the ignore_killed_tuples
    enhancement doesn't significantly change the picture with other types
    of queries (such as the default pgbench SELECT).
    
    In short, this ignore_killed_tuples change makes the second patch from
    v1 more effective, seemingly by further ameliorating the same
    bottleneck. Apparently accessing scan->ignore_killed_tuples created
    another load-use hazard in the same tight inner loop (the per-tuple
    _bt_readpage loop). Which matters with these queries, where we don't
    need to do very much work per-tuple (_bt_readpage's pstate.startikey
    optimization is as effective as possible here) and have quite a few
    tuples (2,000 tuples) that need to be returned by each test query run.
    
    Since this ignore_killed_tuples change is also very simple, and also
    seems like an easy win, I think that it can be committed as part of
    the second patch. Without it needing to wait for too much more
    performance validation.
    
    Attached are 2 text files showing pgbench output/summary info,
    generated by my test script (both are from runs that took place within
    the last 2 hours). One of these result sets just confirms what I
    reported earlier on, with an unmodified v1 patchset. The other set of
    results/file shows detailed results for the v1 patchset with the
    ignore_killed_tuples change also applied, for the same pgbench
    config/workload. This second file gives full details to back up my
    "~5.5% faster than master" claim.
    
    The pgbench script used for this is as follows:
    
    \set aid random_exponential(1, 100000 * :scale, 3.0)
    \set endrange :aid + 2000
    SELECT abalance FROM pgbench_accounts WHERE aid between :aid AND :endrange;
    
    I'm deliberately not attaching a new v2 for this ignore_killed_tuples
    change right now. The first patch is a few hundred KBs, and I don't
    want this email to get held up in moderation. Though I will attach the
    ignore_killed_tuples change in its own patch, which I've also attached
    (with a .txt extension, just to avoid confusing CFTester).
    
    -- 
    Peter Geoghegan
    
  5. Re: Moving _bt_readpage and _bt_checkkeys into a new .c file

    Peter Geoghegan <pg@bowt.ie> — 2025-12-07T22:30:54Z

    On Sat, Dec 6, 2025 at 9:44 PM Peter Geoghegan <pg@bowt.ie> wrote:
    > Since this ignore_killed_tuples change is also very simple, and also
    > seems like an easy win, I think that it can be committed as part of
    > the second patch. Without it needing to wait for too much more
    > performance validation.
    
    My plan is to commit the entire patch series (with a modified second
    patch that includes the ignore_killed_tuples change) in the next
    couple of days.
    
    As far as I can determine through performance validation that tested a
    variety of different scan types (simple point lookups, range scans,
    and a variety of different SAOP scan patterns), the patch series is an
    unambiguous win. It appears to be slightly faster even in
    unsympathetic cases, such as standard pgbench SELECT.
    
    -- 
    Peter Geoghegan
    
    
    
    
  6. Re: Moving _bt_readpage and _bt_checkkeys into a new .c file

    Victor Yegorov <vyegorov@gmail.com> — 2025-12-08T09:13:13Z

    пн, 8 дек. 2025 г. в 01:31, Peter Geoghegan <pg@bowt.ie>:
    
    > On Sat, Dec 6, 2025 at 9:44 PM Peter Geoghegan <pg@bowt.ie> wrote:
    > > Since this ignore_killed_tuples change is also very simple, and also
    > > seems like an easy win, I think that it can be committed as part of
    > > the second patch. Without it needing to wait for too much more
    > > performance validation.
    >
    > My plan is to commit the entire patch series (with a modified second
    > patch that includes the ignore_killed_tuples change) in the next
    > couple of days.
    >
    > As far as I can determine through performance validation that tested a
    > variety of different scan types (simple point lookups, range scans,
    > and a variety of different SAOP scan patterns), the patch series is an
    > unambiguous win. It appears to be slightly faster even in
    > unsympathetic cases, such as standard pgbench SELECT.
    >
    
    Even without the performance increase, I think this a valuable code
    reorganization, worth committing.
    
    I've compiled and run test (check and installcheck) with all 3 patches, did
    a small standard pgbench run:
    pgbench -s 100 -i
    pgbench -P 60 -T 300
    
    Results:
    master: 569 TPS
    patched: 590 TPS +3.7%
    
    LGTM.
    
    -- 
    Victor Yegorov
    
  7. Re: Moving _bt_readpage and _bt_checkkeys into a new .c file

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-12-08T18:03:18Z

    On 08/12/2025 11:13, Victor Yegorov wrote:
    > Even without the performance increase, I think this a valuable code 
    > reorganization, worth committing.
    
    +1
    
    - Heikki
    
    
    
    
    
  8. Re: Moving _bt_readpage and _bt_checkkeys into a new .c file

    Peter Geoghegan <pg@bowt.ie> — 2025-12-08T18:50:08Z

    On Mon, Dec 8, 2025 at 4:13 AM Victor Yegorov <vyegorov@gmail.com> wrote:
    > I've compiled and run test (check and installcheck) with all 3 patches, did a small standard pgbench run:
    > pgbench -s 100 -i
    > pgbench -P 60 -T 300
    >
    > Results:
    > master: 569 TPS
    > patched: 590 TPS +3.7%
    
    Pushed both patches just now.
    
    Thanks for the review!
    
    -- 
    Peter Geoghegan