Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Avoid pointer chasing in _bt_readpage inner loop.
- 83a26ba59b18 19 (unreleased) landed
Relocate _bt_readpage and related functions.
- 65d6acbc5649 19 (unreleased) landed
Fix bug in nbtree array primitive scan scheduling.
- 763d65ae2545 18.0 cited

Moving _bt_readpage and _bt_checkkeys into a new .c file

Peter Geoghegan <pg@bowt.ie> — 2025-12-06T03:48:15Z

Attached patch v1-0001-* moves _bt_readpage (from nbtsearch.c) and
_bt_checkkeys (from nbtutils.c) into a new .c file -- nbtreadpage.c.
It also moves all of the functions that _bt_checkkeys itself calls
(either directly or indirectly) over to nbtreadpage.c. This first
patch is strictly mechanical, in that it only moves existing functions
around, without directly changing anything.

There's also a second patch, which adds a performance optimization
that builds on the first patch. This makes the functions moved to the
new file pass around the scan's direction, rather than requiring the
functions to fish it out of so->currPos as before.

With large range query variants of pgbench consisting of queries like
"SELECT abalance FROM pgbench_accounts WHERE aid between 1000 AND
3000", at scale 100, with a single client, I find that there's a ~4.5%
increase in transaction throughput with both patches applied (when
compared to master). The new module structure coupled with the
optimization in the second patch gives the compiler the ability to
produce more efficient code.

Here's a set of results from a series on interlaced 5 minute runs:

master: tps = 2682.600100, lat avg = 0.373 ms, lat stddev = 0.006 ms
        cpu/txn = 0.011320, cs/txn = 8.1114, in/txn = 0.1832
patch:  tps = 2827.020776, lat avg = 0.354 ms, lat stddev = 0.006 ms
        cpu/txn = 0.010777, cs/txn = 8.1083, in/txn = 0.1768
master: tps = 2711.945605, lat avg = 0.369 ms, lat stddev = 0.006 ms
        cpu/txn = 0.011173, cs/txn = 8.1109, in/txn = 0.1815
patch:  tps = 2832.105849, lat avg = 0.353 ms, lat stddev = 0.006 ms
        cpu/txn = 0.010793, cs/txn = 8.1090, in/txn = 0.1769
master: tps = 2699.348706, lat avg = 0.370 ms, lat stddev = 0.007 ms
        cpu/txn = 0.011225, cs/txn = 8.1123, in/txn = 0.1813
patch:  tps = 2822.511539, lat avg = 0.354 ms, lat stddev = 0.006 ms
        cpu/txn = 0.010759, cs/txn = 8.1078, in/txn = 0.1760
master: tps = 2715.051289, lat avg = 0.368 ms, lat stddev = 0.006 ms
        cpu/txn = 0.011160, cs/txn = 8.1116, in/txn = 0.1824
Patch:  tps = 2818.002045, lat avg = 0.355 ms, lat stddev = 0.007 ms
        cpu/txn = 0.010752, cs/txn = 8.1078, in/txn = 0.1762
master: tps = 2720.488826, lat avg = 0.367 ms, lat stddev = 0.011 ms
        cpu/txn = 0.011138, cs/txn = 8.1103, in/txn = 0.1815
Patch:  tps = 2818.035923, lat avg = 0.355 ms, lat stddev = 0.006 ms
        cpu/txn = 0.010752, cs/txn = 8.1087, in/txn = 0.1762

Other kinds of queries benefit much less. I think that there's a sub
1% improvement with standard pgbench SELECT, though it's hard to
distinguish from noise. There are fewer code paths affected by the
second patch in use with such queries, compared to the favorable range
scan query case that I've shown detailed numbers for, which might
explain why simple equality lookups aren't improved by very much.

This seems like an enhancement that is pretty easy to justify. Note
that the changes in the second patch essentially restore things to how
they already were prior to my commit 763d65ae. I doubt that that
change caused a regression at the time, since the speedup that I see
now depends on the changes in the first patch (though I must admit
that I haven't benchmarked the changes made by the second patch in
isolation).

-- 
Peter Geoghegan

Re: Moving _bt_readpage and _bt_checkkeys into a new .c file

Victor Yegorov <vyegorov@gmail.com> — 2025-12-06T08:07:19Z

сб, 6 дек. 2025 г. в 06:49, Peter Geoghegan <pg@bowt.ie>:

> Attached patch v1-0001-* moves _bt_readpage (from nbtsearch.c) and
> _bt_checkkeys (from nbtutils.c) into a new .c file -- nbtreadpage.c.
> It also moves all of the functions that _bt_checkkeys itself calls
> (either directly or indirectly) over to nbtreadpage.c. This first
> patch is strictly mechanical, in that it only moves existing functions
> around, without directly changing anything.
>
> …
>
> This seems like an enhancement that is pretty easy to justify. Note
> that the changes in the second patch essentially restore things to how
> they already were prior to my commit 763d65ae. I doubt that that
> change caused a regression at the time, since the speedup that I see
> now depends on the changes in the first patch (though I must admit
> that I haven't benchmarked the changes made by the second patch in
> isolation).
>

Hey.

I like this change and I agree that it's both handy and gives an easy
performance boost.

Patch applies and compiles cleanly. I can barely see a performance boost on
my end (VM on a busy host), round 1%, but I still consider this change
beneficial.

-- 
Victor Yegorov

Re: Moving _bt_readpage and _bt_checkkeys into a new .c file

Peter Geoghegan <pg@bowt.ie> — 2025-12-06T20:04:20Z

On Sat, Dec 6, 2025 at 3:07 AM Victor Yegorov <vyegorov@gmail.com> wrote:
> I like this change and I agree that it's both handy and gives an easy performance boost.

There's a number of things that I find counterintuitive about the
performance impact of this patch:

* It doesn't make very much difference (under a 1% improvement) on
similar index-only scans, with queries such as "SELECT count(*) FROM
pgbench_accounts WHERE aid between :aid AND :endrange". One might
expect this case to be improved at least as much as the plain index
scan case, since the relative cost of _bt_readpage is higher (since
it's much cheaper to access the visibility map than to access the
heap).

* The patch makes an even larger difference when I make pgbench use a
larger range of values in the "between". For example, if I increase
the number of values/tuples returned from 2,000 (which is what I used
initially/reported on in the first email) to 15,000, I find that the
patch increases TPS by as much as 5.5%.

* These queries make maximum use of the _bt_set_startikey optimization
-- most individual leaf pages don't need to evaluate any scan key
(after an initial page-level check within _bt_set_startikey). So the
patch really helps in exactly those cases where we don't truly need to
access the scan direction at all -- the loop inside _bt_check_compare
always has 0 iterations with these queries, which means that scan
direction doesn't actually ever need to be considered at that point.

My best guess is that the benefits I see come from eliminating a
dependent load. Without the second patch applied, I see this
disassembly for _bt_checkkeys:

mov rax,QWORD PTR [rdi+0x38] ; Load scan->opaque
mov r15d,DWORD PTR [rax+0x70] ; Load so->dir

A version with the second patch applied still loads a pointer passed
by the _bt_checkkeys caller (_bt_readpage), but doesn't have to chase
another pointer to get to it. Maybe this significantly ameliorates
execution port pressure in the cases where I see a speedup?

> Patch applies and compiles cleanly. I can barely see a performance boost on my end (VM on a busy host), round 1%, but I still consider this change beneficial.

It seems to have no downsides, and some upside. I wouldn't be
surprised if the results I'm seeing are dependent on
microarchitectural details. I myself use a Zen 3 chip (a Ryzen 9
5950X).

--
Peter Geoghegan

Re: Moving _bt_readpage and _bt_checkkeys into a new .c file

Peter Geoghegan <pg@bowt.ie> — 2025-12-07T02:44:46Z

On Sat, Dec 6, 2025 at 3:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
> My best guess is that the benefits I see come from eliminating a
> dependent load. Without the second patch applied, I see this
> disassembly for _bt_checkkeys:
>
> mov    rax,QWORD PTR [rdi+0x38]   ; Load scan->opaque
> mov    r15d,DWORD PTR [rax+0x70]  ; Load so->dir
>
> A version with the second patch applied still loads a pointer passed
> by the _bt_checkkeys caller (_bt_readpage), but doesn't have to chase
> another pointer to get to it. Maybe this significantly ameliorates
> execution port pressure in the cases where I see a speedup?

I found a way to further speed up the queries that the second patch
already helped with, following profiling with perf: if _bt_readpage
takes a local copy of scan->ignore_killed_tuples when first called,
and then uses that local copy within its per-tuple loop (instead of
using scan->ignore_killed_tuples directly), it gives me an additional
1% speedup over what I reported earlier today. In other words, the
range/BETWEEN pgbench variant I summarized earlier today goes from
being about 4.5% faster than master, to being about ~5.5% faster than
master. Testing has also shown that the ignore_killed_tuples
enhancement doesn't significantly change the picture with other types
of queries (such as the default pgbench SELECT).

In short, this ignore_killed_tuples change makes the second patch from
v1 more effective, seemingly by further ameliorating the same
bottleneck. Apparently accessing scan->ignore_killed_tuples created
another load-use hazard in the same tight inner loop (the per-tuple
_bt_readpage loop). Which matters with these queries, where we don't
need to do very much work per-tuple (_bt_readpage's pstate.startikey
optimization is as effective as possible here) and have quite a few
tuples (2,000 tuples) that need to be returned by each test query run.

Since this ignore_killed_tuples change is also very simple, and also
seems like an easy win, I think that it can be committed as part of
the second patch. Without it needing to wait for too much more
performance validation.

Attached are 2 text files showing pgbench output/summary info,
generated by my test script (both are from runs that took place within
the last 2 hours). One of these result sets just confirms what I
reported earlier on, with an unmodified v1 patchset. The other set of
results/file shows detailed results for the v1 patchset with the
ignore_killed_tuples change also applied, for the same pgbench
config/workload. This second file gives full details to back up my
"~5.5% faster than master" claim.

The pgbench script used for this is as follows:

\set aid random_exponential(1, 100000 * :scale, 3.0)
\set endrange :aid + 2000
SELECT abalance FROM pgbench_accounts WHERE aid between :aid AND :endrange;

I'm deliberately not attaching a new v2 for this ignore_killed_tuples
change right now. The first patch is a few hundred KBs, and I don't
want this email to get held up in moderation. Though I will attach the
ignore_killed_tuples change in its own patch, which I've also attached
(with a .txt extension, just to avoid confusing CFTester).

-- 
Peter Geoghegan

Re: Moving _bt_readpage and _bt_checkkeys into a new .c file

Peter Geoghegan <pg@bowt.ie> — 2025-12-07T22:30:54Z

On Sat, Dec 6, 2025 at 9:44 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Since this ignore_killed_tuples change is also very simple, and also
> seems like an easy win, I think that it can be committed as part of
> the second patch. Without it needing to wait for too much more
> performance validation.

My plan is to commit the entire patch series (with a modified second
patch that includes the ignore_killed_tuples change) in the next
couple of days.

As far as I can determine through performance validation that tested a
variety of different scan types (simple point lookups, range scans,
and a variety of different SAOP scan patterns), the patch series is an
unambiguous win. It appears to be slightly faster even in
unsympathetic cases, such as standard pgbench SELECT.

-- 
Peter Geoghegan

Re: Moving _bt_readpage and _bt_checkkeys into a new .c file

Victor Yegorov <vyegorov@gmail.com> — 2025-12-08T09:13:13Z

пн, 8 дек. 2025 г. в 01:31, Peter Geoghegan <pg@bowt.ie>:

> On Sat, Dec 6, 2025 at 9:44 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Since this ignore_killed_tuples change is also very simple, and also
> > seems like an easy win, I think that it can be committed as part of
> > the second patch. Without it needing to wait for too much more
> > performance validation.
>
> My plan is to commit the entire patch series (with a modified second
> patch that includes the ignore_killed_tuples change) in the next
> couple of days.
>
> As far as I can determine through performance validation that tested a
> variety of different scan types (simple point lookups, range scans,
> and a variety of different SAOP scan patterns), the patch series is an
> unambiguous win. It appears to be slightly faster even in
> unsympathetic cases, such as standard pgbench SELECT.
>

Even without the performance increase, I think this a valuable code
reorganization, worth committing.

I've compiled and run test (check and installcheck) with all 3 patches, did
a small standard pgbench run:
pgbench -s 100 -i
pgbench -P 60 -T 300

Results:
master: 569 TPS
patched: 590 TPS +3.7%

LGTM.

-- 
Victor Yegorov

Re: Moving _bt_readpage and _bt_checkkeys into a new .c file

Heikki Linnakangas <hlinnaka@iki.fi> — 2025-12-08T18:03:18Z

On 08/12/2025 11:13, Victor Yegorov wrote:
> Even without the performance increase, I think this a valuable code 
> reorganization, worth committing.

+1

- Heikki

Re: Moving _bt_readpage and _bt_checkkeys into a new .c file

Peter Geoghegan <pg@bowt.ie> — 2025-12-08T18:50:08Z

On Mon, Dec 8, 2025 at 4:13 AM Victor Yegorov <vyegorov@gmail.com> wrote:
> I've compiled and run test (check and installcheck) with all 3 patches, did a small standard pgbench run:
> pgbench -s 100 -i
> pgbench -P 60 -T 300
>
> Results:
> master: 569 TPS
> patched: 590 TPS +3.7%

Pushed both patches just now.

Thanks for the review!

-- 
Peter Geoghegan