Re: Returning nbtree posting list TIDs in DESC order during backwards scans
Peter Geoghegan <pg@bowt.ie>
From: Peter Geoghegan <pg@bowt.ie>
To: Victor Yegorov <vyegorov@gmail.com>
Cc: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Date: 2025-12-10T18:40:03Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Clarify why _bt_killitems sorts its items array.
- e16c6f024718 19 (unreleased) landed
-
Return TIDs in desc order during backwards scans.
- bfb335df58ea 19 (unreleased) landed
-
Optimize nbtree backwards scans.
- 1bd4bc85cac2 18.0 cited
-
Optimize nbtree backward scan boundary cases.
- c9c0589fda0e 17.0 cited
Attachments
- v5-0001-Return-TIDs-in-desc-order-during-backwards-scans.patch (application/x-patch) patch v5-0001
On Wed, Dec 3, 2025 at 10:18 AM Victor Yegorov <vyegorov@gmail.com> wrote: > Patch looks fine, applies and compiles cleanly, passes tests. This patch was trickier than initially expected. I paid no attention to the possible downside of changing the posting list iteration code in _bt_readpage (i.e. from scan posting lists in descending order for backwards scans), which was an oversight. It seems that we're very sensitive to how the compiler allocates registers within _bt_readpage, which led to v4 of the patch (and presumably all earlier versions) not storing itemIndex in a register. With v4, the compiler spilled the itemIndex comparison to the stack (at least on my machine, which uses GCC 15.2 from Debian unstable), and so had to read itemIndex from memory on each loop iteration. This regressed pgbench's SELECT workload by about 1% of total TPS (on my Zen 3 CPU). Attached v5 avoids the regression by tweaking _bt_readpage. I will commit this version soon (I really mean it this time!). I'm not sure why these changes have the intended effect -- I mostly figured all this out through trial and error. Though I can say that my testing showed that _not_ changing the posting list iteration code in the _bt_readpage forwards scan loop makes the crucial difference. The other tweaks probably aren't strictly necessary, but they seem to make the compiler generate ever so slightly faster code (at least with pgbench SELECT), so I kept them in. I also gave up on the idea of using a bitmapset for v5 -- the issue with regressing _bt_readpage discouraged me from adding code that allocates memory (more than once per scan) within btgettuple, which is a performance-critical path. We can simply sort and unique-ify the killedItems array from within _bt_killitems instead -- which is largely the same way that my original v1 did it. That way it won't matter what order we append to killItems in, relative to the scan direction. (The downside of sticking with an array for killedItems is that we can still run out of array space in extreme cases involving scrollable cursors that return many killable tuples, and repeatedly append the same TID to killedItems[], but that doesn't seem like much of a loss to me -- that almost never happens in practice.) > I'd like to point out a missing space after the dot in the 2nd para of the commit message, > falls out of style. Fixed that too. Thanks -- Peter Geoghegan