Thread

Re: Randomize B-Tree page split location to avoid oscillating patterns

Dmitry Dolgov <9erthalion6@gmail.com> — 2026-05-06T17:10:18Z

> On Mon, Apr 27, 2026 at 02:07:14PM -0400, Peter Geoghegan wrote:

Sorry for the delay.

> My interpretation is that randomization can be combined with the usual
> suffix truncation split point choosing heuristics; the paper even
> recommends this at one point. That is, you randomly pick from a small
> number of candidate split points that are all approximately equally
> good according to the traditional criteria. It looks like your patch
> doesn't account for suffix truncation at all, though.
> 
> Your patch should initially look for several split points that are
> approximately equal in quality according to the current criteria -- a
> separate, initial pass. You'd then randomly pick a final split point
> from those gathered during this initial pass -- not from the original
> fillfactormult-sorted list of split points. What you have in v1 will
> make suffix truncation much less effective, which seems unacceptable.

The paper suggest combining randomization and suffix truncation via
randomly shifting the interval for truncation. I'm not sure a separate
pass is needed for that, looks like it should be enough to add random
shift to lowsplit / highsplit in _bt_bestsplitloc, when the penalty is
calculated.

The interesting part here is that it seems the current split interval
might be too small for such randomization to make a significant impact
-- few experiments show that with the current value the result looks
more like changing the fillfactor, waves are just shifting to the right.
Increasing the split interval helps in this case.

Such approach (randomized suffix truncation interval) produces the same
effect in a test with single column index as in the patch above. In a
test with a wider index and significant impact of truncation, there is
no visible degradation (and no oscillations either, as expected suffix
truncation introduces some randomness on it's own).

> Another issue is that nbtsort.c doesn't have any ability to pick among
> split points; it focuses entirely on keeping free space balanced. To
> get much benefit from this, I think you'd have to teach nbtsort.c to
> also pick split points using approximately the same algorithm as
> nbtsplitloc.c would (assuming retail inserts in ascending key space
> order). I've been meaning to add proper suffix truncation to the
> CREATE INDEX case.

Good point, I need to look into this.

> > The unfortunate part is that I couldn't get clear numbers for the performance
> > impact. Turns out the disk in my experimental setup is not good enough to get a
> > sufficient number of inserts to trigger the issue, and to get nice graphs I was
> > running everything either on a RAM disk or on an unlogged table -- in both
> > cases it's easy to observe oscillations of page splits, but their impact is not
> > large enough since only so much IO is happening.
> >
> > But anyway, any thoughts / commentaries on that?
> 
> I wouldn't expect the patch to increase absolute throughput
> significantly, if at all. Its value comes from making the *rate* of
> splits over time more consistent, for a given fixed workload. You
> might notice a more interesting effect if you look at latency,
> particularly worst case latency.

That's exactly what I'm talking about. In those experiments I was trying
to get any visible changes in latency variability, but in this
particular setup I could spot nothing beyond regular noise, while the
number of page splits was obviously heavily oscillating.

Re: Randomize B-Tree page split location to avoid oscillating patterns

Peter Geoghegan <pg@bowt.ie> — 2026-05-06T17:49:10Z

On Wed, May 6, 2026 at 1:10 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > On Mon, Apr 27, 2026 at 02:07:14PM -0400, Peter Geoghegan wrote:
> > Your patch should initially look for several split points that are
> > approximately equal in quality according to the current criteria -- a
> > separate, initial pass. You'd then randomly pick a final split point
> > from those gathered during this initial pass -- not from the original
> > fillfactormult-sorted list of split points. What you have in v1 will
> > make suffix truncation much less effective, which seems unacceptable.
>
> The paper suggest combining randomization and suffix truncation via
> randomly shifting the interval for truncation. I'm not sure a separate
> pass is needed for that, looks like it should be enough to add random
> shift to lowsplit / highsplit in _bt_bestsplitloc, when the penalty is
> calculated.

I can't see why that would make much difference. Increasing the split
interval usually won't affect the final chosen split point at all.
Decreasing is more likely to change the split point, but that's
*precisely* because it makes suffix truncation less effective.

In general it's quite unlikely that an expanded interval (e.g.,
doubling LEAF_SPLIT_DISTANCE) will include a split point that
truncates additional index attributes compared to the best split point
available within the current calculated split interval/current
LEAF_SPLIT_DISTANCE. For example, with the TPC-C indexes, I'd expect
only a tiny minority of all page splits to be affected by doubling
LEAF_SPLIT_DISTANCE to make nbtsplitloc.c consider ~20% of all
possible split points around the middle of the page. Whereas *halving*
LEAF_SPLIT_DISTANCE will indeed make suffix truncation worse.

This asymmetry matters. I believe that it'll make it impossible for
space utilization to average out to the target leaffillfactor% over
time. After all, this approach doesn't actually target space
utilization. This is also why it will make literally no difference at
all with a single column unique index.

> The interesting part here is that it seems the current split interval
> might be too small for such randomization to make a significant impact
> -- few experiments show that with the current value the result looks
> more like changing the fillfactor, waves are just shifting to the right.
> Increasing the split interval helps in this case.

The point of gathering a list of "equally good" split points in an
initial pass is that it removes the danger of making the split
interval less effective in terms of final split penalty. The random
choice becomes a choice among split points that all truncate away the
same number of suffix attributes (all of which must still be
sufficiently close to the space-optimal split location to ensure that
no split is ever wildly unbalanced). This prevents the variable/random
choice from affecting the suffix truncation/split penalty.

With a single column unique index, all split points will have an equal
penalty under this scheme (implying that suffix truncation will be
equally effective). The initial list gathered in the first pass is
exactly all of the split points within the split interval in that
case. Whereas with other types of indexes the initial list gathered is
some subset of all the split points within the interval, without
regard for how balanced the post-split space utilization will be (all
splits within the interval are assumed to be "good enough" from a
space utilization perspective). Over time, space utilization should
reach fillfactor% -- without it negatively affecting suffix truncation.

It's possible that increasing the split interval would also make
sense. That seems like a follow-up question to me.

--
Peter Geoghegan

Re: Randomize B-Tree page split location to avoid oscillating patterns

Dmitry Dolgov <9erthalion6@gmail.com> — 2026-05-07T18:10:30Z

> On Wed, May 06, 2026 at 01:49:10PM -0400, Peter Geoghegan wrote:
> On Wed, May 6, 2026 at 1:10 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > > On Mon, Apr 27, 2026 at 02:07:14PM -0400, Peter Geoghegan wrote:
> > > Your patch should initially look for several split points that are
> > > approximately equal in quality according to the current criteria -- a
> > > separate, initial pass. You'd then randomly pick a final split point
> > > from those gathered during this initial pass -- not from the original
> > > fillfactormult-sorted list of split points. What you have in v1 will
> > > make suffix truncation much less effective, which seems unacceptable.
> >
> > The paper suggest combining randomization and suffix truncation via
> > randomly shifting the interval for truncation. I'm not sure a separate
> > pass is needed for that, looks like it should be enough to add random
> > shift to lowsplit / highsplit in _bt_bestsplitloc, when the penalty is
> > calculated.
> 
> I can't see why that would make much difference. Increasing the split
> interval usually won't affect the final chosen split point at all.
> Decreasing is more likely to change the split point, but that's
> *precisely* because it makes suffix truncation less effective.

I assume by increasing/decreasing the split interval you mean increase/decrease
around the split point, but I was talking about shifting the interval to the
right. More about this in the next part. 

> The point of gathering a list of "equally good" split points in an
> initial pass is that it removes the danger of making the split
> interval less effective in terms of final split penalty. The random
> choice becomes a choice among split points that all truncate away the
> same number of suffix attributes (all of which must still be
> sufficiently close to the space-optimal split location to ensure that
> no split is ever wildly unbalanced). This prevents the variable/random
> choice from affecting the suffix truncation/split penalty.

I see what you mean, but I'm concerned about the resulting overhead. Here is my
understanding of how it would work, let me know if something is missing in the
chain of thoughts:

* What happens currently is we collect all possible split points, sort them by
  delta, calculate the split interval. Then we iterate over the possible split
  points in the interval [0, split interval), trying to find the one causing
  the lowest penalty. If what we found is the best possible penalty, we stop --
  for unique indexes it means that virtually all the time everything is over
  after a single iteration, since the first attribute is enough to
  distinguish the split.

* If we're going to pick at random from a list of "equally good" split points,
  everything up to the split interval is the same. Then we iterate over the
  possible split points in the interval [0, split interval), and search for all
  the points with penalty equal or lower than the current lowest value (the
  list is reset if we found a lower value). We search this way until we found
  enough for the chosen randomization interval (those 20% from the paper) or
  until we're out of split points. In the end we've got a list of split points,
  each has equally low penalty, and we choose a split point at random from this
  list. For a unique index it means we have to do at least as many iterations
  as the size of randomization interval, instead of one iteration as before --
  this is the overhead I'm concerned about.

* If we're going to shift the split interval, everything is also the same up to
  the split interval calculation. Now we introduce a

      delta = random from (0, randomization interval)

  and consider two intervals of available split points:

      [0, delta) and [delta, split interval + delta)

  We first search in [0, delta) as before for lowest possible penalty, just to
  make sure we're not missing it, if it's not present in the second interval.
  The main part is search in [delta, split interval + delta) for a split point
  with the same or lower penalty as found in the first interval and use it as
  the final result. There are few possible outcomes:

  1. There are one or more lowest penalty split point in [0, delta) and one or
  more in [delta, split interval + delta). Since delta is randomized, we pick
  one of the "equally good" split points at random.

  2. There are one or more lowest penalty split points in [0, delta) and for
  some reason none in [delta, split interval + delta). In this case we forgo
  randomization and go with the lowest penalty split point, using the point
  from the [0, delta).

  3. There are none lowest penalty split points in [0, delta) and one or more
  in [delta, split interval + delta). In this case we still pick up one of the
  "equally good" split points, still with some randomization.

  4. There are none lowest penalty split points in [0, randomization interval)
  and one or more in [randomization interval, split interval + randomization
  interval). In this case we still pick up one of the "equally good" split
  points, but without randomization, since the best point lays beyond the
  allowed randomization range, and we hit it all the time.

  Ultimately we would like to apply randomization of split point for mostly
  unique indexes, since otherwise suffix truncation already does the job. With
  this approach the randomization interval serves as a threshold to distinguish
  such indexes -- if the index has mostly unique values, the optimal split
  point would be located close to the start of the interval and multiple
  "equally good" points will fall within the randomization interval. If on the
  other hand the index has many duplicating values, the optimal split point may
  be located outside the randomization interval and we would choose it all
  the time -- but thanks to suffix truncation it doesn't matter.

  From the overhead perspective, with this approach we do:
  - one search in [0, delta), which will end after one iteration for unique
    indexes.
  - then do a "jump" of random distance and do one search in [delta, split
    interval + delta) on the same conditions. It will end after one iteration
    for unique index, and will process a similar number of iteration otherwise.

To summarize, in comparison with the second approach the third one seems to
have less overhead, the same guarantees regarding penalty, and more complex
implementation. I'm fine going with the second approach, but first would like
to discuss the alternatives and make sure we're on the same page.