Thread

  1. Re: Randomize B-Tree page split location to avoid oscillating patterns

    Dmitry Dolgov <9erthalion6@gmail.com> — 2026-05-06T17:10:18Z

    > On Mon, Apr 27, 2026 at 02:07:14PM -0400, Peter Geoghegan wrote:
    
    Sorry for the delay.
    
    > My interpretation is that randomization can be combined with the usual
    > suffix truncation split point choosing heuristics; the paper even
    > recommends this at one point. That is, you randomly pick from a small
    > number of candidate split points that are all approximately equally
    > good according to the traditional criteria. It looks like your patch
    > doesn't account for suffix truncation at all, though.
    > 
    > Your patch should initially look for several split points that are
    > approximately equal in quality according to the current criteria -- a
    > separate, initial pass. You'd then randomly pick a final split point
    > from those gathered during this initial pass -- not from the original
    > fillfactormult-sorted list of split points. What you have in v1 will
    > make suffix truncation much less effective, which seems unacceptable.
    
    The paper suggest combining randomization and suffix truncation via
    randomly shifting the interval for truncation. I'm not sure a separate
    pass is needed for that, looks like it should be enough to add random
    shift to lowsplit / highsplit in _bt_bestsplitloc, when the penalty is
    calculated.
    
    The interesting part here is that it seems the current split interval
    might be too small for such randomization to make a significant impact
    -- few experiments show that with the current value the result looks
    more like changing the fillfactor, waves are just shifting to the right.
    Increasing the split interval helps in this case.
    
    Such approach (randomized suffix truncation interval) produces the same
    effect in a test with single column index as in the patch above. In a
    test with a wider index and significant impact of truncation, there is
    no visible degradation (and no oscillations either, as expected suffix
    truncation introduces some randomness on it's own).
    
    > Another issue is that nbtsort.c doesn't have any ability to pick among
    > split points; it focuses entirely on keeping free space balanced. To
    > get much benefit from this, I think you'd have to teach nbtsort.c to
    > also pick split points using approximately the same algorithm as
    > nbtsplitloc.c would (assuming retail inserts in ascending key space
    > order). I've been meaning to add proper suffix truncation to the
    > CREATE INDEX case.
    
    Good point, I need to look into this.
    
    > > The unfortunate part is that I couldn't get clear numbers for the performance
    > > impact. Turns out the disk in my experimental setup is not good enough to get a
    > > sufficient number of inserts to trigger the issue, and to get nice graphs I was
    > > running everything either on a RAM disk or on an unlogged table -- in both
    > > cases it's easy to observe oscillations of page splits, but their impact is not
    > > large enough since only so much IO is happening.
    > >
    > > But anyway, any thoughts / commentaries on that?
    > 
    > I wouldn't expect the patch to increase absolute throughput
    > significantly, if at all. Its value comes from making the *rate* of
    > splits over time more consistent, for a given fixed workload. You
    > might notice a more interesting effect if you look at latency,
    > particularly worst case latency.
    
    That's exactly what I'm talking about. In those experiments I was trying
    to get any visible changes in latency variability, but in this
    particular setup I could spot nothing beyond regular noise, while the
    number of page splits was obviously heavily oscillating.
    
    
    
    
  2. Re: Randomize B-Tree page split location to avoid oscillating patterns

    Peter Geoghegan <pg@bowt.ie> — 2026-05-06T17:49:10Z

    On Wed, May 6, 2026 at 1:10 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    > > On Mon, Apr 27, 2026 at 02:07:14PM -0400, Peter Geoghegan wrote:
    > > Your patch should initially look for several split points that are
    > > approximately equal in quality according to the current criteria -- a
    > > separate, initial pass. You'd then randomly pick a final split point
    > > from those gathered during this initial pass -- not from the original
    > > fillfactormult-sorted list of split points. What you have in v1 will
    > > make suffix truncation much less effective, which seems unacceptable.
    >
    > The paper suggest combining randomization and suffix truncation via
    > randomly shifting the interval for truncation. I'm not sure a separate
    > pass is needed for that, looks like it should be enough to add random
    > shift to lowsplit / highsplit in _bt_bestsplitloc, when the penalty is
    > calculated.
    
    I can't see why that would make much difference. Increasing the split
    interval usually won't affect the final chosen split point at all.
    Decreasing is more likely to change the split point, but that's
    *precisely* because it makes suffix truncation less effective.
    
    In general it's quite unlikely that an expanded interval (e.g.,
    doubling LEAF_SPLIT_DISTANCE) will include a split point that
    truncates additional index attributes compared to the best split point
    available within the current calculated split interval/current
    LEAF_SPLIT_DISTANCE. For example, with the TPC-C indexes, I'd expect
    only a tiny minority of all page splits to be affected by doubling
    LEAF_SPLIT_DISTANCE to make nbtsplitloc.c consider ~20% of all
    possible split points around the middle of the page. Whereas *halving*
    LEAF_SPLIT_DISTANCE will indeed make suffix truncation worse.
    
    This asymmetry matters. I believe that it'll make it impossible for
    space utilization to average out to the target leaffillfactor% over
    time. After all, this approach doesn't actually target space
    utilization. This is also why it will make literally no difference at
    all with a single column unique index.
    
    > The interesting part here is that it seems the current split interval
    > might be too small for such randomization to make a significant impact
    > -- few experiments show that with the current value the result looks
    > more like changing the fillfactor, waves are just shifting to the right.
    > Increasing the split interval helps in this case.
    
    The point of gathering a list of "equally good" split points in an
    initial pass is that it removes the danger of making the split
    interval less effective in terms of final split penalty. The random
    choice becomes a choice among split points that all truncate away the
    same number of suffix attributes (all of which must still be
    sufficiently close to the space-optimal split location to ensure that
    no split is ever wildly unbalanced). This prevents the variable/random
    choice from affecting the suffix truncation/split penalty.
    
    With a single column unique index, all split points will have an equal
    penalty under this scheme (implying that suffix truncation will be
    equally effective). The initial list gathered in the first pass is
    exactly all of the split points within the split interval in that
    case. Whereas with other types of indexes the initial list gathered is
    some subset of all the split points within the interval, without
    regard for how balanced the post-split space utilization will be (all
    splits within the interval are assumed to be "good enough" from a
    space utilization perspective). Over time, space utilization should
    reach fillfactor% -- without it negatively affecting suffix truncation.
    
    It's possible that increasing the split interval would also make
    sense. That seems like a follow-up question to me.
    
    --
    Peter Geoghegan
    
    
    
    
  3. Re: Randomize B-Tree page split location to avoid oscillating patterns

    Dmitry Dolgov <9erthalion6@gmail.com> — 2026-05-07T18:10:30Z

    > On Wed, May 06, 2026 at 01:49:10PM -0400, Peter Geoghegan wrote:
    > On Wed, May 6, 2026 at 1:10 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
    > > > On Mon, Apr 27, 2026 at 02:07:14PM -0400, Peter Geoghegan wrote:
    > > > Your patch should initially look for several split points that are
    > > > approximately equal in quality according to the current criteria -- a
    > > > separate, initial pass. You'd then randomly pick a final split point
    > > > from those gathered during this initial pass -- not from the original
    > > > fillfactormult-sorted list of split points. What you have in v1 will
    > > > make suffix truncation much less effective, which seems unacceptable.
    > >
    > > The paper suggest combining randomization and suffix truncation via
    > > randomly shifting the interval for truncation. I'm not sure a separate
    > > pass is needed for that, looks like it should be enough to add random
    > > shift to lowsplit / highsplit in _bt_bestsplitloc, when the penalty is
    > > calculated.
    > 
    > I can't see why that would make much difference. Increasing the split
    > interval usually won't affect the final chosen split point at all.
    > Decreasing is more likely to change the split point, but that's
    > *precisely* because it makes suffix truncation less effective.
    
    I assume by increasing/decreasing the split interval you mean increase/decrease
    around the split point, but I was talking about shifting the interval to the
    right. More about this in the next part. 
    
    > The point of gathering a list of "equally good" split points in an
    > initial pass is that it removes the danger of making the split
    > interval less effective in terms of final split penalty. The random
    > choice becomes a choice among split points that all truncate away the
    > same number of suffix attributes (all of which must still be
    > sufficiently close to the space-optimal split location to ensure that
    > no split is ever wildly unbalanced). This prevents the variable/random
    > choice from affecting the suffix truncation/split penalty.
    
    I see what you mean, but I'm concerned about the resulting overhead. Here is my
    understanding of how it would work, let me know if something is missing in the
    chain of thoughts:
    
    * What happens currently is we collect all possible split points, sort them by
      delta, calculate the split interval. Then we iterate over the possible split
      points in the interval [0, split interval), trying to find the one causing
      the lowest penalty. If what we found is the best possible penalty, we stop --
      for unique indexes it means that virtually all the time everything is over
      after a single iteration, since the first attribute is enough to
      distinguish the split.
    
    * If we're going to pick at random from a list of "equally good" split points,
      everything up to the split interval is the same. Then we iterate over the
      possible split points in the interval [0, split interval), and search for all
      the points with penalty equal or lower than the current lowest value (the
      list is reset if we found a lower value). We search this way until we found
      enough for the chosen randomization interval (those 20% from the paper) or
      until we're out of split points. In the end we've got a list of split points,
      each has equally low penalty, and we choose a split point at random from this
      list. For a unique index it means we have to do at least as many iterations
      as the size of randomization interval, instead of one iteration as before --
      this is the overhead I'm concerned about.
    
    * If we're going to shift the split interval, everything is also the same up to
      the split interval calculation. Now we introduce a
    
          delta = random from (0, randomization interval)
    
      and consider two intervals of available split points:
    
          [0, delta) and [delta, split interval + delta)
    
      We first search in [0, delta) as before for lowest possible penalty, just to
      make sure we're not missing it, if it's not present in the second interval.
      The main part is search in [delta, split interval + delta) for a split point
      with the same or lower penalty as found in the first interval and use it as
      the final result. There are few possible outcomes:
    
      1. There are one or more lowest penalty split point in [0, delta) and one or
      more in [delta, split interval + delta). Since delta is randomized, we pick
      one of the "equally good" split points at random.
    
      2. There are one or more lowest penalty split points in [0, delta) and for
      some reason none in [delta, split interval + delta). In this case we forgo
      randomization and go with the lowest penalty split point, using the point
      from the [0, delta).
    
      3. There are none lowest penalty split points in [0, delta) and one or more
      in [delta, split interval + delta). In this case we still pick up one of the
      "equally good" split points, still with some randomization.
    
      4. There are none lowest penalty split points in [0, randomization interval)
      and one or more in [randomization interval, split interval + randomization
      interval). In this case we still pick up one of the "equally good" split
      points, but without randomization, since the best point lays beyond the
      allowed randomization range, and we hit it all the time.
    
      Ultimately we would like to apply randomization of split point for mostly
      unique indexes, since otherwise suffix truncation already does the job. With
      this approach the randomization interval serves as a threshold to distinguish
      such indexes -- if the index has mostly unique values, the optimal split
      point would be located close to the start of the interval and multiple
      "equally good" points will fall within the randomization interval. If on the
      other hand the index has many duplicating values, the optimal split point may
      be located outside the randomization interval and we would choose it all
      the time -- but thanks to suffix truncation it doesn't matter.
    
      From the overhead perspective, with this approach we do:
      - one search in [0, delta), which will end after one iteration for unique
        indexes.
      - then do a "jump" of random distance and do one search in [delta, split
        interval + delta) on the same conditions. It will end after one iteration
        for unique index, and will process a similar number of iteration otherwise.
    
    To summarize, in comparison with the second approach the third one seems to
    have less overhead, the same guarantees regarding penalty, and more complex
    implementation. I'm fine going with the second approach, but first would like
    to discuss the alternatives and make sure we're on the same page.