Thread

  1. Re: Expanding HOT updates for expression and partial indexes

    Matthias van de Meent <boekewurm+postgres@gmail.com> — 2025-11-21T15:25:06Z

    On Wed, 19 Nov 2025 at 19:00, Greg Burd <greg@burd.me> wrote:
    >
    > Attached are rebased (d5b4f3a6d4e) patches with the only changes
    > happening in the last patch in the series.
    
    Here's a high-level review of the patchset, extending on what I shared
    offline. I haven't looked too closely at the code changes.
    
    Re: Perf testing
    
    Apart from the workload we've discussed offline, there's another
    workload to consider: Right now, we only really consider HOT when we
    know there's space on the page. This patch, however, will front-load a
    lot more checks before we have access to the page, and that will
    ~always impact update performance.
    I'm a bit worried that the cost of off-page updates (when the page of
    the old tuple can't fit the new tuple) will be too significantly
    increased, especially considering that we have a default fillfactor of
    100 -- if a table's tuples only grow, it's quite likely the table
    frequently can't apply HOT regardless of the updated columns. So, a
    workload that's tuned to only update tuples in a way that excercises
    'can we HOT or not' code for already-full pages would be appreciated.
    A solution to the issue might be needed if we lose too much
    performance on expensive checks; a solution like passing page space of
    the old tuple to the checks, and short-circuiting the non-HOT path if
    that page space is too small for the new tuple.
    
    0001:
    
    I'm not sure I understand why the code is near completely duplicated
    here. Maybe you can rename the resulting heap_update to
    heap_update_ext, and keep a heap_update around which wraps this
    heap_update_ext, allowing old callers to keep their signature until
    they need to use _ext's features? Then you can introduce the
    duplication if and when needed in later patches -- though I don't
    expect a lot of this duplication to be strictly necessary, though that
    may need some new helper functions.
    
    I also see that for every update we're now copying, passing 5
    bitmapsets individually and then freeing those bitmaps just a moment
    later. I'd like to avoid that overhead and duplication if possible.
    Maybe we can store these in an 'update context' struct, passed by
    reference down to table_tuple_update() from the calling code, and then
    onward to heap_update? That might then also be a prime candidate to
    contain the EState * of ExecCheckIndexedAttrsForChanges.
    
    0002:
    
    This patch seems to have some formatting updates to changes you made
    in 0001, without actually changing the code (e.g. at heap_update's
    definition). When updating code, please put it in the expected
    formatting in the same patch.
    
    ---
    In the patch subject:
    > For instance,
    > indexes with collation information allowing more HOT updates when the
    > index is specified to be case insensitive.
    
    It is incorrect to assume that indexed "btree-equal" datums allow HOT
    updates. The user can not be returned an old datum in index-only
    scans, even if it's sorted the same as the new datum -- after all, a
    function on the datum may return different results even if the datums
    are otherwise equal. Think: count_uppercase(string). See also below at
    "HOT, datum compare, etc".
    
    ---
    With the addition of rd_indexedattr we now have 6 bitmaps in
    RelationData, which generally only get accessed through
    RelationGetIndexAttrBitmap by an enum value. Maybe it's now time to
    bite the bullet and change that to a more general approach with
    `Bitmapset  *rd_bitmaps[NUM_INDEX_ATTR_BITMAP]`? That way, the hot
    path in RelationGetIndexAttrBitmap would not depend on the compiler to
    determine that the fast path can do simple offset arithmatic to get
    the requested bitmap.
    
    0003:
    This looks like it's a cleanup patch for 0002, and doesn't have much
    standing on its own. Maybe the changes for ri_ChangedIndexedCols can
    all be moved into 0003? I think that gives this patch more weight and
    standing.
    
    0004:
    This does two things:
     1. Add index expression evaluation to the toolset to determine which
    indexes were unchanged, and
     2. Allow index access methods to say "this value has not changed"
    even if the datum itself may have changed.
    
    Could that be split up into two different patches?
    
    (aside: 2 makes a lot of sense in some cases, like trgm indexes
    strings if no new trigrams are added/removed, so I really like the
    idea behind this change)
    
    HOT, datum compare, etc.:
    
    Note that for index-only scans on an index to return correct results,
    you _must_ update the index (and thus, do a non-HOT update) whenever a
    value changes its binary datum, even if the value has the same btree
    sort location as the old value. Even for non-IOS-supporting indexes,
    the index may need more information than what's used in btree
    comparisons when it has a btree opclass.
    As SP/GIST have IOS support, they also need to compare the image of
    the datum and not use ordinary equality as defined in nbtree's compare
    function: the value must be exactly equal to what the table AM
    would've provided.
    
    Primary example: Btree compares the `numeric` datums of `1.0` and
    `1.00` as equal, but for a user there is an observable difference; the
    following SQL must return `1.0` in every valid plan:
    
    BEGIN;
    INSERT INTO mytab (mynumeric) VALUES ('1.00');
    UPDATE mytab SET mynumeric = '1.0';
    SELECT mynumeric FROM mytab;
    
    So, IMO, the default datum compare in ExecCheckIndexedAttrsForChanges
    and friends should just use datumIsEqual, and not this new
    tts_attr_equal.
    Indexes without IOS support might be able to opt into using a more lax
    datum comparator, but 1.) it should never be the default, as it'd be a
    loaded footgun for IndexAM implementers, and 2.) should not depend on
    another AM's understanding of attributes, as that is a very leaky
    abstraction.
    
    
    Kind regards,
    
    Matthias van de Meent
    Databricks (https://www.databricks.com)