Thread

  1. Re: Expanding HOT updates for expression and partial indexes

    Greg Burd <greg@burd.me> — 2025-11-16T18:53:09Z

    Hello again.
    
    This idea started over a year ago for me while working on a project that
    used JSONB and had horrible "bloat" with update times that were not
    fantastic.  The root cause, expression indexes prevent HOT updates and
    all indexes on JSONB are by definition, expressions.  That's the backstory.
    
    The idea for the solution came from a patch [1] applied [2], then later
    reverted [3], that basically evaluated the before/after tuple
    expressions in heap_update() at the point where newbuff == buffer just
    before deciding to use_hot_update or not.  When the evaluated
    expressions produced equal results using a binary comparison the index
    didn't need to be updated.  While this approach sorta worked, but it was
    reverted for a few reasons, here's Tom's summary: "The problem here is
    that [the code that checks for equality] thinks that the type of the
    index column is identical to the type of the source datum for it, which
    is not true for any opclass making use of the opckeytype property. [4]"
    
    Still, expanding the domain of what might go HOT seems like a good goal
    to me and it was at the heart of the issues I was facing on the project
    using JSONB, so I kept at it.
    
    The patches I've sent on this thread have evolved from that first idea. 
    First they addressed the specific issue raised by Tom.  Then they
    expanded to include partial indexes as well.  This worked, and I helped
    to ship a fork of Postgres used this approach and solved customer issues
    with the JSONB use case.  Customer experience was better, no unnecessary
    index updates when indexed data wasn't modified meant no more heap/index
    bloat and faster update times.  Vacuum could finally keep up and
    performance and storage overhead was much better.
    
    For this patch set v1 through v17 were essentially work that
    refined/reworked that original approach, but there was a serious flaw
    that I didn't fully appreciate until around v16/17.  I was evaluating
    the expressions while holding a lock on the buffer page which a) expands
    the time the lock is held, and b) opens the door to self-deadlock.  No bueno.
    
    Then there was v18, a quick work-around for that.  I moved the call that
    invokes the executor to the beginning of heap_update() before taking the
    lock on the page.  To do this I had to find the set of updated
    attributes, which I discovered was available in the executor as the
    updated tuple is created.  This was a viable fix, but didn't really go
    far enough and was a bit hackish IMO.  Jeff Davis and others challenged
    me to move the work to identify what's changed into the executor and
    clean it up.  I'm a sucker for a challenge.
    
    More generally, this idea made sense.  While Postgres has many places
    where logic is tightly coupled to the way the heap works this didn't
    have to be one. To make this work I needed the update path in the
    executor to be interested in:
    
    a) knowing what columns were specified in the UPDATE statement and those
    impacted by before/after triggers,
    b) reducing that set to those attributes known to be both indexed and to
    have changed value,
    c) finding which of those (and possibly other) attributes that force new
    index updates.
    
    Why?  We'll, that code already exists in a few places and in some cases
    is replicated; for (a) there is ExecGetAllUpdatedCols(), for (b)
    HeapDetermineColumnsInfo() and index_unchanged_by_update().
    
    An interesting thing to note is that HeapDetermineColumnsInfo() might
    return a set that includes columns not returned by
    ExecGetAllUpdatedCols() because HeapDetermineColumnsInfo() iterates over
    all indexed attributes looking for changes and that might find an
    indexed attribute that was changed by heap_modify_tuple() but not
    knowable by ExecGetAllUpdatedCols().  This happens in tsvector code, see
    tsvector_op.c tsvector_update_trigger() where if (update_needed)
    heap_modify_tuple_by_cols().  That column isn't known to ExecGetAllUpdatedCols().
    
    HeapDetermineColumnsInfo() is also critical when modifying catalog
    tuples.  Catalog tuples are modified using either Form/GETSTRUCT or
    values/nulls/replaces then using heap_modify_tuple() and calling into
    CatalogTupleUpdate() which calls simple_heap_update() that calls
    heap_update() where we find HeapDetermineColumnsInfo().  The interesting
    thing here is that when modifying catalog tuples there is knowledge of
    what attributes are changed, but that knowledge isn't preserved and
    passed into CatalogTupleUpdate(), rather it is re-discovered in
    HeapDetermineColumnsInfo().  That's how catalog tuples are able to take
    the HOT path, they re-use that same logic.  There is a fix for that [5]
    too (and I really hope that lands in master ASAP), but that's not the
    subject of this thread.
    
    HeapDetermineColumnsInfo() also helps inform a few other decisions in
    heap_update(), but these have to happen after taking the buffer lock and
    are very heap-specific, namely:
    
    1. Do either the replica identity key attributes overlap with the
    modified index attributes or do they need to be stored externally, this
    is passed on to ExtractReplicaIdentity() to find out if we must augment
    the WAL log or not.
    
    2. Are there any modified indexed attributes that intersect with the
    primary keys of the relation, if not lower the lock mode to enable
    multixact to work.
    
    HeapDetermineColumnsInfo() also takes a pragmatic approach to testing
    for equality when looking for modified indexed attributes, it uses
    datumIsEqual() which boils down to a simple memcmp() of the before/after
    HeapTuple datum.  This is fine in most cases, but limits the scope of
    what can be HOT.
    
    Interestingly, this requirement of binary equality has leaked into other
    parts of the code, namely nbtree's deduplication of TIDs on page split. 
    That code uses binary equality as well.  A nbtree index with collation
    for case insensitive must store both "A" and "a" despite those being
    type-equal because they are not binary equivalent.  More on this later.
    
    At this point, I had my sights set on HeapDetermineColumnsInfo().  I
    felt that what it was doing should move into the executor, well as much
    of that work as possible, and outside of the buffer lock.  This would
    also open the door for removal of redundant code.  My thought was that
    the table AM update API should have an additional argument, the
    "modified indexed attributes" or "mix_attrs", passed in.
    
    So, here we are at the door of v19... let's begin.
    
    0001 - Reorganize heap update logic
    
    This is preparatory work for the larger goal in that heap_update()
    serves two masters: CatalogTupleUpdate()/simple_heap_update() and
    heap_tuple_update() and in reality, they were different but needed most
    of the same logic that happens at the start of heap_update().  This
    patch splits that logic out and moves it into heap_tuple_update() and
    simple_heap_update().  Functionally nothing changes.  That's the meat of
    this patch.Reorganize heap update logic
    
    
    0002 - Track changed indexed columns in the executor during UPDATEs
    
    This is the first core set of changes, it doesn't expand HOT updates but
    it does restructure where HeapDetermineColumnsInfo()'s core work happens.
    
    A new function ExecCheckIndexedAttrsForChanges() in nodeModifyTable.c is
    now responsible for checking for changes between Datum in the old/new
    TupleTableSlots.  This is different from before in that we're not
    checking the new HeapTuple Datum verses the HeapTuple we read from the
    buffer page while holding the lock on that page.
    
    An update starts off by reading the existing tuple using the table AM. 
    Then a new updated tuple is created as the set of changes to the old. 
    Then the new TupleTableSlot is the combination of the existing one we
    just read and the changes we just recorded.  So, in the executor before
    calling into the table AM's update function we have a pin on the buffer
    and the before/after TupleTableSlots for this update.  So, I've put the
    call to my new ExecCheckIndexedAttrsForChanges() function just before
    calling table_tuple_update() and I've added the "mix_attrs" into that
    call which get passed on to the heap in heap_tuple_update() and then
    heap_update() and all is well.
    
    Why is this safe?  The way I read heap_update() is that it has always
    historically had code to deal with cases where the tuple is concurrently
    updated and react accordingly thanks to HeapTupleSatisfiesUpdate() which
    remains where it was in heap_update(). Visibility checks happened when
    we first read the tuple to form the updated tuple and later in
    heap_update() when we call HeapTupleSatisfiesVisibility() to check for
    transaction-snapshot mode RI updates.
    
    So, this new update path from the executor into the table AM seems to me
    to be okay and almost functionally equivalent.  But there is one big
    change to discuss before moving to the simple_heap_update() path.
    
    In nodeModifyTable.c tts_attr_equal() replaces heap_attr_equal()
    changing the test for equality when calling into heap_tuple_update(). 
    In the past we used datumIsEqual(), essentially a binary comparison
    using memcmp(), now the comparison code in tts_attr_equal uses
    type-specific equality function when available and falls back to
    datumIsEqual() when not.
    
    The other parts of HeapDetermineColumnsInfo() remain in the code, but
    they still happen within the simple_heap_update() and
    heap_tuple_update() code.  That's where you'll find that after the
    buffer is locked we do (1) and (2) from above.  This keeps the
    heap-specific work in the heap, but we've moved some work up into the
    executor and outside the buffer lock.
    
    While this accomplishes the goal of removing HeapDetermineColumnsInfo()
    from heap_update() on the path that uses the table AM API
    heap_tuple_update() but it doesn't on the simple_heap_update() path. 
    That remains the same as it was in the previous patch.  Ideally, my
    patch to restructure how catalog tuples are updated [5] is committed and
    we can fully remove HeapDetermineColumnsInfo() and likely speed up all
    catalog updates in the process.  That's what motivated [5], please take
    a look, it required a huge number of changes so I thought it deserved a
    life/thread of its own.
    
    Finally, there is the ExecSimpleRelationUpdate() path and
    slot_modify_data().  On this path we know what attributes are being
    updated, so we just check to see if they changed and then intersect that
    with the set of indexed attributes and we have our modified indexed
    attributes set to pass into simple_heap_update().
    
    0003 - Replace index_unchanged_by_update with ri_ChangedIndexedCols
    
    This patch removes the function index_unchanged_by_update() in
    execIndexing.c and simply re-uses the modified indexed attributes that
    we've stashed away in ResultRelInfo as ri_ChangedIndexedCols.  This
    provides a hint when calling into the index AM's index_insert() function
    indicating if the UPDATE was without logical change to the data or not. 
    We've done that check, we don't need to do it again.
    
    
    0004 - Enable HOT updates for expression and partial indexes
    
    This finally gets us back to where this project started, but on much
    more firm ground than before because we're not going to self-deadlock. 
    The idea has grown from a small function into something larger, but only
    out of necessity.
    
    In this patch I add ExecWhichIndexesRequireUpdates() in execIndexing.c
    which implements (c) finding the set of attributes that force new index
    updates.  This set can be very different from the modified indexed
    attributes.  We know that some attributes are not equal to their
    previous versions, but does that mean that the index that references
    that attribute needs a new index tuple?  It may, or it may not.  Here's
    the comment on that function that explains:
    
    /*
     * ExecWhichIndexesRequireUpdates
     *
     * Determine which indexes need updating given modified indexed attributes.
     * This function is a companion to ExecCheckIndexedAttrsForChanges(). 
    On the
     * surface, they appear similar but they are doing two very different things.
     *
     * For a standard index on a set of attributes this is the intersection of
     * the mix_attrs and the index attrs (key, expression, but not predicate).
     *
     * For expression indexes and indexes which implement the amcomparedatums()
     * index AM API we'll need to form index datum and compare each
    attribute to
     * see if any actually changed.
     *
     * For expression indexes the result of the expression might not change
    at all,
     * this is common with JSONB columns which require expression indexes
    and where
     * it is commonplace to index a field within a document and have updates that
     * generally don't update that field.
     *
     * Partial indexes won't trigger index tuples when the old/new tuples
    are both
     * outside of the predicate range.
     *
     * For nbtree the amcomparedatums() API is critical as it requires that key
     * attributes are equal when they memcmp(), which might not be the case when
     * using type-specific comparison or factoring in collation which might make
     * an index case insensitive.
     *
     * All of this is to say that the goal is for the executor to know,
    ahead of
     * calling into the table AM for the update and before calling into the index
     * AM for inserting new index tuples, which attributes at a minimum will
     * necessitate a new index tuple.
     *
    ...
     */
    
    Whereas before we were comparing Datum in the table relation, now we're
    comparing Datum in the index relation.  Index AMs are free to store what
    they want, we need to know if what's changed and referenced by the index
    means that the index needs a new tuple or not.
    
    In the case of a JSONB expression index (or any expression index) the
    expression is evaluated when calling FormIndexDatum().  The result of
    the expression is the Datum in the values/isnull arrays.  Then we need
    to compare them to see if they changed.  This can be done using the same
    tts_attr_equal() function, but with the attribute desc of the index, not
    table relation.
    
    In some cases that's not enough, for instance nbtree and it's more
    stringent equality requirements.  For that reason (and another coming up
    in a second) we need a new optional index AM API, amcomparedatums(). 
    Indexes implementing this function have the ability to compare two
    Datums for equality in what ever way they want.  For nbtree, that's a
    binary comparison.
    
    The new case here that this also supports is for indexes like GIN/RUM
    where there is an opclass that can extract zero or more pieces from the
    attribute and form multiple index entries when needed.  Those extracted
    pieces might have odd equality rules as well.  This opens the door for
    index implementations to provide information that will help inform heap
    when making HOT decisions that didn't exist before.
    
    Take for example the Linux Foundation's DocumentDB [6] project [7] which
    aims to be an open source alternative to MongoDB built on top of
    PostgreSQL.  One of the pieces of that project is its "extended RUM"
    index AM implementation.  This index extracts portions of the BSON
    documents stored and forms index keys from that.  Here is an example:
    
    CREATE INDEX documents_rum_index_14002
    ON documentdb_data.documents_14001
    USING documentdb_rum
      (document bson_rum_single_path_ops (path=a, iswildcard='true', tl='2699'))
    
    An index on the "document" column which uses the
    "bson_rum_single_path_ops" opclass to extract a portion of the BSON
    document that matches "path=a".
    
    For this to potentially be stored by heap as a HOT update we need to
    know that what changed within that document didn't intersect with the
    "path=a" and if it did that the new value(s) were all equal to the old
    values.  Equality in DocumentDB isn't what you think, it's quite odd and
    specific to BSON and rules defined by MongoDB so it's important to allow
    the index AM to execute what's necessary for it's use case.
    
    For the more common JSONB use case we can now have heap make an informed
    decision  about HOT updates after evaluating the expression.  For
    GIN/RUM/etc. implementation there is a path to HOT.  For nbtree there is
    a way to maintain its requirements.
    
    Is there a cost to all this?  Yes, of course.  There is net new work
    being done on some paths.  IndexFormDatum() will be called more
    frequently and sometimes twice for the same thing.  This could be
    improved, there might be a way to cache that information.  But to have
    tests of old/new we'll have to do that work at least twice.
    
    Is there a benefit?  Yes, of course.  Some redundant code paths are gone
    and in the end we've increased the number of cases where HOT updates are
    a possibility.  This  especially helps out users of JSONB, but not only them.
    
    What's left undone?
    
    * I need to check code coverage so that I might
    * create tests covering all the new cases
    * update the README.HOT documentation, wiki, etc.
    * performance...
    
    For performance I'd like to examine some worst cases as in lots of
    indexes that have a lot of new code to exercise and all but the last
    index would allow for a HOT update.  That should represent the maximum
    amount of new overhead for this code.  Then, the other side of the
    equation, how much does this help JSONB?  I think that is something to
    measure in terms of TPS as well as "bloat" avoided and time spent vacuuming.
    
    I also don't like the TU_Updating enum, I think it's a leaky abstraction
    and really pointless now.  I'd like to remove it in favor of the bitmap
    of attributes known to force index tuples to be inserted.  Maybe I'll
    layer that into the next set.
    
    In the end, this is a lot of work and I believe that it moves the ball
    forward.  I'll have more metrics on that soon I hope, but I wanted to
    get the conversation re-started ASAP as we're late in the v19 cycle.
    
    Finally, the "elephant" in the room (ha!) is PHOT[9]/WARM[10][11].  Yes,
    some of this work does help make a solution to allow HOT updates when
    only updating a subset of indexes closer to reality, I'd be lying not to
    mention that here, it is a big part of my overall plan for my next year
    and (I hope) v20 and I need this work to get there.  Specifically, PHOT
    is in part based on HeapDetermineColumnsInfo() and I needed to make that
    more truthful for it to work.
    
    I hope you see the value in this work and will partner with me to
    finalize it and get it into master.
    
    best.
    
    -greg
    
    [1] https://www.postgresql.org/message-id/flat/4d9928ee-a9e6-15f9-9c82-5981f13ffca6%40postgrespro.ru
    [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c203d6cf8
    [3] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=05f84605dbeb9cf8279a157234b24bbb706c5256
    [4] https://www.postgresql.org/message-id/2877.1541538838%40sss.pgh.pa.us
    [5] https://www.postgresql.org/message-id/flat/2C5C8B8D-8B36-4547-88EB-BDCF9A7C8D94@greg.burd.me
    [6] https://www.linuxfoundation.org/press/linux-foundation-welcomes-documentdb-to-advance-open-developer-first-nosql-innovation
    [7] https://github.com/documentdb/documentdb
    [8] https://github.com/documentdb/documentdb/tree/main/pg_documentdb_extended_rum
    [9] https://www.postgresql.org/message-id/flat/2ECBBCA0-4D8D-4841-8872-4A5BBDC063D2%40amazon.com
    [10] https://www.postgresql.org/message-id/flat/CABOikdMop5Rb_RnS2xFdAXMZGSqcJ-P-BY2ruMd%2BbuUkJ4iDPw%40mail.gmail.com
    [11]
    https://www.postgresql.org/message-id/flat/CABOikdMNy6yowA%2BwTGK9RVd8iw%2BCzqHeQSGpW7Yka_4RSZ_LOQ%40mail.gmail.com