Thread

Re: Expanding HOT updates for expression and partial indexes

Greg Burd <greg@burd.me> — 2025-11-16T18:53:09Z

Hello again.

This idea started over a year ago for me while working on a project that
used JSONB and had horrible "bloat" with update times that were not
fantastic. The root cause, expression indexes prevent HOT updates and
all indexes on JSONB are by definition, expressions. That's the backstory.

The idea for the solution came from a patch [1] applied [2], then later
reverted [3], that basically evaluated the before/after tuple
expressions in heap_update() at the point where newbuff == buffer just
before deciding to use_hot_update or not. When the evaluated
expressions produced equal results using a binary comparison the index
didn't need to be updated. While this approach sorta worked, but it was
reverted for a few reasons, here's Tom's summary: "The problem here is
that [the code that checks for equality] thinks that the type of the
index column is identical to the type of the source datum for it, which
is not true for any opclass making use of the opckeytype property. [4]"

Still, expanding the domain of what might go HOT seems like a good goal
to me and it was at the heart of the issues I was facing on the project
using JSONB, so I kept at it.

The patches I've sent on this thread have evolved from that first idea.
First they addressed the specific issue raised by Tom. Then they
expanded to include partial indexes as well. This worked, and I helped
to ship a fork of Postgres used this approach and solved customer issues
with the JSONB use case. Customer experience was better, no unnecessary
index updates when indexed data wasn't modified meant no more heap/index
bloat and faster update times. Vacuum could finally keep up and
performance and storage overhead was much better.

For this patch set v1 through v17 were essentially work that
refined/reworked that original approach, but there was a serious flaw
that I didn't fully appreciate until around v16/17. I was evaluating
the expressions while holding a lock on the buffer page which a) expands
the time the lock is held, and b) opens the door to self-deadlock. No bueno.

Then there was v18, a quick work-around for that. I moved the call that
invokes the executor to the beginning of heap_update() before taking the
lock on the page. To do this I had to find the set of updated
attributes, which I discovered was available in the executor as the
updated tuple is created. This was a viable fix, but didn't really go
far enough and was a bit hackish IMO. Jeff Davis and others challenged
me to move the work to identify what's changed into the executor and
clean it up. I'm a sucker for a challenge.

More generally, this idea made sense. While Postgres has many places
where logic is tightly coupled to the way the heap works this didn't
have to be one. To make this work I needed the update path in the
executor to be interested in:

a) knowing what columns were specified in the UPDATE statement and those
impacted by before/after triggers,
b) reducing that set to those attributes known to be both indexed and to
have changed value,
c) finding which of those (and possibly other) attributes that force new
index updates.

Why? We'll, that code already exists in a few places and in some cases
is replicated; for (a) there is ExecGetAllUpdatedCols(), for (b)
HeapDetermineColumnsInfo() and index_unchanged_by_update().

An interesting thing to note is that HeapDetermineColumnsInfo() might
return a set that includes columns not returned by
ExecGetAllUpdatedCols() because HeapDetermineColumnsInfo() iterates over
all indexed attributes looking for changes and that might find an
indexed attribute that was changed by heap_modify_tuple() but not
knowable by ExecGetAllUpdatedCols(). This happens in tsvector code, see
tsvector_op.c tsvector_update_trigger() where if (update_needed)
heap_modify_tuple_by_cols(). That column isn't known to ExecGetAllUpdatedCols().

HeapDetermineColumnsInfo() is also critical when modifying catalog
tuples. Catalog tuples are modified using either Form/GETSTRUCT or
values/nulls/replaces then using heap_modify_tuple() and calling into
CatalogTupleUpdate() which calls simple_heap_update() that calls
heap_update() where we find HeapDetermineColumnsInfo(). The interesting
thing here is that when modifying catalog tuples there is knowledge of
what attributes are changed, but that knowledge isn't preserved and
passed into CatalogTupleUpdate(), rather it is re-discovered in
HeapDetermineColumnsInfo(). That's how catalog tuples are able to take
the HOT path, they re-use that same logic. There is a fix for that [5]
too (and I really hope that lands in master ASAP), but that's not the
subject of this thread.

HeapDetermineColumnsInfo() also helps inform a few other decisions in
heap_update(), but these have to happen after taking the buffer lock and
are very heap-specific, namely:

1. Do either the replica identity key attributes overlap with the
modified index attributes or do they need to be stored externally, this
is passed on to ExtractReplicaIdentity() to find out if we must augment
the WAL log or not.

2. Are there any modified indexed attributes that intersect with the
primary keys of the relation, if not lower the lock mode to enable
multixact to work.

HeapDetermineColumnsInfo() also takes a pragmatic approach to testing
for equality when looking for modified indexed attributes, it uses
datumIsEqual() which boils down to a simple memcmp() of the before/after
HeapTuple datum. This is fine in most cases, but limits the scope of
what can be HOT.

Interestingly, this requirement of binary equality has leaked into other
parts of the code, namely nbtree's deduplication of TIDs on page split.
That code uses binary equality as well. A nbtree index with collation
for case insensitive must store both "A" and "a" despite those being
type-equal because they are not binary equivalent. More on this later.

At this point, I had my sights set on HeapDetermineColumnsInfo(). I
felt that what it was doing should move into the executor, well as much
of that work as possible, and outside of the buffer lock. This would
also open the door for removal of redundant code. My thought was that
the table AM update API should have an additional argument, the
"modified indexed attributes" or "mix_attrs", passed in.

So, here we are at the door of v19... let's begin.

0001 - Reorganize heap update logic

This is preparatory work for the larger goal in that heap_update()
serves two masters: CatalogTupleUpdate()/simple_heap_update() and
heap_tuple_update() and in reality, they were different but needed most
of the same logic that happens at the start of heap_update(). This
patch splits that logic out and moves it into heap_tuple_update() and
simple_heap_update(). Functionally nothing changes. That's the meat of
this patch.Reorganize heap update logic

0002 - Track changed indexed columns in the executor during UPDATEs

This is the first core set of changes, it doesn't expand HOT updates but
it does restructure where HeapDetermineColumnsInfo()'s core work happens.

A new function ExecCheckIndexedAttrsForChanges() in nodeModifyTable.c is
now responsible for checking for changes between Datum in the old/new
TupleTableSlots. This is different from before in that we're not
checking the new HeapTuple Datum verses the HeapTuple we read from the
buffer page while holding the lock on that page.

An update starts off by reading the existing tuple using the table AM.
Then a new updated tuple is created as the set of changes to the old.
Then the new TupleTableSlot is the combination of the existing one we
just read and the changes we just recorded. So, in the executor before
calling into the table AM's update function we have a pin on the buffer
and the before/after TupleTableSlots for this update. So, I've put the
call to my new ExecCheckIndexedAttrsForChanges() function just before
calling table_tuple_update() and I've added the "mix_attrs" into that
call which get passed on to the heap in heap_tuple_update() and then
heap_update() and all is well.

Why is this safe? The way I read heap_update() is that it has always
historically had code to deal with cases where the tuple is concurrently
updated and react accordingly thanks to HeapTupleSatisfiesUpdate() which
remains where it was in heap_update(). Visibility checks happened when
we first read the tuple to form the updated tuple and later in
heap_update() when we call HeapTupleSatisfiesVisibility() to check for
transaction-snapshot mode RI updates.

So, this new update path from the executor into the table AM seems to me
to be okay and almost functionally equivalent. But there is one big
change to discuss before moving to the simple_heap_update() path.

In nodeModifyTable.c tts_attr_equal() replaces heap_attr_equal()
changing the test for equality when calling into heap_tuple_update().
In the past we used datumIsEqual(), essentially a binary comparison
using memcmp(), now the comparison code in tts_attr_equal uses
type-specific equality function when available and falls back to
datumIsEqual() when not.

The other parts of HeapDetermineColumnsInfo() remain in the code, but
they still happen within the simple_heap_update() and
heap_tuple_update() code. That's where you'll find that after the
buffer is locked we do (1) and (2) from above. This keeps the
heap-specific work in the heap, but we've moved some work up into the
executor and outside the buffer lock.

While this accomplishes the goal of removing HeapDetermineColumnsInfo()
from heap_update() on the path that uses the table AM API
heap_tuple_update() but it doesn't on the simple_heap_update() path.
That remains the same as it was in the previous patch. Ideally, my
patch to restructure how catalog tuples are updated [5] is committed and
we can fully remove HeapDetermineColumnsInfo() and likely speed up all
catalog updates in the process. That's what motivated [5], please take
a look, it required a huge number of changes so I thought it deserved a
life/thread of its own.

Finally, there is the ExecSimpleRelationUpdate() path and
slot_modify_data(). On this path we know what attributes are being
updated, so we just check to see if they changed and then intersect that
with the set of indexed attributes and we have our modified indexed
attributes set to pass into simple_heap_update().

0003 - Replace index_unchanged_by_update with ri_ChangedIndexedCols

This patch removes the function index_unchanged_by_update() in
execIndexing.c and simply re-uses the modified indexed attributes that
we've stashed away in ResultRelInfo as ri_ChangedIndexedCols. This
provides a hint when calling into the index AM's index_insert() function
indicating if the UPDATE was without logical change to the data or not.
We've done that check, we don't need to do it again.

0004 - Enable HOT updates for expression and partial indexes

This finally gets us back to where this project started, but on much
more firm ground than before because we're not going to self-deadlock.
The idea has grown from a small function into something larger, but only
out of necessity.

In this patch I add ExecWhichIndexesRequireUpdates() in execIndexing.c
which implements (c) finding the set of attributes that force new index
updates. This set can be very different from the modified indexed
attributes. We know that some attributes are not equal to their
previous versions, but does that mean that the index that references
that attribute needs a new index tuple? It may, or it may not. Here's
the comment on that function that explains:

/*
* ExecWhichIndexesRequireUpdates
*
* Determine which indexes need updating given modified indexed attributes.
* This function is a companion to ExecCheckIndexedAttrsForChanges().
On the
* surface, they appear similar but they are doing two very different things.
*
* For a standard index on a set of attributes this is the intersection of
* the mix_attrs and the index attrs (key, expression, but not predicate).
*
* For expression indexes and indexes which implement the amcomparedatums()
* index AM API we'll need to form index datum and compare each
attribute to
* see if any actually changed.
*
* For expression indexes the result of the expression might not change
at all,
* this is common with JSONB columns which require expression indexes
and where
* it is commonplace to index a field within a document and have updates that
* generally don't update that field.
*
* Partial indexes won't trigger index tuples when the old/new tuples
are both
* outside of the predicate range.
*
* For nbtree the amcomparedatums() API is critical as it requires that key
* attributes are equal when they memcmp(), which might not be the case when
* using type-specific comparison or factoring in collation which might make
* an index case insensitive.
*
* All of this is to say that the goal is for the executor to know,
ahead of
* calling into the table AM for the update and before calling into the index
* AM for inserting new index tuples, which attributes at a minimum will
* necessitate a new index tuple.
*
...
*/

Whereas before we were comparing Datum in the table relation, now we're
comparing Datum in the index relation. Index AMs are free to store what
they want, we need to know if what's changed and referenced by the index
means that the index needs a new tuple or not.

In the case of a JSONB expression index (or any expression index) the
expression is evaluated when calling FormIndexDatum(). The result of
the expression is the Datum in the values/isnull arrays. Then we need
to compare them to see if they changed. This can be done using the same
tts_attr_equal() function, but with the attribute desc of the index, not
table relation.

In some cases that's not enough, for instance nbtree and it's more
stringent equality requirements. For that reason (and another coming up
in a second) we need a new optional index AM API, amcomparedatums().
Indexes implementing this function have the ability to compare two
Datums for equality in what ever way they want. For nbtree, that's a
binary comparison.

The new case here that this also supports is for indexes like GIN/RUM
where there is an opclass that can extract zero or more pieces from the
attribute and form multiple index entries when needed. Those extracted
pieces might have odd equality rules as well. This opens the door for
index implementations to provide information that will help inform heap
when making HOT decisions that didn't exist before.

Take for example the Linux Foundation's DocumentDB [6] project [7] which
aims to be an open source alternative to MongoDB built on top of
PostgreSQL. One of the pieces of that project is its "extended RUM"
index AM implementation. This index extracts portions of the BSON
documents stored and forms index keys from that. Here is an example:

CREATE INDEX documents_rum_index_14002
ON documentdb_data.documents_14001
USING documentdb_rum
(document bson_rum_single_path_ops (path=a, iswildcard='true', tl='2699'))

An index on the "document" column which uses the
"bson_rum_single_path_ops" opclass to extract a portion of the BSON
document that matches "path=a".

For this to potentially be stored by heap as a HOT update we need to
know that what changed within that document didn't intersect with the
"path=a" and if it did that the new value(s) were all equal to the old
values. Equality in DocumentDB isn't what you think, it's quite odd and
specific to BSON and rules defined by MongoDB so it's important to allow
the index AM to execute what's necessary for it's use case.

For the more common JSONB use case we can now have heap make an informed
decision about HOT updates after evaluating the expression. For
GIN/RUM/etc. implementation there is a path to HOT. For nbtree there is
a way to maintain its requirements.

Is there a cost to all this? Yes, of course. There is net new work
being done on some paths. IndexFormDatum() will be called more
frequently and sometimes twice for the same thing. This could be
improved, there might be a way to cache that information. But to have
tests of old/new we'll have to do that work at least twice.

Is there a benefit? Yes, of course. Some redundant code paths are gone
and in the end we've increased the number of cases where HOT updates are
a possibility. This especially helps out users of JSONB, but not only them.

What's left undone?

* I need to check code coverage so that I might
* create tests covering all the new cases
* update the README.HOT documentation, wiki, etc.
* performance...

For performance I'd like to examine some worst cases as in lots of
indexes that have a lot of new code to exercise and all but the last
index would allow for a HOT update. That should represent the maximum
amount of new overhead for this code. Then, the other side of the
equation, how much does this help JSONB? I think that is something to
measure in terms of TPS as well as "bloat" avoided and time spent vacuuming.

I also don't like the TU_Updating enum, I think it's a leaky abstraction
and really pointless now. I'd like to remove it in favor of the bitmap
of attributes known to force index tuples to be inserted. Maybe I'll
layer that into the next set.

In the end, this is a lot of work and I believe that it moves the ball
forward. I'll have more metrics on that soon I hope, but I wanted to
get the conversation re-started ASAP as we're late in the v19 cycle.

Finally, the "elephant" in the room (ha!) is PHOT[9]/WARM[10][11]. Yes,
some of this work does help make a solution to allow HOT updates when
only updating a subset of indexes closer to reality, I'd be lying not to
mention that here, it is a big part of my overall plan for my next year
and (I hope) v20 and I need this work to get there. Specifically, PHOT
is in part based on HeapDetermineColumnsInfo() and I needed to make that
more truthful for it to work.

I hope you see the value in this work and will partner with me to
finalize it and get it into master.

best.

-greg

[1] https://www.postgresql.org/message-id/flat/4d9928ee-a9e6-15f9-9c82-5981f13ffca6%40postgrespro.ru
[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c203d6cf8
[3] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=05f84605dbeb9cf8279a157234b24bbb706c5256
[4] https://www.postgresql.org/message-id/2877.1541538838%40sss.pgh.pa.us
[5] https://www.postgresql.org/message-id/flat/2C5C8B8D-8B36-4547-88EB-BDCF9A7C8D94@greg.burd.me
[6] https://www.linuxfoundation.org/press/linux-foundation-welcomes-documentdb-to-advance-open-developer-first-nosql-innovation
[7] https://github.com/documentdb/documentdb
[8] https://github.com/documentdb/documentdb/tree/main/pg_documentdb_extended_rum
[9] https://www.postgresql.org/message-id/flat/2ECBBCA0-4D8D-4841-8872-4A5BBDC063D2%40amazon.com
[10] https://www.postgresql.org/message-id/flat/CABOikdMop5Rb_RnS2xFdAXMZGSqcJ-P-BY2ruMd%2BbuUkJ4iDPw%40mail.gmail.com
[11]
https://www.postgresql.org/message-id/flat/CABOikdMNy6yowA%2BwTGK9RVd8iw%2BCzqHeQSGpW7Yka_4RSZ_LOQ%40mail.gmail.com