Thread

Re: making tid and HOTness of UPDATE available to logical decoding plugins

Matthias van de Meent <boekewurm+postgres@gmail.com> — 2025-12-10T22:37:54Z
On Tue, 9 Dec 2025 at 21:08, Hannu Krosing <hannuk@google.com> wrote:
>
> On Tue, Dec 9, 2025 at 7:41 PM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> >
> > On Mon, 8 Dec 2025 at 16:25, Hannu Krosing <hannuk@google.com> wrote:
> > > Suggestion to "get the user specify their own identity column" sounds
> > > good in purely theoretical sense, but can have unacceptable overheads
> > > in practice, especially if a large table started out - often for a
> > > good reason - without a PK or other suittable identiuty column
> >
> > I don't think that this user problem is something we should be making
> > our problem; at least not like this.
>
> I assume you don't have to deal with real-world users much :)

At least none of the kind that don't pay a penny and expect the world
in return, no :)

> And we had similar objections for all CONCURRENTLY commands - why on
> earth would anyone want a slower version of INDEX

With CONCURRENTLY, you're reducing the work done whilst holding heavy
and expensive locks, and that is a real concern in any database that
wants to do maintenance without significant downtime.

But in this case there are roads to get your maintenance (adding a PK)
done without significant downtime, and that without exposing
non-logical data. I don't see the point in bloating the logical
decoding system for this. We might be able to add facilities that add
a primary key CONCURRENTLY, but that's not what's being proposed.

> I see again and again real user problems taking ages to address, for
> example when I sent a patch it possible to run move than one VACUUM
> concurrently and actually clean up tables in 2005 it lingered for
> about a year before I got Alvaro to put it in at PostgreSQL
> Anniversary Summit in July 2006.

Yeah, you're not alone in those pain points. I suspect nearly everyone
who frequents this list has had similar experiences.

> > > As I said before, the row id does not have to be stable across row
> > > versions for logical replication, it just has to be able to track the
> > > identity of "the row that is changed or deleted" which ctid already
> > > does very well.
> >
> > Yes, I'm aware. But a TID doesn't really mean anything if you don't
> > have indexes - its value may change arbitrarily between queries.
> > Anything may happen to a tuple's CTID, long as 1.) within a statement,
> > the CTID doesn't change, and 2.) indexed TIDs won't have changes to
> > the attributes represented in indexes.
>
> Can't parse that :(
>
> Can you perhaps elaborate ?

There are practically two mechanisms in PostgreSQL that hold
references to TIDs which prevent housekeeping jobs in the AM from
reassigning TIDs at will:
1.) indexes, which reference specific TIDs (or the blknos of TIDs for
amsummarizing indexes) in the table that have a specific set of
unchanging attribute values for the indexed attributes (in HOT terms,
the HOT root), and
2.) active scans, which use the TIDs for updates, deletes, locking,
and other tuple-level operations.
2.a) catcache, whilst not exactly an active scan, also holds TIDs for
those same reasons. I grouped it under (2) because it's not _that_
different, and because it exclusively uses heapam it's otherwise
irrelevant to the points below about how other AMs would expose TIDs.

But outside any single query, catcache internals, and indexes, we
don't have any internal expectation of TID stability. If you had an AM
that could guarantee that there's no index that references the tuple,
and no scan that'll need to refer to that tuple by its TID, then the
TID of the tuple could well be reassigned at will.

<tangent>
I've actually tried to implement something similar at some point by
making scans refer to the HOT root instead of the live tuple's TID,
but that failed due to unforeseen complexities.
In principle, MVCC scans could refer to only the HOT root, and ignore
non-visible tuples of that HOT chain. Pruning would then be allowed to
move HOT tuples' ItemIds around in the page's line pointer array
(because scans only refer to the root, and can find the original
tuples again using visibility checks), enabling some more
defragmentation in the heap page.
Sadly that doesn't work for non-MVCC scans, as they (may) need to
access all of the tuples in the HOT chain and not just the current
visible tuple; thus requiring more work than just this. But
theoretically, that problem can also be solved.
</tangent>

> > > You were very worried about extra WAL usage above, but seem to be
> > > oblivious of huge resource usage of REPLICA IDENTITY FULL (when used
> > > as a poor man's row id and not because the old row data is needed for
> > > some other reason)
> >
> > I don't generally suggest people run their systems with
> > wal_level=logical and try to avoid those systems that have, exactly
> > because of issues like the significant additional overhead involved
> > with logging page-local update records.
>
> But real-world PostgreSQL users do use logical replication and logical
> decoding for CDC all the time.
>
> And logical replication is currently the only way to do no-downtime
> major version upgrades.

Yep. Hence my point - add a PK column. Then you don't have the
overhead of REPLICA IDENTITY FULL, nor any new requirements during
table rewrites.

> > > When you at some point discover the need for logical replication of a
> > > large table inside a 24/7 production database where you do have
> > > occasional updates - or even frequent updates, just not based on
> > > unique id -  you have currently a few options.
> > >
> > > 1. add REPLICA IDENTITY FULL
> > >    - this will double the WAL traffic for updates and usually more
> > > than double for DELETEs (could be 1x or 100x)
> > >    - it can also be REALLY REALLY SLOW to replicate, the worst case
> > > requiring 1 sequential scan of the whole table for each UPDATE or
> > > DELETE
> >
> > Yep, that's about expected; if you want performant UPDATE with
> > OLTP-style databases you have to index your data.
>
> People don't always do that for all tables, like logs.

So, let me understand this.

This hypothetical table has no indexes, because errnoresources. And
you want to replicate this data, and want to use TID because it's slow
to use FULL identity.

But I still don't see how TID will make this meaningfully faster; you
still don't have an index to improve the O(tablesize) scans with. And
if you're putting an index on it on the remote side, why wouldn't the
origin be able to support this index?

> And they still may occasionally need to change them, for example
> getting a right-to-be-forgotten request and running a delete of all
> log records with a specific unindexed attribute. A single sequential
> scan is cheap enough to be a good compromise against an all-around
> slowdown caused by the index, but if that delete affects 100,000 lines
> in a 10TB table you  suddenly have 100k sequential scans on the
> replica.

Again, that's not solved by replicating TIDs when you refuse to add an
index. I don't think that the cost of a full row in WAL is that
expensive when the replica will do a full table scan regardless of the
row ID used.

> > > 2. add a primary key column - quite hard to do CONCURRENTLY, will have
> > > severe disk and cpu space demands and once it has been added (which
> > > could have taken up to a few weeks) it will slow down any inserts.
> >
> > It's quite possible. Not trivial, but it is a road that many have
> > taken.
>
> Yes, but only because they have no better option than to have a senior
> DBA spend a week or two on this.

How about a normal DBA or database-aware application developer? Those
are also often employed, and can apply the same techniques which have
been developed over time. It isn't rocket science, and not even deep
database technology anymore.

> > And yes, indexing slows down inserts. That's how we make sure
> > the index remains correct.
>
> I know *why* it slows down inserts :).
> I'm just saying that there are many  cases where you don't want this slowdown.

Well, yes. I don't expect anyone would say "yes" if you asked them "do
you like the slowdown caused by indexing tuples during insertions and
updates?", but they probably meant "improve the performance of index
insertion" rather than "remove my indexes".

> > > > As I also said in my other mail, adding
> > > > ctid to the logical replication system will expose too much internal
> > > > information and will turn current logical no-ops into logical
> > > > operations;
> > >
> > > Can you provide an example of this?
>
> I thought if I should mention CLUSTER and VACUUM FULL in the question
> but they seemed so obvious that I left them out.
>
> When writing I was in the mindset of multi-terabyte 24/7 high-traffic
> databases where these things are out of the question anyway.

Most PostgreSQL tables are much smaller than the multi-terabyte scale
that you think of, and most have much lower traffic. I don't think
it's that unreasonable to consider that use case as a valid issue with
this patch, and table rewrites are not that uncommon (even if most
which are done through ALTER TABLE are accidental nowadays).

> > > > possibly even bloating the subscriber by a good deal more
> > > > than what the publisher cleaned up.
> > >
> > > The absolute biggest bloater is REPLICA IDENTITY FULL.
> > >
> > > The beauty of using REPLICA IDENTITY ROWID is that absolutely *nothing
> > > extra needs to be added to WAL*.
> >
> > Except for CLUSTER and related operations, which now need to log TID mappings.
>
> I would rather restrict CLUSTER and VACUUM FULL to refuse to run on
> tables with REPLICA IDENTITY ROWID and any  valid replication slots.

So, that's effectively blocked for all tables with REPLICA IDENTITY
ROWID, because you can't know which invalid replication slots will be
turning valid during the operation, or which slots will get started. I
don't think that's a nice approach.

> >
> > About the patch:
> >
> > The ItemPointerDatas newly added to ReorderBufferChange.data.tp are
> > better stored in {old,new}tuple->t_self, if we're going to store them.
>
> I thought so too, but then found out that no, because old ctid is
> present even when oldtuple is not. It is there even for
> wal_level=minimal.

We can't (well, shouldn't) do logical decoding for non-logical wal_level's WAL.
But more importantly, how can we correctly process an UPDATE when
there's no old tuple that was updated? Presumably that means 'the
primary key wasn't updated', but that also means that oldtuple must be
populated when the TID-based surrogate primary key changes.

Alternatively, you could make {old,new}tuple a union {HeapTuple, TID};
with the specific type indicated by a bool that's stored in the 3
alignment bytes after .clear_toast_afterwards. It'd prevent the data
union from growing further.

> also keeping them separate keeps the overhead minimal when the
> decoding plugin does not need them.

What do you mean by "keeping them seprate ... does not need them"?

> > The size of .data is currently 32 bytes, increasing that by 16 bytes
> > is rather wasteful if we already have a more convenient and accurate
> > place to store this data.
>
> Since this is just an in-memory structure I would mainly worry about
> going over 64 bytes (x64 cache line, likely also palloc's internal
> step)

The ReorderBufferChanges are palloc-ed in a SlabContext, which is
optimized for (and only allows) allocations of a single size, saving
the additional overhead of bucketed sizing.
This means that for ReorderBufferChange, every byte (well, MAXALIGN()
of bytes) counts, there is no bucket alignment.

//aside, all this talk about sizing got me to realize that we're
wasting 10% of the struct's size on alignment gaps in the first 4
fields. That's not great either.

> > I see even less reasons why logical replication can care about
> > HOT-ness of updates (that's mostly an implementation detail of heap;
> > logically there is and should be no difference between a normal update
> > and a HOT update).
> > But, if the decoder really did care about those implementation
> > details, it would arguably be better pull that data from the tuple's
> > infomasks, as then there wouldn't be a need for the additional boolean
> > ReorderBufferChange->data.tp.is_hot_update.
>
> It is there for efficiency and low overhead. It can be made even more
> efficient by givin HOT update its own if() and skipping the second
> comparison.

What efficiency is there to gain in logical decoding from adding HOT
to logical decoding? Logically speaking, there is no distinction
between a normal update and one that applied the HOT mechanism. Even
if you were to include TIDs as part of the tuples' attributes, HOT
doesn't change anything there. Indexes are not logically replicated,
and even if they were the replica would have to decide for itself to
apply HOT or not.

> > Alternatively, move it
> > into an alignment gap, because the current placement (after removing
> > the TIDs from data.tp) is rather unfortunate and would add 8 bytes to
> > the size of this union.
>
> Have to check, maybe there is a free bit somewhere.

There are still 3 bytes of alignment left just after clear_toast_afterwards.

> But does it push it over 64-byte boundary ?

No (HEAD is at 80B), but without the ItemPointerDatas, and at the
current location it would increase the size of ReorderBufferChange by
8B to 88B, which would increase ReorderBufferChange's memory usage by
10%. The ItemPointerDatas in your patch create a gap in which the bool
is stored without additional alignment losses, but as I mentioned
above those are also a change I'm not particularly happy about.

> My main gripe with the union is that there is one struct member after
> the union of different-sized things, so changing the lrgest union
> member moves that one. Not a big issue, but it would be nice if you
> did not have to recompile all extensions that do not need the new
> fields.

I would be extremely hesitant to run extensions that were compiled for
one major PostgreSQL version against a different major PostgreSQL
version. There are too many changing internals across versions to
realistically expect everything to just work.

Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)