Thread

Re: making tid and HOTness of UPDATE available to logical decoding plugins

Matthias van de Meent <boekewurm+postgres@gmail.com> — 2025-12-09T18:41:42Z
On Mon, 8 Dec 2025 at 16:25, Hannu Krosing <hannuk@google.com> wrote:
>
> On Mon, Dec 8, 2025 at 2:46 PM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > I don't see logical decoding as a viable alternative to any
> > indexing-related workload. Creating and maintaining indexes need
> > strict limits on their resource usage, and logical decoding is unable
> > to give those guarantees:
>
> Are these 'strict limits on their resource usage' defined somewhere so
> I could take a look?

None formalized, but "don't read more than O(tablesize) of base data",
"try to keep temporary disk usage to a minimum", "alloc only what you
need", and  "keep yourself to maintenance_work_mem" are some I'd
consider basic.
LR would read arbitrarily large amounts of data from WAL, allocates at
least 2 chunks of memory per WAL record, has a tendency to spill large
transactions to disk when it runs out of memory, and at times does so
at scales much larger than the subscribed-to tables.

I also have issues with the permission models here --
pg_create_subscription is a separate permissions from pg_maintain for
good reasons, and I personally would like to keep it that way. Let's
not add another way for people to accidentally lose track of a logical
slot.

> > It needs to read WAL, which can be produced
> > in approximately arbitrarily large amounts in any period.
>
> If you read the WAL as it is generated, then you are essentially just
> filtering an in-memory stream, most of the time just jumping to next
> WAL record.

I think the main point here is that you can't guarantee that you're
going to read from memory, so you can't assume that the performance is
going to be comparable to "reading from memory".

> The upsides for logical decoding based collection are
> - is that you do not need to have any extra settings and conditions in
> you index methods to do in-index-method collection

What do you mean by this? AFAIK, we don't have any such "settings or
conditions" inside index AMs for CIC/RIC. Most, if not everything else
is handled outside the AM code, in either generic indexing code, or in
heapam's decision for which snapshot to use in each of the scans.
IF LR at some point was to be used for indexing, it'd probably have at
least a similar (if not larger) footprint in the code.

> - you avoid the hassle of synchronoizing collection starts and stops
> between all active backends

I don't see how you can prevent the synchronization step before the
first heap scan ("collection start"). We need to be certain all
backends see the new index, or we might corrupt the index with too
much/to little data through HOT updates in concurrent workloads.
AFAICT, this can not be prevented with LR.
I also can't think of a correct way to prevent the synchronization
step after the initial index is built ("collection end"). We must wait
for all backends to consider this index for insertions, or some
backend may still produce tuples that aren't inserted into the index.
LR can't help with this either - it can't see into the future and
determine which backends will still insert which tuples into which
tables.

So, which hassle would be avoided specifically?

> - you are collecting in a single process, so no overhead from
> synchronizing between all the backends that capture/log index
> insertions

With LR, all backends still have to synchronize through WAL
insertions. Mihail's STIR index does not use WAL in operations, so
it's safe to say that there is also minimal additional overhead there.

> - you can choose between collecting immediately in a background worker
> and collecting later by re-reading WAL.

Which requires writing a new integration with this system, right?

> The one upside (?) of in-index capture is that it will naturally
> throttle your production workload if capture can not keep up for some
> reason.

The insertion of TIDs into the STIR index is 99 times out of 100 going
to be cheaper than the insertion into the index that's being built. I
don't see why you consider STIR an issue, but not LR.

> > This is further worsened by the index build itself, which will have to write
> > out WAL if the relation it's building on also needs to log WAL, which
> > then also needs to be parsed and decoded by LR.
>
> Not "parsed and decoded" - just read the (database, tablespace,
> relation) triplet, decide "not for me" and jump to the next record.

AFAIK we have yet to move record filtering ahead of the point where we
allocate the XLogRecord (and DecodedXLogRecord) that we're retrieving
from raw WAL pages. We don't actually skip the data and variable
headers sections when the record doesn't have rm_decode; we don't
actually skip the data sections when the page headers indicate the WAL
record only modified pages in a different database.
So in my view, yes, we do parse and decode every WAL record in LR;
even if that doesn't always involve calling RMgrData->rm_decode.

> I am working on this specifically because of huge databases with heavy
> production workloads which by definition generate a huge amount of
> WAL.
> One of the top goals is to avoid REPLICA IDENTITY FULL which can
> hugely bloat amount of WAL generated .
> The fact that you can turn on REPLICA IDENTITY ROWID with no extra
> cost on write side is a bonus

There is an extra cost for maintenance operations; as you can see
below; and that ignores the increased overhead for users of pgoutput.

> (you can even replace the small overhead
> of writing th eprimary key recorde by turning on rowid)

You can't use rowid for Primary Keys, because you'd create referential
update loops for foreign keys, or (with ON UPDATE RESTRICT) you'd be
unable to update the rows at all.

> > And lastly, we want
> > CIC/RIC to work on all indexes, not just those on logged relations. "A
> > table with WAL-logging enabled" cannot be a requirement for CIC.
>
> We can use the easy, straightforward collection method - logical
> decoding - when available, and fall back to the complicated method
> (collecting inside index access method) or the resource-intensive
> method (re-scanning the whole table) if logical decoding is
> unavailable.

I prefer a single method that always works, is testable, and that has
a guaranteed limit, over two that are only excercised in certain
paths, one of which with a tendency to keep WAL on disk.

I think it's much, much easier to reason about the performance profile
and impact of just one more index.

> > I understand how you got to this point, but the right solution here
> > still is to get the user to specify their own identity column that is
> > stable across operations, and not to use the volatile and
> > guaranteed-unstable ctid.
>
> Suggestion to "get the user specify their own identity column" sounds
> good in purely theoretical sense, but can have unacceptable overheads
> in practice, especially if a large table started out - often for a
> good reason - without a PK or other suittable identiuty column

I don't think that this user problem is something we should be making
our problem; at least not like this.

> As I said before, the row id does not have to be stable across row
> versions for logical replication, it just has to be able to track the
> identity of "the row that is changed or deleted" which ctid already
> does very well.

Yes, I'm aware. But a TID doesn't really mean anything if you don't
have indexes - its value may change arbitrarily between queries.
Anything may happen to a tuple's CTID, long as 1.) within a statement,
the CTID doesn't change, and 2.) indexed TIDs won't have changes to
the attributes represented in indexes.

> You were very worried about extra WAL usage above, but seem to be
> oblivious of huge resource usage of REPLICA IDENTITY FULL (when used
> as a poor man's row id and not because the old row data is needed for
> some other reason)

I don't generally suggest people run their systems with
wal_level=logical and try to avoid those systems that have, exactly
because of issues like the significant additional overhead involved
with logging page-local update records.

> When you at some point discover the need for logical replication of a
> large table inside a 24/7 production database where you do have
> occasional updates - or even frequent updates, just not based on
> unique id -  you have currently a few options.
>
> 1. add REPLICA IDENTITY FULL
>    - this will double the WAL traffic for updates and usually more
> than double for DELETEs (could be 1x or 100x)
>    - it can also be REALLY REALLY SLOW to replicate, the worst case
> requiring 1 sequential scan of the whole table for each UPDATE or
> DELETE

Yep, that's about expected; if you want performant UPDATE with
OLTP-style databases you have to index your data.

> 2. add a primary key column - quite hard to do CONCURRENTLY, will have
> severe disk and cpu space demands and once it has been added (which
> could have taken up to a few weeks) it will slow down any inserts.

It's quite possible. Not trivial, but it is a road that many have
taken. And yes, indexing slows down inserts. That's how we make sure
the index remains correct.

> To reiterate - "stable across operations" is not at a requirement for
> logical replication, tuple id is "stable enough" for streaming
> replication changes. Think of it as somebody changing the primary key
> column at each update - it seems weird, but the updated PK still
> uniquely identifies the tuple for the next operation.

Yeah, I'm aware of that part.

> > As I also said in my other mail, adding
> > ctid to the logical replication system will expose too much internal
> > information and will turn current logical no-ops into logical
> > operations;
>
> Can you provide an example of this?

CLUSTER itself doesn't modify any logical columns, and would thus
normally have been no-op from a LR PoV; allowing us to ignore it in LR
of user tables.
By adding CTID as new logical column, however, you force effectively
CLUSTER to issue a logical UPDATE record for all tuples that get a new
CTID in the table, or every UPDATE afterward would be replicated with
the wrong CTID from the subscriber's point of view.

Example:
A table with live CTIDs A: (0, 1), B: (0, 2)
This table gets CLUSTERed: the CTIDs on disk in the new table are A:
(0, 0), B: (0, 1).

Note how row B now has the CTID that row A had before clustering; A
has a previously unused CTID; and B's old CTID is now orphaned.
Assuming LR didn't get an update about this CLUSTER changing the CTIDs
of logical rows, the replica will be desynced from the primary.
An UPDATE on row B will still get recorded and replicated, but because
the CTID is now (0, 1) the remote thinks the update was for row A
(which it had recorded previously as the one that last had CTID (0,
1)). Updates for row A on the primary will get replicated and fail to
find a matching row, because its new CTID (0, 0) wasn't an ID that was
replicated yet, and there's now an additional row B that's been
orphaned on the replica.

> > possibly even bloating the subscriber by a good deal more
> > than what the publisher cleaned up.
>
> The absolute biggest bloater is REPLICA IDENTITY FULL.
>
> The beauty of using REPLICA IDENTITY ROWID is that absolutely *nothing
> extra needs to be added to WAL*.

Except for CLUSTER and related operations, which now need to log TID mappings.

> The tids are already there even for physical replication (and even for
> wal_level=minimal) as they are required even for crash recovery. All
> my patch to core did is exposing them to logical decodoing mechanism.
> You do not have to use them, nothing changes for decoding plugins not
> using them.

About the patch:

The ItemPointerDatas newly added to ReorderBufferChange.data.tp are
better stored in {old,new}tuple->t_self, if we're going to store them.
The size of .data is currently 32 bytes, increasing that by 16 bytes
is rather wasteful if we already have a more convenient and accurate
place to store this data.

I see even less reasons why logical replication can care about
HOT-ness of updates (that's mostly an implementation detail of heap;
logically there is and should be no difference between a normal update
and a HOT update).
But, if the decoder really did care about those implementation
details, it would arguably be better pull that data from the tuple's
infomasks, as then there wouldn't be a need for the additional boolean
ReorderBufferChange->data.tp.is_hot_update. Alternatively, move it
into an alignment gap, because the current placement (after removing
the TIDs from data.tp) is rather unfortunate and would add 8 bytes to
the size of this union.


Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)