Thread
-
Re: making tid and HOTness of UPDATE available to logical decoding plugins
Matthias van de Meent <boekewurm+postgres@gmail.com> — 2025-12-10T22:37:54Z
On Tue, 9 Dec 2025 at 21:08, Hannu Krosing <hannuk@google.com> wrote: > > On Tue, Dec 9, 2025 at 7:41 PM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > > > On Mon, 8 Dec 2025 at 16:25, Hannu Krosing <hannuk@google.com> wrote: > > > Suggestion to "get the user specify their own identity column" sounds > > > good in purely theoretical sense, but can have unacceptable overheads > > > in practice, especially if a large table started out - often for a > > > good reason - without a PK or other suittable identiuty column > > > > I don't think that this user problem is something we should be making > > our problem; at least not like this. > > I assume you don't have to deal with real-world users much :) At least none of the kind that don't pay a penny and expect the world in return, no :) > And we had similar objections for all CONCURRENTLY commands - why on > earth would anyone want a slower version of INDEX With CONCURRENTLY, you're reducing the work done whilst holding heavy and expensive locks, and that is a real concern in any database that wants to do maintenance without significant downtime. But in this case there are roads to get your maintenance (adding a PK) done without significant downtime, and that without exposing non-logical data. I don't see the point in bloating the logical decoding system for this. We might be able to add facilities that add a primary key CONCURRENTLY, but that's not what's being proposed. > I see again and again real user problems taking ages to address, for > example when I sent a patch it possible to run move than one VACUUM > concurrently and actually clean up tables in 2005 it lingered for > about a year before I got Alvaro to put it in at PostgreSQL > Anniversary Summit in July 2006. Yeah, you're not alone in those pain points. I suspect nearly everyone who frequents this list has had similar experiences. > > > As I said before, the row id does not have to be stable across row > > > versions for logical replication, it just has to be able to track the > > > identity of "the row that is changed or deleted" which ctid already > > > does very well. > > > > Yes, I'm aware. But a TID doesn't really mean anything if you don't > > have indexes - its value may change arbitrarily between queries. > > Anything may happen to a tuple's CTID, long as 1.) within a statement, > > the CTID doesn't change, and 2.) indexed TIDs won't have changes to > > the attributes represented in indexes. > > Can't parse that :( > > Can you perhaps elaborate ? There are practically two mechanisms in PostgreSQL that hold references to TIDs which prevent housekeeping jobs in the AM from reassigning TIDs at will: 1.) indexes, which reference specific TIDs (or the blknos of TIDs for amsummarizing indexes) in the table that have a specific set of unchanging attribute values for the indexed attributes (in HOT terms, the HOT root), and 2.) active scans, which use the TIDs for updates, deletes, locking, and other tuple-level operations. 2.a) catcache, whilst not exactly an active scan, also holds TIDs for those same reasons. I grouped it under (2) because it's not _that_ different, and because it exclusively uses heapam it's otherwise irrelevant to the points below about how other AMs would expose TIDs. But outside any single query, catcache internals, and indexes, we don't have any internal expectation of TID stability. If you had an AM that could guarantee that there's no index that references the tuple, and no scan that'll need to refer to that tuple by its TID, then the TID of the tuple could well be reassigned at will. <tangent> I've actually tried to implement something similar at some point by making scans refer to the HOT root instead of the live tuple's TID, but that failed due to unforeseen complexities. In principle, MVCC scans could refer to only the HOT root, and ignore non-visible tuples of that HOT chain. Pruning would then be allowed to move HOT tuples' ItemIds around in the page's line pointer array (because scans only refer to the root, and can find the original tuples again using visibility checks), enabling some more defragmentation in the heap page. Sadly that doesn't work for non-MVCC scans, as they (may) need to access all of the tuples in the HOT chain and not just the current visible tuple; thus requiring more work than just this. But theoretically, that problem can also be solved. </tangent> > > > You were very worried about extra WAL usage above, but seem to be > > > oblivious of huge resource usage of REPLICA IDENTITY FULL (when used > > > as a poor man's row id and not because the old row data is needed for > > > some other reason) > > > > I don't generally suggest people run their systems with > > wal_level=logical and try to avoid those systems that have, exactly > > because of issues like the significant additional overhead involved > > with logging page-local update records. > > But real-world PostgreSQL users do use logical replication and logical > decoding for CDC all the time. > > And logical replication is currently the only way to do no-downtime > major version upgrades. Yep. Hence my point - add a PK column. Then you don't have the overhead of REPLICA IDENTITY FULL, nor any new requirements during table rewrites. > > > When you at some point discover the need for logical replication of a > > > large table inside a 24/7 production database where you do have > > > occasional updates - or even frequent updates, just not based on > > > unique id - you have currently a few options. > > > > > > 1. add REPLICA IDENTITY FULL > > > - this will double the WAL traffic for updates and usually more > > > than double for DELETEs (could be 1x or 100x) > > > - it can also be REALLY REALLY SLOW to replicate, the worst case > > > requiring 1 sequential scan of the whole table for each UPDATE or > > > DELETE > > > > Yep, that's about expected; if you want performant UPDATE with > > OLTP-style databases you have to index your data. > > People don't always do that for all tables, like logs. So, let me understand this. This hypothetical table has no indexes, because errnoresources. And you want to replicate this data, and want to use TID because it's slow to use FULL identity. But I still don't see how TID will make this meaningfully faster; you still don't have an index to improve the O(tablesize) scans with. And if you're putting an index on it on the remote side, why wouldn't the origin be able to support this index? > And they still may occasionally need to change them, for example > getting a right-to-be-forgotten request and running a delete of all > log records with a specific unindexed attribute. A single sequential > scan is cheap enough to be a good compromise against an all-around > slowdown caused by the index, but if that delete affects 100,000 lines > in a 10TB table you suddenly have 100k sequential scans on the > replica. Again, that's not solved by replicating TIDs when you refuse to add an index. I don't think that the cost of a full row in WAL is that expensive when the replica will do a full table scan regardless of the row ID used. > > > 2. add a primary key column - quite hard to do CONCURRENTLY, will have > > > severe disk and cpu space demands and once it has been added (which > > > could have taken up to a few weeks) it will slow down any inserts. > > > > It's quite possible. Not trivial, but it is a road that many have > > taken. > > Yes, but only because they have no better option than to have a senior > DBA spend a week or two on this. How about a normal DBA or database-aware application developer? Those are also often employed, and can apply the same techniques which have been developed over time. It isn't rocket science, and not even deep database technology anymore. > > And yes, indexing slows down inserts. That's how we make sure > > the index remains correct. > > I know *why* it slows down inserts :). > I'm just saying that there are many cases where you don't want this slowdown. Well, yes. I don't expect anyone would say "yes" if you asked them "do you like the slowdown caused by indexing tuples during insertions and updates?", but they probably meant "improve the performance of index insertion" rather than "remove my indexes". > > > > As I also said in my other mail, adding > > > > ctid to the logical replication system will expose too much internal > > > > information and will turn current logical no-ops into logical > > > > operations; > > > > > > Can you provide an example of this? > > I thought if I should mention CLUSTER and VACUUM FULL in the question > but they seemed so obvious that I left them out. > > When writing I was in the mindset of multi-terabyte 24/7 high-traffic > databases where these things are out of the question anyway. Most PostgreSQL tables are much smaller than the multi-terabyte scale that you think of, and most have much lower traffic. I don't think it's that unreasonable to consider that use case as a valid issue with this patch, and table rewrites are not that uncommon (even if most which are done through ALTER TABLE are accidental nowadays). > > > > possibly even bloating the subscriber by a good deal more > > > > than what the publisher cleaned up. > > > > > > The absolute biggest bloater is REPLICA IDENTITY FULL. > > > > > > The beauty of using REPLICA IDENTITY ROWID is that absolutely *nothing > > > extra needs to be added to WAL*. > > > > Except for CLUSTER and related operations, which now need to log TID mappings. > > I would rather restrict CLUSTER and VACUUM FULL to refuse to run on > tables with REPLICA IDENTITY ROWID and any valid replication slots. So, that's effectively blocked for all tables with REPLICA IDENTITY ROWID, because you can't know which invalid replication slots will be turning valid during the operation, or which slots will get started. I don't think that's a nice approach. > > > > About the patch: > > > > The ItemPointerDatas newly added to ReorderBufferChange.data.tp are > > better stored in {old,new}tuple->t_self, if we're going to store them. > > I thought so too, but then found out that no, because old ctid is > present even when oldtuple is not. It is there even for > wal_level=minimal. We can't (well, shouldn't) do logical decoding for non-logical wal_level's WAL. But more importantly, how can we correctly process an UPDATE when there's no old tuple that was updated? Presumably that means 'the primary key wasn't updated', but that also means that oldtuple must be populated when the TID-based surrogate primary key changes. Alternatively, you could make {old,new}tuple a union {HeapTuple, TID}; with the specific type indicated by a bool that's stored in the 3 alignment bytes after .clear_toast_afterwards. It'd prevent the data union from growing further. > also keeping them separate keeps the overhead minimal when the > decoding plugin does not need them. What do you mean by "keeping them seprate ... does not need them"? > > The size of .data is currently 32 bytes, increasing that by 16 bytes > > is rather wasteful if we already have a more convenient and accurate > > place to store this data. > > Since this is just an in-memory structure I would mainly worry about > going over 64 bytes (x64 cache line, likely also palloc's internal > step) The ReorderBufferChanges are palloc-ed in a SlabContext, which is optimized for (and only allows) allocations of a single size, saving the additional overhead of bucketed sizing. This means that for ReorderBufferChange, every byte (well, MAXALIGN() of bytes) counts, there is no bucket alignment. //aside, all this talk about sizing got me to realize that we're wasting 10% of the struct's size on alignment gaps in the first 4 fields. That's not great either. > > I see even less reasons why logical replication can care about > > HOT-ness of updates (that's mostly an implementation detail of heap; > > logically there is and should be no difference between a normal update > > and a HOT update). > > But, if the decoder really did care about those implementation > > details, it would arguably be better pull that data from the tuple's > > infomasks, as then there wouldn't be a need for the additional boolean > > ReorderBufferChange->data.tp.is_hot_update. > > It is there for efficiency and low overhead. It can be made even more > efficient by givin HOT update its own if() and skipping the second > comparison. What efficiency is there to gain in logical decoding from adding HOT to logical decoding? Logically speaking, there is no distinction between a normal update and one that applied the HOT mechanism. Even if you were to include TIDs as part of the tuples' attributes, HOT doesn't change anything there. Indexes are not logically replicated, and even if they were the replica would have to decide for itself to apply HOT or not. > > Alternatively, move it > > into an alignment gap, because the current placement (after removing > > the TIDs from data.tp) is rather unfortunate and would add 8 bytes to > > the size of this union. > > Have to check, maybe there is a free bit somewhere. There are still 3 bytes of alignment left just after clear_toast_afterwards. > But does it push it over 64-byte boundary ? No (HEAD is at 80B), but without the ItemPointerDatas, and at the current location it would increase the size of ReorderBufferChange by 8B to 88B, which would increase ReorderBufferChange's memory usage by 10%. The ItemPointerDatas in your patch create a gap in which the bool is stored without additional alignment losses, but as I mentioned above those are also a change I'm not particularly happy about. > My main gripe with the union is that there is one struct member after > the union of different-sized things, so changing the lrgest union > member moves that one. Not a big issue, but it would be nice if you > did not have to recompile all extensions that do not need the new > fields. I would be extremely hesitant to run extensions that were compiled for one major PostgreSQL version against a different major PostgreSQL version. There are too many changing internals across versions to realistically expect everything to just work. Kind regards, Matthias van de Meent Databricks (https://www.databricks.com)