Thread
-
Re: making tid and HOTness of UPDATE available to logical decoding plugins
Matthias van de Meent <boekewurm+postgres@gmail.com> — 2025-12-09T18:41:42Z
On Mon, 8 Dec 2025 at 16:25, Hannu Krosing <hannuk@google.com> wrote: > > On Mon, Dec 8, 2025 at 2:46 PM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > I don't see logical decoding as a viable alternative to any > > indexing-related workload. Creating and maintaining indexes need > > strict limits on their resource usage, and logical decoding is unable > > to give those guarantees: > > Are these 'strict limits on their resource usage' defined somewhere so > I could take a look? None formalized, but "don't read more than O(tablesize) of base data", "try to keep temporary disk usage to a minimum", "alloc only what you need", and "keep yourself to maintenance_work_mem" are some I'd consider basic. LR would read arbitrarily large amounts of data from WAL, allocates at least 2 chunks of memory per WAL record, has a tendency to spill large transactions to disk when it runs out of memory, and at times does so at scales much larger than the subscribed-to tables. I also have issues with the permission models here -- pg_create_subscription is a separate permissions from pg_maintain for good reasons, and I personally would like to keep it that way. Let's not add another way for people to accidentally lose track of a logical slot. > > It needs to read WAL, which can be produced > > in approximately arbitrarily large amounts in any period. > > If you read the WAL as it is generated, then you are essentially just > filtering an in-memory stream, most of the time just jumping to next > WAL record. I think the main point here is that you can't guarantee that you're going to read from memory, so you can't assume that the performance is going to be comparable to "reading from memory". > The upsides for logical decoding based collection are > - is that you do not need to have any extra settings and conditions in > you index methods to do in-index-method collection What do you mean by this? AFAIK, we don't have any such "settings or conditions" inside index AMs for CIC/RIC. Most, if not everything else is handled outside the AM code, in either generic indexing code, or in heapam's decision for which snapshot to use in each of the scans. IF LR at some point was to be used for indexing, it'd probably have at least a similar (if not larger) footprint in the code. > - you avoid the hassle of synchronoizing collection starts and stops > between all active backends I don't see how you can prevent the synchronization step before the first heap scan ("collection start"). We need to be certain all backends see the new index, or we might corrupt the index with too much/to little data through HOT updates in concurrent workloads. AFAICT, this can not be prevented with LR. I also can't think of a correct way to prevent the synchronization step after the initial index is built ("collection end"). We must wait for all backends to consider this index for insertions, or some backend may still produce tuples that aren't inserted into the index. LR can't help with this either - it can't see into the future and determine which backends will still insert which tuples into which tables. So, which hassle would be avoided specifically? > - you are collecting in a single process, so no overhead from > synchronizing between all the backends that capture/log index > insertions With LR, all backends still have to synchronize through WAL insertions. Mihail's STIR index does not use WAL in operations, so it's safe to say that there is also minimal additional overhead there. > - you can choose between collecting immediately in a background worker > and collecting later by re-reading WAL. Which requires writing a new integration with this system, right? > The one upside (?) of in-index capture is that it will naturally > throttle your production workload if capture can not keep up for some > reason. The insertion of TIDs into the STIR index is 99 times out of 100 going to be cheaper than the insertion into the index that's being built. I don't see why you consider STIR an issue, but not LR. > > This is further worsened by the index build itself, which will have to write > > out WAL if the relation it's building on also needs to log WAL, which > > then also needs to be parsed and decoded by LR. > > Not "parsed and decoded" - just read the (database, tablespace, > relation) triplet, decide "not for me" and jump to the next record. AFAIK we have yet to move record filtering ahead of the point where we allocate the XLogRecord (and DecodedXLogRecord) that we're retrieving from raw WAL pages. We don't actually skip the data and variable headers sections when the record doesn't have rm_decode; we don't actually skip the data sections when the page headers indicate the WAL record only modified pages in a different database. So in my view, yes, we do parse and decode every WAL record in LR; even if that doesn't always involve calling RMgrData->rm_decode. > I am working on this specifically because of huge databases with heavy > production workloads which by definition generate a huge amount of > WAL. > One of the top goals is to avoid REPLICA IDENTITY FULL which can > hugely bloat amount of WAL generated . > The fact that you can turn on REPLICA IDENTITY ROWID with no extra > cost on write side is a bonus There is an extra cost for maintenance operations; as you can see below; and that ignores the increased overhead for users of pgoutput. > (you can even replace the small overhead > of writing th eprimary key recorde by turning on rowid) You can't use rowid for Primary Keys, because you'd create referential update loops for foreign keys, or (with ON UPDATE RESTRICT) you'd be unable to update the rows at all. > > And lastly, we want > > CIC/RIC to work on all indexes, not just those on logged relations. "A > > table with WAL-logging enabled" cannot be a requirement for CIC. > > We can use the easy, straightforward collection method - logical > decoding - when available, and fall back to the complicated method > (collecting inside index access method) or the resource-intensive > method (re-scanning the whole table) if logical decoding is > unavailable. I prefer a single method that always works, is testable, and that has a guaranteed limit, over two that are only excercised in certain paths, one of which with a tendency to keep WAL on disk. I think it's much, much easier to reason about the performance profile and impact of just one more index. > > I understand how you got to this point, but the right solution here > > still is to get the user to specify their own identity column that is > > stable across operations, and not to use the volatile and > > guaranteed-unstable ctid. > > Suggestion to "get the user specify their own identity column" sounds > good in purely theoretical sense, but can have unacceptable overheads > in practice, especially if a large table started out - often for a > good reason - without a PK or other suittable identiuty column I don't think that this user problem is something we should be making our problem; at least not like this. > As I said before, the row id does not have to be stable across row > versions for logical replication, it just has to be able to track the > identity of "the row that is changed or deleted" which ctid already > does very well. Yes, I'm aware. But a TID doesn't really mean anything if you don't have indexes - its value may change arbitrarily between queries. Anything may happen to a tuple's CTID, long as 1.) within a statement, the CTID doesn't change, and 2.) indexed TIDs won't have changes to the attributes represented in indexes. > You were very worried about extra WAL usage above, but seem to be > oblivious of huge resource usage of REPLICA IDENTITY FULL (when used > as a poor man's row id and not because the old row data is needed for > some other reason) I don't generally suggest people run their systems with wal_level=logical and try to avoid those systems that have, exactly because of issues like the significant additional overhead involved with logging page-local update records. > When you at some point discover the need for logical replication of a > large table inside a 24/7 production database where you do have > occasional updates - or even frequent updates, just not based on > unique id - you have currently a few options. > > 1. add REPLICA IDENTITY FULL > - this will double the WAL traffic for updates and usually more > than double for DELETEs (could be 1x or 100x) > - it can also be REALLY REALLY SLOW to replicate, the worst case > requiring 1 sequential scan of the whole table for each UPDATE or > DELETE Yep, that's about expected; if you want performant UPDATE with OLTP-style databases you have to index your data. > 2. add a primary key column - quite hard to do CONCURRENTLY, will have > severe disk and cpu space demands and once it has been added (which > could have taken up to a few weeks) it will slow down any inserts. It's quite possible. Not trivial, but it is a road that many have taken. And yes, indexing slows down inserts. That's how we make sure the index remains correct. > To reiterate - "stable across operations" is not at a requirement for > logical replication, tuple id is "stable enough" for streaming > replication changes. Think of it as somebody changing the primary key > column at each update - it seems weird, but the updated PK still > uniquely identifies the tuple for the next operation. Yeah, I'm aware of that part. > > As I also said in my other mail, adding > > ctid to the logical replication system will expose too much internal > > information and will turn current logical no-ops into logical > > operations; > > Can you provide an example of this? CLUSTER itself doesn't modify any logical columns, and would thus normally have been no-op from a LR PoV; allowing us to ignore it in LR of user tables. By adding CTID as new logical column, however, you force effectively CLUSTER to issue a logical UPDATE record for all tuples that get a new CTID in the table, or every UPDATE afterward would be replicated with the wrong CTID from the subscriber's point of view. Example: A table with live CTIDs A: (0, 1), B: (0, 2) This table gets CLUSTERed: the CTIDs on disk in the new table are A: (0, 0), B: (0, 1). Note how row B now has the CTID that row A had before clustering; A has a previously unused CTID; and B's old CTID is now orphaned. Assuming LR didn't get an update about this CLUSTER changing the CTIDs of logical rows, the replica will be desynced from the primary. An UPDATE on row B will still get recorded and replicated, but because the CTID is now (0, 1) the remote thinks the update was for row A (which it had recorded previously as the one that last had CTID (0, 1)). Updates for row A on the primary will get replicated and fail to find a matching row, because its new CTID (0, 0) wasn't an ID that was replicated yet, and there's now an additional row B that's been orphaned on the replica. > > possibly even bloating the subscriber by a good deal more > > than what the publisher cleaned up. > > The absolute biggest bloater is REPLICA IDENTITY FULL. > > The beauty of using REPLICA IDENTITY ROWID is that absolutely *nothing > extra needs to be added to WAL*. Except for CLUSTER and related operations, which now need to log TID mappings. > The tids are already there even for physical replication (and even for > wal_level=minimal) as they are required even for crash recovery. All > my patch to core did is exposing them to logical decodoing mechanism. > You do not have to use them, nothing changes for decoding plugins not > using them. About the patch: The ItemPointerDatas newly added to ReorderBufferChange.data.tp are better stored in {old,new}tuple->t_self, if we're going to store them. The size of .data is currently 32 bytes, increasing that by 16 bytes is rather wasteful if we already have a more convenient and accurate place to store this data. I see even less reasons why logical replication can care about HOT-ness of updates (that's mostly an implementation detail of heap; logically there is and should be no difference between a normal update and a HOT update). But, if the decoder really did care about those implementation details, it would arguably be better pull that data from the tuple's infomasks, as then there wouldn't be a need for the additional boolean ReorderBufferChange->data.tp.is_hot_update. Alternatively, move it into an alignment gap, because the current placement (after removing the TIDs from data.tp) is rather unfortunate and would add 8 bytes to the size of this union. Kind regards, Matthias van de Meent Databricks (https://www.databricks.com)