Re: logical decoding and replication of sequences, take 2
Tomas Vondra <tomas.vondra@enterprisedb.com>
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
To: Amit Kapila <amit.kapila16@gmail.com>,
Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Cc: Dilip Kumar <dilipbalaut@gmail.com>,
"Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>,
"Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>,
PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
Masahiko Sawada <sawada.mshk@gmail.com>,
Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Date: 2024-01-11T16:26:56Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Migrate logical slots to the new node during an upgrade.
- 29d0a77fa660 17.0 cited
-
Make test_decoding ddl.out shorter
- d6677b93c79b 17.0 landed
- c5c5832600e9 14.9 landed
- b1dc946eee3d 16.0 landed
- 3bb8b9342f8a 15.4 landed
-
Fix snapshot handling in logicalmsg_decode
- 949ac32e1267 15.3 landed
- 8b9cbd42b61f 14.8 landed
- 4df581fa0f4b 13.11 landed
- 497f863f0598 12.15 landed
- 8de91ebf2ac1 11.20 landed
- 7fe1aa991b62 16.0 landed
-
doc: Adjust a few more references to "postmaster"
- 17e72ec45d31 16.0 cited
-
Revert "Logical decoding of sequences"
- 2c7ea57e56ca 15.0 cited
Attachments
- v20240111-0001-Logical-decoding-of-sequences.patch (text/x-patch) patch v20240111-0001
- v20240111-0002-Add-decoding-of-sequences-to-test_decoding.patch (text/x-patch) patch v20240111-0002
- v20240111-0003-Add-decoding-of-sequences-to-built-in-repl.patch (text/x-patch) patch v20240111-0003
- v20240111-0004-global-hash-table-of-sequence-relfilenodes.patch (text/x-patch) patch v20240111-0004
- v20240111-0005-CREATE-SUBSCRIPTION-flag-to-enable-sequenc.patch (text/x-patch) patch v20240111-0005
- v20240111-0006-add-comment-about-SNAPBUILD_FULL.patch (text/x-patch) patch v20240111-0006
- v20240111-0007-decode-all-sequence-relfilenodes.patch (text/x-patch) patch v20240111-0007
Hi, Here's new version of this patch series. It rebases the 2023/12/03 version, and there's a couple improvements to address the performance and correctness questions. Since the 2023/12/03 version was posted, there were a couple off-list discussions with several people - with Amit, as mentioned in [1], and then also internally and at pgconf.eu. My personal (very brief) takeaway from these discussions is this: 1) desirability: We want a built-in way to handle sequences in logical replication. I think everyone agrees this is not a way to do distributed sequences in an active-active setups, but that there are other use cases that need this feature - typically upgrades / logical failover. Multiple approaches were discussed (support in logical replication or a separate tool to be executed on the logical replica). Both might work, people usually end up with some sort of custom tool anyway. But it's cumbersome, and the consensus seems the logical rep feature is better. 2) performance: There was concern about the performance impact, and that it affects everyone, including those who don't replicate sequences (as the overhead is mostly incurred before calls to output plugin etc.). I do agree with this, but I don't think sequences can be decoded in a much cheaper way. There was a proposal [2] that maybe we could batch the non-transactional sequences changes in the "next" transaction, and distribute them similarly to SnapBuildDistributeNewCatalogSnapshot() distributes catalog snapshots. But I doubt that'd actually work. Or more precisely - if we can make the code work, I think it would not solve the issue for some common cases. Consider for example a case with many concurrent top-level transactions, making this quite expensive. And I'd bet sequence changes are far more common than catalog changes. However, I think we ultimately agreed that the overhead is acceptable if it only applies to use cases that actually need to decode sequences. So if there was a way to skip sequence decoding when not necessary, that would work. Unfortunately, that can't be based on simply checking which callbacks are defined by the output plugin, because e.g. pgoutput needs to handle both cases (so the callbacks need to be defined). Nor it can be determined based on what's included in the publication (as that's not available that early). The agreement was that the best way is to have a CREATE SUBSCRIPTION option that would instruct the upstream to decode sequences. By default this option is 'off' (because that's the no-overhead case), but it can be enabled for each subscription. This is what 0005 implements, and interestingly enough, this is what an earlier version [3] from 2023/04/02 did. This means that if you add a sequence to the publication, but leave "sequences=off" in CREATE SUBSCRIPTION, the sequence won't be replicated after all. That may seems a bit surprising, and I don't like it, but I don't think there's a better way to do this. 3) correctness: The last point is about making "transactional" flag correct when the snapshot state changes mid-transaction, originally pointed out by Dilip [4]. Per [5] this however happens to work correctly, because while we identify the change as 'non-transactional' (which is incorrect), we immediately throw it again (so we don't try to apply it, which would error-out). One option would be to document/describe this in the comments, per 0006. This means that when ReorderBufferSequenceIsTransactional() returns true, it's correct. But if it returns 'false', it means 'maybe'. I agree it seems a bit strange, but with the extra comments I think it's OK. It simply means that if we get transactional=false incorrectly, we're guaranteed to not process it. Maybe we could rename the function to make this clear from the name. The other solution proposed in the thread [6] was to always decode the relfilenode, and add it to the hash table. 0007 does this, and it works. But I agree this seems possibly worse than 0006 - it means we may be adding entries to the hash table, and it's not clear when exactly we'll clean them up etc. It'd be the only place processing stuff before the snapshots reaches FULL. I personally would go with 0006, i.e. just explaining why doing it this way is correct. regards [1] https://www.postgresql.org/message-id/12822961-b7de-9d59-dd27-2e3dc3980c7e%40enterprisedb.com [2] https://www.postgresql.org/message-id/CAFiTN-vm3-bGfm-uJdzRLERMHozW8xjZHu4rdmtWR-rP-SJYMQ%40mail.gmail.com [3] https://www.postgresql.org/message-id/1f96b282-cb90-8302-cee8-7b3f5576a31c%40enterprisedb.com [4] https://www.postgresql.org/message-id/CAFiTN-vAx-Y%2B19ROKOcWnGf7ix2VOTUebpzteaGw9XQyCAeK6g%40mail.gmail.com [5] https://www.postgresql.org/message-id/CAA4eK1LFise9iN%2BNN%3Dagrk4prR1qD%2BebvzNjKAWUog2%2Bhy3HxQ%40mail.gmail.com [6] https://www.postgresql.org/message-id/CAFiTN-sYpyUBabxopJysqH3DAp4OZUCTi6m_qtgt8d32vDcWSA%40mail.gmail.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company