Re: logical decoding and replication of sequences, take 2

Tomas Vondra <tomas.vondra@enterprisedb.com>

From: Tomas Vondra <tomas.vondra@enterprisedb.com>

To: Amit Kapila <amit.kapila16@gmail.com>, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>

Cc: Dilip Kumar <dilipbalaut@gmail.com>, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>, "Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>, Masahiko Sawada <sawada.mshk@gmail.com>, Peter Eisentraut <peter.eisentraut@enterprisedb.com>

Date: 2024-01-11T16:26:56Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Migrate logical slots to the new node during an upgrade.
- 29d0a77fa660 17.0 cited
Make test_decoding ddl.out shorter
- d6677b93c79b 17.0 landed
- c5c5832600e9 14.9 landed
- b1dc946eee3d 16.0 landed
- 3bb8b9342f8a 15.4 landed
Fix snapshot handling in logicalmsg_decode
- 949ac32e1267 15.3 landed
- 8b9cbd42b61f 14.8 landed
- 4df581fa0f4b 13.11 landed
- 497f863f0598 12.15 landed
- 8de91ebf2ac1 11.20 landed
- 7fe1aa991b62 16.0 landed
doc: Adjust a few more references to "postmaster"
- 17e72ec45d31 16.0 cited
Revert "Logical decoding of sequences"
- 2c7ea57e56ca 15.0 cited

Attachments

v20240111-0001-Logical-decoding-of-sequences.patch (text/x-patch) patch v20240111-0001
v20240111-0002-Add-decoding-of-sequences-to-test_decoding.patch (text/x-patch) patch v20240111-0002
v20240111-0003-Add-decoding-of-sequences-to-built-in-repl.patch (text/x-patch) patch v20240111-0003
v20240111-0004-global-hash-table-of-sequence-relfilenodes.patch (text/x-patch) patch v20240111-0004
v20240111-0005-CREATE-SUBSCRIPTION-flag-to-enable-sequenc.patch (text/x-patch) patch v20240111-0005
v20240111-0006-add-comment-about-SNAPBUILD_FULL.patch (text/x-patch) patch v20240111-0006
v20240111-0007-decode-all-sequence-relfilenodes.patch (text/x-patch) patch v20240111-0007

Hi,

Here's new version of this patch series. It rebases the 2023/12/03
version, and there's a couple improvements to address the performance
and correctness questions.

Since the 2023/12/03 version was posted, there were a couple off-list
discussions with several people - with Amit, as mentioned in [1], and
then also internally and at pgconf.eu.

My personal (very brief) takeaway from these discussions is this:

1) desirability: We want a built-in way to handle sequences in logical
replication. I think everyone agrees this is not a way to do distributed
sequences in an active-active setups, but that there are other use cases
that need this feature - typically upgrades / logical failover.

Multiple approaches were discussed (support in logical replication or a
separate tool to be executed on the logical replica). Both might work,
people usually end up with some sort of custom tool anyway. But it's
cumbersome, and the consensus seems the logical rep feature is better.

2) performance: There was concern about the performance impact, and that
it affects everyone, including those who don't replicate sequences (as
the overhead is mostly incurred before calls to output plugin etc.).

I do agree with this, but I don't think sequences can be decoded in a
much cheaper way. There was a proposal [2] that maybe we could batch the
non-transactional sequences changes in the "next" transaction, and
distribute them similarly to SnapBuildDistributeNewCatalogSnapshot()
distributes catalog snapshots.

But I doubt that'd actually work. Or more precisely - if we can make the
code work, I think it would not solve the issue for some common cases.
Consider for example a case with many concurrent top-level transactions,
making this quite expensive. And I'd bet sequence changes are far more
common than catalog changes.

However, I think we ultimately agreed that the overhead is acceptable if
it only applies to use cases that actually need to decode sequences. So
if there was a way to skip sequence decoding when not necessary, that
would work. Unfortunately, that can't be based on simply checking which
callbacks are defined by the output plugin, because e.g. pgoutput needs
to handle both cases (so the callbacks need to be defined). Nor it can
be determined based on what's included in the publication (as that's not
available that early).

The agreement was that the best way is to have a CREATE SUBSCRIPTION
option that would instruct the upstream to decode sequences. By default
this option is 'off' (because that's the no-overhead case), but it can
be enabled for each subscription.

This is what 0005 implements, and interestingly enough, this is what an
earlier version [3] from 2023/04/02 did.

This means that if you add a sequence to the publication, but leave
"sequences=off" in CREATE SUBSCRIPTION, the sequence won't be replicated
after all. That may seems a bit surprising, and I don't like it, but I
don't think there's a better way to do this.

3) correctness: The last point is about making "transactional" flag
correct when the snapshot state changes mid-transaction, originally
pointed out by Dilip [4]. Per [5] this however happens to work
correctly, because while we identify the change as 'non-transactional'
(which is incorrect), we immediately throw it again (so we don't try to
apply it, which would error-out).

One option would be to document/describe this in the comments, per 0006.
This means that when ReorderBufferSequenceIsTransactional() returns
true, it's correct. But if it returns 'false', it means 'maybe'. I agree
it seems a bit strange, but with the extra comments I think it's OK. It
simply means that if we get transactional=false incorrectly, we're
guaranteed to not process it. Maybe we could rename the function to make
this clear from the name.

The other solution proposed in the thread [6] was to always decode the
relfilenode, and add it to the hash table. 0007 does this, and it works.
But I agree this seems possibly worse than 0006 - it means we may be
adding entries to the hash table, and it's not clear when exactly we'll
clean them up etc. It'd be the only place processing stuff before the
snapshots reaches FULL.

I personally would go with 0006, i.e. just explaining why doing it this
way is correct.

regards

[1]
https://www.postgresql.org/message-id/12822961-b7de-9d59-dd27-2e3dc3980c7e%40enterprisedb.com

[2]
https://www.postgresql.org/message-id/CAFiTN-vm3-bGfm-uJdzRLERMHozW8xjZHu4rdmtWR-rP-SJYMQ%40mail.gmail.com

[3]
https://www.postgresql.org/message-id/1f96b282-cb90-8302-cee8-7b3f5576a31c%40enterprisedb.com

[4]
https://www.postgresql.org/message-id/CAFiTN-vAx-Y%2B19ROKOcWnGf7ix2VOTUebpzteaGw9XQyCAeK6g%40mail.gmail.com

[5]
https://www.postgresql.org/message-id/CAA4eK1LFise9iN%2BNN%3Dagrk4prR1qD%2BebvzNjKAWUog2%2Bhy3HxQ%40mail.gmail.com

[6]
https://www.postgresql.org/message-id/CAFiTN-sYpyUBabxopJysqH3DAp4OZUCTi6m_qtgt8d32vDcWSA%40mail.gmail.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company