Re: logical decoding and replication of sequences, take 2

Tomas Vondra <tomas.vondra@enterprisedb.com>

From: Tomas Vondra <tomas.vondra@enterprisedb.com>

To: Robert Haas <robertmhaas@gmail.com>

Cc: Andres Freund <andres@anarazel.de>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>

Date: 2023-01-11T18:29:49Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Migrate logical slots to the new node during an upgrade.
- 29d0a77fa660 17.0 cited
Make test_decoding ddl.out shorter
- d6677b93c79b 17.0 landed
- c5c5832600e9 14.9 landed
- b1dc946eee3d 16.0 landed
- 3bb8b9342f8a 15.4 landed
Fix snapshot handling in logicalmsg_decode
- 949ac32e1267 15.3 landed
- 8b9cbd42b61f 14.8 landed
- 4df581fa0f4b 13.11 landed
- 497f863f0598 12.15 landed
- 8de91ebf2ac1 11.20 landed
- 7fe1aa991b62 16.0 landed
doc: Adjust a few more references to "postmaster"
- 17e72ec45d31 16.0 cited
Revert "Logical decoding of sequences"
- 2c7ea57e56ca 15.0 cited

On 1/10/23 20:52, Robert Haas wrote:
> On Tue, Jan 10, 2023 at 1:32 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> 0001 is a fix for the pre-existing issue in logicalmsg_decode,
>> attempting to build a snapshot before getting into a consistent state.
>> AFAICS this only affects assert-enabled builds and is otherwise
>> harmless, because we are not actually using the snapshot (apply gets a
>> valid snapshot from the transaction).
>>
>> This is mostly the fix I shared in November, except that I kept the call
>> in decode.c (per comment from Andres). I haven't added any argument to
>> SnapBuildProcessChange because we may need to backpatch this (and it
>> didn't seem much simpler, IMHO).
> 
> I tend to associate transactional behavior with snapshots, so it looks
> odd to see code that builds a snapshot only when the message is
> non-transactional. I think that a more detailed comment spelling out
> the reasoning would be useful here.
> 

I'll try adding a comment explaining this, but the reasoning is fairly
simple AFAICS:

1) We don't actually need to build the snapshot for transactional
changes, because if we end up applying the change, we'll use snapshot
provided/maintained by reorderbuffer.

2) But we don't know if we end up applying the change - it may happen
this is one of the transactions we're waiting to finish / skipped, in
which case the snapshot is kinda bogus anyway. What "saved" us is that
we'll not actually use the snapshot in the end. It's just the assert
that causes issues.

3) For non-transactional changes, we need a snapshot because we're going
to execute the callback right away. But in this case the code actually
protects against building inconsistent snapshots.

>> This however brings me to the original question what's the purpose of
>> this patch - and that's essentially keeping sequences up to date to make
>> them usable after a failover. We can't generate values from the sequence
>> on the subscriber, because it'd just get overwritten. And from this
>> point of view, it's also fine that the sequence is slightly ahead,
>> because that's what happens after crash recovery anyway. And we're not
>> guaranteeing the sequences to be gap-less.
> 
> I agree that it's fine for the sequence to be slightly ahead, but I
> think that it can't be too far ahead without causing problems. Suppose
> for example that transaction #1 creates a sequence. Transaction #2
> does nextval on the sequence a bunch of times and inserts rows into a
> table using the sequence values as the PK. It's fine if the nextval
> operations are replicated ahead of the commit of transaction #2 -- in
> fact I'd say it's necessary for correctness -- but they can't precede
> the commit of transaction #1, since then the sequence won't exist yet.

It's not clear to me how could that even happen. If transaction #1
creates a sequence, it's invisible for transaction #2. So how could it
do nextval() on it? #2 has to wait for #1 to commit before it can do
anything on the sequence, which enforces the correct ordering, no?

> Likewise, if there's an ALTER SEQUENCE that creates a new relfilenode,
> I think that needs to act as a barrier: non-transactional changes that
> happened before that transaction must also be replicated before that
> transaction is replicated, and those that happened after that
> transaction is replicated must be replayed after that transaction is
> replicated. Otherwise, at the very least, there will be states visible
> on the standby that were never visible on the origin server, and maybe
> we'll just straight up get the wrong answer. For instance:
> 
> 1. nextval, setting last_value to 3
> 2. ALTER SEQUENCE, getting a new relfilenode, and also set last_value to 19
> 3. nextval, setting last_value to 20
> 
> If 3 happens before 2, the sequence ends up in the wrong state.
> 
> Maybe you've already got this and similar cases totally correctly
> handled, I'm not sure, just throwing it out there.
> 

I believe this should behave correctly too, thanks to locking.

If a transaction does ALTER SEQUENCE, that locks the sequence, so only
that transaction can do stuff with that sequence (and changes from that
point are treated as transactional). And everyone else is waiting for #1
to commit.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company