Re: logical decoding and replication of sequences, take 2

Tomas Vondra <tomas.vondra@enterprisedb.com>

From: Tomas Vondra <tomas.vondra@enterprisedb.com>

To: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>

Date: 2022-11-11T22:49:07Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Migrate logical slots to the new node during an upgrade.
- 29d0a77fa660 17.0 cited
Make test_decoding ddl.out shorter
- d6677b93c79b 17.0 landed
- c5c5832600e9 14.9 landed
- b1dc946eee3d 16.0 landed
- 3bb8b9342f8a 15.4 landed
Fix snapshot handling in logicalmsg_decode
- 949ac32e1267 15.3 landed
- 8b9cbd42b61f 14.8 landed
- 4df581fa0f4b 13.11 landed
- 497f863f0598 12.15 landed
- 8de91ebf2ac1 11.20 landed
- 7fe1aa991b62 16.0 landed
doc: Adjust a few more references to "postmaster"
- 17e72ec45d31 16.0 cited
Revert "Logical decoding of sequences"
- 2c7ea57e56ca 15.0 cited

Attachments

decoding-sequences-tracking-20221112.patch (text/x-patch) patch

Hi,

I noticed on cfbot the patch no longer applies, so here's a rebased
version. Most of the breakage was due to the column filtering reworks,
grammar changes etc. A lot of bitrot, but mostly mechanical stuff.

I haven't looked into the optimizations / improvements I discussed in my
previous post (logging only LSN of the last WAL-logged increment),
because while fixing "make check-world" I ran into a more serious issue
that I think needs to be discussed first. And I suspect it might also
affect the feasibility of the LSN optimization.

So, what's the issue - the current solution is based on WAL-logging
state of all sequences incremented by the transaction at COMMIT. To do
that, we read the state from disk, and write that into WAL. However,
these WAL messages are not necessarily correlated to COMMIT records, so
stuff like this might happen:

1. transaction T1 increments sequence S
2. transaction T2 increments sequence S
3. both T1 and T2 start to COMMIT
4. T1 reads state of S from disk, writes it into WAL
5. transaction T3 increments sequence S
6. T2 reads state of S from disk, writes it into WAL
7. T2 write COMMIT into WAL
8. T1 write COMMIT into WAL

Because the apply order is determined by ordering of COMMIT records,
this means we'd apply the increments logged by T2, and then by T1. But
that undoes the increment by T3, and the sequence would go backwards.

The previous patch version addressed that by acquiring lock on the
sequence, holding it until transaction end. This effectively ensures the
order of sequence messages and COMMIT matches. But that's problematic
for a number of reasons:

1) throughput reduction, because the COMMIT records need to serialize

2) deadlock risk, if we happen to lock sequences in different order
   (in different transactions)

3) problem for prepared transactions - the sequences are locked and
   logged in PrepareTransaction, because we may not have seqhashtab
   beyond that point. This is a much worse variant of (1).

Note: I also wonder what happens if someone does DISCARD SEQUENCES. I
guess we'll forget the sequences, which is bad - so we'd have to invent
a separate cache that does not have this issue.


I realized (3) because one of the test_decoding TAP tests got stuck
exactly because of a sequence locked by a prepared transaction.

This patch simply releases the lock after writing the WAL message, but
that just makes it vulnerable to the reordering. And this would have
been true even with the LSN optimization.

However, I was thinking that maybe we could use the LSN of the WAL
message (XLOG_LOGICAL_SEQUENCE) to deal with the ordering issue, because
*this* is the sensible sequence increment ordering.

In the example above, we'd first apply the WAL message from T2 (because
that commits first). And then we'd get to apply T1, but the WAL message
has an older LSN, so we'd skip it.

But this requires us remembering LSN of the already applied WAL sequence
messages, which could be tricky - we'd need to persist it in some way
because of restarts, etc. We can't do this while decoding but on the
apply side, I think, because of streaming, aborts.

The other option might be to make these messages non-transactional, in
which case we'd separate the ordering from COMMIT ordering, evading the
reordering problem.

That'd mean we'd ignore rollbacks (which seems fine), we could probably
optimize this by checking if the state actually changed, etc. But we'd
also need to deal with transactions created in the (still uncommitted)
transaction. But I'm also worried it might lead to the same issue with
non-transactional behaviors that forced revert in v15.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company