Re: logical decoding and replication of sequences, take 2
Tomas Vondra <tomas.vondra@enterprisedb.com>
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
To: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Date: 2022-11-11T22:49:07Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Migrate logical slots to the new node during an upgrade.
- 29d0a77fa660 17.0 cited
-
Make test_decoding ddl.out shorter
- d6677b93c79b 17.0 landed
- c5c5832600e9 14.9 landed
- b1dc946eee3d 16.0 landed
- 3bb8b9342f8a 15.4 landed
-
Fix snapshot handling in logicalmsg_decode
- 949ac32e1267 15.3 landed
- 8b9cbd42b61f 14.8 landed
- 4df581fa0f4b 13.11 landed
- 497f863f0598 12.15 landed
- 8de91ebf2ac1 11.20 landed
- 7fe1aa991b62 16.0 landed
-
doc: Adjust a few more references to "postmaster"
- 17e72ec45d31 16.0 cited
-
Revert "Logical decoding of sequences"
- 2c7ea57e56ca 15.0 cited
Attachments
- decoding-sequences-tracking-20221112.patch (text/x-patch) patch
Hi, I noticed on cfbot the patch no longer applies, so here's a rebased version. Most of the breakage was due to the column filtering reworks, grammar changes etc. A lot of bitrot, but mostly mechanical stuff. I haven't looked into the optimizations / improvements I discussed in my previous post (logging only LSN of the last WAL-logged increment), because while fixing "make check-world" I ran into a more serious issue that I think needs to be discussed first. And I suspect it might also affect the feasibility of the LSN optimization. So, what's the issue - the current solution is based on WAL-logging state of all sequences incremented by the transaction at COMMIT. To do that, we read the state from disk, and write that into WAL. However, these WAL messages are not necessarily correlated to COMMIT records, so stuff like this might happen: 1. transaction T1 increments sequence S 2. transaction T2 increments sequence S 3. both T1 and T2 start to COMMIT 4. T1 reads state of S from disk, writes it into WAL 5. transaction T3 increments sequence S 6. T2 reads state of S from disk, writes it into WAL 7. T2 write COMMIT into WAL 8. T1 write COMMIT into WAL Because the apply order is determined by ordering of COMMIT records, this means we'd apply the increments logged by T2, and then by T1. But that undoes the increment by T3, and the sequence would go backwards. The previous patch version addressed that by acquiring lock on the sequence, holding it until transaction end. This effectively ensures the order of sequence messages and COMMIT matches. But that's problematic for a number of reasons: 1) throughput reduction, because the COMMIT records need to serialize 2) deadlock risk, if we happen to lock sequences in different order (in different transactions) 3) problem for prepared transactions - the sequences are locked and logged in PrepareTransaction, because we may not have seqhashtab beyond that point. This is a much worse variant of (1). Note: I also wonder what happens if someone does DISCARD SEQUENCES. I guess we'll forget the sequences, which is bad - so we'd have to invent a separate cache that does not have this issue. I realized (3) because one of the test_decoding TAP tests got stuck exactly because of a sequence locked by a prepared transaction. This patch simply releases the lock after writing the WAL message, but that just makes it vulnerable to the reordering. And this would have been true even with the LSN optimization. However, I was thinking that maybe we could use the LSN of the WAL message (XLOG_LOGICAL_SEQUENCE) to deal with the ordering issue, because *this* is the sensible sequence increment ordering. In the example above, we'd first apply the WAL message from T2 (because that commits first). And then we'd get to apply T1, but the WAL message has an older LSN, so we'd skip it. But this requires us remembering LSN of the already applied WAL sequence messages, which could be tricky - we'd need to persist it in some way because of restarts, etc. We can't do this while decoding but on the apply side, I think, because of streaming, aborts. The other option might be to make these messages non-transactional, in which case we'd separate the ordering from COMMIT ordering, evading the reordering problem. That'd mean we'd ignore rollbacks (which seems fine), we could probably optimize this by checking if the state actually changed, etc. But we'd also need to deal with transactions created in the (still uncommitted) transaction. But I'm also worried it might lead to the same issue with non-transactional behaviors that forced revert in v15. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company