Re: logical decoding and replication of sequences, take 2
Tomas Vondra <tomas.vondra@enterprisedb.com>
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
To: Amit Kapila <amit.kapila16@gmail.com>
Cc: "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>,
"Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>,
Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>,
PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
Masahiko Sawada <sawada.mshk@gmail.com>,
Peter Eisentraut <peter.eisentraut@enterprisedb.com>,
Dilip Kumar <dilipbalaut@gmail.com>
Date: 2023-11-28T14:41:55Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Migrate logical slots to the new node during an upgrade.
- 29d0a77fa660 17.0 cited
-
Make test_decoding ddl.out shorter
- d6677b93c79b 17.0 landed
- c5c5832600e9 14.9 landed
- b1dc946eee3d 16.0 landed
- 3bb8b9342f8a 15.4 landed
-
Fix snapshot handling in logicalmsg_decode
- 949ac32e1267 15.3 landed
- 8b9cbd42b61f 14.8 landed
- 4df581fa0f4b 13.11 landed
- 497f863f0598 12.15 landed
- 8de91ebf2ac1 11.20 landed
- 7fe1aa991b62 16.0 landed
-
doc: Adjust a few more references to "postmaster"
- 17e72ec45d31 16.0 cited
-
Revert "Logical decoding of sequences"
- 2c7ea57e56ca 15.0 cited
On 11/28/23 12:32, Amit Kapila wrote:
> On Mon, Nov 27, 2023 at 11:45 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> I spent a bit of time looking at the proposed change, and unfortunately
>> logging just the boolean flag does not work. A good example is this bit
>> from a TAP test added by the patch for built-in replication (which was
>> not included with the WIP patch):
>>
>> BEGIN;
>> ALTER SEQUENCE s RESTART WITH 1000;
>> SAVEPOINT sp1;
>> INSERT INTO seq_test SELECT nextval('s') FROM generate_series(1,100);
>> ROLLBACK TO sp1;
>> COMMIT;
>>
>> This is expected to produce:
>>
>> 1131|0|t
>>
>> but produces
>>
>> 1000|0|f
>>
>> instead. The reason is very simple - as implemented, the patch simply
>> checks if the relfilenode is from the same top-level transaction, which
>> it is, and sets the flag to "true". So we know the sequence changes need
>> to be queued and replayed as part of this transaction.
>>
>> But then during decoding, we still queue the changes into the subxact,
>> which then aborts, and the changes are discarded. That is not how it's
>> supposed to work, because the new relfilenode is still valid, someone
>> might do nextval() and commit. And the nextval() may not get WAL-logged,
>> so we'd lose this.
>>
>> What I guess we might do is log not just a boolean flag, but the XID of
>> the subtransaction that created the relfilenode. And then during
>> decoding we'd queue the changes into this subtransaction ...
>>
>> 0006 in the attached patch series does this, and it seems to fix the TAP
>> test failure. I left it at the end, to make it easier to run tests
>> without the patch applied.
>>
>
> Offhand, I don't have any better idea than what you have suggested for
> the problem but this needs some thoughts including the questions asked
> by you. I'll spend some time on it and respond back.
>
I've been experimenting with the idea to log the XID, and for a moment I
was worried it actually can't work, because subtransactions may not
actually be just nested in simple way, but form a tree. And what if the
sequence was altered in a different branch (sibling subxact), not in the
immediate parent. In which case the new SubTransactionGetXid() would
fail, because it just walks the current chain of subtransactions.
I've been thinking about cases like this:
BEGIN;
CREATE SEQUENCE s; # XID 1000
SELECT alter_sequence(); # XID 1001
SAVEPOINT s1;
SELECT COUNT(nextval('s')) FROM generate_series(1,100); # XID 1000
ROLLBACK TO s1;
SELECT COUNT(nextval('s')) FROM generate_series(1,100); # XID 1000
COMMIT;
The XID values are what the sequence wal record will reference, assuming
that the main transaction XID is 1000.
Initially, I thought it's wrong that the nextval() calls reference XID
of the main transaction, because the last relfilenode comes from 1001,
which is the subxact created by alter_sequence() thanks to the exception
handling block. And that's where the approach in reorderbuffer would
queue the changes.
But I think this is actually correct too. When a subtransaction commits
(e.g. when alter_sequence() completes), it essentially becomes part of
the parent. And AtEOSubXact_cleanup() updates rd_newRelfilelocatorSubid
accordingly, setting it to parentSubid.
This also means that SubTransactionGetXid() can't actually fail, because
the ID has to reference an active subtransaction in the current stack.
I'm still concerned about the cost of the lookup, because the list may
be long and the subxact we're looking for may be quite high, but I guess
we might have another field, caching the XID. It'd need to be updated
only in AtEOSubXact_cleanup, and at that point we know it's the
immediate parent, so it'd be pretty cheap I think.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company