Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Andres Freund <andres@anarazel.de>
From: Andres Freund <andres@anarazel.de>
To: Amit Kapila <amit.kapila16@gmail.com>
Cc: depesz@depesz.com, Masahiko Sawada <sawada.mshk@gmail.com>, pgsql-bugs mailing list <pgsql-bugs@postgresql.org>
Date: 2022-11-21T20:08:36Z
Lists: pgsql-bugs
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Fix a possibility of logical replication slot's restart_lsn going backwards.
- e5ed873b1b4a 18.0 landed
- 568e78a653ee 17.2 landed
- f353911337cf 16.6 landed
- 91771b3fbbc3 15.10 landed
- 26c4e8968690 14.15 landed
- 15dc1abb17dd 13.18 landed
Hi, On 2022-11-21 19:56:20 +0530, Amit Kapila wrote: > I think this problem could arise when walsender exits due to some > error like "terminating walsender process due to replication timeout". > Here is the theory I came up with: > > 1. Initially the restart_lsn is updated to 1039D/83825958. This will > allow all files till 000000000001039D00000082 to be removed. > 2. Next the slot->candidate_restart_lsn is updated to a 1039D/8B5773D8. > 3. walsender restarts due to replication timeout. > 4. After restart, it starts reading WAL from 1039D/83825958 as that > was restart_lsn. > 5. walsender gets a message to update write, flush, apply, etc. As > part of that, it invokes > ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation. > 6. Due to step 5, the restart_lsn is updated to 1039D/8B5773D8 and > replicationSlotMinLSN will also be computed to the same value allowing > to remove of all files older than 000000000001039D0000008A. This will > allow removing 000000000001039D00000083, 000000010001039D00000084, > etc. This would require that the client acknowledged an LSN that we haven't sent out, no? Shouldn't the MyReplicationSlot->candidate_restart_valid <= lsn from LogicalConfirmReceivedLocation() prevented this from happening unless the client acknowledges up to candidate_restart_valid? > 7. Now, we got new slot->candidate_restart_lsn as 1039D/83825958. > Remember from step 1, we are still reading WAL from that location. I don't think LogicalIncreaseRestartDecodingForSlot() would do anything in that case, because of the /* don't overwrite if have a newer restart lsn */ check. > If this diagnosis is correct, I think we need to clear > candidate_restart_lsn and friends during ReplicationSlotRelease(). Possible, but I don't quite see it yet. Greetings, Andres Freund