Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Amit Kapila <amit.kapila16@gmail.com>
From: Amit Kapila <amit.kapila16@gmail.com>
To: depesz@depesz.com
Cc: Andres Freund <andres@anarazel.de>,
Masahiko Sawada <sawada.mshk@gmail.com>, pgsql-bugs mailing list <pgsql-bugs@postgresql.org>
Date: 2022-11-21T14:26:20Z
Lists: pgsql-bugs
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Fix a possibility of logical replication slot's restart_lsn going backwards.
- e5ed873b1b4a 18.0 landed
- 568e78a653ee 17.2 landed
- f353911337cf 16.6 landed
- 91771b3fbbc3 15.10 landed
- 26c4e8968690 14.15 landed
- 15dc1abb17dd 13.18 landed
On Fri, Nov 18, 2022 at 4:57 PM hubert depesz lubaczewski <depesz@depesz.com> wrote: > > On Thu, Nov 17, 2022 at 10:55:29AM -0800, Andres Freund wrote: > > Hi, > > > > On 2022-11-17 23:22:12 +0900, Masahiko Sawada wrote: > > > On Thu, Nov 17, 2022 at 5:03 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > 4. wal sender restarts for some reason (or server crashed). > > > > > > > > I don't think walsender alone restarting should change anything, but > > > > crash-restart obviously would. > > > > > > Right. I've confirmed this scenario is possible to happen with crash-restart. > > > > Depesz, were there any crashes / immediate restarts on the PG 12 side? If so, > > then we know what the problem likely is and can fix it. If not... > > > > > > Just to confirm, the problem has "newly" occurred after you've upgraded to > > 12.12? I couldn't quite tell from the thread. > > No crashes that I can find any track of. > I think this problem could arise when walsender exits due to some error like "terminating walsender process due to replication timeout". Here is the theory I came up with: 1. Initially the restart_lsn is updated to 1039D/83825958. This will allow all files till 000000000001039D00000082 to be removed. 2. Next the slot->candidate_restart_lsn is updated to a 1039D/8B5773D8. 3. walsender restarts due to replication timeout. 4. After restart, it starts reading WAL from 1039D/83825958 as that was restart_lsn. 5. walsender gets a message to update write, flush, apply, etc. As part of that, it invokes ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation. 6. Due to step 5, the restart_lsn is updated to 1039D/8B5773D8 and replicationSlotMinLSN will also be computed to the same value allowing to remove of all files older than 000000000001039D0000008A. This will allow removing 000000000001039D00000083, 000000010001039D00000084, etc. 7. Now, we got new slot->candidate_restart_lsn as 1039D/83825958. Remember from step 1, we are still reading WAL from that location. 8. At the next update for write, flush, etc. as part of processing standby reply message, we will invoke ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation. This updates restart_lsn to 1039D/83825958. 9. walsender restarts due to replication timeout. 10. After restart, it starts reading WAL from 1039D/83825958 as that was restart_lsn and gets an error "requested WAL segment 000000010001039D00000083 has already been removed" because the same is removed as part of point-6. 11. After that walsender will keep on getting the same error after restart as the restart_lsn is never progressed. What do you think? If this diagnosis is correct, I think we need to clear candidate_restart_lsn and friends during ReplicationSlotRelease(). -- With Regards, Amit Kapila.