Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

Amit Kapila <amit.kapila16@gmail.com>

From: Amit Kapila <amit.kapila16@gmail.com>

To: depesz@depesz.com

Cc: Andres Freund <andres@anarazel.de>, Masahiko Sawada <sawada.mshk@gmail.com>, pgsql-bugs mailing list <pgsql-bugs@postgresql.org>

Date: 2022-11-21T14:26:20Z

Lists: pgsql-bugs

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Fix a possibility of logical replication slot's restart_lsn going backwards.
- e5ed873b1b4a 18.0 landed
- 568e78a653ee 17.2 landed
- f353911337cf 16.6 landed
- 91771b3fbbc3 15.10 landed
- 26c4e8968690 14.15 landed
- 15dc1abb17dd 13.18 landed

On Fri, Nov 18, 2022 at 4:57 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> On Thu, Nov 17, 2022 at 10:55:29AM -0800, Andres Freund wrote:
> > Hi,
> >
> > On 2022-11-17 23:22:12 +0900, Masahiko Sawada wrote:
> > > On Thu, Nov 17, 2022 at 5:03 PM Andres Freund <andres@anarazel.de> wrote:
> > > > > > 4. wal sender restarts for some reason (or server crashed).
> > > >
> > > > I don't think walsender alone restarting should change anything, but
> > > > crash-restart obviously would.
> > >
> > > Right. I've confirmed this scenario is possible to happen with crash-restart.
> >
> > Depesz, were there any crashes / immediate restarts on the PG 12 side? If so,
> > then we know what the problem likely is and can fix it. If not...
> >
> >
> > Just to confirm, the problem has "newly" occurred after you've upgraded to
> > 12.12? I couldn't quite tell from the thread.
>
> No crashes that I can find any track of.
>

I think this problem could arise when walsender exits due to some
error like "terminating walsender process due to replication timeout".
Here is the theory I came up with:

1. Initially the restart_lsn is updated to 1039D/83825958. This will
allow all files till 000000000001039D00000082 to be removed.
2. Next the slot->candidate_restart_lsn is updated to a 1039D/8B5773D8.
3. walsender restarts due to replication timeout.
4. After restart, it starts reading WAL from 1039D/83825958 as that
was restart_lsn.
5. walsender gets a message to update write, flush, apply, etc. As
part of that, it invokes
ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation.
6. Due to step 5, the restart_lsn is updated to 1039D/8B5773D8 and
replicationSlotMinLSN will also be computed to the same value allowing
to remove of all files older than 000000000001039D0000008A. This will
allow removing 000000000001039D00000083, 000000010001039D00000084,
etc.
7. Now, we got new slot->candidate_restart_lsn as 1039D/83825958.
Remember from step 1, we are still reading WAL from that location.
8. At the next update for write, flush, etc. as part of processing
standby reply message, we will invoke
ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation. This
updates restart_lsn to 1039D/83825958.
9. walsender restarts due to replication timeout.
10. After restart, it starts reading WAL from 1039D/83825958 as that
was restart_lsn and gets an error "requested WAL segment
000000010001039D00000083 has already been removed" because the same is
removed as part of point-6.
11. After that walsender will keep on getting the same error after
restart as the restart_lsn is never progressed.

What do you think?

If this diagnosis is correct, I think we need to clear
candidate_restart_lsn and friends during ReplicationSlotRelease().

-- 
With Regards,
Amit Kapila.