Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

hubert depesz lubaczewski <depesz@depesz.com>

From: hubert depesz lubaczewski <depesz@depesz.com>
To: Masahiko Sawada <sawada.mshk@gmail.com>
Cc: Amit Kapila <amit.kapila16@gmail.com>, pgsql-bugs mailing list <pgsql-bugs@postgresql.org>
Date: 2022-11-15T07:55:36Z
Lists: pgsql-bugs

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Fix a possibility of logical replication slot's restart_lsn going backwards.

On Tue, Nov 15, 2022 at 11:20:19AM +0900, Masahiko Sawada wrote:
> On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
> <depesz@depesz.com> wrote:
> >
> > 2022-11-10 20:59:55 UTC  103A7/7B666388      focal14    pgoutput  logical    16607   canvas    f          t       10414       \N    3241434855    1039D/83825958  1039D/8DECBA18
> > 2022-11-10 21:00:54 UTC  103A7/88225C28      focal14    pgoutput  logical    16607   canvas    f          t       10414       \N    3241434855    1039D/83825958  1039D/911A8DB0
> > 2022-11-10 21:01:53 UTC  103A7/9E77EF90      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241434855    1039D/83825958  1039D/911A8DB0
> > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241434855    1039D/83825958  1039D/911A8DB0
> > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t       21748       \N    3241443528    1039D/83825958  1039D/955633D0
> > 2022-11-10 21:04:51 UTC  103A7/D5A6FDD8      focal14    pgoutput  logical    16607   canvas    f          t       21748       \N    3241443528    1039D/83825958  1039D/96453F38
> > 2022-11-10 21:05:50 UTC  103A7/E2C7F590      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
> > 2022-11-10 21:06:49 UTC  103A7/F28B2000      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
> > 2022-11-10 21:07:48 UTC  103A8/3828D20       focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
> > 2022-11-10 21:08:47 UTC  103A8/15BBF088      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
> 
> It happened twice that the walsender process was active for at least
> one minute while not advancing restart_lsn and exited. Could you share
> the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
> can find out what the wal sender processes were doing?

That will be ~ 270MB. Not sure how that would fly, plus that's awfully
lot of logs to check if I don't leak any sensitive info.

But, if I'll remove lines that are generated by walsender to pg12
replicas (streaming replication to bionic/pg12 replicas), the logfile is
only ~ 3MB. Still a bit too much, as there can be some queries with
sensitive info.

Perhaps I can grep something out?

> Also, you may have logs for "failed to increase restart lsn: proposed
> %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
> flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
> please share it as well.

Found one in the time frame you mentioned:
2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10 21:03:23 UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate 1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14"

> I've tried to reproduce this issue with a similar setting in a loop
> but could not.

It seems to be rather elusive. Which is why I'm kinda "happy" that I can with
high level of luck replicate it.

Question might be 100% stupid, sorry if it is - any chance that we're on arm64
has something to do with it?

Best regards,

depesz