Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Fix a possibility of logical replication slot's restart_lsn going backwards.

  1. WAL segments removed from primary despite the fact that logical replication slot needs it.

    hubert depesz lubaczewski <depesz@depesz.com> — 2022-10-05T15:23:54Z

    Hi,
    We are still in process of preparing upgrades from Ubuntu Bionic with Pg
    12 to Focal with Pg 14.
    
    And today (well, yesterday) we hit something that looks like a bug.
    
    Jumping to problem - wal files needed by logical replication were
    removed from primary, despite slot being there.
    
    Now, a bit more details.
    
    Our replication is setup using following method:
    
    1. We create all tables and primary keys on pg14
    2. replication slot is created, with no tables
    3. we add some tables to replication slot, and wait for all tables to
       switch state to "ready"
    4. rinse and repeat until all tables are in replication set
    5. once all is ready, we add rest of schema (indexes, constraints,
       functions, fkeys, and so on).
    
    This works. Worked?
    
    Yesterday we saw something that broke the replication.
    
    First data from bionic side:
    when                    | currentLSN    | slot_name │ plugin   │ slot_type │ datoid │ database │ temporary │ active │ active_pid │  xmin  │ catalog_xmin │  restart_lsn  │ confirmed_flush_lsn
    2022-10-04 14:13:49 UTC | CA0E/273F2180 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743431068   | CA09/BE23BC70 | CA09/C82B2A58
    2022-10-04 14:14:48 UTC | CA0E/3B5FF5C8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      | 6084       | \N     | 3743451605   | CA09/C9DA89B0 | CA09/D3FA9898
    2022-10-04 14:15:47 UTC | CA0E/4F7C06F8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      | 6084       | \N     | 3743479469   | CA09/DD32CE50 | CA09/E65E7F80
    2022-10-04 14:16:46 UTC | CA0E/6CA820A8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      | 6084       | \N     | 3743516374   | CA09/EDD61CB8 | CA09/F3A856D8
    2022-10-04 14:17:45 UTC | CA0E/8C7F3950 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      | 6084       | \N     | 3743635890   | CA0A/2157B8F8 | CA0A/2A4C48F8
    2022-10-04 14:18:44 UTC | CA0E/A5D874F8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      | 6084       | \N     | 3743686305   | CA0A/38BE5AB0 | CA0A/3D2B0C20
    2022-10-04 14:19:43 UTC | CA0E/C3F6C240 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      | 6084       | \N     | 3743749741   | CA0A/49ED91C8 | CA0A/4F6D3048
    2022-10-04 14:20:42 UTC | CA0E/D7DBA590 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743749741   | CA0A/49ED91C8 | CA0A/4F6D3048
    2022-10-04 14:21:41 UTC | CA0E/F65C0F48 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      | 29207      | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    2022-10-04 14:22:40 UTC | CA0F/103DBE98 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    2022-10-04 14:23:39 UTC | CA0F/22D791A8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    2022-10-04 14:24:38 UTC | CA0F/32E81380 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    2022-10-04 14:25:38 UTC | CA0F/41E4A380 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    2022-10-04 14:26:37 UTC | CA0F/58FD1B30 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    2022-10-04 14:27:36 UTC | CA0F/6F732698 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    2022-10-04 14:28:35 UTC | CA0F/855E2F50 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    2022-10-04 14:29:34 UTC | CA0F/94B36450 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    2022-10-04 14:30:33 UTC | CA0F/A7553BF8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    2022-10-04 14:31:33 UTC | CA0F/BB10F020 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    2022-10-04 14:32:32 UTC | CA0F/D1B4DF10 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    2022-10-04 14:33:31 UTC | CA0F/EE332240 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    2022-10-04 14:34:30 UTC | CA10/3B78E10  | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    
    As you can see around 2:20pm replication stopped fetching data.
    
    At this time, in pg14 logs don't show any problems what so ever.
    Pg_restore was doing it's work with create indexes, and some queries
    like:
    ALTER TABLE ONLY schema_name.table_name REPLICA IDENTITY USING INDEX index_table_name_replica_identity;
    
    At 2:20:48pm there was:
    2022-10-04 14:20:48.579 UTC,,,59744,,633c3fac.e960,62,,2022-10-04 14:14:04 UTC,8/6519966,7015376,ERROR,08006,"could not receive data from WAL stream: SSL SYSCALL error: EOF detected",,,,,,,,,"","logical replication worker",,0
    
    But these errors happened earlier too (minutes/hours earlier).
    
    But then, 3 minutes later I got:
    
    2022-10-04 14:23:10.618 UTC,,,63569,,633c4140.f851,38,,2022-10-04 14:20:48 UTC,8/0,0,ERROR,08006,"could not receive data from WAL stream: SSL SYSCALL error: EOF detected
    server closed the connection unexpectedly
            This probably means the server terminated abnormally
            before or while processing the request.",,,,,,,,,"","logical replication worker",,0
    2022-10-04 14:23:10.623 UTC,,,64701,,633c41ce.fcbd,1,,2022-10-04 14:23:10 UTC,8/6529701,0,LOG,00000,"logical replication apply worker for subscription ""focal14"" has started",,,,,,,,,"","logical replication worker",,0
    2022-10-04 14:23:10.624 UTC,,,1801,,633aa5ea.709,41,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logical replication worker"" (PID 63569) exited with exit code 1",,,,,,,,,"","postmaster",,0
    2022-10-04 14:23:10.643 UTC,,,64701,,633c41ce.fcbd,2,,2022-10-04 14:23:10 UTC,8/0,0,ERROR,08P01,"could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000CA0A00000049 has already been removed",,,,,,,,,"","logical replication worker",,0
    2022-10-04 14:23:10.646 UTC,,,1801,,633aa5ea.709,42,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logical replication worker"" (PID 64701) exited with exit code 1",,,,,,,,,"","postmaster",,0
    
    
    Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on pg12/bionic.
    
    Unfortunately I don't know the content of pg_wal on bionic at the
    moment, but currently I see:
    
    $ ls -l pg_wal/| head
    total 297534552
    -rw------- 1 postgres postgres      349 Feb 11  2022 000000010000A1410000001B.00C81918.backup
    -rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A0000004D
    -rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A0000004E
    -rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A0000004F
    -rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A00000050
    -rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A00000051
    -rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A00000052
    -rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A00000053
    -rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A00000054
    
    So, 4 files are missing.
    
    These were archived properly, and I tried to restore them from archive, and put
    them in pg_wal, but even then pg12 was rejecting tries to connect to focal14
    slot with the same message about "requested WAL segment
    000000010000CA0A00000049 has already been removed"
    
    I know that at this moment we're hosed. This replication was removed,
    everythbing cleared (after all, we accumulated already 300G of wal, and
    we can't extend the FS forever, so we needed to cleanup the situation).
    
    But, the way I look at it, Pg12 shouldn't be removing these 4 wal files,
    and the fact that it did means that either we did something wrong (what,
    though?), or there is some kind of bug in Pg.
    
    Anyone has any idea on how to investigate/fix that?
    
    Best regards,
    
    depesz
    
    
    
    
    
  2. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Amit Kapila <amit.kapila16@gmail.com> — 2022-10-16T05:05:17Z

    On Wed, Oct 5, 2022 at 8:54 PM hubert depesz lubaczewski
    <depesz@depesz.com> wrote:
    >
    > Hi,
    > We are still in process of preparing upgrades from Ubuntu Bionic with Pg
    > 12 to Focal with Pg 14.
    >
    > And today (well, yesterday) we hit something that looks like a bug.
    >
    > Jumping to problem - wal files needed by logical replication were
    > removed from primary, despite slot being there.
    >
    > Now, a bit more details.
    >
    > Our replication is setup using following method:
    >
    > 1. We create all tables and primary keys on pg14
    > 2. replication slot is created, with no tables
    > 3. we add some tables to replication slot, and wait for all tables to
    >    switch state to "ready"
    > 4. rinse and repeat until all tables are in replication set
    > 5. once all is ready, we add rest of schema (indexes, constraints,
    >    functions, fkeys, and so on).
    >
    > This works. Worked?
    >
    > Yesterday we saw something that broke the replication.
    >
    > First data from bionic side:
    > when                    | currentLSN    | slot_name │ plugin   │ slot_type │ datoid │ database │ temporary │ active │ active_pid │  xmin  │ catalog_xmin │  restart_lsn  │ confirmed_flush_lsn
    > 2022-10-04 14:13:49 UTC | CA0E/273F2180 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743431068   | CA09/BE23BC70 | CA09/C82B2A58
    > 2022-10-04 14:14:48 UTC | CA0E/3B5FF5C8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      | 6084       | \N     | 3743451605   | CA09/C9DA89B0 | CA09/D3FA9898
    > 2022-10-04 14:15:47 UTC | CA0E/4F7C06F8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      | 6084       | \N     | 3743479469   | CA09/DD32CE50 | CA09/E65E7F80
    > 2022-10-04 14:16:46 UTC | CA0E/6CA820A8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      | 6084       | \N     | 3743516374   | CA09/EDD61CB8 | CA09/F3A856D8
    > 2022-10-04 14:17:45 UTC | CA0E/8C7F3950 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      | 6084       | \N     | 3743635890   | CA0A/2157B8F8 | CA0A/2A4C48F8
    > 2022-10-04 14:18:44 UTC | CA0E/A5D874F8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      | 6084       | \N     | 3743686305   | CA0A/38BE5AB0 | CA0A/3D2B0C20
    > 2022-10-04 14:19:43 UTC | CA0E/C3F6C240 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      | 6084       | \N     | 3743749741   | CA0A/49ED91C8 | CA0A/4F6D3048
    > 2022-10-04 14:20:42 UTC | CA0E/D7DBA590 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743749741   | CA0A/49ED91C8 | CA0A/4F6D3048
    > 2022-10-04 14:21:41 UTC | CA0E/F65C0F48 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      | 29207      | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    > 2022-10-04 14:22:40 UTC | CA0F/103DBE98 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    > 2022-10-04 14:23:39 UTC | CA0F/22D791A8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    > 2022-10-04 14:24:38 UTC | CA0F/32E81380 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    > 2022-10-04 14:25:38 UTC | CA0F/41E4A380 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    > 2022-10-04 14:26:37 UTC | CA0F/58FD1B30 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    > 2022-10-04 14:27:36 UTC | CA0F/6F732698 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    > 2022-10-04 14:28:35 UTC | CA0F/855E2F50 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    > 2022-10-04 14:29:34 UTC | CA0F/94B36450 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    > 2022-10-04 14:30:33 UTC | CA0F/A7553BF8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    > 2022-10-04 14:31:33 UTC | CA0F/BB10F020 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    > 2022-10-04 14:32:32 UTC | CA0F/D1B4DF10 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    > 2022-10-04 14:33:31 UTC | CA0F/EE332240 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    > 2022-10-04 14:34:30 UTC | CA10/3B78E10  | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      | \N         | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
    >
    > As you can see around 2:20pm replication stopped fetching data.
    >
    
    Around that time, it seems the slot is also marked inactive probably
    due to walsender exit.
    
    > At this time, in pg14 logs don't show any problems what so ever.
    > Pg_restore was doing it's work with create indexes, and some queries
    > like:
    > ALTER TABLE ONLY schema_name.table_name REPLICA IDENTITY USING INDEX index_table_name_replica_identity;
    >
    > At 2:20:48pm there was:
    > 2022-10-04 14:20:48.579 UTC,,,59744,,633c3fac.e960,62,,2022-10-04 14:14:04 UTC,8/6519966,7015376,ERROR,08006,"could not receive data from WAL stream: SSL SYSCALL error: EOF detected",,,,,,,,,"","logical replication worker",,0
    >
    > But these errors happened earlier too (minutes/hours earlier).
    >
    > But then, 3 minutes later I got:
    >
    > 2022-10-04 14:23:10.618 UTC,,,63569,,633c4140.f851,38,,2022-10-04 14:20:48 UTC,8/0,0,ERROR,08006,"could not receive data from WAL stream: SSL SYSCALL error: EOF detected
    > server closed the connection unexpectedly
    >         This probably means the server terminated abnormally
    >         before or while processing the request.",,,,,,,,,"","logical replication worker",,0
    > 2022-10-04 14:23:10.623 UTC,,,64701,,633c41ce.fcbd,1,,2022-10-04 14:23:10 UTC,8/6529701,0,LOG,00000,"logical replication apply worker for subscription ""focal14"" has started",,,,,,,,,"","logical replication worker",,0
    > 2022-10-04 14:23:10.624 UTC,,,1801,,633aa5ea.709,41,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logical replication worker"" (PID 63569) exited with exit code 1",,,,,,,,,"","postmaster",,0
    > 2022-10-04 14:23:10.643 UTC,,,64701,,633c41ce.fcbd,2,,2022-10-04 14:23:10 UTC,8/0,0,ERROR,08P01,"could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000CA0A00000049 has already been removed",,,,,,,,,"","logical replication worker",,0
    > 2022-10-04 14:23:10.646 UTC,,,1801,,633aa5ea.709,42,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logical replication worker"" (PID 64701) exited with exit code 1",,,,,,,,,"","postmaster",,0
    >
    >
    > Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on pg12/bionic.
    >
    
    This is quite strange and I am not able to see the reason why this can
    happen. The only way to debug this problem that comes to mind is to
    add more LOGS around the code that removes the WAL files. For example,
    we can try to print the value of minimumslotLSN (keep) and logSegNo in
    KeepLogSeg().
    
    > Unfortunately I don't know the content of pg_wal on bionic at the
    > moment, but currently I see:
    >
    > $ ls -l pg_wal/| head
    > total 297534552
    > -rw------- 1 postgres postgres      349 Feb 11  2022 000000010000A1410000001B.00C81918.backup
    > -rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A0000004D
    > -rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A0000004E
    > -rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A0000004F
    > -rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A00000050
    > -rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A00000051
    > -rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A00000052
    > -rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A00000053
    > -rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A00000054
    >
    > So, 4 files are missing.
    >
    > These were archived properly, and I tried to restore them from archive, and put
    > them in pg_wal, but even then pg12 was rejecting tries to connect to focal14
    > slot with the same message about "requested WAL segment
    > 000000010000CA0A00000049 has already been removed"
    >
    
    I think you are seeing this behavior because we update the
    lastRemovedSegNo before removing files in RemoveOldXlogFiles() and
    then we use that to give the error you are seeing.
    
    > I know that at this moment we're hosed. This replication was removed,
    > everythbing cleared (after all, we accumulated already 300G of wal, and
    > we can't extend the FS forever, so we needed to cleanup the situation).
    >
    > But, the way I look at it, Pg12 shouldn't be removing these 4 wal files,
    > and the fact that it did means that either we did something wrong (what,
    > though?), or there is some kind of bug in Pg.
    >
    > Anyone has any idea on how to investigate/fix that?
    >
    
    As suggested above, I feel we need to add some LOGS to see what is
    going on unless you have a reproducible testcase.
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  3. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Kyotaro Horiguchi <horikyota.ntt@gmail.com> — 2022-10-17T06:53:07Z

    At Sun, 16 Oct 2022 10:35:17 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in 
    > On Wed, Oct 5, 2022 at 8:54 PM hubert depesz lubaczewski
    > <depesz@depesz.com> wrote:
    > > At 2:20:48pm there was:
    > > 2022-10-04 14:20:48.579 UTC,,,59744,,633c3fac.e960,62,,2022-10-04 14:14:04 UTC,8/6519966,7015376,ERROR,08006,"could not receive data from WAL stream: SSL SYSCALL error: EOF detected",,,,,,,,,"","logical replication worker",,0
    > >
    > > But these errors happened earlier too (minutes/hours earlier).
    > >
    > > But then, 3 minutes later I got:
    > >
    > > 2022-10-04 14:23:10.618 UTC,,,63569,,633c4140.f851,38,,2022-10-04 14:20:48 UTC,8/0,0,ERROR,08006,"could not receive data from WAL stream: SSL SYSCALL error: EOF detected
    > > server closed the connection unexpectedly
    > >         This probably means the server terminated abnormally
    > >         before or while processing the request.",,,,,,,,,"","logical replication worker",,0
    > > 2022-10-04 14:23:10.623 UTC,,,64701,,633c41ce.fcbd,1,,2022-10-04 14:23:10 UTC,8/6529701,0,LOG,00000,"logical replication apply worker for subscription ""focal14"" has started",,,,,,,,,"","logical replication worker",,0
    > > 2022-10-04 14:23:10.624 UTC,,,1801,,633aa5ea.709,41,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logical replication worker"" (PID 63569) exited with exit code 1",,,,,,,,,"","postmaster",,0
    > > 2022-10-04 14:23:10.643 UTC,,,64701,,633c41ce.fcbd,2,,2022-10-04 14:23:10 UTC,8/0,0,ERROR,08P01,"could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000CA0A00000049 has already been removed",,,,,,,,,"","logical replication worker",,0
    > > 2022-10-04 14:23:10.646 UTC,,,1801,,633aa5ea.709,42,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logical replication worker"" (PID 64701) exited with exit code 1",,,,,,,,,"","postmaster",,0
    > >
    > >
    > > Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on pg12/bionic.
    > >
    > 
    > This is quite strange and I am not able to see the reason why this can
    > happen. The only way to debug this problem that comes to mind is to
    > add more LOGS around the code that removes the WAL files. For example,
    > we can try to print the value of minimumslotLSN (keep) and logSegNo in
    > KeepLogSeg().
    
    What I can see from the information at hand is that walsender is
    killed (not died) twice for some reasons then some "reserved" WAL
    files are lost.  I don't see the reason for the second walsender (pid
    29207) being killed but if the "has already been removed" log line
    above is the first one, the file removal is probably happend AT
    14:23:10.  (log is needed to know that.)
    
    Anyway, I didn't come up with a possibility other than of someone else
    having removed those files..  For example, virus scan softwares.  If
    any such software kills processes that is grabbing the files that the
    software identified as malicious and is going to remove, that could be
    happen?
    
    > > So, 4 files are missing.
    > >
    > > These were archived properly, and I tried to restore them from archive, and put
    > > them in pg_wal, but even then pg12 was rejecting tries to connect to focal14
    > > slot with the same message about "requested WAL segment
    > > 000000010000CA0A00000049 has already been removed"
    > >
    > 
    > I think you are seeing this behavior because we update the
    > lastRemovedSegNo before removing files in RemoveOldXlogFiles() and
    > then we use that to give the error you are seeing.
    
    lastRemovedSegNo is updated once per one segment of removal.  Four
    files are lost in this case.
    
    
    > > I know that at this moment we're hosed. This replication was removed,
    > > everythbing cleared (after all, we accumulated already 300G of wal, and
    > > we can't extend the FS forever, so we needed to cleanup the situation).
    > >
    > > But, the way I look at it, Pg12 shouldn't be removing these 4 wal files,
    > > and the fact that it did means that either we did something wrong (what,
    > > though?), or there is some kind of bug in Pg.
    > >
    > > Anyone has any idea on how to investigate/fix that?
    > >
    > 
    > As suggested above, I feel we need to add some LOGS to see what is
    > going on unless you have a reproducible testcase.
    
    regards.
    
    -- 
    Kyotaro Horiguchi
    NTT Open Source Software Center
    
    
    
    
  4. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    hubert depesz lubaczewski <depesz@depesz.com> — 2022-10-17T09:13:11Z

    On Sun, Oct 16, 2022 at 10:35:17AM +0530, Amit Kapila wrote:
    > > Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on pg12/bionic.
    > This is quite strange and I am not able to see the reason why this can
    > happen. The only way to debug this problem that comes to mind is to
    > add more LOGS around the code that removes the WAL files. For example,
    > we can try to print the value of minimumslotLSN (keep) and logSegNo in
    > KeepLogSeg().
    
    That would require changing pg sources, I think, recompiling, and
    retrying?
    
    And on the pg12 side? I don't think we can do it, but I'll ask boss.
    
    > > Anyone has any idea on how to investigate/fix that?
    > As suggested above, I feel we need to add some LOGS to see what is
    > going on unless you have a reproducible testcase.
    
    The problem is that we were trying to use this approach to upgrade pg12
    production db to pg14. Modifying it (pg12) will be tricky, as it would
    require restart. Unless there is a way to add extra logging without
    restart...
    
    depesz
    
    
    
    
  5. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    hubert depesz lubaczewski <depesz@depesz.com> — 2022-10-17T09:14:32Z

    On Mon, Oct 17, 2022 at 03:53:07PM +0900, Kyotaro Horiguchi wrote:
    > Anyway, I didn't come up with a possibility other than of someone else
    > having removed those files..  For example, virus scan softwares.  If
    > any such software kills processes that is grabbing the files that the
    > software identified as malicious and is going to remove, that could be
    > happen?
    
    We don't use virus scanning there. The pg12 server is on ubuntu 18.04,
    used for normal production workload, and pg14 is on ubuntu focal, with
    no load, aside from 2 physical streaming replication to pg14 replicas.
    
    Best regards,
    
    depesz
    
    
    
    
    
  6. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Amit Kapila <amit.kapila16@gmail.com> — 2022-10-18T11:21:26Z

    On Mon, Oct 17, 2022 at 12:23 PM Kyotaro Horiguchi
    <horikyota.ntt@gmail.com> wrote:
    >
    > At Sun, 16 Oct 2022 10:35:17 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
    > > On Wed, Oct 5, 2022 at 8:54 PM hubert depesz lubaczewski
    > > > So, 4 files are missing.
    > > >
    > > > These were archived properly, and I tried to restore them from archive, and put
    > > > them in pg_wal, but even then pg12 was rejecting tries to connect to focal14
    > > > slot with the same message about "requested WAL segment
    > > > 000000010000CA0A00000049 has already been removed"
    > > >
    > >
    > > I think you are seeing this behavior because we update the
    > > lastRemovedSegNo before removing files in RemoveOldXlogFiles() and
    > > then we use that to give the error you are seeing.
    >
    > lastRemovedSegNo is updated once per one segment of removal.  Four
    > files are lost in this case.
    >
    
    I didn't understand your response. I was saying the one possible
    reason why even after restoring files from the archive the error
    appears is because of the lastRemovedSegNo related check in function
    CheckXLogRemoved() and we update its value while removing old xlog
    files. From this behavior, it appears that somehow the server has only
    removed those files even though the reason is not clear yet.
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  7. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Amit Kapila <amit.kapila16@gmail.com> — 2022-10-18T11:27:52Z

    On Mon, Oct 17, 2022 at 2:43 PM hubert depesz lubaczewski
    <depesz@depesz.com> wrote:
    >
    > On Sun, Oct 16, 2022 at 10:35:17AM +0530, Amit Kapila wrote:
    > > > Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on pg12/bionic.
    > > This is quite strange and I am not able to see the reason why this can
    > > happen. The only way to debug this problem that comes to mind is to
    > > add more LOGS around the code that removes the WAL files. For example,
    > > we can try to print the value of minimumslotLSN (keep) and logSegNo in
    > > KeepLogSeg().
    >
    > That would require changing pg sources, I think, recompiling, and
    > retrying?
    >
    
    Yes. BTW, are you on the latest release of pg12, if not then you can
    once check the release notes to see if there is any related bug fix in
    the later releases?
    
    Is this problem reproducible? If so, can you find out why there are
    multiple time connection issues between walsender and walreceiver?
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  8. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Kyotaro Horiguchi <horikyota.ntt@gmail.com> — 2022-10-19T02:00:32Z

    At Tue, 18 Oct 2022 16:51:26 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in 
    > On Mon, Oct 17, 2022 at 12:23 PM Kyotaro Horiguchi
    > <horikyota.ntt@gmail.com> wrote:
    > >
    > > At Sun, 16 Oct 2022 10:35:17 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
    > > > On Wed, Oct 5, 2022 at 8:54 PM hubert depesz lubaczewski
    > > > > So, 4 files are missing.
    > > > >
    > > > > These were archived properly, and I tried to restore them from archive, and put
    > > > > them in pg_wal, but even then pg12 was rejecting tries to connect to focal14
    > > > > slot with the same message about "requested WAL segment
    > > > > 000000010000CA0A00000049 has already been removed"
    > > > >
    > > >
    > > > I think you are seeing this behavior because we update the
    > > > lastRemovedSegNo before removing files in RemoveOldXlogFiles() and
    > > > then we use that to give the error you are seeing.
    > >
    > > lastRemovedSegNo is updated once per one segment of removal.  Four
    > > files are lost in this case.
    > >
    > 
    > I didn't understand your response. I was saying the one possible
    > reason why even after restoring files from the archive the error
    > appears is because of the lastRemovedSegNo related check in function
    > CheckXLogRemoved() and we update its value while removing old xlog
    > files. From this behavior, it appears that somehow the server has only
    > removed those files even though the reason is not clear yet.
    
    I meant that if PostgreSQL did something wrong (that I don't
    understand at all) there, the difference between lastRemovedSegNo and
    the last actually removed segment won't be more than 1.
    
    CheckXLogRevmoed warns for a segment logically already removed but was
    physically able to be opend.  On the other hand WalSndSegmentOpen
    emits the same error earlier than CHeckXLogRemoved warns if the
    segment is actually ENOENT regardless of the value of
    lastRemovedSegNo.
    
    One point that bothers me is it seems that the walsender was
    killed. If the file is removed while the walsender is working, logical
    replication worker receives the error and emits "ERROR: could not
    receive data...has been removed" instead of suddenly disconnected as
    in this case.  Considering this together with the possibility that the
    segments were removed by someone else, I suspencted virus scanners but
    that has found to be false..  (I still don't know there's any virus
    scanner that kills processes grabbing a suspectedly-malicious file.)
    
    
    regards.
    
    -- 
    Kyotaro Horiguchi
    NTT Open Source Software Center
    
    
    
    
  9. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Amit Kapila <amit.kapila16@gmail.com> — 2022-10-19T09:46:59Z

    On Wed, Oct 19, 2022 at 7:31 AM Kyotaro Horiguchi
    <horikyota.ntt@gmail.com> wrote:
    >
    >
    > One point that bothers me is it seems that the walsender was
    > killed. If the file is removed while the walsender is working, logical
    > replication worker receives the error and emits "ERROR: could not
    > receive data...has been removed" instead of suddenly disconnected as
    > in this case.  Considering this together with the possibility that the
    > segments were removed by someone else, I suspencted virus scanners but
    > that has found to be false..
    >
    
    Yeah, as requested yesterday, maybe checking if this can be reproduced
    will help us to narrow down the cause. If this doesn't happen due to
    an application mistake it appears that after the slot becomes inactive
    during the time walsender was down, the server removes the required
    files but there is nothing in the code that suggests that such a thing
    can happen.
    
    Are there any more logs on publisher/subscriber which indicate any
    sort of error?
    
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  10. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    hubert depesz lubaczewski <depesz@depesz.com> — 2022-10-19T10:02:20Z

    On Wed, Oct 19, 2022 at 03:16:59PM +0530, Amit Kapila wrote:
    > Yeah, as requested yesterday, maybe checking if this can be reproduced
    > will help us to narrow down the cause. If this doesn't happen due to
    > an application mistake it appears that after the slot becomes inactive
    > during the time walsender was down, the server removes the required
    > files but there is nothing in the code that suggests that such a thing
    > can happen.
    > Are there any more logs on publisher/subscriber which indicate any
    > sort of error?
    
    This specific setup of replication was torn down, we couldn't keep it
    running. Can setup it once more, and with some luck it will break again.
    Will take a day, though.
    
    Is there anything I can do to make it more debugable, but that will not
    require restart of pg12?
    
    Best regards,
    
    depesz
    
    
    
    
    
  11. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    hubert depesz lubaczewski <depesz@depesz.com> — 2022-10-19T10:20:13Z

    On Tue, Oct 18, 2022 at 04:57:52PM +0530, Amit Kapila wrote:
    > On Mon, Oct 17, 2022 at 2:43 PM hubert depesz lubaczewski
    > <depesz@depesz.com> wrote:
    > >
    > > On Sun, Oct 16, 2022 at 10:35:17AM +0530, Amit Kapila wrote:
    > > > > Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on pg12/bionic.
    > > > This is quite strange and I am not able to see the reason why this can
    > > > happen. The only way to debug this problem that comes to mind is to
    > > > add more LOGS around the code that removes the WAL files. For example,
    > > > we can try to print the value of minimumslotLSN (keep) and logSegNo in
    > > > KeepLogSeg().
    > >
    > > That would require changing pg sources, I think, recompiling, and
    > > retrying?
    > >
    > 
    > Yes. BTW, are you on the latest release of pg12, if not then you can
    > once check the release notes to see if there is any related bug fix in
    > the later releases?
    > 
    > Is this problem reproducible? If so, can you find out why there are
    > multiple time connection issues between walsender and walreceiver?
    
    I can try to redo it, but before I do - anything I could do to either
    side of replication to increase our chances at figuring out the
    underlying problem? I can't restart pg12, though.
    
    Best regards,
    
    depesz
    
    
    
    
    
  12. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Amit Kapila <amit.kapila16@gmail.com> — 2022-10-19T10:44:28Z

    On Wed, Oct 19, 2022 at 3:50 PM hubert depesz lubaczewski
    <depesz@depesz.com> wrote:
    >
    > On Tue, Oct 18, 2022 at 04:57:52PM +0530, Amit Kapila wrote:
    > > On Mon, Oct 17, 2022 at 2:43 PM hubert depesz lubaczewski
    > > <depesz@depesz.com> wrote:
    > > >
    > > > On Sun, Oct 16, 2022 at 10:35:17AM +0530, Amit Kapila wrote:
    > > > > > Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on pg12/bionic.
    > > > > This is quite strange and I am not able to see the reason why this can
    > > > > happen. The only way to debug this problem that comes to mind is to
    > > > > add more LOGS around the code that removes the WAL files. For example,
    > > > > we can try to print the value of minimumslotLSN (keep) and logSegNo in
    > > > > KeepLogSeg().
    > > >
    > > > That would require changing pg sources, I think, recompiling, and
    > > > retrying?
    > > >
    > >
    > > Yes. BTW, are you on the latest release of pg12, if not then you can
    > > once check the release notes to see if there is any related bug fix in
    > > the later releases?
    > >
    > > Is this problem reproducible? If so, can you find out why there are
    > > multiple time connection issues between walsender and walreceiver?
    >
    > I can try to redo it, but before I do - anything I could do to either
    > side of replication to increase our chances at figuring out the
    > underlying problem? I can't restart pg12, though.
    >
    
    One idea is to change log level to DEBUG2 so that we can print which
    files are removed by the server via
    RemoveOldXlogFiles()
    {
    ...
    elog(DEBUG2, "attempting to remove WAL segments older than log file %s",
    lastoff);
    ...
    }
    
    If we can do this then at the very least we can know whether the
    required files are removed by the server or by some external
    application.
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  13. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    hubert depesz lubaczewski <depesz@depesz.com> — 2022-10-19T11:09:46Z

    On Wed, Oct 19, 2022 at 04:14:28PM +0530, Amit Kapila wrote:
    > One idea is to change log level to DEBUG2 so that we can print which
    > files are removed by the server via
    > RemoveOldXlogFiles()
    > {
    > ...
    > elog(DEBUG2, "attempting to remove WAL segments older than log file %s",
    > lastoff);
    > ...
    > }
    > 
    > If we can do this then at the very least we can know whether the
    > required files are removed by the server or by some external
    > application.
    
    That should be doable. Will take some time to setup test env, and run
    it. Will get back to you on this.
    
    Best regards,
    
    depesz
    
    
    
    
    
  14. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    hubert depesz lubaczewski <depesz@depesz.com> — 2022-10-19T11:11:06Z

    On Tue, Oct 18, 2022 at 04:57:52PM +0530, Amit Kapila wrote:
    > On Mon, Oct 17, 2022 at 2:43 PM hubert depesz lubaczewski
    > <depesz@depesz.com> wrote:
    > >
    > > On Sun, Oct 16, 2022 at 10:35:17AM +0530, Amit Kapila wrote:
    > > > > Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on pg12/bionic.
    > > > This is quite strange and I am not able to see the reason why this can
    > > > happen. The only way to debug this problem that comes to mind is to
    > > > add more LOGS around the code that removes the WAL files. For example,
    > > > we can try to print the value of minimumslotLSN (keep) and logSegNo in
    > > > KeepLogSeg().
    > >
    > > That would require changing pg sources, I think, recompiling, and
    > > retrying?
    > >
    > 
    > Yes. BTW, are you on the latest release of pg12, if not then you can
    > once check the release notes to see if there is any related bug fix in
    > the later releases?
    
    We're not. We're on 12.9, and there seem to be some wal-related changes
    since then:
    https://why-upgrade.depesz.com/show?from=12.9&to=12.12&keywords=wal*
    
    will talk with responsible people about upgrading it, and then redoing
    the test.
    
    Best regards,
    
    depesz
    
    
    
    
    
  15. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    hubert depesz lubaczewski <depesz@depesz.com> — 2022-10-20T11:55:16Z

    On Wed, Oct 19, 2022 at 01:11:06PM +0200, hubert depesz lubaczewski wrote:
    > On Tue, Oct 18, 2022 at 04:57:52PM +0530, Amit Kapila wrote:
    > > On Mon, Oct 17, 2022 at 2:43 PM hubert depesz lubaczewski
    > > <depesz@depesz.com> wrote:
    > > >
    > > > On Sun, Oct 16, 2022 at 10:35:17AM +0530, Amit Kapila wrote:
    > > > > > Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on pg12/bionic.
    > > > > This is quite strange and I am not able to see the reason why this can
    > > > > happen. The only way to debug this problem that comes to mind is to
    > > > > add more LOGS around the code that removes the WAL files. For example,
    > > > > we can try to print the value of minimumslotLSN (keep) and logSegNo in
    > > > > KeepLogSeg().
    > > >
    > > > That would require changing pg sources, I think, recompiling, and
    > > > retrying?
    > > >
    > > 
    > > Yes. BTW, are you on the latest release of pg12, if not then you can
    > > once check the release notes to see if there is any related bug fix in
    > > the later releases?
    > 
    > We're not. We're on 12.9, and there seem to be some wal-related changes
    > since then:
    > https://why-upgrade.depesz.com/show?from=12.9&to=12.12&keywords=wal*
    > 
    > will talk with responsible people about upgrading it, and then redoing
    > the test.
    
    Hi,
    just an update. Due to procedures we will upgrade to 12.12 at the
    beginning of November. Then, I'll try to replicate the problem again,
    and we'll see how this will pan out.
    
    For next test will do my best to use debug2 log level.
    
    Best regards,
    
    depesz
    
    
    
    
    
  16. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    hubert depesz lubaczewski <depesz@depesz.com> — 2022-11-11T14:50:40Z

    On Thu, Oct 20, 2022 at 01:55:16PM +0200, hubert depesz lubaczewski wrote:
    > Hi,
    > just an update. Due to procedures we will upgrade to 12.12 at the
    > beginning of November. Then, I'll try to replicate the problem again,
    > and we'll see how this will pan out.
    > For next test will do my best to use debug2 log level.
    
    So. I have an update.
    
    Out of 8 servers that we prepared upgrade for, one failed with the
    same/similar problem.
    Unfortunately I don't know if this is the same server as before. Sorry.
    
    Specifically, I see now, in logs on focal/pg14 side:
    
    #v+
    2022-11-11 12:45:26.432 UTC,,,994963,,636e43e6.f2e93,2,,2022-11-11 12:45:26 UTC,6/0,0,ERROR,08P01,"could not receive data from WAL stream: ERROR:  requested WAL segment 000000010001039D00000083 has already been removed",,,,,,,,,"","logical replication worker",,0
    #v-
    
    On source (bionic/pg12), I see:
    
    #v+
    =# select * from pg_replication_slots ;
     slot_name │  plugin  │ slot_type │ datoid │ database │ temporary │ active │ active_pid │  xmin  │ catalog_xmin │  restart_lsn   │ confirmed_flush_lsn
    ═══════════╪══════════╪═══════════╪════════╪══════════╪═══════════╪════════╪════════════╪════════╪══════════════╪════════════════╪═════════════════════
     focal14   │ pgoutput │ logical   │  16607 │ canvas   │ f         │ f      │     [null] │ [null] │   3241443528 │ 1039D/83825958 │ 1039D/96453F38
    (1 row)
    #v-
    
    but pg, like previously, lost some wal files, this time 8 of them:
    
    #v+
    =$ sudo ls -l ~postgres/12/main/pg_wal | head
    total 224403644
    -rw------- 1 postgres postgres      349 Feb 11  2022 000000010000BDEF000000FF.00045F68.backup
    -rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008B
    -rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008C
    -rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008D
    ...
    #v-
    
    This time we had debug2 enabled on both sides.
    
    Looks like we lost segments:
    000000010001039D00000083, 000000010001039D00000084,
    000000010001039D00000085, 000000010001039D00000086,
    000000010001039D00000087, 000000010001039D00000088,
    000000010001039D00000089, and 000000010001039D0000008A.
    
    It looks that first time replica logged the error was:
    
    #v+
    2022-11-10 21:06:16.427 UTC,,,572364,,636d67c8.8bbcc,2,,2022-11-10 21:06:16 UTC,6/0,0,ERROR,08P01,"could not receive data from WAL stream: ERROR:  requested WAL segment 000000010001039D00000083 has already been removed",,,,,,,,,"","logical replication worker",,0
    #v-
    
    In Pg logs on source I see:
    
    #v+
    2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000082""",,,,,,,,,""
    2022-11-10 21:06:07.523 UTC,,,28925,,636378d6.70fd,4594903,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000083""",,,,,,,,,""
    2022-11-10 21:06:07.524 UTC,,,28925,,636378d6.70fd,4594904,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000084""",,,,,,,,,""
    2022-11-10 21:06:07.525 UTC,,,28925,,636378d6.70fd,4594905,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000085""",,,,,,,,,""
    2022-11-10 21:06:07.526 UTC,,,28925,,636378d6.70fd,4594906,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000086""",,,,,,,,,""
    2022-11-10 21:06:07.527 UTC,,,28925,,636378d6.70fd,4594907,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000087""",,,,,,,,,""
    2022-11-10 21:06:07.528 UTC,,,28925,,636378d6.70fd,4594908,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000088""",,,,,,,,,""
    2022-11-10 21:06:07.529 UTC,,,28925,,636378d6.70fd,4594909,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000089""",,,,,,,,,""
    2022-11-10 21:06:07.530 UTC,,,28925,,636378d6.70fd,4594910,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D0000008A""",,,,,,,,,""
    2022-11-10 21:06:16.426 UTC,"upgrayedd","canvas",28608,"10.1.238.101:45328",636d67c8.6fc0,8,"idle",2022-11-10 21:06:16 UTC,147/0,0,ERROR,58P01,"requested WAL segment 000000010001039D00000083 has already been removed",,,,,,"START_REPLICATION SLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names '""focal14""')",,,"focal14"
    2022-11-10 21:06:21.456 UTC,"upgrayedd","canvas",28815,"10.1.238.101:52316",636d67cd.708f,8,"idle",2022-11-10 21:06:21 UTC,147/0,0,ERROR,58P01,"requested WAL segment 000000010001039D00000083 has already been removed",,,,,,"START_REPLICATION SLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names '""focal14""')",,,"focal14"
    2022-11-10 21:06:26.488 UTC,"upgrayedd","canvas",28962,"10.1.238.101:52320",636d67d2.7122,8,"idle",2022-11-10 21:06:26 UTC,148/0,0,ERROR,58P01,"requested WAL segment 000000010001039D00000083 has already been removed",,,,,,"START_REPLICATION SLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names '""focal14""')",,,"focal14"
    #v-
    
    So, the whole set of missing files was recycled at the same time.
    
    One more thing that I have is copy of pg_replication_slots info. This data is taken periodically, around every 1 minute. Each line, aside from pg_replication_slots slots data has also output of pg_current_wal_lsn() and timestamp. Looks like this:
    
    #v+
    timestamp                pg_current_wal_lsn  slot_name  plugin    slot_type  datoid  database  temporary  active  active_pid  xmin  catalog_xmin  restart_lsn     confirmed_flush_lsn
    2022-11-10 20:50:04 UTC  103A6/C49D36F8      focal14    pgoutput  logical    16607   canvas    f          t       16794       \N    3241375739    1039D/65365F68  1039D/782908E8
    2022-11-10 20:51:03 UTC  103A6/CC957148      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241375739    1039D/65365F68  1039D/782908E8
    2022-11-10 20:52:02 UTC  103A6/D7348A30      focal14    pgoutput  logical    16607   canvas    f          t       21140       \N    3241422387    1039D/65365F68  1039D/78C94588
    2022-11-10 20:53:01 UTC  103A6/EB24F408      focal14    pgoutput  logical    16607   canvas    f          t       21140       \N    3241422387    1039D/65365F68  1039D/7C8E7D78
    2022-11-10 20:54:01 UTC  103A7/27A4778       focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241422387    1039D/65365F68  1039D/7C8E7D78
    2022-11-10 20:55:00 UTC  103A7/1A635CD0      focal14    pgoutput  logical    16607   canvas    f          t       27607       \N    3241422387    1039D/78833568  1039D/7F8BCD50
    2022-11-10 20:55:59 UTC  103A7/2D5E6EE8      focal14    pgoutput  logical    16607   canvas    f          t       27607       \N    3241422387    1039D/78833568  1039D/818B6990
    2022-11-10 20:56:58 UTC  103A7/44548000      focal14    pgoutput  logical    16607   canvas    f          t       27607       \N    3241422387    1039D/7E4CDF90  1039D/89E93620
    2022-11-10 20:57:57 UTC  103A7/579F8000      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241422387    1039D/7E4CDF90  1039D/89E93620
    2022-11-10 20:58:56 UTC  103A7/6B6C96D0      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241422387    1039D/7E4CDF90  1039D/89E93620
    2022-11-10 20:59:55 UTC  103A7/7B666388      focal14    pgoutput  logical    16607   canvas    f          t       10414       \N    3241434855    1039D/83825958  1039D/8DECBA18
    2022-11-10 21:00:54 UTC  103A7/88225C28      focal14    pgoutput  logical    16607   canvas    f          t       10414       \N    3241434855    1039D/83825958  1039D/911A8DB0
    2022-11-10 21:01:53 UTC  103A7/9E77EF90      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241434855    1039D/83825958  1039D/911A8DB0
    2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241434855    1039D/83825958  1039D/911A8DB0
    2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t       21748       \N    3241443528    1039D/83825958  1039D/955633D0
    2022-11-10 21:04:51 UTC  103A7/D5A6FDD8      focal14    pgoutput  logical    16607   canvas    f          t       21748       \N    3241443528    1039D/83825958  1039D/96453F38
    2022-11-10 21:05:50 UTC  103A7/E2C7F590      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    2022-11-10 21:06:49 UTC  103A7/F28B2000      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    2022-11-10 21:07:48 UTC  103A8/3828D20       focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    2022-11-10 21:08:47 UTC  103A8/15BBF088      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    2022-11-10 21:09:46 UTC  103A8/233202E0      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    2022-11-10 21:10:45 UTC  103A8/2FBC9720      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    #v-
    
    Now, I can keep this situation up for some time, if more debug will be helpful,
    but I don't know how long (disk space issues).
    
    Is there anything in here that helps? Any other info I can dig out?
    
    Best regards,
    
    depesz
    
    
    
    
    
  17. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    hubert depesz lubaczewski <depesz@depesz.com> — 2022-11-13T08:06:34Z

    On Fri, Nov 11, 2022 at 03:50:40PM +0100, hubert depesz lubaczewski wrote:
    > #v+
    > 2022-11-11 12:45:26.432 UTC,,,994963,,636e43e6.f2e93,2,,2022-11-11 12:45:26 UTC,6/0,0,ERROR,08P01,"could not receive data from WAL stream: ERROR:  requested WAL segment 000000010001039D00000083 has already been removed",,,,,,,,,"","logical replication worker",,0
    > #v-
    
    Sooo... plot thickens.
    
    Without any changes, manual rebuild or anything, yesterday, the problem
    seems to have solved itself?!
    
    In logs on focal/pg14 I see:
    
    #v+
    2022-11-12 20:55:39.190 UTC,,,1897563,,6370084b.1cf45b,2,,2022-11-12 20:55:39 UTC,6/0,0,ERROR,08P01,"could not receive data from WAL stream: ERROR:  requested WAL segment 000000010001039D00000083 has already been removed",,,,,,,,,"","logical replication worker",,0
    #v-
    
    And this is *the last* such message.
    
    On bionic/pg12 we have in logs from pg_replication_slots:
    
    #v+
    timestamp                pg_current_wal_lsn  slot_name  plugin    slot_type  datoid  database  temporary  active  active_pid  xmin  catalog_xmin  restart_lsn     confirmed_flush_lsn
    2022-11-12 20:51:00 UTC  1041E/D3A0E540      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    2022-11-12 20:51:59 UTC  1041E/D89B6000      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    2022-11-12 20:52:58 UTC  1041E/E0547450      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    2022-11-12 20:53:58 UTC  1041E/E59634F0      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    2022-11-12 20:54:57 UTC  1041E/EBB50DE8      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    2022-11-12 20:55:55 UTC  1041E/FBBC3160      focal14    pgoutput  logical    16607   canvas    f          t       18626       \N    3241450490    1039D/9170B010  1039D/9B86EAF0
    2022-11-12 20:56:55 UTC  1041F/FF34000       focal14    pgoutput  logical    16607   canvas    f          t       18626       \N    3241469432    1039D/A21B4598  1039D/A928A6D0
    2022-11-12 20:57:54 UTC  1041F/277FAE30      focal14    pgoutput  logical    16607   canvas    f          t       18626       \N    3241480448    1039D/AD4C7A00  1039D/BA3FC840
    2022-11-12 20:58:53 UTC  1041F/319A0000      focal14    pgoutput  logical    16607   canvas    f          t       18626       \N    3241505462    1039D/C5C32398  1039D/DF899428
    2022-11-12 20:59:52 UTC  1041F/3A399688      focal14    pgoutput  logical    16607   canvas    f          t       18626       \N    3241527824    1039D/F3393280  1039E/AD9A740
    ...
    2022-11-13 08:00:44 UTC  1043F/E464B738      focal14    pgoutput  logical    16607   canvas    f          t       18626       \N    3292625412    1043F/E0F38E78  1043F/E4609628
    #v-
    
    I have no idea what has changed or why.
    
    If it helps I can provide logs, but would prefer to do it off list.
    
    Also, while the problem, with this cluster is "solved", I still have
    like 300 other clusters to upgrade, and at least 1 has hit the same
    problem today.
    
    Best regards,
    
    depesz
    
    
    
    
    
    
  18. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Amit Kapila <amit.kapila16@gmail.com> — 2022-11-14T11:38:15Z

    On Fri, Nov 11, 2022 at 8:20 PM hubert depesz lubaczewski
    <depesz@depesz.com> wrote:
    >
    > On Thu, Oct 20, 2022 at 01:55:16PM +0200, hubert depesz lubaczewski wrote:
    > > Hi,
    > > just an update. Due to procedures we will upgrade to 12.12 at the
    > > beginning of November. Then, I'll try to replicate the problem again,
    > > and we'll see how this will pan out.
    > > For next test will do my best to use debug2 log level.
    >
    > So. I have an update.
    >
    > Out of 8 servers that we prepared upgrade for, one failed with the
    > same/similar problem.
    > Unfortunately I don't know if this is the same server as before. Sorry.
    >
    > Specifically, I see now, in logs on focal/pg14 side:
    >
    > #v+
    > 2022-11-11 12:45:26.432 UTC,,,994963,,636e43e6.f2e93,2,,2022-11-11 12:45:26 UTC,6/0,0,ERROR,08P01,"could not receive data from WAL stream: ERROR:  requested WAL segment 000000010001039D00000083 has already been removed",,,,,,,,,"","logical replication worker",,0
    > #v-
    >
    > On source (bionic/pg12), I see:
    >
    > #v+
    > =# select * from pg_replication_slots ;
    >  slot_name │  plugin  │ slot_type │ datoid │ database │ temporary │ active │ active_pid │  xmin  │ catalog_xmin │  restart_lsn   │ confirmed_flush_lsn
    > ═══════════╪══════════╪═══════════╪════════╪══════════╪═══════════╪════════╪════════════╪════════╪══════════════╪════════════════╪═════════════════════
    >  focal14   │ pgoutput │ logical   │  16607 │ canvas   │ f         │ f      │     [null] │ [null] │   3241443528 │ 1039D/83825958 │ 1039D/96453F38
    > (1 row)
    > #v-
    >
    > but pg, like previously, lost some wal files, this time 8 of them:
    >
    > #v+
    > =$ sudo ls -l ~postgres/12/main/pg_wal | head
    > total 224403644
    > -rw------- 1 postgres postgres      349 Feb 11  2022 000000010000BDEF000000FF.00045F68.backup
    > -rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008B
    > -rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008C
    > -rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008D
    > ...
    > #v-
    >
    > This time we had debug2 enabled on both sides.
    >
    > Looks like we lost segments:
    > 000000010001039D00000083, 000000010001039D00000084,
    > 000000010001039D00000085, 000000010001039D00000086,
    > 000000010001039D00000087, 000000010001039D00000088,
    > 000000010001039D00000089, and 000000010001039D0000008A.
    >
    > It looks that first time replica logged the error was:
    >
    > #v+
    > 2022-11-10 21:06:16.427 UTC,,,572364,,636d67c8.8bbcc,2,,2022-11-10 21:06:16 UTC,6/0,0,ERROR,08P01,"could not receive data from WAL stream: ERROR:  requested WAL segment 000000010001039D00000083 has already been removed",,,,,,,,,"","logical replication worker",,0
    > #v-
    >
    > In Pg logs on source I see:
    >
    > #v+
    > 2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000082""",,,,,,,,,""
    > 2022-11-10 21:06:07.523 UTC,,,28925,,636378d6.70fd,4594903,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000083""",,,,,,,,,""
    > 2022-11-10 21:06:07.524 UTC,,,28925,,636378d6.70fd,4594904,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000084""",,,,,,,,,""
    > 2022-11-10 21:06:07.525 UTC,,,28925,,636378d6.70fd,4594905,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000085""",,,,,,,,,""
    > 2022-11-10 21:06:07.526 UTC,,,28925,,636378d6.70fd,4594906,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000086""",,,,,,,,,""
    > 2022-11-10 21:06:07.527 UTC,,,28925,,636378d6.70fd,4594907,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000087""",,,,,,,,,""
    > 2022-11-10 21:06:07.528 UTC,,,28925,,636378d6.70fd,4594908,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000088""",,,,,,,,,""
    > 2022-11-10 21:06:07.529 UTC,,,28925,,636378d6.70fd,4594909,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000089""",,,,,,,,,""
    > 2022-11-10 21:06:07.530 UTC,,,28925,,636378d6.70fd,4594910,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D0000008A""",,,,,,,,,""
    > 2022-11-10 21:06:16.426 UTC,"upgrayedd","canvas",28608,"10.1.238.101:45328",636d67c8.6fc0,8,"idle",2022-11-10 21:06:16 UTC,147/0,0,ERROR,58P01,"requested WAL segment 000000010001039D00000083 has already been removed",,,,,,"START_REPLICATION SLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names '""focal14""')",,,"focal14"
    > 2022-11-10 21:06:21.456 UTC,"upgrayedd","canvas",28815,"10.1.238.101:52316",636d67cd.708f,8,"idle",2022-11-10 21:06:21 UTC,147/0,0,ERROR,58P01,"requested WAL segment 000000010001039D00000083 has already been removed",,,,,,"START_REPLICATION SLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names '""focal14""')",,,"focal14"
    > 2022-11-10 21:06:26.488 UTC,"upgrayedd","canvas",28962,"10.1.238.101:52320",636d67d2.7122,8,"idle",2022-11-10 21:06:26 UTC,148/0,0,ERROR,58P01,"requested WAL segment 000000010001039D00000083 has already been removed",,,,,,"START_REPLICATION SLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names '""focal14""')",,,"focal14"
    > #v-
    >
    > So, the whole set of missing files was recycled at the same time.
    >
    > One more thing that I have is copy of pg_replication_slots info. This data is taken periodically, around every 1 minute. Each line, aside from pg_replication_slots slots data has also output of pg_current_wal_lsn() and timestamp. Looks like this:
    >
    > #v+
    > timestamp                pg_current_wal_lsn  slot_name  plugin    slot_type  datoid  database  temporary  active  active_pid  xmin  catalog_xmin  restart_lsn     confirmed_flush_lsn
    ...
    ...
    > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241434855    1039D/83825958  1039D/911A8DB0
    > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t       21748       \N    3241443528    1039D/83825958  1039D/955633D0
    ...
    
    I think from the above two it is clear that the slot with restart_lsn
    1039D/83825958  is active and we seem to have recycled corresponding
    segment.
    
    >
    > Now, I can keep this situation up for some time, if more debug will be helpful,
    > but I don't know how long (disk space issues).
    >
    > Is there anything in here that helps? Any other info I can dig out?
    >
    
    You may have the LOG for "attempting to remove WAL segments older than
    log file %s", if so, please share. Also on similar lines, I think we
    need to add some LOGs before calling RemoveOldXlogFiles() to know
    where our computation to remove files goes off. Something, like the
    attached, would be helpful but note I have added this quickly on my
    working branch to show you what I have in mind. You may need small
    changes based on the branch you are working on.
    
    -- 
    With Regards,
    Amit Kapila.
    
  19. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    hubert depesz lubaczewski <depesz@depesz.com> — 2022-11-14T12:17:03Z

    On Mon, Nov 14, 2022 at 05:08:15PM +0530, Amit Kapila wrote:
    > > #v+
    > > 2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000082""",,,,,,,,,""
    ...
    > > #v-
    > >
    > > So, the whole set of missing files was recycled at the same time.
    > >
    > > One more thing that I have is copy of pg_replication_slots info. This data is taken periodically, around every 1 minute. Each line, aside from pg_replication_slots slots data has also output of pg_current_wal_lsn() and timestamp. Looks like this:
    > >
    > > #v+
    > > timestamp                pg_current_wal_lsn  slot_name  plugin    slot_type  datoid  database  temporary  active  active_pid  xmin  catalog_xmin  restart_lsn     confirmed_flush_lsn
    > ...
    > ...
    > > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t       21748       \N    3241443528    1039D/83825958  1039D/955633D0
    > ...
    > 
    > I think from the above two it is clear that the slot with restart_lsn
    > 1039D/83825958  is active and we seem to have recycled corresponding
    > segment.
    > 
    > You may have the LOG for "attempting to remove WAL segments older than
    > log file %s", if so, please share. Also on similar lines, I think we
    
    Absolutely.
    
    There is something weird happening:
    
    #v+
    2022-11-10 20:01:07.520 UTC,,,28925,,636378d6.70fd,4488667,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039B000000D9",,,,,,,,,""
    2022-11-10 20:06:07.359 UTC,,,28925,,636378d6.70fd,4496840,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039B000000F8",,,,,,,,,""
    2022-11-10 20:11:07.522 UTC,,,28925,,636378d6.70fd,4505099,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039C00000027",,,,,,,,,""
    2022-11-10 20:16:07.466 UTC,,,28925,,636378d6.70fd,4513331,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039C00000054",,,,,,,,,""
    2022-11-10 20:21:07.572 UTC,,,28925,,636378d6.70fd,4521466,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039C0000008F",,,,,,,,,""
    2022-11-10 20:26:07.384 UTC,,,28925,,636378d6.70fd,4529518,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039C000000B9",,,,,,,,,""
    2022-11-10 20:31:07.619 UTC,,,28925,,636378d6.70fd,4537868,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039C000000F8",,,,,,,,,""
    2022-11-10 20:36:07.532 UTC,,,28925,,636378d6.70fd,4546079,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000002B",,,,,,,,,""
    2022-11-10 20:41:07.603 UTC,,,28925,,636378d6.70fd,4554468,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D00000052",,,,,,,,,""
    2022-11-10 20:46:07.609 UTC,,,28925,,636378d6.70fd,4562376,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000005F",,,,,,,,,""
    2022-11-10 20:51:07.513 UTC,,,28925,,636378d6.70fd,4570523,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000005F",,,,,,,,,""
    2022-11-10 20:56:07.727 UTC,,,28925,,636378d6.70fd,4578553,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D00000070",,,,,,,,,""
    2022-11-10 21:01:07.509 UTC,,,28925,,636378d6.70fd,4586721,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D00000082",,,,,,,,,""
    2022-11-10 21:06:07.521 UTC,,,28925,,636378d6.70fd,4594902,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
    2022-11-10 21:11:07.402 UTC,,,28925,,636378d6.70fd,4602683,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
    2022-11-10 21:16:07.837 UTC,,,28925,,636378d6.70fd,4610577,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
    2022-11-10 21:21:07.571 UTC,,,28925,,636378d6.70fd,4618490,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
    2022-11-10 21:26:07.559 UTC,,,28925,,636378d6.70fd,4626365,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
    2022-11-10 21:31:07.649 UTC,,,28925,,636378d6.70fd,4634592,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
    2022-11-10 21:36:07.737 UTC,,,28925,,636378d6.70fd,4642829,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
    2022-11-10 21:41:07.278 UTC,,,28925,,636378d6.70fd,4650983,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
    2022-11-10 21:46:07.670 UTC,,,28925,,636378d6.70fd,4658973,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
    2022-11-10 21:51:07.579 UTC,,,28925,,636378d6.70fd,4666946,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
    2022-11-10 21:56:07.618 UTC,,,28925,,636378d6.70fd,4674922,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
    #v-
    
    If I'll add "recycled" lines we will get:
    
    #v+
    ...
    2022-11-10 21:01:07.509 UTC,,,28925,,636378d6.70fd,4586721,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D00000082",,,,,,,,,""
    2022-11-10 21:01:07.512 UTC,,,28925,,636378d6.70fd,4586722,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000071""",,,,,,,,,""
    2022-11-10 21:01:07.514 UTC,,,28925,,636378d6.70fd,4586723,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000072""",,,,,,,,,""
    2022-11-10 21:01:07.515 UTC,,,28925,,636378d6.70fd,4586724,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000073""",,,,,,,,,""
    2022-11-10 21:01:07.517 UTC,,,28925,,636378d6.70fd,4586725,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000074""",,,,,,,,,""
    2022-11-10 21:01:07.517 UTC,,,28925,,636378d6.70fd,4586726,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000075""",,,,,,,,,""
    2022-11-10 21:01:07.519 UTC,,,28925,,636378d6.70fd,4586727,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000076""",,,,,,,,,""
    2022-11-10 21:01:07.520 UTC,,,28925,,636378d6.70fd,4586728,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000077""",,,,,,,,,""
    2022-11-10 21:01:07.521 UTC,,,28925,,636378d6.70fd,4586729,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000078""",,,,,,,,,""
    2022-11-10 21:01:07.522 UTC,,,28925,,636378d6.70fd,4586730,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000079""",,,,,,,,,""
    2022-11-10 21:01:07.523 UTC,,,28925,,636378d6.70fd,4586731,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D0000007A""",,,,,,,,,""
    2022-11-10 21:01:07.524 UTC,,,28925,,636378d6.70fd,4586732,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D0000007B""",,,,,,,,,""
    2022-11-10 21:01:07.524 UTC,,,28925,,636378d6.70fd,4586733,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D0000007C""",,,,,,,,,""
    2022-11-10 21:01:07.525 UTC,,,28925,,636378d6.70fd,4586734,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D0000007D""",,,,,,,,,""
    2022-11-10 21:01:07.527 UTC,,,28925,,636378d6.70fd,4586735,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D0000007E""",,,,,,,,,""
    2022-11-10 21:01:07.528 UTC,,,28925,,636378d6.70fd,4586736,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D0000007F""",,,,,,,,,""
    2022-11-10 21:01:07.529 UTC,,,28925,,636378d6.70fd,4586737,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000080""",,,,,,,,,""
    2022-11-10 21:01:07.530 UTC,,,28925,,636378d6.70fd,4586738,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000081""",,,,,,,,,""
    2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000082""",,,,,,,,,""
    2022-11-10 21:06:07.521 UTC,,,28925,,636378d6.70fd,4594902,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
    2022-11-10 21:06:07.523 UTC,,,28925,,636378d6.70fd,4594903,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000083""",,,,,,,,,""
    2022-11-10 21:06:07.524 UTC,,,28925,,636378d6.70fd,4594904,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000084""",,,,,,,,,""
    2022-11-10 21:06:07.525 UTC,,,28925,,636378d6.70fd,4594905,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000085""",,,,,,,,,""
    2022-11-10 21:06:07.526 UTC,,,28925,,636378d6.70fd,4594906,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000086""",,,,,,,,,""
    2022-11-10 21:06:07.527 UTC,,,28925,,636378d6.70fd,4594907,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000087""",,,,,,,,,""
    2022-11-10 21:06:07.528 UTC,,,28925,,636378d6.70fd,4594908,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000088""",,,,,,,,,""
    2022-11-10 21:06:07.529 UTC,,,28925,,636378d6.70fd,4594909,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000089""",,,,,,,,,""
    2022-11-10 21:06:07.530 UTC,,,28925,,636378d6.70fd,4594910,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D0000008A""",,,,,,,,,""
    2022-11-10 21:11:07.402 UTC,,,28925,,636378d6.70fd,4602683,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
    2022-11-10 21:16:07.837 UTC,,,28925,,636378d6.70fd,4610577,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
    2022-11-10 21:21:07.571 UTC,,,28925,,636378d6.70fd,4618490,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
    2022-11-10 21:26:07.559 UTC,,,28925,,636378d6.70fd,4626365,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
    2022-11-10 21:31:07.649 UTC,,,28925,,636378d6.70fd,4634592,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
    2022-11-10 21:36:07.737 UTC,,,28925,,636378d6.70fd,4642829,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
    2022-11-10 21:41:07.278 UTC,,,28925,,636378d6.70fd,4650983,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
    2022-11-10 21:46:07.670 UTC,,,28925,,636378d6.70fd,4658973,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
    2022-11-10 21:51:07.579 UTC,,,28925,,636378d6.70fd,4666946,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
    2022-11-10 21:56:07.618 UTC,,,28925,,636378d6.70fd,4674922,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
    #v-
    
    > need to add some LOGs before calling RemoveOldXlogFiles() to know
    > where our computation to remove files goes off. Something, like the
    > attached, would be helpful but note I have added this quickly on my
    > working branch to show you what I have in mind. You may need small
    > changes based on the branch you are working on.
    
    That will be complicated.
    
    Changing pg (recompile, and rerun) requires careful preparation with our
    customers. Which takes time.
    
    Additionally we are using precompiled binaries from pgdg, that makes the
    process slightly more involved.
    
    And to top it off, as I mailed in the other message - the problem doesn't
    happen always. For 8 clusters that I tried to prepare problem happened on two,
    and on one of them (this one actually) it kinda solved on its own?!
    
    I know I'm asking a lot, but is there any chance we could setup
    conference/shared screen session to try to get as much data as possible
    out of the situations we have now? Of course that would be paid
    consulting. If you're OK with it, please mail me off list, so we can
    arrange it.
    
    Best regards,
    
    depesz
    
    
    
    
    
  20. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Amit Kapila <amit.kapila16@gmail.com> — 2022-11-14T13:00:57Z

    On Mon, Nov 14, 2022 at 5:47 PM hubert depesz lubaczewski
    <depesz@depesz.com> wrote:
    >
    > On Mon, Nov 14, 2022 at 05:08:15PM +0530, Amit Kapila wrote:
    > > > #v+
    > > > 2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000082""",,,,,,,,,""
    > ...
    > > > #v-
    > > >
    > > > So, the whole set of missing files was recycled at the same time.
    > > >
    > > > One more thing that I have is copy of pg_replication_slots info. This data is taken periodically, around every 1 minute. Each line, aside from pg_replication_slots slots data has also output of pg_current_wal_lsn() and timestamp. Looks like this:
    > > >
    > > > #v+
    > > > timestamp                pg_current_wal_lsn  slot_name  plugin    slot_type  datoid  database  temporary  active  active_pid  xmin  catalog_xmin  restart_lsn     confirmed_flush_lsn
    > > ...
    > > ...
    > > > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t       21748       \N    3241443528    1039D/83825958  1039D/955633D0
    > > ...
    > >
    > > I think from the above two it is clear that the slot with restart_lsn
    > > 1039D/83825958  is active and we seem to have recycled corresponding
    > > segment.
    > >
    > > You may have the LOG for "attempting to remove WAL segments older than
    > > log file %s", if so, please share. Also on similar lines, I think we
    >
    > Absolutely.
    >
    > There is something weird happening:
    >
    
    What exactly weird you are seeing in this? To me, it appears as if the
    system due to some reason ignores an existing slot that has
    restart_lsn as 1039D/83825958.
    
    >
    > If I'll add "recycled" lines we will get:
    >
    > #v+
    > ...
    > 2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-ahead log file ""000000010001039D00000082""",,,,,,,,,""
    > 2022-11-10 21:06:07.521 UTC,,,28925,,636378d6.70fd,4594902,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to remove WAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
    >
    
    Actually, the last LOG was what I was not expecting because of the
    existence of a corresponding slot. This is what we have to find out
    how it has computed the value "000000000001039D0000008A" for the file
    to be removed.
    
    > > need to add some LOGs before calling RemoveOldXlogFiles() to know
    > > where our computation to remove files goes off. Something, like the
    > > attached, would be helpful but note I have added this quickly on my
    > > working branch to show you what I have in mind. You may need small
    > > changes based on the branch you are working on.
    >
    > That will be complicated.
    >
    > Changing pg (recompile, and rerun) requires careful preparation with our
    > customers. Which takes time.
    >
    > Additionally we are using precompiled binaries from pgdg, that makes the
    > process slightly more involved.
    >
    
    I can understand but OTOH it is not easy to proceed without more information.
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  21. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Amit Kapila <amit.kapila16@gmail.com> — 2022-11-14T13:11:59Z

    On Sun, Nov 13, 2022 at 1:36 PM hubert depesz lubaczewski
    <depesz@depesz.com> wrote:
    >
    > On Fri, Nov 11, 2022 at 03:50:40PM +0100, hubert depesz lubaczewski wrote:
    > > #v+
    > > 2022-11-11 12:45:26.432 UTC,,,994963,,636e43e6.f2e93,2,,2022-11-11 12:45:26 UTC,6/0,0,ERROR,08P01,"could not receive data from WAL stream: ERROR:  requested WAL segment 000000010001039D00000083 has already been removed",,,,,,,,,"","logical replication worker",,0
    > > #v-
    >
    > Sooo... plot thickens.
    >
    > Without any changes, manual rebuild or anything, yesterday, the problem
    > seems to have solved itself?!
    >
    > In logs on focal/pg14 I see:
    >
    > #v+
    > 2022-11-12 20:55:39.190 UTC,,,1897563,,6370084b.1cf45b,2,,2022-11-12 20:55:39 UTC,6/0,0,ERROR,08P01,"could not receive data from WAL stream: ERROR:  requested WAL segment 000000010001039D00000083 has already been removed",,,,,,,,,"","logical replication worker",,0
    > #v-
    >
    > And this is *the last* such message.
    >
    > On bionic/pg12 we have in logs from pg_replication_slots:
    >
    > #v+
    > timestamp                pg_current_wal_lsn  slot_name  plugin    slot_type  datoid  database  temporary  active  active_pid  xmin  catalog_xmin  restart_lsn     confirmed_flush_lsn
    > 2022-11-12 20:51:00 UTC  1041E/D3A0E540      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > 2022-11-12 20:51:59 UTC  1041E/D89B6000      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > 2022-11-12 20:52:58 UTC  1041E/E0547450      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > 2022-11-12 20:53:58 UTC  1041E/E59634F0      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > 2022-11-12 20:54:57 UTC  1041E/EBB50DE8      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > 2022-11-12 20:55:55 UTC  1041E/FBBC3160      focal14    pgoutput  logical    16607   canvas    f          t       18626       \N    3241450490    1039D/9170B010  1039D/9B86EAF0
    >
    ...
    ...
    
    From the last two lines above, it is clear why it started working. The
    restart_lsn has advanced from 1039D/83825958 to 1039D/9170B010 which
    means the system no longer needs data from 000000010001039D00000083.
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  22. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    hubert depesz lubaczewski <depesz@depesz.com> — 2022-11-14T13:24:34Z

    On Mon, Nov 14, 2022 at 06:41:59PM +0530, Amit Kapila wrote:
    > From the last two lines above, it is clear why it started working. The
    > restart_lsn has advanced from 1039D/83825958 to 1039D/9170B010 which
    > means the system no longer needs data from 000000010001039D00000083.
    
    Sure, but how could it advance if the wal wasn't there?
    
    Best regards,
    
    depesz
    
    
    
    
    
  23. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    hubert depesz lubaczewski <depesz@depesz.com> — 2022-11-14T13:33:28Z

    On Mon, Nov 14, 2022 at 06:30:57PM +0530, Amit Kapila wrote:
    > > There is something weird happening:
    > What exactly weird you are seeing in this? To me, it appears as if the
    > system due to some reason ignores an existing slot that has
    > restart_lsn as 1039D/83825958.
    
    The weird part for me is that it is trying to remove wal files older
    than the same "x" many times.
    But perhaps it just means that the code that does it doesn't check if it
    did the same previously.
    
    Best regards,
    
    depesz
    
    
    
    
    
  24. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Amit Kapila <amit.kapila16@gmail.com> — 2022-11-14T16:22:05Z

    On Mon, Nov 14, 2022 at 7:03 PM hubert depesz lubaczewski
    <depesz@depesz.com> wrote:
    >
    > On Mon, Nov 14, 2022 at 06:30:57PM +0530, Amit Kapila wrote:
    > > > There is something weird happening:
    > > What exactly weird you are seeing in this? To me, it appears as if the
    > > system due to some reason ignores an existing slot that has
    > > restart_lsn as 1039D/83825958.
    >
    > The weird part for me is that it is trying to remove wal files older
    > than the same "x" many times.
    >
    
    I think that is okay because as per checkpointer's computation it
    decides not to remove/replace any new WAL files. At this stage, I am
    not getting any idea except for getting the value of
    XLogGetReplicationSlotMinimumLSN() in one of the LOG prints. If you
    can't add all the LOGs, I shared in the last patch, can you try to get
    the value of XLogGetReplicationSlotMinimumLSN() by appending to the
    existing LOG "attempting to remove WAL segments older than log file
    .."?
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  25. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    hubert depesz lubaczewski <depesz@depesz.com> — 2022-11-14T16:51:40Z

    On Mon, Nov 14, 2022 at 09:52:05PM +0530, Amit Kapila wrote:
    > On Mon, Nov 14, 2022 at 7:03 PM hubert depesz lubaczewski
    > <depesz@depesz.com> wrote:
    > >
    > > On Mon, Nov 14, 2022 at 06:30:57PM +0530, Amit Kapila wrote:
    > > > > There is something weird happening:
    > > > What exactly weird you are seeing in this? To me, it appears as if the
    > > > system due to some reason ignores an existing slot that has
    > > > restart_lsn as 1039D/83825958.
    > >
    > > The weird part for me is that it is trying to remove wal files older
    > > than the same "x" many times.
    > >
    > 
    > I think that is okay because as per checkpointer's computation it
    > decides not to remove/replace any new WAL files. At this stage, I am
    > not getting any idea except for getting the value of
    > XLogGetReplicationSlotMinimumLSN() in one of the LOG prints. If you
    > can't add all the LOGs, I shared in the last patch, can you try to get
    > the value of XLogGetReplicationSlotMinimumLSN() by appending to the
    > existing LOG "attempting to remove WAL segments older than log file
    > .."?
    
    If I'll get *any* patch approved, and restart of client db, then it
    doesn't matter how many LOGs there will be :)
    
    Will get back to you as soon as I will have more info.
    
    Best regards,
    
    depesz
    
    
    
    
    
  26. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Masahiko Sawada <sawada.mshk@gmail.com> — 2022-11-15T02:20:19Z

    On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
    <depesz@depesz.com> wrote:
    >
    > 2022-11-10 20:59:55 UTC  103A7/7B666388      focal14    pgoutput  logical    16607   canvas    f          t       10414       \N    3241434855    1039D/83825958  1039D/8DECBA18
    > 2022-11-10 21:00:54 UTC  103A7/88225C28      focal14    pgoutput  logical    16607   canvas    f          t       10414       \N    3241434855    1039D/83825958  1039D/911A8DB0
    > 2022-11-10 21:01:53 UTC  103A7/9E77EF90      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241434855    1039D/83825958  1039D/911A8DB0
    > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241434855    1039D/83825958  1039D/911A8DB0
    > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t       21748       \N    3241443528    1039D/83825958  1039D/955633D0
    > 2022-11-10 21:04:51 UTC  103A7/D5A6FDD8      focal14    pgoutput  logical    16607   canvas    f          t       21748       \N    3241443528    1039D/83825958  1039D/96453F38
    > 2022-11-10 21:05:50 UTC  103A7/E2C7F590      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > 2022-11-10 21:06:49 UTC  103A7/F28B2000      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > 2022-11-10 21:07:48 UTC  103A8/3828D20       focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > 2022-11-10 21:08:47 UTC  103A8/15BBF088      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    
    It happened twice that the walsender process was active for at least
    one minute while not advancing restart_lsn and exited. Could you share
    the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
    can find out what the wal sender processes were doing?
    
    Also, you may have logs for "failed to increase restart lsn: proposed
    %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
    flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
    please share it as well.
    
    I've tried to reproduce this issue with a similar setting in a loop
    but could not.
    
    Given that you have the log "attempting to remove WAL segments older
    than log file 000000000001039D0000008A", the last WAL file we need to
    retain calculated by KeepLogSeg() was wrong.
    
    KeepLogSeg() retreated the last segment that we need to retain while
    considering replication slots' minimum LSN (and wal_keep_segments) but
    it seemed not to work fine. The possibilities are (1) we got the wrong
    slots' minimum LSN that is newer than a slot's LSN (ie, retreated
    logSegNo but it's still newer than required WAL files) or (2) we got
    an invalid slots's minimum LSN (ie, didn't retreated at all). If we
    know what the RedoRecPtr was, we can calculate _logSegNo that has not
    been retreated by KeepLogSeg().
    
    BTW in LogicalConfirmReceivedLocation(), we set restart_lsn as follows:
    
            if (MyReplicationSlot->candidate_restart_valid != InvalidXLogRecPtr &&
                MyReplicationSlot->candidate_restart_valid <= lsn)
            {
                Assert(MyReplicationSlot->candidate_restart_lsn !=
    InvalidXLogRecPtr);
    
                MyReplicationSlot->data.restart_lsn =
    MyReplicationSlot->candidate_restart_lsn;
                MyReplicationSlot->candidate_restart_lsn = InvalidXLogRecPtr;
                MyReplicationSlot->candidate_restart_valid = InvalidXLogRecPtr;
                updated_restart = true;
            }
    
    Is it worth changing the assertion to elog(ERROR) so that we can
    detect that an invalid value is set to restart_lsn?
    
    Regards,
    
    -- 
    Masahiko Sawada
    Amazon Web Services: https://aws.amazon.com
    
    
    
    
  27. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Amit Kapila <amit.kapila16@gmail.com> — 2022-11-15T04:51:18Z

    On Tue, Nov 15, 2022 at 7:50 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
    >
    > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
    > <depesz@depesz.com> wrote:
    > >
    > > 2022-11-10 20:59:55 UTC  103A7/7B666388      focal14    pgoutput  logical    16607   canvas    f          t       10414       \N    3241434855    1039D/83825958  1039D/8DECBA18
    > > 2022-11-10 21:00:54 UTC  103A7/88225C28      focal14    pgoutput  logical    16607   canvas    f          t       10414       \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > 2022-11-10 21:01:53 UTC  103A7/9E77EF90      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t       21748       \N    3241443528    1039D/83825958  1039D/955633D0
    > > 2022-11-10 21:04:51 UTC  103A7/D5A6FDD8      focal14    pgoutput  logical    16607   canvas    f          t       21748       \N    3241443528    1039D/83825958  1039D/96453F38
    > > 2022-11-10 21:05:50 UTC  103A7/E2C7F590      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > 2022-11-10 21:06:49 UTC  103A7/F28B2000      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > 2022-11-10 21:07:48 UTC  103A8/3828D20       focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > 2022-11-10 21:08:47 UTC  103A8/15BBF088      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    >
    > It happened twice that the walsender process was active for at least
    > one minute while not advancing restart_lsn and exited. Could you share
    > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
    > can find out what the wal sender processes were doing?
    >
    > Also, you may have logs for "failed to increase restart lsn: proposed
    > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
    > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
    > please share it as well.
    >
    > I've tried to reproduce this issue with a similar setting in a loop
    > but could not.
    >
    > Given that you have the log "attempting to remove WAL segments older
    > than log file 000000000001039D0000008A", the last WAL file we need to
    > retain calculated by KeepLogSeg() was wrong.
    >
    > KeepLogSeg() retreated the last segment that we need to retain while
    > considering replication slots' minimum LSN (and wal_keep_segments) but
    > it seemed not to work fine. The possibilities are (1) we got the wrong
    > slots' minimum LSN that is newer than a slot's LSN (ie, retreated
    > logSegNo but it's still newer than required WAL files) or (2) we got
    > an invalid slots's minimum LSN (ie, didn't retreated at all). If we
    > know what the RedoRecPtr was, we can calculate _logSegNo that has not
    > been retreated by KeepLogSeg().
    >
    
    The other possibility is that the slot is not considered for minimum
    LSN computation (say in_use flag is temporarily set to false).
    
    > BTW in LogicalConfirmReceivedLocation(), we set restart_lsn as follows:
    >
    >         if (MyReplicationSlot->candidate_restart_valid != InvalidXLogRecPtr &&
    >             MyReplicationSlot->candidate_restart_valid <= lsn)
    >         {
    >             Assert(MyReplicationSlot->candidate_restart_lsn !=
    > InvalidXLogRecPtr);
    >
    >             MyReplicationSlot->data.restart_lsn =
    > MyReplicationSlot->candidate_restart_lsn;
    >             MyReplicationSlot->candidate_restart_lsn = InvalidXLogRecPtr;
    >             MyReplicationSlot->candidate_restart_valid = InvalidXLogRecPtr;
    >             updated_restart = true;
    >         }
    >
    > Is it worth changing the assertion to elog(ERROR) so that we can
    > detect that an invalid value is set to restart_lsn?
    >
    
    This is worth considering but note that the caller of this has an
    assertion for invalid values being assigned to candidate_restart_*
    variables. Also, wouldn't be shown as invalid for restart_lsn in
    pg_replication_slots?
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  28. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Amit Kapila <amit.kapila16@gmail.com> — 2022-11-15T05:11:44Z

    On Tue, Nov 15, 2022 at 7:50 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
    >
    > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
    > <depesz@depesz.com> wrote:
    > >
    > > 2022-11-10 20:59:55 UTC  103A7/7B666388      focal14    pgoutput  logical    16607   canvas    f          t       10414       \N    3241434855    1039D/83825958  1039D/8DECBA18
    > > 2022-11-10 21:00:54 UTC  103A7/88225C28      focal14    pgoutput  logical    16607   canvas    f          t       10414       \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > 2022-11-10 21:01:53 UTC  103A7/9E77EF90      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t       21748       \N    3241443528    1039D/83825958  1039D/955633D0
    > > 2022-11-10 21:04:51 UTC  103A7/D5A6FDD8      focal14    pgoutput  logical    16607   canvas    f          t       21748       \N    3241443528    1039D/83825958  1039D/96453F38
    > > 2022-11-10 21:05:50 UTC  103A7/E2C7F590      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > 2022-11-10 21:06:49 UTC  103A7/F28B2000      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > 2022-11-10 21:07:48 UTC  103A8/3828D20       focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > 2022-11-10 21:08:47 UTC  103A8/15BBF088      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    >
    > It happened twice that the walsender process was active for at least
    > one minute while not advancing restart_lsn and exited. Could you share
    > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
    > can find out what the wal sender processes were doing?
    >
    > Also, you may have logs for "failed to increase restart lsn: proposed
    > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
    > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
    > please share it as well.
    >
    > I've tried to reproduce this issue with a similar setting in a loop
    > but could not.
    >
    > Given that you have the log "attempting to remove WAL segments older
    > than log file 000000000001039D0000008A", the last WAL file we need to
    > retain calculated by KeepLogSeg() was wrong.
    >
    > KeepLogSeg() retreated the last segment that we need to retain while
    > considering replication slots' minimum LSN (and wal_keep_segments) but
    > it seemed not to work fine. The possibilities are (1) we got the wrong
    > slots' minimum LSN that is newer than a slot's LSN (ie, retreated
    > logSegNo but it's still newer than required WAL files) or (2) we got
    > an invalid slots's minimum LSN (ie, didn't retreated at all). If we
    > know what the RedoRecPtr was, we can calculate _logSegNo that has not
    > been retreated by KeepLogSeg().
    >
    
    I was checking but didn't find if we have RedoRecPtr-related
    information available. I think it could be valuable to have RedoRecPtr
    and probably replicationSlotMinLSN, lastRemovedSegNo, etc. kind of
    information available via some view (either in pg_stat_bgwriter or a
    similar view) to debug this kind of issue.
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  29. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    hubert depesz lubaczewski <depesz@depesz.com> — 2022-11-15T07:55:36Z

    On Tue, Nov 15, 2022 at 11:20:19AM +0900, Masahiko Sawada wrote:
    > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
    > <depesz@depesz.com> wrote:
    > >
    > > 2022-11-10 20:59:55 UTC  103A7/7B666388      focal14    pgoutput  logical    16607   canvas    f          t       10414       \N    3241434855    1039D/83825958  1039D/8DECBA18
    > > 2022-11-10 21:00:54 UTC  103A7/88225C28      focal14    pgoutput  logical    16607   canvas    f          t       10414       \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > 2022-11-10 21:01:53 UTC  103A7/9E77EF90      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t       21748       \N    3241443528    1039D/83825958  1039D/955633D0
    > > 2022-11-10 21:04:51 UTC  103A7/D5A6FDD8      focal14    pgoutput  logical    16607   canvas    f          t       21748       \N    3241443528    1039D/83825958  1039D/96453F38
    > > 2022-11-10 21:05:50 UTC  103A7/E2C7F590      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > 2022-11-10 21:06:49 UTC  103A7/F28B2000      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > 2022-11-10 21:07:48 UTC  103A8/3828D20       focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > 2022-11-10 21:08:47 UTC  103A8/15BBF088      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > 
    > It happened twice that the walsender process was active for at least
    > one minute while not advancing restart_lsn and exited. Could you share
    > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
    > can find out what the wal sender processes were doing?
    
    That will be ~ 270MB. Not sure how that would fly, plus that's awfully
    lot of logs to check if I don't leak any sensitive info.
    
    But, if I'll remove lines that are generated by walsender to pg12
    replicas (streaming replication to bionic/pg12 replicas), the logfile is
    only ~ 3MB. Still a bit too much, as there can be some queries with
    sensitive info.
    
    Perhaps I can grep something out?
    
    > Also, you may have logs for "failed to increase restart lsn: proposed
    > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
    > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
    > please share it as well.
    
    Found one in the time frame you mentioned:
    2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10 21:03:23 UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate 1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14"
    
    > I've tried to reproduce this issue with a similar setting in a loop
    > but could not.
    
    It seems to be rather elusive. Which is why I'm kinda "happy" that I can with
    high level of luck replicate it.
    
    Question might be 100% stupid, sorry if it is - any chance that we're on arm64
    has something to do with it?
    
    Best regards,
    
    depesz
    
    
    
    
    
  30. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Masahiko Sawada <sawada.mshk@gmail.com> — 2022-11-15T09:01:27Z

    On Tue, Nov 15, 2022 at 4:55 PM hubert depesz lubaczewski
    <depesz@depesz.com> wrote:
    >
    > On Tue, Nov 15, 2022 at 11:20:19AM +0900, Masahiko Sawada wrote:
    > > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
    > > <depesz@depesz.com> wrote:
    > > >
    > > > 2022-11-10 20:59:55 UTC  103A7/7B666388      focal14    pgoutput  logical    16607   canvas    f          t       10414       \N    3241434855    1039D/83825958  1039D/8DECBA18
    > > > 2022-11-10 21:00:54 UTC  103A7/88225C28      focal14    pgoutput  logical    16607   canvas    f          t       10414       \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > > 2022-11-10 21:01:53 UTC  103A7/9E77EF90      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t       21748       \N    3241443528    1039D/83825958  1039D/955633D0
    > > > 2022-11-10 21:04:51 UTC  103A7/D5A6FDD8      focal14    pgoutput  logical    16607   canvas    f          t       21748       \N    3241443528    1039D/83825958  1039D/96453F38
    > > > 2022-11-10 21:05:50 UTC  103A7/E2C7F590      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > > 2022-11-10 21:06:49 UTC  103A7/F28B2000      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > > 2022-11-10 21:07:48 UTC  103A8/3828D20       focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > > 2022-11-10 21:08:47 UTC  103A8/15BBF088      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > >
    > > It happened twice that the walsender process was active for at least
    > > one minute while not advancing restart_lsn and exited. Could you share
    > > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
    > > can find out what the wal sender processes were doing?
    >
    > That will be ~ 270MB. Not sure how that would fly, plus that's awfully
    > lot of logs to check if I don't leak any sensitive info.
    >
    > But, if I'll remove lines that are generated by walsender to pg12
    > replicas (streaming replication to bionic/pg12 replicas), the logfile is
    > only ~ 3MB. Still a bit too much, as there can be some queries with
    > sensitive info.
    >
    > Perhaps I can grep something out?
    >
    > > Also, you may have logs for "failed to increase restart lsn: proposed
    > > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
    > > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
    > > please share it as well.
    >
    > Found one in the time frame you mentioned:
    > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10 21:03:23 UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate 1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14"
    >
    
    Thanks!
    
    LSN 1039D/8B5773D8 seems to be related to this issue. If we advance
    slot's restart_lsn to this LSN, we remove WAL files older than
    000000000001039D0000008A.
    
    In LogicalIncreaseRestartDecodingForSlot(), since
    "current_lsn(1039D/9170B010) <
    slot->data.confirmed_flush(1039D/91F41B50)", we executed the following
    part and called LogicalConfirmReceivedLocation():
    
        else if (current_lsn <= slot->data.confirmed_flush)
        {
            slot->candidate_restart_valid = current_lsn;
            slot->candidate_restart_lsn = restart_lsn;
    
            /* our candidate can directly be used */
            updated_lsn = true;
        }
    
    Then in LogicalConfirmReceivedLocation(), IIUC we executed the following code:
    
           if (MyReplicationSlot->candidate_restart_valid != InvalidXLogRecPtr &&
               MyReplicationSlot->candidate_restart_valid <= lsn)
           {
               Assert(MyReplicationSlot->candidate_restart_lsn !=
    InvalidXLogRecPtr);
    
               MyReplicationSlot->data.restart_lsn =
    MyReplicationSlot->candidate_restart_lsn;
               MyReplicationSlot->candidate_restart_lsn = InvalidXLogRecPtr;
               MyReplicationSlot->candidate_restart_valid = InvalidXLogRecPtr;
               updated_restart = true;
           }
    
    So you should have logs for "updated xmin: %u restart: %u", could you check it?
    
    After the above codes, we do:
    
            SpinLockRelease(&MyReplicationSlot->mutex);
    
            /* first write new xmin to disk, so we know what's up after a crash */
            if (updated_xmin || updated_restart)
            {
                ReplicationSlotMarkDirty();
                ReplicationSlotSave();
                elog(DEBUG1, "updated xmin: %u restart: %u", updated_xmin,
    updated_restart);
            }
    
    Is something like the following scenario possible to happen?
    
    1. wal sender updates slot's restart_lsn and releases the spin lock
    (not saved in the disk yet)
    2. someone updates slots' minimum restart_lsn (note that slot's
    restart_lsn in memory is already updated).
    3. checkpointer removes WAL files older than the minimum restart_lsn
    calculated at step 2.
    4. wal sender restarts for some reason (or server crashed).
    5. after the restarts, wal sender acquires the slot but the slot's
    restart_lsn is old.
    6. WAL files required for restart the decoding are already removed.
    
    Regards,
    
    -- 
    Masahiko Sawada
    Amazon Web Services: https://aws.amazon.com
    
    
    
    
  31. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Masahiko Sawada <sawada.mshk@gmail.com> — 2022-11-15T14:59:37Z

    On Tue, Nov 15, 2022 at 6:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
    >
    > On Tue, Nov 15, 2022 at 4:55 PM hubert depesz lubaczewski
    > <depesz@depesz.com> wrote:
    > >
    > > On Tue, Nov 15, 2022 at 11:20:19AM +0900, Masahiko Sawada wrote:
    > > > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
    > > > <depesz@depesz.com> wrote:
    > > > >
    > > > > 2022-11-10 20:59:55 UTC  103A7/7B666388      focal14    pgoutput  logical    16607   canvas    f          t       10414       \N    3241434855    1039D/83825958  1039D/8DECBA18
    > > > > 2022-11-10 21:00:54 UTC  103A7/88225C28      focal14    pgoutput  logical    16607   canvas    f          t       10414       \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > > > 2022-11-10 21:01:53 UTC  103A7/9E77EF90      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > > > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > > > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t       21748       \N    3241443528    1039D/83825958  1039D/955633D0
    > > > > 2022-11-10 21:04:51 UTC  103A7/D5A6FDD8      focal14    pgoutput  logical    16607   canvas    f          t       21748       \N    3241443528    1039D/83825958  1039D/96453F38
    > > > > 2022-11-10 21:05:50 UTC  103A7/E2C7F590      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > > > 2022-11-10 21:06:49 UTC  103A7/F28B2000      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > > > 2022-11-10 21:07:48 UTC  103A8/3828D20       focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > > > 2022-11-10 21:08:47 UTC  103A8/15BBF088      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > >
    > > > It happened twice that the walsender process was active for at least
    > > > one minute while not advancing restart_lsn and exited. Could you share
    > > > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
    > > > can find out what the wal sender processes were doing?
    > >
    > > That will be ~ 270MB. Not sure how that would fly, plus that's awfully
    > > lot of logs to check if I don't leak any sensitive info.
    > >
    > > But, if I'll remove lines that are generated by walsender to pg12
    > > replicas (streaming replication to bionic/pg12 replicas), the logfile is
    > > only ~ 3MB. Still a bit too much, as there can be some queries with
    > > sensitive info.
    > >
    > > Perhaps I can grep something out?
    > >
    > > > Also, you may have logs for "failed to increase restart lsn: proposed
    > > > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
    > > > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
    > > > please share it as well.
    > >
    > > Found one in the time frame you mentioned:
    > > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10 21:03:23 UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate 1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14"
    > >
    >
    > Thanks!
    >
    > LSN 1039D/8B5773D8 seems to be related to this issue. If we advance
    > slot's restart_lsn to this LSN, we remove WAL files older than
    > 000000000001039D0000008A.
    >
    > In LogicalIncreaseRestartDecodingForSlot(), since
    > "current_lsn(1039D/9170B010) <
    > slot->data.confirmed_flush(1039D/91F41B50)", we executed the following
    > part and called LogicalConfirmReceivedLocation():
    >
    >     else if (current_lsn <= slot->data.confirmed_flush)
    >     {
    >         slot->candidate_restart_valid = current_lsn;
    >         slot->candidate_restart_lsn = restart_lsn;
    >
    >         /* our candidate can directly be used */
    >         updated_lsn = true;
    >     }
    >
    > Then in LogicalConfirmReceivedLocation(), IIUC we executed the following code:
    >
    >        if (MyReplicationSlot->candidate_restart_valid != InvalidXLogRecPtr &&
    >            MyReplicationSlot->candidate_restart_valid <= lsn)
    >        {
    >            Assert(MyReplicationSlot->candidate_restart_lsn !=
    > InvalidXLogRecPtr);
    >
    >            MyReplicationSlot->data.restart_lsn =
    > MyReplicationSlot->candidate_restart_lsn;
    >            MyReplicationSlot->candidate_restart_lsn = InvalidXLogRecPtr;
    >            MyReplicationSlot->candidate_restart_valid = InvalidXLogRecPtr;
    >            updated_restart = true;
    >        }
    >
    > So you should have logs for "updated xmin: %u restart: %u", could you check it?
    >
    > After the above codes, we do:
    >
    >         SpinLockRelease(&MyReplicationSlot->mutex);
    >
    >         /* first write new xmin to disk, so we know what's up after a crash */
    >         if (updated_xmin || updated_restart)
    >         {
    >             ReplicationSlotMarkDirty();
    >             ReplicationSlotSave();
    >             elog(DEBUG1, "updated xmin: %u restart: %u", updated_xmin,
    > updated_restart);
    >         }
    >
    > Is something like the following scenario possible to happen?
    >
    > 1. wal sender updates slot's restart_lsn and releases the spin lock
    > (not saved in the disk yet)
    > 2. someone updates slots' minimum restart_lsn (note that slot's
    > restart_lsn in memory is already updated).
    > 3. checkpointer removes WAL files older than the minimum restart_lsn
    > calculated at step 2.
    > 4. wal sender restarts for some reason (or server crashed).
    > 5. after the restarts, wal sender acquires the slot but the slot's
    > restart_lsn is old.
    > 6. WAL files required for restart the decoding are already removed.
    
    I'm not sure that the reported issue happened in the above scenario
    but I've confirmed this scenario happened on my environment by adding
    sleeps and killing the wal sender process. After the server restarts,
    pg_replication_slots.restart_lsn retreated from 0/5390F78 to
    0/1659FC0, and the walsender raised an error "requested WAL segment
    000000010000000000000001 has already been removed".
    
    Regards,
    
    -- 
    Masahiko Sawada
    Amazon Web Services: https://aws.amazon.com
    
    
    
    
  32. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Amit Kapila <amit.kapila16@gmail.com> — 2022-11-16T04:22:06Z

    On Tue, Nov 15, 2022 at 2:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
    >
    > On Tue, Nov 15, 2022 at 4:55 PM hubert depesz lubaczewski
    > <depesz@depesz.com> wrote:
    > >
    > > On Tue, Nov 15, 2022 at 11:20:19AM +0900, Masahiko Sawada wrote:
    > > > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
    > > > <depesz@depesz.com> wrote:
    > > > >
    > > > > 2022-11-10 20:59:55 UTC  103A7/7B666388      focal14    pgoutput  logical    16607   canvas    f          t       10414       \N    3241434855    1039D/83825958  1039D/8DECBA18
    > > > > 2022-11-10 21:00:54 UTC  103A7/88225C28      focal14    pgoutput  logical    16607   canvas    f          t       10414       \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > > > 2022-11-10 21:01:53 UTC  103A7/9E77EF90      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > > > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > > > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t       21748       \N    3241443528    1039D/83825958  1039D/955633D0
    > > > > 2022-11-10 21:04:51 UTC  103A7/D5A6FDD8      focal14    pgoutput  logical    16607   canvas    f          t       21748       \N    3241443528    1039D/83825958  1039D/96453F38
    > > > > 2022-11-10 21:05:50 UTC  103A7/E2C7F590      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > > > 2022-11-10 21:06:49 UTC  103A7/F28B2000      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > > > 2022-11-10 21:07:48 UTC  103A8/3828D20       focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > > > 2022-11-10 21:08:47 UTC  103A8/15BBF088      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > >
    > > > It happened twice that the walsender process was active for at least
    > > > one minute while not advancing restart_lsn and exited. Could you share
    > > > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
    > > > can find out what the wal sender processes were doing?
    > >
    > > That will be ~ 270MB. Not sure how that would fly, plus that's awfully
    > > lot of logs to check if I don't leak any sensitive info.
    > >
    > > But, if I'll remove lines that are generated by walsender to pg12
    > > replicas (streaming replication to bionic/pg12 replicas), the logfile is
    > > only ~ 3MB. Still a bit too much, as there can be some queries with
    > > sensitive info.
    > >
    > > Perhaps I can grep something out?
    > >
    > > > Also, you may have logs for "failed to increase restart lsn: proposed
    > > > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
    > > > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
    > > > please share it as well.
    > >
    > > Found one in the time frame you mentioned:
    > > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10 21:03:23 UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate 1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14"
    > >
    >
    > Thanks!
    >
    > LSN 1039D/8B5773D8 seems to be related to this issue. If we advance
    > slot's restart_lsn to this LSN, we remove WAL files older than
    > 000000000001039D0000008A.
    >
    > In LogicalIncreaseRestartDecodingForSlot(), since
    > "current_lsn(1039D/9170B010) <
    > slot->data.confirmed_flush(1039D/91F41B50)", we executed the following
    > part and called LogicalConfirmReceivedLocation():
    >
    >     else if (current_lsn <= slot->data.confirmed_flush)
    >     {
    >         slot->candidate_restart_valid = current_lsn;
    >         slot->candidate_restart_lsn = restart_lsn;
    >
    >         /* our candidate can directly be used */
    >         updated_lsn = true;
    >     }
    >
    
    If this would have been executed in
    LogicalIncreaseRestartDecodingForSlot(), then the values displayed in
    the above DEBUG messages "current candidate 1039D/83825958, current
    after 1039D/8B5773D8" should be the same as proposed and after
    "proposed 1039D/8B5773D8, after 1039D/9170B010". Am, I missing
    something?
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  33. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Masahiko Sawada <sawada.mshk@gmail.com> — 2022-11-16T06:13:28Z

    On Wed, Nov 16, 2022 at 1:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    >
    > On Tue, Nov 15, 2022 at 2:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
    > >
    > > On Tue, Nov 15, 2022 at 4:55 PM hubert depesz lubaczewski
    > > <depesz@depesz.com> wrote:
    > > >
    > > > On Tue, Nov 15, 2022 at 11:20:19AM +0900, Masahiko Sawada wrote:
    > > > > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
    > > > > <depesz@depesz.com> wrote:
    > > > > >
    > > > > > 2022-11-10 20:59:55 UTC  103A7/7B666388      focal14    pgoutput  logical    16607   canvas    f          t       10414       \N    3241434855    1039D/83825958  1039D/8DECBA18
    > > > > > 2022-11-10 21:00:54 UTC  103A7/88225C28      focal14    pgoutput  logical    16607   canvas    f          t       10414       \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > > > > 2022-11-10 21:01:53 UTC  103A7/9E77EF90      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > > > > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > > > > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t       21748       \N    3241443528    1039D/83825958  1039D/955633D0
    > > > > > 2022-11-10 21:04:51 UTC  103A7/D5A6FDD8      focal14    pgoutput  logical    16607   canvas    f          t       21748       \N    3241443528    1039D/83825958  1039D/96453F38
    > > > > > 2022-11-10 21:05:50 UTC  103A7/E2C7F590      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > > > > 2022-11-10 21:06:49 UTC  103A7/F28B2000      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > > > > 2022-11-10 21:07:48 UTC  103A8/3828D20       focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > > > > 2022-11-10 21:08:47 UTC  103A8/15BBF088      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > > >
    > > > > It happened twice that the walsender process was active for at least
    > > > > one minute while not advancing restart_lsn and exited. Could you share
    > > > > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
    > > > > can find out what the wal sender processes were doing?
    > > >
    > > > That will be ~ 270MB. Not sure how that would fly, plus that's awfully
    > > > lot of logs to check if I don't leak any sensitive info.
    > > >
    > > > But, if I'll remove lines that are generated by walsender to pg12
    > > > replicas (streaming replication to bionic/pg12 replicas), the logfile is
    > > > only ~ 3MB. Still a bit too much, as there can be some queries with
    > > > sensitive info.
    > > >
    > > > Perhaps I can grep something out?
    > > >
    > > > > Also, you may have logs for "failed to increase restart lsn: proposed
    > > > > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
    > > > > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
    > > > > please share it as well.
    > > >
    > > > Found one in the time frame you mentioned:
    > > > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10 21:03:23 UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate 1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14"
    > > >
    > >
    > > Thanks!
    > >
    > > LSN 1039D/8B5773D8 seems to be related to this issue. If we advance
    > > slot's restart_lsn to this LSN, we remove WAL files older than
    > > 000000000001039D0000008A.
    > >
    > > In LogicalIncreaseRestartDecodingForSlot(), since
    > > "current_lsn(1039D/9170B010) <
    > > slot->data.confirmed_flush(1039D/91F41B50)", we executed the following
    > > part and called LogicalConfirmReceivedLocation():
    > >
    > >     else if (current_lsn <= slot->data.confirmed_flush)
    > >     {
    > >         slot->candidate_restart_valid = current_lsn;
    > >         slot->candidate_restart_lsn = restart_lsn;
    > >
    > >         /* our candidate can directly be used */
    > >         updated_lsn = true;
    > >     }
    > >
    >
    > If this would have been executed in
    > LogicalIncreaseRestartDecodingForSlot(), then the values displayed in
    > the above DEBUG messages "current candidate 1039D/83825958, current
    > after 1039D/8B5773D8" should be the same as proposed and after
    > "proposed 1039D/8B5773D8, after 1039D/9170B010". Am, I missing
    > something?
    
    Oh, you're right.
    
    Given restart_lsn was 1039D/8B5773D8, slot->data.restart_lsn was equal
    to or greater than 1039D/8B5773D8 at that time but
    slot->candidate_restart_lsn was 1039D/83825958, right? Which is weird.
    
    Regards,
    
    -- 
    Masahiko Sawada
    Amazon Web Services: https://aws.amazon.com
    
    
    
    
  34. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Amit Kapila <amit.kapila16@gmail.com> — 2022-11-16T06:38:15Z

    On Wed, Nov 16, 2022 at 11:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
    >
    > On Wed, Nov 16, 2022 at 1:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    > >
    > > On Tue, Nov 15, 2022 at 2:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
    > > >
    > > > On Tue, Nov 15, 2022 at 4:55 PM hubert depesz lubaczewski
    > > > <depesz@depesz.com> wrote:
    > > > >
    > > > > On Tue, Nov 15, 2022 at 11:20:19AM +0900, Masahiko Sawada wrote:
    > > > > > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
    > > > > > <depesz@depesz.com> wrote:
    > > > > > >
    > > > > > > 2022-11-10 20:59:55 UTC  103A7/7B666388      focal14    pgoutput  logical    16607   canvas    f          t       10414       \N    3241434855    1039D/83825958  1039D/8DECBA18
    > > > > > > 2022-11-10 21:00:54 UTC  103A7/88225C28      focal14    pgoutput  logical    16607   canvas    f          t       10414       \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > > > > > 2022-11-10 21:01:53 UTC  103A7/9E77EF90      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > > > > > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241434855    1039D/83825958  1039D/911A8DB0
    > > > > > > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t       21748       \N    3241443528    1039D/83825958  1039D/955633D0
    > > > > > > 2022-11-10 21:04:51 UTC  103A7/D5A6FDD8      focal14    pgoutput  logical    16607   canvas    f          t       21748       \N    3241443528    1039D/83825958  1039D/96453F38
    > > > > > > 2022-11-10 21:05:50 UTC  103A7/E2C7F590      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > > > > > 2022-11-10 21:06:49 UTC  103A7/F28B2000      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > > > > > 2022-11-10 21:07:48 UTC  103A8/3828D20       focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > > > > > 2022-11-10 21:08:47 UTC  103A8/15BBF088      focal14    pgoutput  logical    16607   canvas    f          f       \N          \N    3241443528    1039D/83825958  1039D/96453F38
    > > > > >
    > > > > > It happened twice that the walsender process was active for at least
    > > > > > one minute while not advancing restart_lsn and exited. Could you share
    > > > > > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
    > > > > > can find out what the wal sender processes were doing?
    > > > >
    > > > > That will be ~ 270MB. Not sure how that would fly, plus that's awfully
    > > > > lot of logs to check if I don't leak any sensitive info.
    > > > >
    > > > > But, if I'll remove lines that are generated by walsender to pg12
    > > > > replicas (streaming replication to bionic/pg12 replicas), the logfile is
    > > > > only ~ 3MB. Still a bit too much, as there can be some queries with
    > > > > sensitive info.
    > > > >
    
    We are primarily interested in LOGs that display lsn values. Can you
    try to get that or if possible remove sensitive queries information
    and then send the remaining LOGs?
    
    > > > > Perhaps I can grep something out?
    > > > >
    > > > > > Also, you may have logs for "failed to increase restart lsn: proposed
    > > > > > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
    > > > > > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
    > > > > > please share it as well.
    > > > >
    > > > > Found one in the time frame you mentioned:
    > > > > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10 21:03:23 UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate 1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14"
    > > > >
    > > >
    > > > Thanks!
    > > >
    > > > LSN 1039D/8B5773D8 seems to be related to this issue. If we advance
    > > > slot's restart_lsn to this LSN, we remove WAL files older than
    > > > 000000000001039D0000008A.
    > > >
    > > > In LogicalIncreaseRestartDecodingForSlot(), since
    > > > "current_lsn(1039D/9170B010) <
    > > > slot->data.confirmed_flush(1039D/91F41B50)", we executed the following
    > > > part and called LogicalConfirmReceivedLocation():
    > > >
    > > >     else if (current_lsn <= slot->data.confirmed_flush)
    > > >     {
    > > >         slot->candidate_restart_valid = current_lsn;
    > > >         slot->candidate_restart_lsn = restart_lsn;
    > > >
    > > >         /* our candidate can directly be used */
    > > >         updated_lsn = true;
    > > >     }
    > > >
    > >
    > > If this would have been executed in
    > > LogicalIncreaseRestartDecodingForSlot(), then the values displayed in
    > > the above DEBUG messages "current candidate 1039D/83825958, current
    > > after 1039D/8B5773D8" should be the same as proposed and after
    > > "proposed 1039D/8B5773D8, after 1039D/9170B010". Am, I missing
    > > something?
    >
    > Oh, you're right.
    >
    > Given restart_lsn was 1039D/8B5773D8, slot->data.restart_lsn was equal
    > to or greater than 1039D/8B5773D8 at that time but
    > slot->candidate_restart_lsn was 1039D/83825958, right? Which is weird.
    >
    
    Yes, that is weird but it had been a bit obvious if the same LOG would
    have printed slot->data.restart_lsn. This means that somehow slot's
    'candidate_restart_lsn' somehow went behind its 'restart_lsn'. I can't
    figure out yet how that can happen but if that happens then the slot's
    restart_lsn can retreat in LogicalConfirmReceivedLocation() because we
    don't check if slot's candidate_restart_lsn is lesser than its
    restart_lsn before assigning the same in line
    MyReplicationSlot->data.restart_lsn =
    MyReplicationSlot->candidate_restart_lsn;. I think that should be
    checked. Note that we call LogicalConfirmReceivedLocation() can be
    called from ProcessStandbyReplyMessage(), so once the wrong
    candidate_restart_lsn is set, it can be assigned to restart_lsn from
    other code paths as well.
    
    I am not able to think how 'candidate_restart_lsn' can be set to an
    LSN value prior to 'restart_lsn'. For that, we probably need more
    LOGs. I hope that depesz can share more information from LOGs as that
    might give us some clues.
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  35. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Amit Kapila <amit.kapila16@gmail.com> — 2022-11-16T07:13:39Z

    On Wed, Nov 16, 2022 at 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    >
    > On Wed, Nov 16, 2022 at 11:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
    > >
    > > > > > Found one in the time frame you mentioned:
    > > > > > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10 21:03:23 UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate 1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14"
    > > > > >
    > > > >
    > > > > Thanks!
    > > > >
    > > > > LSN 1039D/8B5773D8 seems to be related to this issue. If we advance
    > > > > slot's restart_lsn to this LSN, we remove WAL files older than
    > > > > 000000000001039D0000008A.
    > > > >
    > > > > In LogicalIncreaseRestartDecodingForSlot(), since
    > > > > "current_lsn(1039D/9170B010) <
    > > > > slot->data.confirmed_flush(1039D/91F41B50)", we executed the following
    > > > > part and called LogicalConfirmReceivedLocation():
    > > > >
    > > > >     else if (current_lsn <= slot->data.confirmed_flush)
    > > > >     {
    > > > >         slot->candidate_restart_valid = current_lsn;
    > > > >         slot->candidate_restart_lsn = restart_lsn;
    > > > >
    > > > >         /* our candidate can directly be used */
    > > > >         updated_lsn = true;
    > > > >     }
    > > > >
    > > >
    > > > If this would have been executed in
    > > > LogicalIncreaseRestartDecodingForSlot(), then the values displayed in
    > > > the above DEBUG messages "current candidate 1039D/83825958, current
    > > > after 1039D/8B5773D8" should be the same as proposed and after
    > > > "proposed 1039D/8B5773D8, after 1039D/9170B010". Am, I missing
    > > > something?
    > >
    > > Oh, you're right.
    > >
    > > Given restart_lsn was 1039D/8B5773D8, slot->data.restart_lsn was equal
    > > to or greater than 1039D/8B5773D8 at that time but
    > > slot->candidate_restart_lsn was 1039D/83825958, right? Which is weird.
    > >
    >
    > Yes, that is weird but it had been a bit obvious if the same LOG would
    > have printed slot->data.restart_lsn. This means that somehow slot's
    > 'candidate_restart_lsn' somehow went behind its 'restart_lsn'. I can't
    > figure out yet how that can happen but if that happens then the slot's
    > restart_lsn can retreat in LogicalConfirmReceivedLocation() because we
    > don't check if slot's candidate_restart_lsn is lesser than its
    > restart_lsn before assigning the same in line
    > MyReplicationSlot->data.restart_lsn =
    > MyReplicationSlot->candidate_restart_lsn;. I think that should be
    > checked. Note that we call LogicalConfirmReceivedLocation() can be
    > called from ProcessStandbyReplyMessage(), so once the wrong
    > candidate_restart_lsn is set, it can be assigned to restart_lsn from
    > other code paths as well.
    >
    > I am not able to think how 'candidate_restart_lsn' can be set to an
    > LSN value prior to 'restart_lsn'.
    >
    
    In the below part of the code, we use the value of
    'last_serialized_snapshot' for restart_lsn.
    else if (txn == NULL &&
    builder->reorder->current_restart_decoding_lsn != InvalidXLogRecPtr &&
    builder->last_serialized_snapshot != InvalidXLogRecPtr)
    LogicalIncreaseRestartDecodingForSlot(lsn,
      builder->last_serialized_snapshot);
    
    Now, say, after restart, we start reading from slot's restart_lsn
    which is 1039D/8B5773D8. At this LSN, we restored a snapshot that has
    the last_seriealized_snapshot set to 1039D/83825958. If that happens,
    then in LogicalIncreaseRestartDecodingForSlot, we can set these values
    to slot's candidate_*_lsn variables. Say, if this happens, next time
    whenever LogicalConfirmReceivedLocation() is called the value of
    slot's restart_lsn will be moved back. Once it is moved back, yet
    another restart will lead to this problem. Does this theory makes
    sense?
    
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  36. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Masahiko Sawada <sawada.mshk@gmail.com> — 2022-11-17T01:17:59Z

    On Wed, Nov 16, 2022 at 4:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    >
    > On Wed, Nov 16, 2022 at 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    > >
    > > On Wed, Nov 16, 2022 at 11:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
    > > >
    > > > > > > Found one in the time frame you mentioned:
    > > > > > > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10 21:03:23 UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate 1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14"
    > > > > > >
    > > > > >
    > > > > > Thanks!
    > > > > >
    > > > > > LSN 1039D/8B5773D8 seems to be related to this issue. If we advance
    > > > > > slot's restart_lsn to this LSN, we remove WAL files older than
    > > > > > 000000000001039D0000008A.
    > > > > >
    > > > > > In LogicalIncreaseRestartDecodingForSlot(), since
    > > > > > "current_lsn(1039D/9170B010) <
    > > > > > slot->data.confirmed_flush(1039D/91F41B50)", we executed the following
    > > > > > part and called LogicalConfirmReceivedLocation():
    > > > > >
    > > > > >     else if (current_lsn <= slot->data.confirmed_flush)
    > > > > >     {
    > > > > >         slot->candidate_restart_valid = current_lsn;
    > > > > >         slot->candidate_restart_lsn = restart_lsn;
    > > > > >
    > > > > >         /* our candidate can directly be used */
    > > > > >         updated_lsn = true;
    > > > > >     }
    > > > > >
    > > > >
    > > > > If this would have been executed in
    > > > > LogicalIncreaseRestartDecodingForSlot(), then the values displayed in
    > > > > the above DEBUG messages "current candidate 1039D/83825958, current
    > > > > after 1039D/8B5773D8" should be the same as proposed and after
    > > > > "proposed 1039D/8B5773D8, after 1039D/9170B010". Am, I missing
    > > > > something?
    > > >
    > > > Oh, you're right.
    > > >
    > > > Given restart_lsn was 1039D/8B5773D8, slot->data.restart_lsn was equal
    > > > to or greater than 1039D/8B5773D8 at that time but
    > > > slot->candidate_restart_lsn was 1039D/83825958, right? Which is weird.
    > > >
    > >
    > > Yes, that is weird but it had been a bit obvious if the same LOG would
    > > have printed slot->data.restart_lsn. This means that somehow slot's
    > > 'candidate_restart_lsn' somehow went behind its 'restart_lsn'. I can't
    > > figure out yet how that can happen but if that happens then the slot's
    > > restart_lsn can retreat in LogicalConfirmReceivedLocation() because we
    > > don't check if slot's candidate_restart_lsn is lesser than its
    > > restart_lsn before assigning the same in line
    > > MyReplicationSlot->data.restart_lsn =
    > > MyReplicationSlot->candidate_restart_lsn;. I think that should be
    > > checked. Note that we call LogicalConfirmReceivedLocation() can be
    > > called from ProcessStandbyReplyMessage(), so once the wrong
    > > candidate_restart_lsn is set, it can be assigned to restart_lsn from
    > > other code paths as well.
    > >
    > > I am not able to think how 'candidate_restart_lsn' can be set to an
    > > LSN value prior to 'restart_lsn'.
    > >
    >
    > In the below part of the code, we use the value of
    > 'last_serialized_snapshot' for restart_lsn.
    > else if (txn == NULL &&
    > builder->reorder->current_restart_decoding_lsn != InvalidXLogRecPtr &&
    > builder->last_serialized_snapshot != InvalidXLogRecPtr)
    > LogicalIncreaseRestartDecodingForSlot(lsn,
    >   builder->last_serialized_snapshot);
    >
    > Now, say, after restart, we start reading from slot's restart_lsn
    > which is 1039D/8B5773D8. At this LSN, we restored a snapshot that has
    > the last_seriealized_snapshot set to 1039D/83825958.
    
    I don't get this part. You meant that the snapshot we restored at
    1039D/8B5773D8 has last_serialized_snapshot set to 1039D/83825958? I
    think we don't update last_serialized_snapshot when restoring the
    snapshot. Also, if we restore the snapshot after restart, we don't
    execute the above code since SnapBuildFindSnapshot() returns false,
    no?
    
    Regards,
    
    -- 
    Masahiko Sawada
    Amazon Web Services: https://aws.amazon.com
    
    
    
    
  37. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Amit Kapila <amit.kapila16@gmail.com> — 2022-11-17T04:06:47Z

    On Thu, Nov 17, 2022 at 6:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
    >
    > On Wed, Nov 16, 2022 at 4:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    > >
    > > On Wed, Nov 16, 2022 at 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    > > >
    > > > On Wed, Nov 16, 2022 at 11:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
    > > > >
    > > > > > > > Found one in the time frame you mentioned:
    > > > > > > > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10 21:03:23 UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate 1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14"
    > > > > > > >
    > > > > > >
    > > > > > > Thanks!
    > > > > > >
    > > > > > > LSN 1039D/8B5773D8 seems to be related to this issue. If we advance
    > > > > > > slot's restart_lsn to this LSN, we remove WAL files older than
    > > > > > > 000000000001039D0000008A.
    > > > > > >
    > > > > > > In LogicalIncreaseRestartDecodingForSlot(), since
    > > > > > > "current_lsn(1039D/9170B010) <
    > > > > > > slot->data.confirmed_flush(1039D/91F41B50)", we executed the following
    > > > > > > part and called LogicalConfirmReceivedLocation():
    > > > > > >
    > > > > > >     else if (current_lsn <= slot->data.confirmed_flush)
    > > > > > >     {
    > > > > > >         slot->candidate_restart_valid = current_lsn;
    > > > > > >         slot->candidate_restart_lsn = restart_lsn;
    > > > > > >
    > > > > > >         /* our candidate can directly be used */
    > > > > > >         updated_lsn = true;
    > > > > > >     }
    > > > > > >
    > > > > >
    > > > > > If this would have been executed in
    > > > > > LogicalIncreaseRestartDecodingForSlot(), then the values displayed in
    > > > > > the above DEBUG messages "current candidate 1039D/83825958, current
    > > > > > after 1039D/8B5773D8" should be the same as proposed and after
    > > > > > "proposed 1039D/8B5773D8, after 1039D/9170B010". Am, I missing
    > > > > > something?
    > > > >
    > > > > Oh, you're right.
    > > > >
    > > > > Given restart_lsn was 1039D/8B5773D8, slot->data.restart_lsn was equal
    > > > > to or greater than 1039D/8B5773D8 at that time but
    > > > > slot->candidate_restart_lsn was 1039D/83825958, right? Which is weird.
    > > > >
    > > >
    > > > Yes, that is weird but it had been a bit obvious if the same LOG would
    > > > have printed slot->data.restart_lsn. This means that somehow slot's
    > > > 'candidate_restart_lsn' somehow went behind its 'restart_lsn'. I can't
    > > > figure out yet how that can happen but if that happens then the slot's
    > > > restart_lsn can retreat in LogicalConfirmReceivedLocation() because we
    > > > don't check if slot's candidate_restart_lsn is lesser than its
    > > > restart_lsn before assigning the same in line
    > > > MyReplicationSlot->data.restart_lsn =
    > > > MyReplicationSlot->candidate_restart_lsn;. I think that should be
    > > > checked. Note that we call LogicalConfirmReceivedLocation() can be
    > > > called from ProcessStandbyReplyMessage(), so once the wrong
    > > > candidate_restart_lsn is set, it can be assigned to restart_lsn from
    > > > other code paths as well.
    > > >
    > > > I am not able to think how 'candidate_restart_lsn' can be set to an
    > > > LSN value prior to 'restart_lsn'.
    > > >
    > >
    > > In the below part of the code, we use the value of
    > > 'last_serialized_snapshot' for restart_lsn.
    > > else if (txn == NULL &&
    > > builder->reorder->current_restart_decoding_lsn != InvalidXLogRecPtr &&
    > > builder->last_serialized_snapshot != InvalidXLogRecPtr)
    > > LogicalIncreaseRestartDecodingForSlot(lsn,
    > >   builder->last_serialized_snapshot);
    > >
    > > Now, say, after restart, we start reading from slot's restart_lsn
    > > which is 1039D/8B5773D8. At this LSN, we restored a snapshot that has
    > > the last_seriealized_snapshot set to 1039D/83825958.
    >
    > I don't get this part. You meant that the snapshot we restored at
    > 1039D/8B5773D8 has last_serialized_snapshot set to 1039D/83825958?
    >
    
    Yes.
    
    > I
    > think we don't update last_serialized_snapshot when restoring the
    > snapshot.
    >
    
    Right, I missed that point that even though it is present in the
    serialized snapshot, while restoring we don't update the same in the
    builder. So, this theory won't work.
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  38. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Andres Freund <andres@anarazel.de> — 2022-11-17T07:25:12Z

    Hi,
    
    On 2022-11-11 15:50:40 +0100, hubert depesz lubaczewski wrote:
    > Out of 8 servers that we prepared upgrade for, one failed with the
    > same/similar problem.
    
    Perhaps a stupid question - are you using max_slot_wal_keep_size? And what's
    your wal_segment_size? And wal_keep_size?
    
    
    
    I'm a bit confused by the changes the changes c6550776394e made to
    KeepLogSeg() etc.
    
    Before:
    	/* then check whether slots limit removal further */
    	if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
    	{
    		XLogSegNo	slotSegNo;
    
    		XLByteToSeg(keep, slotSegNo, wal_segment_size);
    
    		if (slotSegNo <= 0)
    			segno = 1;
    		else if (slotSegNo < segno)
    			segno = slotSegNo;
    	}
    
    Current:
    	keep = XLogGetReplicationSlotMinimumLSN();
    	if (keep != InvalidXLogRecPtr)
    	{
    		XLByteToSeg(keep, segno, wal_segment_size);
    
    		/* Cap by max_slot_wal_keep_size ... */
    		if (max_slot_wal_keep_size_mb >= 0)
    		{
    			uint64		slot_keep_segs;
    
    			slot_keep_segs =
    				ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
    
    			if (currSegNo - segno > slot_keep_segs)
    				segno = currSegNo - slot_keep_segs;
    		}
    	}
    
    Which looks to me to have lost the protection against segno being 0. Which
    afaict would break the callers, because they do _logSegNo--. Which would then
    allow all WAL to be reused, no?  I guess the argument for changing this was
    that we should never have such a dangerous value returned by
    XLogGetReplicationSlotMinimumLSN()?
    
    
    It seems decidedly not great to not log at least a debug1 (but probably it
    should be LOG) message when KeepLogSeg() decides to limit based on
    max_slot_wal_keep_size.
    
    
    It feels wrong to subtract max_slot_wal_keep_size from recptr - that's the end
    of the checkpoint record. Given that we, leaving max_slot_wal_keep_size aside,
    only actually remove WAL if older than the segment that RedoRecPtr (the
    logical start of the checkpoint) is in. If the checkpoint is large we'll end
    up removing replication slots even though they potentially would only have
    retained one additional WAL segment.
    
    Isn't it problematic to use ConvertToXSegs() to implement
    max_slot_wal_keep_size, given that it rounds *down*? Particularly for a large
    wal_segment_size that'd afaict lead to being much more aggressive invalidating
    slots.
    
    Also, why do we do something as expensive as
    InvalidateObsoleteReplicationSlots() even when max_slot_wal_keep_size had no
    effect?
    
    Greetings,
    
    Andres Freund
    
    
    
    
  39. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Andres Freund <andres@anarazel.de> — 2022-11-17T08:02:58Z

    Hi,
    
    On 2022-11-15 23:59:37 +0900, Masahiko Sawada wrote:
    > > Is something like the following scenario possible to happen?
    > >
    > > 1. wal sender updates slot's restart_lsn and releases the spin lock
    > > (not saved in the disk yet)
    > > 2. someone updates slots' minimum restart_lsn (note that slot's
    > > restart_lsn in memory is already updated).
    
    You mean ReplicationSlotsComputeRequiredLSN(), or update that specific slot's
    restart_lsn? The latter shouldn't happen.
    
    
    > > 3. checkpointer removes WAL files older than the minimum restart_lsn
    > > calculated at step 2.
    
    For xmin we have protection against that via the split between
    catalog_xmin/effective_catalog_xmin. We should probably mirror that for
    restart_lsn as well.
    
    We should also call ReplicationSlotsComputeRequiredLSN if only update_restart
    is true...
    
    
    > > 4. wal sender restarts for some reason (or server crashed).
    
    I don't think walsender alone restarting should change anything, but
    crash-restart obviously would.
    
    Greetings,
    
    Andres Freund
    
    
    
    
  40. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    hubert depesz lubaczewski <depesz@depesz.com> — 2022-11-17T10:26:57Z

    On Wed, Nov 16, 2022 at 11:25:12PM -0800, Andres Freund wrote:
    > Perhaps a stupid question - are you using max_slot_wal_keep_size? And what's
    > your wal_segment_size? And wal_keep_size?
    
    Are we talking about pg12 side?
    
    => select name, setting from pg_settings where name ~ 'wal';
                 name             │  setting
    ══════════════════════════════╪═══════════
     max_wal_senders              │ 50
     max_wal_size                 │ 49152
     min_wal_size                 │ 256
     wal_block_size               │ 8192
     wal_buffers                  │ 2048
     wal_compression              │ on
     wal_consistency_checking     │
     wal_init_zero                │ on
     wal_keep_segments            │ 1000
     wal_level                    │ logical
     wal_log_hints                │ off
     wal_receiver_status_interval │ 10
     wal_receiver_timeout         │ 60000
     wal_recycle                  │ on
     wal_retrieve_retry_interval  │ 5000
     wal_segment_size             │ 16777216
     wal_sender_timeout           │ 60000
     wal_sync_method              │ fdatasync
     wal_writer_delay             │ 200
     wal_writer_flush_after       │ 128
    (20 rows)
    
    max_slot_wal_keep_size was added in pg13, afaik. and wel_keep_* was
    _segments in pg 12.
    
    Best regards,
    
    depesz
    
    
    
    
    
  41. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Masahiko Sawada <sawada.mshk@gmail.com> — 2022-11-17T14:22:12Z

    On Thu, Nov 17, 2022 at 5:03 PM Andres Freund <andres@anarazel.de> wrote:
    >
    > Hi,
    >
    > On 2022-11-15 23:59:37 +0900, Masahiko Sawada wrote:
    > > > Is something like the following scenario possible to happen?
    > > >
    > > > 1. wal sender updates slot's restart_lsn and releases the spin lock
    > > > (not saved in the disk yet)
    > > > 2. someone updates slots' minimum restart_lsn (note that slot's
    > > > restart_lsn in memory is already updated).
    >
    > You mean ReplicationSlotsComputeRequiredLSN(), or update that specific slot's
    > restart_lsn? The latter shouldn't happen.
    
    I meant the former.
    
    >
    >
    > > > 3. checkpointer removes WAL files older than the minimum restart_lsn
    > > > calculated at step 2.
    >
    > For xmin we have protection against that via the split between
    > catalog_xmin/effective_catalog_xmin. We should probably mirror that for
    > restart_lsn as well.
    >
    > We should also call ReplicationSlotsComputeRequiredLSN if only update_restart
    > is true...
    
    Agree.
    
    >
    >
    > > > 4. wal sender restarts for some reason (or server crashed).
    >
    > I don't think walsender alone restarting should change anything, but
    > crash-restart obviously would.
    
    Right. I've confirmed this scenario is possible to happen with crash-restart.
    
    Regards,
    
    -- 
    Masahiko Sawada
    Amazon Web Services: https://aws.amazon.com
    
    
    
    
  42. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    hubert depesz lubaczewski <depesz@depesz.com> — 2022-11-17T15:55:34Z

    On Mon, Nov 14, 2022 at 05:08:15PM +0530, Amit Kapila wrote:
    > You may have the LOG for "attempting to remove WAL segments older than
    > log file %s", if so, please share. Also on similar lines, I think we
    > need to add some LOGs before calling RemoveOldXlogFiles() to know
    > where our computation to remove files goes off. Something, like the
    > attached, would be helpful but note I have added this quickly on my
    > working branch to show you what I have in mind. You may need small
    > changes based on the branch you are working on.
    
    Hi,
    I tried to apply it on pg12.12, both clean from
    https://www.postgresql.org/ftp/source/v12.12/ and one that is being
    built for pgdg .deb for 12.12, and in both cases it failed.
    
    Current state of xlog.c from the deb building (which is what I have to
    build) is available at https://depesz.com/various/xlog.c.xz
    
    I don't trust myself with modifing patches in C - any chance I could ask
    you to check why it doesn't apply, and possibly make new one that
    applies on pg12?
    
    I tried to see what is going on there, but you have in patch lines that
    I don't see in my version of xlog.c, so I'm not sure what to do with it.
    
    The procedure I followed to get this xlog.c in this form was:
    
    dget -x https://apt-archive.postgresql.org/pub/repos/apt/pool/main/p/postgresql-12/postgresql-12_12.1-2.pgdg100%2B1.dsc
    dpkg-source -x postgresql-12_12.1-2.pgdg100%2B1.dsc
    cd postgresql-12-12.1/
    sudo apt build-dep .
    quilt push -a
    
    and I think that was it.
    
    Best regards,
    
    depesz
    
    
    
    
    
  43. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Andres Freund <andres@anarazel.de> — 2022-11-17T18:55:29Z

    Hi,
    
    On 2022-11-17 23:22:12 +0900, Masahiko Sawada wrote:
    > On Thu, Nov 17, 2022 at 5:03 PM Andres Freund <andres@anarazel.de> wrote:
    > > > > 4. wal sender restarts for some reason (or server crashed).
    > >
    > > I don't think walsender alone restarting should change anything, but
    > > crash-restart obviously would.
    > 
    > Right. I've confirmed this scenario is possible to happen with crash-restart.
    
    Depesz, were there any crashes / immediate restarts on the PG 12 side? If so,
    then we know what the problem likely is and can fix it. If not...
    
    
    Just to confirm, the problem has "newly" occurred after you've upgraded to
    12.12? I couldn't quite tell from the thread.
    
    Greetings,
    
    Andres Freund
    
    
    
    
  44. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    hubert depesz lubaczewski <depesz@depesz.com> — 2022-11-18T11:27:17Z

    On Thu, Nov 17, 2022 at 10:55:29AM -0800, Andres Freund wrote:
    > Hi,
    > 
    > On 2022-11-17 23:22:12 +0900, Masahiko Sawada wrote:
    > > On Thu, Nov 17, 2022 at 5:03 PM Andres Freund <andres@anarazel.de> wrote:
    > > > > > 4. wal sender restarts for some reason (or server crashed).
    > > >
    > > > I don't think walsender alone restarting should change anything, but
    > > > crash-restart obviously would.
    > > 
    > > Right. I've confirmed this scenario is possible to happen with crash-restart.
    > 
    > Depesz, were there any crashes / immediate restarts on the PG 12 side? If so,
    > then we know what the problem likely is and can fix it. If not...
    > 
    > 
    > Just to confirm, the problem has "newly" occurred after you've upgraded to
    > 12.12? I couldn't quite tell from the thread.
    
    No crashes that I can find any track of.
    
    And no, it's not related to 12.12
    
    We were using 12.9, and in process of trying to figure stuff out we
    upgraded to 12.12, which didn't help.
    
    Best regards,
    
    depesz
    
    
    
    
    
  45. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Amit Kapila <amit.kapila16@gmail.com> — 2022-11-21T14:26:20Z

    On Fri, Nov 18, 2022 at 4:57 PM hubert depesz lubaczewski
    <depesz@depesz.com> wrote:
    >
    > On Thu, Nov 17, 2022 at 10:55:29AM -0800, Andres Freund wrote:
    > > Hi,
    > >
    > > On 2022-11-17 23:22:12 +0900, Masahiko Sawada wrote:
    > > > On Thu, Nov 17, 2022 at 5:03 PM Andres Freund <andres@anarazel.de> wrote:
    > > > > > > 4. wal sender restarts for some reason (or server crashed).
    > > > >
    > > > > I don't think walsender alone restarting should change anything, but
    > > > > crash-restart obviously would.
    > > >
    > > > Right. I've confirmed this scenario is possible to happen with crash-restart.
    > >
    > > Depesz, were there any crashes / immediate restarts on the PG 12 side? If so,
    > > then we know what the problem likely is and can fix it. If not...
    > >
    > >
    > > Just to confirm, the problem has "newly" occurred after you've upgraded to
    > > 12.12? I couldn't quite tell from the thread.
    >
    > No crashes that I can find any track of.
    >
    
    I think this problem could arise when walsender exits due to some
    error like "terminating walsender process due to replication timeout".
    Here is the theory I came up with:
    
    1. Initially the restart_lsn is updated to 1039D/83825958. This will
    allow all files till 000000000001039D00000082 to be removed.
    2. Next the slot->candidate_restart_lsn is updated to a 1039D/8B5773D8.
    3. walsender restarts due to replication timeout.
    4. After restart, it starts reading WAL from 1039D/83825958 as that
    was restart_lsn.
    5. walsender gets a message to update write, flush, apply, etc. As
    part of that, it invokes
    ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation.
    6. Due to step 5, the restart_lsn is updated to 1039D/8B5773D8 and
    replicationSlotMinLSN will also be computed to the same value allowing
    to remove of all files older than 000000000001039D0000008A. This will
    allow removing 000000000001039D00000083, 000000010001039D00000084,
    etc.
    7. Now, we got new slot->candidate_restart_lsn as 1039D/83825958.
    Remember from step 1, we are still reading WAL from that location.
    8. At the next update for write, flush, etc. as part of processing
    standby reply message, we will invoke
    ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation. This
    updates restart_lsn to 1039D/83825958.
    9. walsender restarts due to replication timeout.
    10. After restart, it starts reading WAL from 1039D/83825958 as that
    was restart_lsn and gets an error "requested WAL segment
    000000010001039D00000083 has already been removed" because the same is
    removed as part of point-6.
    11. After that walsender will keep on getting the same error after
    restart as the restart_lsn is never progressed.
    
    What do you think?
    
    If this diagnosis is correct, I think we need to clear
    candidate_restart_lsn and friends during ReplicationSlotRelease().
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  46. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Andres Freund <andres@anarazel.de> — 2022-11-21T20:08:36Z

    Hi,
    
    On 2022-11-21 19:56:20 +0530, Amit Kapila wrote:
    > I think this problem could arise when walsender exits due to some
    > error like "terminating walsender process due to replication timeout".
    > Here is the theory I came up with:
    > 
    > 1. Initially the restart_lsn is updated to 1039D/83825958. This will
    > allow all files till 000000000001039D00000082 to be removed.
    > 2. Next the slot->candidate_restart_lsn is updated to a 1039D/8B5773D8.
    > 3. walsender restarts due to replication timeout.
    > 4. After restart, it starts reading WAL from 1039D/83825958 as that
    > was restart_lsn.
    > 5. walsender gets a message to update write, flush, apply, etc. As
    > part of that, it invokes
    > ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation.
    > 6. Due to step 5, the restart_lsn is updated to 1039D/8B5773D8 and
    > replicationSlotMinLSN will also be computed to the same value allowing
    > to remove of all files older than 000000000001039D0000008A. This will
    > allow removing 000000000001039D00000083, 000000010001039D00000084,
    > etc.
    
    This would require that the client acknowledged an LSN that we haven't
    sent out, no? Shouldn't the
      MyReplicationSlot->candidate_restart_valid <= lsn
    from LogicalConfirmReceivedLocation() prevented this from happening
    unless the client acknowledges up to candidate_restart_valid?
    
    
    > 7. Now, we got new slot->candidate_restart_lsn as 1039D/83825958.
    > Remember from step 1, we are still reading WAL from that location.
    
    I don't think LogicalIncreaseRestartDecodingForSlot() would do anything
    in that case, because of the
    	/* don't overwrite if have a newer restart lsn */
    check.
    
    
    > If this diagnosis is correct, I think we need to clear
    > candidate_restart_lsn and friends during ReplicationSlotRelease().
    
    Possible, but I don't quite see it yet.
    
    Greetings,
    
    Andres Freund
    
    
    
    
  47. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Amit Kapila <amit.kapila16@gmail.com> — 2022-11-22T03:36:22Z

    On Tue, Nov 22, 2022 at 1:38 AM Andres Freund <andres@anarazel.de> wrote:
    >
    > On 2022-11-21 19:56:20 +0530, Amit Kapila wrote:
    > > I think this problem could arise when walsender exits due to some
    > > error like "terminating walsender process due to replication timeout".
    > > Here is the theory I came up with:
    > >
    > > 1. Initially the restart_lsn is updated to 1039D/83825958. This will
    > > allow all files till 000000000001039D00000082 to be removed.
    > > 2. Next the slot->candidate_restart_lsn is updated to a 1039D/8B5773D8.
    > > 3. walsender restarts due to replication timeout.
    > > 4. After restart, it starts reading WAL from 1039D/83825958 as that
    > > was restart_lsn.
    > > 5. walsender gets a message to update write, flush, apply, etc. As
    > > part of that, it invokes
    > > ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation.
    > > 6. Due to step 5, the restart_lsn is updated to 1039D/8B5773D8 and
    > > replicationSlotMinLSN will also be computed to the same value allowing
    > > to remove of all files older than 000000000001039D0000008A. This will
    > > allow removing 000000000001039D00000083, 000000010001039D00000084,
    > > etc.
    >
    > This would require that the client acknowledged an LSN that we haven't
    > sent out, no?  Shouldn't the
    >   MyReplicationSlot->candidate_restart_valid <= lsn
    > from LogicalConfirmReceivedLocation() prevented this from happening
    > unless the client acknowledges up to candidate_restart_valid?
    >
    
    We would have sent the higher LSN prior to the restart. Let me share a
    snippet of LOG which I have extracted from a large LOG depesz has
    shared me offlist after my request:
    
    2022-11-10 21:03:23.478+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,4,idle,2022-11-10
    21:03:23+00,7/0,0,DEBUG,00000,received replication command:
    IDENTIFY_SYSTEM,,,,,,,,,focal14
    2022-11-10 21:03:23.479+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,5,idle,2022-11-10
    21:03:23+00,7/0,0,DEBUG,00000,"received replication command:
    START_REPLICATION SLOT ""focal14"" LOGICAL 1039D/91F41B50
    (proto_version '1', publication_names '""focal14""')",,,,,,,,,focal14
    2022-11-10 21:03:23.481+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,6,idle,2022-11-10
    21:03:23+00,7/0,0,LOG,00000,"starting logical decoding for slot
    ""focal14""","Streaming transactions committing after 1039D/911A8DB0,
    reading WAL from 1039D/83825958.",,,,,"START_REPLICATION SLOT
    ""focal14"" LOGICAL 1039D/91F41B50 (proto_version '1',
    publication_names '""focal14""')",,,focal14
    2022-11-10 21:03:23.481+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,7,idle,2022-11-10
    21:03:23+00,7/0,0,DEBUG,00000,sending replication
    keepalive,,,,,,,,,focal14
    2022-11-10 21:03:23.484+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,8,idle,2022-11-10
    21:03:23+00,7/0,0,LOG,00000,logical decoding found consistent point at
    1039D/83825958,Logical decoding will begin using saved
    snapshot.,,,,,"START_REPLICATION SLOT ""focal14"" LOGICAL
    1039D/91F41B50 (proto_version '1', publication_names
    '""focal14""')",,,focal14
    2022-11-10 21:03:23.484+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,9,idle,2022-11-10
    21:03:23+00,7/0,0,DEBUG,00000,write 1039D/91F41B50 flush
    1039D/91F41B50 apply 1039D/91F41B50 reply_time 2022-11-10
    21:03:23.482052+00,,,,,,,,,focal14
    2022-11-10 21:03:23.495+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,10,idle,2022-11-10
    21:03:23+00,7/0,0,DEBUG,00000,updated xmin: 1 restart:
    1,,,,,,,,,focal14
    
    Here, we can see that even though we have started reading WAL from
    1039D/83825958, the confiremed_flush location is 1039D/911A8DB0. Then
    we get a client reply (write 1039D/91F41B50 flush 1039D/91F41B50)
    probably for a keepalive message. The flush LSN here is higher than
    candidate_restart_valid. The value of candidate_restart_valid before
    walsender restart was 1039D/9170B010 and candidate_restart_lsn was
    1039D/8B5773D8.
    
    >
    > > 7. Now, we got new slot->candidate_restart_lsn as 1039D/83825958.
    > > Remember from step 1, we are still reading WAL from that location.
    >
    > I don't think LogicalIncreaseRestartDecodingForSlot() would do anything
    > in that case, because of the
    >         /* don't overwrite if have a newer restart lsn */
    > check.
    >
    
    But later in LogicalIncreaseRestartDecodingForSlot(), we have the
    below check which can allow to override it.
    /*
    * Only increase if the previous values have been applied, otherwise we
    * might never end up updating if the receiver acks too slowly. A missed
    * value here will just cause some extra effort after reconnecting.
    */
    if (slot->candidate_restart_valid == InvalidXLogRecPtr)
    
    Do we need else if here instead of if? I think that would save us from
    retreating the restart_lsn value. We seem to have "else if" in the
    corresponding case in the function LogicalIncreaseXminForSlot():
    /*
    * Only increase if the previous values have been applied, otherwise we
    * might never end up updating if the receiver acks too slowly.
    */
    else if (slot->candidate_xmin_lsn == InvalidXLogRecPtr)
    
    -- 
    With Regards,
    Amit Kapila.
    
    
    
    
  48. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Masahiko Sawada <sawada.mshk@gmail.com> — 2022-11-22T08:48:12Z

    On Tue, Nov 22, 2022 at 12:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
    >
    > On Tue, Nov 22, 2022 at 1:38 AM Andres Freund <andres@anarazel.de> wrote:
    > >
    > > On 2022-11-21 19:56:20 +0530, Amit Kapila wrote:
    > > > I think this problem could arise when walsender exits due to some
    > > > error like "terminating walsender process due to replication timeout".
    > > > Here is the theory I came up with:
    > > >
    > > > 1. Initially the restart_lsn is updated to 1039D/83825958. This will
    > > > allow all files till 000000000001039D00000082 to be removed.
    > > > 2. Next the slot->candidate_restart_lsn is updated to a 1039D/8B5773D8.
    > > > 3. walsender restarts due to replication timeout.
    > > > 4. After restart, it starts reading WAL from 1039D/83825958 as that
    > > > was restart_lsn.
    > > > 5. walsender gets a message to update write, flush, apply, etc. As
    > > > part of that, it invokes
    > > > ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation.
    > > > 6. Due to step 5, the restart_lsn is updated to 1039D/8B5773D8 and
    > > > replicationSlotMinLSN will also be computed to the same value allowing
    > > > to remove of all files older than 000000000001039D0000008A. This will
    > > > allow removing 000000000001039D00000083, 000000010001039D00000084,
    > > > etc.
    > >
    > > This would require that the client acknowledged an LSN that we haven't
    > > sent out, no?  Shouldn't the
    > >   MyReplicationSlot->candidate_restart_valid <= lsn
    > > from LogicalConfirmReceivedLocation() prevented this from happening
    > > unless the client acknowledges up to candidate_restart_valid?
    > >
    >
    > We would have sent the higher LSN prior to the restart. Let me share a
    > snippet of LOG which I have extracted from a large LOG depesz has
    > shared me offlist after my request:
    >
    > 2022-11-10 21:03:23.478+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,4,idle,2022-11-10
    > 21:03:23+00,7/0,0,DEBUG,00000,received replication command:
    > IDENTIFY_SYSTEM,,,,,,,,,focal14
    > 2022-11-10 21:03:23.479+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,5,idle,2022-11-10
    > 21:03:23+00,7/0,0,DEBUG,00000,"received replication command:
    > START_REPLICATION SLOT ""focal14"" LOGICAL 1039D/91F41B50
    > (proto_version '1', publication_names '""focal14""')",,,,,,,,,focal14
    > 2022-11-10 21:03:23.481+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,6,idle,2022-11-10
    > 21:03:23+00,7/0,0,LOG,00000,"starting logical decoding for slot
    > ""focal14""","Streaming transactions committing after 1039D/911A8DB0,
    > reading WAL from 1039D/83825958.",,,,,"START_REPLICATION SLOT
    > ""focal14"" LOGICAL 1039D/91F41B50 (proto_version '1',
    > publication_names '""focal14""')",,,focal14
    > 2022-11-10 21:03:23.481+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,7,idle,2022-11-10
    > 21:03:23+00,7/0,0,DEBUG,00000,sending replication
    > keepalive,,,,,,,,,focal14
    > 2022-11-10 21:03:23.484+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,8,idle,2022-11-10
    > 21:03:23+00,7/0,0,LOG,00000,logical decoding found consistent point at
    > 1039D/83825958,Logical decoding will begin using saved
    > snapshot.,,,,,"START_REPLICATION SLOT ""focal14"" LOGICAL
    > 1039D/91F41B50 (proto_version '1', publication_names
    > '""focal14""')",,,focal14
    > 2022-11-10 21:03:23.484+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,9,idle,2022-11-10
    > 21:03:23+00,7/0,0,DEBUG,00000,write 1039D/91F41B50 flush
    > 1039D/91F41B50 apply 1039D/91F41B50 reply_time 2022-11-10
    > 21:03:23.482052+00,,,,,,,,,focal14
    > 2022-11-10 21:03:23.495+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,10,idle,2022-11-10
    > 21:03:23+00,7/0,0,DEBUG,00000,updated xmin: 1 restart:
    > 1,,,,,,,,,focal14
    >
    > Here, we can see that even though we have started reading WAL from
    > 1039D/83825958, the confiremed_flush location is 1039D/911A8DB0. Then
    > we get a client reply (write 1039D/91F41B50 flush 1039D/91F41B50)
    > probably for a keepalive message. The flush LSN here is higher than
    > candidate_restart_valid. The value of candidate_restart_valid before
    > walsender restart was 1039D/9170B010 and candidate_restart_lsn was
    > 1039D/8B5773D8.
    >
    > >
    > > > 7. Now, we got new slot->candidate_restart_lsn as 1039D/83825958.
    > > > Remember from step 1, we are still reading WAL from that location.
    > >
    > > I don't think LogicalIncreaseRestartDecodingForSlot() would do anything
    > > in that case, because of the
    > >         /* don't overwrite if have a newer restart lsn */
    > > check.
    > >
    >
    > But later in LogicalIncreaseRestartDecodingForSlot(), we have the
    > below check which can allow to override it.
    > /*
    > * Only increase if the previous values have been applied, otherwise we
    > * might never end up updating if the receiver acks too slowly. A missed
    > * value here will just cause some extra effort after reconnecting.
    > */
    > if (slot->candidate_restart_valid == InvalidXLogRecPtr)
    
    I also think we did this override-path. After updating the restart_lsn
    we accept the given values as candidate_restart_lsn and
    candidate_restart_valid, which were 1039D/83825958 and 1039D/8B5773D8.
    
    Analyzing the server logs that depesz shared also with me offlist,
    this scenario basically makes sense to me. But please note that
    000000000001039D0000008A and older files were removed while there was
    no running wal sender. The wal sender who restarted decoding from
    1039D/83825958 (ie step 4) exited due to wal_sender_timeout at
    21:04:56, and no walsender was launched until 000000000001039D0000008A
    or older files were removed at 21:06:07. I guess the scenario needs to
    change a bit.
    
    6. The restart_lsn was updated to 1039D/8B5773D8 and
    replicationSlotMinLSN was also computed to the same value. But WAL
    files were not removed at this point because there was no checkpoint
    between 6 and 7. This was around 21:03:23.
    
    7. We got new slot->candidate_restart_lsn as 1039D/83825958 and
    candidate_restart_valid as 1039D/8B5773D8. This can happen for the
    reason that Amit mentioned above.
    
    8. The restart_lsn was updated to 1039D/83825958, but we didn't update
    replicationSlotMinLSN since slot->data.catalog_xmin was not updated at
    the same time (see LogicalConfirmReceivedLocation()). And the
    walsender exited again due to wal_sender_timeout. At this point, note
    that slot's restart_lsn was 1039D/83825958 whereas
    replicationSlotMinLSN was 1039D/8B5773D8. This was around 21:04:56.
    
    9. Until 000000000001039D0000008A or older files were removed at
    21:06:07, a client had checked the pg_replication_slots view every
    minute but it had showed 1039D/83825958 in the restart_lsn column. On
    the other hand, the checkpointer removed the WAL file segment files
    000000000001039D0000008A or older at 21:06:07, based on
    replicationSlotMinLSN ( 1039D/8B5773D8).
    
    Regards,
    
    -- 
    Masahiko Sawada
    Amazon Web Services: https://aws.amazon.com
    
    
    
    
  49. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    hubert depesz lubaczewski <depesz@depesz.com> — 2022-12-08T11:13:13Z

    Hi,
    just checking - has there been any progress on diagnosing/fixing the
    bug?
    
    Best regards,
    
    depesz
    
    
    
    
    
  50. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Masahiko Sawada <sawada.mshk@gmail.com> — 2023-02-06T08:25:42Z

    Hi,
    
    On Thu, Dec 8, 2022 at 8:13 PM hubert depesz lubaczewski
    <depesz@depesz.com> wrote:
    >
    > Hi,
    > just checking - has there been any progress on diagnosing/fixing the
    > bug?
    
    Sorry for the late response.
    
    Based on the analysis we did[1][2], I've created the manual scenario
    to reproduce this issue with the attached patch and the script.
    
    The scenario.md explains the basic steps to reproduce this issue. It
    consists of 13 steps (very tricky!!). It's not sophisticated and could
    be improved. test.sh is the shell script I used to execute the
    reproduction steps from 1 to 10. In my environment, I could reproduce
    this issue by the following steps.
    
    1. apply the patch and build PostgreSQL.
    2. run test.sh.
    3. execute the step 11 and later described in scenario.md.
    
    The test.sh is a very hacky and dirty script and is optimized in my
    environment (especially adding many sleeps). You might need to adjust
    it while checking scenario.md.
    
    I've also confirmed that this issue is fixed by the attached patch,
    which clears candidate_restart_lsn and friends during
    ReplicationSlotRelease().
    
    [1] https://www.postgresql.org/message-id/CAA4eK1JvyWHzMwhO9jzPquctE_ha6bz3EkB3KE6qQJx63StErQ%40mail.gmail.com
    [2] https://www.postgresql.org/message-id/CAD21AoBHMCEDcV4eBtSVvDrCN4SrMXanX5N9%2BL-E%2B4OWXYY0ew%40mail.gmail.com
    
    Regards,
    
    --
    Masahiko Sawada
    Amazon Web Services: https://aws.amazon.com
    
  51. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    hubert depesz lubaczewski <depesz@depesz.com> — 2023-02-06T11:15:52Z

    On Mon, Feb 06, 2023 at 05:25:42PM +0900, Masahiko Sawada wrote:
    > Based on the analysis we did[1][2], I've created the manual scenario
    > to reproduce this issue with the attached patch and the script.
    > 
    > The scenario.md explains the basic steps to reproduce this issue. It
    > consists of 13 steps (very tricky!!). It's not sophisticated and could
    > be improved. test.sh is the shell script I used to execute the
    > reproduction steps from 1 to 10. In my environment, I could reproduce
    > this issue by the following steps.
    > 
    > 1. apply the patch and build PostgreSQL.
    > 2. run test.sh.
    > 3. execute the step 11 and later described in scenario.md.
    > 
    > The test.sh is a very hacky and dirty script and is optimized in my
    > environment (especially adding many sleeps). You might need to adjust
    > it while checking scenario.md.
    > 
    > I've also confirmed that this issue is fixed by the attached patch,
    > which clears candidate_restart_lsn and friends during
    > ReplicationSlotRelease().
    
    Hi,
    one important question - do I patch newer Pg, or older? The thing is
    that we were able to replicate the problem (with some luck) only on
    production databases, and patching them will be hard sell. Maybe
    possible, but if it's enough to patch the pg14 (recipient) it would make
    my life much easier.
    
    Best regards,
    
    depesz
    
    
    
    
    
  52. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Masahiko Sawada <sawada.mshk@gmail.com> — 2023-02-06T14:18:02Z

    On Mon, Feb 6, 2023 at 8:15 PM hubert depesz lubaczewski
    <depesz@depesz.com> wrote:
    >
    > On Mon, Feb 06, 2023 at 05:25:42PM +0900, Masahiko Sawada wrote:
    > > Based on the analysis we did[1][2], I've created the manual scenario
    > > to reproduce this issue with the attached patch and the script.
    > >
    > > The scenario.md explains the basic steps to reproduce this issue. It
    > > consists of 13 steps (very tricky!!). It's not sophisticated and could
    > > be improved. test.sh is the shell script I used to execute the
    > > reproduction steps from 1 to 10. In my environment, I could reproduce
    > > this issue by the following steps.
    > >
    > > 1. apply the patch and build PostgreSQL.
    > > 2. run test.sh.
    > > 3. execute the step 11 and later described in scenario.md.
    > >
    > > The test.sh is a very hacky and dirty script and is optimized in my
    > > environment (especially adding many sleeps). You might need to adjust
    > > it while checking scenario.md.
    > >
    > > I've also confirmed that this issue is fixed by the attached patch,
    > > which clears candidate_restart_lsn and friends during
    > > ReplicationSlotRelease().
    >
    > Hi,
    > one important question - do I patch newer Pg, or older? The thing is
    > that we were able to replicate the problem (with some luck) only on
    > production databases, and patching them will be hard sell. Maybe
    > possible, but if it's enough to patch the pg14 (recipient) it would make
    > my life much easier.
    
    Unfortunately, the patch I attached is for the publisher (i.e., sender
    side). There might be a way to fix this issue from the receiver side
    but I have no idea for now.
    
    Regards,
    
    -- 
    Masahiko Sawada
    Amazon Web Services: https://aws.amazon.com
    
    
    
    
  53. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Alvaro Herrera <alvherre@alvh.no-ip.org> — 2023-02-06T15:22:09Z

    On 2023-Feb-06, Masahiko Sawada wrote:
    
    > I've also confirmed that this issue is fixed by the attached patch,
    > which clears candidate_restart_lsn and friends during
    > ReplicationSlotRelease().
    
    Hmm, interesting -- I was studying some other bug recently involving the
    xmin of a slot that had been invalidated and I remember wondering if
    these "candidate" fields were being properly ignored when the slot is
    marked not in use; but I didn't check.  Are you sure that resetting them
    when the slot is released is the appropriate thing to do?  I mean,
    shouldn't they be kept set while the slot is in use, and only reset if
    we destroy it?
    
    (But, actually, in that case, maybe we don't need to reset them: instead
    we need to make sure to ignore the values for slots that are not in_use.
    However, I don't know what the semantics are for these "candidate"
    fields, so maybe this is wrong.)
    
    -- 
    Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
    
    
    
    
  54. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Masahiko Sawada <sawada.mshk@gmail.com> — 2023-02-07T08:20:26Z

    On Tue, Feb 7, 2023 at 12:22 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
    >
    > On 2023-Feb-06, Masahiko Sawada wrote:
    >
    > > I've also confirmed that this issue is fixed by the attached patch,
    > > which clears candidate_restart_lsn and friends during
    > > ReplicationSlotRelease().
    >
    > Hmm, interesting -- I was studying some other bug recently involving the
    > xmin of a slot that had been invalidated and I remember wondering if
    > these "candidate" fields were being properly ignored when the slot is
    > marked not in use; but I didn't check.  Are you sure that resetting them
    > when the slot is released is the appropriate thing to do?  I mean,
    > shouldn't they be kept set while the slot is in use, and only reset if
    > we destroy it?
    
    Thinking about the root cause more, it seems to me that the root cause
    is not the fact that candidate_xxx values are not cleared when being
    released.
    
    In the scenario I reproduced, after restarting the logical decoding,
    the walsender sets the restart_lsn to a candidate_restart_lsn left in
    the slot upon receiving the ack from the subscriber. Then, when
    decoding a RUNNING_XACTS record, the walsender updates the
    restart_lsn, which actually retreats it. The patch, which clears the
    candidate_xxx values at releasing, fixes this issue by not updating
    the restart_lsn upon receiving the ack in this case, but I think it's
    legitimate that the walsender sets the restart_lsn to the
    candidate_restart_lsn left in the slot in this case. The root cause
    here seems to me that it allows the restart_lsn to retreat. In
    LogicalIncreaseRestartDecodingForSlot(), we have the following codes:
    
    /* don't overwrite if have a newer restart lsn */
    if (restart_lsn <= slot->data.restart_lsn)
    {
    }
    
    /*
     * We might have already flushed far enough to directly accept this lsn,
     * in this case there is no need to check for existing candidate LSNs
     */
    else if (current_lsn <= slot->data.confirmed_flush)
    {
        slot->candidate_restart_valid = current_lsn;
        slot->candidate_restart_lsn = restart_lsn;
    
        /* our candidate can directly be used */
        updated_lsn = true;
    }
    
    /*
     * Only increase if the previous values have been applied, otherwise we
     * might never end up updating if the receiver acks too slowly. A missed
     * value here will just cause some extra effort after reconnecting.
     */
    if (slot->candidate_restart_valid == InvalidXLogRecPtr)
    {
        slot->candidate_restart_valid = current_lsn;
        slot->candidate_restart_lsn = restart_lsn;
        SpinLockRelease(&slot->mutex);
    }
    
    If slot->candidate_restart_valid is InvalidXLogRecPtr, he
    candidate_restart_lsn could be set to a LSN lower than the slot's
    restart_lsn in spite of the first comment saying "don't overwrite if
    have a newer restart lsn". So I think the right fix here is to not
    allow the restart_lsn to retreat by changing "if" to "else if" of the
    third block.
    
    >
    > (But, actually, in that case, maybe we don't need to reset them: instead
    > we need to make sure to ignore the values for slots that are not in_use.
    > However, I don't know what the semantics are for these "candidate"
    > fields, so maybe this is wrong.)
    
    IIUC in that case, the WAL segments are removed while the slot is in_use.
    
    Regards,
    
    --
    Masahiko Sawada
    Amazon Web Services: https://aws.amazon.com
    
    
    
    
  55. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    hubert depesz lubaczewski <depesz@depesz.com> — 2023-02-10T14:31:24Z

    Hi,
    so, we have another bit of interesting information. maybe related, maybe
    not.
    
    We noticed weird situation on two clusters we're trying to upgrade.
    
    In both cases sitaution looked the same:
    
    1. there was another process (debezium) connected to source (pg12) using
       logical replication
    2. pg12 -> pg14 replication failed with the message 'ERROR:  requested
       WAL segment ... has already been '
    3. some time afterwards (most likely couple of hours) the process that
       is/was responsible for debezium replicaiton (pg process) stopped
       handling WAL, but instead is eating 100% of cpu.
    
    When this situation happens, we can't pg_cancel_backend(pid) for the
    "broken" wal sender, it also can't be pg_terminate_backend() !
    
    strace of the process doesn't show anything.
    
    When I tried to get backtrace from gdb all I got was:
    
    (gdb) bt
    #0  0x0000aaaad270521c in hash_seq_search ()
    #1  0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
    #2  0x0000aaaad26e3644 in CallSyscacheCallbacks ()
    #3  0x0000aaaad26e3644 in CallSyscacheCallbacks ()
    #4  0x0000aaaad257764c in ReorderBufferCommit ()
    #5  0x0000aaaad256c804 in ?? ()
    #6  0x0000aaaaf303d280 in ?? ()
    
    If I'd quit gdb, and restart, and redo bt, I get 
    
    #0  0x0000ffff806c81a8 in hash_seq_search@plt () from /usr/lib/postgresql/12/lib/pgoutput.so
    #1  0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
    #2  0x0000aaaad291ae58 in ?? ()
    
    or
    
    #0  0x0000aaaad2705244 in hash_seq_search ()
    #1  0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
    #2  0x0000aaaad26e3644 in CallSyscacheCallbacks ()
    #3  0x0000aaaad26e3644 in CallSyscacheCallbacks ()
    #4  0x0000aaaad257764c in ReorderBufferCommit ()
    #5  0x0000aaaad256c804 in ?? ()
    #6  0x0000aaaaf303d280 in ?? ()
    
    At this moment, the only thing that we can do is kill -9 the process (or
    restart pg).
    
    I don't know if it's relevant, but I have this case *right now*, and if
    it's helpful I can provide more information before we will have to kill
    it.
    
    Best regards,
    
    depesz
    
    
    
    
    
  56. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Kyotaro Horiguchi <horikyota.ntt@gmail.com> — 2023-02-13T06:45:49Z

    This seems to have a thin connection to the issue, but.
    
    At Wed, 16 Nov 2022 23:25:12 -0800, Andres Freund <andres@anarazel.de> wrote in 
    > I'm a bit confused by the changes the changes c6550776394e made to
    > KeepLogSeg() etc.
    > 
    > Before:
    > 	/* then check whether slots limit removal further */
    > 	if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
    > 	{
    > 		XLogSegNo	slotSegNo;
    > 
    > 		XLByteToSeg(keep, slotSegNo, wal_segment_size);
    > 
    > 		if (slotSegNo <= 0)
    > 			segno = 1;
    > 		else if (slotSegNo < segno)
    > 			segno = slotSegNo;
    > 	}
    > 
    > Current:
    > 	keep = XLogGetReplicationSlotMinimumLSN();
    > 	if (keep != InvalidXLogRecPtr)
    > 	{
    > 		XLByteToSeg(keep, segno, wal_segment_size);
    > 
    > 		/* Cap by max_slot_wal_keep_size ... */
    > 		if (max_slot_wal_keep_size_mb >= 0)
    > 		{
    > 			uint64		slot_keep_segs;
    > 
    > 			slot_keep_segs =
    > 				ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
    > 
    > 			if (currSegNo - segno > slot_keep_segs)
    > 				segno = currSegNo - slot_keep_segs;
    > 		}
    > 	}
    > 
    > Which looks to me to have lost the protection against segno being 0. Which
    > afaict would break the callers, because they do _logSegNo--. Which would then
    > allow all WAL to be reused, no?  I guess the argument for changing this was
    > that we should never have such a dangerous value returned by
    > XLogGetReplicationSlotMinimumLSN()?
    
    Mmm. Right. If XLogGetReplicationSlotMinimumLSN() returned an LSN
    within the zeroth segment, segno ends up with being 0, which is
    dangerous if happens. However, if it happens, slots are evidently
    already broken, but there's no cost to have that protection.
    
    > It seems decidedly not great to not log at least a debug1 (but probably it
    > should be LOG) message when KeepLogSeg() decides to limit based on
    > max_slot_wal_keep_size.
    
    It's easy to do that, but that log is highly accompanied by a LOG line
    "terminating process %d to release replication slot \"%s\"". I don't
    mind adding it if it is a DEBUGx.
    
    > It feels wrong to subtract max_slot_wal_keep_size from recptr - that's the end
    > of the checkpoint record. Given that we, leaving max_slot_wal_keep_size aside,
    > only actually remove WAL if older than the segment that RedoRecPtr (the
    > logical start of the checkpoint) is in. If the checkpoint is large we'll end
    > up removing replication slots even though they potentially would only have
    > retained one additional WAL segment.
    
    I think that it is a controversial part, but that variable is defined
    the similar way to wal_keep_size.  And I think that all max_wal_size,
    wal_keep_size and max_slot_wal_keep_size being defined with the same
    base LSN makes things simpler for users (also for developers).
    Regardless of checkpoint length, if slots get frequently invalidated,
    the setting should be considered to be too small for the system
    requirements.
    
    > Isn't it problematic to use ConvertToXSegs() to implement
    > max_slot_wal_keep_size, given that it rounds *down*? Particularly for a large
    > wal_segment_size that'd afaict lead to being much more aggressive invalidating
    > slots.
    
    I think max_slot_wal_keep_size is, like max_wal_size for checkpoints,
    a safeguard for slots not to fill-up WAL directory. Thus they both are
    rounded down. If you have 1GB WAL directory and set wal_segment_size
    to 4192MB, I don't see it a sane setup. But if segment size is smaller
    than one hundredth of max_wal_size, that difference won't matter for
    anyone.  But anyway, it's a pain in the a.. that the both of them (and
    wal_keep_size) don't work in a immediate manner, though..
    
    > Also, why do we do something as expensive as
    > InvalidateObsoleteReplicationSlots() even when max_slot_wal_keep_size had no
    > effect?
    
    Indeed. Maybe we didn't regard that process as complex at start? I
    think we can compare the cutoff segno against
    XLogGetReplicationSlotMinimumLSN() before entering the loop over
    slots.
    
    Thus I think there's room for the following improvements.
    
    - Prevent KeepLogSeg from returning 0.
    
    - Add DEBUG log to KeepLogSeg emitted when max_slot_wal_keep_size affects.
    
    - Check against minimum slot LSN before actually examining through the
      slots in Invalidateobsoletereplicationslots.
    
    I'm not sure about the second item but the others seem back-patchable.
    
    If we need to continue further discussion, will need another
    thread. Anyway I'll come up with the patch for the above three items.
    
    regards.
    
    -- 
    Kyotaro Horiguchi
    NTT Open Source Software Center
    
    
    
    
  57. Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

    Andres Freund <andres@anarazel.de> — 2023-02-13T19:41:31Z

    Hi,
    
    On 2023-02-13 15:45:49 +0900, Kyotaro Horiguchi wrote:
    > This seems to have a thin connection to the issue, but.
    
    I was worried that the changes could lead us to removing WAL without
    max_slot_wal_keep_size set.
    
    
    > > It seems decidedly not great to not log at least a debug1 (but probably it
    > > should be LOG) message when KeepLogSeg() decides to limit based on
    > > max_slot_wal_keep_size.
    > 
    > It's easy to do that, but that log is highly accompanied by a LOG line
    > "terminating process %d to release replication slot \"%s\"". I don't
    > mind adding it if it is a DEBUGx.
    
    My problem with that is that we might *NOT* see those log messages for some
    reason, but that that's impossible to debug as-is. And even if we see them,
    it's not that easy to figure out by how much we were over
    max_slot_wal_keep_size, because we always report it in the context of a
    specific slot.
    
    Removing WAL that's still needed is a *very* severe operation. Emitting an
    additional line in case it happens isn't a problem.
    
    
    > > It feels wrong to subtract max_slot_wal_keep_size from recptr - that's the end
    > > of the checkpoint record. Given that we, leaving max_slot_wal_keep_size aside,
    > > only actually remove WAL if older than the segment that RedoRecPtr (the
    > > logical start of the checkpoint) is in. If the checkpoint is large we'll end
    > > up removing replication slots even though they potentially would only have
    > > retained one additional WAL segment.
    > 
    > I think that it is a controversial part, but that variable is defined
    > the similar way to wal_keep_size.  And I think that all max_wal_size,
    > wal_keep_size and max_slot_wal_keep_size being defined with the same
    > base LSN makes things simpler for users (also for developers).
    > Regardless of checkpoint length, if slots get frequently invalidated,
    > the setting should be considered to be too small for the system
    > requirements.
    
    I think it's bad that we define wal_keep_size, max_slot_wal_keep_size that
    way. I don't think bringing max_wal_size into this is useful, as it influences
    different things.
    
    
    > > Isn't it problematic to use ConvertToXSegs() to implement
    > > max_slot_wal_keep_size, given that it rounds *down*? Particularly for a large
    > > wal_segment_size that'd afaict lead to being much more aggressive invalidating
    > > slots.
    > 
    > I think max_slot_wal_keep_size is, like max_wal_size for checkpoints,
    > a safeguard for slots not to fill-up WAL directory. Thus they both are
    > rounded down. If you have 1GB WAL directory and set wal_segment_size
    > to 4192MB, I don't see it a sane setup. But if segment size is smaller
    > than one hundredth of max_wal_size, that difference won't matter for
    > anyone.  But anyway, it's a pain in the a.. that the both of them (and
    > wal_keep_size) don't work in a immediate manner, though..
    
    It doesn't matter a lot for 16MB segments, but with 1GB segments it's a
    different story.
    
    To me the way it's done now is a bug, one that can in extreme circumstances
    lead to data loss.
    
    
    > > Also, why do we do something as expensive as
    > > InvalidateObsoleteReplicationSlots() even when max_slot_wal_keep_size had no
    > > effect?
    > 
    > Indeed. Maybe we didn't regard that process as complex at start? I
    > think we can compare the cutoff segno against
    > XLogGetReplicationSlotMinimumLSN() before entering the loop over
    > slots.
    
    That'd be better, but I'd probably go further, and really gate it on
    max_slot_wal_keep_size having had an effect.
    
    
    > Thus I think there's room for the following improvements.
    > 
    > - Prevent KeepLogSeg from returning 0.
    > 
    > - Add DEBUG log to KeepLogSeg emitted when max_slot_wal_keep_size affects.
    > 
    > - Check against minimum slot LSN before actually examining through the
    >   slots in Invalidateobsoletereplicationslots.
    > 
    > I'm not sure about the second item but the others seem back-patchable.
    > 
    > If we need to continue further discussion, will need another
    > thread. Anyway I'll come up with the patch for the above three items.
    
    Yep, probably a good idea to start another thread.
    
    There's also https://www.postgresql.org/message-id/20220223014855.4lsddr464i7mymk2%40alap3.anarazel.de
    that unfortunately nobody replied to.
    
    Greetings,
    
    Andres Freund