Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. pg_rewind: Skip copy of WAL segments generated before point of divergence

  2. pg_rewind: Extend code detecting relation files to work with WAL files

  3. Split TESTDIR into TESTLOGDIR and TESTDATADIR

  1. Making pg_rewind faster

    vignesh ravichandran <admin@viggy28.dev> — 2022-06-30T13:22:28Z

    Hi Hackers,
    
    
    
    I have been using pg_rewind in production for 2 years. One of the things that I noticed in pg_rewind is if it doesn't know what to do with a file "it copies". I understand it's the more safer option. After all, the alternative, pg_basebackup copies all the files from source to target.
    
    
    
    However, this is making pg_rewind inefficient when we have a high number of WAL files. Majority of the data (in most of my cases 95%+) that it copies are WAL files which are anyway same between the source and target. Skipping those same WAL files from copying will improve the speed of pg_rewind a lot.
    
    
    
    1. Does pg_rewind need to copy WAL files before the WAL that contains the last common check point?
    
    
    
    Heikki's presentation https://pgsessions.com/assets/archives/pg_rewind-presentation-paris.pdf gave me a good overview and also explained the behavior what I mentioned.
    
    
    
    Thanks,
    
    Vignesh
  2. Re: Making pg_rewind faster

    Justin Kwan <justinpkwan@outlook.com> — 2022-07-15T22:24:54Z

    Hi everyone!
    
    Here's the attached patch submission to optimize pg_rewind performance when many WAL files are retained on server. This patch avoids replaying (copying over) older WAL segment files that fall before the point of divergence between the source and target servers.
    
    Thanks,
    Justin
    ________________________________
    From: Justin Kwan <jkwan@cloudflare.com>
    Sent: July 15, 2022 6:13 PM
    To: vignesh ravichandran <admin@viggy28.dev>
    Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; justinpkwan@outlook.com <justinpkwan@outlook.com>
    Subject: Re: Making pg_rewind faster
    
    Looping in my other email.
    
    On Thu, Jun 30, 2022 at 6:22 AM vignesh ravichandran <admin@viggy28.dev<mailto:admin@viggy28.dev>> wrote:
    Hi Hackers,
    
    I have been using pg_rewind in production for 2 years. One of the things that I noticed in pg_rewind is if it doesn't know what to do with a file "it copies". I understand it's the more safer option. After all, the alternative, pg_basebackup copies all the files from source to target.
    
    However, this is making pg_rewind inefficient when we have a high number of WAL files. Majority of the data (in most of my cases 95%+) that it copies are WAL files which are anyway same between the source and target. Skipping those same WAL files from copying will improve the speed of pg_rewind a lot.
    
    1. Does pg_rewind need to copy WAL files before the WAL that contains the last common check point?
    
    Heikki's presentation https://pgsessions.com/assets/archives/pg_rewind-presentation-paris.pdf gave me a good overview and also explained the behavior what I mentioned.
    
    Thanks,
    Vignesh
    
    
  3. Re: Making pg_rewind faster

    Justin Kwan <justinpkwan@outlook.com> — 2022-07-16T03:16:27Z

    Hi everyone!
    
    I've also attached the pg_rewind optimization patch file for Postgres version 14.4. The previous patch file targets version Postgres version 15 Beta 1/2.
    
    Thanks,
    Justin
    ________________________________
    From: Justin Kwan <jkwan@cloudflare.com>
    Sent: July 15, 2022 6:13 PM
    To: vignesh ravichandran <admin@viggy28.dev>
    Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; justinpkwan@outlook.com <justinpkwan@outlook.com>
    Subject: Re: Making pg_rewind faster
    
    Looping in my other email.
    
    On Thu, Jun 30, 2022 at 6:22 AM vignesh ravichandran <admin@viggy28.dev<mailto:admin@viggy28.dev>> wrote:
    Hi Hackers,
    
    I have been using pg_rewind in production for 2 years. One of the things that I noticed in pg_rewind is if it doesn't know what to do with a file "it copies". I understand it's the more safer option. After all, the alternative, pg_basebackup copies all the files from source to target.
    
    However, this is making pg_rewind inefficient when we have a high number of WAL files. Majority of the data (in most of my cases 95%+) that it copies are WAL files which are anyway same between the source and target. Skipping those same WAL files from copying will improve the speed of pg_rewind a lot.
    
    1. Does pg_rewind need to copy WAL files before the WAL that contains the last common check point?
    
    Heikki's presentation https://pgsessions.com/assets/archives/pg_rewind-presentation-paris.pdf gave me a good overview and also explained the behavior what I mentioned.
    
    Thanks,
    Vignesh
    
    
  4. Re: Making pg_rewind faster

    Tom Lane <tgl@sss.pgh.pa.us> — 2022-07-17T18:40:32Z

    Justin Kwan <justinpkwan@outlook.com> writes:
    > I've also attached the pg_rewind optimization patch file for Postgres version 14.4. The previous patch file targets version Postgres version 15 Beta 1/2.
    
    It's very unlikely that we would consider committing such changes into
    released branches.  In fact, it's too late even for v15.  You should
    be submitting non-bug-fix patches against master (v16-to-be).
    
    			regards, tom lane
    
    
    
    
  5. Re: Making pg_rewind faster

    Justin Kwan <justinpkwan@outlook.com> — 2022-07-18T17:14:00Z

    Hi Tom,
    
    Thank you for taking a look at this and that sounds good. I will send over a patch compatible with Postgres v16.
    
    Justin
    ________________________________
    From: Tom Lane <tgl@sss.pgh.pa.us>
    Sent: July 17, 2022 2:40 PM
    To: Justin Kwan <justinpkwan@outlook.com>
    Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; jkwan@cloudflare.com <jkwan@cloudflare.com>; vignesh ravichandran <admin@viggy28.dev>; hlinnaka@iki.fi <hlinnaka@iki.fi>
    Subject: Re: Making pg_rewind faster
    
    Justin Kwan <justinpkwan@outlook.com> writes:
    > I've also attached the pg_rewind optimization patch file for Postgres version 14.4. The previous patch file targets version Postgres version 15 Beta 1/2.
    
    It's very unlikely that we would consider committing such changes into
    released branches.  In fact, it's too late even for v15.  You should
    be submitting non-bug-fix patches against master (v16-to-be).
    
                            regards, tom lane
    
  6. Re: Making pg_rewind faster

    Michael Paquier <michael@paquier.xyz> — 2022-07-19T05:36:25Z

    On Mon, Jul 18, 2022 at 05:14:00PM +0000, Justin Kwan wrote:
    > Thank you for taking a look at this and that sounds good. I will
    > send over a patch compatible with Postgres v16.
    
    +$node_2->psql(
    +       'postgres',
    +       "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000010000000000000003');",
    +       stdout => \my $last_common_tli1_wal_last_modified_at);
    Please note that you should not rely on the FS-level stats for
    anything that touches the WAL segments.  A rough guess about what you
    could here to make sure that only the set of WAL segments you are
    looking for is being copied over would be to either:
    - Scan the logs produced by pg_rewind and see if the segments are
    copied or not, depending on the divergence point (aka the last
    checkpoint before WAL forked).
    - Clean up pg_wal/ in the target node before running pg_rewind,
    checking that only the segments you want are available once the
    operation completes.
    --
    Michael
    
  7. Re: Making pg_rewind faster

    Justin Kwan <justinpkwan@outlook.com> — 2022-07-28T22:46:28Z

    Hi Michael,
    
    Not sure if this email went through previously but thank you for your feedback, I've incorporated your suggestions by scanning the logs produced from pg_rewind when asserting that certain WAL segment files were skipped from being copied over to the target server.
    
    I've also updated the pg_rewind patch file to target the Postgres master branch (version 16 to be). Please see attached.
    
    Thanks,
    Justin
    
    ________________________________
    From: Justin Kwan <justinpkwan@outlook.com>
    Sent: July 18, 2022 1:14 PM
    To: Tom Lane <tgl@sss.pgh.pa.us>
    Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; jkwan@cloudflare.com <jkwan@cloudflare.com>; vignesh ravichandran <admin@viggy28.dev>; hlinnaka@iki.fi <hlinnaka@iki.fi>
    Subject: Re: Making pg_rewind faster
    
    Hi Tom,
    
    Thank you for taking a look at this and that sounds good. I will send over a patch compatible with Postgres v16.
    
    Justin
    ________________________________
    From: Tom Lane <tgl@sss.pgh.pa.us>
    Sent: July 17, 2022 2:40 PM
    To: Justin Kwan <justinpkwan@outlook.com>
    Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; jkwan@cloudflare.com <jkwan@cloudflare.com>; vignesh ravichandran <admin@viggy28.dev>; hlinnaka@iki.fi <hlinnaka@iki.fi>
    Subject: Re: Making pg_rewind faster
    
    Justin Kwan <justinpkwan@outlook.com> writes:
    > I've also attached the pg_rewind optimization patch file for Postgres version 14.4. The previous patch file targets version Postgres version 15 Beta 1/2.
    
    It's very unlikely that we would consider committing such changes into
    released branches.  In fact, it's too late even for v15.  You should
    be submitting non-bug-fix patches against master (v16-to-be).
    
                            regards, tom lane
    
  8. Re: Making pg_rewind faster

    Alexander Korotkov <aekorotkov@gmail.com> — 2022-09-13T17:50:20Z

    Hi, Justin!
    
    On Fri, Jul 29, 2022 at 1:05 PM Justin Kwan <justinpkwan@outlook.com> wrote:
    > Not sure if this email went through previously but thank you for your feedback, I've incorporated your suggestions by scanning the logs produced from pg_rewind when asserting that certain WAL segment files were skipped from being copied over to the target server.
    >
    > I've also updated the pg_rewind patch file to target the Postgres master branch (version 16 to be). Please see attached.
    
    Thank you for the revision.
    
    I've taken a look at this patch. Overall it looks good to me. I also
    don't see any design objections in the thread.
    
    A couple of points from me:
    1) I would prefer to evade hard-coded names for WAL segments in the
    tap tests. Could we calculate the names in the tap tests based on the
    diverge point, etc.?
    2) Patch contains some indentation with spaces, which should be done
    in tabs. Please consider either manually fixing this or running
    pgindent over modified files.
    
    ------
    Regards,
    Alexander Korotkov
    
    
    
    
  9. Re: Making pg_rewind faster

    Andres Freund <andres@anarazel.de> — 2022-10-02T17:44:25Z

    Hi,
    
    On 2022-09-13 20:50:20 +0300, Alexander Korotkov wrote:
    > On Fri, Jul 29, 2022 at 1:05 PM Justin Kwan <justinpkwan@outlook.com> wrote:
    > > Not sure if this email went through previously but thank you for your feedback, I've incorporated your suggestions by scanning the logs produced from pg_rewind when asserting that certain WAL segment files were skipped from being copied over to the target server.
    > >
    > > I've also updated the pg_rewind patch file to target the Postgres master branch (version 16 to be). Please see attached.
    > 
    > Thank you for the revision.
    > 
    > I've taken a look at this patch. Overall it looks good to me. I also
    > don't see any design objections in the thread.
    > 
    > A couple of points from me:
    > 1) I would prefer to evade hard-coded names for WAL segments in the
    > tap tests. Could we calculate the names in the tap tests based on the
    > diverge point, etc.?
    > 2) Patch contains some indentation with spaces, which should be done
    > in tabs. Please consider either manually fixing this or running
    > pgindent over modified files.
    
    This patch currently fails because it hasn't been adjusted for
    commit c47885bd8b6
    Author: Andres Freund <andres@anarazel.de>
    Date:   2022-09-19 18:03:17 -0700
     
        Split TESTDIR into TESTLOGDIR and TESTDATADIR
    
    The adjustment is trivial. Attached, together with also producing an error
    message rather than just dying wordlessly.
    
    
    It doesn't seem quite right to read pg_rewind's logs by reading
    regress_log_001_basic. Too easy to confuse different runs of pg_rewind
    etc. I'd suggest trying to redirect the log to a different file.
    
    
    With regard to Alexander's point about whitespace:
    
    .git/rebase-apply/patch:25: indent with spaces.
                    /* Handle WAL segment file. */
    .git/rebase-apply/patch:26: indent with spaces.
                    const char  *fname;
    .git/rebase-apply/patch:27: indent with spaces.
                    char        *slash;
    .git/rebase-apply/patch:29: indent with spaces.
                    /* Split filepath into directory & filename. */
    .git/rebase-apply/patch:30: indent with spaces.
                    slash = strrchr(path, '/');
    warning: squelched 29 whitespace errors
    warning: 34 lines add whitespace errors.
    
    Greetings,
    
    Andres Freund
    
  10. Re: Making pg_rewind faster

    Michael Paquier <michael@paquier.xyz> — 2022-10-06T07:08:45Z

    On Sun, Oct 02, 2022 at 10:44:25AM -0700, Andres Freund wrote:
    > It doesn't seem quite right to read pg_rewind's logs by reading
    > regress_log_001_basic. Too easy to confuse different runs of pg_rewind
    > etc. I'd suggest trying to redirect the log to a different file.
    
    Hardcoding log file names in the test increases the overall
    maintenance, even if renaming these would be easy to track and fix if
    the naming convention is changed.  Anyway, I think that what this
    patch should do is to use command_checks_all() in RewindTest.pm as it
    is the test API able to check after a status and multiple expected
    outputs, which is what the changes in 001 and 008 are doing.
    RewindTest::run_pg_rewind() needs to be a bit tweaked to accept these
    regex patterns in input.
    
    +    if (file_segno < last_common_segno)
    +    {
    +        pg_log_debug("WAL file entry \"%s\" not copied to target", fname);
    +        return FILE_ACTION_NONE;
    +    }
    There may be something I am missing here, but there is no need to care
    about segments with a TLI older than lastcommontliIndex, no?
    
    decide_wal_file_action() assumes that the WAL segment exists on the
    target and the source.  This looks bug-prone to me without at least an
    assertion.
    
    file_entry_t has an entry to track if a file is a relation file.  I
    think that it would be much cleaner to track if we are handling a WAL
    segment when inserting an entry in insert_filehash_entry(), so
    isrelfile could be replaced by an enum with three values: relation
    file, WAL segment and the rest.
    --
    Michael
    
  11. Re: Making pg_rewind faster

    Michael Paquier <michael@paquier.xyz> — 2022-11-30T07:03:29Z

    On Thu, Oct 06, 2022 at 04:08:45PM +0900, Michael Paquier wrote:
    > file_entry_t has an entry to track if a file is a relation file.  I
    > think that it would be much cleaner to track if we are handling a WAL
    > segment when inserting an entry in insert_filehash_entry(), so
    > isrelfile could be replaced by an enum with three values: relation
    > file, WAL segment and the rest.
    
    This review has been done a few weeks ago, and there has been no
    update since, so I am marking this entry as returned with feedback in
    the CF app.
    --
    Michael
    
  12. Re: Making pg_rewind faster

    John H <johnhyvr@gmail.com> — 2025-07-01T18:13:46Z

    Hi,
    
    I've attached an updated version of the patch against master with the changes
    suggested.
    
    On Tue, Nov 29, 2022 at 10:03 PM Michael Paquier <michael@paquier.xyz> wrote:
    >
    > On Thu, Oct 06, 2022 at 04:08:45PM +0900, Michael Paquier wrote:
    >>
    >>  There may be something I am missing here, but there is no need to care
    >> about segments with a TLI older than lastcommontliIndex, no?
    
    Hard to say. pg_rewind is intended to make the same "copy" of the cluster which
    implies pg_wal/ should look the same. There might be use cases around logical
    replication where you would want these WAL files to still exist even
    across promotions?
    
    >> decide_wal_file_action() assumes that the WAL segment exists on the
    >> target and the source.  This looks bug-prone to me without at least an
    >> assertion.
    
    From previous refactors there is now an Assertion in filemap.c
    decide_file_action that handles this.
    
    > Assert(entry->target_exists && entry->source_exists);
    
    decide_wal_file_action is called after the assertion.
    
    Thanks,
    
    --
    John Hsu - Amazon Web Services
    
  13. Re: Making pg_rewind faster

    Japin Li <japinli@hotmail.com> — 2025-07-02T02:20:50Z

    On Tue, 01 Jul 2025 at 11:13, John H <johnhyvr@gmail.com> wrote:
    > Hi,
    >
    > I've attached an updated version of the patch against master with the changes
    > suggested.
    >
    > On Tue, Nov 29, 2022 at 10:03 PM Michael Paquier <michael@paquier.xyz> wrote:
    >>
    >> On Thu, Oct 06, 2022 at 04:08:45PM +0900, Michael Paquier wrote:
    >>>
    >>>  There may be something I am missing here, but there is no need to care
    >>> about segments with a TLI older than lastcommontliIndex, no?
    >
    > Hard to say. pg_rewind is intended to make the same "copy" of the cluster which
    > implies pg_wal/ should look the same. There might be use cases around logical
    > replication where you would want these WAL files to still exist even
    > across promotions?
    >
    >>> decide_wal_file_action() assumes that the WAL segment exists on the
    >>> target and the source.  This looks bug-prone to me without at least an
    >>> assertion.
    >
    > From previous refactors there is now an Assertion in filemap.c
    > decide_file_action that handles this.
    >
    >> Assert(entry->target_exists && entry->source_exists);
    >
    > decide_wal_file_action is called after the assertion.
    >
    
    Hi, John
    
    Thanks for updating the patch.
    
    1.
    +/* Determine the type of file content (relation, WAL, or other) */
    +static file_content_type_t
    +getFileType(const char *path)
    
    Considering the existence of file_type_t, would getFileContentType() be a
    suitable function for handling file content types?
    
    2.
    Perhaps decide_wal_file_action() could be defined in filemap.c.
    
    
    While this is unrelated to WAL logging, it could also contribute to faster
    pg_rewind operations.  Should we consider ignoring log files under PGDATA
    (e.g., those in the default log/ directory)?
    
    -- 
    Regards,
    Japin Li
    
    
    
    
  14. Re: Making pg_rewind faster

    wenhui qiu <qiuwenhuifx@gmail.com> — 2025-07-02T03:16:25Z

    HI
    >  2.
    > Perhaps decide_wal_file_action() could be defined in filemap.c.
    
    
    >  While this is unrelated to WAL logging, it could also contribute to
    faster
    > pg_rewind operations.  Should we consider ignoring log files under PGDATA
    > (e.g., those in the default log/ directory)?
    Agree ,Usually the log file directory also takes up a lot of space, and the
    number of log files is quite large
    
    On Wed, Jul 2, 2025 at 10:21 AM Japin Li <japinli@hotmail.com> wrote:
    
    > On Tue, 01 Jul 2025 at 11:13, John H <johnhyvr@gmail.com> wrote:
    > > Hi,
    > >
    > > I've attached an updated version of the patch against master with the
    > changes
    > > suggested.
    > >
    > > On Tue, Nov 29, 2022 at 10:03 PM Michael Paquier <michael@paquier.xyz>
    > wrote:
    > >>
    > >> On Thu, Oct 06, 2022 at 04:08:45PM +0900, Michael Paquier wrote:
    > >>>
    > >>>  There may be something I am missing here, but there is no need to care
    > >>> about segments with a TLI older than lastcommontliIndex, no?
    > >
    > > Hard to say. pg_rewind is intended to make the same "copy" of the
    > cluster which
    > > implies pg_wal/ should look the same. There might be use cases around
    > logical
    > > replication where you would want these WAL files to still exist even
    > > across promotions?
    > >
    > >>> decide_wal_file_action() assumes that the WAL segment exists on the
    > >>> target and the source.  This looks bug-prone to me without at least an
    > >>> assertion.
    > >
    > > From previous refactors there is now an Assertion in filemap.c
    > > decide_file_action that handles this.
    > >
    > >> Assert(entry->target_exists && entry->source_exists);
    > >
    > > decide_wal_file_action is called after the assertion.
    > >
    >
    > Hi, John
    >
    > Thanks for updating the patch.
    >
    > 1.
    > +/* Determine the type of file content (relation, WAL, or other) */
    > +static file_content_type_t
    > +getFileType(const char *path)
    >
    > Considering the existence of file_type_t, would getFileContentType() be a
    > suitable function for handling file content types?
    >
    > 2.
    > Perhaps decide_wal_file_action() could be defined in filemap.c.
    >
    >
    > While this is unrelated to WAL logging, it could also contribute to faster
    > pg_rewind operations.  Should we consider ignoring log files under PGDATA
    > (e.g., those in the default log/ directory)?
    >
    > --
    > Regards,
    > Japin Li
    >
    >
    >
    
  15. Re: Making pg_rewind faster

    John H <johnhyvr@gmail.com> — 2025-07-02T18:21:54Z

    Hi,
    
    Thanks for the quick review.
    
    On Tue, Jul 1, 2025 at 8:16 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote:
    > > Perhaps decide_wal_file_action() could be defined in filemap.c.
    >
    
    That's a good point. I updated the patch to reflect that.
    
    > >  While this is unrelated to WAL logging, it could also contribute to faster
    > > pg_rewind operations.  Should we consider ignoring log files under PGDATA
    > > (e.g., those in the default log/ directory)?
    > Agree ,Usually the log file directory also takes up a lot of space, and the number of log files is quite large
    >
    
    Should we handle this use case? I do agree that for the more common
    use-cases of pg_rewind which is synchronizing an old writer to the new
    leader after failover, avoiding syncing the logging directory is
    useful.
    At the same time, pg_rewind is intended to make the same copy of the
    source cluster as efficiently as possible which would include "all"
    directories if they exist in $PGDATA. By default logging_collector is
    off as well. I'd also think you would want to avoid putting the logs
    in $PGDATA to have smaller backups if you are using tools like
    pg_basebackup.
    
    > On Wed, Jul 2, 2025 at 10:21 AM Japin Li <japinli@hotmail.com> wrote:
    >>
    >> Hi, John
    >>
    >> Thanks for updating the patch.
    >>
    >> 1.
    >> +/* Determine the type of file content (relation, WAL, or other) */
    >> +static file_content_type_t
    >> +getFileType(const char *path)
    >>
    >> Considering the existence of file_type_t, would getFileContentType() be a
    >> suitable function for handling file content types?
    
    Do you mean naming getFileType to getFileContentType?
    
    Thanks,
    
    -- 
    John Hsu - Amazon Web Services
    
  16. Re: Making pg_rewind faster

    Japin Li <japinli@hotmail.com> — 2025-07-03T01:39:51Z

    On Wed, 02 Jul 2025 at 11:21, John H <johnhyvr@gmail.com> wrote:
    > Hi,
    >
    > Thanks for the quick review.
    >
    > On Tue, Jul 1, 2025 at 8:16 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote:
    >> > Perhaps decide_wal_file_action() could be defined in filemap.c.
    >>
    >
    > That's a good point. I updated the patch to reflect that.
    >
    
    Thanks for updating the patch.
    
    >> >  While this is unrelated to WAL logging, it could also contribute to faster
    >> > pg_rewind operations.  Should we consider ignoring log files under PGDATA
    >> > (e.g., those in the default log/ directory)?
    >> Agree ,Usually the log file directory also takes up a lot of space, and the number of log files is quite large
    >>
    >
    > Should we handle this use case? I do agree that for the more common
    > use-cases of pg_rewind which is synchronizing an old writer to the new
    > leader after failover, avoiding syncing the logging directory is
    > useful.
    > At the same time, pg_rewind is intended to make the same copy of the
    > source cluster as efficiently as possible which would include "all"
    > directories if they exist in $PGDATA. By default logging_collector is
    > off as well. I'd also think you would want to avoid putting the logs
    > in $PGDATA to have smaller backups if you are using tools like
    > pg_basebackup.
    >
    
    Splitting the logs from $PGDATA is definitely better. The question is whether
    it's worth implementing this directly in core or if a prominent note in the
    documentation would suffice.
    
    >> On Wed, Jul 2, 2025 at 10:21 AM Japin Li <japinli@hotmail.com> wrote:
    >>>
    >>> Hi, John
    >>>
    >>> Thanks for updating the patch.
    >>>
    >>> 1.
    >>> +/* Determine the type of file content (relation, WAL, or other) */
    >>> +static file_content_type_t
    >>> +getFileType(const char *path)
    >>>
    >>> Considering the existence of file_type_t, would getFileContentType() be a
    >>> suitable function for handling file content types?
    >
    > Do you mean naming getFileType to getFileContentType?
    >
    
    Exactly!  It's confusing that getFileType() returns file_content_type_t
    instead of file_type_t.
    
    For v5 patch:
    
    1.
    We could simply use the global WalSegSz variable within decide_file_action(),
    eliminating the need to pass wal_segsz_bytes as an argument.
    
    2.
    For last_common_segno, we could implement it similarly to WalSegSz, avoiding a
    signature change for decide_file_actions() and decide_file_action().  I'm not
    insisting on this approach, however.
    
    --
    Regards,
    Japin Li
    
    
    
    
  17. Re: Making pg_rewind faster

    John H <johnhyvr@gmail.com> — 2025-07-03T19:59:29Z

    Hi,
    
    On Wed, Jul 2, 2025 at 6:40 PM Japin Li <japinli@hotmail.com> wrote:
    >
    > >
    >
    > Splitting the logs from $PGDATA is definitely better. The question is whether
    > it's worth implementing this directly in core or if a prominent note in the
    > documentation would suffice.
    >
    
    I can work on the documentation update as a separate patch if folks
    think this is worthwhile.
    
    > >> On Wed, Jul 2, 2025 at 10:21 AM Japin Li <japinli@hotmail.com> wrote:
    >
    > Exactly!  It's confusing that getFileType() returns file_content_type_t
    > instead of file_type_t.
    >
    
    Ah yes that is confusing, updated in patch.
    
    > For v5 patch:
    >
    > 1.
    > We could simply use the global WalSegSz variable within decide_file_action(),
    > eliminating the need to pass wal_segsz_bytes as an argument.
    >
    
    Good point.
    
    > 2.
    > For last_common_segno, we could implement it similarly to WalSegSz, avoiding a
    > signature change for decide_file_actions() and decide_file_action().  I'm not
    > insisting on this approach, however.
    >
    
    I made it a global as well, and had to include access/xlog_internal.h
    in pg_rewind.h but I don't feel strongly about it either way.
    
    Thanks,
    
    -- 
    John Hsu - Amazon Web Services
    
  18. Re: Making pg_rewind faster

    Japin Li <japinli@hotmail.com> — 2025-07-04T02:48:25Z

    On Thu, 03 Jul 2025 at 12:59, John H <johnhyvr@gmail.com> wrote:
    > Hi,
    >
    > On Wed, Jul 2, 2025 at 6:40 PM Japin Li <japinli@hotmail.com> wrote:
    >>
    >> >
    >>
    >> Splitting the logs from $PGDATA is definitely better. The question is whether
    >> it's worth implementing this directly in core or if a prominent note in the
    >> documentation would suffice.
    >>
    >
    > I can work on the documentation update as a separate patch if folks
    > think this is worthwhile.
    >
    >> >> On Wed, Jul 2, 2025 at 10:21 AM Japin Li <japinli@hotmail.com> wrote:
    >>
    >> Exactly!  It's confusing that getFileType() returns file_content_type_t
    >> instead of file_type_t.
    >>
    >
    > Ah yes that is confusing, updated in patch.
    >
    >> For v5 patch:
    >>
    >> 1.
    >> We could simply use the global WalSegSz variable within decide_file_action(),
    >> eliminating the need to pass wal_segsz_bytes as an argument.
    >>
    >
    > Good point.
    >
    >> 2.
    >> For last_common_segno, we could implement it similarly to WalSegSz, avoiding a
    >> signature change for decide_file_actions() and decide_file_action().  I'm not
    >> insisting on this approach, however.
    >>
    >
    > I made it a global as well, and had to include access/xlog_internal.h
    > in pg_rewind.h but I don't feel strongly about it either way.
    >
    
    Thanks, LGTM.
    
    -- 
    Regards,
    Japin Li
    
    
    
    
  19. Re: Making pg_rewind faster

    wenhui qiu <qiuwenhuifx@gmail.com> — 2025-07-07T02:11:57Z

    Hi
    > Thanks, LGTM.
    I think it's possible to register to https://commitfest.postgresql.org/54,
    https://commitfest.postgresql.org/53 will closed soon
    
    
    
    Thanks
    
    On Fri, Jul 4, 2025 at 10:50 AM Japin Li <japinli@hotmail.com> wrote:
    
    > On Thu, 03 Jul 2025 at 12:59, John H <johnhyvr@gmail.com> wrote:
    > > Hi,
    > >
    > > On Wed, Jul 2, 2025 at 6:40 PM Japin Li <japinli@hotmail.com> wrote:
    > >>
    > >> >
    > >>
    > >> Splitting the logs from $PGDATA is definitely better. The question is
    > whether
    > >> it's worth implementing this directly in core or if a prominent note in
    > the
    > >> documentation would suffice.
    > >>
    > >
    > > I can work on the documentation update as a separate patch if folks
    > > think this is worthwhile.
    > >
    > >> >> On Wed, Jul 2, 2025 at 10:21 AM Japin Li <japinli@hotmail.com>
    > wrote:
    > >>
    > >> Exactly!  It's confusing that getFileType() returns file_content_type_t
    > >> instead of file_type_t.
    > >>
    > >
    > > Ah yes that is confusing, updated in patch.
    > >
    > >> For v5 patch:
    > >>
    > >> 1.
    > >> We could simply use the global WalSegSz variable within
    > decide_file_action(),
    > >> eliminating the need to pass wal_segsz_bytes as an argument.
    > >>
    > >
    > > Good point.
    > >
    > >> 2.
    > >> For last_common_segno, we could implement it similarly to WalSegSz,
    > avoiding a
    > >> signature change for decide_file_actions() and decide_file_action().
    > I'm not
    > >> insisting on this approach, however.
    > >>
    > >
    > > I made it a global as well, and had to include access/xlog_internal.h
    > > in pg_rewind.h but I don't feel strongly about it either way.
    > >
    >
    > Thanks, LGTM.
    >
    > --
    > Regards,
    > Japin Li
    >
    
  20. Re: Making pg_rewind faster

    John H <johnhyvr@gmail.com> — 2025-07-07T18:45:58Z

    Hi,
    
    Thanks for the review.
    
    On Sun, Jul 6, 2025 at 7:12 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote:
    >
    > Hi
    > > Thanks, LGTM.
    > I think it's possible to register to https://commitfest.postgresql.org/54, https://commitfest.postgresql.org/53 will closed soon
    >
    
    I've created the commitfest entry here:
    https://commitfest.postgresql.org/patch/5902/
    
    I realized I wasn't testing properly in 0006. I've updated the latest
    one to move it into a separate test file to make the results cleaner.
    
    Thanks,
    
    -- 
    John Hsu - Amazon Web Services
    
  21. Re: Making pg_rewind faster

    Robert Haas <robertmhaas@gmail.com> — 2025-09-25T18:35:16Z

    On Mon, Jul 7, 2025 at 2:46 PM John H <johnhyvr@gmail.com> wrote:
    > I've created the commitfest entry here:
    > https://commitfest.postgresql.org/patch/5902/
    >
    > I realized I wasn't testing properly in 0006. I've updated the latest
    > one to move it into a separate test file to make the results cleaner.
    
    Thanks to everyone who has worked on this patch. I think this is a
    real problem about which we ought to try to do something. I think you
    could even defend calling this a back-patchable bug fix, since copying
    a lot of files that we know have to be identical is a serious
    efficiency issue that can affect the usability of pg_upgrade. There
    are more aggressive things that we could try to do here that might be
    even better, e.g. we could copy just the WAL segment or segments that
    contain the last checkpoint record and let the server fetch the rest
    as needed after it starts up. However, that would be a definitional
    change, which we certainly wouldn't want to back-patch, and which
    might have downsides for some people, so perhaps it would need to be
    optional behavior. But the patch as proposed is just an optimization
    that shouldn't change the outcome for anyone. Perhaps it's still best
    to treat it as a master-only improvement, but calling it a fix for a
    performance bug isn't completely crazy either, IMHO.
    
    Moving on to the patch contents, the approach taken makes sense to me.
    Any WAL files that exist in both source and target from prior to the
    point of divergence should not need to be copied, because they should
    necessarily be identical. If anyone sees a flaw in this reasoning,
    please point it out! While I agree with the basic approach, I think
    that the patch could be hardened in two ways. First, isRelDataFile
    tests the complete pathname to decide whether it has a relation file,
    but getFileContentType only tests the final component of the pathname.
    I think it should test the whole pathname also, in case there's a
    stray file with a file name that looks like a WAL file in some other
    directory. In fact, I think that these two functions should be
    integrated. Let's rename the isRelDataFile() to getFileContentType()
    and put all the logic in that function, including appropriate tests of
    the path name. Second, in decide_file_action(), I think we should
    check that entry->target_size == entry->source_size and return
    FILE_ACTION_COPY if not. This case really shouldn't happen, but it's
    very cheap to check and might offer a bit of protection in some weird
    scenario.
    
    I am not a huge fan of the way that the patch stuffs last_common_segno
    into a global variable that is then accessed by
    decide_wal_file_action. I think it would be better to find a way to
    pass this information down through the call stack, either as a
    separate argument or as part of some struct we're already passing, if
    there's a reasonable and not-too-invasive way to do that. If it looks
    too ugly, maybe this is the best option, but I think we should try to
    find something better.
    
    It does not appear to me that the test case tests what it purports to
    test, and I don't think that what it purports to test is the right
    thing anyway. It claims that it is testing that a certain WAL file
    (which it erroneously calls a "WAL file entry") is not copied to the
    target, but I see no logic to actually check this. I also don't think
    it's the correct behavior. I think what the patch is doing and should
    do is ensure that we don't copy the file when it's already present. If
    the file isn't present in the target cluster, it should still be
    copied. Granted, I haven't verified that this is the actual behavior,
    but the "Handle cases where the file is missing from one of the
    systems" code in decide_file_actiomn() seems to be higher up than the
    code the patch changes, so I assume that logic should be applied
    first. If that's correct, I don't quite know what a test case here can
    usefully verify, since the only difference would be performance, but
    maybe I'm misunderstanding something?
    
    As an administrative note, the naming convention for the patches
    posted to the thread is not consistent and not correct. Use "git
    format-patch -v$VERSION" to generate patches. In a name like
    v7-0003-do-something.patch, "v7" is the version of the patch set as a
    whole, and "0003" identifies a specific patch within that version of
    the patch set. In other words, that would be the third patch in the
    series from the seventh time that the patch set was posted. This patch
    set so far contains only one patch, so the file name should be
    vX-0001-whatever.patch, where whatever is the subject of the patch and
    X is a number that increases by one every time it's reposted. Let git
    format-patch do the work for you, though.
    
    On a related note, something a committer would need to do if they
    wanted to commit this patch is figure out who should be credited with
    writing it. I see that a number of different people have posted
    versions of the patch; it would be nice if "Author:" or
    "Co-authored-by:" tags were added to the commit message to reflect
    whose code is present in a given version.
    
    -- 
    Robert Haas
    EDB: http://www.enterprisedb.com
    
    
    
    
  22. Re: Making pg_rewind faster

    John H <johnhyvr@gmail.com> — 2025-10-09T17:56:48Z

    Hi Robert,
    
    Thanks for taking a look.
    
    On Thu, Sep 25, 2025 at 11:35 AM Robert Haas <robertmhaas@gmail.com> wrote:
    > ...
    > integrated. Let's rename the isRelDataFile() to getFileContentType()
    > and put all the logic in that function, including appropriate tests of
    > the path name. Second, in decide_file_action(), I think we should
    > check that entry->target_size == entry->source_size and return
    > FILE_ACTION_COPY if not. This case really shouldn't happen, but it's
    > very cheap to check and might offer a bit of protection in some weird
    > scenario.
    
    Updated the patch to reflect that.
    
    > I am not a huge fan of the way that the patch stuffs last_common_segno
    > into a global variable that is then accessed by
    > ...
    
    I forgot why I bothered with a global instead since it was straightforward
    to actually pass in the argument. Updated.
    
    > It does not appear to me that the test case tests what it purports to
    > test, and I don't think that what it purports to test is the right
    > thing anyway. It claims that it is testing that a certain WAL file
    > (which it erroneously calls a "WAL file entry") is not copied to the
    > target, but I see no logic to actually check this.
    > ...
    
    That's a fair point. The test does:
    
    1. Write some data -> WAL 1
    2. SELECT pg_switch_wal() -> pg_current_wal_lsn() is at WAL 2
    3. CHECKPOINT; -> Makes WAL 2 the last common checkpoint
    
    Then it checks that "WAL 1 not copied over" was logged which isn't
    the best since if it was logged elsewhere then it would still pass.
    
    > ...
    > first. If that's correct, I don't quite know what a test case here can
    > usefully verify, since the only difference would be performance, but
    > maybe I'm misunderstanding something?
    
    I updated the test to run stat and get the modification time of the common
    WAL segment before and after pg_rewind and verify it is the same.
    
    In filemap.c in decide_wal_file_action, if you comment out
    > return FILE_ACTION_NONE;
    
    the test fails.
    
    >
    > As an administrative note, the naming convention for the patches
    > posted to the thread is not consistent and not correct. Use "git
    > format-patch -v$VERSION" to generate patches. In a name like
    > ...
    
    Fixed patch formatting.
    
    > ...
    > writing it. I see that a number of different people have posted
    > versions of the patch; it would be nice if "Author:" or
    > "Co-authored-by:" tags were added to the commit message to reflect
    
    I added Justin as the Co-Author in the commit message but I realized
    in their original patch they had "Author: Justin Kwan, Vignesh Ravichandran".
    Andres also had the latest patch series before mine so I'll leave it up to
    the committer how they want to handle this.
    
    Thanks,
    --
    John Hsu - Amazon Web Services
    
  23. Re: Making pg_rewind faster

    Srinath Reddy Sadipiralla <srinath2133@gmail.com> — 2025-10-09T19:09:43Z

    Hi,
    
    just a second late :( i was about to post a patch addressing the refactors
    which Robert mentioned  ,anyway will have a look at your latest patch John
    thanks :), curious about the tap test.
    
    while i was writing the patch something suddenly struck me , that is why we
    are even depending on last_common_segno ,because once we
    reached decide_wal_file_action it means that the file exists in both target
    and source ,AFAIK this can only happen with wal segments older than or
    equal to last_common_segno because once the promotion competes the filename
    of the WAL files gets changed with the new timelineID(2), for ex: if the
    last_common_segno is 000000010000000000000003 then based on the rules
    in XLogInitNewTimeline
    1) if the timeline switch happens in middle of segment ,copy data from the
    last WAL segment and create WAL file with same segno but different
    timelineID,in this case the starting WAL file for the new timeline will be
    000000020000000000000003
    2) if the timeline switch happens at segment boundary , just create next
    segment for this case the starting WAL file for the new timeline will be
    000000020000000000000004
    
    so basically the files which exists in source and not in target like the
    new timeline WAL segments will be copied to target in total before we reach
    decide_wal_file_action , so i think we don't need to think about copying
    WAL files after divergence point by calculating and checking against
    last_common_segno which we are doing in our current approach , i think we
    can just do
    
    in decide_wal_file_action
    
    if (entry->target_size != entry->source_size)
    {
    pg_log_debug("WAL segment file \"%s\" is copied to target", fname);
    return FILE_ACTION_COPY;
    }
    return FILE_ACTION_NONE;
    
    thoughts?
    
    -- 
    Thanks,
    Srinath Reddy Sadipiralla
    EDB: https://www.enterprisedb.com/
    
  24. Re: Making pg_rewind faster

    John H <johnhyvr@gmail.com> — 2025-10-09T21:48:33Z

    Hey Srinath
    
    On Thu, Oct 9, 2025 at 12:09 PM Srinath Reddy Sadipiralla
    <srinath2133@gmail.com> wrote:
    > ...
    > 1) if the timeline switch happens in middle of segment ,copy data from the last WAL segment and create WAL file with same segno but different timelineID,in this case the starting WAL file for the new timeline will be 000000020000000000000003
    > 2) if the timeline switch happens at segment boundary , just create next segment for this case the starting WAL file for the new timeline will be 000000020000000000000004
    >
    > so basically the files which exists in source and not in target like the new timeline WAL segments will be copied to target in total before we reach decide_wal_file_action , so i think we don't need to think about copying WAL files after divergence point by calculating and checking against last_common_segno which we are doing in our current approach , i think we can just do
    >
    > ...
    
    That's a great point. I need to think about it some more but the
    reasoning makes sense to me.
    I think 'last_common_segno ' is only useful as another sanity check
    but we already have the size
    ones.
    
    Thanks,
    
    -- 
    John Hsu - Amazon Web Services
    
    
    
    
  25. Re: Making pg_rewind faster

    Srinath Reddy Sadipiralla <srinath2133@gmail.com> — 2025-10-10T07:44:59Z

    Hi John , I have reviewed your latest patch(v8-0001).
    
    On Thu, Oct 9, 2025 at 11:27 PM John H <johnhyvr@gmail.com> wrote:
    
    > > ...
    > > integrated. Let's rename the isRelDataFile() to getFileContentType()
    > > and put all the logic in that function, including appropriate tests of
    > > the path name. Second, in decide_file_action(), I think we should
    > > check that entry->target_size == entry->source_size and return
    > > FILE_ACTION_COPY if not. This case really shouldn't happen, but it's
    > > very cheap to check and might offer a bit of protection in some weird
    > > scenario.
    >
    > Updated the patch to reflect that.
    >
    
    in getFileContentType as you moved the logic of validating
    whether they are xlog files or not using IsXLogFileName ,
    i think it's better to validate it the kind of similar way as its
    done for relfiles by parsing the path using sscanf this will
    help to get the tli,logno,segno of the xlog file and using
    XLogFromFileName which gives us logSegNo ,so that we
    can rebuild the xlog file path again using these values with
    XLogFilePath , then validate this new path with the given path
    ,this helps to catch invalid xlog files like
    pg_wal/00000001FFFFFFFFFFFFFF10.
    
    maybe like this
    + if (strncmp("pg_wal/", path, 7) == 0)
    + {
    + const char *filename = path + 7;
    + XLogFromFileName(filename,&tli,logSegNo,WalSegSz);
    + XLogFilePath(check_path,tli,*logSegNo,WalSegSz);
    + if (strcmp(check_path, path) == 0)
    + return FILE_CONTENT_TYPE_WAL;
    + else
    + return FILE_CONTENT_TYPE_OTHER;
    + }
    
    
    >
    > > I am not a huge fan of the way that the patch stuffs last_common_segno
    > > into a global variable that is then accessed by
    > > ...
    >
    > I forgot why I bothered with a global instead since it was straightforward
    > to actually pass in the argument. Updated.
    >
    
    instead of passing last_common_segno down the call stack directly,
    we can have a struct maybe called "rewindState" which will have the common
    information related to both clusters involved in rewind ,so that in future
    if we want to pass more such information down we won't be increasing
    the number of arguments that need to be passed down to all functions in
    stack.
    
    
    >
    > > It does not appear to me that the test case tests what it purports to
    > > test, and I don't think that what it purports to test is the right
    > > thing anyway. It claims that it is testing that a certain WAL file
    > > (which it erroneously calls a "WAL file entry") is not copied to the
    > > target, but I see no logic to actually check this.
    > > ...
    >
    > That's a fair point. The test does:
    >
    > 1. Write some data -> WAL 1
    > 2. SELECT pg_switch_wal() -> pg_current_wal_lsn() is at WAL 2
    > 3. CHECKPOINT; -> Makes WAL 2 the last common checkpoint
    >
    > Then it checks that "WAL 1 not copied over" was logged which isn't
    > the best since if it was logged elsewhere then it would still pass.
    >
    
    we can improve the TAP test file header comment as
    # Test the situation where we skip copying the same WAL files from source
    to target
    # except if WAL file size differs.
    
    let me put it this way
    1) WALs are written to 000000010000000000000002 in primary.
    2) then SELECT pg_switch_wal() now the current WAL is
    000000010000000000000003
    3) CHECKPOINT; this is the last common checkpoint and is written in
    000000010000000000000003.
    4) until this point we have the same WAL files in primary and standby.
    5) promote standby ,this leads to timeline switch and lets say the
    divergence point
    is at segment boundary , so the first WAL segment in the new timeline is
    000000020000000000000004.
    6) when we stop standby then primary , in primary we get checkpoint
    shutdown which
    is greater than the divergence point ,means there is divergence so
    pg_rewind is needed.
    7) so the last common WAL segment was 000000010000000000000003.
    
    I have mentioned below about how this updated TAP test is testing whether
    common WAL file is being copied or not and about one more test we need to
    cover
    in this TAP test.
    
    
    >
    > > ...
    > > first. If that's correct, I don't quite know what a test case here can
    > > usefully verify, since the only difference would be performance, but
    > > maybe I'm misunderstanding something?
    >
    > I updated the test to run stat and get the modification time of the common
    > WAL segment before and after pg_rewind and verify it is the same.
    >
    > In filemap.c in decide_wal_file_action, if you comment out
    > > return FILE_ACTION_NONE;
    >
    > the test fails.
    >
    
    This test makes sense to me, by checking whether
    last common segment(000000010000000000000003 in our case)
    been copied or not using the stat->mtime after the pg_rewind, if its not
    copied then stat->mtime will be the same ,else it will differ with the
    one we noted before rewind, but we also need to test the case
    where entry->target_size != entry->source_size and we do copy,
    like if stat->mtime is changed (which means file has been copied)
    and target_stat->size != soucrce_stat->size here no error expected.
    
    
    -- 
    Thanks,
    Srinath Reddy Sadipiralla
    EDB: https://www.enterprisedb.com/
    
  26. Re: Making pg_rewind faster

    John H <johnhyvr@gmail.com> — 2025-10-14T23:19:19Z

    Hi,
    
    On Fri, Oct 10, 2025 at 12:45 AM Srinath Reddy Sadipiralla
    <srinath2133@gmail.com> wrote:
    >
    > ...
    > XLogFilePath , then validate this new path with the given path
    > ,this helps to catch invalid xlog files like pg_wal/00000001FFFFFFFFFFFFFF10.
    > ...
    
    Are you concerned that somehow these files, which are named like XLog
    files but actually
    aren't, are somehow created therefore we should sync them in case?
    I'm trying to understand how these files would be generated in the first place.
    
    >
    > instead of passing last_common_segno down the call stack directly,
    > we can have a struct maybe called "rewindState" which will have the common
    > information related to both clusters involved in rewind ,so that in future
    > if we want to pass more such information down we won't be increasing
    > the number of arguments that need to be passed down to all functions in stack.
    >
    
    I don't feel too strongly about this. Not sure how we view
    future-proofing things,
    as in do we proactively wrap around structs or wait until there's an
    actual need.
    
    This is the first time it's been touched since it was introduced in 14 for what
    it's worth.
    
    >
    > we can improve the TAP test file header comment as
    > # Test the situation where we skip copying the same WAL files from source to target
    > # except if WAL file size differs.
    >
    > let me put it this way
    > 1) WALs are written to 000000010000000000000002 in primary.
    > ...
    > 7) so the last common WAL segment was 000000010000000000000003.
    
    Updated comments.
    
    >
    > ...
    > where entry->target_size != entry->source_size and we do copy,
    > like if stat->mtime is changed (which means file has been copied)
    > and target_stat->size != soucrce_stat->size here no error expected.
    >
    
    Updated patch with one additional segment has a byte appended to it
    on target but not source.
    
    
    Thanks,
    
    John H
    
  27. Re: Making pg_rewind faster

    Srinath Reddy Sadipiralla <srinath2133@gmail.com> — 2025-10-15T14:26:56Z

    Hi,
    
    On Wed, Oct 15, 2025 at 4:49 AM John H <johnhyvr@gmail.com> wrote:
    
    > Hi,
    >
    > On Fri, Oct 10, 2025 at 12:45 AM Srinath Reddy Sadipiralla
    > <srinath2133@gmail.com> wrote:
    > >
    > > ...
    > > XLogFilePath , then validate this new path with the given path
    > > ,this helps to catch invalid xlog files like
    > pg_wal/00000001FFFFFFFFFFFFFF10.
    > > ...
    >
    > Are you concerned that somehow these files, which are named like XLog
    > files but actually
    > aren't, are somehow created therefore we should sync them in case?
    > I'm trying to understand how these files would be generated in the first
    > place.
    >
    
    the problem is not when server generates them because
    filename is properly calculated using the XLogRecPtr in
    XLogWrite->XLogFileInit->XLogFileInitInternal->XLogFilePath
    ,the main problem is when if someone manually places an invalid WAL file
    in pg_wal like 00000001FFFFFFFFFFFFFF10, IsXLogFileName will
    consider it as valid ,so with the approach as i mentioned earlier we can
    catch such cases.
    
    
    >
    > >
    > > instead of passing last_common_segno down the call stack directly,
    > > we can have a struct maybe called "rewindState" which will have the
    > common
    > > information related to both clusters involved in rewind ,so that in
    > future
    > > if we want to pass more such information down we won't be increasing
    > > the number of arguments that need to be passed down to all functions in
    > stack.
    > >
    >
    > I don't feel too strongly about this. Not sure how we view
    > future-proofing things,
    > as in do we proactively wrap around structs or wait until there's an
    > actual need.
    >
    > This is the first time it's been touched since it was introduced in 14 for
    > what
    > it's worth.
    >
    
    TBH first it seemed to me a good coding practice for future proofing,
    then i checked if there's any place in postgres we are doing
    something similar ,then found 1 in src/include/access/gist.h
    
    typedef struct GISTDeletedPageContents
    {
    /* last xid which could see the page in a scan */
    FullTransactionId deleteXid;
    } GISTDeletedPageContents;
    
    
    >
    > >
    > > we can improve the TAP test file header comment as
    > > # Test the situation where we skip copying the same WAL files from
    > source to target
    > > # except if WAL file size differs.
    > >
    > > let me put it this way
    > > 1) WALs are written to 000000010000000000000002 in primary.
    > > ...
    > > 7) so the last common WAL segment was 000000010000000000000003.
    >
    > Updated comments.
    >
    > >
    > > ...
    > > where entry->target_size != entry->source_size and we do copy,
    > > like if stat->mtime is changed (which means file has been copied)
    > > and target_stat->size != soucrce_stat->size here no error expected.
    > >
    >
    > Updated patch with one additional segment has a byte appended to it
    > on target but not source.
    >
    
    thanks for updating, i have attached a diff patch on
    top of v9-0001 patch , where i tried to add more tests
    to improve the validation that we copy the WAL file even
    when it exists on both source and target but the size differs.
    I also tried to rename some variables for readability,while
    testing this out i found that usleep(1000); is not enough
    to show that there has been a copy which changed the
    stat->mtime of the file because i think the copy was very
    fast (less than 1 millisecond) so the mtime of the
    file after rewind's copy didn't change or filesystem precision
    didn't caught the change, the copying of the file is confirmed
    because the same file has different size before rewind and
    same after rewind which these tests proved
    
    ok($corrupt_wal_seg_stat_before_rewind->size !=
    $corrupt_wal_seg_source_stat->size,"WAL segment $corrupt_wal_seg has
    different size in source vs target before rewind");
    ok 11 - WAL segment 000000010000000000000004 has different size in source
    vs target before rewind
    
    ok($corrupt_wal_seg_stat_after_rewind->size ==
    $corrupt_wal_seg_source_stat->size, "WAL segment $corrupt_wal_seg was
    copied: file sizes are same between target and source after rewind");
    ok 12 - WAL segment 000000010000000000000004 was copied: file sizes are
    same between target and source after rewind
    
    and more over the pg_rewind confirmed this with a new debug
    message which I added in this diff patch
    
    diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
    index 69d7728ce44..743484f4833 100644
    --- a/src/bin/pg_rewind/filemap.c
    +++ b/src/bin/pg_rewind/filemap.c
    @@ -732,11 +732,11 @@ decide_wal_file_action(const char *fname, XLogSegNo
    last_common_segno,
      */
      if (file_segno < last_common_segno && source_size == target_size)
      {
      pg_log_debug("WAL segment \"%s\" not copied to target", fname);
      return FILE_ACTION_NONE;
      }
    
    + pg_log_debug("WAL segment \"%s\" is copied to target", fname);
      return FILE_ACTION_COPY;
     }
    ok 6 - run pg_rewind stderr /(?^:WAL segment \"000000010000000000000004\"
    is copied to target)/
    
    also did this instead of using catfile
    -my $wal_skipped_path = File::Spec->catfile($node_primary->data_dir,
    'pg_wal', $wal_seg_skipped);
    +my $wal_skipped_path = $node_primary->data_dir . '/pg_wal/' .
    $wal_seg_skipped;
    seems consistent with other TAP tests.
    
    -- 
    Thanks,
    Srinath Reddy Sadipiralla
    EDB: https://www.enterprisedb.com/
    
  28. Re: Making pg_rewind faster

    Srinath Reddy Sadipiralla <srinath2133@gmail.com> — 2025-10-15T14:52:34Z

    On Wed, Oct 15, 2025 at 7:56 PM Srinath Reddy Sadipiralla <
    srinath2133@gmail.com> wrote:
    
    >
    > thanks for updating, i have attached a diff patch on
    > top of v9-0001 patch , where i tried to add more tests
    > to improve the validation that we copy the WAL file even
    > when it exists on both source and target but the size differs.
    > I also tried to rename some variables for readability,while
    > testing this out i found that usleep(1000); is not enough
    > to show that there has been a copy which changed the
    > stat->mtime of the file because i think the copy was very
    > fast (less than 1 millisecond) so the mtime of the
    > file after rewind's copy didn't change or filesystem precision
    > didn't caught the change, the copying of the file is confirmed
    > because the same file has different size before rewind and
    > same after rewind which these tests proved
    >
    
    >
    ok($corrupt_wal_seg_stat_before_rewind->size !=
    > $corrupt_wal_seg_source_stat->size,"WAL segment $corrupt_wal_seg has
    > different size in source vs target before rewind");
    > ok 11 - WAL segment 000000010000000000000004 has different size in source
    > vs target before rewind
    >
    > ok($corrupt_wal_seg_stat_after_rewind->size ==
    > $corrupt_wal_seg_source_stat->size, "WAL segment $corrupt_wal_seg was
    > copied: file sizes are same between target and source after rewind");
    > ok 12 - WAL segment 000000010000000000000004 was copied: file sizes are
    > same between target and source after rewind
    >
    > and more over the pg_rewind confirmed this with a new debug
    > message which I added in this diff patch
    >
    > diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
    > index 69d7728ce44..743484f4833 100644
    > --- a/src/bin/pg_rewind/filemap.c
    > +++ b/src/bin/pg_rewind/filemap.c
    > @@ -732,11 +732,11 @@ decide_wal_file_action(const char *fname, XLogSegNo
    > last_common_segno,
    >   */
    >   if (file_segno < last_common_segno && source_size == target_size)
    >   {
    >   pg_log_debug("WAL segment \"%s\" not copied to target", fname);
    >   return FILE_ACTION_NONE;
    >   }
    >
    > + pg_log_debug("WAL segment \"%s\" is copied to target", fname);
    >   return FILE_ACTION_COPY;
    >  }
    > ok 6 - run pg_rewind stderr /(?^:WAL segment \"000000010000000000000004\"
    > is copied to target)/
    >
    
    >
    
    
    btw i forgot to let you know that i have increased the
    usleep(1000) to usleep(1000000) for the above reason
    which helped to show the difference in mtime after rewind
    adding the below extra validation that the file has been copied from
    source to target ,this is already done in the v9-0001.diff patch.
    
    ok($corrupt_wal_seg_stat_after_rewind->mtime >
    $corrupt_wal_seg_stat_before_rewind->mtime,"WAL segment $corrupt_wal_seg
    was copied: mtime after rewind > mtime before rewind");
     ok 12 - WAL segment 000000010000000000000004 was copied: mtime after
    rewind > mtime before rewind
    
    
    On Fri, Oct 10, 2025 at 3:18 AM John H <johnhyvr@gmail.com> wrote:
    
    > Hey Srinath
    >
    > On Thu, Oct 9, 2025 at 12:09 PM Srinath Reddy Sadipiralla
    > <srinath2133@gmail.com> wrote:
    > > ...
    > > 1) if the timeline switch happens in middle of segment ,copy data from
    > the last WAL segment and create WAL file with same segno but different
    > timelineID,in this case the starting WAL file for the new timeline will be
    > 000000020000000000000003
    > > 2) if the timeline switch happens at segment boundary , just create next
    > segment for this case the starting WAL file for the new timeline will be
    > 000000020000000000000004
    > >
    > > so basically the files which exists in source and not in target like the
    > new timeline WAL segments will be copied to target in total before we reach
    > decide_wal_file_action , so i think we don't need to think about copying
    > WAL files after divergence point by calculating and checking against
    > last_common_segno which we are doing in our current approach , i think we
    > can just do
    > >
    > > ...
    >
    > That's a great point. I need to think about it some more but the
    > reasoning makes sense to me.
    > I think 'last_common_segno ' is only useful as another sanity check
    > but we already have the size
    > ones.
    >
    
    it would be awesome to know ,if you have any update on this.
    
    -- 
    Thanks,
    Srinath Reddy Sadipiralla
    EDB: https://www.enterprisedb.com/
    
  29. Re: Making pg_rewind faster

    Robert Haas <robertmhaas@gmail.com> — 2025-10-16T18:17:51Z

    On Thu, Oct 9, 2025 at 3:09 PM Srinath Reddy Sadipiralla
    <srinath2133@gmail.com> wrote:
    > just a second late :( i was about to post a patch addressing the refactors which Robert mentioned  ,anyway will have a look at your latest patch John thanks :), curious about the tap test.
    >
    > while i was writing the patch something suddenly struck me , that is why we are even depending on last_common_segno ,because once we reached decide_wal_file_action it means that the file exists in both target and source ,AFAIK this can only happen with wal segments older than or equal to last_common_segno because once the promotion competes the filename of the WAL files gets changed with the new timelineID(2), for ex: if the last_common_segno is 000000010000000000000003 then based on the rules in XLogInitNewTimeline
    > 1) if the timeline switch happens in middle of segment ,copy data from the last WAL segment and create WAL file with same segno but different timelineID,in this case the starting WAL file for the new timeline will be 000000020000000000000003
    > 2) if the timeline switch happens at segment boundary , just create next segment for this case the starting WAL file for the new timeline will be 000000020000000000000004
    >
    > so basically the files which exists in source and not in target like the new timeline WAL segments will be copied to target in total before we reach decide_wal_file_action , so i think we don't need to think about copying WAL files after divergence point by calculating and checking against last_common_segno which we are doing in our current approach , i think we can just do
    
    What makes me nervous about this is that it isn't necessarily the case
    that the servers were perfectly in sync at the time of the failure.
    Suppose that the primary was in the middle of writing
    000000010000000000000003. The standby might also have this file, but
    it might contain less valid data than the one on the primary;
    therefore, if we don't copy the file, the two servers might not have
    an identical file. Maybe that wouldn't really matter, in the sense
    that the extra valid data that exists on the original primary
    shouldn't prevent it from replaying WAL on the new primary's timeline,
    which is probably all we really care about. But it feels dangerous to
    me.
    
    -- 
    Robert Haas
    EDB: http://www.enterprisedb.com
    
    
    
    
  30. Re: Making pg_rewind faster

    Robert Haas <robertmhaas@gmail.com> — 2025-10-16T18:59:48Z

    On Wed, Oct 15, 2025 at 10:27 AM Srinath Reddy Sadipiralla
    <srinath2133@gmail.com> wrote:
    >> On Fri, Oct 10, 2025 at 12:45 AM Srinath Reddy Sadipiralla
    >> <srinath2133@gmail.com> wrote:
    >> > XLogFilePath , then validate this new path with the given path
    >> > ,this helps to catch invalid xlog files like pg_wal/00000001FFFFFFFFFFFFFF10.
    >> > ...
    >>
    >> Are you concerned that somehow these files, which are named like XLog
    >> files but actually
    >> aren't, are somehow created therefore we should sync them in case?
    >> I'm trying to understand how these files would be generated in the first place.
    >
    > the problem is not when server generates them because
    > filename is properly calculated using the XLogRecPtr in
    > XLogWrite->XLogFileInit->XLogFileInitInternal->XLogFilePath
    > ,the main problem is when if someone manually places an invalid WAL file
    > in pg_wal like 00000001FFFFFFFFFFFFFF10, IsXLogFileName will
    > consider it as valid ,so with the approach as i mentioned earlier we can
    > catch such cases.
    
    I think that parsing the file name may be a good idea so that we can
    do appropriate sanity checks on the values (e.g. checking that we're
    only skipping copying prior to last_common_segno), but I do not think
    we should worry too much about the user manually injecting invalid WAL
    files. I mean, I would prefer that if that does happen, it either
    works anyway or fails with a sensible error message, rather than
    emitting an incomprehensible error message or dumping core. But, it is
    in general true that if manual modifications are made to the data
    directory, things may go terribly wrong, and this code is not obliged
    to provide any more protection against such scenarios than we do in
    other cases. Ultimately, such modifications are user error.
    
    -- 
    Robert Haas
    EDB: http://www.enterprisedb.com
    
    
    
    
  31. Re: Making pg_rewind faster

    John H <johnhyvr@gmail.com> — 2025-10-16T22:51:56Z

    Hi,
    
    On Wed, Oct 15, 2025 at 7:27 AM Srinath Reddy Sadipiralla
    <srinath2133@gmail.com> wrote:
    
    >
    > TBH first it seemed to me a good coding practice for future proofing,
    > then i checked if there's any place in postgres we are doing
    > something similar ,then found 1 in src/include/access/gist.h
    >
    > typedef struct GISTDeletedPageContents
    > {
    > /* last xid which could see the page in a scan */
    > FullTransactionId deleteXid;
    > } GISTDeletedPageContents;
    >
    
    It makes more sense to me if we expect the struct or same set of arguments
    to be reused in different places / want a stable API reference. I don't think
    we want everything to be wrapped around a struct for functions just in case.
    
    > ..
    > thanks for updating, i have attached a diff patch on
    > top of v9-0001 patch , where i tried to add more tests
    > ...
    
    Thank you for the diff! Your changes are a lot cleaner and I've included it in
    the latest patch. One difference I've made is instead of additionally logging in
    decide_wal_file_action
    
    > + pg_log_debug("WAL segment \"%s\" is copied to target", fname);
    
    I realized we are already logging the WAL segments that are copied over in
    print_filemap so I've updated the test to check for that instead
    
    > qr/pg_wal\/$corrupt_wal_seg \(COPY\)/
    
    I also updated the error message for the last three checks.
    
    Thanks,
    
    
    -- 
    John Hsu - Amazon Web Services
    
  32. Re: Making pg_rewind faster

    John H <johnhyvr@gmail.com> — 2025-10-16T22:55:06Z

    On Thu, Oct 16, 2025 at 12:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
    >
    > On Wed, Oct 15, 2025 at 10:27 AM Srinath Reddy Sadipiralla
    > > ,the main problem is when if someone manually places an invalid WAL file
    > > in pg_wal like 00000001FFFFFFFFFFFFFF10, IsXLogFileName will
    > > consider it as valid ,so with the approach as i mentioned earlier we can
    > > catch such cases.
    >
    > I think that parsing the file name may be a good idea so that we can
    > do appropriate sanity checks on the values (e.g. checking that we're
    > only skipping copying prior to last_common_segno), but I do not think
    > we should worry too much about the user manually injecting invalid WAL
    > files. I mean, I would prefer that if that does happen, it either
    > works anyway or fails with a sensible error message, rather than
    > emitting an incomprehensible error message or dumping core. But, it is
    > in general true that if manual modifications are made to the data
    > directory, things may go terribly wrong, and this code is not obliged
    > to provide any more protection against such scenarios than we do in
    > other cases. Ultimately, such modifications are user error.
    >
    
    It feels like there's a lot of things we could attempt to ensure
    "correctness" if we are concerned about scenarios when the user manually puts
    or modifies content unexpectedly in the pg_wal directory.
    
    For instance, one could make the argument that when considering to skip
    copying the common WAL segments, even though they are of the same
    size, it's possible the user has manipulated them directly. I don't
    think we need to
    run checksums on every WAL segment that is a valid candidate to ensure they
    match.
    
    
    -- 
    John Hsu - Amazon Web Services
    
    
    
    
  33. Re: Making pg_rewind faster

    Justin Kwan <justinpkwan@outlook.com> — 2025-10-17T22:08:00Z

    Thanks John, great to see this patch getting attention again.
    This builds on my original version from 2022, glad to see it moving forward and happy with the direction it’s taking.
    —Justin
    
    ________________________________
    From: John H <johnhyvr@gmail.com>
    Sent: October 16, 2025 6:55 PM
    To: Robert Haas <robertmhaas@gmail.com>
    Cc: Srinath Reddy Sadipiralla <srinath2133@gmail.com>; wenhui qiu <qiuwenhuifx@gmail.com>; Japin Li <japinli@hotmail.com>; Michael Paquier <michael@paquier.xyz>; Andres Freund <andres@anarazel.de>; Alexander Korotkov <aekorotkov@gmail.com>; Justin Kwan <justinpkwan@outlook.com>; Tom Lane <tgl@sss.pgh.pa.us>; pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; vignesh ravichandran <admin@viggy28.dev>; hlinnaka@iki.fi <hlinnaka@iki.fi>; jkwan@cloudflare.com <jkwan@cloudflare.com>
    Subject: Re: Making pg_rewind faster
    
    On Thu, Oct 16, 2025 at 12:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
    >
    > On Wed, Oct 15, 2025 at 10:27 AM Srinath Reddy Sadipiralla
    > > ,the main problem is when if someone manually places an invalid WAL file
    > > in pg_wal like 00000001FFFFFFFFFFFFFF10, IsXLogFileName will
    > > consider it as valid ,so with the approach as i mentioned earlier we can
    > > catch such cases.
    >
    > I think that parsing the file name may be a good idea so that we can
    > do appropriate sanity checks on the values (e.g. checking that we're
    > only skipping copying prior to last_common_segno), but I do not think
    > we should worry too much about the user manually injecting invalid WAL
    > files. I mean, I would prefer that if that does happen, it either
    > works anyway or fails with a sensible error message, rather than
    > emitting an incomprehensible error message or dumping core. But, it is
    > in general true that if manual modifications are made to the data
    > directory, things may go terribly wrong, and this code is not obliged
    > to provide any more protection against such scenarios than we do in
    > other cases. Ultimately, such modifications are user error.
    >
    
    It feels like there's a lot of things we could attempt to ensure
    "correctness" if we are concerned about scenarios when the user manually puts
    or modifies content unexpectedly in the pg_wal directory.
    
    For instance, one could make the argument that when considering to skip
    copying the common WAL segments, even though they are of the same
    size, it's possible the user has manipulated them directly. I don't
    think we need to
    run checksums on every WAL segment that is a valid candidate to ensure they
    match.
    
    
    --
    John Hsu - Amazon Web Services
    
  34. Re: Making pg_rewind faster

    Srinath Reddy Sadipiralla <srinath2133@gmail.com> — 2025-10-18T14:17:06Z

    On Thu, Oct 16, 2025 at 11:48 PM Robert Haas <robertmhaas@gmail.com> wrote:
    
    > On Thu, Oct 9, 2025 at 3:09 PM Srinath Reddy Sadipiralla
    > <srinath2133@gmail.com> wrote:
    > > just a second late :( i was about to post a patch addressing the
    > refactors which Robert mentioned  ,anyway will have a look at your latest
    > patch John thanks :), curious about the tap test.
    > >
    > > while i was writing the patch something suddenly struck me , that is why
    > we are even depending on last_common_segno ,because once we reached
    > decide_wal_file_action it means that the file exists in both target and
    > source ,AFAIK this can only happen with wal segments older than or equal to
    > last_common_segno because once the promotion competes the filename of the
    > WAL files gets changed with the new timelineID(2), for ex: if the
    > last_common_segno is 000000010000000000000003 then based on the rules in
    > XLogInitNewTimeline
    > > 1) if the timeline switch happens in middle of segment ,copy data from
    > the last WAL segment and create WAL file with same segno but different
    > timelineID,in this case the starting WAL file for the new timeline will be
    > 000000020000000000000003
    > > 2) if the timeline switch happens at segment boundary , just create next
    > segment for this case the starting WAL file for the new timeline will be
    > 000000020000000000000004
    > >
    > > so basically the files which exists in source and not in target like the
    > new timeline WAL segments will be copied to target in total before we reach
    > decide_wal_file_action , so i think we don't need to think about copying
    > WAL files after divergence point by calculating and checking against
    > last_common_segno which we are doing in our current approach , i think we
    > can just do
    >
    > What makes me nervous about this is that it isn't necessarily the case
    > that the servers were perfectly in sync at the time of the failure.
    > Suppose that the primary was in the middle of writing
    > 000000010000000000000003. The standby might also have this file, but
    > it might contain less valid data than the one on the primary;
    > therefore, if we don't copy the file, the two servers might not have
    > an identical file. Maybe that wouldn't really matter, in the sense
    > that the extra valid data that exists on the original primary
    > shouldn't prevent it from replaying WAL on the new primary's timeline,
    > which is probably all we really care about. But it feels dangerous to
    > me.
    >
    
    Thanks Robert ,I want to understand this point more , and will get back .
    
    -- 
    Thanks,
    Srinath Reddy Sadipiralla
    EDB: https://www.enterprisedb.com/
    
  35. Re: Making pg_rewind faster

    Srinath Reddy Sadipiralla <srinath2133@gmail.com> — 2025-10-18T16:04:55Z

    Hi,
    
    On Fri, Oct 17, 2025 at 4:22 AM John H <johnhyvr@gmail.com> wrote:
    
    > Hi,
    >
    > On Wed, Oct 15, 2025 at 7:27 AM Srinath Reddy Sadipiralla
    > <srinath2133@gmail.com> wrote:
    >
    > >
    > > TBH first it seemed to me a good coding practice for future proofing,
    > > then i checked if there's any place in postgres we are doing
    > > something similar ,then found 1 in src/include/access/gist.h
    > >
    > > typedef struct GISTDeletedPageContents
    > > {
    > > /* last xid which could see the page in a scan */
    > > FullTransactionId deleteXid;
    > > } GISTDeletedPageContents;
    > >
    >
    > It makes more sense to me if we expect the struct or same set of arguments
    > to be reused in different places / want a stable API reference. I don't
    > think
    > we want everything to be wrapped around a struct for functions just in
    > case.
    >
    > > ..
    > > thanks for updating, i have attached a diff patch on
    > > top of v9-0001 patch , where i tried to add more tests
    > > ...
    >
    > Thank you for the diff! Your changes are a lot cleaner and I've included
    > it in
    > the latest patch. One difference I've made is instead of additionally
    > logging in
    > decide_wal_file_action
    >
    > > + pg_log_debug("WAL segment \"%s\" is copied to target", fname);
    >
    > I realized we are already logging the WAL segments that are copied over in
    > print_filemap so I've updated the test to check for that instead
    >
    > > qr/pg_wal\/$corrupt_wal_seg \(COPY\)/
    >
    > I also updated the error message for the last three checks.
    >
    
    Thanks for updating the patch, I have reviewed it,
    except these below concerns, patch LGTM.
    
    1) regarding the parsing the WAL filename
    
    On Fri, Oct 17, 2025 at 4:25 AM John H <johnhyvr@gmail.com> wrote:
    
    > On Thu, Oct 16, 2025 at 12:00 PM Robert Haas <robertmhaas@gmail.com>
    > wrote:
    > >
    > > On Wed, Oct 15, 2025 at 10:27 AM Srinath Reddy Sadipiralla
    > > > ,the main problem is when if someone manually places an invalid WAL
    > file
    > > > in pg_wal like 00000001FFFFFFFFFFFFFF10, IsXLogFileName will
    > > > consider it as valid ,so with the approach as i mentioned earlier we
    > can
    > > > catch such cases.
    > >
    > > I think that parsing the file name may be a good idea so that we can
    > > do appropriate sanity checks on the values (e.g. checking that we're
    > > only skipping copying prior to last_common_segno), but I do not think
    > > we should worry too much about the user manually injecting invalid WAL
    > > files. I mean, I would prefer that if that does happen, it either
    > > works anyway or fails with a sensible error message, rather than
    > > emitting an incomprehensible error message or dumping core. But, it is
    > > in general true that if manual modifications are made to the data
    > > directory, things may go terribly wrong, and this code is not obliged
    > > to provide any more protection against such scenarios than we do in
    > > other cases. Ultimately, such modifications are user error.
    > >
    >
    > It feels like there's a lot of things we could attempt to ensure
    > "correctness" if we are concerned about scenarios when the user manually
    > puts
    > or modifies content unexpectedly in the pg_wal directory.
    >
    > For instance, one could make the argument that when considering to skip
    > copying the common WAL segments, even though they are of the same
    > size, it's possible the user has manipulated them directly. I don't
    > think we need to
    > run checksums on every WAL segment that is a valid candidate to ensure they
    > match.
    
    
    I agree with Robert here, and i think if we found
    that the WAL filename is not valid after parsing
    we can return something like FILE_CONTENT_INVALID_TYPE
    in getFileContentType so that it won't end up going
    through decide_wal_file_action,then XLogFromFileName
    may assign file_segno invalid values greater than
    last_common_segno which means we lead to copying it,even
    though it's an invalid file, thoughts?
    
    2) Please fix the indentation by running pgindent,
    also add file_content_type_t typedef to typedefs.list
    file.
    3) maybe we can improve the error messages for the
    last 2 checks in tap test, that because of this
    reason hence proven that the copy from source to
    target has been done.
    4) We need to update the pg_rewind docs regarding
    this new behaviour.
    
    -- 
    Thanks,
    Srinath Reddy Sadipiralla
    EDB: https://www.enterprisedb.com/
    
  36. Re: Making pg_rewind faster

    John H <johnhyvr@gmail.com> — 2025-10-20T22:03:55Z

    Hi Srinath,
    
    Thank you for the quick review!
    
    On Sat, Oct 18, 2025 at 9:05 AM Srinath Reddy Sadipiralla
    <srinath2133@gmail.com> wrote:
    > ...
    > I agree with Robert here, and i think if we found
    > that the WAL filename is not valid after parsing
    > we can return something like FILE_CONTENT_INVALID_TYPE
    > in getFileContentType so that it won't end up going
    > through decide_wal_file_action,then XLogFromFileName
    > may assign file_segno invalid values greater than
    > last_common_segno which means we lead to copying it,even
    > though it's an invalid file, thoughts?
    >
    
    My understanding of pg_rewind is that it aims to make the two clusters
    "identical"
    after rewind, similar to a basebackup. It does not try to make decisions
    on whether or not a file is necessary for database to start. I'd
    prefer the patch aims
    to keep parity with the existing behavior of pg_rewind/pg_basebackup as much as
    possible. If there is an unexpected/invalid segment and we can't make
    assumptions
    about the content, we should just sync it to keep with existing behavior.
    
    There's lots of potential optimizations pg_rewind can do if we only
    talk about what is
    "strictly" necessary for the database to start. We can exclude any
    directories that
    should not be there. For example, if there exists a directory
    $PGDATA/random_files
    on source and target, the whole directory gets synced right now. We could not
    include that directory and the database would most likely still start
    but that goes
    against the current way pg_rewind works.
    
    > 2) Please fix the indentation by running pgindent,
    > also add file_content_type_t typedef to typedefs.list
    > file.
    
    Will update.
    
    > 3) maybe we can improve the error messages for the
    > last 2 checks in tap test, that because of this
    > reason hence proven that the copy from source to
    > target has been done.
    
    What are you thinking of? Something like the below?
    
    +       "Expected WAL segment $corrupt_wal_seg to have been modified
    due to rewind");
    +       "Expected WAL segment $corrupt_wal_seg to have been synced
    from source to target after rewind");
    
    > 4) We need to update the pg_rewind docs regarding
    > this new behaviour.
    >
    
    Will update.
    
    -- 
    John Hsu - Amazon Web Services
    
    
    
    
  37. Re: Making pg_rewind faster

    Srinath Reddy Sadipiralla <srinath2133@gmail.com> — 2025-10-21T04:36:26Z

    Hi John,
    
    On Tue, Oct 21, 2025 at 3:34 AM John H <johnhyvr@gmail.com> wrote:
    
    >
    > > 3) maybe we can improve the error messages for the
    > > last 2 checks in tap test, that because of this
    > > reason hence proven that the copy from source to
    > > target has been done.
    >
    > What are you thinking of? Something like the below?
    >
    > +       "Expected WAL segment $corrupt_wal_seg to have been modified
    > due to rewind");
    > +       "Expected WAL segment $corrupt_wal_seg to have been synced
    > from source to target after rewind");
    >
    
    maybe something like this
    + "Expected WAL segment $corrupt_wal_seg to have later mtime
    on target than source after rewind as it was copied");
    + "Expected WAL segment $corrupt_wal_seg file sizes to be same
    between target and source after rewind as it was copied");
    
    -- 
    Thanks,
    Srinath Reddy Sadipiralla
    EDB: https://www.enterprisedb.com/
    
  38. Re: Making pg_rewind faster

    John H <johnhyvr@gmail.com> — 2025-10-21T23:14:55Z

    Hi Srinath,
    
    On Mon, Oct 20, 2025 at 9:36 PM Srinath Reddy Sadipiralla
    <srinath2133@gmail.com> wrote:
    > ...
    > maybe something like this
    > + "Expected WAL segment $corrupt_wal_seg to have later mtime
    > on target than source after rewind as it was copied");
    > + "Expected WAL segment $corrupt_wal_seg file sizes to be same
    > between target and source after rewind as it was copied");
    >
    
    Made the changes and included the documentation update.
    
    Thanks,
    -- 
    John Hsu - Amazon Web Services
    
  39. Re: Making pg_rewind faster

    Michael Paquier <michael@paquier.xyz> — 2025-10-23T05:12:24Z

    On Thu, Sep 25, 2025 at 02:35:16PM -0400, Robert Haas wrote:
    > On a related note, something a committer would need to do if they
    > wanted to commit this patch is figure out who should be credited with
    > writing it. I see that a number of different people have posted
    > versions of the patch; it would be nice if "Author:" or
    > "Co-authored-by:" tags were added to the commit message to reflect
    > whose code is present in a given version.
    
    Looking at this patch was on my TODO list (more to that in a bit), and
    the simplest answer to this question is that everybody involved would
    be cited in the release notes.  Justin Kwan has been the original
    author.  John Hsu and Srinath have followed up in an equivalent
    fashion, including answers with your review comments posted on this
    previous message.  So listing all three as equivalent authors feels
    normal to me (no need for co-authoring notes, IMO).
    --
    Michael
    
  40. Re: Making pg_rewind faster

    Michael Paquier <michael@paquier.xyz> — 2025-10-23T08:22:33Z

    On Tue, Oct 21, 2025 at 04:14:55PM -0700, John H wrote:
    > Made the changes and included the documentation update.
    
    I have been looking at this patch, and the first part that stood out
    is that the refactoring related to isRelDataFile() can be done as an
    independent piece, cutting the total size of the main patch by 30% or
    so.  So I have extracted this part, simplified it to make it more
    consistent and les noisy on HEAD, and applied it as 6ae08d9583e9.
    
    I am still looking at the rest, and attached is a rebased version of
    what I have for the moment.  Note first that the new test was missing
    in meson.build, so the coverage provided by the CI was limited.
    
    +    * However we check last_common_segno and file_size again for sanity.
    +    */
    +   if (file_segno < last_common_segno && source_size == target_size)
    
    What if a segment has a size different than the one a cluster has been
    initialized with, still both have the same size on the target and the
    source?  Unlikely, still incorrect if not cross-checked with what a
    control file has in store.  :)
    
    +# Sleep to allow mtime to be different
    +usleep(1000000);
    [...]
    +	[qr/WAL segment \"$wal_seg_skipped\" not copied to target/,
    +	 qr/pg_wal\/$corrupt_wal_seg \(COPY\)/
    +	],
    
    FWIW, I am definitely not a fan of the test relying on the timestamp
    of the files based on their retrieved meta-data stats, with the mtime.
    I suspect complications on Windows.  Worse thing, there is a forced
    sleep of 1s added to the test, to make sure that the timestamp of the
    files we compare have a different number.  But do we really need that
    anyway if we have the debug information with the file map printed in
    it?
    
    Hence, I think that we should simplify and rework the test a bit,
    tweaking a bit how the filemap is printed: now that we can detect if
    an operation is happening on a WAL file thanks to file_content_type_t,
    let's always print the type of operation done for each WAL file and
    remove this "not copied to target" debug log which provides the same
    information, then let's compare the debug output with what we want.
    
    It seems to me that it would be good for the test to run some sanity
    checks on the rewound standby, as well.  That would provide more
    validation for the case of the "corrupted" segment that gets copied
    anyway.
    --
    Michael
    
  41. Re: Making pg_rewind faster

    Robert Haas <robertmhaas@gmail.com> — 2025-10-23T12:40:14Z

    On Thu, Oct 23, 2025 at 4:22 AM Michael Paquier <michael@paquier.xyz> wrote:
    > +    * However we check last_common_segno and file_size again for sanity.
    > +    */
    > +   if (file_segno < last_common_segno && source_size == target_size)
    >
    > What if a segment has a size different than the one a cluster has been
    > initialized with, still both have the same size on the target and the
    > source?  Unlikely, still incorrect if not cross-checked with what a
    > control file has in store.  :)
    
    While I'm not against cross-checking against the control file, this
    sounds like an imaginary scenario to me. That is, it would only happen
    if somebody maliciously modified the contents of the data directory by
    hand with the express goal of breaking the tool. But we fundamentally
    cannot defend against a malicious user whose express goal is to break
    the tool, and I do not see any compelling reason to expend energy on
    it even in cases like this where we could theoretically detect it
    without much effort. If we go down that path, we'll end up not only
    complicating the code, but also obscuring our own goals: it will look
    like we've either done too much sanity checking (because we will have
    added checks that are unnecessary with a non-malicious user) or too
    little (because we will not have caught all the things a malicious
    user might do).
    
    -- 
    Robert Haas
    EDB: http://www.enterprisedb.com
    
    
    
    
  42. Re: Making pg_rewind faster

    Srinath Reddy Sadipiralla <srinath2133@gmail.com> — 2025-10-23T16:08:23Z

    Hi Michael,
    
    On Thu, Oct 23, 2025 at 1:52 PM Michael Paquier <michael@paquier.xyz> wrote:
    
    > On Tue, Oct 21, 2025 at 04:14:55PM -0700, John H wrote:
    > > Made the changes and included the documentation update.
    >
    > I have been looking at this patch, and the first part that stood out
    > is that the refactoring related to isRelDataFile() can be done as an
    > independent piece, cutting the total size of the main patch by 30% or
    > so.  So I have extracted this part, simplified it to make it more
    > consistent and les noisy on HEAD, and applied it as 6ae08d9583e9.
    >
    
    yeah ... that's cleaner, thanks for committing.
    
    
    >
    > +# Sleep to allow mtime to be different
    > +usleep(1000000);
    > [...]
    > +       [qr/WAL segment \"$wal_seg_skipped\" not copied to target/,
    > +        qr/pg_wal\/$corrupt_wal_seg \(COPY\)/
    > +       ],
    >
    > FWIW, I am definitely not a fan of the test relying on the timestamp
    > of the files based on their retrieved meta-data stats, with the mtime.
    > I suspect complications on Windows.  Worse thing, there is a forced
    > sleep of 1s added to the test, to make sure that the timestamp of the
    > files we compare have a different number.  But do we really need that
    > anyway if we have the debug information with the file map printed in
    > it?
    >
    
    thought it would act like an extra sanity check ,to prove the point that the
    pg_rewind is saying through the debug info that it has been really copied
    or skipped.
    
    
    >
    > Hence, I think that we should simplify and rework the test a bit,
    > tweaking a bit how the filemap is printed: now that we can detect if
    > an operation is happening on a WAL file thanks to file_content_type_t,
    > let's always print the type of operation done for each WAL file and
    > remove this "not copied to target" debug log which provides the same
    > information, then let's compare the debug output with what we want.
    >
    
    i guess this is what you want ,please let me know if it's otherwise,
    
    diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
    index 2d82edec060..2180495d337 100644
    --- a/src/bin/pg_rewind/filemap.c
    +++ b/src/bin/pg_rewind/filemap.c
    @@ -546,7 +546,7 @@ print_filemap(filemap_t *filemap)
            for (i = 0; i < filemap->nentries; i++)
            {
                    entry = filemap->entries[i];
    -               if (entry->action != FILE_ACTION_NONE ||
    +               if (entry->content_type == FILE_CONTENT_TYPE_WAL ||
    entry->action != FILE_ACTION_NONE ||
                            entry->target_pages_to_overwrite.bitmapsize > 0)
                    {
                            pg_log_debug("%s (%s)", entry->path,
    @@ -733,7 +733,6 @@ decide_wal_file_action(const char *fname, XLogSegNo
    last_common_segno,
             */
            if (file_segno < last_common_segno && source_size == target_size)
            {
    -               pg_log_debug("WAL segment \"%s\" not copied to target",
    fname);
                    return FILE_ACTION_NONE;
            }
    
    diff --git a/src/bin/pg_rewind/t/011_wal_copy.pl b/src/bin/pg_rewind/t/
    011_wal_copy.pl
    index b9e24844654..3e12744dfa6 100644
    --- a/src/bin/pg_rewind/t/011_wal_copy.pl
    +++ b/src/bin/pg_rewind/t/011_wal_copy.pl
    @@ -93,7 +93,7 @@ command_checks_all(
            0,
            [qr//],
            [
    -               qr/WAL segment \"$wal_seg_skipped\" not copied to target/,
    +               qr/pg_wal\/$wal_seg_skipped \(NONE\)/,
                    qr/pg_wal\/$corrupt_wal_seg \(COPY\)/
            ],
            'run pg_rewind');
    
    
    > It seems to me that it would be good for the test to run some sanity
    > checks on the rewound standby, as well.  That would provide more
    > validation for the case of the "corrupted" segment that gets copied
    > anyway.
    >
    
    i think these checks help do the same
    
    1) qr/pg_wal\/$corrupt_wal_seg \(COPY\)/
    this shows the corrupted segment is copied.
    
    2) ok( $corrupt_wal_seg_stat_before_rewind->size !=
     $corrupt_wal_seg_source_stat->size,
    "Expected WAL segment $corrupt_wal_seg to have different size in
    source vs target before rewind"
    );
    my $corrupt_wal_seg_stat_after_rewind =
    stat($corrupt_wal_seg_in_target_path);
    ok( $corrupt_wal_seg_stat_after_rewind->size ==
     $corrupt_wal_seg_source_stat->size,
    "Expected WAL segment $corrupt_wal_seg file sizes to be same between
    target and source after rewind as it was copied"
    );
    These checks show that before the corrupt segment's size on
    rewound standby(target) was not the same as source but after
    rewind it was the same as source,please let me know if i am
    missing your point here.
    
    -- 
    Thanks,
    Srinath Reddy Sadipiralla
    EDB: https://www.enterprisedb.com/
    
  43. Re: Making pg_rewind faster

    John H <johnhyvr@gmail.com> — 2025-10-23T22:01:32Z

    Hi,
    
    On Thu, Oct 23, 2025 at 9:08 AM Srinath Reddy Sadipiralla
    <srinath2133@gmail.com> wrote:
    > On Thu, Oct 23, 2025 at 1:52 PM Michael Paquier <michael@paquier.xyz> wrote:
    >>
    >> FWIW, I am definitely not a fan of the test relying on the timestamp
    >> of the files based on their retrieved meta-data stats, with the mtime.
    >> I suspect complications on Windows.  Worse thing, there is a forced
    >> sleep of 1s added to the test, to make sure that the timestamp of the
    >> files we compare have a different number.  But do we really need that
    >> anyway if we have the debug information with the file map printed in
    >> it?
    >
    >
    > thought it would act like an extra sanity check ,to prove the point that the
    > pg_rewind is saying through the debug info that it has been really copied
    > or skipped.
    >
    
    I don't feel too strongly about it. If we want to rely on checking debug log
    that works for me. We don't do any in-depth checks that every file was
    properly copied anyway for existing files.
    
    > ...
    > i guess this is what you want ,please let me know if it's otherwise,
    > ...
    
    That looks right to me.
    
    >>
    >> It seems to me that it would be good for the test to run some sanity
    >> checks on the rewound standby, as well.  That would provide more
    >> validation for the case of the "corrupted" segment that gets copied
    >> anyway.
    >
    > ...
    > These checks show that before the corrupt segment's size on
    > rewound standby(target) was not the same as source but after
    > rewind it was the same as source,please let me know if i am
    > missing your point here.
    >
    
    What did you have in mind for additional sanity checks?
    The existing test checks that when sizes are different the correct
    branch is taken. For something more in-depth that requires comparing
    data before and after the rewind with a "corrupted" segment that seems
    complicated since the segment would have to somehow be applied to
    writer but not replica prior to divergence.
    
    --
    John Hsu - Amazon Web Services
    
    
    
    
  44. Re: Making pg_rewind faster

    Michael Paquier <michael@paquier.xyz> — 2025-10-24T07:56:33Z

    On Thu, Oct 23, 2025 at 03:01:32PM -0700, John H wrote:
    > On Thu, Oct 23, 2025 at 9:08 AM Srinath Reddy Sadipiralla <srinath2133@gmail.com> wrote:
    >> ...
    >> i guess this is what you want ,please let me know if it's otherwise,
    >> ...
    > 
    > That looks right to me.
    
    Yes, that would be something among these lines, WAL segments being
    treated as an exception when printing the file map to show that they
    are not copied.
    
    >> These checks show that before the corrupt segment's size on
    >> rewound standby(target) was not the same as source but after
    >> rewind it was the same as source,please let me know if i am
    >> missing your point here.
    > 
    > What did you have in mind for additional sanity checks?
    > The existing test checks that when sizes are different the correct
    > branch is taken. For something more in-depth that requires comparing
    > data before and after the rewind with a "corrupted" segment that seems
    > complicated since the segment would have to somehow be applied to
    > writer but not replica prior to divergence.
    
    I was just wondering about some queries on the new standby once we are
    done with the rewind.  But after sleeping on it, I'd be happy with
    just the set of tests we have: the debug output checks, the size
    checks pre and post-rewind, and no mtime.
    --
    Michael
    
  45. Re: Making pg_rewind faster

    Michael Paquier <michael@paquier.xyz> — 2025-10-24T08:38:37Z

    On Fri, Oct 24, 2025 at 04:56:33PM +0900, Michael Paquier wrote:
    > I was just wondering about some queries on the new standby once we are
    > done with the rewind.  But after sleeping on it, I'd be happy with
    > just the set of tests we have: the debug output checks, the size
    > checks pre and post-rewind, and no mtime.
    
    So, I am coming down to the attached (minus the commit message) after
    more edits and adjustments.  It's Friday evening here, and I am not
    planning to do more today and to take bets on the buildfarm, even if
    the CI is OK.
    --
    Michael
    
  46. Re: Making pg_rewind faster

    Srinath Reddy Sadipiralla <srinath2133@gmail.com> — 2025-10-24T12:21:54Z

    On Fri, Oct 24, 2025 at 2:08 PM Michael Paquier <michael@paquier.xyz> wrote:
    
    > On Fri, Oct 24, 2025 at 04:56:33PM +0900, Michael Paquier wrote:
    > > I was just wondering about some queries on the new standby once we are
    > > done with the rewind.  But after sleeping on it, I'd be happy with
    > > just the set of tests we have: the debug output checks, the size
    > > checks pre and post-rewind, and no mtime.
    >
    > So, I am coming down to the attached (minus the commit message) after
    > more edits and adjustments.  It's Friday evening here, and I am not
    > planning to do more today and to take bets on the buildfarm, even if
    > the CI is OK.
    >
    
    Thanks for the updated patch, I have reviewed and tested the
    patch, except these minor refactors ,it LGTM.
    
    diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
    index 59672e66932..467fd97ebcf 100644
    --- a/src/bin/pg_rewind/filemap.c
    +++ b/src/bin/pg_rewind/filemap.c
    @@ -728,7 +728,7 @@ decide_wal_file_action(const char *fname, XLogSegNo
    last_common_segno,
            /*
             * Avoid copying files before the last common segment.
             *
    -        * These files exist on the source and the target services, so they
    should
    +        * These files exist on the source and the target servers, so they
    should
             * be identical and located strictly before the segment that
    contains the
             * LSN where target and source servers have diverged.
             *
    I think as we are not using mtime to show that the file has not been copied
    and been skipped ,instead we are doing the same with the debug message
    (qr/pg_wal\/$wal_seg_skipped \(NONE\)/,), so stat calculation of this WAL
    segment can be removed.
    
    diff --git a/src/bin/pg_rewind/t/011_wal_copy.pl b/src/bin/pg_rewind/t/
    011_wal_copy.pl
    index fb5b7104378..c4d7200d710 100644
    --- a/src/bin/pg_rewind/t/011_wal_copy.pl
    +++ b/src/bin/pg_rewind/t/011_wal_copy.pl
    @@ -43,12 +43,6 @@ RewindTest::promote_standby;
     my $new_timeline_wal_seg = $node_standby->safe_psql('postgres',
            'SELECT pg_walfile_name(pg_current_wal_lsn())');
    
    -# Get some stats info for the WAL file whose copy is skipped.
    -my $wal_skipped_path =
    -  $node_primary->data_dir . '/pg_wal/' . $wal_seg_skipped;
    -my $wal_skipped_stat = stat($wal_skipped_path);
    -defined($wal_skipped_stat) or die("unable to stat $wal_skipped_path");
    -
     # Corrupt a WAL segment on target that has been generated before the
     # divergence point.  We will check that it is copied from the source.
    
    *Test:*
    I tried to demonstrate that with this patch, we are not copying WAL
    segments before the divergence point by using wal_keep_size.
    
    *Environment:*
    A primary and standby physical streaming replication setup was used.
    The TPC-H dbgen tool with a Scale Factor of 10 was used to generate
    data, which in turn generated sufficient WAL to test with wal_keep_size =
    20000 MB.
    A failover and rewind were performed.
    
    *Result:*
    
    *Without Patch:*
    
    SELECT
      pg_size_pretty(sum(size)) AS wal_size
    FROM
      pg_ls_waldir();
     wal_size
    ----------
     19 GB
    (1 row)
    
    pg_rewind: connected to server
    pg_rewind: servers diverged at WAL location 4/DEACBAB0 on timeline 1
    pg_rewind: rewinding from last common checkpoint at 4/C8250D28 on timeline 1
    pg_rewind: reading source file list
    pg_rewind: reading target file list
    pg_rewind: reading WAL in target
    *pg_rewind: need to copy 20449 MB (total source directory size is 32778 MB)*
    20940602/20940602 kB (100%) copied
    pg_rewind: creating backup label and updating control file
    pg_rewind: syncing target data directory
    pg_rewind: Done!
    
    To show that only a small amount of WAL was generated between the last
    common
    checkpoint and the latest checkpoint on the source:
    
    postgres=# select pg_size_pretty('4/DEACBB78'::pg_lsn -
    '4/C8250D28'::pg_lsn);
     pg_size_pretty
    ----------------
     360 MB
    (1 row)
    
    so the 20.449 GB copied was primarily due to wal_keep_size
    (19 GB of accumulated WAL) being copied, as previously non-data files were
    copied in full.
    
    *With Patch:*
    
    SELECT
      pg_size_pretty(sum(size)) AS wal_size
    FROM
      pg_ls_waldir();
     wal_size
    ----------
     19 GB
    (1 row)
    
    pg_rewind: connected to server
    pg_rewind: servers diverged at WAL location 4/B0F2FE78 on timeline 1
    pg_rewind: rewinding from last common checkpoint at 4/B087E468 on timeline 1
    pg_rewind: reading source file list
    pg_rewind: reading target file list
    pg_rewind: reading WAL in target
    *pg_rewind: need to copy 212 MB (total source directory size is 32074 MB)*
    *pg_rewind: skipped WAL 19168 MB* -> tweaked calculate_totals to get this,
    just to give more info (not included in patch).
    217467/217467 kB (100%) copied
    pg_rewind: creating backup label and updating control file
    pg_rewind: syncing target data directory
    pg_rewind: Done!
    
    postgres=# select pg_size_pretty('4/B0F30AB0'::pg_lsn -
    '4/B087E468'::pg_lsn);
     pg_size_pretty
    ----------------
     6855 kB
    (1 row)
    
    This test shows that with the patch applied, only the necessary WAL segments
    after the divergence point are copied, while previously accumulated WAL
    segments (due to wal_keep_size) are correctly skipped.
    
    -- 
    Thanks,
    Srinath Reddy Sadipiralla
    EDB: https://www.enterprisedb.com/
    
  47. Re: Making pg_rewind faster

    Michael Paquier <michael@paquier.xyz> — 2025-10-25T01:15:53Z

    On Fri, Oct 24, 2025 at 05:51:54PM +0530, Srinath Reddy Sadipiralla wrote:
    > -        * These files exist on the source and the target services, so they
    > should
    > +        * These files exist on the source and the target servers, so they
    > should
    
    Not sure what my fingers were doing here.
    
    > I think as we are not using mtime to show that the file has not been copied
    > and been skipped ,instead we are doing the same with the debug message
    > (qr/pg_wal\/$wal_seg_skipped \(NONE\)/,), so stat calculation of this WAL
    > segment can be removed.
    
    Yes, removing this one makes sense.
    
    And, to keep it short: applied.
    --
    Michael
    
  48. Re: Making pg_rewind faster

    Srinath Reddy Sadipiralla <srinath2133@gmail.com> — 2025-10-25T01:50:52Z

    On Sat, Oct 25, 2025 at 6:46 AM Michael Paquier <michael@paquier.xyz> wrote:
    
    > On Fri, Oct 24, 2025 at 05:51:54PM +0530, Srinath Reddy Sadipiralla wrote:
    > > -        * These files exist on the source and the target services, so
    > they
    > > should
    > > +        * These files exist on the source and the target servers, so
    > they
    > > should
    >
    > Not sure what my fingers were doing here.
    >
    > > I think as we are not using mtime to show that the file has not been
    > copied
    > > and been skipped ,instead we are doing the same with the debug message
    > > (qr/pg_wal\/$wal_seg_skipped \(NONE\)/,), so stat calculation of this WAL
    > > segment can be removed.
    >
    > Yes, removing this one makes sense.
    >
    > And, to keep it short: applied.
    >
    
    Thanks Michael.
    
    
    -- 
    Thanks,
    Srinath Reddy Sadipiralla
    EDB: https://www.enterprisedb.com/
    
  49. Re: Making pg_rewind faster

    Michael Paquier <michael@paquier.xyz> — 2025-10-28T04:02:06Z

    On Thu, Oct 23, 2025 at 08:40:14AM -0400, Robert Haas wrote:
    > While I'm not against cross-checking against the control file, this
    > sounds like an imaginary scenario to me. That is, it would only happen
    > if somebody maliciously modified the contents of the data directory by
    > hand with the express goal of breaking the tool. But we fundamentally
    > cannot defend against a malicious user whose express goal is to break
    > the tool, and I do not see any compelling reason to expend energy on
    > it even in cases like this where we could theoretically detect it
    > without much effort. If we go down that path, we'll end up not only
    > complicating the code, but also obscuring our own goals: it will look
    > like we've either done too much sanity checking (because we will have
    > added checks that are unnecessary with a non-malicious user) or too
    > little (because we will not have caught all the things a malicious
    > user might do).
    
    I was thinking about this argument over the weekend, and I am
    wondering if we could not do better here to detect if a file should be
    copied or not.  What if we included a checksum of each file if both
    exist on the target and source, and just not copy them if the
    checksums match?  You cannot do that for relation files when the
    source is online, of course, but for files like the oldest segments
    before the divergence point, that's better than checking the size,
    still more expensive due to the cost of the checksum computation.
    And there is a sha256() available at SQL level.
    
    Just throwing one idea in the bucket of ideas.  That may not be worth
    the extra cost here, of course, but attaching a checksum to
    file_entry_t is not what I would qualify as an invasive change.
    --
    Michael
    
  50. Re: Making pg_rewind faster

    Robert Haas <robertmhaas@gmail.com> — 2025-10-28T14:57:26Z

    On Tue, Oct 28, 2025 at 12:02 AM Michael Paquier <michael@paquier.xyz> wrote:
    > I was thinking about this argument over the weekend, and I am
    > wondering if we could not do better here to detect if a file should be
    > copied or not.  What if we included a checksum of each file if both
    > exist on the target and source, and just not copy them if the
    > checksums match?
    
    Well, that would require reading the entire file on both sides to
    compute the checksum, which sounds pretty expensive. I mean, a copy
    would only be a read on one side and a write on the other. Even
    granting that writes are more expensive than reads, a read of both
    sides would still be a substantial percentage of the total cost, I
    think.
    
    Also, I don't think we really want to reinvent a worse version of
    rsync. If you want to use checksums or file timestamps to decide what
    to copy, there are already good tools for that which probably handle
    that task better than our code ever will. What we can bring to the
    table is PG-specific logic, where we're able to reason about the
    behavior of PG in a way that a general-purpose tool can't. That's why
    for example we use the WAL to decide what data blocks need to be
    copied, rather than checksums -- it's an optimization that rsync can't
    do, and we can. The rule implemented here is similar: rsync can't know
    that WAL from before the divergence point should be the same on both
    sides, but we can.
    
    Now, of course, if in a specific situation the assumptions on which
    pg_rewind relies are not valid, e.g. because manual data directory
    modification has occurred, then pg_rewind should not be used. And if
    on the other hand we find some flaw that will keep pg_rewind from
    delivering correct results even when nothing strange has happened,
    then that's a bug or a design problem that we need to fix. But if we
    just start second-guessing ourselves by adding overhead to protect
    against can't-happen scenarios, we'll end up making pg_rewind useless.
    
    -- 
    Robert Haas
    EDB: http://www.enterprisedb.com