Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

pg_rewind: Skip copy of WAL segments generated before point of divergence
- 5173bfd0443e 19 (unreleased) landed
pg_rewind: Extend code detecting relation files to work with WAL files
- 6ae08d9583e9 19 (unreleased) landed
Split TESTDIR into TESTLOGDIR and TESTDATADIR
- c47885bd8b69 16.0 cited

Making pg_rewind faster

vignesh ravichandran <admin@viggy28.dev> — 2022-06-30T13:22:28Z

Hi Hackers,



I have been using pg_rewind in production for 2 years. One of the things that I noticed in pg_rewind is if it doesn't know what to do with a file "it copies". I understand it's the more safer option. After all, the alternative, pg_basebackup copies all the files from source to target.



However, this is making pg_rewind inefficient when we have a high number of WAL files. Majority of the data (in most of my cases 95%+) that it copies are WAL files which are anyway same between the source and target. Skipping those same WAL files from copying will improve the speed of pg_rewind a lot.



1. Does pg_rewind need to copy WAL files before the WAL that contains the last common check point?



Heikki's presentation https://pgsessions.com/assets/archives/pg_rewind-presentation-paris.pdf gave me a good overview and also explained the behavior what I mentioned.



Thanks,

Vignesh

Re: Making pg_rewind faster

Justin Kwan <justinpkwan@outlook.com> — 2022-07-15T22:24:54Z

Hi everyone!

Here's the attached patch submission to optimize pg_rewind performance when many WAL files are retained on server. This patch avoids replaying (copying over) older WAL segment files that fall before the point of divergence between the source and target servers.

Thanks,
Justin
________________________________
From: Justin Kwan <jkwan@cloudflare.com>
Sent: July 15, 2022 6:13 PM
To: vignesh ravichandran <admin@viggy28.dev>
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; justinpkwan@outlook.com <justinpkwan@outlook.com>
Subject: Re: Making pg_rewind faster

Looping in my other email.

On Thu, Jun 30, 2022 at 6:22 AM vignesh ravichandran <admin@viggy28.dev<mailto:admin@viggy28.dev>> wrote:
Hi Hackers,

I have been using pg_rewind in production for 2 years. One of the things that I noticed in pg_rewind is if it doesn't know what to do with a file "it copies". I understand it's the more safer option. After all, the alternative, pg_basebackup copies all the files from source to target.

However, this is making pg_rewind inefficient when we have a high number of WAL files. Majority of the data (in most of my cases 95%+) that it copies are WAL files which are anyway same between the source and target. Skipping those same WAL files from copying will improve the speed of pg_rewind a lot.

1. Does pg_rewind need to copy WAL files before the WAL that contains the last common check point?

Heikki's presentation https://pgsessions.com/assets/archives/pg_rewind-presentation-paris.pdf gave me a good overview and also explained the behavior what I mentioned.

Thanks,
Vignesh

Re: Making pg_rewind faster

Justin Kwan <justinpkwan@outlook.com> — 2022-07-16T03:16:27Z

Hi everyone!

I've also attached the pg_rewind optimization patch file for Postgres version 14.4. The previous patch file targets version Postgres version 15 Beta 1/2.

Thanks,
Justin
________________________________
From: Justin Kwan <jkwan@cloudflare.com>
Sent: July 15, 2022 6:13 PM
To: vignesh ravichandran <admin@viggy28.dev>
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; justinpkwan@outlook.com <justinpkwan@outlook.com>
Subject: Re: Making pg_rewind faster

Looping in my other email.

On Thu, Jun 30, 2022 at 6:22 AM vignesh ravichandran <admin@viggy28.dev<mailto:admin@viggy28.dev>> wrote:
Hi Hackers,

I have been using pg_rewind in production for 2 years. One of the things that I noticed in pg_rewind is if it doesn't know what to do with a file "it copies". I understand it's the more safer option. After all, the alternative, pg_basebackup copies all the files from source to target.

However, this is making pg_rewind inefficient when we have a high number of WAL files. Majority of the data (in most of my cases 95%+) that it copies are WAL files which are anyway same between the source and target. Skipping those same WAL files from copying will improve the speed of pg_rewind a lot.

1. Does pg_rewind need to copy WAL files before the WAL that contains the last common check point?

Heikki's presentation https://pgsessions.com/assets/archives/pg_rewind-presentation-paris.pdf gave me a good overview and also explained the behavior what I mentioned.

Thanks,
Vignesh

Re: Making pg_rewind faster

Tom Lane <tgl@sss.pgh.pa.us> — 2022-07-17T18:40:32Z

Justin Kwan <justinpkwan@outlook.com> writes:
> I've also attached the pg_rewind optimization patch file for Postgres version 14.4. The previous patch file targets version Postgres version 15 Beta 1/2.

It's very unlikely that we would consider committing such changes into
released branches.  In fact, it's too late even for v15.  You should
be submitting non-bug-fix patches against master (v16-to-be).

			regards, tom lane

Re: Making pg_rewind faster

Justin Kwan <justinpkwan@outlook.com> — 2022-07-18T17:14:00Z

Hi Tom,

Thank you for taking a look at this and that sounds good. I will send over a patch compatible with Postgres v16.

Justin
________________________________
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: July 17, 2022 2:40 PM
To: Justin Kwan <justinpkwan@outlook.com>
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; jkwan@cloudflare.com <jkwan@cloudflare.com>; vignesh ravichandran <admin@viggy28.dev>; hlinnaka@iki.fi <hlinnaka@iki.fi>
Subject: Re: Making pg_rewind faster

Justin Kwan <justinpkwan@outlook.com> writes:
> I've also attached the pg_rewind optimization patch file for Postgres version 14.4. The previous patch file targets version Postgres version 15 Beta 1/2.

It's very unlikely that we would consider committing such changes into
released branches.  In fact, it's too late even for v15.  You should
be submitting non-bug-fix patches against master (v16-to-be).

                        regards, tom lane

Re: Making pg_rewind faster

Michael Paquier <michael@paquier.xyz> — 2022-07-19T05:36:25Z

On Mon, Jul 18, 2022 at 05:14:00PM +0000, Justin Kwan wrote:
> Thank you for taking a look at this and that sounds good. I will
> send over a patch compatible with Postgres v16.

+$node_2->psql(
+       'postgres',
+       "SELECT extract(epoch from modification) FROM pg_stat_file('pg_wal/000000010000000000000003');",
+       stdout => \my $last_common_tli1_wal_last_modified_at);
Please note that you should not rely on the FS-level stats for
anything that touches the WAL segments.  A rough guess about what you
could here to make sure that only the set of WAL segments you are
looking for is being copied over would be to either:
- Scan the logs produced by pg_rewind and see if the segments are
copied or not, depending on the divergence point (aka the last
checkpoint before WAL forked).
- Clean up pg_wal/ in the target node before running pg_rewind,
checking that only the segments you want are available once the
operation completes.
--
Michael

Re: Making pg_rewind faster

Justin Kwan <justinpkwan@outlook.com> — 2022-07-28T22:46:28Z

Hi Michael,

Not sure if this email went through previously but thank you for your feedback, I've incorporated your suggestions by scanning the logs produced from pg_rewind when asserting that certain WAL segment files were skipped from being copied over to the target server.

I've also updated the pg_rewind patch file to target the Postgres master branch (version 16 to be). Please see attached.

Thanks,
Justin

________________________________
From: Justin Kwan <justinpkwan@outlook.com>
Sent: July 18, 2022 1:14 PM
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; jkwan@cloudflare.com <jkwan@cloudflare.com>; vignesh ravichandran <admin@viggy28.dev>; hlinnaka@iki.fi <hlinnaka@iki.fi>
Subject: Re: Making pg_rewind faster

Hi Tom,

Thank you for taking a look at this and that sounds good. I will send over a patch compatible with Postgres v16.

Justin
________________________________
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: July 17, 2022 2:40 PM
To: Justin Kwan <justinpkwan@outlook.com>
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; jkwan@cloudflare.com <jkwan@cloudflare.com>; vignesh ravichandran <admin@viggy28.dev>; hlinnaka@iki.fi <hlinnaka@iki.fi>
Subject: Re: Making pg_rewind faster

Justin Kwan <justinpkwan@outlook.com> writes:
> I've also attached the pg_rewind optimization patch file for Postgres version 14.4. The previous patch file targets version Postgres version 15 Beta 1/2.

It's very unlikely that we would consider committing such changes into
released branches.  In fact, it's too late even for v15.  You should
be submitting non-bug-fix patches against master (v16-to-be).

                        regards, tom lane

Re: Making pg_rewind faster

Alexander Korotkov <aekorotkov@gmail.com> — 2022-09-13T17:50:20Z

Hi, Justin!

On Fri, Jul 29, 2022 at 1:05 PM Justin Kwan <justinpkwan@outlook.com> wrote:
> Not sure if this email went through previously but thank you for your feedback, I've incorporated your suggestions by scanning the logs produced from pg_rewind when asserting that certain WAL segment files were skipped from being copied over to the target server.
>
> I've also updated the pg_rewind patch file to target the Postgres master branch (version 16 to be). Please see attached.

Thank you for the revision.

I've taken a look at this patch. Overall it looks good to me. I also
don't see any design objections in the thread.

A couple of points from me:
1) I would prefer to evade hard-coded names for WAL segments in the
tap tests. Could we calculate the names in the tap tests based on the
diverge point, etc.?
2) Patch contains some indentation with spaces, which should be done
in tabs. Please consider either manually fixing this or running
pgindent over modified files.

------
Regards,
Alexander Korotkov

Re: Making pg_rewind faster

Andres Freund <andres@anarazel.de> — 2022-10-02T17:44:25Z

Hi,

On 2022-09-13 20:50:20 +0300, Alexander Korotkov wrote:
> On Fri, Jul 29, 2022 at 1:05 PM Justin Kwan <justinpkwan@outlook.com> wrote:
> > Not sure if this email went through previously but thank you for your feedback, I've incorporated your suggestions by scanning the logs produced from pg_rewind when asserting that certain WAL segment files were skipped from being copied over to the target server.
> >
> > I've also updated the pg_rewind patch file to target the Postgres master branch (version 16 to be). Please see attached.
> 
> Thank you for the revision.
> 
> I've taken a look at this patch. Overall it looks good to me. I also
> don't see any design objections in the thread.
> 
> A couple of points from me:
> 1) I would prefer to evade hard-coded names for WAL segments in the
> tap tests. Could we calculate the names in the tap tests based on the
> diverge point, etc.?
> 2) Patch contains some indentation with spaces, which should be done
> in tabs. Please consider either manually fixing this or running
> pgindent over modified files.

This patch currently fails because it hasn't been adjusted for
commit c47885bd8b6
Author: Andres Freund <andres@anarazel.de>
Date:   2022-09-19 18:03:17 -0700
 
    Split TESTDIR into TESTLOGDIR and TESTDATADIR

The adjustment is trivial. Attached, together with also producing an error
message rather than just dying wordlessly.


It doesn't seem quite right to read pg_rewind's logs by reading
regress_log_001_basic. Too easy to confuse different runs of pg_rewind
etc. I'd suggest trying to redirect the log to a different file.


With regard to Alexander's point about whitespace:

.git/rebase-apply/patch:25: indent with spaces.
                /* Handle WAL segment file. */
.git/rebase-apply/patch:26: indent with spaces.
                const char  *fname;
.git/rebase-apply/patch:27: indent with spaces.
                char        *slash;
.git/rebase-apply/patch:29: indent with spaces.
                /* Split filepath into directory & filename. */
.git/rebase-apply/patch:30: indent with spaces.
                slash = strrchr(path, '/');
warning: squelched 29 whitespace errors
warning: 34 lines add whitespace errors.

Greetings,

Andres Freund

Re: Making pg_rewind faster

Michael Paquier <michael@paquier.xyz> — 2022-10-06T07:08:45Z

On Sun, Oct 02, 2022 at 10:44:25AM -0700, Andres Freund wrote:
> It doesn't seem quite right to read pg_rewind's logs by reading
> regress_log_001_basic. Too easy to confuse different runs of pg_rewind
> etc. I'd suggest trying to redirect the log to a different file.

Hardcoding log file names in the test increases the overall
maintenance, even if renaming these would be easy to track and fix if
the naming convention is changed.  Anyway, I think that what this
patch should do is to use command_checks_all() in RewindTest.pm as it
is the test API able to check after a status and multiple expected
outputs, which is what the changes in 001 and 008 are doing.
RewindTest::run_pg_rewind() needs to be a bit tweaked to accept these
regex patterns in input.

+    if (file_segno < last_common_segno)
+    {
+        pg_log_debug("WAL file entry \"%s\" not copied to target", fname);
+        return FILE_ACTION_NONE;
+    }
There may be something I am missing here, but there is no need to care
about segments with a TLI older than lastcommontliIndex, no?

decide_wal_file_action() assumes that the WAL segment exists on the
target and the source.  This looks bug-prone to me without at least an
assertion.

file_entry_t has an entry to track if a file is a relation file.  I
think that it would be much cleaner to track if we are handling a WAL
segment when inserting an entry in insert_filehash_entry(), so
isrelfile could be replaced by an enum with three values: relation
file, WAL segment and the rest.
--
Michael

Re: Making pg_rewind faster

Michael Paquier <michael@paquier.xyz> — 2022-11-30T07:03:29Z

On Thu, Oct 06, 2022 at 04:08:45PM +0900, Michael Paquier wrote:
> file_entry_t has an entry to track if a file is a relation file.  I
> think that it would be much cleaner to track if we are handling a WAL
> segment when inserting an entry in insert_filehash_entry(), so
> isrelfile could be replaced by an enum with three values: relation
> file, WAL segment and the rest.

This review has been done a few weeks ago, and there has been no
update since, so I am marking this entry as returned with feedback in
the CF app.
--
Michael

Re: Making pg_rewind faster

John H <johnhyvr@gmail.com> — 2025-07-01T18:13:46Z

Hi,

I've attached an updated version of the patch against master with the changes
suggested.

On Tue, Nov 29, 2022 at 10:03 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Oct 06, 2022 at 04:08:45PM +0900, Michael Paquier wrote:
>>
>>  There may be something I am missing here, but there is no need to care
>> about segments with a TLI older than lastcommontliIndex, no?

Hard to say. pg_rewind is intended to make the same "copy" of the cluster which
implies pg_wal/ should look the same. There might be use cases around logical
replication where you would want these WAL files to still exist even
across promotions?

>> decide_wal_file_action() assumes that the WAL segment exists on the
>> target and the source.  This looks bug-prone to me without at least an
>> assertion.

From previous refactors there is now an Assertion in filemap.c
decide_file_action that handles this.

> Assert(entry->target_exists && entry->source_exists);

decide_wal_file_action is called after the assertion.

Thanks,

--
John Hsu - Amazon Web Services

Re: Making pg_rewind faster

Japin Li <japinli@hotmail.com> — 2025-07-02T02:20:50Z

On Tue, 01 Jul 2025 at 11:13, John H <johnhyvr@gmail.com> wrote:
> Hi,
>
> I've attached an updated version of the patch against master with the changes
> suggested.
>
> On Tue, Nov 29, 2022 at 10:03 PM Michael Paquier <michael@paquier.xyz> wrote:
>>
>> On Thu, Oct 06, 2022 at 04:08:45PM +0900, Michael Paquier wrote:
>>>
>>>  There may be something I am missing here, but there is no need to care
>>> about segments with a TLI older than lastcommontliIndex, no?
>
> Hard to say. pg_rewind is intended to make the same "copy" of the cluster which
> implies pg_wal/ should look the same. There might be use cases around logical
> replication where you would want these WAL files to still exist even
> across promotions?
>
>>> decide_wal_file_action() assumes that the WAL segment exists on the
>>> target and the source.  This looks bug-prone to me without at least an
>>> assertion.
>
> From previous refactors there is now an Assertion in filemap.c
> decide_file_action that handles this.
>
>> Assert(entry->target_exists && entry->source_exists);
>
> decide_wal_file_action is called after the assertion.
>

Hi, John

Thanks for updating the patch.

1.
+/* Determine the type of file content (relation, WAL, or other) */
+static file_content_type_t
+getFileType(const char *path)

Considering the existence of file_type_t, would getFileContentType() be a
suitable function for handling file content types?

2.
Perhaps decide_wal_file_action() could be defined in filemap.c.


While this is unrelated to WAL logging, it could also contribute to faster
pg_rewind operations.  Should we consider ignoring log files under PGDATA
(e.g., those in the default log/ directory)?

-- 
Regards,
Japin Li

Re: Making pg_rewind faster

wenhui qiu <qiuwenhuifx@gmail.com> — 2025-07-02T03:16:25Z

HI
>  2.
> Perhaps decide_wal_file_action() could be defined in filemap.c.


>  While this is unrelated to WAL logging, it could also contribute to
faster
> pg_rewind operations.  Should we consider ignoring log files under PGDATA
> (e.g., those in the default log/ directory)?
Agree ，Usually the log file directory also takes up a lot of space, and the
number of log files is quite large

On Wed, Jul 2, 2025 at 10:21 AM Japin Li <japinli@hotmail.com> wrote:

> On Tue, 01 Jul 2025 at 11:13, John H <johnhyvr@gmail.com> wrote:
> > Hi,
> >
> > I've attached an updated version of the patch against master with the
> changes
> > suggested.
> >
> > On Tue, Nov 29, 2022 at 10:03 PM Michael Paquier <michael@paquier.xyz>
> wrote:
> >>
> >> On Thu, Oct 06, 2022 at 04:08:45PM +0900, Michael Paquier wrote:
> >>>
> >>>  There may be something I am missing here, but there is no need to care
> >>> about segments with a TLI older than lastcommontliIndex, no?
> >
> > Hard to say. pg_rewind is intended to make the same "copy" of the
> cluster which
> > implies pg_wal/ should look the same. There might be use cases around
> logical
> > replication where you would want these WAL files to still exist even
> > across promotions?
> >
> >>> decide_wal_file_action() assumes that the WAL segment exists on the
> >>> target and the source.  This looks bug-prone to me without at least an
> >>> assertion.
> >
> > From previous refactors there is now an Assertion in filemap.c
> > decide_file_action that handles this.
> >
> >> Assert(entry->target_exists && entry->source_exists);
> >
> > decide_wal_file_action is called after the assertion.
> >
>
> Hi, John
>
> Thanks for updating the patch.
>
> 1.
> +/* Determine the type of file content (relation, WAL, or other) */
> +static file_content_type_t
> +getFileType(const char *path)
>
> Considering the existence of file_type_t, would getFileContentType() be a
> suitable function for handling file content types?
>
> 2.
> Perhaps decide_wal_file_action() could be defined in filemap.c.
>
>
> While this is unrelated to WAL logging, it could also contribute to faster
> pg_rewind operations.  Should we consider ignoring log files under PGDATA
> (e.g., those in the default log/ directory)?
>
> --
> Regards,
> Japin Li
>
>
>

Re: Making pg_rewind faster

John H <johnhyvr@gmail.com> — 2025-07-02T18:21:54Z

Hi,

Thanks for the quick review.

On Tue, Jul 1, 2025 at 8:16 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote:
> > Perhaps decide_wal_file_action() could be defined in filemap.c.
>

That's a good point. I updated the patch to reflect that.

> >  While this is unrelated to WAL logging, it could also contribute to faster
> > pg_rewind operations.  Should we consider ignoring log files under PGDATA
> > (e.g., those in the default log/ directory)?
> Agree ，Usually the log file directory also takes up a lot of space, and the number of log files is quite large
>

Should we handle this use case? I do agree that for the more common
use-cases of pg_rewind which is synchronizing an old writer to the new
leader after failover, avoiding syncing the logging directory is
useful.
At the same time, pg_rewind is intended to make the same copy of the
source cluster as efficiently as possible which would include "all"
directories if they exist in $PGDATA. By default logging_collector is
off as well. I'd also think you would want to avoid putting the logs
in $PGDATA to have smaller backups if you are using tools like
pg_basebackup.

> On Wed, Jul 2, 2025 at 10:21 AM Japin Li <japinli@hotmail.com> wrote:
>>
>> Hi, John
>>
>> Thanks for updating the patch.
>>
>> 1.
>> +/* Determine the type of file content (relation, WAL, or other) */
>> +static file_content_type_t
>> +getFileType(const char *path)
>>
>> Considering the existence of file_type_t, would getFileContentType() be a
>> suitable function for handling file content types?

Do you mean naming getFileType to getFileContentType?

Thanks,

-- 
John Hsu - Amazon Web Services

Re: Making pg_rewind faster

Japin Li <japinli@hotmail.com> — 2025-07-03T01:39:51Z

On Wed, 02 Jul 2025 at 11:21, John H <johnhyvr@gmail.com> wrote:
> Hi,
>
> Thanks for the quick review.
>
> On Tue, Jul 1, 2025 at 8:16 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote:
>> > Perhaps decide_wal_file_action() could be defined in filemap.c.
>>
>
> That's a good point. I updated the patch to reflect that.
>

Thanks for updating the patch.

>> >  While this is unrelated to WAL logging, it could also contribute to faster
>> > pg_rewind operations.  Should we consider ignoring log files under PGDATA
>> > (e.g., those in the default log/ directory)?
>> Agree ，Usually the log file directory also takes up a lot of space, and the number of log files is quite large
>>
>
> Should we handle this use case? I do agree that for the more common
> use-cases of pg_rewind which is synchronizing an old writer to the new
> leader after failover, avoiding syncing the logging directory is
> useful.
> At the same time, pg_rewind is intended to make the same copy of the
> source cluster as efficiently as possible which would include "all"
> directories if they exist in $PGDATA. By default logging_collector is
> off as well. I'd also think you would want to avoid putting the logs
> in $PGDATA to have smaller backups if you are using tools like
> pg_basebackup.
>

Splitting the logs from $PGDATA is definitely better. The question is whether
it's worth implementing this directly in core or if a prominent note in the
documentation would suffice.

>> On Wed, Jul 2, 2025 at 10:21 AM Japin Li <japinli@hotmail.com> wrote:
>>>
>>> Hi, John
>>>
>>> Thanks for updating the patch.
>>>
>>> 1.
>>> +/* Determine the type of file content (relation, WAL, or other) */
>>> +static file_content_type_t
>>> +getFileType(const char *path)
>>>
>>> Considering the existence of file_type_t, would getFileContentType() be a
>>> suitable function for handling file content types?
>
> Do you mean naming getFileType to getFileContentType?
>

Exactly!  It's confusing that getFileType() returns file_content_type_t
instead of file_type_t.

For v5 patch:

1.
We could simply use the global WalSegSz variable within decide_file_action(),
eliminating the need to pass wal_segsz_bytes as an argument.

2.
For last_common_segno, we could implement it similarly to WalSegSz, avoiding a
signature change for decide_file_actions() and decide_file_action().  I'm not
insisting on this approach, however.

--
Regards,
Japin Li

Re: Making pg_rewind faster

John H <johnhyvr@gmail.com> — 2025-07-03T19:59:29Z

Hi,

On Wed, Jul 2, 2025 at 6:40 PM Japin Li <japinli@hotmail.com> wrote:
>
> >
>
> Splitting the logs from $PGDATA is definitely better. The question is whether
> it's worth implementing this directly in core or if a prominent note in the
> documentation would suffice.
>

I can work on the documentation update as a separate patch if folks
think this is worthwhile.

> >> On Wed, Jul 2, 2025 at 10:21 AM Japin Li <japinli@hotmail.com> wrote:
>
> Exactly!  It's confusing that getFileType() returns file_content_type_t
> instead of file_type_t.
>

Ah yes that is confusing, updated in patch.

> For v5 patch:
>
> 1.
> We could simply use the global WalSegSz variable within decide_file_action(),
> eliminating the need to pass wal_segsz_bytes as an argument.
>

Good point.

> 2.
> For last_common_segno, we could implement it similarly to WalSegSz, avoiding a
> signature change for decide_file_actions() and decide_file_action().  I'm not
> insisting on this approach, however.
>

I made it a global as well, and had to include access/xlog_internal.h
in pg_rewind.h but I don't feel strongly about it either way.

Thanks,

-- 
John Hsu - Amazon Web Services

Re: Making pg_rewind faster

Japin Li <japinli@hotmail.com> — 2025-07-04T02:48:25Z

On Thu, 03 Jul 2025 at 12:59, John H <johnhyvr@gmail.com> wrote:
> Hi,
>
> On Wed, Jul 2, 2025 at 6:40 PM Japin Li <japinli@hotmail.com> wrote:
>>
>> >
>>
>> Splitting the logs from $PGDATA is definitely better. The question is whether
>> it's worth implementing this directly in core or if a prominent note in the
>> documentation would suffice.
>>
>
> I can work on the documentation update as a separate patch if folks
> think this is worthwhile.
>
>> >> On Wed, Jul 2, 2025 at 10:21 AM Japin Li <japinli@hotmail.com> wrote:
>>
>> Exactly!  It's confusing that getFileType() returns file_content_type_t
>> instead of file_type_t.
>>
>
> Ah yes that is confusing, updated in patch.
>
>> For v5 patch:
>>
>> 1.
>> We could simply use the global WalSegSz variable within decide_file_action(),
>> eliminating the need to pass wal_segsz_bytes as an argument.
>>
>
> Good point.
>
>> 2.
>> For last_common_segno, we could implement it similarly to WalSegSz, avoiding a
>> signature change for decide_file_actions() and decide_file_action().  I'm not
>> insisting on this approach, however.
>>
>
> I made it a global as well, and had to include access/xlog_internal.h
> in pg_rewind.h but I don't feel strongly about it either way.
>

Thanks, LGTM.

-- 
Regards,
Japin Li

Re: Making pg_rewind faster

wenhui qiu <qiuwenhuifx@gmail.com> — 2025-07-07T02:11:57Z

Hi
> Thanks, LGTM.
I think it's possible to register to https://commitfest.postgresql.org/54,
https://commitfest.postgresql.org/53 will closed soon



Thanks

On Fri, Jul 4, 2025 at 10:50 AM Japin Li <japinli@hotmail.com> wrote:

> On Thu, 03 Jul 2025 at 12:59, John H <johnhyvr@gmail.com> wrote:
> > Hi,
> >
> > On Wed, Jul 2, 2025 at 6:40 PM Japin Li <japinli@hotmail.com> wrote:
> >>
> >> >
> >>
> >> Splitting the logs from $PGDATA is definitely better. The question is
> whether
> >> it's worth implementing this directly in core or if a prominent note in
> the
> >> documentation would suffice.
> >>
> >
> > I can work on the documentation update as a separate patch if folks
> > think this is worthwhile.
> >
> >> >> On Wed, Jul 2, 2025 at 10:21 AM Japin Li <japinli@hotmail.com>
> wrote:
> >>
> >> Exactly!  It's confusing that getFileType() returns file_content_type_t
> >> instead of file_type_t.
> >>
> >
> > Ah yes that is confusing, updated in patch.
> >
> >> For v5 patch:
> >>
> >> 1.
> >> We could simply use the global WalSegSz variable within
> decide_file_action(),
> >> eliminating the need to pass wal_segsz_bytes as an argument.
> >>
> >
> > Good point.
> >
> >> 2.
> >> For last_common_segno, we could implement it similarly to WalSegSz,
> avoiding a
> >> signature change for decide_file_actions() and decide_file_action().
> I'm not
> >> insisting on this approach, however.
> >>
> >
> > I made it a global as well, and had to include access/xlog_internal.h
> > in pg_rewind.h but I don't feel strongly about it either way.
> >
>
> Thanks, LGTM.
>
> --
> Regards,
> Japin Li
>

Re: Making pg_rewind faster

John H <johnhyvr@gmail.com> — 2025-07-07T18:45:58Z

Hi,

Thanks for the review.

On Sun, Jul 6, 2025 at 7:12 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote:
>
> Hi
> > Thanks, LGTM.
> I think it's possible to register to https://commitfest.postgresql.org/54, https://commitfest.postgresql.org/53 will closed soon
>

I've created the commitfest entry here:
https://commitfest.postgresql.org/patch/5902/

I realized I wasn't testing properly in 0006. I've updated the latest
one to move it into a separate test file to make the results cleaner.

Thanks,

-- 
John Hsu - Amazon Web Services

Re: Making pg_rewind faster

Robert Haas <robertmhaas@gmail.com> — 2025-09-25T18:35:16Z

On Mon, Jul 7, 2025 at 2:46 PM John H <johnhyvr@gmail.com> wrote:
> I've created the commitfest entry here:
> https://commitfest.postgresql.org/patch/5902/
>
> I realized I wasn't testing properly in 0006. I've updated the latest
> one to move it into a separate test file to make the results cleaner.

Thanks to everyone who has worked on this patch. I think this is a
real problem about which we ought to try to do something. I think you
could even defend calling this a back-patchable bug fix, since copying
a lot of files that we know have to be identical is a serious
efficiency issue that can affect the usability of pg_upgrade. There
are more aggressive things that we could try to do here that might be
even better, e.g. we could copy just the WAL segment or segments that
contain the last checkpoint record and let the server fetch the rest
as needed after it starts up. However, that would be a definitional
change, which we certainly wouldn't want to back-patch, and which
might have downsides for some people, so perhaps it would need to be
optional behavior. But the patch as proposed is just an optimization
that shouldn't change the outcome for anyone. Perhaps it's still best
to treat it as a master-only improvement, but calling it a fix for a
performance bug isn't completely crazy either, IMHO.

Moving on to the patch contents, the approach taken makes sense to me.
Any WAL files that exist in both source and target from prior to the
point of divergence should not need to be copied, because they should
necessarily be identical. If anyone sees a flaw in this reasoning,
please point it out! While I agree with the basic approach, I think
that the patch could be hardened in two ways. First, isRelDataFile
tests the complete pathname to decide whether it has a relation file,
but getFileContentType only tests the final component of the pathname.
I think it should test the whole pathname also, in case there's a
stray file with a file name that looks like a WAL file in some other
directory. In fact, I think that these two functions should be
integrated. Let's rename the isRelDataFile() to getFileContentType()
and put all the logic in that function, including appropriate tests of
the path name. Second, in decide_file_action(), I think we should
check that entry->target_size == entry->source_size and return
FILE_ACTION_COPY if not. This case really shouldn't happen, but it's
very cheap to check and might offer a bit of protection in some weird
scenario.

I am not a huge fan of the way that the patch stuffs last_common_segno
into a global variable that is then accessed by
decide_wal_file_action. I think it would be better to find a way to
pass this information down through the call stack, either as a
separate argument or as part of some struct we're already passing, if
there's a reasonable and not-too-invasive way to do that. If it looks
too ugly, maybe this is the best option, but I think we should try to
find something better.

It does not appear to me that the test case tests what it purports to
test, and I don't think that what it purports to test is the right
thing anyway. It claims that it is testing that a certain WAL file
(which it erroneously calls a "WAL file entry") is not copied to the
target, but I see no logic to actually check this. I also don't think
it's the correct behavior. I think what the patch is doing and should
do is ensure that we don't copy the file when it's already present. If
the file isn't present in the target cluster, it should still be
copied. Granted, I haven't verified that this is the actual behavior,
but the "Handle cases where the file is missing from one of the
systems" code in decide_file_actiomn() seems to be higher up than the
code the patch changes, so I assume that logic should be applied
first. If that's correct, I don't quite know what a test case here can
usefully verify, since the only difference would be performance, but
maybe I'm misunderstanding something?

As an administrative note, the naming convention for the patches
posted to the thread is not consistent and not correct. Use "git
format-patch -v$VERSION" to generate patches. In a name like
v7-0003-do-something.patch, "v7" is the version of the patch set as a
whole, and "0003" identifies a specific patch within that version of
the patch set. In other words, that would be the third patch in the
series from the seventh time that the patch set was posted. This patch
set so far contains only one patch, so the file name should be
vX-0001-whatever.patch, where whatever is the subject of the patch and
X is a number that increases by one every time it's reposted. Let git
format-patch do the work for you, though.

On a related note, something a committer would need to do if they
wanted to commit this patch is figure out who should be credited with
writing it. I see that a number of different people have posted
versions of the patch; it would be nice if "Author:" or
"Co-authored-by:" tags were added to the commit message to reflect
whose code is present in a given version.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Making pg_rewind faster

John H <johnhyvr@gmail.com> — 2025-10-09T17:56:48Z

Hi Robert,

Thanks for taking a look.

On Thu, Sep 25, 2025 at 11:35 AM Robert Haas <robertmhaas@gmail.com> wrote:
> ...
> integrated. Let's rename the isRelDataFile() to getFileContentType()
> and put all the logic in that function, including appropriate tests of
> the path name. Second, in decide_file_action(), I think we should
> check that entry->target_size == entry->source_size and return
> FILE_ACTION_COPY if not. This case really shouldn't happen, but it's
> very cheap to check and might offer a bit of protection in some weird
> scenario.

Updated the patch to reflect that.

> I am not a huge fan of the way that the patch stuffs last_common_segno
> into a global variable that is then accessed by
> ...

I forgot why I bothered with a global instead since it was straightforward
to actually pass in the argument. Updated.

> It does not appear to me that the test case tests what it purports to
> test, and I don't think that what it purports to test is the right
> thing anyway. It claims that it is testing that a certain WAL file
> (which it erroneously calls a "WAL file entry") is not copied to the
> target, but I see no logic to actually check this.
> ...

That's a fair point. The test does:

1. Write some data -> WAL 1
2. SELECT pg_switch_wal() -> pg_current_wal_lsn() is at WAL 2
3. CHECKPOINT; -> Makes WAL 2 the last common checkpoint

Then it checks that "WAL 1 not copied over" was logged which isn't
the best since if it was logged elsewhere then it would still pass.

> ...
> first. If that's correct, I don't quite know what a test case here can
> usefully verify, since the only difference would be performance, but
> maybe I'm misunderstanding something?

I updated the test to run stat and get the modification time of the common
WAL segment before and after pg_rewind and verify it is the same.

In filemap.c in decide_wal_file_action, if you comment out
> return FILE_ACTION_NONE;

the test fails.

>
> As an administrative note, the naming convention for the patches
> posted to the thread is not consistent and not correct. Use "git
> format-patch -v$VERSION" to generate patches. In a name like
> ...

Fixed patch formatting.

> ...
> writing it. I see that a number of different people have posted
> versions of the patch; it would be nice if "Author:" or
> "Co-authored-by:" tags were added to the commit message to reflect

I added Justin as the Co-Author in the commit message but I realized
in their original patch they had "Author: Justin Kwan, Vignesh Ravichandran".
Andres also had the latest patch series before mine so I'll leave it up to
the committer how they want to handle this.

Thanks,
--
John Hsu - Amazon Web Services

Re: Making pg_rewind faster

Srinath Reddy Sadipiralla <srinath2133@gmail.com> — 2025-10-09T19:09:43Z

Hi,

just a second late :( i was about to post a patch addressing the refactors
which Robert mentioned  ,anyway will have a look at your latest patch John
thanks :), curious about the tap test.

while i was writing the patch something suddenly struck me , that is why we
are even depending on last_common_segno ,because once we
reached decide_wal_file_action it means that the file exists in both target
and source ,AFAIK this can only happen with wal segments older than or
equal to last_common_segno because once the promotion competes the filename
of the WAL files gets changed with the new timelineID(2), for ex: if the
last_common_segno is 000000010000000000000003 then based on the rules
in XLogInitNewTimeline
1) if the timeline switch happens in middle of segment ,copy data from the
last WAL segment and create WAL file with same segno but different
timelineID,in this case the starting WAL file for the new timeline will be
000000020000000000000003
2) if the timeline switch happens at segment boundary , just create next
segment for this case the starting WAL file for the new timeline will be
000000020000000000000004

so basically the files which exists in source and not in target like the
new timeline WAL segments will be copied to target in total before we reach
decide_wal_file_action , so i think we don't need to think about copying
WAL files after divergence point by calculating and checking against
last_common_segno which we are doing in our current approach , i think we
can just do

in decide_wal_file_action

if (entry->target_size != entry->source_size)
{
pg_log_debug("WAL segment file \"%s\" is copied to target", fname);
return FILE_ACTION_COPY;
}
return FILE_ACTION_NONE;

thoughts?

-- 
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/

Re: Making pg_rewind faster

John H <johnhyvr@gmail.com> — 2025-10-09T21:48:33Z

Hey Srinath

On Thu, Oct 9, 2025 at 12:09 PM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:
> ...
> 1) if the timeline switch happens in middle of segment ,copy data from the last WAL segment and create WAL file with same segno but different timelineID,in this case the starting WAL file for the new timeline will be 000000020000000000000003
> 2) if the timeline switch happens at segment boundary , just create next segment for this case the starting WAL file for the new timeline will be 000000020000000000000004
>
> so basically the files which exists in source and not in target like the new timeline WAL segments will be copied to target in total before we reach decide_wal_file_action , so i think we don't need to think about copying WAL files after divergence point by calculating and checking against last_common_segno which we are doing in our current approach , i think we can just do
>
> ...

That's a great point. I need to think about it some more but the
reasoning makes sense to me.
I think 'last_common_segno ' is only useful as another sanity check
but we already have the size
ones.

Thanks,

-- 
John Hsu - Amazon Web Services

Re: Making pg_rewind faster

Srinath Reddy Sadipiralla <srinath2133@gmail.com> — 2025-10-10T07:44:59Z

Hi John , I have reviewed your latest patch(v8-0001).

On Thu, Oct 9, 2025 at 11:27 PM John H <johnhyvr@gmail.com> wrote:

> > ...
> > integrated. Let's rename the isRelDataFile() to getFileContentType()
> > and put all the logic in that function, including appropriate tests of
> > the path name. Second, in decide_file_action(), I think we should
> > check that entry->target_size == entry->source_size and return
> > FILE_ACTION_COPY if not. This case really shouldn't happen, but it's
> > very cheap to check and might offer a bit of protection in some weird
> > scenario.
>
> Updated the patch to reflect that.
>

in getFileContentType as you moved the logic of validating
whether they are xlog files or not using IsXLogFileName ,
i think it's better to validate it the kind of similar way as its
done for relfiles by parsing the path using sscanf this will
help to get the tli,logno,segno of the xlog file and using
XLogFromFileName which gives us logSegNo ,so that we
can rebuild the xlog file path again using these values with
XLogFilePath , then validate this new path with the given path
,this helps to catch invalid xlog files like
pg_wal/00000001FFFFFFFFFFFFFF10.

maybe like this
+ if (strncmp("pg_wal/", path, 7) == 0)
+ {
+ const char *filename = path + 7;
+ XLogFromFileName(filename,&tli,logSegNo,WalSegSz);
+ XLogFilePath(check_path,tli,*logSegNo,WalSegSz);
+ if (strcmp(check_path, path) == 0)
+ return FILE_CONTENT_TYPE_WAL;
+ else
+ return FILE_CONTENT_TYPE_OTHER;
+ }

>
> > I am not a huge fan of the way that the patch stuffs last_common_segno
> > into a global variable that is then accessed by
> > ...
>
> I forgot why I bothered with a global instead since it was straightforward
> to actually pass in the argument. Updated.
>

instead of passing last_common_segno down the call stack directly,
we can have a struct maybe called "rewindState" which will have the common
information related to both clusters involved in rewind ,so that in future
if we want to pass more such information down we won't be increasing
the number of arguments that need to be passed down to all functions in
stack.

>
> > It does not appear to me that the test case tests what it purports to
> > test, and I don't think that what it purports to test is the right
> > thing anyway. It claims that it is testing that a certain WAL file
> > (which it erroneously calls a "WAL file entry") is not copied to the
> > target, but I see no logic to actually check this.
> > ...
>
> That's a fair point. The test does:
>
> 1. Write some data -> WAL 1
> 2. SELECT pg_switch_wal() -> pg_current_wal_lsn() is at WAL 2
> 3. CHECKPOINT; -> Makes WAL 2 the last common checkpoint
>
> Then it checks that "WAL 1 not copied over" was logged which isn't
> the best since if it was logged elsewhere then it would still pass.
>

we can improve the TAP test file header comment as
# Test the situation where we skip copying the same WAL files from source
to target
# except if WAL file size differs.

let me put it this way
1) WALs are written to 000000010000000000000002 in primary.
2) then SELECT pg_switch_wal() now the current WAL is
000000010000000000000003
3) CHECKPOINT; this is the last common checkpoint and is written in
000000010000000000000003.
4) until this point we have the same WAL files in primary and standby.
5) promote standby ,this leads to timeline switch and lets say the
divergence point
is at segment boundary , so the first WAL segment in the new timeline is
000000020000000000000004.
6) when we stop standby then primary , in primary we get checkpoint
shutdown which
is greater than the divergence point ,means there is divergence so
pg_rewind is needed.
7) so the last common WAL segment was 000000010000000000000003.

I have mentioned below about how this updated TAP test is testing whether
common WAL file is being copied or not and about one more test we need to
cover
in this TAP test.

>
> > ...
> > first. If that's correct, I don't quite know what a test case here can
> > usefully verify, since the only difference would be performance, but
> > maybe I'm misunderstanding something?
>
> I updated the test to run stat and get the modification time of the common
> WAL segment before and after pg_rewind and verify it is the same.
>
> In filemap.c in decide_wal_file_action, if you comment out
> > return FILE_ACTION_NONE;
>
> the test fails.
>

This test makes sense to me, by checking whether
last common segment(000000010000000000000003 in our case)
been copied or not using the stat->mtime after the pg_rewind, if its not
copied then stat->mtime will be the same ,else it will differ with the
one we noted before rewind, but we also need to test the case
where entry->target_size != entry->source_size and we do copy,
like if stat->mtime is changed (which means file has been copied)
and target_stat->size != soucrce_stat->size here no error expected.

-- 
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/

Re: Making pg_rewind faster

John H <johnhyvr@gmail.com> — 2025-10-14T23:19:19Z

Hi,

On Fri, Oct 10, 2025 at 12:45 AM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:
>
> ...
> XLogFilePath , then validate this new path with the given path
> ,this helps to catch invalid xlog files like pg_wal/00000001FFFFFFFFFFFFFF10.
> ...

Are you concerned that somehow these files, which are named like XLog
files but actually
aren't, are somehow created therefore we should sync them in case?
I'm trying to understand how these files would be generated in the first place.

>
> instead of passing last_common_segno down the call stack directly,
> we can have a struct maybe called "rewindState" which will have the common
> information related to both clusters involved in rewind ,so that in future
> if we want to pass more such information down we won't be increasing
> the number of arguments that need to be passed down to all functions in stack.
>

I don't feel too strongly about this. Not sure how we view
future-proofing things,
as in do we proactively wrap around structs or wait until there's an
actual need.

This is the first time it's been touched since it was introduced in 14 for what
it's worth.

>
> we can improve the TAP test file header comment as
> # Test the situation where we skip copying the same WAL files from source to target
> # except if WAL file size differs.
>
> let me put it this way
> 1) WALs are written to 000000010000000000000002 in primary.
> ...
> 7) so the last common WAL segment was 000000010000000000000003.

Updated comments.

>
> ...
> where entry->target_size != entry->source_size and we do copy,
> like if stat->mtime is changed (which means file has been copied)
> and target_stat->size != soucrce_stat->size here no error expected.
>

Updated patch with one additional segment has a byte appended to it
on target but not source.

Thanks,

John H

Re: Making pg_rewind faster

Srinath Reddy Sadipiralla <srinath2133@gmail.com> — 2025-10-15T14:26:56Z

Hi,

On Wed, Oct 15, 2025 at 4:49 AM John H <johnhyvr@gmail.com> wrote:

> Hi,
>
> On Fri, Oct 10, 2025 at 12:45 AM Srinath Reddy Sadipiralla
> <srinath2133@gmail.com> wrote:
> >
> > ...
> > XLogFilePath , then validate this new path with the given path
> > ,this helps to catch invalid xlog files like
> pg_wal/00000001FFFFFFFFFFFFFF10.
> > ...
>
> Are you concerned that somehow these files, which are named like XLog
> files but actually
> aren't, are somehow created therefore we should sync them in case?
> I'm trying to understand how these files would be generated in the first
> place.
>

the problem is not when server generates them because
filename is properly calculated using the XLogRecPtr in
XLogWrite->XLogFileInit->XLogFileInitInternal->XLogFilePath
,the main problem is when if someone manually places an invalid WAL file
in pg_wal like 00000001FFFFFFFFFFFFFF10, IsXLogFileName will
consider it as valid ,so with the approach as i mentioned earlier we can
catch such cases.


>
> >
> > instead of passing last_common_segno down the call stack directly,
> > we can have a struct maybe called "rewindState" which will have the
> common
> > information related to both clusters involved in rewind ,so that in
> future
> > if we want to pass more such information down we won't be increasing
> > the number of arguments that need to be passed down to all functions in
> stack.
> >
>
> I don't feel too strongly about this. Not sure how we view
> future-proofing things,
> as in do we proactively wrap around structs or wait until there's an
> actual need.
>
> This is the first time it's been touched since it was introduced in 14 for
> what
> it's worth.
>

TBH first it seemed to me a good coding practice for future proofing,
then i checked if there's any place in postgres we are doing
something similar ,then found 1 in src/include/access/gist.h

typedef struct GISTDeletedPageContents
{
/* last xid which could see the page in a scan */
FullTransactionId deleteXid;
} GISTDeletedPageContents;


>
> >
> > we can improve the TAP test file header comment as
> > # Test the situation where we skip copying the same WAL files from
> source to target
> > # except if WAL file size differs.
> >
> > let me put it this way
> > 1) WALs are written to 000000010000000000000002 in primary.
> > ...
> > 7) so the last common WAL segment was 000000010000000000000003.
>
> Updated comments.
>
> >
> > ...
> > where entry->target_size != entry->source_size and we do copy,
> > like if stat->mtime is changed (which means file has been copied)
> > and target_stat->size != soucrce_stat->size here no error expected.
> >
>
> Updated patch with one additional segment has a byte appended to it
> on target but not source.
>

thanks for updating, i have attached a diff patch on
top of v9-0001 patch , where i tried to add more tests
to improve the validation that we copy the WAL file even
when it exists on both source and target but the size differs.
I also tried to rename some variables for readability,while
testing this out i found that usleep(1000); is not enough
to show that there has been a copy which changed the
stat->mtime of the file because i think the copy was very
fast (less than 1 millisecond) so the mtime of the
file after rewind's copy didn't change or filesystem precision
didn't caught the change, the copying of the file is confirmed
because the same file has different size before rewind and
same after rewind which these tests proved

ok($corrupt_wal_seg_stat_before_rewind->size !=
$corrupt_wal_seg_source_stat->size,"WAL segment $corrupt_wal_seg has
different size in source vs target before rewind");
ok 11 - WAL segment 000000010000000000000004 has different size in source
vs target before rewind

ok($corrupt_wal_seg_stat_after_rewind->size ==
$corrupt_wal_seg_source_stat->size, "WAL segment $corrupt_wal_seg was
copied: file sizes are same between target and source after rewind");
ok 12 - WAL segment 000000010000000000000004 was copied: file sizes are
same between target and source after rewind

and more over the pg_rewind confirmed this with a new debug
message which I added in this diff patch

diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 69d7728ce44..743484f4833 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -732,11 +732,11 @@ decide_wal_file_action(const char *fname, XLogSegNo
last_common_segno,
  */
  if (file_segno < last_common_segno && source_size == target_size)
  {
  pg_log_debug("WAL segment \"%s\" not copied to target", fname);
  return FILE_ACTION_NONE;
  }

+ pg_log_debug("WAL segment \"%s\" is copied to target", fname);
  return FILE_ACTION_COPY;
 }
ok 6 - run pg_rewind stderr /(?^:WAL segment \"000000010000000000000004\"
is copied to target)/

also did this instead of using catfile
-my $wal_skipped_path = File::Spec->catfile($node_primary->data_dir,
'pg_wal', $wal_seg_skipped);
+my $wal_skipped_path = $node_primary->data_dir . '/pg_wal/' .
$wal_seg_skipped;
seems consistent with other TAP tests.

-- 
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/

Re: Making pg_rewind faster

Srinath Reddy Sadipiralla <srinath2133@gmail.com> — 2025-10-15T14:52:34Z

On Wed, Oct 15, 2025 at 7:56 PM Srinath Reddy Sadipiralla <
srinath2133@gmail.com> wrote:

>
> thanks for updating, i have attached a diff patch on
> top of v9-0001 patch , where i tried to add more tests
> to improve the validation that we copy the WAL file even
> when it exists on both source and target but the size differs.
> I also tried to rename some variables for readability,while
> testing this out i found that usleep(1000); is not enough
> to show that there has been a copy which changed the
> stat->mtime of the file because i think the copy was very
> fast (less than 1 millisecond) so the mtime of the
> file after rewind's copy didn't change or filesystem precision
> didn't caught the change, the copying of the file is confirmed
> because the same file has different size before rewind and
> same after rewind which these tests proved
>

>
ok($corrupt_wal_seg_stat_before_rewind->size !=
> $corrupt_wal_seg_source_stat->size,"WAL segment $corrupt_wal_seg has
> different size in source vs target before rewind");
> ok 11 - WAL segment 000000010000000000000004 has different size in source
> vs target before rewind
>
> ok($corrupt_wal_seg_stat_after_rewind->size ==
> $corrupt_wal_seg_source_stat->size, "WAL segment $corrupt_wal_seg was
> copied: file sizes are same between target and source after rewind");
> ok 12 - WAL segment 000000010000000000000004 was copied: file sizes are
> same between target and source after rewind
>
> and more over the pg_rewind confirmed this with a new debug
> message which I added in this diff patch
>
> diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
> index 69d7728ce44..743484f4833 100644
> --- a/src/bin/pg_rewind/filemap.c
> +++ b/src/bin/pg_rewind/filemap.c
> @@ -732,11 +732,11 @@ decide_wal_file_action(const char *fname, XLogSegNo
> last_common_segno,
>   */
>   if (file_segno < last_common_segno && source_size == target_size)
>   {
>   pg_log_debug("WAL segment \"%s\" not copied to target", fname);
>   return FILE_ACTION_NONE;
>   }
>
> + pg_log_debug("WAL segment \"%s\" is copied to target", fname);
>   return FILE_ACTION_COPY;
>  }
> ok 6 - run pg_rewind stderr /(?^:WAL segment \"000000010000000000000004\"
> is copied to target)/
>

>


btw i forgot to let you know that i have increased the
usleep(1000) to usleep(1000000) for the above reason
which helped to show the difference in mtime after rewind
adding the below extra validation that the file has been copied from
source to target ,this is already done in the v9-0001.diff patch.

ok($corrupt_wal_seg_stat_after_rewind->mtime >
$corrupt_wal_seg_stat_before_rewind->mtime,"WAL segment $corrupt_wal_seg
was copied: mtime after rewind > mtime before rewind");
 ok 12 - WAL segment 000000010000000000000004 was copied: mtime after
rewind > mtime before rewind


On Fri, Oct 10, 2025 at 3:18 AM John H <johnhyvr@gmail.com> wrote:

> Hey Srinath
>
> On Thu, Oct 9, 2025 at 12:09 PM Srinath Reddy Sadipiralla
> <srinath2133@gmail.com> wrote:
> > ...
> > 1) if the timeline switch happens in middle of segment ,copy data from
> the last WAL segment and create WAL file with same segno but different
> timelineID,in this case the starting WAL file for the new timeline will be
> 000000020000000000000003
> > 2) if the timeline switch happens at segment boundary , just create next
> segment for this case the starting WAL file for the new timeline will be
> 000000020000000000000004
> >
> > so basically the files which exists in source and not in target like the
> new timeline WAL segments will be copied to target in total before we reach
> decide_wal_file_action , so i think we don't need to think about copying
> WAL files after divergence point by calculating and checking against
> last_common_segno which we are doing in our current approach , i think we
> can just do
> >
> > ...
>
> That's a great point. I need to think about it some more but the
> reasoning makes sense to me.
> I think 'last_common_segno ' is only useful as another sanity check
> but we already have the size
> ones.
>

it would be awesome to know ,if you have any update on this.

-- 
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/

Re: Making pg_rewind faster

Robert Haas <robertmhaas@gmail.com> — 2025-10-16T18:17:51Z

On Thu, Oct 9, 2025 at 3:09 PM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:
> just a second late :( i was about to post a patch addressing the refactors which Robert mentioned  ,anyway will have a look at your latest patch John thanks :), curious about the tap test.
>
> while i was writing the patch something suddenly struck me , that is why we are even depending on last_common_segno ,because once we reached decide_wal_file_action it means that the file exists in both target and source ,AFAIK this can only happen with wal segments older than or equal to last_common_segno because once the promotion competes the filename of the WAL files gets changed with the new timelineID(2), for ex: if the last_common_segno is 000000010000000000000003 then based on the rules in XLogInitNewTimeline
> 1) if the timeline switch happens in middle of segment ,copy data from the last WAL segment and create WAL file with same segno but different timelineID,in this case the starting WAL file for the new timeline will be 000000020000000000000003
> 2) if the timeline switch happens at segment boundary , just create next segment for this case the starting WAL file for the new timeline will be 000000020000000000000004
>
> so basically the files which exists in source and not in target like the new timeline WAL segments will be copied to target in total before we reach decide_wal_file_action , so i think we don't need to think about copying WAL files after divergence point by calculating and checking against last_common_segno which we are doing in our current approach , i think we can just do

What makes me nervous about this is that it isn't necessarily the case
that the servers were perfectly in sync at the time of the failure.
Suppose that the primary was in the middle of writing
000000010000000000000003. The standby might also have this file, but
it might contain less valid data than the one on the primary;
therefore, if we don't copy the file, the two servers might not have
an identical file. Maybe that wouldn't really matter, in the sense
that the extra valid data that exists on the original primary
shouldn't prevent it from replaying WAL on the new primary's timeline,
which is probably all we really care about. But it feels dangerous to
me.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Making pg_rewind faster

Robert Haas <robertmhaas@gmail.com> — 2025-10-16T18:59:48Z

On Wed, Oct 15, 2025 at 10:27 AM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:
>> On Fri, Oct 10, 2025 at 12:45 AM Srinath Reddy Sadipiralla
>> <srinath2133@gmail.com> wrote:
>> > XLogFilePath , then validate this new path with the given path
>> > ,this helps to catch invalid xlog files like pg_wal/00000001FFFFFFFFFFFFFF10.
>> > ...
>>
>> Are you concerned that somehow these files, which are named like XLog
>> files but actually
>> aren't, are somehow created therefore we should sync them in case?
>> I'm trying to understand how these files would be generated in the first place.
>
> the problem is not when server generates them because
> filename is properly calculated using the XLogRecPtr in
> XLogWrite->XLogFileInit->XLogFileInitInternal->XLogFilePath
> ,the main problem is when if someone manually places an invalid WAL file
> in pg_wal like 00000001FFFFFFFFFFFFFF10, IsXLogFileName will
> consider it as valid ,so with the approach as i mentioned earlier we can
> catch such cases.

I think that parsing the file name may be a good idea so that we can
do appropriate sanity checks on the values (e.g. checking that we're
only skipping copying prior to last_common_segno), but I do not think
we should worry too much about the user manually injecting invalid WAL
files. I mean, I would prefer that if that does happen, it either
works anyway or fails with a sensible error message, rather than
emitting an incomprehensible error message or dumping core. But, it is
in general true that if manual modifications are made to the data
directory, things may go terribly wrong, and this code is not obliged
to provide any more protection against such scenarios than we do in
other cases. Ultimately, such modifications are user error.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Making pg_rewind faster

John H <johnhyvr@gmail.com> — 2025-10-16T22:51:56Z

Hi,

On Wed, Oct 15, 2025 at 7:27 AM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:

>
> TBH first it seemed to me a good coding practice for future proofing,
> then i checked if there's any place in postgres we are doing
> something similar ,then found 1 in src/include/access/gist.h
>
> typedef struct GISTDeletedPageContents
> {
> /* last xid which could see the page in a scan */
> FullTransactionId deleteXid;
> } GISTDeletedPageContents;
>

It makes more sense to me if we expect the struct or same set of arguments
to be reused in different places / want a stable API reference. I don't think
we want everything to be wrapped around a struct for functions just in case.

> ..
> thanks for updating, i have attached a diff patch on
> top of v9-0001 patch , where i tried to add more tests
> ...

Thank you for the diff! Your changes are a lot cleaner and I've included it in
the latest patch. One difference I've made is instead of additionally logging in
decide_wal_file_action

> + pg_log_debug("WAL segment \"%s\" is copied to target", fname);

I realized we are already logging the WAL segments that are copied over in
print_filemap so I've updated the test to check for that instead

> qr/pg_wal\/$corrupt_wal_seg \(COPY\)/

I also updated the error message for the last three checks.

Thanks,

-- 
John Hsu - Amazon Web Services

Re: Making pg_rewind faster

John H <johnhyvr@gmail.com> — 2025-10-16T22:55:06Z

On Thu, Oct 16, 2025 at 12:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Oct 15, 2025 at 10:27 AM Srinath Reddy Sadipiralla
> > ,the main problem is when if someone manually places an invalid WAL file
> > in pg_wal like 00000001FFFFFFFFFFFFFF10, IsXLogFileName will
> > consider it as valid ,so with the approach as i mentioned earlier we can
> > catch such cases.
>
> I think that parsing the file name may be a good idea so that we can
> do appropriate sanity checks on the values (e.g. checking that we're
> only skipping copying prior to last_common_segno), but I do not think
> we should worry too much about the user manually injecting invalid WAL
> files. I mean, I would prefer that if that does happen, it either
> works anyway or fails with a sensible error message, rather than
> emitting an incomprehensible error message or dumping core. But, it is
> in general true that if manual modifications are made to the data
> directory, things may go terribly wrong, and this code is not obliged
> to provide any more protection against such scenarios than we do in
> other cases. Ultimately, such modifications are user error.
>

It feels like there's a lot of things we could attempt to ensure
"correctness" if we are concerned about scenarios when the user manually puts
or modifies content unexpectedly in the pg_wal directory.

For instance, one could make the argument that when considering to skip
copying the common WAL segments, even though they are of the same
size, it's possible the user has manipulated them directly. I don't
think we need to
run checksums on every WAL segment that is a valid candidate to ensure they
match.


-- 
John Hsu - Amazon Web Services

Re: Making pg_rewind faster

Justin Kwan <justinpkwan@outlook.com> — 2025-10-17T22:08:00Z

Thanks John, great to see this patch getting attention again.
This builds on my original version from 2022, glad to see it moving forward and happy with the direction it’s taking.
—Justin

________________________________
From: John H <johnhyvr@gmail.com>
Sent: October 16, 2025 6:55 PM
To: Robert Haas <robertmhaas@gmail.com>
Cc: Srinath Reddy Sadipiralla <srinath2133@gmail.com>; wenhui qiu <qiuwenhuifx@gmail.com>; Japin Li <japinli@hotmail.com>; Michael Paquier <michael@paquier.xyz>; Andres Freund <andres@anarazel.de>; Alexander Korotkov <aekorotkov@gmail.com>; Justin Kwan <justinpkwan@outlook.com>; Tom Lane <tgl@sss.pgh.pa.us>; pgsql-hackers <pgsql-hackers@postgresql.org>; vignesh <vignesh@cloudflare.com>; vignesh ravichandran <admin@viggy28.dev>; hlinnaka@iki.fi <hlinnaka@iki.fi>; jkwan@cloudflare.com <jkwan@cloudflare.com>
Subject: Re: Making pg_rewind faster

On Thu, Oct 16, 2025 at 12:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Oct 15, 2025 at 10:27 AM Srinath Reddy Sadipiralla
> > ,the main problem is when if someone manually places an invalid WAL file
> > in pg_wal like 00000001FFFFFFFFFFFFFF10, IsXLogFileName will
> > consider it as valid ,so with the approach as i mentioned earlier we can
> > catch such cases.
>
> I think that parsing the file name may be a good idea so that we can
> do appropriate sanity checks on the values (e.g. checking that we're
> only skipping copying prior to last_common_segno), but I do not think
> we should worry too much about the user manually injecting invalid WAL
> files. I mean, I would prefer that if that does happen, it either
> works anyway or fails with a sensible error message, rather than
> emitting an incomprehensible error message or dumping core. But, it is
> in general true that if manual modifications are made to the data
> directory, things may go terribly wrong, and this code is not obliged
> to provide any more protection against such scenarios than we do in
> other cases. Ultimately, such modifications are user error.
>

It feels like there's a lot of things we could attempt to ensure
"correctness" if we are concerned about scenarios when the user manually puts
or modifies content unexpectedly in the pg_wal directory.

For instance, one could make the argument that when considering to skip
copying the common WAL segments, even though they are of the same
size, it's possible the user has manipulated them directly. I don't
think we need to
run checksums on every WAL segment that is a valid candidate to ensure they
match.


--
John Hsu - Amazon Web Services

Re: Making pg_rewind faster

Srinath Reddy Sadipiralla <srinath2133@gmail.com> — 2025-10-18T14:17:06Z

On Thu, Oct 16, 2025 at 11:48 PM Robert Haas <robertmhaas@gmail.com> wrote:

> On Thu, Oct 9, 2025 at 3:09 PM Srinath Reddy Sadipiralla
> <srinath2133@gmail.com> wrote:
> > just a second late :( i was about to post a patch addressing the
> refactors which Robert mentioned  ,anyway will have a look at your latest
> patch John thanks :), curious about the tap test.
> >
> > while i was writing the patch something suddenly struck me , that is why
> we are even depending on last_common_segno ,because once we reached
> decide_wal_file_action it means that the file exists in both target and
> source ,AFAIK this can only happen with wal segments older than or equal to
> last_common_segno because once the promotion competes the filename of the
> WAL files gets changed with the new timelineID(2), for ex: if the
> last_common_segno is 000000010000000000000003 then based on the rules in
> XLogInitNewTimeline
> > 1) if the timeline switch happens in middle of segment ,copy data from
> the last WAL segment and create WAL file with same segno but different
> timelineID,in this case the starting WAL file for the new timeline will be
> 000000020000000000000003
> > 2) if the timeline switch happens at segment boundary , just create next
> segment for this case the starting WAL file for the new timeline will be
> 000000020000000000000004
> >
> > so basically the files which exists in source and not in target like the
> new timeline WAL segments will be copied to target in total before we reach
> decide_wal_file_action , so i think we don't need to think about copying
> WAL files after divergence point by calculating and checking against
> last_common_segno which we are doing in our current approach , i think we
> can just do
>
> What makes me nervous about this is that it isn't necessarily the case
> that the servers were perfectly in sync at the time of the failure.
> Suppose that the primary was in the middle of writing
> 000000010000000000000003. The standby might also have this file, but
> it might contain less valid data than the one on the primary;
> therefore, if we don't copy the file, the two servers might not have
> an identical file. Maybe that wouldn't really matter, in the sense
> that the extra valid data that exists on the original primary
> shouldn't prevent it from replaying WAL on the new primary's timeline,
> which is probably all we really care about. But it feels dangerous to
> me.
>

Thanks Robert ,I want to understand this point more , and will get back .

-- 
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/

Re: Making pg_rewind faster

Srinath Reddy Sadipiralla <srinath2133@gmail.com> — 2025-10-18T16:04:55Z

Hi,

On Fri, Oct 17, 2025 at 4:22 AM John H <johnhyvr@gmail.com> wrote:

> Hi,
>
> On Wed, Oct 15, 2025 at 7:27 AM Srinath Reddy Sadipiralla
> <srinath2133@gmail.com> wrote:
>
> >
> > TBH first it seemed to me a good coding practice for future proofing,
> > then i checked if there's any place in postgres we are doing
> > something similar ,then found 1 in src/include/access/gist.h
> >
> > typedef struct GISTDeletedPageContents
> > {
> > /* last xid which could see the page in a scan */
> > FullTransactionId deleteXid;
> > } GISTDeletedPageContents;
> >
>
> It makes more sense to me if we expect the struct or same set of arguments
> to be reused in different places / want a stable API reference. I don't
> think
> we want everything to be wrapped around a struct for functions just in
> case.
>
> > ..
> > thanks for updating, i have attached a diff patch on
> > top of v9-0001 patch , where i tried to add more tests
> > ...
>
> Thank you for the diff! Your changes are a lot cleaner and I've included
> it in
> the latest patch. One difference I've made is instead of additionally
> logging in
> decide_wal_file_action
>
> > + pg_log_debug("WAL segment \"%s\" is copied to target", fname);
>
> I realized we are already logging the WAL segments that are copied over in
> print_filemap so I've updated the test to check for that instead
>
> > qr/pg_wal\/$corrupt_wal_seg \(COPY\)/
>
> I also updated the error message for the last three checks.
>

Thanks for updating the patch, I have reviewed it,
except these below concerns, patch LGTM.

1) regarding the parsing the WAL filename

On Fri, Oct 17, 2025 at 4:25 AM John H <johnhyvr@gmail.com> wrote:

> On Thu, Oct 16, 2025 at 12:00 PM Robert Haas <robertmhaas@gmail.com>
> wrote:
> >
> > On Wed, Oct 15, 2025 at 10:27 AM Srinath Reddy Sadipiralla
> > > ,the main problem is when if someone manually places an invalid WAL
> file
> > > in pg_wal like 00000001FFFFFFFFFFFFFF10, IsXLogFileName will
> > > consider it as valid ,so with the approach as i mentioned earlier we
> can
> > > catch such cases.
> >
> > I think that parsing the file name may be a good idea so that we can
> > do appropriate sanity checks on the values (e.g. checking that we're
> > only skipping copying prior to last_common_segno), but I do not think
> > we should worry too much about the user manually injecting invalid WAL
> > files. I mean, I would prefer that if that does happen, it either
> > works anyway or fails with a sensible error message, rather than
> > emitting an incomprehensible error message or dumping core. But, it is
> > in general true that if manual modifications are made to the data
> > directory, things may go terribly wrong, and this code is not obliged
> > to provide any more protection against such scenarios than we do in
> > other cases. Ultimately, such modifications are user error.
> >
>
> It feels like there's a lot of things we could attempt to ensure
> "correctness" if we are concerned about scenarios when the user manually
> puts
> or modifies content unexpectedly in the pg_wal directory.
>
> For instance, one could make the argument that when considering to skip
> copying the common WAL segments, even though they are of the same
> size, it's possible the user has manipulated them directly. I don't
> think we need to
> run checksums on every WAL segment that is a valid candidate to ensure they
> match.


I agree with Robert here, and i think if we found
that the WAL filename is not valid after parsing
we can return something like FILE_CONTENT_INVALID_TYPE
in getFileContentType so that it won't end up going
through decide_wal_file_action,then XLogFromFileName
may assign file_segno invalid values greater than
last_common_segno which means we lead to copying it,even
though it's an invalid file, thoughts?

2) Please fix the indentation by running pgindent,
also add file_content_type_t typedef to typedefs.list
file.
3) maybe we can improve the error messages for the
last 2 checks in tap test, that because of this
reason hence proven that the copy from source to
target has been done.
4) We need to update the pg_rewind docs regarding
this new behaviour.

-- 
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/

Re: Making pg_rewind faster

John H <johnhyvr@gmail.com> — 2025-10-20T22:03:55Z

Hi Srinath,

Thank you for the quick review!

On Sat, Oct 18, 2025 at 9:05 AM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:
> ...
> I agree with Robert here, and i think if we found
> that the WAL filename is not valid after parsing
> we can return something like FILE_CONTENT_INVALID_TYPE
> in getFileContentType so that it won't end up going
> through decide_wal_file_action,then XLogFromFileName
> may assign file_segno invalid values greater than
> last_common_segno which means we lead to copying it,even
> though it's an invalid file, thoughts?
>

My understanding of pg_rewind is that it aims to make the two clusters
"identical"
after rewind, similar to a basebackup. It does not try to make decisions
on whether or not a file is necessary for database to start. I'd
prefer the patch aims
to keep parity with the existing behavior of pg_rewind/pg_basebackup as much as
possible. If there is an unexpected/invalid segment and we can't make
assumptions
about the content, we should just sync it to keep with existing behavior.

There's lots of potential optimizations pg_rewind can do if we only
talk about what is
"strictly" necessary for the database to start. We can exclude any
directories that
should not be there. For example, if there exists a directory
$PGDATA/random_files
on source and target, the whole directory gets synced right now. We could not
include that directory and the database would most likely still start
but that goes
against the current way pg_rewind works.

> 2) Please fix the indentation by running pgindent,
> also add file_content_type_t typedef to typedefs.list
> file.

Will update.

> 3) maybe we can improve the error messages for the
> last 2 checks in tap test, that because of this
> reason hence proven that the copy from source to
> target has been done.

What are you thinking of? Something like the below?

+       "Expected WAL segment $corrupt_wal_seg to have been modified
due to rewind");
+       "Expected WAL segment $corrupt_wal_seg to have been synced
from source to target after rewind");

> 4) We need to update the pg_rewind docs regarding
> this new behaviour.
>

Will update.

-- 
John Hsu - Amazon Web Services

Re: Making pg_rewind faster

Srinath Reddy Sadipiralla <srinath2133@gmail.com> — 2025-10-21T04:36:26Z

Hi John,

On Tue, Oct 21, 2025 at 3:34 AM John H <johnhyvr@gmail.com> wrote:

>
> > 3) maybe we can improve the error messages for the
> > last 2 checks in tap test, that because of this
> > reason hence proven that the copy from source to
> > target has been done.
>
> What are you thinking of? Something like the below?
>
> +       "Expected WAL segment $corrupt_wal_seg to have been modified
> due to rewind");
> +       "Expected WAL segment $corrupt_wal_seg to have been synced
> from source to target after rewind");
>

maybe something like this
+ "Expected WAL segment $corrupt_wal_seg to have later mtime
on target than source after rewind as it was copied");
+ "Expected WAL segment $corrupt_wal_seg file sizes to be same
between target and source after rewind as it was copied");

-- 
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/

Re: Making pg_rewind faster

John H <johnhyvr@gmail.com> — 2025-10-21T23:14:55Z

Hi Srinath,

On Mon, Oct 20, 2025 at 9:36 PM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:
> ...
> maybe something like this
> + "Expected WAL segment $corrupt_wal_seg to have later mtime
> on target than source after rewind as it was copied");
> + "Expected WAL segment $corrupt_wal_seg file sizes to be same
> between target and source after rewind as it was copied");
>

Made the changes and included the documentation update.

Thanks,
-- 
John Hsu - Amazon Web Services

Re: Making pg_rewind faster

Michael Paquier <michael@paquier.xyz> — 2025-10-23T05:12:24Z

On Thu, Sep 25, 2025 at 02:35:16PM -0400, Robert Haas wrote:
> On a related note, something a committer would need to do if they
> wanted to commit this patch is figure out who should be credited with
> writing it. I see that a number of different people have posted
> versions of the patch; it would be nice if "Author:" or
> "Co-authored-by:" tags were added to the commit message to reflect
> whose code is present in a given version.

Looking at this patch was on my TODO list (more to that in a bit), and
the simplest answer to this question is that everybody involved would
be cited in the release notes.  Justin Kwan has been the original
author.  John Hsu and Srinath have followed up in an equivalent
fashion, including answers with your review comments posted on this
previous message.  So listing all three as equivalent authors feels
normal to me (no need for co-authoring notes, IMO).
--
Michael

Re: Making pg_rewind faster

Michael Paquier <michael@paquier.xyz> — 2025-10-23T08:22:33Z

On Tue, Oct 21, 2025 at 04:14:55PM -0700, John H wrote:
> Made the changes and included the documentation update.

I have been looking at this patch, and the first part that stood out
is that the refactoring related to isRelDataFile() can be done as an
independent piece, cutting the total size of the main patch by 30% or
so.  So I have extracted this part, simplified it to make it more
consistent and les noisy on HEAD, and applied it as 6ae08d9583e9.

I am still looking at the rest, and attached is a rebased version of
what I have for the moment.  Note first that the new test was missing
in meson.build, so the coverage provided by the CI was limited.

+    * However we check last_common_segno and file_size again for sanity.
+    */
+   if (file_segno < last_common_segno && source_size == target_size)

What if a segment has a size different than the one a cluster has been
initialized with, still both have the same size on the target and the
source?  Unlikely, still incorrect if not cross-checked with what a
control file has in store.  :)

+# Sleep to allow mtime to be different
+usleep(1000000);
[...]
+	[qr/WAL segment \"$wal_seg_skipped\" not copied to target/,
+	 qr/pg_wal\/$corrupt_wal_seg \(COPY\)/
+	],

FWIW, I am definitely not a fan of the test relying on the timestamp
of the files based on their retrieved meta-data stats, with the mtime.
I suspect complications on Windows.  Worse thing, there is a forced
sleep of 1s added to the test, to make sure that the timestamp of the
files we compare have a different number.  But do we really need that
anyway if we have the debug information with the file map printed in
it?

Hence, I think that we should simplify and rework the test a bit,
tweaking a bit how the filemap is printed: now that we can detect if
an operation is happening on a WAL file thanks to file_content_type_t,
let's always print the type of operation done for each WAL file and
remove this "not copied to target" debug log which provides the same
information, then let's compare the debug output with what we want.

It seems to me that it would be good for the test to run some sanity
checks on the rewound standby, as well.  That would provide more
validation for the case of the "corrupted" segment that gets copied
anyway.
--
Michael

Re: Making pg_rewind faster

Robert Haas <robertmhaas@gmail.com> — 2025-10-23T12:40:14Z

On Thu, Oct 23, 2025 at 4:22 AM Michael Paquier <michael@paquier.xyz> wrote:
> +    * However we check last_common_segno and file_size again for sanity.
> +    */
> +   if (file_segno < last_common_segno && source_size == target_size)
>
> What if a segment has a size different than the one a cluster has been
> initialized with, still both have the same size on the target and the
> source?  Unlikely, still incorrect if not cross-checked with what a
> control file has in store.  :)

While I'm not against cross-checking against the control file, this
sounds like an imaginary scenario to me. That is, it would only happen
if somebody maliciously modified the contents of the data directory by
hand with the express goal of breaking the tool. But we fundamentally
cannot defend against a malicious user whose express goal is to break
the tool, and I do not see any compelling reason to expend energy on
it even in cases like this where we could theoretically detect it
without much effort. If we go down that path, we'll end up not only
complicating the code, but also obscuring our own goals: it will look
like we've either done too much sanity checking (because we will have
added checks that are unnecessary with a non-malicious user) or too
little (because we will not have caught all the things a malicious
user might do).

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Making pg_rewind faster

Srinath Reddy Sadipiralla <srinath2133@gmail.com> — 2025-10-23T16:08:23Z

Hi Michael,

On Thu, Oct 23, 2025 at 1:52 PM Michael Paquier <michael@paquier.xyz> wrote:

> On Tue, Oct 21, 2025 at 04:14:55PM -0700, John H wrote:
> > Made the changes and included the documentation update.
>
> I have been looking at this patch, and the first part that stood out
> is that the refactoring related to isRelDataFile() can be done as an
> independent piece, cutting the total size of the main patch by 30% or
> so.  So I have extracted this part, simplified it to make it more
> consistent and les noisy on HEAD, and applied it as 6ae08d9583e9.
>

yeah ... that's cleaner, thanks for committing.


>
> +# Sleep to allow mtime to be different
> +usleep(1000000);
> [...]
> +       [qr/WAL segment \"$wal_seg_skipped\" not copied to target/,
> +        qr/pg_wal\/$corrupt_wal_seg \(COPY\)/
> +       ],
>
> FWIW, I am definitely not a fan of the test relying on the timestamp
> of the files based on their retrieved meta-data stats, with the mtime.
> I suspect complications on Windows.  Worse thing, there is a forced
> sleep of 1s added to the test, to make sure that the timestamp of the
> files we compare have a different number.  But do we really need that
> anyway if we have the debug information with the file map printed in
> it?
>

thought it would act like an extra sanity check ,to prove the point that the
pg_rewind is saying through the debug info that it has been really copied
or skipped.


>
> Hence, I think that we should simplify and rework the test a bit,
> tweaking a bit how the filemap is printed: now that we can detect if
> an operation is happening on a WAL file thanks to file_content_type_t,
> let's always print the type of operation done for each WAL file and
> remove this "not copied to target" debug log which provides the same
> information, then let's compare the debug output with what we want.
>

i guess this is what you want ,please let me know if it's otherwise,

diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 2d82edec060..2180495d337 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -546,7 +546,7 @@ print_filemap(filemap_t *filemap)
        for (i = 0; i < filemap->nentries; i++)
        {
                entry = filemap->entries[i];
-               if (entry->action != FILE_ACTION_NONE ||
+               if (entry->content_type == FILE_CONTENT_TYPE_WAL ||
entry->action != FILE_ACTION_NONE ||
                        entry->target_pages_to_overwrite.bitmapsize > 0)
                {
                        pg_log_debug("%s (%s)", entry->path,
@@ -733,7 +733,6 @@ decide_wal_file_action(const char *fname, XLogSegNo
last_common_segno,
         */
        if (file_segno < last_common_segno && source_size == target_size)
        {
-               pg_log_debug("WAL segment \"%s\" not copied to target",
fname);
                return FILE_ACTION_NONE;
        }

diff --git a/src/bin/pg_rewind/t/011_wal_copy.pl b/src/bin/pg_rewind/t/
011_wal_copy.pl
index b9e24844654..3e12744dfa6 100644
--- a/src/bin/pg_rewind/t/011_wal_copy.pl
+++ b/src/bin/pg_rewind/t/011_wal_copy.pl
@@ -93,7 +93,7 @@ command_checks_all(
        0,
        [qr//],
        [
-               qr/WAL segment \"$wal_seg_skipped\" not copied to target/,
+               qr/pg_wal\/$wal_seg_skipped \(NONE\)/,
                qr/pg_wal\/$corrupt_wal_seg \(COPY\)/
        ],
        'run pg_rewind');


> It seems to me that it would be good for the test to run some sanity
> checks on the rewound standby, as well.  That would provide more
> validation for the case of the "corrupted" segment that gets copied
> anyway.
>

i think these checks help do the same

1) qr/pg_wal\/$corrupt_wal_seg \(COPY\)/
this shows the corrupted segment is copied.

2) ok( $corrupt_wal_seg_stat_before_rewind->size !=
 $corrupt_wal_seg_source_stat->size,
"Expected WAL segment $corrupt_wal_seg to have different size in
source vs target before rewind"
);
my $corrupt_wal_seg_stat_after_rewind =
stat($corrupt_wal_seg_in_target_path);
ok( $corrupt_wal_seg_stat_after_rewind->size ==
 $corrupt_wal_seg_source_stat->size,
"Expected WAL segment $corrupt_wal_seg file sizes to be same between
target and source after rewind as it was copied"
);
These checks show that before the corrupt segment's size on
rewound standby(target) was not the same as source but after
rewind it was the same as source,please let me know if i am
missing your point here.

-- 
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/

Re: Making pg_rewind faster

John H <johnhyvr@gmail.com> — 2025-10-23T22:01:32Z

Hi,

On Thu, Oct 23, 2025 at 9:08 AM Srinath Reddy Sadipiralla
<srinath2133@gmail.com> wrote:
> On Thu, Oct 23, 2025 at 1:52 PM Michael Paquier <michael@paquier.xyz> wrote:
>>
>> FWIW, I am definitely not a fan of the test relying on the timestamp
>> of the files based on their retrieved meta-data stats, with the mtime.
>> I suspect complications on Windows.  Worse thing, there is a forced
>> sleep of 1s added to the test, to make sure that the timestamp of the
>> files we compare have a different number.  But do we really need that
>> anyway if we have the debug information with the file map printed in
>> it?
>
>
> thought it would act like an extra sanity check ,to prove the point that the
> pg_rewind is saying through the debug info that it has been really copied
> or skipped.
>

I don't feel too strongly about it. If we want to rely on checking debug log
that works for me. We don't do any in-depth checks that every file was
properly copied anyway for existing files.

> ...
> i guess this is what you want ,please let me know if it's otherwise,
> ...

That looks right to me.

>>
>> It seems to me that it would be good for the test to run some sanity
>> checks on the rewound standby, as well.  That would provide more
>> validation for the case of the "corrupted" segment that gets copied
>> anyway.
>
> ...
> These checks show that before the corrupt segment's size on
> rewound standby(target) was not the same as source but after
> rewind it was the same as source,please let me know if i am
> missing your point here.
>

What did you have in mind for additional sanity checks?
The existing test checks that when sizes are different the correct
branch is taken. For something more in-depth that requires comparing
data before and after the rewind with a "corrupted" segment that seems
complicated since the segment would have to somehow be applied to
writer but not replica prior to divergence.

--
John Hsu - Amazon Web Services

Re: Making pg_rewind faster

Michael Paquier <michael@paquier.xyz> — 2025-10-24T07:56:33Z

On Thu, Oct 23, 2025 at 03:01:32PM -0700, John H wrote:
> On Thu, Oct 23, 2025 at 9:08 AM Srinath Reddy Sadipiralla <srinath2133@gmail.com> wrote:
>> ...
>> i guess this is what you want ,please let me know if it's otherwise,
>> ...
> 
> That looks right to me.

Yes, that would be something among these lines, WAL segments being
treated as an exception when printing the file map to show that they
are not copied.

>> These checks show that before the corrupt segment's size on
>> rewound standby(target) was not the same as source but after
>> rewind it was the same as source,please let me know if i am
>> missing your point here.
> 
> What did you have in mind for additional sanity checks?
> The existing test checks that when sizes are different the correct
> branch is taken. For something more in-depth that requires comparing
> data before and after the rewind with a "corrupted" segment that seems
> complicated since the segment would have to somehow be applied to
> writer but not replica prior to divergence.

I was just wondering about some queries on the new standby once we are
done with the rewind.  But after sleeping on it, I'd be happy with
just the set of tests we have: the debug output checks, the size
checks pre and post-rewind, and no mtime.
--
Michael

Re: Making pg_rewind faster

Michael Paquier <michael@paquier.xyz> — 2025-10-24T08:38:37Z

On Fri, Oct 24, 2025 at 04:56:33PM +0900, Michael Paquier wrote:
> I was just wondering about some queries on the new standby once we are
> done with the rewind.  But after sleeping on it, I'd be happy with
> just the set of tests we have: the debug output checks, the size
> checks pre and post-rewind, and no mtime.

So, I am coming down to the attached (minus the commit message) after
more edits and adjustments.  It's Friday evening here, and I am not
planning to do more today and to take bets on the buildfarm, even if
the CI is OK.
--
Michael

Re: Making pg_rewind faster

Srinath Reddy Sadipiralla <srinath2133@gmail.com> — 2025-10-24T12:21:54Z

On Fri, Oct 24, 2025 at 2:08 PM Michael Paquier <michael@paquier.xyz> wrote:

> On Fri, Oct 24, 2025 at 04:56:33PM +0900, Michael Paquier wrote:
> > I was just wondering about some queries on the new standby once we are
> > done with the rewind.  But after sleeping on it, I'd be happy with
> > just the set of tests we have: the debug output checks, the size
> > checks pre and post-rewind, and no mtime.
>
> So, I am coming down to the attached (minus the commit message) after
> more edits and adjustments.  It's Friday evening here, and I am not
> planning to do more today and to take bets on the buildfarm, even if
> the CI is OK.
>

Thanks for the updated patch, I have reviewed and tested the
patch, except these minor refactors ,it LGTM.

diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 59672e66932..467fd97ebcf 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -728,7 +728,7 @@ decide_wal_file_action(const char *fname, XLogSegNo
last_common_segno,
        /*
         * Avoid copying files before the last common segment.
         *
-        * These files exist on the source and the target services, so they
should
+        * These files exist on the source and the target servers, so they
should
         * be identical and located strictly before the segment that
contains the
         * LSN where target and source servers have diverged.
         *
I think as we are not using mtime to show that the file has not been copied
and been skipped ,instead we are doing the same with the debug message
(qr/pg_wal\/$wal_seg_skipped \(NONE\)/,), so stat calculation of this WAL
segment can be removed.

diff --git a/src/bin/pg_rewind/t/011_wal_copy.pl b/src/bin/pg_rewind/t/
011_wal_copy.pl
index fb5b7104378..c4d7200d710 100644
--- a/src/bin/pg_rewind/t/011_wal_copy.pl
+++ b/src/bin/pg_rewind/t/011_wal_copy.pl
@@ -43,12 +43,6 @@ RewindTest::promote_standby;
 my $new_timeline_wal_seg = $node_standby->safe_psql('postgres',
        'SELECT pg_walfile_name(pg_current_wal_lsn())');

-# Get some stats info for the WAL file whose copy is skipped.
-my $wal_skipped_path =
-  $node_primary->data_dir . '/pg_wal/' . $wal_seg_skipped;
-my $wal_skipped_stat = stat($wal_skipped_path);
-defined($wal_skipped_stat) or die("unable to stat $wal_skipped_path");
-
 # Corrupt a WAL segment on target that has been generated before the
 # divergence point.  We will check that it is copied from the source.

*Test:*
I tried to demonstrate that with this patch, we are not copying WAL
segments before the divergence point by using wal_keep_size.

*Environment:*
A primary and standby physical streaming replication setup was used.
The TPC-H dbgen tool with a Scale Factor of 10 was used to generate
data, which in turn generated sufficient WAL to test with wal_keep_size =
20000 MB.
A failover and rewind were performed.

*Result:*

*Without Patch:*

SELECT
  pg_size_pretty(sum(size)) AS wal_size
FROM
  pg_ls_waldir();
 wal_size
----------
 19 GB
(1 row)

pg_rewind: connected to server
pg_rewind: servers diverged at WAL location 4/DEACBAB0 on timeline 1
pg_rewind: rewinding from last common checkpoint at 4/C8250D28 on timeline 1
pg_rewind: reading source file list
pg_rewind: reading target file list
pg_rewind: reading WAL in target
*pg_rewind: need to copy 20449 MB (total source directory size is 32778 MB)*
20940602/20940602 kB (100%) copied
pg_rewind: creating backup label and updating control file
pg_rewind: syncing target data directory
pg_rewind: Done!

To show that only a small amount of WAL was generated between the last
common
checkpoint and the latest checkpoint on the source:

postgres=# select pg_size_pretty('4/DEACBB78'::pg_lsn -
'4/C8250D28'::pg_lsn);
 pg_size_pretty
----------------
 360 MB
(1 row)

so the 20.449 GB copied was primarily due to wal_keep_size
(19 GB of accumulated WAL) being copied, as previously non-data files were
copied in full.

*With Patch:*

SELECT
  pg_size_pretty(sum(size)) AS wal_size
FROM
  pg_ls_waldir();
 wal_size
----------
 19 GB
(1 row)

pg_rewind: connected to server
pg_rewind: servers diverged at WAL location 4/B0F2FE78 on timeline 1
pg_rewind: rewinding from last common checkpoint at 4/B087E468 on timeline 1
pg_rewind: reading source file list
pg_rewind: reading target file list
pg_rewind: reading WAL in target
*pg_rewind: need to copy 212 MB (total source directory size is 32074 MB)*
*pg_rewind: skipped WAL 19168 MB* -> tweaked calculate_totals to get this,
just to give more info (not included in patch).
217467/217467 kB (100%) copied
pg_rewind: creating backup label and updating control file
pg_rewind: syncing target data directory
pg_rewind: Done!

postgres=# select pg_size_pretty('4/B0F30AB0'::pg_lsn -
'4/B087E468'::pg_lsn);
 pg_size_pretty
----------------
 6855 kB
(1 row)

This test shows that with the patch applied, only the necessary WAL segments
after the divergence point are copied, while previously accumulated WAL
segments (due to wal_keep_size) are correctly skipped.

-- 
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/

Re: Making pg_rewind faster

Michael Paquier <michael@paquier.xyz> — 2025-10-25T01:15:53Z

On Fri, Oct 24, 2025 at 05:51:54PM +0530, Srinath Reddy Sadipiralla wrote:
> -        * These files exist on the source and the target services, so they
> should
> +        * These files exist on the source and the target servers, so they
> should

Not sure what my fingers were doing here.

> I think as we are not using mtime to show that the file has not been copied
> and been skipped ,instead we are doing the same with the debug message
> (qr/pg_wal\/$wal_seg_skipped \(NONE\)/,), so stat calculation of this WAL
> segment can be removed.

Yes, removing this one makes sense.

And, to keep it short: applied.
--
Michael

Re: Making pg_rewind faster

Srinath Reddy Sadipiralla <srinath2133@gmail.com> — 2025-10-25T01:50:52Z

On Sat, Oct 25, 2025 at 6:46 AM Michael Paquier <michael@paquier.xyz> wrote:

> On Fri, Oct 24, 2025 at 05:51:54PM +0530, Srinath Reddy Sadipiralla wrote:
> > -        * These files exist on the source and the target services, so
> they
> > should
> > +        * These files exist on the source and the target servers, so
> they
> > should
>
> Not sure what my fingers were doing here.
>
> > I think as we are not using mtime to show that the file has not been
> copied
> > and been skipped ,instead we are doing the same with the debug message
> > (qr/pg_wal\/$wal_seg_skipped \(NONE\)/,), so stat calculation of this WAL
> > segment can be removed.
>
> Yes, removing this one makes sense.
>
> And, to keep it short: applied.
>

Thanks Michael.


-- 
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/

Re: Making pg_rewind faster

Michael Paquier <michael@paquier.xyz> — 2025-10-28T04:02:06Z

On Thu, Oct 23, 2025 at 08:40:14AM -0400, Robert Haas wrote:
> While I'm not against cross-checking against the control file, this
> sounds like an imaginary scenario to me. That is, it would only happen
> if somebody maliciously modified the contents of the data directory by
> hand with the express goal of breaking the tool. But we fundamentally
> cannot defend against a malicious user whose express goal is to break
> the tool, and I do not see any compelling reason to expend energy on
> it even in cases like this where we could theoretically detect it
> without much effort. If we go down that path, we'll end up not only
> complicating the code, but also obscuring our own goals: it will look
> like we've either done too much sanity checking (because we will have
> added checks that are unnecessary with a non-malicious user) or too
> little (because we will not have caught all the things a malicious
> user might do).

I was thinking about this argument over the weekend, and I am
wondering if we could not do better here to detect if a file should be
copied or not.  What if we included a checksum of each file if both
exist on the target and source, and just not copy them if the
checksums match?  You cannot do that for relation files when the
source is online, of course, but for files like the oldest segments
before the divergence point, that's better than checking the size,
still more expensive due to the cost of the checksum computation.
And there is a sha256() available at SQL level.

Just throwing one idea in the bucket of ideas.  That may not be worth
the extra cost here, of course, but attaching a checksum to
file_entry_t is not what I would qualify as an invasive change.
--
Michael

Re: Making pg_rewind faster

Robert Haas <robertmhaas@gmail.com> — 2025-10-28T14:57:26Z

On Tue, Oct 28, 2025 at 12:02 AM Michael Paquier <michael@paquier.xyz> wrote:
> I was thinking about this argument over the weekend, and I am
> wondering if we could not do better here to detect if a file should be
> copied or not.  What if we included a checksum of each file if both
> exist on the target and source, and just not copy them if the
> checksums match?

Well, that would require reading the entire file on both sides to
compute the checksum, which sounds pretty expensive. I mean, a copy
would only be a read on one side and a write on the other. Even
granting that writes are more expensive than reads, a read of both
sides would still be a substantial percentage of the total cost, I
think.

Also, I don't think we really want to reinvent a worse version of
rsync. If you want to use checksums or file timestamps to decide what
to copy, there are already good tools for that which probably handle
that task better than our code ever will. What we can bring to the
table is PG-specific logic, where we're able to reason about the
behavior of PG in a way that a general-purpose tool can't. That's why
for example we use the WAL to decide what data blocks need to be
copied, rather than checksums -- it's an optimization that rsync can't
do, and we can. The rule implemented here is similar: rsync can't know
that WAL from before the divergence point should be the same on both
sides, but we can.

Now, of course, if in a specific situation the assumptions on which
pg_rewind relies are not valid, e.g. because manual data directory
modification has occurred, then pg_rewind should not be used. And if
on the other hand we find some flaw that will keep pg_rewind from
delivering correct results even when nothing strange has happened,
then that's a bug or a design problem that we need to fix. But if we
just start second-guessing ourselves by adding overhead to protect
against can't-happen scenarios, we'll end up making pg_rewind useless.

-- 
Robert Haas
EDB: http://www.enterprisedb.com