Re: Timeline switching with partial WAL records can break replica recovery

Alena Vinter <dlaaren8@gmail.com>

From: Alyona Vinter <dlaaren8@gmail.com>
To: Nataliia <k.natalissa@gmail.com>
Cc: pgsql-hackers@lists.postgresql.org
Date: 2025-09-10T09:07:35Z
Lists: pgsql-hackers

Attachments

Hi!

I've noticed an issue with pg_rewind caused by my patches.

Some logs for issue demonstration:
pg_rewind: Source timeline history:
pg_rewind: 1: 0/00000000 - 0/03002048
pg_rewind: 2: 0/03002048 - 0/00000000
pg_rewind: Target timeline history:
pg_rewind: 1: 0/00000000 - 0/00000000
pg_rewind: servers diverged at WAL location 0/03002048 on timeline 1
pg_rewind: error: could not find previous WAL record at 0/03002048: invalid
record length at 0/03002048: expected at least 24, got 0

When a common timeline ends with an overwritten contrecord, the divergence
point may not point to the start of a valid WAL record on the target,
causing errors and making rewind impossible.
To handle this case, I suggest looking for a checkpoint preceding the
divergence point starting from the last checkpoint on the target rather
than from the divergence point itself when the common timeline is
unfinished on the target. This ensures we always begin from a known-valid
position in WAL.

I'd appreciate any feedback!

Best Regards,
Alyona Vinter