Thread

  1. Re: Timeline switching with partial WAL records can break replica recovery

    Alena Vinter <dlaaren8@gmail.com> — 2025-09-10T09:07:35Z

    Hi!
    
    I've noticed an issue with pg_rewind caused by my patches.
    
    Some logs for issue demonstration:
    pg_rewind: Source timeline history:
    pg_rewind: 1: 0/00000000 - 0/03002048
    pg_rewind: 2: 0/03002048 - 0/00000000
    pg_rewind: Target timeline history:
    pg_rewind: 1: 0/00000000 - 0/00000000
    pg_rewind: servers diverged at WAL location 0/03002048 on timeline 1
    pg_rewind: error: could not find previous WAL record at 0/03002048: invalid
    record length at 0/03002048: expected at least 24, got 0
    
    When a common timeline ends with an overwritten contrecord, the divergence
    point may not point to the start of a valid WAL record on the target,
    causing errors and making rewind impossible.
    To handle this case, I suggest looking for a checkpoint preceding the
    divergence point starting from the last checkpoint on the target rather
    than from the divergence point itself when the common timeline is
    unfinished on the target. This ensures we always begin from a known-valid
    position in WAL.
    
    I'd appreciate any feedback!
    
    Best Regards,
    Alyona Vinter