Thread

  1. Re: pg_rewind does not rewind diverging timelines

    Mats Kindahl <mats.kindahl@gmail.com> — 2026-05-01T16:06:20Z

    On Thu, Apr 30, 2026 at 10:19 AM Mats Kindahl <mats.kindahl@gmail.com>
    wrote:
    
    > Hi all,
    >
    > I have been playing around with various promotion scenarios to check if it
    > is possible to lose writes in more complicated scenarios involving
    > promotions and uses of synchronous_standby_names and decided to create a
    > TLA+ model for streaming replication involving promotions and check those
    > with TLC. You can find the models at [1] if you're interested.
    >
    > There is one scenario that I assume is known that TLC found, but does not
    > seem to be fixed. It is a relatively rare case, but since the fix is quite
    > easy, I thought I'd share it with you and get feedback.
    >
    > The scenario can occur if you're unlucky and have more than one crash when
    > promoting standbys to be primaries, and goes like this:
    >
    > You have three servers, S1, S2, and S3. S1 is primary and S2 and S3 are
    > standbys. All are on timeline (TLI) 1.
    >
    > 1. S1 crashes
    > 2. S1 recovers and starts promotion. It writes XLOG_END_OF_RECOVERY (EOR)
    > for TLI 2 to the WAL.
    > 3. S1 It manages to write some records W1 to the WAL.
    > 4. Before the EOR is replicated to any standby, S1 crashes again. It is
    > now on TLI 2 and has some changes that are not elsewhere.
    > 5. S2 is promoted. It writes an EOR for TLI 2 (since it is not aware of
    > any other timeline) to the WAL.
    > 6. S2 writes some records W2 to WAL and now S1 has a record of TLI 2
    > version 1 (TLI 2.1) and S2 is on TLI 2.2.
    > 7. S1 recovers and wants to join as a standby. You run pg_rewind to get
    > rid of the extra data, but since S2 is also on TLI 2, pg_rewind will
    > happily assume that both are on the same timeline.
    > 8. S2 is now a standby but has that extra record for W2 both in the WAL
    > and in the database.
    >
    > The fix (see attached draft) is quite simple: add a UUID to the EOR and to
    > the history file. When comparing timelines, don't only check the TLI, also
    > check the UUID. If not both match, go back further until you find a
    > timeline where both the TLI and the timeline UUID matches and do the usual
    > fandango to find the good LSN to rewind to.
    >
    > [1]: https://github.com/mkindahl/tla-postgres
    >
    
    Here is an updated version of the patch. It seems like it is not necessary
    to extend the XLOG_END_OF_RECOVERY record with the UUID, just the history
    files. The scenario is still the same though, and can trigger diverging
    servers, possibly silent. I have an additional test case using a divergence
    going back three promotions.
    --
    Best wishes,
    Mats Kindahl, Multigres Developer, Supabase