Re: pg_rewind does not rewind diverging timelines

Mats Kindahl <mats.kindahl@gmail.com>
View thread
From: Mats Kindahl <mats.kindahl@gmail.com>
To: Japin Li <japinli@hotmail.com>
Cc: surya poondla <suryapoondla4@gmail.com>, pgsql-hackers@lists.postgresql.org
Date: 2026-05-26T16:03:32Z
Lists: pgsql-hackers
Attachments

v5.0001-pg_rewind-use-UUIDs-to-detect-independent-same-TLI-p.patch (text/x-patch)
Hi Japin,

Thanks for reviewing the patch.

On Tue, May 26, 2026 at 8:56 AM Japin Li <japinli@hotmail.com> wrote:

>
> Hi, Mats
>
> Thanks for updating the patch.
>
> On Mon, 25 May 2026 at 20:59, Mats Kindahl <mats.kindahl@gmail.com> wrote:
> > Hi Japin,
> >
> > On Mon, May 25, 2026 at 7:21 AM Japin Li <japinli@hotmail.com> wrote:
> >
> >  Hi, Mats
> >
> >  On Sun, 24 May 2026 at 20:30, Mats Kindahl <mats.kindahl@gmail.com>
> wrote:
> >  > On Fri, May 22, 2026 at 12:09 AM surya poondla <
> suryapoondla4@gmail.com> wrote:
> >  >
> >  >  Hi Mats,
> >  >
> >  >  Thanks for picking this up -- the scenario is a real one and I think
> the UUID-tagging approach is a clean way to
> >  >  solve it. v2 applies and builds without trouble, and the core
> algorithm reads well to me.
> >  >  I have a handful of observations that I'd love your thoughts.
> >  >
> >  > Hi Surya,
> >  >
> >  > Thank you for the review. It is a quite rare scenario, but it is real
> and the fix is simple.
> >  >
> >  >  Regarding Correctness I have the below thoughts
> >  >
> >  >  1. UUIDv7 timestamp epoch.
> >  >       In StartupXLOG():
> >  >           TimestampTz now = GetCurrentTimestamp();
> >  >           generate_uuidv7_r(&uuid_buf, (uint64)(now / 1000),
> >  >                                        (uint32)(now % 1000) * 1000);
> >  >
> >  >  I think there might be a small mismatch here: GetCurrentTimestamp()
> returns microseconds since the Postgres epoch
> >  >  (2000-01-01),
> >  >  whereas generate_uuidv7_r describes its first argument as
> milliseconds since the Unix epoch.
> >  >  As written that 30-year offset would land in the UUID's timestamp
> field, so the resulting UUID wouldn't be a
> >  >  conformant UUIDv7 and wouldn't
> >  >  time-order against UUIDv7s generated through the SQL functions.
> >  >
> >  >
> >  >
> >  >  Uniqueness is preserved either way, so the rewind logic still works
> as intended but it seemed worth flagging.
> >  >
> >  >  I see conversion that's used elsewhere as:
> >  >  us = ts + (POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE)
> >  >                     * SECS_PER_DAY * USECS_PER_SEC;
> >  >
> >  >  Or, since promotion isn't on a hot path, gettimeofday() / time(NULL)
> directly would also be fine.
> >  >
> >  > Yes, the intention was to use a proper timestamp to allow debugging
> servers if necessary. Switched to gettimeofday
> >  () and
> >  > used 0 for sub-ms since this is not going to be critical. (We could
> use ns here as well, but that would only solve
> >  a race
> >  > if you have two servers being promoted in the same ms, which I find
> unlikely, and there is a random number added
> >  for that
> >  > situation.)
> >  >
> >  >  2. EOR-record path, the intent is unclear.
> >  >
> >  >  The comment above generate_uuidv7_r() at says:
> >  >
> >  >  "The same UUID is written into the history file and later into the
> XLOG_END_OF_RECOVERY record so that pg_rewind
> >  can
> >  >  distinguish two servers..."
> >  >
> >  >  But from what I can see only the history-file part actually lands.
> >  >  xl_end_of_recovery is unchanged, CreateEndOfRecoveryRecord() doesn't
> add the UUID, and XLogCtl->ThisTimeLineUUID
> >  is
> >  >  written under info_lck without a
> >  >  reader (I couldn't grep it).
> >  >
> >  >  The xlog_redo() memset() + Min(rec_len, sizeof(...)) change reads
> like preparation for an EOR-struct extension
> >  that
> >  >  ended up not being part of the patch.
> >  >
> >  >  Was the EOR-record piece something you intended to keep for a
> follow-up, or has it been superseded by the
> >  >  history-file approach?
> >  >
> >  > No, the EOR changes are not needed for the promotion, contrary to
> what I originally thought. Cleaned up the comment
> >  and
> >  > the code and removed all traces of changes to the EOR (I hope).
> >  >
> >  >
> >  >
> >  >  3. Malformed UUID handling in readTimeLineHistory().
> >  >
> >  >       The optional field-4 path is:
> >  >
> >  >           if (nfields == 4 && strlen(uuid_str) == UUID_STR_LEN)
> >  >           {
> >  >               Datum datum = DirectFunctionCall1(uuid_in,
> >  >
>  CStringGetDatum(uuid_str));
> >  >               ...
> >  >           }
> >  >
> >  >  uuid_in() raises ereport(ERROR) on a malformed input, while the
> surrounding syntax-error paths in
> >  readTimeLineHistory
> >  >  () use FATAL deliberately.
> >  >  In practice an ERROR during startup ends up being fatal too, so this
> isn't strictly a bug but it would be nicer to
> >  >  stay consistent.
> >  >
> >  > Agree. I added code to capture the error and raise a FATAL instead
> (with the error message from the uuid_in, in
> >  case it
> >  > is modified it makes sense to show this).
> >  >
> >  >  Regarding the Tests I have the following thoughts
> >  >
> >  >  The two new cases are nice, a few extensions that I think would
> strengthen them:
> >  >  1. A mixed-version case where one side has a zero UUID. That's the
> path we're claiming is graceful, but nothing
> >  >  currently exercises it
> >  >
> >  > Yes, that should work regardless of whether the source or the target
> has the zero UUID.
> >  >
> >  > I realized one thing: if two timelines have identical TLI but one has
> zero UUID and one has not, it seems they
> >  could not
> >  > come from the same promotion (one promotion happened on an old server
> and the other one on a new server), that is,
> >  they
> >  > should be treated as different. Does that make sense? I made the
> necessary changes in the attached patches for
> >  testing.
> >  > Please have a look.
> >  >
> >  >  2. A deeper-divergence case (e.g. TLI1->2->3 vs TLI1->2->3') so that
> findCommonAncestorTimeline's loop walks past
> >  >  matching entries
> >  >       before hitting the mismatch. The 0002 test puts the divergence
> at depth 1.
> >  >
> >  > I was unsure if this test was necessary or interesting, hence a
> separate commit. Since you thought it was useful,
> >  it's
> >  > now rolled into the patch and I extended the tests with the scenarios
> you suggested.
> >  >
> >  > I also did some refactorings of the tests to avoid duplication. More
> below.
> >  >
> >  >  3. A small assertion against the on-disk 00000002.history contents,
> to pin down the file format.
> >  >  4. On 0002 the dependency on restore_command pointing at node_x's
> pg_wal is the kind of thing that tends to break
> >  >  under
> >  >       environment changes. A CHECKPOINT on node_x before the backup,
> or wal_keep_size as in 0001, would let the
> >  test
> >  >  stand on its own.
> >  >
> >  > Good point.
> >  >
> >  > I refactored the code to avoid some duplication and make the test
> flow self-explanatory and as part of that I set
> >  the
> >  > wal_keep_size for all nodes.
> >  >
> >  > In the process I noticed that many of the functions in RewindTest.pm
> do the same job as the primitives I wrote, but
> >  have
> >  > hard-coded variable names. I could rewrite them to take parameters,
> but that would be quite a big patch to add
> >  additional
> >  > changes to each call site, so I did not do that and rather added
> small wrappers specific for the tests in
> >  > 005_same_timeline.pl⚠️⚠️.
> >  >
> >  > Attached a new version of the now single patch.
> >  >
> >  >  I'm happy to keep reviewing/contributing, thanks again for working
> on it.
> >  >
> >  > Thank you for reviewing it.
> >
> >  Thank you for your work.  I have one comment.
> >
> >  +       a = &tlh->source[tlh->sourceNentries - 2].tluuid;
> >  +       b = &tlh->target[tlh->targetNentries - 2].tluuid;
> >  +
> >  +       if (memcmp(a, &zero, UUID_LEN) == 0 && memcmp(b, &zero,
> UUID_LEN) == 0)
> >  +               return true;
> >  +
> >  +       return memcmp(a, b, UUID_LEN) == 0;
> >
> >  Since we already have matchingTimelineUUID(), the above code can be
> simplified
> >  using it.
> >
> > Thank you for the review. I switched to using the matchingTimelineUUID()
> for this part of the code and made some other
> > minor improvements as well.
>
> Here are some comments on v4.
>
> 1.
> +/*
> + * Timeline histories for both clusters, populated by timelines_match().
> + */
>
> I don't see a timelines_match() function.  Does this refer to
> matchAndFetchTimelines()?
>

Correct. Updated.


>
> 2.
> +typedef struct TimelineHistoriesData
> +{
> +       TimeLineHistoryEntry *source,
> +                          *target;
> +       int                     sourceNentries,
> +                               targetNentries;
> +}                      TimelineHistoriesData;
>
> I'd prefer to use TimeLineHistoriesData to stay consistent with
> TimeLineHistoryEntry.  Anyway I'm not instant on it.
>

Makes sense to be consistent. Updated.


>
> 3.
> +typedef TimelineHistoriesData * TimelineHistories;
>
> The space between * and TimelineHistories is unnecessary — see
> StringInfoData and other typedefs.
>

My mistake. FIxed.


> 4.
> +# node_x and node_b both start from the same TLI 1 baseline.
> +my ($node_x, $node_b2) =
> +  setup_standbys_from_origin($node_origin2, 'node_x', 'node_b2');
>
> There appears to be a typo in the comment.  The node_b should be node_b2.
>

Right. Fixed.


>
>
> Everything else looks good.  Thank you again for updating the patch!
>

Thank you again for reviewing the patch. :)

Attached a new version of the patch with the changes you suggested.

-- 
Best wishes,
Mats Kindahl, Multigres Developer, Supabase