Thread

  1. Re: pg_rewind does not rewind diverging timelines

    Mats Kindahl <mats.kindahl@gmail.com> — 2026-05-26T16:03:32Z

    Hi Japin,
    
    Thanks for reviewing the patch.
    
    On Tue, May 26, 2026 at 8:56 AM Japin Li <japinli@hotmail.com> wrote:
    
    >
    > Hi, Mats
    >
    > Thanks for updating the patch.
    >
    > On Mon, 25 May 2026 at 20:59, Mats Kindahl <mats.kindahl@gmail.com> wrote:
    > > Hi Japin,
    > >
    > > On Mon, May 25, 2026 at 7:21 AM Japin Li <japinli@hotmail.com> wrote:
    > >
    > >  Hi, Mats
    > >
    > >  On Sun, 24 May 2026 at 20:30, Mats Kindahl <mats.kindahl@gmail.com>
    > wrote:
    > >  > On Fri, May 22, 2026 at 12:09 AM surya poondla <
    > suryapoondla4@gmail.com> wrote:
    > >  >
    > >  >  Hi Mats,
    > >  >
    > >  >  Thanks for picking this up -- the scenario is a real one and I think
    > the UUID-tagging approach is a clean way to
    > >  >  solve it. v2 applies and builds without trouble, and the core
    > algorithm reads well to me.
    > >  >  I have a handful of observations that I'd love your thoughts.
    > >  >
    > >  > Hi Surya,
    > >  >
    > >  > Thank you for the review. It is a quite rare scenario, but it is real
    > and the fix is simple.
    > >  >
    > >  >  Regarding Correctness I have the below thoughts
    > >  >
    > >  >  1. UUIDv7 timestamp epoch.
    > >  >       In StartupXLOG():
    > >  >           TimestampTz now = GetCurrentTimestamp();
    > >  >           generate_uuidv7_r(&uuid_buf, (uint64)(now / 1000),
    > >  >                                        (uint32)(now % 1000) * 1000);
    > >  >
    > >  >  I think there might be a small mismatch here: GetCurrentTimestamp()
    > returns microseconds since the Postgres epoch
    > >  >  (2000-01-01),
    > >  >  whereas generate_uuidv7_r describes its first argument as
    > milliseconds since the Unix epoch.
    > >  >  As written that 30-year offset would land in the UUID's timestamp
    > field, so the resulting UUID wouldn't be a
    > >  >  conformant UUIDv7 and wouldn't
    > >  >  time-order against UUIDv7s generated through the SQL functions.
    > >  >
    > >  >
    > >  >
    > >  >  Uniqueness is preserved either way, so the rewind logic still works
    > as intended but it seemed worth flagging.
    > >  >
    > >  >  I see conversion that's used elsewhere as:
    > >  >  us = ts + (POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE)
    > >  >                     * SECS_PER_DAY * USECS_PER_SEC;
    > >  >
    > >  >  Or, since promotion isn't on a hot path, gettimeofday() / time(NULL)
    > directly would also be fine.
    > >  >
    > >  > Yes, the intention was to use a proper timestamp to allow debugging
    > servers if necessary. Switched to gettimeofday
    > >  () and
    > >  > used 0 for sub-ms since this is not going to be critical. (We could
    > use ns here as well, but that would only solve
    > >  a race
    > >  > if you have two servers being promoted in the same ms, which I find
    > unlikely, and there is a random number added
    > >  for that
    > >  > situation.)
    > >  >
    > >  >  2. EOR-record path, the intent is unclear.
    > >  >
    > >  >  The comment above generate_uuidv7_r() at says:
    > >  >
    > >  >  "The same UUID is written into the history file and later into the
    > XLOG_END_OF_RECOVERY record so that pg_rewind
    > >  can
    > >  >  distinguish two servers..."
    > >  >
    > >  >  But from what I can see only the history-file part actually lands.
    > >  >  xl_end_of_recovery is unchanged, CreateEndOfRecoveryRecord() doesn't
    > add the UUID, and XLogCtl->ThisTimeLineUUID
    > >  is
    > >  >  written under info_lck without a
    > >  >  reader (I couldn't grep it).
    > >  >
    > >  >  The xlog_redo() memset() + Min(rec_len, sizeof(...)) change reads
    > like preparation for an EOR-struct extension
    > >  that
    > >  >  ended up not being part of the patch.
    > >  >
    > >  >  Was the EOR-record piece something you intended to keep for a
    > follow-up, or has it been superseded by the
    > >  >  history-file approach?
    > >  >
    > >  > No, the EOR changes are not needed for the promotion, contrary to
    > what I originally thought. Cleaned up the comment
    > >  and
    > >  > the code and removed all traces of changes to the EOR (I hope).
    > >  >
    > >  >
    > >  >
    > >  >  3. Malformed UUID handling in readTimeLineHistory().
    > >  >
    > >  >       The optional field-4 path is:
    > >  >
    > >  >           if (nfields == 4 && strlen(uuid_str) == UUID_STR_LEN)
    > >  >           {
    > >  >               Datum datum = DirectFunctionCall1(uuid_in,
    > >  >
    >  CStringGetDatum(uuid_str));
    > >  >               ...
    > >  >           }
    > >  >
    > >  >  uuid_in() raises ereport(ERROR) on a malformed input, while the
    > surrounding syntax-error paths in
    > >  readTimeLineHistory
    > >  >  () use FATAL deliberately.
    > >  >  In practice an ERROR during startup ends up being fatal too, so this
    > isn't strictly a bug but it would be nicer to
    > >  >  stay consistent.
    > >  >
    > >  > Agree. I added code to capture the error and raise a FATAL instead
    > (with the error message from the uuid_in, in
    > >  case it
    > >  > is modified it makes sense to show this).
    > >  >
    > >  >  Regarding the Tests I have the following thoughts
    > >  >
    > >  >  The two new cases are nice, a few extensions that I think would
    > strengthen them:
    > >  >  1. A mixed-version case where one side has a zero UUID. That's the
    > path we're claiming is graceful, but nothing
    > >  >  currently exercises it
    > >  >
    > >  > Yes, that should work regardless of whether the source or the target
    > has the zero UUID.
    > >  >
    > >  > I realized one thing: if two timelines have identical TLI but one has
    > zero UUID and one has not, it seems they
    > >  could not
    > >  > come from the same promotion (one promotion happened on an old server
    > and the other one on a new server), that is,
    > >  they
    > >  > should be treated as different. Does that make sense? I made the
    > necessary changes in the attached patches for
    > >  testing.
    > >  > Please have a look.
    > >  >
    > >  >  2. A deeper-divergence case (e.g. TLI1->2->3 vs TLI1->2->3') so that
    > findCommonAncestorTimeline's loop walks past
    > >  >  matching entries
    > >  >       before hitting the mismatch. The 0002 test puts the divergence
    > at depth 1.
    > >  >
    > >  > I was unsure if this test was necessary or interesting, hence a
    > separate commit. Since you thought it was useful,
    > >  it's
    > >  > now rolled into the patch and I extended the tests with the scenarios
    > you suggested.
    > >  >
    > >  > I also did some refactorings of the tests to avoid duplication. More
    > below.
    > >  >
    > >  >  3. A small assertion against the on-disk 00000002.history contents,
    > to pin down the file format.
    > >  >  4. On 0002 the dependency on restore_command pointing at node_x's
    > pg_wal is the kind of thing that tends to break
    > >  >  under
    > >  >       environment changes. A CHECKPOINT on node_x before the backup,
    > or wal_keep_size as in 0001, would let the
    > >  test
    > >  >  stand on its own.
    > >  >
    > >  > Good point.
    > >  >
    > >  > I refactored the code to avoid some duplication and make the test
    > flow self-explanatory and as part of that I set
    > >  the
    > >  > wal_keep_size for all nodes.
    > >  >
    > >  > In the process I noticed that many of the functions in RewindTest.pm
    > do the same job as the primitives I wrote, but
    > >  have
    > >  > hard-coded variable names. I could rewrite them to take parameters,
    > but that would be quite a big patch to add
    > >  additional
    > >  > changes to each call site, so I did not do that and rather added
    > small wrappers specific for the tests in
    > >  > 005_same_timeline.pl⚠️⚠️.
    > >  >
    > >  > Attached a new version of the now single patch.
    > >  >
    > >  >  I'm happy to keep reviewing/contributing, thanks again for working
    > on it.
    > >  >
    > >  > Thank you for reviewing it.
    > >
    > >  Thank you for your work.  I have one comment.
    > >
    > >  +       a = &tlh->source[tlh->sourceNentries - 2].tluuid;
    > >  +       b = &tlh->target[tlh->targetNentries - 2].tluuid;
    > >  +
    > >  +       if (memcmp(a, &zero, UUID_LEN) == 0 && memcmp(b, &zero,
    > UUID_LEN) == 0)
    > >  +               return true;
    > >  +
    > >  +       return memcmp(a, b, UUID_LEN) == 0;
    > >
    > >  Since we already have matchingTimelineUUID(), the above code can be
    > simplified
    > >  using it.
    > >
    > > Thank you for the review. I switched to using the matchingTimelineUUID()
    > for this part of the code and made some other
    > > minor improvements as well.
    >
    > Here are some comments on v4.
    >
    > 1.
    > +/*
    > + * Timeline histories for both clusters, populated by timelines_match().
    > + */
    >
    > I don't see a timelines_match() function.  Does this refer to
    > matchAndFetchTimelines()?
    >
    
    Correct. Updated.
    
    
    >
    > 2.
    > +typedef struct TimelineHistoriesData
    > +{
    > +       TimeLineHistoryEntry *source,
    > +                          *target;
    > +       int                     sourceNentries,
    > +                               targetNentries;
    > +}                      TimelineHistoriesData;
    >
    > I'd prefer to use TimeLineHistoriesData to stay consistent with
    > TimeLineHistoryEntry.  Anyway I'm not instant on it.
    >
    
    Makes sense to be consistent. Updated.
    
    
    >
    > 3.
    > +typedef TimelineHistoriesData * TimelineHistories;
    >
    > The space between * and TimelineHistories is unnecessary — see
    > StringInfoData and other typedefs.
    >
    
    My mistake. FIxed.
    
    
    > 4.
    > +# node_x and node_b both start from the same TLI 1 baseline.
    > +my ($node_x, $node_b2) =
    > +  setup_standbys_from_origin($node_origin2, 'node_x', 'node_b2');
    >
    > There appears to be a typo in the comment.  The node_b should be node_b2.
    >
    
    Right. Fixed.
    
    
    >
    >
    > Everything else looks good.  Thank you again for updating the patch!
    >
    
    Thank you again for reviewing the patch. :)
    
    Attached a new version of the patch with the changes you suggested.
    
    -- 
    Best wishes,
    Mats Kindahl, Multigres Developer, Supabase