Thread
-
Re: pg_rewind does not rewind diverging timelines
Mats Kindahl <mats.kindahl@gmail.com> — 2026-05-26T16:03:32Z
Hi Japin, Thanks for reviewing the patch. On Tue, May 26, 2026 at 8:56 AM Japin Li <japinli@hotmail.com> wrote: > > Hi, Mats > > Thanks for updating the patch. > > On Mon, 25 May 2026 at 20:59, Mats Kindahl <mats.kindahl@gmail.com> wrote: > > Hi Japin, > > > > On Mon, May 25, 2026 at 7:21 AM Japin Li <japinli@hotmail.com> wrote: > > > > Hi, Mats > > > > On Sun, 24 May 2026 at 20:30, Mats Kindahl <mats.kindahl@gmail.com> > wrote: > > > On Fri, May 22, 2026 at 12:09 AM surya poondla < > suryapoondla4@gmail.com> wrote: > > > > > > Hi Mats, > > > > > > Thanks for picking this up -- the scenario is a real one and I think > the UUID-tagging approach is a clean way to > > > solve it. v2 applies and builds without trouble, and the core > algorithm reads well to me. > > > I have a handful of observations that I'd love your thoughts. > > > > > > Hi Surya, > > > > > > Thank you for the review. It is a quite rare scenario, but it is real > and the fix is simple. > > > > > > Regarding Correctness I have the below thoughts > > > > > > 1. UUIDv7 timestamp epoch. > > > In StartupXLOG(): > > > TimestampTz now = GetCurrentTimestamp(); > > > generate_uuidv7_r(&uuid_buf, (uint64)(now / 1000), > > > (uint32)(now % 1000) * 1000); > > > > > > I think there might be a small mismatch here: GetCurrentTimestamp() > returns microseconds since the Postgres epoch > > > (2000-01-01), > > > whereas generate_uuidv7_r describes its first argument as > milliseconds since the Unix epoch. > > > As written that 30-year offset would land in the UUID's timestamp > field, so the resulting UUID wouldn't be a > > > conformant UUIDv7 and wouldn't > > > time-order against UUIDv7s generated through the SQL functions. > > > > > > > > > > > > Uniqueness is preserved either way, so the rewind logic still works > as intended but it seemed worth flagging. > > > > > > I see conversion that's used elsewhere as: > > > us = ts + (POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) > > > * SECS_PER_DAY * USECS_PER_SEC; > > > > > > Or, since promotion isn't on a hot path, gettimeofday() / time(NULL) > directly would also be fine. > > > > > > Yes, the intention was to use a proper timestamp to allow debugging > servers if necessary. Switched to gettimeofday > > () and > > > used 0 for sub-ms since this is not going to be critical. (We could > use ns here as well, but that would only solve > > a race > > > if you have two servers being promoted in the same ms, which I find > unlikely, and there is a random number added > > for that > > > situation.) > > > > > > 2. EOR-record path, the intent is unclear. > > > > > > The comment above generate_uuidv7_r() at says: > > > > > > "The same UUID is written into the history file and later into the > XLOG_END_OF_RECOVERY record so that pg_rewind > > can > > > distinguish two servers..." > > > > > > But from what I can see only the history-file part actually lands. > > > xl_end_of_recovery is unchanged, CreateEndOfRecoveryRecord() doesn't > add the UUID, and XLogCtl->ThisTimeLineUUID > > is > > > written under info_lck without a > > > reader (I couldn't grep it). > > > > > > The xlog_redo() memset() + Min(rec_len, sizeof(...)) change reads > like preparation for an EOR-struct extension > > that > > > ended up not being part of the patch. > > > > > > Was the EOR-record piece something you intended to keep for a > follow-up, or has it been superseded by the > > > history-file approach? > > > > > > No, the EOR changes are not needed for the promotion, contrary to > what I originally thought. Cleaned up the comment > > and > > > the code and removed all traces of changes to the EOR (I hope). > > > > > > > > > > > > 3. Malformed UUID handling in readTimeLineHistory(). > > > > > > The optional field-4 path is: > > > > > > if (nfields == 4 && strlen(uuid_str) == UUID_STR_LEN) > > > { > > > Datum datum = DirectFunctionCall1(uuid_in, > > > > CStringGetDatum(uuid_str)); > > > ... > > > } > > > > > > uuid_in() raises ereport(ERROR) on a malformed input, while the > surrounding syntax-error paths in > > readTimeLineHistory > > > () use FATAL deliberately. > > > In practice an ERROR during startup ends up being fatal too, so this > isn't strictly a bug but it would be nicer to > > > stay consistent. > > > > > > Agree. I added code to capture the error and raise a FATAL instead > (with the error message from the uuid_in, in > > case it > > > is modified it makes sense to show this). > > > > > > Regarding the Tests I have the following thoughts > > > > > > The two new cases are nice, a few extensions that I think would > strengthen them: > > > 1. A mixed-version case where one side has a zero UUID. That's the > path we're claiming is graceful, but nothing > > > currently exercises it > > > > > > Yes, that should work regardless of whether the source or the target > has the zero UUID. > > > > > > I realized one thing: if two timelines have identical TLI but one has > zero UUID and one has not, it seems they > > could not > > > come from the same promotion (one promotion happened on an old server > and the other one on a new server), that is, > > they > > > should be treated as different. Does that make sense? I made the > necessary changes in the attached patches for > > testing. > > > Please have a look. > > > > > > 2. A deeper-divergence case (e.g. TLI1->2->3 vs TLI1->2->3') so that > findCommonAncestorTimeline's loop walks past > > > matching entries > > > before hitting the mismatch. The 0002 test puts the divergence > at depth 1. > > > > > > I was unsure if this test was necessary or interesting, hence a > separate commit. Since you thought it was useful, > > it's > > > now rolled into the patch and I extended the tests with the scenarios > you suggested. > > > > > > I also did some refactorings of the tests to avoid duplication. More > below. > > > > > > 3. A small assertion against the on-disk 00000002.history contents, > to pin down the file format. > > > 4. On 0002 the dependency on restore_command pointing at node_x's > pg_wal is the kind of thing that tends to break > > > under > > > environment changes. A CHECKPOINT on node_x before the backup, > or wal_keep_size as in 0001, would let the > > test > > > stand on its own. > > > > > > Good point. > > > > > > I refactored the code to avoid some duplication and make the test > flow self-explanatory and as part of that I set > > the > > > wal_keep_size for all nodes. > > > > > > In the process I noticed that many of the functions in RewindTest.pm > do the same job as the primitives I wrote, but > > have > > > hard-coded variable names. I could rewrite them to take parameters, > but that would be quite a big patch to add > > additional > > > changes to each call site, so I did not do that and rather added > small wrappers specific for the tests in > > > 005_same_timeline.pl⚠️⚠️. > > > > > > Attached a new version of the now single patch. > > > > > > I'm happy to keep reviewing/contributing, thanks again for working > on it. > > > > > > Thank you for reviewing it. > > > > Thank you for your work. I have one comment. > > > > + a = &tlh->source[tlh->sourceNentries - 2].tluuid; > > + b = &tlh->target[tlh->targetNentries - 2].tluuid; > > + > > + if (memcmp(a, &zero, UUID_LEN) == 0 && memcmp(b, &zero, > UUID_LEN) == 0) > > + return true; > > + > > + return memcmp(a, b, UUID_LEN) == 0; > > > > Since we already have matchingTimelineUUID(), the above code can be > simplified > > using it. > > > > Thank you for the review. I switched to using the matchingTimelineUUID() > for this part of the code and made some other > > minor improvements as well. > > Here are some comments on v4. > > 1. > +/* > + * Timeline histories for both clusters, populated by timelines_match(). > + */ > > I don't see a timelines_match() function. Does this refer to > matchAndFetchTimelines()? > Correct. Updated. > > 2. > +typedef struct TimelineHistoriesData > +{ > + TimeLineHistoryEntry *source, > + *target; > + int sourceNentries, > + targetNentries; > +} TimelineHistoriesData; > > I'd prefer to use TimeLineHistoriesData to stay consistent with > TimeLineHistoryEntry. Anyway I'm not instant on it. > Makes sense to be consistent. Updated. > > 3. > +typedef TimelineHistoriesData * TimelineHistories; > > The space between * and TimelineHistories is unnecessary — see > StringInfoData and other typedefs. > My mistake. FIxed. > 4. > +# node_x and node_b both start from the same TLI 1 baseline. > +my ($node_x, $node_b2) = > + setup_standbys_from_origin($node_origin2, 'node_x', 'node_b2'); > > There appears to be a typo in the comment. The node_b should be node_b2. > Right. Fixed. > > > Everything else looks good. Thank you again for updating the patch! > Thank you again for reviewing the patch. :) Attached a new version of the patch with the changes you suggested. -- Best wishes, Mats Kindahl, Multigres Developer, Supabase