Re: BUG #19490: Streaming standby on 16.14 stops applying WAL on MultiXactOffsetSLRU when primary is 16.8

Nazneen Jafri <jafrinazneen@gmail.com>

From: Nazneen Jafri <jafrinazneen@gmail.com>
To: Andrey Borodin <x4mmm@yandex-team.ru>
Cc: Heikki Linnakangas <hlinnaka@iki.fi>, Michael Paquier <michael@paquier.xyz>, Ayush Tiwari <ayushtiwari.slg01@gmail.com>, Radim Marek <radim@boringsql.com>, Marko Tiikkaja <marko@joh.to>, PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org>
Date: 2026-05-27T02:55:14Z
Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Fix self-deadlock when replaying WAL generated by older minor version

  2. Fix multixact backwards-compatibility with CHECKPOINT race condition

  3. Don't reset 'latest_page_number' when replaying multixid truncation

  4. Set next multixid's offset when creating a new multixid

Tested Andrey's demo.diff on a fresh environment:



  - Primary: REL_16_8, Standby: REL_16_14 (--enable-cassert)

  - ~2300 MultiXacts crossing the offsets page boundary

  - Without patch: startup deadlocks at RecordNewMultiXact(multi=2047)

  - With patch: standby replays all WAL and catches up


Thanks,
Nazneen

On Tue, May 26, 2026 at 2:55 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

>
>
> > On 26 May 2026, at 17:28, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >
> > looks correct
>
> I tested that change as follows.
>
> Setted up REL_16_0 as primary, REL_16_STABLE as standby.
>
> Generate multixacts in a single session using savepoints:
>
> BEGIN;
> SELECT * FROM t WHERE i = 1 FOR NO KEY UPDATE;
> -- repeat 2500 times:
> SAVEPOINT a; SELECT * FROM t WHERE i = 1 FOR UPDATE; ROLLBACK TO a;
> COMMIT;
>
> Each iteration creates a new MultiXactId. 2500 iterations cross the SLRU
> page
> boundary at multixact 2048 with some spare multis (we'll pickle the excess
> ones in
> jars when all is fixed, toying with 2048 wasted dev cycles for no reason).
>
> Test:
> 0. Run the workload on REL_16_0 primary (2500 multixacts, crossing page
> 0->1)
> 1. Take pg_basebackup
> 2. Run the workload again (2500 more, crossing page 1->2)
> 3. Start the standby
>
> I observe:
> Without the change startup deadlocks.
> With the change standby catches up, the DEBUG1 message "next offsets page
> is not
> initialized, initializing it now" confirms the compat block fires
> correctly.
>
> I packaged this test into a buildfarm module (TestReplayXversion) [0] that
> builds REL_x_0 and runs this check on REL_x_STABLE build. It reproduces
> the deadlock
> on 14, 15, and 16; 17 and 18 pass. Currently I'm struggling to inject
> regress WAL trace
> into it, not working so far. On a bright side - I managed to get PR number
> 42 in buildfarm
> client repo.
>
>
> Best regards, Andrey Borodin.
>
> [0] https://github.com/PGBuildFarm/client-code/pull/42
>
>
>
>
>
>