Thread

Re: BUG #19490: Streaming standby on 16.14 stops applying WAL on MultiXactOffsetSLRU when primary is 16.8

Radim Marek <radim@boringsql.com> — 2026-05-21T09:06:18Z
Altough the culprit is known, I've got more data as requested.

#0  0x00007f20e9bdb687 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f20e9bdbc8c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f20e9be6920 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x000055a71796e3ca in PGSemaphoreLock (sema=0x7f20de6d0e38) at
./build/src/backend/port/pg_sema.c:327
#4  0x000055a7179f57ed in LWLockAcquire (lock=0x7f20de6d1800,
mode=mode@entry=LW_EXCLUSIVE) at
./build/../src/backend/storage/lmgr/lwlock.c:1314
#5  0x000055a71772dfb2 in SimpleLruWriteAll (ctl=ctl@entry=0x55a717e83040
<MultiXactOffsetCtlData>, allow_redirtied=allow_redirtied@entry=false) at
./build/../src/backend/access/transam/slru.c:1174
#6  0x000055a717727b6f in RecordNewMultiXact (multi=79871, offset=218449,
nmembers=2, members=members@entry=0x7f20de6831ec) at
./build/../src/backend/access/transam/multixact.c:944
#7  0x000055a71772a983 in multixact_redo (record=0x55a73a8d0fc8) at
./build/../src/backend/access/transam/multixact.c:3464
#8  0x000055a71774d9b8 in ApplyWalRecord (xlogreader=<optimized out>,
record=0x7f20de6831b0, replayTLI=<synthetic pointer>) at
./build/../src/backend/access/transam/xlogrecovery.c:1951
#9  PerformWalRecovery () at
./build/../src/backend/access/transam/xlogrecovery.c:1782
#10 0x000055a717740def in StartupXLOG () at
./build/../src/backend/access/transam/xlog.c:5452
#11 0x000055a71797c7e4 in StartupProcessMain () at
./build/../src/backend/postmaster/startup.c:282
#12 0x000055a717972b20 in AuxiliaryProcessMain
(auxtype=auxtype@entry=StartupProcess)
at ./build/../src/backend/postmaster/auxprocess.c:141
#13 0x000055a717977db3 in StartChildProcess (type=StartupProcess) at
./build/../src/backend/postmaster/postmaster.c:5381
#14 0x000055a71797bfb8 in PostmasterMain (argc=argc@entry=1,
argv=argv@entry=0x55a73a8d0590)
at ./build/../src/backend/postmaster/postmaster.c:1463
#15 0x000055a7176a05bc in main (argc=1, argv=0x55a73a8d0590) at
./build/../src/backend/main/main.c:200

and WAL dump

rmgr: Btree       len (rec/tot):     64/    64, tx:     336098, lsn:
1/32DE75F0, prev 1/32DE7580, desc: INSERT_LEAF off: 244, blkref #0: rel
1663/16384/16432 blk 536
rmgr: MultiXact   len (rec/tot):     54/    54, tx:     336098, lsn:
1/32DE7630, prev 1/32DE75F0, desc: CREATE_ID 79871 offset 218449 nmembers
2: 336089 (keysh)
336098 (keysh)
rmgr: Heap        len (rec/tot):     54/    54, tx:     336098, lsn:
1/32DE7668, prev 1/32DE7630, desc: LOCK xmax: 79871, off: 1, infobits:
[IS_MULTI, LOCK_ONLY,
KEYSHR_LOCK], flags: 0x00, blkref #0: rel 1663/16384/16418 blk 0
rmgr: Heap        len (rec/tot):     72/    72, tx:     336096, lsn:
1/32DE76A0, prev 1/32DE7668, desc: HOT_UPDATE old_xmax: 336096, old_off:
52, old_infobits: [],
flags: 0x20, new_xmax: 0, new_off: 149, blkref #0: rel 1663/16384/16401 blk
22
rmgr: Heap        len (rec/tot):     71/    71, tx:     336096, lsn:
1/32DE76E8, prev 1/32DE76A0, desc: HOT_UPDATE old_xmax: 336096, old_off:
149, old_infobits: [],
flags: 0x60, new_xmax: 0, new_off: 209, blkref #0: rel 1663/16384/16399 blk
6
rmgr: Heap        len (rec/tot):     79/    79, tx:     336096, lsn:
1/32DE7730, prev 1/32DE76E8, desc: INSERT off: 150, flags: 0x00, blkref #0:
rel 1663/16384/16417
blk 741
rmgr: Heap        len (rec/tot):     72/    72, tx:     336097, lsn:
1/32DE7780, prev 1/32DE7730, desc: HOT_UPDATE old_xmax: 336097, old_off:
243, old_infobits: [],
flags: 0x20, new_xmax: 0, new_off: 228, blkref #0: rel 1663/16384/16401 blk
26
rmgr: Transaction len (rec/tot):     34/    34, tx:     336096, lsn:
1/32DE77C8, prev 1/32DE7780, desc: COMMIT 2026-05-21 08:43:07.003572 UTC

Radim

On Thu, 21 May 2026 at 10:34, Radim Marek <radim@boringsql.com> wrote:

> Thank you for the follow-up. In mean-time I can confirm the
> commit 77dff5d937b1 might be the source of the original reported issue.
>
> Unfortunately pinning version down to 16.12 only avoids the
> MultiXactOffsetSLRU self-deadlock, but the standby then fails recovery
> after 12+ hours.
>
> FATAL: could not access status of transaction 24958976 DETAIL: Could not
> read from file "pg_multixact/offsets/017C" at offset 221184: read too few
> bytes. CONTEXT: WAL redo at 14770/873268E8 for MultiXact/CREATE_ID:
> 24958975 offset 61500431 nmembers 2: 3058927188 (fornokeyupd) 3058927189
> (keysh)
>
> We are going to try to pin 16.13 and try that before we can safely upgrade
> of the primary/are confident we have working PITR recovery available should
> we need it.
>
> Radim
>
> PS: Once I have some time I will try to setup a docker based harness to be
> able to replicate original problem for later testing of the fix.
>
> On Thu, 21 May 2026 at 09:25, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
>>
>>
>> > On 21 May 2026, at 00:12, Marko Tiikkaja <marko@joh.to> wrote:
>> >
>> > #8  0x0000654c8ae2acba in SimpleLruWriteAll (ctl=0x654c8b63e400
>>
>> Thanks!
>>
>> This clearly points to SimpleLruWriteAll() added in 77dff5d937b1.
>> If by chance you will have a backtrace of another deadlocking process -
>> please post it.
>>
>> But it's not strictly necessary for analysis, I think we can figure out
>> what
>> happened from the backtrace you already posted.
>>
>>
>> Best regards, Andrey Borodin.
>>
>