Thread
-
Re: BUG #19490: Streaming standby on 16.14 stops applying WAL on MultiXactOffsetSLRU when primary is 16.8
Radim Marek <radim@boringsql.com> — 2026-05-21T09:06:18Z
Altough the culprit is known, I've got more data as requested. #0 0x00007f20e9bdb687 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007f20e9bdbc8c in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007f20e9be6920 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #3 0x000055a71796e3ca in PGSemaphoreLock (sema=0x7f20de6d0e38) at ./build/src/backend/port/pg_sema.c:327 #4 0x000055a7179f57ed in LWLockAcquire (lock=0x7f20de6d1800, mode=mode@entry=LW_EXCLUSIVE) at ./build/../src/backend/storage/lmgr/lwlock.c:1314 #5 0x000055a71772dfb2 in SimpleLruWriteAll (ctl=ctl@entry=0x55a717e83040 <MultiXactOffsetCtlData>, allow_redirtied=allow_redirtied@entry=false) at ./build/../src/backend/access/transam/slru.c:1174 #6 0x000055a717727b6f in RecordNewMultiXact (multi=79871, offset=218449, nmembers=2, members=members@entry=0x7f20de6831ec) at ./build/../src/backend/access/transam/multixact.c:944 #7 0x000055a71772a983 in multixact_redo (record=0x55a73a8d0fc8) at ./build/../src/backend/access/transam/multixact.c:3464 #8 0x000055a71774d9b8 in ApplyWalRecord (xlogreader=<optimized out>, record=0x7f20de6831b0, replayTLI=<synthetic pointer>) at ./build/../src/backend/access/transam/xlogrecovery.c:1951 #9 PerformWalRecovery () at ./build/../src/backend/access/transam/xlogrecovery.c:1782 #10 0x000055a717740def in StartupXLOG () at ./build/../src/backend/access/transam/xlog.c:5452 #11 0x000055a71797c7e4 in StartupProcessMain () at ./build/../src/backend/postmaster/startup.c:282 #12 0x000055a717972b20 in AuxiliaryProcessMain (auxtype=auxtype@entry=StartupProcess) at ./build/../src/backend/postmaster/auxprocess.c:141 #13 0x000055a717977db3 in StartChildProcess (type=StartupProcess) at ./build/../src/backend/postmaster/postmaster.c:5381 #14 0x000055a71797bfb8 in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0x55a73a8d0590) at ./build/../src/backend/postmaster/postmaster.c:1463 #15 0x000055a7176a05bc in main (argc=1, argv=0x55a73a8d0590) at ./build/../src/backend/main/main.c:200 and WAL dump rmgr: Btree len (rec/tot): 64/ 64, tx: 336098, lsn: 1/32DE75F0, prev 1/32DE7580, desc: INSERT_LEAF off: 244, blkref #0: rel 1663/16384/16432 blk 536 rmgr: MultiXact len (rec/tot): 54/ 54, tx: 336098, lsn: 1/32DE7630, prev 1/32DE75F0, desc: CREATE_ID 79871 offset 218449 nmembers 2: 336089 (keysh) 336098 (keysh) rmgr: Heap len (rec/tot): 54/ 54, tx: 336098, lsn: 1/32DE7668, prev 1/32DE7630, desc: LOCK xmax: 79871, off: 1, infobits: [IS_MULTI, LOCK_ONLY, KEYSHR_LOCK], flags: 0x00, blkref #0: rel 1663/16384/16418 blk 0 rmgr: Heap len (rec/tot): 72/ 72, tx: 336096, lsn: 1/32DE76A0, prev 1/32DE7668, desc: HOT_UPDATE old_xmax: 336096, old_off: 52, old_infobits: [], flags: 0x20, new_xmax: 0, new_off: 149, blkref #0: rel 1663/16384/16401 blk 22 rmgr: Heap len (rec/tot): 71/ 71, tx: 336096, lsn: 1/32DE76E8, prev 1/32DE76A0, desc: HOT_UPDATE old_xmax: 336096, old_off: 149, old_infobits: [], flags: 0x60, new_xmax: 0, new_off: 209, blkref #0: rel 1663/16384/16399 blk 6 rmgr: Heap len (rec/tot): 79/ 79, tx: 336096, lsn: 1/32DE7730, prev 1/32DE76E8, desc: INSERT off: 150, flags: 0x00, blkref #0: rel 1663/16384/16417 blk 741 rmgr: Heap len (rec/tot): 72/ 72, tx: 336097, lsn: 1/32DE7780, prev 1/32DE7730, desc: HOT_UPDATE old_xmax: 336097, old_off: 243, old_infobits: [], flags: 0x20, new_xmax: 0, new_off: 228, blkref #0: rel 1663/16384/16401 blk 26 rmgr: Transaction len (rec/tot): 34/ 34, tx: 336096, lsn: 1/32DE77C8, prev 1/32DE7780, desc: COMMIT 2026-05-21 08:43:07.003572 UTC Radim On Thu, 21 May 2026 at 10:34, Radim Marek <radim@boringsql.com> wrote: > Thank you for the follow-up. In mean-time I can confirm the > commit 77dff5d937b1 might be the source of the original reported issue. > > Unfortunately pinning version down to 16.12 only avoids the > MultiXactOffsetSLRU self-deadlock, but the standby then fails recovery > after 12+ hours. > > FATAL: could not access status of transaction 24958976 DETAIL: Could not > read from file "pg_multixact/offsets/017C" at offset 221184: read too few > bytes. CONTEXT: WAL redo at 14770/873268E8 for MultiXact/CREATE_ID: > 24958975 offset 61500431 nmembers 2: 3058927188 (fornokeyupd) 3058927189 > (keysh) > > We are going to try to pin 16.13 and try that before we can safely upgrade > of the primary/are confident we have working PITR recovery available should > we need it. > > Radim > > PS: Once I have some time I will try to setup a docker based harness to be > able to replicate original problem for later testing of the fix. > > On Thu, 21 May 2026 at 09:25, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > >> >> >> > On 21 May 2026, at 00:12, Marko Tiikkaja <marko@joh.to> wrote: >> > >> > #8 0x0000654c8ae2acba in SimpleLruWriteAll (ctl=0x654c8b63e400 >> >> Thanks! >> >> This clearly points to SimpleLruWriteAll() added in 77dff5d937b1. >> If by chance you will have a backtrace of another deadlocking process - >> please post it. >> >> But it's not strictly necessary for analysis, I think we can figure out >> what >> happened from the backtrace you already posted. >> >> >> Best regards, Andrey Borodin. >> >