BUG #19490: Streaming standby on 16.14 stops applying WAL on MultiXactOffsetSLRU when primary is 16.8
PG Bug reporting form <noreply@postgresql.org>
From: PG Bug reporting form <noreply@postgresql.org>
To: pgsql-bugs@lists.postgresql.org
Cc: radim@boringsql.com
Date: 2026-05-20T21:16:59Z
Lists: pgsql-hackers
The following bug has been logged on the website: Bug reference: 19490 Logged by: Radim Marek Email address: radim@boringsql.com PostgreSQL version: 16.14 Operating system: Linux - Ubuntu 22.04 Description: Hello, due to a mistake we have run a higher minor version of 16.x against the non-upgraded primary. This led to repeated issues on WAL processing. Description: A streaming replication standby running 16.14 stops advancing replay while WAL keeps arriving from a 16.8 primary. The startup process is parked in futex_wait_queue with wait_event = LWLock:MultiXactOffsetSLRU and no longer makes progress. pg_stat_slru shows zero MultiXact activity over the same window, so it appears to stop on the lock itself rather than inside any SLRU read/write path. Downgrading the standby binary to 16.12 (same data directory) resolved the symptom under the same workload. Configuration: Primary running 16.8-1.pgdg22.04+1, we observed both loaded and "relatively" idle (below 1000 QPS) Replica: 16.14-1.pgdg22.04+1, physical streaming, async, single replica on 16.14 due to misconfiguration, no cascading. Other replicas not affected (running 16.8). hot_standby_feedback enabled, logical replication from primary. default WAL segment size. Default SLRU buffer sizes. Observed symptoms on the standby 1. pg_stat_replication on primary, just the affected node client_addr state sent_lag write_lag flush_lag replay_lag_bytes replay_lag 10.x.x.x streaming 0 0 0 8766784344 02:42:50 2. Receive/write/flush all at the primary's current LSN; only replay is far behind and growing. 3. Startup process wait event on standby (sampled repeatedly, always identical)pid wait_event_type wait_event state 19095 LWLock MultiXactOffsetSLRU (null) 4. Kernel stack of the startup process cat /proc/19095/stack [<0>] futex_wait_queue+0x67/0xa0 [<0>] __futex_wait+0x155/0x1d0 [<0>] futex_wait+0x74/0x120 [<0>] do_futex+0x16d/0x230 [<0>] __x64_sys_futex+0x95/0x200 [<0>] x64_sys_call+0x117b/0x2480 [<0>] do_syscall_64+0x81/0x170 [<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80 cat /proc/19095/wchan futex_wait_queue 5. pg_stat_slru on the standby, after pg_stat_reset_slru(NULL) and a 60-second wait under live WAL streaming name blks_zeroed blks_hit blks_read blks_written MultiXactMember 0 0 0 0 MultiXactOffset 0 0 0 0 6. There was no MultiXact SLRU activity while the startup process is reportedly waiting on the MultiXact offset SLRU lock. 7. Replay LSN frozen, receive LSN advancing. Sampled 60 sec apart. recv replay lag_bytes 1476A/D1DA158 14767/EE01DB78 9111848416 1476A/EB565D0 14767/EE01DB78 9138571864 8. No replay progress; ~9 GB of WAL buffered locally that is never applied. 6. Other backends on the standby: only a diagnostic psql client. No hot-standby readers. 7. MultiXact age on the primary is small (~360k on most DBs, ~239k on the main DB). No MultiXact storm. Workarounds - Restarting the standby cleared the block but once it caught up it repeated again- Downgrading the standby binary to 16.12 (16.12-1.pgdg22.04+1) against the same data directory restored normal replay. After 60s under the same workload pg_stat_slru shows only 2 hits / 0 reads on MultiXact. I understand that running 6 minor versions behind is not particulary good setup, but given this being supported direction this might be worth at least in 16.13/16.14 release notes. --- Hope this helps, Radim