Re: Changing the state of data checksums in a running cluster
Tomas Vondra <tomas@vondra.me>
From: Tomas Vondra <tomas@vondra.me>
To: Daniel Gustafsson <daniel@yesql.se>
Cc: Bernd Helmle <mailings@oopsware.de>, Michael Paquier
<michael@paquier.xyz>, Michael Banck <mbanck@gmx.net>,
PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Date: 2025-08-29T14:26:41Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Use correct datatype for PID
- 0ca1b3010597 19 (unreleased) landed
-
Improve comments in online checksums code
- cd857dec0e0a 19 (unreleased) landed
-
Fix checksum state transition during promotion
- 5fee7cab1b87 19 (unreleased) landed
-
Fix regex searching for page verification failures in tests
- 486b9a9b9eb4 19 (unreleased) landed
-
Apply data-checksum worker throttling parameters
- 9a39056c418c 19 (unreleased) landed
-
Skip WAL for unlogged main fork during online checksum enable
- 2018bd616790 19 (unreleased) landed
-
Revert "Get rid of WALBufMappingLock"
- c13070a27b63 19 (unreleased) cited
-
Get rid of WALBufMappingLock
- bc22dc0e0ddc 18.0 cited
-
Improve grammar of options for command arrays in TAP tests
- ce1b0f9da03e 18.0 cited
Attachments
- failure.log (text/x-log)
- tap-fixes.txt (text/plain)
On 8/27/25 14:42, Tomas Vondra wrote: > On 8/27/25 14:39, Tomas Vondra wrote: >> ... >> >> And this happened on Friday: >> >> commit c13070a27b63d9ce4850d88a63bf889a6fde26f0 >> Author: Alexander Korotkov <akorotkov@postgresql.org> >> Date: Fri Aug 22 18:44:39 2025 +0300 >> >> Revert "Get rid of WALBufMappingLock" >> >> This reverts commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683. >> It appears that conditional variables are not suitable for use >> inside critical sections. If WaitLatch()/WaitEventSetWaitBlock() >> face postmaster death, they exit, releasing all locks instead of >> PANIC. In certain situations, this leads to data corruption. >> >> ... >> >> I think it's very likely the checksums were broken by this. After all, >> that linked thread has subject "VM corruption on standby" and I've only >> ever seen checksum failures on standby on the _vm fork. >> > > Forgot to mention - I did try with c13070a27b reverted, and with that I > can reproduce the checksum failures again (using the fixed TAP test). > > It's not a definitive proof, but it's a hint c13070a27b63 was causing > the checksum failures. > Unfortunately, it seems I spoke too soon :-( I decided to test this on multiple machines overnight, and it still fails on the slower ones. Attached is a patch addressing a couple more issues, to makes the TAP test work well enough. (Attached as .txt, to not confuse cfbot). - The pgbench started by IPC::Run::start() needs to be finished, to release resources. Otherwise it leaks file descriptors (and there's a bunch of "defunct" pgbench processes), which may be a problem with increased number of iterations. - AFAICS the pgbench can't use stdin/stdout/stderr, otherwise the pipes get broken when the command fails (after restart). I just used /dev/null instead, the stdout/stderr was not used anyway (or was it?). - commented out the pg_checksums call, because of the issues mentioned earlier (I was trying to make it work by remembering the state, but it seems to not make it into control file before shutdown occasionally) I increased the number of iterations to 1000+ and ran it on three machines: - ryzen (new machine from ~2024) - xeon (old slow machine from ~2016) - rpi5 (very slow machine) I haven't seen a single failure on ryzen, after ~3000 iterations. But both xeon and rpi5 show a number of failures. Xeon has about 35 reports of 'Failed test', rpi5 and about 10. My guess is it's something about timing. It works on the "fast" ryzen, but fails on xeon which is ~3-4x slower. And rpi5, which is even slower. The other reason why it seems unrelated to the reverted commit is that it's not just about visibility maps (which was got corrupted). I see checksum failures on VM and FSM. I think I forgot about the FSM cases, and by the fact that I saw no failures on the ryzen post revert. But clearly, other machines still have issues. Another interesting fact is that the checksum failures happen both on the primary and the standby, it's not just a standby issue. But again, this sees to be machine-dependent. On the rpi5 I've only seen standby issues. The xeon sees failures both on primary/standby (roughly 1:1). There are more weird things. If I grep for page failures, I see this (a more detailed log attached): ----------- # 2025-08-28 22:33:28.195 CEST startup[177466] LOG: page verification failed, calculated checksum 25350 but expected 44559 # 2025-08-28 22:33:28.197 CEST startup[177466] LOG: page verification failed, calculated checksum 25350 but expected 44559 # 2025-08-28 22:33:28.199 CEST startup[177466] LOG: page verification failed, calculated checksum 59909 but expected 53920 # 2025-08-28 22:33:28.201 CEST startup[177466] LOG: page verification failed, calculated checksum 59909 but expected 53920 # 2025-08-28 22:33:28.206 CEST startup[177466] LOG: page verification failed, calculated checksum 59909 but expected 53920 # 2025-08-28 22:33:28.207 CEST startup[177466] LOG: page verification failed, calculated checksum 25350 but expected 44559 ----------- This is right after a single restart, while doing the recovery. The weird thing is, this is all for just two FSM pages! ----------- LOG: invalid page in block 2 of relation "base/5/16410_fsm"; zeroing out page LOG: invalid page in block 2 of relation "base/5/16408_fsm"; zeroing out page ----------- And the calculated/expected checksums repeat! It's just different WAL records hitting the same page, and complaining about the same issue, after claiming the page was zeroed out. Isn't that weird? How come the page doesn't "get" the correct checksum after the first redo? I've seen these failures after changing checksums in both directions, both after enabling and disabling. But I've only ever saw this after immediate shutdown, never after fast shutdown. (It's interesting the pg_checksums failed only after fast shutdowns ...). Could it be that the redo happens to start from an older position, but using the new checksum version? regards -- Tomas Vondra