Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me>

From: Tomas Vondra <tomas@vondra.me>

To: Daniel Gustafsson <daniel@yesql.se>

Cc: Bernd Helmle <mailings@oopsware.de>, Michael Paquier <michael@paquier.xyz>, Michael Banck <mbanck@gmx.net>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>

Date: 2025-08-29T14:26:41Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Use correct datatype for PID
- 0ca1b3010597 19 (unreleased) landed
Improve comments in online checksums code
- cd857dec0e0a 19 (unreleased) landed
Fix checksum state transition during promotion
- 5fee7cab1b87 19 (unreleased) landed
Fix regex searching for page verification failures in tests
- 486b9a9b9eb4 19 (unreleased) landed
Apply data-checksum worker throttling parameters
- 9a39056c418c 19 (unreleased) landed
Skip WAL for unlogged main fork during online checksum enable
- 2018bd616790 19 (unreleased) landed
Revert "Get rid of WALBufMappingLock"
- c13070a27b63 19 (unreleased) cited
Get rid of WALBufMappingLock
- bc22dc0e0ddc 18.0 cited
Improve grammar of options for command arrays in TAP tests
- ce1b0f9da03e 18.0 cited

Attachments

failure.log (text/x-log)
tap-fixes.txt (text/plain)

On 8/27/25 14:42, Tomas Vondra wrote:
> On 8/27/25 14:39, Tomas Vondra wrote:
>> ...
>>
>> And this happened on Friday:
>>
>> commit c13070a27b63d9ce4850d88a63bf889a6fde26f0
>> Author: Alexander Korotkov <akorotkov@postgresql.org>
>> Date:   Fri Aug 22 18:44:39 2025 +0300
>>
>>     Revert "Get rid of WALBufMappingLock"
>>
>>     This reverts commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683.
>>     It appears that conditional variables are not suitable for use
>>     inside critical sections.  If WaitLatch()/WaitEventSetWaitBlock()
>>     face postmaster death, they exit, releasing all locks instead of
>>     PANIC.  In certain situations, this leads to data corruption.
>>
>>     ...
>>
>> I think it's very likely the checksums were broken by this. After all,
>> that linked thread has subject "VM corruption on standby" and I've only
>> ever seen checksum failures on standby on the _vm fork.
>>
> 
> Forgot to mention - I did try with c13070a27b reverted, and with that I
> can reproduce the checksum failures again (using the fixed TAP test).
> 
> It's not a definitive proof, but it's a hint c13070a27b63 was causing
> the checksum failures.
> 

Unfortunately, it seems I spoke too soon :-( I decided to test this on
multiple machines overnight, and it still fails on the slower ones.

Attached is a patch addressing a couple more issues, to makes the TAP
test work well enough. (Attached as .txt, to not confuse cfbot).

- The pgbench started by IPC::Run::start() needs to be finished, to
release resources. Otherwise it leaks file descriptors (and there's a
bunch of "defunct" pgbench processes), which may be a problem with
increased number of iterations.

- AFAICS the pgbench can't use stdin/stdout/stderr, otherwise the pipes
get broken when the command fails (after restart). I just used /dev/null
instead, the stdout/stderr was not used anyway (or was it?).

- commented out the pg_checksums call, because of the issues mentioned
earlier (I was trying to make it work by remembering the state, but it
seems to not make it into control file before shutdown occasionally)

I increased the number of iterations to 1000+ and ran it on three machines:

- ryzen (new machine from ~2024)
- xeon (old slow machine from ~2016)
- rpi5 (very slow machine)

I haven't seen a single failure on ryzen, after ~3000 iterations. But
both xeon and rpi5 show a number of failures. Xeon has about 35 reports
of 'Failed test', rpi5 and about 10.

My guess is it's something about timing. It works on the "fast" ryzen,
but fails on xeon which is ~3-4x slower. And rpi5, which is even slower.

The other reason why it seems unrelated to the reverted commit is that
it's not just about visibility maps (which was got corrupted). I see
checksum failures on VM and FSM. I think I forgot about the FSM cases,
and by the fact that I saw no failures on the ryzen post revert. But
clearly, other machines still have issues.

Another interesting fact is that the checksum failures happen both on
the primary and the standby, it's not just a standby issue. But again,
this sees to be machine-dependent. On the rpi5 I've only seen standby
issues. The xeon sees failures both on primary/standby (roughly 1:1).

There are more weird things. If I grep for page failures, I see this (a
more detailed log attached):

-----------
# 2025-08-28 22:33:28.195 CEST startup[177466] LOG:  page verification
failed, calculated checksum 25350 but expected 44559
# 2025-08-28 22:33:28.197 CEST startup[177466] LOG:  page verification
failed, calculated checksum 25350 but expected 44559
# 2025-08-28 22:33:28.199 CEST startup[177466] LOG:  page verification
failed, calculated checksum 59909 but expected 53920
# 2025-08-28 22:33:28.201 CEST startup[177466] LOG:  page verification
failed, calculated checksum 59909 but expected 53920
# 2025-08-28 22:33:28.206 CEST startup[177466] LOG:  page verification
failed, calculated checksum 59909 but expected 53920
# 2025-08-28 22:33:28.207 CEST startup[177466] LOG:  page verification
failed, calculated checksum 25350 but expected 44559
-----------

This is right after a single restart, while doing the recovery. The
weird thing is, this is all for just two FSM pages!

-----------
LOG:  invalid page in block 2 of relation "base/5/16410_fsm"; zeroing
out page
LOG:  invalid page in block 2 of relation "base/5/16408_fsm"; zeroing
out page
-----------

And the calculated/expected checksums repeat! It's just different WAL
records hitting the same page, and complaining about the same issue,
after claiming the page was zeroed out. Isn't that weird? How come the
page doesn't "get" the correct checksum after the first redo?

I've seen these failures after changing checksums in both directions,
both after enabling and disabling. But I've only ever saw this after
immediate shutdown, never after fast shutdown. (It's interesting the
pg_checksums failed only after fast shutdowns ...).

Could it be that the redo happens to start from an older position, but
using the new checksum version?

regards

-- 
Tomas Vondra