Re: Changing the state of data checksums in a running cluster
Tomas Vondra <tomas@vondra.me>
From: Tomas Vondra <tomas@vondra.me>
To: Daniel Gustafsson <daniel@yesql.se>, Bernd Helmle <mailings@oopsware.de>
Cc: Michael Paquier <michael@paquier.xyz>, Michael Banck <mbanck@gmx.net>,
PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Date: 2025-08-20T14:37:33Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Use correct datatype for PID
- 0ca1b3010597 19 (unreleased) landed
-
Improve comments in online checksums code
- cd857dec0e0a 19 (unreleased) landed
-
Fix checksum state transition during promotion
- 5fee7cab1b87 19 (unreleased) landed
-
Fix regex searching for page verification failures in tests
- 486b9a9b9eb4 19 (unreleased) landed
-
Apply data-checksum worker throttling parameters
- 9a39056c418c 19 (unreleased) landed
-
Skip WAL for unlogged main fork during online checksum enable
- 2018bd616790 19 (unreleased) landed
-
Revert "Get rid of WALBufMappingLock"
- c13070a27b63 19 (unreleased) cited
-
Get rid of WALBufMappingLock
- bc22dc0e0ddc 18.0 cited
-
Improve grammar of options for command arrays in TAP tests
- ce1b0f9da03e 18.0 cited
On 8/16/25 21:34, Daniel Gustafsson wrote: > Attached is a rebase on top of the func.sgml changes which caused this to no > longer apply. > > This version is also substantially updated with a new injection point based > test suite, fixed a few bugs (found by said test suite), added checkpoint to > disabling checksums, code cleanup, more granular wait events, comment rewrites > and additions and more smaller cleanups. > Thanks for the updated patch. The injection points seem like a huge improvement, allowing testing of different code paths in a more deterministic way. I started running the stress test, using pretty much exactly the version posted in March [1]. And so far I noticed only one issue, when the standby reports mismatched checksums on a fsm: LOG: page verification failed, calculated checksum 24786 but expected 24760 CONTEXT: WAL redo at 0/0344A290 for Heap2/MULTI_INSERT+INIT: ntuples: 185, flags: 0x28; blkref #0: rel 1663/16384/16403, blk 0 LOG: invalid page in block 2 of relation base/16384/16403_fsm; zeroing out page CONTEXT: WAL redo at 0/0344A290 for Heap2/MULTI_INSERT+INIT: ntuples: 185, flags: 0x28; blkref #0: rel 1663/16384/16403, blk 0 WARNING: invalid page in block 2 of relation base/16384/16403_fsm; zeroing out page CONTEXT: WAL redo at 0/0344A290 for Heap2/MULTI_INSERT+INIT: ntuples: 185, flags: 0x28; blkref #0: rel 1663/16384/16403, blk 0 LOG: page verification failed, calculated checksum 37048 but expected 0 CONTEXT: WAL redo at 0/0344D7E0 for Heap2/MULTI_INSERT+INIT: ntuples: 61, flags: 0x28; blkref #0: rel 1663/16384/16400, blk 0 LOG: invalid page in block 2 of relation base/16384/16400_fsm; zeroing out page This happens quite regularly, it's not hard to hit. But I've only seen it to happen on a FSM, and only right after immediate shutdown. I don't think that's quite expected. I believe the built-in TAP tests (with injection points) can't catch this, because there's no concurrent activity while flipping checksums on/off. It'd be good to do something like that, by running pgbench in the background, or something like that. I also don't see any restarts of the primary/standby. That might be good to do too. I plan to randomize the stress test a bit more, once this FSM issue gets fixed. Maybe that'll find some additional issues. [1] https://www.postgresql.org/message-id/f528413c-477a-4ec3-a0df-e22a80ffbe41@vondra.me -- Tomas Vondra