Re: Changing the state of data checksums in a running cluster
Tomas Vondra <tomas@vondra.me>
From: Tomas Vondra <tomas@vondra.me>
To: Daniel Gustafsson <daniel@yesql.se>
Cc: Bernd Helmle <mailings@oopsware.de>, Michael Paquier
<michael@paquier.xyz>, Michael Banck <mbanck@gmx.net>,
PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Date: 2025-08-29T14:38:22Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Use correct datatype for PID
- 0ca1b3010597 19 (unreleased) landed
-
Improve comments in online checksums code
- cd857dec0e0a 19 (unreleased) landed
-
Fix checksum state transition during promotion
- 5fee7cab1b87 19 (unreleased) landed
-
Fix regex searching for page verification failures in tests
- 486b9a9b9eb4 19 (unreleased) landed
-
Apply data-checksum worker throttling parameters
- 9a39056c418c 19 (unreleased) landed
-
Skip WAL for unlogged main fork during online checksum enable
- 2018bd616790 19 (unreleased) landed
-
Revert "Get rid of WALBufMappingLock"
- c13070a27b63 19 (unreleased) cited
-
Get rid of WALBufMappingLock
- bc22dc0e0ddc 18.0 cited
-
Improve grammar of options for command arrays in TAP tests
- ce1b0f9da03e 18.0 cited
Attachments
- failure2.log (text/x-log)
On 8/29/25 16:26, Tomas Vondra wrote: > ... > > I've seen these failures after changing checksums in both directions, > both after enabling and disabling. But I've only ever saw this after > immediate shutdown, never after fast shutdown. (It's interesting the > pg_checksums failed only after fast shutdowns ...). > Of course, right after I send a message, it fails after a fast shutdown, contradicting my observation ... > Could it be that the redo happens to start from an older position, but > using the new checksum version? > ... but it also provided more data supporting this hypothesis. I added logging of pg_current_wal_lsn() before / after changing checksums on the primary, and I see this: 1) LSN before: 14/2B0F26A8 2) enable checksums 3) LSN after: 14/EE335D60 4) standby waits for 14/F4E786E8 (higher, likely thanks to pgbench) 5) standby restarts with -m fast 6) redo starts at 14/230043B0, which is *before* enabling checksums I guess this is the root cause. A bit more detailed log attached. regards -- Tomas Vondra