Re: Changing the state of data checksums in a running cluster
Tomas Vondra <tomas@vondra.me>
From: Tomas Vondra <tomas@vondra.me>
To: Daniel Gustafsson <daniel@yesql.se>
Cc: Bernd Helmle <mailings@oopsware.de>, Michael Paquier
<michael@paquier.xyz>, Michael Banck <mbanck@gmx.net>,
PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Date: 2025-08-27T12:42:05Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Use correct datatype for PID
- 0ca1b3010597 19 (unreleased) landed
-
Improve comments in online checksums code
- cd857dec0e0a 19 (unreleased) landed
-
Fix checksum state transition during promotion
- 5fee7cab1b87 19 (unreleased) landed
-
Fix regex searching for page verification failures in tests
- 486b9a9b9eb4 19 (unreleased) landed
-
Apply data-checksum worker throttling parameters
- 9a39056c418c 19 (unreleased) landed
-
Skip WAL for unlogged main fork during online checksum enable
- 2018bd616790 19 (unreleased) landed
-
Revert "Get rid of WALBufMappingLock"
- c13070a27b63 19 (unreleased) cited
-
Get rid of WALBufMappingLock
- bc22dc0e0ddc 18.0 cited
-
Improve grammar of options for command arrays in TAP tests
- ce1b0f9da03e 18.0 cited
On 8/27/25 14:39, Tomas Vondra wrote: > ... > > And this happened on Friday: > > commit c13070a27b63d9ce4850d88a63bf889a6fde26f0 > Author: Alexander Korotkov <akorotkov@postgresql.org> > Date: Fri Aug 22 18:44:39 2025 +0300 > > Revert "Get rid of WALBufMappingLock" > > This reverts commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683. > It appears that conditional variables are not suitable for use > inside critical sections. If WaitLatch()/WaitEventSetWaitBlock() > face postmaster death, they exit, releasing all locks instead of > PANIC. In certain situations, this leads to data corruption. > > ... > > I think it's very likely the checksums were broken by this. After all, > that linked thread has subject "VM corruption on standby" and I've only > ever seen checksum failures on standby on the _vm fork. > Forgot to mention - I did try with c13070a27b reverted, and with that I can reproduce the checksum failures again (using the fixed TAP test). It's not a definitive proof, but it's a hint c13070a27b63 was causing the checksum failures. regards -- Tomas Vondra