Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me>

From: Tomas Vondra <tomas@vondra.me>
To: Daniel Gustafsson <daniel@yesql.se>
Cc: Bernd Helmle <mailings@oopsware.de>, Michael Paquier <michael@paquier.xyz>, Michael Banck <mbanck@gmx.net>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Date: 2025-08-27T12:42:05Z
Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Use correct datatype for PID

  2. Improve comments in online checksums code

  3. Fix checksum state transition during promotion

  4. Fix regex searching for page verification failures in tests

  5. Apply data-checksum worker throttling parameters

  6. Skip WAL for unlogged main fork during online checksum enable

  7. Revert "Get rid of WALBufMappingLock"

  8. Get rid of WALBufMappingLock

  9. Improve grammar of options for command arrays in TAP tests

On 8/27/25 14:39, Tomas Vondra wrote:
> ...
>
> And this happened on Friday:
> 
> commit c13070a27b63d9ce4850d88a63bf889a6fde26f0
> Author: Alexander Korotkov <akorotkov@postgresql.org>
> Date:   Fri Aug 22 18:44:39 2025 +0300
> 
>     Revert "Get rid of WALBufMappingLock"
> 
>     This reverts commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683.
>     It appears that conditional variables are not suitable for use
>     inside critical sections.  If WaitLatch()/WaitEventSetWaitBlock()
>     face postmaster death, they exit, releasing all locks instead of
>     PANIC.  In certain situations, this leads to data corruption.
> 
>     ...
> 
> I think it's very likely the checksums were broken by this. After all,
> that linked thread has subject "VM corruption on standby" and I've only
> ever seen checksum failures on standby on the _vm fork.
> 

Forgot to mention - I did try with c13070a27b reverted, and with that I
can reproduce the checksum failures again (using the fixed TAP test).

It's not a definitive proof, but it's a hint c13070a27b63 was causing
the checksum failures.


regards

-- 
Tomas Vondra