Re: Changing the state of data checksums in a running cluster
Tomas Vondra <tomas@vondra.me>
From: Tomas Vondra <tomas@vondra.me>
To: Daniel Gustafsson <daniel@yesql.se>
Cc: Bernd Helmle <mailings@oopsware.de>, Michael Paquier
<michael@paquier.xyz>, Michael Banck <mbanck@gmx.net>,
PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Date: 2025-08-28T16:11:18Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Use correct datatype for PID
- 0ca1b3010597 19 (unreleased) landed
-
Improve comments in online checksums code
- cd857dec0e0a 19 (unreleased) landed
-
Fix checksum state transition during promotion
- 5fee7cab1b87 19 (unreleased) landed
-
Fix regex searching for page verification failures in tests
- 486b9a9b9eb4 19 (unreleased) landed
-
Apply data-checksum worker throttling parameters
- 9a39056c418c 19 (unreleased) landed
-
Skip WAL for unlogged main fork during online checksum enable
- 2018bd616790 19 (unreleased) landed
-
Revert "Get rid of WALBufMappingLock"
- c13070a27b63 19 (unreleased) cited
-
Get rid of WALBufMappingLock
- bc22dc0e0ddc 18.0 cited
-
Improve grammar of options for command arrays in TAP tests
- ce1b0f9da03e 18.0 cited
Attachments
- checksum-tap-fix.patch (text/x-patch) patch
- checksum-failure.txt (text/plain)
Hi, I spent a bit more time fixing the TAP test. The attached patch makes it "work" for me (or I think it should, in principle). I'm not saying it's the best way to do stuff. With the patch applied, I tried running it, and I got a failure when running pg_checksums. There's a log snippet describing the issue, but AFAICS it's happening like this: 1) checksums are disabled 2) flip_data_checksums gets called 3) both clusters go through 'inprogress-on' and 'on' states 4) primary gets shutdown in 'immediate' mode 5) standby gets shutdown in 'fast' mode 6) we try to validate checksums on the standby, but control file still says checksums=inprogress-on This seems like a bug to me - AFAICS the expectation is that after fast shutdown, we don't forget the checksum state. Or is that expected? In that case the TAP test probably needs to check the control file, instead of relying on the perl variable $data_checksum_state. Or maybe it should check that the control file has the correct / expected state? FWIW I don't think the primary shutdown matters. I've seen multiple of these failures, and it happens even without primary shutdown. But the standby "fast" shutdown is always there. But this also shows a limitation of the TAP test - it never triggers the shutdowns while flipping the checksums (in flip_data_checksums). I think that's something worth testing. regards -- Tomas Vondra