Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me>

From: Tomas Vondra <tomas@vondra.me>
To: Daniel Gustafsson <daniel@yesql.se>
Cc: Bernd Helmle <mailings@oopsware.de>, Michael Paquier <michael@paquier.xyz>, Michael Banck <mbanck@gmx.net>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Date: 2025-08-28T16:11:18Z
Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Use correct datatype for PID

  2. Improve comments in online checksums code

  3. Fix checksum state transition during promotion

  4. Fix regex searching for page verification failures in tests

  5. Apply data-checksum worker throttling parameters

  6. Skip WAL for unlogged main fork during online checksum enable

  7. Revert "Get rid of WALBufMappingLock"

  8. Get rid of WALBufMappingLock

  9. Improve grammar of options for command arrays in TAP tests

Attachments

Hi,

I spent a bit more time fixing the TAP test. The attached patch makes it
"work" for me (or I think it should, in principle). I'm not saying it's
the best way to do stuff.

With the patch applied, I tried running it, and I got a failure when
running pg_checksums. There's a log snippet describing the issue, but
AFAICS it's happening like this:

1) checksums are disabled
2) flip_data_checksums gets called
3) both clusters go through 'inprogress-on' and 'on' states
4) primary gets shutdown in 'immediate' mode
5) standby gets shutdown in 'fast' mode
6) we try to validate checksums on the standby, but control file still
says checksums=inprogress-on

This seems like a bug to me - AFAICS the expectation is that after fast
shutdown, we don't forget the checksum state. Or is that expected? In
that case the TAP test probably needs to check the control file, instead
of relying on the perl variable $data_checksum_state. Or maybe it should
check that the control file has the correct / expected state?

FWIW I don't think the primary shutdown matters. I've seen multiple of
these failures, and it happens even without primary shutdown. But the
standby "fast" shutdown is always there.

But this also shows a limitation of the TAP test - it never triggers the
shutdowns while flipping the checksums (in flip_data_checksums). I think
that's something worth testing.

regards

-- 
Tomas Vondra