Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me>

From: Tomas Vondra <tomas@vondra.me>
To: Daniel Gustafsson <daniel@yesql.se>
Cc: Bernd Helmle <mailings@oopsware.de>, Michael Paquier <michael@paquier.xyz>, Michael Banck <mbanck@gmx.net>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Date: 2025-08-27T09:39:35Z
Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Use correct datatype for PID

  2. Improve comments in online checksums code

  3. Fix checksum state transition during promotion

  4. Fix regex searching for page verification failures in tests

  5. Apply data-checksum worker throttling parameters

  6. Skip WAL for unlogged main fork during online checksum enable

  7. Revert "Get rid of WALBufMappingLock"

  8. Get rid of WALBufMappingLock

  9. Improve grammar of options for command arrays in TAP tests


On 8/27/25 10:30, Daniel Gustafsson wrote:
>> On 26 Aug 2025, at 01:06, Tomas Vondra <tomas@vondra.me> wrote:
> 
>> I think this TAP looks very nice, but there's a couple issues with it.
>> See the attached patch fixing those.
> 
> Thanks, I have incorporated (most of) your patch in the attached.  I did keep
> the PG_TEST_EXTRA check for injection points though which I assume were removed
> out of mistake.
> 

Yes, that was a mistake.

>> With these changes it runs for me, and I even saw some
>>
>>   LOG: page verification failed
>>
>> in tmp_check/log/006_concurrent_pgbench_standby_1.log. But it takes a
>> while - a couple minutes, maybe? I think I saw it at
>>
>>    t/006_concurrent_pgbench.pl .. 427/?
> 
> That's very interesting, I have been running it to timeout several times in a
> row without hitting any verification failures.  Will keep running.
> 

Just to be clear - I don't see any pg_checksums failures either. I only
see failures in the standby log, and I don't think the script checks
that (it probably should).

>> or something like that. I think the bash version did a couple things
>> differently, which might make the failures more frequent (but it's just
>> a wild guess).
>>
>> In particular, I think the script restarts the two nodes independently,
>> while the TAP always stops both primary and standby, in this order. I
>> think it'd be useful to restart one or both.
> 
> Done in the attached, it will now randomly stop one or both or none.  If the
> node is stopped I've added an offline pg_checksum step to validate the
> datafiles as a why-not test.
> 
>> The other thing is the bash script added some random delays/sleep, which
>> increases the test duration, but it also means generating somewhat
>> random amounts of data, etc. It also randomized some other stuff (scale,
>> client count, ...). But that can wait.
> 
> Added as well in a few places, maybe more can be sprinkled in.
> 

Thanks. I'll take a look.


regards

-- 
Tomas Vondra