Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me>

From: Tomas Vondra <tomas@vondra.me>

To: Daniel Gustafsson <daniel@yesql.se>

Cc: Bernd Helmle <mailings@oopsware.de>, Michael Paquier <michael@paquier.xyz>, Michael Banck <mbanck@gmx.net>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>

Date: 2025-08-25T23:06:24Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Use correct datatype for PID
- 0ca1b3010597 19 (unreleased) landed
Improve comments in online checksums code
- cd857dec0e0a 19 (unreleased) landed
Fix checksum state transition during promotion
- 5fee7cab1b87 19 (unreleased) landed
Fix regex searching for page verification failures in tests
- 486b9a9b9eb4 19 (unreleased) landed
Apply data-checksum worker throttling parameters
- 9a39056c418c 19 (unreleased) landed
Skip WAL for unlogged main fork during online checksum enable
- 2018bd616790 19 (unreleased) landed
Revert "Get rid of WALBufMappingLock"
- c13070a27b63 19 (unreleased) cited
Get rid of WALBufMappingLock
- bc22dc0e0ddc 18.0 cited
Improve grammar of options for command arrays in TAP tests
- ce1b0f9da03e 18.0 cited

Attachments

checksums-fixes.patch (text/x-patch) patch

On 8/25/25 20:32, Daniel Gustafsson wrote:
>> On 20 Aug 2025, at 16:37, Tomas Vondra <tomas@vondra.me> wrote:
> 
>> This happens quite regularly, it's not hard to hit. But I've only seen
>> it to happen on a FSM, and only right after immediate shutdown. I don't
>> think that's quite expected.
>>
>> I believe the built-in TAP tests (with injection points) can't catch
>> this, because there's no concurrent activity while flipping checksums
>> on/off. It'd be good to do something like that, by running pgbench in
>> the background, or something like that.
> 
> In searching for this bug I opted for implementing a version of the stress
> tests as a TAP test, see 006_concurrent_pgbench.pl in the attached patch
> version.  It's gated behind PG_TEST_EXTRA since it's clearly not something
> which can be enabled by default (if this goes in this need to be re-done to
> provide two levels IMO, but during testing this is more convenient).  I'm
> curious to see which improvements you can think to make it stress the code to
> the breaking point.
> 

I think this TAP looks very nice, but there's a couple issues with it.
See the attached patch fixing those.

1) I think test_checksums should be in src/test/modules/Makefile?

2) The test_checksums/Makefile didn't seem to work for me, I was getting

Makefile:23: *** recipe commences before first target.  Stop.

Because there was a missing "\" so I had to fix that. And then it was
complaining about Makefile.global or something, so I fixed that by
cargo-culting what other Makefiles in test modules do. Now it seems to
work for me. I guess you're on meson?

3) I'm no perl expert, but AFAICS the test wasn't really running the
pgbench, for a couple of reasons. It was passing "-q" to pgbench, but
that's only for initialization. The clusters had max_connections=10, but
the pgbench was using "-c 10", so I was getting "too many connections".
It was not setting "$pgbench_running = 1" so the other loops were
getting "too many connections" too. Another thing is I'm not sure it's
OK to pass '' to IPC::Run::start, I think it'll take it as an argument,
confusing pgbench.

With these changes it runs for me, and I even saw some

   LOG: page verification failed

in tmp_check/log/006_concurrent_pgbench_standby_1.log. But it takes a
while - a couple minutes, maybe? I think I saw it at

    t/006_concurrent_pgbench.pl .. 427/?

or something like that. I think the bash version did a couple things
differently, which might make the failures more frequent (but it's just
a wild guess).

In particular, I think the script restarts the two nodes independently,
while the TAP always stops both primary and standby, in this order. I
think it'd be useful to restart one or both.

The other thing is the bash script added some random delays/sleep, which
increases the test duration, but it also means generating somewhat
random amounts of data, etc. It also randomized some other stuff (scale,
client count, ...). But that can wait.

regards

-- 
Tomas Vondra