Re: Changing the state of data checksums in a running cluster
Tomas Vondra <tomas@vondra.me>
From: Tomas Vondra <tomas@vondra.me>
To: Daniel Gustafsson <daniel@yesql.se>
Cc: Bernd Helmle <mailings@oopsware.de>, Michael Paquier
<michael@paquier.xyz>, Michael Banck <mbanck@gmx.net>,
PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Date: 2025-08-27T09:39:35Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Use correct datatype for PID
- 0ca1b3010597 19 (unreleased) landed
-
Improve comments in online checksums code
- cd857dec0e0a 19 (unreleased) landed
-
Fix checksum state transition during promotion
- 5fee7cab1b87 19 (unreleased) landed
-
Fix regex searching for page verification failures in tests
- 486b9a9b9eb4 19 (unreleased) landed
-
Apply data-checksum worker throttling parameters
- 9a39056c418c 19 (unreleased) landed
-
Skip WAL for unlogged main fork during online checksum enable
- 2018bd616790 19 (unreleased) landed
-
Revert "Get rid of WALBufMappingLock"
- c13070a27b63 19 (unreleased) cited
-
Get rid of WALBufMappingLock
- bc22dc0e0ddc 18.0 cited
-
Improve grammar of options for command arrays in TAP tests
- ce1b0f9da03e 18.0 cited
On 8/27/25 10:30, Daniel Gustafsson wrote: >> On 26 Aug 2025, at 01:06, Tomas Vondra <tomas@vondra.me> wrote: > >> I think this TAP looks very nice, but there's a couple issues with it. >> See the attached patch fixing those. > > Thanks, I have incorporated (most of) your patch in the attached. I did keep > the PG_TEST_EXTRA check for injection points though which I assume were removed > out of mistake. > Yes, that was a mistake. >> With these changes it runs for me, and I even saw some >> >> LOG: page verification failed >> >> in tmp_check/log/006_concurrent_pgbench_standby_1.log. But it takes a >> while - a couple minutes, maybe? I think I saw it at >> >> t/006_concurrent_pgbench.pl .. 427/? > > That's very interesting, I have been running it to timeout several times in a > row without hitting any verification failures. Will keep running. > Just to be clear - I don't see any pg_checksums failures either. I only see failures in the standby log, and I don't think the script checks that (it probably should). >> or something like that. I think the bash version did a couple things >> differently, which might make the failures more frequent (but it's just >> a wild guess). >> >> In particular, I think the script restarts the two nodes independently, >> while the TAP always stops both primary and standby, in this order. I >> think it'd be useful to restart one or both. > > Done in the attached, it will now randomly stop one or both or none. If the > node is stopped I've added an offline pg_checksum step to validate the > datafiles as a why-not test. > >> The other thing is the bash script added some random delays/sleep, which >> increases the test duration, but it also means generating somewhat >> random amounts of data, etc. It also randomized some other stuff (scale, >> client count, ...). But that can wait. > > Added as well in a few places, maybe more can be sprinkled in. > Thanks. I'll take a look. regards -- Tomas Vondra