Re: Changing the state of data checksums in a running cluster
Tomas Vondra <tomas@vondra.me>
From: Tomas Vondra <tomas@vondra.me>
To: Daniel Gustafsson <daniel@yesql.se>
Cc: Michael Banck <mbanck@gmx.net>,
PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Date: 2024-11-08T00:41:10Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Use correct datatype for PID
- 0ca1b3010597 19 (unreleased) landed
-
Improve comments in online checksums code
- cd857dec0e0a 19 (unreleased) landed
-
Fix checksum state transition during promotion
- 5fee7cab1b87 19 (unreleased) landed
-
Fix regex searching for page verification failures in tests
- 486b9a9b9eb4 19 (unreleased) landed
-
Apply data-checksum worker throttling parameters
- 9a39056c418c 19 (unreleased) landed
-
Skip WAL for unlogged main fork during online checksum enable
- 2018bd616790 19 (unreleased) landed
-
Revert "Get rid of WALBufMappingLock"
- c13070a27b63 19 (unreleased) cited
-
Get rid of WALBufMappingLock
- bc22dc0e0ddc 18.0 cited
-
Improve grammar of options for command arrays in TAP tests
- ce1b0f9da03e 18.0 cited
Attachments
- test.sh (application/x-shellscript)
- backtraces.txt (text/plain)
Hi, Unfortunately it seems we're not out of the woods yet :-( I started doing some more testing on the v8 patch. My plan was to do some stress testing with physical replication, random restarts and stuff like that. But I ran into issues before that. Attached is a reproducer script, that does this: 1) initializes an instance with a small (scale 10) pgbench database 2) runs a pgbench in the background, and flips checksums 3) restarts the database with fast or immediate mode 4) watches for checksums state until it reaches expected value 5) restarts the instance Of course, the restart interrupts the checksum enable, with this message in the log: WARNING: data checksums are being enabled, but no worker is running 1731024482.102 2024-11-08 01:08:02.102 CET [267066] [startup:] [672d5660.4133a:7] [2024-11-08 01:08:00 CET] [/0] HINT: If checksums were being enabled during shutdown then processing must be manually restarted. That's expected, of course. So I did SELECT pg_enable_data_checksums() and "datachecksumsworker launcher" appeared in pg_stat_activity, but nothing else was happening. It also says: Waiting for worker in database template0 (pid 258442) But there are no workers with that PID. Not in the OS, not in the view, not in the server log. Seems a bit weird. Maybe it already completed, but then why is there a launcher waiting for it? Ultimately I tried running CHECKPOINT, And that apparently did the trick, and the instance restarted. But then on start it hits an assert that: (LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION) But this only happens in the final stop is -m immediate. If I change it to "-m fast" it works. I haven't looked into the details, but I guess it's related to the issue with controlfile update we dealt with about a month ago. Attached is the test.sh file (make sure to tweak the paths), and an example of the backtraces. I've seen various processes hitting that. Two more comments: * It's a bit surprising that pg_disable_data_checksums() flips the state right away, while pg_enable_data_checksums() waits for a checkpoint. I guess it's correct, but maybe the docs should mention this difference? * The docs currently say: <para> If the cluster is stopped while in <literal>inprogress-on</literal> mode, for any reason, then this process must be restarted manually. To do this, re-execute the function <function>pg_enable_data_checksums()</function> once the cluster has been restarted. The background worker will attempt to resume the work from where it was interrupted. </para> I believe that's incorrect/misleading. There's no attempt to resume work from where it was interrupted. regards -- Tomas Vondra