Thread

Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me> — 2026-05-04T13:16:34Z
Hi,

Thanks for getting this feature pushed, and for resolving the failures
reported since the feature freeze. I consider this to be an important
improvement, not just for the feature itself, but also because of all
the useful infrastructure it added.

Attached is a refined version of the TAP tests already posted by Daniel
some time ago [1]. Unfortunately, that .txt did not apply cleanly for
some reason, so here's a better version.

I found these tests quite useful when reasoning about how the patch
behaves in concurrent environment (e.g. with multiple sessions
triggering checksum enable/disable, or with a checkpoint, crashes, etc).

At this point all the tests pass, but there are a couple cases with
correct but slightly surprising behavior, worth discussing. Which is
what this e-mail is going to be about.

I'll explain what the TAP tests aim to do first, and then discuss the
slightly surprising behavior.

It's not meant for inclusion into PG19, at least not in this shape - I
wrote those TAP tests while investigating some of the earlier failures
and/or when wondering about behavior in various situations (sequence of
concurrent steps, race conditions, ...). So it's more of an exhaustive,
and the tests are somewhat redundant (N+1 is often just (N + some small
tweak)).

I can imagine distilling it into a tiny subset, and adding that. But
that's up to discussion. But that's for later.


Let me briefly explain what the various TAP tests aim to do. From the
very beginning, my main concern regarding this patch was race conditions
when updating the shared state about effective data_checksum_version.
Because the state is effectively split into about three or four places:

* LocalDataChecksumVersion (local cache)
* XLogCtl->data_checksum_version (XLogCtl->info_lck)
* ControlFile->data_checksum_version (ControlFileLock)
* state in control file on disk

These pieces are protected by different locks, the protocol for updating
and/or reading the various flags is not trivial (and some of the fixed
issues were due to ControlFile->data_checksum_version being updated from
a place that shouldn't have).

So the primary goal of the TAP tests was to check for race conditions by
leveraging injection points to step through concurrent processes in a
deterministic way. The first couple patches (0001-0004) add debug
logging and injection points into a lot of places. And by "a lot" I mean
~80 new injection points, which is about the number of injection points
we have in master now. Anyway, this allows stepping through concurrent
checksum changes, and also checksum change vs. checkpointer.

Then come the actual TAP tests:

1) 0005-TAP-10-concurrent-checksum-changes.patch

Two concurrent checksum changes. The first one gets paused at an
injection point, then the second one gets initiated.

2) 0006-TAP-11-concurrency-with-checkpoints.patch

A checksum change + checkpoint. The change gets paused at an injection
point, a synchronous checkpoint is performed.

3) 0007-TAP-12-crashes-at-injection-points.patch

Similar to 0006, but with a crash + recovery. A checksum change gets
paused at an injection point, a synchronous checkpoint is performed. The
changes gets wpken up and either completes, or pauses on a different
injection point. A restart/crash happens.

4) 0008-TAP-13-concurrency-with-checkpoint-REDO.patch

Similar to 0007, but the checkpoint is not synchronous - happens in the
background, so that the TAP can step through both sides and interleave
them in an arbitrary way. This matters because the checksum change
updates the different state pieces (XlogCtl/ControlFile), while the
checkpointer reads them to record initial state for REDO etc.

5) 0009-TAP-14-checkpoints-with-crashes.patch

Similar to 0008, except that the steps are more fine grained, and
focused on two particular cases with surprisingly different final state.


AFAIK everything works as expected, except for two cases in the "TAP
012" test. One for the "enabling" direction, one for the "disabling"
direction. I'm going to discuss the "enabling" direction, I believe the
other case is just a mirror with the same root cause.

The TAP 012 tests checksum change with a concurrent checkpoint, followed
by a crash, and tests the final state. It pauses the change at an
injection point, does a checkpoint, proceeds to the next injection
point, crashes and does recovery. The expectation is that the final
state "flips" at some injection point, once it gets further enough, and
stays there. But what actually happens is this:

a) test_checksum_transition(
    'disabled', 'enable', undef,
    'datachecksums-enable-inprogress-checksums-end',
    'datachecksums-enable-checksums-start',
    'off');

b) test_checksum_transition(
    'disabled', 'enable', undef,
    'datachecksums-enable-checksums-start',
    'datachecksums-enable-checksums-after-xlog',
    'on');

c) test_checksum_transition(
    'disabled', 'enable', 'datachecksums-enable-checksums-start',
    'datachecksums-enable-checksums-after-xlogs',
    'datachecksums-enable-checksums-after-xlogctl',
    'off');

This says that if the checkpoint happens after
'datachecksums-enable-inprogress-checksums-end' or after
'datachecksums-enable-checksums-after-xlog', we end up with 'off' (i.e.
enabling checksums fails).

But if the checkpoint happens after
'datachecksums-enable-checksums-start', we end up with "on" (after
recovery).

This is a bit surprising, because that injection point is before
'datachecksums-enable-checksums-after-xlog'. So the enabling process
gets further and further, but the final state flips off -> on -> off,
contradicting the expectation that it changes once.

I haven't quite wrapped my head around it yet, but my understanding is
this is due to a race condition between the checksum launcher (writing
XLOG2_CHECKSUMS and updating the shmem state), and the checkpointer
(reading the shmem state and generating REDO).

The launcher does this sequence of steps:

1) write XLOG2_CHECKSUMS with new state
2) update XLogCtl->data_checksum_version
3) update ControlFile->data_checksum_version
4) UpdateControlFile()
5) emits barrier

while the checkpointer (CreateCheckPoint) does this:

A) read XLogCtl->data_checksum_version (while holding insert locks)
B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
C) UpdateControlFile()

The outcome depends on how exactly these two sequences interleave. For
example, this can happen:

1) write XLOG2_CHECKSUMS with new state
  A) read XLogCtl->data_checksum_version (while holding insert locks)
  B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
  C) UpdateControlFile()
2) update XLogCtl->data_checksum_version
3) update ControlFile->data_checksum_version
4) UpdateControlFile()
5) emits barrier

Which means the XLOG_CHECKPOINT_REDO will be after XLOG2_CHECKSUMS (and
so redo won't see it), but the checkpoint will still get the old
checksum state from XLogCtl. And so the outcome is "off", per case (c).

But it can also happen what case (b) does:

  A) read XLogCtl->data_checksum_version (while holding insert locks)
  B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
  C) UpdateControlFile()
1) write XLOG2_CHECKSUMS with new state
2) update XLogCtl->data_checksum_version
3) update ControlFile->data_checksum_version
4) UpdateControlFile()
5) emits barrier

In which case the REDO will have the old state, but the recovery will
read the XLOG2_CHECKSUMS, and so end up with "on".

This is the root cause of the surprising behavior in TAP 012, I think.

I attempted to trigger these race conditions in TAP 013, but without
much success. In the end I realized it probably needs more control,
waiting for the other process to hit the next injection point before
unpausing the current one. TAP 014 does that, and it shows that with the
right interleaving of steps the (c) case can end up with both "on" and
"off" final state.

As I said, I don't claim I fully understand this yet. But I wouldn't
call this "bug" - AFAICS it won't produce an incorrect final state (I
haven't seen any such cases).

Still, I wonder if there's a potential issue I failed to notice.


The other question I had when looking at this (concurrency with
checkpoints) is what we get by doing

     MyProc->delayChkptFlags |= DELAY_CHKPT_START;

whenever updating the state in SetDataChecksums... functions. Because
the only thing that guarantees is the updates happen on one side of the
checkpoint record. What does that give us, actually?

It does not seem to prevent this surprising behavior, and it does not
say the XLOG2_CHECKSUMS happens before/after the XLOG_CHECKPOINT_REDO.


regards

[1]
https://www.postgresql.org/message-id/9197F930-DDEB-4CAC-82A2-16FEC715CCE8%40yesql.se

-- 
Tomas Vondra