Thread

  1. Re: Changing the state of data checksums in a running cluster

    Tomas Vondra <tomas@vondra.me> — 2026-05-04T13:16:34Z

    Hi,
    
    Thanks for getting this feature pushed, and for resolving the failures
    reported since the feature freeze. I consider this to be an important
    improvement, not just for the feature itself, but also because of all
    the useful infrastructure it added.
    
    Attached is a refined version of the TAP tests already posted by Daniel
    some time ago [1]. Unfortunately, that .txt did not apply cleanly for
    some reason, so here's a better version.
    
    I found these tests quite useful when reasoning about how the patch
    behaves in concurrent environment (e.g. with multiple sessions
    triggering checksum enable/disable, or with a checkpoint, crashes, etc).
    
    At this point all the tests pass, but there are a couple cases with
    correct but slightly surprising behavior, worth discussing. Which is
    what this e-mail is going to be about.
    
    I'll explain what the TAP tests aim to do first, and then discuss the
    slightly surprising behavior.
    
    It's not meant for inclusion into PG19, at least not in this shape - I
    wrote those TAP tests while investigating some of the earlier failures
    and/or when wondering about behavior in various situations (sequence of
    concurrent steps, race conditions, ...). So it's more of an exhaustive,
    and the tests are somewhat redundant (N+1 is often just (N + some small
    tweak)).
    
    I can imagine distilling it into a tiny subset, and adding that. But
    that's up to discussion. But that's for later.
    
    
    Let me briefly explain what the various TAP tests aim to do. From the
    very beginning, my main concern regarding this patch was race conditions
    when updating the shared state about effective data_checksum_version.
    Because the state is effectively split into about three or four places:
    
    * LocalDataChecksumVersion (local cache)
    * XLogCtl->data_checksum_version (XLogCtl->info_lck)
    * ControlFile->data_checksum_version (ControlFileLock)
    * state in control file on disk
    
    These pieces are protected by different locks, the protocol for updating
    and/or reading the various flags is not trivial (and some of the fixed
    issues were due to ControlFile->data_checksum_version being updated from
    a place that shouldn't have).
    
    So the primary goal of the TAP tests was to check for race conditions by
    leveraging injection points to step through concurrent processes in a
    deterministic way. The first couple patches (0001-0004) add debug
    logging and injection points into a lot of places. And by "a lot" I mean
    ~80 new injection points, which is about the number of injection points
    we have in master now. Anyway, this allows stepping through concurrent
    checksum changes, and also checksum change vs. checkpointer.
    
    Then come the actual TAP tests:
    
    1) 0005-TAP-10-concurrent-checksum-changes.patch
    
    Two concurrent checksum changes. The first one gets paused at an
    injection point, then the second one gets initiated.
    
    2) 0006-TAP-11-concurrency-with-checkpoints.patch
    
    A checksum change + checkpoint. The change gets paused at an injection
    point, a synchronous checkpoint is performed.
    
    3) 0007-TAP-12-crashes-at-injection-points.patch
    
    Similar to 0006, but with a crash + recovery. A checksum change gets
    paused at an injection point, a synchronous checkpoint is performed. The
    changes gets wpken up and either completes, or pauses on a different
    injection point. A restart/crash happens.
    
    4) 0008-TAP-13-concurrency-with-checkpoint-REDO.patch
    
    Similar to 0007, but the checkpoint is not synchronous - happens in the
    background, so that the TAP can step through both sides and interleave
    them in an arbitrary way. This matters because the checksum change
    updates the different state pieces (XlogCtl/ControlFile), while the
    checkpointer reads them to record initial state for REDO etc.
    
    5) 0009-TAP-14-checkpoints-with-crashes.patch
    
    Similar to 0008, except that the steps are more fine grained, and
    focused on two particular cases with surprisingly different final state.
    
    
    AFAIK everything works as expected, except for two cases in the "TAP
    012" test. One for the "enabling" direction, one for the "disabling"
    direction. I'm going to discuss the "enabling" direction, I believe the
    other case is just a mirror with the same root cause.
    
    The TAP 012 tests checksum change with a concurrent checkpoint, followed
    by a crash, and tests the final state. It pauses the change at an
    injection point, does a checkpoint, proceeds to the next injection
    point, crashes and does recovery. The expectation is that the final
    state "flips" at some injection point, once it gets further enough, and
    stays there. But what actually happens is this:
    
    a) test_checksum_transition(
        'disabled', 'enable', undef,
        'datachecksums-enable-inprogress-checksums-end',
        'datachecksums-enable-checksums-start',
        'off');
    
    b) test_checksum_transition(
        'disabled', 'enable', undef,
        'datachecksums-enable-checksums-start',
        'datachecksums-enable-checksums-after-xlog',
        'on');
    
    c) test_checksum_transition(
        'disabled', 'enable', 'datachecksums-enable-checksums-start',
        'datachecksums-enable-checksums-after-xlogs',
        'datachecksums-enable-checksums-after-xlogctl',
        'off');
    
    This says that if the checkpoint happens after
    'datachecksums-enable-inprogress-checksums-end' or after
    'datachecksums-enable-checksums-after-xlog', we end up with 'off' (i.e.
    enabling checksums fails).
    
    But if the checkpoint happens after
    'datachecksums-enable-checksums-start', we end up with "on" (after
    recovery).
    
    This is a bit surprising, because that injection point is before
    'datachecksums-enable-checksums-after-xlog'. So the enabling process
    gets further and further, but the final state flips off -> on -> off,
    contradicting the expectation that it changes once.
    
    I haven't quite wrapped my head around it yet, but my understanding is
    this is due to a race condition between the checksum launcher (writing
    XLOG2_CHECKSUMS and updating the shmem state), and the checkpointer
    (reading the shmem state and generating REDO).
    
    The launcher does this sequence of steps:
    
    1) write XLOG2_CHECKSUMS with new state
    2) update XLogCtl->data_checksum_version
    3) update ControlFile->data_checksum_version
    4) UpdateControlFile()
    5) emits barrier
    
    while the checkpointer (CreateCheckPoint) does this:
    
    A) read XLogCtl->data_checksum_version (while holding insert locks)
    B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
    C) UpdateControlFile()
    
    The outcome depends on how exactly these two sequences interleave. For
    example, this can happen:
    
    1) write XLOG2_CHECKSUMS with new state
      A) read XLogCtl->data_checksum_version (while holding insert locks)
      B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
      C) UpdateControlFile()
    2) update XLogCtl->data_checksum_version
    3) update ControlFile->data_checksum_version
    4) UpdateControlFile()
    5) emits barrier
    
    Which means the XLOG_CHECKPOINT_REDO will be after XLOG2_CHECKSUMS (and
    so redo won't see it), but the checkpoint will still get the old
    checksum state from XLogCtl. And so the outcome is "off", per case (c).
    
    But it can also happen what case (b) does:
    
      A) read XLogCtl->data_checksum_version (while holding insert locks)
      B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
      C) UpdateControlFile()
    1) write XLOG2_CHECKSUMS with new state
    2) update XLogCtl->data_checksum_version
    3) update ControlFile->data_checksum_version
    4) UpdateControlFile()
    5) emits barrier
    
    In which case the REDO will have the old state, but the recovery will
    read the XLOG2_CHECKSUMS, and so end up with "on".
    
    This is the root cause of the surprising behavior in TAP 012, I think.
    
    I attempted to trigger these race conditions in TAP 013, but without
    much success. In the end I realized it probably needs more control,
    waiting for the other process to hit the next injection point before
    unpausing the current one. TAP 014 does that, and it shows that with the
    right interleaving of steps the (c) case can end up with both "on" and
    "off" final state.
    
    As I said, I don't claim I fully understand this yet. But I wouldn't
    call this "bug" - AFAICS it won't produce an incorrect final state (I
    haven't seen any such cases).
    
    Still, I wonder if there's a potential issue I failed to notice.
    
    
    The other question I had when looking at this (concurrency with
    checkpoints) is what we get by doing
    
         MyProc->delayChkptFlags |= DELAY_CHKPT_START;
    
    whenever updating the state in SetDataChecksums... functions. Because
    the only thing that guarantees is the updates happen on one side of the
    checkpoint record. What does that give us, actually?
    
    It does not seem to prevent this surprising behavior, and it does not
    say the XLOG2_CHECKSUMS happens before/after the XLOG_CHECKPOINT_REDO.
    
    
    regards
    
    [1]
    https://www.postgresql.org/message-id/9197F930-DDEB-4CAC-82A2-16FEC715CCE8%40yesql.se
    
    -- 
    Tomas Vondra