Re: Changing the state of data checksums in a running cluster

Tomas Vondra <tomas@vondra.me>

From: Tomas Vondra <tomas@vondra.me>

To: Daniel Gustafsson <daniel@yesql.se>

Cc: Michael Paquier <michael@paquier.xyz>, Michael Banck <mbanck@gmx.net>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>

Date: 2025-03-14T12:20:37Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Use correct datatype for PID
- 0ca1b3010597 19 (unreleased) landed
Improve comments in online checksums code
- cd857dec0e0a 19 (unreleased) landed
Fix checksum state transition during promotion
- 5fee7cab1b87 19 (unreleased) landed
Fix regex searching for page verification failures in tests
- 486b9a9b9eb4 19 (unreleased) landed
Apply data-checksum worker throttling parameters
- 9a39056c418c 19 (unreleased) landed
Skip WAL for unlogged main fork during online checksum enable
- 2018bd616790 19 (unreleased) landed
Revert "Get rid of WALBufMappingLock"
- c13070a27b63 19 (unreleased) cited
Get rid of WALBufMappingLock
- bc22dc0e0ddc 18.0 cited
Improve grammar of options for command arrays in TAP tests
- ce1b0f9da03e 18.0 cited

On 3/14/25 00:11, Tomas Vondra wrote:
> ...
>>>>>> One issue I ran into is the postmaster does not seem to be processing
>>>>>> the barriers, and thus not getting info about the data_checksum_version
>>>>>> changes.
>>>>>
>>>>> Makes sense, that seems like a pretty reasonable constraint for the barrier.
>>>>
>>>> Not sure I follow. What's a reasonable constraint?
>>>
>>> That the postmaster deosn't process them.
>>>
>>
>> OK, that means we need a way to "refresh" the value for new child
>> processses, similar to what my patch does. But I suspect there might be
>> a race condition - if the child process starts while processing the
>> XLOG_CHECKUMS record, it might happen to get the new value and then also
>> the barrier (if it does the "refresh" in between the XLogCtl update and
>> the barrier). Doesn't this need some sort of interlock, preventing this?
>>
>> The child startup would need to do this:
>>
>> 1) acquire lock
>> 2) reset barriers
>> 3) refresh the LocalDataChecksumValue (from XLogCtl)
>> 4) release lock
>>
>> while the walreceiver would do this
>>
>> 1) acquire lock
>> 2) update XLogCtl value
>> 3) emit barrier
>> 4) release lock
>>
>> Or is there a reason why this would be unnecessary?
>>
> 
> I still think this might be a problem. I wonder if we could maybe
> leverage the barrier generation, to detect that we don't need to process
> this barrier, because we already got the value directly ...
> 
> FWIW we'd have this problem even if postmaster was processing barriers,
> because there'd always be a "gap" between the fork and ProcSignalInit()
> registering the new process into the procsignal array.
> 

I experimented with this a little bit, and unfortunately I ran into not
one, but two race conditions in this :-( I don't have reproducers, all
of this was done by manually adding sleep() calls / gdb breakpoints to
pause the processes for a while, but I'll try to explain what/why ...

1) race #1: SetDataChecksumsOn

The function (and all the other "SetDataChecksums" funcs) does this

    SpinLockAcquire(&XLogCtl->info_lck);
    XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_VERSION;
    SpinLockRelease(&XLogCtl->info_lck);

    barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_ON);

Now, imagine there's a sleep() before the EmitProcSignalBarrier. A new
process may start during that, and it'll read the current checksum value
from XLogCtl. And then the SetDataChecksumsOn() wakes up, and emits the
barrier. So far so good.

But the new backend is already registered in ProcSignal, so it'll get
the barrier too, and will try to set the local version to "on" again.
And kaboom - that hits the assert in AbsorbChecksumsOnBarrier():

    Assert(LocalDataChecksumVersion ==
           PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);

The other "SetDataChecksums" have the same issue, except that in those
cases there are no asserts to trip. Only AbsorbChecksumsOnBarrier() has
such assert to check the state transition.

This is "ephemeral" in the sense that setting the value to "on" again
would be harmless, and indeed a non-assert build will run just fine.

2) race #2: InitPostgres

The InitPostgres does this:

    InitLocalControldata();

    ProcSignalInit(MyCancelKeyValid, MyCancelKey);

where InitLocalControldata gets the current checksum value from XLogCtl,
and ProcSignalInit registers the backend into the procsignal (which is
what barriers are based on).

Imagine there's a sleep() between these two calls, and the cluster does
not have checksums enabled. A backend will start, will read "off" from
XLogCtl, and then gets stuck on the sleep before it gets added to the
procsignal/barrier array.

Now, we enable checksums, and the instance goes through 'inprogress-on'
and 'on' states. This completes, and the backend wakes up and registers
itself into procsignal - but it won't get any barriers, of course.

So we end up with an instance with data_checksums="on", but this one
backend still believes data_checksums="on". This can cause a lot of
trouble, because it won't write blocks with checksums. I.e. this is
persistent data corruption.

I have been thinking about how to fix this. One way would be to
introduce some sort of locking, so that the two steps (update of the
XLogCtl version + barrier emit) and (local flag init + procsignal init)
would always happen atomically. So, something like this:

    SpinLockAcquire(&XLogCtl->info_lck);
    XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_VERSION;
    barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_ON);
    SpinLockRelease(&XLogCtl->info_lck);

and

    SpinLockAcquire(&XLogCtl->info_lck);
    InitLocalControldata();
    ProcSignalInit(MyCancelKeyValid, MyCancelKey);
    SpinLockRelease(&XLogCtl->info_lck);

But that seems pretty heavy-handed, it's definitely much more work while
holding a spinlock than I'm comfortable with, and I wouldn't be
surprised if there were deadlock cases etc. (FWIW I believe it needs to
use XLogCtl->info_lck, to make the value consistent with checkpoints.)

Anyway, I think a much simpler solution would be to reorder InitPostgres
like this:

    ProcSignalInit(MyCancelKeyValid, MyCancelKey);

    InitLocalControldata();

i.e. to first register into procsignal, and then read the new value.
AFAICS this guarantees we won't lose any checksum version updates. It
does mean we still can get a barrier for a value we've already seen, but
I think we should simply ignore this for the very first update.

Opinions? Other ideas how to fix this?

regards

-- 
Tomas Vondra