Re: Non-reproducible AIO failure

Konstantin Knizhnik <knizhnik@garret.ru>

From: Konstantin Knizhnik <knizhnik@garret.ru>
To: Andres Freund <andres@anarazel.de>, Nico Williams <nico@cryptonector.com>
Cc: Thomas Munro <thomas.munro@gmail.com>, Alexander Lakhin <exclusion@gmail.com>, Daniel Gustafsson <daniel@yesql.se>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>, rmt@lists.postgresql.org
Date: 2025-08-26T13:59:54Z
Lists: pgsql-hackers
On 26/08/2025 3:37 AM, Andres Freund wrote:
> Hi,
>
> I'm a bit confused by this focus on bitfields - both Alexander and Konstantin
> stated they could reproduce the issue without the bitfields.


Sorry if I am not correct, but it seems that the problem was never 
reproduced with replaced bitfields.
I have once again looked through all this thread (it is really long;) 
and find just one place when you wrote, that you was able to reproduce 
the problem without bitfields. But as far as understand it was the 
different problem which was fixed by your patch 
v1-0001-aio-Add-missing-memory-barrier-when-waiting-for-I.patch 
<https://www.postgresql.org/message-id/attachment/177619/v1-0001-aio-Add-missing-memory-barrier-when-waiting-for-I.patch>

Still it is not quite clear to me how bitfields can cause this issue.
Yes, fitfields are updated all together - when you update one field, all 
other bitfields are also rewritten.
But actually compiler (at least some versions) does quite 
straightforward thing (especially without optimization):
1. Loads all bitfields.
2. Update required bitfield.
3. Stores all bitfields.

It is very unlikely that some stale version is stored in registers - 
without -O optimizer is using registers only locally and doesn't try to 
keep some variables in registers. Also not clear how we can read some 
stale value from memory if we perform write barrier before updating 
status (in pgaio_io_update_state).
It it impossible that `op` and `state` fields belongs to different cache 
lines because PgAioHandle is aligned on 8. So doesn;t matter whether 
there are bitfields or bytes, this fields should be propagated between 
cores all together.
Certainly, extraction of bitfields is not an atomic operations (as we 
see in assembler, it is done using two loads).
So we can observe inconsistent state of this two fields.

Let's consider fields `op` and `state`. Let's their initial state is 
`op=1`m `state=1`.
The we do update `op=2` and call pgaio_io_update_state(2). Some other 
task can see (op=1, state=1)  or (op=2, state=1) or (op=2, state=2) but 
not (op=1, state=2).

The problem can happen only if some other task tries to update `op` 
without prior checking for `state`.
In this case we can get (op=3,state=1) as well as (op=2,state=2).
But `op` field is updated in just two places: pgaio_io_reclaim and 
pgaio_io_stage. And in both cases we first check 'state' value.

What makes me concern is what happen in case of branch misprediction.
Assume that we have code:

if (state == 1) {
     op = 2;
} else {
     op = 3;
}

It seems to be absolutely legal, but what will happen in case if with 
speculative execution processor start executing first branch and does 
assignment op=2, but then state is loaded and used to be not equal to 1, 
so processor has to undone this execution. But for some moment we we can 
have state (state=0 and op=2) which can be observed by other task and 
cause assertion failure.
Is it really possible?

> But we have observed the generated code being pretty grotty and it's caused
> more than enough confusion - so let's just replace them with plain uint8's and
> cast in switches.

+1

May be I am wrong, but it seems to me that after 
add-moissing-memory-barrier patch was applied nobody reproduced 
assertion failure with replaced bitfields.