Thread

Re: greenfly lwlock corruption in REL_14_STABLE and REL_15_STABLE

Thomas Munro <thomas.munro@gmail.com> — 2025-12-12T00:54:52Z
On Fri, Dec 12, 2025 at 6:27 AM Greg Burd <greg@burd.me> wrote:
> On Wed, Dec 10, 2025, at 12:10 AM, Thomas Munro wrote:
> > Beginning a week ago, greenfly (RISC-V, Clang 20.1) has failed like
> > this in 5 of 8 runs of the pgbench tests on the two oldest branches:
>
> Hey Thomas, raising this.  I should more closely monitor my farm animals.  As greenfly is one of them and my login name is littered in the logs (gburd) I suppose I should dive into this.

Thanks!  Since then each branch had a non-failing run, so the new
failure stats are 5 out of 10.

> > I will see if I can reproduce it or see something wrong under qemu,
> > but that'll take some time to set up...
>
> It'll take me far less time to reproduce than you. :)

Yeah.  I have given up, after a couple of days of grinding through
REL_14_STABLE's src/bin/pgbench tests under qemu, with the same
optimisation level and the clang-20 package from Debian sid +
experimental.  I've learned to use an ARM host when I want to weak
memory order effects of other RISC-ish architectures via qemu JIT
translation (well obviously not the *same* weak memory behavior but
not none at all), so I would have tried that next if you weren't
offering to use the real hardware...

It's a bit confusing that the C does "oldstate =
pg_atomic_sub_fetch_u32()" (that's the convenience wrapper that
computes and returns the *new* value), but that also makes sense for
how it's used in the following assertion.

Assuming the state was already corrupted before we got here, I wonder
if you'd learn something by asserting the expected state before the
change:

if (mode == LW_EXCLUSIVE)
    Assert(pg_atomic_read_u32(&lock->state) & LW_VAL_EXCLUSIVE);
else
    Assert(pg_atomic_read_u32(&lock->state) & LW_SHARED_MASK);

(Since we acquired the lock, a relaxed load should be guaranteed to
see a new enough value?)

While trying to imagine how it might have become corrupted and noting
that the recent change related to a flag that has some synchronisation
requirement with the non-atomic accesses to ->lwWaitXXX members
declared in proc.h, I wondered if that atomic-vs-non-atomic fencing
and reordering stuff might be involved, ie the waitlist spinlock is
acquired and then we load MyProc->lwWaiting in LWLockDequeueSelf, and
then perhaps corrupt LW_FLAG_HAS_WAITERS, or some other place like
that, but I havent' found that in the code gen yet and I don't have an
actual theory...  If you can reproduce this, does anything change if
you mark them volatile?

Can you please attach the animal's lwlock.o from REL_14_STABLE?  I'd
like to check if I'm even looking at the right code...

> I'll see what I can do to find the offending commit(s).

Thanks!  Could also be environmental.  Maybe the snow arrived and you
adjusted your thermostat? :-)  But even then we'd still need to figure
out what's different on those branches...  One difference is autoconf
vs meson.  Both say -g -O2, but only Meson says -fPIC.  Can you
reproduce it on later branches with configure/make?