Thread
-
Re: greenfly lwlock corruption in REL_14_STABLE and REL_15_STABLE
Thomas Munro <thomas.munro@gmail.com> — 2025-12-12T00:54:52Z
On Fri, Dec 12, 2025 at 6:27 AM Greg Burd <greg@burd.me> wrote: > On Wed, Dec 10, 2025, at 12:10 AM, Thomas Munro wrote: > > Beginning a week ago, greenfly (RISC-V, Clang 20.1) has failed like > > this in 5 of 8 runs of the pgbench tests on the two oldest branches: > > Hey Thomas, raising this. I should more closely monitor my farm animals. As greenfly is one of them and my login name is littered in the logs (gburd) I suppose I should dive into this. Thanks! Since then each branch had a non-failing run, so the new failure stats are 5 out of 10. > > I will see if I can reproduce it or see something wrong under qemu, > > but that'll take some time to set up... > > It'll take me far less time to reproduce than you. :) Yeah. I have given up, after a couple of days of grinding through REL_14_STABLE's src/bin/pgbench tests under qemu, with the same optimisation level and the clang-20 package from Debian sid + experimental. I've learned to use an ARM host when I want to weak memory order effects of other RISC-ish architectures via qemu JIT translation (well obviously not the *same* weak memory behavior but not none at all), so I would have tried that next if you weren't offering to use the real hardware... It's a bit confusing that the C does "oldstate = pg_atomic_sub_fetch_u32()" (that's the convenience wrapper that computes and returns the *new* value), but that also makes sense for how it's used in the following assertion. Assuming the state was already corrupted before we got here, I wonder if you'd learn something by asserting the expected state before the change: if (mode == LW_EXCLUSIVE) Assert(pg_atomic_read_u32(&lock->state) & LW_VAL_EXCLUSIVE); else Assert(pg_atomic_read_u32(&lock->state) & LW_SHARED_MASK); (Since we acquired the lock, a relaxed load should be guaranteed to see a new enough value?) While trying to imagine how it might have become corrupted and noting that the recent change related to a flag that has some synchronisation requirement with the non-atomic accesses to ->lwWaitXXX members declared in proc.h, I wondered if that atomic-vs-non-atomic fencing and reordering stuff might be involved, ie the waitlist spinlock is acquired and then we load MyProc->lwWaiting in LWLockDequeueSelf, and then perhaps corrupt LW_FLAG_HAS_WAITERS, or some other place like that, but I havent' found that in the code gen yet and I don't have an actual theory... If you can reproduce this, does anything change if you mark them volatile? Can you please attach the animal's lwlock.o from REL_14_STABLE? I'd like to check if I'm even looking at the right code... > I'll see what I can do to find the offending commit(s). Thanks! Could also be environmental. Maybe the snow arrived and you adjusted your thermostat? :-) But even then we'd still need to figure out what's different on those branches... One difference is autoconf vs meson. Both say -g -O2, but only Meson says -fPIC. Can you reproduce it on later branches with configure/make?