Thread

  1. Re: greenfly lwlock corruption in REL_14_STABLE and REL_15_STABLE

    Thomas Munro <thomas.munro@gmail.com> — 2025-12-12T00:54:52Z

    On Fri, Dec 12, 2025 at 6:27 AM Greg Burd <greg@burd.me> wrote:
    > On Wed, Dec 10, 2025, at 12:10 AM, Thomas Munro wrote:
    > > Beginning a week ago, greenfly (RISC-V, Clang 20.1) has failed like
    > > this in 5 of 8 runs of the pgbench tests on the two oldest branches:
    >
    > Hey Thomas, raising this.  I should more closely monitor my farm animals.  As greenfly is one of them and my login name is littered in the logs (gburd) I suppose I should dive into this.
    
    Thanks!  Since then each branch had a non-failing run, so the new
    failure stats are 5 out of 10.
    
    > > I will see if I can reproduce it or see something wrong under qemu,
    > > but that'll take some time to set up...
    >
    > It'll take me far less time to reproduce than you. :)
    
    Yeah.  I have given up, after a couple of days of grinding through
    REL_14_STABLE's src/bin/pgbench tests under qemu, with the same
    optimisation level and the clang-20 package from Debian sid +
    experimental.  I've learned to use an ARM host when I want to weak
    memory order effects of other RISC-ish architectures via qemu JIT
    translation (well obviously not the *same* weak memory behavior but
    not none at all), so I would have tried that next if you weren't
    offering to use the real hardware...
    
    It's a bit confusing that the C does "oldstate =
    pg_atomic_sub_fetch_u32()" (that's the convenience wrapper that
    computes and returns the *new* value), but that also makes sense for
    how it's used in the following assertion.
    
    Assuming the state was already corrupted before we got here, I wonder
    if you'd learn something by asserting the expected state before the
    change:
    
    if (mode == LW_EXCLUSIVE)
        Assert(pg_atomic_read_u32(&lock->state) & LW_VAL_EXCLUSIVE);
    else
        Assert(pg_atomic_read_u32(&lock->state) & LW_SHARED_MASK);
    
    (Since we acquired the lock, a relaxed load should be guaranteed to
    see a new enough value?)
    
    While trying to imagine how it might have become corrupted and noting
    that the recent change related to a flag that has some synchronisation
    requirement with the non-atomic accesses to ->lwWaitXXX members
    declared in proc.h, I wondered if that atomic-vs-non-atomic fencing
    and reordering stuff might be involved, ie the waitlist spinlock is
    acquired and then we load MyProc->lwWaiting in LWLockDequeueSelf, and
    then perhaps corrupt LW_FLAG_HAS_WAITERS, or some other place like
    that, but I havent' found that in the code gen yet and I don't have an
    actual theory...  If you can reproduce this, does anything change if
    you mark them volatile?
    
    Can you please attach the animal's lwlock.o from REL_14_STABLE?  I'd
    like to check if I'm even looking at the right code...
    
    > I'll see what I can do to find the offending commit(s).
    
    Thanks!  Could also be environmental.  Maybe the snow arrived and you
    adjusted your thermostat? :-)  But even then we'd still need to figure
    out what's different on those branches...  One difference is autoconf
    vs meson.  Both say -g -O2, but only Meson says -fPIC.  Can you
    reproduce it on later branches with configure/make?