Thread

  1. [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

    Greg Burd <greg@burd.me> — 2025-11-20T20:45:22Z

    Hi all,
    
    Dave and I have been working together to get ARM64 with MSVC functional.
     The attached patches accomplish that. Dave is the author of the first
    which addresses some build issues and fixes the spin_delay() semantics,
    I did the second which fixes some atomics in this combination.
    
    PostgreSQL when compiled with MSVC on ARM64 architecture in particular
    when optimizations are enabled (e.g., /O2), fails 027_stream_regress.
    After some investigation and analysis of generated assembly code, Dave
    Cramer and I have identified that the root cause is insufficient memory
    barrier semantics in both atomic operations and spinlocks on ARM64 when
    compiled with MSVC with /O2.
    
    Dave knew I was in the process of setting up a Win11/ARM64/MSVC build
    animal and pinged me with this issue.  Dave got me started on the path
    to finding the issue by sending me his work around:
    
    --- a/src/backend/access/transam/xlog.c
    +++ b/src/backend/access/transam/xlog.c
    @@ -744,6 +744,7 @@ static void
    WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
      * before the data page can be written out.  This implements the basic
      * WAL rule "write the log before the data".)
      */
    +#pragma optimize("",off)
     XLogRecPtr
     XLogInsertRecord(XLogRecData *rdata,
                                     XLogRecPtr fpw_lsn,
    @@ -1088,7 +1089,7 @@ XLogInsertRecord(XLogRecData *rdata,
    
            return EndPos;
     }
    -
    +#pragma optimize("",on)
     /*
    
    
    This pointed a finger at the atomics, so I started there.  We used a few
    tools, but worth noting is https://godbolt.org/ where we were able to
    quickly see that the MSVC assembly was missing the "dmb" barriers on
    this platform.  I'm not sure how long this link will be valid, but in
    the short term here's our investigation: https://godbolt.org/z/PPqfxe1bn
    
    
    PROBLEM DESCRIPTION
    
    PostgreSQL test failures occur intermittently on MSVC ARM64 builds,
    manifesting as timing-dependent failures in critical sections
    protected by spinlocks and atomic variables. The failures are
    reproducible when the test suite is compiled with optimization flags
    (/O2), particularly in the recovery/027_stream_regress test which
    involves WAL replication and standby recovery.
    
    The root cause has two components:
    
    1. Atomic operations lack memory barriers on ARM64
    2. MSVC spinlock implementation lacks memory barriers on ARM64
    
    
    TECHNICAL ANALYSIS
    
    PART 1: ATOMIC OPERATIONS MEMORY BARRIERS
    
    GCC's __atomic_compare_exchange_n() with __ATOMIC_SEQ_CST semantics
    generates a call to __aarch64_cas4_acq_rel(), which is a library
    function that provides explicit acquire-release memory ordering
    semantics through either:
    
    * LSE path (modern ARM64): Using CASAL instruction with built-in
      memory ordering [1][2]
    
    * Legacy path (older ARM64): Using LDAXR/STLXR instructions with
      explicit dmb sy instruction [3]
    
    MSVC's _InterlockedCompareExchange() intrinsic on ARM64 performs the
    atomic operation but does NOT emit the necessary Data Memory Barrier
    (DMB) instructions [4][5].
    
    
    PART 2: SPINLOCK IMPLEMENTATION LACKS BARRIERS
    
    The MSVC spinlock implementation in src/include/storage/s_lock.h had
    two issues on ARM64/MSVC:
    
    #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))
    #define S_UNLOCK(lock) do { _ReadWriteBarrier(); (*(lock)) = 0; } while (0)
    
    Issue 1: TAS() uses InterlockedCompareExchange without hardware barriers
    
    The InterlockedCompareExchange intrinsic lacks full memory barrier
    semantics on ARM64, identical to the atomic operations issue.
    
    Issue 2: S_UNLOCK() uses only a compiler barrier
    
    _ReadWriteBarrier() is a compiler barrier, NOT a hardware memory
    barrier [6].  It prevents the compiler from reordering operations, but
    the CPU can still reorder memory operations. This is fundamentally
    insufficient for ARM64's weaker memory model.
    
    For comparison, GCC's __sync_lock_release() emits actual hardware
    barriers.
    
    IMPACT ON 027_STREAM_REGRESS
    
    The 027_stream_regress test involves WAL replication and standby
    recovery — heavily dependent on synchronized access to shared memory
    protected by spinlocks [7].  Without proper barriers on ARM64:
    
    1. Thread A acquires spinlock (no full barrier emitted)
    2. Thread A modifies shared WAL buffer
    3. Thread B acquires spinlock before Thread A's writes become visible
    4. Thread B reads stale WAL data
    5. WAL replication gets corrupted or hangs indefinitely
    6. Test times out waiting for standby to catch up
    
    
    WHY ARM32 AND X86/X64 ARE UNAFFECTED
    
    MSVC's _InterlockedCompareExchange does provide full memory barriers on:
    * x86/x64: Memory barriers are implicit in the x86 memory model [8]
    * ARM32: MSVC explicitly generates full barriers for ARM32 [5]
    
    Only ARM64 lacks the necessary barriers, making this a platform-specific
    issue.
    
    ATTACHED SOLUTION
    
    Add explicit DMB (Data Memory Barrier) instructions before and after
    atomic operations and spinlock operations on ARM64 to provide sequential
    consistency semantics.
    
    0002: src/inclue/port/atomic/generic-msvc.h
    
    Added platform-specific DMB macros that expand to
    __dmb(_ARM64_BARRIER_SY) on ARM64.
    
    Applied to all six atomic operations:
    * pg_atomic_compare_exchange_u32_impl()
    * pg_atomic_exchange_u32_impl()
    * pg_atomic_fetch_add_u32_impl()
    * pg_atomic_compare_exchange_u64_impl()
    * pg_atomic_exchange_u64_impl()
    * pg_atomic_fetch_add_u64_impl()
    
    
    0001: src/include/storage/s_lock.h
    
    Added ARM64-specific spinlock implementation with explicit DMB barriers [9]:
    
    #if defined(_M_ARM64)
    #define TAS(lock) tas_msvc_arm64(lock)
    
    static __forceinline int
    tas_msvc_arm64(volatile slock_t *lock)
    {
      int result;
    
      /* Full barrier before atomic operation */
      __dmb(_ARM64_BARRIER_SY);
    
      /* Atomic compare-and-swap */
      result = InterlockedCompareExchange(lock, 1, 0);
    
      /* Full barrier after atomic operation */
      __dmb(_ARM64_BARRIER_SY);
    
      return result;
    }
    
    #define S_UNLOCK(lock)
    do {
      __dmb(_ARM64_BARRIER_SY); /* Full barrier before release /
      ((lock)) = 0;
    } while (0)
    
    #else
      /* Non-ARM64 MSVC: existing implementation unchanged */
    #endif
    
    
    The spinlock acquire now ensures:
    
    * Before CAS: All prior memory operations complete before
      acquiring the lock.
    
    * After CAS: The CAS completes before subsequent operations
      access protected data
    
    The spinlock release now ensures:
    
    * Before writing 0: All critical section operations are visible
      to other threads
    
    
    You may ask: why two DMBs in the atomic operations instead of one?
    GCC's non-LSE path (LDAXR/STLXR) uses only one DMB because:
    * LDAXR (Load-Acquire Exclusive) provides half-barrier acquire
      semantics [3]
    * STLXR (Store-Release Exclusive) provides half-barrier release
      semantics [3]
    * One final dmb sy upgrades to full sequential consistency
    
    Since _InterlockedCompareExchange provides NO barrier semantics on
    ARM64, we must provide both halves:
    
    * First DMB acts as a release barrier (ensures prior memory ops
      complete before CAS)
    * Second DMB acts as an acquire barrier (ensures subsequent memory
      ops wait for CAS)
    * Together they provide sequential consistency matching GCC's
      semantics [3]
    
    
    VERIFICATION
    
    The fix has been verified by:
    
    1. Spinlock fix resolves 027_stream_regress timeout: Test now passes
       consistently on MSVC ARM64 with /O2 optimization without hanging
    
    2. Assembly code inspection: Confirmed that dmb sy instructions now
       appear in the optimized assembly for ARM64 builds
    
    3. Platform compatibility: No regression on x86/x64 or ARM32 (macros
       expand to no-ops; original code path unchanged)
    
    
    WHY CLANG/LLVM ON MACOS ARM64 DOESN'T HAVE THIS PROBLEM
    
    PostgreSQL builds successfully on Apple Silicon Macs (ARM64) without
    the memory ordering issues observed on MSVC Windows ARM64. The
    difference comes down to how Clang/LLVM and MSVC handle atomic
    operations.
    
    CLANG/LLVM APPROACH (macOS, Linux, Android ARM64)
    
    Clang/LLVM uses GCC-compatible atomic builtins
    (__atomic_compare_exchange_n, etc.) even on platforms where it's not
    GCC [125][134]. The LLVM backend has an AtomicExpand pass that
    properly expands these operations to include appropriate memory
    barriers for the target architecture [134].
    
    On ARM64, Clang generates:
    
    __aarch64_cas4_acq_rel library calls (or CASAL instruction with LSE)
    Proper acquire-release semantics built into the instruction sequence
    Automatic full dmb sy barriers where needed This means PostgreSQL's
    use of __sync_lock_test_and_set and _atomic* builtins work correctly
    on macOS ARM64 without additional patches.
    
    
    Phew... I hope I read all those docs correctly and got that right.  Feel
    free to let me know if I missed something.  Looking forward to your
    feedback and review so I can get this new build animal up and running.
    
    best.
    
    -greg
    
    [1] ARM Developer: CAS Instructions
    https://developer.arm.com/documentation/dui0801/latest/A64-Data-Transfer-Instructions/CASAB--CASALB--CASB--CASLB--A64-
    
    [2] ARM Developer: Load-Acquire and Store-Release Instructions
    https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions
    
    [3] ARM Developer: Data Memory Barrier (DMB)
    https://developer.arm.com/documentation/100069/0610/A64-General-Instructions/DMB?lang=en
    
    [4] Microsoft Learn: _InterlockedCompareExchange Intrinsic Functions
    https://learn.microsoft.com/en-us/cpp/intrinsics/interlockedcompareexchange-intrinsic-functions?view=msvc-170
    
    [5] Microsoft Learn: ARM Intrinsics - Memory Barriers
    https://learn.microsoft.com/en-us/cpp/intrinsics/arm-intrinsics?view=msvc-170
    
    [6] Microsoft Learn: _ReadWriteBarrier is a Compiler Barrier
    https://learn.microsoft.com/en-us/cpp/intrinsics/compiler-intrinsics?view=msvc-170
    
    [7] PostgreSQL: 027_stream_regress WAL replication testing
    https://www.postgresql.org/message-id/193115.1763243897@sss.pgh.pa.us
    
    [8] Intel Volume 3A: Memory Ordering
    https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf
    
    [9] Microsoft Developer Blog: The AArch64 processor - Barriers
    https://devblogs.microsoft.com/oldnewthing/20220812-00/?p=106968