Thread
-
[PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB barriers
Greg Burd <greg@burd.me> — 2025-11-20T20:45:22Z
Hi all, Dave and I have been working together to get ARM64 with MSVC functional. The attached patches accomplish that. Dave is the author of the first which addresses some build issues and fixes the spin_delay() semantics, I did the second which fixes some atomics in this combination. PostgreSQL when compiled with MSVC on ARM64 architecture in particular when optimizations are enabled (e.g., /O2), fails 027_stream_regress. After some investigation and analysis of generated assembly code, Dave Cramer and I have identified that the root cause is insufficient memory barrier semantics in both atomic operations and spinlocks on ARM64 when compiled with MSVC with /O2. Dave knew I was in the process of setting up a Win11/ARM64/MSVC build animal and pinged me with this issue. Dave got me started on the path to finding the issue by sending me his work around: --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -744,6 +744,7 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt); * before the data page can be written out. This implements the basic * WAL rule "write the log before the data".) */ +#pragma optimize("",off) XLogRecPtr XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn, @@ -1088,7 +1089,7 @@ XLogInsertRecord(XLogRecData *rdata, return EndPos; } - +#pragma optimize("",on) /* This pointed a finger at the atomics, so I started there. We used a few tools, but worth noting is https://godbolt.org/ where we were able to quickly see that the MSVC assembly was missing the "dmb" barriers on this platform. I'm not sure how long this link will be valid, but in the short term here's our investigation: https://godbolt.org/z/PPqfxe1bn PROBLEM DESCRIPTION PostgreSQL test failures occur intermittently on MSVC ARM64 builds, manifesting as timing-dependent failures in critical sections protected by spinlocks and atomic variables. The failures are reproducible when the test suite is compiled with optimization flags (/O2), particularly in the recovery/027_stream_regress test which involves WAL replication and standby recovery. The root cause has two components: 1. Atomic operations lack memory barriers on ARM64 2. MSVC spinlock implementation lacks memory barriers on ARM64 TECHNICAL ANALYSIS PART 1: ATOMIC OPERATIONS MEMORY BARRIERS GCC's __atomic_compare_exchange_n() with __ATOMIC_SEQ_CST semantics generates a call to __aarch64_cas4_acq_rel(), which is a library function that provides explicit acquire-release memory ordering semantics through either: * LSE path (modern ARM64): Using CASAL instruction with built-in memory ordering [1][2] * Legacy path (older ARM64): Using LDAXR/STLXR instructions with explicit dmb sy instruction [3] MSVC's _InterlockedCompareExchange() intrinsic on ARM64 performs the atomic operation but does NOT emit the necessary Data Memory Barrier (DMB) instructions [4][5]. PART 2: SPINLOCK IMPLEMENTATION LACKS BARRIERS The MSVC spinlock implementation in src/include/storage/s_lock.h had two issues on ARM64/MSVC: #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0)) #define S_UNLOCK(lock) do { _ReadWriteBarrier(); (*(lock)) = 0; } while (0) Issue 1: TAS() uses InterlockedCompareExchange without hardware barriers The InterlockedCompareExchange intrinsic lacks full memory barrier semantics on ARM64, identical to the atomic operations issue. Issue 2: S_UNLOCK() uses only a compiler barrier _ReadWriteBarrier() is a compiler barrier, NOT a hardware memory barrier [6]. It prevents the compiler from reordering operations, but the CPU can still reorder memory operations. This is fundamentally insufficient for ARM64's weaker memory model. For comparison, GCC's __sync_lock_release() emits actual hardware barriers. IMPACT ON 027_STREAM_REGRESS The 027_stream_regress test involves WAL replication and standby recovery — heavily dependent on synchronized access to shared memory protected by spinlocks [7]. Without proper barriers on ARM64: 1. Thread A acquires spinlock (no full barrier emitted) 2. Thread A modifies shared WAL buffer 3. Thread B acquires spinlock before Thread A's writes become visible 4. Thread B reads stale WAL data 5. WAL replication gets corrupted or hangs indefinitely 6. Test times out waiting for standby to catch up WHY ARM32 AND X86/X64 ARE UNAFFECTED MSVC's _InterlockedCompareExchange does provide full memory barriers on: * x86/x64: Memory barriers are implicit in the x86 memory model [8] * ARM32: MSVC explicitly generates full barriers for ARM32 [5] Only ARM64 lacks the necessary barriers, making this a platform-specific issue. ATTACHED SOLUTION Add explicit DMB (Data Memory Barrier) instructions before and after atomic operations and spinlock operations on ARM64 to provide sequential consistency semantics. 0002: src/inclue/port/atomic/generic-msvc.h Added platform-specific DMB macros that expand to __dmb(_ARM64_BARRIER_SY) on ARM64. Applied to all six atomic operations: * pg_atomic_compare_exchange_u32_impl() * pg_atomic_exchange_u32_impl() * pg_atomic_fetch_add_u32_impl() * pg_atomic_compare_exchange_u64_impl() * pg_atomic_exchange_u64_impl() * pg_atomic_fetch_add_u64_impl() 0001: src/include/storage/s_lock.h Added ARM64-specific spinlock implementation with explicit DMB barriers [9]: #if defined(_M_ARM64) #define TAS(lock) tas_msvc_arm64(lock) static __forceinline int tas_msvc_arm64(volatile slock_t *lock) { int result; /* Full barrier before atomic operation */ __dmb(_ARM64_BARRIER_SY); /* Atomic compare-and-swap */ result = InterlockedCompareExchange(lock, 1, 0); /* Full barrier after atomic operation */ __dmb(_ARM64_BARRIER_SY); return result; } #define S_UNLOCK(lock) do { __dmb(_ARM64_BARRIER_SY); /* Full barrier before release / ((lock)) = 0; } while (0) #else /* Non-ARM64 MSVC: existing implementation unchanged */ #endif The spinlock acquire now ensures: * Before CAS: All prior memory operations complete before acquiring the lock. * After CAS: The CAS completes before subsequent operations access protected data The spinlock release now ensures: * Before writing 0: All critical section operations are visible to other threads You may ask: why two DMBs in the atomic operations instead of one? GCC's non-LSE path (LDAXR/STLXR) uses only one DMB because: * LDAXR (Load-Acquire Exclusive) provides half-barrier acquire semantics [3] * STLXR (Store-Release Exclusive) provides half-barrier release semantics [3] * One final dmb sy upgrades to full sequential consistency Since _InterlockedCompareExchange provides NO barrier semantics on ARM64, we must provide both halves: * First DMB acts as a release barrier (ensures prior memory ops complete before CAS) * Second DMB acts as an acquire barrier (ensures subsequent memory ops wait for CAS) * Together they provide sequential consistency matching GCC's semantics [3] VERIFICATION The fix has been verified by: 1. Spinlock fix resolves 027_stream_regress timeout: Test now passes consistently on MSVC ARM64 with /O2 optimization without hanging 2. Assembly code inspection: Confirmed that dmb sy instructions now appear in the optimized assembly for ARM64 builds 3. Platform compatibility: No regression on x86/x64 or ARM32 (macros expand to no-ops; original code path unchanged) WHY CLANG/LLVM ON MACOS ARM64 DOESN'T HAVE THIS PROBLEM PostgreSQL builds successfully on Apple Silicon Macs (ARM64) without the memory ordering issues observed on MSVC Windows ARM64. The difference comes down to how Clang/LLVM and MSVC handle atomic operations. CLANG/LLVM APPROACH (macOS, Linux, Android ARM64) Clang/LLVM uses GCC-compatible atomic builtins (__atomic_compare_exchange_n, etc.) even on platforms where it's not GCC [125][134]. The LLVM backend has an AtomicExpand pass that properly expands these operations to include appropriate memory barriers for the target architecture [134]. On ARM64, Clang generates: __aarch64_cas4_acq_rel library calls (or CASAL instruction with LSE) Proper acquire-release semantics built into the instruction sequence Automatic full dmb sy barriers where needed This means PostgreSQL's use of __sync_lock_test_and_set and _atomic* builtins work correctly on macOS ARM64 without additional patches. Phew... I hope I read all those docs correctly and got that right. Feel free to let me know if I missed something. Looking forward to your feedback and review so I can get this new build animal up and running. best. -greg [1] ARM Developer: CAS Instructions https://developer.arm.com/documentation/dui0801/latest/A64-Data-Transfer-Instructions/CASAB--CASALB--CASB--CASLB--A64- [2] ARM Developer: Load-Acquire and Store-Release Instructions https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions [3] ARM Developer: Data Memory Barrier (DMB) https://developer.arm.com/documentation/100069/0610/A64-General-Instructions/DMB?lang=en [4] Microsoft Learn: _InterlockedCompareExchange Intrinsic Functions https://learn.microsoft.com/en-us/cpp/intrinsics/interlockedcompareexchange-intrinsic-functions?view=msvc-170 [5] Microsoft Learn: ARM Intrinsics - Memory Barriers https://learn.microsoft.com/en-us/cpp/intrinsics/arm-intrinsics?view=msvc-170 [6] Microsoft Learn: _ReadWriteBarrier is a Compiler Barrier https://learn.microsoft.com/en-us/cpp/intrinsics/compiler-intrinsics?view=msvc-170 [7] PostgreSQL: 027_stream_regress WAL replication testing https://www.postgresql.org/message-id/193115.1763243897@sss.pgh.pa.us [8] Intel Volume 3A: Memory Ordering https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf [9] Microsoft Developer Blog: The AArch64 processor - Barriers https://devblogs.microsoft.com/oldnewthing/20220812-00/?p=106968