Thread

[PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB barriers

Greg Burd <greg@burd.me> — 2025-11-20T20:45:22Z
Hi all,

Dave and I have been working together to get ARM64 with MSVC functional.
 The attached patches accomplish that. Dave is the author of the first
which addresses some build issues and fixes the spin_delay() semantics,
I did the second which fixes some atomics in this combination.

PostgreSQL when compiled with MSVC on ARM64 architecture in particular
when optimizations are enabled (e.g., /O2), fails 027_stream_regress.
After some investigation and analysis of generated assembly code, Dave
Cramer and I have identified that the root cause is insufficient memory
barrier semantics in both atomic operations and spinlocks on ARM64 when
compiled with MSVC with /O2.

Dave knew I was in the process of setting up a Win11/ARM64/MSVC build
animal and pinged me with this issue.  Dave got me started on the path
to finding the issue by sending me his work around:

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -744,6 +744,7 @@ static void
WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * before the data page can be written out.  This implements the basic
  * WAL rule "write the log before the data".)
  */
+#pragma optimize("",off)
 XLogRecPtr
 XLogInsertRecord(XLogRecData *rdata,
                                 XLogRecPtr fpw_lsn,
@@ -1088,7 +1089,7 @@ XLogInsertRecord(XLogRecData *rdata,

        return EndPos;
 }
-
+#pragma optimize("",on)
 /*


This pointed a finger at the atomics, so I started there.  We used a few
tools, but worth noting is https://godbolt.org/ where we were able to
quickly see that the MSVC assembly was missing the "dmb" barriers on
this platform.  I'm not sure how long this link will be valid, but in
the short term here's our investigation: https://godbolt.org/z/PPqfxe1bn


PROBLEM DESCRIPTION

PostgreSQL test failures occur intermittently on MSVC ARM64 builds,
manifesting as timing-dependent failures in critical sections
protected by spinlocks and atomic variables. The failures are
reproducible when the test suite is compiled with optimization flags
(/O2), particularly in the recovery/027_stream_regress test which
involves WAL replication and standby recovery.

The root cause has two components:

1. Atomic operations lack memory barriers on ARM64
2. MSVC spinlock implementation lacks memory barriers on ARM64


TECHNICAL ANALYSIS

PART 1: ATOMIC OPERATIONS MEMORY BARRIERS

GCC's __atomic_compare_exchange_n() with __ATOMIC_SEQ_CST semantics
generates a call to __aarch64_cas4_acq_rel(), which is a library
function that provides explicit acquire-release memory ordering
semantics through either:

* LSE path (modern ARM64): Using CASAL instruction with built-in
  memory ordering [1][2]

* Legacy path (older ARM64): Using LDAXR/STLXR instructions with
  explicit dmb sy instruction [3]

MSVC's _InterlockedCompareExchange() intrinsic on ARM64 performs the
atomic operation but does NOT emit the necessary Data Memory Barrier
(DMB) instructions [4][5].


PART 2: SPINLOCK IMPLEMENTATION LACKS BARRIERS

The MSVC spinlock implementation in src/include/storage/s_lock.h had
two issues on ARM64/MSVC:

#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))
#define S_UNLOCK(lock) do { _ReadWriteBarrier(); (*(lock)) = 0; } while (0)

Issue 1: TAS() uses InterlockedCompareExchange without hardware barriers

The InterlockedCompareExchange intrinsic lacks full memory barrier
semantics on ARM64, identical to the atomic operations issue.

Issue 2: S_UNLOCK() uses only a compiler barrier

_ReadWriteBarrier() is a compiler barrier, NOT a hardware memory
barrier [6].  It prevents the compiler from reordering operations, but
the CPU can still reorder memory operations. This is fundamentally
insufficient for ARM64's weaker memory model.

For comparison, GCC's __sync_lock_release() emits actual hardware
barriers.

IMPACT ON 027_STREAM_REGRESS

The 027_stream_regress test involves WAL replication and standby
recovery — heavily dependent on synchronized access to shared memory
protected by spinlocks [7].  Without proper barriers on ARM64:

1. Thread A acquires spinlock (no full barrier emitted)
2. Thread A modifies shared WAL buffer
3. Thread B acquires spinlock before Thread A's writes become visible
4. Thread B reads stale WAL data
5. WAL replication gets corrupted or hangs indefinitely
6. Test times out waiting for standby to catch up


WHY ARM32 AND X86/X64 ARE UNAFFECTED

MSVC's _InterlockedCompareExchange does provide full memory barriers on:
* x86/x64: Memory barriers are implicit in the x86 memory model [8]
* ARM32: MSVC explicitly generates full barriers for ARM32 [5]

Only ARM64 lacks the necessary barriers, making this a platform-specific
issue.

ATTACHED SOLUTION

Add explicit DMB (Data Memory Barrier) instructions before and after
atomic operations and spinlock operations on ARM64 to provide sequential
consistency semantics.

0002: src/inclue/port/atomic/generic-msvc.h

Added platform-specific DMB macros that expand to
__dmb(_ARM64_BARRIER_SY) on ARM64.

Applied to all six atomic operations:
* pg_atomic_compare_exchange_u32_impl()
* pg_atomic_exchange_u32_impl()
* pg_atomic_fetch_add_u32_impl()
* pg_atomic_compare_exchange_u64_impl()
* pg_atomic_exchange_u64_impl()
* pg_atomic_fetch_add_u64_impl()


0001: src/include/storage/s_lock.h

Added ARM64-specific spinlock implementation with explicit DMB barriers [9]:

#if defined(_M_ARM64)
#define TAS(lock) tas_msvc_arm64(lock)

static __forceinline int
tas_msvc_arm64(volatile slock_t *lock)
{
  int result;

  /* Full barrier before atomic operation */
  __dmb(_ARM64_BARRIER_SY);

  /* Atomic compare-and-swap */
  result = InterlockedCompareExchange(lock, 1, 0);

  /* Full barrier after atomic operation */
  __dmb(_ARM64_BARRIER_SY);

  return result;
}

#define S_UNLOCK(lock)
do {
  __dmb(_ARM64_BARRIER_SY); /* Full barrier before release /
  ((lock)) = 0;
} while (0)

#else
  /* Non-ARM64 MSVC: existing implementation unchanged */
#endif


The spinlock acquire now ensures:

* Before CAS: All prior memory operations complete before
  acquiring the lock.

* After CAS: The CAS completes before subsequent operations
  access protected data

The spinlock release now ensures:

* Before writing 0: All critical section operations are visible
  to other threads


You may ask: why two DMBs in the atomic operations instead of one?
GCC's non-LSE path (LDAXR/STLXR) uses only one DMB because:
* LDAXR (Load-Acquire Exclusive) provides half-barrier acquire
  semantics [3]
* STLXR (Store-Release Exclusive) provides half-barrier release
  semantics [3]
* One final dmb sy upgrades to full sequential consistency

Since _InterlockedCompareExchange provides NO barrier semantics on
ARM64, we must provide both halves:

* First DMB acts as a release barrier (ensures prior memory ops
  complete before CAS)
* Second DMB acts as an acquire barrier (ensures subsequent memory
  ops wait for CAS)
* Together they provide sequential consistency matching GCC's
  semantics [3]


VERIFICATION

The fix has been verified by:

1. Spinlock fix resolves 027_stream_regress timeout: Test now passes
   consistently on MSVC ARM64 with /O2 optimization without hanging

2. Assembly code inspection: Confirmed that dmb sy instructions now
   appear in the optimized assembly for ARM64 builds

3. Platform compatibility: No regression on x86/x64 or ARM32 (macros
   expand to no-ops; original code path unchanged)


WHY CLANG/LLVM ON MACOS ARM64 DOESN'T HAVE THIS PROBLEM

PostgreSQL builds successfully on Apple Silicon Macs (ARM64) without
the memory ordering issues observed on MSVC Windows ARM64. The
difference comes down to how Clang/LLVM and MSVC handle atomic
operations.

CLANG/LLVM APPROACH (macOS, Linux, Android ARM64)

Clang/LLVM uses GCC-compatible atomic builtins
(__atomic_compare_exchange_n, etc.) even on platforms where it's not
GCC [125][134]. The LLVM backend has an AtomicExpand pass that
properly expands these operations to include appropriate memory
barriers for the target architecture [134].

On ARM64, Clang generates:

__aarch64_cas4_acq_rel library calls (or CASAL instruction with LSE)
Proper acquire-release semantics built into the instruction sequence
Automatic full dmb sy barriers where needed This means PostgreSQL's
use of __sync_lock_test_and_set and _atomic* builtins work correctly
on macOS ARM64 without additional patches.


Phew... I hope I read all those docs correctly and got that right.  Feel
free to let me know if I missed something.  Looking forward to your
feedback and review so I can get this new build animal up and running.

best.

-greg

[1] ARM Developer: CAS Instructions
https://developer.arm.com/documentation/dui0801/latest/A64-Data-Transfer-Instructions/CASAB--CASALB--CASB--CASLB--A64-

[2] ARM Developer: Load-Acquire and Store-Release Instructions
https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions

[3] ARM Developer: Data Memory Barrier (DMB)
https://developer.arm.com/documentation/100069/0610/A64-General-Instructions/DMB?lang=en

[4] Microsoft Learn: _InterlockedCompareExchange Intrinsic Functions
https://learn.microsoft.com/en-us/cpp/intrinsics/interlockedcompareexchange-intrinsic-functions?view=msvc-170

[5] Microsoft Learn: ARM Intrinsics - Memory Barriers
https://learn.microsoft.com/en-us/cpp/intrinsics/arm-intrinsics?view=msvc-170

[6] Microsoft Learn: _ReadWriteBarrier is a Compiler Barrier
https://learn.microsoft.com/en-us/cpp/intrinsics/compiler-intrinsics?view=msvc-170

[7] PostgreSQL: 027_stream_regress WAL replication testing
https://www.postgresql.org/message-id/193115.1763243897@sss.pgh.pa.us

[8] Intel Volume 3A: Memory Ordering
https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf

[9] Microsoft Developer Blog: The AArch64 processor - Barriers
https://devblogs.microsoft.com/oldnewthing/20220812-00/?p=106968
[PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

[PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB barriers