Thread

  1. Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

    Greg Burd <greg@burd.me> — 2025-11-25T16:37:38Z

    On Mon, Nov 24, 2025, at 6:20 PM, Andres Freund wrote:
    > Hi,
    
    Thanks again for taking a look at the patch, hopefully I got it right this time. :)
    
    > On 2025-11-24 11:28:28 -0500, Greg Burd wrote:
    >> @@ -2509,25 +2513,64 @@ int main(void)
    >>  }
    >>  '''
    >>  
    >> -  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd without -march=armv8-a+crc',
    >> -      args: test_c_args)
    >> -    # Use ARM CRC Extension unconditionally
    >> -    cdata.set('USE_ARMV8_CRC32C', 1)
    >> -    have_optimized_crc = true
    >> -  elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd with -march=armv8-a+crc+simd',
    >> -      args: test_c_args + ['-march=armv8-a+crc+simd'])
    >> -    # Use ARM CRC Extension, with runtime check
    >> -    cflags_crc += '-march=armv8-a+crc+simd'
    >> -    cdata.set('USE_ARMV8_CRC32C', false)
    >> -    cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
    >> -    have_optimized_crc = true
    >> -  elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd with -march=armv8-a+crc',
    >> -      args: test_c_args + ['-march=armv8-a+crc'])
    >> -    # Use ARM CRC Extension, with runtime check
    >> -    cflags_crc += '-march=armv8-a+crc'
    >> -    cdata.set('USE_ARMV8_CRC32C', false)
    >> -    cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
    >> -    have_optimized_crc = true
    >> +  if cc.get_id() == 'msvc'
    >> +    # MSVC: Intrinsic availability check for ARM64
    >> +    if host_machine.cpu_family() == 'aarch64'
    >> +      # Test if CRC32C intrinsics are available in intrin.h
    >> +      crc32c_test_msvc = '''
    >> +        #include <intrin.h>
    >> +        int main(void) {
    >> +          uint32_t crc = 0;
    >> +          uint8_t data = 0;
    >> +          crc = __crc32cb(crc, data);
    >> +          return 0;
    >> +        }
    >> +      '''
    >> +      if cc.links(crc32c_test_msvc, name: '__crc32cb intrinsic available')
    >> +        cdata.set('USE_ARMV8_CRC32C', 1)
    >> +        have_optimized_crc = true
    >> +        message('Using ARM64 CRC32C hardware acceleration (MSVC)')
    >> +      else
    >> +        message('CRC32C intrinsics not available on this MSVC ARM64 build')
    >> +      endif
    >
    > Does this:
    > a) need to be conditional at all, given that it's msvc specific, it seems we
    >    don't need to run a test?
    > b) why is the msvc block outside of the general aarch64 block but then has
    > another nested aarch64 test inside? That seems unnecessarily complicated and
    > requires reindenting unnecessarily much code?
    
    Yep, I rushed this.  Apologies.  I've re-worked it with your suggestions.
    
    >> +/*
    >> + * For Arm64, use __isb intrinsic. See aarch64 inline assembly definition for details.
    >> + */
    >> +#ifdef _M_ARM64
    >> +
    >> +static __forceinline void
    >> +spin_delay(void)
    >> +{
    >> +	 /* Reference: https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics#BarrierRestrictions */
    >> +	__isb(_ARM64_BARRIER_SY);
    >> +}
    >> +#else
    >> +/*
    >> + * For x64, use _mm_pause intrinsic instead of rep nop.
    >> + */
    >>  static __forceinline void
    >>  spin_delay(void)
    >>  {
    >>  	_mm_pause();
    >>  }
    >
    > This continues to use a barrier, with a reference to a list of barrier
    > semantics that really don't seem to make a whole lot of sense in the context
    > of spin_delay(). If we want to emit this kind of barrier for now it's ok with
    > me, but it should be documented as just being a fairly random choice, rather
    > than a link that doesn't explain anything.
    
    I did more digging and found that you were right about the use of ISB for spin_delay().  I think I was misled by earlier code in that file (lines 277-286) where there is an implementation of spin_delay() that uses ISB, I ran with that not doing enough research myself.  So I did more digging and found an article on this [1] and it seems that YIELD should be used, not ISB.  I checked into how others implement this feature, Java [2][3] uses YIELD, Linux [4][5] uses YIELD in cpu_relax() called by __delay().
    
    >> +#endif
    >>  #else
    >>  static __forceinline void
    >>  spin_delay(void)
    >> @@ -623,9 +640,13 @@ spin_delay(void)
    >>  #include <intrin.h>
    >>  #pragma intrinsic(_ReadWriteBarrier)
    >>  
    >> -#define S_UNLOCK(lock)	\
    >> +#ifdef _M_ARM64
    >> +#define S_UNLOCK(lock) \
    >> +	do { __dmb(_ARM64_BARRIER_SY); (*(lock)) = 0; } while (0)
    >> +#else
    >
    > This doesn't seem like the right way to implement this - why not use
    > InterlockedExchange(lock, 0)? That will do the write with barrier semantics.
    
    Great idea, done.  Seems to work too.
    
    > Greetings,
    >
    > Andres Freund
    
    Given what I learned about YIELD vs ISB for spin delay it seems like a reasonable idea to submit a new patch for the non-MSVC path and switch it to YIELD, what do you think?
    
    v5 attached, best.
    
    -greg
    
    [1] https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm 
    [2] https://cr.openjdk.org/~dchuyko/8186670/yield/spinwait.html
    [3] https://mail.openjdk.org/pipermail/aarch64-port-dev/2017-August/004880.html
    [4] https://github.com/torvalds/linux/blob/ac3fd01e4c1efce8f2c054cdeb2ddd2fc0fb150d/arch/arm64/include/asm/vdso/processor.h
    [5] https://github.com/torvalds/linux/commit/f511e079177a9b97175a9a3b0ee2374d55682403