Thread

  1. Re: Add RISC-V Zbb popcount optimization

    Greg Burd <greg@burd.me> — 2026-05-27T17:04:46Z

    On Fri, Mar 27, 2026, at 4:22 PM, Greg Burd wrote:
    > On Mon, Mar 23, 2026, at 11:09 AM, Nathan Bossart wrote:
    >> On Sun, Mar 22, 2026 at 02:01:50PM -0400, Andres Freund wrote:
    >>> I'm also pretty doubtful all the effort to e.g. add AVX 512 popcount was spent
    >>> all that effectively - hard to believe there's any real world workloads where
    >>> that gain is worth the squeeze. At least for aarch64 and x86-64 there's real
    >>> world use of those platforms, making niche-y perf improvements somewhat
    >>> worthwhile. Whereas there's afaict not yet a whole lot of riscv production
    >>> adoption.
    >
    > Hey Nathan,
    >
    >> That work was partially motivated by vector stuff that used popcount
    >> functions pretty heavily, but yeah, the complexity compared to the gains is
    >> the main reason I've been pushing to just use simd.h elsewhere (i.e., SSE2
    >> and Neon).  I'd still consider using AVX-512, etc. for things if the impact
    >> on real-world workloads was huge, though. 
    >
    > Yes, that and by research done while trying to understand why my RISC-V 
    > build farm animal "greenfly" (OrangePi RV2 with a VisionFive 2 CPU: 
    > RISC-V RV64GC + Zba/Zbb/Zbc/Zbs) is failing consistently.
    >
    >> -- 
    >> nathan
    >
    > Forgive me, while $subject only mentions popcount I couldn't help 
    > myself so I added a few more RISC-V patches including a bug fix that I 
    > hope makes greenfly happy again.
    >
    >
    > 0001 - This is a bug fix for DES/RISC-V/Clang DES initialization.
    >
    > ------> Join me in "the rabbit hole" on this issue if you care to...
    >
    > The existing software DES (as shown by the build-farm animal "greenfly" 
    > [1]) fails because Clang 20 has an auto-vectorization bug that we 
    > trigger in the DES initialization code (des_init() function), not the 
    > DES encryption algorithm itself.
    >
    > I searched the LLVM issue tracker, here are the issues that caught my eye:
    >   1. Issue #176001 - "RISC-V Wrong code at -O1"
    >     - Vector peephole optimization with vmerge folding
    >     - Fixed by PR #176077 (merged Jan 2024)
    >     - Link: https://github.com/llvm/llvm-project/issues/176001
    >   2. Issue #187458 - "Wrong code for vector.extract.last.active"
    >     - Large index issues with zvl1024b
    >     - Partially fixed, still work ongoing
    >     - Link: https://github.com/llvm/llvm-project/issues/187458
    >   3. Issue #171978 - "RISC-V Wrong code at -O2/O3"
    >     - Illegal instruction from mismatched EEW
    >     - Under investigation
    >     - Link: https://github.com/llvm/llvm-project/issues/171978
    >   4. PR #176105 - "Fix i64 gather/scatter cost on rv32"
    >     - Cost model fixes for scatter/gather (merged Jan 2026)
    >     - Link: https://github.com/llvm/llvm-project/pull/176105
    >
    > My fix in 0001 is simply adding this in a few places in crypt-des.c:
    >
    >   #if defined(__riscv) && defined(__clang__)
    >       pg_memory_barrier();
    >   #endif
    >
    > While searching I ran across a different solution, adding `-mllvm 
    > -riscv-v-vector-bits-min=0` sets the minimum vector bit width for 
    > RISC-V vector extension in LLVM to 0 disabling all vectorization 
    > forcing scalar code generation, no RVV instructions are emitted.  This 
    > would prevent the DES bug at the cost of any vectorization anywhere in 
    > the binary.
    >
    > While that might also fix the other intermittent bug we'd been seeing 
    > on greenfly (not tested) disablnig all RVV optimizations seems to heavy 
    > handed to me.
    >
    >
    > ------> Moving on.
    >
    > 0002 - (was "0001" in v2) this is unchanged, it implements popcount 
    > using Zbb extension on RISC-V
    >
    > 0003 - is a small patch that adapted from the Google Abseil project's 
    > RISC-V CRC32C implementation [1].  It is *a lot faster* than the 
    > software crc32c we fall back to now (see: riscv-crc32c.c).  This 
    > algorithm requires the Zbc (or Zbkc) extension (for clmul) so the patch 
    > tests for that at build and adds the '-march' flag when it is.  
    > However, as is the case for Zbb and popcnt in, the presence of Zbc (or 
    > Zbkc) must be detected at runtime.  That's done following the 
    > pre-existing pattern used for ARM features.  This does introduce some 
    > runtime overhead and complexity, not more than required I hope.
    >
    > I attached test code, and results at the end of this email:
    > * riscv-popcnt.c - unchanged
    > * riscv-crc32c.c - new, based on work in the Google Abseil project
    > * riscv-des.c    - highlights the fix for DES using Clang on RISC-V 
    >
    > I guess the question for 002 and/or 003 is if the "juice" is worth the 
    > "squeeze" or not.  There is a lot of performance juice to be had IMO.  
    > But some might argue that RISC-V isn't widely adopted yet, and they'd 
    > be right.  Others might point out that RISC-V is currently showing up 
    > in embedded systems more than server/desktop/laptop/cloud, also true.  
    > However, there is some evidence that is changing as there are RISC-V in 
    > servers [2][3], and there is a hosted (cloud) solution from Scaleway 
    > [4].  There exists a 64 core RISC-V desktop [6] and a Framework laptop 
    > mainboard [7] sporting a RISC-V CPUs.  And there is the OrangePi RV2 
    > [7] I have that is "greenfly".
    >
    > Is it early days?  Certainly!  But too early?  That's up for debate. :)
    >
    > If nothing else, these patches can be a durable record and used later 
    > when RISC-V is a critical platform for Postgres or informational to 
    > other projects.
    
    Rebased and tested (v4) adding better support for RISC-V with a fix for DES and faster popcount and CRC32 when the CPU supports it.
    
    best.
    
    -greg
    
    > best.
    >
    > -greg
    >
    > [1] https://github.com/abseil/abseil-cpp/pull/1986 
    > absl/crc/internal/crc_riscv.cc
    > [2] 
    > https://www.firefly.store/products/rs-sra120-risc-v-server-2u-computing-server-cloud-storage-large-model-sg2042
    > [3] 
    > https://edgeaicomputer.com/our-products/servers/risc-v-compute-server-sra1-20/
    > [4] 
    > https://www.scaleway.com/en/news/scaleway-launches-its-risc-v-servers-in-the-cloud-a-world-first-and-a-firm-commitment-to-technological-independence/
    > [5] https://milkv.io/pioneer and 
    > https://www.crowdsupply.com/milk-v/milk-v-pioneer/updates/current-status-of-production
    > [6] https://deepcomputing.io/product/dc-roma-risc-v-mainboard/
    > [7] 
    > http://www.orangepi.org/html/hardWare/computerAndMicrocontrollers/details/Orange-Pi-RV2.html
    >
    >
    > ---- TEST PROGRAM OUTPUT:
    >
    > gburd@rv:~/ws/postgres$ make -f Makefile.RISCV
    > gcc -O2 riscv-des.c -o des-gcc-sw
    > gcc -O2 riscv-des.c -march=rv64gcv -o des-gcc-hw
    > clang-20 -O1 riscv-des.c -o des-clang-o1-sw
    > clang-20 -O1 -march=rv64gcv riscv-des.c -o des-clang-o1-hw
    > clang-20 -O2 riscv-des.c -o des-clang-o2-sw
    > clang-20 -O2 -march=rv64gcv riscv-des.c -o des-clang-o2-hw
    > gcc -O2 -o popcnt-gcc-o2-sw riscv-popcnt.c
    > gcc -O2 -march=rv64gc_zbb -o popcnt-gcc-o2-hw riscv-popcnt.c
    > clang-20 -O2 -o popcnt-clang-o2-sw riscv-popcnt.c
    > clang-20 -O2 -march=rv64gc_zbb -o popcnt-clang-o2-hw riscv-popcnt.c
    > gcc -O2 -o crc32c-gcc-o2-sw riscv-crc32c.c
    > gcc -O2 -march=rv64gc_zbc -o crc32c-gcc-o2-hw riscv-crc32c.c
    > clang-20 -O2 -o crc32c-clang-o2-sw riscv-crc32c.c
    > clang-20 -O2 -march=rv64gc_zbc -o crc32c-clang-o2-hw riscv-crc32c.c
    > gburd@rv:~/ws/postgres$ make -f Makefile.RISCV test
    > ./des-gcc-sw
    > Compiler: GCC 13.3.0
    > Target: RISC-V 64-bit
    > Vector extension: Not enabled
    >
    > Testing WITHOUT compiler barriers:
    > PASS: Permutation tables are correct
    >
    > Testing WITH compiler barriers:
    > PASS: Permutation tables are correct
    >
    > Performance Comparison (1000000 iterations):
    > Without barriers: 0.409 seconds (409 ns/iter)
    > With barriers:    0.416 seconds (416 ns/iter)
    > Overhead: 1.6%
    > ./des-gcc-hw
    > Compiler: GCC 13.3.0
    > Target: RISC-V 64-bit
    > Vector extension: Enabled (RVV)
    >
    > Testing WITHOUT compiler barriers:
    > PASS: Permutation tables are correct
    >
    > Testing WITH compiler barriers:
    > PASS: Permutation tables are correct
    >
    > Performance Comparison (1000000 iterations):
    > Without barriers: 0.410 seconds (410 ns/iter)
    > With barriers:    0.410 seconds (410 ns/iter)
    > Overhead: Negligible
    > ./des-clang-o1-sw
    > Compiler: Clang 20.1.2
    > Target: RISC-V 64-bit
    > Vector extension: Not enabled
    >
    > Testing WITHOUT compiler barriers:
    > PASS: Permutation tables are correct
    >
    > Testing WITH compiler barriers:
    > PASS: Permutation tables are correct
    >
    > Performance Comparison (1000000 iterations):
    > Without barriers: 0.517 seconds (517 ns/iter)
    > With barriers:    0.516 seconds (516 ns/iter)
    > Overhead: Negligible
    > ./des-clang-o1-hw
    > Compiler: Clang 20.1.2
    > Target: RISC-V 64-bit
    > Vector extension: Enabled (RVV)
    >
    > Testing WITHOUT compiler barriers:
    > PASS: Permutation tables are correct
    >
    > Testing WITH compiler barriers:
    > PASS: Permutation tables are correct
    >
    > Performance Comparison (1000000 iterations):
    > Without barriers: 0.405 seconds (405 ns/iter)
    > With barriers:    0.405 seconds (405 ns/iter)
    > Overhead: Negligible
    > ./des-clang-o2-sw
    > Compiler: Clang 20.1.2
    > Target: RISC-V 64-bit
    > Vector extension: Not enabled
    >
    > Testing WITHOUT compiler barriers:
    > PASS: Permutation tables are correct
    >
    > Testing WITH compiler barriers:
    > PASS: Permutation tables are correct
    >
    > Performance Comparison (1000000 iterations):
    > Without barriers: 0.517 seconds (517 ns/iter)
    > With barriers:    0.518 seconds (518 ns/iter)
    > Overhead: Negligible
    > ./des-clang-o2-hw
    > Compiler: Clang 20.1.2
    > Target: RISC-V 64-bit
    > Vector extension: Enabled (RVV)
    >
    > Testing WITHOUT compiler barriers:
    > ERROR: un_pbox mismatch:
    > 	un_pbox[0] = 15, expected 8
    > 	un_pbox[1] = 6, expected 16
    > 	un_pbox[2] = 19, expected 22
    > 	un_pbox[3] = 20, expected 30
    > 	un_pbox[4] = 28, expected 12
    >   ... and 27 more errors
    > FAIL: Permutation tables are incorrect
    >
    > Testing WITH compiler barriers:
    > PASS: Permutation tables are correct
    >
    > Performance Comparison (1000000 iterations):
    > Without barriers: 0.093 seconds (93 ns/iter)
    > With barriers:    0.407 seconds (407 ns/iter)
    > Overhead: 335.5%
    > ./popcnt-gcc-o2-sw
    > sw popcount:    0.183 sec  (    547.89 MB/s)
    > hw popcount:    0.274 sec  (    365.40 MB/s)
    >
    > diff: 0.67x
    > match: 406261900 bits counted
    > ./popcnt-gcc-o2-hw
    > sw popcount:    0.182 sec  (    548.17 MB/s)
    > hw popcount:    0.044 sec  (   2287.82 MB/s)
    >
    > diff: 4.17x
    > match: 406261900 bits counted
    > ./popcnt-clang-o2-sw
    > sw popcount:    0.188 sec  (    531.96 MB/s)
    > hw popcount:    0.207 sec  (    482.84 MB/s)
    >
    > diff: 0.91x
    > match: 406261900 bits counted
    > ./popcnt-clang-o2-hw
    > sw popcount:    0.224 sec  (    446.46 MB/s)
    > hw popcount:    0.056 sec  (   1794.83 MB/s)
    >
    > diff: 4.02x
    > match: 406261900 bits counted
    > ./crc32c-gcc-o2-sw
    > sw crc32c:    0.651 sec  (    153.68 MB/s)
    > hw crc32c:    0.651 sec  (    153.72 MB/s)
    >
    > diff: 1.00x
    > match: 0x0B141F2D
    >
    > validation: CRC32C("123456789") = 0xE3069283 (correct)
    > ./crc32c-gcc-o2-hw
    > sw crc32c:    0.651 sec  (    153.70 MB/s)
    > hw crc32c:    0.000 sec  ( 308052.33 MB/s)
    >
    > diff: 2004.21x
    > match: 0x0B141F2D
    >
    > validation: CRC32C("123456789") = 0xE3069283 (correct)
    > ./crc32c-clang-o2-sw
    > sw crc32c:    0.584 sec  (    171.10 MB/s)
    > hw crc32c:    0.584 sec  (    171.17 MB/s)
    >
    > diff: 1.00x
    > match: 0x0B141F2D
    >
    > validation: CRC32C("123456789") = 0xE3069283 (correct)
    > ./crc32c-clang-o2-hw
    > sw crc32c:    0.584 sec  (    171.15 MB/s)
    > hw crc32c:    0.000 sec  ( 309282.38 MB/s)
    >
    > diff: 1807.08x
    > match: 0x0B141F2D
    >
    > validation: CRC32C("123456789") = 0xE3069283 (correct)
    > Attachments:
    > * Makefile.RISCV
    > * riscv-crc32c.c
    > * riscv-des.c
    > * riscv-popcnt.c
    > * v3-0001-Avoid-Clang-RISC-V-auto-vectorization-bug-in-DES.patch
    > * v3-0002-Add-RISC-V-popcount-using-Zbb-extension.patch
    > * v3-0003-Add-RISC-V-CRC32C-using-the-Zbc-extension.patch