Re: Add RISC-V Zbb popcount optimization

Greg Burd <greg@burd.me>

From: "Greg Burd" <greg@burd.me>
To: "Nathan Bossart" <nathandbossart@gmail.com>, "Andres Freund" <andres@anarazel.de>
Cc: "John Naylor" <johncnaylorls@gmail.com>, pgsql-hackers <pgsql-hackers@postgresql.org>, "Andrew Dunstan" <andrew@dunslane.net>
Date: 2026-05-27T17:04:46Z
Lists: pgsql-hackers

Attachments

On Fri, Mar 27, 2026, at 4:22 PM, Greg Burd wrote:
> On Mon, Mar 23, 2026, at 11:09 AM, Nathan Bossart wrote:
>> On Sun, Mar 22, 2026 at 02:01:50PM -0400, Andres Freund wrote:
>>> I'm also pretty doubtful all the effort to e.g. add AVX 512 popcount was spent
>>> all that effectively - hard to believe there's any real world workloads where
>>> that gain is worth the squeeze. At least for aarch64 and x86-64 there's real
>>> world use of those platforms, making niche-y perf improvements somewhat
>>> worthwhile. Whereas there's afaict not yet a whole lot of riscv production
>>> adoption.
>
> Hey Nathan,
>
>> That work was partially motivated by vector stuff that used popcount
>> functions pretty heavily, but yeah, the complexity compared to the gains is
>> the main reason I've been pushing to just use simd.h elsewhere (i.e., SSE2
>> and Neon).  I'd still consider using AVX-512, etc. for things if the impact
>> on real-world workloads was huge, though. 
>
> Yes, that and by research done while trying to understand why my RISC-V 
> build farm animal "greenfly" (OrangePi RV2 with a VisionFive 2 CPU: 
> RISC-V RV64GC + Zba/Zbb/Zbc/Zbs) is failing consistently.
>
>> -- 
>> nathan
>
> Forgive me, while $subject only mentions popcount I couldn't help 
> myself so I added a few more RISC-V patches including a bug fix that I 
> hope makes greenfly happy again.
>
>
> 0001 - This is a bug fix for DES/RISC-V/Clang DES initialization.
>
> ------> Join me in "the rabbit hole" on this issue if you care to...
>
> The existing software DES (as shown by the build-farm animal "greenfly" 
> [1]) fails because Clang 20 has an auto-vectorization bug that we 
> trigger in the DES initialization code (des_init() function), not the 
> DES encryption algorithm itself.
>
> I searched the LLVM issue tracker, here are the issues that caught my eye:
>   1. Issue #176001 - "RISC-V Wrong code at -O1"
>     - Vector peephole optimization with vmerge folding
>     - Fixed by PR #176077 (merged Jan 2024)
>     - Link: https://github.com/llvm/llvm-project/issues/176001
>   2. Issue #187458 - "Wrong code for vector.extract.last.active"
>     - Large index issues with zvl1024b
>     - Partially fixed, still work ongoing
>     - Link: https://github.com/llvm/llvm-project/issues/187458
>   3. Issue #171978 - "RISC-V Wrong code at -O2/O3"
>     - Illegal instruction from mismatched EEW
>     - Under investigation
>     - Link: https://github.com/llvm/llvm-project/issues/171978
>   4. PR #176105 - "Fix i64 gather/scatter cost on rv32"
>     - Cost model fixes for scatter/gather (merged Jan 2026)
>     - Link: https://github.com/llvm/llvm-project/pull/176105
>
> My fix in 0001 is simply adding this in a few places in crypt-des.c:
>
>   #if defined(__riscv) && defined(__clang__)
>       pg_memory_barrier();
>   #endif
>
> While searching I ran across a different solution, adding `-mllvm 
> -riscv-v-vector-bits-min=0` sets the minimum vector bit width for 
> RISC-V vector extension in LLVM to 0 disabling all vectorization 
> forcing scalar code generation, no RVV instructions are emitted.  This 
> would prevent the DES bug at the cost of any vectorization anywhere in 
> the binary.
>
> While that might also fix the other intermittent bug we'd been seeing 
> on greenfly (not tested) disablnig all RVV optimizations seems to heavy 
> handed to me.
>
>
> ------> Moving on.
>
> 0002 - (was "0001" in v2) this is unchanged, it implements popcount 
> using Zbb extension on RISC-V
>
> 0003 - is a small patch that adapted from the Google Abseil project's 
> RISC-V CRC32C implementation [1].  It is *a lot faster* than the 
> software crc32c we fall back to now (see: riscv-crc32c.c).  This 
> algorithm requires the Zbc (or Zbkc) extension (for clmul) so the patch 
> tests for that at build and adds the '-march' flag when it is.  
> However, as is the case for Zbb and popcnt in, the presence of Zbc (or 
> Zbkc) must be detected at runtime.  That's done following the 
> pre-existing pattern used for ARM features.  This does introduce some 
> runtime overhead and complexity, not more than required I hope.
>
> I attached test code, and results at the end of this email:
> * riscv-popcnt.c - unchanged
> * riscv-crc32c.c - new, based on work in the Google Abseil project
> * riscv-des.c    - highlights the fix for DES using Clang on RISC-V 
>
> I guess the question for 002 and/or 003 is if the "juice" is worth the 
> "squeeze" or not.  There is a lot of performance juice to be had IMO.  
> But some might argue that RISC-V isn't widely adopted yet, and they'd 
> be right.  Others might point out that RISC-V is currently showing up 
> in embedded systems more than server/desktop/laptop/cloud, also true.  
> However, there is some evidence that is changing as there are RISC-V in 
> servers [2][3], and there is a hosted (cloud) solution from Scaleway 
> [4].  There exists a 64 core RISC-V desktop [6] and a Framework laptop 
> mainboard [7] sporting a RISC-V CPUs.  And there is the OrangePi RV2 
> [7] I have that is "greenfly".
>
> Is it early days?  Certainly!  But too early?  That's up for debate. :)
>
> If nothing else, these patches can be a durable record and used later 
> when RISC-V is a critical platform for Postgres or informational to 
> other projects.

Rebased and tested (v4) adding better support for RISC-V with a fix for DES and faster popcount and CRC32 when the CPU supports it.

best.

-greg

> best.
>
> -greg
>
> [1] https://github.com/abseil/abseil-cpp/pull/1986 
> absl/crc/internal/crc_riscv.cc
> [2] 
> https://www.firefly.store/products/rs-sra120-risc-v-server-2u-computing-server-cloud-storage-large-model-sg2042
> [3] 
> https://edgeaicomputer.com/our-products/servers/risc-v-compute-server-sra1-20/
> [4] 
> https://www.scaleway.com/en/news/scaleway-launches-its-risc-v-servers-in-the-cloud-a-world-first-and-a-firm-commitment-to-technological-independence/
> [5] https://milkv.io/pioneer and 
> https://www.crowdsupply.com/milk-v/milk-v-pioneer/updates/current-status-of-production
> [6] https://deepcomputing.io/product/dc-roma-risc-v-mainboard/
> [7] 
> http://www.orangepi.org/html/hardWare/computerAndMicrocontrollers/details/Orange-Pi-RV2.html
>
>
> ---- TEST PROGRAM OUTPUT:
>
> gburd@rv:~/ws/postgres$ make -f Makefile.RISCV
> gcc -O2 riscv-des.c -o des-gcc-sw
> gcc -O2 riscv-des.c -march=rv64gcv -o des-gcc-hw
> clang-20 -O1 riscv-des.c -o des-clang-o1-sw
> clang-20 -O1 -march=rv64gcv riscv-des.c -o des-clang-o1-hw
> clang-20 -O2 riscv-des.c -o des-clang-o2-sw
> clang-20 -O2 -march=rv64gcv riscv-des.c -o des-clang-o2-hw
> gcc -O2 -o popcnt-gcc-o2-sw riscv-popcnt.c
> gcc -O2 -march=rv64gc_zbb -o popcnt-gcc-o2-hw riscv-popcnt.c
> clang-20 -O2 -o popcnt-clang-o2-sw riscv-popcnt.c
> clang-20 -O2 -march=rv64gc_zbb -o popcnt-clang-o2-hw riscv-popcnt.c
> gcc -O2 -o crc32c-gcc-o2-sw riscv-crc32c.c
> gcc -O2 -march=rv64gc_zbc -o crc32c-gcc-o2-hw riscv-crc32c.c
> clang-20 -O2 -o crc32c-clang-o2-sw riscv-crc32c.c
> clang-20 -O2 -march=rv64gc_zbc -o crc32c-clang-o2-hw riscv-crc32c.c
> gburd@rv:~/ws/postgres$ make -f Makefile.RISCV test
> ./des-gcc-sw
> Compiler: GCC 13.3.0
> Target: RISC-V 64-bit
> Vector extension: Not enabled
>
> Testing WITHOUT compiler barriers:
> PASS: Permutation tables are correct
>
> Testing WITH compiler barriers:
> PASS: Permutation tables are correct
>
> Performance Comparison (1000000 iterations):
> Without barriers: 0.409 seconds (409 ns/iter)
> With barriers:    0.416 seconds (416 ns/iter)
> Overhead: 1.6%
> ./des-gcc-hw
> Compiler: GCC 13.3.0
> Target: RISC-V 64-bit
> Vector extension: Enabled (RVV)
>
> Testing WITHOUT compiler barriers:
> PASS: Permutation tables are correct
>
> Testing WITH compiler barriers:
> PASS: Permutation tables are correct
>
> Performance Comparison (1000000 iterations):
> Without barriers: 0.410 seconds (410 ns/iter)
> With barriers:    0.410 seconds (410 ns/iter)
> Overhead: Negligible
> ./des-clang-o1-sw
> Compiler: Clang 20.1.2
> Target: RISC-V 64-bit
> Vector extension: Not enabled
>
> Testing WITHOUT compiler barriers:
> PASS: Permutation tables are correct
>
> Testing WITH compiler barriers:
> PASS: Permutation tables are correct
>
> Performance Comparison (1000000 iterations):
> Without barriers: 0.517 seconds (517 ns/iter)
> With barriers:    0.516 seconds (516 ns/iter)
> Overhead: Negligible
> ./des-clang-o1-hw
> Compiler: Clang 20.1.2
> Target: RISC-V 64-bit
> Vector extension: Enabled (RVV)
>
> Testing WITHOUT compiler barriers:
> PASS: Permutation tables are correct
>
> Testing WITH compiler barriers:
> PASS: Permutation tables are correct
>
> Performance Comparison (1000000 iterations):
> Without barriers: 0.405 seconds (405 ns/iter)
> With barriers:    0.405 seconds (405 ns/iter)
> Overhead: Negligible
> ./des-clang-o2-sw
> Compiler: Clang 20.1.2
> Target: RISC-V 64-bit
> Vector extension: Not enabled
>
> Testing WITHOUT compiler barriers:
> PASS: Permutation tables are correct
>
> Testing WITH compiler barriers:
> PASS: Permutation tables are correct
>
> Performance Comparison (1000000 iterations):
> Without barriers: 0.517 seconds (517 ns/iter)
> With barriers:    0.518 seconds (518 ns/iter)
> Overhead: Negligible
> ./des-clang-o2-hw
> Compiler: Clang 20.1.2
> Target: RISC-V 64-bit
> Vector extension: Enabled (RVV)
>
> Testing WITHOUT compiler barriers:
> ERROR: un_pbox mismatch:
> 	un_pbox[0] = 15, expected 8
> 	un_pbox[1] = 6, expected 16
> 	un_pbox[2] = 19, expected 22
> 	un_pbox[3] = 20, expected 30
> 	un_pbox[4] = 28, expected 12
>   ... and 27 more errors
> FAIL: Permutation tables are incorrect
>
> Testing WITH compiler barriers:
> PASS: Permutation tables are correct
>
> Performance Comparison (1000000 iterations):
> Without barriers: 0.093 seconds (93 ns/iter)
> With barriers:    0.407 seconds (407 ns/iter)
> Overhead: 335.5%
> ./popcnt-gcc-o2-sw
> sw popcount:    0.183 sec  (    547.89 MB/s)
> hw popcount:    0.274 sec  (    365.40 MB/s)
>
> diff: 0.67x
> match: 406261900 bits counted
> ./popcnt-gcc-o2-hw
> sw popcount:    0.182 sec  (    548.17 MB/s)
> hw popcount:    0.044 sec  (   2287.82 MB/s)
>
> diff: 4.17x
> match: 406261900 bits counted
> ./popcnt-clang-o2-sw
> sw popcount:    0.188 sec  (    531.96 MB/s)
> hw popcount:    0.207 sec  (    482.84 MB/s)
>
> diff: 0.91x
> match: 406261900 bits counted
> ./popcnt-clang-o2-hw
> sw popcount:    0.224 sec  (    446.46 MB/s)
> hw popcount:    0.056 sec  (   1794.83 MB/s)
>
> diff: 4.02x
> match: 406261900 bits counted
> ./crc32c-gcc-o2-sw
> sw crc32c:    0.651 sec  (    153.68 MB/s)
> hw crc32c:    0.651 sec  (    153.72 MB/s)
>
> diff: 1.00x
> match: 0x0B141F2D
>
> validation: CRC32C("123456789") = 0xE3069283 (correct)
> ./crc32c-gcc-o2-hw
> sw crc32c:    0.651 sec  (    153.70 MB/s)
> hw crc32c:    0.000 sec  ( 308052.33 MB/s)
>
> diff: 2004.21x
> match: 0x0B141F2D
>
> validation: CRC32C("123456789") = 0xE3069283 (correct)
> ./crc32c-clang-o2-sw
> sw crc32c:    0.584 sec  (    171.10 MB/s)
> hw crc32c:    0.584 sec  (    171.17 MB/s)
>
> diff: 1.00x
> match: 0x0B141F2D
>
> validation: CRC32C("123456789") = 0xE3069283 (correct)
> ./crc32c-clang-o2-hw
> sw crc32c:    0.584 sec  (    171.15 MB/s)
> hw crc32c:    0.000 sec  ( 309282.38 MB/s)
>
> diff: 1807.08x
> match: 0x0B141F2D
>
> validation: CRC32C("123456789") = 0xE3069283 (correct)
> Attachments:
> * Makefile.RISCV
> * riscv-crc32c.c
> * riscv-des.c
> * riscv-popcnt.c
> * v3-0001-Avoid-Clang-RISC-V-auto-vectorization-bug-in-DES.patch
> * v3-0002-Add-RISC-V-popcount-using-Zbb-extension.patch
> * v3-0003-Add-RISC-V-CRC32C-using-the-Zbc-extension.patch