Thread
-
Re: Add RISC-V Zbb popcount optimization
Greg Burd <greg@burd.me> — 2026-05-27T17:04:46Z
On Fri, Mar 27, 2026, at 4:22 PM, Greg Burd wrote: > On Mon, Mar 23, 2026, at 11:09 AM, Nathan Bossart wrote: >> On Sun, Mar 22, 2026 at 02:01:50PM -0400, Andres Freund wrote: >>> I'm also pretty doubtful all the effort to e.g. add AVX 512 popcount was spent >>> all that effectively - hard to believe there's any real world workloads where >>> that gain is worth the squeeze. At least for aarch64 and x86-64 there's real >>> world use of those platforms, making niche-y perf improvements somewhat >>> worthwhile. Whereas there's afaict not yet a whole lot of riscv production >>> adoption. > > Hey Nathan, > >> That work was partially motivated by vector stuff that used popcount >> functions pretty heavily, but yeah, the complexity compared to the gains is >> the main reason I've been pushing to just use simd.h elsewhere (i.e., SSE2 >> and Neon). I'd still consider using AVX-512, etc. for things if the impact >> on real-world workloads was huge, though. > > Yes, that and by research done while trying to understand why my RISC-V > build farm animal "greenfly" (OrangePi RV2 with a VisionFive 2 CPU: > RISC-V RV64GC + Zba/Zbb/Zbc/Zbs) is failing consistently. > >> -- >> nathan > > Forgive me, while $subject only mentions popcount I couldn't help > myself so I added a few more RISC-V patches including a bug fix that I > hope makes greenfly happy again. > > > 0001 - This is a bug fix for DES/RISC-V/Clang DES initialization. > > ------> Join me in "the rabbit hole" on this issue if you care to... > > The existing software DES (as shown by the build-farm animal "greenfly" > [1]) fails because Clang 20 has an auto-vectorization bug that we > trigger in the DES initialization code (des_init() function), not the > DES encryption algorithm itself. > > I searched the LLVM issue tracker, here are the issues that caught my eye: > 1. Issue #176001 - "RISC-V Wrong code at -O1" > - Vector peephole optimization with vmerge folding > - Fixed by PR #176077 (merged Jan 2024) > - Link: https://github.com/llvm/llvm-project/issues/176001 > 2. Issue #187458 - "Wrong code for vector.extract.last.active" > - Large index issues with zvl1024b > - Partially fixed, still work ongoing > - Link: https://github.com/llvm/llvm-project/issues/187458 > 3. Issue #171978 - "RISC-V Wrong code at -O2/O3" > - Illegal instruction from mismatched EEW > - Under investigation > - Link: https://github.com/llvm/llvm-project/issues/171978 > 4. PR #176105 - "Fix i64 gather/scatter cost on rv32" > - Cost model fixes for scatter/gather (merged Jan 2026) > - Link: https://github.com/llvm/llvm-project/pull/176105 > > My fix in 0001 is simply adding this in a few places in crypt-des.c: > > #if defined(__riscv) && defined(__clang__) > pg_memory_barrier(); > #endif > > While searching I ran across a different solution, adding `-mllvm > -riscv-v-vector-bits-min=0` sets the minimum vector bit width for > RISC-V vector extension in LLVM to 0 disabling all vectorization > forcing scalar code generation, no RVV instructions are emitted. This > would prevent the DES bug at the cost of any vectorization anywhere in > the binary. > > While that might also fix the other intermittent bug we'd been seeing > on greenfly (not tested) disablnig all RVV optimizations seems to heavy > handed to me. > > > ------> Moving on. > > 0002 - (was "0001" in v2) this is unchanged, it implements popcount > using Zbb extension on RISC-V > > 0003 - is a small patch that adapted from the Google Abseil project's > RISC-V CRC32C implementation [1]. It is *a lot faster* than the > software crc32c we fall back to now (see: riscv-crc32c.c). This > algorithm requires the Zbc (or Zbkc) extension (for clmul) so the patch > tests for that at build and adds the '-march' flag when it is. > However, as is the case for Zbb and popcnt in, the presence of Zbc (or > Zbkc) must be detected at runtime. That's done following the > pre-existing pattern used for ARM features. This does introduce some > runtime overhead and complexity, not more than required I hope. > > I attached test code, and results at the end of this email: > * riscv-popcnt.c - unchanged > * riscv-crc32c.c - new, based on work in the Google Abseil project > * riscv-des.c - highlights the fix for DES using Clang on RISC-V > > I guess the question for 002 and/or 003 is if the "juice" is worth the > "squeeze" or not. There is a lot of performance juice to be had IMO. > But some might argue that RISC-V isn't widely adopted yet, and they'd > be right. Others might point out that RISC-V is currently showing up > in embedded systems more than server/desktop/laptop/cloud, also true. > However, there is some evidence that is changing as there are RISC-V in > servers [2][3], and there is a hosted (cloud) solution from Scaleway > [4]. There exists a 64 core RISC-V desktop [6] and a Framework laptop > mainboard [7] sporting a RISC-V CPUs. And there is the OrangePi RV2 > [7] I have that is "greenfly". > > Is it early days? Certainly! But too early? That's up for debate. :) > > If nothing else, these patches can be a durable record and used later > when RISC-V is a critical platform for Postgres or informational to > other projects. Rebased and tested (v4) adding better support for RISC-V with a fix for DES and faster popcount and CRC32 when the CPU supports it. best. -greg > best. > > -greg > > [1] https://github.com/abseil/abseil-cpp/pull/1986 > absl/crc/internal/crc_riscv.cc > [2] > https://www.firefly.store/products/rs-sra120-risc-v-server-2u-computing-server-cloud-storage-large-model-sg2042 > [3] > https://edgeaicomputer.com/our-products/servers/risc-v-compute-server-sra1-20/ > [4] > https://www.scaleway.com/en/news/scaleway-launches-its-risc-v-servers-in-the-cloud-a-world-first-and-a-firm-commitment-to-technological-independence/ > [5] https://milkv.io/pioneer and > https://www.crowdsupply.com/milk-v/milk-v-pioneer/updates/current-status-of-production > [6] https://deepcomputing.io/product/dc-roma-risc-v-mainboard/ > [7] > http://www.orangepi.org/html/hardWare/computerAndMicrocontrollers/details/Orange-Pi-RV2.html > > > ---- TEST PROGRAM OUTPUT: > > gburd@rv:~/ws/postgres$ make -f Makefile.RISCV > gcc -O2 riscv-des.c -o des-gcc-sw > gcc -O2 riscv-des.c -march=rv64gcv -o des-gcc-hw > clang-20 -O1 riscv-des.c -o des-clang-o1-sw > clang-20 -O1 -march=rv64gcv riscv-des.c -o des-clang-o1-hw > clang-20 -O2 riscv-des.c -o des-clang-o2-sw > clang-20 -O2 -march=rv64gcv riscv-des.c -o des-clang-o2-hw > gcc -O2 -o popcnt-gcc-o2-sw riscv-popcnt.c > gcc -O2 -march=rv64gc_zbb -o popcnt-gcc-o2-hw riscv-popcnt.c > clang-20 -O2 -o popcnt-clang-o2-sw riscv-popcnt.c > clang-20 -O2 -march=rv64gc_zbb -o popcnt-clang-o2-hw riscv-popcnt.c > gcc -O2 -o crc32c-gcc-o2-sw riscv-crc32c.c > gcc -O2 -march=rv64gc_zbc -o crc32c-gcc-o2-hw riscv-crc32c.c > clang-20 -O2 -o crc32c-clang-o2-sw riscv-crc32c.c > clang-20 -O2 -march=rv64gc_zbc -o crc32c-clang-o2-hw riscv-crc32c.c > gburd@rv:~/ws/postgres$ make -f Makefile.RISCV test > ./des-gcc-sw > Compiler: GCC 13.3.0 > Target: RISC-V 64-bit > Vector extension: Not enabled > > Testing WITHOUT compiler barriers: > PASS: Permutation tables are correct > > Testing WITH compiler barriers: > PASS: Permutation tables are correct > > Performance Comparison (1000000 iterations): > Without barriers: 0.409 seconds (409 ns/iter) > With barriers: 0.416 seconds (416 ns/iter) > Overhead: 1.6% > ./des-gcc-hw > Compiler: GCC 13.3.0 > Target: RISC-V 64-bit > Vector extension: Enabled (RVV) > > Testing WITHOUT compiler barriers: > PASS: Permutation tables are correct > > Testing WITH compiler barriers: > PASS: Permutation tables are correct > > Performance Comparison (1000000 iterations): > Without barriers: 0.410 seconds (410 ns/iter) > With barriers: 0.410 seconds (410 ns/iter) > Overhead: Negligible > ./des-clang-o1-sw > Compiler: Clang 20.1.2 > Target: RISC-V 64-bit > Vector extension: Not enabled > > Testing WITHOUT compiler barriers: > PASS: Permutation tables are correct > > Testing WITH compiler barriers: > PASS: Permutation tables are correct > > Performance Comparison (1000000 iterations): > Without barriers: 0.517 seconds (517 ns/iter) > With barriers: 0.516 seconds (516 ns/iter) > Overhead: Negligible > ./des-clang-o1-hw > Compiler: Clang 20.1.2 > Target: RISC-V 64-bit > Vector extension: Enabled (RVV) > > Testing WITHOUT compiler barriers: > PASS: Permutation tables are correct > > Testing WITH compiler barriers: > PASS: Permutation tables are correct > > Performance Comparison (1000000 iterations): > Without barriers: 0.405 seconds (405 ns/iter) > With barriers: 0.405 seconds (405 ns/iter) > Overhead: Negligible > ./des-clang-o2-sw > Compiler: Clang 20.1.2 > Target: RISC-V 64-bit > Vector extension: Not enabled > > Testing WITHOUT compiler barriers: > PASS: Permutation tables are correct > > Testing WITH compiler barriers: > PASS: Permutation tables are correct > > Performance Comparison (1000000 iterations): > Without barriers: 0.517 seconds (517 ns/iter) > With barriers: 0.518 seconds (518 ns/iter) > Overhead: Negligible > ./des-clang-o2-hw > Compiler: Clang 20.1.2 > Target: RISC-V 64-bit > Vector extension: Enabled (RVV) > > Testing WITHOUT compiler barriers: > ERROR: un_pbox mismatch: > un_pbox[0] = 15, expected 8 > un_pbox[1] = 6, expected 16 > un_pbox[2] = 19, expected 22 > un_pbox[3] = 20, expected 30 > un_pbox[4] = 28, expected 12 > ... and 27 more errors > FAIL: Permutation tables are incorrect > > Testing WITH compiler barriers: > PASS: Permutation tables are correct > > Performance Comparison (1000000 iterations): > Without barriers: 0.093 seconds (93 ns/iter) > With barriers: 0.407 seconds (407 ns/iter) > Overhead: 335.5% > ./popcnt-gcc-o2-sw > sw popcount: 0.183 sec ( 547.89 MB/s) > hw popcount: 0.274 sec ( 365.40 MB/s) > > diff: 0.67x > match: 406261900 bits counted > ./popcnt-gcc-o2-hw > sw popcount: 0.182 sec ( 548.17 MB/s) > hw popcount: 0.044 sec ( 2287.82 MB/s) > > diff: 4.17x > match: 406261900 bits counted > ./popcnt-clang-o2-sw > sw popcount: 0.188 sec ( 531.96 MB/s) > hw popcount: 0.207 sec ( 482.84 MB/s) > > diff: 0.91x > match: 406261900 bits counted > ./popcnt-clang-o2-hw > sw popcount: 0.224 sec ( 446.46 MB/s) > hw popcount: 0.056 sec ( 1794.83 MB/s) > > diff: 4.02x > match: 406261900 bits counted > ./crc32c-gcc-o2-sw > sw crc32c: 0.651 sec ( 153.68 MB/s) > hw crc32c: 0.651 sec ( 153.72 MB/s) > > diff: 1.00x > match: 0x0B141F2D > > validation: CRC32C("123456789") = 0xE3069283 (correct) > ./crc32c-gcc-o2-hw > sw crc32c: 0.651 sec ( 153.70 MB/s) > hw crc32c: 0.000 sec ( 308052.33 MB/s) > > diff: 2004.21x > match: 0x0B141F2D > > validation: CRC32C("123456789") = 0xE3069283 (correct) > ./crc32c-clang-o2-sw > sw crc32c: 0.584 sec ( 171.10 MB/s) > hw crc32c: 0.584 sec ( 171.17 MB/s) > > diff: 1.00x > match: 0x0B141F2D > > validation: CRC32C("123456789") = 0xE3069283 (correct) > ./crc32c-clang-o2-hw > sw crc32c: 0.584 sec ( 171.15 MB/s) > hw crc32c: 0.000 sec ( 309282.38 MB/s) > > diff: 1807.08x > match: 0x0B141F2D > > validation: CRC32C("123456789") = 0xE3069283 (correct) > Attachments: > * Makefile.RISCV > * riscv-crc32c.c > * riscv-des.c > * riscv-popcnt.c > * v3-0001-Avoid-Clang-RISC-V-auto-vectorization-bug-in-DES.patch > * v3-0002-Add-RISC-V-popcount-using-Zbb-extension.patch > * v3-0003-Add-RISC-V-CRC32C-using-the-Zbc-extension.patch