Speed up ICU case conversion by using ucasemap_utf8To*()
Andreas Karlsson <andreas@proxel.se>
From: Andreas Karlsson <andreas@proxel.se>
To: pgsql-hackers <pgsql-hackers@postgresql.org>
Cc: Jeff Davis <pgsql@j-davis.com>
Date: 2024-12-20T05:20:38Z
Lists: pgsql-hackers
Attachments
- v1-0001-Use-optimized-versions-of-ICU-case-conversion-for.patch (text/x-patch) patch v1-0001
- v1-0002-Reduce-code-duplication-in-ICU-case-mapping-code.patch (text/x-patch) patch v1-0002
Hi,
Jeff pointed out to me that the case conversion functions in ICU have
UTF-8 specific versions which means we can call those directly if the
database encoding is UTF-8 and skip having to convert to and from UChar.
Since most people today run their databases in UTF-8 I think this
optimization is worth it and when measuring on short to medium length
strings I got a 15-20% speed up. It is still slower than glibc in my
benchmarks but the gap is smaller now.
SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
"sv-SE-x-icu") FROM generate_series(1, 1000000) i);
master: ~540 ms
Patched: ~460 ms
glibc: ~410 ms
I have also attached a clean up patch for the non-UTF-8 code paths. I
thought about doing the same for the new UTF-8 code paths but it turned
out to be a bit messy due to different function signatures for
ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle().
Andreas