Speed up ICU case conversion by using ucasemap_utf8To*()

Andreas Karlsson <andreas@proxel.se>

View thread

From: Andreas Karlsson <andreas@proxel.se>

To: pgsql-hackers <pgsql-hackers@postgresql.org>

Cc: Jeff Davis <pgsql@j-davis.com>

Date: 2024-12-20T05:20:38Z

Lists: pgsql-hackers

Attachments

v1-0001-Use-optimized-versions-of-ICU-case-conversion-for.patch (text/x-patch) patch v1-0001
v1-0002-Reduce-code-duplication-in-ICU-case-mapping-code.patch (text/x-patch) patch v1-0002

Hi,

Jeff pointed out to me that the case conversion functions in ICU have 
UTF-8 specific versions which means we can call those directly if the 
database encoding is UTF-8 and skip having to convert to and from UChar.

Since most people today run their databases in UTF-8 I think this 
optimization is worth it and when measuring on short to medium length 
strings I got a 15-20% speed up. It is still slower than glibc in my 
benchmarks but the gap is smaller now.

SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE 
"sv-SE-x-icu") FROM generate_series(1, 1000000) i);

master:  ~540 ms
Patched: ~460 ms
glibc:   ~410 ms

I have also attached a clean up patch for the non-UTF-8 code paths. I 
thought about doing the same for the new UTF-8 code paths but it turned 
out to be a bit messy due to different function signatures for 
ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle().

Andreas