Thread
-
Speed up ICU case conversion by using ucasemap_utf8To*()
Andreas Karlsson <andreas@proxel.se> — 2024-12-20T05:20:38Z
Hi, Jeff pointed out to me that the case conversion functions in ICU have UTF-8 specific versions which means we can call those directly if the database encoding is UTF-8 and skip having to convert to and from UChar. Since most people today run their databases in UTF-8 I think this optimization is worth it and when measuring on short to medium length strings I got a 15-20% speed up. It is still slower than glibc in my benchmarks but the gap is smaller now. SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE "sv-SE-x-icu") FROM generate_series(1, 1000000) i); master: ~540 ms Patched: ~460 ms glibc: ~410 ms I have also attached a clean up patch for the non-UTF-8 code paths. I thought about doing the same for the new UTF-8 code paths but it turned out to be a bit messy due to different function signatures for ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle(). Andreas -
Re: Speed up ICU case conversion by using ucasemap_utf8To*()
Jeff Davis <pgsql@j-davis.com> — 2024-12-20T19:24:04Z
On Fri, 2024-12-20 at 06:20 +0100, Andreas Karlsson wrote: > SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE > "sv-SE-x-icu") FROM generate_series(1, 1000000) i); > > master: ~540 ms > Patched: ~460 ms > glibc: ~410 ms It looks like you are opening and closing the UCaseMap object each time. Why not save it in pg_locale_t? That should speed it up even more and hopefully beat libc. Also, to support older ICU versions consistently, we need to fix up the locale name to support "und"; cf. pg_ucol_open(). Perhaps factor out that logic? Regards, Jeff Davis -
Re: Speed up ICU case conversion by using ucasemap_utf8To*()
vignesh C <vignesh21@gmail.com> — 2025-03-17T06:46:11Z
On Fri, 20 Dec 2024 at 10:50, Andreas Karlsson <andreas@proxel.se> wrote: > > Hi, > > Jeff pointed out to me that the case conversion functions in ICU have > UTF-8 specific versions which means we can call those directly if the > database encoding is UTF-8 and skip having to convert to and from UChar. > > Since most people today run their databases in UTF-8 I think this > optimization is worth it and when measuring on short to medium length > strings I got a 15-20% speed up. It is still slower than glibc in my > benchmarks but the gap is smaller now. > > SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE > "sv-SE-x-icu") FROM generate_series(1, 1000000) i); > > master: ~540 ms > Patched: ~460 ms > glibc: ~410 ms > > I have also attached a clean up patch for the non-UTF-8 code paths. I > thought about doing the same for the new UTF-8 code paths but it turned > out to be a bit messy due to different function signatures for > ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle(). I noticed that Jeff's comments from [1] have not yet been addressed, I have changed the commitfest entry status to "Waiting on Author", please address them and update it to "Needs Review". [1] - https://www.postgresql.org/message-id/72c7c2b5848da44caddfe0f20f6c7ebc7c0c6e60.camel@j-davis.com Regards, Vignesh -
Re: Speed up ICU case conversion by using ucasemap_utf8To*()
Andres Freund <andres@anarazel.de> — 2025-03-29T18:50:03Z
On 2025-03-17 12:16:11 +0530, vignesh C wrote: > On Fri, 20 Dec 2024 at 10:50, Andreas Karlsson <andreas@proxel.se> wrote: > > > > Hi, > > > > Jeff pointed out to me that the case conversion functions in ICU have > > UTF-8 specific versions which means we can call those directly if the > > database encoding is UTF-8 and skip having to convert to and from UChar. > > > > Since most people today run their databases in UTF-8 I think this > > optimization is worth it and when measuring on short to medium length > > strings I got a 15-20% speed up. It is still slower than glibc in my > > benchmarks but the gap is smaller now. > > > > SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE > > "sv-SE-x-icu") FROM generate_series(1, 1000000) i); > > > > master: ~540 ms > > Patched: ~460 ms > > glibc: ~410 ms > > > > I have also attached a clean up patch for the non-UTF-8 code paths. I > > thought about doing the same for the new UTF-8 code paths but it turned > > out to be a bit messy due to different function signatures for > > ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle(). > > I noticed that Jeff's comments from [1] have not yet been addressed, I > have changed the commitfest entry status to "Waiting on Author", > please address them and update it to "Needs Review". > [1] - https://www.postgresql.org/message-id/72c7c2b5848da44caddfe0f20f6c7ebc7c0c6e60.camel@j-davis.com It's also worth noting that this patch hasn't been building for quite a while (at least not since 2025-01-29): https://cirrus-ci.com/task/5621435164524544?logs=build#L1228 [17:17:51.214] ld: error: undefined symbol: icu_convert_case [17:17:51.214] >>> referenced by pg_locale_icu.c:484 (../src/backend/utils/adt/pg_locale_icu.c:484) [17:17:51.214] >>> src/backend/postgres_lib.a.p/utils_adt_pg_locale_icu.c.o:(strfold_icu) [17:17:51.214] cc: error: linker command failed with exit code 1 (use -v to see invocation) I think we can mark this as returned-with-feedback for now? Greetings, Andres Freund -
Re: Speed up ICU case conversion by using ucasemap_utf8To*()
vignesh C <vignesh21@gmail.com> — 2025-03-30T01:18:46Z
On Sun, 30 Mar 2025 at 00:20, Andres Freund <andres@anarazel.de> wrote: > > On 2025-03-17 12:16:11 +0530, vignesh C wrote: > > On Fri, 20 Dec 2024 at 10:50, Andreas Karlsson <andreas@proxel.se> wrote: > > > > > > Hi, > > > > > > Jeff pointed out to me that the case conversion functions in ICU have > > > UTF-8 specific versions which means we can call those directly if the > > > database encoding is UTF-8 and skip having to convert to and from UChar. > > > > > > Since most people today run their databases in UTF-8 I think this > > > optimization is worth it and when measuring on short to medium length > > > strings I got a 15-20% speed up. It is still slower than glibc in my > > > benchmarks but the gap is smaller now. > > > > > > SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE > > > "sv-SE-x-icu") FROM generate_series(1, 1000000) i); > > > > > > master: ~540 ms > > > Patched: ~460 ms > > > glibc: ~410 ms > > > > > > I have also attached a clean up patch for the non-UTF-8 code paths. I > > > thought about doing the same for the new UTF-8 code paths but it turned > > > out to be a bit messy due to different function signatures for > > > ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle(). > > > > I noticed that Jeff's comments from [1] have not yet been addressed, I > > have changed the commitfest entry status to "Waiting on Author", > > please address them and update it to "Needs Review". > > [1] - https://www.postgresql.org/message-id/72c7c2b5848da44caddfe0f20f6c7ebc7c0c6e60.camel@j-davis.com > > It's also worth noting that this patch hasn't been building for quite a while > (at least not since 2025-01-29): > > https://cirrus-ci.com/task/5621435164524544?logs=build#L1228 > [17:17:51.214] ld: error: undefined symbol: icu_convert_case > [17:17:51.214] >>> referenced by pg_locale_icu.c:484 (../src/backend/utils/adt/pg_locale_icu.c:484) > [17:17:51.214] >>> src/backend/postgres_lib.a.p/utils_adt_pg_locale_icu.c.o:(strfold_icu) > [17:17:51.214] cc: error: linker command failed with exit code 1 (use -v to see invocation) > > I think we can mark this as returned-with-feedback for now? Thanks, the commitfest entry is marked as returned with feedback. @Andreas Karlsson Feel free to add a new commitfest entry when you have addressed the feedback. Regards, Vignesh -
Re: Speed up ICU case conversion by using ucasemap_utf8To*()
Andreas Karlsson <andreas@proxel.se> — 2025-12-31T00:18:40Z
On 12/20/24 8:24 PM, Jeff Davis wrote: > On Fri, 2024-12-20 at 06:20 +0100, Andreas Karlsson wrote: >> SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE >> "sv-SE-x-icu") FROM generate_series(1, 1000000) i); >> >> master: ~540 ms >> Patched: ~460 ms >> glibc: ~410 ms > > It looks like you are opening and closing the UCaseMap object each > time. Why not save it in pg_locale_t? That should speed it up even more > and hopefully beat libc. Fixed. New benchmarks are: SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE "sv-SE-x-icu") FROM generate_series(1, 1000000) i); master: ~570 ms Patched: ~340 ms glibc: ~400 ms So it does indeed seem like we got a further speedup and now are faster than glibc. > Also, to support older ICU versions consistently, we need to fix up the > locale name to support "und"; cf. pg_ucol_open(). Perhaps factor out > that logic? Fixed. Andreas -
Re: Speed up ICU case conversion by using ucasemap_utf8To*()
zengman <zengman@halodbtech.com> — 2025-12-31T02:36:05Z
Hi Andreas, On the mailing list, I've noticed this patch. I tested its functionality and it works really well. I have a few minor, non-critical comments to share. In the `pg_ucasemap_open` function, the error message `casemap lookup failed:` doesn't seem ideal. This is because we're opening the `UCaseMap` here, rather than performing a "lookup" operation. In the comment `Additional makes sure we get the right options for case folding.`, the word Additional seems inappropriate — `Additionally` would be a better replacement. -- Regards, Man Zeng www.openhalo.org
-
Re: Speed up ICU case conversion by using ucasemap_utf8To*()
Andreas Karlsson <andreas@proxel.se> — 2025-12-31T15:40:31Z
On 12/31/25 3:36 AM, zengman wrote: > On the mailing list, I've noticed this patch. I tested its functionality and it works really well. I have a few minor, non-critical comments to share. Thanks for trying it out! > In the `pg_ucasemap_open` function, the error message `casemap lookup failed:` doesn't seem ideal. This is because we're opening the `UCaseMap` here, rather than performing a "lookup" operation. Fixed. > In the comment `Additional makes sure we get the right options for case folding.`, the word Additional seems inappropriate — `Additionally` would be a better replacement. Fixed. Andreas