Thread

Speed up ICU case conversion by using ucasemap_utf8To*()

Andreas Karlsson <andreas@proxel.se> — 2024-12-20T05:20:38Z

Hi,

Jeff pointed out to me that the case conversion functions in ICU have 
UTF-8 specific versions which means we can call those directly if the 
database encoding is UTF-8 and skip having to convert to and from UChar.

Since most people today run their databases in UTF-8 I think this 
optimization is worth it and when measuring on short to medium length 
strings I got a 15-20% speed up. It is still slower than glibc in my 
benchmarks but the gap is smaller now.

SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE 
"sv-SE-x-icu") FROM generate_series(1, 1000000) i);

master:  ~540 ms
Patched: ~460 ms
glibc:   ~410 ms

I have also attached a clean up patch for the non-UTF-8 code paths. I 
thought about doing the same for the new UTF-8 code paths but it turned 
out to be a bit messy due to different function signatures for 
ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle().

Andreas

Re: Speed up ICU case conversion by using ucasemap_utf8To*()

Jeff Davis <pgsql@j-davis.com> — 2024-12-20T19:24:04Z

On Fri, 2024-12-20 at 06:20 +0100, Andreas Karlsson wrote:
> SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE 
> "sv-SE-x-icu") FROM generate_series(1, 1000000) i);
> 
> master:  ~540 ms
> Patched: ~460 ms
> glibc:   ~410 ms

It looks like you are opening and closing the UCaseMap object each
time. Why not save it in pg_locale_t? That should speed it up even more
and hopefully beat libc.

Also, to support older ICU versions consistently, we need to fix up the
locale name to support "und"; cf. pg_ucol_open(). Perhaps factor out
that logic?

Regards,
	Jeff Davis

Re: Speed up ICU case conversion by using ucasemap_utf8To*()

vignesh C <vignesh21@gmail.com> — 2025-03-17T06:46:11Z

On Fri, 20 Dec 2024 at 10:50, Andreas Karlsson <andreas@proxel.se> wrote:
>
> Hi,
>
> Jeff pointed out to me that the case conversion functions in ICU have
> UTF-8 specific versions which means we can call those directly if the
> database encoding is UTF-8 and skip having to convert to and from UChar.
>
> Since most people today run their databases in UTF-8 I think this
> optimization is worth it and when measuring on short to medium length
> strings I got a 15-20% speed up. It is still slower than glibc in my
> benchmarks but the gap is smaller now.
>
> SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
> "sv-SE-x-icu") FROM generate_series(1, 1000000) i);
>
> master:  ~540 ms
> Patched: ~460 ms
> glibc:   ~410 ms
>
> I have also attached a clean up patch for the non-UTF-8 code paths. I
> thought about doing the same for the new UTF-8 code paths but it turned
> out to be a bit messy due to different function signatures for
> ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle().

I noticed that Jeff's comments from [1] have not yet been addressed, I
have changed the commitfest entry status to "Waiting on Author",
please address them and update it to "Needs Review".
[1] - https://www.postgresql.org/message-id/72c7c2b5848da44caddfe0f20f6c7ebc7c0c6e60.camel@j-davis.com

Regards,
Vignesh

Re: Speed up ICU case conversion by using ucasemap_utf8To*()

Andres Freund <andres@anarazel.de> — 2025-03-29T18:50:03Z

On 2025-03-17 12:16:11 +0530, vignesh C wrote:
> On Fri, 20 Dec 2024 at 10:50, Andreas Karlsson <andreas@proxel.se> wrote:
> >
> > Hi,
> >
> > Jeff pointed out to me that the case conversion functions in ICU have
> > UTF-8 specific versions which means we can call those directly if the
> > database encoding is UTF-8 and skip having to convert to and from UChar.
> >
> > Since most people today run their databases in UTF-8 I think this
> > optimization is worth it and when measuring on short to medium length
> > strings I got a 15-20% speed up. It is still slower than glibc in my
> > benchmarks but the gap is smaller now.
> >
> > SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
> > "sv-SE-x-icu") FROM generate_series(1, 1000000) i);
> >
> > master:  ~540 ms
> > Patched: ~460 ms
> > glibc:   ~410 ms
> >
> > I have also attached a clean up patch for the non-UTF-8 code paths. I
> > thought about doing the same for the new UTF-8 code paths but it turned
> > out to be a bit messy due to different function signatures for
> > ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle().
> 
> I noticed that Jeff's comments from [1] have not yet been addressed, I
> have changed the commitfest entry status to "Waiting on Author",
> please address them and update it to "Needs Review".
> [1] - https://www.postgresql.org/message-id/72c7c2b5848da44caddfe0f20f6c7ebc7c0c6e60.camel@j-davis.com

It's also worth noting that this patch hasn't been building for quite a while
(at least not since 2025-01-29):

https://cirrus-ci.com/task/5621435164524544?logs=build#L1228
[17:17:51.214] ld: error: undefined symbol: icu_convert_case
[17:17:51.214] >>> referenced by pg_locale_icu.c:484 (../src/backend/utils/adt/pg_locale_icu.c:484)
[17:17:51.214] >>>               src/backend/postgres_lib.a.p/utils_adt_pg_locale_icu.c.o:(strfold_icu)
[17:17:51.214] cc: error: linker command failed with exit code 1 (use -v to see invocation)

I think we can mark this as returned-with-feedback for now?

Greetings,

Andres Freund

Re: Speed up ICU case conversion by using ucasemap_utf8To*()

vignesh C <vignesh21@gmail.com> — 2025-03-30T01:18:46Z

On Sun, 30 Mar 2025 at 00:20, Andres Freund <andres@anarazel.de> wrote:
>
> On 2025-03-17 12:16:11 +0530, vignesh C wrote:
> > On Fri, 20 Dec 2024 at 10:50, Andreas Karlsson <andreas@proxel.se> wrote:
> > >
> > > Hi,
> > >
> > > Jeff pointed out to me that the case conversion functions in ICU have
> > > UTF-8 specific versions which means we can call those directly if the
> > > database encoding is UTF-8 and skip having to convert to and from UChar.
> > >
> > > Since most people today run their databases in UTF-8 I think this
> > > optimization is worth it and when measuring on short to medium length
> > > strings I got a 15-20% speed up. It is still slower than glibc in my
> > > benchmarks but the gap is smaller now.
> > >
> > > SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
> > > "sv-SE-x-icu") FROM generate_series(1, 1000000) i);
> > >
> > > master:  ~540 ms
> > > Patched: ~460 ms
> > > glibc:   ~410 ms
> > >
> > > I have also attached a clean up patch for the non-UTF-8 code paths. I
> > > thought about doing the same for the new UTF-8 code paths but it turned
> > > out to be a bit messy due to different function signatures for
> > > ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle().
> >
> > I noticed that Jeff's comments from [1] have not yet been addressed, I
> > have changed the commitfest entry status to "Waiting on Author",
> > please address them and update it to "Needs Review".
> > [1] - https://www.postgresql.org/message-id/72c7c2b5848da44caddfe0f20f6c7ebc7c0c6e60.camel@j-davis.com
>
> It's also worth noting that this patch hasn't been building for quite a while
> (at least not since 2025-01-29):
>
> https://cirrus-ci.com/task/5621435164524544?logs=build#L1228
> [17:17:51.214] ld: error: undefined symbol: icu_convert_case
> [17:17:51.214] >>> referenced by pg_locale_icu.c:484 (../src/backend/utils/adt/pg_locale_icu.c:484)
> [17:17:51.214] >>>               src/backend/postgres_lib.a.p/utils_adt_pg_locale_icu.c.o:(strfold_icu)
> [17:17:51.214] cc: error: linker command failed with exit code 1 (use -v to see invocation)
>
> I think we can mark this as returned-with-feedback for now?

Thanks, the commitfest entry is marked as returned with feedback.
@Andreas Karlsson Feel free to add a new commitfest entry when you
have addressed the feedback.

Regards,
Vignesh

Re: Speed up ICU case conversion by using ucasemap_utf8To*()

Andreas Karlsson <andreas@proxel.se> — 2025-12-31T00:18:40Z

On 12/20/24 8:24 PM, Jeff Davis wrote:
> On Fri, 2024-12-20 at 06:20 +0100, Andreas Karlsson wrote:
>> SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
>> "sv-SE-x-icu") FROM generate_series(1, 1000000) i);
>>
>> master:  ~540 ms
>> Patched: ~460 ms
>> glibc:   ~410 ms
> 
> It looks like you are opening and closing the UCaseMap object each
> time. Why not save it in pg_locale_t? That should speed it up even more
> and hopefully beat libc.

Fixed. New benchmarks are:

SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE 
"sv-SE-x-icu") FROM generate_series(1, 1000000) i);

master:  ~570 ms
Patched: ~340 ms
glibc:   ~400 ms

So it does indeed seem like we got a further speedup and now are faster 
than glibc.

> Also, to support older ICU versions consistently, we need to fix up the
> locale name to support "und"; cf. pg_ucol_open(). Perhaps factor out
> that logic?

Fixed.

Andreas

Re: Speed up ICU case conversion by using ucasemap_utf8To*()

zengman <zengman@halodbtech.com> — 2025-12-31T02:36:05Z

Hi Andreas,

On the mailing list, I've noticed this patch. I tested its functionality and it works really well. I have a few minor, non-critical comments to share.
In the `pg_ucasemap_open` function, the error message `casemap lookup failed:` doesn't seem ideal. This is because we're opening the `UCaseMap` here, rather than performing a "lookup" operation.
In the comment `Additional makes sure we get the right options for case folding.`, the word Additional seems inappropriate — `Additionally` would be a better replacement.


--
Regards,
Man Zeng
www.openhalo.org

Re: Speed up ICU case conversion by using ucasemap_utf8To*()

Andreas Karlsson <andreas@proxel.se> — 2025-12-31T15:40:31Z

On 12/31/25 3:36 AM, zengman wrote:
> On the mailing list, I've noticed this patch. I tested its functionality and it works really well. I have a few minor, non-critical comments to share.

Thanks for trying it out!

> In the `pg_ucasemap_open` function, the error message `casemap lookup failed:` doesn't seem ideal. This is because we're opening the `UCaseMap` here, rather than performing a "lookup" operation.

Fixed.

> In the comment `Additional makes sure we get the right options for case folding.`, the word Additional seems inappropriate — `Additionally` would be a better replacement.
Fixed.

Andreas