Thread

  1. Speed up ICU case conversion by using ucasemap_utf8To*()

    Andreas Karlsson <andreas@proxel.se> — 2024-12-20T05:20:38Z

    Hi,
    
    Jeff pointed out to me that the case conversion functions in ICU have 
    UTF-8 specific versions which means we can call those directly if the 
    database encoding is UTF-8 and skip having to convert to and from UChar.
    
    Since most people today run their databases in UTF-8 I think this 
    optimization is worth it and when measuring on short to medium length 
    strings I got a 15-20% speed up. It is still slower than glibc in my 
    benchmarks but the gap is smaller now.
    
    SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE 
    "sv-SE-x-icu") FROM generate_series(1, 1000000) i);
    
    master:  ~540 ms
    Patched: ~460 ms
    glibc:   ~410 ms
    
    I have also attached a clean up patch for the non-UTF-8 code paths. I 
    thought about doing the same for the new UTF-8 code paths but it turned 
    out to be a bit messy due to different function signatures for 
    ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle().
    
    Andreas
    
  2. Re: Speed up ICU case conversion by using ucasemap_utf8To*()

    Jeff Davis <pgsql@j-davis.com> — 2024-12-20T19:24:04Z

    On Fri, 2024-12-20 at 06:20 +0100, Andreas Karlsson wrote:
    > SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE 
    > "sv-SE-x-icu") FROM generate_series(1, 1000000) i);
    > 
    > master:  ~540 ms
    > Patched: ~460 ms
    > glibc:   ~410 ms
    
    It looks like you are opening and closing the UCaseMap object each
    time. Why not save it in pg_locale_t? That should speed it up even more
    and hopefully beat libc.
    
    
    Also, to support older ICU versions consistently, we need to fix up the
    locale name to support "und"; cf. pg_ucol_open(). Perhaps factor out
    that logic?
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  3. Re: Speed up ICU case conversion by using ucasemap_utf8To*()

    vignesh C <vignesh21@gmail.com> — 2025-03-17T06:46:11Z

    On Fri, 20 Dec 2024 at 10:50, Andreas Karlsson <andreas@proxel.se> wrote:
    >
    > Hi,
    >
    > Jeff pointed out to me that the case conversion functions in ICU have
    > UTF-8 specific versions which means we can call those directly if the
    > database encoding is UTF-8 and skip having to convert to and from UChar.
    >
    > Since most people today run their databases in UTF-8 I think this
    > optimization is worth it and when measuring on short to medium length
    > strings I got a 15-20% speed up. It is still slower than glibc in my
    > benchmarks but the gap is smaller now.
    >
    > SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
    > "sv-SE-x-icu") FROM generate_series(1, 1000000) i);
    >
    > master:  ~540 ms
    > Patched: ~460 ms
    > glibc:   ~410 ms
    >
    > I have also attached a clean up patch for the non-UTF-8 code paths. I
    > thought about doing the same for the new UTF-8 code paths but it turned
    > out to be a bit messy due to different function signatures for
    > ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle().
    
    I noticed that Jeff's comments from [1] have not yet been addressed, I
    have changed the commitfest entry status to "Waiting on Author",
    please address them and update it to "Needs Review".
    [1] - https://www.postgresql.org/message-id/72c7c2b5848da44caddfe0f20f6c7ebc7c0c6e60.camel@j-davis.com
    
    Regards,
    Vignesh
    
    
    
    
  4. Re: Speed up ICU case conversion by using ucasemap_utf8To*()

    Andres Freund <andres@anarazel.de> — 2025-03-29T18:50:03Z

    On 2025-03-17 12:16:11 +0530, vignesh C wrote:
    > On Fri, 20 Dec 2024 at 10:50, Andreas Karlsson <andreas@proxel.se> wrote:
    > >
    > > Hi,
    > >
    > > Jeff pointed out to me that the case conversion functions in ICU have
    > > UTF-8 specific versions which means we can call those directly if the
    > > database encoding is UTF-8 and skip having to convert to and from UChar.
    > >
    > > Since most people today run their databases in UTF-8 I think this
    > > optimization is worth it and when measuring on short to medium length
    > > strings I got a 15-20% speed up. It is still slower than glibc in my
    > > benchmarks but the gap is smaller now.
    > >
    > > SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
    > > "sv-SE-x-icu") FROM generate_series(1, 1000000) i);
    > >
    > > master:  ~540 ms
    > > Patched: ~460 ms
    > > glibc:   ~410 ms
    > >
    > > I have also attached a clean up patch for the non-UTF-8 code paths. I
    > > thought about doing the same for the new UTF-8 code paths but it turned
    > > out to be a bit messy due to different function signatures for
    > > ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle().
    > 
    > I noticed that Jeff's comments from [1] have not yet been addressed, I
    > have changed the commitfest entry status to "Waiting on Author",
    > please address them and update it to "Needs Review".
    > [1] - https://www.postgresql.org/message-id/72c7c2b5848da44caddfe0f20f6c7ebc7c0c6e60.camel@j-davis.com
    
    It's also worth noting that this patch hasn't been building for quite a while
    (at least not since 2025-01-29):
    
    https://cirrus-ci.com/task/5621435164524544?logs=build#L1228
    [17:17:51.214] ld: error: undefined symbol: icu_convert_case
    [17:17:51.214] >>> referenced by pg_locale_icu.c:484 (../src/backend/utils/adt/pg_locale_icu.c:484)
    [17:17:51.214] >>>               src/backend/postgres_lib.a.p/utils_adt_pg_locale_icu.c.o:(strfold_icu)
    [17:17:51.214] cc: error: linker command failed with exit code 1 (use -v to see invocation)
    
    I think we can mark this as returned-with-feedback for now?
    
    Greetings,
    
    Andres Freund
    
    
    
    
  5. Re: Speed up ICU case conversion by using ucasemap_utf8To*()

    vignesh C <vignesh21@gmail.com> — 2025-03-30T01:18:46Z

    On Sun, 30 Mar 2025 at 00:20, Andres Freund <andres@anarazel.de> wrote:
    >
    > On 2025-03-17 12:16:11 +0530, vignesh C wrote:
    > > On Fri, 20 Dec 2024 at 10:50, Andreas Karlsson <andreas@proxel.se> wrote:
    > > >
    > > > Hi,
    > > >
    > > > Jeff pointed out to me that the case conversion functions in ICU have
    > > > UTF-8 specific versions which means we can call those directly if the
    > > > database encoding is UTF-8 and skip having to convert to and from UChar.
    > > >
    > > > Since most people today run their databases in UTF-8 I think this
    > > > optimization is worth it and when measuring on short to medium length
    > > > strings I got a 15-20% speed up. It is still slower than glibc in my
    > > > benchmarks but the gap is smaller now.
    > > >
    > > > SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
    > > > "sv-SE-x-icu") FROM generate_series(1, 1000000) i);
    > > >
    > > > master:  ~540 ms
    > > > Patched: ~460 ms
    > > > glibc:   ~410 ms
    > > >
    > > > I have also attached a clean up patch for the non-UTF-8 code paths. I
    > > > thought about doing the same for the new UTF-8 code paths but it turned
    > > > out to be a bit messy due to different function signatures for
    > > > ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle().
    > >
    > > I noticed that Jeff's comments from [1] have not yet been addressed, I
    > > have changed the commitfest entry status to "Waiting on Author",
    > > please address them and update it to "Needs Review".
    > > [1] - https://www.postgresql.org/message-id/72c7c2b5848da44caddfe0f20f6c7ebc7c0c6e60.camel@j-davis.com
    >
    > It's also worth noting that this patch hasn't been building for quite a while
    > (at least not since 2025-01-29):
    >
    > https://cirrus-ci.com/task/5621435164524544?logs=build#L1228
    > [17:17:51.214] ld: error: undefined symbol: icu_convert_case
    > [17:17:51.214] >>> referenced by pg_locale_icu.c:484 (../src/backend/utils/adt/pg_locale_icu.c:484)
    > [17:17:51.214] >>>               src/backend/postgres_lib.a.p/utils_adt_pg_locale_icu.c.o:(strfold_icu)
    > [17:17:51.214] cc: error: linker command failed with exit code 1 (use -v to see invocation)
    >
    > I think we can mark this as returned-with-feedback for now?
    
    Thanks, the commitfest entry is marked as returned with feedback.
    @Andreas Karlsson Feel free to add a new commitfest entry when you
    have addressed the feedback.
    
    Regards,
    Vignesh
    
    
    
    
  6. Re: Speed up ICU case conversion by using ucasemap_utf8To*()

    Andreas Karlsson <andreas@proxel.se> — 2025-12-31T00:18:40Z

    On 12/20/24 8:24 PM, Jeff Davis wrote:
    > On Fri, 2024-12-20 at 06:20 +0100, Andreas Karlsson wrote:
    >> SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
    >> "sv-SE-x-icu") FROM generate_series(1, 1000000) i);
    >>
    >> master:  ~540 ms
    >> Patched: ~460 ms
    >> glibc:   ~410 ms
    > 
    > It looks like you are opening and closing the UCaseMap object each
    > time. Why not save it in pg_locale_t? That should speed it up even more
    > and hopefully beat libc.
    
    Fixed. New benchmarks are:
    
    SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE 
    "sv-SE-x-icu") FROM generate_series(1, 1000000) i);
    
    master:  ~570 ms
    Patched: ~340 ms
    glibc:   ~400 ms
    
    So it does indeed seem like we got a further speedup and now are faster 
    than glibc.
    
    > Also, to support older ICU versions consistently, we need to fix up the
    > locale name to support "und"; cf. pg_ucol_open(). Perhaps factor out
    > that logic?
    
    Fixed.
    
    Andreas
    
  7. Re: Speed up ICU case conversion by using ucasemap_utf8To*()

    zengman <zengman@halodbtech.com> — 2025-12-31T02:36:05Z

    Hi Andreas,
    
    On the mailing list, I've noticed this patch. I tested its functionality and it works really well. I have a few minor, non-critical comments to share.
    In the `pg_ucasemap_open` function, the error message `casemap lookup failed:` doesn't seem ideal. This is because we're opening the `UCaseMap` here, rather than performing a "lookup" operation.
    In the comment `Additional makes sure we get the right options for case folding.`, the word Additional seems inappropriate — `Additionally` would be a better replacement.
    
    
    --
    Regards,
    Man Zeng
    www.openhalo.org
  8. Re: Speed up ICU case conversion by using ucasemap_utf8To*()

    Andreas Karlsson <andreas@proxel.se> — 2025-12-31T15:40:31Z

    On 12/31/25 3:36 AM, zengman wrote:
    > On the mailing list, I've noticed this patch. I tested its functionality and it works really well. I have a few minor, non-critical comments to share.
    
    Thanks for trying it out!
    
    > In the `pg_ucasemap_open` function, the error message `casemap lookup failed:` doesn't seem ideal. This is because we're opening the `UCaseMap` here, rather than performing a "lookup" operation.
    
    Fixed.
    
    > In the comment `Additional makes sure we get the right options for case folding.`, the word Additional seems inappropriate — `Additionally` would be a better replacement.
    Fixed.
    
    Andreas