Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Use C11 char16_t and char32_t for Unicode code points.

  1. Improve the performance of Unicode Normalization Forms.

    Alexander Borisov <lex.borisov@gmail.com> — 2025-06-02T21:51:08Z

    Hi, hackers!
    
    As promised, I continue to improve/speed up Unicode in Postgres.
    Last time, we improved the lower(), upper(), and casefold()
    functions. [1]
    Now it's time for Unicode Normalization Forms, specifically
    the normalize() function.
    
    The attachment contains three patches:
    1. Transfer of functions for building a range index to a separate
        Perl module.
    2-small. In this patch, the mapping tables are reduced, but this
        increases the number of branches for searching for the desired value.
    2-big. In this patch, the tables are large, but the number of branches
        for searching for the desired value is small.
    
    In other words, we choose between two patches, either small or big.
    
    What did we manage to achieve?
    1. The uint32 field, which stored the unicode codepoint record, was
        removed from the main structure/table. This was necessary for perfect
        hashing.
    2. Thanks to the first point, we managed to get rid of duplicates and
        reduce the main table from 6843 to 3943.
    3. We managed to get rid of duplicates in the table with codepoints for
        decomposition from 5138 to 4304 (uint32).
    4. These manipulations allowed us to increase the speed by up to 40%
        in some cases (benchmarks below).
    5. The server part "lost weight" in the binary, but the frontend
        "gained weight" a little.
    
    I read the old commits, which say that the size of the frontend is very
    important and that speed is not important
    (speed is important on the server).
    I'm not quite sure what to do if this is really the case. Perhaps
    we should leave the slow version for the frontend.
    
    Benchmarks.
    
    How was it tested?
    Four files were created for each normalization form: NFC, NFD, NFKC,
    and NFKD.
    The files were sent via pgbench. The files contain all code points that
    need to be normalized.
    Unfortunately, the patches are already quite large, but if necessary,
    I can send these files in a separate email or upload them somewhere.
    
    Legend.
    Patch (big table) — this is a bloated mapping table, but with a small
         number of branches for searching.
    Patch (small table) — these are small tables, but with a large number
         of branches for searching.
    Without — the original Postgres without patches.
    
    * Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)
    
    NFC:
    Patch (big table): tps = 5057.624509
    Patch (small table): tps = 4727.158048
    Without: tps = 3268.654867
    
    NFD:
    Patch (big table): tps = 1145.730247
    Patch (small table): tps = 1073.084633
    Without: tps = 689.627239
    
    NFKC:
    Patch (big table): tps = 4821.700702
    Patch (small table): tps = 4662.640629
    Without: tps = 3450.254581
    
    NFKD:
    Patch (big table): tps = 1142.644341
    Patch (small table): tps = 951.405893
    Without: tps = 636.161496
    
    How the size of object files and libraries has changed (in bytes):
    Patch (big table):
    unicode_norm.o: 138896
    unicode_norm_srv.o: 191336
    libpq.a: 364782
    libpq.so.5.18: 480944
    libpgcommon.a: 584432
    libpgcommon_shlib.a: 568724
    
    Patch (small table):
    unicode_norm.o: 86416
    unicode_norm_srv.o: 138824
    libpq.a: 364782
    libpq.so.5.18: 423600
    libpgcommon.a: 531952
    libpgcommon_shlib.a: 516244
    
    Without:
    unicode_norm.o: 79488
    unicode_norm_srv.o: 165504
    libpq.a: 364782
    libpq.so.5.18: 419288
    libpgcommon.a: 525024
    libpgcommon_shlib.a: 509316
    
    [1] 
    https://www.postgresql.org/message-id/flat/7cac7e66-9a3b-4e3f-a997-42aa0c401f80%40gmail.com
    
    --
    Best regards,
    Alexander Borisov
    
  2. Re: Improve the performance of Unicode Normalization Forms.

    John Naylor <johncnaylorls@gmail.com> — 2025-06-11T07:13:32Z

    On Tue, Jun 3, 2025 at 1:51 PM Alexander Borisov <lex.borisov@gmail.com> wrote:
    > 5. The server part "lost weight" in the binary, but the frontend
    >     "gained weight" a little.
    >
    > I read the old commits, which say that the size of the frontend is very
    > important and that speed is not important
    > (speed is important on the server).
    > I'm not quite sure what to do if this is really the case. Perhaps
    > we should leave the slow version for the frontend.
    
    In the "small" patch, the frontend files got a few kB bigger, but the
    backend got quite a bit smaller. If we decided to go with this patch,
    I'd say it's preferable to do it in a way that keeps both paths the
    same.
    
    > How was it tested?
    > Four files were created for each normalization form: NFC, NFD, NFKC,
    > and NFKD.
    > The files were sent via pgbench. The files contain all code points that
    > need to be normalized.
    > Unfortunately, the patches are already quite large, but if necessary,
    > I can send these files in a separate email or upload them somewhere.
    
    What kind of workload do they present?
    Did you consider running the same tests from the thread that lead to
    the current implementation?
    
    --
    John Naylor
    Amazon Web Services
    
    
    
    
  3. Re: Improve the performance of Unicode Normalization Forms.

    Alexander Borisov <lex.borisov@gmail.com> — 2025-06-11T12:27:02Z

    11.06.2025 10:13, John Naylor wrote:
    > On Tue, Jun 3, 2025 at 1:51 PM Alexander Borisov <lex.borisov@gmail.com> wrote:
    >> 5. The server part "lost weight" in the binary, but the frontend
    >>      "gained weight" a little.
    >>
    >> I read the old commits, which say that the size of the frontend is very
    >> important and that speed is not important
    >> (speed is important on the server).
    >> I'm not quite sure what to do if this is really the case. Perhaps
    >> we should leave the slow version for the frontend.
    > 
    > In the "small" patch, the frontend files got a few kB bigger, but the
    > backend got quite a bit smaller. If we decided to go with this patch,
    > I'd say it's preferable to do it in a way that keeps both paths the
    > same.
    
    Okay, then I'll leave the frontend unchanged so that the size remains
    the same. The changes will only affect the backend.
    
    >> How was it tested?
    >> Four files were created for each normalization form: NFC, NFD, NFKC,
    >> and NFKD.
    >> The files were sent via pgbench. The files contain all code points that
    >> need to be normalized.
    >> Unfortunately, the patches are already quite large, but if necessary,
    >> I can send these files in a separate email or upload them somewhere.
    > 
    > What kind of workload do they present?
    > Did you consider running the same tests from the thread that lead to
    > the current implementation?
    
    I found performance tests in this discussion 
    https://www.postgresql.org/message-id/CAFBsxsHUuMFCt6-pU+oG-F1==CmEp8wR+O+bRouXWu6i8kXuqA@mail.gmail.com
    Below are performance test results.
    
    * Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)
    
    1.
    
    Normalize, decomp only
    
    select count(normalize(t, NFD)) from (
    select md5(i::text) as t from
    generate_series(1,100000) as i
    ) s;
    
    Patch (big table): 279,858 ms
    Patch (small table): 282,925 ms
    Without: 444,118 ms
    
    
    2.
    
    select count(normalize(t, NFD)) from (
    select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
    + 1) as t from
    generate_series(1,100000) as i
    ) s;
    
    Patch (big table): 219,858 ms
    Patch (small table): 247,893 ms
    Without: 376,906 ms
    
    
    3.
    
    Normalize, decomp+recomp
    
    select count(normalize(t, NFC)) from (
    select md5(i::text) as t from
    generate_series(1,1000) as i
    ) s;
    
    Patch (big table): 7,553 ms
    Patch (small table): 7,876 ms
    Without: 13,177 ms
    
    
    4.
    
    select count(normalize(t, NFC)) from (
    select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
    + 1) as t from
    generate_series(1,1000) as i
    ) s;
    
    Patch (big table): 5,765 ms
    Patch (small table): 6,782 ms
    Without: 10,800 ms
    
    
    5.
    
    Quick check has not changed because these patches do not affect it:
    
    -- all chars are quickcheck YES
    select count(*) from (
    select md5(i::text) as t from
    generate_series(1,100000) as i
    ) s;
    
    Patch (big table): 29,477 ms
    Patch (small table): 29,436 ms
    Without: 29,378 ms
    
    
     From these tests, we see 2x in some tests.
    
    
    --
    Best regards,
    Alexander Borisov
    
    
    
    
  4. Re: Improve the performance of Unicode Normalization Forms.

    John Naylor <johncnaylorls@gmail.com> — 2025-06-12T07:58:31Z

    On Wed, Jun 11, 2025 at 7:27 PM Alexander Borisov <lex.borisov@gmail.com> wrote:
    >
    > 11.06.2025 10:13, John Naylor wrote:
    > > On Tue, Jun 3, 2025 at 1:51 PM Alexander Borisov <lex.borisov@gmail.com> wrote:
    > >> 5. The server part "lost weight" in the binary, but the frontend
    > >>      "gained weight" a little.
    > >>
    > >> I read the old commits, which say that the size of the frontend is very
    > >> important and that speed is not important
    > >> (speed is important on the server).
    > >> I'm not quite sure what to do if this is really the case. Perhaps
    > >> we should leave the slow version for the frontend.
    > >
    > > In the "small" patch, the frontend files got a few kB bigger, but the
    > > backend got quite a bit smaller. If we decided to go with this patch,
    > > I'd say it's preferable to do it in a way that keeps both paths the
    > > same.
    >
    > Okay, then I'll leave the frontend unchanged so that the size remains
    > the same. The changes will only affect the backend.
    
    Sorry, I by "both paths" I meant make the frontend and backend the
    same, because it's good for testing. In the "small table" patch, libpq
    etc increase by about 1%, which is negligible. unicode_norm.o is only
    bigger by 7kB -- That seems okay to me, especially considering
    unicode_norm_srv.o is smaller by 27kB.
    
    >  From these tests, we see 2x in some tests.
    
    Nice!
    
    -- 
    John Naylor
    Amazon Web Services
    
    
    
    
  5. Re: Improve the performance of Unicode Normalization Forms.

    Jeff Davis <pgsql@j-davis.com> — 2025-06-19T17:41:57Z

    On Tue, 2025-06-03 at 00:51 +0300, Alexander Borisov wrote:
    > As promised, I continue to improve/speed up Unicode in Postgres.
    > Last time, we improved the lower(), upper(), and casefold()
    > functions. [1]
    > Now it's time for Unicode Normalization Forms, specifically
    > the normalize() function.
    
    Did you compare against other implementations, such as ICU's
    normalization functions? There's also a rust crate here:
    
    https://github.com/unicode-rs/unicode-normalization
    
    that might have been optimized.
    
    In addition to the lookups themselves, there are other opportunities
    for optimization as well, such as:
    
    * reducing the need for palloc and extra buffers, perhaps by using
    buffers on the stack for small strings
    
    * operate more directly on UTF-8 data rather than decoding and re-
    encoding the entire string
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  6. Re: Improve the performance of Unicode Normalization Forms.

    Alexander Borisov <lex.borisov@gmail.com> — 2025-06-20T14:51:24Z

    19.06.2025 20:41, Jeff Davis wrote:
    > On Tue, 2025-06-03 at 00:51 +0300, Alexander Borisov wrote:
    >> As promised, I continue to improve/speed up Unicode in Postgres.
    >> Last time, we improved the lower(), upper(), and casefold()
    >> functions. [1]
    >> Now it's time for Unicode Normalization Forms, specifically
    >> the normalize() function.
    > 
    > Did you compare against other implementations, such as ICU's
    > normalization functions? There's also a rust crate here:
    > 
    > https://github.com/unicode-rs/unicode-normalization
    > 
    > that might have been optimized.
    
    I don't quite see how this compares to the implementation on Rust. In
    the link provided, they use perfect hash, which I get rid of and get
    a x2 boost.
    If you take ICU implementations in C++, I have always considered them
    slow, at least when used in C code.
    I may well run benchmarks and compare the performance of the approach
    in Postgres and ICU. But this is beyond the scope of the patches under
    discussion.
    
    I want to emphasize that the pachty I gave doesn't change the
    normalization code/logic.
    We change the approach in finding the right codepoints across tables,
    which is what gives us the performance boost.
    
    > In addition to the lookups themselves, there are other opportunities
    > for optimization as well, such as:
    > 
    > * reducing the need for palloc and extra buffers, perhaps by using
    > buffers on the stack for small strings
    > 
    > * operate more directly on UTF-8 data rather than decoding and re-
    > encoding the entire string
    
    Absolutely agree with you, the normalization code is very well written
    but far from optimized.
    I didn't send changes in the normalization code itself to avoid lumping
    everything together and make the review easier.
    In keeping with my idea of optimizations in normalization forms, I
    planned to discuss the optimization code (C code) in the next iteration
    on “Improve performance...”.
    
    
    -- 
    Regards,
    Alexander Borisov
    
    
    
    
  7. Re: Improve the performance of Unicode Normalization Forms.

    Nico Williams <nico@cryptonector.com> — 2025-06-20T16:31:49Z

    On Thu, Jun 19, 2025 at 10:41:57AM -0700, Jeff Davis wrote:
    > In addition to the lookups themselves, there are other opportunities
    > for optimization as well, such as:
    > 
    > * reducing the need for palloc and extra buffers, perhaps by using
    > buffers on the stack for small strings
    > 
    > * operate more directly on UTF-8 data rather than decoding and re-
    > encoding the entire string
    
    Not sure how relevant it is here, but in ZFS in the 00s we implement
    form-insensitive, from-preserving functionality with an interesting
    optimization that greatly reduced allocations and which had a decent
    fast path for ASCII-mostly strings.  It works by doing normalization
    only in a) string comparison functions, b) string hashing functions
    (because in ZFS directories are hash tables).
    
    For b-trees one has to normalize the whole string, so perhaps this
    scheme is just not that relevant to databases that use b-trees, but then
    depending on how the b-tree key is smeared over internal nodes it might
    be.
    
    The ASCII-mostly fast path was that when comparing strings you want a
    "cursor" through each string, and whenever the current and next bytes
    are both ASCII you compare the current one the fast way (as ASCII), and
    you only take the slow path of having to check if the current
    _character_ (as opposed to _byte_) requires normalization when either
    the current or next bytes are not ASCII.  In the slow path you only
    normalize the _current character_, so you only need enough buffer space
    for that.  In principle a Unicode character could consist of an
    unbounded number of codepoints (zalgo), I think, but in practice it will
    be no more than a half a dozen or so, so the buffer can be small and
    stack allocated.  This same concept applies to string hashing: you walk
    a cursor over the string and hash each ASCII _character_ and normalize
    all non-ASCII characters, all one character at a time.
    
    The really nice thing about form-insensitive/form-preserving
    functionality is that it is form-preserving rather than normalizing on
    create and lookup, and that makes the fast-path described above
    feasible.  Whereas if you normalize on create and lookup you have to
    heap allocate enough space for each string normalized.  The other nice
    thing is that f-i/f-p behavior is a lot less surprising to users -- the
    input methods they use don't matter.
    
    What motivated this f-i/f-p behavior was that HFS+ used NFD (well,
    something very close to NFD) but input methods (even on OS X) invariably
    produce NFC (well, something close to it), at least for Latin scripts.
    This means that on OS X if you use filenames from directory listings
    those will be NFD while user inputs will be NFC, and so you can't just
    memcmp()/strcmp() those -- you have to normalize _or_ use form-
    insensitive string comparison, but nothing did that 20 years ago.  Thus
    doing the form-insensitivity in the filesystem seemed best, and if you
    do that you can be form-preserving to enable the optimization described
    above.
    
    My apologies if the above is not relevant to PG and thus noise,
    
    Nico
    -- 
    
    
    
    
  8. Re: Improve the performance of Unicode Normalization Forms.

    Jeff Davis <pgsql@j-davis.com> — 2025-06-20T17:15:47Z

    On Fri, 2025-06-20 at 11:31 -0500, Nico Williams wrote:
    > In the slow path you only
    > normalize the _current character_, so you only need enough buffer
    > space
    > for that.
    
    That's a clear win for UTF8 data. Also, if there are no changes, then
    you can just return the input buffer and not bother allocating an
    output buffer.
    
    > The really nice thing about form-insensitive/form-preserving
    > functionality is that it is form-preserving rather than normalizing
    > on
    > create and lookup, and that makes the fast-path described above
    > feasible.  Whereas if you normalize on create and lookup you have to
    > heap allocate enough space for each string normalized.
    
    Non-deterministic ICU collations are already insensitive to most
    normalization differences. Some differences are not handled when it
    involves too many combining marks, but you can use the "full
    normalization" option to address that. See:
    
    https://www.postgresql.org/docs/current/collation.html#ICU-COLLATION-SETTINGS-TABLE
    
    >   The other nice
    > thing is that f-i/f-p behavior is a lot less surprising to users --
    > the
    > input methods they use don't matter.
    
    Postgres is already form-preserving; it does not auto-normalize. (I
    have suggested that we might want to offer something like that, but
    that would be a user choice.)
    
    Currently, the non-deterministic collations (which offer form-
    insensitivity) are not available at the database level, so you have to
    explicitly specify the COLLATE clause on a column or query. In other
    words, Postgres is not form-insensitive by default, though there is
    work to make that possible.
    
    > What motivated this f-i/f-p behavior was that HFS+ used NFD (well,
    > something very close to NFD) but input methods (even on OS X)
    > invariably
    > produce NFC (well, something close to it), at least for Latin
    > scripts.
    > This means that on OS X if you use filenames from directory listings
    > those will be NFD while user inputs will be NFC, and so you can't
    > just
    > memcmp()/strcmp() those -- you have to normalize _or_ use form-
    > insensitive string comparison, but nothing did that 20 years ago. 
    > Thus
    > doing the form-insensitivity in the filesystem seemed best, and if
    > you
    > do that you can be form-preserving to enable the optimization
    > described
    > above.
    
    Databases have similar concerns as a filesystem in this respect.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  9. Re: Improve the performance of Unicode Normalization Forms.

    Jeff Davis <pgsql@j-davis.com> — 2025-06-20T17:20:08Z

    On Fri, 2025-06-20 at 17:51 +0300, Alexander Borisov wrote:
    > I don't quite see how this compares to the implementation on Rust. In
    > the link provided, they use perfect hash, which I get rid of and get
    > a x2 boost.
    > If you take ICU implementations in C++, I have always considered them
    > slow, at least when used in C code.
    > I may well run benchmarks and compare the performance of the approach
    > in Postgres and ICU. But this is beyond the scope of the patches
    > under
    > discussion.
    
    Are you saying that, with these patches, Postgres will offer the
    fastest open-source Unicode normalization? If so, that would be very
    cool.
    
    The reason I'm asking is because, if there are multiple open source
    implementations, we should either have the best one, or just borrow
    another one as long as it has a suitable license (perhaps translating
    to C as necessary).
    
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  10. Re: Improve the performance of Unicode Normalization Forms.

    Nico Williams <nico@cryptonector.com> — 2025-06-20T18:45:22Z

    On Fri, Jun 20, 2025 at 10:15:47AM -0700, Jeff Davis wrote:
    > On Fri, 2025-06-20 at 11:31 -0500, Nico Williams wrote:
    > > In the slow path you only normalize the _current character_, so you
    > > only need enough buffer space for that.
    > 
    > That's a clear win for UTF8 data. Also, if there are no changes, then
    > you can just return the input buffer and not bother allocating an
    > output buffer.
    
    The latter is not relevant to string comparison or hashing, but, yeah,
    if you have to produce a normalized string you can optimistically assume
    it is already normalized and defer allocation until you know it isn't
    normalized.
    
    > Postgres is already form-preserving; it does not auto-normalize. (I
    > have suggested that we might want to offer something like that, but
    > that would be a user choice.)
    
    Excellent, then I would advise looking into adding form-insensitive
    string comparison and hashing to get f-i/f-p behavior.
    
    > Currently, the non-deterministic collations (which offer form-
    > insensitivity) are not available at the database level, so you have to
    > explicitly specify the COLLATE clause on a column or query. In other
    > words, Postgres is not form-insensitive by default, though there is
    > work to make that possible.
    
    TIL.  Thanks.
    
    > Databases have similar concerns as a filesystem in this respect.
    
    I figured :)
    
    
    
    
  11. Re: Improve the performance of Unicode Normalization Forms.

    Alexander Borisov <lex.borisov@gmail.com> — 2025-06-24T15:20:39Z

    20.06.2025 20:20, Jeff Davis wrote:
    > On Fri, 2025-06-20 at 17:51 +0300, Alexander Borisov wrote:
    >> I don't quite see how this compares to the implementation on Rust. In
    >> the link provided, they use perfect hash, which I get rid of and get
    >> a x2 boost.
    >> If you take ICU implementations in C++, I have always considered them
    >> slow, at least when used in C code.
    >> I may well run benchmarks and compare the performance of the approach
    >> in Postgres and ICU. But this is beyond the scope of the patches
    >> under
    >> discussion.
    > 
    > Are you saying that, with these patches, Postgres will offer the
    > fastest open-source Unicode normalization? If so, that would be very
    > cool.
    
    That's what we're aiming for - to implement the fastest approach.
    By applying the proposed patches (two patches) we get the fastest
    codepoints search by tables. This is evidenced by the measurements made
    here and earlier in the patch for unicode case improvement.
    
    After these patches are compiled, I will improve the C normalization
    code directly, optimize it. That's when we can take benchmarks and say 
    with confidence that we're the best at speed.
    
    > The reason I'm asking is because, if there are multiple open source
    > implementations, we should either have the best one, or just borrow
    > another one as long as it has a suitable license (perhaps translating
    > to C as necessary).
    
    Before getting into this "fight" I studied different approaches to
    searching for the necessary codepoints in tables (hash, binary search,
    radix trees...) and came to the conclusion that the approach I proposed
    (range index) is the fastest for this purpose.
    
    
    --
    Regards,
    Alexander Borisov
    
    
    
    
    
  12. Re: Improve the performance of Unicode Normalization Forms.

    Jeff Davis <pgsql@j-davis.com> — 2025-06-30T19:28:47Z

    On Tue, 2025-06-24 at 18:20 +0300, Alexander Borisov wrote:
    > That's what we're aiming for - to implement the fastest approach.
    
    Awesome! Thank you for clarifying this as a goal. Having the fastest
    open-source Unicode normalization would be a great thing to highlight
    when this is done.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  13. Re: Improve the performance of Unicode Normalization Forms.

    Alexander Borisov <lex.borisov@gmail.com> — 2025-07-07T18:44:03Z

    30.06.2025 22:28, Jeff Davis пишет:
    > On Tue, 2025-06-24 at 18:20 +0300, Alexander Borisov wrote:
    >> That's what we're aiming for - to implement the fastest approach.
    > 
    > Awesome! Thank you for clarifying this as a goal. Having the fastest
    > open-source Unicode normalization would be a great thing to highlight
    > when this is done.
    
    After the discussion in this correspondence, we are settling on
    the "small" patch option. The table size is reduced, the speed is
    almost x2.
    Attached are the final two patches. After reviewing/accepting them,
    I will proceed to optimizing the C code for Unicode Normalization Forms.
    
    
    --
    Regards,
    Alexander Borisov
  14. Re: Improve the performance of Unicode Normalization Forms.

    Alexander Borisov <lex.borisov@gmail.com> — 2025-07-08T19:42:20Z

    07.07.2025 21:44, Alexander Borisov пишет:
    > 30.06.2025 22:28, Jeff Davis пишет:
    >> On Tue, 2025-06-24 at 18:20 +0300, Alexander Borisov wrote:
    >>> That's what we're aiming for - to implement the fastest approach.
    >>
    >> Awesome! Thank you for clarifying this as a goal. Having the fastest
    >> open-source Unicode normalization would be a great thing to highlight
    >> when this is done.
    > 
    > After the discussion in this correspondence, we are settling on
    > the "small" patch option. The table size is reduced, the speed is
    > almost x2.
    > Attached are the final two patches. After reviewing/accepting them,
    > I will proceed to optimizing the C code for Unicode Normalization Forms.
    
    Version 3 patches. In version 2 "make -s headerscheck" did not work.
    
    --
    Regards,
    Alexander Borisov
  15. Re: Improve the performance of Unicode Normalization Forms.

    Alexander Borisov <lex.borisov@gmail.com> — 2025-08-01T19:00:08Z

    Hi,
    
    I'm new here, so please advise me: if a patch wasn't accepted at the
    commitfest, does that mean it's not needed (no one was interested in
    it), or was there not enough time?
    Please tell me how this works for this.
    Should I move it to the next commitfest? I'm not quite sure what to do.
    
    I looked and saw that patches are often transferred from commitfest to
    commitfest. I understand that this is normal practice?
    Please understand, it's not very transparent here, the approach is not
    obvious.
    
    What is the best course of action for me?
    Thanks!
    
    --
    Regards,
    Alexander Borisov
    
    
    
    
    
  16. Re: Improve the performance of Unicode Normalization Forms.

    David G. Johnston <david.g.johnston@gmail.com> — 2025-08-01T19:07:34Z

    On Friday, August 1, 2025, Alexander Borisov <lex.borisov@gmail.com> wrote:
    
    >
    > I looked and saw that patches are often transferred from commitfest to
    > commitfest. I understand that this is normal practice?
    >
    > What is the best course of action for me?
    >
    >
    If you feel the patch is committable it should remain in the non-draft CFs,
    being moved every other month or so, until it gets committed.
    
    David J.
    
  17. Re: Improve the performance of Unicode Normalization Forms.

    Tom Lane <tgl@sss.pgh.pa.us> — 2025-08-01T19:37:02Z

    Alexander Borisov <lex.borisov@gmail.com> writes:
    > I'm new here, so please advise me: if a patch wasn't accepted at the
    > commitfest, does that mean it's not needed (no one was interested in
    > it), or was there not enough time?
    
    It's kind of hard to tell really --- there are many patches in our
    queue and not nearly enough reviewers.  So maybe someone will get to
    it in the fullness of time, or maybe it's true that no one cares
    about the particular topic.  (But bug fixes and performance
    improvements are almost always interesting to someone.)
    
    I recommend optimism: as long as *you* still believe that the patch
    is worthwhile, keep pushing it forward to the next commitfest.
    We used to do that automatically, but we have started asking authors
    to do that themselves, as a way of weeding out patches for which
    the author has lost interest.
    
    			regards, tom lane
    
    
    
    
  18. Re: Improve the performance of Unicode Normalization Forms.

    Alexander Borisov <lex.borisov@gmail.com> — 2025-08-01T20:51:04Z

    01.08.2025 23:37, Tom Lane пишет:
    > Alexander Borisov <lex.borisov@gmail.com> writes:
    >> I'm new here, so please advise me: if a patch wasn't accepted at the
    >> commitfest, does that mean it's not needed (no one was interested in
    >> it), or was there not enough time?
    > 
    > It's kind of hard to tell really --- there are many patches in our
    > queue and not nearly enough reviewers.  So maybe someone will get to
    > it in the fullness of time, or maybe it's true that no one cares
    > about the particular topic.  (But bug fixes and performance
    > improvements are almost always interesting to someone.)
    > 
    > I recommend optimism: as long as *you* still believe that the patch
    > is worthwhile, keep pushing it forward to the next commitfest.
    > We used to do that automatically, but we have started asking authors
    > to do that themselves, as a way of weeding out patches for which
    > the author has lost interest.
    
    Thanks, Tom! I always choose optimism.
    
    I've been in open source for a while, and this is the first time I've
    seen this approach.
    I have a plan to further improve Postgres performance in terms of
    Unicode (and not only) (which is kind of the foundation for working with
    text).
    I don't want to overwhelm the community with patches. I take a
    systematic approach.
    
    Once again, thank you, Tom. The community's approach has become clearer.
    
    
    --
    Regards,
    Alexander Borisov
    
    
    
    
    
  19. Re: Improve the performance of Unicode Normalization Forms.

    Jeff Davis <pgsql@j-davis.com> — 2025-08-08T23:17:49Z

    On Tue, 2025-07-08 at 22:42 +0300, Alexander Borisov wrote:
    > Version 3 patches. In version 2 "make -s headerscheck" did not work.
    
    I ran my own performance tests. What I did was get some test data from
    ICU v76.1 by doing:
    
      cat icu4j/perf-tests/data/collation/Test* \
        | uconv -f utf-8 -t utf-8 -x nfc > ~/strings.nfc.txt
    
      cat icu4j/perf-tests/data/collation/Test* \
        | uconv -f utf-8 -t utf-8 -x nfd > ~/strings.nfd.txt
    
      export NORM_PERF_NFC_FILE=~/strings.nfc.txt
      export NORM_PERF_NFD_FILE=~/strings.nfd.txt
    
    The first is about 8MB, the second 9MB (because NFD is slightly
    larger).
    
    Then I added some testing code to norm_test.c. It's not intended for
    committing, just to run the test. Note that it requires setting
    environment variables to find the input files.
    
    If patch v3j-0001 are applied, it's using perfect hashing. If patches
    v3j-0002-4 are applied, it's using your code. In either case it
    compares with ICU.
    
    Results with perfect hashing (100 iterations):
    
      Normalization from NFC to  NFD with  PG: 010.009
      Normalization from NFC to  NFD with ICU: 001.580
      Normalization from NFC to NFKD with  PG: 009.376
      Normalization from NFC to NFKD with ICU: 000.857
      Normalization from NFD to  NFC with  PG: 016.026
      Normalization from NFD to  NFC with ICU: 001.205
      Normalization from NFD to NFKC with  PG: 015.903
      Normalization from NFD to NFKC with ICU: 000.654
    
    Results with your code (100 iterations):
    
      Normalization from NFC to  NFD with  PG: 004.626    
      Normalization from NFC to  NFD with ICU: 001.577
      Normalization from NFC to NFKD with  PG: 004.024
      Normalization from NFC to NFKD with ICU: 000.861
      Normalization from NFD to  NFC with  PG: 006.846
      Normalization from NFD to  NFC with ICU: 001.209                    
      Normalization from NFD to NFKC with  PG: 006.655          
      Normalization from NFD to NFKC with ICU: 000.651          
    
    Your patches are a major improvement, but I'm trying to figure out why
    ICU still wins by so much. Thoughts? I didn't investigate much myself
    yet, so it's quite possible there's a bug in my test or something.
    
    Regards,
    	Jeff Davis
    
    
  20. Re: Improve the performance of Unicode Normalization Forms.

    Alexander Borisov <lex.borisov@gmail.com> — 2025-08-11T14:21:04Z

    09.08.2025 02:17, Jeff Davis пишет:
    > On Tue, 2025-07-08 at 22:42 +0300, Alexander Borisov wrote:
    >> Version 3 patches. In version 2 "make -s headerscheck" did not work.
    > 
    > I ran my own performance tests. What I did was get some test data from
    > ICU v76.1 by doing:
    > 
    
    [..]
    
    > Results with perfect hashing (100 iterations):
    > 
    >    Normalization from NFC to  NFD with  PG: 010.009
    >    Normalization from NFC to  NFD with ICU: 001.580
    >    Normalization from NFC to NFKD with  PG: 009.376
    >    Normalization from NFC to NFKD with ICU: 000.857
    >    Normalization from NFD to  NFC with  PG: 016.026
    >    Normalization from NFD to  NFC with ICU: 001.205
    >    Normalization from NFD to NFKC with  PG: 015.903
    >    Normalization from NFD to NFKC with ICU: 000.654
    > 
    > Results with your code (100 iterations):
    > 
    >    Normalization from NFC to  NFD with  PG: 004.626
    >    Normalization from NFC to  NFD with ICU: 001.577
    >    Normalization from NFC to NFKD with  PG: 004.024
    >    Normalization from NFC to NFKD with ICU: 000.861
    >    Normalization from NFD to  NFC with  PG: 006.846
    >    Normalization from NFD to  NFC with ICU: 001.209
    >    Normalization from NFD to NFKC with  PG: 006.655
    >    Normalization from NFD to NFKC with ICU: 000.651
    > 
    > Your patches are a major improvement, but I'm trying to figure out why
    > ICU still wins by so much. Thoughts? I didn't investigate much myself
    > yet, so it's quite possible there's a bug in my test or something.
    
    I had some experimented a bit, and took a look at the ICU code.
    
    1. The performance test is not entirely accurate.
    
    The thing is that in ICU (unorm_normalize()), we pass the output
    buffer and its size.
    That is, ICU does not calculate how much data we will have at the
    output; we pass it our buffer. ICU simply checks whether the data
    fits into the buffer or not.
    
    In our case, PG (unicode_normalize()) only accepts the input buffer
    and first calculates the size of the output buffer.
    This means we are doing double the work, as we have already gone
    through the input data at least once too many times.
    
    Here, it would be more honest to call the function for calculating
    the buffer in ICU, and only then call data normalization.
    
    2. In PG, unnecessary table lookups (get_code_entry()) that can
        clearly be avoided.
    3. In PG, the algorithm is not optimized.
    
    We could eliminate the decompose_code() function, which is called
    for each code point.
    In this function, we can remove recursion and prepare the data in
    advance. That is, we can perform data expansion in the Perl script.
    If we do this, we will have code that is in place and without recursion.
    And then there are all sorts of other minor details.
    
    Of course, there are other approaches.
    
    I did this comparison (your test, Jeff):
    
    1. Without patch.
    
    Normalization from NFC to  NFD with  PG: 009.121
    Normalization from NFC to  NFD with ICU: 000.954
    Normalization from NFC to NFKD with  PG: 009.048
    Normalization from NFC to NFKD with ICU: 000.965
    Normalization from NFD to  NFC with  PG: 014.525
    Normalization from NFD to  NFC with ICU: 000.623
    Normalization from NFD to NFKC with  PG: 014.380
    Normalization from NFD to NFKC with ICU: 000.666
    
    2. With patch.
    
    Normalization from NFC to  NFD with  PG: 005.743
    Normalization from NFC to  NFD with ICU: 001.005
    Normalization from NFC to NFKD with  PG: 005.902
    Normalization from NFC to NFKD with ICU: 000.963
    Normalization from NFD to  NFC with  PG: 007.837
    Normalization from NFD to  NFC with ICU: 000.640
    Normalization from NFD to NFKC with  PG: 008.036
    Normalization from NFD to NFKC with ICU: 000.667
    
    3. With a patch, but! With direct access to tables, i.e., I created one
        large table (index table, unit16_t) where there is no search, direct
        access to data.
    
    Normalization from NFC to  NFD with  PG: 002.792
    Normalization from NFC to  NFD with ICU: 000.953
    Normalization from NFC to NFKD with  PG: 002.865
    Normalization from NFC to NFKD with ICU: 000.958
    Normalization from NFD to  NFC with  PG: 004.361
    Normalization from NFD to  NFC with ICU: 000.651
    Normalization from NFD to NFKC with  PG: 004.474
    Normalization from NFD to NFKC with ICU: 000.668
    
    It is evident that direct access provides x2 from the patch, but adds
    +1.5MB to the object file size.
    This is just to check the time difference in access.
    
    As a result, I would not look into ICU at the moment, especially since
    we have a different approach.
    I am currently working on optimizing unicode_normalize().
    I am trying to come up with an improved version of the algorithm in C
    by the next commitfest (which will be in September).
    
    P.S.: In general, I strive to surpass ICU in terms of speed.
    I think we're making good progress. Let's see what happens.
    
    
    --
    Regards,
    Alexander Borisov
    
    
    
    
  21. Re: Improve the performance of Unicode Normalization Forms.

    Jeff Davis <pgsql@j-davis.com> — 2025-08-13T22:12:17Z

    On Mon, 2025-08-11 at 17:21 +0300, Alexander Borisov wrote:
    > As a result, I would not look into ICU at the moment, especially
    > since
    > we have a different approach.
    > I am currently working on optimizing unicode_normalize().
    > I am trying to come up with an improved version of the algorithm in C
    > by the next commitfest (which will be in September).
    
    Agreed, but thank you for adding context so I can understand where we
    are.
    
    The patch as proposed is a speedup, and also a simplification because
    it eliminates the different code path for the frontend code. That also
    makes me feel better about testing, because I don't think both those
    paths were tested equally.
    
    Comments on the patch itself:
    
    The 0001 patch generalizes the two-step lookup process: first navigate
    branches to find the index into a partially-compacted sparse array, and
    then use that to get the index into the dense array. The branching
    code, the sparse array, and the dense array are all generated code. The
    reason for the two-step lookup is to keep the sparse array element size
    small (uint16), so that missing elements consume less space (missing
    elements still remain for small gaps). The full entry is kept in the
    dense array.
    
    GenerateSparseArray.pm would be more descriptive than "Ranges.pm" for
    the new module. And we should probably include "sparse" in the name of
    the sparse arrays.
    
    The new module is responsible for generating the branching code as well
    as the sparse array; while the caller is reponsible for the dense
    arrays. For case mapping, one sparse array is used for four parallel
    arrays for the different case kinds (lower/title/upper/fold).
    
    The use of zero values for different purposes is getting confusing. It
    means "doesn't exist", but we are also reserving the zeroth element in
    the arrays. Would it be easier to just "#define EMPTY 0xFFFF" and then
    have the caller check for it? That way we don't need to reserve the
    zeroth array element, which should make it easier to avoid off-by-one
    errors.
    
    I think we can simplify the interface, as well. Why does the caller
    need to separately generate the ranges, then generate the table, then
    generate the branches? It's really all the same action and can be based
    on an input hash with a certain structure, and then return both the
    table and the branches, right?
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  22. Re: Improve the performance of Unicode Normalization Forms.

    Alexander Borisov <lex.borisov@gmail.com> — 2025-08-14T14:41:42Z

    14.08.2025 01:12, Jeff Davis wrote:
    > On Mon, 2025-08-11 at 17:21 +0300, Alexander Borisov wrote:
    
    [..]
    
    > 
    > Comments on the patch itself:
    > 
    > The 0001 patch generalizes the two-step lookup process: first navigate
    > branches to find the index into a partially-compacted sparse array, and
    > then use that to get the index into the dense array. The branching
    > code, the sparse array, and the dense array are all generated code. The
    > reason for the two-step lookup is to keep the sparse array element size
    > small (uint16), so that missing elements consume less space (missing
    > elements still remain for small gaps). The full entry is kept in the
    > dense array.
    
    Yes, that's exactly right.
    
    > GenerateSparseArray.pm would be more descriptive than "Ranges.pm" for
    > the new module. And we should probably include "sparse" in the name of
    > the sparse arrays.
    
    I don't mind, that's good.
    
    > The new module is responsible for generating the branching code as well
    > as the sparse array; while the caller is reponsible for the dense
    > arrays. For case mapping, one sparse array is used for four parallel
    > arrays for the different case kinds (lower/title/upper/fold).
    
    Yes, that's right.
    
    > The use of zero values for different purposes is getting confusing. It
    > means "doesn't exist", but we are also reserving the zeroth element in
    > the arrays. Would it be easier to just "#define EMPTY 0xFFFF" and then
    > have the caller check for it? That way we don't need to reserve the
    > zeroth array element, which should make it easier to avoid off-by-one
    > errors.
    
    Initially, that was the case;
    However, I subsequently concluded that creating magical values to define
    an element that was not found was not an ideal solution.
    I felt that 0 was easier to understand.
    
    I'm not entirely sure now, but maybe this will even help get rid of
    the comparison to 0. That is, we can get an element from the main table
    by index 0. It will be zeroed, and the algorithm will work as it should.
    However, it is not certain that this will have any effect on
    performance.
    
    In general, I don't have a firm position on this issue.
    
    > I think we can simplify the interface, as well. Why does the caller
    > need to separately generate the ranges, then generate the table, then
    > generate the branches? It's really all the same action and can be based
    > on an input hash with a certain structure, and then return both the
    > table and the branches, right?
    
    Yes, it can be done in one go. But the separation was done
    intentionally. It makes the code more understandable, and, more
    importantly, it makes it easier for developers "from the street"
    to understand it and, if necessary, correct or supplement the algorithm.
    These are utilities, code for generating an algorithm that, let's say,
    will not be used very often, and I would not "compress" it. Here, it is
    more important for us to understand how it works.
    
    
    --
    Regards,
    Alexander Borisov
    
    
    
    
  23. Re: Improve the performance of Unicode Normalization Forms.

    Alexander Borisov <lex.borisov@gmail.com> — 2025-09-02T21:07:00Z

    Hi, Jeff, hackers!
    
    As promised, refactoring the C code for Unicode Normalization Forms.
    
    In general terms, here's what has changed:
    1. Recursion has been removed; now data is generated using
         a Perl script.
    2. Memory is no longer allocated for uint32 for the entire size,
         but uint8 is allocated for the entire size for the CCC cache, which
         boosts performance significantly.
    3. The code for the unicode_normalize() function has been completely
         rewritten.
    
    I am confident that we have achieved excellent results.
    
    Jeff's test:
    Without patch:
         Normalization from NFC to  NFD with  PG: 009.121
         Normalization from NFC to NFKD with  PG: 009.048
         Normalization from NFD to  NFC with  PG: 014.525
         Normalization from NFD to NFKC with  PG: 014.380
    
    Whith patch:
         Normalization from NFC to  NFD with  PG: 001.580
         Normalization from NFC to NFKD with  PG: 001.634
         Normalization from NFD to  NFC with  PG: 002.979
         Normalization from NFD to NFKC with  PG: 003.050
    
    Test with ICU (with path and ICU):
         Normalization from NFC to  NFD with  PG: 001.580
         Normalization from NFC to  NFD with ICU: 001.880
         Normalization from NFC to NFKD with  PG: 001.634
         Normalization from NFC to NFKD with ICU: 001.857
    
         Normalization from NFD to  NFC with  PG: 002.979
         Normalization from NFD to  NFC with ICU: 001.144
         Normalization from NFD to NFKC with  PG: 003.050
         Normalization from NFD to NFKC with ICU: 001.260
    
    pgbench:
    The files were sent via pgbench. The files contain all code points that
    need to be normalized.
    
    NFC:
         Patch: tps = 9701.568161
         Without: tps = 6820.828104
    
    NFD:
         Patch: tps = 2707.155148
         Without: tps = 1745.949174
    
    NFKC:
         Patch: tps = 9893.952804
         Without: tps = 6697.358888
    
    NFKD:
         Patch: tps = 2580.785909
         Without: tps = 1521.058417
    
    To ensure fairness in testing with ICU, I corrected Jeff's patch;
    we calculate the size of the final buffer, and I placed ICU in
    the same position.
    
    I'm talking about:
    Get size:
         length = unorm_normalize(u_input, -1, form, 0, NULL, 0, &status);
    Normalize:
         unorm_normalize(u_input, -1, form, 0, u_result, length, &status);
    
    Otherwise, it turned out that we were giving the ICU some huge buffer,
    and it was writing to it.
    And we ourselves calculate what buffer we need.
    
    
    -- 
    Regards,
    Alexander Borisov
  24. Re: Improve the performance of Unicode Normalization Forms.

    Victor Yegorov <vyegorov@gmail.com> — 2025-09-10T18:50:12Z

    ср, 3 сент. 2025 г. в 09:35, Alexander Borisov <lex.borisov@gmail.com>:
    
    > Hi, Jeff, hackers!
    >
    > As promised, refactoring the C code for Unicode Normalization Forms.
    >
    > In general terms, here's what has changed:
    > 1. Recursion has been removed; now data is generated using
    >      a Perl script.
    > 2. Memory is no longer allocated for uint32 for the entire size,
    >      but uint8 is allocated for the entire size for the CCC cache, which
    >      boosts performance significantly.
    > 3. The code for the unicode_normalize() function has been completely
    >      rewritten.
    >
    > I am confident that we have achieved excellent results.
    >
    
    Hey.
    
    I've looked into these patches.
    
    Patches apply, compilation succeedes, make check and make installcheck shows
    no errors.
    
    Code quality is good, although I suggest a native english speaker to review
    comments and commit messages — a bit difficult to follow.
    
    Description of the Sparse Array approach is done in the newly introduced
    GenerateSparseArray.pm module.  Perhaps it'd be valuable to add a section
    into
    the src/common/unicode/README, it'll get more visibility.
    ( Not insisting here. )
    
    For performance testing I've used an approach by Jeff Davis. [1]
    I've prepared NFC and NFD files, loaded them into UNLOGGED tables and
    measured
    normalize() calls.
    
        CREATE UNLOGGED TABLE strings_nfd (
          str   text STORAGE PLAIN NOT NULL
        );
        COPY strings_nfd FROM '/var/lib/postgresql/strings.nfd.txt';
    
        CREATE UNLOGGED TABLE strings_nfc (
          str   text STORAGE PLAIN NOT NULL
        );
        COPY strings_nfc FROM '/var/lib/postgresql/strings.nfc.txt';
    
        SELECT count( normalize( str, NFD ) ) FROM strings_nfd,
    generate_series( 1, 10 ) x;
        SELECT count( normalize( str, NFC ) ) FROM strings_nfc,
    generate_series( 1, 10 ) x;
    
    And I've got the following numbers:
    
    Master
    NFD Time: 2954.630 ms / 295ms
    NFC Time: 3929.939 ms / 330ms
    
    Patched
    NFD Time: 1658.345 ms / 166ms / +78%
    NFC Time: 1862.757 ms / 186ms / +77%
    
    Overall, I find these patches and performance very nice and valuable.
    I've added myself as a reviewer and marked this patch as Ready for
    Committer.
    
    [1]
    https://postgr.es/m/adffa1fbdb867d5a11c9a8211cde3bdb1e208823.camel@j-davis.com
    
    -- 
    Victor Yegorov
    
  25. Re: Improve the performance of Unicode Normalization Forms.

    Alexander Borisov <lex.borisov@gmail.com> — 2025-09-11T17:51:45Z

    > Hey.
    > 
    > I've looked into these patches.
    
    Hi Victor,
    
    Thank you for reviewing the patch and testing it!
    
    [..]
    
    > Description of the Sparse Array approach is done in the newly introduced
    > GenerateSparseArray.pm module.  Perhaps it'd be valuable to add a 
    > section into
    > the src/common/unicode/README, it'll get more visibility.
    > ( Not insisting here. )
    
    It seems that the description of the algorithm that forms the script is
    best kept in the script itself.
    But I'm not sure either, because I don't know what is customary in the
    PG community.
    
    
    -- 
    Regards,
    Alexander Borisov
    
    
    
    
    
  26. Re: Improve the performance of Unicode Normalization Forms.

    Jeff Davis <pgsql@j-davis.com> — 2025-09-20T01:03:46Z

    On Thu, 2025-09-11 at 20:51 +0300, Alexander Borisov wrote:
    > 
    > > Hey.
    > > 
    > > I've looked into these patches.
    > 
    > Hi Victor,
    > 
    > Thank you for reviewing the patch and testing it!
    
    Heikki, do you have thoughts on this thread?
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  27. Re: Improve the performance of Unicode Normalization Forms.

    Alexander Borisov <lex.borisov@gmail.com> — 2025-09-29T10:22:29Z

    20.09.2025 04:03, Jeff Davis wrote:
    > On Thu, 2025-09-11 at 20:51 +0300, Alexander Borisov wrote:
    >>
    >>> Hey.
    >>>
    >>> I've looked into these patches.
    >>
    >> Hi Victor,
    >>
    >> Thank you for reviewing the patch and testing it!
    > 
    > Heikki, do you have thoughts on this thread?
    
    Hey,
    
    In patch v5 (attached), I changed the approach to class caching.
    Now, for small texts (less than 512 characters), we don't allocate
    memory from the heap; we use the stack.
    
    And according to pgbench tests, we have a 2x speedup.
    According to Jeff's tests, we have a 10x speedup.
    According to tests from the thread
    https://www.postgresql.org/message-id/CAFBsxsHUuMFCt6-pU+oG-F1==CmEp8wR+O+bRouXWu6i8kXuqA@mail.gmail.com,
    we also have a 2x speedup.
    
    
    --
    Regards,
    Alexander Borisov
    
  28. Re: Improve the performance of Unicode Normalization Forms.

    Alexander Borisov <lex.borisov@gmail.com> — 2025-10-31T10:44:50Z

    Rebase after commit
    https://github.com/postgres/postgres/commit/3853a6956c3e3bc7a6fa9bcdb205a2997f46bac2.
    
    --
    Best regards,
    Alexander Borisov
    
  29. Re: Improve the performance of Unicode Normalization Forms.

    Alexander Borisov <lex.borisov@gmail.com> — 2025-11-24T09:55:35Z

    20.09.2025 04:03, Jeff Davis wrote:
    > On Thu, 2025-09-11 at 20:51 +0300, Alexander Borisov wrote:
    >>
    >>> Hey.
    >>>
    >>> I've looked into these patches.
    >>
    >> Hi Victor,
    >>
    >> Thank you for reviewing the patch and testing it!
    > 
    > Heikki, do you have thoughts on this thread?
    > 
    > Regards,
    > 	Jeff Davis
    > 
    
    Hey guys, any news?
    
    --
    Regards,
    Alexander Borisov
    
    
    
    
    
  30. Re: Improve the performance of Unicode Normalization Forms.

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-12-12T12:22:38Z

    On 24/11/2025 11:55, Alexander Borisov wrote:
    > Hey guys, any news?
    
    I finally got around to look at this, thanks for your patience!
    
    > v6-0001-Moving-Perl-functions-Sparse-Array-to-a-common-mo.patch
    
    +1, this makes the existing code more readable, even without the rest of 
    the patches. Thanks for the added comments. I'll review this in more 
    detail, but it seems pretty close to be ready for committing.
    
    Thinking of GenerateSparseArray's API, I think what the caller really 
    wants is to generate a function like:
    
    /*
      * Look up the value of 'x'.
      */
    static uint16
    lookup(uint16)
    {
        ...
    }
    
    That's how the PerfectHash module works. The tables should be 
    implementation detail that the caller doesn't need to know about.
    
    
    > v6-0002-Improve-the-performance-of-Unicode-Normalization-.patch
    > v6-0003-Refactoring-Unicode-Normalization-Forms-performan.patch
    
    These basically look OK to me too. Some minor issues and ideas for 
    further work:
    
    The 'make update-unicode' and 'ninja update-unicode' targets are broken. 
    Need to be updated for the removal of 'unicode_norm_hashfunc.h'.
    
    The GenerateSparseArray adds comments like "/* U+1234 */" to each 
    element. That's nice but it implies that the elements are unicode code 
    points. GenerateSparseArray could be used for many other things. Let's 
    use "/* 0x1234 */" instead, or make it somehow configurable.
    
    The generated file is very large, over 1 MB. I guess it doesn't matter 
    all that much, but perhaps we should generate a little more compact 
    code. Maybe we don't need the "/* U+1234 */" comment on every line, but 
    only one comment for each range, for example.
    
    > typedef struct
    > {
    > 	uint8		comb_class;		/* combining class of character */
    > 	uint8		dec_size_flags; /* size and flags of decomposition code list */
    > 	uint16		dec_index;		/* index into UnicodeDecomp_codepoints, or the
    > 								 * decomposition itself if DECOMP_INLINE */
    > } pg_unicode_decomposition;
    
    The 'UnicodeDecomp_codepoints' array mentioned in the comment doesn't 
    exist anymore.
    
    
    Finally, some ideas for packing the arrays more tightly. I'm not sure if 
    these make any difference in practice or are worth the effort, but here 
    we go:
    
    There are only 56 distinct comb_classes, and 18 distinct dec_size_flags, 
    so if we added one more small lookup table for them, they could be 
    packed into a single byte. That would shrink pg_unicode_decomposition 
    from 4 to 3 bytes.
    
    > static const uint8 UnicodeDecompSizes[4931] =
    
    The max value stored here is 7, so you could get away with just 3 bits 
    per element.
    
    > static const uint16 decomp_map[33752] =
    
    This array consists of multiple ranges, and each range is accessed in a 
    separate piece of code. We could use multiple arrays, one for each 
    range, instead of one big array. Some of the ranges only store a small 
    range of values, so we could use uint8 for them.
    
    - Heikki
    
    
    
    
  31. Re: Improve the performance of Unicode Normalization Forms.

    Alexander Borisov <lex.borisov@gmail.com> — 2025-12-22T15:44:44Z

    12.12.2025 15:22, Heikki Linnakangas wrote:
     > On 24/11/2025 11:55, Alexander Borisov wrote:
     >> Hey guys, any news?
     >
     > I finally got around to look at this, thanks for your patience!
     >
    
    Hi Heikki,
    
    First of all, I would like to thank you for the review.
    I tried to take your comments into account and test the approaches.
    
    
    About major changes/improvements
    
    The branching kept bothering me. The algorithm consisted of a function
    with branches (essentially a binary search with offsets) and a table of
    indexes. I really wanted to get rid of the branches. And then it dawned
    on me that instead of building branches with offsets, I could create
    another table that would contain these offsets.
    The "inspiration" came from my patch Optimization of the is_normalized()
    function [0].
    
    Actually, I changed the GenerateSparseArray algorithm, which now
    generates a table with offsets instead of a function with branches.
    In other words, we now have two tables:
    1. A table with offsets, indicating what offset is needed to obtain the
        desired index from the index table.
    2. A table with indexes. The data is divided into fixed ranges.
    
    
    How does it look, figuratively speaking?
    
    Let's take a fixed range of size 16.
    Let's take data to fill in 10..20, 100..110
    (let's imagine that these are codepoints).
    
    1. Let's create the first table with offsets:
    static const uint8 table_offset[8] =
    {
          16, 32, 0, 0, 0, 0, 48, 0
    };
    
    2. Create a second table with indexes to the data table:
    static const uint8 table_index[63] =
    {
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0,
          0, 19, 41, 152, 147, 111, 138, 46, 50, 133, 41, 164, 0, 0, 0, 0, 
    0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 174, 47, 75, 31, 235, 112, 103, 167, 130, 
    11, 240
    };
    
    (indexes/data within table_index are provided as an example)
    The first range is allocated for a dummy index (like a NULL).
    
    3. Lookup function:
    static uint8
    get_index(char32_t cp)
    {
          uint8 offset_idx, offset;
    
          offset_idx = cp >> 4;
    
          if (offset_idx > 7)
               return 0;
    
          offset = table_offset[offset_idx];
    
          return table_index[offset + (cp & 15)];
    }
    
    As you can see from the example, we only have one branch left, to check
    for going outside the bounds.
    By changing the size of the fixed range, we can compact either one table
    or the other.
    You can read more about this in ./src/tools/GenerateSparseArray.pm
    This is the main change to the algorithm.
    
    
    Regarding recommendations from the review
    
     > Thinking of GenerateSparseArray's API, I think what the caller really
     > wants is to generate a function like:
     >
     > /*
     >   * Look up the value of 'x'.
     >   */
     > static uint16
     > lookup(uint16)
     > {
     >     ...
     > }
     >
     > That's how the PerfectHash module works. The tables should be
     > implementation detail that the caller doesn't need to know about.
    
    Yes, now it is being generated together with the tables.
    
    [..]
    
     > The 'make update-unicode' and 'ninja update-unicode' targets are broken.
     > Need to be updated for the removal of 'unicode_norm_hashfunc.h'.
    
    Fixed.
    
     > The GenerateSparseArray adds comments like "/* U+1234 */" to each
     > element. That's nice but it implies that the elements are unicode code
     > points. GenerateSparseArray could be used for many other things. Let's
     > use "/* 0x1234 */" instead, or make it somehow configurable.
     >
     > The generated file is very large, over 1 MB. I guess it doesn't matter
     > all that much, but perhaps we should generate a little more compact
     > code. Maybe we don't need the "/* U+1234 */" comment on every line, but
     > only one comment for each range, for example.
    
    I have many places where I need to generate C tables. To make my life
    easier, and perhaps the lives of others, I added a small module for
    generating "pretty" tables. ./src/tools/PrettyLine.pm
    After applying it, the size of the generated header file was reduced
    to ~300KB.
    
     > typedef struct
     > {
     > 	uint8		comb_class;		/* combining class of character */
     > 	uint8		dec_size_flags; /* size and flags of decomposition code list */
     > 	uint16		dec_index;		/* index into UnicodeDecomp_codepoints, or the
     > 								 * decomposition itself if DECOMP_INLINE */
     > } pg_unicode_decomposition;
     >
     > The 'UnicodeDecomp_codepoints' array mentioned in the comment doesn't
     > exist anymore.
    
    Fixed.
    
     > Finally, some ideas for packing the arrays more tightly. I'm not sure if
     > these make any difference in practice or are worth the effort, but here
     > we go:
     >
     > There are only 56 distinct comb_classes, and 18 distinct dec_size_flags,
     > so if we added one more small lookup table for them, they could be
     > packed into a single byte. That would shrink pg_unicode_decomposition
     > from 4 to 3 bytes.
    
    I tried different ways to optimize this place. If we create a separate
    index for Canonical Combining Class using existing approaches, it will
    create a table that is almost the same size as the main one where
    "uint8 comb_class;" is currently located. The data is highly scattered.
    The current approach to storing comb_class is the golden mean.
    
     > static const uint16 decomp_map[33752] =
    
     > This array consists of multiple ranges, and each range is accessed in a
     > separate piece of code. We could use multiple arrays, one for each
     > range, instead of one big array. Some of the ranges only store a small
     > range of values, so we could use uint8 for them.
    
    In the current approach, the total number of records in the tables
    (offset and index) is approximately 21,000 uint16 records.
    This is approximately 12,000 fewer than before.
    It is possible to attempt to further divide the data, but the complexity
    of obtaining a record (lookup) is significantly reduced.
    
    
    Benchmarks
    
    tables offset -- current algorithm, latest.
    branches offset -- previous algorithm with branches in the lookup function.
    without -- the original Postgres without patches.
    
    * macOS 15.6.1 (Apple M3 Pro) (Apple clang version 17.0.0)
    
    NFC:
    Patch (tables offset): tps = 11180.259769
    Patch (branches offset): tps = 11051.958331
    Without: tps = 7540.085014
    
    NFD:
    Patch (tables offset): tps = 2962.334547
    Patch (branches offset): tps = 2855.137283
    Without: tps = 1841.182279
    
    NFKC:
    Patch (tables offset): tps = 11313.345548
    Patch (branches offset): tps = 11239.604566
    Without: tps = 7498.037333
    
    NFKD:
    Patch (tables offset): tps = 2828.025161
    Patch (branches offset): tps = 2702.252719
    Without: tps = 1556.659947
    
    We can see that there is a difference between the branching and table
    approaches. It may not be large, perhaps not even significant, but it
    is there.
    
    Summary of all patches:
    1. The uint32 field, which stored the unicode codepoint record, was
        removed from the main structure/table. This was necessary for perfect
        hashing.
    2. Thanks to the first point, we managed to get rid of duplicates and
        reduce the main table from 6843 to 3957.
    3. We managed to get rid of duplicates in the table with codepoints for
        decomposition from 5138 to 4931 (uint32).
    4. Recursion has been removed; now data is generated using
        a Perl script.
    5. Memory is no longer allocated for uint32 for the entire size,
        but uint8 is allocated for the entire size for the CCC cache, which
        boosts performance significantly. For small texts (less than 512
        characters), we don't allocate memory from the heap; we use
        the stack.
    6. The backend and frontend have a single code base.
    7. And most importantly, performance has increased significantly.
    
    
    [0] 
    https://www.postgresql.org/message-id/387cd5a3-2600-4c3c-b236-576087ef7062%40gmail.com
    
    
    --
    Regards,
    Alexander Borisov
  32. Re: Improve the performance of Unicode Normalization Forms.

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-12-22T16:24:53Z

    On 22/12/2025 17:44, Alexander Borisov wrote:
    > The branching kept bothering me. The algorithm consisted of a function
    > with branches (essentially a binary search with offsets) and a table of
    > indexes. I really wanted to get rid of the branches. And then it dawned
    > on me that instead of building branches with offsets, I could create
    > another table that would contain these offsets.
    > The "inspiration" came from my patch Optimization of the is_normalized()
    > function [0].
    > 
    > Actually, I changed the GenerateSparseArray algorithm, which now
    > generates a table with offsets instead of a function with branches.
    > In other words, we now have two tables:
    > 1. A table with offsets, indicating what offset is needed to obtain the
    >     desired index from the index table.
    > 2. A table with indexes. The data is divided into fixed ranges.
    
    Sounds like you re-invented the radix tree :-). Nothing wrong with that; 
    a radix tree sounds like a good data structure for this.
    
    >  > The GenerateSparseArray adds comments like "/* U+1234 */" to each
    >  > element. That's nice but it implies that the elements are unicode code
    >  > points. GenerateSparseArray could be used for many other things. Let's
    >  > use "/* 0x1234 */" instead, or make it somehow configurable.
    >  >
    >  > The generated file is very large, over 1 MB. I guess it doesn't matter
    >  > all that much, but perhaps we should generate a little more compact
    >  > code. Maybe we don't need the "/* U+1234 */" comment on every line, but
    >  > only one comment for each range, for example.
    > 
    > I have many places where I need to generate C tables. To make my life
    > easier, and perhaps the lives of others, I added a small module for
    > generating "pretty" tables. ./src/tools/PrettyLine.pm
    > After applying it, the size of the generated header file was reduced
    > to ~300KB.
    
    I think a fixed number of values per line would look nicer than trying 
    to pack as much possible on each line.
    
    (This isn't too important of course, this is just about making the 
    generated code look prettier.)
    
    Thanks, I'll try to find time to take a closer look.
    
    - Heikki
    
    
    
    
    
  33. Re: Improve the performance of Unicode Normalization Forms.

    Alexander Borisov <lex.borisov@gmail.com> — 2025-12-22T17:55:28Z

    22.12.2025 19:24, Heikki Linnakangas wrote:
    > On 22/12/2025 17:44, Alexander Borisov wrote:
    >> The branching kept bothering me. The algorithm consisted of a function
    >> with branches (essentially a binary search with offsets) and a table of
    >> indexes. I really wanted to get rid of the branches. And then it dawned
    >> on me that instead of building branches with offsets, I could create
    >> another table that would contain these offsets.
    >> The "inspiration" came from my patch Optimization of the is_normalized()
    >> function [0].
    >>
    >> Actually, I changed the GenerateSparseArray algorithm, which now
    >> generates a table with offsets instead of a function with branches.
    >> In other words, we now have two tables:
    >> 1. A table with offsets, indicating what offset is needed to obtain the
    >>     desired index from the index table.
    >> 2. A table with indexes. The data is divided into fixed ranges.
    > 
    > Sounds like you re-invented the radix tree :-). Nothing wrong with that; 
    > a radix tree sounds like a good data structure for this.
    
    I looked at the implementation of radix trees for Encoding in
    PostgreSQL. There are similarities, of course, but in my implementation,
    the access speed is literally O(1).
    
    For example:
    return table_index[ table_offset[cp >> 6] + (cp & 63) ];
    
    Access speed in Encoding (radix tree), it is more complicated. If I 
    understand correctly.
    
    In general, there was an idea to store offset and index data in one
    table. Not to separate them.
    But I'm sure you'll understand everything when you see the code.
    
    >>  > The GenerateSparseArray adds comments like "/* U+1234 */" to each
    >>  > element. That's nice but it implies that the elements are unicode code
    >>  > points. GenerateSparseArray could be used for many other things. Let's
    >>  > use "/* 0x1234 */" instead, or make it somehow configurable.
    >>  >
    >>  > The generated file is very large, over 1 MB. I guess it doesn't matter
    >>  > all that much, but perhaps we should generate a little more compact
    >>  > code. Maybe we don't need the "/* U+1234 */" comment on every line, 
    >> but
    >>  > only one comment for each range, for example.
    >>
    >> I have many places where I need to generate C tables. To make my life
    >> easier, and perhaps the lives of others, I added a small module for
    >> generating "pretty" tables. ./src/tools/PrettyLine.pm
    >> After applying it, the size of the generated header file was reduced
    >> to ~300KB.
    > 
    > I think a fixed number of values per line would look nicer than trying 
    > to pack as much possible on each line.
    > 
    > (This isn't too important of course, this is just about making the 
    > generated code look prettier.)
    
    With a sense of inner beauty :), when the number of values in a line is
    fixed, the lines start to "dance", which doesn't look very good.
    On the other hand, we can convert all values to %06X or something like
    that. But then it would defeat the whole purpose of the idea — to reduce
    the file size and make it look nice.
    I use this approach to create many tables, which allows me not to think
    about how the numbers will fit in and not waste time on it.
    But, it's more a matter of taste.
    
    > Thanks, I'll try to find time to take a closer look.
    
    Thank you for your time.
    
    --
    Regards,
    Alexander Borisov
    
    
    
    
    
  34. Re: Improve the performance of Unicode Normalization Forms.

    Alexander Borisov <lex.borisov@gmail.com> — 2026-05-04T19:28:34Z

    Main changes in v11:
    1. Removed the unnecessary Perl module (PrettyPrint) and replaced it
        with Text::Wrap, which has long been part of Perl (Core) and performs
        the same functions.
    2. Performed a minor refactoring of the table generation code (Perl).
    3. The table generation module has been renamed to something more
        appropriate. (I have my doubts here; I wouldn’t exactly call it a
        radix)
    4. The commit logs have been brought into a reasonable state.
    
    --
    Regards,
    Alexander Borisov
    
  35. Re: Improve the performance of Unicode Normalization Forms.

    Henson Choi <assam258@gmail.com> — 2026-05-05T08:02:22Z

    Hi Alexander,
    
    Thank you for the patch.  I applied v10 to a clean checkout, built it,
    and ran the regression suite — all tests pass.  Nice work.
    
    I have attached a review report below.
    
    Detailed gcov coverage in HTML format is included in coverage.tgz.
    
    
    Unicode Normalization Performance Patch Review
    ===============================================
    
    
    Patch: Improve the performance of Unicode Normalization Forms
    Commitfest: https://commitfest.postgresql.org/patch/5802
    
    
    Review Methodology:
      Quality assessment focused on key code paths and coverage analysis.
      Code coverage measured using gcov (modified lines only), supplemented
      by normalization-check (NormalizationTest.txt).
    
    
    Limitations:
      - Not a security audit or formal verification
    
    
    TABLE OF CONTENTS
    -----------------
    
    
    1. Executive Summary
    2. Issues Found
       2.1 Critical / Major
       2.2 Minor (Review Needed)
    3. Test Coverage
       3.1 Covered Areas
       3.2 Untested Items
    4. Code Coverage Analysis
       4.1 Overall Coverage
       4.2 Uncovered Code Risk Assessment
    5. Commit Analysis
    6. Recommendations
    7. Conclusion
    
    
    1. EXECUTIVE SUMMARY
    --------------------
    
    
    Overall assessment: GOOD
    
    
    The patch replaces two separate lookup mechanisms (bsearch for FRONTEND,
    perfect hash for BACKEND) with a single two-level sparse array.  This
    unifies code paths that diverged in 2020 for pragmatic reasons, and
    shrinks the generated table from ~19K to ~9K lines.  The approach is
    well-established in Unicode processing libraries (CPython, ICU, Tcl).
    No critical issues found.
    
    Performance was measured both by the author and independently by Jeff Davis
    using ICU v76.1 test data (100 iterations, ICU collation test strings):
    
      Without patch:  NFD  9.1s  NFC 14.5s
      With patch:     NFD  1.6s  NFC  3.0s
      ICU v76.1:      NFD  1.9s  NFC  1.1s
    
    
    Thread: https://postgr.es/m/7859e5ef-a574-4199-a69b-6fee26711521@gmail.com
    
    Rating by Area:
    - Code Quality:     Good (clean unification, no style violations)
    - Test Coverage:    Adequate (95.4% modified-line coverage)
    - Documentation:    N/A (no user-facing documentation changes)
    - Build/Regress:    Pass
    
    
    2. ISSUES FOUND
    ---------------
    
    
    2.1 Critical / Major
    
    
    None.
    
    
    2.2 Minor (Review Needed)
    
    
    #1 [Design] GenerateSparseArray.pm - zero value dual semantics
    
       Index value 0 serves as both "entry not found" and a valid array
       index.  This works today because all callers reserve position 0 as a
       null entry, but the convention is implicit and unenforced.  A future
       caller that does not follow it would silently return wrong data.
       Jeff Davis suggested #define EMPTY 0xFFFF to make the sentinel
       explicit.  The author's response was non-committal.  This remains open.
    
    
    #2 [Code] unicode_norm.c:476-480 - OOM path untestable
    
       ALLOC() failure in unicode_normalize() returns NULL, but this branch
       is unreachable in BACKEND (palloc() throws on OOM) and untestable
       without fault injection.
    
       Recommend manual review of all FRONTEND callers to confirm NULL return
       is handled correctly.
    
    
    
    3. TEST COVERAGE
    ----------------
    
    
    3.1 Covered Areas
    
    
    Existing tests in src/test/regress/sql/unicode.sql exercise all four
    normalization forms (NFC, NFD, NFKC, NFKD), idempotency, empty string,
    IS NORMALIZED predicate, and error cases.
    
    
    3.2 Untested Items
    
    
    The following lines were uncovered before this review.  Sample tests
    have been added to unicode.sql and verified via pg-regress.
    
    Line                    Code path
    -------------------------------------------------------------------------------
    unicode_norm.c:337      canonical_reorder() in Hangul decomposition path
    unicode_norm.c:414      canonical_reorder() in non-Hangul decomposition path
    unicode_norm.c:476-480  ALLOC() failure — FRONTEND OOM only, untestable
    unicode_norm.c:550      FREE(classes) when input decomposes to >512 chars
    
    
    Tests added for lines 337, 414, 550:
    
        -- Multiple combining marks before a Hangul syllable trigger canonical
        -- reorder before the Hangul fast path writes its jamo decomposition.
        SELECT normalize(U&'\0308\0301\AC00', NFD) = U&'\0308\0301\1100\1161'
    COLLATE "C" AS test_hangul_reorder;
    
        -- Multiple combining marks before a precomposed character trigger
        -- canonical reorder when the starter from its decomposition is written.
        SELECT normalize(U&'\0308\0301\00E0', NFD) = U&'\0308\0301\0061\0300'
    COLLATE "C" AS test_decomp_reorder;
    
        -- class_buf is uint8[512]; repeat(U+00E4, 300) decomposes to 600 chars,
        -- forcing heap allocation for the combining class buffer.
        SELECT length(normalize(repeat(U&'\00E4', 300), NFD)) = 600 AS
    test_large_decomp;
    
    
    4. CODE COVERAGE ANALYSIS
    -------------------------
    
    
    4.1 Overall Coverage (modified lines only)
    
    
    95.4% (146 / 153 lines) — measured on unicode_norm.c.
    GenerateSparseArray.pm, PrettyLine.pm, generate-unicode_norm_table.pl,
    and unicode_norm_table.h are build-time artifacts; runtime coverage not
    applicable.
    
    
    4.2 Uncovered Code Risk Assessment
    
    
    The 7 uncovered lines fall into two groups:
    
    Lines 337, 414, 550 — reachable but required edge-case inputs.
    Tests have been added (see section 3.2).
    
    Lines 476-480 — OOM path, unreachable in BACKEND.  The sole FRONTEND
    production caller (saslprep.c) handles NULL correctly.
    
    No uncovered code presents a functional or security risk.
    
    
    5. COMMIT ANALYSIS
    ------------------
    
    
    4 commits (2 preparatory + 2 core):
    
    
    Commit  Area          Files   +/-          Key Content
    -------------------------------------------------------------------------------
    1       Tools         1       +132/0       PrettyLine.pm - line formatter
    2       Tools         1       +374/0       GenerateSparseArray.pm - sparse
                                               array generator
    3       Core          7       +6485/-12534 Sparse array tables + lookup,
                                               remove perfect hash
    4       Refactor+Perf 3       +3746/-3223  unicode_normalize() rewrite,
                                               compat decomp separation
    
    
    Net change: -5,020 lines.  unicode_norm_hashfunc.h deleted (-3,025 lines);
    unicode_norm_table.h compacted from ~19K to ~9K lines.
    
    
    6. RECOMMENDATIONS
    ------------------
    
    
    1. Add tests for lines 337, 414, and 550 to the regression suite.
       Sample SQL provided in section 3.2 — all three verified correct.
    
    2. Revisit zero value semantics in GenerateSparseArray.pm (see section 2.2
    #1).
       The current code is correct, but the "0 means not found" convention is
       implicit.  Consider #define EMPTY 0xFFFF before adding more callers.
    
    3. Review FRONTEND callers of unicode_normalize() to confirm NULL return
       is handled correctly (see section 2.2 #2).
    
    
    7. CONCLUSION
    -------------
    
    
    The patch is a well-scoped improvement that closes a gap left open in
    2020: the FRONTEND path was kept on bsearch to avoid inflating libpq,
    while BACKEND used a perfect hash.  The sparse array is smaller than the
    hash table yet faster than bsearch, making it viable for both contexts.
    
    The only item worth discussing before commit is whether to include the
    three new regression tests from section 3.2.  Everything else is clean.