Thread

  1. Re: Improve the performance of Unicode Normalization Forms.

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-12-12T12:22:38Z

    On 24/11/2025 11:55, Alexander Borisov wrote:
    > Hey guys, any news?
    
    I finally got around to look at this, thanks for your patience!
    
    > v6-0001-Moving-Perl-functions-Sparse-Array-to-a-common-mo.patch
    
    +1, this makes the existing code more readable, even without the rest of 
    the patches. Thanks for the added comments. I'll review this in more 
    detail, but it seems pretty close to be ready for committing.
    
    Thinking of GenerateSparseArray's API, I think what the caller really 
    wants is to generate a function like:
    
    /*
      * Look up the value of 'x'.
      */
    static uint16
    lookup(uint16)
    {
        ...
    }
    
    That's how the PerfectHash module works. The tables should be 
    implementation detail that the caller doesn't need to know about.
    
    
    > v6-0002-Improve-the-performance-of-Unicode-Normalization-.patch
    > v6-0003-Refactoring-Unicode-Normalization-Forms-performan.patch
    
    These basically look OK to me too. Some minor issues and ideas for 
    further work:
    
    The 'make update-unicode' and 'ninja update-unicode' targets are broken. 
    Need to be updated for the removal of 'unicode_norm_hashfunc.h'.
    
    The GenerateSparseArray adds comments like "/* U+1234 */" to each 
    element. That's nice but it implies that the elements are unicode code 
    points. GenerateSparseArray could be used for many other things. Let's 
    use "/* 0x1234 */" instead, or make it somehow configurable.
    
    The generated file is very large, over 1 MB. I guess it doesn't matter 
    all that much, but perhaps we should generate a little more compact 
    code. Maybe we don't need the "/* U+1234 */" comment on every line, but 
    only one comment for each range, for example.
    
    > typedef struct
    > {
    > 	uint8		comb_class;		/* combining class of character */
    > 	uint8		dec_size_flags; /* size and flags of decomposition code list */
    > 	uint16		dec_index;		/* index into UnicodeDecomp_codepoints, or the
    > 								 * decomposition itself if DECOMP_INLINE */
    > } pg_unicode_decomposition;
    
    The 'UnicodeDecomp_codepoints' array mentioned in the comment doesn't 
    exist anymore.
    
    
    Finally, some ideas for packing the arrays more tightly. I'm not sure if 
    these make any difference in practice or are worth the effort, but here 
    we go:
    
    There are only 56 distinct comb_classes, and 18 distinct dec_size_flags, 
    so if we added one more small lookup table for them, they could be 
    packed into a single byte. That would shrink pg_unicode_decomposition 
    from 4 to 3 bytes.
    
    > static const uint8 UnicodeDecompSizes[4931] =
    
    The max value stored here is 7, so you could get away with just 3 bits 
    per element.
    
    > static const uint16 decomp_map[33752] =
    
    This array consists of multiple ranges, and each range is accessed in a 
    separate piece of code. We could use multiple arrays, one for each 
    range, instead of one big array. Some of the ranges only store a small 
    range of values, so we could use uint8 for them.
    
    - Heikki