Thread

Re: Improve the performance of Unicode Normalization Forms.

Heikki Linnakangas <hlinnaka@iki.fi> — 2025-12-12T12:22:38Z
On 24/11/2025 11:55, Alexander Borisov wrote:
> Hey guys, any news?

I finally got around to look at this, thanks for your patience!

> v6-0001-Moving-Perl-functions-Sparse-Array-to-a-common-mo.patch

+1, this makes the existing code more readable, even without the rest of 
the patches. Thanks for the added comments. I'll review this in more 
detail, but it seems pretty close to be ready for committing.

Thinking of GenerateSparseArray's API, I think what the caller really 
wants is to generate a function like:

/*
  * Look up the value of 'x'.
  */
static uint16
lookup(uint16)
{
    ...
}

That's how the PerfectHash module works. The tables should be 
implementation detail that the caller doesn't need to know about.


> v6-0002-Improve-the-performance-of-Unicode-Normalization-.patch
> v6-0003-Refactoring-Unicode-Normalization-Forms-performan.patch

These basically look OK to me too. Some minor issues and ideas for 
further work:

The 'make update-unicode' and 'ninja update-unicode' targets are broken. 
Need to be updated for the removal of 'unicode_norm_hashfunc.h'.

The GenerateSparseArray adds comments like "/* U+1234 */" to each 
element. That's nice but it implies that the elements are unicode code 
points. GenerateSparseArray could be used for many other things. Let's 
use "/* 0x1234 */" instead, or make it somehow configurable.

The generated file is very large, over 1 MB. I guess it doesn't matter 
all that much, but perhaps we should generate a little more compact 
code. Maybe we don't need the "/* U+1234 */" comment on every line, but 
only one comment for each range, for example.

> typedef struct
> {
> 	uint8		comb_class;		/* combining class of character */
> 	uint8		dec_size_flags; /* size and flags of decomposition code list */
> 	uint16		dec_index;		/* index into UnicodeDecomp_codepoints, or the
> 								 * decomposition itself if DECOMP_INLINE */
> } pg_unicode_decomposition;

The 'UnicodeDecomp_codepoints' array mentioned in the comment doesn't 
exist anymore.


Finally, some ideas for packing the arrays more tightly. I'm not sure if 
these make any difference in practice or are worth the effort, but here 
we go:

There are only 56 distinct comb_classes, and 18 distinct dec_size_flags, 
so if we added one more small lookup table for them, they could be 
packed into a single byte. That would shrink pg_unicode_decomposition 
from 4 to 3 bytes.

> static const uint8 UnicodeDecompSizes[4931] =

The max value stored here is 7, so you could get away with just 3 bits 
per element.

> static const uint16 decomp_map[33752] =

This array consists of multiple ranges, and each range is accessed in a 
separate piece of code. We could use multiple arrays, one for each 
range, instead of one big array. Some of the ranges only store a small 
range of values, so we could use uint8 for them.

- Heikki