Thread

  1. Re: Improve the performance of Unicode Normalization Forms.

    Henson Choi <assam258@gmail.com> — 2026-05-05T08:02:22Z

    Hi Alexander,
    
    Thank you for the patch.  I applied v10 to a clean checkout, built it,
    and ran the regression suite — all tests pass.  Nice work.
    
    I have attached a review report below.
    
    Detailed gcov coverage in HTML format is included in coverage.tgz.
    
    
    Unicode Normalization Performance Patch Review
    ===============================================
    
    
    Patch: Improve the performance of Unicode Normalization Forms
    Commitfest: https://commitfest.postgresql.org/patch/5802
    
    
    Review Methodology:
      Quality assessment focused on key code paths and coverage analysis.
      Code coverage measured using gcov (modified lines only), supplemented
      by normalization-check (NormalizationTest.txt).
    
    
    Limitations:
      - Not a security audit or formal verification
    
    
    TABLE OF CONTENTS
    -----------------
    
    
    1. Executive Summary
    2. Issues Found
       2.1 Critical / Major
       2.2 Minor (Review Needed)
    3. Test Coverage
       3.1 Covered Areas
       3.2 Untested Items
    4. Code Coverage Analysis
       4.1 Overall Coverage
       4.2 Uncovered Code Risk Assessment
    5. Commit Analysis
    6. Recommendations
    7. Conclusion
    
    
    1. EXECUTIVE SUMMARY
    --------------------
    
    
    Overall assessment: GOOD
    
    
    The patch replaces two separate lookup mechanisms (bsearch for FRONTEND,
    perfect hash for BACKEND) with a single two-level sparse array.  This
    unifies code paths that diverged in 2020 for pragmatic reasons, and
    shrinks the generated table from ~19K to ~9K lines.  The approach is
    well-established in Unicode processing libraries (CPython, ICU, Tcl).
    No critical issues found.
    
    Performance was measured both by the author and independently by Jeff Davis
    using ICU v76.1 test data (100 iterations, ICU collation test strings):
    
      Without patch:  NFD  9.1s  NFC 14.5s
      With patch:     NFD  1.6s  NFC  3.0s
      ICU v76.1:      NFD  1.9s  NFC  1.1s
    
    
    Thread: https://postgr.es/m/7859e5ef-a574-4199-a69b-6fee26711521@gmail.com
    
    Rating by Area:
    - Code Quality:     Good (clean unification, no style violations)
    - Test Coverage:    Adequate (95.4% modified-line coverage)
    - Documentation:    N/A (no user-facing documentation changes)
    - Build/Regress:    Pass
    
    
    2. ISSUES FOUND
    ---------------
    
    
    2.1 Critical / Major
    
    
    None.
    
    
    2.2 Minor (Review Needed)
    
    
    #1 [Design] GenerateSparseArray.pm - zero value dual semantics
    
       Index value 0 serves as both "entry not found" and a valid array
       index.  This works today because all callers reserve position 0 as a
       null entry, but the convention is implicit and unenforced.  A future
       caller that does not follow it would silently return wrong data.
       Jeff Davis suggested #define EMPTY 0xFFFF to make the sentinel
       explicit.  The author's response was non-committal.  This remains open.
    
    
    #2 [Code] unicode_norm.c:476-480 - OOM path untestable
    
       ALLOC() failure in unicode_normalize() returns NULL, but this branch
       is unreachable in BACKEND (palloc() throws on OOM) and untestable
       without fault injection.
    
       Recommend manual review of all FRONTEND callers to confirm NULL return
       is handled correctly.
    
    
    
    3. TEST COVERAGE
    ----------------
    
    
    3.1 Covered Areas
    
    
    Existing tests in src/test/regress/sql/unicode.sql exercise all four
    normalization forms (NFC, NFD, NFKC, NFKD), idempotency, empty string,
    IS NORMALIZED predicate, and error cases.
    
    
    3.2 Untested Items
    
    
    The following lines were uncovered before this review.  Sample tests
    have been added to unicode.sql and verified via pg-regress.
    
    Line                    Code path
    -------------------------------------------------------------------------------
    unicode_norm.c:337      canonical_reorder() in Hangul decomposition path
    unicode_norm.c:414      canonical_reorder() in non-Hangul decomposition path
    unicode_norm.c:476-480  ALLOC() failure — FRONTEND OOM only, untestable
    unicode_norm.c:550      FREE(classes) when input decomposes to >512 chars
    
    
    Tests added for lines 337, 414, 550:
    
        -- Multiple combining marks before a Hangul syllable trigger canonical
        -- reorder before the Hangul fast path writes its jamo decomposition.
        SELECT normalize(U&'\0308\0301\AC00', NFD) = U&'\0308\0301\1100\1161'
    COLLATE "C" AS test_hangul_reorder;
    
        -- Multiple combining marks before a precomposed character trigger
        -- canonical reorder when the starter from its decomposition is written.
        SELECT normalize(U&'\0308\0301\00E0', NFD) = U&'\0308\0301\0061\0300'
    COLLATE "C" AS test_decomp_reorder;
    
        -- class_buf is uint8[512]; repeat(U+00E4, 300) decomposes to 600 chars,
        -- forcing heap allocation for the combining class buffer.
        SELECT length(normalize(repeat(U&'\00E4', 300), NFD)) = 600 AS
    test_large_decomp;
    
    
    4. CODE COVERAGE ANALYSIS
    -------------------------
    
    
    4.1 Overall Coverage (modified lines only)
    
    
    95.4% (146 / 153 lines) — measured on unicode_norm.c.
    GenerateSparseArray.pm, PrettyLine.pm, generate-unicode_norm_table.pl,
    and unicode_norm_table.h are build-time artifacts; runtime coverage not
    applicable.
    
    
    4.2 Uncovered Code Risk Assessment
    
    
    The 7 uncovered lines fall into two groups:
    
    Lines 337, 414, 550 — reachable but required edge-case inputs.
    Tests have been added (see section 3.2).
    
    Lines 476-480 — OOM path, unreachable in BACKEND.  The sole FRONTEND
    production caller (saslprep.c) handles NULL correctly.
    
    No uncovered code presents a functional or security risk.
    
    
    5. COMMIT ANALYSIS
    ------------------
    
    
    4 commits (2 preparatory + 2 core):
    
    
    Commit  Area          Files   +/-          Key Content
    -------------------------------------------------------------------------------
    1       Tools         1       +132/0       PrettyLine.pm - line formatter
    2       Tools         1       +374/0       GenerateSparseArray.pm - sparse
                                               array generator
    3       Core          7       +6485/-12534 Sparse array tables + lookup,
                                               remove perfect hash
    4       Refactor+Perf 3       +3746/-3223  unicode_normalize() rewrite,
                                               compat decomp separation
    
    
    Net change: -5,020 lines.  unicode_norm_hashfunc.h deleted (-3,025 lines);
    unicode_norm_table.h compacted from ~19K to ~9K lines.
    
    
    6. RECOMMENDATIONS
    ------------------
    
    
    1. Add tests for lines 337, 414, and 550 to the regression suite.
       Sample SQL provided in section 3.2 — all three verified correct.
    
    2. Revisit zero value semantics in GenerateSparseArray.pm (see section 2.2
    #1).
       The current code is correct, but the "0 means not found" convention is
       implicit.  Consider #define EMPTY 0xFFFF before adding more callers.
    
    3. Review FRONTEND callers of unicode_normalize() to confirm NULL return
       is handled correctly (see section 2.2 #2).
    
    
    7. CONCLUSION
    -------------
    
    
    The patch is a well-scoped improvement that closes a gap left open in
    2020: the FRONTEND path was kept on bsearch to avoid inflating libpq,
    while BACKEND used a perfect hash.  The sparse array is smaller than the
    hash table yet faster than bsearch, making it viable for both contexts.
    
    The only item worth discussing before commit is whether to include the
    three new regression tests from section 3.2.  Everything else is clean.