Thread

  1. Re: Built-in case-insensitive collation pg_unicode_ci

    Jeff Davis <pgsql@j-davis.com> — 2025-11-06T21:54:41Z

    On Thu, 2025-10-16 at 15:46 +0200, Peter Eisentraut wrote:
    > If it's a variant of PG_UNICODE_FAST, then it ought to be called 
    > PG_UNICODE_FAST_CI or similar.  Otherwise, one would expect it to be
    > a 
    > variant of PG_UNICODE (if that existed, but there is also UNICODE).
    > 
    > But that name is also dubious since you later write that it's not 
    > actually fast.
    
    My reasoning for the naming was that "PG" means it's our locale,
    "UNICODE" describes the ctype behavior and "FAST" describes the
    collation behavior (that is, fast but not human-friendly).
    
    For this new case-insensitive collation, "UNICODE" still describes the
    ctype behavior, but "CI" is a better description of the collation
    behavior -- the main purpose is to be case-insensitive, not to be fast.
    
    Other naming suggestions are welcome.
    
    > 
    > This reasoning is a bit narrow.  SIMILAR TO is kind of deprecated,
    
    I didn't know that. Deprecated in the standard, or in Postgres? Should
    we document that?
    
    > and 
    > ILIKE is kind of stupid, 
    
    Should we be discouraging its use in the docs?
    
    > Nevertheless, I think there would be some value to provide CI (and
    > maybe 
    > accent-insensitive?) collations that operate separately from the 
    > "nondeterministic" mechanism.  But then I would like to see a 
    > comprehensive approach that covers a variety of providers and
    > locales. 
    > For example, I would expect there to be something like a "sv_SE_CI" 
    > locale, either available by default or easily created.
    
    I don't think that it's possible in another provider to get CI pattern
    matching semantics that are consistent with collation semantics. libc
    doesn't offer case-insensitive collations at all, and ICU doesn't give
    us enough information about the nature of a collation to use it for
    pattern matching.
    
    For instance:
    
       'e' SIMILAR TO 'e_' -- locale=en-u-ka-shifted
    
    should be true, because the right side could expand to 'e ', which
    matches the left side (because the locale ignores the space). But as
    mentioned in the other thread, it's not practical to guess at all the
    possible expansions that might lead to a match in that locale.
    
    Even with a basic CI locale:
    
       'é' SIMILAR TO 'e_' -- locale=en-u-ks-level2
    
    should also return true, because the right hand side can expand to
    U&'e\0301', and the en-u-ks-level2 locale does basic normalization and
    will consider that a match to U&'\00E9'.
    
    Given that we don't have a lot of visibility into what the ICU locale
    is doing, I don't see a safe way to decide whether an ICU locale will
    match our expectations about pattern matching. In fact, I don't think
    any ICU locales work because of the normalization issue in the second
    pattern, unless we redefine SIMILAR TO to be normalization-aware (which
    I'm not suggesting).
    
    The way I defined PG_UNICODE_CI (or whatever we want to name it), it
    avoids these problems: it does codepoint-at-a-time folding with no
    attempt to normalize, and that's all. Both SIMILAR TO expressions above
    will return false, because the right hand side always expands to a
    longer string than the left, and can never match.
    
    That being said, PG_UNICODE_CI is a collation, not a complete solution
    for SIMILAR TO or regexes.
    
    Is there interest in the collation independently as just a simple case-
    insensitive collation? It would also be nice for testing/documentation,
    and it's the only other collation that would be in-scope for the
    builtin provider (because it doesn't require human-friendly ordering,
    which is better handled by ICU).
    
    Regards,
    	Jeff Davis