Thread

  1. Re: Add CASEFOLD() function.

    Thom Brown <thom@linux.com> — 2025-06-19T04:03:35Z

    On Thu, 19 Jun 2025, 03:53 Jeff Davis, <pgsql@j-davis.com> wrote:
    
    > On Wed, 2025-06-18 at 19:09 +0200, Vik Fearing wrote:
    > > I don't know.  I am just pointing out what the Standard says.  I
    > > think
    > > we should either comply, or say that we don't do it for LOWER and
    > > UPPER
    > > so let's keep things implementation-consistent.
    >
    > For the standard, I see two potential philosophies:
    >
    > I. CASEFOLD() is another variant of LOWER()/UPPER(), and it should
    > preserve NFC in the same way.
    >
    > II. CASEFOLD() is not like LOWER()/UPPER(); it returns a semi-opaque
    > text value that is useful for caseless matching, but should not
    > ordinarily be used for display or sent to the application (those things
    > would be allowed, just not encouraged). For normalization, either:
    >   (A) Follow Unicode Default Caseless Matching (16.0 3.13.5 D144), and
    > don't require any kind of normalization; or
    >   (B) Follow Unicode Canonical Caseless Matching (D145), and require
    > that the input and output are normalized appropriately, but leave the
    > precise normal form as implementation-defined.
    >
    >
    > The current implementation could either be seen as philosophy (I) where
    > we've chosen to ignore the normalization part for the sake of
    > consistency with LOWER()/UPPER(); or it could be seen as philosophy
    > (II)(A).
    >
    > > How much does it cost to check for NFC?  I honestly don't know the
    > > answer to that question, but that is the only case where we need to
    > > maintain normalization.
    >
    > I attached a very rough patch and ran a very simple test on strings
    > averaging 36 bytes in length, all already in NFC and the result is also
    > NFC. Before the patch, doing a CASEFOLD() on 10M tuples took about 3
    > seconds, afterward about 8.
    >
    > There's a patch to optimize some of the normalization paths, which I
    > haven't had a chance to review yet. So those numbers might come down.
    >
    > >
    > > It's not unconditionally, it's only if the input was NFC.
    >
    > Optimizing the case where the input is _not_ NFC seems strange to me.
    > If we are normalizing the output, I'd say we should just make the
    > output always NFC. Being more strict, this seems likely to comply with
    > the eventual standard.
    >
    > Additionally, if we are normalizing the output, then we should also do
    > the input fixup for U+0345, which would make the result usable for
    > Canonical Caseless Matching. Again, this seems likely to comply with
    > the eventual standard.
    >
    > >
    >
    > So I only see two reasonable implementations:
    >
    > 1. The current CASEFOLD() implementation.
    >
    > 2. Do the input fixup for U+0345 and unconditionally normalize the
    > output in NFC.
    >
    > If there's a case to be made for both implementations, we could also
    > consider having two functions, say, CASEFOLD() for #1 and NCASEFOLD()
    > for #2. I'm not sure whether we'd want to standardize one or both of
    > those functions.
    >
    > And if you think there's likely to be a collision with the standard
    > that's hard to anticipate and fix now, then we should consider
    > reverting CASEFOLD() for 18 and wait for more progress on the
    > standardization. What's the likelihood that the name changes or
    > something like that?
    >
    
    Late to the party, but is there an argument for porting this to the citext
    type? Or supplementing the extension with an additional type ("cftext"?
    *shrug*). It currently uses lower(), so our current recommendation for
    dealing with all unicode characters is to use nondeterministic collations.
    
    Thom
    
    >