Thread

Re: Add CASEFOLD() function.

Thom Brown <thom@linux.com> — 2025-06-19T04:03:35Z
On Thu, 19 Jun 2025, 03:53 Jeff Davis, <pgsql@j-davis.com> wrote:

> On Wed, 2025-06-18 at 19:09 +0200, Vik Fearing wrote:
> > I don't know.  I am just pointing out what the Standard says.  I
> > think
> > we should either comply, or say that we don't do it for LOWER and
> > UPPER
> > so let's keep things implementation-consistent.
>
> For the standard, I see two potential philosophies:
>
> I. CASEFOLD() is another variant of LOWER()/UPPER(), and it should
> preserve NFC in the same way.
>
> II. CASEFOLD() is not like LOWER()/UPPER(); it returns a semi-opaque
> text value that is useful for caseless matching, but should not
> ordinarily be used for display or sent to the application (those things
> would be allowed, just not encouraged). For normalization, either:
>   (A) Follow Unicode Default Caseless Matching (16.0 3.13.5 D144), and
> don't require any kind of normalization; or
>   (B) Follow Unicode Canonical Caseless Matching (D145), and require
> that the input and output are normalized appropriately, but leave the
> precise normal form as implementation-defined.
>
>
> The current implementation could either be seen as philosophy (I) where
> we've chosen to ignore the normalization part for the sake of
> consistency with LOWER()/UPPER(); or it could be seen as philosophy
> (II)(A).
>
> > How much does it cost to check for NFC?  I honestly don't know the
> > answer to that question, but that is the only case where we need to
> > maintain normalization.
>
> I attached a very rough patch and ran a very simple test on strings
> averaging 36 bytes in length, all already in NFC and the result is also
> NFC. Before the patch, doing a CASEFOLD() on 10M tuples took about 3
> seconds, afterward about 8.
>
> There's a patch to optimize some of the normalization paths, which I
> haven't had a chance to review yet. So those numbers might come down.
>
> >
> > It's not unconditionally, it's only if the input was NFC.
>
> Optimizing the case where the input is _not_ NFC seems strange to me.
> If we are normalizing the output, I'd say we should just make the
> output always NFC. Being more strict, this seems likely to comply with
> the eventual standard.
>
> Additionally, if we are normalizing the output, then we should also do
> the input fixup for U+0345, which would make the result usable for
> Canonical Caseless Matching. Again, this seems likely to comply with
> the eventual standard.
>
> >
>
> So I only see two reasonable implementations:
>
> 1. The current CASEFOLD() implementation.
>
> 2. Do the input fixup for U+0345 and unconditionally normalize the
> output in NFC.
>
> If there's a case to be made for both implementations, we could also
> consider having two functions, say, CASEFOLD() for #1 and NCASEFOLD()
> for #2. I'm not sure whether we'd want to standardize one or both of
> those functions.
>
> And if you think there's likely to be a collision with the standard
> that's hard to anticipate and fix now, then we should consider
> reverting CASEFOLD() for 18 and wait for more progress on the
> standardization. What's the likelihood that the name changes or
> something like that?
>

Late to the party, but is there an argument for porting this to the citext
type? Or supplementing the extension with an additional type ("cftext"?
*shrug*). It currently uses lower(), so our current recommendation for
dealing with all unicode characters is to use nondeterministic collations.

Thom

>