Re: Add CASEFOLD() function.

Jeff Davis <pgsql@j-davis.com>

From: Jeff Davis <pgsql@j-davis.com>

To: Peter Eisentraut <peter@eisentraut.org>, Joe Conway <mail@joeconway.com>, Ian Lawrence Barwick <barwick@gmail.com>

Cc: pgsql-hackers@postgresql.org

Date: 2024-12-19T17:51:32Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Fix PDF doc build.
- d2ca16bb509c 18.0 landed
Add SQL function CASEFOLD().
- bfc5992069cf 18.0 landed
Add support for Unicode case folding.
- 4e7f62bc386a 18.0 landed

On Thu, 2024-12-19 at 17:18 +0100, Peter Eisentraut wrote:
> Can you explain this in further detail?  I don't quite follow why
> this 
> would be required.

I am unsure now.

My initial reasoning was based on the idea that users would want to use
CASEFOLD(t) in a unique expression index as an improvement over
LOWER(t). And if you do that, you'd be surprised if some equivalent
strings ended up in the index. I don't think that's a huge problem,
because in other contexts we leave it up to the user to keep things
normalized consistently, and a CHECK(t IS NFC NORMALIZED) is a good way
to do that.

But there's a problem: full case folding doesn't preserve the normal
form, so even if the input is NFC normalized, the output might not be.
If we solve this problem, then we can just say that CASEFOLD()
preserves the normal form, consistently with how the spec defines
LOWER()/UPPER(), and I think that would be the best outcome.

I'm not sure if that problem is solvable, though, because what if the
input string is in both NFC and NFD, how do we know which normal form
to preserve?

We could tell users to use an expression index on
NORMALIZE(CASEFOLD(t)) instead, but that feels like inefficient
boilerplate.

> 
> Another might be that's not entirely clear how this should work in 
> encodings other than UTF-8.  For example, the normalized string might
> not be representable in the encoding.

That's a good point.

Regards,
	Jeff Davis