Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Fix PDF doc build.

  2. Add SQL function CASEFOLD().

  3. Add support for Unicode case folding.

  1. Add CASEFOLD() function.

    Jeff Davis <pgsql@j-davis.com> — 2024-12-12T09:00:15Z

    Unicode case folding is a way to convert a string to a canonical case
    for the purpose of case-insensitive matching.
    
    Users have long used LOWER() for that purpose, but there are a few edge
    case problems:
    
    * Some characters have more than two cased forms, such as "Σ" (U+03A3),
    which can be lowercased as "σ" (U+03C3) or "ς" (U+03C2). The CASEFOLD()
    function converts all cased forms of the character to "σ".
    
    * The character "İ" (U+0130, capital I with dot) is lowercased to "i",
    which can be a problem in locales that don't expect that.
    
    * If new lower case characters are added to Unicode, the results of
    LOWER() may change.
    
    The CASEFOLD() function solves these problems.
    
    Patch attached.
    
    
    -- 
    Jeff Davis
    PostgreSQL Contributor Team - AWS
    
    
    
  2. Re: Add CASEFOLD() function.

    Ian Lawrence Barwick <barwick@gmail.com> — 2024-12-12T12:52:31Z

    Hi
    
    2024年12月12日(木) 18:00 Jeff Davis <pgsql@j-davis.com>:
    >
    > Unicode case folding is a way to convert a string to a canonical case
    > for the purpose of case-insensitive matching.
    >
    > Users have long used LOWER() for that purpose, but there are a few edge
    > case problems:
    >
    > * Some characters have more than two cased forms, such as "Σ" (U+03A3),
    > which can be lowercased as "σ" (U+03C3) or "ς" (U+03C2). The CASEFOLD()
    > function converts all cased forms of the character to "σ".
    >
    > * The character "İ" (U+0130, capital I with dot) is lowercased to "i",
    > which can be a problem in locales that don't expect that.
    >
    > * If new lower case characters are added to Unicode, the results of
    > LOWER() may change.
    >
    > The CASEFOLD() function solves these problems.
    >
    > Patch attached.
    
    I took a quick look at this as it sounds useful for the described issue,
    and it seems to work as advertised, except the function is named "FOLDCASE()"
    in the patch, so I'm wondering which is intended? A quick search indicates
    there are no functions of either name in other databases; Python has a
    "casefold()"
    function [1] and PHP a "foldCase()" function [2], so it doesn't seem there's a
    de-facto standard for this.
    
    [1] https://docs.python.org/3/library/stdtypes.html#str.casefold
    [2] https://www.php.net/manual/en/intlchar.foldcase.php
    
    Regards
    
    
    Ian Barwick
    
    
    
    
  3. Re: Add CASEFOLD() function.

    Jeff Davis <pgsql@j-davis.com> — 2024-12-12T18:30:04Z

    On Thu, 2024-12-12 at 21:52 +0900, Ian Lawrence Barwick wrote:
    > and it seems to work as advertised, except the function is named
    > "FOLDCASE()"
    > in the patch, so I'm wondering which is intended?
    
    Thank you for looking into this, I went back and forth on the name, and
    mistyped it a few times.
    
    ICU seems to use "foldcase":
    
    https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucasemap_8h.html
    
    and it seems to be slightly more proper grammatically (with "fold"
    being the verb). However, "case folding" is common terminology in
    Postgres and Unicode, so "casefold" can be seen as the verb instead.
    
    I don't have a strong opinion here so I will just go with whatever
    seems like the popular choice.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  4. Re: Add CASEFOLD() function.

    Joe Conway <mail@joeconway.com> — 2024-12-12T18:55:00Z

    On 12/12/24 13:30, Jeff Davis wrote:
    > On Thu, 2024-12-12 at 21:52 +0900, Ian Lawrence Barwick wrote:
    >> and it seems to work as advertised, except the function is named
    >> "FOLDCASE()"
    >> in the patch, so I'm wondering which is intended?
    > 
    > Thank you for looking into this, I went back and forth on the name, and
    > mistyped it a few times.
    > 
    > ICU seems to use "foldcase":
    > 
    > https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucasemap_8h.html
    > 
    > and it seems to be slightly more proper grammatically (with "fold"
    > being the verb). However, "case folding" is common terminology in
    > Postgres and Unicode, so "casefold" can be seen as the verb instead.
    > 
    > I don't have a strong opinion here so I will just go with whatever
    > seems like the popular choice.
    
    
    FWIW I prefer casefold()
    
    -- 
    Joe Conway
    PostgreSQL Contributors Team
    RDS Open Source Databases
    Amazon Web Services: https://aws.amazon.com
    
    
    
    
  5. Re: Add CASEFOLD() function.

    Jeff Davis <pgsql@j-davis.com> — 2024-12-16T17:49:22Z

    On Thu, 2024-12-12 at 13:55 -0500, Joe Conway wrote:
    > > I don't have a strong opinion here so I will just go with whatever
    > > seems like the popular choice.
    > 
    > 
    > FWIW I prefer casefold()
    
    Done. I just noticed that it now matches $SUBJECT. The fact that my
    code didn't match the email subject before further supports the idea
    that "foldcase" was never quite as natural -- so I agree that
    "casefold" is the way to go.
    
    
    One question I have is whether we want this function to normalize the
    output.
    
    I believe most usecases would want the output normalized, because
    normalization differences (e.g. "a" U+0061 followed by "combining
    acute" U+0301 vs "a with acute" U+00E1) are more minor than differences
    in case.
    
    Of course, a user could wrap it with the normalize() function, but
    that's verbose and easy to forget. I'm also not sure that it can be
    made as fast as a combined function that does both.
    
    And a follow-up question: if it does normalize, the second parameter
    would be the requested normal form. But to accept the keyword forms
    (NFC, NFD in gram.y) rather than the string forms ('NFC', 'NFD') then
    we'd need to also need to add CASEFOLD to gram.y (like NORMALIZE). Is
    that a reasonable thing to do?
    
    Regards,
    	Jeff Davis
    
    
    
  6. Re: Add CASEFOLD() function.

    Joe Conway <mail@joeconway.com> — 2024-12-16T21:27:11Z

    On 12/16/24 12:49, Jeff Davis wrote:
    > One question I have is whether we want this function to normalize the
    > output.
    > 
    > I believe most usecases would want the output normalized, because
    > normalization differences (e.g. "a" U+0061 followed by "combining
    > acute" U+0301 vs "a with acute" U+00E1) are more minor than differences
    > in case.
    > 
    > Of course, a user could wrap it with the normalize() function, but
    > that's verbose and easy to forget. I'm also not sure that it can be
    > made as fast as a combined function that does both.
    
    Perhaps a one arg version that always casefolds and a two arg version 
    that accepts nfc, nfd, none (or something similar)?
    
    > And a follow-up question: if it does normalize, the second parameter
    > would be the requested normal form. But to accept the keyword forms
    > (NFC, NFD in gram.y) rather than the string forms ('NFC', 'NFD') then
    > we'd need to also need to add CASEFOLD to gram.y (like NORMALIZE). Is
    > that a reasonable thing to do?
    
    SQL 2023 seems to include the NORMALIZE syntax, but the only case 
    folding considered is UPPER and LOWER. As such, I think it ought to be a 
    function but not part of the grammar.
    
    -- 
    Joe Conway
    PostgreSQL Contributors Team
    RDS Open Source Databases
    Amazon Web Services: https://aws.amazon.com
    
    
    
    
  7. Re: Add CASEFOLD() function.

    Andreas Karlsson <andreas@proxel.se> — 2024-12-18T01:31:14Z

    On 12/12/24 10:00 AM, Jeff Davis wrote:
    > Patch attached.
    
    I have not looked at the patch yet but +1 to the idea. I am leaning 
    towards that the function also optionally normalizing the codepoints 
    would be handy too since I think that is what most usecases want. 
    Otherwise people would have to always type normalize(casefold(str)).
    
    Andreas
    
    
    
    
    
  8. Re: Add CASEFOLD() function.

    Jeff Davis <pgsql@j-davis.com> — 2024-12-18T18:40:06Z

    On Mon, 2024-12-16 at 16:27 -0500, Joe Conway wrote:
    > 
    > SQL 2023 seems to include the NORMALIZE syntax, but the only case 
    > folding considered is UPPER and LOWER. As such, I think it ought to
    > be a 
    > function but not part of the grammar.
    
    Should the standard support something like the Unicode idea of case
    folding? If so, do we need to be careful of conflicts?
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  9. Re: Add CASEFOLD() function.

    Peter Eisentraut <peter@eisentraut.org> — 2024-12-19T16:18:31Z

    On 16.12.24 18:49, Jeff Davis wrote:
    > One question I have is whether we want this function to normalize the
    > output.
    > 
    > I believe most usecases would want the output normalized, because
    > normalization differences (e.g. "a" U+0061 followed by "combining
    > acute" U+0301 vs "a with acute" U+00E1) are more minor than differences
    > in case.
    
    Can you explain this in further detail?  I don't quite follow why this 
    would be required.
    
    > Of course, a user could wrap it with the normalize() function, but
    > that's verbose and easy to forget. I'm also not sure that it can be
    > made as fast as a combined function that does both.
    > 
    > And a follow-up question: if it does normalize, the second parameter
    > would be the requested normal form. But to accept the keyword forms
    > (NFC, NFD in gram.y) rather than the string forms ('NFC', 'NFD') then
    > we'd need to also need to add CASEFOLD to gram.y (like NORMALIZE). Is
    > that a reasonable thing to do?
    
    That's maybe one reason to keep it separate.
    
    Another might be that's not entirely clear how this should work in 
    encodings other than UTF-8.  For example, the normalized string might 
    not be representable in the encoding.
    
    
    
    
    
  10. Re: Add CASEFOLD() function.

    Jeff Davis <pgsql@j-davis.com> — 2024-12-19T17:51:32Z

    On Thu, 2024-12-19 at 17:18 +0100, Peter Eisentraut wrote:
    > Can you explain this in further detail?  I don't quite follow why
    > this 
    > would be required.
    
    I am unsure now.
    
    My initial reasoning was based on the idea that users would want to use
    CASEFOLD(t) in a unique expression index as an improvement over
    LOWER(t). And if you do that, you'd be surprised if some equivalent
    strings ended up in the index. I don't think that's a huge problem,
    because in other contexts we leave it up to the user to keep things
    normalized consistently, and a CHECK(t IS NFC NORMALIZED) is a good way
    to do that.
    
    But there's a problem: full case folding doesn't preserve the normal
    form, so even if the input is NFC normalized, the output might not be.
    If we solve this problem, then we can just say that CASEFOLD()
    preserves the normal form, consistently with how the spec defines
    LOWER()/UPPER(), and I think that would be the best outcome.
    
    I'm not sure if that problem is solvable, though, because what if the
    input string is in both NFC and NFD, how do we know which normal form
    to preserve?
    
    We could tell users to use an expression index on
    NORMALIZE(CASEFOLD(t)) instead, but that feels like inefficient
    boilerplate.
    
    > 
    > Another might be that's not entirely clear how this should work in 
    > encodings other than UTF-8.  For example, the normalized string might
    > not be representable in the encoding.
    
    That's a good point.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  11. Re: Add CASEFOLD() function.

    Jeff Davis <pgsql@j-davis.com> — 2025-01-08T23:19:58Z

    On Thu, 2024-12-19 at 09:51 -0800, Jeff Davis wrote:
    > But there's a problem: full case folding doesn't preserve the normal
    > form, so even if the input is NFC normalized, the output might not
    > be.
    > If we solve this problem, then we can just say that CASEFOLD()
    > preserves the normal form, consistently with how the spec defines
    > LOWER()/UPPER(), and I think that would be the best outcome.
    > 
    > I'm not sure if that problem is solvable, though, because what if the
    > input string is in both NFC and NFD, how do we know which normal form
    > to preserve?
    
    The options as I see it are:
    
    1. Normalize the output (either by using an extra parameter or just
    always normalizing to NFC). As you said, the problem here is that the
    encoding might not work with normalization. One solution might be that
    CASEFOLD() only works in UTF8, like NORMALIZE().
    
    2. Try to preserve normalization as long as the encoding supports it.
    The problem here is that we don't know what normal form to preserve,
    because the input string might be in both NFC and NFD. We could
    document that it preserves NFC form iff the input is NFC.
    
    3. Allow CASEFOLD() to break the normal form of the input string. The
    problem here is that the user may be surprised that the output is not
    normalized even when all of their data is normalized. It's not clear to
    me whether it still works for caseless matching -- it might if the
    string is in a consistent form, even if not normalized.
    
    Out of those I think #1 is the most appealing. Most users, and
    especially users that care about these edge cases enough to use Full
    Case Folding, are almost certainly going to be on UTF8 anyway. It's
    also the most user-friendly.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  12. Re: Add CASEFOLD() function.

    Jeff Davis <pgsql@j-davis.com> — 2025-01-11T00:27:12Z

    On Wed, 2025-01-08 at 15:19 -0800, Jeff Davis wrote:
    > 3. Allow CASEFOLD() to break the normal form of the input string. The
    > problem here is that the user may be surprised that the output is not
    > normalized even when all of their data is normalized. It's not clear
    > to
    > me whether it still works for caseless matching -- it might if the
    > string is in a consistent form, even if not normalized.
    
    Looking at the Unicode standard again, it distinguishes between
    "default caseless matching" and "canonical caseless matching". The
    latter accounts for a few nuances that the former does not. See Unicode
    16.0 section 3.13.4 D144 & D145. Using Default Caseless Matching
    simplifies things quite a bit.
    
    We could argue that it would be nice to have canonical caseless
    matching, but that seems to be going above and beyond what Unicode
    suggests. And normalization is expensive -- if we combine case folding
    and normalization, there's no way for the user to avoid the cost. So
    I'm changing my answer to #3, and we just document that it does not
    preserve normalization. I believe this means that Peter and I are now
    in agreement[1], though I'm not sure if his reasoning is the same.
    
    New patch series attached.
    
    Regards,
    	Jeff Davis
    
    [1] 
    https://www.postgresql.org/message-id/8c384b0d-00f2-4515-8e60-ff7d0d4c093a%40eisentraut.org
    
    
  13. Re: Add CASEFOLD() function.

    Jeff Davis <pgsql@j-davis.com> — 2025-01-18T00:34:43Z

    On Fri, 2025-01-10 at 16:27 -0800, Jeff Davis wrote:
    > New patch series attached.
    
    v5 attached.
    
    This version is rebased over the Full Case Mapping support, and
    supports Default Case Folding when using the PG_UNICODE_FAST collation.
    
    That means that "ẞ", "ß", "SS", "Ss", and "ss" all fold to "ss"; and
    "Σ", "σ", and "ς" all fold to "σ".
    
    CASEFOLD() is better (according to Unicode, anyway) than LOWER() for
    caseless matching, or in an expression index to enforce case-
    insensitive uniqueness without relying on ICU.
    
    Additionally, the infrastructure in this patch (as well as 286a365b9c)
    can be used in the future for better case-insensitive pattern matching,
    or casefolding identifiers in the parser without relying on libc.
    
    I feel this is about ready for commit. The main point of discussion was
    whether CASEFOLD() would do normalization, and if so, what the SQL API
    would look like. I concluded upthread that it was unnecessary to meet
    the Unicode Default Case Folding behavior, and we should just leave
    normalization as a separate process. If someone disagrees with
    reasoning, please let me know.
    
    Regards,
    	Jeff Davis
    
    [1]
    https://www.postgresql.org/message-id/610a56de2bd958e96c149ca60420db30e7d51588.camel%40j-davis.com
    
  14. Re: Add CASEFOLD() function.

    Jeff Davis <pgsql@j-davis.com> — 2025-01-23T19:28:58Z

    On Fri, 2025-01-17 at 16:34 -0800, Jeff Davis wrote:
    > v5 attached.
    
    v6 attached. I plan to commit this soon.
    
    A couple things to note:
    
    * The ICU API for lower/title/uppercasing is slightly different from
    folding. The former accept a locale, while the latter just has an
    option which is relevant only to languages 'az' and 'tr'. So the patch
    checks for those two languages to enable the option, so that ICU is
    consistently locale-aware for all the functions. I also added ICU
    tests.
    
    * I'm leaving out the normalization, which is not required for Unicode
    Default Caseless Matchinng, as mentioned in the last email. That
    simplifies the SQL function as well as the implementation. There was
    some discussion on normalization upthread.
    
    Regards,
    	Jeff Davis
    
    
  15. Re: Add CASEFOLD() function.

    Tom Lane <tgl@sss.pgh.pa.us> — 2025-01-25T05:00:22Z

    Jeff Davis <pgsql@j-davis.com> writes:
    > v6 attached. I plan to commit this soon.
    
    The documentation for this function is giving the PDF docs build
    indigestion:
    
    [WARN] FOUserAgent - Glyph "?" (0x3a3, Sigma) not available in font "Courier".
    [WARN] FOUserAgent - Glyph "?" (0x3c3, sigma) not available in font "Courier".
    [WARN] FOUserAgent - Glyph "?" (0x3c2, sigma1) not available in font "Courier".
    
    Found characters that cannot be output in the PDF document;  see README.non-ASCII
    
    Not sure about a good workaround for this.  Are there any characters
    within LATIN-1 that have interesting case-folding behavior?
    
    			regards, tom lane
    
    
    
    
  16. Re: Add CASEFOLD() function.

    Jeff Davis <pgsql@j-davis.com> — 2025-01-25T08:20:06Z

    On Sat, 2025-01-25 at 00:00 -0500, Tom Lane wrote:
    > Found characters that cannot be output in the PDF document;  see
    > README.non-ASCII
    
    Thank you, fixed.
    
    > Not sure about a good workaround for this.  Are there any characters
    > within LATIN-1 that have interesting case-folding behavior?
    
    I just removed that example. There's already another example using ß
    (U+00DF), though that only applies to PG_UNICODE_FAST (the new
    collation that performs full case mapping and now full case folding).
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  17. Re: Add CASEFOLD() function.

    Vik Fearing <vik@postgresfriends.org> — 2025-06-17T15:37:37Z

    On 16/12/2024 18:49, Jeff Davis wrote:
    > One question I have is whether we want this function to normalize the
    > output.
    
    
    Yes, we do.
    
    
    I am sorry that I am so late to the party, but I am currently writing 
    the Change Proposal for the SQL Standard for this function.
    
    
    For <fold> (which includes LOWER() and UPPER()), the text says in 
    Section 6.35 GR 7.e:
    
    
    If the character set of <character factor> is UTF8, UTF16, or UTF32, 
    then FR is replaced by
         Case:
             i) If the <search condition> S IS NORMALIZED evaluates to True, 
    then NORMALIZE (FR)
             ii) Otherwise, FR.
    
    
    Here, FR is the result of the function and S is its argument.
    
    
    It does not appear to me that our LOWER and UPPER functions obey this 
    rule, so there is a valid argument that we should continue to ignore it. 
    Or, we can say that we have at least one of three compliant.
    
    -- 
    
    Vik Fearing
    
  18. Re: Add CASEFOLD() function.

    Jeff Davis <pgsql@j-davis.com> — 2025-06-17T18:14:58Z

    On Tue, 2025-06-17 at 17:37 +0200, Vik Fearing wrote:
    > If the character set of <character factor> is UTF8, UTF16, or UTF32,
    > then FR is replaced by
    >      Case:
    >          i) If the <search condition> S IS NORMALIZED evaluates to
    > True, then NORMALIZE (FR)
    >          ii) Otherwise, FR.
    
    I read that as "if the input is normalized, then the output should be
    normalized", IOW preserve the normalization. But does it mean "preserve
    whatever the input normal form is" or "preserve NFC if the input is
    NFC, otherwise the normalization is undefined"?
    
    The above wording seems to mean "preserve NFC if the input is NFC",
    because that's what NORMALIZE(FR) does when the normal form is
    unspecified.
    
    > It does not appear to me that our LOWER and UPPER functions obey this
    > rule,
    
    You are correct:
    
       WITH s(t) AS
       (SELECT NORMALIZE(U&'\00C1\00DF\0301' COLLATE "en-US-x-icu"))
       SELECT UPPER(t) = NORMALIZE(UPPER(t)) FROM s;
        ?column? 
       ----------
        f
    
    >  so there is a valid argument that we should continue to ignore it.
    > Or, we can say that we have at least one of three compliant.
    
    What do other databases do?
    
    Given how costly normalization can be, imposing that on every caller
    seems like a bit much. And favoring NFC for the user unconditionally
    might not be the best thing. Then again, NFC is good most of the time,
    and there are patches to speed up normalization.
    
    I tend to think that a lot of users who want casefolding would also
    want normalization, but it's hard to weigh that against the performance
    cost. It might not matter outside of a few edge cases, though I'm not
    sure exactly how many.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  19. Re: Add CASEFOLD() function.

    Vik Fearing <vik@postgresfriends.org> — 2025-06-18T17:09:04Z

    On 17/06/2025 20:14, Jeff Davis wrote:
    > On Tue, 2025-06-17 at 17:37 +0200, Vik Fearing wrote:
    >> If the character set of <character factor> is UTF8, UTF16, or UTF32,
    >> then FR is replaced by
    >>       Case:
    >>           i) If the <search condition> S IS NORMALIZED evaluates to
    >> True, then NORMALIZE (FR)
    >>           ii) Otherwise, FR.
    > I read that as "if the input is normalized, then the output should be
    > normalized", IOW preserve the normalization. But does it mean "preserve
    > whatever the input normal form is" or "preserve NFC if the input is
    > NFC, otherwise the normalization is undefined"?
    >
    > The above wording seems to mean "preserve NFC if the input is NFC",
    > because that's what NORMALIZE(FR) does when the normal form is
    > unspecified.
    
    
    Yes, and that is also the default for <normalized predicate>.
    
    
    >> It does not appear to me that our LOWER and UPPER functions obey this
    >> rule,
    > You are correct:
    >
    >     WITH s(t) AS
    >     (SELECT NORMALIZE(U&'\00C1\00DF\0301' COLLATE "en-US-x-icu"))
    >     SELECT UPPER(t) = NORMALIZE(UPPER(t)) FROM s;
    >      ?column?
    >     ----------
    >      f
    >
    >>   so there is a valid argument that we should continue to ignore it.
    >> Or, we can say that we have at least one of three compliant.
    > What do other databases do?
    
    
    I don't know.  I am just pointing out what the Standard says.  I think 
    we should either comply, or say that we don't do it for LOWER and UPPER 
    so let's keep things implementation-consistent.
    
    
    > Given how costly normalization can be, imposing that on every caller
    > seems like a bit much.
    
    
    How much does it cost to check for NFC?  I honestly don't know the 
    answer to that question, but that is the only case where we need to 
    maintain normalization.
    
    
    > And favoring NFC for the user unconditionally
    > might not be the best thing. Then again, NFC is good most of the time,
    > and there are patches to speed up normalization.
    
    
    It's not unconditionally, it's only if the input was NFC.
    
    
    > I tend to think that a lot of users who want casefolding would also
    > want normalization, but it's hard to weigh that against the performance
    > cost. It might not matter outside of a few edge cases, though I'm not
    > sure exactly how many.
    
    
    I defer to you and others in the thread to make this decision.
    
    -- 
    
    Vik Fearing
    
    
    
    
    
    
    
  20. Re: Add CASEFOLD() function.

    Jeff Davis <pgsql@j-davis.com> — 2025-06-19T02:53:01Z

    On Wed, 2025-06-18 at 19:09 +0200, Vik Fearing wrote:
    > I don't know.  I am just pointing out what the Standard says.  I
    > think 
    > we should either comply, or say that we don't do it for LOWER and
    > UPPER 
    > so let's keep things implementation-consistent.
    
    For the standard, I see two potential philosophies:
    
    I. CASEFOLD() is another variant of LOWER()/UPPER(), and it should
    preserve NFC in the same way.
    
    II. CASEFOLD() is not like LOWER()/UPPER(); it returns a semi-opaque
    text value that is useful for caseless matching, but should not
    ordinarily be used for display or sent to the application (those things
    would be allowed, just not encouraged). For normalization, either:
      (A) Follow Unicode Default Caseless Matching (16.0 3.13.5 D144), and
    don't require any kind of normalization; or
      (B) Follow Unicode Canonical Caseless Matching (D145), and require
    that the input and output are normalized appropriately, but leave the
    precise normal form as implementation-defined.
    
    
    The current implementation could either be seen as philosophy (I) where
    we've chosen to ignore the normalization part for the sake of
    consistency with LOWER()/UPPER(); or it could be seen as philosophy
    (II)(A).
    
    > How much does it cost to check for NFC?  I honestly don't know the 
    > answer to that question, but that is the only case where we need to 
    > maintain normalization.
    
    I attached a very rough patch and ran a very simple test on strings
    averaging 36 bytes in length, all already in NFC and the result is also
    NFC. Before the patch, doing a CASEFOLD() on 10M tuples took about 3
    seconds, afterward about 8.
    
    There's a patch to optimize some of the normalization paths, which I
    haven't had a chance to review yet. So those numbers might come down. 
    
    > 
    > It's not unconditionally, it's only if the input was NFC.
    
    Optimizing the case where the input is _not_ NFC seems strange to me.
    If we are normalizing the output, I'd say we should just make the
    output always NFC. Being more strict, this seems likely to comply with
    the eventual standard.
    
    Additionally, if we are normalizing the output, then we should also do
    the input fixup for U+0345, which would make the result usable for
    Canonical Caseless Matching. Again, this seems likely to comply with
    the eventual standard.
    
    > 
    
    So I only see two reasonable implementations:
    
    1. The current CASEFOLD() implementation.
    
    2. Do the input fixup for U+0345 and unconditionally normalize the
    output in NFC.
    
    If there's a case to be made for both implementations, we could also
    consider having two functions, say, CASEFOLD() for #1 and NCASEFOLD()
    for #2. I'm not sure whether we'd want to standardize one or both of
    those functions.
    
    And if you think there's likely to be a collision with the standard
    that's hard to anticipate and fix now, then we should consider
    reverting CASEFOLD() for 18 and wait for more progress on the
    standardization. What's the likelihood that the name changes or
    something like that?
    
    Regards,
    	Jeff Davis
    
    
  21. Re: Add CASEFOLD() function.

    Thom Brown <thom@linux.com> — 2025-06-19T04:03:35Z

    On Thu, 19 Jun 2025, 03:53 Jeff Davis, <pgsql@j-davis.com> wrote:
    
    > On Wed, 2025-06-18 at 19:09 +0200, Vik Fearing wrote:
    > > I don't know.  I am just pointing out what the Standard says.  I
    > > think
    > > we should either comply, or say that we don't do it for LOWER and
    > > UPPER
    > > so let's keep things implementation-consistent.
    >
    > For the standard, I see two potential philosophies:
    >
    > I. CASEFOLD() is another variant of LOWER()/UPPER(), and it should
    > preserve NFC in the same way.
    >
    > II. CASEFOLD() is not like LOWER()/UPPER(); it returns a semi-opaque
    > text value that is useful for caseless matching, but should not
    > ordinarily be used for display or sent to the application (those things
    > would be allowed, just not encouraged). For normalization, either:
    >   (A) Follow Unicode Default Caseless Matching (16.0 3.13.5 D144), and
    > don't require any kind of normalization; or
    >   (B) Follow Unicode Canonical Caseless Matching (D145), and require
    > that the input and output are normalized appropriately, but leave the
    > precise normal form as implementation-defined.
    >
    >
    > The current implementation could either be seen as philosophy (I) where
    > we've chosen to ignore the normalization part for the sake of
    > consistency with LOWER()/UPPER(); or it could be seen as philosophy
    > (II)(A).
    >
    > > How much does it cost to check for NFC?  I honestly don't know the
    > > answer to that question, but that is the only case where we need to
    > > maintain normalization.
    >
    > I attached a very rough patch and ran a very simple test on strings
    > averaging 36 bytes in length, all already in NFC and the result is also
    > NFC. Before the patch, doing a CASEFOLD() on 10M tuples took about 3
    > seconds, afterward about 8.
    >
    > There's a patch to optimize some of the normalization paths, which I
    > haven't had a chance to review yet. So those numbers might come down.
    >
    > >
    > > It's not unconditionally, it's only if the input was NFC.
    >
    > Optimizing the case where the input is _not_ NFC seems strange to me.
    > If we are normalizing the output, I'd say we should just make the
    > output always NFC. Being more strict, this seems likely to comply with
    > the eventual standard.
    >
    > Additionally, if we are normalizing the output, then we should also do
    > the input fixup for U+0345, which would make the result usable for
    > Canonical Caseless Matching. Again, this seems likely to comply with
    > the eventual standard.
    >
    > >
    >
    > So I only see two reasonable implementations:
    >
    > 1. The current CASEFOLD() implementation.
    >
    > 2. Do the input fixup for U+0345 and unconditionally normalize the
    > output in NFC.
    >
    > If there's a case to be made for both implementations, we could also
    > consider having two functions, say, CASEFOLD() for #1 and NCASEFOLD()
    > for #2. I'm not sure whether we'd want to standardize one or both of
    > those functions.
    >
    > And if you think there's likely to be a collision with the standard
    > that's hard to anticipate and fix now, then we should consider
    > reverting CASEFOLD() for 18 and wait for more progress on the
    > standardization. What's the likelihood that the name changes or
    > something like that?
    >
    
    Late to the party, but is there an argument for porting this to the citext
    type? Or supplementing the extension with an additional type ("cftext"?
    *shrug*). It currently uses lower(), so our current recommendation for
    dealing with all unicode characters is to use nondeterministic collations.
    
    Thom
    
    >
    
  22. Re: Add CASEFOLD() function.

    Jeff Davis <pgsql@j-davis.com> — 2025-06-19T04:26:23Z

    On Thu, 2025-06-19 at 05:03 +0100, Thom Brown wrote:
    > Late to the party, but is there an argument for porting this to the
    > citext type? Or supplementing the extension with an additional type
    > ("cftext"? *shrug*).
    
    CASEFOLD() addresses a lot of the problems with using LOWER(), so that
    sounds like a good idea. I'd be interested to hear from users of
    citext.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  23. Re: Add CASEFOLD() function.

    Peter Eisentraut <peter@eisentraut.org> — 2025-06-19T14:47:10Z

    On 17.06.25 17:37, Vik Fearing wrote:
    > For <fold> (which includes LOWER() and UPPER()), the text says in 
    > Section 6.35 GR 7.e:
    > 
    > 
    > If the character set of <character factor> is UTF8, UTF16, or UTF32, 
    > then FR is replaced by
    >      Case:
    >          i) If the <search condition> S IS NORMALIZED evaluates to True, 
    > then NORMALIZE (FR)
    >          ii) Otherwise, FR.
    > 
    > 
    > Here, FR is the result of the function and S is its argument.
    > 
    > 
    > It does not appear to me that our LOWER and UPPER functions obey this 
    > rule, so there is a valid argument that we should continue to ignore it. 
    > Or, we can say that we have at least one of three compliant.
    
    The SQL standard also says in a few other places that normalization 
    should be applied, and we do none of those, so this is probably not a 
    reason to change CASEFOLD at this point.
    
    
    
    
    
  24. Re: Add CASEFOLD() function.

    Peter Eisentraut <peter@eisentraut.org> — 2025-06-19T14:51:02Z

    On 19.06.25 06:03, Thom Brown wrote:
    > Late to the party, but is there an argument for porting this to the 
    > citext type? Or supplementing the extension with an additional type 
    > ("cftext"? *shrug*). It currently uses lower(), so our current 
    > recommendation for dealing with all unicode characters is to use 
    > nondeterministic collations.
    
    What is the motivation for wanting a citext variant instead of using 
    nondeterministic collations?
    
    
    
    
    
  25. Re: Add CASEFOLD() function.

    Thom Brown <thom@linux.com> — 2025-06-19T15:36:29Z

    On Thu, 19 Jun 2025 at 15:51, Peter Eisentraut <peter@eisentraut.org> wrote:
    >
    > On 19.06.25 06:03, Thom Brown wrote:
    > > Late to the party, but is there an argument for porting this to the
    > > citext type? Or supplementing the extension with an additional type
    > > ("cftext"? *shrug*). It currently uses lower(), so our current
    > > recommendation for dealing with all unicode characters is to use
    > > nondeterministic collations.
    >
    > What is the motivation for wanting a citext variant instead of using
    > nondeterministic collations?
    
    Ease of use, perhaps. It seems easier to use:
    
    column_name cftext
    
    rather than:
    
    CREATE COLLATION case_insensitive_collation (
        PROVIDER = icu,
        LOCALE = 'und-u-ks-level2',
        DETERMINISTIC = FALSE
    );
    
    column_name text COLLATE case_insensitive_collation
    
    But I see the arguments against it. It creates an unnecessary
    dependency on an extension, and if someone wants to ignore both case
    and accents, they may resort to using 2 extensions (citext + unaccent)
    when none are needed. I guess I don't feel strongly about it either
    way.
    
    Thom
    
    
    
    
  26. Re: Add CASEFOLD() function.

    Robert Treat <rob@xzilla.net> — 2025-06-19T16:15:48Z

    On Thu, Jun 19, 2025 at 11:37 AM Thom Brown <thom@linux.com> wrote:
    > On Thu, 19 Jun 2025 at 15:51, Peter Eisentraut <peter@eisentraut.org> wrote:
    > > On 19.06.25 06:03, Thom Brown wrote:
    > > > Late to the party, but is there an argument for porting this to the
    > > > citext type? Or supplementing the extension with an additional type
    > > > ("cftext"? *shrug*). It currently uses lower(), so our current
    > > > recommendation for dealing with all unicode characters is to use
    > > > nondeterministic collations.
    > >
    > > What is the motivation for wanting a citext variant instead of using
    > > nondeterministic collations?
    >
    > Ease of use, perhaps. It seems easier to use:
    >
    > column_name cftext
    >
    > rather than:
    >
    > CREATE COLLATION case_insensitive_collation (
    >     PROVIDER = icu,
    >     LOCALE = 'und-u-ks-level2',
    >     DETERMINISTIC = FALSE
    > );
    >
    > column_name text COLLATE case_insensitive_collation
    >
    > But I see the arguments against it. It creates an unnecessary
    > dependency on an extension, and if someone wants to ignore both case
    > and accents, they may resort to using 2 extensions (citext + unaccent)
    > when none are needed. I guess I don't feel strongly about it either
    > way.
    
    Don't forget, if you have a defined insensitive / normalized
    collations, you also enable on-the-fly collation based matching, a la
    "SELECT 'Å' = 'A' COLLATE ignore_accent_case;" regardless of the
    provided collations (which I think is much more common certain in
    other databases)
    
    Robert Treat
    https://xzilla.net
    
    
    
    
  27. Re: Add CASEFOLD() function.

    Vik Fearing <vik@postgresfriends.org> — 2025-06-19T16:21:09Z

    On 19/06/2025 16:47, Peter Eisentraut wrote:
    > On 17.06.25 17:37, Vik Fearing wrote:
    >> For <fold> (which includes LOWER() and UPPER()), the text says in 
    >> Section 6.35 GR 7.e:
    >>
    >>
    >> If the character set of <character factor> is UTF8, UTF16, or UTF32, 
    >> then FR is replaced by
    >>      Case:
    >>          i) If the <search condition> S IS NORMALIZED evaluates to 
    >> True, then NORMALIZE (FR)
    >>          ii) Otherwise, FR.
    >>
    >>
    >> Here, FR is the result of the function and S is its argument.
    >>
    >>
    >> It does not appear to me that our LOWER and UPPER functions obey this 
    >> rule, so there is a valid argument that we should continue to ignore 
    >> it. Or, we can say that we have at least one of three compliant.
    >
    > The SQL standard also says in a few other places that normalization 
    > should be applied, and we do none of those, so this is probably not a 
    > reason to change CASEFOLD at this point.
    >
    
    Works for me.
    
    -- 
    
    Vik Fearing.
    
    
    
    
    
  28. Re: Add CASEFOLD() function.

    Jeff Davis <pgsql@j-davis.com> — 2025-06-19T16:33:41Z

    On Thu, 2025-06-19 at 16:36 +0100, Thom Brown wrote:
    > Ease of use, perhaps. It seems easier to use:
    > 
    > column_name cftext
    > 
    > rather than:
    > 
    > CREATE COLLATION case_insensitive_collation (
    >     PROVIDER = icu,
    >     LOCALE = 'und-u-ks-level2',
    >     DETERMINISTIC = FALSE
    > );
    
    We could auto-create such a collation at initdb time for ICU-enabled
    builds.
    
    > But I see the arguments against it. It creates an unnecessary
    > dependency on an extension, and if someone wants to ignore both case
    > and accents, they may resort to using 2 extensions (citext +
    > unaccent)
    > when none are needed.
    
    There are at least three ways to do case insensitivity (or other kinds
    of equivalence):
    
    * Explicit function calls in queries, as well as index and constraint
    definitions. E.g. expression index on LOWER(), queries that explicitly
    do "LOWER(x) = ..."
    
    * Wrap those function calls up in a separate data type, like citext.
    
    * Non-deterministic collations.
    
    Given that we have collations, which are a way of organizing alternate
    behaviors for existing data types, I'm not sure I see the need for
    creating an entirely separate data type.
    
    > I guess I don't feel strongly about it either
    > way.
    
    Are you a user of citext? I'm genuinely interested in the use cases,
    and whether the separate-data-type approach has merits that are missing
    in the other approaches.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  29. Re: Add CASEFOLD() function.

    Robert Treat <rob@xzilla.net> — 2025-06-19T16:51:05Z

    On Thu, Jun 19, 2025 at 12:33 PM Jeff Davis <pgsql@j-davis.com> wrote:
    >
    > On Thu, 2025-06-19 at 16:36 +0100, Thom Brown wrote:
    > > Ease of use, perhaps. It seems easier to use:
    > >
    > > column_name cftext
    > >
    > > rather than:
    > >
    > > CREATE COLLATION case_insensitive_collation (
    > >     PROVIDER = icu,
    > >     LOCALE = 'und-u-ks-level2',
    > >     DETERMINISTIC = FALSE
    > > );
    >
    > We could auto-create such a collation at initdb time for ICU-enabled
    > builds.
    >
    
    Providing a generic insensitive/non-deterministic collation by default
    would solve a number of different use cases, so +1 on the idea from
    me.
    And TBH I usually build --without-icu but this would likely cause me
    to change that.
    
    > > But I see the arguments against it. It creates an unnecessary
    > > dependency on an extension, and if someone wants to ignore both case
    > > and accents, they may resort to using 2 extensions (citext +
    > > unaccent)
    > > when none are needed.
    >
    > There are at least three ways to do case insensitivity (or other kinds
    > of equivalence):
    >
    > * Explicit function calls in queries, as well as index and constraint
    > definitions. E.g. expression index on LOWER(), queries that explicitly
    > do "LOWER(x) = ..."
    >
    > * Wrap those function calls up in a separate data type, like citext.
    >
    > * Non-deterministic collations.
    >
    > Given that we have collations, which are a way of organizing alternate
    > behaviors for existing data types, I'm not sure I see the need for
    > creating an entirely separate data type.
    >
    > > I guess I don't feel strongly about it either
    > > way.
    >
    > Are you a user of citext? I'm genuinely interested in the use cases,
    > and whether the separate-data-type approach has merits that are missing
    > in the other approaches.
    >
    
    Yeah, I'd be interested to hear if there is some missing bit that
    existing users have concerns over; as a former user of citext, it was
    a great workaround at the time, but there are "better ways" to handle
    those things now (imho).
    
    
    Robert Treat
    https://xzilla.net
    
    
    
    
  30. Re: Add CASEFOLD() function.

    Jeff Davis <pgsql@j-davis.com> — 2025-06-19T16:52:47Z

    On Thu, 2025-06-19 at 18:21 +0200, Vik Fearing wrote:
    > > 
    > > The SQL standard also says in a few other places that normalization
    > > should be applied, and we do none of those, so this is probably not
    > > a 
    > > reason to change CASEFOLD at this point.
    > > 
    > 
    > Works for me.
    
    Sounds good. We can document compatibility notes around this point.
    
    If normalization becomes important, we can take the time to work out
    the performance implications more carefully, and potentially introduce
    an NCASEFOLD() if needed.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  31. Re: Add CASEFOLD() function.

    Thom Brown <thom@linux.com> — 2025-06-19T16:59:08Z

    On Thu, 19 Jun 2025, 17:33 Jeff Davis, <pgsql@j-davis.com> wrote:
    
    > On Thu, 2025-06-19 at 16:36 +0100, Thom Brown wrote:
    > > Ease of use, perhaps. It seems easier to use:
    > >
    > > column_name cftext
    > >
    > > rather than:
    > >
    > > CREATE COLLATION case_insensitive_collation (
    > >     PROVIDER = icu,
    > >     LOCALE = 'und-u-ks-level2',
    > >     DETERMINISTIC = FALSE
    > > );
    >
    > We could auto-create such a collation at initdb time for ICU-enabled
    > builds.
    >
    > > But I see the arguments against it. It creates an unnecessary
    > > dependency on an extension, and if someone wants to ignore both case
    > > and accents, they may resort to using 2 extensions (citext +
    > > unaccent)
    > > when none are needed.
    >
    > There are at least three ways to do case insensitivity (or other kinds
    > of equivalence):
    >
    > * Explicit function calls in queries, as well as index and constraint
    > definitions. E.g. expression index on LOWER(), queries that explicitly
    > do "LOWER(x) = ..."
    >
    > * Wrap those function calls up in a separate data type, like citext.
    >
    > * Non-deterministic collations.
    >
    > Given that we have collations, which are a way of organizing alternate
    > behaviors for existing data types, I'm not sure I see the need for
    > creating an entirely separate data type.
    >
    > > I guess I don't feel strongly about it either
    > > way.
    >
    > Are you a user of citext? I'm genuinely interested in the use cases,
    > and whether the separate-data-type approach has merits that are missing
    > in the other approaches.
    >
    
    No. But given the options, I would personally choose nondeterministic
    collations now that they are available. I just wish they were more
    user-friendly as I suspect the majority of people either won't know about
    them, or won't know how to use them. But like you say, maybe having a set
    of predefined nd-collections would help. As it stands, I'm just bringing up
    the consideration of citext in case it has any value, which it doesn't
    appear to. In fact it's probably even an argument to begin the process of
    deprecation.
    
    Thom
    
    >
    
  32. Re: Add CASEFOLD() function.

    David E. Wheeler <david@justatheory.com> — 2025-06-19T17:38:54Z

    On Jun 19, 2025, at 12:59, Thom Brown <thom@linux.com> wrote:
    
    > No. But given the options, I would personally choose nondeterministic collations now that they are available. I just wish they were more user-friendly as I suspect the majority of people either won't know about them, or won't know how to use them.
    
    I suspect there are a lot of uses of citext for databases created before nondeterministic collations existed and people are unaware of them or unclear on the migration path from one to the other, let alone implications for any infrastructure they built around cutest (like function signatures and return values). As long as citext conteinues to be maintained there and there’s no super clear path to migrate, I’d bet good money that few would bother to switch.
    
    Best,
    
    David
    
    
    
  33. Re: Add CASEFOLD() function.

    Thom Brown <thom@linux.com> — 2025-06-19T18:55:43Z

    On Thu, 19 Jun 2025 at 18:39, David E. Wheeler <david@justatheory.com> wrote:
    >
    > On Jun 19, 2025, at 12:59, Thom Brown <thom@linux.com> wrote:
    >
    > > No. But given the options, I would personally choose nondeterministic collations now that they are available. I just wish they were more user-friendly as I suspect the majority of people either won't know about them, or won't know how to use them.
    >
    > I suspect there are a lot of uses of citext for databases created before nondeterministic collations existed and people are unaware of them or unclear on the migration path from one to the other, let alone implications for any infrastructure they built around cutest (like function signatures and return values). As long as citext conteinues to be maintained there and there’s no super clear path to migrate, I’d bet good money that few would bother to switch.
    
    Maybe the citext doc page should explain how to get unhooked from it.
    Something like:
    
    ALTER TABLE mytable
      ALTER COLUMN ci_column
      SET DATA TYPE TEXT COLLATE case_insensitive_collation;
    
    or
    
    CREATE DOMAIN ci_text AS text
      COLLATE case_insensitive_collation;
    
    ALTER TABLE mytable
      ALTER COLUMN ci_column
      SET DATA TYPE ci_text;
    
    And because they're binary-compatible, they should also be free. No
    doubt a procedure could do this to every instance in the database,
    although I guess it gets trickier when it comes to functions that
    accept citext as a parameter type, and other similar examples.
    
    Thom