Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. fuzzystrmatch: use pg_ascii_toupper().

  2. Avoid global LC_CTYPE dependency in pg_locale_icu.c.

  3. downcase_identifier(): use method table from locale provider.

  4. ltree: fix case-insensitive matching.

  5. Fix multibyte issue in ltree_strncasecmp().

  6. Use multibyte-aware extraction of pattern prefixes.

  7. Add pg_iswcased().

  8. Remove char_tolower() API.

  9. Make regex "max_chr" depend on encoding, not provider.

  10. Change some callers to use pg_ascii_toupper().

  11. Allow pg_locale_t APIs to work when ctype_is_c.

  12. Add #define for UNICODE_CASEMAP_BUFSZ.

  13. Inline pg_ascii_tolower() and pg_ascii_toupper().

  14. Avoid global LC_CTYPE dependency in pg_locale_libc.c.

  15. Force LC_COLLATE to C in postmaster.

  16. Change wchar2char() and char2wchar() to accept a locale_t.

  17. Use pg_ascii_tolower()/pg_ascii_toupper() where appropriate.

  18. inet_net_pton.c: use pg_ascii_tolower() rather than tolower().

  19. isn.c: use pg_ascii_toupper() instead of toupper().

  20. contrib/spi/refint.c: use pg_ascii_tolower() instead.

  21. copyfromparse.c: use pg_ascii_tolower() rather than tolower().

  22. Revert "Tidy up locale thread safety in ECPG library."

  23. Tidy up locale thread safety in ECPG library.

  24. All supported systems have locale_t.

  1. Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2024-08-06T21:59:55Z

    After some previous work here:
    
    https://postgr.es/m/89475ee5487d795124f4e25118ea8f1853edb8cb.camel@j-davis.com
    
    we are less dependent on setlocale(), but it's still not completely
    gone.
    
    setlocale() counts as thread-unsafe, so it would be nice to eliminate
    it completely.
    
    The obvious answer is uselocale(), which sets the locale only for the
    calling thread, and takes precedence over whatever is set with
    setlocale().
    
    But there are a couple problems:
    
    1. I don't think it's supported on Windows.
    
    2. I don't see a good way to canonicalize a locale name, like in
    check_locale(), which uses the result of setlocale().
    
    Thoughts?
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  2. Re: Remaining dependency on setlocale()

    Tom Lane <tgl@sss.pgh.pa.us> — 2024-08-06T22:23:09Z

    Jeff Davis <pgsql@j-davis.com> writes:
    > But there are a couple problems:
    
    > 1. I don't think it's supported on Windows.
    
    Can't help with that, but surely Windows has some thread-safe way.
    
    > 2. I don't see a good way to canonicalize a locale name, like in
    > check_locale(), which uses the result of setlocale().
    
    What I can tell you about that is that check_locale's expectation
    that setlocale does any useful canonicalization is mostly wishful
    thinking [1].  On a lot of platforms you just get the input string
    back again.  If that's the only thing keeping us on setlocale,
    I think we could drop it.  (Perhaps we should do some canonicalization
    of our own instead?)
    
    			regards, tom lane
    
    [1] https://www.postgresql.org/message-id/14856.1348497531@sss.pgh.pa.us
    
    
    
    
  3. Re: Remaining dependency on setlocale()

    Thomas Munro <thomas.munro@gmail.com> — 2024-08-07T07:07:40Z

    On Wed, Aug 7, 2024 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Jeff Davis <pgsql@j-davis.com> writes:
    > > But there are a couple problems:
    >
    > > 1. I don't think it's supported on Windows.
    >
    > Can't help with that, but surely Windows has some thread-safe way.
    
    It does.  It's not exactly the same, instead there is a thing you can
    call that puts setlocale() itself into a thread-local mode, but last
    time I checked that mode was missing on MinGW so that's a bit of an
    obstacle.
    
    How far can we get by using more _l() functions?  For example, [1]
    shows a use of strftime() that I think can be converted to
    strftime_l() so that it doesn't depend on setlocale().  Since POSIX
    doesn't specify every obvious _l function, we might need to provide
    any missing wrappers that save/restore thread-locally with
    uselocale().  Windows doesn't have uselocale(), but it generally
    doesn't need such wrappers because it does have most of the obvious
    _l() functions.
    
    > > 2. I don't see a good way to canonicalize a locale name, like in
    > > check_locale(), which uses the result of setlocale().
    >
    > What I can tell you about that is that check_locale's expectation
    > that setlocale does any useful canonicalization is mostly wishful
    > thinking [1].  On a lot of platforms you just get the input string
    > back again.  If that's the only thing keeping us on setlocale,
    > I think we could drop it.  (Perhaps we should do some canonicalization
    > of our own instead?)
    
    +1
    
    I know it does something on Windows (we know the EDB installer gives
    it strings like "Language,Country" and it converts them to
    "Language_Country.Encoding", see various threads about it all going
    wrong), but I'm not sure it does anything we actually want to
    encourage.  I'm hoping we can gradually screw it down so that we only
    have sane BCP 47 in the system on that OS, and I don't see why we
    wouldn't just use them verbatim.
    
    [1] https://www.postgresql.org/message-id/CA%2BhUKGJ%3Dca39Cg%3Dy%3DS89EaCYvvCF8NrZRO%3Duog-cnz0VzC6Kfg%40mail.gmail.com
    
    
    
    
  4. Re: Remaining dependency on setlocale()

    Joe Conway <mail@joeconway.com> — 2024-08-07T13:42:32Z

    On 8/7/24 03:07, Thomas Munro wrote:
    > How far can we get by using more _l() functions?  For example, [1]
    > shows a use of strftime() that I think can be converted to
    > strftime_l() so that it doesn't depend on setlocale().  Since POSIX
    > doesn't specify every obvious _l function, we might need to provide
    > any missing wrappers that save/restore thread-locally with
    > uselocale().  Windows doesn't have uselocale(), but it generally
    > doesn't need such wrappers because it does have most of the obvious
    > _l() functions.
    
    Most of the strtoX functions have an _l variant, but one to watch is 
    atoi, which is defined with a hardcoded call to strtol, at least with glibc:
    
    8<----------
    /* Convert a string to an int.  */
    int
    atoi (const char *nptr)
    {
       return (int) strtol (nptr, (char **) NULL, 10);
    }
    8<----------
    
    I guess in many/most places we use atoi we don't care, but maybe it 
    matters for some?
    
    -- 
    Joe Conway
    PostgreSQL Contributors Team
    RDS Open Source Databases
    Amazon Web Services: https://aws.amazon.com
    
    
    
    
  5. Re: Remaining dependency on setlocale()

    Robert Haas <robertmhaas@gmail.com> — 2024-08-07T15:28:06Z

    On Wed, Aug 7, 2024 at 9:42 AM Joe Conway <mail@joeconway.com> wrote:
    > I guess in many/most places we use atoi we don't care, but maybe it
    > matters for some?
    
    I think we should move in the direction of replacing atoi() calls with
    strtol() and actually checking for errors. In many places where use
    atoi(), it's unlikely that the string would be anything but an
    integer, so error checks are arguably unnecessary. A backup label file
    isn't likely to say "START TIMELINE: potaytoes". On the other hand, if
    it did say that, I'd prefer to get an error about potaytoes than have
    it be treated as if it said "START TIMELINE: 0". And I've definitely
    found missing error-checks over the years. For example, on pg14,
    "pg_basebackup -Ft -Zmaximum -Dx" works as if you specified "-Z0"
    because atoi("maximum") == 0. If we make a practice of checking
    integer conversions for errors everywhere, we might avoid some such
    silliness.
    
    -- 
    Robert Haas
    EDB: http://www.enterprisedb.com
    
    
    
    
  6. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2024-08-07T17:16:09Z

    On Wed, 2024-08-07 at 19:07 +1200, Thomas Munro wrote:
    > How far can we get by using more _l() functions?
    
    There are a ton of calls to, for example, isspace(), used mostly for
    parsing.
    
    I wouldn't expect a lot of differences in behavior from locale to
    locale, like might be the case with iswspace(), but behavior can be
    different at least in theory.
    
    So I guess we're stuck with setlocale()/uselocale() for a while, unless
    we're able to move most of those call sites over to an ascii-only
    variant.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  7. Re: Remaining dependency on setlocale()

    Joe Conway <mail@joeconway.com> — 2024-08-07T17:28:45Z

    On 8/7/24 13:16, Jeff Davis wrote:
    > On Wed, 2024-08-07 at 19:07 +1200, Thomas Munro wrote:
    >> How far can we get by using more _l() functions?
    > 
    > There are a ton of calls to, for example, isspace(), used mostly for
    > parsing.
    > 
    > I wouldn't expect a lot of differences in behavior from locale to
    > locale, like might be the case with iswspace(), but behavior can be
    > different at least in theory.
    > 
    > So I guess we're stuck with setlocale()/uselocale() for a while, unless
    > we're able to move most of those call sites over to an ascii-only
    > variant.
    
    FWIW I see all of these in glibc:
    
    isalnum_l, isalpha_l, isascii_l, isblank_l, iscntrl_l, isdigit_l, 
    isgraph_l,  islower_l, isprint_l, ispunct_l, isspace_l, isupper_l, 
    isxdigit_l
    
    
    -- 
    Joe Conway
    PostgreSQL Contributors Team
    RDS Open Source Databases
    Amazon Web Services: https://aws.amazon.com
    
    
    
    
  8. Re: Remaining dependency on setlocale()

    Robert Haas <robertmhaas@gmail.com> — 2024-08-07T18:18:00Z

    On Wed, Aug 7, 2024 at 1:29 PM Joe Conway <mail@joeconway.com> wrote:
    > FWIW I see all of these in glibc:
    >
    > isalnum_l, isalpha_l, isascii_l, isblank_l, iscntrl_l, isdigit_l,
    > isgraph_l,  islower_l, isprint_l, ispunct_l, isspace_l, isupper_l,
    > isxdigit_l
    
    On my MacBook (Ventura, 13.6.7), I see all of these except for isascii_l.
    
    -- 
    Robert Haas
    EDB: http://www.enterprisedb.com
    
    
    
    
  9. Re: Remaining dependency on setlocale()

    Thomas Munro <thomas.munro@gmail.com> — 2024-08-07T20:52:41Z

    On Thu, Aug 8, 2024 at 5:16 AM Jeff Davis <pgsql@j-davis.com> wrote:
    > There are a ton of calls to, for example, isspace(), used mostly for
    > parsing.
    >
    > I wouldn't expect a lot of differences in behavior from locale to
    > locale, like might be the case with iswspace(), but behavior can be
    > different at least in theory.
    >
    > So I guess we're stuck with setlocale()/uselocale() for a while, unless
    > we're able to move most of those call sites over to an ascii-only
    > variant.
    
    We do know of a few isspace() calls that are already questionable[1]
    (should be scanner_isspace(), or something like that).  It's not only
    weird that SELECT ROW('libertà!') is displayed with or without double
    quote depending (in theory) on your locale, it's also undefined
    behaviour because we feed individual bytes of a multi-byte sequence to
    isspace(), so OSes disagree, and in practice we know that macOS and
    Windows think that the byte 0xa inside 'à' is a space while glibc and
    FreeBSD don't.  Looking at the languages with many sequences
    containing 0xa0, I guess you'd probably need to be processing CJK text
    and cross-platform for the difference to become obvious (that was the
    case for the problem report I analysed):
    
    for i in range(1, 0xffff):
      if (i < 0xd800 or i > 0xdfff) and 0xa0 in chr(i).encode('UTF-8'):
        print("%04x: %s" % (i, chr(i)))
    
    [1] https://www.postgresql.org/message-id/flat/CA%2BHWA9awUW0%2BRV_gO9r1ABZwGoZxPztcJxPy8vMFSTbTfi4jig%40mail.gmail.com
    
    
    
    
  10. Re: Remaining dependency on setlocale()

    Thomas Munro <thomas.munro@gmail.com> — 2024-08-07T21:40:56Z

    On Thu, Aug 8, 2024 at 6:18 AM Robert Haas <robertmhaas@gmail.com> wrote:
    > On Wed, Aug 7, 2024 at 1:29 PM Joe Conway <mail@joeconway.com> wrote:
    > > FWIW I see all of these in glibc:
    > >
    > > isalnum_l, isalpha_l, isascii_l, isblank_l, iscntrl_l, isdigit_l,
    > > isgraph_l,  islower_l, isprint_l, ispunct_l, isspace_l, isupper_l,
    > > isxdigit_l
    >
    > On my MacBook (Ventura, 13.6.7), I see all of these except for isascii_l.
    
    Those (except isascii_l) are from POSIX 2008[1].  They were absorbed
    from "Extended API Set Part 4"[2], along with locale_t (that's why
    there is a header <xlocale.h> on a couple of systems even though after
    absorption they are supposed to be in <locale.h>).  We already
    decided that all computers have that stuff (commit 8d9a9f03), but the
    reality is a little messier than that... NetBSD hasn't implemented
    uselocale() yet[3], though it has a good set of _l functions.  As
    discussed in [3], ECPG code is therefore currently broken in
    multithreaded clients because it's falling back to a setlocale() path,
    and I think Windows+MinGW must be too (it lacks
    HAVE__CONFIGTHREADLOCALE), but those both have a good set of _l
    functions.  In that thread I tried to figure out how to use _l
    functions to fix that problem, but ...
    
    The issue there is that we have our own snprintf.c, that implicitly
    requires LC_NUMERIC to be "C" (it is documented as always printing
    floats a certain way ignoring locale and that's what the callers there
    want in frontend and backend code, but in reality it punts to system
    snprintf for floats, assuming that LC_NUMERIC is "C", which we
    configure early in backend startup, but frontend code has to do it for
    itself!).  So we could use snprintf_l or strtod_l instead, but POSIX
    hasn't got those yet.  Or we could use own own Ryu code (fairly
    specific), but integrating Ryu into our snprintf.c (and correctly
    implementing all the %... stuff?) sounds like quite a hard,
    devil-in-the-details kind of an undertaking to me.  Or maybe it's
    easy, I dunno.  As for the _l functions, you could probably get away
    with "every computer has either uselocale() or snprintf_() (or
    strtod_()?)" and have two code paths in our snprintf.c.  But then we'd
    also need a place to track a locale_t for a long-lived newlocale("C"),
    which was too messy in my latest attempt...
    
    [1] https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/isspace.html
    [2] https://pubs.opengroup.org/onlinepubs/9699939499/toc.pdf
    [3] https://www.postgresql.org/message-id/flat/CWZBBRR6YA8D.8EHMDRGLCKCD%40neon.tech
    
    
    
    
  11. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2024-08-07T22:45:19Z

    On Wed, 2024-08-07 at 13:28 -0400, Joe Conway wrote:
    > FWIW I see all of these in glibc:
    > 
    > isalnum_l, isalpha_l, isascii_l, isblank_l, iscntrl_l, isdigit_l, 
    > isgraph_l,  islower_l, isprint_l, ispunct_l, isspace_l, isupper_l, 
    > isxdigit_l
    
    My point was just that there are a lot of those call sites (especially
    for isspace()) in various parsers. It feels like a lot of code churn to
    change all of them, when a lot of them seem to be intended for ascii
    anyway.
    
    And where do we get the locale_t structure from? We can create our own
    at database connection time, and supply it to each of those call sites,
    but I'm not sure that's a huge advantage over just using uselocale().
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  12. Re: Remaining dependency on setlocale()

    Andreas Karlsson <andreas@proxel.se> — 2024-08-09T11:41:26Z

    On 8/8/24 12:45 AM, Jeff Davis wrote:
    > My point was just that there are a lot of those call sites (especially
    > for isspace()) in various parsers. It feels like a lot of code churn to
    > change all of them, when a lot of them seem to be intended for ascii
    > anyway.
    > 
    > And where do we get the locale_t structure from? We can create our own
    > at database connection time, and supply it to each of those call sites,
    > but I'm not sure that's a huge advantage over just using uselocale().
    
    I am leaning towards that we should write our own pure ascii functions 
    for this. Since we do not support any non-ascii compatible encodings 
    anyway I do not see the point in having locale support in most of these 
    call-sites.
    
    Andewas
    
    
    
    
  13. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2024-08-09T18:24:03Z

    On Fri, 2024-08-09 at 13:41 +0200, Andreas Karlsson wrote:
    > I am leaning towards that we should write our own pure ascii
    > functions 
    > for this.
    
    That makes sense for a lot of call sites, but it could cause breakage
    if we aren't careful.
    
    >  Since we do not support any non-ascii compatible encodings 
    > anyway I do not see the point in having locale support in most of
    > these 
    > call-sites.
    
    An ascii-compatible encoding just means that the code points in the
    ascii range are represented as ascii. I'm not clear on whether code
    points in the ascii range can return different results for things like
    isspace(), but it sounds plausible -- toupper() can return different
    results for 'i' in tr_TR.
    
    Also, what about the values outside 128-255, which are still valid
    input to isspace()?
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  14. Re: Remaining dependency on setlocale()

    Tristan Partin <tristan@partin.io> — 2024-08-09T20:16:58Z

    On Tue Aug 6, 2024 at 5:00 PM CDT, Jeff Davis wrote:
    > After some previous work here:
    >
    > https://postgr.es/m/89475ee5487d795124f4e25118ea8f1853edb8cb.camel@j-davis.com
    >
    > we are less dependent on setlocale(), but it's still not completely
    > gone.
    >
    > setlocale() counts as thread-unsafe, so it would be nice to eliminate
    > it completely.
    >
    > The obvious answer is uselocale(), which sets the locale only for the
    > calling thread, and takes precedence over whatever is set with
    > setlocale().
    >
    > But there are a couple problems:
    >
    > 1. I don't think it's supported on Windows.
    >
    > 2. I don't see a good way to canonicalize a locale name, like in
    > check_locale(), which uses the result of setlocale().
    >
    > Thoughts?
    
    Hey Jeff,
    
    See this thread[0] for some work that I had previously done. Feel free 
    to take it over, or we could collaborate.
    
    [0]: https://www.postgresql.org/message-id/CWMW5OZBWJ10.1YFLQWSUE5RE9@neon.tech
    
    --
    Tristan Partin
    Neon (https://neon.tech)
    
    
    
    
  15. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2024-08-09T20:48:20Z

    Hi,
    
    On Fri, 2024-08-09 at 15:16 -0500, Tristan Partin wrote:
    > Hey Jeff,
    > 
    > See this thread[0] for some work that I had previously done. Feel
    > free 
    > to take it over, or we could collaborate.
    > 
    > [0]:
    > https://www.postgresql.org/message-id/CWMW5OZBWJ10.1YFLQWSUE5RE9@neon.tech
    
    Sounds good, sorry I missed that.
    
    Can you please rebase and we can discuss in that thread?
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  16. Re: Remaining dependency on setlocale()

    Thomas Munro <thomas.munro@gmail.com> — 2024-08-09T21:42:29Z

    On Wed, Aug 7, 2024 at 7:07 PM Thomas Munro <thomas.munro@gmail.com> wrote:
    > On Wed, Aug 7, 2024 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > > Jeff Davis <pgsql@j-davis.com> writes:
    > > > But there are a couple problems:
    > >
    > > > 1. I don't think it's supported on Windows.
    > >
    > > Can't help with that, but surely Windows has some thread-safe way.
    >
    > It does.  It's not exactly the same, instead there is a thing you can
    > call that puts setlocale() itself into a thread-local mode, but last
    > time I checked that mode was missing on MinGW so that's a bit of an
    > obstacle.
    
    Actually the MinGW situation might be better than that these days.  I
    know of three environments where we currently have to keep code
    working on MinGW: build farm animal fairywren (msys2 compiler
    toochain), CI's optional "Windows - Server 2019, MinGW64 - Meson"
    task, and CI's "CompilerWarnings" task, in the "mingw_cross_warning"
    step (which actually runs on Linux, and uses configure rather than
    meson).  All three environments show that they have
    _configthreadlocale.  So could we could simply require it on Windows?
    Then it might be possible to write a replacement implementation of
    uselocale() that does a two-step dance with _configthreadlocale() and
    setlocale(), restoring both afterwards if they changed.  That's what
    ECPG open-codes already.
    
    The NetBSD situation is more vexing.  I was trying to find out if
    someone is working on it and unfortunately it looks like there is a
    principled stand against adding it:
    
    https://mail-index.netbsd.org/tech-userlevel/2015/12/28/msg009546.html
    https://mail-index.netbsd.org/netbsd-users/2017/02/14/msg019352.html
    
    They're right that we really just want to use "C" in some places, and
    their LC_C_LOCALE is a very useful system-provided value to be able to
    pass into _l functions.  It's a shame it's non-standard, because
    without it you have to allocate a locale_t for "C" and keep it
    somewhere to feed to _l functions...
    
    
    
    
  17. Re: Remaining dependency on setlocale()

    Thomas Munro <thomas.munro@gmail.com> — 2024-08-10T01:36:39Z

    I've posted a new attempt at ripping those ECPG setlocales() out on
    the other thread that had the earlier version and discussion:
    
    https://www.postgresql.org/message-id/CA%2BhUKG%2BYv%2Bps%3DnS2T8SS1UDU%3DiySHSr4sGHYiYGkPTpZx6Ooww%40mail.gmail.com
    
    
    
    
  18. Re: Remaining dependency on setlocale()

    Thomas Munro <thomas.munro@gmail.com> — 2024-08-12T03:24:01Z

    On Thu, Aug 8, 2024 at 5:16 AM Jeff Davis <pgsql@j-davis.com> wrote:
    > On Wed, 2024-08-07 at 19:07 +1200, Thomas Munro wrote:
    > > How far can we get by using more _l() functions?
    >
    > There are a ton of calls to, for example, isspace(), used mostly for
    > parsing.
    >
    > I wouldn't expect a lot of differences in behavior from locale to
    > locale, like might be the case with iswspace(), but behavior can be
    > different at least in theory.
    >
    > So I guess we're stuck with setlocale()/uselocale() for a while, unless
    > we're able to move most of those call sites over to an ascii-only
    > variant.
    
    Here are two more cases that I don't think I've seen discussed.
    
    1.  The nl_langinfo() call in pg_get_encoding_from_locale(), can
    probably be changed to nl_langinfo_l() (it is everywhere we currently
    care about except Windows, which has a different already-thread-safe
    alternative; AIX seems to lack the _l version, but someone writing a
    patch to re-add support for that OS could supply the configure goo for
    a uselocale() safe/restore implementation).  One problem is that it
    has callers that pass it NULL meaning the backend default, but we'd
    perhaps use LC_C_GLOBAL for now and have to think about where we get
    the database default locale_t in the future.
    
    2.  localeconv() is *doubly* non-thread-safe: it depends on the
    current locale, and it also returns an object whose storage might be
    clobbered by any other call to localeconv(), setlocale, or even,
    according to POSIX, uselocale() (!!!).  I think that effectively
    closes off that escape hatch.  On some OSes (macOS, BSDs) you find
    localeconv_l() and then I think they give you a more workable
    lifetime: as long as the locale_t lives, which makes perfect sense.  I
    am surprised that no one has invented localeconv_r() where you supply
    the output storage, and you could wrap that in uselocale()
    save/restore to deal with the other problem, or localeconv_r_l() or
    something.  I can't understand why this is so bad.  The glibc
    documentation calls it "a masterpiece of poor design".  Ahh, so it
    seems like we need to delete our use of localeconf() completely,
    because we should be able to get all the information we need from
    nl_langinfo_l() instead:
    
    https://www.gnu.org/software/libc/manual/html_node/Locale-Information.html
    
    
    
    
  19. Re: Remaining dependency on setlocale()

    Thomas Munro <thomas.munro@gmail.com> — 2024-08-12T04:53:17Z

    On Mon, Aug 12, 2024 at 3:24 PM Thomas Munro <thomas.munro@gmail.com> wrote:
    > 1.  The nl_langinfo() call in pg_get_encoding_from_locale(), can
    > probably be changed to nl_langinfo_l() (it is everywhere we currently
    > care about except Windows, which has a different already-thread-safe
    > alternative ...
    
    ... though if we wanted to replace all use of localeconv and struct
    lconv with nl_langinfo_l() calls, it's not totally obvious how to do
    that on Windows.  Its closest thing is GetLocaleInfoEx(), but that has
    complications: it takes wchar_t locale names, which we don't even have
    and can't access when we only have a locale_t, and it must look them
    up in some data structure every time, and it copies data out to the
    caller as wchar_t so now you have two conversion problems and a
    storage problem.  If I understand correctly, the whole point of
    nl_langinfo_l(item, loc) is that it is supposed to be fast, it's
    really just an array lookup, and item is just an index, and the result
    is supposed to be stable as long as loc hasn't been freed (and the
    thread hasn't exited).  So you can use it without putting your own
    caching in front of it.  One idea I came up with which I haven't tried
    and it might turn out to be terrible, is that we could change our
    definition of locale_t on Windows.  Currently it's a typedef to
    Windows' _locale_t, and we use it with a bunch of _XXX functions that
    we access by macro to remove the underscore.  Instead, we could make
    locale_t a pointer to a struct of our own design in WIN32 builds,
    holding the native _locale_t and also an array full of all the values
    that nl_langinfo_l() can return.  We'd provide the standard enums,
    indexes into that array, in a fake POSIX-oid header <langinfo.h>.
    Then nl_langinfo_l(item, loc) could be implemented as
    loc->private_langinfo[item], and strcoll_l(.., loc) could be a static
    inline function that does _strcol_l(...,
    loc->private_windows_locale_t).  These structs would be allocated and
    freed with standard-looking newlocale() and freelocale(), so we could
    finally stop using #ifdef WIN32-wrapped _create_locale() directly.
    Then everything would look more POSIX-y, nl_langinfo_l() could be used
    directly wherever we need fast access to that info, and we could, I
    think, banish the awkward localeconv, right?  I don't know if this all
    makes total sense and haven't tried it, just spitballing here...
    
    
    
    
  20. Re: Remaining dependency on setlocale()

    Thomas Munro <thomas.munro@gmail.com> — 2024-08-12T22:43:14Z

    On Mon, Aug 12, 2024 at 4:53 PM Thomas Munro <thomas.munro@gmail.com> wrote:
    > ... though if we wanted to replace all use of localeconv and struct
    > lconv with nl_langinfo_l() calls,
    
    Gah.  I realised while trying the above that you can't really replace
    localeconv() with nl_langinfo_l() as the GNU documentation recommends,
    because some of the lconv fields we're using are missing from
    langinfo.h in POSIX (only GNU added them all, that was a good idea).
    :-(
    
    Next idea:
    
    Windows: its localeconv() returns pointer to thread-local storage,
    good, but it still needs setlocale().  So the options are: make our
    own lconv-populator function for Windows, using GetLocaleInfoEx(), or
    do that _configthreadlocale() dance (possibly excluding some MinGW
    configurations from working)
    Systems that have localeconv_l(): use that
    POSIX: use uselocale() and also put a big global lock around
    localeconv() call + accessing result (optionally skipping that on an
    OS-by-OS basis after confirming that its implementation doesn't really
    need it)
    
    The reason the uselocale() + localeconv() seems to require a Big Lock
    (by default at least) is that the uselocale() deals only with the
    "current locale" aspect, not the output buffer aspect.  Clearly the
    standard allows for it to be thread-local storage (that's why since
    2008 it says that after thread-exit you can't access the result, and I
    guess that's how it works on real systems (?)), but it also seems to
    allow for a single static buffer (that's why it says that it's not
    re-entrant, and any call to localeconv() might clobber it).  That
    might be OK in practice because we tend to cache that stuff, eg when
    assigning GUC lc_monetary (that cache would presumably become
    thread-local in the first phase of the multithreading plan), so the
    locking shouldn't really hurt.
    
    The reason we'd have to have three ways, and not just two, is again
    that NetBSD declined to implement uselocale().
    
    I'll try this in a bit unless someone else has better ideas or plans
    for this part... sorry for the drip-feeding.
    
    
    
    
  21. Re: Remaining dependency on setlocale()

    Thomas Munro <thomas.munro@gmail.com> — 2024-08-13T05:47:46Z

    On Tue, Aug 13, 2024 at 10:43 AM Thomas Munro <thomas.munro@gmail.com> wrote:
    > I'll try this in a bit unless someone else has better ideas or plans
    > for this part... sorry for the drip-feeding.
    
    And done, see commitfest entry #5170.
    
    
    
    
  22. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2024-08-14T01:05:23Z

    On Sat, 2024-08-10 at 09:42 +1200, Thomas Munro wrote:
    > The NetBSD situation is more vexing.  I was trying to find out if
    > someone is working on it and unfortunately it looks like there is a
    > principled stand against adding it:
    > 
    > https://mail-index.netbsd.org/tech-userlevel/2015/12/28/msg009546.html
    > https://mail-index.netbsd.org/netbsd-users/2017/02/14/msg019352.html
    
    The objection seems to be very general: that uselocale() modifies the
    thread state and affects calls a long distance from uselocale(). I
    don't disagree with the general sentiment. But in effect, that just
    prevents people migrating away from setlocale(), to which the same
    argument applies, and is additionally thread-unsafe.
    
    The only alternative is to essentially ban the use of non-_l variants,
    which is fine I suppose, but causes a fair amount of code churn.
    
    > They're right that we really just want to use "C" in some places, and
    > their LC_C_LOCALE is a very useful system-provided value to be able
    > to
    > pass into _l functions.  It's a shame it's non-standard, because
    > without it you have to allocate a locale_t for "C" and keep it
    > somewhere to feed to _l functions...
    
    If we're going to do that, why not just have ascii-only variants of our
    own? pg_ascii_isspace(...) is at least as readable as isspace_l(...,
    LC_C_LOCALE).
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  23. Re: Remaining dependency on setlocale()

    Thomas Munro <thomas.munro@gmail.com> — 2024-08-14T02:31:20Z

    On Wed, Aug 14, 2024 at 1:05 PM Jeff Davis <pgsql@j-davis.com> wrote:
    > The only alternative is to essentially ban the use of non-_l variants,
    > which is fine I suppose, but causes a fair amount of code churn.
    
    Let's zoom out a bit and consider some ways we could set up the
    process, threads and individual calls:
    
    1.  The process global locale is always "C".  If you ever call
    uselocale(), it can only be for short stretches, and you have to
    restore it straight after; perhaps it is only ever used in replacement
    _l() functions for systems that lack them.  You need to use _l()
    functions for all non-"C" locales.  The current database default needs
    to be available as a variable (in future: thread-local variable, or
    reachable from one), so you can use it in _l() functions.  The "C"
    locale can be accessed implicitly with non-l() functions, or you could
    ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE) for
    "C".  Or a name like PG_C_LOCALE, which, in backend code could be just
    LC_GLOBAL_LOCALE, while in frontend/library code it could be the
    singleton mechanism I showed in CF#5166.
    
    XXX Note that nailing LC_ALL to "C" in the backend would extend our
    existing policy for LC_NUMERIC to all categories.  That's why we can
    use strtod() in the backend and expect the radix character to be '.'.
    It's interesting to contemplate the strtod() calls in CF#5166: they
    are code copied-and-pasted from backend and frontend; in the backend
    we can use strtod() currently but in the frontend code I showed a
    change to strtod_l(..., PG_C_LOCALE), in order to be able to delete
    some ugly deeply nested uselocale()/setlocale() stuff of the exact
    sort that NetBSD hackers (and I) hate.  It's obviously a bit of a code
    smell that it's copied-and-pasted in the first place, and should
    really share code.  Supposing some of that stuff finishes up in
    src/common, then I think you'd want a strtod_l(..., PG_C_LOCALE) that
    could be allowed to take advantage of the knowledge that the global
    locale is "C" in the backend.  Just thoughts...
    
    2.  The process global locale is always "C".  Each backend process (in
    future: thread) calls uselocale() to set the thread-local locale to
    the database default, so it can keep using the non-_l() functions as a
    way to access the database default, and otherwise uses _l() functions
    if it wants something else (as we do already).  The "C" locale can be
    accessed with foo_l(..., LC_GLOBAL_LOCALE) or PG_C_LOCALE etc.
    
    XXX This option is blocked by NetBSD's rejection of uselocale().  I
    guess if you really wanted to work around NetBSD's policy you could
    make your own wrapper for all affected functions, translating foo() to
    foo_l(..., pg_thread_current_locale), so you could write uselocale(),
    which is pretty much what every other libc does...  But eughhh
    
    3.  The process global locale is inherited from the system or can be
    set by the user however they want for the benefit of extensions, but
    we never change it after startup or refer to it.  Then we do the same
    as 1 or 2, except if we ever want "C" we'll need a locale_t for that,
    again perhaps using the PC_C_LOCALE mechanism.  Non-_l() functions are
    effectively useless except in cases where you really want to use the
    system's settings inherited from startup, eg for messages, so they'd
    mostly be banned.
    
    What else?
    
    > > They're right that we really just want to use "C" in some places, and
    > > their LC_C_LOCALE is a very useful system-provided value to be able
    > > to
    > > pass into _l functions.  It's a shame it's non-standard, because
    > > without it you have to allocate a locale_t for "C" and keep it
    > > somewhere to feed to _l functions...
    >
    > If we're going to do that, why not just have ascii-only variants of our
    > own? pg_ascii_isspace(...) is at least as readable as isspace_l(...,
    > LC_C_LOCALE).
    
    Yeah, I agree there are some easy things we should do that way.  In
    fact we've already established that scanner_isspace() needs to be used
    in lots more places for that, even aside from thread-safety.
    
    
    
    
  24. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2024-08-14T19:00:52Z

    On Wed, 2024-08-14 at 14:31 +1200, Thomas Munro wrote:
    > 1.  The process global locale is always "C".  If you ever call
    > uselocale(), it can only be for short stretches, and you have to
    > restore it straight after; perhaps it is only ever used in
    > replacement
    > _l() functions for systems that lack them.  You need to use _l()
    > functions for all non-"C" locales.  The current database default
    > needs
    > to be available as a variable (in future: thread-local variable, or
    > reachable from one), so you can use it in _l() functions.  The "C"
    > locale can be accessed implicitly with non-l() functions, or you
    > could
    > ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE)
    > for
    > "C".  Or a name like PG_C_LOCALE, which, in backend code could be
    > just
    > LC_GLOBAL_LOCALE, while in frontend/library code it could be the
    > singleton mechanism I showed in CF#5166.
    
    +1 to this approach. It makes things more consistent across platforms
    and avoids surprising dependencies on the global setting.
    
    We'll have to be careful that each call site is either OK with C, or
    that it gets changed to an _l() variant. We also have to be careful
    about extensions.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  25. Re: Remaining dependency on setlocale()

    Thomas Munro <thomas.munro@gmail.com> — 2024-08-14T22:43:50Z

    On Wed, Aug 7, 2024 at 7:07 PM Thomas Munro <thomas.munro@gmail.com> wrote:
    > On Wed, Aug 7, 2024 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > > Jeff Davis <pgsql@j-davis.com> writes:
    > > > 2. I don't see a good way to canonicalize a locale name, like in
    > > > check_locale(), which uses the result of setlocale().
    > >
    > > What I can tell you about that is that check_locale's expectation
    > > that setlocale does any useful canonicalization is mostly wishful
    > > thinking [1].  On a lot of platforms you just get the input string
    > > back again.  If that's the only thing keeping us on setlocale,
    > > I think we could drop it.  (Perhaps we should do some canonicalization
    > > of our own instead?)
    >
    > +1
    >
    > I know it does something on Windows (we know the EDB installer gives
    > it strings like "Language,Country" and it converts them to
    > "Language_Country.Encoding", see various threads about it all going
    > wrong), but I'm not sure it does anything we actually want to
    > encourage.  I'm hoping we can gradually screw it down so that we only
    > have sane BCP 47 in the system on that OS, and I don't see why we
    > wouldn't just use them verbatim.
    
    Some more thoughts on check_locale() and canonicalisation:
    
    I doubt the canonicalisation does anything useful on any Unix system,
    as they're basically just file names.  In the case of glibc, the
    encoding part is munged before opening the file so it tolerates .utf8
    or .UTF-8 or .u---T----f------8 on input, but it still returns
    whatever you gave it so the return value isn't cleaning the input or
    anything.
    
    "" is a problem however... the special value for "native environment"
    is returned as a real locale name, which we probably still need in
    places.  We could change that to newlocale("") + query instead, but
    there is a portability pipeline problem getting the name out of it:
    
    1. POSIX only just added getlocalename_l() in 2024[1][2].
    2. Glibc has non-standard nl_langinfo_l(NL_LOCALE_NAME(category), loc).
    3. The <xlocale.h> systems (macOS/*BSD) have non-standard
    querylocale(mask, loc).
    4. AFAIK there is no way to do it on pure POSIX 2008 systems.
    5. For Windows, there is a completely different thing to get the
    user's default locale, see CF#3772.
    
    The systems in category 4 would in practice be Solaris and (if it
    comes back) AIX.  Given that, we probably just can't go that way soon.
    
    So I think the solution could perhaps be something like: in some early
    startup phase before there are any threads, we nail down all the
    locale categories to "C" (or whatever we decide on for the permanent
    global locale), and also query the "" categories and make a copy of
    them in case anyone wants them later, and then never call setlocale()
    again.
    
    [1] https://pubs.opengroup.org/onlinepubs/9799919799/functions/getlocalename_l.html
    [2] https://www.austingroupbugs.net/view.php?id=1220
    
    
    
    
  26. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2024-08-14T23:00:39Z

    On Thu, 2024-08-15 at 10:43 +1200, Thomas Munro wrote:
    > So I think the solution could perhaps be something like: in some
    > early
    > startup phase before there are any threads, we nail down all the
    > locale categories to "C" (or whatever we decide on for the permanent
    > global locale), and also query the "" categories and make a copy of
    > them in case anyone wants them later, and then never call setlocale()
    > again.
    
    +1.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  27. Re: Remaining dependency on setlocale()

    Thomas Munro <thomas.munro@gmail.com> — 2024-08-15T08:46:15Z

    On Thu, Aug 15, 2024 at 11:00 AM Jeff Davis <pgsql@j-davis.com> wrote:
    > On Thu, 2024-08-15 at 10:43 +1200, Thomas Munro wrote:
    > > So I think the solution could perhaps be something like: in some
    > > early
    > > startup phase before there are any threads, we nail down all the
    > > locale categories to "C" (or whatever we decide on for the permanent
    > > global locale), and also query the "" categories and make a copy of
    > > them in case anyone wants them later, and then never call setlocale()
    > > again.
    >
    > +1.
    
    We currently nail down these categories:
    
        /* We keep these set to "C" always.  See pg_locale.c for explanation. */
        init_locale("LC_MONETARY", LC_MONETARY, "C");
        init_locale("LC_NUMERIC", LC_NUMERIC, "C");
        init_locale("LC_TIME", LC_TIME, "C");
    
    CF #5170 has patches to make it so that we stop changing them even
    transiently, using locale_t interfaces to feed our caches of stuff
    needed to work with those categories, so they really stay truly nailed
    down.
    
    It sounds like someone needs to investigate doing the same thing for
    these two, from CheckMyDatabase():
    
        if (pg_perm_setlocale(LC_COLLATE, collate) == NULL)
            ereport(FATAL,
                    (errmsg("database locale is incompatible with
    operating system"),
                     errdetail("The database was initialized with
    LC_COLLATE \"%s\", "
                               " which is not recognized by setlocale().", collate),
                     errhint("Recreate the database with another locale or
    install the missing locale.")));
    
        if (pg_perm_setlocale(LC_CTYPE, ctype) == NULL)
            ereport(FATAL,
                    (errmsg("database locale is incompatible with
    operating system"),
                     errdetail("The database was initialized with LC_CTYPE \"%s\", "
                               " which is not recognized by setlocale().", ctype),
                     errhint("Recreate the database with another locale or
    install the missing locale.")));
    
    How should that work?  Maybe we could imagine something like
    MyDatabaseLocale, a locale_t with LC_COLLATE and LC_CTYPE categories
    set appropriately.  Or should it be a pg_locale_t instead (if your
    database default provider is ICU, then you don't even need a locale_t,
    right?).
    
    Then I think there is one quite gnarly category, from
    assign_locale_messages() (a GUC assignment function):
    
        (void) pg_perm_setlocale(LC_MESSAGES, newval);
    
    I have never really studied gettext(), but I know it was just
    standardised in POSIX 2024, and the standardised interface has _l()
    variants of all functions.  Current implementations don't have them
    yet.  Clearly we absolutely couldn't call pg_perm_setlocale() after
    early startup --  but if gettext() is relying on the current locale to
    affect far away code, then maybe this is one place where we'd just
    have to use uselocale().  Perhaps we could plan some transitional
    strategy where NetBSD users lose the ability to change the GUC without
    restarting the server and it has to be the same for all sessions, or
    something like that, until they produce either gettext_l() or
    uselocale(), but I haven't thought hard about this part at all yet...
    
    
    
    
  28. Re: Remaining dependency on setlocale()

    Peter Eisentraut <peter@eisentraut.org> — 2024-08-15T13:25:35Z

    On 15.08.24 00:43, Thomas Munro wrote:
    > "" is a problem however... the special value for "native environment"
    > is returned as a real locale name, which we probably still need in
    > places.  We could change that to newlocale("") + query instead, but
    
    Where do we need that in the server?
    
    It should just be initdb doing that and then initializing the server 
    with concrete values based on that.
    
    I guess technically some of these GUC settings default to the 
    environment?  But I think we could consider getting rid of that.
    
    
    
    
  29. Re: Remaining dependency on setlocale()

    Thomas Munro <thomas.munro@gmail.com> — 2024-08-15T21:09:40Z

    On Fri, Aug 16, 2024 at 1:25 AM Peter Eisentraut <peter@eisentraut.org> wrote:
    > On 15.08.24 00:43, Thomas Munro wrote:
    > > "" is a problem however... the special value for "native environment"
    > > is returned as a real locale name, which we probably still need in
    > > places.  We could change that to newlocale("") + query instead, but
    >
    > Where do we need that in the server?
    
    Hmm.  Yeah, right, the only way I've found so far to even reach that
    code and that captures that result is:
    
    create database db2 locale = '';
    
    Thats puts 'en_NZ.UTF-8' or whatever in pg_database.  In contrast,
    create collation will accept '' but just store it verbatim, and the
    GUCs for changing time, monetary, numeric accept it too and keep it
    verbatim.  We could simply ban '' in all user commands.  I doubt
    they're documented as acceptable values, once you get past initdb and
    have a running system.  Looking into that...
    
    > It should just be initdb doing that and then initializing the server
    > with concrete values based on that.
    
    Right.
    
    > I guess technically some of these GUC settings default to the
    > environment?  But I think we could consider getting rid of that.
    
    Yeah.
    
    
    
    
  30. Re: Remaining dependency on setlocale()

    Thomas Munro <thomas.munro@gmail.com> — 2024-08-16T01:12:33Z

    On Fri, Aug 16, 2024 at 9:09 AM Thomas Munro <thomas.munro@gmail.com> wrote:
    > On Fri, Aug 16, 2024 at 1:25 AM Peter Eisentraut <peter@eisentraut.org> wrote:
    > > It should just be initdb doing that and then initializing the server
    > > with concrete values based on that.
    >
    > Right.
    >
    > > I guess technically some of these GUC settings default to the
    > > environment?  But I think we could consider getting rid of that.
    >
    > Yeah.
    
    Seems to make a lot of sense.  I tried that out over in CF #5170.
    
    (In case it's not clear why I'm splitting discussion between threads:
    I was thinking of this thread as the one where we're discussing what
    needs to be done, with other threads being spun off to become CF entry
    with concrete patches.  I realised re-reading some discussion that
    that might not be obvious...)
    
    
    
    
  31. Re: Remaining dependency on setlocale()

    Andreas Karlsson <andreas@proxel.se> — 2024-08-28T16:26:04Z

    On 8/9/24 8:24 PM, Jeff Davis wrote:
    > On Fri, 2024-08-09 at 13:41 +0200, Andreas Karlsson wrote:
    >> I am leaning towards that we should write our own pure ascii
    >> functions
    >> for this.
    > 
    > That makes sense for a lot of call sites, but it could cause breakage
    > if we aren't careful.
    > 
    >>   Since we do not support any non-ascii compatible encodings
    >> anyway I do not see the point in having locale support in most of
    >> these
    >> call-sites.
    > 
    > An ascii-compatible encoding just means that the code points in the
    > ascii range are represented as ascii. I'm not clear on whether code
    > points in the ascii range can return different results for things like
    > isspace(), but it sounds plausible -- toupper() can return different
    > results for 'i' in tr_TR.
    > 
    > Also, what about the values outside 128-255, which are still valid
    > input to isspace()?
    
    My idea was that in a lot of those cases we only try to parse e.g. 0-9 
    as digits and always only . as the decimal separator so we should make 
    just make that obvious by either using locale C or writing our own ascii 
    only functions. These strings are meant to be read by machines, not 
    humans, primarily.
    
    Andreas
    
    
    
    
  32. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2024-12-12T19:22:09Z

    On Wed, 2024-08-14 at 12:00 -0700, Jeff Davis wrote:
    > On Wed, 2024-08-14 at 14:31 +1200, Thomas Munro wrote:
    > > 1.  The process global locale is always "C".  If you ever call
    > > uselocale(), it can only be for short stretches, and you have to
    > > restore it straight after; perhaps it is only ever used in
    > > replacement
    > > _l() functions for systems that lack them.  You need to use _l()
    > > functions for all non-"C" locales.  The current database default
    > > needs
    > > to be available as a variable (in future: thread-local variable, or
    > > reachable from one), so you can use it in _l() functions.  The "C"
    > > locale can be accessed implicitly with non-l() functions, or you
    > > could
    > > ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE)
    > > for
    > > "C".  Or a name like PG_C_LOCALE, which, in backend code could be
    > > just
    > > LC_GLOBAL_LOCALE, while in frontend/library code it could be the
    > > singleton mechanism I showed in CF#5166.
    > 
    > +1 to this approach. It makes things more consistent across platforms
    > and avoids surprising dependencies on the global setting.
    > 
    > We'll have to be careful that each call site is either OK with C, or
    > that it gets changed to an _l() variant. We also have to be careful
    > about extensions.
    
    Did we reach a conclusion here? Any thoughts on moving in this
    direction, and whether 18 is the right time to do it?
    
    Regards,
    	Jeff Davis
    
    
    
    
    
    
  33. Re: Remaining dependency on setlocale()

    Thomas Munro <thomas.munro@gmail.com> — 2024-12-13T09:44:06Z

    On Fri, Dec 13, 2024 at 8:22 AM Jeff Davis <pgsql@j-davis.com> wrote:
    > On Wed, 2024-08-14 at 12:00 -0700, Jeff Davis wrote:
    > > On Wed, 2024-08-14 at 14:31 +1200, Thomas Munro wrote:
    > > > 1.  The process global locale is always "C".  If you ever call
    > > > uselocale(), it can only be for short stretches, and you have to
    > > > restore it straight after; perhaps it is only ever used in
    > > > replacement
    > > > _l() functions for systems that lack them.  You need to use _l()
    > > > functions for all non-"C" locales.  The current database default
    > > > needs
    > > > to be available as a variable (in future: thread-local variable, or
    > > > reachable from one), so you can use it in _l() functions.  The "C"
    > > > locale can be accessed implicitly with non-l() functions, or you
    > > > could
    > > > ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE)
    > > > for
    > > > "C".  Or a name like PG_C_LOCALE, which, in backend code could be
    > > > just
    > > > LC_GLOBAL_LOCALE, while in frontend/library code it could be the
    > > > singleton mechanism I showed in CF#5166.
    > >
    > > +1 to this approach. It makes things more consistent across platforms
    > > and avoids surprising dependencies on the global setting.
    > >
    > > We'll have to be careful that each call site is either OK with C, or
    > > that it gets changed to an _l() variant. We also have to be careful
    > > about extensions.
    >
    > Did we reach a conclusion here? Any thoughts on moving in this
    > direction, and whether 18 is the right time to do it?
    
    I think this is the best way, and I haven't seen anyone supporting any
    other idea.  (I'm working on those setlocale()-removal patches I
    mentioned, more very soon...)
    
    
    
    
  34. Re: Remaining dependency on setlocale()

    Peter Eisentraut <peter@eisentraut.org> — 2024-12-17T12:14:21Z

    On 13.12.24 10:44, Thomas Munro wrote:
    > On Fri, Dec 13, 2024 at 8:22 AM Jeff Davis <pgsql@j-davis.com> wrote:
    >> On Wed, 2024-08-14 at 12:00 -0700, Jeff Davis wrote:
    >>> On Wed, 2024-08-14 at 14:31 +1200, Thomas Munro wrote:
    >>>> 1.  The process global locale is always "C".  If you ever call
    >>>> uselocale(), it can only be for short stretches, and you have to
    >>>> restore it straight after; perhaps it is only ever used in
    >>>> replacement
    >>>> _l() functions for systems that lack them.  You need to use _l()
    >>>> functions for all non-"C" locales.  The current database default
    >>>> needs
    >>>> to be available as a variable (in future: thread-local variable, or
    >>>> reachable from one), so you can use it in _l() functions.  The "C"
    >>>> locale can be accessed implicitly with non-l() functions, or you
    >>>> could
    >>>> ban those to reduce confusion and use foo_l(..., LC_GLOBAL_LOCALE)
    >>>> for
    >>>> "C".  Or a name like PG_C_LOCALE, which, in backend code could be
    >>>> just
    >>>> LC_GLOBAL_LOCALE, while in frontend/library code it could be the
    >>>> singleton mechanism I showed in CF#5166.
    >>>
    >>> +1 to this approach. It makes things more consistent across platforms
    >>> and avoids surprising dependencies on the global setting.
    >>>
    >>> We'll have to be careful that each call site is either OK with C, or
    >>> that it gets changed to an _l() variant. We also have to be careful
    >>> about extensions.
    >>
    >> Did we reach a conclusion here? Any thoughts on moving in this
    >> direction, and whether 18 is the right time to do it?
    > 
    > I think this is the best way, and I haven't seen anyone supporting any
    > other idea.  (I'm working on those setlocale()-removal patches I
    > mentioned, more very soon...)
    
    I also think this is the right direction, and we'll get closer with the 
    remaining patches that Thomas has lined up.
    
    I think at this point, we could already remove all locale settings 
    related to LC_COLLATE.  Nothing uses that anymore.
    
    I think we will need to keep the global LC_CTYPE setting set to 
    something useful, for example so that system error messages come out in 
    the right encoding.
    
    But I'm concerned about the the Perl_setlocale() dance in plperl.c. 
    Perl apparently does a setlocale(LC_ALL, "") during startup, and that 
    code is a workaround to reset everything back afterwards.  We need to be 
    careful not to break that.
    
    (Perl has fixed that in 5.19, but the fix requires that you set another 
    environment variable before launching Perl, which you can't do in a 
    threaded system, so we'd probably need another fix eventually.  See 
    <https://github.com/Perl/perl5/issues/8274>.)
    
    
    
    
    
  35. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2024-12-17T18:10:50Z

    On Tue, 2024-12-17 at 13:14 +0100, Peter Eisentraut wrote:
    > I think we will need to keep the global LC_CTYPE setting set to 
    > something useful, for example so that system error messages come out
    > in 
    > the right encoding.
    
    Do we need to rely on the global LC_CTYPE setting? We already use
    bind_textdomain_codeset().
    
    > But I'm concerned about the the Perl_setlocale() dance in plperl.c. 
    > Perl apparently does a setlocale(LC_ALL, "") during startup, and that
    > code is a workaround to reset everything back afterwards.  We need to
    > be 
    > careful not to break that.
    > 
    > (Perl has fixed that in 5.19, but the fix requires that you set
    > another 
    > environment variable before launching Perl, which you can't do in a 
    > threaded system, so we'd probably need another fix eventually.  See 
    > <https://github.com/Perl/perl5/issues/8274>.)
    
    I don't fully understand that issue, but I would think the direction we
    are going (keeping the global LC_CTYPE more consistent and relying on
    it less) would make the problem better.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  36. Re: Remaining dependency on setlocale()

    Peter Eisentraut <peter@eisentraut.org> — 2024-12-19T16:23:11Z

    On 17.12.24 19:10, Jeff Davis wrote:
    > On Tue, 2024-12-17 at 13:14 +0100, Peter Eisentraut wrote:
    >> I think we will need to keep the global LC_CTYPE setting set to
    >> something useful, for example so that system error messages come out
    >> in
    >> the right encoding.
    > 
    > Do we need to rely on the global LC_CTYPE setting? We already use
    > bind_textdomain_codeset().
    
    I don't think that would cover messages from the C library (strerror, 
    dlerror, etc.).
    
    >> But I'm concerned about the the Perl_setlocale() dance in plperl.c.
    >> Perl apparently does a setlocale(LC_ALL, "") during startup, and that
    >> code is a workaround to reset everything back afterwards.  We need to
    >> be
    >> careful not to break that.
    >>
    >> (Perl has fixed that in 5.19, but the fix requires that you set
    >> another
    >> environment variable before launching Perl, which you can't do in a
    >> threaded system, so we'd probably need another fix eventually.  See
    >> <https://github.com/Perl/perl5/issues/8274>.)
    > 
    > I don't fully understand that issue, but I would think the direction we
    > are going (keeping the global LC_CTYPE more consistent and relying on
    > it less) would make the problem better.
    
    Yes, I think it's the right direction, but we need to figure this issue 
    out eventually.
    
    
    
    
    
  37. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-06-06T05:15:34Z

    On Tue, 2024-12-17 at 13:14 +0100, Peter Eisentraut wrote:
    > > > > +1 to this approach. It makes things more consistent across
    > > > > platforms
    > > > > and avoids surprising dependencies on the global setting.
    > > > > 
    > > 
    > > I think this is the best way, and I haven't seen anyone supporting
    > > any
    > > other idea.  (I'm working on those setlocale()-removal patches I
    > > mentioned, more very soon...)
    > 
    > I also think this is the right direction, and we'll get closer with
    > the 
    > remaining patches that Thomas has lined up.
    > 
    > I think at this point, we could already remove all locale settings 
    > related to LC_COLLATE.  Nothing uses that anymore.
    > 
    > I think we will need to keep the global LC_CTYPE setting set to 
    > something useful, for example so that system error messages come out
    > in 
    > the right encoding.
    > 
    > But I'm concerned about the the Perl_setlocale() dance in plperl.c. 
    > Perl apparently does a setlocale(LC_ALL, "") during startup, and that
    > code is a workaround to reset everything back afterwards.  We need to
    > be 
    > careful not to break that.
    > 
    > (Perl has fixed that in 5.19, but the fix requires that you set
    > another 
    > environment variable before launching Perl, which you can't do in a 
    > threaded system, so we'd probably need another fix eventually.  See 
    > <https://github.com/Perl/perl5/issues/8274>.)
    
    To continue this thread, I did a symbol search in the meson build
    directory like (patterns.txt attached):
    
      for f in `find . -name *.o`; do
        if ( nm --format=just-symbols $f | \
             grep -xE -f /tmp/patterns.txt > /dev/null ); then
          echo $f; fi; done
    
    and it output:
    
    ./contrib/fuzzystrmatch/fuzzystrmatch.so.p/dmetaphone.c.o
    ./contrib/fuzzystrmatch/fuzzystrmatch.so.p/fuzzystrmatch.c.o
    ./contrib/isn/isn.so.p/isn.c.o
    ./contrib/spi/refint.so.p/refint.c.o
    ./contrib/ltree/ltree.so.p/crc32.c.o
    ./src/backend/postgres_lib.a.p/commands_copyfromparse.c.o
    ./src/backend/postgres_lib.a.p/utils_adt_pg_locale_libc.c.o
    ./src/backend/postgres_lib.a.p/tsearch_wparser_def.c.o
    ./src/backend/postgres_lib.a.p/parser_scansup.c.o
    ./src/backend/postgres_lib.a.p/utils_adt_inet_net_pton.c.o
    ./src/backend/postgres_lib.a.p/tsearch_ts_locale.c.o
    ./src/bin/psql/psql.p/meson-generated_.._tab-complete.c.o
    ./src/interfaces/ecpg/preproc/ecpg.p/meson-generated_.._preproc.c.o
    ./src/interfaces/ecpg/compatlib/libecpg_compat.so.3.18.p/informix.c.o
    ./src/interfaces/ecpg/compatlib/libecpg_compat.a.p/informix.c.o
    ./src/port/libpgport_srv.a.p/pgstrcasecmp.c.o
    ./src/port/libpgport_shlib.a.p/pgstrcasecmp.c.o
    ./src/port/libpgport.a.p/pgstrcasecmp.c.o
    
    Not a short list, but not a long list, either, so seems tractable. Note
    that this misses things like isdigit() which is inlined.
    
    A few observations while spot-checking these files:
    
    ---------------------
    pgstrcasecmp.c - has code like:
    
           else if (IS_HIGHBIT_SET(ch) && islower(ch))
            ch = toupper(ch);
    
    and comments like "Note however that the whole thing is a bit bogus for
    multibyte character sets."
    
    Most of the callers are directly comparing with ascii literals, so I'm
    not sure what the point is. There are probably some more interesting
    callers hidden in there.
    ----------------------
    pg_locale_libc.c -
    
    char2wchar and wchar2char use mbstowcs and wcstombs when the input
    locale is NULL. The main culprit seems to be full text search, which
    has a bunch of /* TODO */ comments. Another caller is
    get_iso_localename().
    
    There are also a couple false positives where mbstowcs_l/wcstombs_l are
    emulated with uselocale() and mbstowcs/wcstombs. In that case, it's not
    actually sensitive to the global setting.
    -----------------------
    copyfromparse.c - the input is ASCII so it can use pg_ascii_tolower()
    instead of tolower()
    -----------------------
    
    Regards,
    	Jeff Davis
    
    
    
    
  38. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-06-06T22:23:41Z

    On Thu, 2025-06-05 at 22:15 -0700, Jeff Davis wrote:
    > To continue this thread, I did a symbol search in the meson build
    > directory like (patterns.txt attached):
    
    Attached a rough patch series which does what everyone seemed to agree
    on:
    
      * Change some trivial ASCII cases to use pg_ascii_* variants
      * Set LC_COLLATE and LC_CTYPE to C with pg_perm_setlocale
      * Introduce a new global_lc_ctype for callers that still need to use
    operations that depend on datctype
    
    There should be no behavior changes in this series.
    
    Benefits:
    
      * a tiny step toward multithreading, because connections to different
    databases no longer require different setlocale() settings
      * easier to identify dependence on datctype, because callers will
    need to refer to global_lc_ctype.
      * harder to accidentally depend on datctype or datcollate
    
    Ideally, when the database locale provider is not libc, the user
    shouldn't need to even think about a valid LC_CTYPE locale at all. But
    that requires more work, and potentially risk of breakage.
    
    Regards,
    	Jeff Davis
    
    `
    
    
    
  39. Re: Remaining dependency on setlocale()

    Peter Eisentraut <peter@eisentraut.org> — 2025-06-10T15:32:01Z

    On 07.06.25 00:23, Jeff Davis wrote:
    > On Thu, 2025-06-05 at 22:15 -0700, Jeff Davis wrote:
    >> To continue this thread, I did a symbol search in the meson build
    >> directory like (patterns.txt attached):
    > 
    > Attached a rough patch series which does what everyone seemed to agree
    > on:
    > 
    >    * Change some trivial ASCII cases to use pg_ascii_* variants
    >    * Set LC_COLLATE and LC_CTYPE to C with pg_perm_setlocale
    >    * Introduce a new global_lc_ctype for callers that still need to use
    > operations that depend on datctype
    
    v1-0001-copyfromparse.c-use-pg_ascii_tolower-rather-than-.patch
    v1-0002-contrib-spi-refint.c-use-pg_ascii_tolower-instead.patch
    v1-0003-isn.c-use-pg_ascii_toupper-instead-of-toupper.patch
    v1-0004-inet_net_pton.c-use-pg_ascii_tolower-rather-than-.patch
    
    These look good to me.
    
    v1-0005-Add-global_lc_ctype-to-hold-locale_t-for-datctype.patch
    
    This looks ok (but might depend on how patch 0006 turns out).
    
    v1-0006-Use-global_lc_ctype-for-callers-of-locale-aware-f.patch
    
    I think these need further individual analysis and explanation why these 
    should use the global lc_ctype setting.  For example, you could argue 
    that the SQL-callable soundex(text) function should use the collation 
    object of its input value, not the global locale.  But furthermore, 
    soundex_code() could actually just use pg_ascii_toupper() instead.  And 
    in ts_locale.c, the isalnum_l() call should use mylocale that already 
    exists in that function.  The problem to solve it getting a good value 
    into mylocale.  Using the global setting confuses the issue a bit, I think.
    
    v1-0007-Fix-the-last-remaining-callers-relying-on-setloca.patch
    
    Do we have any data what platforms we'd need these checks for?
    
    Also, if you look into wparser_def.c what p_isxdigit is used for, it's 
    used for parsing XML (presumably HTML) files, so we just need ASCII-only 
    behavior and no locale dependency.
    
    v1-0008-Set-process-LC_COLLATE-C-and-LC_CTYPE-C.patch
    
    As I mentioned earlier in the thread, I don't think we can do this for 
    LC_CTYPE, because otherwise system error messages would not come out in 
    the right encoding.  For the LC_COLLATE settings, I think we could just 
    do the setting in main(), where the other non-database-specific locale 
    categories are set.
    
    
    
    
    
  40. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-06-10T17:48:26Z

    On Tue, 2025-06-10 at 17:32 +0200, Peter Eisentraut wrote:
    > As I mentioned earlier in the thread, I don't think we can do this
    > for 
    > LC_CTYPE, because otherwise system error messages would not come out
    > in 
    > the right encoding.
    
    Is there any way around that? If all we need is the right encoding, do
    we need a proper locale?
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  41. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-06-11T19:15:14Z

    On Tue, 2025-06-10 at 17:32 +0200, Peter Eisentraut wrote:
    > v1-0001-copyfromparse.c-use-pg_ascii_tolower-rather-than-.patch
    > v1-0002-contrib-spi-refint.c-use-pg_ascii_tolower-instead.patch
    > v1-0003-isn.c-use-pg_ascii_toupper-instead-of-toupper.patch
    > v1-0004-inet_net_pton.c-use-pg_ascii_tolower-rather-than-.patch
    > 
    > These look good to me.
    
    Committed. (That means they're in 18, which was not my intention, but
    others seemed to think it was harmless enough, so I didn't revert. I
    will wait for the branch before I commit any more of these.)
    
    > v1-0005-Add-global_lc_ctype-to-hold-locale_t-for-datctype.patch
    > 
    > This looks ok (but might depend on how patch 0006 turns out).
    
    I changed this to a global_libc_locale that includes both LC_COLLATE
    and LC_CTYPE (from datcollate and datctype), in case an extension is
    relying on strcoll for some reason.
    
    > v1-0006-Use-global_lc_ctype-for-callers-of-locale-aware-f.patch
    > 
    > I think these need further individual analysis and explanation why
    > these 
    > should use the global lc_ctype setting.
    
    This patch series, at least so far, is designed to have zero behavior
    changes. Anything with a potential for a behavior change should be a
    separate commit, so that if we need to revert it, we can revert the
    behavior change without reintroducing a setlocale() dependency.
    
    >   For example, you could argue
    > that the SQL-callable soundex(text) function should use the collation
    > object of its input value, not the global locale.
    
    That would be a behavior change.
    
    >   But furthermore, 
    > soundex_code() could actually just use pg_ascii_toupper() instead.  
    
    Is that a behavior change?
    
    > And 
    > in ts_locale.c, the isalnum_l() call should use mylocale that already
    > exists in that function.  The problem to solve it getting a good
    > value 
    > into mylocale.  Using the global setting confuses the issue a bit, I
    > think.
    
    I reworked it to be less confusing by changing wchar2char/char2wchar to
    take a locale_t instead of pg_locale_t. Hopefully it's an improvement.
    
    In get_iso_localename(), there's a comment saying that it doesn't
    matter which locale is used (because it's ASCII), but to use the "_l"
    variants, we need to pick some locale. At that point it's not clear to
    me that global_libc_locale will be set yet, so I used LC_C_LOCALE.
    
    I'm not sure whether we can rely on LC_C_LOCALE being available, but it
    passed in CI, and if it's not available somewhere it might be a good
    idea to create it on those platforms anyway.
    
    > v1-0007-Fix-the-last-remaining-callers-relying-on-setloca.patch
    > 
    > Do we have any data what platforms we'd need these checks for?
    
    https://cirrus-ci.com/build/5167600088383488
    
    Looks like windows doesn't have iswxdigit_l or isxdigit_l.
    
    > Also, if you look into wparser_def.c what p_isxdigit is used for,
    > it's 
    > used for parsing XML (presumably HTML) files, so we just need ASCII-
    > only 
    > behavior and no locale dependency.
    
    iswxdigit() does seem to be dependent on locale, so this could be a
    subtle behavior change.
    
    > v1-0008-Set-process-LC_COLLATE-C-and-LC_CTYPE-C.patch
    > 
    > As I mentioned earlier in the thread, I don't think we can do this
    > for 
    > LC_CTYPE, because otherwise system error messages would not come out
    > in 
    > the right encoding.
    
    Changed it so that it only sets LC_COLLATE to C, and leaves LC_CTYPE
    set to datctype.
    
    Unfortunately, as long as LC_CTYPE is set to a real locale, there's a
    danger of accidentally depending on that setting. Can the encoding be
    controlled with LC_MESSAGES instead of LC_CTYPE?
    
    Do you have an example of how things can go wrong?
    
    >   For the LC_COLLATE settings, I think we could just 
    > do the setting in main(), where the other non-database-specific
    > locale 
    > categories are set.
    
    Done.
    
    Regards,
    	Jeff Davis
    
    
  42. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-07-01T15:06:51Z

    On Wed, 2025-06-11 at 12:15 -0700, Jeff Davis wrote:
    > I changed this to a global_libc_locale that includes both LC_COLLATE
    > and LC_CTYPE (from datcollate and datctype), in case an extension is
    > relying on strcoll for some reason.
    
    ..
    
    > This patch series, at least so far, is designed to have zero behavior
    > changes. Anything with a potential for a behavior change should be a
    > separate commit, so that if we need to revert it, we can revert the
    > behavior change without reintroducing a setlocale() dependency.
    
    ...
    
    > 
    > I reworked it to be less confusing by changing wchar2char/char2wchar
    > to
    > take a locale_t instead of pg_locale_t. Hopefully it's an
    > improvement.
    
    ...
    
    > 
    > Changed it so that it only sets LC_COLLATE to C, and leaves LC_CTYPE
    > set to datctype.
    
    
    Attached rebased v3.
    
    Regards,
    	Jeff Davis
    
    
  43. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-07-08T00:56:03Z

    On Wed, 2025-06-11 at 12:15 -0700, Jeff Davis wrote:
    > > v1-0008-Set-process-LC_COLLATE-C-and-LC_CTYPE-C.patch
    > > 
    > > As I mentioned earlier in the thread, I don't think we can do this
    > > for 
    > > LC_CTYPE, because otherwise system error messages would not come
    > > out
    > > in 
    > > the right encoding.
    > 
    > Changed it so that it only sets LC_COLLATE to C, and leaves LC_CTYPE
    > set to datctype.
    > 
    > Unfortunately, as long as LC_CTYPE is set to a real locale, there's a
    > danger of accidentally depending on that setting. Can the encoding be
    > controlled with LC_MESSAGES instead of LC_CTYPE?
    > 
    > Do you have an example of how things can go wrong?
    
    I looked into this a bit, and if I understand correctly, the only
    problem is with strerror() and strerror_r(), which depend on
    LC_MESSAGES for the language but LC_CTYPE to find the right encoding.
    
    I attached some example C code to illustrate how strerror() is affected
    by both LC_MESSAGES and LC_CTYPE. For example:
    
       $ ./strerror de_DE.UTF-8 de_DE.UTF-8
       LC_CTYPE set to: de_DE.UTF-8
       LC_MESSAGES set to: de_DE.UTF-8
       Error message (from strerror(EILSEQ)): Ungültiges oder
    unvollständiges Multi-Byte- oder Wide-Zeichen
       $ ./strerror C de_DE.UTF-8
       LC_CTYPE set to: C
       LC_MESSAGES set to: de_DE.UTF-8
       Error message (from strerror(EILSEQ)): Ung?ltiges oder
    unvollst?ndiges Multi-Byte- oder Wide-Zeichen
    
    On unix-based systems, we can use newlocale() to initialize a global
    variable with both LC_CTYPE and LC_MESSAGES set. The LC_MESSAGES
    portion would need to be updated every time the GUC changes, which is
    not great.
    
    Windows would be a different story, though: strerror() doesn't seem to
    have a variant that accepts a _locale_t object, and even if it did, I
    don't see a way to create a _locale_t object with LC_MESSAGES and
    LC_CTYPE set to different values. One idea is to use
    _configthreadlocale(_ENABLE_PER_THREAD_LOCALE), and then use
    setlocale(), which could enable us to use setlocale() similar to how we
    use uselocale() on other systems. That would be awkward, though.
    
    Thoughts? That seems like a lot of work just for the case of
    strerror()/strerror_r().
    
    Regards,
    	Jeff Davis
    
    [1]
    https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/configthreadlocale?view=msvc-170
    
  44. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-07-08T01:14:10Z

    On Tue, 2025-07-01 at 08:06 -0700, Jeff Davis wrote:
    > Attached rebased v3.
    
    And here's v4.
    
    I changed the global variable to only hold the LC_CTYPE (not
    LC_COLLATE), because windows doesn't support a _locale_t that
    represents multiple categories with different locales.
    
    This patch series is designed to not have any changes in behavior.
    There was some feedback that I could go further, but I'll leave those
    suggestions for future patches, in case one causes an unexpected
    behavior change and needs to be reverted. I intend to start committing
    these soon.
    
    v4-0008 uses LC_C_LOCALE, and I'm not sure if that's portable, but if
    the buildfarm complains then I'll fix it or revert it.
    
    Regards,
    	Jeff Davis
    
    
  45. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-07-09T22:52:34Z

    On Mon, 2025-07-07 at 17:56 -0700, Jeff Davis wrote:
    > I looked into this a bit, and if I understand correctly, the only
    > problem is with strerror() and strerror_r(), which depend on
    > LC_MESSAGES for the language but LC_CTYPE to find the right encoding.
    
    ...
    
    > Windows would be a different story, though: strerror() doesn't seem
    > to
    > have a variant that accepts a _locale_t object, and even if it did, I
    > don't see a way to create a _locale_t object with LC_MESSAGES and
    > LC_CTYPE set to different values.
    
    I think I have an answer to the second part here:
    
    "For information about the format of the locale argument, see Locale
    names, Languages, and Country/Region strings."
    
    https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/create-locale-wcreate-locale?view=msvc-170
    
    and when I follow that link, I see:
    
      "You can specify multiple category types, separated by semicolons.
    Category types that aren't specified use the current locale setting.
    For example, this code snippet sets the current locale for all
    categories to de-DE, and then sets the categories LC_MONETARY to en-GB
    and LC_TIME to es-ES:
    
      _wsetlocale(LC_ALL, L"de-DE");
      _wsetlocale(LC_ALL, L"LC_MONETARY=en-GB;LC_TIME=es-ES");"
    
    https://learn.microsoft.com/en-us/cpp/c-runtime-library/locale-names-languages-and-country-region-strings?view=msvc-170
    
    So we just need to construct a string of the right form, and we can
    have a _locale_t object representing the global locale for all
    categories. I'm not sure exactly how we escape the individual locale
    names, but it might be enough to just reject ';' in the locale name (at
    least for windows).
    
    The first problem -- how to affect the encoding of strings returned by
    strerror() on windows -- may be solvable as well. It looks like
    LC_MESSAGES is not supported at all on windows, so the only thing to be
    concerned about is the encoding, which is affected by LC_CTYPE. But
    windows doesn't offer uselocale() or strerror_l(). The only way seems
    to be to call _configthreadlocale(_ENABLE_PER_THREAD_LOCALE) and then
    setlocale(LC_CTYPE, datctype) right before strerror(), and switch it
    back to "C" right afterward. Comments welcome.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  46. Re: Remaining dependency on setlocale()

    Thomas Munro <thomas.munro@gmail.com> — 2025-07-09T23:53:02Z

    On Thu, Jul 10, 2025 at 10:52 AM Jeff Davis <pgsql@j-davis.com> wrote:
    > The first problem -- how to affect the encoding of strings returned by
    > strerror() on windows -- may be solvable as well. It looks like
    > LC_MESSAGES is not supported at all on windows, so the only thing to be
    > concerned about is the encoding, which is affected by LC_CTYPE. But
    > windows doesn't offer uselocale() or strerror_l(). The only way seems
    > to be to call _configthreadlocale(_ENABLE_PER_THREAD_LOCALE) and then
    > setlocale(LC_CTYPE, datctype) right before strerror(), and switch it
    > back to "C" right afterward. Comments welcome.
    
    FWIW there is an example of that in src/port/pg_localeconv_r.c.
    
    
    
    
  47. Re: Remaining dependency on setlocale()

    Thomas Munro <thomas.munro@gmail.com> — 2025-07-10T00:01:12Z

    On Tue, Jul 8, 2025 at 1:14 PM Jeff Davis <pgsql@j-davis.com> wrote:
    > v4-0008 uses LC_C_LOCALE, and I'm not sure if that's portable, but if
    > the buildfarm complains then I'll fix it or revert it.
    
    (Catching up with this thread...)
    
    LC_C_LOCALE is definitely not portable: I've only seen it on macOS and
    NetBSD.  It would be a good thing to propose to POSIX, since no other
    approach can be retrofitted quite so cromulently...
    
    I tried to make a portable PG_C_LOCALE mechanism like that, but it was
    reverted for reasons needing more investigation... see
    8e993bff5326b00ced137c837fce7cd1e0ecae14 (reverted by
    3c8e463b0d885e0d976f6a13a1fb78187b25c86f).
    
    
    
    
  48. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-07-10T18:22:50Z

    On Thu, 2025-07-10 at 11:53 +1200, Thomas Munro wrote:
    > On Thu, Jul 10, 2025 at 10:52 AM Jeff Davis <pgsql@j-davis.com>
    > wrote:
    > > The first problem -- how to affect the encoding of strings returned
    > > by
    > > strerror() on windows -- may be solvable as well. It looks like
    > > LC_MESSAGES is not supported at all on windows, so the only thing
    > > to be
    > > concerned about is the encoding, which is affected by LC_CTYPE. But
    > > windows doesn't offer uselocale() or strerror_l(). The only way
    > > seems
    > > to be to call _configthreadlocale(_ENABLE_PER_THREAD_LOCALE) and
    > > then
    > > setlocale(LC_CTYPE, datctype) right before strerror(), and switch
    > > it
    > > back to "C" right afterward. Comments welcome.
    > 
    > FWIW there is an example of that in src/port/pg_localeconv_r.c.
    
    OK, so it seems we have a path forward here:
    
    1. Have a global_libc_locale that represents all of the categories, and
    keep it up to date with GUC changes. On windows, it requires keeping
    the textual locale names handy (e.g. copies of datcollate and
    datctype), and building the special locale string and doing
    _create_locale(LC_ALL, "LC_ABC=somelocale;LC_XYZ=otherlocale").
    
    2. When there's no _l() variant of a function, like strerror_r(), wrap
    with uselocale(). On windows, this means using the trick above with
    _configthreadlocale(_ENABLE_PER_THREAD_LOCALE).
    
    I don't have a great windows development environment, and it appears CI
    and the buildfarm don't offer great coverage either. Can I ask for a
    volunteer to do the windows side of this work?
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  49. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-07-10T18:33:41Z

    On Thu, 2025-07-10 at 12:01 +1200, Thomas Munro wrote:
    > I tried to make a portable PG_C_LOCALE mechanism like that, but it
    > was
    > reverted for reasons needing more investigation... see
    > 8e993bff5326b00ced137c837fce7cd1e0ecae14 (reverted by
    > 3c8e463b0d885e0d976f6a13a1fb78187b25c86f).
    
    The revert seems to be related to pgport_shlib. At least for my current
    work, I'm focused on removing setlocale() dependencies in the backend,
    and a PG_C_LOCALE should work fine there.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  50. Re: Remaining dependency on setlocale()

    Thomas Munro <thomas.munro@gmail.com> — 2025-07-10T23:47:25Z

    On Fri, Jul 11, 2025 at 6:33 AM Jeff Davis <pgsql@j-davis.com> wrote:
    > On Thu, 2025-07-10 at 12:01 +1200, Thomas Munro wrote:
    > > I tried to make a portable PG_C_LOCALE mechanism like that, but it
    > > was
    > > reverted for reasons needing more investigation... see
    > > 8e993bff5326b00ced137c837fce7cd1e0ecae14 (reverted by
    > > 3c8e463b0d885e0d976f6a13a1fb78187b25c86f).
    >
    > The revert seems to be related to pgport_shlib. At least for my current
    > work, I'm focused on removing setlocale() dependencies in the backend,
    > and a PG_C_LOCALE should work fine there.
    
    OK, I'll figure out what happened with that and try to post a new
    version over on that other thread soon.
    
    (FWIW I learned a couple of encouraging things about that topic:
    glibc's newlocale(LC_ALL, NULL, 0) seems to give you a static
    singleton anyway, no allocation happens, so it can't fail in practice
    and it's cheap, and FreeBSD also supports LC_C_LOCALE like NetBSD and
    macOS, it just doesn't have a name, you pass NULL instead (which is
    the value that LC_C_LOCALE has on those other systems).  I actually
    rather like the macro, because how else are you supposed to test
    whether this system can accept NULL there?)
    
    
    
    
  51. Re: Remaining dependency on setlocale()

    Thomas Munro <thomas.munro@gmail.com> — 2025-07-10T23:48:50Z

    On Fri, Jul 11, 2025 at 6:22 AM Jeff Davis <pgsql@j-davis.com> wrote:
    > I don't have a great windows development environment, and it appears CI
    > and the buildfarm don't offer great coverage either. Can I ask for a
    > volunteer to do the windows side of this work?
    
    Me neither but I'm willing to help with that, and have done lots of
    closely related things through trial-by-CI...
    
    
    
    
  52. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-07-24T02:11:34Z

    On Fri, 2025-07-11 at 11:48 +1200, Thomas Munro wrote:
    > On Fri, Jul 11, 2025 at 6:22 AM Jeff Davis <pgsql@j-davis.com> wrote:
    > > I don't have a great windows development environment, and it
    > > appears CI
    > > and the buildfarm don't offer great coverage either. Can I ask for
    > > a
    > > volunteer to do the windows side of this work?
    > 
    > Me neither but I'm willing to help with that, and have done lots of
    > closely related things through trial-by-CI...
    
    Attached a patch to separate the message translation (both gettext and
    strerror translations) from setlocale(). That's a step towards thread
    safety, and also a step toward setting LC_CTYPE=C permanently (more
    work still required there).
    
    The patch feels a bit over-engineered, but I'd like to know what you
    think. It would be great if you could test/debug the windows NLS-
    enabled paths.
    
    I'm also not sure what to do about the NetBSD path. NetBSD has no
    uselocale(), so I have to fall bad to temporary setlocale(), which is
    not thread safe. And I'm getting a mysterious error in test_aio for
    NetBSD, which I haven't investigated yet.
    
    Regards,
    	Jeff Davis
    
    
  53. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-07-24T18:10:40Z

    On Wed, 2025-07-23 at 19:11 -0700, Jeff Davis wrote:
    > The patch feels a bit over-engineered, but I'd like to know what you
    > think. It would be great if you could test/debug the windows NLS-
    > enabled paths.
    
    Let me explain how it ended up looking over-engineered, and perhaps
    someone has a simpler solution.
    
    For gettext, we already configure the encoding with
    bind_textdomain_codeset(). All it needs is LC_MESSAGES set properly,
    which can be done with uselocale(), as a semi-permanent setting until
    the next GUC change, just like setlocale() today. There are a couple
    minor problems for platforms without uselocale(). For windows, we could
    just permanently do:
    
      _configthreadlocale(_ENABLE_PER_THREAD_LOCALE)
    
    and then use _wsetlocale. For NetBSD, I don't have a solution, but
    perhaps we can just reject new lc_messages settings after startup, or
    just defer the problem until threading actually becomes a pressing
    issue.
    
    The main problem is with strerror_r(). To get the right LC_MESSAGES
    setting, we need the separate path for windows (which has neither
    uselocale() nor strerror_l()). Because we need to keep track of that
    path anyway, I used it for gettext as well to have a cleaner separation
    for the entire message translation locale. That means we can avoid
    permanent locale settings, and reduce the chances that we accidentally
    depend on the global locale.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  54. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-10-14T23:26:47Z

    On Thu, 2025-07-24 at 11:10 -0700, Jeff Davis wrote:
    > The main problem is with strerror_r()...
    
    Postgres messages, like "division by zero" are translated just fine
    without LC_CTYPE; gettext() only needs LC_MESSAGES and the server
    encoding. So these are fine.
    
    We use strerror_r() to translate the system errno into a readable
    message, like "No such file or directory", i.e. the %m replacements.
    That needs LC_CTYPE set (just for the encoding, not the
    language/region) as well as LC_MESSAGES (for the language/region).
    
    When using a locale provider other than libc, it's unfortunate to
    require LC_CTYPE to be set for just this one single purpose. The locale
    itself, e.g. the "en_US" part, is not used at all; only the encoding
    part of the setting is relevant. And there is no value other than "C"
    that works on all platforms. It's fairly confusing to explain why the
    LC_CTYPE setting is required for the builtin or ICU providers at all.
    Also, while it's far from the biggest challenge when it comes to
    multithreading, it does cause thread-safety headaches on platforms
    without uselocale().
    
    Perhaps we could get the ASCII message and run it through gettext()?
    That would be extra work for translators, but perhaps not a lot, given
    that it's a small and static set of messages in practice. That would
    also have the benefit that either NLS is enabled or not -- right now,
    since the translation happens in two different ways you can end up with
    partially-translated messages. It would also result in consistent
    translations across platforms.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  55. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-10-29T00:19:50Z

    On Wed, 2025-07-23 at 19:11 -0700, Jeff Davis wrote:
    > On Fri, 2025-07-11 at 11:48 +1200, Thomas Munro wrote:
    > > On Fri, Jul 11, 2025 at 6:22 AM Jeff Davis <pgsql@j-davis.com>
    > > wrote:
    > > > I don't have a great windows development environment, and it
    > > > appears CI
    > > > and the buildfarm don't offer great coverage either. Can I ask
    > > > for
    > > > a
    > > > volunteer to do the windows side of this work?
    > > 
    > > Me neither but I'm willing to help with that, and have done lots of
    > > closely related things through trial-by-CI...
    
    Attached a new patch series, v6.
    
    Rather than creating new global locale_t objects, this series (along
    with a separate patch for NLS[1]) removes the dependency on the global
    LC_CTYPE entirely. It's a bunch of small patches that replace direct
    calls to tolower()/toupper() with calls into the provider.
    
    An assumption of these patches is that, in the UTF-8 encoding, the
    logic in pg_tolower()/pg_toupper() is equivalent to
    pg_ascii_tolower()/pg_ascii_toupper().
    
    Generally these preserve existing behavior, but there are a couple
    differences:
    
      * If using the builtin C locale (not C.UTF-8) along with a datctype
    that's a non-C locale with single-byte encoding, it could affect the
    results of downcase_identifier(), ltree, and fuzzystrmatch on
    characters > 127. For ICU, I went to a bit of extra effort to preserve
    the existing behavior here, because it's more likely to be used for
    single-byte encodings.
    
      * When using ICU or builtin C.UTF-8, along with a datctype of
    "tr_TR.UTF-8", then it will affect ltree's and fuzzystrmatch's
    treatment of i/I.
    
    If these are a concern we can fix them with some hacks, but those
    behaviors seem fairly obscure to me.
    
    Regards,
    	Jeff Davis
    
    [1]
    https://www.postgresql.org/message-id/90f176c5b85b9da26a3265b2630ece3552068566.camel@j-davis.com
    
    
  56. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-10-30T17:17:03Z

    On Tue, 2025-10-28 at 17:19 -0700, Jeff Davis wrote:
    > Attached a new patch series, v6.
    
    I'm eager to start committing this series so that we have plenty of
    time to sort out any problems. I welcome feedback before or after
    commit, and I can revert if necessary.
    
    The goal here is to do a permanent:
    
       setlocale(LC_CTYPE, "C")
    
    in the postmaster, and instead use _l() variants where necessary.
    
    Forcing the global LC_CTYPE to C will avoid platform-specific nuances
    spread throughout the code, and prevent new code from accidentally
    depending on platform-specific libc behavior. Instead, libc ctype
    behavior will only happen through a pg_locale_t object.
    
    It also takes us a step closer to thread safety.
    
    LC_COLLATE was already permenently set to "C" (5e6e42e4), and most of
    LC_CTYPE behavior already uses a pg_locale_t object. This series is
    about removing the last few places that rely on raw calls to
    tolower()/toupper() (sometimes through pg_tolower()). Where there isn't
    a pg_locale_t immediately available it uses the database default locale
    (which might or might not be libc).
    
    There's another thread for what to do about strerror_r[1], which
    depends on LC_CTYPE for the encoding:
    
    https://www.postgresql.org/message-id/90f176c5b85b9da26a3265b2630ece3552068566.camel@j-davis.com
    
    pg_localeconv_r() does depend on the LC_CTYPE for the encoding, but it
    already sets it from lc_monetary and lc_numeric, without using datctype
    or the global setting. Then PGLC_localeconv() converts to the database
    encoding, if necessary. So it's an exception to the rule that all ctype
    behavior goes through a pg_locale_t, but it's not a problem. (Aside: we
    could consider this approach as a narrower fix for strerror_r(), as
    well.)
    
    There may be a loose end around plperl, as well, but not sure if this
    will make it any worse.
    
    Some other LC_* settings still rely on setlocale(), which can be
    considered separately unless there's some interaction that I missed.
    
    Note that the datcollate and datctype fields are already mostly
    irrelevant for non-libc providers. We could set those to NULL, but for
    now I don't intend to do that.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  57. Re: Remaining dependency on setlocale()

    Daniel Verite <daniel@manitou-mail.org> — 2025-10-30T20:41:50Z

    	Jeff Davis wrote:
    
    > The goal here is to do a permanent:
    > 
    >   setlocale(LC_CTYPE, "C")
    > 
    > in the postmaster, and instead use _l() variants where necessary.
    
    What about code in extensions? AFAIU a user can control the 
    locale in effect by setting the LC_CTYPE argument of
    CREATE DATABASE, which ends up in the environment
    of backends serving that database.
    If it's forced to "C", how can an extension use locale-aware
    libc functions?
    
    In theory it's the same problem with LC_COLLATE, except
    that functions like tolower()/toupper() are much more likely
    to be used in extensions than strcoll().
    
    
    Best regards,
    -- 
    Daniel Vérité 
    https://postgresql.verite.pro/
    
    
    
    
  58. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-10-30T20:53:43Z

    On Thu, 2025-10-30 at 21:41 +0100, Daniel Verite wrote:
    > What about code in extensions? AFAIU a user can control the 
    > locale in effect by setting the LC_CTYPE argument of
    > CREATE DATABASE, which ends up in the environment
    > of backends serving that database.
    > If it's forced to "C", how can an extension use locale-aware
    > libc functions?
    
    Extensions often need to be updated for a new major version.
    
    The extension should call pg_database_locale(), and pass that to a
    function exposed in pg_locale.h. A recent commit exposed pg_iswalpha(),
    etc., so there's a reasonable set of functions that should be suitable
    for most purposes.
    
    If it's not available in pg_locale.h, or the extension really needs to
    use a different LC_CTYPE for some reason, it can use an _l() variant of
    the function.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
    
  59. Re: Remaining dependency on setlocale()

    Peter Eisentraut <peter@eisentraut.org> — 2025-10-31T09:40:41Z

    On 30.10.25 18:17, Jeff Davis wrote:
    > On Tue, 2025-10-28 at 17:19 -0700, Jeff Davis wrote:
    >> Attached a new patch series, v6.
    > 
    > I'm eager to start committing this series so that we have plenty of
    > time to sort out any problems. I welcome feedback before or after
    > commit, and I can revert if necessary.
    
    What is one supposed to do with this statement?  You post a series of 9 
    patches and the next day you say you are eager to commit it?  Do you not 
    want to give others the time to properly review this?  The patches say 
    they are "v6", but AFAICT the previous patches "v5" and "v4" in this 
    thread are substantially different from these.
    
    > The goal here is to do a permanent:
    > 
    >     setlocale(LC_CTYPE, "C")
    > 
    > in the postmaster, and instead use _l() variants where necessary.
    > 
    > Forcing the global LC_CTYPE to C will avoid platform-specific nuances
    > spread throughout the code, and prevent new code from accidentally
    > depending on platform-specific libc behavior. Instead, libc ctype
    > behavior will only happen through a pg_locale_t object.
    > 
    > It also takes us a step closer to thread safety.
    
    At first glance, these patches seem reasonable steps into that direction.
    
    But I'm not sure that we actually want to make that switch.  It would be 
    good if our code is independent of the global locale settings, but that 
    doesn't mean that there couldn't be code in extensions, other libraries, 
    or other corners of the operating system that relies on this.  In 
    general, and I haven't looked this up in the applicable standards, it 
    seems like a good idea to accurately declare what encoding you operate in.
    
    
    
    
    
  60. Re: Remaining dependency on setlocale()

    Daniel Verite <daniel@manitou-mail.org> — 2025-10-31T14:01:39Z

    	Jeff Davis wrote:
    
    > On Thu, 2025-10-30 at 21:41 +0100, Daniel Verite wrote:
    > > What about code in extensions? AFAIU a user can control the 
    > > locale in effect by setting the LC_CTYPE argument of
    > > CREATE DATABASE, which ends up in the environment
    > > of backends serving that database.
    > > If it's forced to "C", how can an extension use locale-aware
    > > libc functions?
    > 
    > Extensions often need to be updated for a new major version.
    
    I think forcing the C locale is not comparable to API changes,
    and the consequences are not even necessarily fixable for extensions.
    
    For instance, consider the following function, when run in a database
    with en_US.utf8 as locale.
    
    CREATE FUNCTION lt_test(text,text) RETURNS boolean as $$
     use locale; return ($_[0] lt $_[1])?1:0;
    $$ LANGUAGE plperlu;
    
    select lt_test('a', 'B');
    
    With PG 18 it returns true
    With 19devel it returns false.
    
    This is since commit 5e6e42e4 doing that:
    
    +	 * Collation is handled by pg_locale.c, and the behavior is dependent
    on
    +	 * the provider. strcoll(), etc., should not be called directly.
    +	 */
    +	init_locale("LC_COLLATE", LC_COLLATE, "C");
    +
    +	/*
    
    Obviously libperl is not going to be updated to call Postgres
    string comparisons functions instead of strcoll().
    The same is probably true for other languages available as
    extensions that expose POSIX locale-aware functions.
    
    Extending this logic to LC_CTYPE will extend the breakage.
    
    
    While I agree with the goal of not depending on setlocale()
    in the core code for anything that should be locale-provider
    dependent, making this goal leak into extensions seems
    unnecessarily user-hostile. What it's saying to users is,
    before v19 you could choose your locale, and starting
    with v19 you'll have "C" whether you want it or not.
    
    
    Best regards,
    -- 
    Daniel Vérité 
    https://postgresql.verite.pro/
    
    
    
    
  61. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-10-31T15:25:46Z

    On Fri, 2025-10-31 at 15:01 +0100, Daniel Verite wrote:
    > > Extensions often need to be updated for a new major version.
    > 
    > I think forcing the C locale is not comparable to API changes,
    > and the consequences are not even necessarily fixable for extensions.
    
    Are we in agreement that it's fine for C extensions?
    
    > For instance, consider the following function, when run in a database
    > with en_US.utf8 as locale.
    > 
    > CREATE FUNCTION lt_test(text,text) RETURNS boolean as $$
    >  use locale; return ($_[0] lt $_[1])?1:0;
    > $$ LANGUAGE plperlu;
    > 
    > select lt_test('a', 'B');
    
    Are you aware of PL code that does things like that? If the database
    locale is ICU, that would be at least a little bit confusing.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  62. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-10-31T19:59:20Z

    On Fri, 2025-10-31 at 10:40 +0100, Peter Eisentraut wrote:
    
    > But I'm not sure that we actually want to make that switch.  It would
    > be 
    > good if our code is independent of the global locale settings, but
    > that 
    > doesn't mean that there couldn't be code in extensions, other
    > libraries, 
    > or other corners of the operating system that relies on this.
    
    This question has been brewing for a while. How should we make this
    decision?
    
    >   In 
    > general, and I haven't looked this up in the applicable standards, it
    > seems like a good idea to accurately declare what encoding you
    > operate in.
    
    One frustration (for me, at least) is that there is no way to set the
    encoding without specifying the locale. LC_CTYPE=C.UTF-8 is close, but
    the libc version is not available on all platforms and has some quirks.
    
    That makes any changes to the initdb default logic difficult to sort
    out. Some combinations which seem simple -- like ICU/UTF8 -- need to
    handle the case when LC_CTYPE is not compatible with UTF-8, even though
    the LC_CTYPE has no effect in that case.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  63. Re: Remaining dependency on setlocale()

    Daniel Verite <daniel@manitou-mail.org> — 2025-11-03T19:14:03Z

    	Jeff Davis wrote:
    
    > > > Extensions often need to be updated for a new major version.
    > > 
    > > I think forcing the C locale is not comparable to API changes,
    > > and the consequences are not even necessarily fixable for extensions.
    > 
    > Are we in agreement that it's fine for C extensions?
    
    No, I think we should put the database's lc_ctype
    into LC_CTYPE and the database's lc_collate into
    LC_COLLATE, independently of anything else,
    like it was done until commit 5e6e42e.
    I believe that's the purpose of these database
    properties, whether the provider is libc or ICU or builtin.
    
    Forcing "C" is a disruptive change, that IMO does
    not seem compensated by substantial advantages
    that would justify the disruption.
    
    
    > > CREATE FUNCTION lt_test(text,text) RETURNS boolean as $$
    > >  use locale; return ($_[0] lt $_[1])?1:0;
    > > $$ LANGUAGE plperlu;
    > > 
    > > select lt_test('a', 'B');
    > 
    > Are you aware of PL code that does things like that? If the database
    > locale is ICU, that would be at least a little bit confusing.
    
    plperl users writing "use locale" should understand that
    it's the libc locale, like when this code is run outside Postgres.
    
    
    Best regards,
    -- 
    Daniel Vérité 
    https://postgresql.verite.pro/
    
    
    
    
  64. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-11-03T19:59:57Z

    On Mon, 2025-11-03 at 20:14 +0100, Daniel Verite wrote:
    > No, I think we should put the database's lc_ctype
    > into LC_CTYPE and the database's lc_collate into
    > LC_COLLATE, independently of anything else,
    > like it was done until commit 5e6e42e.
    > I believe that's the purpose of these database
    > properties, whether the provider is libc or ICU or builtin.
    
    Is there a clean way to document this behavior? I have tried to improve
    the documentation in this area before, but it's not easy because the
    behavior is so nuanced.
    
    The CREATE DATABASE command needs LOCALE (or LC_CTYPE/LC_COLLATE). But
    it's easy to get LOCALE and ICU_LOCALE or BUILTIN_LOCALE confused. And
    similarly for initdb. We could clean those up a lot if
    LC_CTYPE/LC_COLLATE didn't even need to be set for non-libc providers.
    
    Users can end up in a situation where they need to define
    LC_CTYPE/LC_COLLATE, even though it has almost (but not entirely) no
    effect:
    
    https://www.postgresql.org/message-id/f794e177b0b1ed8917e75258726ae315cf8fbbef.camel%40j-davis.com
    
    
    Reverting commit 5e6e42e may be the right thing, but I'd like to hear
    what others have to say on this point first. In particualr, I'd like to
    know whether such a revert is based on principle, a practical problem,
    or just an abundance of caution.
    
    Another option we have here is LC_CTYPE=LC_COLLATE=C for non-libc
    providers, but leave it as in v17 for libc providers. If someone
    explicitly wants libc behavior (by specifying something like "use
    locale" in plperl), they probably want to be using the libc provider
    for the database.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  65. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-11-03T21:54:13Z

    On Mon, 2025-11-03 at 11:59 -0800, Jeff Davis wrote:
    > Reverting commit 5e6e42e may be the right thing, but I'd like to hear
    > what others have to say on this point first. In particualr, I'd like
    > to
    > know whether such a revert is based on principle, a practical
    > problem,
    > or just an abundance of caution.
    > 
    > Another option we have here is LC_CTYPE=LC_COLLATE=C for non-libc
    > providers, but leave it as in v17 for libc providers. If someone
    > explicitly wants libc behavior (by specifying something like "use
    > locale" in plperl), they probably want to be using the libc provider
    > for the database.
    
    Actually, there's yet another option: lc_ctype and lc_collate could be
    GUCs again. If they don't affect any backend behavior, then why not?
    The user would have complete control.
    
    (Probably PGC_POSTMASTER, because one of the goals is to not rely on
    setlocale() in the backends.)
    
    Of course, we need to be sure that they are compatible with the
    database encoding. We have code to do that, but not sure how reliable
    it is across platforms.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  66. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-11-10T20:39:28Z

    On Mon, 2025-11-03 at 20:14 +0100, Daniel Verite wrote:
    > No, I think we should put the database's lc_ctype
    > into LC_CTYPE and the database's lc_collate into
    > LC_COLLATE, independently of anything else,
    > like it was done until commit 5e6e42e.
    > I believe that's the purpose of these database
    > properties, whether the provider is libc or ICU or builtin.
    
    As phrased, that appears to be a promise that we will never support
    thread-per-connection. setlocale() is not thread-safe, and uselocale()
    is not available on NetBSD.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  67. Re: Remaining dependency on setlocale()

    Peter Eisentraut <peter@eisentraut.org> — 2025-11-12T18:41:58Z

    On 03.11.25 20:14, Daniel Verite wrote:
    > No, I think we should put the database's lc_ctype
    > into LC_CTYPE and the database's lc_collate into
    > LC_COLLATE, independently of anything else,
    > like it was done until commit 5e6e42e.
    > I believe that's the purpose of these database
    > properties, whether the provider is libc or ICU or builtin.
    > 
    > Forcing "C" is a disruptive change, that IMO does
    > not seem compensated by substantial advantages
    > that would justify the disruption.
    
     From my perspective, the difference between LC_COLLATE and LC_CTYPE is 
    that LC_COLLATE has a quite limited impact area.  Either your code uses 
    strcoll() (or strxfrm()) or it does not.  And if it does, you can find 
    all the places and adjust them, and it probably won't be that many 
    places.  The impact area of LC_CTYPE is much larger and more complicated 
    and possibly interacts with other settings and third-party libraries in 
    ways that we don't understand yet and might not be able to change. 
    That's why I'm more hesitant about it.  But I don't see any reason to 
    keep LC_COLLATE set going forward.
    
    
    
    
    
  68. Re: Remaining dependency on setlocale()

    Peter Eisentraut <peter@eisentraut.org> — 2025-11-12T18:59:33Z

    On 29.10.25 01:19, Jeff Davis wrote:
    > On Wed, 2025-07-23 at 19:11 -0700, Jeff Davis wrote:
    >> On Fri, 2025-07-11 at 11:48 +1200, Thomas Munro wrote:
    >>> On Fri, Jul 11, 2025 at 6:22 AM Jeff Davis <pgsql@j-davis.com>
    >>> wrote:
    >>>> I don't have a great windows development environment, and it
    >>>> appears CI
    >>>> and the buildfarm don't offer great coverage either. Can I ask
    >>>> for
    >>>> a
    >>>> volunteer to do the windows side of this work?
    >>>
    >>> Me neither but I'm willing to help with that, and have done lots of
    >>> closely related things through trial-by-CI...
    > 
    > Attached a new patch series, v6.
    > 
    > Rather than creating new global locale_t objects, this series (along
    > with a separate patch for NLS[1]) removes the dependency on the global
    > LC_CTYPE entirely. It's a bunch of small patches that replace direct
    > calls to tolower()/toupper() with calls into the provider.
    > 
    > An assumption of these patches is that, in the UTF-8 encoding, the
    > logic in pg_tolower()/pg_toupper() is equivalent to
    > pg_ascii_tolower()/pg_ascii_toupper().
    
    I'm getting a bit confused by all these different variant function 
    names.  Like we have now
    
    tolower
    TOLOWER
    char_tolower
    pg_tolower
    pg_strlower
    pg_ascii_tolower
    downcase_identifier
    
    and maybe more, and upper versions.
    
    This patch set makes changes like
    
    -           else if (IS_HIGHBIT_SET(ch2) && isupper(ch2))
    -               ch2 = tolower(ch2);
    +           else if (IS_HIGHBIT_SET(ch2))
    +               ch2 = TOLOWER(ch2);
    
    So there is apparently some semantic difference between tolower() and 
    TOLOWER(), which is represented by the fact that the function name is 
    all upper case?  Actually, it's a macro and could mean different things 
    in different contexts.
    
    And there is very little documentation accompanying all these different 
    functions.  For example, struct collate_methods and struct ctype_methods 
    contain barely any documentation at all.
    
    Many of these issues are pre-existing, but I just figured it has reached 
    a point where we need to do something about it.
    
    
    
    
    
  69. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-11-12T18:59:35Z

    On Wed, 2025-11-12 at 19:41 +0100, Peter Eisentraut wrote:
    > The impact area of LC_CTYPE is much larger and more complicated 
    > and possibly interacts with other settings and third-party libraries
    > in 
    > ways that we don't understand yet and might not be able to change. 
    > That's why I'm more hesitant about it.
    
    What do you think about making lc_ctype and/or lc_collate into GUCs
    (like lc_messages), assuming we remove all known effects in the backend
    first?
    
    If we make the setting PGC_POSTMASTER, then it eliminates potential
    problems with threading, because setlocale() would be called only once
    before allowing connections. For platforms that support uselocale(), we
    could allow it to be set more freely, for those who need it set to
    different values in different backends.
    
    It would also be easier to document, which would be nice. There could
    be some confusion if various settings are inconsistent with each other,
    but that's true currently. And we'd still enforce a ctype that's
    consistent with the encoding, at least.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  70. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-11-15T00:23:11Z

    On Wed, 2025-11-12 at 19:59 +0100, Peter Eisentraut wrote:
    > I'm getting a bit confused by all these different variant function 
    > names. 
    
    One way of looking at it is that the functions in this patch series
    mostly affect how identifiers are treated, whereas earlier collation-
    related work affects how text data is treated. Ideally, they should be
    similar, but for historical reasons they're not.
    
    There are a lot of subtle behaviors for identifiers, which individually
    make some sense, but over time have just become edge cases and sources
    of inconsistency:
    
    downcase_identifier() is a server function to casefold unquoted
    identifiers during parsing (used by other callers, too). For non-ascii
    characters in a single-byte encoding, it uses tolower(); otherwise the
    lowercasing is ascii-only. Note: if an application is reliant on the
    casefolding of non-ascii identifiers, such as SELECT * FROM É finding
    the table named "é", that application would not work in UTF-8 even with
    a dump/restore.
    
    pg_strcasecmp() and pg_tolower() are used from the server and the
    client to do case-insensitive comparison of option names. They're
    supposed to use the same casing semantics as downcase_identifier(), but
    they don't differentiate between single-byte and multi-byte encodings;
    they just call tolower() on any non-ascii byte. That difference
    probably doesn't matter for UTF8, because tolower() on a single byte in
    a multibyte sequence should be a no-op, but perhaps it can matter in
    non-UTF-8 multibyte encodings.
    
    It's hard to avoid some confusion unless we're able to simplify some of
    these behaviors. Let me know if you think we can tolerate some
    simplifications in these edge cases without breaking anything too
    badly.
    
    
    > Many of these issues are pre-existing, but I just figured it has
    > reached 
    > a point where we need to do something about it.
    
    Starting from first principles, individual character operations should
    be mostly for parsing (e.g. tsearch) or pattern matching. Case folding
    and caseless matching should be done with string operations. And
    obviously all of this should be multibyte aware and work consistently
    in different encodings (to the extent possible given the
    representational constraints).
    
    Our APIs in pg_locale.c do a good job of offering that, and do not
    depend on the global LC_CTYPE. (There are a few things I'd like to add
    or clean up, but it offers most of what we need.)
    
    The problem, of course, is migrating the callers to use pg_locale.c
    APIs without breaking things. This patch series is intended to make
    everything locale-sensitive in the backend go through pg_locale_t
    without any behavior changes. The benefit is that it would at least
    remove the global LC_CTYPE dependency, but it ends up with hacky
    compatibility methods like char_tolower(), which piles on to the
    already-confusing set of tolower-like functions.
    
    In an earlier approach:
    
    https://www.postgresql.org/message-id/5f95b81af1e81b28b8a9ac5929f199b2f4091fdf.camel@j-davis.com
    
    I added a strfold_ident() method. That's easier to understand for
    downcase_identifier(), but didn't solve the problems for other
    callsites that depend on tolower(), and so I'd need to add more methods
    for those places, and started to look unpleasant.
    
    And earlier in this thread, I had tried the approach of using a global
    variable to hold a locale representing datctype. That felt a bit weird,
    though, because it mostly only matters when datlocprovider='c', and in
    that case, there's already a locale_t initialized along with the
    default collation. So why not find a way to go through the default
    collation?
    
    I still favor the approach used in the current patch series to remove
    the dependency on the global LC_CTYPE, but I'm open to suggestion.
    Whatever we do will probably require some additional hacking later
    anyway.
    
    I tried to improve the comments in pgstrcasecmp.c, and I rebased.
    
    Regards,
    	Jeff Davis
    
    
  71. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-11-21T00:58:16Z

    On Wed, 2025-11-12 at 19:59 +0100, Peter Eisentraut wrote:
    > Many of these issues are pre-existing, but I just figured it has
    > reached 
    > a point where we need to do something about it.
    
    I tried to simplify things in this patch series, assuming that we have
    some tolerance for small behavior changes.
    
    0001: No behavior change here, same patch as before. Uncontroversial
    simplification, so I plan to commit this soon.
    
    0002: change fuzzystrmatch to use ASCII semantics. As far as I can
    tell, this only affects the results of soundex(). Before the patch, in
    en_US.iso885915, soundex('réd') was 'RÉ30', after the patch it's
    'Ré30'. I'm not sure whether the current behavior is intentional or
    not. Other functions (daitch_mokotoff, levenshtein, and metaphone) are
    unaffected as far as I can tell.
    
    0003+0005: change ltree to use case folding instead of tolower(). I
    believe this is a bug fix, because the current code is inconsistent
    between ltree_strncasecmp() and ltree_crc32_sz().
    
    0006-0007: Remove char_tolower() API. This also removes the
    optimization for single-byte encodings with the libc provider and a
    non-C locale, but simplifies the code (the optimization is retained for
    the C locale). It's possible to make the lazy-folding optimization work
    for all locales without the char_tolower() API by doing something
    simlar to what 0004 does for ltree. But to make this work efficiently
    for Generic_Text_IC_like() would be a bit more complex: we'd need to
    adjust MatchText() to be able to fold the arguments lazily, and perhaps
    introduce some kind of casemapping iterator. That's already a pretty
    complex function, so I'm hesitant to do that work unless the
    optimization is important.
    
    These patches don't get us quite to the point of eliminating the
    LC_CTYPE dependency (there's still downcase_identifier() and
    pg_strcasecmp() to worry about, and some assorted isxyz() calls to
    examine), but they simplify things enough that the path forward will be
    easier.
    
    Regards,
    	Jeff Davis
    
    
  72. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-11-24T23:57:43Z

    On Thu, 2025-11-20 at 16:58 -0800, Jeff Davis wrote:
    > On Wed, 2025-11-12 at 19:59 +0100, Peter Eisentraut wrote:
    > > Many of these issues are pre-existing, but I just figured it has
    > > reached 
    > > a point where we need to do something about it.
    > 
    > I tried to simplify things in this patch series, assuming that we
    > have
    > some tolerance for small behavior changes.
    > 
    > 0001: No behavior change here, same patch as before. Uncontroversial
    > simplification, so I plan to commit this soon.
    
    Committed.
    
    New series attached, which I tried to put in an order that would be
    reasonable for commit.
    
    0001-0004: Pure refactoring patches. I intend to commit a couple of
    these soon.
    
    0005: No behavioral change, and not much change at all. Computes the
    "max_chr" for regexes (a performance optimization for low codepoints)
    more consistently and simply based on the encoding.
    
    0006: fixes longstanding ltree bug due to inconsistency between the
    database locale and the global LC_CTYPE setting when using a non-libc
    provider. The end result is also cleaner: use the database locale
    consistently, like tsearch. I don't intend to backport this, unless
    someone thinks it should be, but it should come with a release note to
    reindex ltree indexes if using a non-libc provider.
    
    0007: remove the char_tolower() API completely. We'd lose a pattern
    matching optimization for single-byte encodings with libc and a non-C
    locale, but it's a significant simplification. We could go even further
    and change this to use casefolding rather than lower(), but that seems
    like a separate change.
    
    0008: Multibyte-aware extraction of pattern prefixes. The previous code
    gave up on any byte that it didn't understand, which made prefixes
    unnecessarily short. This patch is also cleaner.
    
    0009: Changes fuzzystrmatch to use pg_ascii_toupper(). Most functions
    in the extension are unaffected, but soundex() can be affected, and I'm
    not sure what exactly it's supposed to do with non-ASCII.
    
    0010: For downcase_identifier(), use a new provider-specific
    pg_strfold_ident() method. The ICU version of this method is a work-in-
    progress, because right now it depends on libc. I suppose it should
    decode to UTF-32, then go through u_tolower(), then re-encode -- but
    can the re-encoding fail? In any case, it would be a behavior change
    for identifier casefolding with ICU and a single-byte encoding, which
    is probably OK but the risk is non-zero.
    
    0011: POC patch to introduce lc_collate GUC. It would only affect
    extensions, PLs, libraries, or other non-core code that happens to call
    strcoll() or strxfrm(). This would address Daniel's complaint, but it's
    more flexible. And by being a GUC, it's clear that we shouldn't depend
    on it for any stored data. We can do something similar for LC_CTYPE
    after we eliminate dependencies in core code.
    
    Regards,
    	Jeff Davis
    
    
  73. Re: Remaining dependency on setlocale()

    Chao Li <li.evan.chao@gmail.com> — 2025-11-25T01:44:51Z

    Hi Jeff,
    
    I have reviewed 0001-0004 and got a few comments:
    
    > On Nov 25, 2025, at 07:57, Jeff Davis <pgsql@j-davis.com> wrote:
    > 
    > 
    > 0001-0004: Pure refactoring patches. I intend to commit a couple of
    > these soon.
    > 
    
    1 - 0001
    ```
    +/*
    + * Fold a character to upper case, following C/POSIX locale rules.
    + */
    +static inline unsigned char
    +pg_ascii_toupper(unsigned char ch)
    ```
    
    I was curious why “inline” is needed, then I figured out when I tried to build. Without “inline”, compile will raise warnings of “unused function”. So I guess it’s better to explain why “inline” is used in the function comment, otherwise other readers may get the same confusion.
    
    2 - 0002
    ```
    + * three output codepoints. See Unicode 5.18.2, "Change in Length".
    ```
    
    With “change in length”, I confirmed “Unicode 5.18.2” means the Unicode Standard Section 5.18.2 “Complications for Case Mapping”. Why don’t we just give the URL in the comment. https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-5/#G29675
    
    3 - 0004
    ```
    @@ -1282,10 +1327,18 @@ size_t
     pg_strfold(char *dst, size_t dstsize, const char *src, ssize_t srclen,
     		   pg_locale_t locale)
     {
    -	if (locale->ctype->strfold)
    -		return locale->ctype->strfold(dst, dstsize, src, srclen, locale);
    -	else
    -		return locale->ctype->strlower(dst, dstsize, src, srclen, locale);
    +	if (locale->ctype == NULL)
    +	{
    +		int			i;
    +
    +		srclen = (srclen >= 0) ? srclen : strlen(src);
    +		for (i = 0; i < srclen && i < dstsize; i++)
    +			dst[i] = pg_ascii_tolower(src[i]);
    +		if (i < dstsize)
    +			dst[i] = '\0';
    +		return srclen;
    +	}
    +	return locale->ctype->strfold(dst, dstsize, src, srclen, locale);
     }
    ```
    
    I don’t get this change. In old code, depending on locale->ctype->strfold, it calls strfold or strlower. But in this patch, it only calls strfold. Why? If that’s intentional, maybe better to add a comment to explain that.
    
    4 - 0004
    
    In pg_strfold, the ctype==NULL fallback code is exactly the same as pg_strlower. I guess you intentionally to not call pg_strlower here for performance consideration, is that true?
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
    
    
    
  74. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-11-25T17:20:01Z

    On Tue, 2025-11-25 at 09:44 +0800, Chao Li wrote:
    > I was curious why “inline” is needed, then I figured out when I tried
    > to build. Without “inline”, compile will raise warnings of “unused
    > function”. So I guess it’s better to explain why “inline” is used in
    > the function comment, otherwise other readers may get the same
    > confusion.
    
    That's a typical pattern: to make it "inline", move it to a .h file and
    declare it as "static inline". For common patterns like that, I don't
    think we should explain them in comments, because it would mean we
    would start adding comments in zillions of places.
    
    > With “change in length”, I confirmed “Unicode 5.18.2” means the
    > Unicode Standard Section 5.18.2 “Complications for Case Mapping”. Why
    > don’t we just give the URL in the comment.
    > https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-5/#G29675
    
    Done, thank you. (Though since we haven't actually moved to 17 yet, I
    linked to 16 instead.)
    
    > I don’t get this change. In old code, depending on locale->ctype-
    > >strfold, it calls strfold or strlower. But in this patch, it only
    > calls strfold. Why? If that’s intentional, maybe better to add a
    > comment to explain that.
    
    I thought it would be slightly cleaner to just define the strfold
    method in the libc provider as the same as strlower. I agree it's worth
    a comment, so I added some in pg_locale_libc.c.
    
    > In pg_strfold, the ctype==NULL fallback code is exactly the same as
    > pg_strlower. I guess you intentionally to not call pg_strlower here
    > for performance consideration, is that true?
    
    I made some static functions to clean that up, and added some comments.
    Thank you.
    
    New series attached with only these changes and a rebase.
    
    Regards,
    	Jeff Davis
    
    
  75. Re: Remaining dependency on setlocale()

    Chao Li <li.evan.chao@gmail.com> — 2025-11-26T01:50:54Z

    
    > On Nov 26, 2025, at 01:20, Jeff Davis <pgsql@j-davis.com> wrote:
    > 
    > 
    > New series attached with only these changes and a rebase.
    > 
    > Regards,
    > Jeff Davis
    > 
    > <v10-0001-Inline-pg_ascii_tolower-and-pg_ascii_toupper.patch><v10-0002-Add-define-for-UNICODE_CASEMAP_BUFSZ.patch><v10-0003-Change-some-callers-to-use-pg_ascii_toupper.patch><v10-0004-Allow-pg_locale_t-APIs-to-work-when-ctype_is_c.patch><v10-0005-Make-regex-max_chr-depend-on-encoding-not-provid.patch><v10-0006-Fix-inconsistency-between-ltree_strncasecmp-and-.patch><v10-0007-Remove-char_tolower-API.patch><v10-0008-Use-multibyte-aware-extraction-of-pattern-prefix.patch><v10-0009-fuzzystrmatch-use-pg_ascii_toupper.patch><v10-0010-downcase_identifier-use-method-table-from-locale.patch><v10-0011-Control-LC_COLLATE-with-GUC.patch>
    
    I continued reviewing 0005-0008.
    
    0005 - no comment. The change looks correct to me. Before the patch, pg_regex_locale->ctype->max_chr <= MAX_SIMPLE_CHR should be the more common cases, with the patch, MAX_SIMPLE_CHR >= UCHAR_MAX should be the more common cases, thus there should be not a behavior change.
    
    5 - 0006
    ```
    @@ -77,10 +77,37 @@ compare_subnode(ltree_level *t, char *qn, int len, int (*cmpptr) (const char *,
     int
     ltree_strncasecmp(const char *a, const char *b, size_t s)
     {
    -	char	   *al = str_tolower(a, s, DEFAULT_COLLATION_OID);
    -	char	   *bl = str_tolower(b, s, DEFAULT_COLLATION_OID);
    +	static pg_locale_t locale = NULL;
    +	size_t		al_sz = s + 1;
    +	char	   *al = palloc(al_sz);
    +	size_t		bl_sz = s + 1;
    +	char	   *bl = palloc(bl_sz);
    +	size_t		needed;
     	int			res;
     
    +	if (!locale)
    +		locale = pg_database_locale();
    +
    +	needed = pg_strfold(al, al_sz, a, s, locale);
    +	if (needed + 1 > al_sz)
    +	{
    +		/* grow buffer if needed and retry */
    +		al_sz = needed + 1;
    +		al = repalloc(al, al_sz);
    +		needed = pg_strfold(al, al_sz, a, s, locale);
    +		Assert(needed + 1 <= al_sz);
    +	}
    +
    +	needed = pg_strfold(bl, bl_sz, b, s, locale);
    +	if (needed + 1 > bl_sz)
    +	{
    +		/* grow buffer if needed and retry */
    +		bl_sz = needed + 1;
    +		bl = repalloc(bl, bl_sz);
    +		needed = pg_strfold(bl, bl_sz, b, s, locale);
    +		Assert(needed + 1 <= bl_sz);
    +	}
    +
     	res = strncmp(al, bl, s);
     
     	pfree(al);
    ```
    
    I do think the new implementation has some problem.
    
    * The retry logic implies that a single-byte char may become multiple bytes after folding, otherwise retry is not needed because you have allocated s+1 bytes for dest buffers. From this perspective, we should use two needed variables: neededA and neededB, if neededA != neededB, then the two strings are different; if neededA == neededB, then we should be perform strncmp, but here we should pass neededA (or neededB as they are identical) to strncmp(al, bl, neededA).
    
    * Based on the logic you implemented in 0004, first pg_strfold() has copied as many chars as possible to dest buffer, so when retry, ideally we can should resume instead of start over. However, if single-byte->multi-byte folding happens, we have no information to decide from where to resume. From this perspective, in 0004, do we really need to take the try-the-best strategy for strlower_c()? If there are some other use cases that require data to be placed in dest buffer even if dest buffer doesn’t have enough space, then my patch [1] of changing strlower_libc_sb() should be considered.
    
    6 - 0007
    ```
     	/*
    -	 * For efficiency reasons, in the single byte case we don't call lower()
    -	 * on the pattern and text, but instead call SB_lower_char on each
    -	 * character.  In the multi-byte case we don't have much choice :-(. Also,
    -	 * ICU does not support single-character case folding, so we go the long
    -	 * way.
    +	 * For efficiency reasons, in the C locale we don't call lower() on the
    +	 * pattern and text, but instead call SB_lower_char on each character.
     	 */
    ```
    
    SB_lower_char should be changed to C_IMatchText.
    
    7 - 0007
    ```
    /* setup to compile like_match.c for single byte case insensitive matches */
    -#define MATCH_LOWER(t, locale) SB_lower_char((unsigned char) (t), locale)
    +#define MATCH_LOWER
     ```
    
    I think the comment should be updated accordingly, like “for ILIKE in the C locale”.
    
    
    8 - 0008
    ```
    +	/* for text types, use pg_wchar; for BYTEA, use char */
     	if (typeid != BYTEAOID)
     	{
    -		patt = TextDatumGetCString(patt_const->constvalue);
    -		pattlen = strlen(patt);
    +		text	   *val = DatumGetTextPP(patt_const->constvalue);
    +		pg_wchar   *wpatt;
    +		pg_wchar   *wmatch;
    +		char	   *match;
    +
    +		patt = VARDATA_ANY(val);
    +		pattlen = VARSIZE_ANY_EXHDR(val);
    +		wpatt = palloc((pattlen + 1) * sizeof(pg_wchar));
    +		wmatch = palloc((pattlen + 1) * sizeof(pg_wchar));
    +		pg_mb2wchar_with_len(patt, wpatt, pattlen);
    +
    +		match = palloc(pattlen + 1);
    ```
    
    * match is allocated with pattlen+1 bytes, is it long enough to hold pattlen multiple-byte chars?
    
    * match is not freed, but looks like it should be: 
    
    *prefix_const = string_to_const(match, typeid);
     -> string_to_datum(str, datatype);
     -> CStringGetTextDatum(str);
     -> cstring_to_text(s)
     -> cstring_to_text_with_len(s, strlen(s));
     -> *result = (text *) palloc(len + VARHDRSZ);
    
    So, match has been copied, it should be freed.
    
    9 - 0008
    ```
    -	}
    +			/* Backslash escapes the next character */
    +			if (patt[pos] == '\\')
    +			{
    +				pos++;
    +				if (pos >= pattlen)
    +					break;
    +			}
     
    -	match[match_pos] = '\0';
    +			match[match_pos++] = pos;
    +		}
    ```
    
    Should “pos” be “part[pos]” assigning to match[match_pos++]?
    
    I will review the rest 3 commits tomorrow.
    
    [1] https://postgr.es/m/CAEoWx2m9mUN397neL=p9x0vaVcj5EGiKD53F1MNTwTDXizxiaA@mail.gmail.com
    
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
    
    
    
  76. Re: Remaining dependency on setlocale()

    Chao Li <li.evan.chao@gmail.com> — 2025-11-27T01:08:16Z

    
    > On Nov 26, 2025, at 09:50, Chao Li <li.evan.chao@gmail.com> wrote:
    > 
    > I will review the rest 3 commits tomorrow.
    
    10 - 0009
    ```
     {
     	if (isalpha((unsigned char) c))
     	{
    -		c = toupper((unsigned char) c);
    +		c = pg_ascii_toupper((unsigned char) c);
    ```
    
    Just curious. As isaplha() and toupper() come from the same header file ctype.h, if we replace toupper with pg_ascii_toupper, does isapha also need to be handled?
    
    11 - 0010
    ```
    -	for (i = 0; i < len; i++)
    -	{
    -		unsigned char ch = (unsigned char) ident[i];
    +	dstsize = len + 1;
    +	result = palloc(dstsize);
     
    -		if (ch >= 'A' && ch <= 'Z')
    -			ch += 'a' - 'A';
    -		else if (enc_is_single_byte && IS_HIGHBIT_SET(ch) && isupper(ch))
    -			ch = tolower(ch);
    -		result[i] = (char) ch;
    -	}
    -	result[i] = '\0';
    +	needed = pg_strfold_ident(result, dstsize, ident, len);
    +	Assert(needed + 1 == dstsize);
    +	Assert(needed == len);
    ```
    
    I think assert both dstsize and len are redundant, because dstsize=len+1, and no place to change their values.
    
    12 - 0010
    ```
    +/*
    + * Fold an identifier using the database default locale.
    + *
    + * For historical reasons, does not use ordinary locale behavior. Should only
    + * be used for identifier folding. XXX: can we make this equivalent to
    + * pg_strfold(..., default_locale)?
    + */
    +size_t
    +pg_strfold_ident(char *dest, size_t destsize, const char *src, ssize_t srclen)
    +{
    +	if (default_locale == NULL || default_locale->ctype == NULL)
    +	{
    +		int			i;
    +
    +		for (i = 0; i < srclen && i < destsize; i++)
    +		{
    +			unsigned char ch = (unsigned char) src[i];
    +
    +			if (ch >= 'A' && ch <= 'Z')
    +				ch += 'a' - 'A';
    +			dest[i] = (char) ch;
    +		}
    +
    +		if (i < destsize)
    +			dest[i] = '\0';
    +
    +		return srclen;
    +	}
    +	return default_locale->ctype->strfold_ident(dest, destsize, src, srclen,
    +												default_locale);
    +}
    ```
    
    Given default_local can be NULL only at some specific moment, can we do something like
    
    Local = default_local;
    If (local == NULL || local->ctype == NULL)
      Local = libc or other fallback;
    Return default_locale->ctype->strfold_ident(dest, destsize, src, srclen, local);
    
    This way avoids the duplicate code.
    
    13 - 0011
    ```
    +{ name => 'lc_collate', type => 'string', context => 'PGC_SUSET', group => 'CLIENT_CONN_LOCALE',
    +  short_desc => 'Sets the locale for text ordering in extensions.',
    ```
    
    I just feel the GUC name is very misleading. Without carefully reading the doc, users may very easy to consider lc_collate the system’s locale. If it only affects extensions, then let’s name it accordingly, for example, “extension_lc_collate”, or “legacy_lc_collate”.
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
    
    
    
  77. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-11-29T20:50:56Z

    On Wed, 2025-11-26 at 09:50 +0800, Chao Li wrote:
    > * The retry logic implies that a single-byte char may become multiple
    > bytes after folding, otherwise retry is not needed because you have
    > allocated s+1 bytes for dest buffers. From this perspective, we
    > should use two needed variables: neededA and neededB, if neededA !=
    > neededB, then the two strings are different; if neededA == neededB,
    > then we should be perform strncmp, but here we should pass neededA
    > (or neededB as they are identical) to strncmp(al, bl, neededA).
    
    Thank you.
    
    It's actually worse than that: having a single 's' argument is just
    completely wrong. Consider:
    
      a: U&'x\0394\0394\0394'
      b: U&'\0394\0394\0394'
    
    There is no value for byte length 's' for which both 'a' and 'b' are
    properly-encoded strings. So, the current code passes invalid byte
    sequences to LOWER(), which is another pre-existing bug.
    
    ltree_strncasecmp() is only used for checking equality of the first s
    bytes of the query, so let's make it a safer API that just checks
    prefix equality. Attached.
    
    > * Based on the logic you implemented in 0004, first pg_strfold() has
    > copied as many chars as possible to dest buffer, so when retry,
    > ideally we can should resume instead of start over. However, if
    > single-byte->multi-byte folding happens, we have no information to
    > decide from where to resume.
    
    Right.
    
    That suggests that we might want some kind of lazy or iterator-based
    API for string folding. We'd need to find the right way to do that with
    ICU. If we find that it's a performance problem somewhere, we can look
    into that. Do you think we need that now?
    
    >  From this perspective, in 0004, do we really need to take the try-
    > the-best strategy for strlower_c()? If there are some other use cases
    > that require data to be placed in dest buffer even if dest buffer
    > doesn’t have enough space, then my patch [1] of changing
    > strlower_libc_sb() should be considered.
    
    I will look into that.
    
    > SB_lower_char should be changed to C_IMatchText.
    
    Updated comment.
    
    > I think the comment should be updated accordingly, like “for ILIKE in
    > the C locale”.
    
    Done, thank you.
    
    > * match is allocated with pattlen+1 bytes, is it long enough to hold
    > pattlen multiple-byte chars?
    > 
    > * match is not freed, but looks like it should be: 
    
    ...
    
    > Should “pos” be “part[pos]” assigning to match[match_pos++]?
    
    All fixed, thank you! (I apologize for posting a patch in that state to
    begin with...)
    
    I also reorganized slightly to separate out the pg_iswcased() API into
    its own patch, and moved the like_support.c changes from the ctype_is_c
    patch (already committed: 1476028225) into the pattern prefixes patch.
    
    Regards,
    	Jeff Davis
    
    
    
  78. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-11-29T21:04:00Z

    On Thu, 2025-11-27 at 09:08 +0800, Chao Li wrote:
    > > On Nov 26, 2025, at 09:50, Chao Li <li.evan.chao@gmail.com> wrote:
    > > 
    > > I will review the rest 3 commits tomorrow.
    > 
    > 10 - 0009
    > 
    > Just curious. As isaplha() and toupper() come from the same header
    > file ctype.h, if we replace toupper with pg_ascii_toupper, does
    > isapha also need to be handled?
    
    OK.
    
    What do you think about the change overall? Is fuzzystrmatch inherently
    ASCII-based? Does it cause behavior changes aside from soundex()? Does
    the behavior change in soundex() matter?
    
    > 11 - 0010
    > 
    > I think assert both dstsize and len are redundant, because
    > dstsize=len+1, and no place to change their values.
    
    OK.
    
    What do you think of the change overall?
    
    > 
    > If (local == NULL || local->ctype == NULL)
    >   Local = libc or other fallback;
    > Return default_locale->ctype->strfold_ident(dest, destsize, src,
    > srclen, local);
    > 
    > This way avoids the duplicate code.
    
    OK. The fallback would still be ASCII though, right?
    
    > I just feel the GUC name is very misleading. Without carefully
    > reading the doc, users may very easy to consider lc_collate the
    > system’s locale. If it only affects extensions, then let’s name it
    > accordingly, for example, “extension_lc_collate”, or
    > “legacy_lc_collate”.
    
    It is the system locale, it's just that we won't be using the system
    locale for most purposes, so it has very little effect: PLs,
    extensions, and libraries used by extensions that happen to rely on the
    system locale. That is a bit confusing, which is why I previously just
    set LC_COLLATE=C. This patch addresses Daniel's concern that people
    might still want lc_collate set to something other than C. I'm not sure
    we want this patch, it's just a POC.
    
    
    I didn't attach a new series here yet, but will after some of the
    earlier patches get committed.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  79. Re: Remaining dependency on setlocale()

    Peter Eisentraut <peter@eisentraut.org> — 2025-12-05T15:01:35Z

    On 29.11.25 21:50, Jeff Davis wrote:
    > All fixed, thank you! (I apologize for posting a patch in that state to
    > begin with...)
    > 
    > I also reorganized slightly to separate out the pg_iswcased() API into
    > its own patch, and moved the like_support.c changes from the ctype_is_c
    > patch (already committed: 1476028225) into the pattern prefixes patch.
    
    I reviewed the v11 patches.  But I wasn't able to apply them locally 
    (couldn't find a starting commit where they applied cleanly), so I 
    haven't tested them.
    
    Patches 0001 through 0006 seem generally ok, with some small comments:
    
    v11-0003-Fix-inconsistency-between-ltree_strncasecmp-and-.patch
    
    The function comment reads "Check if b has a prefix of a." -- Is that 
    the same as "Check if a is a prefix of b."?  The latter might be clearer.
    
    
    v11-0004-Remove-char_tolower-API.patch
    
    The updated comment reads
    
    +        * For efficiency reasons, in the C locale we don't call lower() 
    on the
    +        * pattern and text, but instead call SB_lower_char on each 
    character.
    
    but the patch removes SB_lower_char().
    
    
    v11-0006-Use-multibyte-aware-extraction-of-pattern-prefix.patch
    
    Might have a small typo in the commit message:
    
    ; and preserve and char-at-a-time logic for bytea.
    
    
    For the remaining patches I have some more substantial questions.
    
    v11-0007-fuzzystrmatch-use-pg_ascii_toupper.patch
    
    dmetaphone.c has a comment
    
    case '\xc7':        /* C with cedilla */
    
    so the premise that "fuzzystrmatch is designed for ASCII" does not
    appear to be correct.  Needs more analysis.
    
    (But apparently it's not multibyte aware at all, so I don't know what to 
    do about that.)
    
    
    v11-0008-downcase_identifier-use-method-table-from-locale.patch
    
    I'm confused here about the name of the function pg_strfold_ident().  In 
    general, case "folding" results in an opaque string that is really only 
    useful for comparing against other case-folded strings.  But for 
    identifiers we are actually interested lower-casing.  I think this 
    should be corrected in the API naming.
    
    
    v11-0009-Control-LC_COLLATE-with-GUC.patch
    
    I know there were some complaints about compatibility with extensions, 
    but I don't think anything concrete was presented.  I would like to see 
    more evidence that we need this.
    
    Also, recall that we used to have a lc_collate GUC, and in the end 
    people got confused that it didn't actually show a meaningful value when 
    you used ICU.  So we removed that.  It seems adding this back in would 
    create a similar kind of confusion.  So to avoid that, maybe this should 
    be called fallback_lc_collate or something like that.
    
    If we were to proceed with this patch, it should have some documentation 
    and tests.
    
    
    
    
    
  80. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-12-12T20:11:40Z

    On Fri, 2025-12-05 at 16:01 +0100, Peter Eisentraut wrote:
    > v11-0003-Fix-inconsistency-between-ltree_strncasecmp-and-.patch
    > 
    > The function comment reads "Check if b has a prefix of a." -- Is that
    > the same as "Check if a is a prefix of b."?  The latter might be
    > clearer.
    
    Yes, fixed.
    
    Note: I separated this into two patches. 0003 fixes the multibyte
    mishandling issue, and 0004 consistently performs case folding. 0003 is
    backpatchable, I believe.
    
    > but the patch removes SB_lower_char().
    
    Fixed and committed.
    
    > v11-0006-Use-multibyte-aware-extraction-of-pattern-prefix.patch
    > 
    > Might have a small typo in the commit message:
    > 
    > ; and preserve and char-at-a-time logic for bytea.
    
    Fixed.
    
    I also changed it into two functions: like_fixed_prefix(), which is
    almost unchanged from the original; and like_fixed_prefix_ci(), which
    is multibyte and locale-aware. It was too confusing to have single-byte
    and multi-byte logic in the same function, and they didn't share much
    code anyway.
    
    > case '\xc7':        /* C with cedilla */
    > 
    > so the premise that "fuzzystrmatch is designed for ASCII" does not
    > appear to be correct.  Needs more analysis.
    > 
    > (But apparently it's not multibyte aware at all, so I don't know what
    > to 
    > do about that.)
    
    I didn't notice that, thank you. Agreed, we need a bit more discussion
    around this case as well as soundex().
    
    > v11-0008-downcase_identifier-use-method-table-from-locale.patch
    > 
    > I'm confused here about the name of the function pg_strfold_ident(). 
    > In 
    > general, case "folding" results in an opaque string that is really
    > only 
    > useful for comparing against other case-folded strings.  But for 
    > identifiers we are actually interested lower-casing.  I think this 
    > should be corrected in the API naming.
    
    Agreed and fixed.
    
    Also, I added 0006, which saves a locale_t object for ICU in this one
    case where it's required. Surely that's not what we want in the long
    term, but we don't have the infrastructure for decoding pg_wchar into
    code points yet, and 0006 avoids the dependency on the global LC_CTYPE
    setting.
    
    > v11-0009-Control-LC_COLLATE-with-GUC.patch
    > 
    > I know there were some complaints about compatibility with
    > extensions, 
    > but I don't think anything concrete was presented.  I would like to
    > see 
    > more evidence that we need this.
    > 
    > Also, recall that we used to have a lc_collate GUC, and in the end 
    > people got confused that it didn't actually show a meaningful value
    > when 
    > you used ICU.  So we removed that.  It seems adding this back in
    > would 
    > create a similar kind of confusion.  So to avoid that, maybe this
    > should 
    > be called fallback_lc_collate or something like that.
    
    Yes, this is a POC patch and needs more discussion.
    
    What are your thoughts about a similar lc_ctype GUC, though? That has
    slightly different trade-offs.
    
    
    I believe v12 0001-0005 are about ready for commit, and 0003 should be
    backported.
    
    Regards,
    	Jeff Davis
    
    
  81. Re: Remaining dependency on setlocale()

    Chao Li <li.evan.chao@gmail.com> — 2025-12-13T09:48:13Z

    
    > On Dec 13, 2025, at 04:11, Jeff Davis <pgsql@j-davis.com> wrote:
    > 
    > I believe v12 0001-0005 are about ready for commit, and 0003 should be
    > backported.
    
    I quickly went through 0001-0005, and got a few nitpicks:
    
    1 - 0001
    ```
    +	int			match_mblen pg_attribute_unused();
    ```
    
    Why this variable is marked unused? It’s actually used.
    
    2 - 0002
    
    I did a search and found one place that you missed at line 181 in pg_locale.h
    ```
    extern bool char_is_cased(char ch, pg_locale_t locale);
    ```
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
    
    
    
  82. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-12-15T19:34:01Z

    On Sat, 2025-12-13 at 17:48 +0800, Chao Li wrote:
    > 1 - 0001
    > ```
    > +	int			match_mblen pg_attribute_unused();
    > ```
    > 
    > Why this variable is marked unused? It’s actually used.
    
    Fixed and committed.
    
    I originally marked it unused because I just had an:
    
       Assert(match[match_mblen] == '\0');
    
    but I changed it to just set it:
    
       match[match_mblen] = '\0';
    
    for clarity, even though I think the underlying API guarantees NUL-
    termination.
    
    > 2 - 0002
    > 
    > I did a search and found one place that you missed at line 181 in
    > pg_locale.h
    > ```
    > extern bool char_is_cased(char ch, pg_locale_t locale);
    > ```
    
    Thank you, fixed and committed.
    
    
    I'll continue committing v12 0003-0005. Note that I'm planning to
    backport 0003.
    
    I'm also inclined to move forward with 0006. It's not a long-term
    solution, but it solves an instance of the problem under discussion in
    this thread (removes setlocale() dependency) and is a narrow fix.
    
    
    After that, remaining loose ends:
    
    * 0007: Can we just rip out the non-ASCII support? If it's gone all
    this time without UTF8 support, and there's no suggestion about what
    the non-ASCII behavior should be (in docs or source code), then it's
    probably not terribly important.
    
    * Use of pg_strncasecmp() in the backend, which uses tolower() to do
    case-insensitive matching of command options, e.g. "if
    (pg_strcasecmp(a, "plain") == 0)" to check for PLAIN storage in CREATE
    TYPE.
    
    * strerror(): either consider an lc_ctype GUC, or the approach over
    here[1].
    
    * Daniel suggests that we need some way to set LC_COLLATE for
    extensions/dependencies.
    
    * address calls to pg_tolower in datetime.c and tzparser.c -- can those
    just be pg_ascii_tolower()?
    
    * examine remaining isalpha(), etc., calls in the backend
    
    Regards,
    	Jeff Davis
    
    [1]
    https://www.postgresql.org/message-id/flat/90f176c5b85b9da26a3265b2630ece3552068566.camel@j-davis.com
    
    
    
    
    
  83. Re: Remaining dependency on setlocale()

    Chao Li <li.evan.chao@gmail.com> — 2025-12-16T01:32:09Z

    
    > On Dec 16, 2025, at 03:34, Jeff Davis <pgsql@j-davis.com> wrote:
    > 
    > On Sat, 2025-12-13 at 17:48 +0800, Chao Li wrote:
    >> 1 - 0001
    >> ```
    >> + int match_mblen pg_attribute_unused();
    >> ```
    >> 
    >> Why this variable is marked unused? It’s actually used.
    > 
    > Fixed and committed.
    > 
    > I originally marked it unused because I just had an:
    > 
    >   Assert(match[match_mblen] == '\0');
    > 
    > but I changed it to just set it:
    > 
    >   match[match_mblen] = '\0';
    > 
    > for clarity, even though I think the underlying API guarantees NUL-
    > termination.
    > 
    >> 2 - 0002
    >> 
    >> I did a search and found one place that you missed at line 181 in
    >> pg_locale.h
    >> ```
    >> extern bool char_is_cased(char ch, pg_locale_t locale);
    >> ```
    > 
    > Thank you, fixed and committed.
    > 
    > 
    > I'll continue committing v12 0003-0005. Note that I'm planning to
    > backport 0003.
    
    I have re-reviewed 0003-0005 last week, they all look good to me.
    
    I have no comment on backport 0003.
    
    > 
    > I'm also inclined to move forward with 0006. It's not a long-term
    > solution, but it solves an instance of the problem under discussion in
    > this thread (removes setlocale() dependency) and is a narrow fix.
    > 
    > 
    > After that, remaining loose ends:
    > 
    > * 0007: Can we just rip out the non-ASCII support? If it's gone all
    > this time without UTF8 support, and there's no suggestion about what
    > the non-ASCII behavior should be (in docs or source code), then it's
    > probably not terribly important.
    > 
    > * Use of pg_strncasecmp() in the backend, which uses tolower() to do
    > case-insensitive matching of command options, e.g. "if
    > (pg_strcasecmp(a, "plain") == 0)" to check for PLAIN storage in CREATE
    > TYPE.
    > 
    > * strerror(): either consider an lc_ctype GUC, or the approach over
    > here[1].
    > 
    > * Daniel suggests that we need some way to set LC_COLLATE for
    > extensions/dependencies.
    > 
    > * address calls to pg_tolower in datetime.c and tzparser.c -- can those
    > just be pg_ascii_tolower()?
    > 
    > * examine remaining isalpha(), etc., calls in the backend
    > 
    > Regards,
    > Jeff Davis
    > 
    > [1]
    > https://www.postgresql.org/message-id/flat/90f176c5b85b9da26a3265b2630ece3552068566.camel@j-davis.com
    > 
    
    I just reviewed 0006-0007. Only got one comment on 0006:
    
    ```
    @@ -306,6 +342,7 @@ create_pg_locale_icu(Oid collid, MemoryContext context)
     	result = MemoryContextAllocZero(context, sizeof(struct pg_locale_struct));
     	result->icu.locale = MemoryContextStrdup(context, iculocstr);
     	result->icu.ucol = collator;
    +	result->icu.lt = loc;
    ```
    
    The old code didn’t create a locale object and store in result, thus it didn’t have a logic to free the created locale. This patch now dose that, but I don’t see where the created locale object is free-ed. I suppose newlocale() will allocate memory from the OS, so I guess the memory should be free-ed somewhere.
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
    
    
    
  84. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-12-16T20:04:54Z

    On Tue, 2025-12-16 at 09:32 +0800, Chao Li wrote:
    > I have re-reviewed 0003-0005 last week, they all look good to me.
    > 
    > I have no comment on backport 0003.
    
    Committed 0003 and backported to 14.
    
    Committing 0004 also. For the archives, the bug in that case is:
    
      -- generate some randomly-cased non-ASCII data
      CREATE DATABASE i TEMPLATE template0 LOCALE 'C'
        LOCALE_PROVIDER 'icu' ICU_LOCALE 'en';
      \c i
      CREATE EXTENSION ltree;
      CREATE TABLE test(path ltree);
      CREATE FUNCTION gen() RETURNS TEXT LANGUAGE plpgsql AS $$
        declare
          s TEXT;
        begin
          s := '';
          for i in 1..5 loop
            s := s || case when random() > 0.5 then lower(U&'\00C1') else
    U&'\00C1' end;
            s := s || case when random() > 0.5 then lower(U&'\00C9') else
    U&'\00C9' end;
            s := s || case when random() > 0.5 then lower(U&'\00CD') else
    U&'\00CD' end;
            s := s || case when random() > 0.5 then lower(U&'\00D3') else
    U&'\00D3' end;
            s := s || case when random() > 0.5 then lower(U&'\00DA') else
    U&'\00DA' end;
          end loop;
          return s;
        end;
      $$;
      INSERT INTO test select ('a.'||gen()||'.z')::ltree
        FROM generate_series(1,10000);
      CREATE INDEX test_idx ON test USING gist (path);
      -- returns 10000
      SET enable_seqscan = true;
      SET enable_indexscan = false;
      SET enable_bitmapscan = false;
      SELECT COUNT(*) FROM test
        WHERE path ~ U&'a.áéíóúáéíóúáéíóúáéíóúáéíóú@.z'::lquery;
      -- returns fewer tuples when using index scan
      SET enable_seqscan = false;
      SET enable_indexscan = true;
      SET enable_bitmapscan = true;
      SELECT COUNT(*) FROM test
        WHERE path ~ U&'a.áéíóúáéíóúáéíóúáéíóúáéíóú@.z'::lquery;
    
    Probably a smaller case would do, but I think it requires page splits
    to hit the bug. 0004 fixes the bug.
    
    > The old code didn’t create a locale object and store in result, thus
    > it didn’t have a logic to free the created locale. This patch now
    > dose that, but I don’t see where the created locale object is free-
    > ed. I suppose newlocale() will allocate memory from the OS, so I
    > guess the memory should be free-ed somewhere.
    
    The pg_locale_t objects are cached for the life of the backend, and
    never freed. We may want to change that eventually, but in practice
    it's not much of a problem.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  85. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-12-16T21:19:12Z

    On Tue, 2025-12-16 at 12:04 -0800, Jeff Davis wrote:
    > Probably a smaller case would do, but I think it requires page splits
    > to hit the bug. 0004 fixes the bug.
    
    Because it's a clear bug, I elected to backport 0004 to v18, where the
    casefolding APIs were introduced. It's a bug before that as well, but
    backporting further would be more complex and/or invasive.
    
    Regards,
    	Jeff Davis
    
    
    
    
    
  86. Re: Remaining dependency on setlocale()

    Peter Eisentraut <peter@eisentraut.org> — 2025-12-17T10:39:05Z

    On 12.12.25 21:11, Jeff Davis wrote:
    >> case '\xc7':        /* C with cedilla */
    >>
    >> so the premise that "fuzzystrmatch is designed for ASCII" does not
    >> appear to be correct.  Needs more analysis.
    >>
    >> (But apparently it's not multibyte aware at all, so I don't know what
    >> to
    >> do about that.)
    > I didn't notice that, thank you. Agreed, we need a bit more discussion
    > around this case as well as soundex().
    
    Soundex is an ASCII-only algorithm, there is no expectation that the 
    algorithm does anything useful with non-ASCII characters, and it doesn't 
    do so now.  So I think using pg_ascii_toupper() is ok.  (Users could for 
    example use unaccent to preprocess text.)
    
    One might wonder if the presence of non-ASCII characters should be an 
    error, but that doesn't have to be the subject of this thread.  I 
    noticed that the Wikipedia page for Soundex even calls out PostgreSQL 
    for doing things slightly different than everyone else, but I haven't 
    studied the details.
    
    For Metaphone, I found the reference implementation linked from its 
    Wikipedia page, and it looks like our implementation is pretty closely 
    aligned to that.  That reference implementation also contains the 
    C-with-cedilla case explicitly.  The correct fix here would probably be 
    to change the implementation to work on wide characters.  But I think 
    for the moment you could try a shortcut like, use pg_ascii_toupper(), 
    but if the encoding is LATIN1 (or LATIN9 or whichever other encodings 
    also contain C-with-cedilla at that code point), then explicitly 
    uppercase that one as well.  This would preserve the existing behavior.
    
    Note that the documentation calls out: "At present, the soundex, 
    metaphone, dmetaphone, and dmetaphone_alt functions do not work well 
    with multibyte encodings (such as UTF-8)."
    
    
    
    
    
  87. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-12-23T20:09:08Z

    On Wed, 2025-12-17 at 11:39 +0100, Peter Eisentraut wrote:
    > For Metaphone, I found the reference implementation linked from its 
    > Wikipedia page, and it looks like our implementation is pretty
    > closely 
    > aligned to that.  That reference implementation also contains the 
    > C-with-cedilla case explicitly.  The correct fix here would probably
    > be 
    > to change the implementation to work on wide characters.  But I think
    > for the moment you could try a shortcut like, use pg_ascii_toupper(),
    > but if the encoding is LATIN1 (or LATIN9 or whichever other encodings
    > also contain C-with-cedilla at that code point), then explicitly 
    > uppercase that one as well.  This would preserve the existing
    > behavior.
    
    Done, attached new patches.
    
    Interestingly, WIN1256 encodes only the SMALL LETTER C WITH CEDILLA. I
    think, for the purposes here, we can still consider it to "uppercase"
    to \xc7, so that it can still be treated as the same sound. Technically
    I think that would be an improvement over the current code in this edge
    case, and suggests that case folding would be a better approach than
    uppercasing.
    
    Regards,
    	Jeff Davis