Re: Remaining dependency on setlocale()

Thomas Munro <thomas.munro@gmail.com>

From: Thomas Munro <thomas.munro@gmail.com>

To: Tom Lane <tgl@sss.pgh.pa.us>

Cc: Jeff Davis <pgsql@j-davis.com>, pgsql-hackers@postgresql.org

Date: 2024-08-14T22:43:50Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

fuzzystrmatch: use pg_ascii_toupper().
- b96a9fd76f32 19 (unreleased) landed
Avoid global LC_CTYPE dependency in pg_locale_icu.c.
- 0a90df58cf38 19 (unreleased) landed
downcase_identifier(): use method table from locale provider.
- 87b2968df0f8 19 (unreleased) landed
ltree: fix case-insensitive matching.
- 806555e3000d 18.2 landed
- 7f007e4a044a 19 (unreleased) landed
Fix multibyte issue in ltree_strncasecmp().
- 898991966bc9 14.21 landed
- 335b2f30b468 15.16 landed
- b80227c0a54c 16.12 landed
- b8cfe9dc2e7f 17.8 landed
- f79e239e0bc6 18.2 landed
- 84d5efa7e3eb 19 (unreleased) landed
Use multibyte-aware extraction of pattern prefixes.
- 9c8de1596912 19 (unreleased) landed
Add pg_iswcased().
- 630706ced04e 19 (unreleased) landed
Remove char_tolower() API.
- 1e493158d3d2 19 (unreleased) landed
Make regex "max_chr" depend on encoding, not provider.
- 19b966243c38 19 (unreleased) landed
Change some callers to use pg_ascii_toupper().
- 99cd8890beca 19 (unreleased) landed
Allow pg_locale_t APIs to work when ctype_is_c.
- 147602822597 19 (unreleased) landed
Add #define for UNICODE_CASEMAP_BUFSZ.
- 8d299052fe58 19 (unreleased) landed
Inline pg_ascii_tolower() and pg_ascii_toupper().
- ec4997a9d733 19 (unreleased) landed
Avoid global LC_CTYPE dependency in pg_locale_libc.c.
- f81bf78ce12b 19 (unreleased) landed
Force LC_COLLATE to C in postmaster.
- 5e6e42e44fe1 19 (unreleased) landed
Change wchar2char() and char2wchar() to accept a locale_t.
- 53cd0b71ee2e 19 (unreleased) landed
Use pg_ascii_tolower()/pg_ascii_toupper() where appropriate.
- d81dcc8d6243 19 (unreleased) landed
inet_net_pton.c: use pg_ascii_tolower() rather than tolower().
- 8898082a5d3e 18.0 landed
isn.c: use pg_ascii_toupper() instead of toupper().
- 7a6880fadc17 18.0 landed
contrib/spi/refint.c: use pg_ascii_tolower() instead.
- 78bd364ee39c 18.0 landed
copyfromparse.c: use pg_ascii_tolower() rather than tolower().
- 4c787a24e7e2 18.0 landed
Revert "Tidy up locale thread safety in ECPG library."
- 3c8e463b0d88 18.0 cited
Tidy up locale thread safety in ECPG library.
- 8e993bff5326 18.0 cited
All supported systems have locale_t.
- 8d9a9f034e92 17.0 cited

On Wed, Aug 7, 2024 at 7:07 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Aug 7, 2024 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Jeff Davis <pgsql@j-davis.com> writes:
> > > 2. I don't see a good way to canonicalize a locale name, like in
> > > check_locale(), which uses the result of setlocale().
> >
> > What I can tell you about that is that check_locale's expectation
> > that setlocale does any useful canonicalization is mostly wishful
> > thinking [1].  On a lot of platforms you just get the input string
> > back again.  If that's the only thing keeping us on setlocale,
> > I think we could drop it.  (Perhaps we should do some canonicalization
> > of our own instead?)
>
> +1
>
> I know it does something on Windows (we know the EDB installer gives
> it strings like "Language,Country" and it converts them to
> "Language_Country.Encoding", see various threads about it all going
> wrong), but I'm not sure it does anything we actually want to
> encourage.  I'm hoping we can gradually screw it down so that we only
> have sane BCP 47 in the system on that OS, and I don't see why we
> wouldn't just use them verbatim.

Some more thoughts on check_locale() and canonicalisation:

I doubt the canonicalisation does anything useful on any Unix system,
as they're basically just file names.  In the case of glibc, the
encoding part is munged before opening the file so it tolerates .utf8
or .UTF-8 or .u---T----f------8 on input, but it still returns
whatever you gave it so the return value isn't cleaning the input or
anything.

"" is a problem however... the special value for "native environment"
is returned as a real locale name, which we probably still need in
places.  We could change that to newlocale("") + query instead, but
there is a portability pipeline problem getting the name out of it:

1. POSIX only just added getlocalename_l() in 2024[1][2].
2. Glibc has non-standard nl_langinfo_l(NL_LOCALE_NAME(category), loc).
3. The <xlocale.h> systems (macOS/*BSD) have non-standard
querylocale(mask, loc).
4. AFAIK there is no way to do it on pure POSIX 2008 systems.
5. For Windows, there is a completely different thing to get the
user's default locale, see CF#3772.

The systems in category 4 would in practice be Solaris and (if it
comes back) AIX.  Given that, we probably just can't go that way soon.

So I think the solution could perhaps be something like: in some early
startup phase before there are any threads, we nail down all the
locale categories to "C" (or whatever we decide on for the permanent
global locale), and also query the "" categories and make a copy of
them in case anyone wants them later, and then never call setlocale()
again.

[1] https://pubs.opengroup.org/onlinepubs/9799919799/functions/getlocalename_l.html
[2] https://www.austingroupbugs.net/view.php?id=1220