Re: Remaining dependency on setlocale()

Thomas Munro <thomas.munro@gmail.com>

From: Thomas Munro <thomas.munro@gmail.com>

To: Jeff Davis <pgsql@j-davis.com>

Cc: Tom Lane <tgl@sss.pgh.pa.us>, pgsql-hackers@postgresql.org

Date: 2024-08-15T08:46:15Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

fuzzystrmatch: use pg_ascii_toupper().
- b96a9fd76f32 19 (unreleased) landed
Avoid global LC_CTYPE dependency in pg_locale_icu.c.
- 0a90df58cf38 19 (unreleased) landed
downcase_identifier(): use method table from locale provider.
- 87b2968df0f8 19 (unreleased) landed
ltree: fix case-insensitive matching.
- 806555e3000d 18.2 landed
- 7f007e4a044a 19 (unreleased) landed
Fix multibyte issue in ltree_strncasecmp().
- 898991966bc9 14.21 landed
- 335b2f30b468 15.16 landed
- b80227c0a54c 16.12 landed
- b8cfe9dc2e7f 17.8 landed
- f79e239e0bc6 18.2 landed
- 84d5efa7e3eb 19 (unreleased) landed
Use multibyte-aware extraction of pattern prefixes.
- 9c8de1596912 19 (unreleased) landed
Add pg_iswcased().
- 630706ced04e 19 (unreleased) landed
Remove char_tolower() API.
- 1e493158d3d2 19 (unreleased) landed
Make regex "max_chr" depend on encoding, not provider.
- 19b966243c38 19 (unreleased) landed
Change some callers to use pg_ascii_toupper().
- 99cd8890beca 19 (unreleased) landed
Allow pg_locale_t APIs to work when ctype_is_c.
- 147602822597 19 (unreleased) landed
Add #define for UNICODE_CASEMAP_BUFSZ.
- 8d299052fe58 19 (unreleased) landed
Inline pg_ascii_tolower() and pg_ascii_toupper().
- ec4997a9d733 19 (unreleased) landed
Avoid global LC_CTYPE dependency in pg_locale_libc.c.
- f81bf78ce12b 19 (unreleased) landed
Force LC_COLLATE to C in postmaster.
- 5e6e42e44fe1 19 (unreleased) landed
Change wchar2char() and char2wchar() to accept a locale_t.
- 53cd0b71ee2e 19 (unreleased) landed
Use pg_ascii_tolower()/pg_ascii_toupper() where appropriate.
- d81dcc8d6243 19 (unreleased) landed
inet_net_pton.c: use pg_ascii_tolower() rather than tolower().
- 8898082a5d3e 18.0 landed
isn.c: use pg_ascii_toupper() instead of toupper().
- 7a6880fadc17 18.0 landed
contrib/spi/refint.c: use pg_ascii_tolower() instead.
- 78bd364ee39c 18.0 landed
copyfromparse.c: use pg_ascii_tolower() rather than tolower().
- 4c787a24e7e2 18.0 landed
Revert "Tidy up locale thread safety in ECPG library."
- 3c8e463b0d88 18.0 cited
Tidy up locale thread safety in ECPG library.
- 8e993bff5326 18.0 cited
All supported systems have locale_t.
- 8d9a9f034e92 17.0 cited

On Thu, Aug 15, 2024 at 11:00 AM Jeff Davis <pgsql@j-davis.com> wrote:
> On Thu, 2024-08-15 at 10:43 +1200, Thomas Munro wrote:
> > So I think the solution could perhaps be something like: in some
> > early
> > startup phase before there are any threads, we nail down all the
> > locale categories to "C" (or whatever we decide on for the permanent
> > global locale), and also query the "" categories and make a copy of
> > them in case anyone wants them later, and then never call setlocale()
> > again.
>
> +1.

We currently nail down these categories:

    /* We keep these set to "C" always.  See pg_locale.c for explanation. */
    init_locale("LC_MONETARY", LC_MONETARY, "C");
    init_locale("LC_NUMERIC", LC_NUMERIC, "C");
    init_locale("LC_TIME", LC_TIME, "C");

CF #5170 has patches to make it so that we stop changing them even
transiently, using locale_t interfaces to feed our caches of stuff
needed to work with those categories, so they really stay truly nailed
down.

It sounds like someone needs to investigate doing the same thing for
these two, from CheckMyDatabase():

    if (pg_perm_setlocale(LC_COLLATE, collate) == NULL)
        ereport(FATAL,
                (errmsg("database locale is incompatible with
operating system"),
                 errdetail("The database was initialized with
LC_COLLATE \"%s\", "
                           " which is not recognized by setlocale().", collate),
                 errhint("Recreate the database with another locale or
install the missing locale.")));

    if (pg_perm_setlocale(LC_CTYPE, ctype) == NULL)
        ereport(FATAL,
                (errmsg("database locale is incompatible with
operating system"),
                 errdetail("The database was initialized with LC_CTYPE \"%s\", "
                           " which is not recognized by setlocale().", ctype),
                 errhint("Recreate the database with another locale or
install the missing locale.")));

How should that work?  Maybe we could imagine something like
MyDatabaseLocale, a locale_t with LC_COLLATE and LC_CTYPE categories
set appropriately.  Or should it be a pg_locale_t instead (if your
database default provider is ICU, then you don't even need a locale_t,
right?).

Then I think there is one quite gnarly category, from
assign_locale_messages() (a GUC assignment function):

    (void) pg_perm_setlocale(LC_MESSAGES, newval);

I have never really studied gettext(), but I know it was just
standardised in POSIX 2024, and the standardised interface has _l()
variants of all functions.  Current implementations don't have them
yet.  Clearly we absolutely couldn't call pg_perm_setlocale() after
early startup --  but if gettext() is relying on the current locale to
affect far away code, then maybe this is one place where we'd just
have to use uselocale().  Perhaps we could plan some transitional
strategy where NetBSD users lose the ability to change the GUC without
restarting the server and it has to be the same for all sessions, or
something like that, until they produce either gettext_l() or
uselocale(), but I haven't thought hard about this part at all yet...