Re: Remaining dependency on setlocale()

Thomas Munro <thomas.munro@gmail.com>

From: Thomas Munro <thomas.munro@gmail.com>
To: Jeff Davis <pgsql@j-davis.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>, pgsql-hackers@postgresql.org
Date: 2024-08-15T08:46:15Z
Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. fuzzystrmatch: use pg_ascii_toupper().

  2. Avoid global LC_CTYPE dependency in pg_locale_icu.c.

  3. downcase_identifier(): use method table from locale provider.

  4. ltree: fix case-insensitive matching.

  5. Fix multibyte issue in ltree_strncasecmp().

  6. Use multibyte-aware extraction of pattern prefixes.

  7. Add pg_iswcased().

  8. Remove char_tolower() API.

  9. Make regex "max_chr" depend on encoding, not provider.

  10. Change some callers to use pg_ascii_toupper().

  11. Allow pg_locale_t APIs to work when ctype_is_c.

  12. Add #define for UNICODE_CASEMAP_BUFSZ.

  13. Inline pg_ascii_tolower() and pg_ascii_toupper().

  14. Avoid global LC_CTYPE dependency in pg_locale_libc.c.

  15. Force LC_COLLATE to C in postmaster.

  16. Change wchar2char() and char2wchar() to accept a locale_t.

  17. Use pg_ascii_tolower()/pg_ascii_toupper() where appropriate.

  18. inet_net_pton.c: use pg_ascii_tolower() rather than tolower().

  19. isn.c: use pg_ascii_toupper() instead of toupper().

  20. contrib/spi/refint.c: use pg_ascii_tolower() instead.

  21. copyfromparse.c: use pg_ascii_tolower() rather than tolower().

  22. Revert "Tidy up locale thread safety in ECPG library."

  23. Tidy up locale thread safety in ECPG library.

  24. All supported systems have locale_t.

On Thu, Aug 15, 2024 at 11:00 AM Jeff Davis <pgsql@j-davis.com> wrote:
> On Thu, 2024-08-15 at 10:43 +1200, Thomas Munro wrote:
> > So I think the solution could perhaps be something like: in some
> > early
> > startup phase before there are any threads, we nail down all the
> > locale categories to "C" (or whatever we decide on for the permanent
> > global locale), and also query the "" categories and make a copy of
> > them in case anyone wants them later, and then never call setlocale()
> > again.
>
> +1.

We currently nail down these categories:

    /* We keep these set to "C" always.  See pg_locale.c for explanation. */
    init_locale("LC_MONETARY", LC_MONETARY, "C");
    init_locale("LC_NUMERIC", LC_NUMERIC, "C");
    init_locale("LC_TIME", LC_TIME, "C");

CF #5170 has patches to make it so that we stop changing them even
transiently, using locale_t interfaces to feed our caches of stuff
needed to work with those categories, so they really stay truly nailed
down.

It sounds like someone needs to investigate doing the same thing for
these two, from CheckMyDatabase():

    if (pg_perm_setlocale(LC_COLLATE, collate) == NULL)
        ereport(FATAL,
                (errmsg("database locale is incompatible with
operating system"),
                 errdetail("The database was initialized with
LC_COLLATE \"%s\", "
                           " which is not recognized by setlocale().", collate),
                 errhint("Recreate the database with another locale or
install the missing locale.")));

    if (pg_perm_setlocale(LC_CTYPE, ctype) == NULL)
        ereport(FATAL,
                (errmsg("database locale is incompatible with
operating system"),
                 errdetail("The database was initialized with LC_CTYPE \"%s\", "
                           " which is not recognized by setlocale().", ctype),
                 errhint("Recreate the database with another locale or
install the missing locale.")));

How should that work?  Maybe we could imagine something like
MyDatabaseLocale, a locale_t with LC_COLLATE and LC_CTYPE categories
set appropriately.  Or should it be a pg_locale_t instead (if your
database default provider is ICU, then you don't even need a locale_t,
right?).

Then I think there is one quite gnarly category, from
assign_locale_messages() (a GUC assignment function):

    (void) pg_perm_setlocale(LC_MESSAGES, newval);

I have never really studied gettext(), but I know it was just
standardised in POSIX 2024, and the standardised interface has _l()
variants of all functions.  Current implementations don't have them
yet.  Clearly we absolutely couldn't call pg_perm_setlocale() after
early startup --  but if gettext() is relying on the current locale to
affect far away code, then maybe this is one place where we'd just
have to use uselocale().  Perhaps we could plan some transitional
strategy where NetBSD users lose the ability to change the GUC without
restarting the server and it has to be the same for all sessions, or
something like that, until they produce either gettext_l() or
uselocale(), but I haven't thought hard about this part at all yet...