Re: Remaining dependency on setlocale()

Thomas Munro <thomas.munro@gmail.com>

From: Thomas Munro <thomas.munro@gmail.com>
To: Jeff Davis <pgsql@j-davis.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>, pgsql-hackers@postgresql.org
Date: 2024-08-07T20:52:41Z
Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. fuzzystrmatch: use pg_ascii_toupper().

  2. Avoid global LC_CTYPE dependency in pg_locale_icu.c.

  3. downcase_identifier(): use method table from locale provider.

  4. ltree: fix case-insensitive matching.

  5. Fix multibyte issue in ltree_strncasecmp().

  6. Use multibyte-aware extraction of pattern prefixes.

  7. Add pg_iswcased().

  8. Remove char_tolower() API.

  9. Make regex "max_chr" depend on encoding, not provider.

  10. Change some callers to use pg_ascii_toupper().

  11. Allow pg_locale_t APIs to work when ctype_is_c.

  12. Add #define for UNICODE_CASEMAP_BUFSZ.

  13. Inline pg_ascii_tolower() and pg_ascii_toupper().

  14. Avoid global LC_CTYPE dependency in pg_locale_libc.c.

  15. Force LC_COLLATE to C in postmaster.

  16. Change wchar2char() and char2wchar() to accept a locale_t.

  17. Use pg_ascii_tolower()/pg_ascii_toupper() where appropriate.

  18. inet_net_pton.c: use pg_ascii_tolower() rather than tolower().

  19. isn.c: use pg_ascii_toupper() instead of toupper().

  20. contrib/spi/refint.c: use pg_ascii_tolower() instead.

  21. copyfromparse.c: use pg_ascii_tolower() rather than tolower().

  22. Revert "Tidy up locale thread safety in ECPG library."

  23. Tidy up locale thread safety in ECPG library.

  24. All supported systems have locale_t.

On Thu, Aug 8, 2024 at 5:16 AM Jeff Davis <pgsql@j-davis.com> wrote:
> There are a ton of calls to, for example, isspace(), used mostly for
> parsing.
>
> I wouldn't expect a lot of differences in behavior from locale to
> locale, like might be the case with iswspace(), but behavior can be
> different at least in theory.
>
> So I guess we're stuck with setlocale()/uselocale() for a while, unless
> we're able to move most of those call sites over to an ascii-only
> variant.

We do know of a few isspace() calls that are already questionable[1]
(should be scanner_isspace(), or something like that).  It's not only
weird that SELECT ROW('libertà!') is displayed with or without double
quote depending (in theory) on your locale, it's also undefined
behaviour because we feed individual bytes of a multi-byte sequence to
isspace(), so OSes disagree, and in practice we know that macOS and
Windows think that the byte 0xa inside 'à' is a space while glibc and
FreeBSD don't.  Looking at the languages with many sequences
containing 0xa0, I guess you'd probably need to be processing CJK text
and cross-platform for the difference to become obvious (that was the
case for the problem report I analysed):

for i in range(1, 0xffff):
  if (i < 0xd800 or i > 0xdfff) and 0xa0 in chr(i).encode('UTF-8'):
    print("%04x: %s" % (i, chr(i)))

[1] https://www.postgresql.org/message-id/flat/CA%2BHWA9awUW0%2BRV_gO9r1ABZwGoZxPztcJxPy8vMFSTbTfi4jig%40mail.gmail.com