Re: Remaining dependency on setlocale()

Jeff Davis <pgsql@j-davis.com>

View thread

From: Jeff Davis <pgsql@j-davis.com>

To: Peter Eisentraut <peter@eisentraut.org>, Thomas Munro <thomas.munro@gmail.com>

Cc: Tom Lane <tgl@sss.pgh.pa.us>, pgsql-hackers@postgresql.org

Date: 2025-11-15T00:23:11Z

Lists: pgsql-hackers

Attachments

v7-0001-Avoid-global-LC_CTYPE-dependency-in-pg_locale_lib.patch (text/x-patch)
v7-0002-Define-char_tolower-char_toupper-for-all-locale-p.patch (text/x-patch)
v7-0003-Avoid-global-LC_CTYPE-dependency-in-like.c.patch (text/x-patch)
v7-0004-Avoid-global-LC_CTYPE-dependency-in-scansup.c.patch (text/x-patch)
v7-0005-Avoid-global-LC_CTYPE-dependency-in-pg_locale_icu.patch (text/x-patch)
v7-0006-Avoid-global-LC_CTYPE-dependency-in-ltree-crc32.c.patch (text/x-patch)
v7-0007-Avoid-global-LC_CTYPE-dependency-in-fuzzystrmatch.patch (text/x-patch)
v7-0008-Don-t-include-ICU-headers-in-pg_locale.h.patch (text/x-patch)
v7-0009-Avoid-global-LC_CTYPE-dependency-in-strcasecmp.c-.patch (text/x-patch)

On Wed, 2025-11-12 at 19:59 +0100, Peter Eisentraut wrote:
> I'm getting a bit confused by all these different variant function
> names.

One way of looking at it is that the functions in this patch series
mostly affect how identifiers are treated, whereas earlier collation-
related work affects how text data is treated. Ideally, they should be
similar, but for historical reasons they're not.

There are a lot of subtle behaviors for identifiers, which individually
make some sense, but over time have just become edge cases and sources
of inconsistency:

downcase_identifier() is a server function to casefold unquoted
identifiers during parsing (used by other callers, too). For non-ascii
characters in a single-byte encoding, it uses tolower(); otherwise the
lowercasing is ascii-only. Note: if an application is reliant on the
casefolding of non-ascii identifiers, such as SELECT * FROM É finding
the table named "é", that application would not work in UTF-8 even with
a dump/restore.

pg_strcasecmp() and pg_tolower() are used from the server and the
client to do case-insensitive comparison of option names. They're
supposed to use the same casing semantics as downcase_identifier(), but
they don't differentiate between single-byte and multi-byte encodings;
they just call tolower() on any non-ascii byte. That difference
probably doesn't matter for UTF8, because tolower() on a single byte in
a multibyte sequence should be a no-op, but perhaps it can matter in
non-UTF-8 multibyte encodings.

It's hard to avoid some confusion unless we're able to simplify some of
these behaviors. Let me know if you think we can tolerate some
simplifications in these edge cases without breaking anything too
badly.

> Many of these issues are pre-existing, but I just figured it has
> reached
> a point where we need to do something about it.

Starting from first principles, individual character operations should
be mostly for parsing (e.g. tsearch) or pattern matching. Case folding
and caseless matching should be done with string operations. And
obviously all of this should be multibyte aware and work consistently
in different encodings (to the extent possible given the
representational constraints).

Our APIs in pg_locale.c do a good job of offering that, and do not
depend on the global LC_CTYPE. (There are a few things I'd like to add
or clean up, but it offers most of what we need.)

The problem, of course, is migrating the callers to use pg_locale.c
APIs without breaking things. This patch series is intended to make
everything locale-sensitive in the backend go through pg_locale_t
without any behavior changes. The benefit is that it would at least
remove the global LC_CTYPE dependency, but it ends up with hacky
compatibility methods like char_tolower(), which piles on to the
already-confusing set of tolower-like functions.

In an earlier approach:

https://www.postgresql.org/message-id/5f95b81af1e81b28b8a9ac5929f199b2f4091fdf.camel@j-davis.com

I added a strfold_ident() method. That's easier to understand for
downcase_identifier(), but didn't solve the problems for other
callsites that depend on tolower(), and so I'd need to add more methods
for those places, and started to look unpleasant.

And earlier in this thread, I had tried the approach of using a global
variable to hold a locale representing datctype. That felt a bit weird,
though, because it mostly only matters when datlocprovider='c', and in
that case, there's already a locale_t initialized along with the
default collation. So why not find a way to go through the default
collation?

I still favor the approach used in the current patch series to remove
the dependency on the global LC_CTYPE, but I'm open to suggestion.
Whatever we do will probably require some additional hacking later
anyway.

I tried to improve the comments in pgstrcasecmp.c, and I rebased.

Regards,
Jeff Davis