Re: Remaining dependency on setlocale()
Jeff Davis <pgsql@j-davis.com>
From: Jeff Davis <pgsql@j-davis.com>
To: Peter Eisentraut <peter@eisentraut.org>, Thomas Munro <thomas.munro@gmail.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>, pgsql-hackers@postgresql.org
Date: 2025-11-15T00:23:11Z
Lists: pgsql-hackers
Attachments
- v7-0001-Avoid-global-LC_CTYPE-dependency-in-pg_locale_lib.patch (text/x-patch)
- v7-0002-Define-char_tolower-char_toupper-for-all-locale-p.patch (text/x-patch)
- v7-0003-Avoid-global-LC_CTYPE-dependency-in-like.c.patch (text/x-patch)
- v7-0004-Avoid-global-LC_CTYPE-dependency-in-scansup.c.patch (text/x-patch)
- v7-0005-Avoid-global-LC_CTYPE-dependency-in-pg_locale_icu.patch (text/x-patch)
- v7-0006-Avoid-global-LC_CTYPE-dependency-in-ltree-crc32.c.patch (text/x-patch)
- v7-0007-Avoid-global-LC_CTYPE-dependency-in-fuzzystrmatch.patch (text/x-patch)
- v7-0008-Don-t-include-ICU-headers-in-pg_locale.h.patch (text/x-patch)
- v7-0009-Avoid-global-LC_CTYPE-dependency-in-strcasecmp.c-.patch (text/x-patch)
On Wed, 2025-11-12 at 19:59 +0100, Peter Eisentraut wrote: > I'm getting a bit confused by all these different variant function > names. One way of looking at it is that the functions in this patch series mostly affect how identifiers are treated, whereas earlier collation- related work affects how text data is treated. Ideally, they should be similar, but for historical reasons they're not. There are a lot of subtle behaviors for identifiers, which individually make some sense, but over time have just become edge cases and sources of inconsistency: downcase_identifier() is a server function to casefold unquoted identifiers during parsing (used by other callers, too). For non-ascii characters in a single-byte encoding, it uses tolower(); otherwise the lowercasing is ascii-only. Note: if an application is reliant on the casefolding of non-ascii identifiers, such as SELECT * FROM É finding the table named "é", that application would not work in UTF-8 even with a dump/restore. pg_strcasecmp() and pg_tolower() are used from the server and the client to do case-insensitive comparison of option names. They're supposed to use the same casing semantics as downcase_identifier(), but they don't differentiate between single-byte and multi-byte encodings; they just call tolower() on any non-ascii byte. That difference probably doesn't matter for UTF8, because tolower() on a single byte in a multibyte sequence should be a no-op, but perhaps it can matter in non-UTF-8 multibyte encodings. It's hard to avoid some confusion unless we're able to simplify some of these behaviors. Let me know if you think we can tolerate some simplifications in these edge cases without breaking anything too badly. > Many of these issues are pre-existing, but I just figured it has > reached > a point where we need to do something about it. Starting from first principles, individual character operations should be mostly for parsing (e.g. tsearch) or pattern matching. Case folding and caseless matching should be done with string operations. And obviously all of this should be multibyte aware and work consistently in different encodings (to the extent possible given the representational constraints). Our APIs in pg_locale.c do a good job of offering that, and do not depend on the global LC_CTYPE. (There are a few things I'd like to add or clean up, but it offers most of what we need.) The problem, of course, is migrating the callers to use pg_locale.c APIs without breaking things. This patch series is intended to make everything locale-sensitive in the backend go through pg_locale_t without any behavior changes. The benefit is that it would at least remove the global LC_CTYPE dependency, but it ends up with hacky compatibility methods like char_tolower(), which piles on to the already-confusing set of tolower-like functions. In an earlier approach: https://www.postgresql.org/message-id/5f95b81af1e81b28b8a9ac5929f199b2f4091fdf.camel@j-davis.com I added a strfold_ident() method. That's easier to understand for downcase_identifier(), but didn't solve the problems for other callsites that depend on tolower(), and so I'd need to add more methods for those places, and started to look unpleasant. And earlier in this thread, I had tried the approach of using a global variable to hold a locale representing datctype. That felt a bit weird, though, because it mostly only matters when datlocprovider='c', and in that case, there's already a locale_t initialized along with the default collation. So why not find a way to go through the default collation? I still favor the approach used in the current patch series to remove the dependency on the global LC_CTYPE, but I'm open to suggestion. Whatever we do will probably require some additional hacking later anyway. I tried to improve the comments in pgstrcasecmp.c, and I rebased. Regards, Jeff Davis