Thread

  1. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-11-15T00:23:11Z

    On Wed, 2025-11-12 at 19:59 +0100, Peter Eisentraut wrote:
    > I'm getting a bit confused by all these different variant function 
    > names. 
    
    One way of looking at it is that the functions in this patch series
    mostly affect how identifiers are treated, whereas earlier collation-
    related work affects how text data is treated. Ideally, they should be
    similar, but for historical reasons they're not.
    
    There are a lot of subtle behaviors for identifiers, which individually
    make some sense, but over time have just become edge cases and sources
    of inconsistency:
    
    downcase_identifier() is a server function to casefold unquoted
    identifiers during parsing (used by other callers, too). For non-ascii
    characters in a single-byte encoding, it uses tolower(); otherwise the
    lowercasing is ascii-only. Note: if an application is reliant on the
    casefolding of non-ascii identifiers, such as SELECT * FROM É finding
    the table named "é", that application would not work in UTF-8 even with
    a dump/restore.
    
    pg_strcasecmp() and pg_tolower() are used from the server and the
    client to do case-insensitive comparison of option names. They're
    supposed to use the same casing semantics as downcase_identifier(), but
    they don't differentiate between single-byte and multi-byte encodings;
    they just call tolower() on any non-ascii byte. That difference
    probably doesn't matter for UTF8, because tolower() on a single byte in
    a multibyte sequence should be a no-op, but perhaps it can matter in
    non-UTF-8 multibyte encodings.
    
    It's hard to avoid some confusion unless we're able to simplify some of
    these behaviors. Let me know if you think we can tolerate some
    simplifications in these edge cases without breaking anything too
    badly.
    
    
    > Many of these issues are pre-existing, but I just figured it has
    > reached 
    > a point where we need to do something about it.
    
    Starting from first principles, individual character operations should
    be mostly for parsing (e.g. tsearch) or pattern matching. Case folding
    and caseless matching should be done with string operations. And
    obviously all of this should be multibyte aware and work consistently
    in different encodings (to the extent possible given the
    representational constraints).
    
    Our APIs in pg_locale.c do a good job of offering that, and do not
    depend on the global LC_CTYPE. (There are a few things I'd like to add
    or clean up, but it offers most of what we need.)
    
    The problem, of course, is migrating the callers to use pg_locale.c
    APIs without breaking things. This patch series is intended to make
    everything locale-sensitive in the backend go through pg_locale_t
    without any behavior changes. The benefit is that it would at least
    remove the global LC_CTYPE dependency, but it ends up with hacky
    compatibility methods like char_tolower(), which piles on to the
    already-confusing set of tolower-like functions.
    
    In an earlier approach:
    
    https://www.postgresql.org/message-id/5f95b81af1e81b28b8a9ac5929f199b2f4091fdf.camel@j-davis.com
    
    I added a strfold_ident() method. That's easier to understand for
    downcase_identifier(), but didn't solve the problems for other
    callsites that depend on tolower(), and so I'd need to add more methods
    for those places, and started to look unpleasant.
    
    And earlier in this thread, I had tried the approach of using a global
    variable to hold a locale representing datctype. That felt a bit weird,
    though, because it mostly only matters when datlocprovider='c', and in
    that case, there's already a locale_t initialized along with the
    default collation. So why not find a way to go through the default
    collation?
    
    I still favor the approach used in the current patch series to remove
    the dependency on the global LC_CTYPE, but I'm open to suggestion.
    Whatever we do will probably require some additional hacking later
    anyway.
    
    I tried to improve the comments in pgstrcasecmp.c, and I rebased.
    
    Regards,
    	Jeff Davis