Thread

  1. Re: Remaining dependency on setlocale()

    Jeff Davis <pgsql@j-davis.com> — 2025-11-24T23:57:43Z

    On Thu, 2025-11-20 at 16:58 -0800, Jeff Davis wrote:
    > On Wed, 2025-11-12 at 19:59 +0100, Peter Eisentraut wrote:
    > > Many of these issues are pre-existing, but I just figured it has
    > > reached 
    > > a point where we need to do something about it.
    > 
    > I tried to simplify things in this patch series, assuming that we
    > have
    > some tolerance for small behavior changes.
    > 
    > 0001: No behavior change here, same patch as before. Uncontroversial
    > simplification, so I plan to commit this soon.
    
    Committed.
    
    New series attached, which I tried to put in an order that would be
    reasonable for commit.
    
    0001-0004: Pure refactoring patches. I intend to commit a couple of
    these soon.
    
    0005: No behavioral change, and not much change at all. Computes the
    "max_chr" for regexes (a performance optimization for low codepoints)
    more consistently and simply based on the encoding.
    
    0006: fixes longstanding ltree bug due to inconsistency between the
    database locale and the global LC_CTYPE setting when using a non-libc
    provider. The end result is also cleaner: use the database locale
    consistently, like tsearch. I don't intend to backport this, unless
    someone thinks it should be, but it should come with a release note to
    reindex ltree indexes if using a non-libc provider.
    
    0007: remove the char_tolower() API completely. We'd lose a pattern
    matching optimization for single-byte encodings with libc and a non-C
    locale, but it's a significant simplification. We could go even further
    and change this to use casefolding rather than lower(), but that seems
    like a separate change.
    
    0008: Multibyte-aware extraction of pattern prefixes. The previous code
    gave up on any byte that it didn't understand, which made prefixes
    unnecessarily short. This patch is also cleaner.
    
    0009: Changes fuzzystrmatch to use pg_ascii_toupper(). Most functions
    in the extension are unaffected, but soundex() can be affected, and I'm
    not sure what exactly it's supposed to do with non-ASCII.
    
    0010: For downcase_identifier(), use a new provider-specific
    pg_strfold_ident() method. The ICU version of this method is a work-in-
    progress, because right now it depends on libc. I suppose it should
    decode to UTF-32, then go through u_tolower(), then re-encode -- but
    can the re-encoding fail? In any case, it would be a behavior change
    for identifier casefolding with ICU and a single-byte encoding, which
    is probably OK but the risk is non-zero.
    
    0011: POC patch to introduce lc_collate GUC. It would only affect
    extensions, PLs, libraries, or other non-core code that happens to call
    strcoll() or strxfrm(). This would address Daniel's complaint, but it's
    more flexible. And by being a GUC, it's clear that we shouldn't depend
    on it for any stored data. We can do something similar for LC_CTYPE
    after we eliminate dependencies in core code.
    
    Regards,
    	Jeff Davis