Re: Remaining dependency on setlocale()
Jeff Davis <pgsql@j-davis.com>
From: Jeff Davis <pgsql@j-davis.com>
To: Peter Eisentraut <peter@eisentraut.org>, Thomas Munro <thomas.munro@gmail.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>, pgsql-hackers@postgresql.org
Date: 2025-11-24T23:57:43Z
Lists: pgsql-hackers
Attachments
- v9-0001-Inline-pg_ascii_tolower-and-pg_ascii_toupper.patch (text/x-patch)
- v9-0002-Add-define-for-UNICODE_CASEMAP_BUFSZ.patch (text/x-patch)
- v9-0003-Change-some-callers-to-use-pg_ascii_toupper.patch (text/x-patch)
- v9-0004-Allow-pg_locale_t-APIs-to-work-when-ctype_is_c.patch (text/x-patch)
- v9-0005-Make-regex-max_chr-depend-on-encoding-not-provide.patch (text/x-patch)
- v9-0006-Fix-inconsistency-between-ltree_strncasecmp-and-l.patch (text/x-patch)
- v9-0007-Remove-char_tolower-API.patch (text/x-patch)
- v9-0008-Use-multibyte-aware-extraction-of-pattern-prefixe.patch (text/x-patch)
- v9-0009-fuzzystrmatch-use-pg_ascii_toupper.patch (text/x-patch)
- v9-0010-downcase_identifier-use-method-table-from-locale-.patch (text/x-patch)
- v9-0011-Control-LC_COLLATE-with-GUC.patch (text/x-patch)
On Thu, 2025-11-20 at 16:58 -0800, Jeff Davis wrote: > On Wed, 2025-11-12 at 19:59 +0100, Peter Eisentraut wrote: > > Many of these issues are pre-existing, but I just figured it has > > reached > > a point where we need to do something about it. > > I tried to simplify things in this patch series, assuming that we > have > some tolerance for small behavior changes. > > 0001: No behavior change here, same patch as before. Uncontroversial > simplification, so I plan to commit this soon. Committed. New series attached, which I tried to put in an order that would be reasonable for commit. 0001-0004: Pure refactoring patches. I intend to commit a couple of these soon. 0005: No behavioral change, and not much change at all. Computes the "max_chr" for regexes (a performance optimization for low codepoints) more consistently and simply based on the encoding. 0006: fixes longstanding ltree bug due to inconsistency between the database locale and the global LC_CTYPE setting when using a non-libc provider. The end result is also cleaner: use the database locale consistently, like tsearch. I don't intend to backport this, unless someone thinks it should be, but it should come with a release note to reindex ltree indexes if using a non-libc provider. 0007: remove the char_tolower() API completely. We'd lose a pattern matching optimization for single-byte encodings with libc and a non-C locale, but it's a significant simplification. We could go even further and change this to use casefolding rather than lower(), but that seems like a separate change. 0008: Multibyte-aware extraction of pattern prefixes. The previous code gave up on any byte that it didn't understand, which made prefixes unnecessarily short. This patch is also cleaner. 0009: Changes fuzzystrmatch to use pg_ascii_toupper(). Most functions in the extension are unaffected, but soundex() can be affected, and I'm not sure what exactly it's supposed to do with non-ASCII. 0010: For downcase_identifier(), use a new provider-specific pg_strfold_ident() method. The ICU version of this method is a work-in- progress, because right now it depends on libc. I suppose it should decode to UTF-32, then go through u_tolower(), then re-encode -- but can the re-encoding fail? In any case, it would be a behavior change for identifier casefolding with ICU and a single-byte encoding, which is probably OK but the risk is non-zero. 0011: POC patch to introduce lc_collate GUC. It would only affect extensions, PLs, libraries, or other non-core code that happens to call strcoll() or strxfrm(). This would address Daniel's complaint, but it's more flexible. And by being a GUC, it's clear that we shouldn't depend on it for any stored data. We can do something similar for LC_CTYPE after we eliminate dependencies in core code. Regards, Jeff Davis