Re: Remaining dependency on setlocale()

Jeff Davis <pgsql@j-davis.com>

View thread

From: Jeff Davis <pgsql@j-davis.com>

To: Peter Eisentraut <peter@eisentraut.org>, Chao Li <li.evan.chao@gmail.com>

Cc: Thomas Munro <thomas.munro@gmail.com>, Tom Lane <tgl@sss.pgh.pa.us>, pgsql-hackers@postgresql.org

Date: 2025-12-12T20:11:40Z

Lists: pgsql-hackers

Attachments

v12-0001-Use-multibyte-aware-extraction-of-pattern-prefix.patch (text/x-patch)
v12-0002-Remove-unused-single-byte-char_is_cased-API.patch (text/x-patch)
v12-0003-Fix-multibyte-issue-in-ltree_strncasecmp.patch (text/x-patch)
v12-0004-Fix-inconsistency-between-ltree_strncasecmp-and-.patch (text/x-patch)
v12-0005-downcase_identifier-use-method-table-from-locale.patch (text/x-patch)
v12-0006-Avoid-global-LC_CTYPE-dependency-in-pg_locale_ic.patch (text/x-patch)
v12-0007-fuzzystrmatch-use-pg_ascii_toupper.patch (text/x-patch)
v12-0008-Control-LC_COLLATE-with-GUC.patch (text/x-patch)

On Fri, 2025-12-05 at 16:01 +0100, Peter Eisentraut wrote:
> v11-0003-Fix-inconsistency-between-ltree_strncasecmp-and-.patch
> 
> The function comment reads "Check if b has a prefix of a." -- Is that
> the same as "Check if a is a prefix of b."?  The latter might be
> clearer.

Yes, fixed.

Note: I separated this into two patches. 0003 fixes the multibyte
mishandling issue, and 0004 consistently performs case folding. 0003 is
backpatchable, I believe.

> but the patch removes SB_lower_char().

Fixed and committed.

> v11-0006-Use-multibyte-aware-extraction-of-pattern-prefix.patch
> 
> Might have a small typo in the commit message:
> 
> ; and preserve and char-at-a-time logic for bytea.

Fixed.

I also changed it into two functions: like_fixed_prefix(), which is
almost unchanged from the original; and like_fixed_prefix_ci(), which
is multibyte and locale-aware. It was too confusing to have single-byte
and multi-byte logic in the same function, and they didn't share much
code anyway.

> case '\xc7':        /* C with cedilla */
> 
> so the premise that "fuzzystrmatch is designed for ASCII" does not
> appear to be correct.  Needs more analysis.
> 
> (But apparently it's not multibyte aware at all, so I don't know what
> to 
> do about that.)

I didn't notice that, thank you. Agreed, we need a bit more discussion
around this case as well as soundex().

> v11-0008-downcase_identifier-use-method-table-from-locale.patch
> 
> I'm confused here about the name of the function pg_strfold_ident(). 
> In 
> general, case "folding" results in an opaque string that is really
> only 
> useful for comparing against other case-folded strings.  But for 
> identifiers we are actually interested lower-casing.  I think this 
> should be corrected in the API naming.

Agreed and fixed.

Also, I added 0006, which saves a locale_t object for ICU in this one
case where it's required. Surely that's not what we want in the long
term, but we don't have the infrastructure for decoding pg_wchar into
code points yet, and 0006 avoids the dependency on the global LC_CTYPE
setting.

> v11-0009-Control-LC_COLLATE-with-GUC.patch
> 
> I know there were some complaints about compatibility with
> extensions, 
> but I don't think anything concrete was presented.  I would like to
> see 
> more evidence that we need this.
> 
> Also, recall that we used to have a lc_collate GUC, and in the end 
> people got confused that it didn't actually show a meaningful value
> when 
> you used ICU.  So we removed that.  It seems adding this back in
> would 
> create a similar kind of confusion.  So to avoid that, maybe this
> should 
> be called fallback_lc_collate or something like that.

Yes, this is a POC patch and needs more discussion.

What are your thoughts about a similar lc_ctype GUC, though? That has
slightly different trade-offs.

I believe v12 0001-0005 are about ready for commit, and 0003 should be
backported.

Regards,
	Jeff Davis