Re: Remaining dependency on setlocale()

Chao Li <li.evan.chao@gmail.com>

From: Chao Li <li.evan.chao@gmail.com>
To: Jeff Davis <pgsql@j-davis.com>
Cc: Peter Eisentraut <peter@eisentraut.org>, Thomas Munro <thomas.munro@gmail.com>, Tom Lane <tgl@sss.pgh.pa.us>, pgsql-hackers@postgresql.org
Date: 2025-11-26T01:50:54Z
Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. fuzzystrmatch: use pg_ascii_toupper().

  2. Avoid global LC_CTYPE dependency in pg_locale_icu.c.

  3. downcase_identifier(): use method table from locale provider.

  4. ltree: fix case-insensitive matching.

  5. Fix multibyte issue in ltree_strncasecmp().

  6. Use multibyte-aware extraction of pattern prefixes.

  7. Add pg_iswcased().

  8. Remove char_tolower() API.

  9. Make regex "max_chr" depend on encoding, not provider.

  10. Change some callers to use pg_ascii_toupper().

  11. Allow pg_locale_t APIs to work when ctype_is_c.

  12. Add #define for UNICODE_CASEMAP_BUFSZ.

  13. Inline pg_ascii_tolower() and pg_ascii_toupper().

  14. Avoid global LC_CTYPE dependency in pg_locale_libc.c.

  15. Force LC_COLLATE to C in postmaster.

  16. Change wchar2char() and char2wchar() to accept a locale_t.

  17. Use pg_ascii_tolower()/pg_ascii_toupper() where appropriate.

  18. inet_net_pton.c: use pg_ascii_tolower() rather than tolower().

  19. isn.c: use pg_ascii_toupper() instead of toupper().

  20. contrib/spi/refint.c: use pg_ascii_tolower() instead.

  21. copyfromparse.c: use pg_ascii_tolower() rather than tolower().

  22. Revert "Tidy up locale thread safety in ECPG library."

  23. Tidy up locale thread safety in ECPG library.

  24. All supported systems have locale_t.


> On Nov 26, 2025, at 01:20, Jeff Davis <pgsql@j-davis.com> wrote:
> 
> 
> New series attached with only these changes and a rebase.
> 
> Regards,
> Jeff Davis
> 
> <v10-0001-Inline-pg_ascii_tolower-and-pg_ascii_toupper.patch><v10-0002-Add-define-for-UNICODE_CASEMAP_BUFSZ.patch><v10-0003-Change-some-callers-to-use-pg_ascii_toupper.patch><v10-0004-Allow-pg_locale_t-APIs-to-work-when-ctype_is_c.patch><v10-0005-Make-regex-max_chr-depend-on-encoding-not-provid.patch><v10-0006-Fix-inconsistency-between-ltree_strncasecmp-and-.patch><v10-0007-Remove-char_tolower-API.patch><v10-0008-Use-multibyte-aware-extraction-of-pattern-prefix.patch><v10-0009-fuzzystrmatch-use-pg_ascii_toupper.patch><v10-0010-downcase_identifier-use-method-table-from-locale.patch><v10-0011-Control-LC_COLLATE-with-GUC.patch>

I continued reviewing 0005-0008.

0005 - no comment. The change looks correct to me. Before the patch, pg_regex_locale->ctype->max_chr <= MAX_SIMPLE_CHR should be the more common cases, with the patch, MAX_SIMPLE_CHR >= UCHAR_MAX should be the more common cases, thus there should be not a behavior change.

5 - 0006
```
@@ -77,10 +77,37 @@ compare_subnode(ltree_level *t, char *qn, int len, int (*cmpptr) (const char *,
 int
 ltree_strncasecmp(const char *a, const char *b, size_t s)
 {
-	char	   *al = str_tolower(a, s, DEFAULT_COLLATION_OID);
-	char	   *bl = str_tolower(b, s, DEFAULT_COLLATION_OID);
+	static pg_locale_t locale = NULL;
+	size_t		al_sz = s + 1;
+	char	   *al = palloc(al_sz);
+	size_t		bl_sz = s + 1;
+	char	   *bl = palloc(bl_sz);
+	size_t		needed;
 	int			res;
 
+	if (!locale)
+		locale = pg_database_locale();
+
+	needed = pg_strfold(al, al_sz, a, s, locale);
+	if (needed + 1 > al_sz)
+	{
+		/* grow buffer if needed and retry */
+		al_sz = needed + 1;
+		al = repalloc(al, al_sz);
+		needed = pg_strfold(al, al_sz, a, s, locale);
+		Assert(needed + 1 <= al_sz);
+	}
+
+	needed = pg_strfold(bl, bl_sz, b, s, locale);
+	if (needed + 1 > bl_sz)
+	{
+		/* grow buffer if needed and retry */
+		bl_sz = needed + 1;
+		bl = repalloc(bl, bl_sz);
+		needed = pg_strfold(bl, bl_sz, b, s, locale);
+		Assert(needed + 1 <= bl_sz);
+	}
+
 	res = strncmp(al, bl, s);
 
 	pfree(al);
```

I do think the new implementation has some problem.

* The retry logic implies that a single-byte char may become multiple bytes after folding, otherwise retry is not needed because you have allocated s+1 bytes for dest buffers. From this perspective, we should use two needed variables: neededA and neededB, if neededA != neededB, then the two strings are different; if neededA == neededB, then we should be perform strncmp, but here we should pass neededA (or neededB as they are identical) to strncmp(al, bl, neededA).

* Based on the logic you implemented in 0004, first pg_strfold() has copied as many chars as possible to dest buffer, so when retry, ideally we can should resume instead of start over. However, if single-byte->multi-byte folding happens, we have no information to decide from where to resume. From this perspective, in 0004, do we really need to take the try-the-best strategy for strlower_c()? If there are some other use cases that require data to be placed in dest buffer even if dest buffer doesn’t have enough space, then my patch [1] of changing strlower_libc_sb() should be considered.

6 - 0007
```
 	/*
-	 * For efficiency reasons, in the single byte case we don't call lower()
-	 * on the pattern and text, but instead call SB_lower_char on each
-	 * character.  In the multi-byte case we don't have much choice :-(. Also,
-	 * ICU does not support single-character case folding, so we go the long
-	 * way.
+	 * For efficiency reasons, in the C locale we don't call lower() on the
+	 * pattern and text, but instead call SB_lower_char on each character.
 	 */
```

SB_lower_char should be changed to C_IMatchText.

7 - 0007
```
/* setup to compile like_match.c for single byte case insensitive matches */
-#define MATCH_LOWER(t, locale) SB_lower_char((unsigned char) (t), locale)
+#define MATCH_LOWER
 ```

I think the comment should be updated accordingly, like “for ILIKE in the C locale”.


8 - 0008
```
+	/* for text types, use pg_wchar; for BYTEA, use char */
 	if (typeid != BYTEAOID)
 	{
-		patt = TextDatumGetCString(patt_const->constvalue);
-		pattlen = strlen(patt);
+		text	   *val = DatumGetTextPP(patt_const->constvalue);
+		pg_wchar   *wpatt;
+		pg_wchar   *wmatch;
+		char	   *match;
+
+		patt = VARDATA_ANY(val);
+		pattlen = VARSIZE_ANY_EXHDR(val);
+		wpatt = palloc((pattlen + 1) * sizeof(pg_wchar));
+		wmatch = palloc((pattlen + 1) * sizeof(pg_wchar));
+		pg_mb2wchar_with_len(patt, wpatt, pattlen);
+
+		match = palloc(pattlen + 1);
```

* match is allocated with pattlen+1 bytes, is it long enough to hold pattlen multiple-byte chars?

* match is not freed, but looks like it should be: 

*prefix_const = string_to_const(match, typeid);
 -> string_to_datum(str, datatype);
 -> CStringGetTextDatum(str);
 -> cstring_to_text(s)
 -> cstring_to_text_with_len(s, strlen(s));
 -> *result = (text *) palloc(len + VARHDRSZ);

So, match has been copied, it should be freed.

9 - 0008
```
-	}
+			/* Backslash escapes the next character */
+			if (patt[pos] == '\\')
+			{
+				pos++;
+				if (pos >= pattlen)
+					break;
+			}
 
-	match[match_pos] = '\0';
+			match[match_pos++] = pos;
+		}
```

Should “pos” be “part[pos]” assigning to match[match_pos++]?

I will review the rest 3 commits tomorrow.

[1] https://postgr.es/m/CAEoWx2m9mUN397neL=p9x0vaVcj5EGiKD53F1MNTwTDXizxiaA@mail.gmail.com

--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/