Thread
-
Re: C11: should we use char32_t for unicode code points?
Jeff Davis <pgsql@j-davis.com> — 2025-10-28T17:59:26Z
On Tue, 2025-10-28 at 15:40 +1300, Thomas Munro wrote: > I was noticing that toupper_libc_mb() directly tests if a pg_wchar > value is in the ASCII range, which only makes sense given knowledge > of > pg_wchar's encoding, so perhap that should trigger this new coding > rule. But I agree that's pretty obscure... feel free to ignore that > suggestion. I'm not sure that casting it to char32_t would be an improvement there. Perhaps if we can find some ways to generally clarify things (some of which you suggest below), that could be part of a follow-up. It looks like the current patch is a step in the right direction, so I'll commit that soon and see what the buildfarm says. > Hmm, the comment at the top explains that we apply that special ASCII > treatment for default locales and not non-default locales, but it > doesn't explain *why* we make that distinction. Do you know? It makes some sense: I suppose someone thought that non-ASCII behavior in the default locale is just too likely to cause problems. But the non-ASCII behavior is allowed if you use a COLLATE clause. But the pattern wasn't followed quite the same way with ICU, which uses the given locale for UPPER()/LOWER() regardless of whether it's the default locale or not. And for regexes, ICU doesn't use the locale at all, it just uses u_isalpha(), etc., even if you use a COLLATE clause. And there are still some places that call plain tolower()/toupper(), such as fuzzystrmatch and ltree. > > Right, we do know the encoding of pg_wchar in every case (assuming > that all pg_wchar values come from our transcoding routines). We > just > don't know if that encoding is also the one used by libc's > locale-sensitive functions that deal in wchar_t, except when the > locale is one that uses UTF-8 for char encoding, in which case we > assume that every libc must surely use Unicode codepoints in wchar_t. Ah, right. We create pg_wchars for any encoding, but we only pass a pg_wchar to a libc multibyte function in the UTF-8 encoding. (Aside: we do pass pg_wchars directly to ICU as UTF-32 codepoints, regardless of encoding, which is a bug.) > For locales that use UTF-8 for char, we expect libc to understand > pg_wchar/wchar_t/wint_t values as UTF-32 or at a stretch UTF-16. The > expected source of these pg_wchar values is our various regexp code > paths that will use our mbutils pg_wchar conversion to UTF-32, with a > reasonable copying strategy for sizeof(wchar_t) == 2 (that's Windows > and I think otherwise only AIX in 32 bit builds, if it comes back). > If any libc didn't use Unicode codepoints in its locale-sensitive > wchar_t functions for UTF-8 locales we'd get garbage results, but we > don't know of any such system. Check. > It's a bit of a shame that C11 didn't > introduce the obvious isualpha(char32_t) variants for a > standard-supported version of that realpolitik we depend on, but > perhaps one day... Yeah... > There is one minor quirk here that it might be nice to document in > top > comment section 2: on Windows we also expect wchar_t to be understood > by system wctype functions as UTF-16 for locales that *don't* use > UTF-8 for char (an assumption that definitely doesn't hold on many > Unixen). That is important because on Windows we allow non-UTF-8 > locales to be used in UTF-8 databases for historical reasons. Interesting. > For single-byte encodings: pg_latin12wchar_with_len() just > zero-extends the bytes to pg_wchar, so when the pg_locale_libc.c > functions truncate them and call 8-bit ctype stuff eg isalpha_l(), it > completes a perfect round trip inside our code. So you're saying that pg_wchar is more like a union type? typedef pg_wchar { char ch; /* single-byte encodings or non-UTF8 encodings on unix */ char16_t utf16; /* windows non-UTF8 encodings */ char32_t utf32; /* UTF-8 encoding */ } pg_wchar; (we'd have to be careful about the memory layout if we're casting, though) > (BTW > pg_latin12wchar_with_len() has the same definition as > pg_ascii2wchar_with_len(), and is used for many single-byte encodings > other than LATIN1 which makes me wonder why we don't just have a > single function pg_char2wchar_with_len() that is used by all "simple > widening" cases.) Sounds like a nice simplification. > We never know or care which encoding libc would > itself use for these locales' wchar_t, as we don't ever pass it a > wchar_t. Ah, that makes sense. > Assuming I understood that correctly, I think it would be > nice if the "100% correct for LATINn" comment stated the reason for > that certainty explicitly, ie that it closes an information- > preserving > round-trip beginning with the coercion in pg_latin12wchar_with_len() > and that libc never receives a wchar_t/wint_t that we fabricated. Agreed, though I think some refactoring would be helpful to accompany the comment. I've worked with this stuff a lot and I still find it hard to keep everything in mind at once. > A bit of a digression, which I *think* is out-of-scope for this > module, but just while I'm working through all the implications: > This > could produce unspecified results if a wchar_t from another source > ever arrived into these functions Ugh. When I first started dealing with pg_wchar, I assumed it was just a wider wchar_t to abstract away some of the complexity when sizeof(wchar_t) == 2 (e.g. get rid of surrogate pairs). It's clearly more complicated than that. > For multi-byte encodings other than UTF-8, pg_locale_libc.c is > basically giving up almost completely Right. > I > believe we can ignore MULE internal, as no libc supports it (so you > could only get here with the C locale where you'll get the garbage > results you asked for... in fact I wonder why need MULE internal at > all... it seems to be a sort of double-encoding for multiplexing > other > encodings, so we can't exactly say it's not blessed by a standard, > it's indirectly defined by "all the standards" in a sense, but it's > also entirely obsoleted by Unicode's unification so I don't know what > problem it solves for anyone, or if anyone ever needed it in any > reasonable pg_upgrade window of history...). I have never heard of someone using it in production, and I wouldn't object if someone wants to deprecate it. > 2. More expensive but complete: handle ASCII range with existing > 8-bit ctype functions, and otherwise convert our pg_wchar back to MB > char format and then use libc's mbstowcs_l() to make a wchar_t that > libc's wchar_t-based functions should understand. Correct. Sounds painful, but perhaps we could just do it and measure the performance. > To avoid doing hard > work for nothing (ideogram-based languages generally don't care about > ctype stuff so that'd be the vast majority of characters appearing in > Chinese/Japanese/Korean text) at the cost of having to do a bunch of > research, we could should short-circuit the core CJK character > ranges, > and do the extra CPU cycles for the rest, I don't think we should start making a bunch of assumptions like that. > 3. I assume there are some good reasons we don't do this but... if > we > used char2wchar() in the first place (= libc native wchar_t) for the > regexp stuff that calls this stuff (as we do already inside > whole-string upper/lower, just not character upper/lower or character > classification), then we could simply call the wchar_t libc functions > directly and unconditionally in the libc provider for all cases, > instead of the 8-bit variants with broken edge cases for non-UTF-8 > databases. I'm not sure about that either, but I think it's because you can end up with surrogate pairs, which can't be represented in UTF-8. > I didn't try to find the historical discussions, but I can > imagine already that we might not have done that because it has to > copy to cope with non-NULL-terminated strings, That's probably another reason. > and it would only be appropriate for libc locales anyway and > yet now we have other locale providers that certainly don't want some > unspecified wchar_t encoding or libc involved. We could fix that by making some of these APIs take a char pointer instead. That would allow libc to decode to wchar_t, and other providers to decode to UTF-32. Or, we could say that pg_wchar is an opaque type that can only be created by the provider, and passed back to the same provider. > It's also likely that > non-UTF-8 systems are of dwindling interest to anyone outside perhaps > client encodings That's been my experience -- haven't run into many non-UTF8 server encodings. > In passing, I wonder why _libc.c has that comment about ICU in > parentheses. Not relevant here. I moved it in 4da12e9e2e. > I haven't thought much about whether > it's relevant in the ICU provider code (it may come back to that > do-we-accept-pg_wchar-we-didn't-make? question), but if it is then it > also applies to Windows and probably glibc in the libc provider and I > don't immediately see any problem (assuming no-we-don't! answer). It's relevant for the regc_wc_isalpha(), etc. functions: https://www.postgresql.org/message-id/e7b67d24288f811aebada7c33f9ae629dde0def5.camel@j-davis.com Regards, Jeff Davis