Thread

  1. Re: C11: should we use char32_t for unicode code points?

    Jeff Davis <pgsql@j-davis.com> — 2025-10-28T17:59:26Z

    On Tue, 2025-10-28 at 15:40 +1300, Thomas Munro wrote:
    > I was noticing that toupper_libc_mb() directly tests if a pg_wchar
    > value is in the ASCII range, which only makes sense given knowledge
    > of
    > pg_wchar's encoding, so perhap that should trigger this new coding
    > rule.  But I agree that's pretty obscure...  feel free to ignore that
    > suggestion.
    
    I'm not sure that casting it to char32_t would be an improvement there.
    Perhaps if we can find some ways to generally clarify things (some of
    which you suggest below), that could be part of a follow-up.
    
    It looks like the current patch is a step in the right direction, so
    I'll commit that soon and see what the buildfarm says.
    
    > Hmm, the comment at the top explains that we apply that special ASCII
    > treatment for default locales and not non-default locales, but it
    > doesn't explain *why* we make that distinction.  Do you know?
    
    It makes some sense: I suppose someone thought that non-ASCII behavior
    in the default locale is just too likely to cause problems. But the
    non-ASCII behavior is allowed if you use a COLLATE clause.
    
    But the pattern wasn't followed quite the same way with ICU, which uses
    the given locale for UPPER()/LOWER() regardless of whether it's the
    default locale or not. And for regexes, ICU doesn't use the locale at
    all, it just uses u_isalpha(), etc., even if you use a COLLATE clause.
    
    And there are still some places that call plain tolower()/toupper(),
    such as fuzzystrmatch and ltree.
    
    > 
    > Right, we do know the encoding of pg_wchar in every case (assuming
    > that all pg_wchar values come from our transcoding routines).  We
    > just
    > don't know if that encoding is also the one used by libc's
    > locale-sensitive functions that deal in wchar_t, except when the
    > locale is one that uses UTF-8 for char encoding, in which case we
    > assume that every libc must surely use Unicode codepoints in wchar_t.
    
    Ah, right. We create pg_wchars for any encoding, but we only pass a
    pg_wchar to a libc multibyte function in the UTF-8 encoding.
    
    (Aside: we do pass pg_wchars directly to ICU as UTF-32 codepoints,
    regardless of encoding, which is a bug.)
    
    
    > For locales that use UTF-8 for char, we expect libc to understand
    > pg_wchar/wchar_t/wint_t values as UTF-32 or at a stretch UTF-16.  The
    > expected source of these pg_wchar values is our various regexp code
    > paths that will use our mbutils pg_wchar conversion to UTF-32, with a
    > reasonable copying strategy for sizeof(wchar_t) == 2 (that's Windows
    > and I think otherwise only AIX in 32 bit builds, if it comes back).
    > If any libc didn't use Unicode codepoints in its locale-sensitive
    > wchar_t functions for UTF-8 locales we'd get garbage results, but we
    > don't know of any such system.
    
    Check.
    
    >   It's a bit of a shame that C11 didn't
    > introduce the obvious isualpha(char32_t) variants for a
    > standard-supported version of that realpolitik we depend on, but
    > perhaps one day...
    
    Yeah...
    
    > There is one minor quirk here that it might be nice to document in
    > top
    > comment section 2: on Windows we also expect wchar_t to be understood
    > by system wctype functions as UTF-16 for locales that *don't* use
    > UTF-8 for char (an assumption that definitely doesn't hold on many
    > Unixen).  That is important because on Windows we allow non-UTF-8
    > locales to be used in UTF-8 databases for historical reasons.
    
    Interesting.
    
    > For single-byte encodings: pg_latin12wchar_with_len() just
    > zero-extends the bytes to pg_wchar, so when the pg_locale_libc.c
    > functions truncate them and call 8-bit ctype stuff eg isalpha_l(), it
    > completes a perfect round trip inside our code.
    
    So you're saying that pg_wchar is more like a union type?
    
        typedef pg_wchar
        {
           char ch; /* single-byte encodings or 
                       non-UTF8 encodings on unix */
           char16_t utf16; /* windows non-UTF8 encodings */
           char32_t utf32; /* UTF-8 encoding */
        } pg_wchar;
    
    (we'd have to be careful about the memory layout if we're casting,
    though)
    
    >   (BTW
    > pg_latin12wchar_with_len() has the same definition as
    > pg_ascii2wchar_with_len(), and is used for many single-byte encodings
    > other than LATIN1 which makes me wonder why we don't just have a
    > single function pg_char2wchar_with_len() that is used by all "simple
    > widening" cases.)
    
    Sounds like a nice simplification.
    
    >   We never know or care which encoding libc would
    > itself use for these locales' wchar_t, as we don't ever pass it a
    > wchar_t.
    
    Ah, that makes sense.
    
    >   Assuming I understood that correctly, I think it would be
    > nice if the "100% correct for LATINn" comment stated the reason for
    > that certainty explicitly, ie that it closes an information-
    > preserving
    > round-trip beginning with the coercion in pg_latin12wchar_with_len()
    > and that libc never receives a wchar_t/wint_t that we fabricated.
    
    Agreed, though I think some refactoring would be helpful to accompany
    the comment. I've worked with this stuff a lot and I still find it hard
    to keep everything in mind at once.
    
    > A bit of a digression, which I *think* is out-of-scope for this
    > module, but just while I'm working through all the implications: 
    > This
    > could produce unspecified results if a wchar_t from another source
    > ever arrived into these functions
    
    Ugh.
    
    When I first started dealing with pg_wchar, I assumed it was just a
    wider wchar_t to abstract away some of the complexity when
    sizeof(wchar_t) == 2 (e.g. get rid of surrogate pairs). It's clearly
    more complicated than that.
    
    > For multi-byte encodings other than UTF-8, pg_locale_libc.c is
    > basically giving up almost completely
    
    Right.
    
    > I
    > believe we can ignore MULE internal, as no libc supports it (so you
    > could only get here with the C locale where you'll get the garbage
    > results you asked for...  in fact I wonder why need MULE internal at
    > all... it seems to be a sort of double-encoding for multiplexing
    > other
    > encodings, so we can't exactly say it's not blessed by a standard,
    > it's indirectly defined by "all the standards" in a sense, but it's
    > also entirely obsoleted by Unicode's unification so I don't know what
    > problem it solves for anyone, or if anyone ever needed it in any
    > reasonable pg_upgrade window of history...).
    
    I have never heard of someone using it in production, and I wouldn't
    object if someone wants to deprecate it.
    
    > 2.  More expensive but complete: handle ASCII range with existing
    > 8-bit ctype functions, and otherwise convert our pg_wchar back to MB
    > char format and then use libc's mbstowcs_l() to make a wchar_t that
    > libc's wchar_t-based functions should understand.
    
    Correct. Sounds painful, but perhaps we could just do it and measure
    the performance.
    
    >   To avoid doing hard
    > work for nothing (ideogram-based languages generally don't care about
    > ctype stuff so that'd be the vast majority of characters appearing in
    > Chinese/Japanese/Korean text) at the cost of having to do a bunch of
    > research, we could should short-circuit the core CJK character
    > ranges,
    > and do the extra CPU cycles for the rest,
    
    I don't think we should start making a bunch of assumptions like that.
    
    > 3.  I assume there are some good reasons we don't do this but... if
    > we
    > used char2wchar() in the first place (= libc native wchar_t) for the
    > regexp stuff that calls this stuff (as we do already inside
    > whole-string upper/lower, just not character upper/lower or character
    > classification), then we could simply call the wchar_t libc functions
    > directly and unconditionally in the libc provider for all cases,
    > instead of the 8-bit variants with broken edge cases for non-UTF-8
    > databases.
    
    I'm not sure about that either, but I think it's because you can end up
    with surrogate pairs, which can't be represented in UTF-8.
    
    >   I didn't try to find the historical discussions, but I can
    > imagine already that we might not have done that because it has to
    > copy to cope with non-NULL-terminated strings,
    
    That's probably another reason.
    
    > and it would only be appropriate for libc locales anyway and
    > yet now we have other locale providers that certainly don't want some
    > unspecified wchar_t encoding or libc involved.
    
    We could fix that by making some of these APIs take a char pointer
    instead. That would allow libc to decode to wchar_t, and other
    providers to decode to UTF-32. Or, we could say that pg_wchar is an
    opaque type that can only be created by the provider, and passed back
    to the same provider.
    
    >   It's also likely that
    > non-UTF-8 systems are of dwindling interest to anyone outside perhaps
    > client encodings
    
    That's been my experience -- haven't run into many non-UTF8 server
    encodings.
    
    > In passing, I wonder why _libc.c has that comment about ICU in
    > parentheses.  Not relevant here.
    
    I moved it in 4da12e9e2e.
    
    >   I haven't thought much about whether
    > it's relevant in the ICU provider code (it may come back to that
    > do-we-accept-pg_wchar-we-didn't-make? question), but if it is then it
    > also applies to Windows and probably glibc in the libc provider and I
    > don't immediately see any problem (assuming no-we-don't! answer).
    
    It's relevant for the regc_wc_isalpha(), etc. functions:
    
    https://www.postgresql.org/message-id/e7b67d24288f811aebada7c33f9ae629dde0def5.camel@j-davis.com
    
    Regards,
    	Jeff Davis