Thread

  1. Re: C11: should we use char32_t for unicode code points?

    Jeff Davis <pgsql@j-davis.com> — 2025-10-29T15:12:01Z

    On Wed, 2025-10-29 at 14:00 +1300, Thomas Munro wrote:
    > I wonder if the logic to select the member/semantics could be turned
    > into an enum in the encoding table, to make it even clearer, and then
    > that could be used as an index into a table of ctype methods obejcts
    > in _libc.c.
    
    As long as we're able to isolate that logic in the libc provider,
    that's reasonable. The other providers don't need that complexity, they
    just need to decode straight to UTF-32.
    
    > You showed char16_t for Windows, but we don't ever get char16_t out
    > of
    > wchar.c, it's always char32_t for UTF-8 input.  It's just that
    > _libc.c
    > truncates to UTF-16 or short-circuits to avoid overflow on that
    > platform (and in the past AIX 32-bit and maybe more), so it wouldn't
    > belong in a hypothetical union or enum.
    
    Oh, I see.
    
    > > 
    > Perhaps we could at least put the conversion in a new encoding table
    > function pointer "pg_wchar_custom_to_wchar_t", so we could reserve a
    > place to put that sort of optimisation in
    
    That sounds like a good step forward. And maybe one to convert to UTF-
    32 for ICU, also?
    
    > If we do develop this idea though, one issue to contemplate is that
    > EUC code points might generate more than one wchar_t, looking at
    > EUC_JIS_2004[1].
    
    Wow, that's unfortunate.
    
    
    Regards,
    	Jeff Davis