Re: C11: should we use char32_t for unicode code points?
Jeff Davis <pgsql@j-davis.com>
From: Jeff Davis <pgsql@j-davis.com>
To: Thomas Munro <thomas.munro@gmail.com>
Cc: Tatsuo Ishii <ishii@postgresql.org>, pgsql-hackers@postgresql.org
Date: 2025-10-29T15:12:01Z
Lists: pgsql-hackers
On Wed, 2025-10-29 at 14:00 +1300, Thomas Munro wrote: > I wonder if the logic to select the member/semantics could be turned > into an enum in the encoding table, to make it even clearer, and then > that could be used as an index into a table of ctype methods obejcts > in _libc.c. As long as we're able to isolate that logic in the libc provider, that's reasonable. The other providers don't need that complexity, they just need to decode straight to UTF-32. > You showed char16_t for Windows, but we don't ever get char16_t out > of > wchar.c, it's always char32_t for UTF-8 input. It's just that > _libc.c > truncates to UTF-16 or short-circuits to avoid overflow on that > platform (and in the past AIX 32-bit and maybe more), so it wouldn't > belong in a hypothetical union or enum. Oh, I see. > > > Perhaps we could at least put the conversion in a new encoding table > function pointer "pg_wchar_custom_to_wchar_t", so we could reserve a > place to put that sort of optimisation in That sounds like a good step forward. And maybe one to convert to UTF- 32 for ICU, also? > If we do develop this idea though, one issue to contemplate is that > EUC code points might generate more than one wchar_t, looking at > EUC_JIS_2004[1]. Wow, that's unfortunate. Regards, Jeff Davis