Re: C11: should we use char32_t for unicode code points?

Jeff Davis <pgsql@j-davis.com>

From: Jeff Davis <pgsql@j-davis.com>

To: Thomas Munro <thomas.munro@gmail.com>

Cc: Tatsuo Ishii <ishii@postgresql.org>, pgsql-hackers@postgresql.org

Date: 2025-10-29T15:12:01Z

Lists: pgsql-hackers

On Wed, 2025-10-29 at 14:00 +1300, Thomas Munro wrote:
> I wonder if the logic to select the member/semantics could be turned
> into an enum in the encoding table, to make it even clearer, and then
> that could be used as an index into a table of ctype methods obejcts
> in _libc.c.

As long as we're able to isolate that logic in the libc provider,
that's reasonable. The other providers don't need that complexity, they
just need to decode straight to UTF-32.

> You showed char16_t for Windows, but we don't ever get char16_t out
> of
> wchar.c, it's always char32_t for UTF-8 input.  It's just that
> _libc.c
> truncates to UTF-16 or short-circuits to avoid overflow on that
> platform (and in the past AIX 32-bit and maybe more), so it wouldn't
> belong in a hypothetical union or enum.

Oh, I see.

> > 
> Perhaps we could at least put the conversion in a new encoding table
> function pointer "pg_wchar_custom_to_wchar_t", so we could reserve a
> place to put that sort of optimisation in

That sounds like a good step forward. And maybe one to convert to UTF-
32 for ICU, also?

> If we do develop this idea though, one issue to contemplate is that
> EUC code points might generate more than one wchar_t, looking at
> EUC_JIS_2004[1].

Wow, that's unfortunate.


Regards,
	Jeff Davis