Thread
-
Re: C11: should we use char32_t for unicode code points?
Tatsuo Ishii <ishii@postgresql.org> — 2025-10-28T08:36:13Z
> The EUC family has direct encoding of 7-bit ASCII and then 3 > selectable character sets represented by sequences with the high bit > set, with details varying between the Chinese (simplified Chinese), > Taiwanese (traditional Chinese), Japanese (2 kinds) and Korean > variants. I don't know if the pg_wchar encoding we're producing in > pg_euc*2wchar_with_len() has a name, but it doesn't appear to match > the description of the standard "fixed" representation on the > Wikipedia page for Extended Unix Code (it's too wide for starters, > looking at the shift distances). Yes. pg_euc*2wchar_with_len() creates "variable length" representation of EUC, 1 byte to 4 bytes range per character. Then, expands each character into pg_wchar. Also it can be converted back to the multibyte representation easily. Note that the standard "fixed" representation of EUC includes ASCII range bytes in *non* ASCII characters, thus I think it is not easy to use for backend safe encoding. Best regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp