Thread

  1. Re: C11: should we use char32_t for unicode code points?

    Tatsuo Ishii <ishii@postgresql.org> — 2025-10-28T08:36:13Z

    > The EUC family has direct encoding of 7-bit ASCII and then 3
    > selectable character sets represented by sequences with the high bit
    > set, with details varying between the Chinese (simplified Chinese),
    > Taiwanese (traditional Chinese), Japanese (2 kinds) and Korean
    > variants.  I don't know if the pg_wchar encoding we're producing in
    > pg_euc*2wchar_with_len() has a name, but it doesn't appear to match
    > the description of the standard "fixed" representation on the
    > Wikipedia page for Extended Unix Code (it's too wide for starters,
    > looking at the shift distances).
    
    Yes. pg_euc*2wchar_with_len() creates "variable length" representation
    of EUC, 1 byte to 4 bytes range per character. Then, expands each
    character into pg_wchar. Also it can be converted back to the
    multibyte representation easily.
    
    Note that the standard "fixed" representation of EUC includes ASCII
    range bytes in *non* ASCII characters, thus I think it is not easy to
    use for backend safe encoding.
    
    Best regards,
    --
    Tatsuo Ishii
    SRA OSS K.K.
    English: http://www.sraoss.co.jp/index_en/
    Japanese:http://www.sraoss.co.jp