Thread

  1. Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8

    Tatsuo Ishii <ishii@postgresql.org> — 2026-05-06T12:19:07Z

    > It is in general not necessarily required that all text in all
    > non-UTF8 encodings must be convertible to UTF8.
    > 
    > (This is also a result of history: These encodings were implemented in
    > PostgreSQL before Unicode.)
    > 
    > That said, I can see how different behaviors might be desirable.
    > 
    > My first question would be, are these non-convertible byte sequences
    > just characters that don't map to Unicode, or are they invalid within
    > the definition of the EUC-* encodings themselves?
    
    A strict answer is, the former. 0xA2A3 is 3 of lowercase forms of the
    Roman numerals (iii), which is not defined in the original GB2312
    (the character set of EUC_CN),
    
    > If the latter, then
    > we should just reject them (modulo some backward compatibility),
    > similar to how we reject certain Unicode code points that exist
    > "structurally" but are not valid for one reason or another.
    
    After GB2312, GB18030 was defined. (It is claimed that GB18030 is a
    super set of GB2312). In DB18030, lowercase forms of the Roman
    numerals and other characters (e.g. Euro sign) were added.
    
    So I suspect that a) those characters are sometimes used with EUC_CN
    despite the fact that they are not valid GB2312 characters. b) some
    EUC_CN users might have already written those characters into EUC_CN
    databases. If so, tightening up the validation may break such that
    uses. This is just my guess. Please correct me if I am wrong.
    
    > Alternatively, if these byte sequences are valid characters but they
    > just didn't end up in Unicode for some reason, then rejecting them
    > might break valid uses.
    
    That's not the case, at least for 0xA2A3. It seems UCS_ti_EUC_CN.pl
    explicitly rejects characters that are not part of GB2312, including
    0xA2A3, as the script is using GB18030 as a source data.
    
    Regards,
    --
    Tatsuo Ishii
    SRA OSS K.K.
    English: http://www.sraoss.co.jp/index_en/
    Japanese:http://www.sraoss.co.jp