Thread

  1. Re: BUG #19354: JOHAB rejects valid byte sequences

    Tom Lane <tgl@sss.pgh.pa.us> — 2025-12-16T15:41:46Z

    Robert Haas <robertmhaas@gmail.com> writes:
    > ... So I went looking for
    > where we got the mapping tables from. UCS_to_JOHAB.pl expects to read
    > from a file JOHAB.TXT, of which the latest version seems to be found
    > here:
    > https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/JOHAB.TXT
    > And indeed, if I run UCS_to_JOHAB.pl on that JOHAB.txt file, it
    > regenerates the current mapping files.
    
    Thanks for doing that research!
    
    > So apparently we've
    > got the "right" mappings, but you can only actually the ones that
    > match the code's rules for something to be a valid multi-byte
    > character, which aren't actually in sync with the mapping table.
    
    Yeah.  Looking at the code in wchar.c, it's clear that it thinks
    that JOHAB has the same character-length rules as EUC_KR, which is
    something that one might guess based on available documentation that
    says it's related to that encoding.  So I can see how we got here.
    
    However, that doesn't mean we can fix pg_johab_mblen() and we're done.
    I'm still quite afraid that we'd be introducing security-grade
    inconsistencies of interpretation between different PG versions.
    
    > I'm
    > left with the conclusions that (1) nobody ever actually tried using
    > this encoding for anything real until 3 days ago and (2) we don't have
    > any testing infrastructure that verifies that the characters in the
    > mapping tables are actually accepted by pg_verifymbstr(). I wonder how
    > many other encodings we have that don't actually work?
    
    Indeed.  Anyone want to do some testing?
    
    			regards, tom lane