Thread

  1. Re: BUG #19354: JOHAB rejects valid byte sequences

    Jeroen Vermeulen <jtvjtv@gmail.com> — 2025-12-16T00:07:12Z

    Hi Robert.  Thanks for following up.
    
    The original author of the support code in libpqxx also noted that there
    was a discrepancy.  Python does accept these 2-byte sequences, and decodes
    them to Hangul characters.
    
    The way I read the Wikipedia section, Johab isn't like the EUC encodings in
    that it adds characters that contain ASCII-like values in the second byte.
    I guess that was needed to support Chinese characters in addition to
    Hangul.  Unit-testing for the embedded-backslash hazard was what led me to
    find the problem.
    
    This bit worries me: "TlOther, vendor-defined, Johab variants also exist" —
    such as an EBCDIC-based one and a stateful one!
    
    
    Jeroen
    
    On Mon, Dec 15, 2025, 18:46 Robert Haas <robertmhaas@gmail.com> wrote:
    
    > On Sat, Dec 13, 2025 at 2:12 PM PG Bug reporting form
    > <noreply@postgresql.org> wrote:
    > > Calling libpq, connecting to a UTF8 database and successfully setting
    > client
    > > encoding to JOHAB, this statement:
    > >
    > >     PQexec(connection, "SELECT '\x8a\x5c'");
    > >
    > > Returned an empty result with this error message:
    > >
    > >     ERROR:  invalid byte sequence for encoding "JOHAB": 0x8a 0x5c
    > >
    > > AFAICT, 0x8a 0x5c is a valid JOHAB sequence making up Hangul character
    > "굎".
    > > Easily verified in Python:
    > >
    > >     print(b'\x8a\x5c'.decode('johab'))
    > >
    > > It's the same story for some other valid sequences I tried, including
    > this
    > > character's "neighbours" 0x8a 0x5b and 0x8a 0x5d.
    >
    > My reading of pg_johab_verifystr() is that it accepts any character
    > without the high bit set as a single-byte character. Otherwise, it
    > calls pg_joham_mblen() to determine the length of the character, and
    > that in turn calls pg_euc_mblen(), which returns 3 if the first byte
    > is 0x8f and otherwise 2. Whatever the answer, it then wants each byte
    > to pass IS_EUC_RANGE_VALID() which allows for bytes from 0xa1 to 0xfe.
    > Your byte string doesn't match that rule, so it makes sense that it
    > fails.
    >
    > What confuses me is that
    > https://en.wikipedia.org/wiki/KS_X_1001#Johab_encoding seems to say
    > that the encoding is always a 2-byte encoding and that any 2-byte
    > sequence with the high bit set on the first character is a valid
    > character. So the rules we're implementing don't seem to match that at
    > all. But unfortunately the intent behind the current code is not
    > clear. It was introduced by Bruce in 2002 in commit
    > a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, but I don't see comments
    > there or elsewhere explaining what the thought was behind the way the
    > code works, so I don't know if this is some weird variant of JOHAB
    > that intentionally works differently or if this was just never
    > correct.
    >
    > --
    > Robert Haas
    > EDB: http://www.enterprisedb.com
    >