Thread

  1. Re: BUG #19354: JOHAB rejects valid byte sequences

    Robert Haas <robertmhaas@gmail.com> — 2025-12-15T17:46:15Z

    On Sat, Dec 13, 2025 at 2:12 PM PG Bug reporting form
    <noreply@postgresql.org> wrote:
    > Calling libpq, connecting to a UTF8 database and successfully setting client
    > encoding to JOHAB, this statement:
    >
    >     PQexec(connection, "SELECT '\x8a\x5c'");
    >
    > Returned an empty result with this error message:
    >
    >     ERROR:  invalid byte sequence for encoding "JOHAB": 0x8a 0x5c
    >
    > AFAICT, 0x8a 0x5c is a valid JOHAB sequence making up Hangul character "굎".
    > Easily verified in Python:
    >
    >     print(b'\x8a\x5c'.decode('johab'))
    >
    > It's the same story for some other valid sequences I tried, including this
    > character's "neighbours" 0x8a 0x5b and 0x8a 0x5d.
    
    My reading of pg_johab_verifystr() is that it accepts any character
    without the high bit set as a single-byte character. Otherwise, it
    calls pg_joham_mblen() to determine the length of the character, and
    that in turn calls pg_euc_mblen(), which returns 3 if the first byte
    is 0x8f and otherwise 2. Whatever the answer, it then wants each byte
    to pass IS_EUC_RANGE_VALID() which allows for bytes from 0xa1 to 0xfe.
    Your byte string doesn't match that rule, so it makes sense that it
    fails.
    
    What confuses me is that
    https://en.wikipedia.org/wiki/KS_X_1001#Johab_encoding seems to say
    that the encoding is always a 2-byte encoding and that any 2-byte
    sequence with the high bit set on the first character is a valid
    character. So the rules we're implementing don't seem to match that at
    all. But unfortunately the intent behind the current code is not
    clear. It was introduced by Bruce in 2002 in commit
    a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, but I don't see comments
    there or elsewhere explaining what the thought was behind the way the
    code works, so I don't know if this is some weird variant of JOHAB
    that intentionally works differently or if this was just never
    correct.
    
    -- 
    Robert Haas
    EDB: http://www.enterprisedb.com