Thread

Re: BUG #19354: JOHAB rejects valid byte sequences

Robert Haas <robertmhaas@gmail.com> — 2025-12-15T17:46:15Z
On Sat, Dec 13, 2025 at 2:12 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
> Calling libpq, connecting to a UTF8 database and successfully setting client
> encoding to JOHAB, this statement:
>
>     PQexec(connection, "SELECT '\x8a\x5c'");
>
> Returned an empty result with this error message:
>
>     ERROR:  invalid byte sequence for encoding "JOHAB": 0x8a 0x5c
>
> AFAICT, 0x8a 0x5c is a valid JOHAB sequence making up Hangul character "굎".
> Easily verified in Python:
>
>     print(b'\x8a\x5c'.decode('johab'))
>
> It's the same story for some other valid sequences I tried, including this
> character's "neighbours" 0x8a 0x5b and 0x8a 0x5d.

My reading of pg_johab_verifystr() is that it accepts any character
without the high bit set as a single-byte character. Otherwise, it
calls pg_joham_mblen() to determine the length of the character, and
that in turn calls pg_euc_mblen(), which returns 3 if the first byte
is 0x8f and otherwise 2. Whatever the answer, it then wants each byte
to pass IS_EUC_RANGE_VALID() which allows for bytes from 0xa1 to 0xfe.
Your byte string doesn't match that rule, so it makes sense that it
fails.

What confuses me is that
https://en.wikipedia.org/wiki/KS_X_1001#Johab_encoding seems to say
that the encoding is always a 2-byte encoding and that any 2-byte
sequence with the high bit set on the first character is a valid
character. So the rules we're implementing don't seem to match that at
all. But unfortunately the intent behind the current code is not
clear. It was introduced by Bruce in 2002 in commit
a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, but I don't see comments
there or elsewhere explaining what the thought was behind the way the
code works, so I don't know if this is some weird variant of JOHAB
that intentionally works differently or if this was just never
correct.

-- 
Robert Haas
EDB: http://www.enterprisedb.com