Thread
-
Re: BUG #19354: JOHAB rejects valid byte sequences
Robert Haas <robertmhaas@gmail.com> — 2025-12-15T17:46:15Z
On Sat, Dec 13, 2025 at 2:12 PM PG Bug reporting form <noreply@postgresql.org> wrote: > Calling libpq, connecting to a UTF8 database and successfully setting client > encoding to JOHAB, this statement: > > PQexec(connection, "SELECT '\x8a\x5c'"); > > Returned an empty result with this error message: > > ERROR: invalid byte sequence for encoding "JOHAB": 0x8a 0x5c > > AFAICT, 0x8a 0x5c is a valid JOHAB sequence making up Hangul character "굎". > Easily verified in Python: > > print(b'\x8a\x5c'.decode('johab')) > > It's the same story for some other valid sequences I tried, including this > character's "neighbours" 0x8a 0x5b and 0x8a 0x5d. My reading of pg_johab_verifystr() is that it accepts any character without the high bit set as a single-byte character. Otherwise, it calls pg_joham_mblen() to determine the length of the character, and that in turn calls pg_euc_mblen(), which returns 3 if the first byte is 0x8f and otherwise 2. Whatever the answer, it then wants each byte to pass IS_EUC_RANGE_VALID() which allows for bytes from 0xa1 to 0xfe. Your byte string doesn't match that rule, so it makes sense that it fails. What confuses me is that https://en.wikipedia.org/wiki/KS_X_1001#Johab_encoding seems to say that the encoding is always a 2-byte encoding and that any 2-byte sequence with the high bit set on the first character is a valid character. So the rules we're implementing don't seem to match that at all. But unfortunately the intent behind the current code is not clear. It was introduced by Bruce in 2002 in commit a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, but I don't see comments there or elsewhere explaining what the thought was behind the way the code works, so I don't know if this is some weird variant of JOHAB that intentionally works differently or if this was just never correct. -- Robert Haas EDB: http://www.enterprisedb.com