Thread
-
Re: BUG #19354: JOHAB rejects valid byte sequences
Jeroen Vermeulen <jtvjtv@gmail.com> — 2025-12-16T00:07:12Z
Hi Robert. Thanks for following up. The original author of the support code in libpqxx also noted that there was a discrepancy. Python does accept these 2-byte sequences, and decodes them to Hangul characters. The way I read the Wikipedia section, Johab isn't like the EUC encodings in that it adds characters that contain ASCII-like values in the second byte. I guess that was needed to support Chinese characters in addition to Hangul. Unit-testing for the embedded-backslash hazard was what led me to find the problem. This bit worries me: "TlOther, vendor-defined, Johab variants also exist" — such as an EBCDIC-based one and a stateful one! Jeroen On Mon, Dec 15, 2025, 18:46 Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Dec 13, 2025 at 2:12 PM PG Bug reporting form > <noreply@postgresql.org> wrote: > > Calling libpq, connecting to a UTF8 database and successfully setting > client > > encoding to JOHAB, this statement: > > > > PQexec(connection, "SELECT '\x8a\x5c'"); > > > > Returned an empty result with this error message: > > > > ERROR: invalid byte sequence for encoding "JOHAB": 0x8a 0x5c > > > > AFAICT, 0x8a 0x5c is a valid JOHAB sequence making up Hangul character > "굎". > > Easily verified in Python: > > > > print(b'\x8a\x5c'.decode('johab')) > > > > It's the same story for some other valid sequences I tried, including > this > > character's "neighbours" 0x8a 0x5b and 0x8a 0x5d. > > My reading of pg_johab_verifystr() is that it accepts any character > without the high bit set as a single-byte character. Otherwise, it > calls pg_joham_mblen() to determine the length of the character, and > that in turn calls pg_euc_mblen(), which returns 3 if the first byte > is 0x8f and otherwise 2. Whatever the answer, it then wants each byte > to pass IS_EUC_RANGE_VALID() which allows for bytes from 0xa1 to 0xfe. > Your byte string doesn't match that rule, so it makes sense that it > fails. > > What confuses me is that > https://en.wikipedia.org/wiki/KS_X_1001#Johab_encoding seems to say > that the encoding is always a 2-byte encoding and that any 2-byte > sequence with the high bit set on the first character is a valid > character. So the rules we're implementing don't seem to match that at > all. But unfortunately the intent behind the current code is not > clear. It was introduced by Bruce in 2002 in commit > a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, but I don't see comments > there or elsewhere explaining what the thought was behind the way the > code works, so I don't know if this is some weird variant of JOHAB > that intentionally works differently or if this was just never > correct. > > -- > Robert Haas > EDB: http://www.enterprisedb.com >