Re: BUG #19354: JOHAB rejects valid byte sequences
Tom Lane <tgl@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Robert Haas <robertmhaas@gmail.com>
Cc: Jeroen Vermeulen <jtvjtv@gmail.com>, VASUKI M <vasukianand0119@gmail.com>,
pgsql-bugs@lists.postgresql.org
Date: 2025-12-16T15:41:46Z
Lists: pgsql-bugs
Robert Haas <robertmhaas@gmail.com> writes: > ... So I went looking for > where we got the mapping tables from. UCS_to_JOHAB.pl expects to read > from a file JOHAB.TXT, of which the latest version seems to be found > here: > https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/JOHAB.TXT > And indeed, if I run UCS_to_JOHAB.pl on that JOHAB.txt file, it > regenerates the current mapping files. Thanks for doing that research! > So apparently we've > got the "right" mappings, but you can only actually the ones that > match the code's rules for something to be a valid multi-byte > character, which aren't actually in sync with the mapping table. Yeah. Looking at the code in wchar.c, it's clear that it thinks that JOHAB has the same character-length rules as EUC_KR, which is something that one might guess based on available documentation that says it's related to that encoding. So I can see how we got here. However, that doesn't mean we can fix pg_johab_mblen() and we're done. I'm still quite afraid that we'd be introducing security-grade inconsistencies of interpretation between different PG versions. > I'm > left with the conclusions that (1) nobody ever actually tried using > this encoding for anything real until 3 days ago and (2) we don't have > any testing infrastructure that verifies that the characters in the > mapping tables are actually accepted by pg_verifymbstr(). I wonder how > many other encodings we have that don't actually work? Indeed. Anyone want to do some testing? regards, tom lane