Re: GB18030-2022 Support in PostgreSQL

Chao Li <li.evan.chao@gmail.com>

From: Chao Li <li.evan.chao@gmail.com>
To: John Naylor <johncnaylorls@gmail.com>
Cc: Peter Eisentraut <peter@eisentraut.org>, pgsql-hackers@lists.postgresql.org, Tom Lane <tgl@sss.pgh.pa.us>, Andrew Dunstan <andrew@dunslane.net>
Date: 2025-09-10T11:54:08Z
Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Generate EUC_CN mappings from gb18030-2022.ucm

  2. Update GB18030 encoding from version 2000 to 2022

  3. Generate GB18030 mappings from the Unicode Consortium's UCM file

Attachments

Hi John,

Thank you very much for taking care of this patch.

John Naylor <johncnaylorls@gmail.com> 于2025年9月10日周三 14:38写道:

>
> - The URL at the top currently points to a directory in Github, but v3
> changed it to point to the actual file. A directory can be navigated
> for inspection, so I used:
>
> 2000:
> https://github.com/unicode-org/icu-data/tree/main/charset/data/ucm
>
> 2022:
> https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/
>
>
Looks good.


> - I also made the regex a multiline regex for readability, even though
> the previous one was not.
>
>
Thank you very much for polishing the perl script. I am not an expert of
perl. I can make the script working, but not perfect.


> For 2022 version, I think it would be good to once run a test to
> verify that no mappings changed that we didn't expect. Perhaps the
> tests here can be used:
>
>
> https://www.postgresql.org/message-id/b9e3167f-f84b-7aa4-5738-be578a4db924%40iki.fi
>
>
I have manually run tested I had done before, everything works as expected.

I downloaded the tests from the referenced mail, but I cannot make the
tests to run. After extracting the 2 patch files, it added
src/test/encodings, but "make check" seems to not run them. I tried to copy
.out and .sql files to src/test/regress, but the tests still not running.
Did I miss anything?

The upstream correction to the 2000 version is not present in our
> mappings, so we should mention that, unless it was reverted in or
> before 2022.
>

I think the upstream correction to the 2000 version is just a few not
round-trip chars that are ignored by us. So I feel we don't need to mention
them.


>
> In the documentation (charset.sgml), do we want to mention the version
> e.g. the following?
>
>  <entry><literal>GB18030</literal></entry>
> -<entry>National Standard</entry>
> +<entry>National Standard, version 2022</entry>
>

That's a good idea. I updated the sgml file:

[image: image.png]


>
> I've whacked around the commit messages, so those should be reviewed
> for accuracy.
>
> Your draft commit message had "9 characters are no longer required by
> the new standard, but are retained in this patch for compatibility"
> ...but those nine were introduced in the 2005 version, right? In which
> case it doesn't affect us. Please confirm.
>

I don't find any hint about if the 9 characters were introduced in the 2005
version.

But without this patch, they can be properly converted:
```
evantest=# SELECT encode(convert_from(decode('FD9D', 'hex'),
'GB18030')::bytea, 'hex');
 encode
--------
 efa5b9
(1 row)
```
So they should be available in the version 2002 already.


>
> "Author: Zheng Tao <taoz@highgo.com>" -- I haven't seen any messages
> from this address in this thread, so could you confirm this was
> intentional?
>
>
Yes, Zheng Tao is my colleague. He worked with me for this patch, so I want
to credit him.

I am attaching v5 version. The only change is 0003, I added the SGML change.

Best regards,
Chao Li (Evan)
---------------------
HighGo Software Co., Ltd.
https://www.highgo.com/