Re: GB18030-2022 Support in PostgreSQL

Chao Li <li.evan.chao@gmail.com>

From: Chao Li <li.evan.chao@gmail.com>
To: John Naylor <johncnaylorls@gmail.com>
Cc: Peter Eisentraut <peter@eisentraut.org>, pgsql-hackers@lists.postgresql.org, Tom Lane <tgl@sss.pgh.pa.us>, Andrew Dunstan <andrew@dunslane.net>
Date: 2025-09-29T08:19:48Z
Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Generate EUC_CN mappings from gb18030-2022.ucm

  2. Update GB18030 encoding from version 2000 to 2022

  3. Generate GB18030 mappings from the Unicode Consortium's UCM file

Attachments

On Mon, Sep 29, 2025 at 12:03 PM John Naylor <johncnaylorls@gmail.com>
wrote:

> On Wed, Sep 24, 2025 at 4:18 PM Chao Li <li.evan.chao@gmail.com> wrote:
> > I am not sure if you should also upgrade the UCM file to 2022 version,
> but if we need, let’s do it with a separate commit.
>
> If they can all use the same file, we should just do that for the sake
> of simplicity, in which case a separate commit is just extra noise.
>
>
In v3, I have updated EUC_CN to use gb18030-2022.ucm. Fortunately, the map
files are unchanged, so we don't have to do much testing for EUC_CN.

For UHC, in the icu master branch
https://github.com/unicode-org/icu/tree/main/icu4c/source/data/mappings,
there is still windows-949-2000.ucm, thus only download URL is changed,
file content is unchanged.

```
% make utf8_to_uhc.map utf8_to_euc_cn.map
wget -O windows-949-2000.ucm --no-use-server-timestamps
https://raw.githubusercontent.com/unicode-org/icu/refs/heads/main/icu4c/source/data/mappings/windows-949-2000.ucm
--2025-09-29 16:00:40--
https://raw.githubusercontent.com/unicode-org/icu/refs/heads/main/icu4c/source/data/mappings/windows-949-2000.ucm
HTTP request sent, awaiting response... 200 OK
Length: 356253 (348K) [text/plain]
Saving to: ‘windows-949-2000.ucm’

windows-949-2000.ucm
100%[=========================================================================================================>]
347.90K   222KB/s    in 1.6s

2025-09-29 16:00:43 (222 KB/s) - ‘windows-949-2000.ucm’ saved
[356253/356253]

'/usr/bin/perl' -I . UCS_to_UHC.pl
- Writing UTF8=>UHC conversion table: utf8_to_uhc.map
- Writing UHC=>UTF8 conversion table: uhc_to_utf8.map
wget -O gb18030-2022.ucm --no-use-server-timestamps
https://raw.githubusercontent.com/unicode-org/icu/refs/heads/main/icu4c/source/data/mappings/gb18030-2022.ucm
--2025-09-29 16:00:43--
https://raw.githubusercontent.com/unicode-org/icu/refs/heads/main/icu4c/source/data/mappings/gb18030-2022.ucm
HTTP request sent, awaiting response... 200 OK
Length: 675312 (659K) [text/plain]
Saving to: ‘gb18030-2022.ucm’

gb18030-2022.ucm
100%[=========================================================================================================>]
659.48K  1.33MB/s    in 0.5s

2025-09-29 16:00:44 (1.33 MB/s) - ‘gb18030-2022.ucm’ saved [675312/675312]

'/usr/bin/perl' -I . UCS_to_EUC_CN.pl
- Writing UTF8=>EUC_CN conversion table: utf8_to_euc_cn.map
- Writing EUC_CN=>UTF8 conversion table: euc_cn_to_utf8.map
% git diff
%
```

Please note, I didn't include the deletion of gb-18030-2000.xml in v3,
because that will cause the patch file to be too big, thus requiring an
approval process for the email to land in the Mail Archive. Please delete
the xml file when you push the commit.

Best regards,
Chao Li (Evan)
---------------------
HighGo Software Co., Ltd.
https://www.highgo.com/