Re: GB18030-2022 Support in PostgreSQL

Chao Li <li.evan.chao@gmail.com>

From: Chao Li <li.evan.chao@gmail.com>
To: pgsql-hackers@lists.postgresql.org
Cc: Tom Lane <tgl@sss.pgh.pa.us>, Andrew Dunstan <andrew@dunslane.net>, John Naylor <johncnaylorls@gmail.com>
Date: 2025-08-11T02:01:08Z
Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Generate EUC_CN mappings from gb18030-2022.ucm

  2. Update GB18030 encoding from version 2000 to 2022

  3. Generate GB18030 mappings from the Unicode Consortium's UCM file

Attachments

I have created a patch https://commitfest.postgresql.org/patch/5954/. 
CommitFests requested a rebase, so I rebased the code and created the v2 
patch.

BTW, I have tested all 66 new characters, 9 not-required characters and 
18 changed characters in a way as:

evantest=# SELECT encode(convert_from(decode('82359632', 'hex'), 
'GB18030')::bytea, 'hex');
  encode
--------
  e9bfab
(1 row)

All encoded correctly.

Chao Li (Evan)

---------------------
HighGo Software Co., Ltd.
https://www.highgo.com/


On 2025/8/7 16:14, Chao Li wrote:
> I did more researches about the changes in 2022 over 2000, here is a 
> summary:
>
> * 66 new characters have been added in 2022. All these are 4 bytes 
> characters. As the map files store only 2 bytes GB code mappings, 4 
> bytes GB code mapping are calculated, thus these chars can be properly 
> encoded/decoded without this patch, I tested that.
> * 9 characters are no longer required by 2022, but application may 
> decide to retain them or not. As the ucm file 
> (https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/gb18030-2022.ucm) 
> retains them, we also retain them.
> * Unicode mappings for 18 characters have changed. Only these changes 
> will cause backward compatibility issues. However, half of them are 
> rarely used punctuation marks and rests are glyphs that I cannot 
> recognize as a native Chinese speaker. So these changes should not 
> significantly impact most existing databases.
>
> I added a test case with a mapping changed char, and the test passes:
>
> % make check
> ...
> # All 229 tests passed.
>
> For more details on the standard change, see 
> https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132
>
> I am attaching the patch file.
>
> Chao Li (Evan)
> ---------------------
> Highgo Software Co., Ltd.
> https://www.highgo.com/
>
>