Re: GB18030-2022 Support in PostgreSQL

John Naylor <johncnaylorls@gmail.com>

From: John Naylor <johncnaylorls@gmail.com>

To: JiaoShuntian <jiaoshuntian@highgo.com>

Cc: pgsql-hackers@lists.postgresql.org

Date: 2025-08-04T10:35:02Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Generate EUC_CN mappings from gb18030-2022.ucm
- 48566180efff 19 (unreleased) landed
Update GB18030 encoding from version 2000 to 2022
- 5334620eef8f 19 (unreleased) landed
Generate GB18030 mappings from the Unicode Consortium's UCM file
- cfa6cd29271e 19 (unreleased) landed

On Mon, Aug 4, 2025 at 3:08 PM JiaoShuntian <jiaoshuntian@highgo.com> wrote:
> I noticed that PostgreSQL currently supports GB18030 encoding based on the older GB18030-2000 standard (as seen in commits like extend GB18030 conversion). However, China has since updated its mandatory character set standard to GB18030-2022, which includes additional characters and stricter compliance requirements.GB18030-2022 is now the official standard in China, and ensuring PostgreSQL’s full compliance would be beneficial for users in Chinese-speaking regions.

This is a non-backwards-compatible change:

https://www.unicode.org/L2/L2022/22274-disruptive-changes.pdf
https://www.unicode.org/L2/L2023/23003r-gb18030-recommendations.pdf

There is a risk of breaking applications, although only a few dozen
mappings changed. If it were added as a separate encoding, users could
opt in.

--
John Naylor
Amazon Web Services