Re: GB18030-2022 Support in PostgreSQL

John Naylor <johncnaylorls@gmail.com>

From: John Naylor <johncnaylorls@gmail.com>
To: Chao Li <li.evan.chao@gmail.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>, Andrew Dunstan <andrew@dunslane.net>, JiaoShuntian <jiaoshuntian@highgo.com.w.kunlunaq.com>, pgsql-hackers@lists.postgresql.org
Date: 2025-08-05T10:25:27Z
Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Generate EUC_CN mappings from gb18030-2022.ucm

  2. Update GB18030 encoding from version 2000 to 2022

  3. Generate GB18030 mappings from the Unicode Consortium's UCM file

On Tue, Aug 5, 2025 at 1:22 PM Chao Li <li.evan.chao@gmail.com> wrote:
>
> 2025年8月4日 21:51,Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> So on the whole I'd lean a bit towards just redefining GB18030 as
> meaning the new standard.  The fact that we don't support it as a
> server-side encoding perhaps makes that idea more tenable than it
> would be if the encoding governed the interpretation of our own
> stored data.

> I agree with Tom that we may just redefine GB18030 to comply with the 2022 standard.
>
> As John Naylor pointed, 2022 is not backward compatible, that is true. However, I went through all the incompatible changes, those are all characters rarely used.

If that's the case than redefining is probably okay.

> One use case I am thinking is that, say a database uses default encoding (UTF-8) and ICU locale provider. As ICU started to support GB180303-2022 since version 73.1.

ICU locales can only be used with sever-side encodings.

> At the time when the new version is released, if some third party migration tools are known working fine, the release note may recommend the tools.

I highly doubt such a large hammer will be necessary. Whatever advice
we give for discovery and conversion of affected text is our
responsibility and can be in the form of example queries.

--
John Naylor
Amazon Web Services