Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com>
From: Chao Li <li.evan.chao@gmail.com>
To: John Naylor <johncnaylorls@gmail.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>, Andrew Dunstan <andrew@dunslane.net>, JiaoShuntian <jiaoshuntian@highgo.com.w.kunlunaq.com>,
pgsql-hackers@lists.postgresql.org
Date: 2025-08-07T08:14:44Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Generate EUC_CN mappings from gb18030-2022.ucm
- 48566180efff 19 (unreleased) landed
-
Update GB18030 encoding from version 2000 to 2022
- 5334620eef8f 19 (unreleased) landed
-
Generate GB18030 mappings from the Unicode Consortium's UCM file
- cfa6cd29271e 19 (unreleased) landed
Attachments
- v1-0001-Upgrade-GB18030-encoding-support-from-2000-to-202.patch (application/octet-stream) patch v1-0001
I did more researches about the changes in 2022 over 2000, here is a summary: * 66 new characters have been added in 2022. All these are 4 bytes characters. As the map files store only 2 bytes GB code mappings, 4 bytes GB code mapping are calculated, thus these chars can be properly encoded/decoded without this patch, I tested that. * 9 characters are no longer required by 2022, but application may decide to retain them or not. As the ucm file ( https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/gb18030-2022.ucm) retains them, we also retain them. * Unicode mappings for 18 characters have changed. Only these changes will cause backward compatibility issues. However, half of them are rarely used punctuation marks and rests are glyphs that I cannot recognize as a native Chinese speaker. So these changes should not significantly impact most existing databases. I added a test case with a mapping changed char, and the test passes: % make check ... # All 229 tests passed. For more details on the standard change, see https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132 I am attaching the patch file. Chao Li (Evan) --------------------- Highgo Software Co., Ltd. https://www.highgo.com/ John Naylor <johncnaylorls@gmail.com> 于2025年8月5日周二 18:25写道: > On Tue, Aug 5, 2025 at 1:22 PM Chao Li <li.evan.chao@gmail.com> wrote: > > > > 2025年8月4日 21:51,Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > So on the whole I'd lean a bit towards just redefining GB18030 as > > meaning the new standard. The fact that we don't support it as a > > server-side encoding perhaps makes that idea more tenable than it > > would be if the encoding governed the interpretation of our own > > stored data. > > > I agree with Tom that we may just redefine GB18030 to comply with the > 2022 standard. > > > > As John Naylor pointed, 2022 is not backward compatible, that is true. > However, I went through all the incompatible changes, those are all > characters rarely used. > > If that's the case than redefining is probably okay. > > > One use case I am thinking is that, say a database uses default encoding > (UTF-8) and ICU locale provider. As ICU started to support GB180303-2022 > since version 73.1. > > ICU locales can only be used with sever-side encodings. > > > At the time when the new version is released, if some third party > migration tools are known working fine, the release note may recommend the > tools. > > I highly doubt such a large hammer will be necessary. Whatever advice > we give for discovery and conversion of affected text is our > responsibility and can be in the form of example queries. > > -- > John Naylor > Amazon Web Services >