Re: GB18030-2022 Support in PostgreSQL
Tom Lane <tgl@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Andrew Dunstan <andrew@dunslane.net>
Cc: John Naylor <johncnaylorls@gmail.com>,
JiaoShuntian <jiaoshuntian@highgo.com.w.kunlunaq.com>,
pgsql-hackers@lists.postgresql.org
Date: 2025-08-04T13:51:01Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Generate EUC_CN mappings from gb18030-2022.ucm
- 48566180efff 19 (unreleased) landed
-
Update GB18030 encoding from version 2000 to 2022
- 5334620eef8f 19 (unreleased) landed
-
Generate GB18030 mappings from the Unicode Consortium's UCM file
- cfa6cd29271e 19 (unreleased) landed
Andrew Dunstan <andrew@dunslane.net> writes: > On 2025-08-04 Mo 6:35 AM, John Naylor wrote: >> There is a risk of breaking applications, although only a few dozen >> mappings changed. If it were added as a separate encoding, users could >> opt in. > That makes sense ... naming the new encoding so as to avoid confusion > might be a challenge. We have precedent for that in SHIFT_JIS_2004. Presumably if we make this a new encoding, it'd be GB18030_2022. However, adding a new encoding ID is not without breakage risks of its own, stemming from some code knowing the new ID and others not. I recall that we had some actual problems of that ilk when we added SHIFT_JIS_2004, and some of them were pretty subtle. See e.g. this comment from src/bin/initdb/Makefile: # Note: it's important that we link to encnames.o from libpgcommon, not # from libpq, else we have risks of version skew if we run with a libpq # shared library from a different PG version. Define # USE_PRIVATE_ENCODING_FUNCS to ensure that that happens. That was long enough ago that I have little faith either that that fix still does what it intended to (the code has been rejiggered significantly since the issue was last battle-tested), or that there are not similar hazards elsewhere. So on the whole I'd lean a bit towards just redefining GB18030 as meaning the new standard. The fact that we don't support it as a server-side encoding perhaps makes that idea more tenable than it would be if the encoding governed the interpretation of our own stored data. regards, tom lane