Thread
Commits
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Generate EUC_CN mappings from gb18030-2022.ucm
- 48566180efff 19 (unreleased) landed
-
Update GB18030 encoding from version 2000 to 2022
- 5334620eef8f 19 (unreleased) landed
-
Generate GB18030 mappings from the Unicode Consortium's UCM file
- cfa6cd29271e 19 (unreleased) landed
-
GB18030-2022 Support in PostgreSQL
jiaoshuntian@highgo.com — 2025-08-04T08:08:24Z
Hi hackers, I noticed that PostgreSQL currently supports GB18030 encoding based on the older GB18030-2000 standard (as seen in commits like extend GB18030 conversion). However, China has since updated its mandatory character set standard to GB18030-2022, which includes additional characters and stricter compliance requirements.GB18030-2022 is now the official standard in China, and ensuring PostgreSQL’s full compliance would be beneficial for users in Chinese-speaking regions. I would like to ask: Are there any plans to upgrade PostgreSQL’s GB18030 support to the 2022 version?Would the community be open to contributions in this area? Best regards, JiaoShuntian HighGo Inc.
-
Re: GB18030-2022 Support in PostgreSQL
jiaoshuntian@highgo.com — 2025-08-04T09:27:15Z
> I would like to ask: > > Are there any plans to upgrade PostgreSQL’s GB18030 support to the 2022 version?Would the community be open to contributions in this area? I think we only need to update the perl script and map file to complete this task. JiaoShuntian HighGo Inc.
-
Re: GB18030-2022 Support in PostgreSQL
wenhui qiu <qiuwenhuifx@gmail.com> — 2025-08-04T09:34:48Z
Hi 😂,Not long ago, many people were rushing to remove this character set because of a security vulnerability. I was honestly quite shocked when I saw it. Thanks On Mon, Aug 4, 2025 at 4:08 PM JiaoShuntian <jiaoshuntian@highgo.com> wrote: > Hi hackers, > > I noticed that PostgreSQL currently supports GB18030 encoding based on the > older GB18030-2000 standard (as seen in commits like extend GB18030 > conversion > <https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=...>). > However, China has since updated its mandatory character set standard > to GB18030-2022, which includes additional characters and stricter > compliance requirements.GB18030-2022 is now the official standard in China, > and ensuring PostgreSQL’s full compliance would be beneficial for users in > Chinese-speaking regions. > > I would like to ask: > > Are there any plans to upgrade PostgreSQL’s GB18030 support to the 2022 > version?Would the community be open to contributions in this area? > > Best regards, > > > JiaoShuntian > > HighGo Inc. > -
Re: GB18030-2022 Support in PostgreSQL
John Naylor <johncnaylorls@gmail.com> — 2025-08-04T10:35:02Z
On Mon, Aug 4, 2025 at 3:08 PM JiaoShuntian <jiaoshuntian@highgo.com> wrote: > I noticed that PostgreSQL currently supports GB18030 encoding based on the older GB18030-2000 standard (as seen in commits like extend GB18030 conversion). However, China has since updated its mandatory character set standard to GB18030-2022, which includes additional characters and stricter compliance requirements.GB18030-2022 is now the official standard in China, and ensuring PostgreSQL’s full compliance would be beneficial for users in Chinese-speaking regions. This is a non-backwards-compatible change: https://www.unicode.org/L2/L2022/22274-disruptive-changes.pdf https://www.unicode.org/L2/L2023/23003r-gb18030-recommendations.pdf There is a risk of breaking applications, although only a few dozen mappings changed. If it were added as a separate encoding, users could opt in. -- John Naylor Amazon Web Services
-
Re: GB18030-2022 Support in PostgreSQL
Andrew Dunstan <andrew@dunslane.net> — 2025-08-04T12:33:00Z
On 2025-08-04 Mo 6:35 AM, John Naylor wrote: > On Mon, Aug 4, 2025 at 3:08 PM JiaoShuntian <jiaoshuntian@highgo.com> wrote: >> I noticed that PostgreSQL currently supports GB18030 encoding based on the older GB18030-2000 standard (as seen in commits like extend GB18030 conversion). However, China has since updated its mandatory character set standard to GB18030-2022, which includes additional characters and stricter compliance requirements.GB18030-2022 is now the official standard in China, and ensuring PostgreSQL’s full compliance would be beneficial for users in Chinese-speaking regions. > This is a non-backwards-compatible change: > > https://www.unicode.org/L2/L2022/22274-disruptive-changes.pdf > https://www.unicode.org/L2/L2023/23003r-gb18030-recommendations.pdf > > There is a risk of breaking applications, although only a few dozen > mappings changed. If it were added as a separate encoding, users could > opt in. > That makes sense ... naming the new encoding so as to avoid confusion might be a challenge. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
-
Re: GB18030-2022 Support in PostgreSQL
Tom Lane <tgl@sss.pgh.pa.us> — 2025-08-04T13:51:01Z
Andrew Dunstan <andrew@dunslane.net> writes: > On 2025-08-04 Mo 6:35 AM, John Naylor wrote: >> There is a risk of breaking applications, although only a few dozen >> mappings changed. If it were added as a separate encoding, users could >> opt in. > That makes sense ... naming the new encoding so as to avoid confusion > might be a challenge. We have precedent for that in SHIFT_JIS_2004. Presumably if we make this a new encoding, it'd be GB18030_2022. However, adding a new encoding ID is not without breakage risks of its own, stemming from some code knowing the new ID and others not. I recall that we had some actual problems of that ilk when we added SHIFT_JIS_2004, and some of them were pretty subtle. See e.g. this comment from src/bin/initdb/Makefile: # Note: it's important that we link to encnames.o from libpgcommon, not # from libpq, else we have risks of version skew if we run with a libpq # shared library from a different PG version. Define # USE_PRIVATE_ENCODING_FUNCS to ensure that that happens. That was long enough ago that I have little faith either that that fix still does what it intended to (the code has been rejiggered significantly since the issue was last battle-tested), or that there are not similar hazards elsewhere. So on the whole I'd lean a bit towards just redefining GB18030 as meaning the new standard. The fact that we don't support it as a server-side encoding perhaps makes that idea more tenable than it would be if the encoding governed the interpretation of our own stored data. regards, tom lane
-
Re: GB18030-2022 Support in PostgreSQL
Kenneth Marshall <ktm@rice.edu> — 2025-08-04T16:55:07Z
On Mon, Aug 04, 2025 at 04:08:24PM +0800, JiaoShuntian wrote: > Hi hackers, > > I noticed that PostgreSQL currently supports GB18030 encoding based on the older GB18030-2000 standard (as seen in commits like extend GB18030 conversion). However, China has since updated its mandatory character set standard to GB18030-2022, which includes additional characters and stricter compliance requirements.GB18030-2022 is now the official standard in China, and ensuring PostgreSQL’s full compliance would be beneficial for users in Chinese-speaking regions. > > I would like to ask: > > Are there any plans to upgrade PostgreSQL’s GB18030 support to the 2022 version?Would the community be open to contributions in this area? > > Best regards, > > JiaoShuntian > HighGo Inc. Hi, I believe that it is in ICU already. You should be able to use that as your locale provider. Regards, Ken
-
Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com> — 2025-08-05T06:22:18Z
> 2025年8月4日 21:51,Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > So on the whole I'd lean a bit towards just redefining GB18030 as > meaning the new standard. The fact that we don't support it as a > server-side encoding perhaps makes that idea more tenable than it > would be if the encoding governed the interpretation of our own > stored data. > > regards, tom lane > I agree with Tom that we may just redefine GB18030 to comply with the 2022 standard. As John Naylor pointed, 2022 is not backward compatible, that is true. However, I went through all the incompatible changes, those are all characters rarely used. So I would guess most of the existing databases won’t be impacted and the rest with encoding GB18030 need to do data migration before upgrading to a PG version that switches to GB18030-2022. I think PG may delegate data migration tasks to third party PG service vendors. They may develop simple or complex migration tools to help different use cases. One use case I am thinking is that, say a database uses default encoding (UTF-8) and ICU locale provider. As ICU started to support GB180303-2022 since version 73.1. If the database worked with a pre-73.1 version of ICU, and now if ICU will be upgraded to a post-73.1 version, the database may face the same backward compatibility risk. That is because, say a gb code (0xA6D9) maps to U+E78D with GB18030 and changes to map to U+FE10 with GB18030-2022. If a char of 0xA6D9 was given to the database, it would be stored as U+E78D on disk. After upgrading ICU to post-73.1, U+E78D would no longer be considered as “0xA6D9” by ICU. So to keep the data’s original meaning, a data migration has to been done to update U+E78D to U+FE10. In this example, PG version is not changed, but the database still needs a data migration. The other reason I don’t think a new encoding GB18030_2022 is needed is that, as GB18030_2022 is a hard requirement from the government, most likely all commercial database must comply with. Thus a lot of current databases with GB18030 must be migrated to GB18030_2022. As PG doesn’t support to change a database’s encoding, if a new encoding is added, then an existing db must be migrated to a new db. If only redefine GB18030, then existing databases only need some data migrations, which should be easier. So, I think PG doesn’t need to worries about the backward compatibility problem too much, all PG needs to do is to state/emphasize clearly in the release note that a data migration might be required. At the time when the new version is released, if some third party migration tools are known working fine, the release note may recommend the tools. Regards, Chao Li (Evan) ------------------------------ HighGo Infra. Software Inc. https://www.highgo.com/
-
Re: GB18030-2022 Support in PostgreSQL
John Naylor <johncnaylorls@gmail.com> — 2025-08-05T10:25:27Z
On Tue, Aug 5, 2025 at 1:22 PM Chao Li <li.evan.chao@gmail.com> wrote: > > 2025年8月4日 21:51,Tom Lane <tgl@sss.pgh.pa.us> wrote: > > So on the whole I'd lean a bit towards just redefining GB18030 as > meaning the new standard. The fact that we don't support it as a > server-side encoding perhaps makes that idea more tenable than it > would be if the encoding governed the interpretation of our own > stored data. > I agree with Tom that we may just redefine GB18030 to comply with the 2022 standard. > > As John Naylor pointed, 2022 is not backward compatible, that is true. However, I went through all the incompatible changes, those are all characters rarely used. If that's the case than redefining is probably okay. > One use case I am thinking is that, say a database uses default encoding (UTF-8) and ICU locale provider. As ICU started to support GB180303-2022 since version 73.1. ICU locales can only be used with sever-side encodings. > At the time when the new version is released, if some third party migration tools are known working fine, the release note may recommend the tools. I highly doubt such a large hammer will be necessary. Whatever advice we give for discovery and conversion of affected text is our responsibility and can be in the form of example queries. -- John Naylor Amazon Web Services
-
Re: GB18030-2022 Support in PostgreSQL
Peter Eisentraut <peter@eisentraut.org> — 2025-08-06T10:29:15Z
On 05.08.25 08:22, Chao Li wrote: > I agree with Tom that we may just redefine GB18030 to comply with the > 2022 standard. > > As John Naylor pointed, 2022 is not backward compatible, that is true. > However, I went through all the incompatible changes, those are all > characters rarely used. So I would guess most of the existing databases > won’t be impacted and the rest with encoding GB18030 need to do data > migration before upgrading to a PG version that switches to > GB18030-2022. I think PG may delegate data migration tasks to third > party PG service vendors. They may develop simple or complex migration > tools to help different use cases. Note that you can also create custom conversions using CREATE CONVERSION, so that would be something for those who would need the old behavior.
-
Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com> — 2025-08-07T08:14:44Z
I did more researches about the changes in 2022 over 2000, here is a summary: * 66 new characters have been added in 2022. All these are 4 bytes characters. As the map files store only 2 bytes GB code mappings, 4 bytes GB code mapping are calculated, thus these chars can be properly encoded/decoded without this patch, I tested that. * 9 characters are no longer required by 2022, but application may decide to retain them or not. As the ucm file ( https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/gb18030-2022.ucm) retains them, we also retain them. * Unicode mappings for 18 characters have changed. Only these changes will cause backward compatibility issues. However, half of them are rarely used punctuation marks and rests are glyphs that I cannot recognize as a native Chinese speaker. So these changes should not significantly impact most existing databases. I added a test case with a mapping changed char, and the test passes: % make check ... # All 229 tests passed. For more details on the standard change, see https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132 I am attaching the patch file. Chao Li (Evan) --------------------- Highgo Software Co., Ltd. https://www.highgo.com/ John Naylor <johncnaylorls@gmail.com> 于2025年8月5日周二 18:25写道: > On Tue, Aug 5, 2025 at 1:22 PM Chao Li <li.evan.chao@gmail.com> wrote: > > > > 2025年8月4日 21:51,Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > So on the whole I'd lean a bit towards just redefining GB18030 as > > meaning the new standard. The fact that we don't support it as a > > server-side encoding perhaps makes that idea more tenable than it > > would be if the encoding governed the interpretation of our own > > stored data. > > > I agree with Tom that we may just redefine GB18030 to comply with the > 2022 standard. > > > > As John Naylor pointed, 2022 is not backward compatible, that is true. > However, I went through all the incompatible changes, those are all > characters rarely used. > > If that's the case than redefining is probably okay. > > > One use case I am thinking is that, say a database uses default encoding > (UTF-8) and ICU locale provider. As ICU started to support GB180303-2022 > since version 73.1. > > ICU locales can only be used with sever-side encodings. > > > At the time when the new version is released, if some third party > migration tools are known working fine, the release note may recommend the > tools. > > I highly doubt such a large hammer will be necessary. Whatever advice > we give for discovery and conversion of affected text is our > responsibility and can be in the form of example queries. > > -- > John Naylor > Amazon Web Services >
-
Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com> — 2025-08-11T02:01:08Z
I have created a patch https://commitfest.postgresql.org/patch/5954/. CommitFests requested a rebase, so I rebased the code and created the v2 patch. BTW, I have tested all 66 new characters, 9 not-required characters and 18 changed characters in a way as: evantest=# SELECT encode(convert_from(decode('82359632', 'hex'), 'GB18030')::bytea, 'hex'); encode -------- e9bfab (1 row) All encoded correctly. Chao Li (Evan) --------------------- HighGo Software Co., Ltd. https://www.highgo.com/ On 2025/8/7 16:14, Chao Li wrote: > I did more researches about the changes in 2022 over 2000, here is a > summary: > > * 66 new characters have been added in 2022. All these are 4 bytes > characters. As the map files store only 2 bytes GB code mappings, 4 > bytes GB code mapping are calculated, thus these chars can be properly > encoded/decoded without this patch, I tested that. > * 9 characters are no longer required by 2022, but application may > decide to retain them or not. As the ucm file > (https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/gb18030-2022.ucm) > retains them, we also retain them. > * Unicode mappings for 18 characters have changed. Only these changes > will cause backward compatibility issues. However, half of them are > rarely used punctuation marks and rests are glyphs that I cannot > recognize as a native Chinese speaker. So these changes should not > significantly impact most existing databases. > > I added a test case with a mapping changed char, and the test passes: > > % make check > ... > # All 229 tests passed. > > For more details on the standard change, see > https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132 > > I am attaching the patch file. > > Chao Li (Evan) > --------------------- > Highgo Software Co., Ltd. > https://www.highgo.com/ > > -
Re: GB18030-2022 Support in PostgreSQL
John Naylor <johncnaylorls@gmail.com> — 2025-08-11T05:50:48Z
On Mon, Aug 11, 2025 at 9:01 AM Chao Li <li.evan.chao@gmail.com> wrote: > > I have created a patch https://commitfest.postgresql.org/patch/5954/. CommitFests requested a rebase, so I rebased the code and created the v2 patch. > > BTW, I have tested all 66 new characters, 9 not-required characters and 18 changed characters in a way as: "9 characters are no longer required by the new standard, but are retained in this patch for compatibility" How is that done? > I added a test case with a mapping changed char, and the test passes: > > % make check > ... > # All 229 tests passed. > > For more details on the standard change, see https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132 > > I am attaching the patch file. Going from the old .xml file to the .ucm file makes it difficult to see the relevant changes. Also, there are nearly 1000 non-user-visible changes like this in the output file that are not explained: - /*** Three byte table, leaf: efa8xx - offset 0x07aba ***/ + /*** Three byte table, leaf: efa8xx - offset 0x07b3a ***/ The 2000 version is available in the .ucm format, so maybe converting to that first would be a good preparatory patch: https://github.com/unicode-org/icu-data/blob/main/charset/data/ucm/gb-18030-2000.ucm Looking at the history, it looks like that file has seen small revisions, so it may take some research to get the exact equivalent to the XML file we use. That will also tell us if anything will change for us besides the actual 2022 revision. -- John Naylor Amazon Web Services
-
Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com> — 2025-08-11T08:22:09Z
Hi John, Thanks for your review. Yes, I did a diff between 2000.ucm and 2022.ucm when I worked on the patch. The diff between 2000.ucm and 2022.ucm are quite small: ```diff - omit the comment part > <U20AC> \x80 |3 > <U3000> \xA3\xA0 |3 > <UE5E5> \xA3\xA0 |4 > 28067a28099,28114 > <U9FB4> \xFE\x59 |0 > <U9FB4> \x82\x35\x90\x37 |3 > <U9FB5> \xFE\x61 |0 > <U9FB5> \x82\x35\x90\x38 |3 > <U9FB6> \xFE\x66 |0 > <U9FB6> \x82\x35\x90\x39 |3 > <U9FB7> \xFE\x67 |0 > <U9FB7> \x82\x35\x91\x30 |3 > <U9FB8> \xFE\x6D |0 > <U9FB8> \x82\x35\x91\x31 |3 > <U9FB9> \xFE\x7E |0 > <U9FB9> \x82\x35\x91\x32 |3 > <U9FBA> \xFE\x90 |0 > <U9FBA> \x82\x35\x91\x33 |3 > <U9FBB> \xFE\xA0 |0 > <U9FBB> \x82\x35\x91\x34 |3 29577c29624 < <UE5E5> \xA3\xA0 |0 --- > # <UE5E5> \xA3\xA0 |0 30001,30010c30048,30057 < <UE78D> \xA6\xD9 |0 < <UE78E> \xA6\xDA |0 < <UE78F> \xA6\xDB |0 < <UE790> \xA6\xDC |0 < <UE791> \xA6\xDD |0 < <UE792> \xA6\xDE |0 < <UE793> \xA6\xDF |0 < <UE794> \xA6\xEC |0 < <UE795> \xA6\xED |0 < <UE796> \xA6\xF3 |0 --- > <UE78D> \xA6\xD9 |1 > <UE78E> \xA6\xDA |1 > <UE78F> \xA6\xDB |1 > <UE790> \xA6\xDC |1 > <UE791> \xA6\xDD |1 > <UE792> \xA6\xDE |1 > <UE793> \xA6\xDF |1 > <UE794> \xA6\xEC |1 > <UE795> \xA6\xED |1 > <UE796> \xA6\xF3 |1 30146c30193 < <UE81E> \xFE\x59 |0 --- > <UE81E> \xFE\x59 |1 30154c30201 < <UE826> \xFE\x61 |0 --- > <UE826> \xFE\x61 |1 30159,30160c30206,30207 < <UE82B> \xFE\x66 |0 < <UE82C> \xFE\x67 |0 --- > <UE82B> \xFE\x66 |1 > <UE82C> \xFE\x67 |1 30166c30213 < <UE832> \xFE\x6D |0 --- > <UE832> \xFE\x6D |1 30183c30230 < <UE843> \xFE\x7E |0 --- > <UE843> \xFE\x7E |1 30200c30247 < <UE854> \xFE\x90 |0 --- > <UE854> \xFE\x90 |1 30216c30263 < <UE864> \xFE\xA0 |0 --- > <UE864> \xFE\xA0 |1 30470a30518,30537 > <UFE10> \xA6\xD9 |0 > <UFE10> \x84\x31\x82\x36 |3 > <UFE11> \xA6\xDB |0 > <UFE11> \x84\x31\x82\x37 |3 > <UFE12> \xA6\xDA |0 > <UFE12> \x84\x31\x82\x38 |3 > <UFE13> \xA6\xDC |0 > <UFE13> \x84\x31\x82\x39 |3 > <UFE14> \xA6\xDD |0 > <UFE14> \x84\x31\x83\x30 |3 > <UFE15> \xA6\xDE |0 > <UFE15> \x84\x31\x83\x31 |3 > <UFE16> \xA6\xDF |0 > <UFE16> \x84\x31\x83\x32 |3 > <UFE17> \xA6\xEC |0 > <UFE17> \x84\x31\x83\x33 |3 > <UFE18> \xA6\xED |0 > <UFE18> \x84\x31\x83\x34 |3 > <UFE19> \xA6\xF3 |0 > <UFE19> \x84\x31\x83\x35 |3 ``` As you can see, the changes only reflect to the changed 18 characters plus other 3 unicode points (U20AC, U3000, UE5E5). My code comment in UCS_to_GB18030.pl has explained these changes: ```code comment from UCS_to_GB18030.pl # The |n is a flag, where n has values of 0, 1, 3, 4. # With a refeence to https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132, # the flag should mean the following: # 0 - round-trip mapping # 1 - there are 18 mappings with flag 1, those are mapping changes # from GB180303-2000 to GB18030-2022. Old mappings are marked # with flag 1, new mappings with flag 0. So we can ignore all # mappings with flag 0. # 3 - there are 20 mappings with flag 3: # 18 of them reflect to the 18 mappings with flag 1, but means # the old mapping's unicode's new mapping with GB18030-2022. # These 18 new mappings have no actual glyphs in GB18030-2022. # So we can ignore these 18 mappings with flag 3. # The other 2 are: "<U20AC> \x80 |3" and "<U3000> \xA3\xA0 |3". # They are two reserved fallbacks for compatibility with GBK and # other web data as in WHATWG. Both U20AC and U3000 have round- # trip mappings in GB18030-2022, so we can ignore these two # mappings with flag 3. # So, we can ignore all mappings with flag 3. # 4 - there is only one mapping with flag 4: <UE5E5> \xA3\xA0 |4. # This is a "good one-way" mapping from U+E5E5 to \xA3\xA0 # for maximum compatibility with previous behavior. So we can # ignore this mapping as well. ``` For your question: > "9 characters are no longer required by the new standard, but are > retained in this patch for compatibility" > > How is that done? The 9 mappings are not changed between 2000.ucm and 2022.ucm. For example, GB18030 code 0xFD9C is one of the 9 not-required code, but the mapping: <UF92C> \xFD\x9C |0 Still appears in 2022.ucm, so that this character is retained. Chao Li (Evan) -------------------- HighGo Software Co., Ltd. https://www.highgo.com/ > On Aug 11, 2025, at 13:50, John Naylor <johncnaylorls@gmail.com> wrote: > > On Mon, Aug 11, 2025 at 9:01 AM Chao Li <li.evan.chao@gmail.com> wrote: >> >> I have created a patch https://commitfest.postgresql.org/patch/5954/. CommitFests requested a rebase, so I rebased the code and created the v2 patch. >> >> BTW, I have tested all 66 new characters, 9 not-required characters and 18 changed characters in a way as: > > "9 characters are no longer required by the new standard, but are > retained in this patch for compatibility" > > How is that done? > >> I added a test case with a mapping changed char, and the test passes: >> >> % make check >> ... >> # All 229 tests passed. >> >> For more details on the standard change, see https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132 >> >> I am attaching the patch file. > > Going from the old .xml file to the .ucm file makes it difficult to > see the relevant changes. Also, there are nearly 1000 non-user-visible > changes like this in the output file that are not explained: > > - /*** Three byte table, leaf: efa8xx - offset 0x07aba ***/ > + /*** Three byte table, leaf: efa8xx - offset 0x07b3a ***/ > > The 2000 version is available in the .ucm format, so maybe converting > to that first would be a good preparatory patch: > > https://github.com/unicode-org/icu-data/blob/main/charset/data/ucm/gb-18030-2000.ucm > > Looking at the history, it looks like that file has seen small > revisions, so it may take some research to get the exact equivalent to > the XML file we use. That will also tell us if anything will change > for us besides the actual 2022 revision. > > -- > John Naylor > Amazon Web Services
-
Re: GB18030-2022 Support in PostgreSQL
John Naylor <johncnaylorls@gmail.com> — 2025-08-11T09:15:00Z
On Mon, Aug 11, 2025 at 3:22 PM Chao Li <li.evan.chao@gmail.com> wrote: Hi, For future reference, please don't quote my entire message below yours -- it clutters the archives and also removes context. > Yes, I did a diff between 2000.ucm and 2022.ucm when I worked on the patch. The diff between 2000.ucm and 2022.ucm are quite small: That would match my expectation. In case it wasn't clear before, my preference is to split this patch into two patches: First convert to .ucm, then update to 2022 revision. Then the small diff will be obvious to everyone who looks at the second commit. > For your question: > > "9 characters are no longer required by the new standard, but are > retained in this patch for compatibility" > > How is that done? > > > The 9 mappings are not changed between 2000.ucm and 2022.ucm. For example, GB18030 code 0xFD9C is one of the 9 not-required code, but the mapping: > > <UF92C> \xFD\x9C |0 > > Still appears in 2022.ucm, so that this character is retained. Thanks for clarifying -- by saying "retained in the patch", the commit message implied to me that the patch added something not in the upstream file. -- John Naylor Amazon Web Services
-
Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com> — 2025-08-11T09:25:07Z
> That would match my expectation. In case it wasn't clear before, my > preference is to split this patch into two patches: First convert to > .ucm, then update to 2022 revision. Then the small diff will be > obvious to everyone who looks at the second commit. Sure I can split the patch into two. The patch only changes the .xml file to .ucm file and updating the perl script. As a result, map files should not be changed. Then the second patch will update the ucm file, so that the second patch should be small, contains only ucm changes and map file changes. One thing to confirm with you. To archive that, we will rename the ucm file as gb18030.ucm without suffix of “-2000” and “-2022”, otherwise git won’t be able to show the diff. Is that what you meant? > > Thanks for clarifying -- by saying "retained in the patch", the commit > message implied to me that the patch added something not in the > upstream file. > I will update the commit message in the new patch. Chao Li (Evan) -------------------- HighGo Software Co., Ltd. https://www.highgo.com/
-
Re: GB18030-2022 Support in PostgreSQL
John Naylor <johncnaylorls@gmail.com> — 2025-08-11T09:29:04Z
On Mon, Aug 11, 2025 at 4:25 PM Chao Li <li.evan.chao@gmail.com> wrote: > > Sure I can split the patch into two. The patch only changes the .xml file to .ucm file and updating the perl script. As a result, map files should not be changed. > > Then the second patch will update the ucm file, so that the second patch should be small, contains only ucm changes and map file changes. > > One thing to confirm with you. To archive that, we will rename the ucm file as gb18030.ucm without suffix of “-2000” and “-2022”, otherwise git won’t be able to show the diff. Is that what you meant? Usually git is pretty smart about renames combined with small changes, so I would try keeping the original names and see what it does. -- John Naylor Amazon Web Services
-
Re: GB18030-2022 Support in PostgreSQL
John Naylor <johncnaylorls@gmail.com> — 2025-08-12T04:57:45Z
On Tue, Aug 12, 2025 at 9:09 AM Chao Li <li.evan.chao@gmail.com> wrote: [bringing this back to the original thread] > So, I compared 2000 ucm with 2005 ucm also compared 2005 ucm with 2022 ucm. Then I found that some changed in 2005 is reverted in 2022, that why diff between 2000 and 2022 is small. For example, the following mappings Yes, this was mentioned in the "disruptive changes" document linked in my first email in this thread: "The 2005 edition included 6 characters with double mappings. The 2022 edition removes the double mappings. The 2005 edition included 9 characters from the CJK Compatibility Ideographs block. In Unicode/10646, these all have canonical decomposition mappings to characters in the URO. In the 2022 edition, these nine compatibility characters are removed." > So, for how to create patch 2, I think we have 3 options: > > 1. As planned, update to the latest version of 2000 ucm, then skip 2005 and directly upgrade to 2022 in patch 3. This way, we just honor 2000 ucm regardless that the change is actually introduced by 2005. > > 2. Skip the latest version of 2000 ucm and upgrade to 2005 ucm. This way will clearly show the upgrade path 2000->2005->2022. Downside is that 2005 introduced some changes that are reverted in 2022, which will cause some unnecessary changes in map files. > > 3. Skip patch 2, directly go to patch 3. So that, patch 3 will include changes introduced by both 2005 and 2022. This way makes minimum changes to map files. #3 is what I had in mind to begin with unless we found some reason not to. Minimizing churn is a lucky side effect that reinforces that choice. Before getting to that, I thought I'd bring this up to the community: +# Copyright (C) 2000-2009, International Business Machines Corporation and others. +# All Rights Reserved. The previous XML file didn't contain a copyright notice -- does anyone want to make a case for not checking unicode-org's source file into our tree because of this? The 2022 update changes it to # Copyright (C) 2016 and later: Unicode, Inc. and others. # License & terms of use: http://www.unicode.org/copyright.html # Copyright (C) 2000-2012, International Business Machines Corporation and others. # All Rights Reserved. ...and the above links to https://www.unicode.org/license.txt -- John Naylor Amazon Web Services
-
Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com> — 2025-08-12T06:05:39Z
>> >> 3. Skip patch 2, directly go to patch 3. So that, patch 3 will include changes introduced by both 2005 and 2022. This way makes minimum changes to map files. > > #3 is what I had in mind to begin with unless we found some reason not > to. Minimizing churn is a lucky side effect that reinforces that > choice. > Cool, then I will take option 3. > Before getting to that, I thought I'd bring this up to the community: > > > The previous XML file didn't contain a copyright notice -- does anyone > want to make a case for not checking unicode-org's source file into > our tree because of this? The 2022 update changes it to > > Thanks for pointing out the unicode license issue, I really didn’t notice about that. I did some quick research. As we generate mapping files from the ucm files, and the map files are built into the final executable binaries, we are redistributing Unicode-derived data, so we should still include the Unicode license. Thus, not checking in the ucm won’t waive the license problem. We can just added a license file, say named unicode_license.txt with proper content under the same folder of the ucm file. I guess that would address the license problem. This following the ChatGTP generated content of the license file: ``` Portions of this product include data from the Unicode Character Database and other Unicode® data files. Copyright © 1991–2025 Unicode, Inc. All rights reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of the Unicode data files and any associated documentation (the "Data Files") or Unicode software and any associated documentation (the "Software") to deal in the Data Files or Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, and/or sell copies of the Data Files or Software, and to permit persons to whom the Data Files or Software are furnished to do so, provided that either: (a) this copyright and permission notice appear with all copies of the Data Files or Software, or (b) this copyright and permission notice appear in associated documentation. THE DATA FILES AND SOFTWARE ARE PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THE DATA FILES OR SOFTWARE. Unicode and the Unicode logo are trademarks of Unicode, Inc. in the United States and other countries. All third party trademarks referenced herein are the property of their respective owners. ``` Regards, Chao Li (Evan) -------------------- HighGo Software Co., Ltd. https://www.highgo.com/ -
Re: GB18030-2022 Support in PostgreSQL
Peter Eisentraut <peter@eisentraut.org> — 2025-08-12T19:41:47Z
On 12.08.25 06:57, John Naylor wrote: > Before getting to that, I thought I'd bring this up to the community: > > +# Copyright (C) 2000-2009, International Business Machines > Corporation and others. > +# All Rights Reserved. > > The previous XML file didn't contain a copyright notice -- does anyone > want to make a case for not checking unicode-org's source file into > our tree because of this? The 2022 update changes it to > > # Copyright (C) 2016 and later: Unicode, Inc. and others. > # License & terms of use:http://www.unicode.org/copyright.html > # Copyright (C) 2000-2012, International Business Machines Corporation > and others. > # All Rights Reserved. > > ...and the above links tohttps://www.unicode.org/license.txt Could we download this file on demand, like we do for the other input files for the conversion mappings?
-
Re: GB18030-2022 Support in PostgreSQL
John Naylor <johncnaylorls@gmail.com> — 2025-08-13T07:17:03Z
On Wed, Aug 13, 2025 at 2:41 AM Peter Eisentraut <peter@eisentraut.org> wrote: > Could we download this file on demand, like we do for the other input > files for the conversion mappings? That sounds like the way to go. While poking around, I found that UCS_to_EUC_CN.pl also uses gb-18030-2000.xml for its input, so now it seems wrong to delete the XML file as a side effect of changing the source for GB18030. Maybe EUC_CN could use a downloaded-on-demand .ucm source as well (whether 2000 or 2022) but we can consider that later. For now let's leave the XML file alone. -- John Naylor Amazon Web Services
-
Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com> — 2025-08-13T07:20:27Z
> On Aug 13, 2025, at 15:17, John Naylor <johncnaylorls@gmail.com> wrote: > > On Wed, Aug 13, 2025 at 2:41 AM Peter Eisentraut <peter@eisentraut.org> wrote: >> Could we download this file on demand, like we do for the other input >> files for the conversion mappings? > > That sounds like the way to go. > > While poking around, I found that UCS_to_EUC_CN.pl also uses > gb-18030-2000.xml for its input, so now it seems wrong to delete the > XML file as a side effect of changing the source for GB18030. Maybe > EUC_CN could use a downloaded-on-demand .ucm source as well (whether > 2000 or 2022) but we can consider that later. For now let's leave the > XML file alone. > Sounds good. Let me recreate the patch. Chao Li (Evan) -------------------- HighGo Software Co., Ltd. https://www.highgo.com/
-
Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com> — 2025-08-13T08:08:45Z
On 2025/8/13 15:20, Chao Li wrote: > > > Sounds good. Let me recreate the patch. > > Attached is the new patch. It downloads the UCM file in make: ``` Unicode % make gb18030_to_utf8.map wget -O gb-18030-2000.ucm --no-use-server-timestamps https://raw.githubusercontent.com/unicode-org/icu-data/d9d3a6ed27bb98a7106763e940258f0be8cd995b/charset/data/ucm/gb-18030-2000.ucm --2025-08-13 15:54:53-- https://raw.githubusercontent.com/unicode-org/icu-data/d9d3a6ed27bb98a7106763e940258f0be8cd995b/charset/data/ucm/gb-18030-2000.ucm HTTP request sent, awaiting response... 200 OK Length: 672885 (657K) [text/plain] Saving to: ‘gb-18030-2000.ucm’ gb-18030-2000.ucm 100%[=====================================>] 657.11K 2.78MB/s in 0.2s 2025-08-13 15:54:54 (2.78 MB/s) - ‘gb-18030-2000.ucm’ saved [672885/672885] '/usr/bin/perl' -I . UCS_to_GB18030.pl - Writing UTF8=>GB18030 conversion table: utf8_to_gb18030.map - Writing GB18030=>UTF8 conversion table: gb18030_to_utf8.map Unicode % git diff Unicode % ``` After regenerating the map files, there is no change found in the map files. Best regards, Chao Li (Evan) -------------------- HighGo Software Co., Ltd. https://www.highgo.com/
-
Re: GB18030-2022 Support in PostgreSQL
John Naylor <johncnaylorls@gmail.com> — 2025-08-18T05:18:25Z
On Wed, Aug 13, 2025 at 3:08 PM Chao Li <li.evan.chao@gmail.com> wrote: > Attached is the new patch. It downloads the UCM file in make: > After regenerating the map files, there is no change found in the map files. I can confirm, thanks. We split a patch into multiple patches, it's customary include all of them, since that process may result in unwelcome artifacts to sort out. (When the first step has architectural questions or change in behavior, we may treat it as independent, possibly with a separate thread, but that's not the case here.) I do have some comments already, though: -my $in_file = "gb-18030-2000.xml"; - +my $in_file = "gb-18030-2000.ucm"; -while (<$in>) -{ +while (<$in>) { -# The lines we care about in the source file look like +# The lines we care about in the source file look like: These are spurious changes, which we try to avoid. - next if (!m/<a u="([0-9A-F]+)" b="([0-9A-F ]+)"/); + if (/^<U([0-9A-Fa-f]+)>\s+((?:\\x[0-9A-Fa-f]{2})+)\s*\|(\d+)/) { This change in style caused extra whitespace-only churn. That obscures what the actual changes are. + # Match lines like: <UXXXX> \xYY[\xYY...] |n, and use only (|0) mappings This is missing an explanation of why we skip non-zero mappings. Code-wise, this only matters for the output in the follow-on patch for 2022, but one of these patches needs to include a brief explanation. I did not like the detailed description that was present in one of the earlier 2022 patches that told how many characters were flagged a certain way -- that's irrelevant detail and will likely get out of date in some future version anyway. +# and n is a flag indicating the type of mapping having +# a single value of 0. This seems weird when combined with the logic to filter out non-zero mappings. We need to think about when and where to show relevant information. + next if ($flag ne '0'); # non-0 flags This comment is just repeating what the code is doing, and it's very obvious what it's doing. BTW, it sounds like your proposed Makefile changes are needed for the follow-on patch with .map changes to work at all, is that right? https://www.postgresql.org/message-id/1CA8625F-AA41-4ED2-B60F-E28AC71F37DC@highgo.com -- John Naylor Amazon Web Services -
Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com> — 2025-08-18T06:35:42Z
On 2025/8/18 13:18, John Naylor wrote: > We split a patch into multiple patches, it's customary include all of > them, since that process may result in unwelcome artifacts to sort > out. (When the first step has architectural questions or change in > behavior, we may treat it as independent, possibly with a separate > thread, but that's not the case here.) Thanks for the explanation. I thought to make the second patch only after the first patch is pushed. I am new to PostgreSQL contribution, your guidance is very helpful for my future work. Now I attach the both patch files. For the second patch, I have tested it manually again. And "make check" test passed. > -# The lines we care about in the source file look like > +# The lines we care about in the source file look like: > > These are spurious changes, which we try to avoid. Updated. > - next if (!m/<a u="([0-9A-F]+)" b="([0-9A-F ]+)"/); > > + if (/^<U([0-9A-Fa-f]+)>\s+((?:\\x[0-9A-Fa-f]{2})+)\s*\|(\d+)/) { > > This change in style caused extra whitespace-only churn. That obscures > what the actual changes are. Updated. > + # Match lines like: <UXXXX> \xYY[\xYY...] |n, and use only (|0) mappings > > This is missing an explanation of why we skip non-zero mappings. > Code-wise, this only matters for the output in the follow-on patch for > 2022, but one of these patches needs to include a brief explanation. I > did not like the detailed description that was present in one of the > earlier 2022 patches that told how many characters were flagged a > certain way -- that's irrelevant detail and will likely get out of > date in some future version anyway. Okay, I kept a neat version of comment now. > +# and n is a flag indicating the type of mapping having > +# a single value of 0. > > This seems weird when combined with the logic to filter out non-zero > mappings. We need to think about when and where to show relevant > information. Updated the comment. > + next if ($flag ne '0'); # non-0 flags > > This comment is just repeating what the code is doing, and it's very > obvious what it's doing. Removed the useless comment. > > BTW, it sounds like your proposed Makefile changes are needed for the > follow-on patch with .map changes to work at all, is that right? > > https://www.postgresql.org/message-id/1CA8625F-AA41-4ED2-B60F-E28AC71F37DC@highgo.com > I think that patch could be separate, because the makefile changes are generic to all map files. The current GB18030 patch doesn't depend on that makefile patch at all. The makefile patch just makes build a little bit easier upon map file changes. Best regards, -- Chao Li (Evan) -------------------- HighGo Software Co., Ltd. https://www.highgo.com/ -
Re: GB18030-2022 Support in PostgreSQL
John Naylor <johncnaylorls@gmail.com> — 2025-08-18T08:34:29Z
On Mon, Aug 18, 2025 at 1:36 PM Chao Li <li.evan.chao@gmail.com> wrote: > I think that patch could be separate, because the makefile changes are generic to all map files. The current GB18030 patch doesn't depend on that makefile patch at all. The makefile patch just makes build a little bit easier upon map file changes. I verified that both autoconf and meson builds pick up the change with these two patches, and the new test passes. I'm still not sure what circumstances you found where a change doesn't get picked up, but we can come back to that later if need be. BTW, the Commitfest shows these patches as "needs rebase". The reason for that is the naming. Commands like `git am` apply a series in order, and expects to find something like v3-0001-* v3-0002-* Your last attachment was v1-0001-* v2-0001-* ...and confusingly v2 needed to be applied first. To create a series from a branch, use `git format-patch master -v <version number>` and it will output an ordered series with one patch per commit. -- John Naylor Amazon Web Services
-
Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com> — 2025-08-18T08:50:34Z
On 2025/8/18 16:34, John Naylor wrote: > I verified that both autoconf and meson builds pick up the change with > these two patches, and the new test passes. I'm still not sure what > circumstances you found where a change doesn't get picked up, but we > can come back to that later if need be. Let's talk about the makefile change separately. > ...and confusingly v2 needed to be applied first. To create a series > from a branch, use `git format-patch master -v <version number>` and > it will output an ordered series with one patch per commit. This is my first spitted patch. I was confused about the "0001" part in patch file names. Now I understood. I just recreated the both patch files as v3: chaol@ChaodeMacBook-Air postgresql % git format-patch -v3 master v3-0001-GB18030-Switch-to-using-gb-18030-2000.ucm.patch v3-0002-Upgrade-GB18030-encoding-support-from-2000-to-202.patch Regard regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
-
Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com> — 2025-09-01T01:32:42Z
> On Aug 18, 2025, at 16:50, Chao Li <li.evan.chao@gmail.com> wrote: > > > <v3-0001-GB18030-Switch-to-using-gb-18030-2000.ucm.patch><v3-0002-Upgrade-GB18030-encoding-support-from-2000-to-202.patch> Hi John, Any follow up on this patch? Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
-
Re: GB18030-2022 Support in PostgreSQL
John Naylor <johncnaylorls@gmail.com> — 2025-09-10T06:38:40Z
On Mon, Aug 18, 2025 at 3:50 PM Chao Li <li.evan.chao@gmail.com> wrote: > This is my first spitted patch. I was confused about the "0001" part in patch file names. Now I understood. I just recreated the both patch files as v3: I've attached v4, in which I made some cosmetic changes to the perl script, mostly to make it resemble master more closely. These changes are separated out into a separate patch for visibility, but will be squashed in the final commit. Two things are worth calling out: - The URL at the top currently points to a directory in Github, but v3 changed it to point to the actual file. A directory can be navigated for inspection, so I used: 2000: https://github.com/unicode-org/icu-data/tree/main/charset/data/ucm 2022: https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/ - I also made the regex a multiline regex for readability, even though the previous one was not. For 2022 version, I think it would be good to once run a test to verify that no mappings changed that we didn't expect. Perhaps the tests here can be used: https://www.postgresql.org/message-id/b9e3167f-f84b-7aa4-5738-be578a4db924%40iki.fi The upstream correction to the 2000 version is not present in our mappings, so we should mention that, unless it was reverted in or before 2022. In the documentation (charset.sgml), do we want to mention the version e.g. the following? <entry><literal>GB18030</literal></entry> -<entry>National Standard</entry> +<entry>National Standard, version 2022</entry> I've whacked around the commit messages, so those should be reviewed for accuracy. Your draft commit message had "9 characters are no longer required by the new standard, but are retained in this patch for compatibility" ...but those nine were introduced in the 2005 version, right? In which case it doesn't affect us. Please confirm. "Author: Zheng Tao <taoz@highgo.com>" -- I haven't seen any messages from this address in this thread, so could you confirm this was intentional? -- John Naylor Amazon Web Services
-
Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com> — 2025-09-10T11:54:08Z
Hi John, Thank you very much for taking care of this patch. John Naylor <johncnaylorls@gmail.com> 于2025年9月10日周三 14:38写道: > > - The URL at the top currently points to a directory in Github, but v3 > changed it to point to the actual file. A directory can be navigated > for inspection, so I used: > > 2000: > https://github.com/unicode-org/icu-data/tree/main/charset/data/ucm > > 2022: > https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/ > > Looks good. > - I also made the regex a multiline regex for readability, even though > the previous one was not. > > Thank you very much for polishing the perl script. I am not an expert of perl. I can make the script working, but not perfect. > For 2022 version, I think it would be good to once run a test to > verify that no mappings changed that we didn't expect. Perhaps the > tests here can be used: > > > https://www.postgresql.org/message-id/b9e3167f-f84b-7aa4-5738-be578a4db924%40iki.fi > > I have manually run tested I had done before, everything works as expected. I downloaded the tests from the referenced mail, but I cannot make the tests to run. After extracting the 2 patch files, it added src/test/encodings, but "make check" seems to not run them. I tried to copy .out and .sql files to src/test/regress, but the tests still not running. Did I miss anything? The upstream correction to the 2000 version is not present in our > mappings, so we should mention that, unless it was reverted in or > before 2022. > I think the upstream correction to the 2000 version is just a few not round-trip chars that are ignored by us. So I feel we don't need to mention them. > > In the documentation (charset.sgml), do we want to mention the version > e.g. the following? > > <entry><literal>GB18030</literal></entry> > -<entry>National Standard</entry> > +<entry>National Standard, version 2022</entry> > That's a good idea. I updated the sgml file: [image: image.png] > > I've whacked around the commit messages, so those should be reviewed > for accuracy. > > Your draft commit message had "9 characters are no longer required by > the new standard, but are retained in this patch for compatibility" > ...but those nine were introduced in the 2005 version, right? In which > case it doesn't affect us. Please confirm. > I don't find any hint about if the 9 characters were introduced in the 2005 version. But without this patch, they can be properly converted: ``` evantest=# SELECT encode(convert_from(decode('FD9D', 'hex'), 'GB18030')::bytea, 'hex'); encode -------- efa5b9 (1 row) ``` So they should be available in the version 2002 already. > > "Author: Zheng Tao <taoz@highgo.com>" -- I haven't seen any messages > from this address in this thread, so could you confirm this was > intentional? > > Yes, Zheng Tao is my colleague. He worked with me for this patch, so I want to credit him. I am attaching v5 version. The only change is 0003, I added the SGML change. Best regards, Chao Li (Evan) --------------------- HighGo Software Co., Ltd. https://www.highgo.com/ -
Re: GB18030-2022 Support in PostgreSQL
John Naylor <johncnaylorls@gmail.com> — 2025-09-11T07:39:58Z
On Wed, Sep 10, 2025 at 6:54 PM Chao Li <li.evan.chao@gmail.com> wrote: > I downloaded the tests from the referenced mail, but I cannot make the tests to run. After extracting the 2 patch files, it added src/test/encodings, but "make check" seems to not run them. I tried to copy .out and .sql files to src/test/regress, but the tests still not running. Did I miss anything? Sorry, I'm not quite sure either how to get it to run like a normal test. I got it to show the result by doing psql -f src/test/encodings/sql/init.sql psql -f src/test/encodings/sql/gb18030.sql > patch.out diff -u src/test/encodings/expected/gb18030.out patch.out > v5-test.diff I've attached what I got with the v5 patches, renamed to avoid being picked up by CI. >> The upstream correction to the 2000 version is not present in our >> mappings, so we should mention that, unless it was reverted in or >> before 2022. > > > I think the upstream correction to the 2000 version is just a few not round-trip chars that are ignored by us. So I feel we don't need to mention them. This is the commit, and both of these are in the 2022 file as a round trip mapping. I don't see any mappings with non-zero flag in the 2000 file (in any upstream commit). https://github.com/unicode-org/icu-data/commit/91850aec0209fee0f91248731c9dd4b2b768d2b5 We should mention this correction for completeness. It seems to just move 'ḿ' out of the private use area. To be sure, likely almost no one will notice. >> Your draft commit message had "9 characters are no longer required by >> the new standard, but are retained in this patch for compatibility" >> ...but those nine were introduced in the 2005 version, right? In which >> case it doesn't affect us. Please confirm. > > > I don't find any hint about if the 9 characters were introduced in the 2005 version. Okay, I must have been confused by language "was included" in one of the linked references, which doesn't necessarily mean they were introduced there. The 66 new mappings required are not in the 2022 UCM file and we already cover them algorithmically in utf8_and_gb18030.c, so they already work without this patch (see below, the glyphs render on my OS but maybe not everyone can see them). The commit message needs to focus on what actually changed for users (I'll work on that). Related information should be an afterthought. # SELECT convert_from(decode('82358F33', 'hex'), 'GB18030'); convert_from -------------- 龦 (1 row) # SELECT convert_from(decode('82359636', 'hex'), 'GB18030'); convert_from -------------- 鿯 (1 row) While looking at utf8_and_gb18030.c, I see it refers to the XML file as the source of the algorithmic ranges. We'll want to keep some reference to the ranges independent of the XML file. I found https://htmlpreview.github.io/?https://github.com/unicode-org/icu-data/blob/main/charset/source/gb18030/gb18030.html ...which gives general info and mentions that U+10000 starts at GB+90308130, and also links to https://github.com/unicode-org/icu-data/blob/main/charset/source/gb18030/ranges.txt ...which has the same ranges we have below U+10000. Links can always disappear, but if the algorithmic ranges ever need to change (unlikely), we'll have new information about that. -- John Naylor Amazon Web Services -
Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com> — 2025-09-11T09:08:31Z
> On Sep 11, 2025, at 15:39, John Naylor <johncnaylorls@gmail.com> wrote: > > > On Wed, Sep 10, 2025 at 6:54 PM Chao Li <li.evan.chao@gmail.com <mailto:li.evan.chao@gmail.com>> wrote: > > > I downloaded the tests from the referenced mail, but I cannot make the tests to run. After extracting the 2 patch files, it added src/test/encodings, but "make check" seems to not run them. I tried to copy .out and .sql files to src/test/regress, but the tests still not running. Did I miss anything? > > Sorry, I'm not quite sure either how to get it to run like a normal test. I got it to show the result by doing > > psql -f src/test/encodings/sql/init.sql > psql -f src/test/encodings/sql/gb18030.sql > patch.out > diff -u src/test/encodings/expected/gb18030.out patch.out > v5-test.diff > > I've attached what I got with the v5 patches, renamed to avoid being picked up by CI. > > >> The upstream correction to the 2000 version is not present in our > >> mappings, so we should mention that, unless it was reverted in or > >> before 2022. > > > > > > I think the upstream correction to the 2000 version is just a few not round-trip chars that are ignored by us. So I feel we don't need to mention them. > > This is the commit, and both of these are in the 2022 file as a round trip mapping. I don't see any mappings with non-zero flag in the 2000 file (in any upstream commit). > > https://github.com/unicode-org/icu-data/commit/91850aec0209fee0f91248731c9dd4b2b768d2b5 I managed to get the encoding test to run. I didn’t find init.sql, so I had to manually create 3 functions on my own. But finally the test passed on the master branch. Then I switched to the patch branch, it got 21 different lines. After I updated the 18 known changes in the out file, then it got only 3 different lines: ``` - \x8135f437 | \xe1b8bf + \x8135f437 | \xee9f87 - \xa3a0 | \xee97a5 + \xa3a0 | character with byte sequence 0xa3 0xa0 in encoding "GB18030" has no equivalent in encoding “UTF8" - \xa8bc | \xee9f87 + \xa8bc | \xe1b8bf ``` Where, \x8135f437 and \xa8bc reflect to the change pointed by above link: \xA8BC used to map to unicode UE7C7, now \x8135f437 changed to map to UE7C7, and \xA8BC changed to map to U1E3F in version 2005. For \xa3a0, in 2022.ucm, it is a not a roundtrip mapping: ``` <U3000> \xA3\xA0 |3 <UE5E5> \xA3\xA0 |4 ``` So we ignored it. Then everything is clear. > > We should mention this correction for completeness. It seems to just move 'ḿ' out of the private use area. To be sure, likely almost no one will notice. > > >> Your draft commit message had "9 characters are no longer required by > >> the new standard, but are retained in this patch for compatibility" > >> ...but those nine were introduced in the 2005 version, right? In which > >> case it doesn't affect us. Please confirm. > > > > > > I don't find any hint about if the 9 characters were introduced in the 2005 version. > > Okay, I must have been confused by language "was included" in one of the linked references, which doesn't necessarily mean they were introduced there. > > The 66 new mappings required are not in the 2022 UCM file and we already cover them algorithmically in utf8_and_gb18030.c, so they already work without this patch (see below, the glyphs render on my OS but maybe not everyone can see them). The commit message needs to focus on what actually changed for users (I'll work on that). Related information should be an afterthought. > > # SELECT convert_from(decode('82358F33', 'hex'), 'GB18030'); > convert_from > -------------- > 龦 > (1 row) > > # SELECT convert_from(decode('82359636', 'hex'), 'GB18030'); > convert_from > -------------- > 鿯 > (1 row) > > While looking at utf8_and_gb18030.c, I see it refers to the XML file as the source of the algorithmic ranges. We'll want to keep some reference to the ranges independent of the XML file. I found > > https://htmlpreview.github.io/?https://github.com/unicode-org/icu-data/blob/main/charset/source/gb18030/gb18030.html > > ...which gives general info and mentions that U+10000 starts at GB+90308130, and also links to > > https://github.com/unicode-org/icu-data/blob/main/charset/source/gb18030/ranges.txt > > ...which has the same ranges we have below U+10000. Links can always disappear, but if the algorithmic ranges ever need to change (unlikely), we'll have new information about that. > > I will post v6 soon with updated commit message. By the way, for how I made the test work: 1. I copied gb18030.sql and gb18030.out to src/test/regess under sql and expected subfolders. 2. In src/test/regess/parallel_schedule, I added a line “test: gb18030” 3. Then “make check” run the gb18030 test. Attached in my updated sql and out file. To test in master branch, use the original out file, to test with the patch, use my updated out file, it will fail with the 3 different lines as I mentioned above. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/  -
Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com> — 2025-09-12T01:57:37Z
Chao Li <li.evan.chao@gmail.com> 于2025年9月11日周四 17:08写道: > > > I will post v6 soon with updated commit message. > > I am attaching the v6 patch set: * Updated 0003's commit comment. * In 0003, updated a function comment in utf8_and_gb18030.c to address John's comment about reference to the xml file. Best regards, Chao Li (Evan) --------------------- HighGo Software Co., Ltd. https://www.highgo.com/
-
Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com> — 2025-09-12T02:12:17Z
On Fri, Sep 12, 2025 at 9:57 AM Chao Li <li.evan.chao@gmail.com> wrote: > > > I am attaching the v6 patch set: > > * Updated 0003's commit comment. > * In 0003, updated a function comment in utf8_and_gb18030.c to address > John's comment about reference to the xml file. > > Best regards, > Chao Li (Evan) > --------------------- > HighGo Software Co., Ltd. > https://www.highgo.com/ > CF requested a rebase, so v7 is just a rebased version. Best regards, Chao Li (Evan) --------------------- HighGo Software Co., Ltd. https://www.highgo.com/
-
Re: GB18030-2022 Support in PostgreSQL
John Naylor <johncnaylorls@gmail.com> — 2025-09-16T09:21:04Z
On Thu, Sep 11, 2025 at 4:09 PM Chao Li <li.evan.chao@gmail.com> wrote: > Then I switched to the patch branch, it got 21 different lines. After I updated the 18 known changes in the out file, then it got only 3 different lines: > > ``` > - \x8135f437 | \xe1b8bf > + \x8135f437 | \xee9f87 > > - \xa3a0 | \xee97a5 > + \xa3a0 | character with byte sequence 0xa3 0xa0 in encoding "GB18030" has no equivalent in encoding “UTF8" > > - \xa8bc | \xee9f87 > + \xa8bc | \xe1b8bf > ``` > > Where, \x8135f437 and \xa8bc reflect to the change pointed by above link: > > \xA8BC used to map to unicode UE7C7, now \x8135f437 changed to map to UE7C7, and \xA8BC changed to map to U1E3F in version 2005. Maybe we can phrase it like this: ``` There have been two corrections to the 2000 version that were carried forward to later versions. The following mappings were previously swapped: U+E7C7 (Private Use Area) now maps to \x8135f437 U+1E3F (Latin Small Letter M with Acute) now maps to \xA8BC ``` > For \xa3a0, in 2022.ucm, it is a not a roundtrip mapping: > > ``` > <U3000> \xA3\xA0 |3 > <UE5E5> \xA3\xA0 |4 > ``` > > So we ignored it. Then everything is clear. Yes, I see this in the file, but it's not described in any of the documents about the 2022 version, although they mention other cases regarding the Private Use Area. I'm not sure we need to worry too much, but we need to describe the behavior changes, maybe like this: ``` Previously, U+E5E5 (Private Use Area) was mapped to \xA3A0. This code point now maps to \x65356535. Attempting to convert \xA3A0 will now raise an error. ``` I'm open to suggestions. -- John Naylor Amazon Web Services
-
Re: GB18030-2022 Support in PostgreSQL
John Naylor <johncnaylorls@gmail.com> — 2025-09-16T09:36:02Z
On Fri, Sep 12, 2025 at 8:57 AM Chao Li <li.evan.chao@gmail.com> wrote: > * In 0003, updated a function comment in utf8_and_gb18030.c to address John's comment about reference to the xml file. Thanks, but the entire point of that comment change was to remove the reference to the XML file, yet it didn't actually do that. Also, the words in my email were to explain to you what should go there and why. That doesn't mean those words belong in the comment. The comment change seems like it belongs in the preparatory commit anyway, so I put the links there and pushed 0001 (along with the squashed 0002). -- John Naylor Amazon Web Services
-
Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com> — 2025-09-17T02:08:28Z
Hi John, On Sep 16, 2025, at 17:36, John Naylor <johncnaylorls@gmail.com> wrote: The comment change seems like it belongs in the preparatory commit anyway, so I put the links there and pushed 0001 (along with the squashed 0002). Thank you very much for pushing 0001. I see you have updated the function comment in utf8_and_gb18030.c, so I removed it from the v8 patch. Attached is the v8 patch: * Updated the commit comment by taking your wording * Removed the change of utf8_and_gb18030.c Please take a look again, and thanks for your patience. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
-
Re: GB18030-2022 Support in PostgreSQL
John Naylor <johncnaylorls@gmail.com> — 2025-09-18T07:59:32Z
On Wed, Sep 17, 2025 at 9:08 AM Chao Li <li.evan.chao@gmail.com> wrote: > I see you have updated the function comment in utf8_and_gb18030.c, so I removed it from the v8 patch. > > Attached is the v8 patch: I've reworked the commit message I started in v5 to incorporate later discussions. (I was not a fan of including a complete table there, nor of using UTF-8 encoding instead of code points as a reference.) The only change I made for v9 is to reword the regression test addition from "upgrades" to "change". I'm planning to commit next week unless there are objections. (If anyone otherwise busy with the PG18 release wants a chance to weigh in, let me know and I'll hold off). It'll be a good idea to communicate how to detect (unlikely but not impossible) incompatibilities for existing systems, but I don't think committing needs to wait for that piece. -- John Naylor Amazon Web Services
-
Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com> — 2025-09-18T08:16:05Z
Hi John, Thanks for working on v9. > On Sep 18, 2025, at 15:59, John Naylor <johncnaylorls@gmail.com> wrote: > > > It'll be a good idea to communicate how to detect (unlikely but not > impossible) incompatibilities for existing systems, but I don't think > committing needs to wait for that piece. > > -- > John Naylor > Amazon Web Services > <v9-0001-Update-GB18030-encoding-from-version-2000-to-2022.patch> V9 looks good to me. I am absolutely fine with removing the table of mapping changes. When you say “communicate how to detect incompatibility for existing systems”, what would be the communication channel? I am actually very new to the PG development community, your guidance will be greatly appreciated. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
-
Re: GB18030-2022 Support in PostgreSQL
John Naylor <johncnaylorls@gmail.com> — 2025-09-18T08:53:08Z
On Thu, Sep 18, 2025 at 3:16 PM Chao Li <li.evan.chao@gmail.com> wrote: > > When you say “communicate how to detect incompatibility for existing systems”, what would be the communication channel? I am actually very new to the PG development community, your guidance will be greatly appreciated. My first thought was to include a sample query in the release notes that filters on text with the affected code points, but I'd be happy to hear other ideas. We start working on release notes around April/May. -- John Naylor Amazon Web Services
-
Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com> — 2025-09-18T09:44:43Z
> On Sep 18, 2025, at 16:53, John Naylor <johncnaylorls@gmail.com> wrote: > > On Thu, Sep 18, 2025 at 3:16 PM Chao Li <li.evan.chao@gmail.com> wrote: >> >> When you say “communicate how to detect incompatibility for existing systems”, what would be the communication channel? I am actually very new to the PG development community, your guidance will be greatly appreciated. > > My first thought was to include a sample query in the release notes > that filters on text with the affected code points, but I'd be happy > to hear other ideas. We start working on release notes around > April/May. > So, no immediate action to take, right? I may work out such a query before starting of release note work. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
-
Re: GB18030-2022 Support in PostgreSQL
John Naylor <johncnaylorls@gmail.com> — 2025-09-24T06:42:37Z
On Thu, Sep 18, 2025 at 2:59 PM John Naylor <johncnaylorls@gmail.com> wrote: > The only change I made for v9 is to reword the regression test > addition from "upgrades" to "change". I'm planning to commit next week > unless there are objections. (If anyone otherwise busy with the PG18 > release wants a chance to weigh in, let me know and I'll hold off). Pushed. On Thu, Sep 18, 2025 at 4:45 PM Chao Li <li.evan.chao@gmail.com> wrote: > So, no immediate action to take, right? I may work out such a query before starting of release note work. Sounds good. Were you also interested in seeing if EUC_CN can use the same UCM file? That would allow us to get rid of the XML file. -- John Naylor Amazon Web Services
-
Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com> — 2025-09-24T07:04:07Z
> On Sep 24, 2025, at 14:42, John Naylor <johncnaylorls@gmail.com> wrote: > > > Sounds good. Were you also interested in seeing if EUC_CN can use the > same UCM file? That would allow us to get rid of the XML file. > Sure, let me take a look. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
-
Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com> — 2025-09-24T09:18:40Z
On Sep 24, 2025, at 15:04, Chao Li <li.evan.chao@gmail.com> wrote: On Sep 24, 2025, at 14:42, John Naylor <johncnaylorls@gmail.com> wrote: Sounds good. Were you also interested in seeing if EUC_CN can use the same UCM file? That would allow us to get rid of the XML file. Sure, let me take a look. I found that both EUC_CN and UHC use the same XML file, so I updated both. I didn’t delete gb-18030-2000.xml in this patch, because it would make the patch file very large, you can just add the deletion to the commit when you push it. Basically, the changes are all borrowed from the previous commit. With this patch, regenerating the maps file lead to no map file change, which is expected: ``` % make utf8_to_uhc.map utf8_to_euc_cn.map '/usr/bin/perl' -I . UCS_to_UHC.pl - Writing UTF8=>UHC conversion table: utf8_to_uhc.map - Writing UHC=>UTF8 conversion table: uhc_to_utf8.map '/usr/bin/perl' -I . UCS_to_EUC_CN.pl - Writing UTF8=>EUC_CN conversion table: utf8_to_euc_cn.map - Writing EUC_CN=>UTF8 conversion table: euc_cn_to_utf8.map % git diff # no map file change % ``` I am not sure if you should also upgrade the UCM file to 2022 version, but if we need, let’s do it with a separate commit. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
-
Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com> — 2025-09-24T09:31:39Z
On Wed, Sep 24, 2025 at 5:18 PM Chao Li <li.evan.chao@gmail.com> wrote: > > On Sep 24, 2025, at 15:04, Chao Li <li.evan.chao@gmail.com> wrote: > > On Sep 24, 2025, at 14:42, John Naylor <johncnaylorls@gmail.com> wrote: > > Sounds good. Were you also interested in seeing if EUC_CN can use the > same UCM file? That would allow us to get rid of the XML file. > > > Sure, let me take a look. > > > I found that both EUC_CN and UHC use the same XML file, so I updated both. > > I didn’t delete gb-18030-2000.xml in this patch, because it would make the > patch file very large, you can just add the deletion to the commit when you > push it. > > Basically, the changes are all borrowed from the previous commit. With > this patch, regenerating the maps file lead to no map file change, which is > expected: > > ``` > % make utf8_to_uhc.map utf8_to_euc_cn.map > '/usr/bin/perl' -I . UCS_to_UHC.pl > - Writing UTF8=>UHC conversion table: utf8_to_uhc.map > - Writing UHC=>UTF8 conversion table: uhc_to_utf8.map > '/usr/bin/perl' -I . UCS_to_EUC_CN.pl > - Writing UTF8=>EUC_CN conversion table: utf8_to_euc_cn.map > - Writing EUC_CN=>UTF8 conversion table: euc_cn_to_utf8.map > > % git diff # no map file change > % > ``` > > I am not sure if you should also upgrade the UCM file to 2022 version, but > if we need, let’s do it with a separate commit. > > I included deletion of the xml file in v2, which will help confirm that build will pass clearly. I realized that the patch files were huge because of the map file changes. Best regards, Chao Li (Evan) --------------------- HighGo Software Co., Ltd. https://www.highgo.com/
-
Re: GB18030-2022 Support in PostgreSQL
John Naylor <johncnaylorls@gmail.com> — 2025-09-29T04:03:09Z
On Wed, Sep 24, 2025 at 4:18 PM Chao Li <li.evan.chao@gmail.com> wrote: > I am not sure if you should also upgrade the UCM file to 2022 version, but if we need, let’s do it with a separate commit. If they can all use the same file, we should just do that for the sake of simplicity, in which case a separate commit is just extra noise. -- John Naylor Amazon Web Services
-
Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com> — 2025-09-29T08:19:48Z
On Mon, Sep 29, 2025 at 12:03 PM John Naylor <johncnaylorls@gmail.com> wrote: > On Wed, Sep 24, 2025 at 4:18 PM Chao Li <li.evan.chao@gmail.com> wrote: > > I am not sure if you should also upgrade the UCM file to 2022 version, > but if we need, let’s do it with a separate commit. > > If they can all use the same file, we should just do that for the sake > of simplicity, in which case a separate commit is just extra noise. > > In v3, I have updated EUC_CN to use gb18030-2022.ucm. Fortunately, the map files are unchanged, so we don't have to do much testing for EUC_CN. For UHC, in the icu master branch https://github.com/unicode-org/icu/tree/main/icu4c/source/data/mappings, there is still windows-949-2000.ucm, thus only download URL is changed, file content is unchanged. ``` % make utf8_to_uhc.map utf8_to_euc_cn.map wget -O windows-949-2000.ucm --no-use-server-timestamps https://raw.githubusercontent.com/unicode-org/icu/refs/heads/main/icu4c/source/data/mappings/windows-949-2000.ucm --2025-09-29 16:00:40-- https://raw.githubusercontent.com/unicode-org/icu/refs/heads/main/icu4c/source/data/mappings/windows-949-2000.ucm HTTP request sent, awaiting response... 200 OK Length: 356253 (348K) [text/plain] Saving to: ‘windows-949-2000.ucm’ windows-949-2000.ucm 100%[=========================================================================================================>] 347.90K 222KB/s in 1.6s 2025-09-29 16:00:43 (222 KB/s) - ‘windows-949-2000.ucm’ saved [356253/356253] '/usr/bin/perl' -I . UCS_to_UHC.pl - Writing UTF8=>UHC conversion table: utf8_to_uhc.map - Writing UHC=>UTF8 conversion table: uhc_to_utf8.map wget -O gb18030-2022.ucm --no-use-server-timestamps https://raw.githubusercontent.com/unicode-org/icu/refs/heads/main/icu4c/source/data/mappings/gb18030-2022.ucm --2025-09-29 16:00:43-- https://raw.githubusercontent.com/unicode-org/icu/refs/heads/main/icu4c/source/data/mappings/gb18030-2022.ucm HTTP request sent, awaiting response... 200 OK Length: 675312 (659K) [text/plain] Saving to: ‘gb18030-2022.ucm’ gb18030-2022.ucm 100%[=========================================================================================================>] 659.48K 1.33MB/s in 0.5s 2025-09-29 16:00:44 (1.33 MB/s) - ‘gb18030-2022.ucm’ saved [675312/675312] '/usr/bin/perl' -I . UCS_to_EUC_CN.pl - Writing UTF8=>EUC_CN conversion table: utf8_to_euc_cn.map - Writing EUC_CN=>UTF8 conversion table: euc_cn_to_utf8.map % git diff % ``` Please note, I didn't include the deletion of gb-18030-2000.xml in v3, because that will cause the patch file to be too big, thus requiring an approval process for the email to land in the Mail Archive. Please delete the xml file when you push the commit. Best regards, Chao Li (Evan) --------------------- HighGo Software Co., Ltd. https://www.highgo.com/
-
Re: GB18030-2022 Support in PostgreSQL
John Naylor <johncnaylorls@gmail.com> — 2025-09-29T09:32:15Z
On Wed, Sep 24, 2025 at 4:18 PM Chao Li <li.evan.chao@gmail.com> wrote: > > I found that both EUC_CN and UHC use the same XML file, so I updated both. When you say "same file", that implies to me the file we have checked in our repo. They have different names and the UHC file is downloaded on demand, so it doesn't seem like we need to change UHC at all to delete gb-18030-2000.xml. Is that right? -- John Naylor Amazon Web Services
-
Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com> — 2025-09-29T10:36:27Z
> On Sep 29, 2025, at 17:32, John Naylor <johncnaylorls@gmail.com> wrote: > > On Wed, Sep 24, 2025 at 4:18 PM Chao Li <li.evan.chao@gmail.com> wrote: >> >> I found that both EUC_CN and UHC use the same XML file, so I updated both. > > When you say "same file", that implies to me the file we have checked > in our repo. They have different names and the UHC file is downloaded > on demand, so it doesn't seem like we need to change UHC at all to > delete gb-18030-2000.xml. Is that right? > > -- > John Naylor > Amazon Web Services “same file" was a mistake. windows-949-2000.ucm is a different file from gb-18030-2000(2022).ucm. In theory, we don’t need to change UHC if our goal is to delete gb-18030-2000.xml. However, as you can see, with switching to use ucm, UHC, EUC_CN and GB18030 now share the same download URL in the Makefile, and their perl scripts use the same logic to process UCM files, so I think it would be good for maintenance. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
-
Re: GB18030-2022 Support in PostgreSQL
John Naylor <johncnaylorls@gmail.com> — 2025-09-30T06:05:42Z
On Mon, Sep 29, 2025 at 5:36 PM Chao Li <li.evan.chao@gmail.com> wrote: > “same file" was a mistake. windows-949-2000.ucm is a different file from gb-18030-2000(2022).ucm. > > In theory, we don’t need to change UHC if our goal is to delete gb-18030-2000.xml. That was my goal, yes. Let's stay focused on that and not change unrelated things. -- John Naylor Amazon Web Services
-
Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com> — 2025-09-30T06:31:24Z
On Tue, Sep 30, 2025 at 2:05 PM John Naylor <johncnaylorls@gmail.com> wrote: > On Mon, Sep 29, 2025 at 5:36 PM Chao Li <li.evan.chao@gmail.com> wrote: > > “same file" was a mistake. windows-949-2000.ucm is a different file from > gb-18030-2000(2022).ucm. > > > > In theory, we don’t need to change UHC if our goal is to delete > gb-18030-2000.xml. > > That was my goal, yes. Let's stay focused on that and not change > unrelated things. > > Sure, no problem. Please see the attached v4, I reverted UHC change from v3. Again, please "git rm" the xml file when you push the commit. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
-
Re: GB18030-2022 Support in PostgreSQL
John Naylor <johncnaylorls@gmail.com> — 2025-10-02T05:44:21Z
On Tue, Sep 30, 2025 at 1:31 PM Chao Li <li.evan.chao@gmail.com> wrote: > Sure, no problem. Please see the attached v4, I reverted UHC change from v3. Again, please "git rm" the xml file when you push the commit. Thanks, pushed after correcting the file name in the perl script comment. I've marked the CF entry committed. -- John Naylor Amazon Web Services
-
Re: GB18030-2022 Support in PostgreSQL
Chao Li <li.evan.chao@gmail.com> — 2025-10-03T05:12:29Z
Hi John, Thank you again much very for your support. > On Oct 2, 2025, at 13:44, John Naylor <johncnaylorls@gmail.com> wrote: > > > Thanks, pushed after correcting the file name in the perl script > comment. I've marked the CF entry committed. > So the work for GB18030 is done. I just want to check with your two more items: * Do we want to switch UHC from using xml to ucm? That would not lead to map file change, instead it just removes the code of parsing xml file, making future maintenance easier. * For the makefile changes: https://commitfest.postgresql.org/patch/5953/. Say, ucm has some changes, now make will only rebuild maps files, even if map files are regenerated with differences, corresponding .o files are not automatically rebuilt. I encountered this problem when I started to work on the gb18030 task. I made the change, but because of the problem, postgresql binary was not actually rebuilt to include my change, which led to confusion and wasted time. Please let me know. Your guidance is greatly appreciated. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
-
Re: GB18030-2022 Support in PostgreSQL
John Naylor <johncnaylorls@gmail.com> — 2025-10-03T06:17:14Z
On Fri, Oct 3, 2025 at 12:12 PM Chao Li <li.evan.chao@gmail.com> wrote: > > * Do we want to switch UHC from using xml to ucm? That would not lead to map file change, instead it just removes the code of parsing xml file, making future maintenance easier. I seriously doubt there will be any future maintenance, in which case doing anything is worse than doing nothing. As for the other CF entry, that's a separate email thread, and I've already said all I want to say there. -- John Naylor Amazon Web Services