Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Generate EUC_CN mappings from gb18030-2022.ucm

  2. Update GB18030 encoding from version 2000 to 2022

  3. Generate GB18030 mappings from the Unicode Consortium's UCM file

  1. GB18030-2022 Support in PostgreSQL

    jiaoshuntian@highgo.com — 2025-08-04T08:08:24Z

    Hi hackers,
    
    I noticed that PostgreSQL currently supports GB18030 encoding based on the older GB18030-2000 standard (as seen in commits like extend GB18030 conversion). However, China has since updated its mandatory character set standard to GB18030-2022, which includes additional characters and stricter compliance requirements.GB18030-2022 is now the official standard in China, and ensuring PostgreSQL’s full compliance would be beneficial for users in Chinese-speaking regions.
    
    I would like to ask:
    
    Are there any plans to upgrade PostgreSQL’s GB18030 support to the 2022 version?Would the community be open to contributions in this area?
    
    Best regards,
    
    
    
    
    JiaoShuntian
    
    HighGo Inc.
  2. Re: GB18030-2022 Support in PostgreSQL

    jiaoshuntian@highgo.com — 2025-08-04T09:27:15Z

    > I would like to ask:
    
    > 
    
    > Are there any plans to upgrade PostgreSQL’s GB18030 support to the 2022 version?Would the community be open to contributions in this area?
    
    
    
    I think we only need to update the perl script and map file to complete this task.
    
    
    JiaoShuntian
    HighGo Inc.
  3. Re: GB18030-2022 Support in PostgreSQL

    wenhui qiu <qiuwenhuifx@gmail.com> — 2025-08-04T09:34:48Z

    Hi
        😂,Not long ago, many people were rushing to remove this character set
    because of a security vulnerability. I was honestly quite shocked when I
    saw it.
    
    
    Thanks
    
    On Mon, Aug 4, 2025 at 4:08 PM JiaoShuntian <jiaoshuntian@highgo.com> wrote:
    
    > Hi hackers,
    >
    > I noticed that PostgreSQL currently supports GB18030 encoding based on the
    > older GB18030-2000 standard (as seen in commits like extend GB18030
    > conversion
    > <https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=...>).
    > However, China has since updated its mandatory character set standard
    > to GB18030-2022, which includes additional characters and stricter
    > compliance requirements.GB18030-2022 is now the official standard in China,
    > and ensuring PostgreSQL’s full compliance would be beneficial for users in
    > Chinese-speaking regions.
    >
    > I would like to ask:
    >
    > Are there any plans to upgrade PostgreSQL’s GB18030 support to the 2022
    > version?Would the community be open to contributions in this area?
    >
    > Best regards,
    >
    >
    > JiaoShuntian
    >
    > HighGo Inc.
    >
    
  4. Re: GB18030-2022 Support in PostgreSQL

    John Naylor <johncnaylorls@gmail.com> — 2025-08-04T10:35:02Z

    On Mon, Aug 4, 2025 at 3:08 PM JiaoShuntian <jiaoshuntian@highgo.com> wrote:
    > I noticed that PostgreSQL currently supports GB18030 encoding based on the older GB18030-2000 standard (as seen in commits like extend GB18030 conversion). However, China has since updated its mandatory character set standard to GB18030-2022, which includes additional characters and stricter compliance requirements.GB18030-2022 is now the official standard in China, and ensuring PostgreSQL’s full compliance would be beneficial for users in Chinese-speaking regions.
    
    This is a non-backwards-compatible change:
    
    https://www.unicode.org/L2/L2022/22274-disruptive-changes.pdf
    https://www.unicode.org/L2/L2023/23003r-gb18030-recommendations.pdf
    
    There is a risk of breaking applications, although only a few dozen
    mappings changed. If it were added as a separate encoding, users could
    opt in.
    
    --
    John Naylor
    Amazon Web Services
    
    
    
    
  5. Re: GB18030-2022 Support in PostgreSQL

    Andrew Dunstan <andrew@dunslane.net> — 2025-08-04T12:33:00Z

    On 2025-08-04 Mo 6:35 AM, John Naylor wrote:
    > On Mon, Aug 4, 2025 at 3:08 PM JiaoShuntian <jiaoshuntian@highgo.com> wrote:
    >> I noticed that PostgreSQL currently supports GB18030 encoding based on the older GB18030-2000 standard (as seen in commits like extend GB18030 conversion). However, China has since updated its mandatory character set standard to GB18030-2022, which includes additional characters and stricter compliance requirements.GB18030-2022 is now the official standard in China, and ensuring PostgreSQL’s full compliance would be beneficial for users in Chinese-speaking regions.
    > This is a non-backwards-compatible change:
    >
    > https://www.unicode.org/L2/L2022/22274-disruptive-changes.pdf
    > https://www.unicode.org/L2/L2023/23003r-gb18030-recommendations.pdf
    >
    > There is a risk of breaking applications, although only a few dozen
    > mappings changed. If it were added as a separate encoding, users could
    > opt in.
    >
    
    That makes sense ... naming the new encoding so as to avoid confusion 
    might be a challenge.
    
    
    cheers
    
    
    andrew
    
    
    --
    Andrew Dunstan
    EDB: https://www.enterprisedb.com
    
    
    
    
    
  6. Re: GB18030-2022 Support in PostgreSQL

    Tom Lane <tgl@sss.pgh.pa.us> — 2025-08-04T13:51:01Z

    Andrew Dunstan <andrew@dunslane.net> writes:
    > On 2025-08-04 Mo 6:35 AM, John Naylor wrote:
    >> There is a risk of breaking applications, although only a few dozen
    >> mappings changed. If it were added as a separate encoding, users could
    >> opt in.
    
    > That makes sense ... naming the new encoding so as to avoid confusion 
    > might be a challenge.
    
    We have precedent for that in SHIFT_JIS_2004.  Presumably if we
    make this a new encoding, it'd be GB18030_2022.
    
    However, adding a new encoding ID is not without breakage risks
    of its own, stemming from some code knowing the new ID and others
    not.  I recall that we had some actual problems of that ilk when
    we added SHIFT_JIS_2004, and some of them were pretty subtle.
    See e.g. this comment from src/bin/initdb/Makefile:
    
    # Note: it's important that we link to encnames.o from libpgcommon, not
    # from libpq, else we have risks of version skew if we run with a libpq
    # shared library from a different PG version.  Define
    # USE_PRIVATE_ENCODING_FUNCS to ensure that that happens.
    
    That was long enough ago that I have little faith either that that
    fix still does what it intended to (the code has been rejiggered
    significantly since the issue was last battle-tested), or that
    there are not similar hazards elsewhere.
    
    So on the whole I'd lean a bit towards just redefining GB18030 as
    meaning the new standard.  The fact that we don't support it as a
    server-side encoding perhaps makes that idea more tenable than it
    would be if the encoding governed the interpretation of our own
    stored data.
    
    			regards, tom lane
    
    
    
    
  7. Re: GB18030-2022 Support in PostgreSQL

    Kenneth Marshall <ktm@rice.edu> — 2025-08-04T16:55:07Z

    On Mon, Aug 04, 2025 at 04:08:24PM +0800, JiaoShuntian wrote:
    > Hi hackers,
    > 
    > I noticed that PostgreSQL currently supports GB18030 encoding based on the older GB18030-2000 standard (as seen in commits like extend GB18030 conversion). However, China has since updated its mandatory character set standard to GB18030-2022, which includes additional characters and stricter compliance requirements.GB18030-2022 is now the official standard in China, and ensuring PostgreSQL’s full compliance would be beneficial for users in Chinese-speaking regions.
    > 
    > I would like to ask:
    > 
    > Are there any plans to upgrade PostgreSQL’s GB18030 support to the 2022 version?Would the community be open to contributions in this area?
    > 
    > Best regards,
    > 
    > JiaoShuntian
    > HighGo Inc.
    
    Hi,
    
    I believe that it is in ICU already. You should be able to use that as
    your locale provider.
    
    Regards,
    Ken
    
    
    
    
  8. Re: GB18030-2022 Support in PostgreSQL

    Chao Li <li.evan.chao@gmail.com> — 2025-08-05T06:22:18Z

    
    > 2025年8月4日 21:51,Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > 
    > 
    > So on the whole I'd lean a bit towards just redefining GB18030 as
    > meaning the new standard.  The fact that we don't support it as a
    > server-side encoding perhaps makes that idea more tenable than it
    > would be if the encoding governed the interpretation of our own
    > stored data.
    > 
    > 			regards, tom lane
    > 
    
    I agree with Tom that we may just redefine GB18030 to comply with the 2022 standard.
    
    As John Naylor pointed, 2022 is not backward compatible, that is true. However, I went through all the incompatible changes, those are all characters rarely used. So I would guess most of the existing databases won’t be impacted and the rest with encoding GB18030 need to do data migration before upgrading to a PG version that switches to GB18030-2022. I think PG may delegate data migration tasks to third party PG service vendors. They may develop simple or complex migration tools to help different use cases.
    
    One use case I am thinking is that, say a database uses default encoding (UTF-8) and ICU locale provider. As ICU started to support GB180303-2022 since version 73.1. If the database worked with a pre-73.1 version of ICU, and now if ICU will be upgraded to a post-73.1 version, the database may face the same backward compatibility risk. That is because, say a gb code (0xA6D9) maps to U+E78D with GB18030 and changes to map to U+FE10 with GB18030-2022. If a char of 0xA6D9 was given to the database, it would be stored as U+E78D on disk. After upgrading ICU to post-73.1, U+E78D would no longer be considered as “0xA6D9” by ICU. So to keep the data’s original meaning, a data migration has to been done to update U+E78D to U+FE10. In this example, PG version is not changed, but the database still needs a data migration.
    
    The other reason I don’t think a new encoding GB18030_2022 is needed is that, as GB18030_2022 is a hard requirement from the government, most likely all commercial database must comply with. Thus a lot of current databases with GB18030 must be migrated to GB18030_2022. As PG doesn’t support to change a database’s encoding, if a new encoding is added, then an existing db must be migrated to a new db. If only redefine GB18030, then existing databases only need some data migrations, which should be easier.
    
    So, I think PG doesn’t need to worries about the backward compatibility problem too much, all PG needs to do is to state/emphasize clearly in the release note that a data migration might be required.  At the time when the new version is released, if some third party migration tools are known working fine, the release note may recommend the tools.
    
    Regards,
    
    Chao Li (Evan)
    ------------------------------
    HighGo Infra. Software Inc.
    https://www.highgo.com/
  9. Re: GB18030-2022 Support in PostgreSQL

    John Naylor <johncnaylorls@gmail.com> — 2025-08-05T10:25:27Z

    On Tue, Aug 5, 2025 at 1:22 PM Chao Li <li.evan.chao@gmail.com> wrote:
    >
    > 2025年8月4日 21:51,Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >
    > So on the whole I'd lean a bit towards just redefining GB18030 as
    > meaning the new standard.  The fact that we don't support it as a
    > server-side encoding perhaps makes that idea more tenable than it
    > would be if the encoding governed the interpretation of our own
    > stored data.
    
    > I agree with Tom that we may just redefine GB18030 to comply with the 2022 standard.
    >
    > As John Naylor pointed, 2022 is not backward compatible, that is true. However, I went through all the incompatible changes, those are all characters rarely used.
    
    If that's the case than redefining is probably okay.
    
    > One use case I am thinking is that, say a database uses default encoding (UTF-8) and ICU locale provider. As ICU started to support GB180303-2022 since version 73.1.
    
    ICU locales can only be used with sever-side encodings.
    
    > At the time when the new version is released, if some third party migration tools are known working fine, the release note may recommend the tools.
    
    I highly doubt such a large hammer will be necessary. Whatever advice
    we give for discovery and conversion of affected text is our
    responsibility and can be in the form of example queries.
    
    --
    John Naylor
    Amazon Web Services
    
    
    
    
  10. Re: GB18030-2022 Support in PostgreSQL

    Peter Eisentraut <peter@eisentraut.org> — 2025-08-06T10:29:15Z

    On 05.08.25 08:22, Chao Li wrote:
    > I agree with Tom that we may just redefine GB18030 to comply with the 
    > 2022 standard.
    > 
    > As John Naylor pointed, 2022 is not backward compatible, that is true. 
    > However, I went through all the incompatible changes, those are all 
    > characters rarely used. So I would guess most of the existing databases 
    > won’t be impacted and the rest with encoding GB18030 need to do data 
    > migration before upgrading to a PG version that switches to 
    > GB18030-2022. I think PG may delegate data migration tasks to third 
    > party PG service vendors. They may develop simple or complex migration 
    > tools to help different use cases.
    
    Note that you can also create custom conversions using CREATE 
    CONVERSION, so that would be something for those who would need the old 
    behavior.
    
    
    
    
    
  11. Re: GB18030-2022 Support in PostgreSQL

    Chao Li <li.evan.chao@gmail.com> — 2025-08-07T08:14:44Z

    I did more researches about the changes in 2022 over 2000, here is a
    summary:
    
    * 66 new characters have been added in 2022. All these are 4 bytes
    characters. As the map files store only 2 bytes GB code mappings, 4 bytes
    GB code mapping are calculated, thus these chars can be properly
    encoded/decoded without this patch, I tested that.
    * 9 characters are no longer required by 2022, but application may decide
    to retain them or not. As the ucm file (
    https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/gb18030-2022.ucm)
    retains them, we also retain them.
    * Unicode mappings for 18 characters have changed. Only these changes will
    cause backward compatibility issues. However, half of them are rarely
    used punctuation
    marks and rests are glyphs that I cannot recognize as a native Chinese
    speaker. So these changes should not significantly impact most
    existing databases.
    
    I added a test case with a mapping changed char, and the test passes:
    
    % make check
    ...
    # All 229 tests passed.
    
    For more details on the standard change, see
    https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132
    
    I am attaching the patch file.
    
    Chao Li (Evan)
    ---------------------
    Highgo Software Co., Ltd.
    https://www.highgo.com/
    
    
    John Naylor <johncnaylorls@gmail.com> 于2025年8月5日周二 18:25写道:
    
    > On Tue, Aug 5, 2025 at 1:22 PM Chao Li <li.evan.chao@gmail.com> wrote:
    > >
    > > 2025年8月4日 21:51,Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > >
    > > So on the whole I'd lean a bit towards just redefining GB18030 as
    > > meaning the new standard.  The fact that we don't support it as a
    > > server-side encoding perhaps makes that idea more tenable than it
    > > would be if the encoding governed the interpretation of our own
    > > stored data.
    >
    > > I agree with Tom that we may just redefine GB18030 to comply with the
    > 2022 standard.
    > >
    > > As John Naylor pointed, 2022 is not backward compatible, that is true.
    > However, I went through all the incompatible changes, those are all
    > characters rarely used.
    >
    > If that's the case than redefining is probably okay.
    >
    > > One use case I am thinking is that, say a database uses default encoding
    > (UTF-8) and ICU locale provider. As ICU started to support GB180303-2022
    > since version 73.1.
    >
    > ICU locales can only be used with sever-side encodings.
    >
    > > At the time when the new version is released, if some third party
    > migration tools are known working fine, the release note may recommend the
    > tools.
    >
    > I highly doubt such a large hammer will be necessary. Whatever advice
    > we give for discovery and conversion of affected text is our
    > responsibility and can be in the form of example queries.
    >
    > --
    > John Naylor
    > Amazon Web Services
    >
    
  12. Re: GB18030-2022 Support in PostgreSQL

    Chao Li <li.evan.chao@gmail.com> — 2025-08-11T02:01:08Z

    I have created a patch https://commitfest.postgresql.org/patch/5954/. 
    CommitFests requested a rebase, so I rebased the code and created the v2 
    patch.
    
    BTW, I have tested all 66 new characters, 9 not-required characters and 
    18 changed characters in a way as:
    
    evantest=# SELECT encode(convert_from(decode('82359632', 'hex'), 
    'GB18030')::bytea, 'hex');
      encode
    --------
      e9bfab
    (1 row)
    
    All encoded correctly.
    
    Chao Li (Evan)
    
    ---------------------
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    On 2025/8/7 16:14, Chao Li wrote:
    > I did more researches about the changes in 2022 over 2000, here is a 
    > summary:
    >
    > * 66 new characters have been added in 2022. All these are 4 bytes 
    > characters. As the map files store only 2 bytes GB code mappings, 4 
    > bytes GB code mapping are calculated, thus these chars can be properly 
    > encoded/decoded without this patch, I tested that.
    > * 9 characters are no longer required by 2022, but application may 
    > decide to retain them or not. As the ucm file 
    > (https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/gb18030-2022.ucm) 
    > retains them, we also retain them.
    > * Unicode mappings for 18 characters have changed. Only these changes 
    > will cause backward compatibility issues. However, half of them are 
    > rarely used punctuation marks and rests are glyphs that I cannot 
    > recognize as a native Chinese speaker. So these changes should not 
    > significantly impact most existing databases.
    >
    > I added a test case with a mapping changed char, and the test passes:
    >
    > % make check
    > ...
    > # All 229 tests passed.
    >
    > For more details on the standard change, see 
    > https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132
    >
    > I am attaching the patch file.
    >
    > Chao Li (Evan)
    > ---------------------
    > Highgo Software Co., Ltd.
    > https://www.highgo.com/
    >
    >
  13. Re: GB18030-2022 Support in PostgreSQL

    John Naylor <johncnaylorls@gmail.com> — 2025-08-11T05:50:48Z

    On Mon, Aug 11, 2025 at 9:01 AM Chao Li <li.evan.chao@gmail.com> wrote:
    >
    > I have created a patch https://commitfest.postgresql.org/patch/5954/. CommitFests requested a rebase, so I rebased the code and created the v2 patch.
    >
    > BTW, I have tested all 66 new characters, 9 not-required characters and 18 changed characters in a way as:
    
    "9 characters are no longer required by the new standard, but are
    retained in this patch for compatibility"
    
    How is that done?
    
    > I added a test case with a mapping changed char, and the test passes:
    >
    > % make check
    > ...
    > # All 229 tests passed.
    >
    > For more details on the standard change, see https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132
    >
    > I am attaching the patch file.
    
    Going from the old .xml file to the .ucm file makes it difficult to
    see the relevant changes. Also, there are nearly 1000 non-user-visible
    changes like this in the output file that are not explained:
    
    -  /*** Three byte table, leaf: efa8xx - offset 0x07aba ***/
    +  /*** Three byte table, leaf: efa8xx - offset 0x07b3a ***/
    
    The 2000 version is available in the .ucm format, so maybe converting
    to that first would be a good preparatory patch:
    
    https://github.com/unicode-org/icu-data/blob/main/charset/data/ucm/gb-18030-2000.ucm
    
    Looking at the history, it looks like that file has seen small
    revisions, so it may take some research to get the exact equivalent to
    the XML file we use. That will also tell us if anything will change
    for us besides the actual 2022 revision.
    
    -- 
    John Naylor
    Amazon Web Services
    
    
    
    
  14. Re: GB18030-2022 Support in PostgreSQL

    Chao Li <li.evan.chao@gmail.com> — 2025-08-11T08:22:09Z

    Hi John,
    
    Thanks for your review.
    
    Yes, I did a diff between 2000.ucm and 2022.ucm when I worked on the patch. The diff between 2000.ucm and 2022.ucm are quite small:
    
    ```diff - omit the comment part
    > <U20AC> \x80 |3
    > <U3000> \xA3\xA0 |3
    > <UE5E5> \xA3\xA0 |4
    >
    28067a28099,28114
    > <U9FB4> \xFE\x59 |0
    > <U9FB4> \x82\x35\x90\x37 |3
    > <U9FB5> \xFE\x61 |0
    > <U9FB5> \x82\x35\x90\x38 |3
    > <U9FB6> \xFE\x66 |0
    > <U9FB6> \x82\x35\x90\x39 |3
    > <U9FB7> \xFE\x67 |0
    > <U9FB7> \x82\x35\x91\x30 |3
    > <U9FB8> \xFE\x6D |0
    > <U9FB8> \x82\x35\x91\x31 |3
    > <U9FB9> \xFE\x7E |0
    > <U9FB9> \x82\x35\x91\x32 |3
    > <U9FBA> \xFE\x90 |0
    > <U9FBA> \x82\x35\x91\x33 |3
    > <U9FBB> \xFE\xA0 |0
    > <U9FBB> \x82\x35\x91\x34 |3
    29577c29624
    < <UE5E5> \xA3\xA0 |0
    ---
    > # <UE5E5> \xA3\xA0 |0
    30001,30010c30048,30057
    < <UE78D> \xA6\xD9 |0
    < <UE78E> \xA6\xDA |0
    < <UE78F> \xA6\xDB |0
    < <UE790> \xA6\xDC |0
    < <UE791> \xA6\xDD |0
    < <UE792> \xA6\xDE |0
    < <UE793> \xA6\xDF |0
    < <UE794> \xA6\xEC |0
    < <UE795> \xA6\xED |0
    < <UE796> \xA6\xF3 |0
    ---
    > <UE78D> \xA6\xD9 |1
    > <UE78E> \xA6\xDA |1
    > <UE78F> \xA6\xDB |1
    > <UE790> \xA6\xDC |1
    > <UE791> \xA6\xDD |1
    > <UE792> \xA6\xDE |1
    > <UE793> \xA6\xDF |1
    > <UE794> \xA6\xEC |1
    > <UE795> \xA6\xED |1
    > <UE796> \xA6\xF3 |1
    30146c30193
    < <UE81E> \xFE\x59 |0
    ---
    > <UE81E> \xFE\x59 |1
    30154c30201
    < <UE826> \xFE\x61 |0
    ---
    > <UE826> \xFE\x61 |1
    30159,30160c30206,30207
    < <UE82B> \xFE\x66 |0
    < <UE82C> \xFE\x67 |0
    ---
    > <UE82B> \xFE\x66 |1
    > <UE82C> \xFE\x67 |1
    30166c30213
    < <UE832> \xFE\x6D |0
    ---
    > <UE832> \xFE\x6D |1
    30183c30230
    < <UE843> \xFE\x7E |0
    ---
    > <UE843> \xFE\x7E |1
    30200c30247
    < <UE854> \xFE\x90 |0
    ---
    > <UE854> \xFE\x90 |1
    30216c30263
    < <UE864> \xFE\xA0 |0
    ---
    > <UE864> \xFE\xA0 |1
    30470a30518,30537
    > <UFE10> \xA6\xD9 |0
    > <UFE10> \x84\x31\x82\x36 |3
    > <UFE11> \xA6\xDB |0
    > <UFE11> \x84\x31\x82\x37 |3
    > <UFE12> \xA6\xDA |0
    > <UFE12> \x84\x31\x82\x38 |3
    > <UFE13> \xA6\xDC |0
    > <UFE13> \x84\x31\x82\x39 |3
    > <UFE14> \xA6\xDD |0
    > <UFE14> \x84\x31\x83\x30 |3
    > <UFE15> \xA6\xDE |0
    > <UFE15> \x84\x31\x83\x31 |3
    > <UFE16> \xA6\xDF |0
    > <UFE16> \x84\x31\x83\x32 |3
    > <UFE17> \xA6\xEC |0
    > <UFE17> \x84\x31\x83\x33 |3
    > <UFE18> \xA6\xED |0
    > <UFE18> \x84\x31\x83\x34 |3
    > <UFE19> \xA6\xF3 |0
    > <UFE19> \x84\x31\x83\x35 |3
    ```
    
    As you can see, the changes only reflect to the changed 18 characters plus other 3 unicode points (U20AC, U3000, UE5E5). My code comment in UCS_to_GB18030.pl has explained these changes:
    
    ```code comment from UCS_to_GB18030.pl
    # The |n is a flag, where n has values of 0, 1, 3, 4.
    # With a refeence to https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132,
    # the flag should mean the following:
    #   0 - round-trip mapping
    #   1 - there are 18 mappings with flag 1, those are mapping changes
    #       from GB180303-2000 to GB18030-2022. Old mappings are marked
    #       with flag 1, new mappings with flag 0. So we can ignore all
    #       mappings with flag 0.
    #   3 - there are 20 mappings with flag 3:
    #         18 of them reflect to the 18 mappings with flag 1, but means
    #       the old mapping's unicode's new mapping with GB18030-2022.
    #       These 18 new mappings have no actual glyphs in GB18030-2022.
    #       So we can ignore these 18 mappings with flag 3.
    #         The other 2 are: "<U20AC> \x80 |3" and "<U3000> \xA3\xA0 |3".
    #       They are two reserved fallbacks for compatibility with GBK and
    #       other web data as in WHATWG. Both U20AC and U3000 have round-
    #       trip mappings in GB18030-2022, so we can ignore these two
    #       mappings with flag 3.
    #         So, we can ignore all mappings with flag 3.
    #   4 - there is only one mapping with flag 4: <UE5E5> \xA3\xA0 |4.
    #       This is a "good one-way" mapping from U+E5E5 to \xA3\xA0
    #       for maximum compatibility with previous behavior. So we can
    #       ignore this mapping as well.
    ```
    
    For your question:
    
    > "9 characters are no longer required by the new standard, but are
    > retained in this patch for compatibility"
    > 
    > How is that done?
    
    
    The 9 mappings are not changed between 2000.ucm and 2022.ucm. For example, GB18030 code 0xFD9C is one of the 9 not-required code, but the mapping:
    
    <UF92C> \xFD\x9C |0
    
    Still appears in 2022.ucm, so that this character is retained.
    
    
    Chao Li (Evan)
    --------------------
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    > On Aug 11, 2025, at 13:50, John Naylor <johncnaylorls@gmail.com> wrote:
    > 
    > On Mon, Aug 11, 2025 at 9:01 AM Chao Li <li.evan.chao@gmail.com> wrote:
    >> 
    >> I have created a patch https://commitfest.postgresql.org/patch/5954/. CommitFests requested a rebase, so I rebased the code and created the v2 patch.
    >> 
    >> BTW, I have tested all 66 new characters, 9 not-required characters and 18 changed characters in a way as:
    > 
    > "9 characters are no longer required by the new standard, but are
    > retained in this patch for compatibility"
    > 
    > How is that done?
    > 
    >> I added a test case with a mapping changed char, and the test passes:
    >> 
    >> % make check
    >> ...
    >> # All 229 tests passed.
    >> 
    >> For more details on the standard change, see https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132
    >> 
    >> I am attaching the patch file.
    > 
    > Going from the old .xml file to the .ucm file makes it difficult to
    > see the relevant changes. Also, there are nearly 1000 non-user-visible
    > changes like this in the output file that are not explained:
    > 
    > -  /*** Three byte table, leaf: efa8xx - offset 0x07aba ***/
    > +  /*** Three byte table, leaf: efa8xx - offset 0x07b3a ***/
    > 
    > The 2000 version is available in the .ucm format, so maybe converting
    > to that first would be a good preparatory patch:
    > 
    > https://github.com/unicode-org/icu-data/blob/main/charset/data/ucm/gb-18030-2000.ucm
    > 
    > Looking at the history, it looks like that file has seen small
    > revisions, so it may take some research to get the exact equivalent to
    > the XML file we use. That will also tell us if anything will change
    > for us besides the actual 2022 revision.
    > 
    > -- 
    > John Naylor
    > Amazon Web Services
    
    
  15. Re: GB18030-2022 Support in PostgreSQL

    John Naylor <johncnaylorls@gmail.com> — 2025-08-11T09:15:00Z

    On Mon, Aug 11, 2025 at 3:22 PM Chao Li <li.evan.chao@gmail.com> wrote:
    
    Hi,
    
    For future reference, please don't quote my entire message below yours
    -- it clutters the archives and also removes context.
    
    > Yes, I did a diff between 2000.ucm and 2022.ucm when I worked on the patch. The diff between 2000.ucm and 2022.ucm are quite small:
    
    That would match my expectation. In case it wasn't clear before, my
    preference is to split this patch into two patches: First convert to
    .ucm, then update to 2022 revision. Then the small diff will be
    obvious to everyone who looks at the second commit.
    
    > For your question:
    >
    > "9 characters are no longer required by the new standard, but are
    > retained in this patch for compatibility"
    >
    > How is that done?
    >
    >
    > The 9 mappings are not changed between 2000.ucm and 2022.ucm. For example, GB18030 code 0xFD9C is one of the 9 not-required code, but the mapping:
    >
    > <UF92C> \xFD\x9C |0
    >
    > Still appears in 2022.ucm, so that this character is retained.
    
    Thanks for clarifying -- by saying "retained in the patch", the commit
    message implied to me that the patch added something not in the
    upstream file.
    
    --
    John Naylor
    Amazon Web Services
    
    
    
    
  16. Re: GB18030-2022 Support in PostgreSQL

    Chao Li <li.evan.chao@gmail.com> — 2025-08-11T09:25:07Z

    > That would match my expectation. In case it wasn't clear before, my
    > preference is to split this patch into two patches: First convert to
    > .ucm, then update to 2022 revision. Then the small diff will be
    > obvious to everyone who looks at the second commit.
    
    Sure I can split the patch into two. The patch only changes the .xml file to .ucm file and updating the perl script. As a result, map files should not be changed.
    
    Then the second patch will update the ucm file, so that the second patch should be small, contains only ucm changes and map file changes.
    
    One thing to confirm with you. To archive that, we will rename the ucm file as gb18030.ucm without suffix of “-2000” and “-2022”, otherwise git won’t be able to show the diff. Is that what you meant?
    
    
    > 
    > Thanks for clarifying -- by saying "retained in the patch", the commit
    > message implied to me that the patch added something not in the
    > upstream file.
    > 
    I will update the commit message in the new patch.
    
    
    Chao Li (Evan)
    --------------------
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
  17. Re: GB18030-2022 Support in PostgreSQL

    John Naylor <johncnaylorls@gmail.com> — 2025-08-11T09:29:04Z

    On Mon, Aug 11, 2025 at 4:25 PM Chao Li <li.evan.chao@gmail.com> wrote:
    >
    > Sure I can split the patch into two. The patch only changes the .xml file to .ucm file and updating the perl script. As a result, map files should not be changed.
    >
    > Then the second patch will update the ucm file, so that the second patch should be small, contains only ucm changes and map file changes.
    >
    > One thing to confirm with you. To archive that, we will rename the ucm file as gb18030.ucm without suffix of “-2000” and “-2022”, otherwise git won’t be able to show the diff. Is that what you meant?
    
    Usually git is pretty smart about renames combined with small changes,
    so I would try keeping the original names and see what it does.
    
    -- 
    John Naylor
    Amazon Web Services
    
    
    
    
  18. Re: GB18030-2022 Support in PostgreSQL

    John Naylor <johncnaylorls@gmail.com> — 2025-08-12T04:57:45Z

    On Tue, Aug 12, 2025 at 9:09 AM Chao Li <li.evan.chao@gmail.com> wrote:
    
    [bringing this back to the original thread]
    
    > So, I compared 2000 ucm with 2005 ucm also compared 2005 ucm with 2022 ucm. Then I found that some changed in 2005 is reverted in 2022, that why diff between 2000 and 2022 is small. For example, the following mappings
    
    Yes, this was mentioned in the "disruptive changes" document linked in
    my first email in this thread:
    
    "The 2005 edition included 6 characters with double mappings. The 2022
    edition removes the
    double mappings.
    The 2005 edition included 9 characters from the CJK Compatibility
    Ideographs block. In
    Unicode/10646, these all have canonical decomposition mappings to
    characters in the URO. In
    the 2022 edition, these nine compatibility characters are removed."
    
    > So, for how to create patch 2, I think we have 3 options:
    >
    > 1. As planned, update to the latest version of 2000 ucm, then skip 2005 and directly upgrade to 2022 in patch 3. This way, we just honor 2000 ucm regardless that the change is actually introduced by 2005.
    >
    > 2. Skip the latest version of 2000 ucm and upgrade to 2005 ucm. This way will clearly show the upgrade path 2000->2005->2022. Downside is that 2005 introduced some changes that are reverted in 2022, which will cause some unnecessary changes in map files.
    >
    > 3. Skip patch 2, directly go to patch 3. So that, patch 3 will include changes introduced by both 2005 and 2022. This way makes minimum changes to map files.
    
    #3 is what I had in mind to begin with unless we found some reason not
    to. Minimizing churn is a lucky side effect that reinforces that
    choice.
    
    Before getting to that, I thought I'd bring this up to the community:
    
    +# Copyright (C) 2000-2009, International Business Machines
    Corporation and others.
    +# All Rights Reserved.
    
    The previous XML file didn't contain a copyright notice -- does anyone
    want to make a case for not checking unicode-org's source file into
    our tree because of this? The 2022 update changes it to
    
    # Copyright (C) 2016 and later: Unicode, Inc. and others.
    # License & terms of use: http://www.unicode.org/copyright.html
    # Copyright (C) 2000-2012, International Business Machines Corporation
    and others.
    # All Rights Reserved.
    
    ...and the above links to https://www.unicode.org/license.txt
    
    --
    John Naylor
    Amazon Web Services
    
    
    
    
  19. Re: GB18030-2022 Support in PostgreSQL

    Chao Li <li.evan.chao@gmail.com> — 2025-08-12T06:05:39Z

    >> 
    >> 3. Skip patch 2, directly go to patch 3. So that, patch 3 will include changes introduced by both 2005 and 2022. This way makes minimum changes to map files.
    > 
    > #3 is what I had in mind to begin with unless we found some reason not
    > to. Minimizing churn is a lucky side effect that reinforces that
    > choice.
    > 
    
    Cool, then I will take option 3.
    
    > Before getting to that, I thought I'd bring this up to the community:
    > 
    > 
    > The previous XML file didn't contain a copyright notice -- does anyone
    > want to make a case for not checking unicode-org's source file into
    > our tree because of this? The 2022 update changes it to
    > 
    > 
    
    Thanks for pointing out the unicode license issue, I really didn’t notice about that.
    
    I did some quick research. As we generate mapping files from the ucm files, and the map files are built into the final executable binaries, we are redistributing Unicode-derived data, so we should still include the Unicode license. Thus, not checking in the ucm won’t waive the license problem.
    
    We can just added a license file, say named unicode_license.txt with proper content under the same folder of the ucm file. I guess that would address the license problem.
    
    This following the ChatGTP generated content of the license file:
    
    ```
    Portions of this product include data from the Unicode Character Database
    and other Unicode® data files.
    
    Copyright © 1991–2025 Unicode, Inc.
    All rights reserved.
    
    Permission is hereby granted, free of charge, to any person obtaining a copy
    of the Unicode data files and any associated documentation (the "Data Files")
    or Unicode software and any associated documentation (the "Software") to deal
    in the Data Files or Software without restriction, including without limitation
    the rights to use, copy, modify, merge, publish, distribute, and/or sell copies
    of the Data Files or Software, and to permit persons to whom the Data Files or
    Software are furnished to do so, provided that either:
    
      (a) this copyright and permission notice appear with all copies of the Data
          Files or Software, or
    
      (b) this copyright and permission notice appear in associated documentation.
    
    THE DATA FILES AND SOFTWARE ARE PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
    EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
    MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD
    PARTY RIGHTS. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN
    THIS NOTICE BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL
    DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
    WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING
    OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THE DATA FILES OR
    SOFTWARE.
    
    Unicode and the Unicode logo are trademarks of Unicode, Inc. in the United
    States and other countries. All third party trademarks referenced herein are
    the property of their respective owners.
    ```
    
    Regards,
    
    Chao Li (Evan)
    --------------------
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
  20. Re: GB18030-2022 Support in PostgreSQL

    Peter Eisentraut <peter@eisentraut.org> — 2025-08-12T19:41:47Z

    On 12.08.25 06:57, John Naylor wrote:
    > Before getting to that, I thought I'd bring this up to the community:
    > 
    > +# Copyright (C) 2000-2009, International Business Machines
    > Corporation and others.
    > +# All Rights Reserved.
    > 
    > The previous XML file didn't contain a copyright notice -- does anyone
    > want to make a case for not checking unicode-org's source file into
    > our tree because of this? The 2022 update changes it to
    > 
    > # Copyright (C) 2016 and later: Unicode, Inc. and others.
    > # License & terms of use:http://www.unicode.org/copyright.html
    > # Copyright (C) 2000-2012, International Business Machines Corporation
    > and others.
    > # All Rights Reserved.
    > 
    > ...and the above links tohttps://www.unicode.org/license.txt
    
    Could we download this file on demand, like we do for the other input 
    files for the conversion mappings?
    
    
    
    
    
  21. Re: GB18030-2022 Support in PostgreSQL

    John Naylor <johncnaylorls@gmail.com> — 2025-08-13T07:17:03Z

    On Wed, Aug 13, 2025 at 2:41 AM Peter Eisentraut <peter@eisentraut.org> wrote:
    > Could we download this file on demand, like we do for the other input
    > files for the conversion mappings?
    
    That sounds like the way to go.
    
    While poking around, I found that UCS_to_EUC_CN.pl also uses
    gb-18030-2000.xml for its input, so now it seems wrong to delete the
    XML file as a side effect of changing the source for GB18030. Maybe
    EUC_CN could use a downloaded-on-demand .ucm source as well (whether
    2000 or 2022) but we can consider that later. For now let's leave the
    XML file alone.
    
    --
    John Naylor
    Amazon Web Services
    
    
    
    
  22. Re: GB18030-2022 Support in PostgreSQL

    Chao Li <li.evan.chao@gmail.com> — 2025-08-13T07:20:27Z

    
    > On Aug 13, 2025, at 15:17, John Naylor <johncnaylorls@gmail.com> wrote:
    > 
    > On Wed, Aug 13, 2025 at 2:41 AM Peter Eisentraut <peter@eisentraut.org> wrote:
    >> Could we download this file on demand, like we do for the other input
    >> files for the conversion mappings?
    > 
    > That sounds like the way to go.
    > 
    > While poking around, I found that UCS_to_EUC_CN.pl also uses
    > gb-18030-2000.xml for its input, so now it seems wrong to delete the
    > XML file as a side effect of changing the source for GB18030. Maybe
    > EUC_CN could use a downloaded-on-demand .ucm source as well (whether
    > 2000 or 2022) but we can consider that later. For now let's leave the
    > XML file alone.
    > 
    
    Sounds good. Let me recreate the patch.
    
    Chao Li (Evan)
    --------------------
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
  23. Re: GB18030-2022 Support in PostgreSQL

    Chao Li <li.evan.chao@gmail.com> — 2025-08-13T08:08:45Z

    On 2025/8/13 15:20, Chao Li wrote:
    >
    >
    > Sounds good. Let me recreate the patch.
    >
    >
    Attached is the new patch. It downloads the UCM file in make:
    
    
    ```
    Unicode % make gb18030_to_utf8.map
    wget -O gb-18030-2000.ucm --no-use-server-timestamps 
    https://raw.githubusercontent.com/unicode-org/icu-data/d9d3a6ed27bb98a7106763e940258f0be8cd995b/charset/data/ucm/gb-18030-2000.ucm
    --2025-08-13 15:54:53-- 
    https://raw.githubusercontent.com/unicode-org/icu-data/d9d3a6ed27bb98a7106763e940258f0be8cd995b/charset/data/ucm/gb-18030-2000.ucm
    HTTP request sent, awaiting response... 200 OK
    Length: 672885 (657K) [text/plain]
    Saving to: ‘gb-18030-2000.ucm’
    
    gb-18030-2000.ucm  100%[=====================================>] 657.11K 
      2.78MB/s    in 0.2s
    
    2025-08-13 15:54:54 (2.78 MB/s) - ‘gb-18030-2000.ucm’ saved [672885/672885]
    
    '/usr/bin/perl' -I . UCS_to_GB18030.pl
    - Writing UTF8=>GB18030 conversion table: utf8_to_gb18030.map
    - Writing GB18030=>UTF8 conversion table: gb18030_to_utf8.map
    Unicode % git diff
    Unicode %
    ```
    
    After regenerating the map files, there is no change found in the map files.
    
    
    Best regards,
    
    Chao Li (Evan)
    --------------------
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
  24. Re: GB18030-2022 Support in PostgreSQL

    John Naylor <johncnaylorls@gmail.com> — 2025-08-18T05:18:25Z

    On Wed, Aug 13, 2025 at 3:08 PM Chao Li <li.evan.chao@gmail.com> wrote:
    > Attached is the new patch. It downloads the UCM file in make:
    
    > After regenerating the map files, there is no change found in the map files.
    
    I can confirm, thanks.
    
    We split a patch into multiple patches, it's customary include all of
    them, since that process may result in unwelcome artifacts to sort
    out. (When the first step has architectural questions or change in
    behavior, we may treat it as independent, possibly with a separate
    thread, but that's not the case here.) I do have some comments
    already, though:
    
    -my $in_file = "gb-18030-2000.xml";
    -
    +my $in_file = "gb-18030-2000.ucm";
    
    -while (<$in>)
    -{
    +while (<$in>) {
    
    -# The lines we care about in the source file look like
    +# The lines we care about in the source file look like:
    
    These are spurious changes, which we try to avoid.
    
    - next if (!m/<a u="([0-9A-F]+)" b="([0-9A-F ]+)"/);
    
    + if (/^<U([0-9A-Fa-f]+)>\s+((?:\\x[0-9A-Fa-f]{2})+)\s*\|(\d+)/) {
    
    This change in style caused extra whitespace-only churn. That obscures
    what the actual changes are.
    
    + # Match lines like: <UXXXX> \xYY[\xYY...] |n, and use only (|0) mappings
    
    This is missing an explanation of why we skip non-zero mappings.
    Code-wise, this only matters for the output in the follow-on patch for
    2022, but one of these patches needs to include a brief explanation. I
    did not like the detailed description that was present in one of the
    earlier 2022 patches that told how many characters were flagged a
    certain way -- that's irrelevant detail and will likely get out of
    date in some future version anyway.
    
    +# and n is a flag indicating the type of mapping having
    +# a single value of 0.
    
    This seems weird when combined with the logic to filter out non-zero
    mappings. We need to think about when and where to show relevant
    information.
    
    + next if ($flag ne '0'); # non-0 flags
    
    This comment is just repeating what the code is doing, and it's very
    obvious what it's doing.
    
    BTW, it sounds like your proposed Makefile changes are needed for the
    follow-on patch with .map changes to work at all, is that right?
    
    https://www.postgresql.org/message-id/1CA8625F-AA41-4ED2-B60F-E28AC71F37DC@highgo.com
    
    --
    John Naylor
    Amazon Web Services
    
    
    
    
  25. Re: GB18030-2022 Support in PostgreSQL

    Chao Li <li.evan.chao@gmail.com> — 2025-08-18T06:35:42Z

    On 2025/8/18 13:18, John Naylor wrote:
    > We split a patch into multiple patches, it's customary include all of
    > them, since that process may result in unwelcome artifacts to sort
    > out. (When the first step has architectural questions or change in
    > behavior, we may treat it as independent, possibly with a separate
    > thread, but that's not the case here.)
    
    Thanks for the explanation. I thought to make the second patch only 
    after the first patch is pushed. I am new to PostgreSQL contribution, 
    your guidance is very helpful for my future work.
    
    Now I attach the both patch files.
    
    For the second patch, I have tested it manually again. And "make check" 
    test passed.
    
    > -# The lines we care about in the source file look like
    > +# The lines we care about in the source file look like:
    >
    > These are spurious changes, which we try to avoid.
    
    Updated.
    
    > - next if (!m/<a u="([0-9A-F]+)" b="([0-9A-F ]+)"/);
    >
    > + if (/^<U([0-9A-Fa-f]+)>\s+((?:\\x[0-9A-Fa-f]{2})+)\s*\|(\d+)/) {
    >
    > This change in style caused extra whitespace-only churn. That obscures
    > what the actual changes are.
    
    Updated.
    
    > + # Match lines like: <UXXXX> \xYY[\xYY...] |n, and use only (|0) mappings
    >
    > This is missing an explanation of why we skip non-zero mappings.
    > Code-wise, this only matters for the output in the follow-on patch for
    > 2022, but one of these patches needs to include a brief explanation. I
    > did not like the detailed description that was present in one of the
    > earlier 2022 patches that told how many characters were flagged a
    > certain way -- that's irrelevant detail and will likely get out of
    > date in some future version anyway.
    
    Okay, I kept a neat version of comment now.
    
    > +# and n is a flag indicating the type of mapping having
    > +# a single value of 0.
    >
    > This seems weird when combined with the logic to filter out non-zero
    > mappings. We need to think about when and where to show relevant
    > information.
    
    Updated the comment.
    
    > + next if ($flag ne '0'); # non-0 flags
    >
    > This comment is just repeating what the code is doing, and it's very
    > obvious what it's doing.
    
    Removed the useless comment.
    
    
    >
    > BTW, it sounds like your proposed Makefile changes are needed for the
    > follow-on patch with .map changes to work at all, is that right?
    >
    > https://www.postgresql.org/message-id/1CA8625F-AA41-4ED2-B60F-E28AC71F37DC@highgo.com
    >
    I think that patch could be separate, because the makefile changes are 
    generic to all map files. The current GB18030 patch doesn't depend on 
    that makefile patch at all. The makefile patch just makes build a little 
    bit easier upon map file changes.
    
    
    Best regards,
    
    --
    
    Chao Li (Evan)
    --------------------
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
  26. Re: GB18030-2022 Support in PostgreSQL

    John Naylor <johncnaylorls@gmail.com> — 2025-08-18T08:34:29Z

    On Mon, Aug 18, 2025 at 1:36 PM Chao Li <li.evan.chao@gmail.com> wrote:
    > I think that patch could be separate, because the makefile changes are generic to all map files. The current GB18030 patch doesn't depend on that makefile patch at all. The makefile patch just makes build a little bit easier upon map file changes.
    
    I verified that both autoconf and meson builds pick up the change with
    these two patches, and the new test passes. I'm still not sure what
    circumstances you found where a change doesn't get picked up, but we
    can come back to that later if need be.
    
    BTW, the Commitfest shows these patches as "needs rebase". The reason
    for that is the naming. Commands like `git am` apply a series in
    order, and expects to find something like
    v3-0001-*
    v3-0002-*
    
    Your last attachment was
    v1-0001-*
    v2-0001-*
    
    ...and confusingly v2 needed to be applied first. To create a series
    from a branch, use `git format-patch master -v <version number>` and
    it will output an ordered series with one patch per commit.
    
    --
    John Naylor
    Amazon Web Services
    
    
    
    
  27. Re: GB18030-2022 Support in PostgreSQL

    Chao Li <li.evan.chao@gmail.com> — 2025-08-18T08:50:34Z

    On 2025/8/18 16:34, John Naylor wrote:
    > I verified that both autoconf and meson builds pick up the change with
    > these two patches, and the new test passes. I'm still not sure what
    > circumstances you found where a change doesn't get picked up, but we
    > can come back to that later if need be.
    
    Let's talk about the makefile change separately.
    
    > ...and confusingly v2 needed to be applied first. To create a series
    > from a branch, use `git format-patch master -v <version number>` and
    > it will output an ordered series with one patch per commit.
    
    
    This is my first spitted patch. I was confused about the "0001" part in 
    patch file names. Now I understood. I just recreated the both patch 
    files as v3:
    
    chaol@ChaodeMacBook-Air postgresql % git format-patch -v3 master 
    v3-0001-GB18030-Switch-to-using-gb-18030-2000.ucm.patch 
    v3-0002-Upgrade-GB18030-encoding-support-from-2000-to-202.patch
    
    Regard regards,
    
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
  28. Re: GB18030-2022 Support in PostgreSQL

    Chao Li <li.evan.chao@gmail.com> — 2025-09-01T01:32:42Z

    
    > On Aug 18, 2025, at 16:50, Chao Li <li.evan.chao@gmail.com> wrote:
    > 
    > 
    > <v3-0001-GB18030-Switch-to-using-gb-18030-2000.ucm.patch><v3-0002-Upgrade-GB18030-encoding-support-from-2000-to-202.patch>
    
    
    Hi John,
    
    Any follow up on this patch?
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
  29. Re: GB18030-2022 Support in PostgreSQL

    John Naylor <johncnaylorls@gmail.com> — 2025-09-10T06:38:40Z

    On Mon, Aug 18, 2025 at 3:50 PM Chao Li <li.evan.chao@gmail.com> wrote:
    > This is my first spitted patch. I was confused about the "0001" part in patch file names. Now I understood. I just recreated the both patch files as v3:
    
    I've attached v4, in which I made some cosmetic changes to the perl
    script, mostly to make it resemble master more closely. These changes
    are separated out into a separate patch for visibility, but will be
    squashed in the final commit. Two things are worth calling out:
    
    - The URL at the top currently points to a directory in Github, but v3
    changed it to point to the actual file. A directory can be navigated
    for inspection, so I used:
    
    2000:
    https://github.com/unicode-org/icu-data/tree/main/charset/data/ucm
    
    2022:
    https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/
    
    - I also made the regex a multiline regex for readability, even though
    the previous one was not.
    
    For 2022 version, I think it would be good to once run a test to
    verify that no mappings changed that we didn't expect. Perhaps the
    tests here can be used:
    
    https://www.postgresql.org/message-id/b9e3167f-f84b-7aa4-5738-be578a4db924%40iki.fi
    
    The upstream correction to the 2000 version is not present in our
    mappings, so we should mention that, unless it was reverted in or
    before 2022.
    
    In the documentation (charset.sgml), do we want to mention the version
    e.g. the following?
    
     <entry><literal>GB18030</literal></entry>
    -<entry>National Standard</entry>
    +<entry>National Standard, version 2022</entry>
    
    I've whacked around the commit messages, so those should be reviewed
    for accuracy.
    
    Your draft commit message had "9 characters are no longer required by
    the new standard, but are retained in this patch for compatibility"
    ...but those nine were introduced in the 2005 version, right? In which
    case it doesn't affect us. Please confirm.
    
    "Author: Zheng Tao <taoz@highgo.com>" -- I haven't seen any messages
    from this address in this thread, so could you confirm this was
    intentional?
    
    -- 
    John Naylor
    Amazon Web Services
    
  30. Re: GB18030-2022 Support in PostgreSQL

    Chao Li <li.evan.chao@gmail.com> — 2025-09-10T11:54:08Z

    Hi John,
    
    Thank you very much for taking care of this patch.
    
    John Naylor <johncnaylorls@gmail.com> 于2025年9月10日周三 14:38写道:
    
    >
    > - The URL at the top currently points to a directory in Github, but v3
    > changed it to point to the actual file. A directory can be navigated
    > for inspection, so I used:
    >
    > 2000:
    > https://github.com/unicode-org/icu-data/tree/main/charset/data/ucm
    >
    > 2022:
    > https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/
    >
    >
    Looks good.
    
    
    > - I also made the regex a multiline regex for readability, even though
    > the previous one was not.
    >
    >
    Thank you very much for polishing the perl script. I am not an expert of
    perl. I can make the script working, but not perfect.
    
    
    > For 2022 version, I think it would be good to once run a test to
    > verify that no mappings changed that we didn't expect. Perhaps the
    > tests here can be used:
    >
    >
    > https://www.postgresql.org/message-id/b9e3167f-f84b-7aa4-5738-be578a4db924%40iki.fi
    >
    >
    I have manually run tested I had done before, everything works as expected.
    
    I downloaded the tests from the referenced mail, but I cannot make the
    tests to run. After extracting the 2 patch files, it added
    src/test/encodings, but "make check" seems to not run them. I tried to copy
    .out and .sql files to src/test/regress, but the tests still not running.
    Did I miss anything?
    
    The upstream correction to the 2000 version is not present in our
    > mappings, so we should mention that, unless it was reverted in or
    > before 2022.
    >
    
    I think the upstream correction to the 2000 version is just a few not
    round-trip chars that are ignored by us. So I feel we don't need to mention
    them.
    
    
    >
    > In the documentation (charset.sgml), do we want to mention the version
    > e.g. the following?
    >
    >  <entry><literal>GB18030</literal></entry>
    > -<entry>National Standard</entry>
    > +<entry>National Standard, version 2022</entry>
    >
    
    That's a good idea. I updated the sgml file:
    
    [image: image.png]
    
    
    >
    > I've whacked around the commit messages, so those should be reviewed
    > for accuracy.
    >
    > Your draft commit message had "9 characters are no longer required by
    > the new standard, but are retained in this patch for compatibility"
    > ...but those nine were introduced in the 2005 version, right? In which
    > case it doesn't affect us. Please confirm.
    >
    
    I don't find any hint about if the 9 characters were introduced in the 2005
    version.
    
    But without this patch, they can be properly converted:
    ```
    evantest=# SELECT encode(convert_from(decode('FD9D', 'hex'),
    'GB18030')::bytea, 'hex');
     encode
    --------
     efa5b9
    (1 row)
    ```
    So they should be available in the version 2002 already.
    
    
    >
    > "Author: Zheng Tao <taoz@highgo.com>" -- I haven't seen any messages
    > from this address in this thread, so could you confirm this was
    > intentional?
    >
    >
    Yes, Zheng Tao is my colleague. He worked with me for this patch, so I want
    to credit him.
    
    I am attaching v5 version. The only change is 0003, I added the SGML change.
    
    Best regards,
    Chao Li (Evan)
    ---------------------
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
  31. Re: GB18030-2022 Support in PostgreSQL

    John Naylor <johncnaylorls@gmail.com> — 2025-09-11T07:39:58Z

    On Wed, Sep 10, 2025 at 6:54 PM Chao Li <li.evan.chao@gmail.com> wrote:
    
    > I downloaded the tests from the referenced mail, but I cannot make the
    tests to run. After extracting the 2 patch files, it added
    src/test/encodings, but "make check" seems to not run them. I tried to copy
    .out and .sql files to src/test/regress, but the tests still not running.
    Did I miss anything?
    
    Sorry, I'm not quite sure either how to get it to run like a normal test. I
    got it to show the result by doing
    
    psql -f src/test/encodings/sql/init.sql
    psql -f src/test/encodings/sql/gb18030.sql > patch.out
    diff -u src/test/encodings/expected/gb18030.out patch.out > v5-test.diff
    
    I've attached what I got with the v5 patches, renamed to avoid being picked
    up by CI.
    
    >> The upstream correction to the 2000 version is not present in our
    >> mappings, so we should mention that, unless it was reverted in or
    >> before 2022.
    >
    >
    > I think the upstream correction to the 2000 version is just a few not
    round-trip chars that are ignored by us. So I feel we don't need to mention
    them.
    
    This is the commit, and both of these are in the 2022 file as a round trip
    mapping. I don't see any mappings with non-zero flag in the 2000 file (in
    any upstream commit).
    
    https://github.com/unicode-org/icu-data/commit/91850aec0209fee0f91248731c9dd4b2b768d2b5
    
    We should mention this correction for completeness. It seems to just move
    'ḿ' out of the private use area. To be sure, likely almost no one will
    notice.
    
    >> Your draft commit message had "9 characters are no longer required by
    >> the new standard, but are retained in this patch for compatibility"
    >> ...but those nine were introduced in the 2005 version, right? In which
    >> case it doesn't affect us. Please confirm.
    >
    >
    > I don't find any hint about if the 9 characters were introduced in the
    2005 version.
    
    Okay, I must have been confused by language "was included" in one of the
    linked references, which doesn't necessarily mean they were introduced
    there.
    
    The 66 new mappings required are not in the 2022 UCM file and we already
    cover them algorithmically in utf8_and_gb18030.c, so they already work
    without this patch (see below, the glyphs render on my OS but maybe not
    everyone can see them). The commit message needs to focus on what actually
    changed for users (I'll work on that). Related information should be an
    afterthought.
    
    # SELECT convert_from(decode('82358F33', 'hex'), 'GB18030');
     convert_from
    --------------
     龦
    (1 row)
    
    # SELECT convert_from(decode('82359636', 'hex'), 'GB18030');
     convert_from
    --------------
     鿯
    (1 row)
    
    While looking at utf8_and_gb18030.c, I see it refers to the XML file as the
    source of the algorithmic ranges. We'll want to keep some reference to the
    ranges independent of the XML file. I found
    
    https://htmlpreview.github.io/?https://github.com/unicode-org/icu-data/blob/main/charset/source/gb18030/gb18030.html
    
    ...which gives general info and mentions that U+10000 starts at
    GB+90308130, and also links to
    
    https://github.com/unicode-org/icu-data/blob/main/charset/source/gb18030/ranges.txt
    
    ...which has the same ranges we have below U+10000. Links can always
    disappear, but if the algorithmic ranges ever need to change (unlikely),
    we'll have new information about that.
    
    --
    John Naylor
    Amazon Web Services
    
  32. Re: GB18030-2022 Support in PostgreSQL

    Chao Li <li.evan.chao@gmail.com> — 2025-09-11T09:08:31Z

    
    > On Sep 11, 2025, at 15:39, John Naylor <johncnaylorls@gmail.com> wrote:
    > 
    > 
    > On Wed, Sep 10, 2025 at 6:54 PM Chao Li <li.evan.chao@gmail.com <mailto:li.evan.chao@gmail.com>> wrote:
    > 
    > > I downloaded the tests from the referenced mail, but I cannot make the tests to run. After extracting the 2 patch files, it added src/test/encodings, but "make check" seems to not run them. I tried to copy .out and .sql files to src/test/regress, but the tests still not running. Did I miss anything?
    > 
    > Sorry, I'm not quite sure either how to get it to run like a normal test. I got it to show the result by doing
    > 
    > psql -f src/test/encodings/sql/init.sql 
    > psql -f src/test/encodings/sql/gb18030.sql > patch.out
    > diff -u src/test/encodings/expected/gb18030.out patch.out > v5-test.diff
    > 
    > I've attached what I got with the v5 patches, renamed to avoid being picked up by CI.
    > 
    > >> The upstream correction to the 2000 version is not present in our
    > >> mappings, so we should mention that, unless it was reverted in or
    > >> before 2022.
    > >
    > >
    > > I think the upstream correction to the 2000 version is just a few not round-trip chars that are ignored by us. So I feel we don't need to mention them.
    > 
    > This is the commit, and both of these are in the 2022 file as a round trip mapping. I don't see any mappings with non-zero flag in the 2000 file (in any upstream commit).
    > 
    > https://github.com/unicode-org/icu-data/commit/91850aec0209fee0f91248731c9dd4b2b768d2b5
    
    I managed to get the encoding test to run. I didn’t find init.sql, so I had to manually create 3 functions on my own. But finally the test passed on the master branch.
    
    Then I switched to the patch branch, it got 21 different lines. After I updated the 18 known changes in the out file, then it got only 3 different lines:
    
    ```
    - \x8135f437   | \xe1b8bf
    + \x8135f437   | \xee9f87
    
    - \xa3a0       | \xee97a5
    + \xa3a0       | character with byte sequence 0xa3 0xa0 in encoding "GB18030" has no equivalent in encoding “UTF8"
    
    - \xa8bc       | \xee9f87
    + \xa8bc       | \xe1b8bf
    ```
    
    Where, \x8135f437 and \xa8bc reflect to the change pointed by above link:
    
    \xA8BC used to map to unicode UE7C7, now \x8135f437 changed to map to UE7C7, and \xA8BC changed to map to U1E3F in version 2005.
    
    For \xa3a0, in 2022.ucm, it is a not a roundtrip mapping:
    
    ```
    <U3000> \xA3\xA0 |3
    <UE5E5> \xA3\xA0 |4
    ```
    
    So we ignored it. Then everything is clear.
    
    > 
    > We should mention this correction for completeness. It seems to just move 'ḿ' out of the private use area. To be sure, likely almost no one will notice.
    > 
    > >> Your draft commit message had "9 characters are no longer required by
    > >> the new standard, but are retained in this patch for compatibility"
    > >> ...but those nine were introduced in the 2005 version, right? In which
    > >> case it doesn't affect us. Please confirm.
    > >
    > >
    > > I don't find any hint about if the 9 characters were introduced in the 2005 version.
    > 
    > Okay, I must have been confused by language "was included" in one of the linked references, which doesn't necessarily mean they were introduced there.
    > 
    > The 66 new mappings required are not in the 2022 UCM file and we already cover them algorithmically in utf8_and_gb18030.c, so they already work without this patch (see below, the glyphs render on my OS but maybe not everyone can see them). The commit message needs to focus on what actually changed for users (I'll work on that). Related information should be an afterthought.
    > 
    > # SELECT convert_from(decode('82358F33', 'hex'), 'GB18030');
    >  convert_from 
    > --------------
    >  龦
    > (1 row)
    > 
    > # SELECT convert_from(decode('82359636', 'hex'), 'GB18030');
    >  convert_from 
    > --------------
    >  鿯
    > (1 row)
    > 
    > While looking at utf8_and_gb18030.c, I see it refers to the XML file as the source of the algorithmic ranges. We'll want to keep some reference to the ranges independent of the XML file. I found 
    > 
    > https://htmlpreview.github.io/?https://github.com/unicode-org/icu-data/blob/main/charset/source/gb18030/gb18030.html
    > 
    > ...which gives general info and mentions that U+10000 starts at GB+90308130, and also links to
    > 
    > https://github.com/unicode-org/icu-data/blob/main/charset/source/gb18030/ranges.txt
    > 
    > ...which has the same ranges we have below U+10000. Links can always disappear, but if the algorithmic ranges ever need to change (unlikely), we'll have new information about that.
    > 
    > 
    
    I will post v6 soon with updated commit message.
    
    By the way, for how I made the test work:
    
    1. I copied gb18030.sql and gb18030.out to src/test/regess under sql and expected subfolders.
    2. In src/test/regess/parallel_schedule, I added a line “test: gb18030”
    3. Then “make check” run the gb18030 test.
    
    Attached in my updated sql and out file. To test in master branch, use the original out file, to test with the patch, use my updated out file, it will fail with the 3 different lines as I mentioned above.
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
  33. Re: GB18030-2022 Support in PostgreSQL

    Chao Li <li.evan.chao@gmail.com> — 2025-09-12T01:57:37Z

    Chao Li <li.evan.chao@gmail.com> 于2025年9月11日周四 17:08写道:
    
    >
    >
    > I will post v6 soon with updated commit message.
    >
    >
    I am attaching the v6 patch set:
    
    * Updated 0003's commit comment.
    * In 0003, updated a function comment in utf8_and_gb18030.c to address
    John's comment about reference to the xml file.
    
    Best regards,
    Chao Li (Evan)
    ---------------------
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
  34. Re: GB18030-2022 Support in PostgreSQL

    Chao Li <li.evan.chao@gmail.com> — 2025-09-12T02:12:17Z

    On Fri, Sep 12, 2025 at 9:57 AM Chao Li <li.evan.chao@gmail.com> wrote:
    
    >
    >
    > I am attaching the v6 patch set:
    >
    > * Updated 0003's commit comment.
    > * In 0003, updated a function comment in utf8_and_gb18030.c to address
    > John's comment about reference to the xml file.
    >
    > Best regards,
    > Chao Li (Evan)
    > ---------------------
    > HighGo Software Co., Ltd.
    > https://www.highgo.com/
    >
    
    CF requested a rebase, so v7 is just a rebased version.
    
    Best regards,
    Chao Li (Evan)
    ---------------------
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
  35. Re: GB18030-2022 Support in PostgreSQL

    John Naylor <johncnaylorls@gmail.com> — 2025-09-16T09:21:04Z

    On Thu, Sep 11, 2025 at 4:09 PM Chao Li <li.evan.chao@gmail.com> wrote:
    > Then I switched to the patch branch, it got 21 different lines. After I updated the 18 known changes in the out file, then it got only 3 different lines:
    >
    > ```
    > - \x8135f437   | \xe1b8bf
    > + \x8135f437   | \xee9f87
    >
    > - \xa3a0       | \xee97a5
    > + \xa3a0       | character with byte sequence 0xa3 0xa0 in encoding "GB18030" has no equivalent in encoding “UTF8"
    >
    > - \xa8bc       | \xee9f87
    > + \xa8bc       | \xe1b8bf
    > ```
    >
    > Where, \x8135f437 and \xa8bc reflect to the change pointed by above link:
    >
    > \xA8BC used to map to unicode UE7C7, now \x8135f437 changed to map to UE7C7, and \xA8BC changed to map to U1E3F in version 2005.
    
    Maybe we can phrase it like this:
    
    ```
    There have been two corrections to the 2000 version that were carried
    forward to later versions. The following mappings were previously
    swapped:
    
    U+E7C7 (Private Use Area) now maps to \x8135f437
    U+1E3F (Latin Small Letter M with Acute) now maps to \xA8BC
    ```
    
    > For \xa3a0, in 2022.ucm, it is a not a roundtrip mapping:
    >
    > ```
    > <U3000> \xA3\xA0 |3
    > <UE5E5> \xA3\xA0 |4
    > ```
    >
    > So we ignored it. Then everything is clear.
    
    Yes, I see this in the file, but it's not described in any of the
    documents about the 2022 version, although they mention other cases
    regarding the Private Use Area. I'm not sure we need to worry too
    much, but we need to describe the behavior changes, maybe like this:
    
    ```
    Previously, U+E5E5 (Private Use Area) was mapped to \xA3A0. This code
    point now maps to \x65356535. Attempting to convert \xA3A0 will now
    raise an error.
    ```
    
    I'm open to suggestions.
    
    --
    John Naylor
    Amazon Web Services
    
    
    
    
  36. Re: GB18030-2022 Support in PostgreSQL

    John Naylor <johncnaylorls@gmail.com> — 2025-09-16T09:36:02Z

    On Fri, Sep 12, 2025 at 8:57 AM Chao Li <li.evan.chao@gmail.com> wrote:
    > * In 0003, updated a function comment in utf8_and_gb18030.c to address John's comment about reference to the xml file.
    
    Thanks, but the entire point of that comment change was to remove the
    reference to the XML file, yet it didn't actually do that. Also, the
    words in my email were to explain to you what should go there and why.
    That doesn't mean those words belong in the comment.
    
    The comment change seems like it belongs in the preparatory commit
    anyway, so I put the links there and pushed 0001 (along with the
    squashed 0002).
    
    --
    John Naylor
    Amazon Web Services
    
    
    
    
  37. Re: GB18030-2022 Support in PostgreSQL

    Chao Li <li.evan.chao@gmail.com> — 2025-09-17T02:08:28Z

    Hi John,
    
    On Sep 16, 2025, at 17:36, John Naylor <johncnaylorls@gmail.com> wrote:
    
    
    The comment change seems like it belongs in the preparatory commit
    anyway, so I put the links there and pushed 0001 (along with the
    squashed 0002).
    
    
    Thank you very much for pushing 0001.
    
    I see you have updated the function comment in utf8_and_gb18030.c, so I
    removed it from the v8 patch.
    
    Attached is the v8 patch:
    
    * Updated the commit comment by taking your wording
    * Removed the change of utf8_and_gb18030.c
    
    Please take a look again, and thanks for your patience.
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
  38. Re: GB18030-2022 Support in PostgreSQL

    John Naylor <johncnaylorls@gmail.com> — 2025-09-18T07:59:32Z

    On Wed, Sep 17, 2025 at 9:08 AM Chao Li <li.evan.chao@gmail.com> wrote:
    > I see you have updated the function comment in utf8_and_gb18030.c, so I removed it from the v8 patch.
    >
    > Attached is the v8 patch:
    
    I've reworked the commit message I started in v5 to incorporate later
    discussions. (I was not a fan of including a complete table there, nor
    of using UTF-8 encoding instead of code points as a reference.)
    
    The only change I made for v9 is to reword the regression test
    addition from "upgrades" to "change". I'm planning to commit next week
    unless there are objections. (If anyone otherwise busy with the PG18
    release wants a chance to weigh in, let me know and I'll hold off).
    
    It'll be a good idea to communicate how to detect (unlikely but not
    impossible) incompatibilities for existing systems, but I don't think
    committing needs to wait for that piece.
    
    --
    John Naylor
    Amazon Web Services
    
  39. Re: GB18030-2022 Support in PostgreSQL

    Chao Li <li.evan.chao@gmail.com> — 2025-09-18T08:16:05Z

    Hi John,
    
    Thanks for working on v9.
    
    > On Sep 18, 2025, at 15:59, John Naylor <johncnaylorls@gmail.com> wrote:
    > 
    > 
    > It'll be a good idea to communicate how to detect (unlikely but not
    > impossible) incompatibilities for existing systems, but I don't think
    > committing needs to wait for that piece.
    > 
    > --
    > John Naylor
    > Amazon Web Services
    > <v9-0001-Update-GB18030-encoding-from-version-2000-to-2022.patch>
    
    
    V9 looks good to me. I am absolutely fine with removing the table of mapping changes.
    
    When you say “communicate how to detect incompatibility for existing systems”, what would be the communication channel? I am actually very new to the PG development community, your guidance will be greatly appreciated.
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
  40. Re: GB18030-2022 Support in PostgreSQL

    John Naylor <johncnaylorls@gmail.com> — 2025-09-18T08:53:08Z

    On Thu, Sep 18, 2025 at 3:16 PM Chao Li <li.evan.chao@gmail.com> wrote:
    >
    > When you say “communicate how to detect incompatibility for existing systems”, what would be the communication channel? I am actually very new to the PG development community, your guidance will be greatly appreciated.
    
    My first thought was to include a sample query in the release notes
    that filters on text with the affected code points, but I'd be happy
    to hear other ideas. We start working on release notes around
    April/May.
    
    --
    John Naylor
    Amazon Web Services
    
    
    
    
  41. Re: GB18030-2022 Support in PostgreSQL

    Chao Li <li.evan.chao@gmail.com> — 2025-09-18T09:44:43Z

    
    > On Sep 18, 2025, at 16:53, John Naylor <johncnaylorls@gmail.com> wrote:
    > 
    > On Thu, Sep 18, 2025 at 3:16 PM Chao Li <li.evan.chao@gmail.com> wrote:
    >> 
    >> When you say “communicate how to detect incompatibility for existing systems”, what would be the communication channel? I am actually very new to the PG development community, your guidance will be greatly appreciated.
    > 
    > My first thought was to include a sample query in the release notes
    > that filters on text with the affected code points, but I'd be happy
    > to hear other ideas. We start working on release notes around
    > April/May.
    > 
    
    So, no immediate action to take, right? I may work out such a query before starting of release note work.
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
  42. Re: GB18030-2022 Support in PostgreSQL

    John Naylor <johncnaylorls@gmail.com> — 2025-09-24T06:42:37Z

    On Thu, Sep 18, 2025 at 2:59 PM John Naylor <johncnaylorls@gmail.com> wrote:
    > The only change I made for v9 is to reword the regression test
    > addition from "upgrades" to "change". I'm planning to commit next week
    > unless there are objections. (If anyone otherwise busy with the PG18
    > release wants a chance to weigh in, let me know and I'll hold off).
    
    Pushed.
    
    On Thu, Sep 18, 2025 at 4:45 PM Chao Li <li.evan.chao@gmail.com> wrote:
    > So, no immediate action to take, right? I may work out such a query before starting of release note work.
    
    Sounds good. Were you also interested in seeing if EUC_CN can use the
    same UCM file? That would allow us to get rid of the XML file.
    
    --
    John Naylor
    Amazon Web Services
    
    
    
    
  43. Re: GB18030-2022 Support in PostgreSQL

    Chao Li <li.evan.chao@gmail.com> — 2025-09-24T07:04:07Z

    
    > On Sep 24, 2025, at 14:42, John Naylor <johncnaylorls@gmail.com> wrote:
    > 
    > 
    > Sounds good. Were you also interested in seeing if EUC_CN can use the
    > same UCM file? That would allow us to get rid of the XML file.
    > 
    
    
    Sure, let me take a look.
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
  44. Re: GB18030-2022 Support in PostgreSQL

    Chao Li <li.evan.chao@gmail.com> — 2025-09-24T09:18:40Z

    On Sep 24, 2025, at 15:04, Chao Li <li.evan.chao@gmail.com> wrote:
    
    On Sep 24, 2025, at 14:42, John Naylor <johncnaylorls@gmail.com> wrote:
    
    Sounds good. Were you also interested in seeing if EUC_CN can use the
    same UCM file? That would allow us to get rid of the XML file.
    
    
    Sure, let me take a look.
    
    
    I found that both EUC_CN and UHC use the same XML file, so I updated both.
    
    I didn’t delete gb-18030-2000.xml in this patch, because it would make the
    patch file very large, you can just add the deletion to the commit when you
    push it.
    
    Basically, the changes are all borrowed from the previous commit. With this
    patch, regenerating the maps file lead to no map file change, which is
    expected:
    
    ```
    % make utf8_to_uhc.map utf8_to_euc_cn.map
    '/usr/bin/perl' -I . UCS_to_UHC.pl
    - Writing UTF8=>UHC conversion table: utf8_to_uhc.map
    - Writing UHC=>UTF8 conversion table: uhc_to_utf8.map
    '/usr/bin/perl' -I . UCS_to_EUC_CN.pl
    - Writing UTF8=>EUC_CN conversion table: utf8_to_euc_cn.map
    - Writing EUC_CN=>UTF8 conversion table: euc_cn_to_utf8.map
    
    % git diff # no map file change
    %
    ```
    
    I am not sure if you should also upgrade the UCM file to 2022 version, but
    if we need, let’s do it with a separate commit.
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
  45. Re: GB18030-2022 Support in PostgreSQL

    Chao Li <li.evan.chao@gmail.com> — 2025-09-24T09:31:39Z

    On Wed, Sep 24, 2025 at 5:18 PM Chao Li <li.evan.chao@gmail.com> wrote:
    
    >
    > On Sep 24, 2025, at 15:04, Chao Li <li.evan.chao@gmail.com> wrote:
    >
    > On Sep 24, 2025, at 14:42, John Naylor <johncnaylorls@gmail.com> wrote:
    >
    > Sounds good. Were you also interested in seeing if EUC_CN can use the
    > same UCM file? That would allow us to get rid of the XML file.
    >
    >
    > Sure, let me take a look.
    >
    >
    > I found that both EUC_CN and UHC use the same XML file, so I updated both.
    >
    > I didn’t delete gb-18030-2000.xml in this patch, because it would make the
    > patch file very large, you can just add the deletion to the commit when you
    > push it.
    >
    > Basically, the changes are all borrowed from the previous commit. With
    > this patch, regenerating the maps file lead to no map file change, which is
    > expected:
    >
    > ```
    > % make utf8_to_uhc.map utf8_to_euc_cn.map
    > '/usr/bin/perl' -I . UCS_to_UHC.pl
    > - Writing UTF8=>UHC conversion table: utf8_to_uhc.map
    > - Writing UHC=>UTF8 conversion table: uhc_to_utf8.map
    > '/usr/bin/perl' -I . UCS_to_EUC_CN.pl
    > - Writing UTF8=>EUC_CN conversion table: utf8_to_euc_cn.map
    > - Writing EUC_CN=>UTF8 conversion table: euc_cn_to_utf8.map
    >
    > % git diff # no map file change
    > %
    > ```
    >
    > I am not sure if you should also upgrade the UCM file to 2022 version, but
    > if we need, let’s do it with a separate commit.
    >
    >
    I included deletion of the xml file in v2, which will help confirm that
    build will pass clearly. I realized that the patch files were huge because
    of the map file changes.
    
    Best regards,
    Chao Li (Evan)
    ---------------------
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
  46. Re: GB18030-2022 Support in PostgreSQL

    John Naylor <johncnaylorls@gmail.com> — 2025-09-29T04:03:09Z

    On Wed, Sep 24, 2025 at 4:18 PM Chao Li <li.evan.chao@gmail.com> wrote:
    > I am not sure if you should also upgrade the UCM file to 2022 version, but if we need, let’s do it with a separate commit.
    
    If they can all use the same file, we should just do that for the sake
    of simplicity, in which case a separate commit is just extra noise.
    
    -- 
    John Naylor
    Amazon Web Services
    
    
    
    
  47. Re: GB18030-2022 Support in PostgreSQL

    Chao Li <li.evan.chao@gmail.com> — 2025-09-29T08:19:48Z

    On Mon, Sep 29, 2025 at 12:03 PM John Naylor <johncnaylorls@gmail.com>
    wrote:
    
    > On Wed, Sep 24, 2025 at 4:18 PM Chao Li <li.evan.chao@gmail.com> wrote:
    > > I am not sure if you should also upgrade the UCM file to 2022 version,
    > but if we need, let’s do it with a separate commit.
    >
    > If they can all use the same file, we should just do that for the sake
    > of simplicity, in which case a separate commit is just extra noise.
    >
    >
    In v3, I have updated EUC_CN to use gb18030-2022.ucm. Fortunately, the map
    files are unchanged, so we don't have to do much testing for EUC_CN.
    
    For UHC, in the icu master branch
    https://github.com/unicode-org/icu/tree/main/icu4c/source/data/mappings,
    there is still windows-949-2000.ucm, thus only download URL is changed,
    file content is unchanged.
    
    ```
    % make utf8_to_uhc.map utf8_to_euc_cn.map
    wget -O windows-949-2000.ucm --no-use-server-timestamps
    https://raw.githubusercontent.com/unicode-org/icu/refs/heads/main/icu4c/source/data/mappings/windows-949-2000.ucm
    --2025-09-29 16:00:40--
    https://raw.githubusercontent.com/unicode-org/icu/refs/heads/main/icu4c/source/data/mappings/windows-949-2000.ucm
    HTTP request sent, awaiting response... 200 OK
    Length: 356253 (348K) [text/plain]
    Saving to: ‘windows-949-2000.ucm’
    
    windows-949-2000.ucm
    100%[=========================================================================================================>]
    347.90K   222KB/s    in 1.6s
    
    2025-09-29 16:00:43 (222 KB/s) - ‘windows-949-2000.ucm’ saved
    [356253/356253]
    
    '/usr/bin/perl' -I . UCS_to_UHC.pl
    - Writing UTF8=>UHC conversion table: utf8_to_uhc.map
    - Writing UHC=>UTF8 conversion table: uhc_to_utf8.map
    wget -O gb18030-2022.ucm --no-use-server-timestamps
    https://raw.githubusercontent.com/unicode-org/icu/refs/heads/main/icu4c/source/data/mappings/gb18030-2022.ucm
    --2025-09-29 16:00:43--
    https://raw.githubusercontent.com/unicode-org/icu/refs/heads/main/icu4c/source/data/mappings/gb18030-2022.ucm
    HTTP request sent, awaiting response... 200 OK
    Length: 675312 (659K) [text/plain]
    Saving to: ‘gb18030-2022.ucm’
    
    gb18030-2022.ucm
    100%[=========================================================================================================>]
    659.48K  1.33MB/s    in 0.5s
    
    2025-09-29 16:00:44 (1.33 MB/s) - ‘gb18030-2022.ucm’ saved [675312/675312]
    
    '/usr/bin/perl' -I . UCS_to_EUC_CN.pl
    - Writing UTF8=>EUC_CN conversion table: utf8_to_euc_cn.map
    - Writing EUC_CN=>UTF8 conversion table: euc_cn_to_utf8.map
    % git diff
    %
    ```
    
    Please note, I didn't include the deletion of gb-18030-2000.xml in v3,
    because that will cause the patch file to be too big, thus requiring an
    approval process for the email to land in the Mail Archive. Please delete
    the xml file when you push the commit.
    
    Best regards,
    Chao Li (Evan)
    ---------------------
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
  48. Re: GB18030-2022 Support in PostgreSQL

    John Naylor <johncnaylorls@gmail.com> — 2025-09-29T09:32:15Z

    On Wed, Sep 24, 2025 at 4:18 PM Chao Li <li.evan.chao@gmail.com> wrote:
    >
    > I found that both EUC_CN and UHC use the same XML file, so I updated both.
    
    When you say "same file", that implies to me the file we have checked
    in our repo. They have different names and the UHC file is downloaded
    on demand, so it doesn't seem like we need to change UHC at all to
    delete gb-18030-2000.xml. Is that right?
    
    -- 
    John Naylor
    Amazon Web Services
    
    
    
    
  49. Re: GB18030-2022 Support in PostgreSQL

    Chao Li <li.evan.chao@gmail.com> — 2025-09-29T10:36:27Z

    
    > On Sep 29, 2025, at 17:32, John Naylor <johncnaylorls@gmail.com> wrote:
    > 
    > On Wed, Sep 24, 2025 at 4:18 PM Chao Li <li.evan.chao@gmail.com> wrote:
    >> 
    >> I found that both EUC_CN and UHC use the same XML file, so I updated both.
    > 
    > When you say "same file", that implies to me the file we have checked
    > in our repo. They have different names and the UHC file is downloaded
    > on demand, so it doesn't seem like we need to change UHC at all to
    > delete gb-18030-2000.xml. Is that right?
    > 
    > -- 
    > John Naylor
    > Amazon Web Services
    
    
    “same file" was a mistake. windows-949-2000.ucm is a different file from gb-18030-2000(2022).ucm.
    
    In theory, we don’t need to change UHC if our goal is to delete gb-18030-2000.xml. However, as you can see, with switching to use ucm, UHC, EUC_CN and GB18030 now share the same download URL in the Makefile, and their perl scripts use the same logic to process UCM files, so I think it would be good for maintenance.
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
  50. Re: GB18030-2022 Support in PostgreSQL

    John Naylor <johncnaylorls@gmail.com> — 2025-09-30T06:05:42Z

    On Mon, Sep 29, 2025 at 5:36 PM Chao Li <li.evan.chao@gmail.com> wrote:
    > “same file" was a mistake. windows-949-2000.ucm is a different file from gb-18030-2000(2022).ucm.
    >
    > In theory, we don’t need to change UHC if our goal is to delete gb-18030-2000.xml.
    
    That was my goal, yes. Let's stay focused on that and not change
    unrelated things.
    
    -- 
    John Naylor
    Amazon Web Services
    
    
    
    
  51. Re: GB18030-2022 Support in PostgreSQL

    Chao Li <li.evan.chao@gmail.com> — 2025-09-30T06:31:24Z

    On Tue, Sep 30, 2025 at 2:05 PM John Naylor <johncnaylorls@gmail.com> wrote:
    
    > On Mon, Sep 29, 2025 at 5:36 PM Chao Li <li.evan.chao@gmail.com> wrote:
    > > “same file" was a mistake. windows-949-2000.ucm is a different file from
    > gb-18030-2000(2022).ucm.
    > >
    > > In theory, we don’t need to change UHC if our goal is to delete
    > gb-18030-2000.xml.
    >
    > That was my goal, yes. Let's stay focused on that and not change
    > unrelated things.
    >
    >
    Sure, no problem. Please see the attached v4, I reverted UHC change from
    v3. Again, please "git rm" the xml file when you push the commit.
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
  52. Re: GB18030-2022 Support in PostgreSQL

    John Naylor <johncnaylorls@gmail.com> — 2025-10-02T05:44:21Z

    On Tue, Sep 30, 2025 at 1:31 PM Chao Li <li.evan.chao@gmail.com> wrote:
    > Sure, no problem. Please see the attached v4, I reverted UHC change from v3. Again, please "git rm" the xml file when you push the commit.
    
    Thanks, pushed after correcting the file name in the perl script
    comment. I've marked the CF entry committed.
    
    -- 
    John Naylor
    Amazon Web Services
    
    
    
    
  53. Re: GB18030-2022 Support in PostgreSQL

    Chao Li <li.evan.chao@gmail.com> — 2025-10-03T05:12:29Z

    Hi John,
    
    Thank you again much very for your support.
    
    > On Oct 2, 2025, at 13:44, John Naylor <johncnaylorls@gmail.com> wrote:
    > 
    > 
    > Thanks, pushed after correcting the file name in the perl script
    > comment. I've marked the CF entry committed.
    > 
    
    
    So the work for GB18030 is done.
    
    I just want to check with your two more items:
    
    * Do we want to switch UHC from using xml to ucm? That would not lead to map file change, instead it just removes the code of parsing xml file, making future maintenance easier.
    
    * For the makefile changes: https://commitfest.postgresql.org/patch/5953/. Say, ucm has some changes, now make will only rebuild maps files, even if map files are regenerated with differences, corresponding .o files are not automatically rebuilt. I encountered this problem when I started to work on the gb18030 task. I made the change, but because of the problem, postgresql binary was not actually rebuilt to include my change, which led to confusion and wasted time.
    
    Please let me know. Your guidance is greatly appreciated.
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
  54. Re: GB18030-2022 Support in PostgreSQL

    John Naylor <johncnaylorls@gmail.com> — 2025-10-03T06:17:14Z

    On Fri, Oct 3, 2025 at 12:12 PM Chao Li <li.evan.chao@gmail.com> wrote:
    >
    > * Do we want to switch UHC from using xml to ucm? That would not lead to map file change, instead it just removes the code of parsing xml file, making future maintenance easier.
    
    I seriously doubt there will be any future maintenance, in which case
    doing anything is worse than doing nothing. As for the other CF entry,
    that's a separate email thread, and I've already said all I want to
    say there.
    
    -- 
    John Naylor
    Amazon Web Services