v6-0001-Generate-GB18030-mappings-from-the-Unicode-Consor.patch

application/octet-stream

Filename: v6-0001-Generate-GB18030-mappings-from-the-Unicode-Consor.patch
Type: application/octet-stream
Part: 1
Message: Re: GB18030-2022 Support in PostgreSQL

Patch

Same data as JSON: GET /api/v1/attachments/:id/patch the parsed metadata as JSON — format, series position, per-file stats; never the diff bytes. API reference →
Format: format-patch
Series: patch v6-0001
Subject: Generate GB18030 mappings from the Unicode Consortium's UCM file
File+
src/backend/utils/mb/Unicode/Makefile 4 1
src/backend/utils/mb/Unicode/UCS_to_GB18030.pl 36 10
From 435c5ac8e07084e011e64fb205fb7b7e5aa1e335 Mon Sep 17 00:00:00 2001
From: "Chao Li (Evan)" <lic@highgo.com>
Date: Mon, 11 Aug 2025 18:06:07 +0800
Subject: [PATCH v6 1/3] Generate GB18030 mappings from the Unicode
 Consortium's UCM file

Previously we built the .map files for GB18030 (version 2000) from an
XML file. The 2022 version for this encoding is only available as a
Unicode Character Mapping (UCM) file, so as preparatory refactoring
switch to this format as the source for building version 2000.

As we do with most input files for the conversion mappings, download
the file on demand. In order to generate the same mappings we have
now, we must download from a previous upstream commit, rather than
the head since the latter contains a correction not present in our
current .map files.

The XML file is still used by EUC_CN, so we cannot delete it from our
repository. GB18030 is a superset of EUC_CN, so it may be possible to
build EUC_CN from the same UCM file, but that is left for future work.

Author: Chao Li <lic@highgo.com>
Discussion: https://postgr.es/m/966d9fc.169.198741fe60b.Coremail.jiaoshuntian%40highgo.com
---
 src/backend/utils/mb/Unicode/Makefile         |  5 +-
 .../utils/mb/Unicode/UCS_to_GB18030.pl        | 46 +++++++++++++++----
 2 files changed, 40 insertions(+), 11 deletions(-)

diff --git a/src/backend/utils/mb/Unicode/Makefile b/src/backend/utils/mb/Unicode/Makefile
index ad789b31e54..27424b2a001 100644
--- a/src/backend/utils/mb/Unicode/Makefile
+++ b/src/backend/utils/mb/Unicode/Makefile
@@ -54,7 +54,7 @@ $(eval $(call map_rule,euc_cn,UCS_to_EUC_CN.pl,gb-18030-2000.xml))
 $(eval $(call map_rule,euc_kr,UCS_to_EUC_KR.pl,KSX1001.TXT))
 $(eval $(call map_rule,euc_tw,UCS_to_EUC_TW.pl,CNS11643.TXT))
 $(eval $(call map_rule,sjis,UCS_to_SJIS.pl,CP932.TXT))
-$(eval $(call map_rule,gb18030,UCS_to_GB18030.pl,gb-18030-2000.xml))
+$(eval $(call map_rule,gb18030,UCS_to_GB18030.pl,gb-18030-2000.ucm))
 $(eval $(call map_rule,big5,UCS_to_BIG5.pl,CP950.TXT BIG5.TXT CP950.TXT))
 $(eval $(call map_rule,euc_jis_2004,UCS_to_EUC_JIS_2004.pl,euc-jis-2004-std.txt))
 $(eval $(call map_rule,shift_jis_2004,UCS_to_SHIFT_JIS_2004.pl,sjis-0213-2004-std.txt))
@@ -78,6 +78,9 @@ euc-jis-2004-std.txt sjis-0213-2004-std.txt:
 gb-18030-2000.xml windows-949-2000.xml:
 	$(DOWNLOAD) https://raw.githubusercontent.com/unicode-org/icu-data/master/charset/data/xml/$(@F)
 
+gb-18030-2000.ucm:
+	$(DOWNLOAD) https://raw.githubusercontent.com/unicode-org/icu-data/d9d3a6ed27bb98a7106763e940258f0be8cd995b/charset/data/ucm/$(@F)
+
 GB2312.TXT:
 	$(DOWNLOAD) 'http://trac.greenstone.org/browser/trunk/gsdl/unicode/MAPPINGS/EASTASIA/GB/GB2312.TXT?rev=1842&format=txt'
 
diff --git a/src/backend/utils/mb/Unicode/UCS_to_GB18030.pl b/src/backend/utils/mb/Unicode/UCS_to_GB18030.pl
index ddcbd6ef0c4..658e0d59e2c 100755
--- a/src/backend/utils/mb/Unicode/UCS_to_GB18030.pl
+++ b/src/backend/utils/mb/Unicode/UCS_to_GB18030.pl
@@ -4,14 +4,17 @@
 #
 # src/backend/utils/mb/Unicode/UCS_to_GB18030.pl
 #
+
 # Generate UTF-8 <--> GB18030 code conversion tables from
-# "gb-18030-2000.xml", obtained from
-# http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/
+# "gb-18030-2000.ucm", a Unicode Character Mapping file (UCM) from ICU,
+# obtained from https://github.com/unicode-org/icu-data/blob/d9d3a6ed27bb98a7106763e940258f0be8cd995b/charset/data/ucm/gb-18030-2000.ucm
 #
 # The lines we care about in the source file look like
-#    <a u="009A" b="81 30 83 36"/>
-# where the "u" field is the Unicode code point in hex,
-# and the "b" field is the hex byte sequence for GB18030
+#   <UXXXX> \xYY[\xYY...] |n
+# where <UXXXX> is the Unicode code point in hex,
+# and the \xYY... is the hex byte sequence for GB18030,
+# and n is a flag indicating the type of mapping.
+#
 
 use strict;
 use warnings FATAL => 'all';
@@ -22,19 +25,42 @@ my $this_script = 'src/backend/utils/mb/Unicode/UCS_to_GB18030.pl';
 
 # Read the input
 
-my $in_file = "gb-18030-2000.xml";
+my $in_file = "gb-18030-2000.ucm";
 
 open(my $in, '<', $in_file) || die("cannot open $in_file");
 
 my @mapping;
+my $in_charmap = 0;
 
 while (<$in>)
 {
-	next if (!m/<a u="([0-9A-F]+)" b="([0-9A-F ]+)"/);
-	my ($u, $c) = ($1, $2);
-	$c =~ s/ //g;
+	chomp;
+	# Enter CHARMAP section
+	if (/^CHARMAP/) {
+		$in_charmap = 1;
+		next;
+	}
+	# Exit CHARMAP section
+	if (/^END CHARMAP/) {
+		$in_charmap = 0;
+		last;
+	}
+	next unless $in_charmap;
+	# Skip comments and empty lines
+	next if /^#/ || /^$/;
+
+	# Match lines like: <UXXXX> \xYY[\xYY...] |n
+	next if !/^<U([0-9A-Fa-f]+)>\s+((?:\\x[0-9A-Fa-f]{2})+)\s*\|(\d+)/;
+	my ($u, $c, $flag) = ($1, $2, $3);
+
+	# flag 0 means round-trip mapping, we only care about that
+	next if ($flag ne '0');
+
 	my $ucs = hex($u);
-	my $code = hex($c);
+	# Remove \x and concatenate bytes
+	my $c_hex = $c;
+	$c_hex =~ s/\\x//g;
+	my $code = hex($c_hex);
 	if ($code >= 0x80 && $ucs >= 0x0080)
 	{
 		push @mapping,
-- 
2.39.5 (Apple Git-154)