Thread

Retiring some encodings?

Michael Paquier <michael@paquier.xyz> — 2025-05-22T05:54:22Z

Hi all,

$subject is something that has been on my mind for a few weeks now,
following the recent events with CVE-2025-4207 (627acc3caa74) and
CVE-2025-1094 (5dc1e42b4fa6).

All the encodings supported are documented here:
https://www.postgresql.org/docs/devel/multibyte.html#MULTIBYTE-CHARSET-SUPPORTED

One pain point in the code is with encoding GB18030, which has the
particularity to require a look at the first two bytes of an input to
know what's the actual length of a multi-byte character sequence.
This is not documented, and it can be a trapped in disguise,
particularly with the frontend code (see jsonapi.c).

With all that in mind, I have wanted to kick a discussion about
potentially removing one or more encodings from the core code,
including the backend part, the frontend part and the conversion
routines, coupled with checks in pg_upgrade to complain with database
or collations include the so-said encoding (the collation part needs
to be checked when not using ICU).  Just being able to removing
GB18030 would do us a favor in the long-term, at least, but there's
more.

I have discussed the matter internally, with a few things pointed out:
- One thing that I was considering first would be the possibility to
add support for pluggable encodings in the backend code, giving an
option for retired encodings to be reloaded back to the server, with a
concept close to what we do for WAL RMGRs with IDs stuck in time once
defined, catalogs using pg_enc.  Encouraging users to have their own
encodings, particularly ones that we'd consider to be unsafe by design
like the GB one may not be a good idea.  But there is always the
argument that users may not want to pay the cost of a set of ALTER
DATABASE commands.  Nobody really liked this idea of putting the
encoding responsibility into an extension :D
- Another idea, that Jeff Davis has mentioned is around unicode point
U+FFFD (didn't know about this one) that can be used to replace an
incoming character whose value is unknown.  One strategy would then be
to map encodings whose internals are dropped to use UTF-8 underground,
with this character as exit path when finding characters that cannot
be understood, meaning partial and silent data loss.

Another set of things (also mentioned by Jeff as he's been diving into
this area a lot for the last few years with Jeremy Schneider), that
could also help $subject in the long-run, would be to try removing
some code used for non-UTF8 cases.  Some examples:
- downcase_identifier() and pgstrcasecmp.c mention the specific case
of Turkish with 'i' and 'I'.
- Simplify regc_pg_locale.c which is unable to support non-UTF8
encodings with characters of more than 2 bytes.
- pg_wchar's uint type could be removed, switched to a codepoint value
(?) (pointed out by Jeff).
- Varlena cases with non-URF8, like text_position_setup().
In theory, what we could aim for here is to move forward with non-UTF8
encodings in the server, potentially moving away from libc.  That's a
larger project, so it may be better to try something with some of the
low-hanging fruits like the non-UTF8 cases. 

This last paragraph does not really my opinion about GB18030: I'd like
to propose its removal for v19 because looking at the first two bytes
of a character sequence to know how long the full sequence is stands
as an exception compared to all the encodings supported by Postgres.
Anyway, at the end, all that is about removing code.  A large majority
of users use UTF-8, we could improve things, so feel free to comment.

Feel free to use this thread if you have different ideas or if you
have any comments.

Thanks,
--
Michael

Re: Retiring some encodings?

Laurenz Albe <laurenz.albe@cybertec.at> — 2025-05-22T08:26:33Z

The obvious question is how many people would suffer because
of that removal, as it would prevent them from using pg_upgrade.

Can anybody who works in a region that uses these encodings make
an educated guess?

Yours,
Laurenz Albe

Re: Retiring some encodings?

Heikki Linnakangas <hlinnaka@iki.fi> — 2025-05-22T11:44:39Z

On 22/05/2025 08:54, Michael Paquier wrote:
> With all that in mind, I have wanted to kick a discussion about
> potentially removing one or more encodings from the core code,
> including the backend part, the frontend part and the conversion
> routines, coupled with checks in pg_upgrade to complain with database
> or collations include the so-said encoding (the collation part needs
> to be checked when not using ICU).  Just being able to removing
> GB18030 would do us a favor in the long-term, at least, but there's
> more.

+1 at high level for deprecating and removing conversions that are not 
widely used anymore. As the first step, we can at least add a warning to 
the documentation, that they will be removed in the future.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: Retiring some encodings?

Pavel Stehule <pavel.stehule@gmail.com> — 2025-05-22T12:05:03Z

čt 22. 5. 2025 v 13:44 odesílatel Heikki Linnakangas <hlinnaka@iki.fi>
napsal:

> On 22/05/2025 08:54, Michael Paquier wrote:
> > With all that in mind, I have wanted to kick a discussion about
> > potentially removing one or more encodings from the core code,
> > including the backend part, the frontend part and the conversion
> > routines, coupled with checks in pg_upgrade to complain with database
> > or collations include the so-said encoding (the collation part needs
> > to be checked when not using ICU).  Just being able to removing
> > GB18030 would do us a favor in the long-term, at least, but there's
> > more.
>
> +1 at high level for deprecating and removing conversions that are not
> widely used anymore. As the first step, we can at least add a warning to
> the documentation, that they will be removed in the future.
>

+1

Pavel


> --
> Heikki Linnakangas
> Neon (https://neon.tech)
>
>
>
>

Re: Retiring some encodings?

Bruce Momjian <bruce@momjian.us> — 2025-05-22T14:02:16Z

On Thu, May 22, 2025 at 02:44:39PM +0300, Heikki Linnakangas wrote:
> On 22/05/2025 08:54, Michael Paquier wrote:
> > With all that in mind, I have wanted to kick a discussion about
> > potentially removing one or more encodings from the core code,
> > including the backend part, the frontend part and the conversion
> > routines, coupled with checks in pg_upgrade to complain with database
> > or collations include the so-said encoding (the collation part needs
> > to be checked when not using ICU).  Just being able to removing
> > GB18030 would do us a favor in the long-term, at least, but there's
> > more.
> 
> +1 at high level for deprecating and removing conversions that are not
> widely used anymore. As the first step, we can at least add a warning to the
> documentation, that they will be removed in the future.

Agreed on notification.  A radical idea would be to add a warning for
the use of such encodings in PG 18, and then mention their deprecation
in the PG 18 release notes so everyone is informed they will be removed
in PG 19.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.

Re: Retiring some encodings?

Michael Paquier <michael@paquier.xyz> — 2025-05-23T02:11:09Z

On Thu, May 22, 2025 at 10:02:16AM -0400, Bruce Momjian wrote:
> Agreed on notification.  A radical idea would be to add a warning for
> the use of such encodings in PG 18, and then mention their deprecation
> in the PG 18 release notes so everyone is informed they will be removed
> in PG 19.

With v18beta1 already out in the wild, I think that we are too late
for taking any action on this version at this stage.  Putting a
deprecation notice for a selected set of conversions and/or encodings
and do the actual removal work when v20 opens up around July 2026
would sound like a better timing here, if the overall consensus goes
in this direction, of course.
--
Michael

Re: Retiring some encodings?

Heikki Linnakangas <hlinnaka@iki.fi> — 2025-05-23T07:18:34Z

On 23/05/2025 05:11, Michael Paquier wrote:
> On Thu, May 22, 2025 at 10:02:16AM -0400, Bruce Momjian wrote:
>> Agreed on notification.  A radical idea would be to add a warning for
>> the use of such encodings in PG 18, and then mention their deprecation
>> in the PG 18 release notes so everyone is informed they will be removed
>> in PG 19.
> 
> With v18beta1 already out in the wild, I think that we are too late
> for taking any action on this version at this stage.  Putting a
> deprecation notice for a selected set of conversions and/or encodings
> and do the actual removal work when v20 opens up around July 2026
> would sound like a better timing here, if the overall consensus goes
> in this direction, of course.

If we plan to remove something in the future, I think putting a 
deprecation notice in the docs in v18 is still a good idea. There's no 
point in hiding the plan by not documenting it sooner. The more advance 
notice people get the better.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: Retiring some encodings?

Daniel Gustafsson <daniel@yesql.se> — 2025-05-23T08:21:42Z

> On 23 May 2025, at 09:18, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

> If we plan to remove something in the future, I think putting a deprecation notice in the docs in v18 is still a good idea. There's no point in hiding the plan by not documenting it sooner. The more advance notice people get the better.

+1

--
Daniel Gustafsson

Re: Retiring some encodings?

wenhui qiu <qiuwenhuifx@gmail.com> — 2025-05-23T09:08:35Z

HI
> The obvious question is how many people would suffer because
> of that removal, as it would prevent them from using pg_upgrade.

> Can anybody who works in a region that uses these encodings make
> an educated guess?
+1 Agree ,GB18030 A coding standard in China, if deleted, will have an
impact on the application of postgresql in China, and China is now
experiencing more and more hot postgresql heat, need to consider carefully!

On Fri, May 23, 2025 at 4:22 PM Daniel Gustafsson <daniel@yesql.se> wrote:

> > On 23 May 2025, at 09:18, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> > If we plan to remove something in the future, I think putting a
> deprecation notice in the docs in v18 is still a good idea. There's no
> point in hiding the plan by not documenting it sooner. The more advance
> notice people get the better.
>
> +1
>
> --
> Daniel Gustafsson
>
>
>
>

Re: Retiring some encodings?

Daniel Gustafsson <daniel@yesql.se> — 2025-05-23T09:28:32Z

> On 23 May 2025, at 11:08, wenhui qiu <qiuwenhuifx@gmail.com> wrote:
> 
> HI 
> > The obvious question is how many people would suffer because
> > of that removal, as it would prevent them from using pg_upgrade.
> 
> > Can anybody who works in a region that uses these encodings make
> > an educated guess?
> +1 Agree ,GB18030 A coding standard in China, if deleted, will have an impact on the application of postgresql in China, and China is now experiencing more and more hot postgresql heat, need to consider carefully!

Thanks for the input, that's exactly what we need to make informed decisions.
How prevalent is GB18030 usage, is it used in all postgres installations in
China, most of them or in some particular cases?

--
Daniel Gustafsson

Re: Retiring some encodings?

Tatsuo Ishii <ishii@postgresql.org> — 2025-05-23T10:58:46Z

>> On 23 May 2025, at 11:08, wenhui qiu <qiuwenhuifx@gmail.com> wrote:
>> 
>> HI 
>> > The obvious question is how many people would suffer because
>> > of that removal, as it would prevent them from using pg_upgrade.
>> 
>> > Can anybody who works in a region that uses these encodings make
>> > an educated guess?
>> +1 Agree ,GB18030 A coding standard in China, if deleted, will have an impact on the application of postgresql in China, and China is now experiencing more and more hot postgresql heat, need to consider carefully!
> 
> Thanks for the input, that's exactly what we need to make informed decisions.
> How prevalent is GB18030 usage, is it used in all postgres installations in
> China, most of them or in some particular cases?

Another point is, whether other DBMS support GB18030 or not. If they
support, but PostgreSQL would not in the future, that could be a
reason to move away from PostgreSQL.

As far as I know MySQL, Oracle and SQL server support GB18030.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: Retiring some encodings?

Michael Paquier <michael@paquier.xyz> — 2025-05-24T00:13:33Z

On Fri, May 23, 2025 at 07:58:46PM +0900, Tatsuo Ishii wrote:
> Another point is, whether other DBMS support GB18030 or not. If they
> support, but PostgreSQL would not in the future, that could be a
> reason to move away from PostgreSQL.

Yeah, that's a good point.  I would also question what's the benefit
in using GB18030 over UTF-8, though.  An obvious one I can see is
because legacy applications never get updated.

On my side, I'll try to grab some actual numbers or at least a trend
of them.
--
Michael

Re: Retiring some encodings?

Tatsuo Ishii <ishii@postgresql.org> — 2025-05-24T02:23:23Z

> Yeah, that's a good point.  I would also question what's the benefit
> in using GB18030 over UTF-8, though.  An obvious one I can see is
> because legacy applications never get updated.

Plus users have too many GB18030 encoded files, I guess.
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: Retiring some encodings?

DEVOPS_WwIT <devops@ww-it.cn> — 2025-05-25T00:58:13Z

Hi Michael

> Yeah, that's a good point.  I would also question what's the benefit
> in using GB18030 over UTF-8, though.  An obvious one I can see is
> because legacy applications never get updated.
>
The GB18030 encoding standard is a mandatory Chinese character encoding 
standard required by regulations. Software sold and used in China must 
support GB18030, with its latest version being the 2023 edition. The 
primary advantage of GB18030 is that most Chinese characters require 
only 2 bytes for storage, whereas UTF-8 necessitates 3 bytes for the 
same characters. This makes GB18030 significantly more storage-efficient 
compared to UTF-8 in terms of space utilization.

Tony

Re: Retiring some encodings?

Andrew Dunstan <andrew@dunslane.net> — 2025-05-26T16:07:02Z

On 2025-05-24 Sa 8:58 PM, DEVOPS_WwIT wrote:
>
> Hi Michael
>
>> Yeah, that's a good point.  I would also question what's the benefit
>> in using GB18030 over UTF-8, though.  An obvious one I can see is
>> because legacy applications never get updated.
>>
> The GB18030 encoding standard is a mandatory Chinese character 
> encoding standard required by regulations. Software sold and used in 
> China must support GB18030, with its latest version being the 2023 
> edition. The primary advantage of GB18030 is that most Chinese 
> characters require only 2 bytes for storage, whereas UTF-8 
> necessitates 3 bytes for the same characters. This makes GB18030 
> significantly more storage-efficient compared to UTF-8 in terms of 
> space utilization.
>
>

Given this, removing it seems like a non-starter.


cheers


andrew


--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: Retiring some encodings?

Daniel Gustafsson <daniel@yesql.se> — 2025-05-26T16:54:49Z

> On 26 May 2025, at 18:07, Andrew Dunstan <andrew@dunslane.net> wrote:
> On 2025-05-24 Sa 8:58 PM, DEVOPS_WwIT wrote:

>> The GB18030 encoding standard is a mandatory Chinese character encoding standard required by regulations. Software sold and used in China must support GB18030, with its latest version being the 2023 edition. The primary advantage of GB18030 is that most Chinese characters require only 2 bytes for storage, whereas UTF-8 necessitates 3 bytes for the same characters. This makes GB18030 significantly more storage-efficient compared to UTF-8 in terms of space utilization.
> 
> Given this, removing it seems like a non-starter.

Agreed, it seems very unappealing to remove something so important to such a
large userbase.

--
Daniel Gustafsson

Re: Retiring some encodings?

Michael Paquier <michael@paquier.xyz> — 2025-05-27T00:07:13Z

On Mon, May 26, 2025 at 06:54:49PM +0200, Daniel Gustafsson wrote:
> Agreed, it seems very unappealing to remove something so important to such a
> large userbase.

Agreed that the so-said "state" level requirement would be a
non-starter.
--
Michael

Re: Retiring some encodings?

Christoph Berg <myon@debian.org> — 2025-06-05T13:35:19Z

Re: Michael Paquier
> On Mon, May 26, 2025 at 06:54:49PM +0200, Daniel Gustafsson wrote:
> > Agreed, it seems very unappealing to remove something so important to such a
> > large userbase.
> 
> Agreed that the so-said "state" level requirement would be a
> non-starter.

Or maybe support for using these as server encodings could be
removed, keeping the client_encoding part intact?

Christoph

Re: Retiring some encodings?

Kenneth Marshall <ktm@rice.edu> — 2025-06-05T15:14:58Z

On Thu, Jun 05, 2025 at 03:35:19PM +0200, Christoph Berg wrote:
> Re: Michael Paquier
> > On Mon, May 26, 2025 at 06:54:49PM +0200, Daniel Gustafsson wrote:
> > > Agreed, it seems very unappealing to remove something so important to such a
> > > large userbase.
> > 
> > Agreed that the so-said "state" level requirement would be a
> > non-starter.
> 
> Or maybe support for using these as server encodings could be
> removed, keeping the client_encoding part intact?
> 
> Christoph
> 

Hi,

Doesn't the ICU system support this encoding? They could just use it if
you still want to remove our own implementation.

Regards,
Ken

Re: Retiring some encodings?

Tatsuo Ishii <ishii@postgresql.org> — 2025-06-05T23:50:56Z

>> Agreed that the so-said "state" level requirement would be a
>> non-starter.
> 
> Or maybe support for using these as server encodings could be
> removed, keeping the client_encoding part intact?

GB18030 is already client encoding only, and cannot be used as a
server encoding. The only way to save GB18030 data into database is,
converting GB18030 to UTF-8 (which can be done automatically).

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: Retiring some encodings?

Andres Freund <andres@anarazel.de> — 2025-06-06T00:05:20Z

Hi,

On 2025-05-22 14:54:22 +0900, Michael Paquier wrote:
> All the encodings supported are documented here:
> https://www.postgresql.org/docs/devel/multibyte.html#MULTIBYTE-CHARSET-SUPPORTED

There has been plenty discussion about GB18030, and it seems we aren't likely
to be able to drop that.

I think there are a lot easier cases though. The easiest probably is
MULE_INTERNAL - all discussions referencing it seem to be about oddities of
MULE_INTERNAL, not about using it.  I think it's been effectively unused since
it's introduction.  Due to not even having a conversion path to UTF-8 it's
really not practically usable IMO.

Greetings,

Andres Freund

Re: Retiring some encodings?

Michael Paquier <michael@paquier.xyz> — 2025-06-06T01:42:20Z

On Thu, Jun 05, 2025 at 08:05:20PM -0400, Andres Freund wrote:
> There has been plenty discussion about GB18030, and it seems we aren't likely
> to be able to drop that.

Yes, as per upthread.

> I think there are a lot easier cases though. The easiest probably is
> MULE_INTERNAL - all discussions referencing it seem to be about oddities of
> MULE_INTERNAL, not about using it.  I think it's been effectively unused since
> it's introduction.  Due to not even having a conversion path to UTF-8 it's
> really not practically usable IMO.

Perhaps, yes.  I still need to do some homework here and gather some
data to share, FWIW.
--
Michael