Thread
-
Retiring some encodings?
Michael Paquier <michael@paquier.xyz> — 2025-05-22T05:54:22Z
Hi all, $subject is something that has been on my mind for a few weeks now, following the recent events with CVE-2025-4207 (627acc3caa74) and CVE-2025-1094 (5dc1e42b4fa6). All the encodings supported are documented here: https://www.postgresql.org/docs/devel/multibyte.html#MULTIBYTE-CHARSET-SUPPORTED One pain point in the code is with encoding GB18030, which has the particularity to require a look at the first two bytes of an input to know what's the actual length of a multi-byte character sequence. This is not documented, and it can be a trapped in disguise, particularly with the frontend code (see jsonapi.c). With all that in mind, I have wanted to kick a discussion about potentially removing one or more encodings from the core code, including the backend part, the frontend part and the conversion routines, coupled with checks in pg_upgrade to complain with database or collations include the so-said encoding (the collation part needs to be checked when not using ICU). Just being able to removing GB18030 would do us a favor in the long-term, at least, but there's more. I have discussed the matter internally, with a few things pointed out: - One thing that I was considering first would be the possibility to add support for pluggable encodings in the backend code, giving an option for retired encodings to be reloaded back to the server, with a concept close to what we do for WAL RMGRs with IDs stuck in time once defined, catalogs using pg_enc. Encouraging users to have their own encodings, particularly ones that we'd consider to be unsafe by design like the GB one may not be a good idea. But there is always the argument that users may not want to pay the cost of a set of ALTER DATABASE commands. Nobody really liked this idea of putting the encoding responsibility into an extension :D - Another idea, that Jeff Davis has mentioned is around unicode point U+FFFD (didn't know about this one) that can be used to replace an incoming character whose value is unknown. One strategy would then be to map encodings whose internals are dropped to use UTF-8 underground, with this character as exit path when finding characters that cannot be understood, meaning partial and silent data loss. Another set of things (also mentioned by Jeff as he's been diving into this area a lot for the last few years with Jeremy Schneider), that could also help $subject in the long-run, would be to try removing some code used for non-UTF8 cases. Some examples: - downcase_identifier() and pgstrcasecmp.c mention the specific case of Turkish with 'i' and 'I'. - Simplify regc_pg_locale.c which is unable to support non-UTF8 encodings with characters of more than 2 bytes. - pg_wchar's uint type could be removed, switched to a codepoint value (?) (pointed out by Jeff). - Varlena cases with non-URF8, like text_position_setup(). In theory, what we could aim for here is to move forward with non-UTF8 encodings in the server, potentially moving away from libc. That's a larger project, so it may be better to try something with some of the low-hanging fruits like the non-UTF8 cases. This last paragraph does not really my opinion about GB18030: I'd like to propose its removal for v19 because looking at the first two bytes of a character sequence to know how long the full sequence is stands as an exception compared to all the encodings supported by Postgres. Anyway, at the end, all that is about removing code. A large majority of users use UTF-8, we could improve things, so feel free to comment. Feel free to use this thread if you have different ideas or if you have any comments. Thanks, -- Michael
-
Re: Retiring some encodings?
Laurenz Albe <laurenz.albe@cybertec.at> — 2025-05-22T08:26:33Z
The obvious question is how many people would suffer because of that removal, as it would prevent them from using pg_upgrade. Can anybody who works in a region that uses these encodings make an educated guess? Yours, Laurenz Albe
-
Re: Retiring some encodings?
Heikki Linnakangas <hlinnaka@iki.fi> — 2025-05-22T11:44:39Z
On 22/05/2025 08:54, Michael Paquier wrote: > With all that in mind, I have wanted to kick a discussion about > potentially removing one or more encodings from the core code, > including the backend part, the frontend part and the conversion > routines, coupled with checks in pg_upgrade to complain with database > or collations include the so-said encoding (the collation part needs > to be checked when not using ICU). Just being able to removing > GB18030 would do us a favor in the long-term, at least, but there's > more. +1 at high level for deprecating and removing conversions that are not widely used anymore. As the first step, we can at least add a warning to the documentation, that they will be removed in the future. -- Heikki Linnakangas Neon (https://neon.tech)
-
Re: Retiring some encodings?
Pavel Stehule <pavel.stehule@gmail.com> — 2025-05-22T12:05:03Z
čt 22. 5. 2025 v 13:44 odesílatel Heikki Linnakangas <hlinnaka@iki.fi> napsal: > On 22/05/2025 08:54, Michael Paquier wrote: > > With all that in mind, I have wanted to kick a discussion about > > potentially removing one or more encodings from the core code, > > including the backend part, the frontend part and the conversion > > routines, coupled with checks in pg_upgrade to complain with database > > or collations include the so-said encoding (the collation part needs > > to be checked when not using ICU). Just being able to removing > > GB18030 would do us a favor in the long-term, at least, but there's > > more. > > +1 at high level for deprecating and removing conversions that are not > widely used anymore. As the first step, we can at least add a warning to > the documentation, that they will be removed in the future. > +1 Pavel > -- > Heikki Linnakangas > Neon (https://neon.tech) > > > >
-
Re: Retiring some encodings?
Bruce Momjian <bruce@momjian.us> — 2025-05-22T14:02:16Z
On Thu, May 22, 2025 at 02:44:39PM +0300, Heikki Linnakangas wrote: > On 22/05/2025 08:54, Michael Paquier wrote: > > With all that in mind, I have wanted to kick a discussion about > > potentially removing one or more encodings from the core code, > > including the backend part, the frontend part and the conversion > > routines, coupled with checks in pg_upgrade to complain with database > > or collations include the so-said encoding (the collation part needs > > to be checked when not using ICU). Just being able to removing > > GB18030 would do us a favor in the long-term, at least, but there's > > more. > > +1 at high level for deprecating and removing conversions that are not > widely used anymore. As the first step, we can at least add a warning to the > documentation, that they will be removed in the future. Agreed on notification. A radical idea would be to add a warning for the use of such encodings in PG 18, and then mention their deprecation in the PG 18 release notes so everyone is informed they will be removed in PG 19. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Do not let urgent matters crowd out time for investment in the future.
-
Re: Retiring some encodings?
Michael Paquier <michael@paquier.xyz> — 2025-05-23T02:11:09Z
On Thu, May 22, 2025 at 10:02:16AM -0400, Bruce Momjian wrote: > Agreed on notification. A radical idea would be to add a warning for > the use of such encodings in PG 18, and then mention their deprecation > in the PG 18 release notes so everyone is informed they will be removed > in PG 19. With v18beta1 already out in the wild, I think that we are too late for taking any action on this version at this stage. Putting a deprecation notice for a selected set of conversions and/or encodings and do the actual removal work when v20 opens up around July 2026 would sound like a better timing here, if the overall consensus goes in this direction, of course. -- Michael
-
Re: Retiring some encodings?
Heikki Linnakangas <hlinnaka@iki.fi> — 2025-05-23T07:18:34Z
On 23/05/2025 05:11, Michael Paquier wrote: > On Thu, May 22, 2025 at 10:02:16AM -0400, Bruce Momjian wrote: >> Agreed on notification. A radical idea would be to add a warning for >> the use of such encodings in PG 18, and then mention their deprecation >> in the PG 18 release notes so everyone is informed they will be removed >> in PG 19. > > With v18beta1 already out in the wild, I think that we are too late > for taking any action on this version at this stage. Putting a > deprecation notice for a selected set of conversions and/or encodings > and do the actual removal work when v20 opens up around July 2026 > would sound like a better timing here, if the overall consensus goes > in this direction, of course. If we plan to remove something in the future, I think putting a deprecation notice in the docs in v18 is still a good idea. There's no point in hiding the plan by not documenting it sooner. The more advance notice people get the better. -- Heikki Linnakangas Neon (https://neon.tech)
-
Re: Retiring some encodings?
Daniel Gustafsson <daniel@yesql.se> — 2025-05-23T08:21:42Z
> On 23 May 2025, at 09:18, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > If we plan to remove something in the future, I think putting a deprecation notice in the docs in v18 is still a good idea. There's no point in hiding the plan by not documenting it sooner. The more advance notice people get the better. +1 -- Daniel Gustafsson
-
Re: Retiring some encodings?
wenhui qiu <qiuwenhuifx@gmail.com> — 2025-05-23T09:08:35Z
HI > The obvious question is how many people would suffer because > of that removal, as it would prevent them from using pg_upgrade. > Can anybody who works in a region that uses these encodings make > an educated guess? +1 Agree ,GB18030 A coding standard in China, if deleted, will have an impact on the application of postgresql in China, and China is now experiencing more and more hot postgresql heat, need to consider carefully! On Fri, May 23, 2025 at 4:22 PM Daniel Gustafsson <daniel@yesql.se> wrote: > > On 23 May 2025, at 09:18, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > If we plan to remove something in the future, I think putting a > deprecation notice in the docs in v18 is still a good idea. There's no > point in hiding the plan by not documenting it sooner. The more advance > notice people get the better. > > +1 > > -- > Daniel Gustafsson > > > >
-
Re: Retiring some encodings?
Daniel Gustafsson <daniel@yesql.se> — 2025-05-23T09:28:32Z
> On 23 May 2025, at 11:08, wenhui qiu <qiuwenhuifx@gmail.com> wrote: > > HI > > The obvious question is how many people would suffer because > > of that removal, as it would prevent them from using pg_upgrade. > > > Can anybody who works in a region that uses these encodings make > > an educated guess? > +1 Agree ,GB18030 A coding standard in China, if deleted, will have an impact on the application of postgresql in China, and China is now experiencing more and more hot postgresql heat, need to consider carefully! Thanks for the input, that's exactly what we need to make informed decisions. How prevalent is GB18030 usage, is it used in all postgres installations in China, most of them or in some particular cases? -- Daniel Gustafsson
-
Re: Retiring some encodings?
Tatsuo Ishii <ishii@postgresql.org> — 2025-05-23T10:58:46Z
>> On 23 May 2025, at 11:08, wenhui qiu <qiuwenhuifx@gmail.com> wrote: >> >> HI >> > The obvious question is how many people would suffer because >> > of that removal, as it would prevent them from using pg_upgrade. >> >> > Can anybody who works in a region that uses these encodings make >> > an educated guess? >> +1 Agree ,GB18030 A coding standard in China, if deleted, will have an impact on the application of postgresql in China, and China is now experiencing more and more hot postgresql heat, need to consider carefully! > > Thanks for the input, that's exactly what we need to make informed decisions. > How prevalent is GB18030 usage, is it used in all postgres installations in > China, most of them or in some particular cases? Another point is, whether other DBMS support GB18030 or not. If they support, but PostgreSQL would not in the future, that could be a reason to move away from PostgreSQL. As far as I know MySQL, Oracle and SQL server support GB18030. Best regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Retiring some encodings?
Michael Paquier <michael@paquier.xyz> — 2025-05-24T00:13:33Z
On Fri, May 23, 2025 at 07:58:46PM +0900, Tatsuo Ishii wrote: > Another point is, whether other DBMS support GB18030 or not. If they > support, but PostgreSQL would not in the future, that could be a > reason to move away from PostgreSQL. Yeah, that's a good point. I would also question what's the benefit in using GB18030 over UTF-8, though. An obvious one I can see is because legacy applications never get updated. On my side, I'll try to grab some actual numbers or at least a trend of them. -- Michael
-
Re: Retiring some encodings?
Tatsuo Ishii <ishii@postgresql.org> — 2025-05-24T02:23:23Z
> Yeah, that's a good point. I would also question what's the benefit > in using GB18030 over UTF-8, though. An obvious one I can see is > because legacy applications never get updated. Plus users have too many GB18030 encoded files, I guess. -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Retiring some encodings?
DEVOPS_WwIT <devops@ww-it.cn> — 2025-05-25T00:58:13Z
Hi Michael > Yeah, that's a good point. I would also question what's the benefit > in using GB18030 over UTF-8, though. An obvious one I can see is > because legacy applications never get updated. > The GB18030 encoding standard is a mandatory Chinese character encoding standard required by regulations. Software sold and used in China must support GB18030, with its latest version being the 2023 edition. The primary advantage of GB18030 is that most Chinese characters require only 2 bytes for storage, whereas UTF-8 necessitates 3 bytes for the same characters. This makes GB18030 significantly more storage-efficient compared to UTF-8 in terms of space utilization. Tony
-
Re: Retiring some encodings?
Andrew Dunstan <andrew@dunslane.net> — 2025-05-26T16:07:02Z
On 2025-05-24 Sa 8:58 PM, DEVOPS_WwIT wrote: > > Hi Michael > >> Yeah, that's a good point. I would also question what's the benefit >> in using GB18030 over UTF-8, though. An obvious one I can see is >> because legacy applications never get updated. >> > The GB18030 encoding standard is a mandatory Chinese character > encoding standard required by regulations. Software sold and used in > China must support GB18030, with its latest version being the 2023 > edition. The primary advantage of GB18030 is that most Chinese > characters require only 2 bytes for storage, whereas UTF-8 > necessitates 3 bytes for the same characters. This makes GB18030 > significantly more storage-efficient compared to UTF-8 in terms of > space utilization. > > Given this, removing it seems like a non-starter. cheers andrew -- Andrew Dunstan EDB:https://www.enterprisedb.com
-
Re: Retiring some encodings?
Daniel Gustafsson <daniel@yesql.se> — 2025-05-26T16:54:49Z
> On 26 May 2025, at 18:07, Andrew Dunstan <andrew@dunslane.net> wrote: > On 2025-05-24 Sa 8:58 PM, DEVOPS_WwIT wrote: >> The GB18030 encoding standard is a mandatory Chinese character encoding standard required by regulations. Software sold and used in China must support GB18030, with its latest version being the 2023 edition. The primary advantage of GB18030 is that most Chinese characters require only 2 bytes for storage, whereas UTF-8 necessitates 3 bytes for the same characters. This makes GB18030 significantly more storage-efficient compared to UTF-8 in terms of space utilization. > > Given this, removing it seems like a non-starter. Agreed, it seems very unappealing to remove something so important to such a large userbase. -- Daniel Gustafsson
-
Re: Retiring some encodings?
Michael Paquier <michael@paquier.xyz> — 2025-05-27T00:07:13Z
On Mon, May 26, 2025 at 06:54:49PM +0200, Daniel Gustafsson wrote: > Agreed, it seems very unappealing to remove something so important to such a > large userbase. Agreed that the so-said "state" level requirement would be a non-starter. -- Michael
-
Re: Retiring some encodings?
Christoph Berg <myon@debian.org> — 2025-06-05T13:35:19Z
Re: Michael Paquier > On Mon, May 26, 2025 at 06:54:49PM +0200, Daniel Gustafsson wrote: > > Agreed, it seems very unappealing to remove something so important to such a > > large userbase. > > Agreed that the so-said "state" level requirement would be a > non-starter. Or maybe support for using these as server encodings could be removed, keeping the client_encoding part intact? Christoph
-
Re: Retiring some encodings?
Kenneth Marshall <ktm@rice.edu> — 2025-06-05T15:14:58Z
On Thu, Jun 05, 2025 at 03:35:19PM +0200, Christoph Berg wrote: > Re: Michael Paquier > > On Mon, May 26, 2025 at 06:54:49PM +0200, Daniel Gustafsson wrote: > > > Agreed, it seems very unappealing to remove something so important to such a > > > large userbase. > > > > Agreed that the so-said "state" level requirement would be a > > non-starter. > > Or maybe support for using these as server encodings could be > removed, keeping the client_encoding part intact? > > Christoph > Hi, Doesn't the ICU system support this encoding? They could just use it if you still want to remove our own implementation. Regards, Ken
-
Re: Retiring some encodings?
Tatsuo Ishii <ishii@postgresql.org> — 2025-06-05T23:50:56Z
>> Agreed that the so-said "state" level requirement would be a >> non-starter. > > Or maybe support for using these as server encodings could be > removed, keeping the client_encoding part intact? GB18030 is already client encoding only, and cannot be used as a server encoding. The only way to save GB18030 data into database is, converting GB18030 to UTF-8 (which can be done automatically). Best regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Retiring some encodings?
Andres Freund <andres@anarazel.de> — 2025-06-06T00:05:20Z
Hi, On 2025-05-22 14:54:22 +0900, Michael Paquier wrote: > All the encodings supported are documented here: > https://www.postgresql.org/docs/devel/multibyte.html#MULTIBYTE-CHARSET-SUPPORTED There has been plenty discussion about GB18030, and it seems we aren't likely to be able to drop that. I think there are a lot easier cases though. The easiest probably is MULE_INTERNAL - all discussions referencing it seem to be about oddities of MULE_INTERNAL, not about using it. I think it's been effectively unused since it's introduction. Due to not even having a conversion path to UTF-8 it's really not practically usable IMO. Greetings, Andres Freund
-
Re: Retiring some encodings?
Michael Paquier <michael@paquier.xyz> — 2025-06-06T01:42:20Z
On Thu, Jun 05, 2025 at 08:05:20PM -0400, Andres Freund wrote: > There has been plenty discussion about GB18030, and it seems we aren't likely > to be able to drop that. Yes, as per upthread. > I think there are a lot easier cases though. The easiest probably is > MULE_INTERNAL - all discussions referencing it seem to be about oddities of > MULE_INTERNAL, not about using it. I think it's been effectively unused since > it's introduction. Due to not even having a conversion path to UTF-8 it's > really not practically usable IMO. Perhaps, yes. I still need to do some homework here and gather some data to share, FWIW. -- Michael