Thread

  1. Retiring some encodings?

    Michael Paquier <michael@paquier.xyz> — 2025-05-22T05:54:22Z

    Hi all,
    
    $subject is something that has been on my mind for a few weeks now,
    following the recent events with CVE-2025-4207 (627acc3caa74) and
    CVE-2025-1094 (5dc1e42b4fa6).
    
    All the encodings supported are documented here:
    https://www.postgresql.org/docs/devel/multibyte.html#MULTIBYTE-CHARSET-SUPPORTED
    
    One pain point in the code is with encoding GB18030, which has the
    particularity to require a look at the first two bytes of an input to
    know what's the actual length of a multi-byte character sequence.
    This is not documented, and it can be a trapped in disguise,
    particularly with the frontend code (see jsonapi.c).
    
    With all that in mind, I have wanted to kick a discussion about
    potentially removing one or more encodings from the core code,
    including the backend part, the frontend part and the conversion
    routines, coupled with checks in pg_upgrade to complain with database
    or collations include the so-said encoding (the collation part needs
    to be checked when not using ICU).  Just being able to removing
    GB18030 would do us a favor in the long-term, at least, but there's
    more.
    
    I have discussed the matter internally, with a few things pointed out:
    - One thing that I was considering first would be the possibility to
    add support for pluggable encodings in the backend code, giving an
    option for retired encodings to be reloaded back to the server, with a
    concept close to what we do for WAL RMGRs with IDs stuck in time once
    defined, catalogs using pg_enc.  Encouraging users to have their own
    encodings, particularly ones that we'd consider to be unsafe by design
    like the GB one may not be a good idea.  But there is always the
    argument that users may not want to pay the cost of a set of ALTER
    DATABASE commands.  Nobody really liked this idea of putting the
    encoding responsibility into an extension :D
    - Another idea, that Jeff Davis has mentioned is around unicode point
    U+FFFD (didn't know about this one) that can be used to replace an
    incoming character whose value is unknown.  One strategy would then be
    to map encodings whose internals are dropped to use UTF-8 underground,
    with this character as exit path when finding characters that cannot
    be understood, meaning partial and silent data loss.
    
    Another set of things (also mentioned by Jeff as he's been diving into
    this area a lot for the last few years with Jeremy Schneider), that
    could also help $subject in the long-run, would be to try removing
    some code used for non-UTF8 cases.  Some examples:
    - downcase_identifier() and pgstrcasecmp.c mention the specific case
    of Turkish with 'i' and 'I'.
    - Simplify regc_pg_locale.c which is unable to support non-UTF8
    encodings with characters of more than 2 bytes.
    - pg_wchar's uint type could be removed, switched to a codepoint value
    (?) (pointed out by Jeff).
    - Varlena cases with non-URF8, like text_position_setup().
    In theory, what we could aim for here is to move forward with non-UTF8
    encodings in the server, potentially moving away from libc.  That's a
    larger project, so it may be better to try something with some of the
    low-hanging fruits like the non-UTF8 cases. 
    
    This last paragraph does not really my opinion about GB18030: I'd like
    to propose its removal for v19 because looking at the first two bytes
    of a character sequence to know how long the full sequence is stands
    as an exception compared to all the encodings supported by Postgres.
    Anyway, at the end, all that is about removing code.  A large majority
    of users use UTF-8, we could improve things, so feel free to comment.
    
    Feel free to use this thread if you have different ideas or if you
    have any comments.
    
    Thanks,
    --
    Michael
    
  2. Re: Retiring some encodings?

    Laurenz Albe <laurenz.albe@cybertec.at> — 2025-05-22T08:26:33Z

    The obvious question is how many people would suffer because
    of that removal, as it would prevent them from using pg_upgrade.
    
    Can anybody who works in a region that uses these encodings make
    an educated guess?
    
    Yours,
    Laurenz Albe
    
    
    
    
  3. Re: Retiring some encodings?

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-05-22T11:44:39Z

    On 22/05/2025 08:54, Michael Paquier wrote:
    > With all that in mind, I have wanted to kick a discussion about
    > potentially removing one or more encodings from the core code,
    > including the backend part, the frontend part and the conversion
    > routines, coupled with checks in pg_upgrade to complain with database
    > or collations include the so-said encoding (the collation part needs
    > to be checked when not using ICU).  Just being able to removing
    > GB18030 would do us a favor in the long-term, at least, but there's
    > more.
    
    +1 at high level for deprecating and removing conversions that are not 
    widely used anymore. As the first step, we can at least add a warning to 
    the documentation, that they will be removed in the future.
    
    -- 
    Heikki Linnakangas
    Neon (https://neon.tech)
    
    
    
    
    
  4. Re: Retiring some encodings?

    Pavel Stehule <pavel.stehule@gmail.com> — 2025-05-22T12:05:03Z

    čt 22. 5. 2025 v 13:44 odesílatel Heikki Linnakangas <hlinnaka@iki.fi>
    napsal:
    
    > On 22/05/2025 08:54, Michael Paquier wrote:
    > > With all that in mind, I have wanted to kick a discussion about
    > > potentially removing one or more encodings from the core code,
    > > including the backend part, the frontend part and the conversion
    > > routines, coupled with checks in pg_upgrade to complain with database
    > > or collations include the so-said encoding (the collation part needs
    > > to be checked when not using ICU).  Just being able to removing
    > > GB18030 would do us a favor in the long-term, at least, but there's
    > > more.
    >
    > +1 at high level for deprecating and removing conversions that are not
    > widely used anymore. As the first step, we can at least add a warning to
    > the documentation, that they will be removed in the future.
    >
    
    +1
    
    Pavel
    
    
    > --
    > Heikki Linnakangas
    > Neon (https://neon.tech)
    >
    >
    >
    >
    
  5. Re: Retiring some encodings?

    Bruce Momjian <bruce@momjian.us> — 2025-05-22T14:02:16Z

    On Thu, May 22, 2025 at 02:44:39PM +0300, Heikki Linnakangas wrote:
    > On 22/05/2025 08:54, Michael Paquier wrote:
    > > With all that in mind, I have wanted to kick a discussion about
    > > potentially removing one or more encodings from the core code,
    > > including the backend part, the frontend part and the conversion
    > > routines, coupled with checks in pg_upgrade to complain with database
    > > or collations include the so-said encoding (the collation part needs
    > > to be checked when not using ICU).  Just being able to removing
    > > GB18030 would do us a favor in the long-term, at least, but there's
    > > more.
    > 
    > +1 at high level for deprecating and removing conversions that are not
    > widely used anymore. As the first step, we can at least add a warning to the
    > documentation, that they will be removed in the future.
    
    Agreed on notification.  A radical idea would be to add a warning for
    the use of such encodings in PG 18, and then mention their deprecation
    in the PG 18 release notes so everyone is informed they will be removed
    in PG 19.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        https://momjian.us
      EDB                                      https://enterprisedb.com
    
      Do not let urgent matters crowd out time for investment in the future.
    
    
    
    
  6. Re: Retiring some encodings?

    Michael Paquier <michael@paquier.xyz> — 2025-05-23T02:11:09Z

    On Thu, May 22, 2025 at 10:02:16AM -0400, Bruce Momjian wrote:
    > Agreed on notification.  A radical idea would be to add a warning for
    > the use of such encodings in PG 18, and then mention their deprecation
    > in the PG 18 release notes so everyone is informed they will be removed
    > in PG 19.
    
    With v18beta1 already out in the wild, I think that we are too late
    for taking any action on this version at this stage.  Putting a
    deprecation notice for a selected set of conversions and/or encodings
    and do the actual removal work when v20 opens up around July 2026
    would sound like a better timing here, if the overall consensus goes
    in this direction, of course.
    --
    Michael
    
  7. Re: Retiring some encodings?

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-05-23T07:18:34Z

    On 23/05/2025 05:11, Michael Paquier wrote:
    > On Thu, May 22, 2025 at 10:02:16AM -0400, Bruce Momjian wrote:
    >> Agreed on notification.  A radical idea would be to add a warning for
    >> the use of such encodings in PG 18, and then mention their deprecation
    >> in the PG 18 release notes so everyone is informed they will be removed
    >> in PG 19.
    > 
    > With v18beta1 already out in the wild, I think that we are too late
    > for taking any action on this version at this stage.  Putting a
    > deprecation notice for a selected set of conversions and/or encodings
    > and do the actual removal work when v20 opens up around July 2026
    > would sound like a better timing here, if the overall consensus goes
    > in this direction, of course.
    
    If we plan to remove something in the future, I think putting a 
    deprecation notice in the docs in v18 is still a good idea. There's no 
    point in hiding the plan by not documenting it sooner. The more advance 
    notice people get the better.
    
    -- 
    Heikki Linnakangas
    Neon (https://neon.tech)
    
    
    
    
  8. Re: Retiring some encodings?

    Daniel Gustafsson <daniel@yesql.se> — 2025-05-23T08:21:42Z

    > On 23 May 2025, at 09:18, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    
    > If we plan to remove something in the future, I think putting a deprecation notice in the docs in v18 is still a good idea. There's no point in hiding the plan by not documenting it sooner. The more advance notice people get the better.
    
    +1
    
    --
    Daniel Gustafsson
    
    
    
    
    
  9. Re: Retiring some encodings?

    wenhui qiu <qiuwenhuifx@gmail.com> — 2025-05-23T09:08:35Z

    HI
    > The obvious question is how many people would suffer because
    > of that removal, as it would prevent them from using pg_upgrade.
    
    > Can anybody who works in a region that uses these encodings make
    > an educated guess?
    +1 Agree ,GB18030 A coding standard in China, if deleted, will have an
    impact on the application of postgresql in China, and China is now
    experiencing more and more hot postgresql heat, need to consider carefully!
    
    On Fri, May 23, 2025 at 4:22 PM Daniel Gustafsson <daniel@yesql.se> wrote:
    
    > > On 23 May 2025, at 09:18, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    >
    > > If we plan to remove something in the future, I think putting a
    > deprecation notice in the docs in v18 is still a good idea. There's no
    > point in hiding the plan by not documenting it sooner. The more advance
    > notice people get the better.
    >
    > +1
    >
    > --
    > Daniel Gustafsson
    >
    >
    >
    >
    
  10. Re: Retiring some encodings?

    Daniel Gustafsson <daniel@yesql.se> — 2025-05-23T09:28:32Z

    > On 23 May 2025, at 11:08, wenhui qiu <qiuwenhuifx@gmail.com> wrote:
    > 
    > HI 
    > > The obvious question is how many people would suffer because
    > > of that removal, as it would prevent them from using pg_upgrade.
    > 
    > > Can anybody who works in a region that uses these encodings make
    > > an educated guess?
    > +1 Agree ,GB18030 A coding standard in China, if deleted, will have an impact on the application of postgresql in China, and China is now experiencing more and more hot postgresql heat, need to consider carefully!
    
    Thanks for the input, that's exactly what we need to make informed decisions.
    How prevalent is GB18030 usage, is it used in all postgres installations in
    China, most of them or in some particular cases?
    
    --
    Daniel Gustafsson
    
    
    
    
    
  11. Re: Retiring some encodings?

    Tatsuo Ishii <ishii@postgresql.org> — 2025-05-23T10:58:46Z

    >> On 23 May 2025, at 11:08, wenhui qiu <qiuwenhuifx@gmail.com> wrote:
    >> 
    >> HI 
    >> > The obvious question is how many people would suffer because
    >> > of that removal, as it would prevent them from using pg_upgrade.
    >> 
    >> > Can anybody who works in a region that uses these encodings make
    >> > an educated guess?
    >> +1 Agree ,GB18030 A coding standard in China, if deleted, will have an impact on the application of postgresql in China, and China is now experiencing more and more hot postgresql heat, need to consider carefully!
    > 
    > Thanks for the input, that's exactly what we need to make informed decisions.
    > How prevalent is GB18030 usage, is it used in all postgres installations in
    > China, most of them or in some particular cases?
    
    Another point is, whether other DBMS support GB18030 or not. If they
    support, but PostgreSQL would not in the future, that could be a
    reason to move away from PostgreSQL.
    
    As far as I know MySQL, Oracle and SQL server support GB18030.
    
    Best regards,
    --
    Tatsuo Ishii
    SRA OSS K.K.
    English: http://www.sraoss.co.jp/index_en/
    Japanese:http://www.sraoss.co.jp
    
    
    
    
  12. Re: Retiring some encodings?

    Michael Paquier <michael@paquier.xyz> — 2025-05-24T00:13:33Z

    On Fri, May 23, 2025 at 07:58:46PM +0900, Tatsuo Ishii wrote:
    > Another point is, whether other DBMS support GB18030 or not. If they
    > support, but PostgreSQL would not in the future, that could be a
    > reason to move away from PostgreSQL.
    
    Yeah, that's a good point.  I would also question what's the benefit
    in using GB18030 over UTF-8, though.  An obvious one I can see is
    because legacy applications never get updated.
    
    On my side, I'll try to grab some actual numbers or at least a trend
    of them.
    --
    Michael
    
  13. Re: Retiring some encodings?

    Tatsuo Ishii <ishii@postgresql.org> — 2025-05-24T02:23:23Z

    > Yeah, that's a good point.  I would also question what's the benefit
    > in using GB18030 over UTF-8, though.  An obvious one I can see is
    > because legacy applications never get updated.
    
    Plus users have too many GB18030 encoded files, I guess.
    --
    Tatsuo Ishii
    SRA OSS K.K.
    English: http://www.sraoss.co.jp/index_en/
    Japanese:http://www.sraoss.co.jp
    
    
    
    
  14. Re: Retiring some encodings?

    DEVOPS_WwIT <devops@ww-it.cn> — 2025-05-25T00:58:13Z

    Hi Michael
    
    > Yeah, that's a good point.  I would also question what's the benefit
    > in using GB18030 over UTF-8, though.  An obvious one I can see is
    > because legacy applications never get updated.
    >
    The GB18030 encoding standard is a mandatory Chinese character encoding 
    standard required by regulations. Software sold and used in China must 
    support GB18030, with its latest version being the 2023 edition. The 
    primary advantage of GB18030 is that most Chinese characters require 
    only 2 bytes for storage, whereas UTF-8 necessitates 3 bytes for the 
    same characters. This makes GB18030 significantly more storage-efficient 
    compared to UTF-8 in terms of space utilization.
    
    Tony
    
  15. Re: Retiring some encodings?

    Andrew Dunstan <andrew@dunslane.net> — 2025-05-26T16:07:02Z

    On 2025-05-24 Sa 8:58 PM, DEVOPS_WwIT wrote:
    >
    > Hi Michael
    >
    >> Yeah, that's a good point.  I would also question what's the benefit
    >> in using GB18030 over UTF-8, though.  An obvious one I can see is
    >> because legacy applications never get updated.
    >>
    > The GB18030 encoding standard is a mandatory Chinese character 
    > encoding standard required by regulations. Software sold and used in 
    > China must support GB18030, with its latest version being the 2023 
    > edition. The primary advantage of GB18030 is that most Chinese 
    > characters require only 2 bytes for storage, whereas UTF-8 
    > necessitates 3 bytes for the same characters. This makes GB18030 
    > significantly more storage-efficient compared to UTF-8 in terms of 
    > space utilization.
    >
    >
    
    Given this, removing it seems like a non-starter.
    
    
    cheers
    
    
    andrew
    
    
    --
    Andrew Dunstan
    EDB:https://www.enterprisedb.com
    
  16. Re: Retiring some encodings?

    Daniel Gustafsson <daniel@yesql.se> — 2025-05-26T16:54:49Z

    > On 26 May 2025, at 18:07, Andrew Dunstan <andrew@dunslane.net> wrote:
    > On 2025-05-24 Sa 8:58 PM, DEVOPS_WwIT wrote:
    
    >> The GB18030 encoding standard is a mandatory Chinese character encoding standard required by regulations. Software sold and used in China must support GB18030, with its latest version being the 2023 edition. The primary advantage of GB18030 is that most Chinese characters require only 2 bytes for storage, whereas UTF-8 necessitates 3 bytes for the same characters. This makes GB18030 significantly more storage-efficient compared to UTF-8 in terms of space utilization.
    > 
    > Given this, removing it seems like a non-starter.
    
    Agreed, it seems very unappealing to remove something so important to such a
    large userbase.
    
    --
    Daniel Gustafsson
    
    
    
    
    
  17. Re: Retiring some encodings?

    Michael Paquier <michael@paquier.xyz> — 2025-05-27T00:07:13Z

    On Mon, May 26, 2025 at 06:54:49PM +0200, Daniel Gustafsson wrote:
    > Agreed, it seems very unappealing to remove something so important to such a
    > large userbase.
    
    Agreed that the so-said "state" level requirement would be a
    non-starter.
    --
    Michael
    
  18. Re: Retiring some encodings?

    Christoph Berg <myon@debian.org> — 2025-06-05T13:35:19Z

    Re: Michael Paquier
    > On Mon, May 26, 2025 at 06:54:49PM +0200, Daniel Gustafsson wrote:
    > > Agreed, it seems very unappealing to remove something so important to such a
    > > large userbase.
    > 
    > Agreed that the so-said "state" level requirement would be a
    > non-starter.
    
    Or maybe support for using these as server encodings could be
    removed, keeping the client_encoding part intact?
    
    Christoph
    
    
    
    
  19. Re: Retiring some encodings?

    Kenneth Marshall <ktm@rice.edu> — 2025-06-05T15:14:58Z

    On Thu, Jun 05, 2025 at 03:35:19PM +0200, Christoph Berg wrote:
    > Re: Michael Paquier
    > > On Mon, May 26, 2025 at 06:54:49PM +0200, Daniel Gustafsson wrote:
    > > > Agreed, it seems very unappealing to remove something so important to such a
    > > > large userbase.
    > > 
    > > Agreed that the so-said "state" level requirement would be a
    > > non-starter.
    > 
    > Or maybe support for using these as server encodings could be
    > removed, keeping the client_encoding part intact?
    > 
    > Christoph
    > 
    
    Hi,
    
    Doesn't the ICU system support this encoding? They could just use it if
    you still want to remove our own implementation.
    
    Regards,
    Ken
    
    
    
    
  20. Re: Retiring some encodings?

    Tatsuo Ishii <ishii@postgresql.org> — 2025-06-05T23:50:56Z

    >> Agreed that the so-said "state" level requirement would be a
    >> non-starter.
    > 
    > Or maybe support for using these as server encodings could be
    > removed, keeping the client_encoding part intact?
    
    GB18030 is already client encoding only, and cannot be used as a
    server encoding. The only way to save GB18030 data into database is,
    converting GB18030 to UTF-8 (which can be done automatically).
    
    Best regards,
    --
    Tatsuo Ishii
    SRA OSS K.K.
    English: http://www.sraoss.co.jp/index_en/
    Japanese:http://www.sraoss.co.jp
    
    
    
    
  21. Re: Retiring some encodings?

    Andres Freund <andres@anarazel.de> — 2025-06-06T00:05:20Z

    Hi,
    
    On 2025-05-22 14:54:22 +0900, Michael Paquier wrote:
    > All the encodings supported are documented here:
    > https://www.postgresql.org/docs/devel/multibyte.html#MULTIBYTE-CHARSET-SUPPORTED
    
    There has been plenty discussion about GB18030, and it seems we aren't likely
    to be able to drop that.
    
    I think there are a lot easier cases though. The easiest probably is
    MULE_INTERNAL - all discussions referencing it seem to be about oddities of
    MULE_INTERNAL, not about using it.  I think it's been effectively unused since
    it's introduction.  Due to not even having a conversion path to UTF-8 it's
    really not practically usable IMO.
    
    Greetings,
    
    Andres Freund
    
    
    
    
  22. Re: Retiring some encodings?

    Michael Paquier <michael@paquier.xyz> — 2025-06-06T01:42:20Z

    On Thu, Jun 05, 2025 at 08:05:20PM -0400, Andres Freund wrote:
    > There has been plenty discussion about GB18030, and it seems we aren't likely
    > to be able to drop that.
    
    Yes, as per upthread.
    
    > I think there are a lot easier cases though. The easiest probably is
    > MULE_INTERNAL - all discussions referencing it seem to be about oddities of
    > MULE_INTERNAL, not about using it.  I think it's been effectively unused since
    > it's introduction.  Due to not even having a conversion path to UTF-8 it's
    > really not practically usable IMO.
    
    Perhaps, yes.  I still need to do some homework here and gather some
    data to share, FWIW.
    --
    Michael