Thread

  1. Re: some utf8 breaking substring(txt,1,3) but not substring(txt from '^.{4}')

    Hannu Krosing <hannuk@google.com> — 2026-05-29T15:34:46Z

    But looks like I can not reproduce it on any other computer (I though
    I had verified it on vanill Ubuntu 22.04 as well), so no need to do
    any more investigation until I can reproduce it elsewhere
    
    On Fri, May 29, 2026 at 5:16 PM Hannu Krosing <hannuk@google.com> wrote:
    >
    > Body is valid, can be selected in full and the the body can be also
    > fed to other functions, including version of substring using regexes.
    >
    > badutf8=# select ctid, id, length(body), length(normalize(body)),
    > body=normalize(body), octet_length(body),
    > octet_length(normalize(body)) from  bademail;
    >  ctid  │ id │ length │ length │ ?column? │ octet_length │ octet_length
    > ───────┼────┼────────┼────────┼──────────┼──────────────┼──────────────
    >  (0,1) │  1 │   2314 │   2314 │ t        │         2323 │         2323
    > (1 row)
    >
    > The most confusing thing is that the byte it complains about in case
    > of substring(body, 1, 3) or substring(body, 1, 4) does not seem to be
    > present in the original string at all and definitely not within the
    > first few characters I ams asking to extract
    >
    >
    > badutf8=# select ctid, id, substring(body, 1, 2) from  bademail;
    >  ctid  │ id │ substring
    > ───────┼────┼───────────
    >  (0,1) │  1 │ Hi
    > (1 row)
    >
    > Time: 0.527 ms
    > badutf8=# select ctid, id, substring(body, 1, 3) from  bademail;
    > ERROR:  22021: invalid byte sequence for encoding "UTF8": 0xc3
    > LOCATION:  report_invalid_encoding_int, mbutils.c:1847
    > Time: 0.638 ms
    > badutf8=# select ctid, id, substring(body, 1, 4) from  bademail;
    > ERROR:  22021: invalid byte sequence for encoding "UTF8": 0xc2
    > LOCATION:  report_invalid_encoding_int, mbutils.c:1847
    > Time: 0.555 ms
    > badutf8=# select ctid, id, substring(body, 1, 5) from  bademail;
    >  ctid  │ id │ substring
    > ───────┼────┼───────────
    >  (0,1) │  1 │ Hi ev
    > (1 row)
    >
    > On Fri, May 29, 2026 at 3:29 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
    > >
    > > On 29/05/2026 14:44, Hannu Krosing wrote:
    > > > Hi hackers
    > > >
    > > > I was loading our mailing list into a database and noticed that some
    > > > text results int substring not working.
    > > >
    > > > Specifically calling substring with some specific values fails
    > > >
    > > > badutf8=# select ctid, id, substring(body, 1, 4) from  bademail;
    > > > ERROR:  22021: invalid byte sequence for encoding "UTF8": 0xc2
    > > > LOCATION:  report_invalid_encoding_int, mbutils.c:1847
    > > >
    > > > Asking one byte longer substring works ok
    > > >
    > > > badutf8=# select ctid, id, substring(body, 1, 5) from  bademail;
    > > >   ctid  │ id │ substring
    > > > ───────┼────┼───────────
    > > >   (0,1) │  1 │ Hi ev
    > > > (1 row)
    > > >
    > > > as do other ways of getting the same 4 bytes
    > > >
    > > > badutf8=# select ctid, id, substring(body from '^.{4}') from  bademail;
    > > >   ctid  │ id │ substring
    > > > ───────┼────┼───────────
    > > >   (0,1) │  1 │ Hi e
    > > > (1 row)
    > > >
    > > > badutf8=# select ctid, id, substring(normalize(body), 1, 4) from  bademail;
    > > >   ctid  │ id │ substring
    > > > ───────┼────┼───────────
    > > >   (0,1) │  1 │ Hi e
    > > > (1 row)
    > > >
    > > > is this expected behaviour and I just have to always noirmalize when
    > > > loading exotic UTF8 strings ?
    > >
    > > Is the body valid UTF-8 or not? If it's not valid, then you shouldn't be
    > > able to load it into the database in the first place. If it is valid,
    > > then the substring() should work.
    > >
    > > > If you want to replicate this use attached python script to load data
    > >
    > > I could not reproduce this. That substring() query after running your
    > > script works fine for me.
    > >
    > > badutf8=# select ctid, id, substring(body, 1, 4) from  bademail;
    > >   ctid  | id | substring
    > > -------+----+-----------
    > >   (0,1) |  1 | Hi e
    > > (1 row)
    > >
    > > Which version did you use? What is the database's encoding and what is
    > > the client encoding? I used 'master', with UTF-8 as server and client
    > > encoding.
    > >
    > > - Heikki
    > >