Re: some utf8 breaking substring(txt,1,3) but not substring(txt from '^.{4}')

Heikki Linnakangas <hlinnaka@iki.fi>

From: Heikki Linnakangas <hlinnaka@iki.fi>
To: Hannu Krosing <hannuk@google.com>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Date: 2026-05-29T13:29:46Z
Lists: pgsql-hackers
On 29/05/2026 14:44, Hannu Krosing wrote:
> Hi hackers
> 
> I was loading our mailing list into a database and noticed that some
> text results int substring not working.
> 
> Specifically calling substring with some specific values fails
> 
> badutf8=# select ctid, id, substring(body, 1, 4) from  bademail;
> ERROR:  22021: invalid byte sequence for encoding "UTF8": 0xc2
> LOCATION:  report_invalid_encoding_int, mbutils.c:1847
> 
> Asking one byte longer substring works ok
> 
> badutf8=# select ctid, id, substring(body, 1, 5) from  bademail;
>   ctid  │ id │ substring
> ───────┼────┼───────────
>   (0,1) │  1 │ Hi ev
> (1 row)
> 
> as do other ways of getting the same 4 bytes
> 
> badutf8=# select ctid, id, substring(body from '^.{4}') from  bademail;
>   ctid  │ id │ substring
> ───────┼────┼───────────
>   (0,1) │  1 │ Hi e
> (1 row)
> 
> badutf8=# select ctid, id, substring(normalize(body), 1, 4) from  bademail;
>   ctid  │ id │ substring
> ───────┼────┼───────────
>   (0,1) │  1 │ Hi e
> (1 row)
> 
> is this expected behaviour and I just have to always noirmalize when
> loading exotic UTF8 strings ?

Is the body valid UTF-8 or not? If it's not valid, then you shouldn't be 
able to load it into the database in the first place. If it is valid, 
then the substring() should work.

> If you want to replicate this use attached python script to load data

I could not reproduce this. That substring() query after running your 
script works fine for me.

badutf8=# select ctid, id, substring(body, 1, 4) from  bademail;
  ctid  | id | substring
-------+----+-----------
  (0,1) |  1 | Hi e
(1 row)

Which version did you use? What is the database's encoding and what is 
the client encoding? I used 'master', with UTF-8 as server and client 
encoding.

- Heikki