some utf8 breaking substring(txt,1,3) but not substring(txt from '^.{4}')

Hannu Krosing <hannuk@google.com>

From: Hannu Krosing <hannuk@google.com>
To: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Date: 2026-05-29T11:44:26Z
Lists: pgsql-hackers

Attachments

Hi hackers

I was loading our mailing list into a database and noticed that some
text results int substring not working.

Specifically calling substring with some specific values fails

badutf8=# select ctid, id, substring(body, 1, 4) from  bademail;
ERROR:  22021: invalid byte sequence for encoding "UTF8": 0xc2
LOCATION:  report_invalid_encoding_int, mbutils.c:1847

Asking one byte longer substring works ok

badutf8=# select ctid, id, substring(body, 1, 5) from  bademail;
 ctid  │ id │ substring
───────┼────┼───────────
 (0,1) │  1 │ Hi ev
(1 row)

as do other ways of getting the same 4 bytes

badutf8=# select ctid, id, substring(body from '^.{4}') from  bademail;
 ctid  │ id │ substring
───────┼────┼───────────
 (0,1) │  1 │ Hi e
(1 row)

badutf8=# select ctid, id, substring(normalize(body), 1, 4) from  bademail;
 ctid  │ id │ substring
───────┼────┼───────────
 (0,1) │  1 │ Hi e
(1 row)

is this expected behaviour and I just have to always noirmalize when
loading exotic UTF8 strings ?


If you want to replicate this use attached python script to load data

I tried to come up with pure SQL to load the data but that always
produced values that did not fail.

Even shortening the body by removing some "normal" characters between
the \xc2\x... sequences froduces values which do not fail
substring(txt, a, b)

I would have shared the original code to load a mailbox, but this
fails to decode that specifioc email on some machines. I did not yet
go into why that happens