Thread
-
Re: some utf8 breaking substring(txt,1,3) but not substring(txt from '^.{4}')
Heikki Linnakangas <hlinnaka@iki.fi> — 2026-05-29T13:29:46Z
On 29/05/2026 14:44, Hannu Krosing wrote: > Hi hackers > > I was loading our mailing list into a database and noticed that some > text results int substring not working. > > Specifically calling substring with some specific values fails > > badutf8=# select ctid, id, substring(body, 1, 4) from bademail; > ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xc2 > LOCATION: report_invalid_encoding_int, mbutils.c:1847 > > Asking one byte longer substring works ok > > badutf8=# select ctid, id, substring(body, 1, 5) from bademail; > ctid │ id │ substring > ───────┼────┼─────────── > (0,1) │ 1 │ Hi ev > (1 row) > > as do other ways of getting the same 4 bytes > > badutf8=# select ctid, id, substring(body from '^.{4}') from bademail; > ctid │ id │ substring > ───────┼────┼─────────── > (0,1) │ 1 │ Hi e > (1 row) > > badutf8=# select ctid, id, substring(normalize(body), 1, 4) from bademail; > ctid │ id │ substring > ───────┼────┼─────────── > (0,1) │ 1 │ Hi e > (1 row) > > is this expected behaviour and I just have to always noirmalize when > loading exotic UTF8 strings ? Is the body valid UTF-8 or not? If it's not valid, then you shouldn't be able to load it into the database in the first place. If it is valid, then the substring() should work. > If you want to replicate this use attached python script to load data I could not reproduce this. That substring() query after running your script works fine for me. badutf8=# select ctid, id, substring(body, 1, 4) from bademail; ctid | id | substring -------+----+----------- (0,1) | 1 | Hi e (1 row) Which version did you use? What is the database's encoding and what is the client encoding? I used 'master', with UTF-8 as server and client encoding. - Heikki