Thread
-
some utf8 breaking substring(txt,1,3) but not substring(txt from '^.{4}')
Hannu Krosing <hannuk@google.com> — 2026-05-29T11:44:26Z
Hi hackers I was loading our mailing list into a database and noticed that some text results int substring not working. Specifically calling substring with some specific values fails badutf8=# select ctid, id, substring(body, 1, 4) from bademail; ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xc2 LOCATION: report_invalid_encoding_int, mbutils.c:1847 Asking one byte longer substring works ok badutf8=# select ctid, id, substring(body, 1, 5) from bademail; ctid │ id │ substring ───────┼────┼─────────── (0,1) │ 1 │ Hi ev (1 row) as do other ways of getting the same 4 bytes badutf8=# select ctid, id, substring(body from '^.{4}') from bademail; ctid │ id │ substring ───────┼────┼─────────── (0,1) │ 1 │ Hi e (1 row) badutf8=# select ctid, id, substring(normalize(body), 1, 4) from bademail; ctid │ id │ substring ───────┼────┼─────────── (0,1) │ 1 │ Hi e (1 row) is this expected behaviour and I just have to always noirmalize when loading exotic UTF8 strings ? If you want to replicate this use attached python script to load data I tried to come up with pure SQL to load the data but that always produced values that did not fail. Even shortening the body by removing some "normal" characters between the \xc2\x... sequences froduces values which do not fail substring(txt, a, b) I would have shared the original code to load a mailbox, but this fails to decode that specifioc email on some machines. I did not yet go into why that happens