Thread

  1. Re: some utf8 breaking substring(txt,1,3) but not substring(txt from '^.{4}')

    Heikki Linnakangas <hlinnaka@iki.fi> — 2026-05-29T13:29:46Z

    On 29/05/2026 14:44, Hannu Krosing wrote:
    > Hi hackers
    > 
    > I was loading our mailing list into a database and noticed that some
    > text results int substring not working.
    > 
    > Specifically calling substring with some specific values fails
    > 
    > badutf8=# select ctid, id, substring(body, 1, 4) from  bademail;
    > ERROR:  22021: invalid byte sequence for encoding "UTF8": 0xc2
    > LOCATION:  report_invalid_encoding_int, mbutils.c:1847
    > 
    > Asking one byte longer substring works ok
    > 
    > badutf8=# select ctid, id, substring(body, 1, 5) from  bademail;
    >   ctid  │ id │ substring
    > ───────┼────┼───────────
    >   (0,1) │  1 │ Hi ev
    > (1 row)
    > 
    > as do other ways of getting the same 4 bytes
    > 
    > badutf8=# select ctid, id, substring(body from '^.{4}') from  bademail;
    >   ctid  │ id │ substring
    > ───────┼────┼───────────
    >   (0,1) │  1 │ Hi e
    > (1 row)
    > 
    > badutf8=# select ctid, id, substring(normalize(body), 1, 4) from  bademail;
    >   ctid  │ id │ substring
    > ───────┼────┼───────────
    >   (0,1) │  1 │ Hi e
    > (1 row)
    > 
    > is this expected behaviour and I just have to always noirmalize when
    > loading exotic UTF8 strings ?
    
    Is the body valid UTF-8 or not? If it's not valid, then you shouldn't be 
    able to load it into the database in the first place. If it is valid, 
    then the substring() should work.
    
    > If you want to replicate this use attached python script to load data
    
    I could not reproduce this. That substring() query after running your 
    script works fine for me.
    
    badutf8=# select ctid, id, substring(body, 1, 4) from  bademail;
      ctid  | id | substring
    -------+----+-----------
      (0,1) |  1 | Hi e
    (1 row)
    
    Which version did you use? What is the database's encoding and what is 
    the client encoding? I used 'master', with UTF-8 as server and client 
    encoding.
    
    - Heikki