Thread

  1. some utf8 breaking substring(txt,1,3) but not substring(txt from '^.{4}')

    Hannu Krosing <hannuk@google.com> — 2026-05-29T11:44:26Z

    Hi hackers
    
    I was loading our mailing list into a database and noticed that some
    text results int substring not working.
    
    Specifically calling substring with some specific values fails
    
    badutf8=# select ctid, id, substring(body, 1, 4) from  bademail;
    ERROR:  22021: invalid byte sequence for encoding "UTF8": 0xc2
    LOCATION:  report_invalid_encoding_int, mbutils.c:1847
    
    Asking one byte longer substring works ok
    
    badutf8=# select ctid, id, substring(body, 1, 5) from  bademail;
     ctid  │ id │ substring
    ───────┼────┼───────────
     (0,1) │  1 │ Hi ev
    (1 row)
    
    as do other ways of getting the same 4 bytes
    
    badutf8=# select ctid, id, substring(body from '^.{4}') from  bademail;
     ctid  │ id │ substring
    ───────┼────┼───────────
     (0,1) │  1 │ Hi e
    (1 row)
    
    badutf8=# select ctid, id, substring(normalize(body), 1, 4) from  bademail;
     ctid  │ id │ substring
    ───────┼────┼───────────
     (0,1) │  1 │ Hi e
    (1 row)
    
    is this expected behaviour and I just have to always noirmalize when
    loading exotic UTF8 strings ?
    
    
    If you want to replicate this use attached python script to load data
    
    I tried to come up with pure SQL to load the data but that always
    produced values that did not fail.
    
    Even shortening the body by removing some "normal" characters between
    the \xc2\x... sequences froduces values which do not fail
    substring(txt, a, b)
    
    I would have shared the original code to load a mailbox, but this
    fails to decode that specifioc email on some machines. I did not yet
    go into why that happens