Thread

  1. Re: Small patch to improve safety of utf8_to_unicode().

    Jeff Davis <pgsql@j-davis.com> — 2025-12-15T20:23:44Z

    On Sun, 2025-12-14 at 07:22 +0800, Chao Li wrote:
    > This patch adds a length check to utf8_to_unicode(), I think which is
    > where “safety” comes from. Can you please add an a little bit more to
    > the commit message instead of only saying “improve safety”.
    
    Right: it does not read past pg_mblen(), nor past the supplied length.
    
    We could go further and check for valid continuation bytes, but the
    other routines don't do that.
    
    > It also deleted two redundant function declarations from pg_wchar.h,
    > which may also worth a quick note in the commit message.
    
    I committed that as an independent change and removed it from this
    patch.
    
    > +	/* Assume byte sequence has not been broken. */
    > +	c = utf8_to_unicode(str, MAX_MULTIBYTE_CHAR_LEN);
    > ```
    > 
    > Here we need an empty line above the new comment.
    
    Done, and I expanded the comment to explain why it's safe.
    
    >  pg_utf_dsplen(const unsigned char *s)
    >  {
    > -	return ucs_wcwidth(utf8_to_unicode(s));
    > +	/* trust that input is not a truncated byte sequence */
    > +	return ucs_wcwidth(utf8_to_unicode(s,
    > MAX_MULTIBYTE_CHAR_LEN));
    >  }
    > ```
    > 
    > For the new comment, as a code reader, I wonder why we “trust” that?
    
    We could use strlen(), but I was concerned that it might be used for
    string fragments that aren't NUL-terminated, because it's intended for
    a single character. A caller might reasonably assume that it wouldn't
    read past pg_mblen().
    
    So I changed the comment slightly to just say that it requires the
    input is a valid UTF-8 sequence. Let me know if you have another
    suggestion.
    
    Regards,
    	Jeff Davis