Thread
-
Re: Small patch to improve safety of utf8_to_unicode().
Jeff Davis <pgsql@j-davis.com> — 2025-12-15T20:23:44Z
On Sun, 2025-12-14 at 07:22 +0800, Chao Li wrote: > This patch adds a length check to utf8_to_unicode(), I think which is > where “safety” comes from. Can you please add an a little bit more to > the commit message instead of only saying “improve safety”. Right: it does not read past pg_mblen(), nor past the supplied length. We could go further and check for valid continuation bytes, but the other routines don't do that. > It also deleted two redundant function declarations from pg_wchar.h, > which may also worth a quick note in the commit message. I committed that as an independent change and removed it from this patch. > + /* Assume byte sequence has not been broken. */ > + c = utf8_to_unicode(str, MAX_MULTIBYTE_CHAR_LEN); > ``` > > Here we need an empty line above the new comment. Done, and I expanded the comment to explain why it's safe. > pg_utf_dsplen(const unsigned char *s) > { > - return ucs_wcwidth(utf8_to_unicode(s)); > + /* trust that input is not a truncated byte sequence */ > + return ucs_wcwidth(utf8_to_unicode(s, > MAX_MULTIBYTE_CHAR_LEN)); > } > ``` > > For the new comment, as a code reader, I wonder why we “trust” that? We could use strlen(), but I was concerned that it might be used for string fragments that aren't NUL-terminated, because it's intended for a single character. A caller might reasonably assume that it wouldn't read past pg_mblen(). So I changed the comment slightly to just say that it requires the input is a valid UTF-8 sequence. Let me know if you have another suggestion. Regards, Jeff Davis