Thread

Re: Small patch to improve safety of utf8_to_unicode().

Jeff Davis <pgsql@j-davis.com> — 2025-12-17T19:37:59Z

On Tue, 2025-12-16 at 07:34 +0800, Chao Li wrote:
> > <v2-0001-Make-utf8_to_unicode-safer.patch>
> 
> V2 LGTM.

On second thought, if we're going to change something here, we should
probably have a more flexible API for both utf8_to_unicode() and
unicode_to_utf8().

Looking at the callers, I think we want to have signatures something
like:

/* returns number of bytes consumed, or -1 */
static inline ssize_t
utf8_to_unicode(char32_t *cp, const unsigned char *src, size_t srclen)
{
    ...
}

/* returns number of bytes written, or -1 */
static inline ssize_t
unicode_to_utf8(unsigned char *dst, size_t dstsize, char32_t cp)
{
    ...
}

That would make both APIs safer, and the caller wouldn't need to call
unicode_utf8len() or pg_utf8_mblen() separately.

We could also do more validation, but of course then the callers would
need to do something if they encounter a failure. We could also try to
catch NUL terminators in the middle of a sequence, which might be
useful.

Regards,
	Jeff Davis