Thread
-
Re: Small patch to improve safety of utf8_to_unicode().
Jeff Davis <pgsql@j-davis.com> — 2025-12-17T19:37:59Z
On Tue, 2025-12-16 at 07:34 +0800, Chao Li wrote: > > <v2-0001-Make-utf8_to_unicode-safer.patch> > > V2 LGTM. On second thought, if we're going to change something here, we should probably have a more flexible API for both utf8_to_unicode() and unicode_to_utf8(). Looking at the callers, I think we want to have signatures something like: /* returns number of bytes consumed, or -1 */ static inline ssize_t utf8_to_unicode(char32_t *cp, const unsigned char *src, size_t srclen) { ... } /* returns number of bytes written, or -1 */ static inline ssize_t unicode_to_utf8(unsigned char *dst, size_t dstsize, char32_t cp) { ... } That would make both APIs safer, and the caller wouldn't need to call unicode_utf8len() or pg_utf8_mblen() separately. We could also do more validation, but of course then the callers would need to do something if they encounter a failure. We could also try to catch NUL terminators in the middle of a sequence, which might be useful. Regards, Jeff Davis