Thread

  1. Re: Small patch to improve safety of utf8_to_unicode().

    Jeff Davis <pgsql@j-davis.com> — 2025-12-17T19:37:59Z

    On Tue, 2025-12-16 at 07:34 +0800, Chao Li wrote:
    > > <v2-0001-Make-utf8_to_unicode-safer.patch>
    > 
    > V2 LGTM.
    
    On second thought, if we're going to change something here, we should
    probably have a more flexible API for both utf8_to_unicode() and
    unicode_to_utf8().
    
    Looking at the callers, I think we want to have signatures something
    like:
    
    /* returns number of bytes consumed, or -1 */
    static inline ssize_t
    utf8_to_unicode(char32_t *cp, const unsigned char *src, size_t srclen)
    {
        ...
    }
    
    /* returns number of bytes written, or -1 */
    static inline ssize_t
    unicode_to_utf8(unsigned char *dst, size_t dstsize, char32_t cp)
    {
        ...
    }
    
    That would make both APIs safer, and the caller wouldn't need to call
    unicode_utf8len() or pg_utf8_mblen() separately.
    
    We could also do more validation, but of course then the callers would
    need to do something if they encounter a failure. We could also try to
    catch NUL terminators in the middle of a sequence, which might be
    useful.
    
    Regards,
    	Jeff Davis