Thread

  1. Re: C11: should we use char32_t for unicode code points?

    Jeff Davis <pgsql@j-davis.com> — 2025-10-29T20:14:21Z

    On Thu, 2025-10-30 at 04:25 +1300, Thomas Munro wrote:
    > Here are some sketch-quality patches to try out some of these ideas,
    > for discussion.  I gave them .txt endings so as not to hijack your
    > thread's CI.
    
    I like the direction this is going. I will commit the char32_t work
    anyway, so afterward feel free to hijack the thread (there's a lot of
    good information here so continuing here might be more productive than
    starting a new thread).
    
    Regarding 0002, IIUC, for PG_WCHAR_UTF32, surrogates are forbidden, but
    the comment about UTF-16 is a bit vague. I think we should add some
    asserts to make it clear.
    
    The basic communication mechanism between the modules is the database
    encoding: it determines PgWcharEncodingScheme in both wchar.c and
    pg_locale_libc.c. That seems reasonable to me, and doesn't interfere
    with the other providers.
    
    I'm still not quite sure how this fits with ICU in a single-byte
    encoding, but doesn't seem worse than what we do currently.
    
    Also, tangentially, I'm a bit anxious to do a permanent
    setlocale(LC_CTYPE, "C"), and we are very close. If these two threads
    are successful, I believe we can do it:
    
    https://www.postgresql.org/message-id/90f176c5b85b9da26a3265b2630ece3552068566.camel%40j-davis.com
    
    https://www.postgresql.org/message-id/d9657a6e51aa20702447bb2386b32fea6218670f.camel@j-davis.com
    
    That would be a big simplification because it would isolate libc ctype
    behavior to pg_locale_libc.c. That would make me feel generally more
    comfortable with additional work in this area.
    
    Regards,
    	Jeff Davis