Thread

  1. Re: Remaining dependency on setlocale()

    Peter Eisentraut <peter@eisentraut.org> — 2025-12-17T10:39:05Z

    On 12.12.25 21:11, Jeff Davis wrote:
    >> case '\xc7':        /* C with cedilla */
    >>
    >> so the premise that "fuzzystrmatch is designed for ASCII" does not
    >> appear to be correct.  Needs more analysis.
    >>
    >> (But apparently it's not multibyte aware at all, so I don't know what
    >> to
    >> do about that.)
    > I didn't notice that, thank you. Agreed, we need a bit more discussion
    > around this case as well as soundex().
    
    Soundex is an ASCII-only algorithm, there is no expectation that the 
    algorithm does anything useful with non-ASCII characters, and it doesn't 
    do so now.  So I think using pg_ascii_toupper() is ok.  (Users could for 
    example use unaccent to preprocess text.)
    
    One might wonder if the presence of non-ASCII characters should be an 
    error, but that doesn't have to be the subject of this thread.  I 
    noticed that the Wikipedia page for Soundex even calls out PostgreSQL 
    for doing things slightly different than everyone else, but I haven't 
    studied the details.
    
    For Metaphone, I found the reference implementation linked from its 
    Wikipedia page, and it looks like our implementation is pretty closely 
    aligned to that.  That reference implementation also contains the 
    C-with-cedilla case explicitly.  The correct fix here would probably be 
    to change the implementation to work on wide characters.  But I think 
    for the moment you could try a shortcut like, use pg_ascii_toupper(), 
    but if the encoding is LATIN1 (or LATIN9 or whichever other encodings 
    also contain C-with-cedilla at that code point), then explicitly 
    uppercase that one as well.  This would preserve the existing behavior.
    
    Note that the documentation calls out: "At present, the soundex, 
    metaphone, dmetaphone, and dmetaphone_alt functions do not work well 
    with multibyte encodings (such as UTF-8)."