Thread

  1. Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation

    Heikki Linnakangas <hlinnaka@iki.fi> — 2025-12-02T17:29:06Z

    On 02/12/2025 18:36, Heikki Linnakangas wrote:
    > On 02/12/2025 18:24, Laurenz Albe wrote:
    >> On Tue, 2025-12-02 at 10:03 +0000, PG Bug reporting form wrote:
    >>> PostgreSQL version: 18.1
    >>>
    >>> When using a nondeterministic ICU collation, the replace() function 
    >>> fails to
    >>> replace a substring when that substring appears at the end of the input
    >>> string.
    >>>
    >>> Occurrences of the same substring earlier in the string are replaced
    >>> normally.
    >>>
    >>> Specific collation used:
    >>> create collation test_nondeterministic (
    >>>      provider = icu,
    >>>      locale = 'und-u-ks-level2',
    >>>      deterministic = false
    >>> )
    >>>
    >>> -- Replace final character under nondeterministic collation
    >>> SELECT replace(
    >>>      'testx' COLLATE "test_nondeterministic",
    >>>      'x'     COLLATE "test_nondeterministic",
    >>>      'y') AS res1;
    >>
    >> I can reproduce the problem, and the attached patch fixes it for me.
    > 
    > +1, looks good to me. Let's also add a regression test for this.
    
    I added a simple test for this, and I think this is still not quite 
    right. I added the following to collate.icu.utf test:
    
      CREATE TABLE test4nfd (a int, b text);
      INSERT INTO test4nfd VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 
    'côté');
      UPDATE test4nfd SET b = normalize(b, nfd);
      -- This shows why replace should be greedy.  Otherwise, in the NFD
      -- case, the match would stop before the decomposed accents, which
      -- would leave the accents in the results.
      SELECT a, b, replace(b COLLATE ignore_accents, 'co', 'ma') FROM test4;
       a |  b   | replace
      ---+------+---------
       1 | cote | mate
       2 | côte | mate
       3 | coté | maté
       4 | côté | maté
      (4 rows)
    
      SELECT a, b, replace(b COLLATE ignore_accents, 'co', 'ma') FROM test4nfd;
       a |  b   | replace
      ---+------+---------
       1 | cote | mate
       2 | côte | mate
       3 | coté | maté
       4 | côté | maté
      (4 rows)
    
    +-- Test for match at the end of the string.  (We had a bug on that
    +-- once)
    +SELECT a, b, replace(b COLLATE ignore_accents, 'te', 'ma') FROM test4nfd;
    + a |  b   | replace
    +---+------+---------
    + 1 | cote | coma
    + 2 | côte | coma
    + 3 | coté | coma
    + 4 | côté | coma
    +(4 rows)
    +
    
    In the added test query, the accents on the 'o' are stripped, which 
    doesn't look correct.
    
    - Heikki