Thread

  1. Re: Patch for collation using ICU

    Tatsuo Ishii <t-ishii@sra.co.jp> — 2005-05-09T14:32:00Z

    > > -----Original Message-----
    > > From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp] 
    > > Sent: Sunday, May 08, 2005 11:08 PM
    > > To: John Hansen
    > > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net; 
    > > pgsql-hackers@postgresql.org
    > > Subject: Re: [HACKERS] Patch for collation using ICU
    > > 
    > > > > I don't buy it. If current conversion tables does the 
    > > right thing, 
    > > > > why we need to replace. Or if conversion tables are not 
    > > correct, why 
    > > > > don't you fix it? I think the rule of character 
    > > conversion will not 
    > > > > change frequently, especially for LATIN languages. Thus 
    > > maintaining 
    > > > > cost is not too high.
    > > > 
    > > > I never said we need to, but if we're going to implement 
    > > ICU, then we 
    > > > might as well go all the way.
    > > 
    > > So you admit there's no benefit using ICU for replacing 
    > > existing conversions?
    > > 
    > > Besides ICU does not support all existing conversions, I 
    > > think ICU has serious flaw for using conversion. If I 
    > > understand correctly, ICU uses UNICODE internally to do the 
    > > conversion. For example, to implement
    > > SJIS->EUC_JP conversion, ICU first converts SJIS to UNICODE then
    > > converts UNICODE to EUC_JP. Problem is these conversion is 
    > > not roud trip(conversion between SJIS/EUC_JP and UNICODE will 
    > > lose some information). Thus SJIS->EUC_JP->SJIS conversion 
    > > using ICU does not preserve original text.
    > 
    > Just for the record, I fetched a web page encoded in sjis, and converted
    > it to euc-jp and back using uconv from ICU 3.2, and the result is the
    > original is identical to the transformed file.
    > 
    >  uconv -f Shift_JIS -t EUC-JP -o index.html.euc index.html
    >  uconv -f EUC-JP -t Shift_JIS -o index.html.sjis index.html.euc
    >  diff index.html index.html.sjis
    
    Not all SJIS/EUC_JP characters have the problem. You might want to
    try: Shift_JIS 0x81e6, 0x879a, 0xfa5b.
    
    BTW, I got this with ICU 3.2:
    
    $ uconv -f EUC_JP -t Shift_JIS /tmp/a.txt -o /tmp/b.txt
    Conversion from Unicode to codepage failed at input byte position 0. Unicode: 301c Error: Invalid character found
    
    The contents of a.txt is 0xa1c1 which is a valid EUC_JP character.
    
    This makes me nervous in using ICU...
    --
    Tatsuo Ishii