Thread

  1. Re: Patch for collation using ICU

    Tatsuo Ishii <t-ishii@sra.co.jp> — 2005-05-10T07:44:48Z

    > Tatsuo Ishii wrote:
    > > Sent: Tuesday, May 10, 2005 12:32 AM
    > > To: John Hansen
    > > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net; 
    > > pgsql-hackers@postgresql.org
    > > Subject: Re: [HACKERS] Patch for collation using ICU
    > > 
    > > > > -----Original Message-----
    > > > > From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
    > > > > Sent: Sunday, May 08, 2005 11:08 PM
    > > > > To: John Hansen
    > > > > Cc: pgman@candle.pha.pa.us; girgen@pingpong.net; 
    > > > > pgsql-hackers@postgresql.org
    > > > > Subject: Re: [HACKERS] Patch for collation using ICU
    > > > > 
    > > > > > > I don't buy it. If current conversion tables does the
    > > > > right thing,
    > > > > > > why we need to replace. Or if conversion tables are not
    > > > > correct, why
    > > > > > > don't you fix it? I think the rule of character
    > > > > conversion will not
    > > > > > > change frequently, especially for LATIN languages. Thus
    > > > > maintaining
    > > > > > > cost is not too high.
    > > > > > 
    > > > > > I never said we need to, but if we're going to implement
    > > > > ICU, then we
    > > > > > might as well go all the way.
    > > > > 
    > > > > So you admit there's no benefit using ICU for replacing existing 
    > > > > conversions?
    > > > > 
    > > > > Besides ICU does not support all existing conversions, I 
    > > think ICU 
    > > > > has serious flaw for using conversion. If I understand correctly, 
    > > > > ICU uses UNICODE internally to do the conversion. For example, to 
    > > > > implement
    > > > > SJIS->EUC_JP conversion, ICU first converts SJIS to UNICODE then
    > > > > converts UNICODE to EUC_JP. Problem is these conversion 
    > > is not roud 
    > > > > trip(conversion between SJIS/EUC_JP and UNICODE will lose some 
    > > > > information). Thus SJIS->EUC_JP->SJIS conversion using 
    > > ICU does not 
    > > > > preserve original text.
    > > > 
    > > > Just for the record, I fetched a web page encoded in sjis, and 
    > > > converted it to euc-jp and back using uconv from ICU 3.2, and the 
    > > > result is the original is identical to the transformed file.
    > > > 
    > > >  uconv -f Shift_JIS -t EUC-JP -o index.html.euc index.html  
    > > uconv -f 
    > > > EUC-JP -t Shift_JIS -o index.html.sjis index.html.euc  diff 
    > > index.html 
    > > > index.html.sjis
    > > 
    > > Not all SJIS/EUC_JP characters have the problem. You might want to
    > > try: Shift_JIS 0x81e6, 0x879a, 0xfa5b.
    > > 
    > > BTW, I got this with ICU 3.2:
    > > 
    > > $ uconv -f EUC_JP -t Shift_JIS /tmp/a.txt -o /tmp/b.txt 
    > > Conversion from Unicode to codepage failed at input byte 
    > > position 0. Unicode: 301c Error: Invalid character found
    > > 
    > > The contents of a.txt is 0xa1c1 which is a valid EUC_JP character.
    > 
    > That actually makes perfect sense, since according to unicode.org's
    > database:
    > 301C ~ WAVE DASH
    >        This character was encoded to match JIS C 6226-1978 1-33 "wave
    > dash".
    >        The JIS standards and some industry practise disagree in mapping.
    > 	 - 3030 wavy dash
    > 	 - FF5E full width tilde
    > 
    > In PG FF5E is the mapping currently used. That is obviously wrong
    > (according to the standards), as that is only a 'similar character'.
    > 
    > Unfortunately, there is no mapping from 301C to shift_jis, as shift_jis
    > doesn't define "WAVE DASH".
    > In all, I believe this behaviour to be correct according to the
    > standards.
    > 
    > There'd be nothing to stop us from defining alternative mappings for the
    > cases where we deviate from the standard, but the question is, should we
    > be non-standard?
    
    You missed the point. EUC_JP 0xa1c1 is a perfect valid data and 
    uconv -f EUC_JP -t Shift_JIS should convert it to Shift_JIS 0x8160
    regardless of the internal of uconv.
    --
    Tatsuo Ishii