Thread

  1. Re: [HACKERS] Invalid unicode in COPY problem

    Tatsuo Ishii <t-ishii@sra.co.jp> — 2005-05-08T05:30:45Z

    > Tatsuo Ishii wrote:
    > > Sent: Sunday, May 08, 2005 12:01 PM
    > > To: linux@alteeve.com
    > > Cc: pgsql-general@postgresql.org; pgsql-hackers@postgresql.org
    > > Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
    > > 
    > > We have developed patches which relaxes the character 
    > > validation so that PostgreSQL accepts invalid characters. It 
    > > works like this:
    > 
    > That is just plain 100% wrong!!
    >
    > Under no circumstances should there be invalid data in a database.
    > And if you're trying to make a database of invalid data, then at 
    > least encode it using a valid encoding.
    > 
    > In fact, I've proposed strengthening the validation routines for UTF-8.
    
    Actually I myself thought as you are before. Later I found that it was
    not so good idea. People already have invalid encoded data in their
    precious database and have very hard time to migrate to newer version
    of PostgreSQL because of encoding validation.
    
    Think about this kind of situation:
    
    There is a table t1(member_id integer primary key, member_name text,
    address text, phone text, email text). I have to reach each member by
    either adress, phone or email. Unfortunately some of address field
    have wrong encoded data. In this case I will use phone or email to
    reach them. 
    
    Now I need to upgrade to newer PostgreSQL within 1 day. I know I have
    to fix wrong encoded field but it will take more than 1 day. So I
    would like to import the data first then fix wrong encoded field on
    running database since I can reach members by phone or email even with
    wrong encoded address field...
    
    I saw this kind of situation in the real world and that's why we
    developed the patches.
    --
    Tatsuo Ishii