Thread

  1. Re: Invalid unicode in COPY problem

    Tatsuo Ishii <t-ishii@sra.co.jp> — 2005-05-08T02:01:10Z

    We have developed patches which relaxes the character validation so
    that PostgreSQL accepts invalid characters. It works like this:
    
    1) new postgresql.conf item "mbstr_check" added.
    2) if mbstr_check = 0 then invalid characters are not accepted
       (same as current PostgreSQL behavior). This is the default.
    3) if mbstr_check = 1 then invalid characters are accepted with
       WARNING
    4) if mbstr_check = 2 then invalid characters are accepted without any
       warnings
    5) We have checked PostgreSQL source code if accepting invalid
       characters makes some troubles. We have found that we need to fix a
       place and the fix is included in the patches.
    
    Madison,
    If you are interested in the patches, I could send it to you.
    
    Hackers,
    Do you think the functionality something like above is worth to add to
    PostgreSQL?
    --
    Tatsuo Ishii
    
    > Hi all,
    > 
    >    I've been chasing down a bug and from what I have learned it may be 
    > because of how postgreSQL (8.0.2 on Fedora Core 4 test 2) handles 
    > invalid unicode. I've been given some ideas on how to try to catch 
    > invalid unicode but it seems expensive so I am hoping there is a 
    > postgresql way to deal with this problem.
    > 
    >    I've run into a problem where a bulk postgres "COPY..." statement is 
    > dieing because one of the lines contains a file name with an invalid 
    > unicode character. In nautilus this file has '(invalid encoding)' and 
    > the postgres error is 'CONTEXT:  COPY file_info_3, line 228287, column 
    > file_name: "Femme Fatal\uffff.url"'.
    > 
    >    To actually look at the file from the shell (bash) shows what appears 
    > to be a whitespace but when I copy/paste the file name I get the 
    > '\uffff' you see above.
    > 
    >    I could, with the help of the TLUG people, use regex to match for an 
    > invalid character and skip the file but that is not ideal. The reason is 
    > that this is for my backup program and invalid unicode or not, the 
    > contents of the file may still be important and I would prefer to have 
    > it in the database so that it is later copied. I can copy and move the 
    > file in the shell so the file isn't apparently in an of itself corrupt.
    > 
    >    So then, is there a way I can tell postresql to accept the invalid 
    > unicode name? Here is a copy of my schema:
    > 
    > tle-bu=> \d file_info_2
    >                                Table "public.file_info_2"
    >          Column        |         Type         |                Modifiers
    > ----------------------+----------------------+-----------------------------------------
    >   file_group_name      | text                 |
    >   file_group_uid       | bigint               | not null
    >   file_mod_time        | bigint               | not null
    >   file_name            | text                 | not null
    >   file_parent_dir      | text                 | not null
    >   file_perm            | text                 | not null
    >   file_size            | bigint               | not null
    >   file_type            | character varying(2) | not null default 
    > 'f'::character varying
    >   file_user_name       | text                 |
    >   file_user_uid        | bigint               | not null
    >   file_backup          | boolean              | not null default true
    >   file_display         | boolean              | not null default false
    >   file_restore_display | boolean              | not null default false
    >   file_restore         | boolean              | not null default false
    > Indexes:
    >      "file_info_2_display_idx" btree (file_type, file_parent_dir, file_name)
    > 
    >    'file_name' and 'file_parent_dir' are the columns that could have 
    > entries with the invalid unicode characters. Maybe I could/should use 
    > something other than 'text'? These columns could contain anything that a 
    > file or directory name could be.
    > 
    >    Thanks!
    > 
    > Madison
    > 
    > -- 
    > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
    > Madison Kelly (Digimer)
    > TLE-BU, The Linux Experience; Back Up
    > http://tle-bu.thelinuxexperience.com
    > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
    > 
    > ---------------------------(end of broadcast)---------------------------
    > TIP 7: don't forget to increase your free space map settings
    >