Thread

  1. Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS

    Jeevan Chalke <jeevan.chalke@enterprisedb.com> — 2011-06-07T07:36:27Z

    Hi Tom,
    
    Issue is on Windows:
    
    If you see in attached failure.out file, (after running failure.sql) we are
    getting "ERROR:  invalid
    byte sequence for encoding "UTF8": 0xe59aff" error. Please note that byte
    sequence we got from database is e5 9a ff, where as actual byte sequence for
    the wide character '功' is e5 8a 9f.
    
    
    '功'      ==> UNICODE Character
    e5 8a 9f  ==> Original Byte Sequence for the given characters
    e5 9a ff  ==> downcase_truncate_identifier() result, which is invalid UTF8
    representation, stored in pg_catalog table
    
    While displaying on client, we receive this invalid byte sequence which
    throws an error. Note that UTF8 characters have predefined character ranges
    for each byte which is checked in pg_utf8_islegal() function. Here is the
    code snippet:
    
    ==
        a = source[2];
        if (a < 0x80 || a > 0xBF)
            return false;
    ==
    Note that source[2] = ff, which does not fall into the valid range which
    results in illegal UTF8 character sequence. If you carefully see the
    original one i.e. 9f, which falls within the range.
    
    since we smash the identifier to lower case using
    downcase_truncate_identifier() function, the solution is to make this
    function should be wide-char aware, like LOWER() function functionality.
    
    I see some discussion related to downcase_truncate_identifier() and
    wide-char aware function, but seems like we lost somewhere.
    (http://archives.postgresql.org/pgsql-hackers/2010-11/msg01385.php)
    This invalid byte sequence issue seems like a more serious issue, because it
    might lead e.g to pg_dump failures.
    
    I have tested this on PG9.0 beta4 (one click installers), BTW, we have
    observed same with earlier version as well.
    
    Attached is the .sql and its output (run on PG9.0 beta4).
    
    Any thoughts???
    
    Thanks
    
    -- 
    Jeevan B Chalke
    Senior Software Engineer, R&D
    EnterpriseDB Corporation
    The Enterprise PostgreSQL Company
    
    Phone: +91 20 30589500
    
    Website: www.enterprisedb.com
    EnterpriseDB Blog: http://blogs.enterprisedb.com/
    Follow us on Twitter: http://www.twitter.com/enterprisedb
    
    This e-mail message (and any attachment) is intended for the use of the
    individual or entity to whom it is addressed. This message contains
    information from EnterpriseDB Corporation that may be privileged,
    confidential, or exempt from disclosure under applicable law. If you are not
    the intended recipient or authorized to receive this for the intended
    recipient, any use, dissemination, distribution, retention, archiving, or
    copying of this communication is strictly prohibited. If you have received
    this e-mail in error, please notify the sender immediately by reply e-mail
    and delete this message.
    
  2. Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS

    Robert Haas <robertmhaas@gmail.com> — 2011-06-08T00:52:05Z

    2011/6/7 Jeevan Chalke <jeevan.chalke@enterprisedb.com>:
    > since we smash the identifier to lower case using
    > downcase_truncate_identifier() function, the solution is to make this
    > function should be wide-char aware, like LOWER() function functionality.
    >
    > I see some discussion related to downcase_truncate_identifier() and
    > wide-char aware function, but seems like we lost somewhere.
    > (http://archives.postgresql.org/pgsql-hackers/2010-11/msg01385.php)
    > This invalid byte sequence issue seems like a more serious issue, because it
    > might lead e.g to pg_dump failures.
    
    It's a problem, but without an efficient algorithm for Unicode case
    folding, any fix we attempt to implement seems like it'll just be
    moving the problem around.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  3. Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS

    Jeevan Chalke <jeevan.chalke@enterprisedb.com> — 2011-06-09T04:39:36Z

    On Wed, Jun 8, 2011 at 6:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
    
    > 2011/6/7 Jeevan Chalke <jeevan.chalke@enterprisedb.com>:
    > > since we smash the identifier to lower case using
    > > downcase_truncate_identifier() function, the solution is to make this
    > > function should be wide-char aware, like LOWER() function functionality.
    > >
    > > I see some discussion related to downcase_truncate_identifier() and
    > > wide-char aware function, but seems like we lost somewhere.
    > > (http://archives.postgresql.org/pgsql-hackers/2010-11/msg01385.php)
    > > This invalid byte sequence issue seems like a more serious issue, because
    > it
    > > might lead e.g to pg_dump failures.
    >
    > It's a problem, but without an efficient algorithm for Unicode case
    > folding, any fix we attempt to implement seems like it'll just be
    > moving the problem around.
    >
    
    Agree.
    
    I read on other mail thread that str_tolower() is a  wide-character-aware
    lower function but it is also a collation-aware and hence might change its
    behaviour wrt change in locale. However, Tom suggested that we need to have
    non-locale-dependent case folding algorithm.
    
    But still for same locale on same machine, where we can able to create a
    table, insert some data, we cannot retrieve it. Don't you think it is more
    serious and we need a quick solution here? As said earlier it may even lead
    to pg_dump failures. Given that str_tolower() functionality is locale
    dependent but still it will resolve this particular issue. Not sure, there
    might be a performance issue but at-least we are not giving an error.
    
    Please excuse me, if community already had a lot of discussion and kept this
    behaviour intentionally knowing all these errors and serious issues.
    
    Thanks
    
    
    
     --
    > Robert Haas
    > EnterpriseDB: http://www.enterprisedb.com
    > The Enterprise PostgreSQL Company
    >
    
    
    
    -- 
    Jeevan B Chalke
    Senior Software Engineer, R&D
    EnterpriseDB Corporation
    The Enterprise PostgreSQL Company
    
    Phone: +91 20 30589500
    
    Website: www.enterprisedb.com
    EnterpriseDB Blog: http://blogs.enterprisedb.com/
    Follow us on Twitter: http://www.twitter.com/enterprisedb
    
    This e-mail message (and any attachment) is intended for the use of the
    individual or entity to whom it is addressed. This message contains
    information from EnterpriseDB Corporation that may be privileged,
    confidential, or exempt from disclosure under applicable law. If you are not
    the intended recipient or authorized to receive this for the intended
    recipient, any use, dissemination, distribution, retention, archiving, or
    copying of this communication is strictly prohibited. If you have received
    this e-mail in error, please notify the sender immediately by reply e-mail
    and delete this message.
    
  4. Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS

    Robert Haas <robertmhaas@gmail.com> — 2011-06-09T12:02:03Z

    On Thu, Jun 9, 2011 at 12:39 AM, Jeevan Chalke
    <jeevan.chalke@enterprisedb.com> wrote:
    >> It's a problem, but without an efficient algorithm for Unicode case
    >> folding, any fix we attempt to implement seems like it'll just be
    >> moving the problem around.
    >
    > Agree.
    >
    > I read on other mail thread that str_tolower() is a  wide-character-aware
    > lower function but it is also a collation-aware and hence might change its
    > behaviour wrt change in locale. However, Tom suggested that we need to have
    > non-locale-dependent case folding algorithm.
    >
    > But still for same locale on same machine, where we can able to create a
    > table, insert some data, we cannot retrieve it. Don't you think it is more
    > serious and we need a quick solution here? As said earlier it may even lead
    > to pg_dump failures. Given that str_tolower() functionality is locale
    > dependent but still it will resolve this particular issue. Not sure, there
    > might be a performance issue but at-least we are not giving an error.
    
    Well, as I understand it, the problem here is that if someone goes and
    changes the locale, then you might massively break the user's
    application.  For example, if the user says:
    
    CREATE TABLE FOO (...);
    SELECT * FROM FOO;
    
    ...that'll work, of course, because whatever you get when you downcase
    FOO will be the same both times.  But if the locale now changes, then
    the next...
    
    SELECT * FROM FOO;
    
    ...might fail, because the new downcasing of FOO might not match the old one.
    
    You could argue that that's better than the current situation, but
    it's not clear-cut.
    
    But now that I re-think about it, I guess what I'm confused about is
    this code here:
    
                    if (ch >= 'A' && ch <= 'Z')
                            ch += 'a' - 'A';
                    else if (IS_HIGHBIT_SET(ch) && isupper(ch))
                            ch = tolower(ch);
                    result[i] = (char) ch;
    
    It seems to me that we're downcasing the first byte of each wide
    character and ignoring the rest... which seems like it can't possibly
    be a good idea in a multi-byte encoding.  Perhaps we could keep that
    approach for single-byte encodings and just pass through multi-byte
    characters untouched?
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  5. Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-06-09T14:07:29Z

    Robert Haas <robertmhaas@gmail.com> writes:
    > But now that I re-think about it, I guess what I'm confused about is
    > this code here:
    
    >                 if (ch >= 'A' && ch <= 'Z')
    >                         ch += 'a' - 'A';
    >                 else if (IS_HIGHBIT_SET(ch) && isupper(ch))
    >                         ch = tolower(ch);
    >                 result[i] = (char) ch;
    
    The expected behavior there is that case-folding of non-ASCII characters
    will occur in single-byte encodings but nothing will happen to
    multi-byte characters.  We are relying on isupper() to not return true
    when presented with a character fragment in a multibyte locale.
    
    			regards, tom lane
    
    
  6. Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS

    Robert Haas <robertmhaas@gmail.com> — 2011-06-09T14:11:38Z

    On Thu, Jun 9, 2011 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Robert Haas <robertmhaas@gmail.com> writes:
    >> But now that I re-think about it, I guess what I'm confused about is
    >> this code here:
    >
    >>                 if (ch >= 'A' && ch <= 'Z')
    >>                         ch += 'a' - 'A';
    >>                 else if (IS_HIGHBIT_SET(ch) && isupper(ch))
    >>                         ch = tolower(ch);
    >>                 result[i] = (char) ch;
    >
    > The expected behavior there is that case-folding of non-ASCII characters
    > will occur in single-byte encodings but nothing will happen to
    > multi-byte characters.  We are relying on isupper() to not return true
    > when presented with a character fragment in a multibyte locale.
    
    Based on Jeevan's original message, it seems like that's not always
    the case, at least on Windows.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  7. Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-06-09T14:15:07Z

    Robert Haas <robertmhaas@gmail.com> writes:
    > On Thu, Jun 9, 2011 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >> We are relying on isupper() to not return true
    >> when presented with a character fragment in a multibyte locale.
    
    > Based on Jeevan's original message, it seems like that's not always
    > the case, at least on Windows.
    
    Hmm.  Now that you mention it, I think the same has been said about OSX.
    
    If we need to work around brain-dead isupper() tests, maybe the best
    thing is to implement two versions of the loop:
    
    	if (encoding is single byte)
    		... loop as it stands ...
    	else
    		... loop without the "else if" part
    
    			regards, tom lane
    
    
  8. Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS

    Robert Haas <robertmhaas@gmail.com> — 2011-06-09T14:24:16Z

    On Thu, Jun 9, 2011 at 10:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Robert Haas <robertmhaas@gmail.com> writes:
    >> On Thu, Jun 9, 2011 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >>> We are relying on isupper() to not return true
    >>> when presented with a character fragment in a multibyte locale.
    >
    >> Based on Jeevan's original message, it seems like that's not always
    >> the case, at least on Windows.
    >
    > Hmm.  Now that you mention it, I think the same has been said about OSX.
    >
    > If we need to work around brain-dead isupper() tests, maybe the best
    > thing is to implement two versions of the loop:
    >
    >        if (encoding is single byte)
    >                ... loop as it stands ...
    >        else
    >                ... loop without the "else if" part
    
    That seems like a clear improvement.  It's a long way from perfect,
    but still worthwhile.
    
    Would we back-patch that?  Just do it in master?  Wait for 9.2?
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  9. Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-06-09T14:38:59Z

    Robert Haas <robertmhaas@gmail.com> writes:
    > On Thu, Jun 9, 2011 at 10:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >> If we need to work around brain-dead isupper() tests, maybe the best
    >> thing is to implement two versions of the loop:
    >> 
    >>        if (encoding is single byte)
    >>                ... loop as it stands ...
    >>        else
    >>                ... loop without the "else if" part
    
    > That seems like a clear improvement.  It's a long way from perfect,
    > but still worthwhile.
    
    > Would we back-patch that?  Just do it in master?  Wait for 9.2?
    
    It looks to me like a portability bug fix, so I'd say back-patch.
    
    I'll take care of it, unless you're hot to do so?
    
    			regards, tom lane
    
    
  10. Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS

    Robert Haas <robertmhaas@gmail.com> — 2011-06-09T14:58:58Z

    On Thu, Jun 9, 2011 at 10:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Robert Haas <robertmhaas@gmail.com> writes:
    >> On Thu, Jun 9, 2011 at 10:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >>> If we need to work around brain-dead isupper() tests, maybe the best
    >>> thing is to implement two versions of the loop:
    >>>
    >>>        if (encoding is single byte)
    >>>                ... loop as it stands ...
    >>>        else
    >>>                ... loop without the "else if" part
    >
    >> That seems like a clear improvement.  It's a long way from perfect,
    >> but still worthwhile.
    >
    >> Would we back-patch that?  Just do it in master?  Wait for 9.2?
    >
    > It looks to me like a portability bug fix, so I'd say back-patch.
    >
    > I'll take care of it, unless you're hot to do so?
    
    Nope, have at it.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  11. Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-06-09T15:17:47Z

    Robert Haas <robertmhaas@gmail.com> writes:
    > On Thu, Jun 9, 2011 at 10:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >> If we need to work around brain-dead isupper() tests, maybe the best
    >> thing is to implement two versions of the loop:
    >>        if (encoding is single byte)
    
    > That seems like a clear improvement.  It's a long way from perfect,
    > but still worthwhile.
    
    Hmm ... while the above is easy enough to do in the backend, where we
    can look at pg_database_encoding_max_length, we have also got instances
    of this coding pattern in src/port/pgstrcasecmp.c.  It's a lot less
    obvious how to make the test in frontend environments.  Thoughts anyone?
    
    			regards, tom lane
    
    
  12. Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS

    Robert Haas <robertmhaas@gmail.com> — 2011-06-09T15:28:20Z

    On Thu, Jun 9, 2011 at 11:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Robert Haas <robertmhaas@gmail.com> writes:
    >> On Thu, Jun 9, 2011 at 10:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >>> If we need to work around brain-dead isupper() tests, maybe the best
    >>> thing is to implement two versions of the loop:
    >>>        if (encoding is single byte)
    >
    >> That seems like a clear improvement.  It's a long way from perfect,
    >> but still worthwhile.
    >
    > Hmm ... while the above is easy enough to do in the backend, where we
    > can look at pg_database_encoding_max_length, we have also got instances
    > of this coding pattern in src/port/pgstrcasecmp.c.  It's a lot less
    > obvious how to make the test in frontend environments.  Thoughts anyone?
    
    I'm not sure if this helps at all, but an awful lot of those tests are
    against hard-coded strings that are known to contain only ASCII
    characters.  Is there some way we can optimize this for that case?
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  13. Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-06-09T17:22:42Z

    Robert Haas <robertmhaas@gmail.com> writes:
    > On Thu, Jun 9, 2011 at 11:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >> Hmm ... while the above is easy enough to do in the backend, where we
    >> can look at pg_database_encoding_max_length, we have also got instances
    >> of this coding pattern in src/port/pgstrcasecmp.c. It's a lot less
    >> obvious how to make the test in frontend environments. Thoughts anyone?
    
    > I'm not sure if this helps at all, but an awful lot of those tests are
    > against hard-coded strings that are known to contain only ASCII
    > characters.  Is there some way we can optimize this for that case?
    
    For the places where we're just looking for a match to a fixed all-ASCII
    string, an ASCII-only downcasing would be sufficient, and would
    eliminate the whole problem.  But I doubt all the callers fall into that
    class.
    
    What I'm particularly worried about at the moment is whether we are
    assuming anywhere that the frontend side can duplicate the backend's
    identifier downcasing behavior.  That seems like a complete morass,
    because (1) they might not have the same locale, (2) they might not
    have the same encoding, (3) even if they do, the "same" locale is known
    to behave differently on different platforms.
    
    			regards, tom lane
    
    
  14. Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS

    Robert Haas <robertmhaas@gmail.com> — 2011-06-09T17:55:02Z

    On Thu, Jun 9, 2011 at 1:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Robert Haas <robertmhaas@gmail.com> writes:
    >> On Thu, Jun 9, 2011 at 11:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >>> Hmm ... while the above is easy enough to do in the backend, where we
    >>> can look at pg_database_encoding_max_length, we have also got instances
    >>> of this coding pattern in src/port/pgstrcasecmp.c.  It's a lot less
    >>> obvious how to make the test in frontend environments.  Thoughts anyone?
    >
    >> I'm not sure if this helps at all, but an awful lot of those tests are
    >> against hard-coded strings that are known to contain only ASCII
    >> characters.  Is there some way we can optimize this for that case?
    >
    > For the places where we're just looking for a match to a fixed all-ASCII
    > string, an ASCII-only downcasing would be sufficient, and would
    > eliminate the whole problem.  But I doubt all the callers fall into that
    > class.
    >
    > What I'm particularly worried about at the moment is whether we are
    > assuming anywhere that the frontend side can duplicate the backend's
    > identifier downcasing behavior.  That seems like a complete morass,
    > because (1) they might not have the same locale, (2) they might not
    > have the same encoding, (3) even if they do, the "same" locale is known
    > to behave differently on different platforms.
    
    Right.  Understood.  So let's look at the cases (from git grep
    pg_strcasecmp and pg_strncasecmp):
    
    contrib/dict_int: Fixed strings only, and it's all backend code anyway.
    contrib/dict_xsyn: Fixed strings only, and it's all backend code anyway.
    contrib/hstore: Fixed strings only, and it's all backend code anyway.
    contrib/pg_upgrade: Used to compare LC_COLLATE, LC_CTYPE, and encoding names.
    contrib/pgbench: Definitely front-end code, but it's all fixed strings.
    contrib/pgcrypto: All fixed strings except for one instance in
    px_find_digit.  But it's all backend
    contrib/spi: One instance, not a fixed string, but it's backend code.
    contrib/unaccent: One instance, not a fixed string, but it's backend code.
    src/backend/*: Backend code, obviously.
    src/bin/initdb: Strings from a constant lookup table
    (tsearch_config_languages) only.
    src/bin/pg_basebackup: Fixed strings only.
    src/bin/pg_ctl: Fixed strings only.
    src/bin/pg_dump: Fixed strings only.
    src/bin/psql: Fixed strings only.  In a couple of cases they are not
    constants - help.c uses strings from to generated file sql_help.h, and
    tab-complete.c uses strings from a constant array called
    words_after_create[].  But these are constant lookup tables.
    src/include: access/reloptions.h uses strncasecmp() as part of a
    macro.  That should be OK as long as no one tries to include this in
    frontend code, which seems rather impractical.
    src/interfaces/ecpg/ecpglib: Fixed strings.
    src/interfaces/ecpg/pgtypeslib: Fixed strings, and strings from a
    constant lookup table, only.
    src/interfaces/ecpg/preproc: This looks a bit worrisome.  It seems we
    might be using it on identifiers here.
    src/interfaces/libpq: This is attempting to match a wildcard
    certificate name against a hostname, in two different places.
    src/port/chklocale.c: Fixed strings or ones from a lookup table.
    src/timezone/pgtz.c: Matches input strings against filenames read from the OS.
    
    So mostly I think these are OK.  The instance in
    src/interfaces/ecpg/preproc looks like the most likely candidate for a
    problem spot.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  15. Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-06-09T18:58:40Z

    Robert Haas <robertmhaas@gmail.com> writes:
    > Right.  Understood.  So let's look at the cases (from git grep
    > pg_strcasecmp and pg_strncasecmp):
    
    Also pg_toupper and pg_tolower.  Right offhand, it looks like psql
    *does* assume it can lower-case identifiers this way :-(
    
    			regards, tom lane
    
    
  16. Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS

    Robert Haas <robertmhaas@gmail.com> — 2011-06-09T18:59:57Z

    On Thu, Jun 9, 2011 at 2:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Robert Haas <robertmhaas@gmail.com> writes:
    >> Right.  Understood.  So let's look at the cases (from git grep
    >> pg_strcasecmp and pg_strncasecmp):
    >
    > Also pg_toupper and pg_tolower.  Right offhand, it looks like psql
    > *does* assume it can lower-case identifiers this way :-(
    
    Blarg.  I dunno what to do about that - but surely we must have the
    encoding available somewhere?
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  17. Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-06-09T19:14:24Z

    Robert Haas <robertmhaas@gmail.com> writes:
    > On Thu, Jun 9, 2011 at 2:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >> Robert Haas <robertmhaas@gmail.com> writes:
    >>> Right. Understood. So let's look at the cases (from git grep
    >>> pg_strcasecmp and pg_strncasecmp):
    
    >> Also pg_toupper and pg_tolower. Right offhand, it looks like psql
    >> *does* assume it can lower-case identifiers this way :-(
    
    > Blarg.  I dunno what to do about that - but surely we must have the
    > encoding available somewhere?
    
    Well, psql does at least, but changing all the callers to pass encoding
    info would be a PITA.
    
    The idea I'd been mulling was to keep a static variable in
    pgstrcasecmp.c, which would control whether to allow the isupper/tolower
    path to be tried.  The static initializer value would be false, meaning
    you get ASCII-only case folding by default.  We'd also add a callable
    function to set the flag, which could be called during startup by
    whatever bit of logic is aware of the encoding we're using.  psql,
    pg_dump, and the backend could do that; offhand I suspect it doesn't
    matter for anything else.
    
    However, this is all still dependent on the assumption that our idea of
    the encoding matches the libc locale setting.  I'm prepared to believe
    that we have that locked down pretty well now in the backend, but I
    don't think I believe it for either psql or pg_dump.
    
    There's also the meta-problem that psql's locale might not match the
    backend's, leading to wrong case folding of identifiers compared to what
    the backend does, even if it's entirely correct for psql's locale.  I'm
    not sure that's a huge deal for psql, since most likely anybody who's
    typing non-ASCII identifiers has taken the trouble to set the locale the
    way she wants, and anyway the effects would only be seen in interactive
    commands and so are easily recovered from.  However, if pg_dump is
    trying to do this, the possible downsides seem a lot worse.
    
    			regards, tom lane