Thread

  1. XPATH vs. server_encoding != UTF-8

    Florian G. Pflug <fgp@phlo.org> — 2011-07-23T15:49:37Z

    Hi
    
    The current thread about JSON and the ensuing discussion about the
    XML types' behaviour in non-UTF8 databases made me try out how well
    XPATH() copes with that situation. The code, at least, looks
    suspicious - XPATH neither verifies that the server encoding is UTF-8,
    not does it pass the server encoding on to libxml's xpath functions.
    
    So I created a database with encoding ISO-8859-1 (LATIN1), and did
    (which aclient encoding matching my terminal's settings)
    
      CREATE TABLE X (d XML);
      INSERT INTO X VALUES ('<r a="ä"/>');
    
    i.e, I inserted the XML document <r a="&auml;"/>, but without using
    an entity reference for the german Umlaut-A. Then I attempted to extract
    the length of r's attribute "a" with the XPATH /r/@a, both with the XPath
    function string-length (which works now! yay!) and with postgres'
    LENGTH() function.
    
      SELECT
        (XPATH('string-length(/r/@a)', d))[1] AS xpath_length,
        LENGTH((XPATH('/r/@a', d))[1]::text) AS pg_length
      FROM X;
    
    The XPATH() function itself doesn't complain, but libxml does - it expects
    UTF-8 encoded data, and screams bloody murder when it encounters the
    ISO-8859-1-encoded Umlaut-A
    
      ERROR:  could not parse XML document
      DETAIL:  line 1: Input is not proper UTF-8, indicate encoding !
      Bytes: 0xE4 0x22 0x2F 0x3E
      <r a="ä"/>
    
    That might seem fine on the surface - we did, after all, error out instead
    of producing potentially non-sensical results. However, libxml's ability to
    detect this error relies on it's ability to distinguish between UTF-8 and
    non-UTF-8 encoded strings. Which, of course, doesn't work in the general case.
    
    So for my next try, I deliberately set client_encoding to ISO-8859-1, even
    though my terminal uses UTF-8, removed all data from table X, and did
    
      INSERT INTO X VALUES ('<r a="ä"/>');
    
    again. The effect is that is that X now contains ISO-8859-1 encoded data
    which *happens* to look like valid UTF-8. After changing the client_encoding
    back to UTF-8, the value we just inserted looks like that
    
      <r a="ä"/>
    
    Now I invoked the XPATH query from above again.
    
      SELECT
        (XPATH('string-length(/r/@a)', d))[1] AS xpath_length,
        LENGTH((XPATH('/r/@a', d))[1]::text) AS pg_length
      FROM X;
    
    As predicted, it doesn't raise an error this time, since libxml is unable
    to distinguish the ISO-8859-1 string '<r a="ä"/' from valid UTF-8. But the
    result is still wrongs, since the string-length() function counts 'ä' as just
    one character, when it reality it are of course contains two.
    
     xpath_length | pg_length 
    --------------+-----------
     1            |         2
    
    The easiest way to fix this would be to make XPATH() flat-out refuse to
    do anything if the server encoding isn't UTF-8. But that seems a bit harsh -
    things actually do work correctly as long as the XML document contains only
    ASCII characters, and existing applications might depend on that.
    
    So what I think we should do is tell libxml that the encoding is ASCII
    if the server encoding isn't UTF-8. With that change, the query above
    produces
    
      ERROR:  could not parse XML document
      DETAIL:  encoder error
    
    which seems sane. Replacing the data in X with ASCII-only data makes the
    error go away, and the result is then correct also.
    
      DELETE FROM X;
      INSERT INTO X VALUES ('<r a="a"/>');
      SELECT
        (XPATH('string-length(/r/@a)', d))[1] AS xpath_length,
        LENGTH((XPATH('/r/@a', d))[1]::text) AS pg_length
      FROM X;
    
    gives
    
     xpath_length | pg_length 
    --------------+-----------
     1            |         1
    
    Proof-of-concept patch attached, but doesn't yet include documentation
    updates.
    
    Comments? Thoughts? Suggestions?
    
    best regards,
    Florian Pflug