Thread

  1. Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON

    Joey Adams <joeyadams3.14159@gmail.com> — 2011-07-15T19:56:56Z

    On Mon, Jul 4, 2011 at 10:22 PM, Joseph Adams
    <joeyadams3.14159@gmail.com> wrote:
    > I'll try to submit a revised patch within the next couple days.
    
    Sorry this is later than I said.
    
    I addressed the issues covered in the review.  I also fixed a bug
    where "\u0022" would become """, which is invalid JSON, causing an
    assertion failure.
    
    However, I want to put this back into WIP for a number of reasons:
    
     * The current code accepts invalid surrogate pairs (e.g.
    "\uD800\uD800").  The problem with accepting them is that it would be
    inconsistent with PostgreSQL's Unicode support, and with the Unicode
    standard itself.  In my opinion: as long as the server encoding is
    universal (i.e. UTF-8), decoding a JSON-encoded string should not fail
    (barring data corruption and resource limitations).
    
     * I'd like to go ahead with the parser rewrite I mentioned earlier.
    The new parser will be able to construct a parse tree when needed, and
    it won't use those overkill parsing macros.
    
     * I recently learned that not all supported server encodings can be
    converted to Unicode losslessly.  The current code, on output,
    converts non-ASCII characters to Unicode escapes under some
    circumstances (see the comment above json_need_to_escape_unicode).
    
    I'm having a really hard time figuring out how the JSON module should
    handle non-Unicode character sets.  \uXXXX escapes in JSON literals
    can be used to encode characters not available in the server encoding.
     On the other hand, the server encoding can encode characters not
    present in Unicode (see the third bullet point above).  This means
    JSON normalization and comparison (along with member lookup) are not
    possible in general.
    
    Even if I assume server -> UTF-8 -> server transcoding is lossless,
    the situation is still ugly.  Normalization could be implemented a few
    ways:
    
     * Convert from server encoding to UTF-8, and convert \uXXXX escapes
    to UTF-8 characters.  This is space-efficient, but the resulting text
    would not be compatible with the server encoding (which may or may not
    matter).
     * Convert from server encoding to UTF-8, and convert all non-ASCII
    characters to \uXXXX escapes, resulting in pure ASCII.  This bloats
    the text by a factor of three, in the worst case.
     * Convert \uXXXX escapes to characters in the server encoding, but
    only where possible.  This would be extremely inefficient.
    
    The parse tree (for functions that need it) will need to store JSON
    member names and strings either in UTF-8 or in normalized JSON (which
    could be the same thing).
    
    Help and advice would be appreciated.  Thanks!
    
    - Joey