Thread

Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON

Joey Adams <joeyadams3.14159@gmail.com> — 2011-07-15T19:56:56Z
On Mon, Jul 4, 2011 at 10:22 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:
> I'll try to submit a revised patch within the next couple days.

Sorry this is later than I said.

I addressed the issues covered in the review.  I also fixed a bug
where "\u0022" would become """, which is invalid JSON, causing an
assertion failure.

However, I want to put this back into WIP for a number of reasons:

 * The current code accepts invalid surrogate pairs (e.g.
"\uD800\uD800").  The problem with accepting them is that it would be
inconsistent with PostgreSQL's Unicode support, and with the Unicode
standard itself.  In my opinion: as long as the server encoding is
universal (i.e. UTF-8), decoding a JSON-encoded string should not fail
(barring data corruption and resource limitations).

 * I'd like to go ahead with the parser rewrite I mentioned earlier.
The new parser will be able to construct a parse tree when needed, and
it won't use those overkill parsing macros.

 * I recently learned that not all supported server encodings can be
converted to Unicode losslessly.  The current code, on output,
converts non-ASCII characters to Unicode escapes under some
circumstances (see the comment above json_need_to_escape_unicode).

I'm having a really hard time figuring out how the JSON module should
handle non-Unicode character sets.  \uXXXX escapes in JSON literals
can be used to encode characters not available in the server encoding.
 On the other hand, the server encoding can encode characters not
present in Unicode (see the third bullet point above).  This means
JSON normalization and comparison (along with member lookup) are not
possible in general.

Even if I assume server -> UTF-8 -> server transcoding is lossless,
the situation is still ugly.  Normalization could be implemented a few
ways:

 * Convert from server encoding to UTF-8, and convert \uXXXX escapes
to UTF-8 characters.  This is space-efficient, but the resulting text
would not be compatible with the server encoding (which may or may not
matter).
 * Convert from server encoding to UTF-8, and convert all non-ASCII
characters to \uXXXX escapes, resulting in pure ASCII.  This bloats
the text by a factor of three, in the worst case.
 * Convert \uXXXX escapes to characters in the server encoding, but
only where possible.  This would be extremely inefficient.

The parse tree (for functions that need it) will need to store JSON
member names and strings either in UTF-8 or in normalized JSON (which
could be the same thing).

Help and advice would be appreciated.  Thanks!

- Joey