[PoC] Add CANONICAL option to xmlserialize

Jim Jones <jim.jones@uni-muenster.de>

From: Jim Jones <jim.jones@uni-muenster.de>
To: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Date: 2023-02-27T13:16:30Z
Lists: pgsql-hackers

Attachments

Hi,

In order to compare pairs of XML documents for equivalence it is 
necessary to convert them first to their canonical form, as described at 
W3C Canonical XML 1.1.[1] This spec basically defines a standard 
physical representation of xml documents that have more then one 
possible representation, so that it is possible to compare them, e.g. 
forcing UTF-8 encoding, entity reference replacement, attributes 
normalization, etc.

Although it is not part of the XML/SQL standard, it would be nice to 
have the option CANONICAL in xmlserialize. Additionally, we could also 
add the attribute WITH [NO] COMMENTS to keep or remove xml comments from 
the documents.

Something like this:

WITH t(col) AS (
  VALUES
   ('<?xml version="1.0" encoding="ISO-8859-1"?>
   <!DOCTYPE doc SYSTEM "doc.dtd" [
   <!ENTITY val "42">
   <!ATTLIST xyz attr CDATA "default">
   ]>

   <!-- ordering of attributes -->
   <foo ns:c = "3" ns:b = "2" ns:a = "1"
     xmlns:ns="http://postgresql.org">

     <!-- Normalization of whitespace in start and end tags -->
     <!-- Elimination of superfluous namespace declarations,
          as already declared in <foo> -->
  <bar     xmlns:ns="http://postgresql.org" >&val;</bar     >

     <!-- Empty element conversion to start-end tag pair -->
     <empty/>

     <!-- Effect of transcoding from a sample encoding to UTF-8 -->
     <iso8859>&#169;</iso8859>

     <!-- Addition of default attribute -->
     <!-- Whitespace inside tag preserved -->
     <xyz> 321 </xyz>
   </foo>
   <!-- comment outside doc -->'::xml)
)
SELECT xmlserialize(DOCUMENT col AS text CANONICAL) FROM t;
xmlserialize
--------------------------------------------------------------------------------------------------------------------------------------------------------
  <foo xmlns:ns="http://postgresql.org" ns:a="1" ns:b="2" 
ns:c="3"><bar>42</bar><empty></empty><iso8859>©</iso8859><xyz 
attr="default"> 321 </xyz></foo>
(1 row)

-- using WITH COMMENTS

WITH t(col) AS (
  VALUES
   (' <foo ns:c = "3" ns:b = "2" ns:a = "1"
     xmlns:ns="http://postgresql.org">
     <!-- very important comment -->
     <xyz> 321 </xyz>
   </foo>'::xml)
)
SELECT xmlserialize(DOCUMENT col AS text CANONICAL WITH COMMENTS) FROM t;
xmlserialize
------------------------------------------------------------------------------------------------------------------------
  <foo xmlns:ns="http://postgresql.org" ns:a="1" ns:b="2" ns:c="3"><!-- 
very important comment --><xyz> 321 </xyz></foo>
(1 row)


Another option would be to simply create a new function, e.g. 
xmlcanonical(doc xml, keep_comments boolean), but I'm not sure if this 
would be the right approach.

Attached a very short draft. What do you think?

Best, Jim

1- https://www.w3.org/TR/xml-c14n11/