Thread

  1. [PoC] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2023-02-27T13:16:30Z

    Hi,
    
    In order to compare pairs of XML documents for equivalence it is 
    necessary to convert them first to their canonical form, as described at 
    W3C Canonical XML 1.1.[1] This spec basically defines a standard 
    physical representation of xml documents that have more then one 
    possible representation, so that it is possible to compare them, e.g. 
    forcing UTF-8 encoding, entity reference replacement, attributes 
    normalization, etc.
    
    Although it is not part of the XML/SQL standard, it would be nice to 
    have the option CANONICAL in xmlserialize. Additionally, we could also 
    add the attribute WITH [NO] COMMENTS to keep or remove xml comments from 
    the documents.
    
    Something like this:
    
    WITH t(col) AS (
      VALUES
       ('<?xml version="1.0" encoding="ISO-8859-1"?>
       <!DOCTYPE doc SYSTEM "doc.dtd" [
       <!ENTITY val "42">
       <!ATTLIST xyz attr CDATA "default">
       ]>
    
       <!-- ordering of attributes -->
       <foo ns:c = "3" ns:b = "2" ns:a = "1"
         xmlns:ns="http://postgresql.org">
    
         <!-- Normalization of whitespace in start and end tags -->
         <!-- Elimination of superfluous namespace declarations,
              as already declared in <foo> -->
      <bar     xmlns:ns="http://postgresql.org" >&val;</bar     >
    
         <!-- Empty element conversion to start-end tag pair -->
         <empty/>
    
         <!-- Effect of transcoding from a sample encoding to UTF-8 -->
         <iso8859>&#169;</iso8859>
    
         <!-- Addition of default attribute -->
         <!-- Whitespace inside tag preserved -->
         <xyz> 321 </xyz>
       </foo>
       <!-- comment outside doc -->'::xml)
    )
    SELECT xmlserialize(DOCUMENT col AS text CANONICAL) FROM t;
    xmlserialize
    --------------------------------------------------------------------------------------------------------------------------------------------------------
      <foo xmlns:ns="http://postgresql.org" ns:a="1" ns:b="2" 
    ns:c="3"><bar>42</bar><empty></empty><iso8859>©</iso8859><xyz 
    attr="default"> 321 </xyz></foo>
    (1 row)
    
    -- using WITH COMMENTS
    
    WITH t(col) AS (
      VALUES
       (' <foo ns:c = "3" ns:b = "2" ns:a = "1"
         xmlns:ns="http://postgresql.org">
         <!-- very important comment -->
         <xyz> 321 </xyz>
       </foo>'::xml)
    )
    SELECT xmlserialize(DOCUMENT col AS text CANONICAL WITH COMMENTS) FROM t;
    xmlserialize
    ------------------------------------------------------------------------------------------------------------------------
      <foo xmlns:ns="http://postgresql.org" ns:a="1" ns:b="2" ns:c="3"><!-- 
    very important comment --><xyz> 321 </xyz></foo>
    (1 row)
    
    
    Another option would be to simply create a new function, e.g. 
    xmlcanonical(doc xml, keep_comments boolean), but I'm not sure if this 
    would be the right approach.
    
    Attached a very short draft. What do you think?
    
    Best, Jim
    
    1- https://www.w3.org/TR/xml-c14n11/
    
  2. [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2023-03-05T18:44:01Z

    On 27.02.23 14:16, I wrote:
    > Hi,
    >
    > In order to compare pairs of XML documents for equivalence it is 
    > necessary to convert them first to their canonical form, as described 
    > at W3C Canonical XML 1.1.[1] This spec basically defines a standard 
    > physical representation of xml documents that have more then one 
    > possible representation, so that it is possible to compare them, e.g. 
    > forcing UTF-8 encoding, entity reference replacement, attributes 
    > normalization, etc.
    >
    > Although it is not part of the XML/SQL standard, it would be nice to 
    > have the option CANONICAL in xmlserialize. Additionally, we could also 
    > add the attribute WITH [NO] COMMENTS to keep or remove xml comments 
    > from the documents.
    >
    > Something like this:
    >
    > WITH t(col) AS (
    >  VALUES
    >   ('<?xml version="1.0" encoding="ISO-8859-1"?>
    >   <!DOCTYPE doc SYSTEM "doc.dtd" [
    >   <!ENTITY val "42">
    >   <!ATTLIST xyz attr CDATA "default">
    >   ]>
    >
    >   <!-- ordering of attributes -->
    >   <foo ns:c = "3" ns:b = "2" ns:a = "1"
    >     xmlns:ns="http://postgresql.org">
    >
    >     <!-- Normalization of whitespace in start and end tags -->
    >     <!-- Elimination of superfluous namespace declarations,
    >          as already declared in <foo> -->
    >  <bar     xmlns:ns="http://postgresql.org" >&val;</bar     >
    >
    >     <!-- Empty element conversion to start-end tag pair -->
    >     <empty/>
    >
    >     <!-- Effect of transcoding from a sample encoding to UTF-8 -->
    >     <iso8859>&#169;</iso8859>
    >
    >     <!-- Addition of default attribute -->
    >     <!-- Whitespace inside tag preserved -->
    >     <xyz> 321 </xyz>
    >   </foo>
    >   <!-- comment outside doc -->'::xml)
    > )
    > SELECT xmlserialize(DOCUMENT col AS text CANONICAL) FROM t;
    > xmlserialize
    > -------------------------------------------------------------------------------------------------------------------------------------------------------- 
    >
    >  <foo xmlns:ns="http://postgresql.org" ns:a="1" ns:b="2" 
    > ns:c="3"><bar>42</bar><empty></empty><iso8859>©</iso8859><xyz 
    > attr="default"> 321 </xyz></foo>
    > (1 row)
    >
    > -- using WITH COMMENTS
    >
    > WITH t(col) AS (
    >  VALUES
    >   (' <foo ns:c = "3" ns:b = "2" ns:a = "1"
    >     xmlns:ns="http://postgresql.org">
    >     <!-- very important comment -->
    >     <xyz> 321 </xyz>
    >   </foo>'::xml)
    > )
    > SELECT xmlserialize(DOCUMENT col AS text CANONICAL WITH COMMENTS) FROM t;
    > xmlserialize
    > ------------------------------------------------------------------------------------------------------------------------ 
    >
    >  <foo xmlns:ns="http://postgresql.org" ns:a="1" ns:b="2" ns:c="3"><!-- 
    > very important comment --><xyz> 321 </xyz></foo>
    > (1 row)
    >
    >
    > Another option would be to simply create a new function, e.g. 
    > xmlcanonical(doc xml, keep_comments boolean), but I'm not sure if this 
    > would be the right approach.
    >
    > Attached a very short draft. What do you think?
    >
    > Best, Jim
    >
    > 1- https://www.w3.org/TR/xml-c14n11/
    
    The attached version includes documentation and tests to the patch.
    
    I hope things are clearer now :)
    
    Best, Jim
    
  3. Re: [PATCH] Add CANONICAL option to xmlserialize

    Thomas Munro <thomas.munro@gmail.com> — 2023-03-05T21:00:33Z

    On Mon, Mar 6, 2023 at 7:44 AM Jim Jones <jim.jones@uni-muenster.de> wrote:
    > The attached version includes documentation and tests to the patch.
    
    The CI run for that failed in an interesting way, only on Debian +
    Meson, 32 bit.  The diffs appear to show that psql has a different
    opinion of the column width, while building its header (the "------"
    you get at the top of psql's output), even though the actual column
    contents was the same.  regression.diff[2] shows that there is a "£1"
    in the output, which is how UTF-8 "£1" looks if you view it with
    Latin1 glasses on.  Clearly this patch involves transcoding, Latin1
    and UTF-8 and I haven't studied it, but it's pretty weird for the 32
    bit build to give a different result...  could be something to do with
    our environment, since .cirrus.yml sets LANG=C in the 32 bit test run
    -- maybe try that locally?
    
    That run also generated a core dump, but I think that's just our open
    SIGQUIT problem[3] and not relevant here.
    
    [1] https://cirrus-ci.com/build/6319462375227392
    [2] https://api.cirrus-ci.com/v1/artifact/task/5800598633709568/testrun/build-32/testrun/regress/regress/regression.diffs
    [3] https://www.postgresql.org/message-id/flat/20230214202927.xgb2w6b7gnhq6tvv%40awork3.anarazel.de
    
    
    
    
  4. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2023-03-05T22:20:19Z

    On 05.03.23 22:00, Thomas Munro wrote:
    > The CI run for that failed in an interesting way, only on Debian +
    > Meson, 32 bit.  The diffs appear to show that psql has a different
    > opinion of the column width, while building its header (the "------"
    > you get at the top of psql's output), even though the actual column
    > contents was the same.  regression.diff[2] shows that there is a "£1"
    > in the output, which is how UTF-8 "£1" looks if you view it with
    > Latin1 glasses on.  Clearly this patch involves transcoding, Latin1
    > and UTF-8
    One of the use cases of this patch is exactly the transcoding of a non 
    utf-8 document to utf-8 - as described in the XML canonical spec.
    > and I haven't studied it, but it's pretty weird for the 32
    > bit build to give a different result...
    Yeah, it's pretty weird indeed. I'll try to reproduce this environment 
    in a container to see if I get the same diff. Although I'm not sure that 
    by "fixing" the result set for this environment it won't break all the 
    others.
    > could be something to do with
    > our environment, since .cirrus.yml sets LANG=C in the 32 bit test run
    > -- maybe try that locally?
    Also using LANGUAGE=C the result is the same for me - all tests pass 
    just fine.
    > That run also generated a core dump, but I think that's just our open
    > SIGQUIT problem[3] and not relevant here.
    >
    > [1] https://cirrus-ci.com/build/6319462375227392
    > [2] https://api.cirrus-ci.com/v1/artifact/task/5800598633709568/testrun/build-32/testrun/regress/regress/regression.diffs
    > [3] https://www.postgresql.org/message-id/flat/20230214202927.xgb2w6b7gnhq6tvv%40awork3.anarazel.de
    
    Thanks for the quick reply. Much appreciated!
    
    
    
    
    
    
    
  5. Re: [PATCH] Add CANONICAL option to xmlserialize

    Thomas Munro <thomas.munro@gmail.com> — 2023-03-05T23:32:49Z

    On Mon, Mar 6, 2023 at 11:20 AM Jim Jones <jim.jones@uni-muenster.de> wrote:
    > On 05.03.23 22:00, Thomas Munro wrote:
    > > could be something to do with
    > > our environment, since .cirrus.yml sets LANG=C in the 32 bit test run
    > > -- maybe try that locally?
    
    > Also using LANGUAGE=C the result is the same for me - all tests pass
    > just fine.
    
    I couldn't reproduce that locally either, but I just tested on CI with
    your patch applied  saw the failure, and then removed
    "PYTHONCOERCECLOCALE=0 LANG=C" and it's all green:
    
    https://github.com/macdice/postgres/commit/91999f5d13ac2df6f7237a301ed6cf73f2bb5b6d
    
    Without looking too closely, my first guess would have been that this
    just isn't going to work without UTF-8 database encoding, so you might
    need to skip the test (see for example
    src/test/regress/expected/unicode_1.out).  It's annoying that "xml"
    already has 3 expected variants... hmm.  BTW shouldn't it be failing
    in a more explicit way somewhere sooner if the database encoding is
    not UTF-8, rather than getting confused?
    
    
    
    
  6. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2023-03-06T10:50:54Z

    On 06.03.23 00:32, Thomas Munro wrote:
    > I couldn't reproduce that locally either, but I just tested on CI with
    > your patch applied  saw the failure, and then removed
    > "PYTHONCOERCECLOCALE=0 LANG=C" and it's all green:
    >
    > https://github.com/macdice/postgres/commit/91999f5d13ac2df6f7237a301ed6cf73f2bb5b6d
    >
    > Without looking too closely, my first guess would have been that this
    > just isn't going to work without UTF-8 database encoding, so you might
    > need to skip the test (see for example
    > src/test/regress/expected/unicode_1.out).  It's annoying that "xml"
    > already has 3 expected variants... hmm.  BTW shouldn't it be failing
    > in a more explicit way somewhere sooner if the database encoding is
    > not UTF-8, rather than getting confused?
    
    I guess this confusion is happening because xml_parse() was being called 
    with the database encoding from GetDatabaseEncoding().
    
    I added a condition before calling xml_parse() to check if the xml 
    document has a different encoding than UTF-8
    
    parse_xml_decl(xml_text2xmlChar(data), NULL, NULL, &encodingStr, NULL);
    encoding = encodingStr ? xmlChar_to_encoding(encodingStr) : PG_UTF8;
    
    doc = xml_parse(data, XMLOPTION_DOCUMENT, false, encoding, NULL);
    
    v2 attached.
    
    Thanks!
    
    Best, Jim
    
  7. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2023-03-06T13:19:43Z

    On 06.03.23 11:50, I wrote:
    > I guess this confusion is happening because xml_parse() was being 
    > called with the database encoding from GetDatabaseEncoding().
    >
    > I added a condition before calling xml_parse() to check if the xml 
    > document has a different encoding than UTF-8
    >
    > parse_xml_decl(xml_text2xmlChar(data), NULL, NULL, &encodingStr, NULL);
    > encoding = encodingStr ? xmlChar_to_encoding(encodingStr) : PG_UTF8;
    >
    > doc = xml_parse(data, XMLOPTION_DOCUMENT, false, encoding, NULL);
    
    It seems that this bug fix didn't change the output of the CI on Debian 
    + Meson, 32bit.
    
    I slightly changed the test case to a character that both encodings can 
    deal with.
    
    v3 attached.
    
  8. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2023-03-14T07:49:19Z

    v4 attached fixes an encoding issue at the xml_parse call. It now uses 
    GetDatabaseEncoding().
    
    Best, Jim
    
  9. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2023-03-17T09:46:36Z

    v5 attached is a rebase over the latest changes in xmlserialize (INDENT 
    output).
  10. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2023-03-17T12:30:49Z

    After some more testing I realized that v5 was leaking the xmlDocPtr.
    
    Now fixed in v6.
    
  11. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2023-09-14T11:54:24Z

    The cfbot started complaining about this patch on "macOS - Ventura - Meson"
    
    'Persistent worker failed to start the task: tart isolation failed: 
    failed to create VM cloned from 
    "ghcr.io/cirruslabs/macos-ventura-base:latest": tart command returned 
    non-zero exit code: ""'
    
    Is this a problem in my code or in the CI itself?
    
    Thanks!
    
    Jim
    
  12. Re: [PATCH] Add CANONICAL option to xmlserialize

    Thomas Munro <thomas.munro@gmail.com> — 2023-09-14T21:43:51Z

    On Thu, Sep 14, 2023 at 11:54 PM Jim Jones <jim.jones@uni-muenster.de> wrote:
    > The cfbot started complaining about this patch on "macOS - Ventura - Meson"
    >
    > 'Persistent worker failed to start the task: tart isolation failed: failed to create VM cloned from "ghcr.io/cirruslabs/macos-ventura-base:latest": tart command returned non-zero exit code: ""'
    >
    > Is this a problem in my code or in the CI itself?
    
    There was a temporary glitch on one of the new Mac CI runner machines
    that caused a few tests to fail like that, but it was fixed so that
    should turn red again later today.
    
    
    
    
  13. Re: [PATCH] Add CANONICAL option to xmlserialize

    vignesh C <vignesh21@gmail.com> — 2023-10-04T09:39:44Z

    On Fri, 17 Mar 2023 at 18:01, Jim Jones <jim.jones@uni-muenster.de> wrote:
    >
    > After some more testing I realized that v5 was leaking the xmlDocPtr.
    >
    > Now fixed in v6.
    
    Few comments:
    1) Why the default option was chosen without comments shouldn't it be
    the other way round?
    +opt_xml_serialize_format:
    +                       INDENT
                             { $$ = XMLSERIALIZE_INDENT; }
    +                       | NO INDENT
                             { $$ = XMLSERIALIZE_NO_FORMAT; }
    +                       | CANONICAL
                             { $$ = XMLSERIALIZE_CANONICAL; }
    +                       | CANONICAL WITH NO COMMENTS
     { $$ = XMLSERIALIZE_CANONICAL; }
    +                       | CANONICAL WITH COMMENTS
             { $$ = XMLSERIALIZE_CANONICAL_WITH_COMMENTS; }
    +                       | /*EMPTY*/
                             { $$ = XMLSERIALIZE_NO_FORMAT; }
    
    2) This should be added to typedefs.list:
    +typedef enum XmlSerializeFormat
    +{
    +       XMLSERIALIZE_INDENT,                                    /*
    pretty-printed xml serialization  */
    +       XMLSERIALIZE_CANONICAL,                                 /*
    canonical form without xml comments */
    +       XMLSERIALIZE_CANONICAL_WITH_COMMENTS,   /* canonical form with
    xml comments */
    +       XMLSERIALIZE_NO_FORMAT                                  /*
    unformatted xml representation */
    +} XmlSerializeFormat;
    
    3) This change is not required:
            return result;
    +
     #else
            NO_XML_SUPPORT();
            return NULL;
    
    4) This comment body needs slight reformatting:
    +       /*
    +       * Parse the input according to the xmloption.
    +       * XML canonical expects a well-formed XML input, so here in case of
    +       * XMLSERIALIZE_CANONICAL or XMLSERIALIZE_CANONICAL_WITH_COMMENTS we
    +       * force xml_parse() to parse 'data' as XMLOPTION_DOCUMENT despite
    +       * of the XmlOptionType given in 'xmloption_arg'. This enables the
    +       * canonicalization of CONTENT fragments if they contain a singly-rooted
    +       * XML - xml_parse() will thrown an error otherwise.
    +       */
    
    5) Similarly here too:
    -                       if (newline == NULL || xmlerrcxt->err_occurred)
    +                       * Emit declaration only if the input had one.
    Note: some versions of
    +                       * xmlSaveToBuffer leak memory if a non-null
    encoding argument is
    +                       * passed, so don't do that.  We don't want any
    encoding conversion
    +                       * anyway.
    +                       */
    +                       if (decl_len == 0)
    
    6) Similarly here too:
    +                               /*
    +                               * Deal with the case where we have
    non-singly-rooted XML.
    +                               * libxml's dump functions don't work
    well for that without help.
    +                               * We build a fake root node that
    serves as a container for the
    +                               * content nodes, and then iterate over
    the nodes.
    +                               */
    
    7) Similarly here too:
    +                               /*
    +                               * We use this node to insert newlines
    in the dump.  Note: in at
    +                               * least some libxml versions,
    xmlNewDocText would not attach the
    +                               * node to the document even if we
    passed it.  Therefore, manage
    +                               * freeing of this node manually, and
    pass NULL here to make sure
    +                               * there's not a dangling link.
    +                               */
    
    8) Should this:
    +       * of the XmlOptionType given in 'xmloption_arg'. This enables the
    +       * canonicalization of CONTENT fragments if they contain a singly-rooted
    +       * XML - xml_parse() will thrown an error otherwise.
    Be:
    +       * of the XmlOptionType given in 'xmloption_arg'. This enables the
    +       * canonicalization of CONTENT fragments if they contain a singly-rooted
    +       * XML - xml_parse() will throw an error otherwise.
    
    Regards,
    Vignesh
    
    
    
    
  14. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2023-10-04T16:19:18Z

    Hi Vignesh
    
    Thanks for the thorough review!
    
    On 04.10.23 11:39, vignesh C wrote:
    > Few comments:
    > 1) Why the default option was chosen without comments shouldn't it be
    > the other way round?
    > +opt_xml_serialize_format:
    > +                       INDENT
    >                           { $$ = XMLSERIALIZE_INDENT; }
    > +                       | NO INDENT
    >                           { $$ = XMLSERIALIZE_NO_FORMAT; }
    > +                       | CANONICAL
    >                           { $$ = XMLSERIALIZE_CANONICAL; }
    > +                       | CANONICAL WITH NO COMMENTS
    >   { $$ = XMLSERIALIZE_CANONICAL; }
    > +                       | CANONICAL WITH COMMENTS
    >           { $$ = XMLSERIALIZE_CANONICAL_WITH_COMMENTS; }
    > +                       | /*EMPTY*/
    >                           { $$ = XMLSERIALIZE_NO_FORMAT; }
    I'm not sure it is the way to go. The main idea is to check if two 
    documents have the same content, and comments might be different even if 
    the contents of two documents are identical. What are your concerns 
    regarding this default behaviour?
    > 2) This should be added to typedefs.list:
    > +typedef enum XmlSerializeFormat
    > +{
    > +       XMLSERIALIZE_INDENT,                                    /*
    > pretty-printed xml serialization  */
    > +       XMLSERIALIZE_CANONICAL,                                 /*
    > canonical form without xml comments */
    > +       XMLSERIALIZE_CANONICAL_WITH_COMMENTS,   /* canonical form with
    > xml comments */
    > +       XMLSERIALIZE_NO_FORMAT                                  /*
    > unformatted xml representation */
    > +} XmlSerializeFormat;
    added.
    > 3) This change is not required:
    >          return result;
    > +
    >   #else
    >          NO_XML_SUPPORT();
    >          return NULL;
    removed.
    >
    > 4) This comment body needs slight reformatting:
    > +       /*
    > +       * Parse the input according to the xmloption.
    > +       * XML canonical expects a well-formed XML input, so here in case of
    > +       * XMLSERIALIZE_CANONICAL or XMLSERIALIZE_CANONICAL_WITH_COMMENTS we
    > +       * force xml_parse() to parse 'data' as XMLOPTION_DOCUMENT despite
    > +       * of the XmlOptionType given in 'xmloption_arg'. This enables the
    > +       * canonicalization of CONTENT fragments if they contain a singly-rooted
    > +       * XML - xml_parse() will thrown an error otherwise.
    > +       */
    reformatted.
    > 5) Similarly here too:
    > -                       if (newline == NULL || xmlerrcxt->err_occurred)
    > +                       * Emit declaration only if the input had one.
    > Note: some versions of
    > +                       * xmlSaveToBuffer leak memory if a non-null
    > encoding argument is
    > +                       * passed, so don't do that.  We don't want any
    > encoding conversion
    > +                       * anyway.
    > +                       */
    > +                       if (decl_len == 0)
    reformatted.
    > 6) Similarly here too:
    > +                               /*
    > +                               * Deal with the case where we have
    > non-singly-rooted XML.
    > +                               * libxml's dump functions don't work
    > well for that without help.
    > +                               * We build a fake root node that
    > serves as a container for the
    > +                               * content nodes, and then iterate over
    > the nodes.
    > +                               */
    reformatted.
    > 7) Similarly here too:
    > +                               /*
    > +                               * We use this node to insert newlines
    > in the dump.  Note: in at
    > +                               * least some libxml versions,
    > xmlNewDocText would not attach the
    > +                               * node to the document even if we
    > passed it.  Therefore, manage
    > +                               * freeing of this node manually, and
    > pass NULL here to make sure
    > +                               * there's not a dangling link.
    > +                               */
    reformatted.
    > 8) Should this:
    > +       * of the XmlOptionType given in 'xmloption_arg'. This enables the
    > +       * canonicalization of CONTENT fragments if they contain a singly-rooted
    > +       * XML - xml_parse() will thrown an error otherwise.
    > Be:
    > +       * of the XmlOptionType given in 'xmloption_arg'. This enables the
    > +       * canonicalization of CONTENT fragments if they contain a singly-rooted
    > +       * XML - xml_parse() will throw an error otherwise.
    
    I didn't understand the suggestion in 8) :)
    
    Thanks again for the review. Much appreciated!
    
    v7 attached.
    
    Best, Jim
    
  15. Re: [PATCH] Add CANONICAL option to xmlserialize

    Chapman Flack <chap@anastigmatix.net> — 2023-10-04T21:05:37Z

    On 2023-10-04 12:19, Jim Jones wrote:
    > On 04.10.23 11:39, vignesh C wrote:
    >> 1) Why the default option was chosen without comments shouldn't it be
    >> the other way round?
    > I'm not sure it is the way to go. The main idea is to check if two 
    > documents have the same content, and comments might be different even 
    > if the contents of two documents are identical. What are your concerns 
    > regarding this default behaviour?
    
    I hope I'm not butting in, but I too would be leery of any default
    behavior that's going to say thing1 and thing2 are the same thing
    but ignoring (name part of thing here). If that's the comparison
    I mean to make, and it's as easy as CANONICAL WITHOUT COMMENTS
    to say that's what I mean, I'd be happy to write that. It also means
    that the next person reading my code will know "oh, he means
    'same' in *that* way", without having to think about it.
    
    Regards,
    -Chap
    
    
    
    
  16. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2023-10-05T07:38:20Z

    Hi Chap
    
    On 04.10.23 23:05, Chapman Flack wrote:
    > I hope I'm not butting in, but I too would be leery of any default
    > behavior that's going to say thing1 and thing2 are the same thing
    > but ignoring (name part of thing here). If that's the comparison
    > I mean to make, and it's as easy as CANONICAL WITHOUT COMMENTS
    > to say that's what I mean, I'd be happy to write that. It also means
    > that the next person reading my code will know "oh, he means
    > 'same' in *that* way", without having to think about it.
    
    That's a very compelling argument. Thanks for that!
    
    It is indeed clearer to only remove items from the result set if 
    explicitly said so.
    
    v8 attached changes de default behaviour to WITH COMMENTS.
    
    Best,
    
    Jim
    
  17. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2024-02-09T13:19:13Z

    On 05.10.23 09:38, Jim Jones wrote:
    >
    > v8 attached changes de default behaviour to WITH COMMENTS.
    v9 attached with rebase due to changes done to primnodes.h in 615f5f6
    
    -- 
    Jim
    
  18. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2024-06-19T08:59:25Z

    
    On 09.02.24 14:19, Jim Jones wrote:
    > v9 attached with rebase due to changes done to primnodes.h in 615f5f6
    >
    v10 attached with rebase due to changes in primnodes, parsenodes.h, and
    gram.y
    
    -- 
    Jim
    
  19. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2024-07-09T18:48:47Z

    On 19.06.24 10:59, Jim Jones wrote:
    > On 09.02.24 14:19, Jim Jones wrote:
    >> v9 attached with rebase due to changes done to primnodes.h in 615f5f6
    >>
    > v10 attached with rebase due to changes in primnodes, parsenodes.h, and
    > gram.y
    >
    v11 attached with rebase due to changes in xml.c
    
    -- 
    Jim
    
  20. Re: [PATCH] Add CANONICAL option to xmlserialize

    Pavel Stehule <pavel.stehule@gmail.com> — 2024-08-25T18:57:30Z

    Hi
    
    so 24. 8. 2024 v 7:40 odesílatel Jim Jones <jim.jones@uni-muenster.de>
    napsal:
    
    >
    > On 19.06.24 10:59, Jim Jones wrote:
    > > On 09.02.24 14:19, Jim Jones wrote:
    > >> v9 attached with rebase due to changes done to primnodes.h in 615f5f6
    > >>
    > > v10 attached with rebase due to changes in primnodes, parsenodes.h, and
    > > gram.y
    > >
    > v11 attached with rebase due to changes in xml.c
    >
    
    I try to check this patch
    
    There is unwanted white space in the patch
    
    -<-><--><-->xmlFreeDoc(doc);
    +<->else if (format == XMLSERIALIZE_CANONICAL || format ==
    XMLSERIALIZE_CANONICAL_WITH_NO_COMMENTS)
    + <>{
    +<-><-->xmlChar    *xmlbuf = NULL;
    +<-><-->int         nbytes;
    +<-><-->int
    
    1. the xml is serialized to UTF8 string every time, but when target type is
    varchar or text, then it should be every time encoded to database encoding.
    Is not possible to hold utf8 string in latin2 database varchar.
    
    2. The proposed feature can increase some confusion in implementation of NO
    IDENT. I am not an expert on this area, so I checked other databases. DB2
    does not have anything similar. But Oracle's "NO IDENT" clause is very
    similar to the proposed "CANONICAL". Unfortunately, there is different
    behaviour of NO IDENT - Oracle's really removes formatting, Postgres does
    nothing.
    
    Regards
    
    Pavel
    
    
    
    > --
    > Jim
    >
    
  21. Re: [PATCH] Add CANONICAL option to xmlserialize

    Pavel Stehule <pavel.stehule@gmail.com> — 2024-08-25T19:49:24Z

    ne 25. 8. 2024 v 20:57 odesílatel Pavel Stehule <pavel.stehule@gmail.com>
    napsal:
    
    > Hi
    >
    > so 24. 8. 2024 v 7:40 odesílatel Jim Jones <jim.jones@uni-muenster.de>
    > napsal:
    >
    >>
    >> On 19.06.24 10:59, Jim Jones wrote:
    >> > On 09.02.24 14:19, Jim Jones wrote:
    >> >> v9 attached with rebase due to changes done to primnodes.h in 615f5f6
    >> >>
    >> > v10 attached with rebase due to changes in primnodes, parsenodes.h, and
    >> > gram.y
    >> >
    >> v11 attached with rebase due to changes in xml.c
    >>
    >
    > I try to check this patch
    >
    > There is unwanted white space in the patch
    >
    > -<-><--><-->xmlFreeDoc(doc);
    > +<->else if (format == XMLSERIALIZE_CANONICAL || format ==
    > XMLSERIALIZE_CANONICAL_WITH_NO_COMMENTS)
    > + <>{
    > +<-><-->xmlChar    *xmlbuf = NULL;
    > +<-><-->int         nbytes;
    > +<-><-->int
    >
    > 1. the xml is serialized to UTF8 string every time, but when target type
    > is varchar or text, then it should be every time encoded to database
    > encoding. Is not possible to hold utf8 string in latin2 database varchar.
    >
    > 2. The proposed feature can increase some confusion in implementation of
    > NO IDENT. I am not an expert on this area, so I checked other databases.
    > DB2 does not have anything similar. But Oracle's "NO IDENT" clause is very
    > similar to the proposed "CANONICAL". Unfortunately, there is different
    > behaviour of NO IDENT - Oracle's really removes formatting, Postgres does
    > nothing.
    >
    > Regards
    >
    
    I read https://www.w3.org/TR/xml-c14n11/ and if I understand this document,
    then CANONICAL <> "NO INDENT" ?
    
    Regards
    
    Pavel
    
    
    
    > Pavel
    >
    >
    >
    >> --
    >> Jim
    >>
    >
    
  22. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2024-08-26T09:32:02Z

    Hi Pavel
    
    On 25.08.24 20:57, Pavel Stehule wrote:
    >
    > There is unwanted white space in the patch
    >
    > -<-><--><-->xmlFreeDoc(doc);
    > +<->else if (format == XMLSERIALIZE_CANONICAL || format ==
    > XMLSERIALIZE_CANONICAL_WITH_NO_COMMENTS)
    > + <>{
    > +<-><-->xmlChar    *xmlbuf = NULL;
    > +<-><-->int         nbytes;
    > +<-><-->int    
    >
    I missed that one. Just removed it, thanks!
    > 1. the xml is serialized to UTF8 string every time, but when target
    > type is varchar or text, then it should be every time encoded to
    > database encoding. Is not possible to hold utf8 string in latin2
    > database varchar.
    I'm calling xml_parse using GetDatabaseEncoding(), so I thought I would
    be on the safe side
    
    if(format ==XMLSERIALIZE_CANONICAL ||format
    ==XMLSERIALIZE_CANONICAL_WITH_NO_COMMENTS)
    doc =xml_parse(data, XMLOPTION_DOCUMENT, false,
    GetDatabaseEncoding(), NULL, NULL, NULL);
    ... or you mean something else?
    
    > 2. The proposed feature can increase some confusion in implementation
    > of NO IDENT. I am not an expert on this area, so I checked other
    > databases. DB2 does not have anything similar. But Oracle's "NO IDENT"
    > clause is very similar to the proposed "CANONICAL". Unfortunately,
    > there is different behaviour of NO IDENT - Oracle's really removes
    > formatting, Postgres does nothing.
    
    Coincidentally, the [NO] INDENT support for xmlserialize is an old patch
    of mine.
    It basically "does nothing" and prints the xml as is, e.g.
    
    SELECT xmlserialize(DOCUMENT '<foo><bar><val z="1"
    a="8"><![CDATA[0&1]]></val></bar></foo>' AS text INDENT);
                    xmlserialize                
    --------------------------------------------
     <foo>                                     +
       <bar>                                   +
         <val z="1" a="8"><![CDATA[0&1]]></val>+
       </bar>                                  +
     </foo>                                    +
     
    (1 row)
    
    SELECT xmlserialize(DOCUMENT '<foo><bar><val z="1"
    a="8"><![CDATA[0&1]]></val></bar></foo>' AS text NO INDENT);
                             xmlserialize                         
    --------------------------------------------------------------
     <foo><bar><val z="1" a="8"><![CDATA[0&1]]></val></bar></foo>
    (1 row)
    
    SELECT xmlserialize(DOCUMENT '<foo><bar><val z="1"
    a="8"><![CDATA[0&1]]></val></bar></foo>' AS text);
                             xmlserialize                         
    --------------------------------------------------------------
     <foo><bar><val z="1" a="8"><![CDATA[0&1]]></val></bar></foo>
    (1 row)
    
    .. while CANONICAL converts the xml to its canonical form,[1,2] e.g.
    sorting attributes and replacing CDATA strings by its value:
    
    SELECT xmlserialize(DOCUMENT '<foo><bar><val z="1"
    a="8"><![CDATA[0&1]]></val></bar></foo>' AS text CANONICAL);
                         xmlserialize                     
    ------------------------------------------------------
     <foo><bar><val a="8" z="1">0&amp;1</val></bar></foo>
    (1 row)
    
    xmlserialize CANONICAL does not exist in any other database and it's not
    part of the SQL/XML standard.
    
    Regarding the different behaviour of NO INDENT in Oracle and PostgreSQL:
    it is not entirely clear to me if SQL/XML states that NO INDENT must
    remove the indentation from xml strings.
    It says:
    
    "INDENT — the choice of whether to “pretty-print” the serialized XML by
    means of indentation, either
    True or False.
    ....
    i) If <XML serialize indent> is specified and does not contain NO, then
    let IND be True.
    ii) Otherwise, let IND be False."
    
    When I wrote the patch I assumed it meant to leave the xml as is .. but
    I might be wrong.
    Perhaps it would be best if we open a new thread for this topic.
    
    Thank you for reviewing this patch. Much appreciated!
    
    Best,
    
    -- 
    Jim
    
    1 - https://www.w3.org/TR/xml-c14n11/
    2 - https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-c14n.html
    
    
    
    
    
  23. Re: [PATCH] Add CANONICAL option to xmlserialize

    Pavel Stehule <pavel.stehule@gmail.com> — 2024-08-26T10:30:28Z

    po 26. 8. 2024 v 11:32 odesílatel Jim Jones <jim.jones@uni-muenster.de>
    napsal:
    
    > Hi Pavel
    >
    > On 25.08.24 20:57, Pavel Stehule wrote:
    > >
    > > There is unwanted white space in the patch
    > >
    > > -<-><--><-->xmlFreeDoc(doc);
    > > +<->else if (format == XMLSERIALIZE_CANONICAL || format ==
    > > XMLSERIALIZE_CANONICAL_WITH_NO_COMMENTS)
    > > + <>{
    > > +<-><-->xmlChar    *xmlbuf = NULL;
    > > +<-><-->int         nbytes;
    > > +<-><-->int
    > >
    > I missed that one. Just removed it, thanks!
    > > 1. the xml is serialized to UTF8 string every time, but when target
    > > type is varchar or text, then it should be every time encoded to
    > > database encoding. Is not possible to hold utf8 string in latin2
    > > database varchar.
    > I'm calling xml_parse using GetDatabaseEncoding(), so I thought I would
    > be on the safe side
    >
    > if(format ==XMLSERIALIZE_CANONICAL ||format
    > ==XMLSERIALIZE_CANONICAL_WITH_NO_COMMENTS)
    > doc =xml_parse(data, XMLOPTION_DOCUMENT, false,
    > GetDatabaseEncoding(), NULL, NULL, NULL);
    > ... or you mean something else?
    >
    
    Maybe I was confused by the initial message.
    
    
    >
    > > 2. The proposed feature can increase some confusion in implementation
    > > of NO IDENT. I am not an expert on this area, so I checked other
    > > databases. DB2 does not have anything similar. But Oracle's "NO IDENT"
    > > clause is very similar to the proposed "CANONICAL". Unfortunately,
    > > there is different behaviour of NO IDENT - Oracle's really removes
    > > formatting, Postgres does nothing.
    >
    > Coincidentally, the [NO] INDENT support for xmlserialize is an old patch
    > of mine.
    > It basically "does nothing" and prints the xml as is, e.g.
    >
    > SELECT xmlserialize(DOCUMENT '<foo><bar><val z="1"
    > a="8"><![CDATA[0&1]]></val></bar></foo>' AS text INDENT);
    >                 xmlserialize
    > --------------------------------------------
    >  <foo>                                     +
    >    <bar>                                   +
    >      <val z="1" a="8"><![CDATA[0&1]]></val>+
    >    </bar>                                  +
    >  </foo>                                    +
    >
    > (1 row)
    >
    > SELECT xmlserialize(DOCUMENT '<foo><bar><val z="1"
    > a="8"><![CDATA[0&1]]></val></bar></foo>' AS text NO INDENT);
    >                          xmlserialize
    > --------------------------------------------------------------
    >  <foo><bar><val z="1" a="8"><![CDATA[0&1]]></val></bar></foo>
    > (1 row)
    >
    > SELECT xmlserialize(DOCUMENT '<foo><bar><val z="1"
    > a="8"><![CDATA[0&1]]></val></bar></foo>' AS text);
    >                          xmlserialize
    > --------------------------------------------------------------
    >  <foo><bar><val z="1" a="8"><![CDATA[0&1]]></val></bar></foo>
    > (1 row)
    >
    > .. while CANONICAL converts the xml to its canonical form,[1,2] e.g.
    > sorting attributes and replacing CDATA strings by its value:
    >
    > SELECT xmlserialize(DOCUMENT '<foo><bar><val z="1"
    > a="8"><![CDATA[0&1]]></val></bar></foo>' AS text CANONICAL);
    >                      xmlserialize
    > ------------------------------------------------------
    >  <foo><bar><val a="8" z="1">0&amp;1</val></bar></foo>
    > (1 row)
    >
    > xmlserialize CANONICAL does not exist in any other database and it's not
    > part of the SQL/XML standard.
    >
    > Regarding the different behaviour of NO INDENT in Oracle and PostgreSQL:
    > it is not entirely clear to me if SQL/XML states that NO INDENT must
    > remove the indentation from xml strings.
    > It says:
    >
    > "INDENT — the choice of whether to “pretty-print” the serialized XML by
    > means of indentation, either
    > True or False.
    > ....
    > i) If <XML serialize indent> is specified and does not contain NO, then
    > let IND be True.
    > ii) Otherwise, let IND be False."
    >
    > When I wrote the patch I assumed it meant to leave the xml as is .. but
    > I might be wrong.
    > Perhaps it would be best if we open a new thread for this topic.
    >
    
    I think so there should be specified the target of CANONICAL - it is a
    partial replacement of NO INDENT or it produces format  just for comparing?
    The CANONICAL format is not probably extra standardized, because libxml2
    removes indenting, but examples in https://www.w3.org/TR/xml-c14n11/
    doesn't do it. So this format makes sense just for local operations.
    
    I like this functionality, and it is great so the functionality from
    libxml2 can be used, but I think, so the fact that there are  four not
    compatible implementations of xmlserialize is messy. Can be nice, if  we
    find some intersection between SQL/XML, Oracle  instead of new proprietary
    syntax.
    
    In Oracle syntax the CANONICAL is +/- NO INDENT SHOW DEFAULT ?
    
    My objection against CANONICAL so SQL/XML and Oracle allows to parametrize
    XMLSERIALIZE more precious and before implementing new feature, we should
    to clean table and say, what we want to have in XMLSERIALIZE.
    
    An alternative of enhancing of XMLSERIALIZE I can imagine just function
    "to_canonical(xml, without_comments bool default false)". In this case we
    don't need to solve relations against SQL/XML or Oracle.
    
    
    > Thank you for reviewing this patch. Much appreciated!
    >
    > Best,
    >
    > --
    > Jim
    >
    > 1 - https://www.w3.org/TR/xml-c14n11/
    > 2 - https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-c14n.html
    >
    >
    
  24. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2024-08-26T11:28:22Z

    
    On 26.08.24 12:30, Pavel Stehule wrote:
    > I think so there should be specified the target of CANONICAL - it is a
    > partial replacement of NO INDENT or it produces format  just for
    > comparing? The CANONICAL format is not probably extra standardized,
    > because libxml2 removes indenting, but examples in
    > https://www.w3.org/TR/xml-c14n11/ doesn't do it. So this format makes
    > sense just for local operations.
    My idea with CANONICAL was not to replace NO INDENT. The intent was to
    format xml strings in an standardized way, so that they can be compared.
    For instance, removing comments, sorting attributes, converting CDATA
    strings, converting empty elements to start-end tag pairs, removing
    white spaces between elements, etc ...
    
    The W3C recommendation for Canonical XML[1] dictates the following
    regarding the removal of whitespaces between elements :
    
    * Whitespace outside of the document element and within start and end
    tags is normalized
    * All whitespace in character content is retained (excluding characters
    removed during line feed normalization)
    
    >
    > I like this functionality, and it is great so the functionality from
    > libxml2 can be used, but I think, so the fact that there are  four not
    > compatible implementations of xmlserialize is messy. Can be nice, if 
    > we find some intersection between SQL/XML, Oracle  instead of new
    > proprietary syntax. 
    >
    > In Oracle syntax the CANONICAL is +/- NO INDENT SHOW DEFAULT ?
    
    No.
    XMLSERIALIZE ... NO INDENT is supposed, as the name suggests, to
    serialize an xml string without indenting it. One could argue that not
    indenting can be translated as removing indentation, but I couldn't find
    anything concrete about this in the SQL/XML spec. If it's indeed the
    case, we should correct XMLSERIALIZE .. NO INDENT, but it is unrelated
    to this patch.
    
    CANONICAL serializes a physical representation of an xml document. In a
    nutshell, XMLSERIALIZE ... CANONICAL sort of "rewrites" the xml string
    with the following rules (list from the W3C recommendation):
    
    * The document is encoded in UTF-8
    * Line breaks normalized to #xA on input, before parsing
    * Attribute values are normalized, as if by a validating processor
    * Character and parsed entity references are replaced
    * CDATA sections are replaced with their character content
    * The XML declaration and document type declaration are removed
    * Empty elements are converted to start-end tag pairs
    * Whitespace outside of the document element and within start and end
    tags is normalized
    * All whitespace in character content is retained (excluding characters
    removed during line feed normalization)
    * Attribute value delimiters are set to quotation marks (double quotes)
    * Special characters in attribute values and character content are
    replaced by character references
    * Superfluous namespace declarations are removed from each element
    * Default attributes are added to each element
    * Fixup of xml:base attributes [C14N-Issues] is performed
    * Lexicographic order is imposed on the namespace declarations and
    attributes of each element
    
    btw: Oracle's SIZE =, HIDE DEFAULTS, and SHOW DEFAULTS are not part of
    the SQL/XML standard either :)
    
    > My objection against CANONICAL so SQL/XML and Oracle allows to
    > parametrize XMLSERIALIZE more precious and before implementing new
    > feature, we should to clean table and say, what we want to have in
    > XMLSERIALIZE.
    >
    > An alternative of enhancing of XMLSERIALIZE I can imagine just
    > function "to_canonical(xml, without_comments bool default false)". In
    > this case we don't need to solve relations against SQL/XML or Oracle.
    
    To create a separated serialization function would be IMHO way less
    elegant than to parametrize XMLSERIALIZE, but it would be something I
    could live with in case we decide to go down this path.
    
    Thanks!
    
    -- 
    Jim
    
    1 - https://www.w3.org/TR/xml-c14n11/
    
    
    
    
    
  25. Re: [PATCH] Add CANONICAL option to xmlserialize

    Pavel Stehule <pavel.stehule@gmail.com> — 2024-08-26T12:15:56Z

    po 26. 8. 2024 v 13:28 odesílatel Jim Jones <jim.jones@uni-muenster.de>
    napsal:
    
    >
    >
    > On 26.08.24 12:30, Pavel Stehule wrote:
    > > I think so there should be specified the target of CANONICAL - it is a
    > > partial replacement of NO INDENT or it produces format  just for
    > > comparing? The CANONICAL format is not probably extra standardized,
    > > because libxml2 removes indenting, but examples in
    > > https://www.w3.org/TR/xml-c14n11/ doesn't do it. So this format makes
    > > sense just for local operations.
    > My idea with CANONICAL was not to replace NO INDENT. The intent was to
    > format xml strings in an standardized way, so that they can be compared.
    > For instance, removing comments, sorting attributes, converting CDATA
    > strings, converting empty elements to start-end tag pairs, removing
    > white spaces between elements, etc ...
    >
    > The W3C recommendation for Canonical XML[1] dictates the following
    > regarding the removal of whitespaces between elements :
    >
    > * Whitespace outside of the document element and within start and end
    > tags is normalized
    > * All whitespace in character content is retained (excluding characters
    > removed during line feed normalization)
    >
    > >
    > > I like this functionality, and it is great so the functionality from
    > > libxml2 can be used, but I think, so the fact that there are  four not
    > > compatible implementations of xmlserialize is messy. Can be nice, if
    > > we find some intersection between SQL/XML, Oracle  instead of new
    > > proprietary syntax.
    > >
    > > In Oracle syntax the CANONICAL is +/- NO INDENT SHOW DEFAULT ?
    >
    > No.
    > XMLSERIALIZE ... NO INDENT is supposed, as the name suggests, to
    > serialize an xml string without indenting it. One could argue that not
    > indenting can be translated as removing indentation, but I couldn't find
    > anything concrete about this in the SQL/XML spec. If it's indeed the
    > case, we should correct XMLSERIALIZE .. NO INDENT, but it is unrelated
    > to this patch.
    >
    > CANONICAL serializes a physical representation of an xml document. In a
    > nutshell, XMLSERIALIZE ... CANONICAL sort of "rewrites" the xml string
    > with the following rules (list from the W3C recommendation):
    >
    > * The document is encoded in UTF-8
    > * Line breaks normalized to #xA on input, before parsing
    > * Attribute values are normalized, as if by a validating processor
    > * Character and parsed entity references are replaced
    > * CDATA sections are replaced with their character content
    > * The XML declaration and document type declaration are removed
    > * Empty elements are converted to start-end tag pairs
    > * Whitespace outside of the document element and within start and end
    > tags is normalized
    > * All whitespace in character content is retained (excluding characters
    > removed during line feed normalization)
    > * Attribute value delimiters are set to quotation marks (double quotes)
    > * Special characters in attribute values and character content are
    > replaced by character references
    > * Superfluous namespace declarations are removed from each element
    > * Default attributes are added to each element
    > * Fixup of xml:base attributes [C14N-Issues] is performed
    > * Lexicographic order is imposed on the namespace declarations and
    > attributes of each element
    >
    > btw: Oracle's SIZE =, HIDE DEFAULTS, and SHOW DEFAULTS are not part of
    > the SQL/XML standard either :)
    >
    
    I know - looks so this function is not well designed generally
    
    
    >
    > > My objection against CANONICAL so SQL/XML and Oracle allows to
    > > parametrize XMLSERIALIZE more precious and before implementing new
    > > feature, we should to clean table and say, what we want to have in
    > > XMLSERIALIZE.
    > >
    > > An alternative of enhancing of XMLSERIALIZE I can imagine just
    > > function "to_canonical(xml, without_comments bool default false)". In
    > > this case we don't need to solve relations against SQL/XML or Oracle.
    >
    > To create a separated serialization function would be IMHO way less
    > elegant than to parametrize XMLSERIALIZE, but it would be something I
    > could live with in case we decide to go down this path.
    >
    
    I am not strongly against enhancing XMLSERIALIZE, but it can be nice to see
    some wider concept first. Currently the state looks just random - and I
    didn't see any serious discussion about implementation fo SQL/XML. We don't
    need to be necessarily compatible with Oracle, but it can help if we have a
    functionality that can be used for conversions.
    
    
    
    > Thanks!
    >
    > --
    > Jim
    >
    > 1 - https://www.w3.org/TR/xml-c14n11/
    >
    >
    
  26. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2024-08-26T14:29:58Z

    
    On 26.08.24 14:15, Pavel Stehule wrote:
    > I am not strongly against enhancing XMLSERIALIZE, but it can be nice
    > to see some wider concept first. Currently the state looks just random
    > - and I didn't see any serious discussion about implementation fo
    > SQL/XML. We don't need to be necessarily compatible with Oracle, but
    > it can help if we have a functionality that can be used for conversions.
    
    Fair point. A road map definitely wouldn't hurt. Not quite sure how to
    start this motion though :D
    So far I've just picked the missing SQL/XML features that were listed in
    the PostgreSQL todo list and that I need for any of my projects. But I
    would gladly change the priorities if there is any interest in the
    community for specific features.
    
    -- 
    Jim
    
    
    
    
    
  27. Re: [PATCH] Add CANONICAL option to xmlserialize

    Pavel Stehule <pavel.stehule@gmail.com> — 2024-08-26T14:59:46Z

    po 26. 8. 2024 v 16:30 odesílatel Jim Jones <jim.jones@uni-muenster.de>
    napsal:
    
    >
    >
    > On 26.08.24 14:15, Pavel Stehule wrote:
    > > I am not strongly against enhancing XMLSERIALIZE, but it can be nice
    > > to see some wider concept first. Currently the state looks just random
    > > - and I didn't see any serious discussion about implementation fo
    > > SQL/XML. We don't need to be necessarily compatible with Oracle, but
    > > it can help if we have a functionality that can be used for conversions.
    >
    > Fair point. A road map definitely wouldn't hurt. Not quite sure how to
    > start this motion though :D
    > So far I've just picked the missing SQL/XML features that were listed in
    > the PostgreSQL todo list and that I need for any of my projects. But I
    > would gladly change the priorities if there is any interest in the
    > community for specific features.
    >
    
    yes, "like" road map and related questions - just for XMLSERIALIZE (in this
    thread). I see points
    
    1. what about behaviour of NO INDENT - the implementation is not too old,
    so it can be changed if we want (I think), and it is better to do early
    than too late
    
    2. Are we able to implement SQL/XML syntax with libxml2?
    
    3. Are we able to implement Oracle syntax with libxml2? And there are
    benefits other than higher possible compatibility?
    
    4. Can there be some possible collision (functionality, syntax) with
    CANONICAL?
    
    5. SQL/XML XMLSERIALIZE supports other target types than varchar. I can
    imagine XMLSERIALIZE with CANONICAL to bytea (then we don't need to force
    database encoding). Does it make sense? Are the results comparable?
    
    
    
    
    > --
    > Jim
    >
    >
    
  28. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2024-08-27T11:57:24Z

    
    On 26.08.24 16:59, Pavel Stehule wrote:
    >
    > 1. what about behaviour of NO INDENT - the implementation is not too
    > old, so it can be changed if we want (I think), and it is better to do
    > early than too late
    
    While checking the feasibility of removing indentation with NO INDENT I
    may have found a bug in XMLSERIALIZE ... INDENT.
    xmlSaveToBuffer seems to ignore elements if there are whitespaces
    between them:
    
    SELECT xmlserialize(DOCUMENT '<foo><bar>42</bar></foo>' AS text INDENT);
      xmlserialize   
    -----------------
     <foo>          +
       <bar>42</bar>+
     </foo>         +
     
    (1 row)
    
    SELECT xmlserialize(DOCUMENT '<foo> <bar>42</bar> </foo>'::xml AS text
    INDENT);
            xmlserialize        
    ----------------------------
     <foo> <bar>42</bar> </foo>+
     
    (1 row)
    
    I'll take a look at it.
    
    Regarding removing indentation: yes, it would be possible with libxml2.
    The question is if it would be right to do so.
    > 2. Are we able to implement SQL/XML syntax with libxml2?
    >
    > 3. Are we able to implement Oracle syntax with libxml2? And there are
    > benefits other than higher possible compatibility?
    I guess it would be beneficial if you're migrating from oracle to
    postgres - or the other way around. It certainly wouldn't hurt, but so
    far I personally had little use for the oracle's extra xmlserialize
    features.
    >
    > 4. Can there be some possible collision (functionality, syntax) with
    > CANONICAL?
    I couldn't find anything in the SQL/XML spec that might refer to
    canonocal xml.
    >
    > 5. SQL/XML XMLSERIALIZE supports other target types than varchar. I
    > can imagine XMLSERIALIZE with CANONICAL to bytea (then we don't need
    > to force database encoding). Does it make sense? Are the results
    > comparable?
    |
    As of pg16 bytea is not supported. Currently type| can be |character|,
    |character varying|, or |text - also their other flavours like 'name'.
    
    |
    
    -- 
    Jim
    
    
    
    
    
  29. Re: [PATCH] Add CANONICAL option to xmlserialize

    Pavel Stehule <pavel.stehule@gmail.com> — 2024-08-29T18:50:33Z

    út 27. 8. 2024 v 13:57 odesílatel Jim Jones <jim.jones@uni-muenster.de>
    napsal:
    
    >
    >
    > On 26.08.24 16:59, Pavel Stehule wrote:
    > >
    > > 1. what about behaviour of NO INDENT - the implementation is not too
    > > old, so it can be changed if we want (I think), and it is better to do
    > > early than too late
    >
    > While checking the feasibility of removing indentation with NO INDENT I
    > may have found a bug in XMLSERIALIZE ... INDENT.
    > xmlSaveToBuffer seems to ignore elements if there are whitespaces
    > between them:
    >
    > SELECT xmlserialize(DOCUMENT '<foo><bar>42</bar></foo>' AS text INDENT);
    >   xmlserialize
    > -----------------
    >  <foo>          +
    >    <bar>42</bar>+
    >  </foo>         +
    >
    > (1 row)
    >
    > SELECT xmlserialize(DOCUMENT '<foo> <bar>42</bar> </foo>'::xml AS text
    > INDENT);
    >         xmlserialize
    > ----------------------------
    >  <foo> <bar>42</bar> </foo>+
    >
    > (1 row)
    >
    > I'll take a look at it.
    >
    
    +1
    
    
    > Regarding removing indentation: yes, it would be possible with libxml2.
    > The question is if it would be right to do so.
    > > 2. Are we able to implement SQL/XML syntax with libxml2?
    > >
    > > 3. Are we able to implement Oracle syntax with libxml2? And there are
    > > benefits other than higher possible compatibility?
    > I guess it would be beneficial if you're migrating from oracle to
    > postgres - or the other way around. It certainly wouldn't hurt, but so
    > far I personally had little use for the oracle's extra xmlserialize
    > features.
    > >
    > > 4. Can there be some possible collision (functionality, syntax) with
    > > CANONICAL?
    > I couldn't find anything in the SQL/XML spec that might refer to
    > canonocal xml.
    > >
    > > 5. SQL/XML XMLSERIALIZE supports other target types than varchar. I
    > > can imagine XMLSERIALIZE with CANONICAL to bytea (then we don't need
    > > to force database encoding). Does it make sense? Are the results
    > > comparable?
    > |
    > As of pg16 bytea is not supported. Currently type| can be |character|,
    > |character varying|, or |text - also their other flavours like 'name'.
    >
    
    I know, but theoretically, there can be some benefit for CANONICAL if pg
    supports bytea there. Lot of databases still use non utf8 encoding.
    
    It is a more theoretical question - if pg supports different types there in
    future  (because SQL/XML or Oracle), then CANONICAL can be used without
    limit, or CANONICAL can be used just for text? And you are sure, so you can
    compare text X text, instead xml X xml?
    
    +SELECT xmlserialize(CONTENT doc AS text CANONICAL) = xmlserialize(CONTENT
    doc AS text CANONICAL WITH COMMENTS) FROM xmltest_serialize;
    + ?column?
    +----------
    + t
    + t
    +(2 rows)
    
    Maybe I am a little bit confused by these regress tests, because at the end
    it is not too useful - you compare two identical XML, and WITH COMMENTS and
    WITHOUT COMMENTS is tested elsewhere. I tried to search for a sense of this
    test.  Better to use really different documents (columns) instead.
    
    Regards
    
    Pavel
    
    
    >
    > |
    >
    > --
    > Jim
    >
    >
    
  30. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2024-08-29T21:54:18Z

    
    On 29.08.24 20:50, Pavel Stehule wrote:
    >
    > I know, but theoretically, there can be some benefit for CANONICAL if
    > pg supports bytea there. Lot of databases still use non utf8 encoding.
    >
    > It is a more theoretical question - if pg supports different types
    > there in future  (because SQL/XML or Oracle), then CANONICAL can be
    > used without limit,
    I like the idea of extending the feature to support bytea. I can
    definitely take a look at it, but perhaps in another patch? This change
    would most likely involve transformXmlSerialize in parse_expr.c, and I'm
    not sure of the impact in other usages of XMLSERIALIZE.
    > or CANONICAL can be used just for text? And you are sure, so you can
    > compare text X text, instead xml X xml?
    Yes, currently it only supports varchar or text - and their cousins. The
    idea is to format the xml and serialize it as text in a way that they
    can compared based on their content, independently of how they were
    written, e.g '<foo a="1" b="2"/>' is equal to '<foo b="2" a="1"/>'.
    
    >
    > +SELECT xmlserialize(CONTENT doc AS text CANONICAL) =
    > xmlserialize(CONTENT doc AS text CANONICAL WITH COMMENTS) FROM
    > xmltest_serialize;
    > + ?column?
    > +----------
    > + t
    > + t
    > +(2 rows)
    >
    > Maybe I am a little bit confused by these regress tests, because at
    > the end it is not too useful - you compare two identical XML, and WITH
    > COMMENTS and WITHOUT COMMENTS is tested elsewhere. I tried to search
    > for a sense of this test.  Better to use really different documents
    > (columns) instead.
    
    Yeah, I can see that it's confusing. In this example I actually just
    wanted to test that the default option of CANONICAL is CANONICAL WITH
    COMMENTS, even if you don't mention it. In the docs I mentioned it like
    this:
    
    "The optional parameters WITH COMMENTS (which is the default) or WITH NO
    COMMENTS, respectively, keep or remove XML comments from the given
    document."
    
    Perhaps I should rephrase it? Or maybe a comment in the regression tests
    would suffice?
    
    Thanks a lot for the input!
    
    -- 
    Jim
    
    
    
    
    
  31. Re: [PATCH] Add CANONICAL option to xmlserialize

    Pavel Stehule <pavel.stehule@gmail.com> — 2024-08-30T04:46:12Z

    čt 29. 8. 2024 v 23:54 odesílatel Jim Jones <jim.jones@uni-muenster.de>
    napsal:
    
    >
    >
    > On 29.08.24 20:50, Pavel Stehule wrote:
    > >
    > > I know, but theoretically, there can be some benefit for CANONICAL if
    > > pg supports bytea there. Lot of databases still use non utf8 encoding.
    > >
    > > It is a more theoretical question - if pg supports different types
    > > there in future  (because SQL/XML or Oracle), then CANONICAL can be
    > > used without limit,
    > I like the idea of extending the feature to support bytea. I can
    > definitely take a look at it, but perhaps in another patch? This change
    > would most likely involve transformXmlSerialize in parse_expr.c, and I'm
    > not sure of the impact in other usages of XMLSERIALIZE.
    > > or CANONICAL can be used just for text? And you are sure, so you can
    > > compare text X text, instead xml X xml?
    > Yes, currently it only supports varchar or text - and their cousins. The
    > idea is to format the xml and serialize it as text in a way that they
    > can compared based on their content, independently of how they were
    > written, e.g '<foo a="1" b="2"/>' is equal to '<foo b="2" a="1"/>'.
    >
    > >
    > > +SELECT xmlserialize(CONTENT doc AS text CANONICAL) =
    > > xmlserialize(CONTENT doc AS text CANONICAL WITH COMMENTS) FROM
    > > xmltest_serialize;
    > > + ?column?
    > > +----------
    > > + t
    > > + t
    > > +(2 rows)
    > >
    > > Maybe I am a little bit confused by these regress tests, because at
    > > the end it is not too useful - you compare two identical XML, and WITH
    > > COMMENTS and WITHOUT COMMENTS is tested elsewhere. I tried to search
    > > for a sense of this test.  Better to use really different documents
    > > (columns) instead.
    >
    > Yeah, I can see that it's confusing. In this example I actually just
    > wanted to test that the default option of CANONICAL is CANONICAL WITH
    > COMMENTS, even if you don't mention it. In the docs I mentioned it like
    > this:
    >
    > "The optional parameters WITH COMMENTS (which is the default) or WITH NO
    > COMMENTS, respectively, keep or remove XML comments from the given
    > document."
    >
    > Perhaps I should rephrase it? Or maybe a comment in the regression tests
    > would suffice?
    >
    
    comment will be enough
    
    
    
    >
    > Thanks a lot for the input!
    >
    > --
    > Jim
    >
    >
    
  32. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2024-08-30T06:05:27Z

    
    On 30.08.24 06:46, Pavel Stehule wrote:
    >
    >
    > čt 29. 8. 2024 v 23:54 odesílatel Jim Jones
    > <jim.jones@uni-muenster.de> napsal:
    >
    >
    >     > +SELECT xmlserialize(CONTENT doc AS text CANONICAL) =
    >     > xmlserialize(CONTENT doc AS text CANONICAL WITH COMMENTS) FROM
    >     > xmltest_serialize;
    >     > + ?column?
    >     > +----------
    >     > + t
    >     > + t
    >     > +(2 rows)
    >     >
    >     > Maybe I am a little bit confused by these regress tests, because at
    >     > the end it is not too useful - you compare two identical XML,
    >     and WITH
    >     > COMMENTS and WITHOUT COMMENTS is tested elsewhere. I tried to search
    >     > for a sense of this test.  Better to use really different documents
    >     > (columns) instead.
    >
    >     Yeah, I can see that it's confusing. In this example I actually just
    >     wanted to test that the default option of CANONICAL is CANONICAL WITH
    >     COMMENTS, even if you don't mention it. In the docs I mentioned it
    >     like
    >     this:
    >
    >     "The optional parameters WITH COMMENTS (which is the default) or
    >     WITH NO
    >     COMMENTS, respectively, keep or remove XML comments from the given
    >     document."
    >
    >     Perhaps I should rephrase it? Or maybe a comment in the regression
    >     tests
    >     would suffice?
    >
    >
    > comment will be enough
    >
    
    v12 attached adds a comment to this test.
    
    Thanks
    
    -- 
    Jim
    
  33. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2024-09-03T14:43:51Z

    v13 attached removes two variables that were left unused after
    refactoring parsenodes.h and primnodes.h, both booleans related to the
    INDENT feature of xmlserialize.
    
    On 30.08.24 08:05, Jim Jones wrote:
    >
    > On 30.08.24 06:46, Pavel Stehule wrote:
    >>
    >> čt 29. 8. 2024 v 23:54 odesílatel Jim Jones
    >> <jim.jones@uni-muenster.de> napsal:
    >>
    >>
    >>     > +SELECT xmlserialize(CONTENT doc AS text CANONICAL) =
    >>     > xmlserialize(CONTENT doc AS text CANONICAL WITH COMMENTS) FROM
    >>     > xmltest_serialize;
    >>     > + ?column?
    >>     > +----------
    >>     > + t
    >>     > + t
    >>     > +(2 rows)
    >>     >
    >>     > Maybe I am a little bit confused by these regress tests, because at
    >>     > the end it is not too useful - you compare two identical XML,
    >>     and WITH
    >>     > COMMENTS and WITHOUT COMMENTS is tested elsewhere. I tried to search
    >>     > for a sense of this test.  Better to use really different documents
    >>     > (columns) instead.
    >>
    >>     Yeah, I can see that it's confusing. In this example I actually just
    >>     wanted to test that the default option of CANONICAL is CANONICAL WITH
    >>     COMMENTS, even if you don't mention it. In the docs I mentioned it
    >>     like
    >>     this:
    >>
    >>     "The optional parameters WITH COMMENTS (which is the default) or
    >>     WITH NO
    >>     COMMENTS, respectively, keep or remove XML comments from the given
    >>     document."
    >>
    >>     Perhaps I should rephrase it? Or maybe a comment in the regression
    >>     tests
    >>     would suffice?
    >>
    >>
    >> comment will be enough
    >>
    > v12 attached adds a comment to this test.
    >
    > Thanks
    >
    
    -- 
    Jim
    
  34. Re: [PoC] Add CANONICAL option to xmlserialize

    Oliver Ford <ojford@gmail.com> — 2024-09-08T11:43:49Z

    The following review has been posted through the commitfest application:
    make installcheck-world:  tested, failed
    Implements feature:       tested, failed
    Spec compliant:           tested, failed
    Documentation:            tested, failed
    
    LGTM
    
    The new status of this patch is: Ready for Committer
    
  35. Re: [PoC] Add CANONICAL option to xmlserialize

    Laurenz Albe <laurenz.albe@cybertec.at> — 2024-09-08T13:44:01Z

    On Sun, 2024-09-08 at 11:43 +0000, Oliver Ford wrote:
    > The following review has been posted through the commitfest application:
    > make installcheck-world:  tested, failed
    > Implements feature:       tested, failed
    > Spec compliant:           tested, failed
    > Documentation:            tested, failed
    > 
    > LGTM
    > 
    > The new status of this patch is: Ready for Committer
    
    Huh?  Do you mean "tested, passes"?
    
    Yours,
    Laurenz Albe
    
    
    
    
  36. Re: [PoC] Add CANONICAL option to xmlserialize

    Oliver Ford <ojford@gmail.com> — 2024-09-08T13:56:47Z

    On Sun, Sep 8, 2024 at 2:44 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
    >
    > On Sun, 2024-09-08 at 11:43 +0000, Oliver Ford wrote:
    > > The following review has been posted through the commitfest application:
    > > make installcheck-world:  tested, failed
    > > Implements feature:       tested, failed
    > > Spec compliant:           tested, failed
    > > Documentation:            tested, failed
    > >
    > > LGTM
    > >
    > > The new status of this patch is: Ready for Committer
    >
    > Huh?  Do you mean "tested, passes"?
    >
    > Yours,
    > Laurenz Albe
    
    Whoops, yes all tests and docs pass!
    
    
    
    
  37. Re: [PoC] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2024-09-08T15:02:27Z

    Hi Oliver
    
    On 08.09.24 15:56, Oliver Ford wrote:
    > Whoops, yes all tests and docs pass!
    
    Thanks for the review!
    
    Best, Jim
    
    
    
    
  38. Re: [PATCH] Add CANONICAL option to xmlserialize

    Tom Lane <tgl@sss.pgh.pa.us> — 2024-09-10T17:43:36Z

    Jim Jones <jim.jones@uni-muenster.de> writes:
    > This patch introduces the CANONICAL option to xmlserialize, which
    > serializes xml documents in their canonical form - as described in
    > the W3C Canonical XML Version 1.1 specification. This option can
    > be used with the additional parameter WITH [NO] COMMENTS to keep
    > or remove xml comments from the canonical xml output.
    
    While I don't object to providing this functionality in some form,
    I think that doing it with this specific syntax is a seriously
    bad idea.  I think there's significant risk that at some point
    the SQL committee will either standardize this syntax with a
    somewhat different meaning or standardize some other syntax for
    the same functionality.
    
    How about instead introducing a plain function along the lines of
    "xml_canonicalize(xml, bool keep_comments) returns text" ?  The SQL
    committee will certainly never do that, but we won't regret having
    created a plain function whenever they get around to doing something
    in the same space.
    
    			regards, tom lane
    
    
    
    
  39. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2024-09-12T10:56:32Z

    On 10.09.24 19:43, Tom Lane wrote:
    > How about instead introducing a plain function along the lines of
    > "xml_canonicalize(xml, bool keep_comments) returns text" ?  The SQL
    > committee will certainly never do that, but we won't regret having
    > created a plain function whenever they get around to doing something
    > in the same space.
    A second function to serialize xml documents may sound a bit redundant,
    but I totally understand the concern of possibly conflicting with
    SQL/XMl spec in the feature. I guess we can always come back here and
    extend xmlserialize when the SQL committee moves in this direction.
    
    v14 attached adds the function xmlcanonicalize, as suggested.
    
    Thanks
    
    -- 
    Jim
    
  40. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2025-02-17T11:12:21Z

    On 12.09.24 12:56, Jim Jones wrote:
    > v14 attached adds the function xmlcanonicalize, as suggested.
    
    rebase.
    
    Best regards, Jim
    
  41. Re: [PoC] Add CANONICAL option to xmlserialize

    newtglobal postgresql_contributors <postgresql_contributors@newtglobalcorp.com> — 2025-03-12T07:21:47Z

    The following review has been posted through the commitfest application:
    make installcheck-world:  tested, failed
    Implements feature:       tested, failed
    Spec compliant:           tested, failed
    Documentation:            tested, failed
    
    Hi Jim,
    I tested the function with dynamically generated XML using `XMLELEMENT`, `XMLATTRIBUTES`, and `XMLCOMMENT`.
    
    Example:
    
    SELECT XMLCANONICALIZE(XMLELEMENT(
        NAME employee,
        XMLCOMMENT('Employee details start'),
        XMLELEMENT(NAME id, 101),
        XMLELEMENT(NAME name, 'Mahesh'),
        XMLELEMENT(NAME department, 'Engineering'),
        XMLCOMMENT('Employee details end')
    ), true);
    
    The function correctly handled comments and structured XML elements as expected.
    
    Testing with a Table Containing Employee Data
    Created a table and inserted sample employee records to verify XML generation.
    
    Table Creation and Data Insertion:
    
    CREATE TABLE employees (id INT, name TEXT, department TEXT);
    INSERT INTO employees VALUES (1, 'Alice', 'HR'), (2, 'Bob', 'IT');
    
    
    Canonicalizing Employee XML Data:
    
    SELECT XMLCANONICALIZE(XMLELEMENT(
        NAME employee,
        XMLATTRIBUTES(id AS id),
        XMLCOMMENT('Employee details start'),
        XMLELEMENT(NAME name, name),
        XMLELEMENT(NAME department, department),
        XMLCOMMENT('Employee details end')
    ), true)
    FROM employees;
    
    The patch correctly processes XML elements and attributes from database records.
    
    Testing with Dynamic Comments
    Added a column to store dynamic comments and verified if `XMLCANONICALIZE` handles them properly.
    
    Modifications:
    ALTER TABLE employees ADD COLUMN comment TEXT;
    UPDATE employees SET comment = 'Employee details for ' || name;
    
    
    Verification Query:
    
    SELECT XMLCANONICALIZE(XMLELEMENT(
        NAME employee,
        XMLATTRIBUTES(id AS id),
        XMLCOMMENT(comment),
        XMLELEMENT(NAME name, name),
        XMLELEMENT(NAME department, department)
    ), true)
    FROM employees;
    
    Dynamic comments were correctly included in the XML output.
    
    Testing with Manual Newlines
    Inserted manual newlines to check the function's behavior with formatted XML.
    
    
    SELECT XMLCANONICALIZE(
        XMLELEMENT(
            NAME employee,
            XMLATTRIBUTES(id AS id),
            XMLCOMMENT(comment),
            E'\n  ',
            XMLELEMENT(NAME name, name),
            E'\n  ',
            XMLELEMENT(NAME department, department),
            E'\n'
        ),
        true
    )
    FROM employees;
    
    Whitespace and newlines were correctly handled in XML output.
    
    After testing various scenarios, I found that `XMLCANONICALIZE` is working as expected. It:
    - Removes unnecessary whitespace and newlines.
    - Sorts attributes in a consistent order.
    - Converts empty elements to start-end tag pairs.
    - Preserves or removes comments based on the flag.
    - Works well with table data and dynamically generated XML.
    
    Regards,
    NewtGlobal PostgreSQL contributors
  42. Re: [PoC] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2025-03-12T07:49:53Z

    Hi
    
    On 12.03.25 08:21, newtglobal postgresql_contributors wrote:
    > The following review has been posted through the commitfest application:
    > make installcheck-world:  tested, failed
    > Implements feature:       tested, failed
    > Spec compliant:           tested, failed
    > Documentation:            tested, failed
    
    
    Did it really fail?
    
    > After testing various scenarios, I found that `XMLCANONICALIZE` is working as expected. It:
    > - Removes unnecessary whitespace and newlines.
    > - Sorts attributes in a consistent order.
    > - Converts empty elements to start-end tag pairs.
    > - Preserves or removes comments based on the flag.
    > - Works well with table data and dynamically generated XML.
    >
    
    Thanks a lot for the thorough review. Much appreciated!
    
    Best regards, Jim
    
    
    
    
    
  43. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2025-03-21T09:47:09Z

    While reviewing another patch, I noticed that the keep_comments argument
    was missing a default value. To allow calling the function with only the
    XML argument, I've added DEFAULT true to keep_comments.
    
    SELECT xmlcanonicalize(xmldoc);
    
    Updated patch (v16) attached.
    
    Best, Jim
  44. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2025-04-07T12:25:52Z

    v17 attached adds a (missing) reference to C14N 1.1 specification.
    
    Jim
  45. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2025-06-10T09:16:06Z

    rebase and add an extra check for xmlbuf after the xmlC14NDocDumpMemory
    call:
    
    if (nbytes < 0 || xmlbuf == NULL)
    	ereport(ERROR,
    		(errcode(ERRCODE_INTERNAL_ERROR),
    		 errmsg("could not canonicalize XML document")));
    -- 
    Jim
    
  46. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2025-08-06T16:13:45Z

    rebased due to recent changes in func-xml.sgml
    
    Additionally, v19 now wraps the xmlC14NDocDumpMemory call in a
    PG_TRY/PG_CATCH.
    
    Best regards, Jim
  47. Re: [PATCH] Add CANONICAL option to xmlserialize

    Jim Jones <jim.jones@uni-muenster.de> — 2026-05-26T09:46:44Z

    On 30/03/2026 13:27, Jim Jones wrote:
    > On 30/03/2026 11:44, Andrew Dunstan wrote:
    >> I note that your function returns xml, whereas Tom's suggestion was for
    >> a function returning text. I don't think there was any discussion on the
    >> point.
    > Indeed, there was no discussion regarding the return type.
    > 
    > My rationale for keeping it as xml was: the output is xml, callers can
    > immediately use the xml without casting, and nearly all other xml*
    > functions return xml. Is there a direct advantage of having this
    > function return text?
    
    
    After some consideration, I think returning text instead of xml is
    indeed the better choice here. The canonical form is a serialization
    artifact rather than a document intended for further XML processing.
    More practically, since xml has no = operator, the primary use case of
    comparing documents requires casting anyway -- returning text is indeed
    closer to real-world usage.
    
    I also noticed a correctness issue with database encoding: the C14N 1.1
    specification mandates UTF8 output, so xmlC14NDocDumpMemory always
    returns UTF8.[1] I added a pg_any_to_server call to convert the output
    to the server encoding before returning.
    
    Best, Jim
    
    1 -
    https://github.com/GNOME/libxml2/blob/174201f747da93167354287a7599d0b385552599/c14n.c#L1964