Re: Regression with large XML data input

Erik Wienhold <ewie@ewie.name>

From: Erik Wienhold <ewie@ewie.name>
To: Jim Jones <jim.jones@uni-muenster.de>
Cc: Michael Paquier <michael@paquier.xyz>, Tom Lane <tgl@sss.pgh.pa.us>, Robert Treat <rob@xzilla.net>, Postgres hackers <pgsql-hackers@lists.postgresql.org>
Date: 2025-07-28T10:49:02Z
Lists: pgsql-hackers
On 2025-07-28 09:45 +0200, Jim Jones wrote:
> 
> On 28.07.25 04:47, Michael Paquier wrote:
> > I understand that from the point of view of a maintainer this is
> > rather bad, but from the customer point of view the current
> > situation is also bad to deal with in the scope of a minor upgrade,
> > because applications suddenly break.
> 
> I totally get it --- from the user’s perspective, it’s hard to see
> this as a bugfix.
> 
> I was wondering whether using XML_PARSE_HUGE in xml_parse's options
> could help address this, for example:
> 
> options = XML_PARSE_NOENT | XML_PARSE_DTDATTR | XML_PARSE_HUGE
>           | (preserve_whitespace ? 0 : XML_PARSE_NOBLANKS);

This also came to my mind, but it was already tried and reverted soon
after for security reasons. [1]

> One idea would be to guard XML_PARSE_HUGE behind a GUC --- say,
> xml_enable_huge_parsing. That would at least allow controlled
> environments to opt in. But of course, that wouldn't help current
> releases.

+1 for new major releases.  But normal users must not be allowed to
enable that GUC.  So probably context PGC_SU_BACKEND.

I'm leaning towards Michael's proposal of adding a libxml2 version check
in the stable branches before REL_18_STABLE and parsing the content with
xmlParseBalancedChunkMemory on versions up to 2.12.x.

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=f2743a7d70e7b2891277632121bb51e739743a47

-- 
Erik Wienhold