Thread

  1. Re: Regression with large XML data input

    Erik Wienhold <ewie@ewie.name> — 2025-07-24T18:50:04Z

    On 2025-07-24 20:10 +0200, Tom Lane wrote:
    > The supplied test case hides important details in the error message.
    > If you get rid of the exception block so that the error is reported
    > in full, what you see is
    > 
    > regression=# CREATE TEMP TABLE xmldata (id BIGINT PRIMARY KEY, message XML );
    > CREATE TABLE
    > regression=# DO $$ DECLARE size_40mb TEXT := repeat('X', 40000000);
    > regression$# BEGIN
    > regression$#    INSERT INTO xmldata (id, message) VALUES
    > regression$#      ( 1, (('<Root><Item><Name>Test40MB</Name><Content>' || size_40mb || '</Content></Item></Root>')::xml) );
    > regression$# END $$;
    > ERROR:  invalid XML content
    > DETAIL:  line 1: internal error: Huge input lookup
    > XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    >                                                                                ^
    > CONTEXT:  SQL statement "INSERT INTO xmldata (id, message) VALUES
    >      ( 1, (('<Root><Item><Name>Test40MB</Name><Content>' || size_40mb || '</Content></Item></Root>')::xml) )"
    > PL/pgSQL function inline_code_block line 3 at SQL statement
    > regression=# 
    > 
    > That is, what we are hitting is libxml2's internal protections
    > against processing "too large" input.  I am not really sure
    > why the other coding failed to hit this same thing, but I wonder
    > if we shouldn't leave well enough alone.  See commits 2197d0622
    > and f2743a7d7, where we tried to enable such cases and then
    > decided it was too risky.  I'm afraid now that our prior coding
    > might have allowed billion-laugh-like cases to be reachable.
    
    I was just looking into Michael's fix when I saw your message.  The fix
    works on libxml2 2.14.5.  But on 2.13.8 xmlParseBalancedChunkMemory
    returns XML_ERR_RESOURCE_LIMIT and I get this error:
    
    	ERROR:  invalid XML content
    	DETAIL:  line 1: Resource limit exceeded: Text node too long, try XML_PARSE_HUGE
    	XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    
    Not sure how to set XML_PARSE_HUGE without an xmlParserCtxtPtr at hand,
    though.  Also haven't looked into why 2.14.5 is not subject to that
    resource limit.  But as you've already noted, that code was heavily
    refactored in 2.13 [1].
    
    [1] https://www.postgresql.org/message-id/716736.1720376901%40sss.pgh.pa.us
    
    -- 
    Erik Wienhold