Thread

  1. Memory context can be its own parent and child in replication command

    Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> — 2025-03-07T15:07:06Z

    Hi,
    
    While running some tests with logical replication, I've run in a
    situation where a walsender process was stuck in an infinite loop with
    the following backtrace:
    
    #0  in MemoryContextDelete (...) at ../src/backend/utils/mmgr/mcxt.c:474
    #1  in exec_replication_command (cmd_string=... "BEGIN") at
    ../src/backend/replication/walsender.c:2005
    
    Which matches the following while loop:
    while (curr->firstchild != NULL)
        curr = curr->firstchild;
    
    Inspecting the memory context, I have the following:
    
    $9 = (MemoryContext) 0xafb741c35360
    (gdb) p *curr
    $10 = {type = T_AllocSetContext, isReset = false, allowInCritSection =
    false, mem_allocated = 8192, methods = 0xafb7278a82d8
    <mcxt_methods+216>, parent = 0xafb741c35360, firstchild =
    0xafb741c35360, prevchild = 0x0, nextchild = 0x0, name =
    0xafb7275f68a8 "Replication command context", ident = 0x0, reset_cbs =
    0x0}
    
    So the memory context is 0xafb741c35360, which is also the same value
    for parent and firstchild. This explains the infinite loop as
    MemoryContextDelete tries to find the leaf with no children.
    
    I was able to get a rr recording of the issue and trace how this
    happened. This can be reproduced by triggering 2 replication commands,
    with the first one doing a snapshot export:
    
    CREATE_REPLICATION_SLOT "test_slot" LOGICAL "test_decoding" ( SNAPSHOT
    "export");
    DROP_REPLICATION_SLOT "test_slot";
    
    - CreateReplicationSlot will start a new transaction to handle the
    snapshot export.
    - This transaction will save the replication command context (let's
    call it ctx1) in its state.
    - ctx1 is deleted at the end of exec_replication_command
    - During the next replication command, the transaction is aborted in
    SnapBuildClearExportedSnapshot
    - The transaction restores ctx1 as the CurrentMemoryContext
    - AllocSetContextCreate is called with ctx1 as a parent, and it will
    pull ctx1 from the freelist
    - ctx1's parent and child will be set to ctx1 and returned
    - During ctx1 deletion, it will be stuck in an infinite loop
    
    I've added a tap test to reproduce the issue, along with an assertion
    during context creation to check the parent and returned context are
    not the same so the test would immediately abort and not stay stuck.
    
    To fix this, it seems like saving and restoring the memory context
    after the call AbortCurrentTransaction was the best approach. It is
    similar to what's done with CurrentResourceOwner. I've thought of
    switching to the TopMemoryContext before exporting the snapshot so
    aborting the transaction will switch back to TopMemoryContext, but
    this would still require restoring the memory context after the
    transaction is aborted.
    
    Regards,
    Anthonin Bonnefoy