Thread

  1. Segfault due to NULL ParamExecData value

    Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> — 2025-12-04T14:25:55Z

    Hi,
    
    We had multiple segfaults happening on PG14.17. All coredumps showed the
    following backtrace:
    
    #0:  postgres`toast_raw_datum_size(value=0) at detoast.c:550:6
    #1:  postgres`textne(fcinfo=0x0000c823e4759338) at varlena.c:1848:10
    #2:  postgres`ExecInterpExpr(state=0x0000c823e4758a48,
    econtext=0x0000c823e4757b88, isnull=<unavailable>) at execExprInterp.c:749:8
    #3:  postgres`ExecScan at executor.h:342:13
    #4:  postgres`ExecScan at executor.h:411:8
    #5:  postgres`ExecScan(node=0x0000c823e4757978,
    accessMtd=(postgres`FunctionNext at nodeFunctionscan.c:61:1),
    recheckMtd=(postgres`FunctionRecheck)) at execScan.c:226:23
    #6:  postgres`ExecSubPlan [inlined] ExecProcNode(node=0x0000c823e4757978)
    at executor.h:260:9
    #7:  postgres`ExecSubPlan at nodeSubplan.c:302:14
    #8:  postgres`ExecSubPlan(node=0x0000c823e47814e8,
    econtext=0x0000c823e475a658, isNull=0x0000c823e47814c0) at
    nodeSubplan.c:89:12
    #9:  postgres`ExecInterpExpr at execExprInterp.c:3954:18
    #10: postgres`ExecInterpExpr(state=0x0000c823e47813c0,
    econtext=0x0000c823e475a658, isnull=<unavailable>) at
    execExprInterp.c:1576:4
    #11: postgres`ExecNestLoop [inlined]
    ExecEvalExprSwitchContext(isNull=0x0000ffffebb9d637,
    econtext=0x0000c823e475a658, state=<unavailable>) at executor.h:342:13
    #12: postgres`ExecNestLoop [inlined] ExecProject(projInfo=<unavailable>) at
    executor.h:376:9
    #13: postgres`ExecNestLoop(pstate=<unavailable>) at nodeNestloop.c:241:12
    #14: postgres`EvalPlanQual at executor.h:260:9
    #15: postgres`ExecUpdate(mtstate=0x0000c823e4651a98,
    resultRelInfo=0x0000c823e4651ca8, tupleid=0x0000ffffebb9d858,
    oldtuple=0x0000000000000000, slot=<unavailable>,
    planSlot=0x0000c823e4661800, epqstate=0x0000c823e4651b80,
    estate=0x0000c823e46ace18, canSetTag=<unavailable>) at
    nodeModifyTable.c:2007:18
    #16: postgres`ExecModifyTable(pstate=0x0000c823e4651a98) at
    nodeModifyTable.c:2760:12
    #17: postgres`standard_ExecutorRun [inlined]
    ExecProcNode(node=0x0000c823e4651a98) at executor.h:260:9
    #18: postgres`standard_ExecutorRun at execMain.c:1555:10
    #19: postgres`standard_ExecutorRun(queryDesc=0x0000c823e45d51a0,
    direction=<unavailable>, count=0, execute_once=<unavailable>) at
    execMain.c:360:3
    
    textne's arg2 is null, leading to the segfault in toast_raw_datum_size. The
    segfaults all happened with the following query.
    
    WITH RECURSIVE
      params AS (SELECT $1::text AS schema, $2::text AS name, $3::text AS
    version),
      seed AS (SELECT p.schema || E'\t^\t' || p.name AS node FROM params p),
      reach AS (SELECT v."schema" || E'\t^\t' || v."name" AS node
        FROM definitions v WHERE (SELECT node FROM seed) = ANY(v.used_tables)
        UNION
        SELECT v."schema" || E'\t^\t' || v."name" AS node
        FROM definitions v, reach r WHERE r.node = ANY(v.used_tables)),
      to_update AS (SELECT DISTINCT split_part(r.node, E'\t^\t', 1) AS "schema",
          split_part(r.node, E'\t^\t', 2) AS "name" FROM reach r),
      kv AS (SELECT (p.schema || '.' || p.name) AS key_str,
               (p.schema || '.' || p.name || ':' || p.version) AS new_entry
        FROM params p)
    UPDATE definitions v SET dependencies = array_cat(
      COALESCE(ARRAY(SELECT e
          FROM unnest(COALESCE(v.dependencies, ARRAY[]::text[])) AS e
          WHERE split_part(e, ':', 1) <> (SELECT key_str FROM kv)
        ), ARRAY[]::text[]
      ), ARRAY[(SELECT new_entry FROM kv)]
    )
    FROM to_update u
    WHERE v."schema" = u."schema" AND v."name" = u."name";
    
    Which has the following plan:
                                                             QUERY PLAN
    -----------------------------------------------------------------------------------------------------------------------------
     Update on definitions v  (cost=63318199175.74..63318200084.69 rows=0
    width=0)
       CTE params
         ->  Result  (cost=0.00..0.01 rows=1 width=96)
       CTE reach
         ->  Recursive Union  (cost=0.03..61108645382.88 rows=73651793079
    width=32)
               ->  Seq Scan on definitions v_1  (cost=0.03..370199.45
    rows=27139 width=32)
                     Filter: ($2 = ANY (used_tables))
                     InitPlan 2 (returns $2)
                       ->  CTE Scan on params p  (cost=0.00..0.03 rows=1
    width=32)
               ->  Nested Loop  (cost=0.00..5963523932.18 rows=7365176594
    width=32)
                     Join Filter: (r_1.node = ANY (v_2.used_tables))
                     ->  Seq Scan on definitions v_2  (cost=0.00..363124.99
    rows=555099 width=315)
                     ->  WorkTable Scan on reach r_1  (cost=0.00..5427.80
    rows=271390 width=32)
       CTE kv
         ->  CTE Scan on params p_1  (cost=0.00..0.04 rows=1 width=64)
       InitPlan 7 (returns $7)
         ->  CTE Scan on kv kv_1  (cost=0.00..0.02 rows=1 width=32)
       ->  Nested Loop  (cost=2209553792.80..2209554701.75 rows=100 width=126)
             ->  Subquery Scan on u  (cost=2209553792.37..2209553797.37
    rows=200 width=152)
                   ->  HashAggregate  (cost=2209553792.37..2209553795.37
    rows=200 width=64)
                         Group Key: split_part(r.node, '     ^       '::text,
    1), split_part(r.node, '       ^       '::text, 2)
                         ->  CTE Scan on reach r  (cost=0.00..1841294826.97
    rows=73651793079 width=64)
             ->  Index Scan using definitions_schema_name_idx on definitions v
     (cost=0.42..4.40 rows=3 width=68)
                   Index Cond: ((schema = u.schema) AND (name = u.name))
             SubPlan 6
               ->  Function Scan on unnest e  (cost=0.02..0.17 rows=9 width=32)
                     Filter: (split_part(e, ':'::text, 1) <> $5)
                     InitPlan 5 (returns $5)
                       ->  CTE Scan on kv  (cost=0.00..0.02 rows=1 width=32)
    
    Unfortunately, I wasn't able to reproduce the segfault, so the only
    available information I have are the coredumps.
    
    The failure happens when textne of 'WHERE split_part(e, ':', 1) <> (SELECT
    key_str FROM kv)' is evaluated. Looking at ExprState, there are 7 steps
    with the following opcodes:
    0: SCAN_FETCHSOME
    1: SCAN_VAR
    2: FUNC_EXPR_STRICT
    3: PARAM_EXEC
    4: FUNC_EXPR_STRICT
    5: EEOP_QUAL
    6: EEOP_DONE
    
    Step 3 runs the subplan InitPlan 5 to fill the arg2 for textne (step 4). If
    I look at step 3's param:
    p state->steps[3].d.param
    ((unnamed struct)) $219 = (paramid = 5, paramtype = 25)
    
    Then, looking at the matching ParamExecData:
    p econtext->ecxt_param_exec_vals[5]
    (ParamExecData) $220 = (execPlan = 0x0000000000000000, value = 0, isnull =
    false)
    
    When looking at the matching WAL records, we also see at least two updates
    before the segfault is triggered:
    rmgr: Heap        len (rec/tot):     59/  2139, tx: 2549003939, lsn:
    B4D/21956518, prev B4D/219564E8, desc: LOCK off 1: xid 2549003939: flags
    0x00 LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/16386/16899 blk
    160730 FPW
    rmgr: Heap        len (rec/tot):   2055/  2055, tx: 2549003939, lsn:
    B4D/21956D78, prev B4D/21956518, desc: UPDATE off 1 xmax 2549003939 flags
    0x11 KEYS_UPDATED ; new off 3 xmax 0, blkref #0: rel 1663/16386/16899 blk
    160730
    rmgr: Btree       len (rec/tot):     55/  1971, tx: 2549003939, lsn:
    B4D/21957698, prev B4D/21957668, desc: INSERT_LEAF off 122, blkref #0: rel
    1663/16386/16905 blk 3261 FPW
    rmgr: Btree       len (rec/tot):    104/   104, tx: 2549003939, lsn:
    B4D/21957E50, prev B4D/21957698, desc: INSERT_LEAF off 18, blkref #0: rel
    1663/16386/16907 blk 1517
    rmgr: Btree       len (rec/tot):    104/   104, tx: 2549003939, lsn:
    B4D/21957EB8, prev B4D/21957E50, desc: INSERT_LEAF off 90, blkref #0: rel
    1663/16386/53784856 blk 1020
    rmgr: Btree       len (rec/tot):     55/  1156, tx: 2549003939, lsn:
    B4D/21957F20, prev B4D/21957EB8, desc: INSERT_LEAF off 11, blkref #0: rel
    1663/16386/57258051 blk 7015 FPW
    rmgr: Btree       len (rec/tot):     55/   209, tx: 2549003939, lsn:
    B4D/219583C0, prev B4D/21957F20, desc: INSERT_LEAF off 4, blkref #0: rel
    1663/16386/57459940 blk 1921 FPW
    rmgr: Gin         len (rec/tot):    566/   566, tx: 2549003939, lsn:
    B4D/21958498, prev B4D/219583C0, desc: UPDATE_META_PAGE , blkref #0: rel
    1663/16386/57459942 blk 0, blkref #1: rel 1663/16386/57459942 blk 808
    rmgr: Heap        len (rec/tot):     54/    54, tx: 2549003939, lsn:
    B4D/25A4C7F0, prev B4D/25A4C7B8, desc: LOCK off 9: xid 2549003939: flags
    0x00 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/16386/16899 blk 40
    rmgr: Heap        len (rec/tot):   1827/  1827, tx: 2549003939, lsn:
    B4D/25A4CAC8, prev B4D/25A4CA88, desc: HOT_UPDATE off 9 xmax 2549003939
    flags 0x10 ; new off 10 xmax 2549003939, blkref #0: rel 1663/16386/16899
    blk 40
    rmgr: Heap2       len (rec/tot):     56/    56, tx: 2549003939, lsn:
    B4D/25A4D1F0, prev B4D/25A4CAC8, desc: PRUNE latestRemovedXid 0 nredirected
    0 ndead 0, blkref #0: rel 1663/16386/16899 blk 100
    rmgr: Heap        len (rec/tot):     54/    54, tx: 2549003939, lsn:
    B4D/25A4D228, prev B4D/25A4D1F0, desc: LOCK off 1: xid 2549003939: flags
    0x00 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/16386/16899 blk 19749
    
    On the logs side, we see row lock contentions happening before the segfault:
    2025-11-04T17:02:56.507Z,process 289871 still waiting for ShareLock on
    transaction 2549003939 after 1000.053 ms
    2025-11-04T17:02:56.507Z,Process holding the lock: 292365. Wait queue:
    289871.
    2025-11-04T17:02:58.938Z,process 292365 still waiting for ShareLock on
    transaction 2549003931 after 1000.052 ms
    2025-11-04T17:02:58.938Z,Process holding the lock: 292716. Wait queue:
    285801, 292365.
    2025-11-04T17:02:58.938Z,while updating tuple (40,8) in relation
    "definitions"
    2025-11-04T17:03:00.041Z,process 292365 acquired ShareLock on transaction
    2549003931 after 2102.985 ms
    2025-11-04T17:03:00.041Z,while updating tuple (40,8) in relation
    "definitions"
    2025-11-04T17:03:00.041Z,process 283964 acquired ExclusiveLock on tuple
    (40,8) of relation 16899 of database 16386 after 1997.621 ms
    2025-11-04T17:03:00.201Z,server process (PID 292365) was terminated by
    signal 11: Segmentation fault
    
    So it looks like the ParamExec for the InitPlan 5 was correctly executed
    (since execPlan is null) and the value was probably used during the first
    two updates. But for the third update, the ParamExecData's value was null
    leading to the segfault.
    All coredumps (or rather WAL records) show a similar pattern of 2 updates
    before segfaults.
    I haven't been able to reproduce the segfault so I wasn't able to pinpoint
    what could have set ParamExecData's value to null.
    
    Regards,
    Anthonin Bonnefoy