Thread

  1. Fix race during concurrent logical decoding activation

    Chao Li <li.evan.chao@gmail.com> — 2026-05-28T09:09:13Z

    Hi,
    
    While testing “Toggle logical decoding dynamically based on logical slot presence”, I hit an assertion failure with concurrent logical slot creation.
    
    This is a repo:
    
    1. In session 1, attach the injection point locally and start creating a logical slot. The session blocks at logical-decoding-activation:
    ```
    evantest=# set application_name = 'slot_a';
    SET
    evantest=# select injection_points_set_local();
     injection_points_set_local
    ----------------------------
    
    (1 row)
    evantest=# select injection_points_attach('logical-decoding-activation', 'wait');
     injection_points_attach
    -------------------------
    
    (1 row)
    evantest=# select pg_create_logical_replication_slot('slot_a', 'pgoutput');
    ``` 
    
    2. In session 2, create another logical slot. This succeeds, and effective_wal_level becomes logical:
    ```
    evantest=# select pg_create_logical_replication_slot('slot_b', 'pgoutput');
     pg_create_logical_replication_slot
    ------------------------------------
     (slot_b,0/0902E418)
    (1 row)
    
    evantest=# show effective_wal_level;
     effective_wal_level
    ---------------------
     logical
    (1 row)
    ```
    
    3. In session 2, cancel session 1 instead of waking it up:
    ```
    evantest=# select pg_cancel_backend(pid) from pg_stat_activity where application_name = 'slot_a';
     pg_cancel_backend
    -------------------
     t
    (1 row)
    ```
    
    Then the server hits this assertion:
    ```
    TRAP: failed Assert("!LogicalDecodingCtl->logical_decoding_enabled"), File: "logicalctl.c", Line: 266, PID: 13768
    0   postgres                            0x00000001032b35d8 ExceptionalCondition + 216
    1   postgres                            0x0000000102f64600 abort_logical_decoding_activation + 120
    2   postgres                            0x0000000102f6451c EnsureLogicalDecodingEnabled + 412
    3   postgres                            0x0000000102f9f314 create_logical_replication_slot + 164
    4   postgres                            0x0000000102f9f1c4 pg_create_logical_replication_slot + 312
    5   postgres                            0x0000000102ce5f48 ExecInterpExpr + 3888
    6   postgres                            0x0000000102ce48b4 ExecInterpExprStillValid + 76
    7   postgres                            0x0000000102d57e94 ExecEvalExprNoReturn + 44
    8   postgres                            0x0000000102d57e54 ExecEvalExprNoReturnSwitchContext + 48
    9   postgres                            0x0000000102d57d18 ExecProject + 72
    10  postgres                            0x0000000102d57a9c ExecResult + 312
    11  postgres                            0x0000000102d06f1c ExecProcNodeFirst + 92
    12  postgres                            0x0000000102cfd8cc ExecProcNode + 60
    13  postgres                            0x0000000102cf83fc ExecutePlan + 244
    14  postgres                            0x0000000102cf8298 standard_ExecutorRun + 456
    15  postgres                            0x0000000102cf80c0 ExecutorRun + 84
    16  postgres                            0x000000010306fc64 PortalRunSelect + 296
    17  postgres                            0x000000010306f674 PortalRun + 656
    18  postgres                            0x000000010306a220 exec_simple_query + 1372
    19  postgres                            0x0000000103069348 PostgresMain + 3224
    20  postgres                            0x0000000103060a3c BackendInitialize + 0
    21  postgres                            0x0000000102f27db8 postmaster_child_launch + 464
    22  postgres                            0x0000000102f2f2ec BackendStartup + 304
    23  postgres                            0x0000000102f2d260 ServerLoop + 372
    24  postgres                            0x0000000102f2bd8c PostmasterMain + 6256
    25  postgres                            0x0000000102d99e84 main + 924
    26  dyld                                0x000000018cef7e00 start + 6992
    2026-05-28 13:28:32.526 CST [13753] LOG:  client backend (PID 13768) was terminated by signal 6: Abort trap: 6
    2026-05-28 13:28:32.526 CST [13753] DETAIL:  Failed process was running: select pg_create_logical_replication_slot('slot_a', 'pgoutput');
    ```
    
    From my tracing, when session 1 is cancelled, session 1 entered abort_logical_decoding_activation(), and there is an assert:
    ```
    Assert(!LogicalDecodingCtl->logical_decoding_enabled);
    ```
    
    But session 2 had successfully created a slot and set LogicalDecodingCtl->logical_decoding_enabled to true, so this is a race condition.
    
    I might be over thinking, but I just feel the safest fix is to make EnableLogicalDecoding() serialize. I tried serializing with LogicalDecodingControlLock and with a separate lock, but both approaches got deadlock around the barrier wait. I ended up with adding an activation_in_progress flag in shared memory, protected by LogicalDecodingControlLock, with a condition variable to wait for the active activation to finish.
    
    With this fix, rerunning the repro makes session 2 wait while session 1 is blocked at the injection point. After canceling session 1 from session 3, session 2 continues, creates the slot successfully, and effective_wal_level becomes logical.
    
    I didn’t include a test in this patch, as I wasn’t sure such a test would be desirable. If others think it is worth adding, I can convert the repro into a TAP test.
    
    See the attached patch for details.
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/