Thread
-
Fix race during concurrent logical decoding activation
Chao Li <li.evan.chao@gmail.com> — 2026-05-28T09:09:13Z
Hi, While testing “Toggle logical decoding dynamically based on logical slot presence”, I hit an assertion failure with concurrent logical slot creation. This is a repo: 1. In session 1, attach the injection point locally and start creating a logical slot. The session blocks at logical-decoding-activation: ``` evantest=# set application_name = 'slot_a'; SET evantest=# select injection_points_set_local(); injection_points_set_local ---------------------------- (1 row) evantest=# select injection_points_attach('logical-decoding-activation', 'wait'); injection_points_attach ------------------------- (1 row) evantest=# select pg_create_logical_replication_slot('slot_a', 'pgoutput'); ``` 2. In session 2, create another logical slot. This succeeds, and effective_wal_level becomes logical: ``` evantest=# select pg_create_logical_replication_slot('slot_b', 'pgoutput'); pg_create_logical_replication_slot ------------------------------------ (slot_b,0/0902E418) (1 row) evantest=# show effective_wal_level; effective_wal_level --------------------- logical (1 row) ``` 3. In session 2, cancel session 1 instead of waking it up: ``` evantest=# select pg_cancel_backend(pid) from pg_stat_activity where application_name = 'slot_a'; pg_cancel_backend ------------------- t (1 row) ``` Then the server hits this assertion: ``` TRAP: failed Assert("!LogicalDecodingCtl->logical_decoding_enabled"), File: "logicalctl.c", Line: 266, PID: 13768 0 postgres 0x00000001032b35d8 ExceptionalCondition + 216 1 postgres 0x0000000102f64600 abort_logical_decoding_activation + 120 2 postgres 0x0000000102f6451c EnsureLogicalDecodingEnabled + 412 3 postgres 0x0000000102f9f314 create_logical_replication_slot + 164 4 postgres 0x0000000102f9f1c4 pg_create_logical_replication_slot + 312 5 postgres 0x0000000102ce5f48 ExecInterpExpr + 3888 6 postgres 0x0000000102ce48b4 ExecInterpExprStillValid + 76 7 postgres 0x0000000102d57e94 ExecEvalExprNoReturn + 44 8 postgres 0x0000000102d57e54 ExecEvalExprNoReturnSwitchContext + 48 9 postgres 0x0000000102d57d18 ExecProject + 72 10 postgres 0x0000000102d57a9c ExecResult + 312 11 postgres 0x0000000102d06f1c ExecProcNodeFirst + 92 12 postgres 0x0000000102cfd8cc ExecProcNode + 60 13 postgres 0x0000000102cf83fc ExecutePlan + 244 14 postgres 0x0000000102cf8298 standard_ExecutorRun + 456 15 postgres 0x0000000102cf80c0 ExecutorRun + 84 16 postgres 0x000000010306fc64 PortalRunSelect + 296 17 postgres 0x000000010306f674 PortalRun + 656 18 postgres 0x000000010306a220 exec_simple_query + 1372 19 postgres 0x0000000103069348 PostgresMain + 3224 20 postgres 0x0000000103060a3c BackendInitialize + 0 21 postgres 0x0000000102f27db8 postmaster_child_launch + 464 22 postgres 0x0000000102f2f2ec BackendStartup + 304 23 postgres 0x0000000102f2d260 ServerLoop + 372 24 postgres 0x0000000102f2bd8c PostmasterMain + 6256 25 postgres 0x0000000102d99e84 main + 924 26 dyld 0x000000018cef7e00 start + 6992 2026-05-28 13:28:32.526 CST [13753] LOG: client backend (PID 13768) was terminated by signal 6: Abort trap: 6 2026-05-28 13:28:32.526 CST [13753] DETAIL: Failed process was running: select pg_create_logical_replication_slot('slot_a', 'pgoutput'); ``` From my tracing, when session 1 is cancelled, session 1 entered abort_logical_decoding_activation(), and there is an assert: ``` Assert(!LogicalDecodingCtl->logical_decoding_enabled); ``` But session 2 had successfully created a slot and set LogicalDecodingCtl->logical_decoding_enabled to true, so this is a race condition. I might be over thinking, but I just feel the safest fix is to make EnableLogicalDecoding() serialize. I tried serializing with LogicalDecodingControlLock and with a separate lock, but both approaches got deadlock around the barrier wait. I ended up with adding an activation_in_progress flag in shared memory, protected by LogicalDecodingControlLock, with a condition variable to wait for the active activation to finish. With this fix, rerunning the repro makes session 2 wait while session 1 is blocked at the injection point. After canceling session 1 from session 3, session 2 continues, creates the slot successfully, and effective_wal_level becomes logical. I didn’t include a test in this patch, as I wasn’t sure such a test would be desirable. If others think it is worth adding, I can convert the repro into a TAP test. See the attached patch for details. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/