Thread
-
Re: Fix race in ReplicationSlotRelease for ephemeral slots
Srinath Reddy Sadipiralla <srinath2133@gmail.com> — 2026-05-29T16:44:10Z
Hi, On Wed, May 27, 2026 at 5:20 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > Hi, > > While testing the slot release logic, I noticed a bug in > ReplicationSlotRelease() where it may access a replication slot array > entry that > has already been released by itself. > > The detail is: When releasing an ephemeral replication slot, > ReplicationSlotRelease() first drops the slot via > ReplicationSlotDropAcquired(). > After this point, the slot's shared memory slot array entry can be > immediately > reused by another backend creating a new slot. > > However, ReplicationSlotRelease() continued executing common cleanup code > that > still dereferenced the old slot pointer and updated shared memory fields > such as > effective_xmin. If the slot array entry had already been reallocated, these > writes could inadvertently affect a different, unrelated slot. > > I am attaching a patch that avoids touching slot shared-memory state after > dropping an ephemeral slot. Keep the post-release shared-memory updates > only for > non-ephemeral slots, where the slot remains valid after release. > > To reproduce, we can use the following steps: > > 1. Attach gdb to the backend and set a breakpoint in > ReplicationSlotRelease() > right after ReplicationSlotDropAcquired() is called. > 2. Create an ephemeral slot in the above backend with an invalid output > plugin: > SELECT pg_create_logical_replication_slot('test_slot_dropped', > 'pgoutput2', false, false, true); > 3. Once the breakpoint is hit, start another backend and create a new slot > named 'test_slot_created'. > 4. Release the breakpoint and allow the first backend to continue. At this > point, you will see it updating the new slot 'test_slot_created' -> > active_proc > (and effective_xmin, if a snapshot is being exported) to invalid values. > 5. Start a third backend and attempt to acquire the same slot > 'test_slot_created' ? this should not be possible under normal > circumstances, > but the bug allows it. > patch LGTM. > > I haven't attached a test for this fix, as the change is straightforward > and the > likelihood of encountering this bug is low, so it may not be worth adding > test > cycles for it. However, if others feel differently, I'm OK to add one. > +1 for a test. The fix is just an else, so a future refactor could change it and silently reintroduce the corruption, since it scribbles on an unrelated reused slot, nothing would catch it. Injection points make it deterministic; I've attached a diff patch that adds a test that fails without the fix and passes with it. -- Thanks, Srinath Reddy Sadipiralla EDB: https://www.enterprisedb.com/