Thread

  1. Re: Fix race in ReplicationSlotRelease for ephemeral slots

    Srinath Reddy Sadipiralla <srinath2133@gmail.com> — 2026-05-29T16:44:10Z

    Hi,
    
    On Wed, May 27, 2026 at 5:20 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
    wrote:
    
    > Hi,
    >
    > While testing the slot release logic, I noticed a bug in
    > ReplicationSlotRelease() where it may access a replication slot array
    > entry that
    > has already been released by itself.
    >
    > The detail is: When releasing an ephemeral replication slot,
    > ReplicationSlotRelease() first drops the slot via
    > ReplicationSlotDropAcquired().
    > After this point, the slot's shared memory slot array entry can be
    > immediately
    > reused by another backend creating a new slot.
    >
    > However, ReplicationSlotRelease() continued executing common cleanup code
    > that
    > still dereferenced the old slot pointer and updated shared memory fields
    > such as
    > effective_xmin. If the slot array entry had already been reallocated, these
    > writes could inadvertently affect a different, unrelated slot.
    >
    > I am attaching a patch that avoids touching slot shared-memory state after
    > dropping an ephemeral slot. Keep the post-release shared-memory updates
    > only for
    > non-ephemeral slots, where the slot remains valid after release.
    >
    > To reproduce, we can use the following steps:
    >
    > 1. Attach gdb to the backend and set a breakpoint in
    > ReplicationSlotRelease()
    >    right after ReplicationSlotDropAcquired() is called.
    > 2. Create an ephemeral slot in the above backend with an invalid output
    > plugin:
    >    SELECT pg_create_logical_replication_slot('test_slot_dropped',
    > 'pgoutput2', false, false, true);
    > 3. Once the breakpoint is hit, start another backend and create a new slot
    >    named 'test_slot_created'.
    > 4. Release the breakpoint and allow the first backend to continue. At this
    >    point, you will see it updating the new slot 'test_slot_created' ->
    > active_proc
    >    (and effective_xmin, if a snapshot is being exported) to invalid values.
    > 5. Start a third backend and attempt to acquire the same slot
    >    'test_slot_created' ? this should not be possible under normal
    > circumstances,
    >    but the bug allows it.
    >
    
    patch LGTM.
    
    
    >
    > I haven't attached a test for this fix, as the change is straightforward
    > and the
    > likelihood of encountering this bug is low, so it may not be worth adding
    > test
    > cycles for it. However, if others feel differently, I'm OK to add one.
    >
    
    +1 for a test. The fix is just an else, so a future refactor could change
    it and silently
    reintroduce the corruption, since it scribbles on an unrelated reused slot,
    nothing
    would catch it. Injection points make it deterministic; I've attached a
    diff patch that adds
    a test that fails without the fix and passes with it.
    
    
    -- 
    Thanks,
    Srinath Reddy Sadipiralla
    EDB: https://www.enterprisedb.com/