Thread

  1. Re: [PATCH] Fix ProcKill lock-group vs procLatch recycle race

    Michael Paquier <michael@paquier.xyz> — 2026-05-28T01:08:28Z

    On Wed, May 27, 2026 at 02:03:37PM +0300, Vlad Lesin wrote:
    > Attached is an updated patch with the test rebased onto the latest master.
    
    I have been looking at that, and after putting my HEAD on it I have
    concluded that this test is an anti-pattern based on its current
    shape.  The test relies on a wait, but the wait mechanism in
    injection_points will not be able to work as the latch of the backends
    get detached earlier after the fix done at 84b9d6bceab6 to prevent the
    recycle race.  What your proposed test was relying on here is a
    statement timeout to make sure that the leader stays long enough in
    the second point so as the followers are awaken and can finish their
    termination business first.  But we cannot really rely on that, and it
    makes the test much longer than necessary on fast machines, dependent
    on statement_timeout.
    
    Another thing that we could do is redesign the wait mechanism in
    injection_points to not rely on latches, using a shared memory flag
    where we don't depend on latches and conditional variables.  I cannot
    get excited enough about that for this specific case, but perhaps
    somebody would like to give it a shot?  The use of latches has been
    mentioned more than once as an annoying limitation, but I've always
    found that less efficient on fast machines as it would require an
    internal polling with a sleep or equivalent.  This thread would give
    an extra reason to do so, perhaps not in the stable branches but HEAD
    once v20 opens up?
    
    prockill_attach_injection_wait_pid() is not really necessary.  We
    could just reuse the existing attach() function and add an optional
    PID argument, defaulting to 0.
    
    Using two different points, one for the leader and one for the
    follower is indeed the correct way to do things.
    
    Another thing that one may try is to switch the test to use NOTICEs
    and server log lookups.  That may catch the PANIC, but one really
    needs to be lucky so it would be mostly a waste of cycles in the
    buildfarm due to the low reproducibility rate.
    
    The test was still a useful exercise to prove the bug, though.  If one
    reverts 84b9d6bceab6 and runs the test, we are able to reproduce the
    original bug.
    
    Anyway, if we are to move ahead with this test, I'd suggest a couple
    of patch pieces first:
    - Switch the wait mechanism to use something more primitive, with a
    shmem state.
    - Extend injection_points_attach() with an optional PID.
    - Add the test.
    All that would be quite invasise, especially for the 1st point, so
    introducing that only in v20 may be better.  I am not sure that the
    case of this thread is good enough to justify these changes, TBH.
    --
    Michael