Thread

  1. Re: [PATCH] Fix orphaned backend processes on Windows using Job Objects

    Bryan Green <dbryan.green@gmail.com> — 2025-11-03T22:12:36Z

    On 11/3/2025 9:29 AM, Andres Freund wrote:
    > On 2025-11-03 09:25:11 -0600, Bryan Green wrote:
    >> On 11/3/2025 9:19 AM, Andres Freund wrote:
    >>> Hi,
    >>>
    >>> On 2025-11-03 09:12:03 -0600, Bryan Green wrote:
    >>>> We just need to call CreateJobObject() in PostmasterMain(), configure
    >>>> with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE, and assign the postmaster.
    >>>> Children inherit membership automatically. When the job handle closes on
    >>>> postmaster exit, the kernel terminates all children atomically. This is
    >>>> kernel-enforced with no polling and no race conditions.
    >>>
    >>> What happens if a postmaster child exits irregularly? Is postmaster terminated
    >>> as well?
    >>>
    >>
    >> No, Job Objects are unidirectional.
    > 
    > Great.
    > 
    > 
    >>>> The patch has been tested on Windows 10/11 with both MSVC and MinGW
    >>>> builds. Nested jobs fail gracefully as expected. Clean shutdown is
    >>>> unaffected. Crash tests with taskkill /F, debugger abort, and access
    >>>> violations all correctly terminate children immediately with zero orphans.
    >>>>
    >>>> This patch does not include automated tests because the core
    >>>> functionality (orphan prevention on crash) requires simulating process
    >>>> termination, which is difficult to test reliably in CI.
    >>>
    >>> Why is it difficult to test in CI? We do some related tests in
    >>> 013_crash_restart.pl, it doesn't seem like it ought to be hard to also add
    >>> tests for postmaster?
    >>>
    >>
    >> Fair point. I was hesitant because testing the actual orphan prevention
    >> requires killing the postmaster while backends are active, which seemed
    >> fragile. But you're right that we already test similar scenarios.
    >>
    >> I can add a test to 013_crash_restart.pl (or a new Windows-specific test
    >> file) that:
    >> 1. Starts server with active backend
    >> 2. Kills postmaster ungracefully (taskkill /F)
    >> 3. Verifies backend process terminates automatically
    >> 4. Confirms clean restart
    >>
    >> Would that be sufficient, or do you have other test scenarios in mind?
    > 
    > That's pretty much what I had in mind.
    > 
    > Greetings,
    > 
    > Andres Freund
    
    
    I've implemented the test in 013_crash_restart.pl.
    
    The test passes on Windows 10/11 with both MSVC and MinGW builds.
    Backends  are typically terminated within 100-200ms after postmaster
    kill, confirming the Job Object KILL_ON_JOB_CLOSE mechanism works as
    intended.
    
    Updated patch (v2) attached.
    
    --
    Bryan