Thread

  1. Re: [PATCH] Fix REPACK decoding worker not cleaned up on FATAL exit

    Sami Imseih <samimseih@gmail.com> — 2026-05-13T03:45:07Z

    Hi,
    
    Thanks for reporting. This indeed looks like a bug.
    
    With pg_terminate_backend, the logical replication worker has no
    way to know that it needs to stop, as  the PG_FINALLY is not
    reached in this case.
    
    I think registering a callback to terminate the worker is the proper fix,
    but I don't think on_proc_exit() is the right place to register the
    callback.
    
    With 0001 applied and building with asserts, I see a segfault.
    
    postgres=# select pg_terminate_backend(26707);
     pg_terminate_backend
    ----------------------
     t
    (1 row)
    
    ```
    postgres=# select 1;
    WARNING:  terminating connection because of crash of another server process
    DETAIL:  The postmaster has commanded this server process to roll back
    the current transaction and exit, because another server process
    exited abnormally and possibly corrupted shared memory.
    HINT:  In a moment you should be able to reconnect to the database and
    repeat your command.
    server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
    postgres=?#
    ```
    
    ```
    2026-05-12 21:50:33.866 CDT [26569] LOG:  client backend (PID 26707)
    was terminated by signal 11: Segmentation fault: 11
    2026-05-12 21:50:33.866 CDT [26569] LOG:  terminating any other active
    server processes
    2026-05-12 21:50:33.872 CDT [26569] LOG:  all server processes
    terminated; reinitializing
    2026-05-12 21:50:33.882 CDT [27131] LOG:  database system was
    interrupted; last known up at 2026-05-12 21:45:39 CDT
    2026-05-12 21:50:34.278 CDT [27131] LOG:  database system was not
    properly shut down; automatic recovery in progress
    2026-05-12 21:50:34.281 CDT [27131] LOG:  redo starts at 13/619E9470
    ```
    
    From lldb on my Mac, I see
    
    ```
      Process 22683 stopped
      * thread #1, queue = 'com.apple.main-thread', stop reason =
    EXC_BAD_ACCESS (code=1, address=0x7f7f7f7f7f7f7f7f)
          frame #0: 0x00000001044c607c
    postgres`TerminateBackgroundWorker(handle=0x7f7f7f7f7f7f7f7f) at
    bgworker.c:1324:2 [opt]
         1321               BackgroundWorkerSlot *slot;
         1322               bool            signal_postmaster = false;
         1323
      -> 1324               Assert(handle->slot < max_worker_processes);
         1325               slot = &BackgroundWorkerData->slot[handle->slot];
         1326
         1327               /* Set terminate flag in shared memory, unless
    slot has been reused. */
    ```
    
    The 0x7f7f7f7f7f7f7f7f is the CLOBBER_FREED_MEMORY fill pattern from
    wipe_mem(). The handle's memory context has already been destroyed by
    the time on_proc_exit callbacks run.
    
    A better fix is to use before_shmem_exit instead, which is for
    user-level cleanup.
    
    /* ----------------------------------------------------------------
    * before_shmem_exit
    *
    * Register early callback to perform user-level cleanup,
    
    If we do that, we can also wait for the worker to shutdown, so we can use
    stop_repack_decoding_worker();
    
    What do you think?
    
    --
    Sami Imseih
    Amazon Web Services (AWS)