Thread
-
Re: [PATCH] Fix REPACK decoding worker not cleaned up on FATAL exit
Sami Imseih <samimseih@gmail.com> — 2026-05-13T03:45:07Z
Hi, Thanks for reporting. This indeed looks like a bug. With pg_terminate_backend, the logical replication worker has no way to know that it needs to stop, as the PG_FINALLY is not reached in this case. I think registering a callback to terminate the worker is the proper fix, but I don't think on_proc_exit() is the right place to register the callback. With 0001 applied and building with asserts, I see a segfault. postgres=# select pg_terminate_backend(26707); pg_terminate_backend ---------------------- t (1 row) ``` postgres=# select 1; WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. postgres=?# ``` ``` 2026-05-12 21:50:33.866 CDT [26569] LOG: client backend (PID 26707) was terminated by signal 11: Segmentation fault: 11 2026-05-12 21:50:33.866 CDT [26569] LOG: terminating any other active server processes 2026-05-12 21:50:33.872 CDT [26569] LOG: all server processes terminated; reinitializing 2026-05-12 21:50:33.882 CDT [27131] LOG: database system was interrupted; last known up at 2026-05-12 21:45:39 CDT 2026-05-12 21:50:34.278 CDT [27131] LOG: database system was not properly shut down; automatic recovery in progress 2026-05-12 21:50:34.281 CDT [27131] LOG: redo starts at 13/619E9470 ``` From lldb on my Mac, I see ``` Process 22683 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x7f7f7f7f7f7f7f7f) frame #0: 0x00000001044c607c postgres`TerminateBackgroundWorker(handle=0x7f7f7f7f7f7f7f7f) at bgworker.c:1324:2 [opt] 1321 BackgroundWorkerSlot *slot; 1322 bool signal_postmaster = false; 1323 -> 1324 Assert(handle->slot < max_worker_processes); 1325 slot = &BackgroundWorkerData->slot[handle->slot]; 1326 1327 /* Set terminate flag in shared memory, unless slot has been reused. */ ``` The 0x7f7f7f7f7f7f7f7f is the CLOBBER_FREED_MEMORY fill pattern from wipe_mem(). The handle's memory context has already been destroyed by the time on_proc_exit callbacks run. A better fix is to use before_shmem_exit instead, which is for user-level cleanup. /* ---------------------------------------------------------------- * before_shmem_exit * * Register early callback to perform user-level cleanup, If we do that, we can also wait for the worker to shutdown, so we can use stop_repack_decoding_worker(); What do you think? -- Sami Imseih Amazon Web Services (AWS)