Fix LOCK_TIMEOUT handling in slotsync worker
Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
From: "Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>
To: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Cc: Amit Kapila <amit.kapila16@gmail.com>
Date: 2025-12-08T02:04:27Z
Lists: pgsql-hackers
Attachments
- v1-0001-Fix-LOCK_TIMEOUT-handling-in-slotsync-worker.patch (application/octet-stream)
Hi, Previously, the slotsync worker used SIGINT to receive a graceful shutdown signal from the startup process on promotion. However, SIGINT is also used by the LOCK_TIMEOUT handler to trigger a query-cancel interrupt. Given that the slotsync worker can access and lock catalog tables while parsing libpq tuples, this overlapping use of SIGINT led to the slotsync worker ignoring LOCK_TIMEOUT signals and consequently waiting indefinitely on locks. I can reproduce the issue by: 1) create a failover replication slot for slotsync on primary. 2) start slotsync worker on standby and uses gdb to make the slotsync worker block before accessing pg_type catalog via walrcv_exec -> libpqrcv_exec -> libpqrcv_processTuples -> TupleDescInitEntry -> SearchSysCache1. 3) take ACCESS EXCLUSIVE lock on pg_type on primary. 4) log standby snapshot to replicate the lock to standby. 5) release the slotsync worker, it will start waiting for the lock on pg_type to be released. And on HEAD, it would not be canceled by the lock_timeout setting. Here is a patch to resolve this by replacing the current signal handler with the appropriate StatementCancelHandler for SIGINT within the slotsync worker. Furthermore, it updates the startup process to send a SIGUSR1 signal to notify slotsync of the need to stop during promotion. The slotsync worker now stops upon detecting that the shared memory flag (stopSignaled) is set to true. I did not add a tap-test in the patch for now. Although feasible, it requires a strong lock on a catalog and an injection point to control the process. Best Regards, Hou zj