Thread

  1. Fix LOCK_TIMEOUT handling in slotsync worker

    Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> — 2025-12-08T02:04:27Z

    Hi,
    
    Previously, the slotsync worker used SIGINT to receive a graceful shutdown
    signal from the startup process on promotion. However, SIGINT is also used by
    the LOCK_TIMEOUT handler to trigger a query-cancel interrupt. Given that the
    slotsync worker can access and lock catalog tables while parsing libpq tuples,
    this overlapping use of SIGINT led to the slotsync worker ignoring LOCK_TIMEOUT
    signals and consequently waiting indefinitely on locks.
    
    I can reproduce the issue by:
    
    1) create a failover replication slot for slotsync on primary.
    2) start slotsync worker on standby and uses gdb to make the slotsync
    worker block before accessing pg_type catalog via walrcv_exec -> libpqrcv_exec ->
    libpqrcv_processTuples -> TupleDescInitEntry -> SearchSysCache1.
    3) take ACCESS EXCLUSIVE lock on pg_type on primary.
    4) log standby snapshot to replicate the lock to standby.
    5) release the slotsync worker, it will start waiting for the lock on pg_type to
       be released. And on HEAD, it would not be canceled by the lock_timeout
       setting.
    
    Here is a patch to resolve this by replacing the current signal handler with the
    appropriate StatementCancelHandler for SIGINT within the slotsync worker.
    Furthermore, it updates the startup process to send a SIGUSR1 signal to notify
    slotsync of the need to stop during promotion. The slotsync worker now stops
    upon detecting that the shared memory flag (stopSignaled) is set to true.
    
    I did not add a tap-test in the patch for now. Although feasible, it requires
    a strong lock on a catalog and an injection point to control the
    process.
    
    Best Regards,
    Hou zj