Thread
-
Re: Exit walsender before confirming remote flush in logical replication
Alexander Lakhin <exclusion@gmail.com> — 2026-05-31T11:00:01Z
Hello Fujii-san, 06.04.2026 18:32, Andres Freund wrote: > Hi, > > On 2026-04-06 11:40:24 +0900, Fujii Masao wrote: >> I've pushed the patch. Thanks! > Failed on CI just now: > > https://cirrus-ci.com/task/6745359004729344?logs=test_world#L410 > https://api.cirrus-ci.com/v1/artifact/task/6745359004729344/testrun/build/testrun/subscription/038_walsnd_shutdown_timeout/log/regress_log_038_walsnd_shutdown_timeout A recent failure occurred on the buildfarm [1] indicates another stability issue, this time with the test itself, I guess: [21:12:03.301](0.068s) ok 3 - have walreceiver pid 2275585 ### Stopping node "publisher" using mode fast # Running: pg_ctl --pgdata /home/bf/bf-build/tamandua/HEAD/pgsql.build/testrun/subscription/038_walsnd_shutdown_timeout/data/t_038_walsnd_shutdown_timeout_publisher_data/pgdata --mode fast stop waiting for server to shut down.... done server stopped # No postmaster PID for node "publisher" [21:12:03.752](0.452s) not ok 4 - walsender exits due to wal_sender_shutdown_timeout even when both physical and logical replication are stalled [21:12:03.753](0.001s) # Failed test 'walsender exits due to wal_sender_shutdown_timeout even when both physical and logical replication are stalled' # at /home/bf/bf-build/tamandua/HEAD/pgsql/src/test/subscription/t/038_walsnd_shutdown_timeout.pl line 189. ### Stopping node "subscriber" using mode fast 038_walsnd_shutdown_timeout_subscriber.log doesn't really contain the expected warning: 2026-05-29 21:11:58.777 CEST [2273817][logical replication apply worker][124/2:0] LOG: logical replication apply worker for subscription "test_sub" has started 2026-05-29 21:12:03.232 CEST [2271870][client backend][4/6:0] LOG: statement: BEGIN; 2026-05-29 21:12:03.232 CEST [2271870][client backend][4/6:0] LOG: statement: LOCK TABLE test_tab IN EXCLUSIVE MODE; 2026-05-29 21:12:03.232 CEST [2271870][client backend][4/6:0] LOG: statement: ; 2026-05-29 21:12:03.268 CEST [2273817][logical replication apply worker][124/0:0] LOG: received message via replication: WARNING: replication slot "test_slot" specified in parameter "synchronized_standby_slots" does not have active_pid DETAIL: Logical replication is waiting on the standby associated with replication slot "test_slot". HINT: Start the standby associated with the replication slot "test_slot", or amend parameter "synchronized_standby_slots". 2026-05-29 21:12:03.433 CEST [2273817][logical replication apply worker][124/0:0] ERROR: could not receive data from WAL stream: ERROR: replication slot "test_slot" specified in parameter "synchronized_standby_slots" does not have active_pid DETAIL: Logical replication is waiting on the standby associated with replication slot "test_slot". HINT: Start the standby associated with the replication slot "test_slot", or amend parameter "synchronized_standby_slots". 2026-05-29 21:12:03.437 CEST [2271017][postmaster][:0] LOG: background worker "logical replication apply worker" (PID 2273817) exited with exit code 1 2026-05-29 21:12:03.439 CEST [2275785][logical replication apply worker][125/2:0] LOG: logical replication apply worker for subscription "test_sub" has started 2026-05-29 21:12:03.441 CEST [2275785][logical replication apply worker][125/0:0] ERROR: apply worker for subscription "test_sub" could not connect to the publisher: connection to server on socket "/tmp/3gDgOpAsXL/.s.PGSQL.27739" failed: FATAL: the database system is shutting down 2026-05-29 21:12:03.445 CEST [2271017][postmaster][:0] LOG: background worker "logical replication apply worker" (PID 2275785) exited with exit code 1 2026-05-29 21:12:03.763 CEST [2271870][client backend][:0] LOG: disconnection: session time: 0:00:08.180 user=bf database=postgres host=[local] 2026-05-29 21:12:03.767 CEST [2271017][postmaster][:0] LOG: received fast shutdown request I think this can be explained by the fact that walrcv->ready_to_display is set before WalReceiverMain's loop reached. I've reproduced this test failure with: --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -280,6 +280,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len) walrcv->sender_port = 0; walrcv->ready_to_display = true; SpinLockRelease(&walrcv->mutex); +pg_usleep(500 * 1000); /* Establish the connection to the primary for XLOG streaming */ appname = cluster_name[0] ? cluster_name : "walreceiver"; Could you please look if this can be improved? [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2026-05-29%2019%3A03%3A21 Best regards, Alexander