Thread
-
[PATCH] Fix fragile walreceiver test.
Bryan Green <dbryan.green@gmail.com> — 2025-11-05T06:03:29Z
The recovery/004_timeline_switch test has been failing for me on Windows. The test is wrong. The test does this: $node_standby_2->restart; # ... timeline switch happens ... ok( !$node_standby_2->log_contains( "FATAL: .* terminating walreceiver process due to administrator command" ), 'WAL receiver should not be stopped across timeline jumps'); Problem: restart() kills the walreceiver (as it should), which writes that exact FATAL message to the log. The test then searches the log and finds it. The test has a comment claiming "a new log file is used on node restart". TAP tests use pg_ctl with a fixed filename that gets reused across restarts. No log rotation. I added logging to confirm what's actually happening. The walreceiver works correctly - same PID handles both timelines: 2025-11-04 23:05:28.539 CST walreceiver[83824] LOG: started streaming WAL from primary at 0/03000000 on timeline 1 2025-11-04 23:05:28.543 CST startup[42764] LOG: new target timeline is 2 2025-11-04 23:05:28.544 CST walreceiver[83824] LOG: restarted WAL streaming at 0/03000000 on timeline 2 That's PID 83824 throughout. Works fine. Earlier in the same log, from the restart: 2025-11-04 23:05:27.261 CST walreceiver[52440] FATAL: terminating walreceiver process due to administrator command Different PID (52440), expected shutdown. This is what the test finds. The fix is obvious: check that the walreceiver PID stays constant. That's what we actually care about anyway. This matters because changes to I/O behavior elsewhere in the code can make this test fail spuriously. I hit it while working on O_CLOEXEC handling for Windows. Patch attached. -- Bryan Green EDB: https://www.enterprisedb.com