Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process
Masahiko Sawada <sawada.mshk@gmail.com>
From: Masahiko Sawada <sawada.mshk@gmail.com>
To: Alexander Lakhin <exclusion@gmail.com>
Cc: Andres Freund <andres@anarazel.de>, Matthias van de Meent <boekewurm+postgres@gmail.com>,
Thomas Munro <thomas.munro@gmail.com>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
Heikki Linnakangas <hlinnaka@iki.fi>, Andrey Borodin <x4mmm@yandex-team.ru>
Date: 2026-05-14T21:47:26Z
Lists: pgsql-hackers
Attachments
- REL17_v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch (text/x-patch)
- REL15_v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch (text/x-patch)
- master_v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch (text/x-patch)
- REL16_v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch (text/x-patch)
- REL18_v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch (text/x-patch)
On Thu, May 7, 2026 at 10:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, May 1, 2026 at 1:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > > > > Dear Sawada-san, > > > > 01.05.2026 01:08, Masahiko Sawada wrote: > > > > On Wed, Apr 29, 2026 at 11:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > > > > I was wondering why is that failure the only one of this kind on buildfarm > > (in last two years, at least), so I've tried to reproduce it on > > REL_18_STABLE... and failed. > > > > Then I've bisected it on the master branch and found (your) commit that > > introduced this behavior: 67c20979c from 2025-12-23. > > > > I've confirmed that this race condition issue is present from v15 to > > the master. In v14, we have the procsignal barrier code but don't use > > it anywhere. In v18 or older, it could happen when executing DROP > > DATABASE, DROP TABLESPACE etc, whereas in the master, it could happen > > in more cases as we're using procsignal barrier more places. In any > > case, if a process emits a signal barrier when another process is > > between the initialization of slot->pss_barrierGeneration and > > slot->pss_pid initialization, the subsequent > > WaitForProcSignalBarrier() ends up waiting for that process forever. > > So I think the patch should be backpatched to v15. Please review these > > patches. > > > > > > Yes, you're right -- it's not reproduced on REL_18_STABLE with > > test_oat_hooks, which simply starts postgres node (as many other tests), > > but when I tried the full test suite with the sleep inserted before > > setting pss_pid, I discovered the following vulnerable tests: > > > > 030_stats_cleanup_replica_standby.log > > 2026-05-01 06:00:58.789 UTC [2086579] LOG: still waiting for backend with PID 2086578 to accept ProcSignalBarrier > > 2026-05-01 06:00:58.789 UTC [2086579] CONTEXT: WAL redo at 0/3410B00 for Database/DROP: dir 1663/16393 > > > > 033_replay_tsp_drops_standby2_FILE_COPY.log > > 2026-05-01 05:45:12.969 UTC [2030902] LOG: still waiting for backend with PID 2030901 to accept ProcSignalBarrier > > 2026-05-01 05:45:12.969 UTC [2030902] CONTEXT: WAL redo at 0/30006A8 for Database/CREATE_FILE_COPY: copy dir 1663/1 to 16384/16389 > > > > 040_standby_failover_slots_sync_publisher.log > > 2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl LOG: still waiting for backend with PID 1538477 to accept ProcSignalBarrier > > 2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl STATEMENT: DROP DATABASE slotsync_test_db; > > > > 002_compare_backups_pitr1.log > > 2026-05-01 04:50:46.638 UTC [1829328] LOG: still waiting for backend with PID 1829396 to accept ProcSignalBarrier > > 2026-05-01 04:50:46.638 UTC [1829328] CONTEXT: WAL redo at 0/30A1DE0 for Database/DROP: dir 1663/16414 > > > > I've tried my repro with 033_replay_tsp_drops and it really fails on > > REL_15_STABLE..master and doesn't fail on REL_14_STABLE. > > > > FYI I found that we had a similar report[1] last year, I'm not sure > > it hit the exact same issue, though. > > > > Regards, > > > > [1] https://www.postgresql.org/message-id/CAGQGyDTaVkG3DbTEbtyxZLM48jMZR2BcvTeYBsWLV5HvwSb+2Q@mail.gmail.com > > > > > > Yeah, and probably this one: > > https://www.postgresql.org/message-id/EF98BB5B-CA83-443E-B8A6-AA58EE4A06BB%40yandex-team.ru > > > > By the way, mamba produced the same failure just yesterday: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2026-04-30%2005%3A10%3A39 > > > > # Running: pg_ctl --wait --pgdata /home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/t_004_restart_primary_data/pgdata --log /home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/log/004_restart_primary.log --options --cluster-name=primary start > > waiting for server to start........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... stopped waiting > > pg_ctl: server did not start in time > > 004_restart_primary.log > > 2026-04-30 04:09:04.025 EDT [17814:2] LOG: still waiting for backend with PID 11506 to accept ProcSignalBarrier > > ... > > 2026-04-30 04:19:55.336 EDT [17814:132] LOG: still waiting for backend with PID 11506 to accept ProcSignalBarrier > > > > The proposed patches make the test pass reliably for me in all affected > > branches. Thank you for working on this! > > > > Thank you for checking this issue on stable branches too! > > Considering that this issue is not very visible in practice and we're > going to release new minor versions next week, I'm planning to push > these fixes to master and backbranches after the minor releases. That > way, we can fix the issue on the master relatively soon and have > enough time to verify that fix works well on backbranches. > While reviewing the patches, I realized that it would be better to use pg_atomic_write_membarrier_u32() instead of pg_atomic_write_u32() + pg_memory_barrier() where available. I've updated the patch for master and 18, and slightly commit messages. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com