Thread

  1. Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process

    Masahiko Sawada <sawada.mshk@gmail.com> — 2026-05-14T21:47:26Z

    On Thu, May 7, 2026 at 10:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
    >
    > On Fri, May 1, 2026 at 1:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
    > >
    > > Dear Sawada-san,
    > >
    > > 01.05.2026 01:08, Masahiko Sawada wrote:
    > >
    > > On Wed, Apr 29, 2026 at 11:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
    > >
    > > I was wondering why is that failure the only one of this kind on buildfarm
    > > (in last two years, at least), so I've tried to reproduce it on
    > > REL_18_STABLE... and failed.
    > >
    > > Then I've bisected it on the master branch and found (your) commit that
    > > introduced this behavior: 67c20979c from 2025-12-23.
    > >
    > > I've confirmed that this race condition issue is present from v15 to
    > > the master. In v14, we have the procsignal barrier code but don't use
    > > it anywhere. In v18 or older, it could happen when executing DROP
    > > DATABASE, DROP TABLESPACE etc, whereas in the master, it could happen
    > > in more cases as we're using procsignal barrier more places. In any
    > > case, if a process emits a signal barrier when another process is
    > > between the initialization of slot->pss_barrierGeneration and
    > > slot->pss_pid initialization, the subsequent
    > > WaitForProcSignalBarrier() ends up waiting for that process forever.
    > > So I think the patch should be backpatched to v15. Please review these
    > > patches.
    > >
    > >
    > > Yes, you're right -- it's not reproduced on REL_18_STABLE with
    > > test_oat_hooks, which simply starts postgres node (as many other tests),
    > > but when I tried the full test suite with the sleep inserted before
    > > setting pss_pid, I discovered the following vulnerable tests:
    > >
    > > 030_stats_cleanup_replica_standby.log
    > > 2026-05-01 06:00:58.789 UTC [2086579] LOG:  still waiting for backend with PID 2086578 to accept ProcSignalBarrier
    > > 2026-05-01 06:00:58.789 UTC [2086579] CONTEXT:  WAL redo at 0/3410B00 for Database/DROP: dir 1663/16393
    > >
    > > 033_replay_tsp_drops_standby2_FILE_COPY.log
    > > 2026-05-01 05:45:12.969 UTC [2030902] LOG:  still waiting for backend with PID 2030901 to accept ProcSignalBarrier
    > > 2026-05-01 05:45:12.969 UTC [2030902] CONTEXT:  WAL redo at 0/30006A8 for Database/CREATE_FILE_COPY: copy dir 1663/1 to 16384/16389
    > >
    > > 040_standby_failover_slots_sync_publisher.log
    > > 2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl LOG:  still waiting for backend with PID 1538477 to accept ProcSignalBarrier
    > > 2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl STATEMENT:  DROP DATABASE slotsync_test_db;
    > >
    > > 002_compare_backups_pitr1.log
    > > 2026-05-01 04:50:46.638 UTC [1829328] LOG:  still waiting for backend with PID 1829396 to accept ProcSignalBarrier
    > > 2026-05-01 04:50:46.638 UTC [1829328] CONTEXT:  WAL redo at 0/30A1DE0 for Database/DROP: dir 1663/16414
    > >
    > > I've tried my repro with 033_replay_tsp_drops and it really fails on
    > > REL_15_STABLE..master and doesn't fail on REL_14_STABLE.
    > >
    > > FYI I found that we had a similar report[1]  last year, I'm not sure
    > > it hit the exact same issue, though.
    > >
    > > Regards,
    > >
    > > [1] https://www.postgresql.org/message-id/CAGQGyDTaVkG3DbTEbtyxZLM48jMZR2BcvTeYBsWLV5HvwSb+2Q@mail.gmail.com
    > >
    > >
    > > Yeah, and probably this one:
    > > https://www.postgresql.org/message-id/EF98BB5B-CA83-443E-B8A6-AA58EE4A06BB%40yandex-team.ru
    > >
    > > By the way, mamba produced the same failure just yesterday:
    > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2026-04-30%2005%3A10%3A39
    > >
    > > # Running: pg_ctl --wait --pgdata /home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/t_004_restart_primary_data/pgdata --log /home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/log/004_restart_primary.log --options --cluster-name=primary start
    > > waiting for server to start........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... stopped waiting
    > > pg_ctl: server did not start in time
    > > 004_restart_primary.log
    > > 2026-04-30 04:09:04.025 EDT [17814:2] LOG:  still waiting for backend with PID 11506 to accept ProcSignalBarrier
    > > ...
    > > 2026-04-30 04:19:55.336 EDT [17814:132] LOG:  still waiting for backend with PID 11506 to accept ProcSignalBarrier
    > >
    > > The proposed patches make the test pass reliably for me in all affected
    > > branches. Thank you for working on this!
    > >
    >
    > Thank you for checking this issue on stable branches too!
    >
    > Considering that this issue is not very visible in practice and we're
    > going to release new minor versions next week, I'm planning to push
    > these fixes to master and backbranches after the minor releases. That
    > way, we can fix the issue on the master relatively soon and have
    > enough time to verify that fix works well on backbranches.
    >
    
    While reviewing the patches, I realized that it would be better to use
    pg_atomic_write_membarrier_u32() instead of pg_atomic_write_u32() +
    pg_memory_barrier() where available. I've updated the patch for master
    and 18, and slightly commit messages.
    
    Regards,
    
    -- 
    Masahiko Sawada
    Amazon Web Services: https://aws.amazon.com