Thread

  1. Startup PANIC on standby promotion due to zero-filled WAL segment

    Alena Vinter <dlaaren8@gmail.com> — 2025-12-23T07:02:15Z

    Hi hackers,
    
    During replication, when a new timeline is detected, PostgreSQL creates a
    new zero-filled WAL segment on the new timeline instead of copying the
    partial segment from the previous timeline. This diverges from the behavior
    during timeline switches at startup.
    This discrepancy can cause problems — especially under slow replication.
    Consider the following scenario:
    
     last record in TLI |     | timeline switch point
                        v     v
    |-----TLI N---------------|0000000000000000000
                              |
    |-----TLI N+1--00000000000|0000000000000000000
    
    If a standby is promoted before the WAL segment containing the last record
    of the previous timeline has been fully copied to the new timeline, startup
    may fail. We have observed this in production, where recovery aborts with
    "PANIC: invalid magic number 0000 in WAL segment ..."
    
    I’ve attached:
    * a patch and a TAP test that reproduce the issue;
    * a draft patch that, on timeline switch during recovery, copies the
    remainder of the current WAL segment from the old timeline — matching the
    behavior used after crash recovery at startup.
    All existing regression tests pass with the patch applied, but I plan to
    add more targeted test cases.
    
    I’d appreciate your feedback. In particular:
    * Is this behavior (not copying the segment during replication) intentional?
    * Are there edge cases I might be overlooking?
    
    ---
    Best wishes,
    Alena Vinter
    
  2. Re: Startup PANIC on standby promotion due to zero-filled WAL segment

    Michael Paquier <michael@paquier.xyz> — 2025-12-23T08:38:13Z

    On Tue, Dec 23, 2025 at 02:02:15PM +0700, Alena Vinter wrote:
    > If a standby is promoted before the WAL segment containing the last record
    > of the previous timeline has been fully copied to the new timeline, startup
    > may fail. We have observed this in production, where recovery aborts with
    > "PANIC: invalid magic number 0000 in WAL segment ..."
    > 
    > I’ve attached:
    > * a patch and a TAP test that reproduce the issue;
    > * a draft patch that, on timeline switch during recovery, copies the
    > remainder of the current WAL segment from the old timeline — matching the
    > behavior used after crash recovery at startup.
    > All existing regression tests pass with the patch applied, but I plan to
    > add more targeted test cases.
    > 
    > I’d appreciate your feedback. In particular:
    > * Is this behavior (not copying the segment during replication) intentional?
    > * Are there edge cases I might be overlooking?
    
    The failure pattern is different in v18/master vs the rest of the
    world.  v17 and older branches just wait for the standby node to start
    at the end with your test.  Anyway, the problem is the same as far as
    I can see, with the test generating the following post-patch:
    2025-12-23 17:08:37.494 JST startup[32689] LOG:  unexpected pageaddr 
    0/0305E000 in WAL segment 000000020000000000000003, LSN 0/03060000,
    offset 393216
    2025-12-23 17:08:37.494 JST startup[32689] FATAL:  according to
    history file, WAL location 0/0305FFD0 belongs to timeline 1, but
    previous recovered WAL file came from timeline 2
    
    This would be right, because you are losing the records of the first
    INSERT and TLI 1 diverges on the primary.  Now, the reason why you are
    losing these records is because of the way the test is set up.  fsync
    is off on the primary, hence you are forcing what looks like a
    corruption scenario by forcing a node to be promoted with some of its
    WAL records missing.  I am unconvinced with the problem the way you
    are showing it.  This primarily shows that setting fsync=off is a bad
    idea to force a divergence in timelines, with the segment missing
    while the records should be there.
    
    Perhaps it is a matter of proving your point in a cleaner way?  I am
    open to your potential arguments, but I don't see something here based
    on the test you are sending; I am just seeing something that should
    not be done.
    
    I am not asking how you are able to see these failures in your
    Postgres setups, but perhaps there is something in your HA flow that
    you should not do, especially if you do the same things as in this
    test..  Just saying.
    --
    Michael
    
  3. Re: Startup PANIC on standby promotion due to zero-filled WAL segment

    Alena Vinter <dlaaren8@gmail.com> — 2025-12-23T09:33:30Z

    Hi Michael,
    
    Thanks for the review. To clarify: TLI 1 does not diverge — it is fully
    replicated to the standby before the timeline switch. The test then
    intentionally slows down replication on TLI 2 (e.g., by delaying WAL
    shipping), reproducing the scenario I illustrated. As far as I’m aware,
    `fsync` is `on` by default, and the test does not modify it — so no WAL
    records are lost due to unsafe flushing.
    
    The core issue is that the new timeline’s segment is zero-initialized
    instead of copying the same segment from the previous timeline (as done in
    crash-recovery startup).  As a result, startup cannot finish recovery due
    to non-replicated end of WAL causing failures like “invalid magic number”.
    
    ---
    Alena Vinter
    
  4. Re: Startup PANIC on standby promotion due to zero-filled WAL segment

    Michael Paquier <michael@paquier.xyz> — 2025-12-23T09:47:37Z

    On Tue, Dec 23, 2025 at 04:33:30PM +0700, Alena Vinter wrote:
    > Thanks for the review. To clarify: TLI 1 does not diverge — it is fully
    > replicated to the standby before the timeline switch. The test then
    > intentionally slows down replication on TLI 2 (e.g., by delaying WAL
    > shipping), reproducing the scenario I illustrated. As far as I’m aware,
    > `fsync` is `on` by default, and the test does not modify it — so no WAL
    > records are lost due to unsafe flushing.
    
    Don't think so, based on what is in the tree:
    $ git grep "fsync = " -- *.pm
    src/test/perl/PostgreSQL/Test/Cluster.pm:   print $conf "fsync = off\n";
    
    > The core issue is that the new timeline’s segment is zero-initialized
    > instead of copying the same segment from the previous timeline (as done in
    > crash-recovery startup).  As a result, startup cannot finish recovery due
    > to non-replicated end of WAL causing failures like “invalid magic number”.
    
    The following addition to your proposed test is telling me an entirely
     different story, making the test pass as the records of TLI 1 are
     around:
     my $node_primary = PostgreSQL::Test::Cluster->new('primary');
     $node_primary->init(allows_streaming => 1);
    +#$node_primary->append_conf('postgresql.conf', 'fsync=on');
     $node_primary->start;
    --
    Michael
    
  5. Re: Startup PANIC on standby promotion due to zero-filled WAL segment

    Alena Vinter <dlaaren8@gmail.com> — 2025-12-23T10:24:56Z

    > Don't think so, based on what is in the tree:
    > $ git grep "fsync = " -- *.pm
    > src/test/perl/PostgreSQL/Test/Cluster.pm:   print $conf "fsync = off\n";
    
    Oh, didn’t know it's in the `init` function — I apologize.
    
    > The following addition to your proposed test is telling me an entirely
    > different story, making the test pass as the records of TLI 1 are
    > around:
    > my $node_primary = PostgreSQL::Test::Cluster->new('primary');
    > $node_primary->init(allows_streaming => 1);
    > +#$node_primary->append_conf('postgresql.conf', 'fsync=on');
    > $node_primary->start;
    
    I've tried this way, and yes, this works fine. Now I'm really interested in
    how this parameter prevents the scenario with startup panic. Thank you very
    much!!!
    But I’m still unclear why the segment isn’t copied during replication, as
    it is in crash recovery (I’d prefer uniform behavior across both
    paths). Could you help me figure out the answer to that question?
    ---
    Alena Vinter
    
  6. Re: Startup PANIC on standby promotion due to zero-filled WAL segment

    Alena Vinter <dlaaren8@gmail.com> — 2025-12-23T13:49:20Z

    Michael, I left my pipeline running the TAP test until it failed — and
    after some time, it did fail. I then changed the test slightly, and simply
    by adding a short sleep, I was able to reproduce the same failure more
    reliably. Moreover, attempting to restart the standby server after a failed
    promotion triggers startup PANIC again.
    
  7. Re: Startup PANIC on standby promotion due to zero-filled WAL segment

    Michael Paquier <michael@paquier.xyz> — 2025-12-24T05:55:58Z

    On Tue, Dec 23, 2025 at 08:49:20PM +0700, Alena Vinter wrote:
    > Michael, I left my pipeline running the TAP test until it failed — and
    > after some time, it did fail. I then changed the test slightly, and simply
    > by adding a short sleep, I was able to reproduce the same failure more
    > reliably. Moreover, attempting to restart the standby server after a failed
    > promotion triggers startup PANIC again.
    
    This is a better argument, yes.  ProcessInterrupts() is just a way to
    force the WAL receiver to do nothing.  We could see the same if a WAL
    receiver fails a palloc() or an allocation repeatedly, shutting it
    down before it is able to stream any changes, and we could also have a
    test with an injection point that forces an error based on a specific
    specific timeline number, or something like that.
    
    Hmm.  Like in the case where the WAL receiver is not able to connect
    to a primary, shouldn't we prevent the promotion request to process at
    all?  So while you have your finger on something here, I don't think
    that your suggested solution is a good nor correct one: it sounds to
    me that the startup process assumes that the WAL receiver is doing
    some work, then the promotion request comes it and we assume that it
    is OK to process through the promotion while we should obviously not
    do so, because the WAL receiver has streamed zero contents from TLI 2.
    It sounds to me that we should let the startup process know that
    something is wrong with the WAL receiver, meaning that it may be up to
    the WAL receiver to save some information in shared memory so as the
    startup process should not allow the promotion to go through at all.
    --
    Michael
    
  8. Re: Startup PANIC on standby promotion due to zero-filled WAL segment

    Alena Vinter <dlaaren8@gmail.com> — 2025-12-25T08:47:05Z

    I like the idea of preventing promotion to avoid such failures -- it sounds
    reasonable.
    
    However, we still have the problem: if the standby is stopped with
    non-replicated TLI 2, it will fail to start:
    "FATAL: according to history file, WAL location Y belongs to timeline X,
    but previous recovered WAL file came from timeline X+1".
    This happens even if no promotion is attempted — just a plain restart of
    the standby. So the issue isn’t only about when to allow promotion.
    
    Regarding my proposed solution: could you clarify why it isn’t correct? I’d
    appreciate more detail so I can address your concerns.
    
    ---
    Alena Vinter