Thread

  1. Re: Fix crash during recovery when redo segment is missing

    Nitin Jadhav <nitinjadhavpostgres@gmail.com> — 2025-12-04T06:36:30Z

    Apologies, I missed attaching the patch earlier. Please find the v2
    version attached.
    
    Best Regards,
    Nitin Jadhav
    Azure Database for PostgreSQL
    Microsoft
    
    On Thu, Dec 4, 2025 at 12:01 PM Nitin Jadhav
    <nitinjadhavpostgres@gmail.com> wrote:
    >
    > The patch wasn’t applying cleanly on master, so I’ve rebased it and
    > also added it to the PG19‑4 CommitFest:
    > https://commitfest.postgresql.org/patch/6279/
    > Please review and share your feedback.
    >
    > Best Regards,
    > Nitin Jadhav
    > Azure Database for PostgreSQL
    > Microsoft
    >
    > Best Regards,
    > Nitin Jadhav
    > Azure Database for PostgreSQL
    > Microsoft
    >
    >
    > On Fri, Feb 21, 2025 at 4:29 PM Nitin Jadhav
    > <nitinjadhavpostgres@gmail.com> wrote:
    > >
    > > Hi,
    > >
    > > In [1], Andres reported a bug where PostgreSQL crashes during recovery
    > > if the segment containing the redo pointer does not exist. I have
    > > attempted to address this issue and I am sharing a patch for the same.
    > >
    > > The problem was that PostgreSQL did not PANIC when the redo LSN and
    > > checkpoint LSN were in separate segments, and the file containing the
    > > redo LSN was missing, leading to a crash. Andres has provided a
    > > detailed analysis of the behavior across different settings and
    > > versions. Please refer to [1] for more information. This issue arises
    > > because PostgreSQL does not PANIC initially.
    > >
    > > The issue was resolved by ensuring that the REDO location exists once
    > > we successfully read the checkpoint record in InitWalRecovery(). This
    > > prevents control from reaching PerformWalRecovery() unless the WAL
    > > file containing the redo record exists. A new test script,
    > > 044_redo_segment_missing.pl, has been added to validate this. To
    > > populate the WAL file with a redo record different from the WAL file
    > > with the checkpoint record, I wait for the checkpoint start message
    > > and then issue a pg_switch_wal(), which should occur before the
    > > completion of the checkpoint. Then, I crash the server, and during the
    > > restart, it should log an appropriate error indicating that it could
    > > not find the redo location. Please let me know if there is a better
    > > way to reproduce this behavior. I have tested and verified this with
    > > the various scenarios Andres pointed out in [1]. Please note that this
    > > patch does not address error checking in StartupXLOG(),
    > > CreateCheckPoint(), etc., nor does it focus on cleaning up existing
    > > code.
    > >
    > > Attaching the patch. Please review and share your feedback. Thanks to
    > > Andres for spotting the bug and providing the detailed report [1].
    > >
    > > [1]: https://www.postgresql.org/message-id/20231023232145.cmqe73stvivsmlhs%40awork3.anarazel.de
    > >
    > > Best Regards,
    > > Nitin Jadhav
    > > Azure Database for PostgreSQL
    > > Microsoft