Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Generate GUC tables from .dat file

  2. Skip WAL recycling and preallocation during archive recovery.

  3. Fix scenario where streaming standby gets stuck at a continuation record.

  1. Unnecessary delay in streaming replication due to replay lag

    P <apraveen@pivotal.io> — 2020-01-17T04:04:05Z

    Hi
    
    Standby does not start walreceiver process until startup process
    finishes WAL replay.  The more WAL there is to replay, longer is the
    delay in starting streaming replication.  If replication connection is
    temporarily disconnected, this delay becomes a major problem and we
    are proposing a solution to avoid the delay.
    
    WAL replay is likely to fall behind when master is processing
    write-heavy workload, because WAL is generated by concurrently running
    backends on master while only one startup process on standby replays WAL
    records in sequence as new WAL is received from master.
    
    Replication connection between walsender and walreceiver may break due
    to reasons such as transient network issue, standby going through
    restart, etc.  The delay in resuming replication connection leads to
    lack of high availability - only one copy of WAL is available during
    this period.
    
    The problem worsens when the replication is configured to be
    synchronous.  Commits on master must wait until the WAL replay is
    finished on standby, walreceiver is then started and it confirms flush
    of WAL upto the commit LSN.  If synchronous_commit GUC is set to
    remote_write, this behavior is equivalent to tacitly changing it to
    remote_apply until the replication connection is re-established!
    
    Has anyone encountered such a problem with streaming replication?
    
    We propose to address this by starting walreceiver without waiting for
    startup process to finish replay of WAL.  Please see attached
    patchset.  It can be summarized as follows:
    
        0001 - TAP test to demonstrate the problem.
    
        0002 - The standby startup sequence is changed such that
               walreceiver is started by startup process before it begins
               to replay WAL.
    
        0003 - Postmaster starts walreceiver if it finds that a
               walreceiver process is no longer running and the state
               indicates that it is operating as a standby.
    
    This is a POC, we are looking for early feedback on whether the
    problem is worth solving and if it makes sense to solve if along this
    route.
    
    Hao and Asim
    
  2. Re: Unnecessary delay in streaming replication due to replay lag

    Michael Paquier <michael@paquier.xyz> — 2020-01-17T05:37:56Z

    On Fri, Jan 17, 2020 at 09:34:05AM +0530, Asim R P wrote:
    > Standby does not start walreceiver process until startup process
    > finishes WAL replay.  The more WAL there is to replay, longer is the
    > delay in starting streaming replication.  If replication connection is
    > temporarily disconnected, this delay becomes a major problem and we
    > are proposing a solution to avoid the delay.
    
    Yeah, that's documented:
    https://www.postgresql.org/message-id/20190910062325.GD11737@paquier.xyz
    
    > We propose to address this by starting walreceiver without waiting for
    > startup process to finish replay of WAL.  Please see attached
    > patchset.  It can be summarized as follows:
    > 
    >     0001 - TAP test to demonstrate the problem.
    
    There is no real need for debug_replay_delay because we have already
    recovery_min_apply_delay, no?  That would count only after consistency
    has been reached, and only for COMMIT records, but your test would be
    enough with that.
    
    >     0002 - The standby startup sequence is changed such that
    >            walreceiver is started by startup process before it begins
    >            to replay WAL.
    
    See below.
    
    >     0003 - Postmaster starts walreceiver if it finds that a
    >            walreceiver process is no longer running and the state
    >            indicates that it is operating as a standby.
    
    I have not checked in details, but I smell some race conditions
    between the postmaster and the startup process here.
    
    > This is a POC, we are looking for early feedback on whether the
    > problem is worth solving and if it makes sense to solve if along this
    > route.
    
    You are not the first person interested in this problem, we have a
    patch registered in this CF to control the timing when a WAL receiver
    is started at recovery:
    https://commitfest.postgresql.org/26/1995/
    https://www.postgresql.org/message-id/b271715f-f945-35b0-d1f5-c9de3e56f65e@postgrespro.ru
    
    I am pretty sure that we should not change the default behavior to
    start the WAL receiver after replaying everything from the archives to
    avoid copying some WAL segments for nothing, so being able to use a
    GUC switch should be the way to go, and Konstantin's latest patch was
    using this approach.  Your patch 0002 adds visibly a third mode: start
    immediately on top of the two ones already proposed:
    - Start after replaying all WAL available locally and in the
    archives.
    - Start after reaching a consistent point.
    --
    Michael
    
  3. Re: Unnecessary delay in streaming replication due to replay lag

    P <apraveen@pivotal.io> — 2020-01-17T13:00:58Z

    On Fri, Jan 17, 2020 at 11:08 AM Michael Paquier <michael@paquier.xyz>
    wrote:
    >
    > On Fri, Jan 17, 2020 at 09:34:05AM +0530, Asim R P wrote:
    > >
    > >     0001 - TAP test to demonstrate the problem.
    >
    > There is no real need for debug_replay_delay because we have already
    > recovery_min_apply_delay, no?  That would count only after consistency
    > has been reached, and only for COMMIT records, but your test would be
    > enough with that.
    >
    
    Indeed, we didn't know about recovery_min_apply_delay.  Thank you for
    the suggestion, the updated test is attached.
    
    >
    > > This is a POC, we are looking for early feedback on whether the
    > > problem is worth solving and if it makes sense to solve if along this
    > > route.
    >
    > You are not the first person interested in this problem, we have a
    > patch registered in this CF to control the timing when a WAL receiver
    > is started at recovery:
    > https://commitfest.postgresql.org/26/1995/
    >
    https://www.postgresql.org/message-id/b271715f-f945-35b0-d1f5-c9de3e56f65e@postgrespro.ru
    >
    
    Great to know about this patch and the discussion.  The test case and
    the part that saves next start point in control file from our patch
    can be combined with Konstantin's patch to solve this problem.  Let me
    work on that.
    
    > I am pretty sure that we should not change the default behavior to
    > start the WAL receiver after replaying everything from the archives to
    > avoid copying some WAL segments for nothing, so being able to use a
    > GUC switch should be the way to go, and Konstantin's latest patch was
    > using this approach.  Your patch 0002 adds visibly a third mode: start
    > immediately on top of the two ones already proposed:
    > - Start after replaying all WAL available locally and in the
    > archives.
    > - Start after reaching a consistent point.
    
    Consistent point should be reached fairly quickly, in spite of large
    replay lag.  Min recovery point is updated during XLOG flush and that
    happens when a commit record is replayed.  Commits should occur
    frequently in the WAL stream.  So I do not see much value in starting
    WAL receiver immediately as compared to starting it after reaching a
    consistent point.  Does that make sense?
    
    That said, is there anything obviously wrong with starting WAL receiver
    immediately, even before reaching consistent state?  A consequence is
    that WAL receiver may overwrite a WAL segment while startup process is
    reading and replaying WAL from it.  But that doesn't appear to be a
    problem because the overwrite should happen with identical content as
    before.
    
    Asim
    
  4. Re: Unnecessary delay in streaming replication due to replay lag

    Asim Praveen <pasim@vmware.com> — 2020-08-09T05:54:32Z

    I would like to revive this thready by submitting a rebased patch to start streaming replication without waiting for startup process to finish replaying all WAL.  The start LSN for streaming is determined to be the LSN that points to the beginning of the most recently flushed WAL segment.
    
    The patch passes tests under src/test/recovery and top level “make check”.
    
    
  5. Re: Unnecessary delay in streaming replication due to replay lag

    Michael Paquier <michael@paquier.xyz> — 2020-08-09T08:41:15Z

    On Sun, Aug 09, 2020 at 05:54:32AM +0000, Asim Praveen wrote:
    > I would like to revive this thready by submitting a rebased patch to
    > start streaming replication without waiting for startup process to
    > finish replaying all WAL.  The start LSN for streaming is determined
    > to be the LSN that points to the beginning of the most recently
    > flushed WAL segment.
    > 
    > The patch passes tests under src/test/recovery and top level “make check”.
    
    I have not really looked at the proposed patch, but it would be good
    to have some documentation.
    --
    Michael
    
  6. Re: Unnecessary delay in streaming replication due to replay lag

    Asim Praveen <pasim@vmware.com> — 2020-08-10T04:31:05Z

    
    > On 09-Aug-2020, at 2:11 PM, Michael Paquier <michael@paquier.xyz> wrote:
    > 
    > I have not really looked at the proposed patch, but it would be good
    > to have some documentation.
    > 
    
    Ah, right.  The basic idea is to reuse the logic to allow read-only connections to also start WAL streaming.  The patch borrows a new GUC “wal_receiver_start_condition” introduced by another patch alluded to upthread.  It affects when to start WAL receiver process on a standby.  By default, the GUC is set to “replay”, which means no change in current behavior - WAL receiver is started only after replaying all WAL already available in pg_wal.  When set to “consistency”, WAL receiver process is started earlier, as soon as consistent state is reached during WAL replay.
    
    The LSN where to start streaming from is determined to be the LSN that points at the beginning of the WAL segment file that was most recently flushed in pg_wal.  To find the most recently flushed WAL segment, first blocks of all WAL segment files in pg_wal, starting from the segment that contains currently replayed record, are inspected.  The search stops when a first page with no valid header is found.
    
    The benefits of starting WAL receiver early are mentioned upthread but allow me to reiterate: as WAL streaming starts, any commits that are waiting for synchronous replication on the master are unblocked.  The benefit of this is apparent in situations where significant replay lag has been built up and the replication is configured to be synchronous.
    
    Asim
  7. Re: Unnecessary delay in streaming replication due to replay lag

    Masahiko Sawada <masahiko.sawada@2ndquadrant.com> — 2020-08-10T06:57:33Z

    On Sun, 9 Aug 2020 at 14:54, Asim Praveen <pasim@vmware.com> wrote:
    >
    > I would like to revive this thready by submitting a rebased patch to start streaming replication without waiting for startup process to finish replaying all WAL.  The start LSN for streaming is determined to be the LSN that points to the beginning of the most recently flushed WAL segment.
    >
    > The patch passes tests under src/test/recovery and top level “make check”.
    >
    
    The patch can be applied cleanly to the current HEAD but I got the
    error on building the code with this patch:
    
    xlog.c: In function 'StartupXLOG':
    xlog.c:7315:6: error: too few arguments to function 'RequestXLogStreaming'
     7315 |      RequestXLogStreaming(ThisTimeLineID,
          |      ^~~~~~~~~~~~~~~~~~~~
    In file included from xlog.c:59:
    ../../../../src/include/replication/walreceiver.h:463:13: note: declared here
      463 | extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
          |             ^~~~~~~~~~~~~~~~~~~~
    
    cfbot also complaints this.
    
    Could you please update the patch?
    
    Regards,
    
    --
    Masahiko Sawada            http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
    
    
    
    
  8. Re: Unnecessary delay in streaming replication due to replay lag

    Asim Praveen <pasim@vmware.com> — 2020-08-10T08:53:34Z

    
    > On 10-Aug-2020, at 12:27 PM, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
    > 
    > The patch can be applied cleanly to the current HEAD but I got the
    > error on building the code with this patch:
    > 
    > xlog.c: In function 'StartupXLOG':
    > xlog.c:7315:6: error: too few arguments to function 'RequestXLogStreaming'
    > 7315 |      RequestXLogStreaming(ThisTimeLineID,
    >      |      ^~~~~~~~~~~~~~~~~~~~
    > In file included from xlog.c:59:
    > ../../../../src/include/replication/walreceiver.h:463:13: note: declared here
    >  463 | extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
    >      |             ^~~~~~~~~~~~~~~~~~~~
    > 
    > cfbot also complaints this.
    > 
    > Could you please update the patch?
    > 
    
    Thank you for trying the patch and apologies for the compiler error.  I missed adding a hunk earlier, it should be fixed in the version attached here.
    
    
  9. Re: Unnecessary delay in streaming replication due to replay lag

    lchch1990@sina.cn — 2020-09-15T09:30:22Z

    Hello
    
    I read the code and test the patch, it run well on my side, and I have several issues on the
    patch.
    
    1. When call RequestXLogStreaming() during replay, you pick timeline straightly from control
    file, do you think it should pick timeline from timeline history file?
    
    2. In archive recovery mode which will never turn to a stream mode, I think in current code it
    will call RequestXLogStreaming() too which can avoid.
    
    3. I found two 018_xxxxx.pl when I do make check, maybe rename the new one?
    
    
    
    
    Regards,
    Highgo Software (Canada/China/Pakistan) 
    URL : www.highgo.ca 
    EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
    
  10. Re: Unnecessary delay in streaming replication due to replay lag

    Michael Paquier <michael@paquier.xyz> — 2020-11-20T08:21:06Z

    On Tue, Sep 15, 2020 at 05:30:22PM +0800, lchch1990@sina.cn wrote:
    > I read the code and test the patch, it run well on my side, and I have several issues on the
    > patch.
    
    +                   RequestXLogStreaming(ThisTimeLineID,
    +                                        startpoint,
    +                                        PrimaryConnInfo,
    +                                        PrimarySlotName,
    +                                        wal_receiver_create_temp_slot);
    
    This patch thinks that it is fine to request streaming even if
    PrimaryConnInfo is not set, but that's not fine.
    
    Anyway, I don't quite understand what you are trying to achieve here.
    "startpoint" is used to request the beginning of streaming.  It is
    roughly the consistency LSN + some alpha with some checks on WAL
    pages (those WAL page checks are not acceptable as they make
    maintenance harder).  What about the case where consistency is
    reached but there are many segments still ahead that need to be
    replayed?  Your patch would cause streaming to begin too early, and
    a manual copy of segments is not a rare thing as in some environments
    a bulk copy of segments can make the catchup of a standby faster than
    streaming.
    
    It seems to me that what you are looking for here is some kind of
    pre-processing before entering the redo loop to determine the LSN
    that could be reused for the fast streaming start, which should match
    the end of the WAL present locally.  In short, you would need a
    XLogReaderState that begins a scan of WAL from the redo point until it
    cannot find anything more, and use the last LSN found as a base to
    begin requesting streaming.  The question of timeline jumps can also
    be very tricky, but it could also be possible to not allow this option
    if a timeline jump happens while attempting to guess the end of WAL
    ahead of time.  Another thing: could it be useful to have an extra
    mode to begin streaming without waiting for consistency to finish?
    --
    Michael
    
  11. Re: Unnecessary delay in streaming replication due to replay lag

    Anastasia Lubennikova <a.lubennikova@postgrespro.ru> — 2020-12-01T14:21:51Z

    On 20.11.2020 11:21, Michael Paquier wrote:
    > On Tue, Sep 15, 2020 at 05:30:22PM +0800, lchch1990@sina.cn wrote:
    >> I read the code and test the patch, it run well on my side, and I have several issues on the
    >> patch.
    > +                   RequestXLogStreaming(ThisTimeLineID,
    > +                                        startpoint,
    > +                                        PrimaryConnInfo,
    > +                                        PrimarySlotName,
    > +                                        wal_receiver_create_temp_slot);
    >
    > This patch thinks that it is fine to request streaming even if
    > PrimaryConnInfo is not set, but that's not fine.
    >
    > Anyway, I don't quite understand what you are trying to achieve here.
    > "startpoint" is used to request the beginning of streaming.  It is
    > roughly the consistency LSN + some alpha with some checks on WAL
    > pages (those WAL page checks are not acceptable as they make
    > maintenance harder).  What about the case where consistency is
    > reached but there are many segments still ahead that need to be
    > replayed?  Your patch would cause streaming to begin too early, and
    > a manual copy of segments is not a rare thing as in some environments
    > a bulk copy of segments can make the catchup of a standby faster than
    > streaming.
    >
    > It seems to me that what you are looking for here is some kind of
    > pre-processing before entering the redo loop to determine the LSN
    > that could be reused for the fast streaming start, which should match
    > the end of the WAL present locally.  In short, you would need a
    > XLogReaderState that begins a scan of WAL from the redo point until it
    > cannot find anything more, and use the last LSN found as a base to
    > begin requesting streaming.  The question of timeline jumps can also
    > be very tricky, but it could also be possible to not allow this option
    > if a timeline jump happens while attempting to guess the end of WAL
    > ahead of time.  Another thing: could it be useful to have an extra
    > mode to begin streaming without waiting for consistency to finish?
    > --
    > Michael
    
    
    Status update for a commitfest entry.
    
    This entry was "Waiting On Author" during this CF, so I've marked it as 
    returned with feedback. Feel free to resubmit an updated version to a 
    future commitfest.
    
    -- 
    Anastasia Lubennikova
    Postgres Professional: http://www.postgrespro.com
    The Russian Postgres Company
    
    
    
    
    
  12. Re: Unnecessary delay in streaming replication due to replay lag

    Soumyadeep Chakraborty <soumyadeep2007@gmail.com> — 2021-08-25T04:51:25Z

    Hello,
    
    Ashwin and I recently got a chance to work on this and we addressed all
    outstanding feedback and suggestions. PFA a significantly reworked patch.
    
    On 20.11.2020 11:21, Michael Paquier wrote:
    
    > This patch thinks that it is fine to request streaming even if
    > PrimaryConnInfo is not set, but that's not fine.
    
    We introduced a check to ensure that PrimaryConnInfo is set up before we
    request the WAL stream eagerly.
    
    > Anyway, I don't quite understand what you are trying to achieve here.
    > "startpoint" is used to request the beginning of streaming.  It is
    > roughly the consistency LSN + some alpha with some checks on WAL
    > pages (those WAL page checks are not acceptable as they make
    > maintenance harder).  What about the case where consistency is
    > reached but there are many segments still ahead that need to be
    > replayed?  Your patch would cause streaming to begin too early, and
    > a manual copy of segments is not a rare thing as in some environments
    > a bulk copy of segments can make the catchup of a standby faster than
    > streaming.
    >
    > It seems to me that what you are looking for here is some kind of
    > pre-processing before entering the redo loop to determine the LSN
    > that could be reused for the fast streaming start, which should match
    > the end of the WAL present locally.  In short, you would need a
    > XLogReaderState that begins a scan of WAL from the redo point until it
    > cannot find anything more, and use the last LSN found as a base to
    > begin requesting streaming.  The question of timeline jumps can also
    > be very tricky, but it could also be possible to not allow this option
    > if a timeline jump happens while attempting to guess the end of WAL
    > ahead of time.  Another thing: could it be useful to have an extra
    > mode to begin streaming without waiting for consistency to finish?
    
    1. When wal_receiver_start_condition='consistency', we feel that the
    stream start point calculation should be done only when we reach
    consistency. Imagine the situation where consistency is reached 2 hours
    after start, and within that 2 hours a lot of WAL has been manually
    copied over into the standby's pg_wal. If we pre-calculated the stream
    start location before we entered the main redo apply loop, we would be
    starting the stream from a much earlier location (minus the 2 hours
    worth of WAL), leading to wasted work.
    
    2. We have significantly changed the code to calculate the WAL stream
    start location. We now traverse pg_wal, find the latest valid WAL
    segment and start the stream from the segment's start. This is much
    more performant than reading from the beginning of the locally available
    WAL.
    
    3. To perform the validation check, we no longer have duplicate code -
    as we can now rely on the XLogReaderState(), XLogReaderValidatePageHeader()
    and friends.
    
    4. We have an extra mode: wal_receiver_start_condition='startup', which
    will start the WAL receiver before the startup process reaches
    consistency. We don't fully understand the utility of having 'startup' over
    'consistency' though.
    
    5. During the traversal of pg_wal, if we find WAL segments on differing
    timelines, we bail out and abandon attempting to start the WAL stream
    eagerly.
    
    6. To handle the cases where a lot of WAL is copied over after the
    WAL receiver has started at consistency:
    i) Don't recommend wal_receiver_start_condition='startup|consistency'.
    
    ii) Copy over the WAL files and then start the standby, so that the WAL
    stream starts from a fresher point.
    
    iii) Have an LSN/segment# target to start the WAL receiver from?
    
    7. We have significantly changed the test. It is much more simplified
    and focused.
    
    8. We did not test wal_receiver_start_condition='startup' in the test.
    It's actually hard to assert that the walreceiver has started at
    startup. recovery_min_apply_delay only kicks in once we reach
    consistency, and thus there is no way I could think of to reliably halt
    the startup process and check: "Has the wal receiver started even
    though the standby hasn't reached consistency?" Only way we could think
    of is to generate a large workload during the course of the backup so
    that the standby has significant WAL to replay before it reaches
    consistency. But that will make the test flaky as we will have no
    absolutely precise wait condition. That said, we felt that checking
    for 'consistency' is enough as it covers the majority of the added
    code.
    
    9. We added a documentation section describing the GUC.
    
    
    Regards,
    Ashwin and Soumyadeep (VMware)
    
  13. Re: Unnecessary delay in streaming replication due to replay lag

    Soumyadeep Chakraborty <soumyadeep2007@gmail.com> — 2021-10-25T19:11:07Z

    Rebased and added a CF entry for Nov CF:
    https://commitfest.postgresql.org/35/3376/.
    
  14. Re: Unnecessary delay in streaming replication due to replay lag

    Michael Paquier <michael@paquier.xyz> — 2021-11-08T09:41:04Z

    On Tue, Aug 24, 2021 at 09:51:25PM -0700, Soumyadeep Chakraborty wrote:
    > Ashwin and I recently got a chance to work on this and we addressed all
    > outstanding feedback and suggestions. PFA a significantly reworked patch.
    
    +static void
    +StartWALReceiverEagerly()
    +{
    The patch fails to apply because of the recent changes from Robert to
    eliminate ThisTimeLineID.  The correct thing to do would be to add one
    TimeLineID argument, passing down the local ThisTimeLineID in
    StartupXLOG() and using XLogCtl->lastReplayedTLI in
    CheckRecoveryConsistency().
    
    +	/*
    +	 * We should never reach here. We should have at least one valid WAL
    +	 * segment in our pg_wal, for the standby to have started.
    +	 */
    +	Assert(false);
    The reason behind that is not that we have a standby, but that we read
    at least the segment that included the checkpoint record we are
    replaying from, at least (it is possible for a standby to start
    without any contents in pg_wal/ as long as recovery is configured),
    and because StartWALReceiverEagerly() is called just after that.
    
    It would be better to make sure that StartWALReceiverEagerly() gets
    only called from the startup process, perhaps?
    
    +	RequestXLogStreaming(ThisTimeLineID, startptr, PrimaryConnInfo,
    +			     PrimarySlotName, wal_receiver_create_temp_slot);
    +	XLogReaderFree(state);
    XLogReaderFree() should happen before RequestXLogStreaming().  The
    tipping point of the patch is here, where the WAL receiver is started
    based on the location of the first valid WAL record found.
    
    wal_receiver_start_condition is missing in postgresql.conf.sample.
    
    +	/*
    +	 * Start WAL receiver eagerly if requested.
    +	 */
    +	if (StandbyModeRequested && !WalRcvStreaming() &&
    +		PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0 &&
    +		wal_receiver_start_condition == WAL_RCV_START_AT_STARTUP)
    +		StartWALReceiverEagerly();
    [...]
    +	if (StandbyModeRequested && !WalRcvStreaming() && reachedConsistency &&
    +		PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0 &&
    +		wal_receiver_start_condition == WAL_RCV_START_AT_CONSISTENCY)
    +		StartWALReceiverEagerly();
    This repeats two times the same set of conditions, which does not look
    like a good idea to me.  I think that you'd better add an extra
    argument to StartWALReceiverEagerly to track the start timing expected
    in this code path, that will be matched with the GUC in the routine.
    It would be better to document the reasons behind each check done, as
    well.
    
    +	/* Find the latest and earliest WAL segments in pg_wal */
    +	dir = AllocateDir("pg_wal");
    +	while ((de = ReadDir(dir, "pg_wal")) != NULL)
    +	{
    [ ... ]
    +	/* Find the latest valid WAL segment and request streaming from its start */
    +	while (endsegno >= startsegno)
    +	{
    [...]
    +		XLogReaderFree(state);
    +		endsegno--;
    +	}
    So, this reads the contents of pg_wal/ for any files that exist, then
    goes down to the first segment found with a valid beginning.  That's
    going to be expensive with a large max_wal_size.  When searching for a
    point like that, a dichotomy method would be better to calculate a LSN
    you'd like to start from.  Anyway, I think that there is a problem
    with the approach: what should we do if there are holes in the
    segments present in pg_wal/?  As of HEAD, or
    wal_receiver_start_condition = 'exhaust' in this patch, we would
    switch across local pg_wal/, archive and stream in a linear way,
    thanks to WaitForWALToBecomeAvailable().  For example, imagine that we
    have a standby with the following set of valid segments, because of
    the buggy way a base backup has been taken:
    000000010000000000000001
    000000010000000000000003
    000000010000000000000005
    What the patch would do is starting a WAL receiver from segment 5,
    which is in contradiction with the existing logic where we should try
    to look for the segment once we are waiting for something in segment
    2.  This would be dangerous once the startup process waits for some
    WAL to become available, because we have a WAL receiver started, but
    we cannot fetch the segment we have.  Perhaps a deployment has
    archiving, in which case it would be able to grab segment 2 (if no
    archiving, recovery would not be able to move on, so that would be
    game over).
     
             /*
              * Move to XLOG_FROM_STREAM state, and set to start a
    -         * walreceiver if necessary.
    +         * walreceiver if necessary. The WAL receiver may have
    +         * already started (if it was configured to start
    +         * eagerly).
              */
             currentSource = XLOG_FROM_STREAM;
    -        startWalReceiver = true;
    +        startWalReceiver = !WalRcvStreaming();
             break;
         case XLOG_FROM_ARCHIVE:
         case XLOG_FROM_PG_WAL:
     
    -        /*
    -         * WAL receiver must not be running when reading WAL from
    -         * archive or pg_wal.
    -         */
    -        Assert(!WalRcvStreaming());
    
    These parts should IMO not be changed.  They are strong assumptions we
    rely on in the startup process, and this comes down to the fact that
    it is not a good idea to mix a WAL receiver started while
    currentSource could be pointing at a WAL source completely different. 
    That's going to bring a lot of racy conditions, I am afraid, as we
    rely on currentSource a lot during recovery, in combination that we
    expect the code to be able to retrieve WAL in a linear fashion from
    the LSN position that recovery is looking for.
    
    So, I think that deciding if a WAL receiver should be started blindly
    outside of the code path deciding if the startup process is waiting
    for some WAL is not a good idea, and the position we may begin to
    stream from may be something that we may have zero need for at the
    end (this is going to be tricky if we detect a TLI jump while
    replaying the local WAL, also?).  The issue is that I am not sure what
    a good design for that should be.  We have no idea when the startup
    process will need WAL from a different source until replay comes
    around, but what we want here is to anticipate othis LSN :)
    
    I am wondering if there should be a way to work out something with the
    control file, though, but things can get very fancy with HA
    and base backup deployments and the various cases we support thanks to
    the current way recovery works, as well.  We could also go simpler and
    rework the priority order if both archiving and streaming are options
    wanted by the user.
    --
    Michael
    
  15. Re: Unnecessary delay in streaming replication due to replay lag

    Soumyadeep Chakraborty <soumyadeep2007@gmail.com> — 2021-11-09T23:41:09Z

    Hi Michael,
    
    Thanks for the detailed review! Attached is a rebased patch that addresses
    most of the feedback.
    
    On Mon, Nov 8, 2021 at 1:41 AM Michael Paquier <michael@paquier.xyz> wrote:
    
    > +static void
    > +StartWALReceiverEagerly()
    > +{
    > The patch fails to apply because of the recent changes from Robert to
    > eliminate ThisTimeLineID.  The correct thing to do would be to add one
    > TimeLineID argument, passing down the local ThisTimeLineID in
    > StartupXLOG() and using XLogCtl->lastReplayedTLI in
    > CheckRecoveryConsistency().
    
    Rebased.
    
    > +       /*
    > +        * We should never reach here. We should have at least one valid
    WAL
    > +        * segment in our pg_wal, for the standby to have started.
    > +        */
    > +       Assert(false);
    > The reason behind that is not that we have a standby, but that we read
    > at least the segment that included the checkpoint record we are
    > replaying from, at least (it is possible for a standby to start
    > without any contents in pg_wal/ as long as recovery is configured),
    > and because StartWALReceiverEagerly() is called just after that.
    
    Fair, comment updated.
    
    > It would be better to make sure that StartWALReceiverEagerly() gets
    > only called from the startup process, perhaps?
    
    Added Assert(AmStartupProcess()) at the beginning of
    StartWALReceiverEagerly().
    
    >
    > +       RequestXLogStreaming(ThisTimeLineID, startptr, PrimaryConnInfo,
    > +                            PrimarySlotName,
    wal_receiver_create_temp_slot);
    > +       XLogReaderFree(state);
    > XLogReaderFree() should happen before RequestXLogStreaming().  The
    > tipping point of the patch is here, where the WAL receiver is started
    > based on the location of the first valid WAL record found.
    
    Done.
    
    > wal_receiver_start_condition is missing in postgresql.conf.sample.
    
    Fixed.
    
    > +       /*
    > +        * Start WAL receiver eagerly if requested.
    > +        */
    > +       if (StandbyModeRequested && !WalRcvStreaming() &&
    > +               PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0 &&
    > +               wal_receiver_start_condition == WAL_RCV_START_AT_STARTUP)
    > +               StartWALReceiverEagerly();
    > [...]
    > +       if (StandbyModeRequested && !WalRcvStreaming() &&
    reachedConsistency &&
    > +               PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0 &&
    > +               wal_receiver_start_condition ==
    WAL_RCV_START_AT_CONSISTENCY)
    > +               StartWALReceiverEagerly();
    > This repeats two times the same set of conditions, which does not look
    > like a good idea to me.  I think that you'd better add an extra
    > argument to StartWALReceiverEagerly to track the start timing expected
    > in this code path, that will be matched with the GUC in the routine.
    > It would be better to document the reasons behind each check done, as
    > well.
    
    Done.
    
    > So, this reads the contents of pg_wal/ for any files that exist, then
    > goes down to the first segment found with a valid beginning.  That's
    > going to be expensive with a large max_wal_size. When searching for a
    > point like that, a dichotomy method would be better to calculate a LSN
    > you'd like to start from.
    
    Even if there is a large max_wal_size, do we expect that there will be
    a lot of invalid high-numbered WAL files? If that is not the case, most
    of the time we would be looking at the last 1 or 2 WAL files to
    determine the start point, making it efficient?
    
    > Anyway, I think that there is a problem
    > with the approach: what should we do if there are holes in the
    > segments present in pg_wal/?  As of HEAD, or
    > wal_receiver_start_condition = 'exhaust' in this patch, we would
    > switch across local pg_wal/, archive and stream in a linear way,
    > thanks to WaitForWALToBecomeAvailable().  For example, imagine that we
    > have a standby with the following set of valid segments, because of
    > the buggy way a base backup has been taken:
    > 000000010000000000000001
    > 000000010000000000000003
    > 000000010000000000000005
    > What the patch would do is starting a WAL receiver from segment 5,
    > which is in contradiction with the existing logic where we should try
    > to look for the segment once we are waiting for something in segment
    > 2.  This would be dangerous once the startup process waits for some
    > WAL to become available, because we have a WAL receiver started, but
    > we cannot fetch the segment we have.  Perhaps a deployment has
    > archiving, in which case it would be able to grab segment 2 (if no
    > archiving, recovery would not be able to move on, so that would be
    > game over).
    
    We could easily check for holes while we are doing the ReadDir() and
    bail fron the early start if there are holes, just like we do if there
    is a timeline jump in any of the WAL segments.
    
    >          /*
    >           * Move to XLOG_FROM_STREAM state, and set to start a
    > -         * walreceiver if necessary.
    > +         * walreceiver if necessary. The WAL receiver may have
    > +         * already started (if it was configured to start
    > +         * eagerly).
    >           */
    >          currentSource = XLOG_FROM_STREAM;
    > -        startWalReceiver = true;
    > +        startWalReceiver = !WalRcvStreaming();
    >          break;
    >      case XLOG_FROM_ARCHIVE:
    >      case XLOG_FROM_PG_WAL:
    >
    > -        /*
    > -         * WAL receiver must not be running when reading WAL from
    > -         * archive or pg_wal.
    > -         */
    > -        Assert(!WalRcvStreaming());
    >
    > These parts should IMO not be changed.  They are strong assumptions we
    > rely on in the startup process, and this comes down to the fact that
    > it is not a good idea to mix a WAL receiver started while
    > currentSource could be pointing at a WAL source completely different.
    > That's going to bring a lot of racy conditions, I am afraid, as we
    > rely on currentSource a lot during recovery, in combination that we
    > expect the code to be able to retrieve WAL in a linear fashion from
    > the LSN position that recovery is looking for.
    >
    > So, I think that deciding if a WAL receiver should be started blindly
    > outside of the code path deciding if the startup process is waiting
    > for some WAL is not a good idea, and the position we may begin to
    > stream from may be something that we may have zero need for at the
    > end (this is going to be tricky if we detect a TLI jump while
    > replaying the local WAL, also?).  The issue is that I am not sure what
    > a good design for that should be.  We have no idea when the startup
    > process will need WAL from a different source until replay comes
    > around, but what we want here is to anticipate othis LSN :)
    
    Can you elaborate on the race conditions that you are thinking about?
    Do the race conditions manifest only when we mix archiving and streaming?
    If yes, how do you feel about making the GUC a no-op with a WARNING
    while we are in WAL archiving mode?
    
    > I am wondering if there should be a way to work out something with the
    > control file, though, but things can get very fancy with HA
    > and base backup deployments and the various cases we support thanks to
    > the current way recovery works, as well.  We could also go simpler and
    > rework the priority order if both archiving and streaming are options
    > wanted by the user.
    
    Agreed, it would be much better to depend on the state in pg_wal,
    namely the files that are available there.
    
    Reworking the priority order seems like an appealing fix - if we can say
    streaming > archiving in terms of priority, then the race that you are
    referring to will not happen?
    
    Also, what are some use cases where one would give priority to streaming
    replication over archive recovery, if both sources have the same WAL
    segments?
    
    Regards,
    Ashwin & Soumyadeep
    
  16. Re: Unnecessary delay in streaming replication due to replay lag

    Daniel Gustafsson <daniel@yesql.se> — 2021-11-15T09:59:04Z

    > On 10 Nov 2021, at 00:41, Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote:
    
    > Thanks for the detailed review! Attached is a rebased patch that addresses
    > most of the feedback.
    
    This patch no longer applies after e997a0c64 and associated follow-up commits,
    please submit a rebased version.
    
    --
    Daniel Gustafsson		https://vmware.com/
    
    
    
    
    
  17. Re: Unnecessary delay in streaming replication due to replay lag

    Soumyadeep Chakraborty <soumyadeep2007@gmail.com> — 2021-11-19T08:35:04Z

    Hi Daniel,
    
    Thanks for checking in on this patch.
    Attached rebased version.
    
    Regards,
    Soumyadeep (VMware)
    
  18. Re: Unnecessary delay in streaming replication due to replay lag

    Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> — 2021-11-22T10:53:28Z

    On Fri, Nov 19, 2021 at 2:05 PM Soumyadeep Chakraborty
    <soumyadeep2007@gmail.com> wrote:
    >
    > Hi Daniel,
    >
    > Thanks for checking in on this patch.
    > Attached rebased version.
    
    Hi, I've not gone through the patch or this thread entirely, yet, can
    you please confirm if there's any relation between this thread and
    another one at [1]
    
    [1] https://www.postgresql.org/message-id/CAFiTN-vzbcSM_qZ%2B-mhS3OWecxupDCR5DkhQUTy%2BTKfrCMQLKQ%40mail.gmail.com
    
    
    
    
  19. Re: Unnecessary delay in streaming replication due to replay lag

    Soumyadeep Chakraborty <soumyadeep2007@gmail.com> — 2021-11-22T20:09:22Z

    Hi Bharath,
    
    Yes, that thread has been discussed here. Asim had x-posted the patch to
    [1]. This thread
    was more recent when Ashwin and I picked up the patch in Aug 2021, so we
    continued here.
    The patch has been significantly updated by us, addressing Michael's long
    outstanding feedback.
    
    Regards,
    Soumyadeep (VMware)
    
    [1]
    https://www.postgresql.org/message-id/CANXE4TeinQdw%2BM2Or0kTR24eRgWCOg479N8%3DgRvj9Ouki-tZFg%40mail.gmail.com
    
  20. Re: Unnecessary delay in streaming replication due to replay lag

    Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> — 2021-11-28T02:35:52Z

    On Tue, Nov 23, 2021 at 1:39 AM Soumyadeep Chakraborty
    <soumyadeep2007@gmail.com> wrote:
    >
    > Hi Bharath,
    >
    > Yes, that thread has been discussed here. Asim had x-posted the patch to [1]. This thread
    > was more recent when Ashwin and I picked up the patch in Aug 2021, so we continued here.
    > The patch has been significantly updated by us, addressing Michael's long outstanding feedback.
    
    Thanks for the patch. I reviewed it a bit, here are some comments:
    
    1) A memory leak: add FreeDir(dir); before returning.
    + ereport(LOG,
    + (errmsg("Could not start streaming WAL eagerly"),
    + errdetail("There are timeline changes in the locally available WAL files."),
    + errhint("WAL streaming will begin once all local WAL and archives
    are exhausted.")));
    + return;
    + }
    
    2) Is there a guarantee that while we traverse the pg_wal directory to
    find startsegno and endsegno, the new wal files arrive from the
    primary or archive location or old wal files get removed/recycled by
    the standby? Especially when wal_receiver_start_condition=consistency?
    + startsegno = (startsegno == -1) ? logSegNo : Min(startsegno, logSegNo);
    + endsegno = (endsegno == -1) ? logSegNo : Max(endsegno, logSegNo);
    + }
    
    3) I think the errmsg text format isn't correct. Note that the errmsg
    text starts with lowercase and doesn't end with "." whereas errdetail
    or errhint starts with uppercase and ends with ".". Please check other
    messages for reference.
    The following should be changed.
    + errmsg("Requesting stream from beginning of: %s",
    + errmsg("Invalid WAL segment found while calculating stream start:
    %s. Skipping.",
    + (errmsg("Could not start streaming WAL eagerly"),
    
    4) I think you also need to have wal files names in double quotes,
    something like below:
    errmsg("could not close file \"%s\": %m", xlogfname)));
    
    5) It is ".....stream start: \"%s\", skipping..",
    + errmsg("Invalid WAL segment found while calculating stream start:
    %s. Skipping.",
    
    4) I think the patch can make the startup process significantly slow,
    especially when there are lots of wal files that exist in the standby
    pg_wal directory. This is because of the overhead
    StartWALReceiverEagerlyIfPossible adds i.e. doing two while loops to
    figure out the start position of the
    streaming in advance. This might end up the startup process doing the
    loop over in the directory rather than the important thing of doing
    crash recovery or standby recovery.
    
    5) What happens if this new GUC is enabled in case of a synchronous standby?
    What happens if this new GUC is enabled in case of a crash recovery?
    What happens if this new GUC is enabled in case a restore command is
    set i.e. standby performing archive recovery?
    
    6) How about bgwriter/checkpointer which gets started even before the
    startup process (or a new bg worker? of course it's going to be an
    overkill) finding out the new start pos for the startup process and
    then we could get rid of <literal>startup</literal> behaviour of the
    patch? This avoids an extra burden on the startup process. Many times,
    users will be complaining about why recovery is taking more time now,
    after the GUC wal_receiver_start_condition=startup.
    
    7) I think we can just have 'consistency' and 'exhaust' behaviours and
    let the bgwrite or checkpointer find out the start position for the
    startup process, so the startup process whenever reaches a consistent
    point, it sees if the other process has calculated
    start pos for it or not, if yes it starts wal receiver other wise it
    goes with its usual recovery. I'm not sure if this will be a good
    idea.
    
    8) Can we have a better GUC name than wal_receiver_start_condition?
    Something like wal_receiver_start_at or wal_receiver_start or some
    other?
    
    Regards,
    Bharath Rupireddy.
    
    
    
    
  21. Re: Unnecessary delay in streaming replication due to replay lag

    Soumyadeep Chakraborty <soumyadeep2007@gmail.com> — 2021-12-16T01:01:24Z

    Hi Bharath,
    
    Thanks for the review!
    
    On Sat, Nov 27, 2021 at 6:36 PM Bharath Rupireddy <
    bharath.rupireddyforpostgres@gmail.com> wrote:
    
    > 1) A memory leak: add FreeDir(dir); before returning.
    > + ereport(LOG,
    > + (errmsg("Could not start streaming WAL eagerly"),
    > + errdetail("There are timeline changes in the locally available WAL
    files."),
    > + errhint("WAL streaming will begin once all local WAL and archives
    > are exhausted.")));
    > + return;
    > + }
    >
    
    Thanks for catching that. Fixed.
    
    >
    >
    > 2) Is there a guarantee that while we traverse the pg_wal directory to
    > find startsegno and endsegno, the new wal files arrive from the
    > primary or archive location or old wal files get removed/recycled by
    > the standby? Especially when wal_receiver_start_condition=consistency?
    > + startsegno = (startsegno == -1) ? logSegNo : Min(startsegno, logSegNo);
    > + endsegno = (endsegno == -1) ? logSegNo : Max(endsegno, logSegNo);
    > + }
    >
    
    Even if newer wal files arrive after the snapshot of the dir listing
    taken by AllocateDir()/ReadDir(), we will in effect start from a
    slightly older location, which should be fine. It shouldn't matter if
    an older file is recycled. If the last valid WAL segment is recycled,
    we will ERROR out in StartWALReceiverEagerlyIfPossible() and the eager
    start can be retried by the startup process when
    CheckRecoveryConsistency() is called again.
    
    >
    >
    > 3) I think the errmsg text format isn't correct. Note that the errmsg
    > text starts with lowercase and doesn't end with "." whereas errdetail
    > or errhint starts with uppercase and ends with ".". Please check other
    > messages for reference.
    > The following should be changed.
    > + errmsg("Requesting stream from beginning of: %s",
    > + errmsg("Invalid WAL segment found while calculating stream start:
    > %s. Skipping.",
    > + (errmsg("Could not start streaming WAL eagerly"),
    
    Fixed.
    
    > 4) I think you also need to have wal files names in double quotes,
    > something like below:
    > errmsg("could not close file \"%s\": %m", xlogfname)));
    
    Fixed.
    
    >
    > 5) It is ".....stream start: \"%s\", skipping..",
    > + errmsg("Invalid WAL segment found while calculating stream start:
    > %s. Skipping.",
    
    Fixed.
    
    > 4) I think the patch can make the startup process significantly slow,
    > especially when there are lots of wal files that exist in the standby
    > pg_wal directory. This is because of the overhead
    > StartWALReceiverEagerlyIfPossible adds i.e. doing two while loops to
    > figure out the start position of the
    > streaming in advance. This might end up the startup process doing the
    > loop over in the directory rather than the important thing of doing
    > crash recovery or standby recovery.
    
    Well, 99% of the time we can expect that the second loop finishes after
    1 or 2 iterations, as the last valid WAL segment would most likely be
    the highest numbered WAL file or thereabouts. I don't think that the
    overhead will be significant as we are just looking up a directory
    listing and not reading any files.
    
    > 5) What happens if this new GUC is enabled in case of a synchronous
    standby?
    > What happens if this new GUC is enabled in case of a crash recovery?
    > What happens if this new GUC is enabled in case a restore command is
    > set i.e. standby performing archive recovery?
    
    The GUC would behave the same way for all of these cases. If we have
    chosen 'startup'/'consistency', we would be starting the WAL receiver
    eagerly. There might be certain race conditions when one combines this
    GUC with archive recovery, which was discussed upthread. [1]
    
    > 6) How about bgwriter/checkpointer which gets started even before the
    > startup process (or a new bg worker? of course it's going to be an
    > overkill) finding out the new start pos for the startup process and
    > then we could get rid of <literal>startup</literal> behaviour of the
    > patch? This avoids an extra burden on the startup process. Many times,
    > users will be complaining about why recovery is taking more time now,
    > after the GUC wal_receiver_start_condition=startup.
    
    Hmm, then we would be needing additional synchronization. There will
    also be an added dependency on checkpoint_timeout. I don't think that
    the performance hit is significant enough to warrant this change.
    
    > 8) Can we have a better GUC name than wal_receiver_start_condition?
    > Something like wal_receiver_start_at or wal_receiver_start or some
    > other?
    
    Sure, that makes more sense. Fixed.
    
    Regards,
    Soumyadeep (VMware)
    
    [1]
    https://www.postgresql.org/message-id/CAE-ML%2B-8KnuJqXKHz0mrC7-qFMQJ3ArDC78X3-AjGKos7Ceocw%40mail.gmail.com
    
  22. Re: Unnecessary delay in streaming replication due to replay lag

    Kyotaro Horiguchi <horikyota.ntt@gmail.com> — 2021-12-16T10:05:19Z

    At Wed, 15 Dec 2021 17:01:24 -0800, Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote in 
    > Sure, that makes more sense. Fixed.
    
    As I played with this briefly.  I started a standby from a backup that
    has an access to archive.  I had the following log lines steadily.
    
    
    [139535:postmaster] LOG:  database system is ready to accept read-only connections
    [139542:walreceiver] LOG:  started streaming WAL from primary at 0/2000000 on timeline 1
    cp: cannot stat '/home/horiguti/data/arc_work/000000010000000000000003': No such file or directory
    [139542:walreceiver] FATAL:  could not open file "pg_wal/000000010000000000000003": No such file or directory
    cp: cannot stat '/home/horiguti/data/arc_work/00000002.history': No such file or directory
    cp: cannot stat '/home/horiguti/data/arc_work/000000010000000000000003': No such file or directory
    [139548:walreceiver] LOG:  started streaming WAL from primary at 0/3000000 on timeline 1
    
    The "FATAL:  could not open file" message from walreceiver means that
    the walreceiver was operationally prohibited to install a new wal
    segment at the time.  Thus the walreceiver ended as soon as started.
    In short, the eager replication is not working at all.
    
    
    I have a comment on the behavior and objective of this feature.
    
    In the case where archive recovery is started from a backup, this
    feature lets walreceiver start while the archive recovery is ongoing.
    If walreceiver (or the eager replication) worked as expected, it would
    write wal files while archive recovery writes the same set of WAL
    segments to the same directory. I don't think that is a sane behavior.
    Or, if putting more modestly, an unintended behavior.
    
    In common cases, I believe archive recovery is faster than
    replication.  If a segment is available from archive, we don't need to
    prefetch it via stream.
    
    If this feature is intended to use only for crash recovery of a
    standby, it should fire only when it is needed.
    
    If not, that is, if it is intended to work also for archive recovery,
    I think the eager replication should start from the next segment of
    the last WAL in archive but that would invite more complex problems.
    
    regards.
    
    -- 
    Kyotaro Horiguchi
    NTT Open Source Software Center
    
    
    
    
  23. Re: Unnecessary delay in streaming replication due to replay lag

    sunil s <sunilfeb26@gmail.com> — 2025-07-08T18:31:55Z

    Hello Hackers,
    
    I recently had the opportunity to continue the effort originally led by a
    valued contributor.
    I’ve addressed most of the previously reported feedback and issues, and
    would like to share the updated patch with the community.
    
    IMHO starting WAL receiver eagerly offers significant advantages because of
    following reasons
    
       1.
    
       If recovery_min_apply_delay is set high (for various operational
       reasons) and the primary crashes, the mirror can recover quickly, thereby
       improving overall High Availability.
       2.
    
       For setups without archive-based recovery, restore and recovery
       operations complete faster.
       3.
    
       When synchronous_commit is enabled, faster mirror recovery reduces
       offline time and helps avoid prolonged commit/query wait times during
       failover/recovery.
       4.
    
       This approach also improves resilience by limiting the impact of network
       interruptions on replication.
    
    
    > In common cases, I believe archive recovery is faster than
    replication. If a segment is available from archive, we don't need to
    prefetch it via stream.
    
    I completely agree — restoring from the archive is significantly faster
    than streaming.
     Attempting to stream from the last available WAL in the archive would
    introduce complexity and risk.
    Therefore, we can limit this feature to crash recovery scenarios and skip
    it when archiving is enabled.
    
    > The "FATAL: could not open file" message from walreceiver means that
    the walreceiver was operationally prohibited to install a new wal
    segment at the time.
    This was caused by an additional fix added in upstream to address a race
    condition between the archiver and checkpointer.
    It has been resolved in the latest patch, which also includes a TAP test to
    verify the fix. Thanks for testing and bringing this to our attention.
    For now we will skip wal receiver early start since enabling the write
    access for wal receiver will reintroduce the bug, which the
    commit cc2c7d65fc27e877c9f407587b0b92d46cd6dd16
    <https://github.com/postgres/postgres/commit/cc2c7d65fc27e877c9f407587b0b92d46cd6dd16>
    fixed
    previously.
    
    
    I've attached the rebased patch with the necessary fix.
    
    Thanks & Regards,
    Sunil S (Broadcom)
    
    
    On Tue, Jul 8, 2025 at 11:01 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
    wrote:
    
    > At Wed, 15 Dec 2021 17:01:24 -0800, Soumyadeep Chakraborty <
    > soumyadeep2007@gmail.com> wrote in
    > > Sure, that makes more sense. Fixed.
    >
    > As I played with this briefly.  I started a standby from a backup that
    > has an access to archive.  I had the following log lines steadily.
    >
    >
    > [139535:postmaster] LOG:  database system is ready to accept read-only
    > connections
    > [139542:walreceiver] LOG:  started streaming WAL from primary at 0/2000000
    > on timeline 1
    > cp: cannot stat '/home/horiguti/data/arc_work/000000010000000000000003':
    > No such file or directory
    > [139542:walreceiver] FATAL:  could not open file
    > "pg_wal/000000010000000000000003": No such file or directory
    > cp: cannot stat '/home/horiguti/data/arc_work/00000002.history': No such
    > file or directory
    > cp: cannot stat '/home/horiguti/data/arc_work/000000010000000000000003':
    > No such file or directory
    > [139548:walreceiver] LOG:  started streaming WAL from primary at 0/3000000
    > on timeline 1
    >
    > The "FATAL:  could not open file" message from walreceiver means that
    > the walreceiver was operationally prohibited to install a new wal
    > segment at the time.  Thus the walreceiver ended as soon as started.
    > In short, the eager replication is not working at all.
    >
    >
    > I have a comment on the behavior and objective of this feature.
    >
    > In the case where archive recovery is started from a backup, this
    > feature lets walreceiver start while the archive recovery is ongoing.
    > If walreceiver (or the eager replication) worked as expected, it would
    > write wal files while archive recovery writes the same set of WAL
    > segments to the same directory. I don't think that is a sane behavior.
    > Or, if putting more modestly, an unintended behavior.
    >
    > In common cases, I believe archive recovery is faster than
    > replication.  If a segment is available from archive, we don't need to
    > prefetch it via stream.
    >
    > If this feature is intended to use only for crash recovery of a
    > standby, it should fire only when it is needed.
    >
    > If not, that is, if it is intended to work also for archive recovery,
    > I think the eager replication should start from the next segment of
    > the last WAL in archive but that would invite more complex problems.
    >
    > regards.
    >
    > --
    > Kyotaro Horiguchi
    > NTT Open Source Software Center
    >
    >
    >
    >
    >
    
  24. Re: Unnecessary delay in streaming replication due to replay lag

    sunil s <sunilfeb26@gmail.com> — 2025-07-10T06:35:07Z

    Added patch to upcoming commitfest
    https://commitfest.postgresql.org/patch/5908/
    
    Thanks & Regards,
    Sunil S
    
    
    On Wed, Jul 9, 2025 at 12:01 AM sunil s <sunilfeb26@gmail.com> wrote:
    
    > Hello Hackers,
    >
    > I recently had the opportunity to continue the effort originally led by a
    > valued contributor.
    > I’ve addressed most of the previously reported feedback and issues, and
    > would like to share the updated patch with the community.
    >
    > IMHO starting WAL receiver eagerly offers significant advantages because
    > of following reasons
    >
    >    1.
    >
    >    If recovery_min_apply_delay is set high (for various operational
    >    reasons) and the primary crashes, the mirror can recover quickly, thereby
    >    improving overall High Availability.
    >    2.
    >
    >    For setups without archive-based recovery, restore and recovery
    >    operations complete faster.
    >    3.
    >
    >    When synchronous_commit is enabled, faster mirror recovery reduces
    >    offline time and helps avoid prolonged commit/query wait times during
    >    failover/recovery.
    >    4.
    >
    >    This approach also improves resilience by limiting the impact of
    >    network interruptions on replication.
    >
    >
    > > In common cases, I believe archive recovery is faster than
    > replication. If a segment is available from archive, we don't need to
    > prefetch it via stream.
    >
    > I completely agree — restoring from the archive is significantly faster
    > than streaming.
    >  Attempting to stream from the last available WAL in the archive would
    > introduce complexity and risk.
    > Therefore, we can limit this feature to crash recovery scenarios and skip
    > it when archiving is enabled.
    >
    > > The "FATAL: could not open file" message from walreceiver means that
    > the walreceiver was operationally prohibited to install a new wal
    > segment at the time.
    > This was caused by an additional fix added in upstream to address a race
    > condition between the archiver and checkpointer.
    > It has been resolved in the latest patch, which also includes a TAP test
    > to verify the fix. Thanks for testing and bringing this to our attention.
    > For now we will skip wal receiver early start since enabling the write
    > access for wal receiver will reintroduce the bug, which the
    > commit cc2c7d65fc27e877c9f407587b0b92d46cd6dd16
    > <https://github.com/postgres/postgres/commit/cc2c7d65fc27e877c9f407587b0b92d46cd6dd16> fixed
    > previously.
    >
    >
    > I've attached the rebased patch with the necessary fix.
    >
    > Thanks & Regards,
    > Sunil S (Broadcom)
    >
    >
    > On Tue, Jul 8, 2025 at 11:01 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
    > wrote:
    >
    >> At Wed, 15 Dec 2021 17:01:24 -0800, Soumyadeep Chakraborty <
    >> soumyadeep2007@gmail.com> wrote in
    >> > Sure, that makes more sense. Fixed.
    >>
    >> As I played with this briefly.  I started a standby from a backup that
    >> has an access to archive.  I had the following log lines steadily.
    >>
    >>
    >> [139535:postmaster] LOG:  database system is ready to accept read-only
    >> connections
    >> [139542:walreceiver] LOG:  started streaming WAL from primary at
    >> 0/2000000 on timeline 1
    >> cp: cannot stat '/home/horiguti/data/arc_work/000000010000000000000003':
    >> No such file or directory
    >> [139542:walreceiver] FATAL:  could not open file
    >> "pg_wal/000000010000000000000003": No such file or directory
    >> cp: cannot stat '/home/horiguti/data/arc_work/00000002.history': No such
    >> file or directory
    >> cp: cannot stat '/home/horiguti/data/arc_work/000000010000000000000003':
    >> No such file or directory
    >> [139548:walreceiver] LOG:  started streaming WAL from primary at
    >> 0/3000000 on timeline 1
    >>
    >> The "FATAL:  could not open file" message from walreceiver means that
    >> the walreceiver was operationally prohibited to install a new wal
    >> segment at the time.  Thus the walreceiver ended as soon as started.
    >> In short, the eager replication is not working at all.
    >>
    >>
    >> I have a comment on the behavior and objective of this feature.
    >>
    >> In the case where archive recovery is started from a backup, this
    >> feature lets walreceiver start while the archive recovery is ongoing.
    >> If walreceiver (or the eager replication) worked as expected, it would
    >> write wal files while archive recovery writes the same set of WAL
    >> segments to the same directory. I don't think that is a sane behavior.
    >> Or, if putting more modestly, an unintended behavior.
    >>
    >> In common cases, I believe archive recovery is faster than
    >> replication.  If a segment is available from archive, we don't need to
    >> prefetch it via stream.
    >>
    >> If this feature is intended to use only for crash recovery of a
    >> standby, it should fire only when it is needed.
    >>
    >> If not, that is, if it is intended to work also for archive recovery,
    >> I think the eager replication should start from the next segment of
    >> the last WAL in archive but that would invite more complex problems.
    >>
    >> regards.
    >>
    >> --
    >> Kyotaro Horiguchi
    >> NTT Open Source Software Center
    >>
    >>
    >>
    >>
    >>
    
  25. Re: Unnecessary delay in streaming replication due to replay lag

    Huansong Fu <huansong.fu.info@gmail.com> — 2025-07-28T22:41:24Z

    The following review has been posted through the commitfest application:
    make installcheck-world:  not tested
    Implements feature:       tested, failed
    Spec compliant:           not tested
    Documentation:            not tested
    
    Hi,
    
    I've been playing with the patch. It worked as intended. I have a few minor review comments on the code and test:
    
    1. There was some indent issue when applying the v6-0001 patch:
    v6-0001-Introduce-feature-to-start-WAL-receiver-eagerly.patch:342: space before tab in indent.
    		 	gettext_noop("When to start WAL receiver."),
    v6-0001-Introduce-feature-to-start-WAL-receiver-eagerly.patch:343: space before tab in indent.
    		 	NULL,
    warning: 2 lines add whitespace errors.
    
    2. There was a whitespace issue when applying the v6-0002 test patch:
    v6-0002-Test-WAL-receiver-early-start-upon-reaching-consi.patch:126: new blank line at EOF.
    +
    warning: 1 line adds whitespace errors.
    
    3. Test number for "046_walreciver_start.pl" collided with a recently added test "046_checkpoint_logical_slot.pl" so needs another number.
    
    4. Some text needs wraparound:
             * Archiving from the restore command does not holds the control lock
    -        * and enabling XLogCtl->InstallXLogFileSegmentActive for wal reciever early start
    -        * will create a race condition with the checkpointer process as fixed in cc2c7d65fc27e877c9f407587b0b92d46cd6dd16.
    -        * Hence skipping early start of the wal receiver in case of archive recovery.
    +        * and enabling XLogCtl->InstallXLogFileSegmentActive for wal reciever
    +        * early start will create a race condition with the checkpointer process
    +        * as fixed in cc2c7d65fc27e877c9f407587b0b92d46cd6dd16. Hence skipping
    +        * early start of the wal receiver in case of archive recovery.
             */
    
    5. Extra ";"
    @@ -3820,7 +3821,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
                                             * eagerly).
                                             */
                                            currentSource = XLOG_FROM_STREAM;
    -                                       startWalReceiver = !WalRcvStreaming();;
    +                                       startWalReceiver = !WalRcvStreaming();
    
    Thanks,
    Huansong
    Broadcom Inc.
    
    The new status of this patch is: Waiting on Author
    
  26. Re: Unnecessary delay in streaming replication due to replay lag

    sunil s <sunilfeb26@gmail.com> — 2025-07-29T09:29:35Z

    Thanks  Huansong for reviewing the patch, and I have addressed all the
    above-mentioned points.
    
    PFA rebased patch.
    
    Thanks & Regards,
    Sunil S
    
  27. Re: Unnecessary delay in streaming replication due to replay lag

    sunil s <sunilfeb26@gmail.com> — 2025-09-11T08:51:14Z

    Hello Hackers,
    
    PFA rebased patch due to the code changes done in upstream commit
    63599896545c7869f7dd28cd593e8b548983d613
    <https://github.com/postgres/postgres/commit/63599896545c7869f7dd28cd593e8b548983d613>
    .
    
    The current status of the patch registered in Commit Fest
    <https://commitfest.postgresql.org/patch/5908/>is "*Ready for Committer*".
    
    Thanks & Regards,
    Sunil S
    Broadcom Inc
    
    >
    
  28. Re: Unnecessary delay in streaming replication due to replay lag

    Fujii Masao <masao.fujii@gmail.com> — 2025-10-27T13:16:20Z

    On Thu, Sep 11, 2025 at 5:51 PM sunil s <sunilfeb26@gmail.com> wrote:
    >
    > Hello Hackers,
    >
    > PFA rebased patch due to the code changes done in upstream commit 63599896545c7869f7dd28cd593e8b548983d613.
    >
    > The current status of the patch registered in Commit Fest is "Ready for Committer".
    
    +        streamed WAL. Such environments can benefit from setting
    +        <varname>wal_receiver_start_at</varname> to
    +        <literal>startup</literal> or <literal>consistency</literal>. These
    +        values will lead to the WAL receiver starting much earlier, and from
    +        the end of locally available WAL.
    
    When this parameter is set to 'startup' or 'consistency', what happens
    if replication begins early and the startup process fails to replay
    a WAL record—say, due to corruption—before reaching the replication
    start point? In that case, the standby might fail to recover correctly
    because of missing WAL records, while a transaction waiting for
    synchronous replication may have already been acknowledged as committed.
    Wouldn't that lead to a serious problem?
    
    Regards,
    
    -- 
    Fujii Masao
    
    
    
    
  29. Re: Unnecessary delay in streaming replication due to replay lag

    Josef Šimánek <josef.simanek@gmail.com> — 2025-11-02T17:34:47Z

    ne 2. 11. 2025 v 18:33 odesílatel sunil s <sunilfeb26@gmail.com> napsal:
    >
    > Hello Hackers,
    >
    > PFA rebased patch due to the code changes done in upstream commit 63599896545c7869f7dd28cd593e8b548983d613.
    
    src/test/recovery/t/050_archive_enabled_standby.pl is missing the
    ending newline. Is that intentional?
    
    could be seen at
    https://github.com/postgresql-cfbot/postgresql/commit/041e477fea9677fa6dee0736ffe4825f704c066e
    
    > The current status of the patch registered in Commit Fest is "Ready for Committer".
    >
    > Thanks & Regards,
    > Sunil S
    > Broadcom Inc
    
    
    
    
  30. Re: Unnecessary delay in streaming replication due to replay lag

    sunil s <sunilfeb26@gmail.com> — 2025-11-05T14:05:27Z

    > When this parameter is set to 'startup' or 'consistency', what happens
    > if replication begins early and the startup process fails to replay
    > a WAL record—say, due to corruption—before reaching the replication
    > start point? In that case, the standby might fail to recover correctly
    > because of missing WAL records,
    
    Let’s compare with and without these patch changes ,
    
    Without the patch:
    
    *Scenario 1:* With a large recovery_min_apply_delay (e.g., 2 hours)
    Even in this case, the flush acknowledgment for streamed WALs is sent, and
    the primary already recycled those WAL files.
    If a corrupted record is encountered later during replay then streaming of
    those records is not possible.
    
    
    *Scenario 2:* With recovery_min_apply_delay = 0 or in normal standby
    operation
    
    In this case the restart_lsn is advanced based on flushPtr, allowing the
    primary to recycle the corresponding WAL files.
    
    If a corrupt record is encountered during replaying local wal records, then
    streaming will also fail here right ?.
    
    
    With this patch:
    
    Starting the WAL receiver early(let’s say at consistent point) will allow
    us to prefetch the records more early in the redo loop instead of waiting
    till we exhaust locally available wal.
    
    
    Even if the WAL receiver hadn’t started early, those WAL segments would
    have been recycled, since the restart_lsn would have advanced.
    Therefore, the record corruption behaviour  is unchanged, but the benefit
    from this patch is reduced replay lag.
    
    
       - Reduces replay lag when recovery_min_apply_delay is large, as reported
          in
          https://www.postgresql.org/message-id/201901301432.p3utg64hum27%40alvherre.pgsql
          [2].
          - Mitigates delay for standbys lagging due to network bandwidth or
          latency or slow disk write(HDD).
          - faster recovery
          - Currently till wal reciver is started the acknowledgement for
          commit  is not sent for waiting transaction, since wal reciver is not
          running.With this new change the waiting transaction will get
    unblocked as
          soon as we apply the record.
    
    
    In normal condition also the slot is advanced based on flushPtr, even if
    the mode is remote_apply.We fixed a corrupt scenario for cont record at the
    end of last locally available segment.Previously we were starting
    at the last stage/corrupt record(like cont record [1]  ) but now much early.
    
    If there is a situation where the wal record is retained in primary then we
    can restart the wal receiver from old lsnptr in case of corrupt record,
    which would be older LSN than what we are starting as part of early
    streaming.
    This same mechanism is used in standby where we switch b/w wal source.I
    don’t see any scenario where the new workflow would break existing behavior.
    Could you point out the specific case you’re concerned about? Understanding
    that will help us refine the implementation.
    
    > while a transaction waiting for synchronous replication may have already
    been acknowledged as committed.
    > Wouldn't that lead to a serious problem?
    
    Without the patch:
    
    If the synchronous replication mode is flush(on), then even with a
    recovery_min_apply_delay set for larger value(e.g., 2 hours), the
    transaction is acknowledged as committed before the record is actually
    applied on the standby.
    
    If the mode is remote_apply, the primary waits until the record is applied
    on the standby, which includes waiting for the configured recovery delay.
    
    
    With the patch:
    
    The behavior remains the same with respect to synchronous_commit — it still
    depends on whether the mode is flush or remote_apply.
    
    
    So we can see a similar situation when recovery_min_apply_delay set for
    larger value(e.g., 2 hours)/a slow apply situation where  all the wal files
    are streamed but not replayed.
    
    *AFAIU this patch doesn’t introduce any new behavior.In a normal situation
    where the WAL receiver is continuously streaming, we would anyway received
    those WAL segments without waiting for*
    *replaying to finish right.*
    
    The only difference is we are initating walreciever more early in the
    recovery loop, which will going to benifit us in many ways.In system where
    replay is slow due to low power hardware/system resource or the
    low network bandwidth/slower disk write (HDD)  will makes the standby to
    lag behind Primary.
    
    By prefetching the wal records early will avoid more wal build up in
    primary, which would avoid running out of disk space and also benifit us
    for faster standby recovery.
    Faster recovery means faster application availability/lower downtime in
    case of sync commit enabled.
    
    
    > src/test/recovery/t/050_archive_enabled_standby.pl is missing the
    > ending newline. Is that intentional?
    Thanks for reporting. Fixed in the new rebased patch.
    
    Reference:
    [1]
    https://github.com/postgres/postgres/commit/0668719801838aa6a8bda330ff9b3d20097ea844
    [2]
    https://www.postgresql.org/message-id/201901301432.p3utg64hum27%40alvherre.pgsql
    
    Thanks & Regards,
    Sunil S
    
  31. Re: Unnecessary delay in streaming replication due to replay lag

    sunil s <sunilfeb26@gmail.com> — 2025-11-23T06:42:23Z

    Hi,
    
    Attaching the rebased patch after resolving some recent conflicts.
    
    Thanks & Regards,
    Sunil Seetharama