Thread

  1. Re: Implement waiting for wal lsn replay: reloaded

    Alexander Korotkov <aekorotkov@gmail.com> — 2025-12-25T16:34:02Z

    On Thu, Dec 25, 2025 at 2:52 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
    > On Thu, Dec 25, 2025 at 7:13 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
    > >
    > > Hi, Xuneng!
    > >
    > > On Mon, Dec 22, 2025 at 9:57 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
    > > >
    > > > On Sun, Dec 21, 2025 at 12:37 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
    > > > >
    > > > > Hi Alexander,
    > > > >
    > > > > Thanks for your feedback!
    > > > >
    > > > > > I see that we can't specify WAIT_LSN_TYPE_PRIMARY_FLUSH by setting
    > > > > > mode parameter.  Should we allow this?
    > > > >
    > > > > I think this constraint could be relaxed if needed. I was previously
    > > > > unsure about the use cases.
    > > >
    > > > Flush mode on the primary seems useful when synchronous_commit is set
    > > > to off [1]. In that mode, a transaction in primary may return success
    > > > before its WAL is durably flushed to disk, trading durability for
    > > > lower latency. A “wait for primary flush” operation provides an
    > > > explicit durability barrier for cases where applications or tools
    > > > occasionally need stronger guarantees.
    > > >
    > > > [1] https://postgresqlco.nf/doc/en/param/synchronous_commit/
    > > >
    > > > > > If we allow specifying WAIT_LSN_TYPE_PRIMARY_FLUSH, should it be
    > > > > > separate mode value or the same with WAIT_LSN_TYPE_STANDBY_FLUSH?  In
    > > > > > principle, we could encode both as just 'flush' mode, and detect which
    > > > > > WaitLSNType to pick by checking if recovery is in progress.  However,
    > > > > > how should we then react to unreached flush location after standby
    > > > > > promotion (technically it could be still reached but on the different
    > > > > > timeline)?
    > > > > >
    > > > >
    > > > > Technically, we can use 'flush' mode to specify WAIT FOR behavior in
    > > > > both primary and replica. Currently, wait for commands error out if
    > > > > promotion occurs since: either the requested LSN type does not exist
    > > > > on the primary, or we do not yet have the infrastructure to support
    > > > > continuing the wait. If we allow waiting for flush on the primary as a
    > > > > user-visible command and the wake-up calls for flush in primary are
    > > > > introduced, the question becomes whether we should still abort the
    > > > > wait on promotion, or continue waiting—as you noted—given that the
    > > > > target LSN might still be reached, albeit on a different timeline. The
    > > > > question behind this might be: do users care and should be aware of
    > > > > the state change of the server while waiting? If they do, then we
    > > > > better stop the waiting and report the error. In this case, I am
    > > > > inclined to to break the unified flush mode to something like
    > > > > primary_flush/standby_flush mode and
    > > > > WAIT_LSN_TYPE_PRIMARY_FLUSH/WAIT_LSN_TYPE_STANDBY_FLUSH respectively.
    > > > >
    > > >
    > > > After further consideration, it also seems reasonable to use a single,
    > > > unified flush mode that works on both primary and standby servers,
    > > > provided its semantics are clearly documented to avoid the potential
    > > > confusion on failure. I don’t have a strong preference between these
    > > > two and would be interested in your thoughts.
    > > >
    > > > If a standby is promoted while a session is waiting, the command
    > > > better abort and return an error (or report “not in recovery” when
    > > > using NO_THROW). At that point, the meaning of the LSN being waited
    > > > for may have changed due to the timeline switch and the transition
    > > > from standby to primary. An LSN such as 0/5000000 on TLI 2 can
    > > > represent entirely different WAL content from 0/5000000 on TLI 1.
    > > > Allowing the wait to silently continue across promotion risks giving
    > > > users a false sense of safety—for example, interpreting “wait
    > > > completed” as “the original data is now durable,” which would no
    > > > longer be true.
    > >
    > > Agree, but there is still risk that promotion happens after user send
    > > the query but before we started to wait.  In this case we will still
    > > silently start to wait on primary, while user probably meant to wait
    > > on replica.  Probably it would be safer to have separate user-visible
    > > modes for waiting on primary and on replica?
    > >
    >
    > Thanks for your thoughts. You're right about the race condition. If
    > promotion happens between query submission and execution, a unified
    > 'flush' mode could silently switch semantics without the user knowing.
    > Separate modes like 'standby_flush' and 'primary_flush' would make
    > user intent explicit and catch this case with an error, which is
    > safer. Do these two terms look reasonable to you, or would you suggest
    > better names? If they look ok, I plan to update the implementation to
    > use these two modes.
    
    Thank you, Xuneng.  'standby_flush' and 'primary_flush' look good for
    me.  Please, go ahead.  I think we should name other modes
    'standby_write' and 'standby_replay' for the sake of unity.
    
    ------
    Regards,
    Alexander Korotkov
    Supabase