Thread

Re: Implement waiting for wal lsn replay: reloaded

Alexander Korotkov <aekorotkov@gmail.com> — 2025-12-25T16:34:02Z
On Thu, Dec 25, 2025 at 2:52 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> On Thu, Dec 25, 2025 at 7:13 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> >
> > Hi, Xuneng!
> >
> > On Mon, Dec 22, 2025 at 9:57 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > >
> > > On Sun, Dec 21, 2025 at 12:37 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > > >
> > > > Hi Alexander,
> > > >
> > > > Thanks for your feedback!
> > > >
> > > > > I see that we can't specify WAIT_LSN_TYPE_PRIMARY_FLUSH by setting
> > > > > mode parameter.  Should we allow this?
> > > >
> > > > I think this constraint could be relaxed if needed. I was previously
> > > > unsure about the use cases.
> > >
> > > Flush mode on the primary seems useful when synchronous_commit is set
> > > to off [1]. In that mode, a transaction in primary may return success
> > > before its WAL is durably flushed to disk, trading durability for
> > > lower latency. A “wait for primary flush” operation provides an
> > > explicit durability barrier for cases where applications or tools
> > > occasionally need stronger guarantees.
> > >
> > > [1] https://postgresqlco.nf/doc/en/param/synchronous_commit/
> > >
> > > > > If we allow specifying WAIT_LSN_TYPE_PRIMARY_FLUSH, should it be
> > > > > separate mode value or the same with WAIT_LSN_TYPE_STANDBY_FLUSH?  In
> > > > > principle, we could encode both as just 'flush' mode, and detect which
> > > > > WaitLSNType to pick by checking if recovery is in progress.  However,
> > > > > how should we then react to unreached flush location after standby
> > > > > promotion (technically it could be still reached but on the different
> > > > > timeline)?
> > > > >
> > > >
> > > > Technically, we can use 'flush' mode to specify WAIT FOR behavior in
> > > > both primary and replica. Currently, wait for commands error out if
> > > > promotion occurs since: either the requested LSN type does not exist
> > > > on the primary, or we do not yet have the infrastructure to support
> > > > continuing the wait. If we allow waiting for flush on the primary as a
> > > > user-visible command and the wake-up calls for flush in primary are
> > > > introduced, the question becomes whether we should still abort the
> > > > wait on promotion, or continue waiting—as you noted—given that the
> > > > target LSN might still be reached, albeit on a different timeline. The
> > > > question behind this might be: do users care and should be aware of
> > > > the state change of the server while waiting? If they do, then we
> > > > better stop the waiting and report the error. In this case, I am
> > > > inclined to to break the unified flush mode to something like
> > > > primary_flush/standby_flush mode and
> > > > WAIT_LSN_TYPE_PRIMARY_FLUSH/WAIT_LSN_TYPE_STANDBY_FLUSH respectively.
> > > >
> > >
> > > After further consideration, it also seems reasonable to use a single,
> > > unified flush mode that works on both primary and standby servers,
> > > provided its semantics are clearly documented to avoid the potential
> > > confusion on failure. I don’t have a strong preference between these
> > > two and would be interested in your thoughts.
> > >
> > > If a standby is promoted while a session is waiting, the command
> > > better abort and return an error (or report “not in recovery” when
> > > using NO_THROW). At that point, the meaning of the LSN being waited
> > > for may have changed due to the timeline switch and the transition
> > > from standby to primary. An LSN such as 0/5000000 on TLI 2 can
> > > represent entirely different WAL content from 0/5000000 on TLI 1.
> > > Allowing the wait to silently continue across promotion risks giving
> > > users a false sense of safety—for example, interpreting “wait
> > > completed” as “the original data is now durable,” which would no
> > > longer be true.
> >
> > Agree, but there is still risk that promotion happens after user send
> > the query but before we started to wait.  In this case we will still
> > silently start to wait on primary, while user probably meant to wait
> > on replica.  Probably it would be safer to have separate user-visible
> > modes for waiting on primary and on replica?
> >
>
> Thanks for your thoughts. You're right about the race condition. If
> promotion happens between query submission and execution, a unified
> 'flush' mode could silently switch semantics without the user knowing.
> Separate modes like 'standby_flush' and 'primary_flush' would make
> user intent explicit and catch this case with an error, which is
> safer. Do these two terms look reasonable to you, or would you suggest
> better names? If they look ok, I plan to update the implementation to
> use these two modes.

Thank you, Xuneng.  'standby_flush' and 'primary_flush' look good for
me.  Please, go ahead.  I think we should name other modes
'standby_write' and 'standby_replay' for the sake of unity.

------
Regards,
Alexander Korotkov
Supabase