Thread

  1. Re: Adding REPACK [concurrently]

    Amit Kapila <amit.kapila16@gmail.com> — 2026-05-08T12:25:12Z

    On Wed, May 6, 2026 at 1:55 PM Antonin Houska <ah@cybertec.at> wrote:
    >
    > Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
    >
    > > On 2026-May-05, Antonin Houska wrote:
    > >
    > > > However, I failed to notice that COMMIT record of
    > > > a transaction listed in the xl_running_xacts WAL record is not guaranteed to
    > > > follow the xl_running_xacts record in WAL. In other words, even if
    > > > xl_running_xacts is created before a COMMIT record of the contained
    > > > transaction, it may end up at higher LSN in WAL. So the cleanup I relied on
    > > > might not take place.
    > >
    > > That's pretty bad news.
    > >
    > > > I've got no good idea how to fix that.
    >
    > One idea occurred to me yet, effectively it's just a cleanup. Part of it was
    > already proposed [1].
    >
    
    Some issues/inefficiencies regarding this fix and base code related to
    db-specific snapshots built during decoding:
    *
    After fix, whenever a db-specific decoder sees a cluster-wide
    xl_running_xacts record, it unconditionally calls
    LogStandbySnapshot(MyDatabaseId) and returns. This triggers for every
    cluster-wide record the decoder encounters (including post snapbuild's
    CONSISTENT state) , for the entire duration of the decoding session.
    LogStandbySnapshot acquires ProcArrayLock + XidGenLock, calls
    GetRunningTransactionData, and writes WAL. With N active db-specific
    decoding sessions, each cluster-wide record now triggers N additional
    WAL writes.
    
    Additionally, LogStandbySnapshot also logs AccessExclusiveLocks before
    the running_xacts record. Physical standbys skip db-specific
    XLOG_RUNNING_XACTS records via standby_redo(), but they do process the
    preceding XLOG_STANDBY_LOCK records. The same locks may already have
    been logged in the most recent cluster-wide snapshot. Physical
    standbys could end up processing these lock records twice which may
    not be harmful because I think we avoid re-acquiring the lock but
    still it is a new overhead in the system.
    
    *
    When a cluster-wide running_xacts record arrives:
    SnapBuildProcessRunningXacts calls LogStandbySnapshot and returns
    early. ReorderBufferAbortOld is called, but with the cluster-wide
    oldestRunningXid, which could lag far behind the db-specific value
    (due to a long-running transaction in another database).
    When a db-specific record arrives: SnapBuildProcessRunningXacts
    processes it and advances builder->xmin with the db-specific (more
    current) oldestRunningXid. But ReorderBufferAbortOld is NOT called for
    db-specific records. This means the reorder buffer is cleaned up using
    a conservative, potentially very old, cluster-wide oldestRunningXid,
    even though builder->xmin has already advanced much further. The
    reorder buffer holds stale entries longer than necessary, increasing
    memory pressure.
    
    *
    I also see a design level problem with plugins that have
    need_shared_catalogs=false and use failover slots. IIUC, the
    db-specific optimization was designed around a live decoding session
    on the primary which can emit and immediately read its own db-specific
    records in the WAL stream to reach consistent state. The
    LogicalSlotAdvanceAndCheckSnapState path used by slotsync has a
    bounded WAL window and cannot exploit the feedback loop, making the
    two mechanisms fundamentally incompatible. I know the slot created by
    pgrepack doesn't enable failover option but we have not even added any
    guards or thought about db-specific snapbuilds with other parts of the
    system that rely on cluster-wide running_xact records, so there could
    be more problems which we don't see with the current set of tests.
    
    -- 
    With Regards,
    Amit Kapila.