Thread

  1. Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

    Matthias van de Meent <boekewurm+postgres@gmail.com> — 2025-12-02T12:02:32Z

    On Fri, 28 Nov 2025 at 20:08, Hannu Krosing <hannuk@google.com> wrote:
    >
    > On Fri, Nov 28, 2025 at 7:31 PM Matthias van de Meent
    > <boekewurm+postgres@gmail.com> wrote:
    > >
    > > On Fri, 28 Nov 2025 at 18:58, Hannu Krosing <hannuk@google.com> wrote:
    > > >
    > > > On Fri, Nov 28, 2025 at 5:58 PM Matthias van de Meent
    > > > <boekewurm+postgres@gmail.com> wrote:
    > > > >
    > > > ...
    > > > > I'm a bit worried, though, that LR may lose updates due to commit
    > > > > order differences between WAL and PGPROC. I don't know how that's
    > > > > handled in logical decoding, and can't find much literature about it
    > > > > in the repo either.
    > > >
    > > > Now the reference to logical decoding made me think that maybe to real
    > > > fix for CIC would be to leverage logical decoding for the 2nd pass of
    > > > CIC and not wore about in-page visibilities at all.
    > >
    > > -1: Requiring the logical decoding system just to reindex an index
    > > without O(tablesize) lock time adds too much overhead, and removes
    > > features we currently have (CIC on unlogged tables). wal_level=logical
    > > *must not* be required for these tasks if we can at all avoid it.
    > > I'm also not sure whether logical decoding gets access to the HOT
    > > information of the updated tuples involved, and therefore whether the
    > > index build can determine whether it must or can't insert the tuple.
    >
    > There are more and more cases (not just CIC here) where using logical
    > decoding would be the most efficient solution, so why not instead
    > start improving it instead of complicating the system in various
    > places?
    
    Because Logical Replication implies Replication, which in turn implies
    (more) WAL generation. And if an unlogged table still generates WAL in
    DML, then it's not really an unlogged table, in which case we've
    broken a promise to the user [see: CREATE TABLE's UNLOGGED
    description]. Adding features to WAL which replicas can't (mustn't!)
    do anything with is always going to be bloat in my view.
    
    I also don't know how you measure efficiency, but I don't consider LR
    to be particularly efficient in any metric, apart from maybe "wasting
    DBA time with abandoned slots". LR parses WAL, which is a conveyor
    belt with _all_ changes, and given that WAL has no real upper boundary
    on how large it can grow, LR would have to touch an unbounded amount
    of data to get only the changes it needs. We already have ways to get
    those changes without parsing an unbounded amount of data, so why not
    use that instead?
    
    > We could even start selectively logging UNLOGGED and TEMP tables when
    > we start CIC if CIC has enough upsides.
    
    Which is why I hate this idea. There can't be enough upsides to
    counteract the enormous downside of increasing the size of the data we
    need to ship to replicas when the replicas can't ever use that data.
    Replicas were able to use the added data of LR before 17 when they
    were promoted, so it wasn't terrible to include more data in the WAL,
    but what's proposed here is to add data that literally nobody on the
    replica can use; wasting WAL storage and replication bandwidth.
    
    Lastly, LR requires replication slots, which are very expensive to
    maintain. Currently, you can do CIC/RIC with any number of backends
    you want up to max_backends, but this doesn't work if you'd want to
    use LR, as you'd now need to have max_replication_slots proportional
    to max_connections.
    
    Again, -1 on LR for UNLOGGED/TEMP tables. Or LR in general when the
    user explicitly asked for `wal_level NOT IN ('logical')`
    
    > > I don't think logical decoding is sufficient, because we don't know
    > > which tuples were already inserted into the index by their own
    > > backends, so we don't know which tuples' index entries we must skip.
    >
    > The premise of pass2 in CIC is that we collect all the rows that were
    > inserted after CIC started for which we are not 100% sure that they
    > are inserted in the index. We can only be sure they are inserted for
    > transactions started after pass1 completed and the index became
    > visible and available for inserts.
    
    I'm not sure this is true; wouldn't it be possible for a transaction
    to start before the index became visible, but because of READ
    COMMITTED get access to the index after one statement? I.e. two
    statements that straddle the index becoming visible? That way, a
    transaction could start to see the index after it first modified some
    tuples; creating a hybrid visibility state.
    
    > And we do not care about hot update chains dusing normal CREATE INDEX
    > or first pass of CIC - we just index what is visible NOW wit no regard
    > of weather the tuple is at the end of HOT update chain.
    
    We do care about HOT update chains, because the TID of the HOT root is
    indexed, and not necessarily the TID of the scanned tuple.
    
    > > PS. I think the same should be true for REPACK CONCURRENTLY, but
    > > that's a new command with yet-to-be-determined semantics, unlike CIC
    > > which has been part of PG for 6 years.
    >
    > CIC has been around way longer, since 8.2 released in 2006, so more
    > like 20 years :)
    
    Ah, so RIC wasn't introduced together with CIC? TIL.
    
    Kind regards,
    
    Matthias van de Meent
    Databricks (https://www.databricks.com)