Thread

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

Matthias van de Meent <boekewurm+postgres@gmail.com> — 2025-12-02T12:02:32Z
On Fri, 28 Nov 2025 at 20:08, Hannu Krosing <hannuk@google.com> wrote:
>
> On Fri, Nov 28, 2025 at 7:31 PM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> >
> > On Fri, 28 Nov 2025 at 18:58, Hannu Krosing <hannuk@google.com> wrote:
> > >
> > > On Fri, Nov 28, 2025 at 5:58 PM Matthias van de Meent
> > > <boekewurm+postgres@gmail.com> wrote:
> > > >
> > > ...
> > > > I'm a bit worried, though, that LR may lose updates due to commit
> > > > order differences between WAL and PGPROC. I don't know how that's
> > > > handled in logical decoding, and can't find much literature about it
> > > > in the repo either.
> > >
> > > Now the reference to logical decoding made me think that maybe to real
> > > fix for CIC would be to leverage logical decoding for the 2nd pass of
> > > CIC and not wore about in-page visibilities at all.
> >
> > -1: Requiring the logical decoding system just to reindex an index
> > without O(tablesize) lock time adds too much overhead, and removes
> > features we currently have (CIC on unlogged tables). wal_level=logical
> > *must not* be required for these tasks if we can at all avoid it.
> > I'm also not sure whether logical decoding gets access to the HOT
> > information of the updated tuples involved, and therefore whether the
> > index build can determine whether it must or can't insert the tuple.
>
> There are more and more cases (not just CIC here) where using logical
> decoding would be the most efficient solution, so why not instead
> start improving it instead of complicating the system in various
> places?

Because Logical Replication implies Replication, which in turn implies
(more) WAL generation. And if an unlogged table still generates WAL in
DML, then it's not really an unlogged table, in which case we've
broken a promise to the user [see: CREATE TABLE's UNLOGGED
description]. Adding features to WAL which replicas can't (mustn't!)
do anything with is always going to be bloat in my view.

I also don't know how you measure efficiency, but I don't consider LR
to be particularly efficient in any metric, apart from maybe "wasting
DBA time with abandoned slots". LR parses WAL, which is a conveyor
belt with _all_ changes, and given that WAL has no real upper boundary
on how large it can grow, LR would have to touch an unbounded amount
of data to get only the changes it needs. We already have ways to get
those changes without parsing an unbounded amount of data, so why not
use that instead?

> We could even start selectively logging UNLOGGED and TEMP tables when
> we start CIC if CIC has enough upsides.

Which is why I hate this idea. There can't be enough upsides to
counteract the enormous downside of increasing the size of the data we
need to ship to replicas when the replicas can't ever use that data.
Replicas were able to use the added data of LR before 17 when they
were promoted, so it wasn't terrible to include more data in the WAL,
but what's proposed here is to add data that literally nobody on the
replica can use; wasting WAL storage and replication bandwidth.

Lastly, LR requires replication slots, which are very expensive to
maintain. Currently, you can do CIC/RIC with any number of backends
you want up to max_backends, but this doesn't work if you'd want to
use LR, as you'd now need to have max_replication_slots proportional
to max_connections.

Again, -1 on LR for UNLOGGED/TEMP tables. Or LR in general when the
user explicitly asked for `wal_level NOT IN ('logical')`

> > I don't think logical decoding is sufficient, because we don't know
> > which tuples were already inserted into the index by their own
> > backends, so we don't know which tuples' index entries we must skip.
>
> The premise of pass2 in CIC is that we collect all the rows that were
> inserted after CIC started for which we are not 100% sure that they
> are inserted in the index. We can only be sure they are inserted for
> transactions started after pass1 completed and the index became
> visible and available for inserts.

I'm not sure this is true; wouldn't it be possible for a transaction
to start before the index became visible, but because of READ
COMMITTED get access to the index after one statement? I.e. two
statements that straddle the index becoming visible? That way, a
transaction could start to see the index after it first modified some
tuples; creating a hybrid visibility state.

> And we do not care about hot update chains dusing normal CREATE INDEX
> or first pass of CIC - we just index what is visible NOW wit no regard
> of weather the tuple is at the end of HOT update chain.

We do care about HOT update chains, because the TID of the HOT root is
indexed, and not necessarily the TID of the scanned tuple.

> > PS. I think the same should be true for REPACK CONCURRENTLY, but
> > that's a new command with yet-to-be-determined semantics, unlike CIC
> > which has been part of PG for 6 years.
>
> CIC has been around way longer, since 8.2 released in 2006, so more
> like 20 years :)

Ah, so RIC wasn't introduced together with CIC? TIL.

Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)