Thread

Re: [PATCH] Re: Proposal to Enable/Disable Index using ALTER INDEX

Robert Treat <rob@xzilla.net> — 2025-06-07T13:35:33Z
On Fri, Jun 6, 2025 at 8:04 PM David Rowley <dgrowleyml@gmail.com> wrote:
> On Fri, 6 Jun 2025 at 14:32, Robert Treat <rob@xzilla.net> wrote:
> > In production, you aren't watching to see what happen with pg_stat_all_indexes, because you will first be watching pg_stat_activity to see if the plans have flipped in some way that leads to an overloaded server (extra latency, poor caching effects, extra buffers usage, etc). And the replicated bit? Sadly someone launched some big DML operation so you're waiting for that to finish so the "quick rollback" can actually get to those other servers.
>
> I think you've misunderstood when you'd be looking at
> pg_stat_all_indexes. The time when you'd want to look at
> pg_stat_all_indexes is *before* you DROP INDEX and before you ALTER
> TABLE INVISIBLE the index. What you'd likely want to look for there
> are indexes that have the last_idx_scan set to something far in the
> past or set to NULL.
>

I guess you have never heard of the TREAT method of index management? :-D
- Test for duplicate indexes
- Reindex bloated indexes
- Eliminate unused indexes
- Add missing indexes
- Tune indexes for generic queries

The easy part of figuring out what to change, the hard part
(sometimes) is getting those changes into production safely; that's
the part I am focused on.

> I'm curious to know if you've ever had to drop an index out of
> production before? What did you think about when you'd just typed the
> DROP INDEX command and were contemplating your future? How long did
> you pause before pressing [Enter]?
>

ROFL... Uh... yes, I have had to do it at least a few times.

So, years ago I used to say things like "I wish we had a way to make
indexes invisible like they do in Oracle" on the regular; but as I
worked through several different implementations and their potential
effects, and had more and more exposure to more demanding Postgres
installations, my thinking evolved. I spoke with Sami a bit about this
off-list and he walked me through some of the Oracle documentation on
this (I had, at best, forgot the specifics), which I think was helpful
to better understand some of the allure of the alter index/guc method
for many people who are used to it (and this current version of the
implementation is very Oracle like), but it also crystalized my
feeling that an Oracle-style implementation would be a red herring
that can keep us from a better solution.

> Can you list your proposed series of steps you'd recommend to a DBA
> wishing to remove an index, assuming this feature exists in core as
> you'd like it to?
>

Well, the series of steps differs depending on the nature of the
system being managed. If you are running on a single node with normal
traffic and resources, you just set the GUC to include the index you
want to be invisible, wait for a few days (maybe no one runs monthly
reports on this system?), take a quick look at your monitoring/stats
to make sure things seem copacetic, and then you drop the index and
reset the GUC.

But of course the people who I am most worried about are the ones who
are operating on high scale, high transaction, high connection,
"mission critical" systems... ie. people operating in high risk
environments, where things can go very bad very fast. Where safety
considerations are a critical part of every deployment.

In that type of environment, the GUC-only method enables you to
control changes at very precise levels, so you can do things like:
- run it ad-hoc at the session level to confirm that the explain plans
you get in production match your expectations.
- you can stay ad-hoc at the session level and run explain analyze and
confirm acceptable performance within your workload, and see what kind
of buffer impact you are going to have (typically overlooked, but a
potential landmine for outages, but I'll come back to this)
- because we are operating at the session level, we can then add this
on a per query basis at the application level, and in really high
traffic scenarios, you can use canary releases and/or feature flags to
ramp up those new queries into the live system.
- depending on how much risk you are concerned about, you can use this
session level method across queries individually, or at some point
roll it up to a user/application level. And again, we can roll it out
to different users at different times if you want.
- at some point when you feel confident that you have covered enough
angles, you set the GUC globally and let that marinate for a few more
weeks as needed.

And the funny thing is, at this point, once you have the guc put in
globally, and it's run for some number of weeks or months and everyone
is confident, you don't actually need the ALTER INDEX part any more;
you can just drop the index and be done with it. Now of course if you
aren't running at this kind of scale or don't have this level of risk,
you can speed run this a bit and go directly to the user level or skip
right to adding it globally, so the ease of use is on par with using
ALTER. But in any case where you do have elevated levels of risk, this
is actually less steps (and less risk) that having to use the
ALTER/guc method.

Earlier I mentioned the idea of monitoring buffer impact; let's talk
about that. I often hear people say that you should be doing things
like confirming your explain plans in development or have some type of
staging system where you do these kind of "experiments", as if a test
on a secondary system could really give you absolute confidence when
deploying to a system that automatically updates its settings (ie
pg_stats) at semi-random times with randomly sampled values; but in
any case, most people will at least agree that there is no way to
match up buffer usage across machines. That means if we are making
production changes that might have a significant impact on buffers, we
are doing something inherently dangerous. Well, dropping an index is
one of those things.

Imagine a scenario where you have a large index on a column and a
similar partial index on the same column, which are both used in
production for different queries, and therefore taking up some amount
of space within the buffer pool. When you make the partial index
invisible, the index is still maintained, and therefore it likely
still needs to maintain pages within the buffer pool to stay updated.
However, with queries now shifting to the full index, the full index
may very well need to pull in additional pages into the buffer pool
that it didn't need before, and this action can cause other pages from
some unknown object to get evicted. If you are lucky, this all works
itself and nothing bad happens, if you aren't, you may end up with a
server overloaded by latency in queries that aren't even related to
the indexes you're working on. (If you have a hard time seeing it with
partial indexes, the same can happen with consolidating indexes with
different INCLUDE statements, and certainly will be a scenario when
people look to drop indexes by way of skip-scan based plans). Now, is
it possible to handle this with the ALTER/guc method? Well, you can
mitigate it somewhat, but ironically to do so requires pushing out the
guc part of the ALTER/guc to all the places you would have pushed out
the GUC-only method, and that has to have been done BEFORE running
ALTER INDEX, so what does it really buy you?

I suppose while we're here, let me also make some observations about
how these methods differ when dealing with replica clusters. You
mentioned that one of the things you liked about the ALTER/guc method
is that it replicates the changes across all systems which makes it
easy to revert, however I believe that thinking is flawed. For
starters, any change that has to occur across the WAL stream is not
something that can be relied on to happen quickly; there are too many
other items that traverse that space that could end up blocking a
rollback from being applied in a timely fashion. The more complex the
replica cluster, the worse this is. One very common use case is to run
different workloads on different nodes, with the ALTER/guc method, you
are forcing users to make changes on a primary when they want to
target a workload that only runs on a replica. This means I have to
account for all potential workloads on all clusters before I can
safely start making changes, and to the degree that the ALTER/guc
gives me a safety net, that safety net is... to deploy a guc globally,
one at a time, on each individual server.

I feel like this email is already long, and tbh I could go on even
more, but hopefully I've covered enough to help explain some of the
issues that are involved here. I'm not trying to say that GUC-only is
a perfect solution, but I do think it handles every use case on par
with ALTER/guc, and enables some use cases ALTER/guc can't, especially
for people who have to operate in risk-first environments. And I get
it that some people are going to want a thing that looks very simple
or is familiar to how Oracle did it, but I can't help but think this
is one of those cases like how people used to always ask us to
implement UPSERT because that's what MySQL had, but instead we gave
them INSERT ON CONFLICT because it was the better solution to the
problem they (actually) had.


Robert Treat
https://xzilla.net