Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

xml2: Improve error handling of libxml2 calls
- 732061150b00 19 (unreleased) cited

Proposal: Global Index for PostgreSQL

Dilip Kumar <dilipbalaut@gmail.com> — 2025-06-06T07:00:05Z

PostgreSQL’s table partitioning allows large tables to be broken into
smaller, more manageable pieces for better performance. However, a key
limitation currently is the absence of global indexes, which restricts
using partitioned tables, especially when you need unique constraints
on columns that aren't part of the partition key.  It's true that
global indexes might seem to go against the core idea of partitioning.
However, they become essential when business needs dictate
partitioning data on one key while also enforcing uniqueness on a
different column. Without global indexes, users simply couldn't
leverage partitioning at all in such scenarios.

I've been developing a design and a POC for global indexes to address
this. I'd like to share my work, including the challenges we still
face and my proposed solutions.

Storage
=======
As we know that since global indexes are covering tuple of multiple
partitions, TIDs are not sufficient to uniquely identifying the tuple
so for that I have introduced a partitioned identifier which is 4
bytes integer and we may argue more on this whether this can be a
variable length or not for saving space.  I will describe more on this
partition identifier in a later part of the email and what's the
reason for choosing this instead of just a relation Oid as that seems
much straightforward.

Partition identifier is stored as part of the index tuple right after
the last key column, storing it here after the last key column make a
lot of things work in very straight forward way without any special
handling for this partition identifier column, e.g. btree tuple are
arranged in key order and if key are duplicate its arranged in TID
order and now for global index we want if the keys are duplicates it
should be arranged in partitioned identifier order and if partition
identifier is also same then in TID order, so by storing identifier
after the last key column and if we consider that field as part of the
extended key column then tuple will automatically will be arranged in
(keys, partition identifier, TID) order as we desire.  Similarly
suffix truncation and deduplication will also be simpler with this
design choice.

Partition Identifier
==============
Before discussing anything about Create Index/Attach/Detach partition,
I think we need to discuss the partition identifier.  So the main
reason for not using relation Oid as partition identifier is because
if we do so we would be forced to remove all the tuple of the detached
partitions from the global index, otherwise it would be very hard to
identify the tuple which are not valid or we should not scan for
example if the same partition detaches and reattaches then we would
not be able to distinguish between old (which should be ignore) and
new tuples, and also if Oid got reassigned to some other partition old
partition is dropped and wraparound happened.

To solve this, we've introduced a partition identifier, a 4-byte
integer. We'll use a new catalog called pg_index_partitions to store a
mapping from (partition identifier, global index OID) to relation OID.

The mapping isn't a direct one from partition identifier to relation
ID because each partition will have a different partition identifier
for each global index it's associated with. This is crucial for
correctly handling global indexes in multi-level partition
hierarchies.

Consider this example:
- T is the top-level parent and has global index g.
- T1 and T2 are children of T. T2 is partitioned table itself and also
has global index g2.
- T22 is a child of T2.

Let's consider a scenario where if you assign a single partition
identifier to T22 and create a one-to-one mapping with its relation
IDs. What happens if we detach T2 from T? (While we haven't delved
into the detach process yet, our current thought is to invalidate the
partition identifier entry so that during a scan, when we look up the
partition identifier to find the corresponding reloid, we'd simply
ignore it.)

However, this approach presents a problem. We can't simply invalidate
the entry for T22 entirely, because while its partition identifier
should be considered invalid when scanning global index 'g' (since T2
is detached from T), it should still be valid when scanning global
index 'g2' (which belongs to T2).

To address this, each leaf partition needs to be assigned a separate
partition identifier for every global index it's associated with.
While a separate partition identifier for each level in the partition
hierarchy could suffice, we've opted for the simpler approach of
assigning a distinct identifier per global index. We made this choice
assuming users will be judicious in creating global indexes within
their partition hierarchies, meaning we won't have to worry about
rapidly consuming a large number of partition IDs. Additionally, the
partition identifier counter can be maintained per global index rather
than globally, this will avoid growing this counter rapidly.

Creating a global Index
=================
While creating an index if the index is marked as GLOBAL, then there
would be some differences compared to the partitioned index, 1) And
additional internal partition identifier key column will be added 2) A
partition id will be allocated to each leaf partitions exist under the
top level partition on which we are creating a global index and
mapping will be stored in pg_index_partition table  3) Unlike
partitioned index global index will not be created on each child
instead this will be created only on the partitioned relation on which
user chooses to create 4) Index build needs to be modified to scan
through all the leaf partitions and create a single sort space for
inserting into the global index.

Attach Partition
=============
While attaching a partition, if this is a leaf partition, we need to
assign a new partition id with respect to each global index present on
its parent and ancestors and insert the mapping in the
pg_index_partitions table.  If the partition being attached is itself
a partitioned table then this step has to be done for every leaf
partition present in the partition hierarchy under the partitioned
table being attached.

Currently we reindex the global indexes present on the parent parent
and ancestor to which we are attaching a partition(s), but this can be
optimized for example we may choose to selectively insert the tuple of
the partition being attached but in some cases it could be costly if
the attached partition itself has a lot of tuple compared to existing
partitions which are already attached.  So this could be a mixed
approach and can be decided based on the existing number of tuples
index the index vs new in coming tuples, and as of now I am listing
this as open for suggestion items.

Detach Partition
=============
When detaching or dropping a partition, we just need to invalidate the
corresponding mapping in pg_index_partitions. Specifically, for each
leaf partition being detached, we'll mark its reloid as invalid within
the (indexoid, partition id) entry. This ensures that during an index
scan, when the system attempts to convert a partition ID to its
reloid, it will recognize this entry as invalid. We'll also need to
consider a mechanism to clean up these invalidated entries at some
point.

Global Index Scan
==============
In short, the global index scan will be similar to the normal index.
The key difference is that we'll also need to track heapOid alongside
heapTid within BTScanPosItem. Although we store the partition ID
directly in the index tuple, we can readily convert it to a heapOid by
referencing the pg_index_partitions table. For performance, we'll
maintain a cache within the global index's Relcache entry; this cache
will simply store the mapping of partitions associated with that
specific global index.

Furthermore, we must account for new index access paths when global
indexes are present. Currently, set_append_rel_size() generates index
restriction information for each append relation. Now, this
restriction information must also be generated for the partitioned
table itself. Additionally, within set_append_rel_pathlist(), we'll
need to invoke create_index_paths() for the partitioned table. This
ensures that we consider index paths for the partitioned table, which,
in the case of partitioned tables, will be global index scan paths.

Locking consideration
=================
Currently, when we perform DML operations on a top-level partitioned
table, we acquire a lock on the entire partition hierarchy beneath it,
which works well. However, if we're directly operating on a leaf
relation, say, inserting a tuple, we only lock that specific relation.
This isn't sufficient for global indexes.

Since a global index can reside on a parent table, inserting a tuple
into a leaf relation also necessitates an insertion into the global
index. This means we also need to lock the parent table on which the
global index is created.

However, even that isn't enough. We actually need to lock the entire
partition hierarchy under the parent table that has the global index.
Here's why: when you insert a tuple into any leaf partition, you'll
also insert it into the global index. If it's a unique index, a
conflict might arise with a key belonging to another partition. To
validate whether the tuple is live or not, we might need to access
data in other partitions.

Based on our chosen design, we identify all these necessary locks
during the planning phase. We then acquire these locks during planning
and store them within the planned statement. This way, if the
statement is prepared and executed later, these locks can be
re-acquired during AcquireExecutorLocks.

Open Problems/Need suggestions
==========================
Here is the list of top few problems which would be good to discuss
sooner than later

Vacuum for global indexes
—----------------------------------
I think this has been discussed multiple times in the past whenever we
talked about the global index. The core issue is that, by default,
global indexes are vacuumed with each individual partition. This
becomes incredibly inefficient and costly when dealing with millions
of partitions, as it leads to repeated, expensive scans of a large
global index.

Ideally, global indexes should only be vacuumed once, after all
partitions have been processed. This is because a global index keeps
data from all partitions, meaning a single vacuum operation after all
partition vacuums would be sufficient and far more efficient. This
approach would significantly reduce the overhead associated with
maintaining global indexes in large, partitioned datasets.

Our testing highlights a significant performance bottleneck when
vacuuming with global indexes. In a scenario with 1,000 partitions and
a total of 50 million tuples, vacuuming without global indexes took a
mere 15 seconds. However, after introducing global indexes, the vacuum
time skyrocketed to 45 minutes, a 300-fold increase! This dramatic
slowdown is expected, as the global index, containing data from all
partitions, is effectively vacuumed with each of the 1,000 individual
partition vacuums.

We've made a quick but significant optimization: we now vacuum each
partition and its local indexes first, skipping the global index
vacuum and the second pass of heap vacuuming for a bit. Instead, we
just keep track of the dead-tid stores (dead tuples) for each
partition. Once all partitions are done, we then vacuum the global
index once and perform that second heap pass across all partitions.

This change slashed our vacuuming time from 45 minutes down to just 40
seconds! This clearly shows we have plenty of room for optimization,
and this challenge isn't a showstopper. I'll share more details on
other solutions I'm testing in an upcoming email to keep this one
concise.


Global Index rewrite
—-------------------------
One particular class of things that needs careful consideration is
table-rewriting operations. We have a few of those: CLUSTER, VACUUM
FULL..ALTER TABLE etc, When such an operation occurs, all the TIDs
potentially change, so tablecmds.c arranges to rebuild indexes. When
there’s a global index involved, we also need to rebuild global
indexes. Or, as a possible alternative strategy, we could allocate a
new partition ID and just leave the index entries to be cleaned out
eventually, but that seems to risk a lot of bloat. However, a key
point here is that we don’t want to be rebuilding the same global
indexes over and over. If someone does a table-rewriting operation
across a whole partitioning hierarchy, we don’t want to rebuild the
same global indexes multiple times.

We also need to consider what happens when relations are truncated.
Here, the two possible strategies are: (1) assign new partition IDs
and leave the old index entries around, (2) truncate the entire index
and then reinsert entries for any untruncated partitions. As above (1)
risks bloat, but it also makes TRUNCATE fast. Of course, if the whole
partitioning hierarchy is truncated all at once, then (2) is clearly
better.

I'm aiming to submit the first WIP patch set before the July
commitfest. It won't have all the issues ironed out yet, but the main
design will be functional.

Thanks, Robert, for many of the key design ideas and regular
discussion throughout designing this. I'd also like to thank Joe,
Peter Geoghegan, Alvaro, and Masahiko Sawada for discussing the
independent issues with me offlist.

-- 
Regards,
Dilip Kumar
Google

Re: Proposal: Global Index for PostgreSQL

wenhui qiu <qiuwenhuifx@gmail.com> — 2025-06-06T07:30:48Z

Hi Dilip Kumar
   Thank you for your working on this ,I remember six years ago there was
talk about global index ，You can see if this mailing list has any
references to (
https://www.postgresql.org/message-id/CALtqXTcurqy1PKXzP9XO%3DofLLA5wBSo77BnUnYVEZpmcA3V0ag%40mail.gmail.com
)



Thanks

On Fri, Jun 6, 2025 at 3:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

> PostgreSQL’s table partitioning allows large tables to be broken into
> smaller, more manageable pieces for better performance. However, a key
> limitation currently is the absence of global indexes, which restricts
> using partitioned tables, especially when you need unique constraints
> on columns that aren't part of the partition key.  It's true that
> global indexes might seem to go against the core idea of partitioning.
> However, they become essential when business needs dictate
> partitioning data on one key while also enforcing uniqueness on a
> different column. Without global indexes, users simply couldn't
> leverage partitioning at all in such scenarios.
>
> I've been developing a design and a POC for global indexes to address
> this. I'd like to share my work, including the challenges we still
> face and my proposed solutions.
>
> Storage
> =======
> As we know that since global indexes are covering tuple of multiple
> partitions, TIDs are not sufficient to uniquely identifying the tuple
> so for that I have introduced a partitioned identifier which is 4
> bytes integer and we may argue more on this whether this can be a
> variable length or not for saving space.  I will describe more on this
> partition identifier in a later part of the email and what's the
> reason for choosing this instead of just a relation Oid as that seems
> much straightforward.
>
> Partition identifier is stored as part of the index tuple right after
> the last key column, storing it here after the last key column make a
> lot of things work in very straight forward way without any special
> handling for this partition identifier column, e.g. btree tuple are
> arranged in key order and if key are duplicate its arranged in TID
> order and now for global index we want if the keys are duplicates it
> should be arranged in partitioned identifier order and if partition
> identifier is also same then in TID order, so by storing identifier
> after the last key column and if we consider that field as part of the
> extended key column then tuple will automatically will be arranged in
> (keys, partition identifier, TID) order as we desire.  Similarly
> suffix truncation and deduplication will also be simpler with this
> design choice.
>
> Partition Identifier
> ==============
> Before discussing anything about Create Index/Attach/Detach partition,
> I think we need to discuss the partition identifier.  So the main
> reason for not using relation Oid as partition identifier is because
> if we do so we would be forced to remove all the tuple of the detached
> partitions from the global index, otherwise it would be very hard to
> identify the tuple which are not valid or we should not scan for
> example if the same partition detaches and reattaches then we would
> not be able to distinguish between old (which should be ignore) and
> new tuples, and also if Oid got reassigned to some other partition old
> partition is dropped and wraparound happened.
>
> To solve this, we've introduced a partition identifier, a 4-byte
> integer. We'll use a new catalog called pg_index_partitions to store a
> mapping from (partition identifier, global index OID) to relation OID.
>
> The mapping isn't a direct one from partition identifier to relation
> ID because each partition will have a different partition identifier
> for each global index it's associated with. This is crucial for
> correctly handling global indexes in multi-level partition
> hierarchies.
>
> Consider this example:
> - T is the top-level parent and has global index g.
> - T1 and T2 are children of T. T2 is partitioned table itself and also
> has global index g2.
> - T22 is a child of T2.
>
> Let's consider a scenario where if you assign a single partition
> identifier to T22 and create a one-to-one mapping with its relation
> IDs. What happens if we detach T2 from T? (While we haven't delved
> into the detach process yet, our current thought is to invalidate the
> partition identifier entry so that during a scan, when we look up the
> partition identifier to find the corresponding reloid, we'd simply
> ignore it.)
>
> However, this approach presents a problem. We can't simply invalidate
> the entry for T22 entirely, because while its partition identifier
> should be considered invalid when scanning global index 'g' (since T2
> is detached from T), it should still be valid when scanning global
> index 'g2' (which belongs to T2).
>
> To address this, each leaf partition needs to be assigned a separate
> partition identifier for every global index it's associated with.
> While a separate partition identifier for each level in the partition
> hierarchy could suffice, we've opted for the simpler approach of
> assigning a distinct identifier per global index. We made this choice
> assuming users will be judicious in creating global indexes within
> their partition hierarchies, meaning we won't have to worry about
> rapidly consuming a large number of partition IDs. Additionally, the
> partition identifier counter can be maintained per global index rather
> than globally, this will avoid growing this counter rapidly.
>
> Creating a global Index
> =================
> While creating an index if the index is marked as GLOBAL, then there
> would be some differences compared to the partitioned index, 1) And
> additional internal partition identifier key column will be added 2) A
> partition id will be allocated to each leaf partitions exist under the
> top level partition on which we are creating a global index and
> mapping will be stored in pg_index_partition table  3) Unlike
> partitioned index global index will not be created on each child
> instead this will be created only on the partitioned relation on which
> user chooses to create 4) Index build needs to be modified to scan
> through all the leaf partitions and create a single sort space for
> inserting into the global index.
>
> Attach Partition
> =============
> While attaching a partition, if this is a leaf partition, we need to
> assign a new partition id with respect to each global index present on
> its parent and ancestors and insert the mapping in the
> pg_index_partitions table.  If the partition being attached is itself
> a partitioned table then this step has to be done for every leaf
> partition present in the partition hierarchy under the partitioned
> table being attached.
>
> Currently we reindex the global indexes present on the parent parent
> and ancestor to which we are attaching a partition(s), but this can be
> optimized for example we may choose to selectively insert the tuple of
> the partition being attached but in some cases it could be costly if
> the attached partition itself has a lot of tuple compared to existing
> partitions which are already attached.  So this could be a mixed
> approach and can be decided based on the existing number of tuples
> index the index vs new in coming tuples, and as of now I am listing
> this as open for suggestion items.
>
> Detach Partition
> =============
> When detaching or dropping a partition, we just need to invalidate the
> corresponding mapping in pg_index_partitions. Specifically, for each
> leaf partition being detached, we'll mark its reloid as invalid within
> the (indexoid, partition id) entry. This ensures that during an index
> scan, when the system attempts to convert a partition ID to its
> reloid, it will recognize this entry as invalid. We'll also need to
> consider a mechanism to clean up these invalidated entries at some
> point.
>
> Global Index Scan
> ==============
> In short, the global index scan will be similar to the normal index.
> The key difference is that we'll also need to track heapOid alongside
> heapTid within BTScanPosItem. Although we store the partition ID
> directly in the index tuple, we can readily convert it to a heapOid by
> referencing the pg_index_partitions table. For performance, we'll
> maintain a cache within the global index's Relcache entry; this cache
> will simply store the mapping of partitions associated with that
> specific global index.
>
> Furthermore, we must account for new index access paths when global
> indexes are present. Currently, set_append_rel_size() generates index
> restriction information for each append relation. Now, this
> restriction information must also be generated for the partitioned
> table itself. Additionally, within set_append_rel_pathlist(), we'll
> need to invoke create_index_paths() for the partitioned table. This
> ensures that we consider index paths for the partitioned table, which,
> in the case of partitioned tables, will be global index scan paths.
>
> Locking consideration
> =================
> Currently, when we perform DML operations on a top-level partitioned
> table, we acquire a lock on the entire partition hierarchy beneath it,
> which works well. However, if we're directly operating on a leaf
> relation, say, inserting a tuple, we only lock that specific relation.
> This isn't sufficient for global indexes.
>
> Since a global index can reside on a parent table, inserting a tuple
> into a leaf relation also necessitates an insertion into the global
> index. This means we also need to lock the parent table on which the
> global index is created.
>
> However, even that isn't enough. We actually need to lock the entire
> partition hierarchy under the parent table that has the global index.
> Here's why: when you insert a tuple into any leaf partition, you'll
> also insert it into the global index. If it's a unique index, a
> conflict might arise with a key belonging to another partition. To
> validate whether the tuple is live or not, we might need to access
> data in other partitions.
>
> Based on our chosen design, we identify all these necessary locks
> during the planning phase. We then acquire these locks during planning
> and store them within the planned statement. This way, if the
> statement is prepared and executed later, these locks can be
> re-acquired during AcquireExecutorLocks.
>
> Open Problems/Need suggestions
> ==========================
> Here is the list of top few problems which would be good to discuss
> sooner than later
>
> Vacuum for global indexes
> —----------------------------------
> I think this has been discussed multiple times in the past whenever we
> talked about the global index. The core issue is that, by default,
> global indexes are vacuumed with each individual partition. This
> becomes incredibly inefficient and costly when dealing with millions
> of partitions, as it leads to repeated, expensive scans of a large
> global index.
>
> Ideally, global indexes should only be vacuumed once, after all
> partitions have been processed. This is because a global index keeps
> data from all partitions, meaning a single vacuum operation after all
> partition vacuums would be sufficient and far more efficient. This
> approach would significantly reduce the overhead associated with
> maintaining global indexes in large, partitioned datasets.
>
> Our testing highlights a significant performance bottleneck when
> vacuuming with global indexes. In a scenario with 1,000 partitions and
> a total of 50 million tuples, vacuuming without global indexes took a
> mere 15 seconds. However, after introducing global indexes, the vacuum
> time skyrocketed to 45 minutes, a 300-fold increase! This dramatic
> slowdown is expected, as the global index, containing data from all
> partitions, is effectively vacuumed with each of the 1,000 individual
> partition vacuums.
>
> We've made a quick but significant optimization: we now vacuum each
> partition and its local indexes first, skipping the global index
> vacuum and the second pass of heap vacuuming for a bit. Instead, we
> just keep track of the dead-tid stores (dead tuples) for each
> partition. Once all partitions are done, we then vacuum the global
> index once and perform that second heap pass across all partitions.
>
> This change slashed our vacuuming time from 45 minutes down to just 40
> seconds! This clearly shows we have plenty of room for optimization,
> and this challenge isn't a showstopper. I'll share more details on
> other solutions I'm testing in an upcoming email to keep this one
> concise.
>
>
> Global Index rewrite
> —-------------------------
> One particular class of things that needs careful consideration is
> table-rewriting operations. We have a few of those: CLUSTER, VACUUM
> FULL..ALTER TABLE etc, When such an operation occurs, all the TIDs
> potentially change, so tablecmds.c arranges to rebuild indexes. When
> there’s a global index involved, we also need to rebuild global
> indexes. Or, as a possible alternative strategy, we could allocate a
> new partition ID and just leave the index entries to be cleaned out
> eventually, but that seems to risk a lot of bloat. However, a key
> point here is that we don’t want to be rebuilding the same global
> indexes over and over. If someone does a table-rewriting operation
> across a whole partitioning hierarchy, we don’t want to rebuild the
> same global indexes multiple times.
>
> We also need to consider what happens when relations are truncated.
> Here, the two possible strategies are: (1) assign new partition IDs
> and leave the old index entries around, (2) truncate the entire index
> and then reinsert entries for any untruncated partitions. As above (1)
> risks bloat, but it also makes TRUNCATE fast. Of course, if the whole
> partitioning hierarchy is truncated all at once, then (2) is clearly
> better.
>
> I'm aiming to submit the first WIP patch set before the July
> commitfest. It won't have all the issues ironed out yet, but the main
> design will be functional.
>
> Thanks, Robert, for many of the key design ideas and regular
> discussion throughout designing this. I'd also like to thank Joe,
> Peter Geoghegan, Alvaro, and Masahiko Sawada for discussing the
> independent issues with me offlist.
>
> --
> Regards,
> Dilip Kumar
> Google
>
>
>

Re: Proposal: Global Index for PostgreSQL

Dilip Kumar <dilipbalaut@gmail.com> — 2025-06-07T04:36:50Z

On Fri, Jun 6, 2025 at 1:01 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote:
>
> Hi Dilip Kumar
>    Thank you for your working on this ,I remember six years ago there was talk about global index ，You can see if this mailing list has any references to (https://www.postgresql.org/message-id/CALtqXTcurqy1PKXzP9XO%3DofLLA5wBSo77BnUnYVEZpmcA3V0ag%40mail.gmail.com)
>
Sure Thanks.

-- 
Regards,
Dilip Kumar
Google

Re: Proposal: Global Index for PostgreSQL

Nikita Malakhov <hukutoc@gmail.com> — 2025-06-09T08:33:43Z

Hi Dilip!

Global Indexes is a very interesting functionality that has both
significant advantages
and drawbacks, and the community seems not ready to accept it without very
strong
motivation.
There was a more recent approach to Global index problem [1], please check
it out.

I've read you proposal and have several questions:
1) New catalog table with global index partitions would immediately affect
interaction
with user tables with global indexes because of corresponding locks that
should be
taken for [in]validation and attach/detach operations, this should be
investigated;
2) Changing relation OIDs (by, say, vacuum full) would immediately result
in index
inconsistency, what do you suppose to do with internal processes that could
change
relation OIDs? Also this question
3) Would single sort space be enough for a more typical case when we have
hundreds of partitions with hundreds of millions records in each? It is a
normal
production case for partitioned tables.
4) Update-heavy partitioned tables that should run vacuum frequently.
Significant
vacuum slowdown would result in going beyond SLAs without corresponding
significant improvements.

[1] Global Unique Index
<https://www.postgresql.org/message-id/flat/184879c5306.12490ea581628934.7312528450011769010%40highgo.ca>

Thank you!

-- 
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

Re: Proposal: Global Index for PostgreSQL

Dilip Kumar <dilipbalaut@gmail.com> — 2025-06-09T09:58:38Z

On Mon, Jun 9, 2025 at 2:03 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
>
> Hi Dilip!

Thanks Nikita for your response and reading my proposal.

> Global Indexes is a very interesting functionality that has both significant advantages
> and drawbacks, and the community seems not ready to accept it without very strong
> motivation.

I understand that this is a hard problem and needs changes in many
critical modules. I don't think there should be a problem with the
motivation of this work, but I believe the main issue lies in the
project's complexity.

> There was a more recent approach to Global index problem [1], please check it out.

I've reviewed the proposal, and I understand it aims to address
uniqueness on partitioned tables for non-partition key columns.
However, I'm concerned about the basic design principles' scalability.
I believe it won't scale effectively beyond a relatively small number
of partitions, and this limitation will be quite surprising to users.
Specifically, checking uniqueness during inserts/updates across all
indexes on each partition (since there's no global index) will become
a significant bottleneck.

> I've read you proposal and have several questions:

> 1) New catalog table with global index partitions would immediately affect interaction
> with user tables with global indexes because of corresponding locks that should be
> taken for [in]validation and attach/detach operations, this should be investigated;

Yeah that's right, but logically the attach/detach are DDL and are not
most frequent operations, so are you worried about performance due to
locking?

> 2) Changing relation OIDs (by, say, vacuum full) would immediately result in index
> inconsistency, what do you suppose to do with internal processes that could change
> relation OIDs? Also this question

I want to clarify that we don't store relation OIDs directly in the
global index. Instead, the global index holds partition IDs, and the
mapping from partition ID to relation OID is managed in a new catalog
table, pg_index_partition. It's important to note that VACUUM FULL
operations only alter the relfilenumber (the disk file OID), not the
relation OID itself, which remains constant for the lifetime of the
relation.

What does change during a VACUUM FULL are the TIDs (tuple IDs).
Because of this, all indexes, including global indexes, are reindexed.
As I mentioned in my proposal, there's a significant opportunity for
optimization here. Reindexing large global indexes is a costly
operation, and I've proposed some ideas to improve this process.

> 3) Would single sort space be enough for a more typical case when we have
> hundreds of partitions with hundreds of millions records in each? It is a normal
> production case for partitioned tables.

In general, users ideally wouldn't use a global index everywhere. It
really comes down to their specific use case – they should only opt
for a global index when they can't effectively partition their data
without one. The idea is that the amount of data in the sort space
should essentially be the same as if the table wasn't partitioned at
all. That's a good point for consideration. I agree that global
indexes shouldn't be a default choice for every use case. They're most
beneficial when a user's data access patterns inherently prevent
effective partitioning without them. In such scenarios, the amount of
data in the sort space would ideally remain comparable to an
unpartitioned table.

> 4) Update-heavy partitioned tables that should run vacuum frequently. Significant
> vacuum slowdown would result in going beyond SLAs without corresponding
> significant improvements.
>
You've got it. I'm on board with prioritizing a VACUUM optimization
solution for partitioned tables with global indexes. My initial
proposal touched on a proof-of-concept experiment, which indicated no
significant performance hit with global index after the optimization.
I'll share the detailed VACUUM optimization proposal in this thread
within the next couple of days.

--
Regards,
Dilip Kumar
Google

Re: Proposal: Global Index for PostgreSQL

Bruce Momjian <bruce@momjian.us> — 2025-06-09T21:51:25Z

On Mon, Jun  9, 2025 at 03:28:38PM +0530, Dilip Kumar wrote:
> On Mon, Jun 9, 2025 at 2:03 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
> > Global Indexes is a very interesting functionality that has both significant advantages
> > and drawbacks, and the community seems not ready to accept it without very strong
> > motivation.
> 
> I understand that this is a hard problem and needs changes in many
> critical modules. I don't think there should be a problem with the
> motivation of this work, but I believe the main issue lies in the
> project's complexity.
...
> In general, users ideally wouldn't use a global index everywhere. It
> really comes down to their specific use case – they should only opt
> for a global index when they can't effectively partition their data
> without one. The idea is that the amount of data in the sort space
> should essentially be the same as if the table wasn't partitioned at
> all. That's a good point for consideration. I agree that global
> indexes shouldn't be a default choice for every use case. They're most
> beneficial when a user's data access patterns inherently prevent
> effective partitioning without them. In such scenarios, the amount of
> data in the sort space would ideally remain comparable to an
> unpartitioned table.

There are certainly use cases where this would be helpful, but I think
the big question is whether it would have so many negatives that most
people who try to use it would eventually remove it.  I have heard that
happened to other relational systems who support global indexes, so I
think we have to consider that possibility.  The problem is you might
need to actually write the patch to find out.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.

Re: Proposal: Global Index for PostgreSQL

Bruce Momjian <bruce@momjian.us> — 2025-06-10T19:38:11Z

On Mon, Jun  9, 2025 at 05:51:25PM -0400, Bruce Momjian wrote:
> On Mon, Jun  9, 2025 at 03:28:38PM +0530, Dilip Kumar wrote:
> There are certainly use cases where this would be helpful, but I think
> the big question is whether it would have so many negatives that most
> people who try to use it would eventually remove it.  I have heard that
> happened to other relational systems who support global indexes, so I
> think we have to consider that possibility.  The problem is you might
> need to actually write the patch to find out.

FYI, I wrote a blog about global indexes in 2020:

	https://momjian.us/main/blogs/pgblog/2020.html#July_1_2020

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.

Re: Proposal: Global Index for PostgreSQL

Dilip Kumar <dilipbalaut@gmail.com> — 2025-06-11T03:03:33Z

On Tue, Jun 10, 2025 at 3:21 AM Bruce Momjian <bruce@momjian.us> wrote:

Thanks Bruce for your thoughts on this.

> On Mon, Jun  9, 2025 at 03:28:38PM +0530, Dilip Kumar wrote:
> > On Mon, Jun 9, 2025 at 2:03 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
> > > Global Indexes is a very interesting functionality that has both significant advantages
> > > and drawbacks, and the community seems not ready to accept it without very strong
> > > motivation.
> >
> > I understand that this is a hard problem and needs changes in many
> > critical modules. I don't think there should be a problem with the
> > motivation of this work, but I believe the main issue lies in the
> > project's complexity.
> ...
> > In general, users ideally wouldn't use a global index everywhere. It
> > really comes down to their specific use case – they should only opt
> > for a global index when they can't effectively partition their data
> > without one. The idea is that the amount of data in the sort space
> > should essentially be the same as if the table wasn't partitioned at
> > all. That's a good point for consideration. I agree that global
> > indexes shouldn't be a default choice for every use case. They're most
> > beneficial when a user's data access patterns inherently prevent
> > effective partitioning without them. In such scenarios, the amount of
> > data in the sort space would ideally remain comparable to an
> > unpartitioned table.
>
> There are certainly use cases where this would be helpful, but I think
> the big question is whether it would have so many negatives that most
> people who try to use it would eventually remove it.

Yeah that's a very valid point.

  I have heard that
> happened to other relational systems who support global indexes, so I
> think we have to consider that possibility.  The problem is you might
> need to actually write the patch to find out.

I've actually drafted the patch, and while it still has open issues to
tackle, I believe it's ready for some interesting experimentation. For
instance, I've observed significant performance gains during index
scans on non-partition key columns, where scanning a single global
index outperforms appending scan results from thousands of local
indexes. However, I've also noted performance regression in
insert/update cases when a global index is present, which is expected
as we need to insert into a large index, so we need to evaluate what
is acceptable and what not.

My plan is to submit the patch in the next commitfest, including
performance data. Before submission, I'll clean up the code further
and add TODO comments wherever additional work is required.

-- 
Regards,
Dilip Kumar
Google

Re: Proposal: Global Index for PostgreSQL

Dilip Kumar <dilipbalaut@gmail.com> — 2025-06-11T03:04:24Z

On Wed, Jun 11, 2025 at 1:08 AM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Mon, Jun  9, 2025 at 05:51:25PM -0400, Bruce Momjian wrote:
> > On Mon, Jun  9, 2025 at 03:28:38PM +0530, Dilip Kumar wrote:
> > There are certainly use cases where this would be helpful, but I think
> > the big question is whether it would have so many negatives that most
> > people who try to use it would eventually remove it.  I have heard that
> > happened to other relational systems who support global indexes, so I
> > think we have to consider that possibility.  The problem is you might
> > need to actually write the patch to find out.
>
> FYI, I wrote a blog about global indexes in 2020:

Oh interesting, I would be happy to have a look at it, Thanks.

-- 
Regards,
Dilip Kumar
Google

Re: Proposal: Global Index for PostgreSQL

Dilip Kumar <dilipbalaut@gmail.com> — 2025-06-14T09:32:03Z

On Mon, Jun 9, 2025 at 3:28 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jun 9, 2025 at 2:03 PM Nikita Malakhov <hukutoc@gmail.com> wrote:

> > 4) Update-heavy partitioned tables that should run vacuum frequently. Significant
> > vacuum slowdown would result in going beyond SLAs without corresponding
> > significant improvements.
> >
> You've got it. I'm on board with prioritizing a VACUUM optimization
> solution for partitioned tables with global indexes. My initial
> proposal touched on a proof-of-concept experiment, which indicated no
> significant performance hit with global index after the optimization.
> I'll share the detailed VACUUM optimization proposal in this thread
> within the next couple of days.

As discussed earlier the one of main problems is that the global
indexes are vacuumed along with each partition whereas logically it
should be vacuumed only once when all the partitions are vacuum in an
ideal world.

So my proposal is that we make some infrastructure change in vacuum
api's such that there is option to tell heap_vacuum_rel() to skip the
global index vacuum and also return back the deadtid_store, vacuum
layer will store these deadtid_store, hash it by the reloid until it
vacuum all the partitions which are covered by a global index.  Once
that is done it will vacuum the global index, we also need to modify
the vac_tid_reaped() so that it can take additional input of reloid so
that it can find the appropriate deadtid_store by looking into the
hash and then look up the dead tid in respective deadtid_store.

We also need to do something for the autovacuum, because currently
autovacuum workers scan the pg_class and identify the relation which
needs to be vacuumed and vacuum it one at a time.  However once we
have educated vacuum machinery to first vacuum all the partitions and
then perform global index vacuum, the autovacuum worker should know
which all partitions are supposed to be vacuum together and the
autovacuum worker can pass a list of all those partitions together to
the vacuum machinery.

I believe this enhancement can be implemented in autovacuum without
significant difficulty. Currently, autovacuum scans pg_class to
generate a list of relations requiring vacuuming. For partitioned
tables that have a global index, we can extend this process by
additionally maintaining a list of all their leaf relations within the
parent table's entry.

I believe this enhancement can be implemented in autovacuum without
significant difficulty. Currently, autovacuum scans pg_class to
generate a list of relations requiring vacuuming. For partitioned
tables that have a global index, we can extend this process by
additionally maintaining a list of all their leaf relations within the
parent table's entry.

An autovacuum worker would then process all the leaf relations in this
complete hierarchy. To effectively prevent other workers from
attempting to vacuum the same hierarchy concurrently, the worker would
publish the top-most partitioned relation ID (which identifies the
table with the global index) as MyWorkerInfo->wi_tableoid. This
mechanism ensures that if one worker is processing a partitioned
table, the entire set of child relations under that top-level ID is
automatically skipped by other workers

While this approach offers benefits, it's important to acknowledge
certain potential drawbacks. One concern is a possible impact on
parallelism, as currently each partition might be vacuumed by a
separate worker, but with this change, all partitions covered by the
same global index would have to be processed by a single worker.
Another significant challenge lies in effectively choosing which
partitions to vacuum; for instance, if a table with a global index has
1000 partitions and only 10 meet the vacuum threshold in a given
cycle, this method might not be very efficient. Although we wouldn't
vacuum the global index once for every partition, we could still end
up vacuuming it numerous times by the time all partitions are
eventually processed. To mitigate this, an alternative strategy could
involve proactively vacuuming partitions that haven't fully met their
vacuum threshold but are close (e.g., reached 50% of the threshold).
This would allow us to combine more partitions for a single vacuum
operation, thereby reducing the number of times the global index needs
to be vacuumed.

-- 
Regards,
Dilip Kumar
Google

Re: Proposal: Global Index for PostgreSQL

Masahiko Sawada <sawada.mshk@gmail.com> — 2025-06-17T23:07:25Z

On Sat, Jun 14, 2025 at 2:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jun 9, 2025 at 3:28 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jun 9, 2025 at 2:03 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
>
> > > 4) Update-heavy partitioned tables that should run vacuum frequently. Significant
> > > vacuum slowdown would result in going beyond SLAs without corresponding
> > > significant improvements.
> > >
> > You've got it. I'm on board with prioritizing a VACUUM optimization
> > solution for partitioned tables with global indexes. My initial
> > proposal touched on a proof-of-concept experiment, which indicated no
> > significant performance hit with global index after the optimization.
> > I'll share the detailed VACUUM optimization proposal in this thread
> > within the next couple of days.
>
> As discussed earlier the one of main problems is that the global
> indexes are vacuumed along with each partition whereas logically it
> should be vacuumed only once when all the partitions are vacuum in an
> ideal world.
>
> So my proposal is that we make some infrastructure change in vacuum
> api's such that there is option to tell heap_vacuum_rel() to skip the
> global index vacuum and also return back the deadtid_store, vacuum
> layer will store these deadtid_store, hash it by the reloid until it
> vacuum all the partitions which are covered by a global index.  Once
> that is done it will vacuum the global index, we also need to modify
> the vac_tid_reaped() so that it can take additional input of reloid so
> that it can find the appropriate deadtid_store by looking into the
> hash and then look up the dead tid in respective deadtid_store.

Does it need to keep holding dead TIDs for each partition until it
completes vacuuming all partitions that are covered by the global
index? If so, it would end up holding a huge amount of memory in cases
where there are many partitions. How does maintanence_work_mem (or
autovacuum_work_mem) work in this context? Also, what if the
autovacuum worker who is processing the partitioned table with global
indexes gets cancelled? I guess that we would need to scan the
partitions again in order to collect dead TIDs to vacuum the global
index but it would be very expensive.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Proposal: Global Index for PostgreSQL

Dilip Kumar <dilipbalaut@gmail.com> — 2025-06-18T10:45:13Z

On Wed, Jun 18, 2025 at 4:38 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sat, Jun 14, 2025 at 2:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >

Thanks for your opinion, Sawada-San.

> Does it need to keep holding dead TIDs for each partition until it
> completes vacuuming all partitions that are covered by the global
> index? If so, it would end up holding a huge amount of memory in cases
> where there are many partitions. How does maintanence_work_mem (or
> autovacuum_work_mem) work in this context?

So it will keep holding the deadtids until we vacuum all the
partitions or we run out of the 'maintanence_work_mem', consider this
similar to the case where you do not have concept of global index and
for supporting the unique constraint on not partitioned key column
user has to keep one giant table without partitioning it, then every
time we run out of the  'maintanence_work_mem' we need to vacuum the
indexes, same theory applies for the global index.

 Also, what if the
> autovacuum worker who is processing the partitioned table with global
> indexes gets cancelled? I guess that we would need to scan the
> partitions again in order to collect dead TIDs to vacuum the global
> index but it would be very expensive.

You're right, but that's comparable to the cost of managing a single,
unpartitioned giant table. I believe the primary value of global
indexes lies in enabling users to partition tables that otherwise
couldn't be. So, while certain maintenance tasks might incur similar
costs to a single large table, you'll gain significant advantages
during many other DML operations due to partitioning that wouldn't be
possible without global indexes.

-- 
Regards,
Dilip Kumar
Google

Re: Proposal: Global Index for PostgreSQL

Dilip Kumar <dilipbalaut@gmail.com> — 2025-07-01T09:27:07Z

On Wed, Jun 18, 2025 at 4:15 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jun 18, 2025 at 4:38 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
Here is the first WIP version of the patch set.  Commit message in
each patch explains in detail what exactly it does so not adding the
description in email.  Patches applies on the latest head
(732061150b004385810e522f8629f5bf91d977b7)

Open issues yet to be handled:
=========================
1) Vacuum optimization as I described above is still highly unstable
patch so not attached here, so with this patch set if you create
global index with large table with a lot of partition then there will
be a huge regression in vacuum performance, which should be resolved
after vacuum optimization patch which I am planning to post by this
month.
2) If you are attaching a partition which has a different column order
than the parent and create a global index on that, it will not work,
still working on it.
3) For unique checking in btree, global index may get conflicting
tuple belongs to different partitions so relation open is done on the
fly, ideally that description should have been done in executor and
pass down to btree but still identifying how to do that with minimum
changes in AM interfaces, or can it be done without AM changes.
4) Need to write more test cases for REINDEX TABLE, TRUNCATE TABLE and
need to tighten up the global index rebuild so that it doesn't get
rebuilt multiple times or doesn't skip rebuilding when needed.
5) ALTER COLUMN SET TYPE, which triggers global index rewrite is not
handled properly. This needs some more work, in the alter table
machinery where we identify the list of indexes to rebuild.
6) Need to perform a performance test, for SELECT/UPDATE/INSERT cases,
we already know the VACUUM performance.
7) global_index.sql is taking a long time to execute as I have kept a
high data load, so that needs to be simplified.

Note: Patches are still WIP and might have many loose ends, for now,
make check-world is passing, including the global index test cases

Credit:  I have already mentioned while sending the first design email
but missed some, so adding it again here.

1. Robert: for many of the key design ideas and regular discussion
throughout designing this.
2. Joe: Also has regular discussion on this and many suggestions,
specially related to vacuum
3. Peter Geoghegan, Alvaro, and Masahiko Sawada for discussing the
independent issues with me offlist.
4. I got the idea of syntax for creating global indexes and also
keeping a cache of mapping from reloid to relation descriptor during
the global index scan, from some of the very old thread related to
global index (not able to find the thread now).

--
Regards,
Dilip Kumar
Google

Re: Proposal: Global Index for PostgreSQL

amit <amitlangote09@gmail.com> — 2025-07-01T13:42:06Z

Hi Dilip,

Happy to see you working on this.  It’s clear a lot of thought has
gone into the design.

On Tue, Jul 1, 2025 at 6:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> 6) Need to perform a performance test, for SELECT/UPDATE/INSERT cases,
> we already know the VACUUM performance.

One point I want to check my understanding of is around the locking
implications of global index scans, especially in prepared statement
scenarios.

I’ve been working on improving how we handle partition locking during
execution of generic plans. Specifically, I committed a patch to defer
locking of partitions until after pruning during ExecutorStart(), so
we avoid taking locks on partitions that aren’t actually needed --
even when the plan contains scans on all partitions. That patch was
later reverted, as Tom pointed out that the plan invalidation logic
wasn't cleanly handled. But the goal remains: to avoid locking
unnecessary partitions, particularly in high-partition-count OLTP
setups that use PREPARE/EXECUTE.

The proposed global index design, IIUC, requires locking all leaf
partitions up front during planning, and I guess during
AcquireExecutorLocks() when using a cached plan, because the index
scan could return tuples from any partition. This seems to directly
undercut that effort: we'd be back to generic plans causing broad
locking regardless of actual runtime needs.

I understand that this is currently necessary, given that a global
index scan is a single node without per-partition awareness. But it
might be worth considering whether the scan could opportunistically
defer heap relation locking until it returns a tuple that actually
belongs to a particular partition -- similar to how inserts into
partitioned tables only lock the target partition at execution time.
Or did I miss that inserts also need to lock all partitions up front
when global indexes are present, due to cross-partition uniqueness
checks?

Let me know if I’ve misunderstood the design.

-- 
Thanks, Amit Langote

Re: Proposal: Global Index for PostgreSQL

Dilip Kumar <dilipbalaut@gmail.com> — 2025-07-02T04:03:51Z

On Tue, Jul 1, 2025 at 7:12 PM Amit Langote <amitlangote09@gmail.com> wrote:
>
> Hi Dilip,
>
> Happy to see you working on this.  It’s clear a lot of thought has
> gone into the design.

Thanks, Amit.  And thanks for your comment.

> On Tue, Jul 1, 2025 at 6:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > 6) Need to perform a performance test, for SELECT/UPDATE/INSERT cases,
> > we already know the VACUUM performance.
>
> One point I want to check my understanding of is around the locking
> implications of global index scans, especially in prepared statement
> scenarios.

Sure

> I’ve been working on improving how we handle partition locking during
> execution of generic plans. Specifically, I committed a patch to defer
> locking of partitions until after pruning during ExecutorStart(), so
> we avoid taking locks on partitions that aren’t actually needed --
> even when the plan contains scans on all partitions. That patch was
> later reverted, as Tom pointed out that the plan invalidation logic
> wasn't cleanly handled.

Yes I was following that thread, and at times when I was working on
locking for global index I had in mind that I would have to do
something to the locking after that patch is in.  Unfortunately that
got reverted and then I didn't put any effort in reconsidering how
locking is handled for global index.

 But the goal remains: to avoid locking
> unnecessary partitions, particularly in high-partition-count OLTP
> setups that use PREPARE/EXECUTE.

That makes sense.

> The proposed global index design, IIUC, requires locking all leaf
> partitions up front during planning, and I guess during
> AcquireExecutorLocks() when using a cached plan, because the index
> scan could return tuples from any partition. This seems to directly
> undercut that effort: we'd be back to generic plans causing broad
> locking regardless of actual runtime needs.

Just trying to understand the locking difference more with/without
global index, let's assume we have normal partitioned index on non
partition key column, and if we issue a scan on the partitioned table
then internally from expand_partitioned_rtentry() we will take lock on
all the partitions, because we can not prune any partition.
Similarly, with the global index also all the child partitions under
the top partition on which we have global index will be locked.  So in
this case we do not have a difference.

> I understand that this is currently necessary, given that a global
> index scan is a single node without per-partition awareness. But it
> might be worth considering whether the scan could opportunistically
> defer heap relation locking until it returns a tuple that actually
> belongs to a particular partition -- similar to how inserts into
> partitioned tables only lock the target partition at execution time.
> Or did I miss that inserts also need to lock all partitions up front
> when global indexes are present, due to cross-partition uniqueness
> checks?
>
> Let me know if I’ve misunderstood the design.

So there difference is in the cases, where we are directly operating
on the leaf table, e.g. if you inserting directly on the leaf
relation, currently we just need to lock that partition, but if there
is global index we need to lock other siblings as well (in short all
the leaf under the parent which has global index) because if the
global index is unique we might need to check unique conflict in other
leafs as well.  I believe when the table is partitioned, it might not
be the most preferred way to operate directly on the leaf, and with
global index only this case will be impacted where we are doing DML
directly on the leaf.  I am not sure in this case how much delay we
can do in locking, because e.g. for insert we will only identify which
partition has a duplicate key while inserting in the btree.

-- 
Regards,
Dilip Kumar
Google

Re: Proposal: Global Index for PostgreSQL

amit <amitlangote09@gmail.com> — 2025-07-02T13:48:22Z

On Wed, Jul 2, 2025 at 1:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Tue, Jul 1, 2025 at 7:12 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > I’ve been working on improving how we handle partition locking during
> > execution of generic plans. Specifically, I committed a patch to defer
> > locking of partitions until after pruning during ExecutorStart(), so
> > we avoid taking locks on partitions that aren’t actually needed --
> > even when the plan contains scans on all partitions. That patch was
> > later reverted, as Tom pointed out that the plan invalidation logic
> > wasn't cleanly handled.
>
> Yes I was following that thread, and at times when I was working on
> locking for global index I had in mind that I would have to do
> something to the locking after that patch is in.  Unfortunately that
> got reverted and then I didn't put any effort in reconsidering how
> locking is handled for global index.
...
>  But the goal remains: to avoid locking
> > unnecessary partitions, particularly in high-partition-count OLTP
> > setups that use PREPARE/EXECUTE.
>
> That makes sense.

Ok, good to know you were keeping a tab on it.

> > The proposed global index design, IIUC, requires locking all leaf
> > partitions up front during planning, and I guess during
> > AcquireExecutorLocks() when using a cached plan, because the index
> > scan could return tuples from any partition. This seems to directly
> > undercut that effort: we'd be back to generic plans causing broad
> > locking regardless of actual runtime needs.
>
> Just trying to understand the locking difference more with/without
> global index, let's assume we have normal partitioned index on non
> partition key column, and if we issue a scan on the partitioned table
> then internally from expand_partitioned_rtentry() we will take lock on
> all the partitions, because we can not prune any partition.
> Similarly, with the global index also all the child partitions under
> the top partition on which we have global index will be locked.  So in
> this case we do not have a difference.

Just to clarify -- I was hoping that, at least for SELECTs, we
wouldn’t need to lock all leaf partitions up front.

One of the potential selling points of global indexes (compared to
partitioned indexes) is that we can avoid expand_partitioned_rtentry()
and the full scan path setup per partition, though that's admittedly
quite an undertaking.  So I was imagining we could just lock the
parent and the global index during planning, and only lock individual
heap relations at execution time -- once we know which partition the
returned tuple belongs to.

Locking isn’t cheap -- and in workloads with thousands of partitions,
it becomes a major source of overhead, especially when the query
doesn't contain pruning quals and ends up touching only a few
partitions in practice. So I think it’s worth seeing if we can avoid
planning-time locking of all partitions in at least the SELECT case,
even if INSERTs may require broader locking due to uniqueness checks,
but see below...

> > I understand that this is currently necessary, given that a global
> > index scan is a single node without per-partition awareness. But it
> > might be worth considering whether the scan could opportunistically
> > defer heap relation locking until it returns a tuple that actually
> > belongs to a particular partition -- similar to how inserts into
> > partitioned tables only lock the target partition at execution time.
> > Or did I miss that inserts also need to lock all partitions up front
> > when global indexes are present, due to cross-partition uniqueness
> > checks?
> >
> > Let me know if I’ve misunderstood the design.
>
> So there difference is in the cases, where we are directly operating
> on the leaf table, e.g. if you inserting directly on the leaf
> relation, currently we just need to lock that partition, but if there
> is global index we need to lock other siblings as well (in short all
> the leaf under the parent which has global index) because if the
> global index is unique we might need to check unique conflict in other
> leafs as well.  I believe when the table is partitioned, it might not
> be the most preferred way to operate directly on the leaf, and with
> global index only this case will be impacted where we are doing DML
> directly on the leaf.  I am not sure in this case how much delay we
> can do in locking, because e.g. for insert we will only identify which
> partition has a duplicate key while inserting in the btree.

Hmm, it’s my understanding that with a true global index, meaning a
single btree structure spanning all partitions, uniqueness conflicts
are detected by probing the index after inserting the tuple into the
heap. So unless we find a matching key in the index, there is no need
to consult any other partitions.

Even if a match is found, we only need to access the heap page for
that specific TID to check visibility, and that would involve just one
partition.

Why then do we need to lock all leaf partitions in advance? It seems
like we could defer locking until the uniqueness check identifies a
partition that actually contains a conflicting tuple, and only then
lock that one heap.

I understand that in some earlier floated ideas for enforcing global
uniqueness (perhaps only briefly mentioned in past discussions), the
approach was to scan all per-partition indexes, which would have
required locking everything up front. But with a unified global index,
that overhead seems avoidable.

Is there something subtle that I am missing that still requires
locking all heaps in advance?

-- 
Thanks, Amit Langote

Re: Proposal: Global Index for PostgreSQL

Dilip Kumar <dilipbalaut@gmail.com> — 2025-07-03T04:23:55Z

On Wed, Jul 2, 2025 at 7:18 PM Amit Langote <amitlangote09@gmail.com> wrote:

> Just to clarify -- I was hoping that, at least for SELECTs, we
> wouldn’t need to lock all leaf partitions up front.
>
> One of the potential selling points of global indexes (compared to
> partitioned indexes) is that we can avoid expand_partitioned_rtentry()
> and the full scan path setup per partition, though that's admittedly
> quite an undertaking.  So I was imagining we could just lock the
> parent and the global index during planning, and only lock individual
> heap relations at execution time -- once we know which partition the
> returned tuple belongs to.
>
> Locking isn’t cheap -- and in workloads with thousands of partitions,
> it becomes a major source of overhead, especially when the query
> doesn't contain pruning quals and ends up touching only a few
> partitions in practice. So I think it’s worth seeing if we can avoid
> planning-time locking of all partitions in at least the SELECT case,
> even if INSERTs may require broader locking due to uniqueness checks,
> but see below...
>
> > > I understand that this is currently necessary, given that a global
> > > index scan is a single node without per-partition awareness. But it
> > > might be worth considering whether the scan could opportunistically
> > > defer heap relation locking until it returns a tuple that actually
> > > belongs to a particular partition -- similar to how inserts into
> > > partitioned tables only lock the target partition at execution time.
> > > Or did I miss that inserts also need to lock all partitions up front
> > > when global indexes are present, due to cross-partition uniqueness
> > > checks?
> > >
> > > Let me know if I’ve misunderstood the design.
> >
> > So there difference is in the cases, where we are directly operating
> > on the leaf table, e.g. if you inserting directly on the leaf
> > relation, currently we just need to lock that partition, but if there
> > is global index we need to lock other siblings as well (in short all
> > the leaf under the parent which has global index) because if the
> > global index is unique we might need to check unique conflict in other
> > leafs as well.  I believe when the table is partitioned, it might not
> > be the most preferred way to operate directly on the leaf, and with
> > global index only this case will be impacted where we are doing DML
> > directly on the leaf.  I am not sure in this case how much delay we
> > can do in locking, because e.g. for insert we will only identify which
> > partition has a duplicate key while inserting in the btree.
>
> Hmm, it’s my understanding that with a true global index, meaning a
> single btree structure spanning all partitions, uniqueness conflicts
> are detected by probing the index after inserting the tuple into the
> heap. So unless we find a matching key in the index, there is no need
> to consult any other partitions.
>
> Even if a match is found, we only need to access the heap page for
> that specific TID to check visibility, and that would involve just one
> partition.
>
> Why then do we need to lock all leaf partitions in advance? It seems
> like we could defer locking until the uniqueness check identifies a
> partition that actually contains a conflicting tuple, and only then
> lock that one heap.
>
> I understand that in some earlier floated ideas for enforcing global
> uniqueness (perhaps only briefly mentioned in past discussions), the
> approach was to scan all per-partition indexes, which would have
> required locking everything up front. But with a unified global index,
> that overhead seems avoidable.
>
> Is there something subtle that I am missing that still requires
> locking all heaps in advance?

Thanks Amit for raising this question.  I understand your point, and I
think this is mainly about whether to lock all the tables upfront or
delay the locking and lock them when we need to.  I remember the very
first patch when I implemented at least in case of unique checking I
was locking the table on the fly whenever we find the duplicate tuple
and there I had a discussion with Robert regarding the same.  Here is
what was out thinking, although it's in my words and Robert might have
something else in mind but this is based on what I think he told.

Here's what I remember our thinking was at the time, though Robert
might recall it differently. We concluded that the code was designed
so the planner would upfront decide which tables were needed for the
plan, lock them, and keep them in the range table. If it was a
prepared plan, AcquireExecutorLock would handle locking all of them
before execution began.  And then I changed the logic of lock on the
fly to locking everything upfront.

I'm now inclined to agree with your point. When dealing with multiple
partitions, especially when it's uncertain if all of them will even be
needed, the traditional approach of locking every potential partition
upfront isn't the most efficient strategy.  But one problem is we will
only discovered which partitions to lock (whether doing scan or
checking unique while inserting in the index) only when we are in
Btree code, so as of now I am not finding it strategically correct to
lock and build the relation descriptor right then and there, I don't
have any reason why it would be bad but may be one reason is this
could be the first code to do it and another reason is if this has to
lock millions of partitions (in case we find matching tuple from
multiple partition during scan) then locking all of them on the fly
from index AM code might not be the best idea.

What's your thought on this?  I would appreciate if @Robert Haas can
also share his thoughts.

-- 
Regards,
Dilip Kumar
Google

Re: Proposal: Global Index for PostgreSQL

amit <amitlangote09@gmail.com> — 2025-07-11T06:51:57Z

Hi Dilip,

Sorry for the late reply.

On Thu, Jul 3, 2025 at 1:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jul 2, 2025 at 7:18 PM Amit Langote <amitlangote09@gmail.com> wrote:
>
> > Just to clarify -- I was hoping that, at least for SELECTs, we
> > wouldn’t need to lock all leaf partitions up front.
> >
> > One of the potential selling points of global indexes (compared to
> > partitioned indexes) is that we can avoid expand_partitioned_rtentry()
> > and the full scan path setup per partition, though that's admittedly
> > quite an undertaking.  So I was imagining we could just lock the
> > parent and the global index during planning, and only lock individual
> > heap relations at execution time -- once we know which partition the
> > returned tuple belongs to.
> >
> > Locking isn’t cheap -- and in workloads with thousands of partitions,
> > it becomes a major source of overhead, especially when the query
> > doesn't contain pruning quals and ends up touching only a few
> > partitions in practice. So I think it’s worth seeing if we can avoid
> > planning-time locking of all partitions in at least the SELECT case,
> > even if INSERTs may require broader locking due to uniqueness checks,
> > but see below...
> >
> > > > I understand that this is currently necessary, given that a global
> > > > index scan is a single node without per-partition awareness. But it
> > > > might be worth considering whether the scan could opportunistically
> > > > defer heap relation locking until it returns a tuple that actually
> > > > belongs to a particular partition -- similar to how inserts into
> > > > partitioned tables only lock the target partition at execution time.
> > > > Or did I miss that inserts also need to lock all partitions up front
> > > > when global indexes are present, due to cross-partition uniqueness
> > > > checks?
> > > >
> > > > Let me know if I’ve misunderstood the design.
> > >
> > > So there difference is in the cases, where we are directly operating
> > > on the leaf table, e.g. if you inserting directly on the leaf
> > > relation, currently we just need to lock that partition, but if there
> > > is global index we need to lock other siblings as well (in short all
> > > the leaf under the parent which has global index) because if the
> > > global index is unique we might need to check unique conflict in other
> > > leafs as well.  I believe when the table is partitioned, it might not
> > > be the most preferred way to operate directly on the leaf, and with
> > > global index only this case will be impacted where we are doing DML
> > > directly on the leaf.  I am not sure in this case how much delay we
> > > can do in locking, because e.g. for insert we will only identify which
> > > partition has a duplicate key while inserting in the btree.
> >
> > Hmm, it’s my understanding that with a true global index, meaning a
> > single btree structure spanning all partitions, uniqueness conflicts
> > are detected by probing the index after inserting the tuple into the
> > heap. So unless we find a matching key in the index, there is no need
> > to consult any other partitions.
> >
> > Even if a match is found, we only need to access the heap page for
> > that specific TID to check visibility, and that would involve just one
> > partition.
> >
> > Why then do we need to lock all leaf partitions in advance? It seems
> > like we could defer locking until the uniqueness check identifies a
> > partition that actually contains a conflicting tuple, and only then
> > lock that one heap.
> >
> > I understand that in some earlier floated ideas for enforcing global
> > uniqueness (perhaps only briefly mentioned in past discussions), the
> > approach was to scan all per-partition indexes, which would have
> > required locking everything up front. But with a unified global index,
> > that overhead seems avoidable.
> >
> > Is there something subtle that I am missing that still requires
> > locking all heaps in advance?
>
> Thanks Amit for raising this question.  I understand your point, and I
> think this is mainly about whether to lock all the tables upfront or
> delay the locking and lock them when we need to.  I remember the very
> first patch when I implemented at least in case of unique checking I
> was locking the table on the fly whenever we find the duplicate tuple
> and there I had a discussion with Robert regarding the same.  Here is
> what was out thinking, although it's in my words and Robert might have
> something else in mind but this is based on what I think he told.
>
> Here's what I remember our thinking was at the time, though Robert
> might recall it differently. We concluded that the code was designed
> so the planner would upfront decide which tables were needed for the
> plan, lock them, and keep them in the range table. If it was a
> prepared plan, AcquireExecutorLock would handle locking all of them
> before execution began.  And then I changed the logic of lock on the
> fly to locking everything upfront.
>
> I'm now inclined to agree with your point. When dealing with multiple
> partitions, especially when it's uncertain if all of them will even be
> needed, the traditional approach of locking every potential partition
> upfront isn't the most efficient strategy.  But one problem is we will
> only discovered which partitions to lock (whether doing scan or
> checking unique while inserting in the index) only when we are in
> Btree code, so as of now I am not finding it strategically correct to
> lock and build the relation descriptor right then and there, I don't
> have any reason why it would be bad but may be one reason is this
> could be the first code to do it and another reason is if this has to
> lock millions of partitions (in case we find matching tuple from
> multiple partition during scan) then locking all of them on the fly
> from index AM code might not be the best idea.
>
> What's your thought on this?  I would appreciate if @Robert Haas can
> also share his thoughts.

Thanks, I see that you've thought this through and opted for the safe
route of locking all possible partitions during planning, both for
SELECT and INSERT.

That seems like the only viable option today, given how the executor
assumes that each index scan targets a single heap relation which has
already been opened and locked before execution begins. But even so, I
think we should not assume locking is cheap. Even with improvements
like fast-path locking or a higher max_locks_per_transaction, locking
thousands of partitions still adds up. This can become noticeable even
in regular query execution, since one of the motivations for global
indexes is to reduce planning effort, for example by avoiding
per-partition scan node generation. In such cases, the relative cost
of locking can become a dominant part of query startup time.

I do not have a better solution right now, but it is worth keeping
this tradeoff in mind.

-- 
Thanks, Amit Langote

Re: Proposal: Global Index for PostgreSQL

Dilip Kumar <dilipbalaut@gmail.com> — 2025-07-11T11:17:27Z

On Fri, Jul 11, 2025 at 12:22 PM Amit Langote <amitlangote09@gmail.com> wrote:

> Thanks, I see that you've thought this through and opted for the safe
> route of locking all possible partitions during planning, both for
> SELECT and INSERT.
>
> That seems like the only viable option today, given how the executor
> assumes that each index scan targets a single heap relation which has
> already been opened and locked before execution begins. But even so, I
> think we should not assume locking is cheap. Even with improvements
> like fast-path locking or a higher max_locks_per_transaction, locking
> thousands of partitions still adds up. This can become noticeable even
> in regular query execution, since one of the motivations for global
> indexes is to reduce planning effort, for example by avoiding
> per-partition scan node generation. In such cases, the relative cost
> of locking can become a dominant part of query startup time.

That's right

> I do not have a better solution right now, but it is worth keeping
> this tradeoff in mind.

I agree.  Thanks for your opinion on this.

-- 
Regards,
Dilip Kumar
Google