Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. xml2: Improve error handling of libxml2 calls

  1. Proposal: Global Index for PostgreSQL

    Dilip Kumar <dilipbalaut@gmail.com> — 2025-06-06T07:00:05Z

    PostgreSQL’s table partitioning allows large tables to be broken into
    smaller, more manageable pieces for better performance. However, a key
    limitation currently is the absence of global indexes, which restricts
    using partitioned tables, especially when you need unique constraints
    on columns that aren't part of the partition key.  It's true that
    global indexes might seem to go against the core idea of partitioning.
    However, they become essential when business needs dictate
    partitioning data on one key while also enforcing uniqueness on a
    different column. Without global indexes, users simply couldn't
    leverage partitioning at all in such scenarios.
    
    I've been developing a design and a POC for global indexes to address
    this. I'd like to share my work, including the challenges we still
    face and my proposed solutions.
    
    Storage
    =======
    As we know that since global indexes are covering tuple of multiple
    partitions, TIDs are not sufficient to uniquely identifying the tuple
    so for that I have introduced a partitioned identifier which is 4
    bytes integer and we may argue more on this whether this can be a
    variable length or not for saving space.  I will describe more on this
    partition identifier in a later part of the email and what's the
    reason for choosing this instead of just a relation Oid as that seems
    much straightforward.
    
    Partition identifier is stored as part of the index tuple right after
    the last key column, storing it here after the last key column make a
    lot of things work in very straight forward way without any special
    handling for this partition identifier column, e.g. btree tuple are
    arranged in key order and if key are duplicate its arranged in TID
    order and now for global index we want if the keys are duplicates it
    should be arranged in partitioned identifier order and if partition
    identifier is also same then in TID order, so by storing identifier
    after the last key column and if we consider that field as part of the
    extended key column then tuple will automatically will be arranged in
    (keys, partition identifier, TID) order as we desire.  Similarly
    suffix truncation and deduplication will also be simpler with this
    design choice.
    
    Partition Identifier
    ==============
    Before discussing anything about Create Index/Attach/Detach partition,
    I think we need to discuss the partition identifier.  So the main
    reason for not using relation Oid as partition identifier is because
    if we do so we would be forced to remove all the tuple of the detached
    partitions from the global index, otherwise it would be very hard to
    identify the tuple which are not valid or we should not scan for
    example if the same partition detaches and reattaches then we would
    not be able to distinguish between old (which should be ignore) and
    new tuples, and also if Oid got reassigned to some other partition old
    partition is dropped and wraparound happened.
    
    To solve this, we've introduced a partition identifier, a 4-byte
    integer. We'll use a new catalog called pg_index_partitions to store a
    mapping from (partition identifier, global index OID) to relation OID.
    
    The mapping isn't a direct one from partition identifier to relation
    ID because each partition will have a different partition identifier
    for each global index it's associated with. This is crucial for
    correctly handling global indexes in multi-level partition
    hierarchies.
    
    Consider this example:
    - T is the top-level parent and has global index g.
    - T1 and T2 are children of T. T2 is partitioned table itself and also
    has global index g2.
    - T22 is a child of T2.
    
    Let's consider a scenario where if you assign a single partition
    identifier to T22 and create a one-to-one mapping with its relation
    IDs. What happens if we detach T2 from T? (While we haven't delved
    into the detach process yet, our current thought is to invalidate the
    partition identifier entry so that during a scan, when we look up the
    partition identifier to find the corresponding reloid, we'd simply
    ignore it.)
    
    However, this approach presents a problem. We can't simply invalidate
    the entry for T22 entirely, because while its partition identifier
    should be considered invalid when scanning global index 'g' (since T2
    is detached from T), it should still be valid when scanning global
    index 'g2' (which belongs to T2).
    
    To address this, each leaf partition needs to be assigned a separate
    partition identifier for every global index it's associated with.
    While a separate partition identifier for each level in the partition
    hierarchy could suffice, we've opted for the simpler approach of
    assigning a distinct identifier per global index. We made this choice
    assuming users will be judicious in creating global indexes within
    their partition hierarchies, meaning we won't have to worry about
    rapidly consuming a large number of partition IDs. Additionally, the
    partition identifier counter can be maintained per global index rather
    than globally, this will avoid growing this counter rapidly.
    
    Creating a global Index
    =================
    While creating an index if the index is marked as GLOBAL, then there
    would be some differences compared to the partitioned index, 1) And
    additional internal partition identifier key column will be added 2) A
    partition id will be allocated to each leaf partitions exist under the
    top level partition on which we are creating a global index and
    mapping will be stored in pg_index_partition table  3) Unlike
    partitioned index global index will not be created on each child
    instead this will be created only on the partitioned relation on which
    user chooses to create 4) Index build needs to be modified to scan
    through all the leaf partitions and create a single sort space for
    inserting into the global index.
    
    Attach Partition
    =============
    While attaching a partition, if this is a leaf partition, we need to
    assign a new partition id with respect to each global index present on
    its parent and ancestors and insert the mapping in the
    pg_index_partitions table.  If the partition being attached is itself
    a partitioned table then this step has to be done for every leaf
    partition present in the partition hierarchy under the partitioned
    table being attached.
    
    Currently we reindex the global indexes present on the parent parent
    and ancestor to which we are attaching a partition(s), but this can be
    optimized for example we may choose to selectively insert the tuple of
    the partition being attached but in some cases it could be costly if
    the attached partition itself has a lot of tuple compared to existing
    partitions which are already attached.  So this could be a mixed
    approach and can be decided based on the existing number of tuples
    index the index vs new in coming tuples, and as of now I am listing
    this as open for suggestion items.
    
    Detach Partition
    =============
    When detaching or dropping a partition, we just need to invalidate the
    corresponding mapping in pg_index_partitions. Specifically, for each
    leaf partition being detached, we'll mark its reloid as invalid within
    the (indexoid, partition id) entry. This ensures that during an index
    scan, when the system attempts to convert a partition ID to its
    reloid, it will recognize this entry as invalid. We'll also need to
    consider a mechanism to clean up these invalidated entries at some
    point.
    
    Global Index Scan
    ==============
    In short, the global index scan will be similar to the normal index.
    The key difference is that we'll also need to track heapOid alongside
    heapTid within BTScanPosItem. Although we store the partition ID
    directly in the index tuple, we can readily convert it to a heapOid by
    referencing the pg_index_partitions table. For performance, we'll
    maintain a cache within the global index's Relcache entry; this cache
    will simply store the mapping of partitions associated with that
    specific global index.
    
    Furthermore, we must account for new index access paths when global
    indexes are present. Currently, set_append_rel_size() generates index
    restriction information for each append relation. Now, this
    restriction information must also be generated for the partitioned
    table itself. Additionally, within set_append_rel_pathlist(), we'll
    need to invoke create_index_paths() for the partitioned table. This
    ensures that we consider index paths for the partitioned table, which,
    in the case of partitioned tables, will be global index scan paths.
    
    Locking consideration
    =================
    Currently, when we perform DML operations on a top-level partitioned
    table, we acquire a lock on the entire partition hierarchy beneath it,
    which works well. However, if we're directly operating on a leaf
    relation, say, inserting a tuple, we only lock that specific relation.
    This isn't sufficient for global indexes.
    
    Since a global index can reside on a parent table, inserting a tuple
    into a leaf relation also necessitates an insertion into the global
    index. This means we also need to lock the parent table on which the
    global index is created.
    
    However, even that isn't enough. We actually need to lock the entire
    partition hierarchy under the parent table that has the global index.
    Here's why: when you insert a tuple into any leaf partition, you'll
    also insert it into the global index. If it's a unique index, a
    conflict might arise with a key belonging to another partition. To
    validate whether the tuple is live or not, we might need to access
    data in other partitions.
    
    Based on our chosen design, we identify all these necessary locks
    during the planning phase. We then acquire these locks during planning
    and store them within the planned statement. This way, if the
    statement is prepared and executed later, these locks can be
    re-acquired during AcquireExecutorLocks.
    
    Open Problems/Need suggestions
    ==========================
    Here is the list of top few problems which would be good to discuss
    sooner than later
    
    Vacuum for global indexes
    —----------------------------------
    I think this has been discussed multiple times in the past whenever we
    talked about the global index. The core issue is that, by default,
    global indexes are vacuumed with each individual partition. This
    becomes incredibly inefficient and costly when dealing with millions
    of partitions, as it leads to repeated, expensive scans of a large
    global index.
    
    Ideally, global indexes should only be vacuumed once, after all
    partitions have been processed. This is because a global index keeps
    data from all partitions, meaning a single vacuum operation after all
    partition vacuums would be sufficient and far more efficient. This
    approach would significantly reduce the overhead associated with
    maintaining global indexes in large, partitioned datasets.
    
    Our testing highlights a significant performance bottleneck when
    vacuuming with global indexes. In a scenario with 1,000 partitions and
    a total of 50 million tuples, vacuuming without global indexes took a
    mere 15 seconds. However, after introducing global indexes, the vacuum
    time skyrocketed to 45 minutes, a 300-fold increase! This dramatic
    slowdown is expected, as the global index, containing data from all
    partitions, is effectively vacuumed with each of the 1,000 individual
    partition vacuums.
    
    We've made a quick but significant optimization: we now vacuum each
    partition and its local indexes first, skipping the global index
    vacuum and the second pass of heap vacuuming for a bit. Instead, we
    just keep track of the dead-tid stores (dead tuples) for each
    partition. Once all partitions are done, we then vacuum the global
    index once and perform that second heap pass across all partitions.
    
    This change slashed our vacuuming time from 45 minutes down to just 40
    seconds! This clearly shows we have plenty of room for optimization,
    and this challenge isn't a showstopper. I'll share more details on
    other solutions I'm testing in an upcoming email to keep this one
    concise.
    
    
    Global Index rewrite
    —-------------------------
    One particular class of things that needs careful consideration is
    table-rewriting operations. We have a few of those: CLUSTER, VACUUM
    FULL..ALTER TABLE etc, When such an operation occurs, all the TIDs
    potentially change, so tablecmds.c arranges to rebuild indexes. When
    there’s a global index involved, we also need to rebuild global
    indexes. Or, as a possible alternative strategy, we could allocate a
    new partition ID and just leave the index entries to be cleaned out
    eventually, but that seems to risk a lot of bloat. However, a key
    point here is that we don’t want to be rebuilding the same global
    indexes over and over. If someone does a table-rewriting operation
    across a whole partitioning hierarchy, we don’t want to rebuild the
    same global indexes multiple times.
    
    We also need to consider what happens when relations are truncated.
    Here, the two possible strategies are: (1) assign new partition IDs
    and leave the old index entries around, (2) truncate the entire index
    and then reinsert entries for any untruncated partitions. As above (1)
    risks bloat, but it also makes TRUNCATE fast. Of course, if the whole
    partitioning hierarchy is truncated all at once, then (2) is clearly
    better.
    
    I'm aiming to submit the first WIP patch set before the July
    commitfest. It won't have all the issues ironed out yet, but the main
    design will be functional.
    
    Thanks, Robert, for many of the key design ideas and regular
    discussion throughout designing this. I'd also like to thank Joe,
    Peter Geoghegan, Alvaro, and Masahiko Sawada for discussing the
    independent issues with me offlist.
    
    -- 
    Regards,
    Dilip Kumar
    Google
    
    
    
    
  2. Re: Proposal: Global Index for PostgreSQL

    wenhui qiu <qiuwenhuifx@gmail.com> — 2025-06-06T07:30:48Z

    Hi Dilip Kumar
       Thank you for your working on this ,I remember six years ago there was
    talk about global index ,You can see if this mailing list has any
    references to (
    https://www.postgresql.org/message-id/CALtqXTcurqy1PKXzP9XO%3DofLLA5wBSo77BnUnYVEZpmcA3V0ag%40mail.gmail.com
    )
    
    
    
    Thanks
    
    On Fri, Jun 6, 2025 at 3:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
    
    > PostgreSQL’s table partitioning allows large tables to be broken into
    > smaller, more manageable pieces for better performance. However, a key
    > limitation currently is the absence of global indexes, which restricts
    > using partitioned tables, especially when you need unique constraints
    > on columns that aren't part of the partition key.  It's true that
    > global indexes might seem to go against the core idea of partitioning.
    > However, they become essential when business needs dictate
    > partitioning data on one key while also enforcing uniqueness on a
    > different column. Without global indexes, users simply couldn't
    > leverage partitioning at all in such scenarios.
    >
    > I've been developing a design and a POC for global indexes to address
    > this. I'd like to share my work, including the challenges we still
    > face and my proposed solutions.
    >
    > Storage
    > =======
    > As we know that since global indexes are covering tuple of multiple
    > partitions, TIDs are not sufficient to uniquely identifying the tuple
    > so for that I have introduced a partitioned identifier which is 4
    > bytes integer and we may argue more on this whether this can be a
    > variable length or not for saving space.  I will describe more on this
    > partition identifier in a later part of the email and what's the
    > reason for choosing this instead of just a relation Oid as that seems
    > much straightforward.
    >
    > Partition identifier is stored as part of the index tuple right after
    > the last key column, storing it here after the last key column make a
    > lot of things work in very straight forward way without any special
    > handling for this partition identifier column, e.g. btree tuple are
    > arranged in key order and if key are duplicate its arranged in TID
    > order and now for global index we want if the keys are duplicates it
    > should be arranged in partitioned identifier order and if partition
    > identifier is also same then in TID order, so by storing identifier
    > after the last key column and if we consider that field as part of the
    > extended key column then tuple will automatically will be arranged in
    > (keys, partition identifier, TID) order as we desire.  Similarly
    > suffix truncation and deduplication will also be simpler with this
    > design choice.
    >
    > Partition Identifier
    > ==============
    > Before discussing anything about Create Index/Attach/Detach partition,
    > I think we need to discuss the partition identifier.  So the main
    > reason for not using relation Oid as partition identifier is because
    > if we do so we would be forced to remove all the tuple of the detached
    > partitions from the global index, otherwise it would be very hard to
    > identify the tuple which are not valid or we should not scan for
    > example if the same partition detaches and reattaches then we would
    > not be able to distinguish between old (which should be ignore) and
    > new tuples, and also if Oid got reassigned to some other partition old
    > partition is dropped and wraparound happened.
    >
    > To solve this, we've introduced a partition identifier, a 4-byte
    > integer. We'll use a new catalog called pg_index_partitions to store a
    > mapping from (partition identifier, global index OID) to relation OID.
    >
    > The mapping isn't a direct one from partition identifier to relation
    > ID because each partition will have a different partition identifier
    > for each global index it's associated with. This is crucial for
    > correctly handling global indexes in multi-level partition
    > hierarchies.
    >
    > Consider this example:
    > - T is the top-level parent and has global index g.
    > - T1 and T2 are children of T. T2 is partitioned table itself and also
    > has global index g2.
    > - T22 is a child of T2.
    >
    > Let's consider a scenario where if you assign a single partition
    > identifier to T22 and create a one-to-one mapping with its relation
    > IDs. What happens if we detach T2 from T? (While we haven't delved
    > into the detach process yet, our current thought is to invalidate the
    > partition identifier entry so that during a scan, when we look up the
    > partition identifier to find the corresponding reloid, we'd simply
    > ignore it.)
    >
    > However, this approach presents a problem. We can't simply invalidate
    > the entry for T22 entirely, because while its partition identifier
    > should be considered invalid when scanning global index 'g' (since T2
    > is detached from T), it should still be valid when scanning global
    > index 'g2' (which belongs to T2).
    >
    > To address this, each leaf partition needs to be assigned a separate
    > partition identifier for every global index it's associated with.
    > While a separate partition identifier for each level in the partition
    > hierarchy could suffice, we've opted for the simpler approach of
    > assigning a distinct identifier per global index. We made this choice
    > assuming users will be judicious in creating global indexes within
    > their partition hierarchies, meaning we won't have to worry about
    > rapidly consuming a large number of partition IDs. Additionally, the
    > partition identifier counter can be maintained per global index rather
    > than globally, this will avoid growing this counter rapidly.
    >
    > Creating a global Index
    > =================
    > While creating an index if the index is marked as GLOBAL, then there
    > would be some differences compared to the partitioned index, 1) And
    > additional internal partition identifier key column will be added 2) A
    > partition id will be allocated to each leaf partitions exist under the
    > top level partition on which we are creating a global index and
    > mapping will be stored in pg_index_partition table  3) Unlike
    > partitioned index global index will not be created on each child
    > instead this will be created only on the partitioned relation on which
    > user chooses to create 4) Index build needs to be modified to scan
    > through all the leaf partitions and create a single sort space for
    > inserting into the global index.
    >
    > Attach Partition
    > =============
    > While attaching a partition, if this is a leaf partition, we need to
    > assign a new partition id with respect to each global index present on
    > its parent and ancestors and insert the mapping in the
    > pg_index_partitions table.  If the partition being attached is itself
    > a partitioned table then this step has to be done for every leaf
    > partition present in the partition hierarchy under the partitioned
    > table being attached.
    >
    > Currently we reindex the global indexes present on the parent parent
    > and ancestor to which we are attaching a partition(s), but this can be
    > optimized for example we may choose to selectively insert the tuple of
    > the partition being attached but in some cases it could be costly if
    > the attached partition itself has a lot of tuple compared to existing
    > partitions which are already attached.  So this could be a mixed
    > approach and can be decided based on the existing number of tuples
    > index the index vs new in coming tuples, and as of now I am listing
    > this as open for suggestion items.
    >
    > Detach Partition
    > =============
    > When detaching or dropping a partition, we just need to invalidate the
    > corresponding mapping in pg_index_partitions. Specifically, for each
    > leaf partition being detached, we'll mark its reloid as invalid within
    > the (indexoid, partition id) entry. This ensures that during an index
    > scan, when the system attempts to convert a partition ID to its
    > reloid, it will recognize this entry as invalid. We'll also need to
    > consider a mechanism to clean up these invalidated entries at some
    > point.
    >
    > Global Index Scan
    > ==============
    > In short, the global index scan will be similar to the normal index.
    > The key difference is that we'll also need to track heapOid alongside
    > heapTid within BTScanPosItem. Although we store the partition ID
    > directly in the index tuple, we can readily convert it to a heapOid by
    > referencing the pg_index_partitions table. For performance, we'll
    > maintain a cache within the global index's Relcache entry; this cache
    > will simply store the mapping of partitions associated with that
    > specific global index.
    >
    > Furthermore, we must account for new index access paths when global
    > indexes are present. Currently, set_append_rel_size() generates index
    > restriction information for each append relation. Now, this
    > restriction information must also be generated for the partitioned
    > table itself. Additionally, within set_append_rel_pathlist(), we'll
    > need to invoke create_index_paths() for the partitioned table. This
    > ensures that we consider index paths for the partitioned table, which,
    > in the case of partitioned tables, will be global index scan paths.
    >
    > Locking consideration
    > =================
    > Currently, when we perform DML operations on a top-level partitioned
    > table, we acquire a lock on the entire partition hierarchy beneath it,
    > which works well. However, if we're directly operating on a leaf
    > relation, say, inserting a tuple, we only lock that specific relation.
    > This isn't sufficient for global indexes.
    >
    > Since a global index can reside on a parent table, inserting a tuple
    > into a leaf relation also necessitates an insertion into the global
    > index. This means we also need to lock the parent table on which the
    > global index is created.
    >
    > However, even that isn't enough. We actually need to lock the entire
    > partition hierarchy under the parent table that has the global index.
    > Here's why: when you insert a tuple into any leaf partition, you'll
    > also insert it into the global index. If it's a unique index, a
    > conflict might arise with a key belonging to another partition. To
    > validate whether the tuple is live or not, we might need to access
    > data in other partitions.
    >
    > Based on our chosen design, we identify all these necessary locks
    > during the planning phase. We then acquire these locks during planning
    > and store them within the planned statement. This way, if the
    > statement is prepared and executed later, these locks can be
    > re-acquired during AcquireExecutorLocks.
    >
    > Open Problems/Need suggestions
    > ==========================
    > Here is the list of top few problems which would be good to discuss
    > sooner than later
    >
    > Vacuum for global indexes
    > —----------------------------------
    > I think this has been discussed multiple times in the past whenever we
    > talked about the global index. The core issue is that, by default,
    > global indexes are vacuumed with each individual partition. This
    > becomes incredibly inefficient and costly when dealing with millions
    > of partitions, as it leads to repeated, expensive scans of a large
    > global index.
    >
    > Ideally, global indexes should only be vacuumed once, after all
    > partitions have been processed. This is because a global index keeps
    > data from all partitions, meaning a single vacuum operation after all
    > partition vacuums would be sufficient and far more efficient. This
    > approach would significantly reduce the overhead associated with
    > maintaining global indexes in large, partitioned datasets.
    >
    > Our testing highlights a significant performance bottleneck when
    > vacuuming with global indexes. In a scenario with 1,000 partitions and
    > a total of 50 million tuples, vacuuming without global indexes took a
    > mere 15 seconds. However, after introducing global indexes, the vacuum
    > time skyrocketed to 45 minutes, a 300-fold increase! This dramatic
    > slowdown is expected, as the global index, containing data from all
    > partitions, is effectively vacuumed with each of the 1,000 individual
    > partition vacuums.
    >
    > We've made a quick but significant optimization: we now vacuum each
    > partition and its local indexes first, skipping the global index
    > vacuum and the second pass of heap vacuuming for a bit. Instead, we
    > just keep track of the dead-tid stores (dead tuples) for each
    > partition. Once all partitions are done, we then vacuum the global
    > index once and perform that second heap pass across all partitions.
    >
    > This change slashed our vacuuming time from 45 minutes down to just 40
    > seconds! This clearly shows we have plenty of room for optimization,
    > and this challenge isn't a showstopper. I'll share more details on
    > other solutions I'm testing in an upcoming email to keep this one
    > concise.
    >
    >
    > Global Index rewrite
    > —-------------------------
    > One particular class of things that needs careful consideration is
    > table-rewriting operations. We have a few of those: CLUSTER, VACUUM
    > FULL..ALTER TABLE etc, When such an operation occurs, all the TIDs
    > potentially change, so tablecmds.c arranges to rebuild indexes. When
    > there’s a global index involved, we also need to rebuild global
    > indexes. Or, as a possible alternative strategy, we could allocate a
    > new partition ID and just leave the index entries to be cleaned out
    > eventually, but that seems to risk a lot of bloat. However, a key
    > point here is that we don’t want to be rebuilding the same global
    > indexes over and over. If someone does a table-rewriting operation
    > across a whole partitioning hierarchy, we don’t want to rebuild the
    > same global indexes multiple times.
    >
    > We also need to consider what happens when relations are truncated.
    > Here, the two possible strategies are: (1) assign new partition IDs
    > and leave the old index entries around, (2) truncate the entire index
    > and then reinsert entries for any untruncated partitions. As above (1)
    > risks bloat, but it also makes TRUNCATE fast. Of course, if the whole
    > partitioning hierarchy is truncated all at once, then (2) is clearly
    > better.
    >
    > I'm aiming to submit the first WIP patch set before the July
    > commitfest. It won't have all the issues ironed out yet, but the main
    > design will be functional.
    >
    > Thanks, Robert, for many of the key design ideas and regular
    > discussion throughout designing this. I'd also like to thank Joe,
    > Peter Geoghegan, Alvaro, and Masahiko Sawada for discussing the
    > independent issues with me offlist.
    >
    > --
    > Regards,
    > Dilip Kumar
    > Google
    >
    >
    >
    
  3. Re: Proposal: Global Index for PostgreSQL

    Dilip Kumar <dilipbalaut@gmail.com> — 2025-06-07T04:36:50Z

    On Fri, Jun 6, 2025 at 1:01 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote:
    >
    > Hi Dilip Kumar
    >    Thank you for your working on this ,I remember six years ago there was talk about global index ,You can see if this mailing list has any references to (https://www.postgresql.org/message-id/CALtqXTcurqy1PKXzP9XO%3DofLLA5wBSo77BnUnYVEZpmcA3V0ag%40mail.gmail.com)
    >
    Sure Thanks.
    
    -- 
    Regards,
    Dilip Kumar
    Google
    
    
    
    
  4. Re: Proposal: Global Index for PostgreSQL

    Nikita Malakhov <hukutoc@gmail.com> — 2025-06-09T08:33:43Z

    Hi Dilip!
    
    Global Indexes is a very interesting functionality that has both
    significant advantages
    and drawbacks, and the community seems not ready to accept it without very
    strong
    motivation.
    There was a more recent approach to Global index problem [1], please check
    it out.
    
    I've read you proposal and have several questions:
    1) New catalog table with global index partitions would immediately affect
    interaction
    with user tables with global indexes because of corresponding locks that
    should be
    taken for [in]validation and attach/detach operations, this should be
    investigated;
    2) Changing relation OIDs (by, say, vacuum full) would immediately result
    in index
    inconsistency, what do you suppose to do with internal processes that could
    change
    relation OIDs? Also this question
    3) Would single sort space be enough for a more typical case when we have
    hundreds of partitions with hundreds of millions records in each? It is a
    normal
    production case for partitioned tables.
    4) Update-heavy partitioned tables that should run vacuum frequently.
    Significant
    vacuum slowdown would result in going beyond SLAs without corresponding
    significant improvements.
    
    [1] Global Unique Index
    <https://www.postgresql.org/message-id/flat/184879c5306.12490ea581628934.7312528450011769010%40highgo.ca>
    
    Thank you!
    
    -- 
    Regards,
    Nikita Malakhov
    Postgres Professional
    The Russian Postgres Company
    https://postgrespro.ru/
    
  5. Re: Proposal: Global Index for PostgreSQL

    Dilip Kumar <dilipbalaut@gmail.com> — 2025-06-09T09:58:38Z

    On Mon, Jun 9, 2025 at 2:03 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
    >
    > Hi Dilip!
    
    Thanks Nikita for your response and reading my proposal.
    
    > Global Indexes is a very interesting functionality that has both significant advantages
    > and drawbacks, and the community seems not ready to accept it without very strong
    > motivation.
    
    I understand that this is a hard problem and needs changes in many
    critical modules. I don't think there should be a problem with the
    motivation of this work, but I believe the main issue lies in the
    project's complexity.
    
    > There was a more recent approach to Global index problem [1], please check it out.
    
    I've reviewed the proposal, and I understand it aims to address
    uniqueness on partitioned tables for non-partition key columns.
    However, I'm concerned about the basic design principles' scalability.
    I believe it won't scale effectively beyond a relatively small number
    of partitions, and this limitation will be quite surprising to users.
    Specifically, checking uniqueness during inserts/updates across all
    indexes on each partition (since there's no global index) will become
    a significant bottleneck.
    
    > I've read you proposal and have several questions:
    
    > 1) New catalog table with global index partitions would immediately affect interaction
    > with user tables with global indexes because of corresponding locks that should be
    > taken for [in]validation and attach/detach operations, this should be investigated;
    
    Yeah that's right, but logically the attach/detach are DDL and are not
    most frequent operations, so are you worried about performance due to
    locking?
    
    > 2) Changing relation OIDs (by, say, vacuum full) would immediately result in index
    > inconsistency, what do you suppose to do with internal processes that could change
    > relation OIDs? Also this question
    
    I want to clarify that we don't store relation OIDs directly in the
    global index. Instead, the global index holds partition IDs, and the
    mapping from partition ID to relation OID is managed in a new catalog
    table, pg_index_partition. It's important to note that VACUUM FULL
    operations only alter the relfilenumber (the disk file OID), not the
    relation OID itself, which remains constant for the lifetime of the
    relation.
    
    What does change during a VACUUM FULL are the TIDs (tuple IDs).
    Because of this, all indexes, including global indexes, are reindexed.
    As I mentioned in my proposal, there's a significant opportunity for
    optimization here. Reindexing large global indexes is a costly
    operation, and I've proposed some ideas to improve this process.
    
    > 3) Would single sort space be enough for a more typical case when we have
    > hundreds of partitions with hundreds of millions records in each? It is a normal
    > production case for partitioned tables.
    
    In general, users ideally wouldn't use a global index everywhere. It
    really comes down to their specific use case – they should only opt
    for a global index when they can't effectively partition their data
    without one. The idea is that the amount of data in the sort space
    should essentially be the same as if the table wasn't partitioned at
    all. That's a good point for consideration. I agree that global
    indexes shouldn't be a default choice for every use case. They're most
    beneficial when a user's data access patterns inherently prevent
    effective partitioning without them. In such scenarios, the amount of
    data in the sort space would ideally remain comparable to an
    unpartitioned table.
    
    > 4) Update-heavy partitioned tables that should run vacuum frequently. Significant
    > vacuum slowdown would result in going beyond SLAs without corresponding
    > significant improvements.
    >
    You've got it. I'm on board with prioritizing a VACUUM optimization
    solution for partitioned tables with global indexes. My initial
    proposal touched on a proof-of-concept experiment, which indicated no
    significant performance hit with global index after the optimization.
    I'll share the detailed VACUUM optimization proposal in this thread
    within the next couple of days.
    
    --
    Regards,
    Dilip Kumar
    Google
    
    
    
    
  6. Re: Proposal: Global Index for PostgreSQL

    Bruce Momjian <bruce@momjian.us> — 2025-06-09T21:51:25Z

    On Mon, Jun  9, 2025 at 03:28:38PM +0530, Dilip Kumar wrote:
    > On Mon, Jun 9, 2025 at 2:03 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
    > > Global Indexes is a very interesting functionality that has both significant advantages
    > > and drawbacks, and the community seems not ready to accept it without very strong
    > > motivation.
    > 
    > I understand that this is a hard problem and needs changes in many
    > critical modules. I don't think there should be a problem with the
    > motivation of this work, but I believe the main issue lies in the
    > project's complexity.
    ...
    > In general, users ideally wouldn't use a global index everywhere. It
    > really comes down to their specific use case – they should only opt
    > for a global index when they can't effectively partition their data
    > without one. The idea is that the amount of data in the sort space
    > should essentially be the same as if the table wasn't partitioned at
    > all. That's a good point for consideration. I agree that global
    > indexes shouldn't be a default choice for every use case. They're most
    > beneficial when a user's data access patterns inherently prevent
    > effective partitioning without them. In such scenarios, the amount of
    > data in the sort space would ideally remain comparable to an
    > unpartitioned table.
    
    There are certainly use cases where this would be helpful, but I think
    the big question is whether it would have so many negatives that most
    people who try to use it would eventually remove it.  I have heard that
    happened to other relational systems who support global indexes, so I
    think we have to consider that possibility.  The problem is you might
    need to actually write the patch to find out.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        https://momjian.us
      EDB                                      https://enterprisedb.com
    
      Do not let urgent matters crowd out time for investment in the future.
    
    
    
    
  7. Re: Proposal: Global Index for PostgreSQL

    Bruce Momjian <bruce@momjian.us> — 2025-06-10T19:38:11Z

    On Mon, Jun  9, 2025 at 05:51:25PM -0400, Bruce Momjian wrote:
    > On Mon, Jun  9, 2025 at 03:28:38PM +0530, Dilip Kumar wrote:
    > There are certainly use cases where this would be helpful, but I think
    > the big question is whether it would have so many negatives that most
    > people who try to use it would eventually remove it.  I have heard that
    > happened to other relational systems who support global indexes, so I
    > think we have to consider that possibility.  The problem is you might
    > need to actually write the patch to find out.
    
    FYI, I wrote a blog about global indexes in 2020:
    
    	https://momjian.us/main/blogs/pgblog/2020.html#July_1_2020
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        https://momjian.us
      EDB                                      https://enterprisedb.com
    
      Do not let urgent matters crowd out time for investment in the future.
    
    
    
    
  8. Re: Proposal: Global Index for PostgreSQL

    Dilip Kumar <dilipbalaut@gmail.com> — 2025-06-11T03:03:33Z

    On Tue, Jun 10, 2025 at 3:21 AM Bruce Momjian <bruce@momjian.us> wrote:
    
    Thanks Bruce for your thoughts on this.
    
    > On Mon, Jun  9, 2025 at 03:28:38PM +0530, Dilip Kumar wrote:
    > > On Mon, Jun 9, 2025 at 2:03 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
    > > > Global Indexes is a very interesting functionality that has both significant advantages
    > > > and drawbacks, and the community seems not ready to accept it without very strong
    > > > motivation.
    > >
    > > I understand that this is a hard problem and needs changes in many
    > > critical modules. I don't think there should be a problem with the
    > > motivation of this work, but I believe the main issue lies in the
    > > project's complexity.
    > ...
    > > In general, users ideally wouldn't use a global index everywhere. It
    > > really comes down to their specific use case – they should only opt
    > > for a global index when they can't effectively partition their data
    > > without one. The idea is that the amount of data in the sort space
    > > should essentially be the same as if the table wasn't partitioned at
    > > all. That's a good point for consideration. I agree that global
    > > indexes shouldn't be a default choice for every use case. They're most
    > > beneficial when a user's data access patterns inherently prevent
    > > effective partitioning without them. In such scenarios, the amount of
    > > data in the sort space would ideally remain comparable to an
    > > unpartitioned table.
    >
    > There are certainly use cases where this would be helpful, but I think
    > the big question is whether it would have so many negatives that most
    > people who try to use it would eventually remove it.
    
    Yeah that's a very valid point.
    
      I have heard that
    > happened to other relational systems who support global indexes, so I
    > think we have to consider that possibility.  The problem is you might
    > need to actually write the patch to find out.
    
    I've actually drafted the patch, and while it still has open issues to
    tackle, I believe it's ready for some interesting experimentation. For
    instance, I've observed significant performance gains during index
    scans on non-partition key columns, where scanning a single global
    index outperforms appending scan results from thousands of local
    indexes. However, I've also noted performance regression in
    insert/update cases when a global index is present, which is expected
    as we need to insert into a large index, so we need to evaluate what
    is acceptable and what not.
    
    My plan is to submit the patch in the next commitfest, including
    performance data. Before submission, I'll clean up the code further
    and add TODO comments wherever additional work is required.
    
    -- 
    Regards,
    Dilip Kumar
    Google
    
    
    
    
  9. Re: Proposal: Global Index for PostgreSQL

    Dilip Kumar <dilipbalaut@gmail.com> — 2025-06-11T03:04:24Z

    On Wed, Jun 11, 2025 at 1:08 AM Bruce Momjian <bruce@momjian.us> wrote:
    >
    > On Mon, Jun  9, 2025 at 05:51:25PM -0400, Bruce Momjian wrote:
    > > On Mon, Jun  9, 2025 at 03:28:38PM +0530, Dilip Kumar wrote:
    > > There are certainly use cases where this would be helpful, but I think
    > > the big question is whether it would have so many negatives that most
    > > people who try to use it would eventually remove it.  I have heard that
    > > happened to other relational systems who support global indexes, so I
    > > think we have to consider that possibility.  The problem is you might
    > > need to actually write the patch to find out.
    >
    > FYI, I wrote a blog about global indexes in 2020:
    
    Oh interesting, I would be happy to have a look at it, Thanks.
    
    -- 
    Regards,
    Dilip Kumar
    Google
    
    
    
    
  10. Re: Proposal: Global Index for PostgreSQL

    Dilip Kumar <dilipbalaut@gmail.com> — 2025-06-14T09:32:03Z

    On Mon, Jun 9, 2025 at 3:28 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
    >
    > On Mon, Jun 9, 2025 at 2:03 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
    
    > > 4) Update-heavy partitioned tables that should run vacuum frequently. Significant
    > > vacuum slowdown would result in going beyond SLAs without corresponding
    > > significant improvements.
    > >
    > You've got it. I'm on board with prioritizing a VACUUM optimization
    > solution for partitioned tables with global indexes. My initial
    > proposal touched on a proof-of-concept experiment, which indicated no
    > significant performance hit with global index after the optimization.
    > I'll share the detailed VACUUM optimization proposal in this thread
    > within the next couple of days.
    
    As discussed earlier the one of main problems is that the global
    indexes are vacuumed along with each partition whereas logically it
    should be vacuumed only once when all the partitions are vacuum in an
    ideal world.
    
    So my proposal is that we make some infrastructure change in vacuum
    api's such that there is option to tell heap_vacuum_rel() to skip the
    global index vacuum and also return back the deadtid_store, vacuum
    layer will store these deadtid_store, hash it by the reloid until it
    vacuum all the partitions which are covered by a global index.  Once
    that is done it will vacuum the global index, we also need to modify
    the vac_tid_reaped() so that it can take additional input of reloid so
    that it can find the appropriate deadtid_store by looking into the
    hash and then look up the dead tid in respective deadtid_store.
    
    We also need to do something for the autovacuum, because currently
    autovacuum workers scan the pg_class and identify the relation which
    needs to be vacuumed and vacuum it one at a time.  However once we
    have educated vacuum machinery to first vacuum all the partitions and
    then perform global index vacuum, the autovacuum worker should know
    which all partitions are supposed to be vacuum together and the
    autovacuum worker can pass a list of all those partitions together to
    the vacuum machinery.
    
    I believe this enhancement can be implemented in autovacuum without
    significant difficulty. Currently, autovacuum scans pg_class to
    generate a list of relations requiring vacuuming. For partitioned
    tables that have a global index, we can extend this process by
    additionally maintaining a list of all their leaf relations within the
    parent table's entry.
    
    I believe this enhancement can be implemented in autovacuum without
    significant difficulty. Currently, autovacuum scans pg_class to
    generate a list of relations requiring vacuuming. For partitioned
    tables that have a global index, we can extend this process by
    additionally maintaining a list of all their leaf relations within the
    parent table's entry.
    
    An autovacuum worker would then process all the leaf relations in this
    complete hierarchy. To effectively prevent other workers from
    attempting to vacuum the same hierarchy concurrently, the worker would
    publish the top-most partitioned relation ID (which identifies the
    table with the global index) as MyWorkerInfo->wi_tableoid. This
    mechanism ensures that if one worker is processing a partitioned
    table, the entire set of child relations under that top-level ID is
    automatically skipped by other workers
    
    While this approach offers benefits, it's important to acknowledge
    certain potential drawbacks. One concern is a possible impact on
    parallelism, as currently each partition might be vacuumed by a
    separate worker, but with this change, all partitions covered by the
    same global index would have to be processed by a single worker.
    Another significant challenge lies in effectively choosing which
    partitions to vacuum; for instance, if a table with a global index has
    1000 partitions and only 10 meet the vacuum threshold in a given
    cycle, this method might not be very efficient. Although we wouldn't
    vacuum the global index once for every partition, we could still end
    up vacuuming it numerous times by the time all partitions are
    eventually processed. To mitigate this, an alternative strategy could
    involve proactively vacuuming partitions that haven't fully met their
    vacuum threshold but are close (e.g., reached 50% of the threshold).
    This would allow us to combine more partitions for a single vacuum
    operation, thereby reducing the number of times the global index needs
    to be vacuumed.
    
    
    -- 
    Regards,
    Dilip Kumar
    Google
    
    
    
    
  11. Re: Proposal: Global Index for PostgreSQL

    Masahiko Sawada <sawada.mshk@gmail.com> — 2025-06-17T23:07:25Z

    On Sat, Jun 14, 2025 at 2:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
    >
    > On Mon, Jun 9, 2025 at 3:28 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
    > >
    > > On Mon, Jun 9, 2025 at 2:03 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
    >
    > > > 4) Update-heavy partitioned tables that should run vacuum frequently. Significant
    > > > vacuum slowdown would result in going beyond SLAs without corresponding
    > > > significant improvements.
    > > >
    > > You've got it. I'm on board with prioritizing a VACUUM optimization
    > > solution for partitioned tables with global indexes. My initial
    > > proposal touched on a proof-of-concept experiment, which indicated no
    > > significant performance hit with global index after the optimization.
    > > I'll share the detailed VACUUM optimization proposal in this thread
    > > within the next couple of days.
    >
    > As discussed earlier the one of main problems is that the global
    > indexes are vacuumed along with each partition whereas logically it
    > should be vacuumed only once when all the partitions are vacuum in an
    > ideal world.
    >
    > So my proposal is that we make some infrastructure change in vacuum
    > api's such that there is option to tell heap_vacuum_rel() to skip the
    > global index vacuum and also return back the deadtid_store, vacuum
    > layer will store these deadtid_store, hash it by the reloid until it
    > vacuum all the partitions which are covered by a global index.  Once
    > that is done it will vacuum the global index, we also need to modify
    > the vac_tid_reaped() so that it can take additional input of reloid so
    > that it can find the appropriate deadtid_store by looking into the
    > hash and then look up the dead tid in respective deadtid_store.
    
    Does it need to keep holding dead TIDs for each partition until it
    completes vacuuming all partitions that are covered by the global
    index? If so, it would end up holding a huge amount of memory in cases
    where there are many partitions. How does maintanence_work_mem (or
    autovacuum_work_mem) work in this context? Also, what if the
    autovacuum worker who is processing the partitioned table with global
    indexes gets cancelled? I guess that we would need to scan the
    partitions again in order to collect dead TIDs to vacuum the global
    index but it would be very expensive.
    
    Regards,
    
    -- 
    Masahiko Sawada
    Amazon Web Services: https://aws.amazon.com
    
    
    
    
  12. Re: Proposal: Global Index for PostgreSQL

    Dilip Kumar <dilipbalaut@gmail.com> — 2025-06-18T10:45:13Z

    On Wed, Jun 18, 2025 at 4:38 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
    >
    > On Sat, Jun 14, 2025 at 2:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
    > >
    
    Thanks for your opinion, Sawada-San.
    
    > Does it need to keep holding dead TIDs for each partition until it
    > completes vacuuming all partitions that are covered by the global
    > index? If so, it would end up holding a huge amount of memory in cases
    > where there are many partitions. How does maintanence_work_mem (or
    > autovacuum_work_mem) work in this context?
    
    So it will keep holding the deadtids until we vacuum all the
    partitions or we run out of the 'maintanence_work_mem', consider this
    similar to the case where you do not have concept of global index and
    for supporting the unique constraint on not partitioned key column
    user has to keep one giant table without partitioning it, then every
    time we run out of the  'maintanence_work_mem' we need to vacuum the
    indexes, same theory applies for the global index.
    
     Also, what if the
    > autovacuum worker who is processing the partitioned table with global
    > indexes gets cancelled? I guess that we would need to scan the
    > partitions again in order to collect dead TIDs to vacuum the global
    > index but it would be very expensive.
    
    You're right, but that's comparable to the cost of managing a single,
    unpartitioned giant table. I believe the primary value of global
    indexes lies in enabling users to partition tables that otherwise
    couldn't be. So, while certain maintenance tasks might incur similar
    costs to a single large table, you'll gain significant advantages
    during many other DML operations due to partitioning that wouldn't be
    possible without global indexes.
    
    -- 
    Regards,
    Dilip Kumar
    Google
    
    
    
    
  13. Re: Proposal: Global Index for PostgreSQL

    Dilip Kumar <dilipbalaut@gmail.com> — 2025-07-01T09:27:07Z

    On Wed, Jun 18, 2025 at 4:15 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
    >
    > On Wed, Jun 18, 2025 at 4:38 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
    > >
    Here is the first WIP version of the patch set.  Commit message in
    each patch explains in detail what exactly it does so not adding the
    description in email.  Patches applies on the latest head
    (732061150b004385810e522f8629f5bf91d977b7)
    
    Open issues yet to be handled:
    =========================
    1) Vacuum optimization as I described above is still highly unstable
    patch so not attached here, so with this patch set if you create
    global index with large table with a lot of partition then there will
    be a huge regression in vacuum performance, which should be resolved
    after vacuum optimization patch which I am planning to post by this
    month.
    2) If you are attaching a partition which has a different column order
    than the parent and create a global index on that, it will not work,
    still working on it.
    3) For unique checking in btree, global index may get conflicting
    tuple belongs to different partitions so relation open is done on the
    fly, ideally that description should have been done in executor and
    pass down to btree but still identifying how to do that with minimum
    changes in AM interfaces, or can it be done without AM changes.
    4) Need to write more test cases for REINDEX TABLE, TRUNCATE TABLE and
    need to tighten up the global index rebuild so that it doesn't get
    rebuilt multiple times or doesn't skip rebuilding when needed.
    5) ALTER COLUMN SET TYPE, which triggers global index rewrite is not
    handled properly. This needs some more work, in the alter table
    machinery where we identify the list of indexes to rebuild.
    6) Need to perform a performance test, for SELECT/UPDATE/INSERT cases,
    we already know the VACUUM performance.
    7) global_index.sql is taking a long time to execute as I have kept a
    high data load, so that needs to be simplified.
    
    Note: Patches are still WIP and might have many loose ends, for now,
    make check-world is passing, including the global index test cases
    
    Credit:  I have already mentioned while sending the first design email
    but missed some, so adding it again here.
    
    1. Robert: for many of the key design ideas and regular discussion
    throughout designing this.
    2. Joe: Also has regular discussion on this and many suggestions,
    specially related to vacuum
    3. Peter Geoghegan, Alvaro, and Masahiko Sawada for discussing the
    independent issues with me offlist.
    4. I got the idea of syntax for creating global indexes and also
    keeping a cache of mapping from reloid to relation descriptor during
    the global index scan, from some of the very old thread related to
    global index (not able to find the thread now).
    
    --
    Regards,
    Dilip Kumar
    Google
    
  14. Re: Proposal: Global Index for PostgreSQL

    amit <amitlangote09@gmail.com> — 2025-07-01T13:42:06Z

    Hi Dilip,
    
    Happy to see you working on this.  It’s clear a lot of thought has
    gone into the design.
    
    On Tue, Jul 1, 2025 at 6:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
    > 6) Need to perform a performance test, for SELECT/UPDATE/INSERT cases,
    > we already know the VACUUM performance.
    
    One point I want to check my understanding of is around the locking
    implications of global index scans, especially in prepared statement
    scenarios.
    
    I’ve been working on improving how we handle partition locking during
    execution of generic plans. Specifically, I committed a patch to defer
    locking of partitions until after pruning during ExecutorStart(), so
    we avoid taking locks on partitions that aren’t actually needed --
    even when the plan contains scans on all partitions. That patch was
    later reverted, as Tom pointed out that the plan invalidation logic
    wasn't cleanly handled. But the goal remains: to avoid locking
    unnecessary partitions, particularly in high-partition-count OLTP
    setups that use PREPARE/EXECUTE.
    
    The proposed global index design, IIUC, requires locking all leaf
    partitions up front during planning, and I guess during
    AcquireExecutorLocks() when using a cached plan, because the index
    scan could return tuples from any partition. This seems to directly
    undercut that effort: we'd be back to generic plans causing broad
    locking regardless of actual runtime needs.
    
    I understand that this is currently necessary, given that a global
    index scan is a single node without per-partition awareness. But it
    might be worth considering whether the scan could opportunistically
    defer heap relation locking until it returns a tuple that actually
    belongs to a particular partition -- similar to how inserts into
    partitioned tables only lock the target partition at execution time.
    Or did I miss that inserts also need to lock all partitions up front
    when global indexes are present, due to cross-partition uniqueness
    checks?
    
    Let me know if I’ve misunderstood the design.
    
    -- 
    Thanks, Amit Langote
    
    
    
    
  15. Re: Proposal: Global Index for PostgreSQL

    Dilip Kumar <dilipbalaut@gmail.com> — 2025-07-02T04:03:51Z

    On Tue, Jul 1, 2025 at 7:12 PM Amit Langote <amitlangote09@gmail.com> wrote:
    >
    > Hi Dilip,
    >
    > Happy to see you working on this.  It’s clear a lot of thought has
    > gone into the design.
    
    
    Thanks, Amit.  And thanks for your comment.
    
    > On Tue, Jul 1, 2025 at 6:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
    > > 6) Need to perform a performance test, for SELECT/UPDATE/INSERT cases,
    > > we already know the VACUUM performance.
    >
    > One point I want to check my understanding of is around the locking
    > implications of global index scans, especially in prepared statement
    > scenarios.
    
    Sure
    
    > I’ve been working on improving how we handle partition locking during
    > execution of generic plans. Specifically, I committed a patch to defer
    > locking of partitions until after pruning during ExecutorStart(), so
    > we avoid taking locks on partitions that aren’t actually needed --
    > even when the plan contains scans on all partitions. That patch was
    > later reverted, as Tom pointed out that the plan invalidation logic
    > wasn't cleanly handled.
    
    Yes I was following that thread, and at times when I was working on
    locking for global index I had in mind that I would have to do
    something to the locking after that patch is in.  Unfortunately that
    got reverted and then I didn't put any effort in reconsidering how
    locking is handled for global index.
    
     But the goal remains: to avoid locking
    > unnecessary partitions, particularly in high-partition-count OLTP
    > setups that use PREPARE/EXECUTE.
    
    That makes sense.
    
    > The proposed global index design, IIUC, requires locking all leaf
    > partitions up front during planning, and I guess during
    > AcquireExecutorLocks() when using a cached plan, because the index
    > scan could return tuples from any partition. This seems to directly
    > undercut that effort: we'd be back to generic plans causing broad
    > locking regardless of actual runtime needs.
    
    Just trying to understand the locking difference more with/without
    global index, let's assume we have normal partitioned index on non
    partition key column, and if we issue a scan on the partitioned table
    then internally from expand_partitioned_rtentry() we will take lock on
    all the partitions, because we can not prune any partition.
    Similarly, with the global index also all the child partitions under
    the top partition on which we have global index will be locked.  So in
    this case we do not have a difference.
    
    > I understand that this is currently necessary, given that a global
    > index scan is a single node without per-partition awareness. But it
    > might be worth considering whether the scan could opportunistically
    > defer heap relation locking until it returns a tuple that actually
    > belongs to a particular partition -- similar to how inserts into
    > partitioned tables only lock the target partition at execution time.
    > Or did I miss that inserts also need to lock all partitions up front
    > when global indexes are present, due to cross-partition uniqueness
    > checks?
    >
    > Let me know if I’ve misunderstood the design.
    
    So there difference is in the cases, where we are directly operating
    on the leaf table, e.g. if you inserting directly on the leaf
    relation, currently we just need to lock that partition, but if there
    is global index we need to lock other siblings as well (in short all
    the leaf under the parent which has global index) because if the
    global index is unique we might need to check unique conflict in other
    leafs as well.  I believe when the table is partitioned, it might not
    be the most preferred way to operate directly on the leaf, and with
    global index only this case will be impacted where we are doing DML
    directly on the leaf.  I am not sure in this case how much delay we
    can do in locking, because e.g. for insert we will only identify which
    partition has a duplicate key while inserting in the btree.
    
    -- 
    Regards,
    Dilip Kumar
    Google
    
    
    
    
  16. Re: Proposal: Global Index for PostgreSQL

    amit <amitlangote09@gmail.com> — 2025-07-02T13:48:22Z

    On Wed, Jul 2, 2025 at 1:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
    > On Tue, Jul 1, 2025 at 7:12 PM Amit Langote <amitlangote09@gmail.com> wrote:
    > > I’ve been working on improving how we handle partition locking during
    > > execution of generic plans. Specifically, I committed a patch to defer
    > > locking of partitions until after pruning during ExecutorStart(), so
    > > we avoid taking locks on partitions that aren’t actually needed --
    > > even when the plan contains scans on all partitions. That patch was
    > > later reverted, as Tom pointed out that the plan invalidation logic
    > > wasn't cleanly handled.
    >
    > Yes I was following that thread, and at times when I was working on
    > locking for global index I had in mind that I would have to do
    > something to the locking after that patch is in.  Unfortunately that
    > got reverted and then I didn't put any effort in reconsidering how
    > locking is handled for global index.
    ...
    >  But the goal remains: to avoid locking
    > > unnecessary partitions, particularly in high-partition-count OLTP
    > > setups that use PREPARE/EXECUTE.
    >
    > That makes sense.
    
    Ok, good to know you were keeping a tab on it.
    
    > > The proposed global index design, IIUC, requires locking all leaf
    > > partitions up front during planning, and I guess during
    > > AcquireExecutorLocks() when using a cached plan, because the index
    > > scan could return tuples from any partition. This seems to directly
    > > undercut that effort: we'd be back to generic plans causing broad
    > > locking regardless of actual runtime needs.
    >
    > Just trying to understand the locking difference more with/without
    > global index, let's assume we have normal partitioned index on non
    > partition key column, and if we issue a scan on the partitioned table
    > then internally from expand_partitioned_rtentry() we will take lock on
    > all the partitions, because we can not prune any partition.
    > Similarly, with the global index also all the child partitions under
    > the top partition on which we have global index will be locked.  So in
    > this case we do not have a difference.
    
    Just to clarify -- I was hoping that, at least for SELECTs, we
    wouldn’t need to lock all leaf partitions up front.
    
    One of the potential selling points of global indexes (compared to
    partitioned indexes) is that we can avoid expand_partitioned_rtentry()
    and the full scan path setup per partition, though that's admittedly
    quite an undertaking.  So I was imagining we could just lock the
    parent and the global index during planning, and only lock individual
    heap relations at execution time -- once we know which partition the
    returned tuple belongs to.
    
    Locking isn’t cheap -- and in workloads with thousands of partitions,
    it becomes a major source of overhead, especially when the query
    doesn't contain pruning quals and ends up touching only a few
    partitions in practice. So I think it’s worth seeing if we can avoid
    planning-time locking of all partitions in at least the SELECT case,
    even if INSERTs may require broader locking due to uniqueness checks,
    but see below...
    
    > > I understand that this is currently necessary, given that a global
    > > index scan is a single node without per-partition awareness. But it
    > > might be worth considering whether the scan could opportunistically
    > > defer heap relation locking until it returns a tuple that actually
    > > belongs to a particular partition -- similar to how inserts into
    > > partitioned tables only lock the target partition at execution time.
    > > Or did I miss that inserts also need to lock all partitions up front
    > > when global indexes are present, due to cross-partition uniqueness
    > > checks?
    > >
    > > Let me know if I’ve misunderstood the design.
    >
    > So there difference is in the cases, where we are directly operating
    > on the leaf table, e.g. if you inserting directly on the leaf
    > relation, currently we just need to lock that partition, but if there
    > is global index we need to lock other siblings as well (in short all
    > the leaf under the parent which has global index) because if the
    > global index is unique we might need to check unique conflict in other
    > leafs as well.  I believe when the table is partitioned, it might not
    > be the most preferred way to operate directly on the leaf, and with
    > global index only this case will be impacted where we are doing DML
    > directly on the leaf.  I am not sure in this case how much delay we
    > can do in locking, because e.g. for insert we will only identify which
    > partition has a duplicate key while inserting in the btree.
    
    Hmm, it’s my understanding that with a true global index, meaning a
    single btree structure spanning all partitions, uniqueness conflicts
    are detected by probing the index after inserting the tuple into the
    heap. So unless we find a matching key in the index, there is no need
    to consult any other partitions.
    
    Even if a match is found, we only need to access the heap page for
    that specific TID to check visibility, and that would involve just one
    partition.
    
    Why then do we need to lock all leaf partitions in advance? It seems
    like we could defer locking until the uniqueness check identifies a
    partition that actually contains a conflicting tuple, and only then
    lock that one heap.
    
    I understand that in some earlier floated ideas for enforcing global
    uniqueness (perhaps only briefly mentioned in past discussions), the
    approach was to scan all per-partition indexes, which would have
    required locking everything up front. But with a unified global index,
    that overhead seems avoidable.
    
    Is there something subtle that I am missing that still requires
    locking all heaps in advance?
    
    -- 
    Thanks, Amit Langote
    
    
    
    
  17. Re: Proposal: Global Index for PostgreSQL

    Dilip Kumar <dilipbalaut@gmail.com> — 2025-07-03T04:23:55Z

    On Wed, Jul 2, 2025 at 7:18 PM Amit Langote <amitlangote09@gmail.com> wrote:
    
    > Just to clarify -- I was hoping that, at least for SELECTs, we
    > wouldn’t need to lock all leaf partitions up front.
    >
    > One of the potential selling points of global indexes (compared to
    > partitioned indexes) is that we can avoid expand_partitioned_rtentry()
    > and the full scan path setup per partition, though that's admittedly
    > quite an undertaking.  So I was imagining we could just lock the
    > parent and the global index during planning, and only lock individual
    > heap relations at execution time -- once we know which partition the
    > returned tuple belongs to.
    >
    > Locking isn’t cheap -- and in workloads with thousands of partitions,
    > it becomes a major source of overhead, especially when the query
    > doesn't contain pruning quals and ends up touching only a few
    > partitions in practice. So I think it’s worth seeing if we can avoid
    > planning-time locking of all partitions in at least the SELECT case,
    > even if INSERTs may require broader locking due to uniqueness checks,
    > but see below...
    >
    > > > I understand that this is currently necessary, given that a global
    > > > index scan is a single node without per-partition awareness. But it
    > > > might be worth considering whether the scan could opportunistically
    > > > defer heap relation locking until it returns a tuple that actually
    > > > belongs to a particular partition -- similar to how inserts into
    > > > partitioned tables only lock the target partition at execution time.
    > > > Or did I miss that inserts also need to lock all partitions up front
    > > > when global indexes are present, due to cross-partition uniqueness
    > > > checks?
    > > >
    > > > Let me know if I’ve misunderstood the design.
    > >
    > > So there difference is in the cases, where we are directly operating
    > > on the leaf table, e.g. if you inserting directly on the leaf
    > > relation, currently we just need to lock that partition, but if there
    > > is global index we need to lock other siblings as well (in short all
    > > the leaf under the parent which has global index) because if the
    > > global index is unique we might need to check unique conflict in other
    > > leafs as well.  I believe when the table is partitioned, it might not
    > > be the most preferred way to operate directly on the leaf, and with
    > > global index only this case will be impacted where we are doing DML
    > > directly on the leaf.  I am not sure in this case how much delay we
    > > can do in locking, because e.g. for insert we will only identify which
    > > partition has a duplicate key while inserting in the btree.
    >
    > Hmm, it’s my understanding that with a true global index, meaning a
    > single btree structure spanning all partitions, uniqueness conflicts
    > are detected by probing the index after inserting the tuple into the
    > heap. So unless we find a matching key in the index, there is no need
    > to consult any other partitions.
    >
    > Even if a match is found, we only need to access the heap page for
    > that specific TID to check visibility, and that would involve just one
    > partition.
    >
    > Why then do we need to lock all leaf partitions in advance? It seems
    > like we could defer locking until the uniqueness check identifies a
    > partition that actually contains a conflicting tuple, and only then
    > lock that one heap.
    >
    > I understand that in some earlier floated ideas for enforcing global
    > uniqueness (perhaps only briefly mentioned in past discussions), the
    > approach was to scan all per-partition indexes, which would have
    > required locking everything up front. But with a unified global index,
    > that overhead seems avoidable.
    >
    > Is there something subtle that I am missing that still requires
    > locking all heaps in advance?
    
    Thanks Amit for raising this question.  I understand your point, and I
    think this is mainly about whether to lock all the tables upfront or
    delay the locking and lock them when we need to.  I remember the very
    first patch when I implemented at least in case of unique checking I
    was locking the table on the fly whenever we find the duplicate tuple
    and there I had a discussion with Robert regarding the same.  Here is
    what was out thinking, although it's in my words and Robert might have
    something else in mind but this is based on what I think he told.
    
    Here's what I remember our thinking was at the time, though Robert
    might recall it differently. We concluded that the code was designed
    so the planner would upfront decide which tables were needed for the
    plan, lock them, and keep them in the range table. If it was a
    prepared plan, AcquireExecutorLock would handle locking all of them
    before execution began.  And then I changed the logic of lock on the
    fly to locking everything upfront.
    
    I'm now inclined to agree with your point. When dealing with multiple
    partitions, especially when it's uncertain if all of them will even be
    needed, the traditional approach of locking every potential partition
    upfront isn't the most efficient strategy.  But one problem is we will
    only discovered which partitions to lock (whether doing scan or
    checking unique while inserting in the index) only when we are in
    Btree code, so as of now I am not finding it strategically correct to
    lock and build the relation descriptor right then and there, I don't
    have any reason why it would be bad but may be one reason is this
    could be the first code to do it and another reason is if this has to
    lock millions of partitions (in case we find matching tuple from
    multiple partition during scan) then locking all of them on the fly
    from index AM code might not be the best idea.
    
    What's your thought on this?  I would appreciate if @Robert Haas can
    also share his thoughts.
    
    -- 
    Regards,
    Dilip Kumar
    Google
    
    
    
    
  18. Re: Proposal: Global Index for PostgreSQL

    amit <amitlangote09@gmail.com> — 2025-07-11T06:51:57Z

    Hi Dilip,
    
    Sorry for the late reply.
    
    On Thu, Jul 3, 2025 at 1:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
    >
    > On Wed, Jul 2, 2025 at 7:18 PM Amit Langote <amitlangote09@gmail.com> wrote:
    >
    > > Just to clarify -- I was hoping that, at least for SELECTs, we
    > > wouldn’t need to lock all leaf partitions up front.
    > >
    > > One of the potential selling points of global indexes (compared to
    > > partitioned indexes) is that we can avoid expand_partitioned_rtentry()
    > > and the full scan path setup per partition, though that's admittedly
    > > quite an undertaking.  So I was imagining we could just lock the
    > > parent and the global index during planning, and only lock individual
    > > heap relations at execution time -- once we know which partition the
    > > returned tuple belongs to.
    > >
    > > Locking isn’t cheap -- and in workloads with thousands of partitions,
    > > it becomes a major source of overhead, especially when the query
    > > doesn't contain pruning quals and ends up touching only a few
    > > partitions in practice. So I think it’s worth seeing if we can avoid
    > > planning-time locking of all partitions in at least the SELECT case,
    > > even if INSERTs may require broader locking due to uniqueness checks,
    > > but see below...
    > >
    > > > > I understand that this is currently necessary, given that a global
    > > > > index scan is a single node without per-partition awareness. But it
    > > > > might be worth considering whether the scan could opportunistically
    > > > > defer heap relation locking until it returns a tuple that actually
    > > > > belongs to a particular partition -- similar to how inserts into
    > > > > partitioned tables only lock the target partition at execution time.
    > > > > Or did I miss that inserts also need to lock all partitions up front
    > > > > when global indexes are present, due to cross-partition uniqueness
    > > > > checks?
    > > > >
    > > > > Let me know if I’ve misunderstood the design.
    > > >
    > > > So there difference is in the cases, where we are directly operating
    > > > on the leaf table, e.g. if you inserting directly on the leaf
    > > > relation, currently we just need to lock that partition, but if there
    > > > is global index we need to lock other siblings as well (in short all
    > > > the leaf under the parent which has global index) because if the
    > > > global index is unique we might need to check unique conflict in other
    > > > leafs as well.  I believe when the table is partitioned, it might not
    > > > be the most preferred way to operate directly on the leaf, and with
    > > > global index only this case will be impacted where we are doing DML
    > > > directly on the leaf.  I am not sure in this case how much delay we
    > > > can do in locking, because e.g. for insert we will only identify which
    > > > partition has a duplicate key while inserting in the btree.
    > >
    > > Hmm, it’s my understanding that with a true global index, meaning a
    > > single btree structure spanning all partitions, uniqueness conflicts
    > > are detected by probing the index after inserting the tuple into the
    > > heap. So unless we find a matching key in the index, there is no need
    > > to consult any other partitions.
    > >
    > > Even if a match is found, we only need to access the heap page for
    > > that specific TID to check visibility, and that would involve just one
    > > partition.
    > >
    > > Why then do we need to lock all leaf partitions in advance? It seems
    > > like we could defer locking until the uniqueness check identifies a
    > > partition that actually contains a conflicting tuple, and only then
    > > lock that one heap.
    > >
    > > I understand that in some earlier floated ideas for enforcing global
    > > uniqueness (perhaps only briefly mentioned in past discussions), the
    > > approach was to scan all per-partition indexes, which would have
    > > required locking everything up front. But with a unified global index,
    > > that overhead seems avoidable.
    > >
    > > Is there something subtle that I am missing that still requires
    > > locking all heaps in advance?
    >
    > Thanks Amit for raising this question.  I understand your point, and I
    > think this is mainly about whether to lock all the tables upfront or
    > delay the locking and lock them when we need to.  I remember the very
    > first patch when I implemented at least in case of unique checking I
    > was locking the table on the fly whenever we find the duplicate tuple
    > and there I had a discussion with Robert regarding the same.  Here is
    > what was out thinking, although it's in my words and Robert might have
    > something else in mind but this is based on what I think he told.
    >
    > Here's what I remember our thinking was at the time, though Robert
    > might recall it differently. We concluded that the code was designed
    > so the planner would upfront decide which tables were needed for the
    > plan, lock them, and keep them in the range table. If it was a
    > prepared plan, AcquireExecutorLock would handle locking all of them
    > before execution began.  And then I changed the logic of lock on the
    > fly to locking everything upfront.
    >
    > I'm now inclined to agree with your point. When dealing with multiple
    > partitions, especially when it's uncertain if all of them will even be
    > needed, the traditional approach of locking every potential partition
    > upfront isn't the most efficient strategy.  But one problem is we will
    > only discovered which partitions to lock (whether doing scan or
    > checking unique while inserting in the index) only when we are in
    > Btree code, so as of now I am not finding it strategically correct to
    > lock and build the relation descriptor right then and there, I don't
    > have any reason why it would be bad but may be one reason is this
    > could be the first code to do it and another reason is if this has to
    > lock millions of partitions (in case we find matching tuple from
    > multiple partition during scan) then locking all of them on the fly
    > from index AM code might not be the best idea.
    >
    > What's your thought on this?  I would appreciate if @Robert Haas can
    > also share his thoughts.
    
    Thanks, I see that you've thought this through and opted for the safe
    route of locking all possible partitions during planning, both for
    SELECT and INSERT.
    
    That seems like the only viable option today, given how the executor
    assumes that each index scan targets a single heap relation which has
    already been opened and locked before execution begins. But even so, I
    think we should not assume locking is cheap. Even with improvements
    like fast-path locking or a higher max_locks_per_transaction, locking
    thousands of partitions still adds up. This can become noticeable even
    in regular query execution, since one of the motivations for global
    indexes is to reduce planning effort, for example by avoiding
    per-partition scan node generation. In such cases, the relative cost
    of locking can become a dominant part of query startup time.
    
    I do not have a better solution right now, but it is worth keeping
    this tradeoff in mind.
    
    -- 
    Thanks, Amit Langote
    
    
    
    
  19. Re: Proposal: Global Index for PostgreSQL

    Dilip Kumar <dilipbalaut@gmail.com> — 2025-07-11T11:17:27Z

    On Fri, Jul 11, 2025 at 12:22 PM Amit Langote <amitlangote09@gmail.com> wrote:
    
    > Thanks, I see that you've thought this through and opted for the safe
    > route of locking all possible partitions during planning, both for
    > SELECT and INSERT.
    >
    > That seems like the only viable option today, given how the executor
    > assumes that each index scan targets a single heap relation which has
    > already been opened and locked before execution begins. But even so, I
    > think we should not assume locking is cheap. Even with improvements
    > like fast-path locking or a higher max_locks_per_transaction, locking
    > thousands of partitions still adds up. This can become noticeable even
    > in regular query execution, since one of the motivations for global
    > indexes is to reduce planning effort, for example by avoiding
    > per-partition scan node generation. In such cases, the relative cost
    > of locking can become a dominant part of query startup time.
    
    That's right
    
    > I do not have a better solution right now, but it is worth keeping
    > this tradeoff in mind.
    
    I agree.  Thanks for your opinion on this.
    
    -- 
    Regards,
    Dilip Kumar
    Google