Thread

  1. Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE

    Masahiko Sawada <sawada.mshk@gmail.com> — 2026-05-12T21:20:14Z

    Hi,
    
    On Sun, May 3, 2026 at 7:35 PM Bharath Rupireddy
    <bharath.rupireddyforpostgres@gmail.com> wrote:
    >
    > Hi,
    >
    > On Wed, Apr 29, 2026 at 12:15 AM Bharath Rupireddy
    > <bharath.rupireddyforpostgres@gmail.com> wrote:
    > >
    > > Fixed. Please find the attached v5 patch.
    > >
    > > The fix is needed only for PG16 and later, not PG15 or PG14. The bug
    > > was introduced by b7ae03953690 [1] in PG16, which added a table_open()
    > > call in pg_get_publication_tables(). PG15 and earlier only use
    > > get_rel_namespace() and syscache lookups, both of which gracefully
    > > handle dropped relations (returning InvalidOid/false rather than
    > > erroring).
    > >
    > > I verified the bug and the fix on all affected branches. Please find
    > > the attached version-specific patches for backpatching. Thank you!
    > >
    > > [1] b7ae03953690 - Ignore dropped and generated columns from the column list
    >
    > Please find the attached v6 patch, which fixes a test failure on
    > FreeBSD. This variant of the build forces parallel query via
    > debug_parallel_query=regress, causing the pg_get_publication_tables
    > injection point to fire in parallel workers as well. I disabled forced
    > parallel query for this test case.
    >
    
    I reviewed the patch and here are some comments:
    
    +/* State for pg_get_publication_tables SRF */
    +typedef struct
    +{
    + List    *table_infos; /* list of published_rel */
    + int curr_idx; /* current index into table_infos */
    +} publication_tables_state;
    
    I think we can define publication_table_state in
    pg_get_publication_tables() as it's used only in that function.
    
    ---
    + /* Skip if the relation has been concurrently dropped. */
    + if (!OidIsValid(schemaid))
    + continue;
    
    Although this check is done for all relations in table_infos, we also
    check the return value of try_table_open(), and these two checks have
    the same comment. I think we need more comments on why these two
    checks are required.
    
    In which case is schemaid InvalidOid and do we not call
    try_table_open() (i.e., nulls[2] is false)?
    
    It might make more sense to get-and-check the schemaid of each
    relation when adding the table information to table_infos.
    
    ---
    Looking at the regression tests in the patch, it tests the ALL TABLES
    publication cases and the concurrently-dropped table is handled when
    we call check the return value of try_table_open() but not
    get_rel_namespace(). I think the test case itself is fine but we can
    do the same test without adding a new injection point. For instance,
    
    backend-1: begin;
    backend-2: begin; lock table t_dropme in access exclusive mode;
    backend-1: select * from pg_publication_tables;
    backend-2: drop table t_dropme; commit;
    backend-1: get an error "could not open relation with XXX"
    
    ---
    If we use tuplestore instead of SRF, can we simplify the code as we
    would not need publication_tables_state and the above check? It would
    be only for the master, though.
    
    Regards,
    
    --
    Masahiko Sawada
    Amazon Web Services: https://aws.amazon.com