Thread
-
Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
Masahiko Sawada <sawada.mshk@gmail.com> — 2026-05-12T21:20:14Z
Hi, On Sun, May 3, 2026 at 7:35 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > Hi, > > On Wed, Apr 29, 2026 at 12:15 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > Fixed. Please find the attached v5 patch. > > > > The fix is needed only for PG16 and later, not PG15 or PG14. The bug > > was introduced by b7ae03953690 [1] in PG16, which added a table_open() > > call in pg_get_publication_tables(). PG15 and earlier only use > > get_rel_namespace() and syscache lookups, both of which gracefully > > handle dropped relations (returning InvalidOid/false rather than > > erroring). > > > > I verified the bug and the fix on all affected branches. Please find > > the attached version-specific patches for backpatching. Thank you! > > > > [1] b7ae03953690 - Ignore dropped and generated columns from the column list > > Please find the attached v6 patch, which fixes a test failure on > FreeBSD. This variant of the build forces parallel query via > debug_parallel_query=regress, causing the pg_get_publication_tables > injection point to fire in parallel workers as well. I disabled forced > parallel query for this test case. > I reviewed the patch and here are some comments: +/* State for pg_get_publication_tables SRF */ +typedef struct +{ + List *table_infos; /* list of published_rel */ + int curr_idx; /* current index into table_infos */ +} publication_tables_state; I think we can define publication_table_state in pg_get_publication_tables() as it's used only in that function. --- + /* Skip if the relation has been concurrently dropped. */ + if (!OidIsValid(schemaid)) + continue; Although this check is done for all relations in table_infos, we also check the return value of try_table_open(), and these two checks have the same comment. I think we need more comments on why these two checks are required. In which case is schemaid InvalidOid and do we not call try_table_open() (i.e., nulls[2] is false)? It might make more sense to get-and-check the schemaid of each relation when adding the table information to table_infos. --- Looking at the regression tests in the patch, it tests the ALL TABLES publication cases and the concurrently-dropped table is handled when we call check the return value of try_table_open() but not get_rel_namespace(). I think the test case itself is fine but we can do the same test without adding a new injection point. For instance, backend-1: begin; backend-2: begin; lock table t_dropme in access exclusive mode; backend-1: select * from pg_publication_tables; backend-2: drop table t_dropme; commit; backend-1: get an error "could not open relation with XXX" --- If we use tuplestore instead of SRF, can we simplify the code as we would not need publication_tables_state and the above check? It would be only for the master, though. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com