Thread

Re: Remove inner joins based on foreign keys

Richard Guo <guofenglinux@gmail.com> — 2026-05-01T08:25:13Z

On Thu, Apr 30, 2026 at 4:50 PM Tender Wang <tndrwang@gmail.com> wrote:
> The inner join between orders and users is not removed. It appears
> this can be removed in this query. This is my first question.

Right.  This is a mis-optimization case I was aware of.  The inner
join can be removed in this case.  inner_join_is_removable() is
currently overly conservative about ref_rel->joininfo: it bails as
long as the list is non-empty.  We can relax this to bail only when a
clause actually references ref_rel via clause_relids.  A clause can
appear in ref_rel->joininfo without its expression referencing
ref_rel, which connects to your second question.

> The second question may not be related to this $SUBJECT.
> My question is that the (n.id = o.id) clause seems not related to
> users. Why does the planner add it to the users' RelOptInfo's
> joininfo?
>
> I only know that the thing happens in the
> reconsider_outer_join_clauses(), the restrictinfo->required_relids
> contains {1,2,3}, where the rtindex = 3 is the users relation.

Because the clause "n.id = o.id" is an outer join ON clause that is
non-degenerate (one that references the non-nullable side of the join)
and we need to force it to be evaluated exactly at the level of the
outer join.  To handle that, we add the join's minimum input relid set
to required_relids.  That's why it appears in users' joininfo without
referencing users.

Attached patch relaxes the check against ref_rel->joininfo.  Nothing
else has changed.

- Richard

Re: Remove inner joins based on foreign keys

Tender Wang <tndrwang@gmail.com> — 2026-05-06T12:57:57Z

Richard Guo <guofenglinux@gmail.com> 于2026年5月1日周五 16:25写道：
> Because the clause "n.id = o.id" is an outer join ON clause that is
> non-degenerate (one that references the non-nullable side of the join)
> and we need to force it to be evaluated exactly at the level of the
> outer join.  To handle that, we add the join's minimum input relid set
> to required_relids.  That's why it appears in users' joininfo without
> referencing users.

Thanks for the explanation.

>
> Attached patch relaxes the check against ref_rel->joininfo.  Nothing
> else has changed.

CREATE TABLE users  (id int primary key, name text);
CREATE TABLE orders (id int primary key, user_id int not null
references users(id), amount int);
CREATE TABLE nation (id int primary key, name text);
postgres=# explain select  n.* from nation n left join (orders o join
users u on o.user_id = u.id) on n.id = o.id;
                               QUERY PLAN
-------------------------------------------------------------------------
 Hash Right Join  (cost=38.58..74.34 rows=1270 width=36)
   Hash Cond: (o.id = n.id)
   ->  Seq Scan on orders o  (cost=0.00..30.40 rows=2040 width=8)
   ->  Hash  (cost=22.70..22.70 rows=1270 width=36)
         ->  Seq Scan on nation n  (cost=0.00..22.70 rows=1270 width=36)
(5 rows)

postgres=# explain select n.* from nation n left join orders o on n.id = o.id;
                         QUERY PLAN
-------------------------------------------------------------
 Seq Scan on nation n  (cost=0.00..22.70 rows=1270 width=36)
(1 row)

Recently, I encountered the two plans above.  The first plan can
continue to transform into the second plan after inner-join removing.
But in current logic, we cannot do this. Because we do left-join
removable first. Then we do other join(semi-join,
self-join,inner-join) removable.
We can only remove outer-join, whose min-righthand is single.
Maybe we can call remove_useless_joins() again to remove the outer
join that the function couldn't remove in the first call.
I'm not sure it is worth doing this.
Any thoughts?

Re: Remove inner joins based on foreign keys

Alexandra Wang <alexandra.wang.oss@gmail.com> — 2026-05-07T19:12:06Z

Hi Richard,

On Fri, May 1, 2026 at 1:25 AM Richard Guo <guofenglinux@gmail.com> wrote:
> Attached patch relaxes the check against ref_rel->joininfo.  Nothing
> else has changed.

Thanks for the patch! I've tested and reviewed v4. Overall, this looks
good to me. Please find individual feedback below.

On Wed, Apr 1, 2026 at 2:46 AM Richard Guo <guofenglinux@gmail.com> wrote:
> On Mon, Mar 23, 2026 at 3:12 PM Richard Guo <guofenglinux@gmail.com>
wrote:
> > Just for the sake of archives, the timeline of the trap during a
> > cascading delete looks like this:
> >
> > T0: DELETE FROM users WHERE id = 1;
> >
> > T1: The executor finds users row 1 and sets its xmax, physically
> > marking it as dead.
> >
> > T2: [The Gap] The executor pushes the RI trigger into a queue to deal
> > with orders later.  Right now, the orders row still exists, but its
> > referenced row in users is dead.
> >
> > T3: The statement finishes, the trigger fires, and the orders row is
> > finally deleted.
> >
> > The "T2 Gap" is small, but there are several ways to execute user code
> > inside that window, such as RETURNING clauses, volatile functions, or
> > user-defined AFTER ROW triggers.
>
> I've been exploring ways to detect whether a query is executing within
> the RI trigger gap.  It seems one promising approach is to leverage
> the lock manager.
>
> Every DML statement acquires RowExclusiveLock on its target table
> before modifying any rows, so if the trigger gap is active, the lock
> is guaranteed to be held.  We can verify this during planning by
> calling CheckRelationOidLockedByMe(), which is a quite efficient
> check.  (As a point of precedent, the planner already relies on the
> local lock manager's state in several existing code paths.)
>
> For prepared statements and cached plans, a generic plan that was
> built with FK join removal could be reused in a context where the
> trigger gap is active.  To handle this, we can modify
> choose_custom_plan() to check whether any relation involved in an
> FK-based join removal currently holds RowExclusiveLock, and choose
> custom plan if so.
>
> The trade-off is false positives: because RowExclusiveLock persists
> for the entire transaction while the trigger gap is intra-statement,
> the optimization is also skipped after the DML completes but within
> the same transaction, after ROLLBACK TO a savepoint, or when
> RowExclusiveLock is held for other reasons (e.g., LOCK TABLE).  These
> seem like uncommon cases, and the query simply falls back to executing
> the join normally as it would without the optimization at all, so I
> think this is an acceptable trade-off.

On Tue, Apr 28, 2026 at 3:09 AM Richard Guo <guofenglinux@gmail.com> wrote:
> On Wed, Apr 1, 2026 at 6:45 PM Richard Guo <guofenglinux@gmail.com> wrote:
> > Please see the v2 patch for the implementation details.
> >
> > I didn't find any mention of this approach in the 2014 thread.  I'd
> > appreciate any thoughts or feedback on this direction.
>
> Here is a new rebase over 20efbdffe.  It also tightens up some checks
> in inner_join_is_removable().
>
> I've reconsidered the trigger-gap issue and I think I now have a
> cleaner understanding.  A snapshot captured inside the trigger-gap
> window can outlive the window's closure -- for example via a STABLE
> function that inherits an older snapshot from its caller, as Tom
> pointed out back in 2015 [1].  A query executed against such a stale
> snapshot would still observe the inconsistency even after the trigger
> queue has drained.  So the predicate guarding the optimization has to
> remain positive at least as long as any in-transaction snapshot could
> still be referenced.
>
> I went through several alternatives (a per-statement counter
> incremented in ExecInitModifyTable, AfterTriggerPendingOnRel, etc.)
> and convinced myself that all of them go silent before the relevant
> snapshots are gone.  The lock-based predicate is still the best
> correct approach I can think of: RowExclusiveLock is released only at
> end of transaction, which trivially exceeds the lifetime of any
> in-transaction snapshot.  By the time the lock is released, every
> in-transaction snapshot has been released, so no stale gap-window
> snapshot can still be referenced.  Maybe the false positives are just
> the price we need to pay for that lifetime guarantee.

Checking RowExclusiveLock makes sense to me.

One nitpick on wording: the RowExclusiveLock acquired at the beginning
of the DML command can be released early by ROLLBACK TO a savepoint
established before that command, so it doesn't strictly "exceed the
lifetime of any in-transaction snapshot." In practice this doesn't
matter, though, because the lock is guaranteed to outlive any snapshot
taken during the trigger gap of the DML command, which is the only
window where the FK invariant can be violated.

IIUC, as long as we ensure that the snapshot of the join statement --
whether it is an after-row trigger, RETURNING clause, cursor, etc. --
is taken outside of the trigger gap, which IS guarded by
RowExclusiveLock, removing the join from its plan should be safe.

So, setting aside the false positives you mentioned, which I think are
acceptable, I believe this approach is correct.

I then tried to find edge cases similar to the 2014 threads and found
one bug with TABLESAMPLE:

-- If the patch removes the join when the referenced table uses TABLESAMPLE,
-- it will return MORE rows than correct (all child rows instead of only
-- those whose parent appeared in the sample).

CREATE TABLE exp_parent (id int PRIMARY KEY, data text);
CREATE TABLE exp_child (
    id int PRIMARY KEY,
    parent_id int NOT NULL REFERENCES exp_parent(id)
);

INSERT INTO exp_parent SELECT g, 'row_' || g FROM generate_series(1, 100) g;
INSERT INTO exp_child SELECT g, g FROM generate_series(1, 100) g;
ANALYZE exp_parent;
ANALYZE exp_child;

-- This join should NOT be removed because TABLESAMPLE limits the parent
rows
EXPLAIN (COSTS OFF)
SELECT c.id
FROM exp_child c
INNER JOIN exp_parent p TABLESAMPLE BERNOULLI(1) REPEATABLE(42) ON
c.parent_id = p.id;

-- Verify: if join removal fires, this would return ~100 rows.
-- With the join, it should return ~1 row (1% sample).
SELECT count(*)
FROM exp_child c
INNER JOIN exp_parent p TABLESAMPLE BERNOULLI(1) REPEATABLE(42) ON
c.parent_id = p.id;

We should add a condition for TABLESAMPLE in inner_join_is_removable()
to disallow join removal.

Minor things I think might be worth mentioning, but I don't have
strong opinions:

1. Self-join removal has a GUC: enable_self_join_elimination. I didn't
investigate why that was needed, and I don't know in what cases people
would want to turn it off, maybe for debugging purposes? Do you think
we need a similar GUC for inner joins? Disclaimer: I'm not a fan of
unnecessary GUCs, just putting the question out there.

2. Nitpick: should we consider renaming remove_useless_joins() to
remove_useless_left_joins() to reflect what it actually does?

Best,
Alex

-- 
Alexandra Wang
EDB: https://www.enterprisedb.com

Re: Remove inner joins based on foreign keys

Richard Guo <guofenglinux@gmail.com> — 2026-05-25T21:01:08Z

(CC'ing folks who discussed this with me at PGConf.dev last week;
thanks again for the conversations.)

After several discussions at PGConf.dev last week, there are concerns
with the lock-based predicate this thread has been converging on.  I'd
like to lay out the underlying issue and the available approaches in
detail on-list, so the wider community can weigh in before we go
further.  I'm not advocating any particular option here.  My goal in
this email is to make sure everyone evaluating the patch is starting
from the same picture.


1. The trigger gap, demonstrated
--------------------------------

On master, no patch applied:

CREATE TABLE users  (id int PRIMARY KEY, name text);
CREATE TABLE orders (id int PRIMARY KEY,
                     user_id int REFERENCES users(id) ON DELETE CASCADE);
INSERT INTO users  VALUES (1, 'Alice'), (2, 'Bob');
INSERT INTO orders VALUES (10, 1), (11, 1), (20, 2);

CREATE FUNCTION show_gap() RETURNS void LANGUAGE plpgsql AS $$
DECLARE u_rows text; o_rows text;
BEGIN
  SELECT string_agg(format('%s (id=%s)', name, id),
                    ', ' ORDER BY id)
    INTO u_rows FROM users;
  SELECT string_agg(format('(id=%s, user_id=%s)', id, user_id),
                    ', ' ORDER BY id)
    INTO o_rows FROM orders;
  RAISE NOTICE 'users:  %', u_rows;
  RAISE NOTICE 'orders: %', o_rows;
END $$;

DELETE FROM users WHERE id = 1 RETURNING show_gap();

NOTICE:  users:  Bob (id=2)
NOTICE:  orders: (id=10, user_id=1), (id=11, user_id=1), (id=20, user_id=2)

show_gap() runs inside the DELETE's RETURNING clause, after Alice has
been removed from users but before the cascade trigger has deleted her
orders.  In show_gap()'s snapshot, Alice is gone, but her two orders
(10 and 11) are still there, pointing at a user_id that no longer
exists.  The FK invariant is locally false.

Why this happens: FK constraints are enforced by AFTER ROW triggers
that fire at end of statement, after RETURNING evaluation.  Between
when the DELETE modifies the heap and when the cascade trigger fires,
the heap is transiently inconsistent.  Any user code that runs in that
window with a fresh snapshot observes the inconsistency.  In the demo
above, plpgsql's SPI call takes such a snapshot.

For the FK-based inner-join-removal optimization, this matters because
the rewrite from 'child JOIN parent ON c.fk = p.id' to 'child WHERE
c.fk IS NOT NULL' assumes the FK invariant holds for the executing
snapshot.  In show_gap()'s snapshot it doesn't, and the rewrite would
return Alice's orders (10 and 11) as if Alice were still present, when
the joined form correctly excludes them.


2. Approaches
-------------

What PG promises about FK constraints today is what the SQL standard
requires: INITIALLY IMMEDIATE constraints are satisfied at end-of-
statement, INITIALLY DEFERRED at commit.  Neither the standard nor the
PG docs explicitly promise that the FK invariant holds for every
snapshot a query inside the database can use.  The optimization
proposed in this thread needs the stronger property: it rewrites a
join on the assumption that every visible child row with non-null FK
has a corresponding visible parent in the same snapshot.  The trigger
gap is where that stronger property fails.

The approaches I have considered are listed below, with (B) being
the predicate currently in the v4 patch.

(A) Drop the optimization.

Don't pursue FK-based inner-join removal, or any other optimization
whose correctness depends on the FK invariant.

(B) Lock-based predicate.

inner_join_is_removable() and choose_custom_plan() consult
CheckRelationOidLockedByMe(RowExclusiveLock) for the FK rels; skip the
rewrite if held.  RowExclusiveLock is xact-scoped and exceeds the
lifetime of any in-xact snapshot, so the lock being present is
sufficient to block the optimization in any context where a
gap-bearing snapshot might exist.

Pros: No new infrastructure.  Predicate is local to two functions.
CheckRelationOidLockedByMe() is a cheap local hash lookup.  Correct
against every gap-bearing snapshot path.

Cons: Xact-scoped, therefore pessimistic.  Once a DML on either FK rel
runs in a transaction, the rewrite is suppressed for the rest of that
transaction, even when subsequent statements take fresh snapshots that
the FK invariant *does* hold for.  For cached plans that previously
used FK removal, this also produces a planner-replan storm:
choose_custom_plan trips on every consultation in the writing xact and
replans each time.

(C) Snapshot-anchored predicate.

Add a "captured during a gap window" bit to SnapshotData, populated in
CopySnapshot when AfterTriggerPendingOnAnyRel() returns true at
capture time.  inner_join_is_removable() and choose_custom_plan()
consult the bit on the active snapshot.

Pros: Precise.  Only snapshots actually captured inside a gap suppress
the optimization.  Sequential DML-then-SELECT in the same xact keeps
the rewrite.  Eliminates the replan storm and most of the plan-shape
cliff that (B) introduces.

Cons: SnapshotData, one of PG's most foundational abstractions, gains
a field that encodes trigger-subsystem state.  Every future reader of
snapshot.h has to learn what the gap is.

(D) Close the gap.

Make the FK invariant hold for every in-xact snapshot.  Either by
enforcing RI synchronously inside the DML rather than via AFTER ROW
triggers, or by arranging that the deleted parent tuple and the
trigger's modifications to children become visible atomically.

Pros: Strengthens what PG promises about constraints.  Benefits any
future optimization that needs the FK invariant.  No gating predicate
is needed in the planner or plan cache.

Cons: Invasive and difficult.

(E) Apply the optimization without gap handling, and document the
corner cases.

Skip the predicate entirely.  The optimization fires whenever the FK
constraint structurally permits it.  Queries that execute against a
gap-bearing snapshot can return wrong results, documented as a known
limitation.

Pros: Smallest patch.  Writing a join query that runs during a trigger
gap is unusual in practice for most user code.

Cons: Can have wrong results.

(F) Something else.


3. On splitting the patch
-------------------------

Independently of which option above wins, I'd like to separate the
patch into two parts so the optimization mechanics and the
gap-handling can be reviewed independently (see v5 patches):

Part 1. The structural inner-join-removal logic, assuming the FK
invariant holds.

Part 2. Whatever gap-handling we converge on from section 2 above.  If
(A) wins, Part 2 is replaced by dropping Part 1 entirely.

Part 1 alone would be unsafe to commit by itself; the value of the
split is reviewing-order, not commit-order.

(Attached v5 is the split version of v4, plus addressing Alex's two
comments.  0002 is still the lock-based predicate from v4, posted as
the concrete reference for option (B); it can be swapped for whichever
approach the gap-handling discussion settles on.)


I'd like input on which of the approaches in section 2 people would be
willing to live with.  Pointers to prior discussion I haven't found,
or design considerations I've missed, are also welcome.

- Richard