Re: Making Vars outer-join aware
Tom Lane <tgl@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
To: pgsql-hackers@lists.postgresql.org
Cc: Richard Guo <guofenglinux@gmail.com>,
"Finnerty, Jim" <jfinnert@amazon.com>
Date: 2022-08-20T22:52:11Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Re-allow INDEX_VAR as rt_index in ChangeVarNodes().
- fbf80421ead5 16.0 landed
-
Fix thinkos in have_unsafe_outer_join_ref; reduce to Assert check.
- f50f029c497d 16.0 landed
-
Invent "join domains" to replace the below_outer_join hack.
- 3bef56e11650 16.0 landed
-
Do assorted mop-up in the planner.
- b448f1c8d83f 16.0 landed
-
Make Vars be outer-join-aware.
- 2489d76c4906 16.0 landed
-
Invent "multibitmapsets", and use them to speed up antijoin detection.
- e9e26b5e7166 16.0 landed
-
Add basic regression tests for semi/antijoin recognition.
- 0043aa6b8597 16.0 landed
-
Improve performance of adjust_appendrel_attrs_multilevel.
- 2f17b57017e5 16.0 landed
-
Refactor addition of PlaceHolderVars to joinrel targetlists.
- afa0ec30bfd1 16.0 landed
-
Use an explicit state flag to control PlaceHolderInfo creation.
- b3ff6c742f6c 16.0 landed
-
Make PlaceHolderInfo lookup O(1).
- 6569ca43973b 16.0 landed
Progress report on this ...
I've been trying to remove some of the cruftier aspects of
EquivalenceClasses (which really is the main point of this whole
effort), and gotten repeatedly blocked by the fact that the semantics
are still a bit crufty thanks to the hacking associated with outer
join identity 3. I think I see a path forward though.
To recap, the thing about identity 3 is that it says you can get
equivalent results from
(A leftjoin B on (Pab)) leftjoin C on (Pb*c)
A leftjoin (B leftjoin C on (Pbc)) on (Pab)
if Pbc is strict for B. Unlike what it says in optimizer/README,
I've written the first form as "Pb*c" to indicate that any B Vars
appearing in that clause will be marked as possibly nulled by
the A/B join. This makes the problem apparent: we cannot use
the same representation of Pbc for both join orders, because
in the second variant B's Vars are not nulled by anything.
We've been trying to get away with writing Pbc just one way,
and that leads directly to the semantic squishiness I've been
fighting.
What I'm thinking we should do about this, once we detect that
this identity is applicable, is to generate *both* forms of Pbc,
either adding or removing the varnullingrels bits depending on
which form we got from the parser. Then, when we come to forming
join paths, use the appropriate variant depending on which join
order we're considering. build_join_rel() already has the concept
that the join restriction list varies depending on which input
relations we're trying to join, so this doesn't require any
fundamental restructuring -- only a way to identify which
RestrictInfos to use or ignore for a particular join. That will
probably require some new field in RestrictInfo, but I'm not
fussed about that because there are other fields we'll be able
to remove as this work progresses.
Similarly, generate_join_implied_equalities() will need to generate
EquivalenceClass-derived join clauses with or without varnullingrels
marks, as appropriate. I'm not quite sure how to do that, but it
feels like just a small matter of programming, not a fundamental
problem with the model which is where things are right now.
We'll only need this for ECs that include source clauses coming
from a movable outer join clause (i.e., Pbc in identity 3).
An interesting point is that I think we want to force movable
outer joins into the second format for the purpose of generating
ECs, that is we want to use Pbc not Pb*c as the EC source form.
The reason for that is to allow generation of relation-scan-level
clauses from an EC, particularly an EC that includes a constant.
As an example, given
A leftjoin (B leftjoin C on (B.b = C.c)) on (A.a = B.b)
where A.a = constant
we can decide unconditionally that A.a, B.b, C.c, and the constant all
belong to the same equivalence class, and thereby generate relation
scan restrictions A.a = constant, B.b = constant, and C.c = constant.
If we start with the other join order, which will include "B.b* = C.c"
(ie Pb*c) then we'd have two separate ECs: {A.a, B.b, constant} and
{B.b*, C.c}. So we'll fail to produce any scan restriction for C, or
at least we can't do so in any principled way.
Furthermore, if the joins are done in the second order then we don't
need any additional join clauses -- both joins can act like "LEFT JOIN
ON TRUE". (Right now, we'll emit redundant B.b = C.c and A.a = B.b
join clauses in addition to the scan-level clauses, which is
inefficient.) However, if we make use of identity 3 to do the
joins in the other order, then we do need an extra join clause, like
(A leftjoin B on (true)) leftjoin C on (B.b* = C.c)
(or maybe we could just emit "B.b* IS NOT NULL" for Pb*c?)
Without any Pb*c join constraint we get wrong answers because
nulling of B fails to propagate to C.
So while there are lots of details to work out, it feels like
this line of thought can lead to something where setrefs.c
doesn't have to ignore varnullingrels mismatches (yay) and
there is no squishiness in EquivalenceClass semantics.
regards, tom lane