Re: should we have a fast-path planning for OLTP starjoins?

Tomas Vondra <tomas@vondra.me>

View thread

From: Tomas Vondra <tomas@vondra.me>

To: Robert Haas <robertmhaas@gmail.com>, Tom Lane <tgl@sss.pgh.pa.us>

Cc: Bruce Momjian <bruce@momjian.us>, PostgreSQL Hackers <pgsql-hackers@postgresql.org>

Date: 2026-05-28T22:11:32Z

Lists: pgsql-hackers

Attachments

v6-starjoin-fastpath-planning.patch (text/x-patch)
starjoin-scripts.tgz (application/x-compressed-tar)
throughput _ join_collapse_limit=1.png (image/png)
throughput _ join_collapse_limit=8.png (image/png)
throughput _ join_collapse_limit=16.png (image/png)
throughput _ join_collapse_limit=8 vs. master_limit=1.png (image/png)
throughput _ join_collapse_limit=16 vs. master_limit=1.png (image/png)
starjoins.csv (text/csv)

Hi,

I got into a couple discussions regarding this patch in Vancouver last week.

Some of the questions were about possible overlap with the "foreign key
joins" patch Joel is working on. That patch needs to prove some of the
same questions about joins (e.g. which joins can't change cardinality).
So it seems plausible some this join planning patch could leverage some
of the information eventually introduced by that patch, and maybe apply
the optimization in more cases. Not sure.

But re-reading the old thread, this doesn't seem to be why it got stuck.
We already can identify dimensions joined on foreign keys, and that
seems like a good start.

IIRC the thing that worried me was that just sticking the joins at the
end is pretty heavy-handed. It can easily end up making the plan worse,
if one of the other joins increases the cardinality. Would that be
common? Probably not, but it seems unnecessarily risky.

Ideally, we'd do join that reduce cardinality first (with the regular DP
join search), then join all the dimensions, and finally do all joins
that expand cardinality (again, using the regular DP). But the earlier
patches worked by adjusting the join tree in query_planner(), i.e. way
before we get to calculate join cardinalities.

My plan (mentioned even during the Vancouver hallway track) was to
invent a new type of jointree "entry" representing the dimensions, and
modifying the join_search_one_level() et al to "expand" it whenever we
decide to join it. So it'd get the join group, and replace it with a
sequence of joins of all the dimensions. We'd try it in various places,
and could pick the best plan. But it seemed pretty complex and invasive,
so I kept postponing the work.

I started hacking on it this week, but I also started thinking if there
might be a better solution, that would fit into the current join search
more naturally. And I think I actually found a good approach.

It works like this:

1) query_planner()

Determine if the query includes a starjoin (or multiple), and remember
the relids included in the starjoin cluster. Pick a "canonical" join
order for each starjoin cluster we found (e.g. with dimensions in relid
order).

2) standard_join_search()/join_search_one_level()

When constructing the join rels (e.g. in make_join_rel or right before
it's called), check that the new rel would violate the canonical order.
If it would, refuse to create it, just like we do for various other join
restrictions.

The new join restriction is that if the join result includes a subset of
the starjoin cluster, then it has to include the fact + prefix of the
list of dimensions (which is the canonical join order).

Note: It should be possible to make the restriction even more strict, if
needed (e.g. to effectively join all dimensions at once, with no other
joins in between).


The attached patch v6 does this. It's a bit of a PoC quality, but it
should be good enough for testing and experiments (it does pass
check-world for me, but some parts still need more care / cleanup).

I also did a buch of tests to see how effective it is, and it seems
pretty effective. It gives me ~80% throughput of join_collapse_limit=1,
even with extreme number of dimensions (e.g. 16), with disabled geqo and
join_collapse_limit=16 (so it's all in one join list). The current code
drops to ~1% of join_collapse_limit=1, so this is about two orders of
magnitude faster.

Yes, this is in situations when the join does nothing during execution
(the fact table is empty). So in practice the improvement will be
smaller, but it can still be a substantial speedup, depending on how
much other work it's doing.

Attached are a couple charts from a test with 1-15 dimensions (scripts
attached too). I was wondering how geqo affects this, so I tried with
geqo=on/off, and with join_collapse_limit=1/8/16.

With join_collapse_limit=1 there's no difference between any of the runs
(master, patches with on/off). Here's an example of results:

  dims   master(1)   master  sj/off    sj/on  master   sj/off   sj/on
  -------------------------------------------------------------------
  1          49485    48797   48966    49118     99%      99%     99%
  3          26886    22003   21319    24322     82%      79%     90%
  5          17759     7923    7634    15434     45%      43%     87%
  7          13110     2122    2071    11290     16%      16%     86%
  9          10390      462     445     8709      4%       4%     84%
  11          7781       87      86     6488      1%       1%     83%
  13          5948       14      14     5749      0%       0%     97%
  15          5237        1       1     4227      0%       0%     81%

The first column is for master with join_collapse_limit=1, which I take
as the best possible throughtput, if we eliminate the join order search
almost completely. The rest is with geqo=off and join_collapse_limit=16.

So that's the 80% of possible throughput, pretty good IMO. Especially
considering master is at 1 tps.

There are a couple more interesting charts, and a CSV with raw results.
A funny (but not entirely surprising) detail is that geqo=on helps a bit
(from 1tps to ~90tps for 15 dimensions), but it's still far from the
limit=1 throughput of ~5200 tps. The fastpath join order does better.


I'm sure the code needs cleanups and fixes, but I like the approach way
more than the original plan (inventing a new group join entry).

Any comments regarding this alternative approach?


regards

-- 
Tomas Vondra