Thread

Re: Is there value in having optimizer stats for joins/foreignkeys?

Chengpeng Yan <chengpeng_yan@outlook.com> — 2026-05-25T12:15:47Z
Hi,

> On May 15, 2026, at 23:30, Alexandra Wang <alexandra.wang.oss@gmail.com> wrote:
> 
> Here's v7, another attempt to fix the unstable tests.

Thanks for working on this. I have a few design comments about the lifecycle
and the intended scope of the join statistic.

First, the index dependency contract seems worth clarifying. CREATE STATISTICS
records a normal dependency on the selected index, so DROP INDEX is blocked
unless CASCADE is used, although that index is not part of the statistics
definition. But ANALYZE appears to re-discover a suitable index when refreshing
the statistic. Is the index intended to be part of the statistic's persistent
contract, or only a creation-time proof that index-based sampling is possible?
If the latter, should DROP INDEX still be blocked when another equivalent index
exists?

Second, this seems related to the earlier concern that ANALYZE is per-table.
The statistic is owned by the anchor relation, but its contents depend on the
probed relation too. In the current patch, ANALYZE on the probed relation can
refresh its own statistics without refreshing the join statistic. If the
probed relation has changed substantially, that leaves a possible staleness
gap where the planner combines fresh base-table statistics with stale
cross-relation skew information.

Third, the contract for non-unique indexes on the probed side seems worth
clarifying. The comments define raw_sel as anchor-relative:
P(join AND covered_filters) / anchor_totalrows, roughly Jf / anchor_totalrows,
where Jf is the number of joined rows satisfying the covered filters. But the
implementation computes raw_sel from MCV frequencies. Since the MCV list is
built from sampled joined rows, a plain MCV frequency is naturally measured
inside that joined sample, roughly Jf / J where J is the sampled join-result
size. It would be useful to clarify when those two quantities are expected to
be equivalent, especially when a non-unique probed-side index allows one
anchor row to contribute multiple joined rows.

These two measures are close in FK-like cases where the joined sample size
tracks the anchor sample size. With a non-unique lookup, one anchor row may
appear many times in the joined sample. For example, if one key matches many
red rows and another key matches only one blue row, red may dominate the
joined sample because of match multiplicity. That frequency describes the
distribution within the joined result, but not how many matching joined rows
are produced per anchor row. It would be useful to state whether such
one-to-many joins are outside the current supported scope, or how the
MCV-derived raw_sel accounts for how many joined rows each anchor row
contributed before it is converted into planner join selectivity.

--
Best regards,
Chengpeng Yan