Thread

  1. Re: Is there value in having optimizer stats for joins/foreignkeys?

    Chengpeng Yan <chengpeng_yan@outlook.com> — 2026-05-25T12:15:47Z

    Hi,
    
    > On May 15, 2026, at 23:30, Alexandra Wang <alexandra.wang.oss@gmail.com> wrote:
    > 
    > Here's v7, another attempt to fix the unstable tests.
    
    Thanks for working on this. I have a few design comments about the lifecycle
    and the intended scope of the join statistic.
    
    First, the index dependency contract seems worth clarifying. CREATE STATISTICS
    records a normal dependency on the selected index, so DROP INDEX is blocked
    unless CASCADE is used, although that index is not part of the statistics
    definition. But ANALYZE appears to re-discover a suitable index when refreshing
    the statistic. Is the index intended to be part of the statistic's persistent
    contract, or only a creation-time proof that index-based sampling is possible?
    If the latter, should DROP INDEX still be blocked when another equivalent index
    exists?
    
    Second, this seems related to the earlier concern that ANALYZE is per-table.
    The statistic is owned by the anchor relation, but its contents depend on the
    probed relation too. In the current patch, ANALYZE on the probed relation can
    refresh its own statistics without refreshing the join statistic. If the
    probed relation has changed substantially, that leaves a possible staleness
    gap where the planner combines fresh base-table statistics with stale
    cross-relation skew information.
    
    Third, the contract for non-unique indexes on the probed side seems worth
    clarifying. The comments define raw_sel as anchor-relative:
    P(join AND covered_filters) / anchor_totalrows, roughly Jf / anchor_totalrows,
    where Jf is the number of joined rows satisfying the covered filters. But the
    implementation computes raw_sel from MCV frequencies. Since the MCV list is
    built from sampled joined rows, a plain MCV frequency is naturally measured
    inside that joined sample, roughly Jf / J where J is the sampled join-result
    size. It would be useful to clarify when those two quantities are expected to
    be equivalent, especially when a non-unique probed-side index allows one
    anchor row to contribute multiple joined rows.
    
    These two measures are close in FK-like cases where the joined sample size
    tracks the anchor sample size. With a non-unique lookup, one anchor row may
    appear many times in the joined sample. For example, if one key matches many
    red rows and another key matches only one blue row, red may dominate the
    joined sample because of match multiplicity. That frequency describes the
    distribution within the joined result, but not how many matching joined rows
    are produced per anchor row. It would be useful to state whether such
    one-to-many joins are outside the current supported scope, or how the
    MCV-derived raw_sel accounts for how many joined rows each anchor row
    contributed before it is converted into planner join selectivity.
    
    --
    Best regards,
    Chengpeng Yan