Thread

  1. Is there value in having optimizer stats for joins/foreignkeys?

    Corey Huinker <corey.huinker@gmail.com> — 2025-12-01T20:10:25Z

    Threads like [1] and [2] have gotten me thinking that there may be some
    value in storing statistics about joins.
    
    For the sake of argument, assume a table t1 with a column t2id which
    references the pk of table t2 that has columns t2.t2id, t2c1, t2c2, t2c3.
    In such a situation I can envision the following statistics being collected:
    
    * The % of values rows in t2 are referenced at least once in t1
    * The attribute stats (i.e. pg_statistic stats) for t2c1, t2c2, t2c3, but
    associated with t1 and weighted according to the frequency of that row
    being referenced, which means that values of unreferenced rows are filtered
    out entirely.
    * That's about it for direct statistics, but I could see creating extended
    statistics for correlations between a local column value and a remote
    column, or expressions on the remote columns, etc.
    
    The storage feels like it would be identical to pg_statistic but with a
    "starefrelid" field that identifies the referencing table.
    
    That much seems straightforward. A bigger problem is how we'd manage to
    collect these statistics. We could (as Jeff Davis has suggested) keep our
    tablesamples, but that wouldn't necessarily help in this case because the
    rows referenced, and their relative weightings would change since the last
    sampling. In a worst-case scenario, We would have to sample the joined-to
    tables as well,and that's an additional burden on an already IO intensive
    operation.
    
    In theory, we could do some of this without any additional stats
    collection. If the ndistinct of t1.t2id is, say, at least 75+% of the
    ndistinct of t2.t2id, we could just peek at the attribute stats on t2 and
    use them for estimates. However, that makes some assumptions that the stats
    on t2 are approximately as fresh as the stats on t1, and I don't think that
    will be the case most of the time.
    
    CCing people who have wondered out loud about this topic within earshot of
    me.
    
    Thoughts?
    
    [1]
    https://www.postgresql.org/message-id/flat/6fdc4dc5-8881-4987-9858-a9b484953185%40joeconway.com#5a93cd7a730691843a7700c770397baf
    [2]
    https://www.postgresql.org/message-id/flat/tencent_3018762E7D4C9BC470C821C829C1BF2F650A%40qq.com
    
  2. Re: Is there value in having optimizer stats for joins/foreignkeys?

    Tomas Vondra <tomas@vondra.me> — 2025-12-01T21:01:58Z

    On 12/1/25 21:10, Corey Huinker wrote:
    > Threads like [1] and [2] have gotten me thinking that there may be some
    > value in storing statistics about joins.
    > 
    > For the sake of argument, assume a table t1 with a column t2id which
    > references the pk of table t2 that has columns t2.t2id, t2c1, t2c2,
    > t2c3. In such a situation I can envision the following statistics being
    > collected:
    > 
    > * The % of values rows in t2 are referenced at least once in t1
    > * The attribute stats (i.e. pg_statistic stats) for t2c1, t2c2, t2c3,
    > but associated with t1 and weighted according to the frequency of that
    > row being referenced, which means that values of unreferenced rows are
    > filtered out entirely.
    > * That's about it for direct statistics, but I could see creating
    > extended statistics for correlations between a local column value and a
    > remote column, or expressions on the remote columns, etc.
    > 
    
    Do I understand correctly you propose to collect such stats for every
    foreign key? I recall something like that was proposed in the past, and
    the argument against was that for many joins it'd be a waste because the
    estimates are good enough. And for OLTP systems that's probably true.
    
    Of course, it also depends on how expensive this would be. Maybe it's
    cheap enough? No idea.
    
    But I always assumed we'd have a way to explicitly enable such stats for
    certain joins only, and the extended stats were designed to make that
    possible.
    
    FWIW I'm not entirely sure what stats you propose to collect exactly. I
    mean, what does
    
       ... associated with t1 and weighted according to the frequency of
       that row being referenced, which means that values of unreferenced
       rows are filtered out entirely.
    
    mean? Are you suggesting to "do the join" and build the regular stats as
    if that was a regular table? I think that'd work, and it's mostly how I
    envisioned to handle joins in extended stats, restricted to joins of two
    relations.
    
    > The storage feels like it would be identical to pg_statistic but with a
    > "starefrelid" field that identifies the referencing table.
    > 
    > That much seems straightforward. A bigger problem is how we'd manage to
    > collect these statistics. We could (as Jeff Davis has suggested) keep
    > our tablesamples, but that wouldn't necessarily help in this case
    > because the rows referenced, and their relative weightings would change
    > since the last sampling. In a worst-case scenario, We would have to
    > sample the joined-to tables as well,and that's an additional burden on
    > an already IO intensive operation.
    > 
    
    Combining independent per-table samples does not work, unless the
    samples are huge. There's a nice paper [1] on how to do index-based join
    sampling efficiently.
    
    > In theory, we could do some of this without any additional stats
    > collection. If the ndistinct of t1.t2id is, say, at least 75+% of the
    > ndistinct of t2.t2id, we could just peek at the attribute stats on t2
    > and use them for estimates. However, that makes some assumptions that
    > the stats on t2 are approximately as fresh as the stats on t1, and I
    > don't think that will be the case most of the time.
    > 
    > CCing people who have wondered out loud about this topic within earshot
    > of me.
    > 
    > Thoughts?
    
    I think adding joins to extended stats would not be all that hard
    (famous last words, I know). For me the main challenge was figuring out
    how to store the join definition in the catalog, I always procrastinated
    and never gave that a serious try.
    
    FWIW I think we might start by actually using per-table extended stats
    on the joined tables. Just like we combine the scalar MCVs on joined
    columns, we could combine multicolumn MVCs.
    
    regards
    
    [1] https://www.cidrdb.org/cidr2017/papers/p9-leis-cidr17.pdf
    
    -- 
    Tomas Vondra
    
    
    
    
    
  3. Re: Is there value in having optimizer stats for joins/foreignkeys?

    Tom Lane <tgl@sss.pgh.pa.us> — 2025-12-01T22:11:55Z

    Tomas Vondra <tomas@vondra.me> writes:
    > On 12/1/25 21:10, Corey Huinker wrote:
    >> Threads like [1] and [2] have gotten me thinking that there may be some
    >> value in storing statistics about joins.
    
    > Do I understand correctly you propose to collect such stats for every
    > foreign key? I recall something like that was proposed in the past, and
    > the argument against was that for many joins it'd be a waste because the
    > estimates are good enough. And for OLTP systems that's probably true.
    
    Yeah, I think that automated choices about this are unlikely to work
    well.  We chose the syntax for CREATE STATISTICS with an eye to
    allowing users to declaratively tell us to collect stats about
    specific joins, and I still think that's a more promising approach.
    But nobody's yet worked out any details.
    
    			regards, tom lane
    
    
    
    
  4. Re: Is there value in having optimizer stats for joins/foreignkeys?

    Corey Huinker <corey.huinker@gmail.com> — 2025-12-02T03:16:25Z

    > Do I understand correctly you propose to collect such stats for every
    > foreign key? I recall something like that was proposed in the past, and
    > the argument against was that for many joins it'd be a waste because the
    > estimates are good enough. And for OLTP systems that's probably true.
    >
    
    Not every foreign key, they'd be declared like CREATE STATISTICS, but would
    be anchored to the constraint, not to the table.
    
    
    > But I always assumed we'd have a way to explicitly enable such stats for
    > certain joins only, and the extended stats were designed to make that
    > possible.
    >
    
    That's the intention, but the stats stored don't quite "fit" in the buckets
    that extended stats create. The attribute statistics seem much better
    suited, as this isn't about combinations, there's only ever the one
    combination, but rather about what can be known about the attributes in the
    far table before doing the actual join.
    
    
    > FWIW I'm not entirely sure what stats you propose to collect exactly. I
    > mean, what does
    >
    >    ... associated with t1 and weighted according to the frequency of
    >    that row being referenced, which means that values of unreferenced
    >    rows are filtered out entirely.
    >
    > mean? Are you suggesting to "do the join" and build the regular stats as
    > if that was a regular table? I think that'd work, and it's mostly how I
    > envisioned to handle joins in extended stats, restricted to joins of two
    > relations.
    >
    
    Right. We'd do the join from t1 to t2 as described earlier, and then we'd
    judge the null_frac, mcv, etc for each column of t2 (as defined by the
    scope of the stats declaration) according to the join. More commonly
    referenced values would show up as more frequent, hence "weighted".
    
    Just so I have an example to refer to later, say we have a table of colors:
    
    CREATE TABLE color(id bigint primary key, color_name text unique,
    color_family text null)
    
    and there's hundreds of colors in the table that are color_family='red'
    ('fire engine red', 'candy apple red', 'popular muppet red', etc). Some
    colors don't belong to any color_family.
    
    And we have a table of toys:
    
    CREATE TABLE toy(id bigint primary key, min_child_age integer, name text,
    color_id bigint REFERENCES color)
    
    And we declare a join stat on toy->color for the color_family attribute.
    We'd sample rows from the toy table, left join those to color, and then
    calculate the attribute stats of color_family as if it were a column in
    toys. Some toys might not have a color_id, and some color_ids might not
    belong to a color_family, so we'd want the null_frac to reflect those
    combined conditions. For the values that do join, and the colors that do
    belong to a family, we'd want to see regular MCV stats showing "red" as the
    most common color_family.
    
    But those stats aren't really a correlation or a dependency, they're just
    plain old attribute stats.
    
    I understand wanting to know the correlation between toys.min_child_age and
    colors.color_family, so that makes perfect sense for extended statistics,
    but color_family on its own just doesn't fit. Am I missing something?
    
    
    > Combining independent per-table samples does not work, unless the
    > samples are huge. There's a nice paper [1] on how to do index-based join
    > sampling efficiently.
    >
    
    Thanks, now I've got some light reading for the flight home.
    
    
    > I think adding joins to extended stats would not be all that hard
    > (famous last words, I know). For me the main challenge was figuring out
    > how to store the join definition in the catalog, I always procrastinated
    > and never gave that a serious try.
    >
    
    I envisioned keying the stats off the foreign key constraint id, or adding
    "starefrelid" (relation oid of the referencing table) to pg_statistic or a
    table roughly the same shape as pg_statistic.
    
    
    >
    > FWIW I think we might start by actually using per-table extended stats
    > on the joined tables. Just like we combine the scalar MCVs on joined
    > columns, we could combine multicolumn MVCs.
    >
    
    That's the other half of this - if the stats existed, do we have an obvious
    way to put them to use?
    
  5. Re: Is there value in having optimizer stats for joins/foreignkeys?

    Corey Huinker <corey.huinker@gmail.com> — 2025-12-02T03:18:24Z

    >
    >
    > Yeah, I think that automated choices about this are unlikely to work
    > well.  We chose the syntax for CREATE STATISTICS with an eye to
    > allowing users to declaratively tell us to collect stats about
    > specific joins, and I still think that's a more promising approach.
    > But nobody's yet worked out any details.
    >
    >
    Per other response, no, I didn't envision stats on all possible joins or
    even all possible foreign keys, just the ones we declare as interesting,
    and even then only for the attributes that we say are interesting on the
    far side of the join.
    
  6. Re: Is there value in having optimizer stats for joins/foreignkeys?

    Alexandra Wang <alexandra.wang.oss@gmail.com> — 2025-12-03T19:01:13Z

    Hi there,
    
    Thanks for raising this topic! I am currently working on a POC patch that
    adds extended statistics for joins. I am polishing the patch now and will
    post it soon with performance numbers, since there are interests!
    
    On Mon, Dec 1, 2025 at 7:16 PM Corey Huinker <corey.huinker@gmail.com>
    wrote:
    
    > On Mon, Dec 1, 2025 at 1:02 PM Tomas Vondra <tomas@vondra.me> wrote:
    >
    I think adding joins to extended stats would not be all that hard
    >> (famous last words, I know). For me the main challenge was figuring out
    >> how to store the join definition in the catalog, I always procrastinated
    >> and never gave that a serious try.
    >>
    >
    > I envisioned keying the stats off the foreign key constraint id, or adding
    > "starefrelid" (relation oid of the referencing table) to pg_statistic or a
    > table roughly the same shape as pg_statistic.
    >
    
    
     On Mon, Dec 1, 2025 at 1:02 PM Tomas Vondra <tomas@vondra.me> wrote:
    >
    >>
    >> FWIW I think we might start by actually using per-table extended stats
    >> on the joined tables. Just like we combine the scalar MCVs on joined
    >> columns, we could combine multicolumn MVCs.
    >>
    >
    > That's the other half of this - if the stats existed, do we have an
    > obvious way to put them to use?
    >
    
    I have indeed started by implementing MCV statistics for joins,
    because I have not found a case for joins that would benefit only from
    ndistinct or functional dependency stats that MCV stats wouldn't help.
    
    In my POC patch, I've made the following catalog changes:
    - Add *stxotherrel (oid) *and *stxjoinkeys (int2vector)* fields to
    *pg_statistic_ext*
    - Use the existing *stxkeys (int2vector)* to store the stats object
    attributes of *stxotherrel*
    - Create *pg_statistic_ext_otherrel_index* on *(stxrelid, stxotherrel)*
    - Add stxdjoinmcv* (pg_join_mcv_list)* to *pg_statistic_ext_data*
    
    To use them, we can let the planner detect patterns like this:
    
    /*
     * JoinStatsMatch - Information about a detected join pattern
     * Used internally to track what was matched in a join+filter pattern
     */
    typedef struct JoinStatsMatch
    {
        Oid          target_rel;       /* table OID of the estimation target */
        AttrNumber targetrel_joinkey;    /* target_rel's join column */
        Oid          other_rel;       /* table OID of the filter source */
        AttrNumber otherrel_joinkey;        /* other_rel's join column */
        List      *filter_attnums;       /* list of AttrNumbers for filter
    columns in other_rel */
        List      *filter_values;    /* list of Datum constant values
    being filtered */
        Oid          collation;       /* collation for comparisons */
    
        /* Additional info to avoid duplicate work */
        List      *join_rinfos;      /* list of join clause RestrictInfos */
        RestrictInfo *filter_rinfo;       /* the filter clause RestrictInfo */
    } JoinStatsMatch;
    
    and add the detection logic in clauselist_selectivity_ext() and
    get_foreign_key_join_selectivity().
    
    Statistics collection indeed needs the most thinking. For the
    purpose of a POC, I added MCV join stats collection as part of ANALYZE
    of one table (stxrel in pg_statistic_ext). I can do this because MCV
    join stats are somewhat asymmetric. It allows me to have a target
    table (referencing table for foreign key join) to ANALYZE, and we can
    use the already collected MCVs of the joinkey column on the target
    table to query the rows in the other table. This greatly mitigates
    performance impact compared to actually joining two tables. However,
    if we are to support more complex joins or other types of join stats
    such as ndistinct or functional dependency, I found it hard to define
    who's the target table (referencing table) and who's the other table
    (referenced table) outside of the foreign key join scenario. So I
    think for those more complex cases eventually we may as well
    perform the joins and collect the join stats separately. Alvaro
    Herrera suggested offline that we could have a dedicated autovacuum
    command option for collecting the join statistics.
    
    I have experimented with two ways to define the join statistics:
    
    1. Use CREATE STATISTICS:
    
    CREATE STATISTICS [ [ IF NOT EXISTS ] statistics_name ] [ ( mcv ) ] ON {
    table_name1.column_name1 }, { table_name1.column_name2 } [, ...] FROM
    table_name1 JOIN table_name2 ON table_name1.column_name3 =
    table_name2.column_name4
    
    Examples:
    -- Create join MCV statistics on a single filter column (keyword)
    CREATE STATISTICS movie_keyword_keyword_join_stats (mcv)
    ON k.keyword
    FROM movie_keyword mk JOIN keyword k ON (mk.keyword_id = k.id);
    ANALYZE movie_keyword;
    
    -- Create join MCV statistics on multiple filter columns (keyword +
    phonetic_code):
    CREATE STATISTICS movie_keyword_keyword_multicols_join_stats (mcv)
    ON k.keyword, k.phonetic_code
    FROM movie_keyword mk JOIN keyword k ON (mk.keyword_id = k.id);
    ANALYZE movie_keyword;
    
    2. Auto join stats creation for Foreign Key + Functional Dependency stats
    
    Initially, I did not implement the CREATE TABLE STATISTICS command to
    create the join stats. Instead, I’ve implemented logic in ANALYZE to
    detect functional dependency stats on the referenced table through FKs
    and create join statistics implicitly for those cases.
    
    I've been using the Join Order Benchmark (JOB) [1] to measure
    performance gain. I will post the POC patch and performance numbers in
    a followup email.
    
    [1] https://www.vldb.org/pvldb/vol9/p204-leis.pdf
    
    Best,
    Alex
    
    -- 
    Alexandra Wang
    EDB: https://www.enterprisedb.com
    
  7. Re: Is there value in having optimizer stats for joins/foreignkeys?

    Alexandra Wang <alexandra.wang.oss@gmail.com> — 2026-05-06T12:19:05Z

    Here's the rebased patch, it only needed a catalog version bump.
    
    Best,
    Alex
    
    -- 
    Alexandra Wang
    EDB: https://www.enterprisedb.com
    
  8. Re: Is there value in having optimizer stats for joins/foreignkeys?

    jian he <jian.universality@gmail.com> — 2026-05-13T15:15:31Z

    On Wed, May 6, 2026 at 8:19 PM Alexandra Wang
    <alexandra.wang.oss@gmail.com> wrote:
    >
    > Here's the rebased patch, it only needed a catalog version bump.
    >
    
    No need to update src/include/catalog/catversion.h during dev cycle.
    
    + if (!IsA(lfirst(l), OpExpr))
    + ereport(ERROR,
    + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
    + errmsg("join statistics require a single equijoin condition per pair
    of tables")));
    
    In the error message, should we replace "equijoin" with "equality join"?
    IMHO,  "equijoin" is kind of informal wording.
    
    -- Unintended error case1
    DROP TABLE IF EXISTS t1;
    CREATE TABLE t1 (id INTEGER PRIMARY KEY, val TEXT NOT NULL);
    CREATE STATISTICS x (mcv) ON t1.val FROM t1 JOIN t1 as t1s ON (t1.id = t1s.id);
    alter table t1 alter column id set data type int8;
    ERROR:  could not open relation with OID 0
    
    -- Unintended error case2
    DROP TABLE IF EXISTS t1, t2;
    CREATE TABLE t1 (id INTEGER PRIMARY KEY, val TEXT NOT NULL);
    CREATE TABLE t2 (id INTEGER PRIMARY KEY, t1_id INTEGER NOT NULL);
    CREATE STATISTICS xx (mcv) ON t1.val FROM t2 JOIN t1 ON (t2.t1_id = t1.id);
    alter table t1 alter column id set data type int8;
    ERROR:  missing FROM-clause entry for table "t1"
    LINE 1: alter table t1 alter column id set data type int8;
                                                 ^
    
    -- Unintended error case3
    DROP TABLE IF EXISTS t1, t2;
    CREATE TABLE t1 (id INTEGER PRIMARY KEY, val TEXT NOT NULL);
    CREATE TABLE t2 (id INTEGER PRIMARY KEY, t1_id INTEGER generated always as (1));
    CREATE STATISTICS xx (mcv) ON t1.val FROM t2 JOIN t1 ON (t2.t1_id = t1.id);
    alter table t2 alter column t1_id set expression as (2);
    ERROR:  missing FROM-clause entry for table "t1"
    LINE 1: alter table t2 alter column t1_id set expression as (2);
                                                 ^
    
    -- Unintended error case4
    DROP TABLE IF EXISTS t1, t2;
    CREATE TABLE t1 (id INTEGER PRIMARY KEY, val TEXT NOT NULL);
    CREATE TABLE t2 (id INTEGER PRIMARY KEY, t1_id INTEGER generated always as (1));
    CREATE STATISTICS xx (mcv) ON t1.val FROM t2 JOIN t1 ON (t2 = t1);
    ERROR:  cache lookup failed for attribute 0 of relation 18388
    
    In src/test/regress/sql/stats_ext_crossrel.sql, we need to add test cases for
    ALTER COLUMN SET DATA TYPE and ALTER COLUMN SET EXPRESSION on columns
    involved in join statistics.
    These ALTER TABLE operations will internally recreate the affected join stats.
    
    src/test/regress/sql/stats_ext_crossrel.sql currently lacks tests for equality
    joins where the join qual contains whole-row variable references.This is
    important because whole-row variables raise the question of whether join
    statistics should be recreated when any column in the relation is modified via
    ALTER COLUMN SET DATA TYPE or ALTER COLUMN SET EXPRESSION.
    
    
    
    --
    jian
    https://www.enterprisedb.com/
    
    
    
    
  9. Re: Is there value in having optimizer stats for joins/foreignkeys?

    Alexandra Wang <alexandra.wang.oss@gmail.com> — 2026-05-13T17:30:56Z

    Hi Jian,
    
    Thanks for reviewing! At the current stage, I'm more interested in
    reviews around the overall design and the feasibility of the current
    approach, as I mentioned earlier:
    
    On Tue, Apr 28, 2026 at 10:59 PM Alexandra Wang <
    alexandra.wang.oss@gmail.com> wrote:
    > I'd appreciate feedback on:
    > - The catalog design (stxjoinrels, stxkeyrefs, stxjoinconds)
    > - The index-based sampling implementation
    > - The planner integration (clausesel.c, costsize.c)
    > - What should be in scope for v1 vs deferred
    
    I will incorporate your review feedback in the next revision, but for
    now, I'd like to focus the discussion on the above areas before
    polishing the existing patch.
    
    On Wed, May 13, 2026 at 8:16 AM jian he <jian.universality@gmail.com> wrote:
    > On Wed, May 6, 2026 at 8:19 PM Alexandra Wang
    > <alexandra.wang.oss@gmail.com> wrote:
    > >
    > > Here's the rebased patch, it only needed a catalog version bump.
    > >
    >
    > No need to update src/include/catalog/catversion.h during dev cycle.
    
    I needed to bump the catalog version because CI broke without that
    change.
    
    > + if (!IsA(lfirst(l), OpExpr))
    > + ereport(ERROR,
    > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
    > + errmsg("join statistics require a single equijoin condition per pair
    > of tables")));
    
    > In the error message, should we replace "equijoin" with "equality join"?
    > IMHO,  "equijoin" is kind of informal wording.
    
    OK. I'll update the error message in the next revision.
    
    > -- Unintended error case1
    > DROP TABLE IF EXISTS t1;
    > CREATE TABLE t1 (id INTEGER PRIMARY KEY, val TEXT NOT NULL);
    > CREATE STATISTICS x (mcv) ON t1.val FROM t1 JOIN t1 as t1s ON (t1.id =
    t1s.id);
    > alter table t1 alter column id set data type int8;
    > ERROR:  could not open relation with OID 0
    >
    > -- Unintended error case2
    > DROP TABLE IF EXISTS t1, t2;
    > CREATE TABLE t1 (id INTEGER PRIMARY KEY, val TEXT NOT NULL);
    > CREATE TABLE t2 (id INTEGER PRIMARY KEY, t1_id INTEGER NOT NULL);
    > CREATE STATISTICS xx (mcv) ON t1.val FROM t2 JOIN t1 ON (t2.t1_id = t1.id
    );
    > alter table t1 alter column id set data type int8;
    > ERROR:  missing FROM-clause entry for table "t1"
    > LINE 1: alter table t1 alter column id set data type int8;
    >                                              ^
    >
    > -- Unintended error case3
    > DROP TABLE IF EXISTS t1, t2;
    > CREATE TABLE t1 (id INTEGER PRIMARY KEY, val TEXT NOT NULL);
    > CREATE TABLE t2 (id INTEGER PRIMARY KEY, t1_id INTEGER generated always
    as (1));
    > CREATE STATISTICS xx (mcv) ON t1.val FROM t2 JOIN t1 ON (t2.t1_id = t1.id
    );
    > alter table t2 alter column t1_id set expression as (2);
    > ERROR:  missing FROM-clause entry for table "t1"
    > LINE 1: alter table t2 alter column t1_id set expression as (2);
    >                                              ^
    >
    > -- Unintended error case4
    > DROP TABLE IF EXISTS t1, t2;
    > CREATE TABLE t1 (id INTEGER PRIMARY KEY, val TEXT NOT NULL);
    > CREATE TABLE t2 (id INTEGER PRIMARY KEY, t1_id INTEGER generated always
    as (1));
    > CREATE STATISTICS xx (mcv) ON t1.val FROM t2 JOIN t1 ON (t2 = t1);
    > ERROR:  cache lookup failed for attribute 0 of relation 18388
    >
    > In src/test/regress/sql/stats_ext_crossrel.sql, we need to add test cases
    for
    > ALTER COLUMN SET DATA TYPE and ALTER COLUMN SET EXPRESSION on columns
    > involved in join statistics.
    > These ALTER TABLE operations will internally recreate the affected join
    stats.
    >
    > src/test/regress/sql/stats_ext_crossrel.sql currently lacks tests for
    equality
    > joins where the join qual contains whole-row variable references.This is
    > important because whole-row variables raise the question of whether join
    > statistics should be recreated when any column in the relation is
    modified via
    > ALTER COLUMN SET DATA TYPE or ALTER COLUMN SET EXPRESSION.
    
    Good call! I'll make sure to include these edge cases in the next
    revision.
    
    Best,
    Alex
    
    --
    Alexandra Wang
    EDB: https://www.enterprisedb.com
    
  10. Re: Is there value in having optimizer stats for joins/foreignkeys?

    Alexandra Wang <alexandra.wang.oss@gmail.com> — 2026-05-14T23:13:26Z

    Patch needed a rebase, so here's v5.
    
    I've dropped the changes to catversion.h so hopefully I won't need to
    rebase too often. Sorry about the noise!
    
    While I was at it, I also addressed the issues Jian mentioned -- all
    minor fixes in parse_utilcmd.c (a one-line detection fix for the ALTER
    COLUMN error, an error guard for whole-row variables, and rewording
    "equijoin" to "equality join"). Other than that, there are no major
    changes in v5 compared to v2-v4. At this stage I'd really appreciate
    feedback on the overall design approach.
    
    Best,
    Alex
    
    -- 
    Alexandra Wang
    EDB: https://www.enterprisedb.com
    
  11. Re: Is there value in having optimizer stats for joins/foreignkeys?

    Alexandra Wang <alexandra.wang.oss@gmail.com> — 2026-05-15T05:08:56Z

    Attached v6. It should fix the CI failures on NetBSD/Windows by
    changing the ALTER TYPE invalidation test added in v5 from
    text->varchar(100) to integer->numeric, avoiding encoding-dependent
    width estimates.
    
    Best,
    Alex
    
    
    -- 
    Alexandra Wang
    EDB: https://www.enterprisedb.com
    
  12. Re: Is there value in having optimizer stats for joins/foreignkeys?

    Alexandra Wang <alexandra.wang.oss@gmail.com> — 2026-05-15T15:30:31Z

    Here's v7, another attempt to fix the unstable tests.
    
    Best,
    Alex
    
    On Thu, May 14, 2026 at 10:08 PM Alexandra Wang <
    alexandra.wang.oss@gmail.com> wrote:
    
    > Attached v6. It should fix the CI failures on NetBSD/Windows by
    > changing the ALTER TYPE invalidation test added in v5 from
    > text->varchar(100) to integer->numeric, avoiding encoding-dependent
    > width estimates.
    >
    > Best,
    > Alex
    >
    >
    > --
    > Alexandra Wang
    > EDB: https://www.enterprisedb.com
    >
    
    
    -- 
    Alexandra Wang
    EDB: https://www.enterprisedb.com
    
  13. Re: Is there value in having optimizer stats for joins/foreignkeys?

    Tom Lane <tgl@sss.pgh.pa.us> — 2026-05-21T20:25:13Z

    Alexandra Wang <alexandra.wang.oss@gmail.com> writes:
    > Here's v7, another attempt to fix the unstable tests.
    
    Hi Alexandra,
    
    I signed up for an in-person review of this at PGConf.dev, but
    the schedule doesn't seem to be working in favor of making that
    happen.  If you see this and happen to run into me in the
    hallway, I'm happy to chat, but in any case here are my
    rather-hasty review notes.
    
    I think it's okay if v1 only handles 2-way joins, as long as the
    catalog representation is prepared for more.  Restricting to
    cases where we can do index-based sampling seems fine too.
    Those things could be relaxed later if it seems worthwhile,
    but we'd have a creditable feature even without.
    
    I didn't read the sampling code in any detail.  I think you will
    need to put more thought into what is user-friendly behavior
    in case the required index doesn't exist or doesn't have the
    right properties.  (I think the tests for that might not be
    strong enough, either.)
    
    I think you could simplify some code noticeably if you included the
    anchor rel's OID as the first element of stxjoinrels[].  Yeah,
    it'd be redundant with stxrelid, but so what?  It's not like 
    pg_statistic_ext rows are narrow enough that anyone would notice
    the extra 4 bytes.  I think this would simplify some of the
    relationships within the data structures, too, eg all varnos in
    the expressions could be considered to reference stxjoinrels[].
    
    I don't love stxkeyrefs[].  I wonder if it's time to throw away
    stxkeys[], represent all the target columns as regular expression
    trees in stxexprs, and then special-case columns that are simple
    Vars where appropriate at execution.
    
    (In the same vein, I dislike the grammar's separation of plain
    columns from expressions; I'd like to replace stats_params
    with expr_list and sort it all out later.  But perhaps that's
    material for a separate patch.)
    
    We will need to put more thought into permissions: I don't think
    requiring all the tables to have the same owner is workable.
    (What happens if someone tries to ALTER OWNER later?)  However,
    if they don't all have the same owner, there are potential security
    problems, so the right restriction is not obvious.  This is not
    necessary to solve now; there are bigger questions to worry about.
    But we'll need an answer before it's committable.
    
    It's not too soon to write some user-facing documentation.
    CREATE STATISTICS man page obviously needs attention, but
    also the discussion of extended stats in perform.sgml.
    And catalogs.sgml.  I find that writing that sort of stuff
    helps to clarify where one's design is weak.
    
    			regards, tom lane
    
    
    
    
  14. Re: Is there value in having optimizer stats for joins/foreignkeys?

    Chengpeng Yan <chengpeng_yan@outlook.com> — 2026-05-25T12:15:47Z

    Hi,
    
    > On May 15, 2026, at 23:30, Alexandra Wang <alexandra.wang.oss@gmail.com> wrote:
    > 
    > Here's v7, another attempt to fix the unstable tests.
    
    Thanks for working on this. I have a few design comments about the lifecycle
    and the intended scope of the join statistic.
    
    First, the index dependency contract seems worth clarifying. CREATE STATISTICS
    records a normal dependency on the selected index, so DROP INDEX is blocked
    unless CASCADE is used, although that index is not part of the statistics
    definition. But ANALYZE appears to re-discover a suitable index when refreshing
    the statistic. Is the index intended to be part of the statistic's persistent
    contract, or only a creation-time proof that index-based sampling is possible?
    If the latter, should DROP INDEX still be blocked when another equivalent index
    exists?
    
    Second, this seems related to the earlier concern that ANALYZE is per-table.
    The statistic is owned by the anchor relation, but its contents depend on the
    probed relation too. In the current patch, ANALYZE on the probed relation can
    refresh its own statistics without refreshing the join statistic. If the
    probed relation has changed substantially, that leaves a possible staleness
    gap where the planner combines fresh base-table statistics with stale
    cross-relation skew information.
    
    Third, the contract for non-unique indexes on the probed side seems worth
    clarifying. The comments define raw_sel as anchor-relative:
    P(join AND covered_filters) / anchor_totalrows, roughly Jf / anchor_totalrows,
    where Jf is the number of joined rows satisfying the covered filters. But the
    implementation computes raw_sel from MCV frequencies. Since the MCV list is
    built from sampled joined rows, a plain MCV frequency is naturally measured
    inside that joined sample, roughly Jf / J where J is the sampled join-result
    size. It would be useful to clarify when those two quantities are expected to
    be equivalent, especially when a non-unique probed-side index allows one
    anchor row to contribute multiple joined rows.
    
    These two measures are close in FK-like cases where the joined sample size
    tracks the anchor sample size. With a non-unique lookup, one anchor row may
    appear many times in the joined sample. For example, if one key matches many
    red rows and another key matches only one blue row, red may dominate the
    joined sample because of match multiplicity. That frequency describes the
    distribution within the joined result, but not how many matching joined rows
    are produced per anchor row. It would be useful to state whether such
    one-to-many joins are outside the current supported scope, or how the
    MCV-derived raw_sel accounts for how many joined rows each anchor row
    contributed before it is converted into planner join selectivity.
    
    --
    Best regards,
    Chengpeng Yan
    
    
    
    
    
  15. Re: Is there value in having optimizer stats for joins/foreignkeys?

    Tomas Vondra <tomas@vondra.me> — 2026-05-25T14:34:51Z

    On 5/21/26 22:25, Tom Lane wrote:
    > Alexandra Wang <alexandra.wang.oss@gmail.com> writes:
    >> Here's v7, another attempt to fix the unstable tests.
    > 
    > Hi Alexandra,
    > 
    > I signed up for an in-person review of this at PGConf.dev, but
    > the schedule doesn't seem to be working in favor of making that
    > happen.  If you see this and happen to run into me in the
    > hallway, I'm happy to chat, but in any case here are my
    > rather-hasty review notes.
    > 
    > I think it's okay if v1 only handles 2-way joins, as long as the
    > catalog representation is prepared for more.  Restricting to
    > cases where we can do index-based sampling seems fine too.
    > Those things could be relaxed later if it seems worthwhile,
    > but we'd have a creditable feature even without.
    > 
    
    +1
    
    My assumption is allowing larger joins would be a somewhat mechanical.
    I'm not aware of additional problems on top of 2-way joins.
    
    I think we should aim to support larger joins. At the unconference there
    were suggestions maybe it would be enough to support 2-way joins, but
    it's not hard to construct cases where that's no sufficient. Consider a
    fact table, joining to two dimensions. It's common for the dimensions to
    be correlated in various ways, and 2-way joins can't handle these cases.
    
    > I didn't read the sampling code in any detail.  I think you will
    > need to put more thought into what is user-friendly behavior
    > in case the required index doesn't exist or doesn't have the
    > right properties.  (I think the tests for that might not be
    > strong enough, either.)
    > 
    
    What would be the most "user-friendly behavior" in this case?
    
    I think we can either (a) refuse defining/building the join statistics
    in this case, or (b) fallback to sampling not requiring an index (but
    then it'll be way more expensive).
    
    I think (a) should be fine for now, i.e. we should require an index.
    Most of the joins will be on FK constraints, or something like that (the
    FK may not be defined, but there will be a PK on one side).
    
    Not sure if we should simply refuse building the stats, or if it's
    enough to detect this while building the statistics. I'd say enforcing
    this during DDL (CREATE STATISTICS, DROP INDEX, ...) is better,
    otherwise users may not notice the statistics stopped building.
    
    > I think you could simplify some code noticeably if you included the
    > anchor rel's OID as the first element of stxjoinrels[].  Yeah,
    > it'd be redundant with stxrelid, but so what?  It's not like 
    > pg_statistic_ext rows are narrow enough that anyone would notice
    > the extra 4 bytes.  I think this would simplify some of the
    > relationships within the data structures, too, eg all varnos in
    > the expressions could be considered to reference stxjoinrels[].
    > 
    > I don't love stxkeyrefs[].  I wonder if it's time to throw away
    > stxkeys[], represent all the target columns as regular expression
    > trees in stxexprs, and then special-case columns that are simple
    > Vars where appropriate at execution.
    > 
    
    +1, I don't see a reason to not store the anchor rel separately.
    
    > (In the same vein, I dislike the grammar's separation of plain
    > columns from expressions; I'd like to replace stats_params
    > with expr_list and sort it all out later.  But perhaps that's
    > material for a separate patch.)
    > 
    
    FWIW the extended stats copied this from pg_index, which also stores
    keys and expressions separately. I suppose there was a reason for that,
    most likely performance - is cheaper to compare attnums than
    expressions, and plain keys are much more common.
    
    Maybe that's no longer true, or maybe it's not as important for extended
    stats (there's likely fewer of those, compared to indexes).
    
    > We will need to put more thought into permissions: I don't think
    > requiring all the tables to have the same owner is workable.
    > (What happens if someone tries to ALTER OWNER later?)  However,
    > if they don't all have the same owner, there are potential security
    > problems, so the right restriction is not obvious.  This is not
    > necessary to solve now; there are bigger questions to worry about.
    > But we'll need an answer before it's committable.
    > 
    
    I have not thought about this at all, but what can we do if the tables
    have different owners? I suppose we could require the stxowner to have
    SELECT privilege on the joined relations (instead of owning them).
    
    > It's not too soon to write some user-facing documentation.
    > CREATE STATISTICS man page obviously needs attention, but
    > also the discussion of extended stats in perform.sgml.
    > And catalogs.sgml.  I find that writing that sort of stuff
    > helps to clarify where one's design is weak.
    > 
    
    +1
    
    
    regards
    
    -- 
    Tomas Vondra
    
    
    
    
    
  16. Re: Is there value in having optimizer stats for joins/foreignkeys?

    Tom Lane <tgl@sss.pgh.pa.us> — 2026-05-25T15:03:57Z

    Tomas Vondra <tomas@vondra.me> writes:
    > On 5/21/26 22:25, Tom Lane wrote:
    >> I don't love stxkeyrefs[].  I wonder if it's time to throw away
    >> stxkeys[], represent all the target columns as regular expression
    >> trees in stxexprs, and then special-case columns that are simple
    >> Vars where appropriate at execution.
    >> (In the same vein, I dislike the grammar's separation of plain
    >> columns from expressions; I'd like to replace stats_params
    >> with expr_list and sort it all out later.  But perhaps that's
    >> material for a separate patch.)
    
    > FWIW the extended stats copied this from pg_index, which also stores
    > keys and expressions separately. I suppose there was a reason for that,
    > most likely performance - is cheaper to compare attnums than
    > expressions, and plain keys are much more common.
    
    I think I might be to blame for the separate storage of indexprs.
    If so, the motivation was to avoid breakage of older code that only
    knew about indkey[].  (Of course, such code would necessarily fail
    on indexes with expressions, but we wanted to avoid breakage for the
    common case of no-expressions.)  I don't think that consideration is
    nearly as pressing for extended stats.  There's probably a lot less
    client-side code that knows about extended stats at all, and what
    there is seems more likely to rely on the server-side display
    functions than to dig into the catalog details for itself.  Also,
    if there is anything that's looking at pg_statistic_ext details,
    it will need work anyway after this patch; there's no way around that.
    
    >> We will need to put more thought into permissions: I don't think
    >> requiring all the tables to have the same owner is workable.
    
    > I have not thought about this at all, but what can we do if the tables
    > have different owners? I suppose we could require the stxowner to have
    > SELECT privilege on the joined relations (instead of owning them).
    
    Yeah, the rough idea I had was to require ownership on the anchor
    table and SELECT on the rest.  But it's not terribly clear what
    to do if that SELECT privilege gets revoked.
    
    I also wonder to what extent we have a problem with users of the
    anchor table being able to infer something about the contents of the
    other tables via plan choices, and whether it matters if they can.
    They may well be able to make the same inferences anyway from query
    results.  For that matter, if a user is able to issue a query for
    which a set of extended join stats is relevant, it seems likely that
    that must mean she has SELECT on the other tables anyway.  But
    maybe I'm missing some case where that wouldn't be true.
    
    			regards, tom lane
    
    
    
    
  17. Re: Is there value in having optimizer stats for joins/foreignkeys?

    Alexandra Wang <alexandra.wang.oss@gmail.com> — 2026-05-27T17:49:44Z

    Hi Tom and Tomas,
    
    Thank you so much for the feedback!
    
    On Mon, May 25, 2026 at 8:04 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Tomas Vondra <tomas@vondra.me> writes:
    > > On 5/21/26 22:25, Tom Lane wrote:
    > >> I don't love stxkeyrefs[].  I wonder if it's time to throw away
    > >> stxkeys[], represent all the target columns as regular expression
    > >> trees in stxexprs, and then special-case columns that are simple
    > >> Vars where appropriate at execution.
    > >> (In the same vein, I dislike the grammar's separation of plain
    > >> columns from expressions; I'd like to replace stats_params
    > >> with expr_list and sort it all out later.  But perhaps that's
    > >> material for a separate patch.)
    >
    > > FWIW the extended stats copied this from pg_index, which also stores
    > > keys and expressions separately. I suppose there was a reason for that,
    > > most likely performance - is cheaper to compare attnums than
    > > expressions, and plain keys are much more common.
    >
    > I think I might be to blame for the separate storage of indexprs.
    > If so, the motivation was to avoid breakage of older code that only
    > knew about indkey[].  (Of course, such code would necessarily fail
    > on indexes with expressions, but we wanted to avoid breakage for the
    > common case of no-expressions.)  I don't think that consideration is
    > nearly as pressing for extended stats.  There's probably a lot less
    > client-side code that knows about extended stats at all, and what
    > there is seems more likely to rely on the server-side display
    > functions than to dig into the catalog details for itself.  Also,
    > if there is anything that's looking at pg_statistic_ext details,
    > it will need work anyway after this patch; there's no way around that.
    
    I'm working on removing stxkeys[] as a prerequisite commit before the main
    join
    stats patch, representing all target columns as Var nodes in stxexprs, as
    you
    both suggested.
    
    One question about the pg_stats_ext view: currently it has two complementary
    columns:
    
    - attnames (name[]) — Names of the columns included in the statistics object
    - exprs (text[]) — Expressions included in the statistics object
    
    With stxkeys gone from the catalog, should the view:
    
    (a) Stay as-is: keep attnames and exprs as separate columns with the same
    semantics. Implemented via a helper function that extracts plain column
    names
    from the unified stxexprs.
    
    or
    
    (b) Mirror the catalog: remove attnames, make exprs show all entries (both
    column names and expressions together in one text[] array).
    
    Any preference?
    
    -- 
    Alexandra Wang
    EDB: https://www.enterprisedb.com
    
  18. Re: Is there value in having optimizer stats for joins/foreignkeys?

    Tomas Vondra <tomas@vondra.me> — 2026-05-27T19:08:34Z

    
    On 5/27/26 19:49, Alexandra Wang wrote:
    > Hi Tom and Tomas,
    > 
    > Thank you so much for the feedback!
    > 
    > On Mon, May 25, 2026 at 8:04 AM Tom Lane <tgl@sss.pgh.pa.us
    > <mailto:tgl@sss.pgh.pa.us>> wrote:
    >> Tomas Vondra <tomas@vondra.me <mailto:tomas@vondra.me>> writes:
    >> > On 5/21/26 22:25, Tom Lane wrote:
    >> >> I don't love stxkeyrefs[].  I wonder if it's time to throw away
    >> >> stxkeys[], represent all the target columns as regular expression
    >> >> trees in stxexprs, and then special-case columns that are simple
    >> >> Vars where appropriate at execution.
    >> >> (In the same vein, I dislike the grammar's separation of plain
    >> >> columns from expressions; I'd like to replace stats_params
    >> >> with expr_list and sort it all out later.  But perhaps that's
    >> >> material for a separate patch.)
    >>
    >> > FWIW the extended stats copied this from pg_index, which also stores
    >> > keys and expressions separately. I suppose there was a reason for that,
    >> > most likely performance - is cheaper to compare attnums than
    >> > expressions, and plain keys are much more common.
    >>
    >> I think I might be to blame for the separate storage of indexprs.
    >> If so, the motivation was to avoid breakage of older code that only
    >> knew about indkey[].  (Of course, such code would necessarily fail
    >> on indexes with expressions, but we wanted to avoid breakage for the
    >> common case of no-expressions.)  I don't think that consideration is
    >> nearly as pressing for extended stats.  There's probably a lot less
    >> client-side code that knows about extended stats at all, and what
    >> there is seems more likely to rely on the server-side display
    >> functions than to dig into the catalog details for itself.  Also,
    >> if there is anything that's looking at pg_statistic_ext details,
    >> it will need work anyway after this patch; there's no way around that.
    > 
    > I'm working on removing stxkeys[] as a prerequisite commit before the
    > main join
    > stats patch, representing all target columns as Var nodes in stxexprs,
    > as you
    > both suggested.
    > 
    > One question about the pg_stats_ext view: currently it has two complementary
    > columns:
    > 
    > - attnames (name[]) — Names of the columns included in the statistics object
    > - exprs (text[]) — Expressions included in the statistics object
    > 
    > With stxkeys gone from the catalog, should the view:
    > 
    > (a) Stay as-is: keep attnames and exprs as separate columns with the same
    > semantics. Implemented via a helper function that extracts plain column
    > names
    > from the unified stxexprs.
    > 
    > or
    > 
    > (b) Mirror the catalog: remove attnames, make exprs show all entries (both
    > column names and expressions together in one text[] array).
    > 
    > Any preference?
    > 
    
    My 2c: AFAIR there's no fundamental reason to keep those two lists
    separate, other than that expressions were "bolted on" later, after we
    already had stats on plain attributes. In hindsight, it might have been
    better to just unify the view back then, probably.
    
    I personally would be OK with just unifying adjusting the view, and
    showing a single list (with both attributes and expressions). IIUC the
    plan is to just store a list of expressions anyway, with attributes
    represented as Vars. So the view would have to do more work just to
    produce the "old" output, with little benefit.
    
    The one argument against this that I can think of is possibly breaking
    tools that use this view. IIRC pg_dump is reading the view when
    exporting/importing the statistics. That might need some adjustments
    (and there's also pg_stats_ext_exprs), but maybe it's easier to keep the
    view consistent.
    
    Also, maybe this is one more argument against the "optimizer view" idea
    Corey mentioned in Vancouver last week? Because surely we'll want to
    include the join statistics in export/import, and for the view approach
    we'd need to invent a fair amount of code to do that. Maybe?
    
    
    regards
    
    -- 
    Tomas Vondra
    
    
    
    
    
  19. Re: Is there value in having optimizer stats for joins/foreignkeys?

    Tom Lane <tgl@sss.pgh.pa.us> — 2026-05-27T20:31:16Z

    Tomas Vondra <tomas@vondra.me> writes:
    > On 5/27/26 19:49, Alexandra Wang wrote:
    >> One question about the pg_stats_ext view: currently it has two complementary
    >> columns:
    >> 
    >> - attnames (name[]) — Names of the columns included in the statistics object
    >> - exprs (text[]) — Expressions included in the statistics object
    >> 
    >> With stxkeys gone from the catalog, should the view:
    >> (a) Stay as-is: keep attnames and exprs as separate columns with the same
    >> semantics. Implemented via a helper function that extracts plain column
    >> names from the unified stxexprs.
    >> or
    >> (b) Mirror the catalog: remove attnames, make exprs show all entries (both
    >> column names and expressions together in one text[] array).
    
    > My 2c: AFAIR there's no fundamental reason to keep those two lists
    > separate, other than that expressions were "bolted on" later, after we
    > already had stats on plain attributes. In hindsight, it might have been
    > better to just unify the view back then, probably.
    
    Yeah.  There are some other oddities that arise from that: expressions
    get shoved to the end.  For example, if I put in
    
    create statistics my_stats (mcv) on ten, (ten+four), four from tenk1;
    
    pg_dump will regurgitate that as
    
    CREATE STATISTICS public.my_stats (mcv) ON four, ten, (ten + four) FROM public.tenk1;
    
    and I see that that column ordering is consistent with what appears in
    pg_stats_ext and perhaps other places.  I'd expect a rewritten version
    to stop doing that and preserve the user-written column order.
    So there are going to be some potential minor incompatibilities for
    anything that is looking too closely at this view, and it seems to
    me that it might be better for such code to fail noisily rather than
    perhaps silently mis-associate stats with columns.
    
    (It might be a good idea to have some test cases that exercise this
    kind of scenario in pg_upgrade, especially now that we are trying to
    transfer extended stats in upgrades...)
    
    			regards, tom lane
    
    
    
    
  20. Re: Is there value in having optimizer stats for joins/foreignkeys?

    Corey Huinker <corey.huinker@gmail.com> — 2026-05-27T22:02:20Z

    >
    > Also, maybe this is one more argument against the "optimizer view" idea
    > Corey mentioned in Vancouver last week? Because surely we'll want to
    > include the join statistics in export/import, and for the view approach
    > we'd need to invent a fair amount of code to do that. Maybe?
    >
    
    We definitely want join statistics in the export/import.
    
    To my mind, having stats on certain views would be extremely simple for
    export/import, we'd simply have one more relkind that makes it into the
    system views pg_stats and pg_stats_ext, and the statistics for that new
    relation would plug into pg_statistic with no catalog change to
    pg_statistic whatsoever. Additionally, allowing certain kinds of views (or
    a relation relkind that's functionally equivalent to a view) to have
    statistics makes it easy to define extended statistics on those views, with
    no catalog change to pg_statistic_ext.
    
    Having something in pg_class to anchor existing per-attribute and
    multi-attribute stats off of seems like a big win to me. We'd get all
    pre-existing statistics types for free, statistics kinds provided by
    extensions for free, extended stats for free, import and export for free.
    Well, not free, we just have to remove the exclusion that views (or
    whatever relkind we create) can't have stats.
    
    Having said all that, the statistics import code reads from
    pg_statistic_ext but obviously never modifies it, it's all about
    constructing the pg_statistic_ext_data row, and it need to only concern
    itself with importing to the current version, so internal catalog changes
    to pg_statistic_ext wouldn't be that big of a deal.
    
    If, however, we want to stick with the notion that all non-relation,
    not-attribute stats are extended stats, then we have to remodel a bit:
    
    - as Tom and Tomas mentioned, we'd probably want to dispense with the
    negative attnums and stxexprs, as those are already a bit of a pain to deal
    with. That might be worth of a patch even without join statistics.
    
    - we'd want a way to express stats for all the individual
    columns/expressions defined in the join stats object, i.e. "what is the MCV
    of B.name for rows joined by A on A.b_id"?
    
    - we'd want some way of excluding the non-interesting combinations of
    columns. If we had a stat on A.qty, A.price, B.cust_name, C.sale_date,
    D.item_name, then we'd have 5-factorial MCV combinations in addition to the
    5 single-column stats, and the (A.qty, A.price) combinations can already be
    covered by a non-join extended stat.
    
    It's those things that make optimizer views so attractive to me - we
    already have a way to store individual column stats (pg_statistic) and shut
    them off when not interesting (pg_attribute.attstattarget), and extended
    statistics are a great way to describe which combinations of columns are
    interesting to us.
    
    Typing all this has me thinking that there may be a third way:
    
    - statistics objects become pg_class objects, stxkeys and stxexprs  become
    pg_attribute rows
    - pg_statisic_ext keeps (stxrelid, stxstattarget, stxkind), adds stxid
    (pointing to new pg_class object)  sort of like pg_index
    - pg_statistic_ext_data remains as-is
    
    and we'd set per-column stats collection like ALTER STATISTICS foo ALTER
    COLUMN.
    
  21. Re: Is there value in having optimizer stats for joins/foreignkeys?

    Tomas Vondra <tomas@vondra.me> — 2026-05-28T01:29:15Z

    On 5/28/26 00:02, Corey Huinker wrote:
    >     Also, maybe this is one more argument against the "optimizer view" idea
    >     Corey mentioned in Vancouver last week? Because surely we'll want to
    >     include the join statistics in export/import, and for the view approach
    >     we'd need to invent a fair amount of code to do that. Maybe?
    > 
    > 
    > We definitely want join statistics in the export/import. 
    > 
    > To my mind, having stats on certain views would be extremely simple for
    > export/import, we'd simply have one more relkind that makes it into the
    > system views pg_stats and pg_stats_ext, and the statistics for that new
    > relation would plug into pg_statistic with no catalog change to
    > pg_statistic whatsoever. Additionally, allowing certain kinds of views
    > (or a relation relkind that's functionally equivalent to a view) to have
    > statistics makes it easy to define extended statistics on those views,
    > with no catalog change to pg_statistic_ext.
    > 
    
    Ah, so you're proposing supporting CREATE STATISTICS on some views? I
    guess that's one way to support stats on joins, without having to rework
    the schema. I'm still not a huge fan of it, because it just uses views
    as a workaround to store the join definition, nothing else. To me it
    seems a weird to require creating a new relation just for this, and we
    already envisioned CREATE STATISTICS would cover joins. My guess is that
    may be why DB2 did it this way, as they probably didn't have anything
    like extended stats at that point.
    
    > Having something in pg_class to anchor existing per-attribute and multi-
    > attribute stats off of seems like a big win to me. We'd get all pre-
    > existing statistics types for free, statistics kinds provided by
    > extensions for free, extended stats for free, import and export for
    > free. Well, not free, we just have to remove the exclusion that views
    > (or whatever relkind we create) can't have stats.
    > 
    
    TBH the grammar / catalog stuff seems like a relatively minor part of
    this patch. To me, the difficult part seems to be the sampling / analyze
    part, and then matching it to the query during planning. And all of this
    seems exactly the same no matter how the stats are defined.
    
    OTOH maybe allowing CREATE STATISTICS on views would be independently
    useful too, once we have the later parts. Not sure.
    
    > Having said all that, the statistics import code reads from
    > pg_statistic_ext but obviously never modifies it, it's all about
    > constructing the pg_statistic_ext_data row, and it need to only concern
    > itself with importing to the current version, so internal catalog
    > changes to pg_statistic_ext wouldn't be that big of a deal.
    > 
    > If, however, we want to stick with the notion that all non-relation,
    > not-attribute stats are extended stats, then we have to remodel a bit:
    > 
    > - as Tom and Tomas mentioned, we'd probably want to dispense with the
    > negative attnums and stxexprs, as those are already a bit of a pain to
    > deal with. That might be worth of a patch even without join statistics.
    > 
    > - we'd want a way to express stats for all the individual columns/
    > expressions defined in the join stats object, i.e. "what is the MCV of
    > B.name for rows joined by A on A.b_id"?
    > 
    
    Maybe I misunderstand, but don't we already do this for statistics on
    expressions?
    
    > - we'd want some way of excluding the non-interesting combinations of
    > columns. If we had a stat on A.qty, A.price, B.cust_name, C.sale_date,
    > D.item_name, then we'd have 5-factorial MCV combinations in addition to
    > the 5 single-column stats, and the (A.qty, A.price) combinations can
    > already be covered by a non-join extended stat.
    > 
    
    I don't follow. What 5-factorial combinations? We only ever build a
    single MCV for a given statistics object.
    
    > It's those things that make optimizer views so attractive to me - we
    > already have a way to store individual column stats (pg_statistic) and
    > shut them off when not interesting (pg_attribute.attstattarget), and
    > extended statistics are a great way to describe which combinations of
    > columns are interesting to us.
    > 
    
    I may be missing something, but I honestly the only benefit of views I
    can think of is already having the join definition in a catalog.
    
    > Typing all this has me thinking that there may be a third way:
    > 
    > - statistics objects become pg_class objects, stxkeys and stxexprs 
    > become pg_attribute rows
    > - pg_statisic_ext keeps (stxrelid, stxstattarget, stxkind), adds stxid
    > (pointing to new pg_class object)  sort of like pg_index
    > - pg_statistic_ext_data remains as-is
    > 
    > and we'd set per-column stats collection like ALTER STATISTICS foo ALTER
    > COLUMN.
    
    I don't follow. Why should statistics object be pg_class objects? In my
    mind pg_class is meant for "relations" (as in, table-like things), and
    statistics objects are not like that. I suppose this is related to your
    earlier suggestion
    
      Having something in pg_class to anchor existing per-attribute and
      multi-attribute stats off of seems like a big win to me.
    
    but it's not clear to me why would that be? All statistics are tied to a
    relation (or multiple relations) in the end, and I don't see why would
    the view make anything simpler. What are the wins?
    
    Could you elaborate? It's entirely possible I just don't see something
    obvious, or maybe you explained this in Vancouver and I managed to
    forget the details. Sorry about that.
    
    
    regards
    
    -- 
    Tomas Vondra
    
    
    
    
    
  22. Re: Is there value in having optimizer stats for joins/foreignkeys?

    Corey Huinker <corey.huinker@gmail.com> — 2026-05-28T19:03:26Z

    >
    > > To my mind, having stats on certain views would be extremely simple for
    > > export/import, we'd simply have one more relkind that makes it into the
    > > system views pg_stats and pg_stats_ext, and the statistics for that new
    > > relation would plug into pg_statistic with no catalog change to
    > > pg_statistic whatsoever. Additionally, allowing certain kinds of views
    > > (or a relation relkind that's functionally equivalent to a view) to have
    > > statistics makes it easy to define extended statistics on those views,
    > > with no catalog change to pg_statistic_ext.
    > >
    >
    > Ah, so you're proposing supporting CREATE STATISTICS on some views? I
    > guess that's one way to support stats on joins, without having to rework
    > the schema. I'm still not a huge fan of it, because it just uses views
    > as a workaround to store the join definition, nothing else. To me it
    > seems a weird to require creating a new relation just for this, and we
    > already envisioned CREATE STATISTICS would cover joins. My guess is that
    > may be why DB2 did it this way, as they probably didn't have anything
    > like extended stats at that point.
    >
    
    They did have extended stats.
    
    Here are some example RUNSTATS calls from
    https://www.ibm.com/docs/en/db2/11.5.x?topic=commands-runstats
    
        RUNSTATS ON TABLE employee ON ALL COLUMNS AND COLUMNS ((JOB, WORKDEPT,
    SEX))
            WITH DISTRIBUTION
    
    In our lingo, this is roughly:
    
        CREATE STATISTICS foo ON (job, workdept, sex) FROM employee;
        ANALYZE employee;
    
    And you can tweak individual columns
    
        RUNSTATS ON VIEW product_sales_view
         WITH DISTRIBUTION ON COLUMNS (category NUM_FREQVALUES 100
    NUM_QUANTILES 100,
        type, product_key) DEFAULT NUM_FREQVALUES 50 NUM_QUANTILES 50
    
    But the lesson I took away from this doc and corroborated by meeting with
    someone who used to work on DB2 planner is that they have an equivalent of
    extended stats.
    
    >
    > > Having something in pg_class to anchor existing per-attribute and multi-
    > > attribute stats off of seems like a big win to me. We'd get all pre-
    > > existing statistics types for free, statistics kinds provided by
    > > extensions for free, extended stats for free, import and export for
    > > free. Well, not free, we just have to remove the exclusion that views
    > > (or whatever relkind we create) can't have stats.
    > >
    >
    > TBH the grammar / catalog stuff seems like a relatively minor part of
    > this patch. To me, the difficult part seems to be the sampling / analyze
    > part, and then matching it to the query during planning. And all of this
    > seems exactly the same no matter how the stats are defined.
    >
    
    I agree that the catalog stuff is minor relative to the work matching
    queries to the join node trees associated with a set of join stats,
    wherever that resides.
    
    The sampling analyze part doesn't strike me as too hard. The worst case
    scenario is that we take the columns+join definition, swap the "anchor"
    table out replacing it with the EphemeralNamedRelation of the fetched row
    sample from the anchor table, and just run that, letting SPI figure out
    which indexes can be used.
    
    The question of _when_ to analyze is a bit trickier, as was pointed out by
    Chenpeng Yan, and I suspect that any "anchor" table with join stats on it
    will need to be analyzed once any one of the tables it joins to has hit the
    threshold.
    
    
    > OTOH maybe allowing CREATE STATISTICS on views would be independently
    > useful too, once we have the later parts. Not sure.
    >
    
    I think so as well, which is why I'd like to leave that option open.
    
    
    > > - we'd want a way to express stats for all the individual columns/
    > > expressions defined in the join stats object, i.e. "what is the MCV of
    > > B.name for rows joined by A on A.b_id"?
    > >
    >
    > Maybe I misunderstand, but don't we already do this for statistics on
    > expressions?
    >
    
    We can CREATE STATISTICS foo on (upper(name)) FROM table, if that's what
    you're asking.
    
    But that isn't what I'm saying. Given tables A, B, and C which can join on
    A.b_id = B.id and A.c_id = c.id, we already have pg_statistic stats for
    B.name, but that's across B in a vacuum. I'm assuming that there's also
    value in having the stats for B.name filtered and weighted by how often (if
    at all) the B row is joined to A, and those stats would be shaped exactly
    like the pg_statistic row for B.name. We could view that as correlating
    B.name against the primary key of A, but that seems odd to me.
    
    Additionally, we might want the ability to have correlative stats of B.name
    with c.purchase_date. With the views idea, that would just be:
    
          CREATE STATISTICS foo ON (b_name, c_purchase_date) FROM
    my_statistics_view.
    
    So we'd have regular stats to draw upon from my_statistics view, and those
    would have our cardinality estimates given the precondition of the row
    surviving the join criteria, and we'd further have the correlative
    statistics of two columns each from tabled joined to A. And that's
    something not even DB2 can do now given their 2-table join limitation.
    
    
    > I don't follow. What 5-factorial combinations? We only ever build a
    > single MCV for a given statistics object.
    >
    
    Sorry, I was thinking about pg_ndistinct and pg_dependencies.
    
    For MCV that would give us the most common tuples of (B.name,
    C.purchase_date, D.promotion_code, a.quantity, a.amount), but that only
    helps when we need all of those columns, not a subset of them.
    
    
    > I may be missing something, but I honestly the only benefit of views I
    > can think of is already having the join definition in a catalog.
    >
    
    That's the biggest benefit for sure, but the other benefit is we can have
    the per-column stats in pg_statistic as I detailed above, and further do
    extended statistics, as we both addressed above.
    
    
    > I don't follow. Why should statistics object be pg_class objects? In my
    > mind pg_class is meant for "relations" (as in, table-like things), and
    > statistics objects are not like that. I suppose this is related to your
    > earlier suggestion
    >
    >   Having something in pg_class to anchor existing per-attribute and
    >   multi-attribute stats off of seems like a big win to me.
    >
    
    It is, but it would also allow us to avoid having a declared view, and then
    altering that view to allow statistics.
    
    It would make it more complicated to express which elements of the join
    were worthy of correlative stats, so in that sense having a view with a
    name is conceptually simpler.
    
    
    > but it's not clear to me why would that be? All statistics are tied to a
    > relation (or multiple relations) in the end, and I don't see why would
    > the view make anything simpler. What are the wins?
    >
    
    The view would allow us to put a wider range of stats on all of the columns
    that could be found through that defined join, using the mechanisms we
    already have available to us.
    
    It would further allow us to declare that multiple subsets of those columns
    are interesting enough to warrant correlative (extended) statistics.
    
    
    > Could you elaborate? It's entirely possible I just don't see something
    > obvious, or maybe you explained this in Vancouver and I managed to
    > forget the details. Sorry about that.
    
    
    No worries, Vancouver was a whirlwind of ideas around this topic. My head
    is still swimming.
    
    Another way of thinking about this is that it would give us the stats that
    we could get from a materialized view, with the ability to define extended
    stats on top of that materialized view, but without the costs of
    maintaining a materialized view, and without the restriction that a query
    has to directly reference the materialized view.