Thread

Is there value in having optimizer stats for joins/foreignkeys?

Corey Huinker <corey.huinker@gmail.com> — 2025-12-01T20:10:25Z

Threads like [1] and [2] have gotten me thinking that there may be some
value in storing statistics about joins.

For the sake of argument, assume a table t1 with a column t2id which
references the pk of table t2 that has columns t2.t2id, t2c1, t2c2, t2c3.
In such a situation I can envision the following statistics being collected:

* The % of values rows in t2 are referenced at least once in t1
* The attribute stats (i.e. pg_statistic stats) for t2c1, t2c2, t2c3, but
associated with t1 and weighted according to the frequency of that row
being referenced, which means that values of unreferenced rows are filtered
out entirely.
* That's about it for direct statistics, but I could see creating extended
statistics for correlations between a local column value and a remote
column, or expressions on the remote columns, etc.

The storage feels like it would be identical to pg_statistic but with a
"starefrelid" field that identifies the referencing table.

That much seems straightforward. A bigger problem is how we'd manage to
collect these statistics. We could (as Jeff Davis has suggested) keep our
tablesamples, but that wouldn't necessarily help in this case because the
rows referenced, and their relative weightings would change since the last
sampling. In a worst-case scenario, We would have to sample the joined-to
tables as well,and that's an additional burden on an already IO intensive
operation.

In theory, we could do some of this without any additional stats
collection. If the ndistinct of t1.t2id is, say, at least 75+% of the
ndistinct of t2.t2id, we could just peek at the attribute stats on t2 and
use them for estimates. However, that makes some assumptions that the stats
on t2 are approximately as fresh as the stats on t1, and I don't think that
will be the case most of the time.

CCing people who have wondered out loud about this topic within earshot of
me.

Thoughts?

[1]
https://www.postgresql.org/message-id/flat/6fdc4dc5-8881-4987-9858-a9b484953185%40joeconway.com#5a93cd7a730691843a7700c770397baf
[2]
https://www.postgresql.org/message-id/flat/tencent_3018762E7D4C9BC470C821C829C1BF2F650A%40qq.com

Re: Is there value in having optimizer stats for joins/foreignkeys?

Tomas Vondra <tomas@vondra.me> — 2025-12-01T21:01:58Z

On 12/1/25 21:10, Corey Huinker wrote:
> Threads like [1] and [2] have gotten me thinking that there may be some
> value in storing statistics about joins.
> 
> For the sake of argument, assume a table t1 with a column t2id which
> references the pk of table t2 that has columns t2.t2id, t2c1, t2c2,
> t2c3. In such a situation I can envision the following statistics being
> collected:
> 
> * The % of values rows in t2 are referenced at least once in t1
> * The attribute stats (i.e. pg_statistic stats) for t2c1, t2c2, t2c3,
> but associated with t1 and weighted according to the frequency of that
> row being referenced, which means that values of unreferenced rows are
> filtered out entirely.
> * That's about it for direct statistics, but I could see creating
> extended statistics for correlations between a local column value and a
> remote column, or expressions on the remote columns, etc.
> 

Do I understand correctly you propose to collect such stats for every
foreign key? I recall something like that was proposed in the past, and
the argument against was that for many joins it'd be a waste because the
estimates are good enough. And for OLTP systems that's probably true.

Of course, it also depends on how expensive this would be. Maybe it's
cheap enough? No idea.

But I always assumed we'd have a way to explicitly enable such stats for
certain joins only, and the extended stats were designed to make that
possible.

FWIW I'm not entirely sure what stats you propose to collect exactly. I
mean, what does

   ... associated with t1 and weighted according to the frequency of
   that row being referenced, which means that values of unreferenced
   rows are filtered out entirely.

mean? Are you suggesting to "do the join" and build the regular stats as
if that was a regular table? I think that'd work, and it's mostly how I
envisioned to handle joins in extended stats, restricted to joins of two
relations.

> The storage feels like it would be identical to pg_statistic but with a
> "starefrelid" field that identifies the referencing table.
> 
> That much seems straightforward. A bigger problem is how we'd manage to
> collect these statistics. We could (as Jeff Davis has suggested) keep
> our tablesamples, but that wouldn't necessarily help in this case
> because the rows referenced, and their relative weightings would change
> since the last sampling. In a worst-case scenario, We would have to
> sample the joined-to tables as well,and that's an additional burden on
> an already IO intensive operation.
> 

Combining independent per-table samples does not work, unless the
samples are huge. There's a nice paper [1] on how to do index-based join
sampling efficiently.

> In theory, we could do some of this without any additional stats
> collection. If the ndistinct of t1.t2id is, say, at least 75+% of the
> ndistinct of t2.t2id, we could just peek at the attribute stats on t2
> and use them for estimates. However, that makes some assumptions that
> the stats on t2 are approximately as fresh as the stats on t1, and I
> don't think that will be the case most of the time.
> 
> CCing people who have wondered out loud about this topic within earshot
> of me.
> 
> Thoughts?

I think adding joins to extended stats would not be all that hard
(famous last words, I know). For me the main challenge was figuring out
how to store the join definition in the catalog, I always procrastinated
and never gave that a serious try.

FWIW I think we might start by actually using per-table extended stats
on the joined tables. Just like we combine the scalar MCVs on joined
columns, we could combine multicolumn MVCs.

regards

[1] https://www.cidrdb.org/cidr2017/papers/p9-leis-cidr17.pdf

-- 
Tomas Vondra

Re: Is there value in having optimizer stats for joins/foreignkeys?

Tom Lane <tgl@sss.pgh.pa.us> — 2025-12-01T22:11:55Z

Tomas Vondra <tomas@vondra.me> writes:
> On 12/1/25 21:10, Corey Huinker wrote:
>> Threads like [1] and [2] have gotten me thinking that there may be some
>> value in storing statistics about joins.

> Do I understand correctly you propose to collect such stats for every
> foreign key? I recall something like that was proposed in the past, and
> the argument against was that for many joins it'd be a waste because the
> estimates are good enough. And for OLTP systems that's probably true.

Yeah, I think that automated choices about this are unlikely to work
well.  We chose the syntax for CREATE STATISTICS with an eye to
allowing users to declaratively tell us to collect stats about
specific joins, and I still think that's a more promising approach.
But nobody's yet worked out any details.

			regards, tom lane

Re: Is there value in having optimizer stats for joins/foreignkeys?

Corey Huinker <corey.huinker@gmail.com> — 2025-12-02T03:16:25Z

> Do I understand correctly you propose to collect such stats for every
> foreign key? I recall something like that was proposed in the past, and
> the argument against was that for many joins it'd be a waste because the
> estimates are good enough. And for OLTP systems that's probably true.
>

Not every foreign key, they'd be declared like CREATE STATISTICS, but would
be anchored to the constraint, not to the table.


> But I always assumed we'd have a way to explicitly enable such stats for
> certain joins only, and the extended stats were designed to make that
> possible.
>

That's the intention, but the stats stored don't quite "fit" in the buckets
that extended stats create. The attribute statistics seem much better
suited, as this isn't about combinations, there's only ever the one
combination, but rather about what can be known about the attributes in the
far table before doing the actual join.


> FWIW I'm not entirely sure what stats you propose to collect exactly. I
> mean, what does
>
>    ... associated with t1 and weighted according to the frequency of
>    that row being referenced, which means that values of unreferenced
>    rows are filtered out entirely.
>
> mean? Are you suggesting to "do the join" and build the regular stats as
> if that was a regular table? I think that'd work, and it's mostly how I
> envisioned to handle joins in extended stats, restricted to joins of two
> relations.
>

Right. We'd do the join from t1 to t2 as described earlier, and then we'd
judge the null_frac, mcv, etc for each column of t2 (as defined by the
scope of the stats declaration) according to the join. More commonly
referenced values would show up as more frequent, hence "weighted".

Just so I have an example to refer to later, say we have a table of colors:

CREATE TABLE color(id bigint primary key, color_name text unique,
color_family text null)

and there's hundreds of colors in the table that are color_family='red'
('fire engine red', 'candy apple red', 'popular muppet red', etc). Some
colors don't belong to any color_family.

And we have a table of toys:

CREATE TABLE toy(id bigint primary key, min_child_age integer, name text,
color_id bigint REFERENCES color)

And we declare a join stat on toy->color for the color_family attribute.
We'd sample rows from the toy table, left join those to color, and then
calculate the attribute stats of color_family as if it were a column in
toys. Some toys might not have a color_id, and some color_ids might not
belong to a color_family, so we'd want the null_frac to reflect those
combined conditions. For the values that do join, and the colors that do
belong to a family, we'd want to see regular MCV stats showing "red" as the
most common color_family.

But those stats aren't really a correlation or a dependency, they're just
plain old attribute stats.

I understand wanting to know the correlation between toys.min_child_age and
colors.color_family, so that makes perfect sense for extended statistics,
but color_family on its own just doesn't fit. Am I missing something?


> Combining independent per-table samples does not work, unless the
> samples are huge. There's a nice paper [1] on how to do index-based join
> sampling efficiently.
>

Thanks, now I've got some light reading for the flight home.


> I think adding joins to extended stats would not be all that hard
> (famous last words, I know). For me the main challenge was figuring out
> how to store the join definition in the catalog, I always procrastinated
> and never gave that a serious try.
>

I envisioned keying the stats off the foreign key constraint id, or adding
"starefrelid" (relation oid of the referencing table) to pg_statistic or a
table roughly the same shape as pg_statistic.


>
> FWIW I think we might start by actually using per-table extended stats
> on the joined tables. Just like we combine the scalar MCVs on joined
> columns, we could combine multicolumn MVCs.
>

That's the other half of this - if the stats existed, do we have an obvious
way to put them to use?

Re: Is there value in having optimizer stats for joins/foreignkeys?

Corey Huinker <corey.huinker@gmail.com> — 2025-12-02T03:18:24Z

>
>
> Yeah, I think that automated choices about this are unlikely to work
> well.  We chose the syntax for CREATE STATISTICS with an eye to
> allowing users to declaratively tell us to collect stats about
> specific joins, and I still think that's a more promising approach.
> But nobody's yet worked out any details.
>
>
Per other response, no, I didn't envision stats on all possible joins or
even all possible foreign keys, just the ones we declare as interesting,
and even then only for the attributes that we say are interesting on the
far side of the join.

Re: Is there value in having optimizer stats for joins/foreignkeys?

Alexandra Wang <alexandra.wang.oss@gmail.com> — 2025-12-03T19:01:13Z

Hi there,

Thanks for raising this topic! I am currently working on a POC patch that
adds extended statistics for joins. I am polishing the patch now and will
post it soon with performance numbers, since there are interests!

On Mon, Dec 1, 2025 at 7:16 PM Corey Huinker <corey.huinker@gmail.com>
wrote:

> On Mon, Dec 1, 2025 at 1:02 PM Tomas Vondra <tomas@vondra.me> wrote:
>
I think adding joins to extended stats would not be all that hard
>> (famous last words, I know). For me the main challenge was figuring out
>> how to store the join definition in the catalog, I always procrastinated
>> and never gave that a serious try.
>>
>
> I envisioned keying the stats off the foreign key constraint id, or adding
> "starefrelid" (relation oid of the referencing table) to pg_statistic or a
> table roughly the same shape as pg_statistic.
>

 On Mon, Dec 1, 2025 at 1:02 PM Tomas Vondra <tomas@vondra.me> wrote:
>
>>
>> FWIW I think we might start by actually using per-table extended stats
>> on the joined tables. Just like we combine the scalar MCVs on joined
>> columns, we could combine multicolumn MVCs.
>>
>
> That's the other half of this - if the stats existed, do we have an
> obvious way to put them to use?
>

I have indeed started by implementing MCV statistics for joins,
because I have not found a case for joins that would benefit only from
ndistinct or functional dependency stats that MCV stats wouldn't help.

In my POC patch, I've made the following catalog changes:
- Add *stxotherrel (oid) *and *stxjoinkeys (int2vector)* fields to
*pg_statistic_ext*
- Use the existing *stxkeys (int2vector)* to store the stats object
attributes of *stxotherrel*
- Create *pg_statistic_ext_otherrel_index* on *(stxrelid, stxotherrel)*
- Add stxdjoinmcv* (pg_join_mcv_list)* to *pg_statistic_ext_data*

To use them, we can let the planner detect patterns like this:

/*
 * JoinStatsMatch - Information about a detected join pattern
 * Used internally to track what was matched in a join+filter pattern
 */
typedef struct JoinStatsMatch
{
    Oid          target_rel;       /* table OID of the estimation target */
    AttrNumber targetrel_joinkey;    /* target_rel's join column */
    Oid          other_rel;       /* table OID of the filter source */
    AttrNumber otherrel_joinkey;        /* other_rel's join column */
    List      *filter_attnums;       /* list of AttrNumbers for filter
columns in other_rel */
    List      *filter_values;    /* list of Datum constant values
being filtered */
    Oid          collation;       /* collation for comparisons */

    /* Additional info to avoid duplicate work */
    List      *join_rinfos;      /* list of join clause RestrictInfos */
    RestrictInfo *filter_rinfo;       /* the filter clause RestrictInfo */
} JoinStatsMatch;

and add the detection logic in clauselist_selectivity_ext() and
get_foreign_key_join_selectivity().

Statistics collection indeed needs the most thinking. For the
purpose of a POC, I added MCV join stats collection as part of ANALYZE
of one table (stxrel in pg_statistic_ext). I can do this because MCV
join stats are somewhat asymmetric. It allows me to have a target
table (referencing table for foreign key join) to ANALYZE, and we can
use the already collected MCVs of the joinkey column on the target
table to query the rows in the other table. This greatly mitigates
performance impact compared to actually joining two tables. However,
if we are to support more complex joins or other types of join stats
such as ndistinct or functional dependency, I found it hard to define
who's the target table (referencing table) and who's the other table
(referenced table) outside of the foreign key join scenario. So I
think for those more complex cases eventually we may as well
perform the joins and collect the join stats separately. Alvaro
Herrera suggested offline that we could have a dedicated autovacuum
command option for collecting the join statistics.

I have experimented with two ways to define the join statistics:

1. Use CREATE STATISTICS:

CREATE STATISTICS [ [ IF NOT EXISTS ] statistics_name ] [ ( mcv ) ] ON {
table_name1.column_name1 }, { table_name1.column_name2 } [, ...] FROM
table_name1 JOIN table_name2 ON table_name1.column_name3 =
table_name2.column_name4

Examples:
-- Create join MCV statistics on a single filter column (keyword)
CREATE STATISTICS movie_keyword_keyword_join_stats (mcv)
ON k.keyword
FROM movie_keyword mk JOIN keyword k ON (mk.keyword_id = k.id);
ANALYZE movie_keyword;

-- Create join MCV statistics on multiple filter columns (keyword +
phonetic_code):
CREATE STATISTICS movie_keyword_keyword_multicols_join_stats (mcv)
ON k.keyword, k.phonetic_code
FROM movie_keyword mk JOIN keyword k ON (mk.keyword_id = k.id);
ANALYZE movie_keyword;

2. Auto join stats creation for Foreign Key + Functional Dependency stats

Initially, I did not implement the CREATE TABLE STATISTICS command to
create the join stats. Instead, I’ve implemented logic in ANALYZE to
detect functional dependency stats on the referenced table through FKs
and create join statistics implicitly for those cases.

I've been using the Join Order Benchmark (JOB) [1] to measure
performance gain. I will post the POC patch and performance numbers in
a followup email.

[1] https://www.vldb.org/pvldb/vol9/p204-leis.pdf

Best,
Alex

-- 
Alexandra Wang
EDB: https://www.enterprisedb.com

Re: Is there value in having optimizer stats for joins/foreignkeys?

Alexandra Wang <alexandra.wang.oss@gmail.com> — 2026-05-06T12:19:05Z
```
Here's the rebased patch, it only needed a catalog version bump.

Best,
Alex

-- 
Alexandra Wang
EDB: https://www.enterprisedb.com
```

Re: Is there value in having optimizer stats for joins/foreignkeys?

jian he <jian.universality@gmail.com> — 2026-05-13T15:15:31Z

On Wed, May 6, 2026 at 8:19 PM Alexandra Wang
<alexandra.wang.oss@gmail.com> wrote:
>
> Here's the rebased patch, it only needed a catalog version bump.
>

No need to update src/include/catalog/catversion.h during dev cycle.

+ if (!IsA(lfirst(l), OpExpr))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("join statistics require a single equijoin condition per pair
of tables")));

In the error message, should we replace "equijoin" with "equality join"?
IMHO,  "equijoin" is kind of informal wording.

-- Unintended error case1
DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (id INTEGER PRIMARY KEY, val TEXT NOT NULL);
CREATE STATISTICS x (mcv) ON t1.val FROM t1 JOIN t1 as t1s ON (t1.id = t1s.id);
alter table t1 alter column id set data type int8;
ERROR:  could not open relation with OID 0

-- Unintended error case2
DROP TABLE IF EXISTS t1, t2;
CREATE TABLE t1 (id INTEGER PRIMARY KEY, val TEXT NOT NULL);
CREATE TABLE t2 (id INTEGER PRIMARY KEY, t1_id INTEGER NOT NULL);
CREATE STATISTICS xx (mcv) ON t1.val FROM t2 JOIN t1 ON (t2.t1_id = t1.id);
alter table t1 alter column id set data type int8;
ERROR:  missing FROM-clause entry for table "t1"
LINE 1: alter table t1 alter column id set data type int8;
                                             ^

-- Unintended error case3
DROP TABLE IF EXISTS t1, t2;
CREATE TABLE t1 (id INTEGER PRIMARY KEY, val TEXT NOT NULL);
CREATE TABLE t2 (id INTEGER PRIMARY KEY, t1_id INTEGER generated always as (1));
CREATE STATISTICS xx (mcv) ON t1.val FROM t2 JOIN t1 ON (t2.t1_id = t1.id);
alter table t2 alter column t1_id set expression as (2);
ERROR:  missing FROM-clause entry for table "t1"
LINE 1: alter table t2 alter column t1_id set expression as (2);
                                             ^

-- Unintended error case4
DROP TABLE IF EXISTS t1, t2;
CREATE TABLE t1 (id INTEGER PRIMARY KEY, val TEXT NOT NULL);
CREATE TABLE t2 (id INTEGER PRIMARY KEY, t1_id INTEGER generated always as (1));
CREATE STATISTICS xx (mcv) ON t1.val FROM t2 JOIN t1 ON (t2 = t1);
ERROR:  cache lookup failed for attribute 0 of relation 18388

In src/test/regress/sql/stats_ext_crossrel.sql, we need to add test cases for
ALTER COLUMN SET DATA TYPE and ALTER COLUMN SET EXPRESSION on columns
involved in join statistics.
These ALTER TABLE operations will internally recreate the affected join stats.

src/test/regress/sql/stats_ext_crossrel.sql currently lacks tests for equality
joins where the join qual contains whole-row variable references.This is
important because whole-row variables raise the question of whether join
statistics should be recreated when any column in the relation is modified via
ALTER COLUMN SET DATA TYPE or ALTER COLUMN SET EXPRESSION.



--
jian
https://www.enterprisedb.com/

Re: Is there value in having optimizer stats for joins/foreignkeys?

Alexandra Wang <alexandra.wang.oss@gmail.com> — 2026-05-13T17:30:56Z

Hi Jian,

Thanks for reviewing! At the current stage, I'm more interested in
reviews around the overall design and the feasibility of the current
approach, as I mentioned earlier:

On Tue, Apr 28, 2026 at 10:59 PM Alexandra Wang <
alexandra.wang.oss@gmail.com> wrote:
> I'd appreciate feedback on:
> - The catalog design (stxjoinrels, stxkeyrefs, stxjoinconds)
> - The index-based sampling implementation
> - The planner integration (clausesel.c, costsize.c)
> - What should be in scope for v1 vs deferred

I will incorporate your review feedback in the next revision, but for
now, I'd like to focus the discussion on the above areas before
polishing the existing patch.

On Wed, May 13, 2026 at 8:16 AM jian he <jian.universality@gmail.com> wrote:
> On Wed, May 6, 2026 at 8:19 PM Alexandra Wang
> <alexandra.wang.oss@gmail.com> wrote:
> >
> > Here's the rebased patch, it only needed a catalog version bump.
> >
>
> No need to update src/include/catalog/catversion.h during dev cycle.

I needed to bump the catalog version because CI broke without that
change.

> + if (!IsA(lfirst(l), OpExpr))
> + ereport(ERROR,
> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> + errmsg("join statistics require a single equijoin condition per pair
> of tables")));

> In the error message, should we replace "equijoin" with "equality join"?
> IMHO,  "equijoin" is kind of informal wording.

OK. I'll update the error message in the next revision.

> -- Unintended error case1
> DROP TABLE IF EXISTS t1;
> CREATE TABLE t1 (id INTEGER PRIMARY KEY, val TEXT NOT NULL);
> CREATE STATISTICS x (mcv) ON t1.val FROM t1 JOIN t1 as t1s ON (t1.id =
t1s.id);
> alter table t1 alter column id set data type int8;
> ERROR:  could not open relation with OID 0
>
> -- Unintended error case2
> DROP TABLE IF EXISTS t1, t2;
> CREATE TABLE t1 (id INTEGER PRIMARY KEY, val TEXT NOT NULL);
> CREATE TABLE t2 (id INTEGER PRIMARY KEY, t1_id INTEGER NOT NULL);
> CREATE STATISTICS xx (mcv) ON t1.val FROM t2 JOIN t1 ON (t2.t1_id = t1.id
);
> alter table t1 alter column id set data type int8;
> ERROR:  missing FROM-clause entry for table "t1"
> LINE 1: alter table t1 alter column id set data type int8;
>                                              ^
>
> -- Unintended error case3
> DROP TABLE IF EXISTS t1, t2;
> CREATE TABLE t1 (id INTEGER PRIMARY KEY, val TEXT NOT NULL);
> CREATE TABLE t2 (id INTEGER PRIMARY KEY, t1_id INTEGER generated always
as (1));
> CREATE STATISTICS xx (mcv) ON t1.val FROM t2 JOIN t1 ON (t2.t1_id = t1.id
);
> alter table t2 alter column t1_id set expression as (2);
> ERROR:  missing FROM-clause entry for table "t1"
> LINE 1: alter table t2 alter column t1_id set expression as (2);
>                                              ^
>
> -- Unintended error case4
> DROP TABLE IF EXISTS t1, t2;
> CREATE TABLE t1 (id INTEGER PRIMARY KEY, val TEXT NOT NULL);
> CREATE TABLE t2 (id INTEGER PRIMARY KEY, t1_id INTEGER generated always
as (1));
> CREATE STATISTICS xx (mcv) ON t1.val FROM t2 JOIN t1 ON (t2 = t1);
> ERROR:  cache lookup failed for attribute 0 of relation 18388
>
> In src/test/regress/sql/stats_ext_crossrel.sql, we need to add test cases
for
> ALTER COLUMN SET DATA TYPE and ALTER COLUMN SET EXPRESSION on columns
> involved in join statistics.
> These ALTER TABLE operations will internally recreate the affected join
stats.
>
> src/test/regress/sql/stats_ext_crossrel.sql currently lacks tests for
equality
> joins where the join qual contains whole-row variable references.This is
> important because whole-row variables raise the question of whether join
> statistics should be recreated when any column in the relation is
modified via
> ALTER COLUMN SET DATA TYPE or ALTER COLUMN SET EXPRESSION.

Good call! I'll make sure to include these edge cases in the next
revision.

Best,
Alex

--
Alexandra Wang
EDB: https://www.enterprisedb.com

Re: Is there value in having optimizer stats for joins/foreignkeys?

Alexandra Wang <alexandra.wang.oss@gmail.com> — 2026-05-14T23:13:26Z

Patch needed a rebase, so here's v5.

I've dropped the changes to catversion.h so hopefully I won't need to
rebase too often. Sorry about the noise!

While I was at it, I also addressed the issues Jian mentioned -- all
minor fixes in parse_utilcmd.c (a one-line detection fix for the ALTER
COLUMN error, an error guard for whole-row variables, and rewording
"equijoin" to "equality join"). Other than that, there are no major
changes in v5 compared to v2-v4. At this stage I'd really appreciate
feedback on the overall design approach.

Best,
Alex

-- 
Alexandra Wang
EDB: https://www.enterprisedb.com

Re: Is there value in having optimizer stats for joins/foreignkeys?

Alexandra Wang <alexandra.wang.oss@gmail.com> — 2026-05-15T05:08:56Z

Attached v6. It should fix the CI failures on NetBSD/Windows by
changing the ALTER TYPE invalidation test added in v5 from
text->varchar(100) to integer->numeric, avoiding encoding-dependent
width estimates.

Best,
Alex


-- 
Alexandra Wang
EDB: https://www.enterprisedb.com

Re: Is there value in having optimizer stats for joins/foreignkeys?

Alexandra Wang <alexandra.wang.oss@gmail.com> — 2026-05-15T15:30:31Z

Here's v7, another attempt to fix the unstable tests.

Best,
Alex

On Thu, May 14, 2026 at 10:08 PM Alexandra Wang <
alexandra.wang.oss@gmail.com> wrote:

> Attached v6. It should fix the CI failures on NetBSD/Windows by
> changing the ALTER TYPE invalidation test added in v5 from
> text->varchar(100) to integer->numeric, avoiding encoding-dependent
> width estimates.
>
> Best,
> Alex
>
>
> --
> Alexandra Wang
> EDB: https://www.enterprisedb.com
>


-- 
Alexandra Wang
EDB: https://www.enterprisedb.com

Re: Is there value in having optimizer stats for joins/foreignkeys?

Tom Lane <tgl@sss.pgh.pa.us> — 2026-05-21T20:25:13Z

Alexandra Wang <alexandra.wang.oss@gmail.com> writes:
> Here's v7, another attempt to fix the unstable tests.

Hi Alexandra,

I signed up for an in-person review of this at PGConf.dev, but
the schedule doesn't seem to be working in favor of making that
happen.  If you see this and happen to run into me in the
hallway, I'm happy to chat, but in any case here are my
rather-hasty review notes.

I think it's okay if v1 only handles 2-way joins, as long as the
catalog representation is prepared for more.  Restricting to
cases where we can do index-based sampling seems fine too.
Those things could be relaxed later if it seems worthwhile,
but we'd have a creditable feature even without.

I didn't read the sampling code in any detail.  I think you will
need to put more thought into what is user-friendly behavior
in case the required index doesn't exist or doesn't have the
right properties.  (I think the tests for that might not be
strong enough, either.)

I think you could simplify some code noticeably if you included the
anchor rel's OID as the first element of stxjoinrels[].  Yeah,
it'd be redundant with stxrelid, but so what?  It's not like 
pg_statistic_ext rows are narrow enough that anyone would notice
the extra 4 bytes.  I think this would simplify some of the
relationships within the data structures, too, eg all varnos in
the expressions could be considered to reference stxjoinrels[].

I don't love stxkeyrefs[].  I wonder if it's time to throw away
stxkeys[], represent all the target columns as regular expression
trees in stxexprs, and then special-case columns that are simple
Vars where appropriate at execution.

(In the same vein, I dislike the grammar's separation of plain
columns from expressions; I'd like to replace stats_params
with expr_list and sort it all out later.  But perhaps that's
material for a separate patch.)

We will need to put more thought into permissions: I don't think
requiring all the tables to have the same owner is workable.
(What happens if someone tries to ALTER OWNER later?)  However,
if they don't all have the same owner, there are potential security
problems, so the right restriction is not obvious.  This is not
necessary to solve now; there are bigger questions to worry about.
But we'll need an answer before it's committable.

It's not too soon to write some user-facing documentation.
CREATE STATISTICS man page obviously needs attention, but
also the discussion of extended stats in perform.sgml.
And catalogs.sgml.  I find that writing that sort of stuff
helps to clarify where one's design is weak.

			regards, tom lane

Re: Is there value in having optimizer stats for joins/foreignkeys?

Chengpeng Yan <chengpeng_yan@outlook.com> — 2026-05-25T12:15:47Z

Hi,

> On May 15, 2026, at 23:30, Alexandra Wang <alexandra.wang.oss@gmail.com> wrote:
> 
> Here's v7, another attempt to fix the unstable tests.

Thanks for working on this. I have a few design comments about the lifecycle
and the intended scope of the join statistic.

First, the index dependency contract seems worth clarifying. CREATE STATISTICS
records a normal dependency on the selected index, so DROP INDEX is blocked
unless CASCADE is used, although that index is not part of the statistics
definition. But ANALYZE appears to re-discover a suitable index when refreshing
the statistic. Is the index intended to be part of the statistic's persistent
contract, or only a creation-time proof that index-based sampling is possible?
If the latter, should DROP INDEX still be blocked when another equivalent index
exists?

Second, this seems related to the earlier concern that ANALYZE is per-table.
The statistic is owned by the anchor relation, but its contents depend on the
probed relation too. In the current patch, ANALYZE on the probed relation can
refresh its own statistics without refreshing the join statistic. If the
probed relation has changed substantially, that leaves a possible staleness
gap where the planner combines fresh base-table statistics with stale
cross-relation skew information.

Third, the contract for non-unique indexes on the probed side seems worth
clarifying. The comments define raw_sel as anchor-relative:
P(join AND covered_filters) / anchor_totalrows, roughly Jf / anchor_totalrows,
where Jf is the number of joined rows satisfying the covered filters. But the
implementation computes raw_sel from MCV frequencies. Since the MCV list is
built from sampled joined rows, a plain MCV frequency is naturally measured
inside that joined sample, roughly Jf / J where J is the sampled join-result
size. It would be useful to clarify when those two quantities are expected to
be equivalent, especially when a non-unique probed-side index allows one
anchor row to contribute multiple joined rows.

These two measures are close in FK-like cases where the joined sample size
tracks the anchor sample size. With a non-unique lookup, one anchor row may
appear many times in the joined sample. For example, if one key matches many
red rows and another key matches only one blue row, red may dominate the
joined sample because of match multiplicity. That frequency describes the
distribution within the joined result, but not how many matching joined rows
are produced per anchor row. It would be useful to state whether such
one-to-many joins are outside the current supported scope, or how the
MCV-derived raw_sel accounts for how many joined rows each anchor row
contributed before it is converted into planner join selectivity.

--
Best regards,
Chengpeng Yan

Re: Is there value in having optimizer stats for joins/foreignkeys?

Tomas Vondra <tomas@vondra.me> — 2026-05-25T14:34:51Z

On 5/21/26 22:25, Tom Lane wrote:
> Alexandra Wang <alexandra.wang.oss@gmail.com> writes:
>> Here's v7, another attempt to fix the unstable tests.
> 
> Hi Alexandra,
> 
> I signed up for an in-person review of this at PGConf.dev, but
> the schedule doesn't seem to be working in favor of making that
> happen.  If you see this and happen to run into me in the
> hallway, I'm happy to chat, but in any case here are my
> rather-hasty review notes.
> 
> I think it's okay if v1 only handles 2-way joins, as long as the
> catalog representation is prepared for more.  Restricting to
> cases where we can do index-based sampling seems fine too.
> Those things could be relaxed later if it seems worthwhile,
> but we'd have a creditable feature even without.
> 

+1

My assumption is allowing larger joins would be a somewhat mechanical.
I'm not aware of additional problems on top of 2-way joins.

I think we should aim to support larger joins. At the unconference there
were suggestions maybe it would be enough to support 2-way joins, but
it's not hard to construct cases where that's no sufficient. Consider a
fact table, joining to two dimensions. It's common for the dimensions to
be correlated in various ways, and 2-way joins can't handle these cases.

> I didn't read the sampling code in any detail.  I think you will
> need to put more thought into what is user-friendly behavior
> in case the required index doesn't exist or doesn't have the
> right properties.  (I think the tests for that might not be
> strong enough, either.)
> 

What would be the most "user-friendly behavior" in this case?

I think we can either (a) refuse defining/building the join statistics
in this case, or (b) fallback to sampling not requiring an index (but
then it'll be way more expensive).

I think (a) should be fine for now, i.e. we should require an index.
Most of the joins will be on FK constraints, or something like that (the
FK may not be defined, but there will be a PK on one side).

Not sure if we should simply refuse building the stats, or if it's
enough to detect this while building the statistics. I'd say enforcing
this during DDL (CREATE STATISTICS, DROP INDEX, ...) is better,
otherwise users may not notice the statistics stopped building.

> I think you could simplify some code noticeably if you included the
> anchor rel's OID as the first element of stxjoinrels[].  Yeah,
> it'd be redundant with stxrelid, but so what?  It's not like 
> pg_statistic_ext rows are narrow enough that anyone would notice
> the extra 4 bytes.  I think this would simplify some of the
> relationships within the data structures, too, eg all varnos in
> the expressions could be considered to reference stxjoinrels[].
> 
> I don't love stxkeyrefs[].  I wonder if it's time to throw away
> stxkeys[], represent all the target columns as regular expression
> trees in stxexprs, and then special-case columns that are simple
> Vars where appropriate at execution.
> 

+1, I don't see a reason to not store the anchor rel separately.

> (In the same vein, I dislike the grammar's separation of plain
> columns from expressions; I'd like to replace stats_params
> with expr_list and sort it all out later.  But perhaps that's
> material for a separate patch.)
> 

FWIW the extended stats copied this from pg_index, which also stores
keys and expressions separately. I suppose there was a reason for that,
most likely performance - is cheaper to compare attnums than
expressions, and plain keys are much more common.

Maybe that's no longer true, or maybe it's not as important for extended
stats (there's likely fewer of those, compared to indexes).

> We will need to put more thought into permissions: I don't think
> requiring all the tables to have the same owner is workable.
> (What happens if someone tries to ALTER OWNER later?)  However,
> if they don't all have the same owner, there are potential security
> problems, so the right restriction is not obvious.  This is not
> necessary to solve now; there are bigger questions to worry about.
> But we'll need an answer before it's committable.
> 

I have not thought about this at all, but what can we do if the tables
have different owners? I suppose we could require the stxowner to have
SELECT privilege on the joined relations (instead of owning them).

> It's not too soon to write some user-facing documentation.
> CREATE STATISTICS man page obviously needs attention, but
> also the discussion of extended stats in perform.sgml.
> And catalogs.sgml.  I find that writing that sort of stuff
> helps to clarify where one's design is weak.
> 

+1

regards

-- 
Tomas Vondra

Re: Is there value in having optimizer stats for joins/foreignkeys?

Tom Lane <tgl@sss.pgh.pa.us> — 2026-05-25T15:03:57Z

Tomas Vondra <tomas@vondra.me> writes:
> On 5/21/26 22:25, Tom Lane wrote:
>> I don't love stxkeyrefs[].  I wonder if it's time to throw away
>> stxkeys[], represent all the target columns as regular expression
>> trees in stxexprs, and then special-case columns that are simple
>> Vars where appropriate at execution.
>> (In the same vein, I dislike the grammar's separation of plain
>> columns from expressions; I'd like to replace stats_params
>> with expr_list and sort it all out later.  But perhaps that's
>> material for a separate patch.)

> FWIW the extended stats copied this from pg_index, which also stores
> keys and expressions separately. I suppose there was a reason for that,
> most likely performance - is cheaper to compare attnums than
> expressions, and plain keys are much more common.

I think I might be to blame for the separate storage of indexprs.
If so, the motivation was to avoid breakage of older code that only
knew about indkey[].  (Of course, such code would necessarily fail
on indexes with expressions, but we wanted to avoid breakage for the
common case of no-expressions.)  I don't think that consideration is
nearly as pressing for extended stats.  There's probably a lot less
client-side code that knows about extended stats at all, and what
there is seems more likely to rely on the server-side display
functions than to dig into the catalog details for itself.  Also,
if there is anything that's looking at pg_statistic_ext details,
it will need work anyway after this patch; there's no way around that.

>> We will need to put more thought into permissions: I don't think
>> requiring all the tables to have the same owner is workable.

> I have not thought about this at all, but what can we do if the tables
> have different owners? I suppose we could require the stxowner to have
> SELECT privilege on the joined relations (instead of owning them).

Yeah, the rough idea I had was to require ownership on the anchor
table and SELECT on the rest.  But it's not terribly clear what
to do if that SELECT privilege gets revoked.

I also wonder to what extent we have a problem with users of the
anchor table being able to infer something about the contents of the
other tables via plan choices, and whether it matters if they can.
They may well be able to make the same inferences anyway from query
results.  For that matter, if a user is able to issue a query for
which a set of extended join stats is relevant, it seems likely that
that must mean she has SELECT on the other tables anyway.  But
maybe I'm missing some case where that wouldn't be true.

			regards, tom lane

Re: Is there value in having optimizer stats for joins/foreignkeys?

Alexandra Wang <alexandra.wang.oss@gmail.com> — 2026-05-27T17:49:44Z

Hi Tom and Tomas,

Thank you so much for the feedback!

On Mon, May 25, 2026 at 8:04 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Tomas Vondra <tomas@vondra.me> writes:
> > On 5/21/26 22:25, Tom Lane wrote:
> >> I don't love stxkeyrefs[].  I wonder if it's time to throw away
> >> stxkeys[], represent all the target columns as regular expression
> >> trees in stxexprs, and then special-case columns that are simple
> >> Vars where appropriate at execution.
> >> (In the same vein, I dislike the grammar's separation of plain
> >> columns from expressions; I'd like to replace stats_params
> >> with expr_list and sort it all out later.  But perhaps that's
> >> material for a separate patch.)
>
> > FWIW the extended stats copied this from pg_index, which also stores
> > keys and expressions separately. I suppose there was a reason for that,
> > most likely performance - is cheaper to compare attnums than
> > expressions, and plain keys are much more common.
>
> I think I might be to blame for the separate storage of indexprs.
> If so, the motivation was to avoid breakage of older code that only
> knew about indkey[].  (Of course, such code would necessarily fail
> on indexes with expressions, but we wanted to avoid breakage for the
> common case of no-expressions.)  I don't think that consideration is
> nearly as pressing for extended stats.  There's probably a lot less
> client-side code that knows about extended stats at all, and what
> there is seems more likely to rely on the server-side display
> functions than to dig into the catalog details for itself.  Also,
> if there is anything that's looking at pg_statistic_ext details,
> it will need work anyway after this patch; there's no way around that.

I'm working on removing stxkeys[] as a prerequisite commit before the main
join
stats patch, representing all target columns as Var nodes in stxexprs, as
you
both suggested.

One question about the pg_stats_ext view: currently it has two complementary
columns:

- attnames (name[]) — Names of the columns included in the statistics object
- exprs (text[]) — Expressions included in the statistics object

With stxkeys gone from the catalog, should the view:

(a) Stay as-is: keep attnames and exprs as separate columns with the same
semantics. Implemented via a helper function that extracts plain column
names
from the unified stxexprs.

or

(b) Mirror the catalog: remove attnames, make exprs show all entries (both
column names and expressions together in one text[] array).

Any preference?

-- 
Alexandra Wang
EDB: https://www.enterprisedb.com

Re: Is there value in having optimizer stats for joins/foreignkeys?

Tomas Vondra <tomas@vondra.me> — 2026-05-27T19:08:34Z


On 5/27/26 19:49, Alexandra Wang wrote:
> Hi Tom and Tomas,
> 
> Thank you so much for the feedback!
> 
> On Mon, May 25, 2026 at 8:04 AM Tom Lane <tgl@sss.pgh.pa.us
> <mailto:tgl@sss.pgh.pa.us>> wrote:
>> Tomas Vondra <tomas@vondra.me <mailto:tomas@vondra.me>> writes:
>> > On 5/21/26 22:25, Tom Lane wrote:
>> >> I don't love stxkeyrefs[].  I wonder if it's time to throw away
>> >> stxkeys[], represent all the target columns as regular expression
>> >> trees in stxexprs, and then special-case columns that are simple
>> >> Vars where appropriate at execution.
>> >> (In the same vein, I dislike the grammar's separation of plain
>> >> columns from expressions; I'd like to replace stats_params
>> >> with expr_list and sort it all out later.  But perhaps that's
>> >> material for a separate patch.)
>>
>> > FWIW the extended stats copied this from pg_index, which also stores
>> > keys and expressions separately. I suppose there was a reason for that,
>> > most likely performance - is cheaper to compare attnums than
>> > expressions, and plain keys are much more common.
>>
>> I think I might be to blame for the separate storage of indexprs.
>> If so, the motivation was to avoid breakage of older code that only
>> knew about indkey[].  (Of course, such code would necessarily fail
>> on indexes with expressions, but we wanted to avoid breakage for the
>> common case of no-expressions.)  I don't think that consideration is
>> nearly as pressing for extended stats.  There's probably a lot less
>> client-side code that knows about extended stats at all, and what
>> there is seems more likely to rely on the server-side display
>> functions than to dig into the catalog details for itself.  Also,
>> if there is anything that's looking at pg_statistic_ext details,
>> it will need work anyway after this patch; there's no way around that.
> 
> I'm working on removing stxkeys[] as a prerequisite commit before the
> main join
> stats patch, representing all target columns as Var nodes in stxexprs,
> as you
> both suggested.
> 
> One question about the pg_stats_ext view: currently it has two complementary
> columns:
> 
> - attnames (name[]) — Names of the columns included in the statistics object
> - exprs (text[]) — Expressions included in the statistics object
> 
> With stxkeys gone from the catalog, should the view:
> 
> (a) Stay as-is: keep attnames and exprs as separate columns with the same
> semantics. Implemented via a helper function that extracts plain column
> names
> from the unified stxexprs.
> 
> or
> 
> (b) Mirror the catalog: remove attnames, make exprs show all entries (both
> column names and expressions together in one text[] array).
> 
> Any preference?
> 

My 2c: AFAIR there's no fundamental reason to keep those two lists
separate, other than that expressions were "bolted on" later, after we
already had stats on plain attributes. In hindsight, it might have been
better to just unify the view back then, probably.

I personally would be OK with just unifying adjusting the view, and
showing a single list (with both attributes and expressions). IIUC the
plan is to just store a list of expressions anyway, with attributes
represented as Vars. So the view would have to do more work just to
produce the "old" output, with little benefit.

The one argument against this that I can think of is possibly breaking
tools that use this view. IIRC pg_dump is reading the view when
exporting/importing the statistics. That might need some adjustments
(and there's also pg_stats_ext_exprs), but maybe it's easier to keep the
view consistent.

Also, maybe this is one more argument against the "optimizer view" idea
Corey mentioned in Vancouver last week? Because surely we'll want to
include the join statistics in export/import, and for the view approach
we'd need to invent a fair amount of code to do that. Maybe?


regards

-- 
Tomas Vondra

Re: Is there value in having optimizer stats for joins/foreignkeys?

Tom Lane <tgl@sss.pgh.pa.us> — 2026-05-27T20:31:16Z

Tomas Vondra <tomas@vondra.me> writes:
> On 5/27/26 19:49, Alexandra Wang wrote:
>> One question about the pg_stats_ext view: currently it has two complementary
>> columns:
>> 
>> - attnames (name[]) — Names of the columns included in the statistics object
>> - exprs (text[]) — Expressions included in the statistics object
>> 
>> With stxkeys gone from the catalog, should the view:
>> (a) Stay as-is: keep attnames and exprs as separate columns with the same
>> semantics. Implemented via a helper function that extracts plain column
>> names from the unified stxexprs.
>> or
>> (b) Mirror the catalog: remove attnames, make exprs show all entries (both
>> column names and expressions together in one text[] array).

> My 2c: AFAIR there's no fundamental reason to keep those two lists
> separate, other than that expressions were "bolted on" later, after we
> already had stats on plain attributes. In hindsight, it might have been
> better to just unify the view back then, probably.

Yeah.  There are some other oddities that arise from that: expressions
get shoved to the end.  For example, if I put in

create statistics my_stats (mcv) on ten, (ten+four), four from tenk1;

pg_dump will regurgitate that as

CREATE STATISTICS public.my_stats (mcv) ON four, ten, (ten + four) FROM public.tenk1;

and I see that that column ordering is consistent with what appears in
pg_stats_ext and perhaps other places.  I'd expect a rewritten version
to stop doing that and preserve the user-written column order.
So there are going to be some potential minor incompatibilities for
anything that is looking too closely at this view, and it seems to
me that it might be better for such code to fail noisily rather than
perhaps silently mis-associate stats with columns.

(It might be a good idea to have some test cases that exercise this
kind of scenario in pg_upgrade, especially now that we are trying to
transfer extended stats in upgrades...)

			regards, tom lane

Re: Is there value in having optimizer stats for joins/foreignkeys?

Corey Huinker <corey.huinker@gmail.com> — 2026-05-27T22:02:20Z

>
> Also, maybe this is one more argument against the "optimizer view" idea
> Corey mentioned in Vancouver last week? Because surely we'll want to
> include the join statistics in export/import, and for the view approach
> we'd need to invent a fair amount of code to do that. Maybe?
>

We definitely want join statistics in the export/import.

To my mind, having stats on certain views would be extremely simple for
export/import, we'd simply have one more relkind that makes it into the
system views pg_stats and pg_stats_ext, and the statistics for that new
relation would plug into pg_statistic with no catalog change to
pg_statistic whatsoever. Additionally, allowing certain kinds of views (or
a relation relkind that's functionally equivalent to a view) to have
statistics makes it easy to define extended statistics on those views, with
no catalog change to pg_statistic_ext.

Having something in pg_class to anchor existing per-attribute and
multi-attribute stats off of seems like a big win to me. We'd get all
pre-existing statistics types for free, statistics kinds provided by
extensions for free, extended stats for free, import and export for free.
Well, not free, we just have to remove the exclusion that views (or
whatever relkind we create) can't have stats.

Having said all that, the statistics import code reads from
pg_statistic_ext but obviously never modifies it, it's all about
constructing the pg_statistic_ext_data row, and it need to only concern
itself with importing to the current version, so internal catalog changes
to pg_statistic_ext wouldn't be that big of a deal.

If, however, we want to stick with the notion that all non-relation,
not-attribute stats are extended stats, then we have to remodel a bit:

- as Tom and Tomas mentioned, we'd probably want to dispense with the
negative attnums and stxexprs, as those are already a bit of a pain to deal
with. That might be worth of a patch even without join statistics.

- we'd want a way to express stats for all the individual
columns/expressions defined in the join stats object, i.e. "what is the MCV
of B.name for rows joined by A on A.b_id"?

- we'd want some way of excluding the non-interesting combinations of
columns. If we had a stat on A.qty, A.price, B.cust_name, C.sale_date,
D.item_name, then we'd have 5-factorial MCV combinations in addition to the
5 single-column stats, and the (A.qty, A.price) combinations can already be
covered by a non-join extended stat.

It's those things that make optimizer views so attractive to me - we
already have a way to store individual column stats (pg_statistic) and shut
them off when not interesting (pg_attribute.attstattarget), and extended
statistics are a great way to describe which combinations of columns are
interesting to us.

Typing all this has me thinking that there may be a third way:

- statistics objects become pg_class objects, stxkeys and stxexprs  become
pg_attribute rows
- pg_statisic_ext keeps (stxrelid, stxstattarget, stxkind), adds stxid
(pointing to new pg_class object)  sort of like pg_index
- pg_statistic_ext_data remains as-is

and we'd set per-column stats collection like ALTER STATISTICS foo ALTER
COLUMN.

Re: Is there value in having optimizer stats for joins/foreignkeys?

Tomas Vondra <tomas@vondra.me> — 2026-05-28T01:29:15Z

On 5/28/26 00:02, Corey Huinker wrote:
>     Also, maybe this is one more argument against the "optimizer view" idea
>     Corey mentioned in Vancouver last week? Because surely we'll want to
>     include the join statistics in export/import, and for the view approach
>     we'd need to invent a fair amount of code to do that. Maybe?
> 
> 
> We definitely want join statistics in the export/import. 
> 
> To my mind, having stats on certain views would be extremely simple for
> export/import, we'd simply have one more relkind that makes it into the
> system views pg_stats and pg_stats_ext, and the statistics for that new
> relation would plug into pg_statistic with no catalog change to
> pg_statistic whatsoever. Additionally, allowing certain kinds of views
> (or a relation relkind that's functionally equivalent to a view) to have
> statistics makes it easy to define extended statistics on those views,
> with no catalog change to pg_statistic_ext.
> 

Ah, so you're proposing supporting CREATE STATISTICS on some views? I
guess that's one way to support stats on joins, without having to rework
the schema. I'm still not a huge fan of it, because it just uses views
as a workaround to store the join definition, nothing else. To me it
seems a weird to require creating a new relation just for this, and we
already envisioned CREATE STATISTICS would cover joins. My guess is that
may be why DB2 did it this way, as they probably didn't have anything
like extended stats at that point.

> Having something in pg_class to anchor existing per-attribute and multi-
> attribute stats off of seems like a big win to me. We'd get all pre-
> existing statistics types for free, statistics kinds provided by
> extensions for free, extended stats for free, import and export for
> free. Well, not free, we just have to remove the exclusion that views
> (or whatever relkind we create) can't have stats.
> 

TBH the grammar / catalog stuff seems like a relatively minor part of
this patch. To me, the difficult part seems to be the sampling / analyze
part, and then matching it to the query during planning. And all of this
seems exactly the same no matter how the stats are defined.

OTOH maybe allowing CREATE STATISTICS on views would be independently
useful too, once we have the later parts. Not sure.

> Having said all that, the statistics import code reads from
> pg_statistic_ext but obviously never modifies it, it's all about
> constructing the pg_statistic_ext_data row, and it need to only concern
> itself with importing to the current version, so internal catalog
> changes to pg_statistic_ext wouldn't be that big of a deal.
> 
> If, however, we want to stick with the notion that all non-relation,
> not-attribute stats are extended stats, then we have to remodel a bit:
> 
> - as Tom and Tomas mentioned, we'd probably want to dispense with the
> negative attnums and stxexprs, as those are already a bit of a pain to
> deal with. That might be worth of a patch even without join statistics.
> 
> - we'd want a way to express stats for all the individual columns/
> expressions defined in the join stats object, i.e. "what is the MCV of
> B.name for rows joined by A on A.b_id"?
> 

Maybe I misunderstand, but don't we already do this for statistics on
expressions?

> - we'd want some way of excluding the non-interesting combinations of
> columns. If we had a stat on A.qty, A.price, B.cust_name, C.sale_date,
> D.item_name, then we'd have 5-factorial MCV combinations in addition to
> the 5 single-column stats, and the (A.qty, A.price) combinations can
> already be covered by a non-join extended stat.
> 

I don't follow. What 5-factorial combinations? We only ever build a
single MCV for a given statistics object.

> It's those things that make optimizer views so attractive to me - we
> already have a way to store individual column stats (pg_statistic) and
> shut them off when not interesting (pg_attribute.attstattarget), and
> extended statistics are a great way to describe which combinations of
> columns are interesting to us.
> 

I may be missing something, but I honestly the only benefit of views I
can think of is already having the join definition in a catalog.

> Typing all this has me thinking that there may be a third way:
> 
> - statistics objects become pg_class objects, stxkeys and stxexprs 
> become pg_attribute rows
> - pg_statisic_ext keeps (stxrelid, stxstattarget, stxkind), adds stxid
> (pointing to new pg_class object)  sort of like pg_index
> - pg_statistic_ext_data remains as-is
> 
> and we'd set per-column stats collection like ALTER STATISTICS foo ALTER
> COLUMN.

I don't follow. Why should statistics object be pg_class objects? In my
mind pg_class is meant for "relations" (as in, table-like things), and
statistics objects are not like that. I suppose this is related to your
earlier suggestion

  Having something in pg_class to anchor existing per-attribute and
  multi-attribute stats off of seems like a big win to me.

but it's not clear to me why would that be? All statistics are tied to a
relation (or multiple relations) in the end, and I don't see why would
the view make anything simpler. What are the wins?

Could you elaborate? It's entirely possible I just don't see something
obvious, or maybe you explained this in Vancouver and I managed to
forget the details. Sorry about that.

regards

-- 
Tomas Vondra

Re: Is there value in having optimizer stats for joins/foreignkeys?

Corey Huinker <corey.huinker@gmail.com> — 2026-05-28T19:03:26Z

>
> > To my mind, having stats on certain views would be extremely simple for
> > export/import, we'd simply have one more relkind that makes it into the
> > system views pg_stats and pg_stats_ext, and the statistics for that new
> > relation would plug into pg_statistic with no catalog change to
> > pg_statistic whatsoever. Additionally, allowing certain kinds of views
> > (or a relation relkind that's functionally equivalent to a view) to have
> > statistics makes it easy to define extended statistics on those views,
> > with no catalog change to pg_statistic_ext.
> >
>
> Ah, so you're proposing supporting CREATE STATISTICS on some views? I
> guess that's one way to support stats on joins, without having to rework
> the schema. I'm still not a huge fan of it, because it just uses views
> as a workaround to store the join definition, nothing else. To me it
> seems a weird to require creating a new relation just for this, and we
> already envisioned CREATE STATISTICS would cover joins. My guess is that
> may be why DB2 did it this way, as they probably didn't have anything
> like extended stats at that point.
>

They did have extended stats.

Here are some example RUNSTATS calls from
https://www.ibm.com/docs/en/db2/11.5.x?topic=commands-runstats

    RUNSTATS ON TABLE employee ON ALL COLUMNS AND COLUMNS ((JOB, WORKDEPT,
SEX))
        WITH DISTRIBUTION

In our lingo, this is roughly:

    CREATE STATISTICS foo ON (job, workdept, sex) FROM employee;
    ANALYZE employee;

And you can tweak individual columns

    RUNSTATS ON VIEW product_sales_view
     WITH DISTRIBUTION ON COLUMNS (category NUM_FREQVALUES 100
NUM_QUANTILES 100,
    type, product_key) DEFAULT NUM_FREQVALUES 50 NUM_QUANTILES 50

But the lesson I took away from this doc and corroborated by meeting with
someone who used to work on DB2 planner is that they have an equivalent of
extended stats.

>
> > Having something in pg_class to anchor existing per-attribute and multi-
> > attribute stats off of seems like a big win to me. We'd get all pre-
> > existing statistics types for free, statistics kinds provided by
> > extensions for free, extended stats for free, import and export for
> > free. Well, not free, we just have to remove the exclusion that views
> > (or whatever relkind we create) can't have stats.
> >
>
> TBH the grammar / catalog stuff seems like a relatively minor part of
> this patch. To me, the difficult part seems to be the sampling / analyze
> part, and then matching it to the query during planning. And all of this
> seems exactly the same no matter how the stats are defined.
>

I agree that the catalog stuff is minor relative to the work matching
queries to the join node trees associated with a set of join stats,
wherever that resides.

The sampling analyze part doesn't strike me as too hard. The worst case
scenario is that we take the columns+join definition, swap the "anchor"
table out replacing it with the EphemeralNamedRelation of the fetched row
sample from the anchor table, and just run that, letting SPI figure out
which indexes can be used.

The question of _when_ to analyze is a bit trickier, as was pointed out by
Chenpeng Yan, and I suspect that any "anchor" table with join stats on it
will need to be analyzed once any one of the tables it joins to has hit the
threshold.


> OTOH maybe allowing CREATE STATISTICS on views would be independently
> useful too, once we have the later parts. Not sure.
>

I think so as well, which is why I'd like to leave that option open.


> > - we'd want a way to express stats for all the individual columns/
> > expressions defined in the join stats object, i.e. "what is the MCV of
> > B.name for rows joined by A on A.b_id"?
> >
>
> Maybe I misunderstand, but don't we already do this for statistics on
> expressions?
>

We can CREATE STATISTICS foo on (upper(name)) FROM table, if that's what
you're asking.

But that isn't what I'm saying. Given tables A, B, and C which can join on
A.b_id = B.id and A.c_id = c.id, we already have pg_statistic stats for
B.name, but that's across B in a vacuum. I'm assuming that there's also
value in having the stats for B.name filtered and weighted by how often (if
at all) the B row is joined to A, and those stats would be shaped exactly
like the pg_statistic row for B.name. We could view that as correlating
B.name against the primary key of A, but that seems odd to me.

Additionally, we might want the ability to have correlative stats of B.name
with c.purchase_date. With the views idea, that would just be:

      CREATE STATISTICS foo ON (b_name, c_purchase_date) FROM
my_statistics_view.

So we'd have regular stats to draw upon from my_statistics view, and those
would have our cardinality estimates given the precondition of the row
surviving the join criteria, and we'd further have the correlative
statistics of two columns each from tabled joined to A. And that's
something not even DB2 can do now given their 2-table join limitation.


> I don't follow. What 5-factorial combinations? We only ever build a
> single MCV for a given statistics object.
>

Sorry, I was thinking about pg_ndistinct and pg_dependencies.

For MCV that would give us the most common tuples of (B.name,
C.purchase_date, D.promotion_code, a.quantity, a.amount), but that only
helps when we need all of those columns, not a subset of them.


> I may be missing something, but I honestly the only benefit of views I
> can think of is already having the join definition in a catalog.
>

That's the biggest benefit for sure, but the other benefit is we can have
the per-column stats in pg_statistic as I detailed above, and further do
extended statistics, as we both addressed above.


> I don't follow. Why should statistics object be pg_class objects? In my
> mind pg_class is meant for "relations" (as in, table-like things), and
> statistics objects are not like that. I suppose this is related to your
> earlier suggestion
>
>   Having something in pg_class to anchor existing per-attribute and
>   multi-attribute stats off of seems like a big win to me.
>

It is, but it would also allow us to avoid having a declared view, and then
altering that view to allow statistics.

It would make it more complicated to express which elements of the join
were worthy of correlative stats, so in that sense having a view with a
name is conceptually simpler.


> but it's not clear to me why would that be? All statistics are tied to a
> relation (or multiple relations) in the end, and I don't see why would
> the view make anything simpler. What are the wins?
>

The view would allow us to put a wider range of stats on all of the columns
that could be found through that defined join, using the mechanisms we
already have available to us.

It would further allow us to declare that multiple subsets of those columns
are interesting enough to warrant correlative (extended) statistics.


> Could you elaborate? It's entirely possible I just don't see something
> obvious, or maybe you explained this in Vancouver and I managed to
> forget the details. Sorry about that.


No worries, Vancouver was a whirlwind of ideas around this topic. My head
is still swimming.

Another way of thinking about this is that it would give us the stats that
we could get from a materialized view, with the ability to define extended
stats on top of that materialized view, but without the costs of
maintaining a materialized view, and without the restriction that a query
has to directly reference the materialized view.