Thread
-
Is there value in having optimizer stats for joins/foreignkeys?
Corey Huinker <corey.huinker@gmail.com> — 2025-12-01T20:10:25Z
Threads like [1] and [2] have gotten me thinking that there may be some value in storing statistics about joins. For the sake of argument, assume a table t1 with a column t2id which references the pk of table t2 that has columns t2.t2id, t2c1, t2c2, t2c3. In such a situation I can envision the following statistics being collected: * The % of values rows in t2 are referenced at least once in t1 * The attribute stats (i.e. pg_statistic stats) for t2c1, t2c2, t2c3, but associated with t1 and weighted according to the frequency of that row being referenced, which means that values of unreferenced rows are filtered out entirely. * That's about it for direct statistics, but I could see creating extended statistics for correlations between a local column value and a remote column, or expressions on the remote columns, etc. The storage feels like it would be identical to pg_statistic but with a "starefrelid" field that identifies the referencing table. That much seems straightforward. A bigger problem is how we'd manage to collect these statistics. We could (as Jeff Davis has suggested) keep our tablesamples, but that wouldn't necessarily help in this case because the rows referenced, and their relative weightings would change since the last sampling. In a worst-case scenario, We would have to sample the joined-to tables as well,and that's an additional burden on an already IO intensive operation. In theory, we could do some of this without any additional stats collection. If the ndistinct of t1.t2id is, say, at least 75+% of the ndistinct of t2.t2id, we could just peek at the attribute stats on t2 and use them for estimates. However, that makes some assumptions that the stats on t2 are approximately as fresh as the stats on t1, and I don't think that will be the case most of the time. CCing people who have wondered out loud about this topic within earshot of me. Thoughts? [1] https://www.postgresql.org/message-id/flat/6fdc4dc5-8881-4987-9858-a9b484953185%40joeconway.com#5a93cd7a730691843a7700c770397baf [2] https://www.postgresql.org/message-id/flat/tencent_3018762E7D4C9BC470C821C829C1BF2F650A%40qq.com
-
Re: Is there value in having optimizer stats for joins/foreignkeys?
Tomas Vondra <tomas@vondra.me> — 2025-12-01T21:01:58Z
On 12/1/25 21:10, Corey Huinker wrote: > Threads like [1] and [2] have gotten me thinking that there may be some > value in storing statistics about joins. > > For the sake of argument, assume a table t1 with a column t2id which > references the pk of table t2 that has columns t2.t2id, t2c1, t2c2, > t2c3. In such a situation I can envision the following statistics being > collected: > > * The % of values rows in t2 are referenced at least once in t1 > * The attribute stats (i.e. pg_statistic stats) for t2c1, t2c2, t2c3, > but associated with t1 and weighted according to the frequency of that > row being referenced, which means that values of unreferenced rows are > filtered out entirely. > * That's about it for direct statistics, but I could see creating > extended statistics for correlations between a local column value and a > remote column, or expressions on the remote columns, etc. > Do I understand correctly you propose to collect such stats for every foreign key? I recall something like that was proposed in the past, and the argument against was that for many joins it'd be a waste because the estimates are good enough. And for OLTP systems that's probably true. Of course, it also depends on how expensive this would be. Maybe it's cheap enough? No idea. But I always assumed we'd have a way to explicitly enable such stats for certain joins only, and the extended stats were designed to make that possible. FWIW I'm not entirely sure what stats you propose to collect exactly. I mean, what does ... associated with t1 and weighted according to the frequency of that row being referenced, which means that values of unreferenced rows are filtered out entirely. mean? Are you suggesting to "do the join" and build the regular stats as if that was a regular table? I think that'd work, and it's mostly how I envisioned to handle joins in extended stats, restricted to joins of two relations. > The storage feels like it would be identical to pg_statistic but with a > "starefrelid" field that identifies the referencing table. > > That much seems straightforward. A bigger problem is how we'd manage to > collect these statistics. We could (as Jeff Davis has suggested) keep > our tablesamples, but that wouldn't necessarily help in this case > because the rows referenced, and their relative weightings would change > since the last sampling. In a worst-case scenario, We would have to > sample the joined-to tables as well,and that's an additional burden on > an already IO intensive operation. > Combining independent per-table samples does not work, unless the samples are huge. There's a nice paper [1] on how to do index-based join sampling efficiently. > In theory, we could do some of this without any additional stats > collection. If the ndistinct of t1.t2id is, say, at least 75+% of the > ndistinct of t2.t2id, we could just peek at the attribute stats on t2 > and use them for estimates. However, that makes some assumptions that > the stats on t2 are approximately as fresh as the stats on t1, and I > don't think that will be the case most of the time. > > CCing people who have wondered out loud about this topic within earshot > of me. > > Thoughts? I think adding joins to extended stats would not be all that hard (famous last words, I know). For me the main challenge was figuring out how to store the join definition in the catalog, I always procrastinated and never gave that a serious try. FWIW I think we might start by actually using per-table extended stats on the joined tables. Just like we combine the scalar MCVs on joined columns, we could combine multicolumn MVCs. regards [1] https://www.cidrdb.org/cidr2017/papers/p9-leis-cidr17.pdf -- Tomas Vondra
-
Re: Is there value in having optimizer stats for joins/foreignkeys?
Tom Lane <tgl@sss.pgh.pa.us> — 2025-12-01T22:11:55Z
Tomas Vondra <tomas@vondra.me> writes: > On 12/1/25 21:10, Corey Huinker wrote: >> Threads like [1] and [2] have gotten me thinking that there may be some >> value in storing statistics about joins. > Do I understand correctly you propose to collect such stats for every > foreign key? I recall something like that was proposed in the past, and > the argument against was that for many joins it'd be a waste because the > estimates are good enough. And for OLTP systems that's probably true. Yeah, I think that automated choices about this are unlikely to work well. We chose the syntax for CREATE STATISTICS with an eye to allowing users to declaratively tell us to collect stats about specific joins, and I still think that's a more promising approach. But nobody's yet worked out any details. regards, tom lane
-
Re: Is there value in having optimizer stats for joins/foreignkeys?
Corey Huinker <corey.huinker@gmail.com> — 2025-12-02T03:16:25Z
> Do I understand correctly you propose to collect such stats for every > foreign key? I recall something like that was proposed in the past, and > the argument against was that for many joins it'd be a waste because the > estimates are good enough. And for OLTP systems that's probably true. > Not every foreign key, they'd be declared like CREATE STATISTICS, but would be anchored to the constraint, not to the table. > But I always assumed we'd have a way to explicitly enable such stats for > certain joins only, and the extended stats were designed to make that > possible. > That's the intention, but the stats stored don't quite "fit" in the buckets that extended stats create. The attribute statistics seem much better suited, as this isn't about combinations, there's only ever the one combination, but rather about what can be known about the attributes in the far table before doing the actual join. > FWIW I'm not entirely sure what stats you propose to collect exactly. I > mean, what does > > ... associated with t1 and weighted according to the frequency of > that row being referenced, which means that values of unreferenced > rows are filtered out entirely. > > mean? Are you suggesting to "do the join" and build the regular stats as > if that was a regular table? I think that'd work, and it's mostly how I > envisioned to handle joins in extended stats, restricted to joins of two > relations. > Right. We'd do the join from t1 to t2 as described earlier, and then we'd judge the null_frac, mcv, etc for each column of t2 (as defined by the scope of the stats declaration) according to the join. More commonly referenced values would show up as more frequent, hence "weighted". Just so I have an example to refer to later, say we have a table of colors: CREATE TABLE color(id bigint primary key, color_name text unique, color_family text null) and there's hundreds of colors in the table that are color_family='red' ('fire engine red', 'candy apple red', 'popular muppet red', etc). Some colors don't belong to any color_family. And we have a table of toys: CREATE TABLE toy(id bigint primary key, min_child_age integer, name text, color_id bigint REFERENCES color) And we declare a join stat on toy->color for the color_family attribute. We'd sample rows from the toy table, left join those to color, and then calculate the attribute stats of color_family as if it were a column in toys. Some toys might not have a color_id, and some color_ids might not belong to a color_family, so we'd want the null_frac to reflect those combined conditions. For the values that do join, and the colors that do belong to a family, we'd want to see regular MCV stats showing "red" as the most common color_family. But those stats aren't really a correlation or a dependency, they're just plain old attribute stats. I understand wanting to know the correlation between toys.min_child_age and colors.color_family, so that makes perfect sense for extended statistics, but color_family on its own just doesn't fit. Am I missing something? > Combining independent per-table samples does not work, unless the > samples are huge. There's a nice paper [1] on how to do index-based join > sampling efficiently. > Thanks, now I've got some light reading for the flight home. > I think adding joins to extended stats would not be all that hard > (famous last words, I know). For me the main challenge was figuring out > how to store the join definition in the catalog, I always procrastinated > and never gave that a serious try. > I envisioned keying the stats off the foreign key constraint id, or adding "starefrelid" (relation oid of the referencing table) to pg_statistic or a table roughly the same shape as pg_statistic. > > FWIW I think we might start by actually using per-table extended stats > on the joined tables. Just like we combine the scalar MCVs on joined > columns, we could combine multicolumn MVCs. > That's the other half of this - if the stats existed, do we have an obvious way to put them to use? -
Re: Is there value in having optimizer stats for joins/foreignkeys?
Corey Huinker <corey.huinker@gmail.com> — 2025-12-02T03:18:24Z
> > > Yeah, I think that automated choices about this are unlikely to work > well. We chose the syntax for CREATE STATISTICS with an eye to > allowing users to declaratively tell us to collect stats about > specific joins, and I still think that's a more promising approach. > But nobody's yet worked out any details. > > Per other response, no, I didn't envision stats on all possible joins or even all possible foreign keys, just the ones we declare as interesting, and even then only for the attributes that we say are interesting on the far side of the join.
-
Re: Is there value in having optimizer stats for joins/foreignkeys?
Alexandra Wang <alexandra.wang.oss@gmail.com> — 2025-12-03T19:01:13Z
Hi there, Thanks for raising this topic! I am currently working on a POC patch that adds extended statistics for joins. I am polishing the patch now and will post it soon with performance numbers, since there are interests! On Mon, Dec 1, 2025 at 7:16 PM Corey Huinker <corey.huinker@gmail.com> wrote: > On Mon, Dec 1, 2025 at 1:02 PM Tomas Vondra <tomas@vondra.me> wrote: > I think adding joins to extended stats would not be all that hard >> (famous last words, I know). For me the main challenge was figuring out >> how to store the join definition in the catalog, I always procrastinated >> and never gave that a serious try. >> > > I envisioned keying the stats off the foreign key constraint id, or adding > "starefrelid" (relation oid of the referencing table) to pg_statistic or a > table roughly the same shape as pg_statistic. > On Mon, Dec 1, 2025 at 1:02 PM Tomas Vondra <tomas@vondra.me> wrote: > >> >> FWIW I think we might start by actually using per-table extended stats >> on the joined tables. Just like we combine the scalar MCVs on joined >> columns, we could combine multicolumn MVCs. >> > > That's the other half of this - if the stats existed, do we have an > obvious way to put them to use? > I have indeed started by implementing MCV statistics for joins, because I have not found a case for joins that would benefit only from ndistinct or functional dependency stats that MCV stats wouldn't help. In my POC patch, I've made the following catalog changes: - Add *stxotherrel (oid) *and *stxjoinkeys (int2vector)* fields to *pg_statistic_ext* - Use the existing *stxkeys (int2vector)* to store the stats object attributes of *stxotherrel* - Create *pg_statistic_ext_otherrel_index* on *(stxrelid, stxotherrel)* - Add stxdjoinmcv* (pg_join_mcv_list)* to *pg_statistic_ext_data* To use them, we can let the planner detect patterns like this: /* * JoinStatsMatch - Information about a detected join pattern * Used internally to track what was matched in a join+filter pattern */ typedef struct JoinStatsMatch { Oid target_rel; /* table OID of the estimation target */ AttrNumber targetrel_joinkey; /* target_rel's join column */ Oid other_rel; /* table OID of the filter source */ AttrNumber otherrel_joinkey; /* other_rel's join column */ List *filter_attnums; /* list of AttrNumbers for filter columns in other_rel */ List *filter_values; /* list of Datum constant values being filtered */ Oid collation; /* collation for comparisons */ /* Additional info to avoid duplicate work */ List *join_rinfos; /* list of join clause RestrictInfos */ RestrictInfo *filter_rinfo; /* the filter clause RestrictInfo */ } JoinStatsMatch; and add the detection logic in clauselist_selectivity_ext() and get_foreign_key_join_selectivity(). Statistics collection indeed needs the most thinking. For the purpose of a POC, I added MCV join stats collection as part of ANALYZE of one table (stxrel in pg_statistic_ext). I can do this because MCV join stats are somewhat asymmetric. It allows me to have a target table (referencing table for foreign key join) to ANALYZE, and we can use the already collected MCVs of the joinkey column on the target table to query the rows in the other table. This greatly mitigates performance impact compared to actually joining two tables. However, if we are to support more complex joins or other types of join stats such as ndistinct or functional dependency, I found it hard to define who's the target table (referencing table) and who's the other table (referenced table) outside of the foreign key join scenario. So I think for those more complex cases eventually we may as well perform the joins and collect the join stats separately. Alvaro Herrera suggested offline that we could have a dedicated autovacuum command option for collecting the join statistics. I have experimented with two ways to define the join statistics: 1. Use CREATE STATISTICS: CREATE STATISTICS [ [ IF NOT EXISTS ] statistics_name ] [ ( mcv ) ] ON { table_name1.column_name1 }, { table_name1.column_name2 } [, ...] FROM table_name1 JOIN table_name2 ON table_name1.column_name3 = table_name2.column_name4 Examples: -- Create join MCV statistics on a single filter column (keyword) CREATE STATISTICS movie_keyword_keyword_join_stats (mcv) ON k.keyword FROM movie_keyword mk JOIN keyword k ON (mk.keyword_id = k.id); ANALYZE movie_keyword; -- Create join MCV statistics on multiple filter columns (keyword + phonetic_code): CREATE STATISTICS movie_keyword_keyword_multicols_join_stats (mcv) ON k.keyword, k.phonetic_code FROM movie_keyword mk JOIN keyword k ON (mk.keyword_id = k.id); ANALYZE movie_keyword; 2. Auto join stats creation for Foreign Key + Functional Dependency stats Initially, I did not implement the CREATE TABLE STATISTICS command to create the join stats. Instead, I’ve implemented logic in ANALYZE to detect functional dependency stats on the referenced table through FKs and create join statistics implicitly for those cases. I've been using the Join Order Benchmark (JOB) [1] to measure performance gain. I will post the POC patch and performance numbers in a followup email. [1] https://www.vldb.org/pvldb/vol9/p204-leis.pdf Best, Alex -- Alexandra Wang EDB: https://www.enterprisedb.com -
Re: Is there value in having optimizer stats for joins/foreignkeys?
Alexandra Wang <alexandra.wang.oss@gmail.com> — 2026-05-06T12:19:05Z
Here's the rebased patch, it only needed a catalog version bump. Best, Alex -- Alexandra Wang EDB: https://www.enterprisedb.com
-
Re: Is there value in having optimizer stats for joins/foreignkeys?
jian he <jian.universality@gmail.com> — 2026-05-13T15:15:31Z
On Wed, May 6, 2026 at 8:19 PM Alexandra Wang <alexandra.wang.oss@gmail.com> wrote: > > Here's the rebased patch, it only needed a catalog version bump. > No need to update src/include/catalog/catversion.h during dev cycle. + if (!IsA(lfirst(l), OpExpr)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("join statistics require a single equijoin condition per pair of tables"))); In the error message, should we replace "equijoin" with "equality join"? IMHO, "equijoin" is kind of informal wording. -- Unintended error case1 DROP TABLE IF EXISTS t1; CREATE TABLE t1 (id INTEGER PRIMARY KEY, val TEXT NOT NULL); CREATE STATISTICS x (mcv) ON t1.val FROM t1 JOIN t1 as t1s ON (t1.id = t1s.id); alter table t1 alter column id set data type int8; ERROR: could not open relation with OID 0 -- Unintended error case2 DROP TABLE IF EXISTS t1, t2; CREATE TABLE t1 (id INTEGER PRIMARY KEY, val TEXT NOT NULL); CREATE TABLE t2 (id INTEGER PRIMARY KEY, t1_id INTEGER NOT NULL); CREATE STATISTICS xx (mcv) ON t1.val FROM t2 JOIN t1 ON (t2.t1_id = t1.id); alter table t1 alter column id set data type int8; ERROR: missing FROM-clause entry for table "t1" LINE 1: alter table t1 alter column id set data type int8; ^ -- Unintended error case3 DROP TABLE IF EXISTS t1, t2; CREATE TABLE t1 (id INTEGER PRIMARY KEY, val TEXT NOT NULL); CREATE TABLE t2 (id INTEGER PRIMARY KEY, t1_id INTEGER generated always as (1)); CREATE STATISTICS xx (mcv) ON t1.val FROM t2 JOIN t1 ON (t2.t1_id = t1.id); alter table t2 alter column t1_id set expression as (2); ERROR: missing FROM-clause entry for table "t1" LINE 1: alter table t2 alter column t1_id set expression as (2); ^ -- Unintended error case4 DROP TABLE IF EXISTS t1, t2; CREATE TABLE t1 (id INTEGER PRIMARY KEY, val TEXT NOT NULL); CREATE TABLE t2 (id INTEGER PRIMARY KEY, t1_id INTEGER generated always as (1)); CREATE STATISTICS xx (mcv) ON t1.val FROM t2 JOIN t1 ON (t2 = t1); ERROR: cache lookup failed for attribute 0 of relation 18388 In src/test/regress/sql/stats_ext_crossrel.sql, we need to add test cases for ALTER COLUMN SET DATA TYPE and ALTER COLUMN SET EXPRESSION on columns involved in join statistics. These ALTER TABLE operations will internally recreate the affected join stats. src/test/regress/sql/stats_ext_crossrel.sql currently lacks tests for equality joins where the join qual contains whole-row variable references.This is important because whole-row variables raise the question of whether join statistics should be recreated when any column in the relation is modified via ALTER COLUMN SET DATA TYPE or ALTER COLUMN SET EXPRESSION. -- jian https://www.enterprisedb.com/ -
Re: Is there value in having optimizer stats for joins/foreignkeys?
Alexandra Wang <alexandra.wang.oss@gmail.com> — 2026-05-13T17:30:56Z
Hi Jian, Thanks for reviewing! At the current stage, I'm more interested in reviews around the overall design and the feasibility of the current approach, as I mentioned earlier: On Tue, Apr 28, 2026 at 10:59 PM Alexandra Wang < alexandra.wang.oss@gmail.com> wrote: > I'd appreciate feedback on: > - The catalog design (stxjoinrels, stxkeyrefs, stxjoinconds) > - The index-based sampling implementation > - The planner integration (clausesel.c, costsize.c) > - What should be in scope for v1 vs deferred I will incorporate your review feedback in the next revision, but for now, I'd like to focus the discussion on the above areas before polishing the existing patch. On Wed, May 13, 2026 at 8:16 AM jian he <jian.universality@gmail.com> wrote: > On Wed, May 6, 2026 at 8:19 PM Alexandra Wang > <alexandra.wang.oss@gmail.com> wrote: > > > > Here's the rebased patch, it only needed a catalog version bump. > > > > No need to update src/include/catalog/catversion.h during dev cycle. I needed to bump the catalog version because CI broke without that change. > + if (!IsA(lfirst(l), OpExpr)) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("join statistics require a single equijoin condition per pair > of tables"))); > In the error message, should we replace "equijoin" with "equality join"? > IMHO, "equijoin" is kind of informal wording. OK. I'll update the error message in the next revision. > -- Unintended error case1 > DROP TABLE IF EXISTS t1; > CREATE TABLE t1 (id INTEGER PRIMARY KEY, val TEXT NOT NULL); > CREATE STATISTICS x (mcv) ON t1.val FROM t1 JOIN t1 as t1s ON (t1.id = t1s.id); > alter table t1 alter column id set data type int8; > ERROR: could not open relation with OID 0 > > -- Unintended error case2 > DROP TABLE IF EXISTS t1, t2; > CREATE TABLE t1 (id INTEGER PRIMARY KEY, val TEXT NOT NULL); > CREATE TABLE t2 (id INTEGER PRIMARY KEY, t1_id INTEGER NOT NULL); > CREATE STATISTICS xx (mcv) ON t1.val FROM t2 JOIN t1 ON (t2.t1_id = t1.id ); > alter table t1 alter column id set data type int8; > ERROR: missing FROM-clause entry for table "t1" > LINE 1: alter table t1 alter column id set data type int8; > ^ > > -- Unintended error case3 > DROP TABLE IF EXISTS t1, t2; > CREATE TABLE t1 (id INTEGER PRIMARY KEY, val TEXT NOT NULL); > CREATE TABLE t2 (id INTEGER PRIMARY KEY, t1_id INTEGER generated always as (1)); > CREATE STATISTICS xx (mcv) ON t1.val FROM t2 JOIN t1 ON (t2.t1_id = t1.id ); > alter table t2 alter column t1_id set expression as (2); > ERROR: missing FROM-clause entry for table "t1" > LINE 1: alter table t2 alter column t1_id set expression as (2); > ^ > > -- Unintended error case4 > DROP TABLE IF EXISTS t1, t2; > CREATE TABLE t1 (id INTEGER PRIMARY KEY, val TEXT NOT NULL); > CREATE TABLE t2 (id INTEGER PRIMARY KEY, t1_id INTEGER generated always as (1)); > CREATE STATISTICS xx (mcv) ON t1.val FROM t2 JOIN t1 ON (t2 = t1); > ERROR: cache lookup failed for attribute 0 of relation 18388 > > In src/test/regress/sql/stats_ext_crossrel.sql, we need to add test cases for > ALTER COLUMN SET DATA TYPE and ALTER COLUMN SET EXPRESSION on columns > involved in join statistics. > These ALTER TABLE operations will internally recreate the affected join stats. > > src/test/regress/sql/stats_ext_crossrel.sql currently lacks tests for equality > joins where the join qual contains whole-row variable references.This is > important because whole-row variables raise the question of whether join > statistics should be recreated when any column in the relation is modified via > ALTER COLUMN SET DATA TYPE or ALTER COLUMN SET EXPRESSION. Good call! I'll make sure to include these edge cases in the next revision. Best, Alex -- Alexandra Wang EDB: https://www.enterprisedb.com -
Re: Is there value in having optimizer stats for joins/foreignkeys?
Alexandra Wang <alexandra.wang.oss@gmail.com> — 2026-05-14T23:13:26Z
Patch needed a rebase, so here's v5. I've dropped the changes to catversion.h so hopefully I won't need to rebase too often. Sorry about the noise! While I was at it, I also addressed the issues Jian mentioned -- all minor fixes in parse_utilcmd.c (a one-line detection fix for the ALTER COLUMN error, an error guard for whole-row variables, and rewording "equijoin" to "equality join"). Other than that, there are no major changes in v5 compared to v2-v4. At this stage I'd really appreciate feedback on the overall design approach. Best, Alex -- Alexandra Wang EDB: https://www.enterprisedb.com
-
Re: Is there value in having optimizer stats for joins/foreignkeys?
Alexandra Wang <alexandra.wang.oss@gmail.com> — 2026-05-15T05:08:56Z
Attached v6. It should fix the CI failures on NetBSD/Windows by changing the ALTER TYPE invalidation test added in v5 from text->varchar(100) to integer->numeric, avoiding encoding-dependent width estimates. Best, Alex -- Alexandra Wang EDB: https://www.enterprisedb.com
-
Re: Is there value in having optimizer stats for joins/foreignkeys?
Alexandra Wang <alexandra.wang.oss@gmail.com> — 2026-05-15T15:30:31Z
Here's v7, another attempt to fix the unstable tests. Best, Alex On Thu, May 14, 2026 at 10:08 PM Alexandra Wang < alexandra.wang.oss@gmail.com> wrote: > Attached v6. It should fix the CI failures on NetBSD/Windows by > changing the ALTER TYPE invalidation test added in v5 from > text->varchar(100) to integer->numeric, avoiding encoding-dependent > width estimates. > > Best, > Alex > > > -- > Alexandra Wang > EDB: https://www.enterprisedb.com > -- Alexandra Wang EDB: https://www.enterprisedb.com
-
Re: Is there value in having optimizer stats for joins/foreignkeys?
Tom Lane <tgl@sss.pgh.pa.us> — 2026-05-21T20:25:13Z
Alexandra Wang <alexandra.wang.oss@gmail.com> writes: > Here's v7, another attempt to fix the unstable tests. Hi Alexandra, I signed up for an in-person review of this at PGConf.dev, but the schedule doesn't seem to be working in favor of making that happen. If you see this and happen to run into me in the hallway, I'm happy to chat, but in any case here are my rather-hasty review notes. I think it's okay if v1 only handles 2-way joins, as long as the catalog representation is prepared for more. Restricting to cases where we can do index-based sampling seems fine too. Those things could be relaxed later if it seems worthwhile, but we'd have a creditable feature even without. I didn't read the sampling code in any detail. I think you will need to put more thought into what is user-friendly behavior in case the required index doesn't exist or doesn't have the right properties. (I think the tests for that might not be strong enough, either.) I think you could simplify some code noticeably if you included the anchor rel's OID as the first element of stxjoinrels[]. Yeah, it'd be redundant with stxrelid, but so what? It's not like pg_statistic_ext rows are narrow enough that anyone would notice the extra 4 bytes. I think this would simplify some of the relationships within the data structures, too, eg all varnos in the expressions could be considered to reference stxjoinrels[]. I don't love stxkeyrefs[]. I wonder if it's time to throw away stxkeys[], represent all the target columns as regular expression trees in stxexprs, and then special-case columns that are simple Vars where appropriate at execution. (In the same vein, I dislike the grammar's separation of plain columns from expressions; I'd like to replace stats_params with expr_list and sort it all out later. But perhaps that's material for a separate patch.) We will need to put more thought into permissions: I don't think requiring all the tables to have the same owner is workable. (What happens if someone tries to ALTER OWNER later?) However, if they don't all have the same owner, there are potential security problems, so the right restriction is not obvious. This is not necessary to solve now; there are bigger questions to worry about. But we'll need an answer before it's committable. It's not too soon to write some user-facing documentation. CREATE STATISTICS man page obviously needs attention, but also the discussion of extended stats in perform.sgml. And catalogs.sgml. I find that writing that sort of stuff helps to clarify where one's design is weak. regards, tom lane
-
Re: Is there value in having optimizer stats for joins/foreignkeys?
Chengpeng Yan <chengpeng_yan@outlook.com> — 2026-05-25T12:15:47Z
Hi, > On May 15, 2026, at 23:30, Alexandra Wang <alexandra.wang.oss@gmail.com> wrote: > > Here's v7, another attempt to fix the unstable tests. Thanks for working on this. I have a few design comments about the lifecycle and the intended scope of the join statistic. First, the index dependency contract seems worth clarifying. CREATE STATISTICS records a normal dependency on the selected index, so DROP INDEX is blocked unless CASCADE is used, although that index is not part of the statistics definition. But ANALYZE appears to re-discover a suitable index when refreshing the statistic. Is the index intended to be part of the statistic's persistent contract, or only a creation-time proof that index-based sampling is possible? If the latter, should DROP INDEX still be blocked when another equivalent index exists? Second, this seems related to the earlier concern that ANALYZE is per-table. The statistic is owned by the anchor relation, but its contents depend on the probed relation too. In the current patch, ANALYZE on the probed relation can refresh its own statistics without refreshing the join statistic. If the probed relation has changed substantially, that leaves a possible staleness gap where the planner combines fresh base-table statistics with stale cross-relation skew information. Third, the contract for non-unique indexes on the probed side seems worth clarifying. The comments define raw_sel as anchor-relative: P(join AND covered_filters) / anchor_totalrows, roughly Jf / anchor_totalrows, where Jf is the number of joined rows satisfying the covered filters. But the implementation computes raw_sel from MCV frequencies. Since the MCV list is built from sampled joined rows, a plain MCV frequency is naturally measured inside that joined sample, roughly Jf / J where J is the sampled join-result size. It would be useful to clarify when those two quantities are expected to be equivalent, especially when a non-unique probed-side index allows one anchor row to contribute multiple joined rows. These two measures are close in FK-like cases where the joined sample size tracks the anchor sample size. With a non-unique lookup, one anchor row may appear many times in the joined sample. For example, if one key matches many red rows and another key matches only one blue row, red may dominate the joined sample because of match multiplicity. That frequency describes the distribution within the joined result, but not how many matching joined rows are produced per anchor row. It would be useful to state whether such one-to-many joins are outside the current supported scope, or how the MCV-derived raw_sel accounts for how many joined rows each anchor row contributed before it is converted into planner join selectivity. -- Best regards, Chengpeng Yan
-
Re: Is there value in having optimizer stats for joins/foreignkeys?
Tomas Vondra <tomas@vondra.me> — 2026-05-25T14:34:51Z
On 5/21/26 22:25, Tom Lane wrote: > Alexandra Wang <alexandra.wang.oss@gmail.com> writes: >> Here's v7, another attempt to fix the unstable tests. > > Hi Alexandra, > > I signed up for an in-person review of this at PGConf.dev, but > the schedule doesn't seem to be working in favor of making that > happen. If you see this and happen to run into me in the > hallway, I'm happy to chat, but in any case here are my > rather-hasty review notes. > > I think it's okay if v1 only handles 2-way joins, as long as the > catalog representation is prepared for more. Restricting to > cases where we can do index-based sampling seems fine too. > Those things could be relaxed later if it seems worthwhile, > but we'd have a creditable feature even without. > +1 My assumption is allowing larger joins would be a somewhat mechanical. I'm not aware of additional problems on top of 2-way joins. I think we should aim to support larger joins. At the unconference there were suggestions maybe it would be enough to support 2-way joins, but it's not hard to construct cases where that's no sufficient. Consider a fact table, joining to two dimensions. It's common for the dimensions to be correlated in various ways, and 2-way joins can't handle these cases. > I didn't read the sampling code in any detail. I think you will > need to put more thought into what is user-friendly behavior > in case the required index doesn't exist or doesn't have the > right properties. (I think the tests for that might not be > strong enough, either.) > What would be the most "user-friendly behavior" in this case? I think we can either (a) refuse defining/building the join statistics in this case, or (b) fallback to sampling not requiring an index (but then it'll be way more expensive). I think (a) should be fine for now, i.e. we should require an index. Most of the joins will be on FK constraints, or something like that (the FK may not be defined, but there will be a PK on one side). Not sure if we should simply refuse building the stats, or if it's enough to detect this while building the statistics. I'd say enforcing this during DDL (CREATE STATISTICS, DROP INDEX, ...) is better, otherwise users may not notice the statistics stopped building. > I think you could simplify some code noticeably if you included the > anchor rel's OID as the first element of stxjoinrels[]. Yeah, > it'd be redundant with stxrelid, but so what? It's not like > pg_statistic_ext rows are narrow enough that anyone would notice > the extra 4 bytes. I think this would simplify some of the > relationships within the data structures, too, eg all varnos in > the expressions could be considered to reference stxjoinrels[]. > > I don't love stxkeyrefs[]. I wonder if it's time to throw away > stxkeys[], represent all the target columns as regular expression > trees in stxexprs, and then special-case columns that are simple > Vars where appropriate at execution. > +1, I don't see a reason to not store the anchor rel separately. > (In the same vein, I dislike the grammar's separation of plain > columns from expressions; I'd like to replace stats_params > with expr_list and sort it all out later. But perhaps that's > material for a separate patch.) > FWIW the extended stats copied this from pg_index, which also stores keys and expressions separately. I suppose there was a reason for that, most likely performance - is cheaper to compare attnums than expressions, and plain keys are much more common. Maybe that's no longer true, or maybe it's not as important for extended stats (there's likely fewer of those, compared to indexes). > We will need to put more thought into permissions: I don't think > requiring all the tables to have the same owner is workable. > (What happens if someone tries to ALTER OWNER later?) However, > if they don't all have the same owner, there are potential security > problems, so the right restriction is not obvious. This is not > necessary to solve now; there are bigger questions to worry about. > But we'll need an answer before it's committable. > I have not thought about this at all, but what can we do if the tables have different owners? I suppose we could require the stxowner to have SELECT privilege on the joined relations (instead of owning them). > It's not too soon to write some user-facing documentation. > CREATE STATISTICS man page obviously needs attention, but > also the discussion of extended stats in perform.sgml. > And catalogs.sgml. I find that writing that sort of stuff > helps to clarify where one's design is weak. > +1 regards -- Tomas Vondra
-
Re: Is there value in having optimizer stats for joins/foreignkeys?
Tom Lane <tgl@sss.pgh.pa.us> — 2026-05-25T15:03:57Z
Tomas Vondra <tomas@vondra.me> writes: > On 5/21/26 22:25, Tom Lane wrote: >> I don't love stxkeyrefs[]. I wonder if it's time to throw away >> stxkeys[], represent all the target columns as regular expression >> trees in stxexprs, and then special-case columns that are simple >> Vars where appropriate at execution. >> (In the same vein, I dislike the grammar's separation of plain >> columns from expressions; I'd like to replace stats_params >> with expr_list and sort it all out later. But perhaps that's >> material for a separate patch.) > FWIW the extended stats copied this from pg_index, which also stores > keys and expressions separately. I suppose there was a reason for that, > most likely performance - is cheaper to compare attnums than > expressions, and plain keys are much more common. I think I might be to blame for the separate storage of indexprs. If so, the motivation was to avoid breakage of older code that only knew about indkey[]. (Of course, such code would necessarily fail on indexes with expressions, but we wanted to avoid breakage for the common case of no-expressions.) I don't think that consideration is nearly as pressing for extended stats. There's probably a lot less client-side code that knows about extended stats at all, and what there is seems more likely to rely on the server-side display functions than to dig into the catalog details for itself. Also, if there is anything that's looking at pg_statistic_ext details, it will need work anyway after this patch; there's no way around that. >> We will need to put more thought into permissions: I don't think >> requiring all the tables to have the same owner is workable. > I have not thought about this at all, but what can we do if the tables > have different owners? I suppose we could require the stxowner to have > SELECT privilege on the joined relations (instead of owning them). Yeah, the rough idea I had was to require ownership on the anchor table and SELECT on the rest. But it's not terribly clear what to do if that SELECT privilege gets revoked. I also wonder to what extent we have a problem with users of the anchor table being able to infer something about the contents of the other tables via plan choices, and whether it matters if they can. They may well be able to make the same inferences anyway from query results. For that matter, if a user is able to issue a query for which a set of extended join stats is relevant, it seems likely that that must mean she has SELECT on the other tables anyway. But maybe I'm missing some case where that wouldn't be true. regards, tom lane
-
Re: Is there value in having optimizer stats for joins/foreignkeys?
Alexandra Wang <alexandra.wang.oss@gmail.com> — 2026-05-27T17:49:44Z
Hi Tom and Tomas, Thank you so much for the feedback! On Mon, May 25, 2026 at 8:04 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Tomas Vondra <tomas@vondra.me> writes: > > On 5/21/26 22:25, Tom Lane wrote: > >> I don't love stxkeyrefs[]. I wonder if it's time to throw away > >> stxkeys[], represent all the target columns as regular expression > >> trees in stxexprs, and then special-case columns that are simple > >> Vars where appropriate at execution. > >> (In the same vein, I dislike the grammar's separation of plain > >> columns from expressions; I'd like to replace stats_params > >> with expr_list and sort it all out later. But perhaps that's > >> material for a separate patch.) > > > FWIW the extended stats copied this from pg_index, which also stores > > keys and expressions separately. I suppose there was a reason for that, > > most likely performance - is cheaper to compare attnums than > > expressions, and plain keys are much more common. > > I think I might be to blame for the separate storage of indexprs. > If so, the motivation was to avoid breakage of older code that only > knew about indkey[]. (Of course, such code would necessarily fail > on indexes with expressions, but we wanted to avoid breakage for the > common case of no-expressions.) I don't think that consideration is > nearly as pressing for extended stats. There's probably a lot less > client-side code that knows about extended stats at all, and what > there is seems more likely to rely on the server-side display > functions than to dig into the catalog details for itself. Also, > if there is anything that's looking at pg_statistic_ext details, > it will need work anyway after this patch; there's no way around that. I'm working on removing stxkeys[] as a prerequisite commit before the main join stats patch, representing all target columns as Var nodes in stxexprs, as you both suggested. One question about the pg_stats_ext view: currently it has two complementary columns: - attnames (name[]) — Names of the columns included in the statistics object - exprs (text[]) — Expressions included in the statistics object With stxkeys gone from the catalog, should the view: (a) Stay as-is: keep attnames and exprs as separate columns with the same semantics. Implemented via a helper function that extracts plain column names from the unified stxexprs. or (b) Mirror the catalog: remove attnames, make exprs show all entries (both column names and expressions together in one text[] array). Any preference? -- Alexandra Wang EDB: https://www.enterprisedb.com
-
Re: Is there value in having optimizer stats for joins/foreignkeys?
Tomas Vondra <tomas@vondra.me> — 2026-05-27T19:08:34Z
On 5/27/26 19:49, Alexandra Wang wrote: > Hi Tom and Tomas, > > Thank you so much for the feedback! > > On Mon, May 25, 2026 at 8:04 AM Tom Lane <tgl@sss.pgh.pa.us > <mailto:tgl@sss.pgh.pa.us>> wrote: >> Tomas Vondra <tomas@vondra.me <mailto:tomas@vondra.me>> writes: >> > On 5/21/26 22:25, Tom Lane wrote: >> >> I don't love stxkeyrefs[]. I wonder if it's time to throw away >> >> stxkeys[], represent all the target columns as regular expression >> >> trees in stxexprs, and then special-case columns that are simple >> >> Vars where appropriate at execution. >> >> (In the same vein, I dislike the grammar's separation of plain >> >> columns from expressions; I'd like to replace stats_params >> >> with expr_list and sort it all out later. But perhaps that's >> >> material for a separate patch.) >> >> > FWIW the extended stats copied this from pg_index, which also stores >> > keys and expressions separately. I suppose there was a reason for that, >> > most likely performance - is cheaper to compare attnums than >> > expressions, and plain keys are much more common. >> >> I think I might be to blame for the separate storage of indexprs. >> If so, the motivation was to avoid breakage of older code that only >> knew about indkey[]. (Of course, such code would necessarily fail >> on indexes with expressions, but we wanted to avoid breakage for the >> common case of no-expressions.) I don't think that consideration is >> nearly as pressing for extended stats. There's probably a lot less >> client-side code that knows about extended stats at all, and what >> there is seems more likely to rely on the server-side display >> functions than to dig into the catalog details for itself. Also, >> if there is anything that's looking at pg_statistic_ext details, >> it will need work anyway after this patch; there's no way around that. > > I'm working on removing stxkeys[] as a prerequisite commit before the > main join > stats patch, representing all target columns as Var nodes in stxexprs, > as you > both suggested. > > One question about the pg_stats_ext view: currently it has two complementary > columns: > > - attnames (name[]) — Names of the columns included in the statistics object > - exprs (text[]) — Expressions included in the statistics object > > With stxkeys gone from the catalog, should the view: > > (a) Stay as-is: keep attnames and exprs as separate columns with the same > semantics. Implemented via a helper function that extracts plain column > names > from the unified stxexprs. > > or > > (b) Mirror the catalog: remove attnames, make exprs show all entries (both > column names and expressions together in one text[] array). > > Any preference? > My 2c: AFAIR there's no fundamental reason to keep those two lists separate, other than that expressions were "bolted on" later, after we already had stats on plain attributes. In hindsight, it might have been better to just unify the view back then, probably. I personally would be OK with just unifying adjusting the view, and showing a single list (with both attributes and expressions). IIUC the plan is to just store a list of expressions anyway, with attributes represented as Vars. So the view would have to do more work just to produce the "old" output, with little benefit. The one argument against this that I can think of is possibly breaking tools that use this view. IIRC pg_dump is reading the view when exporting/importing the statistics. That might need some adjustments (and there's also pg_stats_ext_exprs), but maybe it's easier to keep the view consistent. Also, maybe this is one more argument against the "optimizer view" idea Corey mentioned in Vancouver last week? Because surely we'll want to include the join statistics in export/import, and for the view approach we'd need to invent a fair amount of code to do that. Maybe? regards -- Tomas Vondra
-
Re: Is there value in having optimizer stats for joins/foreignkeys?
Tom Lane <tgl@sss.pgh.pa.us> — 2026-05-27T20:31:16Z
Tomas Vondra <tomas@vondra.me> writes: > On 5/27/26 19:49, Alexandra Wang wrote: >> One question about the pg_stats_ext view: currently it has two complementary >> columns: >> >> - attnames (name[]) — Names of the columns included in the statistics object >> - exprs (text[]) — Expressions included in the statistics object >> >> With stxkeys gone from the catalog, should the view: >> (a) Stay as-is: keep attnames and exprs as separate columns with the same >> semantics. Implemented via a helper function that extracts plain column >> names from the unified stxexprs. >> or >> (b) Mirror the catalog: remove attnames, make exprs show all entries (both >> column names and expressions together in one text[] array). > My 2c: AFAIR there's no fundamental reason to keep those two lists > separate, other than that expressions were "bolted on" later, after we > already had stats on plain attributes. In hindsight, it might have been > better to just unify the view back then, probably. Yeah. There are some other oddities that arise from that: expressions get shoved to the end. For example, if I put in create statistics my_stats (mcv) on ten, (ten+four), four from tenk1; pg_dump will regurgitate that as CREATE STATISTICS public.my_stats (mcv) ON four, ten, (ten + four) FROM public.tenk1; and I see that that column ordering is consistent with what appears in pg_stats_ext and perhaps other places. I'd expect a rewritten version to stop doing that and preserve the user-written column order. So there are going to be some potential minor incompatibilities for anything that is looking too closely at this view, and it seems to me that it might be better for such code to fail noisily rather than perhaps silently mis-associate stats with columns. (It might be a good idea to have some test cases that exercise this kind of scenario in pg_upgrade, especially now that we are trying to transfer extended stats in upgrades...) regards, tom lane
-
Re: Is there value in having optimizer stats for joins/foreignkeys?
Corey Huinker <corey.huinker@gmail.com> — 2026-05-27T22:02:20Z
> > Also, maybe this is one more argument against the "optimizer view" idea > Corey mentioned in Vancouver last week? Because surely we'll want to > include the join statistics in export/import, and for the view approach > we'd need to invent a fair amount of code to do that. Maybe? > We definitely want join statistics in the export/import. To my mind, having stats on certain views would be extremely simple for export/import, we'd simply have one more relkind that makes it into the system views pg_stats and pg_stats_ext, and the statistics for that new relation would plug into pg_statistic with no catalog change to pg_statistic whatsoever. Additionally, allowing certain kinds of views (or a relation relkind that's functionally equivalent to a view) to have statistics makes it easy to define extended statistics on those views, with no catalog change to pg_statistic_ext. Having something in pg_class to anchor existing per-attribute and multi-attribute stats off of seems like a big win to me. We'd get all pre-existing statistics types for free, statistics kinds provided by extensions for free, extended stats for free, import and export for free. Well, not free, we just have to remove the exclusion that views (or whatever relkind we create) can't have stats. Having said all that, the statistics import code reads from pg_statistic_ext but obviously never modifies it, it's all about constructing the pg_statistic_ext_data row, and it need to only concern itself with importing to the current version, so internal catalog changes to pg_statistic_ext wouldn't be that big of a deal. If, however, we want to stick with the notion that all non-relation, not-attribute stats are extended stats, then we have to remodel a bit: - as Tom and Tomas mentioned, we'd probably want to dispense with the negative attnums and stxexprs, as those are already a bit of a pain to deal with. That might be worth of a patch even without join statistics. - we'd want a way to express stats for all the individual columns/expressions defined in the join stats object, i.e. "what is the MCV of B.name for rows joined by A on A.b_id"? - we'd want some way of excluding the non-interesting combinations of columns. If we had a stat on A.qty, A.price, B.cust_name, C.sale_date, D.item_name, then we'd have 5-factorial MCV combinations in addition to the 5 single-column stats, and the (A.qty, A.price) combinations can already be covered by a non-join extended stat. It's those things that make optimizer views so attractive to me - we already have a way to store individual column stats (pg_statistic) and shut them off when not interesting (pg_attribute.attstattarget), and extended statistics are a great way to describe which combinations of columns are interesting to us. Typing all this has me thinking that there may be a third way: - statistics objects become pg_class objects, stxkeys and stxexprs become pg_attribute rows - pg_statisic_ext keeps (stxrelid, stxstattarget, stxkind), adds stxid (pointing to new pg_class object) sort of like pg_index - pg_statistic_ext_data remains as-is and we'd set per-column stats collection like ALTER STATISTICS foo ALTER COLUMN.
-
Re: Is there value in having optimizer stats for joins/foreignkeys?
Tomas Vondra <tomas@vondra.me> — 2026-05-28T01:29:15Z
On 5/28/26 00:02, Corey Huinker wrote: > Also, maybe this is one more argument against the "optimizer view" idea > Corey mentioned in Vancouver last week? Because surely we'll want to > include the join statistics in export/import, and for the view approach > we'd need to invent a fair amount of code to do that. Maybe? > > > We definitely want join statistics in the export/import. > > To my mind, having stats on certain views would be extremely simple for > export/import, we'd simply have one more relkind that makes it into the > system views pg_stats and pg_stats_ext, and the statistics for that new > relation would plug into pg_statistic with no catalog change to > pg_statistic whatsoever. Additionally, allowing certain kinds of views > (or a relation relkind that's functionally equivalent to a view) to have > statistics makes it easy to define extended statistics on those views, > with no catalog change to pg_statistic_ext. > Ah, so you're proposing supporting CREATE STATISTICS on some views? I guess that's one way to support stats on joins, without having to rework the schema. I'm still not a huge fan of it, because it just uses views as a workaround to store the join definition, nothing else. To me it seems a weird to require creating a new relation just for this, and we already envisioned CREATE STATISTICS would cover joins. My guess is that may be why DB2 did it this way, as they probably didn't have anything like extended stats at that point. > Having something in pg_class to anchor existing per-attribute and multi- > attribute stats off of seems like a big win to me. We'd get all pre- > existing statistics types for free, statistics kinds provided by > extensions for free, extended stats for free, import and export for > free. Well, not free, we just have to remove the exclusion that views > (or whatever relkind we create) can't have stats. > TBH the grammar / catalog stuff seems like a relatively minor part of this patch. To me, the difficult part seems to be the sampling / analyze part, and then matching it to the query during planning. And all of this seems exactly the same no matter how the stats are defined. OTOH maybe allowing CREATE STATISTICS on views would be independently useful too, once we have the later parts. Not sure. > Having said all that, the statistics import code reads from > pg_statistic_ext but obviously never modifies it, it's all about > constructing the pg_statistic_ext_data row, and it need to only concern > itself with importing to the current version, so internal catalog > changes to pg_statistic_ext wouldn't be that big of a deal. > > If, however, we want to stick with the notion that all non-relation, > not-attribute stats are extended stats, then we have to remodel a bit: > > - as Tom and Tomas mentioned, we'd probably want to dispense with the > negative attnums and stxexprs, as those are already a bit of a pain to > deal with. That might be worth of a patch even without join statistics. > > - we'd want a way to express stats for all the individual columns/ > expressions defined in the join stats object, i.e. "what is the MCV of > B.name for rows joined by A on A.b_id"? > Maybe I misunderstand, but don't we already do this for statistics on expressions? > - we'd want some way of excluding the non-interesting combinations of > columns. If we had a stat on A.qty, A.price, B.cust_name, C.sale_date, > D.item_name, then we'd have 5-factorial MCV combinations in addition to > the 5 single-column stats, and the (A.qty, A.price) combinations can > already be covered by a non-join extended stat. > I don't follow. What 5-factorial combinations? We only ever build a single MCV for a given statistics object. > It's those things that make optimizer views so attractive to me - we > already have a way to store individual column stats (pg_statistic) and > shut them off when not interesting (pg_attribute.attstattarget), and > extended statistics are a great way to describe which combinations of > columns are interesting to us. > I may be missing something, but I honestly the only benefit of views I can think of is already having the join definition in a catalog. > Typing all this has me thinking that there may be a third way: > > - statistics objects become pg_class objects, stxkeys and stxexprs > become pg_attribute rows > - pg_statisic_ext keeps (stxrelid, stxstattarget, stxkind), adds stxid > (pointing to new pg_class object) sort of like pg_index > - pg_statistic_ext_data remains as-is > > and we'd set per-column stats collection like ALTER STATISTICS foo ALTER > COLUMN. I don't follow. Why should statistics object be pg_class objects? In my mind pg_class is meant for "relations" (as in, table-like things), and statistics objects are not like that. I suppose this is related to your earlier suggestion Having something in pg_class to anchor existing per-attribute and multi-attribute stats off of seems like a big win to me. but it's not clear to me why would that be? All statistics are tied to a relation (or multiple relations) in the end, and I don't see why would the view make anything simpler. What are the wins? Could you elaborate? It's entirely possible I just don't see something obvious, or maybe you explained this in Vancouver and I managed to forget the details. Sorry about that. regards -- Tomas Vondra
-
Re: Is there value in having optimizer stats for joins/foreignkeys?
Corey Huinker <corey.huinker@gmail.com> — 2026-05-28T19:03:26Z
> > > To my mind, having stats on certain views would be extremely simple for > > export/import, we'd simply have one more relkind that makes it into the > > system views pg_stats and pg_stats_ext, and the statistics for that new > > relation would plug into pg_statistic with no catalog change to > > pg_statistic whatsoever. Additionally, allowing certain kinds of views > > (or a relation relkind that's functionally equivalent to a view) to have > > statistics makes it easy to define extended statistics on those views, > > with no catalog change to pg_statistic_ext. > > > > Ah, so you're proposing supporting CREATE STATISTICS on some views? I > guess that's one way to support stats on joins, without having to rework > the schema. I'm still not a huge fan of it, because it just uses views > as a workaround to store the join definition, nothing else. To me it > seems a weird to require creating a new relation just for this, and we > already envisioned CREATE STATISTICS would cover joins. My guess is that > may be why DB2 did it this way, as they probably didn't have anything > like extended stats at that point. > They did have extended stats. Here are some example RUNSTATS calls from https://www.ibm.com/docs/en/db2/11.5.x?topic=commands-runstats RUNSTATS ON TABLE employee ON ALL COLUMNS AND COLUMNS ((JOB, WORKDEPT, SEX)) WITH DISTRIBUTION In our lingo, this is roughly: CREATE STATISTICS foo ON (job, workdept, sex) FROM employee; ANALYZE employee; And you can tweak individual columns RUNSTATS ON VIEW product_sales_view WITH DISTRIBUTION ON COLUMNS (category NUM_FREQVALUES 100 NUM_QUANTILES 100, type, product_key) DEFAULT NUM_FREQVALUES 50 NUM_QUANTILES 50 But the lesson I took away from this doc and corroborated by meeting with someone who used to work on DB2 planner is that they have an equivalent of extended stats. > > > Having something in pg_class to anchor existing per-attribute and multi- > > attribute stats off of seems like a big win to me. We'd get all pre- > > existing statistics types for free, statistics kinds provided by > > extensions for free, extended stats for free, import and export for > > free. Well, not free, we just have to remove the exclusion that views > > (or whatever relkind we create) can't have stats. > > > > TBH the grammar / catalog stuff seems like a relatively minor part of > this patch. To me, the difficult part seems to be the sampling / analyze > part, and then matching it to the query during planning. And all of this > seems exactly the same no matter how the stats are defined. > I agree that the catalog stuff is minor relative to the work matching queries to the join node trees associated with a set of join stats, wherever that resides. The sampling analyze part doesn't strike me as too hard. The worst case scenario is that we take the columns+join definition, swap the "anchor" table out replacing it with the EphemeralNamedRelation of the fetched row sample from the anchor table, and just run that, letting SPI figure out which indexes can be used. The question of _when_ to analyze is a bit trickier, as was pointed out by Chenpeng Yan, and I suspect that any "anchor" table with join stats on it will need to be analyzed once any one of the tables it joins to has hit the threshold. > OTOH maybe allowing CREATE STATISTICS on views would be independently > useful too, once we have the later parts. Not sure. > I think so as well, which is why I'd like to leave that option open. > > - we'd want a way to express stats for all the individual columns/ > > expressions defined in the join stats object, i.e. "what is the MCV of > > B.name for rows joined by A on A.b_id"? > > > > Maybe I misunderstand, but don't we already do this for statistics on > expressions? > We can CREATE STATISTICS foo on (upper(name)) FROM table, if that's what you're asking. But that isn't what I'm saying. Given tables A, B, and C which can join on A.b_id = B.id and A.c_id = c.id, we already have pg_statistic stats for B.name, but that's across B in a vacuum. I'm assuming that there's also value in having the stats for B.name filtered and weighted by how often (if at all) the B row is joined to A, and those stats would be shaped exactly like the pg_statistic row for B.name. We could view that as correlating B.name against the primary key of A, but that seems odd to me. Additionally, we might want the ability to have correlative stats of B.name with c.purchase_date. With the views idea, that would just be: CREATE STATISTICS foo ON (b_name, c_purchase_date) FROM my_statistics_view. So we'd have regular stats to draw upon from my_statistics view, and those would have our cardinality estimates given the precondition of the row surviving the join criteria, and we'd further have the correlative statistics of two columns each from tabled joined to A. And that's something not even DB2 can do now given their 2-table join limitation. > I don't follow. What 5-factorial combinations? We only ever build a > single MCV for a given statistics object. > Sorry, I was thinking about pg_ndistinct and pg_dependencies. For MCV that would give us the most common tuples of (B.name, C.purchase_date, D.promotion_code, a.quantity, a.amount), but that only helps when we need all of those columns, not a subset of them. > I may be missing something, but I honestly the only benefit of views I > can think of is already having the join definition in a catalog. > That's the biggest benefit for sure, but the other benefit is we can have the per-column stats in pg_statistic as I detailed above, and further do extended statistics, as we both addressed above. > I don't follow. Why should statistics object be pg_class objects? In my > mind pg_class is meant for "relations" (as in, table-like things), and > statistics objects are not like that. I suppose this is related to your > earlier suggestion > > Having something in pg_class to anchor existing per-attribute and > multi-attribute stats off of seems like a big win to me. > It is, but it would also allow us to avoid having a declared view, and then altering that view to allow statistics. It would make it more complicated to express which elements of the join were worthy of correlative stats, so in that sense having a view with a name is conceptually simpler. > but it's not clear to me why would that be? All statistics are tied to a > relation (or multiple relations) in the end, and I don't see why would > the view make anything simpler. What are the wins? > The view would allow us to put a wider range of stats on all of the columns that could be found through that defined join, using the mechanisms we already have available to us. It would further allow us to declare that multiple subsets of those columns are interesting enough to warrant correlative (extended) statistics. > Could you elaborate? It's entirely possible I just don't see something > obvious, or maybe you explained this in Vancouver and I managed to > forget the details. Sorry about that. No worries, Vancouver was a whirlwind of ideas around this topic. My head is still swimming. Another way of thinking about this is that it would give us the stats that we could get from a materialized view, with the ability to define extended stats on top of that materialized view, but without the costs of maintaining a materialized view, and without the restriction that a query has to directly reference the materialized view.