Thread
-
Re: limit in subquery causes poor selectivity estimation
Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-02T16:45:19Z
Peter Eisentraut <peter_e@gmx.net> writes: > On lör, 2011-08-27 at 13:32 -0400, Tom Lane wrote: >> The larger problem is that if a subquery didn't get flattened, it's >> often because it's got LIMIT, or GROUP BY, or some similar clause that >> makes it highly suspect whether the statistics available for the table >> column are reasonable to use for the subquery outputs. It wouldn't be >> that hard to grab the stats for test2.sha1, but then how do you want >> to adjust them to reflect the LIMIT? > It turns out that this is a regression introduced in 8.4.8; Well, the fact that examine_variable punts on subqueries is certainly not a "regression introduced in 8.4.8"; it's always been like that. I think your observation that 8.4.8 is worse has to be blamed on commit 0ae8b300388c2a3eaf90e6e6f13d6be1f4d4ac2d, which introduced a fallback rule of assuming 0.5 selectivity for a semijoin if we didn't have non-default ndistinct estimates on both sides. Before that, 8.4.x would go ahead and apply its heuristic rule, essentially Min(nd2/nd1, 1), even when one or both ndistinct values were completely made-up. I'm not sure what we could do instead. Perhaps you could argue that we should just revert that commit on the grounds that it's doing more harm than good, but I don't really believe that --- I think reverting would just move the pain points around. It's pure luck that 8.4.8 is worse rather than better on the particular example you cite. On a longer-term basis, I'm looking into what we could do with extracting stats from subqueries, but that doesn't seem like material for a backpatch. I have a draft patch that I've been playing with (attached). The main thing I feel unsure about is whether it's reasonable to extract stats in this way from a subquery that has GROUP BY or DISTINCT. ISTM it's probably okay to ignore joining, sorting, or limiting in the subquery: those might affect the stats of the subquery output, but this is no worse than using the unmodified column statistics for any other join-level selectivity estimate (where we already ignore the effects of scan-level restriction clauses that will filter the column values). But GROUP BY or DISTINCT would entirely invalidate the column frequency statistics, which makes me think that ignoring the pg_statistic entry might be the thing to do. Comments? regards, tom lane