Thread
-
Re: Top N queries and disbursion
Tom Lane <tgl@sss.pgh.pa.us> — 1999-10-07T23:16:58Z
Roberto Cornacchia <rcorna@tin.it> writes: > ... since we are working on snapshots of the 6.6 > release (now we are using snapshot dated 9/13/99) we are afraid of > instability problems to affect the results. Could you give us any > suggestion about this? We are quite close to the degree day, so we have > to optimize time usage... If you don't want to spend time tracking development changes then you probably ought to stick with the snapshot you have. I don't see any reason that you should try to track changes right now... > We need to estimate the number of distinct values of an attribute. We > thought 1/disbursion was the right solution, but the results were quite > wrong: No, it's certainly not the right thing. To my understanding, disbursion is a measure of the frequency of the most common value of an attribute; but that tells you very little about how many other values there are. 1/disbursion is a lower bound on the number of values, but it wouldn't be a good estimate unless you had reason to think that the values were pretty evenly distributed. There could be a *lot* of very-infrequent values. > with 100 distinct values of an attribute uniformly distribuited in a > relation of 10000 tuples, disbursion was estimated as 0.002275, giving > us 440 distinct values. This is an illustration of the fact that Postgres' disbursion-estimator is pretty bad :-(. It usually underestimates the frequency of the most common value, unless the most common value is really frequent (probability > 0.2 or so). I've been trying to think of a more accurate way of figuring the statistic that wouldn't be unreasonably slow. Or, perhaps, we should forget all about disbursion and adopt some other statistic(s). regards, tom lane
-
Re: [HACKERS] Re: Top N queries and disbursion
Bruce Momjian <maillist@candle.pha.pa.us> — 1999-10-07T23:53:17Z
> No, it's certainly not the right thing. To my understanding, disbursion > is a measure of the frequency of the most common value of an attribute; > but that tells you very little about how many other values there are. > 1/disbursion is a lower bound on the number of values, but it wouldn't > be a good estimate unless you had reason to think that the values were > pretty evenly distributed. There could be a *lot* of very-infrequent > values. > > > with 100 distinct values of an attribute uniformly distribuited in a > > relation of 10000 tuples, disbursion was estimated as 0.002275, giving > > us 440 distinct values. > > This is an illustration of the fact that Postgres' disbursion-estimator > is pretty bad :-(. It usually underestimates the frequency of the most > common value, unless the most common value is really frequent > (probability > 0.2 or so). I've been trying to think of a more accurate > way of figuring the statistic that wouldn't be unreasonably slow. > Or, perhaps, we should forget all about disbursion and adopt some other > statistic(s). Yes, you have the crux of the issue. I wrote it because it was the best thing I could think of, but it is non-optimimal. Because all the optimal solutions seemed too slow to me, I couldn't think of a better one. Here is my narrative on it from vacuum.c: --------------------------------------------------------------------------- * We compute the column min, max, null and non-null counts. * Plus we attempt to find the count of the value that occurs most * frequently in each column * These figures are used to compute the selectivity of the column * * We use a three-bucked cache to get the most frequent item * The 'guess' buckets count hits. A cache miss causes guess1 * to get the most hit 'guess' item in the most recent cycle, and * the new item goes into guess2. Whenever the total count of hits * of a 'guess' entry is larger than 'best', 'guess' becomes 'best'. * * This method works perfectly for columns with unique values, and columns * with only two unique values, plus nulls. * * It becomes less perfect as the number of unique values increases and * their distribution in the table becomes more random. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
-
Re: [HACKERS] Re: Top N queries and disbursion
Roberto Cornacchia <rcorna@tin.it> — 1999-10-08T13:18:42Z
Bruce Momjian wrote: > > > No, it's certainly not the right thing. To my understanding, disbursion > > is a measure of the frequency of the most common value of an attribute; > > but that tells you very little about how many other values there are. > > 1/disbursion is a lower bound on the number of values, but it wouldn't > > be a good estimate unless you had reason to think that the values were > > pretty evenly distributed. There could be a *lot* of very-infrequent > > values. > > > > > with 100 distinct values of an attribute uniformly distribuited in a > > > relation of 10000 tuples, disbursion was estimated as 0.002275, giving > > > us 440 distinct values. > > > > This is an illustration of the fact that Postgres' disbursion-estimator > > is pretty bad :-(. It usually underestimates the frequency of the most > > common value, unless the most common value is really frequent > > (probability > 0.2 or so). I've been trying to think of a more accurate > > way of figuring the statistic that wouldn't be unreasonably slow. > > Or, perhaps, we should forget all about disbursion and adopt some other > > statistic(s). > > Yes, you have the crux of the issue. I wrote it because it was the best > thing I could think of, but it is non-optimimal. Because all the > optimal solutions seemed too slow to me, I couldn't think of a better > one. Thank you, Tom and Bruce. This is not a good news for us :-(. In any case, is 1/disbursion the best estimate we can have by now, even if not optimal? Roberto Cornacchia Andrea Ghidini
-
Re: [HACKERS] Re: Top N queries and disbursion
Bruce Momjian <maillist@candle.pha.pa.us> — 1999-10-08T16:19:31Z
> > Yes, you have the crux of the issue. I wrote it because it was the best > > thing I could think of, but it is non-optimimal. Because all the > > optimal solutions seemed too slow to me, I couldn't think of a better > > one. > > Thank you, Tom and Bruce. > This is not a good news for us :-(. In any case, is 1/disbursion the > best estimate we can have by now, even if not optimal? That is the best one maintained by the database. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026