Thread

Re: Top N queries and disbursion

Tom Lane <tgl@sss.pgh.pa.us> — 1999-10-07T23:16:58Z

Roberto Cornacchia <rcorna@tin.it> writes:
> ... since we are working on snapshots of the 6.6
> release (now we are using snapshot dated 9/13/99) we are afraid of
> instability problems to affect the results. Could you give us any
> suggestion about this? We are quite close to the degree day, so we have
> to optimize time usage... 

If you don't want to spend time tracking development changes then you
probably ought to stick with the snapshot you have.  I don't see any
reason that you should try to track changes right now...

> We need to estimate the number of distinct values of an attribute. We
> thought 1/disbursion was the right solution, but the results were quite
> wrong:

No, it's certainly not the right thing.  To my understanding, disbursion
is a measure of the frequency of the most common value of an attribute;
but that tells you very little about how many other values there are.
1/disbursion is a lower bound on the number of values, but it wouldn't
be a good estimate unless you had reason to think that the values were
pretty evenly distributed.  There could be a *lot* of very-infrequent
values.

> with 100 distinct values of an attribute uniformly distribuited in a
> relation of 10000 tuples, disbursion was estimated as 0.002275, giving
> us 440 distinct values.

This is an illustration of the fact that Postgres' disbursion-estimator
is pretty bad :-(.  It usually underestimates the frequency of the most
common value, unless the most common value is really frequent
(probability > 0.2 or so).  I've been trying to think of a more accurate
way of figuring the statistic that wouldn't be unreasonably slow.
Or, perhaps, we should forget all about disbursion and adopt some other
statistic(s).

			regards, tom lane

Re: [HACKERS] Re: Top N queries and disbursion

Bruce Momjian <maillist@candle.pha.pa.us> — 1999-10-07T23:53:17Z

> No, it's certainly not the right thing.  To my understanding, disbursion
> is a measure of the frequency of the most common value of an attribute;
> but that tells you very little about how many other values there are.
> 1/disbursion is a lower bound on the number of values, but it wouldn't
> be a good estimate unless you had reason to think that the values were
> pretty evenly distributed.  There could be a *lot* of very-infrequent
> values.
> 
> > with 100 distinct values of an attribute uniformly distribuited in a
> > relation of 10000 tuples, disbursion was estimated as 0.002275, giving
> > us 440 distinct values.
> 
> This is an illustration of the fact that Postgres' disbursion-estimator
> is pretty bad :-(.  It usually underestimates the frequency of the most
> common value, unless the most common value is really frequent
> (probability > 0.2 or so).  I've been trying to think of a more accurate
> way of figuring the statistic that wouldn't be unreasonably slow.
> Or, perhaps, we should forget all about disbursion and adopt some other
> statistic(s).

Yes, you have the crux of the issue.  I wrote it because it was the best
thing I could think of, but it is non-optimimal.  Because all the
optimal solutions seemed too slow to me, I couldn't think of a better
one.

Here is my narrative on it from vacuum.c:

---------------------------------------------------------------------------

 *  We compute the column min, max, null and non-null counts.
 *  Plus we attempt to find the count of the value that occurs most
 *  frequently in each column
 *  These figures are used to compute the selectivity of the column
 *
 *  We use a three-bucked cache to get the most frequent item
 *  The 'guess' buckets count hits.  A cache miss causes guess1
 *  to get the most hit 'guess' item in the most recent cycle, and
 *  the new item goes into guess2.  Whenever the total count of hits
 *  of a 'guess' entry is larger than 'best', 'guess' becomes 'best'.
 *
 *  This method works perfectly for columns with unique values, and columns
 *  with only two unique values, plus nulls.
 *
 *  It becomes less perfect as the number of unique values increases and
 *  their distribution in the table becomes more random.


-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: [HACKERS] Re: Top N queries and disbursion

Roberto Cornacchia <rcorna@tin.it> — 1999-10-08T13:18:42Z

Bruce Momjian wrote:
> 
> > No, it's certainly not the right thing.  To my understanding, disbursion
> > is a measure of the frequency of the most common value of an attribute;
> > but that tells you very little about how many other values there are.
> > 1/disbursion is a lower bound on the number of values, but it wouldn't
> > be a good estimate unless you had reason to think that the values were
> > pretty evenly distributed.  There could be a *lot* of very-infrequent
> > values.
> >
> > > with 100 distinct values of an attribute uniformly distribuited in a
> > > relation of 10000 tuples, disbursion was estimated as 0.002275, giving
> > > us 440 distinct values.
> >
> > This is an illustration of the fact that Postgres' disbursion-estimator
> > is pretty bad :-(.  It usually underestimates the frequency of the most
> > common value, unless the most common value is really frequent
> > (probability > 0.2 or so).  I've been trying to think of a more accurate
> > way of figuring the statistic that wouldn't be unreasonably slow.
> > Or, perhaps, we should forget all about disbursion and adopt some other
> > statistic(s).
> 
> Yes, you have the crux of the issue.  I wrote it because it was the best
> thing I could think of, but it is non-optimimal.  Because all the
> optimal solutions seemed too slow to me, I couldn't think of a better
> one.

Thank you, Tom and Bruce.
This is not a good news for us :-(. In any case, is 1/disbursion the
best estimate we can have by now, even if not optimal?

Roberto Cornacchia
Andrea Ghidini

Re: [HACKERS] Re: Top N queries and disbursion

Bruce Momjian <maillist@candle.pha.pa.us> — 1999-10-08T16:19:31Z

> > Yes, you have the crux of the issue.  I wrote it because it was the best
> > thing I could think of, but it is non-optimimal.  Because all the
> > optimal solutions seemed too slow to me, I couldn't think of a better
> > one.
> 
> Thank you, Tom and Bruce.
> This is not a good news for us :-(. In any case, is 1/disbursion the
> best estimate we can have by now, even if not optimal?

That is the best one maintained by the database.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026