Thread

  1. Re: Top N queries and disbursion

    Tom Lane <tgl@sss.pgh.pa.us> — 1999-10-07T23:16:58Z

    Roberto Cornacchia <rcorna@tin.it> writes:
    > ... since we are working on snapshots of the 6.6
    > release (now we are using snapshot dated 9/13/99) we are afraid of
    > instability problems to affect the results. Could you give us any
    > suggestion about this? We are quite close to the degree day, so we have
    > to optimize time usage... 
    
    If you don't want to spend time tracking development changes then you
    probably ought to stick with the snapshot you have.  I don't see any
    reason that you should try to track changes right now...
    
    
    > We need to estimate the number of distinct values of an attribute. We
    > thought 1/disbursion was the right solution, but the results were quite
    > wrong:
    
    No, it's certainly not the right thing.  To my understanding, disbursion
    is a measure of the frequency of the most common value of an attribute;
    but that tells you very little about how many other values there are.
    1/disbursion is a lower bound on the number of values, but it wouldn't
    be a good estimate unless you had reason to think that the values were
    pretty evenly distributed.  There could be a *lot* of very-infrequent
    values.
    
    > with 100 distinct values of an attribute uniformly distribuited in a
    > relation of 10000 tuples, disbursion was estimated as 0.002275, giving
    > us 440 distinct values.
    
    This is an illustration of the fact that Postgres' disbursion-estimator
    is pretty bad :-(.  It usually underestimates the frequency of the most
    common value, unless the most common value is really frequent
    (probability > 0.2 or so).  I've been trying to think of a more accurate
    way of figuring the statistic that wouldn't be unreasonably slow.
    Or, perhaps, we should forget all about disbursion and adopt some other
    statistic(s).
    
    			regards, tom lane
    
    
  2. Re: [HACKERS] Re: Top N queries and disbursion

    Bruce Momjian <maillist@candle.pha.pa.us> — 1999-10-07T23:53:17Z

    > No, it's certainly not the right thing.  To my understanding, disbursion
    > is a measure of the frequency of the most common value of an attribute;
    > but that tells you very little about how many other values there are.
    > 1/disbursion is a lower bound on the number of values, but it wouldn't
    > be a good estimate unless you had reason to think that the values were
    > pretty evenly distributed.  There could be a *lot* of very-infrequent
    > values.
    > 
    > > with 100 distinct values of an attribute uniformly distribuited in a
    > > relation of 10000 tuples, disbursion was estimated as 0.002275, giving
    > > us 440 distinct values.
    > 
    > This is an illustration of the fact that Postgres' disbursion-estimator
    > is pretty bad :-(.  It usually underestimates the frequency of the most
    > common value, unless the most common value is really frequent
    > (probability > 0.2 or so).  I've been trying to think of a more accurate
    > way of figuring the statistic that wouldn't be unreasonably slow.
    > Or, perhaps, we should forget all about disbursion and adopt some other
    > statistic(s).
    
    Yes, you have the crux of the issue.  I wrote it because it was the best
    thing I could think of, but it is non-optimimal.  Because all the
    optimal solutions seemed too slow to me, I couldn't think of a better
    one.
    
    Here is my narrative on it from vacuum.c:
    
    ---------------------------------------------------------------------------
    
     *  We compute the column min, max, null and non-null counts.
     *  Plus we attempt to find the count of the value that occurs most
     *  frequently in each column
     *  These figures are used to compute the selectivity of the column
     *
     *  We use a three-bucked cache to get the most frequent item
     *  The 'guess' buckets count hits.  A cache miss causes guess1
     *  to get the most hit 'guess' item in the most recent cycle, and
     *  the new item goes into guess2.  Whenever the total count of hits
     *  of a 'guess' entry is larger than 'best', 'guess' becomes 'best'.
     *
     *  This method works perfectly for columns with unique values, and columns
     *  with only two unique values, plus nulls.
     *
     *  It becomes less perfect as the number of unique values increases and
     *  their distribution in the table becomes more random.
    
    
    -- 
      Bruce Momjian                        |  http://www.op.net/~candle
      maillist@candle.pha.pa.us            |  (610) 853-3000
      +  If your life is a hard drive,     |  830 Blythe Avenue
      +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
    
    
  3. Re: [HACKERS] Re: Top N queries and disbursion

    Roberto Cornacchia <rcorna@tin.it> — 1999-10-08T13:18:42Z

    Bruce Momjian wrote:
    > 
    > > No, it's certainly not the right thing.  To my understanding, disbursion
    > > is a measure of the frequency of the most common value of an attribute;
    > > but that tells you very little about how many other values there are.
    > > 1/disbursion is a lower bound on the number of values, but it wouldn't
    > > be a good estimate unless you had reason to think that the values were
    > > pretty evenly distributed.  There could be a *lot* of very-infrequent
    > > values.
    > >
    > > > with 100 distinct values of an attribute uniformly distribuited in a
    > > > relation of 10000 tuples, disbursion was estimated as 0.002275, giving
    > > > us 440 distinct values.
    > >
    > > This is an illustration of the fact that Postgres' disbursion-estimator
    > > is pretty bad :-(.  It usually underestimates the frequency of the most
    > > common value, unless the most common value is really frequent
    > > (probability > 0.2 or so).  I've been trying to think of a more accurate
    > > way of figuring the statistic that wouldn't be unreasonably slow.
    > > Or, perhaps, we should forget all about disbursion and adopt some other
    > > statistic(s).
    > 
    > Yes, you have the crux of the issue.  I wrote it because it was the best
    > thing I could think of, but it is non-optimimal.  Because all the
    > optimal solutions seemed too slow to me, I couldn't think of a better
    > one.
    
    Thank you, Tom and Bruce.
    This is not a good news for us :-(. In any case, is 1/disbursion the
    best estimate we can have by now, even if not optimal?
    
    Roberto Cornacchia
    Andrea Ghidini
    
    
    
  4. Re: [HACKERS] Re: Top N queries and disbursion

    Bruce Momjian <maillist@candle.pha.pa.us> — 1999-10-08T16:19:31Z

    > > Yes, you have the crux of the issue.  I wrote it because it was the best
    > > thing I could think of, but it is non-optimimal.  Because all the
    > > optimal solutions seemed too slow to me, I couldn't think of a better
    > > one.
    > 
    > Thank you, Tom and Bruce.
    > This is not a good news for us :-(. In any case, is 1/disbursion the
    > best estimate we can have by now, even if not optimal?
    
    That is the best one maintained by the database.
    
    -- 
      Bruce Momjian                        |  http://www.op.net/~candle
      maillist@candle.pha.pa.us            |  (610) 853-3000
      +  If your life is a hard drive,     |  830 Blythe Avenue
      +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026