Thread

  1. WIP: collect frequency statistics for arrays

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-02-23T15:00:09Z

    WIP patch of statistics collection for arrays is attached. It generally
    copies statistics collection for tsvector, but there are following
    differencies:
    1) Default comparison, hash and equality function for element data type is
    used (from corresponding default operator classes).
    2) Operators @> and && don't takes care about element occurence count in
    array, i.e. '{1}':int[] @> '{1,1}':int[] and so on. That's why statistics
    collection and selectivity estimation functions takes care about uniqueness
    counting of array element.
    3) array_typanalyze collects frequency of null element into separate value
    (like maximum and minimum frequencies in ts_typanalyze). Currently it is not
    used in selectivity estimation, but it can be useful in future.
    
    Also I've faced with following problems:
    1) Do selectivity estimation for ANY and ALL keywords seems not so easy as
    for operators because their selectivity is estimating inside planner. So
    it's required to modify planner to do selectivity estimation for these
    keywords. Probably I'm missing something.
    2) I didn't implement selectivity estimation for "column <@ const"
    and "column == const" cases. The problem of "column <@ const" case is that
    we need to estimate frequency of occurence of any element not in const. We
    can try to collect statistics of frequency of all elements which is not in
    most common elements based on assumption of their independent occurence. But
    I'm not sure that this statistic will be precise enough. "column == const"
    case have also another problem. @> and && operators don't takes care about
    element occurence count and order while == operator require exact match.
    That's why statistics for @> and && operators can be applied to == very
    approximately.
     3) I need to test selectivity estimation for arrays. But it's hard to
    understand which distributions is typical for arrays. For example, we know
    that data in tsvector is based on natural language data, so we can assume
    something about data distribution in tsvector. But we don't know almost
    nothing about data in arrays because it can hold any data (tsvector also can
    holds any data, but it using for non nutural language data is out of
    purpose).
    
    ------
    With best regards,
    Alexander Korotkov.
    
  2. Re: WIP: collect frequency statistics for arrays

    Robert Haas <robertmhaas@gmail.com> — 2011-02-25T00:08:01Z

    On Wed, Feb 23, 2011 at 10:00 AM, Alexander Korotkov
    <aekorotkov@gmail.com> wrote:
    > WIP patch of statistics collection for arrays is attached.
    
    Please add this patch to the currently open CommitFest at
    
    https://commitfest.postgresql.org/action/commitfest_view/open
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  3. Re: WIP: collect frequency statistics for arrays

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-05-23T17:54:14Z

    Second version of patch attached. Main changes:
    1) ANY and ALL keywords handling.
    2) Collecting statistics of array length. It is used in "column <@ const".
    3) Relatively accurate estimation of "column <@ const" selectivity. This
    estimation is most hard part of patch. I'm warrying that there could be too
    expensibe calculations for estimation. Also, likely current comments don't
    clearify anything...
    
    ------
    With best regards,
    Alexander Korotkov.
    
  4. Re: WIP: collect frequency statistics for arrays

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-05-23T20:27:00Z

    I forgot to commit before diff. Here is correct version.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  5. Re: WIP: collect frequency statistics for arrays

    Simon Riggs <simon@2ndquadrant.com> — 2011-06-10T17:03:18Z

    On Mon, May 23, 2011 at 6:54 PM, Alexander Korotkov
    <aekorotkov@gmail.com> wrote:
    
    > Second version of patch attached. Main changes:
    > 1) ANY and ALL keywords handling.
    > 2) Collecting statistics of array length. It is used in "column <@ const".
    > 3) Relatively accurate estimation of "column <@ const" selectivity. This
    > estimation is most hard part of patch. I'm warrying that there could be too
    > expensibe calculations for estimation. Also, likely current comments don't
    > clearify anything...
    
    Just had a brief look at this ahead of starting review.
    
    Initial comments are that the code is well structured and I doubt
    there will be problems at the code level. Looks like a good patch.
    
    At the moment I see no tests. If this code will be exercised by
    existing tests then you should put some notes with the patch to
    explain that, or at least provide some pointers as to how I might test
    this.
    
    Also, I'd like to see some more explanation. Either in comments, or
    just as a post to hackers. That saves me time, but we need to be clear
    about what this does and does not do, what it might do in the future
    etc.. 3+ years from now we need to be able to remember what the code
    was supposed to do. You will forget yourself in time, if you write
    enough patches. Based on this, I think you'll be writing quite a few
    more.
    
    And of course, a few lines for the docs also.
    
    -- 
     Simon Riggs                   http://www.2ndQuadrant.com/
     PostgreSQL Development, 24x7 Support, Training & Services
    
    
  6. Re: WIP: collect frequency statistics for arrays

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-06-12T16:17:25Z

    On Fri, Jun 10, 2011 at 9:03 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
    
    > Initial comments are that the code is well structured and I doubt
    > there will be problems at the code level. Looks like a good patch.
    >
    I'm worrying about perfomance of "column <@ const" estimation. It takes
    O(m*(n+m)) of time, where m - const length and n - statistics target.
    Probably, it can be too slow is some some cases.
    
    
    > At the moment I see no tests. If this code will be exercised by
    > existing tests then you should put some notes with the patch to
    > explain that, or at least provide some pointers as to how I might test
    > this.
    >
    I didn't find in existing tests which check selectivity estimation accuracy.
    And I found difficult to create them because regression tests gives binary
    result while estimation accuracy is quantitative value. Existing regression
    tests covers case if typanalyze or selectivity estimation function falls
    down. I've added "ANALYZE array_op_test;" line into array test in order to
    these tests covers falldown case for this patch functions too.
    Seems that, selectivity estimation accuracy should be tested manually on
    various distributions. I've done very small amount of such tests.
    Unfortunately, few months pass before I got idea about "column <@ const"
    case. And now, I don't have sufficient time for it due to my GSoC project.
    It would be great if you can help me with this tests.
    
    
    > Also, I'd like to see some more explanation. Either in comments, or
    > just as a post to hackers. That saves me time, but we need to be clear
    > about what this does and does not do, what it might do in the future
    > etc.. 3+ years from now we need to be able to remember what the code
    > was supposed to do. You will forget yourself in time, if you write
    > enough patches. Based on this, I think you'll be writing quite a few
    > more.
    >
    I've added some more comments. I'm afraid that it should be completely
    rewritten before committing due to my english. If some particular points
    should be clarified more, please, specify them.
    
    
    > And of course, a few lines for the docs also.
    >
    I found that in statistics patch for tsvector only article about pg_stats
    view was corrected. I've corrected this article a little bit too.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  7. Re: WIP: collect frequency statistics for arrays

    Robert Haas <robertmhaas@gmail.com> — 2011-06-13T04:16:43Z

    On Sun, Jun 12, 2011 at 12:17 PM, Alexander Korotkov
    <aekorotkov@gmail.com> wrote:
    > I'm worrying about perfomance of "column <@ const" estimation. It takes
    > O(m*(n+m)) of time, where m - const length and n - statistics target.
    > Probably, it can be too slow is some some cases.
    
    Hmm, that doesn't sound terribly promising.  The best thing to do here
    is probably to construct a pessimal case and test it.  So, say, look
    for a column <@ a 100-element array with a statistics target of 400...
     once you know how bad it is, then we can make intelligent decisions
    about where to go with it.
    
    If the data type is hashable, you could consider building a hash table
    on the MCVs and then do a probe for each element in the array.  I
    think that's better than the other way around because there can't be
    more than 10k MCVs, whereas the input constant could be arbitrarily
    long.  I'm not entirely sure whether this case is important enough to
    be worth spending a lot of code on, but then again it might not be
    that much code.
    
    Another option is to bound the number of operations you're willing to
    perform to some reasonable limit, say, 10 * default_statistics_target.
     Work out ceil((10 * default_statistics_target) /
    number-of-elements-in-const) and consider at most that many MCVs.
    When this limit kicks in you'll get a less-accurate selectivity
    estimate, but that's a reasonable price to pay for not blowing out
    planning time.
    
    >> At the moment I see no tests. If this code will be exercised by
    >> existing tests then you should put some notes with the patch to
    >> explain that, or at least provide some pointers as to how I might test
    >> this.
    >
    > I didn't find in existing tests which check selectivity estimation accuracy.
    > And I found difficult to create them because regression tests gives binary
    > result while estimation accuracy is quantitative value. Existing regression
    > tests covers case if typanalyze or selectivity estimation function falls
    > down. I've added "ANALYZE array_op_test;" line into array test in order to
    > these tests covers falldown case for this patch functions too.
    > Seems that, selectivity estimation accuracy should be tested manually on
    > various distributions. I've done very small amount of such tests.
    > Unfortunately, few months pass before I got idea about "column <@ const"
    > case. And now, I don't have sufficient time for it due to my GSoC project.
    > It would be great if you can help me with this tests.
    
    AFAIK, we don't have regression tests for any other selectivity
    estimator; except perhaps to the extent that we verify they don't
    crash.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  8. Re: WIP: collect frequency statistics for arrays

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-06-13T19:10:36Z

    On Mon, Jun 13, 2011 at 8:16 AM, Robert Haas <robertmhaas@gmail.com> wrote:
    
    > If the data type is hashable, you could consider building a hash table
    > on the MCVs and then do a probe for each element in the array.  I
    > think that's better than the other way around because there can't be
    > more than 10k MCVs, whereas the input constant could be arbitrarily
    > long.  I'm not entirely sure whether this case is important enough to
    > be worth spending a lot of code on, but then again it might not be
    > that much code.
    >
    Unfortunately, most time consuming operation isn't related to elements
    comparison. It is caused by complex computations in calc_distr function.
    
    
    > Another option is to bound the number of operations you're willing to
    > perform to some reasonable limit, say, 10 * default_statistics_target.
    >  Work out ceil((10 * default_statistics_target) /
    > number-of-elements-in-const) and consider at most that many MCVs.
    > When this limit kicks in you'll get a less-accurate selectivity
    > estimate, but that's a reasonable price to pay for not blowing out
    > planning time.
    
     Good option. I'm going to add such condition to my patch.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  9. Re: WIP: collect frequency statistics for arrays

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-06-14T14:25:09Z

    Version of patch with few more comments and some fixes.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  10. Re: WIP: collect frequency statistics for arrays

    Bruce Momjian <bruce@momjian.us> — 2011-10-14T19:24:10Z

    Alexander Korotkov wrote:
    > Version of patch with few more comments and some fixes.
    
    Where are we on this?
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        http://momjian.us
      EnterpriseDB                             http://enterprisedb.com
    
      + It's impossible for everything to be true. +
    
    
  11. Re: WIP: collect frequency statistics for arrays

    Robert Haas <robertmhaas@gmail.com> — 2011-10-14T21:59:52Z

    On Fri, Oct 14, 2011 at 3:24 PM, Bruce Momjian <bruce@momjian.us> wrote:
    > Alexander Korotkov wrote:
    >> Version of patch with few more comments and some fixes.
    
    The CommitFest app page is here.
    
    https://commitfest.postgresql.org/action/patch_view?id=539
    
    It was reviewed in July by Nate Boley, and never updated.
    
    Looking now, I see that Alexander wasn't Cc'd on the review, so it's
    possible he missed the message?
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  12. Re: WIP: collect frequency statistics for arrays

    Nathan Boley <npboley@gmail.com> — 2011-10-15T10:47:16Z

    > Looking now, I see that Alexander wasn't Cc'd on the review, so it's
    > possible he missed the message?
    >
    
    We've corresponded off list and have discussed my review at some length.
    
    Alex submitted an updated patch on Sep 22 to me personally ( although
    not to the list? Alex? ), with the promise of a new version with
    improved comments.
    
    Best,
    Nathan
    
    
  13. Re: WIP: collect frequency statistics for arrays

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-10-15T15:13:51Z

    Hi!
    
    Thanks for your attention to my patch!
    
    On Sat, Oct 15, 2011 at 2:47 PM, Nathan Boley <npboley@gmail.com> wrote:
    
    > > Looking now, I see that Alexander wasn't Cc'd on the review, so it's
    > > possible he missed the message?
    > >
    >
    > We've corresponded off list and have discussed my review at some length.
    >
    > Alex submitted an updated patch on Sep 22 to me personally ( although
    > not to the list? Alex? ), with the promise of a new version with
    > improved comments.
    >
    
    Oh, I didn't noticed that I've posted updated patch off-list. So, there is
    repost of that version of patch. List of changes is below:
    1) Distinct slot is used for length histogram.
    2) Standard statistics is collected for arrays.
    3) Most common values and most common elements are mapped to distinct
    columns of pg_stats view, because both of them are calculated for arrays.
    
    Now, it's hard for me to give to it as much time as I would like to. But I
    hope to present improved comments and testing until end of october.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  14. Re: WIP: collect frequency statistics for arrays

    Bruce Momjian <bruce@momjian.us> — 2011-11-29T02:49:03Z

    Alexander Korotkov wrote:
    > Version of patch with few more comments and some fixes.
    
    Where are we on the idea of better statistics for arrays?
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        http://momjian.us
      EnterpriseDB                             http://enterprisedb.com
    
      + It's impossible for everything to be true. +
    
    
  15. Re: WIP: collect frequency statistics for arrays

    Nathan Boley <npboley@gmail.com> — 2011-11-29T03:02:51Z

    >> Version of patch with few more comments and some fixes.
    >
    > Where are we on the idea of better statistics for arrays?
    
    I need to finish reviewing the patch: I'll try to send in something
    tomorrow night. So far it looks good.
    
    Best,
    Nathan