Thread

WIP: collect frequency statistics for arrays

Alexander Korotkov <aekorotkov@gmail.com> — 2011-02-23T15:00:09Z

WIP patch of statistics collection for arrays is attached. It generally
copies statistics collection for tsvector, but there are following
differencies:
1) Default comparison, hash and equality function for element data type is
used (from corresponding default operator classes).
2) Operators @> and && don't takes care about element occurence count in
array, i.e. '{1}':int[] @> '{1,1}':int[] and so on. That's why statistics
collection and selectivity estimation functions takes care about uniqueness
counting of array element.
3) array_typanalyze collects frequency of null element into separate value
(like maximum and minimum frequencies in ts_typanalyze). Currently it is not
used in selectivity estimation, but it can be useful in future.

Also I've faced with following problems:
1) Do selectivity estimation for ANY and ALL keywords seems not so easy as
for operators because their selectivity is estimating inside planner. So
it's required to modify planner to do selectivity estimation for these
keywords. Probably I'm missing something.
2) I didn't implement selectivity estimation for "column <@ const"
and "column == const" cases. The problem of "column <@ const" case is that
we need to estimate frequency of occurence of any element not in const. We
can try to collect statistics of frequency of all elements which is not in
most common elements based on assumption of their independent occurence. But
I'm not sure that this statistic will be precise enough. "column == const"
case have also another problem. @> and && operators don't takes care about
element occurence count and order while == operator require exact match.
That's why statistics for @> and && operators can be applied to == very
approximately.
 3) I need to test selectivity estimation for arrays. But it's hard to
understand which distributions is typical for arrays. For example, we know
that data in tsvector is based on natural language data, so we can assume
something about data distribution in tsvector. But we don't know almost
nothing about data in arrays because it can hold any data (tsvector also can
holds any data, but it using for non nutural language data is out of
purpose).

------
With best regards,
Alexander Korotkov.

Re: WIP: collect frequency statistics for arrays

Robert Haas <robertmhaas@gmail.com> — 2011-02-25T00:08:01Z

On Wed, Feb 23, 2011 at 10:00 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> WIP patch of statistics collection for arrays is attached.

Please add this patch to the currently open CommitFest at

https://commitfest.postgresql.org/action/commitfest_view/open

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP: collect frequency statistics for arrays

Alexander Korotkov <aekorotkov@gmail.com> — 2011-05-23T17:54:14Z

Second version of patch attached. Main changes:
1) ANY and ALL keywords handling.
2) Collecting statistics of array length. It is used in "column <@ const".
3) Relatively accurate estimation of "column <@ const" selectivity. This
estimation is most hard part of patch. I'm warrying that there could be too
expensibe calculations for estimation. Also, likely current comments don't
clearify anything...

------
With best regards,
Alexander Korotkov.

Re: WIP: collect frequency statistics for arrays

Alexander Korotkov <aekorotkov@gmail.com> — 2011-05-23T20:27:00Z
```
I forgot to commit before diff. Here is correct version.

------
With best regards,
Alexander Korotkov.
```

Re: WIP: collect frequency statistics for arrays

Simon Riggs <simon@2ndquadrant.com> — 2011-06-10T17:03:18Z

On Mon, May 23, 2011 at 6:54 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

> Second version of patch attached. Main changes:
> 1) ANY and ALL keywords handling.
> 2) Collecting statistics of array length. It is used in "column <@ const".
> 3) Relatively accurate estimation of "column <@ const" selectivity. This
> estimation is most hard part of patch. I'm warrying that there could be too
> expensibe calculations for estimation. Also, likely current comments don't
> clearify anything...

Just had a brief look at this ahead of starting review.

Initial comments are that the code is well structured and I doubt
there will be problems at the code level. Looks like a good patch.

At the moment I see no tests. If this code will be exercised by
existing tests then you should put some notes with the patch to
explain that, or at least provide some pointers as to how I might test
this.

Also, I'd like to see some more explanation. Either in comments, or
just as a post to hackers. That saves me time, but we need to be clear
about what this does and does not do, what it might do in the future
etc.. 3+ years from now we need to be able to remember what the code
was supposed to do. You will forget yourself in time, if you write
enough patches. Based on this, I think you'll be writing quite a few
more.

And of course, a few lines for the docs also.

-- 
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: collect frequency statistics for arrays

Alexander Korotkov <aekorotkov@gmail.com> — 2011-06-12T16:17:25Z

On Fri, Jun 10, 2011 at 9:03 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

> Initial comments are that the code is well structured and I doubt
> there will be problems at the code level. Looks like a good patch.
>
I'm worrying about perfomance of "column <@ const" estimation. It takes
O(m*(n+m)) of time, where m - const length and n - statistics target.
Probably, it can be too slow is some some cases.


> At the moment I see no tests. If this code will be exercised by
> existing tests then you should put some notes with the patch to
> explain that, or at least provide some pointers as to how I might test
> this.
>
I didn't find in existing tests which check selectivity estimation accuracy.
And I found difficult to create them because regression tests gives binary
result while estimation accuracy is quantitative value. Existing regression
tests covers case if typanalyze or selectivity estimation function falls
down. I've added "ANALYZE array_op_test;" line into array test in order to
these tests covers falldown case for this patch functions too.
Seems that, selectivity estimation accuracy should be tested manually on
various distributions. I've done very small amount of such tests.
Unfortunately, few months pass before I got idea about "column <@ const"
case. And now, I don't have sufficient time for it due to my GSoC project.
It would be great if you can help me with this tests.


> Also, I'd like to see some more explanation. Either in comments, or
> just as a post to hackers. That saves me time, but we need to be clear
> about what this does and does not do, what it might do in the future
> etc.. 3+ years from now we need to be able to remember what the code
> was supposed to do. You will forget yourself in time, if you write
> enough patches. Based on this, I think you'll be writing quite a few
> more.
>
I've added some more comments. I'm afraid that it should be completely
rewritten before committing due to my english. If some particular points
should be clarified more, please, specify them.


> And of course, a few lines for the docs also.
>
I found that in statistics patch for tsvector only article about pg_stats
view was corrected. I've corrected this article a little bit too.

------
With best regards,
Alexander Korotkov.

Re: WIP: collect frequency statistics for arrays

Robert Haas <robertmhaas@gmail.com> — 2011-06-13T04:16:43Z

On Sun, Jun 12, 2011 at 12:17 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> I'm worrying about perfomance of "column <@ const" estimation. It takes
> O(m*(n+m)) of time, where m - const length and n - statistics target.
> Probably, it can be too slow is some some cases.

Hmm, that doesn't sound terribly promising.  The best thing to do here
is probably to construct a pessimal case and test it.  So, say, look
for a column <@ a 100-element array with a statistics target of 400...
 once you know how bad it is, then we can make intelligent decisions
about where to go with it.

If the data type is hashable, you could consider building a hash table
on the MCVs and then do a probe for each element in the array.  I
think that's better than the other way around because there can't be
more than 10k MCVs, whereas the input constant could be arbitrarily
long.  I'm not entirely sure whether this case is important enough to
be worth spending a lot of code on, but then again it might not be
that much code.

Another option is to bound the number of operations you're willing to
perform to some reasonable limit, say, 10 * default_statistics_target.
 Work out ceil((10 * default_statistics_target) /
number-of-elements-in-const) and consider at most that many MCVs.
When this limit kicks in you'll get a less-accurate selectivity
estimate, but that's a reasonable price to pay for not blowing out
planning time.

>> At the moment I see no tests. If this code will be exercised by
>> existing tests then you should put some notes with the patch to
>> explain that, or at least provide some pointers as to how I might test
>> this.
>
> I didn't find in existing tests which check selectivity estimation accuracy.
> And I found difficult to create them because regression tests gives binary
> result while estimation accuracy is quantitative value. Existing regression
> tests covers case if typanalyze or selectivity estimation function falls
> down. I've added "ANALYZE array_op_test;" line into array test in order to
> these tests covers falldown case for this patch functions too.
> Seems that, selectivity estimation accuracy should be tested manually on
> various distributions. I've done very small amount of such tests.
> Unfortunately, few months pass before I got idea about "column <@ const"
> case. And now, I don't have sufficient time for it due to my GSoC project.
> It would be great if you can help me with this tests.

AFAIK, we don't have regression tests for any other selectivity
estimator; except perhaps to the extent that we verify they don't
crash.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP: collect frequency statistics for arrays

Alexander Korotkov <aekorotkov@gmail.com> — 2011-06-13T19:10:36Z

On Mon, Jun 13, 2011 at 8:16 AM, Robert Haas <robertmhaas@gmail.com> wrote:

> If the data type is hashable, you could consider building a hash table
> on the MCVs and then do a probe for each element in the array.  I
> think that's better than the other way around because there can't be
> more than 10k MCVs, whereas the input constant could be arbitrarily
> long.  I'm not entirely sure whether this case is important enough to
> be worth spending a lot of code on, but then again it might not be
> that much code.
>
Unfortunately, most time consuming operation isn't related to elements
comparison. It is caused by complex computations in calc_distr function.


> Another option is to bound the number of operations you're willing to
> perform to some reasonable limit, say, 10 * default_statistics_target.
>  Work out ceil((10 * default_statistics_target) /
> number-of-elements-in-const) and consider at most that many MCVs.
> When this limit kicks in you'll get a less-accurate selectivity
> estimate, but that's a reasonable price to pay for not blowing out
> planning time.

 Good option. I'm going to add such condition to my patch.

------
With best regards,
Alexander Korotkov.

Re: WIP: collect frequency statistics for arrays

Alexander Korotkov <aekorotkov@gmail.com> — 2011-06-14T14:25:09Z
```
Version of patch with few more comments and some fixes.

------
With best regards,
Alexander Korotkov.
```

Re: WIP: collect frequency statistics for arrays

Bruce Momjian <bruce@momjian.us> — 2011-10-14T19:24:10Z

Alexander Korotkov wrote:
> Version of patch with few more comments and some fixes.

Where are we on this?

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: WIP: collect frequency statistics for arrays

Robert Haas <robertmhaas@gmail.com> — 2011-10-14T21:59:52Z

On Fri, Oct 14, 2011 at 3:24 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Alexander Korotkov wrote:
>> Version of patch with few more comments and some fixes.

The CommitFest app page is here.

https://commitfest.postgresql.org/action/patch_view?id=539

It was reviewed in July by Nate Boley, and never updated.

Looking now, I see that Alexander wasn't Cc'd on the review, so it's
possible he missed the message?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP: collect frequency statistics for arrays

Nathan Boley <npboley@gmail.com> — 2011-10-15T10:47:16Z

> Looking now, I see that Alexander wasn't Cc'd on the review, so it's
> possible he missed the message?
>

We've corresponded off list and have discussed my review at some length.

Alex submitted an updated patch on Sep 22 to me personally ( although
not to the list? Alex? ), with the promise of a new version with
improved comments.

Best,
Nathan

Re: WIP: collect frequency statistics for arrays

Alexander Korotkov <aekorotkov@gmail.com> — 2011-10-15T15:13:51Z

Hi!

Thanks for your attention to my patch!

On Sat, Oct 15, 2011 at 2:47 PM, Nathan Boley <npboley@gmail.com> wrote:

> > Looking now, I see that Alexander wasn't Cc'd on the review, so it's
> > possible he missed the message?
> >
>
> We've corresponded off list and have discussed my review at some length.
>
> Alex submitted an updated patch on Sep 22 to me personally ( although
> not to the list? Alex? ), with the promise of a new version with
> improved comments.
>

Oh, I didn't noticed that I've posted updated patch off-list. So, there is
repost of that version of patch. List of changes is below:
1) Distinct slot is used for length histogram.
2) Standard statistics is collected for arrays.
3) Most common values and most common elements are mapped to distinct
columns of pg_stats view, because both of them are calculated for arrays.

Now, it's hard for me to give to it as much time as I would like to. But I
hope to present improved comments and testing until end of october.

------
With best regards,
Alexander Korotkov.

Re: WIP: collect frequency statistics for arrays

Bruce Momjian <bruce@momjian.us> — 2011-11-29T02:49:03Z

Alexander Korotkov wrote:
> Version of patch with few more comments and some fixes.

Where are we on the idea of better statistics for arrays?

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: WIP: collect frequency statistics for arrays

Nathan Boley <npboley@gmail.com> — 2011-11-29T03:02:51Z

>> Version of patch with few more comments and some fixes.
>
> Where are we on the idea of better statistics for arrays?

I need to finish reviewing the patch: I'll try to send in something
tomorrow night. So far it looks good.

Best,
Nathan