Thread

Parameterized aggregate subquery (was: Pull up aggregate subquery)

Hitoshi Harada <umi.tanuki@gmail.com> — 2011-06-09T06:28:22Z

Purpose & Goal
--------------
Allow planner to create NestLoop with parameterized aggregate subquery
just like inner IndexScan pattern. This helps to avoid unnecessary
aggregate process that would be filtered at the stage of upper join
filter in such case:

create table size_m as select i as id, repeat(i::text, i % 100) as val
from generate_series(1, 20000) i;
create table size_l as select i as id, m.id as m_id, repeat(i::text, i
% 100) as val from generate_series(1, 1000000) i, size_m m where i.i /
10 = m.id;
analyze size_m;
analyze size_l;
---- make this query faster
select m_id, sum_len from size_m m inner join(select m_id,
sum(length(val)) as sum_len from size_l group by m_id)l on m.id =
l.m_id where val = '10101';

In addition, it might be the first step to take account of "general
parameterized scan" design, although this proposal contains only
narrow cases of aggregate subquery.


Related information
-------------------
This work is the next step of the previously proposed "Pull up
aggregate subqery" thread.
http://archives.postgresql.org/message-id/BANLkTimW-EqS_9hk5AYj14R8U%3DuQoc6tuA%40mail.gmail.com


Design
------
In contract to the previous design, I made aggregate subquery
"parameterized" instead of pulling it up. The design is based on the
discussion with Tom Lane and Rober Haas et al. It has some benefits
compared with the pulling up appoach as, 1) parameterizing any scan
node other than index scan as nest loop's inner is our way to go, 2)
pulling up or pushing down any aggregate query has potential risk that
the aggregate results are wrong (, which may be solvable by adding
some constraints like "only when unique" etc,) 3) the decision whether
to make parameterized or not can be made by only cost without any
heuristics.

The idea is very similar to inner IndexScan. When NestLoop path is
created in match_unsorted_outer(), parameterized SubqueryScan path is
also considered, similarly to inner IndexScan paths. While IndexScan
is simple since its information like costs are well known by the base
relation, SubqueryScan should re-plan its Query to gain that, which is
expensive.

To do so, I added SubqueryPath node to store each subquery
information. Before patching, subplan, subrtable and subrowmark is
stored in RelOptInfo, but now we need to save them separately for each
parameterized one. This part might be split from the original patch
for easy review.

Result
------
Without breakage of regression test, it successfully makes subquery as
parameterized as the inner of NestLoop. Query performance is as aimed.

(without patch)
db1=# explain analyze
select m_id, sum_len from size_m m inner join(select m_id,
sum(length(val)) as sum_len from size_l group by m_id)l on m.id =
l.m_id where val = '10101';
                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=79393.64..82937.42 rows=100 width=12) (actual
time=1256.569..1733.903 rows=1 loops=1)
   Join Filter: (m.id = size_l.m_id)
   ->  Seq Scan on size_m m  (cost=0.00..897.00 rows=1 width=4)
(actual time=25.182..32.237 rows=1 loops=1)
         Filter: (val = '10101'::text)
   ->  GroupAggregate  (cost=79393.64..81592.65 rows=19901 width=277)
(actual time=772.791..1694.848 rows=20000 loops=1)
         ->  Sort  (cost=79393.64..79893.64 rows=200000 width=277)
(actual time=772.754..929.225 rows=200000 loops=1)
               Sort Key: size_l.m_id
               Sort Method: external sort  Disk: 56848kB
               ->  Seq Scan on size_l  (cost=0.00..9830.00 rows=200000
width=277) (actual time=0.030..198.093 rows=200000 loops=1)
 Total runtime: 1754.111 ms
(10 rows)

(with patch)
db1=# explain analyze
select m_id, sum_len from size_m m inner join(select m_id,
sum(length(val)) as sum_len from size_l group by m_id)l on m.id =
l.m_id where val = '10101';
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..11674.86 rows=100 width=12) (actual
time=119.386..122.478 rows=1 loops=1)
   Join Filter: (m.id = size_l.m_id)
   ->  Seq Scan on size_m m  (cost=0.00..897.00 rows=1 width=4)
(actual time=9.720..12.811 rows=1 loops=1)
         Filter: (val = '10101'::text)
   ->  GroupAggregate  (cost=0.00..10330.08 rows=1 width=277) (actual
time=109.639..109.640 rows=1 loops=1)
         ->  Seq Scan on size_l  (cost=0.00..10330.00 rows=10
width=277) (actual time=59.138..109.611 rows=10 loops=1)
               Filter: (m.id = m_id)
 Total runtime: 122.612 ms
(8 rows)

I tested some more cases like "where val IN ('1', '10101')", "where
val between '1' and '10101'", which shows as good as I expected.

Concern
-------
Although execution time will be shorter in many cases, planning time
will be longer. Especially re-planning subquery with pushing join
quals for each joinrel step. I didn't measure the additional cost in
planner stage.


BTW, as I changed title and design from the previous post, should I
throw away the old commit fest entry and make the new one?

Regards,

-- 
Hitoshi Harada

Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

Robert Haas <robertmhaas@gmail.com> — 2011-06-09T14:32:27Z

On Thu, Jun 9, 2011 at 2:28 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
> BTW, as I changed title and design from the previous post, should I
> throw away the old commit fest entry and make the new one?

Nah, just edit the existing entry and change the title.

Also add a link to the new patch, of course.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

Hitoshi Harada <umi.tanuki@gmail.com> — 2011-06-09T16:02:33Z

2011/6/9 Robert Haas <robertmhaas@gmail.com>:
> On Thu, Jun 9, 2011 at 2:28 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
>> BTW, as I changed title and design from the previous post, should I
>> throw away the old commit fest entry and make the new one?
>
> Nah, just edit the existing entry and change the title.
>
> Also add a link to the new patch, of course.

Ok, done.


-- 
Hitoshi Harada

Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

Hitoshi Harada <umi.tanuki@gmail.com> — 2011-06-17T07:54:06Z

2011/6/10 Hitoshi Harada <umi.tanuki@gmail.com>:
> 2011/6/9 Robert Haas <robertmhaas@gmail.com>:
>> On Thu, Jun 9, 2011 at 2:28 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
>>> BTW, as I changed title and design from the previous post, should I
>>> throw away the old commit fest entry and make the new one?
>>
>> Nah, just edit the existing entry and change the title.
>>
>> Also add a link to the new patch, of course.
>
> Ok, done.

While reviewing the gist/box patch, I found some planner APIs that can
replace parts in my patch. Also, comments in includes wasn't updated
appropriately. Revised patch attached.

Regards,

-- 
Hitoshi Harada

Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

Yeb Havinga <yebhavinga@gmail.com> — 2011-06-29T13:44:18Z

On 2011-06-17 09:54, Hitoshi Harada wrote:
> While reviewing the gist/box patch, I found some planner APIs that can
> replace parts in my patch. Also, comments in includes wasn't updated
> appropriately. Revised patch attached.

Hello Hitoshi-san,

I read your latest patch implementing parameterizing subquery scans.

1)
In the email from june 9 with the patch You wrote: "While IndexScan
is simple since its information like costs are well known by the base
relation, SubqueryScan should re-plan its Query to gain that, which is
expensive."

Initial concerns I had were caused by misinterpreting 'replanning' as: 
for each outer tuple, replan the subquery (it sounds a bit like 
'ReScan'). Though the general comments in the patch are helpful, I think 
it would be an improvement to describe why subqueries are planned more 
than once, i.e. something like
"best_inner_subqueryscan
    Try to find a better subqueryscan path and its associated plan for 
each join clause that can be pushed down, in addition to the path that 
is already calculated by set_subquery_pathlist."

The consequences of having multiple subquery paths and plans for each 
new subquery path makes the bulk of the remainder of the patch.

2)
Since 'subquery_is_pushdown_safe' is invariant under join clause push 
down, it might be possible to have it called only once in 
set_subquery_pathlist, i.e. by only attaching rel->preprocessed_subquery 
if the subquery_is_pushdown_safe.

3)
/*
  * set_subquery_pathlist
  *        Build the (single) access path for a subquery RTE
  */
This unchanged comment is still correct, but 'the (single) access path' 
might have become a bit misleading now there are multiple paths 
possible, though not by set_subquery_pathlist.

4) somewhere in the patch
s/regsitered/registered/

5) Regression tests are missing; I think at this point they'd aid in 
speeding up development/test cycles.

6) Before patching Postgres, I could execute the test with the size_l 
and size_m tables, after patching against current git HEAD (patch 
without errors), I get the following error when running the example query:
ERROR:  plan should not reference subplan's variable

I get the same error with the version from june 9 with current git HEAD.

7) Since both set_subquery_pathlist and best_inner_subqueryscan push 
down clauses, as well as add a path and subplan, could a generalized 
function be made to support both, to reduce duplicate code?

regards,
Yeb Havinga

-- 
Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data

Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

Hitoshi Harada <umi.tanuki@gmail.com> — 2011-06-29T17:22:07Z

2011/6/29 Yeb Havinga <yebhavinga@gmail.com>:
>
> On 2011-06-17 09:54, Hitoshi Harada wrote:
>>
>> While reviewing the gist/box patch, I found some planner APIs that can
>> replace parts in my patch. Also, comments in includes wasn't updated
>> appropriately. Revised patch attached.
>
> Hello Hitoshi-san,

Hi Yeb,

> I read your latest patch implementing parameterizing subquery scans.

> 6) Before patching Postgres, I could execute the test with the size_l and
> size_m tables, after patching against current git HEAD (patch without
> errors), I get the following error when running the example query:
> ERROR:  plan should not reference subplan's variable
>
> I get the same error with the version from june 9 with current git HEAD.

It seems that something has changed since I developed the patch at
first. The last one and the one before are not so different with each
other, especially in terms of runtime but might not be tested enough.
I tried time-slip of the local git branch to around june 10, but the
same error occurs. The error message itself is not in parts changed
recently, so I guess some surrounding change would affect it.

> 7) Since both set_subquery_pathlist and best_inner_subqueryscan push down
> clauses, as well as add a path and subplan, could a generalized function be
> made to support both, to reduce duplicate code?

I tried it but decided as it is, for the future extensibility. The
subquery parameterizing will (can) be extended more complicatedly. I'm
not sure if we'd better gathering some small parts into one by
throwing future capability.

Other things are all good points. Thanks for elaborate review!
More than anything, I'm going to fix the 6) issue, at least to find the cause.

Regards,
-- 
Hitoshi Harada

Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

Yeb Havinga <yebhavinga@gmail.com> — 2011-06-30T07:39:39Z

On 2011-06-29 19:22, Hitoshi Harada wrote:
> Other things are all good points. Thanks for elaborate review!
> More than anything, I'm going to fix the 6) issue, at least to find the cause.
>
Some more questions:
8) why are cheapest start path and cheapest total path in 
best_inner_subqueryscan the same?

9) as remarked up a different thread, more than one clause could be 
pushed down a subquery. The current patch only considers alternative 
paths that have at most one clause pushed down. Is this because of the 
call site of best_inner_subqueryscan, i.e. under one join rel? Would it 
be an idea to consider, for each subquery, only a single alternative 
plan that tries to have all suitable clauses pushed down?

10) I have a hard time imagining use cases that will actually result in 
a alternative plan, especially since not all subqueries are allowed to 
have quals pushed down into, and like Simon Riggs pointed out that many 
users will write queries like this with the subqueries pulled up. If it 
is the case that the subqueries that can't be pulled up have a large 
overlap with the ones that are not pushdown safe (limit, set operations 
etc), there might be little actual use cases for this patch.

I think the most important thing for this patch to go forward is to have 
a few examples, from which it's clear that the patch is beneficial.

regards,

-- 

Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data

Re: Parameterized aggregate subquery

Yeb Havinga <yebhavinga@gmail.com> — 2011-06-30T08:03:22Z

On 2011-06-30 09:39, Yeb Havinga wrote:
>
> 9) as remarked up a different thread, more than one clause could be 
> pushed down a subquery. The current patch only considers alternative 
> paths that have at most one clause pushed down. Is this because of the 
> call site of best_inner_subqueryscan, i.e. under one join rel? Would 
> it be an idea to consider, for each subquery, only a single 
> alternative plan that tries to have all suitable clauses pushed down?

Ehm, please forget this remark, I've mistaken join rel for base rel.

-- Yeb

Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

Hitoshi Harada <umi.tanuki@gmail.com> — 2011-06-30T08:37:43Z

2011/6/30 Yeb Havinga <yebhavinga@gmail.com>:
> On 2011-06-29 19:22, Hitoshi Harada wrote:
>>
>> Other things are all good points. Thanks for elaborate review!
>> More than anything, I'm going to fix the 6) issue, at least to find the
>> cause.
>>
> Some more questions:
> 8) why are cheapest start path and cheapest total path in
> best_inner_subqueryscan the same?

Because best_inner_indexscan has the two. Actually one of them is
enough so far. I aligned it as the existing interface but they might
be one.

> 10) I have a hard time imagining use cases that will actually result in a
> alternative plan, especially since not all subqueries are allowed to have
> quals pushed down into, and like Simon Riggs pointed out that many users
> will write queries like this with the subqueries pulled up. If it is the
> case that the subqueries that can't be pulled up have a large overlap with
> the ones that are not pushdown safe (limit, set operations etc), there might
> be little actual use cases for this patch.

I have seen many cases that this planner hack would help
significantly, which were difficult to rewrite. Why were they
difficult to write? Because, quals on size_m (and they have quals on
size_l in fact) are usually very complicated (5-10 op clauses) and the
join+agg part itself is kind of subquery in other big query. Of course
there were workaround like split the statement to two, filtering
size_m then aggregate size_l by the result of the first statement.
However, it's against instinct. The reason why planner is in RDBMS is
to let users to write simple (as needed) statements. I don't know if
the example I raise here is common or not, but I believe the example
represents "one to many" relation simply, therefore there should be
many users who just don't find themselves currently in the slow query
performance.

> I think the most important thing for this patch to go forward is to have a
> few examples, from which it's clear that the patch is beneficial.

What will be good examples to show benefit of the patch? I guess the
test case of size_m/size_l shows it. What lacks on the case, do you
think?

Regards,

-- 
Hitoshi Harada

Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

Hitoshi Harada <umi.tanuki@gmail.com> — 2011-07-02T07:48:05Z

2011/6/29 Yeb Havinga <yebhavinga@gmail.com>:
>
> On 2011-06-17 09:54, Hitoshi Harada wrote:
>>
>> While reviewing the gist/box patch, I found some planner APIs that can
>> replace parts in my patch. Also, comments in includes wasn't updated
>> appropriately. Revised patch attached.
>
> Hello Hitoshi-san,
>
> I read your latest patch implementing parameterizing subquery scans.

Attached is revised version.

> 1)
> In the email from june 9 with the patch You wrote: "While IndexScan
> is simple since its information like costs are well known by the base
> relation, SubqueryScan should re-plan its Query to gain that, which is
> expensive."
>
> Initial concerns I had were caused by misinterpreting 'replanning' as: for
> each outer tuple, replan the subquery (it sounds a bit like 'ReScan').
> Though the general comments in the patch are helpful, I think it would be an
> improvement to describe why subqueries are planned more than once, i.e.
> something like
> "best_inner_subqueryscan
>   Try to find a better subqueryscan path and its associated plan for each
> join clause that can be pushed down, in addition to the path that is already
> calculated by set_subquery_pathlist."

I changed comments around set_subquery_pathlist and best_inner_subqueryscan.

> 2)
> Since 'subquery_is_pushdown_safe' is invariant under join clause push down,
> it might be possible to have it called only once in set_subquery_pathlist,
> i.e. by only attaching rel->preprocessed_subquery if the
> subquery_is_pushdown_safe.

I modified as you suggested.

> 3)
> /*
>  * set_subquery_pathlist
>  *        Build the (single) access path for a subquery RTE
>  */
> This unchanged comment is still correct, but 'the (single) access path'
> might have become a bit misleading now there are multiple paths possible,
> though not by set_subquery_pathlist.

As noted #1.

> 4) somewhere in the patch
> s/regsitered/registered/

Fixed.

> 5) Regression tests are missing; I think at this point they'd aid in
> speeding up development/test cycles.

I'm still thinking about it. I can add complex test but the concept of
regression test focuses on small pieces of simple cases. I don't want
take pg_regress much more than before.

> 6) Before patching Postgres, I could execute the test with the size_l and
> size_m tables, after patching against current git HEAD (patch without
> errors), I get the following error when running the example query:
> ERROR:  plan should not reference subplan's variable
>
> I get the same error with the version from june 9 with current git HEAD.

Fixed. Some variable initializing was wrong.

> 7) Since both set_subquery_pathlist and best_inner_subqueryscan push down
> clauses, as well as add a path and subplan, could a generalized function be
> made to support both, to reduce duplicate code?

No touch as answered before.

Although I still need to think about suitable regression test case,
the patch itself can be reviewed again. You may want to try some
additional tests as you imagine after finding my test case gets
quicker.

Thanks,

-- 
Hitoshi Harada

Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

Hitoshi Harada <umi.tanuki@gmail.com> — 2011-07-02T08:02:07Z

2011/7/2 Hitoshi Harada <umi.tanuki@gmail.com>:
> 2011/6/29 Yeb Havinga <yebhavinga@gmail.com>:
>>
>> On 2011-06-17 09:54, Hitoshi Harada wrote:
>>>
>>> While reviewing the gist/box patch, I found some planner APIs that can
>>> replace parts in my patch. Also, comments in includes wasn't updated
>>> appropriately. Revised patch attached.
>>
>> Hello Hitoshi-san,
>>
>> I read your latest patch implementing parameterizing subquery scans.
>
> Attached is revised version.

I failed to attached the patch. I'm trying again.

>> 1)
>> In the email from june 9 with the patch You wrote: "While IndexScan
>> is simple since its information like costs are well known by the base
>> relation, SubqueryScan should re-plan its Query to gain that, which is
>> expensive."
>>
>> Initial concerns I had were caused by misinterpreting 'replanning' as: for
>> each outer tuple, replan the subquery (it sounds a bit like 'ReScan').
>> Though the general comments in the patch are helpful, I think it would be an
>> improvement to describe why subqueries are planned more than once, i.e.
>> something like
>> "best_inner_subqueryscan
>>   Try to find a better subqueryscan path and its associated plan for each
>> join clause that can be pushed down, in addition to the path that is already
>> calculated by set_subquery_pathlist."
>
> I changed comments around set_subquery_pathlist and best_inner_subqueryscan.
>
>> 2)
>> Since 'subquery_is_pushdown_safe' is invariant under join clause push down,
>> it might be possible to have it called only once in set_subquery_pathlist,
>> i.e. by only attaching rel->preprocessed_subquery if the
>> subquery_is_pushdown_safe.
>
> I modified as you suggested.
>
>> 3)
>> /*
>>  * set_subquery_pathlist
>>  *        Build the (single) access path for a subquery RTE
>>  */
>> This unchanged comment is still correct, but 'the (single) access path'
>> might have become a bit misleading now there are multiple paths possible,
>> though not by set_subquery_pathlist.
>
> As noted #1.
>
>> 4) somewhere in the patch
>> s/regsitered/registered/
>
> Fixed.
>
>> 5) Regression tests are missing; I think at this point they'd aid in
>> speeding up development/test cycles.
>
> I'm still thinking about it. I can add complex test but the concept of
> regression test focuses on small pieces of simple cases. I don't want
> take pg_regress much more than before.
>
>> 6) Before patching Postgres, I could execute the test with the size_l and
>> size_m tables, after patching against current git HEAD (patch without
>> errors), I get the following error when running the example query:
>> ERROR:  plan should not reference subplan's variable
>>
>> I get the same error with the version from june 9 with current git HEAD.
>
> Fixed. Some variable initializing was wrong.
>
>> 7) Since both set_subquery_pathlist and best_inner_subqueryscan push down
>> clauses, as well as add a path and subplan, could a generalized function be
>> made to support both, to reduce duplicate code?
>
> No touch as answered before.
>
> Although I still need to think about suitable regression test case,
> the patch itself can be reviewed again. You may want to try some
> additional tests as you imagine after finding my test case gets
> quicker.
>
> Thanks,
>
> --
> Hitoshi Harada
>



-- 
Hitoshi Harada

Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

Yeb Havinga <yebhavinga@gmail.com> — 2011-07-05T11:15:44Z

Hello Hitosh, list,

>
> > Attached is revised version.
>
> I failed to attached the patch. I'm trying again.
>
> I'm currently unable to test, since on holiday. I'm happy to continue
testing once returned but it may not be within the bounds of the current
commitfest, sorry.


> >> 5) Regression tests are missing; I think at this point they'd aid in
> >> speeding up development/test cycles.
> >
> > I'm still thinking about it. I can add complex test but the concept of
> > regression test focuses on small pieces of simple cases. I don't want
> > take pg_regress much more than before.
>

IMHO, at least one query where the new code is touched is a good idea.

>> I get the same error with the version from june 9 with current git HEAD.
> >
> > Fixed. Some variable initializing was wrong.
>

Thanks - will test when I am back from holiday and other duties.

regards,
Yeb

Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

Hitoshi Harada <umi.tanuki@gmail.com> — 2011-07-05T11:43:46Z

2011/7/5 Yeb Havinga <yebhavinga@gmail.com>:
> Hello Hitosh, list,
>
>> >
>> > Attached is revised version.
>>
>> I failed to attached the patch. I'm trying again.
>>
> I'm currently unable to test, since on holiday. I'm happy to continue
> testing once returned but it may not be within the bounds of the current
> commitfest, sorry.

Thanks for replying. Regarding the time remained until the end of this
commitfest, I'd think we should mark this item "Returned with
Feedback" if no objection appears. I will be very happy if Yeb tests
my latest patch after he comes back. I'll make up my mind around
regression test.

Regards,

-- 
Hitoshi Harada

Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

Hitoshi Harada <umi.tanuki@gmail.com> — 2011-07-09T14:23:20Z

2011/7/5 Hitoshi Harada <umi.tanuki@gmail.com>:
> 2011/7/5 Yeb Havinga <yebhavinga@gmail.com>:
>> Hello Hitosh, list,
>>
>>> >
>>> > Attached is revised version.
>>>
>>> I failed to attached the patch. I'm trying again.
>>>
>> I'm currently unable to test, since on holiday. I'm happy to continue
>> testing once returned but it may not be within the bounds of the current
>> commitfest, sorry.
>
> Thanks for replying. Regarding the time remained until the end of this
> commitfest, I'd think we should mark this item "Returned with
> Feedback" if no objection appears. I will be very happy if Yeb tests
> my latest patch after he comes back. I'll make up my mind around
> regression test.

It seems that Yeb marked "Returned with Feedback". Thanks for the
review again. Still, I'd appreciate if further review is done on my
latest patch.

Thanks,
-- 
Hitoshi Harada

Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

Yeb Havinga <yebhavinga@gmail.com> — 2011-07-09T15:45:34Z

On 2011-07-09 16:23, Hitoshi Harada wrote:
> 2011/7/5 Hitoshi Harada<umi.tanuki@gmail.com>:
>> 2011/7/5 Yeb Havinga<yebhavinga@gmail.com>:
>>> Hello Hitosh, list,
>>>
>>>>> Attached is revised version.
>>>> I failed to attached the patch. I'm trying again.
>>>>
>>> I'm currently unable to test, since on holiday. I'm happy to continue
>>> testing once returned but it may not be within the bounds of the current
>>> commitfest, sorry.
>> Thanks for replying. Regarding the time remained until the end of this
>> commitfest, I'd think we should mark this item "Returned with
>> Feedback" if no objection appears. I will be very happy if Yeb tests
>> my latest patch after he comes back. I'll make up my mind around
>> regression test.
> It seems that Yeb marked "Returned with Feedback". Thanks for the
> review again. Still, I'd appreciate if further review is done on my
> latest patch.

Yes, I did so after you suggested to return it to feedback. I'll review 
your latest patch as soon as there is enough time to do a proper review, 
which is probably after next week.

regards,
Yeb Havinga

Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

Yeb Havinga <yebhavinga@gmail.com> — 2011-07-22T13:13:01Z

On 2011-07-02 10:02, Hitoshi Harada wrote:
>
>> Although I still need to think about suitable regression test case,
>> the patch itself can be reviewed again. You may want to try some
>> additional tests as you imagine after finding my test case gets
>> quicker.

Hello Hitoshi-san,

I took a look at your latest patch and it looks good, no comments. 
However I also tried it against current 9.2 HEAD and the test query of 
the start of this thread.

Before and after applying the patch, I get the same result for the test 
query.

postgres=# explain select m_id, sum_len from size_m m inner join(select 
m_id,
sum(length(val)) as sum_len from size_l group by m_id)l on m.id =
l.m_id where val = '10101';
                                     QUERY PLAN
----------------------------------------------------------------------------------
  Nested Loop  (cost=79392.64..82938.05 rows=100 width=12)
    Join Filter: (m.id = size_l.m_id)
    ->  Seq Scan on size_m m  (cost=0.00..897.00 rows=1 width=4)
          Filter: (val = '10101'::text)
    ->  GroupAggregate  (cost=79392.64..81592.15 rows=19951 width=277)
          ->  Sort  (cost=79392.64..79892.64 rows=200000 width=277)
                Sort Key: size_l.m_id
                ->  Seq Scan on size_l  (cost=0.00..9829.00 rows=200000 
width=277)

I double checked that I had applied the patch (git diff shows the 
patch), installed and restarted postgres. The database is a fresh 
created database with no edits in postgresql.conf.

regards,

-- 
Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data

Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

Hitoshi Harada <umi.tanuki@gmail.com> — 2011-07-22T14:17:01Z

2011/7/22 Yeb Havinga <yebhavinga@gmail.com>:
> On 2011-07-02 10:02, Hitoshi Harada wrote:
>>
>>> Although I still need to think about suitable regression test case,
>>> the patch itself can be reviewed again. You may want to try some
>>> additional tests as you imagine after finding my test case gets
>>> quicker.
>
> Hello Hitoshi-san,

Hi,

> I double checked that I had applied the patch (git diff shows the patch),
> installed and restarted postgres. The database is a fresh created database
> with no edits in postgresql.conf.

:(
I updated the patch. Could you try attached once more? The "issafe"
switch seems wrong.

Thanks,
-- 
Hitoshi Harada

Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

Yeb Havinga <yebhavinga@gmail.com> — 2011-07-22T15:16:30Z

On 2011-07-22 16:17, Hitoshi Harada wrote:
>
> :(
> I updated the patch. Could you try attached once more? The "issafe"
> switch seems wrong.
Works like a charm :-). However, now there is always a copyObject of a 
subquery even when the subquery is not safe for qual pushdown. The 
problem with the previous issafe was that it was only assigned for 
rel->baserestrictinfo != NIL. If it is assigned before the if statement, 
it still works. See attached patch that avoids subquery copy for unsafe 
subqueries, and also exits best_inner_subqueryscan before palloc of 
differenttypes in case of unsafe queries.

regards,
-- 

Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data

Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

Hitoshi Harada <umi.tanuki@gmail.com> — 2011-07-22T15:35:40Z

2011/7/23 Yeb Havinga <yebhavinga@gmail.com>:
> On 2011-07-22 16:17, Hitoshi Harada wrote:
>>
>> :(
>> I updated the patch. Could you try attached once more? The "issafe"
>> switch seems wrong.
>
> Works like a charm :-). However, now there is always a copyObject of a
> subquery even when the subquery is not safe for qual pushdown. The problem
> with the previous issafe was that it was only assigned for
> rel->baserestrictinfo != NIL. If it is assigned before the if statement, it
> still works. See attached patch that avoids subquery copy for unsafe
> subqueries, and also exits best_inner_subqueryscan before palloc of
> differenttypes in case of unsafe queries.

Ah, yeah, right. Too quick fix bloated my brain :P Thanks for testing!
I'll check it more.


-- 
Hitoshi Harada

Pull up aggregate sublink (was: Parameterized aggregate subquery (was: Pull up aggregate subquery))

Yeb Havinga <yebhavinga@gmail.com> — 2011-07-26T19:32:11Z

On 2011-07-22 17:35, Hitoshi Harada wrote:
> 2011/7/23 Yeb Havinga<yebhavinga@gmail.com>:
>> Works like a charm :-). However, now there is always a copyObject of a
>> subquery even when the subquery is not safe for qual pushdown. The problem
>> with the previous issafe was that it was only assigned for
>> rel->baserestrictinfo != NIL. If it is assigned before the if statement, it
>> still works. See attached patch that avoids subquery copy for unsafe
>> subqueries, and also exits best_inner_subqueryscan before palloc of
>> differenttypes in case of unsafe queries.
>
> Ah, yeah, right. Too quick fix bloated my brain :P Thanks for testing!
> I'll check it more.

A few days ago I read Tomas Vondra's blog post about dss tpc-h queries 
on PostgreSQL at 
http://fuzzy.cz/en/articles/dss-tpc-h-benchmark-with-postgresql/ - in 
which he showed how to manually pull up a dss subquery to get a large 
speed up. Initially I thought: cool, this is probably now handled by 
Hitoshi's patch, but it turns out the subquery type in the dss query is 
different.

The original and rewritten queries are below. The debug_print_plan 
output shows the subquery is called from a opexpr (< l_quantity, 
subquery output) and the sublink type is EXPR_SUBLINK. Looking at the 
source code; pull_up_sublink only considers ANY and EXISTS sublinks. I'm 
wondering if this could be expanded to deal with EXPR sublinks. Clearly 
in the example Tomas has given this can be done. I'm wondering if there 
are show stoppers that prevent this to be possible in the general case, 
but can't think of any, other than the case of a sublink returning NULL 
and the opexpr is part of a larger OR expression or IS NULL; in which 
case it should not be pulled op, or perhaps it could be pulled up as 
outer join.

Thoughts?

regards,
Yeb


The original query:

tpch=# explain select
sum(l_extendedprice) / 7.0 as avg_yearly
from
lineitem,
part
where
p_partkey = l_partkey
and p_brand = 'Brand#13'
and p_container = 'JUMBO PKG'
and l_quantity < (
select
0.2 * avg(l_quantity)
from
lineitem
where
l_partkey = p_partkey
)
LIMIT 1;
                                                  QUERY PLAN
------------------------------------------------------------------------------------------------------------
  Limit  (cost=183345309.79..183345309.81 rows=1 width=8)
    ->  Aggregate  (cost=183345309.79..183345309.81 rows=1 width=8)
          ->  Hash Join  (cost=2839.99..183345307.76 rows=815 width=8)
                Hash Cond: (public.lineitem.l_partkey = part.p_partkey)
                Join Filter: (public.lineitem.l_quantity < (SubPlan 1))
                ->  Seq Scan on lineitem  (cost=0.00..68985.69 
rows=2399869 width=17)
                ->  Hash  (cost=2839.00..2839.00 rows=79 width=4)
                      ->  Seq Scan on part  (cost=0.00..2839.00 rows=79 
width=4)
                            Filter: ((p_brand = 'Brand#13'::bpchar) AND 
(p_container = 'JUMBO PKG'::bpchar))
                SubPlan 1
                  ->  Aggregate  (cost=74985.44..74985.46 rows=1 width=5)
                        ->  Seq Scan on lineitem  (cost=0.00..74985.36 
rows=31 width=5)
                              Filter: (l_partkey = part.p_partkey)

manually rewritten variant:

tpch=# explain select
sum(l_extendedprice) / 7.0 as avg_yearly
from
lineitem,
part,
(SELECT l_partkey AS agg_partkey,
                 0.2 * avg(l_quantity) AS avg_quantity
           FROM lineitem GROUP BY l_partkey) part_agg
where
p_partkey = l_partkey
and agg_partkey = l_partkey
and p_brand = 'Brand#13'
and p_container = 'JUMBO PKG'
and l_quantity < avg_quantity
LIMIT 1;
                                                        QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=179643.88..179643.89 rows=1 width=8)
    ->  Aggregate  (cost=179643.88..179643.89 rows=1 width=8)
          ->  Hash Join  (cost=161865.21..178853.91 rows=315985 width=8)
                Hash Cond: (public.lineitem.l_partkey = part.p_partkey)
                Join Filter: (public.lineitem.l_quantity < ((0.2 * 
avg(public.lineitem.l_quantity))))
                ->  HashAggregate  (cost=80985.04..82148.65 rows=77574 
width=9)
                      ->  Seq Scan on lineitem  (cost=0.00..68985.69 
rows=2399869 width=9)
                ->  Hash  (cost=80849.63..80849.63 rows=2444 width=21)
                      ->  Hash Join  (cost=2839.99..80849.63 rows=2444 
width=21)
                            Hash Cond: (public.lineitem.l_partkey = 
part.p_partkey)
                            ->  Seq Scan on lineitem  
(cost=0.00..68985.69 rows=2399869 width=17)
                            ->  Hash  (cost=2839.00..2839.00 rows=79 
width=4)
                                  ->  Seq Scan on part  
(cost=0.00..2839.00 rows=79 width=4)
                                        Filter: ((p_brand = 
'Brand#13'::bpchar) AND (p_container = 'JUMBO PKG'::bpchar))

Re: Pull up aggregate sublink (was: Parameterized aggregate subquery (was: Pull up aggregate subquery))

Tom Lane <tgl@sss.pgh.pa.us> — 2011-07-26T21:37:52Z

Yeb Havinga <yebhavinga@gmail.com> writes:
> A few days ago I read Tomas Vondra's blog post about dss tpc-h queries 
> on PostgreSQL at 
> http://fuzzy.cz/en/articles/dss-tpc-h-benchmark-with-postgresql/ - in 
> which he showed how to manually pull up a dss subquery to get a large 
> speed up. Initially I thought: cool, this is probably now handled by 
> Hitoshi's patch, but it turns out the subquery type in the dss query is 
> different.

Actually, I believe this example is the exact opposite of the
transformation Hitoshi proposes.  Tomas was manually replacing an
aggregated subquery by a reference to a grouped table, which can be
a win if the subquery would be executed enough times to amortize
calculation of the grouped table over all the groups (some of which
might never be demanded by the outer query).  Hitoshi was talking about
avoiding calculations of grouped-table elements that we don't need,
which would be a win in different cases.  Or at least that was the
thrust of his original proposal; I'm not sure where the patch went since
then.

This leads me to think that we need to represent both cases as the same
sort of query and make a cost-based decision as to which way to go.
Thinking of it as a pull-up or push-down transformation is the wrong
approach because those sorts of transformations are done too early to
be able to use cost comparisons.

			regards, tom lane

Re: Pull up aggregate sublink (was: Parameterized aggregate subquery (was: Pull up aggregate subquery))

Robert Haas <robertmhaas@gmail.com> — 2011-07-27T14:16:21Z

On Tue, Jul 26, 2011 at 5:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeb Havinga <yebhavinga@gmail.com> writes:
>> A few days ago I read Tomas Vondra's blog post about dss tpc-h queries
>> on PostgreSQL at
>> http://fuzzy.cz/en/articles/dss-tpc-h-benchmark-with-postgresql/ - in
>> which he showed how to manually pull up a dss subquery to get a large
>> speed up. Initially I thought: cool, this is probably now handled by
>> Hitoshi's patch, but it turns out the subquery type in the dss query is
>> different.
>
> Actually, I believe this example is the exact opposite of the
> transformation Hitoshi proposes.  Tomas was manually replacing an
> aggregated subquery by a reference to a grouped table, which can be
> a win if the subquery would be executed enough times to amortize
> calculation of the grouped table over all the groups (some of which
> might never be demanded by the outer query).  Hitoshi was talking about
> avoiding calculations of grouped-table elements that we don't need,
> which would be a win in different cases.  Or at least that was the
> thrust of his original proposal; I'm not sure where the patch went since
> then.
>
> This leads me to think that we need to represent both cases as the same
> sort of query and make a cost-based decision as to which way to go.
> Thinking of it as a pull-up or push-down transformation is the wrong
> approach because those sorts of transformations are done too early to
> be able to use cost comparisons.

I think you're right.  OTOH, our estimates of what will pop out of an
aggregate are so poor that denying the user to control the plan on the
basis of how they write the query might be a net negative.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Pull up aggregate sublink (was: Parameterized aggregate subquery (was: Pull up aggregate subquery))

Yeb Havinga <yebhavinga@gmail.com> — 2011-07-27T14:40:23Z

On 2011-07-27 16:16, Robert Haas wrote:
> On Tue, Jul 26, 2011 at 5:37 PM, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>> Yeb Havinga<yebhavinga@gmail.com>  writes:
>>> A few days ago I read Tomas Vondra's blog post about dss tpc-h queries
>>> on PostgreSQL at
>>> http://fuzzy.cz/en/articles/dss-tpc-h-benchmark-with-postgresql/ - in
>>> which he showed how to manually pull up a dss subquery to get a large
>>> speed up. Initially I thought: cool, this is probably now handled by
>>> Hitoshi's patch, but it turns out the subquery type in the dss query is
>>> different.
>> Actually, I believe this example is the exact opposite of the
>> transformation Hitoshi proposes.  Tomas was manually replacing an
>> aggregated subquery by a reference to a grouped table, which can be
>> a win if the subquery would be executed enough times to amortize
>> calculation of the grouped table over all the groups (some of which
>> might never be demanded by the outer query).  Hitoshi was talking about
>> avoiding calculations of grouped-table elements that we don't need,
>> which would be a win in different cases.  Or at least that was the
>> thrust of his original proposal; I'm not sure where the patch went since
>> then.
>>
>> This leads me to think that we need to represent both cases as the same
>> sort of query and make a cost-based decision as to which way to go.
>> Thinking of it as a pull-up or push-down transformation is the wrong
>> approach because those sorts of transformations are done too early to
>> be able to use cost comparisons.
> I think you're right.  OTOH, our estimates of what will pop out of an
> aggregate are so poor that denying the user to control the plan on the
> basis of how they write the query might be a net negative.  :-(
>

Tom and Robert, thank you both for your replies. I think I'm having some 
blind spots and maybe false assumptions regarding the overal work in the 
optimizer, as it is not clear to me what 'the same sort of query' would 
look like. I was under the impression that using cost to select the best 
paths is only done per simple query, and fail to see how a total 
combined plan with pulled up subquery could be compared on cost with a 
total plan where the subquery is still a separate subplan, since the 
range tables / simple-queries to compare are different.

regards,
Yeb

Re: Pull up aggregate sublink (was: Parameterized aggregate subquery (was: Pull up aggregate subquery))

Tom Lane <tgl@sss.pgh.pa.us> — 2011-07-27T14:51:13Z

Yeb Havinga <yebhavinga@gmail.com> writes:
> Tom and Robert, thank you both for your replies. I think I'm having some 
> blind spots and maybe false assumptions regarding the overal work in the 
> optimizer, as it is not clear to me what 'the same sort of query' would 
> look like. I was under the impression that using cost to select the best 
> paths is only done per simple query, and fail to see how a total 
> combined plan with pulled up subquery could be compared on cost with a 
> total plan where the subquery is still a separate subplan, since the 
> range tables / simple-queries to compare are different.

Well, you could look at what planagg.c does to decide whether an
indexscan optimization of MIN/MAX is worthwhile, or at the calculations
in planner.c that decide which way to do grouping/aggregation/ordering.

It could be fairly expensive to handle this type of problem because of
the need to cost out two fundamentally different scan/join trees, but
we're assuming the queries are expensive enough to make that worthwhile
...

			regards, tom lane

Re: Pull up aggregate sublink (was: Parameterized aggregate subquery (was: Pull up aggregate subquery))

Hitoshi Harada <umi.tanuki@gmail.com> — 2011-07-27T15:34:21Z

2011/7/27 Yeb Havinga <yebhavinga@gmail.com>:
> On 2011-07-22 17:35, Hitoshi Harada wrote:
>>
>> 2011/7/23 Yeb Havinga<yebhavinga@gmail.com>:
>
> A few days ago I read Tomas Vondra's blog post about dss tpc-h queries on
> PostgreSQL at
> http://fuzzy.cz/en/articles/dss-tpc-h-benchmark-with-postgresql/ - in which
> he showed how to manually pull up a dss subquery to get a large speed up.
> Initially I thought: cool, this is probably now handled by Hitoshi's patch,
> but it turns out the subquery type in the dss query is different.
>
> The original and rewritten queries are below. The debug_print_plan output
> shows the subquery is called from a opexpr (< l_quantity, subquery output)
> and the sublink type is EXPR_SUBLINK. Looking at the source code;
> pull_up_sublink only considers ANY and EXISTS sublinks. I'm wondering if
> this could be expanded to deal with EXPR sublinks. Clearly in the example
> Tomas has given this can be done. I'm wondering if there are show stoppers
> that prevent this to be possible in the general case, but can't think of
> any, other than the case of a sublink returning NULL and the opexpr is part
> of a larger OR expression or IS NULL; in which case it should not be pulled
> op, or perhaps it could be pulled up as outer join.
>
> Thoughts?

Good catch. I was not aware of the sublink case so I'm not sure if it
is possible, but I believe it will be worth modifying the optimizer to
handle them in the same way.  Since my latest proposal is based on
parameterized NestLoop, the first step is how to transform the sublink
expression into join. I bet there are chances in simple cases since we
have Semi/Anti Join technique. On the other hand, those pseudo-join
types are easily failing to be transformed to join, in such cases
above like it have another filter clause than join qual expression.

If tpc bechmark can be speed up that's a good use case which you
pointed out I'm missing.

Regards,

-- 
Hitoshi Harada

Re: Pull up aggregate sublink (was: Parameterized aggregate subquery (was: Pull up aggregate subquery))

Hitoshi Harada <umi.tanuki@gmail.com> — 2011-07-27T15:50:20Z

2011/7/27 Tom Lane <tgl@sss.pgh.pa.us>:
> Yeb Havinga <yebhavinga@gmail.com> writes:
>> A few days ago I read Tomas Vondra's blog post about dss tpc-h queries
>> on PostgreSQL at
>> http://fuzzy.cz/en/articles/dss-tpc-h-benchmark-with-postgresql/ - in
>> which he showed how to manually pull up a dss subquery to get a large
>> speed up. Initially I thought: cool, this is probably now handled by
>> Hitoshi's patch, but it turns out the subquery type in the dss query is
>> different.
>
> Actually, I believe this example is the exact opposite of the
> transformation Hitoshi proposes. Tomas was manually replacing an
> aggregated subquery by a reference to a grouped table, which can be
> a win if the subquery would be executed enough times to amortize
> calculation of the grouped table over all the groups (some of which
> might never be demanded by the outer query).  Hitoshi was talking about
> avoiding calculations of grouped-table elements that we don't need,
> which would be a win in different cases.  Or at least that was the
> thrust of his original proposal; I'm not sure where the patch went since
> then.

My first proposal which is about pulling up aggregate like sublink
expression is exact opposite of this (Tomas pushed down the sublink
expression to join subquery). But the latest proposal is upon
parameterized NestLoop, so I think my latest patch might help
something for the *second* query. Actually the problem is the same; We
want to reduce grouping operation which is not interesting to the
final output, by filtering other relations expression. In this case,
if the joined lineitem-part relation has very few rows by WHERE
conditions (p_brand, p_container), we don't want calculate avg of huge
lineitem because we know almost all of the avg result is not in the
upper result. However, the current optimizer cannot pass the upper
query's condition (like "it will have only few rows") down to the
lower aggregate query.

> This leads me to think that we need to represent both cases as the same
> sort of query and make a cost-based decision as to which way to go.
> Thinking of it as a pull-up or push-down transformation is the wrong
> approach because those sorts of transformations are done too early to
> be able to use cost comparisons.

Wrapping up my mind above and reading this paragraph, it might be
another work to make sublink expression look like the same as join.
But what we want to solve is the same goal, I think.

Regards,

-- 
Hitoshi Harada