Thread

  1. Parameterized aggregate subquery (was: Pull up aggregate subquery)

    Hitoshi Harada <umi.tanuki@gmail.com> — 2011-06-09T06:28:22Z

    Purpose & Goal
    --------------
    Allow planner to create NestLoop with parameterized aggregate subquery
    just like inner IndexScan pattern. This helps to avoid unnecessary
    aggregate process that would be filtered at the stage of upper join
    filter in such case:
    
    create table size_m as select i as id, repeat(i::text, i % 100) as val
    from generate_series(1, 20000) i;
    create table size_l as select i as id, m.id as m_id, repeat(i::text, i
    % 100) as val from generate_series(1, 1000000) i, size_m m where i.i /
    10 = m.id;
    analyze size_m;
    analyze size_l;
    ---- make this query faster
    select m_id, sum_len from size_m m inner join(select m_id,
    sum(length(val)) as sum_len from size_l group by m_id)l on m.id =
    l.m_id where val = '10101';
    
    In addition, it might be the first step to take account of "general
    parameterized scan" design, although this proposal contains only
    narrow cases of aggregate subquery.
    
    
    Related information
    -------------------
    This work is the next step of the previously proposed "Pull up
    aggregate subqery" thread.
    http://archives.postgresql.org/message-id/BANLkTimW-EqS_9hk5AYj14R8U%3DuQoc6tuA%40mail.gmail.com
    
    
    Design
    ------
    In contract to the previous design, I made aggregate subquery
    "parameterized" instead of pulling it up. The design is based on the
    discussion with Tom Lane and Rober Haas et al. It has some benefits
    compared with the pulling up appoach as, 1) parameterizing any scan
    node other than index scan as nest loop's inner is our way to go, 2)
    pulling up or pushing down any aggregate query has potential risk that
    the aggregate results are wrong (, which may be solvable by adding
    some constraints like "only when unique" etc,) 3) the decision whether
    to make parameterized or not can be made by only cost without any
    heuristics.
    
    The idea is very similar to inner IndexScan. When NestLoop path is
    created in match_unsorted_outer(), parameterized SubqueryScan path is
    also considered, similarly to inner IndexScan paths. While IndexScan
    is simple since its information like costs are well known by the base
    relation, SubqueryScan should re-plan its Query to gain that, which is
    expensive.
    
    To do so, I added SubqueryPath node to store each subquery
    information. Before patching, subplan, subrtable and subrowmark is
    stored in RelOptInfo, but now we need to save them separately for each
    parameterized one. This part might be split from the original patch
    for easy review.
    
    Result
    ------
    Without breakage of regression test, it successfully makes subquery as
    parameterized as the inner of NestLoop. Query performance is as aimed.
    
    (without patch)
    db1=# explain analyze
    select m_id, sum_len from size_m m inner join(select m_id,
    sum(length(val)) as sum_len from size_l group by m_id)l on m.id =
    l.m_id where val = '10101';
                                                                QUERY PLAN
    -----------------------------------------------------------------------------------------------------------------------------------
     Nested Loop  (cost=79393.64..82937.42 rows=100 width=12) (actual
    time=1256.569..1733.903 rows=1 loops=1)
       Join Filter: (m.id = size_l.m_id)
       ->  Seq Scan on size_m m  (cost=0.00..897.00 rows=1 width=4)
    (actual time=25.182..32.237 rows=1 loops=1)
             Filter: (val = '10101'::text)
       ->  GroupAggregate  (cost=79393.64..81592.65 rows=19901 width=277)
    (actual time=772.791..1694.848 rows=20000 loops=1)
             ->  Sort  (cost=79393.64..79893.64 rows=200000 width=277)
    (actual time=772.754..929.225 rows=200000 loops=1)
                   Sort Key: size_l.m_id
                   Sort Method: external sort  Disk: 56848kB
                   ->  Seq Scan on size_l  (cost=0.00..9830.00 rows=200000
    width=277) (actual time=0.030..198.093 rows=200000 loops=1)
     Total runtime: 1754.111 ms
    (10 rows)
    
    (with patch)
    db1=# explain analyze
    select m_id, sum_len from size_m m inner join(select m_id,
    sum(length(val)) as sum_len from size_l group by m_id)l on m.id =
    l.m_id where val = '10101';
                                                          QUERY PLAN
    -----------------------------------------------------------------------------------------------------------------------
     Nested Loop  (cost=0.00..11674.86 rows=100 width=12) (actual
    time=119.386..122.478 rows=1 loops=1)
       Join Filter: (m.id = size_l.m_id)
       ->  Seq Scan on size_m m  (cost=0.00..897.00 rows=1 width=4)
    (actual time=9.720..12.811 rows=1 loops=1)
             Filter: (val = '10101'::text)
       ->  GroupAggregate  (cost=0.00..10330.08 rows=1 width=277) (actual
    time=109.639..109.640 rows=1 loops=1)
             ->  Seq Scan on size_l  (cost=0.00..10330.00 rows=10
    width=277) (actual time=59.138..109.611 rows=10 loops=1)
                   Filter: (m.id = m_id)
     Total runtime: 122.612 ms
    (8 rows)
    
    I tested some more cases like "where val IN ('1', '10101')", "where
    val between '1' and '10101'", which shows as good as I expected.
    
    Concern
    -------
    Although execution time will be shorter in many cases, planning time
    will be longer. Especially re-planning subquery with pushing join
    quals for each joinrel step. I didn't measure the additional cost in
    planner stage.
    
    
    BTW, as I changed title and design from the previous post, should I
    throw away the old commit fest entry and make the new one?
    
    Regards,
    
    -- 
    Hitoshi Harada
    
  2. Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

    Robert Haas <robertmhaas@gmail.com> — 2011-06-09T14:32:27Z

    On Thu, Jun 9, 2011 at 2:28 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
    > BTW, as I changed title and design from the previous post, should I
    > throw away the old commit fest entry and make the new one?
    
    Nah, just edit the existing entry and change the title.
    
    Also add a link to the new patch, of course.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  3. Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

    Hitoshi Harada <umi.tanuki@gmail.com> — 2011-06-09T16:02:33Z

    2011/6/9 Robert Haas <robertmhaas@gmail.com>:
    > On Thu, Jun 9, 2011 at 2:28 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
    >> BTW, as I changed title and design from the previous post, should I
    >> throw away the old commit fest entry and make the new one?
    >
    > Nah, just edit the existing entry and change the title.
    >
    > Also add a link to the new patch, of course.
    
    Ok, done.
    
    
    -- 
    Hitoshi Harada
    
    
  4. Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

    Hitoshi Harada <umi.tanuki@gmail.com> — 2011-06-17T07:54:06Z

    2011/6/10 Hitoshi Harada <umi.tanuki@gmail.com>:
    > 2011/6/9 Robert Haas <robertmhaas@gmail.com>:
    >> On Thu, Jun 9, 2011 at 2:28 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:
    >>> BTW, as I changed title and design from the previous post, should I
    >>> throw away the old commit fest entry and make the new one?
    >>
    >> Nah, just edit the existing entry and change the title.
    >>
    >> Also add a link to the new patch, of course.
    >
    > Ok, done.
    
    While reviewing the gist/box patch, I found some planner APIs that can
    replace parts in my patch. Also, comments in includes wasn't updated
    appropriately. Revised patch attached.
    
    Regards,
    
    -- 
    Hitoshi Harada
    
  5. Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

    Yeb Havinga <yebhavinga@gmail.com> — 2011-06-29T13:44:18Z

    On 2011-06-17 09:54, Hitoshi Harada wrote:
    > While reviewing the gist/box patch, I found some planner APIs that can
    > replace parts in my patch. Also, comments in includes wasn't updated
    > appropriately. Revised patch attached.
    
    Hello Hitoshi-san,
    
    I read your latest patch implementing parameterizing subquery scans.
    
    1)
    In the email from june 9 with the patch You wrote: "While IndexScan
    is simple since its information like costs are well known by the base
    relation, SubqueryScan should re-plan its Query to gain that, which is
    expensive."
    
    Initial concerns I had were caused by misinterpreting 'replanning' as: 
    for each outer tuple, replan the subquery (it sounds a bit like 
    'ReScan'). Though the general comments in the patch are helpful, I think 
    it would be an improvement to describe why subqueries are planned more 
    than once, i.e. something like
    "best_inner_subqueryscan
        Try to find a better subqueryscan path and its associated plan for 
    each join clause that can be pushed down, in addition to the path that 
    is already calculated by set_subquery_pathlist."
    
    The consequences of having multiple subquery paths and plans for each 
    new subquery path makes the bulk of the remainder of the patch.
    
    2)
    Since 'subquery_is_pushdown_safe' is invariant under join clause push 
    down, it might be possible to have it called only once in 
    set_subquery_pathlist, i.e. by only attaching rel->preprocessed_subquery 
    if the subquery_is_pushdown_safe.
    
    3)
    /*
      * set_subquery_pathlist
      *        Build the (single) access path for a subquery RTE
      */
    This unchanged comment is still correct, but 'the (single) access path' 
    might have become a bit misleading now there are multiple paths 
    possible, though not by set_subquery_pathlist.
    
    
    4) somewhere in the patch
    s/regsitered/registered/
    
    5) Regression tests are missing; I think at this point they'd aid in 
    speeding up development/test cycles.
    
    6) Before patching Postgres, I could execute the test with the size_l 
    and size_m tables, after patching against current git HEAD (patch 
    without errors), I get the following error when running the example query:
    ERROR:  plan should not reference subplan's variable
    
    I get the same error with the version from june 9 with current git HEAD.
    
    7) Since both set_subquery_pathlist and best_inner_subqueryscan push 
    down clauses, as well as add a path and subplan, could a generalized 
    function be made to support both, to reduce duplicate code?
    
    regards,
    Yeb Havinga
    
    -- 
    Yeb Havinga
    http://www.mgrid.net/
    Mastering Medical Data
    
    
    
  6. Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

    Hitoshi Harada <umi.tanuki@gmail.com> — 2011-06-29T17:22:07Z

    2011/6/29 Yeb Havinga <yebhavinga@gmail.com>:
    >
    > On 2011-06-17 09:54, Hitoshi Harada wrote:
    >>
    >> While reviewing the gist/box patch, I found some planner APIs that can
    >> replace parts in my patch. Also, comments in includes wasn't updated
    >> appropriately. Revised patch attached.
    >
    > Hello Hitoshi-san,
    
    Hi Yeb,
    
    > I read your latest patch implementing parameterizing subquery scans.
    
    > 6) Before patching Postgres, I could execute the test with the size_l and
    > size_m tables, after patching against current git HEAD (patch without
    > errors), I get the following error when running the example query:
    > ERROR:  plan should not reference subplan's variable
    >
    > I get the same error with the version from june 9 with current git HEAD.
    
    It seems that something has changed since I developed the patch at
    first. The last one and the one before are not so different with each
    other, especially in terms of runtime but might not be tested enough.
    I tried time-slip of the local git branch to around june 10, but the
    same error occurs. The error message itself is not in parts changed
    recently, so I guess some surrounding change would affect it.
    
    > 7) Since both set_subquery_pathlist and best_inner_subqueryscan push down
    > clauses, as well as add a path and subplan, could a generalized function be
    > made to support both, to reduce duplicate code?
    
    I tried it but decided as it is, for the future extensibility. The
    subquery parameterizing will (can) be extended more complicatedly. I'm
    not sure if we'd better gathering some small parts into one by
    throwing future capability.
    
    Other things are all good points. Thanks for elaborate review!
    More than anything, I'm going to fix the 6) issue, at least to find the cause.
    
    Regards,
    -- 
    Hitoshi Harada
    
    
  7. Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

    Yeb Havinga <yebhavinga@gmail.com> — 2011-06-30T07:39:39Z

    On 2011-06-29 19:22, Hitoshi Harada wrote:
    > Other things are all good points. Thanks for elaborate review!
    > More than anything, I'm going to fix the 6) issue, at least to find the cause.
    >
    Some more questions:
    8) why are cheapest start path and cheapest total path in 
    best_inner_subqueryscan the same?
    
    9) as remarked up a different thread, more than one clause could be 
    pushed down a subquery. The current patch only considers alternative 
    paths that have at most one clause pushed down. Is this because of the 
    call site of best_inner_subqueryscan, i.e. under one join rel? Would it 
    be an idea to consider, for each subquery, only a single alternative 
    plan that tries to have all suitable clauses pushed down?
    
    10) I have a hard time imagining use cases that will actually result in 
    a alternative plan, especially since not all subqueries are allowed to 
    have quals pushed down into, and like Simon Riggs pointed out that many 
    users will write queries like this with the subqueries pulled up. If it 
    is the case that the subqueries that can't be pulled up have a large 
    overlap with the ones that are not pushdown safe (limit, set operations 
    etc), there might be little actual use cases for this patch.
    
    I think the most important thing for this patch to go forward is to have 
    a few examples, from which it's clear that the patch is beneficial.
    
    regards,
    
    -- 
    
    Yeb Havinga
    http://www.mgrid.net/
    Mastering Medical Data
    
    
    
  8. Re: Parameterized aggregate subquery

    Yeb Havinga <yebhavinga@gmail.com> — 2011-06-30T08:03:22Z

    On 2011-06-30 09:39, Yeb Havinga wrote:
    >
    > 9) as remarked up a different thread, more than one clause could be 
    > pushed down a subquery. The current patch only considers alternative 
    > paths that have at most one clause pushed down. Is this because of the 
    > call site of best_inner_subqueryscan, i.e. under one join rel? Would 
    > it be an idea to consider, for each subquery, only a single 
    > alternative plan that tries to have all suitable clauses pushed down?
    
    Ehm, please forget this remark, I've mistaken join rel for base rel.
    
    -- Yeb
    
    
    
  9. Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

    Hitoshi Harada <umi.tanuki@gmail.com> — 2011-06-30T08:37:43Z

    2011/6/30 Yeb Havinga <yebhavinga@gmail.com>:
    > On 2011-06-29 19:22, Hitoshi Harada wrote:
    >>
    >> Other things are all good points. Thanks for elaborate review!
    >> More than anything, I'm going to fix the 6) issue, at least to find the
    >> cause.
    >>
    > Some more questions:
    > 8) why are cheapest start path and cheapest total path in
    > best_inner_subqueryscan the same?
    
    Because best_inner_indexscan has the two. Actually one of them is
    enough so far. I aligned it as the existing interface but they might
    be one.
    
    > 10) I have a hard time imagining use cases that will actually result in a
    > alternative plan, especially since not all subqueries are allowed to have
    > quals pushed down into, and like Simon Riggs pointed out that many users
    > will write queries like this with the subqueries pulled up. If it is the
    > case that the subqueries that can't be pulled up have a large overlap with
    > the ones that are not pushdown safe (limit, set operations etc), there might
    > be little actual use cases for this patch.
    
    I have seen many cases that this planner hack would help
    significantly, which were difficult to rewrite. Why were they
    difficult to write? Because, quals on size_m (and they have quals on
    size_l in fact) are usually very complicated (5-10 op clauses) and the
    join+agg part itself is kind of subquery in other big query. Of course
    there were workaround like split the statement to two, filtering
    size_m then aggregate size_l by the result of the first statement.
    However, it's against instinct. The reason why planner is in RDBMS is
    to let users to write simple (as needed) statements. I don't know if
    the example I raise here is common or not, but I believe the example
    represents "one to many" relation simply, therefore there should be
    many users who just don't find themselves currently in the slow query
    performance.
    
    > I think the most important thing for this patch to go forward is to have a
    > few examples, from which it's clear that the patch is beneficial.
    
    What will be good examples to show benefit of the patch? I guess the
    test case of size_m/size_l shows it. What lacks on the case, do you
    think?
    
    Regards,
    
    
    -- 
    Hitoshi Harada
    
    
  10. Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

    Hitoshi Harada <umi.tanuki@gmail.com> — 2011-07-02T07:48:05Z

    2011/6/29 Yeb Havinga <yebhavinga@gmail.com>:
    >
    > On 2011-06-17 09:54, Hitoshi Harada wrote:
    >>
    >> While reviewing the gist/box patch, I found some planner APIs that can
    >> replace parts in my patch. Also, comments in includes wasn't updated
    >> appropriately. Revised patch attached.
    >
    > Hello Hitoshi-san,
    >
    > I read your latest patch implementing parameterizing subquery scans.
    
    Attached is revised version.
    
    > 1)
    > In the email from june 9 with the patch You wrote: "While IndexScan
    > is simple since its information like costs are well known by the base
    > relation, SubqueryScan should re-plan its Query to gain that, which is
    > expensive."
    >
    > Initial concerns I had were caused by misinterpreting 'replanning' as: for
    > each outer tuple, replan the subquery (it sounds a bit like 'ReScan').
    > Though the general comments in the patch are helpful, I think it would be an
    > improvement to describe why subqueries are planned more than once, i.e.
    > something like
    > "best_inner_subqueryscan
    >   Try to find a better subqueryscan path and its associated plan for each
    > join clause that can be pushed down, in addition to the path that is already
    > calculated by set_subquery_pathlist."
    
    I changed comments around set_subquery_pathlist and best_inner_subqueryscan.
    
    > 2)
    > Since 'subquery_is_pushdown_safe' is invariant under join clause push down,
    > it might be possible to have it called only once in set_subquery_pathlist,
    > i.e. by only attaching rel->preprocessed_subquery if the
    > subquery_is_pushdown_safe.
    
    I modified as you suggested.
    
    > 3)
    > /*
    >  * set_subquery_pathlist
    >  *        Build the (single) access path for a subquery RTE
    >  */
    > This unchanged comment is still correct, but 'the (single) access path'
    > might have become a bit misleading now there are multiple paths possible,
    > though not by set_subquery_pathlist.
    
    As noted #1.
    
    > 4) somewhere in the patch
    > s/regsitered/registered/
    
    Fixed.
    
    > 5) Regression tests are missing; I think at this point they'd aid in
    > speeding up development/test cycles.
    
    I'm still thinking about it. I can add complex test but the concept of
    regression test focuses on small pieces of simple cases. I don't want
    take pg_regress much more than before.
    
    > 6) Before patching Postgres, I could execute the test with the size_l and
    > size_m tables, after patching against current git HEAD (patch without
    > errors), I get the following error when running the example query:
    > ERROR:  plan should not reference subplan's variable
    >
    > I get the same error with the version from june 9 with current git HEAD.
    
    Fixed. Some variable initializing was wrong.
    
    > 7) Since both set_subquery_pathlist and best_inner_subqueryscan push down
    > clauses, as well as add a path and subplan, could a generalized function be
    > made to support both, to reduce duplicate code?
    
    No touch as answered before.
    
    Although I still need to think about suitable regression test case,
    the patch itself can be reviewed again. You may want to try some
    additional tests as you imagine after finding my test case gets
    quicker.
    
    Thanks,
    
    -- 
    Hitoshi Harada
    
    
  11. Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

    Hitoshi Harada <umi.tanuki@gmail.com> — 2011-07-02T08:02:07Z

    2011/7/2 Hitoshi Harada <umi.tanuki@gmail.com>:
    > 2011/6/29 Yeb Havinga <yebhavinga@gmail.com>:
    >>
    >> On 2011-06-17 09:54, Hitoshi Harada wrote:
    >>>
    >>> While reviewing the gist/box patch, I found some planner APIs that can
    >>> replace parts in my patch. Also, comments in includes wasn't updated
    >>> appropriately. Revised patch attached.
    >>
    >> Hello Hitoshi-san,
    >>
    >> I read your latest patch implementing parameterizing subquery scans.
    >
    > Attached is revised version.
    
    I failed to attached the patch. I'm trying again.
    
    >> 1)
    >> In the email from june 9 with the patch You wrote: "While IndexScan
    >> is simple since its information like costs are well known by the base
    >> relation, SubqueryScan should re-plan its Query to gain that, which is
    >> expensive."
    >>
    >> Initial concerns I had were caused by misinterpreting 'replanning' as: for
    >> each outer tuple, replan the subquery (it sounds a bit like 'ReScan').
    >> Though the general comments in the patch are helpful, I think it would be an
    >> improvement to describe why subqueries are planned more than once, i.e.
    >> something like
    >> "best_inner_subqueryscan
    >>   Try to find a better subqueryscan path and its associated plan for each
    >> join clause that can be pushed down, in addition to the path that is already
    >> calculated by set_subquery_pathlist."
    >
    > I changed comments around set_subquery_pathlist and best_inner_subqueryscan.
    >
    >> 2)
    >> Since 'subquery_is_pushdown_safe' is invariant under join clause push down,
    >> it might be possible to have it called only once in set_subquery_pathlist,
    >> i.e. by only attaching rel->preprocessed_subquery if the
    >> subquery_is_pushdown_safe.
    >
    > I modified as you suggested.
    >
    >> 3)
    >> /*
    >>  * set_subquery_pathlist
    >>  *        Build the (single) access path for a subquery RTE
    >>  */
    >> This unchanged comment is still correct, but 'the (single) access path'
    >> might have become a bit misleading now there are multiple paths possible,
    >> though not by set_subquery_pathlist.
    >
    > As noted #1.
    >
    >> 4) somewhere in the patch
    >> s/regsitered/registered/
    >
    > Fixed.
    >
    >> 5) Regression tests are missing; I think at this point they'd aid in
    >> speeding up development/test cycles.
    >
    > I'm still thinking about it. I can add complex test but the concept of
    > regression test focuses on small pieces of simple cases. I don't want
    > take pg_regress much more than before.
    >
    >> 6) Before patching Postgres, I could execute the test with the size_l and
    >> size_m tables, after patching against current git HEAD (patch without
    >> errors), I get the following error when running the example query:
    >> ERROR:  plan should not reference subplan's variable
    >>
    >> I get the same error with the version from june 9 with current git HEAD.
    >
    > Fixed. Some variable initializing was wrong.
    >
    >> 7) Since both set_subquery_pathlist and best_inner_subqueryscan push down
    >> clauses, as well as add a path and subplan, could a generalized function be
    >> made to support both, to reduce duplicate code?
    >
    > No touch as answered before.
    >
    > Although I still need to think about suitable regression test case,
    > the patch itself can be reviewed again. You may want to try some
    > additional tests as you imagine after finding my test case gets
    > quicker.
    >
    > Thanks,
    >
    > --
    > Hitoshi Harada
    >
    
    
    
    -- 
    Hitoshi Harada
    
  12. Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

    Yeb Havinga <yebhavinga@gmail.com> — 2011-07-05T11:15:44Z

    Hello Hitosh, list,
    
    >
    > > Attached is revised version.
    >
    > I failed to attached the patch. I'm trying again.
    >
    > I'm currently unable to test, since on holiday. I'm happy to continue
    testing once returned but it may not be within the bounds of the current
    commitfest, sorry.
    
    
    > >> 5) Regression tests are missing; I think at this point they'd aid in
    > >> speeding up development/test cycles.
    > >
    > > I'm still thinking about it. I can add complex test but the concept of
    > > regression test focuses on small pieces of simple cases. I don't want
    > > take pg_regress much more than before.
    >
    
    IMHO, at least one query where the new code is touched is a good idea.
    
    >> I get the same error with the version from june 9 with current git HEAD.
    > >
    > > Fixed. Some variable initializing was wrong.
    >
    
    Thanks - will test when I am back from holiday and other duties.
    
    regards,
    Yeb
    
  13. Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

    Hitoshi Harada <umi.tanuki@gmail.com> — 2011-07-05T11:43:46Z

    2011/7/5 Yeb Havinga <yebhavinga@gmail.com>:
    > Hello Hitosh, list,
    >
    >> >
    >> > Attached is revised version.
    >>
    >> I failed to attached the patch. I'm trying again.
    >>
    > I'm currently unable to test, since on holiday. I'm happy to continue
    > testing once returned but it may not be within the bounds of the current
    > commitfest, sorry.
    
    Thanks for replying. Regarding the time remained until the end of this
    commitfest, I'd think we should mark this item "Returned with
    Feedback" if no objection appears. I will be very happy if Yeb tests
    my latest patch after he comes back. I'll make up my mind around
    regression test.
    
    Regards,
    
    -- 
    Hitoshi Harada
    
    
  14. Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

    Hitoshi Harada <umi.tanuki@gmail.com> — 2011-07-09T14:23:20Z

    2011/7/5 Hitoshi Harada <umi.tanuki@gmail.com>:
    > 2011/7/5 Yeb Havinga <yebhavinga@gmail.com>:
    >> Hello Hitosh, list,
    >>
    >>> >
    >>> > Attached is revised version.
    >>>
    >>> I failed to attached the patch. I'm trying again.
    >>>
    >> I'm currently unable to test, since on holiday. I'm happy to continue
    >> testing once returned but it may not be within the bounds of the current
    >> commitfest, sorry.
    >
    > Thanks for replying. Regarding the time remained until the end of this
    > commitfest, I'd think we should mark this item "Returned with
    > Feedback" if no objection appears. I will be very happy if Yeb tests
    > my latest patch after he comes back. I'll make up my mind around
    > regression test.
    
    It seems that Yeb marked "Returned with Feedback". Thanks for the
    review again. Still, I'd appreciate if further review is done on my
    latest patch.
    
    Thanks,
    -- 
    Hitoshi Harada
    
    
  15. Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

    Yeb Havinga <yebhavinga@gmail.com> — 2011-07-09T15:45:34Z

    On 2011-07-09 16:23, Hitoshi Harada wrote:
    > 2011/7/5 Hitoshi Harada<umi.tanuki@gmail.com>:
    >> 2011/7/5 Yeb Havinga<yebhavinga@gmail.com>:
    >>> Hello Hitosh, list,
    >>>
    >>>>> Attached is revised version.
    >>>> I failed to attached the patch. I'm trying again.
    >>>>
    >>> I'm currently unable to test, since on holiday. I'm happy to continue
    >>> testing once returned but it may not be within the bounds of the current
    >>> commitfest, sorry.
    >> Thanks for replying. Regarding the time remained until the end of this
    >> commitfest, I'd think we should mark this item "Returned with
    >> Feedback" if no objection appears. I will be very happy if Yeb tests
    >> my latest patch after he comes back. I'll make up my mind around
    >> regression test.
    > It seems that Yeb marked "Returned with Feedback". Thanks for the
    > review again. Still, I'd appreciate if further review is done on my
    > latest patch.
    
    Yes, I did so after you suggested to return it to feedback. I'll review 
    your latest patch as soon as there is enough time to do a proper review, 
    which is probably after next week.
    
    regards,
    Yeb Havinga
    
    
    
  16. Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

    Yeb Havinga <yebhavinga@gmail.com> — 2011-07-22T13:13:01Z

    On 2011-07-02 10:02, Hitoshi Harada wrote:
    >
    >> Although I still need to think about suitable regression test case,
    >> the patch itself can be reviewed again. You may want to try some
    >> additional tests as you imagine after finding my test case gets
    >> quicker.
    
    Hello Hitoshi-san,
    
    I took a look at your latest patch and it looks good, no comments. 
    However I also tried it against current 9.2 HEAD and the test query of 
    the start of this thread.
    
    Before and after applying the patch, I get the same result for the test 
    query.
    
    postgres=# explain select m_id, sum_len from size_m m inner join(select 
    m_id,
    sum(length(val)) as sum_len from size_l group by m_id)l on m.id =
    l.m_id where val = '10101';
                                         QUERY PLAN
    ----------------------------------------------------------------------------------
      Nested Loop  (cost=79392.64..82938.05 rows=100 width=12)
        Join Filter: (m.id = size_l.m_id)
        ->  Seq Scan on size_m m  (cost=0.00..897.00 rows=1 width=4)
              Filter: (val = '10101'::text)
        ->  GroupAggregate  (cost=79392.64..81592.15 rows=19951 width=277)
              ->  Sort  (cost=79392.64..79892.64 rows=200000 width=277)
                    Sort Key: size_l.m_id
                    ->  Seq Scan on size_l  (cost=0.00..9829.00 rows=200000 
    width=277)
    
    I double checked that I had applied the patch (git diff shows the 
    patch), installed and restarted postgres. The database is a fresh 
    created database with no edits in postgresql.conf.
    
    regards,
    
    -- 
    Yeb Havinga
    http://www.mgrid.net/
    Mastering Medical Data
    
    
    
  17. Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

    Hitoshi Harada <umi.tanuki@gmail.com> — 2011-07-22T14:17:01Z

    2011/7/22 Yeb Havinga <yebhavinga@gmail.com>:
    > On 2011-07-02 10:02, Hitoshi Harada wrote:
    >>
    >>> Although I still need to think about suitable regression test case,
    >>> the patch itself can be reviewed again. You may want to try some
    >>> additional tests as you imagine after finding my test case gets
    >>> quicker.
    >
    > Hello Hitoshi-san,
    
    Hi,
    
    > I double checked that I had applied the patch (git diff shows the patch),
    > installed and restarted postgres. The database is a fresh created database
    > with no edits in postgresql.conf.
    
    :(
    I updated the patch. Could you try attached once more? The "issafe"
    switch seems wrong.
    
    Thanks,
    -- 
    Hitoshi Harada
    
  18. Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

    Yeb Havinga <yebhavinga@gmail.com> — 2011-07-22T15:16:30Z

    On 2011-07-22 16:17, Hitoshi Harada wrote:
    >
    > :(
    > I updated the patch. Could you try attached once more? The "issafe"
    > switch seems wrong.
    Works like a charm :-). However, now there is always a copyObject of a 
    subquery even when the subquery is not safe for qual pushdown. The 
    problem with the previous issafe was that it was only assigned for 
    rel->baserestrictinfo != NIL. If it is assigned before the if statement, 
    it still works. See attached patch that avoids subquery copy for unsafe 
    subqueries, and also exits best_inner_subqueryscan before palloc of 
    differenttypes in case of unsafe queries.
    
    regards,
    -- 
    
    Yeb Havinga
    http://www.mgrid.net/
    Mastering Medical Data
    
    
  19. Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)

    Hitoshi Harada <umi.tanuki@gmail.com> — 2011-07-22T15:35:40Z

    2011/7/23 Yeb Havinga <yebhavinga@gmail.com>:
    > On 2011-07-22 16:17, Hitoshi Harada wrote:
    >>
    >> :(
    >> I updated the patch. Could you try attached once more? The "issafe"
    >> switch seems wrong.
    >
    > Works like a charm :-). However, now there is always a copyObject of a
    > subquery even when the subquery is not safe for qual pushdown. The problem
    > with the previous issafe was that it was only assigned for
    > rel->baserestrictinfo != NIL. If it is assigned before the if statement, it
    > still works. See attached patch that avoids subquery copy for unsafe
    > subqueries, and also exits best_inner_subqueryscan before palloc of
    > differenttypes in case of unsafe queries.
    
    Ah, yeah, right. Too quick fix bloated my brain :P Thanks for testing!
    I'll check it more.
    
    
    -- 
    Hitoshi Harada
    
    
  20. Pull up aggregate sublink (was: Parameterized aggregate subquery (was: Pull up aggregate subquery))

    Yeb Havinga <yebhavinga@gmail.com> — 2011-07-26T19:32:11Z

    On 2011-07-22 17:35, Hitoshi Harada wrote:
    > 2011/7/23 Yeb Havinga<yebhavinga@gmail.com>:
    >> Works like a charm :-). However, now there is always a copyObject of a
    >> subquery even when the subquery is not safe for qual pushdown. The problem
    >> with the previous issafe was that it was only assigned for
    >> rel->baserestrictinfo != NIL. If it is assigned before the if statement, it
    >> still works. See attached patch that avoids subquery copy for unsafe
    >> subqueries, and also exits best_inner_subqueryscan before palloc of
    >> differenttypes in case of unsafe queries.
    >
    > Ah, yeah, right. Too quick fix bloated my brain :P Thanks for testing!
    > I'll check it more.
    
    A few days ago I read Tomas Vondra's blog post about dss tpc-h queries 
    on PostgreSQL at 
    http://fuzzy.cz/en/articles/dss-tpc-h-benchmark-with-postgresql/ - in 
    which he showed how to manually pull up a dss subquery to get a large 
    speed up. Initially I thought: cool, this is probably now handled by 
    Hitoshi's patch, but it turns out the subquery type in the dss query is 
    different.
    
    The original and rewritten queries are below. The debug_print_plan 
    output shows the subquery is called from a opexpr (< l_quantity, 
    subquery output) and the sublink type is EXPR_SUBLINK. Looking at the 
    source code; pull_up_sublink only considers ANY and EXISTS sublinks. I'm 
    wondering if this could be expanded to deal with EXPR sublinks. Clearly 
    in the example Tomas has given this can be done. I'm wondering if there 
    are show stoppers that prevent this to be possible in the general case, 
    but can't think of any, other than the case of a sublink returning NULL 
    and the opexpr is part of a larger OR expression or IS NULL; in which 
    case it should not be pulled op, or perhaps it could be pulled up as 
    outer join.
    
    Thoughts?
    
    regards,
    Yeb
    
    
    The original query:
    
    tpch=# explain select
    sum(l_extendedprice) / 7.0 as avg_yearly
    from
    lineitem,
    part
    where
    p_partkey = l_partkey
    and p_brand = 'Brand#13'
    and p_container = 'JUMBO PKG'
    and l_quantity < (
    select
    0.2 * avg(l_quantity)
    from
    lineitem
    where
    l_partkey = p_partkey
    )
    LIMIT 1;
                                                      QUERY PLAN
    ------------------------------------------------------------------------------------------------------------
      Limit  (cost=183345309.79..183345309.81 rows=1 width=8)
        ->  Aggregate  (cost=183345309.79..183345309.81 rows=1 width=8)
              ->  Hash Join  (cost=2839.99..183345307.76 rows=815 width=8)
                    Hash Cond: (public.lineitem.l_partkey = part.p_partkey)
                    Join Filter: (public.lineitem.l_quantity < (SubPlan 1))
                    ->  Seq Scan on lineitem  (cost=0.00..68985.69 
    rows=2399869 width=17)
                    ->  Hash  (cost=2839.00..2839.00 rows=79 width=4)
                          ->  Seq Scan on part  (cost=0.00..2839.00 rows=79 
    width=4)
                                Filter: ((p_brand = 'Brand#13'::bpchar) AND 
    (p_container = 'JUMBO PKG'::bpchar))
                    SubPlan 1
                      ->  Aggregate  (cost=74985.44..74985.46 rows=1 width=5)
                            ->  Seq Scan on lineitem  (cost=0.00..74985.36 
    rows=31 width=5)
                                  Filter: (l_partkey = part.p_partkey)
    
    manually rewritten variant:
    
    tpch=# explain select
    sum(l_extendedprice) / 7.0 as avg_yearly
    from
    lineitem,
    part,
    (SELECT l_partkey AS agg_partkey,
                     0.2 * avg(l_quantity) AS avg_quantity
               FROM lineitem GROUP BY l_partkey) part_agg
    where
    p_partkey = l_partkey
    and agg_partkey = l_partkey
    and p_brand = 'Brand#13'
    and p_container = 'JUMBO PKG'
    and l_quantity < avg_quantity
    LIMIT 1;
                                                            QUERY PLAN
    ------------------------------------------------------------------------------------------------------------------------
      Limit  (cost=179643.88..179643.89 rows=1 width=8)
        ->  Aggregate  (cost=179643.88..179643.89 rows=1 width=8)
              ->  Hash Join  (cost=161865.21..178853.91 rows=315985 width=8)
                    Hash Cond: (public.lineitem.l_partkey = part.p_partkey)
                    Join Filter: (public.lineitem.l_quantity < ((0.2 * 
    avg(public.lineitem.l_quantity))))
                    ->  HashAggregate  (cost=80985.04..82148.65 rows=77574 
    width=9)
                          ->  Seq Scan on lineitem  (cost=0.00..68985.69 
    rows=2399869 width=9)
                    ->  Hash  (cost=80849.63..80849.63 rows=2444 width=21)
                          ->  Hash Join  (cost=2839.99..80849.63 rows=2444 
    width=21)
                                Hash Cond: (public.lineitem.l_partkey = 
    part.p_partkey)
                                ->  Seq Scan on lineitem  
    (cost=0.00..68985.69 rows=2399869 width=17)
                                ->  Hash  (cost=2839.00..2839.00 rows=79 
    width=4)
                                      ->  Seq Scan on part  
    (cost=0.00..2839.00 rows=79 width=4)
                                            Filter: ((p_brand = 
    'Brand#13'::bpchar) AND (p_container = 'JUMBO PKG'::bpchar))
    
    
    
    
  21. Re: Pull up aggregate sublink (was: Parameterized aggregate subquery (was: Pull up aggregate subquery))

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-07-26T21:37:52Z

    Yeb Havinga <yebhavinga@gmail.com> writes:
    > A few days ago I read Tomas Vondra's blog post about dss tpc-h queries 
    > on PostgreSQL at 
    > http://fuzzy.cz/en/articles/dss-tpc-h-benchmark-with-postgresql/ - in 
    > which he showed how to manually pull up a dss subquery to get a large 
    > speed up. Initially I thought: cool, this is probably now handled by 
    > Hitoshi's patch, but it turns out the subquery type in the dss query is 
    > different.
    
    Actually, I believe this example is the exact opposite of the
    transformation Hitoshi proposes.  Tomas was manually replacing an
    aggregated subquery by a reference to a grouped table, which can be
    a win if the subquery would be executed enough times to amortize
    calculation of the grouped table over all the groups (some of which
    might never be demanded by the outer query).  Hitoshi was talking about
    avoiding calculations of grouped-table elements that we don't need,
    which would be a win in different cases.  Or at least that was the
    thrust of his original proposal; I'm not sure where the patch went since
    then.
    
    This leads me to think that we need to represent both cases as the same
    sort of query and make a cost-based decision as to which way to go.
    Thinking of it as a pull-up or push-down transformation is the wrong
    approach because those sorts of transformations are done too early to
    be able to use cost comparisons.
    
    			regards, tom lane
    
    
  22. Re: Pull up aggregate sublink (was: Parameterized aggregate subquery (was: Pull up aggregate subquery))

    Robert Haas <robertmhaas@gmail.com> — 2011-07-27T14:16:21Z

    On Tue, Jul 26, 2011 at 5:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Yeb Havinga <yebhavinga@gmail.com> writes:
    >> A few days ago I read Tomas Vondra's blog post about dss tpc-h queries
    >> on PostgreSQL at
    >> http://fuzzy.cz/en/articles/dss-tpc-h-benchmark-with-postgresql/ - in
    >> which he showed how to manually pull up a dss subquery to get a large
    >> speed up. Initially I thought: cool, this is probably now handled by
    >> Hitoshi's patch, but it turns out the subquery type in the dss query is
    >> different.
    >
    > Actually, I believe this example is the exact opposite of the
    > transformation Hitoshi proposes.  Tomas was manually replacing an
    > aggregated subquery by a reference to a grouped table, which can be
    > a win if the subquery would be executed enough times to amortize
    > calculation of the grouped table over all the groups (some of which
    > might never be demanded by the outer query).  Hitoshi was talking about
    > avoiding calculations of grouped-table elements that we don't need,
    > which would be a win in different cases.  Or at least that was the
    > thrust of his original proposal; I'm not sure where the patch went since
    > then.
    >
    > This leads me to think that we need to represent both cases as the same
    > sort of query and make a cost-based decision as to which way to go.
    > Thinking of it as a pull-up or push-down transformation is the wrong
    > approach because those sorts of transformations are done too early to
    > be able to use cost comparisons.
    
    I think you're right.  OTOH, our estimates of what will pop out of an
    aggregate are so poor that denying the user to control the plan on the
    basis of how they write the query might be a net negative.  :-(
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  23. Re: Pull up aggregate sublink (was: Parameterized aggregate subquery (was: Pull up aggregate subquery))

    Yeb Havinga <yebhavinga@gmail.com> — 2011-07-27T14:40:23Z

    On 2011-07-27 16:16, Robert Haas wrote:
    > On Tue, Jul 26, 2011 at 5:37 PM, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
    >> Yeb Havinga<yebhavinga@gmail.com>  writes:
    >>> A few days ago I read Tomas Vondra's blog post about dss tpc-h queries
    >>> on PostgreSQL at
    >>> http://fuzzy.cz/en/articles/dss-tpc-h-benchmark-with-postgresql/ - in
    >>> which he showed how to manually pull up a dss subquery to get a large
    >>> speed up. Initially I thought: cool, this is probably now handled by
    >>> Hitoshi's patch, but it turns out the subquery type in the dss query is
    >>> different.
    >> Actually, I believe this example is the exact opposite of the
    >> transformation Hitoshi proposes.  Tomas was manually replacing an
    >> aggregated subquery by a reference to a grouped table, which can be
    >> a win if the subquery would be executed enough times to amortize
    >> calculation of the grouped table over all the groups (some of which
    >> might never be demanded by the outer query).  Hitoshi was talking about
    >> avoiding calculations of grouped-table elements that we don't need,
    >> which would be a win in different cases.  Or at least that was the
    >> thrust of his original proposal; I'm not sure where the patch went since
    >> then.
    >>
    >> This leads me to think that we need to represent both cases as the same
    >> sort of query and make a cost-based decision as to which way to go.
    >> Thinking of it as a pull-up or push-down transformation is the wrong
    >> approach because those sorts of transformations are done too early to
    >> be able to use cost comparisons.
    > I think you're right.  OTOH, our estimates of what will pop out of an
    > aggregate are so poor that denying the user to control the plan on the
    > basis of how they write the query might be a net negative.  :-(
    >
    
    Tom and Robert, thank you both for your replies. I think I'm having some 
    blind spots and maybe false assumptions regarding the overal work in the 
    optimizer, as it is not clear to me what 'the same sort of query' would 
    look like. I was under the impression that using cost to select the best 
    paths is only done per simple query, and fail to see how a total 
    combined plan with pulled up subquery could be compared on cost with a 
    total plan where the subquery is still a separate subplan, since the 
    range tables / simple-queries to compare are different.
    
    regards,
    Yeb
    
    
    
  24. Re: Pull up aggregate sublink (was: Parameterized aggregate subquery (was: Pull up aggregate subquery))

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-07-27T14:51:13Z

    Yeb Havinga <yebhavinga@gmail.com> writes:
    > Tom and Robert, thank you both for your replies. I think I'm having some 
    > blind spots and maybe false assumptions regarding the overal work in the 
    > optimizer, as it is not clear to me what 'the same sort of query' would 
    > look like. I was under the impression that using cost to select the best 
    > paths is only done per simple query, and fail to see how a total 
    > combined plan with pulled up subquery could be compared on cost with a 
    > total plan where the subquery is still a separate subplan, since the 
    > range tables / simple-queries to compare are different.
    
    Well, you could look at what planagg.c does to decide whether an
    indexscan optimization of MIN/MAX is worthwhile, or at the calculations
    in planner.c that decide which way to do grouping/aggregation/ordering.
    
    It could be fairly expensive to handle this type of problem because of
    the need to cost out two fundamentally different scan/join trees, but
    we're assuming the queries are expensive enough to make that worthwhile
    ...
    
    			regards, tom lane
    
    
  25. Re: Pull up aggregate sublink (was: Parameterized aggregate subquery (was: Pull up aggregate subquery))

    Hitoshi Harada <umi.tanuki@gmail.com> — 2011-07-27T15:34:21Z

    2011/7/27 Yeb Havinga <yebhavinga@gmail.com>:
    > On 2011-07-22 17:35, Hitoshi Harada wrote:
    >>
    >> 2011/7/23 Yeb Havinga<yebhavinga@gmail.com>:
    >
    > A few days ago I read Tomas Vondra's blog post about dss tpc-h queries on
    > PostgreSQL at
    > http://fuzzy.cz/en/articles/dss-tpc-h-benchmark-with-postgresql/ - in which
    > he showed how to manually pull up a dss subquery to get a large speed up.
    > Initially I thought: cool, this is probably now handled by Hitoshi's patch,
    > but it turns out the subquery type in the dss query is different.
    >
    > The original and rewritten queries are below. The debug_print_plan output
    > shows the subquery is called from a opexpr (< l_quantity, subquery output)
    > and the sublink type is EXPR_SUBLINK. Looking at the source code;
    > pull_up_sublink only considers ANY and EXISTS sublinks. I'm wondering if
    > this could be expanded to deal with EXPR sublinks. Clearly in the example
    > Tomas has given this can be done. I'm wondering if there are show stoppers
    > that prevent this to be possible in the general case, but can't think of
    > any, other than the case of a sublink returning NULL and the opexpr is part
    > of a larger OR expression or IS NULL; in which case it should not be pulled
    > op, or perhaps it could be pulled up as outer join.
    >
    > Thoughts?
    
    Good catch. I was not aware of the sublink case so I'm not sure if it
    is possible, but I believe it will be worth modifying the optimizer to
    handle them in the same way.  Since my latest proposal is based on
    parameterized NestLoop, the first step is how to transform the sublink
    expression into join. I bet there are chances in simple cases since we
    have Semi/Anti Join technique. On the other hand, those pseudo-join
    types are easily failing to be transformed to join, in such cases
    above like it have another filter clause than join qual expression.
    
    If tpc bechmark can be speed up that's a good use case which you
    pointed out I'm missing.
    
    Regards,
    
    -- 
    Hitoshi Harada
    
    
  26. Re: Pull up aggregate sublink (was: Parameterized aggregate subquery (was: Pull up aggregate subquery))

    Hitoshi Harada <umi.tanuki@gmail.com> — 2011-07-27T15:50:20Z

    2011/7/27 Tom Lane <tgl@sss.pgh.pa.us>:
    > Yeb Havinga <yebhavinga@gmail.com> writes:
    >> A few days ago I read Tomas Vondra's blog post about dss tpc-h queries
    >> on PostgreSQL at
    >> http://fuzzy.cz/en/articles/dss-tpc-h-benchmark-with-postgresql/ - in
    >> which he showed how to manually pull up a dss subquery to get a large
    >> speed up. Initially I thought: cool, this is probably now handled by
    >> Hitoshi's patch, but it turns out the subquery type in the dss query is
    >> different.
    >
    > Actually, I believe this example is the exact opposite of the
    > transformation Hitoshi proposes. Tomas was manually replacing an
    > aggregated subquery by a reference to a grouped table, which can be
    > a win if the subquery would be executed enough times to amortize
    > calculation of the grouped table over all the groups (some of which
    > might never be demanded by the outer query).  Hitoshi was talking about
    > avoiding calculations of grouped-table elements that we don't need,
    > which would be a win in different cases.  Or at least that was the
    > thrust of his original proposal; I'm not sure where the patch went since
    > then.
    
    My first proposal which is about pulling up aggregate like sublink
    expression is exact opposite of this (Tomas pushed down the sublink
    expression to join subquery). But the latest proposal is upon
    parameterized NestLoop, so I think my latest patch might help
    something for the *second* query. Actually the problem is the same; We
    want to reduce grouping operation which is not interesting to the
    final output, by filtering other relations expression. In this case,
    if the joined lineitem-part relation has very few rows by WHERE
    conditions (p_brand, p_container), we don't want calculate avg of huge
    lineitem because we know almost all of the avg result is not in the
    upper result. However, the current optimizer cannot pass the upper
    query's condition (like "it will have only few rows") down to the
    lower aggregate query.
    
    > This leads me to think that we need to represent both cases as the same
    > sort of query and make a cost-based decision as to which way to go.
    > Thinking of it as a pull-up or push-down transformation is the wrong
    > approach because those sorts of transformations are done too early to
    > be able to use cost comparisons.
    
    Wrapping up my mind above and reading this paragraph, it might be
    another work to make sublink expression look like the same as join.
    But what we want to solve is the same goal, I think.
    
    Regards,
    
    -- 
    Hitoshi Harada