Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Simplify relation_has_unique_index_for()

  2. Pathify RHS unique-ification for semijoin planning

  3. Convert varatt.h access macros to static inline functions.

  4. Re-export a few of createplan.c's make_xxx() functions.

  1. Pathify RHS unique-ification for semijoin planning

    Richard Guo <guofenglinux@gmail.com> — 2025-05-22T07:05:55Z

    I came across a query where the plan includes some unnecessary Sort
    nodes.  Here's an example that shows the issue.
    
    create table t(a int, b int);
    insert into t select i%100, i from generate_series(1,10000)i;
    create index on t(a);
    vacuum analyze t;
    
    set enable_hashagg to off;
    
    explain (costs off)
    select * from t t1 where t1.a in
      (select a from t t2 where a < 10)
    order by t1.a;
                              QUERY PLAN
    ---------------------------------------------------------------
     Merge Join
       Merge Cond: (t1.a = t2.a)
       ->  Index Scan using t_a_idx on t t1
       ->  Sort
             Sort Key: t2.a
             ->  Unique
                   ->  Sort
                         Sort Key: t2.a
                         ->  Index Only Scan using t_a_idx on t t2
                               Index Cond: (a < 10)
    (10 rows)
    
    I believe the two Sort nodes are unnecessary.
    
    After some digging, it seems that this is related to one of the
    approaches we use to implement semijoins: unique-ifying the RHS and
    then doing a regular join.  The unique-ification is handled in
    create_unique_path(), which considers both hash-based and sort-based
    implementations.  However, in the case of sort-based implementation,
    this function pays no attention to the subpath's pathkeys or the
    pathkeys of the resulting output.
    
    In addition to this specific issue, it seems to me that there are
    other potential issues in create_unique_path().
    
    * Currently, we only consider the cheapest_total_path of the RHS when
    unique-ifying it.  I think this may cause us to miss some optimization
    opportunities.  For example, there might be a path with a better sort
    order that isn't the cheapest-total one.  Such a path could help avoid
    a sort at a higher level, potentially resulting in a cheaper overall
    plan.
    
    * In create_unique_path(), we currently rely on heuristics to decide
    whether to use a hash-based or sort-based method.  I think a better
    approach would be to create paths for both methods and let add_path()
    determine which one is better, similar to how we handle path selection
    in other parts of the planner.
    
    Therefore, I'm thinking that maybe we could create a new RelOptInfo
    for the RHS rel to represent its unique-ified version, and then
    generate all worthwhile paths for it, similar to how it's done in
    create_distinct_paths().  Since this is likely to be called repeatedly
    on the same rel, we can cache the new RelOptInfo in the rel struct,
    just like how we cache cheapest_unique_path currently.
    
    To be concrete, I'm envisioning something like the following:
    
                if (bms_equal(sjinfo->syn_righthand, rel2->relids) &&
    -               create_unique_path(root, rel2, rel2->cheapest_total_path,
    -                                  sjinfo) != NULL)
    +               (rel2_unique = create_unique_rel(root, rel2, sjinfo)) != NULL)
    
    ...
    
    -               add_paths_to_joinrel(root, joinrel, rel1, rel2,
    -                                    JOIN_UNIQUE_INNER, sjinfo,
    +               add_paths_to_joinrel(root, joinrel, rel1, rel2_unique,
    +                                    JOIN_INNER, sjinfo,
                                         restrictlist);
    -               add_paths_to_joinrel(root, joinrel, rel2, rel1,
    -                                    JOIN_UNIQUE_OUTER, sjinfo,
    +               add_paths_to_joinrel(root, joinrel, rel2_unique, rel1,
    +                                    JOIN_INNER, sjinfo,
                                         restrictlist);
    
    In addition, by changing the join from "rel1" and "rel2" using
    JOIN_UNIQUE_OUTER or JOIN_UNIQUE_INNER to a join between "rel1" and
    "rel2_unique" using a standard JOIN_INNER, we might be able to get
    rid of the JOIN_UNIQUE_OUTER and JOIN_UNIQUE_INNER jointypes.  This
    could simplify a lot of logic in joinpath.c, where we're increasingly
    adding special-case handling for these two jointypes.
    
    One last point, I doubt that the code related to UNIQUE_PATH_NOOP is
    reachable in practice.  create_unique_path() checks whether the input
    rel is an RTE_RELATION or RTE_SUBQUERY and is provably unique, and
    creates a UNIQUE_PATH_NOOP UniquePath if so.  However, in that case,
    the semijoin should have already been simplified to a plain inner join
    by analyzejoins.c.
    
    Any thoughts?
    
    Thanks
    Richard
    
    
    
    
  2. Re: Pathify RHS unique-ification for semijoin planning

    Andy Fan <zhihuifan1213@163.com> — 2025-05-22T09:27:48Z

    Richard Guo <guofenglinux@gmail.com> writes:
    
    Hi,
    
    > However, in the case of sort-based implementation,
    > this function pays no attention to the subpath's pathkeys or the
    > pathkeys of the resulting output.
    
    Good finding!
    
    >
    > In addition to this specific issue, it seems to me that there are
    > other potential issues in create_unique_path().
    >
    > * Currently, we only consider the cheapest_total_path of the RHS when
    > unique-ifying it.
    
    I think it is better have a check the tuple_fraction for the startup_cost
    factor, for some paths where the total cost is high but the required
    fraction is lower.
    
    > I think this may cause us to miss some optimization
    > opportunities.  For example, there might be a path with a better sort
    > order that isn't the cheapest-total one.  Such a path could help avoid
    > a sort at a higher level, potentially resulting in a cheaper overall
    > plan.
    
    > * In create_unique_path(), we currently rely on heuristics to decide
    > whether to use a hash-based or sort-based method.  I think a better
    > approach would be to create paths for both methods and let add_path()
    > determine which one is better, similar to how we handle path selection
    > in other parts of the planner.
    >
    > Therefore, I'm thinking that maybe we could create a new RelOptInfo
    > for the RHS rel to represent its unique-ified version, and then
    > generate all worthwhile paths for it,
    
    This sounds great for me. and I think we can keep the fraction
    cheapest path on the new RelOptInfo as well, then all the things should
    be on the way. 
    
    > To be concrete, I'm envisioning something like the following:
    >
    >             if (bms_equal(sjinfo->syn_righthand, rel2->relids) &&
    > -               create_unique_path(root, rel2, rel2->cheapest_total_path,
    > -                                  sjinfo) != NULL)
    > +               (rel2_unique = create_unique_rel(root, rel2, sjinfo)) != NULL)
    >
    > ...
    >
    > -               add_paths_to_joinrel(root, joinrel, rel1, rel2,
    > -                                    JOIN_UNIQUE_INNER, sjinfo,
    > +               add_paths_to_joinrel(root, joinrel, rel1, rel2_unique,
    > +                                    JOIN_INNER, sjinfo,
    >                                      restrictlist);
    > -               add_paths_to_joinrel(root, joinrel, rel2, rel1,
    > -                                    JOIN_UNIQUE_OUTER, sjinfo,
    > +               add_paths_to_joinrel(root, joinrel, rel2_unique, rel1,
    > +                                    JOIN_INNER, sjinfo,
    >                                      restrictlist);
    >
    > In addition, by changing the join from "rel1" and "rel2" using
    > JOIN_UNIQUE_OUTER or JOIN_UNIQUE_INNER to a join between "rel1" and
    > "rel2_unique" using a standard JOIN_INNER, we might be able to get
    > rid of the JOIN_UNIQUE_OUTER and JOIN_UNIQUE_INNER jointypes.
    
    if we can, +10.
    
    -- 
    Best Regards
    Andy Fan
    
    
    
    
    
  3. Re: Pathify RHS unique-ification for semijoin planning

    wenhui qiu <qiuwenhuifx@gmail.com> — 2025-05-22T13:51:32Z

    On Thu, 22 May 2025 at 17:28, Andy Fan <zhihuifan1213@163.com> wrote:
    
    > Richard Guo <guofenglinux@gmail.com> writes:
    >
    > Hi,
    >
    > > However, in the case of sort-based implementation,
    > > this function pays no attention to the subpath's pathkeys or the
    > > pathkeys of the resulting output.
    >
    > Good finding!
    >
    > >
    > > In addition to this specific issue, it seems to me that there are
    > > other potential issues in create_unique_path().
    > >
    > > * Currently, we only consider the cheapest_total_path of the RHS when
    > > unique-ifying it.
    >
    > I think it is better have a check the tuple_fraction for the startup_cost
    > factor, for some paths where the total cost is high but the required
    > fraction is lower.
    >
    > > I think this may cause us to miss some optimization
    > > opportunities.  For example, there might be a path with a better sort
    > > order that isn't the cheapest-total one.  Such a path could help avoid
    > > a sort at a higher level, potentially resulting in a cheaper overall
    > > plan.
    >
    > > * In create_unique_path(), we currently rely on heuristics to decide
    > > whether to use a hash-based or sort-based method.  I think a better
    > > approach would be to create paths for both methods and let add_path()
    > > determine which one is better, similar to how we handle path selection
    > > in other parts of the planner.
    > >
    > > Therefore, I'm thinking that maybe we could create a new RelOptInfo
    > > for the RHS rel to represent its unique-ified version, and then
    > > generate all worthwhile paths for it,
    >
    > This sounds great for me. and I think we can keep the fraction
    > cheapest path on the new RelOptInfo as well, then all the things should
    > be on the way.
    >
    > > To be concrete, I'm envisioning something like the following:
    > >
    > >             if (bms_equal(sjinfo->syn_righthand, rel2->relids) &&
    > > -               create_unique_path(root, rel2, rel2->cheapest_total_path,
    > > -                                  sjinfo) != NULL)
    > > +               (rel2_unique = create_unique_rel(root, rel2, sjinfo)) !=
    > NULL)
    > >
    > > ...
    > >
    > > -               add_paths_to_joinrel(root, joinrel, rel1, rel2,
    > > -                                    JOIN_UNIQUE_INNER, sjinfo,
    > > +               add_paths_to_joinrel(root, joinrel, rel1, rel2_unique,
    > > +                                    JOIN_INNER, sjinfo,
    > >                                      restrictlist);
    > > -               add_paths_to_joinrel(root, joinrel, rel2, rel1,
    > > -                                    JOIN_UNIQUE_OUTER, sjinfo,
    > > +               add_paths_to_joinrel(root, joinrel, rel2_unique, rel1,
    > > +                                    JOIN_INNER, sjinfo,
    > >                                      restrictlist);
    > >
    > > In addition, by changing the join from "rel1" and "rel2" using
    > > JOIN_UNIQUE_OUTER or JOIN_UNIQUE_INNER to a join between "rel1" and
    > > "rel2_unique" using a standard JOIN_INNER, we might be able to get
    > > rid of the JOIN_UNIQUE_OUTER and JOIN_UNIQUE_INNER jointypes.
    >
    > >>if we can, +10.
    
    Agree
    Pls kindly release a path for this?
    
    >
    >
    
    >
    >
    >
    >
    
  4. Re: Pathify RHS unique-ification for semijoin planning

    Richard Guo <guofenglinux@gmail.com> — 2025-05-28T01:58:27Z

    On Thu, May 22, 2025 at 4:05 PM Richard Guo <guofenglinux@gmail.com> wrote:
    > Therefore, I'm thinking that maybe we could create a new RelOptInfo
    > for the RHS rel to represent its unique-ified version, and then
    > generate all worthwhile paths for it, similar to how it's done in
    > create_distinct_paths().  Since this is likely to be called repeatedly
    > on the same rel, we can cache the new RelOptInfo in the rel struct,
    > just like how we cache cheapest_unique_path currently.
    >
    > To be concrete, I'm envisioning something like the following:
    >
    >             if (bms_equal(sjinfo->syn_righthand, rel2->relids) &&
    > -               create_unique_path(root, rel2, rel2->cheapest_total_path,
    > -                                  sjinfo) != NULL)
    > +               (rel2_unique = create_unique_rel(root, rel2, sjinfo)) != NULL)
    >
    > ...
    >
    > -               add_paths_to_joinrel(root, joinrel, rel1, rel2,
    > -                                    JOIN_UNIQUE_INNER, sjinfo,
    > +               add_paths_to_joinrel(root, joinrel, rel1, rel2_unique,
    > +                                    JOIN_INNER, sjinfo,
    >                                      restrictlist);
    > -               add_paths_to_joinrel(root, joinrel, rel2, rel1,
    > -                                    JOIN_UNIQUE_OUTER, sjinfo,
    > +               add_paths_to_joinrel(root, joinrel, rel2_unique, rel1,
    > +                                    JOIN_INNER, sjinfo,
    >                                      restrictlist);
    
    Here is a WIP draft patch based on this idea.  It retains
    JOIN_UNIQUE_OUTER and JOIN_UNIQUE_INNER to help determine whether the
    inner relation is provably unique, but otherwise removes most of the
    code related to these two join types.
    
    Additionally, the T_Unique path now has the same meaning for both
    semijoins and DISTINCT clauses: it represents adjacent-duplicate
    removal on presorted input.  This patch unifies their handling by
    sharing the same data structures and functions.
    
    There are a few plan diffs in the regression tests.  As far as I can
    tell, the changes are improvements.  One of them is caused by the fact
    that we now consider parameterized paths in unique-ified cases.  The
    rest are mostly a result of now preserving pathkeys for unique paths.
    
    This patch is still a work in progress.  Before investing too much
    time into it, I'd like to get some feedback on whether it's heading in
    the right direction.
    
    Thanks
    Richard
    
  5. Re: Pathify RHS unique-ification for semijoin planning

    Richard Guo <guofenglinux@gmail.com> — 2025-06-03T07:52:30Z

    On Wed, May 28, 2025 at 10:58 AM Richard Guo <guofenglinux@gmail.com> wrote:
    > This patch is still a work in progress.  Before investing too much
    > time into it, I'd like to get some feedback on whether it's heading in
    > the right direction.
    
    Here is an updated version of the patch, which is ready for review.
    I've fixed a cost estimation issue, improved some comments, and added
    a commit message.  Nothing essential has changed.
    
    Thanks
    Richard
    
  6. Re: Pathify RHS unique-ification for semijoin planning

    Richard Guo <guofenglinux@gmail.com> — 2025-07-01T02:57:44Z

    On Tue, Jun 3, 2025 at 4:52 PM Richard Guo <guofenglinux@gmail.com> wrote:
    > Here is an updated version of the patch, which is ready for review.
    > I've fixed a cost estimation issue, improved some comments, and added
    > a commit message.  Nothing essential has changed.
    
    This patch does not apply anymore, and here is a new rebase.
    
    It may be argued that this patch introduces additional planning
    overhead by considering multiple unique-ification paths for the RHS.
    While that is true to some extent, I don't think this is a problem.
    Please bear with me a moment.
    
    * The additional path generation only occurs in specific semijoin
    cases where one input rel is exactly the RHS.  Queries without such
    semijoins are not affected.
    
    * This patch only considers the cheapest total path and presorted
    paths from the original RHS.  These are typically few in number, and
    each has a high likelihood of contributing to a lower overall cost for
    the final plan.  I think the cost-benefit trade-off is worthwhile.
    
    * This patch follows the convention in joinpath.c of exploring
    alternative join input paths, rather than introducing novel overhead.
    For example, when planning (A SEMIJOIN B), the planner considers
    multiple paths from B, including its cheapest total path and any paths
    with useful sort orders.  There is no clear reason why, in the
    analogous case of (A INNERJOIN unique-ified(B)), we should restrict
    ourselves to only one path from the unique-ified RHS.
    
    * On the other hand, if we insist on considering only a single path
    from the unique-ified RHS, we face a dilemma when the hash-based
    implementation has a cheaper total cost, but the sort-based
    implementation has a better sort order.  In such cases, what should
    our selection criteria be?  Currently, create_unique_path() simply
    compares total_cost to choose the cheaper one (even without applying a
    fuzz factor), and ignores sort order entirely.  I don't think this
    approach makes sense.  It is also inconsistent with the general
    pathification framework, where we rely on add_path() to retain the
    best path or set of paths based on cost and other metrics, rather than
    using such simple heuristics.
    
    Another point I'd like to mention is that this patch removes the
    UNIQUE_PATH_NOOP related code along the way, because I think it's dead
    code.  If the RHS rel is provably unique, the semijoin should have
    already been simplified to a plain inner join by analyzejoins.c.
    However, I might be overlooking something, and I'd appreciate any
    feedback or corrections.
    
    Thanks
    Richard
    
  7. Re: Pathify RHS unique-ification for semijoin planning

    Richard Guo <guofenglinux@gmail.com> — 2025-07-03T10:06:47Z

    On Tue, Jul 1, 2025 at 11:57 AM Richard Guo <guofenglinux@gmail.com> wrote:
    > On Tue, Jun 3, 2025 at 4:52 PM Richard Guo <guofenglinux@gmail.com> wrote:
    > > Here is an updated version of the patch, which is ready for review.
    > > I've fixed a cost estimation issue, improved some comments, and added
    > > a commit message.  Nothing essential has changed.
    
    > This patch does not apply anymore, and here is a new rebase.
    
    This patch does not apply again, so here is a new rebase.
    
    This version also fixes an issue related to parameterized paths: if
    the RHS has LATERAL references to the LHS, unique-ification becomes
    meaningless because the RHS depends on the LHS, and such paths should
    not be generated.
    
    Thanks
    Richard
    
  8. Re: Pathify RHS unique-ification for semijoin planning

    Richard Guo <guofenglinux@gmail.com> — 2025-07-04T01:41:35Z

    On Thu, Jul 3, 2025 at 7:06 PM Richard Guo <guofenglinux@gmail.com> wrote:
    > This patch does not apply again, so here is a new rebase.
    >
    > This version also fixes an issue related to parameterized paths: if
    > the RHS has LATERAL references to the LHS, unique-ification becomes
    > meaningless because the RHS depends on the LHS, and such paths should
    > not be generated.
    
    (The cc list is somehow lost; re-ccing.)
    
    FWIW, I noticed that the row/cost estimates for the unique-ification
    node on master can be very wrong.  For example:
    
    create table t(a int, b int);
    insert into t select i%100, i from generate_series(1,10000)i;
    vacuum analyze t;
    set enable_hashagg to off;
    
    explain (costs on)
    select * from t t1, t t2 where (t1.a, t2.b) in
        (select a, b from t t3 where t1.b is not null offset 0);
    
    And look at the snippet from the plan:
    
    (on master)
    ->  Unique  (cost=934.39..1009.39 rows=10000 width=8)
          ->  Sort  (cost=271.41..271.54 rows=50 width=8)
                Sort Key: "ANY_subquery".a, "ANY_subquery".b
                ->  Subquery Scan on "ANY_subquery"  (cost=0.00..270.00
    rows=50 width=8)
    
    The row estimate for the subpath is 50, but it increases to 10000
    after unique-ification.  How does that make sense?
    
    This issue does not occur with this patch:
    
    (on patched)
    ->  Unique  (cost=271.41..271.79 rows=50 width=8)
          ->  Sort  (cost=271.41..271.54 rows=50 width=8)
                Sort Key: "ANY_subquery".a, "ANY_subquery".b
                ->  Subquery Scan on "ANY_subquery"  (cost=0.00..270.00
    rows=50 width=8)
    
    Thanks
    Richard
    
    
    
    
  9. Re: Pathify RHS unique-ification for semijoin planning

    Richard Guo <guofenglinux@gmail.com> — 2025-07-16T02:30:35Z

    On Fri, Jul 4, 2025 at 10:41 AM Richard Guo <guofenglinux@gmail.com> wrote:
    > On Thu, Jul 3, 2025 at 7:06 PM Richard Guo <guofenglinux@gmail.com> wrote:
    > > This patch does not apply again, so here is a new rebase.
    > >
    > > This version also fixes an issue related to parameterized paths: if
    > > the RHS has LATERAL references to the LHS, unique-ification becomes
    > > meaningless because the RHS depends on the LHS, and such paths should
    > > not be generated.
    
    > (The cc list is somehow lost; re-ccing.)
    
    The CI reports a test failure related to this patch, although I'm
    quite confident it's unrelated to the changes introduced here.
    
    The failure is: recovery/009_twophase time out (After 1000 seconds)
    
    In any case, here's a freshly rebased version.
    
    Hi Tom, I wonder if you've had a chance to look at this patch.  It
    would be great to have your input.
    
    Thanks
    Richard
    
  10. Re: Pathify RHS unique-ification for semijoin planning

    Álvaro Herrera <alvherre@kurilemu.de> — 2025-07-23T08:11:03Z

    Hello,
    
    As a very trivial test on this patch, I ran the query in your opening
    email, both with and without the patch, scaling up the size of the table
    a little bit.  So I did this
    
    	drop table if exists t;
    	create table t(a int, b int);
    	insert into t select i % 100000, i from generate_series(1,1e7) i;
    	create index on t(a);
    	vacuum analyze t;
    
    	set enable_hashagg to off;
    
    	explain (costs off, analyze, buffers)
    	select * from t t1 where t1.a in
    	  (select a from t t2 where a < 10000)
    	order by t1.a;
    
    
    This is the plan without the patch:
    
                                                        QUERY PLAN                                                     
    ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     Merge Join (actual time=289.262..700.761 rows=1000000.00 loops=1)
       Merge Cond: (t1.a = t2.a)
       Buffers: shared hit=1017728 read=3945 written=3361, temp read=1471 written=1476
       ->  Index Scan using t_a_idx on t t1 (actual time=0.011..320.747 rows=1000001.00 loops=1)
             Index Searches: 1
             Buffers: shared hit=997725 read=3112 written=2664
       ->  Sort (actual time=219.273..219.771 rows=10000.00 loops=1)
             Sort Key: t2.a
             Sort Method: quicksort  Memory: 385kB
             Buffers: shared hit=20003 read=833 written=697, temp read=1471 written=1476
             ->  Unique (actual time=128.173..218.708 rows=10000.00 loops=1)
                   Buffers: shared hit=20003 read=833 written=697, temp read=1471 written=1476
                   ->  Sort (actual time=128.170..185.461 rows=1000000.00 loops=1)
                         Sort Key: t2.a
                         Sort Method: external merge  Disk: 11768kB
                         Buffers: shared hit=20003 read=833 written=697, temp read=1471 written=1476
                         ->  Index Only Scan using t_a_idx on t t2 (actual time=0.024..74.171 rows=1000000.00 loops=1)
                               Index Cond: (a < 10000)
                               Heap Fetches: 0
                               Index Searches: 1
                               Buffers: shared hit=20003 read=833 written=697
     Planning:
       Buffers: shared hit=28 read=7
     Planning Time: 0.212 ms
     Execution Time: 732.840 ms
    
    
    and this is the plan with the patch:
    
                                                  QUERY PLAN                                               
    ───────────────────────────────────────────────────────────────────────────────────────────────────────
     Merge Join (actual time=70.310..595.116 rows=1000000.00 loops=1)
       Merge Cond: (t1.a = t2.a)
       Buffers: shared hit=1017750 read=3923 written=3586
       ->  Index Scan using t_a_idx on t t1 (actual time=0.020..341.257 rows=1000001.00 loops=1)
             Index Searches: 1
             Buffers: shared hit=996914 read=3923 written=3586
       ->  Unique (actual time=0.028..99.074 rows=10000.00 loops=1)
             Buffers: shared hit=20836
             ->  Index Only Scan using t_a_idx on t t2 (actual time=0.026..66.219 rows=1000000.00 loops=1)
                   Index Cond: (a < 10000)
                   Heap Fetches: 0
                   Index Searches: 1
                   Buffers: shared hit=20836
     Planning:
       Buffers: shared hit=55 read=15 written=14
     Planning Time: 0.391 ms
     Execution Time: 621.377 ms
    
    
    This is a really nice improvement.  I think we could find queries that
    are arbitrarily faster, by feeding enough tuples to the unnecessary Sort
    nodes.
    
    -- 
    Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
    "No necesitamos banderas
     No reconocemos fronteras"                  (Jorge González)
    
    
    
    
  11. Re: Pathify RHS unique-ification for semijoin planning

    Richard Guo <guofenglinux@gmail.com> — 2025-07-24T03:40:50Z

    On Wed, Jul 23, 2025 at 5:11 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:
    > As a very trivial test on this patch, I ran the query in your opening
    > email, both with and without the patch, scaling up the size of the table
    > a little bit.
    
    > This is a really nice improvement.  I think we could find queries that
    > are arbitrarily faster, by feeding enough tuples to the unnecessary Sort
    > nodes.
    
    Thank you for testing this patch!
    
    In addition to eliminating unnecessary sort nodes, this patch also
    allows us to exploit pre-sorted paths that aren't necessarily the
    cheapest total during the unique-ification step.  It also allows the
    use of parallel plans for that step on large tables.  I think we could
    also find queries that become faster as a result of these improvements.
    
    Thanks
    Richard
    
    
    
    
  12. Re: Pathify RHS unique-ification for semijoin planning

    Alexandra Wang <alexandra.wang.oss@gmail.com> — 2025-07-31T01:33:14Z

    Hi Richard,
    
    Thanks for the patch! I applied your patch and played around with it.
    
    I have a question about the following test case you added in
    subselect.sql:
    
    +-- Ensure that we can unique-ify expressions more complex than plain Vars
    +explain (verbose, costs off)
    +select * from semijoin_unique_tbl t1, semijoin_unique_tbl t2
    +where (t1.a, t2.a) in (select a+1, b+1 from semijoin_unique_tbl t3)
    +order by t1.a, t2.a;
    +                                           QUERY PLAN
    
    +------------------------------------------------------------------------------------------------
    + Incremental Sort
    +   Output: t1.a, t1.b, t2.a, t2.b
    +   Sort Key: t1.a, t2.a
    +   Presorted Key: t1.a
    +   ->  Merge Join
    +         Output: t1.a, t1.b, t2.a, t2.b
    +         Merge Cond: (t1.a = ((t3.a + 1)))
    +         ->  Index Only Scan using semijoin_unique_tbl_a_b_idx on
    public.semijoin_unique_tbl t1
    +               Output: t1.a, t1.b
    +         ->  Sort
    +               Output: t2.a, t2.b, t3.a, ((t3.a + 1))
    +               Sort Key: ((t3.a + 1))
    +               ->  Hash Join
    +                     Output: t2.a, t2.b, t3.a, (t3.a + 1)
    +                     Hash Cond: (t2.a = (t3.b + 1))
    +                     ->  Seq Scan on public.semijoin_unique_tbl t2
    +                           Output: t2.a, t2.b
    +                     ->  Hash
    +                           Output: t3.a, t3.b
    +                           ->  HashAggregate
    +                                 Output: t3.a, t3.b
    +                                 Group Key: (t3.a + 1), (t3.b + 1)
    +                                 ->  Seq Scan on
    public.semijoin_unique_tbl t3
    +                                       Output: t3.a, t3.b, (t3.a + 1),
    (t3.b + 1)
    +(24 rows)
    
    I was under the impression that we wanted Unique on top of a sorted
    node for the inner of the SIMI join. However, the expected output uses
    a HashAgg instead. Is this expected?
    
    While looking at the code, I also had a question about the following
    changes in costsize.c:
    
    --- a/src/backend/optimizer/path/costsize.c
    +++ b/src/backend/optimizer/path/costsize.c
    @@ -3963,7 +3963,9 @@ final_cost_mergejoin(PlannerInfo *root, MergePath
    *path,
      * The whole issue is moot if we are working from a unique-ified outer
      * input, or if we know we don't need to mark/restore at all.
      */
    - if (IsA(outer_path, UniquePath) || path->skip_mark_restore)
    + if (IsA(outer_path, UniquePath) ||
    + IsA(outer_path, AggPath) ||
    + path->skip_mark_restore)
    
    and
    
    @@ -4358,7 +4360,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
      * because we avoid contaminating the cache with a value that's wrong for
      * non-unique-ified paths.
      */
    - if (IsA(inner_path, UniquePath))
    + if (IsA(inner_path, UniquePath) || IsA(inner_path, AggPath))
    
    I'm curious why AggPath was added in these two cases. To investigate,
    I reverted these two changes regarding AggPath, and reran make
    installcheck, and got the following diff:
    
    diff -U3
    /Users/alex.wang/workspace/postgres/src/test/regress/expected/subselect.out
    /Users/alex.wang/workspace/postgres/src/test/regress/results/subselect.out
    ---
    /Users/alex.wang/workspace/postgres/src/test/regress/expected/subselect.out
    2025-07-30 14:47:02
    +++
    /Users/alex.wang/workspace/postgres/src/test/regress/results/subselect.out
    2025-07-30 17:35:33
    @@ -747,33 +747,32 @@
     select * from semijoin_unique_tbl t1, semijoin_unique_tbl t2
     where (t1.a, t2.a) in (select a+1, b+1 from semijoin_unique_tbl t3)
     order by t1.a, t2.a;
    -                                           QUERY PLAN
    
    -------------------------------------------------------------------------------------------------
    - Incremental Sort
    +                                              QUERY PLAN
    
    +------------------------------------------------------------------------------------------------------
    + Merge Join
        Output: t1.a, t1.b, t2.a, t2.b
    -   Sort Key: t1.a, t2.a
    -   Presorted Key: t1.a
    -   ->  Merge Join
    -         Output: t1.a, t1.b, t2.a, t2.b
    -         Merge Cond: (t1.a = ((t3.a + 1)))
    +   Merge Cond: ((t3.a + 1) = t1.a)
    +   ->  Nested Loop
    +         Output: t2.a, t2.b, t3.a
    +         ->  Unique
    +               Output: t3.a, t3.b, ((t3.a + 1)), ((t3.b + 1))
    +               ->  Sort
    +                     Output: t3.a, t3.b, ((t3.a + 1)), ((t3.b + 1))
    +                     Sort Key: ((t3.a + 1)), ((t3.b + 1))
    +                     ->  Seq Scan on public.semijoin_unique_tbl t3
    +                           Output: t3.a, t3.b, (t3.a + 1), (t3.b + 1)
    +         ->  Memoize
    +               Output: t2.a, t2.b
    +               Cache Key: (t3.b + 1)
    +               Cache Mode: logical
    +               ->  Index Only Scan using semijoin_unique_tbl_a_b_idx on
    public.semijoin_unique_tbl t2
    +                     Output: t2.a, t2.b
    +                     Index Cond: (t2.a = (t3.b + 1))
    +   ->  Materialize
    +         Output: t1.a, t1.b
              ->  Index Only Scan using semijoin_unique_tbl_a_b_idx on
    public.semijoin_unique_tbl t1
                    Output: t1.a, t1.b
    -         ->  Sort
    -               Output: t2.a, t2.b, t3.a, ((t3.a + 1))
    -               Sort Key: ((t3.a + 1))
    -               ->  Hash Join
    -                     Output: t2.a, t2.b, t3.a, (t3.a + 1)
    -                     Hash Cond: (t2.a = (t3.b + 1))
    -                     ->  Seq Scan on public.semijoin_unique_tbl t2
    -                           Output: t2.a, t2.b
    -                     ->  Hash
    -                           Output: t3.a, t3.b
    -                           ->  HashAggregate
    -                                 Output: t3.a, t3.b
    -                                 Group Key: (t3.a + 1), (t3.b + 1)
    -                                 ->  Seq Scan on
    public.semijoin_unique_tbl t3
    -                                       Output: t3.a, t3.b, (t3.a + 1),
    (t3.b + 1)
    -(24 rows)
    +(23 rows)
    
     -- encourage use of parallel plans
     set parallel_setup_cost=0;
    @@ -2818,21 +2817,23 @@
     SELECT * FROM tenk1 A INNER JOIN tenk2 B
     ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd)
     WHERE a.thousand < 750;
    -                   QUERY PLAN
    --------------------------------------------------
    +                         QUERY PLAN
    +-------------------------------------------------------------
      Hash Join
        Hash Cond: (c.odd = b.odd)
    -   ->  Hash Join
    -         Hash Cond: (a.hundred = c.hundred)
    -         ->  Seq Scan on tenk1 a
    -               Filter: (thousand < 750)
    -         ->  Hash
    -               ->  HashAggregate
    -                     Group Key: c.odd, c.hundred
    -                     ->  Seq Scan on tenk2 c
    +   ->  Nested Loop
    +         ->  HashAggregate
    +               Group Key: c.odd, c.hundred
    +               ->  Seq Scan on tenk2 c
    +         ->  Memoize
    +               Cache Key: c.hundred
    +               Cache Mode: logical
    +               ->  Index Scan using tenk1_hundred on tenk1 a
    +                     Index Cond: (hundred = c.hundred)
    +                     Filter: (thousand < 750)
        ->  Hash
              ->  Seq Scan on tenk2 b
    -(12 rows)
    +(14 rows)
    
    The second diff looks fine to me. However, the first diff in the
    subselect test happened to be the test that I asked about.
    
    Here's the comparison of the two EXPLAIN ANALYZE results:
    
    -- setup:
    create table semijoin_unique_tbl (a int, b int);
    insert into semijoin_unique_tbl select i%10, i%10 from
    generate_series(1,1000)i;
    create index on semijoin_unique_tbl(a, b);
    analyze semijoin_unique_tbl;
    
    -- query:
    EXPLAIN ANALYZE
    select * from semijoin_unique_tbl t1, semijoin_unique_tbl t2
    where (t1.a, t2.a) in (select a+1, b+1 from semijoin_unique_tbl t3)
    order by t1.a, t2.a;
    
    -- results:
    Output of your original v5 patch:
    
    
          QUERY PLAN
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Incremental Sort  (cost=93.94..297.51 rows=2500 width=16) (actual
    time=4.153..25.257 rows=90000.00 loops=1)
       Sort Key: t1.a, t2.a
       Presorted Key: t1.a
       Full-sort Groups: 9  Sort Method: quicksort  Average Memory: 27kB  Peak
    Memory: 27kB
       Pre-sorted Groups: 9  Sort Method: quicksort  Average Memory: 697kB
     Peak Memory: 697kB
       Buffers: shared hit=61
       ->  Merge Join  (cost=74.81..166.49 rows=2500 width=16) (actual
    time=0.964..13.341 rows=90000.00 loops=1)
             Merge Cond: (t1.a = ((t3.a + 1)))
             Buffers: shared hit=61
             ->  Index Only Scan using semijoin_unique_tbl_a_b_idx on
    semijoin_unique_tbl t1  (cost=0.15..43.08 rows=1000 width=8) (actual
    time=0.040..0.276 rows=1000.00 loops=1)
                   Heap Fetches: 1000
                   Index Searches: 1
                   Buffers: shared hit=51
             ->  Sort  (cost=74.66..75.91 rows=500 width=12) (actual
    time=0.867..4.366 rows=89901.00 loops=1)
                   Sort Key: ((t3.a + 1))
                   Sort Method: quicksort  Memory: 53kB
                   Buffers: shared hit=10
                   ->  Hash Join  (cost=27.25..52.25 rows=500 width=12) (actual
    time=0.401..0.711 rows=900.00 loops=1)
                         Hash Cond: (t2.a = (t3.b + 1))
                         Buffers: shared hit=10
                         ->  Seq Scan on semijoin_unique_tbl t2
     (cost=0.00..15.00 rows=1000 width=8) (actual time=0.027..0.106
    rows=1000.00 loops=1)
                               Buffers: shared hit=5
                         ->  Hash  (cost=26.00..26.00 rows=100 width=8) (actual
    time=0.361..0.361 rows=10.00 loops=1)
                               Buckets: 1024  Batches: 1  Memory Usage: 9kB
                               Buffers: shared hit=5
                               ->  HashAggregate  (cost=25.00..26.00 rows=100
    width=8) (actual time=0.349..0.352 rows=10.00 loops=1)
                                     Group Key: (t3.a + 1), (t3.b + 1)
                                     Batches: 1  Memory Usage: 32kB
                                     Buffers: shared hit=5
                                     ->  Seq Scan on semijoin_unique_tbl t3
     (cost=0.00..20.00 rows=1000 width=16) (actual time=0.012..0.150
    rows=1000.00 loops=1)
                                           Buffers: shared hit=5
     Planning Time: 0.309 ms
     Execution Time: 28.552 ms
    (33 rows)
    
    Output of the v5 patch + my modification that removes the changes in
    costsize.c about AggPath:
    
    
            QUERY PLAN
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Merge Join  (cost=70.14..316.44 rows=2500 width=16) (actual
    time=0.862..13.484 rows=90000.00 loops=1)
       Merge Cond: ((t3.a + 1) = t1.a)
       Buffers: shared hit=105 read=6
       ->  Nested Loop  (cost=69.99..227.12 rows=500 width=12) (actual
    time=0.778..1.225 rows=900.00 loops=1)
             Buffers: shared hit=54 read=6
             ->  Unique  (cost=69.83..77.33 rows=100 width=16) (actual
    time=0.685..0.782 rows=10.00 loops=1)
                   Buffers: shared read=5
                   ->  Sort  (cost=69.83..72.33 rows=1000 width=16) (actual
    time=0.684..0.723 rows=1000.00 loops=1)
                         Sort Key: ((t3.a + 1)), ((t3.b + 1))
                         Sort Method: quicksort  Memory: 56kB
                         Buffers: shared read=5
                         ->  Seq Scan on semijoin_unique_tbl t3
     (cost=0.00..20.00 rows=1000 width=16) (actual time=0.324..0.479
    rows=1000.00 loops=1)
                               Buffers: shared read=5
             ->  Memoize  (cost=0.16..2.19 rows=100 width=8) (actual
    time=0.010..0.035 rows=90.00 loops=10)
                   Cache Key: (t3.b + 1)
                   Cache Mode: logical
                   Hits: 0  Misses: 10  Evictions: 0  Overflows: 0  Memory
    Usage: 36kB
                   Buffers: shared hit=54 read=1
                   ->  Index Only Scan using semijoin_unique_tbl_a_b_idx on
    semijoin_unique_tbl t2  (cost=0.15..2.18 rows=100 width=8) (actual
    time=0.005..0.015 rows=90.00 loops=10)
                         Index Cond: (a = (t3.b + 1))
                         Heap Fetches: 900
                         Index Searches: 10
                         Buffers: shared hit=54 read=1
       ->  Materialize  (cost=0.15..45.58 rows=1000 width=8) (actual
    time=0.014..3.613 rows=90001.00 loops=1)
             Storage: Memory  Maximum Storage: 20kB
             Buffers: shared hit=51
             ->  Index Only Scan using semijoin_unique_tbl_a_b_idx on
    semijoin_unique_tbl t1  (cost=0.15..43.08 rows=1000 width=8) (actual
    time=0.010..0.126 rows=1000.00 loops=1)
                   Heap Fetches: 1000
                   Index Searches: 1
                   Buffers: shared hit=51
     Planning:
       Buffers: shared hit=69 read=20
     Planning Time: 3.558 ms
     Execution Time: 17.211 ms
    (34 rows)
    
    While both runs are fast (28ms vs. 17ms), the plan generated by the v5
    patch is slower in this case.
    
    The latter plan also seems closer to my expectation: Unique + Sort for
    the inner side of the SEMI join.
    
    What do you think?
    
    Best,
    Alex
    
  13. Re: Pathify RHS unique-ification for semijoin planning

    Richard Guo <guofenglinux@gmail.com> — 2025-07-31T04:08:06Z

    On Thu, Jul 31, 2025 at 10:33 AM Alexandra Wang
    <alexandra.wang.oss@gmail.com> wrote:
    > Thanks for the patch! I applied your patch and played around with it.
    
    Thanks for trying out this patch.
    
    > I have a question about the following test case you added in
    > subselect.sql:
    
    > I was under the impression that we wanted Unique on top of a sorted
    > node for the inner of the SIMI join. However, the expected output uses
    > a HashAgg instead. Is this expected?
    
    Hmm, I don't think we have such expectation that "Sort+Unique" should
    be used for the unique-ification step in this query.  This patch
    considers both hash-based and sort-based implementations, and lets the
    cost comparison determine which one is preferable.  So I wouldn't be
    surprised if the planner ends up choosing the hash-based
    implementation in the final plan.
    
    > While looking at the code, I also had a question about the following
    > changes in costsize.c:
    >
    > --- a/src/backend/optimizer/path/costsize.c
    > +++ b/src/backend/optimizer/path/costsize.c
    > @@ -3963,7 +3963,9 @@ final_cost_mergejoin(PlannerInfo *root, MergePath *path,
    >   * The whole issue is moot if we are working from a unique-ified outer
    >   * input, or if we know we don't need to mark/restore at all.
    >   */
    > - if (IsA(outer_path, UniquePath) || path->skip_mark_restore)
    > + if (IsA(outer_path, UniquePath) ||
    > + IsA(outer_path, AggPath) ||
    > + path->skip_mark_restore)
    >
    > and
    >
    > @@ -4358,7 +4360,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
    >   * because we avoid contaminating the cache with a value that's wrong for
    >   * non-unique-ified paths.
    >   */
    > - if (IsA(inner_path, UniquePath))
    > + if (IsA(inner_path, UniquePath) || IsA(inner_path, AggPath))
    >
    > I'm curious why AggPath was added in these two cases.
    
    Well, in final_cost_hashjoin() and final_cost_mergejoin(), we have
    some special cases when the inner or outer relation has been
    unique-ified.  Previously, it was sufficient to check whether the path
    was a UniquePath, since both hash-based and sort-based implementations
    were represented that way.  However, with this patch, UniquePath now
    only represents the sort-based implementation, so we also need to
    check for AggPath to account for the hash-based case.
    
    > The second diff looks fine to me. However, the first diff in the
    > subselect test happened to be the test that I asked about.
    >
    > Here's the comparison of the two EXPLAIN ANALYZE results:
    
    > While both runs are fast (28ms vs. 17ms), the plan generated by the v5
    > patch is slower in this case.
    
    This is a very interesting observation.  In fact, with the original v5
    patch, you can produce both plans by setting enable_hashagg on and
    off.
    
    set enable_hashagg to on;
     Incremental Sort  (cost=91.95..277.59 rows=2500 width=16)
                       (actual time=31.960..147.040 rows=90000.00 loops=1)
    
    set enable_hashagg to off;
     Merge Join  (cost=70.14..294.34 rows=2500 width=16)
                 (actual time=4.303..71.891 rows=90000.00 loops=1)
    
    This seems to be another case where the planner chooses a suboptimal
    plan due to inaccurate cost estimates.
    
    Thanks
    Richard
    
    
    
    
  14. Re: Pathify RHS unique-ification for semijoin planning

    Richard Guo <guofenglinux@gmail.com> — 2025-07-31T08:58:28Z

    On Thu, Jul 31, 2025 at 1:08 PM Richard Guo <guofenglinux@gmail.com> wrote:
    > On Thu, Jul 31, 2025 at 10:33 AM Alexandra Wang
    > <alexandra.wang.oss@gmail.com> wrote:
    > > While looking at the code, I also had a question about the following
    > > changes in costsize.c:
    > >
    > > --- a/src/backend/optimizer/path/costsize.c
    > > +++ b/src/backend/optimizer/path/costsize.c
    > > @@ -3963,7 +3963,9 @@ final_cost_mergejoin(PlannerInfo *root, MergePath *path,
    > >   * The whole issue is moot if we are working from a unique-ified outer
    > >   * input, or if we know we don't need to mark/restore at all.
    > >   */
    > > - if (IsA(outer_path, UniquePath) || path->skip_mark_restore)
    > > + if (IsA(outer_path, UniquePath) ||
    > > + IsA(outer_path, AggPath) ||
    > > + path->skip_mark_restore)
    > >
    > > and
    > >
    > > @@ -4358,7 +4360,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
    > >   * because we avoid contaminating the cache with a value that's wrong for
    > >   * non-unique-ified paths.
    > >   */
    > > - if (IsA(inner_path, UniquePath))
    > > + if (IsA(inner_path, UniquePath) || IsA(inner_path, AggPath))
    > >
    > > I'm curious why AggPath was added in these two cases.
    
    > Well, in final_cost_hashjoin() and final_cost_mergejoin(), we have
    > some special cases when the inner or outer relation has been
    > unique-ified.  Previously, it was sufficient to check whether the path
    > was a UniquePath, since both hash-based and sort-based implementations
    > were represented that way.  However, with this patch, UniquePath now
    > only represents the sort-based implementation, so we also need to
    > check for AggPath to account for the hash-based case.
    
    BTW, maybe a better way to determine whether a relation has been
    unique-ified is to check that the nominal jointype is JOIN_INNER and
    sjinfo->jointype is JOIN_SEMI, and the relation is exactly the RHS of
    the semijoin.  This approach is mentioned in a comment in joinpath.c:
    
     * Path cost estimation code may need to recognize that it's
     * dealing with such a case --- the combination of nominal jointype INNER
     * with sjinfo->jointype == JOIN_SEMI indicates that.
    
    ... but it seems we don't currently apply it in costsize.c.
    
    To be concrete, I'm imagining a check like the following:
    
    #define IS_UNIQUEIFIED_REL(rel, sjinfo, nominal_jointype) \
        ((nominal_jointype) == JOIN_INNER && (sjinfo)->jointype == JOIN_SEMI && \
         bms_equal((sjinfo)->syn_righthand, (rel)->relids))
    
    ... and then the check in final_cost_hashjoin() becomes:
    
        if (IS_UNIQUEIFIED_REL(inner_path->parent, extra->sjinfo,
                               path->jpath.jointype))
        {
            innerbucketsize = 1.0 / virtualbuckets;
            innermcvfreq = 0.0;
        }
    
    Would this be a better approach?  Any thoughts?
    
    Thanks
    Richard
    
    
    
    
  15. Re: Pathify RHS unique-ification for semijoin planning

    wenhui qiu <qiuwenhuifx@gmail.com> — 2025-07-31T12:49:38Z

    HI
    
    > This is a very interesting observation.  In fact, with the original v5
    >  patch, you can produce both plans by setting enable_hashagg on and
    > off.
    
    >  set enable_hashagg to on;
    >  Incremental Sort  (cost=91.95..277.59 rows=2500 width=16)
    >                    (actual time=31.960..147.040 rows=90000.00 loops=1)
    >
    > set enable_hashagg to off;
     > Merge Join  (cost=70.14..294.34 rows=2500 width=16)
     >             (actual time=4.303..71.891 rows=90000.00 loops=1)
    >
    > This seems to be another case where the planner chooses a suboptimal
    >  plan due to inaccurate cost estimates.
    Agree ,I increased some rows , set enable_hashagg to on and off ,There's no
    difference in execution time. The execution plan is the same.
    
    > #define IS_UNIQUEIFIED_REL(rel, sjinfo, nominal_jointype) \
     >   ((nominal_jointype) == JOIN_INNER && (sjinfo)->jointype == JOIN_SEMI
    && \
    >     bms_equal((sjinfo)->syn_righthand, (rel)->relids))
    >
    >... and then the check in final_cost_hashjoin() becomes:
    >
     >   if (IS_UNIQUEIFIED_REL(inner_path->parent, extra->sjinfo,
    >                           path->jpath.jointype))
    >    {
    >        innerbucketsize = 1.0 / virtualbuckets;
     >       innermcvfreq = 0.0;
    >    }
    >
    >Would this be a better approach?  Any thoughts?
    
    This approach does indeed more accurately capture the fact that the
    relation has been unique-ified, especially in cases where a semi join has
    been transformed into an inner join. Compared to the current heuristic
    checks in costsize.c that rely on inner_path->rows, this method is more
    semantically meaningful and robust.
    
    On Thu, Jul 31, 2025 at 4:58 PM Richard Guo <guofenglinux@gmail.com> wrote:
    
    > On Thu, Jul 31, 2025 at 1:08 PM Richard Guo <guofenglinux@gmail.com>
    > wrote:
    > > On Thu, Jul 31, 2025 at 10:33 AM Alexandra Wang
    > > <alexandra.wang.oss@gmail.com> wrote:
    > > > While looking at the code, I also had a question about the following
    > > > changes in costsize.c:
    > > >
    > > > --- a/src/backend/optimizer/path/costsize.c
    > > > +++ b/src/backend/optimizer/path/costsize.c
    > > > @@ -3963,7 +3963,9 @@ final_cost_mergejoin(PlannerInfo *root,
    > MergePath *path,
    > > >   * The whole issue is moot if we are working from a unique-ified outer
    > > >   * input, or if we know we don't need to mark/restore at all.
    > > >   */
    > > > - if (IsA(outer_path, UniquePath) || path->skip_mark_restore)
    > > > + if (IsA(outer_path, UniquePath) ||
    > > > + IsA(outer_path, AggPath) ||
    > > > + path->skip_mark_restore)
    > > >
    > > > and
    > > >
    > > > @@ -4358,7 +4360,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath
    > *path,
    > > >   * because we avoid contaminating the cache with a value that's wrong
    > for
    > > >   * non-unique-ified paths.
    > > >   */
    > > > - if (IsA(inner_path, UniquePath))
    > > > + if (IsA(inner_path, UniquePath) || IsA(inner_path, AggPath))
    > > >
    > > > I'm curious why AggPath was added in these two cases.
    >
    > > Well, in final_cost_hashjoin() and final_cost_mergejoin(), we have
    > > some special cases when the inner or outer relation has been
    > > unique-ified.  Previously, it was sufficient to check whether the path
    > > was a UniquePath, since both hash-based and sort-based implementations
    > > were represented that way.  However, with this patch, UniquePath now
    > > only represents the sort-based implementation, so we also need to
    > > check for AggPath to account for the hash-based case.
    >
    > BTW, maybe a better way to determine whether a relation has been
    > unique-ified is to check that the nominal jointype is JOIN_INNER and
    > sjinfo->jointype is JOIN_SEMI, and the relation is exactly the RHS of
    > the semijoin.  This approach is mentioned in a comment in joinpath.c:
    >
    >  * Path cost estimation code may need to recognize that it's
    >  * dealing with such a case --- the combination of nominal jointype INNER
    >  * with sjinfo->jointype == JOIN_SEMI indicates that.
    >
    > ... but it seems we don't currently apply it in costsize.c.
    >
    > To be concrete, I'm imagining a check like the following:
    >
    > #define IS_UNIQUEIFIED_REL(rel, sjinfo, nominal_jointype) \
    >     ((nominal_jointype) == JOIN_INNER && (sjinfo)->jointype == JOIN_SEMI
    > && \
    >      bms_equal((sjinfo)->syn_righthand, (rel)->relids))
    >
    > ... and then the check in final_cost_hashjoin() becomes:
    >
    >     if (IS_UNIQUEIFIED_REL(inner_path->parent, extra->sjinfo,
    >                            path->jpath.jointype))
    >     {
    >         innerbucketsize = 1.0 / virtualbuckets;
    >         innermcvfreq = 0.0;
    >     }
    >
    > Would this be a better approach?  Any thoughts?
    >
    > Thanks
    > Richard
    >
    
  16. Re: Pathify RHS unique-ification for semijoin planning

    Richard Guo <guofenglinux@gmail.com> — 2025-08-01T01:53:37Z

    On Thu, Jul 31, 2025 at 9:49 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote:
    
    > > This seems to be another case where the planner chooses a suboptimal
    > >  plan due to inaccurate cost estimates.
    
    > Agree ,I increased some rows , set enable_hashagg to on and off ,There's no difference in execution time. The execution plan is the same.
    
    Yeah, I found that if you increase the total number of rows in the
    table from 1000 to 1083 or more, you consistently get the more
    efficient plan -- regardless of whether enable_hashagg is on or off.
    
    > > #define IS_UNIQUEIFIED_REL(rel, sjinfo, nominal_jointype) \
    >  >   ((nominal_jointype) == JOIN_INNER && (sjinfo)->jointype == JOIN_SEMI && \
    > >     bms_equal((sjinfo)->syn_righthand, (rel)->relids))
    > >
    > >... and then the check in final_cost_hashjoin() becomes:
    > >
    >  >   if (IS_UNIQUEIFIED_REL(inner_path->parent, extra->sjinfo,
    > >                           path->jpath.jointype))
    > >    {
    > >        innerbucketsize = 1.0 / virtualbuckets;
    >  >       innermcvfreq = 0.0;
    > >    }
    > >
    > >Would this be a better approach?  Any thoughts?
    
    > This approach does indeed more accurately capture the fact that the relation has been unique-ified, especially in cases where a semi join has been transformed into an inner join. Compared to the current heuristic checks in costsize.c that rely on inner_path->rows, this method is more semantically meaningful and robust.
    
    The current check in costsize.c relies on the path type, not the path
    rows as you mentioned.  However, I agree that the check I proposed is
    more robust and extensible: if additional path types are introduced to
    represent unique-ification, this check wouldn't need to change.  So I
    plan to go this way, unless there are any objections.
    
    Thanks
    Richard
    
    
    
    
  17. Re: Pathify RHS unique-ification for semijoin planning

    Alexandra Wang <alexandra.wang.oss@gmail.com> — 2025-08-01T14:58:01Z

    Hi Richard,
    
    On Wed, Jul 30, 2025 at 9:08 PM Richard Guo <guofenglinux@gmail.com> wrote:
    
    > On Thu, Jul 31, 2025 at 10:33 AM Alexandra Wang
    > <alexandra.wang.oss@gmail.com> wrote:
    > > Thanks for the patch! I applied your patch and played around with it.
    >
    > Thanks for trying out this patch.
    >
    > > I have a question about the following test case you added in
    > > subselect.sql:
    >
    > > I was under the impression that we wanted Unique on top of a sorted
    > > node for the inner of the SIMI join. However, the expected output uses
    > > a HashAgg instead. Is this expected?
    >
    > Hmm, I don't think we have such expectation that "Sort+Unique" should
    > be used for the unique-ification step in this query.  This patch
    > considers both hash-based and sort-based implementations, and lets the
    > cost comparison determine which one is preferable.  So I wouldn't be
    > surprised if the planner ends up choosing the hash-based
    > implementation in the final plan.
    >
    > > While looking at the code, I also had a question about the following
    > > changes in costsize.c:
    > >
    > > --- a/src/backend/optimizer/path/costsize.c
    > > +++ b/src/backend/optimizer/path/costsize.c
    > > @@ -3963,7 +3963,9 @@ final_cost_mergejoin(PlannerInfo *root, MergePath
    > *path,
    > >   * The whole issue is moot if we are working from a unique-ified outer
    > >   * input, or if we know we don't need to mark/restore at all.
    > >   */
    > > - if (IsA(outer_path, UniquePath) || path->skip_mark_restore)
    > > + if (IsA(outer_path, UniquePath) ||
    > > + IsA(outer_path, AggPath) ||
    > > + path->skip_mark_restore)
    > >
    > > and
    > >
    > > @@ -4358,7 +4360,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath
    > *path,
    > >   * because we avoid contaminating the cache with a value that's wrong
    > for
    > >   * non-unique-ified paths.
    > >   */
    > > - if (IsA(inner_path, UniquePath))
    > > + if (IsA(inner_path, UniquePath) || IsA(inner_path, AggPath))
    > >
    > > I'm curious why AggPath was added in these two cases.
    >
    > Well, in final_cost_hashjoin() and final_cost_mergejoin(), we have
    > some special cases when the inner or outer relation has been
    > unique-ified.  Previously, it was sufficient to check whether the path
    > was a UniquePath, since both hash-based and sort-based implementations
    > were represented that way.  However, with this patch, UniquePath now
    > only represents the sort-based implementation, so we also need to
    > check for AggPath to account for the hash-based case.
    >
    > > The second diff looks fine to me. However, the first diff in the
    > > subselect test happened to be the test that I asked about.
    > >
    > > Here's the comparison of the two EXPLAIN ANALYZE results:
    >
    > > While both runs are fast (28ms vs. 17ms), the plan generated by the v5
    > > patch is slower in this case.
    >
    > This is a very interesting observation.  In fact, with the original v5
    > patch, you can produce both plans by setting enable_hashagg on and
    > off.
    >
    > set enable_hashagg to on;
    >  Incremental Sort  (cost=91.95..277.59 rows=2500 width=16)
    >                    (actual time=31.960..147.040 rows=90000.00 loops=1)
    >
    > set enable_hashagg to off;
    >  Merge Join  (cost=70.14..294.34 rows=2500 width=16)
    >              (actual time=4.303..71.891 rows=90000.00 loops=1)
    >
    > This seems to be another case where the planner chooses a suboptimal
    > plan due to inaccurate cost estimates.
    >
    
    Thanks for explaining! It makes sense now!
    
    While reviewing the code, the following diff concerns me:
    
      /*
      * If the joinrel is parallel-safe, we may be able to consider a partial
    - * merge join.  However, we can't handle JOIN_UNIQUE_OUTER, because the
    - * outer path will be partial, and therefore we won't be able to properly
    - * guarantee uniqueness.  Similarly, we can't handle JOIN_FULL, JOIN_RIGHT
    - * and JOIN_RIGHT_ANTI, because they can produce false null extended rows.
    + * merge join.  However, we can't handle JOIN_FULL, JOIN_RIGHT and
    + * JOIN_RIGHT_ANTI, because they can produce false null extended rows.
      * Also, the resulting path must not be parameterized.
      */
      if (joinrel->consider_parallel &&
    - save_jointype != JOIN_UNIQUE_OUTER &&
    - save_jointype != JOIN_FULL &&
    - save_jointype != JOIN_RIGHT &&
    - save_jointype != JOIN_RIGHT_ANTI &&
    + jointype != JOIN_FULL &&
    + jointype != JOIN_RIGHT &&
    + jointype != JOIN_RIGHT_ANTI &&
      outerrel->partial_pathlist != NIL &&
      bms_is_empty(joinrel->lateral_relids))
    
    What has changed that enabled JOIN_UNIQUE_OUTER to handle partial
    paths? Or is it no longer possible for the outer path to be partial?
    
    Best,
    Alex
    
  18. Re: Pathify RHS unique-ification for semijoin planning

    Richard Guo <guofenglinux@gmail.com> — 2025-08-04T01:55:48Z

    On Fri, Aug 1, 2025 at 11:58 PM Alexandra Wang
    <alexandra.wang.oss@gmail.com> wrote:
    > While reviewing the code, the following diff concerns me:
    
    >   if (joinrel->consider_parallel &&
    > - save_jointype != JOIN_UNIQUE_OUTER &&
    > - save_jointype != JOIN_FULL &&
    > - save_jointype != JOIN_RIGHT &&
    > - save_jointype != JOIN_RIGHT_ANTI &&
    > + jointype != JOIN_FULL &&
    > + jointype != JOIN_RIGHT &&
    > + jointype != JOIN_RIGHT_ANTI &&
    >   outerrel->partial_pathlist != NIL &&
    >   bms_is_empty(joinrel->lateral_relids))
    >
    > What has changed that enabled JOIN_UNIQUE_OUTER to handle partial
    > paths? Or is it no longer possible for the outer path to be partial?
    
    It's the latter, as indicated by the Assert in create_unique_paths():
    
    +       /*
    +        * There shouldn't be any partial paths for the unique relation;
    +        * otherwise, we won't be able to properly guarantee uniqueness.
    +        */
    +       Assert(unique_rel->partial_pathlist == NIL);
    
    Thanks
    Richard
    
    
    
    
  19. Re: Pathify RHS unique-ification for semijoin planning

    Richard Guo <guofenglinux@gmail.com> — 2025-08-04T02:08:08Z

    The v5 patch does not apply anymore, and here is a new rebase.  There
    are two main changes in v6:
    
    * I choose to use the check I proposed earlier to determine whether a
    relation has been unique-ified in costsize.c.
    
    * Now that the only call to relation_has_unique_index_for() that
    supplied an exprlist and oprlist has been removed, the loop handling
    those lists is effectively dead code.  0002 removes that loop and
    simplifies the function accordingly.
    
    Thanks
    Richard
    
  20. Re: Pathify RHS unique-ification for semijoin planning

    wenhui qiu <qiuwenhuifx@gmail.com> — 2025-08-07T09:04:41Z

    HI Richard Guo
    +/*
    + * Is given relation unique-ified?
    + *
    + * When the nominal jointype is JOIN_INNER, sjinfo->jointype is JOIN_SEMI,
    and
    + * the given rel is exactly the RHS of the semijoin, it indicates that the
    rel
    + * has been unique-ified.
    + */
    +#define IS_UNIQUEIFIED_REL(rel, sjinfo, nominal_jointype) \
    + ((nominal_jointype) == JOIN_INNER && (sjinfo)->jointype == JOIN_SEMI && \
    + bms_equal((sjinfo)->syn_righthand, (rel)->relids))
    +
    
    In light of this commit (
    https://github.com/postgres/postgres/commit/e035863c9a04beeecc254c3bfe48dab58e389e10),
    I also recommend changing the macro to a static inline function. Macros are
    harder to debug and lack type safety.
    static inline bool
    is_uniqueified_rel(RelOptInfo *rel, SpecialJoinInfo *sjinfo, JoinType
    nominal_jointype)
    {
        return nominal_jointype == JOIN_INNER &&
               sjinfo->jointype == JOIN_SEMI &&
               bms_equal(sjinfo->syn_righthand, rel->relids);
    }
    
    Thanks
    
    On Mon, Aug 4, 2025 at 10:08 AM Richard Guo <guofenglinux@gmail.com> wrote:
    
    > The v5 patch does not apply anymore, and here is a new rebase.  There
    > are two main changes in v6:
    >
    > * I choose to use the check I proposed earlier to determine whether a
    > relation has been unique-ified in costsize.c.
    >
    > * Now that the only call to relation_has_unique_index_for() that
    > supplied an exprlist and oprlist has been removed, the loop handling
    > those lists is effectively dead code.  0002 removes that loop and
    > simplifies the function accordingly.
    >
    > Thanks
    > Richard
    >
    
  21. Re: Pathify RHS unique-ification for semijoin planning

    Richard Guo <guofenglinux@gmail.com> — 2025-08-08T01:39:08Z

    On Thu, Aug 7, 2025 at 6:04 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote:
    > In light of this commit (https://github.com/postgres/postgres/commit/e035863c9a04beeecc254c3bfe48dab58e389e10), I also recommend changing the macro to a static inline function. Macros are harder to debug and lack type safety.
    
    I'm inclined not to do that.  We already have other macros for
    checking whether a relation is of a certain kind, and I'd prefer to
    keep the new check consistent with those.
    
    Thanks
    Richard
    
    
    
    
  22. Re: Pathify RHS unique-ification for semijoin planning

    wenhui qiu <qiuwenhuifx@gmail.com> — 2025-08-08T02:21:40Z

    Hi Richard
        Thanks for your feedback. I agree this approach is better for keeping
    the code style consistent.
    
    
    Thanks
    
    On Fri, Aug 8, 2025 at 9:39 AM Richard Guo <guofenglinux@gmail.com> wrote:
    
    > On Thu, Aug 7, 2025 at 6:04 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote:
    > > In light of this commit (
    > https://github.com/postgres/postgres/commit/e035863c9a04beeecc254c3bfe48dab58e389e10),
    > I also recommend changing the macro to a static inline function. Macros are
    > harder to debug and lack type safety.
    >
    > I'm inclined not to do that.  We already have other macros for
    > checking whether a relation is of a certain kind, and I'd prefer to
    > keep the new check consistent with those.
    >
    > Thanks
    > Richard
    >
    
  23. Re: Pathify RHS unique-ification for semijoin planning

    Richard Guo <guofenglinux@gmail.com> — 2025-08-12T01:43:06Z

    On Mon, Aug 4, 2025 at 11:08 AM Richard Guo <guofenglinux@gmail.com> wrote:
    > The v5 patch does not apply anymore, and here is a new rebase.  There
    > are two main changes in v6:
    >
    > * I choose to use the check I proposed earlier to determine whether a
    > relation has been unique-ified in costsize.c.
    >
    > * Now that the only call to relation_has_unique_index_for() that
    > supplied an exprlist and oprlist has been removed, the loop handling
    > those lists is effectively dead code.  0002 removes that loop and
    > simplifies the function accordingly.
    
    Does anyone plan to review this patch further?  I intend to push it in
    two weeks unless there are any objections or additional comments.
    
    Thanks
    Richard
    
    
    
    
  24. Re: Pathify RHS unique-ification for semijoin planning

    Álvaro Herrera <alvherre@kurilemu.de> — 2025-08-12T15:58:24Z

    On 2025-Aug-12, Richard Guo wrote:
    
    > Does anyone plan to review this patch further?  I intend to push it in
    > two weeks unless there are any objections or additional comments.
    
    No review, but apparently "uniquify" is more widely accepted than
    "uniqueify".  No dictionary lists either words AFAICS, except that
    Wiktionary lists the former:
    https://en.wiktionary.org/wiki/uniquify
    but apparently adjectives ending in "-e" are prone to lose it when a
    verb is formed from them with "-ify", such as
    
    https://www.merriam-webster.com/dictionary/falsify
    https://www.merriam-webster.com/dictionary/intensify
    https://www.merriam-webster.com/dictionary/simplify
    
    There aren't many though.  Most in this list don't end in -e:
    https://en.wiktionary.org/w/index.php?title=Category:English_terms_suffixed_with_-ify
    
    -- 
    Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
    "El destino baraja y nosotros jugamos" (A. Schopenhauer)
    
    
    
    
  25. Re: Pathify RHS unique-ification for semijoin planning

    Tom Lane <tgl@sss.pgh.pa.us> — 2025-08-12T16:38:19Z

    =?utf-8?Q?=C3=81lvaro?= Herrera <alvherre@kurilemu.de> writes:
    > No review, but apparently "uniquify" is more widely accepted than
    > "uniqueify".
    
    Personally I'd write "unique-ify", seeing that neither of the forms
    without the dash are considered good English.  Of course, if you
    need to make identifiers out of this, that solution doesn't work;
    but you could just avoid the construction --- say, "make_path_unique"
    rather than "uniquify_path".
    
    			regards, tom lane
    
    
    
    
  26. Re: Pathify RHS unique-ification for semijoin planning

    Richard Guo <guofenglinux@gmail.com> — 2025-08-13T02:17:52Z

    On Wed, Aug 13, 2025 at 1:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > =?utf-8?Q?=C3=81lvaro?= Herrera <alvherre@kurilemu.de> writes:
    > > No review, but apparently "uniquify" is more widely accepted than
    > > "uniqueify".
    
    > Personally I'd write "unique-ify", seeing that neither of the forms
    > without the dash are considered good English.  Of course, if you
    > need to make identifiers out of this, that solution doesn't work;
    > but you could just avoid the construction --- say, "make_path_unique"
    > rather than "uniquify_path".
    
    Some 'git grep' work shows that, currently on master, we commonly use
    the form "unique-ify" (with the dash) and its variants, such as:
    unique-ify, unique-ified, unique-ification, and unique-ifying.
    
    $ git grep -in 'unique-if' | wc -l
    50
    
    There is one instance of the form "uniquify":
    planner.c:5107: * check).  We can uniquify these tuples simply by just taking
    
    And one instance of "uniqueify" (without the dash):
    jsonb_util.c:65:static void uniqueifyJsonbObject()
    
    Given this, I'd prefer to stick with "unique-ify", for consistency
    with the majority usage in the codebase.
    
    In this patch, the only instance that doesn't follow the "unique-ify"
    form is the macro IS_UNIQUEIFIED_REL, as dashes are not allowed in C
    identifiers.  Maybe a better alternative is IS_RELATION_UNIQUE?  Any
    suggestions?
    
    Thanks
    Richard
    
    
    
    
  27. Re: Pathify RHS unique-ification for semijoin planning

    Tom Lane <tgl@sss.pgh.pa.us> — 2025-08-13T02:27:47Z

    Richard Guo <guofenglinux@gmail.com> writes:
    > Given this, I'd prefer to stick with "unique-ify", for consistency
    > with the majority usage in the codebase.
    
    +1.  (Not but what I might've been responsible for many of the
    existing usages, so my opinion is perhaps counting twice here.)
    
    > In this patch, the only instance that doesn't follow the "unique-ify"
    > form is the macro IS_UNIQUEIFIED_REL, as dashes are not allowed in C
    > identifiers.  Maybe a better alternative is IS_RELATION_UNIQUE?  Any
    > suggestions?
    
    Hm ... to my ear, "unique-ified" implies that we took some positive
    action to make the path's output unique, such as running it through
    a hashagg or Unique node.  IS_RELATION_UNIQUE only implies that the
    output is unique, so for example a scan of a primary key should
    satisfy such a predicate.  Not having read the patch (I do hope
    to get to that), I'm not sure which connotation you have in mind.
    If it's the latter, IS_RELATION_UNIQUE seems like a fine name.
    If it's the former, maybe something like "RELATION_WAS_MADE_UNIQUE"?
    That's not very pretty though ...
    
    			regards, tom lane
    
    
    
    
  28. Re: Pathify RHS unique-ification for semijoin planning

    Richard Guo <guofenglinux@gmail.com> — 2025-08-13T03:17:00Z

    On Wed, Aug 13, 2025 at 11:27 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Richard Guo <guofenglinux@gmail.com> writes:
    > > In this patch, the only instance that doesn't follow the "unique-ify"
    > > form is the macro IS_UNIQUEIFIED_REL, as dashes are not allowed in C
    > > identifiers.  Maybe a better alternative is IS_RELATION_UNIQUE?  Any
    > > suggestions?
    
    > Hm ... to my ear, "unique-ified" implies that we took some positive
    > action to make the path's output unique, such as running it through
    > a hashagg or Unique node.  IS_RELATION_UNIQUE only implies that the
    > output is unique, so for example a scan of a primary key should
    > satisfy such a predicate.  Not having read the patch (I do hope
    > to get to that), I'm not sure which connotation you have in mind.
    > If it's the latter, IS_RELATION_UNIQUE seems like a fine name.
    > If it's the former, maybe something like "RELATION_WAS_MADE_UNIQUE"?
    > That's not very pretty though ...
    
    It's the former: this macro is to signal that we've explicitly taken
    steps to make the output of the relation unique.  IMO, "unique-ified"
    best describes this, but we cannot use it directly in the macro name
    because of the dash.
    
    Hmm, I think "RELATION_WAS_MADE_UNIQUE" works well because it clearly
    conveys that the relation has been explicitly unique-ified.  It's a
    bit verbose, but I found that we have similar names in our codebase,
    such as VAC_BLK_WAS_EAGER_SCANNED.
    
    Thanks
    Richard
    
    
    
    
  29. Re: Pathify RHS unique-ification for semijoin planning

    Richard Guo <guofenglinux@gmail.com> — 2025-08-18T06:07:42Z

    On Tue, Aug 12, 2025 at 10:43 AM Richard Guo <guofenglinux@gmail.com> wrote:
    > On Mon, Aug 4, 2025 at 11:08 AM Richard Guo <guofenglinux@gmail.com> wrote:
    > > The v5 patch does not apply anymore, and here is a new rebase.  There
    > > are two main changes in v6:
    > >
    > > * I choose to use the check I proposed earlier to determine whether a
    > > relation has been unique-ified in costsize.c.
    > >
    > > * Now that the only call to relation_has_unique_index_for() that
    > > supplied an exprlist and oprlist has been removed, the loop handling
    > > those lists is effectively dead code.  0002 removes that loop and
    > > simplifies the function accordingly.
    
    > Does anyone plan to review this patch further?  I intend to push it in
    > two weeks unless there are any objections or additional comments.
    
    Here's the updated version of the patch, which renames the macro
    IS_UNIQUEIFIED_REL to RELATION_WAS_MADE_UNIQUE, and includes some
    comment updates as well.  I plan to push it soon, barring any
    objections.
    
    This patch removes the last call to make_sort_from_sortclauses(), so
    I'm wondering if we can safely remove the function itself.  Or should
    we keep it around in case it's used by extensions or might be needed
    in the future?
    
    Thanks
    Richard
    
  30. Re: Pathify RHS unique-ification for semijoin planning

    Richard Guo <guofenglinux@gmail.com> — 2025-08-19T03:05:19Z

    On Mon, Aug 18, 2025 at 3:07 PM Richard Guo <guofenglinux@gmail.com> wrote:
    > Here's the updated version of the patch, which renames the macro
    > IS_UNIQUEIFIED_REL to RELATION_WAS_MADE_UNIQUE, and includes some
    > comment updates as well.  I plan to push it soon, barring any
    > objections.
    
    Pushed.
    
    > This patch removes the last call to make_sort_from_sortclauses(), so
    > I'm wondering if we can safely remove the function itself.  Or should
    > we keep it around in case it's used by extensions or might be needed
    > in the future?
    
    This function, along with two other make_xxx() functions from
    createplan.c, was exported in 570be1f73 because CitusDB was using
    them.
    
    commit 570be1f73f385abb557bda15b718d7aac616cc15
    Author: Tom Lane <tgl@sss.pgh.pa.us>
    Date:   Sat Mar 12 12:12:59 2016 -0500
    
        Re-export a few of createplan.c's make_xxx() functions.
    
        CitusDB is using these and don't wish to redesign their code right now.
        I am not on board with this being a good idea, or a good precedent,
        but I lack the energy to fight about it.
    
    I actually agree with Tom that it's not a good idea to create Plan
    nodes outside of createplan.c; instead, one should construct a Path
    tree and let create_plan() convert it into Plan nodes.
    
    I'm not sure whether CitusDB has redesigned their code in this way,
    but for now, I prefer not to remove make_sort_from_sortclauses() just
    to be safe.
    
    Thanks
    Richard
    
    
    
    
  31. Re: Pathify RHS unique-ification for semijoin planning

    Richard Guo <guofenglinux@gmail.com> — 2025-09-02T10:10:23Z

    On Wed, Jul 23, 2025 at 5:11 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:
    > As a very trivial test on this patch, I ran the query in your opening
    > email, both with and without the patch, scaling up the size of the table
    > a little bit.
    
    > This is a really nice improvement.  I think we could find queries that
    > are arbitrarily faster, by feeding enough tuples to the unnecessary Sort
    > nodes.
    
    FWIW, I'm looking for a query to better showcase the performance
    improvement from this patch.  Here is one I found.
    
    create table t (a int, b int);
    insert into t select i%10, i%10 from generate_series(1,50000) i;
    create index on t (a, b);
    analyze t;
    
    explain (analyze, costs on)
    select * from t t1, t t2 where (t1.a, t2.b) in (select a, b from t t3)
    order by t1.a, t2.b;
    
    Here are the planning and execution time on my snail-paced machine
    (best of 3), without and with this patch.
    
    -- without this patch
     Planning Time: 0.850 ms
     Execution Time: 108149.907 ms
    
    -- with this patch
     Planning Time: 0.728 ms
     Execution Time: 29229.748 ms
    
    So this specific case runs about 3.7 times faster, which is really
    nice.
    
    - Richard
    
    
    
    
  32. Re: Pathify RHS unique-ification for semijoin planning

    Andrei Lepikhov <lepihov@gmail.com> — 2025-09-02T10:55:59Z

    On 2/9/2025 12:10, Richard Guo wrote:
    > So this specific case runs about 3.7 times faster, which is really
    > nice.No questions, it is good enough optimisation. I'm worried only about 
    implementation: It creates one more RelOptInfo that may look like a 
    baserel, but we can't find it by find_base_rel or even find_join_rel. It 
    seems a little inconsistent to me.
    Don't think it is critical - just complicates life for extension 
    developers in some cases.
    
    -- 
    regards, Andrei Lepikhov
    
    
    
    
  33. Re: Pathify RHS unique-ification for semijoin planning

    Richard Guo <guofenglinux@gmail.com> — 2025-09-03T09:12:53Z

    On Tue, Sep 2, 2025 at 7:56 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
    > No questions, it is good enough optimisation. I'm worried only about
    > implementation: It creates one more RelOptInfo that may look like a
    > baserel, but we can't find it by find_base_rel or even find_join_rel. It
    > seems a little inconsistent to me.
    > Don't think it is critical - just complicates life for extension
    > developers in some cases.
    
    The RelOptInfo representing the unique-ified rel is intended to be
    used only internally during path generation for semi-joins, and should
    be opaque outside of that.  I don't think extensions should know about
    it.
    
    - Richard
    
    
    
    
  34. Re: Pathify RHS unique-ification for semijoin planning

    Andrei Lepikhov <lepihov@gmail.com> — 2025-09-03T10:00:18Z

    On 3/9/2025 11:12, Richard Guo wrote:
    > On Tue, Sep 2, 2025 at 7:56 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
    >> No questions, it is good enough optimisation. I'm worried only about
    >> implementation: It creates one more RelOptInfo that may look like a
    >> baserel, but we can't find it by find_base_rel or even find_join_rel. It
    >> seems a little inconsistent to me.
    >> Don't think it is critical - just complicates life for extension
    >> developers in some cases.
    > 
    > The RelOptInfo representing the unique-ified rel is intended to be
    > used only internally during path generation for semi-joins, and should
    > be opaque outside of that.  I don't think extensions should know about
    > it.
    I just stated the fact - it is not for debate ;).
    To understand how deeply developers utilise the core, take a look at the 
    pg_hint_plan. The extensibility of the Postgres planner isn't flexible 
    enough (and will never be) for the developers' purposes. So, they 
    exploit every exported function, variable and type.
    Every feature intended to be hidden from extensions should be wrapped 
    into an internal type, not exposed in .h files.
    It leads us to a discussion about the voice of extension developers in 
    the core decisions. I think it is worth debating at one of the 
    conferences in the near future.
    
    -- 
    regards, Andrei Lepikhov