Re: Pathify RHS unique-ification for semijoin planning

wenhui qiu <qiuwenhuifx@gmail.com>

From: wenhui qiu <qiuwenhuifx@gmail.com>

To: Andy Fan <zhihuifan1213@163.com>

Cc: Richard Guo <guofenglinux@gmail.com>, PostgreSQL-development <pgsql-hackers@postgresql.org>

Date: 2025-05-22T13:51:32Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Simplify relation_has_unique_index_for()
- bf9ee294e567 19 (unreleased) landed
Pathify RHS unique-ification for semijoin planning
- 24225ad9aafc 19 (unreleased) landed
Convert varatt.h access macros to static inline functions.
- e035863c9a04 19 (unreleased) cited
Re-export a few of createplan.c's make_xxx() functions.
- 570be1f73f38 9.6.0 cited

On Thu, 22 May 2025 at 17:28, Andy Fan <zhihuifan1213@163.com> wrote:

> Richard Guo <guofenglinux@gmail.com> writes:
>
> Hi,
>
> > However, in the case of sort-based implementation,
> > this function pays no attention to the subpath's pathkeys or the
> > pathkeys of the resulting output.
>
> Good finding!
>
> >
> > In addition to this specific issue, it seems to me that there are
> > other potential issues in create_unique_path().
> >
> > * Currently, we only consider the cheapest_total_path of the RHS when
> > unique-ifying it.
>
> I think it is better have a check the tuple_fraction for the startup_cost
> factor, for some paths where the total cost is high but the required
> fraction is lower.
>
> > I think this may cause us to miss some optimization
> > opportunities.  For example, there might be a path with a better sort
> > order that isn't the cheapest-total one.  Such a path could help avoid
> > a sort at a higher level, potentially resulting in a cheaper overall
> > plan.
>
> > * In create_unique_path(), we currently rely on heuristics to decide
> > whether to use a hash-based or sort-based method.  I think a better
> > approach would be to create paths for both methods and let add_path()
> > determine which one is better, similar to how we handle path selection
> > in other parts of the planner.
> >
> > Therefore, I'm thinking that maybe we could create a new RelOptInfo
> > for the RHS rel to represent its unique-ified version, and then
> > generate all worthwhile paths for it,
>
> This sounds great for me. and I think we can keep the fraction
> cheapest path on the new RelOptInfo as well, then all the things should
> be on the way.
>
> > To be concrete, I'm envisioning something like the following:
> >
> >             if (bms_equal(sjinfo->syn_righthand, rel2->relids) &&
> > -               create_unique_path(root, rel2, rel2->cheapest_total_path,
> > -                                  sjinfo) != NULL)
> > +               (rel2_unique = create_unique_rel(root, rel2, sjinfo)) !=
> NULL)
> >
> > ...
> >
> > -               add_paths_to_joinrel(root, joinrel, rel1, rel2,
> > -                                    JOIN_UNIQUE_INNER, sjinfo,
> > +               add_paths_to_joinrel(root, joinrel, rel1, rel2_unique,
> > +                                    JOIN_INNER, sjinfo,
> >                                      restrictlist);
> > -               add_paths_to_joinrel(root, joinrel, rel2, rel1,
> > -                                    JOIN_UNIQUE_OUTER, sjinfo,
> > +               add_paths_to_joinrel(root, joinrel, rel2_unique, rel1,
> > +                                    JOIN_INNER, sjinfo,
> >                                      restrictlist);
> >
> > In addition, by changing the join from "rel1" and "rel2" using
> > JOIN_UNIQUE_OUTER or JOIN_UNIQUE_INNER to a join between "rel1" and
> > "rel2_unique" using a standard JOIN_INNER, we might be able to get
> > rid of the JOIN_UNIQUE_OUTER and JOIN_UNIQUE_INNER jointypes.
>
> >>if we can, +10.

Agree
Pls kindly release a path for this？

>
>

>
>
>
>