Re: Pathify RHS unique-ification for semijoin planning

Richard Guo <guofenglinux@gmail.com>

From: Richard Guo <guofenglinux@gmail.com>

To: PostgreSQL-development <pgsql-hackers@postgresql.org>

Date: 2025-07-01T02:57:44Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Simplify relation_has_unique_index_for()
- bf9ee294e567 19 (unreleased) landed
Pathify RHS unique-ification for semijoin planning
- 24225ad9aafc 19 (unreleased) landed
Convert varatt.h access macros to static inline functions.
- e035863c9a04 19 (unreleased) cited
Re-export a few of createplan.c's make_xxx() functions.
- 570be1f73f38 9.6.0 cited

Attachments

v3-0001-Pathify-RHS-unique-ification-for-semijoin-plannin.patch (application/octet-stream) patch v3-0001

On Tue, Jun 3, 2025 at 4:52 PM Richard Guo <guofenglinux@gmail.com> wrote:
> Here is an updated version of the patch, which is ready for review.
> I've fixed a cost estimation issue, improved some comments, and added
> a commit message.  Nothing essential has changed.

This patch does not apply anymore, and here is a new rebase.

It may be argued that this patch introduces additional planning
overhead by considering multiple unique-ification paths for the RHS.
While that is true to some extent, I don't think this is a problem.
Please bear with me a moment.

* The additional path generation only occurs in specific semijoin
cases where one input rel is exactly the RHS.  Queries without such
semijoins are not affected.

* This patch only considers the cheapest total path and presorted
paths from the original RHS.  These are typically few in number, and
each has a high likelihood of contributing to a lower overall cost for
the final plan.  I think the cost-benefit trade-off is worthwhile.

* This patch follows the convention in joinpath.c of exploring
alternative join input paths, rather than introducing novel overhead.
For example, when planning (A SEMIJOIN B), the planner considers
multiple paths from B, including its cheapest total path and any paths
with useful sort orders.  There is no clear reason why, in the
analogous case of (A INNERJOIN unique-ified(B)), we should restrict
ourselves to only one path from the unique-ified RHS.

* On the other hand, if we insist on considering only a single path
from the unique-ified RHS, we face a dilemma when the hash-based
implementation has a cheaper total cost, but the sort-based
implementation has a better sort order.  In such cases, what should
our selection criteria be?  Currently, create_unique_path() simply
compares total_cost to choose the cheaper one (even without applying a
fuzz factor), and ignores sort order entirely.  I don't think this
approach makes sense.  It is also inconsistent with the general
pathification framework, where we rely on add_path() to retain the
best path or set of paths based on cost and other metrics, rather than
using such simple heuristics.

Another point I'd like to mention is that this patch removes the
UNIQUE_PATH_NOOP related code along the way, because I think it's dead
code.  If the RHS rel is provably unique, the semijoin should have
already been simplified to a plain inner join by analyzejoins.c.
However, I might be overlooking something, and I'd appreciate any
feedback or corrections.

Thanks
Richard