Re: Pathify RHS unique-ification for semijoin planning
Richard Guo <guofenglinux@gmail.com>
From: Richard Guo <guofenglinux@gmail.com>
To: PostgreSQL-development <pgsql-hackers@postgresql.org>
Date: 2025-07-01T02:57:44Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Simplify relation_has_unique_index_for()
- bf9ee294e567 19 (unreleased) landed
-
Pathify RHS unique-ification for semijoin planning
- 24225ad9aafc 19 (unreleased) landed
-
Convert varatt.h access macros to static inline functions.
- e035863c9a04 19 (unreleased) cited
-
Re-export a few of createplan.c's make_xxx() functions.
- 570be1f73f38 9.6.0 cited
Attachments
- v3-0001-Pathify-RHS-unique-ification-for-semijoin-plannin.patch (application/octet-stream) patch v3-0001
On Tue, Jun 3, 2025 at 4:52 PM Richard Guo <guofenglinux@gmail.com> wrote: > Here is an updated version of the patch, which is ready for review. > I've fixed a cost estimation issue, improved some comments, and added > a commit message. Nothing essential has changed. This patch does not apply anymore, and here is a new rebase. It may be argued that this patch introduces additional planning overhead by considering multiple unique-ification paths for the RHS. While that is true to some extent, I don't think this is a problem. Please bear with me a moment. * The additional path generation only occurs in specific semijoin cases where one input rel is exactly the RHS. Queries without such semijoins are not affected. * This patch only considers the cheapest total path and presorted paths from the original RHS. These are typically few in number, and each has a high likelihood of contributing to a lower overall cost for the final plan. I think the cost-benefit trade-off is worthwhile. * This patch follows the convention in joinpath.c of exploring alternative join input paths, rather than introducing novel overhead. For example, when planning (A SEMIJOIN B), the planner considers multiple paths from B, including its cheapest total path and any paths with useful sort orders. There is no clear reason why, in the analogous case of (A INNERJOIN unique-ified(B)), we should restrict ourselves to only one path from the unique-ified RHS. * On the other hand, if we insist on considering only a single path from the unique-ified RHS, we face a dilemma when the hash-based implementation has a cheaper total cost, but the sort-based implementation has a better sort order. In such cases, what should our selection criteria be? Currently, create_unique_path() simply compares total_cost to choose the cheaper one (even without applying a fuzz factor), and ignores sort order entirely. I don't think this approach makes sense. It is also inconsistent with the general pathification framework, where we rely on add_path() to retain the best path or set of paths based on cost and other metrics, rather than using such simple heuristics. Another point I'd like to mention is that this patch removes the UNIQUE_PATH_NOOP related code along the way, because I think it's dead code. If the RHS rel is provably unique, the semijoin should have already been simplified to a plain inner join by analyzejoins.c. However, I might be overlooking something, and I'd appreciate any feedback or corrections. Thanks Richard