Thread
-
Add a greedy join search algorithm to handle large join problems
Chengpeng Yan <chengpeng_yan@outlook.com> — 2025-12-02T03:48:26Z
Hi hackers, This patch implements GOO (Greedy Operator Ordering), a greedy join-order search method for large join problems, based on Fegaras (DEXA ’98) [1]. The algorithm repeatedly selects, among all legal joins, the join pair with the lowest estimated total cost, merges them, and continues until a single join remains. Patch attached. To get an initial sense of performance, I reused the star join / snowflake examples and the testing script from the thread in [2]. The star-join GUC in that SQL workload was replaced with `enable_goo_join_search`, so the same tests can run under DP (standard dynamic programming) / GEQO(Genetic Query Optimizer) / GOO. For these tests, geqo_threshold was set to 15 for DP, and to 5 for both GEQO and GOO. Other planner settings, including join_collapse_limit, remained at their defaults. On my local machine, a single-client pgbench run produces the following throughput (tps): | DP | GEQO | GOO --------------------+----------+----------+----------- starjoin (inner) | 1762.52 | 192.13 | 6168.89 starjoin (outer) | 1683.92 | 173.90 | 5626.56 snowflake (inner) | 1829.04 | 133.40 | 3929.57 snowflake (outer) | 1397.93 | 99.65 | 3040.52 Open questions: * The paper ranks joins by estimated result size, while this prototype reuses the existing planner total_cost. This makes the implementation straightforward, but it may be worth exploring row-based rule. * The prototype allocates through multiple memory contexts, aiming to reduce memory usage during candidate evaluation. Suggestions on simplification are welcome. * Planning performance vs GEQO: On the star/snowflake benchmarks above, GOO reduces planning time significantly compared to GEQO. Broader evaluation on more representative workloads (e.g., TPC-H / TPC-DS / JOB) is TODO, covering both planning speed and plan quality. * There is no tuning knob like `geqo_seed` for GEQO if GOO produces a poor ordering. * The long-term integration is open: * fully replace GEQO, * keep GEQO available and optionally try both, * or use GOO as a starting point for GEQO. Comments and review would be appreciated. References: [1] Leonidas Fegaras, “A New Heuristic for Optimizing Large Queries”, DEXA ’98. ACM Digital Library: https://dl.acm.org/doi/10.5555/648311.754892 A publicly accessible PostScript version is available from the author's page: https://lambda.uta.edu/order.ps [2] Star/snowflake join thread and benchmarks: https://www.postgresql.org/message-id/a22ec6e0-92ae-43e7-85c1-587df2a65f51%40vondra.me -- Best regards, Chengpeng Yan