Thread

  1. Re: Use merge-based matching for MCVs in eqjoinsel

    Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> — 2025-09-16T15:52:47Z

    Hi hackers,
    
    On 10.09.2025 16:56, Ilia Evdokimov wrote:
    > Unfortunately, the JOB benchmark does not contain semi join nodes. 
    > However, TPC-DS does. I'll look for the queries with slowest planner 
    > times there and check them.
    >
    > I'll need some time to check both join and semi join cases with small 
    > and large default_statistics_target. I'll share the results later.
    
    JOIN
    ==============================
    
    I’ve benchmarked the new implementation of eqjoinsel() with different 
    values of default_statistics_target. On small targets (1, 5, 10, 25, 50, 
    75, 100) the results are all within statistical noise, and I did not 
    observe any regressions. In my view, it’s reasonable to keep the current 
    condition that the hash table is not used for default_statistics_target 
    = 1. Raising that threshold does not seem useful.
    
    Here are the results for JOB queries (where the effect of semi join is 
    not visible due to different data distributions):
    
    default_statistics_target | Planner Speedup (×) | Planner Before (ms) | 
    Planner After (ms)
    ------------------------------------------------------------------------------------------
    1                         | 1.00                | 1846.643            | 
    1847.409
    5                         | 1.00                | 1836.391            | 
    1828.318
    10                        | 0.95                | 1841.750            | 
    1929.722
    25                        | 0.99                | 1873.172            | 
    1890.741
    50                        | 0.98                | 1869.897            | 
    1898.470
    75                        | 1.02                | 1969.368            | 
    1929.521
    100                       | 0.97                | 1857.890            | 
    1921.207
    1000                      | 1.14                | 2279.700            | 
    1997.102
    2500                      | 1.78                | 4682.658            | 
    2636.202
    5000                      | 6.45                | 15943.696           | 
    2471.242
    7500                      | 12.45               | 34350.855           | 
    2758.565
    10000                     | 20.52               | 62519.342           | 
    3046.819
    
    SEMI JOIN
    ==============================
    
    Unfortunately, in TPC-DS it is not possible to clearly see improvements 
    for semi joins. To address this, I designed a synthetic example where 
    the data distribution forces the loop to run fully, without exiting 
    early, which makes the effect on semi joins more visible. In this setup, 
    I also ensured that the length of the MCV array is equal to the chosen 
    default_statistics_target.
    
    CREATE TABLE t1 AS
    SELECT CASE
              WHEN g <= 3000000 * 0.9 THEN (g % 10000) + 1
              ELSE (g % 1000000) + 10000
            END AS id
    FROM generate_series(1, 3000000) g;
    
    CREATE TABLE t2 AS
    SELECT CASE
              WHEN g <= 3000000 * 0.9 THEN (g % 10000) + 10001
              ELSE (g % 1000000) + 20000
            END AS id
    FROM generate_series(1, 3000000) g;
    
    ANALYZE t1, t2;
    
    The results of the query are:
    
    SELECT * FROM t1
    WHERE id IN (SELECT id FROM t2);
    
    default_statistics_target | Planner Speedup (×) | Planner Before (ms) | 
    Planner After (ms)
    ------------------------------------------------------------------------------------------
    1                         | 1.12                | 1.191               | 
    1.062
    5                         | 1.02                | 0.493               | 
    0.481
    10                        | 0.92                | 0.431               | 
    0.471
    25                        | 1.27                | 0.393               | 
    0.309
    50                        | 1.04                | 0.432               | 
    0.416
    75                        | 0.96                | 0.398               | 
    0.415
    100                       | 0.95                | 0.450               | 
    0.473
    1000                      | 9.42                | 6.742               | 
    0.716
    2500                      | 19.15               | 21.621              | 
    1.129
    5000                      | 46.74               | 85.667              | 
    1.833
    7500                      | 73.26               | 194.806             | 
    2.659
    10000                     | 107.95              | 349.981             | 
    3.242
    
    -- 
    Best regards,
    Ilia Evdokimov,
    Tantor Labs LLC,
    https://tantorlabs.com