Thread

  1. Re: Use merge-based matching for MCVs in eqjoinsel

    Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> — 2025-09-08T10:35:50Z

    On 08.09.2025 13:08, David Geier wrote:
    > Hi Ilia!
    >
    > On 05.09.2025 16:03, David Geier wrote:
    >>>> I propose an optimization: when the column datatype supports
    >>>> ordering(i.e., has < and >), we can sort both MCV lists and apply
    >>>> mege-style algorithm to detect matches. This reduces runtime from
    >>>> O(N^2) to O(NlogN), where N is the number of MCV entries. The patch
    >>>> also applies the same optimization to semi-join clauses, which show
    >>>> similar performance behavior.
    >> Why do you sort both lists and then merge instead of putting the smaller
    >> list into a hash map and then doing hash lookups (if the type is hashable)?
    > I've tested your and my code with the following script:
    >
    > CREATE TABLE foo(col TEXT);
    > CREATE TABLE bar(col TEXT);
    > SET default_statistics_target = 10000;
    >
    > -- Generate MCV values. PostgreSQL doesn't store MCVs if the table has
    > -- only a single value or every value has exactly the same cardinality.
    > DO $$
    > BEGIN
    >    FOR i IN 1..10000 LOOP
    >      FOR j IN 1..least(i, 50) LOOP
    >        INSERT INTO foo VALUES ('aaaaaaaaaaaaaaaaaaaa' || i::TEXT);
    >        INSERT INTO bar VALUES ('aaaaaaaaaaaaaaaaaaaa' || (i + 100000)::TEXT);
    >      END LOOP;
    >    END LOOP;
    > END;
    > $$;
    >
    > ANALYZE foo, bar;
    > \timing on
    > EXPLAIN SELECT * FROM foo f, bar b WHERE f.col = b.col;
    >
    > Results are:
    >
    > - master: 433 ms
    > - Order+Merge: 11 ms
    > - Hash map: 4 ms
    >
    > I've attached my draft patch.
    >
    > --
    > David Geier
    
    
    Hi David,
    
    Thank you for reviewing.
    
    I have read all the previous messages - and yes, you are right. I don’t 
    know why I didn’t consider using a hash table approach initially. Your 
    idea makes a lot of sense.
    
    To evaluate it, I ran benchmarks on JOB with three variants:
    
    $ ./benchmark.sh master
    $ ./benchmark.sh merge
    $ ./benchmark.sh hash
    
    I compared total planning time across all 113 queries.
    
    $ python3 planning_time.py master hash
    default_statistics_target | Planner Speedup (×) | Planner Before (ms) | 
    Planner After (ms)
    --------------------------------------------------------------------------------
    100                       | 1.00                | 1892.627            | 
    1886.969
    1000                      | 1.09                | 2286.922            | 
    2100.099
    2500                      | 1.94                | 4647.167            | 
    2400.711
    5000                      | 6.15                | 17964.779           | 
    2919.914
    7500                      | 10.58               | 38622.443           | 
    3650.375
    10000                     | 16.33               | 69538.085           | 
    4257.864
    
    $ python3 planning_time.py master merge
    default_statistics_target | Planner Speedup (×) | Planner Before (ms) | 
    Planner After (ms)
    --------------------------------------------------------------------------------
    100                       | 1.00                | 1892.627            | 
    1898.622
    1000                      | 1.12                | 2286.922            | 
    2033.553
    2500                      | 1.92                | 4647.167            | 
    2423.552
    5000                      | 5.94                | 17964.779           | 
    3025.739
    7500                      | 10.48               | 38622.443           | 
    3684.262
    10000                     | 16.72               | 69538.085           | 
    4159.418
    
    Based on these results, I’d prefer the hash lookup implementation, so I 
    think it makes sense to improve your patch further and bring it into 
    good shape. Shall I take care of that, or would you prefer to do it 
    yourself?
    
    --
    Best regards,
    Ilia Evdokimov,
    Tantor Labs LLC,
    https://tantorlabs.com