Thread

  1. Use merge-based matching for MCVs in eqjoinsel

    Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> — 2025-07-21T13:55:56Z

    Hi hackers,
    
    While analyzing planner performance on JOB with 
    default_statistics_target = 1000, I noticed that a significant portion 
    of planning time is spent inside the eqjoinsel() function. According to 
    perf, in most JOB queries at default_statistics_target = 1000, 
    eqjoinsel() is the most expensive function during planning, accounting 
    for approximately 8% of total CPU time. At default_statistics_target = 
    10000, the planner spend up to 75% of its time inside eqjoinsel(), 
    making it one of the primary bottlenecks.
    
    Еhis overhead is caused by the O(N^2) nested-loop comparison of MCVs in 
    var1 = var2 clauses.
    
    I propose an optimization: when the column datatype supports 
    ordering(i.e., has < and >), we can sort both MCV lists and apply 
    mege-style algorithm to detect matches. This reduces runtime from O(N^2) 
    to O(NlogN), where N is the number of MCV entries. The patch also 
    applies the same optimization to semi-join clauses, which show similar 
    performance behavior.
    
    On JOB, this changes reduce planner time in most queries with complex 
    joins and large MCVs with no observable effect on plan quality. I’ve 
    also attached bar charts showing per-query planner time before and after 
    the patch for default_statistics_target = 100, 1000, 10000 along with 
    query numbers for reference.
    
    Any feedback or suggestions are welcome!
    
    --
    Best regards,
    Ilia Evdokimov,
    Tantor Labs LLC.