Thread

  1. Re: Use merge-based matching for MCVs in eqjoinsel

    Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> — 2025-11-10T14:20:42Z

    Thanks for the detailed feedback!
    
    
    On 04.11.2025 00:55, Tom Lane wrote:
    
    > Hmm.  Those results sure look like there is a performance regression
    > up to at least 100 MCVs ... not a large one, but consistently a few
    > percent.  That's a bit sad for a patch purporting to improve
    > performance.  It looks to me like perhaps we should stick to the old
    > algorithm up to 100 or possibly even more MCVs.  Certainly the
    > threshold needs to be higher than 1, as you have it now.
    
    
    I re-ran the benchmark on JOB with a threshold of 100.Here are the 
    updated results:
    
    default_statistics_target | Planner Speedup (×) | Planner Before (ms) | 
    Planner After (ms)
    --------------------------------------------------------------------------------
    1                         | 1.00                | 2320.412            | 
    2318.377
    5                         | 0.99                | 2335.894            | 
    2360.890
    10                        | 1.00                | 2350.612            | 
    2347.154
    25                        | 1.01                | 2365.977            | 
    2342.312
    50                        | 0.99                | 2381.554            | 
    2405.262
    75                        | 1.00                | 2396.481            | 
    2399.828
    100                       | 1.00                | 2410.367            | 
    2412.456
    1000                      | 1.11                | 2850.853            | 
    2564.303
    2500                      | 2.04                | 5571.688            | 
    2731.545
    5000                      | 6.05                | 18850.084           | 
    3114.692
    7500                      | 11.96               | 39160.898           | 
    3273.688
    10000                     | 19.04               | 71334.113           | 
    3745.955
    
    > I eyeballed the patch itself very briefly, and have a couple
    > quick comments:
    >
    > * Is hash_msv a typo for hash_mcv?  If not, maybe spell out what
    > it's supposed to mean.
    
    
    Yes, that was a typo — fixed.
    
    > * The patch would be easier to read if it didn't reindent a couple
    > large chunks of existing code.  Can we change the factorization
    > to avoid that?  If not, I'd recommend submitting without that
    > reindentation, and reminding the committer to reindent at the last
    > moment.
    
    
    Fixed as well. I’ve removed all reindentation changes. I will keep that 
    in mind for future submissions.
    
    
    > * The calculation loops in eqjoinsel_inner and eqjoinsel_semi
    > are not identical, which makes it look quite weird to be
    > writing just one function that conditionally replaces both.
    > I wonder if we should refactor to have just one copy (and
    > just eat the extra cycles of calculating matchprodfreq).
    
    
    Agreed. I’ve dropped the attempt to merge them into a single function.
    
    
    >
    > * In fact ... I wonder if we should try harder to not do essentially
    > identical calculations twice, remembering that eqjoinsel_semi is
    > always used alongside eqjoinsel_inner.  (Of course, we could only do
    > that if clamped_nvalues2 is the same as sslot2->nvalues, but that's
    > frequently true.)  I think the reason it's duplicative right now
    > is that we regarded this semijoin calculation as an experiment and
    > so didn't want to invest a lot of effort in it ... but this patch
    > is exactly a lot of effort, so maybe it's time to deal with that
    > unfinished business.
    >
    > 			regards, tom lane
    
    
    Good point. I addressed this in a separate patch: eqjoinsel_inner() now 
    saves matchfreq1, matchfreq2, nmatches so that eqjoinsel_semi() can 
    reuse them when (clamped_nvalues2 == sslot2->nvalues). If the MCV list 
    on the RHS is clamped, we still recompute locally. If you have a cleaner 
    idea for how to share these values between the two functions without 
    passing them explicitly, I’d be happy to consider it.
    
    I’m attaching two patches:
    1. v4-0001-Avoid-duplicate-MCV-matching-in-eqjoinsel_semi-an.patch - 
    removes redundant MCV matching for semi/anti joins;
    2. v4-0002-Optimize-MCV-matching-in-eqjoinsel_inner-and-eqjo.patch - 
    adds hash-based MCV matching with a configurable threshold and includes 
    fixes based on your comments.
    
    --
    Best regards,
    Ilia Evdokimov,
    Tantor Labs LLC,
    https://tantorlabs.com/