Thread
-
Re: Use merge-based matching for MCVs in eqjoinsel
Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> — 2025-11-10T14:20:42Z
Thanks for the detailed feedback! On 04.11.2025 00:55, Tom Lane wrote: > Hmm. Those results sure look like there is a performance regression > up to at least 100 MCVs ... not a large one, but consistently a few > percent. That's a bit sad for a patch purporting to improve > performance. It looks to me like perhaps we should stick to the old > algorithm up to 100 or possibly even more MCVs. Certainly the > threshold needs to be higher than 1, as you have it now. I re-ran the benchmark on JOB with a threshold of 100.Here are the updated results: default_statistics_target | Planner Speedup (×) | Planner Before (ms) | Planner After (ms) -------------------------------------------------------------------------------- 1 | 1.00 | 2320.412 | 2318.377 5 | 0.99 | 2335.894 | 2360.890 10 | 1.00 | 2350.612 | 2347.154 25 | 1.01 | 2365.977 | 2342.312 50 | 0.99 | 2381.554 | 2405.262 75 | 1.00 | 2396.481 | 2399.828 100 | 1.00 | 2410.367 | 2412.456 1000 | 1.11 | 2850.853 | 2564.303 2500 | 2.04 | 5571.688 | 2731.545 5000 | 6.05 | 18850.084 | 3114.692 7500 | 11.96 | 39160.898 | 3273.688 10000 | 19.04 | 71334.113 | 3745.955 > I eyeballed the patch itself very briefly, and have a couple > quick comments: > > * Is hash_msv a typo for hash_mcv? If not, maybe spell out what > it's supposed to mean. Yes, that was a typo — fixed. > * The patch would be easier to read if it didn't reindent a couple > large chunks of existing code. Can we change the factorization > to avoid that? If not, I'd recommend submitting without that > reindentation, and reminding the committer to reindent at the last > moment. Fixed as well. I’ve removed all reindentation changes. I will keep that in mind for future submissions. > * The calculation loops in eqjoinsel_inner and eqjoinsel_semi > are not identical, which makes it look quite weird to be > writing just one function that conditionally replaces both. > I wonder if we should refactor to have just one copy (and > just eat the extra cycles of calculating matchprodfreq). Agreed. I’ve dropped the attempt to merge them into a single function. > > * In fact ... I wonder if we should try harder to not do essentially > identical calculations twice, remembering that eqjoinsel_semi is > always used alongside eqjoinsel_inner. (Of course, we could only do > that if clamped_nvalues2 is the same as sslot2->nvalues, but that's > frequently true.) I think the reason it's duplicative right now > is that we regarded this semijoin calculation as an experiment and > so didn't want to invest a lot of effort in it ... but this patch > is exactly a lot of effort, so maybe it's time to deal with that > unfinished business. > > regards, tom lane Good point. I addressed this in a separate patch: eqjoinsel_inner() now saves matchfreq1, matchfreq2, nmatches so that eqjoinsel_semi() can reuse them when (clamped_nvalues2 == sslot2->nvalues). If the MCV list on the RHS is clamped, we still recompute locally. If you have a cleaner idea for how to share these values between the two functions without passing them explicitly, I’d be happy to consider it. I’m attaching two patches: 1. v4-0001-Avoid-duplicate-MCV-matching-in-eqjoinsel_semi-an.patch - removes redundant MCV matching for semi/anti joins; 2. v4-0002-Optimize-MCV-matching-in-eqjoinsel_inner-and-eqjo.patch - adds hash-based MCV matching with a configurable threshold and includes fixes based on your comments. -- Best regards, Ilia Evdokimov, Tantor Labs LLC, https://tantorlabs.com/