Re: Use merge-based matching for MCVs in eqjoinsel

Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>

From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: David Geier <geidav.pg@gmail.com>, pgsql-hackers@lists.postgresql.org
Date: 2025-11-10T14:20:42Z
Lists: pgsql-hackers

Attachments

Thanks for the detailed feedback!


On 04.11.2025 00:55, Tom Lane wrote:

> Hmm.  Those results sure look like there is a performance regression
> up to at least 100 MCVs ... not a large one, but consistently a few
> percent.  That's a bit sad for a patch purporting to improve
> performance.  It looks to me like perhaps we should stick to the old
> algorithm up to 100 or possibly even more MCVs.  Certainly the
> threshold needs to be higher than 1, as you have it now.


I re-ran the benchmark on JOB with a threshold of 100.Here are the 
updated results:

default_statistics_target | Planner Speedup (×) | Planner Before (ms) | 
Planner After (ms)
--------------------------------------------------------------------------------
1                         | 1.00                | 2320.412            | 
2318.377
5                         | 0.99                | 2335.894            | 
2360.890
10                        | 1.00                | 2350.612            | 
2347.154
25                        | 1.01                | 2365.977            | 
2342.312
50                        | 0.99                | 2381.554            | 
2405.262
75                        | 1.00                | 2396.481            | 
2399.828
100                       | 1.00                | 2410.367            | 
2412.456
1000                      | 1.11                | 2850.853            | 
2564.303
2500                      | 2.04                | 5571.688            | 
2731.545
5000                      | 6.05                | 18850.084           | 
3114.692
7500                      | 11.96               | 39160.898           | 
3273.688
10000                     | 19.04               | 71334.113           | 
3745.955

> I eyeballed the patch itself very briefly, and have a couple
> quick comments:
>
> * Is hash_msv a typo for hash_mcv?  If not, maybe spell out what
> it's supposed to mean.


Yes, that was a typo — fixed.

> * The patch would be easier to read if it didn't reindent a couple
> large chunks of existing code.  Can we change the factorization
> to avoid that?  If not, I'd recommend submitting without that
> reindentation, and reminding the committer to reindent at the last
> moment.


Fixed as well. I’ve removed all reindentation changes. I will keep that 
in mind for future submissions.


> * The calculation loops in eqjoinsel_inner and eqjoinsel_semi
> are not identical, which makes it look quite weird to be
> writing just one function that conditionally replaces both.
> I wonder if we should refactor to have just one copy (and
> just eat the extra cycles of calculating matchprodfreq).


Agreed. I’ve dropped the attempt to merge them into a single function.


>
> * In fact ... I wonder if we should try harder to not do essentially
> identical calculations twice, remembering that eqjoinsel_semi is
> always used alongside eqjoinsel_inner.  (Of course, we could only do
> that if clamped_nvalues2 is the same as sslot2->nvalues, but that's
> frequently true.)  I think the reason it's duplicative right now
> is that we regarded this semijoin calculation as an experiment and
> so didn't want to invest a lot of effort in it ... but this patch
> is exactly a lot of effort, so maybe it's time to deal with that
> unfinished business.
>
> 			regards, tom lane


Good point. I addressed this in a separate patch: eqjoinsel_inner() now 
saves matchfreq1, matchfreq2, nmatches so that eqjoinsel_semi() can 
reuse them when (clamped_nvalues2 == sslot2->nvalues). If the MCV list 
on the RHS is clamped, we still recompute locally. If you have a cleaner 
idea for how to share these values between the two functions without 
passing them explicitly, I’d be happy to consider it.

I’m attaching two patches:
1. v4-0001-Avoid-duplicate-MCV-matching-in-eqjoinsel_semi-an.patch - 
removes redundant MCV matching for semi/anti joins;
2. v4-0002-Optimize-MCV-matching-in-eqjoinsel_inner-and-eqjo.patch - 
adds hash-based MCV matching with a configurable threshold and includes 
fixes based on your comments.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/