Re: Use merge-based matching for MCVs in eqjoinsel

Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>

From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: David Geier <geidav.pg@gmail.com>, pgsql-hackers@lists.postgresql.org
Date: 2025-11-17T15:17:25Z
Lists: pgsql-hackers
On 14.11.2025 01:21, Tom Lane wrote:
> This didn't look very much like the refactorization that I had in
> mind: I thought we should have one copy of the matching code, not two.
> Also, after looking closer at your patch I realized you were just
> punting for cross-type comparison operators, which I felt was kind
> of sad.  It's a little bit tricky to get simplehash.h to go along
> with cross-type hashing, because it wants to use just one hash and
> one equality function.  But since those are interface routines we
> are going to supply anyway, we can make them deal with the insert
> and lookup cases differently.


I had considered the cross-type comparison operators and I didn’t see a 
clean way to support them, so I intentionally excluded cross-type cases 
from hash probing. Your suggestion to switch the hash function in probe 
mode is clearly a more correct approach than simply rejecting those 
cases. Thanks for the explanation.


>
> So after a bit of hacking I ended up with the attached.  I split up
> the refactorization into several steps to make it easier to review.
> (But I'd anticipate squashing these into one commit in the end,
> so I didn't spend a lot of time on the commit messages.)


I reviewed patches 0002-0004 with the refactoring, and I think the 
overall approach is excellent. However, I noticed one issue: in 
eqjoinsel_semi() the variable 'nmatches' is not initialized and can lead 
to undefined behavior when clamped_nvalues2 == sslot2->nvalues. Before 
the refactoring it was initialized by zero.


> Also, 0001 in this series is not meant to be committed; what it
> does is to add some debug logging to ease comparing runtimes of
> different versions of eqjoinsel.  I was able to use that to
> convince myself that the refactoring steps didn't cost anything
> meaningful in performance.  Perhaps we could use it to investigate
> the right hashing threshold more carefully, too.


With 0001 patch I tested the selectivity calculation time for SEMI JOIN 
after applying patches 0002-0004, and the time was cut in half. Thank 
you for the work on that.


>
> There are still a couple of XXX comments in the attached, denoting
> loose ends to look at.  In particular, I wondered whether the
> hash threshold check
>
>          if (Min(sslot1.nvalues, sslot2.nvalues) >= EQJOINSEL_MCV_HASH_THRESHOLD)
>
> should use Max() instead --- that is, it might be safer to hash
> if either MCV list is long.  Or, holding one's head at a different
> angle, perhaps the sum of the list lengths should be what's checked?
>
> 			regards, tom lane
>

Hmm… using the sum actually seems like a good idea for me. It may 
provide a smoother switch-over point between the two MCV-scanning 
algorithms when both list lengths are below the threshold. But this 
definitely needs to be validated by measuring different MCV lengths 
below the threshold using the 0001 patch.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/