Re: Use merge-based matching for MCVs in eqjoinsel
Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: David Geier <geidav.pg@gmail.com>, pgsql-hackers@lists.postgresql.org
Date: 2025-11-17T15:17:25Z
Lists: pgsql-hackers
On 14.11.2025 01:21, Tom Lane wrote: > This didn't look very much like the refactorization that I had in > mind: I thought we should have one copy of the matching code, not two. > Also, after looking closer at your patch I realized you were just > punting for cross-type comparison operators, which I felt was kind > of sad. It's a little bit tricky to get simplehash.h to go along > with cross-type hashing, because it wants to use just one hash and > one equality function. But since those are interface routines we > are going to supply anyway, we can make them deal with the insert > and lookup cases differently. I had considered the cross-type comparison operators and I didn’t see a clean way to support them, so I intentionally excluded cross-type cases from hash probing. Your suggestion to switch the hash function in probe mode is clearly a more correct approach than simply rejecting those cases. Thanks for the explanation. > > So after a bit of hacking I ended up with the attached. I split up > the refactorization into several steps to make it easier to review. > (But I'd anticipate squashing these into one commit in the end, > so I didn't spend a lot of time on the commit messages.) I reviewed patches 0002-0004 with the refactoring, and I think the overall approach is excellent. However, I noticed one issue: in eqjoinsel_semi() the variable 'nmatches' is not initialized and can lead to undefined behavior when clamped_nvalues2 == sslot2->nvalues. Before the refactoring it was initialized by zero. > Also, 0001 in this series is not meant to be committed; what it > does is to add some debug logging to ease comparing runtimes of > different versions of eqjoinsel. I was able to use that to > convince myself that the refactoring steps didn't cost anything > meaningful in performance. Perhaps we could use it to investigate > the right hashing threshold more carefully, too. With 0001 patch I tested the selectivity calculation time for SEMI JOIN after applying patches 0002-0004, and the time was cut in half. Thank you for the work on that. > > There are still a couple of XXX comments in the attached, denoting > loose ends to look at. In particular, I wondered whether the > hash threshold check > > if (Min(sslot1.nvalues, sslot2.nvalues) >= EQJOINSEL_MCV_HASH_THRESHOLD) > > should use Max() instead --- that is, it might be safer to hash > if either MCV list is long. Or, holding one's head at a different > angle, perhaps the sum of the list lengths should be what's checked? > > regards, tom lane > Hmm… using the sum actually seems like a good idea for me. It may provide a smoother switch-over point between the two MCV-scanning algorithms when both list lengths are below the threshold. But this definitely needs to be validated by measuring different MCV lengths below the threshold using the 0001 patch. -- Best regards, Ilia Evdokimov, Tantor Labs LLC, https://tantorlabs.com/