Thread

  1. Re: Use merge-based matching for MCVs in eqjoinsel

    Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> — 2025-11-17T15:17:25Z

    On 14.11.2025 01:21, Tom Lane wrote:
    > This didn't look very much like the refactorization that I had in
    > mind: I thought we should have one copy of the matching code, not two.
    > Also, after looking closer at your patch I realized you were just
    > punting for cross-type comparison operators, which I felt was kind
    > of sad.  It's a little bit tricky to get simplehash.h to go along
    > with cross-type hashing, because it wants to use just one hash and
    > one equality function.  But since those are interface routines we
    > are going to supply anyway, we can make them deal with the insert
    > and lookup cases differently.
    
    
    I had considered the cross-type comparison operators and I didn’t see a 
    clean way to support them, so I intentionally excluded cross-type cases 
    from hash probing. Your suggestion to switch the hash function in probe 
    mode is clearly a more correct approach than simply rejecting those 
    cases. Thanks for the explanation.
    
    
    >
    > So after a bit of hacking I ended up with the attached.  I split up
    > the refactorization into several steps to make it easier to review.
    > (But I'd anticipate squashing these into one commit in the end,
    > so I didn't spend a lot of time on the commit messages.)
    
    
    I reviewed patches 0002-0004 with the refactoring, and I think the 
    overall approach is excellent. However, I noticed one issue: in 
    eqjoinsel_semi() the variable 'nmatches' is not initialized and can lead 
    to undefined behavior when clamped_nvalues2 == sslot2->nvalues. Before 
    the refactoring it was initialized by zero.
    
    
    > Also, 0001 in this series is not meant to be committed; what it
    > does is to add some debug logging to ease comparing runtimes of
    > different versions of eqjoinsel.  I was able to use that to
    > convince myself that the refactoring steps didn't cost anything
    > meaningful in performance.  Perhaps we could use it to investigate
    > the right hashing threshold more carefully, too.
    
    
    With 0001 patch I tested the selectivity calculation time for SEMI JOIN 
    after applying patches 0002-0004, and the time was cut in half. Thank 
    you for the work on that.
    
    
    >
    > There are still a couple of XXX comments in the attached, denoting
    > loose ends to look at.  In particular, I wondered whether the
    > hash threshold check
    >
    >          if (Min(sslot1.nvalues, sslot2.nvalues) >= EQJOINSEL_MCV_HASH_THRESHOLD)
    >
    > should use Max() instead --- that is, it might be safer to hash
    > if either MCV list is long.  Or, holding one's head at a different
    > angle, perhaps the sum of the list lengths should be what's checked?
    >
    > 			regards, tom lane
    >
    
    Hmm… using the sum actually seems like a good idea for me. It may 
    provide a smoother switch-over point between the two MCV-scanning 
    algorithms when both list lengths are below the threshold. But this 
    definitely needs to be validated by measuring different MCV lengths 
    below the threshold using the 0001 patch.
    
    --
    Best regards,
    Ilia Evdokimov,
    Tantor Labs LLC,
    https://tantorlabs.com/