Thread

  1. Re: hashjoins vs. Bloom filters (yet again)

    Andrew Dunstan <andrew@dunslane.net> — 2026-05-31T15:03:47Z

    On 2026-05-30 Sa 2:14 PM, Tomas Vondra wrote:
    >
    >
    > Hi, Tomas
    >
    > This is terrific and very timely from my POV.
    >
    > I've been experimenting with a table AM (implemented as a
    > CustomScan scan provider), and bloom-filter pushdown from a hashjoin is one
    > of the bigger wins available to it: a fact-table scan joined to a filtered
    > dimension can use the filter to skip whole row groups and avoid
    > decompressing columns entirely, rather than just rejecting a tuple after
    > it's been produced. I'd hacked up a private version of this via a new
    > table-AM callback (the hashjoin walks the outer subtree, builds a filter
    > from the build side, and hands it to the AM's scan descriptor). Having now
    > read your PoC, I think your framework is the better foundation, and I'd
    > rather build on it than carry a parallel mechanism. But two things stand in
    > the way of a storage-level consumer using it, and I think both are
    > relatively
    > small.
    >
    > OK, good to hear. I was actually thinking about that use case too, i.e.
    > making it possible for the scan to do something smart with the filter
    > (like even pushing it even further down, to "storage"). Or maybe the
    > ForeignScan could push it to the remote side, so that it's actually
    > filtered there.
    >
    > I didn't mention that my message, and there are some difficulties:
    >
    > 1) We only build the hash (and bloom) with a delay, after the scan
    > already produces some tuples. That complicates the pushdown, whiich may
    > need to happen when starting the scan. Presumably, we'd need to allow
    > disabling this optimization, optionally.
    >
    > 2) We'd need some sort of "portable" Bloom filter, with serialization
    > and deserialization, etc.
    >
    > Both of these seem rather solvable.
    >
    >> 1) A CustomScan can't currently be a recipient.
    >>
    >> find_bloom_filter_recipient() only recognizes the stock scan tags, and the
    >> probe itself lives in ExecScanExtended(), which a CustomScan never calls
    >> (it dispatches to the provider's ExecCustomScan). The second part is
    >> actually a feature, not a bug: if a CustomScan provider does its own
    >> probing, it can choose the granularity -- per dictionary entry, per row
    >> group, or per row -- instead of being locked into the per-row,
    >> post-materialization probe that the stock nodes get. So all that's needed
    >> on your side is to let the planner attach a filter to a base-relation
    >> CustomScan; the provider takes care of consuming it.
    >>
    >> Concretely, that's adding T_CustomScan to the scan-leaf case in
    >> find_bloom_filter_recipient() (CustomScan embeds Scan first, so the
    >> scanrelid test is identical; non-leaf custom nodes have scanrelid == 0 and
    >> fall through to NULL), plus the matching fix_scan_bloom_filters() call in
    >> set_customscan_references(). The provider then calls ExecInitBloomFilters()
    >> in BeginCustomScan and ExecBloomFilters() (or a coarser-grained variant)
    >> inside its scan loop. Everything else -- producer registration, the
    >> es_bloom_producers lookup, the adaptive sampling, EXPLAIN -- is reused
    >> unchanged.
    >>
    > Yes, that should work and it's a mostly mechanical change.
    >
    > Maybe we'd want some sort of opt-in, so that the CustomScan can indicate
    > it can handle Bloom filters. Like, setting
    > CUSTOMPATH_SUPPORT_BLOOM_FILTERS to flags.
    >
    >> 2) The combined-hash filter can't be tested against a single column.
    >>
    >> You build one filter keyed on hash32() of all the join keys combined. For a
    >> single-key join that's ideal, and a column store can use it directly: hash
    >> each distinct dictionary value once per row group and skip groups whose
    >> values are all absent. For a multi-column join, though, the combined hash
    >> mixes the keys, so it can only ever be tested per-row (with all key columns
    >> in hand) -- it can't be checked against any one column's dictionary. The
    >> per-row probe is still useful, but the row-group/dictionary skipping, which
    >> is where most of the storage win comes from, isn't available.
    >>
    >> The obvious thought is to key a filter per column instead. But I don't
    >> think that should *replace* the combined filter, because per-column filters
    >> are strictly less selective on multi-column joins: they only test whether
    >> each column's value appears *somewhere* in the build side, not whether the
    >> combination does. With build pairs {(1,10),(2,20)}, an outer (1,20) passes
    >> both per-column filters even though it matches no build row, whereas the
    >> combined filter rejects it. So for the row-level probe -- and especially
    >> for plain heap -- the combined filter is the better one, and replacing it
    >> would be a regression.
    >>
    >> What I think would actually help is to let the framework *optionally* emit
    >> per-column filters in addition to the combined one, when a recipient
    >> signals it can use them. The combined filter stays the default and does the
    >> precise per-row rejection (unchanged for heap, and usable per-row by a
    >> column store too); the per-column filters are extra, built only on demand,
    >> and let a storage consumer cheaply eliminate whole row groups before the
    >> combined filter does the exact work. The cost is the build CPU and memory
    >> for the extra filters -- but only for consumers that ask, so your design is
    >> untouched when nobody does. For a single-key join the two filters
    >> coincide, so
    >> there'd be no reason to build both.
    >>
    > I think I speculated about this (having per-key filters) in some of the
    > comments in the patch, although the use case was different. I haven't
    > thought about TAM, but about different joins where the join keys come
    > from both sides. Consider a join like
    >
    >          HJ
    >        /    \
    >       A     HJ
    >           /    \
    >          B      C
    >
    > where A-(BC) is on (A.x = B.x AND A.y = C.y), so the complete filter
    > can't be pushed to either side. But we could:
    >
    > (1) Push the filter on top of the BC join (which in this example is not
    > really a push-down).
    >
    > (2) Build filters on (x) and (y) separately, and push-down these.
    >
    > Or we could do both, really.
    >
    > I suppose a variation of (2) would work for your use case too, except
    > we'd push all three filters (x,y), (x) and (y) to the same scan.
    >
    > I guess this could also be opt-in, enabled by some CUSTOMPATH_ flag.
    >
    > The question is how efficient can the smaller filters be. The complete
    > filter can be very selective, while the per-key filters are terrible.
    >
    >> I'd be happy to work on patches for these.
    >>
    > Great. It's and interesting experiment / area to explore.
    
    
    Here are 3 patches (developed using Claude) that sit on top of your POC.
    
    Patch 1 enables the pushdown filters for custom scans. As you say it's 
    fairly mechanical and is enabled by a CUSTOMPATH_SUPPORT_BLOOM_FILTERS 
    path flag.
    
    Patch 2 provides for building per-key filters in addition to the 
    multi-key filter if that flag is set. There may be other cases that 
    would want it, but this would suit my immediate use case.
    
    Patch 3 provides for eager creation of the filter(s) in such cases, 
    disabling the optimization you mentioned in point 1 above.
    
    
    >
    > FWIW I think the main difficulty for this PoC is going to be the
    > planning/costing stuff, and the impact on EXPLAIN.
    >
    >
    
    
    I haven't dealt with that or other issues you raise, but I think this is 
    enough for me to begin testing. I have adapted my TAM to it and verified 
    that it acts as expected. I will start doing some benchmarks.
    
    
    cheers
    
    
    andrew
    
    
    --
    Andrew Dunstan
    EDB: https://www.enterprisedb.com