Thread
-
Re: hashjoins vs. Bloom filters (yet again)
Andrew Dunstan <andrew@dunslane.net> — 2026-05-31T15:03:47Z
On 2026-05-30 Sa 2:14 PM, Tomas Vondra wrote: > > > Hi, Tomas > > This is terrific and very timely from my POV. > > I've been experimenting with a table AM (implemented as a > CustomScan scan provider), and bloom-filter pushdown from a hashjoin is one > of the bigger wins available to it: a fact-table scan joined to a filtered > dimension can use the filter to skip whole row groups and avoid > decompressing columns entirely, rather than just rejecting a tuple after > it's been produced. I'd hacked up a private version of this via a new > table-AM callback (the hashjoin walks the outer subtree, builds a filter > from the build side, and hands it to the AM's scan descriptor). Having now > read your PoC, I think your framework is the better foundation, and I'd > rather build on it than carry a parallel mechanism. But two things stand in > the way of a storage-level consumer using it, and I think both are > relatively > small. > > OK, good to hear. I was actually thinking about that use case too, i.e. > making it possible for the scan to do something smart with the filter > (like even pushing it even further down, to "storage"). Or maybe the > ForeignScan could push it to the remote side, so that it's actually > filtered there. > > I didn't mention that my message, and there are some difficulties: > > 1) We only build the hash (and bloom) with a delay, after the scan > already produces some tuples. That complicates the pushdown, whiich may > need to happen when starting the scan. Presumably, we'd need to allow > disabling this optimization, optionally. > > 2) We'd need some sort of "portable" Bloom filter, with serialization > and deserialization, etc. > > Both of these seem rather solvable. > >> 1) A CustomScan can't currently be a recipient. >> >> find_bloom_filter_recipient() only recognizes the stock scan tags, and the >> probe itself lives in ExecScanExtended(), which a CustomScan never calls >> (it dispatches to the provider's ExecCustomScan). The second part is >> actually a feature, not a bug: if a CustomScan provider does its own >> probing, it can choose the granularity -- per dictionary entry, per row >> group, or per row -- instead of being locked into the per-row, >> post-materialization probe that the stock nodes get. So all that's needed >> on your side is to let the planner attach a filter to a base-relation >> CustomScan; the provider takes care of consuming it. >> >> Concretely, that's adding T_CustomScan to the scan-leaf case in >> find_bloom_filter_recipient() (CustomScan embeds Scan first, so the >> scanrelid test is identical; non-leaf custom nodes have scanrelid == 0 and >> fall through to NULL), plus the matching fix_scan_bloom_filters() call in >> set_customscan_references(). The provider then calls ExecInitBloomFilters() >> in BeginCustomScan and ExecBloomFilters() (or a coarser-grained variant) >> inside its scan loop. Everything else -- producer registration, the >> es_bloom_producers lookup, the adaptive sampling, EXPLAIN -- is reused >> unchanged. >> > Yes, that should work and it's a mostly mechanical change. > > Maybe we'd want some sort of opt-in, so that the CustomScan can indicate > it can handle Bloom filters. Like, setting > CUSTOMPATH_SUPPORT_BLOOM_FILTERS to flags. > >> 2) The combined-hash filter can't be tested against a single column. >> >> You build one filter keyed on hash32() of all the join keys combined. For a >> single-key join that's ideal, and a column store can use it directly: hash >> each distinct dictionary value once per row group and skip groups whose >> values are all absent. For a multi-column join, though, the combined hash >> mixes the keys, so it can only ever be tested per-row (with all key columns >> in hand) -- it can't be checked against any one column's dictionary. The >> per-row probe is still useful, but the row-group/dictionary skipping, which >> is where most of the storage win comes from, isn't available. >> >> The obvious thought is to key a filter per column instead. But I don't >> think that should *replace* the combined filter, because per-column filters >> are strictly less selective on multi-column joins: they only test whether >> each column's value appears *somewhere* in the build side, not whether the >> combination does. With build pairs {(1,10),(2,20)}, an outer (1,20) passes >> both per-column filters even though it matches no build row, whereas the >> combined filter rejects it. So for the row-level probe -- and especially >> for plain heap -- the combined filter is the better one, and replacing it >> would be a regression. >> >> What I think would actually help is to let the framework *optionally* emit >> per-column filters in addition to the combined one, when a recipient >> signals it can use them. The combined filter stays the default and does the >> precise per-row rejection (unchanged for heap, and usable per-row by a >> column store too); the per-column filters are extra, built only on demand, >> and let a storage consumer cheaply eliminate whole row groups before the >> combined filter does the exact work. The cost is the build CPU and memory >> for the extra filters -- but only for consumers that ask, so your design is >> untouched when nobody does. For a single-key join the two filters >> coincide, so >> there'd be no reason to build both. >> > I think I speculated about this (having per-key filters) in some of the > comments in the patch, although the use case was different. I haven't > thought about TAM, but about different joins where the join keys come > from both sides. Consider a join like > > HJ > / \ > A HJ > / \ > B C > > where A-(BC) is on (A.x = B.x AND A.y = C.y), so the complete filter > can't be pushed to either side. But we could: > > (1) Push the filter on top of the BC join (which in this example is not > really a push-down). > > (2) Build filters on (x) and (y) separately, and push-down these. > > Or we could do both, really. > > I suppose a variation of (2) would work for your use case too, except > we'd push all three filters (x,y), (x) and (y) to the same scan. > > I guess this could also be opt-in, enabled by some CUSTOMPATH_ flag. > > The question is how efficient can the smaller filters be. The complete > filter can be very selective, while the per-key filters are terrible. > >> I'd be happy to work on patches for these. >> > Great. It's and interesting experiment / area to explore. Here are 3 patches (developed using Claude) that sit on top of your POC. Patch 1 enables the pushdown filters for custom scans. As you say it's fairly mechanical and is enabled by a CUSTOMPATH_SUPPORT_BLOOM_FILTERS path flag. Patch 2 provides for building per-key filters in addition to the multi-key filter if that flag is set. There may be other cases that would want it, but this would suit my immediate use case. Patch 3 provides for eager creation of the filter(s) in such cases, disabling the optimization you mentioned in point 1 above. > > FWIW I think the main difficulty for this PoC is going to be the > planning/costing stuff, and the impact on EXPLAIN. > > I haven't dealt with that or other issues you raise, but I think this is enough for me to begin testing. I have adapted my TAM to it and verified that it acts as expected. I will start doing some benchmarks. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com