Thread
-
[RFC] Secondary hash to split stuck hash-join batches
Ben Mejia <benjamin.arthur.mejia@gmail.com> — 2026-05-31T19:41:13Z
Hi, Tomas clearly describes the underlying problem in [1]: stuck batches cannot be split by batch doubling. This occurs when the tuples' hash values all map to the same batch, similar to all items ending up in a single hash bucket. The HashJoin executor currently faces two unappealing fallback options: a) Continually doubling nBatch in a futile attempt to split the unsplittable batch. b) Inflating spaceAllowed, which silently exceeds work_mem. While PG18 caps the first fallback to prevent runaway batch explosions, it attempts to manage the situation by incrementally increasing memory usage. Consequently, work_mem is routinely violated by HashJoin, particularly when encountering these unsplittable batches. I have been experimenting with a rehash-based approach to reduce the memory used by hash joins and would appreciate feedback on the direction before polishing it for submission. Sketch of Rehash Approach - When ExecHashIncreaseNumBatches detects a stuck batch, call the new function, ExecHashRehashBatch to attempt a rehash. - Rehash attempts to reassign each tuple in the stuck batch to a sub-batch using a secondary hash function. - Sub-batches are handled in the same way as normal batches, and by the same code. - Sub-batches are split, like normal batches until the sub-batch fits into spaceAllowed - If the rehash cannot split the batch, it fails and the current inflate-and-gut-it-out path is taken. In my prototype, I use a GUC, enable_hashjoin_rehash to turn on the rehash logic; otherwise the code works the same as before. Additional Heuristics Having the rehash available as a tool to control memory usage, I also explored further modifications: - Using planner estimates to detect skew vs. large tuples: This can avoid doing a batch split and immediately attempt a rehash. - Avoiding PG18's memory inflation: Based on skew detection above, avoids inflating spaceAllowed and attempts a rehash. - Diminishing returns detection: If batch splitting does less than a split-threshold attempt a rehash. Some Preliminary Results For the table: CREATE TABLE t_stuck_4m (id INT, pad TEXT); INSERT INTO t_stuck_4m SELECT a, repeat(md5(a::text), 5) FROM generate_series(1, 1000000000) s(a) WHERE hashint4(a)::bit(32) & x'FFFFE000'::bit(32) = 0::bit(32); INSERT INTO t_stuck_4m SELECT id, pad FROM t_stuck_4m; -- 2x INSERT INTO t_stuck_4m SELECT id, pad FROM t_stuck_4m; -- 4x INSERT INTO t_stuck_4m SELECT id, pad FROM t_stuck_4m; -- 8x INSERT INTO t_stuck_4m SELECT id, pad FROM t_stuck_4m; -- 16x SET work_mem = '4MB'; SET enable_hashjoin_rehash = off; EXPLAIN (ANALYZE, COSTS off, TIMING off, BUFFERS off) SELECT * FROM t_stuck_4m a JOIN t_stuck_4m b USING (id) LIMIT 1; QUERY PLAN -------------------------------------------------------------------------------------------------- Limit (actual rows=1.00 loops=1) -> Hash Join (actual rows=1.00 loops=1) Hash Cond: (a.id = b.id) -> Seq Scan on t_stuck_4m a (actual rows=1.00 loops=1) -> Hash (actual rows=29248.00 loops=1) Buckets: 32768 (originally 32768) Batches: 4 (originally 2) Memory Usage: 5974kB -> Seq Scan on t_stuck_4m b (actual rows=29248.00 loops=1) Planning Time: 0.243 ms Execution Time: 10.288 ms SET enable_hashjoin_rehash = on; EXPLAIN (ANALYZE, COSTS off, TIMING off, BUFFERS off) SELECT * FROM t_stuck_4m a JOIN t_stuck_4m b USING (id) LIMIT 1; QUERY PLAN -------------------------------------------------------------------------------------------------- Limit (actual rows=1.00 loops=1) -> Hash Join (actual rows=1.00 loops=1) Hash Cond: (a.id = b.id) -> Seq Scan on t_stuck_4m a (actual rows=4.00 loops=1) -> Hash (actual rows=29248.00 loops=1) Buckets: 32768 (originally 32768) Batches: 4 (originally 2) Memory Usage: 3083kB Rehash Repartitions: 1 (max sub-batches: 2) -> Seq Scan on t_stuck_4m b (actual rows=29248.00 loops=1) Planning Time: 0.110 ms Execution Time: 10.071 ms So the rehash drops memory usage from 5974 kB -> 3083 kB, under the 4MB limit. My observations: 1) The runtime when rehashing is used is nearly the same. 2) This execution split the one unsplittable batch into 2, reducing total memory used. I have more data and the patch to share if people are interested. Best regards, Ben Mejia [1] https://www.postgresql.org/message-id/7bed6c08-72a0-4ab9-a79c-e01fcdd0940f@vondra.me