Thread

[RFC] Secondary hash to split stuck hash-join batches

Ben Mejia <benjamin.arthur.mejia@gmail.com> — 2026-05-31T19:41:13Z
Hi,

Tomas clearly describes the underlying problem in [1]: stuck batches cannot
be split by batch doubling. This occurs when the tuples' hash values all
map to the same batch, similar to all items ending up in a single hash
bucket.

The HashJoin executor currently faces two unappealing fallback options:

  a) Continually doubling nBatch in a futile attempt to split the
unsplittable batch.

  b) Inflating spaceAllowed, which silently exceeds work_mem.

While PG18 caps the first fallback to prevent runaway batch explosions, it
attempts to manage the situation by incrementally increasing memory usage.
Consequently, work_mem is routinely violated by HashJoin, particularly when
encountering these unsplittable batches.

I have been experimenting with a rehash-based approach to reduce the memory
used by hash joins and would appreciate feedback on the direction before
polishing it for submission.

Sketch of Rehash Approach

 - When ExecHashIncreaseNumBatches detects a stuck batch, call the new
function, ExecHashRehashBatch to attempt a rehash.

 - Rehash attempts to reassign each tuple in the stuck batch to a sub-batch
using a secondary hash function.

 - Sub-batches are handled in the same way as normal batches, and by the
same code.

 - Sub-batches are split, like normal batches until the sub-batch fits into
spaceAllowed

 - If the rehash cannot split the batch, it fails and the current
inflate-and-gut-it-out path is taken.

In my prototype, I use a GUC, enable_hashjoin_rehash to turn on the rehash
logic; otherwise the code works the same as before.

Additional Heuristics

Having the rehash available as a tool to control memory usage, I also
explored further modifications:

 - Using planner estimates to detect skew vs. large tuples:
   This can avoid doing a batch split and immediately attempt a rehash.

 - Avoiding PG18's memory inflation:
   Based on skew detection above, avoids inflating spaceAllowed and
attempts a rehash.

 - Diminishing returns detection:
   If batch splitting does less than a split-threshold attempt a rehash.


Some Preliminary Results

For the table:

CREATE TABLE t_stuck_4m (id INT, pad TEXT);

INSERT INTO t_stuck_4m
  SELECT a, repeat(md5(a::text), 5)
  FROM generate_series(1, 1000000000) s(a)
  WHERE hashint4(a)::bit(32) & x'FFFFE000'::bit(32) = 0::bit(32);

INSERT INTO t_stuck_4m SELECT id, pad FROM t_stuck_4m;  -- 2x
INSERT INTO t_stuck_4m SELECT id, pad FROM t_stuck_4m;  -- 4x
INSERT INTO t_stuck_4m SELECT id, pad FROM t_stuck_4m;  -- 8x
INSERT INTO t_stuck_4m SELECT id, pad FROM t_stuck_4m;  -- 16x

SET work_mem = '4MB';
SET enable_hashjoin_rehash = off;

EXPLAIN (ANALYZE, COSTS off, TIMING off, BUFFERS off) SELECT
     * FROM t_stuck_4m a JOIN t_stuck_4m b USING (id) LIMIT 1;
                                             QUERY PLAN

--------------------------------------------------------------------------------------------------
   Limit (actual rows=1.00 loops=1)
     ->  Hash Join (actual rows=1.00 loops=1)
           Hash Cond: (a.id = b.id)
           ->  Seq Scan on t_stuck_4m a (actual rows=1.00 loops=1)
           ->  Hash (actual rows=29248.00 loops=1)
                 Buckets: 32768 (originally 32768)  Batches: 4 (originally
2)  Memory Usage: 5974kB
                 ->  Seq Scan on t_stuck_4m b (actual rows=29248.00 loops=1)
   Planning Time: 0.243 ms
   Execution Time: 10.288 ms

SET enable_hashjoin_rehash = on;
  EXPLAIN (ANALYZE, COSTS off, TIMING off, BUFFERS off) SELECT
     * FROM t_stuck_4m a JOIN t_stuck_4m b USING (id) LIMIT 1;
                                              QUERY PLAN

--------------------------------------------------------------------------------------------------
   Limit (actual rows=1.00 loops=1)
     ->  Hash Join (actual rows=1.00 loops=1)
           Hash Cond: (a.id = b.id)
           ->  Seq Scan on t_stuck_4m a (actual rows=4.00 loops=1)
           ->  Hash (actual rows=29248.00 loops=1)
                 Buckets: 32768 (originally 32768)  Batches: 4 (originally
2)  Memory Usage: 3083kB
                 Rehash Repartitions: 1 (max sub-batches: 2)
                 ->  Seq Scan on t_stuck_4m b (actual rows=29248.00 loops=1)
   Planning Time: 0.110 ms
   Execution Time: 10.071 ms

So the rehash drops memory usage from 5974 kB -> 3083 kB, under the 4MB
limit.

My observations:

1) The runtime when rehashing is used is nearly the same.

2) This execution split the one unsplittable batch into 2, reducing
total memory used.

I have more data and the patch to share if people are interested.

Best regards,
Ben Mejia

  [1]
https://www.postgresql.org/message-id/7bed6c08-72a0-4ab9-a79c-e01fcdd0940f@vondra.me