Re: Batching in executor

amit <amitlangote09@gmail.com>
View thread
From: Amit Langote <amitlangote09@gmail.com>
To: Tomas Vondra <tomas@vondra.me>
Cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
Date: 2025-12-20T14:12:03Z
Lists: pgsql-hackers
Attachments

v4-0001-Add-batch-table-AM-API-and-heapam-implementation.patch (application/octet-stream)
v4-0002-SeqScan-add-batch-driven-variants-returning-slots.patch (application/octet-stream)
v4-0003-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch (application/octet-stream)
bar_limit.sh (text/x-sh)
bar_limit_where_o.sh (text/x-sh)
bar_limit_where_a.sh (text/x-sh)
On Fri, Dec 5, 2025 at 12:54 AM Amit Langote <amitlangote09@gmail.com>
wrote:
> On Wed, Oct 29, 2025 at 3:37 PM Amit Langote <amitlangote09@gmail.com>
wrote:
> > On Tue, Oct 28, 2025 at 10:40 PM Amit Langote <amitlangote09@gmail.com>
wrote:
> > > That would be nice to see if you have the time, but maybe after I post
> > > a new version.
> >
> > I’ve created a CF entry marked WoA for this in the next CF under the
> > title “Batching in executor, part 1: add batch variant of table AM
> > scan API.” The idea is to track this piece separately so that later
> > parts can have their own entries and we don’t end up with a single
> > long-lived entry that never gets marked done. :-)
>
> I intend to continue working on this, so have just moved it into the
> next fest.  I will post a new patch version next week that addresses
> Daniil's comments and implements a few other things I mentioned I will
> in my reply to Tomas on Oct 28; sorry for the delay.

Before I go on vacation for a couple of weeks, here's an updated patch
set.  I am only including the patches that add TAM interface, add
TupleBatch executor wrapper for TAM batches, and use it in SeqScan as I had
posted before.  There is a new patch to add a BATCHES option to EXPLAIN.  I
renamed the testing GUC to executor_batch_rows (integer) from the boolean
executor_batching.  EXPLAIN (BATCHES) example:

+-- Basic batch stats output
+select explain_filter('explain (analyze, batches, buffers off, costs off)
select * from batch_test');
+                         explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+   Batches: N  Avg Rows: N.N  Max: N  Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)

What I have not included in this set are the patches that add
ExecProcNodeBatch() so that TupleBatch can be passed from one plan node to
another (parent), ExprEvalOps (EEOPs) for batched expression evaluation
(qual and aggregate transition).  I would like to focus on the patches that
allow reading batches from TAM into Scan nodes (only SeqScan for now).

After I'm back from vacation, I will post patches for batched qual
evaluation in SeqScan filter quals (once bugs are fixed and polished).
Batching in Agg node can wait for now.

In the meantime, what I would like to have someone's thoughts on:

* the shape of the TAM APIs -- should I add a TAMBatch or something that is
created, populated, and destroyed by the TAM instead of the current void
pointer and TupleBatchOps that are initialized in the executor like this
(excerpt from 0002):

+    /* Lazily create the AM batch payload. */
+    if (node->ss.ps.ps_Batch->am_payload == NULL)
+    {
+        const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY =
scandesc->rs_rd->rd_tableam;
+
+        Assert(tam && tam->scan_begin_batch);
+        node->ss.ps.ps_Batch->am_payload =
+            table_scan_begin_batch(scandesc,
node->ss.ps.ps_Batch->maxslots);
+        node->ss.ps.ps_Batch->ops =
table_batch_callbacks(node->ss.ss_currentRelation);
+    }

* the shape of TupleBatch itself -- its contents and operations defined in
execBatch.c/h.

* any other thoughts you might have on the project, patches.

Benchmark:

Scripts attached if you want to try them.

(Negative % = faster than master)

SELECT * FROM table LIMIT 1 OFFSET N:
Rows      Master    batch=0   vs master   batch=64   vs master
--------------------------------------------------------------
1M          11ms       11ms        -0%        8ms       -23%
2M          23ms       22ms        -1%       18ms       -23%
3M          36ms       34ms        -5%       27ms       -25%
4M          51ms       50ms        -2%       38ms       -26%
5M          64ms       64ms        -1%       48ms       -26%
10M        147ms      145ms        -1%      114ms       -22%

SELECT * FROM WHERE a > 0 LIMIT 1 OFFSET N:
Rows      Master    batch=0   vs master   batch=64   vs master
--------------------------------------------------------------
1M          31ms       31ms        +0%       16ms       -48%
2M          64ms       64ms        -0%       34ms       -47%
3M          67ms       66ms        -1%       50ms       -25%
4M          91ms       90ms        -1%       71ms       -22%
5M         119ms      113ms        -5%       88ms       -26%
10M        262ms      261ms        -0%      205ms       -21%

SELECT * FROM table WHERE o > 0 LIMIT 1 OFFSET N (last column -
deform-heavy):
Rows      Master    batch=0   vs master   batch=64   vs master
--------------------------------------------------------------
1M          38ms       37ms        -2%       38ms        +0%
2M          79ms       75ms        -6%       77ms        -4%
3M         182ms      186ms        +2%      160ms       -12%
4M         250ms      252ms        +1%      219ms       -12%
5M         314ms      316ms        +1%      273ms       -13%
10M        647ms      651ms        +1%      604ms        -7%

The smaller improvement with WHERE o > 0 is expected since accessing the
last column requires deforming most of the tuple, which dominates the
execution time. Future work on batched tuple deformation could help here.

Note on regressions with executor_batch_rows = 0 vs master:

I am not seeing the regressions with batch_rows=0 vs master as I did
before.  I think some of it might have to do with my removing some stray
fields from HeapScanData that were accidentally left there in the earlier
patches.  Also, the regressions I was observing earlier seemed more to have
to do with using gcc to compile master tree and clang to compile patched
tree, which resulted in code layout changes that seemed to cause patched
binary to regress.  Would be nice if these numbers can be verified by
others.

-- 
Thanks, Amit Langote