Thread

Re: Limit memory usage by postgres_fdw batches

Tomas Vondra <tomas@vondra.me> — 2025-12-27T13:18:56Z
Hi Alexander,

On 12/26/25 11:54, Alexander Pyhalov wrote:
> Hi.
> 
> We had some real cases when client set rather big batch_size on server
> level, but for some foreign table, containing large documents, it was
> inadequate and lead to OOM killer intervention. You can argue that
> batch_size can be set on foreign table level, but it can still be not
> flexible enough, when tuple size varies.

I agree this can be an issue with large tuples. It'd be good to consider
the size, not just the number of tuples.

I think you modified the right places, but I think there are two or
three issues we should improve:

1) It calls estimate_batch_length() for every ExecInsert() call, i.e.
for every tuples. And it walks all tuples up to that point, which makes
it O(N^2). I haven't measured how significant it is, but AFAICS we could
track the current size of the batch fairly easily, and use that.

2) work_mem is in kilobytes, while batch_len is in bytes, so the
comparison (batch_len > work_mem) is not quite right. I'll probably fire
every time, preventing any batching.

3) Isn't this consider the size of the new tuple in batch_len? Imagine
the tuples are 99% of the work_mem limit. We add the first one. When
adding the next one we check the current batch is below work_mem, and so
we proceed to add the second tuple. Now the batch is 1.98% of the limit.

I think it should work like this:

1) batch_len + tup_len < work_mem => add tuple to batch
2) tup_len < work_mem => flush batch, add tuple to batch
3) tup_len => work_mem => flush batch, insert tuple directly

What bothers me a little bit is that this is per relation. AFAICS when
inserting into a partitioned table with multiple foreign partitions,
each partition will have a separate limit. I wonder if we could do
better and have some sort of "global" limit for the whole insert.

But that's not the fault of this patch, of course.

> I suppose this case is also
> takes place for fetch_size. Issue here is that we can't somehow limit
> size of data (versus number of rows) while fetching from cursor. But we
> can use tuple store to preserve fetched results, so that they spill out
> to the disk.
> 

Perhaps. Seems like a separate issue. I haven't looked very closely, but
do we want to use the tuplestore always, or just when the tuples get too
large? It might even be on batch-by-batch, I guess. With fetch_size=1
it's hardly useful, right? Is the tuplestore management measurable?

> I'm attaching two patches which try to fix issues with possible huge
> memory usage by postgres_fdw batches.
> With fetched tuples we still can't use only tuplestore, as ctids are not
> preserved, and so have to store them separately.
> 
> The reproducer for insert is simple.
> 
> create extension postgres_fdw ;
> create server loopback foreign data wrapper postgres_fdw options (dbname
> 'postgres', port '5432', batch_size '100', fetch_size '100');
> create table base_table(i int, s bytea);
> create foreign table foreign_table (i int, s bytea) server loopback
> options(table_name 'base_table');
> create user mapping for public server loopback ;
> 
> insert into foreign_table select i, pg_read_binary_file('/some/big/
> file') from generate_series(1,1000) i;
> 
> will easily grow backend RSS to several gigabytes.
> The first patch fixes this problem.
> 
> The second patch alleviates the second issue - SELECT * queries also can
> grow backend memory to several GBs. Still memory usage can peak (on my
> toy examples) up to 3-4 GB, but at least it seams 1-2 GB less than non-
> patched version.
> 

How large are the tuples? How much higher was this RSS than the
theoretical minimum?


regards

-- 
Tomas Vondra