Thread
-
Re: Limit memory usage by postgres_fdw batches
Tomas Vondra <tomas@vondra.me> — 2025-12-27T13:18:56Z
Hi Alexander, On 12/26/25 11:54, Alexander Pyhalov wrote: > Hi. > > We had some real cases when client set rather big batch_size on server > level, but for some foreign table, containing large documents, it was > inadequate and lead to OOM killer intervention. You can argue that > batch_size can be set on foreign table level, but it can still be not > flexible enough, when tuple size varies. I agree this can be an issue with large tuples. It'd be good to consider the size, not just the number of tuples. I think you modified the right places, but I think there are two or three issues we should improve: 1) It calls estimate_batch_length() for every ExecInsert() call, i.e. for every tuples. And it walks all tuples up to that point, which makes it O(N^2). I haven't measured how significant it is, but AFAICS we could track the current size of the batch fairly easily, and use that. 2) work_mem is in kilobytes, while batch_len is in bytes, so the comparison (batch_len > work_mem) is not quite right. I'll probably fire every time, preventing any batching. 3) Isn't this consider the size of the new tuple in batch_len? Imagine the tuples are 99% of the work_mem limit. We add the first one. When adding the next one we check the current batch is below work_mem, and so we proceed to add the second tuple. Now the batch is 1.98% of the limit. I think it should work like this: 1) batch_len + tup_len < work_mem => add tuple to batch 2) tup_len < work_mem => flush batch, add tuple to batch 3) tup_len => work_mem => flush batch, insert tuple directly What bothers me a little bit is that this is per relation. AFAICS when inserting into a partitioned table with multiple foreign partitions, each partition will have a separate limit. I wonder if we could do better and have some sort of "global" limit for the whole insert. But that's not the fault of this patch, of course. > I suppose this case is also > takes place for fetch_size. Issue here is that we can't somehow limit > size of data (versus number of rows) while fetching from cursor. But we > can use tuple store to preserve fetched results, so that they spill out > to the disk. > Perhaps. Seems like a separate issue. I haven't looked very closely, but do we want to use the tuplestore always, or just when the tuples get too large? It might even be on batch-by-batch, I guess. With fetch_size=1 it's hardly useful, right? Is the tuplestore management measurable? > I'm attaching two patches which try to fix issues with possible huge > memory usage by postgres_fdw batches. > With fetched tuples we still can't use only tuplestore, as ctids are not > preserved, and so have to store them separately. > > The reproducer for insert is simple. > > create extension postgres_fdw ; > create server loopback foreign data wrapper postgres_fdw options (dbname > 'postgres', port '5432', batch_size '100', fetch_size '100'); > create table base_table(i int, s bytea); > create foreign table foreign_table (i int, s bytea) server loopback > options(table_name 'base_table'); > create user mapping for public server loopback ; > > insert into foreign_table select i, pg_read_binary_file('/some/big/ > file') from generate_series(1,1000) i; > > will easily grow backend RSS to several gigabytes. > The first patch fixes this problem. > > The second patch alleviates the second issue - SELECT * queries also can > grow backend memory to several GBs. Still memory usage can peak (on my > toy examples) up to 3-4 GB, but at least it seams 1-2 GB less than non- > patched version. > How large are the tuples? How much higher was this RSS than the theoretical minimum? regards -- Tomas Vondra