Thread

Re: [PATCH] Better Performance for PostgreSQL with large INSERTs

Jakub Wartak <jakub.wartak@enterprisedb.com> — 2025-12-12T12:54:38Z
On Wed, Nov 26, 2025 at 3:03 PM Filip Janus <fjanus@redhat.com> wrote:
>
>
>
>     -Filip-
>
>
> út 7. 10. 2025 v 16:54 odesílatel Andres Freund <andres@anarazel.de> napsal:
>>
>> Hi,
>>
>> On 2025-10-07 15:03:29 +0200, Philipp Marek wrote:
>> > > Have you tried to verify that this doesn't cause performance regressions
>> > > in
>> > > other workloads? pq_recvbuf() has this code:
>> > >
>> > ...
>> > >
>> > > I do seem to recall that just increasing the buffer size substantially
>> > > lead to
>> > > more time being spent inside that memmove() (likely due to exceeding
>> > > L1/L2).
>> >
>> >
>> > Do you have any pointers to discussions or other data about that?
>> >
>> >
>> > My (quick) analysis was that clients that send one request,
>> > wait for an answer, then send the next request wouldn't run that code
>> > as there's nothing behind the individual requests that could be moved.
>> >
>> >
>> > But yes, Pipeline Mode[1] might/would be affected.
>> >
>> > The interesting question is how much data can userspace copy before
>> > that means more load than doing a userspace-kernel-userspace round trip.
>> > (I guess that moving 64kB or 128kB should be quicker, especially since
>> > the various CPU mitigations.)
>>
>> I unfortunately don't remember the details of where I saw it
>> happening. Unfortunately I suspect it'll depend a lot on hardware and
>> operating system details (like the security mitigations you mention) when it
>> matters too.
>>
>>
>> > As long as there are complete requests in the buffer the memmove()
>> > could be avoided; only the initial part of the first incomplete request
>> > might need moving to the beginning.
>>
>> Right.  I'd be inclined that that ought to be addressed as part of this patch,
>> that way we can be sure that it's pretty sure it's not going to cause
>> regressions.
>
>
> I tried to benchmark the usage of memmove(), but I wasn’t able to hit the memmove() part of the code. This led me to a deeper investigation, and I realized that the memmove() call is probably in a dead part of the code.
> pq_recvbuf is called when PqRecvPointer >= PqRecvLength, while memmove() is called later only if PqRecvLength > PqRecvPointer.
> This results in a contradiction.
>
>>
>> > The documentation says
>> >
>> >   > Pipelining is less useful, and more complex,
>> >   > when a single pipeline contains multiple transactions
>> >   > (see Section 32.5.1.3).
>> >
>> > are there any benchmarks/usage statistics for pipeline mode?
>>
>> You can write benchmarks for it using pgbench's pipeline support, with a
>> custom script.
>>
>> Greetings,
>>
>> Andres Freund
>>
> I am also proposing the introduction of a new GUC variable for setting PQ_RECV_BUFFER_SIZE in the first patch. And the second patch removes the dead code.
>


Hi Filip,

Can you please how have you verified it is giving you that some perf. increase?

3 tries each, best:

@ pq_recv_buffers = 2MB best of 3:
    latency average = 2.594 ms
    latency stddev = 0.352 ms
    initial connection time = 9.419 ms
    tps = 385.431723 (without initial connection time)

@ pq_recv_buffers = default (8kB) best of 3:
    latency average = 2.629 ms
    latency stddev = 0.929 ms
    initial connection time = 9.937 ms
    tps = 380.336257 (without initial connection time)

/usr/pgsql19/bin/pgbench  -h xxx -U app -f insert.sql -c 1 -P 1 -n -T
5 -M prepared postgres
where insert.sql was:
    echo "CREATE TEMPORARY TABLE IF NOT EXISTS file_storage (data
BYTEA STORAGE EXTERNAL) ON COMMIT PRESERVE ROWS;" > insert.sql
    echo "INSERT INTO file_storage(data) VALUES ('" >> insert.sql
    perl -e 'print "A"x(1024*1024);' >> insert.sql # 1MB
    echo "');" >> insert.sql

Some description of the env I had:
- tuned TCP rmem/wmem and set congestion to BBR (to eliminate TCP as
being a bottlneck)
- low RTT (same AWS zone), max 25Gbps total, max 9.6 Gbps single stream TCP
- as TOAST pglz compression way taking most of CPU in my case , I've
changed it to lz4 also didn't help a lot, so I've changed it to avoid
*any* compression
- switched to temporary table to avoid I/O as much as possible,
wal_level=minimal too
- had to use prepared statements as otherwise I was hitting way too
much CPU in parser (yylex routines)

So I'm looking for a way to demonstrate the effect. I've also written
a simple psypong2 based LO upload benchmark as pgbench cannot
apparently benchmark this. Sadly of course, then you cannot I think
disable compression and/or load into TEMPORARY table so it's far worse
and hits I/O heavy (as it hit pg_largeobjects*)

-J.