Thread
-
Re: [PATCH] Better Performance for PostgreSQL with large INSERTs
Jakub Wartak <jakub.wartak@enterprisedb.com> — 2025-12-12T12:54:38Z
On Wed, Nov 26, 2025 at 3:03 PM Filip Janus <fjanus@redhat.com> wrote: > > > > -Filip- > > > út 7. 10. 2025 v 16:54 odesílatel Andres Freund <andres@anarazel.de> napsal: >> >> Hi, >> >> On 2025-10-07 15:03:29 +0200, Philipp Marek wrote: >> > > Have you tried to verify that this doesn't cause performance regressions >> > > in >> > > other workloads? pq_recvbuf() has this code: >> > > >> > ... >> > > >> > > I do seem to recall that just increasing the buffer size substantially >> > > lead to >> > > more time being spent inside that memmove() (likely due to exceeding >> > > L1/L2). >> > >> > >> > Do you have any pointers to discussions or other data about that? >> > >> > >> > My (quick) analysis was that clients that send one request, >> > wait for an answer, then send the next request wouldn't run that code >> > as there's nothing behind the individual requests that could be moved. >> > >> > >> > But yes, Pipeline Mode[1] might/would be affected. >> > >> > The interesting question is how much data can userspace copy before >> > that means more load than doing a userspace-kernel-userspace round trip. >> > (I guess that moving 64kB or 128kB should be quicker, especially since >> > the various CPU mitigations.) >> >> I unfortunately don't remember the details of where I saw it >> happening. Unfortunately I suspect it'll depend a lot on hardware and >> operating system details (like the security mitigations you mention) when it >> matters too. >> >> >> > As long as there are complete requests in the buffer the memmove() >> > could be avoided; only the initial part of the first incomplete request >> > might need moving to the beginning. >> >> Right. I'd be inclined that that ought to be addressed as part of this patch, >> that way we can be sure that it's pretty sure it's not going to cause >> regressions. > > > I tried to benchmark the usage of memmove(), but I wasn’t able to hit the memmove() part of the code. This led me to a deeper investigation, and I realized that the memmove() call is probably in a dead part of the code. > pq_recvbuf is called when PqRecvPointer >= PqRecvLength, while memmove() is called later only if PqRecvLength > PqRecvPointer. > This results in a contradiction. > >> >> > The documentation says >> > >> > > Pipelining is less useful, and more complex, >> > > when a single pipeline contains multiple transactions >> > > (see Section 32.5.1.3). >> > >> > are there any benchmarks/usage statistics for pipeline mode? >> >> You can write benchmarks for it using pgbench's pipeline support, with a >> custom script. >> >> Greetings, >> >> Andres Freund >> > I am also proposing the introduction of a new GUC variable for setting PQ_RECV_BUFFER_SIZE in the first patch. And the second patch removes the dead code. > Hi Filip, Can you please how have you verified it is giving you that some perf. increase? 3 tries each, best: @ pq_recv_buffers = 2MB best of 3: latency average = 2.594 ms latency stddev = 0.352 ms initial connection time = 9.419 ms tps = 385.431723 (without initial connection time) @ pq_recv_buffers = default (8kB) best of 3: latency average = 2.629 ms latency stddev = 0.929 ms initial connection time = 9.937 ms tps = 380.336257 (without initial connection time) /usr/pgsql19/bin/pgbench -h xxx -U app -f insert.sql -c 1 -P 1 -n -T 5 -M prepared postgres where insert.sql was: echo "CREATE TEMPORARY TABLE IF NOT EXISTS file_storage (data BYTEA STORAGE EXTERNAL) ON COMMIT PRESERVE ROWS;" > insert.sql echo "INSERT INTO file_storage(data) VALUES ('" >> insert.sql perl -e 'print "A"x(1024*1024);' >> insert.sql # 1MB echo "');" >> insert.sql Some description of the env I had: - tuned TCP rmem/wmem and set congestion to BBR (to eliminate TCP as being a bottlneck) - low RTT (same AWS zone), max 25Gbps total, max 9.6 Gbps single stream TCP - as TOAST pglz compression way taking most of CPU in my case , I've changed it to lz4 also didn't help a lot, so I've changed it to avoid *any* compression - switched to temporary table to avoid I/O as much as possible, wal_level=minimal too - had to use prepared statements as otherwise I was hitting way too much CPU in parser (yylex routines) So I'm looking for a way to demonstrate the effect. I've also written a simple psypong2 based LO upload benchmark as pgbench cannot apparently benchmark this. Sadly of course, then you cannot I think disable compression and/or load into TEMPORARY table so it's far worse and hits I/O heavy (as it hit pg_largeobjects*) -J.