Re: Speed up COPY FROM text/CSV parsing using SIMD
Andrew Dunstan <andrew@dunslane.net>
From: Andrew Dunstan <andrew@dunslane.net>
To: Nazir Bilal Yavuz <byavuz81@gmail.com>, KAZAR Ayoub <ma_kazar@esi.dz>
Cc: Shinya Kato <shinya11.kato@gmail.com>, pgsql-hackers@postgresql.org
Date: 2025-08-21T15:47:30Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Optimize COPY FROM (FORMAT {text,csv}) using SIMD.
- e0a3a3fd5361 19 (unreleased) landed
-
Speedup COPY FROM with additional function inlining.
- dc592a41557b 19 (unreleased) landed
-
doc: Fix incorrect wording for --file in pg_dump
- 07961ef86625 19 (unreleased) cited
On 2025-08-19 Tu 10:14 AM, Nazir Bilal Yavuz wrote: > Hi, > > On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: >> I am able to reproduce the regression you mentioned but both >> regressions are %20 on my end. I found that (by experimenting) SIMD >> causes a regression if it advances less than 5 characters. >> >> So, I implemented a small heuristic. It works like that: >> >> - If advance < 5 -> insert a sleep penalty (n cycles). > 'sleep' might be a poor word choice here. I meant skipping SIMD for n > number of times. > I was thinking a bit about that this morning. I wonder if it might be better instead of having a constantly applied heuristic like this, it might be better to do a little extra accounting in the first, say, 1000 lines of an input file, and if less than some portion of the input is found to be special characters then switch to the SIMD code. What that portion should be would need to be determined by some experimentation with a variety of typical workloads, but given your findings 20% seems like a good starting point. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com