Re: Speed up COPY FROM text/CSV parsing using SIMD
Nazir Bilal Yavuz <byavuz81@gmail.com>
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
To: Andrew Dunstan <andrew@dunslane.net>
Cc: KAZAR Ayoub <ma_kazar@esi.dz>, Shinya Kato <shinya11.kato@gmail.com>, pgsql-hackers@postgresql.org
Date: 2025-10-16T14:29:36Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Optimize COPY FROM (FORMAT {text,csv}) using SIMD.
- e0a3a3fd5361 19 (unreleased) landed
-
Speedup COPY FROM with additional function inlining.
- dc592a41557b 19 (unreleased) landed
-
doc: Fix incorrect wording for --file in pg_dump
- 07961ef86625 19 (unreleased) cited
Attachments
- v3-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patch (text/x-patch) patch v3-0001
- v3-0002-COPY-SIMD-per-line-heuristic.patch (text/x-patch) patch v3-0002
Hi, On Thu, 21 Aug 2025 at 18:47, Andrew Dunstan <andrew@dunslane.net> wrote: > > > On 2025-08-19 Tu 10:14 AM, Nazir Bilal Yavuz wrote: > > Hi, > > > > On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > >> I am able to reproduce the regression you mentioned but both > >> regressions are %20 on my end. I found that (by experimenting) SIMD > >> causes a regression if it advances less than 5 characters. > >> > >> So, I implemented a small heuristic. It works like that: > >> > >> - If advance < 5 -> insert a sleep penalty (n cycles). > > 'sleep' might be a poor word choice here. I meant skipping SIMD for n > > number of times. > > > > I was thinking a bit about that this morning. I wonder if it might be > better instead of having a constantly applied heuristic like this, it > might be better to do a little extra accounting in the first, say, 1000 > lines of an input file, and if less than some portion of the input is > found to be special characters then switch to the SIMD code. What that > portion should be would need to be determined by some experimentation > with a variety of typical workloads, but given your findings 20% seems > like a good starting point. I implemented a heuristic something similar to this. It is a mix of previous heuristic and your idea, it works like that: Overall logic is that we will not run SIMD for the entire line and we decide if it is worth it to run SIMD for the next lines. 1 - We will try SIMD and decide if it is worth it to run SIMD. 1.1 - If it is worth it, we will continue to run SIMD and we will halve the simd_last_sleep_cycle variable. 1.2 - If it is not worth it, we will double the simd_last_sleep_cycle and we will not run SIMD for these many lines. 1.3 - After skipping simd_last_sleep_cycle lines, we will go back to the #1. Note: simd_last_sleep_cycle can not pass 1024, so we will run SIMD for each 1024 lines at max. With this heuristic the regression is limited by %2 in the worst case. Patches are attached, the first patch is v2-0001 from Shinya with the '-Werror=maybe-uninitialized' fixes and the pgindent changes. 0002 is the actual heuristic patch. -- Regards, Nazir Bilal Yavuz Microsoft