Re: Speed up COPY FROM text/CSV parsing using SIMD
Manni Wood <manni.wood@enterprisedb.com>
From: Manni Wood <manni.wood@enterprisedb.com>
To: KAZAR Ayoub <ma_kazar@esi.dz>
Cc: Nathan Bossart <nathandbossart@gmail.com>,
Nazir Bilal Yavuz <byavuz81@gmail.com>, Andrew Dunstan <andrew@dunslane.net>, Shinya Kato <shinya11.kato@gmail.com>, PostgreSQL-development <pgsql-hackers@postgresql.org>
Date: 2025-12-06T01:39:56Z
Lists: pgsql-hackers
Attachments
- test.sh.txt (text/plain)
- v4-0002-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patch (text/x-patch)
- v4-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patch (text/x-patch)
On Wed, Nov 26, 2025 at 8:21 AM Manni Wood <manni.wood@enterprisedb.com> wrote: > > > On Wed, Nov 26, 2025 at 5:51 AM KAZAR Ayoub <ma_kazar@esi.dz> wrote: > >> Hello, >> On Wed, Nov 19, 2025 at 10:01 PM Nathan Bossart <nathandbossart@gmail.com> >> wrote: >> >>> On Tue, Nov 18, 2025 at 05:20:05PM +0300, Nazir Bilal Yavuz wrote: >>> > Thanks, done. >>> >>> I took a look at the v3 patches. Here are my high-level thoughts: >>> >>> + /* >>> + * Parse data and transfer into line_buf. To get benefit from >>> inlining, >>> + * call CopyReadLineText() with the constant boolean variables. >>> + */ >>> + if (cstate->simd_continue) >>> + result = CopyReadLineText(cstate, is_csv, true); >>> + else >>> + result = CopyReadLineText(cstate, is_csv, false); >>> >>> I'm curious whether this actually generates different code, and if it >>> does, >>> if it's actually faster. We're already branching on >>> cstate->simd_continue >>> here. >> >> I've compiled both versions with -O2 and confirmed they generate >> different code. When simd_continue is passed as a constant to >> CopyReadLineText, the compiler optimizes out the condition checks from the >> SIMD path. >> A small benchmark on a 1GB+ file shows the expected benefit which is >> around 6% performance improvement. >> I've attached the assembly outputs in case someone wants to check >> something else. >> >> >> Regards, >> Ayoub Kazar >> > > Correction to my last post: > > I also tried files that alternated lines with no special characters and > lines with 1/3rd special characters, thinking I could force the algorithm > to continually check whether or not it should use simd and therefore force > more overhead in the try-simd/don't-try-simd housekeeping code. The text > file was still 20% faster (not 50% faster as I originally stated --- that > was a typo). The CSV file was still 13% faster. > > Also, apologies for posting at the top in my last e-mail. > -- > -- Manni Wood EDB: https://www.enterprisedb.com > Hello, all. Andrew, I tried your suggestion of just reading the first chunk of the copy file to determine if SIMD is worth using. Attached are v4 versions of the patches showing a first attempt at doing that. I attached test.sh.txt to show how I've been testing, with 5 million lines of the various copy file variations introduced by Ayub Kazar. The text copy with no special chars is 30% faster. The CSV copy with no special chars is 48% faster. The text with 1/3rd escapes is 3% slower. The CSV with 1/3rd quotes is 0.27% slower. This set of patches follows the simplest suggestion of just testing the first N lines (actually first N bytes) of the file and then deciding whether or not to enable SIMD. This set of patches does not follow Andrew's later suggestion of maybe checking again every million lines or so. -- -- Manni Wood EDB: https://www.enterprisedb.com