Re: Speed up COPY FROM text/CSV parsing using SIMD
Nazir Bilal Yavuz <byavuz81@gmail.com>
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
To: KAZAR Ayoub <ma_kazar@esi.dz>
Cc: Manni Wood <manni.wood@enterprisedb.com>, Mark Wong <markwkm@gmail.com>, Nathan Bossart <nathandbossart@gmail.com>,
Andrew Dunstan <andrew@dunslane.net>, Shinya Kato <shinya11.kato@gmail.com>, PostgreSQL-development <pgsql-hackers@postgresql.org>
Date: 2025-12-31T13:04:15Z
Lists: pgsql-hackers
Hi, On Wed, 24 Dec 2025 at 18:08, KAZAR Ayoub <ma_kazar@esi.dz> wrote: > > Hello, > Following the same path of optimizing COPY FROM using SIMD, i found that COPY TO can also benefit from this. > > I attached a small patch that uses SIMD to skip data and advance as far as the first special character is found, then fallback to scalar processing for that character and re-enter the SIMD path again... > There's two ways to do this: > 1) Essentially we do SIMD until we find a special character, then continue scalar path without re-entering SIMD again. > - This gives from 10% to 30% speedups depending on the weight of special characters in the attribute, we don't lose anything here since it advances with SIMD until it can't (using the previous scripts: 1/3, 2/3 specials chars). > > 2) Do SIMD path, then use scalar path when we hit a special character, keep re-entering the SIMD path each time. > - This is equivalent to the COPY FROM story, we'll need to find the same heuristic to use for both COPY FROM/TO to reduce the regressions (same regressions: around from 20% to 30% with 1/3, 2/3 specials chars). > > Something else to note is that the scalar path for COPY TO isn't as heavy as the state machine in COPY FROM. > > So if we find the sweet spot for the heuristic, doing the same for COPY TO will be trivial and always beneficial. > Attached is 0004 which is option 1 (SIMD without re-entering), 0005 is the second one. Patches look correct to me. I think we could move these SIMD code portions into a shared function to remove duplication, although that might have a performance impact. I have not benchmarked these patches yet. Another consideration is that these patches might need their own thread, though I am not completely sure about this yet. One question: what do you think about having a 0004-style approach for COPY FROM? What I have in mind is running SIMD for each line & column, stopping SIMD once it can no longer skip an entire chunk, and then continuing with the next line & column. -- Regards, Nazir Bilal Yavuz Microsoft