Re: Speed up COPY FROM text/CSV parsing using SIMD
Shinya Kato <shinya11.kato@gmail.com>
From: Shinya Kato <shinya11.kato@gmail.com>
To: Nazir Bilal Yavuz <byavuz81@gmail.com>
Cc: pgsql-hackers@postgresql.org
Date: 2025-08-13T06:21:06Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Optimize COPY FROM (FORMAT {text,csv}) using SIMD.
- e0a3a3fd5361 19 (unreleased) landed
-
Speedup COPY FROM with additional function inlining.
- dc592a41557b 19 (unreleased) landed
-
doc: Fix incorrect wording for --file in pg_dump
- 07961ef86625 19 (unreleased) cited
Attachments
- v2-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patch (application/octet-stream) patch v2-0001
On Tue, Aug 12, 2025 at 4:25 PM Shinya Kato <shinya11.kato@gmail.com> wrote: > > + * However, SIMD optimization cannot be applied in the following cases: > > + * - Inside quoted fields, where escape sequences and closing quotes > > + * require sequential processing to handle correctly. > > > > I think you can continue SIMD inside quoted fields. Only important > > thing is you need to set last_was_esc to false when SIMD skipped the > > chunk. > > That's a clever point that last_was_esc should be reset to false when > a SIMD chunk is skipped. You're right about that specific case. > > However, the core challenge is not what happens when we skip a chunk, > but what happens when a chunk contains special characters like quotes > or escapes. The main reason we avoid SIMD inside quoted fields is that > the parsing logic becomes fundamentally sequential and > context-dependent. > > To correctly parse a "" as a single literal quote, we must perform a > lookahead to check the next character. This is an inherently > sequential operation that doesn't map well to SIMD's parallel nature. > > Trying to handle this stateful logic with SIMD would lead to > significant implementation complexity, especially with edge cases like > an escape character falling on the last byte of a chunk. Ah, you're right. My apologies, I misunderstood the implementation. It appears that SIMD can be used even within quoted strings. I think it would be better not to use the SIMD path when last_was_esc is true. The next character is likely to be a special character, and handling this case outside the SIMD loop would also improve readability by consolidating the last_was_esc toggle logic in one place. Furthermore, when inside a quote (in_quote) in CSV mode, the detection of \n and \r can be disabled. + last_was_esc = false; Regarding the implementation, I believe we must set last_was_esc to false when advancing input_buf_ptr, as shown in the code below. For this reason, I think it’s best to keep the current logic for toggling last_was_esc. + int advance = pg_rightmost_one_pos32(mask); + input_buf_ptr += advance; I've attached a new patch that includes these changes. Further modifications are still in progress. -- Best regards, Shinya Kato NTT OSS Center