Thread
-
Re: Speed up COPY FROM text/CSV parsing using SIMD
Nazir Bilal Yavuz <byavuz81@gmail.com> — 2025-12-09T13:40:19Z
Hi, On Sat, 6 Dec 2025 at 10:55, Bilal Yavuz <byavuz81@gmail.com> wrote: > > Hi, > > On Sat, 6 Dec 2025 at 04:40, Manni Wood <manni.wood@enterprisedb.com> wrote: > > Hello, all. > > > > Andrew, I tried your suggestion of just reading the first chunk of the copy file to determine if SIMD is worth using. Attached are v4 versions of the patches showing a first attempt at doing that. > > Thank you for doing this! > > > I attached test.sh.txt to show how I've been testing, with 5 million lines of the various copy file variations introduced by Ayub Kazar. > > > > The text copy with no special chars is 30% faster. The CSV copy with no special chars is 48% faster. The text with 1/3rd escapes is 3% slower. The CSV with 1/3rd quotes is 0.27% slower. > > > > This set of patches follows the simplest suggestion of just testing the first N lines (actually first N bytes) of the file and then deciding whether or not to enable SIMD. This set of patches does not follow Andrew's later suggestion of maybe checking again every million lines or so. > > My input-generation script is not ready to share yet, but the inputs > follow this format: text_${n}.input, where n represents the number of > normal characters before the delimiter. For example: > > n = 0 -> "\n\n\n\n\n..." (no normal characters) > n = 1 -> "a\n..." (1 normal character before the delimiter) > ... > n = 5 -> "aaaaa\n..." > … continuing up to n = 32. > > Each line has 4096 chars and there are a total of 100000 lines in each > input file. > > I only benchmarked the text format. I compared the latest heuristic I > shared [1] with the current method. The benchmarks show roughly a ~16% > regression at the worst case (n = 2), with regressions up to n = 5. > For the remaining values, performance was similar. I tried to improve the v4 patchset. My changes are: 1 - I changed CopyReadLineText() to an inline function and sent the use_simd variable as an argument to get help from inlining. 2 - A main for loop in the CopyReadLineText() function is called many times, so I moved the use_simd check to the CopyReadLine() function. 3 - Instead of 'bytes_processed', I used 'chars_processed' because cstate->bytes_processed is increased before we process them and this can cause wrong results. 4 - Because of #2 and #3, instead of having 'SPECIAL_CHAR_SIMD_THRESHOLD', I used the ratio of 'chars_processed / special_chars_encountered' to determine whether we want to use SIMD. 5 - cstate->special_chars_encountered is incremented wrongly for the CSV case. It is not incremented for the quote and escape delimiters. I moved all increments of cstate->special_chars_encountered to the central place and tried to optimize it but it still causes a regression as it creates one more branching. With these changes, I am able to decrease the regression to %10 from %16. Regression decreases to %7 if I modify #5 for the only text input but I did not do that. My changes are in the 0003. -- Regards, Nazir Bilal Yavuz Microsoft