Thread

  1. Re: Speed up COPY FROM text/CSV parsing using SIMD

    Nazir Bilal Yavuz <byavuz81@gmail.com> — 2025-12-09T13:40:19Z

    Hi,
    
    On Sat, 6 Dec 2025 at 10:55, Bilal Yavuz <byavuz81@gmail.com> wrote:
    >
    > Hi,
    >
    > On Sat, 6 Dec 2025 at 04:40, Manni Wood <manni.wood@enterprisedb.com> wrote:
    > > Hello, all.
    > >
    > > Andrew, I tried your suggestion of just reading the first chunk of the copy file to determine if SIMD is worth using. Attached are v4 versions of the patches showing a first attempt at doing that.
    >
    > Thank you for doing this!
    >
    > > I attached test.sh.txt to show how I've been testing, with 5 million lines of the various copy file variations introduced by Ayub Kazar.
    > >
    > > The text copy with no special chars is 30% faster. The CSV copy with no special chars is 48% faster. The text with 1/3rd escapes is 3% slower. The CSV with 1/3rd quotes is 0.27% slower.
    > >
    > > This set of patches follows the simplest suggestion of just testing the first N lines (actually first N bytes) of the file and then deciding whether or not to enable SIMD. This set of patches does not follow Andrew's later suggestion of maybe checking again every million lines or so.
    >
    > My input-generation script is not ready to share yet, but the inputs
    > follow this format: text_${n}.input, where n represents the number of
    > normal characters before the delimiter. For example:
    >
    > n = 0 -> "\n\n\n\n\n..." (no normal characters)
    > n = 1 -> "a\n..." (1 normal character before the delimiter)
    > ...
    > n = 5 -> "aaaaa\n..."
    > … continuing up to n = 32.
    >
    > Each line has 4096 chars and there are a total of 100000 lines in each
    > input file.
    >
    > I only benchmarked the text format. I compared the latest heuristic I
    > shared [1] with the current method. The benchmarks show roughly a ~16%
    > regression at the worst case (n = 2), with regressions up to n = 5.
    > For the remaining values, performance was similar.
    
    I tried to improve the v4 patchset. My changes are:
    
    1 - I changed CopyReadLineText() to an inline function and sent the
    use_simd variable as an argument to get help from inlining.
    
    2 - A main for loop in the CopyReadLineText() function is called many
    times, so I moved the use_simd check to the CopyReadLine() function.
    
    3 - Instead of 'bytes_processed', I used 'chars_processed' because
    cstate->bytes_processed is increased before we process them and this
    can cause wrong results.
    
    4 - Because of #2 and #3, instead of having
    'SPECIAL_CHAR_SIMD_THRESHOLD', I used the ratio of 'chars_processed /
    special_chars_encountered' to determine whether we want to use SIMD.
    
    5 - cstate->special_chars_encountered is incremented wrongly for the
    CSV case. It is not incremented for the quote and escape delimiters. I
    moved all increments of cstate->special_chars_encountered to the
    central place and tried to optimize it but it still causes a
    regression as it creates one more branching.
    
    With these changes, I am able to decrease the regression to %10 from
    %16. Regression decreases to %7 if I modify #5 for the only text input
    but I did not do that.
    
    My changes are in the 0003.
    
    -- 
    Regards,
    Nazir Bilal Yavuz
    Microsoft