Thread

  1. Re: Speed up COPY FROM text/CSV parsing using SIMD

    Manni Wood <manni.wood@enterprisedb.com> — 2025-11-13T02:40:35Z

    On Wed, Nov 12, 2025 at 8:44 AM KAZAR Ayoub <ma_kazar@esi.dz> wrote:
    
    > On Tue, Nov 11, 2025 at 11:23 PM Manni Wood <manni.wood@enterprisedb.com>
    > wrote:
    >
    >> Hello!
    >>
    >> I wanted reproduce the results using files attached by Shinya Kato and
    >> Ayoub Kazar. I installed a postgres compiled from master, and then I
    >> installed a postgres built from master plus Nazir Bilal Yavuz's v3 patches
    >> applied.
    >>
    >> The master+v3patches postgres naturally performed better on copying into
    >> the database: anywhere from 11% better for the t.csv file produced by
    >> Shinyo's test.sql, to 35% better copying in the t_4096_none.csv file
    >> created by Ayoub Kazar's simd-copy-from-bench.sql.
    >>
    >> But here's where it gets weird. The two files created by Ayoub Kazar's
    >> simd-copy-from-bench.sql that are supposed to be slower, t_4096_escape.txt,
    >> and t_4096_quote.csv, actually ran faster on my machine, by 11% and 5%
    >> respectively.
    >>
    >> This seems impossible.
    >>
    >> A few things I should note:
    >>
    >> I timed the commands using the Unix time command, like so:
    >>
    >> time psql -X -U mwood -h localhost -d postgres -c '\copy t from
    >> /tmp/t_4096_escape.txt'
    >>
    >> For each file, I timed the copy 6 times and took the average.
    >>
    >> This was done on my work Linux machine while also running Chrome and an
    >> Open Office spreadsheet; not a dedicated machine only running postgres.
    >>
    > Hello,
    > I think if you do a perf benchmark (if it still reproduces) it would
    > probably be possible to explain why it's performing like that looking at
    > the CPI and other metrics and compare it to my findings.
    > What i also suggest is to make the data close even closer to the worst
    > case i.e: more special characters where it hurts the switching between SIMD
    > and scalar processing (in simd-copy-from-bench.sql file), if still does a
    > good job then there's something to look at.
    >
    >>
    >>
    >
    >> All of the copy results took between 4.5 seconds (Shinyo's t.csv copied
    >> into postgres compiled from master) to 2 seconds (Ayoub
    >> Kazar's t_4096_none.csv copied into postgres compiled from master plus
    >> Nazir's v3 patches).
    >>
    >> Perhaps I need to fiddle with the provided SQL to produce larger files to
    >> get longer run times? Maybe sub-second differences won't tell as
    >> interesting a story as minutes-long copy commands?
    >>
    > I did try it on some GBs (around 2-5GB only), the differences were not
    > that much, but if you can run this on more GBs (at least 10GB) it would be
    > good to look at, although i don't suspect anything interesting since the
    > shape of data is the same for the totality of the COPY.
    >
    >>
    >> Thanks for reading this.
    >> --
    >> -- Manni Wood EDB: https://www.enterprisedb.com
    >>
    > Thanks for the info.
    >
    >
    > Regards,
    > Ayoub Kazar.
    >
    
    Hello again!
    
    It looks like using 10 times the data removed the apparent speedup in the
    simd code when the simd code has to deal with t_4096_escape.txt
    and t_4096_quote.csv. When both files contain 1,000,000 lines each,
    postgres master+v3patch imports 0.63% slower and 0.54% slower respectively.
    For 1,000,000 lines of t_4096_none.txt, the v3 patch yields a 30% speedup.
    For 1,000,000 lines of t_4096_none.csv, the v3 patch yields a 33% speedup.
    
    I got these numbers just via simple timing, though this time I used psql's
    \timing feature. I left psql running rather than launching it each time as
    I did when I used the unix "time" command. I ran the copy command 5 times
    for each file and averaged the results. Again, this happened on a Linux
    machine that also happened to be running Chrome and Open Office's
    spreadsheet.
    
    I should probably try to construct some .txt or .csv files that would trip
    up the simd on/off heuristic in the v3 patch.
    
    If data "in the wild" tend to be roughly the same "shape" from row to row,
    as Andrew's experience has shown, I imagine these million row results bode
    well for the v3 patch...
    -- 
    -- Manni Wood EDB: https://www.enterprisedb.com