Thread

  1. Re: Speed up COPY FROM text/CSV parsing using SIMD

    Manni Wood <manni.wood@enterprisedb.com> — 2025-11-11T22:23:20Z

    On Wed, Oct 29, 2025 at 5:23 PM Andrew Dunstan <andrew@dunslane.net> wrote:
    
    >
    > On 2025-10-22 We 3:24 PM, Nathan Bossart wrote:
    > > On Wed, Oct 22, 2025 at 03:33:37PM +0300, Nazir Bilal Yavuz wrote:
    > >> On Tue, 21 Oct 2025 at 21:40, Nathan Bossart <nathandbossart@gmail.com>
    > wrote:
    > >>> I wonder if we could mitigate the regression further by spacing out the
    > >>> checks a bit more.  It could be worth comparing a variety of values to
    > >>> identify what works best with the test data.
    > >> Do you mean that instead of doubling the SIMD sleep, we should
    > >> multiply it by 3 (or another factor)? Or are you referring to
    > >> increasing the maximum sleep from 1024? Or possibly both?
    > > I'm not sure of the precise details, but the main thrust of my suggestion
    > > is to assume that whatever sampling you do to determine whether to use
    > SIMD
    > > is good for a larger chunk of data.  That is, if you are sampling 1K
    > lines
    > > and then using the result to choose whether to use SIMD for the next 100K
    > > lines, we could instead bump the latter number to 1M lines (or
    > something).
    > > That way we minimize the regression for relatively uniform data sets
    > while
    > > retaining some ability to adapt in case things change halfway through a
    > > large table.
    > >
    >
    >
    > I'd be ok with numbers like this, although I suspect the numbers of
    > cases where we see shape shifts like this in the middle of a data set
    > would be vanishingly small.
    >
    >
    > cheers
    >
    >
    > andrew
    >
    >
    > --
    > Andrew Dunstan
    > EDB: https://www.enterprisedb.com
    >
    >
    >
    >
    Hello!
    
    I wanted reproduce the results using files attached by Shinya Kato and
    Ayoub Kazar. I installed a postgres compiled from master, and then I
    installed a postgres built from master plus Nazir Bilal Yavuz's v3 patches
    applied.
    
    The master+v3patches postgres naturally performed better on copying into
    the database: anywhere from 11% better for the t.csv file produced by
    Shinyo's test.sql, to 35% better copying in the t_4096_none.csv file
    created by Ayoub Kazar's simd-copy-from-bench.sql.
    
    But here's where it gets weird. The two files created by Ayoub Kazar's
    simd-copy-from-bench.sql that are supposed to be slower, t_4096_escape.txt,
    and t_4096_quote.csv, actually ran faster on my machine, by 11% and 5%
    respectively.
    
    This seems impossible.
    
    A few things I should note:
    
    I timed the commands using the Unix time command, like so:
    
    time psql -X -U mwood -h localhost -d postgres -c '\copy t from
    /tmp/t_4096_escape.txt'
    
    For each file, I timed the copy 6 times and took the average.
    
    This was done on my work Linux machine while also running Chrome and an
    Open Office spreadsheet; not a dedicated machine only running postgres.
    
    All of the copy results took between 4.5 seconds (Shinyo's t.csv copied
    into postgres compiled from master) to 2 seconds (Ayoub
    Kazar's t_4096_none.csv copied into postgres compiled from master plus
    Nazir's v3 patches).
    
    Perhaps I need to fiddle with the provided SQL to produce larger files to
    get longer run times? Maybe sub-second differences won't tell as
    interesting a story as minutes-long copy commands?
    
    Thanks for reading this.
    -- 
    -- Manni Wood EDB: https://www.enterprisedb.com