Thread

  1. Re: Speed up COPY FROM text/CSV parsing using SIMD

    Manni Wood <manni.wood@enterprisedb.com> — 2025-12-06T01:39:56Z

    On Wed, Nov 26, 2025 at 8:21 AM Manni Wood <manni.wood@enterprisedb.com>
    wrote:
    
    >
    >
    > On Wed, Nov 26, 2025 at 5:51 AM KAZAR Ayoub <ma_kazar@esi.dz> wrote:
    >
    >> Hello,
    >> On Wed, Nov 19, 2025 at 10:01 PM Nathan Bossart <nathandbossart@gmail.com>
    >> wrote:
    >>
    >>> On Tue, Nov 18, 2025 at 05:20:05PM +0300, Nazir Bilal Yavuz wrote:
    >>> > Thanks, done.
    >>>
    >>> I took a look at the v3 patches.  Here are my high-level thoughts:
    >>>
    >>> +    /*
    >>> +     * Parse data and transfer into line_buf. To get benefit from
    >>> inlining,
    >>> +     * call CopyReadLineText() with the constant boolean variables.
    >>> +     */
    >>> +    if (cstate->simd_continue)
    >>> +        result = CopyReadLineText(cstate, is_csv, true);
    >>> +    else
    >>> +        result = CopyReadLineText(cstate, is_csv, false);
    >>>
    >>> I'm curious whether this actually generates different code, and if it
    >>> does,
    >>> if it's actually faster.  We're already branching on
    >>> cstate->simd_continue
    >>> here.
    >>
    >> I've compiled both versions with -O2 and confirmed they generate
    >> different code. When simd_continue is passed as a constant to
    >> CopyReadLineText, the compiler optimizes out the condition checks from the
    >> SIMD path.
    >> A small benchmark on a 1GB+ file shows the expected benefit which is
    >> around 6% performance improvement.
    >> I've attached the assembly outputs in case someone wants to check
    >> something else.
    >>
    >>
    >> Regards,
    >> Ayoub Kazar
    >>
    >
    > Correction to my last post:
    >
    > I also tried files that alternated lines with no special characters and
    > lines with 1/3rd special characters, thinking I could force the algorithm
    > to continually check whether or not it should use simd and therefore force
    > more overhead in the try-simd/don't-try-simd housekeeping code. The text
    > file was still 20% faster (not 50% faster as I originally stated --- that
    > was a typo). The CSV file was still 13% faster.
    >
    > Also, apologies for posting at the top in my last e-mail.
    > --
    > -- Manni Wood EDB: https://www.enterprisedb.com
    >
    
    Hello, all.
    
    Andrew, I tried your suggestion of just reading the first chunk of the copy
    file to determine if SIMD is worth using. Attached are v4 versions of the
    patches showing a first attempt at doing that.
    
    I attached test.sh.txt to show how I've been testing, with 5 million lines
    of the various copy file variations introduced by Ayub Kazar.
    
    The text copy with no special chars is 30% faster. The CSV copy with no
    special chars is 48% faster. The text with 1/3rd escapes is 3% slower. The
    CSV with 1/3rd quotes is 0.27% slower.
    
    This set of patches follows the simplest suggestion of just testing the
    first N lines (actually first N bytes) of the file and then deciding
    whether or not to enable SIMD. This set of patches does not follow Andrew's
    later suggestion of maybe checking again every million lines or so.
    -- 
    -- Manni Wood EDB: https://www.enterprisedb.com