Thread

  1. Re: Speed up COPY FROM text/CSV parsing using SIMD

    Andrew Dunstan <andrew@dunslane.net> — 2025-10-20T14:02:23Z

    On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote:
    > Hi,
    >
    > On Thu, 21 Aug 2025 at 18:47, Andrew Dunstan<andrew@dunslane.net> wrote:
    >>
    >> On 2025-08-19 Tu 10:14 AM, Nazir Bilal Yavuz wrote:
    >>> Hi,
    >>>
    >>> On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz<byavuz81@gmail.com> wrote:
    >>>> I am able to reproduce the regression you mentioned but both
    >>>> regressions are %20 on my end. I found that (by experimenting) SIMD
    >>>> causes a regression if it advances less than 5 characters.
    >>>>
    >>>> So, I implemented a small heuristic. It works like that:
    >>>>
    >>>> - If advance < 5 -> insert a sleep penalty (n cycles).
    >>> 'sleep' might be a poor word choice here. I meant skipping SIMD for n
    >>> number of times.
    >>>
    >> I was thinking a bit about that this morning. I wonder if it might be
    >> better instead of having a constantly applied heuristic like this, it
    >> might be better to do a little extra accounting in the first, say, 1000
    >> lines of an input file, and if less than some portion of the input is
    >> found to be special characters then switch to the SIMD code. What that
    >> portion should be would need to be determined by some experimentation
    >> with a variety of typical workloads, but given your findings 20% seems
    >> like a good starting point.
    > I implemented a heuristic something similar to this. It is a mix of
    > previous heuristic and your idea, it works like that:
    >
    > Overall logic is that we will not run SIMD for the entire line and we
    > decide if it is worth it to run SIMD for the next lines.
    >
    > 1 - We will try SIMD and decide if it is worth it to run SIMD.
    > 1.1 - If it is worth it, we will continue to run SIMD and we will
    > halve the simd_last_sleep_cycle variable.
    > 1.2 - If it is not worth it, we will double the simd_last_sleep_cycle
    > and we will not run SIMD for these many lines.
    > 1.3 - After skipping simd_last_sleep_cycle lines, we will go back to the #1.
    > Note: simd_last_sleep_cycle can not pass 1024, so we will run SIMD for
    > each 1024 lines at max.
    >
    > With this heuristic the regression is limited by %2 in the worst case.
    >
    
    My worry is that the worst case is actually quite common. Sparse data 
    sets dominated by a lot of null values (and hence lots of special 
    characters) are very common. Are people prepared to accept a 2% 
    regression on load times for such data sets?
    
    
    cheers
    
    
    andrew
    
    --
    Andrew Dunstan
    EDB:https://www.enterprisedb.com