Thread

  1. Inserting heap tuples in bulk in COPY

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-12T19:16:50Z

    COPY is slow. Let's make it faster. One obvious optimization is to 
    insert heap tuples in bigger chunks, instead of calling heap_insert() 
    separately for every tuple. That saves the overhead of pinning and 
    locking the buffer for every tuple, and you only need to write one WAL 
    record for all the tuples written to the same page, instead of one for 
    each tuple.
    
    Attached is a WIP patch to do that. It adds a new function, 
    heap_multi_insert, which does the same thing as heap_insert, but works 
    in bulk. It takes an array of tuples as argument, and tries to cram as 
    many of them into the chosen targe page as it can, and only writes a 
    single WAL record of the operation.
    
    This gives a significant speedup to COPY, particularly for narrow 
    tables, with small tuples. Grouping multiple tuples into one WAL record 
    reduces the WAL volume significantly, and the time spent in writing that 
    WAL. The reduced overhead of repeatedly locking the buffer is also most 
    noticeable on narrow tables. On wider tables, the effects are smaller. 
    See copytest-results.txt, containing test results with three tables of 
    different widths. The scripts used to get those numbers are also attached.
    
    Triggers complicate this. I believe it is only safe to group tuples 
    together like this if the table has no triggers. A BEFORE ROW trigger 
    might run a SELECT on the table being copied to, and check if some of 
    the tuples we're about to insert exist. If we run BEFORE ROW triggers 
    for a bunch of tuples first, and only then insert them, none of the 
    trigger invocations will see the other rows as inserted yet. Similarly, 
    if we run AFTER ROW triggers after inserting a bunch of tuples, the 
    trigger for each of the insertions would see all the inserted rows. So 
    at least for now, the patch simply falls back to inserting one row at a 
    time if there are any triggers on the table.
    
    The patch is WIP, mainly because I didn't write the WAL replay routines 
    yet, but please let me know if you see any issues.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
  2. Re: Inserting heap tuples in bulk in COPY

    Gurjeet Singh <singh.gurjeet@gmail.com> — 2011-08-12T19:44:47Z

    On Fri, Aug 12, 2011 at 3:16 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > COPY is slow.
    
    
    No kidding!
    
    
    > So at least for now, the patch simply falls back to inserting one row at a
    > time if there are any triggers on the table.
    >
    
    Maybe we want to change that to "fall back to old ways if there are any FOR
    EACH ROW triggers", since FOR EACH STATEMENT triggers won't be bothered by
    this optimization.
    
    -- 
    Gurjeet Singh
    EnterpriseDB Corporation
    The Enterprise PostgreSQL Company
    
  3. Re: Inserting heap tuples in bulk in COPY

    Florian G. Pflug <fgp@phlo.org> — 2011-08-12T19:57:11Z

    On Aug12, 2011, at 21:16 , Heikki Linnakangas wrote:
    > Triggers complicate this. I believe it is only safe to group tuples together like this if the table has no triggers. A BEFORE ROW trigger might run a SELECT on the table being copied to, and check if some of the tuples we're about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples first, and only then insert them, none of the trigger invocations will see the other rows as inserted yet. Similarly, if we run AFTER ROW triggers after inserting a bunch of tuples, the trigger for each of the insertions would see all the inserted rows.
    
    Don't we run AFTER ROW triggers after inserting *all* the tuples anyway? At least this is what we do in the case of INSERT/UPDATE/DELETE if I'm not mistaken.
    
    best regards,
    Florian Pflug
    
    
    
  4. Re: Inserting heap tuples in bulk in COPY

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-12T20:11:05Z

    On 12.08.2011 22:57, Florian Pflug wrote:
    > On Aug12, 2011, at 21:16 , Heikki Linnakangas wrote:
    >> Triggers complicate this. I believe it is only safe to group tuples together like this if the table has no triggers. A BEFORE ROW trigger might run a SELECT on the table being copied to, and check if some of the tuples we're about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples first, and only then insert them, none of the trigger invocations will see the other rows as inserted yet. Similarly, if we run AFTER ROW triggers after inserting a bunch of tuples, the trigger for each of the insertions would see all the inserted rows.
    >
    > Don't we run AFTER ROW triggers after inserting *all* the tuples anyway? At least this is what we do in the case of INSERT/UPDATE/DELETE if I'm not mistaken.
    
    Um, yes, you're right. Now I feel silly. The above still applies to 
    BEFORE ROW triggers, though.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  5. Re: Inserting heap tuples in bulk in COPY

    Robert Haas <robertmhaas@gmail.com> — 2011-08-12T20:57:36Z

    On Fri, Aug 12, 2011 at 3:16 PM, Heikki Linnakangas
    <heikki.linnakangas@enterprisedb.com> wrote:
    > COPY is slow. Let's make it faster. One obvious optimization is to insert
    > heap tuples in bigger chunks, instead of calling heap_insert() separately
    > for every tuple. That saves the overhead of pinning and locking the buffer
    > for every tuple, and you only need to write one WAL record for all the
    > tuples written to the same page, instead of one for each tuple.
    >
    > Attached is a WIP patch to do that. It adds a new function,
    > heap_multi_insert, which does the same thing as heap_insert, but works in
    > bulk. It takes an array of tuples as argument, and tries to cram as many of
    > them into the chosen targe page as it can, and only writes a single WAL
    > record of the operation.
    >
    > This gives a significant speedup to COPY, particularly for narrow tables,
    > with small tuples. Grouping multiple tuples into one WAL record reduces the
    > WAL volume significantly, and the time spent in writing that WAL. The
    > reduced overhead of repeatedly locking the buffer is also most noticeable on
    > narrow tables. On wider tables, the effects are smaller. See
    > copytest-results.txt, containing test results with three tables of different
    > widths. The scripts used to get those numbers are also attached.
    >
    > Triggers complicate this. I believe it is only safe to group tuples together
    > like this if the table has no triggers. A BEFORE ROW trigger might run a
    > SELECT on the table being copied to, and check if some of the tuples we're
    > about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples
    > first, and only then insert them, none of the trigger invocations will see
    > the other rows as inserted yet. Similarly, if we run AFTER ROW triggers
    > after inserting a bunch of tuples, the trigger for each of the insertions
    > would see all the inserted rows. So at least for now, the patch simply falls
    > back to inserting one row at a time if there are any triggers on the table.
    >
    > The patch is WIP, mainly because I didn't write the WAL replay routines yet,
    > but please let me know if you see any issues.
    
    I thought about trying to do this at one point in the past, but I
    couldn't figure out exactly how to make it work.  I think the approach
    you've taken here is good.
    
    Aside from the point already raised about needing to worry only about
    BEFORE ROW triggers, I don't see any showstoppers.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  6. Re: Inserting heap tuples in bulk in COPY

    Andrew Dunstan <andrew@dunslane.net> — 2011-08-12T21:10:11Z

    
    On 08/12/2011 04:57 PM, Robert Haas wrote:
    > I thought about trying to do this at one point in the past, but I
    > couldn't figure out exactly how to make it work.  I think the approach
    > you've taken here is good.
    >
    > Aside from the point already raised about needing to worry only about
    > BEFORE ROW triggers, I don't see any showstoppers.
    
    
    Yeah, this looks very promising indeed. Well done. In fact, I'm asking 
    myself how hard it will be to backport for one particular customer of 
    ours, for whom the WAL load is a major hotspot.
    
    cheers
    
    andrew
    
    
  7. Re: Inserting heap tuples in bulk in COPY

    Simon Riggs <simon@2ndquadrant.com> — 2011-08-12T21:17:07Z

    On Fri, Aug 12, 2011 at 8:16 PM, Heikki Linnakangas
    <heikki.linnakangas@enterprisedb.com> wrote:
    
    > COPY is slow. Let's make it faster. One obvious optimization is to insert
    > heap tuples in bigger chunks, instead of calling heap_insert() separately
    > for every tuple. That saves the overhead of pinning and locking the buffer
    > for every tuple, and you only need to write one WAL record for all the
    > tuples written to the same page, instead of one for each tuple.
    
    We don't pin the buffer for every tuple, that optimisation is already done...
    
    When we discussed this before you said that it wasn't worth trying to
    do this additional work - it was certainly a smaller gain than the one
    we achieved by removing the pinning overhead.
    
    Also, we discussed that you would work on buffering the index inserts,
    which is where the main problem lies. The main heap is only a small
    part of the overhead if we have multiple indexes already built on a
    table - which is the use case that causes the most problem.
    
    So I'm a little surprised to see you working on this and I'm guessing
    that the COPY improvement with indexes is barely noticeable. This
    would be a nice improvement, but not until the bulk index inserts are
    done.
    
    -- 
     Simon Riggs                   http://www.2ndQuadrant.com/
     PostgreSQL Development, 24x7 Support, Training & Services
    
    
  8. Re: Inserting heap tuples in bulk in COPY

    Merlin Moncure <mmoncure@gmail.com> — 2011-08-12T21:26:03Z

    On Fri, Aug 12, 2011 at 2:16 PM, Heikki Linnakangas
    <heikki.linnakangas@enterprisedb.com> wrote:
    > COPY is slow. Let's make it faster. One obvious optimization is to insert
    > heap tuples in bigger chunks, instead of calling heap_insert() separately
    > for every tuple. That saves the overhead of pinning and locking the buffer
    > for every tuple, and you only need to write one WAL record for all the
    > tuples written to the same page, instead of one for each tuple.
    >
    > Attached is a WIP patch to do that. It adds a new function,
    > heap_multi_insert, which does the same thing as heap_insert, but works in
    > bulk. It takes an array of tuples as argument, and tries to cram as many of
    > them into the chosen targe page as it can, and only writes a single WAL
    > record of the operation.
    >
    > This gives a significant speedup to COPY, particularly for narrow tables,
    > with small tuples. Grouping multiple tuples into one WAL record reduces the
    > WAL volume significantly, and the time spent in writing that WAL. The
    > reduced overhead of repeatedly locking the buffer is also most noticeable on
    > narrow tables. On wider tables, the effects are smaller. See
    > copytest-results.txt, containing test results with three tables of different
    > widths. The scripts used to get those numbers are also attached.
    >
    > Triggers complicate this. I believe it is only safe to group tuples together
    > like this if the table has no triggers. A BEFORE ROW trigger might run a
    > SELECT on the table being copied to, and check if some of the tuples we're
    > about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples
    > first, and only then insert them, none of the trigger invocations will see
    > the other rows as inserted yet. Similarly, if we run AFTER ROW triggers
    > after inserting a bunch of tuples, the trigger for each of the insertions
    > would see all the inserted rows. So at least for now, the patch simply falls
    > back to inserting one row at a time if there are any triggers on the table.
    
    But generic RI triggers would be ok, right?
    
    merlin
    
    
  9. Re: Inserting heap tuples in bulk in COPY

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-12T21:59:13Z

    On 13.08.2011 00:17, Simon Riggs wrote:
    > Also, we discussed that you would work on buffering the index inserts,
    > which is where the main problem lies. The main heap is only a small
    > part of the overhead if we have multiple indexes already built on a
    > table - which is the use case that causes the most problem.
    
    Sure, if you have indexes on the table already, then the overhead of 
    updating them is more significant. I am actually working on that too, I 
    will make a separate post about that.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  10. Re: Inserting heap tuples in bulk in COPY

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-12T22:19:35Z

    On 13.08.2011 00:26, Merlin Moncure wrote:
    > On Fri, Aug 12, 2011 at 2:16 PM, Heikki Linnakangas
    > <heikki.linnakangas@enterprisedb.com>  wrote:
    >> Triggers complicate this. I believe it is only safe to group tuples together
    >> like this if the table has no triggers. A BEFORE ROW trigger might run a
    >> SELECT on the table being copied to, and check if some of the tuples we're
    >> about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples
    >> first, and only then insert them, none of the trigger invocations will see
    >> the other rows as inserted yet. Similarly, if we run AFTER ROW triggers
    >> after inserting a bunch of tuples, the trigger for each of the insertions
    >> would see all the inserted rows. So at least for now, the patch simply falls
    >> back to inserting one row at a time if there are any triggers on the table.
    >
    > But generic RI triggers would be ok, right?
    
    RI triggers are AFTER ROW triggers, which we concluded to be OK after 
    all, so they would be ok.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  11. Re: Inserting heap tuples in bulk in COPY

    Dean Rasheed <dean.a.rasheed@gmail.com> — 2011-08-13T09:01:57Z

    On 12 August 2011 23:19, Heikki Linnakangas
    <heikki.linnakangas@enterprisedb.com> wrote:
    >>> Triggers complicate this. I believe it is only safe to group tuples
    >>> together
    >>> like this if the table has no triggers. A BEFORE ROW trigger might run a
    >>> SELECT on the table being copied to, and check if some of the tuples
    >>> we're
    >>> about to insert exist. If we run BEFORE ROW triggers for a bunch of
    >>> tuples
    >>> first, and only then insert them, none of the trigger invocations will
    >>> see
    >>> the other rows as inserted yet. Similarly, if we run AFTER ROW triggers
    >>> after inserting a bunch of tuples, the trigger for each of the insertions
    >>> would see all the inserted rows. So at least for now, the patch simply
    >>> falls
    >>> back to inserting one row at a time if there are any triggers on the
    >>> table.
    
    I guess DEFAULT values could also suffer from a similar problem to
    BEFORE ROW triggers though (in theory a DEFAULT could be based on a
    function that SELECTs from the table being copied to).
    
    Regards,
    Dean
    
    
  12. Re: Inserting heap tuples in bulk in COPY

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-08-13T14:33:53Z

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
    > The patch is WIP, mainly because I didn't write the WAL replay routines 
    > yet, but please let me know if you see any issues.
    
    Why do you need new WAL replay routines?  Can't you just use the existing
    XLOG_HEAP_NEWPAGE support?
    
    By any large, I think we should be avoiding special-purpose WAL entries
    as much as possible.
    
    Also, I strongly object to turning regular heap_insert into a wrapper
    around some other more complicated operation.  That will likely have
    bad consequences for the performance of every other operation.
    
    			regards, tom lane
    
    
  13. Re: Inserting heap tuples in bulk in COPY

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-09-14T10:52:09Z

    On 13.08.2011 17:33, Tom Lane wrote:
    > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
    >> The patch is WIP, mainly because I didn't write the WAL replay routines
    >> yet, but please let me know if you see any issues.
    >
    > Why do you need new WAL replay routines?  Can't you just use the existing
    > XLOG_HEAP_NEWPAGE support?
    >
    > By any large, I think we should be avoiding special-purpose WAL entries
    > as much as possible.
    
    I tried that, but most of the reduction in WAL-size melts away with 
    that. And if the page you're copying to is not empty, logging the whole 
    page is even more expensive. You'd need to fall back to retail inserts 
    in that case which complicates the logic.
    
    > Also, I strongly object to turning regular heap_insert into a wrapper
    > around some other more complicated operation.  That will likely have
    > bad consequences for the performance of every other operation.
    
    Ok. I doubt it makes any noticeable difference for performance, but 
    having done that, it's more readable, too, to duplicate the code.
    
    Attached is a new version of the patch. It is now complete, including 
    WAL replay code.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
  14. Re: Inserting heap tuples in bulk in COPY

    Kohei KaiGai <kaigai@kaigai.gr.jp> — 2011-09-25T08:43:13Z

    Hi Heikki,
    
    I checked your patch, then I have a comment and two questions here.
    
    The heap_prepare_insert() seems a duplication of code with earlier
    half of existing heap_insert(). I think it is a good question to
    consolidate these portion of the code.
    
    I'm not clear the reason why the argument of
    CheckForSerializableConflictIn() was
    changed from the one in heap_insert(). Its source code comment describes as:
            :
         * For a heap insert, we only need to check for table-level SSI locks.
         * Our new tuple can't possibly conflict with existing tuple locks, and
         * heap page locks are only consolidated versions of tuple locks; they do
         * not lock "gaps" as index page locks do.  So we don't need to identify
         * a buffer before making the call.
         */
    Is it feasible that newly inserted tuples conflict with existing tuple
    locks when
    we expand insert to support multiple tuples at once?
    It is a bit confusing for me.
    
    This patch disallow multiple-insertion not only when before-row
    trigger is configured,
    but volatile functions are used to compute a default value also.
    Is it a reasonable condition to avoid multiple-insertion?
    All the datums to be delivered to heap_form_tuple() is calculated in
    NextCopyFrom,
    and default values are also constructed here; sequentially.
    So, it seems to me we have no worry about volatile functions are not
    invoked toward
    each tuples. Or, am I missing something?
    
    Thanks,
    
    2011/9/14 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
    > On 13.08.2011 17:33, Tom Lane wrote:
    >>
    >> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
    >>>
    >>> The patch is WIP, mainly because I didn't write the WAL replay routines
    >>> yet, but please let me know if you see any issues.
    >>
    >> Why do you need new WAL replay routines?  Can't you just use the existing
    >> XLOG_HEAP_NEWPAGE support?
    >>
    >> By any large, I think we should be avoiding special-purpose WAL entries
    >> as much as possible.
    >
    > I tried that, but most of the reduction in WAL-size melts away with that.
    > And if the page you're copying to is not empty, logging the whole page is
    > even more expensive. You'd need to fall back to retail inserts in that case
    > which complicates the logic.
    >
    >> Also, I strongly object to turning regular heap_insert into a wrapper
    >> around some other more complicated operation.  That will likely have
    >> bad consequences for the performance of every other operation.
    >
    > Ok. I doubt it makes any noticeable difference for performance, but having
    > done that, it's more readable, too, to duplicate the code.
    >
    > Attached is a new version of the patch. It is now complete, including WAL
    > replay code.
    >
    > --
    >  Heikki Linnakangas
    >  EnterpriseDB   http://www.enterprisedb.com
    >
    >
    > --
    > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
    > To make changes to your subscription:
    > http://www.postgresql.org/mailpref/pgsql-hackers
    >
    >
    
    
    
    -- 
    KaiGai Kohei <kaigai@kaigai.gr.jp>
    
    
  15. Re: Inserting heap tuples in bulk in COPY

    Dean Rasheed <dean.a.rasheed@gmail.com> — 2011-09-25T13:03:19Z

    On 25 September 2011 09:43, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:
    > Hi Heikki,
    >
    > I checked your patch, then I have a comment and two questions here.
    >
    > 2011/9/14 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
    >>
    >> Attached is a new version of the patch. It is now complete, including WAL
    >> replay code.
    
    Hi,
    
    I had a quick look at the patch as well and spotted an oversight: the
    multi-insert code branch fails to invoke AFTER ROW triggers.
    
    Regards,
    Dean
    
    
  16. Re: Inserting heap tuples in bulk in COPY

    Robert Haas <robertmhaas@gmail.com> — 2011-09-25T16:01:36Z

    On Wed, Sep 14, 2011 at 6:52 AM, Heikki Linnakangas
    <heikki.linnakangas@enterprisedb.com> wrote:
    >> Why do you need new WAL replay routines?  Can't you just use the existing
    >> XLOG_HEAP_NEWPAGE support?
    >>
    >> By any large, I think we should be avoiding special-purpose WAL entries
    >> as much as possible.
    >
    > I tried that, but most of the reduction in WAL-size melts away with that.
    > And if the page you're copying to is not empty, logging the whole page is
    > even more expensive. You'd need to fall back to retail inserts in that case
    > which complicates the logic.
    
    Where does it go?  I understand why it'd be a problem for partially
    filled pages, but it seems like it ought to be efficient for pages
    that are initially empty.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  17. Re: Inserting heap tuples in bulk in COPY

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-10-06T11:33:24Z

    On 25.09.2011 19:01, Robert Haas wrote:
    > On Wed, Sep 14, 2011 at 6:52 AM, Heikki Linnakangas
    > <heikki.linnakangas@enterprisedb.com>  wrote:
    >>> Why do you need new WAL replay routines?  Can't you just use the existing
    >>> XLOG_HEAP_NEWPAGE support?
    >>>
    >>> By any large, I think we should be avoiding special-purpose WAL entries
    >>> as much as possible.
    >>
    >> I tried that, but most of the reduction in WAL-size melts away with that.
    >> And if the page you're copying to is not empty, logging the whole page is
    >> even more expensive. You'd need to fall back to retail inserts in that case
    >> which complicates the logic.
    >
    > Where does it go?  I understand why it'd be a problem for partially
    > filled pages, but it seems like it ought to be efficient for pages
    > that are initially empty.
    
    A regular heap_insert record leaves out a lot of information that can be 
    deduced at replay time. It can leave out all the headers, including just 
    the null bitmap + data. In addition to that, there's just the location 
    of the tuple (RelFileNode+ItemPointer). At replay, xmin is taken from 
    the WAL record header.
    
    For a multi-insert record, you don't even need to store the RelFileNode 
    and the block number for every tuple, just the offsets.
    
    In comparison, a full-page image will include the full tuple header, and 
    also the line pointers. If I'm doing my math right, a full-page image 
    takes 25 bytes more data per tuple, than the special-purpose 
    multi-insert record.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  18. Re: Inserting heap tuples in bulk in COPY

    Robert Haas <robertmhaas@gmail.com> — 2011-10-06T12:11:52Z

    On Thu, Oct 6, 2011 at 7:33 AM, Heikki Linnakangas
    <heikki.linnakangas@enterprisedb.com> wrote:
    > A regular heap_insert record leaves out a lot of information that can be
    > deduced at replay time. It can leave out all the headers, including just the
    > null bitmap + data. In addition to that, there's just the location of the
    > tuple (RelFileNode+ItemPointer). At replay, xmin is taken from the WAL
    > record header.
    >
    > For a multi-insert record, you don't even need to store the RelFileNode and
    > the block number for every tuple, just the offsets.
    >
    > In comparison, a full-page image will include the full tuple header, and
    > also the line pointers. If I'm doing my math right, a full-page image takes
    > 25 bytes more data per tuple, than the special-purpose multi-insert record.
    
    Interesting.  It's always seemed to me fairly inefficient in general
    to store the whole RelFileNode.  For many people, the database and
    tablespace OID will be constants, and even if they aren't, there
    certainly aren't going to be 96 bits of entropy in the relfilenode.  I
    thought about whether we could create some sort of mapping layer,
    where say once per checkpoint we'd allocate a 4-byte integer to denote
    a relfilenode, and WAL-log that mapping.  Then after that everyone
    could just refer to the 4-byte integer instead of the whole
    relfilenode.  But it seems like a lot of work for 8 bytes per record.
    Then again, if you're getting that much benefit from shaving off 25
    bytes per tuple, maybe it is, although I feel like FPW is the elephant
    in the room.
    
    But I digress...
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  19. Re: Inserting heap tuples in bulk in COPY

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-10-06T14:24:07Z

    On 06.10.2011 15:11, Robert Haas wrote:
    > On Thu, Oct 6, 2011 at 7:33 AM, Heikki Linnakangas
    > <heikki.linnakangas@enterprisedb.com>  wrote:
    >> A regular heap_insert record leaves out a lot of information that can be
    >> deduced at replay time. It can leave out all the headers, including just the
    >> null bitmap + data. In addition to that, there's just the location of the
    >> tuple (RelFileNode+ItemPointer). At replay, xmin is taken from the WAL
    >> record header.
    >>
    >> For a multi-insert record, you don't even need to store the RelFileNode and
    >> the block number for every tuple, just the offsets.
    >>
    >> In comparison, a full-page image will include the full tuple header, and
    >> also the line pointers. If I'm doing my math right, a full-page image takes
    >> 25 bytes more data per tuple, than the special-purpose multi-insert record.
    >
    > Interesting.  It's always seemed to me fairly inefficient in general
    > to store the whole RelFileNode.  For many people, the database and
    > tablespace OID will be constants, and even if they aren't, there
    > certainly aren't going to be 96 bits of entropy in the relfilenode.  I
    > thought about whether we could create some sort of mapping layer,
    > where say once per checkpoint we'd allocate a 4-byte integer to denote
    > a relfilenode, and WAL-log that mapping.  Then after that everyone
    > could just refer to the 4-byte integer instead of the whole
    > relfilenode.  But it seems like a lot of work for 8 bytes per record.
    > Then again, if you're getting that much benefit from shaving off 25
    > bytes per tuple, maybe it is, although I feel like FPW is the elephant
    > in the room.
    
    A very simple optimization would be to leave out tablespace OID 
    altogether if it's DEFAULTTABLESPACE_OID, and just set a flag somewhere. 
    Then again, we could also just compress the WAL wholesale.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  20. Re: Inserting heap tuples in bulk in COPY

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-10-24T14:46:21Z

    On 25.09.2011 16:03, Dean Rasheed wrote:
    > On 25 September 2011 09:43, Kohei KaiGai<kaigai@kaigai.gr.jp>  wrote:
    >> Hi Heikki,
    >>
    >> I checked your patch, then I have a comment and two questions here.
    >>
    >> 2011/9/14 Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>:
    >>>
    >>> Attached is a new version of the patch. It is now complete, including WAL
    >>> replay code.
    >
    > Hi,
    >
    > I had a quick look at the patch as well and spotted an oversight: the
    > multi-insert code branch fails to invoke AFTER ROW triggers.
    
    Thanks! Here's an updated version of the patch, fixing that, and all the 
    other issues pointed out this far.
    
    I extracted the code that sets oid and tuple headers, and invokes the 
    toaster, into a new function that's shared by heap_insert() and 
    heap_multi_insert(). Tom objected to merging heap_insert() and 
    heap_multi_insert() into one complicated function, and I think he was 
    right on that, but sharing this code to prepare a tuple still makes 
    sense. IMHO it makes heap_insert() slightly more readable too.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
  21. Re: Inserting heap tuples in bulk in COPY

    Jeff Janes <jeff.janes@gmail.com> — 2011-11-25T20:53:19Z

    On Mon, Oct 24, 2011 at 7:46 AM, Heikki Linnakangas
    <heikki.linnakangas@enterprisedb.com> wrote:
    >
    > Thanks! Here's an updated version of the patch, fixing that, and all the
    > other issues pointed out this far.
    >
    > I extracted the code that sets oid and tuple headers, and invokes the
    > toaster, into a new function that's shared by heap_insert() and
    > heap_multi_insert(). Tom objected to merging heap_insert() and
    > heap_multi_insert() into one complicated function, and I think he was right
    > on that, but sharing this code to prepare a tuple still makes sense. IMHO it
    > makes heap_insert() slightly more readable too.
    
    Hi Heikki,
    
    Thanks for this patch.  Doing bulk copies in parallel for me is now
    limited by the IO subsystem rather than the CPU.
    
    This patch, commit number d326d9e8ea1d69, causes fillfactor to be
    ignored for the copy command.  Is this acceptable collateral damage?
    
    This can be seen by using "pgbench -i -s50 -F50" to create the table
    combined with and select
    pg_size_pretty(pg_table_size('pgbench_accounts')), or by using the
    relation_free_space extension to pageinspect proposed elsewhere.
    
    Thanks,
    
    Jeff
    
    
  22. Re: Inserting heap tuples in bulk in COPY

    Jeff Janes <jeff.janes@gmail.com> — 2011-11-25T21:32:15Z

    On Fri, Nov 25, 2011 at 12:53 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
    
    > Hi Heikki,
    >
    > Thanks for this patch.  Doing bulk copies in parallel for me is now
    > limited by the IO subsystem rather than the CPU.
    >
    > This patch, commit number d326d9e8ea1d69, causes fillfactor to be
    > ignored for the copy command.  Is this acceptable collateral damage?
    
    Having looked into it a little bit, I think this might be an acceptable fix.
    
    Cheers,
    
    Jeff
    
  23. Re: Inserting heap tuples in bulk in COPY

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-11-26T10:16:27Z

    On 25.11.2011 23:32, Jeff Janes wrote:
    > On Fri, Nov 25, 2011 at 12:53 PM, Jeff Janes<jeff.janes@gmail.com>  wrote:
    >
    >> Thanks for this patch.  Doing bulk copies in parallel for me is now
    >> limited by the IO subsystem rather than the CPU.
    >>
    >> This patch, commit number d326d9e8ea1d69, causes fillfactor to be
    >> ignored for the copy command.  Is this acceptable collateral damage?
    >
    > Having looked into it a little bit, I think this might be an acceptable fix.
    
    Thanks, applied. It was an oversight.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  24. Re: Inserting heap tuples in bulk in COPY

    Jeff Janes <jeff.janes@gmail.com> — 2012-08-07T19:58:03Z

    On Fri, Aug 12, 2011 at 2:59 PM, Heikki Linnakangas
    <heikki.linnakangas@enterprisedb.com> wrote:
    > On 13.08.2011 00:17, Simon Riggs wrote:
    >>
    >> Also, we discussed that you would work on buffering the index inserts,
    >> which is where the main problem lies. The main heap is only a small
    >> part of the overhead if we have multiple indexes already built on a
    >> table - which is the use case that causes the most problem.
    >
    >
    > Sure, if you have indexes on the table already, then the overhead of
    > updating them is more significant. I am actually working on that too, I will
    > make a separate post about that.
    
    Hi Heikki,
    
    Is the bulk index insert still an active area for you?
    
    If not, is there some kind of summary of design or analysis work
    already done, which would be a useful for someone else interested in
    picking it up?
    
    Thanks,
    
    Jeff
    
    
  25. Re: Inserting heap tuples in bulk in COPY

    Simon Riggs <simon@2ndquadrant.com> — 2012-08-07T20:52:25Z

    On 7 August 2012 20:58, Jeff Janes <jeff.janes@gmail.com> wrote:
    > On Fri, Aug 12, 2011 at 2:59 PM, Heikki Linnakangas
    > <heikki.linnakangas@enterprisedb.com> wrote:
    >> On 13.08.2011 00:17, Simon Riggs wrote:
    >>>
    >>> Also, we discussed that you would work on buffering the index inserts,
    >>> which is where the main problem lies. The main heap is only a small
    >>> part of the overhead if we have multiple indexes already built on a
    >>> table - which is the use case that causes the most problem.
    >>
    >>
    >> Sure, if you have indexes on the table already, then the overhead of
    >> updating them is more significant. I am actually working on that too, I will
    >> make a separate post about that.
    >
    > Hi Heikki,
    >
    > Is the bulk index insert still an active area for you?
    >
    > If not, is there some kind of summary of design or analysis work
    > already done, which would be a useful for someone else interested in
    > picking it up?
    
    The main cost comes from repeated re-seeking down the index tree to
    find the insertion point, but repeated lock and pin operations on
    index buffers are also high. That is optimisable if the index inserts
    are grouped, as they will be with monotonic indexes, (e.g. SERIAL), or
    with partial monotonic (i.e. with Parent/Child relationship, on Child
    table many inserts will be of the form (x,1), (x,2), (x, 3) etc, e.g.
    Order/OrderLine).
    
    All we need do is buffer the inserts in the inserts, before inserting
    them into the main index. As long as we flush the buffer before end of
    transaction, we're good.
    
    Incidentally, we can also optimise repeated inserts within a normal
    transaction using this method, by implementing deferred unique
    constraints. At present we say that unique constraints aren't
    deferrable, but there's no reason they can't be, if we allow buffering
    in the index. (Implementing a deferred constraint that won't fail if
    we do UPDATE foo SET pk = pk+1 requires a buffer the size of foo,
    which is probably a bad plan, plus you'd need to sort the inputs, so
    that particular nut is another issue altogether, AFAICS).
    
    Suggested implementation is to buffer index tuples at the generic
    index API, then implement a bulk insert index API call that can then
    be implemented for each AM separately. Suggested buffer size is
    work_mem. Note we must extract index tuple from heap tuples -
    refetching heap blocks to get rows is too costly.
    
    I think we need to implement buffering both to end of statement or end
    of transaction, not just one or the other.
    
    -- 
     Simon Riggs                   http://www.2ndQuadrant.com/
     PostgreSQL Development, 24x7 Support, Training & Services
    
    
  26. Re: Inserting heap tuples in bulk in COPY

    Jeff Janes <jeff.janes@gmail.com> — 2012-08-08T02:44:47Z

    On Tue, Aug 7, 2012 at 1:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
    > On 7 August 2012 20:58, Jeff Janes <jeff.janes@gmail.com> wrote:
    >> Hi Heikki,
    >>
    >> Is the bulk index insert still an active area for you?
    >>
    >> If not, is there some kind of summary of design or analysis work
    >> already done, which would be a useful for someone else interested in
    >> picking it up?
    >
    > The main cost comes from repeated re-seeking down the index tree to
    > find the insertion point, but repeated lock and pin operations on
    > index buffers are also high. That is optimisable if the index inserts
    > are grouped, as they will be with monotonic indexes, (e.g. SERIAL), or
    > with partial monotonic (i.e. with Parent/Child relationship, on Child
    > table many inserts will be of the form (x,1), (x,2), (x, 3) etc, e.g.
    > Order/OrderLine).
    >
    > All we need do is buffer the inserts in the inserts, before inserting
    > them into the main index. As long as we flush the buffer before end of
    > transaction, we're good.
    >
    > Incidentally, we can also optimise repeated inserts within a normal
    > transaction using this method, by implementing deferred unique
    > constraints. At present we say that unique constraints aren't
    > deferrable, but there's no reason they can't be, if we allow buffering
    > in the index. (Implementing a deferred constraint that won't fail if
    > we do UPDATE foo SET pk = pk+1 requires a buffer the size of foo,
    > which is probably a bad plan, plus you'd need to sort the inputs, so
    > that particular nut is another issue altogether, AFAICS).
    >
    > Suggested implementation is to buffer index tuples at the generic
    > index API, then implement a bulk insert index API call that can then
    > be implemented for each AM separately. Suggested buffer size is
    > work_mem. Note we must extract index tuple from heap tuples -
    > refetching heap blocks to get rows is too costly.
    
    OK, thanks for the summary.  It looks like your plans are to improve
    the case where the index maintenance is already CPU limited.  I was
    more interested in cases where the regions of the index(es) undergoing
    active insertion do not fit into usable RAM, so the limit is the
    random IO needed to fetch the leaf pages in order to update them or to
    write them out once dirtied.  Here too buffering is probably the
    answer, but the size of the buffer needed to turn that IO from
    effectively random to effectively sequential is probably much larger
    than the size of the buffer you are contemplating.
    
    > I think we need to implement buffering both to end of statement or end
    > of transaction, not just one or the other.
    
    With the planner deciding which would be better, or explicit user action?
    
    Thanks,
    
    Jeff
    
    
  27. Re: Inserting heap tuples in bulk in COPY

    Simon Riggs <simon@2ndquadrant.com> — 2012-08-08T07:36:08Z

    On 8 August 2012 03:44, Jeff Janes <jeff.janes@gmail.com> wrote:
    > On Tue, Aug 7, 2012 at 1:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
    >> On 7 August 2012 20:58, Jeff Janes <jeff.janes@gmail.com> wrote:
    >>> Hi Heikki,
    >>>
    >>> Is the bulk index insert still an active area for you?
    >>>
    >>> If not, is there some kind of summary of design or analysis work
    >>> already done, which would be a useful for someone else interested in
    >>> picking it up?
    >>
    >> The main cost comes from repeated re-seeking down the index tree to
    >> find the insertion point, but repeated lock and pin operations on
    >> index buffers are also high. That is optimisable if the index inserts
    >> are grouped, as they will be with monotonic indexes, (e.g. SERIAL), or
    >> with partial monotonic (i.e. with Parent/Child relationship, on Child
    >> table many inserts will be of the form (x,1), (x,2), (x, 3) etc, e.g.
    >> Order/OrderLine).
    >>
    >> All we need do is buffer the inserts in the inserts, before inserting
    >> them into the main index. As long as we flush the buffer before end of
    >> transaction, we're good.
    >>
    >> Incidentally, we can also optimise repeated inserts within a normal
    >> transaction using this method, by implementing deferred unique
    >> constraints. At present we say that unique constraints aren't
    >> deferrable, but there's no reason they can't be, if we allow buffering
    >> in the index. (Implementing a deferred constraint that won't fail if
    >> we do UPDATE foo SET pk = pk+1 requires a buffer the size of foo,
    >> which is probably a bad plan, plus you'd need to sort the inputs, so
    >> that particular nut is another issue altogether, AFAICS).
    >>
    >> Suggested implementation is to buffer index tuples at the generic
    >> index API, then implement a bulk insert index API call that can then
    >> be implemented for each AM separately. Suggested buffer size is
    >> work_mem. Note we must extract index tuple from heap tuples -
    >> refetching heap blocks to get rows is too costly.
    >
    > OK, thanks for the summary.  It looks like your plans are to improve
    > the case where the index maintenance is already CPU limited.  I was
    > more interested in cases where the regions of the index(es) undergoing
    > active insertion do not fit into usable RAM, so the limit is the
    > random IO needed to fetch the leaf pages in order to update them or to
    > write them out once dirtied.  Here too buffering is probably the
    > answer, but the size of the buffer needed to turn that IO from
    > effectively random to effectively sequential is probably much larger
    > than the size of the buffer you are contemplating.
    
    The buffer size can be variable, yes. I was imagining a mechanism that
    worked for normal INSERTs as well as COPY. Perhaps we could say buffer
    is work_mem with INSERT and maintenance_work_mem with COPY.
    
    Very large index appends are useful, but currently not very easily
    usable because of the transactional nature of COPY. If we could reject
    rows without ERROR it would be more practical.
    
    I'm not planning to work on this, so all comments for your assistance.
    
    >> I think we need to implement buffering both to end of statement or end
    >> of transaction, not just one or the other.
    >
    > With the planner deciding which would be better, or explicit user action?
    
    Probably both: on/off/choose.
    
    Deferring unique check would change the point at which errors were
    reported in a transaction, which might not be desirable for some. I
    think SQL standard has something to say about this also, so that needs
    care. But in general, if your tables use generated PK values they
    should be able to benefit from this, so I would suggest a default
    setting of choose.
    
    -- 
     Simon Riggs                   http://www.2ndQuadrant.com/
     PostgreSQL Development, 24x7 Support, Training & Services
    
    
  28. Re: Inserting heap tuples in bulk in COPY

    Robert Haas <robertmhaas@gmail.com> — 2012-08-08T19:34:07Z

    On Tue, Aug 7, 2012 at 4:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
    > Incidentally, we can also optimise repeated inserts within a normal
    > transaction using this method, by implementing deferred unique
    > constraints. At present we say that unique constraints aren't
    > deferrable, but there's no reason they can't be, if we allow buffering
    > in the index. (Implementing a deferred constraint that won't fail if
    > we do UPDATE foo SET pk = pk+1 requires a buffer the size of foo,
    > which is probably a bad plan, plus you'd need to sort the inputs, so
    > that particular nut is another issue altogether, AFAICS).
    
    We've had deferrable unique constraints since 9.0, courtesy of Dean Rasheed.
    
    > I think we need to implement buffering both to end of statement or end
    > of transaction, not just one or the other.
    
    Another (not necessarily better) idea is to use a buffer that's part
    of the index, like the GIN fastupdate stuff, so that there's no
    particular constraint on when the buffer has to be flushed, but
    competing index scans may be slower until it is.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  29. Re: Inserting heap tuples in bulk in COPY

    Simon Riggs <simon@2ndquadrant.com> — 2012-08-08T19:54:23Z

    On 8 August 2012 20:34, Robert Haas <robertmhaas@gmail.com> wrote:
    > On Tue, Aug 7, 2012 at 4:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
    >> Incidentally, we can also optimise repeated inserts within a normal
    >> transaction using this method, by implementing deferred unique
    >> constraints. At present we say that unique constraints aren't
    >> deferrable, but there's no reason they can't be, if we allow buffering
    >> in the index. (Implementing a deferred constraint that won't fail if
    >> we do UPDATE foo SET pk = pk+1 requires a buffer the size of foo,
    >> which is probably a bad plan, plus you'd need to sort the inputs, so
    >> that particular nut is another issue altogether, AFAICS).
    >
    > We've had deferrable unique constraints since 9.0, courtesy of Dean Rasheed.
    
    Yeh, but IIRC there was some issue I can't seem to find detail on
    about it not working quite right in production. Not sure now.
    
    
    >> I think we need to implement buffering both to end of statement or end
    >> of transaction, not just one or the other.
    >
    > Another (not necessarily better) idea is to use a buffer that's part
    > of the index, like the GIN fastupdate stuff, so that there's no
    > particular constraint on when the buffer has to be flushed, but
    > competing index scans may be slower until it is.
    
    I think that works very well for non-unique indexes, though it does
    increase WAL traffic since you need to do a double hop into the index.
    Its not possible for unique constraints/pk indexes since they need to
    fail the transaction in case of duplicates.
    
    The buffer I was imagining would be a private buffer within a
    transaction, so wouldn't increase WAL.
    
    -- 
     Simon Riggs                   http://www.2ndQuadrant.com/
     PostgreSQL Development, 24x7 Support, Training & Services
    
    
  30. Re: Inserting heap tuples in bulk in COPY

    Jesper Krogh <jesper@krogh.cc> — 2012-08-09T06:59:52Z

    On 08/08/12 21:34, Robert Haas wrote:
    >> I think we need to implement buffering both to end of statement or end
    >> of transaction, not just one or the other.
    > Another (not necessarily better) idea is to use a buffer that's part
    > of the index, like the GIN fastupdate stuff, so that there's no
    > particular constraint on when the buffer has to be flushed, but
    > competing index scans may be slower until it is.
    If it is an implementation artifact or an result of this
    approach I dont know. But currently, when the GIN fastupdate
    code finally decides to "flush" the buffer, it is going to stall all
    other processes doing updates while doing it. If you only have
    one update process then this doesn't matter. But if you're trying to get
    user-interactive-updates to flow in with batch-updates from
    background processes, then you'd better kill off this feature,
    since you're gauranteed that the user-interactive process is
    either going to flush the buffer or wait on someone else doing
    it.
    
    I havent done the benchmarking, but I'm actually fairly sure that
    fastupdate isn't overall faster if you bump concurrency slightly and run of
    memory or SSD-based backends due to this cross-backend contention
    of the buffer.
    
    A buffer that is backend local, so you can use transactions to
    batch up changes would get around this, but that may have another
    huge set of consequenses I dont know if.
    
    ... based on my own real-world experience with this feature.
    
    
  31. Re: Inserting heap tuples in bulk in COPY

    Robert Haas <robertmhaas@gmail.com> — 2012-08-09T14:11:08Z

    On Thu, Aug 9, 2012 at 2:59 AM, Jesper Krogh <jesper@krogh.cc> wrote:
    > If it is an implementation artifact or an result of this
    > approach I dont know. But currently, when the GIN fastupdate
    > code finally decides to "flush" the buffer, it is going to stall all
    > other processes doing updates while doing it. If you only have
    > one update process then this doesn't matter. But if you're trying to get
    > user-interactive-updates to flow in with batch-updates from
    > background processes, then you'd better kill off this feature,
    > since you're gauranteed that the user-interactive process is
    > either going to flush the buffer or wait on someone else doing
    > it.
    >
    > I havent done the benchmarking, but I'm actually fairly sure that
    > fastupdate isn't overall faster if you bump concurrency slightly and run of
    > memory or SSD-based backends due to this cross-backend contention
    > of the buffer.
    
    Yeah, I've noticed that there are some things that are a little wonky
    about GIN fastupdate.  On the other hand, I believe that MySQL has
    something along these lines called secondary index buffering which
    apparently does very good things for random I/O.  I am not sure of the
    details or the implementation, though.
    
    > A buffer that is backend local, so you can use transactions to
    > batch up changes would get around this, but that may have another
    > huge set of consequenses I dont know if.
    >
    > ... based on my own real-world experience with this feature.
    
    Well, the main thing to worry about is transactional consistency.  If
    a backend which has postponed doing the index-inserts does an index
    scan after the command-counter-id has been bumped, it'll see
    inconsistent results.  We could avoid that by only using the
    optimization when some set of sanity checks passes and doing the
    deferred inserts at the end of the statement, or something like that.
    
    The other tricky part is figuring out how to actually get a
    performance improvement out of it.  I think Simon's probably right
    that a lot of the cost is in repeatedly walking the btree, looking up
    and pinning/unpinning/locking/unlocking buffers along the way.  Maybe
    we could sort the data in index order, walk down to the first
    insertion point, and the insert as many tuples in a row as precede the
    next key in the index.  Then lather, rinse, repeat.  If you're
    actually just adding everything at the tail of the index, this ought
    to work pretty well.  But if the inserts are all over the place it
    seems like it might not be any better, or actually a little worse.
    
    Of course it's probably premature to speculate too much until someone
    actually codes something up and tests it.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company