Thread

Inserting heap tuples in bulk in COPY

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-12T19:16:50Z

COPY is slow. Let's make it faster. One obvious optimization is to 
insert heap tuples in bigger chunks, instead of calling heap_insert() 
separately for every tuple. That saves the overhead of pinning and 
locking the buffer for every tuple, and you only need to write one WAL 
record for all the tuples written to the same page, instead of one for 
each tuple.

Attached is a WIP patch to do that. It adds a new function, 
heap_multi_insert, which does the same thing as heap_insert, but works 
in bulk. It takes an array of tuples as argument, and tries to cram as 
many of them into the chosen targe page as it can, and only writes a 
single WAL record of the operation.

This gives a significant speedup to COPY, particularly for narrow 
tables, with small tuples. Grouping multiple tuples into one WAL record 
reduces the WAL volume significantly, and the time spent in writing that 
WAL. The reduced overhead of repeatedly locking the buffer is also most 
noticeable on narrow tables. On wider tables, the effects are smaller. 
See copytest-results.txt, containing test results with three tables of 
different widths. The scripts used to get those numbers are also attached.

Triggers complicate this. I believe it is only safe to group tuples 
together like this if the table has no triggers. A BEFORE ROW trigger 
might run a SELECT on the table being copied to, and check if some of 
the tuples we're about to insert exist. If we run BEFORE ROW triggers 
for a bunch of tuples first, and only then insert them, none of the 
trigger invocations will see the other rows as inserted yet. Similarly, 
if we run AFTER ROW triggers after inserting a bunch of tuples, the 
trigger for each of the insertions would see all the inserted rows. So 
at least for now, the patch simply falls back to inserting one row at a 
time if there are any triggers on the table.

The patch is WIP, mainly because I didn't write the WAL replay routines 
yet, but please let me know if you see any issues.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Inserting heap tuples in bulk in COPY

Gurjeet Singh <singh.gurjeet@gmail.com> — 2011-08-12T19:44:47Z

On Fri, Aug 12, 2011 at 3:16 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

> COPY is slow.

No kidding!

> So at least for now, the patch simply falls back to inserting one row at a
> time if there are any triggers on the table.
>

Maybe we want to change that to "fall back to old ways if there are any FOR
EACH ROW triggers", since FOR EACH STATEMENT triggers won't be bothered by
this optimization.

-- 
Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: Inserting heap tuples in bulk in COPY

Florian G. Pflug <fgp@phlo.org> — 2011-08-12T19:57:11Z

On Aug12, 2011, at 21:16 , Heikki Linnakangas wrote:
> Triggers complicate this. I believe it is only safe to group tuples together like this if the table has no triggers. A BEFORE ROW trigger might run a SELECT on the table being copied to, and check if some of the tuples we're about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples first, and only then insert them, none of the trigger invocations will see the other rows as inserted yet. Similarly, if we run AFTER ROW triggers after inserting a bunch of tuples, the trigger for each of the insertions would see all the inserted rows.

Don't we run AFTER ROW triggers after inserting *all* the tuples anyway? At least this is what we do in the case of INSERT/UPDATE/DELETE if I'm not mistaken.

best regards,
Florian Pflug

Re: Inserting heap tuples in bulk in COPY

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-12T20:11:05Z

On 12.08.2011 22:57, Florian Pflug wrote:
> On Aug12, 2011, at 21:16 , Heikki Linnakangas wrote:
>> Triggers complicate this. I believe it is only safe to group tuples together like this if the table has no triggers. A BEFORE ROW trigger might run a SELECT on the table being copied to, and check if some of the tuples we're about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples first, and only then insert them, none of the trigger invocations will see the other rows as inserted yet. Similarly, if we run AFTER ROW triggers after inserting a bunch of tuples, the trigger for each of the insertions would see all the inserted rows.
>
> Don't we run AFTER ROW triggers after inserting *all* the tuples anyway? At least this is what we do in the case of INSERT/UPDATE/DELETE if I'm not mistaken.

Um, yes, you're right. Now I feel silly. The above still applies to 
BEFORE ROW triggers, though.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Inserting heap tuples in bulk in COPY

Robert Haas <robertmhaas@gmail.com> — 2011-08-12T20:57:36Z

On Fri, Aug 12, 2011 at 3:16 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> COPY is slow. Let's make it faster. One obvious optimization is to insert
> heap tuples in bigger chunks, instead of calling heap_insert() separately
> for every tuple. That saves the overhead of pinning and locking the buffer
> for every tuple, and you only need to write one WAL record for all the
> tuples written to the same page, instead of one for each tuple.
>
> Attached is a WIP patch to do that. It adds a new function,
> heap_multi_insert, which does the same thing as heap_insert, but works in
> bulk. It takes an array of tuples as argument, and tries to cram as many of
> them into the chosen targe page as it can, and only writes a single WAL
> record of the operation.
>
> This gives a significant speedup to COPY, particularly for narrow tables,
> with small tuples. Grouping multiple tuples into one WAL record reduces the
> WAL volume significantly, and the time spent in writing that WAL. The
> reduced overhead of repeatedly locking the buffer is also most noticeable on
> narrow tables. On wider tables, the effects are smaller. See
> copytest-results.txt, containing test results with three tables of different
> widths. The scripts used to get those numbers are also attached.
>
> Triggers complicate this. I believe it is only safe to group tuples together
> like this if the table has no triggers. A BEFORE ROW trigger might run a
> SELECT on the table being copied to, and check if some of the tuples we're
> about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples
> first, and only then insert them, none of the trigger invocations will see
> the other rows as inserted yet. Similarly, if we run AFTER ROW triggers
> after inserting a bunch of tuples, the trigger for each of the insertions
> would see all the inserted rows. So at least for now, the patch simply falls
> back to inserting one row at a time if there are any triggers on the table.
>
> The patch is WIP, mainly because I didn't write the WAL replay routines yet,
> but please let me know if you see any issues.

I thought about trying to do this at one point in the past, but I
couldn't figure out exactly how to make it work.  I think the approach
you've taken here is good.

Aside from the point already raised about needing to worry only about
BEFORE ROW triggers, I don't see any showstoppers.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inserting heap tuples in bulk in COPY

Andrew Dunstan <andrew@dunslane.net> — 2011-08-12T21:10:11Z

On 08/12/2011 04:57 PM, Robert Haas wrote:
> I thought about trying to do this at one point in the past, but I
> couldn't figure out exactly how to make it work.  I think the approach
> you've taken here is good.
>
> Aside from the point already raised about needing to worry only about
> BEFORE ROW triggers, I don't see any showstoppers.

Yeah, this looks very promising indeed. Well done. In fact, I'm asking 
myself how hard it will be to backport for one particular customer of 
ours, for whom the WAL load is a major hotspot.

cheers

andrew

Re: Inserting heap tuples in bulk in COPY

Simon Riggs <simon@2ndquadrant.com> — 2011-08-12T21:17:07Z

On Fri, Aug 12, 2011 at 8:16 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

> COPY is slow. Let's make it faster. One obvious optimization is to insert
> heap tuples in bigger chunks, instead of calling heap_insert() separately
> for every tuple. That saves the overhead of pinning and locking the buffer
> for every tuple, and you only need to write one WAL record for all the
> tuples written to the same page, instead of one for each tuple.

We don't pin the buffer for every tuple, that optimisation is already done...

When we discussed this before you said that it wasn't worth trying to
do this additional work - it was certainly a smaller gain than the one
we achieved by removing the pinning overhead.

Also, we discussed that you would work on buffering the index inserts,
which is where the main problem lies. The main heap is only a small
part of the overhead if we have multiple indexes already built on a
table - which is the use case that causes the most problem.

So I'm a little surprised to see you working on this and I'm guessing
that the COPY improvement with indexes is barely noticeable. This
would be a nice improvement, but not until the bulk index inserts are
done.

-- 
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Inserting heap tuples in bulk in COPY

Merlin Moncure <mmoncure@gmail.com> — 2011-08-12T21:26:03Z

On Fri, Aug 12, 2011 at 2:16 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> COPY is slow. Let's make it faster. One obvious optimization is to insert
> heap tuples in bigger chunks, instead of calling heap_insert() separately
> for every tuple. That saves the overhead of pinning and locking the buffer
> for every tuple, and you only need to write one WAL record for all the
> tuples written to the same page, instead of one for each tuple.
>
> Attached is a WIP patch to do that. It adds a new function,
> heap_multi_insert, which does the same thing as heap_insert, but works in
> bulk. It takes an array of tuples as argument, and tries to cram as many of
> them into the chosen targe page as it can, and only writes a single WAL
> record of the operation.
>
> This gives a significant speedup to COPY, particularly for narrow tables,
> with small tuples. Grouping multiple tuples into one WAL record reduces the
> WAL volume significantly, and the time spent in writing that WAL. The
> reduced overhead of repeatedly locking the buffer is also most noticeable on
> narrow tables. On wider tables, the effects are smaller. See
> copytest-results.txt, containing test results with three tables of different
> widths. The scripts used to get those numbers are also attached.
>
> Triggers complicate this. I believe it is only safe to group tuples together
> like this if the table has no triggers. A BEFORE ROW trigger might run a
> SELECT on the table being copied to, and check if some of the tuples we're
> about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples
> first, and only then insert them, none of the trigger invocations will see
> the other rows as inserted yet. Similarly, if we run AFTER ROW triggers
> after inserting a bunch of tuples, the trigger for each of the insertions
> would see all the inserted rows. So at least for now, the patch simply falls
> back to inserting one row at a time if there are any triggers on the table.

But generic RI triggers would be ok, right?

merlin

Re: Inserting heap tuples in bulk in COPY

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-12T21:59:13Z

On 13.08.2011 00:17, Simon Riggs wrote:
> Also, we discussed that you would work on buffering the index inserts,
> which is where the main problem lies. The main heap is only a small
> part of the overhead if we have multiple indexes already built on a
> table - which is the use case that causes the most problem.

Sure, if you have indexes on the table already, then the overhead of 
updating them is more significant. I am actually working on that too, I 
will make a separate post about that.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Inserting heap tuples in bulk in COPY

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-12T22:19:35Z

On 13.08.2011 00:26, Merlin Moncure wrote:
> On Fri, Aug 12, 2011 at 2:16 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> Triggers complicate this. I believe it is only safe to group tuples together
>> like this if the table has no triggers. A BEFORE ROW trigger might run a
>> SELECT on the table being copied to, and check if some of the tuples we're
>> about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples
>> first, and only then insert them, none of the trigger invocations will see
>> the other rows as inserted yet. Similarly, if we run AFTER ROW triggers
>> after inserting a bunch of tuples, the trigger for each of the insertions
>> would see all the inserted rows. So at least for now, the patch simply falls
>> back to inserting one row at a time if there are any triggers on the table.
>
> But generic RI triggers would be ok, right?

RI triggers are AFTER ROW triggers, which we concluded to be OK after 
all, so they would be ok.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Inserting heap tuples in bulk in COPY

Dean Rasheed <dean.a.rasheed@gmail.com> — 2011-08-13T09:01:57Z

On 12 August 2011 23:19, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>>> Triggers complicate this. I believe it is only safe to group tuples
>>> together
>>> like this if the table has no triggers. A BEFORE ROW trigger might run a
>>> SELECT on the table being copied to, and check if some of the tuples
>>> we're
>>> about to insert exist. If we run BEFORE ROW triggers for a bunch of
>>> tuples
>>> first, and only then insert them, none of the trigger invocations will
>>> see
>>> the other rows as inserted yet. Similarly, if we run AFTER ROW triggers
>>> after inserting a bunch of tuples, the trigger for each of the insertions
>>> would see all the inserted rows. So at least for now, the patch simply
>>> falls
>>> back to inserting one row at a time if there are any triggers on the
>>> table.

I guess DEFAULT values could also suffer from a similar problem to
BEFORE ROW triggers though (in theory a DEFAULT could be based on a
function that SELECTs from the table being copied to).

Regards,
Dean

Re: Inserting heap tuples in bulk in COPY

Tom Lane <tgl@sss.pgh.pa.us> — 2011-08-13T14:33:53Z

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> The patch is WIP, mainly because I didn't write the WAL replay routines 
> yet, but please let me know if you see any issues.

Why do you need new WAL replay routines?  Can't you just use the existing
XLOG_HEAP_NEWPAGE support?

By any large, I think we should be avoiding special-purpose WAL entries
as much as possible.

Also, I strongly object to turning regular heap_insert into a wrapper
around some other more complicated operation.  That will likely have
bad consequences for the performance of every other operation.

			regards, tom lane

Re: Inserting heap tuples in bulk in COPY

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-09-14T10:52:09Z

On 13.08.2011 17:33, Tom Lane wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> The patch is WIP, mainly because I didn't write the WAL replay routines
>> yet, but please let me know if you see any issues.
>
> Why do you need new WAL replay routines?  Can't you just use the existing
> XLOG_HEAP_NEWPAGE support?
>
> By any large, I think we should be avoiding special-purpose WAL entries
> as much as possible.

I tried that, but most of the reduction in WAL-size melts away with 
that. And if the page you're copying to is not empty, logging the whole 
page is even more expensive. You'd need to fall back to retail inserts 
in that case which complicates the logic.

> Also, I strongly object to turning regular heap_insert into a wrapper
> around some other more complicated operation.  That will likely have
> bad consequences for the performance of every other operation.

Ok. I doubt it makes any noticeable difference for performance, but 
having done that, it's more readable, too, to duplicate the code.

Attached is a new version of the patch. It is now complete, including 
WAL replay code.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Inserting heap tuples in bulk in COPY

Kohei KaiGai <kaigai@kaigai.gr.jp> — 2011-09-25T08:43:13Z

Hi Heikki,

I checked your patch, then I have a comment and two questions here.

The heap_prepare_insert() seems a duplication of code with earlier
half of existing heap_insert(). I think it is a good question to
consolidate these portion of the code.

I'm not clear the reason why the argument of
CheckForSerializableConflictIn() was
changed from the one in heap_insert(). Its source code comment describes as:
        :
     * For a heap insert, we only need to check for table-level SSI locks.
     * Our new tuple can't possibly conflict with existing tuple locks, and
     * heap page locks are only consolidated versions of tuple locks; they do
     * not lock "gaps" as index page locks do.  So we don't need to identify
     * a buffer before making the call.
     */
Is it feasible that newly inserted tuples conflict with existing tuple
locks when
we expand insert to support multiple tuples at once?
It is a bit confusing for me.

This patch disallow multiple-insertion not only when before-row
trigger is configured,
but volatile functions are used to compute a default value also.
Is it a reasonable condition to avoid multiple-insertion?
All the datums to be delivered to heap_form_tuple() is calculated in
NextCopyFrom,
and default values are also constructed here; sequentially.
So, it seems to me we have no worry about volatile functions are not
invoked toward
each tuples. Or, am I missing something?

Thanks,

2011/9/14 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
> On 13.08.2011 17:33, Tom Lane wrote:
>>
>> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>>>
>>> The patch is WIP, mainly because I didn't write the WAL replay routines
>>> yet, but please let me know if you see any issues.
>>
>> Why do you need new WAL replay routines?  Can't you just use the existing
>> XLOG_HEAP_NEWPAGE support?
>>
>> By any large, I think we should be avoiding special-purpose WAL entries
>> as much as possible.
>
> I tried that, but most of the reduction in WAL-size melts away with that.
> And if the page you're copying to is not empty, logging the whole page is
> even more expensive. You'd need to fall back to retail inserts in that case
> which complicates the logic.
>
>> Also, I strongly object to turning regular heap_insert into a wrapper
>> around some other more complicated operation.  That will likely have
>> bad consequences for the performance of every other operation.
>
> Ok. I doubt it makes any noticeable difference for performance, but having
> done that, it's more readable, too, to duplicate the code.
>
> Attached is a new version of the patch. It is now complete, including WAL
> replay code.
>
> --
>  Heikki Linnakangas
>  EnterpriseDB   http://www.enterprisedb.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>



-- 
KaiGai Kohei <kaigai@kaigai.gr.jp>

Re: Inserting heap tuples in bulk in COPY

Dean Rasheed <dean.a.rasheed@gmail.com> — 2011-09-25T13:03:19Z

On 25 September 2011 09:43, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:
> Hi Heikki,
>
> I checked your patch, then I have a comment and two questions here.
>
> 2011/9/14 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
>>
>> Attached is a new version of the patch. It is now complete, including WAL
>> replay code.

Hi,

I had a quick look at the patch as well and spotted an oversight: the
multi-insert code branch fails to invoke AFTER ROW triggers.

Regards,
Dean

Re: Inserting heap tuples in bulk in COPY

Robert Haas <robertmhaas@gmail.com> — 2011-09-25T16:01:36Z

On Wed, Sep 14, 2011 at 6:52 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>> Why do you need new WAL replay routines?  Can't you just use the existing
>> XLOG_HEAP_NEWPAGE support?
>>
>> By any large, I think we should be avoiding special-purpose WAL entries
>> as much as possible.
>
> I tried that, but most of the reduction in WAL-size melts away with that.
> And if the page you're copying to is not empty, logging the whole page is
> even more expensive. You'd need to fall back to retail inserts in that case
> which complicates the logic.

Where does it go?  I understand why it'd be a problem for partially
filled pages, but it seems like it ought to be efficient for pages
that are initially empty.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inserting heap tuples in bulk in COPY

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-10-06T11:33:24Z

On 25.09.2011 19:01, Robert Haas wrote:
> On Wed, Sep 14, 2011 at 6:52 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>>> Why do you need new WAL replay routines?  Can't you just use the existing
>>> XLOG_HEAP_NEWPAGE support?
>>>
>>> By any large, I think we should be avoiding special-purpose WAL entries
>>> as much as possible.
>>
>> I tried that, but most of the reduction in WAL-size melts away with that.
>> And if the page you're copying to is not empty, logging the whole page is
>> even more expensive. You'd need to fall back to retail inserts in that case
>> which complicates the logic.
>
> Where does it go?  I understand why it'd be a problem for partially
> filled pages, but it seems like it ought to be efficient for pages
> that are initially empty.

A regular heap_insert record leaves out a lot of information that can be 
deduced at replay time. It can leave out all the headers, including just 
the null bitmap + data. In addition to that, there's just the location 
of the tuple (RelFileNode+ItemPointer). At replay, xmin is taken from 
the WAL record header.

For a multi-insert record, you don't even need to store the RelFileNode 
and the block number for every tuple, just the offsets.

In comparison, a full-page image will include the full tuple header, and 
also the line pointers. If I'm doing my math right, a full-page image 
takes 25 bytes more data per tuple, than the special-purpose 
multi-insert record.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Inserting heap tuples in bulk in COPY

Robert Haas <robertmhaas@gmail.com> — 2011-10-06T12:11:52Z

On Thu, Oct 6, 2011 at 7:33 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> A regular heap_insert record leaves out a lot of information that can be
> deduced at replay time. It can leave out all the headers, including just the
> null bitmap + data. In addition to that, there's just the location of the
> tuple (RelFileNode+ItemPointer). At replay, xmin is taken from the WAL
> record header.
>
> For a multi-insert record, you don't even need to store the RelFileNode and
> the block number for every tuple, just the offsets.
>
> In comparison, a full-page image will include the full tuple header, and
> also the line pointers. If I'm doing my math right, a full-page image takes
> 25 bytes more data per tuple, than the special-purpose multi-insert record.

Interesting.  It's always seemed to me fairly inefficient in general
to store the whole RelFileNode.  For many people, the database and
tablespace OID will be constants, and even if they aren't, there
certainly aren't going to be 96 bits of entropy in the relfilenode.  I
thought about whether we could create some sort of mapping layer,
where say once per checkpoint we'd allocate a 4-byte integer to denote
a relfilenode, and WAL-log that mapping.  Then after that everyone
could just refer to the 4-byte integer instead of the whole
relfilenode.  But it seems like a lot of work for 8 bytes per record.
Then again, if you're getting that much benefit from shaving off 25
bytes per tuple, maybe it is, although I feel like FPW is the elephant
in the room.

But I digress...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inserting heap tuples in bulk in COPY

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-10-06T14:24:07Z

On 06.10.2011 15:11, Robert Haas wrote:
> On Thu, Oct 6, 2011 at 7:33 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> A regular heap_insert record leaves out a lot of information that can be
>> deduced at replay time. It can leave out all the headers, including just the
>> null bitmap + data. In addition to that, there's just the location of the
>> tuple (RelFileNode+ItemPointer). At replay, xmin is taken from the WAL
>> record header.
>>
>> For a multi-insert record, you don't even need to store the RelFileNode and
>> the block number for every tuple, just the offsets.
>>
>> In comparison, a full-page image will include the full tuple header, and
>> also the line pointers. If I'm doing my math right, a full-page image takes
>> 25 bytes more data per tuple, than the special-purpose multi-insert record.
>
> Interesting.  It's always seemed to me fairly inefficient in general
> to store the whole RelFileNode.  For many people, the database and
> tablespace OID will be constants, and even if they aren't, there
> certainly aren't going to be 96 bits of entropy in the relfilenode.  I
> thought about whether we could create some sort of mapping layer,
> where say once per checkpoint we'd allocate a 4-byte integer to denote
> a relfilenode, and WAL-log that mapping.  Then after that everyone
> could just refer to the 4-byte integer instead of the whole
> relfilenode.  But it seems like a lot of work for 8 bytes per record.
> Then again, if you're getting that much benefit from shaving off 25
> bytes per tuple, maybe it is, although I feel like FPW is the elephant
> in the room.

A very simple optimization would be to leave out tablespace OID 
altogether if it's DEFAULTTABLESPACE_OID, and just set a flag somewhere. 
Then again, we could also just compress the WAL wholesale.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Inserting heap tuples in bulk in COPY

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-10-24T14:46:21Z

On 25.09.2011 16:03, Dean Rasheed wrote:
> On 25 September 2011 09:43, Kohei KaiGai<kaigai@kaigai.gr.jp>  wrote:
>> Hi Heikki,
>>
>> I checked your patch, then I have a comment and two questions here.
>>
>> 2011/9/14 Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>:
>>>
>>> Attached is a new version of the patch. It is now complete, including WAL
>>> replay code.
>
> Hi,
>
> I had a quick look at the patch as well and spotted an oversight: the
> multi-insert code branch fails to invoke AFTER ROW triggers.

Thanks! Here's an updated version of the patch, fixing that, and all the 
other issues pointed out this far.

I extracted the code that sets oid and tuple headers, and invokes the 
toaster, into a new function that's shared by heap_insert() and 
heap_multi_insert(). Tom objected to merging heap_insert() and 
heap_multi_insert() into one complicated function, and I think he was 
right on that, but sharing this code to prepare a tuple still makes 
sense. IMHO it makes heap_insert() slightly more readable too.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Inserting heap tuples in bulk in COPY

Jeff Janes <jeff.janes@gmail.com> — 2011-11-25T20:53:19Z

On Mon, Oct 24, 2011 at 7:46 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>
> Thanks! Here's an updated version of the patch, fixing that, and all the
> other issues pointed out this far.
>
> I extracted the code that sets oid and tuple headers, and invokes the
> toaster, into a new function that's shared by heap_insert() and
> heap_multi_insert(). Tom objected to merging heap_insert() and
> heap_multi_insert() into one complicated function, and I think he was right
> on that, but sharing this code to prepare a tuple still makes sense. IMHO it
> makes heap_insert() slightly more readable too.

Hi Heikki,

Thanks for this patch.  Doing bulk copies in parallel for me is now
limited by the IO subsystem rather than the CPU.

This patch, commit number d326d9e8ea1d69, causes fillfactor to be
ignored for the copy command.  Is this acceptable collateral damage?

This can be seen by using "pgbench -i -s50 -F50" to create the table
combined with and select
pg_size_pretty(pg_table_size('pgbench_accounts')), or by using the
relation_free_space extension to pageinspect proposed elsewhere.

Thanks,

Jeff

Re: Inserting heap tuples in bulk in COPY

Jeff Janes <jeff.janes@gmail.com> — 2011-11-25T21:32:15Z

On Fri, Nov 25, 2011 at 12:53 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

> Hi Heikki,
>
> Thanks for this patch.  Doing bulk copies in parallel for me is now
> limited by the IO subsystem rather than the CPU.
>
> This patch, commit number d326d9e8ea1d69, causes fillfactor to be
> ignored for the copy command.  Is this acceptable collateral damage?

Having looked into it a little bit, I think this might be an acceptable fix.

Cheers,

Jeff

Re: Inserting heap tuples in bulk in COPY

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-11-26T10:16:27Z

On 25.11.2011 23:32, Jeff Janes wrote:
> On Fri, Nov 25, 2011 at 12:53 PM, Jeff Janes<jeff.janes@gmail.com>  wrote:
>
>> Thanks for this patch.  Doing bulk copies in parallel for me is now
>> limited by the IO subsystem rather than the CPU.
>>
>> This patch, commit number d326d9e8ea1d69, causes fillfactor to be
>> ignored for the copy command.  Is this acceptable collateral damage?
>
> Having looked into it a little bit, I think this might be an acceptable fix.

Thanks, applied. It was an oversight.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Inserting heap tuples in bulk in COPY

Jeff Janes <jeff.janes@gmail.com> — 2012-08-07T19:58:03Z

On Fri, Aug 12, 2011 at 2:59 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 13.08.2011 00:17, Simon Riggs wrote:
>>
>> Also, we discussed that you would work on buffering the index inserts,
>> which is where the main problem lies. The main heap is only a small
>> part of the overhead if we have multiple indexes already built on a
>> table - which is the use case that causes the most problem.
>
>
> Sure, if you have indexes on the table already, then the overhead of
> updating them is more significant. I am actually working on that too, I will
> make a separate post about that.

Hi Heikki,

Is the bulk index insert still an active area for you?

If not, is there some kind of summary of design or analysis work
already done, which would be a useful for someone else interested in
picking it up?

Thanks,

Jeff

Re: Inserting heap tuples in bulk in COPY

Simon Riggs <simon@2ndquadrant.com> — 2012-08-07T20:52:25Z

On 7 August 2012 20:58, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Fri, Aug 12, 2011 at 2:59 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> On 13.08.2011 00:17, Simon Riggs wrote:
>>>
>>> Also, we discussed that you would work on buffering the index inserts,
>>> which is where the main problem lies. The main heap is only a small
>>> part of the overhead if we have multiple indexes already built on a
>>> table - which is the use case that causes the most problem.
>>
>>
>> Sure, if you have indexes on the table already, then the overhead of
>> updating them is more significant. I am actually working on that too, I will
>> make a separate post about that.
>
> Hi Heikki,
>
> Is the bulk index insert still an active area for you?
>
> If not, is there some kind of summary of design or analysis work
> already done, which would be a useful for someone else interested in
> picking it up?

The main cost comes from repeated re-seeking down the index tree to
find the insertion point, but repeated lock and pin operations on
index buffers are also high. That is optimisable if the index inserts
are grouped, as they will be with monotonic indexes, (e.g. SERIAL), or
with partial monotonic (i.e. with Parent/Child relationship, on Child
table many inserts will be of the form (x,1), (x,2), (x, 3) etc, e.g.
Order/OrderLine).

All we need do is buffer the inserts in the inserts, before inserting
them into the main index. As long as we flush the buffer before end of
transaction, we're good.

Incidentally, we can also optimise repeated inserts within a normal
transaction using this method, by implementing deferred unique
constraints. At present we say that unique constraints aren't
deferrable, but there's no reason they can't be, if we allow buffering
in the index. (Implementing a deferred constraint that won't fail if
we do UPDATE foo SET pk = pk+1 requires a buffer the size of foo,
which is probably a bad plan, plus you'd need to sort the inputs, so
that particular nut is another issue altogether, AFAICS).

Suggested implementation is to buffer index tuples at the generic
index API, then implement a bulk insert index API call that can then
be implemented for each AM separately. Suggested buffer size is
work_mem. Note we must extract index tuple from heap tuples -
refetching heap blocks to get rows is too costly.

I think we need to implement buffering both to end of statement or end
of transaction, not just one or the other.

-- 
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Inserting heap tuples in bulk in COPY

Jeff Janes <jeff.janes@gmail.com> — 2012-08-08T02:44:47Z

On Tue, Aug 7, 2012 at 1:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 7 August 2012 20:58, Jeff Janes <jeff.janes@gmail.com> wrote:
>> Hi Heikki,
>>
>> Is the bulk index insert still an active area for you?
>>
>> If not, is there some kind of summary of design or analysis work
>> already done, which would be a useful for someone else interested in
>> picking it up?
>
> The main cost comes from repeated re-seeking down the index tree to
> find the insertion point, but repeated lock and pin operations on
> index buffers are also high. That is optimisable if the index inserts
> are grouped, as they will be with monotonic indexes, (e.g. SERIAL), or
> with partial monotonic (i.e. with Parent/Child relationship, on Child
> table many inserts will be of the form (x,1), (x,2), (x, 3) etc, e.g.
> Order/OrderLine).
>
> All we need do is buffer the inserts in the inserts, before inserting
> them into the main index. As long as we flush the buffer before end of
> transaction, we're good.
>
> Incidentally, we can also optimise repeated inserts within a normal
> transaction using this method, by implementing deferred unique
> constraints. At present we say that unique constraints aren't
> deferrable, but there's no reason they can't be, if we allow buffering
> in the index. (Implementing a deferred constraint that won't fail if
> we do UPDATE foo SET pk = pk+1 requires a buffer the size of foo,
> which is probably a bad plan, plus you'd need to sort the inputs, so
> that particular nut is another issue altogether, AFAICS).
>
> Suggested implementation is to buffer index tuples at the generic
> index API, then implement a bulk insert index API call that can then
> be implemented for each AM separately. Suggested buffer size is
> work_mem. Note we must extract index tuple from heap tuples -
> refetching heap blocks to get rows is too costly.

OK, thanks for the summary.  It looks like your plans are to improve
the case where the index maintenance is already CPU limited.  I was
more interested in cases where the regions of the index(es) undergoing
active insertion do not fit into usable RAM, so the limit is the
random IO needed to fetch the leaf pages in order to update them or to
write them out once dirtied.  Here too buffering is probably the
answer, but the size of the buffer needed to turn that IO from
effectively random to effectively sequential is probably much larger
than the size of the buffer you are contemplating.

> I think we need to implement buffering both to end of statement or end
> of transaction, not just one or the other.

With the planner deciding which would be better, or explicit user action?

Thanks,

Jeff

Re: Inserting heap tuples in bulk in COPY

Simon Riggs <simon@2ndquadrant.com> — 2012-08-08T07:36:08Z

On 8 August 2012 03:44, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Tue, Aug 7, 2012 at 1:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 7 August 2012 20:58, Jeff Janes <jeff.janes@gmail.com> wrote:
>>> Hi Heikki,
>>>
>>> Is the bulk index insert still an active area for you?
>>>
>>> If not, is there some kind of summary of design or analysis work
>>> already done, which would be a useful for someone else interested in
>>> picking it up?
>>
>> The main cost comes from repeated re-seeking down the index tree to
>> find the insertion point, but repeated lock and pin operations on
>> index buffers are also high. That is optimisable if the index inserts
>> are grouped, as they will be with monotonic indexes, (e.g. SERIAL), or
>> with partial monotonic (i.e. with Parent/Child relationship, on Child
>> table many inserts will be of the form (x,1), (x,2), (x, 3) etc, e.g.
>> Order/OrderLine).
>>
>> All we need do is buffer the inserts in the inserts, before inserting
>> them into the main index. As long as we flush the buffer before end of
>> transaction, we're good.
>>
>> Incidentally, we can also optimise repeated inserts within a normal
>> transaction using this method, by implementing deferred unique
>> constraints. At present we say that unique constraints aren't
>> deferrable, but there's no reason they can't be, if we allow buffering
>> in the index. (Implementing a deferred constraint that won't fail if
>> we do UPDATE foo SET pk = pk+1 requires a buffer the size of foo,
>> which is probably a bad plan, plus you'd need to sort the inputs, so
>> that particular nut is another issue altogether, AFAICS).
>>
>> Suggested implementation is to buffer index tuples at the generic
>> index API, then implement a bulk insert index API call that can then
>> be implemented for each AM separately. Suggested buffer size is
>> work_mem. Note we must extract index tuple from heap tuples -
>> refetching heap blocks to get rows is too costly.
>
> OK, thanks for the summary.  It looks like your plans are to improve
> the case where the index maintenance is already CPU limited.  I was
> more interested in cases where the regions of the index(es) undergoing
> active insertion do not fit into usable RAM, so the limit is the
> random IO needed to fetch the leaf pages in order to update them or to
> write them out once dirtied.  Here too buffering is probably the
> answer, but the size of the buffer needed to turn that IO from
> effectively random to effectively sequential is probably much larger
> than the size of the buffer you are contemplating.

The buffer size can be variable, yes. I was imagining a mechanism that
worked for normal INSERTs as well as COPY. Perhaps we could say buffer
is work_mem with INSERT and maintenance_work_mem with COPY.

Very large index appends are useful, but currently not very easily
usable because of the transactional nature of COPY. If we could reject
rows without ERROR it would be more practical.

I'm not planning to work on this, so all comments for your assistance.

>> I think we need to implement buffering both to end of statement or end
>> of transaction, not just one or the other.
>
> With the planner deciding which would be better, or explicit user action?

Probably both: on/off/choose.

Deferring unique check would change the point at which errors were
reported in a transaction, which might not be desirable for some. I
think SQL standard has something to say about this also, so that needs
care. But in general, if your tables use generated PK values they
should be able to benefit from this, so I would suggest a default
setting of choose.

-- 
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Inserting heap tuples in bulk in COPY

Robert Haas <robertmhaas@gmail.com> — 2012-08-08T19:34:07Z

On Tue, Aug 7, 2012 at 4:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Incidentally, we can also optimise repeated inserts within a normal
> transaction using this method, by implementing deferred unique
> constraints. At present we say that unique constraints aren't
> deferrable, but there's no reason they can't be, if we allow buffering
> in the index. (Implementing a deferred constraint that won't fail if
> we do UPDATE foo SET pk = pk+1 requires a buffer the size of foo,
> which is probably a bad plan, plus you'd need to sort the inputs, so
> that particular nut is another issue altogether, AFAICS).

We've had deferrable unique constraints since 9.0, courtesy of Dean Rasheed.

> I think we need to implement buffering both to end of statement or end
> of transaction, not just one or the other.

Another (not necessarily better) idea is to use a buffer that's part
of the index, like the GIN fastupdate stuff, so that there's no
particular constraint on when the buffer has to be flushed, but
competing index scans may be slower until it is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inserting heap tuples in bulk in COPY

Simon Riggs <simon@2ndquadrant.com> — 2012-08-08T19:54:23Z

On 8 August 2012 20:34, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Aug 7, 2012 at 4:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Incidentally, we can also optimise repeated inserts within a normal
>> transaction using this method, by implementing deferred unique
>> constraints. At present we say that unique constraints aren't
>> deferrable, but there's no reason they can't be, if we allow buffering
>> in the index. (Implementing a deferred constraint that won't fail if
>> we do UPDATE foo SET pk = pk+1 requires a buffer the size of foo,
>> which is probably a bad plan, plus you'd need to sort the inputs, so
>> that particular nut is another issue altogether, AFAICS).
>
> We've had deferrable unique constraints since 9.0, courtesy of Dean Rasheed.

Yeh, but IIRC there was some issue I can't seem to find detail on
about it not working quite right in production. Not sure now.


>> I think we need to implement buffering both to end of statement or end
>> of transaction, not just one or the other.
>
> Another (not necessarily better) idea is to use a buffer that's part
> of the index, like the GIN fastupdate stuff, so that there's no
> particular constraint on when the buffer has to be flushed, but
> competing index scans may be slower until it is.

I think that works very well for non-unique indexes, though it does
increase WAL traffic since you need to do a double hop into the index.
Its not possible for unique constraints/pk indexes since they need to
fail the transaction in case of duplicates.

The buffer I was imagining would be a private buffer within a
transaction, so wouldn't increase WAL.

-- 
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Inserting heap tuples in bulk in COPY

Jesper Krogh <jesper@krogh.cc> — 2012-08-09T06:59:52Z

On 08/08/12 21:34, Robert Haas wrote:
>> I think we need to implement buffering both to end of statement or end
>> of transaction, not just one or the other.
> Another (not necessarily better) idea is to use a buffer that's part
> of the index, like the GIN fastupdate stuff, so that there's no
> particular constraint on when the buffer has to be flushed, but
> competing index scans may be slower until it is.
If it is an implementation artifact or an result of this
approach I dont know. But currently, when the GIN fastupdate
code finally decides to "flush" the buffer, it is going to stall all
other processes doing updates while doing it. If you only have
one update process then this doesn't matter. But if you're trying to get
user-interactive-updates to flow in with batch-updates from
background processes, then you'd better kill off this feature,
since you're gauranteed that the user-interactive process is
either going to flush the buffer or wait on someone else doing
it.

I havent done the benchmarking, but I'm actually fairly sure that
fastupdate isn't overall faster if you bump concurrency slightly and run of
memory or SSD-based backends due to this cross-backend contention
of the buffer.

A buffer that is backend local, so you can use transactions to
batch up changes would get around this, but that may have another
huge set of consequenses I dont know if.

... based on my own real-world experience with this feature.

Re: Inserting heap tuples in bulk in COPY

Robert Haas <robertmhaas@gmail.com> — 2012-08-09T14:11:08Z

On Thu, Aug 9, 2012 at 2:59 AM, Jesper Krogh <jesper@krogh.cc> wrote:
> If it is an implementation artifact or an result of this
> approach I dont know. But currently, when the GIN fastupdate
> code finally decides to "flush" the buffer, it is going to stall all
> other processes doing updates while doing it. If you only have
> one update process then this doesn't matter. But if you're trying to get
> user-interactive-updates to flow in with batch-updates from
> background processes, then you'd better kill off this feature,
> since you're gauranteed that the user-interactive process is
> either going to flush the buffer or wait on someone else doing
> it.
>
> I havent done the benchmarking, but I'm actually fairly sure that
> fastupdate isn't overall faster if you bump concurrency slightly and run of
> memory or SSD-based backends due to this cross-backend contention
> of the buffer.

Yeah, I've noticed that there are some things that are a little wonky
about GIN fastupdate.  On the other hand, I believe that MySQL has
something along these lines called secondary index buffering which
apparently does very good things for random I/O.  I am not sure of the
details or the implementation, though.

> A buffer that is backend local, so you can use transactions to
> batch up changes would get around this, but that may have another
> huge set of consequenses I dont know if.
>
> ... based on my own real-world experience with this feature.

Well, the main thing to worry about is transactional consistency.  If
a backend which has postponed doing the index-inserts does an index
scan after the command-counter-id has been bumped, it'll see
inconsistent results.  We could avoid that by only using the
optimization when some set of sanity checks passes and doing the
deferred inserts at the end of the statement, or something like that.

The other tricky part is figuring out how to actually get a
performance improvement out of it.  I think Simon's probably right
that a lot of the cost is in repeatedly walking the btree, looking up
and pinning/unpinning/locking/unlocking buffers along the way.  Maybe
we could sort the data in index order, walk down to the first
insertion point, and the insert as many tuples in a row as precede the
next key in the index.  Then lather, rinse, repeat.  If you're
actually just adding everything at the tail of the index, this ought
to work pretty well.  But if the inserts are all over the place it
seems like it might not be any better, or actually a little worse.

Of course it's probably premature to speculate too much until someone
actually codes something up and tests it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company