Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Speed up conversion of signed integers to C strings.
- 4fc115b2e981 9.1.0 cited
Remove some unnecessary tests of pgstat_track_counts.
- f4d242ef9473 9.1.0 cited
Remove cvs keywords from all files.
- 9f2e21138693 9.1.0 cited
Code cleanup for function prototypes: change two K&R-style prototypes
- b9954fbb4ef2 8.3.0 cited
Use Min() instead of min() in qsort, for consistency and to avoid
- b38900c76776 8.2.0 cited
pgindent run for 8.2.
- f99a569a2ee3 8.2.0 cited
Switch over to using our own qsort() all the time, as has been proposed
- 6edd2b4a91bd 8.2.0 cited

Inlining comparators as a performance optimisation

Peter Geoghegan <peter@2ndquadrant.com> — 2011-09-20T01:56:01Z

Recent discussions on the threads "Double sorting split patch" and
"CUDA sorting" raised the possibility that there could be significant
performance optimisation "low-hanging fruit" picked by having the
executor treat integers and floats as a special case during sorting,
avoiding going to the trouble of calling a comparator using the
built-in SQL function machinery, and taking advantage of inlining of
the comparator, which has been shown to have a considerable
performance advantage (at least compared to a general purpose c stdlib
qsort(), that takes a function pointer as its comparator, much like
tuplesort).

I've hacked together a sloppy POC implementation in a hurry
(basically, some code is shifted around) , which is attached - I felt
that it would be useful to determine if the community feels that this
is a worth-while undertaking in advance of a business trip that I'm
leaving on tomorrow lasting until Friday, during which I will be
mostly unavailable. The patch breaks the Postgres sorting executor
node (at least when it quicksorts) for any type other than int4. I
apologise for how rough the patch is, but the code itself isn't
important right now - the idea is. I anticipate that the value
state->datumType or something similar will be set in a near future
revision, so that tuplesort_performsort will know which type-specific
optimisation it can use for the type, while falling back on the
existing generic qsort_arg + qsort_arg_comparator, and sorting won't
actually be broken.

I've been doing some preliminary testing using the dell store 2 sample
database. I increase work_mem to '50MB', to ensure that a quicksort
will be performed for sorting (otherwise, I'm using the
postgresql.conf that initdb gave me). The query is:

explain analyze select * from orderlines order by prod_id;

Once the cache has been warmed, explain analyze very consistently
reports a runtime of 123ms for this query on master/HEAD, which varies
+/- 1 ms, with a few outliers of maybe +/- 2ms. However, when I apply
this patch, that goes down to 107ms +/- 1ms at -O0. I think that
that's a pretty good start. Funnily enough, the difference/advantage
vanishes at -O2 (I'm guessing that the higher optimisation level of
GCC 4.5 hyper-corrects away the inlining, but I don't have time to
check that right now).

I imagine the version that I actually submit for patch review will
have a macro-based infrastructure for inlining the sorting of various
built-in types, initially integers and floats. It will most likely
have some other optimisations - I haven't even used a profiler yet.

This performance patch differs from most in that it's difficult in
principle to imagine a performance regression occurring.

Thoughts?

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Inlining comparators as a performance optimisation

Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-20T02:51:32Z

Peter Geoghegan <peter@2ndquadrant.com> writes:
> Once the cache has been warmed, explain analyze very consistently
> reports a runtime of 123ms for this query on master/HEAD, which varies
> +/- 1 ms, with a few outliers of maybe +/- 2ms. However, when I apply
> this patch, that goes down to 107ms +/- 1ms at -O0. I think that
> that's a pretty good start. Funnily enough, the difference/advantage
> vanishes at -O2 (I'm guessing that the higher optimisation level of
> GCC 4.5 hyper-corrects away the inlining, but I don't have time to
> check that right now).

Considering that -O2 is our standard optimization level, that
observation seems to translate to "this patch will be useless in
practice".  I think you had better investigate that aspect in some
detail before spending more effort.

> This performance patch differs from most in that it's difficult in
> principle to imagine a performance regression occurring.

Really?  N copies of the same code could lead to performance loss just
due to code bloat (ie, less of a query's inner loops fitting in CPU
cache).  Not to mention the clear regression in maintainability.  So
I'm disinclined to consider this sort of change without a significantly
bigger win than you're suggesting above (no, I don't even consider the
-O0 number attractive, let alone what you're finding at -O2).

			regards, tom lane

Re: Inlining comparators as a performance optimisation

Amit Kapila <amit.kapila@huawei.com> — 2011-09-20T13:57:24Z

>Recent discussions on the threads "Double sorting split patch" and

>"CUDA sorting" raised the possibility that there could be significant

>performance optimisation "low-hanging fruit" picked by having the

>executor treat integers and floats as a special case during sorting,

>avoiding going to the trouble of calling a comparator using the

>built-in SQL function machinery

 

Why only for integers and floats why not for char/varchar?

But I believe this can make code less maintainable as similar things can be
done at other places to avoid SQL function machinery.

 

>Once the cache has been warmed, explain analyze very consistently

>reports a runtime of 123ms for this query on master/HEAD, which varies

>+/- 1 ms, with a few outliers of maybe +/- 2ms. However, when I apply

>this patch, that goes down to 107ms +/- 1ms at -O0.

 

Time 123ms which is without your change is with which optimization -O2 or
O0?

 

 

****************************************************************************
***********

This e-mail and attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed
above. Any use of the information contained herein in any way (including,
but not limited to, total or partial disclosure, reproduction, or
dissemination) by persons other than the intended recipient's) is
prohibited. If you receive this e-mail in error, please notify the sender by
phone or email immediately and delete it!

 

 

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Peter Geoghegan
Sent: Tuesday, September 20, 2011 7:26 AM
To: PG Hackers
Subject: [HACKERS] Inlining comparators as a performance optimisation

 

Recent discussions on the threads "Double sorting split patch" and

"CUDA sorting" raised the possibility that there could be significant

performance optimisation "low-hanging fruit" picked by having the

executor treat integers and floats as a special case during sorting,

avoiding going to the trouble of calling a comparator using the

built-in SQL function machinery, and taking advantage of inlining of

the comparator, which has been shown to have a considerable

performance advantage (at least compared to a general purpose c stdlib

qsort(), that takes a function pointer as its comparator, much like

tuplesort).

 

I've hacked together a sloppy POC implementation in a hurry

(basically, some code is shifted around) , which is attached - I felt

that it would be useful to determine if the community feels that this

is a worth-while undertaking in advance of a business trip that I'm

leaving on tomorrow lasting until Friday, during which I will be

mostly unavailable. The patch breaks the Postgres sorting executor

node (at least when it quicksorts) for any type other than int4. I

apologise for how rough the patch is, but the code itself isn't

important right now - the idea is. I anticipate that the value

state->datumType or something similar will be set in a near future

revision, so that tuplesort_performsort will know which type-specific

optimisation it can use for the type, while falling back on the

existing generic qsort_arg + qsort_arg_comparator, and sorting won't

actually be broken.

 

I've been doing some preliminary testing using the dell store 2 sample

database. I increase work_mem to '50MB', to ensure that a quicksort

will be performed for sorting (otherwise, I'm using the

postgresql.conf that initdb gave me). The query is:

 

explain analyze select * from orderlines order by prod_id;

 

Once the cache has been warmed, explain analyze very consistently

reports a runtime of 123ms for this query on master/HEAD, which varies

+/- 1 ms, with a few outliers of maybe +/- 2ms. However, when I apply

this patch, that goes down to 107ms +/- 1ms at -O0. I think that

that's a pretty good start. Funnily enough, the difference/advantage

vanishes at -O2 (I'm guessing that the higher optimisation level of

GCC 4.5 hyper-corrects away the inlining, but I don't have time to

check that right now).

 

I imagine the version that I actually submit for patch review will

have a macro-based infrastructure for inlining the sorting of various

built-in types, initially integers and floats. It will most likely

have some other optimisations - I haven't even used a profiler yet.

 

This performance patch differs from most in that it's difficult in

principle to imagine a performance regression occurring.

 

Thoughts?

 

-- 

Peter Geoghegan       http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training and Services

Re: Inlining comparators as a performance optimisation

Peter Geoghegan <peter@2ndquadrant.com> — 2011-09-20T23:53:16Z

On 20 September 2011 03:51, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Considering that -O2 is our standard optimization level, that
> observation seems to translate to "this patch will be useless in
> practice".  I think you had better investigate that aspect in some
> detail before spending more effort.

I don't think that the fact that that happens is at all significant at
this early stage, and it never even occurred to me that you'd think
that it might be. I was simply disclosing a quirk of this POC patch.
The workaround is probably to use a macro instead. For the benefit of
those that didn't follow the other threads, the macro-based qsort
implementation, which I found to perform significantly better than
regular qsort(), runs like this on my laptop when I built at 02 with
GCC 4.6 just now:

C stdlib quick-sort time elapsed: 2.092451 seconds
Inline quick-sort time elapsed: 1.587651 seconds

Does *that* look attractive to you? I've attached source code of the
program that produced these figures, which has been ported to C from
C++.

When I #define LARGE_SIZE 100000000, here's what I see:

[peter@peter inline_compar_test]$ ./a.out
C stdlib quick-sort time elapsed: 23.659411 seconds
Inline quick-sort time elapsed: 18.470611 seconds

Here, sorting with the function pointer/stdlib version takes about
1.28 times as long. In the prior test (with the smaller LARGE_SIZE),
it took about 1.32 times as long. Fairly predictable, linear, and not
to be sniffed at.

The variance I'm seeing across runs is low - a couple of hundredths of
a second at most. This is a Fedora 15 " Intel(R) Core(TM) i5-2540M CPU
@ 2.60GHz" machine. I'm not sure right now why the inline quick-sort
is less of a win than on my old Fedora 14 desktop (where it was 3.24
Vs 2.01), but it's still a significant win. Perhaps others can build
this simple program and tell me what they come up with.

>> This performance patch differs from most in that it's difficult in
>> principle to imagine a performance regression occurring.
>
> Really?  N copies of the same code could lead to performance loss just
> due to code bloat (ie, less of a query's inner loops fitting in CPU
> cache).

I did consider that. Of course inlining has an overhead, and I'll be
testing that each instance of inlining has a net benefit. I just meant
that many other performance patches have an obvious worst case, and I
think that it is policy to focus on that case, but I can't think of
one here. Does anyone else have any ideas?

> Not to mention the clear regression in maintainability.  So
> I'm disinclined to consider this sort of change without a significantly
> bigger win than you're suggesting above

Sure, there'll be some sort of regression in maintainability - I think
that HOT had a clear regression in maintainability too. The important
questions are obviously "how big is the loss of maintainability?", and
"is it worth it?". We'll know more when this work is actually shaped
into a proper patch. Perhaps I should have waited until I had
something along those lines before making an announcement, but I
wanted community input as early as possible. I think that there's
plenty of tweaking that can be done to get additional performance
improvements - all I've done so far is demonstrate that those
improvements are real and worth thinking about, in the fastest
possible way, partly because you expressed skepticism of the benefits
of inlining comparators to Greg Stark in an earlier thread.

Performance and maintainability are often somewhat in tension, but we
cannot ignore performance. If this work can bring us an improvement in
performance approaching the isolated macro Vs qsort() function pointer
benchmark, that's a *big* win. Sorting integers and floats is very
common and important.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

[PATCH] POC: inline int4 comparison in tuplesort

Dan McGee <dan@archlinux.org> — 2011-09-21T01:54:20Z

This attempts to be as simple as it gets while reducing function call
depth, and should be viewed as a proof of concept. It is also untested
as of now, but will try to do that and report back.

I'm hoping I followed the rabbit hole correctly and are correctly
comparing the right pointers to each other in order to short circuit the
case where we are using the int4 comparison operator.

Peter, if you want to compare stock vs. your patch vs. this patch, we might
be able to get some sort of read on where the maintainablity vs. performance
curve lies. Note that this version should still allow sorting of anything,
and simply shifts gears for int4 tuples...

---
 src/backend/utils/sort/tuplesort.c |   23 +++++++++++++++++++++--
 1 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 3505236..ddd5ced 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2652,6 +2652,22 @@ myFunctionCall2Coll(FmgrInfo *flinfo, Oid collation, Datum arg1, Datum arg2)
 	return result;
 }
 
+static inline
+int int4cmp(Datum first, Datum second)
+{
+	int32		a = DatumGetInt32(first);
+	int32		b = DatumGetInt32(second);
+
+	if (a > b)
+		return 1;
+	else if (a == b)
+		return 0;
+	else
+		return -1;
+}
+
+extern Datum btint4cmp(PG_FUNCTION_ARGS);
+
 /*
  * Apply a sort function (by now converted to fmgr lookup form)
  * and return a 3-way comparison result.  This takes care of handling
@@ -2683,8 +2699,11 @@ inlineApplySortFunction(FmgrInfo *sortFunction, int sk_flags, Oid collation,
 	}
 	else
 	{
-		compare = DatumGetInt32(myFunctionCall2Coll(sortFunction, collation,
-													datum1, datum2));
+		if (sortFunction->fn_addr == btint4cmp)
+			compare = int4cmp(datum1, datum2);
+		else
+			compare = DatumGetInt32(myFunctionCall2Coll(sortFunction, collation,
+														datum1, datum2));
 
 		if (sk_flags & SK_BT_DESC)
 			compare = -compare;
-- 
1.7.6.3

Re: Inlining comparators as a performance optimisation

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-09-21T06:51:32Z

On 21.09.2011 02:53, Peter Geoghegan wrote:
> C stdlib quick-sort time elapsed: 2.092451 seconds
> Inline quick-sort time elapsed: 1.587651 seconds
>
> Does *that* look attractive to you?

Not really, to be honest. That's a 25% speedup in pure qsorting speed. 
How much of a gain in a real query do you expect to get from that, in 
the best case? There's so many other sources of overhead that I'm afraid 
this will be lost in the noise. If you find a query that spends, say, 
50% of its time in qsort(), you will only get a 12.5% speedup on that 
query. And even 50% is really pushing it - I challenge you to find a 
query that spends any significant amount of time qsorting integers.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Inlining comparators as a performance optimisation

Simon Riggs <simon@2ndquadrant.com> — 2011-09-21T06:55:55Z

On Tue, Sep 20, 2011 at 3:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

>> This performance patch differs from most in that it's difficult in
>> principle to imagine a performance regression occurring.
>
> Really?  N copies of the same code could lead to performance loss just
> due to code bloat (ie, less of a query's inner loops fitting in CPU
> cache).  Not to mention the clear regression in maintainability.  So
> I'm disinclined to consider this sort of change without a significantly
> bigger win than you're suggesting above (no, I don't even consider the
> -O0 number attractive, let alone what you're finding at -O2).

More copies of the code are somewhat annoying, but its only 100 lines
of code in one module and we can easily have specific tests for each.
The extra code size is minor in comparison to the reams of code we add
elsewhere.

It's a surprisingly good win for such a common use case. Well done, Peter.

-- 
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Inlining comparators as a performance optimisation

Simon Riggs <simon@2ndquadrant.com> — 2011-09-21T07:01:11Z

On Wed, Sep 21, 2011 at 7:51 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 21.09.2011 02:53, Peter Geoghegan wrote:
>>
>> C stdlib quick-sort time elapsed: 2.092451 seconds
>> Inline quick-sort time elapsed: 1.587651 seconds
>>
>> Does *that* look attractive to you?
>
> Not really, to be honest. That's a 25% speedup in pure qsorting speed. How
> much of a gain in a real query do you expect to get from that, in the best
> case? There's so many other sources of overhead that I'm afraid this will be
> lost in the noise. If you find a query that spends, say, 50% of its time in
> qsort(), you will only get a 12.5% speedup on that query. And even 50% is
> really pushing it - I challenge you to find a query that spends any
> significant amount of time qsorting integers.

How about almost every primary index creation?

Don't really see a reason for the negativity here. If you use that
argument no performance gain is worth it because all workloads are
mixed.

This is a marvellous win, a huge gain from a small, isolated and
easily tested change. By far the smallest amount of additional code to
sorting we will have added and yet one of the best gains.

-- 
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Inlining comparators as a performance optimisation

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-09-21T07:08:42Z

On 21.09.2011 10:01, Simon Riggs wrote:
> On Wed, Sep 21, 2011 at 7:51 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> On 21.09.2011 02:53, Peter Geoghegan wrote:
>>>
>>> C stdlib quick-sort time elapsed: 2.092451 seconds
>>> Inline quick-sort time elapsed: 1.587651 seconds
>>>
>>> Does *that* look attractive to you?
>>
>> Not really, to be honest. That's a 25% speedup in pure qsorting speed. How
>> much of a gain in a real query do you expect to get from that, in the best
>> case? There's so many other sources of overhead that I'm afraid this will be
>> lost in the noise. If you find a query that spends, say, 50% of its time in
>> qsort(), you will only get a 12.5% speedup on that query. And even 50% is
>> really pushing it - I challenge you to find a query that spends any
>> significant amount of time qsorting integers.
>
> How about almost every primary index creation?

Nope. Swamped by everything else.

Also note that as soon as the sort grows big enough to not fit in 
maintenance_work_mem, you switch to the external sort algorithm, which 
doesn't use qsort. Perhaps you could do similar inlining in the heap 
sort & merge passes done in the external sort, but it's unlikely to be 
as big a win there.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Inlining comparators as a performance optimisation

Greg Stark <stark@mit.edu> — 2011-09-21T12:47:46Z

On Wed, Sep 21, 2011 at 8:08 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>> How about almost every primary index creation?
>
> Nope. Swamped by everything else.

Really? I think it's pretty common for shops to be able to dedicate
large amounts of RAM to building initial indexes on data loads or
reindex operations. Enough that they can cache the entire table for
the short time they're doing the index builds even if they're quite
large. Witness the recent pleas to allow maintenance_work_mem on the
order of tens of gigabytes. And it's also pretty common that shops can
dedicate very large I/O bandwidth, in many cases enough to saturate
the memory bandwidth, for doing these kinds of batch operations when
they get large enough to need to do an external sort.

There's still overhead of reading the pages, the tuples, finding the
sort keys in the tuple, etc. But I think the actual qsort or heap
operations in tapesort are pretty big portions of the work.

This is pretty easy to measure. Just run oprofile or gprof and see
what percentage of time for a big index build is spent in qsort.

-- 
greg

Re: Inlining comparators as a performance optimisation

Robert Haas <robertmhaas@gmail.com> — 2011-09-21T13:34:49Z

On Wed, Sep 21, 2011 at 8:47 AM, Greg Stark <stark@mit.edu> wrote:
> On Wed, Sep 21, 2011 at 8:08 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>>> How about almost every primary index creation?
>>
>> Nope. Swamped by everything else.
>
> Really? I think it's pretty common for shops to be able to dedicate
> large amounts of RAM to building initial indexes on data loads or
> reindex operations. Enough that they can cache the entire table for
> the short time they're doing the index builds even if they're quite
> large. Witness the recent pleas to allow maintenance_work_mem on the
> order of tens of gigabytes. And it's also pretty common that shops can
> dedicate very large I/O bandwidth, in many cases enough to saturate
> the memory bandwidth, for doing these kinds of batch operations when
> they get large enough to need to do an external sort.
>
> There's still overhead of reading the pages, the tuples, finding the
> sort keys in the tuple, etc. But I think the actual qsort or heap
> operations in tapesort are pretty big portions of the work.
>
> This is pretty easy to measure. Just run oprofile or gprof and see
> what percentage of time for a big index build is spent in qsort.

+1 for some actual measurements.

I don't think anyone on this thread is saying that if we can get big
performance gains from doing this we still shouldn't do it.  But at
this point it's unclear that we can get a consistent speedup that
isn't heavily dependent on the choice of compiler flags (to say
nothing of compiler and OS), and even if we can, it's not clear that
it will still be noticeable when you measure the run time of an entire
query rather than just the speed of qsort().  Like Tom and Heikki, I'm
a bit skeptical: it wouldn't surprise me to find out that qsort() is
5% of the runtime any realistic test case and the average qsort()
speedup based on tests on a couple different platforms is 10% and so
on average we're looking at a 0.5% improvement, in which case it might
be more trouble than it's worth, especially if it turns out that there
are OS/platform combinations where the inlined version is (for some
crazy reason) slower.  I've seen performance differences of up to 3%
from minor code rearrangements that don't seem like they should matter
at all, just because code and data shifts around across cache-line
boundaries and the new arrangement is slightly better or worse than
the old one.  So if the performance improvement turns out to be very
small, then validating that it actually IS an improvement in general
is likely to be kind of a pain in the ass.

On the other hand, the performance improvement might turn out to be
large.  Maybe there's a test case where, as Heikki suggests, 50% of
the time is spent in qsort().  If we can reliably make that 25%
faster, I wouldn't dismiss that out of hand; I think that would be
pretty good, assuming it didn't require massive amounts of spaghetti
code to make it work.  I don't see that that would be any more
marginal than the sorts of things we've optimized in, say, commit
4fc115b2e981f8c63165ca86a23215380a3fda66, or commit
f4d242ef94730c447d87b9840a40b0ec3371fe0f.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inlining comparators as a performance optimisation

Peter Geoghegan <peter@2ndquadrant.com> — 2011-09-21T14:20:55Z

On 21 September 2011 07:51, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 21.09.2011 02:53, Peter Geoghegan wrote:
>>
>> C stdlib quick-sort time elapsed: 2.092451 seconds
>> Inline quick-sort time elapsed: 1.587651 seconds
>>
>> Does *that* look attractive to you?
>
> Not really, to be honest. That's a 25% speedup in pure qsorting speed. How
> much of a gain in a real query do you expect to get from that, in the best
> case? There's so many other sources of overhead that I'm afraid this will be
> lost in the noise.

I'm surprised that you're dismissive of this. After all, we have in
the past indulged in micro-optimisation of qsort, or so it would seem
from this comment:

 * We have modified their original by adding a check for already-sorted input,
 * which seems to be a win per discussions on pgsql-hackers around 2006-03-21.

"Makes affected queries radically faster" (In the best case, a speedup
somewhat greater than 12.5%) is an unreasonably high standard for a
performance optimisation of the executor in general (such a high
standard might be sensible if it was due to a particular
maintainability downside, but you didn't mention one). Even still, I
think that the 12.5% figure is pretty pessimistic - I've already sped
up the dell store query by almost that much, and that's with a patch
that was, due to circumstances, cobbled together.

Not only are we benefiting from the effects of inlining, we're also
benefiting from the removal of unnecessary indirection. As Tom said,
"In concrete terms, there would be no reason to have tuplesort.c's
myFunctionCall2Coll, and maybe not inlineApplySortFunction either, if
the datatype-specific comparison functions had APIs that were closer
to what sorting wants rather than following the general
SQL-callable-function API." He was just referring to the benefits of
removing indirection here, so ISTM that this is really two performance
optimisations rolled into one - it's conceivable that the total
performance improvement will even exceed the isolated inlining
comparator benchmark.

As I've said, I believe this patch can be committed without
compromising the maintainability of the tuplesort code to an extent
that is not clearly worth it, through the use of a clean, macro-based
abstraction. Concerns about bloated binaries are probably not well
founded, because what I'm proposing is to a certain extent emulating
C++ templates, while using a very common pattern used with C++
templates. In the C++ world, algorithms are often generalised as
templates, so that they can be used equally well with any datatype
(that supports the interface of the template), while availing of
compiler optimisations per template instantiation (instance of using a
given type with a given template). I actually got the idea for this
patch in part from a book that I read years ago that described the
fact that counter-intuitively, std::sort() consistently outperforms
qsort(), because the comparator is often inlined, and the compiler can
generally avail of optimisations from knowing the comparator at
compile-time.

On 21 September 2011 13:47, Greg Stark <stark@mit.edu> wrote:
> This is pretty easy to measure. Just run oprofile or gprof and see
> what percentage of time for a big index build is spent in qsort.

I'll do so soon. I intend to get to this on Friday evening, and
perhaps have a proper patch to show next week.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Inlining comparators as a performance optimisation

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-09-21T14:31:11Z

On 21.09.2011 17:20, Peter Geoghegan wrote:
> Even still, I
> think that the 12.5% figure is pretty pessimistic - I've already sped
> up the dell store query by almost that much, and that's with a patch
> that was, due to circumstances, cobbled together.

I'm not against making things faster, it's just that I haven't seen 
solid evidence yet that this will help. Just provide a best-case test 
case for this that shows a huge improvement, and I'll shut up. If the 
improvement is only modest, then let's discuss how big it is and whether 
it's worth the code ugliness this causes.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Inlining comparators as a performance optimisation

Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-21T14:50:22Z

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 21.09.2011 17:20, Peter Geoghegan wrote:
>> Even still, I
>> think that the 12.5% figure is pretty pessimistic - I've already sped
>> up the dell store query by almost that much, and that's with a patch
>> that was, due to circumstances, cobbled together.

> I'm not against making things faster, it's just that I haven't seen 
> solid evidence yet that this will help. Just provide a best-case test 
> case for this that shows a huge improvement, and I'll shut up. If the 
> improvement is only modest, then let's discuss how big it is and whether 
> it's worth the code ugliness this causes.

The other question that I'm going to be asking is whether it's not
possible to get most of the same improvement with a much smaller code
footprint.  I continue to suspect that getting rid of the SQL function
impedance-match layer (myFunctionCall2Coll etc) would provide most of
whatever gain is to be had here, without nearly as large a cost in code
size and maintainability, and with the extra benefit that the speedup
would also be available to non-core datatypes.

			regards, tom lane

Re: Inlining comparators as a performance optimisation

Andrew Dunstan <andrew@dunslane.net> — 2011-09-21T15:03:15Z


On 09/21/2011 10:50 AM, Tom Lane wrote:
> The other question that I'm going to be asking is whether it's not
> possible to get most of the same improvement with a much smaller code
> footprint.  I continue to suspect that getting rid of the SQL function
> impedance-match layer (myFunctionCall2Coll etc) would provide most of
> whatever gain is to be had here, without nearly as large a cost in code
> size and maintainability, and with the extra benefit that the speedup
> would also be available to non-core datatypes.
>
> 			

Can we get a patch so we can do benchmarks on this?

cheers

andrew

Re: Inlining comparators as a performance optimisation

Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-21T15:22:50Z

Simon Riggs <simon@2ndQuadrant.com> writes:
> This is a marvellous win, a huge gain from a small, isolated and
> easily tested change. By far the smallest amount of additional code to
> sorting we will have added and yet one of the best gains.

I think you forgot your cheerleader uniform.  A patch along these lines
is not going to be small, isolated, easily maintained, nor beneficial
for any but a small number of predetermined datatypes.

			regards, tom lane

Re: Inlining comparators as a performance optimisation

Peter Geoghegan <peter@2ndquadrant.com> — 2011-09-21T15:43:31Z

On 21 September 2011 15:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> I'm not against making things faster, it's just that I haven't seen
>> solid evidence yet that this will help. Just provide a best-case test
>> case for this that shows a huge improvement, and I'll shut up. If the
>> improvement is only modest, then let's discuss how big it is and whether
>> it's worth the code ugliness this causes.

Fair enough.

> The other question that I'm going to be asking is whether it's not
> possible to get most of the same improvement with a much smaller code
> footprint.

That's a reasonable question, and I hope to be able to come up with a
good answer.

> I continue to suspect that getting rid of the SQL function
> impedance-match layer (myFunctionCall2Coll etc) would provide most of
> whatever gain is to be had here, without nearly as large a cost in code
> size and maintainability, and with the extra benefit that the speedup
> would also be available to non-core datatypes.

I'm fairly surprised that your view on that is mostly or entirely
unchanged, even after I've demonstrated a considerable performance
advantage from a macro-based qsort implementation over my OS vendor's
c std lib qsort(), using an isolated test-case, that does not have
anything to do with that impedance mismatch. I'm not sure why you
doubt that the same thing is happening within tuplesort.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Inlining comparators as a performance optimisation

Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-21T15:46:52Z

Andrew Dunstan <andrew@dunslane.net> writes:
> On 09/21/2011 10:50 AM, Tom Lane wrote:
>> The other question that I'm going to be asking is whether it's not
>> possible to get most of the same improvement with a much smaller code
>> footprint.  I continue to suspect that getting rid of the SQL function
>> impedance-match layer (myFunctionCall2Coll etc) would provide most of
>> whatever gain is to be had here, without nearly as large a cost in code
>> size and maintainability, and with the extra benefit that the speedup
>> would also be available to non-core datatypes.

> Can we get a patch so we can do benchmarks on this?

Well, we'd have to negotiate what the API ought to be.  What I'm
envisioning is that datatypes could provide alternate comparison
functions that are designed to be qsort-callable rather than
SQL-callable.  As such, they could not have entries in pg_proc, so
it seems like there's no ready way to represent them in the catalogs.

The idea that I was toying with was to allow the regular SQL-callable
comparison function to somehow return a function pointer to the
alternate comparison function, so that the first comparison in a given
sort run would be done the traditional way but then we'd notice the
provided function pointer and start using that.  It would not be too
hard to pass back the pointer using FunctionCallInfoData.context, say.
The downside is adding cycles to unoptimized cases to uselessly check
for a returned function pointer that's not there.  Perhaps it could be
hacked so that we only add cycles to the very first call, but I've not
looked closely at the code to see what would be involved.

Has anyone got a better idea for getting hold of the alternate function?

			regards, tom lane

Re: Inlining comparators as a performance optimisation

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-09-21T15:56:56Z

On 21.09.2011 18:46, Tom Lane wrote:
> Andrew Dunstan<andrew@dunslane.net>  writes:
>> On 09/21/2011 10:50 AM, Tom Lane wrote:
>>> The other question that I'm going to be asking is whether it's not
>>> possible to get most of the same improvement with a much smaller code
>>> footprint.  I continue to suspect that getting rid of the SQL function
>>> impedance-match layer (myFunctionCall2Coll etc) would provide most of
>>> whatever gain is to be had here, without nearly as large a cost in code
>>> size and maintainability, and with the extra benefit that the speedup
>>> would also be available to non-core datatypes.
>
>> Can we get a patch so we can do benchmarks on this?
>
> Well, we'd have to negotiate what the API ought to be.  What I'm
> envisioning is that datatypes could provide alternate comparison
> functions that are designed to be qsort-callable rather than
> SQL-callable.  As such, they could not have entries in pg_proc, so
> it seems like there's no ready way to represent them in the catalogs.
>
> The idea that I was toying with was to allow the regular SQL-callable
> comparison function to somehow return a function pointer to the
> alternate comparison function, so that the first comparison in a given
> sort run would be done the traditional way but then we'd notice the
> provided function pointer and start using that.  It would not be too
> hard to pass back the pointer using FunctionCallInfoData.context, say.
> The downside is adding cycles to unoptimized cases to uselessly check
> for a returned function pointer that's not there.  Perhaps it could be
> hacked so that we only add cycles to the very first call, but I've not
> looked closely at the code to see what would be involved.

You could have a new function with a pg_proc entry, that just returns a 
function pointer to the qsort-callback.

Or maybe the interface should be an even more radical replacement of 
qsort, not just the comparison function. Instead of calling qsort, 
tuplesort.c would call the new datatype-specific sort-function (which 
would be in pg_proc). The implementation could use an inlined version of 
qsort, like Peter is suggesting, or it could do something completely 
different, like a radix sort or a GPU-assisted sort or whatever.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Inlining comparators as a performance optimisation

Greg Stark <stark@mit.edu> — 2011-09-21T16:04:21Z

On Wed, Sep 21, 2011 at 4:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>  As such, they could not have entries in pg_proc, so
> it seems like there's no ready way to represent them in the catalogs.

Why couldn't they be in pg_proc with a bunch of opaque arguments like
the GIST opclass support functions?

I'm a bit puzzled what the arguments would look like. They would still
need to know the collation, nulls first/last flags, etc.
And calling it would still not be inlinable.  So they would have to
check those flags on each invocation instead of having a piece of
straightline code that hard codes the behaviour with the right
behaviour inline.  ISTM the hope for a speedup from the inlining
mostly came from the idea that the compiler might be able to hoist
this logic outside the loop (and I suppose implement n specialized
loops depending on the behaviour needed).

-- 
greg

Re: Inlining comparators as a performance optimisation

Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-21T16:13:07Z

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 21.09.2011 18:46, Tom Lane wrote:
>> The idea that I was toying with was to allow the regular SQL-callable
>> comparison function to somehow return a function pointer to the
>> alternate comparison function,

> You could have a new function with a pg_proc entry, that just returns a 
> function pointer to the qsort-callback.

Yeah, possibly.  That would be a much more invasive change, but cleaner
in some sense.  I'm not really prepared to do all the legwork involved
in that just to get to a performance-testable patch though.

> Or maybe the interface should be an even more radical replacement of 
> qsort, not just the comparison function. Instead of calling qsort, 
> tuplesort.c would call the new datatype-specific sort-function (which 
> would be in pg_proc). The implementation could use an inlined version of 
> qsort, like Peter is suggesting, or it could do something completely 
> different, like a radix sort or a GPU-assisted sort or whatever.

No.  In the first place, that only helps for in-memory sorts.  In the
second, it would absolutely destroy our ability to change the behavior
of sorting ever again.  Considering that we've added ASC/DESC, NULLS
FIRST/LAST, and collation support over the years, are you really
prepared to bet that the sort code will never need any more feature
upgrades?  (This concern is in fact the source of my beef with the whole
inlining proposal to begin with, but allowing the inlining to occur into
third-party code that we don't control at all would be a hundred times
worse.)

			regards, tom lane

Re: Inlining comparators as a performance optimisation

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-09-21T16:13:20Z

On 21.09.2011 18:46, Tom Lane wrote:
> Well, we'd have to negotiate what the API ought to be.  What I'm
> envisioning is that datatypes could provide alternate comparison
> functions that are designed to be qsort-callable rather than
> SQL-callable.  As such, they could not have entries in pg_proc, so
> it seems like there's no ready way to represent them in the catalogs.

Quite aside from this qsort-thing, it would be nice to have versions of 
all simple functions that could be called without the FunctionCall 
overhead. So instead of:

FunctionCall2(&flinfo_for_int4pl, 1, 2)

you could do simply

int4pl_fastpath(1,2)

I'm not sure how big an effect this would have, but it seems like it 
could shave some cycles across the system.

We could have an extended version of the PG_FUNCTION_INFO_V1 macro that 
would let you register the fastpath function:

PG_FUNCTION_INFO_V1(int4pl, int4pl_fastpath);

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Inlining comparators as a performance optimisation

Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-21T16:23:13Z

Greg Stark <stark@mit.edu> writes:
> On Wed, Sep 21, 2011 at 4:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> As such, they could not have entries in pg_proc, so
>> it seems like there's no ready way to represent them in the catalogs.

> Why couldn't they be in pg_proc with a bunch of opaque arguments like
> the GIST opclass support functions?

That does not mean the same thing at all.  Everything in pg_proc is
meant to be called through the V0 or V1 function call info protocols.

> I'm a bit puzzled what the arguments would look like. They would still
> need to know the collation, nulls first/last flags, etc.

No, I wasn't thinking that we should do that.  The datatype comparison
functions should have the exact same semantics they do now, just a
lower-overhead call mechanism.  If you try to push stuff like NULLS
FIRST/LAST into the per-datatype code, then you are up against a problem
when you want to add a new flag: you have to touch lots of code not all
of which you even control.

> And calling it would still not be inlinable.  So they would have to
> check those flags on each invocation instead of having a piece of
> straightline code that hard codes the behaviour with the right
> behaviour inline.  ISTM the hope for a speedup from the inlining
> mostly came from the idea that the compiler might be able to hoist
> this logic outside the loop (and I suppose implement n specialized
> loops depending on the behaviour needed).

None of that stuff is inlinable or constant-foldable today, nor would it
be with the patch that Peter was proposing AFAICS, because none of the
flags will ever be compile time constant values.

			regards, tom lane

Re: Inlining comparators as a performance optimisation

Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-21T16:27:49Z

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 21.09.2011 18:46, Tom Lane wrote:
>> Well, we'd have to negotiate what the API ought to be.  What I'm
>> envisioning is that datatypes could provide alternate comparison
>> functions that are designed to be qsort-callable rather than
>> SQL-callable.  As such, they could not have entries in pg_proc, so
>> it seems like there's no ready way to represent them in the catalogs.

> Quite aside from this qsort-thing, it would be nice to have versions of 
> all simple functions that could be called without the FunctionCall 
> overhead.

Hmm, that's an interesting idea.  I think probably the important aspects
are (1) known number of arguments and (2) no null argument or result
values are allowed.  Not sure what we'd do with collations though.

> We could have an extended version of the PG_FUNCTION_INFO_V1 macro that 
> would let you register the fastpath function:
> PG_FUNCTION_INFO_V1(int4pl, int4pl_fastpath);

We don't use PG_FUNCTION_INFO_V1 for built-in functions ...

			regards, tom lane

Re: Inlining comparators as a performance optimisation

Greg Stark <stark@mit.edu> — 2011-09-21T16:35:41Z

On Wed, Sep 21, 2011 at 5:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> None of that stuff is inlinable or constant-foldable today, nor would it
> be with the patch that Peter was proposing AFAICS, because none of the
> flags will ever be compile time constant values.

I was referring to transformations like this which I believe compilers
are already capable of doing:

v = ...;
while (...)
  if (v) {
    if (a < b) ... else ....;
  } else {
    if (a > b) ... else ...;
  }

turning it into code that looks like:

if (v) {
 while (....)
   if (a<b) ... else ...;
} else {
  while (....)
    if (a>b) ... else ...;
}

which may not look like much -- especially with branch prediction --
but then it's much more likely to be able to unroll the loop and do
clever instruction scheduling and so on than if there's an extra
branch in the middle of the loop. But if there's a function call to an
external function called through a function pointer in the middle of
the loop then the whole endeavour would be for naught.

-- 
greg

Re: Inlining comparators as a performance optimisation

Simon Riggs <simon@2ndquadrant.com> — 2011-09-21T19:21:38Z

On Wed, Sep 21, 2011 at 4:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> This is a marvellous win, a huge gain from a small, isolated and
>> easily tested change. By far the smallest amount of additional code to
>> sorting we will have added and yet one of the best gains.
>
> I think you forgot your cheerleader uniform.

LOL. I'm happy whoever and whenever we get large wins like that.

Go Postgres!

> A patch along these lines
> is not going to be small, isolated, easily maintained, nor beneficial
> for any but a small number of predetermined datatypes.

That was the starting premise.

-- 
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Inlining comparators as a performance optimisation

Peter Geoghegan <peter@2ndquadrant.com> — 2011-09-26T02:12:55Z

I've produced something much neater than my first patch, attached,
although I still consider this to be at the POC stage, not least since
I'm not exactly sure how I should be selecting the right
specialisation in tuplesort_performsort() (a hack is used right now
that does a direct pg_proc OID comparison), and also because I haven't
implemented anything other than qsort for heap tuples yet (a minor
detail, I suppose). I'm pleased to say that a much clearer picture of
what's really going on here has emerged.

Statistics: Total runtime according to explain analyze for query
"select * from orderlines order by prod_id" (dellstore2 sample db), at
GCC 4.5's -02 optimisation level, after warming the cache, on my
desktop:

Without the patch:

~82ms

With the patch, but with the "inline" keyword commented out for all
new functions/meta-functions:

~60ms

with the patch, unmodified:

~52ms

Recent experience suggests that using a macro rather than an inline
function would perhaps have been a mistake, as it would prevent us
from benefiting from the compiler's smarts in surmising just where we
should inline. As I've pointed out, individual call sites are inlined,
not individual functions.

Tom wanted to know if I could prove the benefit in inlining, as
against just resolving the impedance mismatch without bothering with
inlining (and, indeed, further optimisations that the compiler can
avail of as a result of knowing the comparator at compile-time). These
figures support my contention that these optimisations have an
important role to play. However, there's no question that resolving
the impedance mismatch is where the biggest benefit is seen,
particularly relative to maintainability. As Tom anticipated, the
question of whether or not we do the inlining is less than entirely
straightforward, as it's less of a win, and it has a higher price in
ugliness. I myself am very much of the opinion that it's still worth
it. As I've said, sorting integers and floats is very common and very
important, and this approach will likely yield benefits beyond
increasing the speed of executor sort nodes, as described below.

I accept that a more sophisticated benchmark is required here, but
I've been a little busy actually writing the patch. Any ideas anyone?
Independent benchmarks are welcome. If someone could suggest a worst
case, that would be particularly useful, as I cannot think of one - I
believe that each instance of inlining a call site more-or-less either
is or is not a net benefit here, regardless of the number of
comparisons performed, which is why the improvement is so predictable
across the size of the set of integers sorted for by qsort in
isolation (which is the big reason why that ~8ms decrease due to
inlining turns out to be not too shabby - it's a saving per
comparison, and they can add it pretty quickly). So, for example, when
I build the patch on my laptop (GCC 4.6), with an 48MB table (the same
orderlines table query, but I doubled up the data a few times
beforehand), we see an undiminished, proportional (to the prior,
isolated cost of qsorting, I think) decrease in runtime:

Without patch:

~615ms

With patch:

~415ms

One key insight that this work brings is that resolving the impedance
mismatch - making comparisons as inexpensive as possible (including
using inlining) - is the best possible strategy in improving sort
performance today, vastly more efficacious than, for example, tweaking
the sorting algorithm. If anyone can find some more ways of shaving
cycles there, it's one place where it really matters.

Changes are isolated to the extent that if you decide that you don't
like this one day, you can simply remove the calls to qsort_arg
specialisations, and once again switch back to just using what the
patch makes a fallback, qsort_arg itself. I know that we'd never
really get into that situation, but the fact that we could do that
serves to illustrate that these changes are fairly isolated.

Incidentally, if you find writing code that is heavily dependent on
macros to be a chore, I can highly recommend Clang 2.9 .

But what of the maintenance burden of mostly duplicating qsort_arg.c,
to produce a "template" version? Well, take a look at the *entire*
history for that file:

[peter@localhost port]$ git log qsort_arg.c
commit 9f2e211386931f7aee48ffbc2fcaef1632d8329f
Author: Magnus Hagander <magnus@hagander.net>
Date:   Mon Sep 20 22:08:53 2010 +0200

    Remove cvs keywords from all files.

commit b9954fbb4ef25fb1ea173d26017d4d128dd15be5
Author: Neil Conway <neilc@samurai.com>
Date:   Sun Mar 18 05:36:50 2007 +0000

    Code cleanup for function prototypes: change two K&R-style prototypes
    to ANSI-style, and change "()" -> "(void)". Patch from Stefan Huehner.

commit b38900c7677657a815e75781b776fb1e41054df3
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Thu Oct 12 15:04:55 2006 +0000

    Use Min() instead of min() in qsort, for consistency and to avoid
    redefined-macro warnings on some platforms.  Per gripe from Hiroshi Saito.

commit f99a569a2ee3763b4ae174e81250c95ca0fdcbb6
Author: Bruce Momjian <bruce@momjian.us>
Date:   Wed Oct 4 00:30:14 2006 +0000

    pgindent run for 8.2.

commit 6edd2b4a91bda90b7f0290203bf5c88a8a8504db
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Tue Oct 3 22:18:23 2006 +0000

    Switch over to using our own qsort() all the time, as has been proposed
    repeatedly.  ***SNIP MESSAGE***


I think that it is fair to say that the maintenance burden imposed by
this change is well worth it. Only one of these changes after the
initial commit is not mechanical, and that is still pretty trivial.

I've attached gprof flat profile output from unmodified PostgreSQL (at
master) when we create an index on a pgbench table:

pgbench -i -s 1500 index_test

That puts the number of rows in the pgbench_accounts table at 150
million, while the table is 19 GB in size. That’s a reasonable size,
and should usefully demonstrate the likely improvement we’ll see in
the performance of creating an index.

I increased maintenance_work_mem to 756MB, plus used a fairly
straightforward postgresql.conf, plus some other, more generic values
for other GUCs. The query is:

create INDEX on pgbench_accounts(abalance); -- abalance is of type integer

Here is the top of the flat profile, by far the most important part:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 43.56     79.29    79.29 6972025063     0.00     0.00  comparetup_index_btree
 22.39    120.04    40.75 150000000     0.00     0.00  tuplesort_heap_siftup
  4.36    127.97     7.93 2677057771     0.00     0.00  btint4cmp
  2.88    133.21     5.24 314766350     0.00     0.00  LWLockRelease
  1.81    136.50     3.29 314612660     0.00     0.00  LWLockAcquire
  1.65    139.50     3.00 450905247     0.00     0.00  AllocSetAlloc
  1.43    142.10     2.60 300000001     0.00     0.00  LogicalTapeWrite
  1.17    144.23     2.13                             comparetup_cluster
*** SNIP ***

This looks pretty top-heavy to me, and more time is spent in
comparetup_index_btree than any other function by some margin,
suggesting that it will be worthwhile to pursue similar optimisations
to make index creation faster in a later revision of this patch, or as
another patch. I didn't take care to ensure that I'd be caching the
entire table in memory, which probably would have been more useful
here.

Thoughts?

On the subject of highly ambitious optimisations to sorting, one
possibility I consider much more practicable than GPU-accelerated
sorting is simple threading; quicksort can be parallelised very
effectively, due to its divide-and-conquer nature. If we could agree
on a threading abstraction more sophisticated than forking, it's
something I'd be interested in looking at. To do so would obviously
entail lots of discussion about how that relates to whatever way we
eventually decide on implementing parallel query, and that's obviously
a difficult discussion.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Inlining comparators as a performance optimisation

Robert Haas <robertmhaas@gmail.com> — 2011-09-26T03:46:29Z

On Sun, Sep 25, 2011 at 10:12 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> [ new results ]

Nice results.  I think these are far more convincing than the last
set, because (1) the gains are bigger and (2) they survive -O2 and (3)
you tested an actual query, not just qsort() itself.

I don't want to take time to review this in detail right now, because
I don't think it would be fair to put this ahead of things that were
submitted for the current CommitFest, but I'm impressed.

> On the subject of highly ambitious optimisations to sorting, one
> possibility I consider much more practicable than GPU-accelerated
> sorting is simple threading; quicksort can be parallelised very
> effectively, due to its divide-and-conquer nature. If we could agree
> on a threading abstraction more sophisticated than forking, it's
> something I'd be interested in looking at. To do so would obviously
> entail lots of discussion about how that relates to whatever way we
> eventually decide on implementing parallel query, and that's obviously
> a difficult discussion.

I have the same feeling about about this that I do about almost every
executor optimization that anyone proposes: the whole project would be
entirely simple and relatively painless if it weren't for the need to
make planner changes.  I mean, deciding on a threading interface is
likely to be a somewhat contentious discussion, with differences of
opinion on whether we should do it and what the API should look like.
But at the end of the day it's not rocket science, and I expect that
we would end up with something reasonable.  What seems much harder is
figuring out how to decide when to perform quicksort in parallel vs.
single-threaded, and how much parallelism would be appropriate.  I
haven't seen anyone propose even a shadow of an idea about how to make
such decisions intelligently, either in general or in specific cases.

The other issue is that, while threading would be possibly suitable
for this particular case, at least for built-in datatypes with
comparison operations that basically reduce to single machine-language
comparison instructions, it's hard to see how we could take it much
further.  It would be unsafe for these multiple threads of execution
to do anything that could possibly throw an error or anything that
touches a lightweight lock or, really, just about anything at all.
Trying to make the entire backend thread-safe - or even any
significant portion of it - seems like a colossal effort that will
most likely fail, but maybe not without eating an enormous amount of
developer time first.  And without doing that, I don't think we could
even extend this as far as, say, numeric, whose functions do things
like palloc() and ereport() internally.  So I feel like this whole
approach might be a dead-end - there's a narrow range of cases where
it could be made to work, I think, but after that I think you hit a
titanium wall.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inlining comparators as a performance optimisation

Peter Geoghegan <peter@2ndquadrant.com> — 2011-09-26T13:53:49Z

On 26 September 2011 04:46, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't want to take time to review this in detail right now, because
> I don't think it would be fair to put this ahead of things that were
> submitted for the current CommitFest, but I'm impressed.

Thank you.

Now that I think about it, the if statement that determines if the
int4 specialisation will be used may be okay - it sort of documents
the conditions under which the int4 specialisation should be used with
reference to how the "else" generic case actually works, which is
perhaps no bad thing. I now realise that I should be using constants
from fmgroids.h though.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Inlining comparators as a performance optimisation

Bruce Momjian <bruce@momjian.us> — 2011-10-05T01:49:40Z

Peter Geoghegan wrote:
> On the subject of highly ambitious optimisations to sorting, one
> possibility I consider much more practicable than GPU-accelerated
> sorting is simple threading; quicksort can be parallelised very
> effectively, due to its divide-and-conquer nature. If we could agree
> on a threading abstraction more sophisticated than forking, it's
> something I'd be interested in looking at. To do so would obviously
> entail lots of discussion about how that relates to whatever way we
> eventually decide on implementing parallel query, and that's obviously
> a difficult discussion.

I agree that the next big challenge for Postgres is parallel operations.
With the number of cores increasing, and with increased memory and SSD,
parallel operation is even more important.  Rather than parallelizing
the entire backend, I imagine adding threading or helper processes for
things like sorts, index scans, executor nodes, and stored procedure
languages.  I expect final code to be 2-3 years in the future.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: Inlining comparators as a performance optimisation

Greg Stark <stark@mit.edu> — 2011-10-05T02:55:28Z

On Wed, Oct 5, 2011 at 2:49 AM, Bruce Momjian <bruce@momjian.us> wrote:

> Rather than parallelizing
> the entire backend, I imagine adding threading or helper processes for
> things like sorts, index scans, executor nodes, and stored procedure
> languages.  I expect final code to be 2-3 years in the future.

I don't actually see it would be a big problem quicksort to start up a
bunch of threads which do some of the work and go away when the
tuplesort is done. As long as the code they're executing is well
defined and limited to a small set of code that can be guaranteed to
be thread-safe it should be reasonably simple.

The problem is that in most of the Postgres core there aren't many
places where much code fits that description. Even in tuplesort it can
call out to arbitrary user-defined functions so we would need a way to
mark which functions are thread-safe. Beyond integer and floating
point comparisons it may not be much -- certainly not Numeric or
strings due to detoasting... And then there are things like handling
the various signals (including SI invalidations?) and so on.

I agree that if we wanted to farm out entire plan nodes we would
probably end up generating new partial plans which would be handed to
other backend processes. That's not unlike what Oracle did with
Parallel Query. But i'm a bit skeptical that we'll get much of that
done in 2-3 years. The main use case for Parallel Query in Oracle is
for partitioned tables -- and we haven't really polished that after
how many years?

-- 
greg

Re: Inlining comparators as a performance optimisation

Robert Haas <robertmhaas@gmail.com> — 2011-10-05T13:55:04Z

On Tue, Oct 4, 2011 at 10:55 PM, Greg Stark <stark@mit.edu> wrote:
> I agree that if we wanted to farm out entire plan nodes we would
> probably end up generating new partial plans which would be handed to
> other backend processes. That's not unlike what Oracle did with
> Parallel Query. But i'm a bit skeptical that we'll get much of that
> done in 2-3 years. The main use case for Parallel Query in Oracle is
> for partitioned tables -- and we haven't really polished that after
> how many years?

Partitioning hasn't been completely neglected; 9.1 adds support for
something called Merge Append.  You may have heard of (or, err,
authored) that particular bit of functionality.  :-)

Of course, it would be nice to have a better syntax, but I don't think
the lack of it should discourage us from working on parallel query,
which is a muti-part problem.  You need to:

- have planner support to decide when to parallelize
- have a mechanism for firing up worker processes and synchronizing
the relevant bits of state between the master and the workers
- have an IPC mechanism for streaming data between the master process
and the workers
- figure out how to structure the executor so that the workers neither
stall nor run too far ahead of the master (which might have LIMIT 10
or something)

Markus Wanner took a crack at generalizing the autovacuum machinery
that we have now into something that could be used to fire up
general-purpose worker processes, but it fell down mostly because I
(and, I think, others) weren't convinced that imessages were something
we wanted to suck into core, and Markus reasonably enough wasn't
interested in rewriting it to do something that wouldn't really help
his work with Postgres-R.  I'm not sure where Bruce is getting his
timeline from, but I think the limiting factor is not so much that we
don't have people who can write the code as that those people are
busy, and this is a big project.  But you can bet that if it gets to
the top of Tom's priority list (just for example) we'll see some
motion...!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inlining comparators as a performance optimisation

Bruce Momjian <bruce@momjian.us> — 2011-10-05T15:31:57Z

Robert Haas wrote:
> Markus Wanner took a crack at generalizing the autovacuum machinery
> that we have now into something that could be used to fire up
> general-purpose worker processes, but it fell down mostly because I
> (and, I think, others) weren't convinced that imessages were something
> we wanted to suck into core, and Markus reasonably enough wasn't
> interested in rewriting it to do something that wouldn't really help
> his work with Postgres-R.  I'm not sure where Bruce is getting his
> timeline from, but I think the limiting factor is not so much that we
> don't have people who can write the code as that those people are
> busy, and this is a big project.  But you can bet that if it gets to
> the top of Tom's priority list (just for example) we'll see some
> motion...!

I was thinking of setting up a team to map out some strategies and get
community buy-in, and then we could attack each issue.  I got the 2-3
years from the Win32 timeline.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: Inlining comparators as a performance optimisation

Robert Haas <robertmhaas@gmail.com> — 2011-11-18T05:20:26Z

On Sun, Sep 25, 2011 at 10:12 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> I've produced something much neater than my first patch, attached,
> although I still consider this to be at the POC stage, not least since
> I'm not exactly sure how I should be selecting the right
> specialisation in tuplesort_performsort() (a hack is used right now
> that does a direct pg_proc OID comparison), and also because I haven't
> implemented anything other than qsort for heap tuples yet (a minor
> detail, I suppose). I'm pleased to say that a much clearer picture of
> what's really going on here has emerged.
>
> Statistics: Total runtime according to explain analyze for query
> "select * from orderlines order by prod_id" (dellstore2 sample db), at
> GCC 4.5's -02 optimisation level, after warming the cache, on my
> desktop:
>
> Without the patch:
>
> ~82ms
>
> With the patch, but with the "inline" keyword commented out for all
> new functions/meta-functions:
>
> ~60ms
>
> with the patch, unmodified:
>
> ~52ms

Reviewing away when I should be sleeping, wahoo...!

I think that we should really consider doing with this patch what Tom
suggested upthread; namely, looking for a mechanism to allow
individual datatypes to offer up a comparator function that doesn't
require bouncing through FunctionCall2Coll().  It seems to me that
duplicating the entire qsort() algorithm is a little iffy.  Sure, in
this case it works out to a win.  But it's only a small win -
three-quarters of it is in the uncontroversial activity of reducing
the impedance mismatch - and how do we know it will always be a win?
Adding more copies of the same code can be an anti-optimization if it
means that a smaller percentage of it fits in the instruction cache,
and sometimes small changes in runtime are caused by random shifts in
the layout of memory that align things more or less favorably across
cache lines rather than by real effects.  Now it may well be that this
is a real effect, but will it still look as good when we do this for
10 data types?  For 100 data types?

In contrast, it seems to me that reducing the impedance mismatch is
something that we could go and do across our entire code base, and
every single data type would benefit from it.  It would also be
potentially usable by other sorting algorithms, not just quick sort.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inlining comparators as a performance optimisation

Simon Riggs <simon@2ndquadrant.com> — 2011-11-18T08:53:09Z

On Fri, Nov 18, 2011 at 5:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:

> I think that we should really consider doing with this patch what Tom
> suggested upthread; namely, looking for a mechanism to allow
> individual datatypes to offer up a comparator function that doesn't
> require bouncing through FunctionCall2Coll().  It seems to me that
> duplicating the entire qsort() algorithm is a little iffy.  Sure, in
> this case it works out to a win.  But it's only a small win -
> three-quarters of it is in the uncontroversial activity of reducing
> the impedance mismatch - and how do we know it will always be a win?
> Adding more copies of the same code can be an anti-optimization if it
> means that a smaller percentage of it fits in the instruction cache,
> and sometimes small changes in runtime are caused by random shifts in
> the layout of memory that align things more or less favorably across
> cache lines rather than by real effects.  Now it may well be that this
> is a real effect, but will it still look as good when we do this for
> 10 data types?  For 100 data types?
>
> In contrast, it seems to me that reducing the impedance mismatch is
> something that we could go and do across our entire code base, and
> every single data type would benefit from it.  It would also be
> potentially usable by other sorting algorithms, not just quick sort.

I don't think its credible to implement that kind of generic
improvement at this stage of the release cycle. That has a much bigger
impact since it potentially effects all internal datatypes and
external ones also. Definitely a longer term way forward.

If individual datatypes offer up a comparator function that is easily
going to result in more code than is being suggested here. So the
argument about flooding the CPU cache works against your alternate
proposal, not in favour of it.

We have no proof that we need to do this for 10 or 100 data types. We
only currently have proof that there is gain for the most common
types. Of course, it sounds like it might be useful to allow any data
type to gain an advantage, but we shouldn't be blind to the point that
almost nobody will use such a facility, and if they do the code won't
be written for a long time yet. If this came as a request from custom
datatype authors complaining of slow sorts it would be different, but
it didn't so we don't even know if anybody would ever write user
defined comparator routines. Rejecting a patch because of a guessed
user requirement is not good.

Peter's suggested change adds very few lines of code and those compile
to some very terse code, a few hundred instructions at very most.
Requesting an extra few cachelines to improve qsort by so much is
still an easy overall win.

The OP change improves qsort dramatically, and is a small, isolated
patch. There is no significant downside. We also have it now, so lets
commit this, chalk up another very good performance improvement and
use our time on something else this commitfest, such as the flexlocks
idea.

-- 
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Inlining comparators as a performance optimisation

Robert Haas <robertmhaas@gmail.com> — 2011-11-18T13:39:38Z

On Fri, Nov 18, 2011 at 3:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> We have no proof that we need to do this for 10 or 100 data types. We
> only currently have proof that there is gain for the most common
> types.

Well, that's kind of my point.  I think this needs more work before we
decide what the best approach is.  So far, the ONLY test result we
have that compares inlining to not inlining shows a speedup from 60 ms
to 52 ms.  I think that an 8 ms speedup on one test with one datatype
on one platform/compiler combination isn't sufficient evidence to
conclude that this is the best possible approach.

I think the way to look at this is that this patch contains two
possibly good ideas: one of which is to make the qsort() argument
match the qsort() calling convention, and the other of which is to
have multiple copies of the qsort() logic that inline the comparators
for their respective datatypes.  Tom hypothesized that most of the
benefit was in the first idea, and the numbers that Peter posted seem
to support that conclusion.  The first idea is also less invasive and
more broadly applicable, so to me that seems like the first thing to
pursue.

Now, that doesn't mean that we shouldn't consider the second idea as
well, but I don't believe that the evidence presented thus far is
sufficient to prove that we should go that route.  It seems entirely
possible that inlining any non-trivial comparator function could work
out to a loss, or that the exact choice of compiler flags could affect
whether inlining works out to a plus or a minus (what happens with -O3
vs. -O2?  what about -O1?  what about -O2 -fno-omit-frame-pointer?
what if they're using HP's aCC or Intel's icc rather than gcc?).
There's no point in installing an optimization that could easily be a
pessimization on some other workload, and that hasn't been tested, or
at least no results have been posted here.  On the other hand,
matching the calling convention of the comparator function to what
qsort() wants and eliminating the trampoline seems absolutely certain
to be a win in every case, and based on the work Peter has done it
seems like it might be a quite large win.

In fact, you have to ask yourself just exactly how much our
function-calling convention is costing us in general.  We use that
mechanism an awful lot and whatever loss of cycles is involved would
be spread all over the code base where oprofile or gprof won't easily
be able to pick it out.  Even the cost at any particular call site
will be split between the caller and the callee.  There might be more
stuff we could optimize here than just sorting...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inlining comparators as a performance optimisation

Tom Lane <tgl@sss.pgh.pa.us> — 2011-11-18T14:11:27Z

Simon Riggs <simon@2ndQuadrant.com> writes:
> On Fri, Nov 18, 2011 at 5:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think that we should really consider doing with this patch what Tom
>> suggested upthread; namely, looking for a mechanism to allow
>> individual datatypes to offer up a comparator function that doesn't
>> require bouncing through FunctionCall2Coll().

> I don't think its credible to implement that kind of generic
> improvement at this stage of the release cycle.

Er, *what*?  We're in mid development cycle, we are nowhere near
release.  When exactly would you have us make major changes?

In any case, what I understood Robert to be proposing was an add-on
feature that could be implemented in one datatype at a time.  Not
a global flag day.  We couldn't really do the latter anyway without
making life very unpleasant for authors of extension datatypes.

			regards, tom lane

Re: Inlining comparators as a performance optimisation

Peter Geoghegan <peter@2ndquadrant.com> — 2011-11-18T21:38:18Z

On 18 November 2011 05:20, Robert Haas <robertmhaas@gmail.com> wrote:
> I think that we should really consider doing with this patch what Tom
> suggested upthread; namely, looking for a mechanism to allow
> individual datatypes to offer up a comparator function that doesn't
> require bouncing through FunctionCall2Coll().  It seems to me that
> duplicating the entire qsort() algorithm is a little iffy.

I understand that we highly value extensibility and genericity (yes,
that's a real word). We may not always be well served by that
tendency.

> Sure, in this case it works out to a win.  But it's only a small win -
> three-quarters of it is in the uncontroversial activity of reducing
> the impedance mismatch - and how do we know it will always be a win?

Firstly, 1/4 of a quite large gain is still a pretty good gain.
Secondly, I probably didn't actually isolate the effects of inlining,
nor the overall benefit of the compiler knowing the comparator at
compile-time. I just removed the inline keyword. Those are two
different things. The inline keyword serves as a request to the
compiler to inline. The compiler can and often will ignore that
request. Most people know that. What isn't so widely known is that
modern compilers may equally well inline even when they haven't been
asked to (but only when they can). When you also consider, as I've
already pointed out several times, that individual call sites are
inlined, it becomes apparent that there may well still be a certain
amount of inlining and/or other optimisations like procedural
integration going on at some call sites even without the encouragement
of the inline keyword, that would not have been performed without the
benefit of compile-time comparator knowledge. The addition of the
inline keyword may just, in this particular case, have the compiler
inline even more call sites.  Posting the ~8ms difference was
motivated by a desire to prove that inlining had *some* role to play,
without actually going to the trouble of implementing Tom's idea as a
basis of comparison, because Tom was very sceptical of inlining.

The long and the short of it is that I'm going to have to get my hands
dirty with a dissembler before we really know exactly what's
happening. That, or I could use an optimisation fence of some type.

> Adding more copies of the same code can be an anti-optimization if it
> means that a smaller percentage of it fits in the instruction cache,
> and sometimes small changes in runtime are caused by random shifts in
> the layout of memory that align things more or less favorably across
> cache lines rather than by real effects.  Now it may well be that this
> is a real effect, but will it still look as good when we do this for
> 10 data types?  For 100 data types?

I'd favour limiting it to just the common integer and float types.

> In contrast, it seems to me that reducing the impedance mismatch is
> something that we could go and do across our entire code base, and
> every single data type would benefit from it.  It would also be
> potentially usable by other sorting algorithms, not just quick sort.

Suppose that we went ahead and added that infrastructure. What you
must acknowledge is that one reason that this speed-up is so dramatic
is that the essential expense of a comparison is already so low - a
single instruction - and therefore the overall per-comparison cost
goes way down, particularly if the qsort inner loop can store the code
across fewer cache lines. For that reason, any absolute improvement
that you'll see in complex datatypes will be smaller, maybe much
smaller, because for each comparison we'll execute many more
instructions that are essential to the comparison. In my estimation,
all of this work does not point to there being an undue overhead in
the function calling convention as you suggested. Still, I'm not
opposed to investigating generalising this in some way, reservations
notwithstanding, unless we have to block-wait on it. I don't want to
chase diminishing returns too far.

> Well, that's kind of my point.  I think this needs more work before we
> decide what the best approach is.

Agreed.

> So far, the ONLY test result we
> have that compares inlining to not inlining shows a speedup from 60 ms
> to 52 ms.  I think that an 8 ms speedup on one test with one datatype
> on one platform/compiler combination isn't sufficient evidence to
> conclude that this is the best possible approach.

Fair enough, but it's not the only test I did - I posted other numbers
for the same query when the table was 48mb, and we saw a proportional
improvement, consistent with a per-comparison win. I'm supposed to be
on leave for a few days at the moment, so I won't be very active this
weekend, but I'm rather curious as to where you or others would like
to see me go with benchmarks.

I should point out that we currently don't have much idea how big of a
win applying these principles could be for index creation times...it
could possibly be very significant. My profiling of index creation
makes this looks promising.

On 18 November 2011 13:39, Robert Haas <robertmhaas@gmail.com> wrote:
> It seems entirely possible that inlining any non-trivial comparator function could work
> out to a loss,

Compilers weigh the size of the function to be inlined heavily when
deciding, on a call site by call site basis, whether or not to inline.
Much effort has been expended on making these heuristics work well.
Besides, you're the one advocating pursuing this for types with
non-trivial comparators. This will certainly be less effective for
such non-trivial types, perhaps quite a lot less effective, as already
noted.

It's worth acknowledging that sorting of integers and floats is in
general very important, probably more important than any other sorting
case.

> or that the exact choice of compiler flags could affect
> whether inlining works out to a plus or a minus (what happens with -O3
> vs. -O2?  what about -O1?  what about -O2 -fno-omit-frame-pointer?
> what if they're using HP's aCC or Intel's icc rather than gcc?).

If you'd like to see numbers for another platform, I can get those.
How about Windows? I only have access to x86_64 boxes. I don't see
that the exact compiler flags used will influence whether or not
inlining is a win, so much as whether or not it occurs at all. As to
-fno-omit-frame-pointer or something nullifying the benefits, well,
that's a bit fantastical. We're already at the mercy of the compiler
to do the right thing at this level of granularity. I really just want
to help/enable it.

> There's no point in installing an optimization that could easily be a
> pessimization on some other workload, and that hasn't been tested, or
> at least no results have been posted here.

Hmm. Well, I accept the burden of proof lies with me here. I have a
hard time believing that inlining/compile-time optimisation benefits
(benefits which, to repeat for emphasis, have not been
isolated/measured yet) could be entirely nullified by CPU caching
effects on any plausible workload. What does a benchmark that
alleviates your concerns here look like?

The way that I understand inlining to interact with CPU cache is that
it will increase the number of cache misses if it causes a cache line
to be crossed, but will decrease the number of cache misses when
locality of reference is improved such that less cache lines are
crossed within an inner loop. Well, the overall cost of a sort is
measured in the number of comparisons performed, so locality of
reference is particularly important here - in general, the comparator
is called lots of times.

What I'm having a hard time with is general scepticism of the value of
inlining, when the self-contained test case I posted showed an
inlining qsort() radically out-performing my system's qsort(), without
any impedance mismatch to muddy the waters. Sure, you can hold me to
account and get me to prove my claim, but don't you think it's curious
that that effect was observed there?

Incidentally, I'd like to call this fast path sorting, if it ends up
being committed in a form that is not significantly different to what
we have now.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Inlining comparators as a performance optimisation

Martijn van Oosterhout <kleptog@svana.org> — 2011-11-18T22:08:18Z

On Fri, Nov 18, 2011 at 12:20:26AM -0500, Robert Haas wrote:
> I think that we should really consider doing with this patch what Tom
> suggested upthread; namely, looking for a mechanism to allow
> individual datatypes to offer up a comparator function that doesn't
> require bouncing through FunctionCall2Coll().  It seems to me that
> duplicating the entire qsort() algorithm is a little iffy.  Sure, in
> this case it works out to a win.  But it's only a small win -
> three-quarters of it is in the uncontroversial activity of reducing
> the impedance mismatch - and how do we know it will always be a win?

There's always the old idea of a data type providing a function mapping
f:(type -> int64) in such a way that it preserves order. That is, in the
sense that:

f(x) < f(y)  =>  x < y

When sorting, you add f(x) as hidden column and change "ORDER BY x" to
"ORDER BY f(x), x".  Then you only need to special case the int64
version.  This would mean that in most cases you may be able to skip
the call because you're comparing integers.  The downside is you need
to call f on each input.  It depends on the datatype if that's cheaper
or not, but for all numerics types I think it's an easy win.

I don't think anyone has written a proof of concept for this. It does
have the advantage of scaling better than coding a qsort for each
individual type.

Have a nice day,
-- 
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
   -- Arthur Schopenhauer

Re: Inlining comparators as a performance optimisation

Robert Haas <robertmhaas@gmail.com> — 2011-11-19T02:55:54Z

On Fri, Nov 18, 2011 at 4:38 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> I understand that we highly value extensibility and genericity (yes,
> that's a real word). We may not always be well served by that
> tendency.

True (except that genericity is not a synonym for generality AFAICT).
A good fraction of the optimizations that we've done over the last,
well, pick any arbitrary period of time consists in adding special
cases that occur frequently enough to be worth optimizing - e.g. HOT,
or fast relation locks - often to extremely good effect.

> Firstly, 1/4 of a quite large gain is still a pretty good gain.
> Secondly, I probably didn't actually isolate the effects of inlining,
> nor the overall benefit of the compiler knowing the comparator at
> compile-time. I just removed the inline keyword.

Maybe we should look at trying to isolate that a bit better.

It strikes me that we could probably create an API that would support
doing either of these things depending on the wishes of the underlying
datatype.  For example, imagine that we're sorting with <(int4, int4).
 We associate a PGPROC-callable function with that operator that
returns "internal", really a pointer to a struct.  The first element
of the struct is a pointer to a comparison function that qsort() (or a
tape sort) can invoke without a trampoline; the second is a wholesale
replacement for qsort(); either or both can be NULL.  Given that, it
seems to me that we could experiment with this pretty easily, and if
it turns out that only one of them is worth doing, it's easy to drop
one element out of the structure.

Or do you have another plan for how to do this?

> Fair enough, but it's not the only test I did - I posted other numbers
> for the same query when the table was 48mb, and we saw a proportional
> improvement, consistent with a per-comparison win. I'm supposed to be
> on leave for a few days at the moment, so I won't be very active this
> weekend, but I'm rather curious as to where you or others would like
> to see me go with benchmarks.
>
> I should point out that we currently don't have much idea how big of a
> win applying these principles could be for index creation times...it
> could possibly be very significant. My profiling of index creation
> makes this looks promising.

Have you done any benchmarks where this saves seconds or minutes,
rather than milliseconds?  That would certainly make it more exciting,
at least to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inlining comparators as a performance optimisation

Peter Geoghegan <peter@2ndquadrant.com> — 2011-11-20T03:29:07Z

On 19 November 2011 02:55, Robert Haas <robertmhaas@gmail.com> wrote:
> Maybe we should look at trying to isolate that a bit better.

Indeed. Fortunately, GCC has options to disable each optimisation.
Here's potentially relevant flags that we're already implicitly using
at -02:

-finline-small-functions <-- this is the one that inlined functions
without an "inline" keyword
-findirect-inlining
-finline-functions-called-once
-fearly-inlining
-fipa-sra (Perform interprocedural scalar replacement of aggregates,
removal of unused parameters and replacement of parameters passed by
reference by parameters passed by value).

In an effort to better isolate the effects of inlining, I tried this:

./configure CFLAGS="-fno-inline -fno-inline-small-functions" (I could
have disabled more -02 optimisations, but this proved sufficient to
make my point)

Unsurprisingly, this makes things slower than regular -02.

With the patch, the same query (once again, using explain analyze)
with the same fast path quicksort stabilises around 92ms +/- 5ms
(recall that the original figure was ~82ms). Our gains evaporate, and
then some. Take away the additional CLAGS, and we're predictably back
to big gains, with the query taking ~52ms just as before.

What happens to this query when we build an unmodified postgres with
these same CFLAGS? Well, we see the query take ~116ms after a few
runs. It seems that the impedance mismatch matters, but inlining and
other optimisations look to be at least as important. This isn't
surprising to me, given what I was able to do with the isolated test.
Maybe I should have tried it with additional disabling of
optimisations named above, but that would have perhaps made things
less clear.

I'd probably have been better off directly measuring qsort speed and
only passing those flags when compiling tuplesort.c (maybe impedance
mismatch issues would have proven to have been even less relevant),
but I wanted to do something that could be easily recreated, plus it's
late.

> It strikes me that we could probably create an API that would support
> doing either of these things depending on the wishes of the underlying
> datatype.  For example, imagine that we're sorting with <(int4, int4).
>  We associate a PGPROC-callable function with that operator that
> returns "internal", really a pointer to a struct.  The first element
> of the struct is a pointer to a comparison function that qsort() (or a
> tape sort) can invoke without a trampoline; the second is a wholesale
> replacement for qsort(); either or both can be NULL.  Given that, it
> seems to me that we could experiment with this pretty easily, and if
> it turns out that only one of them is worth doing, it's easy to drop
> one element out of the structure.
>
> Or do you have another plan for how to do this?

I haven't given it much thought. Let me get back to you on that next week.

> Have you done any benchmarks where this saves seconds or minutes,
> rather than milliseconds?  That would certainly make it more exciting,
> at least to me.

Right. Well, I thought I'd use pgbench to generate a large table in a
re-creatable way. That is:

pgbench -i -s 60

This puts pgbench_accounts at 769MB. Then, having increased work_mem
to 1GB (enough to qsort) and maintenance_work_mem to 756mb, I decided
to test this query with the patch:

explain analyze select * from pgbench_accounts order BY abalance;

This stabilised at ~3450ms, through repeatedly being executed.

How does this compare to unpatched postgres? Well, it stabilised at
about ~3780ms for the same query.

This patch is obviously less of a win as the number of tuples to sort
goes up. That's probably partly explained by the cost of everything
else going up at a greater rate than the number of comparisons. I
suspect that if we measure qsort in isolation, we'll see better
results, so we may still see a good win on index creation time as a
result of this work.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Inlining comparators as a performance optimisation

Peter Geoghegan <peter@2ndquadrant.com> — 2011-11-20T04:13:36Z

On 20 November 2011 03:29, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> ./configure CFLAGS="-fno-inline -fno-inline-small-functions" (I could
> have disabled more -02 optimisations, but this proved sufficient to
> make my point)

I'll isolate this further to tuplesort.c soon, which ought to be a lot
more useful.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Inlining comparators as a performance optimisation

Peter Geoghegan <peter@2ndquadrant.com> — 2011-11-21T02:02:06Z

Part of my problem with not producing specialisations that I really
neglected to complain about until now is the inconsistency between
types, and the need to deal with that.

We benefit from assuming in our specialisation that we're only dealing
with POD types, simply not considering things that we would otherwise
have had to for the benefit of types that have a Datum representation
that is pass-by-reference, or have collations, or have multiple
scankeys. Leaving aside the question of straight compiler-optimisation
benefits for the moment, the remaining benefit of what I've done comes
not so much from avoiding the usual function call machinery per se, as
from doing so *as well as* cutting down on what currently happens in
comparetup_heap to handle every single compound and scalar type.
Compare the function comparetup_heap with my meta-function
TYPE##AppSort to see what I mean. The function comparetup_heap is the
comparator directly used by qsort_arg when sorting heap tuples, and
qsort_arg outsources to comparetup_heap some things that you might not
expect it to (very little has changed about qsort_arg since we lifted
it from NetBSD back in 2006). So while you might imagine that that
loop in comparetup_heap and things like its use of the heap_getattr
macros won't expand to that many instructions, you really don't want
them in the inner loop of a long operation like qsorting, paid for up
to O(n ^ 2) times. There's also the obvious implications for compiler
optimisations, particularly relating to effective usage of CPU cache.

I'm trying to get you what you asked for: A straight choice between
what Tom suggested and what I suggested, with perhaps some compromises
between the two positions. That's sort of tricky though, especially
considering the issues raised above.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Inlining comparators as a performance optimisation

Robert Haas <robertmhaas@gmail.com> — 2011-11-21T22:55:18Z

On Tue, Sep 20, 2011 at 7:53 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> I don't think that the fact that that happens is at all significant at
> this early stage, and it never even occurred to me that you'd think
> that it might be. I was simply disclosing a quirk of this POC patch.
> The workaround is probably to use a macro instead. For the benefit of
> those that didn't follow the other threads, the macro-based qsort
> implementation, which I found to perform significantly better than
> regular qsort(), runs like this on my laptop when I built at 02 with
> GCC 4.6 just now:
>
> C stdlib quick-sort time elapsed: 2.092451 seconds
> Inline quick-sort time elapsed: 1.587651 seconds

Results on my machine, for what they're worth:

[rhaas inline_compar_test]$ gcc -O0 qsort-inline-benchmark.c
[rhaas inline_compar_test]$ ./a.out
C stdlib quick-sort time elapsed: 2.366762 seconds
Inline quick-sort time elapsed: 1.807951 seconds
[rhaas inline_compar_test]$ gcc -O1 qsort-inline-benchmark.c
[rhaas inline_compar_test]$ ./a.out
C stdlib quick-sort time elapsed: 1.970473 seconds
Inline quick-sort time elapsed: 1.002765 seconds
[rhaas inline_compar_test]$ gcc -O2 qsort-inline-benchmark.c
[rhaas inline_compar_test]$ ./a.out
C stdlib quick-sort time elapsed: 1.966408 seconds
Inline quick-sort time elapsed: 0.958999 seconds
[rhaas inline_compar_test]$ gcc -O3 qsort-inline-benchmark.c
[rhaas inline_compar_test]$ ./a.out
C stdlib quick-sort time elapsed: 1.988693 seconds
Inline quick-sort time elapsed: 0.975090 seconds

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inlining comparators as a performance optimisation

Peter Geoghegan <peter@2ndquadrant.com> — 2011-11-23T01:09:23Z

On 21 November 2011 22:55, Robert Haas <robertmhaas@gmail.com> wrote:
> Results on my machine, for what they're worth:

I was curious as to how much of an improvement I'd made to sorting in
isolation. I've adding simple sort timing instrumentation to the code
in a revised patch, attached, for the convenience of reviewers.

This patch adds an int8 specialisation, but it also supports fast path
sorting of timestamps by using that same int8 specialisation (for
HAVE_INT64_TIMESTAMP builds at least - didn't know if it was worth
handling the NaN crud for the other case). ISTM that various different
types with identical internal representations can share
specialisations like this, which I hope will alleviate some earlier
concerns about both the limited applicability of this optimisation and
the possible proliferation of specialisations.

Here's some raw qsort figures in seconds for the query "select * from
orderlines order by test" (where test is a timestamptz column I added
and updated with some data, executing repeatedly on the same machine
as before):

Before:

DEBUG:  elapsed: 0.031738
DEBUG:  elapsed: 0.031595
DEBUG:  elapsed: 0.031502
DEBUG:  elapsed: 0.031378
DEBUG:  elapsed: 0.031359
DEBUG:  elapsed: 0.031413
DEBUG:  elapsed: 0.031499
DEBUG:  elapsed: 0.031394
DEBUG:  elapsed: 0.031368

After:

DEBUG:  elapsed: 0.013341
DEBUG:  elapsed: 0.014608
DEBUG:  elapsed: 0.013809
DEBUG:  elapsed: 0.013244
DEBUG:  elapsed: 0.014307
DEBUG:  elapsed: 0.013207
DEBUG:  elapsed: 0.013227
DEBUG:  elapsed: 0.013264
DEBUG:  elapsed: 0.013143
DEBUG:  elapsed: 0.013455
DEBUG:  elapsed: 0.013447

I wonder, is it worth qualifying that the "Sort Method" was a
"quicksort (fast path)" sort within explain analyze output? This is a
rather large improvement, so It might actually be something that
people look out for, as it might be tricky to remember the exact
circumstances under which the optimisation kicks in by the time we're
done here.

I haven't had as much time as I'd like to polish this patch, or to get
clearer answers. I expect to spend more time on it over the weekend.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Inlining comparators as a performance optimisation

Simon Riggs <simon@2ndquadrant.com> — 2011-11-23T14:48:22Z

On Fri, Nov 18, 2011 at 2:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> On Fri, Nov 18, 2011 at 5:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I think that we should really consider doing with this patch what Tom
>>> suggested upthread; namely, looking for a mechanism to allow
>>> individual datatypes to offer up a comparator function that doesn't
>>> require bouncing through FunctionCall2Coll().
>
>> I don't think its credible to implement that kind of generic
>> improvement at this stage of the release cycle.
>
> Er, *what*?  We're in mid development cycle, we are nowhere near
> release.  When exactly would you have us make major changes?
>
> In any case, what I understood Robert to be proposing was an add-on
> feature that could be implemented in one datatype at a time.  Not
> a global flag day.  We couldn't really do the latter anyway without
> making life very unpleasant for authors of extension datatypes.

Tom, whenever you think I've said something you really disagree with,
just assume there's a misunderstanding. Like here.

Of course it is OK to make such changes at this time.

Given we have <2 months to the last CF of this release, inventing a
generic infrastructure is unlikely to be finished and complete in this
dev cycle, so requesting that isn't a practical suggestion, IMHO.

-- 
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Inlining comparators as a performance optimisation

Robert Haas <robertmhaas@gmail.com> — 2011-11-23T19:24:22Z

On Tue, Nov 22, 2011 at 8:09 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> I wonder, is it worth qualifying that the "Sort Method" was a
> "quicksort (fast path)" sort within explain analyze output? This is a
> rather large improvement, so It might actually be something that
> people look out for, as it might be tricky to remember the exact
> circumstances under which the optimisation kicks in by the time we're
> done here.

Well, right now the decision as to which mechanism should be used here
gets made in tuplesort_performsort(), which has no good way of
communicating with EXPLAIN anyway.  Actually, I think that's a
modularity violation; using the address of comparetup_heap as a flag
value seems quite ugly.  How about moving that logic up to
tuplesort_begin_heap() and having it set some state inside the
Tuplesort, maybe based on a flag in the opclass (or would it have to
attach to the individual operator)?

At least on my machine, your latest patch reliably crashes the
regression tests in multiple places.

The following test case also crashes them for me (perhaps for the same
reason the regression tests crash):

create table i4 (a int, b int);
insert into i4 values (4, 1), (2, 1), (0, 1), (null, 1), (-2, 1), (-7,
1), (4, 2), (4, 3), (4, 4);
select * from i4 order by 1, 2;
TRAP: FailedAssertion("!(state->nKeys == 1)", File: "tuplesort.c", Line: 1261);

The formatting of src/include/utils/template_qsort_arg.h is hard to
read.  At ts=8, the backslashes line up, but the code doesn't fit in
80 columns.  If you set ts=4, then it fits in 80 columns, but the
backslashes don't line up any more, and the variable declarations
don't either.  I believe ts=4 is project standard.

I still think it would be a good idea to provide a mechanism to
override heap_comparetup() with a type-specific function.  I don't
think that would take much extra code, and then any data type could
get at least that much benefit out of this.

It seems like it could be a good idea to do some
per-assembler-instruction profiling of this code, and perhaps also of
the original code.  I'm curious where the time is being spent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inlining comparators as a performance optimisation

Peter Geoghegan <peter@2ndquadrant.com> — 2011-11-25T01:05:15Z

On 23 November 2011 19:24, Robert Haas <robertmhaas@gmail.com> wrote:
> Well, right now the decision as to which mechanism should be used here
> gets made in tuplesort_performsort(), which has no good way of
> communicating with EXPLAIN anyway.

You could pretty easily add something to Tuplesortstate to accomplish
this. That isn't an endorsement of doing so, but I'm not sure that it
isn't appropriate.

> Actually, I think that's a
> modularity violation; using the address of comparetup_heap as a flag
> value seems quite ugly. How about moving that logic up to
> tuplesort_begin_heap()

I'll post a patch soon that does just that in the next day or two.
Tuplesortstate has a pointer to a sort specialisation.

> and having it set some state inside the
> Tuplesort, maybe based on a flag in the opclass (or would it have to
> attach to the individual operator)?

I'm not sure that there's much point in such a flag.

> At least on my machine, your latest patch reliably crashes the
> regression tests in multiple places.

> TRAP: FailedAssertion("!(state->nKeys == 1)", File: "tuplesort.c", Line: 1261);

Yes, sorry about that. Should have been discriminating against nKeys > 1 cases.

As of this evening, for sorts with multiple scankeys, I'm using
optimisations for the first scankey but not subsequent scankeys, which
is frequently almost as good as having optimisations for all scanKeys.

> The formatting of src/include/utils/template_qsort_arg.h is hard to
> read.  At ts=8, the backslashes line up, but the code doesn't fit in
> 80 columns.  If you set ts=4, then it fits in 80 columns, but the
> backslashes don't line up any more, and the variable declarations
> don't either.  I believe ts=4 is project standard.

Fair enough. My working copy and .vimrc have been updated.

> I still think it would be a good idea to provide a mechanism to
> override heap_comparetup() with a type-specific function.  I don't
> think that would take much extra code, and then any data type could
> get at least that much benefit out of this.

> It seems like it could be a good idea to do some
> per-assembler-instruction profiling of this code, and perhaps also of
> the original code.  I'm curious where the time is being spent.

How would you go about doing that? The instrumentation that profilers
use actually caused a big drop in performance here when I attempted it
a few weeks ago. There's a kind of Heisenberg effect.

This optimisation *more than doubles* raw sort performance for the
cases. There is nothing contrived or cherry picked about the query
that I selected to represent this optimisation - it was literally the
first one that I selected.

Sometimes, I see even a markedly better gain than a doubling of raw
sort performance - I think my earlier experiments that indicated a
much smaller improvement past a certain point may have been
methodologically flawed. Sorry about that.

If I double-up the data in the orderlines table a few times, until it
reaches 385 MB (duplicate ever tuple with an insert into ...select ),
then warm the cache, I get very interesting results. Here, we see a
few runs of the same old query unoptimised (note that I've excluded
some cold-cache runs before these runs):

Before optimisation
==============
 Total runtime: 7785.473 ms - 3.517310 secs just sorting
 Total runtime: 8203.533 ms - 3.577193 secs just sorting
 Total runtime: 8559.743 ms - 3.892719 secs just sorting

 Total runtime: 9032.564 ms - 3.844746 secs just sorting

 Total runtime: 9637.179 ms - 4.434431 secs just sorting
 Total runtime: 9647.215 ms - 4.440560 secs just sorting
 Total runtime: 9669.701 ms - 4.448572 secs just sorting

After optimisation
==============
 Total runtime: 5462.419 ms - 1.169963 secs just sorting
 Total runtime: 5510.660 ms - 1.234393 secs just sorting
 Total runtime: 5511.703 ms - 1.208377 secs just sorting

 Total runtime: 5588.604 ms - 1.175536 secs just sorting

 Total runtime: 5899.496 ms - 1.250403 secs just sorting
 Total runtime: 6023.132 ms - 1.338760 secs just sorting
 Total runtime: 6717.177 ms - 1.486602 secs just sorting

This is a 800109kB sort.

So, taking the median value as representative here, that looks to be
just shy of a 40% improvement, or 3.4 seconds. My /proc/cpuinfo is
attached on the off chance that someone is interested in that. More
work is needed here, but this seems promising.

It will be interesting to see how far all of this can be taken with
comparetup_index_btree. Certainly, I'm sure there's some gain to be
had there by applying lessons learned from comparetup_heap.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Inlining comparators as a performance optimisation

Peter Geoghegan <peter@2ndquadrant.com> — 2011-11-26T18:58:00Z

3 items are attached:

1. A spreadsheet, results.ods, which has results for automated
multiple runs of the "doubled up" dellstore orderlines queries (where
orderlines is 385 MB). Results are sorted, with the median (or the
lower middle value, didn't get the mean of the two middle runs) result
for each run highlighted as representative. There are 3 builds of
Postgres (HEAD, not inlined, optimized), each on its own sheet in the
spreadsheet. The cache was warmed for each query, and subsequently the
query was run 20 times.

2. A python script I hacked together that should run on anything
around Python 2.5+, with a single dependency, psycopg2. Look at --help
to see how it works. This is the script that actually generated the
figures that went into the spreadsheet, by being run once for each
type of build. You can fairly easily play along at home with this, for
these and other queries. It will spit out CSV files. It is designed to
warm the cache (by default 3 times before each query) to eliminate
caching effects. You can add your own query to the python list to have
it run by the script, to generate the same set of figures for that
query. I'm rather curious as to how much of an impact this
optimisation will have on queries with unique nodes, joins and
grouping when they rely on a sort node for their input, in the real
world. Certainly, a query need not have an ORDER BY clause to see
benefits, perhaps substantial benefits. The logic to parse sort node
details is rather unsophisticated right now though, due to my focus on
the obvious ORDER BY case.

3. The revision of the patch that was actually tested, now with
inlining specialisations for the single scanKey case, and a
non-inlined specialisation for multiple scanKeys where we can still
benefit from eliding the SQL function call machinery for the first
scanKey, which is often almost as useful as being able to do so for
all scanKeys. It also selects a sorting specialisation when it can in
tuplesort_begin_heap, per Robert's suggestion.

Reviewers will want to comment out line 731 of tuplesort.c, "#define
optimize", to quickly get unoptimized behaviour for comparative
purposes. Furthermore, if you'd like to see what this would look like
without inlining, you can simply comment out assignments of inline
variants of specialisations (i.e. int4inlqsort_arg and
int8inlqsort_arg) in tuplesort_begin_heap.

It has been suggested that I'm chasing diminishing returns by
inlining, as I go further for a smaller benefit. Of course, that's
true. However, I believe that I'm not chasing them past the point
where that ceases to make sense, and these figures support that
contention - chasing diminishing returns in the nature of this kind of
work. Here, the additional benefit of inlining accounts for over an
entire second shaved off a query that was originally 7056ms, so that's
not to be sniffed at.

I'll reiterate that qsort_arg has only been modified 4 times after its
initial commit in 2006, and these were all trivial changes. While the
way that I ape generic programming with the preprocessor is on the
ugly side, what I've done can be much more easily understood with
reference to qsort_arg itself. Robert said that duplicating the sort
function was "iffy". However, that already happened long ago, as we're
already maintaining both qsort_arg.c and qsort.c, and comments already
urge maintainers to keep the two consistent. That's not been a problem
though, because, as I've said, there has never been any call to make
substantive changes to either in all those years.

We could probably further reduce the code footprint of all of this by
having template_qsort_arg.h generate comparators directly. That might
be inflexible in a way that turns out to matter though, like if we
wanted to use this for non-HAVE_INT64_TIMESTAMP timestamps and had to
add NaN crud.

My next step is to see how this goes with hundreds of sorts in the
hundreds of megabytes on a high-end server. I don't have immediate
access to one, but I'm working that out.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Inlining comparators as a performance optimisation

Peter Geoghegan <peter@2ndquadrant.com> — 2011-11-28T02:15:13Z

Attached are the results from performing a similar process to the
prior benchmark, but on Greg Smith's high-end server, and with an
orderlines table that has been "doubled-up" until it is 1538 MB,
making the same old query perform a quicksort that's over 3GB. Short
version:  HEAD is 20468.0ms, with my patch it's 13689.698ms, so these
gains hold-up very well for very large in-memory sorts, possibly even
perfectly.

Note that the "doubling up" had an interesting and desirable effect
for the purposes of this patch review: It's approaching a worst case
due to the low cardinality of the product + quantity columns, which
crops up with the non-inlined functions only (int4regqsort_arg, etc).
All too frequently, evaluating the first scankey results in equality,
and so we cannot elide the SQL function call machinery. This is going
to dampen the improvement for the multiple scanKey case considerably
(and it looks like a smaller relative improvement than when the
orderlines table was quite a lot smaller). As I said from the outset,
there is no worst case for the single scanKey case that occurs to me.
Multiple scanKeys are likely to be a problem for any effort to offer
user-defined, per-type sort functions. I could probably make
int4regqsort_arg and friends perform a bit better than we see here by
increasing the cardinality of those two columns for the second query,
so the first scanKey comparison would frequently suffice.

Incidentally, I'm pretty sceptical of the idea of any effort being
made to provide user-defined per-type sort functions or anything like
that. No one has come forward with a novel sorting implementation that
looks useful, although there has been some talk of some. It's
relatively easy to hack on tuplesort to build a prototype, so I don't
think the lack of this functionality is the blocker here. Besides, we
will probably end up doing a terrible job of anticipating how invasive
such a facility needs to be to facilitate these novel sorting
implementations.

While I'm very encouraged by these results, I must admit that there
are a few things that I cannot currently explain. I don't know why the
second query on the "Optimized" sheet outperforms the same query on
the "Not inlined" page. When you consider how small the standard
deviation is for each set of results, it's fairly obvious that it
cannot be explained by noise. I thought it was weird, so I re-ran both
benchmarks, with much the same outcome. It may be that the compiler
gets lucky for the optimized case, resulting in a bit of an extra
gain, and that effect is ridiculously magnified. In all cases, the
regression tests pass.

The single key/inlining optimisation seems to account for a big part
of the gains, and the "Not inlined" figures for the first query would
seem to suggest that the inlining can account for more than half of
gains seen, whereas that was previously assumed to be less. What I
think is happening here is that we're benefiting both from inlining as
well as not having to even consider multiple scanKeys (we have to at
least consider them even if we know in advance from the nScankeys
variable that there'll never be multiple scanKeys). Again, if the
cardinality was higher, we'd probably see "Not inlined" do better
here.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Inlining comparators as a performance optimisation

Bruce Momjian <bruce@momjian.us> — 2011-11-29T15:31:45Z

Peter Geoghegan wrote:
> Attached are the results from performing a similar process to the
> prior benchmark, but on Greg Smith's high-end server, and with an
> orderlines table that has been "doubled-up" until it is 1538 MB,
> making the same old query perform a quicksort that's over 3GB. Short
> version:  HEAD is 20468.0ms, with my patch it's 13689.698ms, so these
> gains hold-up very well for very large in-memory sorts, possibly even
> perfectly.
> 
> Note that the "doubling up" had an interesting and desirable effect
> for the purposes of this patch review: It's approaching a worst case
> due to the low cardinality of the product + quantity columns, which
> crops up with the non-inlined functions only (int4regqsort_arg, etc).
> All too frequently, evaluating the first scankey results in equality,
> and so we cannot elide the SQL function call machinery. This is going
> to dampen the improvement for the multiple scanKey case considerably
> (and it looks like a smaller relative improvement than when the
> orderlines table was quite a lot smaller). As I said from the outset,
> there is no worst case for the single scanKey case that occurs to me.
> Multiple scanKeys are likely to be a problem for any effort to offer
> user-defined, per-type sort functions. I could probably make
> int4regqsort_arg and friends perform a bit better than we see here by
> increasing the cardinality of those two columns for the second query,
> so the first scanKey comparison would frequently suffice.

These are exciting advanced you are producing and I am hopeful we can
get this included in Postgres 9.2.  I have mentioned already that I
think parallelism is the next big Postgres challenge, and of course, one
of the first areas for parallelism is sorting.  If you can speed up
sorting as you have reported, this allows us to provide faster sorting
now, and get even larger benefits if we implement sorting parallelism.

In fact, because Postgres 9.2 has so many performance improvements, we
might find that we have a more limited need for parallelism.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: Inlining comparators as a performance optimisation

Peter Geoghegan <peter@2ndquadrant.com> — 2011-11-29T18:48:37Z

On 29 November 2011 15:31, Bruce Momjian <bruce@momjian.us> wrote:
> These are exciting advanced you are producing and I am hopeful we can
> get this included in Postgres 9.2.

Thanks Bruce.

>I have mentioned already that I
> think parallelism is the next big Postgres challenge, and of course, one
> of the first areas for parallelism is sorting.

I'm not sure that sorting has that much to recommend it as an initial
target of some new backend parallelism other than being easy to
implement. I've observed the qsort_arg specialisations in this patch
out-perform stock qsort_arg by as much as almost 3 times. However, the
largest decrease in a query's time that I've observed was 45%, and
that was for a contrived worst-case for quicksort, but about 25% is
much more typical of queries similar to the ones I've shown, for more
normative data distributions. While that's a respectable gain, it
isn't a paradigm shifting one, and it makes parallelising qsort itself
for further improvements quite a lot less attractive - there's too
many other sources of overhead.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Inlining comparators as a performance optimisation

Bruce Momjian <bruce@momjian.us> — 2011-11-29T20:34:40Z

Peter Geoghegan wrote:
> On 29 November 2011 15:31, Bruce Momjian <bruce@momjian.us> wrote:
> > These are exciting advanced you are producing and I am hopeful we can
> > get this included in Postgres 9.2.
> 
> Thanks Bruce.
> 
> >I have mentioned already that I
> > think parallelism is the next big Postgres challenge, and of course, one
> > of the first areas for parallelism is sorting.
> 
> I'm not sure that sorting has that much to recommend it as an initial
> target of some new backend parallelism other than being easy to
> implement. I've observed the qsort_arg specialisations in this patch
> out-perform stock qsort_arg by as much as almost 3 times. However, the
> largest decrease in a query's time that I've observed was 45%, and
> that was for a contrived worst-case for quicksort, but about 25% is
> much more typical of queries similar to the ones I've shown, for more
> normative data distributions. While that's a respectable gain, it
> isn't a paradigm shifting one, and it makes parallelising qsort itself
> for further improvements quite a lot less attractive - there's too
> many other sources of overhead.

Agreed.  I think your improvements make it likely we will address not
address sort parallelism first.

With all the improvements coming in Postgres 9.2, we might need to look
at I/O parallelism first.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: Inlining comparators as a performance optimisation

Andres Freund <andres@anarazel.de> — 2011-11-29T21:36:26Z

On Tuesday, November 29, 2011 07:48:37 PM Peter Geoghegan wrote:
> On 29 November 2011 15:31, Bruce Momjian <bruce@momjian.us> wrote:
> > These are exciting advanced you are producing and I am hopeful we can
> > get this included in Postgres 9.2.
> 
> Thanks Bruce.
> 
> >I have mentioned already that I
> >
> > think parallelism is the next big Postgres challenge, and of course, one
> > of the first areas for parallelism is sorting.
> 
> I'm not sure that sorting has that much to recommend it as an initial
> target of some new backend parallelism other than being easy to
> implement. I've observed the qsort_arg specialisations in this patch
> out-perform stock qsort_arg by as much as almost 3 times. However, the
> largest decrease in a query's time that I've observed was 45%, and
> that was for a contrived worst-case for quicksort, but about 25% is
> much more typical of queries similar to the ones I've shown, for more
> normative data distributions. While that's a respectable gain, it
> isn't a paradigm shifting one, and it makes parallelising qsort itself
> for further improvements quite a lot less attractive - there's too
> many other sources of overhead.
I think that logic is faulty.

For one I doubt that anybody is honestly suggesting paralellism inside qsort 
itself. It seems more likely/sensible to implement that on the level of 
mergesorting.
Index builds for example could hugely benefit from improvements on that level. 
With index build you often get pretty non-optimal data distributions btw...

I also seriously doubt that you will find an area inside pg's executor where 
you find that paralellizing them will provide a near linear scale without 
much, much more work.

Also I wouldn't consider sorting the easiest target - especially on a qsort 
level - for parallelization as you constantly need to execute user defined 
operators with multiple input tuples which has the usual problems.
COPY parsing + inserting or such seems to be way easier target for example. 
Even doing hashing + aggregation in different threads seems likely to be 
easier.

Andres

Re: Inlining comparators as a performance optimisation

Greg Jaskiewicz <gj@pointblue.com.pl> — 2011-11-29T21:43:40Z

On 28 Nov 2011, at 02:15, Peter Geoghegan wrote:

> Attached are the results from performing a similar process to the
> prior benchmark, but on Greg Smith's high-end server, and with an
> orderlines table that has been "doubled-up" until it is 1538 MB,
> making the same old query perform a quicksort that's over 3GB. Short
> version:  HEAD is 20468.0ms, with my patch it's 13689.698ms, so these
> gains hold-up very well for very large in-memory sorts, possibly even
> perfectly.
> 

This is some really good stuff. 
Has to be said, that there must be quite few other places where inlining like that could have quite a positive benefit. 
But - I also have to say that both template functions and template classes in C++ give you pretty much the same speed improvement, with much better clarity and readability of the code. (I've tested it with the example Peter presented, if someone is interested in the code). 
One more reason why PostgreSQL project should look into the future and get some of the bits simplified and optimised by switching to C++.

Re: Inlining comparators as a performance optimisation

Peter Geoghegan <peter@2ndquadrant.com> — 2011-12-01T16:44:55Z

Attached is revision of my patch with some clean-ups. In particular,
I'm now using switch statements for greater readability, plus
supporting fast path sorting of the time datatype. I've also updated
the documentation on "Date/Time Types" to note the additional
disadvantage of using the deprecated "store timestamp + friends as
double precision floating-point numbers" compile time option.

There is one aspect to this optimisation that I haven't touched on,
which is the effect on memory consumption. I think that much of the
value that this patch will deliver will come from being able to
release sort memory earlier. Consider that the substantial
improvements in raw sorting speed (far more substantial than the
improvements in query runtime) will sometimes result in a concomitant
reduction in the time that the executor holds onto memory allocated
for sorting. Maybe the effect will only be really noticeable for plans
with a sort node as their root node, but that isn't exactly a rare
occurrence, particularly among large, expensive sorts.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Inlining comparators as a performance optimisation

Robert Haas <robertmhaas@gmail.com> — 2011-12-01T17:15:52Z

On Thu, Dec 1, 2011 at 11:44 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> Attached is revision of my patch with some clean-ups. In particular,
> I'm now using switch statements for greater readability, plus
> supporting fast path sorting of the time datatype. I've also updated
> the documentation on "Date/Time Types" to note the additional
> disadvantage of using the deprecated "store timestamp + friends as
> double precision floating-point numbers" compile time option.

One thing I'm starting to get a bit concerned about is the effect of
this patch on the size of the resulting binary.  The performance
results you've posted are getting steadily more impressive as you get
into this, which is cool, but it seems like the number of copies of
the qsort logic that you're proposing to include in the resulting
binary is also steadily climbing.  On my system, a stripped postgres
binary built with my usual compile options (except --enable-cassert,
which I took out) is 49336 bytes bigger with this patch applied, an
increase of close to 1%.  We might need to be a little bit choosy
about this, because I don't think that we want to end up with a
situation where some noticeable percentage of the final binary
consists of differently-inlined versions of the quicksort algorithm -
especially because there may be other places where we want to do
similar kinds of inlining.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inlining comparators as a performance optimisation

Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-02T01:13:02Z

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Dec 1, 2011 at 11:44 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
>> Attached is revision of my patch with some clean-ups.

> One thing I'm starting to get a bit concerned about is the effect of
> this patch on the size of the resulting binary.

Regardless of that, I'm still of the opinion that this patch is
fundamentally misdesigned.  The way it's set up, it is only possible to
have a fast path for types that are hard-wired into tuplesort.c, and
that flies in the face of twenty years' worth of effort to make Postgres
an extensible system.  I really don't care about performance
measurements: this is not an acceptable design, period.

What I want to see is some mechanism that pushes this out to the
individual datatypes, so that both core and plug-in datatypes have
access to the benefits.  There are a number of ways that could be
attacked, but the most obvious one is to invent additional, optional
support function(s) for btree opclasses.

I also still think that we should pursue the idea of a lower-overhead
API for comparison functions.  It may be that it's worth the code bulk
to have an inlined copy of qsort for a small number of high-usage
datatypes, but I think there are going to be a lot for which we aren't
going to want to pay that price, and yet we could get some benefit from
cutting the call overhead.  A lower-overhead API would also save cycles
in usage of the comparison functions from btree itself (remember that?).

I think in particular that the proposed approach to multiple sort keys
is bogus and should be replaced by just using lower-overhead
comparators.  The gain per added code byte in doing it as proposed
has got to be too low to be reasonable.

			regards, tom lane

Re: Inlining comparators as a performance optimisation

Peter Geoghegan <peter@2ndquadrant.com> — 2011-12-02T01:29:56Z

On 1 December 2011 17:15, Robert Haas <robertmhaas@gmail.com> wrote:
> One thing I'm starting to get a bit concerned about is the effect of
> this patch on the size of the resulting binary.  The performance
> results you've posted are getting steadily more impressive as you get
> into this, which is cool, but it seems like the number of copies of
> the qsort logic that you're proposing to include in the resulting
> binary is also steadily climbing.  On my system, a stripped postgres
> binary built with my usual compile options (except --enable-cassert,
> which I took out) is 49336 bytes bigger with this patch applied, an
> increase of close to 1%.

Do your usual compile options include debug symbols? I've been using
standard compile options for development of this patch, for obvious
reasons. I get 36690 bytes (just under 36 KiB, or a 0.644% increase).

Binary bloat is a legitimate concern. I thought that I was
conservative in my choice of specialisations though.

> We might need to be a little bit choosy
> about this, because I don't think that we want to end up with a
> situation where some noticeable percentage of the final binary
> consists of differently-inlined versions of the quicksort algorithm -
> especially because there may be other places where we want to do
> similar kinds of inlining.
>
> Thoughts?

A simple caveat in a comment along the lines of "this mechanism
instantiates 2 copies of qsort_arg per type, please use judiciously"
sounds like the right balance. It could also be possible to be penny
wise and pound foolish here though.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Inlining comparators as a performance optimisation

Robert Haas <robertmhaas@gmail.com> — 2011-12-02T01:31:06Z

On Thu, Dec 1, 2011 at 8:29 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> Do your usual compile options include debug symbols? I've been using
> standard compile options for development of this patch, for obvious
> reasons. I get 36690 bytes (just under 36 KiB, or a 0.644% increase).

They do, but I ran "strip" on the resulting binary before taking those
measurements, so I thought that would undo it...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inlining comparators as a performance optimisation

Robert Haas <robertmhaas@gmail.com> — 2011-12-02T02:15:14Z

On Thu, Dec 1, 2011 at 8:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> What I want to see is some mechanism that pushes this out to the
> individual datatypes, so that both core and plug-in datatypes have
> access to the benefits.  There are a number of ways that could be
> attacked, but the most obvious one is to invent additional, optional
> support function(s) for btree opclasses.

I feel like what's missing here is a way for the catalogs to refer to
a function that doesn't take FunctionCallInfo as an argument.  But it
strikes me that it wouldn't be very hard to design such a mechanism.
It seems like all we really need to do is decide on a nomenclature.
The simplest possible approach would probably be to just refer to
functions by name.  In other words, we add a text or name column to
pg_amproc, and then inside the backend there's a function whose job it
is to map the string to a function address.  However, that would have
the annoying effect of causing catalog bloat, so I'm inclined to
instead propose an 8-byte integer.  (A 4-byte integer is enough, but
would increase the risk of collisions between values that might be
chosen by third party extensions.)

So, the API to this would look something like this:

void register_arbitrary_function_pointer(int64 handle, void *funcptr);
void *get_arbitrary_function_pointer(int64 handle);

So the idea is that if you need to refer to an arbitrary function in a
system catalog (such as pg_amproc), you generate a random non-zero
64-bit integer, stuff it into the catalog, and then register the same
64-bit integer with the appropriate function pointer.  To avoid
increasing backend startup time, we could have a Perl script generate
a prefab hash table for the built-in functions; loadable modules would
add their entries at PG_init() time using
register_arbitrary_function_pointer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inlining comparators as a performance optimisation

Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-02T02:46:09Z

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Dec 1, 2011 at 8:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> What I want to see is some mechanism that pushes this out to the
>> individual datatypes, so that both core and plug-in datatypes have
>> access to the benefits.  There are a number of ways that could be
>> attacked, but the most obvious one is to invent additional, optional
>> support function(s) for btree opclasses.

> I feel like what's missing here is a way for the catalogs to refer to
> a function that doesn't take FunctionCallInfo as an argument.  But it
> strikes me that it wouldn't be very hard to design such a mechanism.

IMO, entries in pg_proc ought to refer to functions that are callable
by the standard interface: breaking that basic contract is not going to
lead anywhere pleasant.  Nor do I really want yet more columns in
pg_proc.  Nor does the "register a pointer" scheme you suggest make
any sense to me: you still need to connect these things to catalog
entries, pg_opclass entries in particular, and the intermediate handle
adds nothing to the situation except for creating a risk of collisions.

The scheme that was rolling around in my mind was about like this:

* Define btree opclasses to have an optional support function,
amprocnum 2, that has a SQL signature of func(internal) returns void.

* The actual argument represented by the "internal" parameter would
be a pointer to a struct which is to be filled in by the support
function.  We'd call the support function once, during tuplesort
setup or btree index relcache entry setup, and save the struct
somewhere.

* The struct contents would be pointers to functions that have a
non-FunctionCallInfo interface.  We know we need at least two:
a full qsort replacement, and a non-FCI comparator.  We might want
more in future, if someone convinces us that additional specializations
of sorting are worth the trouble.  So I imagine a struct like this:

	typedef struct SortSupportFunctions {
		void	(*inline_qsort) (Datum *elements, int nelements);
		int	(*comparator) (Datum a, Datum b);
	} SortSupportFunctions;

with the agreement that the caller must zero out the struct before call,
and then the btree support function sets the function pointers for any
specializations it's going to offer.  If we later add a third or fourth
function pointer, datatypes that know about that could fill in those
pointers, but datatypes that haven't been updated aren't broken.

One thing I'm not too certain about is whether to define the APIs just
as above, or to support a passthrough argument of some sort (and if so,
what does it reference)?  Possibly any datatype that we'd actually care
about this for is going to be simple enough to not need any state data.
Or possibly not.  And what about collations?

			regards, tom lane

Re: Inlining comparators as a performance optimisation

Robert Haas <robertmhaas@gmail.com> — 2011-12-02T03:22:32Z

On Thu, Dec 1, 2011 at 9:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> IMO, entries in pg_proc ought to refer to functions that are callable
> by the standard interface: breaking that basic contract is not going to
> lead anywhere pleasant.  Nor do I really want yet more columns in
> pg_proc.

I wasn't proposing to create pg_proc entries for this.

> Nor does the "register a pointer" scheme you suggest make
> any sense to me: you still need to connect these things to catalog
> entries, pg_opclass entries in particular, and the intermediate handle
> adds nothing to the situation except for creating a risk of collisions.

I think you might be misinterpreting what I had in mind.  Right now,
pg_amproc entries have an "amproc" column that points to a pg_proc
entry that in turn points to a function that takes
FunctionCallInfoData as an argument.  What I'm proposing to do is add
an additional column to that catalog that points more or less directly
to a non-SQL-callable function, but it can't actually just be the
address of the function because that's not stable.  So what I'm
proposing is that we interpose the thinnest possible shim layer
between the catalog and a function pointer, and an int64 -> function
pointer mapping seemed to me like something that would fit the bill.

> The scheme that was rolling around in my mind was about like this:
>
> * Define btree opclasses to have an optional support function,
> amprocnum 2, that has a SQL signature of func(internal) returns void.
>
> * The actual argument represented by the "internal" parameter would
> be a pointer to a struct which is to be filled in by the support
> function.  We'd call the support function once, during tuplesort
> setup or btree index relcache entry setup, and save the struct
> somewhere.
>
> * The struct contents would be pointers to functions that have a
> non-FunctionCallInfo interface.  We know we need at least two:
> a full qsort replacement, and a non-FCI comparator.  We might want
> more in future, if someone convinces us that additional specializations
> of sorting are worth the trouble.  So I imagine a struct like this:
>
>        typedef struct SortSupportFunctions {
>                void    (*inline_qsort) (Datum *elements, int nelements);
>                int     (*comparator) (Datum a, Datum b);
>        } SortSupportFunctions;
>
> with the agreement that the caller must zero out the struct before call,
> and then the btree support function sets the function pointers for any
> specializations it's going to offer.  If we later add a third or fourth
> function pointer, datatypes that know about that could fill in those
> pointers, but datatypes that haven't been updated aren't broken.
>
> One thing I'm not too certain about is whether to define the APIs just
> as above, or to support a passthrough argument of some sort (and if so,
> what does it reference)?  Possibly any datatype that we'd actually care
> about this for is going to be simple enough to not need any state data.
> Or possibly not.  And what about collations?

Maybe there should be a comparator_setup function that gets the
collation OID and returns void *, and then that void * value gets
passed as a third argument to each call to the comparator function.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inlining comparators as a performance optimisation

Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-02T03:48:07Z

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Dec 1, 2011 at 9:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Nor does the "register a pointer" scheme you suggest make
>> any sense to me: you still need to connect these things to catalog
>> entries, pg_opclass entries in particular, and the intermediate handle
>> adds nothing to the situation except for creating a risk of collisions.

> I think you might be misinterpreting what I had in mind.  Right now,
> pg_amproc entries have an "amproc" column that points to a pg_proc
> entry that in turn points to a function that takes
> FunctionCallInfoData as an argument.  What I'm proposing to do is add
> an additional column to that catalog that points more or less directly
> to a non-SQL-callable function, but it can't actually just be the
> address of the function because that's not stable.  So what I'm
> proposing is that we interpose the thinnest possible shim layer
> between the catalog and a function pointer, and an int64 -> function
> pointer mapping seemed to me like something that would fit the bill.

But you still need a lot of mechanism to do that mapping, including an
initialization function that has noplace to be executed (unless every
.so that defines a btree opclass has to be listed in preload_libraries),
so it doesn't seem like the "thinnest possible shim" to me.

>> One thing I'm not too certain about is whether to define the APIs just
>> as above, or to support a passthrough argument of some sort (and if so,
>> what does it reference)? Possibly any datatype that we'd actually care
>> about this for is going to be simple enough to not need any state data.
>> Or possibly not. And what about collations?

> Maybe there should be a comparator_setup function that gets the
> collation OID and returns void *, and then that void * value gets
> passed as a third argument to each call to the comparator function.

Maybe.  Or perhaps we could merge that work into the
function-pointer-setup function --- that is, give it the collation and
some extra workspace to fool with.  We would always know the
collation at the time we're doing that setup.

At this point the struct filled in by the setup function is starting
to feel a lot like FmgrInfo, and history teaches us a lot about what
needs to be in there.  So maybe

typedef struct SortSupportInfoData *SortSupportInfo;

typedef struct SortSupportInfoData
{
	MemoryContext	info_cxt;	/* where to allocate subsidiary data */
	void		*extra;		/* any data needed by functions */
	Oid		collation;	/* provided by caller */

	void	(*inline_qsort) (Datum *elements, int nelements,
				 SortSupportInfo info);
	int	(*comparator) (Datum a, Datum b,
			       SortSupportInfo info);
	/* possibly other function pointers in future */	
} SortSupportInfoData;

I am thinking that the btree code, at least, would want to just
unconditionally do

	colsortinfo->comparator(datum1, datum2, colsortinfo)

so for an opclass that fails to supply the low-overhead comparator,
it would insert into the "comparator" pointer a shim function that
calls the opclass' old-style FCI-using comparator.  (Anybody who
complains about the added overhead would be told to get busy and
supply a low-overhead comparator for their datatype...)  But to do
that, we have to have enough infrastructure here to cover all cases,
so omitting collation or not having a place to stash an FmgrInfo
won't do.

			regards, tom lane

Re: Inlining comparators as a performance optimisation

Robert Haas <robertmhaas@gmail.com> — 2011-12-02T04:00:36Z

On Thu, Dec 1, 2011 at 10:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> But you still need a lot of mechanism to do that mapping, including an
> initialization function that has noplace to be executed (unless every
> .so that defines a btree opclass has to be listed in preload_libraries),
> so it doesn't seem like the "thinnest possible shim" to me.

PG_init?

>>> One thing I'm not too certain about is whether to define the APIs just
>>> as above, or to support a passthrough argument of some sort (and if so,
>>> what does it reference)?  Possibly any datatype that we'd actually care
>>> about this for is going to be simple enough to not need any state data.
>>> Or possibly not.  And what about collations?
>
>> Maybe there should be a comparator_setup function that gets the
>> collation OID and returns void *, and then that void * value gets
>> passed as a third argument to each call to the comparator function.
>
> Maybe.  Or perhaps we could merge that work into the
> function-pointer-setup function --- that is, give it the collation and
> some extra workspace to fool with.  We would always know the
> collation at the time we're doing that setup.

Ah!  That seems quite nice.

> At this point the struct filled in by the setup function is starting
> to feel a lot like FmgrInfo, and history teaches us a lot about what
> needs to be in there.  So maybe
>
> typedef struct SortSupportInfoData *SortSupportInfo;
>
> typedef struct SortSupportInfoData
> {
>        MemoryContext   info_cxt;       /* where to allocate subsidiary data */
>        void            *extra;         /* any data needed by functions */
>        Oid             collation;      /* provided by caller */
>
>        void    (*inline_qsort) (Datum *elements, int nelements,
>                                 SortSupportInfo info);
>        int     (*comparator) (Datum a, Datum b,
>                               SortSupportInfo info);
>        /* possibly other function pointers in future */
> } SortSupportInfoData;
>
> I am thinking that the btree code, at least, would want to just
> unconditionally do
>
>        colsortinfo->comparator(datum1, datum2, colsortinfo)
>
> so for an opclass that fails to supply the low-overhead comparator,
> it would insert into the "comparator" pointer a shim function that
> calls the opclass' old-style FCI-using comparator.  (Anybody who
> complains about the added overhead would be told to get busy and
> supply a low-overhead comparator for their datatype...)  But to do
> that, we have to have enough infrastructure here to cover all cases,
> so omitting collation or not having a place to stash an FmgrInfo
> won't do.

I'm slightly worried about whether that'll be adding too much overhead
to the case where there is no non-FCI comparator.  But it may be no
worse than what we're doing now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inlining comparators as a performance optimisation

Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-02T05:16:08Z

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Dec 1, 2011 at 10:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I am thinking that the btree code, at least, would want to just
>> unconditionally do
>> 
>>        colsortinfo->comparator(datum1, datum2, colsortinfo)
>> 
>> so for an opclass that fails to supply the low-overhead comparator,
>> it would insert into the "comparator" pointer a shim function that
>> calls the opclass' old-style FCI-using comparator.  (Anybody who
>> complains about the added overhead would be told to get busy and
>> supply a low-overhead comparator for their datatype...)  But to do
>> that, we have to have enough infrastructure here to cover all cases,
>> so omitting collation or not having a place to stash an FmgrInfo
>> won't do.

> I'm slightly worried about whether that'll be adding too much overhead
> to the case where there is no non-FCI comparator.  But it may be no
> worse than what we're doing now.

It should be the same or better.  Right now, we are going through
FunctionCall2Coll to reach the FCI-style comparator.  The shim function
would be more or less equivalent to that, and since it's quite special
purpose I would hope we could shave a cycle or two.  For instance, we
could probably afford to set up a dedicated FunctionCallInfo struct
associated with the SortSupportInfo struct, and not have to reinitialize
one each time.

			regards, tom lane

Re: Inlining comparators as a performance optimisation

Peter Geoghegan <peter@2ndquadrant.com> — 2011-12-02T13:33:30Z

On 2 December 2011 01:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Regardless of that, I'm still of the opinion that this patch is
> fundamentally misdesigned.  The way it's set up, it is only possible to
> have a fast path for types that are hard-wired into tuplesort.c, and
> that flies in the face of twenty years' worth of effort to make Postgres
> an extensible system.

What the user doesn't know can't hurt them. I'm not of the opinion
that an internal optimisations flies in the face of anything - is it
really so hard for you to hold your nose for a fairly isolated patch
like this?

> I really don't care about performance
> measurements: this is not an acceptable design, period.

If ever there was a justification for doing something so distasteful,
I would think that a 60% decrease in the time taken to perform a raw
sort for POD types (plus common covariants) would be it.

> What I want to see is some mechanism that pushes this out to the
> individual datatypes, so that both core and plug-in datatypes have
> access to the benefits.  There are a number of ways that could be
> attacked, but the most obvious one is to invent additional, optional
> support function(s) for btree opclasses.
>
> I also still think that we should pursue the idea of a lower-overhead
> API for comparison functions.  It may be that it's worth the code bulk
> to have an inlined copy of qsort for a small number of high-usage
> datatypes, but I think there are going to be a lot for which we aren't
> going to want to pay that price, and yet we could get some benefit from
> cutting the call overhead.

I'm not opposed to that. It just seems fairly tangential to what I've
done here, as well as considerably less important to users,
particularly when you look at the numbers I've produced. It would be
nice to get a lesser gain for text and things like that too, but other
than that I don't see a huge demand.

> A lower-overhead API would also save cycles
> in usage of the comparison functions from btree itself (remember that?).

Yes, I remember that. Why not do something similar there, and get
similarly large benefits?

> I think in particular that the proposed approach to multiple sort keys
> is bogus and should be replaced by just using lower-overhead
> comparators.  The gain per added code byte in doing it as proposed
> has got to be too low to be reasonable.

Reasonable by what standard? We may well be better off with
lower-overhead comparators for the multiple scanKey case (though I
wouldn't like to bet on it) but we would most certainly not be better
off without discriminating against single scanKey and multiple scanKey
cases as I have.

Look at the spreadsheet results_server.ods . Compare the "not inlined"
and "optimized" sheets. It's clear from those numbers that a
combination of inlining and of simply not having to consider a case
that will never come up (multiple scanKeys), the inlining
specialisation far outperforms the non-inlining, multiple scanKey
specialization for the same data (I simply commented out inlining
specializations so that non-inlining specialisations would always be
called for int4 and friends, even if there was only a single scanKey).
That's where the really dramatic improvements are seen. It's well
worth having this additional benefit, because it's such a common case
and the benefit is so big, and while I will be quite happy to pursue a
plan to bring some of these benefits to all types such as the one you
describe, I do not want to jettison the additional benefits described
to do so. Isn't that reasonable?

We're talking about taking 6778.302ms off a 20468.0ms query, rather
than 1860.332ms. Not exactly a subtle difference. Assuming that those
figures are representative of the gains to be had by a generic
mechanism that does not inline/specialize across number of scanKeys,
are you sure that that's worthwhile?

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Inlining comparators as a performance optimisation

Robert Haas <robertmhaas@gmail.com> — 2011-12-02T13:57:48Z

On Fri, Dec 2, 2011 at 12:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It should be the same or better.  Right now, we are going through
> FunctionCall2Coll to reach the FCI-style comparator.  The shim function
> would be more or less equivalent to that, and since it's quite special
> purpose I would hope we could shave a cycle or two.  For instance, we
> could probably afford to set up a dedicated FunctionCallInfo struct
> associated with the SortSupportInfo struct, and not have to reinitialize
> one each time.

OK, but I think it's also going to cost you an extra syscache lookup,
no?  You're going to have to check for this new support function
first, and then if you don't find it, you'll have to check for the
original one.  I don't think there's any higher-level caching around
opfamilies to save our bacon here, is there?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inlining comparators as a performance optimisation

Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-02T15:11:19Z

Robert Haas <robertmhaas@gmail.com> writes:
> OK, but I think it's also going to cost you an extra syscache lookup,
> no?  You're going to have to check for this new support function
> first, and then if you don't find it, you'll have to check for the
> original one.  I don't think there's any higher-level caching around
> opfamilies to save our bacon here, is there?

[ shrug... ] If you are bothered by that, get off your duff and provide
the function for your datatype.  But it's certainly going to be in the
noise for btree index usage, and I submit that query parsing/setup
involves quite a lot of syscache lookups already.  I think that as a
performance objection, the above is nonsensical.  (And I would also
comment that your proposal with a handle is going to involve a table
search that's at least as expensive as a syscache lookup.)

			regards, tom lane

Re: Inlining comparators as a performance optimisation

Bruce Momjian <bruce@momjian.us> — 2011-12-02T19:35:32Z

Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > OK, but I think it's also going to cost you an extra syscache lookup,
> > no?  You're going to have to check for this new support function
> > first, and then if you don't find it, you'll have to check for the
> > original one.  I don't think there's any higher-level caching around
> > opfamilies to save our bacon here, is there?
> 
> [ shrug... ] If you are bothered by that, get off your duff and provide
> the function for your datatype.  But it's certainly going to be in the
> noise for btree index usage, and I submit that query parsing/setup
> involves quite a lot of syscache lookups already.  I think that as a
> performance objection, the above is nonsensical.  (And I would also
> comment that your proposal with a handle is going to involve a table
> search that's at least as expensive as a syscache lookup.)

Agreed.  Doing something once and doing something in the sort loop are
two different overheads.

I am excited by this major speedup Peter Geoghegan has found.  Postgres
doesn't have parallel query, which is often used for sorting, so we are
already behind some of the databases are compared against.  Getting this
speedup is definitely going to help us.  And I do like the generic
nature of where we are heading!

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: Inlining comparators as a performance optimisation

Pavel Stehule <pavel.stehule@gmail.com> — 2011-12-02T20:04:45Z

>>
>> [ shrug... ] If you are bothered by that, get off your duff and provide
>> the function for your datatype.  But it's certainly going to be in the
>> noise for btree index usage, and I submit that query parsing/setup
>> involves quite a lot of syscache lookups already.  I think that as a
>> performance objection, the above is nonsensical.  (And I would also
>> comment that your proposal with a handle is going to involve a table
>> search that's at least as expensive as a syscache lookup.)
>
> Agreed.  Doing something once and doing something in the sort loop are
> two different overheads.
>
> I am excited by this major speedup Peter Geoghegan has found.  Postgres
> doesn't have parallel query, which is often used for sorting, so we are
> already behind some of the databases are compared against.  Getting this
> speedup is definitely going to help us.  And I do like the generic
> nature of where we are heading!
>

Oracle has not or had not parallel sort too, and I have a reports so
Oracle does sort faster then PostgreSQL (but without any numbers). So
any solution is welcome

Regards

Pavel

Re: Inlining comparators as a performance optimisation

Robert Haas <robertmhaas@gmail.com> — 2011-12-02T20:21:50Z

On Fri, Dec 2, 2011 at 2:35 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Agreed.  Doing something once and doing something in the sort loop are
> two different overheads.

OK, so I tried to code this up.  Adding the new amproc wasn't too
difficult (see attached).  It wasn't obvious to me how to tie it into
the tuplesort infrastructure, though, so instead of wasting time
guessing what a sensible approach might be I'm going to use one of my
lifelines and poll the audience (or is that ask an expert?).
Currently the Tuplesortstate[1] has a pointer to an array of
ScanKeyData objects, one per column being sorted.  But now instead of
"FmgrInfo sk_func", the tuplesort code is going to want each scankey
to contain "SortSupportInfo(Data?) sk_sortsupport"[2], because that's
where we get the comparison function from.   Should I just go ahead
and add one more member to that struct, or is there some more
appropriate way to handle this?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] Consistent capitalization is for wimps.
[2] Hey, we could abbreviate that "SSI"!  Oh, wait...

Re: Inlining comparators as a performance optimisation

Bruce Momjian <bruce@momjian.us> — 2011-12-02T20:42:21Z

Robert Haas wrote:
> On Fri, Dec 2, 2011 at 2:35 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > Agreed. ?Doing something once and doing something in the sort loop are
> > two different overheads.
> 
> OK, so I tried to code this up.  Adding the new amproc wasn't too
> difficult (see attached).  It wasn't obvious to me how to tie it into
> the tuplesort infrastructure, though, so instead of wasting time
> guessing what a sensible approach might be I'm going to use one of my
> lifelines and poll the audience (or is that ask an expert?).
> Currently the Tuplesortstate[1] has a pointer to an array of
> ScanKeyData objects, one per column being sorted.  But now instead of
> "FmgrInfo sk_func", the tuplesort code is going to want each scankey
> to contain "SortSupportInfo(Data?) sk_sortsupport"[2], because that's
> where we get the comparison function from.   Should I just go ahead
> and add one more member to that struct, or is there some more
> appropriate way to handle this?

Is this code immediately usable anywhere else in our codebasde, and if
so, is it generic enough?

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: Inlining comparators as a performance optimisation

Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-03T18:36:04Z

Robert Haas <robertmhaas@gmail.com> writes:
> OK, so I tried to code this up.  Adding the new amproc wasn't too
> difficult (see attached).  It wasn't obvious to me how to tie it into
> the tuplesort infrastructure, though, so instead of wasting time
> guessing what a sensible approach might be I'm going to use one of my
> lifelines and poll the audience (or is that ask an expert?).

OK, I'll take a whack at improving this over the weekend.

			regards, tom lane

Re: Inlining comparators as a performance optimisation

Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-04T19:17:15Z

Robert Haas <robertmhaas@gmail.com> writes:
> OK, so I tried to code this up.  Adding the new amproc wasn't too
> difficult (see attached).  It wasn't obvious to me how to tie it into
> the tuplesort infrastructure, though, so instead of wasting time
> guessing what a sensible approach might be I'm going to use one of my
> lifelines and poll the audience (or is that ask an expert?).

Here's another cut at this.  I only went as far as converting the
heap-sort code path in tuplesort.c, so there's lots more to do, but
this does sort integers successfully.  Before expanding the patch
to do more, I think we need to have consensus on some API details,
in particular:

* I invented a "SortKey" struct that replaces ScanKey for tuplesort's
purposes.  Right now that's local in tuplesort.c, but we're more than
likely going to need it elsewhere as well.  Should we just define it
in sortsupport.h?  Or perhaps we should just add the few additional
fields to SortSupportInfoData, and not bother with two struct types?
Before you answer, consider the next point.

* I wonder whether it would be worthwhile to elide inlineApplyComparator
altogether, pushing what it does down to the level of the
datatype-specific functions.  That would require changing the
"comparator" API to include isnull flags, and exposing the
reverse/nulls_first sort flags to the comparators (presumably by
including them in SortSupportInfoData).  The main way in which that
could be a win would be if the setup function could choose one of four
comparator functions that are pre-specialized for each flags
combination; but that seems like it would add a lot of code bulk, and
the bigger problem is that we need to be able to change the flags after
sort initialization (cf. the reversedirection code in tuplesort.c), so
we'd also need some kind of "re-select the comparator" call API.  On the
whole this doesn't seem promising, but maybe somebody else has a
different idea.

* We're going to want to expose PrepareSortSupportComparisonShim
for use outside tuplesort.c too, and possibly refactor
tuplesort_begin_heap so that the SortKey setup logic inside it
can be extracted for use elsewhere.  Shall we just add those to
tuplesort's API, or would it be better to create a sortsupport.c
with these sorts of functions?

I have not done any performance testing on this patch, but it might be
interesting to check it with the same test cases Peter's been using.

			regards, tom lane

Re: Inlining comparators as a performance optimisation

Peter Geoghegan <peter@2ndquadrant.com> — 2011-12-05T00:14:33Z

On 4 December 2011 19:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I have not done any performance testing on this patch, but it might be
> interesting to check it with the same test cases Peter's been using.

I've attached a revision of exactly the same benchmark run to get the
results in results_server.ods .

You'll see very similar figures to results_server.ods for HEAD and for
my patch, as you'd expect. I think the results speak for themselves. I
maintain that we should use specialisations - that's where most of the
benefit is to be found.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Inlining comparators as a performance optimisation

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-12-05T13:23:45Z

On 05.12.2011 02:14, Peter Geoghegan wrote:
> On 4 December 2011 19:17, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>> I have not done any performance testing on this patch, but it might be
>> interesting to check it with the same test cases Peter's been using.
>
> I've attached a revision of exactly the same benchmark run to get the
> results in results_server.ods .
>
> You'll see very similar figures to results_server.ods for HEAD and for
> my patch, as you'd expect. I think the results speak for themselves. I
> maintain that we should use specialisations - that's where most of the
> benefit is to be found.

I'm still not convinced the big gain is in inlining the comparison 
functions. Your patch contains a few orthogonal tricks, too:

* A fastpath for the case of just one scankey. No reason we can't do 
that without inlining.

* inlining the swap-functions within qsort. Thaẗ́'s different from 
inlining the comparison functions, and could be done regardless of the 
data type. The qsort element size in tuplesort.c is always sizeof(SortTuple)

And there's some additional specializations we can do, not implemented 
in your patch, that don't depend on inlining the comparison functions:

* A fastpath for the case of no NULLs
* separate comparetup functions for ascending and descending sorts (this 
allows tail call elimination of the call to type-specific comparison 
function in the ascending version)
* call CHECK_FOR_INTERRUPTS() less frequently.

To see how much difference those things make, I hacked together the 
attached patch. I didn't base this on Robert's/Tom's patch, but instead 
just added a quick & dirty FunctionCallInvoke-overhead-skipping version 
of btint4cmp. I believe that aspect of this patch it's similar in 
performance characteristics, though.

In my quick tests, it gets quite close in performance to your inlining 
patch, when sorting int4s and the input contains no NULLs. But please 
give it a try in your test environment, to get numbers comparable with 
your other tests.

Perhaps you can get even more gain by adding the no-NULLs, and asc/desc 
special cases to your inlining patch, too, but let's squeeze all the fat 
out of the non-inlined version first. One nice aspect of many of these 
non-inlining optimizations is that they help with external sorts, too.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Inlining comparators as a performance optimisation

Peter Geoghegan <peter@2ndquadrant.com> — 2011-12-05T17:10:05Z

On 5 December 2011 13:23, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> I'm still not convinced the big gain is in inlining the comparison
> functions. Your patch contains a few orthogonal tricks, too:
>
> * A fastpath for the case of just one scankey. No reason we can't do that
> without inlining.

True, though Tom did seem to not like that idea very much.

> * inlining the swap-functions within qsort. Thaẗ́'s different from inlining
> the comparison functions, and could be done regardless of the data type. The
> qsort element size in tuplesort.c is always sizeof(SortTuple)

Sure. I think that Tom mostly objects to hard-coded intelligence about
which datatypes are used, rather than specialisations per se.

> And there's some additional specializations we can do, not implemented in
> your patch, that don't depend on inlining the comparison functions:
>
> * A fastpath for the case of no NULLs
> * separate comparetup functions for ascending and descending sorts (this
> allows tail call elimination of the call to type-specific comparison
> function in the ascending version)
> * call CHECK_FOR_INTERRUPTS() less frequently.

All of those had occurred to me, except the last - if you look at the
definition of that macro, it looks like more trouble than it's worth
to do less frequently. I didn't revise my patch with them though,
because the difference that they made, while clearly measurable,
seemed fairly small, or at least relatively so. I wasn't about to add
additional kludge for marginal benefits given the existing
controversy, though I have not dismissed the idea entirely - it could
be important on some other machine.

> Perhaps you can get even more gain by adding the no-NULLs, and asc/desc
> special cases to your inlining patch, too, but let's squeeze all the fat out
> of the non-inlined version first.

As I said, those things are simple enough to test, and were not found
to make a significant difference, at least to my exact test case +
hardware.

> One nice aspect of many of these
> non-inlining optimizations is that they help with external sorts, too.

I'd expect the benefits to be quite a lot smaller there, but fair point.

Results from running the same test on your patch are attached. I think
that while you're right to suggest that the inlining of the comparator
proper, rather than any other function or other optimisation isn't
playing a dominant role in these optimisations, I think that you're
underestimating the role of locality of reference. To illustrate this,
I've also included results for when I simply move the comparator to
another translation unit, logtape.c . There's clearly a regression
from doing so (I ran the numbers twice, in a clinical environment).
The point is, there is a basically unknown overhead paid for not
inlining - maybe inlining is not worth it, but that's a hard call to
make.

I didn't try to make the numbers look worse by putting the comparator
proper in some other translation unit, but it may be that I'd have
shown considerably worse numbers by doing so.

Why the strong aversion to what I've done? I accept that it's ugly,
but is it really worth worrying about that sort of regression in
maintainability for something that was basically untouched since 2006,
and will probably remain untouched after this work concludes, whatever
happens?

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Inlining comparators as a performance optimisation

Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-05T17:53:23Z

Peter Geoghegan <peter@2ndquadrant.com> writes:
> Why the strong aversion to what I've done? I accept that it's ugly,
> but is it really worth worrying about that sort of regression in
> maintainability for something that was basically untouched since 2006,
> and will probably remain untouched after this work concludes, whatever
> happens?

Maintainability is only part of the issue --- though it's definitely one
part, since this code has hardly gone "untouched since 2006", cf
http://git.postgresql.org/gitweb/?p=postgresql.git;a=history;f=src/backend/utils/sort/tuplesort.c;hb=HEAD

What is bothering me is that this approach is going to cause substantial
bloat of the executable code, and such bloat has got distributed costs,
which we don't have any really good way to measure but for sure
micro-benchmarks addressing only sort speed aren't going to reveal them.
Cache lines filled with sort code take cycles to flush and replace with
something else.

I think it's possibly reasonable to have inlined copies of qsort for a
small number of datatypes, but it seems much less reasonable to have
multiple copies per datatype in order to obtain progressively tinier
micro-benchmark speedups.  We need to figure out what the tradeoff
against executable size really is, but so far it seems like you've been
ignoring the fact that there is any such tradeoff at all.

			regards, tom lane

Re: Inlining comparators as a performance optimisation

Nicolas Barbier <nicolas.barbier@gmail.com> — 2011-12-06T10:40:23Z

2011/12/5 Tom Lane <tgl@sss.pgh.pa.us>:

> What is bothering me is that this approach is going to cause substantial
> bloat of the executable code, and such bloat has got distributed costs,
> which we don't have any really good way to measure but for sure
> micro-benchmarks addressing only sort speed aren't going to reveal them.
> Cache lines filled with sort code take cycles to flush and replace with
> something else.
>
> I think it's possibly reasonable to have inlined copies of qsort for a
> small number of datatypes, but it seems much less reasonable to have
> multiple copies per datatype in order to obtain progressively tinier
> micro-benchmark speedups.  We need to figure out what the tradeoff
> against executable size really is, but so far it seems like you've been
> ignoring the fact that there is any such tradeoff at all.

[ Randomly spouting ideas here: ]

Might it not be a good idea to decide whether to use the inlined
copies vs. the normal version, based on how much data to sort? Surely
for a very large sort, the cache blow-out doesn't matter that much
relative to the amount of time that can be saved sorting?

Assuming that all types would have an inlined sort function, although
that would indeed result in a larger binary, most of that binary would
never touch the cache, because corresponding large sorts are never
performed. If they would sporadically occur (and assuming the points
at which inline sorting starts to get used are chosen wisely), the
possibly resulting cache blow-out would be a net win.

I am also assuming here that instruction cache lines are small enough
for case line aliasing not to become a problem; putting all sort
functions next to each other in the binary (so that they don't alias
with the rest of the backend code that might be used more often) might
alleviate that.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

Re: Inlining comparators as a performance optimisation

Robert Haas <robertmhaas@gmail.com> — 2011-12-06T16:49:03Z

On Sun, Dec 4, 2011 at 2:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> * I invented a "SortKey" struct that replaces ScanKey for tuplesort's
> purposes.  Right now that's local in tuplesort.c, but we're more than
> likely going to need it elsewhere as well.  Should we just define it
> in sortsupport.h?  Or perhaps we should just add the few additional
> fields to SortSupportInfoData, and not bother with two struct types?
> Before you answer, consider the next point.

+1 for not bothering with two struct types.  We might want to consider
calling the resulting structure SortKey rather than
SortSupportInfoData, however.

> * I wonder whether it would be worthwhile to elide inlineApplyComparator
> altogether, pushing what it does down to the level of the
> datatype-specific functions.  That would require changing the
> "comparator" API to include isnull flags, and exposing the
> reverse/nulls_first sort flags to the comparators (presumably by
> including them in SortSupportInfoData).  The main way in which that
> could be a win would be if the setup function could choose one of four
> comparator functions that are pre-specialized for each flags
> combination; but that seems like it would add a lot of code bulk, and
> the bigger problem is that we need to be able to change the flags after
> sort initialization (cf. the reversedirection code in tuplesort.c), so
> we'd also need some kind of "re-select the comparator" call API.  On the
> whole this doesn't seem promising, but maybe somebody else has a
> different idea.

I thought about this, too, but it didn't seem promising to me, either.

> * We're going to want to expose PrepareSortSupportComparisonShim
> for use outside tuplesort.c too, and possibly refactor
> tuplesort_begin_heap so that the SortKey setup logic inside it
> can be extracted for use elsewhere.  Shall we just add those to
> tuplesort's API, or would it be better to create a sortsupport.c
> with these sorts of functions?

Why are we going to want to do that?  If it's because there are other
places in the code that can make use of a fast comparator that don't
go through tuplesort.c, then we should probably break it off into a
separate file (sortkey.c?).  But if it's because we think that clients
of the tuplesort code are going to need it for some reason, then we
may as well keep it in tuplesort.c.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inlining comparators as a performance optimisation

Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-06T17:06:09Z

Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Dec 4, 2011 at 2:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> * We're going to want to expose PrepareSortSupportComparisonShim
>> for use outside tuplesort.c too, and possibly refactor
>> tuplesort_begin_heap so that the SortKey setup logic inside it
>> can be extracted for use elsewhere.  Shall we just add those to
>> tuplesort's API, or would it be better to create a sortsupport.c
>> with these sorts of functions?

> Why are we going to want to do that?  If it's because there are other
> places in the code that can make use of a fast comparator that don't
> go through tuplesort.c, then we should probably break it off into a
> separate file (sortkey.c?).  But if it's because we think that clients
> of the tuplesort code are going to need it for some reason, then we
> may as well keep it in tuplesort.c.

My expectation is that nbtree, as well as mergejoin and mergeappend,
would get converted over to use the fast comparator API.  I looked at
that a little bit but didn't push it far enough to be very sure about
whether they'd be able to share the initialization code from
tuplesort_begin_heap.  But they're definitely going to need the shim
function for backwards compatibility, and
PrepareSortSupportComparisonShim was my first cut at a wrapper that
would be generally useful.

			regards, tom lane

Re: Inlining comparators as a performance optimisation

Robert Haas <robertmhaas@gmail.com> — 2011-12-06T17:48:35Z

On Tue, Dec 6, 2011 at 12:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Sun, Dec 4, 2011 at 2:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> * We're going to want to expose PrepareSortSupportComparisonShim
>>> for use outside tuplesort.c too, and possibly refactor
>>> tuplesort_begin_heap so that the SortKey setup logic inside it
>>> can be extracted for use elsewhere.  Shall we just add those to
>>> tuplesort's API, or would it be better to create a sortsupport.c
>>> with these sorts of functions?
>
>> Why are we going to want to do that?  If it's because there are other
>> places in the code that can make use of a fast comparator that don't
>> go through tuplesort.c, then we should probably break it off into a
>> separate file (sortkey.c?).  But if it's because we think that clients
>> of the tuplesort code are going to need it for some reason, then we
>> may as well keep it in tuplesort.c.
>
> My expectation is that nbtree, as well as mergejoin and mergeappend,
> would get converted over to use the fast comparator API.  I looked at
> that a little bit but didn't push it far enough to be very sure about
> whether they'd be able to share the initialization code from
> tuplesort_begin_heap.  But they're definitely going to need the shim
> function for backwards compatibility, and
> PrepareSortSupportComparisonShim was my first cut at a wrapper that
> would be generally useful.

OK.  Well, then pushing it out to a separate file probably makes
sense.  Do you want to do that or shall I have a crack at it?  If the
latter, what do you think about using the name SortKey for everything
rather than SortSupport?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inlining comparators as a performance optimisation

Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-06T18:07:08Z

Robert Haas <robertmhaas@gmail.com> writes:
> OK.  Well, then pushing it out to a separate file probably makes
> sense.  Do you want to do that or shall I have a crack at it?  If the
> latter, what do you think about using the name SortKey for everything
> rather than SortSupport?

I'll take another crack at it.  I'm not entirely sold yet on merging
the two structs; I think first we'd better look and see what the needs
are in the other potential callers I mentioned.  If we'd end up
cluttering the struct with half a dozen weird fields, it'd be better to
stick to a minimal interface struct with various wrapper structs, IMO.

OTOH it did seem that the names were getting a bit long.  If we do
keep the two-struct-levels approach, what do you think of
s/SortSupportInfo/SortSupport/g ?

			regards, tom lane

Re: Inlining comparators as a performance optimisation

Robert Haas <robertmhaas@gmail.com> — 2011-12-06T18:08:29Z

On Tue, Dec 6, 2011 at 1:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> OK.  Well, then pushing it out to a separate file probably makes
>> sense.  Do you want to do that or shall I have a crack at it?  If the
>> latter, what do you think about using the name SortKey for everything
>> rather than SortSupport?
>
> I'll take another crack at it.  I'm not entirely sold yet on merging
> the two structs; I think first we'd better look and see what the needs
> are in the other potential callers I mentioned.  If we'd end up
> cluttering the struct with half a dozen weird fields, it'd be better to
> stick to a minimal interface struct with various wrapper structs, IMO.

OK.  I'll defer to whatever you come up with after looking at it.

> OTOH it did seem that the names were getting a bit long.  If we do
> keep the two-struct-levels approach, what do you think of
> s/SortSupportInfo/SortSupport/g ?

+1.  I had that thought when you originally suggested that name, but
it didn't seem worth arguing about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inlining comparators as a performance optimisation

Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-06T21:23:17Z

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Dec 6, 2011 at 1:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'll take another crack at it.  I'm not entirely sold yet on merging
>> the two structs; I think first we'd better look and see what the needs
>> are in the other potential callers I mentioned.  If we'd end up
>> cluttering the struct with half a dozen weird fields, it'd be better to
>> stick to a minimal interface struct with various wrapper structs, IMO.

> OK.  I'll defer to whatever you come up with after looking at it.

OK, it looks like nodeMergeAppend.c could use something exactly like the
draft SortKey struct, while nodeMergejoin.c could embed such a struct in
MergeJoinClauseData.  The btree stuff needs something more nearly
equivalent to a ScanKey, including a datum-to-compare-to and a flags
field.  I'm inclined to think the latter would be too specialized to put
in the generic struct.  On the other hand, including the reverse and
nulls_first flags in the generic struct is clearly a win since it allows
ApplyComparator() to be defined as a generic function.  So the only
thing that's really debatable is the attno field, and I'm not anal
enough to insist on a separate level of struct just for that.

I am however inclined to stick with the shortened struct name SortSupport
rather than using SortKey.  The presence of the function pointer fields
(especially the inlined-qsort pointers, assuming we adopt some form of
Peter's patch) changes the struct's nature in my view; it's not really
describing just a sort key (ie an ORDER BY column specification).

			regards, tom lane

Re: Inlining comparators as a performance optimisation

Robert Haas <robertmhaas@gmail.com> — 2011-12-07T00:18:24Z

On Tue, Dec 6, 2011 at 4:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Dec 6, 2011 at 1:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I'll take another crack at it.  I'm not entirely sold yet on merging
>>> the two structs; I think first we'd better look and see what the needs
>>> are in the other potential callers I mentioned.  If we'd end up
>>> cluttering the struct with half a dozen weird fields, it'd be better to
>>> stick to a minimal interface struct with various wrapper structs, IMO.
>
>> OK.  I'll defer to whatever you come up with after looking at it.
>
> OK, it looks like nodeMergeAppend.c could use something exactly like the
> draft SortKey struct, while nodeMergejoin.c could embed such a struct in
> MergeJoinClauseData.  The btree stuff needs something more nearly
> equivalent to a ScanKey, including a datum-to-compare-to and a flags
> field.  I'm inclined to think the latter would be too specialized to put
> in the generic struct.  On the other hand, including the reverse and
> nulls_first flags in the generic struct is clearly a win since it allows
> ApplyComparator() to be defined as a generic function.  So the only
> thing that's really debatable is the attno field, and I'm not anal
> enough to insist on a separate level of struct just for that.
>
> I am however inclined to stick with the shortened struct name SortSupport
> rather than using SortKey.  The presence of the function pointer fields
> (especially the inlined-qsort pointers, assuming we adopt some form of
> Peter's patch) changes the struct's nature in my view; it's not really
> describing just a sort key (ie an ORDER BY column specification).

Works for me.  I think we should go ahead and get this part committed
first, and then we can look at the inlining stuff as a further
optimization for certain cases...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inlining comparators as a performance optimisation

Peter Geoghegan <peter@2ndquadrant.com> — 2011-12-07T01:13:59Z

On 7 December 2011 00:18, Robert Haas <robertmhaas@gmail.com> wrote:
> Works for me.  I think we should go ahead and get this part committed
> first, and then we can look at the inlining stuff as a further
> optimization for certain cases...

Do you mean just inlining, or inlining and the numerous other
optimisations that my patch had?

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Inlining comparators as a performance optimisation

Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-07T01:46:27Z

Robert Haas <robertmhaas@gmail.com> writes:
> Works for me.  I think we should go ahead and get this part committed
> first, and then we can look at the inlining stuff as a further
> optimization for certain cases...

Yeah.  Attached is a second-draft patch, which I'm fairly happy with
except that the documentation hasn't been updated.  It extends the
changes into the executor as well as analyze.c, with the result that
we can get rid of some of the old infrastructure altogether.
(inlineApplySortFunction is still there for the moment, though.)
Also, I adopted a style similar to what we've done for inlined list
functions to make ApplyComparatorFunction inline-able by all callers.

There are three somewhat-independent lines of work to pursue from here:

1. Adding sortsupport infrastructure for more datatypes.
2. Revising nbtree and related code to use this infrastructure.
3. Integrating Peter's work into this framework.

I'll try to take care of #1 for at least a few key datatypes before
I commit, but I think #2 is best done as a separate patch, so I'll
postpone that till later.

			regards, tom lane

Re: Inlining comparators as a performance optimisation

Robert Haas <robertmhaas@gmail.com> — 2011-12-07T03:45:20Z

On Tue, Dec 6, 2011 at 8:13 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 7 December 2011 00:18, Robert Haas <robertmhaas@gmail.com> wrote:
>> Works for me.  I think we should go ahead and get this part committed
>> first, and then we can look at the inlining stuff as a further
>> optimization for certain cases...
>
> Do you mean just inlining, or inlining and the numerous other
> optimisations that my patch had?

Whichever you like.  But I think part of the point here is to
disentangle those optimizations from each other and decide how broadly
it makes sense to apply each one.  Avoiding the FunctionCallInfo stuff
is one, and it seems like we can apply that to a wide variety of data
types (maybe all of them) for both in-memory and on-disk sorting, plus
btree index ops, merge joins, and merge append.  The gains may be
modest, but they will benefit many use cases.  Your original patch
targets a much narrower use case (in-memory sorting of POD types) but
the benefits are larger.  We don't have to pick between a general but
small optimization and a narrower but larger one; we can do both.

In this regard, I think Heikki's remarks upthread are worth some
thought.  If inlining is a win just because it avoids saving and
restoring registers or allows better instruction scheduling, then
inlining is the (probably?) the only way to get the benefit.  But if
most of the benefit is in having a separate path for the
single-sort-key case, we can do that without duplicating the qsort()
code and get the benefit for every data type without much code bloat.
I'd like to see us dig into that a little, so that we get the broadest
possible benefit out of this work.  It doesn't bother me that not
every optimization will apply to every case, and I don't object to
optimizations that are intrinsically narrow (within some reasonable
limits).  But I'd rather not take what could be a fairly broad-based
optimization and apply it only narrowly, all things being equal.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inlining comparators as a performance optimisation

Peter Geoghegan <peter@2ndquadrant.com> — 2011-12-07T11:57:08Z

On 7 December 2011 03:45, Robert Haas <robertmhaas@gmail.com> wrote:
> In this regard, I think Heikki's remarks upthread are worth some
> thought.  If inlining is a win just because it avoids saving and
> restoring registers or allows better instruction scheduling, then
> inlining is the (probably?) the only way to get the benefit.  But if
> most of the benefit is in having a separate path for the
> single-sort-key case, we can do that without duplicating the qsort()
> code and get the benefit for every data type without much code bloat.
> I'd like to see us dig into that a little, so that we get the broadest
> possible benefit out of this work.  It doesn't bother me that not
> every optimization will apply to every case, and I don't object to
> optimizations that are intrinsically narrow (within some reasonable
> limits).  But I'd rather not take what could be a fairly broad-based
> optimization and apply it only narrowly, all things being equal.

I think we're in agreement then.

It's important to realise, though I'm sure you do, that once you've
done what I did with sorting single scanKey integers, that's it -
there isn't really any way that I can think of to speed up quick
sorting further, other than parallelism which has major challenges,
and isn't particularly appropriate at this granularity. The real
significance of inlining is, well, that's it, that's all there is -
after inlining, I predict that the well will run dry, or practically
dry. Inlining actually is a pretty great win when you look at sorting
in isolation, but it's just that there's been a number of other
significant wins in my patch, and of course there is overhead from a
number of other areas, so perhaps it's easy to lose sight of that.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Inlining comparators as a performance optimisation

Robert Haas <robertmhaas@gmail.com> — 2011-12-07T14:39:54Z

On Tue, Dec 6, 2011 at 8:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 1. Adding sortsupport infrastructure for more datatypes.
> 2. Revising nbtree and related code to use this infrastructure.
> 3. Integrating Peter's work into this framework.
>
> I'll try to take care of #1 for at least a few key datatypes before
> I commit, but I think #2 is best done as a separate patch, so I'll
> postpone that till later.

I see you've committed a chunk of this now.  Does it make sense to do
#1 for every data type we support, or should we be more selective than
that?  My gut feeling would be to add it across the board and
introduce an opr_sanity check for it.  The general utility of adding
it to deprecated types like abstime is perhaps questionable, but it
strikes me that the value of making everything consistent probably
outweighs the cost of a few extra lines of code.

Are you planning to do anything about #2 or #3?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inlining comparators as a performance optimisation

Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-07T15:09:32Z

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Dec 6, 2011 at 8:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> 1. Adding sortsupport infrastructure for more datatypes.
>> 2. Revising nbtree and related code to use this infrastructure.
>> 3. Integrating Peter's work into this framework.
>> 
>> I'll try to take care of #1 for at least a few key datatypes before
>> I commit, but I think #2 is best done as a separate patch, so I'll
>> postpone that till later.

> I see you've committed a chunk of this now.  Does it make sense to do
> #1 for every data type we support, or should we be more selective than
> that?

Basically, I tried to do #1 for every datatype for which the comparator
was cheap enough that reducing the call overhead seemed likely to make a
useful difference.  I'm not in favor of adding sortsupport functions
where this is not true, as I think it'll be useless code and catalog
bloat.  I don't want to add 'em for cruft like abstime either.

There's some stuff that's debatable according to this criterion --- in
particular, I wondered whether it'd be worth having a fast path for
bttextcmp, especially if we pre-tested the collate_is_c condition and
had a separate version that just hardwired the memcmp code path.  (The
idea of doing that was one reason I insisted on collation being known at
the setup step.)  But it would still have to be prepared for detoasting,
so in the end I was unenthused.  Anyone who feels like testing could try
to prove me wrong about it though.

> Are you planning to do anything about #2 or #3?

I am willing to do #2, but not right now; I feel what I need to do next
is go review SPGist.  I don't believe that #2 blocks progress on #3
anyway.  I think #3 is in Peter's court, or yours if you want to do it.

(BTW, I agree with your comments yesterday about trying to break down
the different aspects of what Peter did, and put as many of them as we
can into the non-inlined code paths.)

			regards, tom lane

Re: Inlining comparators as a performance optimisation

Robert Haas <robertmhaas@gmail.com> — 2011-12-07T15:15:35Z

On Wed, Dec 7, 2011 at 10:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> There's some stuff that's debatable according to this criterion --- in
> particular, I wondered whether it'd be worth having a fast path for
> bttextcmp, especially if we pre-tested the collate_is_c condition and
> had a separate version that just hardwired the memcmp code path.  (The
> idea of doing that was one reason I insisted on collation being known at
> the setup step.)  But it would still have to be prepared for detoasting,
> so in the end I was unenthused.  Anyone who feels like testing could try
> to prove me wrong about it though.

I think that'd definitely be worth investigating (although I'm not
sure I have the time to do it myself any time real soon).

>> Are you planning to do anything about #2 or #3?
>
> I am willing to do #2, but not right now; I feel what I need to do next
> is go review SPGist.

Yeah, makes sense.  That one seems likely to be a challenge to absorb.

> I don't believe that #2 blocks progress on #3
> anyway.  I think #3 is in Peter's court, or yours if you want to do it.
>
> (BTW, I agree with your comments yesterday about trying to break down
> the different aspects of what Peter did, and put as many of them as we
> can into the non-inlined code paths.)

Cool.  Peter, can you rebase your patch and integrate it into the
sortsupport framework that's now committed?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inlining comparators as a performance optimisation

Peter Geoghegan <peter@2ndquadrant.com> — 2011-12-07T15:58:03Z

On 7 December 2011 15:15, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Dec 7, 2011 at 10:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> But it would still have to be prepared for detoasting,
>> so in the end I was unenthused.  Anyone who feels like testing could try
>> to prove me wrong about it though.
>
> I think that'd definitely be worth investigating (although I'm not
> sure I have the time to do it myself any time real soon).

I'll at least take a look at it. Sorting text is a fairly common case.
I'm not hugely enthused about spending too much time on work that will
only be useful if collate_is_c.

>> I don't believe that #2 blocks progress on #3
>> anyway.  I think #3 is in Peter's court, or yours if you want to do it.
>>
>> (BTW, I agree with your comments yesterday about trying to break down
>> the different aspects of what Peter did, and put as many of them as we
>> can into the non-inlined code paths.)

I'm confident that we should have everything for the simple case of
ordering by a single int4 and int8 column, and I think you'd probably
agree with that - they're extremely common cases. Anything beyond that
will need to be justified, probably in part by running additional
benchmarks.

> Cool.  Peter, can you rebase your patch and integrate it into the
> sortsupport framework that's now committed?

Yes, I'd be happy to, though I don't think I'm going to be getting
around to it this side of Friday. Since it isn't a blocker, I assume
that's okay.

The rebased revision will come complete with a well thought-out
rationale for my use of inlining specialisations, that takes account
of the trade-off against binary bloat that Tom highlighted. I wasn't
ignoring that issue, but I did fail to articulate my thoughts there,
mostly because I felt the need to do some additional research to
justify my position.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Inlining comparators as a performance optimisation

Robert Haas <robertmhaas@gmail.com> — 2011-12-07T16:22:30Z

On Wed, Dec 7, 2011 at 10:58 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 7 December 2011 15:15, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Dec 7, 2011 at 10:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> But it would still have to be prepared for detoasting,
>>> so in the end I was unenthused.  Anyone who feels like testing could try
>>> to prove me wrong about it though.
>>
>> I think that'd definitely be worth investigating (although I'm not
>> sure I have the time to do it myself any time real soon).
>
> I'll at least take a look at it. Sorting text is a fairly common case.
> I'm not hugely enthused about spending too much time on work that will
> only be useful if collate_is_c.

Well, text_pattern_ops is not exactly an uncommon case, but sure.  And
there might be some benefit for other collations, too.

>> Cool.  Peter, can you rebase your patch and integrate it into the
>> sortsupport framework that's now committed?
>
> Yes, I'd be happy to, though I don't think I'm going to be getting
> around to it this side of Friday. Since it isn't a blocker, I assume
> that's okay.

Sounds fine to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Inlining comparators as a performance optimisation

Greg Smith <greg@2ndquadrant.com> — 2011-12-16T19:55:12Z

I think we can call a new sorting infrastructure popping out and what 
looks to be over 90 messages on this topic as successful progress on 
this front.  Peter's going to rev a new patch, but with more performance 
results to review and followup discussion I can't see this one as 
wrapping for the current CommitFest.  Marking it returned and we'll 
return to this topic during or before the next CF.  With the majority of 
the comments here coming from committers, I think continuing progress in 
that direction won't be a problem.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: Inlining comparators as a performance optimisation

Pierre C <lists@peufeu.com> — 2012-01-13T09:48:56Z

On Wed, 21 Sep 2011 18:13:07 +0200, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> On 21.09.2011 18:46, Tom Lane wrote:
>>> The idea that I was toying with was to allow the regular SQL-callable
>>> comparison function to somehow return a function pointer to the
>>> alternate comparison function,
>
>> You could have a new function with a pg_proc entry, that just returns a
>> function pointer to the qsort-callback.
>
> Yeah, possibly.  That would be a much more invasive change, but cleaner
> in some sense.  I'm not really prepared to do all the legwork involved
> in that just to get to a performance-testable patch though.

A few years ago I had looked for a way to speed up COPY operations, and it  
turned out that COPY TO has a good optimization opportunity. At that time,  
for each datum, COPY TO would :

- test for nullness
- call an outfunc through fmgr
- outfunc pallocs() a bytea or text, fills it with data, and returns it  
(sometimes it uses an extensible string buffer which may be repalloc()d  
several times)
- COPY memcpy()s returned data to a buffer and eventually flushes the  
buffer to client socket.

I introduced a special write buffer with an on-flush callback (ie, a close  
relative of the existing string-buffer), in this case the callback was  
"flush to client socket", and several outfuncs (one per type) which took  
that buffer as argument, besides the datum to output, and simply put the  
datum inside the buffer, with appropriate transformations (like converting  
to bytea or text), and flushed if needed.

Then the COPY TO BINARY of a constant-size datum would turn to :
- one test for nullness
- one C function call
- one test to ensure appropriate space available in buffer (flush if  
needed)
- one htonl() and memcpy of constant size, which the compiler turns out  
into a couple of simple instructions

I recall measuring speedups of 2x - 8x on COPY BINARY, less for text, but  
still large gains.

Although eliminating fmgr call and palloc overhead was an important part  
of it, another large part was getting rid of memcpy()'s which the compiler  
turned into simple movs for known-size types, a transformation that can be  
done only if the buffer write functions are inlined inside the outfuncs.  
Compilers love constants...

Additionnally, code size growth was minimal since I moved the old outfuncs  
code into the new outfuncs, and replaced the old fmgr-callable outfuncs  
with "create buffer with on-full callback=extend_and_repalloc() - pass to  
new outfunc(buffer,datum) - return buffer". Which is basically equivalent  
to the previous palloc()-based code, maybe with a few extra instructions.

When I submitted the patch for review, Tom rightfully pointed out that my  
way of obtaining the C function pointer sucked very badly (I don't  
remember how I did it, only that it was butt-ugly) but the idea was to get  
a quick measurement of what could be gained, and the result was positive.  
Unfortunately I had no time available to finish it and make it into a real  
patch, I'm sorry about that.

So why do I post in this sorting topic ? It seems, by bypassing fmgr for  
functions which are small, simple, and called lots of times, there is a  
large gain to be made, not only because of fmgr overhead but also because  
of the opportunity for new compiler optimizations, palloc removal, etc.  
However, in my experiment the arguments and return types of the new  
functions were DIFFERENT from the old functions : the new ones do the same  
thing, but in a different manner. One manner was suited to sql-callable  
functions (ie, palloc and return a bytea) and another one to writing large  
amounts of data (direct buffer write). Since both have very different  
requirements, being fast at both is impossible for the same function.

Anyway, all that rant boils down to :

Some functions could benefit having two versions (while sharing almost all  
the code between them) :
- User-callable (fmgr) version (current one)
- C-callable version, usually with different parameters and return type

And it would be cool to have a way to grab a bare function pointer on the  
second one.

Maybe an extra column in pg_proc would do (but then, the proargtypes and  
friends would describe only the sql-callable version) ?
Or an extra table ? pg_cproc ?
Or an in-memory hash : hashtable[ fmgr-callable function ] => C version
- What happens if a C function has no SQL-callable equivalent ?
Or (ugly) introduce an extra per-type function type_get_function_ptr(  
function_kind ) which returns the requested function ptr

If one of those happens, I'll dust off my old copy-optimization patch ;)

Hmm... just my 2c

Regards
Pierre

Re: Inlining comparators as a performance optimisation

Bruce Momjian <bruce@momjian.us> — 2012-01-28T14:55:04Z

On Fri, Jan 13, 2012 at 10:48:56AM +0100, Pierre C wrote:
> Maybe an extra column in pg_proc would do (but then, the proargtypes
> and friends would describe only the sql-callable version) ?
> Or an extra table ? pg_cproc ?
> Or an in-memory hash : hashtable[ fmgr-callable function ] => C version
> - What happens if a C function has no SQL-callable equivalent ?
> Or (ugly) introduce an extra per-type function
> type_get_function_ptr( function_kind ) which returns the requested
> function ptr
> 
> If one of those happens, I'll dust off my old copy-optimization patch ;)

I agree that COPY is ripe for optimization, and I am excited you have
some ideas on this.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +