Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Speed up conversion of signed integers to C strings.

  2. Remove some unnecessary tests of pgstat_track_counts.

  3. Remove cvs keywords from all files.

  4. Code cleanup for function prototypes: change two K&R-style prototypes

  5. Use Min() instead of min() in qsort, for consistency and to avoid

  6. pgindent run for 8.2.

  7. Switch over to using our own qsort() all the time, as has been proposed

  1. Inlining comparators as a performance optimisation

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-09-20T01:56:01Z

    Recent discussions on the threads "Double sorting split patch" and
    "CUDA sorting" raised the possibility that there could be significant
    performance optimisation "low-hanging fruit" picked by having the
    executor treat integers and floats as a special case during sorting,
    avoiding going to the trouble of calling a comparator using the
    built-in SQL function machinery, and taking advantage of inlining of
    the comparator, which has been shown to have a considerable
    performance advantage (at least compared to a general purpose c stdlib
    qsort(), that takes a function pointer as its comparator, much like
    tuplesort).
    
    I've hacked together a sloppy POC implementation in a hurry
    (basically, some code is shifted around) , which is attached - I felt
    that it would be useful to determine if the community feels that this
    is a worth-while undertaking in advance of a business trip that I'm
    leaving on tomorrow lasting until Friday, during which I will be
    mostly unavailable. The patch breaks the Postgres sorting executor
    node (at least when it quicksorts) for any type other than int4. I
    apologise for how rough the patch is, but the code itself isn't
    important right now - the idea is. I anticipate that the value
    state->datumType or something similar will be set in a near future
    revision, so that tuplesort_performsort will know which type-specific
    optimisation it can use for the type, while falling back on the
    existing generic qsort_arg + qsort_arg_comparator, and sorting won't
    actually be broken.
    
    I've been doing some preliminary testing using the dell store 2 sample
    database. I increase work_mem to '50MB', to ensure that a quicksort
    will be performed for sorting (otherwise, I'm using the
    postgresql.conf that initdb gave me). The query is:
    
    explain analyze select * from orderlines order by prod_id;
    
    Once the cache has been warmed, explain analyze very consistently
    reports a runtime of 123ms for this query on master/HEAD, which varies
    +/- 1 ms, with a few outliers of maybe +/- 2ms. However, when I apply
    this patch, that goes down to 107ms +/- 1ms at -O0. I think that
    that's a pretty good start. Funnily enough, the difference/advantage
    vanishes at -O2 (I'm guessing that the higher optimisation level of
    GCC 4.5 hyper-corrects away the inlining, but I don't have time to
    check that right now).
    
    I imagine the version that I actually submit for patch review will
    have a macro-based infrastructure for inlining the sorting of various
    built-in types, initially integers and floats. It will most likely
    have some other optimisations - I haven't even used a profiler yet.
    
    This performance patch differs from most in that it's difficult in
    principle to imagine a performance regression occurring.
    
    Thoughts?
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
  2. Re: Inlining comparators as a performance optimisation

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-20T02:51:32Z

    Peter Geoghegan <peter@2ndquadrant.com> writes:
    > Once the cache has been warmed, explain analyze very consistently
    > reports a runtime of 123ms for this query on master/HEAD, which varies
    > +/- 1 ms, with a few outliers of maybe +/- 2ms. However, when I apply
    > this patch, that goes down to 107ms +/- 1ms at -O0. I think that
    > that's a pretty good start. Funnily enough, the difference/advantage
    > vanishes at -O2 (I'm guessing that the higher optimisation level of
    > GCC 4.5 hyper-corrects away the inlining, but I don't have time to
    > check that right now).
    
    Considering that -O2 is our standard optimization level, that
    observation seems to translate to "this patch will be useless in
    practice".  I think you had better investigate that aspect in some
    detail before spending more effort.
    
    > This performance patch differs from most in that it's difficult in
    > principle to imagine a performance regression occurring.
    
    Really?  N copies of the same code could lead to performance loss just
    due to code bloat (ie, less of a query's inner loops fitting in CPU
    cache).  Not to mention the clear regression in maintainability.  So
    I'm disinclined to consider this sort of change without a significantly
    bigger win than you're suggesting above (no, I don't even consider the
    -O0 number attractive, let alone what you're finding at -O2).
    
    			regards, tom lane
    
    
  3. Re: Inlining comparators as a performance optimisation

    Amit Kapila <amit.kapila@huawei.com> — 2011-09-20T13:57:24Z

    >Recent discussions on the threads "Double sorting split patch" and
    
    >"CUDA sorting" raised the possibility that there could be significant
    
    >performance optimisation "low-hanging fruit" picked by having the
    
    >executor treat integers and floats as a special case during sorting,
    
    >avoiding going to the trouble of calling a comparator using the
    
    >built-in SQL function machinery
    
     
    
    Why only for integers and floats why not for char/varchar?
    
    But I believe this can make code less maintainable as similar things can be
    done at other places to avoid SQL function machinery.
    
     
    
    >Once the cache has been warmed, explain analyze very consistently
    
    >reports a runtime of 123ms for this query on master/HEAD, which varies
    
    >+/- 1 ms, with a few outliers of maybe +/- 2ms. However, when I apply
    
    >this patch, that goes down to 107ms +/- 1ms at -O0.
    
     
    
    Time 123ms which is without your change is with which optimization -O2 or
    O0?
    
     
    
     
    
    ****************************************************************************
    ***********
    
    This e-mail and attachments contain confidential information from HUAWEI,
    which is intended only for the person or entity whose address is listed
    above. Any use of the information contained herein in any way (including,
    but not limited to, total or partial disclosure, reproduction, or
    dissemination) by persons other than the intended recipient's) is
    prohibited. If you receive this e-mail in error, please notify the sender by
    phone or email immediately and delete it!
    
     
    
     
    
    -----Original Message-----
    From: pgsql-hackers-owner@postgresql.org
    [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Peter Geoghegan
    Sent: Tuesday, September 20, 2011 7:26 AM
    To: PG Hackers
    Subject: [HACKERS] Inlining comparators as a performance optimisation
    
     
    
    Recent discussions on the threads "Double sorting split patch" and
    
    "CUDA sorting" raised the possibility that there could be significant
    
    performance optimisation "low-hanging fruit" picked by having the
    
    executor treat integers and floats as a special case during sorting,
    
    avoiding going to the trouble of calling a comparator using the
    
    built-in SQL function machinery, and taking advantage of inlining of
    
    the comparator, which has been shown to have a considerable
    
    performance advantage (at least compared to a general purpose c stdlib
    
    qsort(), that takes a function pointer as its comparator, much like
    
    tuplesort).
    
     
    
    I've hacked together a sloppy POC implementation in a hurry
    
    (basically, some code is shifted around) , which is attached - I felt
    
    that it would be useful to determine if the community feels that this
    
    is a worth-while undertaking in advance of a business trip that I'm
    
    leaving on tomorrow lasting until Friday, during which I will be
    
    mostly unavailable. The patch breaks the Postgres sorting executor
    
    node (at least when it quicksorts) for any type other than int4. I
    
    apologise for how rough the patch is, but the code itself isn't
    
    important right now - the idea is. I anticipate that the value
    
    state->datumType or something similar will be set in a near future
    
    revision, so that tuplesort_performsort will know which type-specific
    
    optimisation it can use for the type, while falling back on the
    
    existing generic qsort_arg + qsort_arg_comparator, and sorting won't
    
    actually be broken.
    
     
    
    I've been doing some preliminary testing using the dell store 2 sample
    
    database. I increase work_mem to '50MB', to ensure that a quicksort
    
    will be performed for sorting (otherwise, I'm using the
    
    postgresql.conf that initdb gave me). The query is:
    
     
    
    explain analyze select * from orderlines order by prod_id;
    
     
    
    Once the cache has been warmed, explain analyze very consistently
    
    reports a runtime of 123ms for this query on master/HEAD, which varies
    
    +/- 1 ms, with a few outliers of maybe +/- 2ms. However, when I apply
    
    this patch, that goes down to 107ms +/- 1ms at -O0. I think that
    
    that's a pretty good start. Funnily enough, the difference/advantage
    
    vanishes at -O2 (I'm guessing that the higher optimisation level of
    
    GCC 4.5 hyper-corrects away the inlining, but I don't have time to
    
    check that right now).
    
     
    
    I imagine the version that I actually submit for patch review will
    
    have a macro-based infrastructure for inlining the sorting of various
    
    built-in types, initially integers and floats. It will most likely
    
    have some other optimisations - I haven't even used a profiler yet.
    
     
    
    This performance patch differs from most in that it's difficult in
    
    principle to imagine a performance regression occurring.
    
     
    
    Thoughts?
    
     
    
    -- 
    
    Peter Geoghegan       http://www.2ndQuadrant.com/
    
    PostgreSQL Development, 24x7 Support, Training and Services
    
    
  4. Re: Inlining comparators as a performance optimisation

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-09-20T23:53:16Z

    On 20 September 2011 03:51, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Considering that -O2 is our standard optimization level, that
    > observation seems to translate to "this patch will be useless in
    > practice".  I think you had better investigate that aspect in some
    > detail before spending more effort.
    
    I don't think that the fact that that happens is at all significant at
    this early stage, and it never even occurred to me that you'd think
    that it might be. I was simply disclosing a quirk of this POC patch.
    The workaround is probably to use a macro instead. For the benefit of
    those that didn't follow the other threads, the macro-based qsort
    implementation, which I found to perform significantly better than
    regular qsort(), runs like this on my laptop when I built at 02 with
    GCC 4.6 just now:
    
    C stdlib quick-sort time elapsed: 2.092451 seconds
    Inline quick-sort time elapsed: 1.587651 seconds
    
    Does *that* look attractive to you? I've attached source code of the
    program that produced these figures, which has been ported to C from
    C++.
    
    When I #define LARGE_SIZE 100000000, here's what I see:
    
    [peter@peter inline_compar_test]$ ./a.out
    C stdlib quick-sort time elapsed: 23.659411 seconds
    Inline quick-sort time elapsed: 18.470611 seconds
    
    Here, sorting with the function pointer/stdlib version takes about
    1.28 times as long. In the prior test (with the smaller LARGE_SIZE),
    it took about 1.32 times as long. Fairly predictable, linear, and not
    to be sniffed at.
    
    The variance I'm seeing across runs is low - a couple of hundredths of
    a second at most. This is a Fedora 15 " Intel(R) Core(TM) i5-2540M CPU
    @ 2.60GHz" machine. I'm not sure right now why the inline quick-sort
    is less of a win than on my old Fedora 14 desktop (where it was 3.24
    Vs 2.01), but it's still a significant win. Perhaps others can build
    this simple program and tell me what they come up with.
    
    >> This performance patch differs from most in that it's difficult in
    >> principle to imagine a performance regression occurring.
    >
    > Really?  N copies of the same code could lead to performance loss just
    > due to code bloat (ie, less of a query's inner loops fitting in CPU
    > cache).
    
    I did consider that. Of course inlining has an overhead, and I'll be
    testing that each instance of inlining has a net benefit. I just meant
    that many other performance patches have an obvious worst case, and I
    think that it is policy to focus on that case, but I can't think of
    one here. Does anyone else have any ideas?
    
    > Not to mention the clear regression in maintainability.  So
    > I'm disinclined to consider this sort of change without a significantly
    > bigger win than you're suggesting above
    
    Sure, there'll be some sort of regression in maintainability - I think
    that HOT had a clear regression in maintainability too. The important
    questions are obviously "how big is the loss of maintainability?", and
    "is it worth it?". We'll know more when this work is actually shaped
    into a proper patch. Perhaps I should have waited until I had
    something along those lines before making an announcement, but I
    wanted community input as early as possible. I think that there's
    plenty of tweaking that can be done to get additional performance
    improvements - all I've done so far is demonstrate that those
    improvements are real and worth thinking about, in the fastest
    possible way, partly because you expressed skepticism of the benefits
    of inlining comparators to Greg Stark in an earlier thread.
    
    Performance and maintainability are often somewhat in tension, but we
    cannot ignore performance. If this work can bring us an improvement in
    performance approaching the isolated macro Vs qsort() function pointer
    benchmark, that's a *big* win. Sorting integers and floats is very
    common and important.
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
  5. [PATCH] POC: inline int4 comparison in tuplesort

    Dan McGee <dan@archlinux.org> — 2011-09-21T01:54:20Z

    This attempts to be as simple as it gets while reducing function call
    depth, and should be viewed as a proof of concept. It is also untested
    as of now, but will try to do that and report back.
    
    I'm hoping I followed the rabbit hole correctly and are correctly
    comparing the right pointers to each other in order to short circuit the
    case where we are using the int4 comparison operator.
    
    Peter, if you want to compare stock vs. your patch vs. this patch, we might
    be able to get some sort of read on where the maintainablity vs. performance
    curve lies. Note that this version should still allow sorting of anything,
    and simply shifts gears for int4 tuples...
    
    ---
     src/backend/utils/sort/tuplesort.c |   23 +++++++++++++++++++++--
     1 files changed, 21 insertions(+), 2 deletions(-)
    
    diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
    index 3505236..ddd5ced 100644
    --- a/src/backend/utils/sort/tuplesort.c
    +++ b/src/backend/utils/sort/tuplesort.c
    @@ -2652,6 +2652,22 @@ myFunctionCall2Coll(FmgrInfo *flinfo, Oid collation, Datum arg1, Datum arg2)
     	return result;
     }
     
    +static inline
    +int int4cmp(Datum first, Datum second)
    +{
    +	int32		a = DatumGetInt32(first);
    +	int32		b = DatumGetInt32(second);
    +
    +	if (a > b)
    +		return 1;
    +	else if (a == b)
    +		return 0;
    +	else
    +		return -1;
    +}
    +
    +extern Datum btint4cmp(PG_FUNCTION_ARGS);
    +
     /*
      * Apply a sort function (by now converted to fmgr lookup form)
      * and return a 3-way comparison result.  This takes care of handling
    @@ -2683,8 +2699,11 @@ inlineApplySortFunction(FmgrInfo *sortFunction, int sk_flags, Oid collation,
     	}
     	else
     	{
    -		compare = DatumGetInt32(myFunctionCall2Coll(sortFunction, collation,
    -													datum1, datum2));
    +		if (sortFunction->fn_addr == btint4cmp)
    +			compare = int4cmp(datum1, datum2);
    +		else
    +			compare = DatumGetInt32(myFunctionCall2Coll(sortFunction, collation,
    +														datum1, datum2));
     
     		if (sk_flags & SK_BT_DESC)
     			compare = -compare;
    -- 
    1.7.6.3
    
    
    
  6. Re: Inlining comparators as a performance optimisation

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-09-21T06:51:32Z

    On 21.09.2011 02:53, Peter Geoghegan wrote:
    > C stdlib quick-sort time elapsed: 2.092451 seconds
    > Inline quick-sort time elapsed: 1.587651 seconds
    >
    > Does *that* look attractive to you?
    
    Not really, to be honest. That's a 25% speedup in pure qsorting speed. 
    How much of a gain in a real query do you expect to get from that, in 
    the best case? There's so many other sources of overhead that I'm afraid 
    this will be lost in the noise. If you find a query that spends, say, 
    50% of its time in qsort(), you will only get a 12.5% speedup on that 
    query. And even 50% is really pushing it - I challenge you to find a 
    query that spends any significant amount of time qsorting integers.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  7. Re: Inlining comparators as a performance optimisation

    Simon Riggs <simon@2ndquadrant.com> — 2011-09-21T06:55:55Z

    On Tue, Sep 20, 2011 at 3:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    
    >> This performance patch differs from most in that it's difficult in
    >> principle to imagine a performance regression occurring.
    >
    > Really?  N copies of the same code could lead to performance loss just
    > due to code bloat (ie, less of a query's inner loops fitting in CPU
    > cache).  Not to mention the clear regression in maintainability.  So
    > I'm disinclined to consider this sort of change without a significantly
    > bigger win than you're suggesting above (no, I don't even consider the
    > -O0 number attractive, let alone what you're finding at -O2).
    
    More copies of the code are somewhat annoying, but its only 100 lines
    of code in one module and we can easily have specific tests for each.
    The extra code size is minor in comparison to the reams of code we add
    elsewhere.
    
    It's a surprisingly good win for such a common use case. Well done, Peter.
    
    -- 
     Simon Riggs                   http://www.2ndQuadrant.com/
     PostgreSQL Development, 24x7 Support, Training & Services
    
    
  8. Re: Inlining comparators as a performance optimisation

    Simon Riggs <simon@2ndquadrant.com> — 2011-09-21T07:01:11Z

    On Wed, Sep 21, 2011 at 7:51 AM, Heikki Linnakangas
    <heikki.linnakangas@enterprisedb.com> wrote:
    > On 21.09.2011 02:53, Peter Geoghegan wrote:
    >>
    >> C stdlib quick-sort time elapsed: 2.092451 seconds
    >> Inline quick-sort time elapsed: 1.587651 seconds
    >>
    >> Does *that* look attractive to you?
    >
    > Not really, to be honest. That's a 25% speedup in pure qsorting speed. How
    > much of a gain in a real query do you expect to get from that, in the best
    > case? There's so many other sources of overhead that I'm afraid this will be
    > lost in the noise. If you find a query that spends, say, 50% of its time in
    > qsort(), you will only get a 12.5% speedup on that query. And even 50% is
    > really pushing it - I challenge you to find a query that spends any
    > significant amount of time qsorting integers.
    
    How about almost every primary index creation?
    
    Don't really see a reason for the negativity here. If you use that
    argument no performance gain is worth it because all workloads are
    mixed.
    
    This is a marvellous win, a huge gain from a small, isolated and
    easily tested change. By far the smallest amount of additional code to
    sorting we will have added and yet one of the best gains.
    
    -- 
     Simon Riggs                   http://www.2ndQuadrant.com/
     PostgreSQL Development, 24x7 Support, Training & Services
    
    
  9. Re: Inlining comparators as a performance optimisation

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-09-21T07:08:42Z

    On 21.09.2011 10:01, Simon Riggs wrote:
    > On Wed, Sep 21, 2011 at 7:51 AM, Heikki Linnakangas
    > <heikki.linnakangas@enterprisedb.com>  wrote:
    >> On 21.09.2011 02:53, Peter Geoghegan wrote:
    >>>
    >>> C stdlib quick-sort time elapsed: 2.092451 seconds
    >>> Inline quick-sort time elapsed: 1.587651 seconds
    >>>
    >>> Does *that* look attractive to you?
    >>
    >> Not really, to be honest. That's a 25% speedup in pure qsorting speed. How
    >> much of a gain in a real query do you expect to get from that, in the best
    >> case? There's so many other sources of overhead that I'm afraid this will be
    >> lost in the noise. If you find a query that spends, say, 50% of its time in
    >> qsort(), you will only get a 12.5% speedup on that query. And even 50% is
    >> really pushing it - I challenge you to find a query that spends any
    >> significant amount of time qsorting integers.
    >
    > How about almost every primary index creation?
    
    Nope. Swamped by everything else.
    
    Also note that as soon as the sort grows big enough to not fit in 
    maintenance_work_mem, you switch to the external sort algorithm, which 
    doesn't use qsort. Perhaps you could do similar inlining in the heap 
    sort & merge passes done in the external sort, but it's unlikely to be 
    as big a win there.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  10. Re: Inlining comparators as a performance optimisation

    Greg Stark <stark@mit.edu> — 2011-09-21T12:47:46Z

    On Wed, Sep 21, 2011 at 8:08 AM, Heikki Linnakangas
    <heikki.linnakangas@enterprisedb.com> wrote:
    >> How about almost every primary index creation?
    >
    > Nope. Swamped by everything else.
    
    Really? I think it's pretty common for shops to be able to dedicate
    large amounts of RAM to building initial indexes on data loads or
    reindex operations. Enough that they can cache the entire table for
    the short time they're doing the index builds even if they're quite
    large. Witness the recent pleas to allow maintenance_work_mem on the
    order of tens of gigabytes. And it's also pretty common that shops can
    dedicate very large I/O bandwidth, in many cases enough to saturate
    the memory bandwidth, for doing these kinds of batch operations when
    they get large enough to need to do an external sort.
    
    There's still overhead of reading the pages, the tuples, finding the
    sort keys in the tuple, etc. But I think the actual qsort or heap
    operations in tapesort are pretty big portions of the work.
    
    This is pretty easy to measure. Just run oprofile or gprof and see
    what percentage of time for a big index build is spent in qsort.
    
    -- 
    greg
    
    
  11. Re: Inlining comparators as a performance optimisation

    Robert Haas <robertmhaas@gmail.com> — 2011-09-21T13:34:49Z

    On Wed, Sep 21, 2011 at 8:47 AM, Greg Stark <stark@mit.edu> wrote:
    > On Wed, Sep 21, 2011 at 8:08 AM, Heikki Linnakangas
    > <heikki.linnakangas@enterprisedb.com> wrote:
    >>> How about almost every primary index creation?
    >>
    >> Nope. Swamped by everything else.
    >
    > Really? I think it's pretty common for shops to be able to dedicate
    > large amounts of RAM to building initial indexes on data loads or
    > reindex operations. Enough that they can cache the entire table for
    > the short time they're doing the index builds even if they're quite
    > large. Witness the recent pleas to allow maintenance_work_mem on the
    > order of tens of gigabytes. And it's also pretty common that shops can
    > dedicate very large I/O bandwidth, in many cases enough to saturate
    > the memory bandwidth, for doing these kinds of batch operations when
    > they get large enough to need to do an external sort.
    >
    > There's still overhead of reading the pages, the tuples, finding the
    > sort keys in the tuple, etc. But I think the actual qsort or heap
    > operations in tapesort are pretty big portions of the work.
    >
    > This is pretty easy to measure. Just run oprofile or gprof and see
    > what percentage of time for a big index build is spent in qsort.
    
    +1 for some actual measurements.
    
    I don't think anyone on this thread is saying that if we can get big
    performance gains from doing this we still shouldn't do it.  But at
    this point it's unclear that we can get a consistent speedup that
    isn't heavily dependent on the choice of compiler flags (to say
    nothing of compiler and OS), and even if we can, it's not clear that
    it will still be noticeable when you measure the run time of an entire
    query rather than just the speed of qsort().  Like Tom and Heikki, I'm
    a bit skeptical: it wouldn't surprise me to find out that qsort() is
    5% of the runtime any realistic test case and the average qsort()
    speedup based on tests on a couple different platforms is 10% and so
    on average we're looking at a 0.5% improvement, in which case it might
    be more trouble than it's worth, especially if it turns out that there
    are OS/platform combinations where the inlined version is (for some
    crazy reason) slower.  I've seen performance differences of up to 3%
    from minor code rearrangements that don't seem like they should matter
    at all, just because code and data shifts around across cache-line
    boundaries and the new arrangement is slightly better or worse than
    the old one.  So if the performance improvement turns out to be very
    small, then validating that it actually IS an improvement in general
    is likely to be kind of a pain in the ass.
    
    On the other hand, the performance improvement might turn out to be
    large.  Maybe there's a test case where, as Heikki suggests, 50% of
    the time is spent in qsort().  If we can reliably make that 25%
    faster, I wouldn't dismiss that out of hand; I think that would be
    pretty good, assuming it didn't require massive amounts of spaghetti
    code to make it work.  I don't see that that would be any more
    marginal than the sorts of things we've optimized in, say, commit
    4fc115b2e981f8c63165ca86a23215380a3fda66, or commit
    f4d242ef94730c447d87b9840a40b0ec3371fe0f.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  12. Re: Inlining comparators as a performance optimisation

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-09-21T14:20:55Z

    On 21 September 2011 07:51, Heikki Linnakangas
    <heikki.linnakangas@enterprisedb.com> wrote:
    > On 21.09.2011 02:53, Peter Geoghegan wrote:
    >>
    >> C stdlib quick-sort time elapsed: 2.092451 seconds
    >> Inline quick-sort time elapsed: 1.587651 seconds
    >>
    >> Does *that* look attractive to you?
    >
    > Not really, to be honest. That's a 25% speedup in pure qsorting speed. How
    > much of a gain in a real query do you expect to get from that, in the best
    > case? There's so many other sources of overhead that I'm afraid this will be
    > lost in the noise.
    
    I'm surprised that you're dismissive of this. After all, we have in
    the past indulged in micro-optimisation of qsort, or so it would seem
    from this comment:
    
     * We have modified their original by adding a check for already-sorted input,
     * which seems to be a win per discussions on pgsql-hackers around 2006-03-21.
    
    "Makes affected queries radically faster" (In the best case, a speedup
    somewhat greater than 12.5%) is an unreasonably high standard for a
    performance optimisation of the executor in general (such a high
    standard might be sensible if it was due to a particular
    maintainability downside, but you didn't mention one). Even still, I
    think that the 12.5% figure is pretty pessimistic - I've already sped
    up the dell store query by almost that much, and that's with a patch
    that was, due to circumstances, cobbled together.
    
    Not only are we benefiting from the effects of inlining, we're also
    benefiting from the removal of unnecessary indirection. As Tom said,
    "In concrete terms, there would be no reason to have tuplesort.c's
    myFunctionCall2Coll, and maybe not inlineApplySortFunction either, if
    the datatype-specific comparison functions had APIs that were closer
    to what sorting wants rather than following the general
    SQL-callable-function API." He was just referring to the benefits of
    removing indirection here, so ISTM that this is really two performance
    optimisations rolled into one - it's conceivable that the total
    performance improvement will even exceed the isolated inlining
    comparator benchmark.
    
    As I've said, I believe this patch can be committed without
    compromising the maintainability of the tuplesort code to an extent
    that is not clearly worth it, through the use of a clean, macro-based
    abstraction. Concerns about bloated binaries are probably not well
    founded, because what I'm proposing is to a certain extent emulating
    C++ templates, while using a very common pattern used with C++
    templates. In the C++ world, algorithms are often generalised as
    templates, so that they can be used equally well with any datatype
    (that supports the interface of the template), while availing of
    compiler optimisations per template instantiation (instance of using a
    given type with a given template). I actually got the idea for this
    patch in part from a book that I read years ago that described the
    fact that counter-intuitively, std::sort() consistently outperforms
    qsort(), because the comparator is often inlined, and the compiler can
    generally avail of optimisations from knowing the comparator at
    compile-time.
    
    On 21 September 2011 13:47, Greg Stark <stark@mit.edu> wrote:
    > This is pretty easy to measure. Just run oprofile or gprof and see
    > what percentage of time for a big index build is spent in qsort.
    
    I'll do so soon. I intend to get to this on Friday evening, and
    perhaps have a proper patch to show next week.
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
    
  13. Re: Inlining comparators as a performance optimisation

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-09-21T14:31:11Z

    On 21.09.2011 17:20, Peter Geoghegan wrote:
    > Even still, I
    > think that the 12.5% figure is pretty pessimistic - I've already sped
    > up the dell store query by almost that much, and that's with a patch
    > that was, due to circumstances, cobbled together.
    
    I'm not against making things faster, it's just that I haven't seen 
    solid evidence yet that this will help. Just provide a best-case test 
    case for this that shows a huge improvement, and I'll shut up. If the 
    improvement is only modest, then let's discuss how big it is and whether 
    it's worth the code ugliness this causes.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  14. Re: Inlining comparators as a performance optimisation

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-21T14:50:22Z

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
    > On 21.09.2011 17:20, Peter Geoghegan wrote:
    >> Even still, I
    >> think that the 12.5% figure is pretty pessimistic - I've already sped
    >> up the dell store query by almost that much, and that's with a patch
    >> that was, due to circumstances, cobbled together.
    
    > I'm not against making things faster, it's just that I haven't seen 
    > solid evidence yet that this will help. Just provide a best-case test 
    > case for this that shows a huge improvement, and I'll shut up. If the 
    > improvement is only modest, then let's discuss how big it is and whether 
    > it's worth the code ugliness this causes.
    
    The other question that I'm going to be asking is whether it's not
    possible to get most of the same improvement with a much smaller code
    footprint.  I continue to suspect that getting rid of the SQL function
    impedance-match layer (myFunctionCall2Coll etc) would provide most of
    whatever gain is to be had here, without nearly as large a cost in code
    size and maintainability, and with the extra benefit that the speedup
    would also be available to non-core datatypes.
    
    			regards, tom lane
    
    
  15. Re: Inlining comparators as a performance optimisation

    Andrew Dunstan <andrew@dunslane.net> — 2011-09-21T15:03:15Z

    
    On 09/21/2011 10:50 AM, Tom Lane wrote:
    > The other question that I'm going to be asking is whether it's not
    > possible to get most of the same improvement with a much smaller code
    > footprint.  I continue to suspect that getting rid of the SQL function
    > impedance-match layer (myFunctionCall2Coll etc) would provide most of
    > whatever gain is to be had here, without nearly as large a cost in code
    > size and maintainability, and with the extra benefit that the speedup
    > would also be available to non-core datatypes.
    >
    > 			
    
    Can we get a patch so we can do benchmarks on this?
    
    cheers
    
    andrew
    
    
  16. Re: Inlining comparators as a performance optimisation

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-21T15:22:50Z

    Simon Riggs <simon@2ndQuadrant.com> writes:
    > This is a marvellous win, a huge gain from a small, isolated and
    > easily tested change. By far the smallest amount of additional code to
    > sorting we will have added and yet one of the best gains.
    
    I think you forgot your cheerleader uniform.  A patch along these lines
    is not going to be small, isolated, easily maintained, nor beneficial
    for any but a small number of predetermined datatypes.
    
    			regards, tom lane
    
    
  17. Re: Inlining comparators as a performance optimisation

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-09-21T15:43:31Z

    On 21 September 2011 15:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
    >> I'm not against making things faster, it's just that I haven't seen
    >> solid evidence yet that this will help. Just provide a best-case test
    >> case for this that shows a huge improvement, and I'll shut up. If the
    >> improvement is only modest, then let's discuss how big it is and whether
    >> it's worth the code ugliness this causes.
    
    Fair enough.
    
    > The other question that I'm going to be asking is whether it's not
    > possible to get most of the same improvement with a much smaller code
    > footprint.
    
    That's a reasonable question, and I hope to be able to come up with a
    good answer.
    
    > I continue to suspect that getting rid of the SQL function
    > impedance-match layer (myFunctionCall2Coll etc) would provide most of
    > whatever gain is to be had here, without nearly as large a cost in code
    > size and maintainability, and with the extra benefit that the speedup
    > would also be available to non-core datatypes.
    
    I'm fairly surprised that your view on that is mostly or entirely
    unchanged, even after I've demonstrated a considerable performance
    advantage from a macro-based qsort implementation over my OS vendor's
    c std lib qsort(), using an isolated test-case, that does not have
    anything to do with that impedance mismatch. I'm not sure why you
    doubt that the same thing is happening within tuplesort.
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
    
  18. Re: Inlining comparators as a performance optimisation

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-21T15:46:52Z

    Andrew Dunstan <andrew@dunslane.net> writes:
    > On 09/21/2011 10:50 AM, Tom Lane wrote:
    >> The other question that I'm going to be asking is whether it's not
    >> possible to get most of the same improvement with a much smaller code
    >> footprint.  I continue to suspect that getting rid of the SQL function
    >> impedance-match layer (myFunctionCall2Coll etc) would provide most of
    >> whatever gain is to be had here, without nearly as large a cost in code
    >> size and maintainability, and with the extra benefit that the speedup
    >> would also be available to non-core datatypes.
    
    > Can we get a patch so we can do benchmarks on this?
    
    Well, we'd have to negotiate what the API ought to be.  What I'm
    envisioning is that datatypes could provide alternate comparison
    functions that are designed to be qsort-callable rather than
    SQL-callable.  As such, they could not have entries in pg_proc, so
    it seems like there's no ready way to represent them in the catalogs.
    
    The idea that I was toying with was to allow the regular SQL-callable
    comparison function to somehow return a function pointer to the
    alternate comparison function, so that the first comparison in a given
    sort run would be done the traditional way but then we'd notice the
    provided function pointer and start using that.  It would not be too
    hard to pass back the pointer using FunctionCallInfoData.context, say.
    The downside is adding cycles to unoptimized cases to uselessly check
    for a returned function pointer that's not there.  Perhaps it could be
    hacked so that we only add cycles to the very first call, but I've not
    looked closely at the code to see what would be involved.
    
    Has anyone got a better idea for getting hold of the alternate function?
    
    			regards, tom lane
    
    
  19. Re: Inlining comparators as a performance optimisation

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-09-21T15:56:56Z

    On 21.09.2011 18:46, Tom Lane wrote:
    > Andrew Dunstan<andrew@dunslane.net>  writes:
    >> On 09/21/2011 10:50 AM, Tom Lane wrote:
    >>> The other question that I'm going to be asking is whether it's not
    >>> possible to get most of the same improvement with a much smaller code
    >>> footprint.  I continue to suspect that getting rid of the SQL function
    >>> impedance-match layer (myFunctionCall2Coll etc) would provide most of
    >>> whatever gain is to be had here, without nearly as large a cost in code
    >>> size and maintainability, and with the extra benefit that the speedup
    >>> would also be available to non-core datatypes.
    >
    >> Can we get a patch so we can do benchmarks on this?
    >
    > Well, we'd have to negotiate what the API ought to be.  What I'm
    > envisioning is that datatypes could provide alternate comparison
    > functions that are designed to be qsort-callable rather than
    > SQL-callable.  As such, they could not have entries in pg_proc, so
    > it seems like there's no ready way to represent them in the catalogs.
    >
    > The idea that I was toying with was to allow the regular SQL-callable
    > comparison function to somehow return a function pointer to the
    > alternate comparison function, so that the first comparison in a given
    > sort run would be done the traditional way but then we'd notice the
    > provided function pointer and start using that.  It would not be too
    > hard to pass back the pointer using FunctionCallInfoData.context, say.
    > The downside is adding cycles to unoptimized cases to uselessly check
    > for a returned function pointer that's not there.  Perhaps it could be
    > hacked so that we only add cycles to the very first call, but I've not
    > looked closely at the code to see what would be involved.
    
    You could have a new function with a pg_proc entry, that just returns a 
    function pointer to the qsort-callback.
    
    Or maybe the interface should be an even more radical replacement of 
    qsort, not just the comparison function. Instead of calling qsort, 
    tuplesort.c would call the new datatype-specific sort-function (which 
    would be in pg_proc). The implementation could use an inlined version of 
    qsort, like Peter is suggesting, or it could do something completely 
    different, like a radix sort or a GPU-assisted sort or whatever.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  20. Re: Inlining comparators as a performance optimisation

    Greg Stark <stark@mit.edu> — 2011-09-21T16:04:21Z

    On Wed, Sep 21, 2011 at 4:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >  As such, they could not have entries in pg_proc, so
    > it seems like there's no ready way to represent them in the catalogs.
    
    Why couldn't they be in pg_proc with a bunch of opaque arguments like
    the GIST opclass support functions?
    
    I'm a bit puzzled what the arguments would look like. They would still
    need to know the collation, nulls first/last flags, etc.
    And calling it would still not be inlinable.  So they would have to
    check those flags on each invocation instead of having a piece of
    straightline code that hard codes the behaviour with the right
    behaviour inline.  ISTM the hope for a speedup from the inlining
    mostly came from the idea that the compiler might be able to hoist
    this logic outside the loop (and I suppose implement n specialized
    loops depending on the behaviour needed).
    
    -- 
    greg
    
    
  21. Re: Inlining comparators as a performance optimisation

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-21T16:13:07Z

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
    > On 21.09.2011 18:46, Tom Lane wrote:
    >> The idea that I was toying with was to allow the regular SQL-callable
    >> comparison function to somehow return a function pointer to the
    >> alternate comparison function,
    
    > You could have a new function with a pg_proc entry, that just returns a 
    > function pointer to the qsort-callback.
    
    Yeah, possibly.  That would be a much more invasive change, but cleaner
    in some sense.  I'm not really prepared to do all the legwork involved
    in that just to get to a performance-testable patch though.
    
    > Or maybe the interface should be an even more radical replacement of 
    > qsort, not just the comparison function. Instead of calling qsort, 
    > tuplesort.c would call the new datatype-specific sort-function (which 
    > would be in pg_proc). The implementation could use an inlined version of 
    > qsort, like Peter is suggesting, or it could do something completely 
    > different, like a radix sort or a GPU-assisted sort or whatever.
    
    No.  In the first place, that only helps for in-memory sorts.  In the
    second, it would absolutely destroy our ability to change the behavior
    of sorting ever again.  Considering that we've added ASC/DESC, NULLS
    FIRST/LAST, and collation support over the years, are you really
    prepared to bet that the sort code will never need any more feature
    upgrades?  (This concern is in fact the source of my beef with the whole
    inlining proposal to begin with, but allowing the inlining to occur into
    third-party code that we don't control at all would be a hundred times
    worse.)
    
    			regards, tom lane
    
    
  22. Re: Inlining comparators as a performance optimisation

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-09-21T16:13:20Z

    On 21.09.2011 18:46, Tom Lane wrote:
    > Well, we'd have to negotiate what the API ought to be.  What I'm
    > envisioning is that datatypes could provide alternate comparison
    > functions that are designed to be qsort-callable rather than
    > SQL-callable.  As such, they could not have entries in pg_proc, so
    > it seems like there's no ready way to represent them in the catalogs.
    
    Quite aside from this qsort-thing, it would be nice to have versions of 
    all simple functions that could be called without the FunctionCall 
    overhead. So instead of:
    
    FunctionCall2(&flinfo_for_int4pl, 1, 2)
    
    you could do simply
    
    int4pl_fastpath(1,2)
    
    I'm not sure how big an effect this would have, but it seems like it 
    could shave some cycles across the system.
    
    We could have an extended version of the PG_FUNCTION_INFO_V1 macro that 
    would let you register the fastpath function:
    
    PG_FUNCTION_INFO_V1(int4pl, int4pl_fastpath);
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  23. Re: Inlining comparators as a performance optimisation

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-21T16:23:13Z

    Greg Stark <stark@mit.edu> writes:
    > On Wed, Sep 21, 2011 at 4:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >> As such, they could not have entries in pg_proc, so
    >> it seems like there's no ready way to represent them in the catalogs.
    
    > Why couldn't they be in pg_proc with a bunch of opaque arguments like
    > the GIST opclass support functions?
    
    That does not mean the same thing at all.  Everything in pg_proc is
    meant to be called through the V0 or V1 function call info protocols.
    
    > I'm a bit puzzled what the arguments would look like. They would still
    > need to know the collation, nulls first/last flags, etc.
    
    No, I wasn't thinking that we should do that.  The datatype comparison
    functions should have the exact same semantics they do now, just a
    lower-overhead call mechanism.  If you try to push stuff like NULLS
    FIRST/LAST into the per-datatype code, then you are up against a problem
    when you want to add a new flag: you have to touch lots of code not all
    of which you even control.
    
    > And calling it would still not be inlinable.  So they would have to
    > check those flags on each invocation instead of having a piece of
    > straightline code that hard codes the behaviour with the right
    > behaviour inline.  ISTM the hope for a speedup from the inlining
    > mostly came from the idea that the compiler might be able to hoist
    > this logic outside the loop (and I suppose implement n specialized
    > loops depending on the behaviour needed).
    
    None of that stuff is inlinable or constant-foldable today, nor would it
    be with the patch that Peter was proposing AFAICS, because none of the
    flags will ever be compile time constant values.
    
    			regards, tom lane
    
    
  24. Re: Inlining comparators as a performance optimisation

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-21T16:27:49Z

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
    > On 21.09.2011 18:46, Tom Lane wrote:
    >> Well, we'd have to negotiate what the API ought to be.  What I'm
    >> envisioning is that datatypes could provide alternate comparison
    >> functions that are designed to be qsort-callable rather than
    >> SQL-callable.  As such, they could not have entries in pg_proc, so
    >> it seems like there's no ready way to represent them in the catalogs.
    
    > Quite aside from this qsort-thing, it would be nice to have versions of 
    > all simple functions that could be called without the FunctionCall 
    > overhead.
    
    Hmm, that's an interesting idea.  I think probably the important aspects
    are (1) known number of arguments and (2) no null argument or result
    values are allowed.  Not sure what we'd do with collations though.
    
    > We could have an extended version of the PG_FUNCTION_INFO_V1 macro that 
    > would let you register the fastpath function:
    > PG_FUNCTION_INFO_V1(int4pl, int4pl_fastpath);
    
    We don't use PG_FUNCTION_INFO_V1 for built-in functions ...
    
    			regards, tom lane
    
    
  25. Re: Inlining comparators as a performance optimisation

    Greg Stark <stark@mit.edu> — 2011-09-21T16:35:41Z

    On Wed, Sep 21, 2011 at 5:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > None of that stuff is inlinable or constant-foldable today, nor would it
    > be with the patch that Peter was proposing AFAICS, because none of the
    > flags will ever be compile time constant values.
    
    I was referring to transformations like this which I believe compilers
    are already capable of doing:
    
    v = ...;
    while (...)
      if (v) {
        if (a < b) ... else ....;
      } else {
        if (a > b) ... else ...;
      }
    
    turning it into code that looks like:
    
    if (v) {
     while (....)
       if (a<b) ... else ...;
    } else {
      while (....)
        if (a>b) ... else ...;
    }
    
    which may not look like much -- especially with branch prediction --
    but then it's much more likely to be able to unroll the loop and do
    clever instruction scheduling and so on than if there's an extra
    branch in the middle of the loop. But if there's a function call to an
    external function called through a function pointer in the middle of
    the loop then the whole endeavour would be for naught.
    
    
    
    -- 
    greg
    
    
  26. Re: Inlining comparators as a performance optimisation

    Simon Riggs <simon@2ndquadrant.com> — 2011-09-21T19:21:38Z

    On Wed, Sep 21, 2011 at 4:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Simon Riggs <simon@2ndQuadrant.com> writes:
    >> This is a marvellous win, a huge gain from a small, isolated and
    >> easily tested change. By far the smallest amount of additional code to
    >> sorting we will have added and yet one of the best gains.
    >
    > I think you forgot your cheerleader uniform.
    
    LOL. I'm happy whoever and whenever we get large wins like that.
    
    Go Postgres!
    
    > A patch along these lines
    > is not going to be small, isolated, easily maintained, nor beneficial
    > for any but a small number of predetermined datatypes.
    
    That was the starting premise.
    
    -- 
     Simon Riggs                   http://www.2ndQuadrant.com/
     PostgreSQL Development, 24x7 Support, Training & Services
    
    
  27. Re: Inlining comparators as a performance optimisation

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-09-26T02:12:55Z

    I've produced something much neater than my first patch, attached,
    although I still consider this to be at the POC stage, not least since
    I'm not exactly sure how I should be selecting the right
    specialisation in tuplesort_performsort() (a hack is used right now
    that does a direct pg_proc OID comparison), and also because I haven't
    implemented anything other than qsort for heap tuples yet (a minor
    detail, I suppose). I'm pleased to say that a much clearer picture of
    what's really going on here has emerged.
    
    Statistics: Total runtime according to explain analyze for query
    "select * from orderlines order by prod_id" (dellstore2 sample db), at
    GCC 4.5's -02 optimisation level, after warming the cache, on my
    desktop:
    
    Without the patch:
    
    ~82ms
    
    With the patch, but with the "inline" keyword commented out for all
    new functions/meta-functions:
    
    ~60ms
    
    with the patch, unmodified:
    
    ~52ms
    
    Recent experience suggests that using a macro rather than an inline
    function would perhaps have been a mistake, as it would prevent us
    from benefiting from the compiler's smarts in surmising just where we
    should inline. As I've pointed out, individual call sites are inlined,
    not individual functions.
    
    Tom wanted to know if I could prove the benefit in inlining, as
    against just resolving the impedance mismatch without bothering with
    inlining (and, indeed, further optimisations that the compiler can
    avail of as a result of knowing the comparator at compile-time). These
    figures support my contention that these optimisations have an
    important role to play. However, there's no question that resolving
    the impedance mismatch is where the biggest benefit is seen,
    particularly relative to maintainability. As Tom anticipated, the
    question of whether or not we do the inlining is less than entirely
    straightforward, as it's less of a win, and it has a higher price in
    ugliness. I myself am very much of the opinion that it's still worth
    it. As I've said, sorting integers and floats is very common and very
    important, and this approach will likely yield benefits beyond
    increasing the speed of executor sort nodes, as described below.
    
    I accept that a more sophisticated benchmark is required here, but
    I've been a little busy actually writing the patch. Any ideas anyone?
    Independent benchmarks are welcome. If someone could suggest a worst
    case, that would be particularly useful, as I cannot think of one - I
    believe that each instance of inlining a call site more-or-less either
    is or is not a net benefit here, regardless of the number of
    comparisons performed, which is why the improvement is so predictable
    across the size of the set of integers sorted for by qsort in
    isolation (which is the big reason why that ~8ms decrease due to
    inlining turns out to be not too shabby - it's a saving per
    comparison, and they can add it pretty quickly). So, for example, when
    I build the patch on my laptop (GCC 4.6), with an 48MB table (the same
    orderlines table query, but I doubled up the data a few times
    beforehand), we see an undiminished, proportional (to the prior,
    isolated cost of qsorting, I think) decrease in runtime:
    
    Without patch:
    
    ~615ms
    
    With patch:
    
    ~415ms
    
    One key insight that this work brings is that resolving the impedance
    mismatch - making comparisons as inexpensive as possible (including
    using inlining) - is the best possible strategy in improving sort
    performance today, vastly more efficacious than, for example, tweaking
    the sorting algorithm. If anyone can find some more ways of shaving
    cycles there, it's one place where it really matters.
    
    Changes are isolated to the extent that if you decide that you don't
    like this one day, you can simply remove the calls to qsort_arg
    specialisations, and once again switch back to just using what the
    patch makes a fallback, qsort_arg itself. I know that we'd never
    really get into that situation, but the fact that we could do that
    serves to illustrate that these changes are fairly isolated.
    
    Incidentally, if you find writing code that is heavily dependent on
    macros to be a chore, I can highly recommend Clang 2.9 .
    
    But what of the maintenance burden of mostly duplicating qsort_arg.c,
    to produce a "template" version? Well, take a look at the *entire*
    history for that file:
    
    [peter@localhost port]$ git log qsort_arg.c
    commit 9f2e211386931f7aee48ffbc2fcaef1632d8329f
    Author: Magnus Hagander <magnus@hagander.net>
    Date:   Mon Sep 20 22:08:53 2010 +0200
    
        Remove cvs keywords from all files.
    
    commit b9954fbb4ef25fb1ea173d26017d4d128dd15be5
    Author: Neil Conway <neilc@samurai.com>
    Date:   Sun Mar 18 05:36:50 2007 +0000
    
        Code cleanup for function prototypes: change two K&R-style prototypes
        to ANSI-style, and change "()" -> "(void)". Patch from Stefan Huehner.
    
    commit b38900c7677657a815e75781b776fb1e41054df3
    Author: Tom Lane <tgl@sss.pgh.pa.us>
    Date:   Thu Oct 12 15:04:55 2006 +0000
    
        Use Min() instead of min() in qsort, for consistency and to avoid
        redefined-macro warnings on some platforms.  Per gripe from Hiroshi Saito.
    
    commit f99a569a2ee3763b4ae174e81250c95ca0fdcbb6
    Author: Bruce Momjian <bruce@momjian.us>
    Date:   Wed Oct 4 00:30:14 2006 +0000
    
        pgindent run for 8.2.
    
    commit 6edd2b4a91bda90b7f0290203bf5c88a8a8504db
    Author: Tom Lane <tgl@sss.pgh.pa.us>
    Date:   Tue Oct 3 22:18:23 2006 +0000
    
        Switch over to using our own qsort() all the time, as has been proposed
        repeatedly.  ***SNIP MESSAGE***
    
    
    I think that it is fair to say that the maintenance burden imposed by
    this change is well worth it. Only one of these changes after the
    initial commit is not mechanical, and that is still pretty trivial.
    
    I've attached gprof flat profile output from unmodified PostgreSQL (at
    master) when we create an index on a pgbench table:
    
    pgbench -i -s 1500 index_test
    
    That puts the number of rows in the pgbench_accounts table at 150
    million, while the table is 19 GB in size. That’s a reasonable size,
    and should usefully demonstrate the likely improvement we’ll see in
    the performance of creating an index.
    
    I increased maintenance_work_mem to 756MB, plus used a fairly
    straightforward postgresql.conf, plus some other, more generic values
    for other GUCs. The query is:
    
    create INDEX on pgbench_accounts(abalance); -- abalance is of type integer
    
    Here is the top of the flat profile, by far the most important part:
    
    Flat profile:
    
    Each sample counts as 0.01 seconds.
      %   cumulative   self              self     total
     time   seconds   seconds    calls   s/call   s/call  name
     43.56     79.29    79.29 6972025063     0.00     0.00  comparetup_index_btree
     22.39    120.04    40.75 150000000     0.00     0.00  tuplesort_heap_siftup
      4.36    127.97     7.93 2677057771     0.00     0.00  btint4cmp
      2.88    133.21     5.24 314766350     0.00     0.00  LWLockRelease
      1.81    136.50     3.29 314612660     0.00     0.00  LWLockAcquire
      1.65    139.50     3.00 450905247     0.00     0.00  AllocSetAlloc
      1.43    142.10     2.60 300000001     0.00     0.00  LogicalTapeWrite
      1.17    144.23     2.13                             comparetup_cluster
    *** SNIP ***
    
    This looks pretty top-heavy to me, and more time is spent in
    comparetup_index_btree than any other function by some margin,
    suggesting that it will be worthwhile to pursue similar optimisations
    to make index creation faster in a later revision of this patch, or as
    another patch. I didn't take care to ensure that I'd be caching the
    entire table in memory, which probably would have been more useful
    here.
    
    Thoughts?
    
    On the subject of highly ambitious optimisations to sorting, one
    possibility I consider much more practicable than GPU-accelerated
    sorting is simple threading; quicksort can be parallelised very
    effectively, due to its divide-and-conquer nature. If we could agree
    on a threading abstraction more sophisticated than forking, it's
    something I'd be interested in looking at. To do so would obviously
    entail lots of discussion about how that relates to whatever way we
    eventually decide on implementing parallel query, and that's obviously
    a difficult discussion.
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
  28. Re: Inlining comparators as a performance optimisation

    Robert Haas <robertmhaas@gmail.com> — 2011-09-26T03:46:29Z

    On Sun, Sep 25, 2011 at 10:12 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
    > [ new results ]
    
    Nice results.  I think these are far more convincing than the last
    set, because (1) the gains are bigger and (2) they survive -O2 and (3)
    you tested an actual query, not just qsort() itself.
    
    I don't want to take time to review this in detail right now, because
    I don't think it would be fair to put this ahead of things that were
    submitted for the current CommitFest, but I'm impressed.
    
    > On the subject of highly ambitious optimisations to sorting, one
    > possibility I consider much more practicable than GPU-accelerated
    > sorting is simple threading; quicksort can be parallelised very
    > effectively, due to its divide-and-conquer nature. If we could agree
    > on a threading abstraction more sophisticated than forking, it's
    > something I'd be interested in looking at. To do so would obviously
    > entail lots of discussion about how that relates to whatever way we
    > eventually decide on implementing parallel query, and that's obviously
    > a difficult discussion.
    
    I have the same feeling about about this that I do about almost every
    executor optimization that anyone proposes: the whole project would be
    entirely simple and relatively painless if it weren't for the need to
    make planner changes.  I mean, deciding on a threading interface is
    likely to be a somewhat contentious discussion, with differences of
    opinion on whether we should do it and what the API should look like.
    But at the end of the day it's not rocket science, and I expect that
    we would end up with something reasonable.  What seems much harder is
    figuring out how to decide when to perform quicksort in parallel vs.
    single-threaded, and how much parallelism would be appropriate.  I
    haven't seen anyone propose even a shadow of an idea about how to make
    such decisions intelligently, either in general or in specific cases.
    
    The other issue is that, while threading would be possibly suitable
    for this particular case, at least for built-in datatypes with
    comparison operations that basically reduce to single machine-language
    comparison instructions, it's hard to see how we could take it much
    further.  It would be unsafe for these multiple threads of execution
    to do anything that could possibly throw an error or anything that
    touches a lightweight lock or, really, just about anything at all.
    Trying to make the entire backend thread-safe - or even any
    significant portion of it - seems like a colossal effort that will
    most likely fail, but maybe not without eating an enormous amount of
    developer time first.  And without doing that, I don't think we could
    even extend this as far as, say, numeric, whose functions do things
    like palloc() and ereport() internally.  So I feel like this whole
    approach might be a dead-end - there's a narrow range of cases where
    it could be made to work, I think, but after that I think you hit a
    titanium wall.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  29. Re: Inlining comparators as a performance optimisation

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-09-26T13:53:49Z

    On 26 September 2011 04:46, Robert Haas <robertmhaas@gmail.com> wrote:
    > I don't want to take time to review this in detail right now, because
    > I don't think it would be fair to put this ahead of things that were
    > submitted for the current CommitFest, but I'm impressed.
    
    Thank you.
    
    Now that I think about it, the if statement that determines if the
    int4 specialisation will be used may be okay - it sort of documents
    the conditions under which the int4 specialisation should be used with
    reference to how the "else" generic case actually works, which is
    perhaps no bad thing. I now realise that I should be using constants
    from fmgroids.h though.
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
    
  30. Re: Inlining comparators as a performance optimisation

    Bruce Momjian <bruce@momjian.us> — 2011-10-05T01:49:40Z

    Peter Geoghegan wrote:
    > On the subject of highly ambitious optimisations to sorting, one
    > possibility I consider much more practicable than GPU-accelerated
    > sorting is simple threading; quicksort can be parallelised very
    > effectively, due to its divide-and-conquer nature. If we could agree
    > on a threading abstraction more sophisticated than forking, it's
    > something I'd be interested in looking at. To do so would obviously
    > entail lots of discussion about how that relates to whatever way we
    > eventually decide on implementing parallel query, and that's obviously
    > a difficult discussion.
    
    I agree that the next big challenge for Postgres is parallel operations.
    With the number of cores increasing, and with increased memory and SSD,
    parallel operation is even more important.  Rather than parallelizing
    the entire backend, I imagine adding threading or helper processes for
    things like sorts, index scans, executor nodes, and stored procedure
    languages.  I expect final code to be 2-3 years in the future.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        http://momjian.us
      EnterpriseDB                             http://enterprisedb.com
    
      + It's impossible for everything to be true. +
    
    
  31. Re: Inlining comparators as a performance optimisation

    Greg Stark <stark@mit.edu> — 2011-10-05T02:55:28Z

    On Wed, Oct 5, 2011 at 2:49 AM, Bruce Momjian <bruce@momjian.us> wrote:
    
    > Rather than parallelizing
    > the entire backend, I imagine adding threading or helper processes for
    > things like sorts, index scans, executor nodes, and stored procedure
    > languages.  I expect final code to be 2-3 years in the future.
    
    I don't actually see it would be a big problem quicksort to start up a
    bunch of threads which do some of the work and go away when the
    tuplesort is done. As long as the code they're executing is well
    defined and limited to a small set of code that can be guaranteed to
    be thread-safe it should be reasonably simple.
    
    The problem is that in most of the Postgres core there aren't many
    places where much code fits that description. Even in tuplesort it can
    call out to arbitrary user-defined functions so we would need a way to
    mark which functions are thread-safe. Beyond integer and floating
    point comparisons it may not be much -- certainly not Numeric or
    strings due to detoasting... And then there are things like handling
    the various signals (including SI invalidations?) and so on.
    
    I agree that if we wanted to farm out entire plan nodes we would
    probably end up generating new partial plans which would be handed to
    other backend processes. That's not unlike what Oracle did with
    Parallel Query. But i'm a bit skeptical that we'll get much of that
    done in 2-3 years. The main use case for Parallel Query in Oracle is
    for partitioned tables -- and we haven't really polished that after
    how many years?
    
    -- 
    greg
    
    
  32. Re: Inlining comparators as a performance optimisation

    Robert Haas <robertmhaas@gmail.com> — 2011-10-05T13:55:04Z

    On Tue, Oct 4, 2011 at 10:55 PM, Greg Stark <stark@mit.edu> wrote:
    > I agree that if we wanted to farm out entire plan nodes we would
    > probably end up generating new partial plans which would be handed to
    > other backend processes. That's not unlike what Oracle did with
    > Parallel Query. But i'm a bit skeptical that we'll get much of that
    > done in 2-3 years. The main use case for Parallel Query in Oracle is
    > for partitioned tables -- and we haven't really polished that after
    > how many years?
    
    Partitioning hasn't been completely neglected; 9.1 adds support for
    something called Merge Append.  You may have heard of (or, err,
    authored) that particular bit of functionality.  :-)
    
    Of course, it would be nice to have a better syntax, but I don't think
    the lack of it should discourage us from working on parallel query,
    which is a muti-part problem.  You need to:
    
    - have planner support to decide when to parallelize
    - have a mechanism for firing up worker processes and synchronizing
    the relevant bits of state between the master and the workers
    - have an IPC mechanism for streaming data between the master process
    and the workers
    - figure out how to structure the executor so that the workers neither
    stall nor run too far ahead of the master (which might have LIMIT 10
    or something)
    
    Markus Wanner took a crack at generalizing the autovacuum machinery
    that we have now into something that could be used to fire up
    general-purpose worker processes, but it fell down mostly because I
    (and, I think, others) weren't convinced that imessages were something
    we wanted to suck into core, and Markus reasonably enough wasn't
    interested in rewriting it to do something that wouldn't really help
    his work with Postgres-R.  I'm not sure where Bruce is getting his
    timeline from, but I think the limiting factor is not so much that we
    don't have people who can write the code as that those people are
    busy, and this is a big project.  But you can bet that if it gets to
    the top of Tom's priority list (just for example) we'll see some
    motion...!
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  33. Re: Inlining comparators as a performance optimisation

    Bruce Momjian <bruce@momjian.us> — 2011-10-05T15:31:57Z

    Robert Haas wrote:
    > Markus Wanner took a crack at generalizing the autovacuum machinery
    > that we have now into something that could be used to fire up
    > general-purpose worker processes, but it fell down mostly because I
    > (and, I think, others) weren't convinced that imessages were something
    > we wanted to suck into core, and Markus reasonably enough wasn't
    > interested in rewriting it to do something that wouldn't really help
    > his work with Postgres-R.  I'm not sure where Bruce is getting his
    > timeline from, but I think the limiting factor is not so much that we
    > don't have people who can write the code as that those people are
    > busy, and this is a big project.  But you can bet that if it gets to
    > the top of Tom's priority list (just for example) we'll see some
    > motion...!
    
    I was thinking of setting up a team to map out some strategies and get
    community buy-in, and then we could attack each issue.  I got the 2-3
    years from the Win32 timeline.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        http://momjian.us
      EnterpriseDB                             http://enterprisedb.com
    
      + It's impossible for everything to be true. +
    
    
  34. Re: Inlining comparators as a performance optimisation

    Robert Haas <robertmhaas@gmail.com> — 2011-11-18T05:20:26Z

    On Sun, Sep 25, 2011 at 10:12 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
    > I've produced something much neater than my first patch, attached,
    > although I still consider this to be at the POC stage, not least since
    > I'm not exactly sure how I should be selecting the right
    > specialisation in tuplesort_performsort() (a hack is used right now
    > that does a direct pg_proc OID comparison), and also because I haven't
    > implemented anything other than qsort for heap tuples yet (a minor
    > detail, I suppose). I'm pleased to say that a much clearer picture of
    > what's really going on here has emerged.
    >
    > Statistics: Total runtime according to explain analyze for query
    > "select * from orderlines order by prod_id" (dellstore2 sample db), at
    > GCC 4.5's -02 optimisation level, after warming the cache, on my
    > desktop:
    >
    > Without the patch:
    >
    > ~82ms
    >
    > With the patch, but with the "inline" keyword commented out for all
    > new functions/meta-functions:
    >
    > ~60ms
    >
    > with the patch, unmodified:
    >
    > ~52ms
    
    Reviewing away when I should be sleeping, wahoo...!
    
    I think that we should really consider doing with this patch what Tom
    suggested upthread; namely, looking for a mechanism to allow
    individual datatypes to offer up a comparator function that doesn't
    require bouncing through FunctionCall2Coll().  It seems to me that
    duplicating the entire qsort() algorithm is a little iffy.  Sure, in
    this case it works out to a win.  But it's only a small win -
    three-quarters of it is in the uncontroversial activity of reducing
    the impedance mismatch - and how do we know it will always be a win?
    Adding more copies of the same code can be an anti-optimization if it
    means that a smaller percentage of it fits in the instruction cache,
    and sometimes small changes in runtime are caused by random shifts in
    the layout of memory that align things more or less favorably across
    cache lines rather than by real effects.  Now it may well be that this
    is a real effect, but will it still look as good when we do this for
    10 data types?  For 100 data types?
    
    In contrast, it seems to me that reducing the impedance mismatch is
    something that we could go and do across our entire code base, and
    every single data type would benefit from it.  It would also be
    potentially usable by other sorting algorithms, not just quick sort.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  35. Re: Inlining comparators as a performance optimisation

    Simon Riggs <simon@2ndquadrant.com> — 2011-11-18T08:53:09Z

    On Fri, Nov 18, 2011 at 5:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
    
    > I think that we should really consider doing with this patch what Tom
    > suggested upthread; namely, looking for a mechanism to allow
    > individual datatypes to offer up a comparator function that doesn't
    > require bouncing through FunctionCall2Coll().  It seems to me that
    > duplicating the entire qsort() algorithm is a little iffy.  Sure, in
    > this case it works out to a win.  But it's only a small win -
    > three-quarters of it is in the uncontroversial activity of reducing
    > the impedance mismatch - and how do we know it will always be a win?
    > Adding more copies of the same code can be an anti-optimization if it
    > means that a smaller percentage of it fits in the instruction cache,
    > and sometimes small changes in runtime are caused by random shifts in
    > the layout of memory that align things more or less favorably across
    > cache lines rather than by real effects.  Now it may well be that this
    > is a real effect, but will it still look as good when we do this for
    > 10 data types?  For 100 data types?
    >
    > In contrast, it seems to me that reducing the impedance mismatch is
    > something that we could go and do across our entire code base, and
    > every single data type would benefit from it.  It would also be
    > potentially usable by other sorting algorithms, not just quick sort.
    
    I don't think its credible to implement that kind of generic
    improvement at this stage of the release cycle. That has a much bigger
    impact since it potentially effects all internal datatypes and
    external ones also. Definitely a longer term way forward.
    
    If individual datatypes offer up a comparator function that is easily
    going to result in more code than is being suggested here. So the
    argument about flooding the CPU cache works against your alternate
    proposal, not in favour of it.
    
    We have no proof that we need to do this for 10 or 100 data types. We
    only currently have proof that there is gain for the most common
    types. Of course, it sounds like it might be useful to allow any data
    type to gain an advantage, but we shouldn't be blind to the point that
    almost nobody will use such a facility, and if they do the code won't
    be written for a long time yet. If this came as a request from custom
    datatype authors complaining of slow sorts it would be different, but
    it didn't so we don't even know if anybody would ever write user
    defined comparator routines. Rejecting a patch because of a guessed
    user requirement is not good.
    
    Peter's suggested change adds very few lines of code and those compile
    to some very terse code, a few hundred instructions at very most.
    Requesting an extra few cachelines to improve qsort by so much is
    still an easy overall win.
    
    The OP change improves qsort dramatically, and is a small, isolated
    patch. There is no significant downside. We also have it now, so lets
    commit this, chalk up another very good performance improvement and
    use our time on something else this commitfest, such as the flexlocks
    idea.
    
    -- 
     Simon Riggs                   http://www.2ndQuadrant.com/
     PostgreSQL Development, 24x7 Support, Training & Services
    
    
  36. Re: Inlining comparators as a performance optimisation

    Robert Haas <robertmhaas@gmail.com> — 2011-11-18T13:39:38Z

    On Fri, Nov 18, 2011 at 3:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
    > We have no proof that we need to do this for 10 or 100 data types. We
    > only currently have proof that there is gain for the most common
    > types.
    
    Well, that's kind of my point.  I think this needs more work before we
    decide what the best approach is.  So far, the ONLY test result we
    have that compares inlining to not inlining shows a speedup from 60 ms
    to 52 ms.  I think that an 8 ms speedup on one test with one datatype
    on one platform/compiler combination isn't sufficient evidence to
    conclude that this is the best possible approach.
    
    I think the way to look at this is that this patch contains two
    possibly good ideas: one of which is to make the qsort() argument
    match the qsort() calling convention, and the other of which is to
    have multiple copies of the qsort() logic that inline the comparators
    for their respective datatypes.  Tom hypothesized that most of the
    benefit was in the first idea, and the numbers that Peter posted seem
    to support that conclusion.  The first idea is also less invasive and
    more broadly applicable, so to me that seems like the first thing to
    pursue.
    
    Now, that doesn't mean that we shouldn't consider the second idea as
    well, but I don't believe that the evidence presented thus far is
    sufficient to prove that we should go that route.  It seems entirely
    possible that inlining any non-trivial comparator function could work
    out to a loss, or that the exact choice of compiler flags could affect
    whether inlining works out to a plus or a minus (what happens with -O3
    vs. -O2?  what about -O1?  what about -O2 -fno-omit-frame-pointer?
    what if they're using HP's aCC or Intel's icc rather than gcc?).
    There's no point in installing an optimization that could easily be a
    pessimization on some other workload, and that hasn't been tested, or
    at least no results have been posted here.  On the other hand,
    matching the calling convention of the comparator function to what
    qsort() wants and eliminating the trampoline seems absolutely certain
    to be a win in every case, and based on the work Peter has done it
    seems like it might be a quite large win.
    
    In fact, you have to ask yourself just exactly how much our
    function-calling convention is costing us in general.  We use that
    mechanism an awful lot and whatever loss of cycles is involved would
    be spread all over the code base where oprofile or gprof won't easily
    be able to pick it out.  Even the cost at any particular call site
    will be split between the caller and the callee.  There might be more
    stuff we could optimize here than just sorting...
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  37. Re: Inlining comparators as a performance optimisation

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-11-18T14:11:27Z

    Simon Riggs <simon@2ndQuadrant.com> writes:
    > On Fri, Nov 18, 2011 at 5:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
    >> I think that we should really consider doing with this patch what Tom
    >> suggested upthread; namely, looking for a mechanism to allow
    >> individual datatypes to offer up a comparator function that doesn't
    >> require bouncing through FunctionCall2Coll().
    
    > I don't think its credible to implement that kind of generic
    > improvement at this stage of the release cycle.
    
    Er, *what*?  We're in mid development cycle, we are nowhere near
    release.  When exactly would you have us make major changes?
    
    In any case, what I understood Robert to be proposing was an add-on
    feature that could be implemented in one datatype at a time.  Not
    a global flag day.  We couldn't really do the latter anyway without
    making life very unpleasant for authors of extension datatypes.
    
    			regards, tom lane
    
    
  38. Re: Inlining comparators as a performance optimisation

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-11-18T21:38:18Z

    On 18 November 2011 05:20, Robert Haas <robertmhaas@gmail.com> wrote:
    > I think that we should really consider doing with this patch what Tom
    > suggested upthread; namely, looking for a mechanism to allow
    > individual datatypes to offer up a comparator function that doesn't
    > require bouncing through FunctionCall2Coll().  It seems to me that
    > duplicating the entire qsort() algorithm is a little iffy.
    
    I understand that we highly value extensibility and genericity (yes,
    that's a real word). We may not always be well served by that
    tendency.
    
    > Sure, in this case it works out to a win.  But it's only a small win -
    > three-quarters of it is in the uncontroversial activity of reducing
    > the impedance mismatch - and how do we know it will always be a win?
    
    Firstly, 1/4 of a quite large gain is still a pretty good gain.
    Secondly, I probably didn't actually isolate the effects of inlining,
    nor the overall benefit of the compiler knowing the comparator at
    compile-time. I just removed the inline keyword. Those are two
    different things. The inline keyword serves as a request to the
    compiler to inline. The compiler can and often will ignore that
    request. Most people know that. What isn't so widely known is that
    modern compilers may equally well inline even when they haven't been
    asked to (but only when they can). When you also consider, as I've
    already pointed out several times, that individual call sites are
    inlined, it becomes apparent that there may well still be a certain
    amount of inlining and/or other optimisations like procedural
    integration going on at some call sites even without the encouragement
    of the inline keyword, that would not have been performed without the
    benefit of compile-time comparator knowledge. The addition of the
    inline keyword may just, in this particular case, have the compiler
    inline even more call sites.  Posting the ~8ms difference was
    motivated by a desire to prove that inlining had *some* role to play,
    without actually going to the trouble of implementing Tom's idea as a
    basis of comparison, because Tom was very sceptical of inlining.
    
    The long and the short of it is that I'm going to have to get my hands
    dirty with a dissembler before we really know exactly what's
    happening. That, or I could use an optimisation fence of some type.
    
    > Adding more copies of the same code can be an anti-optimization if it
    > means that a smaller percentage of it fits in the instruction cache,
    > and sometimes small changes in runtime are caused by random shifts in
    > the layout of memory that align things more or less favorably across
    > cache lines rather than by real effects.  Now it may well be that this
    > is a real effect, but will it still look as good when we do this for
    > 10 data types?  For 100 data types?
    
    I'd favour limiting it to just the common integer and float types.
    
    > In contrast, it seems to me that reducing the impedance mismatch is
    > something that we could go and do across our entire code base, and
    > every single data type would benefit from it.  It would also be
    > potentially usable by other sorting algorithms, not just quick sort.
    
    Suppose that we went ahead and added that infrastructure. What you
    must acknowledge is that one reason that this speed-up is so dramatic
    is that the essential expense of a comparison is already so low - a
    single instruction - and therefore the overall per-comparison cost
    goes way down, particularly if the qsort inner loop can store the code
    across fewer cache lines. For that reason, any absolute improvement
    that you'll see in complex datatypes will be smaller, maybe much
    smaller, because for each comparison we'll execute many more
    instructions that are essential to the comparison. In my estimation,
    all of this work does not point to there being an undue overhead in
    the function calling convention as you suggested. Still, I'm not
    opposed to investigating generalising this in some way, reservations
    notwithstanding, unless we have to block-wait on it. I don't want to
    chase diminishing returns too far.
    
    > Well, that's kind of my point.  I think this needs more work before we
    > decide what the best approach is.
    
    Agreed.
    
    > So far, the ONLY test result we
    > have that compares inlining to not inlining shows a speedup from 60 ms
    > to 52 ms.  I think that an 8 ms speedup on one test with one datatype
    > on one platform/compiler combination isn't sufficient evidence to
    > conclude that this is the best possible approach.
    
    Fair enough, but it's not the only test I did - I posted other numbers
    for the same query when the table was 48mb, and we saw a proportional
    improvement, consistent with a per-comparison win. I'm supposed to be
    on leave for a few days at the moment, so I won't be very active this
    weekend, but I'm rather curious as to where you or others would like
    to see me go with benchmarks.
    
    I should point out that we currently don't have much idea how big of a
    win applying these principles could be for index creation times...it
    could possibly be very significant. My profiling of index creation
    makes this looks promising.
    
    On 18 November 2011 13:39, Robert Haas <robertmhaas@gmail.com> wrote:
    > It seems entirely possible that inlining any non-trivial comparator function could work
    > out to a loss,
    
    Compilers weigh the size of the function to be inlined heavily when
    deciding, on a call site by call site basis, whether or not to inline.
    Much effort has been expended on making these heuristics work well.
    Besides, you're the one advocating pursuing this for types with
    non-trivial comparators. This will certainly be less effective for
    such non-trivial types, perhaps quite a lot less effective, as already
    noted.
    
    It's worth acknowledging that sorting of integers and floats is in
    general very important, probably more important than any other sorting
    case.
    
    > or that the exact choice of compiler flags could affect
    > whether inlining works out to a plus or a minus (what happens with -O3
    > vs. -O2?  what about -O1?  what about -O2 -fno-omit-frame-pointer?
    > what if they're using HP's aCC or Intel's icc rather than gcc?).
    
    If you'd like to see numbers for another platform, I can get those.
    How about Windows? I only have access to x86_64 boxes. I don't see
    that the exact compiler flags used will influence whether or not
    inlining is a win, so much as whether or not it occurs at all. As to
    -fno-omit-frame-pointer or something nullifying the benefits, well,
    that's a bit fantastical. We're already at the mercy of the compiler
    to do the right thing at this level of granularity. I really just want
    to help/enable it.
    
    > There's no point in installing an optimization that could easily be a
    > pessimization on some other workload, and that hasn't been tested, or
    > at least no results have been posted here.
    
    Hmm. Well, I accept the burden of proof lies with me here. I have a
    hard time believing that inlining/compile-time optimisation benefits
    (benefits which, to repeat for emphasis, have not been
    isolated/measured yet) could be entirely nullified by CPU caching
    effects on any plausible workload. What does a benchmark that
    alleviates your concerns here look like?
    
    The way that I understand inlining to interact with CPU cache is that
    it will increase the number of cache misses if it causes a cache line
    to be crossed, but will decrease the number of cache misses when
    locality of reference is improved such that less cache lines are
    crossed within an inner loop. Well, the overall cost of a sort is
    measured in the number of comparisons performed, so locality of
    reference is particularly important here - in general, the comparator
    is called lots of times.
    
    What I'm having a hard time with is general scepticism of the value of
    inlining, when the self-contained test case I posted showed an
    inlining qsort() radically out-performing my system's qsort(), without
    any impedance mismatch to muddy the waters. Sure, you can hold me to
    account and get me to prove my claim, but don't you think it's curious
    that that effect was observed there?
    
    Incidentally, I'd like to call this fast path sorting, if it ends up
    being committed in a form that is not significantly different to what
    we have now.
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
    
  39. Re: Inlining comparators as a performance optimisation

    Martijn van Oosterhout <kleptog@svana.org> — 2011-11-18T22:08:18Z

    On Fri, Nov 18, 2011 at 12:20:26AM -0500, Robert Haas wrote:
    > I think that we should really consider doing with this patch what Tom
    > suggested upthread; namely, looking for a mechanism to allow
    > individual datatypes to offer up a comparator function that doesn't
    > require bouncing through FunctionCall2Coll().  It seems to me that
    > duplicating the entire qsort() algorithm is a little iffy.  Sure, in
    > this case it works out to a win.  But it's only a small win -
    > three-quarters of it is in the uncontroversial activity of reducing
    > the impedance mismatch - and how do we know it will always be a win?
    
    There's always the old idea of a data type providing a function mapping
    f:(type -> int64) in such a way that it preserves order. That is, in the
    sense that:
    
    f(x) < f(y)  =>  x < y
    
    When sorting, you add f(x) as hidden column and change "ORDER BY x" to
    "ORDER BY f(x), x".  Then you only need to special case the int64
    version.  This would mean that in most cases you may be able to skip
    the call because you're comparing integers.  The downside is you need
    to call f on each input.  It depends on the datatype if that's cheaper
    or not, but for all numerics types I think it's an easy win.
    
    I don't think anyone has written a proof of concept for this. It does
    have the advantage of scaling better than coding a qsort for each
    individual type.
    
    Have a nice day,
    -- 
    Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
    > He who writes carelessly confesses thereby at the very outset that he does
    > not attach much importance to his own thoughts.
       -- Arthur Schopenhauer
    
  40. Re: Inlining comparators as a performance optimisation

    Robert Haas <robertmhaas@gmail.com> — 2011-11-19T02:55:54Z

    On Fri, Nov 18, 2011 at 4:38 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
    > I understand that we highly value extensibility and genericity (yes,
    > that's a real word). We may not always be well served by that
    > tendency.
    
    True (except that genericity is not a synonym for generality AFAICT).
    A good fraction of the optimizations that we've done over the last,
    well, pick any arbitrary period of time consists in adding special
    cases that occur frequently enough to be worth optimizing - e.g. HOT,
    or fast relation locks - often to extremely good effect.
    
    > Firstly, 1/4 of a quite large gain is still a pretty good gain.
    > Secondly, I probably didn't actually isolate the effects of inlining,
    > nor the overall benefit of the compiler knowing the comparator at
    > compile-time. I just removed the inline keyword.
    
    Maybe we should look at trying to isolate that a bit better.
    
    It strikes me that we could probably create an API that would support
    doing either of these things depending on the wishes of the underlying
    datatype.  For example, imagine that we're sorting with <(int4, int4).
     We associate a PGPROC-callable function with that operator that
    returns "internal", really a pointer to a struct.  The first element
    of the struct is a pointer to a comparison function that qsort() (or a
    tape sort) can invoke without a trampoline; the second is a wholesale
    replacement for qsort(); either or both can be NULL.  Given that, it
    seems to me that we could experiment with this pretty easily, and if
    it turns out that only one of them is worth doing, it's easy to drop
    one element out of the structure.
    
    Or do you have another plan for how to do this?
    
    > Fair enough, but it's not the only test I did - I posted other numbers
    > for the same query when the table was 48mb, and we saw a proportional
    > improvement, consistent with a per-comparison win. I'm supposed to be
    > on leave for a few days at the moment, so I won't be very active this
    > weekend, but I'm rather curious as to where you or others would like
    > to see me go with benchmarks.
    >
    > I should point out that we currently don't have much idea how big of a
    > win applying these principles could be for index creation times...it
    > could possibly be very significant. My profiling of index creation
    > makes this looks promising.
    
    Have you done any benchmarks where this saves seconds or minutes,
    rather than milliseconds?  That would certainly make it more exciting,
    at least to me.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  41. Re: Inlining comparators as a performance optimisation

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-11-20T03:29:07Z

    On 19 November 2011 02:55, Robert Haas <robertmhaas@gmail.com> wrote:
    > Maybe we should look at trying to isolate that a bit better.
    
    Indeed. Fortunately, GCC has options to disable each optimisation.
    Here's potentially relevant flags that we're already implicitly using
    at -02:
    
    -finline-small-functions <-- this is the one that inlined functions
    without an "inline" keyword
    -findirect-inlining
    -finline-functions-called-once
    -fearly-inlining
    -fipa-sra (Perform interprocedural scalar replacement of aggregates,
    removal of unused parameters and replacement of parameters passed by
    reference by parameters passed by value).
    
    In an effort to better isolate the effects of inlining, I tried this:
    
    ./configure CFLAGS="-fno-inline -fno-inline-small-functions" (I could
    have disabled more -02 optimisations, but this proved sufficient to
    make my point)
    
    Unsurprisingly, this makes things slower than regular -02.
    
    With the patch, the same query (once again, using explain analyze)
    with the same fast path quicksort stabilises around 92ms +/- 5ms
    (recall that the original figure was ~82ms). Our gains evaporate, and
    then some. Take away the additional CLAGS, and we're predictably back
    to big gains, with the query taking ~52ms just as before.
    
    What happens to this query when we build an unmodified postgres with
    these same CFLAGS? Well, we see the query take ~116ms after a few
    runs. It seems that the impedance mismatch matters, but inlining and
    other optimisations look to be at least as important. This isn't
    surprising to me, given what I was able to do with the isolated test.
    Maybe I should have tried it with additional disabling of
    optimisations named above, but that would have perhaps made things
    less clear.
    
    I'd probably have been better off directly measuring qsort speed and
    only passing those flags when compiling tuplesort.c (maybe impedance
    mismatch issues would have proven to have been even less relevant),
    but I wanted to do something that could be easily recreated, plus it's
    late.
    
    > It strikes me that we could probably create an API that would support
    > doing either of these things depending on the wishes of the underlying
    > datatype.  For example, imagine that we're sorting with <(int4, int4).
    >  We associate a PGPROC-callable function with that operator that
    > returns "internal", really a pointer to a struct.  The first element
    > of the struct is a pointer to a comparison function that qsort() (or a
    > tape sort) can invoke without a trampoline; the second is a wholesale
    > replacement for qsort(); either or both can be NULL.  Given that, it
    > seems to me that we could experiment with this pretty easily, and if
    > it turns out that only one of them is worth doing, it's easy to drop
    > one element out of the structure.
    >
    > Or do you have another plan for how to do this?
    
    I haven't given it much thought. Let me get back to you on that next week.
    
    > Have you done any benchmarks where this saves seconds or minutes,
    > rather than milliseconds?  That would certainly make it more exciting,
    > at least to me.
    
    Right. Well, I thought I'd use pgbench to generate a large table in a
    re-creatable way. That is:
    
    pgbench -i -s 60
    
    This puts pgbench_accounts at 769MB. Then, having increased work_mem
    to 1GB (enough to qsort) and maintenance_work_mem to 756mb, I decided
    to test this query with the patch:
    
    explain analyze select * from pgbench_accounts order BY abalance;
    
    This stabilised at ~3450ms, through repeatedly being executed.
    
    How does this compare to unpatched postgres? Well, it stabilised at
    about ~3780ms for the same query.
    
    This patch is obviously less of a win as the number of tuples to sort
    goes up. That's probably partly explained by the cost of everything
    else going up at a greater rate than the number of comparisons. I
    suspect that if we measure qsort in isolation, we'll see better
    results, so we may still see a good win on index creation time as a
    result of this work.
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
    
  42. Re: Inlining comparators as a performance optimisation

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-11-20T04:13:36Z

    On 20 November 2011 03:29, Peter Geoghegan <peter@2ndquadrant.com> wrote:
    > ./configure CFLAGS="-fno-inline -fno-inline-small-functions" (I could
    > have disabled more -02 optimisations, but this proved sufficient to
    > make my point)
    
    I'll isolate this further to tuplesort.c soon, which ought to be a lot
    more useful.
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
    
  43. Re: Inlining comparators as a performance optimisation

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-11-21T02:02:06Z

    Part of my problem with not producing specialisations that I really
    neglected to complain about until now is the inconsistency between
    types, and the need to deal with that.
    
    We benefit from assuming in our specialisation that we're only dealing
    with POD types, simply not considering things that we would otherwise
    have had to for the benefit of types that have a Datum representation
    that is pass-by-reference, or have collations, or have multiple
    scankeys. Leaving aside the question of straight compiler-optimisation
    benefits for the moment, the remaining benefit of what I've done comes
    not so much from avoiding the usual function call machinery per se, as
    from doing so *as well as* cutting down on what currently happens in
    comparetup_heap to handle every single compound and scalar type.
    Compare the function comparetup_heap with my meta-function
    TYPE##AppSort to see what I mean. The function comparetup_heap is the
    comparator directly used by qsort_arg when sorting heap tuples, and
    qsort_arg outsources to comparetup_heap some things that you might not
    expect it to (very little has changed about qsort_arg since we lifted
    it from NetBSD back in 2006). So while you might imagine that that
    loop in comparetup_heap and things like its use of the heap_getattr
    macros won't expand to that many instructions, you really don't want
    them in the inner loop of a long operation like qsorting, paid for up
    to O(n ^ 2) times. There's also the obvious implications for compiler
    optimisations, particularly relating to effective usage of CPU cache.
    
    I'm trying to get you what you asked for: A straight choice between
    what Tom suggested and what I suggested, with perhaps some compromises
    between the two positions. That's sort of tricky though, especially
    considering the issues raised above.
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
    
  44. Re: Inlining comparators as a performance optimisation

    Robert Haas <robertmhaas@gmail.com> — 2011-11-21T22:55:18Z

    On Tue, Sep 20, 2011 at 7:53 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
    > I don't think that the fact that that happens is at all significant at
    > this early stage, and it never even occurred to me that you'd think
    > that it might be. I was simply disclosing a quirk of this POC patch.
    > The workaround is probably to use a macro instead. For the benefit of
    > those that didn't follow the other threads, the macro-based qsort
    > implementation, which I found to perform significantly better than
    > regular qsort(), runs like this on my laptop when I built at 02 with
    > GCC 4.6 just now:
    >
    > C stdlib quick-sort time elapsed: 2.092451 seconds
    > Inline quick-sort time elapsed: 1.587651 seconds
    
    Results on my machine, for what they're worth:
    
    [rhaas inline_compar_test]$ gcc -O0 qsort-inline-benchmark.c
    [rhaas inline_compar_test]$ ./a.out
    C stdlib quick-sort time elapsed: 2.366762 seconds
    Inline quick-sort time elapsed: 1.807951 seconds
    [rhaas inline_compar_test]$ gcc -O1 qsort-inline-benchmark.c
    [rhaas inline_compar_test]$ ./a.out
    C stdlib quick-sort time elapsed: 1.970473 seconds
    Inline quick-sort time elapsed: 1.002765 seconds
    [rhaas inline_compar_test]$ gcc -O2 qsort-inline-benchmark.c
    [rhaas inline_compar_test]$ ./a.out
    C stdlib quick-sort time elapsed: 1.966408 seconds
    Inline quick-sort time elapsed: 0.958999 seconds
    [rhaas inline_compar_test]$ gcc -O3 qsort-inline-benchmark.c
    [rhaas inline_compar_test]$ ./a.out
    C stdlib quick-sort time elapsed: 1.988693 seconds
    Inline quick-sort time elapsed: 0.975090 seconds
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  45. Re: Inlining comparators as a performance optimisation

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-11-23T01:09:23Z

    On 21 November 2011 22:55, Robert Haas <robertmhaas@gmail.com> wrote:
    > Results on my machine, for what they're worth:
    
    I was curious as to how much of an improvement I'd made to sorting in
    isolation. I've adding simple sort timing instrumentation to the code
    in a revised patch, attached, for the convenience of reviewers.
    
    This patch adds an int8 specialisation, but it also supports fast path
    sorting of timestamps by using that same int8 specialisation (for
    HAVE_INT64_TIMESTAMP builds at least - didn't know if it was worth
    handling the NaN crud for the other case). ISTM that various different
    types with identical internal representations can share
    specialisations like this, which I hope will alleviate some earlier
    concerns about both the limited applicability of this optimisation and
    the possible proliferation of specialisations.
    
    Here's some raw qsort figures in seconds for the query "select * from
    orderlines order by test" (where test is a timestamptz column I added
    and updated with some data, executing repeatedly on the same machine
    as before):
    
    Before:
    
    DEBUG:  elapsed: 0.031738
    DEBUG:  elapsed: 0.031595
    DEBUG:  elapsed: 0.031502
    DEBUG:  elapsed: 0.031378
    DEBUG:  elapsed: 0.031359
    DEBUG:  elapsed: 0.031413
    DEBUG:  elapsed: 0.031499
    DEBUG:  elapsed: 0.031394
    DEBUG:  elapsed: 0.031368
    
    After:
    
    DEBUG:  elapsed: 0.013341
    DEBUG:  elapsed: 0.014608
    DEBUG:  elapsed: 0.013809
    DEBUG:  elapsed: 0.013244
    DEBUG:  elapsed: 0.014307
    DEBUG:  elapsed: 0.013207
    DEBUG:  elapsed: 0.013227
    DEBUG:  elapsed: 0.013264
    DEBUG:  elapsed: 0.013143
    DEBUG:  elapsed: 0.013455
    DEBUG:  elapsed: 0.013447
    
    I wonder, is it worth qualifying that the "Sort Method" was a
    "quicksort (fast path)" sort within explain analyze output? This is a
    rather large improvement, so It might actually be something that
    people look out for, as it might be tricky to remember the exact
    circumstances under which the optimisation kicks in by the time we're
    done here.
    
    I haven't had as much time as I'd like to polish this patch, or to get
    clearer answers. I expect to spend more time on it over the weekend.
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
  46. Re: Inlining comparators as a performance optimisation

    Simon Riggs <simon@2ndquadrant.com> — 2011-11-23T14:48:22Z

    On Fri, Nov 18, 2011 at 2:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Simon Riggs <simon@2ndQuadrant.com> writes:
    >> On Fri, Nov 18, 2011 at 5:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
    >>> I think that we should really consider doing with this patch what Tom
    >>> suggested upthread; namely, looking for a mechanism to allow
    >>> individual datatypes to offer up a comparator function that doesn't
    >>> require bouncing through FunctionCall2Coll().
    >
    >> I don't think its credible to implement that kind of generic
    >> improvement at this stage of the release cycle.
    >
    > Er, *what*?  We're in mid development cycle, we are nowhere near
    > release.  When exactly would you have us make major changes?
    >
    > In any case, what I understood Robert to be proposing was an add-on
    > feature that could be implemented in one datatype at a time.  Not
    > a global flag day.  We couldn't really do the latter anyway without
    > making life very unpleasant for authors of extension datatypes.
    
    Tom, whenever you think I've said something you really disagree with,
    just assume there's a misunderstanding. Like here.
    
    Of course it is OK to make such changes at this time.
    
    Given we have <2 months to the last CF of this release, inventing a
    generic infrastructure is unlikely to be finished and complete in this
    dev cycle, so requesting that isn't a practical suggestion, IMHO.
    
    -- 
     Simon Riggs                   http://www.2ndQuadrant.com/
     PostgreSQL Development, 24x7 Support, Training & Services
    
    
  47. Re: Inlining comparators as a performance optimisation

    Robert Haas <robertmhaas@gmail.com> — 2011-11-23T19:24:22Z

    On Tue, Nov 22, 2011 at 8:09 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
    > I wonder, is it worth qualifying that the "Sort Method" was a
    > "quicksort (fast path)" sort within explain analyze output? This is a
    > rather large improvement, so It might actually be something that
    > people look out for, as it might be tricky to remember the exact
    > circumstances under which the optimisation kicks in by the time we're
    > done here.
    
    Well, right now the decision as to which mechanism should be used here
    gets made in tuplesort_performsort(), which has no good way of
    communicating with EXPLAIN anyway.  Actually, I think that's a
    modularity violation; using the address of comparetup_heap as a flag
    value seems quite ugly.  How about moving that logic up to
    tuplesort_begin_heap() and having it set some state inside the
    Tuplesort, maybe based on a flag in the opclass (or would it have to
    attach to the individual operator)?
    
    At least on my machine, your latest patch reliably crashes the
    regression tests in multiple places.
    
    The following test case also crashes them for me (perhaps for the same
    reason the regression tests crash):
    
    create table i4 (a int, b int);
    insert into i4 values (4, 1), (2, 1), (0, 1), (null, 1), (-2, 1), (-7,
    1), (4, 2), (4, 3), (4, 4);
    select * from i4 order by 1, 2;
    TRAP: FailedAssertion("!(state->nKeys == 1)", File: "tuplesort.c", Line: 1261);
    
    The formatting of src/include/utils/template_qsort_arg.h is hard to
    read.  At ts=8, the backslashes line up, but the code doesn't fit in
    80 columns.  If you set ts=4, then it fits in 80 columns, but the
    backslashes don't line up any more, and the variable declarations
    don't either.  I believe ts=4 is project standard.
    
    I still think it would be a good idea to provide a mechanism to
    override heap_comparetup() with a type-specific function.  I don't
    think that would take much extra code, and then any data type could
    get at least that much benefit out of this.
    
    It seems like it could be a good idea to do some
    per-assembler-instruction profiling of this code, and perhaps also of
    the original code.  I'm curious where the time is being spent.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  48. Re: Inlining comparators as a performance optimisation

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-11-25T01:05:15Z

    On 23 November 2011 19:24, Robert Haas <robertmhaas@gmail.com> wrote:
    > Well, right now the decision as to which mechanism should be used here
    > gets made in tuplesort_performsort(), which has no good way of
    > communicating with EXPLAIN anyway.
    
    You could pretty easily add something to Tuplesortstate to accomplish
    this. That isn't an endorsement of doing so, but I'm not sure that it
    isn't appropriate.
    
    > Actually, I think that's a
    > modularity violation; using the address of comparetup_heap as a flag
    > value seems quite ugly. How about moving that logic up to
    > tuplesort_begin_heap()
    
    I'll post a patch soon that does just that in the next day or two.
    Tuplesortstate has a pointer to a sort specialisation.
    
    > and having it set some state inside the
    > Tuplesort, maybe based on a flag in the opclass (or would it have to
    > attach to the individual operator)?
    
    I'm not sure that there's much point in such a flag.
    
    > At least on my machine, your latest patch reliably crashes the
    > regression tests in multiple places.
    
    > TRAP: FailedAssertion("!(state->nKeys == 1)", File: "tuplesort.c", Line: 1261);
    
    Yes, sorry about that. Should have been discriminating against nKeys > 1 cases.
    
    As of this evening, for sorts with multiple scankeys, I'm using
    optimisations for the first scankey but not subsequent scankeys, which
    is frequently almost as good as having optimisations for all scanKeys.
    
    > The formatting of src/include/utils/template_qsort_arg.h is hard to
    > read.  At ts=8, the backslashes line up, but the code doesn't fit in
    > 80 columns.  If you set ts=4, then it fits in 80 columns, but the
    > backslashes don't line up any more, and the variable declarations
    > don't either.  I believe ts=4 is project standard.
    
    Fair enough. My working copy and .vimrc have been updated.
    
    > I still think it would be a good idea to provide a mechanism to
    > override heap_comparetup() with a type-specific function.  I don't
    > think that would take much extra code, and then any data type could
    > get at least that much benefit out of this.
    
    > It seems like it could be a good idea to do some
    > per-assembler-instruction profiling of this code, and perhaps also of
    > the original code.  I'm curious where the time is being spent.
    
    How would you go about doing that? The instrumentation that profilers
    use actually caused a big drop in performance here when I attempted it
    a few weeks ago. There's a kind of Heisenberg effect.
    
    This optimisation *more than doubles* raw sort performance for the
    cases. There is nothing contrived or cherry picked about the query
    that I selected to represent this optimisation - it was literally the
    first one that I selected.
    
    Sometimes, I see even a markedly better gain than a doubling of raw
    sort performance - I think my earlier experiments that indicated a
    much smaller improvement past a certain point may have been
    methodologically flawed. Sorry about that.
    
    If I double-up the data in the orderlines table a few times, until it
    reaches 385 MB (duplicate ever tuple with an insert into ...select ),
    then warm the cache, I get very interesting results. Here, we see a
    few runs of the same old query unoptimised (note that I've excluded
    some cold-cache runs before these runs):
    
    Before optimisation
    ==============
     Total runtime: 7785.473 ms - 3.517310 secs just sorting
     Total runtime: 8203.533 ms - 3.577193 secs just sorting
     Total runtime: 8559.743 ms - 3.892719 secs just sorting
    
     Total runtime: 9032.564 ms - 3.844746 secs just sorting
    
     Total runtime: 9637.179 ms - 4.434431 secs just sorting
     Total runtime: 9647.215 ms - 4.440560 secs just sorting
     Total runtime: 9669.701 ms - 4.448572 secs just sorting
    
    After optimisation
    ==============
     Total runtime: 5462.419 ms - 1.169963 secs just sorting
     Total runtime: 5510.660 ms - 1.234393 secs just sorting
     Total runtime: 5511.703 ms - 1.208377 secs just sorting
    
     Total runtime: 5588.604 ms - 1.175536 secs just sorting
    
     Total runtime: 5899.496 ms - 1.250403 secs just sorting
     Total runtime: 6023.132 ms - 1.338760 secs just sorting
     Total runtime: 6717.177 ms - 1.486602 secs just sorting
    
    This is a 800109kB sort.
    
    So, taking the median value as representative here, that looks to be
    just shy of a 40% improvement, or 3.4 seconds. My /proc/cpuinfo is
    attached on the off chance that someone is interested in that. More
    work is needed here, but this seems promising.
    
    It will be interesting to see how far all of this can be taken with
    comparetup_index_btree. Certainly, I'm sure there's some gain to be
    had there by applying lessons learned from comparetup_heap.
    
    --
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
  49. Re: Inlining comparators as a performance optimisation

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-11-26T18:58:00Z

    3 items are attached:
    
    1. A spreadsheet, results.ods, which has results for automated
    multiple runs of the "doubled up" dellstore orderlines queries (where
    orderlines is 385 MB). Results are sorted, with the median (or the
    lower middle value, didn't get the mean of the two middle runs) result
    for each run highlighted as representative. There are 3 builds of
    Postgres (HEAD, not inlined, optimized), each on its own sheet in the
    spreadsheet. The cache was warmed for each query, and subsequently the
    query was run 20 times.
    
    2. A python script I hacked together that should run on anything
    around Python 2.5+, with a single dependency, psycopg2. Look at --help
    to see how it works. This is the script that actually generated the
    figures that went into the spreadsheet, by being run once for each
    type of build. You can fairly easily play along at home with this, for
    these and other queries. It will spit out CSV files. It is designed to
    warm the cache (by default 3 times before each query) to eliminate
    caching effects. You can add your own query to the python list to have
    it run by the script, to generate the same set of figures for that
    query. I'm rather curious as to how much of an impact this
    optimisation will have on queries with unique nodes, joins and
    grouping when they rely on a sort node for their input, in the real
    world. Certainly, a query need not have an ORDER BY clause to see
    benefits, perhaps substantial benefits. The logic to parse sort node
    details is rather unsophisticated right now though, due to my focus on
    the obvious ORDER BY case.
    
    3. The revision of the patch that was actually tested, now with
    inlining specialisations for the single scanKey case, and a
    non-inlined specialisation for multiple scanKeys where we can still
    benefit from eliding the SQL function call machinery for the first
    scanKey, which is often almost as useful as being able to do so for
    all scanKeys. It also selects a sorting specialisation when it can in
    tuplesort_begin_heap, per Robert's suggestion.
    
    Reviewers will want to comment out line 731 of tuplesort.c, "#define
    optimize", to quickly get unoptimized behaviour for comparative
    purposes. Furthermore, if you'd like to see what this would look like
    without inlining, you can simply comment out assignments of inline
    variants of specialisations (i.e. int4inlqsort_arg and
    int8inlqsort_arg) in tuplesort_begin_heap.
    
    It has been suggested that I'm chasing diminishing returns by
    inlining, as I go further for a smaller benefit. Of course, that's
    true. However, I believe that I'm not chasing them past the point
    where that ceases to make sense, and these figures support that
    contention - chasing diminishing returns in the nature of this kind of
    work. Here, the additional benefit of inlining accounts for over an
    entire second shaved off a query that was originally 7056ms, so that's
    not to be sniffed at.
    
    I'll reiterate that qsort_arg has only been modified 4 times after its
    initial commit in 2006, and these were all trivial changes. While the
    way that I ape generic programming with the preprocessor is on the
    ugly side, what I've done can be much more easily understood with
    reference to qsort_arg itself. Robert said that duplicating the sort
    function was "iffy". However, that already happened long ago, as we're
    already maintaining both qsort_arg.c and qsort.c, and comments already
    urge maintainers to keep the two consistent. That's not been a problem
    though, because, as I've said, there has never been any call to make
    substantive changes to either in all those years.
    
    We could probably further reduce the code footprint of all of this by
    having template_qsort_arg.h generate comparators directly. That might
    be inflexible in a way that turns out to matter though, like if we
    wanted to use this for non-HAVE_INT64_TIMESTAMP timestamps and had to
    add NaN crud.
    
    My next step is to see how this goes with hundreds of sorts in the
    hundreds of megabytes on a high-end server. I don't have immediate
    access to one, but I'm working that out.
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
  50. Re: Inlining comparators as a performance optimisation

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-11-28T02:15:13Z

    Attached are the results from performing a similar process to the
    prior benchmark, but on Greg Smith's high-end server, and with an
    orderlines table that has been "doubled-up" until it is 1538 MB,
    making the same old query perform a quicksort that's over 3GB. Short
    version:  HEAD is 20468.0ms, with my patch it's 13689.698ms, so these
    gains hold-up very well for very large in-memory sorts, possibly even
    perfectly.
    
    Note that the "doubling up" had an interesting and desirable effect
    for the purposes of this patch review: It's approaching a worst case
    due to the low cardinality of the product + quantity columns, which
    crops up with the non-inlined functions only (int4regqsort_arg, etc).
    All too frequently, evaluating the first scankey results in equality,
    and so we cannot elide the SQL function call machinery. This is going
    to dampen the improvement for the multiple scanKey case considerably
    (and it looks like a smaller relative improvement than when the
    orderlines table was quite a lot smaller). As I said from the outset,
    there is no worst case for the single scanKey case that occurs to me.
    Multiple scanKeys are likely to be a problem for any effort to offer
    user-defined, per-type sort functions. I could probably make
    int4regqsort_arg and friends perform a bit better than we see here by
    increasing the cardinality of those two columns for the second query,
    so the first scanKey comparison would frequently suffice.
    
    Incidentally, I'm pretty sceptical of the idea of any effort being
    made to provide user-defined per-type sort functions or anything like
    that. No one has come forward with a novel sorting implementation that
    looks useful, although there has been some talk of some. It's
    relatively easy to hack on tuplesort to build a prototype, so I don't
    think the lack of this functionality is the blocker here. Besides, we
    will probably end up doing a terrible job of anticipating how invasive
    such a facility needs to be to facilitate these novel sorting
    implementations.
    
    While I'm very encouraged by these results, I must admit that there
    are a few things that I cannot currently explain. I don't know why the
    second query on the "Optimized" sheet outperforms the same query on
    the "Not inlined" page. When you consider how small the standard
    deviation is for each set of results, it's fairly obvious that it
    cannot be explained by noise. I thought it was weird, so I re-ran both
    benchmarks, with much the same outcome. It may be that the compiler
    gets lucky for the optimized case, resulting in a bit of an extra
    gain, and that effect is ridiculously magnified. In all cases, the
    regression tests pass.
    
    The single key/inlining optimisation seems to account for a big part
    of the gains, and the "Not inlined" figures for the first query would
    seem to suggest that the inlining can account for more than half of
    gains seen, whereas that was previously assumed to be less. What I
    think is happening here is that we're benefiting both from inlining as
    well as not having to even consider multiple scanKeys (we have to at
    least consider them even if we know in advance from the nScankeys
    variable that there'll never be multiple scanKeys). Again, if the
    cardinality was higher, we'd probably see "Not inlined" do better
    here.
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
  51. Re: Inlining comparators as a performance optimisation

    Bruce Momjian <bruce@momjian.us> — 2011-11-29T15:31:45Z

    Peter Geoghegan wrote:
    > Attached are the results from performing a similar process to the
    > prior benchmark, but on Greg Smith's high-end server, and with an
    > orderlines table that has been "doubled-up" until it is 1538 MB,
    > making the same old query perform a quicksort that's over 3GB. Short
    > version:  HEAD is 20468.0ms, with my patch it's 13689.698ms, so these
    > gains hold-up very well for very large in-memory sorts, possibly even
    > perfectly.
    > 
    > Note that the "doubling up" had an interesting and desirable effect
    > for the purposes of this patch review: It's approaching a worst case
    > due to the low cardinality of the product + quantity columns, which
    > crops up with the non-inlined functions only (int4regqsort_arg, etc).
    > All too frequently, evaluating the first scankey results in equality,
    > and so we cannot elide the SQL function call machinery. This is going
    > to dampen the improvement for the multiple scanKey case considerably
    > (and it looks like a smaller relative improvement than when the
    > orderlines table was quite a lot smaller). As I said from the outset,
    > there is no worst case for the single scanKey case that occurs to me.
    > Multiple scanKeys are likely to be a problem for any effort to offer
    > user-defined, per-type sort functions. I could probably make
    > int4regqsort_arg and friends perform a bit better than we see here by
    > increasing the cardinality of those two columns for the second query,
    > so the first scanKey comparison would frequently suffice.
    
    These are exciting advanced you are producing and I am hopeful we can
    get this included in Postgres 9.2.  I have mentioned already that I
    think parallelism is the next big Postgres challenge, and of course, one
    of the first areas for parallelism is sorting.  If you can speed up
    sorting as you have reported, this allows us to provide faster sorting
    now, and get even larger benefits if we implement sorting parallelism.
    
    In fact, because Postgres 9.2 has so many performance improvements, we
    might find that we have a more limited need for parallelism.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        http://momjian.us
      EnterpriseDB                             http://enterprisedb.com
    
      + It's impossible for everything to be true. +
    
    
  52. Re: Inlining comparators as a performance optimisation

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-11-29T18:48:37Z

    On 29 November 2011 15:31, Bruce Momjian <bruce@momjian.us> wrote:
    > These are exciting advanced you are producing and I am hopeful we can
    > get this included in Postgres 9.2.
    
    Thanks Bruce.
    
    >I have mentioned already that I
    > think parallelism is the next big Postgres challenge, and of course, one
    > of the first areas for parallelism is sorting.
    
    I'm not sure that sorting has that much to recommend it as an initial
    target of some new backend parallelism other than being easy to
    implement. I've observed the qsort_arg specialisations in this patch
    out-perform stock qsort_arg by as much as almost 3 times. However, the
    largest decrease in a query's time that I've observed was 45%, and
    that was for a contrived worst-case for quicksort, but about 25% is
    much more typical of queries similar to the ones I've shown, for more
    normative data distributions. While that's a respectable gain, it
    isn't a paradigm shifting one, and it makes parallelising qsort itself
    for further improvements quite a lot less attractive - there's too
    many other sources of overhead.
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
    
  53. Re: Inlining comparators as a performance optimisation

    Bruce Momjian <bruce@momjian.us> — 2011-11-29T20:34:40Z

    Peter Geoghegan wrote:
    > On 29 November 2011 15:31, Bruce Momjian <bruce@momjian.us> wrote:
    > > These are exciting advanced you are producing and I am hopeful we can
    > > get this included in Postgres 9.2.
    > 
    > Thanks Bruce.
    > 
    > >I have mentioned already that I
    > > think parallelism is the next big Postgres challenge, and of course, one
    > > of the first areas for parallelism is sorting.
    > 
    > I'm not sure that sorting has that much to recommend it as an initial
    > target of some new backend parallelism other than being easy to
    > implement. I've observed the qsort_arg specialisations in this patch
    > out-perform stock qsort_arg by as much as almost 3 times. However, the
    > largest decrease in a query's time that I've observed was 45%, and
    > that was for a contrived worst-case for quicksort, but about 25% is
    > much more typical of queries similar to the ones I've shown, for more
    > normative data distributions. While that's a respectable gain, it
    > isn't a paradigm shifting one, and it makes parallelising qsort itself
    > for further improvements quite a lot less attractive - there's too
    > many other sources of overhead.
    
    Agreed.  I think your improvements make it likely we will address not
    address sort parallelism first.
    
    With all the improvements coming in Postgres 9.2, we might need to look
    at I/O parallelism first.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        http://momjian.us
      EnterpriseDB                             http://enterprisedb.com
    
      + It's impossible for everything to be true. +
    
    
  54. Re: Inlining comparators as a performance optimisation

    Andres Freund <andres@anarazel.de> — 2011-11-29T21:36:26Z

    On Tuesday, November 29, 2011 07:48:37 PM Peter Geoghegan wrote:
    > On 29 November 2011 15:31, Bruce Momjian <bruce@momjian.us> wrote:
    > > These are exciting advanced you are producing and I am hopeful we can
    > > get this included in Postgres 9.2.
    > 
    > Thanks Bruce.
    > 
    > >I have mentioned already that I
    > >
    > > think parallelism is the next big Postgres challenge, and of course, one
    > > of the first areas for parallelism is sorting.
    > 
    > I'm not sure that sorting has that much to recommend it as an initial
    > target of some new backend parallelism other than being easy to
    > implement. I've observed the qsort_arg specialisations in this patch
    > out-perform stock qsort_arg by as much as almost 3 times. However, the
    > largest decrease in a query's time that I've observed was 45%, and
    > that was for a contrived worst-case for quicksort, but about 25% is
    > much more typical of queries similar to the ones I've shown, for more
    > normative data distributions. While that's a respectable gain, it
    > isn't a paradigm shifting one, and it makes parallelising qsort itself
    > for further improvements quite a lot less attractive - there's too
    > many other sources of overhead.
    I think that logic is faulty.
    
    For one I doubt that anybody is honestly suggesting paralellism inside qsort 
    itself. It seems more likely/sensible to implement that on the level of 
    mergesorting.
    Index builds for example could hugely benefit from improvements on that level. 
    With index build you often get pretty non-optimal data distributions btw...
    
    I also seriously doubt that you will find an area inside pg's executor where 
    you find that paralellizing them will provide a near linear scale without 
    much, much more work.
    
    Also I wouldn't consider sorting the easiest target - especially on a qsort 
    level - for parallelization as you constantly need to execute user defined 
    operators with multiple input tuples which has the usual problems.
    COPY parsing + inserting or such seems to be way easier target for example. 
    Even doing hashing + aggregation in different threads seems likely to be 
    easier.
    
    Andres
    
    
    
  55. Re: Inlining comparators as a performance optimisation

    Greg Jaskiewicz <gj@pointblue.com.pl> — 2011-11-29T21:43:40Z

    On 28 Nov 2011, at 02:15, Peter Geoghegan wrote:
    
    > Attached are the results from performing a similar process to the
    > prior benchmark, but on Greg Smith's high-end server, and with an
    > orderlines table that has been "doubled-up" until it is 1538 MB,
    > making the same old query perform a quicksort that's over 3GB. Short
    > version:  HEAD is 20468.0ms, with my patch it's 13689.698ms, so these
    > gains hold-up very well for very large in-memory sorts, possibly even
    > perfectly.
    > 
    
    This is some really good stuff. 
    Has to be said, that there must be quite few other places where inlining like that could have quite a positive benefit. 
    But - I also have to say that both template functions and template classes in C++ give you pretty much the same speed improvement, with much better clarity and readability of the code. (I've tested it with the example Peter presented, if someone is interested in the code). 
    One more reason why PostgreSQL project should look into the future and get some of the bits simplified and optimised by switching to C++. 
    
    
    
    
    
  56. Re: Inlining comparators as a performance optimisation

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-12-01T16:44:55Z

    Attached is revision of my patch with some clean-ups. In particular,
    I'm now using switch statements for greater readability, plus
    supporting fast path sorting of the time datatype. I've also updated
    the documentation on "Date/Time Types" to note the additional
    disadvantage of using the deprecated "store timestamp + friends as
    double precision floating-point numbers" compile time option.
    
    There is one aspect to this optimisation that I haven't touched on,
    which is the effect on memory consumption. I think that much of the
    value that this patch will deliver will come from being able to
    release sort memory earlier. Consider that the substantial
    improvements in raw sorting speed (far more substantial than the
    improvements in query runtime) will sometimes result in a concomitant
    reduction in the time that the executor holds onto memory allocated
    for sorting. Maybe the effect will only be really noticeable for plans
    with a sort node as their root node, but that isn't exactly a rare
    occurrence, particularly among large, expensive sorts.
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
  57. Re: Inlining comparators as a performance optimisation

    Robert Haas <robertmhaas@gmail.com> — 2011-12-01T17:15:52Z

    On Thu, Dec 1, 2011 at 11:44 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
    > Attached is revision of my patch with some clean-ups. In particular,
    > I'm now using switch statements for greater readability, plus
    > supporting fast path sorting of the time datatype. I've also updated
    > the documentation on "Date/Time Types" to note the additional
    > disadvantage of using the deprecated "store timestamp + friends as
    > double precision floating-point numbers" compile time option.
    
    One thing I'm starting to get a bit concerned about is the effect of
    this patch on the size of the resulting binary.  The performance
    results you've posted are getting steadily more impressive as you get
    into this, which is cool, but it seems like the number of copies of
    the qsort logic that you're proposing to include in the resulting
    binary is also steadily climbing.  On my system, a stripped postgres
    binary built with my usual compile options (except --enable-cassert,
    which I took out) is 49336 bytes bigger with this patch applied, an
    increase of close to 1%.  We might need to be a little bit choosy
    about this, because I don't think that we want to end up with a
    situation where some noticeable percentage of the final binary
    consists of differently-inlined versions of the quicksort algorithm -
    especially because there may be other places where we want to do
    similar kinds of inlining.
    
    Thoughts?
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  58. Re: Inlining comparators as a performance optimisation

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-02T01:13:02Z

    Robert Haas <robertmhaas@gmail.com> writes:
    > On Thu, Dec 1, 2011 at 11:44 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
    >> Attached is revision of my patch with some clean-ups.
    
    > One thing I'm starting to get a bit concerned about is the effect of
    > this patch on the size of the resulting binary.
    
    Regardless of that, I'm still of the opinion that this patch is
    fundamentally misdesigned.  The way it's set up, it is only possible to
    have a fast path for types that are hard-wired into tuplesort.c, and
    that flies in the face of twenty years' worth of effort to make Postgres
    an extensible system.  I really don't care about performance
    measurements: this is not an acceptable design, period.
    
    What I want to see is some mechanism that pushes this out to the
    individual datatypes, so that both core and plug-in datatypes have
    access to the benefits.  There are a number of ways that could be
    attacked, but the most obvious one is to invent additional, optional
    support function(s) for btree opclasses.
    
    I also still think that we should pursue the idea of a lower-overhead
    API for comparison functions.  It may be that it's worth the code bulk
    to have an inlined copy of qsort for a small number of high-usage
    datatypes, but I think there are going to be a lot for which we aren't
    going to want to pay that price, and yet we could get some benefit from
    cutting the call overhead.  A lower-overhead API would also save cycles
    in usage of the comparison functions from btree itself (remember that?).
    
    I think in particular that the proposed approach to multiple sort keys
    is bogus and should be replaced by just using lower-overhead
    comparators.  The gain per added code byte in doing it as proposed
    has got to be too low to be reasonable.
    
    			regards, tom lane
    
    
  59. Re: Inlining comparators as a performance optimisation

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-12-02T01:29:56Z

    On 1 December 2011 17:15, Robert Haas <robertmhaas@gmail.com> wrote:
    > One thing I'm starting to get a bit concerned about is the effect of
    > this patch on the size of the resulting binary.  The performance
    > results you've posted are getting steadily more impressive as you get
    > into this, which is cool, but it seems like the number of copies of
    > the qsort logic that you're proposing to include in the resulting
    > binary is also steadily climbing.  On my system, a stripped postgres
    > binary built with my usual compile options (except --enable-cassert,
    > which I took out) is 49336 bytes bigger with this patch applied, an
    > increase of close to 1%.
    
    Do your usual compile options include debug symbols? I've been using
    standard compile options for development of this patch, for obvious
    reasons. I get 36690 bytes (just under 36 KiB, or a 0.644% increase).
    
    Binary bloat is a legitimate concern. I thought that I was
    conservative in my choice of specialisations though.
    
    > We might need to be a little bit choosy
    > about this, because I don't think that we want to end up with a
    > situation where some noticeable percentage of the final binary
    > consists of differently-inlined versions of the quicksort algorithm -
    > especially because there may be other places where we want to do
    > similar kinds of inlining.
    >
    > Thoughts?
    
    A simple caveat in a comment along the lines of "this mechanism
    instantiates 2 copies of qsort_arg per type, please use judiciously"
    sounds like the right balance. It could also be possible to be penny
    wise and pound foolish here though.
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
    
  60. Re: Inlining comparators as a performance optimisation

    Robert Haas <robertmhaas@gmail.com> — 2011-12-02T01:31:06Z

    On Thu, Dec 1, 2011 at 8:29 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
    > Do your usual compile options include debug symbols? I've been using
    > standard compile options for development of this patch, for obvious
    > reasons. I get 36690 bytes (just under 36 KiB, or a 0.644% increase).
    
    They do, but I ran "strip" on the resulting binary before taking those
    measurements, so I thought that would undo it...
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  61. Re: Inlining comparators as a performance optimisation

    Robert Haas <robertmhaas@gmail.com> — 2011-12-02T02:15:14Z

    On Thu, Dec 1, 2011 at 8:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > What I want to see is some mechanism that pushes this out to the
    > individual datatypes, so that both core and plug-in datatypes have
    > access to the benefits.  There are a number of ways that could be
    > attacked, but the most obvious one is to invent additional, optional
    > support function(s) for btree opclasses.
    
    I feel like what's missing here is a way for the catalogs to refer to
    a function that doesn't take FunctionCallInfo as an argument.  But it
    strikes me that it wouldn't be very hard to design such a mechanism.
    It seems like all we really need to do is decide on a nomenclature.
    The simplest possible approach would probably be to just refer to
    functions by name.  In other words, we add a text or name column to
    pg_amproc, and then inside the backend there's a function whose job it
    is to map the string to a function address.  However, that would have
    the annoying effect of causing catalog bloat, so I'm inclined to
    instead propose an 8-byte integer.  (A 4-byte integer is enough, but
    would increase the risk of collisions between values that might be
    chosen by third party extensions.)
    
    So, the API to this would look something like this:
    
    void register_arbitrary_function_pointer(int64 handle, void *funcptr);
    void *get_arbitrary_function_pointer(int64 handle);
    
    So the idea is that if you need to refer to an arbitrary function in a
    system catalog (such as pg_amproc), you generate a random non-zero
    64-bit integer, stuff it into the catalog, and then register the same
    64-bit integer with the appropriate function pointer.  To avoid
    increasing backend startup time, we could have a Perl script generate
    a prefab hash table for the built-in functions; loadable modules would
    add their entries at PG_init() time using
    register_arbitrary_function_pointer.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  62. Re: Inlining comparators as a performance optimisation

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-02T02:46:09Z

    Robert Haas <robertmhaas@gmail.com> writes:
    > On Thu, Dec 1, 2011 at 8:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >> What I want to see is some mechanism that pushes this out to the
    >> individual datatypes, so that both core and plug-in datatypes have
    >> access to the benefits.  There are a number of ways that could be
    >> attacked, but the most obvious one is to invent additional, optional
    >> support function(s) for btree opclasses.
    
    > I feel like what's missing here is a way for the catalogs to refer to
    > a function that doesn't take FunctionCallInfo as an argument.  But it
    > strikes me that it wouldn't be very hard to design such a mechanism.
    
    IMO, entries in pg_proc ought to refer to functions that are callable
    by the standard interface: breaking that basic contract is not going to
    lead anywhere pleasant.  Nor do I really want yet more columns in
    pg_proc.  Nor does the "register a pointer" scheme you suggest make
    any sense to me: you still need to connect these things to catalog
    entries, pg_opclass entries in particular, and the intermediate handle
    adds nothing to the situation except for creating a risk of collisions.
    
    The scheme that was rolling around in my mind was about like this:
    
    * Define btree opclasses to have an optional support function,
    amprocnum 2, that has a SQL signature of func(internal) returns void.
    
    * The actual argument represented by the "internal" parameter would
    be a pointer to a struct which is to be filled in by the support
    function.  We'd call the support function once, during tuplesort
    setup or btree index relcache entry setup, and save the struct
    somewhere.
    
    * The struct contents would be pointers to functions that have a
    non-FunctionCallInfo interface.  We know we need at least two:
    a full qsort replacement, and a non-FCI comparator.  We might want
    more in future, if someone convinces us that additional specializations
    of sorting are worth the trouble.  So I imagine a struct like this:
    
    	typedef struct SortSupportFunctions {
    		void	(*inline_qsort) (Datum *elements, int nelements);
    		int	(*comparator) (Datum a, Datum b);
    	} SortSupportFunctions;
    
    with the agreement that the caller must zero out the struct before call,
    and then the btree support function sets the function pointers for any
    specializations it's going to offer.  If we later add a third or fourth
    function pointer, datatypes that know about that could fill in those
    pointers, but datatypes that haven't been updated aren't broken.
    
    One thing I'm not too certain about is whether to define the APIs just
    as above, or to support a passthrough argument of some sort (and if so,
    what does it reference)?  Possibly any datatype that we'd actually care
    about this for is going to be simple enough to not need any state data.
    Or possibly not.  And what about collations?
    
    			regards, tom lane
    
    
  63. Re: Inlining comparators as a performance optimisation

    Robert Haas <robertmhaas@gmail.com> — 2011-12-02T03:22:32Z

    On Thu, Dec 1, 2011 at 9:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > IMO, entries in pg_proc ought to refer to functions that are callable
    > by the standard interface: breaking that basic contract is not going to
    > lead anywhere pleasant.  Nor do I really want yet more columns in
    > pg_proc.
    
    I wasn't proposing to create pg_proc entries for this.
    
    > Nor does the "register a pointer" scheme you suggest make
    > any sense to me: you still need to connect these things to catalog
    > entries, pg_opclass entries in particular, and the intermediate handle
    > adds nothing to the situation except for creating a risk of collisions.
    
    I think you might be misinterpreting what I had in mind.  Right now,
    pg_amproc entries have an "amproc" column that points to a pg_proc
    entry that in turn points to a function that takes
    FunctionCallInfoData as an argument.  What I'm proposing to do is add
    an additional column to that catalog that points more or less directly
    to a non-SQL-callable function, but it can't actually just be the
    address of the function because that's not stable.  So what I'm
    proposing is that we interpose the thinnest possible shim layer
    between the catalog and a function pointer, and an int64 -> function
    pointer mapping seemed to me like something that would fit the bill.
    
    > The scheme that was rolling around in my mind was about like this:
    >
    > * Define btree opclasses to have an optional support function,
    > amprocnum 2, that has a SQL signature of func(internal) returns void.
    >
    > * The actual argument represented by the "internal" parameter would
    > be a pointer to a struct which is to be filled in by the support
    > function.  We'd call the support function once, during tuplesort
    > setup or btree index relcache entry setup, and save the struct
    > somewhere.
    >
    > * The struct contents would be pointers to functions that have a
    > non-FunctionCallInfo interface.  We know we need at least two:
    > a full qsort replacement, and a non-FCI comparator.  We might want
    > more in future, if someone convinces us that additional specializations
    > of sorting are worth the trouble.  So I imagine a struct like this:
    >
    >        typedef struct SortSupportFunctions {
    >                void    (*inline_qsort) (Datum *elements, int nelements);
    >                int     (*comparator) (Datum a, Datum b);
    >        } SortSupportFunctions;
    >
    > with the agreement that the caller must zero out the struct before call,
    > and then the btree support function sets the function pointers for any
    > specializations it's going to offer.  If we later add a third or fourth
    > function pointer, datatypes that know about that could fill in those
    > pointers, but datatypes that haven't been updated aren't broken.
    >
    > One thing I'm not too certain about is whether to define the APIs just
    > as above, or to support a passthrough argument of some sort (and if so,
    > what does it reference)?  Possibly any datatype that we'd actually care
    > about this for is going to be simple enough to not need any state data.
    > Or possibly not.  And what about collations?
    
    Maybe there should be a comparator_setup function that gets the
    collation OID and returns void *, and then that void * value gets
    passed as a third argument to each call to the comparator function.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  64. Re: Inlining comparators as a performance optimisation

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-02T03:48:07Z

    Robert Haas <robertmhaas@gmail.com> writes:
    > On Thu, Dec 1, 2011 at 9:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >> Nor does the "register a pointer" scheme you suggest make
    >> any sense to me: you still need to connect these things to catalog
    >> entries, pg_opclass entries in particular, and the intermediate handle
    >> adds nothing to the situation except for creating a risk of collisions.
    
    > I think you might be misinterpreting what I had in mind.  Right now,
    > pg_amproc entries have an "amproc" column that points to a pg_proc
    > entry that in turn points to a function that takes
    > FunctionCallInfoData as an argument.  What I'm proposing to do is add
    > an additional column to that catalog that points more or less directly
    > to a non-SQL-callable function, but it can't actually just be the
    > address of the function because that's not stable.  So what I'm
    > proposing is that we interpose the thinnest possible shim layer
    > between the catalog and a function pointer, and an int64 -> function
    > pointer mapping seemed to me like something that would fit the bill.
    
    But you still need a lot of mechanism to do that mapping, including an
    initialization function that has noplace to be executed (unless every
    .so that defines a btree opclass has to be listed in preload_libraries),
    so it doesn't seem like the "thinnest possible shim" to me.
    
    >> One thing I'm not too certain about is whether to define the APIs just
    >> as above, or to support a passthrough argument of some sort (and if so,
    >> what does it reference)? Possibly any datatype that we'd actually care
    >> about this for is going to be simple enough to not need any state data.
    >> Or possibly not. And what about collations?
    
    > Maybe there should be a comparator_setup function that gets the
    > collation OID and returns void *, and then that void * value gets
    > passed as a third argument to each call to the comparator function.
    
    Maybe.  Or perhaps we could merge that work into the
    function-pointer-setup function --- that is, give it the collation and
    some extra workspace to fool with.  We would always know the
    collation at the time we're doing that setup.
    
    At this point the struct filled in by the setup function is starting
    to feel a lot like FmgrInfo, and history teaches us a lot about what
    needs to be in there.  So maybe
    
    typedef struct SortSupportInfoData *SortSupportInfo;
    
    typedef struct SortSupportInfoData
    {
    	MemoryContext	info_cxt;	/* where to allocate subsidiary data */
    	void		*extra;		/* any data needed by functions */
    	Oid		collation;	/* provided by caller */
    
    	void	(*inline_qsort) (Datum *elements, int nelements,
    				 SortSupportInfo info);
    	int	(*comparator) (Datum a, Datum b,
    			       SortSupportInfo info);
    	/* possibly other function pointers in future */	
    } SortSupportInfoData;
    
    I am thinking that the btree code, at least, would want to just
    unconditionally do
    
    	colsortinfo->comparator(datum1, datum2, colsortinfo)
    
    so for an opclass that fails to supply the low-overhead comparator,
    it would insert into the "comparator" pointer a shim function that
    calls the opclass' old-style FCI-using comparator.  (Anybody who
    complains about the added overhead would be told to get busy and
    supply a low-overhead comparator for their datatype...)  But to do
    that, we have to have enough infrastructure here to cover all cases,
    so omitting collation or not having a place to stash an FmgrInfo
    won't do.
    
    			regards, tom lane
    
    
  65. Re: Inlining comparators as a performance optimisation

    Robert Haas <robertmhaas@gmail.com> — 2011-12-02T04:00:36Z

    On Thu, Dec 1, 2011 at 10:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > But you still need a lot of mechanism to do that mapping, including an
    > initialization function that has noplace to be executed (unless every
    > .so that defines a btree opclass has to be listed in preload_libraries),
    > so it doesn't seem like the "thinnest possible shim" to me.
    
    PG_init?
    
    >>> One thing I'm not too certain about is whether to define the APIs just
    >>> as above, or to support a passthrough argument of some sort (and if so,
    >>> what does it reference)?  Possibly any datatype that we'd actually care
    >>> about this for is going to be simple enough to not need any state data.
    >>> Or possibly not.  And what about collations?
    >
    >> Maybe there should be a comparator_setup function that gets the
    >> collation OID and returns void *, and then that void * value gets
    >> passed as a third argument to each call to the comparator function.
    >
    > Maybe.  Or perhaps we could merge that work into the
    > function-pointer-setup function --- that is, give it the collation and
    > some extra workspace to fool with.  We would always know the
    > collation at the time we're doing that setup.
    
    Ah!  That seems quite nice.
    
    > At this point the struct filled in by the setup function is starting
    > to feel a lot like FmgrInfo, and history teaches us a lot about what
    > needs to be in there.  So maybe
    >
    > typedef struct SortSupportInfoData *SortSupportInfo;
    >
    > typedef struct SortSupportInfoData
    > {
    >        MemoryContext   info_cxt;       /* where to allocate subsidiary data */
    >        void            *extra;         /* any data needed by functions */
    >        Oid             collation;      /* provided by caller */
    >
    >        void    (*inline_qsort) (Datum *elements, int nelements,
    >                                 SortSupportInfo info);
    >        int     (*comparator) (Datum a, Datum b,
    >                               SortSupportInfo info);
    >        /* possibly other function pointers in future */
    > } SortSupportInfoData;
    >
    > I am thinking that the btree code, at least, would want to just
    > unconditionally do
    >
    >        colsortinfo->comparator(datum1, datum2, colsortinfo)
    >
    > so for an opclass that fails to supply the low-overhead comparator,
    > it would insert into the "comparator" pointer a shim function that
    > calls the opclass' old-style FCI-using comparator.  (Anybody who
    > complains about the added overhead would be told to get busy and
    > supply a low-overhead comparator for their datatype...)  But to do
    > that, we have to have enough infrastructure here to cover all cases,
    > so omitting collation or not having a place to stash an FmgrInfo
    > won't do.
    
    I'm slightly worried about whether that'll be adding too much overhead
    to the case where there is no non-FCI comparator.  But it may be no
    worse than what we're doing now.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  66. Re: Inlining comparators as a performance optimisation

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-02T05:16:08Z

    Robert Haas <robertmhaas@gmail.com> writes:
    > On Thu, Dec 1, 2011 at 10:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >> I am thinking that the btree code, at least, would want to just
    >> unconditionally do
    >> 
    >>        colsortinfo->comparator(datum1, datum2, colsortinfo)
    >> 
    >> so for an opclass that fails to supply the low-overhead comparator,
    >> it would insert into the "comparator" pointer a shim function that
    >> calls the opclass' old-style FCI-using comparator.  (Anybody who
    >> complains about the added overhead would be told to get busy and
    >> supply a low-overhead comparator for their datatype...)  But to do
    >> that, we have to have enough infrastructure here to cover all cases,
    >> so omitting collation or not having a place to stash an FmgrInfo
    >> won't do.
    
    > I'm slightly worried about whether that'll be adding too much overhead
    > to the case where there is no non-FCI comparator.  But it may be no
    > worse than what we're doing now.
    
    It should be the same or better.  Right now, we are going through
    FunctionCall2Coll to reach the FCI-style comparator.  The shim function
    would be more or less equivalent to that, and since it's quite special
    purpose I would hope we could shave a cycle or two.  For instance, we
    could probably afford to set up a dedicated FunctionCallInfo struct
    associated with the SortSupportInfo struct, and not have to reinitialize
    one each time.
    
    			regards, tom lane
    
    
  67. Re: Inlining comparators as a performance optimisation

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-12-02T13:33:30Z

    On 2 December 2011 01:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Regardless of that, I'm still of the opinion that this patch is
    > fundamentally misdesigned.  The way it's set up, it is only possible to
    > have a fast path for types that are hard-wired into tuplesort.c, and
    > that flies in the face of twenty years' worth of effort to make Postgres
    > an extensible system.
    
    What the user doesn't know can't hurt them. I'm not of the opinion
    that an internal optimisations flies in the face of anything - is it
    really so hard for you to hold your nose for a fairly isolated patch
    like this?
    
    > I really don't care about performance
    > measurements: this is not an acceptable design, period.
    
    If ever there was a justification for doing something so distasteful,
    I would think that a 60% decrease in the time taken to perform a raw
    sort for POD types (plus common covariants) would be it.
    
    > What I want to see is some mechanism that pushes this out to the
    > individual datatypes, so that both core and plug-in datatypes have
    > access to the benefits.  There are a number of ways that could be
    > attacked, but the most obvious one is to invent additional, optional
    > support function(s) for btree opclasses.
    >
    > I also still think that we should pursue the idea of a lower-overhead
    > API for comparison functions.  It may be that it's worth the code bulk
    > to have an inlined copy of qsort for a small number of high-usage
    > datatypes, but I think there are going to be a lot for which we aren't
    > going to want to pay that price, and yet we could get some benefit from
    > cutting the call overhead.
    
    I'm not opposed to that. It just seems fairly tangential to what I've
    done here, as well as considerably less important to users,
    particularly when you look at the numbers I've produced. It would be
    nice to get a lesser gain for text and things like that too, but other
    than that I don't see a huge demand.
    
    > A lower-overhead API would also save cycles
    > in usage of the comparison functions from btree itself (remember that?).
    
    Yes, I remember that. Why not do something similar there, and get
    similarly large benefits?
    
    > I think in particular that the proposed approach to multiple sort keys
    > is bogus and should be replaced by just using lower-overhead
    > comparators.  The gain per added code byte in doing it as proposed
    > has got to be too low to be reasonable.
    
    Reasonable by what standard? We may well be better off with
    lower-overhead comparators for the multiple scanKey case (though I
    wouldn't like to bet on it) but we would most certainly not be better
    off without discriminating against single scanKey and multiple scanKey
    cases as I have.
    
    Look at the spreadsheet results_server.ods . Compare the "not inlined"
    and "optimized" sheets. It's clear from those numbers that a
    combination of inlining and of simply not having to consider a case
    that will never come up (multiple scanKeys), the inlining
    specialisation far outperforms the non-inlining, multiple scanKey
    specialization for the same data (I simply commented out inlining
    specializations so that non-inlining specialisations would always be
    called for int4 and friends, even if there was only a single scanKey).
    That's where the really dramatic improvements are seen. It's well
    worth having this additional benefit, because it's such a common case
    and the benefit is so big, and while I will be quite happy to pursue a
    plan to bring some of these benefits to all types such as the one you
    describe, I do not want to jettison the additional benefits described
    to do so. Isn't that reasonable?
    
    We're talking about taking 6778.302ms off a 20468.0ms query, rather
    than 1860.332ms. Not exactly a subtle difference. Assuming that those
    figures are representative of the gains to be had by a generic
    mechanism that does not inline/specialize across number of scanKeys,
    are you sure that that's worthwhile?
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
    
  68. Re: Inlining comparators as a performance optimisation

    Robert Haas <robertmhaas@gmail.com> — 2011-12-02T13:57:48Z

    On Fri, Dec 2, 2011 at 12:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > It should be the same or better.  Right now, we are going through
    > FunctionCall2Coll to reach the FCI-style comparator.  The shim function
    > would be more or less equivalent to that, and since it's quite special
    > purpose I would hope we could shave a cycle or two.  For instance, we
    > could probably afford to set up a dedicated FunctionCallInfo struct
    > associated with the SortSupportInfo struct, and not have to reinitialize
    > one each time.
    
    OK, but I think it's also going to cost you an extra syscache lookup,
    no?  You're going to have to check for this new support function
    first, and then if you don't find it, you'll have to check for the
    original one.  I don't think there's any higher-level caching around
    opfamilies to save our bacon here, is there?
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  69. Re: Inlining comparators as a performance optimisation

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-02T15:11:19Z

    Robert Haas <robertmhaas@gmail.com> writes:
    > OK, but I think it's also going to cost you an extra syscache lookup,
    > no?  You're going to have to check for this new support function
    > first, and then if you don't find it, you'll have to check for the
    > original one.  I don't think there's any higher-level caching around
    > opfamilies to save our bacon here, is there?
    
    [ shrug... ] If you are bothered by that, get off your duff and provide
    the function for your datatype.  But it's certainly going to be in the
    noise for btree index usage, and I submit that query parsing/setup
    involves quite a lot of syscache lookups already.  I think that as a
    performance objection, the above is nonsensical.  (And I would also
    comment that your proposal with a handle is going to involve a table
    search that's at least as expensive as a syscache lookup.)
    
    			regards, tom lane
    
    
  70. Re: Inlining comparators as a performance optimisation

    Bruce Momjian <bruce@momjian.us> — 2011-12-02T19:35:32Z

    Tom Lane wrote:
    > Robert Haas <robertmhaas@gmail.com> writes:
    > > OK, but I think it's also going to cost you an extra syscache lookup,
    > > no?  You're going to have to check for this new support function
    > > first, and then if you don't find it, you'll have to check for the
    > > original one.  I don't think there's any higher-level caching around
    > > opfamilies to save our bacon here, is there?
    > 
    > [ shrug... ] If you are bothered by that, get off your duff and provide
    > the function for your datatype.  But it's certainly going to be in the
    > noise for btree index usage, and I submit that query parsing/setup
    > involves quite a lot of syscache lookups already.  I think that as a
    > performance objection, the above is nonsensical.  (And I would also
    > comment that your proposal with a handle is going to involve a table
    > search that's at least as expensive as a syscache lookup.)
    
    Agreed.  Doing something once and doing something in the sort loop are
    two different overheads.
    
    I am excited by this major speedup Peter Geoghegan has found.  Postgres
    doesn't have parallel query, which is often used for sorting, so we are
    already behind some of the databases are compared against.  Getting this
    speedup is definitely going to help us.  And I do like the generic
    nature of where we are heading!
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        http://momjian.us
      EnterpriseDB                             http://enterprisedb.com
    
      + It's impossible for everything to be true. +
    
    
  71. Re: Inlining comparators as a performance optimisation

    Pavel Stehule <pavel.stehule@gmail.com> — 2011-12-02T20:04:45Z

    >>
    >> [ shrug... ] If you are bothered by that, get off your duff and provide
    >> the function for your datatype.  But it's certainly going to be in the
    >> noise for btree index usage, and I submit that query parsing/setup
    >> involves quite a lot of syscache lookups already.  I think that as a
    >> performance objection, the above is nonsensical.  (And I would also
    >> comment that your proposal with a handle is going to involve a table
    >> search that's at least as expensive as a syscache lookup.)
    >
    > Agreed.  Doing something once and doing something in the sort loop are
    > two different overheads.
    >
    > I am excited by this major speedup Peter Geoghegan has found.  Postgres
    > doesn't have parallel query, which is often used for sorting, so we are
    > already behind some of the databases are compared against.  Getting this
    > speedup is definitely going to help us.  And I do like the generic
    > nature of where we are heading!
    >
    
    Oracle has not or had not parallel sort too, and I have a reports so
    Oracle does sort faster then PostgreSQL (but without any numbers). So
    any solution is welcome
    
    Regards
    
    Pavel
    
    
  72. Re: Inlining comparators as a performance optimisation

    Robert Haas <robertmhaas@gmail.com> — 2011-12-02T20:21:50Z

    On Fri, Dec 2, 2011 at 2:35 PM, Bruce Momjian <bruce@momjian.us> wrote:
    > Agreed.  Doing something once and doing something in the sort loop are
    > two different overheads.
    
    OK, so I tried to code this up.  Adding the new amproc wasn't too
    difficult (see attached).  It wasn't obvious to me how to tie it into
    the tuplesort infrastructure, though, so instead of wasting time
    guessing what a sensible approach might be I'm going to use one of my
    lifelines and poll the audience (or is that ask an expert?).
    Currently the Tuplesortstate[1] has a pointer to an array of
    ScanKeyData objects, one per column being sorted.  But now instead of
    "FmgrInfo sk_func", the tuplesort code is going to want each scankey
    to contain "SortSupportInfo(Data?) sk_sortsupport"[2], because that's
    where we get the comparison function from.   Should I just go ahead
    and add one more member to that struct, or is there some more
    appropriate way to handle this?
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    [1] Consistent capitalization is for wimps.
    [2] Hey, we could abbreviate that "SSI"!  Oh, wait...
    
  73. Re: Inlining comparators as a performance optimisation

    Bruce Momjian <bruce@momjian.us> — 2011-12-02T20:42:21Z

    Robert Haas wrote:
    > On Fri, Dec 2, 2011 at 2:35 PM, Bruce Momjian <bruce@momjian.us> wrote:
    > > Agreed. ?Doing something once and doing something in the sort loop are
    > > two different overheads.
    > 
    > OK, so I tried to code this up.  Adding the new amproc wasn't too
    > difficult (see attached).  It wasn't obvious to me how to tie it into
    > the tuplesort infrastructure, though, so instead of wasting time
    > guessing what a sensible approach might be I'm going to use one of my
    > lifelines and poll the audience (or is that ask an expert?).
    > Currently the Tuplesortstate[1] has a pointer to an array of
    > ScanKeyData objects, one per column being sorted.  But now instead of
    > "FmgrInfo sk_func", the tuplesort code is going to want each scankey
    > to contain "SortSupportInfo(Data?) sk_sortsupport"[2], because that's
    > where we get the comparison function from.   Should I just go ahead
    > and add one more member to that struct, or is there some more
    > appropriate way to handle this?
    
    Is this code immediately usable anywhere else in our codebasde, and if
    so, is it generic enough?
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        http://momjian.us
      EnterpriseDB                             http://enterprisedb.com
    
      + It's impossible for everything to be true. +
    
    
  74. Re: Inlining comparators as a performance optimisation

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-03T18:36:04Z

    Robert Haas <robertmhaas@gmail.com> writes:
    > OK, so I tried to code this up.  Adding the new amproc wasn't too
    > difficult (see attached).  It wasn't obvious to me how to tie it into
    > the tuplesort infrastructure, though, so instead of wasting time
    > guessing what a sensible approach might be I'm going to use one of my
    > lifelines and poll the audience (or is that ask an expert?).
    
    OK, I'll take a whack at improving this over the weekend.
    
    			regards, tom lane
    
    
  75. Re: Inlining comparators as a performance optimisation

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-04T19:17:15Z

    Robert Haas <robertmhaas@gmail.com> writes:
    > OK, so I tried to code this up.  Adding the new amproc wasn't too
    > difficult (see attached).  It wasn't obvious to me how to tie it into
    > the tuplesort infrastructure, though, so instead of wasting time
    > guessing what a sensible approach might be I'm going to use one of my
    > lifelines and poll the audience (or is that ask an expert?).
    
    Here's another cut at this.  I only went as far as converting the
    heap-sort code path in tuplesort.c, so there's lots more to do, but
    this does sort integers successfully.  Before expanding the patch
    to do more, I think we need to have consensus on some API details,
    in particular:
    
    * I invented a "SortKey" struct that replaces ScanKey for tuplesort's
    purposes.  Right now that's local in tuplesort.c, but we're more than
    likely going to need it elsewhere as well.  Should we just define it
    in sortsupport.h?  Or perhaps we should just add the few additional
    fields to SortSupportInfoData, and not bother with two struct types?
    Before you answer, consider the next point.
    
    * I wonder whether it would be worthwhile to elide inlineApplyComparator
    altogether, pushing what it does down to the level of the
    datatype-specific functions.  That would require changing the
    "comparator" API to include isnull flags, and exposing the
    reverse/nulls_first sort flags to the comparators (presumably by
    including them in SortSupportInfoData).  The main way in which that
    could be a win would be if the setup function could choose one of four
    comparator functions that are pre-specialized for each flags
    combination; but that seems like it would add a lot of code bulk, and
    the bigger problem is that we need to be able to change the flags after
    sort initialization (cf. the reversedirection code in tuplesort.c), so
    we'd also need some kind of "re-select the comparator" call API.  On the
    whole this doesn't seem promising, but maybe somebody else has a
    different idea.
    
    * We're going to want to expose PrepareSortSupportComparisonShim
    for use outside tuplesort.c too, and possibly refactor
    tuplesort_begin_heap so that the SortKey setup logic inside it
    can be extracted for use elsewhere.  Shall we just add those to
    tuplesort's API, or would it be better to create a sortsupport.c
    with these sorts of functions?
    
    I have not done any performance testing on this patch, but it might be
    interesting to check it with the same test cases Peter's been using.
    
    			regards, tom lane
    
    
    
  76. Re: Inlining comparators as a performance optimisation

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-12-05T00:14:33Z

    On 4 December 2011 19:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > I have not done any performance testing on this patch, but it might be
    > interesting to check it with the same test cases Peter's been using.
    
    I've attached a revision of exactly the same benchmark run to get the
    results in results_server.ods .
    
    You'll see very similar figures to results_server.ods for HEAD and for
    my patch, as you'd expect. I think the results speak for themselves. I
    maintain that we should use specialisations - that's where most of the
    benefit is to be found.
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
  77. Re: Inlining comparators as a performance optimisation

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-12-05T13:23:45Z

    On 05.12.2011 02:14, Peter Geoghegan wrote:
    > On 4 December 2011 19:17, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
    >> I have not done any performance testing on this patch, but it might be
    >> interesting to check it with the same test cases Peter's been using.
    >
    > I've attached a revision of exactly the same benchmark run to get the
    > results in results_server.ods .
    >
    > You'll see very similar figures to results_server.ods for HEAD and for
    > my patch, as you'd expect. I think the results speak for themselves. I
    > maintain that we should use specialisations - that's where most of the
    > benefit is to be found.
    
    I'm still not convinced the big gain is in inlining the comparison 
    functions. Your patch contains a few orthogonal tricks, too:
    
    * A fastpath for the case of just one scankey. No reason we can't do 
    that without inlining.
    
    * inlining the swap-functions within qsort. Thaẗ́'s different from 
    inlining the comparison functions, and could be done regardless of the 
    data type. The qsort element size in tuplesort.c is always sizeof(SortTuple)
    
    And there's some additional specializations we can do, not implemented 
    in your patch, that don't depend on inlining the comparison functions:
    
    * A fastpath for the case of no NULLs
    * separate comparetup functions for ascending and descending sorts (this 
    allows tail call elimination of the call to type-specific comparison 
    function in the ascending version)
    * call CHECK_FOR_INTERRUPTS() less frequently.
    
    To see how much difference those things make, I hacked together the 
    attached patch. I didn't base this on Robert's/Tom's patch, but instead 
    just added a quick & dirty FunctionCallInvoke-overhead-skipping version 
    of btint4cmp. I believe that aspect of this patch it's similar in 
    performance characteristics, though.
    
    In my quick tests, it gets quite close in performance to your inlining 
    patch, when sorting int4s and the input contains no NULLs. But please 
    give it a try in your test environment, to get numbers comparable with 
    your other tests.
    
    Perhaps you can get even more gain by adding the no-NULLs, and asc/desc 
    special cases to your inlining patch, too, but let's squeeze all the fat 
    out of the non-inlined version first. One nice aspect of many of these 
    non-inlining optimizations is that they help with external sorts, too.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
  78. Re: Inlining comparators as a performance optimisation

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-12-05T17:10:05Z

    On 5 December 2011 13:23, Heikki Linnakangas
    <heikki.linnakangas@enterprisedb.com> wrote:
    > I'm still not convinced the big gain is in inlining the comparison
    > functions. Your patch contains a few orthogonal tricks, too:
    >
    > * A fastpath for the case of just one scankey. No reason we can't do that
    > without inlining.
    
    True, though Tom did seem to not like that idea very much.
    
    > * inlining the swap-functions within qsort. Thaẗ́'s different from inlining
    > the comparison functions, and could be done regardless of the data type. The
    > qsort element size in tuplesort.c is always sizeof(SortTuple)
    
    Sure. I think that Tom mostly objects to hard-coded intelligence about
    which datatypes are used, rather than specialisations per se.
    
    > And there's some additional specializations we can do, not implemented in
    > your patch, that don't depend on inlining the comparison functions:
    >
    > * A fastpath for the case of no NULLs
    > * separate comparetup functions for ascending and descending sorts (this
    > allows tail call elimination of the call to type-specific comparison
    > function in the ascending version)
    > * call CHECK_FOR_INTERRUPTS() less frequently.
    
    All of those had occurred to me, except the last - if you look at the
    definition of that macro, it looks like more trouble than it's worth
    to do less frequently. I didn't revise my patch with them though,
    because the difference that they made, while clearly measurable,
    seemed fairly small, or at least relatively so. I wasn't about to add
    additional kludge for marginal benefits given the existing
    controversy, though I have not dismissed the idea entirely - it could
    be important on some other machine.
    
    > Perhaps you can get even more gain by adding the no-NULLs, and asc/desc
    > special cases to your inlining patch, too, but let's squeeze all the fat out
    > of the non-inlined version first.
    
    As I said, those things are simple enough to test, and were not found
    to make a significant difference, at least to my exact test case +
    hardware.
    
    > One nice aspect of many of these
    > non-inlining optimizations is that they help with external sorts, too.
    
    I'd expect the benefits to be quite a lot smaller there, but fair point.
    
    Results from running the same test on your patch are attached. I think
    that while you're right to suggest that the inlining of the comparator
    proper, rather than any other function or other optimisation isn't
    playing a dominant role in these optimisations, I think that you're
    underestimating the role of locality of reference. To illustrate this,
    I've also included results for when I simply move the comparator to
    another translation unit, logtape.c . There's clearly a regression
    from doing so (I ran the numbers twice, in a clinical environment).
    The point is, there is a basically unknown overhead paid for not
    inlining - maybe inlining is not worth it, but that's a hard call to
    make.
    
    I didn't try to make the numbers look worse by putting the comparator
    proper in some other translation unit, but it may be that I'd have
    shown considerably worse numbers by doing so.
    
    Why the strong aversion to what I've done? I accept that it's ugly,
    but is it really worth worrying about that sort of regression in
    maintainability for something that was basically untouched since 2006,
    and will probably remain untouched after this work concludes, whatever
    happens?
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
  79. Re: Inlining comparators as a performance optimisation

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-05T17:53:23Z

    Peter Geoghegan <peter@2ndquadrant.com> writes:
    > Why the strong aversion to what I've done? I accept that it's ugly,
    > but is it really worth worrying about that sort of regression in
    > maintainability for something that was basically untouched since 2006,
    > and will probably remain untouched after this work concludes, whatever
    > happens?
    
    Maintainability is only part of the issue --- though it's definitely one
    part, since this code has hardly gone "untouched since 2006", cf
    http://git.postgresql.org/gitweb/?p=postgresql.git;a=history;f=src/backend/utils/sort/tuplesort.c;hb=HEAD
    
    What is bothering me is that this approach is going to cause substantial
    bloat of the executable code, and such bloat has got distributed costs,
    which we don't have any really good way to measure but for sure
    micro-benchmarks addressing only sort speed aren't going to reveal them.
    Cache lines filled with sort code take cycles to flush and replace with
    something else.
    
    I think it's possibly reasonable to have inlined copies of qsort for a
    small number of datatypes, but it seems much less reasonable to have
    multiple copies per datatype in order to obtain progressively tinier
    micro-benchmark speedups.  We need to figure out what the tradeoff
    against executable size really is, but so far it seems like you've been
    ignoring the fact that there is any such tradeoff at all.
    
    			regards, tom lane
    
    
  80. Re: Inlining comparators as a performance optimisation

    Nicolas Barbier <nicolas.barbier@gmail.com> — 2011-12-06T10:40:23Z

    2011/12/5 Tom Lane <tgl@sss.pgh.pa.us>:
    
    > What is bothering me is that this approach is going to cause substantial
    > bloat of the executable code, and such bloat has got distributed costs,
    > which we don't have any really good way to measure but for sure
    > micro-benchmarks addressing only sort speed aren't going to reveal them.
    > Cache lines filled with sort code take cycles to flush and replace with
    > something else.
    >
    > I think it's possibly reasonable to have inlined copies of qsort for a
    > small number of datatypes, but it seems much less reasonable to have
    > multiple copies per datatype in order to obtain progressively tinier
    > micro-benchmark speedups.  We need to figure out what the tradeoff
    > against executable size really is, but so far it seems like you've been
    > ignoring the fact that there is any such tradeoff at all.
    
    [ Randomly spouting ideas here: ]
    
    Might it not be a good idea to decide whether to use the inlined
    copies vs. the normal version, based on how much data to sort? Surely
    for a very large sort, the cache blow-out doesn't matter that much
    relative to the amount of time that can be saved sorting?
    
    Assuming that all types would have an inlined sort function, although
    that would indeed result in a larger binary, most of that binary would
    never touch the cache, because corresponding large sorts are never
    performed. If they would sporadically occur (and assuming the points
    at which inline sorting starts to get used are chosen wisely), the
    possibly resulting cache blow-out would be a net win.
    
    I am also assuming here that instruction cache lines are small enough
    for case line aliasing not to become a problem; putting all sort
    functions next to each other in the binary (so that they don't alias
    with the rest of the backend code that might be used more often) might
    alleviate that.
    
    Nicolas
    
    -- 
    A. Because it breaks the logical sequence of discussion.
    Q. Why is top posting bad?
    
    
  81. Re: Inlining comparators as a performance optimisation

    Robert Haas <robertmhaas@gmail.com> — 2011-12-06T16:49:03Z

    On Sun, Dec 4, 2011 at 2:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > * I invented a "SortKey" struct that replaces ScanKey for tuplesort's
    > purposes.  Right now that's local in tuplesort.c, but we're more than
    > likely going to need it elsewhere as well.  Should we just define it
    > in sortsupport.h?  Or perhaps we should just add the few additional
    > fields to SortSupportInfoData, and not bother with two struct types?
    > Before you answer, consider the next point.
    
    +1 for not bothering with two struct types.  We might want to consider
    calling the resulting structure SortKey rather than
    SortSupportInfoData, however.
    
    > * I wonder whether it would be worthwhile to elide inlineApplyComparator
    > altogether, pushing what it does down to the level of the
    > datatype-specific functions.  That would require changing the
    > "comparator" API to include isnull flags, and exposing the
    > reverse/nulls_first sort flags to the comparators (presumably by
    > including them in SortSupportInfoData).  The main way in which that
    > could be a win would be if the setup function could choose one of four
    > comparator functions that are pre-specialized for each flags
    > combination; but that seems like it would add a lot of code bulk, and
    > the bigger problem is that we need to be able to change the flags after
    > sort initialization (cf. the reversedirection code in tuplesort.c), so
    > we'd also need some kind of "re-select the comparator" call API.  On the
    > whole this doesn't seem promising, but maybe somebody else has a
    > different idea.
    
    I thought about this, too, but it didn't seem promising to me, either.
    
    > * We're going to want to expose PrepareSortSupportComparisonShim
    > for use outside tuplesort.c too, and possibly refactor
    > tuplesort_begin_heap so that the SortKey setup logic inside it
    > can be extracted for use elsewhere.  Shall we just add those to
    > tuplesort's API, or would it be better to create a sortsupport.c
    > with these sorts of functions?
    
    Why are we going to want to do that?  If it's because there are other
    places in the code that can make use of a fast comparator that don't
    go through tuplesort.c, then we should probably break it off into a
    separate file (sortkey.c?).  But if it's because we think that clients
    of the tuplesort code are going to need it for some reason, then we
    may as well keep it in tuplesort.c.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  82. Re: Inlining comparators as a performance optimisation

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-06T17:06:09Z

    Robert Haas <robertmhaas@gmail.com> writes:
    > On Sun, Dec 4, 2011 at 2:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >> * We're going to want to expose PrepareSortSupportComparisonShim
    >> for use outside tuplesort.c too, and possibly refactor
    >> tuplesort_begin_heap so that the SortKey setup logic inside it
    >> can be extracted for use elsewhere.  Shall we just add those to
    >> tuplesort's API, or would it be better to create a sortsupport.c
    >> with these sorts of functions?
    
    > Why are we going to want to do that?  If it's because there are other
    > places in the code that can make use of a fast comparator that don't
    > go through tuplesort.c, then we should probably break it off into a
    > separate file (sortkey.c?).  But if it's because we think that clients
    > of the tuplesort code are going to need it for some reason, then we
    > may as well keep it in tuplesort.c.
    
    My expectation is that nbtree, as well as mergejoin and mergeappend,
    would get converted over to use the fast comparator API.  I looked at
    that a little bit but didn't push it far enough to be very sure about
    whether they'd be able to share the initialization code from
    tuplesort_begin_heap.  But they're definitely going to need the shim
    function for backwards compatibility, and
    PrepareSortSupportComparisonShim was my first cut at a wrapper that
    would be generally useful.
    
    			regards, tom lane
    
    
  83. Re: Inlining comparators as a performance optimisation

    Robert Haas <robertmhaas@gmail.com> — 2011-12-06T17:48:35Z

    On Tue, Dec 6, 2011 at 12:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Robert Haas <robertmhaas@gmail.com> writes:
    >> On Sun, Dec 4, 2011 at 2:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >>> * We're going to want to expose PrepareSortSupportComparisonShim
    >>> for use outside tuplesort.c too, and possibly refactor
    >>> tuplesort_begin_heap so that the SortKey setup logic inside it
    >>> can be extracted for use elsewhere.  Shall we just add those to
    >>> tuplesort's API, or would it be better to create a sortsupport.c
    >>> with these sorts of functions?
    >
    >> Why are we going to want to do that?  If it's because there are other
    >> places in the code that can make use of a fast comparator that don't
    >> go through tuplesort.c, then we should probably break it off into a
    >> separate file (sortkey.c?).  But if it's because we think that clients
    >> of the tuplesort code are going to need it for some reason, then we
    >> may as well keep it in tuplesort.c.
    >
    > My expectation is that nbtree, as well as mergejoin and mergeappend,
    > would get converted over to use the fast comparator API.  I looked at
    > that a little bit but didn't push it far enough to be very sure about
    > whether they'd be able to share the initialization code from
    > tuplesort_begin_heap.  But they're definitely going to need the shim
    > function for backwards compatibility, and
    > PrepareSortSupportComparisonShim was my first cut at a wrapper that
    > would be generally useful.
    
    OK.  Well, then pushing it out to a separate file probably makes
    sense.  Do you want to do that or shall I have a crack at it?  If the
    latter, what do you think about using the name SortKey for everything
    rather than SortSupport?
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  84. Re: Inlining comparators as a performance optimisation

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-06T18:07:08Z

    Robert Haas <robertmhaas@gmail.com> writes:
    > OK.  Well, then pushing it out to a separate file probably makes
    > sense.  Do you want to do that or shall I have a crack at it?  If the
    > latter, what do you think about using the name SortKey for everything
    > rather than SortSupport?
    
    I'll take another crack at it.  I'm not entirely sold yet on merging
    the two structs; I think first we'd better look and see what the needs
    are in the other potential callers I mentioned.  If we'd end up
    cluttering the struct with half a dozen weird fields, it'd be better to
    stick to a minimal interface struct with various wrapper structs, IMO.
    
    OTOH it did seem that the names were getting a bit long.  If we do
    keep the two-struct-levels approach, what do you think of
    s/SortSupportInfo/SortSupport/g ?
    
    			regards, tom lane
    
    
  85. Re: Inlining comparators as a performance optimisation

    Robert Haas <robertmhaas@gmail.com> — 2011-12-06T18:08:29Z

    On Tue, Dec 6, 2011 at 1:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Robert Haas <robertmhaas@gmail.com> writes:
    >> OK.  Well, then pushing it out to a separate file probably makes
    >> sense.  Do you want to do that or shall I have a crack at it?  If the
    >> latter, what do you think about using the name SortKey for everything
    >> rather than SortSupport?
    >
    > I'll take another crack at it.  I'm not entirely sold yet on merging
    > the two structs; I think first we'd better look and see what the needs
    > are in the other potential callers I mentioned.  If we'd end up
    > cluttering the struct with half a dozen weird fields, it'd be better to
    > stick to a minimal interface struct with various wrapper structs, IMO.
    
    OK.  I'll defer to whatever you come up with after looking at it.
    
    > OTOH it did seem that the names were getting a bit long.  If we do
    > keep the two-struct-levels approach, what do you think of
    > s/SortSupportInfo/SortSupport/g ?
    
    +1.  I had that thought when you originally suggested that name, but
    it didn't seem worth arguing about.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  86. Re: Inlining comparators as a performance optimisation

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-06T21:23:17Z

    Robert Haas <robertmhaas@gmail.com> writes:
    > On Tue, Dec 6, 2011 at 1:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >> I'll take another crack at it.  I'm not entirely sold yet on merging
    >> the two structs; I think first we'd better look and see what the needs
    >> are in the other potential callers I mentioned.  If we'd end up
    >> cluttering the struct with half a dozen weird fields, it'd be better to
    >> stick to a minimal interface struct with various wrapper structs, IMO.
    
    > OK.  I'll defer to whatever you come up with after looking at it.
    
    OK, it looks like nodeMergeAppend.c could use something exactly like the
    draft SortKey struct, while nodeMergejoin.c could embed such a struct in
    MergeJoinClauseData.  The btree stuff needs something more nearly
    equivalent to a ScanKey, including a datum-to-compare-to and a flags
    field.  I'm inclined to think the latter would be too specialized to put
    in the generic struct.  On the other hand, including the reverse and
    nulls_first flags in the generic struct is clearly a win since it allows
    ApplyComparator() to be defined as a generic function.  So the only
    thing that's really debatable is the attno field, and I'm not anal
    enough to insist on a separate level of struct just for that.
    
    I am however inclined to stick with the shortened struct name SortSupport
    rather than using SortKey.  The presence of the function pointer fields
    (especially the inlined-qsort pointers, assuming we adopt some form of
    Peter's patch) changes the struct's nature in my view; it's not really
    describing just a sort key (ie an ORDER BY column specification).
    
    			regards, tom lane
    
    
  87. Re: Inlining comparators as a performance optimisation

    Robert Haas <robertmhaas@gmail.com> — 2011-12-07T00:18:24Z

    On Tue, Dec 6, 2011 at 4:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Robert Haas <robertmhaas@gmail.com> writes:
    >> On Tue, Dec 6, 2011 at 1:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >>> I'll take another crack at it.  I'm not entirely sold yet on merging
    >>> the two structs; I think first we'd better look and see what the needs
    >>> are in the other potential callers I mentioned.  If we'd end up
    >>> cluttering the struct with half a dozen weird fields, it'd be better to
    >>> stick to a minimal interface struct with various wrapper structs, IMO.
    >
    >> OK.  I'll defer to whatever you come up with after looking at it.
    >
    > OK, it looks like nodeMergeAppend.c could use something exactly like the
    > draft SortKey struct, while nodeMergejoin.c could embed such a struct in
    > MergeJoinClauseData.  The btree stuff needs something more nearly
    > equivalent to a ScanKey, including a datum-to-compare-to and a flags
    > field.  I'm inclined to think the latter would be too specialized to put
    > in the generic struct.  On the other hand, including the reverse and
    > nulls_first flags in the generic struct is clearly a win since it allows
    > ApplyComparator() to be defined as a generic function.  So the only
    > thing that's really debatable is the attno field, and I'm not anal
    > enough to insist on a separate level of struct just for that.
    >
    > I am however inclined to stick with the shortened struct name SortSupport
    > rather than using SortKey.  The presence of the function pointer fields
    > (especially the inlined-qsort pointers, assuming we adopt some form of
    > Peter's patch) changes the struct's nature in my view; it's not really
    > describing just a sort key (ie an ORDER BY column specification).
    
    Works for me.  I think we should go ahead and get this part committed
    first, and then we can look at the inlining stuff as a further
    optimization for certain cases...
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  88. Re: Inlining comparators as a performance optimisation

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-12-07T01:13:59Z

    On 7 December 2011 00:18, Robert Haas <robertmhaas@gmail.com> wrote:
    > Works for me.  I think we should go ahead and get this part committed
    > first, and then we can look at the inlining stuff as a further
    > optimization for certain cases...
    
    Do you mean just inlining, or inlining and the numerous other
    optimisations that my patch had?
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
    
  89. Re: Inlining comparators as a performance optimisation

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-07T01:46:27Z

    Robert Haas <robertmhaas@gmail.com> writes:
    > Works for me.  I think we should go ahead and get this part committed
    > first, and then we can look at the inlining stuff as a further
    > optimization for certain cases...
    
    Yeah.  Attached is a second-draft patch, which I'm fairly happy with
    except that the documentation hasn't been updated.  It extends the
    changes into the executor as well as analyze.c, with the result that
    we can get rid of some of the old infrastructure altogether.
    (inlineApplySortFunction is still there for the moment, though.)
    Also, I adopted a style similar to what we've done for inlined list
    functions to make ApplyComparatorFunction inline-able by all callers.
    
    There are three somewhat-independent lines of work to pursue from here:
    
    1. Adding sortsupport infrastructure for more datatypes.
    2. Revising nbtree and related code to use this infrastructure.
    3. Integrating Peter's work into this framework.
    
    I'll try to take care of #1 for at least a few key datatypes before
    I commit, but I think #2 is best done as a separate patch, so I'll
    postpone that till later.
    
    			regards, tom lane
    
    
  90. Re: Inlining comparators as a performance optimisation

    Robert Haas <robertmhaas@gmail.com> — 2011-12-07T03:45:20Z

    On Tue, Dec 6, 2011 at 8:13 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
    > On 7 December 2011 00:18, Robert Haas <robertmhaas@gmail.com> wrote:
    >> Works for me.  I think we should go ahead and get this part committed
    >> first, and then we can look at the inlining stuff as a further
    >> optimization for certain cases...
    >
    > Do you mean just inlining, or inlining and the numerous other
    > optimisations that my patch had?
    
    Whichever you like.  But I think part of the point here is to
    disentangle those optimizations from each other and decide how broadly
    it makes sense to apply each one.  Avoiding the FunctionCallInfo stuff
    is one, and it seems like we can apply that to a wide variety of data
    types (maybe all of them) for both in-memory and on-disk sorting, plus
    btree index ops, merge joins, and merge append.  The gains may be
    modest, but they will benefit many use cases.  Your original patch
    targets a much narrower use case (in-memory sorting of POD types) but
    the benefits are larger.  We don't have to pick between a general but
    small optimization and a narrower but larger one; we can do both.
    
    In this regard, I think Heikki's remarks upthread are worth some
    thought.  If inlining is a win just because it avoids saving and
    restoring registers or allows better instruction scheduling, then
    inlining is the (probably?) the only way to get the benefit.  But if
    most of the benefit is in having a separate path for the
    single-sort-key case, we can do that without duplicating the qsort()
    code and get the benefit for every data type without much code bloat.
    I'd like to see us dig into that a little, so that we get the broadest
    possible benefit out of this work.  It doesn't bother me that not
    every optimization will apply to every case, and I don't object to
    optimizations that are intrinsically narrow (within some reasonable
    limits).  But I'd rather not take what could be a fairly broad-based
    optimization and apply it only narrowly, all things being equal.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  91. Re: Inlining comparators as a performance optimisation

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-12-07T11:57:08Z

    On 7 December 2011 03:45, Robert Haas <robertmhaas@gmail.com> wrote:
    > In this regard, I think Heikki's remarks upthread are worth some
    > thought.  If inlining is a win just because it avoids saving and
    > restoring registers or allows better instruction scheduling, then
    > inlining is the (probably?) the only way to get the benefit.  But if
    > most of the benefit is in having a separate path for the
    > single-sort-key case, we can do that without duplicating the qsort()
    > code and get the benefit for every data type without much code bloat.
    > I'd like to see us dig into that a little, so that we get the broadest
    > possible benefit out of this work.  It doesn't bother me that not
    > every optimization will apply to every case, and I don't object to
    > optimizations that are intrinsically narrow (within some reasonable
    > limits).  But I'd rather not take what could be a fairly broad-based
    > optimization and apply it only narrowly, all things being equal.
    
    I think we're in agreement then.
    
    It's important to realise, though I'm sure you do, that once you've
    done what I did with sorting single scanKey integers, that's it -
    there isn't really any way that I can think of to speed up quick
    sorting further, other than parallelism which has major challenges,
    and isn't particularly appropriate at this granularity. The real
    significance of inlining is, well, that's it, that's all there is -
    after inlining, I predict that the well will run dry, or practically
    dry. Inlining actually is a pretty great win when you look at sorting
    in isolation, but it's just that there's been a number of other
    significant wins in my patch, and of course there is overhead from a
    number of other areas, so perhaps it's easy to lose sight of that.
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
    
  92. Re: Inlining comparators as a performance optimisation

    Robert Haas <robertmhaas@gmail.com> — 2011-12-07T14:39:54Z

    On Tue, Dec 6, 2011 at 8:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > 1. Adding sortsupport infrastructure for more datatypes.
    > 2. Revising nbtree and related code to use this infrastructure.
    > 3. Integrating Peter's work into this framework.
    >
    > I'll try to take care of #1 for at least a few key datatypes before
    > I commit, but I think #2 is best done as a separate patch, so I'll
    > postpone that till later.
    
    I see you've committed a chunk of this now.  Does it make sense to do
    #1 for every data type we support, or should we be more selective than
    that?  My gut feeling would be to add it across the board and
    introduce an opr_sanity check for it.  The general utility of adding
    it to deprecated types like abstime is perhaps questionable, but it
    strikes me that the value of making everything consistent probably
    outweighs the cost of a few extra lines of code.
    
    Are you planning to do anything about #2 or #3?
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  93. Re: Inlining comparators as a performance optimisation

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-12-07T15:09:32Z

    Robert Haas <robertmhaas@gmail.com> writes:
    > On Tue, Dec 6, 2011 at 8:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >> 1. Adding sortsupport infrastructure for more datatypes.
    >> 2. Revising nbtree and related code to use this infrastructure.
    >> 3. Integrating Peter's work into this framework.
    >> 
    >> I'll try to take care of #1 for at least a few key datatypes before
    >> I commit, but I think #2 is best done as a separate patch, so I'll
    >> postpone that till later.
    
    > I see you've committed a chunk of this now.  Does it make sense to do
    > #1 for every data type we support, or should we be more selective than
    > that?
    
    Basically, I tried to do #1 for every datatype for which the comparator
    was cheap enough that reducing the call overhead seemed likely to make a
    useful difference.  I'm not in favor of adding sortsupport functions
    where this is not true, as I think it'll be useless code and catalog
    bloat.  I don't want to add 'em for cruft like abstime either.
    
    There's some stuff that's debatable according to this criterion --- in
    particular, I wondered whether it'd be worth having a fast path for
    bttextcmp, especially if we pre-tested the collate_is_c condition and
    had a separate version that just hardwired the memcmp code path.  (The
    idea of doing that was one reason I insisted on collation being known at
    the setup step.)  But it would still have to be prepared for detoasting,
    so in the end I was unenthused.  Anyone who feels like testing could try
    to prove me wrong about it though.
    
    > Are you planning to do anything about #2 or #3?
    
    I am willing to do #2, but not right now; I feel what I need to do next
    is go review SPGist.  I don't believe that #2 blocks progress on #3
    anyway.  I think #3 is in Peter's court, or yours if you want to do it.
    
    (BTW, I agree with your comments yesterday about trying to break down
    the different aspects of what Peter did, and put as many of them as we
    can into the non-inlined code paths.)
    
    			regards, tom lane
    
    
  94. Re: Inlining comparators as a performance optimisation

    Robert Haas <robertmhaas@gmail.com> — 2011-12-07T15:15:35Z

    On Wed, Dec 7, 2011 at 10:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > There's some stuff that's debatable according to this criterion --- in
    > particular, I wondered whether it'd be worth having a fast path for
    > bttextcmp, especially if we pre-tested the collate_is_c condition and
    > had a separate version that just hardwired the memcmp code path.  (The
    > idea of doing that was one reason I insisted on collation being known at
    > the setup step.)  But it would still have to be prepared for detoasting,
    > so in the end I was unenthused.  Anyone who feels like testing could try
    > to prove me wrong about it though.
    
    I think that'd definitely be worth investigating (although I'm not
    sure I have the time to do it myself any time real soon).
    
    >> Are you planning to do anything about #2 or #3?
    >
    > I am willing to do #2, but not right now; I feel what I need to do next
    > is go review SPGist.
    
    Yeah, makes sense.  That one seems likely to be a challenge to absorb.
    
    > I don't believe that #2 blocks progress on #3
    > anyway.  I think #3 is in Peter's court, or yours if you want to do it.
    >
    > (BTW, I agree with your comments yesterday about trying to break down
    > the different aspects of what Peter did, and put as many of them as we
    > can into the non-inlined code paths.)
    
    Cool.  Peter, can you rebase your patch and integrate it into the
    sortsupport framework that's now committed?
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  95. Re: Inlining comparators as a performance optimisation

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-12-07T15:58:03Z

    On 7 December 2011 15:15, Robert Haas <robertmhaas@gmail.com> wrote:
    > On Wed, Dec 7, 2011 at 10:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >> But it would still have to be prepared for detoasting,
    >> so in the end I was unenthused.  Anyone who feels like testing could try
    >> to prove me wrong about it though.
    >
    > I think that'd definitely be worth investigating (although I'm not
    > sure I have the time to do it myself any time real soon).
    
    I'll at least take a look at it. Sorting text is a fairly common case.
    I'm not hugely enthused about spending too much time on work that will
    only be useful if collate_is_c.
    
    >> I don't believe that #2 blocks progress on #3
    >> anyway.  I think #3 is in Peter's court, or yours if you want to do it.
    >>
    >> (BTW, I agree with your comments yesterday about trying to break down
    >> the different aspects of what Peter did, and put as many of them as we
    >> can into the non-inlined code paths.)
    
    I'm confident that we should have everything for the simple case of
    ordering by a single int4 and int8 column, and I think you'd probably
    agree with that - they're extremely common cases. Anything beyond that
    will need to be justified, probably in part by running additional
    benchmarks.
    
    > Cool.  Peter, can you rebase your patch and integrate it into the
    > sortsupport framework that's now committed?
    
    Yes, I'd be happy to, though I don't think I'm going to be getting
    around to it this side of Friday. Since it isn't a blocker, I assume
    that's okay.
    
    The rebased revision will come complete with a well thought-out
    rationale for my use of inlining specialisations, that takes account
    of the trade-off against binary bloat that Tom highlighted. I wasn't
    ignoring that issue, but I did fail to articulate my thoughts there,
    mostly because I felt the need to do some additional research to
    justify my position.
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
    
  96. Re: Inlining comparators as a performance optimisation

    Robert Haas <robertmhaas@gmail.com> — 2011-12-07T16:22:30Z

    On Wed, Dec 7, 2011 at 10:58 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
    > On 7 December 2011 15:15, Robert Haas <robertmhaas@gmail.com> wrote:
    >> On Wed, Dec 7, 2011 at 10:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >>> But it would still have to be prepared for detoasting,
    >>> so in the end I was unenthused.  Anyone who feels like testing could try
    >>> to prove me wrong about it though.
    >>
    >> I think that'd definitely be worth investigating (although I'm not
    >> sure I have the time to do it myself any time real soon).
    >
    > I'll at least take a look at it. Sorting text is a fairly common case.
    > I'm not hugely enthused about spending too much time on work that will
    > only be useful if collate_is_c.
    
    Well, text_pattern_ops is not exactly an uncommon case, but sure.  And
    there might be some benefit for other collations, too.
    
    >> Cool.  Peter, can you rebase your patch and integrate it into the
    >> sortsupport framework that's now committed?
    >
    > Yes, I'd be happy to, though I don't think I'm going to be getting
    > around to it this side of Friday. Since it isn't a blocker, I assume
    > that's okay.
    
    Sounds fine to me.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  97. Re: Inlining comparators as a performance optimisation

    Greg Smith <greg@2ndquadrant.com> — 2011-12-16T19:55:12Z

    I think we can call a new sorting infrastructure popping out and what 
    looks to be over 90 messages on this topic as successful progress on 
    this front.  Peter's going to rev a new patch, but with more performance 
    results to review and followup discussion I can't see this one as 
    wrapping for the current CommitFest.  Marking it returned and we'll 
    return to this topic during or before the next CF.  With the majority of 
    the comments here coming from committers, I think continuing progress in 
    that direction won't be a problem.
    
    -- 
    Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
    PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
    
    
    
  98. Re: Inlining comparators as a performance optimisation

    Pierre C <lists@peufeu.com> — 2012-01-13T09:48:56Z

    On Wed, 21 Sep 2011 18:13:07 +0200, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    
    > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
    >> On 21.09.2011 18:46, Tom Lane wrote:
    >>> The idea that I was toying with was to allow the regular SQL-callable
    >>> comparison function to somehow return a function pointer to the
    >>> alternate comparison function,
    >
    >> You could have a new function with a pg_proc entry, that just returns a
    >> function pointer to the qsort-callback.
    >
    > Yeah, possibly.  That would be a much more invasive change, but cleaner
    > in some sense.  I'm not really prepared to do all the legwork involved
    > in that just to get to a performance-testable patch though.
    
    A few years ago I had looked for a way to speed up COPY operations, and it  
    turned out that COPY TO has a good optimization opportunity. At that time,  
    for each datum, COPY TO would :
    
    - test for nullness
    - call an outfunc through fmgr
    - outfunc pallocs() a bytea or text, fills it with data, and returns it  
    (sometimes it uses an extensible string buffer which may be repalloc()d  
    several times)
    - COPY memcpy()s returned data to a buffer and eventually flushes the  
    buffer to client socket.
    
    I introduced a special write buffer with an on-flush callback (ie, a close  
    relative of the existing string-buffer), in this case the callback was  
    "flush to client socket", and several outfuncs (one per type) which took  
    that buffer as argument, besides the datum to output, and simply put the  
    datum inside the buffer, with appropriate transformations (like converting  
    to bytea or text), and flushed if needed.
    
    Then the COPY TO BINARY of a constant-size datum would turn to :
    - one test for nullness
    - one C function call
    - one test to ensure appropriate space available in buffer (flush if  
    needed)
    - one htonl() and memcpy of constant size, which the compiler turns out  
    into a couple of simple instructions
    
    I recall measuring speedups of 2x - 8x on COPY BINARY, less for text, but  
    still large gains.
    
    Although eliminating fmgr call and palloc overhead was an important part  
    of it, another large part was getting rid of memcpy()'s which the compiler  
    turned into simple movs for known-size types, a transformation that can be  
    done only if the buffer write functions are inlined inside the outfuncs.  
    Compilers love constants...
    
    Additionnally, code size growth was minimal since I moved the old outfuncs  
    code into the new outfuncs, and replaced the old fmgr-callable outfuncs  
    with "create buffer with on-full callback=extend_and_repalloc() - pass to  
    new outfunc(buffer,datum) - return buffer". Which is basically equivalent  
    to the previous palloc()-based code, maybe with a few extra instructions.
    
    When I submitted the patch for review, Tom rightfully pointed out that my  
    way of obtaining the C function pointer sucked very badly (I don't  
    remember how I did it, only that it was butt-ugly) but the idea was to get  
    a quick measurement of what could be gained, and the result was positive.  
    Unfortunately I had no time available to finish it and make it into a real  
    patch, I'm sorry about that.
    
    So why do I post in this sorting topic ? It seems, by bypassing fmgr for  
    functions which are small, simple, and called lots of times, there is a  
    large gain to be made, not only because of fmgr overhead but also because  
    of the opportunity for new compiler optimizations, palloc removal, etc.  
    However, in my experiment the arguments and return types of the new  
    functions were DIFFERENT from the old functions : the new ones do the same  
    thing, but in a different manner. One manner was suited to sql-callable  
    functions (ie, palloc and return a bytea) and another one to writing large  
    amounts of data (direct buffer write). Since both have very different  
    requirements, being fast at both is impossible for the same function.
    
    Anyway, all that rant boils down to :
    
    Some functions could benefit having two versions (while sharing almost all  
    the code between them) :
    - User-callable (fmgr) version (current one)
    - C-callable version, usually with different parameters and return type
    
    And it would be cool to have a way to grab a bare function pointer on the  
    second one.
    
    Maybe an extra column in pg_proc would do (but then, the proargtypes and  
    friends would describe only the sql-callable version) ?
    Or an extra table ? pg_cproc ?
    Or an in-memory hash : hashtable[ fmgr-callable function ] => C version
    - What happens if a C function has no SQL-callable equivalent ?
    Or (ugly) introduce an extra per-type function type_get_function_ptr(  
    function_kind ) which returns the requested function ptr
    
    If one of those happens, I'll dust off my old copy-optimization patch ;)
    
    Hmm... just my 2c
    
    Regards
    Pierre
    
    
  99. Re: Inlining comparators as a performance optimisation

    Bruce Momjian <bruce@momjian.us> — 2012-01-28T14:55:04Z

    On Fri, Jan 13, 2012 at 10:48:56AM +0100, Pierre C wrote:
    > Maybe an extra column in pg_proc would do (but then, the proargtypes
    > and friends would describe only the sql-callable version) ?
    > Or an extra table ? pg_cproc ?
    > Or an in-memory hash : hashtable[ fmgr-callable function ] => C version
    > - What happens if a C function has no SQL-callable equivalent ?
    > Or (ugly) introduce an extra per-type function
    > type_get_function_ptr( function_kind ) which returns the requested
    > function ptr
    > 
    > If one of those happens, I'll dust off my old copy-optimization patch ;)
    
    I agree that COPY is ripe for optimization, and I am excited you have
    some ideas on this.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        http://momjian.us
      EnterpriseDB                             http://enterprisedb.com
    
      + It's impossible for everything to be true. +