Thread

  1. WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-06-03T11:02:49Z

    Hackers,
    
    WIP patch of fast GiST index build is attached. Code is dirty and comments
    are lacking, but it works. Now it is ready for first benchmarks, which
    should prove efficiency of selected technique. It's time to compare fast
    GiST index build with repeat insert build on large enough datasets (datasets
    which don't fit to cache). There are following aims of testing:
    1) Measure acceleration of index build.
    2) Measure change in index quality.
    I'm going to do first testing using synthetic datasets. Everybody who have
    interesting real-life datasets for testing are welcome.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  2. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-06-04T15:13:10Z

    On 03.06.2011 14:02, Alexander Korotkov wrote:
    > Hackers,
    >
    > WIP patch of fast GiST index build is attached. Code is dirty and comments
    > are lacking, but it works. Now it is ready for first benchmarks, which
    > should prove efficiency of selected technique. It's time to compare fast
    > GiST index build with repeat insert build on large enough datasets (datasets
    > which don't fit to cache). There are following aims of testing:
    > 1) Measure acceleration of index build.
    > 2) Measure change in index quality.
    > I'm going to do first testing using synthetic datasets. Everybody who have
    > interesting real-life datasets for testing are welcome.
    
    I did some quick performance testing of this. I installed postgis 1.5, 
    and loaded an extract of the OpenStreetMap data covering Finland. The 
    biggest gist index in that data set is the idx_nodes_geom index on nodes 
    table. I have maintenance_work_mem and shared_buffers both set to 512 
    MB, and this laptop has 4GB of RAM.
    
    Without the patch, reindexing the index takes about 170 seconds and the 
    index size is 321 MB. And with the patch, it takes about 150 seconds, 
    and the resulting index size is 319 MB.
    
    The nodes table is 618MB in size, so it fits in RAM. I presume the gain 
    would be bigger if it doesn't, as the random I/O to update the index 
    starts to hurt more. But this shows that even when it does, this patch 
    helps a little bit, and the resulting index size is comparable.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  3. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-06-06T07:42:15Z

    On 03.06.2011 14:02, Alexander Korotkov wrote:
    > Hackers,
    >
    > WIP patch of fast GiST index build is attached. Code is dirty and comments
    > are lacking, but it works. Now it is ready for first benchmarks, which
    > should prove efficiency of selected technique. It's time to compare fast
    > GiST index build with repeat insert build on large enough datasets (datasets
    > which don't fit to cache). There are following aims of testing:
    > 1) Measure acceleration of index build.
    > 2) Measure change in index quality.
    > I'm going to do first testing using synthetic datasets. Everybody who have
    > interesting real-life datasets for testing are welcome.
    
    I ran another test with a simple table generated with:
    
    CREATE TABLE pointtest (p point);
    INSERT INTO pointtest SELECT point(random(), random()) FROM 
    generate_series(1,50000000);
    
    Generating a gist index with:
    
    CREATE INDEX i_pointtest ON pointtest USING gist (p);
    
    took about 15 hours without the patch, and 2 hours with it. That's quite 
    dramatic.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  4. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-06-06T10:51:45Z

    On 06.06.2011 10:42, Heikki Linnakangas wrote:
    > On 03.06.2011 14:02, Alexander Korotkov wrote:
    >> Hackers,
    >>
    >> WIP patch of fast GiST index build is attached. Code is dirty and
    >> comments
    >> are lacking, but it works. Now it is ready for first benchmarks, which
    >> should prove efficiency of selected technique. It's time to compare fast
    >> GiST index build with repeat insert build on large enough datasets
    >> (datasets
    >> which don't fit to cache). There are following aims of testing:
    >> 1) Measure acceleration of index build.
    >> 2) Measure change in index quality.
    >> I'm going to do first testing using synthetic datasets. Everybody who
    >> have
    >> interesting real-life datasets for testing are welcome.
    >
    > I ran another test with a simple table generated with:
    >
    > CREATE TABLE pointtest (p point);
    > INSERT INTO pointtest SELECT point(random(), random()) FROM
    > generate_series(1,50000000);
    >
    > Generating a gist index with:
    >
    > CREATE INDEX i_pointtest ON pointtest USING gist (p);
    >
    > took about 15 hours without the patch, and 2 hours with it. That's quite
    > dramatic.
    
    Oops, that was a rounding error, sorry. The run took about 2.7 hours 
    with the patch, which of course should be rounded to 3 hours, not 2. 
    Anyway, it is still a very impressive improvement.
    
    I'm glad you could get the patch ready for benchmarking this quickly. 
    Now you just need to get the patch into shape so that it can be 
    committed. That is always the more time-consuming part, so I'm glad you 
    have plenty of time left for it.
    
    Could you please create a TODO list on the wiki page, listing all the 
    missing features, known bugs etc. that will need to be fixed? That'll 
    make it easier to see how much work there is left. It'll also help 
    anyone looking at the patch to know which issues are known issues.
    
    Meanwhile, it would still be very valuable if others could test this 
    with different workloads. And Alexander, it would be good if at some 
    point you could write some benchmark scripts too, and put them on the 
    wiki page, just to see what kind of workloads have been taken into 
    consideration and tested already. Do you think there's some worst-case 
    data distributions where this algorithm would perform particularly badly?
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  5. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-06-06T12:00:28Z

    Hi!
    
    On Mon, Jun 6, 2011 at 2:51 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > On 06.06.2011 10:42, Heikki Linnakangas wrote:
    >>
    >> I ran another test with a simple table generated with:
    >>
    >> CREATE TABLE pointtest (p point);
    >> INSERT INTO pointtest SELECT point(random(), random()) FROM
    >> generate_series(1,50000000);
    >>
    >> Generating a gist index with:
    >>
    >> CREATE INDEX i_pointtest ON pointtest USING gist (p);
    >>
    >> took about 15 hours without the patch, and 2 hours with it. That's quite
    >> dramatic.
    >>
    >
    > Oops, that was a rounding error, sorry. The run took about 2.7 hours with
    > the patch, which of course should be rounded to 3 hours, not 2. Anyway, it
    > is still a very impressive improvement.
    >
    I have similar results on 100 millions of rows: 21.6 hours without patch and
    2 hours with patch. But I found a problem: index quality is worse. See
    following query plans. There test is relation where index was created in
    ordinal way, and test2 is relation where patch was used.
    
                                                            QUERY PLAN
    
    ---------------------------------------------------------------------------------------------------------------------------
     Bitmap Heap Scan on test  (cost=4391.01..270397.31 rows=100000 width=20)
    (actual time=1.257..2.147 rows=838 loops=1)
       Recheck Cond: (v <@ '(0.903,0.203),(0.9,0.2)'::box)
       Buffers: shared hit=968
       ->  Bitmap Index Scan on test_idx  (cost=0.00..4366.01 rows=100000
    width=0) (actual time=1.162..1.162 rows=838 loops=1)
             Index Cond: (v <@ '(0.903,0.203),(0.9,0.2)'::box)
             Buffers: shared hit=131
     Total runtime: 2.214 ms
    (7 rows)
    
                                                             QUERY PLAN
    
    ----------------------------------------------------------------------------------------------------------------------------
     Bitmap Heap Scan on test2  (cost=4370.84..270377.13 rows=100000 width=20)
    (actual time=5.252..6.056 rows=838 loops=1)
       Recheck Cond: (v <@ '(0.903,0.203),(0.9,0.2)'::box)
       Buffers: shared hit=1458
       ->  Bitmap Index Scan on test2_idx  (cost=0.00..4345.84 rows=100000
    width=0) (actual time=5.155..5.155 rows=838 loops=1)
             Index Cond: (v <@ '(0.903,0.203),(0.9,0.2)'::box)
             Buffers: shared hit=621
     Total runtime: 6.121 ms
    (7 rows)
    
                                                            QUERY PLAN
    
    ---------------------------------------------------------------------------------------------------------------------------
     Bitmap Heap Scan on test  (cost=4391.01..270397.31 rows=100000 width=20)
    (actual time=2.148..2.977 rows=850 loops=1)
       Recheck Cond: (v <@ '(0.503,0.503),(0.5,0.5)'::box)
       Buffers: shared hit=1099
       ->  Bitmap Index Scan on test_idx  (cost=0.00..4366.01 rows=100000
    width=0) (actual time=2.052..2.052 rows=850 loops=1)
             Index Cond: (v <@ '(0.503,0.503),(0.5,0.5)'::box)
             Buffers: shared hit=249
     Total runtime: 3.033 ms
    (7 rows)
    
                                                             QUERY PLAN
    
    ----------------------------------------------------------------------------------------------------------------------------
     Bitmap Heap Scan on test2  (cost=4370.84..270377.13 rows=100000 width=20)
    (actual time=6.806..7.602 rows=850 loops=1)
       Recheck Cond: (v <@ '(0.503,0.503),(0.5,0.5)'::box)
       Buffers: shared hit=1615
       ->  Bitmap Index Scan on test2_idx  (cost=0.00..4345.84 rows=100000
    width=0) (actual time=6.709..6.709 rows=850 loops=1)
             Index Cond: (v <@ '(0.503,0.503),(0.5,0.5)'::box)
             Buffers: shared hit=773
     Total runtime: 7.667 ms
    (7 rows)
    
    We can see that index scan requires read of several times more
    pages. Original paper denotes such effect. It explains it by the routing
    rectangles in less optimal ways. But this effect wasn't so dramatic in tests
    provided in the paper. So, I have following thoughts about this problem:
    
    1) Number of pages, which was readed from index is too large even with
    ordinal index build. Querying of small area requires read of hundred of
    pages. It probbably caused by picksplit implementation. I've version of
    picksplit algorithm which seems to be much more efficient. I'll do some
    benchmarks with my picksplit algorithm. I hope difference in index quality
    will be not so dramatic.
    
    2) I can try to do some enchancements in fast build alogrithms which could
    improve tree quality. In original paper Hilbert heuristic was used to achive
    even better tree quality than tree which was created in ordinal way. But
    since we use GiST we are restricted by it's interface (or we have to create
    new interface functions(s), but I like to avoid it). I would like to try to
    do some ordering by penalty value in buffer emptying process and buffers
    relocation on split.
    
    3) Probably, there is some bug which affects tree quality.
    
    
    > Could you please create a TODO list on the wiki page, listing all the
    > missing features, known bugs etc. that will need to be fixed? That'll make
    > it easier to see how much work there is left. It'll also help anyone looking
    > at the patch to know which issues are known issues.
    >
    Sure. I'll create such list on wiki page. I believe that currenlty most
    important issue is index quality.
    
    
    > Meanwhile, it would still be very valuable if others could test this with
    > different workloads. And Alexander, it would be good if at some point you
    > could write some benchmark scripts too, and put them on the wiki page, just
    > to see what kind of workloads have been taken into consideration and tested
    > already. Do you think there's some worst-case data distributions where this
    > algorithm would perform particularly badly?
    
    I don't expect some bad cases in terms in IO. My most worrying is about
    index quality which is strongly related to data distribution.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  6. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-06-06T12:14:07Z

    On Mon, Jun 6, 2011 at 2:51 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > Do you think there's some worst-case data distributions where this
    > algorithm would perform particularly badly?
    >
    I think there could be some worst-case GiST applications. Just now gist fast
    build algorithm invokes more penalty calls than repeatable insert algorithm.
    If I succeed then it will invoke even more such calls. So, if penalty
    function is very slow then gist fast build will be slover then
    repeatable insert.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  7. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-06-06T12:16:47Z

    On Mon, Jun 6, 2011 at 4:14 PM, Alexander Korotkov <aekorotkov@gmail.com>wrote:
    
    > If I succeed then it will invoke even more such calls.
    >
    I meant here that if I succeed in enhancements which improve index quality
    then fast build algorithm will invoke even more such calls.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  8. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-06-15T07:21:10Z

    I've tried index tuples sorting on penalty function before buffer relocation
    on split. But it was without any success. Index quality becomes even worse
    than without sorting.
    The next thing I've tried is buffer relocation between all neighbor buffers.
    Results of first tests is much more promising. Number of page accesses
    during index scan is similar to those without fast index build. I'm going to
    hold on this approach.
    
    test=# create index test_idx on test using gist(v);
    NOTICE:  Level step = 1, pagesPerBuffer = 406
    CREATE INDEX
    Time: 10002590,469 ms
    
    test=# select pg_size_pretty(pg_relation_size('test_idx'));
     pg_size_pretty
    ----------------
     6939 MB
    (1 row)
    
    test=# explain (analyze, buffers) select * from test where v <@
    '(0.903,0.203),(0.9,0.2)'::box;
                                                            QUERY PLAN
    
    ---------------------------------------------------------------------------------------------------------------------------
     Bitmap Heap Scan on test  (cost=4366.78..258752.22 rows=100000 width=16)
    (actual time=1.412..2.295 rows=897 loops=1)
       Recheck Cond: (v <@ '(0.903,0.203),(0.9,0.2)'::box)
       Buffers: shared hit=1038
       ->  Bitmap Index Scan on test_idx  (cost=0.00..4341.78 rows=100000
    width=0) (actual time=1.311..1.311 rows=897 loops=1)
             Index Cond: (v <@ '(0.903,0.203),(0.9,0.2)'::box)
             Buffers: shared hit=141
     Total runtime: 2.375 ms
    (7 rows)
    
    test=# explain (analyze, buffers) select * from test where v <@
    '(0.503,0.503),(0.5,0.5)'::box;
    
                                                            QUERY PLAN
    
    ---------------------------------------------------------------------------------------------------------------------------
     Bitmap Heap Scan on test  (cost=4366.78..258752.22 rows=100000 width=16)
    (actual time=2.113..2.972 rows=855 loops=1)
       Recheck Cond: (v <@ '(0.503,0.503),(0.5,0.5)'::box)
       Buffers: shared hit=1095
       ->  Bitmap Index Scan on test_idx  (cost=0.00..4341.78 rows=100000
    width=0) (actual time=2.016..2.016 rows=855 loops=1)
             Index Cond: (v <@ '(0.503,0.503),(0.5,0.5)'::box)
             Buffers: shared hit=240
     Total runtime: 3.043 ms
    (7 rows)
    
    
    ------
    With best regards,
    Alexander Korotkov.
    
  9. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-06-15T07:24:51Z

    On Wed, Jun 15, 2011 at 11:21 AM, Alexander Korotkov
    <aekorotkov@gmail.com>wrote:
    
    > I've tried index tuples sorting on penalty function before buffer
    > relocation on split. But it was without any success. Index quality becomes
    > even worse than without sorting.
    > The next thing I've tried is buffer relocation between all neighbor
    > buffers. Results of first tests is much more promising. Number of page
    > accesses during index scan is similar to those without fast index build. I'm
    > going to hold on this approach.
    >
    > test=# create index test_idx on test using gist(v);
    > NOTICE:  Level step = 1, pagesPerBuffer = 406
    > CREATE INDEX
    > Time: 10002590,469 ms
    >
    I forget to say that build time increases in about 40%, but it is still
    faster than ordinal build in about 10 times.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  10. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-06-15T08:03:59Z

    On 15.06.2011 10:24, Alexander Korotkov wrote:
    > On Wed, Jun 15, 2011 at 11:21 AM, Alexander Korotkov
    > <aekorotkov@gmail.com>wrote:
    >
    >> I've tried index tuples sorting on penalty function before buffer
    >> relocation on split. But it was without any success. Index quality becomes
    >> even worse than without sorting.
    >> The next thing I've tried is buffer relocation between all neighbor
    >> buffers. Results of first tests is much more promising. Number of page
    >> accesses during index scan is similar to those without fast index build. I'm
    >> going to hold on this approach.
    >>
    >> test=# create index test_idx on test using gist(v);
    >> NOTICE:  Level step = 1, pagesPerBuffer = 406
    >> CREATE INDEX
    >> Time: 10002590,469 ms
    >>
    > I forget to say that build time increases in about 40%, but it is still
    > faster than ordinal build in about 10 times.
    
    Is this relocation mechanism something that can be tuned, for different 
    tradeoffs between index quality and build time? In any case, it seems 
    that we're going to need a lot of testing with different data sets to 
    get a better picture of how this performs. But at least for now, it 
    looks like this approach is going to be acceptable.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  11. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-06-15T08:25:51Z

    On Wed, Jun 15, 2011 at 12:03 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > Is this relocation mechanism something that can be tuned, for different
    > tradeoffs between index quality and build time?
    >
    Yes, it can. I believe that it can be index parameter.
    
    > In any case, it seems that we're going to need a lot of testing with
    > different data sets to get a better picture of how this performs.
    >
    Sure. My problem is that I haven't large enough reallife datasets. Picture
    of syntetic datasets can be unrepresentative on reallife cases. On smaller
    datasets that I have I actually can compare only index quality. Also, tests
    with large datasets takes long time especially without fast build. Probably
    solution is to limit cache size during testing. It should allow to measure
    I/O benefit even on relatively small datasets. But while I don't know now to
    do that on Linux.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  12. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-06-16T09:33:15Z

    Actually, I would like to measure CPU and IO load independently for more
    comprehensive benchmarks. Can you advice me some appropriate tools for it?
    
    ------
    With best regards,
    Alexander Korotkov.
    
  13. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-06-16T18:13:02Z

    My current idea is to measure number of IO accesses by pg_stat_statements
    and measure CPU usage by /proc/PID/stat. Any thoughts?
    
    ------
    With best regards,
    Alexander Korotkov.
    
    
    On Thu, Jun 16, 2011 at 1:33 PM, Alexander Korotkov <aekorotkov@gmail.com>wrote:
    
    > Actually, I would like to measure CPU and IO load independently for more
    > comprehensive benchmarks. Can you advice me some appropriate tools for it?
    >
    > ------
    > With best regards,
    > Alexander Korotkov.
    >
    
  14. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-06-16T18:26:45Z

    On 16.06.2011 21:13, Alexander Korotkov wrote:
    > My current idea is to measure number of IO accesses by pg_stat_statements
    > and measure CPU usage by /proc/PID/stat. Any thoughts?
    
    Actually, you get both of those very easily with:
    
    set log_statement_stats=on
    
    LOG:  QUERY STATISTICS
    DETAIL:  ! system usage stats:
    	!	0.000990 elapsed 0.000000 user 0.000000 system sec
    	!	[0.000000 user 0.008000 sys total]
    	!	0/0 [32/0] filesystem blocks in/out
    	!	0/0 [0/959] page faults/reclaims, 0 [0] swaps
    	!	0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
    	!	0/0 [10/1] voluntary/involuntary context switches
    STATEMENT:  SELECT generate_series(1,100);
    
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  15. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-06-16T18:35:23Z

    Oh, actually it's so easy. Thanks.
    
    ------
    With best regards,
    Alexander Korotkov.
    
    On Thu, Jun 16, 2011 at 10:26 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > On 16.06.2011 21:13, Alexander Korotkov wrote:
    >
    >> My current idea is to measure number of IO accesses by pg_stat_statements
    >> and measure CPU usage by /proc/PID/stat. Any thoughts?
    >>
    >
    > Actually, you get both of those very easily with:
    >
    > set log_statement_stats=on
    >
    > LOG:  QUERY STATISTICS
    > DETAIL:  ! system usage stats:
    >        !       0.000990 elapsed 0.000000 user 0.000000 system sec
    >        !       [0.000000 user 0.008000 sys total]
    >        !       0/0 [32/0] filesystem blocks in/out
    >        !       0/0 [0/959] page faults/reclaims, 0 [0] swaps
    >        !       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
    >        !       0/0 [10/1] voluntary/involuntary context switches
    > STATEMENT:  SELECT generate_series(1,100);
    >
    >
    >
    > --
    >  Heikki Linnakangas
    >  EnterpriseDB   http://www.enterprisedb.com
    >
    
  16. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-06-20T07:46:28Z

    New version of patch. There are some bugfixes, minor refactoring, comments
    (unfortunatelly, not all the code is covered by comments yet). Also
    "fastbuild" parameter was added to the GiST index. It allows to test index
    building with and without fast build without postgres recompile.
    
    ------
    With best regards,
    Alexander Korotkov.
    
    On Thu, Jun 16, 2011 at 10:35 PM, Alexander Korotkov
    <aekorotkov@gmail.com>wrote:
    
    > Oh, actually it's so easy. Thanks.
    >
    > ------
    > With best regards,
    > Alexander Korotkov.
    >
    > On Thu, Jun 16, 2011 at 10:26 PM, Heikki Linnakangas <
    > heikki.linnakangas@enterprisedb.com> wrote:
    >
    >> On 16.06.2011 21:13, Alexander Korotkov wrote:
    >>
    >>> My current idea is to measure number of IO accesses by pg_stat_statements
    >>> and measure CPU usage by /proc/PID/stat. Any thoughts?
    >>>
    >>
    >> Actually, you get both of those very easily with:
    >>
    >> set log_statement_stats=on
    >>
    >> LOG:  QUERY STATISTICS
    >> DETAIL:  ! system usage stats:
    >>        !       0.000990 elapsed 0.000000 user 0.000000 system sec
    >>        !       [0.000000 user 0.008000 sys total]
    >>        !       0/0 [32/0] filesystem blocks in/out
    >>        !       0/0 [0/959] page faults/reclaims, 0 [0] swaps
    >>        !       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
    >>        !       0/0 [10/1] voluntary/involuntary context switches
    >> STATEMENT:  SELECT generate_series(1,100);
    >>
    >>
    >>
    >> --
    >>  Heikki Linnakangas
    >>  EnterpriseDB   http://www.enterprisedb.com
    >>
    >
    >
    
  17. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-06-21T10:08:25Z

    Hi!
    
    I've created section about testing in project wiki page:
    http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011#Testing_results
    Do you have any notes about table structure?
    As you can see I found that CPU usage might be much higher
    with gist_trgm_ops. I believe it's due to relatively expensive penalty
    method in that opclass. But, probably index build can be still faster when
    index doesn't fit cache even for gist_trgm_ops. Also with that opclass index
    quality is slightly worse but the difference is not dramatic.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  18. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-06-24T08:40:08Z

    On 21.06.2011 13:08, Alexander Korotkov wrote:
    > I've created section about testing in project wiki page:
    > http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011#Testing_results
    > Do you have any notes about table structure?
    
    It would be nice to have links to the datasets and scripts used, so that 
    others can reproduce the tests.
    
    It's surprising that the search time differs so much between the 
    point_ops tests with uniformly random data with 100M and 10M rows. Just 
    to be sure I'm reading it correctly: a small search time is good, right? 
    You might want to spell that out explicitly.
    
    > As you can see I found that CPU usage might be much higher
    > with gist_trgm_ops.
    
    Yeah, that is a bit worrysome. 6 minutes without the patch and 18 
    minutes with it.
    
    > I believe it's due to relatively expensive penalty
    > method in that opclass.
    
    Hmm, I wonder if it could be optimized. I did a quick test, creating a 
    gist_trgm_ops index on a list of English words from 
    /usr/share/dict/words. oprofile shows that with the patch, 60% of the 
    CPU time is spent in the makesign() function.
    
    > But, probably index build can be still faster when
    > index doesn't fit cache even for gist_trgm_ops.
    
    Yep.
    
    > Also with that opclass index
    > quality is slightly worse but the difference is not dramatic.
    
    5-10% difference should be acceptable
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  19. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-06-24T09:59:16Z

    On Fri, Jun 24, 2011 at 12:40 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > On 21.06.2011 13:08, Alexander Korotkov wrote:
    >
    >> I've created section about testing in project wiki page:
    >> http://wiki.postgresql.org/**wiki/Fast_GiST_index_build_**
    >> GSoC_2011#Testing_results<http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011#Testing_results>
    >> Do you have any notes about table structure?
    >>
    >
    > It would be nice to have links to the datasets and scripts used, so that
    > others can reproduce the tests.
    >
    Done.
    
    
    > It's surprising that the search time differs so much between the point_ops
    > tests with uniformly random data with 100M and 10M rows. Just to be sure I'm
    > reading it correctly: a small search time is good, right? You might want to
    > spell that out explicitly.
    
    Yes, you're reading this correctly. Detailed explanation was added to the
    wiki page. It's surprising for me too. I need some more insight into causes
    of index quality difference.
    
    Now I found some large enough real-life datasets (thanks to Oleg Bartunov)
    and I'm performing tests on them.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  20. Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-06-24T16:51:37Z

    On 24.06.2011 11:40, Heikki Linnakangas wrote:
    > On 21.06.2011 13:08, Alexander Korotkov wrote:
    >> I believe it's due to relatively expensive penalty
    >> method in that opclass.
    >
    > Hmm, I wonder if it could be optimized. I did a quick test, creating a
    > gist_trgm_ops index on a list of English words from
    > /usr/share/dict/words. oprofile shows that with the patch, 60% of the
    > CPU time is spent in the makesign() function.
    
    I couldn't resist looking into this, and came up with the attached 
    patch. I tested this with:
    
    CREATE TABLE words (word text);
    COPY words FROM '/usr/share/dict/words';
    CREATE INDEX i_words ON words USING gist (word gist_trgm_ops);
    
    And then ran "REINDEX INDEX i_words" a few times with and without the 
    patch. Without the patch, reindex takes about 4.7 seconds. With the 
    patch, 3.7 seconds. That's a worthwhile gain on its own, but becomes 
    even more important with Alexander's fast GiST build patch, which calls 
    the penalty function more.
    
    I used the attached showsign-debug.patch to verify that the patched 
    makesign function produces the same results as the existing code. I 
    haven't tested the big-endian code, however.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
  21. Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

    Robert Haas <robertmhaas@gmail.com> — 2011-06-24T18:24:21Z

    On Fri, Jun 24, 2011 at 12:51 PM, Heikki Linnakangas
    <heikki.linnakangas@enterprisedb.com> wrote:
    > On 24.06.2011 11:40, Heikki Linnakangas wrote:
    >>
    >> On 21.06.2011 13:08, Alexander Korotkov wrote:
    >>>
    >>> I believe it's due to relatively expensive penalty
    >>> method in that opclass.
    >>
    >> Hmm, I wonder if it could be optimized. I did a quick test, creating a
    >> gist_trgm_ops index on a list of English words from
    >> /usr/share/dict/words. oprofile shows that with the patch, 60% of the
    >> CPU time is spent in the makesign() function.
    >
    > I couldn't resist looking into this, and came up with the attached patch. I
    > tested this with:
    >
    > CREATE TABLE words (word text);
    > COPY words FROM '/usr/share/dict/words';
    > CREATE INDEX i_words ON words USING gist (word gist_trgm_ops);
    >
    > And then ran "REINDEX INDEX i_words" a few times with and without the patch.
    > Without the patch, reindex takes about 4.7 seconds. With the patch, 3.7
    > seconds. That's a worthwhile gain on its own, but becomes even more
    > important with Alexander's fast GiST build patch, which calls the penalty
    > function more.
    >
    > I used the attached showsign-debug.patch to verify that the patched makesign
    > function produces the same results as the existing code. I haven't tested
    > the big-endian code, however.
    
    Out of curiosity (and because there is no comment or Assert here), how
    can you be so sure of the input alignment?
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  22. Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-06-24T19:01:36Z

    On 24.06.2011 21:24, Robert Haas wrote:
    > Out of curiosity (and because there is no comment or Assert here), how
    > can you be so sure of the input alignment?
    
    The input TRGM to makesign() is a varlena, so it must be at least 4-byte 
    aligned. If it was not for some reason, the existing VARSIZE invocation 
    (within GETARR()) would already fail on platforms that are strict about 
    alignment.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  23. Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

    Robert Haas <robertmhaas@gmail.com> — 2011-06-24T19:22:15Z

    On Fri, Jun 24, 2011 at 3:01 PM, Heikki Linnakangas
    <heikki.linnakangas@enterprisedb.com> wrote:
    > On 24.06.2011 21:24, Robert Haas wrote:
    >>
    >> Out of curiosity (and because there is no comment or Assert here), how
    >> can you be so sure of the input alignment?
    >
    > The input TRGM to makesign() is a varlena, so it must be at least 4-byte
    > aligned. If it was not for some reason, the existing VARSIZE invocation
    > (within GETARR()) would already fail on platforms that are strict about
    > alignment.
    
    Hmm, OK.  Might be worth adding a comment, anyway...
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  24. Re: WIP: Fast GiST index build

    Jesper Krogh <jesper@krogh.cc> — 2011-06-25T07:03:12Z

    On 2011-06-06 09:42, Heikki Linnakangas wrote:
    > took about 15 hours without the patch, and 2 hours with it. That's 
    > quite dramatic.
    
    With the precense of robust consumer-class SSD-drives that can be
    found in sizes where they actually can fit "many" database usage
    scenarios. A PostgreSQL version is not likely to hit the streets before
    50% of PostgreSQL users are sitting on "some kind" of flash based
    storage (for the part where the entire dataset doesn't fit in memory
    any more). Point is:
    
    * Wouldn't it be natural to measure the performance benefits of
        disc-bound tests in an SSD setup?
    
    ... my understanding of Fast gi(n|st) index build is that it is
    more or less a challenge to transform a lot of random IO workload
    to be more sequential and collapse multiple changes into fewer.
    
    In terms of random IO an SSD can easily be x100 better than rotating
    drives and it would be a shame to optimize "against" that world?
    
    -- 
    Jesper
    
    
  25. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-06-25T08:23:13Z

    On Sat, Jun 25, 2011 at 11:03 AM, Jesper Krogh <jesper@krogh.cc> wrote:
    
    > * Wouldn't it be natural to measure the performance benefits of
    >   disc-bound tests in an SSD setup?
    >
    Sure, it would be great to run performance tests on SSD drives too.
    Unfortunately, I don't have corresponding test platform just now.
    
    ... my understanding of Fast gi(n|st) index build is that it is
    > more or less a challenge to transform a lot of random IO workload
    > to be more sequential and collapse multiple changes into fewer.
    >
    The main benefit of proposed algorithm is to greatly reduce number IO
    operations during index build due to dealing with great number of index
    tuples simultaneously. And it also makes some IO more sequential. I haven't
    precise measures yet, but I belive that contribution of making IO more
    sequantial is not very significant.
    
    
    > In terms of random IO an SSD can easily be x100 better than rotating
    > drives and it would be a shame to optimize "against" that world?
    >
    Actually, I'm not sure that IO is bottle neck of GiST index build on SSD
    drives. It's more likely for me that CPU becomes a bottle neck in this case
    and optimizing IO can't give much benefit. But anyway, the value of this
    work can be in producing better index in some cases and SSD drive lifetime
    economy due to less IO operations.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  26. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-06-26T10:18:06Z

    On 25.06.2011 11:23, Alexander Korotkov wrote:
    > On Sat, Jun 25, 2011 at 11:03 AM, Jesper Krogh<jesper@krogh.cc>  wrote:
    >
    >> * Wouldn't it be natural to measure the performance benefits of
    >>    disc-bound tests in an SSD setup?
    >>
    > Sure, it would be great to run performance tests on SSD drives too.
    > Unfortunately, I don't have corresponding test platform just now.
    
    Anyone have an SSD setup to run some quick tests with this?
    
    >> In terms of random IO an SSD can easily be x100 better than rotating
    >> drives and it would be a shame to optimize "against" that world?
    >
    > Actually, I'm not sure that IO is bottle neck of GiST index build on SSD
    > drives. It's more likely for me that CPU becomes a bottle neck in this case
    > and optimizing IO can't give much benefit. But anyway, the value of this
    > work can be in producing better index in some cases and SSD drive lifetime
    > economy due to less IO operations.
    
    Yeah, this patch probably doesn't give much benefit on SSDs, not the 
    order of magnitude improvements it gives on HDDs anyway. I would expect 
    there to still be a small gain, however. If you look at the comparison 
    of CPU times on Alexander's tests, the patch doesn't add that much CPU 
    overhead: about 5% on the point_ops tests. I/O isn't free on SSDs 
    either, so I would expect the patch to buy back that 5% increase in CPU 
    overhead by reduced time spent on I/O even on a SSD.
    
    It's much worse on the gist_trgm_ops test case, so this clearly depends 
    a lot on the opclass, but even that should be possible to optimize quite 
    a bit.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  27. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-06-27T10:45:34Z

    I've added information about testing on some real-life dataset to wiki page.
    This dataset have a speciality: data is ordered inside it. In this case
    tradeoff was inverse in comparison with expectations about "fast build"
    algrorithm. Index built is longer but index quality is significantly better.
    I think high speed of regular index built is because sequential inserts are
    into near tree parts. That's why number of actual page reads and writes is
    low. The difference in tree quality I can't *convincingly explain now.*
    I've also maked tests with shuffled data of this dataset. In this case
    results was similar to random generated data.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  28. Optimizing box_penalty (Re: WIP: Fast GiST index build)

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-06-27T13:33:04Z

    On 27.06.2011 13:45, Alexander Korotkov wrote:
    > I've added information about testing on some real-life dataset to wiki page.
    > This dataset have a speciality: data is ordered inside it. In this case
    > tradeoff was inverse in comparison with expectations about "fast build"
    > algrorithm. Index built is longer but index quality is significantly better.
    > I think high speed of regular index built is because sequential inserts are
    > into near tree parts. That's why number of actual page reads and writes is
    > low. The difference in tree quality I can't *convincingly explain now.*
    > I've also maked tests with shuffled data of this dataset. In this case
    > results was similar to random generated data.
    
    Hmm, I assume the CPU overhead is coming from the penalty calls in this 
    case too. There's some low-hanging optimization fruit in 
    gist_box_penalty(), see attached patch. I tested this with:
    
    CREATE TABLE points (a point);
    CREATE INDEX i_points ON points using gist (a);
    INSERT INTO points SELECT point(random(), random()) FROM 
    generate_series(1,1000000);
    
    and running "checkpoint; reindex index i_points;" a few times with and 
    without the patch. The patch reduced the runtime from about 17.5 s to 
    15.5 s. oprofile confirms that the time spent in gist_box_penalty() and 
    rt_box_union() is reduced significantly.
    
    This is all without the fast GiST index build patch, so this is 
    worthwhile on its own. If penalty function is called more, then this 
    becomes even more significant.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
  29. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-06-27T14:34:39Z

    On 27.06.2011 13:45, Alexander Korotkov wrote:
    > I've added information about testing on some real-life dataset to wiki page.
    > This dataset have a speciality: data is ordered inside it. In this case
    > tradeoff was inverse in comparison with expectations about "fast build"
    > algrorithm. Index built is longer but index quality is significantly better.
    > I think high speed of regular index built is because sequential inserts are
    > into near tree parts. That's why number of actual page reads and writes is
    > low. The difference in tree quality I can't *convincingly explain now.*
    > I've also maked tests with shuffled data of this dataset. In this case
    > results was similar to random generated data.
    
    Once again, interesting results.
    
    The penalty function is called whenever a tuple is routed to the next 
    level down, and the final tree has the same depth with and without the 
    patch, so I would expect the number of penalty calls to be roughly the 
    same. But clearly there's something wrong with that logic; can you 
    explain in layman's terms why the patch adds so many gist penalty calls? 
    And how many calls does it actually add, can you gather some numbers on 
    that? Any ides on how to mitigate that, or do we just have to live with 
    it? Or maybe use some heuristic to use the existing insertion method 
    when the patch is not expected to be helpful?
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  30. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-06-27T18:32:47Z

    On Mon, Jun 27, 2011 at 6:34 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > The penalty function is called whenever a tuple is routed to the next level
    > down, and the final tree has the same depth with and without the patch, so I
    > would expect the number of penalty calls to be roughly the same. But clearly
    > there's something wrong with that logic; can you explain in layman's terms
    > why the patch adds so many gist penalty calls? And how many calls does it
    > actually add, can you gather some numbers on that? Any ides on how to
    > mitigate that, or do we just have to live with it? Or maybe use some
    > heuristic to use the existing insertion method when the patch is not
    > expected to be helpful?
    >
    In short due to parralel routing of many index tuples routing can alter. In
    fast build algorithm index tuples are accumulating into node buffers. When
    corresponding node splits we have to repocate index tuples from it. In
    original algorithm we are relocating node buffers into buffers of new nodes
    produced by split. Even this requires additional penalty calls.
    But for improvement of index quality I modified algorithm. With my
    modification index tuple of splitted node buffer can be relocated also into
    other node buffers of same parent. It produces more penalty calls.
    I didn't have an estimate yet, but I'm working on it. Unfortunatelly, I
    haven't any idea about mitigating it except turning off my modification.
    Heuristic is possible, but I feel following problems. At first, we need to
    somehow estimate length of varlena keys. I avoid this estimate in fast
    algorithm itself just assumed worst case, but I believe we need some more
    precise for good heuristic. At second, the right decision is strongly depend
    on concurrent load. When there are no concurrent load (as in my experiments)
    fraction of tree which fits to effective cache is reasonable for estimating
    benefit of IO economy. But with high concurrent load part of cache occupied
    by tree should be considerable smaller than whole effective cache.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  31. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-06-28T07:47:43Z

    On Mon, Jun 27, 2011 at 10:32 PM, Alexander Korotkov
    <aekorotkov@gmail.com>wrote:
    
    > I didn't have an estimate yet, but I'm working on it.
    >
    
    Now, it seems that I have an estimate.
    N - total number of itups
    B - avg. number of itups in page
    H - height of tree
    K - avg. number of itups fitting in node buffer
    step - level step of buffers
    
    K = 2 * B^step
    avg. number of internal pages with buffers = 2*N/((2*B)^step - 1) (assume
    pages to be half-filled)
    avg. itups in node buffer = K / 2 (assume node buffers to be half-filled)
    Each internal page with buffers can be produces by split of another internal
    page with buffers.
    So, number of additional penalty calls = 2*N/((2*B)^step - 1) * K / 2
    =(approximately)= 2*N*(1/2)^step
    While number of regular penalty calls is H*N
    
    Seems that fraction of additional penalty calls should decrease with
    increase of level step (while I didn't do experiments with level step != 1).
    Also, we can try to broke K = 2 * B^step equation. This can increase number
    of IOs, but decrease number of additional penalty calls and, probably,
    increase tree quality in some cases.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  32. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-06-28T18:53:50Z

    New version of patch. Bug which caused falldown on trees with high number of
    levels was fixed. Also some more comments and refactoring.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  33. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-07-01T07:41:50Z

    New version of patch. Now, it seems that all code is covered by comments.
    Also I'm going to write readme with general description of algorithm. Some
    bugs were fixed.
    More options were added. Options description is below.
    1) fastbuild - whether to fast build algorithm. Default is true.
    2) levelstep - step of levels where buffers exists (if levelstep == 1 then
    there are buffers on each internal page, if levelstep == 2 then buffers are
    only on odd levels and so on). By default it's calculating by
    maintenance_work_mem and indexed types.
    3) buffersize - size of buffers in pages. By default it's calculating
    by levelstep and indexed types.
    4) neighborrelocation - whether to relocate buffer on split also between
    neighbor buffers (my modification for original algorithm). Improves tree
    quality, but produces additional penalty calls. Default is true.
    Varying of this options should allow me to undertand tradeoffs better.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  34. Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-07-02T03:01:46Z

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
    > I couldn't resist looking into this, and came up with the attached 
    > patch. I tested this with:
    
    > CREATE TABLE words (word text);
    > COPY words FROM '/usr/share/dict/words';
    > CREATE INDEX i_words ON words USING gist (word gist_trgm_ops);
    
    > And then ran "REINDEX INDEX i_words" a few times with and without the 
    > patch. Without the patch, reindex takes about 4.7 seconds. With the 
    > patch, 3.7 seconds. That's a worthwhile gain on its own, but becomes 
    > even more important with Alexander's fast GiST build patch, which calls 
    > the penalty function more.
    
    I tested this on HPPA (big-endian, picky about alignment) to verify that
    you had that code path right.  It's right, but on that platform the
    speedup in the "reindex dict/words" test is only about 6%.  I'm afraid
    that the benefit is a lot more machine-specific than we could wish.
    
    I tweaked the patch a bit further (don't pessimize the boundary case
    where there's exactly 4n+1 trigrams, avoid forcing trg1 into memory,
    improve the comment) and attach that version below.  This is a little
    bit faster but I still wonder if it's worth the price of adding an
    obscure dependency on the header size.
    
    > I used the attached showsign-debug.patch to verify that the patched 
    > makesign function produces the same results as the existing code. I 
    > haven't tested the big-endian code, however.
    
    That didn't look terribly convincing.  I added a direct validation that
    old and new code give the same results, a la
    
      	if (ISARRKEY(newval))
      	{
      		BITVEC		sign;
    + 		BITVEC		osign;
      
      		makesign(sign, newval);
    + 		origmakesign(osign, newval);
    + 		Assert(memcmp(sign, osign, sizeof(BITVEC)) == 0);
      
      		if (ISALLTRUE(origval))
      			*penalty = ((float) (SIGLENBIT - sizebitvec(sign))) / (float) (SIGLENBIT + 1);
    
    and ran the regression tests and the dict/words example with that.
    
    			regards, tom lane
    
    
  35. Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-07-02T18:07:24Z

    I wrote:
    > I tweaked the patch a bit further (don't pessimize the boundary case
    > where there's exactly 4n+1 trigrams, avoid forcing trg1 into memory,
    > improve the comment) and attach that version below.  This is a little
    > bit faster but I still wonder if it's worth the price of adding an
    > obscure dependency on the header size.
    
    It occurred to me that we could very easily remove that objection by
    making the code dynamically detect when it's reached a suitably aligned
    trigram.  The attached version of the patch does it that way.  It seems
    to be a percent or so slower than my previous version, but I think that
    making the function robust against header changes is probably well worth
    that price.
    
    BTW, I also tried wrapping the first two loops in an "if (len > 4)"
    test, reasoning that the added complexity is useless unless the main
    loop will be able to iterate at least once, and surely most words are
    less than 15 bytes long.  While this did indeed make back the extra
    percent on my HPPA box, it made things a percent or so slower yet on my
    Intel box with gcc 4.4.5.  I think the compiler must be getting confused
    about what to optimize.  So I left that out of this version of the
    patch, but perhaps it deserves closer investigation.
    
    			regards, tom lane
    
    
  36. Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-07-02T18:37:28Z

    On 02.07.2011 21:07, Tom Lane wrote:
    > I wrote:
    >> I tweaked the patch a bit further (don't pessimize the boundary case
    >> where there's exactly 4n+1 trigrams, avoid forcing trg1 into memory,
    >> improve the comment) and attach that version below.  This is a little
    >> bit faster but I still wonder if it's worth the price of adding an
    >> obscure dependency on the header size.
    >
    > It occurred to me that we could very easily remove that objection by
    > making the code dynamically detect when it's reached a suitably aligned
    > trigram.  The attached version of the patch does it that way.  It seems
    > to be a percent or so slower than my previous version, but I think that
    > making the function robust against header changes is probably well worth
    > that price.
    
    Ah, that opens the door to do something I considered earlier but 
    rejected because of alignment: instead of three 32-bit word fetches, we 
    could fetch one 64-bit word and 32-bit word. Might shave a few more 
    cycles...
    
    Meanwhile, I experimented with optimizing the other part of the loop: 
    the HASH() macros for setting the right bits in the signature. With the 
    default compile-time settings, the signature array is 95 bits. 
    Currently, it's stored in a BITVEC, which is a byte array, but we could 
    store it in two 64-bit integers, which makes it possible to write SETBIT 
    differently. I experimented with a few approaches, first was essentially 
    this:
    
    + /* Set the nth bit in the signature, in s1 or s2 */
    + #define HASH_S(h) \
    + 	do {									\
    + 		unsigned int n = HASHVAL(h);		\
    + 		if (((uint64) (n)) < 64)			\
    + 			s1 |= (uint64) 1<<(n);			\
    + 		else								\
    + 			s2 |= (uint64) 1<<((n) - 64);	\
    + 	} while(0)
    
    That was a bit faster on my x64 laptop, but slightly slower on my ia64 
    HP-UX box. My second try was to use lookup tables, patch attached. That 
    was yet faster on x64, and a small win on the ia64 box too. I'm not sure 
    it's worth the added code complexity, though.
    
    Here's a summary of the timings I got with different versions:
    
    ia64 HP-UX (anole):
    
    unpatched: 11194.038 ms
    fast_makesign-tom: 10064.980 ms
    fast_makesign-2int: 10649.726 ms
    fast_makesign-tbl: 9951.547 ms
    
    x64 laptop:
    
    unpatched: 4595,209 ms
    fast_makesign-tom: 3346,548 ms
    fast_makesign-2int: 3102,874 ms
    fast_makesign-tbl: 2997,854 ms
    
    I used the same "REINDEX INDEX i_words" test I used earlier, repeated 
    each run a couple of times, and took the lowest number. 
    fast_makesign-tom is the first patch you posted, I haven't tested your 
    latest one. fast_makesign-2int is with the HASH_S() macro above, and 
    has_makesign-tbl is with the attached patch.
    
    PS. in case you wonder why the HP-UX box is so much slower than my 
    laptop; this box isn't really meant for performance testing. It's just 
    something I happen to have access to, I think it's a virtual machine of 
    some sort. The numbers are very repeatable, however.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
  37. Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-07-02T18:46:10Z

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
    > Ah, that opens the door to do something I considered earlier but 
    > rejected because of alignment: instead of three 32-bit word fetches, we 
    > could fetch one 64-bit word and 32-bit word. Might shave a few more 
    > cycles...
    
    Hm ... I suspect that might be a small win on natively-64-bit machines,
    but almost certainly a loss on 32-bitters.
    
    > Meanwhile, I experimented with optimizing the other part of the loop: 
    > the HASH() macros for setting the right bits in the signature.
    
    Yeah, I was eyeing that too, but I'm hesitant to hard-wire assumptions
    about the size of the signature.
    
    			regards, tom lane
    
    
  38. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-07-07T10:47:31Z

    New version of patch with readme and some bugfixes.
    Some new tests with fast build parameters variations are on the wiki page.
    While I doubt to interpret some results of tests, because it looks strange
    for me. I particular I can't explain why decrease of buffer size affects
    index quality so much (in my expectation decrease of buffer size should make
    all build parameters closer to regular build). I'm going to recheck my
    experiments, probably I'm missing something.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  39. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-07-08T09:32:39Z

    I found that results of previous tests with USNOA2 were not fully correct.
    Tables for tests with various parameters contained tuples in different
    order. I assumed that query
    CREATE TABLE tab2 AS (SELECT * FROM tab1);
    creates tab2 as exact copy of tab1, i.e. tab2 contain tuples in same order
    as tab1. But actually, it isn't always so. In aggregate with only few used
    test cases it can cause significant error.
    I'm going to make use some more thought-out testing method. Probably, some
    more precise index quality measure exists (even for R-tree).
    
    ------
    With best regards,
    Alexander Korotkov.
    
  40. Re: WIP: Fast GiST index build

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-07-08T14:18:05Z

    Alexander Korotkov <aekorotkov@gmail.com> writes:
    > I found that results of previous tests with USNOA2 were not fully correct.
    > Tables for tests with various parameters contained tuples in different
    > order. I assumed that query
    > CREATE TABLE tab2 AS (SELECT * FROM tab1);
    > creates tab2 as exact copy of tab1, i.e. tab2 contain tuples in same order
    > as tab1. But actually, it isn't always so. In aggregate with only few used
    > test cases it can cause significant error.
    
    For test purposes, you could turn off synchronize_seqscans to prevent
    that.
    
    			regards, tom lane
    
    
  41. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-07-12T07:05:53Z

    On Fri, Jul 8, 2011 at 6:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    
    > For test purposes, you could turn off synchronize_seqscans to prevent
    > that.
    
    
    Thanks, it helps. I'm rerunning tests now.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  42. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-07-12T08:34:09Z

    New version of patch with a little more refactoring and comments.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  43. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-07-13T08:33:17Z

    Hi,
    
    Looking at the performance test results again on the wiki page 
    (http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011#Testing_results), 
    the patch can be summarized like this: it reduces the number of disk 
    accesses, at the cost of some extra CPU work.
    
    Is it possible to switch to the new buffering method in the middle of an 
    index build? We could use the plain insertion method until the index 
    grows to a certain size (effective_cache_size?), and switch to the 
    buffering method after that.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  44. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-07-13T08:40:20Z

    Hi!
    
    On Wed, Jul 13, 2011 at 12:33 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > Is it possible to switch to the new buffering method in the middle of an
    > index build? We could use the plain insertion method until the index grows
    > to a certain size (effective_cache_size?), and switch to the buffering
    > method after that.
    
    Yes, it seems to be possible. It also would be great to somehow detect case
    of ordered data when regular index build performs well.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  45. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-07-13T09:16:56Z

    On Wed, Jul 13, 2011 at 12:40 PM, Alexander Korotkov
    <aekorotkov@gmail.com>wrote:
    
    > On Wed, Jul 13, 2011 at 12:33 PM, Heikki Linnakangas <
    > heikki.linnakangas@enterprisedb.com> wrote:
    >
    >> Is it possible to switch to the new buffering method in the middle of an
    >> index build? We could use the plain insertion method until the index grows
    >> to a certain size (effective_cache_size?), and switch to the buffering
    >> method after that.
    >
    > Yes, it seems to be possible.
    >
    It also gives possibility to get estimate of varlena size by real data
    before start of buffering method using.
    
    
    > It also would be great to somehow detect case of ordered data when regular
    > index build performs well.
    >
    I think this case can be detected by the situation when most part of index
    tuples are inserting into few leaf pages which was recently used.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  46. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-07-13T13:59:39Z

    On 12.07.2011 11:34, Alexander Korotkov wrote:
    > New version of patch with a little more refactoring and comments.
    
    Great! The README helps tremendously to understand this, thanks for that.
    
    One thing that caught my eye is that when you empty a buffer, you load 
    the entire subtree below that buffer, down to the next buffered or leaf 
    level, into memory. Every page in that subtree is kept pinned. That is a 
    problem; in the general case, the buffer manager can only hold a modest 
    number of pages pinned at a time. Consider that the minimum value for 
    shared_buffers is just 16. That's unrealistically low for any real 
    system, but the default is only 32MB, which equals to just 4096 buffers. 
    A subtree could easily be larger than that.
    
    I don't think you're benefiting at all from the buffering that BufFile 
    does for you, since you're reading/writing a full block at a time 
    anyway. You might as well use the file API in fd.c directly, ie. 
    OpenTemporaryFile/FileRead/FileWrite.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  47. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-07-14T08:33:11Z

    On Wed, Jul 13, 2011 at 5:59 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > One thing that caught my eye is that when you empty a buffer, you load the
    > entire subtree below that buffer, down to the next buffered or leaf level,
    > into memory. Every page in that subtree is kept pinned. That is a problem;
    > in the general case, the buffer manager can only hold a modest number of
    > pages pinned at a time. Consider that the minimum value for shared_buffers
    > is just 16. That's unrealistically low for any real system, but the default
    > is only 32MB, which equals to just 4096 buffers. A subtree could easily be
    > larger than that.
    >
    With level step = 1 we need only 2 levels in subtree. With mininun index
    tuple size (12 bytes) each page can have at maximum 675. Thus I think
    default shared_buffers is enough for level step = 1. I believe it's enough
    to add check we have sufficient shared_buffers, isn't it?
    
    
    > I don't think you're benefiting at all from the buffering that BufFile does
    > for you, since you're reading/writing a full block at a time anyway. You
    > might as well use the file API in fd.c directly, ie.
    > OpenTemporaryFile/FileRead/**FileWrite.
    
    BufFile is distributing temporary data through several files. AFAICS
    postgres avoids working with files larger than 1GB. Size of tree buffers can
    easily be greater. Without BufFile I need to maintain set of files manually.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  48. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-07-14T08:42:01Z

    On 14.07.2011 11:33, Alexander Korotkov wrote:
    > On Wed, Jul 13, 2011 at 5:59 PM, Heikki Linnakangas<
    > heikki.linnakangas@enterprisedb.com>  wrote:
    >
    >> One thing that caught my eye is that when you empty a buffer, you load the
    >> entire subtree below that buffer, down to the next buffered or leaf level,
    >> into memory. Every page in that subtree is kept pinned. That is a problem;
    >> in the general case, the buffer manager can only hold a modest number of
    >> pages pinned at a time. Consider that the minimum value for shared_buffers
    >> is just 16. That's unrealistically low for any real system, but the default
    >> is only 32MB, which equals to just 4096 buffers. A subtree could easily be
    >> larger than that.
    >>
    > With level step = 1 we need only 2 levels in subtree. With mininun index
    > tuple size (12 bytes) each page can have at maximum 675. Thus I think
    > default shared_buffers is enough for level step = 1.
    
    Hundreds of buffer pins is still a lot. And with_level_step=2, the 
    number of pins required explodes to 675^2 = 455625.
    
    Pinning a buffer that's already in the shared buffer cache is cheap, I 
    doubt you're gaining much by keeping the private hash table in front of 
    the buffer cache. Also, it's possible that not all of the subtree is 
    actually required during the emptying, so in the worst case pre-loading 
    them is counter-productive.
    
    > I believe it's enough
    > to add check we have sufficient shared_buffers, isn't it?
    
    Well, what do you do if you deem that shared_buffers is too small? Fall 
    back to the old method? Also, shared_buffers is shared by all backends, 
    so you can't assume that you get to use all of it for the index build. 
    You'd need a wide safety margin.
    
    >> I don't think you're benefiting at all from the buffering that BufFile does
    >> for you, since you're reading/writing a full block at a time anyway. You
    >> might as well use the file API in fd.c directly, ie.
    >> OpenTemporaryFile/FileRead/**FileWrite.
    >
    > BufFile is distributing temporary data through several files. AFAICS
    > postgres avoids working with files larger than 1GB. Size of tree buffers can
    > easily be greater. Without BufFile I need to maintain set of files manually.
    
    Ah, I see. Makes sense.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  49. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-07-14T09:41:27Z

    On Thu, Jul 14, 2011 at 12:42 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > Pinning a buffer that's already in the shared buffer cache is cheap, I
    > doubt you're gaining much by keeping the private hash table in front of the
    > buffer cache.
    >
    Yes, I see. Pinning a lot of buffers don't gives singnificant benefits but
    produce a lot of problems. Also removing the hash table can simplify code.
    
    Also, it's possible that not all of the subtree is actually required during
    > the emptying, so in the worst case pre-loading them is counter-productive.
    
    What do you think about pre-fetching all of the subtree? It requires actual
    loading of level_step - 1 levels of it. I some cases it still can be
      counter-productive. But probably it is productive in average?
    
    
    > Well, what do you do if you deem that shared_buffers is too small? Fall
    > back to the old method? Also, shared_buffers is shared by all backends, so
    > you can't assume that you get to use all of it for the index build. You'd
    > need a wide safety margin.
    
    I assumed to check if there are enough of shared_buffers before switching to
    buffering method. But concurent backends makes this method unsafe.
    
    There are other difficulties with concurrent backends: it would be nice
    estimate usage of effective cache by other backeds before switching to
    buffering method. If don't take care about it then we can don't switch to
    buffering method which it can give significant benefit.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  50. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-07-14T20:41:01Z

    Do you think using "rightlink" as pointer to parent page is possible during
    index build? It would allow to simplify code significantly, because of no
    more need to maintain in-memory structures for parents memorizing.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  51. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-07-14T20:53:55Z

    On 14.07.2011 23:41, Alexander Korotkov wrote:
    > Do you think using "rightlink" as pointer to parent page is possible during
    > index build? It would allow to simplify code significantly, because of no
    > more need to maintain in-memory structures for parents memorizing.
    
    I guess, but where do you store the rightlink, then? Would you need a 
    final pass through the index to fix all the rightlinks?
    
    I think you could use the NSN field. It's used to detect concurrent page 
    splits, but those can't happen during index build, so you don't need 
    that field during index build. You just have to make it look like an 
    otherwise illegal NSN, so that it won't be mistaken for a real NSN after 
    the index is built. Maybe add a new flag to mean that the NSN is 
    actually invalid.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  52. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-07-15T10:04:44Z

    Fri, Jul 15, 2011 at 12:53 AM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > On 14.07.2011 23:41, Alexander Korotkov wrote:
    >
    >> Do you think using "rightlink" as pointer to parent page is possible
    >> during
    >> index build? It would allow to simplify code significantly, because of no
    >> more need to maintain in-memory structures for parents memorizing.
    >>
    >
    > I guess, but where do you store the rightlink, then? Would you need a final
    > pass through the index to fix all the rightlinks?
    >
    > I think you could use the NSN field. It's used to detect concurrent page
    > splits, but those can't happen during index build, so you don't need that
    > field during index build. You just have to make it look like an otherwise
    > illegal NSN, so that it won't be mistaken for a real NSN after the index is
    > built. Maybe add a new flag to mean that the NSN is actually invalid.
    
    
    Thank you for advice. But I didn't take into account that in this case I
    need to update parent link in many pages(which might be not in cache) on
    split. Seems that I still need to maintain some in-memory structures for
    parent finding.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  53. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-07-18T18:00:34Z

    Hi!
    
    New version of patch is attached. There are following changes.
    1) Since proposed tchnique is not always a "fast" build, it was renamed
    everywhere in the patch to "buffering" build.
    2) Parameter "buffering" now has 3 possible values "yes", "no" and "auto".
    "auto" means automatic switching from regular index build to buffering one.
    Currently it just switch when index size exceeds maintenance_work_mem.
    3) Holding of many buffers pinned is avoided.
    4) Rebased with head.
    
    TODO:
    1) Take care about ordered datasets in automatic switching.
    2) Take care about concurrent backends in automatic switching.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  54. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-07-22T09:38:35Z

    Hi!
    
    Patch with my try to detect ordered datasets is attached. The implemented
    idea is desribed below.
    Index tuples are divided by chunks of 128. On each chunk we measure how much
    leaf pages where index tuples was inserted don't match those of previous
    chunk. Based on statistics of several chunks we estimate distribution of
    accesses between lead pages (exponential distribution law is accumed and
    it's seems to be an error). After that we can estimate portion of index
    tuples which can be processed without actual IO. If this estimate exceeds
    threshold then we should switch to buffering build.
    Now my implementation successfully detects randomly mixed datasets and well
    ordered datasets, but it's seems to be too optimistic about intermediate
    cases. I believe it's due to wrong assumption about distribution law.
    Do you think this approach is acceptable? Probably there are some researches
    about distribution law for such cases (while I didn't find anything relevant
    in google scholar)?
    As an alternative I can propose take into account actual average IO
    operations per tuple rather then an estimate.
    
    ------
    With best regards,
    Alexander Korotkov.
    
    On Mon, Jul 18, 2011 at 10:00 PM, Alexander Korotkov
    <aekorotkov@gmail.com>wrote:
    
    > Hi!
    >
    > New version of patch is attached. There are following changes.
    > 1) Since proposed tchnique is not always a "fast" build, it was renamed
    > everywhere in the patch to "buffering" build.
    > 2) Parameter "buffering" now has 3 possible values "yes", "no" and "auto".
    > "auto" means automatic switching from regular index build to buffering one.
    > Currently it just switch when index size exceeds maintenance_work_mem.
    > 3) Holding of many buffers pinned is avoided.
    > 4) Rebased with head.
    >
    > TODO:
    > 1) Take care about ordered datasets in automatic switching.
    > 2) Take care about concurrent backends in automatic switching.
    >
    > ------
    > With best regards,
    > Alexander Korotkov.
    >
    
  55. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-07-25T18:52:16Z

    On 22.07.2011 12:38, Alexander Korotkov wrote:
    > Patch with my try to detect ordered datasets is attached. The implemented
    > idea is desribed below.
    > Index tuples are divided by chunks of 128. On each chunk we measure how much
    > leaf pages where index tuples was inserted don't match those of previous
    > chunk. Based on statistics of several chunks we estimate distribution of
    > accesses between lead pages (exponential distribution law is accumed and
    > it's seems to be an error). After that we can estimate portion of index
    > tuples which can be processed without actual IO. If this estimate exceeds
    > threshold then we should switch to buffering build.
    > Now my implementation successfully detects randomly mixed datasets and well
    > ordered datasets, but it's seems to be too optimistic about intermediate
    > cases. I believe it's due to wrong assumption about distribution law.
    > Do you think this approach is acceptable? Probably there are some researches
    > about distribution law for such cases (while I didn't find anything relevant
    > in google scholar)?
    
    Great! It would be nice to find a more scientific approach to this, but 
    that's probably fine for now. It's time to start cleaning up the patch 
    for eventual commit.
    
    You got rid of the extra page pins, which is good, but I wonder why you 
    still pre-create all the GISTLoadedPartItem structs for the whole 
    subtree in loadTreePart() ? Can't you create those structs on-the-fly, 
    when you descend the tree? I understand that it's difficult to update 
    all the parent-pointers as trees are split, but it feels like there's 
    way too much bookkeeping going on. Surely it's possible to simplify it 
    somehow..
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  56. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-07-26T17:24:10Z

    On 25.07.2011 21:52, Heikki Linnakangas wrote:
    > On 22.07.2011 12:38, Alexander Korotkov wrote:
    >> Patch with my try to detect ordered datasets is attached. The implemented
    >> idea is desribed below.
    >> Index tuples are divided by chunks of 128. On each chunk we measure
    >> how much
    >> leaf pages where index tuples was inserted don't match those of previous
    >> chunk. Based on statistics of several chunks we estimate distribution of
    >> accesses between lead pages (exponential distribution law is accumed and
    >> it's seems to be an error). After that we can estimate portion of index
    >> tuples which can be processed without actual IO. If this estimate exceeds
    >> threshold then we should switch to buffering build.
    >> Now my implementation successfully detects randomly mixed datasets and
    >> well
    >> ordered datasets, but it's seems to be too optimistic about intermediate
    >> cases. I believe it's due to wrong assumption about distribution law.
    >> Do you think this approach is acceptable? Probably there are some
    >> researches
    >> about distribution law for such cases (while I didn't find anything
    >> relevant
    >> in google scholar)?
    >
    > Great! It would be nice to find a more scientific approach to this, but
    > that's probably fine for now. It's time to start cleaning up the patch
    > for eventual commit.
    >
    > You got rid of the extra page pins, which is good, but I wonder why you
    > still pre-create all the GISTLoadedPartItem structs for the whole
    > subtree in loadTreePart() ? Can't you create those structs on-the-fly,
    > when you descend the tree? I understand that it's difficult to update
    > all the parent-pointers as trees are split, but it feels like there's
    > way too much bookkeeping going on. Surely it's possible to simplify it
    > somehow..
    
    That was a quite off-the-cuff remark, so I took the patch and culled out 
    loaded-tree bookkeeping. A lot of other changes fell off from that, so 
    it took me quite some time to get it working again, but here it is. This 
    is a *lot* smaller patch, although that's partly explained by the fact 
    that I left out some features: prefetching and the neighbor relocation 
    code is gone.
    
    I'm pretty exhausted by this, so I just wanted to send this out without 
    further analysis. Let me know if you have questions on the approach 
    taken. I'm also not sure how this performs compared to your latest 
    patch, I haven't done any performance testing. Feel free to use this as 
    is, or as a source of inspiration :-).
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
  57. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-07-26T19:34:13Z

    On Tue, Jul 26, 2011 at 9:24 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > That was a quite off-the-cuff remark, so I took the patch and culled out
    > loaded-tree bookkeeping. A lot of other changes fell off from that, so it
    > took me quite some time to get it working again, but here it is. This is a
    > *lot* smaller patch, although that's partly explained by the fact that I
    > left out some features: prefetching and the neighbor relocation code is
    > gone.
    >
    > I'm pretty exhausted by this, so I just wanted to send this out without
    > further analysis. Let me know if you have questions on the approach taken.
    > I'm also not sure how this performs compared to your latest patch, I haven't
    > done any performance testing. Feel free to use this as is, or as a source of
    > inspiration :-).
    
    I also was going to try to evade keeping loaded-tree hash. This might help
    me a lot. Thanks.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  58. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-07-27T12:29:04Z

    I found a problem in WAL with this patch. I use simplified insert algorithm
    in my patch which don't insert downlink one by one but insert them at once.
    Thus FollowRight flag is leaving uncleared when redoing from WAL, because
    only one flag can be cleared by one WAL record. Do you think modification of
    WAL record structure is possible or I have to insert downlink one by one in
    buffering build too?
    
    ------
    With best regards,
    Alexander Korotkov.
    
  59. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-07-27T14:05:50Z

    On 27.07.2011 15:29, Alexander Korotkov wrote:
    > I found a problem in WAL with this patch. I use simplified insert algorithm
    > in my patch which don't insert downlink one by one but insert them at once.
    > Thus FollowRight flag is leaving uncleared when redoing from WAL, because
    > only one flag can be cleared by one WAL record. Do you think modification of
    > WAL record structure is possible or I have to insert downlink one by one in
    > buffering build too?
    
    Dunno, both approaches seem reasonable to me. There's no rule against 
    changing WAL record structure across major releases, if that's what you 
    were worried about.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  60. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-07-27T14:43:26Z

    On Wed, Jul 27, 2011 at 6:05 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > Dunno, both approaches seem reasonable to me. There's no rule against
    > changing WAL record structure across major releases, if that's what you were
    > worried about.
    >
    
    OK, thanks. I also found behaviour of GiST(without patch) with streaming
    replication that seems strange for me. On master there are only few
    rightlinks are InvalidBlockNumber while on slave there are a lot of them. I
    hack gevel for getting index structure on slave (accessing tree without
    AccessExclusiveLock).
    
    On master:
    # create table test as (select point(random(),random()) from
    generate_series(1,100000));
    # create index test_idx on test using gist(point);
    # \copy (select gist_tree('test_idx')) to 'tree1r.txt';
    
    On slave:
    # \copy (select gist_tree('test_idx')) to 'tree2r.txt';
    
    In bash:
    # cat tree1r.txt | sed 's/\\n/\n/g' > tree1.txt
    # cat tree2r.txt | sed 's/\\n/\n/g' > tree2.txt
    # diff tree1.txt tree2.txt
    
    2,89c2,89
    <     1(l:1) blk: 324 numTuple: 129 free: 2472b(69.71%) rightlink:637 (OK)
    <         1(l:2) blk: 242 numTuple: 164 free: 932b(88.58%) rightlink:319
    (OK)
    <         2(l:2) blk: 525 numTuple: 121 free: 2824b(65.39%) rightlink:153
    (OK)
    <         3(l:2) blk: 70 numTuple: 104 free: 3572b(56.23%) rightlink:551
    (OK)
    <         4(l:2) blk: 384 numTuple: 106 free: 3484b(57.30%) rightlink:555
    (OK)
    <         5(l:2) blk: 555 numTuple: 121 free: 2824b(65.39%) rightlink:74
    (OK)
    <         6(l:2) blk: 564 numTuple: 109 free: 3352b(58.92%) rightlink:294
    (OK)
    <         7(l:2) blk: 165 numTuple: 108 free: 3396b(58.38%) rightlink:567
    (OK)
    .....
    ---
    >     1(l:1) blk: 324 numTuple: 129 free: 2472b(69.71%) rightlink:4294967295
    (InvalidBlockNumber)
    >         1(l:2) blk: 242 numTuple: 164 free: 932b(88.58%)
    rightlink:4294967295 (InvalidBlockNumber)
    >         2(l:2) blk: 525 numTuple: 121 free: 2824b(65.39%)
    rightlink:4294967295 (InvalidBlockNumber)
    >         3(l:2) blk: 70 numTuple: 104 free: 3572b(56.23%)
    rightlink:4294967295 (InvalidBlockNumber)
    >         4(l:2) blk: 384 numTuple: 106 free: 3484b(57.30%)
    rightlink:4294967295 (InvalidBlockNumber)
    >         5(l:2) blk: 555 numTuple: 121 free: 2824b(65.39%)
    rightlink:4294967295 (InvalidBlockNumber)
    >         6(l:2) blk: 564 numTuple: 109 free: 3352b(58.92%)
    rightlink:4294967295 (InvalidBlockNumber)
    >         7(l:2) blk: 165 numTuple: 108 free: 3396b(58.38%)
    rightlink:4294967295 (InvalidBlockNumber)
    .....
    
    Isn't it a bug?
    
    ------
    With best regards,
    Alexander Korotkov.
    
  61. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-07-27T15:30:02Z

    On 27.07.2011 17:43, Alexander Korotkov wrote:
    >>      1(l:1) blk: 324 numTuple: 129 free: 2472b(69.71%) rightlink:4294967295
    > (InvalidBlockNumber)
    >>          1(l:2) blk: 242 numTuple: 164 free: 932b(88.58%)
    > rightlink:4294967295 (InvalidBlockNumber)
    >>          2(l:2) blk: 525 numTuple: 121 free: 2824b(65.39%)
    > rightlink:4294967295 (InvalidBlockNumber)
    >>          3(l:2) blk: 70 numTuple: 104 free: 3572b(56.23%)
    > rightlink:4294967295 (InvalidBlockNumber)
    >>          4(l:2) blk: 384 numTuple: 106 free: 3484b(57.30%)
    > rightlink:4294967295 (InvalidBlockNumber)
    >>          5(l:2) blk: 555 numTuple: 121 free: 2824b(65.39%)
    > rightlink:4294967295 (InvalidBlockNumber)
    >>          6(l:2) blk: 564 numTuple: 109 free: 3352b(58.92%)
    > rightlink:4294967295 (InvalidBlockNumber)
    >>          7(l:2) blk: 165 numTuple: 108 free: 3396b(58.38%)
    > rightlink:4294967295 (InvalidBlockNumber)
    > .....
    >
    > Isn't it a bug?
    
    Yeah, looks like a bug. I must've messed up the WAL logging in my recent 
    changes to this. I'll look into that.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  62. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-07-28T20:57:02Z

    On Tue, Jul 26, 2011 at 9:24 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    >
    > Let me know if you have questions on the approach taken.
    
    
    I realized that approach which comes as replace to loaded-subtrees keeping
    is unclear to me. We save paths between node buffers. But those paths can
    become invalid on page splits. It seems to me that approximately same volume
    of code for maintaining parent links should be added to this version of
    patch in order to get it working correctly.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  63. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-07-28T21:10:21Z

    On 28.07.2011 23:57, Alexander Korotkov wrote:
    > On Tue, Jul 26, 2011 at 9:24 PM, Heikki Linnakangas<
    > heikki.linnakangas@enterprisedb.com>  wrote:
    >>
    >> Let me know if you have questions on the approach taken.
    >
    >
    > I realized that approach which comes as replace to loaded-subtrees keeping
    > is unclear to me. We save paths between node buffers. But those paths can
    > become invalid on page splits. It seems to me that approximately same volume
    > of code for maintaining parent links should be added to this version of
    > patch in order to get it working correctly.
    
    gistFindCorrectParent() should take care of that.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  64. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-07-28T22:06:40Z

    On Fri, Jul 29, 2011 at 1:10 AM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > gistFindCorrectParent() should take care of that.
    >
    OK, now it seems that I understood. I need to verify amount memory needed
    for paths because it seems that they tends to accumulate. Also I need to
    verify final emptying, because IO guarantees of original paper is based on
    strict descending final emptying.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  65. Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-01T09:38:13Z

    On 27.07.2011 17:43, Alexander Korotkov wrote:
    > OK, thanks. I also found behaviour of GiST(without patch) with streaming
    > replication that seems strange for me. On master there are only few
    > rightlinks are InvalidBlockNumber while on slave there are a lot of them. I
    > hack gevel for getting index structure on slave (accessing tree without
    > AccessExclusiveLock).
    >
    > On master:
    > # create table test as (select point(random(),random()) from
    > generate_series(1,100000));
    > # create index test_idx on test using gist(point);
    > # \copy (select gist_tree('test_idx')) to 'tree1r.txt';
    >
    > On slave:
    > # \copy (select gist_tree('test_idx')) to 'tree2r.txt';
    >
    > In bash:
    > # cat tree1r.txt | sed 's/\\n/\n/g'>  tree1.txt
    > # cat tree2r.txt | sed 's/\\n/\n/g'>  tree2.txt
    > # diff tree1.txt tree2.txt
    >
    > 2,89c2,89
    > <      1(l:1) blk: 324 numTuple: 129 free: 2472b(69.71%) rightlink:637 (OK)
    > <          1(l:2) blk: 242 numTuple: 164 free: 932b(88.58%) rightlink:319
    > (OK)
    > <          2(l:2) blk: 525 numTuple: 121 free: 2824b(65.39%) rightlink:153
    > (OK)
    > <          3(l:2) blk: 70 numTuple: 104 free: 3572b(56.23%) rightlink:551
    > (OK)
    > <          4(l:2) blk: 384 numTuple: 106 free: 3484b(57.30%) rightlink:555
    > (OK)
    > <          5(l:2) blk: 555 numTuple: 121 free: 2824b(65.39%) rightlink:74
    > (OK)
    > <          6(l:2) blk: 564 numTuple: 109 free: 3352b(58.92%) rightlink:294
    > (OK)
    > <          7(l:2) blk: 165 numTuple: 108 free: 3396b(58.38%) rightlink:567
    > (OK)
    > .....
    > ---
    >>      1(l:1) blk: 324 numTuple: 129 free: 2472b(69.71%) rightlink:4294967295
    > (InvalidBlockNumber)
    >>          1(l:2) blk: 242 numTuple: 164 free: 932b(88.58%)
    > rightlink:4294967295 (InvalidBlockNumber)
    >>          2(l:2) blk: 525 numTuple: 121 free: 2824b(65.39%)
    > rightlink:4294967295 (InvalidBlockNumber)
    >>          3(l:2) blk: 70 numTuple: 104 free: 3572b(56.23%)
    > rightlink:4294967295 (InvalidBlockNumber)
    >>          4(l:2) blk: 384 numTuple: 106 free: 3484b(57.30%)
    > rightlink:4294967295 (InvalidBlockNumber)
    >>          5(l:2) blk: 555 numTuple: 121 free: 2824b(65.39%)
    > rightlink:4294967295 (InvalidBlockNumber)
    >>          6(l:2) blk: 564 numTuple: 109 free: 3352b(58.92%)
    > rightlink:4294967295 (InvalidBlockNumber)
    >>          7(l:2) blk: 165 numTuple: 108 free: 3396b(58.38%)
    > rightlink:4294967295 (InvalidBlockNumber)
    > .....
    >
    > Isn't it a bug?
    
    Yeah, it sure looks like a bug. I was certain that I had broken this in 
    the recent changes to GiST handling of page splits, but in fact it has 
    been like that forever.
    
    The rightlinks are not needed after crash recovery, because all the 
    downlinks should be there. A scan will find all pages through the 
    downlinks, and doesn't need to follow any rightlinks. I'm not sure why 
    we explicitly clear them, it's not like the rightlinks would do any harm 
    either, but for crash recovery that's harmless.
    
    But a scan during hot standby can see those intermediate states, just 
    like concurrent scans can on the master. The locking on replay of page 
    split needs to be fixed, too. At the moment, it locks and writes out 
    each page separately, so a concurrent scan could "overtake" the WAL 
    replay while following rightlinks, and miss tuples on the right half.
    
    Attached is a patch for that for 9.1/master. The 9.0 GiST replay code 
    was quite different, it will require a separate patch.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
  66. Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

    Simon Riggs <simon@2ndquadrant.com> — 2011-08-01T10:13:54Z

    On Mon, Aug 1, 2011 at 10:38 AM, Heikki Linnakangas
    <heikki.linnakangas@enterprisedb.com> wrote:
    > On 27.07.2011 17:43, Alexander Korotkov wrote:
    >>
    >> OK, thanks. I also found behaviour of GiST(without patch) with streaming
    >> replication that seems strange for me. On master there are only few
    >> rightlinks are InvalidBlockNumber while on slave there are a lot of them.
    >> I
    >> hack gevel for getting index structure on slave (accessing tree without
    >> AccessExclusiveLock).
    >>
    >> On master:
    >> # create table test as (select point(random(),random()) from
    >> generate_series(1,100000));
    >> # create index test_idx on test using gist(point);
    >> # \copy (select gist_tree('test_idx')) to 'tree1r.txt';
    >>
    >> On slave:
    >> # \copy (select gist_tree('test_idx')) to 'tree2r.txt';
    >>
    >> In bash:
    >> # cat tree1r.txt | sed 's/\\n/\n/g'>  tree1.txt
    >> # cat tree2r.txt | sed 's/\\n/\n/g'>  tree2.txt
    >> # diff tree1.txt tree2.txt
    >>
    >> 2,89c2,89
    >> <      1(l:1) blk: 324 numTuple: 129 free: 2472b(69.71%) rightlink:637
    >> (OK)
    >> <          1(l:2) blk: 242 numTuple: 164 free: 932b(88.58%) rightlink:319
    >> (OK)
    >> <          2(l:2) blk: 525 numTuple: 121 free: 2824b(65.39%) rightlink:153
    >> (OK)
    >> <          3(l:2) blk: 70 numTuple: 104 free: 3572b(56.23%) rightlink:551
    >> (OK)
    >> <          4(l:2) blk: 384 numTuple: 106 free: 3484b(57.30%) rightlink:555
    >> (OK)
    >> <          5(l:2) blk: 555 numTuple: 121 free: 2824b(65.39%) rightlink:74
    >> (OK)
    >> <          6(l:2) blk: 564 numTuple: 109 free: 3352b(58.92%) rightlink:294
    >> (OK)
    >> <          7(l:2) blk: 165 numTuple: 108 free: 3396b(58.38%) rightlink:567
    >> (OK)
    >> .....
    >> ---
    >>>
    >>>     1(l:1) blk: 324 numTuple: 129 free: 2472b(69.71%)
    >>> rightlink:4294967295
    >>
    >> (InvalidBlockNumber)
    >>>
    >>>         1(l:2) blk: 242 numTuple: 164 free: 932b(88.58%)
    >>
    >> rightlink:4294967295 (InvalidBlockNumber)
    >>>
    >>>         2(l:2) blk: 525 numTuple: 121 free: 2824b(65.39%)
    >>
    >> rightlink:4294967295 (InvalidBlockNumber)
    >>>
    >>>         3(l:2) blk: 70 numTuple: 104 free: 3572b(56.23%)
    >>
    >> rightlink:4294967295 (InvalidBlockNumber)
    >>>
    >>>         4(l:2) blk: 384 numTuple: 106 free: 3484b(57.30%)
    >>
    >> rightlink:4294967295 (InvalidBlockNumber)
    >>>
    >>>         5(l:2) blk: 555 numTuple: 121 free: 2824b(65.39%)
    >>
    >> rightlink:4294967295 (InvalidBlockNumber)
    >>>
    >>>         6(l:2) blk: 564 numTuple: 109 free: 3352b(58.92%)
    >>
    >> rightlink:4294967295 (InvalidBlockNumber)
    >>>
    >>>         7(l:2) blk: 165 numTuple: 108 free: 3396b(58.38%)
    >>
    >> rightlink:4294967295 (InvalidBlockNumber)
    >> .....
    >>
    >> Isn't it a bug?
    >
    > Yeah, it sure looks like a bug. I was certain that I had broken this in the
    > recent changes to GiST handling of page splits, but in fact it has been like
    > that forever.
    >
    > The rightlinks are not needed after crash recovery, because all the
    > downlinks should be there. A scan will find all pages through the downlinks,
    > and doesn't need to follow any rightlinks. I'm not sure why we explicitly
    > clear them, it's not like the rightlinks would do any harm either, but for
    > crash recovery that's harmless.
    >
    > But a scan during hot standby can see those intermediate states, just like
    > concurrent scans can on the master. The locking on replay of page split
    > needs to be fixed, too. At the moment, it locks and writes out each page
    > separately, so a concurrent scan could "overtake" the WAL replay while
    > following rightlinks, and miss tuples on the right half.
    >
    > Attached is a patch for that for 9.1/master. The 9.0 GiST replay code was
    > quite different, it will require a separate patch.
    
    Hmm, I was assured no changes would be required for Hot Standby for
    GIST and GIN. Perhaps we should check GIN code also.
    
    Does the order of locking of the buffers matter? I'm sure it does.
    
    Did you want me to write the patch for 9.0?
    
    And what does NSN stand for? :-)
    
    -- 
     Simon Riggs                   http://www.2ndQuadrant.com/
     PostgreSQL Development, 24x7 Support, Training & Services
    
    
  67. Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-01T10:44:23Z

    On 01.08.2011 13:13, Simon Riggs wrote:
    > On Mon, Aug 1, 2011 at 10:38 AM, Heikki Linnakangas
    > <heikki.linnakangas@enterprisedb.com>  wrote:
    >> Attached is a patch for that for 9.1/master. The 9.0 GiST replay code was
    >> quite different, it will require a separate patch.
    >
    > Hmm, I was assured no changes would be required for Hot Standby for
    > GIST and GIN. Perhaps we should check GIN code also.
    
    Yeah, we probably should.
    
    > Does the order of locking of the buffers matter? I'm sure it does.
    
    Yep.
    
    > Did you want me to write the patch for 9.0?
    
    I'm looking at it now.
    
    > And what does NSN stand for? :-)
    
    Hmm, I don't know actually know what NSN is an acronym for :-). But in 
    case you want an explanation of what it does:
    
    The NSN is field in the GiST page header, used to detect concurrent page 
    splits. Whenever a page is split, its NSN is set to the LSN of the page 
    split record. To be precise: the NSN of the resulting left page(s) is 
    set, the resulting rightmost half keeps the NSN of the original page.
    
    When you scan a parent page and decide to move down, it's possible that 
    the child page is split before you read it, but after you read the 
    parent page. So you didn't see the downlink for the right half when you 
    scanned the parent page. To reach the right half, you need to follow the 
    rightlink from the left page, but first you need to detect that the page 
    was split. To do that, when you scan the parent page you remember the 
    LSN on the parent. When you scan the child, you compare the parent's LSN 
    you saw with the NSN of the child. If the child's NSN > parent's LSN, 
    the page was split after you scanned the parent, so you need to follow 
    the rightlink.
    
    The B-tree code has similar move-right logic, but it uses the "high" key 
    on each page to decide when it needs to move right. There's no high key 
    on GiST pages, so we rely on the NSN for that.
    
    In 9.1, I added the F_FOLLOW_RIGHT flag to handle the case of an 
    incomplete split correctly. If the flag is set on a page, its rightlink 
    needs to be followed regardless of the NSN.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  68. Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

    Thom Brown <thom@linux.com> — 2011-08-01T10:47:18Z

    On 1 August 2011 11:44, Heikki Linnakangas
    <heikki.linnakangas@enterprisedb.com> wrote:
    > On 01.08.2011 13:13, Simon Riggs wrote:
    >> And what does NSN stand for? :-)
    >
    > Hmm, I don't know actually know what NSN is an acronym for :-).
    
    Node Sequence Number.
    
    -- 
    Thom Brown
    Twitter: @darkixion
    IRC (freenode): dark_ixion
    Registered Linux user: #516935
    
    EnterpriseDB UK: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  69. Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

    Simon Riggs <simon@2ndquadrant.com> — 2011-08-01T11:25:59Z

    On Mon, Aug 1, 2011 at 11:47 AM, Thom Brown <thom@linux.com> wrote:
    > On 1 August 2011 11:44, Heikki Linnakangas
    > <heikki.linnakangas@enterprisedb.com> wrote:
    >> On 01.08.2011 13:13, Simon Riggs wrote:
    >>> And what does NSN stand for? :-)
    >>
    >> Hmm, I don't know actually know what NSN is an acronym for :-).
    >
    > Node Sequence Number.
    
    Do you have a reference to support that explanation?
    
    -- 
     Simon Riggs                   http://www.2ndQuadrant.com/
     PostgreSQL Development, 24x7 Support, Training & Services
    
    
  70. Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

    Thom Brown <thom@linux.com> — 2011-08-01T11:30:18Z

    On 1 August 2011 12:25, Simon Riggs <simon@2ndquadrant.com> wrote:
    > On Mon, Aug 1, 2011 at 11:47 AM, Thom Brown <thom@linux.com> wrote:
    >> On 1 August 2011 11:44, Heikki Linnakangas
    >> <heikki.linnakangas@enterprisedb.com> wrote:
    >>> On 01.08.2011 13:13, Simon Riggs wrote:
    >>>> And what does NSN stand for? :-)
    >>>
    >>> Hmm, I don't know actually know what NSN is an acronym for :-).
    >>
    >> Node Sequence Number.
    >
    > Do you have a reference to support that explanation?
    
    Here's one reference to it:
    http://archives.postgresql.org/pgsql-hackers/2005-06/msg00294.php
    
    -- 
    Thom Brown
    Twitter: @darkixion
    IRC (freenode): dark_ixion
    Registered Linux user: #516935
    
    EnterpriseDB UK: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  71. Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-08-01T11:32:59Z

    On Mon, Aug 1, 2011 at 3:25 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
    
    > On Mon, Aug 1, 2011 at 11:47 AM, Thom Brown <thom@linux.com> wrote:
    > > On 1 August 2011 11:44, Heikki Linnakangas
    > > <heikki.linnakangas@enterprisedb.com> wrote:
    > >> On 01.08.2011 13:13, Simon Riggs wrote:
    > >>> And what does NSN stand for? :-)
    > >>
    > >> Hmm, I don't know actually know what NSN is an acronym for :-).
    > >
    > > Node Sequence Number.
    >
    > Do you have a reference to support that explanation?
    
    
    See "Access Methods for Next-Generation Database Systems" by Marcel
    Kornacker, Chapter 4 "Concurrency and Recovery for GiSTs".
    http://www.sai.msu.su/~megera/postgres/gist/papers/concurrency/access-methods-for-next-generation.pdf.gz
    
    
    ------
    With best regards,
    Alexander Korotkov.
    
  72. Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

    Simon Riggs <simon@2ndquadrant.com> — 2011-08-01T11:35:26Z

    On Mon, Aug 1, 2011 at 11:44 AM, Heikki Linnakangas
    <heikki.linnakangas@enterprisedb.com> wrote:
    
    >> Does the order of locking of the buffers matter? I'm sure it does.
    >
    > Yep.
    
    Do you mean that the BlockNumbers are already in correct sequence, or
    that you will be adding this code to redo?
    
    -- 
     Simon Riggs                   http://www.2ndQuadrant.com/
     PostgreSQL Development, 24x7 Support, Training & Services
    
    
  73. Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-01T13:29:42Z

    On 01.08.2011 14:35, Simon Riggs wrote:
    > On Mon, Aug 1, 2011 at 11:44 AM, Heikki Linnakangas
    > <heikki.linnakangas@enterprisedb.com>  wrote:
    >
    >>> Does the order of locking of the buffers matter? I'm sure it does.
    >>
    >> Yep.
    >
    > Do you mean that the BlockNumbers are already in correct sequence, or
    > that you will be adding this code to redo?
    
    I just meant that yes, the order of locking of the buffers does matter.
    
    I believe we code acquire the locks in right order already, and the 
    patch I posted fixes the premature release of locks at page split.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  74. Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

    Simon Riggs <simon@2ndquadrant.com> — 2011-08-01T14:26:57Z

    On Mon, Aug 1, 2011 at 2:29 PM, Heikki Linnakangas
    <heikki.linnakangas@enterprisedb.com> wrote:
    > On 01.08.2011 14:35, Simon Riggs wrote:
    >>
    >> On Mon, Aug 1, 2011 at 11:44 AM, Heikki Linnakangas
    >> <heikki.linnakangas@enterprisedb.com>  wrote:
    >>
    >>>> Does the order of locking of the buffers matter? I'm sure it does.
    >>>
    >>> Yep.
    >>
    >> Do you mean that the BlockNumbers are already in correct sequence, or
    >> that you will be adding this code to redo?
    >
    > I just meant that yes, the order of locking of the buffers does matter.
    >
    > I believe we code acquire the locks in right order already, and the patch I
    > posted fixes the premature release of locks at page split.
    
    Your patch is good, but it does rely on the idea that we're logging
    the blocks in the same order they were originally locked. That's a
    good assumption, but I would like to see that documented for general
    sanity, or just mine at least.
    
    I can't really see anything in the master-side code that attempts to
    lock things in a specific sequence, which bothers me also.
    
    -- 
     Simon Riggs                   http://www.2ndQuadrant.com/
     PostgreSQL Development, 24x7 Support, Training & Services
    
    
  75. Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-01T14:34:47Z

    On 01.08.2011 17:26, Simon Riggs wrote:
    > On Mon, Aug 1, 2011 at 2:29 PM, Heikki Linnakangas
    > <heikki.linnakangas@enterprisedb.com>  wrote:
    >> I believe we code acquire the locks in right order already, and the patch I
    >> posted fixes the premature release of locks at page split.
    >
    > Your patch is good, but it does rely on the idea that we're logging
    > the blocks in the same order they were originally locked. That's a
    > good assumption, but I would like to see that documented for general
    > sanity, or just mine at least.
    >
    > I can't really see anything in the master-side code that attempts to
    > lock things in a specific sequence, which bothers me also.
    
    All but the first page are unused pages, grabbed with either P_NEW or 
    from the FSM. gistNewBuffer() uses ConditionalLockBuffer() to guard for 
    the case that someone else chooses the same victim buffer, and picks 
    another page.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  76. Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

    Simon Riggs <simon@2ndquadrant.com> — 2011-08-01T15:25:01Z

    On Mon, Aug 1, 2011 at 3:34 PM, Heikki Linnakangas
    <heikki.linnakangas@enterprisedb.com> wrote:
    > On 01.08.2011 17:26, Simon Riggs wrote:
    >>
    >> On Mon, Aug 1, 2011 at 2:29 PM, Heikki Linnakangas
    >> <heikki.linnakangas@enterprisedb.com>  wrote:
    >>>
    >>> I believe we code acquire the locks in right order already, and the patch
    >>> I
    >>> posted fixes the premature release of locks at page split.
    >>
    >> Your patch is good, but it does rely on the idea that we're logging
    >> the blocks in the same order they were originally locked. That's a
    >> good assumption, but I would like to see that documented for general
    >> sanity, or just mine at least.
    >>
    >> I can't really see anything in the master-side code that attempts to
    >> lock things in a specific sequence, which bothers me also.
    >
    > All but the first page are unused pages, grabbed with either P_NEW or from
    > the FSM. gistNewBuffer() uses ConditionalLockBuffer() to guard for the case
    > that someone else chooses the same victim buffer, and picks another page.
    
    Seems good. Thanks for checking some more for me.
    
    -- 
     Simon Riggs                   http://www.2ndQuadrant.com/
     PostgreSQL Development, 24x7 Support, Training & Services
    
    
  77. Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-02T11:03:47Z

    On 01.08.2011 13:44, Heikki Linnakangas wrote:
    > On 01.08.2011 13:13, Simon Riggs wrote:
    >> Did you want me to write the patch for 9.0?
    >
    > I'm looking at it now.
    
    So, in 9.0, we currently leave the rightlink and NSN invalid when 
    replaying a page split. To set them correctly, we'd need the old 
    rightlink and NSN from the page being split, but the WAL record doesn't 
    currently include them. I can see two solutions to this:
    
    1. Add them to the page split WAL record. That's straightforward, but 
    breaks WAL format compatibility with older minor versions.
    
    2. Read the old page version, and copy the rightlink and NSN from there. 
    Since we're restoring what's basically a full-page image of the page 
    after the split, in crash recovery the old contents might be a torn 
    page, or a newer version of the page. I believe that's harmless, because 
    we only care about the NSN and rightlink in hot standby mode, but it's a 
    bit ugly.
    
    If we change the WAL record, we have to make it so that the new version 
    can still read the old format, which complicates the implementation a 
    bit. Neverthelss, I'm leaning towards option 1.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  78. Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

    Simon Riggs <simon@2ndquadrant.com> — 2011-08-02T11:36:30Z

    On Tue, Aug 2, 2011 at 12:03 PM, Heikki Linnakangas
    <heikki.linnakangas@enterprisedb.com> wrote:
    > On 01.08.2011 13:44, Heikki Linnakangas wrote:
    >>
    >> On 01.08.2011 13:13, Simon Riggs wrote:
    >>>
    >>> Did you want me to write the patch for 9.0?
    >>
    >> I'm looking at it now.
    >
    > So, in 9.0, we currently leave the rightlink and NSN invalid when replaying
    > a page split. To set them correctly, we'd need the old rightlink and NSN
    > from the page being split, but the WAL record doesn't currently include
    > them. I can see two solutions to this:
    >
    > 1. Add them to the page split WAL record. That's straightforward, but breaks
    > WAL format compatibility with older minor versions.
    >
    > 2. Read the old page version, and copy the rightlink and NSN from there.
    > Since we're restoring what's basically a full-page image of the page after
    > the split, in crash recovery the old contents might be a torn page, or a
    > newer version of the page. I believe that's harmless, because we only care
    > about the NSN and rightlink in hot standby mode, but it's a bit ugly.
    >
    > If we change the WAL record, we have to make it so that the new version can
    > still read the old format, which complicates the implementation a bit.
    > Neverthelss, I'm leaning towards option 1.
    
    We may as well do (1), with two versions of the WAL record.
    
    Hmm, the biggest issue is actually that existing GIST indexes are
    corrupted, from the perspective of being unusable during HS.
    
    We can fix the cause but that won't repair the existing damage. So the
    requirement is for us to re/create new indexes, which can then use a
    new WAL record format. We probably need to store some information in
    the metapage saying whether or not the index has been maintained only
    with v2 WAL records, or with a mixture of v1 and v2 records. If the
    latter, then issue a WARNING to rebuild the index.
    
    -- 
     Simon Riggs                   http://www.2ndQuadrant.com/
     PostgreSQL Development, 24x7 Support, Training & Services
    
    
  79. Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-02T11:43:31Z

    On 02.08.2011 14:36, Simon Riggs wrote:
    > On Tue, Aug 2, 2011 at 12:03 PM, Heikki Linnakangas
    > <heikki.linnakangas@enterprisedb.com>  wrote:
    >> If we change the WAL record, we have to make it so that the new version can
    >> still read the old format, which complicates the implementation a bit.
    >> Neverthelss, I'm leaning towards option 1.
    >
    > We may as well do (1), with two versions of the WAL record.
    
    Actually I think we can append the new information to the end of the 
    page split record, so that an old version server can read WAL generated 
    by new version, too. It just won't set the right link and NSN correctly, 
    so hot standby will be broken like it is today.
    
    > Hmm, the biggest issue is actually that existing GIST indexes are
    > corrupted, from the perspective of being unusable during HS.
    >
    > We can fix the cause but that won't repair the existing damage. So the
    > requirement is for us to re/create new indexes, which can then use a
    > new WAL record format.
    
    No-no, it's not that bad. The right-links and NSNs are only needed to 
    handle scans concurrent with page splits. The existing indexes are fine, 
    you only have a problem if you run queries in hot standby mode, while 
    you replay page splits on it. As soon as you upgrade the master and 
    standby to new minor version with the fix, that will work too.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  80. Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

    Simon Riggs <simon@2ndquadrant.com> — 2011-08-02T12:18:57Z

    On Tue, Aug 2, 2011 at 12:43 PM, Heikki Linnakangas
    <heikki.linnakangas@enterprisedb.com> wrote:
    > On 02.08.2011 14:36, Simon Riggs wrote:
    >>
    >> On Tue, Aug 2, 2011 at 12:03 PM, Heikki Linnakangas
    >> <heikki.linnakangas@enterprisedb.com>  wrote:
    >>>
    >>> If we change the WAL record, we have to make it so that the new version
    >>> can
    >>> still read the old format, which complicates the implementation a bit.
    >>> Neverthelss, I'm leaning towards option 1.
    >>
    >> We may as well do (1), with two versions of the WAL record.
    >
    > Actually I think we can append the new information to the end of the page
    > split record, so that an old version server can read WAL generated by new
    > version, too.
    
    Not sure how that would work. Lengths, CRCs?
    
    Or do you mean we will support 2 versions, have them both called the
    same thing, just resolve which is which by the record length. Don't
    like that.
    
    -- 
     Simon Riggs                   http://www.2ndQuadrant.com/
     PostgreSQL Development, 24x7 Support, Training & Services
    
    
  81. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-08-02T12:33:16Z

    Hi!
    
    I'm now working on adding features to your version of patch. Current version
    is attached. Somehow this version produce huge amount of WAL and that makes
    it slow. Though count and avg. length of WAL records is similar to that of
    non-buffering build.
    
    test=# create table points as (select point(random(),random()) from
    generate_series(1,1000000));
    SELECT 1000000
    test=# select pg_xlogfile_name_offset(pg_current_xlog_location());
           pg_xlogfile_name_offset
    -------------------------------------
     (000000010000004000000073,15005048)
    (1 row)
    
    test=# create index points_idx on points using gist (point) with
    (buffering=off);CREATE INDEX
    test=# select pg_xlogfile_name_offset(pg_current_xlog_location());
           pg_xlogfile_name_offset
    -------------------------------------
     (00000001000000400000007E,13764024)
     (1 row)
    
    test=# create index points_idx2 on points using gist (point) with
    (buffering=on, neighborrelocation=off);
    INFO:  Level step = 1, pagesPerBuffer = 406
    NOTICE:  final emptying
    NOTICE:  final emptying
    NOTICE:  final emptying
    NOTICE:  final emptying
    CREATE INDEX
    test=# select pg_xlogfile_name_offset(pg_current_xlog_location());
           pg_xlogfile_name_offset
    -------------------------------------
     (0000000100000040000000D2,10982288)
    (1 row)
    
    May be you have any ideas about it?
    
    ------
    With best regards,
    Alexander Korotkov.
    
  82. Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-02T15:59:24Z

    On 02.08.2011 15:18, Simon Riggs wrote:
    > On Tue, Aug 2, 2011 at 12:43 PM, Heikki Linnakangas
    > <heikki.linnakangas@enterprisedb.com>  wrote:
    >> On 02.08.2011 14:36, Simon Riggs wrote:
    >> Actually I think we can append the new information to the end of the page
    >> split record, so that an old version server can read WAL generated by new
    >> version, too.
    >
    > Not sure how that would work. Lengths, CRCs?
    >
    > Or do you mean we will support 2 versions, have them both called the
    > same thing, just resolve which is which by the record length. Don't
    > like that.
    
    Here's a patch to do what I meant. The new fields are stored at the very 
    end of the WAL record, and you check the length to see if they're there 
    or not. The nice thing about this is that it's compatible in both 
    directions.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
  83. Re: Hot standby and GiST page splits (was Re: WIP: Fast GiST index build)

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-02T17:13:53Z

    On 02.08.2011 20:06, Alvaro Herrera wrote:
    > Excerpts from Heikki Linnakangas's message of mar ago 02 11:59:24 -0400 2011:
    >> On 02.08.2011 15:18, Simon Riggs wrote:
    >>> On Tue, Aug 2, 2011 at 12:43 PM, Heikki Linnakangas
    >>> <heikki.linnakangas@enterprisedb.com>   wrote:
    >>>> On 02.08.2011 14:36, Simon Riggs wrote:
    >>>> Actually I think we can append the new information to the end of the page
    >>>> split record, so that an old version server can read WAL generated by new
    >>>> version, too.
    >>>
    >>> Not sure how that would work. Lengths, CRCs?
    >>>
    >>> Or do you mean we will support 2 versions, have them both called the
    >>> same thing, just resolve which is which by the record length. Don't
    >>> like that.
    >>
    >> Here's a patch to do what I meant. The new fields are stored at the very
    >> end of the WAL record, and you check the length to see if they're there
    >> or not. The nice thing about this is that it's compatible in both
    >> directions.
    >
    > Err, did you attach the wrong patch?
    
    Yes, sorry about that. Here's the right patch.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
  84. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-08-03T08:18:49Z

    I found that in previous version of patch I missed PageSetLSN
    and PageSetTLI, but huge amount of WAL is still here. Also I found that huge
    amount of WAL appears only with -O2. With -O0 amount of WAL is ok, but
    messages "FATAL:  xlog flush request BFF11148/809A600 is not satisfied ---
    flushed only to 44/9C518750" appears. Seems that there is some totally wrong
    use of WAL if even optimization level does matter...
    
    ------
    With best regards,
    Alexander Korotkov.
    
  85. Re: WIP: Fast GiST index build

    Robert Haas <robertmhaas@gmail.com> — 2011-08-03T12:59:47Z

    On Wed, Aug 3, 2011 at 4:18 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
    > I found that in previous version of patch I missed PageSetLSN
    > and PageSetTLI, but huge amount of WAL is still here. Also I found that huge
    > amount of WAL appears only with -O2. With -O0 amount of WAL is ok, but
    > messages "FATAL:  xlog flush request BFF11148/809A600 is not satisfied ---
    > flushed only to 44/9C518750" appears. Seems that there is some totally wrong
    > use of WAL if even optimization level does matter...
    
    Try setting wal_debug=true to see what records are getting emitted.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  86. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-03T16:31:20Z

    On 03.08.2011 11:18, Alexander Korotkov wrote:
    > I found that in previous version of patch I missed PageSetLSN
    > and PageSetTLI, but huge amount of WAL is still here. Also I found that huge
    > amount of WAL appears only with -O2. With -O0 amount of WAL is ok, but
    > messages "FATAL:  xlog flush request BFF11148/809A600 is not satisfied ---
    > flushed only to 44/9C518750" appears. Seems that there is some totally wrong
    > use of WAL if even optimization level does matter...
    
    Try this:
    
    diff --git a/src/backend/access/gist/gistbuild.c 
    b/src/backend/access/gist/gistbuild.c
    index 5099330..5a441e0 100644
    --- a/src/backend/access/gist/gistbuild.c
    +++ b/src/backend/access/gist/gistbuild.c
    @@ -478,7 +478,7 @@ bufferingbuildinsert(GISTInsertState *state,
      		/* Write the WAL record */
      		if (RelationNeedsWAL(state->r))
      		{
    -			gistXLogUpdate(state->r->rd_node, buffer, oldoffnum, noldoffnum,
    +			recptr = gistXLogUpdate(state->r->rd_node, buffer, oldoffnum, 
    noldoffnum,
      													itup, ntup,	InvalidBuffer);
      			PageSetLSN(page, recptr);
      			PageSetTLI(page, ThisTimeLineID);
    
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  87. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-08-04T08:52:52Z

    Uhh, my bad, really stupid bug. Many thanks.
    
    ------
    With best regards,
    Alexander Korotkov.
    
    On Wed, Aug 3, 2011 at 8:31 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > On 03.08.2011 11:18, Alexander Korotkov wrote:
    >
    >> I found that in previous version of patch I missed PageSetLSN
    >> and PageSetTLI, but huge amount of WAL is still here. Also I found that
    >> huge
    >> amount of WAL appears only with -O2. With -O0 amount of WAL is ok, but
    >> messages "FATAL:  xlog flush request BFF11148/809A600 is not satisfied ---
    >> flushed only to 44/9C518750" appears. Seems that there is some totally
    >> wrong
    >> use of WAL if even optimization level does matter...
    >>
    >
    > Try this:
    >
    > diff --git a/src/backend/access/gist/**gistbuild.c
    > b/src/backend/access/gist/**gistbuild.c
    > index 5099330..5a441e0 100644
    > --- a/src/backend/access/gist/**gistbuild.c
    > +++ b/src/backend/access/gist/**gistbuild.c
    > @@ -478,7 +478,7 @@ bufferingbuildinsert(**GISTInsertState *state,
    >                /* Write the WAL record */
    >                if (RelationNeedsWAL(state->r))
    >                {
    > -                       gistXLogUpdate(state->r->rd_**node, buffer,
    > oldoffnum, noldoffnum,
    > +                       recptr = gistXLogUpdate(state->r->rd_**node,
    > buffer, oldoffnum, noldoffnum,
    >
    >                            itup, ntup,     InvalidBuffer);
    >                        PageSetLSN(page, recptr);
    >                        PageSetTLI(page, ThisTimeLineID);
    >
    >
    >
    > --
    >  Heikki Linnakangas
    >  EnterpriseDB   http://www.enterprisedb.com
    >
    
  88. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-08-07T19:28:10Z

    Hi!
    
    There is last version of patch. There is the list of most significant
    changes in comparison with your version of patch:
    1) Reference counting of path items was added. It should helps against
    possible accumulation of path items.
    2) Neighbor relocation was added.
    3) Subtree prefetching was added.
    4) Final emptying algorithm was reverted to the original one. My experiments
    shows that typical number of passes in final emptying in your version of
    patch is about 5. It may be significant itself. Also I haven't estimate of
    number of passes and haven't guarantees that it will not be high in some
    corner cases. I.e. I prefer more predictable single-pass algorithm in spite
    of it's a little more complex.
    5) Unloading even last page of node buffer from main memory to the disk.
    Imagine that that with levelstep = 1 each inner node has buffer. It seems to
    me that keeping one page of each buffer in memory may be memory consuming.
    
    Open items I see at this moment:
    1) I dislike my switching to buffering build method because it's based on
    very unproven assumptions. And I didn't more reliable assumptions in
    scientific papers while. I would like to replace it with something much
    simplier. For example, switching to buffering build when regular build
    actually starts to produce a lot of IO. For this approach implementation I
    need to somehow detect actual IO (not just buffer read but miss of OS
    cache).
    2) I'm worrying about possible size of nodeBuffersTab and path items. If we
    imagine extremely large tree with levelstep = 1 size of this datastructures
    will be singnificant. And it's hard to predict that size without knowing of
    tree size.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  89. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-08T09:23:41Z

    On 07.08.2011 22:28, Alexander Korotkov wrote:
    > There is last version of patch. There is the list of most significant
    > changes in comparison with your version of patch:
    > 1) Reference counting of path items was added. It should helps against
    > possible accumulation of path items.
    
    Ok.
    
    > 2) Neighbor relocation was added.
    
    Ok. I think we're going to need some sort of a heuristic on when to 
    enable neighbor relocation. If I remember the performance tests 
    correctly, it improves the quality of the resulting index, but incurs a 
    significant CPU overhead.
    
    Actually, looking at the performance numbers on the wiki page again 
    (http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011), it 
    looks like neighbor relocation doesn't help very much with the index 
    quality - sometimes it even results in a slightly worse index. Based on 
    those results, shouldn't we just remove it? Or is there some other data 
    set where it helps significantly?
    
    > 3) Subtree prefetching was added.
    
    I'm inclined to leave out the prefetching code for now. Unless you have 
    some performance numbers that show that it's critical for the overall 
    performance. But I don't think that was the case, it's just an 
    additional optimization for servers with big RAID arrays. So, please 
    separate that into an add-on patch. It needs to be performance tests and 
    reviewed separately.
    
    > 4) Final emptying algorithm was reverted to the original one. My
    > experiments shows that typical number of passes in final emptying in
    > your version of patch is about 5. It may be significant itself. Also I
    > haven't estimate of number of passes and haven't guarantees that it will
    > not be high in some corner cases. I.e. I prefer more predictable
    > single-pass algorithm in spite of it's a little more complex.
    
    I was trying to get rid of that complexity during index build. Some 
    extra code in the final pass would be easier to understand than extra 
    work that needs to be done through the index build. It's not a huge 
    amount of code, but still.
    
    I'm not worried about the extra CPU overhead of scanning the data 
    structures at the final pass. I guess in my patch you had to do extra 
    I/O as well, because the buffers were not emptied in strict top-down 
    order, so let's avoid that. How about:
    
    Track all buffers in the lists, not only those that are non-empty. Add 
    the buffer to the right list at getNodeBuffer(). That way in the final 
    stage, you need to scan through all buffers instead of just the 
    non-empty ones. But the overhead of that to should be minimal in 
    practice, scanning some in-memory data structures is pretty cheap 
    compared to building an index. That way you wouldn't need to maintain 
    the lists during the index build, except for adding each buffer to 
    correct lists in getNodeBuffer().
    
    BTW, please use List for the linked lists. No need to re-implement the 
    wheel.
    
    > 5) Unloading even last page of node buffer from main memory to the disk.
    > Imagine that that with levelstep = 1 each inner node has buffer. It
    > seems to me that keeping one page of each buffer in memory may be memory
    > consuming.
    >
    > Open items I see at this moment:
    > 1) I dislike my switching to buffering build method because it's based
    > on very unproven assumptions. And I didn't more reliable assumptions in
    > scientific papers while. I would like to replace it with something much
    > simplier. For example, switching to buffering build when regular build
    > actually starts to produce a lot of IO. For this approach implementation
    > I need to somehow detect actual IO (not just buffer read but miss of OS
    > cache).
    
    Yeah, that's a surprisingly hard problem. I don't much like the method 
    used in the patch either.
    
    > 2) I'm worrying about possible size of nodeBuffersTab and path items. If
    > we imagine extremely large tree with levelstep = 1 size of this
    > datastructures will be singnificant. And it's hard to predict that size
    > without knowing of tree size.
    
    I'm not very worried about that in practice. If you have a very large 
    index, you presumably have a fair amount of memory too. Otherwise the 
    machine is horrendously underpowered to build or do anything useful with 
    the index anyway. Nevertheless it would nice to have some figures on 
    that. If you have, say, an index of 1 TB in size, how much memory will 
    building the index need?
    
    Miscellaneous observations:
    
    * Please run pgindent over the code, there's a lot of spurious 
    whitespace in the patch.
    * How about renaming GISTLoadedPartItem to something like 
    GISTBulkInsertStack, to resemble the GISTInsertStack struct used in the 
    normal insertion code. The "loaded part" nomenclature is obsolete, as 
    the patch doesn't explicitly load parts of the tree into memory anymore. 
    Think about the names of other structs, variables and functions too, 
    GISTLoadedPartItem just caught my eye first but there's probably others 
    that could have better names.
    * Any user-visible options need to be documented in the user manual.
    * And of course, make sure comments and the readme are up-to-date.
    * Compiler warning:
    
    reloptions.c:259: warning: initializer-string for array of chars is too long
    reloptions.c:259: warning: (near initialization for 
    ‘stringRelOpts[0].default_val’)
    
    I don't think there's a way to add an entry to stringRelOpts in a way 
    that works correctly. That's a design flaw in the reloptions.c code that 
    has never come up before, as there hasn't been any string-formatted 
    relopts before (actually buffering option might be better served by an 
    enum reloption too, if we had that). Please start a new thread on that 
    on pgsql-hackers.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  90. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-08-08T10:18:35Z

    On Mon, Aug 8, 2011 at 1:23 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    >
    > 2) Neighbor relocation was added.
    >>
    >
    > Ok. I think we're going to need some sort of a heuristic on when to enable
    > neighbor relocation. If I remember the performance tests correctly, it
    > improves the quality of the resulting index, but incurs a significant CPU
    > overhead.
    >
    > Actually, looking at the performance numbers on the wiki page again (
    > http://wiki.postgresql.org/**wiki/Fast_GiST_index_build_**GSoC_2011<http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011>),
    > it looks like neighbor relocation doesn't help very much with the index
    > quality - sometimes it even results in a slightly worse index. Based on
    > those results, shouldn't we just remove it? Or is there some other data set
    > where it helps significantly?
    
    Oh, actually I didn't add some results with neighborrelocation = off. I
    would like to rerun some tests with current version of patch.
    
    
    >  3) Subtree prefetching was added.
    >>
    >
    > I'm inclined to leave out the prefetching code for now. Unless you have
    > some performance numbers that show that it's critical for the overall
    > performance. But I don't think that was the case, it's just an additional
    > optimization for servers with big RAID arrays. So, please separate that into
    > an add-on patch. It needs to be performance tests and reviewed separately.
    
    I though that prefetch helps even on separate hard disks by ordering of IOs.
    
    
    >  4) Final emptying algorithm was reverted to the original one. My
    >> experiments shows that typical number of passes in final emptying in
    >> your version of patch is about 5. It may be significant itself. Also I
    >> haven't estimate of number of passes and haven't guarantees that it will
    >> not be high in some corner cases. I.e. I prefer more predictable
    >> single-pass algorithm in spite of it's a little more complex.
    >>
    >
    > I was trying to get rid of that complexity during index build. Some extra
    > code in the final pass would be easier to understand than extra work that
    > needs to be done through the index build. It's not a huge amount of code,
    > but still.
    >
    > I'm not worried about the extra CPU overhead of scanning the data
    > structures at the final pass. I guess in my patch you had to do extra I/O as
    > well, because the buffers were not emptied in strict top-down order, so
    > let's avoid that. How about:
    >
    > Track all buffers in the lists, not only those that are non-empty. Add the
    > buffer to the right list at getNodeBuffer(). That way in the final stage,
    > you need to scan through all buffers instead of just the non-empty ones. But
    > the overhead of that to should be minimal in practice, scanning some
    > in-memory data structures is pretty cheap compared to building an index.
    > That way you wouldn't need to maintain the lists during the index build,
    > except for adding each buffer to correct lists in getNodeBuffer().
    >
    > BTW, please use List for the linked lists. No need to re-implement the
    > wheel.
    
    Ok.
    
    
    >  5) Unloading even last page of node buffer from main memory to the disk.
    >> Imagine that that with levelstep = 1 each inner node has buffer. It
    >> seems to me that keeping one page of each buffer in memory may be memory
    >> consuming.
    >>
    >> Open items I see at this moment:
    >> 1) I dislike my switching to buffering build method because it's based
    >> on very unproven assumptions. And I didn't more reliable assumptions in
    >> scientific papers while. I would like to replace it with something much
    >> simplier. For example, switching to buffering build when regular build
    >> actually starts to produce a lot of IO. For this approach implementation
    >> I need to somehow detect actual IO (not just buffer read but miss of OS
    >> cache).
    >>
    >
    > Yeah, that's a surprisingly hard problem. I don't much like the method used
    > in the patch either.
    
    Is it possible to make buffering build a user defined option until we have a
    better idea?
    
    
    >  2) I'm worrying about possible size of nodeBuffersTab and path items. If
    >> we imagine extremely large tree with levelstep = 1 size of this
    >> datastructures will be singnificant. And it's hard to predict that size
    >> without knowing of tree size.
    >>
    >
    > I'm not very worried about that in practice. If you have a very large
    > index, you presumably have a fair amount of memory too. Otherwise the
    > machine is horrendously underpowered to build or do anything useful with the
    > index anyway. Nevertheless it would nice to have some figures on that. If
    > you have, say, an index of 1 TB in size, how much memory will building the
    > index need?
    >
    I think with points it would be about 1 million of buffers and about 100-300
    megabytes of RAM depending on space utilization. It may be ok, because 1 TB
    is really huge index. But if maintenance_work_mem is low we can run out of
    it. Though maintenance_work_mem is quite strange for system with 1 TB
    indexes.
    
    
    > Miscellaneous observations:
    >
    > * Please run pgindent over the code, there's a lot of spurious whitespace
    > in the patch.
    > * How about renaming GISTLoadedPartItem to something like
    > GISTBulkInsertStack, to resemble the GISTInsertStack struct used in the
    > normal insertion code. The "loaded part" nomenclature is obsolete, as the
    > patch doesn't explicitly load parts of the tree into memory anymore. Think
    > about the names of other structs, variables and functions too,
    > GISTLoadedPartItem just caught my eye first but there's probably others that
    > could have better names.
    > * Any user-visible options need to be documented in the user manual.
    > * And of course, make sure comments and the readme are up-to-date.
    > * Compiler warning:
    >
    > reloptions.c:259: warning: initializer-string for array of chars is too
    > long
    > reloptions.c:259: warning: (near initialization for
    > ‘stringRelOpts[0].default_val’**)
    >
    > I don't think there's a way to add an entry to stringRelOpts in a way that
    > works correctly. That's a design flaw in the reloptions.c code that has
    > never come up before, as there hasn't been any string-formatted relopts
    > before (actually buffering option might be better served by an enum
    > reloption too, if we had that). Please start a new thread on that on
    > pgsql-hackers.
    
    Ok.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  91. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-08-10T10:19:13Z

    Hi!
    
    Here is last verion of the patch.
    List of changes:
    1) Neighbor relocation and prefetch were removed. They will be supplied as
    separate patches.
    2) Final emptying now using standart lists of all buffers by levels.
    3) Automatic switching again use simple comparison of index size and
    effective_cache_size.
    4) Some renames. In particular GISTLoadedPartItem
    to GISTBufferingInsertStack.
    5) Some comments were corrected and some were added.
    6) pgindent
    7) rebased with head
    
    Readme update and user documentation coming soon.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  92. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-08-10T19:44:20Z

    Manual and readme updates.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  93. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-10T19:45:18Z

    On 10.08.2011 13:19, Alexander Korotkov wrote:
    > Hi!
    >
    > Here is last verion of the patch.
    > List of changes:
    > 1) Neighbor relocation and prefetch were removed. They will be supplied as
    > separate patches.
    
    unloadNodeBuffers() is now dead code.
    
    > 2) Final emptying now using standart lists of all buffers by levels.
    > 3) Automatic switching again use simple comparison of index size and
    > effective_cache_size.
    
    LEAF_PAGES_STATS_* are unused now. Should avoid calling smgrnblocks() on 
    every tuple, the overhead of that could add up.
    
    > 4) Some renames. In particular GISTLoadedPartItem
    > to GISTBufferingInsertStack.
    > 5) Some comments were corrected and some were added.
    > 6) pgindent
    > 7) rebased with head
    >
    > Readme update and user documentation coming soon.
    
    I wonder, how hard would it be to merge gistBufferingBuildPlaceToPage() 
    with the gistplacetopage() function used in the main codepath? There's 
    very little difference between them, and it would be nice to maintain 
    just one function. At the very least I think there should be a comment 
    in both along the lines of "NOTE: if you change this function, make sure 
    you update XXXX (the other function) as well!"
    
    In gistbuild(), in the final emptying stage, there's this special-case 
    handling for the root block before looping through the buffers in the 
    buffersOnLevels lists:
    
    > 		for (;;)
    > 		{
    > 			nodeBuffer = getNodeBuffer(gfbb, &buildstate.giststate, GIST_ROOT_BLKNO,
    > 									   InvalidOffsetNumber, NULL, false);
    > 			if (!nodeBuffer || nodeBuffer->blocksCount <= 0)
    > 				break;
    > 			MemoryContextSwitchTo(gfbb->context);
    > 			gfbb->bufferEmptyingStack = lcons(nodeBuffer, gfbb->bufferEmptyingStack);
    > 			MemoryContextSwitchTo(buildstate.tmpCtx);
    > 			processEmptyingStack(&buildstate.giststate, &insertstate);
    > 		}
    
    What's the purpose of that? Wouldn't the loop through buffersOnLevels 
    lists take care of the root node too?
    
    The calculations in initBuffering() desperately need comments. As does 
    the rest of the code too, but the heuristics in that function are 
    particularly hard to understand without some explanation.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  94. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-11T06:21:17Z

    Split of an internal node works like this:
    
    1. Gather all the existing tuples on the page, plus the new tuple being 
    inserted.
    2. Call picksplit on the tuples, to divide them into pages
    3. Go through all tuples on the buffer associated with the page, and 
    divide them into buffers on the new pages. This is done by calling 
    penalty function on each buffered tuple.
    
    I wonder if it would be better for index quality to pass the buffered 
    tuples to picksplit in the 2nd step, so that they too can affect the 
    split decision. Maybe it doesn't make much difference in practice..
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  95. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-08-11T07:35:46Z

    On Thu, Aug 11, 2011 at 10:21 AM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > Split of an internal node works like this:
    >
    > 1. Gather all the existing tuples on the page, plus the new tuple being
    > inserted.
    > 2. Call picksplit on the tuples, to divide them into pages
    > 3. Go through all tuples on the buffer associated with the page, and divide
    > them into buffers on the new pages. This is done by calling penalty function
    > on each buffered tuple.
    >
    > I wonder if it would be better for index quality to pass the buffered
    > tuples to picksplit in the 2nd step, so that they too can affect the split
    > decision. Maybe it doesn't make much difference in practice..
    
    
    I had this idea. But:
    1) Buffer contain much more tuples than page plus new tuple.
    2) Picksplit method can easily be quadratic for example.
    
    Let's see the complexity of picksplit algorithms:
    1) geometric datatypes (point, box etc) - O(n) (BTW, I have serious doubts
    about it, i.e. O(n*log(n)) algorithm can be in times better in many cases)
    2) pg_trgm and fts - O(n^2)
    3) seg - O(n*log(n))
    4) cube - O(n^2)
    
    Thus, I believe such feature should be an optional. We can try it as add-on
    patch.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  96. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-08-11T10:21:29Z

    On Wed, Aug 10, 2011 at 11:45 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > unloadNodeBuffers() is now dead code.
    >
    processEmptyingStack calls it.
    
    LEAF_PAGES_STATS_* are unused now.
    
    Removed.
    
    
    > Should avoid calling smgrnblocks() on every tuple, the overhead of that
    > could add up.
    
    Now calling at each BUFFERING_MODE_SWITCH_CHECK_STEP(256) tuples.
    
    
    > I wonder, how hard would it be to merge gistBufferingBuildPlaceToPage(**)
    > with the gistplacetopage() function used in the main codepath? There's very
    > little difference between them, and it would be nice to maintain just one
    > function. At the very least I think there should be a comment in both along
    > the lines of "NOTE: if you change this function, make sure you update XXXX
    > (the other function) as well!"
    >
    I doubt they can be effectively merged, but will try.
    
    
    > In gistbuild(), in the final emptying stage, there's this special-case
    > handling for the root block before looping through the buffers in the
    > buffersOnLevels lists:
    >
    >                 for (;;)
    >>                {
    >>                        nodeBuffer = getNodeBuffer(gfbb,
    >> &buildstate.giststate, GIST_ROOT_BLKNO,
    >>
    >> InvalidOffsetNumber, NULL, false);
    >>                        if (!nodeBuffer || nodeBuffer->blocksCount <= 0)
    >>                                break;
    >>                        MemoryContextSwitchTo(gfbb->**context);
    >>                        gfbb->bufferEmptyingStack = lcons(nodeBuffer,
    >> gfbb->bufferEmptyingStack);
    >>                        MemoryContextSwitchTo(**buildstate.tmpCtx);
    >>                        processEmptyingStack(&**buildstate.giststate,
    >> &insertstate);
    >>                }
    >>
    >
    > What's the purpose of that? Wouldn't the loop through buffersOnLevels lists
    > take care of the root node too?
    >
    I was worried about node buffer deletion from list while scanning that
    list gistbuild(). That's why I avoided deletion from lists.
    Now I've added additional check for root node in loop over list.
    
    
    > The calculations in initBuffering() desperately need comments. As does the
    > rest of the code too, but the heuristics in that function are particularly
    > hard to understand without some explanation.
    
    Some comments were added. I'm working on more of them.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  97. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-11T10:28:58Z

    On 10.08.2011 22:44, Alexander Korotkov wrote:
    > Manual and readme updates.
    
    Thanks, I'm reviewing these now.
    
    Do we want to expose the level-step and buffersize parameters to users? 
    They've been useful during testing, but I'm thinking we should be able 
    to guess good enough values for them automatically, and just remove the 
    options. It's pretty much impossible for a user to tune them correctly, 
    it would require deep knowledge of the buffering algorithm.
    
    I'm thinking that even when you explicitly turn buffering on, we should 
    still process the first 10000 or so tuples with simple inserts. That way 
    we always have a sample of tuples to calculate the average tuple size 
    from. It's plausible that if the input data is ordered, looking at the 
    first N tuples will give skewed sample, but I don't think there's much 
    danger of that in practice. Even if the data is ordered, the length of 
    GiST tuples shouldn't vary much.
    
    What happens if we get the levelstep and pagesPerBuffer estimates wrong? 
    How sensitive is the algorithm to that? Or will we run out of memory? 
    Would it be feasible to adjust those in the middle of the index build, 
    if we e.g exceed the estimated memory usage greatly?
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  98. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-11T13:45:25Z

    On 10.08.2011 22:44, Alexander Korotkov wrote:
    > Manual and readme updates.
    
    I went through these, and did some editing and rewording. Attached is an 
    updated README, and an updated patch of the doc changes. Let me know if 
    I screwed up something.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
  99. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-08-11T20:30:06Z

    On Thu, Aug 11, 2011 at 2:28 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > On 10.08.2011 22:44, Alexander Korotkov wrote:
    >
    >> Manual and readme updates.
    >>
    >
    > Thanks, I'm reviewing these now.
    >
    > Do we want to expose the level-step and buffersize parameters to users?
    > They've been useful during testing, but I'm thinking we should be able to
    > guess good enough values for them automatically, and just remove the
    > options. It's pretty much impossible for a user to tune them correctly, it
    > would require deep knowledge of the buffering algorithm.
    >
    > I'm thinking that even when you explicitly turn buffering on, we should
    > still process the first 10000 or so tuples with simple inserts. That way we
    > always have a sample of tuples to calculate the average tuple size from.
    > It's plausible that if the input data is ordered, looking at the first N
    > tuples will give skewed sample, but I don't think there's much danger of
    > that in practice. Even if the data is ordered, the length of GiST tuples
    > shouldn't vary much.
    >
    > What happens if we get the levelstep and pagesPerBuffer estimates wrong?
    > How sensitive is the algorithm to that? Or will we run out of memory? Would
    > it be feasible to adjust those in the middle of the index build, if we e.g
    > exceed the estimated memory usage greatly?
    
    
    I see the following risks.
    
    For levelstep:
    Too small: not so great IO benefit as can be
    Too large:
      1) If subtree doesn't fit effective_cache, much more IO then should be
    (because of cache misses during buffer emptying)
      2) If last pages of buffers don't fit to maintenance_work_mem, possible
    OOM
    
    For buffersize:
    Too small: less IO benefit, becuse buffer size is relatively small in
    comparison with sub-tree size.
    Too large: greater CPU overhead (because of more penalty calls) then can be
    with same IO benefit.
    
    Thereby I propose following.
    1) Too large levelstep is greatest risk. Let's use pessimistic estimate for
    it. Pessimistic estimate has following logic:
    largest sub-tree => maximal tuples per page => minimal tuple size
    Thereby always using minimal tuple size in levelstep calculation we exclude
    greatest risks.
    2) Risks of buffersize are comparable and not too critical. Thats why I
    propose to use size of first 10000 tuples for estimate.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  100. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-12T08:23:55Z

    On 11.08.2011 23:30, Alexander Korotkov wrote:
    > On Thu, Aug 11, 2011 at 2:28 PM, Heikki Linnakangas<
    > heikki.linnakangas@enterprisedb.com>  wrote:
    >
    >> On 10.08.2011 22:44, Alexander Korotkov wrote:
    >>
    >>> Manual and readme updates.
    >>>
    >>
    >> Thanks, I'm reviewing these now.
    >>
    >> Do we want to expose the level-step and buffersize parameters to users?
    >> They've been useful during testing, but I'm thinking we should be able to
    >> guess good enough values for them automatically, and just remove the
    >> options. It's pretty much impossible for a user to tune them correctly, it
    >> would require deep knowledge of the buffering algorithm.
    >>
    >> I'm thinking that even when you explicitly turn buffering on, we should
    >> still process the first 10000 or so tuples with simple inserts. That way we
    >> always have a sample of tuples to calculate the average tuple size from.
    >> It's plausible that if the input data is ordered, looking at the first N
    >> tuples will give skewed sample, but I don't think there's much danger of
    >> that in practice. Even if the data is ordered, the length of GiST tuples
    >> shouldn't vary much.
    >>
    >> What happens if we get the levelstep and pagesPerBuffer estimates wrong?
    >> How sensitive is the algorithm to that? Or will we run out of memory? Would
    >> it be feasible to adjust those in the middle of the index build, if we e.g
    >> exceed the estimated memory usage greatly?
    >
    >
    > I see the following risks.
    >
    > For levelstep:
    > Too small: not so great IO benefit as can be
    > Too large:
    >    1) If subtree doesn't fit effective_cache, much more IO then should be
    > (because of cache misses during buffer emptying)
    >    2) If last pages of buffers don't fit to maintenance_work_mem, possible
    > OOM
    
    Hmm, we could avoid running out of memory if we used a LRU cache 
    replacement policy on the buffer pages, instead of explicitly unloading 
    the buffers. 1) would still apply, though.
    
    > For buffersize:
    > Too small: less IO benefit, becuse buffer size is relatively small in
    > comparison with sub-tree size.
    > Too large: greater CPU overhead (because of more penalty calls) then can be
    > with same IO benefit.
    >
    > Thereby I propose following.
    > 1) Too large levelstep is greatest risk. Let's use pessimistic estimate for
    > it. Pessimistic estimate has following logic:
    > largest sub-tree =>  maximal tuples per page =>  minimal tuple size
    > Thereby always using minimal tuple size in levelstep calculation we exclude
    > greatest risks.
    > 2) Risks of buffersize are comparable and not too critical. Thats why I
    > propose to use size of first 10000 tuples for estimate.
    
    Yep, sounds reasonable.
    
    I think it would also be fairly simple to decrease levelstep and/or 
    adjust buffersize on-the-fly. The trick would be in figuring out the 
    heuristics on when to do that.
    
    Another thing occurred to me while looking at the buffer emptying 
    process: At the moment, we stop emptying after we've flushed 1/2 buffer 
    size worth of tuples. The point of that is to avoid overfilling a 
    lower-level buffer, in the case that the tuples we emptied all landed on 
    the same lower-level buffer. Wouldn't it be fairly simple to detect that 
    case explicitly, and stop the emptying process only if one of the 
    lower-level buffers really fills up? That should be more efficient, as 
    you would have "swap" between different subtrees less often.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  101. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-08-12T10:59:55Z

    On Fri, Aug 12, 2011 at 12:23 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > I think it would also be fairly simple to decrease levelstep and/or adjust
    > buffersize on-the-fly. The trick would be in figuring out the heuristics on
    > when to do that.
    >
    I would be simple to decrease levelstep to the it's divider. It seems quite
    hard to dicrease it, for example, from 3 to 2. Also, it's pretty hard to
    detect that sub-tree actually doen't fit to the cache. I don't see much
    difficulties in buffersize runtime tuning.
    
    
    > Another thing occurred to me while looking at the buffer emptying process:
    > At the moment, we stop emptying after we've flushed 1/2 buffer size worth of
    > tuples. The point of that is to avoid overfilling a lower-level buffer, in
    > the case that the tuples we emptied all landed on the same lower-level
    > buffer. Wouldn't it be fairly simple to detect that case explicitly, and
    > stop the emptying process only if one of the lower-level buffers really
    > fills up? That should be more efficient, as you would have "swap" between
    > different subtrees less often.
    
     Yes, it seems reasonable to me.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  102. Re: WIP: Fast GiST index build

    Robert Haas <robertmhaas@gmail.com> — 2011-08-12T15:04:55Z

    On Thu, Aug 11, 2011 at 6:21 AM, Alexander Korotkov
    <aekorotkov@gmail.com> wrote:
    > [ new patch ]
    
    Some random comments:
    
    - It appears that the "noFollowFight" flag is really supposed to be
    called "noFollowRight".
    
    - In gist_private.h you've written "halt-filled" where you really mean
    "half-filled".
    
    - It seems like you're using reloptions to set parameters that are
    only going to do anything at index creation time.  IIUC, "BUFFERING",
    "LEVELSTEP" and "BUFFERSIZE" have no permanent meaning for that index;
    they're just used ephemerally while constructing it.  If we're going
    to expose such things as options, maybe they should be GUCs, not
    reloptions.
    
    - Function names should begin with "gist" or some other, appropriate
    prefix, especially if they are non-static.  decreasePathRefcount(),
    getNodeBuffer(), relocateBuildBuffersOnSplit(), adn
    getNodeBufferBusySize() violate this rule, and it might be good to
    change the static functions to follow it, too, just for consistency,
    and to avoid renaming things if something that's currently static
    later needs to be made non-static.
    
    - validateBufferOption needs to use ereport(), not elog().
    
    - This needs a bit of attention:
    
    +               /* TODO: Write the WAL record */
    +               if (RelationNeedsWAL(state->r))
    +                       recptr = gistXLogSplit(state->r->rd_node,
    blkno, is_leaf,
    +                                                               dist,
    oldrlink, oldnsn, InvalidBuffer, true);
    +               else
    +                       recptr = GetXLogRecPtrForTemp();
    +
    
    I don't think the comment matches the code, since gistXLogSplit() does
    in fact call XLogInsert().  Also, you should probably move the
    RelationNeedsWAL() test inside gistXLogSplit().  Otherwise, every
    single caller of gistXLogSplit() is going to need to do the same
    dance.
    
    - In gistBufferingPlaceToPage, you've got a series of loops that look like this:
    
    +               for (ptr = dist; ptr; ptr = ptr->next)
    
    The first loop allocates a bunch of buffers.  The second loop sets up
    downlink pointers.   Then there's some other code.  Then there's a
    third loop, which adds items to each page in turn and sets up right
    links.  Then there's a fourth loop, which marks all those buffers
    dirty.  Then you write XLOG.  Then there's a fifth loop, which sets
    all the LSNs and TLIs, and a sixth loop, which does
    UnlockReleaseBuffer() on each valid buffer in the list.  All of this
    seems like it could be simplified.  In particular, the third and
    fourth loops can certainly be merged - you should set the dirty bit at
    the same time you're adding items to the page.  And the fifth and
    sixth loops can also be merged.  You certainly don't need to set all
    the LSNs and TLIs before releasing any of the buffer locks & pins.
    I'm not sure if there's any more merging that can be done than that,
    but you might want to have a look.
    
    I'm also wondering how long this linked list can be.  It's not good to
    have large numbers of buffers locked for a long period of time.  At
    the very least, some comments are in order here.
    
    Another general comment about this function is that it seems like it
    is backwards.  The overall flow of the function is:
    
    if (is_split)
    {
        /* complicated stuff */
    }
    else
    {
        /* simple case */
    }
    
    It seems like it might be better to flip that around and do this:
    
    if (!is_split)
    {
        /* simple case */
        return result;
    }
    /* complicated stuff */
    
    It's easier to read and avoids having the indentation get too deep.
    
    - As I look at this more, I see that a lot of the logic in
    gistBufferingBuildPlaceToPage is copied from gistplacetopage().  It
    would be nice to move the common bits to common subroutines that both
    functions can call, instead of duplicating the code.
    
    - On a related note, gistBufferingBuildPlaceToPage needs to do
    START_CRIT_SECTION and END_CRIT_SECTION at appropriate points in the
    sequence, as gistplacetopage() does.
    
    - gistFindCorrectParent() seems to rely heavily on the assumption that
    there's no concurrent activity going on in this index.  Otherwise,
    it's got to be unsafe to release the buffer lock before using the
    answer the function computes.  Some kind of comment seems like it
    would be a good idea.
    
    - On a more algorithmic note, I don't really understand why we attach
    buffers to all pages on a level or none of them.  If it isn't
    necessary to have buffers on every internal page in the tree, why do
    we have them on every other level or every third level rather than,
    say, creating them on the fly in whatever parts of the tree end up
    busy?
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  103. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-08-14T19:30:43Z

    Hi!
    
    Thank you for your notes.
    
    On Fri, Aug 12, 2011 at 7:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
    
    > On Thu, Aug 11, 2011 at 6:21 AM, Alexander Korotkov
    > <aekorotkov@gmail.com> wrote:
    > > [ new patch ]
    >
    > Some random comments:
    >
    > - It appears that the "noFollowFight" flag is really supposed to be
    > called "noFollowRight".
    >
    Fixed.
    
    - In gist_private.h you've written "halt-filled" where you really mean
    > "half-filled".
    >
    Fixed.
    
    
    > - It seems like you're using reloptions to set parameters that are
    > only going to do anything at index creation time.  IIUC, "BUFFERING",
    > "LEVELSTEP" and "BUFFERSIZE" have no permanent meaning for that index;
    > they're just used ephemerally while constructing it.  If we're going
    > to expose such things as options, maybe they should be GUCs, not
    > reloptions.
    >
    Having these as index parameters may be helpful when you reindex. It's
    likely that you would like to rebuild it with same parameters as it was
    created. Actually, we have the same situation with FILLFACTOR: it is used
    only during index creation.
    
    
    > - Function names should begin with "gist" or some other, appropriate
    > prefix, especially if they are non-static.  decreasePathRefcount(),
    > getNodeBuffer(), relocateBuildBuffersOnSplit(), adn
    > getNodeBufferBusySize() violate this rule, and it might be good to
    > change the static functions to follow it, too, just for consistency,
    > and to avoid renaming things if something that's currently static
    > later needs to be made non-static.
    >
    Fixed.
    
    
    > - validateBufferOption needs to use ereport(), not elog().
    >
    Fixed.
    
    
    > - This needs a bit of attention:
    >
    > +               /* TODO: Write the WAL record */
    > +               if (RelationNeedsWAL(state->r))
    > +                       recptr = gistXLogSplit(state->r->rd_node,
    > blkno, is_leaf,
    > +                                                               dist,
    > oldrlink, oldnsn, InvalidBuffer, true);
    > +               else
    > +                       recptr = GetXLogRecPtrForTemp();
    > +
    >
    > I don't think the comment matches the code, since gistXLogSplit() does
    > in fact call XLogInsert(). Also, you should probably move the
    
    RelationNeedsWAL() test inside gistXLogSplit().  Otherwise, every
    > single caller of gistXLogSplit() is going to need to do the same
    > dance.
    >
    
    > - In gistBufferingPlaceToPage, you've got a series of loops that look like
    > this:
    >
    > +               for (ptr = dist; ptr; ptr = ptr->next)
    >
    > The first loop allocates a bunch of buffers.  The second loop sets up
    > downlink pointers.   Then there's some other code.  Then there's a
    > third loop, which adds items to each page in turn and sets up right
    > links.  Then there's a fourth loop, which marks all those buffers
    > dirty.  Then you write XLOG.  Then there's a fifth loop, which sets
    > all the LSNs and TLIs, and a sixth loop, which does
    > UnlockReleaseBuffer() on each valid buffer in the list.  All of this
    > seems like it could be simplified.  In particular, the third and
    > fourth loops can certainly be merged - you should set the dirty bit at
    > the same time you're adding items to the page.  And the fifth and
    > sixth loops can also be merged.  You certainly don't need to set all
    > the LSNs and TLIs before releasing any of the buffer locks & pins.
    > I'm not sure if there's any more merging that can be done than that,
    > but you might want to have a look.
    >
    > I'm also wondering how long this linked list can be.  It's not good to
    > have large numbers of buffers locked for a long period of time.  At
    > the very least, some comments are in order here.
    >
    > Another general comment about this function is that it seems like it
    > is backwards.  The overall flow of the function is:
    >
    > if (is_split)
    > {
    >    /* complicated stuff */
    > }
    > else
    > {
    >    /* simple case */
    > }
    >
    > It seems like it might be better to flip that around and do this:
    >
    > if (!is_split)
    > {
    >    /* simple case */
    >    return result;
    > }
    > /* complicated stuff */
    >
    > It's easier to read and avoids having the indentation get too deep.
    >
    > - As I look at this more, I see that a lot of the logic in
    > gistBufferingBuildPlaceToPage is copied from gistplacetopage().  It
    > would be nice to move the common bits to common subroutines that both
    > functions can call, instead of duplicating the code.
    >
    > - On a related note, gistBufferingBuildPlaceToPage needs to do
    > START_CRIT_SECTION and END_CRIT_SECTION at appropriate points in the
    > sequence, as gistplacetopage() does.
    >
    While, I've merged gistplacetopage() and gistBufferingBuildPlaceToPage().
    Now I'm trying some more refactoring.
    
    
    > - gistFindCorrectParent() seems to rely heavily on the assumption that
    > there's no concurrent activity going on in this index.  Otherwise,
    > it's got to be unsafe to release the buffer lock before using the
    > answer the function computes.  Some kind of comment seems like it
    > would be a good idea.
    >
    Corresponding comment was added.
    
    
    > - On a more algorithmic note, I don't really understand why we attach
    > buffers to all pages on a level or none of them.  If it isn't
    > necessary to have buffers on every internal page in the tree, why do
    > we have them on every other level or every third level rather than,
    > say, creating them on the fly in whatever parts of the tree end up
    > busy?
    >
    Idea of having buffers on levels with some step is following. We have enough
    of cache to have a sub-tree of some height fits to cache. When we loaded
    such sub-tree once we can process index tuples inside it effectively
    (without actual IO). During buffer emptying we're flushing index tuples to
    undeflying buffers or leaf pages. Having buffers on levels with step we
    guarantee that flushing don't require loading(and writing) more then such
    sub-tree (which fits to cache). Thus, if we've processed many enough of
    index tuples during emptying, it's IO effective. It's possible that some
    more effective distribution of buffers exists, but it's currently unclear
    for me.
    
    Other changes:
    1) Levelstep and buffersize user options were removed.
    2) Buffer size is now run time tuned.
    3) Buffer emptying now stops when some child can't take index tuple anymore.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  104. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-08-16T08:10:33Z

    I found that I forgot to remove levelstep and buffersize from reloptions.c.
    Updated patch is attached.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  105. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-16T12:04:37Z

    Looking at the calculation of levelStep:
    
    > + 	/*
    > + 	 * Calculate levelStep by available amount of memory. We should be able to
    > + 	 * load into main memory one page of each underlying node buffer (which
    > + 	 * are in levelStep below). That give constraint over
    > + 	 * maintenance_work_mem. Also we should be able to have subtree of
    > + 	 * levelStep level in cache. That give constraint over
    > + 	 * effective_cache_size.
    > + 	 *
    > + 	 * i'th underlying level of sub-tree can consists of
    > + 	 * i^maxIndexTuplesPerPage pages at maximum. So, subtree of levelStep
    > + 	 * levels can't be greater then 2 * maxIndexTuplesPerPage ^ levelStep
    > + 	 * pages. We use some more reserve due to we probably can't take whole
    > + 	 * effective cache and use formula 4 * maxIndexTuplesPerPage ^ levelStep =
    > + 	 * effectiveCache. We use similar logic with maintenance_work_mem. We
    > + 	 * should be able to store at least last pages of all buffers where we are
    > + 	 * emptying current buffer to.
    > + 	 */
    > + 	effectiveMemory = Min(maintenance_work_mem * 1024 / BLCKSZ,
    > + 						  effective_cache_size);
    > + 	levelStep = (int) log((double) effectiveMemory / 4.0) /
    > + 		log((double) maxIndexTuplesPerPage);
    > +
    
    I can see that that's equal to the formula given in the paper, 
    log_B(M/4B), but I couldn't see any explanation for that formula in the 
    paper. Your explanation makes sense, but where did it come from?
    
    It seems a bit pessimistic: while it's true that the a subtree can't be 
    larger than 2 * maxIndexTuplesPerPage ^ levelStep, you can put a tighter 
    upper bound on it. The max size of a subtree of depth n can be 
    calculated as the geometric series:
    
    r^0 + r^1 + r^2 + ... + r^n = (1 - r^(n + 1)) / (1 - r)
    
    where r = maxIndexTuplesPerPage. For r=2 those formulas are equal, but 
    for a large r and small n (which is typical), the 2 * 
    maxIndexTuplesPerPage^levelStep formula gives a value that's almost 
    twice as large as the real max size of a subtree.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  106. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-08-16T13:19:14Z

    On Tue, Aug 16, 2011 at 4:04 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > I can see that that's equal to the formula given in the paper, log_B(M/4B),
    > but I couldn't see any explanation for that formula in the paper. Your
    > explanation makes sense, but where did it come from?
    >
    I didn't find it too. But it has to reservse memory for both sub-tree and
    active buffers. While we'are reserving memory for sub-tree in
    effective_cache_size and memory for last pages of buffers in
    maintenance_work_mem.
    
    
    > It seems a bit pessimistic: while it's true that the a subtree can't be
    > larger than 2 * maxIndexTuplesPerPage ^ levelStep, you can put a tighter
    > upper bound on it. The max size of a subtree of depth n can be calculated as
    > the geometric series:
    >
    > r^0 + r^1 + r^2 + ... + r^n = (1 - r^(n + 1)) / (1 - r)
    >
    > where r = maxIndexTuplesPerPage. For r=2 those formulas are equal, but for
    > a large r and small n (which is typical), the 2 * maxIndexTuplesPerPage^**levelStep
    > formula gives a value that's almost twice as large as the real max size of a
    > subtree.
    >
    Thus, we can calculate:
    levelstep = min(log_r(1 + effective_cache_size_in_pages*(r - 1)) - 1,
    log_r(maintenance_work_mem_in_pages - 1))
    and get more precise result. But also we need at least very rough estimate
    of memory occupied by node buffers hash tab and path items.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  107. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-16T17:43:37Z

    Why is there ever a buffer on the root node? It seems like a waste of 
    time to load N tuples from the heap into the root buffer, only to empty 
    the buffer after it fills up. You might as well pull tuples directly 
    from the heap.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  108. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-08-16T18:46:06Z

    On Tue, Aug 16, 2011 at 9:43 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > Why is there ever a buffer on the root node? It seems like a waste of time
    > to load N tuples from the heap into the root buffer, only to empty the
    > buffer after it fills up. You might as well pull tuples directly from the
    > heap.
    >
    Yes, seems reasonable. Buffer on the root node was in the paper. But now I
    don't see the need of it too.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  109. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-16T19:10:49Z

    On 16.08.2011 21:46, Alexander Korotkov wrote:
    > On Tue, Aug 16, 2011 at 9:43 PM, Heikki Linnakangas<
    > heikki.linnakangas@enterprisedb.com>  wrote:
    >
    >> Why is there ever a buffer on the root node? It seems like a waste of time
    >> to load N tuples from the heap into the root buffer, only to empty the
    >> buffer after it fills up. You might as well pull tuples directly from the
    >> heap.
    >>
    > Yes, seems reasonable. Buffer on the root node was in the paper. But now I
    > don't see the need of it too.
    
    Here's an version of the patch with a bunch of minor changes:
    
    * No more buffer on root node. Aside from the root buffer being 
    pointless, this simplifies gistRelocateBuildBuffersOnSplit slightly as 
    it doesn't need the special case for root block anymore.
    
    * Moved the code to create new root item from gistplacetopage() to 
    gistRelocateBuildBuffersOnSplit(). Seems better to keep the 
    buffering-related code away from the normal codepath, for the sake of 
    readability.
    
    * Changed the levelStep calculation to use the more accurate upper bound 
    on subtree size that we discussed.
    
    * Changed the levelStep calculation so that it doesn't do just 
    min(maintenance_work_mem, effective_cache_size) and calculate the 
    levelStep from that. Maintenance_work_mem matters determines the max. 
    number of page buffers that can be held in memory at a time, while 
    effective_cache_size determines the max size of the subtree. Those are 
    subtly different things.
    
    * Renamed NodeBuffer to GISTNodeBuffer, to avoid cluttering the namespace
    
    * Plus misc comment, whitespace, formatting and naming changes.
    
    I think this patch is in pretty good shape now. Could you re-run the 
    performance tests you have on the wiki page, please, to make sure the 
    performance hasn't regressed? It would also be nice to get some testing 
    on the levelStep and pagesPerBuffer estimates, and the point where we 
    switch to the buffering method. I'm particularly interested to know if 
    there's any corner-cases with very skewed data distributions or strange 
    GUC settings, where the estimates fails badly.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  110. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-16T19:15:15Z

    On 16.08.2011 22:10, Heikki Linnakangas wrote:
    > Here's an version of the patch with a bunch of minor changes:
    
    And here it really is, this time with an attachment...
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
  111. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-08-17T07:11:34Z

    On Tue, Aug 16, 2011 at 11:15 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > On 16.08.2011 22:10, Heikki Linnakangas wrote:
    >
    >> Here's an version of the patch with a bunch of minor changes:
    >>
    >
    > And here it really is, this time with an attachment...
    
    Thanks a lot. I'm going to start rerunning the tests now.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  112. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-08-22T10:23:32Z

    On Wed, Aug 17, 2011 at 11:11 AM, Alexander Korotkov
    <aekorotkov@gmail.com>wrote:
    
    > On Tue, Aug 16, 2011 at 11:15 PM, Heikki Linnakangas <
    > heikki.linnakangas@enterprisedb.com> wrote:
    >
    >> On 16.08.2011 22:10, Heikki Linnakangas wrote:
    >>
    >>> Here's an version of the patch with a bunch of minor changes:
    >>>
    >>
    >> And here it really is, this time with an attachment...
    >
    > Thanks a lot. I'm going to start rerunning the tests now.
    >
    
    First bunch of test results will be available soon (tests running and
    results processing take some time). While there is a patch with few small
    bugfixes.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  113. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-08-24T13:57:28Z

    I've added some testing results to the wiki page:
    http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011
    There are not all the results I planned for the first chunk because it takes
    more time than I expect.
    Some notes about it.
    
    Now I see two causes which accelerate regular build of GiST indexes:
    1) As it was noted before regular index build of pretty ordered dataset is
    fast.
    2) I found that worse index is faster to build. I mean worse index is index
    with higher overlaps. Function gistchoose selects the first index tuple with
    zero penalty if any. Thus, with higher overlap in root page only few index
    tuples of it will be choosed for insert. And, recursively, only small part
    of the tree will be used for actual inserts. And that part of tree can
    easier fit to the cache. Thus, high overlaps  makes inserts cheaper as much
    as searches expensiver.
    
    In the tests on the first version of patch I found index quality of regular
    build much better than it of buffering build (without neighborrelocation).
    Now it's similar, though it's because index quality of regular index build
    become worse. There by in current tests regular index build is faster than
    in previous. I see following possible causes of it:
     1) I didn't save source random data. So, now it's a new random data.
    2) Some environment parameters of my test setup may alters, though I doubt.
    Despite these possible explanation it seems quite strange for me.
    
    In order to compare index build methods on more qualitative indexes, I've
    tried to build indexes with my double sorting split method (see:
    http://syrcose.ispras.ru/2011/files/SYRCoSE2011_Proceedings.pdf#page=36). So
    on uniform dataset search is faster in about 10 times! And, as it was
    expected, regular index build becomes much slower. It runs more than 60
    hours and while only 50% of index is complete (estimated by file sizes).
    
    Also, automatic switching to buffering build shows better index quality
    results in all the tests. While it's hard for me to explain that.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  114. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-25T18:53:10Z

    On 24.08.2011 16:57, Alexander Korotkov wrote:
    > I've added some testing results to the wiki page:
    > http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011
    > There are not all the results I planned for the first chunk because it takes
    > more time than I expect.
    > Some notes about it.
    >
    > Now I see two causes which accelerate regular build of GiST indexes:
    > 1) As it was noted before regular index build of pretty ordered dataset is
    > fast.
    > 2) I found that worse index is faster to build. I mean worse index is index
    > with higher overlaps. Function gistchoose selects the first index tuple with
    > zero penalty if any. Thus, with higher overlap in root page only few index
    > tuples of it will be choosed for insert. And, recursively, only small part
    > of the tree will be used for actual inserts. And that part of tree can
    > easier fit to the cache. Thus, high overlaps  makes inserts cheaper as much
    > as searches expensiver.
    
    As an extreme case, a trivial penalty function that just always returns 
    0 will make index build fast - but the index will be useless for querying.
    
    > In the tests on the first version of patch I found index quality of regular
    > build much better than it of buffering build (without neighborrelocation).
    > Now it's similar, though it's because index quality of regular index build
    > become worse. There by in current tests regular index build is faster than
    > in previous. I see following possible causes of it:
    >   1) I didn't save source random data. So, now it's a new random data.
    > 2) Some environment parameters of my test setup may alters, though I doubt.
    > Despite these possible explanation it seems quite strange for me.
    
    That's pretty surprising. Assuming the data is truly random, I wouldn't 
    expect a big difference in the index quality of one random data set over 
    another. If the index quality depends so much on, say, the distribution 
    of the few first tuples that are inserted to it, that's a quite 
    interesting find on its own, and merits some further research.
    
    > In order to compare index build methods on more qualitative indexes, I've
    > tried to build indexes with my double sorting split method (see:
    > http://syrcose.ispras.ru/2011/files/SYRCoSE2011_Proceedings.pdf#page=36). So
    > on uniform dataset search is faster in about 10 times! And, as it was
    > expected, regular index build becomes much slower. It runs more than 60
    > hours and while only 50% of index is complete (estimated by file sizes).
    >
    > Also, automatic switching to buffering build shows better index quality
    > results in all the tests. While it's hard for me to explain that.
    
    Hmm, makes me a bit uneasy that we're testing with a modified page 
    splitting algorithm. But if the new algorithm is that good, could you 
    post that as a separate patch, please?
    
    That said, I don't see any new evidence that the buffering build 
    algorithm would be significantly worse. There's the case of ordered data 
    that we already knew about, and will have to just accept for now.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  115. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-25T19:08:10Z

    On 22.08.2011 13:23, Alexander Korotkov wrote:
    > On Wed, Aug 17, 2011 at 11:11 AM, Alexander Korotkov
    > <aekorotkov@gmail.com>wrote:
    >
    >> On Tue, Aug 16, 2011 at 11:15 PM, Heikki Linnakangas<
    >> heikki.linnakangas@enterprisedb.com>  wrote:
    >>
    >>> On 16.08.2011 22:10, Heikki Linnakangas wrote:
    >>>
    >>>> Here's an version of the patch with a bunch of minor changes:
    >>>>
    >>>
    >>> And here it really is, this time with an attachment...
    >>
    >> Thanks a lot. I'm going to start rerunning the tests now.
    >>
    >
    > First bunch of test results will be available soon (tests running and
    > results processing take some time). While there is a patch with few small
    > bugfixes.
    
    I've been mulling this through, and will continue working on this 
    tomorrow, but wanted to share this version meanwhile:
    
    * Moved all the buffering build logic from gistplacetopage() to a new 
    function in gistbuild.c. There's almost no changes to gistplacetopage() 
    now, it returns the SplitInfo struct as usual, and the new function 
    deals with that and handles the call to 
    gistRelocateBuildBuffersOnSplit(), and the recursion to insert downlinks.
    
    * Simplified the handling of buffersOnLevels lists a bit. There's now an 
    entry in buffersOnLevels array for all levels, even those that don't 
    have buffers because levelStep > 1. That wastes a few bytes in the 
    array, but it's more easy to debug and understand that way. Also, 
    there's no separate Len and Count variables for it anymore.
    
    * Moved validateBufferingOption() to gistbuild.c
    
    * Moved the code to add buffer to emptying queue to 
    gistPushItupToNodeBuffer() (was handled by the callers previously)
    
    * Removed gistGetNodeBufferBusySize(), it was unused
    
    * A lot of comment changes
    
    Could you share the test scripts, patches and data sets etc. needed to 
    reproduce the tests you've been running? I'd like to try them out on a 
    test server.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
  116. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-08-26T14:18:41Z

    On Thu, Aug 25, 2011 at 11:08 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > Could you share the test scripts, patches and data sets etc. needed to
    > reproduce the tests you've been running? I'd like to try them out on a test
    > server.
    >
    
    1) I've updated links to the datasets on the wiki page.
    2) Script for index quality testing fastbuild_test.php is in the attachment.
    In order to run it you need PHP with pdo and pdo_pgsql modules. Also
    plantuner moduler is required (it is used to force planer to use specific
    index). After running that script following query returns relative score of
    index quality:
    
    select indexname, avg(count::real/(select count from test_result a2 where
    a2.indexname = 'usnoa2_idx3' and a2.predicate = a1.predicate and
    a2.tablename = a1.tablename)::real) from test_result a1 where a1.tablename =
    'usnoa2' group by indexname;
    
    where 'usnoa2' - table name, 'usnoa2_idx3' - name of index which quality was
    assumed to be 1.
    3) Patch which makes plantuner work with HEAD is also in attachment.
    4) Patch with my split algorithm implementation is attached. Now it's form
    is appropriate only for testing purposes.
    5) For indexes creation I use simple script which is attached as
    'indexes.sql'. Also, similar script with different index names I'm running
    with my split patch.
    
    Feel free to ask questions about all this stuff.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  117. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-08-26T14:31:23Z

    On Thu, Aug 25, 2011 at 10:53 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    >
    >> In the tests on the first version of patch I found index quality of
    >> regular
    >> build much better than it of buffering build (without neighborrelocation).
    >> Now it's similar, though it's because index quality of regular index build
    >> become worse. There by in current tests regular index build is faster than
    >> in previous. I see following possible causes of it:
    >>  1) I didn't save source random data. So, now it's a new random data.
    >> 2) Some environment parameters of my test setup may alters, though I
    >> doubt.
    >> Despite these possible explanation it seems quite strange for me.
    >>
    >
    > That's pretty surprising. Assuming the data is truly random, I wouldn't
    > expect a big difference in the index quality of one random data set over
    > another. If the index quality depends so much on, say, the distribution of
    > the few first tuples that are inserted to it, that's a quite interesting
    > find on its own, and merits some further research.
    
    Yeah, it's pretty strange. Using same random datasets in different tests can
    help to exclude onepossible cause of difference.
    
    
    >  In order to compare index build methods on more qualitative indexes, I've
    >> tried to build indexes with my double sorting split method (see:
    >> http://syrcose.ispras.ru/2011/**files/SYRCoSE2011_Proceedings.**
    >> pdf#page=36<http://syrcose.ispras.ru/2011/files/SYRCoSE2011_Proceedings.pdf#page=36>).
    >> So
    >> on uniform dataset search is faster in about 10 times! And, as it was
    >> expected, regular index build becomes much slower. It runs more than 60
    >> hours and while only 50% of index is complete (estimated by file sizes).
    >>
    >> Also, automatic switching to buffering build shows better index quality
    >> results in all the tests. While it's hard for me to explain that.
    >>
    >
    > Hmm, makes me a bit uneasy that we're testing with a modified page
    > splitting algorithm. But if the new algorithm is that good, could you post
    > that as a separate patch, please?
    >
    I've post it in another message and I will try to get it into more
    appropriate form. Let me clarify this a little. I don't think my split
    algorithm is 10 times better than state of the art algorithms. I think that
    currently used new linear split shows unreasonably bad results in may cases.
    For example, uniformly distributed data is pretty easy case. And with almost
    any splitting algorithm we can get index with almost zero overlaps. But new
    linear split produces huge overlaps in this case. That's why I decided to
    make some experiments with another split algorithm.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  118. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-29T07:25:50Z

    On 26.08.2011 17:18, Alexander Korotkov wrote:
    > On Thu, Aug 25, 2011 at 11:08 PM, Heikki Linnakangas<
    > heikki.linnakangas@enterprisedb.com>  wrote:
    >
    >> Could you share the test scripts, patches and data sets etc. needed to
    >> reproduce the tests you've been running? I'd like to try them out on a test
    >> server.
    >>
    >
    > 1) I've updated links to the datasets on the wiki page.
    > 2) Script for index quality testing fastbuild_test.php is in the attachment.
    > In order to run it you need PHP with pdo and pdo_pgsql modules. Also
    > plantuner moduler is required (it is used to force planer to use specific
    > index). After running that script following query returns relative score of
    > index quality:
    >
    > select indexname, avg(count::real/(select count from test_result a2 where
    > a2.indexname = 'usnoa2_idx3' and a2.predicate = a1.predicate and
    > a2.tablename = a1.tablename)::real) from test_result a1 where a1.tablename =
    > 'usnoa2' group by indexname;
    >
    > where 'usnoa2' - table name, 'usnoa2_idx3' - name of index which quality was
    > assumed to be 1.
    > 3) Patch which makes plantuner work with HEAD is also in attachment.
    > 4) Patch with my split algorithm implementation is attached. Now it's form
    > is appropriate only for testing purposes.
    > 5) For indexes creation I use simple script which is attached as
    > 'indexes.sql'. Also, similar script with different index names I'm running
    > with my split patch.
    >
    > Feel free to ask questions about all this stuff.
    
    Thanks! Meanwhile, I hacked together a script of my own to do 
    performance testing. I let it run over the weekend, but I just realized 
    that I forgot to vacuum the test tables so the results are not worth 
    much. I'm rerunning them now, stay tuned!
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  119. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-30T09:08:28Z

    On 26.08.2011 17:18, Alexander Korotkov wrote:
    > On Thu, Aug 25, 2011 at 11:08 PM, Heikki Linnakangas<
    > heikki.linnakangas@enterprisedb.com>  wrote:
    >
    >> Could you share the test scripts, patches and data sets etc. needed to
    >> reproduce the tests you've been running? I'd like to try them out on a test
    >> server.
    >>
    >
    > 1) I've updated links to the datasets on the wiki page.
    > 2) Script for index quality testing fastbuild_test.php is in the attachment.
    > In order to run it you need PHP with pdo and pdo_pgsql modules. Also
    > plantuner moduler is required (it is used to force planer to use specific
    > index). After running that script following query returns relative score of
    > index quality:
    >
    > select indexname, avg(count::real/(select count from test_result a2 where
    > a2.indexname = 'usnoa2_idx3' and a2.predicate = a1.predicate and
    > a2.tablename = a1.tablename)::real) from test_result a1 where a1.tablename =
    > 'usnoa2' group by indexname;
    >
    > where 'usnoa2' - table name, 'usnoa2_idx3' - name of index which quality was
    > assumed to be 1.
    > 3) Patch which makes plantuner work with HEAD is also in attachment.
    > 4) Patch with my split algorithm implementation is attached. Now it's form
    > is appropriate only for testing purposes.
    > 5) For indexes creation I use simple script which is attached as
    > 'indexes.sql'. Also, similar script with different index names I'm running
    > with my split patch.
    >
    > Feel free to ask questions about all this stuff.
    
    Thanks. Meanwhile, I hacked together my own set of test scripts, and let 
    them run over the weekend. I'm still running tests with ordered data, 
    but here are some preliminary results:
    
                testname           |   nrows   |    duration     | accesses
    -----------------------------+-----------+-----------------+----------
      points unordered auto       | 250000000 | 08:08:39.174956 |  3757848
      points unordered buffered   | 250000000 | 09:29:16.47012  |  4049832
      points unordered unbuffered | 250000000 | 03:48:10.999861 |  4564986
    
    As you can see, the results are very disappointing :-(. The buffered 
    builds take a lot *longer* than unbuffered ones. I was expecting the 
    buffering to be very helpful at least in these unordered tests. On the 
    positive side, the buffering made index quality somewhat better 
    (accesses column, smaller is better), but that's not what we're aiming at.
    
    What's going on here? This data set was large enough to not fit in RAM, 
    the table was about 8.5 GB in size (and I think the index is even larger 
    than that), and the box has 4GB of RAM. Does the buffering only help 
    with even larger indexes that exceed the cache size even more?
    
    
    Test methodology
    ----------------
    
    These tests consist of creating a gist index using the point datatype.
    
          Table "public.points"
       Column |  Type   | Modifiers
    --------+---------+-----------
       x      | integer |
       y      | integer |
    
    CREATE INDEX testindex ON points_ordered USING gist (point(x,y)) WITH 
    (buffering = 'on');
    
    The points in the table are uniformly distributed. In the 'unordered' 
    tests, they are in random order. The ordered tests use the exact same 
    data, but sorted by x, y coordinates.
    
    The 'accesses' column measures the quality of the produced index. 
    Smaller is better. It is calculated by performing a million queries on 
    the table, selecting points within a small square at evenly spaced 
    locations. Like:
    
    (SELECT COUNT(*) FROM points WHERE point(x,y) <@ box(point(xx-20, 
    yy-20), point(xx+20, yy+20)));
    
    The number of index pages touched by those queries are count from 
    pg_statio_user_indexes, and that number is reported in the 'accesses' 
    column.
    
    I've attached the whole script used. Pass the number of rows to use in 
    the test as argument, and the script does the rest.
    
    -- 
        Heikki Linnakangas
        EnterpriseDB   http://www.enterprisedb.com
    
    
  120. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-30T09:13:41Z

    On 30.08.2011 12:08, Heikki Linnakangas wrote:
    > What's going on here? This data set was large enough to not fit in RAM,
    > the table was about 8.5 GB in size (and I think the index is even larger
    > than that), and the box has 4GB of RAM. Does the buffering only help
    > with even larger indexes that exceed the cache size even more?
    
    The tests are still running, so I decided to try oprofile. The build is 
    in the final emptying phase, according to the log, and has been for over 
    half an hour now. Oprofile output looks very interesting:
    
    samples  %        image name               symbol name
    228590   30.3045  postgres                 pg_qsort
    200558   26.5882  postgres                 gistBuffersFreeBlocksCmp
    49397     6.5486  postgres                 gistchoose
    45563     6.0403  postgres                 gist_box_penalty
    31425     4.1661  postgres                 AllocSetAlloc
    24182     3.2058  postgres                 FunctionCall3Coll
    22671     3.0055  postgres                 rt_box_union
    20147     2.6709  postgres                 gistpenalty
    17007     2.2546  postgres                 DirectFunctionCall2Coll
    15790     2.0933  no-vmlinux               /no-vmlinux
    14148     1.8756  postgres                 XLogInsert
    10612     1.4068  postgres                 gistdentryinit
    10542     1.3976  postgres                 MemoryContextAlloc
    9466      1.2549  postgres                 FunctionCall1Coll
    9190      1.2183  postgres                 gist_box_decompress
    6681      0.8857  postgres                 med3
    4941      0.6550  libc-2.12.so             isnanf
    
    So, over 50% of the CPU time is spent in choosing a block from the 
    temporary files. That should be pretty easy to improve..
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  121. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-08-30T10:29:02Z

    On Tue, Aug 30, 2011 at 1:13 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > So, over 50% of the CPU time is spent in choosing a block from the
    > temporary files. That should be pretty easy to improve..
    >
    Yes, probably we can just remove free blocks sorting.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  122. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-08-30T10:38:10Z

    On Tue, Aug 30, 2011 at 1:08 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    >
    > Thanks. Meanwhile, I hacked together my own set of test scripts, and let
    > them run over the weekend. I'm still running tests with ordered data, but
    > here are some preliminary results:
    >
    >           testname           |   nrows   |    duration     | accesses
    > -----------------------------+**-----------+-----------------+**----------
    >  points unordered auto       | 250000000 | 08:08:39.174956 |  3757848
    >  points unordered buffered   | 250000000 | 09:29:16.47012  |  4049832
    >  points unordered unbuffered | 250000000 | 03:48:10.999861 |  4564986
    >
    > As you can see, the results are very disappointing :-(. The buffered builds
    > take a lot *longer* than unbuffered ones. I was expecting the buffering to
    > be very helpful at least in these unordered tests. On the positive side, the
    > buffering made index quality somewhat better (accesses column, smaller is
    > better), but that's not what we're aiming at.
    >
    > What's going on here? This data set was large enough to not fit in RAM, the
    > table was about 8.5 GB in size (and I think the index is even larger than
    > that), and the box has 4GB of RAM. Does the buffering only help with even
    > larger indexes that exceed the cache size even more?
    >
    This seems pretty strange for me. Time of unbuffered index build shows that
    there is not bottleneck at IO. That radically differs from my
    experiments. I'm going to try your test script on my test setup.
    While I have only express assumption that random function appears to be
    somewhat bad. Thereby unordered dataset behave like the ordered one. Can you
    rerun tests on your test setup with dataset generation on the backend like
    this?
    CREATE TABLE points AS (SELECT point(random(), random() FROM
    generate_series(1,10000000));
    
    ------
    With best regards,
    Alexander Korotkov.
    
  123. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-30T10:41:00Z

    On 30.08.2011 13:29, Alexander Korotkov wrote:
    > On Tue, Aug 30, 2011 at 1:13 PM, Heikki Linnakangas<
    > heikki.linnakangas@enterprisedb.com>  wrote:
    >
    >> So, over 50% of the CPU time is spent in choosing a block from the
    >> temporary files. That should be pretty easy to improve..
    >>
    > Yes, probably we can just remove free blocks sorting.
    
    I'm re-running the tests with that change now. It seems like using the 
    list of free blocks as a simple stack would be better anyway, it 
    probably yields a better cache hit ratio when we re-use blocks that have 
    just been released.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  124. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-30T10:43:23Z

    On 30.08.2011 13:38, Alexander Korotkov wrote:
    > On Tue, Aug 30, 2011 at 1:08 PM, Heikki Linnakangas<
    > heikki.linnakangas@enterprisedb.com>  wrote:
    >
    >>
    >> Thanks. Meanwhile, I hacked together my own set of test scripts, and let
    >> them run over the weekend. I'm still running tests with ordered data, but
    >> here are some preliminary results:
    >>
    >>            testname           |   nrows   |    duration     | accesses
    >> -----------------------------+**-----------+-----------------+**----------
    >>   points unordered auto       | 250000000 | 08:08:39.174956 |  3757848
    >>   points unordered buffered   | 250000000 | 09:29:16.47012  |  4049832
    >>   points unordered unbuffered | 250000000 | 03:48:10.999861 |  4564986
    >>
    >> As you can see, the results are very disappointing :-(. The buffered builds
    >> take a lot *longer* than unbuffered ones. I was expecting the buffering to
    >> be very helpful at least in these unordered tests. On the positive side, the
    >> buffering made index quality somewhat better (accesses column, smaller is
    >> better), but that's not what we're aiming at.
    >>
    >> What's going on here? This data set was large enough to not fit in RAM, the
    >> table was about 8.5 GB in size (and I think the index is even larger than
    >> that), and the box has 4GB of RAM. Does the buffering only help with even
    >> larger indexes that exceed the cache size even more?
    >>
    > This seems pretty strange for me. Time of unbuffered index build shows that
    > there is not bottleneck at IO. That radically differs from my
    > experiments. I'm going to try your test script on my test setup.
    > While I have only express assumption that random function appears to be
    > somewhat bad. Thereby unordered dataset behave like the ordered one.
    
    Oh. Doing a simple "SELECT * FROM points LIMIT 10", it looks pretty 
    random to me. The data should be uniformly distributed in a rectangle 
    from (0, 0) to (100000, 100000).
    
    >  Can you
    > rerun tests on your test setup with dataset generation on the backend like
    > this?
    > CREATE TABLE points AS (SELECT point(random(), random() FROM
    > generate_series(1,10000000));
    
    Ok, I'll queue up that test after the ones I'm running now.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  125. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-08-30T17:29:13Z

    On 30.08.2011 13:29, Alexander Korotkov wrote:
    > On Tue, Aug 30, 2011 at 1:13 PM, Heikki Linnakangas<
    > heikki.linnakangas@enterprisedb.com>  wrote:
    >
    >> So, over 50% of the CPU time is spent in choosing a block from the
    >> temporary files. That should be pretty easy to improve..
    >>
    > Yes, probably we can just remove free blocks sorting.
    
    Ok, the first results are in for that:
    
              testname          |   nrows   |    duration     | accesses
    ---------------------------+-----------+-----------------+----------
      points unordered buffered | 250000000 | 06:00:23.707579 |  4049832
    
     From the previous test runs, the unbuffered index build took under 4 
    hours, so even though this is a lot better than with the sorting, it's 
    still a loss compared to non-buffered build.
    
    I had vmstat running during most of this index build. At a quick glance, 
    it doesn't seem to be CPU bound anymore. I suspect the buffers in the 
    temporary file gets very fragmented. Or, we're reading it in backwards 
    order because the buffers work in a LIFO fashion. The system seems to be 
    doing about 5 MB/s of I/O during the build, which sounds like a figure 
    you'd get for more or less random I/O.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  126. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-08-30T20:39:23Z

    On Tue, Aug 30, 2011 at 9:29 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > On 30.08.2011 13:29, Alexander Korotkov wrote:
    >
    >> On Tue, Aug 30, 2011 at 1:13 PM, Heikki Linnakangas<
    >> heikki.linnakangas@**enterprisedb.com<heikki.linnakangas@enterprisedb.com>>
    >>  wrote:
    >>
    >>  So, over 50% of the CPU time is spent in choosing a block from the
    >>> temporary files. That should be pretty easy to improve..
    >>>
    >>>  Yes, probably we can just remove free blocks sorting.
    >>
    >
    > Ok, the first results are in for that:
    >
    >         testname          |   nrows   |    duration     | accesses
    > ---------------------------+--**---------+-----------------+--**--------
    >  points unordered buffered | 250000000 | 06:00:23.707579 |  4049832
    >
    > From the previous test runs, the unbuffered index build took under 4 hours,
    > so even though this is a lot better than with the sorting, it's still a loss
    > compared to non-buffered build.
    >
    > I had vmstat running during most of this index build. At a quick glance, it
    > doesn't seem to be CPU bound anymore. I suspect the buffers in the temporary
    > file gets very fragmented. Or, we're reading it in backwards order because
    > the buffers work in a LIFO fashion. The system seems to be doing about 5
    > MB/s of I/O during the build, which sounds like a figure you'd get for more
    > or less random I/O.
    
    
    So, we still have two questions:
    1) Why buffering build algorithm doesn't show any benefit on these tests?
    2) Why test results on your test setup differs from test results on my test
    setup?
    
    I can propose following answers now:
    1) I think it's because high overlaps in the tree. As I mentioned before
    high overlaps can cause only fraction of the tree to be used for actual
    inserts. For comparison, with my split algorithm (which produce almost no
    overlaps on uniform dataset) buffering index build took 4 hours, while
    regular build is still running (already more than 8 days = 192 hours)!
    2) Probably it's because different behavour of OS cache. For example, on my
    test setup OS displace unused pages from cache too slowly. Thereby buffering
    algorithm showed benefit nevertheless.
    
    Also it seems to me that I start to understand problem of new linear
    splitting algorithm. On dataset with 1M rows it produces almost no overlaps
    while it produces significant overlaps already on 10M rows (drama!).
    Probably nobody tested it on large enough datasets (neither while original
    research or before commit). I'll dig it in more details and provide some
    testing results.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  127. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-09-01T08:59:11Z

    On 30.08.2011 13:38, Alexander Korotkov wrote:
    > On Tue, Aug 30, 2011 at 1:08 PM, Heikki Linnakangas<
    > heikki.linnakangas@enterprisedb.com>  wrote:
    >
    >>
    >> Thanks. Meanwhile, I hacked together my own set of test scripts, and let
    >> them run over the weekend. I'm still running tests with ordered data, but
    >> here are some preliminary results:
    >>
    >>            testname           |   nrows   |    duration     | accesses
    >> -----------------------------+**-----------+-----------------+**----------
    >>   points unordered auto       | 250000000 | 08:08:39.174956 |  3757848
    >>   points unordered buffered   | 250000000 | 09:29:16.47012  |  4049832
    >>   points unordered unbuffered | 250000000 | 03:48:10.999861 |  4564986
    >>
    >> As you can see, the results are very disappointing :-(. The buffered builds
    >> take a lot *longer* than unbuffered ones. I was expecting the buffering to
    >> be very helpful at least in these unordered tests. On the positive side, the
    >> buffering made index quality somewhat better (accesses column, smaller is
    >> better), but that's not what we're aiming at.
    >>
    >> What's going on here? This data set was large enough to not fit in RAM, the
    >> table was about 8.5 GB in size (and I think the index is even larger than
    >> that), and the box has 4GB of RAM. Does the buffering only help with even
    >> larger indexes that exceed the cache size even more?
    >>
    > This seems pretty strange for me. Time of unbuffered index build shows that
    > there is not bottleneck at IO. That radically differs from my
    > experiments. I'm going to try your test script on my test setup.
    > While I have only express assumption that random function appears to be
    > somewhat bad. Thereby unordered dataset behave like the ordered one. Can you
    > rerun tests on your test setup with dataset generation on the backend like
    > this?
    > CREATE TABLE points AS (SELECT point(random(), random() FROM
    > generate_series(1,10000000));
    
    So I changed the test script to generate the table as:
    
    CREATE TABLE points AS SELECT random() as x, random() as y FROM 
    generate_series(1, $NROWS);
    
    The unordered results are in:
    
               testname           |   nrows   |    duration     | accesses
    -----------------------------+-----------+-----------------+----------
      points unordered buffered   | 250000000 | 05:56:58.575789 |  2241050
      points unordered auto       | 250000000 | 05:34:12.187479 |  2246420
      points unordered unbuffered | 250000000 | 04:38:48.663952 |  2244228
    
    Although the buffered build doesn't lose as badly as it did with more 
    overlap, it still doesn't look good :-(. Any ideas?
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  128. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-09-01T09:23:51Z

    On Thu, Sep 1, 2011 at 12:59 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > So I changed the test script to generate the table as:
    >
    > CREATE TABLE points AS SELECT random() as x, random() as y FROM
    > generate_series(1, $NROWS);
    >
    > The unordered results are in:
    >
    >          testname           |   nrows   |    duration     | accesses
    > -----------------------------+**-----------+-----------------+**----------
    >  points unordered buffered   | 250000000 | 05:56:58.575789 |  2241050
    >  points unordered auto       | 250000000 | 05:34:12.187479 |  2246420
    >  points unordered unbuffered | 250000000 | 04:38:48.663952 |  2244228
    >
    > Although the buffered build doesn't lose as badly as it did with more
    > overlap, it still doesn't look good :-(. Any ideas?
    
    
    But it's still a lot of overlap. It's about 220 accesses per small area
    request. It's about 10 - 20 times greater than should be without overlaps.
    If we roughly assume that 10 times more overlap makes 1/10 of tree to be
    used for actual inserts, then that part of tree can easily fit to the cache.
    You can try my splitting algorithm on your test setup (it this case I advice
    to start from smaller number of rows, 100 M for example).
    I'm requesting real-life datasets which makes troubles in real life from
    Oleg. Probably those datasets is even larger or new linear split produce
    less overlaps on them.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  129. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-09-01T09:37:42Z

    On 01.09.2011 12:23, Alexander Korotkov wrote:
    > On Thu, Sep 1, 2011 at 12:59 PM, Heikki Linnakangas<
    > heikki.linnakangas@enterprisedb.com>  wrote:
    >
    >> So I changed the test script to generate the table as:
    >>
    >> CREATE TABLE points AS SELECT random() as x, random() as y FROM
    >> generate_series(1, $NROWS);
    >>
    >> The unordered results are in:
    >>
    >>           testname           |   nrows   |    duration     | accesses
    >> -----------------------------+**-----------+-----------------+**----------
    >>   points unordered buffered   | 250000000 | 05:56:58.575789 |  2241050
    >>   points unordered auto       | 250000000 | 05:34:12.187479 |  2246420
    >>   points unordered unbuffered | 250000000 | 04:38:48.663952 |  2244228
    >>
    >> Although the buffered build doesn't lose as badly as it did with more
    >> overlap, it still doesn't look good :-(. Any ideas?
    >
    >
    > But it's still a lot of overlap. It's about 220 accesses per small area
    > request. It's about 10 - 20 times greater than should be without overlaps.
    
    Hmm, those "accesses" numbers are actually quite bogus for this test. I 
    changed the creation of the table as you suggested, so that all x and y 
    values are in the range 0.0 - 1.0, but I didn't change the loop to 
    calculate those accesses, so it still queried for boxes in the range 0 - 
    100000. That makes me wonder, why does it need 220 accesses on average 
    to satisfy queries most of which lie completely outside the range of 
    actual values in the index? I would expect such queries to just look at 
    the root node, conclude that there can't be any matching tuples, and 
    return immediately.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  130. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-09-05T11:10:54Z

    On 01.09.2011 12:23, Alexander Korotkov wrote:
    > On Thu, Sep 1, 2011 at 12:59 PM, Heikki Linnakangas<
    > heikki.linnakangas@enterprisedb.com>  wrote:
    >
    >> So I changed the test script to generate the table as:
    >>
    >> CREATE TABLE points AS SELECT random() as x, random() as y FROM
    >> generate_series(1, $NROWS);
    >>
    >> The unordered results are in:
    >>
    >>           testname           |   nrows   |    duration     | accesses
    >> -----------------------------+**-----------+-----------------+**----------
    >>   points unordered buffered   | 250000000 | 05:56:58.575789 |  2241050
    >>   points unordered auto       | 250000000 | 05:34:12.187479 |  2246420
    >>   points unordered unbuffered | 250000000 | 04:38:48.663952 |  2244228
    >>
    >> Although the buffered build doesn't lose as badly as it did with more
    >> overlap, it still doesn't look good :-(. Any ideas?
    >
    > But it's still a lot of overlap. It's about 220 accesses per small area
    > request. It's about 10 - 20 times greater than should be without overlaps.
    > If we roughly assume that 10 times more overlap makes 1/10 of tree to be
    > used for actual inserts, then that part of tree can easily fit to the cache.
    > You can try my splitting algorithm on your test setup (it this case I advice
    > to start from smaller number of rows, 100 M for example).
    > I'm requesting real-life datasets which makes troubles in real life from
    > Oleg. Probably those datasets is even larger or new linear split produce
    > less overlaps on them.
    
    I made a small tweak to the patch, and got much better results (this is 
    with my original method of generating the data):
    
               testname           |   nrows   |    duration     | accesses
    -----------------------------+-----------+-----------------+----------
      points unordered buffered   | 250000000 | 03:34:23.488275 |  3945486
      points unordered auto       | 250000000 | 02:55:10.248722 |  3767548
      points unordered unbuffered | 250000000 | 04:02:04.168138 |  4564986
    
    The tweak I made was to the way buffers are emptied in the final 
    emptying phase. Previously, it repeatedly looped through all the buffers 
    at a level, until there were no more non-empty buffers at the level. 
    When a buffer was split while it was being emptied, processing that 
    buffer stopped, and the emptying process moved on to the next buffer. I 
    changed it so that when a buffer splits, we continue emptying that 
    buffer until it's completely empty. That behavior is much more 
    cache-friendly, which shows as much better overall performance.
    
    I probably changed that behavior for the worse in previous my rounds of 
    cleanup. Anyway, attached is the patch I used to get the above numbers. 
    Now that the performance problem is fixed, I'll continue reviewing and 
    cleaning up the patch.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
  131. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-09-05T16:30:09Z

    On 05.09.2011 14:10, Heikki Linnakangas wrote:
    > On 01.09.2011 12:23, Alexander Korotkov wrote:
    >> On Thu, Sep 1, 2011 at 12:59 PM, Heikki Linnakangas<
    >> heikki.linnakangas@enterprisedb.com> wrote:
    >>
    >>> So I changed the test script to generate the table as:
    >>>
    >>> CREATE TABLE points AS SELECT random() as x, random() as y FROM
    >>> generate_series(1, $NROWS);
    >>>
    >>> The unordered results are in:
    >>>
    >>> testname | nrows | duration | accesses
    >>> -----------------------------+**-----------+-----------------+**----------
    >>>
    >>> points unordered buffered | 250000000 | 05:56:58.575789 | 2241050
    >>> points unordered auto | 250000000 | 05:34:12.187479 | 2246420
    >>> points unordered unbuffered | 250000000 | 04:38:48.663952 | 2244228
    >>>
    >>> Although the buffered build doesn't lose as badly as it did with more
    >>> overlap, it still doesn't look good :-(. Any ideas?
    >>
    >> But it's still a lot of overlap. It's about 220 accesses per small area
    >> request. It's about 10 - 20 times greater than should be without
    >> overlaps.
    >> If we roughly assume that 10 times more overlap makes 1/10 of tree to be
    >> used for actual inserts, then that part of tree can easily fit to the
    >> cache.
    >> You can try my splitting algorithm on your test setup (it this case I
    >> advice
    >> to start from smaller number of rows, 100 M for example).
    >> I'm requesting real-life datasets which makes troubles in real life from
    >> Oleg. Probably those datasets is even larger or new linear split produce
    >> less overlaps on them.
    >
    > I made a small tweak to the patch, and got much better results (this is
    > with my original method of generating the data):
    >
    > testname | nrows | duration | accesses
    > -----------------------------+-----------+-----------------+----------
    > points unordered buffered | 250000000 | 03:34:23.488275 | 3945486
    > points unordered auto | 250000000 | 02:55:10.248722 | 3767548
    > points unordered unbuffered | 250000000 | 04:02:04.168138 | 4564986
    
    The full results of this test are in:
    
               testname           |   nrows   |    duration     | accesses
    -----------------------------+-----------+-----------------+----------
      points unordered buffered   | 250000000 | 03:34:23.488275 |  3945486
      points unordered auto       | 250000000 | 02:55:10.248722 |  3767548
      points unordered unbuffered | 250000000 | 04:02:04.168138 |  4564986
      points ordered buffered     | 250000000 | 02:00:10.467914 |  5572906
      points ordered auto         | 250000000 | 02:16:01.859039 |  5435673
      points ordered unbuffered   | 250000000 | 03:23:18.061742 |  1875826
    (6 rows)
    
    Interestingly, in this test case the buffered build was significantly 
    faster even in the case of ordered input - but the quality of the 
    produced index was much worse. I suspect it's because of the 
    last-in-first-out nature of the buffering, tuples that pushed into 
    buffers first are flushed to lower levels last. Tweaking the data 
    structures to make the buffer flushing a FIFO process might help with 
    that, but I'm afraid that might make our cache hit ratio worse when 
    reading pages from the temporary file.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  132. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-09-05T22:18:51Z

    Small bugfix: in gistBufferingFindCorrectParent check that downlinkoffnum
    doesn't exceed maximal offset number.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  133. Re: WIP: Fast GiST index build

    Stefan Keller <sfkeller@gmail.com> — 2011-09-06T13:40:44Z

    Hi,
    
    Unlogged tables seems to me to follow a similar goal. Obviously GiST
    indexes are not supported there.
    Do you know the technical reason?
    Do you see some synergy in your work on fast GiST index building and
    unlogged tables?
    
    Yours, Stefan
    
    2011/9/6 Alexander Korotkov <aekorotkov@gmail.com>:
    > Small bugfix: in gistBufferingFindCorrectParent check that downlinkoffnum
    > doesn't exceed maximal offset number.
    > ------
    > With best regards,
    > Alexander Korotkov.
    >
    > --
    > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
    > To make changes to your subscription:
    > http://www.postgresql.org/mailpref/pgsql-hackers
    >
    >
    
    
  134. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-09-06T13:51:50Z

    Hi!
    
    
    > Unlogged tables seems to me to follow a similar goal. Obviously GiST
    > indexes are not supported there.
    > Do you know the technical reason?
    >
    GiST using serial numbers of operations for concurrency. In current
    implementation xlog record ids are used in capacity of that numbers. In
    unlogged table no xlog records are produced. So, we haven't serial numbers
    of operations. AFAIK, it's enough to provide some other source of serial
    number in order to make GiST work with unlogged tables.
    
    
    > Do you see some synergy in your work on fast GiST index building and
    > unlogged tables?
    >
    With tecnhique discussing in this thread GiST build can win form unlogged as
    much as with regular build.
    
    ------
    With best regards,
    Alexander Korotkov.
    
  135. Re: WIP: Fast GiST index build

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-09-08T14:59:16Z

    On 06.09.2011 01:18, Alexander Korotkov wrote:
    > Small bugfix: in gistBufferingFindCorrectParent check that downlinkoffnum
    > doesn't exceed maximal offset number.
    
    I've committed the patch now, including that fix. Thanks for a great 
    GSoC project!
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  136. Re: WIP: Fast GiST index build

    Robert Haas <robertmhaas@gmail.com> — 2011-09-08T15:28:28Z

    On Thu, Sep 8, 2011 at 10:59 AM, Heikki Linnakangas
    <heikki.linnakangas@enterprisedb.com> wrote:
    > On 06.09.2011 01:18, Alexander Korotkov wrote:
    >>
    >> Small bugfix: in gistBufferingFindCorrectParent check that downlinkoffnum
    >> doesn't exceed maximal offset number.
    >
    > I've committed the patch now, including that fix. Thanks for a great GSoC
    > project!
    
    Wow, major congrats, Alexander!
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  137. Fast GiST index build - further improvements

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-09-08T16:35:37Z

    Now that the main GiST index build patch has been committed, there's a 
    few further improvements that could make it much faster still:
    
    Better management of the buffer pages on disk. At the moment, the 
    temporary file is used as a heap of pages belonging to all the buffers 
    in random order. I think we could speed up the algorithm considerably by 
    reading/writing the buffer pages sequentially. For example, when an 
    internal page is split, and all the tuples in its buffer are relocated, 
    that would be a great chance to write the new pages of the new buffers 
    in sequential order, instead of writing them back to the pages freed up 
    by the original buffer, which can be spread all around the temp file. I 
    wonder if we could use a separate file for each buffer? Or at least, a 
    separate file for all buffers that are larger than, say 100 MB in size.
    
    Better management of in-memory buffer pages. When we start emptying a 
    buffer, we currently flush all the buffer pages in memory to the 
    temporary file, to make room for new buffer pages. But that's a waste of 
    time, if some of the pages we had in memory belonged to the buffer we're 
    about to empty next, or that we empty tuples to. Also, if while emptying 
    a buffer, all the tuples go to just one or two lower level buffers, it 
    would be beneficial to keep more than one page in-memory for those buffers.
    
    Buffering leaf pages. This I posted on a separate thread already: 
    http://archives.postgresql.org/message-id/4E5350DB.3060209@enterprisedb.com
    
    
    Also, at the moment there is one issue with the algorithm that we have 
    glossed over this far: For each buffer, we keep some information in 
    memory, in the hash table, and in the auxiliary lists. That means that 
    the amount of memory needed for the build scales with the size of the 
    index. If you're dealing with very large indexes, hopefully you also 
    have a lot of RAM in your system, so I don't think this is a problem in 
    practice. Still, it would be nice to do something about that. A 
    straightforward idea would be to swap some of the information to disk. 
    Another idea that, simpler to implement, would be to completely destroy 
    a buffer, freeing all the memory it uses, when it becomes completely 
    empty. Then, if you're about to run out of memory (as defined by 
    maintenance_work_mem), you can empty some low level buffers to disk to 
    free up some.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  138. Re: WIP: Fast GiST index build

    Oleg Bartunov <oleg@sai.msu.su> — 2011-09-08T18:11:00Z

    My congratulations too, Alexander ! Hope to work on SP-GiST together !
    
    Oleg
    
    On Thu, 8 Sep 2011, Heikki Linnakangas wrote:
    
    > On 06.09.2011 01:18, Alexander Korotkov wrote:
    >> Small bugfix: in gistBufferingFindCorrectParent check that downlinkoffnum
    >> doesn't exceed maximal offset number.
    >
    > I've committed the patch now, including that fix. Thanks for a great GSoC 
    > project!
    >
    >
    
     	Regards,
     		Oleg
    _____________________________________________________________
    Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
    Sternberg Astronomical Institute, Moscow University, Russia
    Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
    phone: +007(495)939-16-83, +007(495)939-23-83
    
    
  139. Re: WIP: Fast GiST index build

    Alexander Korotkov <aekorotkov@gmail.com> — 2011-09-08T19:25:58Z

    Thanks for congratulations!
    Tnanks to Heikki for mentoring and his work on patch!
    
    ------
    With best regards,
    Alexander Korotkov.
    
  140. Re: WIP: Fast GiST index build

    Stefan Keller <sfkeller@gmail.com> — 2011-09-13T22:00:18Z

    Robert,
    
    2011/9/6 Alexander Korotkov <aekorotkov@gmail.com>:
    > GiST use serial numbers of operations for concurrency. In current
    > implementation xlog record ids are used in capacity of that numbers. In
    > unlogged table no xlog records are produced. So, we haven't serial numbers
    > of operations. AFAIK, it's enough to provide some other source of serial
    > number in order to make GiST work with unlogged tables.
    
    GiST is IMHO quite broadly used. I use it for example for indexing
    geometry and hstore types and there's no other choice there.
    Do you know whether unlogged option in create table will support GiST
    in the next release?
    
    Stefan
    
    
  141. Re: WIP: Fast GiST index build

    Stefan Keller <sfkeller@gmail.com> — 2011-09-14T06:43:33Z

    I'm on the way to open a ticket for hash indexes (adding WAL support) anyway:
    May I open a ticket for adding GiST support to unlogged tables ?
    
    Stefan
    
    2011/9/14 Stefan Keller <sfkeller@gmail.com>:
    > Robert,
    >
    > 2011/9/6 Alexander Korotkov <aekorotkov@gmail.com>:
    >> GiST use serial numbers of operations for concurrency. In current
    >> implementation xlog record ids are used in capacity of that numbers. In
    >> unlogged table no xlog records are produced. So, we haven't serial numbers
    >> of operations. AFAIK, it's enough to provide some other source of serial
    >> number in order to make GiST work with unlogged tables.
    >
    > GiST is IMHO quite broadly used. I use it for example for indexing
    > geometry and hstore types and there's no other choice there.
    > Do you know whether unlogged option in create table will support GiST
    > in the next release?
    >
    > Stefan
    >
    
    
  142. Re: WIP: Fast GiST index build

    Robert Haas <robertmhaas@gmail.com> — 2011-09-14T12:12:15Z

    On Tue, Sep 13, 2011 at 5:00 PM, Stefan Keller <sfkeller@gmail.com> wrote:
    > 2011/9/6 Alexander Korotkov <aekorotkov@gmail.com>:
    >> GiST use serial numbers of operations for concurrency. In current
    >> implementation xlog record ids are used in capacity of that numbers. In
    >> unlogged table no xlog records are produced. So, we haven't serial numbers
    >> of operations. AFAIK, it's enough to provide some other source of serial
    >> number in order to make GiST work with unlogged tables.
    >
    > GiST is IMHO quite broadly used. I use it for example for indexing
    > geometry and hstore types and there's no other choice there.
    > Do you know whether unlogged option in create table will support GiST
    > in the next release?
    
    It's probably not a difficult patch to write, but I don't have any
    current plans to work on it.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  143. Re: Fast GiST index build - further improvements

    Alexander Korotkov <aekorotkov@gmail.com> — 2012-02-02T19:54:37Z

    On Thu, Sep 8, 2011 at 8:35 PM, Heikki Linnakangas <
    heikki.linnakangas@enterprisedb.com> wrote:
    
    > Now that the main GiST index build patch has been committed, there's a few
    > further improvements that could make it much faster still:
    >
    > Better management of the buffer pages on disk. At the moment, the
    > temporary file is used as a heap of pages belonging to all the buffers in
    > random order. I think we could speed up the algorithm considerably by
    > reading/writing the buffer pages sequentially. For example, when an
    > internal page is split, and all the tuples in its buffer are relocated,
    > that would be a great chance to write the new pages of the new buffers in
    > sequential order, instead of writing them back to the pages freed up by the
    > original buffer, which can be spread all around the temp file. I wonder if
    > we could use a separate file for each buffer? Or at least, a separate file
    > for all buffers that are larger than, say 100 MB in size.
    >
    > Better management of in-memory buffer pages. When we start emptying a
    > buffer, we currently flush all the buffer pages in memory to the temporary
    > file, to make room for new buffer pages. But that's a waste of time, if
    > some of the pages we had in memory belonged to the buffer we're about to
    > empty next, or that we empty tuples to. Also, if while emptying a buffer,
    > all the tuples go to just one or two lower level buffers, it would be
    > beneficial to keep more than one page in-memory for those buffers.
    >
    > Buffering leaf pages. This I posted on a separate thread already:
    > http://archives.postgresql.**org/message-id/4E5350DB.**
    > 3060209@enterprisedb.com<http://archives.postgresql.org/message-id/4E5350DB.3060209@enterprisedb.com>
    >
    >
    > Also, at the moment there is one issue with the algorithm that we have
    > glossed over this far: For each buffer, we keep some information in memory,
    > in the hash table, and in the auxiliary lists. That means that the amount
    > of memory needed for the build scales with the size of the index. If you're
    > dealing with very large indexes, hopefully you also have a lot of RAM in
    > your system, so I don't think this is a problem in practice. Still, it
    > would be nice to do something about that. A straightforward idea would be
    > to swap some of the information to disk. Another idea that, simpler to
    > implement, would be to completely destroy a buffer, freeing all the memory
    > it uses, when it becomes completely empty. Then, if you're about to run out
    > of memory (as defined by maintenance_work_mem), you can empty some low
    > level buffers to disk to free up some.
    
    Unfortunately, I hadn't enough of time to implement something of this
    before 9.2 release. Work on my Phd. thesis and web-projects takes too much
    time.
    
    But, I think there is one thing we should fix before 9.2 release. We assume
    that gist index build have at least effective_cache_size/4 of cache. This
    assumption could easily be false on high concurrency systems. I don't see
    the way for convincing estimate here, but we could document this behaviour.
    So, users could just tune effective_cache_size for gist index build on high
    concurrency.
    
    ------
    With best regards,
    Alexander Korotkov.