Re: WIP: Fast GiST index build

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>

From: Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>
To: Alexander Korotkov <aekorotkov@gmail.com>
Cc: Robert Haas <robertmhaas@gmail.com>, pgsql-hackers <pgsql-hackers@postgresql.org>
Date: 2011-08-16T12:04:37Z
Lists: pgsql-hackers
Looking at the calculation of levelStep:

> + 	/*
> + 	 * Calculate levelStep by available amount of memory. We should be able to
> + 	 * load into main memory one page of each underlying node buffer (which
> + 	 * are in levelStep below). That give constraint over
> + 	 * maintenance_work_mem. Also we should be able to have subtree of
> + 	 * levelStep level in cache. That give constraint over
> + 	 * effective_cache_size.
> + 	 *
> + 	 * i'th underlying level of sub-tree can consists of
> + 	 * i^maxIndexTuplesPerPage pages at maximum. So, subtree of levelStep
> + 	 * levels can't be greater then 2 * maxIndexTuplesPerPage ^ levelStep
> + 	 * pages. We use some more reserve due to we probably can't take whole
> + 	 * effective cache and use formula 4 * maxIndexTuplesPerPage ^ levelStep =
> + 	 * effectiveCache. We use similar logic with maintenance_work_mem. We
> + 	 * should be able to store at least last pages of all buffers where we are
> + 	 * emptying current buffer to.
> + 	 */
> + 	effectiveMemory = Min(maintenance_work_mem * 1024 / BLCKSZ,
> + 						  effective_cache_size);
> + 	levelStep = (int) log((double) effectiveMemory / 4.0) /
> + 		log((double) maxIndexTuplesPerPage);
> +

I can see that that's equal to the formula given in the paper, 
log_B(M/4B), but I couldn't see any explanation for that formula in the 
paper. Your explanation makes sense, but where did it come from?

It seems a bit pessimistic: while it's true that the a subtree can't be 
larger than 2 * maxIndexTuplesPerPage ^ levelStep, you can put a tighter 
upper bound on it. The max size of a subtree of depth n can be 
calculated as the geometric series:

r^0 + r^1 + r^2 + ... + r^n = (1 - r^(n + 1)) / (1 - r)

where r = maxIndexTuplesPerPage. For r=2 those formulas are equal, but 
for a large r and small n (which is typical), the 2 * 
maxIndexTuplesPerPage^levelStep formula gives a value that's almost 
twice as large as the real max size of a subtree.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com