Thread

Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-07-02T18:37:28Z
On 02.07.2011 21:07, Tom Lane wrote:
> I wrote:
>> I tweaked the patch a bit further (don't pessimize the boundary case
>> where there's exactly 4n+1 trigrams, avoid forcing trg1 into memory,
>> improve the comment) and attach that version below.  This is a little
>> bit faster but I still wonder if it's worth the price of adding an
>> obscure dependency on the header size.
>
> It occurred to me that we could very easily remove that objection by
> making the code dynamically detect when it's reached a suitably aligned
> trigram.  The attached version of the patch does it that way.  It seems
> to be a percent or so slower than my previous version, but I think that
> making the function robust against header changes is probably well worth
> that price.

Ah, that opens the door to do something I considered earlier but 
rejected because of alignment: instead of three 32-bit word fetches, we 
could fetch one 64-bit word and 32-bit word. Might shave a few more 
cycles...

Meanwhile, I experimented with optimizing the other part of the loop: 
the HASH() macros for setting the right bits in the signature. With the 
default compile-time settings, the signature array is 95 bits. 
Currently, it's stored in a BITVEC, which is a byte array, but we could 
store it in two 64-bit integers, which makes it possible to write SETBIT 
differently. I experimented with a few approaches, first was essentially 
this:

+ /* Set the nth bit in the signature, in s1 or s2 */
+ #define HASH_S(h) \
+ 	do {									\
+ 		unsigned int n = HASHVAL(h);		\
+ 		if (((uint64) (n)) < 64)			\
+ 			s1 |= (uint64) 1<<(n);			\
+ 		else								\
+ 			s2 |= (uint64) 1<<((n) - 64);	\
+ 	} while(0)

That was a bit faster on my x64 laptop, but slightly slower on my ia64 
HP-UX box. My second try was to use lookup tables, patch attached. That 
was yet faster on x64, and a small win on the ia64 box too. I'm not sure 
it's worth the added code complexity, though.

Here's a summary of the timings I got with different versions:

ia64 HP-UX (anole):

unpatched: 11194.038 ms
fast_makesign-tom: 10064.980 ms
fast_makesign-2int: 10649.726 ms
fast_makesign-tbl: 9951.547 ms

x64 laptop:

unpatched: 4595,209 ms
fast_makesign-tom: 3346,548 ms
fast_makesign-2int: 3102,874 ms
fast_makesign-tbl: 2997,854 ms

I used the same "REINDEX INDEX i_words" test I used earlier, repeated 
each run a couple of times, and took the lowest number. 
fast_makesign-tom is the first patch you posted, I haven't tested your 
latest one. fast_makesign-2int is with the HASH_S() macro above, and 
has_makesign-tbl is with the attached patch.

PS. in case you wonder why the HP-UX box is so much slower than my 
laptop; this box isn't really meant for performance testing. It's just 
something I happen to have access to, I think it's a virtual machine of 
some sort. The numbers are very repeatable, however.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com