Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

Hannu Krosing <hannu@trust.ee>

From: Hannu Krosing <hannu@trust.ee>
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Bruce Momjian <maillist@candle.pha.pa.us>, Daniel Kalchev <daniel@digsys.bg>, Hiroshi Inoue <Inoue@tpf.co.jp>, pgsql-hackers@postgreSQL.org
Date: 1999-06-09T18:32:03Z
Lists: pgsql-hackers
Tom Lane wrote:
> 
> Bruce Momjian <maillist@candle.pha.pa.us> writes:
> > This certainly explains it.  With locale enabled, LIKE does not use
> > indexes because we can't figure out how to do the indexing trick with
> > non-ASCII character sets because we can't figure out the maximum
> > character value for a particular encoding.
> 
> We don't actually need the *maximum* character value, what we need is
> to be able to generate a *slightly larger* character value.
> 
> For example, what the parser is doing now:
>         fld LIKE 'abc%' ==> fld <= 'abc\377'
> is not even really right in ASCII locale, because it will reject a
> data value like 'abc\377x'.
> 
> I think what we really want is to generate the "next value of the
> same length" and use a < comparison.  In ASCII locale this means
>         fld LIKE 'abc%' ==> fld < 'abd'
> which is reliable regardless of what comes after abc in the data.
> The trick is to figure out a "next" value without assuming a lot
> about the local character set and collation sequence.

in single-byte locales it should be easy:

1. sort a char[256] array from 0-255 using the current locale settings,
 do it once, either at startup or when first needed.

2. use binary search on that array to locate the last char before %
 in this sorted array:
 if (it is not the last char in sorted array)
 then (replace that char with the one at index+1)
 else (
   if (it is not the first char in like string)
   then (discard the last char and goto 2.
   else (don't do the end restriction)
 )

some locales where the string is already sorted may use special 
treatment (ASCII, CYRILLIC) 

> But I am worried whether this trick will work in multibyte locales ---
> incrementing the last byte might generate an invalid character sequence
> and produce unpredictable results from strcmp.  So we need some help
> from someone who knows a lot about collation orders and multibyte
> character representations.

for double-byte locales something similar should work, but getting
the initial array is probably tricky

----------------
Hannu