Re: Row pattern recognition

Tatsuo Ishii <ishii@postgresql.org>

From: Tatsuo Ishii <ishii@postgresql.org>

To: david.g.johnston@gmail.com, vik@postgresfriends.org, jacob.champion@enterprisedb.com, er@xs4all.nl, peter@eisentraut.org

Cc: pgsql-hackers@postgresql.org

Date: 2024-12-30T13:37:18Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Add temporal FOREIGN KEY contraints
- 89f908a6d0ac 18.0 cited
Remove obsolete executor cleanup code
- d060e921ea5a 17.0 cited

Attachments

v26-0001-Row-pattern-recognition-patch-for-raw-parser.patch (text/x-patch)

I have added further optimization to the v25 patch.

While generating possible input strings that may satisfy the pattern
string, it is possible to omit to run regexp in some cases. Since
regexp matching is heavy operation, especially if it is applied to
long string, it is beneficial for RPR to reduce the number of regexp
runs.

If the tail pattern variable has '+' quantifier and previously the
input string was confirmed to be matched the pattern string, and the
same character as the tail pattern string is added, we don't need run
regexp match again because the new input string surely matches the
pattern string. Suppose a pattern string is "ab+" and the current
input string is "ab" (this satisfies "ab+"). If the new input string
is "abb", then "abb" surely matches "ab+" too and we don't need to run
regexp again.

In v26 patch, by using the technique above I get performance
improvement.

>> EXPLAIN (ANALYZE)
>> SELECT aid, bid, count(*) OVER w
>> FROM pgbench_accounts WHERE aid <= 10000
>> WINDOW w AS (
>> PARTITION BY bid
>> ORDER BY aid
>> ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
>> AFTER MATCH SKIP PAST LAST ROW
>> INITIAL
>> PATTERN (START UP+)
>> DEFINE
>> START AS TRUE,
>> UP AS aid > PREV(aid)
>> );

This SQL took 322.5913 ms (average in 3 runs) in v24. With v26 patch,
it takes 41.84 ms, which is over 7 times improvement. Also I run the
SQL in 100k row case. v23 took 26 seconds. With the v26 patch it takes
1195.603 ms, which is over 21 times improvement.

I think a pattern string ended up with '+' is one of common use cases
of RPR, and I believe the improvement is useful for many RPR
applications.

I also add following changes to v25.

- Fix do_pattern_match to use the top memory context to store compiled
  re cache. Before it was in per query memory context. This causes a
  trouble because do_pattern_match checks the cache existence using
  a static variable.

- Refactor search_str_set, which is a workhorse of pattern matching,
  into multiple functions to understand the logic easier.
  
Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp