Re: Row pattern recognition

Tatsuo Ishii <ishii@postgresql.org>

From: Tatsuo Ishii <ishii@postgresql.org>

To: david.g.johnston@gmail.com, vik@postgresfriends.org, jacob.champion@enterprisedb.com, er@xs4all.nl, peter@eisentraut.org

Cc: pgsql-hackers@postgresql.org

Date: 2024-12-21T09:20:04Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Add temporal FOREIGN KEY contraints
- 89f908a6d0ac 18.0 cited
Remove obsolete executor cleanup code
- d060e921ea5a 17.0 cited

Attachments

v25-0001-Row-pattern-recognition-patch-for-raw-parser.patch (text/x-patch)

> I have looked into the performance of current RPR implementation,
> especially when the number of rows in a reduced frame is large (like
> over 10k). Below is a simple benchmark against pgbench database. The
> SQL will produce a reduced frame having 10k rows.
> 
> EXPLAIN (ANALYZE)
> SELECT aid, bid, count(*) OVER w
> FROM pgbench_accounts WHERE aid <= 10000
> WINDOW w AS (
> PARTITION BY bid
> ORDER BY aid
> ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
> AFTER MATCH SKIP PAST LAST ROW
> INITIAL
> PATTERN (START UP+)
> DEFINE
> START AS TRUE,
> UP AS aid > PREV(aid)
> );
> 
> This took 722 ms on my laptop. It's not very quick. Moreover, if I
> expand the reduced frame size to 100k (aid <= 100000), OOM killer
> triggered. I looked into the code and found that do_pattern_match in
> nodeWindowAgg.c is one of the major problems. It calls regexp_instr to
> know whether the regular expression derived from a PATTERN clause
> (e.g. "ab+c+") matches an encoded row pattern variable string
> (e.g. "abbcc"). The latter string could be quite long: the length
> could be as same as the number of rows in the reduced frame. Thus, The
> length could become 100k if the frame size is 100k. Unfortunately
> regexp_instr needs to allocate and convert the input string to wchar
> (it's 4-byte long for each character), which uses 4x space bigger than
> the original input string. In RPR case the input string is always
> ASCII and does not need to be converted to wchar. So I decided to
> switch to the standard regular expression engine coming with OS. With
> this change, I got 2x speed up in the 10k case.
> 
> v23 patch: 722.618 ms (average of 3 runs)
> new patch: 322.5913 ms (average of 3 runs)
> 
> Also I tried the 100k rows reduced frame case. It was slow (took 26
> seconds) but it completed without OOM killer.  Attached is the
> patch. The change was in 0005 only. Other patches were not changed
> from v23.

The CFBot starts complaining about the patch. It fails in Windows
environment test because regex.h does not exist. I concluded that on
Windows it's not a good idea to use the standard regexp library (I am
not familiar with Windows. If my thought is not correct, please let me
know). So I switched to the PostgreSQL's builtin core regexp
library. Although the interface requires to use pg_wchar which spends
4x memory comparing with the standard regexp (that's why I wanted to
avoid using it), the result seems to be not so bad. It consumes only
10MB or so more memory when processing 100k rows in a frame. Good news
is, it runs slightly faster than the standard regexp (19 vs. 26
seconds).

Attached is the v25 patch to use the PostgreSQL's regexp library.
Most changes are in nodeWindowAgg.c, which is in the 5th patch.

In the patches I also fixed some memory leaks and run pgindent with
updated typedefs.list. Now the patch includes a patch for
typedefs.list (the 8th patch).

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp