Re: Row pattern recognition

Tatsuo Ishii <ishii@sraoss.co.jp>
From: Tatsuo Ishii <ishii@sraoss.co.jp>
To: vik@postgresfriends.org, jchampion@timescale.com
Cc: pgsql-hackers@postgresql.org
Date: 2023-07-26T12:21:34Z
Lists: pgsql-hackers
Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
Add temporal FOREIGN KEY contraints
- 89f908a6d0ac 18.0 cited
Remove obsolete executor cleanup code
- d060e921ea5a 17.0 cited
Attachments

v3-0001-Row-pattern-recognition-patch-for-raw-parser.patch (text/x-patch)
Attached is the v3 patch. In this patch following changes are made.

(1) I completely changed the pattern matching engine so that it
performs backtracking. Now the engine evaluates all pattern elements
defined in PATTER against each row, saving matched pattern variables
in a string per row. For example if the pattern element A and B
evaluated to true, a string "AB" is created for current row.

This continues until all pattern matching fails or encounters the end
of full window frame/partition. After that, the pattern matching
engine creates all possible "pattern strings" and apply the regular
expression matching to each. For example if we have row 0 = "AB" row 1
= "C", possible pattern strings are: "AC" and "BC".

If it matches, the length of matching substring is saved. After all
possible trials are done, the longest matching substring is chosen and
it becomes the width (number of rows) in the reduced window frame.

See row_is_in_reduced_frame, search_str_set and search_str_set_recurse
in nodeWindowAggs.c for more details. For now I use a naive depth
first search and probably there is a lot of rooms for optimization
(for example rewriting it without using
recursion). Suggestions/patches are welcome.

Jacob Champion wrote:
> It looks like the + qualifier has trouble when it meets the end of the
> frame. For instance, try removing the last row of the 'stock' table in
> rpr.sql; some of the final matches will disappear unexpectedly. Or try a
> pattern like
> 
>     PATTERN ( a+ )
>      DEFINE a AS TRUE
> 
> which doesn't seem to match anything in my testing.
> 
> There's also the issue of backtracking in the face of reclassification,
> as I think Vik was alluding to upthread. The pattern
> 
>     PATTERN ( a+ b+ )
>      DEFINE a AS col = 2,
>             b AS col = 2

With the new engine, cases above do not fail anymore. See new
regression test cases. Thanks for providing valuable test cases!

(2) Make window functions RPR aware. Now first_value, last_value, and
nth_value recognize RPR (maybe first_value do not need any change?)

Vik Fearing wrote:
> I strongly disagree with this.  Window function do not need to know
> how the frame is defined, and indeed they should not.
> WinGetFuncArgInFrame should answer yes or no and the window function
> just works on that. Otherwise we will get extension (and possibly even
> core) functions that don't handle the frame properly.

So I moved row_is_in_reduce_frame into WinGetFuncArgInFrame so that
those window functions are not needed to be changed.

(3) Window function rpr was removed. We can use first_value instead.

(4) Remaining tasks/issues.

- For now I disable WinSetMarkPosition because RPR often needs to
  access a row before the mark is set. We need to fix this in the
  future.

- I am working on making window aggregates RPR aware now. The
  implementation is in progress and far from completeness. An example
  is below. I think row 2, 3, 4 of "count" column should be NULL
  instead of 3, 2, 0, 0. Same thing can be said to other
  rows. Probably this is an effect of moving aggregate but I still
  studying the window aggregation code.

SELECT company, tdate, first_value(price) OVER W, count(*) OVER w FROM stock
 WINDOW w AS (
 PARTITION BY company
 ORDER BY tdate
 ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
 AFTER MATCH SKIP PAST LAST ROW
 INITIAL
 PATTERN (START UP+ DOWN+)
 DEFINE
  START AS TRUE,
  UP AS price > PREV(price),
  DOWN AS price < PREV(price)
);
 company  |   tdate    | first_value | count 
----------+------------+-------------+-------
 company1 | 2023-07-01 |         100 |     4
 company1 | 2023-07-02 |             |     3
 company1 | 2023-07-03 |             |     2
 company1 | 2023-07-04 |             |     0
 company1 | 2023-07-05 |             |     0
 company1 | 2023-07-06 |          90 |     4
 company1 | 2023-07-07 |             |     3
 company1 | 2023-07-08 |             |     2
 company1 | 2023-07-09 |             |     0
 company1 | 2023-07-10 |             |     0
 company2 | 2023-07-01 |          50 |     4
 company2 | 2023-07-02 |             |     3
 company2 | 2023-07-03 |             |     2
 company2 | 2023-07-04 |             |     0
 company2 | 2023-07-05 |             |     0
 company2 | 2023-07-06 |          60 |     4
 company2 | 2023-07-07 |             |     3
 company2 | 2023-07-08 |             |     2
 company2 | 2023-07-09 |             |     0
 company2 | 2023-07-10 |             |     0

- If attributes appearing in DEFINE are not used in the target list, query fails.

SELECT company, tdate, count(*) OVER w FROM stock
 WINDOW w AS (
 PARTITION BY company
 ORDER BY tdate
 ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
 AFTER MATCH SKIP PAST LAST ROW
 INITIAL
 PATTERN (START UP+ DOWN+)
 DEFINE
  START AS TRUE,
  UP AS price > PREV(price),
  DOWN AS price < PREV(price)
);
ERROR:  attribute number 3 exceeds number of columns 2

This is because attributes appearing in DEFINE are not added to the
target list. I am looking for way to teach planner to add attributes
appearing in DEFINE.

I am going to add this thread to CommitFest and plan to add both of
you as reviewers. Thanks in advance.
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp