Thread
Commits
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Add temporal FOREIGN KEY contraints
- 89f908a6d0ac 18.0 cited
-
Remove obsolete executor cleanup code
- d060e921ea5a 17.0 cited
-
Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-06-25T12:05:09Z
Attached is a PoC patch to implement "Row pattern recognition" (RPR) in SQL:2016 (I know SQL:2023 is already out, but I don't have access to it). Actually SQL:2016 defines RPR in two places[1]: Feature R010, “Row pattern recognition: FROM clause” Feature R020, “Row pattern recognition: WINDOW clause” The patch includes partial support for R020 part. - What is RPR? RPR provides a way to search series of data using regular expression patterns. Suppose you have a stock database. CREATE TABLE stock ( company TEXT, tdate DATE, price BIGINT); You want to find a "V-shaped" pattern: i.e. price goes down for 1 or more days, then goes up for 1 or more days. If we try to implement the query in PostgreSQL, it could be quite complex and inefficient. RPR provides convenient way to implement the query. SELECT company, tdate, price, rpr(price) OVER w FROM stock WINDOW w AS ( PARTITION BY company ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING PATTERN (START DOWN+ UP+) DEFINE START AS TRUE, UP AS price > PREV(price), DOWN AS price < PREV(price) ); "PATTERN" and "DEFINE" are the key clauses of RPR. DEFINE defines 3 "row pattern variables" namely START, UP and DOWN. They are associated with logical expressions namely "TRUE", "price > PREV(price)", and "price < PREV(price)". Note that "PREV" function returns price column in the previous row. So, UP is true if price is higher than previous day. On the other hand, DOWN is true if price is lower than previous day. PATTERN uses the row pattern variables to create a necessary pattern. In this case, the first row is always match because START is always true, and second or more rows match with "UP" ('+' is a regular expression representing one or more), and subsequent rows match with "DOWN". Here is the sample output. company | tdate | price | rpr ----------+------------+-------+------ company1 | 2023-07-01 | 100 | company1 | 2023-07-02 | 200 | 200 -- 200->150->140->150 company1 | 2023-07-03 | 150 | 150 -- 150->140->150 company1 | 2023-07-04 | 140 | company1 | 2023-07-05 | 150 | 150 -- 150->90->110->130 company1 | 2023-07-06 | 90 | company1 | 2023-07-07 | 110 | company1 | 2023-07-08 | 130 | company1 | 2023-07-09 | 120 | company1 | 2023-07-10 | 130 | rpr shows the first row if all the patterns are satisfied. In the example above 200, 150, 150 are the cases. Other rows are shown as NULL. For example, on 2023-07-02 price starts with 200, then goes down to 150 then 140 but goes up 150 next day. As far as I know, only Oracle implements RPR (only R010. R020 is not implemented) among OSS/commercial RDBMSs. There are a few DWH software having RPR. According to [2] they are Snowflake and MS Stream Analytics. It seems Trino is another one[3]. - Note about the patch The current implementation is not only a subset of the standard, but is different from it in some places. The example above is written as follows according to the standard: SELECT company, tdate, startprice OVER w FROM stock WINDOW w AS ( PARTITION BY company MEASURES START.price AS startprice ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING PATTERN (START DOWN+ UP+) DEFINE START AS TRUE, UP AS UP.price > PREV(UP.price), DOWN AS DOWN.price < PREV(DOWN.price) ); Notice that rpr(price) is written as START.price and startprice in the standard. MEASURES defines variable names used in the target list used with "OVER window". As OVER only allows functions in PostgreSQL, I had to make up a window function "rpr" which performs the row pattern recognition task. I was not able to find a way to implement expressions like START.price (START is not a table alias). Any suggestion is greatly appreciated. The differences from the standard include: o MEASURES is not supported o SUBSET is not supported o FIRST, LAST, CLASSIFIER are not supported o PREV/NEXT in the standard accept more complex arguments o Regular expressions other than "+" are not supported o Only AFTER MATCH SKIP TO NEXT ROW is supported (if AFTER MATCH is not specified, AFTER MATCH SKIP TO NEXT ROW is assumed. In the standard AFTER MATCH SKIP PAST LAST ROW is assumed in this case). I have a plan to implement AFTER MATCH SKIP PAST LAST ROW though. o INITIAL or SEEK are not supported ((behaves as if INITIAL is specified) o Aggregate functions associated with window clause using RPR do not respect RPR It seems RPR in the standard is quite complex. I think we can start with a small subset of RPR then we could gradually enhance the implementation. Comments and suggestions are welcome. [1] https://sqlperformance.com/2019/04/t-sql-queries/row-pattern-recognition-in-sql [2] https://link.springer.com/article/10.1007/s13222-022-00404-3 [3] https://trino.io/docs/current/sql/pattern-recognition-in-window.html -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp -
Re: Row pattern recognition
Vik Fearing <vik@postgresfriends.org> — 2023-06-25T21:08:35Z
On 6/25/23 14:05, Tatsuo Ishii wrote: > Attached is a PoC patch to implement "Row pattern recognition" (RPR) > in SQL:2016 (I know SQL:2023 is already out, but I don't have access > to it). Actually SQL:2016 defines RPR in two places[1]: > > Feature R010, “Row pattern recognition: FROM clause” > Feature R020, “Row pattern recognition: WINDOW clause” > > The patch includes partial support for R020 part. I have been dreaming of and lobbying for someone to pick up this feature. I will be sure to review it from a standards perspective and will try my best to help with the technical aspect, but I am not sure to have the qualifications for that. THANK YOU! > (I know SQL:2023 is already out, but I don't have access to it) If you can, try to get ISO/IEC 19075-5 which is a guide to RPR instead of just its technical specification. https://www.iso.org/standard/78936.html > - What is RPR? > > RPR provides a way to search series of data using regular expression > patterns. Suppose you have a stock database. > > CREATE TABLE stock ( > company TEXT, > tdate DATE, > price BIGINT); > > You want to find a "V-shaped" pattern: i.e. price goes down for 1 or > more days, then goes up for 1 or more days. If we try to implement the > query in PostgreSQL, it could be quite complex and inefficient. > > RPR provides convenient way to implement the query. > > SELECT company, tdate, price, rpr(price) OVER w FROM stock > WINDOW w AS ( > PARTITION BY company > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > PATTERN (START DOWN+ UP+) > DEFINE > START AS TRUE, > UP AS price > PREV(price), > DOWN AS price < PREV(price) > ); > > "PATTERN" and "DEFINE" are the key clauses of RPR. DEFINE defines 3 > "row pattern variables" namely START, UP and DOWN. They are associated > with logical expressions namely "TRUE", "price > PREV(price)", and > "price < PREV(price)". Note that "PREV" function returns price column > in the previous row. So, UP is true if price is higher than previous > day. On the other hand, DOWN is true if price is lower than previous > day. PATTERN uses the row pattern variables to create a necessary > pattern. In this case, the first row is always match because START is > always true, and second or more rows match with "UP" ('+' is a regular > expression representing one or more), and subsequent rows match with > "DOWN". > > Here is the sample output. > > company | tdate | price | rpr > ----------+------------+-------+------ > company1 | 2023-07-01 | 100 | > company1 | 2023-07-02 | 200 | 200 -- 200->150->140->150 > company1 | 2023-07-03 | 150 | 150 -- 150->140->150 > company1 | 2023-07-04 | 140 | > company1 | 2023-07-05 | 150 | 150 -- 150->90->110->130 > company1 | 2023-07-06 | 90 | > company1 | 2023-07-07 | 110 | > company1 | 2023-07-08 | 130 | > company1 | 2023-07-09 | 120 | > company1 | 2023-07-10 | 130 | > > rpr shows the first row if all the patterns are satisfied. In the > example above 200, 150, 150 are the cases. Other rows are shown as > NULL. For example, on 2023-07-02 price starts with 200, then goes down > to 150 then 140 but goes up 150 next day. I don't understand this. RPR in a window specification limits the window to the matched rows, so this looks like your rpr() function is just the regular first_value() window function that we already have? > As far as I know, only Oracle implements RPR (only R010. R020 is not > implemented) among OSS/commercial RDBMSs. There are a few DWH software > having RPR. According to [2] they are Snowflake and MS Stream > Analytics. It seems Trino is another one[3]. I thought DuckDB had it already, but it looks like I was wrong. > - Note about the patch > > The current implementation is not only a subset of the standard, but > is different from it in some places. The example above is written as > follows according to the standard: > > SELECT company, tdate, startprice OVER w FROM stock > WINDOW w AS ( > PARTITION BY company > MEASURES > START.price AS startprice > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > PATTERN (START DOWN+ UP+) > DEFINE > START AS TRUE, > UP AS UP.price > PREV(UP.price), > DOWN AS DOWN.price < PREV(DOWN.price) > ); > > Notice that rpr(price) is written as START.price and startprice in the > standard. MEASURES defines variable names used in the target list used > with "OVER window". As OVER only allows functions in PostgreSQL, I had > to make up a window function "rpr" which performs the row pattern > recognition task. I was not able to find a way to implement > expressions like START.price (START is not a table alias). Any > suggestion is greatly appreciated. As in your example, you cannot have START.price outside of the window specification; it can only go in the MEASURES clause. Only startprice is allowed outside and it gets its qualification from the OVER. Using w.startprice might have been better but that would require window names to be in the same namespace as range tables. This currently works in Postgres: SELECT RANK() OVER w FROM (VALUES (1)) AS w (x) WINDOW w AS (ORDER BY w.x); > The differences from the standard include: > > o MEASURES is not supported > o FIRST, LAST, CLASSIFIER are not supported > o PREV/NEXT in the standard accept more complex arguments > o INITIAL or SEEK are not supported ((behaves as if INITIAL is specified) Okay, for now. > o SUBSET is not supported Is this because you haven't done it yet, or because you ran into problems trying to do it? > o Regular expressions other than "+" are not supported This is what I had a hard time imagining how to do while thinking about it. The grammar is so different here and we allow many more operators (like "?" which is also the standard parameter symbol). People more expert than me will have to help here. > o Only AFTER MATCH SKIP TO NEXT ROW is supported (if AFTER MATCH is > not specified, AFTER MATCH SKIP TO NEXT ROW is assumed. In the > standard AFTER MATCH SKIP PAST LAST ROW is assumed in this case). I > have a plan to implement AFTER MATCH SKIP PAST LAST ROW though. In this case, we should require the user to specify AFTER MATCH SKIP TO NEXT ROW so that behavior doesn't change when we implement the standard default. (Your patch might do this already.) > o Aggregate functions associated with window clause using RPR do not respect RPR I do not understand what this means. > It seems RPR in the standard is quite complex. I think we can start > with a small subset of RPR then we could gradually enhance the > implementation. I have no problem with that as long as we don't paint ourselves into a corner. > Comments and suggestions are welcome. I have not looked at the patch yet, but is the reason for doing R020 before R010 because you haven't done the MEASURES clause yet? In any case, I will be watching this with a close eye, and I am eager to help in any way I can. -- Vik Fearing -
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-06-26T01:05:20Z
> I have been dreaming of and lobbying for someone to pick up this > feature. I will be sure to review it from a standards perspective and > will try my best to help with the technical aspect, but I am not sure > to have the qualifications for that. > > THANK YOU! Thank you for looking into my proposal. >> (I know SQL:2023 is already out, but I don't have access to it) > > If you can, try to get ISO/IEC 19075-5 which is a guide to RPR instead > of just its technical specification. > > https://www.iso.org/standard/78936.html Thanks for the info. > I don't understand this. RPR in a window specification limits the > window to the matched rows, so this looks like your rpr() function is > just the regular first_value() window function that we already have? No, rpr() is different from first_value(). rpr() returns the argument value at the first row in a frame only when matched rows found. On the other hand first_value() returns the argument value at the first row in a frame unconditionally. company | tdate | price | rpr | first_value ----------+------------+-------+------+------------- company1 | 2023-07-01 | 100 | | 100 company1 | 2023-07-02 | 200 | 200 | 200 company1 | 2023-07-03 | 150 | 150 | 150 company1 | 2023-07-04 | 140 | | 140 company1 | 2023-07-05 | 150 | 150 | 150 company1 | 2023-07-06 | 90 | | 90 company1 | 2023-07-07 | 110 | | 110 company1 | 2023-07-08 | 130 | | 130 company1 | 2023-07-09 | 120 | | 120 company1 | 2023-07-10 | 130 | | 130 For example, a frame starting with (tdate = 2023-07-02, price = 200) consists of rows (price = 200, 150, 140, 150) satisfying the pattern, thus rpr returns 200. Since in this example frame option "ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING" is specified, next frame starts with (tdate = 2023-07-03, price = 150). This frame satisfies the pattern too (price = 150, 140, 150), and rpr retus 150... and so on. > As in your example, you cannot have START.price outside of the window > specification; it can only go in the MEASURES clause. Only startprice > is allowed outside and it gets its qualification from the OVER. Using > w.startprice might have been better but that would require window > names to be in the same namespace as range tables. > > This currently works in Postgres: > > SELECT RANK() OVER w > FROM (VALUES (1)) AS w (x) > WINDOW w AS (ORDER BY w.x); Interesting. >> o SUBSET is not supported > > Is this because you haven't done it yet, or because you ran into > problems trying to do it? Because it seems SUBSET is not useful without MEASURES support. Thus my plan is, firstly implement MEASURES, then SUBSET. What do you think? >> o Regular expressions other than "+" are not supported > > This is what I had a hard time imagining how to do while thinking > about it. The grammar is so different here and we allow many more > operators (like "?" which is also the standard parameter symbol). > People more expert than me will have to help here. Yes, that is a problem. > In this case, we should require the user to specify AFTER MATCH SKIP > TO NEXT ROW so that behavior doesn't change when we implement the > standard default. (Your patch might do this already.) Agreed. I will implement AFTER MATCH SKIP PAST LAST ROW in the next patch and I will change the default to AFTER MATCH SKIP PAST LAST ROW. >> o Aggregate functions associated with window clause using RPR do not >> respect RPR > > I do not understand what this means. Ok, let me explain. See example below. In my understanding "count" should retun the number of rows in a frame restriced by the match condition. For example at the first line (2023-07-01 | 100) count returns 10. I think this should be 0 because the "restriced" frame starting at the line contains no matched row. On the other hand the (restricted) frame starting at second line (2023-07-02 | 200) contains 4 rows, thus count should return 4, instead of 9. SELECT company, tdate, price, rpr(price) OVER w, count(*) OVER w FROM stock WINDOW w AS ( PARTITION BY company ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING PATTERN (START DOWN+ UP+) DEFINE START AS TRUE, UP AS price > PREV(price), DOWN AS price < PREV(price) ); company | tdate | price | rpr | count ----------+------------+-------+------+------- company1 | 2023-07-01 | 100 | | 10 company1 | 2023-07-02 | 200 | 200 | 9 company1 | 2023-07-03 | 150 | 150 | 8 company1 | 2023-07-04 | 140 | | 7 company1 | 2023-07-05 | 150 | 150 | 6 company1 | 2023-07-06 | 90 | | 5 company1 | 2023-07-07 | 110 | | 4 company1 | 2023-07-08 | 130 | | 3 company1 | 2023-07-09 | 120 | | 2 company1 | 2023-07-10 | 130 | | 1 >> It seems RPR in the standard is quite complex. I think we can start >> with a small subset of RPR then we could gradually enhance the >> implementation. > > I have no problem with that as long as we don't paint ourselves into a > corner. Totally agreed. >> Comments and suggestions are welcome. > > I have not looked at the patch yet, but is the reason for doing R020 > before R010 because you haven't done the MEASURES clause yet? One of the reasons is, implementing MATCH_RECOGNIZE (R010) looked harder for me because modifying main SELECT clause could be a hard work. Another reason is, I had no idea how to implement PREV/NEXT in other than in WINDOW clause. Other people might feel differently though. > In any case, I will be watching this with a close eye, and I am eager > to help in any way I can. Thank you! I am looking forward to comments on my patch. Also any idea how to implement MEASURES clause is welcome. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-06-26T08:45:07Z
>> In this case, we should require the user to specify AFTER MATCH SKIP >> TO NEXT ROW so that behavior doesn't change when we implement the >> standard default. (Your patch might do this already.) > > Agreed. I will implement AFTER MATCH SKIP PAST LAST ROW in the next > patch and I will change the default to AFTER MATCH SKIP PAST LAST ROW. Attached is the v2 patch to add support for AFTER MATCH SKIP PAST LAST ROW and AFTER MATCH SKIP PAST LAST ROW. The default is AFTER MATCH SKIP PAST LAST ROW as the standard default. Here are some examples to demonstrate how those clauses affect the query result. SELECT i, rpr(i) OVER w FROM (VALUES (1), (2), (3), (4)) AS v (i) WINDOW w AS ( ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING AFTER MATCH SKIP PAST LAST ROW PATTERN (A B) DEFINE A AS i <= 2, B AS i <= 3 ); i | rpr ---+----- 1 | 1 2 | 3 | 4 | (4 rows) In this example rpr starts from i = 1 and find that row i = 1 satisfies A, and row i = 2 satisfies B. Then rpr moves to row i = 3 and find that it does not satisfy A, thus the result is NULL. Same thing can be said to row i = 4. SELECT i, rpr(i) OVER w FROM (VALUES (1), (2), (3), (4)) AS v (i) WINDOW w AS ( ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING AFTER MATCH SKIP TO NEXT ROW PATTERN (A B) DEFINE A AS i <= 2, B AS i <= 3 ); i | rpr ---+----- 1 | 1 2 | 2 3 | 4 | (4 rows) In this example rpr starts from i = 1 and find that row i = 1 satisfies A, and row i = 2 satisfies B (same as above). Then rpr moves to row i = 2, rather than 3 because AFTER MATCH SKIP TO NEXT ROW is specified. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp -
Re: Row pattern recognition
Vik Fearing <vik@postgresfriends.org> — 2023-06-26T22:38:20Z
On 6/26/23 03:05, Tatsuo Ishii wrote: >> I don't understand this. RPR in a window specification limits the >> window to the matched rows, so this looks like your rpr() function is >> just the regular first_value() window function that we already have? > > No, rpr() is different from first_value(). rpr() returns the argument > value at the first row in a frame only when matched rows found. On the > other hand first_value() returns the argument value at the first row > in a frame unconditionally. > > company | tdate | price | rpr | first_value > ----------+------------+-------+------+------------- > company1 | 2023-07-01 | 100 | | 100 > company1 | 2023-07-02 | 200 | 200 | 200 > company1 | 2023-07-03 | 150 | 150 | 150 > company1 | 2023-07-04 | 140 | | 140 > company1 | 2023-07-05 | 150 | 150 | 150 > company1 | 2023-07-06 | 90 | | 90 > company1 | 2023-07-07 | 110 | | 110 > company1 | 2023-07-08 | 130 | | 130 > company1 | 2023-07-09 | 120 | | 120 > company1 | 2023-07-10 | 130 | | 130 > > For example, a frame starting with (tdate = 2023-07-02, price = 200) > consists of rows (price = 200, 150, 140, 150) satisfying the pattern, > thus rpr returns 200. Since in this example frame option "ROWS BETWEEN > CURRENT ROW AND UNBOUNDED FOLLOWING" is specified, next frame starts > with (tdate = 2023-07-03, price = 150). This frame satisfies the > pattern too (price = 150, 140, 150), and rpr retus 150... and so on. Okay, I see the problem now, and why you need the rpr() function. You are doing this as something that happens over a window frame, but it is actually something that *reduces* the window frame. The pattern matching needs to be done when the frame is calculated and not when any particular function is applied over it. This query (with all the defaults made explicit): SELECT s.company, s.tdate, s.price, FIRST_VALUE(s.tdate) OVER w, LAST_VALUE(s.tdate) OVER w, lowest OVER w FROM stock AS s WINDOW w AS ( PARTITION BY s.company ORDER BY s.tdate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING EXCLUDE NO OTHERS MEASURES LAST(DOWN) AS lowest AFTER MATCH SKIP PAST LAST ROW INITIAL PATTERN (START DOWN+ UP+) DEFINE START AS TRUE, UP AS price > PREV(price), DOWN AS price < PREV(price) ); Should produce this result: company | tdate | price | first_value | last_value | lowest ----------+------------+-------+-------------+------------+-------- company1 | 07-01-2023 | 100 | | | company1 | 07-02-2023 | 200 | 07-02-2023 | 07-05-2023 | 140 company1 | 07-03-2023 | 150 | | | company1 | 07-04-2023 | 140 | | | company1 | 07-05-2023 | 150 | | | company1 | 07-06-2023 | 90 | | | company1 | 07-07-2023 | 110 | | | company1 | 07-08-2023 | 130 | 07-05-2023 | 07-05-2023 | 120 company1 | 07-09-2023 | 120 | | | company1 | 07-10-2023 | 130 | | | (10 rows) Or if we switch to AFTER MATCH SKIP TO NEXT ROW, then we get: company | tdate | price | first_value | last_value | lowest ----------+------------+-------+-------------+------------+-------- company1 | 07-01-2023 | 100 | | | company1 | 07-02-2023 | 200 | 07-02-2023 | 07-05-2023 | 140 company1 | 07-03-2023 | 150 | 07-03-2023 | 07-05-2023 | 140 company1 | 07-04-2023 | 140 | | | company1 | 07-05-2023 | 150 | 07-05-2023 | 07-08-2023 | 90 company1 | 07-06-2023 | 90 | | | company1 | 07-07-2023 | 110 | | | company1 | 07-08-2023 | 130 | 07-08-2023 | 07-10-2023 | 120 company1 | 07-09-2023 | 120 | | | company1 | 07-10-2023 | 130 | | | (10 rows) And then if we change INITIAL to SEEK: company | tdate | price | first_value | last_value | lowest ----------+------------+-------+-------------+------------+-------- company1 | 07-01-2023 | 100 | 07-02-2023 | 07-05-2023 | 140 company1 | 07-02-2023 | 200 | 07-02-2023 | 07-05-2023 | 140 company1 | 07-03-2023 | 150 | 07-03-2023 | 07-05-2023 | 140 company1 | 07-04-2023 | 140 | 07-05-2023 | 07-08-2023 | 90 company1 | 07-05-2023 | 150 | 07-05-2023 | 07-08-2023 | 90 company1 | 07-06-2023 | 90 | 07-08-2023 | 07-10-2023 | 120 company1 | 07-07-2023 | 110 | 07-08-2023 | 07-10-2023 | 120 company1 | 07-08-2023 | 130 | 07-08-2023 | 07-10-2023 | 120 company1 | 07-09-2023 | 120 | | | company1 | 07-10-2023 | 130 | | | (10 rows) Since the pattern recognition is part of the frame, the window aggregates should Just Work. >>> o SUBSET is not supported >> >> Is this because you haven't done it yet, or because you ran into >> problems trying to do it? > > Because it seems SUBSET is not useful without MEASURES support. Thus > my plan is, firstly implement MEASURES, then SUBSET. What do you > think? SUBSET elements can be used in DEFINE clauses, but I do not think this is important compared to other features. >>> Comments and suggestions are welcome. >> >> I have not looked at the patch yet, but is the reason for doing R020 >> before R010 because you haven't done the MEASURES clause yet? > > One of the reasons is, implementing MATCH_RECOGNIZE (R010) looked > harder for me because modifying main SELECT clause could be a hard > work. Another reason is, I had no idea how to implement PREV/NEXT in > other than in WINDOW clause. Other people might feel differently > though. I think we could do this with a single tuplesort if we use backtracking (which might be really slow for some patterns). I have not looked into it in any detail. We would need to be able to remove tuples from the end (even if only logically), and be able to update tuples inside the store. Both of those needs come from backtracking and possibly changing the classifier. Without backtracking, I don't see how we could do it without have a separate tuplestore for every current possible match. >> In any case, I will be watching this with a close eye, and I am eager >> to help in any way I can. > > Thank you! I am looking forward to comments on my patch. Also any > idea how to implement MEASURES clause is welcome. I looked at your v2 patches a little bit and the only comment that I currently have on the code is you spelled PERMUTE as PREMUTE. Everything else is hopefully explained above. -- Vik Fearing -
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-06-28T00:58:19Z
> Okay, I see the problem now, and why you need the rpr() function. > > You are doing this as something that happens over a window frame, but > it is actually something that *reduces* the window frame. The pattern > matching needs to be done when the frame is calculated and not when > any particular function is applied over it. Yes. (I think the standard calls the window frame as "full window frame" in context of RPR to make a contrast with the subset of the frame rows restricted by RPR. The paper I refered to as [2] claims that the latter window frame is called "reduced window frame" in the standard but I wasn't able to find the term in the standard.) I wanted to demonstate that pattern matching logic is basically correct in the PoC patch. Now what I need to do is, move the row pattern matching logic to somewhere inside nodeWindowAgg so that "restricted window frame" can be applied to all window functions and window aggregates. Currently I am looking into update_frameheadpos() and update_frametailpos() which calculate the frame head and tail against current row. What do you think? > This query (with all the defaults made explicit): > > SELECT s.company, s.tdate, s.price, > FIRST_VALUE(s.tdate) OVER w, > LAST_VALUE(s.tdate) OVER w, > lowest OVER w > FROM stock AS s > WINDOW w AS ( > PARTITION BY s.company > ORDER BY s.tdate > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > EXCLUDE NO OTHERS > MEASURES > LAST(DOWN) AS lowest > AFTER MATCH SKIP PAST LAST ROW > INITIAL PATTERN (START DOWN+ UP+) > DEFINE > START AS TRUE, > UP AS price > PREV(price), > DOWN AS price < PREV(price) > ); > > Should produce this result: [snip] Thanks for the examples. I agree with the expected query results. >>>> o SUBSET is not supported >>> >>> Is this because you haven't done it yet, or because you ran into >>> problems trying to do it? >> Because it seems SUBSET is not useful without MEASURES support. Thus >> my plan is, firstly implement MEASURES, then SUBSET. What do you >> think? > > > SUBSET elements can be used in DEFINE clauses, but I do not think this > is important compared to other features. Ok. >>> I have not looked at the patch yet, but is the reason for doing R020 >>> before R010 because you haven't done the MEASURES clause yet? >> One of the reasons is, implementing MATCH_RECOGNIZE (R010) looked >> harder for me because modifying main SELECT clause could be a hard >> work. Another reason is, I had no idea how to implement PREV/NEXT in >> other than in WINDOW clause. Other people might feel differently >> though. > > > I think we could do this with a single tuplesort if we use > backtracking (which might be really slow for some patterns). I have > not looked into it in any detail. > > We would need to be able to remove tuples from the end (even if only > logically), and be able to update tuples inside the store. Both of > those needs come from backtracking and possibly changing the > classifier. > > Without backtracking, I don't see how we could do it without have a > separate tuplestore for every current possible match. Maybe an insane idea but what about rewriting MATCH_RECOGNIZE clause into Window clause with RPR? > I looked at your v2 patches a little bit and the only comment that I > currently have on the code is you spelled PERMUTE as > PREMUTE. Everything else is hopefully explained above. Thanks. Will fix. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-06-28T12:17:00Z
Small question. > This query (with all the defaults made explicit): > > SELECT s.company, s.tdate, s.price, > FIRST_VALUE(s.tdate) OVER w, > LAST_VALUE(s.tdate) OVER w, > lowest OVER w > FROM stock AS s > WINDOW w AS ( > PARTITION BY s.company > ORDER BY s.tdate > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > EXCLUDE NO OTHERS > MEASURES > LAST(DOWN) AS lowest > AFTER MATCH SKIP PAST LAST ROW > INITIAL PATTERN (START DOWN+ UP+) > DEFINE > START AS TRUE, > UP AS price > PREV(price), > DOWN AS price < PREV(price) > ); > LAST(DOWN) AS lowest should be "LAST(DOWN.price) AS lowest"? Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Vik Fearing <vik@postgresfriends.org> — 2023-06-28T22:30:43Z
On 6/28/23 14:17, Tatsuo Ishii wrote: > Small question. > >> This query (with all the defaults made explicit): >> >> SELECT s.company, s.tdate, s.price, >> FIRST_VALUE(s.tdate) OVER w, >> LAST_VALUE(s.tdate) OVER w, >> lowest OVER w >> FROM stock AS s >> WINDOW w AS ( >> PARTITION BY s.company >> ORDER BY s.tdate >> ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING >> EXCLUDE NO OTHERS >> MEASURES >> LAST(DOWN) AS lowest >> AFTER MATCH SKIP PAST LAST ROW >> INITIAL PATTERN (START DOWN+ UP+) >> DEFINE >> START AS TRUE, >> UP AS price > PREV(price), >> DOWN AS price < PREV(price) >> ); > >> LAST(DOWN) AS lowest > > should be "LAST(DOWN.price) AS lowest"? Yes, it should be. And the tdate='07-08-2023' row in the first resultset should have '07-08-2023' and '07-10-2023' as its 4th and 5th columns. Since my brain is doing the processing instead of postgres, I made some human errors. :-) -- Vik Fearing
-
Re: Row pattern recognition
Jacob Champion <jchampion@timescale.com> — 2023-07-19T16:30:40Z
Hello, Thanks for working on this! We're interested in RPR as well, and I've been trying to get up to speed with the specs, to maybe make myself useful. On 6/27/23 17:58, Tatsuo Ishii wrote: > Yes. (I think the standard calls the window frame as "full window > frame" in context of RPR to make a contrast with the subset of the > frame rows restricted by RPR. The paper I refered to as [2] claims > that the latter window frame is called "reduced window frame" in the > standard but I wasn't able to find the term in the standard.) 19075-5 discusses that, at least; not sure about other parts of the spec. > Maybe an insane idea but what about rewriting MATCH_RECOGNIZE clause > into Window clause with RPR? Are we guaranteed to always have an equivalent window clause? There seem to be many differences between the two, especially when it comes to ONE ROW/ALL ROWS PER MATCH. -- To add onto what Vik said above: >> It seems RPR in the standard is quite complex. I think we can start >> with a small subset of RPR then we could gradually enhance the >> implementation. > > I have no problem with that as long as we don't paint ourselves into a > corner. To me, PATTERN looks like an area where we may want to support a broader set of operations in the first version. The spec has a bunch of discussion around cases like empty matches, match order of alternation and permutation, etc., which are not possible to express or test with only the + quantifier. Those might be harder to get right in a v2, if we don't at least keep them in mind for v1? > +static List * > +transformPatternClause(ParseState *pstate, WindowClause *wc, WindowDef *windef) > +{ > + List *patterns; My compiler complains about the `patterns` variable here, which is returned without ever being initialized. (The caller doesn't seem to use it.) > +-- basic test using PREV > +SELECT company, tdate, price, rpr(price) OVER w FROM stock > + WINDOW w AS ( > + PARTITION BY company > + ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > + INITIAL > + PATTERN (START UP+ DOWN+) > + DEFINE > + START AS TRUE, > + UP AS price > PREV(price), > + DOWN AS price < PREV(price) > +); nitpick: IMO the tests should be making use of ORDER BY in the window clauses. This is a very big feature. I agree with you that MEASURES seems like a very important "next piece" to add. Are there other areas where you would like reviewers to focus on right now (or avoid)? Thanks! --Jacob -
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-07-20T05:15:13Z
> Hello, > > Thanks for working on this! We're interested in RPR as well, and I've > been trying to get up to speed with the specs, to maybe make myself > useful. Thank you for being interested in this. > 19075-5 discusses that, at least; not sure about other parts of the spec. Thanks for the info. Unfortunately I don't have 19075-5 though. >> Maybe an insane idea but what about rewriting MATCH_RECOGNIZE clause >> into Window clause with RPR? > > Are we guaranteed to always have an equivalent window clause? There seem > to be many differences between the two, especially when it comes to ONE > ROW/ALL ROWS PER MATCH. You are right. I am not 100% sure if the rewriting is possible at this point. > To add onto what Vik said above: > >>> It seems RPR in the standard is quite complex. I think we can start >>> with a small subset of RPR then we could gradually enhance the >>> implementation. >> >> I have no problem with that as long as we don't paint ourselves into a >> corner. > > To me, PATTERN looks like an area where we may want to support a broader > set of operations in the first version. Me too but... > The spec has a bunch of > discussion around cases like empty matches, match order of alternation > and permutation, etc., which are not possible to express or test with > only the + quantifier. Those might be harder to get right in a v2, if we > don't at least keep them in mind for v1? Currently my patch has a limitation for the sake of simple implementation: a pattern like "A+" is parsed and analyzed in the raw parser. This makes subsequent process much easier because the pattern element variable (in this case "A") and the quantifier (in this case "+") is already identified by the raw parser. However there are much more cases are allowed in the standard as you already pointed out. For those cases probably we should give up to parse PATTERN items in the raw parser, and instead the raw parser just accepts the elements as Sconst? >> +static List * >> +transformPatternClause(ParseState *pstate, WindowClause *wc, WindowDef *windef) >> +{ >> + List *patterns; > > My compiler complains about the `patterns` variable here, which is > returned without ever being initialized. (The caller doesn't seem to use > it.) Will fix. >> +-- basic test using PREV >> +SELECT company, tdate, price, rpr(price) OVER w FROM stock >> + WINDOW w AS ( >> + PARTITION BY company >> + ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING >> + INITIAL >> + PATTERN (START UP+ DOWN+) >> + DEFINE >> + START AS TRUE, >> + UP AS price > PREV(price), >> + DOWN AS price < PREV(price) >> +); > > nitpick: IMO the tests should be making use of ORDER BY in the window > clauses. Right. Will fix. > This is a very big feature. I agree with you that MEASURES seems like a > very important "next piece" to add. Are there other areas where you > would like reviewers to focus on right now (or avoid)? Any comments, especially on the PREV/NEXT implementation part is welcome. Currently the DEFINE expression like "price > PREV(price)" is prepared in ExecInitWindowAgg using ExecInitExpr,tweaking var->varno in Var node so that PREV uses OUTER_VAR, NEXT uses INNER_VAR. Then evaluate the expression in ExecWindowAgg using ExecEvalExpr, setting previous row TupleSlot to ExprContext->ecxt_outertuple, and next row TupleSlot to ExprContext->ecxt_innertuple. I think this is temporary hack and should be gotten ride of before v1 is committed. Better idea? Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp -
Re: Row pattern recognition
Jacob Champion <jchampion@timescale.com> — 2023-07-20T23:36:37Z
Hi Ishii-san, On 7/19/23 22:15, Tatsuo Ishii wrote: > Currently my patch has a limitation for the sake of simple > implementation: a pattern like "A+" is parsed and analyzed in the raw > parser. This makes subsequent process much easier because the pattern > element variable (in this case "A") and the quantifier (in this case > "+") is already identified by the raw parser. However there are much > more cases are allowed in the standard as you already pointed out. For > those cases probably we should give up to parse PATTERN items in the > raw parser, and instead the raw parser just accepts the elements as > Sconst? Is there a concern that the PATTERN grammar can't be represented in Bison? I thought it was all context-free... Or is the concern that the parse tree of the pattern is hard to feed into a regex engine? > Any comments, especially on the PREV/NEXT implementation part is > welcome. Currently the DEFINE expression like "price > PREV(price)" is > prepared in ExecInitWindowAgg using ExecInitExpr,tweaking var->varno > in Var node so that PREV uses OUTER_VAR, NEXT uses INNER_VAR. Then > evaluate the expression in ExecWindowAgg using ExecEvalExpr, setting > previous row TupleSlot to ExprContext->ecxt_outertuple, and next row > TupleSlot to ExprContext->ecxt_innertuple. I think this is temporary > hack and should be gotten ride of before v1 is committed. Better idea? I'm not familiar enough with this code yet to offer very concrete suggestions, sorry... But at some point in the future, we need to be able to skip forward and backward from arbitrary points, like DEFINE B AS B.price > PREV(FIRST(A.price), 3) so there won't be just one pair of "previous and next" tuples. Maybe that can help clarify the design? It feels like it'll need to eventually be a "real" function that operates on the window state, even if it doesn't support all of the possible complexities in v1. -- Taking a closer look at the regex engine: It looks like the + qualifier has trouble when it meets the end of the frame. For instance, try removing the last row of the 'stock' table in rpr.sql; some of the final matches will disappear unexpectedly. Or try a pattern like PATTERN ( a+ ) DEFINE a AS TRUE which doesn't seem to match anything in my testing. There's also the issue of backtracking in the face of reclassification, as I think Vik was alluding to upthread. The pattern PATTERN ( a+ b+ ) DEFINE a AS col = 2, b AS col = 2 doesn't match a sequence of values (2 2 ...) with the current implementation, even with a dummy row at the end to avoid the end-of-frame bug. (I've attached two failing tests against v2, to hopefully better illustrate, along with what I _think_ should be the correct results.) I'm not quite understanding the match loop in evaluate_pattern(). It looks like we're building up a string to pass to the regex engine, but by the we call regexp_instr, don't we already know whether or not the pattern will match based on the expression evaluation we've done? Thanks, --Jacob -
Re: Row pattern recognition
Vik Fearing <vik@postgresfriends.org> — 2023-07-21T00:07:44Z
On 7/21/23 01:36, Jacob Champion wrote: > There's also the issue of backtracking in the face of reclassification, > as I think Vik was alluding to upthread. We definitely need some kind of backtracking or other reclassification method. > (I've attached two failing tests against v2, to hopefully better > illustrate, along with what I_think_ should be the correct results.) Almost. You are matching 07-01-2023 but the condition is "price > 100". -- Vik Fearing
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-07-21T06:16:48Z
Hi, > Hi Ishii-san, > > On 7/19/23 22:15, Tatsuo Ishii wrote: >> Currently my patch has a limitation for the sake of simple >> implementation: a pattern like "A+" is parsed and analyzed in the raw >> parser. This makes subsequent process much easier because the pattern >> element variable (in this case "A") and the quantifier (in this case >> "+") is already identified by the raw parser. However there are much >> more cases are allowed in the standard as you already pointed out. For >> those cases probably we should give up to parse PATTERN items in the >> raw parser, and instead the raw parser just accepts the elements as >> Sconst? > > Is there a concern that the PATTERN grammar can't be represented in > Bison? I thought it was all context-free... I don't know at this point. I think context-free is not enough to be repsented in Bison. The grammer also needs to be LALR(1). Moreover, adding the grammer to existing parser may generate shift/reduce errors. > Or is the concern that the > parse tree of the pattern is hard to feed into a regex engine? One small concern is how to convert pattern variables to regex expression which our regex enginer understands. Suppose, PATTERN UP+ Currently I convert "UP+" to "U+" so that it can be fed to the regexp engine. In order to do that, we need to know which part of the pattern (UP+) is the pattern variable ("UP"). For "UP+" it's quite easy. But for more complex regular expressions it would be not, unless PATTERN grammer can be analyzed by our parser to know which part is the pattern variable. >> Any comments, especially on the PREV/NEXT implementation part is >> welcome. Currently the DEFINE expression like "price > PREV(price)" is >> prepared in ExecInitWindowAgg using ExecInitExpr,tweaking var->varno >> in Var node so that PREV uses OUTER_VAR, NEXT uses INNER_VAR. Then >> evaluate the expression in ExecWindowAgg using ExecEvalExpr, setting >> previous row TupleSlot to ExprContext->ecxt_outertuple, and next row >> TupleSlot to ExprContext->ecxt_innertuple. I think this is temporary >> hack and should be gotten ride of before v1 is committed. Better idea? > > I'm not familiar enough with this code yet to offer very concrete > suggestions, sorry... But at some point in the future, we need to be > able to skip forward and backward from arbitrary points, like > > DEFINE B AS B.price > PREV(FIRST(A.price), 3) > > so there won't be just one pair of "previous and next" tuples. Yes, I know. > Maybe > that can help clarify the design? It feels like it'll need to eventually > be a "real" function that operates on the window state, even if it > doesn't support all of the possible complexities in v1. Unfortunately an window function can not call other window functions. > Taking a closer look at the regex engine: > > It looks like the + qualifier has trouble when it meets the end of the > frame. For instance, try removing the last row of the 'stock' table in > rpr.sql; some of the final matches will disappear unexpectedly. Or try a > pattern like > > PATTERN ( a+ ) > DEFINE a AS TRUE > > which doesn't seem to match anything in my testing. > > There's also the issue of backtracking in the face of reclassification, > as I think Vik was alluding to upthread. The pattern > > PATTERN ( a+ b+ ) > DEFINE a AS col = 2, > b AS col = 2 > > doesn't match a sequence of values (2 2 ...) with the current > implementation, even with a dummy row at the end to avoid the > end-of-frame bug. > > (I've attached two failing tests against v2, to hopefully better > illustrate, along with what I _think_ should be the correct results.) Thanks. I will look into this. > I'm not quite understanding the match loop in evaluate_pattern(). It > looks like we're building up a string to pass to the regex engine, but > by the we call regexp_instr, don't we already know whether or not the > pattern will match based on the expression evaluation we've done? For "+" yes. But for more complex regular expression like '{n}', we need to call our regexp engine to check if the pattern matches. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp -
Re: Row pattern recognition
Jacob Champion <jchampion@timescale.com> — 2023-07-21T23:14:12Z
On 7/20/23 17:07, Vik Fearing wrote: > On 7/21/23 01:36, Jacob Champion wrote: >> (I've attached two failing tests against v2, to hopefully better >> illustrate, along with what I_think_ should be the correct results.) > > Almost. You are matching 07-01-2023 but the condition is "price > 100". D'oh. Correction attached. I think :) Thanks, --Jacob
-
Re: Row pattern recognition
Jacob Champion <jchampion@timescale.com> — 2023-07-21T23:16:18Z
On 7/20/23 23:16, Tatsuo Ishii wrote: > I don't know at this point. I think context-free is not enough to be > repsented in Bison. The grammer also needs to be LALR(1). Moreover, > adding the grammer to existing parser may generate shift/reduce > errors. Ah. It's been too long since my compilers classes; I will pipe down. > One small concern is how to convert pattern variables to regex > expression which our regex enginer understands. Suppose, > > PATTERN UP+ > > Currently I convert "UP+" to "U+" so that it can be fed to the regexp > engine. In order to do that, we need to know which part of the pattern > (UP+) is the pattern variable ("UP"). For "UP+" it's quite easy. But > for more complex regular expressions it would be not, unless PATTERN > grammer can be analyzed by our parser to know which part is the > pattern variable. Is the eventual plan to generate multiple alternatives, and run the regex against them one at a time? >> I'm not familiar enough with this code yet to offer very concrete >> suggestions, sorry... But at some point in the future, we need to be >> able to skip forward and backward from arbitrary points, like >> >> DEFINE B AS B.price > PREV(FIRST(A.price), 3) >> >> so there won't be just one pair of "previous and next" tuples. > > Yes, I know. I apologize. I got overexplain-y. >> Maybe >> that can help clarify the design? It feels like it'll need to eventually >> be a "real" function that operates on the window state, even if it >> doesn't support all of the possible complexities in v1. > > Unfortunately an window function can not call other window functions. Can that restriction be lifted for the EXPR_KIND_RPR_DEFINE case? Or does it make sense to split the pattern navigation "functions" into their own new concept, and maybe borrow some of the window function infrastructure for it? Thanks! --Jacob -
Re: Row pattern recognition
Vik Fearing <vik@postgresfriends.org> — 2023-07-21T23:38:01Z
On 7/22/23 01:14, Jacob Champion wrote: > On 7/20/23 17:07, Vik Fearing wrote: >> On 7/21/23 01:36, Jacob Champion wrote: >>> (I've attached two failing tests against v2, to hopefully better >>> illustrate, along with what I_think_ should be the correct results.) >> >> Almost. You are matching 07-01-2023 but the condition is "price > 100". > > D'oh. Correction attached. I think :) This looks correct to my human brain. Thanks! -- Vik Fearing
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-07-22T01:11:49Z
>> One small concern is how to convert pattern variables to regex >> expression which our regex enginer understands. Suppose, >> >> PATTERN UP+ >> >> Currently I convert "UP+" to "U+" so that it can be fed to the regexp >> engine. In order to do that, we need to know which part of the pattern >> (UP+) is the pattern variable ("UP"). For "UP+" it's quite easy. But >> for more complex regular expressions it would be not, unless PATTERN >> grammer can be analyzed by our parser to know which part is the >> pattern variable. > > Is the eventual plan to generate multiple alternatives, and run the > regex against them one at a time? Yes, that's my plan. >>> I'm not familiar enough with this code yet to offer very concrete >>> suggestions, sorry... But at some point in the future, we need to be >>> able to skip forward and backward from arbitrary points, like >>> >>> DEFINE B AS B.price > PREV(FIRST(A.price), 3) >>> >>> so there won't be just one pair of "previous and next" tuples. >> >> Yes, I know. > > I apologize. I got overexplain-y. No problem. Thank you for reminding me it. >>> Maybe >>> that can help clarify the design? It feels like it'll need to eventually >>> be a "real" function that operates on the window state, even if it >>> doesn't support all of the possible complexities in v1. >> >> Unfortunately an window function can not call other window functions. > > Can that restriction be lifted for the EXPR_KIND_RPR_DEFINE case? I am not sure at this point. Current PostgreSQL executor creates WindowStatePerFuncData for each window function and aggregate appearing in OVER clause. This means PREV/NEXT and other row pattern navigation operators cannot have their own WindowStatePerFuncData if they do not appear in OVER clauses in a query even if PREV/NEXT etc. are defined as window function. > Or > does it make sense to split the pattern navigation "functions" into > their own new concept, and maybe borrow some of the window function > infrastructure for it? Maybe. Suppose a window function executes row pattern matching using price > PREV(price). The window function already receives WindowStatePerFuncData. If we can pass the WindowStatePerFuncData to PREV, we could let PREV do the real work (getting previous tuple). I have not tried this yet, though. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp -
Re: Row pattern recognition
Vik Fearing <vik@postgresfriends.org> — 2023-07-22T02:54:43Z
On 7/22/23 03:11, Tatsuo Ishii wrote: >>>> Maybe >>>> that can help clarify the design? It feels like it'll need to eventually >>>> be a "real" function that operates on the window state, even if it >>>> doesn't support all of the possible complexities in v1. >>> Unfortunately an window function can not call other window functions. >> Can that restriction be lifted for the EXPR_KIND_RPR_DEFINE case? > I am not sure at this point. Current PostgreSQL executor creates > WindowStatePerFuncData for each window function and aggregate > appearing in OVER clause. This means PREV/NEXT and other row pattern > navigation operators cannot have their own WindowStatePerFuncData if > they do not appear in OVER clauses in a query even if PREV/NEXT > etc. are defined as window function. > >> Or >> does it make sense to split the pattern navigation "functions" into >> their own new concept, and maybe borrow some of the window function >> infrastructure for it? > Maybe. Suppose a window function executes row pattern matching using > price > PREV(price). The window function already receives > WindowStatePerFuncData. If we can pass the WindowStatePerFuncData to > PREV, we could let PREV do the real work (getting previous tuple). > I have not tried this yet, though. I don't understand this logic. Window functions work over a window frame. What we are talking about here is *defining* a window frame. How can a window function execute row pattern matching? -- Vik Fearing
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-07-22T06:14:46Z
> On 7/22/23 03:11, Tatsuo Ishii wrote: >>>>> Maybe >>>>> that can help clarify the design? It feels like it'll need to >>>>> eventually >>>>> be a "real" function that operates on the window state, even if it >>>>> doesn't support all of the possible complexities in v1. >>>> Unfortunately an window function can not call other window functions. >>> Can that restriction be lifted for the EXPR_KIND_RPR_DEFINE case? > >> I am not sure at this point. Current PostgreSQL executor creates >> WindowStatePerFuncData for each window function and aggregate >> appearing in OVER clause. This means PREV/NEXT and other row pattern >> navigation operators cannot have their own WindowStatePerFuncData if >> they do not appear in OVER clauses in a query even if PREV/NEXT >> etc. are defined as window function. >> >>> Or >>> does it make sense to split the pattern navigation "functions" into >>> their own new concept, and maybe borrow some of the window function >>> infrastructure for it? > >> Maybe. Suppose a window function executes row pattern matching using >> price > PREV(price). The window function already receives >> WindowStatePerFuncData. If we can pass the WindowStatePerFuncData to >> PREV, we could let PREV do the real work (getting previous tuple). >> I have not tried this yet, though. > > > I don't understand this logic. Window functions work over a window > frame. Yes. > What we are talking about here is *defining* a window > frame. Well, we are defining a "reduced" window frame within a (full) window frame. A "reduced" window frame is calculated each time when a window function is called. > How can a window function execute row pattern matching? A window function is called for each row fed by an outer plan. It fetches current, previous and next row to execute pattern matching. If it matches, the window function moves to next row and repeat the process, until pattern match fails. Below is an example window function to execute pattern matching (I will include this in the v3 patch). row_is_in_reduced_frame() is a function to execute pattern matching. It returns the number of rows in the reduced frame if pattern match succeeds. If succeeds, the function returns the last row in the reduced frame instead of the last row in the full window frame. /* * last_value * return the value of VE evaluated on the last row of the * window frame, per spec. */ Datum window_last_value(PG_FUNCTION_ARGS) { WindowObject winobj = PG_WINDOW_OBJECT(); Datum result; bool isnull; int64 abspos; int num_reduced_frame; abspos = WinGetCurrentPosition(winobj); num_reduced_frame = row_is_in_reduced_frame(winobj, abspos); if (num_reduced_frame == 0) /* no RPR is involved */ result = WinGetFuncArgInFrame(winobj, 0, 0, WINDOW_SEEK_TAIL, true, &isnull, NULL); else if (num_reduced_frame > 0) /* get last row value in the reduced frame */ result = WinGetFuncArgInFrame(winobj, 0, num_reduced_frame - 1, WINDOW_SEEK_HEAD, true, &isnull, NULL); else /* RPR is involved and current row is unmatched or skipped */ isnull = true; if (isnull) PG_RETURN_NULL(); PG_RETURN_DATUM(result); } -
Re: Row pattern recognition
Vik Fearing <vik@postgresfriends.org> — 2023-07-23T21:29:46Z
On 7/22/23 08:14, Tatsuo Ishii wrote: >> On 7/22/23 03:11, Tatsuo Ishii wrote: >>> Maybe. Suppose a window function executes row pattern matching using >>> price > PREV(price). The window function already receives >>> WindowStatePerFuncData. If we can pass the WindowStatePerFuncData to >>> PREV, we could let PREV do the real work (getting previous tuple). >>> I have not tried this yet, though. >> >> I don't understand this logic. Window functions work over a window >> frame. > > Yes. > >> What we are talking about here is *defining* a window >> frame. > > Well, we are defining a "reduced" window frame within a (full) window > frame. A "reduced" window frame is calculated each time when a window > function is called. Why? It should only be recalculated when the current row changes and we need a new frame. The reduced window frame *is* the window frame for all functions over that window. >> How can a window function execute row pattern matching? > > A window function is called for each row fed by an outer plan. It > fetches current, previous and next row to execute pattern matching. If > it matches, the window function moves to next row and repeat the > process, until pattern match fails. > > Below is an example window function to execute pattern matching (I > will include this in the v3 patch). row_is_in_reduced_frame() is a > function to execute pattern matching. It returns the number of rows in > the reduced frame if pattern match succeeds. If succeeds, the function > returns the last row in the reduced frame instead of the last row in > the full window frame. I strongly disagree with this. Window function do not need to know how the frame is defined, and indeed they should not. WinGetFuncArgInFrame should answer yes or no and the window function just works on that. Otherwise we will get extension (and possibly even core) functions that don't handle the frame properly. -- Vik Fearing
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-07-24T00:22:40Z
>>> What we are talking about here is *defining* a window >>> frame. >> Well, we are defining a "reduced" window frame within a (full) window >> frame. A "reduced" window frame is calculated each time when a window >> function is called. > > > Why? It should only be recalculated when the current row changes and > we need a new frame. The reduced window frame *is* the window frame > for all functions over that window. We already recalculate a frame each time a row is processed even without RPR. See ExecWindowAgg. Also RPR always requires a frame option ROWS BETWEEN CURRENT ROW, which means the frame head is changed each time current row position changes. > I strongly disagree with this. Window function do not need to know > how the frame is defined, and indeed they should not. We already break the rule by defining *support functions. See windowfuncs.c. > WinGetFuncArgInFrame should answer yes or no and the window function > just works on that. Otherwise we will get extension (and possibly even > core) functions that don't handle the frame properly. Maybe I can move row_is_in_reduced_frame into WinGetFuncArgInFrame just for convenience. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Vik Fearing <vik@postgresfriends.org> — 2023-07-24T23:14:37Z
On 7/24/23 02:22, Tatsuo Ishii wrote: >>>> What we are talking about here is *defining* a window >>>> frame. >>> Well, we are defining a "reduced" window frame within a (full) window >>> frame. A "reduced" window frame is calculated each time when a window >>> function is called. >> >> >> Why? It should only be recalculated when the current row changes and >> we need a new frame. The reduced window frame *is* the window frame >> for all functions over that window. > > We already recalculate a frame each time a row is processed even > without RPR. See ExecWindowAgg. Yes, after each row. Not for each function. > Also RPR always requires a frame option ROWS BETWEEN CURRENT ROW, > which means the frame head is changed each time current row position > changes. Off topic for now: I wonder why this restriction is in place and whether we should respect or ignore it. That is a discussion for another time, though. >> I strongly disagree with this. Window function do not need to know >> how the frame is defined, and indeed they should not. > > We already break the rule by defining *support functions. See > windowfuncs.c. The support functions don't know anything about the frame, they just know when a window function is monotonically increasing and execution can either stop or be "passed through". >> WinGetFuncArgInFrame should answer yes or no and the window function >> just works on that. Otherwise we will get extension (and possibly even >> core) functions that don't handle the frame properly. > > Maybe I can move row_is_in_reduced_frame into WinGetFuncArgInFrame > just for convenience. I have two comments about this: It isn't just for convenience, it is for correctness. The window functions do not need to know which rows they are *not* operating on. There is no such thing as a "full" or "reduced" frame. The standard uses those terms to explain the difference between before and after RPR is applied, but window functions do not get to choose which frame they apply over. They only ever apply over the reduced window frame. -- Vik Fearing
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-07-25T12:35:04Z
Hi, > diff --git a/src/test/regress/expected/rpr.out b/src/test/regress/expected/rpr.out > index 6bf8818911..f3fd22de2a 100644 > --- a/src/test/regress/expected/rpr.out > +++ b/src/test/regress/expected/rpr.out > @@ -230,6 +230,79 @@ SELECT company, tdate, price, rpr(price) OVER w FROM stock > company2 | 07-10-2023 | 1300 | > (20 rows) > > +-- match everything > +SELECT company, tdate, price, rpr(price) OVER w FROM stock > + WINDOW w AS ( > + PARTITION BY company > + ORDER BY tdate > + ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > + AFTER MATCH SKIP TO NEXT ROW It seems it's a result with AFTER MATCH SKIP PAST LAST ROW. > + INITIAL > + PATTERN (A+) > + DEFINE > + A AS TRUE > +); > + company | tdate | price | rpr > +----------+------------+-------+----- > + company1 | 07-01-2023 | 100 | 100 > + company1 | 07-02-2023 | 200 | > + company1 | 07-03-2023 | 150 | > + company1 | 07-04-2023 | 140 | > + company1 | 07-05-2023 | 150 | > + company1 | 07-06-2023 | 90 | > + company1 | 07-07-2023 | 110 | > + company1 | 07-08-2023 | 130 | > + company1 | 07-09-2023 | 120 | > + company1 | 07-10-2023 | 130 | > + company2 | 07-01-2023 | 50 | 50 > + company2 | 07-02-2023 | 2000 | > + company2 | 07-03-2023 | 1500 | > + company2 | 07-04-2023 | 1400 | > + company2 | 07-05-2023 | 1500 | > + company2 | 07-06-2023 | 60 | > + company2 | 07-07-2023 | 1100 | > + company2 | 07-08-2023 | 1300 | > + company2 | 07-09-2023 | 1200 | > + company2 | 07-10-2023 | 1300 | > +(20 rows) > + > +-- backtracking with reclassification of rows > +SELECT company, tdate, price, rpr(price) OVER w FROM stock > + WINDOW w AS ( > + PARTITION BY company > + ORDER BY tdate > + ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > + AFTER MATCH SKIP TO NEXT ROW > + INITIAL > + PATTERN (A+ B+) > + DEFINE > + A AS price > 100, > + B AS price > 100 > +); > + company | tdate | price | rpr > +----------+------------+-------+------ > + company1 | 07-01-2023 | 100 | > + company1 | 07-02-2023 | 200 | 200 > + company1 | 07-03-2023 | 150 | > + company1 | 07-04-2023 | 140 | > + company1 | 07-05-2023 | 150 | > + company1 | 07-06-2023 | 90 | > + company1 | 07-07-2023 | 110 | 110 > + company1 | 07-08-2023 | 130 | > + company1 | 07-09-2023 | 120 | > + company1 | 07-10-2023 | 130 | > + company2 | 07-01-2023 | 50 | > + company2 | 07-02-2023 | 2000 | 2000 > + company2 | 07-03-2023 | 1500 | > + company2 | 07-04-2023 | 1400 | > + company2 | 07-05-2023 | 1500 | > + company2 | 07-06-2023 | 60 | > + company2 | 07-07-2023 | 1100 | 1100 > + company2 | 07-08-2023 | 1300 | > + company2 | 07-09-2023 | 1200 | > + company2 | 07-10-2023 | 1300 | > +(20 rows) > + > -- > -- Error cases > -- > diff --git a/src/test/regress/sql/rpr.sql b/src/test/regress/sql/rpr.sql > index 951c9abfe9..f1cd0369f4 100644 > --- a/src/test/regress/sql/rpr.sql > +++ b/src/test/regress/sql/rpr.sql > @@ -94,6 +94,33 @@ SELECT company, tdate, price, rpr(price) OVER w FROM stock > UPDOWN AS price > PREV(price) AND price > NEXT(price) > ); > > +-- match everything > +SELECT company, tdate, price, rpr(price) OVER w FROM stock > + WINDOW w AS ( > + PARTITION BY company > + ORDER BY tdate > + ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > + AFTER MATCH SKIP TO NEXT ROW > + INITIAL > + PATTERN (A+) > + DEFINE > + A AS TRUE > +); > + > +-- backtracking with reclassification of rows > +SELECT company, tdate, price, rpr(price) OVER w FROM stock > + WINDOW w AS ( > + PARTITION BY company > + ORDER BY tdate > + ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > + AFTER MATCH SKIP TO NEXT ROW > + INITIAL > + PATTERN (A+ B+) > + DEFINE > + A AS price > 100, > + B AS price > 100 > +); > + > -- > -- Error cases > --
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-07-26T12:21:34Z
Attached is the v3 patch. In this patch following changes are made. (1) I completely changed the pattern matching engine so that it performs backtracking. Now the engine evaluates all pattern elements defined in PATTER against each row, saving matched pattern variables in a string per row. For example if the pattern element A and B evaluated to true, a string "AB" is created for current row. This continues until all pattern matching fails or encounters the end of full window frame/partition. After that, the pattern matching engine creates all possible "pattern strings" and apply the regular expression matching to each. For example if we have row 0 = "AB" row 1 = "C", possible pattern strings are: "AC" and "BC". If it matches, the length of matching substring is saved. After all possible trials are done, the longest matching substring is chosen and it becomes the width (number of rows) in the reduced window frame. See row_is_in_reduced_frame, search_str_set and search_str_set_recurse in nodeWindowAggs.c for more details. For now I use a naive depth first search and probably there is a lot of rooms for optimization (for example rewriting it without using recursion). Suggestions/patches are welcome. Jacob Champion wrote: > It looks like the + qualifier has trouble when it meets the end of the > frame. For instance, try removing the last row of the 'stock' table in > rpr.sql; some of the final matches will disappear unexpectedly. Or try a > pattern like > > PATTERN ( a+ ) > DEFINE a AS TRUE > > which doesn't seem to match anything in my testing. > > There's also the issue of backtracking in the face of reclassification, > as I think Vik was alluding to upthread. The pattern > > PATTERN ( a+ b+ ) > DEFINE a AS col = 2, > b AS col = 2 With the new engine, cases above do not fail anymore. See new regression test cases. Thanks for providing valuable test cases! (2) Make window functions RPR aware. Now first_value, last_value, and nth_value recognize RPR (maybe first_value do not need any change?) Vik Fearing wrote: > I strongly disagree with this. Window function do not need to know > how the frame is defined, and indeed they should not. > WinGetFuncArgInFrame should answer yes or no and the window function > just works on that. Otherwise we will get extension (and possibly even > core) functions that don't handle the frame properly. So I moved row_is_in_reduce_frame into WinGetFuncArgInFrame so that those window functions are not needed to be changed. (3) Window function rpr was removed. We can use first_value instead. (4) Remaining tasks/issues. - For now I disable WinSetMarkPosition because RPR often needs to access a row before the mark is set. We need to fix this in the future. - I am working on making window aggregates RPR aware now. The implementation is in progress and far from completeness. An example is below. I think row 2, 3, 4 of "count" column should be NULL instead of 3, 2, 0, 0. Same thing can be said to other rows. Probably this is an effect of moving aggregate but I still studying the window aggregation code. SELECT company, tdate, first_value(price) OVER W, count(*) OVER w FROM stock WINDOW w AS ( PARTITION BY company ORDER BY tdate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING AFTER MATCH SKIP PAST LAST ROW INITIAL PATTERN (START UP+ DOWN+) DEFINE START AS TRUE, UP AS price > PREV(price), DOWN AS price < PREV(price) ); company | tdate | first_value | count ----------+------------+-------------+------- company1 | 2023-07-01 | 100 | 4 company1 | 2023-07-02 | | 3 company1 | 2023-07-03 | | 2 company1 | 2023-07-04 | | 0 company1 | 2023-07-05 | | 0 company1 | 2023-07-06 | 90 | 4 company1 | 2023-07-07 | | 3 company1 | 2023-07-08 | | 2 company1 | 2023-07-09 | | 0 company1 | 2023-07-10 | | 0 company2 | 2023-07-01 | 50 | 4 company2 | 2023-07-02 | | 3 company2 | 2023-07-03 | | 2 company2 | 2023-07-04 | | 0 company2 | 2023-07-05 | | 0 company2 | 2023-07-06 | 60 | 4 company2 | 2023-07-07 | | 3 company2 | 2023-07-08 | | 2 company2 | 2023-07-09 | | 0 company2 | 2023-07-10 | | 0 - If attributes appearing in DEFINE are not used in the target list, query fails. SELECT company, tdate, count(*) OVER w FROM stock WINDOW w AS ( PARTITION BY company ORDER BY tdate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING AFTER MATCH SKIP PAST LAST ROW INITIAL PATTERN (START UP+ DOWN+) DEFINE START AS TRUE, UP AS price > PREV(price), DOWN AS price < PREV(price) ); ERROR: attribute number 3 exceeds number of columns 2 This is because attributes appearing in DEFINE are not added to the target list. I am looking for way to teach planner to add attributes appearing in DEFINE. I am going to add this thread to CommitFest and plan to add both of you as reviewers. Thanks in advance. -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-07-26T20:22:30Z
> I am going to add this thread to CommitFest and plan to add both of > you as reviewers. Thanks in advance. Done. https://commitfest.postgresql.org/44/4460/ Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-07-28T07:09:53Z
>> We already recalculate a frame each time a row is processed even >> without RPR. See ExecWindowAgg. > > Yes, after each row. Not for each function. Ok, I understand now. Closer look at the code, I realized that each window function calls update_frameheadpos, which computes the frame head position. But actually it checks winstate->framehead_valid and if it's already true (probably by other window function), then it does nothing. >> Also RPR always requires a frame option ROWS BETWEEN CURRENT ROW, >> which means the frame head is changed each time current row position >> changes. > > Off topic for now: I wonder why this restriction is in place and > whether we should respect or ignore it. That is a discussion for > another time, though. My guess is, it is because other than ROWS BETWEEN CURRENT ROW has little or no meaning. Consider following example: SELECT i, first_value(i) OVER w FROM (VALUES (1), (2), (3), (4)) AS v (i) WINDOW w AS ( ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING AFTER MATCH SKIP PAST LAST ROW PATTERN (A) DEFINE A AS i = 1 OR i = 3 ); In this example ROWS BETWEEN CURRENT ROW gives frames with i = 1 and i = 3. i | first_value ---+------------- 1 | 1 2 | 3 | 3 4 | (4 rows) But what would happen with ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING? Probably the frame i = 3 will be missed as at i = 2, PATTERN is not satisfied and compution of the reduced frame stops. i | first_value ---+------------- 1 | 1 2 | 3 | 4 | (4 rows) This is not very useful for users. >>> I strongly disagree with this. Window function do not need to know >>> how the frame is defined, and indeed they should not. >> We already break the rule by defining *support functions. See >> windowfuncs.c. > The support functions don't know anything about the frame, they just > know when a window function is monotonically increasing and execution > can either stop or be "passed through". I see following code in window_row_number_support: /* * The frame options can always become "ROWS BETWEEN UNBOUNDED * PRECEDING AND CURRENT ROW". row_number() always just increments by * 1 with each row in the partition. Using ROWS instead of RANGE * saves effort checking peer rows during execution. */ req->frameOptions = (FRAMEOPTION_NONDEFAULT | FRAMEOPTION_ROWS | FRAMEOPTION_START_UNBOUNDED_PRECEDING | FRAMEOPTION_END_CURRENT_ROW); I think it not only knows about frame but it even changes the frame options. This seems far from "don't know anything about the frame", no? > I have two comments about this: > > It isn't just for convenience, it is for correctness. The window > functions do not need to know which rows they are *not* operating on. > > There is no such thing as a "full" or "reduced" frame. The standard > uses those terms to explain the difference between before and after > RPR is applied, but window functions do not get to choose which frame > they apply over. They only ever apply over the reduced window frame. I agree that "full window frame" and "reduced window frame" do not exist at the same time, and in the end (after computation of reduced frame), only "reduced" frame is visible to window functions/aggregates. But I still do think that "full window frame" and "reduced window frame" are important concept to explain/understand how PRP works. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp -
Re: Row pattern recognition
Vik Fearing <vik@postgresfriends.org> — 2023-07-28T08:56:26Z
On 7/28/23 09:09, Tatsuo Ishii wrote: >>> We already recalculate a frame each time a row is processed even >>> without RPR. See ExecWindowAgg. >> >> Yes, after each row. Not for each function. > > Ok, I understand now. Closer look at the code, I realized that each > window function calls update_frameheadpos, which computes the frame > head position. But actually it checks winstate->framehead_valid and if > it's already true (probably by other window function), then it does > nothing. > >>> Also RPR always requires a frame option ROWS BETWEEN CURRENT ROW, >>> which means the frame head is changed each time current row position >>> changes. >> >> Off topic for now: I wonder why this restriction is in place and >> whether we should respect or ignore it. That is a discussion for >> another time, though. > > My guess is, it is because other than ROWS BETWEEN CURRENT ROW has > little or no meaning. Consider following example: Yes, that makes sense. >>>> I strongly disagree with this. Window function do not need to know >>>> how the frame is defined, and indeed they should not. >>> We already break the rule by defining *support functions. See >>> windowfuncs.c. >> The support functions don't know anything about the frame, they just >> know when a window function is monotonically increasing and execution >> can either stop or be "passed through". > > I see following code in window_row_number_support: > > /* > * The frame options can always become "ROWS BETWEEN UNBOUNDED > * PRECEDING AND CURRENT ROW". row_number() always just increments by > * 1 with each row in the partition. Using ROWS instead of RANGE > * saves effort checking peer rows during execution. > */ > req->frameOptions = (FRAMEOPTION_NONDEFAULT | > FRAMEOPTION_ROWS | > FRAMEOPTION_START_UNBOUNDED_PRECEDING | > FRAMEOPTION_END_CURRENT_ROW); > > I think it not only knows about frame but it even changes the frame > options. This seems far from "don't know anything about the frame", no? That's the planner support function. The row_number() function itself is not even allowed to *have* a frame, per spec. We allow it, but as you can see from that support function, we completely replace it. So all of the partition-level window functions are not affected by RPR anyway. >> I have two comments about this: >> >> It isn't just for convenience, it is for correctness. The window >> functions do not need to know which rows they are *not* operating on. >> >> There is no such thing as a "full" or "reduced" frame. The standard >> uses those terms to explain the difference between before and after >> RPR is applied, but window functions do not get to choose which frame >> they apply over. They only ever apply over the reduced window frame. > > I agree that "full window frame" and "reduced window frame" do not > exist at the same time, and in the end (after computation of reduced > frame), only "reduced" frame is visible to window > functions/aggregates. But I still do think that "full window frame" > and "reduced window frame" are important concept to explain/understand > how PRP works. If we are just using those terms for documentation, then okay. -- Vik Fearing
-
Re: Row pattern recognition
Vik Fearing <vik@postgresfriends.org> — 2023-07-28T09:21:25Z
On 7/26/23 14:21, Tatsuo Ishii wrote: > Attached is the v3 patch. In this patch following changes are made. Excellent. Thanks! A few quick comments: - PERMUTE is still misspelled as PREMUTE - PATTERN variables do not have to exist in the DEFINE clause. They are considered TRUE if not present. > (1) I completely changed the pattern matching engine so that it > performs backtracking. Now the engine evaluates all pattern elements > defined in PATTER against each row, saving matched pattern variables > in a string per row. For example if the pattern element A and B > evaluated to true, a string "AB" is created for current row. > > This continues until all pattern matching fails or encounters the end > of full window frame/partition. After that, the pattern matching > engine creates all possible "pattern strings" and apply the regular > expression matching to each. For example if we have row 0 = "AB" row 1 > = "C", possible pattern strings are: "AC" and "BC". > > If it matches, the length of matching substring is saved. After all > possible trials are done, the longest matching substring is chosen and > it becomes the width (number of rows) in the reduced window frame. > > See row_is_in_reduced_frame, search_str_set and search_str_set_recurse > in nodeWindowAggs.c for more details. For now I use a naive depth > first search and probably there is a lot of rooms for optimization > (for example rewriting it without using > recursion). Suggestions/patches are welcome. My own reviews will only focus on correctness for now. Once we get a good set of regression tests all passing, I will focus more on optimization. Of course, others might want to review the performance now. > Vik Fearing wrote: >> I strongly disagree with this. Window function do not need to know >> how the frame is defined, and indeed they should not. >> WinGetFuncArgInFrame should answer yes or no and the window function >> just works on that. Otherwise we will get extension (and possibly even >> core) functions that don't handle the frame properly. > > So I moved row_is_in_reduce_frame into WinGetFuncArgInFrame so that > those window functions are not needed to be changed. > > (3) Window function rpr was removed. We can use first_value instead. Excellent. > (4) Remaining tasks/issues. > > - For now I disable WinSetMarkPosition because RPR often needs to > access a row before the mark is set. We need to fix this in the > future. Noted, and agreed. > - I am working on making window aggregates RPR aware now. The > implementation is in progress and far from completeness. An example > is below. I think row 2, 3, 4 of "count" column should be NULL > instead of 3, 2, 0, 0. Same thing can be said to other > rows. Probably this is an effect of moving aggregate but I still > studying the window aggregation code. This tells me again that RPR is not being run in the right place. All windowed aggregates and frame-level window functions should Just Work with no modification. > SELECT company, tdate, first_value(price) OVER W, count(*) OVER w FROM stock > WINDOW w AS ( > PARTITION BY company > ORDER BY tdate > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > AFTER MATCH SKIP PAST LAST ROW > INITIAL > PATTERN (START UP+ DOWN+) > DEFINE > START AS TRUE, > UP AS price > PREV(price), > DOWN AS price < PREV(price) > ); > company | tdate | first_value | count > ----------+------------+-------------+------- > company1 | 2023-07-01 | 100 | 4 > company1 | 2023-07-02 | | 3 > company1 | 2023-07-03 | | 2 > company1 | 2023-07-04 | | 0 > company1 | 2023-07-05 | | 0 > company1 | 2023-07-06 | 90 | 4 > company1 | 2023-07-07 | | 3 > company1 | 2023-07-08 | | 2 > company1 | 2023-07-09 | | 0 > company1 | 2023-07-10 | | 0 > company2 | 2023-07-01 | 50 | 4 > company2 | 2023-07-02 | | 3 > company2 | 2023-07-03 | | 2 > company2 | 2023-07-04 | | 0 > company2 | 2023-07-05 | | 0 > company2 | 2023-07-06 | 60 | 4 > company2 | 2023-07-07 | | 3 > company2 | 2023-07-08 | | 2 > company2 | 2023-07-09 | | 0 > company2 | 2023-07-10 | | 0 In this scenario, row 1's frame is the first 5 rows and specified SKIP PAST LAST ROW, so rows 2-5 don't have *any* frame (because they are skipped) and the result of the outer count should be 0 for all of them because there are no rows in the frame. When we get to adding count in the MEASURES clause, there will be a difference between no match and empty match, but that does not apply here. > I am going to add this thread to CommitFest and plan to add both of > you as reviewers. Thanks in advance. My pleasure. Thank you for working on this difficult feature. -- Vik Fearing
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-07-28T11:02:30Z
>> Attached is the v3 patch. In this patch following changes are made. > > Excellent. Thanks! You are welcome! > A few quick comments: > > - PERMUTE is still misspelled as PREMUTE Oops. Will fix. > - PATTERN variables do not have to exist in the DEFINE clause. They are > - considered TRUE if not present. Do you think we really need this? I found a criticism regarding this. https://link.springer.com/article/10.1007/s13222-022-00404-3 "3.2 Explicit Definition of All Row Pattern Variables" What do you think? >> - I am working on making window aggregates RPR aware now. The >> implementation is in progress and far from completeness. An example >> is below. I think row 2, 3, 4 of "count" column should be NULL >> instead of 3, 2, 0, 0. Same thing can be said to other >> rows. Probably this is an effect of moving aggregate but I still >> studying the window aggregation code. > > This tells me again that RPR is not being run in the right place. All > windowed aggregates and frame-level window functions should Just Work > with no modification. I am not touching each aggregate function. I am modifying eval_windowaggregates() in nodeWindowAgg.c, which calls each aggregate function. Do you think it's not the right place to make window aggregates RPR aware? >> SELECT company, tdate, first_value(price) OVER W, count(*) OVER w FROM >> stock >> WINDOW w AS ( >> PARTITION BY company >> ORDER BY tdate >> ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING >> AFTER MATCH SKIP PAST LAST ROW >> INITIAL >> PATTERN (START UP+ DOWN+) >> DEFINE >> START AS TRUE, >> UP AS price > PREV(price), >> DOWN AS price < PREV(price) >> ); >> company | tdate | first_value | count >> ----------+------------+-------------+------- >> company1 | 2023-07-01 | 100 | 4 >> company1 | 2023-07-02 | | 3 >> company1 | 2023-07-03 | | 2 >> company1 | 2023-07-04 | | 0 >> company1 | 2023-07-05 | | 0 >> company1 | 2023-07-06 | 90 | 4 >> company1 | 2023-07-07 | | 3 >> company1 | 2023-07-08 | | 2 >> company1 | 2023-07-09 | | 0 >> company1 | 2023-07-10 | | 0 >> company2 | 2023-07-01 | 50 | 4 >> company2 | 2023-07-02 | | 3 >> company2 | 2023-07-03 | | 2 >> company2 | 2023-07-04 | | 0 >> company2 | 2023-07-05 | | 0 >> company2 | 2023-07-06 | 60 | 4 >> company2 | 2023-07-07 | | 3 >> company2 | 2023-07-08 | | 2 >> company2 | 2023-07-09 | | 0 >> company2 | 2023-07-10 | | 0 > > In this scenario, row 1's frame is the first 5 rows and specified SKIP > PAST LAST ROW, so rows 2-5 don't have *any* frame (because they are > skipped) and the result of the outer count should be 0 for all of them > because there are no rows in the frame. Ok. Just I want to make sure. If it's other aggregates like sum or avg, the result of the outer aggregates should be NULL. > When we get to adding count in the MEASURES clause, there will be a > difference between no match and empty match, but that does not apply > here. Can you elaborate more? I understand that "no match" and "empty match" are different things. But I do not understand how the difference affects the result of count. >> I am going to add this thread to CommitFest and plan to add both of >> you as reviewers. Thanks in advance. > > My pleasure. Thank you for working on this difficult feature. Thank you for accepting being registered as a reviewer. Your comments are really helpful. -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Vik Fearing <vik@postgresfriends.org> — 2023-07-28T12:36:58Z
On 7/28/23 13:02, Tatsuo Ishii wrote: >>> Attached is the v3 patch. In this patch following changes are made. >> >> - PATTERN variables do not have to exist in the DEFINE clause. They are >> - considered TRUE if not present. > > Do you think we really need this? I found a criticism regarding this. > > https://link.springer.com/article/10.1007/s13222-022-00404-3 > "3.2 Explicit Definition of All Row Pattern Variables" > > What do you think? I think that a large part of obeying the standard is to allow queries from other engines to run the same on ours. The standard does not require the pattern variables to be defined and so there are certainly queries out there without them, and that hurts migrating to PostgreSQL. >>> - I am working on making window aggregates RPR aware now. The >>> implementation is in progress and far from completeness. An example >>> is below. I think row 2, 3, 4 of "count" column should be NULL >>> instead of 3, 2, 0, 0. Same thing can be said to other >>> rows. Probably this is an effect of moving aggregate but I still >>> studying the window aggregation code. >> >> This tells me again that RPR is not being run in the right place. All >> windowed aggregates and frame-level window functions should Just Work >> with no modification. > > I am not touching each aggregate function. I am modifying > eval_windowaggregates() in nodeWindowAgg.c, which calls each aggregate > function. Do you think it's not the right place to make window > aggregates RPR aware? Oh, okay. >>> SELECT company, tdate, first_value(price) OVER W, count(*) OVER w FROM >>> stock >>> WINDOW w AS ( >>> PARTITION BY company >>> ORDER BY tdate >>> ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING >>> AFTER MATCH SKIP PAST LAST ROW >>> INITIAL >>> PATTERN (START UP+ DOWN+) >>> DEFINE >>> START AS TRUE, >>> UP AS price > PREV(price), >>> DOWN AS price < PREV(price) >>> ); >>> company | tdate | first_value | count >>> ----------+------------+-------------+------- >>> company1 | 2023-07-01 | 100 | 4 >>> company1 | 2023-07-02 | | 3 >>> company1 | 2023-07-03 | | 2 >>> company1 | 2023-07-04 | | 0 >>> company1 | 2023-07-05 | | 0 >>> company1 | 2023-07-06 | 90 | 4 >>> company1 | 2023-07-07 | | 3 >>> company1 | 2023-07-08 | | 2 >>> company1 | 2023-07-09 | | 0 >>> company1 | 2023-07-10 | | 0 >>> company2 | 2023-07-01 | 50 | 4 >>> company2 | 2023-07-02 | | 3 >>> company2 | 2023-07-03 | | 2 >>> company2 | 2023-07-04 | | 0 >>> company2 | 2023-07-05 | | 0 >>> company2 | 2023-07-06 | 60 | 4 >>> company2 | 2023-07-07 | | 3 >>> company2 | 2023-07-08 | | 2 >>> company2 | 2023-07-09 | | 0 >>> company2 | 2023-07-10 | | 0 >> >> In this scenario, row 1's frame is the first 5 rows and specified SKIP >> PAST LAST ROW, so rows 2-5 don't have *any* frame (because they are >> skipped) and the result of the outer count should be 0 for all of them >> because there are no rows in the frame. > > Ok. Just I want to make sure. If it's other aggregates like sum or > avg, the result of the outer aggregates should be NULL. They all behave the same way as in a normal query when they receive no rows as input. >> When we get to adding count in the MEASURES clause, there will be a >> difference between no match and empty match, but that does not apply >> here. > > Can you elaborate more? I understand that "no match" and "empty match" > are different things. But I do not understand how the difference > affects the result of count. This query: SELECT v.a, wcnt OVER w, count(*) OVER w FROM (VALUES ('A')) AS v (a) WINDOW w AS ( ORDER BY v.a MEASURES count(*) AS wcnt ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING PATTERN (B) DEFINE B AS B.a = 'B' ) produces this result: a | wcnt | count ---+------+------- A | | 0 (1 row) Inside the window specification, *no match* was found and so all of the MEASURES are null. The count(*) in the target list however, still exists and operates over zero rows. This very similar query: SELECT v.a, wcnt OVER w, count(*) OVER w FROM (VALUES ('A')) AS v (a) WINDOW w AS ( ORDER BY v.a MEASURES count(*) AS wcnt ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING PATTERN (B?) DEFINE B AS B.a = 'B' ) produces this result: a | wcnt | count ---+------+------- A | 0 | 0 (1 row) In this case, the pattern is B? instead of just B, which produces an *empty match* for the MEASURES to be applied over. -- Vik Fearing -
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-07-29T03:05:08Z
>>> - PATTERN variables do not have to exist in the DEFINE clause. They are >>> - considered TRUE if not present. >> Do you think we really need this? I found a criticism regarding this. >> https://link.springer.com/article/10.1007/s13222-022-00404-3 >> "3.2 Explicit Definition of All Row Pattern Variables" >> What do you think? > > I think that a large part of obeying the standard is to allow queries > from other engines to run the same on ours. The standard does not > require the pattern variables to be defined and so there are certainly > queries out there without them, and that hurts migrating to > PostgreSQL. Yeah, migration is good point. I agree we should have the feature. >>> When we get to adding count in the MEASURES clause, there will be a >>> difference between no match and empty match, but that does not apply >>> here. >> Can you elaborate more? I understand that "no match" and "empty match" >> are different things. But I do not understand how the difference >> affects the result of count. > > This query: > > SELECT v.a, wcnt OVER w, count(*) OVER w > FROM (VALUES ('A')) AS v (a) > WINDOW w AS ( > ORDER BY v.a > MEASURES count(*) AS wcnt > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > PATTERN (B) > DEFINE B AS B.a = 'B' > ) > > produces this result: > > a | wcnt | count > ---+------+------- > A | | 0 > (1 row) > > Inside the window specification, *no match* was found and so all of > the MEASURES are null. The count(*) in the target list however, still > exists and operates over zero rows. > > This very similar query: > > SELECT v.a, wcnt OVER w, count(*) OVER w > FROM (VALUES ('A')) AS v (a) > WINDOW w AS ( > ORDER BY v.a > MEASURES count(*) AS wcnt > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > PATTERN (B?) > DEFINE B AS B.a = 'B' > ) > > produces this result: > > a | wcnt | count > ---+------+------- > A | 0 | 0 > (1 row) > > In this case, the pattern is B? instead of just B, which produces an > *empty match* for the MEASURES to be applied over. Thank you for the detailed explanation. I think I understand now. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp -
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-08-09T08:41:12Z
Attached is the v4 patch. Differences from previous patch include: > - PERMUTE is still misspelled as PREMUTE Fixed. > - PATTERN variables do not have to exist in the DEFINE clause. They are > - considered TRUE if not present. Fixed. Moreover new regression test case is added. - It was possible that tle nodes in DEFINE clause do not appear in the plan's target list. This makes impossible to evaluate expressions in the DEFINE because it does not appear in the outer plan's target list. To fix this, call findTargetlistEntrySQL99 (with resjunk is true) so that the missing TargetEntry is added to the outer plan later on. - I eliminated some hacks in handling the Var node in DEFINE clause. Previously I replaced varattno of Var node in a plan tree by hand so that it refers to correct varattno in the outer plan node. In this patch I modified set_upper_references so that it calls fix_upper_expr for those Var nodes in the DEFINE clause. See v4-0003 patch for more details. - I found a bug with pattern matching code. It creates a string for subsequent regular expression matching. It uses the initial letter of each define variable name. For example, if the varname is "foo", then "f" is used. Obviously this makes trouble if we have two or more variables starting with same "f" (e.g. "food"). To fix this, I assign [a-z] to each variable instead of its initial letter. However this way limits us not to have more than 26 variables. I hope 26 is enough for most use cases. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-09-02T06:52:35Z
Attached is the v5 patch. Differences from previous patch include: * Resolve complaint from "PostgreSQL Patch Tester" https://commitfest.postgresql.org/44/4460/ - Change gram.y to use PATTERN_P instead of PATTERN. Using PATTERN seems to make trouble with Visual Studio build. : : [10:07:57.853] FAILED: src/backend/parser/parser.a.p/meson-generated_.._gram.c.obj [10:07:57.853] "cl" "-Isrc\backend\parser\parser.a.p" "-Isrc\backend\parser" "-I..\src\backend\parser" "-Isrc\include" "-I..\src\include" "-Ic:\openssl\1.1\include" "-I..\src\include\port\win32" "-I..\src\include\port\win32_msvc" "/MDd" "/nologo" "/showIncludes" "/utf-8" "/W2" "/Od" "/Zi" "/DWIN32" "/DWINDOWS" "/D__WINDOWS__" "/D__WIN32__" "/D_CRT_SECURE_NO_DEPRECATE" "/D_CRT_NONSTDC_NO_DEPRECATE" "/wd4018" "/wd4244" "/wd4273" "/wd4101" "/wd4102" "/wd4090" "/wd4267" "-DBUILDING_DLL" "/Fdsrc\backend\parser\parser.a.p\meson-generated_.._gram.c.pdb" /Fosrc/backend/parser/parser.a.p/meson-generated_.._gram.c.obj "/c" src/backend/parser/gram.c [10:07:57.860] c:\cirrus\build\src\backend\parser\gram.h(379): error C2365: 'PATTERN': redefinition; previous definition was 'typedef' [10:07:57.860] C:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\um\wingdi.h(1375): note: see declaration of 'PATTERN' [10:07:57.860] c:\cirrus\build\src\backend\parser\gram.h(379): error C2086: 'yytokentype PATTERN': redefinition [10:07:57.860] c:\cirrus\build\src\backend\parser\gram.h(379): note: see declaration of 'PATTERN' [10:07:57.860] ninja: build stopped: subcommand failed. * Resolve complaint from "make headerscheck" - Change Windowapi.h and nodeWindowAgg.c to remove unecessary extern and public functions. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Erik Rijkers <er@xs4all.nl> — 2023-09-02T18:04:02Z
Op 9/2/23 om 08:52 schreef Tatsuo Ishii: > Attached is the v5 patch. Differences from previous patch include: > Hi, The patches compile & tests run fine but this statement from the documentation crashes an assert-enabled server: SELECT company, tdate, price, max(price) OVER w FROM stock WINDOW w AS ( PARTITION BY company ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING AFTER MATCH SKIP PAST LAST ROW INITIAL PATTERN (LOWPRICE UP+ DOWN+) DEFINE LOWPRICE AS price <= 100, UP AS price > PREV(price), DOWN AS price < PREV(price) ); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. connection to server was lost Log file: TRAP: failed Assert("aggregatedupto_nonrestarted <= winstate->aggregatedupto"), File: "nodeWindowAgg.c", Line: 1054, PID: 68975 postgres: 17_rpr_d0ec_gulo: aardvark testdb ::1(34808) SELECT(ExceptionalCondition+0x54)[0x9b0824] postgres: 17_rpr_d0ec_gulo: aardvark testdb ::1(34808) SELECT[0x71ae8d] postgres: 17_rpr_d0ec_gulo: aardvark testdb ::1(34808) SELECT(standard_ExecutorRun+0x13a)[0x6def9a] /home/aardvark/pg_stuff/pg_installations/pgsql.rpr/lib/pg_stat_statements.so(+0x55e5)[0x7ff3798b95e5] /home/aardvark/pg_stuff/pg_installations/pgsql.rpr/lib/auto_explain.so(+0x2680)[0x7ff3798ab680] postgres: 17_rpr_d0ec_gulo: aardvark testdb ::1(34808) SELECT[0x88a4ff] postgres: 17_rpr_d0ec_gulo: aardvark testdb ::1(34808) SELECT(PortalRun+0x240)[0x88bb50] postgres: 17_rpr_d0ec_gulo: aardvark testdb ::1(34808) SELECT[0x887cca] postgres: 17_rpr_d0ec_gulo: aardvark testdb ::1(34808) SELECT(PostgresMain+0x14dc)[0x88958c] postgres: 17_rpr_d0ec_gulo: aardvark testdb ::1(34808) SELECT[0x7fb0da] postgres: 17_rpr_d0ec_gulo: aardvark testdb ::1(34808) SELECT(PostmasterMain+0xd2d)[0x7fc01d] postgres: 17_rpr_d0ec_gulo: aardvark testdb ::1(34808) SELECT(main+0x1e0)[0x5286d0] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea)[0x7ff378e9dd0a] postgres: 17_rpr_d0ec_gulo: aardvark testdb ::1(34808) SELECT(_start+0x2a)[0x5289aa] 2023-09-02 19:59:05.329 CEST 46723 LOG: server process (PID 68975) was terminated by signal 6: Aborted 2023-09-02 19:59:05.329 CEST 46723 DETAIL: Failed process was running: SELECT company, tdate, price, max(price) OVER w FROM stock WINDOW w AS ( PARTITION BY company ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING AFTER MATCH SKIP PAST LAST ROW INITIAL PATTERN (LOWPRICE UP+ DOWN+) DEFINE LOWPRICE AS price <= 100, UP AS price > PREV(price), DOWN AS price < PREV(price) ); 2023-09-02 19:59:05.329 CEST 46723 LOG: terminating any other active server processes Erik Rijkers -
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-09-03T00:03:44Z
> Hi, > > The patches compile & tests run fine but this statement from the > documentation crashes an assert-enabled server: > > SELECT company, tdate, price, max(price) OVER w FROM stock > WINDOW w AS ( > PARTITION BY company > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > AFTER MATCH SKIP PAST LAST ROW > INITIAL > PATTERN (LOWPRICE UP+ DOWN+) > DEFINE > LOWPRICE AS price <= 100, > UP AS price > PREV(price), > DOWN AS price < PREV(price) > ); > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > connection to server was lost Thank you for the report. Currently the patch has an issue with aggregate functions including max. I have been working on aggregations in row pattern recognition but will take more time to complete the part. In the mean time if you want to play with RPR, you can try window functions. Examples can be found in src/test/regress/sql/rpr.sql. Here is one of this: -- the first row start with less than or equal to 100 SELECT company, tdate, price, first_value(price) OVER w, last_value(price) OVER w FROM stock WINDOW w AS ( PARTITION BY company ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING INITIAL PATTERN (LOWPRICE UP+ DOWN+) DEFINE LOWPRICE AS price <= 100, UP AS price > PREV(price), DOWN AS price < PREV(price) ); -- second row raises 120% SELECT company, tdate, price, first_value(price) OVER w, last_value(price) OVER w FROM stock WINDOW w AS ( PARTITION BY company ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING INITIAL PATTERN (LOWPRICE UP+ DOWN+) DEFINE LOWPRICE AS price <= 100, UP AS price > PREV(price) * 1.2, DOWN AS price < PREV(price) ); Sorry for inconvenience. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Jacob Champion <jchampion@timescale.com> — 2023-09-08T00:00:07Z
Hello! > (1) I completely changed the pattern matching engine so that it > performs backtracking. Now the engine evaluates all pattern elements > defined in PATTER against each row, saving matched pattern variables > in a string per row. For example if the pattern element A and B > evaluated to true, a string "AB" is created for current row. If I understand correctly, this strategy assumes that one row's membership in a pattern variable is independent of the other rows' membership. But I don't think that's necessarily true: DEFINE A AS PREV(CLASSIFIER()) IS DISTINCT FROM 'A', ... > See row_is_in_reduced_frame, search_str_set and search_str_set_recurse > in nodeWindowAggs.c for more details. For now I use a naive depth > first search and probably there is a lot of rooms for optimization > (for example rewriting it without using > recursion). The depth-first match is doing a lot of subtle work here. For example, with PATTERN ( A+ B+ ) DEFINE A AS TRUE, B AS TRUE (i.e. all rows match both variables), and three rows in the partition, our candidates will be tried in the order aaa aab aba abb ... bbb The only possible matches against our regex `^a+b+` are "aab" and "abb", and that order matches the preferment order, so it's fine. But it's easy to come up with a pattern where that's the wrong order, like PATTERN ( A+ (B|A)+ ) Now "aaa" will be considered before "aab", which isn't correct. Similarly, the assumption that we want to match the longest string only works because we don't allow alternation yet. > Suggestions/patches are welcome. Cool, I will give this piece some more thought. Do you mind if I try to add some more complicated pattern quantifiers to stress the architecture, or would you prefer to tackle that later? Just alternation by itself will open up a world of corner cases. > With the new engine, cases above do not fail anymore. See new > regression test cases. Thanks for providing valuable test cases! You're very welcome! On 8/9/23 01:41, Tatsuo Ishii wrote: > - I found a bug with pattern matching code. It creates a string for > subsequent regular expression matching. It uses the initial letter > of each define variable name. For example, if the varname is "foo", > then "f" is used. Obviously this makes trouble if we have two or > more variables starting with same "f" (e.g. "food"). To fix this, I > assign [a-z] to each variable instead of its initial letter. However > this way limits us not to have more than 26 variables. I hope 26 is > enough for most use cases. There are still plenty of alphanumerics left that could be assigned... But I'm wondering if we might want to just implement the NFA directly? The current implementation's Cartesian explosion can probably be pruned aggressively, but replaying the entire regex match once for every backtracked step will still duplicate a lot of work. -- I've attached another test case; it looks like last_value() is depending on some sort of side effect from either first_value() or nth_value(). I know the window frame itself is still under construction, so apologies if this is an expected failure. Thanks! --Jacob -
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-09-08T03:54:47Z
Hi, > Hello! > >> (1) I completely changed the pattern matching engine so that it >> performs backtracking. Now the engine evaluates all pattern elements >> defined in PATTER against each row, saving matched pattern variables >> in a string per row. For example if the pattern element A and B >> evaluated to true, a string "AB" is created for current row. > > If I understand correctly, this strategy assumes that one row's > membership in a pattern variable is independent of the other rows' > membership. But I don't think that's necessarily true: > > DEFINE > A AS PREV(CLASSIFIER()) IS DISTINCT FROM 'A', > ... But: UP AS price > PREV(price) also depends on previous row, no? Can you please elaborate how your example could break current implementation? I cannot test it because CLASSIFIER is not implemented yet. >> See row_is_in_reduced_frame, search_str_set and search_str_set_recurse >> in nodeWindowAggs.c for more details. For now I use a naive depth >> first search and probably there is a lot of rooms for optimization >> (for example rewriting it without using >> recursion). > > The depth-first match is doing a lot of subtle work here. For example, with > > PATTERN ( A+ B+ ) > DEFINE A AS TRUE, > B AS TRUE > > (i.e. all rows match both variables), and three rows in the partition, > our candidates will be tried in the order > > aaa > aab > aba > abb > ... > bbb > > The only possible matches against our regex `^a+b+` are "aab" and "abb", > and that order matches the preferment order, so it's fine. But it's easy > to come up with a pattern where that's the wrong order, like > > PATTERN ( A+ (B|A)+ ) > > Now "aaa" will be considered before "aab", which isn't correct. Can you explain a little bit more? I think 'aaa' matches a regular expression 'a+(b|a)+' and should be no problem before "aab" is considered. > Similarly, the assumption that we want to match the longest string only > works because we don't allow alternation yet. Can you please clarify more on this? > Cool, I will give this piece some more thought. Do you mind if I try to > add some more complicated pattern quantifiers to stress the > architecture, or would you prefer to tackle that later? Just alternation > by itself will open up a world of corner cases. Do you mean you want to provide a better patch for the pattern matching part? That will be helpfull. Because I am currently working on the aggregation part and have no time to do it. However, the aggregation work affects the v5 patch: it needs a refactoring. So can you wait until I release v6 patch? I hope it will be released in two weeks or so. > On 8/9/23 01:41, Tatsuo Ishii wrote: >> - I found a bug with pattern matching code. It creates a string for >> subsequent regular expression matching. It uses the initial letter >> of each define variable name. For example, if the varname is "foo", >> then "f" is used. Obviously this makes trouble if we have two or >> more variables starting with same "f" (e.g. "food"). To fix this, I >> assign [a-z] to each variable instead of its initial letter. However >> this way limits us not to have more than 26 variables. I hope 26 is >> enough for most use cases. > > There are still plenty of alphanumerics left that could be assigned... > > But I'm wondering if we might want to just implement the NFA directly? > The current implementation's Cartesian explosion can probably be pruned > aggressively, but replaying the entire regex match once for every > backtracked step will still duplicate a lot of work. Not sure if you mean implementing new regular expression engine besides src/backend/regexp. I am afraid it's not a trivial work. The current regexp code consists of over 10k lines. What do you think? > I've attached another test case; it looks like last_value() is depending > on some sort of side effect from either first_value() or nth_value(). I > know the window frame itself is still under construction, so apologies > if this is an expected failure. Thanks. Fortunately current code which I am working passes the new test. I will include it in the next (v6) patch. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Jacob Champion <jchampion@timescale.com> — 2023-09-08T19:27:05Z
On 9/7/23 20:54, Tatsuo Ishii wrote: >> DEFINE >> A AS PREV(CLASSIFIER()) IS DISTINCT FROM 'A', >> ... > > But: > > UP AS price > PREV(price) > > also depends on previous row, no? PREV(CLASSIFIER()) depends not on the value of the previous row but the state of the match so far. To take an example from the patch: > * Example: > * str_set[0] = "AB"; > * str_set[1] = "AC"; > * In this case at row 0 A and B are true, and A and C are true in row 1. With these str_sets and my example DEFINE, row[1] is only classifiable as 'A' if row[0] is *not* classified as 'A' at this point in the match. "AA" is not a valid candidate string, even if it matches the PATTERN. So if we don't reevaluate the pattern variable condition for the row, we at least have to prune the combinations that search_str_set() visits, so that we don't generate a logically impossible combination. That seems like it could be pretty fragile, and it may be difficult for us to prove compliance. >> But it's easy >> to come up with a pattern where that's the wrong order, like >> >> PATTERN ( A+ (B|A)+ ) >> >> Now "aaa" will be considered before "aab", which isn't correct. > > Can you explain a little bit more? I think 'aaa' matches a regular > expression 'a+(b|a)+' and should be no problem before "aab" is > considered. Assuming I've understood the rules correctly, we're not allowed to classify the last row as 'A' if it also matches 'B'. Lexicographic ordering takes precedence, so we have to try "aab" first. Otherwise our query could return different results compared to another implementation. >> Similarly, the assumption that we want to match the longest string only >> works because we don't allow alternation yet. > > Can you please clarify more on this? Sure: for the pattern PATTERN ( (A|B)+ ) we have to consider the candidate "a" over the candidate "ba", even though the latter is longer. Like the prior example, lexicographic ordering is considered more important than the greedy quantifier. Quoting ISO/IEC 9075-2:2016: More precisely, with both reluctant and greedy quantifiers, the set of matches is ordered lexicographically, but when one match is an initial substring of another match, reluctant quantifiers prefer the shorter match (the substring), whereas greedy quantifiers prefer the longer match (the “superstring”). Here, "ba" doesn't have "a" as a prefix, so "ba" doesn't get priority. ISO/IEC 19075-5:2021 has a big section on this (7.2) with worked examples. (The "lexicographic order matters more than greediness" concept was the most mind-bending part for me so far, probably because I haven't figured out how to translate the concept into POSIX EREs. It wouldn't make sense to say "the letter 't' can match 'a', 'B', or '3' in this regex", but that's what RPR is doing.) >> Cool, I will give this piece some more thought. Do you mind if I try to >> add some more complicated pattern quantifiers to stress the >> architecture, or would you prefer to tackle that later? Just alternation >> by itself will open up a world of corner cases. > > Do you mean you want to provide a better patch for the pattern > matching part? That will be helpfull. No guarantees that I'll find a better patch :D But yes, I will give it a try. > Because I am currently working > on the aggregation part and have no time to do it. However, the > aggregation work affects the v5 patch: it needs a refactoring. So can > you wait until I release v6 patch? I hope it will be released in two > weeks or so. Absolutely! >> But I'm wondering if we might want to just implement the NFA directly? >> The current implementation's Cartesian explosion can probably be pruned >> aggressively, but replaying the entire regex match once for every >> backtracked step will still duplicate a lot of work. > > Not sure if you mean implementing new regular expression engine > besides src/backend/regexp. I am afraid it's not a trivial work. The > current regexp code consists of over 10k lines. What do you think? Heh, I think it would be pretty foolish for me to code an NFA, from scratch, and then try to convince the community to maintain it. But: - I think we have to implement a parallel parser regardless (RPR PATTERN syntax looks incompatible with POSIX) - I suspect we need more control over the backtracking than the current pg_reg* API is going to give us, or else I'm worried performance is going to fall off a cliff with usefully-large partitions - there's a lot of stuff in POSIX EREs that we don't need, and of the features we do need, the + quantifier is probably one of the easiest - it seems easier to prove the correctness of a slow, naive, row-at-a-time engine, because we can compare it directly to the spec So what I'm thinking is: if I start by open-coding the + quantifier, and slowly add more pieces in, then it might be easier to see the parts of src/backend/regex that I've duplicated. We can try to expose those parts directly from the internal API to replace my bad implementation. And if there are parts that aren't duplicated, then it'll be easier to explain why we need something different from the current engine. Does that seem like a workable approach? (Worst-case, my code is just horrible, and we throw it in the trash.) >> I've attached another test case; it looks like last_value() is depending >> on some sort of side effect from either first_value() or nth_value(). I >> know the window frame itself is still under construction, so apologies >> if this is an expected failure. > > Thanks. Fortunately current code which I am working passes the new > test. I will include it in the next (v6) patch. Great! Thanks, --Jacob -
Re: Row pattern recognition
Vik Fearing <vik@postgresfriends.org> — 2023-09-08T21:43:10Z
On 9/8/23 21:27, Jacob Champion wrote: > On 9/7/23 20:54, Tatsuo Ishii wrote: >>> But it's easy >>> to come up with a pattern where that's the wrong order, like >>> >>> PATTERN ( A+ (B|A)+ ) >>> >>> Now "aaa" will be considered before "aab", which isn't correct. >> >> Can you explain a little bit more? I think 'aaa' matches a regular >> expression 'a+(b|a)+' and should be no problem before "aab" is >> considered. > > Assuming I've understood the rules correctly, we're not allowed to > classify the last row as 'A' if it also matches 'B'. Lexicographic > ordering takes precedence, so we have to try "aab" first. Otherwise our > query could return different results compared to another implementation. Your understanding is correct. -- Vik Fearing
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-09-09T11:21:21Z
Hi, >> But: >> >> UP AS price > PREV(price) >> >> also depends on previous row, no? > > PREV(CLASSIFIER()) depends not on the value of the previous row but the > state of the match so far. To take an example from the patch: > >> * Example: >> * str_set[0] = "AB"; >> * str_set[1] = "AC"; >> * In this case at row 0 A and B are true, and A and C are true in row 1. > > With these str_sets and my example DEFINE, row[1] is only classifiable > as 'A' if row[0] is *not* classified as 'A' at this point in the match. > "AA" is not a valid candidate string, even if it matches the PATTERN. Ok, Let me clarify my understanding. Suppose we have: PATTER (A B) DEFINE A AS PREV(CLASSIFIER()) IS DISTINCT FROM 'A', B AS price > 100 and the target table has price column values: row[0]: 110 row[1]: 110 row[2]: 110 row[3]: 110 Then we will get for str_set: r0: B r1: AB Because r0 only has classifier B, r1 can have A and B. Problem is, r2. If we choose A at r1, then r2 = B. But if we choose B at t1, then r2 = AB. I guess this is the issue you pointed out. > So if we don't reevaluate the pattern variable condition for the row, we > at least have to prune the combinations that search_str_set() visits, so > that we don't generate a logically impossible combination. That seems > like it could be pretty fragile, and it may be difficult for us to prove > compliance. Yeah, probably we have delay evaluation of such pattern variables like A, then reevaluate A after the first scan. What about leaving this (reevaluation) for now? Because: 1) we don't have CLASSIFIER 2) we don't allow to give CLASSIFIER to PREV as its arggument so I think we don't need to worry about this for now. >> Can you explain a little bit more? I think 'aaa' matches a regular >> expression 'a+(b|a)+' and should be no problem before "aab" is >> considered. > > Assuming I've understood the rules correctly, we're not allowed to > classify the last row as 'A' if it also matches 'B'. Lexicographic > ordering takes precedence, so we have to try "aab" first. Otherwise our > query could return different results compared to another implementation. > >>> Similarly, the assumption that we want to match the longest string only >>> works because we don't allow alternation yet. >> >> Can you please clarify more on this? > > Sure: for the pattern > > PATTERN ( (A|B)+ ) > > we have to consider the candidate "a" over the candidate "ba", even > though the latter is longer. Like the prior example, lexicographic > ordering is considered more important than the greedy quantifier. > Quoting ISO/IEC 9075-2:2016: > > More precisely, with both reluctant and greedy quantifiers, the set > of matches is ordered lexicographically, but when one match is an > initial substring of another match, reluctant quantifiers prefer the > shorter match (the substring), whereas greedy quantifiers prefer the > longer match (the “superstring”). > > Here, "ba" doesn't have "a" as a prefix, so "ba" doesn't get priority. > ISO/IEC 19075-5:2021 has a big section on this (7.2) with worked examples. > > (The "lexicographic order matters more than greediness" concept was the > most mind-bending part for me so far, probably because I haven't figured > out how to translate the concept into POSIX EREs. It wouldn't make sense > to say "the letter 't' can match 'a', 'B', or '3' in this regex", but > that's what RPR is doing.) Thanks for the explanation. Surprising concet of the standard:-) Is it different from SIMILAR TO REs too? What if we don't follow the standard, instead we follow POSIX EREs? I think this is better for users unless RPR's REs has significant merit for users. >> Do you mean you want to provide a better patch for the pattern >> matching part? That will be helpfull. > > No guarantees that I'll find a better patch :D But yes, I will give it a > try. Ok. >> Because I am currently working >> on the aggregation part and have no time to do it. However, the >> aggregation work affects the v5 patch: it needs a refactoring. So can >> you wait until I release v6 patch? I hope it will be released in two >> weeks or so. > > Absolutely! Thanks. > Heh, I think it would be pretty foolish for me to code an NFA, from > scratch, and then try to convince the community to maintain it. > > But: > - I think we have to implement a parallel parser regardless (RPR PATTERN > syntax looks incompatible with POSIX) I am not sure if we need to worry about this because of the reason I mentioned above. > - I suspect we need more control over the backtracking than the current > pg_reg* API is going to give us, or else I'm worried performance is > going to fall off a cliff with usefully-large partitions Agreed. > - there's a lot of stuff in POSIX EREs that we don't need, and of the > features we do need, the + quantifier is probably one of the easiest > - it seems easier to prove the correctness of a slow, naive, > row-at-a-time engine, because we can compare it directly to the spec > > So what I'm thinking is: if I start by open-coding the + quantifier, and > slowly add more pieces in, then it might be easier to see the parts of > src/backend/regex that I've duplicated. We can try to expose those parts > directly from the internal API to replace my bad implementation. And if > there are parts that aren't duplicated, then it'll be easier to explain > why we need something different from the current engine. > > Does that seem like a workable approach? (Worst-case, my code is just > horrible, and we throw it in the trash.) Yes, it seems workable. I think for the first cut of RPR needs at least the +quantifier with reasonable performance. The current naive implementation seems to have issue because of exhaustive search. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Vik Fearing <vik@postgresfriends.org> — 2023-09-09T13:32:41Z
On 9/9/23 13:21, Tatsuo Ishii wrote: > Thanks for the explanation. Surprising concet of the standard:-) <quote from 19075-5:2023> This leaves the choice between traditional NFA and Posix NFA. The difference between these is that a traditional NFA exits (declares a match) as soon as it finds the first possible match, whereas a Posix NFA is obliged to find all possible matches and then choose the “leftmost longest”. There are examples that show that, even for conventional regular expression matching on text strings and without back references, there are patterns for which a Posix NFA is orders of magnitude slower than a traditional NFA. In addition, reluctant quantifiers cannot be defined in a Posix NFA, because of the leftmost longest rule. Therefore it was decided not to use the Posix NFA model, which leaves the traditional NFA as the model for row pattern matching. Among available tools that use traditional NFA engines, Perl is the most influential; therefore Perl was adopted as the design target for pattern matching rules. </quote> > Is it different from SIMILAR TO REs too? Of course it is. :-) SIMILAR TO uses its own language and rules. > What if we don't follow the standard, instead we follow POSIX EREs? I > think this is better for users unless RPR's REs has significant merit > for users. This would get big pushback from me. -- Vik Fearing
-
Re: Row pattern recognition
Jacob Champion <jchampion@timescale.com> — 2023-09-11T22:13:43Z
On Sat, Sep 9, 2023 at 4:21 AM Tatsuo Ishii <ishii@sraoss.co.jp> wrote: > Then we will get for str_set: > r0: B > r1: AB > > Because r0 only has classifier B, r1 can have A and B. Problem is, > r2. If we choose A at r1, then r2 = B. But if we choose B at t1, then > r2 = AB. I guess this is the issue you pointed out. Right. > Yeah, probably we have delay evaluation of such pattern variables like > A, then reevaluate A after the first scan. > > What about leaving this (reevaluation) for now? Because: > > 1) we don't have CLASSIFIER > 2) we don't allow to give CLASSIFIER to PREV as its arggument > > so I think we don't need to worry about this for now. Sure. I'm all for deferring features to make it easier to iterate; I just want to make sure the architecture doesn't hit a dead end. Or at least, not without being aware of it. Also: is CLASSIFIER the only way to run into this issue? > What if we don't follow the standard, instead we follow POSIX EREs? I > think this is better for users unless RPR's REs has significant merit > for users. Piggybacking off of what Vik wrote upthread, I think we would not be doing ourselves any favors by introducing a non-compliant implementation that performs worse than a traditional NFA. Those would be some awful bug reports. > > - I think we have to implement a parallel parser regardless (RPR PATTERN > > syntax looks incompatible with POSIX) > > I am not sure if we need to worry about this because of the reason I > mentioned above. Even if we adopted POSIX NFA semantics, we'd still have to implement our own parser for the PATTERN part of the query. I don't think there's a good way for us to reuse the parser in src/backend/regex. > > Does that seem like a workable approach? (Worst-case, my code is just > > horrible, and we throw it in the trash.) > > Yes, it seems workable. I think for the first cut of RPR needs at > least the +quantifier with reasonable performance. The current naive > implementation seems to have issue because of exhaustive search. +1 Thanks! --Jacob
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-09-12T06:18:43Z
>> What about leaving this (reevaluation) for now? Because: >> >> 1) we don't have CLASSIFIER >> 2) we don't allow to give CLASSIFIER to PREV as its arggument >> >> so I think we don't need to worry about this for now. > > Sure. I'm all for deferring features to make it easier to iterate; I > just want to make sure the architecture doesn't hit a dead end. Or at > least, not without being aware of it. Ok, let's defer this issue. Currently the patch already exceeds 3k lines. I am afraid too big patch cannot be reviewed by anyone, which means it will never be committed. > Also: is CLASSIFIER the only way to run into this issue? Good question. I would like to know. >> What if we don't follow the standard, instead we follow POSIX EREs? I >> think this is better for users unless RPR's REs has significant merit >> for users. > > Piggybacking off of what Vik wrote upthread, I think we would not be > doing ourselves any favors by introducing a non-compliant > implementation that performs worse than a traditional NFA. Those would > be some awful bug reports. What I am not sure about is, you and Vik mentioned that the traditional NFA is superior that POSIX NFA in terms of performance. But how "lexicographic ordering" is related to performance? >> I am not sure if we need to worry about this because of the reason I >> mentioned above. > > Even if we adopted POSIX NFA semantics, we'd still have to implement > our own parser for the PATTERN part of the query. I don't think > there's a good way for us to reuse the parser in src/backend/regex. Ok. >> > Does that seem like a workable approach? (Worst-case, my code is just >> > horrible, and we throw it in the trash.) >> >> Yes, it seems workable. I think for the first cut of RPR needs at >> least the +quantifier with reasonable performance. The current naive >> implementation seems to have issue because of exhaustive search. > > +1 BTW, attched is the v6 patch. The differences from v5 include: - Now aggregates can be used with RPR. Below is an example from the regression test cases, which is added by v6 patch. - Fix assersion error pointed out by Erik. SELECT company, tdate, price, first_value(price) OVER w, last_value(price) OVER w, max(price) OVER w, min(price) OVER w, sum(price) OVER w, avg(price) OVER w, count(price) OVER w FROM stock WINDOW w AS ( PARTITION BY company ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING AFTER MATCH SKIP PAST LAST ROW INITIAL PATTERN (START UP+ DOWN+) DEFINE START AS TRUE, UP AS price > PREV(price), DOWN AS price < PREV(price) ); company | tdate | price | first_value | last_value | max | min | sum | avg | count ----------+------------+-------+-------------+------------+------+-----+------+-----------------------+------- company1 | 07-01-2023 | 100 | 100 | 140 | 200 | 100 | 590 | 147.5000000000000000 | 4 company1 | 07-02-2023 | 200 | | | | | | | company1 | 07-03-2023 | 150 | | | | | | | company1 | 07-04-2023 | 140 | | | | | | | company1 | 07-05-2023 | 150 | | | | | | | company1 | 07-06-2023 | 90 | 90 | 120 | 130 | 90 | 450 | 112.5000000000000000 | 4 company1 | 07-07-2023 | 110 | | | | | | | company1 | 07-08-2023 | 130 | | | | | | | company1 | 07-09-2023 | 120 | | | | | | | company1 | 07-10-2023 | 130 | | | | | | | company2 | 07-01-2023 | 50 | 50 | 1400 | 2000 | 50 | 4950 | 1237.5000000000000000 | 4 company2 | 07-02-2023 | 2000 | | | | | | | company2 | 07-03-2023 | 1500 | | | | | | | company2 | 07-04-2023 | 1400 | | | | | | | company2 | 07-05-2023 | 1500 | | | | | | | company2 | 07-06-2023 | 60 | 60 | 1200 | 1300 | 60 | 3660 | 915.0000000000000000 | 4 company2 | 07-07-2023 | 1100 | | | | | | | company2 | 07-08-2023 | 1300 | | | | | | | company2 | 07-09-2023 | 1200 | | | | | | | company2 | 07-10-2023 | 1300 | | | | | | | (20 rows) Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-09-12T08:44:57Z
Regarding v6 patch: > SELECT company, tdate, price, > first_value(price) OVER w, > last_value(price) OVER w, > max(price) OVER w, > min(price) OVER w, > sum(price) OVER w, > avg(price) OVER w, > count(price) OVER w > FROM stock > WINDOW w AS ( > PARTITION BY company > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > AFTER MATCH SKIP PAST LAST ROW > INITIAL > PATTERN (START UP+ DOWN+) > DEFINE > START AS TRUE, > UP AS price > PREV(price), > DOWN AS price < PREV(price) > ); > company | tdate | price | first_value | last_value | max | min | sum | avg | count > ----------+------------+-------+-------------+------------+------+-----+------+-----------------------+------- > company1 | 07-01-2023 | 100 | 100 | 140 | 200 | 100 | 590 | 147.5000000000000000 | 4 > company1 | 07-02-2023 | 200 | | | | | | | > company1 | 07-03-2023 | 150 | | | | | | | > company1 | 07-04-2023 | 140 | | | | | | | > company1 | 07-05-2023 | 150 | | | | | | | > company1 | 07-06-2023 | 90 | 90 | 120 | 130 | 90 | 450 | 112.5000000000000000 | 4 > company1 | 07-07-2023 | 110 | | | | | | | > company1 | 07-08-2023 | 130 | | | | | | | > company1 | 07-09-2023 | 120 | | | | | | | > company1 | 07-10-2023 | 130 | | | | | | | > company2 | 07-01-2023 | 50 | 50 | 1400 | 2000 | 50 | 4950 | 1237.5000000000000000 | 4 > company2 | 07-02-2023 | 2000 | | | | | | | > company2 | 07-03-2023 | 1500 | | | | | | | > company2 | 07-04-2023 | 1400 | | | | | | | > company2 | 07-05-2023 | 1500 | | | | | | | > company2 | 07-06-2023 | 60 | 60 | 1200 | 1300 | 60 | 3660 | 915.0000000000000000 | 4 > company2 | 07-07-2023 | 1100 | | | | | | | > company2 | 07-08-2023 | 1300 | | | | | | | > company2 | 07-09-2023 | 1200 | | | | | | | > company2 | 07-10-2023 | 1300 | | | | | | | > (20 rows) count column for unmatched rows should have been 0, rather than NULL. i.e. company | tdate | price | first_value | last_value | max | min | sum | avg | count ----------+------------+-------+-------------+------------+------+-----+------+-----------------------+------- company1 | 07-01-2023 | 100 | 100 | 140 | 200 | 100 | 590 | 147.5000000000000000 | 4 company1 | 07-02-2023 | 200 | | | | | | | company1 | 07-03-2023 | 150 | | | | | | | company1 | 07-04-2023 | 140 | | | | | | | company1 | 07-05-2023 | 150 | | | | | | | 0 company1 | 07-06-2023 | 90 | 90 | 120 | 130 | 90 | 450 | 112.5000000000000000 | 4 company1 | 07-07-2023 | 110 | | | | | | | company1 | 07-08-2023 | 130 | | | | | | | company1 | 07-09-2023 | 120 | | | | | | | company1 | 07-10-2023 | 130 | | | | | | | 0 company2 | 07-01-2023 | 50 | 50 | 1400 | 2000 | 50 | 4950 | 1237.5000000000000000 | 4 company2 | 07-02-2023 | 2000 | | | | | | | company2 | 07-03-2023 | 1500 | | | | | | | company2 | 07-04-2023 | 1400 | | | | | | | company2 | 07-05-2023 | 1500 | | | | | | | 0 company2 | 07-06-2023 | 60 | 60 | 1200 | 1300 | 60 | 3660 | 915.0000000000000000 | 4 company2 | 07-07-2023 | 1100 | | | | | | | company2 | 07-08-2023 | 1300 | | | | | | | company2 | 07-09-2023 | 1200 | | | | | | | company2 | 07-10-2023 | 1300 | | | | | | | 0 (20 rows) Attached is the fix against v6 patch. I will include this in upcoming v7 patch. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Jacob Champion <jchampion@timescale.com> — 2023-09-12T22:09:29Z
On Mon, Sep 11, 2023 at 11:18 PM Tatsuo Ishii <ishii@sraoss.co.jp> wrote: > What I am not sure about is, you and Vik mentioned that the > traditional NFA is superior that POSIX NFA in terms of performance. > But how "lexicographic ordering" is related to performance? I think they're only tangentially related. POSIX NFAs have to fully backtrack even after the first match is found, so that's where the performance difference comes in. (We would be introducing new ways to catastrophically backtrack if we used that approach.) But since you don't visit every possible path through the graph with a traditional NFA, it makes sense to define an order in which you visit the nodes, so that you can reason about which string is actually going to be matched in the end. > BTW, attched is the v6 patch. The differences from v5 include: > > - Now aggregates can be used with RPR. Below is an example from the > regression test cases, which is added by v6 patch. Great, thank you! --Jacob
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-09-13T05:14:07Z
> <quote from 19075-5:2023> I was looking for this but I only found ISO/IEC 19075-5:2021. https://www.iso.org/standard/78936.html Maybe 19075-5:2021 is the latest one? Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Vik Fearing <vik@postgresfriends.org> — 2023-09-13T11:28:45Z
On 9/13/23 07:14, Tatsuo Ishii wrote: >> <quote from 19075-5:2023> > > I was looking for this but I only found ISO/IEC 19075-5:2021. > https://www.iso.org/standard/78936.html > > Maybe 19075-5:2021 is the latest one? Yes, probably. Sorry. -- Vik Fearing
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-09-13T12:35:53Z
> On 9/13/23 07:14, Tatsuo Ishii wrote: >>> <quote from 19075-5:2023> >> I was looking for this but I only found ISO/IEC 19075-5:2021. >> https://www.iso.org/standard/78936.html >> Maybe 19075-5:2021 is the latest one? > > Yes, probably. Sorry. No problem. Thanks for confirmation. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-09-22T05:16:40Z
> Attached is the fix against v6 patch. I will include this in upcoming v7 patch. Attached is the v7 patch. It includes the fix mentioned above. Also this time the pattern matching engine is enhanced: previously it recursed to row direction, which means if the number of rows in a frame is large, it could exceed the limit of stack depth. The new version recurses over matched pattern variables in a row, which is at most 26 which should be small enough. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Erik Rijkers <er@xs4all.nl> — 2023-09-22T08:12:38Z
Op 9/22/23 om 07:16 schreef Tatsuo Ishii: >> Attached is the fix against v6 patch. I will include this in upcoming v7 patch. > > Attached is the v7 patch. It includes the fix mentioned above. Also Hi, In my hands, make check fails on the rpr test; see attached .diff file. In these two statements: -- using NEXT -- using AFTER MATCH SKIP TO NEXT ROW result of first_value(price) and next_value(price) are empty. Erik Rijkers > this time the pattern matching engine is enhanced: previously it > recursed to row direction, which means if the number of rows in a > frame is large, it could exceed the limit of stack depth. The new > version recurses over matched pattern variables in a row, which is at > most 26 which should be small enough. > > Best reagards, > -- > Tatsuo Ishii > SRA OSS LLC > English: http://www.sraoss.co.jp/index_en/ > Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Erik Rijkers <er@xs4all.nl> — 2023-09-22T08:23:11Z
Op 9/22/23 om 07:16 schreef Tatsuo Ishii: >> Attached is the fix against v6 patch. I will include this in upcoming v7 patch. > > Attached is the v7 patch. It includes the fix mentioned above. Also (Champion's address bounced; removed) Hi, In my hands, make check fails on the rpr test; see attached .diff file. In these two statements: -- using NEXT -- using AFTER MATCH SKIP TO NEXT ROW result of first_value(price) and next_value(price) are empty. Erik Rijkers > this time the pattern matching engine is enhanced: previously it > recursed to row direction, which means if the number of rows in a > frame is large, it could exceed the limit of stack depth. The new > version recurses over matched pattern variables in a row, which is at > most 26 which should be small enough. > > Best reagards, > -- > Tatsuo Ishii > SRA OSS LLC > English: http://www.sraoss.co.jp/index_en/ > Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Erik Rijkers <er@xs4all.nl> — 2023-09-22T08:26:49Z
Op 9/22/23 om 10:23 schreef Erik Rijkers: > Op 9/22/23 om 07:16 schreef Tatsuo Ishii: >>> Attached is the fix against v6 patch. I will include this in upcoming >>> v7 patch. >> >> Attached is the v7 patch. It includes the fix mentioned above. Also > (Champion's address bounced; removed) > Sorry, I forgot to re-attach the regression.diffs with resend... Erik > Hi, > > In my hands, make check fails on the rpr test; see attached .diff file. > In these two statements: > -- using NEXT > -- using AFTER MATCH SKIP TO NEXT ROW > result of first_value(price) and next_value(price) are empty. > > Erik Rijkers > > >> this time the pattern matching engine is enhanced: previously it >> recursed to row direction, which means if the number of rows in a >> frame is large, it could exceed the limit of stack depth. The new >> version recurses over matched pattern variables in a row, which is at >> most 26 which should be small enough. >> >> Best reagards, >> -- >> Tatsuo Ishii >> SRA OSS LLC >> English: http://www.sraoss.co.jp/index_en/ >> Japanese:http://www.sraoss.co.jp > >
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-09-22T10:12:50Z
> Op 9/22/23 om 07:16 schreef Tatsuo Ishii: >>> Attached is the fix against v6 patch. I will include this in upcoming >>> v7 patch. >> Attached is the v7 patch. It includes the fix mentioned above. Also > (Champion's address bounced; removed) On my side his adress bounced too:-< > Hi, > > In my hands, make check fails on the rpr test; see attached .diff > file. > In these two statements: > -- using NEXT > -- using AFTER MATCH SKIP TO NEXT ROW > result of first_value(price) and next_value(price) are empty. Strange. I have checked out fresh master branch and applied the v7 patches, then ran make check. All tests including the rpr test passed. This is Ubuntu 20.04. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Erik Rijkers <er@xs4all.nl> — 2023-09-22T11:28:11Z
Op 9/22/23 om 12:12 schreef Tatsuo Ishii: >> Op 9/22/23 om 07:16 schreef Tatsuo Ishii: >>>> Attached is the fix against v6 patch. I will include this in upcoming >>>> v7 patch. >>> Attached is the v7 patch. It includes the fix mentioned above. Also >> (Champion's address bounced; removed) > > On my side his adress bounced too:-< > >> Hi, >> >> In my hands, make check fails on the rpr test; see attached .diff >> file. >> In these two statements: >> -- using NEXT >> -- using AFTER MATCH SKIP TO NEXT ROW >> result of first_value(price) and next_value(price) are empty. > > Strange. I have checked out fresh master branch and applied the v7 > patches, then ran make check. All tests including the rpr test > passed. This is Ubuntu 20.04. The curious thing is that the server otherwise builds ok, and if I explicitly run on that server 'CREATE TEMP TABLE stock' + the 20 INSERTS (just to make sure to have known data), those two statements now both return the correct result. So maybe the testing/timing is wonky (not necessarily the server). Erik > > Best reagards, > -- > Tatsuo Ishii > SRA OSS LLC > English: http://www.sraoss.co.jp/index_en/ > Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Jacob Champion <champion.p@gmail.com> — 2023-09-22T14:48:22Z
On Fri, Sep 22, 2023, 3:13 AM Tatsuo Ishii <ishii@sraoss.co.jp> wrote: > > Op 9/22/23 om 07:16 schreef Tatsuo Ishii: > >>> Attached is the fix against v6 patch. I will include this in upcoming > >>> v7 patch. > >> Attached is the v7 patch. It includes the fix mentioned above. Also > > (Champion's address bounced; removed) > > On my side his adress bounced too:-< > Yep. I'm still here, just lurking for now. It'll take a little time for me to get back to this thread, as my schedule has changed significantly. :D Thanks, --Jacob >
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-09-25T05:26:30Z
> On Fri, Sep 22, 2023, 3:13 AM Tatsuo Ishii <ishii@sraoss.co.jp> wrote: > >> > Op 9/22/23 om 07:16 schreef Tatsuo Ishii: >> >>> Attached is the fix against v6 patch. I will include this in upcoming >> >>> v7 patch. >> >> Attached is the v7 patch. It includes the fix mentioned above. Also >> > (Champion's address bounced; removed) >> >> On my side his adress bounced too:-< >> > > Yep. I'm still here, just lurking for now. It'll take a little time for me > to get back to this thread, as my schedule has changed significantly. :D Hope you get back soon... By the way, I was thinking about eliminating recusrive calls in pattern matching. Attached is the first cut of the implementation. In the attached v8 patch: - No recursive calls anymore. search_str_set_recurse was removed. - Instead it generates all possible pattern variable name initial strings before pattern matching. Suppose we have "ab" (row 0) and "ac" (row 1). "a" and "b" represents the pattern variable names which are evaluated to true. In this case it will generate "aa", "ac", "ba" and "bc" and they are examined by do_pattern_match one by one, which performs pattern matching. - To implement this, an infrastructure string_set* are created. They take care of set of StringInfo. I found that the previous implementation did not search all possible cases. I believe the bug is fixed in v8. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-10-04T06:03:28Z
> By the way, I was thinking about eliminating recusrive calls in > pattern matching. Attached is the first cut of the implementation. In > the attached v8 patch: > > - No recursive calls anymore. search_str_set_recurse was removed. > > - Instead it generates all possible pattern variable name initial > strings before pattern matching. Suppose we have "ab" (row 0) and > "ac" (row 1). "a" and "b" represents the pattern variable names > which are evaluated to true. In this case it will generate "aa", > "ac", "ba" and "bc" and they are examined by do_pattern_match one by > one, which performs pattern matching. > > - To implement this, an infrastructure string_set* are created. They > take care of set of StringInfo. > > I found that the previous implementation did not search all possible > cases. I believe the bug is fixed in v8. The v8 patch does not apply anymore due to commit d060e921ea "Remove obsolete executor cleanup code". So I rebased and created v9 patch. The differences from the v8 include: - Fix bug with get_slots. It did not correctly detect the end of full frame. - Add test case using "ROWS BETWEEN CURRENT ROW AND offset FOLLOWING". Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-10-22T02:39:20Z
Attached is the v10 patch. This version enhances the performance of pattern matching. Previously it generated all possible pattern string candidates. This resulted in unnecessarily large number of candidates. For example if you have 2 pattern variables and the target frame includes 100 rows, the number of candidates can reach to 2^100 in the worst case. To avoid this, I do a pruning in the v10 patch. Suppose you have: PATTERN (A B+ C+) Candidates like "BAC" "CAB" cannot survive because they never satisfy the search pattern. To judge this, I assign sequence numbers (0, 1, 2) to (A B C). If the pattern generator tries to generate BA, this is not allowed because the sequence number for B is 1 and for A is 0, and 0 < 1: B cannot be followed by A. Note that this technique can be applied when the quantifiers are "+" or "*". Maybe other quantifiers such as '?' or '{n, m}' can be applied too but I don't confirm yet because I have not implemented them yet. Besides this improvement, I fixed a bug in the previous and older patches: when an expression in DEFINE uses text operators, it errors out: ERROR: could not determine which collation to use for string comparison HINT: Use the COLLATE clause to set the collation explicitly. This was fixed by adding assign_expr_collations() in transformDefineClause(). Also I have updated documentation "3.5. Window Functions" - It still mentioned about rpr(). It's not applied anymore. - Enhance the description about DEFINE and PATTERN. - Mention that quantifier '*' is supported. Finally I have added more test cases to the regression test. - same pattern variable appears twice - case for quantifier '*' Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp -
Re: Row pattern recognition
Jacob Champion <champion.p@gmail.com> — 2023-10-24T18:51:19Z
On Sat, Oct 21, 2023 at 7:39 PM Tatsuo Ishii <ishii@sraoss.co.jp> wrote: > Attached is the v10 patch. This version enhances the performance of > pattern matching. Nice! I've attached a couple of more stressful tests (window partitions of 1000 rows each). Beware that the second one runs my desktop out of memory fairly quickly with the v10 implementation. I was able to carve out some time this week to implement a very basic recursive NFA, which handles both the + and * qualifiers (attached). It's not production quality -- a frame on the call stack for every row isn't going to work -- but with only those two features, it's pretty tiny, and it's able to run the new stress tests with no issue. If I continue to have time, I hope to keep updating this parallel implementation as you add features to the StringSet implementation, and we can see how it evolves. I expect that alternation and grouping will ratchet up the complexity. Thanks! --Jacob
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-10-25T00:11:05Z
> On Sat, Oct 21, 2023 at 7:39 PM Tatsuo Ishii <ishii@sraoss.co.jp> wrote: >> Attached is the v10 patch. This version enhances the performance of >> pattern matching. > > Nice! I've attached a couple of more stressful tests (window > partitions of 1000 rows each). Beware that the second one runs my > desktop out of memory fairly quickly with the v10 implementation. > > I was able to carve out some time this week to implement a very basic > recursive NFA, which handles both the + and * qualifiers (attached). Great. I will look into this. > It's not production quality -- a frame on the call stack for every row > isn't going to work Yeah. > -- but with only those two features, it's pretty > tiny, and it's able to run the new stress tests with no issue. If I > continue to have time, I hope to keep updating this parallel > implementation as you add features to the StringSet implementation, > and we can see how it evolves. I expect that alternation and grouping > will ratchet up the complexity. Sounds like a plan. By the way, I tested my patch (v10) to handle more large data set and tried to following query with pgbench database. On my laptop it works with 100k rows pgbench_accounts table but with beyond the number I got OOM killer. I would like to enhance this in the next patch. SELECT aid, first_value(aid) OVER w, count(*) OVER w FROM pgbench_accounts WINDOW w AS ( PARTITION BY bid ORDER BY aid ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING AFTER MATCH SKIP PAST LAST ROW INITIAL PATTERN (START UP+) DEFINE START AS TRUE, UP AS aid > PREV(aid) ); Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-10-25T02:49:30Z
> Great. I will look into this. I am impressed the simple NFA implementation. It would be nicer if it could be implemented without using recursion. > By the way, I tested my patch (v10) to handle more large data set and > tried to following query with pgbench database. On my laptop it works > with 100k rows pgbench_accounts table but with beyond the number I got ~~~ I meant 10k. > OOM killer. I would like to enhance this in the next patch. > > SELECT aid, first_value(aid) OVER w, > count(*) OVER w > FROM pgbench_accounts > WINDOW w AS ( > PARTITION BY bid > ORDER BY aid > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > AFTER MATCH SKIP PAST LAST ROW > INITIAL > PATTERN (START UP+) > DEFINE > START AS TRUE, > UP AS aid > PREV(aid) > ); I ran this against your patch. It failed around > 60k rows. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp -
Re: Row pattern recognition
Jacob Champion <champion.p@gmail.com> — 2023-10-30T19:49:18Z
On Tue, Oct 24, 2023 at 7:49 PM Tatsuo Ishii <ishii@sraoss.co.jp> wrote: > I am impressed the simple NFA implementation. Thanks! > It would be nicer if it > could be implemented without using recursion. Yeah. If for some reason we end up going with a bespoke implementation, I assume we'd just convert the algorithm to an iterative one and optimize it heavily. But I didn't want to do that too early, since it'd probably make it harder to add new features... and anyway my goal is still to try to reuse src/backend/regex eventually. > > SELECT aid, first_value(aid) OVER w, > > count(*) OVER w > > FROM pgbench_accounts > > WINDOW w AS ( > > PARTITION BY bid > > ORDER BY aid > > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > > AFTER MATCH SKIP PAST LAST ROW > > INITIAL > > PATTERN (START UP+) > > DEFINE > > START AS TRUE, > > UP AS aid > PREV(aid) > > ); > > I ran this against your patch. It failed around > 60k rows. Nice, that's actually more frames than I expected. Looks like I have similar results here with my second test query (segfault at ~58k rows). Thanks, --Jacob
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-11-08T07:37:05Z
>> It would be nicer if it >> could be implemented without using recursion. > > Yeah. If for some reason we end up going with a bespoke > implementation, I assume we'd just convert the algorithm to an > iterative one and optimize it heavily. But I didn't want to do that > too early, since it'd probably make it harder to add new features... > and anyway my goal is still to try to reuse src/backend/regex > eventually. Ok. Attached is the v11 patch. Below are the summary of the changes from previous version. - rebase. - Reduce memory allocation in pattern matching (search_str_set()). But still Champion's second stress test gives OOM killer. - While keeping an old set to next round, move the StringInfo to new_str_set, rather than copying from old_str_set. This allows to run pgbench.sql against up to 60k rows on my laptop (previously 20k). - Use enlargeStringInfo to set the buffer size, rather than incrementally enlarge the buffer. This does not seem to give big enhancement but it should theoretically an enhancement. - Fix "variable not found in subplan target list" error if WITH is used. - To fix this apply pullup_replace_vars() against DEFINE clause in planning phase (perform_pullup_replace_vars()). Also add regression test cases for WITH that caused the error in the previous version. - Fix the case when no greedy quantifiers ('+' or '*') are included in PATTERN. - Previously update_reduced_frame() did not consider the case and produced wrong results. Add another code path which is dedicated to none greedy PATTERN (at this point, it means there's no quantifier case). Also add a test case for this. - Remove unnecessary check in transformPatternClause(). - Previously it checked if all pattern variables are defined in DEFINE clause. But currently RPR allows to "auto define" such variables as "varname AS TRUE". So the check was not necessary. - FYI here is the list to explain what was changed in each patch file. 0001-Row-pattern-recognition-patch-for-raw-parser.patch - same 0002-Row-pattern-recognition-patch-parse-analysis.patch - Add markTargetListOrigins() to transformFrameOffset(). - Change transformPatternClause(). 0003-Row-pattern-recognition-patch-planner.patch - Fix perform_pullup_replace_vars() 0004-Row-pattern-recognition-patch-executor.patch - Fix update_reduced_frame() - Fix search_str_set() 0005-Row-pattern-recognition-patch-docs.patch - same 0006-Row-pattern-recognition-patch-tests.patch - Add test case for non-greedy and WITH cases 0007-Allow-to-print-raw-parse-tree.patch - same Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp -
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-12-08T01:16:13Z
Sorry for posting v12 patch again. It seems the previous post of v12 patch email lost mail threading information and was not recognized as a part of the thread by CF application and CFbot. https://www.postgresql.org/message-id/20231204.204048.1998548830490453126.t-ishii%40sranhm.sra.co.jp Attached is the v12 patch. Below are the summary of the changes from previous version. - Rebase. CFbot says v11 patch needs rebase since Nov 30, 2023. - Apply preprocess_expression() to DEFINE clause in the planning phase. This is necessary to simply const expressions like: DEFINE A price < (99 + 1) to: DEFINE A price < 100 - Re-allow to use WinSetMarkPosition() in eval_windowaggregates(). - FYI here is the list to explain what were changed in each patch file. 0001-Row-pattern-recognition-patch-for-raw-parser.patch - Fix conflict. 0002-Row-pattern-recognition-patch-parse-analysis.patch - Same as before. 0003-Row-pattern-recognition-patch-planner.patch - Call preprocess_expression() for DEFINE clause in subquery_planner(). 0004-Row-pattern-recognition-patch-executor.patch - Re-allow to use WinSetMarkPosition() in eval_windowaggregates(). 0005-Row-pattern-recognition-patch-docs.patch - Same as before. 0006-Row-pattern-recognition-patch-tests.patch - Same as before. 0007-Allow-to-print-raw-parse-tree.patch - Same as before. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp -
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2023-12-08T22:22:58Z
> On 04.12.23 12:40, Tatsuo Ishii wrote: >> diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y >> index d631ac89a9..5a77fca17f 100644 >> --- a/src/backend/parser/gram.y >> +++ b/src/backend/parser/gram.y >> @@ -251,6 +251,8 @@ static Node *makeRecursiveViewSelect(char >> *relname, List *aliases, Node *query); >> DefElem *defelt; >> SortBy *sortby; >> WindowDef *windef; >> + RPCommonSyntax *rpcom; >> + RPSubsetItem *rpsubset; >> JoinExpr *jexpr; >> IndexElem *ielem; >> StatsElem *selem; >> @@ -278,6 +280,7 @@ static Node *makeRecursiveViewSelect(char >> *relname, List *aliases, Node *query); >> MergeWhenClause *mergewhen; >> struct KeyActions *keyactions; >> struct KeyAction *keyaction; >> + RPSkipTo skipto; >> } >> %type <node> stmt toplevel_stmt schema_stmt routine_body_stmt > > It is usually not the style to add an entry for every node type to the > %union. Otherwise, we'd have hundreds of entries in there. Ok, I have removed the node types and used existing node types. Also I have moved RPR related %types to same place to make it easier to know what are added by RPR. >> @@ -866,6 +878,7 @@ static Node *makeRecursiveViewSelect(char >> *relname, List *aliases, Node *query); >> %nonassoc UNBOUNDED /* ideally would have same precedence as IDENT */ >> %nonassoc IDENT PARTITION RANGE ROWS GROUPS PRECEDING FOLLOWING CUBE >> %ROLLUP >> SET KEYS OBJECT_P SCALAR VALUE_P WITH WITHOUT >> +%nonassoc MEASURES AFTER INITIAL SEEK PATTERN_P >> %left Op OPERATOR /* multi-character ops and user-defined operators */ >> %left '+' '-' >> %left '*' '/' '%' > > It was recently discussed that these %nonassoc should ideally all have > the same precedence. Did you consider that here? No, I didn't realize it. Thanks for pointing it out. I have removed %nonassoc so that MEASURES etc. have the same precedence as IDENT etc. Attached is the new diff of gram.y against master branch. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
NINGWEI CHEN <chen@sraoss.co.jp> — 2024-01-22T05:51:49Z
On Sat, 09 Dec 2023 07:22:58 +0900 (JST) Tatsuo Ishii <ishii@sraoss.co.jp> wrote: > > On 04.12.23 12:40, Tatsuo Ishii wrote: > >> diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y > >> index d631ac89a9..5a77fca17f 100644 > >> --- a/src/backend/parser/gram.y > >> +++ b/src/backend/parser/gram.y > >> @@ -251,6 +251,8 @@ static Node *makeRecursiveViewSelect(char > >> *relname, List *aliases, Node *query); > >> DefElem *defelt; > >> SortBy *sortby; > >> WindowDef *windef; > >> + RPCommonSyntax *rpcom; > >> + RPSubsetItem *rpsubset; > >> JoinExpr *jexpr; > >> IndexElem *ielem; > >> StatsElem *selem; > >> @@ -278,6 +280,7 @@ static Node *makeRecursiveViewSelect(char > >> *relname, List *aliases, Node *query); > >> MergeWhenClause *mergewhen; > >> struct KeyActions *keyactions; > >> struct KeyAction *keyaction; > >> + RPSkipTo skipto; > >> } > >> %type <node> stmt toplevel_stmt schema_stmt routine_body_stmt > > > > It is usually not the style to add an entry for every node type to the > > %union. Otherwise, we'd have hundreds of entries in there. > > Ok, I have removed the node types and used existing node types. Also > I have moved RPR related %types to same place to make it easier to know > what are added by RPR. > > >> @@ -866,6 +878,7 @@ static Node *makeRecursiveViewSelect(char > >> *relname, List *aliases, Node *query); > >> %nonassoc UNBOUNDED /* ideally would have same precedence as IDENT */ > >> %nonassoc IDENT PARTITION RANGE ROWS GROUPS PRECEDING FOLLOWING CUBE > >> %ROLLUP > >> SET KEYS OBJECT_P SCALAR VALUE_P WITH WITHOUT > >> +%nonassoc MEASURES AFTER INITIAL SEEK PATTERN_P > >> %left Op OPERATOR /* multi-character ops and user-defined operators */ > >> %left '+' '-' > >> %left '*' '/' '%' > > > > It was recently discussed that these %nonassoc should ideally all have > > the same precedence. Did you consider that here? > > No, I didn't realize it. Thanks for pointing it out. I have removed > %nonassoc so that MEASURES etc. have the same precedence as IDENT etc. > > Attached is the new diff of gram.y against master branch. Thank you very much for providing the patch for the RPR implementation. After applying the v12-patches, I noticed an issue that the rpr related parts in window clauses were not displayed in the view definitions (the definition column of pg_views). To address this, I have taken the liberty of adding an additional patch that modifies the relevant rewriter source code. I have attached the rewriter patch for your review and would greatly appreciate your feedback. Thank you for your time and consideration. -- SRA OSS LLC Ningwei Chen <chen@sraoss.co.jp> TEL: 03-5979-2701 FAX: 03-5979-2702
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2024-01-22T06:22:11Z
> Thank you very much for providing the patch for the RPR implementation. > > After applying the v12-patches, I noticed an issue that > the rpr related parts in window clauses were not displayed in the > view definitions (the definition column of pg_views). > > To address this, I have taken the liberty of adding an additional patch > that modifies the relevant rewriter source code. > I have attached the rewriter patch for your review and would greatly appreciate your feedback. > > Thank you for your time and consideration. Thank you so much for spotting the issue and creating the patch. I confirmed that your patch applies cleanly and solve the issue. I will include the patches into upcoming v13 patches. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2024-01-22T10:26:18Z
Attached is the v13 patch. Below are the summary of the changes from previous version (besides rebase). 0001-Row-pattern-recognition-patch-for-raw-parser.patch - Fix raw paser per Peter Eisentraut's review. Remove the new node types and use existing ones. Also remove %nonassoc so that MEASURES etc. have the same precedence as IDENT etc. Peter's comment: > It is usually not the style to add an entry for every node type to the > %union. Otherwise, we'd have hundreds of entries in there. > It was recently discussed that these %nonassoc should ideally all have > the same precedence. Did you consider that here? 0002-Row-pattern-recognition-patch-parse-analysis.patch - Fix transformRPR so that SKIP variable name in the AFTER MATCH SKIP TO clause is tracked. This is added by Ningwei Chen. 0003-Row-pattern-recognition-patch-rewriter.patch This is a new patch for rewriter. Contributed by Ningwei Chen. Chen's comment: > After applying the v12-patches, I noticed an issue that > the rpr related parts in window clauses were not displayed in the > view definitions (the definition column of pg_views). 0004-Row-pattern-recognition-patch-planner.patch - same as before (previously it was 0003-Row-pattern-recognition-patch-planner.patch) 0005-Row-pattern-recognition-patch-executor.patch - same as before (previously it was 0004-Row-pattern-recognition-patch-executor.patch) 0006-Row-pattern-recognition-patch-docs.patch - Same as before. (previously it was 0005-Row-pattern-recognition-patch-docs.patch) 0007-Row-pattern-recognition-patch-tests.patch - Same as before. (previously it was 0006-Row-pattern-recognition-patch-tests.patch) 0008-Allow-to-print-raw-parse-tree.patch - Same as before. (previously it was 0007-Allow-to-print-raw-parse-tree.patch). Note that patch is not intended to be incorporated into main tree. This is just for debugging purpose. With this patch, raw parse tree is printed if debug_print_parse is enabled. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2024-02-29T00:19:54Z
Attached is the v14 patch. Below are the summary of the changes from previous version (besides rebase). V14 patches are mainly for coding style fixes. 0001-Row-pattern-recognition-patch-for-raw-parser.patch - Fold too long lines and run pgindent. 0002-Row-pattern-recognition-patch-parse-analysis.patch - Fold too long lines and run pgindent. 0003-Row-pattern-recognition-patch-rewriter.patch - Fold too long lines and run pgindent. 0004-Row-pattern-recognition-patch-planner.patch - Fold too long lines and run pgindent. 0005-Row-pattern-recognition-patch-executor.patch - Fold too long lines and run pgindent. - Surround debug lines using "ifdef RPR_DEBUG" so that logs are not contaminated by RPR debug logs when log_min_messages are set to DEBUG1 or higher. 0006-Row-pattern-recognition-patch-docs.patch - Same as before. (previously it was 0005-Row-pattern-recognition-patch-docs.patch) 0007-Row-pattern-recognition-patch-tests.patch - Same as before. (previously it was 0006-Row-pattern-recognition-patch-tests.patch) 0008-Allow-to-print-raw-parse-tree.patch - Same as before. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2024-03-28T10:59:25Z
Attached is the v15 patch. No changes are made except rebasing due to recent grammar changes. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2024-04-12T07:09:08Z
Attached is the v16 patch. No changes are made except rebasing due to recent grammar changes. Also I removed chen@sraoss.co.jp from the Cc: list. The email address is no longer valid. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2024-04-24T03:12:44Z
Hi Vik and Champion, I think the current RPR patch is not quite correct in handling count(*). (using slightly modified version of Vik's example query) SELECT v.a, count(*) OVER w FROM (VALUES ('A'),('B'),('B'),('C')) AS v (a) WINDOW w AS ( ORDER BY v.a ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING PATTERN (B+) DEFINE B AS a = 'B' ) a | count ---+------- A | 0 B | 2 B | C | 0 (4 rows) Here row 3 is skipped because the pattern B matches row 2 and 3. In this case I think cont(*) should return 0 rathern than NULL for row 3. What do you think? Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp -
Re: Row pattern recognition
Jacob Champion <jacob.champion@enterprisedb.com> — 2024-04-24T17:55:29Z
On Tue, Apr 23, 2024 at 8:13 PM Tatsuo Ishii <ishii@sraoss.co.jp> wrote: > SELECT v.a, count(*) OVER w > FROM (VALUES ('A'),('B'),('B'),('C')) AS v (a) > WINDOW w AS ( > ORDER BY v.a > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > PATTERN (B+) > DEFINE B AS a = 'B' > ) > a | count > ---+------- > A | 0 > B | 2 > B | > C | 0 > (4 rows) > > Here row 3 is skipped because the pattern B matches row 2 and 3. In > this case I think cont(*) should return 0 rathern than NULL for row 3. I think returning zero would match Vik's explanation upthread [1], yes. Unfortunately I don't have a spec handy to double-check for myself right now. --Jacob [1] https://www.postgresql.org/message-id/c9ebc3d0-c3d1-e8eb-4a57-0ec099cbda17%40postgresfriends.org -
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2024-04-26T06:09:32Z
> On Tue, Apr 23, 2024 at 8:13 PM Tatsuo Ishii <ishii@sraoss.co.jp> wrote: >> SELECT v.a, count(*) OVER w >> FROM (VALUES ('A'),('B'),('B'),('C')) AS v (a) >> WINDOW w AS ( >> ORDER BY v.a >> ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING >> PATTERN (B+) >> DEFINE B AS a = 'B' >> ) >> a | count >> ---+------- >> A | 0 >> B | 2 >> B | >> C | 0 >> (4 rows) >> >> Here row 3 is skipped because the pattern B matches row 2 and 3. In >> this case I think cont(*) should return 0 rathern than NULL for row 3. > > I think returning zero would match Vik's explanation upthread [1], > yes. Unfortunately I don't have a spec handy to double-check for > myself right now. Ok. I believe you and Vik are correct. I am modifying the patch in this direction. Attached is the regression diff after modifying the patch. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp -
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2024-04-28T11:28:26Z
>> I think returning zero would match Vik's explanation upthread [1], >> yes. Unfortunately I don't have a spec handy to double-check for >> myself right now. > > Ok. I believe you and Vik are correct. > I am modifying the patch in this direction. Attached are the v17 patches in the direction. Differences from v16 include: - In 0005 executor patch, aggregation in RPR always restarts for each row. This is necessary to run aggregates on no matching (due to skipping) or empty matching (due to no pattern variables matches) rows to produce NULL (most aggregates) or 0 (count) properly. In v16 I had a hack using a flag to force the aggregation results to be NULL in case of no match or empty match in finalize_windowaggregate(). v17 eliminates the dirty hack. - 0006 docs and 0007 test patches are adjusted to reflect the RPR output chages in 0005. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2024-05-11T07:23:07Z
Attached are the v18 patches. To fix conflicts due to recent commit: 7d2c7f08d9 Fix query pullup issue with WindowClause runCondition Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2024-05-15T00:02:03Z
Attached are the v19 patches. Changes from v18 include: 0002: - add a check whether DEFINE clause includes subqueries. If so, error out. 0007: - fix wrong test (row pattern definition variable name must not appear more than once) - remove unnessary test (undefined define variable is not allowed). We have already allowed the undefined variables. - add tests: subqueries and aggregates in DEFINE clause are not supported. The standard allows them but I have not implemented them yet. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2024-05-24T02:39:19Z
Attached are the v20 patches. Just rebased. (The conflict was in 0001 patch.) Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@sraoss.co.jp> — 2024-06-13T00:25:01Z
I gave a talk on RPR in PGConf.dev 2024. https://www.pgevents.ca/events/pgconfdev2024/schedule/session/114-implementing-row-pattern-recognition/ (Slides are available from the link). Vik Faring and Jacob Champion were one of the audiences and we had a small discussion after the talk. We continued the discussion off list on how to move forward the RPR implementation project. One of the ideas is, to summarize what are in the patch and what are not from the SQL standard specification's point of view. This should help us to reach the consensus regarding "minimum viable" feature set if we want to bring the patch in upcoming PostgreSQL v18. Here is the first cut of the document. Comments/feedback are welcome. ------------------------------------------------------------------------- This memo describes the current status of implementation of SQL/RPR (Row Pattern Recognition), as of June 13, 2024 (the latest patch is v20). - RPR in FROM clause and WINDOW clause The SQL standard defines two features regarding SQL/RPR - R010 (RPR in FROM clause) and R020 (RPR in WINDOW clause). Only R020 is implemented. From now on, we discuss on R020. - Overview of R020 syntax WINDOW window_name AS ( [ PARTITION BY ... ] [ ORDER BY... ] [ MEASURES ... ] ROWS BETWEEN CURRENT ROW AND ... [ AFTER MATCH SKIP ... ] [ INITIAL|SEEK ] PATTERN (...) [ SUBSET ... ] DEFINE ... ) -- PARTITION BY and ORDER BY are not specific to RPR and has been already there in current PostgreSQL. -- What are (partially) implemented: AFTER MATCH SKIP INITIAL|SEEK PATTERN DEFINE -- What are not implemented at all: MEASURES SUBSET Followings are detailed status of the each clause. - AFTER MATCH SKIP -- Implemented: AFTER MATCH SKIP TO NEXT ROW AFTER MATCH SKIP PAST LAST ROW -- Not implemented: AFTER MATCH SKIP TO FIRST|LAST pattern_variable - INITIAL|SEEK --Implemented: INITIAL -- Not implemented: SEEK - DEFINE -- Partially implemented row pattern navigation operations are PREV and NEXT. FIRST and LAST are not implemented. -- The standard says PREV and NEXT accepts optional argument "offset" but it's not implemented. -- The standard says the row pattern navigation operations can be nested but it's not implemented. -- CLLASSIFIER, use of aggregate functions and subqueries in DEFINE clause are not implemented. - PATTERN -- Followings are implemented: +: 1 or more rows *: 0 or more rows -- Followings are not implemented: ?: 0 or 1 row A | B: OR condition (A B): grouping {n}: n rows {n,}: n or more rows {n,m}: greater or equal to n rows and less than or equal to m rows {,m}: more than 0 and less than or equal to m rows ------------------------------------------------------------------------- Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp -
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2024-08-26T04:39:47Z
Attached are the v21 patches. Just rebased. (The conflict was in 0001 patch.) The 0008 patch is just for debugging purpose. You can ignore it. This hasn't been changed, but I would like to notice just in case. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2024-09-19T04:59:47Z
Attached are the v22 patches. Just rebased. The conflict was in 0001 patch due to commit 89f908a6d0 "Add temporal FOREIGN KEY contraints". The 0008 patch is just for debugging purpose. You can ignore it. This hasn't been changed, but I would like to notice just in case. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Jacob Champion <jacob.champion@enterprisedb.com> — 2024-09-27T22:27:07Z
On Wed, Sep 18, 2024 at 10:00 PM Tatsuo Ishii <ishii@postgresql.org> wrote: > > Attached are the v22 patches. Just rebased. Thanks! With some bigger partitions, I hit an `ERROR: wrong pos: 1024`. A test that reproduces it is attached. While playing with the feature, I've been trying to identify runs of matched rows by eye. But it's pretty difficult -- the best I can do is manually count rows using a `COUNT(*) OVER ...`. So I'd like to suggest that MEASURES be part of the eventual v1 feature, if there's no other way to determine whether a row was skipped by a previous match. (That was less obvious to me before the fix in v17.) -- I've been working on an implementation [1] of SQL/RPR's "parenthesized language" and preferment order. (These are defined in SQL/Foundation 2023, section 9.41.) The tool gives you a way to figure out, for a given pattern, what matches are supposed to be attempted and in what order: $ ./src/test/modules/rpr/rpr_prefer "a b? a" ( ( a ( b ) ) a ) ( ( a ( ) ) a ) Many simple patterns result in an infinite set of possible matches. So if you use an unbounded quantifiers, you have to also use --max-rows to limit the size of the hypothetical window frame: $ ./src/test/modules/rpr/rpr_prefer --max-rows 2 "^ PERMUTE(a*, b+)? $" ( ( ^ ( ( ( ( ( ( a ) ( b ) ) ) - ) ) ) ) $ ) ( ( ^ ( ( ( ( ( ( ) ( b b ) ) ) - ) ) ) ) $ ) ( ( ^ ( ( ( ( ( ( ) ( b ) ) ) - ) ) ) ) $ ) ( ( ^ ( ( ( - ( ( ( b b ) ( ) ) ) ) ) ) ) $ ) ( ( ^ ( ( ( - ( ( ( b ) ( a ) ) ) ) ) ) ) $ ) ( ( ^ ( ( ( - ( ( ( b ) ( ) ) ) ) ) ) ) $ ) ( ( ^ ( ) ) $ ) I've found this useful to check my personal understanding of the spec and the match behavior, but it could also potentially be used to generate test cases, or to help users debug their own patterns. For example, a pattern that has a bunch of duplicate sequences in its PL is probably not very well optimized: $ ./src/test/modules/rpr/rpr_prefer --max-rows 4 "a+ a+" ( ( a a a ) ( a ) ) ( ( a a ) ( a a ) ) ( ( a a ) ( a ) ) ( ( a ) ( a a a ) ) ( ( a ) ( a a ) ) ( ( a ) ( a ) ) And patterns with catastrophic backtracking behavior tend to show a "sawtooth" pattern in the output, with a huge number of potential matches being generated relative to the number of rows in the frame. My implementation is really messy -- it leaks memory like a sieve, and I cannibalized the parser from ECPG, which just ended up as an exercise in teaching myself flex/bison. But if there's interest in having this kind of tool in the tree, I can work on making it reviewable. Either way, I should be able to use it to double-check more complicated test cases. A while back [2], you were wondering whether our Bison implementation would be able to parse the PATTERN grammar directly. I think this tool proves that the answer is "yes", but PERMUTE in particular causes a shift/reduce conflict. To fix it, I applied the same precedence workaround that we use for CUBE and ROLLUP. Thanks again! --Jacob [1] https://github.com/jchampio/postgres/tree/dev/rpr [2] https://www.postgresql.org/message-id/20230721.151648.412762379013769790.t-ishii%40sranhm.sra.co.jp -
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2024-09-28T10:43:59Z
> With some bigger partitions, I hit an `ERROR: wrong pos: 1024`. A > test that reproduces it is attached. Thanks for the report. Attached is a patch on top of v22 patches to fix the bug. We keep info in an array (WindowAggState.reduced_frame_map) to track the rpr pattern match result status for each row in a frame. If pattern match succeeds, the first row in the reduced frame has status RF_FRAME_HEAD and rest of rows have RF_SKIPPED state. A row with pattern match failure state has RF_UNMATCHED state. Any row which is never tested has state RF_NOT_DETERMINED. At begining the map is initialized with 1024 entries with all RF_NOT_DETERMINED state. Eventually they are replaced with other than RF_NOT_DETERMINED state. In the error case rpr engine tries to find 1024 th row's state in the map and failed because the row's state has not been tested yet. I think we should treat it as RF_NOT_DETERMINED rather than an error. Attached patch does it. > While playing with the feature, I've been trying to identify runs of > matched rows by eye. But it's pretty difficult -- the best I can do is > manually count rows using a `COUNT(*) OVER ...`. So I'd like to > suggest that MEASURES be part of the eventual v1 feature, if there's > no other way to determine whether a row was skipped by a previous > match. (That was less obvious to me before the fix in v17.) I think implementing MEASURES is challenging. Especially we need to find how our parser accepts "colname OVER window_definition". Currently PostgreSQL's parser only accepts "func() OVER window_definition" Even it is technically possible, I think the v1 patch size will become much larger than now due to this. How about inventing new window function that returns row state instead? - match found (yes/no) - skipped due to AFTER MATCH SKIP PAST LAST ROW (no match) For the rest of the mail I need more time to understand. I will reply back after studying it. For now, I just want to thank you for the valuable information! > -- > > I've been working on an implementation [1] of SQL/RPR's "parenthesized > language" and preferment order. (These are defined in SQL/Foundation > 2023, section 9.41.) The tool gives you a way to figure out, for a > given pattern, what matches are supposed to be attempted and in what > order: > > $ ./src/test/modules/rpr/rpr_prefer "a b? a" > ( ( a ( b ) ) a ) > ( ( a ( ) ) a ) > > Many simple patterns result in an infinite set of possible matches. So > if you use an unbounded quantifiers, you have to also use --max-rows > to limit the size of the hypothetical window frame: > > $ ./src/test/modules/rpr/rpr_prefer --max-rows 2 "^ PERMUTE(a*, b+)? $" > ( ( ^ ( ( ( ( ( ( a ) ( b ) ) ) - ) ) ) ) $ ) > ( ( ^ ( ( ( ( ( ( ) ( b b ) ) ) - ) ) ) ) $ ) > ( ( ^ ( ( ( ( ( ( ) ( b ) ) ) - ) ) ) ) $ ) > ( ( ^ ( ( ( - ( ( ( b b ) ( ) ) ) ) ) ) ) $ ) > ( ( ^ ( ( ( - ( ( ( b ) ( a ) ) ) ) ) ) ) $ ) > ( ( ^ ( ( ( - ( ( ( b ) ( ) ) ) ) ) ) ) $ ) > ( ( ^ ( ) ) $ ) > > I've found this useful to check my personal understanding of the spec > and the match behavior, but it could also potentially be used to > generate test cases, or to help users debug their own patterns. For > example, a pattern that has a bunch of duplicate sequences in its PL > is probably not very well optimized: > > $ ./src/test/modules/rpr/rpr_prefer --max-rows 4 "a+ a+" > ( ( a a a ) ( a ) ) > ( ( a a ) ( a a ) ) > ( ( a a ) ( a ) ) > ( ( a ) ( a a a ) ) > ( ( a ) ( a a ) ) > ( ( a ) ( a ) ) > > And patterns with catastrophic backtracking behavior tend to show a > "sawtooth" pattern in the output, with a huge number of potential > matches being generated relative to the number of rows in the frame. > > My implementation is really messy -- it leaks memory like a sieve, and > I cannibalized the parser from ECPG, which just ended up as an > exercise in teaching myself flex/bison. But if there's interest in > having this kind of tool in the tree, I can work on making it > reviewable. Either way, I should be able to use it to double-check > more complicated test cases. > > A while back [2], you were wondering whether our Bison implementation > would be able to parse the PATTERN grammar directly. I think this tool > proves that the answer is "yes", but PERMUTE in particular causes a > shift/reduce conflict. To fix it, I applied the same precedence > workaround that we use for CUBE and ROLLUP. > > Thanks again! > --Jacob > > [1] https://github.com/jchampio/postgres/tree/dev/rpr > [2] https://www.postgresql.org/message-id/20230721.151648.412762379013769790.t-ishii%40sranhm.sra.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2024-09-30T00:07:51Z
>> While playing with the feature, I've been trying to identify runs of >> matched rows by eye. But it's pretty difficult -- the best I can do is >> manually count rows using a `COUNT(*) OVER ...`. So I'd like to >> suggest that MEASURES be part of the eventual v1 feature, if there's >> no other way to determine whether a row was skipped by a previous >> match. (That was less obvious to me before the fix in v17.) > > I think implementing MEASURES is challenging. Especially we need to > find how our parser accepts "colname OVER > window_definition". Currently PostgreSQL's parser only accepts "func() > OVER window_definition" Even it is technically possible, I think the > v1 patch size will become much larger than now due to this. > > How about inventing new window function that returns row state instead? > > - match found (yes/no) > - skipped due to AFTER MATCH SKIP PAST LAST ROW (no match) Please disregard my proposal. Even if we make such a function, it would always return NULL for unmatched rows or skipped rows, and I think the function does not solve your problem. However, I wonder if supporting MEASURES solves the problem either because any columns defined by MEASURES will return NULL except the first row in a reduced frame. Can you please show an example how to identify runs of matched rows using MEASURES? > For the rest of the mail I need more time to understand. I will reply > back after studying it. For now, I just want to thank you for the > valuable information! > >> -- >> >> I've been working on an implementation [1] of SQL/RPR's "parenthesized >> language" and preferment order. (These are defined in SQL/Foundation >> 2023, section 9.41.) The tool gives you a way to figure out, for a >> given pattern, what matches are supposed to be attempted and in what >> order: >> >> $ ./src/test/modules/rpr/rpr_prefer "a b? a" >> ( ( a ( b ) ) a ) >> ( ( a ( ) ) a ) >> >> Many simple patterns result in an infinite set of possible matches. So >> if you use an unbounded quantifiers, you have to also use --max-rows >> to limit the size of the hypothetical window frame: >> >> $ ./src/test/modules/rpr/rpr_prefer --max-rows 2 "^ PERMUTE(a*, b+)? $" >> ( ( ^ ( ( ( ( ( ( a ) ( b ) ) ) - ) ) ) ) $ ) >> ( ( ^ ( ( ( ( ( ( ) ( b b ) ) ) - ) ) ) ) $ ) >> ( ( ^ ( ( ( ( ( ( ) ( b ) ) ) - ) ) ) ) $ ) >> ( ( ^ ( ( ( - ( ( ( b b ) ( ) ) ) ) ) ) ) $ ) >> ( ( ^ ( ( ( - ( ( ( b ) ( a ) ) ) ) ) ) ) $ ) >> ( ( ^ ( ( ( - ( ( ( b ) ( ) ) ) ) ) ) ) $ ) >> ( ( ^ ( ) ) $ ) I wonder how Oracle solves the problem (an infinite set of possible matches) without using "--max-rows" or something like that because in my understanding Oracle supports the regular expressions and PERMUTE. >> I've found this useful to check my personal understanding of the spec >> and the match behavior, but it could also potentially be used to >> generate test cases, or to help users debug their own patterns. For >> example, a pattern that has a bunch of duplicate sequences in its PL >> is probably not very well optimized: >> >> $ ./src/test/modules/rpr/rpr_prefer --max-rows 4 "a+ a+" >> ( ( a a a ) ( a ) ) >> ( ( a a ) ( a a ) ) >> ( ( a a ) ( a ) ) >> ( ( a ) ( a a a ) ) >> ( ( a ) ( a a ) ) >> ( ( a ) ( a ) ) >> >> And patterns with catastrophic backtracking behavior tend to show a >> "sawtooth" pattern in the output, with a huge number of potential >> matches being generated relative to the number of rows in the frame. >> >> My implementation is really messy -- it leaks memory like a sieve, and >> I cannibalized the parser from ECPG, which just ended up as an >> exercise in teaching myself flex/bison. But if there's interest in >> having this kind of tool in the tree, I can work on making it >> reviewable. Either way, I should be able to use it to double-check >> more complicated test cases. I definitely am interested in the tool! >> A while back [2], you were wondering whether our Bison implementation >> would be able to parse the PATTERN grammar directly. I think this tool >> proves that the answer is "yes", but PERMUTE in particular causes a >> shift/reduce conflict. To fix it, I applied the same precedence >> workaround that we use for CUBE and ROLLUP. That's a good news! Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Jacob Champion <jacob.champion@enterprisedb.com> — 2024-10-01T12:48:02Z
On Sun, Sep 29, 2024 at 5:08 PM Tatsuo Ishii <ishii@postgresql.org> wrote: > > I think implementing MEASURES is challenging. Especially we need to > > find how our parser accepts "colname OVER > > window_definition". Currently PostgreSQL's parser only accepts "func() > > OVER window_definition" Even it is technically possible, I think the > > v1 patch size will become much larger than now due to this. [resending, to the whole list this time] Yeah. In any case, I'm not the right person to bolt MEASURES onto the existing grammar... my misadventures in the PATTERN parser have highlighted how little I know about Bison. :D > Please disregard my proposal. Even if we make such a function, it > would always return NULL for unmatched rows or skipped rows, and I > think the function does not solve your problem. > > However, I wonder if supporting MEASURES solves the problem either > because any columns defined by MEASURES will return NULL except the > first row in a reduced frame. Can you please show an example how to > identify runs of matched rows using MEASURES? I think you're probably right; my suggestion can't distinguish between skipped (but previously matched) rows and entirely-unmatched rows. The test case I'd been working with returned an empty match as a fallback, so it wouldn't have had that problem in practice. I was hoping that one of the existing whole-partition window functions would allow me to cobble something together based on the COUNT(*) measure, but after searching for a while I haven't been able to come up with a solution. Maybe it's just too niche for the window-function version of this -- after all, it only makes sense when using both INITIAL and AFTER MATCH SKIP PAST LAST ROW. A more general solution could identify the row_number of the first and last rows of the window frame, perhaps? But a frame isn't guaranteed to be contiguous, so maybe that doesn't make sense either. Ugh. > I wonder how Oracle solves the problem (an infinite set of possible > matches) without using "--max-rows" or something like that because in > my understanding Oracle supports the regular expressions and PERMUTE. I chose a confusing way to describe it, sorry. The parenthesized language for a pattern can be an infinite set, because A+ could match "( A )" or "( A A )" or "( A A A )" and so on forever. But that doesn't apply to our regex engine in practice; our tables have a finite number of rows, and I *think* the PL for a finite number of rows is also finite, due to the complicated rules on where empty matches are allowed to appear in the language. (In any case, my tool doesn't guard against infinite recursion...) > >> My implementation is really messy -- it leaks memory like a sieve, and > >> I cannibalized the parser from ECPG, which just ended up as an > >> exercise in teaching myself flex/bison. But if there's interest in > >> having this kind of tool in the tree, I can work on making it > >> reviewable. Either way, I should be able to use it to double-check > >> more complicated test cases. > > I definitely am interested in the tool! Okay, good to know! I will need to clean it up considerably, and figure out whether I've duplicated more code than I should have. > >> A while back [2], you were wondering whether our Bison implementation > >> would be able to parse the PATTERN grammar directly. I think this tool > >> proves that the answer is "yes", but PERMUTE in particular causes a > >> shift/reduce conflict. To fix it, I applied the same precedence > >> workaround that we use for CUBE and ROLLUP. > > That's a good news! +1 Thanks, --Jacob
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2024-10-22T04:53:43Z
Hi, I wonder how "PREV(col + 1)" is different from "PREV(col) + 1". Currently my RPR implementation does not allow PREV(col + 1). If "PREV(col + 1)" is different from "PREV(col) + 1", it maybe worthwhile to implement "PREV(col + 1)". Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Row pattern recognition
David G. Johnston <david.g.johnston@gmail.com> — 2024-10-22T05:29:34Z
On Monday, October 21, 2024, Tatsuo Ishii <ishii@postgresql.org> wrote: > > I wonder how "PREV(col + 1)" is different from "PREV(col) + 1". > Currently my RPR implementation does not allow PREV(col + 1). If > "PREV(col + 1)" is different from "PREV(col) + 1", it maybe worthwhile > to implement "PREV(col + 1)". > Interesting feature that I’m now just seeing. The expression PREV(column_name) produces a value output taken from the given named column in the preceding frame row. It doesn’t make any sense to me to attempt to add the integer 1 to an identifier that is being used as a value input to a “function”. It would also seem quite odd if “+ 1” had something to do with row selection as opposed to simply being an operator “+(column_name%type, integer)” expression. Maybe RPR is defining something special here I haven't yet picked up on, in which case just ignore this. But if I read: “UP as price > prev(price + 1)” in the opening example it would be quite non-intuitive to reason out the meaning. “Price > prev(price) + 1” would mean my current row is at least one (e.g. dollar per share) more than the value of the previous period. David J.
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2024-10-22T10:19:09Z
> On Monday, October 21, 2024, Tatsuo Ishii <ishii@postgresql.org> wrote: >> >> I wonder how "PREV(col + 1)" is different from "PREV(col) + 1". >> Currently my RPR implementation does not allow PREV(col + 1). If >> "PREV(col + 1)" is different from "PREV(col) + 1", it maybe worthwhile >> to implement "PREV(col + 1)". >> > > Interesting feature that I’m now just seeing. > > The expression PREV(column_name) produces a value output taken from the > given named column in the preceding frame row. It doesn’t make any sense > to me to attempt to add the integer 1 to an identifier that is being used > as a value input to a “function”. It would also seem quite odd if “+ 1” > had something to do with row selection as opposed to simply being an > operator “+(column_name%type, integer)” expression. According to the ISO/IEC 9075-2:2016, 6.26 <row pattern navigation operation> (I don't have access to SQL 2023) PREV (and NEXT) is defined as: <row pattern navigation: physical> ::= <prev or next> <left paren> <value expression> [ <comma> <physical offset> ] <right paren> (Besides <row pattern navigation: physical>, there are <row pattern navigation: logical> and <row pattern navigation: compound> but I ignore them here). So PREV's first argument is a value expression (VE). VE shall contain at least one row pattern column reference. <set function specification>, <window function specification> or <row patter navigation operation> are not permitted. From this, I don't see any reason PREV(column_name + 1) is prohibited unless I miss something. I think even PREV(column_name1 + column_name2) is possible. I see similar example in ISO/IEC 19075-5:2021, 5.6.2 "PREV and NEXT". > Maybe RPR is defining something special here I haven't yet picked up on, in > which case just ignore this. But if I read: “UP as price > prev(price + > 1)” in the opening example it would be quite non-intuitive to reason out > the meaning. “Price > prev(price) + 1” would mean my current row is at > least one (e.g. dollar per share) more than the value of the previous > period. Acording to ISO/IEC 9075-2:2016 "4.21.2 Row pattern navigation operations", <row pattern navigation operation> evaluates a <value expression> VE in a row NR, which may be different than current row CR. From this I think PREV(col + 1) should be interpreted as: 1. go to the previous row. 2. evaluate "col + 1" at the current row (that was previous row). 3. return the result. If my understanding is correct, prev(price + 1) has the same meaning as prev(price) + 1. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Vik Fearing <vik@postgresfriends.org> — 2024-10-22T13:12:43Z
On 22/10/2024 12:19, Tatsuo Ishii wrote: > Acording to ISO/IEC 9075-2:2016 "4.21.2 Row pattern navigation operations", > > <row pattern navigation operation> evaluates a <value expression> VE > in a row NR, which may be different than current row CR. > > From this I think PREV(col + 1) should be interpreted as: > > 1. go to the previous row. > 2. evaluate "col + 1" at the current row (that was previous row). > 3. return the result. > > If my understanding is correct, prev(price + 1) has the same meaning > as prev(price) + 1. This is how I read the specification also. -- Vik Fearing
-
Re: Row pattern recognition
David G. Johnston <david.g.johnston@gmail.com> — 2024-10-22T14:19:41Z
On Tue, Oct 22, 2024 at 6:12 AM Vik Fearing <vik@postgresfriends.org> wrote: > > On 22/10/2024 12:19, Tatsuo Ishii wrote: > > Acording to ISO/IEC 9075-2:2016 "4.21.2 Row pattern navigation operations", > > <row pattern navigation operation> evaluates a <value expression> VE > in a row NR, which may be different than current row CR. > > From this I think PREV(col + 1) should be interpreted as: > > 1. go to the previous row. > 2. evaluate "col + 1" at the current row (that was previous row). > 3. return the result. > > If my understanding is correct, prev(price + 1) has the same meaning > as prev(price) + 1. > > > > This is how I read the specification also. > > > That makes sense. Definitely much nicer to only have to write PREV once if the expression you are evaluating involves multiple columns. And is also consistent with window function "value" behavior. Thanks! David J.
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2024-10-25T04:04:53Z
> On Tue, Oct 22, 2024 at 6:12 AM Vik Fearing <vik@postgresfriends.org> wrote: > >> >> On 22/10/2024 12:19, Tatsuo Ishii wrote: >> >> Acording to ISO/IEC 9075-2:2016 "4.21.2 Row pattern navigation operations", >> >> <row pattern navigation operation> evaluates a <value expression> VE >> in a row NR, which may be different than current row CR. >> >> From this I think PREV(col + 1) should be interpreted as: >> >> 1. go to the previous row. >> 2. evaluate "col + 1" at the current row (that was previous row). >> 3. return the result. >> >> If my understanding is correct, prev(price + 1) has the same meaning >> as prev(price) + 1. >> >> >> >> This is how I read the specification also. >> >> >> > That makes sense. Definitely much nicer to only have to write PREV once if > the expression you are evaluating involves multiple columns. And is also > consistent with window function "value" behavior. Thanks to all who joined the discussion. I decided to support PREV and NEXT in my RPR patches to allow to have multiple columns and other expressions in their argument. e.g. CREATE TEMP TABLE rpr1 (id INTEGER, i SERIAL, j INTEGER); INSERT INTO rpr1(id, j) SELECT 1, g*2 FROM generate_series(1, 10) AS g; SELECT id, i, j, count(*) OVER w FROM rpr1 WINDOW w AS ( PARTITION BY id ORDER BY i ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING AFTER MATCH SKIP PAST LAST ROW INITIAL PATTERN (START COND+) DEFINE START AS TRUE, COND AS PREV(i + j + 1) < 10 ); id | i | j | count ----+----+----+------- 1 | 1 | 2 | 3 1 | 2 | 4 | 0 1 | 3 | 6 | 0 1 | 4 | 8 | 0 1 | 5 | 10 | 0 1 | 6 | 12 | 0 1 | 7 | 14 | 0 1 | 8 | 16 | 0 1 | 9 | 18 | 0 1 | 10 | 20 | 0 (10 rows) Attached are the v23 patches. V23 also includes the fix for the problem pointed out by Jacob Champion and test cases from him. Thank you, Jacob. https://www.postgresql.org/message-id/CAOYmi%2Bns3kHjC83ap_BCfJCL0wfO5BJ_sEByOEpgNOrsPhqQTg%40mail.gmail.com Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2024-12-19T06:19:50Z
I have looked into the performance of current RPR implementation, especially when the number of rows in a reduced frame is large (like over 10k). Below is a simple benchmark against pgbench database. The SQL will produce a reduced frame having 10k rows. EXPLAIN (ANALYZE) SELECT aid, bid, count(*) OVER w FROM pgbench_accounts WHERE aid <= 10000 WINDOW w AS ( PARTITION BY bid ORDER BY aid ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING AFTER MATCH SKIP PAST LAST ROW INITIAL PATTERN (START UP+) DEFINE START AS TRUE, UP AS aid > PREV(aid) ); This took 722 ms on my laptop. It's not very quick. Moreover, if I expand the reduced frame size to 100k (aid <= 100000), OOM killer triggered. I looked into the code and found that do_pattern_match in nodeWindowAgg.c is one of the major problems. It calls regexp_instr to know whether the regular expression derived from a PATTERN clause (e.g. "ab+c+") matches an encoded row pattern variable string (e.g. "abbcc"). The latter string could be quite long: the length could be as same as the number of rows in the reduced frame. Thus, The length could become 100k if the frame size is 100k. Unfortunately regexp_instr needs to allocate and convert the input string to wchar (it's 4-byte long for each character), which uses 4x space bigger than the original input string. In RPR case the input string is always ASCII and does not need to be converted to wchar. So I decided to switch to the standard regular expression engine coming with OS. With this change, I got 2x speed up in the 10k case. v23 patch: 722.618 ms (average of 3 runs) new patch: 322.5913 ms (average of 3 runs) Also I tried the 100k rows reduced frame case. It was slow (took 26 seconds) but it completed without OOM killer. Attached is the patch. The change was in 0005 only. Other patches were not changed from v23. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2024-12-21T09:20:04Z
> I have looked into the performance of current RPR implementation, > especially when the number of rows in a reduced frame is large (like > over 10k). Below is a simple benchmark against pgbench database. The > SQL will produce a reduced frame having 10k rows. > > EXPLAIN (ANALYZE) > SELECT aid, bid, count(*) OVER w > FROM pgbench_accounts WHERE aid <= 10000 > WINDOW w AS ( > PARTITION BY bid > ORDER BY aid > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > AFTER MATCH SKIP PAST LAST ROW > INITIAL > PATTERN (START UP+) > DEFINE > START AS TRUE, > UP AS aid > PREV(aid) > ); > > This took 722 ms on my laptop. It's not very quick. Moreover, if I > expand the reduced frame size to 100k (aid <= 100000), OOM killer > triggered. I looked into the code and found that do_pattern_match in > nodeWindowAgg.c is one of the major problems. It calls regexp_instr to > know whether the regular expression derived from a PATTERN clause > (e.g. "ab+c+") matches an encoded row pattern variable string > (e.g. "abbcc"). The latter string could be quite long: the length > could be as same as the number of rows in the reduced frame. Thus, The > length could become 100k if the frame size is 100k. Unfortunately > regexp_instr needs to allocate and convert the input string to wchar > (it's 4-byte long for each character), which uses 4x space bigger than > the original input string. In RPR case the input string is always > ASCII and does not need to be converted to wchar. So I decided to > switch to the standard regular expression engine coming with OS. With > this change, I got 2x speed up in the 10k case. > > v23 patch: 722.618 ms (average of 3 runs) > new patch: 322.5913 ms (average of 3 runs) > > Also I tried the 100k rows reduced frame case. It was slow (took 26 > seconds) but it completed without OOM killer. Attached is the > patch. The change was in 0005 only. Other patches were not changed > from v23. The CFBot starts complaining about the patch. It fails in Windows environment test because regex.h does not exist. I concluded that on Windows it's not a good idea to use the standard regexp library (I am not familiar with Windows. If my thought is not correct, please let me know). So I switched to the PostgreSQL's builtin core regexp library. Although the interface requires to use pg_wchar which spends 4x memory comparing with the standard regexp (that's why I wanted to avoid using it), the result seems to be not so bad. It consumes only 10MB or so more memory when processing 100k rows in a frame. Good news is, it runs slightly faster than the standard regexp (19 vs. 26 seconds). Attached is the v25 patch to use the PostgreSQL's regexp library. Most changes are in nodeWindowAgg.c, which is in the 5th patch. In the patches I also fixed some memory leaks and run pgindent with updated typedefs.list. Now the patch includes a patch for typedefs.list (the 8th patch). Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2024-12-30T13:37:18Z
I have added further optimization to the v25 patch. While generating possible input strings that may satisfy the pattern string, it is possible to omit to run regexp in some cases. Since regexp matching is heavy operation, especially if it is applied to long string, it is beneficial for RPR to reduce the number of regexp runs. If the tail pattern variable has '+' quantifier and previously the input string was confirmed to be matched the pattern string, and the same character as the tail pattern string is added, we don't need run regexp match again because the new input string surely matches the pattern string. Suppose a pattern string is "ab+" and the current input string is "ab" (this satisfies "ab+"). If the new input string is "abb", then "abb" surely matches "ab+" too and we don't need to run regexp again. In v26 patch, by using the technique above I get performance improvement. >> EXPLAIN (ANALYZE) >> SELECT aid, bid, count(*) OVER w >> FROM pgbench_accounts WHERE aid <= 10000 >> WINDOW w AS ( >> PARTITION BY bid >> ORDER BY aid >> ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING >> AFTER MATCH SKIP PAST LAST ROW >> INITIAL >> PATTERN (START UP+) >> DEFINE >> START AS TRUE, >> UP AS aid > PREV(aid) >> ); This SQL took 322.5913 ms (average in 3 runs) in v24. With v26 patch, it takes 41.84 ms, which is over 7 times improvement. Also I run the SQL in 100k row case. v23 took 26 seconds. With the v26 patch it takes 1195.603 ms, which is over 21 times improvement. I think a pattern string ended up with '+' is one of common use cases of RPR, and I believe the improvement is useful for many RPR applications. I also add following changes to v25. - Fix do_pattern_match to use the top memory context to store compiled re cache. Before it was in per query memory context. This causes a trouble because do_pattern_match checks the cache existence using a static variable. - Refactor search_str_set, which is a workhorse of pattern matching, into multiple functions to understand the logic easier. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2024-12-30T23:57:07Z
> I have added further optimization to the v25 patch. > > While generating possible input strings that may satisfy the pattern > string, it is possible to omit to run regexp in some cases. Since > regexp matching is heavy operation, especially if it is applied to > long string, it is beneficial for RPR to reduce the number of regexp > runs. > > If the tail pattern variable has '+' quantifier and previously the > input string was confirmed to be matched the pattern string, and the > same character as the tail pattern string is added, we don't need run > regexp match again because the new input string surely matches the > pattern string. Suppose a pattern string is "ab+" and the current > input string is "ab" (this satisfies "ab+"). If the new input string > is "abb", then "abb" surely matches "ab+" too and we don't need to run > regexp again. > > In v26 patch, by using the technique above I get performance > improvement. > >>> EXPLAIN (ANALYZE) >>> SELECT aid, bid, count(*) OVER w >>> FROM pgbench_accounts WHERE aid <= 10000 >>> WINDOW w AS ( >>> PARTITION BY bid >>> ORDER BY aid >>> ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING >>> AFTER MATCH SKIP PAST LAST ROW >>> INITIAL >>> PATTERN (START UP+) >>> DEFINE >>> START AS TRUE, >>> UP AS aid > PREV(aid) >>> ); > > This SQL took 322.5913 ms (average in 3 runs) in v24. With v26 patch, > it takes 41.84 ms, which is over 7 times improvement. Also I run the > SQL in 100k row case. v23 took 26 seconds. With the v26 patch it takes > 1195.603 ms, which is over 21 times improvement. > > I think a pattern string ended up with '+' is one of common use cases > of RPR, and I believe the improvement is useful for many RPR > applications. > > I also add following changes to v25. > > - Fix do_pattern_match to use the top memory context to store compiled > re cache. Before it was in per query memory context. This causes a > trouble because do_pattern_match checks the cache existence using > a static variable. > > - Refactor search_str_set, which is a workhorse of pattern matching, > into multiple functions to understand the logic easier. CFBot complains a compiler error in the v26 patch. Attached is v27 patch to fix this. Also some typo in comment are fixed. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2025-01-11T05:46:11Z
Attached are the v28 patches to implement a subset of Row Pattern Recognition feature defined in the SQL standard. In this patch set: - Reduce the patch size. Comparing with v27, the patch size is trimmed from 5296 lines down to 5073 lines. This is mainly achieved in the raw parser patch so that non implemented features are removed from the grammar file. - Use newly introduced makeStringInfoExt() instead of makeStringInfo() in nodeWindowAgg.c, which removes a few unnecessary codes and should give slightly better performance. - Fix bugs in the doc and add it more description regarding RPR syntax. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2025-03-13T12:51:29Z
Attached are the v29 patch sets to implement the subset of row pattern recognition feature defined in the SQL standard. In this patch set: - Adjust 0003 and 0004 to deal with the recent changes made by 8b1b342. - Add tests to 0007 for CREATE VIEW/pg_get_viewdef. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2025-04-22T23:55:33Z
Attached are the v30 patches, just adding a patch to change the default io_method parameter to sync, expecting this affects something to the recent CFbot failure. https://commitfest.postgresql.org/patch/4460/ https://cirrus-ci.com/task/6078653164945408 which is similar to: https://www.postgresql.org/message-id/20250422.111139.1502127917165838403.ishii%40postgresql.org (Once the issue resolved, the patch should be removed, of course) Best regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2025-04-23T01:07:29Z
> Attached are the v30 patches, just adding a patch to change the > default io_method parameter to sync, expecting this affects something > to the recent CFbot failure. > https://commitfest.postgresql.org/patch/4460/ > https://cirrus-ci.com/task/6078653164945408 > which is similar to: > https://www.postgresql.org/message-id/20250422.111139.1502127917165838403.ishii%40postgresql.org CFBot has just run tests against v30 patches and now it turns to green again! I guess io_method = sync definitely did the trick. Note that previously only the Windows Server 2019, VS 2019 - Neason & ninja test had failed. Best regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2025-05-06T00:53:51Z
> Attached are the v30 patches, just adding a patch to change the > default io_method parameter to sync, expecting this affects something > to the recent CFbot failure. > https://commitfest.postgresql.org/patch/4460/ > https://cirrus-ci.com/task/6078653164945408 > which is similar to: > https://www.postgresql.org/message-id/20250422.111139.1502127917165838403.ishii%40postgresql.org > > (Once the issue resolved, the patch should be removed, of course) It seems this has turned to green since May 2, 2025. https://commitfest.postgresql.org/patch/5679/. The last time it turned to red was April 16, 2025. Maybe something committed between the period solved the cause of red, but I don't know exactly which commit. Anyway v31 patches now remove the change to default io_method parameter to see if it passes Windows Server 2019, VS 2019 - Meson & ninja test. Best regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2025-05-06T07:28:42Z
>> Attached are the v30 patches, just adding a patch to change the >> default io_method parameter to sync, expecting this affects something >> to the recent CFbot failure. >> https://commitfest.postgresql.org/patch/4460/ >> https://cirrus-ci.com/task/6078653164945408 >> which is similar to: >> https://www.postgresql.org/message-id/20250422.111139.1502127917165838403.ishii%40postgresql.org >> >> (Once the issue resolved, the patch should be removed, of course) > > It seems this has turned to green since May 2, 2025. > https://commitfest.postgresql.org/patch/5679/. > > The last time it turned to red was April 16, 2025. Maybe something > committed between the period solved the cause of red, but I don't know > exactly which commit. > > Anyway v31 patches now remove the change to default io_method > parameter to see if it passes Windows Server 2019, VS 2019 - Meson & > ninja test. Now I see it passes the test. https://cirrus-ci.com/build/5671612202090496 Best regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2025-08-16T08:45:52Z
Attached are the v32 patches for Row pattern recognition. Recent changes to doc/src/sgml/func.sgml required v31 to be rebased. Other than that, nothing has been changed. Best regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2025-09-24T10:35:23Z
Attached are the v33 patches for Row pattern recognition. The difference from v32 is that the raw parse tree printing patch is not included in v33. This is because now that we have debug_print_raw_parse. Best regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2025-11-17T06:57:15Z
Attached are the v34 patches for Row pattern recognition. Notihing has been changed. Just rebased. Best regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2025-11-18T02:33:20Z
Attached are the v35 patches for Row pattern recognition. In v34-0001 gram.y patch, %type for RPR were misplaced. v35-0001 fixes this. Other patches are not changed. Best regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Chao Li <li.evan.chao@gmail.com> — 2025-11-18T05:03:08Z
Hi Tatsuo-san, I just reviewed 0006 docs changes and 0001. I plan to slowly review the patch, maybe one commit a day. > On Nov 18, 2025, at 10:33, Tatsuo Ishii <ishii@postgresql.org> wrote: > > Attached are the v35 patches for Row pattern recognition. In v34-0001 > gram.y patch, %type for RPR were misplaced. v35-0001 fixes this. Other > patches are not changed. > > Best regards, > -- > Tatsuo Ishii > SRA OSS K.K. > English: http://www.sraoss.co.jp/index_en/ > Japanese:http://www.sraoss.co.jp > <v35-0001-Row-pattern-recognition-patch-for-raw-parser.patch><v35-0002-Row-pattern-recognition-patch-parse-analysis.patch><v35-0003-Row-pattern-recognition-patch-rewriter.patch><v35-0004-Row-pattern-recognition-patch-planner.patch><v35-0005-Row-pattern-recognition-patch-executor.patch><v35-0006-Row-pattern-recognition-patch-docs.patch><v35-0007-Row-pattern-recognition-patch-tests.patch><v35-0008-Row-pattern-recognition-patch-typedefs.list.patch> I got a few comments, maybe just questions: 1 - 0001 - kwlist.h ``` +PG_KEYWORD("define", DEFINE, RESERVED_KEYWORD, BARE_LABEL) ``` Why do we add “define” as a reserved keyword? From the SQL example you put in 0006: ``` <programlisting> SELECT company, tdate, price, first_value(price) OVER w, max(price) OVER w, count(price) OVER w FROM stock WINDOW w AS ( PARTITION BY company ORDER BY tdate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING AFTER MATCH SKIP PAST LAST ROW INITIAL PATTERN (LOWPRICE UP+ DOWN+) DEFINE LOWPRICE AS price <= 100, UP AS price > PREV(price), DOWN AS price < PREV(price) ); </programlisting> ``` PARTITION is at the same level as DEFINE, but it’s not defined as a reserved keyword: ``` PG_KEYWORD("partition", PARTITION, UNRESERVED_KEYWORD, BARE_LABEL) ``` Even in this patch,”initial”,”past”, “pattern” and “seek” are defined as unreserved, why? So I just want to clarify. 2 - 0001 - gram.y ``` opt_row_pattern_initial_or_seek: INITIAL { $$ = true; } | SEEK { ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("SEEK is not supported"), errhint("Use INITIAL instead."), parser_errposition(@1))); } | /*EMPTY*/ { $$ = true; } ; ``` As SEEK is specially listed here, I guess it might be supported in future. If that is true, would it be better to defer the semantic check to later parse phase, which would future work easier. 3 - 0001 - parsenodes.h ``` +/* + * RowPatternCommonSyntax - raw representation of row pattern common syntax + * + */ +typedef struct RPCommonSyntax +{ + NodeTag type; + RPSkipTo rpSkipTo; /* Row Pattern AFTER MATCH SKIP type */ + bool initial; /* true if <row pattern initial or seek> is + * initial */ + List *rpPatterns; /* PATTERN variables (list of A_Expr) */ + List *rpDefs; /* row pattern definitions clause (list of + * ResTarget) */ +} RPCommonSyntax; + /* * WindowDef - raw representation of WINDOW and OVER clauses * @@ -593,6 +618,7 @@ typedef struct WindowDef char *refname; /* referenced window name, if any */ List *partitionClause; /* PARTITION BY expression list */ List *orderClause; /* ORDER BY (list of SortBy) */ + RPCommonSyntax *rpCommonSyntax; /* row pattern common syntax */ ``` RP fields are directly defined in WindowClause, then why do we need a wrapper of RPCommonSyntax? Can we directly define RP fields in WindowRef as well? Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/ -
Re: Row pattern recognition
Chao Li <li.evan.chao@gmail.com> — 2025-11-18T05:05:52Z
> On Nov 18, 2025, at 13:03, Chao Li <li.evan.chao@gmail.com> wrote: > > Hi Tatsuo-san, > > I just reviewed 0006 docs changes and 0001. I plan to slowly review the patch, maybe one commit a day. > >> On Nov 18, 2025, at 10:33, Tatsuo Ishii <ishii@postgresql.org> wrote: >> >> Attached are the v35 patches for Row pattern recognition. In v34-0001 >> gram.y patch, %type for RPR were misplaced. v35-0001 fixes this. Other >> patches are not changed. >> >> Best regards, >> -- >> Tatsuo Ishii >> SRA OSS K.K. >> English: http://www.sraoss.co.jp/index_en/ >> Japanese:http://www.sraoss.co.jp >> <v35-0001-Row-pattern-recognition-patch-for-raw-parser.patch><v35-0002-Row-pattern-recognition-patch-parse-analysis.patch><v35-0003-Row-pattern-recognition-patch-rewriter.patch><v35-0004-Row-pattern-recognition-patch-planner.patch><v35-0005-Row-pattern-recognition-patch-executor.patch><v35-0006-Row-pattern-recognition-patch-docs.patch><v35-0007-Row-pattern-recognition-patch-tests.patch><v35-0008-Row-pattern-recognition-patch-typedefs.list.patch> > > I got a few comments, maybe just questions: > > 1 - 0001 - kwlist.h > ``` > +PG_KEYWORD("define", DEFINE, RESERVED_KEYWORD, BARE_LABEL) > ``` > > Why do we add “define” as a reserved keyword? From the SQL example you put in 0006: > ``` > <programlisting> > SELECT company, tdate, price, > first_value(price) OVER w, > max(price) OVER w, > count(price) OVER w > FROM stock > WINDOW w AS ( > PARTITION BY company > ORDER BY tdate > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > AFTER MATCH SKIP PAST LAST ROW > INITIAL > PATTERN (LOWPRICE UP+ DOWN+) > DEFINE > LOWPRICE AS price <= 100, > UP AS price > PREV(price), > DOWN AS price < PREV(price) > ); > </programlisting> > ``` > > PARTITION is at the same level as DEFINE, but it’s not defined as a reserved keyword: > ``` > PG_KEYWORD("partition", PARTITION, UNRESERVED_KEYWORD, BARE_LABEL) > ``` > > Even in this patch,”initial”,”past”, “pattern” and “seek” are defined as unreserved, why? > > So I just want to clarify. > > 2 - 0001 - gram.y > ``` > opt_row_pattern_initial_or_seek: > INITIAL { $$ = true; } > | SEEK > { > ereport(ERROR, > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > errmsg("SEEK is not supported"), > errhint("Use INITIAL instead."), > parser_errposition(@1))); > } > | /*EMPTY*/ { $$ = true; } > ; > ``` > > As SEEK is specially listed here, I guess it might be supported in future. If that is true, would it be better to defer the semantic check to later parse phase, which would future work easier. > > 3 - 0001 - parsenodes.h > ``` > +/* > + * RowPatternCommonSyntax - raw representation of row pattern common syntax > + * > + */ > +typedef struct RPCommonSyntax > +{ > + NodeTag type; > + RPSkipTo rpSkipTo; /* Row Pattern AFTER MATCH SKIP type */ > + bool initial; /* true if <row pattern initial or seek> is > + * initial */ > + List *rpPatterns; /* PATTERN variables (list of A_Expr) */ > + List *rpDefs; /* row pattern definitions clause (list of > + * ResTarget) */ > +} RPCommonSyntax; > + > /* > * WindowDef - raw representation of WINDOW and OVER clauses > * > @@ -593,6 +618,7 @@ typedef struct WindowDef > char *refname; /* referenced window name, if any */ > List *partitionClause; /* PARTITION BY expression list */ > List *orderClause; /* ORDER BY (list of SortBy) */ > + RPCommonSyntax *rpCommonSyntax; /* row pattern common syntax */ > ``` > > RP fields are directly defined in WindowClause, then why do we need a wrapper of RPCommonSyntax? Can we directly define RP fields in WindowRef as well? > 4 - 0001 - parsenodes.h ``` + /* Row Pattern AFTER MACH SKIP clause */ ``` Typo: MACH -> MATCH Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/ -
Re: Row pattern recognition
Vik Fearing <vik@postgresfriends.org> — 2025-11-18T11:19:46Z
On 18/11/2025 06:03, Chao Li wrote: > 1 - 0001 - kwlist.h > ``` > +PG_KEYWORD("define", DEFINE, RESERVED_KEYWORD, BARE_LABEL) > ``` > > Why do we add “define” as a reserved keyword? From the SQL example you put in 0006: > ``` > <programlisting> > SELECT company, tdate, price, > first_value(price) OVER w, > max(price) OVER w, > count(price) OVER w > FROM stock > WINDOW w AS ( > PARTITION BY company > ORDER BY tdate > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > AFTER MATCH SKIP PAST LAST ROW > INITIAL > PATTERN (LOWPRICE UP+ DOWN+) > DEFINE > LOWPRICE AS price <= 100, > UP AS price > PREV(price), > DOWN AS price < PREV(price) > ); > </programlisting> > ``` > > PARTITION is at the same level as DEFINE, but it’s not defined as a reserved keyword: > ``` > PG_KEYWORD("partition", PARTITION, UNRESERVED_KEYWORD, BARE_LABEL) > ``` > > Even in this patch,”initial”,”past”, “pattern” and “seek” are defined as unreserved, why? > > So I just want to clarify. Because of position. Without making DEFINE a reserved keyword, how do you know that it isn't another variable in the PATTERN clause? -- Vik Fearing -
Re: Row pattern recognition
Chao Li <li.evan.chao@gmail.com> — 2025-11-19T04:14:03Z
> On Nov 18, 2025, at 19:19, Vik Fearing <vik@postgresfriends.org> wrote: > > > Because of position. Without making DEFINE a reserved keyword, how do you know that it isn't another variable in the PATTERN clause? > Ah, thanks for the clarification. Now I got it. I’m continue to review 0002. 5 - 0002 - parse_clause.c ``` + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("FRAME must start at current row when row patttern recognition is used"))); ``` Nit typo: patttern -> pattern 6 - 0002 - parse_clause.c ``` + /* DEFINE variable name initials */ + static char *defineVariableInitials = "abcdefghijklmnopqrstuvwxyz”; ``` This string can also be const, so “static const char *” 7 - 0002 - parse_clause.c ``` + /* + * Create list of row pattern DEFINE variable name's initial. We assign + * [a-z] to them (up to 26 variable names are allowed). + */ + restargets = NIL; + i = 0; + initialLen = strlen(defineVariableInitials); + + foreach(lc, windef->rpCommonSyntax->rpDefs) + { + char initial[2]; + + restarget = (ResTarget *) lfirst(lc); + name = restarget->name; ``` Looks like “name” is not used inside “foreach”. 8 - 0002 - parse_clause.c ``` + /* + * Create list of row pattern DEFINE variable name's initial. We assign + * [a-z] to them (up to 26 variable names are allowed). + */ + restargets = NIL; + i = 0; + initialLen = strlen(defineVariableInitials); + + foreach(lc, windef->rpCommonSyntax->rpDefs) + { + char initial[2]; ``` I guess this “foreach” for build initial list can be combined into the previous foreach loop of checking duplication. If an def has no dup, then assign an initial to it. 9 - 0002 - parse_clause.c ``` +static void +transformPatternClause(ParseState *pstate, WindowClause *wc, + WindowDef *windef) +{ + ListCell *lc; + + /* + * Row Pattern Common Syntax clause exists? + */ + if (windef->rpCommonSyntax == NULL) + return; ``` Checking (windef->rpCommonSyntax == NULL) seems a duplicate, because transformPatternClause() is only called by transformRPR(), and transformRPR() has done the check. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/ -
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2025-11-19T06:51:02Z
> On 18/11/2025 06:03, Chao Li wrote: >> 1 - 0001 - kwlist.h >> ``` >> +PG_KEYWORD("define", DEFINE, RESERVED_KEYWORD, BARE_LABEL) >> ``` >> >> Why do we add “define” as a reserved keyword? From the SQL example >> you put in 0006: >> ``` >> <programlisting> >> SELECT company, tdate, price, >> first_value(price) OVER w, >> max(price) OVER w, >> count(price) OVER w >> FROM stock >> WINDOW w AS ( >> PARTITION BY company >> ORDER BY tdate >> ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING >> AFTER MATCH SKIP PAST LAST ROW >> INITIAL >> PATTERN (LOWPRICE UP+ DOWN+) >> DEFINE >> LOWPRICE AS price <= 100, >> UP AS price > PREV(price), >> DOWN AS price < PREV(price) >> ); >> </programlisting> >> ``` >> >> PARTITION is at the same level as DEFINE, but it’s not defined as a >> reserved keyword: >> ``` >> PG_KEYWORD("partition", PARTITION, UNRESERVED_KEYWORD, BARE_LABEL) >> ``` >> >> Even in this patch,”initial”,”past”, “pattern” and “seek” are >> defined as unreserved, why? >> >> So I just want to clarify. > > > Because of position. Without making DEFINE a reserved keyword, how do > you know that it isn't another variable in the PATTERN clause? I think we don't need to worry about this because PATTERN_P is in the $nonassoc list in the patch, which gives PATTERN different precedence from DEFINE. @@ -888,6 +896,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %nonassoc UNBOUNDED NESTED /* ideally would have same precedence as IDENT */ %nonassoc IDENT PARTITION RANGE ROWS GROUPS PRECEDING FOLLOWING CUBE ROLLUP SET KEYS OBJECT_P SCALAR VALUE_P WITH WITHOUT PATH + AFTER INITIAL SEEK PATTERN_P And I think we could change DEFINE to an unreserved keyword. Attached is a patch to do that, on top of v35-0001. Best regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp -
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2025-11-19T07:19:03Z
> 2 - 0001 - gram.y > ``` > opt_row_pattern_initial_or_seek: > INITIAL { $$ = true; } > | SEEK > { > ereport(ERROR, > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > errmsg("SEEK is not supported"), > errhint("Use INITIAL instead."), > parser_errposition(@1))); > } > | /*EMPTY*/ { $$ = true; } > ; > ``` > > As SEEK is specially listed here, I guess it might be supported in future. If that is true, would it be better to defer the semantic check to later parse phase, which would future work easier. From a comment in gram.y: /* * If we see PARTITION, RANGE, ROWS, GROUPS, AFTER, INITIAL, SEEK or PATTERN_P * as the first token after the '(' of a window_specification, we want the * assumption to be that there is no existing_window_name; but those keywords * are unreserved and so could be ColIds. We fix this by making them have the For this purpose, we want INITIAL and SEEK to be unreserved keywords. > 3 - 0001 - parsenodes.h > ``` > +/* > + * RowPatternCommonSyntax - raw representation of row pattern common syntax > + * > + */ > +typedef struct RPCommonSyntax > +{ > + NodeTag type; > + RPSkipTo rpSkipTo; /* Row Pattern AFTER MATCH SKIP type */ > + bool initial; /* true if <row pattern initial or seek> is > + * initial */ > + List *rpPatterns; /* PATTERN variables (list of A_Expr) */ > + List *rpDefs; /* row pattern definitions clause (list of > + * ResTarget) */ > +} RPCommonSyntax; > + > /* > * WindowDef - raw representation of WINDOW and OVER clauses > * > @@ -593,6 +618,7 @@ typedef struct WindowDef > char *refname; /* referenced window name, if any */ > List *partitionClause; /* PARTITION BY expression list */ > List *orderClause; /* ORDER BY (list of SortBy) */ > + RPCommonSyntax *rpCommonSyntax; /* row pattern common syntax */ > ``` > > RP fields are directly defined in WindowClause, then why do we need a wrapper of RPCommonSyntax? Can we directly define RP fields in WindowRef as well? The row pattern common syntax defined in the standard is not only used in the WINDOW clause, but in the FROM clause. If we would support RPR in FROM clause in the future, it would be better to use the same code of row pattern common syntax in WINDOW clause as much as possible. That's the reason I created RPCommonSyntax. In the parse/analysis phase, I am not sure how the parse/analysis code would be in FROM clause at this point. So I did not define yet another struct for it. Best regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp -
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2025-11-19T07:20:46Z
> 4 - 0001 - parsenodes.h > ``` > + /* Row Pattern AFTER MACH SKIP clause */ > ``` > > Typo: MACH -> MATCH Thanks for pointing it out. Will fix. Best regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Chao Li <li.evan.chao@gmail.com> — 2025-11-20T07:33:48Z
> On Nov 19, 2025, at 12:14, Chao Li <li.evan.chao@gmail.com> wrote: > > > 9 - 0002 - parse_clause.c I am continuing to review 0003 10 - 0003 ``` + Assert(list_length(patternVariables) == list_length(defineClause)); ``` Is this assert true? From what I learned, pattern length doesn’t have to equal to define length. For example, I got an example from [1]: ``` Example 4 -- This example has three different patterns. -- Comment two of them, to get error-free query. SELECT company, price_date, price FROM stock_price MATCH_RECOGNIZE ( partition by company order by price_date all rows per match pattern ( limit_50 up up* down down* ) define limit_50 as price <= 50.00, up as price > prev(price), down as price < prev(price) ) WHERE company = 'B' ORDER BY price_date; ``` In this example, pattern has 5 elements and define has only 3 elements. 11 - 0004 - plannodes.h ``` + /* Row Pattern Recognition AFTER MACH SKIP clause */ + RPSkipTo rpSkipTo; /* Row Pattern Skip To type */ ``` Typo: MACH -> MATCH I’d stop here, and continue 0005 tomorrow. [1] https://link.springer.com/article/10.1007/s13222-022-00404-3 Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/ -
Re: Row pattern recognition
Chao Li <li.evan.chao@gmail.com> — 2025-11-21T05:25:50Z
> On Nov 20, 2025, at 15:33, Chao Li <li.evan.chao@gmail.com> wrote: > > > I’d stop here, and continue 0005 tomorrow. > I reviewed 0005 today. I'm still not very familiar with the executor code, so going through 0005 has also been a valuable learning process for me. 12 - 0005 - nodeWindowAgg.c ``` + if (string_set_get_size(str_set) == 0) + { + /* no match found in the first row */ + register_reduced_frame_map(winstate, original_pos, RF_UNMATCHED); + destroyStringInfo(encoded_str); + return; + } ``` encoded_str should also be destroyed if not entering the “if” clause. 13 - 0005 - nodeWindowAgg.c ``` static int do_pattern_match(char *pattern, char *encoded_str, int len) { static regex_t *regcache = NULL; static regex_t preg; static char patbuf[1024]; /* most recent 'pattern' is cached here */ ``` Using static variable in executor seems something I have never seen, which may persistent data across queries. I think usually per query data are stored in state structures. 14 - 0005 - nodeWindowAgg.c ``` + MemoryContext oldContext = MemoryContextSwitchTo(TopMemoryContext); + + if (regcache != NULL) + pg_regfree(regcache); /* free previous re */ + + /* we need to convert to char to pg_wchar */ + plen = strlen(pattern); + data = (pg_wchar *) palloc((plen + 1) * sizeof(pg_wchar)); + data_len = pg_mb2wchar_with_len(pattern, data, plen); + /* compile re */ + sts = pg_regcomp(&preg, /* compiled re */ + data, /* target pattern */ + data_len, /* length of pattern */ + cflags, /* compile option */ + C_COLLATION_OID /* collation */ + ); + pfree(data); + + MemoryContextSwitchTo(oldContext); ``` Here in do_pattern_match, it switches to top memory context and compiles the re and stores to “preg". Based on the comment of pg_regcomp: ``` /* * pg_regcomp - compile regular expression * * Note: on failure, no resources remain allocated, so pg_regfree() * need not be applied to re. */ int pg_regcomp(regex_t *re, const chr *string, size_t len, int flags, Oid collation) ``` “preg” should be freed by pg_regfree(), given the memory is allocated in TopMemoryContext, this looks like a memory leak. Okay, I’d stop here and continue to review 0006 next week. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/ -
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2025-11-21T05:57:08Z
Hi Chao, >> On Nov 18, 2025, at 19:19, Vik Fearing <vik@postgresfriends.org> wrote: >> >> >> Because of position. Without making DEFINE a reserved keyword, how do you know that it isn't another variable in the PATTERN clause? >> > > Ah, thanks for the clarification. Now I got it. > > I’m continue to review 0002. Thanks for the review! > 5 - 0002 - parse_clause.c > ``` > + ereport(ERROR, > + (errcode(ERRCODE_SYNTAX_ERROR), > + errmsg("FRAME must start at current row when row patttern recognition is used"))); > ``` > > Nit typo: patttern -> pattern Will fix (this involves regression test change because this changes the ERROR messages in the expected file). > 6 - 0002 - parse_clause.c > ``` > + /* DEFINE variable name initials */ > + static char *defineVariableInitials = "abcdefghijklmnopqrstuvwxyz”; > ``` > > This string can also be const, so “static const char *” Agreed. Will fix. > 7 - 0002 - parse_clause.c > ``` > + /* > + * Create list of row pattern DEFINE variable name's initial. We assign > + * [a-z] to them (up to 26 variable names are allowed). > + */ > + restargets = NIL; > + i = 0; > + initialLen = strlen(defineVariableInitials); > + > + foreach(lc, windef->rpCommonSyntax->rpDefs) > + { > + char initial[2]; > + > + restarget = (ResTarget *) lfirst(lc); > + name = restarget->name; > ``` > > Looks like “name” is not used inside “foreach”. Oops. Will fix. > 8 - 0002 - parse_clause.c > ``` > + /* > + * Create list of row pattern DEFINE variable name's initial. We assign > + * [a-z] to them (up to 26 variable names are allowed). > + */ > + restargets = NIL; > + i = 0; > + initialLen = strlen(defineVariableInitials); > + > + foreach(lc, windef->rpCommonSyntax->rpDefs) > + { > + char initial[2]; > ``` > > I guess this “foreach” for build initial list can be combined into the previous foreach loop of checking duplication. If an def has no dup, then assign an initial to it. You are right. Will change. > 9 - 0002 - parse_clause.c > ``` > +static void > +transformPatternClause(ParseState *pstate, WindowClause *wc, > + WindowDef *windef) > +{ > + ListCell *lc; > + > + /* > + * Row Pattern Common Syntax clause exists? > + */ > + if (windef->rpCommonSyntax == NULL) > + return; > ``` > > Checking (windef->rpCommonSyntax == NULL) seems a duplicate, because transformPatternClause() is only called by transformRPR(), and transformRPR() has done the check. Yeah. transformDefineClause() already does the similar check using Assert. What about using Assert in transPatternClause() as well? Assert(windef->rpCommonSyntax != NULL); Best regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp -
Re: Row pattern recognition
Chao Li <li.evan.chao@gmail.com> — 2025-11-24T00:59:57Z
> On Nov 21, 2025, at 13:25, Chao Li <li.evan.chao@gmail.com> wrote: > > > Okay, I’d stop here and continue to review 0006 next week. > I just finished reviewing 0006, and see some problems: 15 - 0006 - select.sgml ``` +[ <replaceable class="parameter">row_pattern_common_syntax</replaceable> ] ``` row_pattern_common_syntax doesn’t look like a good name. I searched over the docs, and couldn't find a name suffixed by “_syntax”. I think we can just name it as “row_pattern_recognition_clause” or a shorter name just “row_pattern”. 16 - 0006 - select.sgml ``` + <synopsis> + [ AFTER MATCH SKIP PAST LAST ROW | AFTER MATCH SKIP TO NEXT ROW ] `` I think the two values are mutually exclusive, so curly braces should added surround them: [ { AFTER MATCH SKIP PAST LAST ROW | AFTER MATCH SKIP TO NEXT ROW } ] [] means optional, and {} means choose one from all alternatives. 17 - 0006 - select.sgml ``` PATTERN <replaceable class="parameter">pattern_variable_name</replaceable>[+] [, ...] ``` PATTERN definition should be placed inside (), here you missed () 18 - 0006 - select.sgml The same code as 17, <replaceable class="parameter">pattern_variable_name</replaceable>[+], do you only put “+” here, I think SQL allows a lot of pattern quantifiers. From 0001, gram.y, looks like +, * and ? Are supported in this patch. So, maybe we can do: ``` PATTERN ( pattern_element, [pattern_element …] ) and pattern_element = variable name optionally followed by quantifier (reference to somewhere introducing pattern quantifier). ``` 19 - 0006 - select.sgml I don’t see INITIAL and SEEK are described. Even if SEEK is not supported currently, it’s worthy mentioning that. 20 - 0006 - select.sgml ``` + after a match found. With <literal>AFTER MATCH SKIP PAST LAST + ROW</literal> (the default) next row position is next to the last row of ``` 21 - 0006 - select.sgml ``` [ <replaceable class="parameter">frame_clause</replaceable> ] +[ <replaceable class="parameter">row_pattern_common_syntax</replaceable> ] ``` Looks like row_pattern_common_syntax belongs to frame_clause, and you have lately added row_pattern_common_syntax to frame_clause: ``` <synopsis> -{ RANGE | ROWS | GROUPS } <replaceable>frame_start</replaceable> [ <replaceable>frame_exclusion</replaceable> ] -{ RANGE | ROWS | GROUPS } BETWEEN <replaceable>frame_start</replaceable> AND <replaceable>frame_end</replaceable> [ <replaceable>frame_exclusion</replaceable> ] +{ RANGE | ROWS | GROUPS } <replaceable>frame_start</replaceable> [ <replaceable>frame_exclusion</replaceable> ] [row_pattern_common_syntax] +{ RANGE | ROWS | GROUPS } BETWEEN <replaceable>frame_start</replaceable> AND <replaceable>frame_end</replaceable> [ <replaceable>frame_exclusion</replaceable> ] [row_pattern_common_syntax] </synopsis> ``` So I guess row_pattern_common_syntax after frame_clause should not be added. 22 - 0006 - advance.sgml ``` <programlisting> DEFINE LOWPRICE AS price <= 100, UP AS price > PREV(price), DOWN AS price < PREV(price) </programlisting> Note that <function>PREV</function> returns the price column in the ``` In the “Note” line, “price” refers to a column from above example, so I think it should be tagged by <literal>. This comment applies to multiple places. I will proceed 0007 tomorrow. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/ -
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2025-11-24T04:57:25Z
Hi Chao, Thank you for the review! >> On Nov 20, 2025, at 15:33, Chao Li <li.evan.chao@gmail.com> wrote: >> >> >> I’d stop here, and continue 0005 tomorrow. >> > > I reviewed 0005 today. I'm still not very familiar with the executor code, so going through 0005 has also been a valuable learning process for me. > > 12 - 0005 - nodeWindowAgg.c > ``` > + if (string_set_get_size(str_set) == 0) > + { > + /* no match found in the first row */ > + register_reduced_frame_map(winstate, original_pos, RF_UNMATCHED); > + destroyStringInfo(encoded_str); > + return; > + } > ``` > > encoded_str should also be destroyed if not entering the “if” clause. Subsequent string_set_discard() will free encode_str since encoded_str is part of str_set. So no need to call destroyStringInfo(encoded_str) in not entering "if" clause. string_set_discard(str_set); So I think this is Ok. > 13 - 0005 - nodeWindowAgg.c > ``` > static > int > do_pattern_match(char *pattern, char *encoded_str, int len) > { > static regex_t *regcache = NULL; > static regex_t preg; > static char patbuf[1024]; /* most recent 'pattern' is cached here */ > ``` > > Using static variable in executor seems something I have never seen, which may persistent data across queries. I think usually per query data are stored in state structures. Yeah, such a usage maybe rare. But I hesitate to store the data (regex_t) in WindowAggState because: (1) it bloats WindowAggState which we want to avoid if possible (2) it requires execnodes.h to include regex.h. (maybe this itself is harmless, I am not sure) > 14 - 0005 - nodeWindowAgg.c > ``` > + MemoryContext oldContext = MemoryContextSwitchTo(TopMemoryContext); > + > + if (regcache != NULL) > + pg_regfree(regcache); /* free previous re */ > + > + /* we need to convert to char to pg_wchar */ > + plen = strlen(pattern); > + data = (pg_wchar *) palloc((plen + 1) * sizeof(pg_wchar)); > + data_len = pg_mb2wchar_with_len(pattern, data, plen); > + /* compile re */ > + sts = pg_regcomp(&preg, /* compiled re */ > + data, /* target pattern */ > + data_len, /* length of pattern */ > + cflags, /* compile option */ > + C_COLLATION_OID /* collation */ > + ); > + pfree(data); > + > + MemoryContextSwitchTo(oldContext); > ``` > > Here in do_pattern_match, it switches to top memory context and compiles the re and stores to “preg". Based on the comment of pg_regcomp: > ``` > /* > * pg_regcomp - compile regular expression > * > * Note: on failure, no resources remain allocated, so pg_regfree() > * need not be applied to re. > */ > int > pg_regcomp(regex_t *re, > const chr *string, > size_t len, > int flags, > Oid collation) > ``` > > “preg” should be freed by pg_regfree(), given the memory is allocated in TopMemoryContext, this looks like a memory leak. I admit current patch leaves the memory unfreed even after a query finishes. What about adding something like: static void do_pattern_match_end(void) { if (regcache != NULL) pg_regfree(regcache); } and let ExecEndWindowAgg() call it? > Okay, I’d stop here and continue to review 0006 next week. > > Best regards, > -- > Chao Li (Evan) > HighGo Software Co., Ltd. > https://www.highgo.com/ > > > > -
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2025-11-24T06:47:03Z
Hi Chao, >> On Nov 21, 2025, at 13:25, Chao Li <li.evan.chao@gmail.com> wrote: >> >> >> Okay, I’d stop here and continue to review 0006 next week. >> > > I just finished reviewing 0006, and see some problems: > > 15 - 0006 - select.sgml > ``` > +[ <replaceable class="parameter">row_pattern_common_syntax</replaceable> ] > ``` > > row_pattern_common_syntax doesn’t look like a good name. I searched over the docs, and couldn't find a name suffixed by “_syntax”. I think we can just name it as “row_pattern_recognition_clause” or a shorter name just “row_pattern”. I believe these names are based on the SQL standard. All syntax components do not necessary be suffixed by "clause". For example in select.sgml, [ WITH [ RECURSIVE ] <replaceable class="parameter">with_query</replaceable> [, ...] ] SELECT [ ALL | DISTINCT [ ON ( <replaceable class="parameter">expression</replaceable> [, ...] ) ] ] [ { * | <replaceable class="parameter">expression</replaceable> [ [ AS ] <replaceable class="parameter">output_name</replaceable> ] } [, ...] ] [ FROM <replaceable class="parameter">from_item</replaceable> [, ...] ] [ WHERE <replaceable class="parameter">condition</replaceable> ] [ GROUP BY { ALL | [ ALL | DISTINCT ] <replaceable class="parameter">grouping_element</replaceable> [, ...] } ] [ HAVING <replaceable class="parameter">condition</replaceable> ] [ WINDOW <replaceable class="parameter">window_name</replaceable> AS ( <replaceable class="parameter">window_definition</replaceable> ) [, ...] ] "window_definition" comes from the standard and does not suffixed by "clause". If you look into the window clause definition in the standard, you will see: <window clause> ::= WINDOW <window definition list> <window definition list> ::= <window definition> [ { <comma> <window definition> }... ] So the usage of terms in our document matches the standard. Likewise, we see the definition of row pattern common syntax in the stabdard: <window clause> ::= WINDOW <window definition list> <window definition list> ::= <window definition> [ { <comma> <window definition> }... ] <window definition> ::= <new window name> AS <window specification> <new window name> ::= <window name> <window specification> ::= <left paren> <window specification details> <right paren> <window specification details> ::= [ <existing window name> ] [ <window partition clause> ] [ <window order clause> ] [ <window frame clause> ] : : <window frame clause> ::= [ <row pattern measures> ] <window frame units> <window frame extent> [ <window frame exclusion> ] [ <row pattern common syntax> ] So I think "row pattern common syntax" is perfectly appropriate name. If we change it to “row_pattern_recognition_clause”, it will just bring confusion. In the standard “row pattern recognition clause” is defined RPR in FROM clause. SELECT ... FROM table MATCH RECOGNIZE (....); Here MATCH RECOGNIZE (....) is the “row pattern recognition clause”. > 16 - 0006 - select.sgml > ``` > + <synopsis> > + [ AFTER MATCH SKIP PAST LAST ROW | AFTER MATCH SKIP TO NEXT ROW ] > `` > > I think the two values are mutually exclusive, so curly braces should added surround them: > > [ { AFTER MATCH SKIP PAST LAST ROW | AFTER MATCH SKIP TO NEXT ROW } ] > > [] means optional, and {} means choose one from all alternatives. Agreed. Will fix. > 17 - 0006 - select.sgml > ``` > PATTERN <replaceable class="parameter">pattern_variable_name</replaceable>[+] [, ...] > ``` > > PATTERN definition should be placed inside (), here you missed () Good catch. Will fix. > 18 - 0006 - select.sgml > The same code as 17, <replaceable class="parameter">pattern_variable_name</replaceable>[+], do you only put “+” here, I think SQL allows a lot of pattern quantifiers. From 0001, gram.y, looks like +, * and ? Are supported in this patch. So, maybe we can do: > > ``` > PATTERN ( pattern_element, [pattern_element …] ) > > and pattern_element = variable name optionally followed by quantifier (reference to somewhere introducing pattern quantifier). > ``` Currently only *, + and ? are supported and I don't think it's worth to invent "pattern_element". (Actually the standard defines much more complex syntax for PATTERN. I think we can revisit this once the basic support for quantifier *,+ and ? are brought in the core PostgreSQL code. > 19 - 0006 - select.sgml > > I don’t see INITIAL and SEEK are described. Even if SEEK is not supported currently, it’s worthy mentioning that. Agreed. Will fix. > 20 - 0006 - select.sgml > ``` > + after a match found. With <literal>AFTER MATCH SKIP PAST LAST > + ROW</literal> (the default) next row position is next to the last row of > ``` > > 21 - 0006 - select.sgml > ``` > [ <replaceable class="parameter">frame_clause</replaceable> ] > +[ <replaceable class="parameter">row_pattern_common_syntax</replaceable> ] > ``` > > Looks like row_pattern_common_syntax belongs to frame_clause, and you have lately added row_pattern_common_syntax to frame_clause: > ``` > <synopsis> > -{ RANGE | ROWS | GROUPS } <replaceable>frame_start</replaceable> [ <replaceable>frame_exclusion</replaceable> ] > -{ RANGE | ROWS | GROUPS } BETWEEN <replaceable>frame_start</replaceable> AND <replaceable>frame_end</replaceable> [ <replaceable>frame_exclusion</replaceable> ] > +{ RANGE | ROWS | GROUPS } <replaceable>frame_start</replaceable> [ <replaceable>frame_exclusion</replaceable> ] [row_pattern_common_syntax] > +{ RANGE | ROWS | GROUPS } BETWEEN <replaceable>frame_start</replaceable> AND <replaceable>frame_end</replaceable> [ <replaceable>frame_exclusion</replaceable> ] [row_pattern_common_syntax] > </synopsis> > ``` > > So I guess row_pattern_common_syntax after frame_clause should not be added. Yes, row_pattern_common_syntax belongs to frame_clause. Will fix. > 22 - 0006 - advance.sgml > ``` > <programlisting> > DEFINE > LOWPRICE AS price <= 100, > UP AS price > PREV(price), > DOWN AS price < PREV(price) > </programlisting> > > Note that <function>PREV</function> returns the price column in the > ``` > > In the “Note” line, “price” refers to a column from above example, so I think it should be tagged by <literal>. This comment applies to multiple places. You are right. Will fix. > I will proceed 0007 tomorrow. Thanks! Best regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp -
Re: Row pattern recognition
Chao Li <li.evan.chao@gmail.com> — 2025-11-25T06:16:10Z
> On Nov 24, 2025, at 08:59, Chao Li <li.evan.chao@gmail.com> wrote: > > I will proceed 0007 tomorrow. I just finished reviewing 0007 and 0008. This patch set really demonstrates the full lifecycle of adding a SQL feature, from changing the syntax in gram.y all the way down to the executor, including tests and docs. I learned a lot from it. Thanks! 23 - 0007 As you mentioned earlier, pattern regular expression support *, + and ?, but I don’t see ? is tested. 24 - 0007 I don’t see negative tests for unsupported quantifiers, like PATTERN (A+?). 25 - 0007 ``` +-- basic test with none greedy pattern ``` Typo: “none greedy” -> non-greedy No comment for 0008. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2025-11-27T01:16:42Z
> I just finished reviewing 0007 and 0008. This patch set really demonstrates the full lifecycle of adding a SQL feature, from changing the syntax in gram.y all the way down to the executor, including tests and docs. I learned a lot from it. Thanks! You are welcome! > 23 - 0007 > > As you mentioned earlier, pattern regular expression support *, + and ?, but I don’t see ? is tested. Good catch. I will add the test case. In the mean time I found a bug with gram.y part which handles '?' quantifier. gram.y can be successfully compiled but there's no way to put '?' quantifier in a SQL statement. We cannot write directly "ColId '?'" like other quantifier '*' and '+' in the grammer because '?' is not a "self" token. So we let '?' matches Op first then check if it's '?' or not. | ColId Op { if (strcmp("?", $2)) ereport(ERROR, errcode(ERRCODE_SYNTAX_ERROR), errmsg("unsupported quantifier"), parser_errposition(@2)); $$ = (Node *) makeSimpleA_Expr(AEXPR_OP, "?", (Node *)makeString($1), NULL, @1); } Another bug was with executor (nodeWindowAgg.c). The code to check greedy quantifiers missed the case '?'. > 24 - 0007 > > I don’t see negative tests for unsupported quantifiers, like PATTERN (A+?). I will add such test cases. > 25 - 0007 > ``` > +-- basic test with none greedy pattern > ``` > > Typo: “none greedy” -> non-greedy Will fix. > No comment for 0008. Thanks again for the review. I will post v35 patch set soon. Attached is the diff from v34. Best regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp -
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2025-11-27T03:10:08Z
Hi Chao, Any comment on this? >> 13 - 0005 - nodeWindowAgg.c >> ``` >> static >> int >> do_pattern_match(char *pattern, char *encoded_str, int len) >> { >> static regex_t *regcache = NULL; >> static regex_t preg; >> static char patbuf[1024]; /* most recent 'pattern' is cached here */ >> ``` >> >> Using static variable in executor seems something I have never seen, which may persistent data across queries. I think usually per query data are stored in state structures. > >Yeah, such a usage maybe rare. But I hesitate to store the data >(regex_t) in WindowAggState because: > >(1) it bloats WindowAggState which we want to avoid if possible >(2) it requires execnodes.h to include regex.h. (maybe this itself is harmless, I am not sure) > >> 14 - 0005 - nodeWindowAgg.c >> ``` >> + MemoryContext oldContext = MemoryContextSwitchTo(TopMemoryContext); >> + >> + if (regcache != NULL) >> + pg_regfree(regcache); /* free previous re */ >> + >> + /* we need to convert to char to pg_wchar */ >> + plen = strlen(pattern); >> + data = (pg_wchar *) palloc((plen + 1) * sizeof(pg_wchar)); >> + data_len = pg_mb2wchar_with_len(pattern, data, plen); >> + /* compile re */ >> + sts = pg_regcomp(&preg, /* compiled re */ >> + data, /* target pattern */ >> + data_len, /* length of pattern */ >> + cflags, /* compile option */ >> + C_COLLATION_OID /* collation */ >> + ); >> + pfree(data); >> + >> + MemoryContextSwitchTo(oldContext); >> ``` >> >> Here in do_pattern_match, it switches to top memory context and compiles the re and stores to “preg". Based on the comment of pg_regcomp: >> ``` >> /* >> * pg_regcomp - compile regular expression >> * >> * Note: on failure, no resources remain allocated, so pg_regfree() >> * need not be applied to re. >> */ >> int >> pg_regcomp(regex_t *re, >> const chr *string, >> size_t len, >> int flags, >> Oid collation) >> ``` >> >> “preg” should be freed by pg_regfree(), given the memory is allocated in TopMemoryContext, this looks like a memory leak. > >I admit current patch leaves the memory unfreed even after a query >finishes. What about adding something like: > >static void do_pattern_match_end(void) >{ > if (regcache != NULL) > pg_regfree(regcache); >} > >and let ExecEndWindowAgg() call it? Best regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp -
Re: Row pattern recognition
Chao Li <li.evan.chao@gmail.com> — 2025-11-27T06:56:02Z
> On Nov 27, 2025, at 11:10, Tatsuo Ishii <ishii@postgresql.org> wrote: > > Hi Chao, > > Any comment on this? > >>> 13 - 0005 - nodeWindowAgg.c >>> ``` >>> static >>> int >>> do_pattern_match(char *pattern, char *encoded_str, int len) >>> { >>> static regex_t *regcache = NULL; >>> static regex_t preg; >>> static char patbuf[1024]; /* most recent 'pattern' is cached here */ >>> ``` >>> >>> Using static variable in executor seems something I have never seen, which may persistent data across queries. I think usually per query data are stored in state structures. >> >> Yeah, such a usage maybe rare. But I hesitate to store the data >> (regex_t) in WindowAggState because: >> >> (1) it bloats WindowAggState which we want to avoid if possible >> (2) it requires execnodes.h to include regex.h. (maybe this itself is harmless, I am not sure) With the static function-scope variables, those data persist across queries, which is error prone. I am afraid that even if I let this pass, other reviewers might have the same concern. Maybe define a sub structure, say WindowAggCache, and put a pointer of WindowAggCache in WindowAggState, and only allocate memory when a query needs to. >> >>> 14 - 0005 - nodeWindowAgg.c >>> ``` >>> + MemoryContext oldContext = MemoryContextSwitchTo(TopMemoryContext); >>> + >>> + if (regcache != NULL) >>> + pg_regfree(regcache); /* free previous re */ >>> + >>> + /* we need to convert to char to pg_wchar */ >>> + plen = strlen(pattern); >>> + data = (pg_wchar *) palloc((plen + 1) * sizeof(pg_wchar)); >>> + data_len = pg_mb2wchar_with_len(pattern, data, plen); >>> + /* compile re */ >>> + sts = pg_regcomp(&preg, /* compiled re */ >>> + data, /* target pattern */ >>> + data_len, /* length of pattern */ >>> + cflags, /* compile option */ >>> + C_COLLATION_OID /* collation */ >>> + ); >>> + pfree(data); >>> + >>> + MemoryContextSwitchTo(oldContext); >>> ``` >>> >>> Here in do_pattern_match, it switches to top memory context and compiles the re and stores to “preg". Based on the comment of pg_regcomp: >>> ``` >>> /* >>> * pg_regcomp - compile regular expression >>> * >>> * Note: on failure, no resources remain allocated, so pg_regfree() >>> * need not be applied to re. >>> */ >>> int >>> pg_regcomp(regex_t *re, >>> const chr *string, >>> size_t len, >>> int flags, >>> Oid collation) >>> ``` >>> >>> “preg” should be freed by pg_regfree(), given the memory is allocated in TopMemoryContext, this looks like a memory leak. >> >> I admit current patch leaves the memory unfreed even after a query >> finishes. What about adding something like: >> >> static void do_pattern_match_end(void) >> { >> if (regcache != NULL) >> pg_regfree(regcache); >> } >> >> and let ExecEndWindowAgg() call it? > I’m not sure. But I think if we move the data into WindowAggState, then I guess we don’t have to use TopmemoryContext here. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/ -
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2025-12-01T06:57:02Z
Hi Chao, Sorry, I missed this email. >> 9 - 0002 - parse_clause.c > > I am continuing to review 0003 > > 10 - 0003 > ``` > + Assert(list_length(patternVariables) == list_length(defineClause)); > ``` > > Is this assert true? From what I learned, pattern length doesn’t have to equal to define length. For example, I got an example from [1]: You are right. If PATTERN clause uses the same pattern variable more than once (which is perfectly valid), the assertion fails. I will remove the assersion and fix the subsequent forboth loop. > ``` > Example 4 > -- This example has three different patterns. > -- Comment two of them, to get error-free query. > SELECT company, price_date, price > FROM stock_price > MATCH_RECOGNIZE ( > partition by company > order by price_date > all rows per match > pattern ( limit_50 up up* down down* ) > define > limit_50 as price <= 50.00, > up as price > prev(price), > down as price < prev(price) > ) > WHERE company = 'B' > ORDER BY price_date; > ``` > > In this example, pattern has 5 elements and define has only 3 elements. Yes. > 11 - 0004 - plannodes.h > ``` > + /* Row Pattern Recognition AFTER MACH SKIP clause */ > + RPSkipTo rpSkipTo; /* Row Pattern Skip To type */ > ``` > > Typo: MACH -> MATCH Will fix/ > I’d stop here, and continue 0005 tomorrow. Thanks! > [1] https://link.springer.com/article/10.1007/s13222-022-00404-3 > > Best regards, > -- > Chao Li (Evan) > HighGo Software Co., Ltd. > https://www.highgo.com/ > > > >
-
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2025-12-01T12:21:38Z
>> On Nov 27, 2025, at 11:10, Tatsuo Ishii <ishii@postgresql.org> wrote: >> >> Hi Chao, >> >> Any comment on this? >> >>>> 13 - 0005 - nodeWindowAgg.c >>>> ``` >>>> static >>>> int >>>> do_pattern_match(char *pattern, char *encoded_str, int len) >>>> { >>>> static regex_t *regcache = NULL; >>>> static regex_t preg; >>>> static char patbuf[1024]; /* most recent 'pattern' is cached here */ >>>> ``` >>>> >>>> Using static variable in executor seems something I have never seen, which may persistent data across queries. I think usually per query data are stored in state structures. >>> >>> Yeah, such a usage maybe rare. But I hesitate to store the data >>> (regex_t) in WindowAggState because: >>> >>> (1) it bloats WindowAggState which we want to avoid if possible >>> (2) it requires execnodes.h to include regex.h. (maybe this itself is harmless, I am not sure) > > With the static function-scope variables, those data persist across queries, which is error prone. I am afraid that even if I let this pass, other reviewers might have the same concern. > > Maybe define a sub structure, say WindowAggCache, and put a pointer of WindowAggCache in WindowAggState, and only allocate memory when a query needs to. > >>> >>>> 14 - 0005 - nodeWindowAgg.c >>>> ``` >>>> + MemoryContext oldContext = MemoryContextSwitchTo(TopMemoryContext); >>>> + >>>> + if (regcache != NULL) >>>> + pg_regfree(regcache); /* free previous re */ >>>> + >>>> + /* we need to convert to char to pg_wchar */ >>>> + plen = strlen(pattern); >>>> + data = (pg_wchar *) palloc((plen + 1) * sizeof(pg_wchar)); >>>> + data_len = pg_mb2wchar_with_len(pattern, data, plen); >>>> + /* compile re */ >>>> + sts = pg_regcomp(&preg, /* compiled re */ >>>> + data, /* target pattern */ >>>> + data_len, /* length of pattern */ >>>> + cflags, /* compile option */ >>>> + C_COLLATION_OID /* collation */ >>>> + ); >>>> + pfree(data); >>>> + >>>> + MemoryContextSwitchTo(oldContext); >>>> ``` >>>> >>>> Here in do_pattern_match, it switches to top memory context and compiles the re and stores to “preg". Based on the comment of pg_regcomp: >>>> ``` >>>> /* >>>> * pg_regcomp - compile regular expression >>>> * >>>> * Note: on failure, no resources remain allocated, so pg_regfree() >>>> * need not be applied to re. >>>> */ >>>> int >>>> pg_regcomp(regex_t *re, >>>> const chr *string, >>>> size_t len, >>>> int flags, >>>> Oid collation) >>>> ``` >>>> >>>> “preg” should be freed by pg_regfree(), given the memory is allocated in TopMemoryContext, this looks like a memory leak. >>> >>> I admit current patch leaves the memory unfreed even after a query >>> finishes. What about adding something like: >>> >>> static void do_pattern_match_end(void) >>> { >>> if (regcache != NULL) >>> pg_regfree(regcache); >>> } >>> >>> and let ExecEndWindowAgg() call it? >> > > I’m not sure. But I think if we move the data into WindowAggState, then I guess we don’t have to use TopmemoryContext here. I decided to add new fields to WindowAggState: /* regular expression compiled result cache. Used for RPR. */ char *patbuf; /* pattern to be compiled */ regex_t preg; /* compiled re pattern */ Then allocate the memory for them using winstate->ss.ps.ps_ExprContext->ecxt_per_query_memory; Best regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp -
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2025-12-01T12:42:18Z
>> I just finished reviewing 0007 and 0008. This patch set really demonstrates the full lifecycle of adding a SQL feature, from changing the syntax in gram.y all the way down to the executor, including tests and docs. I learned a lot from it. Thanks! > > You are welcome! > >> 23 - 0007 >> >> As you mentioned earlier, pattern regular expression support *, + and ?, but I don’t see ? is tested. > > Good catch. I will add the test case. In the mean time I found a bug > with gram.y part which handles '?' quantifier. gram.y can be > successfully compiled but there's no way to put '?' quantifier in a > SQL statement. We cannot write directly "ColId '?'" like other > quantifier '*' and '+' in the grammer because '?' is not a "self" > token. So we let '?' matches Op first then check if it's '?' or > not. > > | ColId Op > { > if (strcmp("?", $2)) > ereport(ERROR, > errcode(ERRCODE_SYNTAX_ERROR), > errmsg("unsupported quantifier"), > parser_errposition(@2)); > > $$ = (Node *) makeSimpleA_Expr(AEXPR_OP, "?", (Node *)makeString($1), NULL, @1); > } > > Another bug was with executor (nodeWindowAgg.c). The code to check > greedy quantifiers missed the case '?'. > >> 24 - 0007 >> >> I don’t see negative tests for unsupported quantifiers, like PATTERN (A+?). > > I will add such test cases. > >> 25 - 0007 >> ``` >> +-- basic test with none greedy pattern >> ``` >> >> Typo: “none greedy” -> non-greedy > > Will fix. > >> No comment for 0008. > > Thanks again for the review. I will post v35 patch set soon. Attached > is the diff from v34. Attached are the v35 patches for Row pattern recognition. Chao Li reviewed v34 thoroughly. Thank you! v35 reflects the review comments. Major differences from v34 include: - Make "DEFINE" an unreserved keyword. Previously it was a reserved keyword. - Refactor transformDefineClause() to make two foreach loops into single foreach loop. - Make '?' quantifier in PATTERN work as advertised. Test for '?' quantifier is added too. - Unsupported quantifier test added. - Fix get_rule_define(). - Fix memory leak related to regcomp. - Move regcomp compiled result cache from static data to WindowAggState. - Fix several typos. Best regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp -
Re: Row pattern recognition
Tatsuo Ishii <ishii@postgresql.org> — 2025-12-23T13:26:54Z
Attached are the v36 patches, just for rebasing. Best regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
-
Re: Row pattern recognition
Henson Choi <assam258@gmail.com> — 2025-12-31T12:10:53Z
SQL/RPR Patch Review Report ============================ Patch: Row Pattern Recognition (SQL:2016) Commitfest: https://commitfest.postgresql.org/patch/4460 Review Methodology: This review focused on quality assessment, not line-by-line code audit. Key code paths and quality issues were examined with surrounding context when concerns arose. Documentation files were reviewed with AI-assisted grammar and typo checking. Code coverage was measured using gcov and custom analysis tools. Limitations: - Not a security audit or formal verification - Parser and executor internals reviewed at module level, not exhaustively - Performance characteristics not benchmarked TABLE OF CONTENTS ----------------- 1. Executive Summary 2. Issues Found 2.1 Critical / Major 2.2 Minor Issues (Review Needed) 2.3 Minor Issues (Style) 2.4 Suggestions for Discussion 3. Test Coverage 3.1 Covered Areas 3.2 Untested Items 3.3 Unimplemented Features (No Test Needed) 4. Code Coverage Analysis 4.1 Overall Coverage 4.2 Coverage by File 4.3 Uncovered Code Risk Assessment 5. Commit Analysis 6. Recommendations 7. Conclusion 1. EXECUTIVE SUMMARY -------------------- Overall assessment: GOOD The SQL/RPR patch demonstrates solid implementation quality within its defined scope. Code follows PostgreSQL coding standards (with minor style issues), test coverage is comprehensive at 96.4%, and documentation is thorough with only minor typos. Rating by Area: - Code Quality: Good (PostgreSQL style compliant, 26 static style fixes recommended) - Test Coverage: Good (96.4% line coverage, 28 test cases) - Documentation: Good (Complete, 1 minor typo) - Build/Regress: Pass (make check-world, 753 tests passed) 2. ISSUES FOUND --------------- 2.1 Critical / Major #1 [Code] Greedy pattern combinatorial explosion Pattern: PATTERN (A+ B+ C+) with DEFINE A AS TRUE, B AS TRUE, C AS TRUE Impact: Memory exhaustion or infinite wait Recommendation: Add work_mem-based memory limit (error on exceed) Evidence - No memory limit in current code: - nodeWindowAgg.c:5718-5722 string_set_init(): palloc() unconditional - nodeWindowAgg.c:5741-5750 string_set_add(): set_size *= 2; repalloc() unlimited - nodeWindowAgg.c:5095-5174 generate_patterns(): Triple loop, no limit Only work_mem usage in RPR (nodeWindowAgg.c:1297): winstate->buffer = tuplestore_begin_heap(false, false, work_mem); -> For tuple buffer, unrelated to pattern combination memory (StringSet) 2.2 Minor Issues (Review Needed) #1 [Code] nodeWindowAgg.c:5849,5909,5912 pos > NUM_ALPHABETS check - intent unclear, causes error at 28 PATTERN elements Reproduction: - PATTERN (A B C ... Z A) (27 elements) -> OK - PATTERN (A B C ... Z A B) (28 elements) -> ERROR "initial is not valid char: b" 2.3 Minor Issues (Style) #1 [Code] nodeWindowAgg.c (25 blocks) #ifdef RPR_DEBUG -> recommend elog(DEBUG2, ...) or remove #2 [Code] src/backend/executor/nodeWindowAgg.c static keyword on separate line (26 functions) #3 [Code] src/backend/utils/adt/ruleutils.c Whitespace typo: "regexp !=NULL" -> "regexp != NULL" #4 [Code] nodeWindowAgg.c:4619 Error message case: "Unrecognized" -> "unrecognized" (PostgreSQL style) #5 [Doc] doc/src/sgml/ref/select.sgml:1128 Typo: "maximu" -> "maximum" 2.4 Suggestions for Discussion #1 Incremental matching with streaming NFA redesign Benefits: - O(n) time complexity per row (current: exponential in worst case) - Bounded memory via state merging and context absorption - Natural extension for OR patterns, {n,m} quantifiers, nested groups - Enables MEASURES clause with incremental aggregates - Proven approach in CEP engines (Flink, Esper) 3. TEST COVERAGE ---------------- 3.1 Covered Areas - PATTERN clause: +, *, ? quantifiers (line 41, 71, 86) - DEFINE clause: Variable conditions, auto-TRUE for missing (line 120-133) - PREV/NEXT functions: Single argument (line 44, 173) - AFTER MATCH SKIP: TO NEXT ROW (line 182), PAST LAST ROW (line 198) - Aggregate integration: sum, avg, count, max, min (line 258-277) - Window function integration: first_value, last_value, nth_value (line 34-35) - JOIN/CTE: JOIN (line 313), WITH (line 324) - View: VIEW creation, pg_get_viewdef (line 390-406) - Error cases: 7 error scenarios (line 409-532) - Large partition: 5000 row smoke test (line 360-387) - ROWS BETWEEN offset: CURRENT ROW AND 2 FOLLOWING (line 244) 3.2 Untested Items Feature Priority Recommendation ------------------------------------------------------------------------------- 26 variable limit Medium Test 26 variables success, 27th variable error NULL value handling Low Test PREV(col) where col or previous row is NULL Sample test for 26 variable limit: -- Should fail with 27th variable (parser error, no table needed) SELECT * FROM (SELECT 1 AS x) t WINDOW w AS ( ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING PATTERN (a b c d e f g h i j k l m n o p q r s t u v w x y z aa) DEFINE a AS TRUE, b AS TRUE, c AS TRUE, d AS TRUE, e AS TRUE, f AS TRUE, g AS TRUE, h AS TRUE, i AS TRUE, j AS TRUE, k AS TRUE, l AS TRUE, m AS TRUE, n AS TRUE, o AS TRUE, p AS TRUE, q AS TRUE, r AS TRUE, s AS TRUE, t AS TRUE, u AS TRUE, v AS TRUE, w AS TRUE, x AS TRUE, y AS TRUE, z AS TRUE, aa AS TRUE ); -- ERROR: number of row pattern definition variable names exceeds 26 Sample test for NULL handling: CREATE TEMP TABLE stock_null (company TEXT, tdate DATE, price INTEGER); INSERT INTO stock_null VALUES ('c1', '2023-07-01', 100); INSERT INTO stock_null VALUES ('c1', '2023-07-02', NULL); -- NULL in middle INSERT INTO stock_null VALUES ('c1', '2023-07-03', 200); INSERT INTO stock_null VALUES ('c1', '2023-07-04', 150); SELECT company, tdate, price, count(*) OVER w AS match_count FROM stock_null WINDOW w AS ( PARTITION BY company ORDER BY tdate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING PATTERN (START UP DOWN) DEFINE START AS TRUE, UP AS price > PREV(price), DOWN AS price < PREV(price) ); -- Result: all rows show match_count = 0 (NULL breaks pattern matching) 3.3 Unimplemented Features (No Test Needed) Per documentation (select.sgml:1133-1136): - MEASURES clause: Not implemented - SUBSET clause: Not implemented - AFTER MATCH variants: Only SKIP TO NEXT ROW, PAST LAST ROW supported Not in documentation, verified by testing: - {n,m} quantifier: Parser error ("syntax error at or near {") - (A|B) OR pattern: Not supported (parsed as single variable name "a|b") - (A B)+ compound repetition: Parser accepts, but execution returns 0 matches 4. CODE COVERAGE ANALYSIS ------------------------- 4.1 Overall Coverage 96.4% (732 / 759 lines) 4.2 Coverage by File (major RPR-modified files) nodeWindowAgg.c: 96.6% (560/580) - Pattern matching execution engine parse_clause.c: 96.7% (88/91) - PATTERN/DEFINE analysis ruleutils.c: 93.1% (54/58) - pg_get_viewdef output 4.3 Uncovered Code Risk Assessment Total: 27 lines uncovered Medium Risk (2 items) - Test addition recommended (see section 3.2): - parse_clause.c:4093: transformDefineClause - 27th variable error - nodeWindowAgg.c:5623: get_slots - null_slot for PREV at partition first row Low Risk (25 items) - Defensive code / Unreachable: - nodeWindowAgg.c:3074: attno_map_walker - Parser validates arg count - nodeWindowAgg.c:3137-3138: rpr_navigation_walker - switch default - nodeWindowAgg.c:3566: window_gettupleslot - Before mark position - nodeWindowAgg.c:4289: WinGetFuncArgInFrame - isout flag - nodeWindowAgg.c:4393: WinGetSlotInFrame - Boundary check - nodeWindowAgg.c:4618-4619: row_is_in_reduced_frame - switch default - nodeWindowAgg.c:4697: register_reduced_frame_map - pos < 0 check - nodeWindowAgg.c:5007: search_str_set - NULL return continue - nodeWindowAgg.c:5405: do_pattern_match - Regex compile error - nodeWindowAgg.c:5435,5437-5438: do_pattern_match - Regex exec error - nodeWindowAgg.c:5700: pattern_initial - Variable not found - nodeWindowAgg.c:5776: string_set_get - Index range check - nodeWindowAgg.c:5850: variable_pos_register - a-z range check - nodeWindowAgg.c:5910: variable_pos_fetch - a-z range check - nodeWindowAgg.c:5913: variable_pos_fetch - Index range check - parse_clause.c:3989: transformDefineClause - A_Expr type check - parse_clause.c:4145: transformPatternClause - A_Expr type check - ruleutils.c:6904-6908: get_rule_windowspec - SKIP TO NEXT ROW output Conclusion: Most uncovered code consists of defensive error handling or code unreachable due to parser pre-validation. No security or functional risk. 5. COMMIT ANALYSIS ------------------ 8 sequential commits: Commit Area Files +/- Key Content ------------------------------------------------------------------------------- 1 Raw Parser 4 +174/-16 gram.y grammar (PATTERN/DEFINE) 2 Parse/Analysis 4 +277/-1 parse_clause.c analysis logic 3 Rewriter 1 +109/-0 pg_get_viewdef extension 4 Planner 5 +73/-3 WindowAgg plan node extension 5 Executor 4 +1,942/-11 CORE: Pattern matching engine (+1,850) 6 Docs 3 +192/-7 advanced.sgml, func-window.sgml 7 Tests 3 +1,585/-1 rpr.sql (532), rpr.out (1,052) 8 typedefs 1 +6/-0 pgindent support Code Change Statistics: - Total files: 25 - Lines added: 4,358 - Lines deleted: 39 - Net increase: +4,319 lines 6. RECOMMENDATIONS ------------------ 6.1 Combinatorial Explosion Prevention (Major, Required) Add work_mem-based memory limit for StringSet allocation. Location: string_set_add() in nodeWindowAgg.c:5741-5750 Consistent with existing PostgreSQL approach (Hash Join, Sort, etc.) 6.2 Code Review Required (Minor, Decision Needed) Location: nodeWindowAgg.c:5849,5909,5912 Issue: pos > NUM_ALPHABETS check intent unclear Current: PATTERN with 28 elements causes error Question: Is 27 element limit intentional? 6.3 Code Style Fixes (Minor) - #ifdef RPR_DEBUG: Use elog(DEBUG2, ...) or remove (25 blocks) - static keyword on separate line: 26 functions to fix - Whitespace: "regexp !=NULL" -> "regexp != NULL" - Error message case: "Unrecognized" -> "unrecognized" 6.4 Documentation Fixes (Minor) - select.sgml: "maximu" -> "maximum" 6.5 Test Additions (Optional) Black-box Tests (Functional): Feature Test Case Priority ------------------------------------------------------------------------------- Variable limit 26 variables success, 27 error Medium NULL boundary PREV at partition first row Medium White-box Tests (Coverage) - covered by above black-box tests: File:Line Target Branch ------------------------------------------------------------------------------- parse_clause.c:4093 Limit error branch (Variable limit test) nodeWindowAgg.c:5623 null_slot usage (NULL boundary test) 7. CONCLUSION ------------- Test Quality: GOOD Core functionality is thoroughly tested with comprehensive error case coverage. The patch is well-implemented within its defined scope. Identified issues include one major concern (combinatorial explosion) and minor style/documentation items. Recommended actions before commit: 1. Add work_mem-based memory limit for pattern combinations (required) 2. Clarify pos > NUM_ALPHABETS check intent (review needed) 3. Fix code style issues (optional but recommended) 4. Fix documentation typo (trivial) 5. Add tests for variable limit and NULL handling (optional) Points for discussion (optional): 6. Incremental matching with streaming NFA redesign Attachment: - coverage.tgz (gcov HTML report, RPR-modified code only) --- End of Report 2025년 9월 24일 (수) PM 7:36, Tatsuo Ishii <ishii@postgresql.org>님이 작성: > Attached are the v33 patches for Row pattern recognition. The > difference from v32 is that the raw parse tree printing patch is not > included in v33. This is because now that we have > debug_print_raw_parse. > > Best regards, > -- > Tatsuo Ishii > SRA OSS K.K. > English: http://www.sraoss.co.jp/index_en/ > Japanese:http://www.sraoss.co.jp >