Thread
-
Re: Bypassing cursors in postgres_fdw to enable parallel plans
Robert Haas <robertmhaas@gmail.com> — 2025-12-10T17:43:48Z
On Thu, Nov 27, 2025 at 5:50 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote: > Thanks Kenan for these. So, it looks like the patch performs the same as in the local scan case. I wonder if you found any case of performance degradation with the patch. > > Per an off-list discussion with Robert, he suggested using the existing data structures for recording the state of last queries instead of inventing something new. > Makes sense, so I reworked the patch to include tuplestore in PgFdwScanState and then use PgFdwScanState as part of PgFdwConnState to keep track of previously > active cursors. Nothing else is changed in this version of the patch. Thanks for your continued work on this topic. As we discussed off-list, the regression tests in this patch need quite a bit more work. We shouldn't just repeat large numbers of tests -- we should repeat enough to provide meaningful test coverage of the new feature, and maybe add a few new ones that specifically target scenarios that the patch is intended to cover. One somewhat bizarre thing that I noticed scrolling through the regression test outputs is this: +-- Test with non-cursor mode +SET postgres_fdw.use_cursor = false; +SET postgres_fdw.use_cursor = true; This happens in multiple places in the regression tests, and doesn't really make any sense to me, because changing the GUC from to false, doing nothing, and then changing it back to true doesn't seem like a useful test scenario, and definitely doesn't seem like a test scenario that we need to repeat multiple times. Honestly, I wonder how this happened. Did you maybe run a script over the .sql files to generate the new versions and then not check what the output actually looked like? I really can't stress enough the need to be thoughtful about constructing test cases for a patch of this type. A related problem is that you've included 12,721 lines of useless output in the patch file, because somehow postgres_fdw.out.orig got checked into the repository. It's always a good idea, when posting a patch to the list, to check the diffstat to make sure that there's nothing in there that you don't expect to see. The presence of this line just below your commit message could have alerted you to this problem: .../expected/postgres_fdw.out.orig | 12721 ++++++++++++++++ Ideally, I really recommend scrolling through the patch, not just the diffstat, to make sure that everything is the way you want it to be, before giving it to others to look at. There are a number of other cosmetic problems with this patch that make it hard to review the actual code changes. For instance: + /* To be used only in non-cursor mode */ + Tuplestorestate *tuplestore; + TupleTableSlot *slot; + bool tuples_ready; + struct PgFdwScanState *next; The comment is good, but it only explains a general fact about all four of these fields. It doesn't explain specifically what each of these fields is intended to do. Naming a field very generally, like "next", when it must have some very specific purpose that is not documented, makes it really hard for someone reading the code -- in this case, me -- to understand what the point is. So, these fields should probably have comments, but also, some of them probably need to be renamed. Maybe "tuplestore" and "slot" are OK but "next" is not going to be good enough. +/* Fill the Tupleslot when another query needs to execute. */ +static void +fillTupleSlot(PgFdwScanState *fsstate, ForeignScanState *node) I think you're filling a tuple store, not a tuple slot. + initStringInfo(&buf); What seems to happen here is that you create an empty buffer, add fsstate->query to it and nothing else, and then use buf.data. So why not just get rid of buf and use fsstate->query directly? + /* Always fetch the last prev_query */ + for (; p1->next != NULL; p1 = p1->next); There are multiple problems here. First, this code is not per PostgreSQL style and would be reindented by pgindent, which you should make a habit of running before posting. Second, p1 is not a particularly informative or understandable variable name. Third, why are we using a linked list if the value we're going to need is at the end of the list? If we need to be able to access the last element of the list efficiently, maybe we should be keeping the list in reverse order, or maybe we should be using a List which permits efficient random access. But looking quickly through the patch, I have an even bigger question here. It doesn't really seem like we ever care about any element of the list other than the last. It looks like we go to a fair amount of trouble to maintain the list at various points in the code, but it seems like when we access the list, we just always go to the end. That makes me wonder why we even need a list. Perhaps instead of PgFdwConnState storing a pointer to "prev_queries", it should just store a pointer to the PgFdwScanState for the query that is currently "using" the connection, if there is one. That is, the query whose result set is currently being received, and which must be buffered into a tuplestore before using the connection for something else. When we've done that, we reset the field to NULL, and we don't maintain any list of the older "previous queries". Maybe there's some problem with that idea, but that gets back to my earlier point about comments: the comment for "next" (or however it got renamed) ought to be explaining something about why it exists and what it's for. Without such a comment, the only ways for me to be sure whether we really need "next" is to either (a) ask you or (b) spend a long time staring at the code to try to figure it out or (c) try removing it and see if it works. Ah, wait! I just found some code that cares about the list but not just about the last element: + for (; p1->next != NULL; p1 = p1->next) + /* find the correct prev_query */ + { + if ((p1->tuples_ready && fsstate->cursor_number == p1->cursor_number)) + cur_prev = p1; + } But I'm still confused. Why do we need to iterate through the prev_queries list to find the correct ForeignScanState? Isn't that exactly what fsstate is here? I'm inclined to think that this is just a complicated way of setting p1 = fsstate and then setting cur_prev = p1, so we could skip all this and just test whether fsstate->tuplestore != NULL and if so retrieve tuples from there. If there's some reason why fsstate and cur_prev would end up being different, then there should be some comment explaining it. I'm inclined to think that would be a sign of a design flaw: I think the idea here should be to make use of the ForeignScanState object that already exists to hold the details that we need for this feature, rather than creating a new one. But, if there were a comment telling me why it's like this, I might change my mind. + /* Clear the last query details, once tuples are retrieved. */ + if (fsstate->conn_state->prev_queries == cur_prev) + { + /* + * This is the first prev query in the list, check if there + * are more + */ + if (fsstate->conn_state->num_prev_queries > 1) + + /* + * fsstate->conn_state->prev_queries->next = + * cur_prev->next; + */ + fsstate->next = cur_prev->next; + clear_prev_query_state(cur_prev); + } This code can remove an item from the prev_queries list, but only if it's the first item. Now, if my analysis above is correct and we don't even need the prev_queries list, then it doesn't matter; all this logic can basically go away. If we do need the prev_queries list, then I don't think it can be right to only be able to remove the first element of the list. There's no reason why the oldest prev_query has to finish first. What if there are three queries in the list and the middle one finishes first? Then this will just leave it in the list, whereas if the first one had finished first, it would have been deleted from the list. That kind of inconsistency doesn't seem like a good idea. + /* + * This is to fetch all the tuples of this query and save + * them in Tuple Slot. Since it is using + * PQSetChunkedRowsMode, we only get the + * fsstate->fetch_size tuples in one run, so keep on + * executing till we get NULL in PGresult i.e. all the + * tuples are retrieved. + */ + p1->tuplestore = tuplestore_begin_heap(false, true, work_mem); + p1->slot = MakeSingleTupleTableSlot(fsstate->tupdesc, &TTSOpsMinimalTuple); The slot doesn't seem to be used anywhere in the code that follows. Is there a reason to initialize it here, rather than wherever it's needed? If so, maybe the comment could mention that. + i = 0; I don't think this does anything, because you reinitialize i in the inner loop where you use it. In general, try moving your variable declarations to the innermost scope where they are needed. It makes the code more clear and allows the compiler to tell you about stuff like this. + else if (PQresultStatus(res) == PGRES_TUPLES_OK) + { + while (res != NULL) + res = pgfdw_get_result(conn); + break; + } I doubt that that this is right. It seems like it's willing to throw away an infinite number of result sets without doing anything with them, or discard an infinite number of errors without processing them, but that doesn't seem likely to be the right thing. + else if (PQresultStatus(res) == PGRES_FATAL_ERROR) + { + clear_conn_state(fsstate->conn_state); + pgfdw_report_error(res, conn, fsstate->query); + } I also doubt that this is right. If it's necessary to call clear_conn_state() before reporting this error, then the error handling in this patch is probably wrong. We don't know where errors will be thrown in general and cannot rely on being able to do manual error cleanup before an error occurs. The goal should for it to be safe for pgfdw_report_error -- or just ereport(ERROR, ...) -- to happen anywhere without causing problems in the future. clear_conn_state() doesn't look right either. It does some cleanup on each element of the prev_queries list and decrements num_prev_queries, but it doesn't actually remove the queries from the list. So after calling this function the first time, I think that num_prev_queries might end up being 0 while prev_queries is still a list of the same length as before. After that, I imagine calling clear_conn_state() a second time would result in num_prev_queries going negative. I don't really understand why num_prev_queries exists -- it seems like it's just a recipe for mistakes to have both the list itself, and a separate field that gives you the list length. If you need that, the existing List data type will do that bookkeeping for you, and then there's less opportunity for mistakes. Overall, I think the direction of the patch set has some promise, but I think it needs a lot of cleanup: removal of unnecessary code, proper formatting, moving variables to inner scopes, explanatory comments, good names for variables and functions and structure members, removal of unnecessary files from the patch, cleanup of the regression test coverage so that it doesn't add more bloat than necessary, proper choice of data structures, and so on. Right now, the good things that you've done here are being hidden by these sorts of mechanical issues. That's not just an issue for me as a reviewer: I suspect it's also blocking you, as the patch author, from finding places where the code could be made better. Being able to find such opportunities for improvement and act on them is what will get this patch from "interesting proof of concept" to "potentially committable patch". -- Robert Haas EDB: http://www.enterprisedb.com