Thread

Re: Bypassing cursors in postgres_fdw to enable parallel plans

Robert Haas <robertmhaas@gmail.com> — 2025-12-10T17:43:48Z

On Thu, Nov 27, 2025 at 5:50 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote:
> Thanks Kenan for these. So, it looks like the patch performs the same as in the local scan case. I wonder if you found any case of performance degradation with the patch.
>
> Per an off-list discussion with Robert, he suggested using the existing data structures for recording the state of last queries instead of inventing something new.
> Makes sense, so I reworked the patch to include tuplestore in PgFdwScanState and then use PgFdwScanState as part of PgFdwConnState to keep track of previously
> active cursors. Nothing else is changed in this version of the patch.

Thanks for your continued work on this topic.

As we discussed off-list, the regression tests in this patch need
quite a bit more work. We shouldn't just repeat large numbers of tests
-- we should repeat enough to provide meaningful test coverage of the
new feature, and maybe add a few new ones that specifically target
scenarios that the patch is intended to cover. One somewhat bizarre
thing that I noticed scrolling through the regression test outputs is
this:

+-- Test with non-cursor mode
+SET postgres_fdw.use_cursor = false;
+SET postgres_fdw.use_cursor = true;

This happens in multiple places in the regression tests, and doesn't
really make any sense to me, because changing the GUC from to false,
doing nothing, and then changing it back to true doesn't seem like a
useful test scenario, and definitely doesn't seem like a test scenario
that we need to repeat multiple times. Honestly, I wonder how this
happened. Did you maybe run a script over the .sql files to generate
the new versions and then not check what the output actually looked
like? I really can't stress enough the need to be thoughtful about
constructing test cases for a patch of this type.

A related problem is that you've included 12,721 lines of useless
output in the patch file, because somehow postgres_fdw.out.orig got
checked into the repository. It's always a good idea, when posting a
patch to the list, to check the diffstat to make sure that there's
nothing in there that you don't expect to see. The presence of this
line just below your commit message could have alerted you to this
problem:

.../expected/postgres_fdw.out.orig | 12721 ++++++++++++++++

Ideally, I really recommend scrolling through the patch, not just the
diffstat, to make sure that everything is the way you want it to be,
before giving it to others to look at.

There are a number of other cosmetic problems with this patch that
make it hard to review the actual code changes. For instance:

+ /* To be used only in non-cursor mode */
+ Tuplestorestate *tuplestore;
+ TupleTableSlot *slot;
+ bool tuples_ready;
+ struct PgFdwScanState *next;

The comment is good, but it only explains a general fact about all
four of these fields. It doesn't explain specifically what each of
these fields is intended to do. Naming a field very generally, like
"next", when it must have some very specific purpose that is not
documented, makes it really hard for someone reading the code -- in
this case, me -- to understand what the point is. So, these fields
should probably have comments, but also, some of them probably need to
be renamed. Maybe "tuplestore" and "slot" are OK but "next" is not
going to be good enough.

+/* Fill the Tupleslot when another query needs to execute. */
+static void
+fillTupleSlot(PgFdwScanState *fsstate, ForeignScanState *node)

I think you're filling a tuple store, not a tuple slot.

+ initStringInfo(&buf);

What seems to happen here is that you create an empty buffer, add
fsstate->query to it and nothing else, and then use buf.data. So why
not just get rid of buf and use fsstate->query directly?

+ /* Always fetch the last prev_query */
+ for (; p1->next != NULL; p1 = p1->next);

There are multiple problems here. First, this code is not per
PostgreSQL style and would be reindented by pgindent, which you should
make a habit of running before posting. Second, p1 is not a
particularly informative or understandable variable name. Third, why
are we using a linked list if the value we're going to need is at the
end of the list? If we need to be able to access the last element of
the list efficiently, maybe we should be keeping the list in reverse
order, or maybe we should be using a List which permits efficient
random access.

But looking quickly through the patch, I have an even bigger question
here. It doesn't really seem like we ever care about any element of
the list other than the last. It looks like we go to a fair amount of
trouble to maintain the list at various points in the code, but it
seems like when we access the list, we just always go to the end. That
makes me wonder why we even need a list. Perhaps instead of
PgFdwConnState storing a pointer to "prev_queries", it should just
store a pointer to the PgFdwScanState for the query that is currently
"using" the connection, if there is one. That is, the query whose
result set is currently being received, and which must be buffered
into a tuplestore before using the connection for something else. When
we've done that, we reset the field to NULL, and we don't maintain any
list of the older "previous queries". Maybe there's some problem with
that idea, but that gets back to my earlier point about comments: the
comment for "next" (or however it got renamed) ought to be explaining
something about why it exists and what it's for. Without such a
comment, the only ways for me to be sure whether we really need "next"
is to either (a) ask you or (b) spend a long time staring at the code
to try to figure it out or (c) try removing it and see if it works.

Ah, wait! I just found some code that cares about the list but not
just about the last element:

+ for (; p1->next != NULL; p1 = p1->next)
+ /* find the correct prev_query */
+ {
+ if ((p1->tuples_ready && fsstate->cursor_number == p1->cursor_number))
+ cur_prev = p1;
+ }

But I'm still confused. Why do we need to iterate through the
prev_queries list to find the correct ForeignScanState? Isn't that
exactly what fsstate is here? I'm inclined to think that this is just
a complicated way of setting p1 = fsstate and then setting cur_prev =
p1, so we could skip all this and just test whether
fsstate->tuplestore != NULL and if so retrieve tuples from there. If
there's some reason why fsstate and cur_prev would end up being
different, then there should be some comment explaining it. I'm
inclined to think that would be a sign of a design flaw: I think the
idea here should be to make use of the ForeignScanState object that
already exists to hold the details that we need for this feature,
rather than creating a new one. But, if there were a comment telling
me why it's like this, I might change my mind.

+ /* Clear the last query details, once tuples are retrieved. */
+ if (fsstate->conn_state->prev_queries == cur_prev)
+ {
+ /*
+ * This is the first prev query in the list, check if there
+ * are more
+ */
+ if (fsstate->conn_state->num_prev_queries > 1)
+
+ /*
+ * fsstate->conn_state->prev_queries->next =
+ * cur_prev->next;
+ */
+ fsstate->next = cur_prev->next;
+ clear_prev_query_state(cur_prev);
+ }

This code can remove an item from the prev_queries list, but only if
it's the first item. Now, if my analysis above is correct and we don't
even need the prev_queries list, then it doesn't matter; all this
logic can basically go away. If we do need the prev_queries list, then
I don't think it can be right to only be able to remove the first
element of the list. There's no reason why the oldest prev_query has
to finish first. What if there are three queries in the list and the
middle one finishes first? Then this will just leave it in the list,
whereas if the first one had finished first, it would have been
deleted from the list. That kind of inconsistency doesn't seem like a
good idea.

+ /*
+ * This is to fetch all the tuples of this query and save
+ * them in Tuple Slot. Since it is using
+ * PQSetChunkedRowsMode, we only get the
+ * fsstate->fetch_size tuples in one run, so keep on
+ * executing till we get NULL in PGresult i.e. all the
+ * tuples are retrieved.
+ */
+ p1->tuplestore = tuplestore_begin_heap(false, true, work_mem);
+ p1->slot = MakeSingleTupleTableSlot(fsstate->tupdesc, &TTSOpsMinimalTuple);

The slot doesn't seem to be used anywhere in the code that follows. Is
there a reason to initialize it here, rather than wherever it's
needed? If so, maybe the comment could mention that.

+ i = 0;

I don't think this does anything, because you reinitialize i in the
inner loop where you use it. In general, try moving your variable
declarations to the innermost scope where they are needed. It makes
the code more clear and allows the compiler to tell you about stuff
like this.

+ else if (PQresultStatus(res) == PGRES_TUPLES_OK)
+ {
+ while (res != NULL)
+ res = pgfdw_get_result(conn);
+ break;
+ }

I doubt that that this is right. It seems like it's willing to throw
away an infinite number of result sets without doing anything with
them, or discard an infinite number of errors without processing them,
but that doesn't seem likely to be the right thing.

+ else if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+ {
+ clear_conn_state(fsstate->conn_state);
+ pgfdw_report_error(res, conn, fsstate->query);
+ }

I also doubt that this is right. If it's necessary to call
clear_conn_state() before reporting this error, then the error
handling in this patch is probably wrong. We don't know where errors
will be thrown in general and cannot rely on being able to do manual
error cleanup before an error occurs. The goal should for it to be
safe for pgfdw_report_error -- or just ereport(ERROR, ...) -- to
happen anywhere without causing problems in the future.

clear_conn_state() doesn't look right either. It does some cleanup on
each element of the prev_queries list and decrements num_prev_queries,
but it doesn't actually remove the queries from the list. So after
calling this function the first time, I think that num_prev_queries
might end up being 0 while prev_queries is still a list of the same
length as before. After that, I imagine calling clear_conn_state() a
second time would result in num_prev_queries going negative. I don't
really understand why num_prev_queries exists -- it seems like it's
just a recipe for mistakes to have both the list itself, and a
separate field that gives you the list length. If you need that, the
existing List data type will do that bookkeeping for you, and then
there's less opportunity for mistakes.

Overall, I think the direction of the patch set has some promise, but
I think it needs a lot of cleanup: removal of unnecessary code, proper
formatting, moving variables to inner scopes, explanatory comments,
good names for variables and functions and structure members, removal
of unnecessary files from the patch, cleanup of the regression test
coverage so that it doesn't add more bloat than necessary, proper
choice of data structures, and so on. Right now, the good things that
you've done here are being hidden by these sorts of mechanical issues.
That's not just an issue for me as a reviewer: I suspect it's also
blocking you, as the patch author, from finding places where the code
could be made better. Being able to find such opportunities for
improvement and act on them is what will get this patch from
"interesting proof of concept" to "potentially committable patch".

--
Robert Haas
EDB: http://www.enterprisedb.com