Thread

  1. Re: Bypassing cursors in postgres_fdw to enable parallel plans

    Robert Haas <robertmhaas@gmail.com> — 2025-12-10T17:43:48Z

    On Thu, Nov 27, 2025 at 5:50 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote:
    > Thanks Kenan for these. So, it looks like the patch performs the same as in the local scan case. I wonder if you found any case of performance degradation with the patch.
    >
    > Per an off-list discussion with Robert, he suggested using the existing data structures for recording the state of last queries instead of inventing something new.
    > Makes sense, so I reworked the patch to include tuplestore in PgFdwScanState and then use PgFdwScanState as part of PgFdwConnState to keep track of previously
    > active cursors. Nothing else is changed in this version of the patch.
    
    Thanks for your continued work on this topic.
    
    As we discussed off-list, the regression tests in this patch need
    quite a bit more work. We shouldn't just repeat large numbers of tests
    -- we should repeat enough to provide meaningful test coverage of the
    new feature, and maybe add a few new ones that specifically target
    scenarios that the patch is intended to cover. One somewhat bizarre
    thing that I noticed scrolling through the regression test outputs is
    this:
    
    +-- Test with non-cursor mode
    +SET postgres_fdw.use_cursor = false;
    +SET postgres_fdw.use_cursor = true;
    
    This happens in multiple places in the regression tests, and doesn't
    really make any sense to me, because changing the GUC from to false,
    doing nothing, and then changing it back to true doesn't seem like a
    useful test scenario, and definitely doesn't seem like a test scenario
    that we need to repeat multiple times. Honestly, I wonder how this
    happened. Did you maybe run a script over the .sql files to generate
    the new versions and then not check what the output actually looked
    like? I really can't stress enough the need to be thoughtful about
    constructing test cases for a patch of this type.
    
    A related problem is that you've included 12,721 lines of useless
    output in the patch file, because somehow postgres_fdw.out.orig got
    checked into the repository. It's always a good idea, when posting a
    patch to the list, to check the diffstat to make sure that there's
    nothing in there that you don't expect to see. The presence of this
    line just below your commit message could have alerted you to this
    problem:
    
     .../expected/postgres_fdw.out.orig            | 12721 ++++++++++++++++
    
    Ideally, I really recommend scrolling through the patch, not just the
    diffstat, to make sure that everything is the way you want it to be,
    before giving it to others to look at.
    
    There are a number of other cosmetic problems with this patch that
    make it hard to review the actual code changes. For instance:
    
    + /* To be used only in non-cursor mode */
    + Tuplestorestate *tuplestore;
    + TupleTableSlot *slot;
    + bool tuples_ready;
    + struct PgFdwScanState *next;
    
    The comment is good, but it only explains a general fact about all
    four of these fields. It doesn't explain specifically what each of
    these fields is intended to do. Naming a field very generally, like
    "next", when it must have some very specific purpose that is not
    documented, makes it really hard for someone reading the code -- in
    this case, me -- to understand what the point is. So, these fields
    should probably have comments, but also, some of them probably need to
    be renamed. Maybe "tuplestore" and "slot" are OK but "next" is not
    going to be good enough.
    
    +/* Fill the Tupleslot when another query needs to execute. */
    +static void
    +fillTupleSlot(PgFdwScanState *fsstate, ForeignScanState *node)
    
    I think you're filling a tuple store, not a tuple slot.
    
    + initStringInfo(&buf);
    
    What seems to happen here is that you create an empty buffer, add
    fsstate->query to it and nothing else, and then use buf.data. So why
    not just get rid of buf and use fsstate->query directly?
    
    + /* Always fetch the last prev_query */
    + for (; p1->next != NULL; p1 = p1->next);
    
    There are multiple problems here. First, this code is not per
    PostgreSQL style and would be reindented by pgindent, which you should
    make a habit of running before posting. Second, p1 is not a
    particularly informative or understandable variable name. Third, why
    are we using a linked list if the value we're going to need is at the
    end of the list? If we need to be able to access the last element of
    the list efficiently, maybe we should be keeping the list in reverse
    order, or maybe we should be using a List which permits efficient
    random access.
    
    But looking quickly through the patch, I have an even bigger question
    here. It doesn't really seem like we ever care about any element of
    the list other than the last. It looks like we go to a fair amount of
    trouble to maintain the list at various points in the code, but it
    seems like when we access the list, we just always go to the end. That
    makes me wonder why we even need a list. Perhaps instead of
    PgFdwConnState storing a pointer to "prev_queries", it should just
    store a pointer to the PgFdwScanState for the query that is currently
    "using" the connection, if there is one. That is, the query whose
    result set is currently being received, and which must be buffered
    into a tuplestore before using the connection for something else. When
    we've done that, we reset the field to NULL, and we don't maintain any
    list of the older "previous queries". Maybe there's some problem with
    that idea, but that gets back to my earlier point about comments: the
    comment for "next" (or however it got renamed) ought to be explaining
    something about why it exists and what it's for. Without such a
    comment, the only ways for me to be sure whether we really need "next"
    is to either (a) ask you or (b) spend a long time staring at the code
    to try to figure it out or (c) try removing it and see if it works.
    
    Ah, wait! I just found some code that cares about the list but not
    just about the last element:
    
    + for (; p1->next != NULL; p1 = p1->next)
    + /* find the correct prev_query */
    + {
    + if ((p1->tuples_ready && fsstate->cursor_number == p1->cursor_number))
    + cur_prev = p1;
    + }
    
    But I'm still confused. Why do we need to iterate through the
    prev_queries list to find the correct ForeignScanState? Isn't that
    exactly what fsstate is here? I'm inclined to think that this is just
    a complicated way of setting p1 = fsstate and then setting cur_prev =
    p1, so we could skip all this and just test whether
    fsstate->tuplestore != NULL and if so retrieve tuples from there. If
    there's some reason why fsstate and cur_prev would end up being
    different, then there should be some comment explaining it. I'm
    inclined to think that would be a sign of a design flaw: I think the
    idea here should be to make use of the ForeignScanState object that
    already exists to hold the details that we need for this feature,
    rather than creating a new one. But, if there were a comment telling
    me why it's like this, I might change my mind.
    
    + /* Clear the last query details, once tuples are retrieved. */
    + if (fsstate->conn_state->prev_queries == cur_prev)
    + {
    + /*
    + * This is the first prev query in the list, check if there
    + * are more
    + */
    + if (fsstate->conn_state->num_prev_queries > 1)
    +
    + /*
    + * fsstate->conn_state->prev_queries->next =
    + * cur_prev->next;
    + */
    + fsstate->next = cur_prev->next;
    + clear_prev_query_state(cur_prev);
    + }
    
    This code can remove an item from the prev_queries list, but only if
    it's the first item. Now, if my analysis above is correct and we don't
    even need the prev_queries list, then it doesn't matter; all this
    logic can basically go away. If we do need the prev_queries list, then
    I don't think it can be right to only be able to remove the first
    element of the list. There's no reason why the oldest prev_query has
    to finish first. What if there are three queries in the list and the
    middle one finishes first? Then this will just leave it in the list,
    whereas if the first one had finished first, it would have been
    deleted from the list. That kind of inconsistency doesn't seem like a
    good idea.
    
    + /*
    + * This is to fetch all the tuples of this query and save
    + * them in Tuple Slot. Since it is using
    + * PQSetChunkedRowsMode, we only get the
    + * fsstate->fetch_size tuples in one run, so keep on
    + * executing till we get NULL in PGresult i.e. all the
    + * tuples are retrieved.
    + */
    + p1->tuplestore = tuplestore_begin_heap(false, true, work_mem);
    + p1->slot = MakeSingleTupleTableSlot(fsstate->tupdesc, &TTSOpsMinimalTuple);
    
    The slot doesn't seem to be used anywhere in the code that follows. Is
    there a reason to initialize it here, rather than wherever it's
    needed? If so, maybe the comment could mention that.
    
    + i = 0;
    
    I don't think this does anything, because you reinitialize i in the
    inner loop where you use it. In general, try moving your variable
    declarations to the innermost scope where they are needed. It makes
    the code more clear and allows the compiler to tell you about stuff
    like this.
    
    + else if (PQresultStatus(res) == PGRES_TUPLES_OK)
    + {
    + while (res != NULL)
    + res = pgfdw_get_result(conn);
    + break;
    + }
    
    I doubt that that this is right. It seems like it's willing to throw
    away an infinite number of result sets without doing anything with
    them, or discard an infinite number of errors without processing them,
    but that doesn't seem likely to be the right thing.
    
    + else if (PQresultStatus(res) == PGRES_FATAL_ERROR)
    + {
    + clear_conn_state(fsstate->conn_state);
    + pgfdw_report_error(res, conn, fsstate->query);
    + }
    
    I also doubt that this is right. If it's necessary to call
    clear_conn_state() before reporting this error, then the error
    handling in this patch is probably wrong. We don't know where errors
    will be thrown in general and cannot rely on being able to do manual
    error cleanup before an error occurs. The goal should for it to be
    safe for pgfdw_report_error -- or just ereport(ERROR, ...) -- to
    happen anywhere without causing problems in the future.
    
    clear_conn_state() doesn't look right either. It does some cleanup on
    each element of the prev_queries list and decrements num_prev_queries,
    but it doesn't actually remove the queries from the list. So after
    calling this function the first time, I think that num_prev_queries
    might end up being 0 while prev_queries is still a list of the same
    length as before. After that, I imagine calling clear_conn_state() a
    second time would result in num_prev_queries going negative. I don't
    really understand why num_prev_queries exists -- it seems like it's
    just a recipe for mistakes to have both the list itself, and a
    separate field that gives you the list length. If you need that, the
    existing List data type will do that bookkeeping for you, and then
    there's less opportunity for mistakes.
    
    Overall, I think the direction of the patch set has some promise, but
    I think it needs a lot of cleanup: removal of unnecessary code, proper
    formatting, moving variables to inner scopes, explanatory comments,
    good names for variables and functions and structure members, removal
    of unnecessary files from the patch, cleanup of the regression test
    coverage so that it doesn't add more bloat than necessary, proper
    choice of data structures, and so on. Right now, the good things that
    you've done here are being hidden by these sorts of mechanical issues.
    That's not just an issue for me as a reviewer: I suspect it's also
    blocking you, as the patch author, from finding places where the code
    could be made better. Being able to find such opportunities for
    improvement and act on them is what will get this patch from
    "interesting proof of concept" to "potentially committable patch".
    
    -- 
    Robert Haas
    EDB: http://www.enterprisedb.com