Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Make dblink interruptible, via new libpqsrv APIs.

  2. Remove excess #include "utils/wait_event.h".

  3. dblink, postgres_fdw: Handle interrupts during connection establishment

  4. Fix old-fd issues using global barriers everywhere.

  1. dblink query interruptibility

    Noah Misch <noah@leadboat.com> — 2023-11-22T01:29:45Z

    === Background
    
    Something as simple as the following doesn't respond to cancellation.  In
    v15+, any DROP DATABASE will hang as long as it's running:
    
      SELECT dblink_exec(
        $$dbname='$$||current_database()||$$' port=$$||current_setting('port'),
        'SELECT pg_sleep(15)');
    
    https://postgr.es/m/4B584C99.8090004@enterprisedb.com proposed a fix back in
    2010.  Latches and the libpqsrv facility have changed the server programming
    environment since that patch.  The problem statement also came up here:
    
    On Thu, Dec 08, 2022 at 06:08:15PM -0800, Andres Freund wrote:
    > dblink.c uses a lot of other blocking libpq functions, which obviously also
    > isn't ok.
    
    
    === Key decisions
    
    This patch adds to libpqsrv facility.  It dutifully follows the existing
    naming scheme.  For greppability, I am favoring renaming new and old functions
    such that the libpq name is a substring of this facility's name.  That is,
    rename libpqsrv_disconnect to srvPQfinish or maybe libpqsrv_PQfinish().  Now
    is better than later, while pgxn contains no references to libpqsrv.  Does
    anyone else have a preference between naming schemes?  If nobody does, I'll
    keep today's libpqsrv_disconnect() style.
    
    I was tempted to add a timeout argument to each libpqsrv function, which would
    allow libpqsrv_get_result_last() to replace pgfdw_get_cleanup_result().  We
    can always add a timeout-accepting function later and make this thread's
    function name a thin wrapper around it.  Does anyone feel a mandatory timeout
    argument, accepting -1 for no timeout, would be the right thing?
    
    
    === Minor topics
    
    It would be nice to replace libpqrcv_PQgetResult() and friends with the new
    functions.  I refrained since they use ProcessWalRcvInterrupts(), not
    CHECK_FOR_INTERRUPTS().  Since walreceiver already reaches
    CHECK_FOR_INTERRUPTS() via libpqsrv_connect_params(), things might just work.
    
    This patch contains a PQexecParams() wrapper, called nowhere in
    postgresql.git.  It's inessential, but twelve pgxn modules do mention
    PQexecParams.  Just one mentions PQexecPrepared, and none mention PQdescribe*.
    
    The patch makes postgres_fdw adopt its functions, as part of confirming the
    functions are general enough.  postgres_fdw create_cursor() has been passing
    the "DECLARE CURSOR FOR inner_query" string for some error messages and just
    inner_query for others.  I almost standardized on the longer one, but the test
    suite checks that.  Hence, I standardized on just inner_query.
    
    I wrote this because pglogical needs these functions to cooperate with v15+
    DROP DATABASE (https://github.com/2ndQuadrant/pglogical/issues/418).
    
    Thanks,
    nm
    
  2. Re: dblink query interruptibility

    Fujii Masao <masao.fujii@gmail.com> — 2024-01-24T19:23:39Z

    On Wed, Nov 22, 2023 at 10:29 AM Noah Misch <noah@leadboat.com> wrote:
    >
    > === Background
    >
    > Something as simple as the following doesn't respond to cancellation.  In
    > v15+, any DROP DATABASE will hang as long as it's running:
    >
    >   SELECT dblink_exec(
    >     $$dbname='$$||current_database()||$$' port=$$||current_setting('port'),
    >     'SELECT pg_sleep(15)');
    >
    > https://postgr.es/m/4B584C99.8090004@enterprisedb.com proposed a fix back in
    > 2010.  Latches and the libpqsrv facility have changed the server programming
    > environment since that patch.  The problem statement also came up here:
    >
    > On Thu, Dec 08, 2022 at 06:08:15PM -0800, Andres Freund wrote:
    > > dblink.c uses a lot of other blocking libpq functions, which obviously also
    > > isn't ok.
    >
    >
    > === Key decisions
    >
    > This patch adds to libpqsrv facility.
    
    I found that this patch was committed at d3c5f37dd5 and changed the
    error message in postgres_fdw slightly. Here's an example:
    
    #1. Begin a new transaction.
    #2. Execute a query accessing to a foreign table, like SELECT * FROM
    <foreign table>
    #3. Terminate the *remote* session corresponding to the foreign table.
    #4. Commit the transaction, and then currently the following error
    message is output.
    
        ERROR:  FATAL:  terminating connection due to administrator command
        server closed the connection unexpectedly
            This probably means the server terminated abnormally
            before or while processing the request.
       invalid socket
    
    Previously, before commit d3c5f37dd5, the error message at #4 did not
    include "invalid socket." Now, after the commit, it does. Is this
    change intentional?
    
    + /* Consume whatever data is available from the socket */
    + if (PQconsumeInput(conn) == 0)
    + {
    + /* trouble; expect PQgetResult() to return NULL */
    + break;
    + }
    + }
    +
    + /* Now we can collect and return the next PGresult */
    + return PQgetResult(conn);
    
    This code appears to cause the change. When the remote session ends,
    PQconsumeInput() returns 0 and marks conn->socket as invalid.
    Subsequent PQgetResult() calls pqWait(), detecting the invalid socket
    and appending "invalid socket" to the error message.
    
    I think the "invalid socket" message is unsuitable in this scenario,
    and PQgetResult() should not be called after PQconsumeInput() returns
    0. Thought?
    
    Regards,
    
    -- 
    Fujii Masao
    
    
    
    
  3. Re: dblink query interruptibility

    Noah Misch <noah@leadboat.com> — 2024-01-24T20:45:32Z

    On Thu, Jan 25, 2024 at 04:23:39AM +0900, Fujii Masao wrote:
    > I found that this patch was committed at d3c5f37dd5 and changed the
    > error message in postgres_fdw slightly. Here's an example:
    > 
    > #1. Begin a new transaction.
    > #2. Execute a query accessing to a foreign table, like SELECT * FROM
    > <foreign table>
    > #3. Terminate the *remote* session corresponding to the foreign table.
    > #4. Commit the transaction, and then currently the following error
    > message is output.
    > 
    >     ERROR:  FATAL:  terminating connection due to administrator command
    >     server closed the connection unexpectedly
    >         This probably means the server terminated abnormally
    >         before or while processing the request.
    >    invalid socket
    > 
    > Previously, before commit d3c5f37dd5, the error message at #4 did not
    > include "invalid socket." Now, after the commit, it does. Is this
    > change intentional?
    
    No.  It's a consequence of an intentional change in libpq call sequence, but I
    was unaware I was changing an error message.
    
    > + /* Consume whatever data is available from the socket */
    > + if (PQconsumeInput(conn) == 0)
    > + {
    > + /* trouble; expect PQgetResult() to return NULL */
    > + break;
    > + }
    > + }
    > +
    > + /* Now we can collect and return the next PGresult */
    > + return PQgetResult(conn);
    > 
    > This code appears to cause the change. When the remote session ends,
    > PQconsumeInput() returns 0 and marks conn->socket as invalid.
    > Subsequent PQgetResult() calls pqWait(), detecting the invalid socket
    > and appending "invalid socket" to the error message.
    > 
    > I think the "invalid socket" message is unsuitable in this scenario,
    > and PQgetResult() should not be called after PQconsumeInput() returns
    > 0. Thought?
    
    The documentation is absolute about the necessity of PQgetResult():
    
      PQsendQuery cannot be called again (on the same connection) until
      PQgetResult has returned a null pointer, indicating that the command is
      done.
    
      PQgetResult must be called repeatedly until it returns a null pointer,
      indicating that the command is done. (If called when no command is active,
      PQgetResult will just return a null pointer at once.)
    
    Similar statements also appear in libpq-pipeline-results,
    libpq-pipeline-errors, and libpq-copy.
    
    
    So, unless the documentation or my reading of it is wrong there, I think the
    answer is something other than skipping PQgetResult().  Perhaps PQgetResult()
    should not append "invalid socket" in this case?  The extra line is a net
    negative, though it's not wrong and not awful.
    
    Thanks for reporting the change.
    
    
    
    
  4. Re: dblink query interruptibility

    Fujii Masao <masao.fujii@gmail.com> — 2024-01-25T03:28:43Z

    On Thu, Jan 25, 2024 at 5:45 AM Noah Misch <noah@leadboat.com> wrote:
    >
    > On Thu, Jan 25, 2024 at 04:23:39AM +0900, Fujii Masao wrote:
    > > I found that this patch was committed at d3c5f37dd5 and changed the
    > > error message in postgres_fdw slightly. Here's an example:
    > >
    > > #1. Begin a new transaction.
    > > #2. Execute a query accessing to a foreign table, like SELECT * FROM
    > > <foreign table>
    > > #3. Terminate the *remote* session corresponding to the foreign table.
    > > #4. Commit the transaction, and then currently the following error
    > > message is output.
    > >
    > >     ERROR:  FATAL:  terminating connection due to administrator command
    > >     server closed the connection unexpectedly
    > >         This probably means the server terminated abnormally
    > >         before or while processing the request.
    > >    invalid socket
    > >
    > > Previously, before commit d3c5f37dd5, the error message at #4 did not
    > > include "invalid socket." Now, after the commit, it does. Is this
    > > change intentional?
    >
    > No.  It's a consequence of an intentional change in libpq call sequence, but I
    > was unaware I was changing an error message.
    >
    > > + /* Consume whatever data is available from the socket */
    > > + if (PQconsumeInput(conn) == 0)
    > > + {
    > > + /* trouble; expect PQgetResult() to return NULL */
    > > + break;
    > > + }
    > > + }
    > > +
    > > + /* Now we can collect and return the next PGresult */
    > > + return PQgetResult(conn);
    > >
    > > This code appears to cause the change. When the remote session ends,
    > > PQconsumeInput() returns 0 and marks conn->socket as invalid.
    > > Subsequent PQgetResult() calls pqWait(), detecting the invalid socket
    > > and appending "invalid socket" to the error message.
    > >
    > > I think the "invalid socket" message is unsuitable in this scenario,
    > > and PQgetResult() should not be called after PQconsumeInput() returns
    > > 0. Thought?
    >
    > The documentation is absolute about the necessity of PQgetResult():
    
    The documentation looks unclear to me regarding what should be done
    when PQconsumeInput() returns 0. So I'm not sure if PQgetResult()
    must be called even in that case.
    
    As far as I read some functions like libpqrcv_PQgetResult() that use
    PQconsumeInput(), it appears that they basically report the error message
    using PQerrorMessage(), without calling PQgetResult(),
    when PQconsumeInput() returns 0.
    
    Regards,
    
    -- 
    Fujii Masao
    
    
    
    
  5. Re: dblink query interruptibility

    Noah Misch <noah@leadboat.com> — 2024-01-27T02:35:39Z

    On Thu, Jan 25, 2024 at 12:28:43PM +0900, Fujii Masao wrote:
    > On Thu, Jan 25, 2024 at 5:45 AM Noah Misch <noah@leadboat.com> wrote:
    > > On Thu, Jan 25, 2024 at 04:23:39AM +0900, Fujii Masao wrote:
    > > > I found that this patch was committed at d3c5f37dd5 and changed the
    > > > error message in postgres_fdw slightly. Here's an example:
    > > >
    > > > #1. Begin a new transaction.
    > > > #2. Execute a query accessing to a foreign table, like SELECT * FROM
    > > > <foreign table>
    > > > #3. Terminate the *remote* session corresponding to the foreign table.
    > > > #4. Commit the transaction, and then currently the following error
    > > > message is output.
    > > >
    > > >     ERROR:  FATAL:  terminating connection due to administrator command
    > > >     server closed the connection unexpectedly
    > > >         This probably means the server terminated abnormally
    > > >         before or while processing the request.
    > > >    invalid socket
    > > >
    > > > Previously, before commit d3c5f37dd5, the error message at #4 did not
    > > > include "invalid socket." Now, after the commit, it does.
    
    Other clients have witnessed the extra "invalid socket" message:
    https://dba.stackexchange.com/questions/335081/how-to-investigate-intermittent-postgres-connection-errors
    https://stackoverflow.com/questions/77781358/pgbackrest-backup-error-with-exit-code-57
    https://github.com/timescale/timescaledb/issues/102
    
    > > > + /* Consume whatever data is available from the socket */
    > > > + if (PQconsumeInput(conn) == 0)
    > > > + {
    > > > + /* trouble; expect PQgetResult() to return NULL */
    > > > + break;
    > > > + }
    > > > + }
    > > > +
    > > > + /* Now we can collect and return the next PGresult */
    > > > + return PQgetResult(conn);
    > > >
    > > > This code appears to cause the change. When the remote session ends,
    > > > PQconsumeInput() returns 0 and marks conn->socket as invalid.
    > > > Subsequent PQgetResult() calls pqWait(), detecting the invalid socket
    > > > and appending "invalid socket" to the error message.
    
    What do you think of making PQconsumeInput() set PGASYNC_READY and
    CONNECTION_BAD in this case?  Since libpq appended "server closed the
    connection unexpectedly", it knows those indicators are correct.  That way,
    PQgetResult() won't issue a pointless pqWait() call.
    
    > > > I think the "invalid socket" message is unsuitable in this scenario,
    > > > and PQgetResult() should not be called after PQconsumeInput() returns
    > > > 0. Thought?
    > >
    > > The documentation is absolute about the necessity of PQgetResult():
    > 
    > The documentation looks unclear to me regarding what should be done
    > when PQconsumeInput() returns 0. So I'm not sure if PQgetResult()
    > must be called even in that case.
    
    I agree PQconsumeInput() docs don't specify how to interpret it returning 0.
    
    > As far as I read some functions like libpqrcv_PQgetResult() that use
    > PQconsumeInput(), it appears that they basically report the error message
    > using PQerrorMessage(), without calling PQgetResult(),
    > when PQconsumeInput() returns 0.
    
    libpqrcv_PQgetResult() is part of walreceiver, where any ERROR becomes FATAL.
    Hence, it can't hurt anything by eagerly skipping to ERROR.  I designed
    libpqsrv_exec() to mimic PQexec() as closely as possible, so it would be a
    drop-in replacement for arbitrary callers.  Ideally, accepting interrupts
    would be the only caller-visible difference.
    
    I know of three ways PQconsumeInput() can return 0, along with my untested
    estimates of how they work:
    
    a. Protocol violation.  handleSyncLoss() sets PGASYNC_READY and
       CONNECTION_BAD.  PQgetResult() is optional.
    
    b. Connection broken.  PQgetResult() is optional.
    
    c. ENOMEM.  PGASYNC_ and CONNECTION_ status don't change.  Applications choose
       among (c1) free memory and retry, (c2) close the connection, or (c3) call
       PQgetResult() to break protocol sync and set PGASYNC_IDLE:
    
    Comparing PQconsumeInput() with the PQgetResult() block under "while
    (conn->asyncStatus == PGASYNC_BUSY)", there's a key difference that
    PQgetResult() sets PGASYNC_IDLE on most errors, including ENOMEM.  That
    prevents PQexec() subroutine PQexecFinish() from busy looping on ENOMEM, but I
    suspect that also silently breaks protocol sync.  While we could change it
    from (c3) to (c2) by dropping the connection via handleSyncLoss() or
    equivalent, I'm not confident about that being better.
    
    libpqsrv_exec() implements (c3) by way of calling PQgetResult() after
    PQconsumeInput() fails.  If PQisBusy(), the same ENOMEM typically will repeat,
    yielding (c3).  If memory became available in that brief time, PQgetResult()
    may instead block.  That blocking is unwanted but unimportant.
    
    
    
    
  6. Re: dblink query interruptibility

    Andreas Karlsson <andreas@proxel.se> — 2025-03-06T13:57:01Z

    On 11/22/23 2:29 AM, Noah Misch wrote:
    > Something as simple as the following doesn't respond to cancellation.  In
    > v15+, any DROP DATABASE will hang as long as it's running:
    
    Hi,
    
    One of our customers ran into this bug when upgrading from PostgreSQL 14 
    to PostgreSQL 16. Your commit[1] fixed this issue in PostgreSQL 17 but 
    the bugfix was not backported with the explanation below.
    
     > Code inspection identified the bug at least thirteen years ago, but 
    user complaints have not appeared. Hence, no back-patch for now.
    
    But that is as far as I can tell not the case because at least for 
    CREATE DATABASE the bug was introduced in a commit[2] in PostgeSQL 15. 
    And now that we actually have a user complaint what do you think about 
    backporting the fix?
    
    The patch seems small and relatively safe to backport and the functions 
    have had no bugfixes as far as I could see. And I did a quick git 
    cherry-pick myself on top of PG 16 (see attached patch) and the only 
    conflict was related to the introduction of custom wait events which was 
    easy to fix.
    
    Here is a small snippet which reproduces the bug. It hangs reliably on 
    PostgreSQL 15 and 16, but not 14, 17 or HEAD.
    
    CREATE EXTENSION dblink;
    
    CREATE FUNCTION f(text, int, text) RETURNS VOID LANGUAGE plpgsql AS $$
    BEGIN
         PERFORM dblink_connect(format('host=%s port=%s user=%s 
    dbname=postgres', $1, $2, $3));
         RAISE NOTICE 'dblink connected!';
         PERFORM dblink_exec('DROP DATABASE test');
         RAISE NOTICE 'Database dropped!';
         PERFORM dblink_disconnect();
         RAISE NOTICE 'dblink disconnected!';
    END
    $$;
    
    CREATE DATABASE test;
    SELECT f(:'HOST', :PORT, :'USER');
    
    
    Andreas
    
    1. 
    https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d3c5f37dd543498cc7c678815d3921823beec9e9
    2. 
    https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e2f65f42555ff531c6d7c8f151526b4ef7c016f8
    
  7. Re: dblink query interruptibility

    Noah Misch <noah@leadboat.com> — 2025-03-11T23:48:39Z

    On Thu, Mar 06, 2025 at 02:57:01PM +0100, Andreas Karlsson wrote:
    > On 11/22/23 2:29 AM, Noah Misch wrote:
    > > Something as simple as the following doesn't respond to cancellation.  In
    > > v15+, any DROP DATABASE will hang as long as it's running:
    
    > One of our customers ran into this bug when upgrading from PostgreSQL 14 to
    > PostgreSQL 16. Your commit[1] fixed this issue in PostgreSQL 17 but the
    > bugfix was not backported with the explanation below.
    > 
    > > Code inspection identified the bug at least thirteen years ago, but user
    > complaints have not appeared. Hence, no back-patch for now.
    > 
    > But that is as far as I can tell not the case because at least for CREATE
    > DATABASE the bug was introduced in a commit[2] in PostgeSQL 15.
    
    > 1. https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d3c5f37dd543498cc7c678815d3921823beec9e9
    > 2. https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e2f65f42555ff531c6d7c8f151526b4ef7c016f8
    
    The CREATE DATABASE hang is indeed new in v15.  The general dblink missed
    interrupt processing (e.g. pg_cancel_backend response delay) is an old bug.
    
    > And now that
    > we actually have a user complaint what do you think about backporting the
    > fix?
    
    Yes, that seems fine to do.  No PGXN module refers to libpq-be-fe-helpers.h so
    I'm unconcerned about a compatibility risk from adding it.  In the context of
    https://github.com/2ndQuadrant/pglogical/pull/454, I did test the functions
    against all versions v9.4+.
    
    Commit d3c5f37 used the new functions for postgres_fdw, not just dblink.  That
    caused message changes detailed in
    postgr.es/m/CAHGQGwGpDTXeg8K1oTmDv9nankaKTrCD-XW-tnkzo6%3DE9p5%3Duw%40mail.gmail.com
    so I'm inclined to omit postgres_fdw changes in back branches.  postgres_fdw
    was already interruptible, so the point of making postgres_fdw adopt the
    functions was to reduce code duplication.
    
    Overall, in the absence of objections, I will queue a task to back-patch the
    non-postgres_fdw portion of commit d3c5f37 to v13-v16.
    
    
    
    
  8. Re: dblink query interruptibility

    Andreas Karlsson <andreas@proxel.se> — 2025-03-12T10:03:41Z

    On 3/12/25 12:48 AM, Noah Misch wrote:
    > The CREATE DATABASE hang is indeed new in v15.  The general dblink missed
    > interrupt processing (e.g. pg_cancel_backend response delay) is an old bug.
    
    Aha, that was what you were referring to! My apologies, was reading your 
    mail a bit too quickly. :)
    
    > Commit d3c5f37 used the new functions for postgres_fdw, not just dblink.  That
    > caused message changes detailed in
    > postgr.es/m/CAHGQGwGpDTXeg8K1oTmDv9nankaKTrCD-XW-tnkzo6%3DE9p5%3Duw%40mail.gmail.com
    > so I'm inclined to omit postgres_fdw changes in back branches.  postgres_fdw
    > was already interruptible, so the point of making postgres_fdw adopt the
    > functions was to reduce code duplication.
    
    Makes sense.
    
    > Overall, in the absence of objections, I will queue a task to back-patch the
    > non-postgres_fdw portion of commit d3c5f37 to v13-v16.
    
    Thanks!
    
    Andreas
    
    
    
    
  9. Re: dblink query interruptibility

    Noah Misch <noah@leadboat.com> — 2025-04-03T21:29:34Z

    On Wed, Mar 12, 2025 at 11:03:41AM +0100, Andreas Karlsson wrote:
    > On 3/12/25 12:48 AM, Noah Misch wrote:
    > > Overall, in the absence of objections, I will queue a task to back-patch the
    > > non-postgres_fdw portion of commit d3c5f37 to v13-v16.
    
    Pushed (e.g. v16 has commit 82a8f0f).  Only v16 had libpq-be-fe-helpers.h at
    all, so I also back-patched 28a5917 to add it.  The original use case for
    libpq-be-fe-helpers.h was interrupting PQconnectdbParams(), commit e460248.  I
    decided not to back-patch that one, since connection-time delays are often
    limited in ways query runtime is not.  We could change that decision.
    
    
    
    
  10. Re: dblink query interruptibility

    Andreas Karlsson <andreas@proxel.se> — 2025-04-05T22:14:52Z

    On 4/3/25 11:29 PM, Noah Misch wrote:
    > Pushed (e.g. v16 has commit 82a8f0f).  Only v16 had libpq-be-fe-helpers.h at
    > all, so I also back-patched 28a5917 to add it.  The original use case for
    > libpq-be-fe-helpers.h was interrupting PQconnectdbParams(), commit e460248.  I
    > decided not to back-patch that one, since connection-time delays are often
    > limited in ways query runtime is not.  We could change that decision.
    
    Thanks!
    
    I am fine with that since there seem to be few real world complaints.
    
    Andreas