Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Flush logical mapping files with fd opened for read/write at checkpoint

  2. Use a fd opened for read/write when syncing slots during startup, take 2.

  3. Tighten use of OpenTransientFile and CloseTransientFile

  4. Use a fd opened for read/write when syncing slots during startup.

  1. BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows

    PG Bug reporting form <noreply@postgresql.org> — 2019-10-04T19:28:28Z

    The following bug has been logged on the website:
    
    Bug reference:      16039
    Logged by:          Hans Buschmann
    Email address:      buschmann@nidsa.net
    PostgreSQL version: 12.0
    Operating system:   Windows Server 2019 64bit
    Description:        
    
    We just moved our production cluster from pg 11.5 to pg 12.0 under windows
    using pg_dump/initdb/pg_restore.
    
    When we activated the replication slots by  
    
    SELECT * FROM pg_create_physical_replication_slot('sam_repli_3');
    
    and tried restarting the server, we got a PANIC in error log:
    
    CPS PRD 2019-10-04 19:10:07 CEST  00000  1:> LOG:  database system was shut
    down at 2019-10-04 19:10:02 CEST
    CPS PRD 2019-10-04 19:10:07 CEST  XX000  2:> PANIC:  could not fsync file
    "pg_replslot/sam_repli_3/state": Bad file descriptor
    CPS PRD 2019-10-04 19:10:07 CEST  00000  6:> LOG:  startup process (PID
    4028) was terminated by exception 0xC0000409
    CPS PRD 2019-10-04 19:10:07 CEST  00000  7:> HINT:  See C include file
    "ntstatus.h" for a description of the hexadecimal value.
    CPS PRD 2019-10-04 19:10:07 CEST  00000  8:> LOG:  aborting startup due to
    startup process failure
    CPS PRD 2019-10-04 19:10:07 CEST  00000  9:> LOG:  database system is shut
    down
    
    We use the EDB distribution from the website under Windows Server 2019
    (September 2019 patch level).
    
    select version ();
                              version
    ------------------------------------------------------------
     PostgreSQL 12.0, compiled by Visual C++ build 1914, 64-bit
    (1 Zeile)
    
    This seems to me like a fatal bug which makes the streaming replication
    unusable under Windows x64 /pg12.
    
    The same configuration worked flawlessly under pg 11.x until pg 11.5
    
    By searching on google we encountered a similar error from 2015 under pg
    9.4.1 reported under BUG #13287:
    
    https://www.postgresql.org/message-id/flat/20150514105514.2691.67352%40wrigleys.postgresql.org
    
    We were able to remove the replication slot by deleting the directory
    pg_replslot/sam_repli_3.
    
    The server seems to be working fine now, but without streaming replication
    of course.
    
    
    BTW: Is it necessary to restart the cluster after creating the replication
    slot or is it a consequence of the error, that the 
    newly created slot is not active automatically?
    
    select * from pg_replication_slots;
      slot_name  | plugin | slot_type | datoid | database | temporary | active |
    active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
    -------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
     sam_repli_3 |        | physical  |        |          | f         | f      |
               |      |              |             |
     sam_repli_5 |        | physical  |        |          | f         | f      |
               |      |              |             |
    (2 Zeilen)
    
    
    Many thanks!
    
    
    Hans Buschmann
    
    
  2. Re: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows

    Andres Freund <andres@anarazel.de> — 2019-10-04T20:06:05Z

    Hi,
    
    Thanks for the report!
    
    On 2019-10-04 19:28:28 +0000, PG Bug reporting form wrote:
    > We just moved our production cluster from pg 11.5 to pg 12.0 under windows
    > using pg_dump/initdb/pg_restore.
    >
    > When we activated the replication slots by
    >
    > SELECT * FROM pg_create_physical_replication_slot('sam_repli_3');
    >
    > and tried restarting the server, we got a PANIC in error log:
    >
    > CPS PRD 2019-10-04 19:10:07 CEST  00000  1:> LOG:  database system was shut
    > down at 2019-10-04 19:10:02 CEST
    > CPS PRD 2019-10-04 19:10:07 CEST  XX000  2:> PANIC:  could not fsync file
    > "pg_replslot/sam_repli_3/state": Bad file descriptor
    > CPS PRD 2019-10-04 19:10:07 CEST  00000  6:> LOG:  startup process (PID
    > 4028) was terminated by exception 0xC0000409
    > CPS PRD 2019-10-04 19:10:07 CEST  00000  7:> HINT:  See C include file
    > "ntstatus.h" for a description of the hexadecimal value.
    > CPS PRD 2019-10-04 19:10:07 CEST  00000  8:> LOG:  aborting startup due to
    > startup process failure
    > CPS PRD 2019-10-04 19:10:07 CEST  00000  9:> LOG:  database system is shut
    > down
    >
    > We use the EDB distribution from the website under Windows Server 2019
    > (September 2019 patch level).
    >
    > select version ();
    >                           version
    > ------------------------------------------------------------
    >  PostgreSQL 12.0, compiled by Visual C++ build 1914, 64-bit
    > (1 Zeile)
    >
    > This seems to me like a fatal bug which makes the streaming replication
    > unusable under Windows x64 /pg12.
    >
    > The same configuration worked flawlessly under pg 11.x until pg 11.5
    >
    > By searching on google we encountered a similar error from 2015 under pg
    > 9.4.1 reported under BUG #13287:
    >
    > https://www.postgresql.org/message-id/flat/20150514105514.2691.67352%40wrigleys.postgresql.org
    
    Uh, Michael? You just reintroduced this bug in
    
    commit 82a5649fb9dbef12d04cd24799be6bf298d889a6
    Author: Michael Paquier <michael@paquier.xyz>
    Date:   2019-03-09 08:50:55 +0900
    
        Tighten use of OpenTransientFile and CloseTransientFile
    
        This fixes two sets of issues related to the use of transient files in
        the backend:
        1) OpenTransientFile() has been used in some code paths with read-write
        flags while read-only is sufficient, so switch those calls to be
        read-only where necessary.  These have been reported by Joe Conway.
    
    You pretty much entirely reverted:
    
    commit dfbaed459754e71e01bb0cc90a12802bba3f9786
    Author: Andres Freund <andres@anarazel.de>
    Date:   2015-04-28 00:12:38 +0200
    
        Use a fd opened for read/write when syncing slots during startup.
    
        Some operating systems, including the reporter's windows, return EBADFD
        or similar when fsync() is invoked on a O_RDONLY file descriptor.
        Unfortunately RestoreSlotFromDisk() does exactly that; which causes
        failures after restarts in at least some scenarios.
    
        If you hit the bug the error message will be something like
        ERROR: could not fsync file "pg_replslot/$name/state": Bad file descriptor
    
        Simply use O_RDWR instead of O_RDONLY when opening the relevant file
        descriptor to fix the bug.  Unfortunately I have no way of verifying the
        fix, but we've seen similar problems in the past.
    
        This bug goes back to 9.4 where slots were introduced. Backpatch
        accordingly.
    
        Reported-By: Patrice Drolet
        Bug: #13143:
        Discussion: 20150424101006.2556.60897@wrigleys.postgresql.org
    
    I realize I perhaps should have added a comment explaining this, but
    this is far from the only location that knows we have to know open fds
    r/w to be able to fsync them.
    
    What were you even trying to fix by changing this?
    
    
    Seems also pretty clear that we need a few animals running with fsync
    enabled. Not sure how we best can write test infrastructure to make it
    easy to set that for all tests. Guess I best start a thread about it on
    -hackers.
    
    Greetings,
    
    Andres Freund
    
    
    
    
  3. Re: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows

    Andres Freund <andres@anarazel.de> — 2019-10-04T20:49:29Z

    Hi Hans,
    
    On 2019-10-04 13:06:05 -0700, Andres Freund wrote:
    > On 2019-10-04 19:28:28 +0000, PG Bug reporting form wrote:
    > > CPS PRD 2019-10-04 19:10:07 CEST  XX000  2:> PANIC:  could not fsync file
    > > "pg_replslot/sam_repli_3/state": Bad file descriptor
    > > CPS PRD 2019-10-04 19:10:07 CEST  00000  6:> LOG:  startup process (PID
    > > 4028) was terminated by exception 0xC0000409
    
    Thanks again for the report! I've committed the fix. Unfortunately,
    unless you can compile yourself, you'll have to wait until the next
    minor release for the fix.  Given this is the .0 release of 12, and that
    we already have a number of bugs to fix, I'm sure that won't be too far
    away however.
    
    If you want to work around the problem, you can, but it's not entirely
    unproblematic: If, just when restarting, you set fsync to false, you'll
    avoid the problem. But then you'd have to immediately re-enable
    afterwards, and there's still a small data loss window.  If however all
    you want is to continue testing, fsync=off might be a reasonable
    workaround.
    
    Greetings,
    
    Andres Freund
    
    
    
    
  4. AW: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows

    Hans Buschmann <buschmann@nidsa.net> — 2019-10-05T07:40:00Z

    ________________________________________
    Von: Andres Freund <andres@anarazel.de>
    Gesendet: Freitag, 4. Oktober 2019 22:49
    An: Hans Buschmann; pgsql-bugs@lists.postgresql.org; Michael Paquier
    Betreff: Re: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows
    
    >If you want to work around the problem, you can, but it's not entirely
    >unproblematic: If, just when restarting, you set fsync to false, you'll
    >avoid the problem. But then you'd have to immediately re-enable
    >afterwards, and there's still a small data loss window.  If however all
    >you want is to continue testing, fsync=off might be a reasonable
    >workaround.
    
    Hi Andres,
    
    Thanks for the quick fix. I reported it immediately to get it soon fixed before 12.1.
    
    In the meantime we continue to use pg 12 on the production System (quite confident in overall Software Quality!).
    
    I can't reenable it because there may be unexpected restarts (power failures for example) and then the unattended restart would not succeed.
    
    In quite a Long journey I have got deeper Knowledge of the whole System. So now I am able to recompile from source BUT:
    
    It is VERY difficult to get the whole Environment under Windows!
    
    I use Visual Studio Express 2019 (Version 16.3 from Oct 2019) and got the core (source tarball) compiled without error.
    
    But I have not managed to get all the additional packages downloaded and adopted to the most recent Visual Studio (e.g zlib, XML, ssl etc.)
    
    The Manual only states the Locations for download, but doesn't help in compiling.
    
    It would be VERY helpfull to provide a .ZIP file for download on the Postgres Website containing all the additional packages in a usable Directory tree and some config file for compiling. This (mostly) should exist already for the buildfarm or for Building the release.
    
    Because the upstream Software doesn't Change often, it would (as quick alternative) be sufficient to have the include files and the compiled libraries to easily compile Postgres from source.
    
    Thanks for considering this proposal!
    
    Hans Buschmann
    
    PS Uppercase Problems are from the mail System!
    
    
    
    
    
    
  5. Re: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows

    Michael Paquier <michael@paquier.xyz> — 2019-10-06T04:55:48Z

    On Fri, Oct 04, 2019 at 01:06:05PM -0700, Andres Freund wrote:
    > I realize I perhaps should have added a comment explaining this, but
    > this is far from the only location that knows we have to know open fds
    > r/w to be able to fsync them.
    
    Sorry for the late reply here.  It looks like I messed up here, my
    apologies for that.  And thanks for fixing the issue.
    
    It would have been nice to add some sanity checks based on fcntl() but
    directory handling in pg_fsync() makes that annoying.  Anyway, I have
    checked the code with a little trick, and I have spotted a second bug:
    CheckPointLogicalRewriteHeap() fsyncs a logical rewrite mapping file
    with RDONLY.  This is incorrect since b89e151.
    
    > What were you even trying to fix by changing this?
    
    Hardening of the code.  Some code paths clearly relied on the
    operations to be read-only.
    
    > Seems also pretty clear that we need a few animals running with fsync
    > enabled. Not sure how we best can write test infrastructure to make it
    > easy to set that for all tests. Guess I best start a thread about it on
    > -hackers.
    
    I think that we would need more infrastructure here for TAP tests, aka
    how to be able to enforce some parameters when setting the
    configuration of a new node.
    
    Attached are two patches: the actual bug fix and an extra patch with
    the trick I have used to find it out (contrib/test_decoding/ was the
    part which has blown up).
    --
    Michael
    
  6. Re: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows

    Michael Paquier <michael@paquier.xyz> — 2019-10-06T05:00:11Z

    On Sat, Oct 05, 2019 at 07:40:00AM +0000, Hans Buschmann wrote:
    > Thanks for the quick fix. I reported it immediately to get it soon
    > fixed before 12.1.
    
    Sorry that you got annoyed by that :/
    You can blame me about it..
    
    > But I have not managed to get all the additional packages downloaded
    > and adopted to the most recent Visual Studio (e.g zlib, XML, ssl
    > etc.)
    > 
    > The Manual only states the Locations for download, but doesn't help
    > in compiling.
    
    There is documentation on the matter:
    https://www.postgresql.org/docs/devel/install-windows-full.html#id-1.6.4.8.10
    
    > It would be VERY helpful to provide a .ZIP file for download on the
    > Postgres Website containing all the additional packages in a usable
    > Directory tree and some config file for compiling. This (mostly)
    > should exist already for the buildfarm or for Building the release.
    
    I think that this would be difficult, per licensing issues and such,
    no?  The maintenance of such a tarball would be annoying as well, for
    example OpenSSL version changes quite often and needs to be bundled
    with the installer so as DLLs loading it are able to work.
    --
    Michael
    
  7. Re: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows

    Michael Paquier <michael@paquier.xyz> — 2019-10-08T00:32:40Z

    On Sun, Oct 06, 2019 at 01:55:48PM +0900, Michael Paquier wrote:
    > It would have been nice to add some sanity checks based on fcntl() but
    > directory handling in pg_fsync() makes that annoying.  Anyway, I have
    > checked the code with a little trick, and I have spotted a second bug:
    > CheckPointLogicalRewriteHeap() fsyncs a logical rewrite mapping file
    > with RDONLY.  This is incorrect since b89e151.
    
    Andres, others, any thoughts about this issue?  Are there any
    objections if I just fix it?
    --
    Michael
    
  8. Re: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows

    Andres Freund <andres@anarazel.de> — 2019-10-08T16:09:53Z

    Hi,
    
    On 2019-10-08 09:32:40 +0900, Michael Paquier wrote:
    > On Sun, Oct 06, 2019 at 01:55:48PM +0900, Michael Paquier wrote:
    > > It would have been nice to add some sanity checks based on fcntl() but
    > > directory handling in pg_fsync() makes that annoying.
    
    I wondered about adding something like that too. Not sure what you mean
    by directory handling problems? Couldn't that just be solved by doing an
    fstat()?
    
    
    > > Anyway, I have checked the code with a little trick, and I have
    > > spotted a second bug: CheckPointLogicalRewriteHeap() fsyncs a
    > > logical rewrite mapping file with RDONLY.  This is incorrect since
    > > b89e151.
    
    Yuck :(. Luckily that's a pretty narrow case to hit.  We really need
    windows coverage for this stuff. And also just general buildfarm
    coverage, it's not like we're immune from bugs on unixoid OSs etiher.
    
    
    > Andres, others, any thoughts about this issue?  Are there any
    > objections if I just fix it?
    
    Not here.
    
    Greetings,
    
    Andres Freund
    
    
    
    
  9. Re: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows

    Michael Paquier <michael@paquier.xyz> — 2019-10-09T04:42:07Z

    On Tue, Oct 08, 2019 at 09:09:53AM -0700, Andres Freund wrote:
    > I wondered about adding something like that too. Not sure what you mean
    > by directory handling problems? Couldn't that just be solved by doing an
    > fstat()?
    
    Yeah, we should just do that.
    
    > Yuck :(. Luckily that's a pretty narrow case to hit.  We really need
    > windows coverage for this stuff. And also just general buildfarm
    > coverage, it's not like we're immune from bugs on unixoid OSs etiher.
    
    test_decoding hits that easily for me, and we test that on Windows for
    some time now as there is a buildfarm module.  That's a bit depressing.
    
    >> Andres, others, any thoughts about this issue?  Are there any
    >> objections if I just fix it?
    > 
    > Not here.
    
    Done this one.
    --
    Michael
    
  10. Re: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows

    Andres Freund <andres@anarazel.de> — 2019-10-10T15:25:15Z

    Hi,
    
    On 2019-10-09 13:42:07 +0900, Michael Paquier wrote:
    > > Yuck :(. Luckily that's a pretty narrow case to hit.  We really need
    > > windows coverage for this stuff. And also just general buildfarm
    > > coverage, it's not like we're immune from bugs on unixoid OSs etiher.
    > 
    > test_decoding hits that easily for me, and we test that on Windows for
    > some time now as there is a buildfarm module.  That's a bit depressing.
    
    Well, that's because it intentionally tries to hit that case ;)
    
    But yea, it's not good.
    
    I think we really ought to remove the -F from pg_regress.c, and leave
    that up to the caller. Right now it's just about impossible to
    configure without patching the source, unless I miss something.
    
    
    > >> Andres, others, any thoughts about this issue?  Are there any
    > >> objections if I just fix it?
    > > 
    > > Not here.
    > 
    > Done this one.
    
    Thanks.
    
    Greetings,
    
    Andres Freund
    
    
    
    
  11. Re: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows

    Michael Paquier <michael@paquier.xyz> — 2019-10-11T00:28:14Z

    On Thu, Oct 10, 2019 at 08:25:15AM -0700, Andres Freund wrote:
    > I think we really ought to remove the -F from pg_regress.c, and leave
    > that up to the caller. Right now it's just about impossible to
    > configure without patching the source, unless I miss something.
    
    Well, I have just posted something about that:
    https://www.postgresql.org/message-id/20191009062640.GB21379@paquier.xyz
    
    Perhaps there are better ideas for that particular problem, let's
    see.
    --
    Michael