Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Don't call data type input functions in GUC check hooks

  1. block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-04-09T15:48:38Z

    Hi,
    
    Several companies, including EnterpriseDB, NTT, and Postgres Pro, have
    developed technology that permits a block-level incremental backup to
    be taken from a PostgreSQL server.  I believe the idea in all of those
    cases is that non-relation files should be backed up in their
    entirety, but for relation files, only those blocks that have been
    changed need to be backed up.  I would like to propose that we should
    have a solution for this problem in core, rather than leaving it to
    each individual PostgreSQL company to develop and maintain their own
    solution. Generally my idea is:
    
    1. There should be a way to tell pg_basebackup to request from the
    server only those blocks where LSN >= threshold_value.  There are
    several possible ways for the server to implement this, the simplest
    of which is to just scan all the blocks and send only the ones that
    satisfy that criterion.  That might sound dumb, but it does still save
    network bandwidth, and it works even without any prior setup. It will
    probably be more efficient in many cases to instead scan all the WAL
    generated since that LSN and extract block references from it, but
    that is only possible if the server has all of that WAL available or
    can somehow get it from the archive.  We could also, as several people
    have proposed previously, have some kind of additional relation for
    that stores either a single is-modified bit -- which only helps if the
    reference LSN for the is-modified bit is older than the requested LSN
    but not too much older -- or the highest LSN for each range of K
    blocks, or something like that.  I am at the moment not too concerned
    with the exact strategy we use here. I believe we may want to
    eventually support more than one, since they have different
    trade-offs.
    
    2. When you use pg_basebackup in this way, each relation file that is
    not sent in its entirety is replaced by a file with a different name.
    For example, instead of base/16384/16417, you might get
    base/16384/partial.16417 or however we decide to name them.  Each such
    file will store near the beginning of the file a list of all the
    blocks contained in that file, and the blocks themselves will follow
    at offsets that can be predicted from the metadata at the beginning of
    the file.  The idea is that you shouldn't have to read the whole file
    to figure out which blocks it contains, and if you know specifically
    what blocks you want, you should be able to reasonably efficiently
    read just those blocks.  A backup taken in this manner should also
    probably create some kind of metadata file in the root directory that
    stops the server from starting and lists other salient details of the
    backup.  In particular, you need the threshold LSN for the backup
    (i.e. contains blocks newer than this) and the start LSN for the
    backup (i.e. the LSN that would have been returned from
    pg_start_backup).
    
    3. There should be a new tool that knows how to merge a full backup
    with any number of incremental backups and produce a complete data
    directory with no remaining partial files.  The tool should check that
    the threshold LSN for each incremental backup is less than or equal to
    the start LSN of the previous backup; if not, there may be changes
    that happened in between which would be lost, so combining the backups
    is unsafe.  Running this tool can be thought of either as restoring
    the backup or as producing a new synthetic backup from any number of
    incremental backups.  This would allow for a strategy of unending
    incremental backups.  For instance, on day 1, you take a full backup.
    On every subsequent day, you take an incremental backup.  On day 9,
    you run pg_combinebackup day1 day2 -o full; rm -rf day1 day2; mv full
    day2.  On each subsequent day you do something similar.  Now you can
    always roll back to any of the last seven days by combining the oldest
    backup you have (which is always a synthetic full backup) with as many
    newer incrementals as you want, up to the point where you want to
    stop.
    
    Other random points:
    - If the server has multiple ways of finding blocks with an LSN
    greater than or equal to the threshold LSN, it could make a cost-based
    decision between those methods, or it could allow the client to
    specify the method to be used.
    - I imagine that the server would offer this functionality through a
    new replication command or a syntax extension to an existing command,
    so it could also be used by tools other than pg_basebackup if they
    wished.
    - Combining backups could also be done destructively rather than, as
    proposed above, non-destructively, but you have to be careful about
    what happens in case of a failure.
    - The pg_combinebackup tool (or whatever we call it) should probably
    have an option to exploit hard links to save disk space; this could in
    particular make construction of a new synthetic full backup much
    cheaper.  However you'd better be careful not to use this option when
    actually trying to restore, because if you start the server and run
    recovery, you don't want to change the copies of those same files that
    are in your backup directory.  I guess the server could be taught to
    complain about st_nlink > 1 but I'm not sure we want to go there.
    - It would also be possible to collapse multiple incremental backups
    into a single incremental backup, without combining with a full
    backup.  In the worst case, size(i1+i2) = size(i1) + size(i2), but if
    the same data is modified repeatedly collapsing backups would save
    lots of space.  This doesn't seem like a must-have for v1, though.
    - If you have a SAN and are taking backups using filesystem snapshots,
    then you don't need this, because your SAN probably already uses
    copy-on-write magic for those snapshots, and so you are already
    getting all of the same benefits in terms of saving storage space that
    you would get from something like this.  But not everybody has a SAN.
    - I know that there have been several previous efforts in this area,
    but none of them have gotten to the point of being committed.  I
    intend no disrespect to those efforts.  I believe I'm taking a
    slightly different view of the problem here than what has been done
    previously, trying to focus on the user experience rather than, e.g.,
    the technology that is used to decide which blocks need to be sent.
    However it's possible I've missed a promising patch that takes an
    approach very similar to what I'm outlining here, and if so, I don't
    mind a bit having that pointed out to me.
    - This is just a design proposal at this point; there is no code.  If
    this proposal, or some modified version of it, seems likely to be
    acceptable, I and/or my colleagues might try to implement it.
    - It would also be nice to support *parallel* backup, both for full
    backups as we can do them today and for incremental backups.  But that
    sound like a separate effort.  pg_combinebackup could potentially
    support parallel operation as well, although that might be too
    ambitious for v1.
    - It would also be nice if pg_basebackup could write backups to places
    other than the local disk, like an object store, a tape drive, etc.
    But that also sounds like a separate effort.
    
    Thoughts?
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  2. Re: block-level incremental backup

    Artur Zakirov <a.zakirov@postgrespro.ru> — 2019-04-09T16:32:30Z

    Hello,
    
    On 09.04.2019 18:48, Robert Haas wrote:
    > - It would also be nice if pg_basebackup could write backups to places
    > other than the local disk, like an object store, a tape drive, etc.
    > But that also sounds like a separate effort.
    > 
    > Thoughts? 
    
    (Just thinking out loud) Also it might be useful to have remote restore 
    facility (i.e. if pg_combinebackup could write to non-local storage), so 
    you don't need to restore the instance into a locale place and copy/move 
    to the remote machine. But it seems to me that it is the most nontrivial 
    feature and requires much more effort than other points.
    
    In pg_probackup we have remote restore via SSH in the beta state. But 
    SSH isn't an option for in-core approach I think.
    
    -- 
    Arthur Zakirov
    Postgres Professional: http://www.postgrespro.com
    Russian Postgres Company
    
    
    
    
  3. Re: block-level incremental backup

    Andres Freund <andres@anarazel.de> — 2019-04-09T16:35:00Z

    Hi,
    
    On 2019-04-09 11:48:38 -0400, Robert Haas wrote:
    > 2. When you use pg_basebackup in this way, each relation file that is
    > not sent in its entirety is replaced by a file with a different name.
    > For example, instead of base/16384/16417, you might get
    > base/16384/partial.16417 or however we decide to name them.
    
    Hm. But that means that files that are shipped nearly in their entirety,
    need to be fully rewritten. Wonder if it's better to ship them as files
    with holes, and have the metadata in a separate file. That'd then allow
    to just fill in the holes with data from the older version.  I'd assume
    that there's a lot of workloads where some significantly sized relations
    will get updated in nearly their entirety between backups.
    
    
    > Each such file will store near the beginning of the file a list of all the
    > blocks contained in that file, and the blocks themselves will follow
    > at offsets that can be predicted from the metadata at the beginning of
    > the file.  The idea is that you shouldn't have to read the whole file
    > to figure out which blocks it contains, and if you know specifically
    > what blocks you want, you should be able to reasonably efficiently
    > read just those blocks.  A backup taken in this manner should also
    > probably create some kind of metadata file in the root directory that
    > stops the server from starting and lists other salient details of the
    > backup.  In particular, you need the threshold LSN for the backup
    > (i.e. contains blocks newer than this) and the start LSN for the
    > backup (i.e. the LSN that would have been returned from
    > pg_start_backup).
    
    I wonder if we shouldn't just integrate that into pg_control or such. So
    that:
    
    > 3. There should be a new tool that knows how to merge a full backup
    > with any number of incremental backups and produce a complete data
    > directory with no remaining partial files.
    
    Could just be part of server startup?
    
    
    > - I imagine that the server would offer this functionality through a
    > new replication command or a syntax extension to an existing command,
    > so it could also be used by tools other than pg_basebackup if they
    > wished.
    
    Would this logic somehow be usable from tools that don't want to copy
    the data directory via pg_basebackup (e.g. for parallelism, to directly
    send to some backup service / SAN / whatnot)?
    
    
    > - It would also be nice if pg_basebackup could write backups to places
    > other than the local disk, like an object store, a tape drive, etc.
    > But that also sounds like a separate effort.
    
    Indeed seems separate. But worthwhile.
    
    Greetings,
    
    Andres Freund
    
    
    
    
  4. Re: block-level incremental backup

    Gary M <garym@oedata.com> — 2019-04-09T17:47:29Z

    Having worked in the data storage industry since the '80s, I think backup
    is an important capability. Having said that, the ideas should be expanded
    to an overall data management strategy combining local and remote storage
    including cloud.
    
    From my experience, record and transaction consistency is critical to any
    replication action, including backup.  The approach commonly includes a
    starting baseline, snapshot if you prefer, and a set of incremental changes
    to the snapshot.  I always used the transaction logs for both backup and
    remote replication to other DBMS. In standard ECMA-208 @94, you will note a
    file object with a transaction property. Although the language specifies
    files, a file may be any set of records.
    
    SAN based snapshots usually occur on the SAN storage device, meaning if
    cached data (unwritten to disk) will not be snapshotted or inconsistently
    reference and likely result in a corrupted database on restore.
    
    Snapshots are point in time states of storage objects. Between snapshot
    periods, any number of changes many occur.  If a record of "all changes"
    are required, snapshot methods must be augmented with a historical record..
    the transaction log.
    
     Delta block methods for backups have been in practice for many years. ZFS
    had adopted the practice for block management. The ability of incremental
    backups, whether block, transactions or other methods, is dependent on
    prior data. Like primary storage, backup media can fail, become lost and be
    inadvertently corrupted. The result of incremental data backup loss is the
    restored data after the point of loss is likely corrupted.
    
    cheers,
    garym
    
    On Tue, Apr 9, 2019 at 10:35 AM Andres Freund <andres@anarazel.de> wrote:
    
    > Hi,
    >
    > On 2019-04-09 11:48:38 -0400, Robert Haas wrote:
    > > 2. When you use pg_basebackup in this way, each relation file that is
    > > not sent in its entirety is replaced by a file with a different name.
    > > For example, instead of base/16384/16417, you might get
    > > base/16384/partial.16417 or however we decide to name them.
    >
    > Hm. But that means that files that are shipped nearly in their entirety,
    > need to be fully rewritten. Wonder if it's better to ship them as files
    > with holes, and have the metadata in a separate file. That'd then allow
    > to just fill in the holes with data from the older version.  I'd assume
    > that there's a lot of workloads where some significantly sized relations
    > will get updated in nearly their entirety between backups.
    >
    >
    > > Each such file will store near the beginning of the file a list of all
    > the
    > > blocks contained in that file, and the blocks themselves will follow
    > > at offsets that can be predicted from the metadata at the beginning of
    > > the file.  The idea is that you shouldn't have to read the whole file
    > > to figure out which blocks it contains, and if you know specifically
    > > what blocks you want, you should be able to reasonably efficiently
    > > read just those blocks.  A backup taken in this manner should also
    > > probably create some kind of metadata file in the root directory that
    > > stops the server from starting and lists other salient details of the
    > > backup.  In particular, you need the threshold LSN for the backup
    > > (i.e. contains blocks newer than this) and the start LSN for the
    > > backup (i.e. the LSN that would have been returned from
    > > pg_start_backup).
    >
    > I wonder if we shouldn't just integrate that into pg_control or such. So
    > that:
    >
    > > 3. There should be a new tool that knows how to merge a full backup
    > > with any number of incremental backups and produce a complete data
    > > directory with no remaining partial files.
    >
    > Could just be part of server startup?
    >
    >
    > > - I imagine that the server would offer this functionality through a
    > > new replication command or a syntax extension to an existing command,
    > > so it could also be used by tools other than pg_basebackup if they
    > > wished.
    >
    > Would this logic somehow be usable from tools that don't want to copy
    > the data directory via pg_basebackup (e.g. for parallelism, to directly
    > send to some backup service / SAN / whatnot)?
    >
    >
    > > - It would also be nice if pg_basebackup could write backups to places
    > > other than the local disk, like an object store, a tape drive, etc.
    > > But that also sounds like a separate effort.
    >
    > Indeed seems separate. But worthwhile.
    >
    > Greetings,
    >
    > Andres Freund
    >
    >
    >
    
  5. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-04-09T17:54:00Z

    On Tue, Apr 9, 2019 at 12:35 PM Andres Freund <andres@anarazel.de> wrote:
    > Hm. But that means that files that are shipped nearly in their entirety,
    > need to be fully rewritten. Wonder if it's better to ship them as files
    > with holes, and have the metadata in a separate file. That'd then allow
    > to just fill in the holes with data from the older version.  I'd assume
    > that there's a lot of workloads where some significantly sized relations
    > will get updated in nearly their entirety between backups.
    
    I don't want to rely on holes at the FS level.  I don't want to have
    to worry about what Windows does and what every Linux filesystem does
    and what NetBSD and FreeBSD and Dragonfly BSD and MacOS do.  And I
    don't want to have to write documentation for the fine manual
    explaining to people that they need to use a hole-preserving tool when
    they copy an incremental backup around.  And I don't want to have to
    listen to complaints from $USER that their backup tool, $THING, is not
    hole-aware.  Just - no.
    
    But what we could do is have some threshold (as git does), beyond
    which you just send the whole file.  For example if >90% of the blocks
    have changed, or >80% or whatever, then you just send everything.
    That way, if you have a database where you have lots and lots of 1GB
    segments with low churn (so that you can't just use full backups) and
    lots and lots of 1GB segments with high churn (to create the problem
    you're describing) you'll still be OK.
    
    > > 3. There should be a new tool that knows how to merge a full backup
    > > with any number of incremental backups and produce a complete data
    > > directory with no remaining partial files.
    >
    > Could just be part of server startup?
    
    Yes, but I think that sucks.  You might not want to start the server
    but rather just create a new synthetic backup.  And realistically,
    it's hard to imagine the server doing anything but synthesizing the
    backup first and then proceeding as normal.  In theory there's no
    reason why it couldn't be smart enough to construct the files it needs
    "on demand" in the background, but that sounds really hard and I don't
    think there's enough value to justify that level of effort.  YMMV, of
    course.
    
    > > - I imagine that the server would offer this functionality through a
    > > new replication command or a syntax extension to an existing command,
    > > so it could also be used by tools other than pg_basebackup if they
    > > wished.
    >
    > Would this logic somehow be usable from tools that don't want to copy
    > the data directory via pg_basebackup (e.g. for parallelism, to directly
    > send to some backup service / SAN / whatnot)?
    
    Well, I'm imagining it as a piece of server-side functionality that
    can figure out what has changed using one of several possible methods,
    and then send that stuff to you.  So I think if you don't have a
    server connection you are out of luck.  If you have a server
    connection but just want to be told what has changed rather than
    actually being given that data, that might be something that could be
    worked into the design.  I'm not sure whether that's a real need,
    though, or just extra work.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  6. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-04-09T17:56:23Z

    On Tue, Apr 9, 2019 at 12:32 PM Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:
    > In pg_probackup we have remote restore via SSH in the beta state. But
    > SSH isn't an option for in-core approach I think.
    
    That's a little off-topic for this thread, but I think we should have
    some kind of extensible mechanism for pg_basebackup and maybe other
    tools, so that you can teach it to send backups to AWS or your
    teletype or etch them on stone tablets or whatever without having to
    modify core code.  But let's not design that mechanism on this thread,
    'cuz that will distract from what I want to talk about here.  Feel
    free to start a new thread for it, though, and I'll jump in.  :-)
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  7. Re: block-level incremental backup

    Peter Eisentraut <peter.eisentraut@2ndquadrant.com> — 2019-04-09T21:07:39Z

    On 2019-04-09 17:48, Robert Haas wrote:
    > It will
    > probably be more efficient in many cases to instead scan all the WAL
    > generated since that LSN and extract block references from it, but
    > that is only possible if the server has all of that WAL available or
    > can somehow get it from the archive.
    
    This could be a variant of a replication slot that preserves WAL between
    incremental backup runs.
    
    > 3. There should be a new tool that knows how to merge a full backup
    > with any number of incremental backups and produce a complete data
    > directory with no remaining partial files.
    
    Are there by any chance standard file formats and tools that describe a
    binary difference between directories?  That would be really useful here.
    
    -- 
    Peter Eisentraut              http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
    
    
    
    
  8. Re: block-level incremental backup

    Alvaro Herrera <alvherre@2ndquadrant.com> — 2019-04-09T21:28:38Z

    On 2019-Apr-09, Peter Eisentraut wrote:
    
    > On 2019-04-09 17:48, Robert Haas wrote:
    
    > > 3. There should be a new tool that knows how to merge a full backup
    > > with any number of incremental backups and produce a complete data
    > > directory with no remaining partial files.
    > 
    > Are there by any chance standard file formats and tools that describe a
    > binary difference between directories?  That would be really useful here.
    
    VCDIFF? https://tools.ietf.org/html/rfc3284
    
    -- 
    Álvaro Herrera                https://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
    
    
    
    
  9. Re: block-level incremental backup

    Andrey Borodin <x4mmm@yandex-team.ru> — 2019-04-10T11:51:01Z

    Hi!
    
    > 9 апр. 2019 г., в 20:48, Robert Haas <robertmhaas@gmail.com> написал(а):
    > 
    > Thoughts?
    Thanks for this long and thoughtful post!
    
    At Yandex, we are using incremental backups for some years now. Initially, we used patched pgbarman, then we implemented this functionality in WAL-G. And there are many things to be done yet. We have more than 1Pb of clusters backuped with this technology.
    Most of the time we use this technology as a part of HA setup in managed PostgreSQL service. So, for us main goals are to operate backups cheaply and restore new node quickly. Here's what I see from our perspective.
    
    1. Yes, this feature is important.
    
    2. This importance comes not from reduced disk storage, magnetic disks and object storages are very cheap.
    
    3. Incremental backups save a lot of network bandwidth. It is non-trivial for the storage system to ingest hundreds of Tb daily.
    
    4. Incremental backups are a redundancy of WAL, intended for parallel application. Incremental backup applied sequentially is not very useful, it will not be much faster than simple WAL replay in many cases.
    
    5. As long as increments duplicate WAL functionality - it is not worth pursuing tradeoffs of storage utilization reduction. We scan WAL during archivation, extract numbers of changed blocks and store changemap for a group of WALs in the archive.
    
    6. This changemaps can be used for the increment of the visibility map (if I recall correctly). But you cannot compare LSNs on a page of visibility map: some operations do not bump them.
    
    7. We use changemaps during backups and during WAL replay - we know blocks that will change far in advance and prefetch them to page cache like pg_prefaulter does.
    
    8. There is similar functionality in RMAN for one well-known database. They used to store 8 sets of change maps. That database also has cool functionality "increment for catchup".
    
    9. We call incremental backup a "delta backup". This wording describes purpose more precisely: it is not "next version of DB", it is "difference between two DB states". But wording choice does not matter much.
    
    
    Here are slides from my talk at PgConf.APAC[0]. I've proposed a talk on this matter to PgCon, but it was not accepted. I will try next year :)
    
    > 9 апр. 2019 г., в 20:48, Robert Haas <robertmhaas@gmail.com> написал(а):
    > - This is just a design proposal at this point; there is no code.  If
    > this proposal, or some modified version of it, seems likely to be
    > acceptable, I and/or my colleagues might try to implement it.
    
    I'll be happy to help with code, discussion and patch review.
    
    Best regards, Andrey Borodin.
    
    [0] https://yadi.sk/i/Y_S1iqNN5WxS6A
    
    
    
  10. Re: block-level incremental backup

    Konstantin Knizhnik <k.knizhnik@postgrespro.ru> — 2019-04-10T14:22:38Z

    
    On 09.04.2019 18:48, Robert Haas wrote:
    > 1. There should be a way to tell pg_basebackup to request from the
    > server only those blocks where LSN >= threshold_value.
    
    Some times ago I have implemented alternative version of ptrack utility 
    (not one used in pg_probackup)
    which detects updated block at file level. It is very simple and may be 
    it can be sometimes integrated in master.
    I attached patch to vanilla to this mail.
    Right now it contains just two GUCs:
    
    ptrack_map_size: Size of ptrack map (number of elements) used for 
    incremental backup: 0 disabled.
    ptrack_block_log: Logarithm of ptrack block size (amount of pages)
    
    and one function:
    
    pg_ptrack_get_changeset(startlsn pg_lsn) returns 
    {relid,relfilenode,reltablespace,forknum,blocknum,segsize,updlsn,path}
    
    Idea is very simple: it creates hash map of fixed size (ptrack_map_size) 
    and stores LSN of written pages in this map.
    As far as postgres default page size seems to be too small  for ptrack 
    block (requiring too large hash map or increasing number of conflicts, 
    as well as
    increasing number of random reads) it is possible to configure ptrack 
    block to consists of multiple pages (power of 2).
    
    This patch is using memory mapping mechanism. Unfortunately there is no 
    portable wrapper for it in Postgres, so I have to provide own 
    implementations for Unix/Windows. Certainly it is not good and should be 
    rewritten.
    
    How to use?
    
    1. Define ptrack_map_size in postgres.conf, for example (use simple 
    number for more uniform hashing):
    
    ptrack_map_size = 1000003
    
    2.  Remember current lsn.
    
    psql postgres -c "select pg_current_wal_lsn()"
      pg_current_wal_lsn
    --------------------
      0/224A268
    (1 row)
    
    3. Do some updates.
    
    $ pgbench -T 10 postgres
    
    4. Select changed blocks.
    
      select * from pg_ptrack_get_changeset('0/224A268');
      relid | relfilenode | reltablespace | forknum | blocknum | segsize |  
    updlsn   |         path
    -------+-------------+---------------+---------+----------+---------+-----------+----------------------
      16390 |       16396 |          1663 |       0 |     1640 |       1 | 
    0/224FD88 | base/12710/16396
      16390 |       16396 |          1663 |       0 |     1641 |       1 | 
    0/2258680 | base/12710/16396
      16390 |       16396 |          1663 |       0 |     1642 |       1 | 
    0/22615A0 | base/12710/16396
    ...
    
    Certainly ptrack should be used as part of some backup tool (as 
    pg_basebackup or pg_probackup).
    
    
    -- 
    Konstantin Knizhnik
    Postgres Professional: http://www.postgrespro.com
    The Russian Postgres Company
    
    
  11. Re: block-level incremental backup

    Jehan-Guillaume de Rorthais <jgdr@dalibo.com> — 2019-04-10T14:57:11Z

    Hi,
    
    On Tue, 9 Apr 2019 11:48:38 -0400
    Robert Haas <robertmhaas@gmail.com> wrote:
    
    > Several companies, including EnterpriseDB, NTT, and Postgres Pro, have
    > developed technology that permits a block-level incremental backup to
    > be taken from a PostgreSQL server.  I believe the idea in all of those
    > cases is that non-relation files should be backed up in their
    > entirety, but for relation files, only those blocks that have been
    > changed need to be backed up.  I would like to propose that we should
    > have a solution for this problem in core, rather than leaving it to
    > each individual PostgreSQL company to develop and maintain their own
    > solution. Generally my idea is:
    > 
    > 1. There should be a way to tell pg_basebackup to request from the
    > server only those blocks where LSN >= threshold_value.  There are
    > several possible ways for the server to implement this, the simplest
    > of which is to just scan all the blocks and send only the ones that
    > satisfy that criterion.  That might sound dumb, but it does still save
    > network bandwidth, and it works even without any prior setup.
    
    +1 this is a simple design and probably a first easy step bringing a lot of
    benefices already.
    
    > It will probably be more efficient in many cases to instead scan all the WAL
    > generated since that LSN and extract block references from it, but
    > that is only possible if the server has all of that WAL available or
    > can somehow get it from the archive.
    
    I seize the opportunity to discuss about this on the fly.
    
    I've been playing with the idea of producing incremental backups from
    archives since many years. But I've only started PoC'ing on it this year.
    
    My idea would be create a new tool working on archived WAL. No burden
    server side. Basic concept is:
    
    * parse archives
    * record latest relevant FPW for the incr backup
    * write new WALs with recorded FPW and removing/rewriting duplicated walrecords.
    
    It's just a PoC and I hadn't finished the WAL writing part...not even talking
    about the replay part. I'm not even sure this project is a good idea, but it is
    a good educational exercice to me in the meantime. 
    
    Anyway, using real life OLTP production archives, my stats were:
    
      # WAL   xlogrec kept     Size WAL kept
        127            39%               50%
        383            22%               38%
        639            20%               29%
    
    Based on this stats, I expect this would save a lot of time during recovery in
    a first step. If it get mature, it might even save a lot of archives space or
    extend the retention period with degraded granularity. It would even help
    taking full backups with a lower frequency.
    
    Any thoughts about this design would be much appreciated. I suppose this should
    be offlist or in a new thread to avoid polluting this thread as this is a
    slightly different subject.
    
    Regards,
    
    
    PS: I was surprised to still find some existing piece of code related to
    pglesslog in core. This project has been discontinued and WAL format changed in
    the meantime.
    
    
    
    
  12. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-04-10T15:31:53Z

    On Tue, Apr 9, 2019 at 5:28 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
    > On 2019-Apr-09, Peter Eisentraut wrote:
    > > On 2019-04-09 17:48, Robert Haas wrote:
    > > > 3. There should be a new tool that knows how to merge a full backup
    > > > with any number of incremental backups and produce a complete data
    > > > directory with no remaining partial files.
    > >
    > > Are there by any chance standard file formats and tools that describe a
    > > binary difference between directories?  That would be really useful here.
    >
    > VCDIFF? https://tools.ietf.org/html/rfc3284
    
    I don't understand VCDIFF very well, but I see some potential problems
    with going in this direction.
    
    First, suppose we take a full backup on Monday.  Then, on Tuesday, we
    want to take an incremental backup.  In my proposal, the backup server
    only needs to provide the database with one piece of information: the
    start-LSN of the previous backup.  The server determines which blocks
    are recently modified and sends them to the client, which stores them.
    The end.  On the other hand, storing a maximally compact VCDIFF seems
    to require that, for each block modified in the Tuesday backup, we go
    read the corresponding block as it existed on Monday.  Assuming that
    the server is using some efficient method of locating modified blocks,
    this will approximately double the amount of read I/O required to
    complete the backup: either the server or the client must now read not
    only the current version of the block but the previous versions.  If
    the previous backup is an incremental backup that does not contain
    full block images but only VCDIFF content, whoever is performing the
    VCDIFF calculation will need to walk the entire backup chain and
    reconstruct the previous contents of the previous block so that it can
    compute the newest VCDIFF.  A customer who does an incremental backup
    every day and maintains a synthetic full backup from 1 week prior will
    see a roughly eightfold increase in read I/O compared to the design I
    proposed.
    
    The same problem exists at restore time.  In my design, the total read
    I/O required is equal to the size of the database, plus however much
    metadata needs to be read from older delta files -- and that should be
    fairly small compared to the actual data being read, at least in
    normal, non-extreme cases.  But if we are going to proceed by applying
    a series of delta files, we're going to need to read every older
    backup in its entirety.  If the turnover percentage is significant,
    say 20%/day, and if the backup chain is say 7 backups long to get back
    to a full backup, this is a huge difference.  Instead of having to
    read ~100% of the database size, as in my proposal, we'll need to read
    100% + (6 * 20%) = 220% of the database size.
    
    Since VCDIFF uses an add-copy-run language to described differences,
    we could try to work around the problem that I just described by
    describing each changed data block as an 8192-byte add, and unchanged
    blocks as an 8192-byte copy.  If we did that, then I think that the
    problem at backup time goes away: we can write out a VCDIFF-format
    file for the changed blocks based just on knowing that those are the
    blocks that have changed, without needing to access the older file. Of
    course, if we do it this way, the file will be larger than it would be
    if we actually compared the old and new block contents and wrote out a
    minimal VCDIFF, but it does make taking a backup a lot simpler.  Even
    with this proposal, though, I think we still have trouble with restore
    time.  I proposed putting the metadata about which blocks are included
    in a delta file at the beginning of the file, which allows a restore
    of a new incremental backup to relatively efficiently flip through
    older backups to find just the blocks that it needs, without having to
    read the whole file.  But I think (although I am not quite sure) that
    in the VCDIFF format, the payload for an ADD instruction is stored
    near the payload.  The result would be that you'd have to basically
    read the whole file at restore time to figure out which blocks were
    available from that file and which ones needed to be retrieved from an
    older backup.  So while this approach would fix the backup-time
    problem, I believe that it would still require significantly more read
    I/O at restore time than my proposal.
    
    Furthermore, if, at backup time, we have to do anything that requires
    access to the old data, either the client or the server needs to have
    access to that data.  Nonwithstanding the costs of reading it, that
    doesn't seem very desirable.  The server is quite unlikely to have
    access to the backups, because most users want to back up to a
    different server in order to guard against a hardware failure.  The
    client is more likely to be running on a machine where it has access
    to the data, because many users back up to the same machine every day,
    so the machine that is taking the current backup probably has the
    older one.  However, accessing that old backup might not be cheap.  It
    could be located in an object store in the cloud someplace, or it
    could have been written out to a tape drive and the tape removed from
    the drive.  In the design I'm proposing, that stuff doesn't matter,
    but if you want to run diffs, then it does.  Even if the client has
    efficient access to the data and even if it has so much read I/O
    bandwidth that the costs of reading that old data to run diffs doesn't
    matter, it's still pretty awkward for a tar-format backup.  The client
    would have to take the tar archive sent by the server apart and form a
    new one.
    
    Another advantage of storing whole blocks in the incremental backup is
    that there's no tight coupling between the full backup and the
    incremental backup.  Suppose you take a full backup A on S1, and then
    another full backup B, and then an incremental backup C based on A,
    and then an incremental backup D based on B.  If backup B is destroyed
    beyond retrieval, you can restore the chain A-C-D and get back to the
    same place that restoring B-D would have gotten you.  Backup D doesn't
    really know or care that it happens to be based on B.  It just knows
    that it can only give you those blocks that have LSN >= LSN_B.  You
    can get those blocks from anywhere that you like.  If D instead stored
    deltas between the blocks as they exist in backup B, then those deltas
    would have to be applied specifically to backup B, not some
    possibly-later version.
    
    I think the way to think about this problem, or at least the way I
    think about this problem, is that we need to decide whether want
    file-level incremental backup, block-level incremental backup, or
    byte-level incremental backup.  pgbackrest implements file-level
    incremental backup: if the file has changed, copy the whole thing.
    That has an appealing simplicity but risks copying 1GB of data for a
    1-byte change. What I'm proposing here is block-level incremental
    backup, which is more complicated and still risks copying 8kB of data
    for a 1-byte change.  Using VCDIFF would, I think, give us byte-level
    incremental backup.  That would probably do an excellent job of making
    incremental backups as small as they can possibly be, because we would
    not need to include in the backup image even a single byte of
    unmodified data.  It also seems like it does some other compression
    tricks which could shrink incremental backups further.  However, my
    intuition is that we won't gain enough in terms of backup size to make
    up for the downsides listed above.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  13. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-04-10T16:21:03Z

    On Wed, Apr 10, 2019 at 10:57 AM Jehan-Guillaume de Rorthais
    <jgdr@dalibo.com> wrote:
    > My idea would be create a new tool working on archived WAL. No burden
    > server side. Basic concept is:
    >
    > * parse archives
    > * record latest relevant FPW for the incr backup
    > * write new WALs with recorded FPW and removing/rewriting duplicated walrecords.
    >
    > It's just a PoC and I hadn't finished the WAL writing part...not even talking
    > about the replay part. I'm not even sure this project is a good idea, but it is
    > a good educational exercice to me in the meantime.
    >
    > Anyway, using real life OLTP production archives, my stats were:
    >
    >   # WAL   xlogrec kept     Size WAL kept
    >     127            39%               50%
    >     383            22%               38%
    >     639            20%               29%
    >
    > Based on this stats, I expect this would save a lot of time during recovery in
    > a first step. If it get mature, it might even save a lot of archives space or
    > extend the retention period with degraded granularity. It would even help
    > taking full backups with a lower frequency.
    >
    > Any thoughts about this design would be much appreciated. I suppose this should
    > be offlist or in a new thread to avoid polluting this thread as this is a
    > slightly different subject.
    
    Interesting idea, but I don't see how it can work if you only deal
    with the FPWs and not the other records.  For instance, suppose that
    you take a full backup at time T0, and then at time T1 there are two
    modifications to a certain block in quick succession.  That block is
    then never touched again.  Since no checkpoint intervenes between the
    modifications, the first one emits an FPI and the second does not.
    Capturing the FPI is fine as far as it goes, but unless you also do
    something with the non-FPI change, you lose that second modification.
    You could fix that by having your tool replicate the effects of WAL
    apply outside the server, but that sounds like a ton of work and a ton
    of possible bugs.
    
    I have a related idea, though.  Suppose that, as Peter says upthread,
    you have a replication slot that prevents old WAL from being removed.
    You also have a background worker that is connected to that slot.  It
    decodes WAL and produces summary files containing all block-references
    extracted from those WAL records and the associated LSN (or maybe some
    approximation of the LSN instead of the exact value, to allow for
    compression and combining of nearby references).  Then you hold onto
    those summary files after the actual WAL is removed.  Now, when
    somebody asks the server for all blocks changed since a certain LSN,
    it can use those summary files to figure out which blocks to send
    without having to read all the pages in the database.  Although I
    believe that a simple system that finds modified blocks by reading
    them all is good enough for a first version of this feature and useful
    in its own right, a more efficient system will be a lot more useful,
    and something like this seems to me to be probably the best way to
    implement it.
    
    The reason why I think this is likely to be superior to other possible
    approaches, such as the ptrack approach Konstantin suggests elsewhere
    on this thread, is because it pushes the work of figuring out which
    blocks have been modified into the background.  With a ptrack-type
    approach, the server has to do some non-zero amount of extra work in
    the foreground every time it modifies a block.  With an approach based
    on WAL-scanning, the work is done in the background and nobody has to
    wait for it.  It's possible that there are other considerations which
    aren't occurring to me right now, though.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  14. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-04-10T16:51:27Z

    On Wed, Apr 10, 2019 at 10:22 AM Konstantin Knizhnik
    <k.knizhnik@postgrespro.ru> wrote:
    > Some times ago I have implemented alternative version of ptrack utility
    > (not one used in pg_probackup)
    > which detects updated block at file level. It is very simple and may be
    > it can be sometimes integrated in master.
    
    I don't think this is completely crash-safe.  It looks like it
    arranges to msync() the ptrack file at appropriate times (although I
    haven't exhaustively verified the logic), but it uses MS_ASYNC, so
    it's possible that the ptrack file could get updated on disk either
    before or after the relation file itself.  I think before is probably
    OK -- it just risks having some blocks look modified when they aren't
    really -- but after seems like it is very much not OK.  And changing
    this to use MS_SYNC would probably be really expensive.  Likely a
    better approach would be to hook into the new fsync queue machinery
    that Thomas Munro added to PostgreSQL 12.
    
    It looks like your system maps all the blocks in the system into a
    fixed-size map using hashing.  If the number of modified blocks
    between the full backup and the incremental backup is large compared
    to the size of the ptrack map, you'll start to get a lot of
    false-positives.  It will look as if much of the database needs to be
    backed up.  For example, in your sample configuration, you have
    ptrack_map_size = 1000003. If you've got a 100GB database with 20%
    daily turnover, that's about 2.6 million blocks.  If you set bump a
    random entry ~2.6 million times in a map with 1000003 entries, on the
    average ~92% of the entries end up getting bumped, so you will get
    very little benefit from incremental backup.  This problem drops off
    pretty fast if you raise the size of the map, but it's pretty critical
    that your map is large enough for the database you've got, or you may
    as well not bother.
    
    It also appears that your system can't really handle resizing of the
    map in any friendly way.  So if your data size grows, you may be faced
    with either letting the map become progressively less effective, or
    throwing it out and losing all the data you have.
    
    None of that is to say that what you're presenting here has no value,
    but I think it's possible to do better (and I think we should try).
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  15. Re: block-level incremental backup

    Ashwin Agrawal <aagrawal@pivotal.io> — 2019-04-10T16:56:42Z

    On Wed, Apr 10, 2019 at 9:21 AM Robert Haas <robertmhaas@gmail.com> wrote:
    
    > I have a related idea, though.  Suppose that, as Peter says upthread,
    > you have a replication slot that prevents old WAL from being removed.
    > You also have a background worker that is connected to that slot.  It
    > decodes WAL and produces summary files containing all block-references
    > extracted from those WAL records and the associated LSN (or maybe some
    > approximation of the LSN instead of the exact value, to allow for
    > compression and combining of nearby references).  Then you hold onto
    > those summary files after the actual WAL is removed.  Now, when
    > somebody asks the server for all blocks changed since a certain LSN,
    > it can use those summary files to figure out which blocks to send
    > without having to read all the pages in the database.  Although I
    > believe that a simple system that finds modified blocks by reading
    > them all is good enough for a first version of this feature and useful
    > in its own right, a more efficient system will be a lot more useful,
    > and something like this seems to me to be probably the best way to
    > implement it.
    >
    
    Not to fork the conversation from incremental backups, but similar approach
    is what we have been thinking for pg_rewind. Currently, pg_rewind requires
    all the WAL logs to be present on source side from point of divergence to
    rewind. Instead just parse the wal and keep the changed blocks around on
    sourece. Then don't need to retain the WAL but can still rewind using the
    changed block map. So, rewind becomes much similar to incremental backup
    proposed here after performing rewind activity using target side WAL only.
    
  16. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-04-10T17:03:26Z

    On Wed, Apr 10, 2019 at 7:51 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
    > > 9 апр. 2019 г., в 20:48, Robert Haas <robertmhaas@gmail.com> написал(а):
    > > - This is just a design proposal at this point; there is no code.  If
    > > this proposal, or some modified version of it, seems likely to be
    > > acceptable, I and/or my colleagues might try to implement it.
    >
    > I'll be happy to help with code, discussion and patch review.
    
    That would be great!
    
    We should probably give this discussion some more time before we
    plunge into the implementation phase, but I'd love to have some help
    with that, whether it's with coding or review or whatever.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  17. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-04-10T17:08:55Z

    On Wed, Apr 10, 2019 at 12:56 PM Ashwin Agrawal <aagrawal@pivotal.io> wrote:
    > Not to fork the conversation from incremental backups, but similar approach is what we have been thinking for pg_rewind. Currently, pg_rewind requires all the WAL logs to be present on source side from point of divergence to rewind. Instead just parse the wal and keep the changed blocks around on sourece. Then don't need to retain the WAL but can still rewind using the changed block map. So, rewind becomes much similar to incremental backup proposed here after performing rewind activity using target side WAL only.
    
    Interesting.  So if we build a system like this for incremental
    backup, or for pg_rewind, the other one can use the same
    infrastructure.  That sound excellent.  I'll start a new thread to
    talk about that, and hopefully you and Heikki and others will chime in
    with thoughts.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  18. Re: block-level incremental backup

    Jehan-Guillaume de Rorthais <jgdr@dalibo.com> — 2019-04-10T18:21:45Z

    Hi,
    
    First thank you for your answer!
    
    On Wed, 10 Apr 2019 12:21:03 -0400
    Robert Haas <robertmhaas@gmail.com> wrote:
    
    > On Wed, Apr 10, 2019 at 10:57 AM Jehan-Guillaume de Rorthais
    > <jgdr@dalibo.com> wrote:
    > > My idea would be create a new tool working on archived WAL. No burden
    > > server side. Basic concept is:
    > >
    > > * parse archives
    > > * record latest relevant FPW for the incr backup
    > > * write new WALs with recorded FPW and removing/rewriting duplicated
    > > walrecords.
    > >
    > > It's just a PoC and I hadn't finished the WAL writing part...not even
    > > talking about the replay part. I'm not even sure this project is a good
    > > idea, but it is a good educational exercice to me in the meantime.
    > >
    > > Anyway, using real life OLTP production archives, my stats were:
    > >
    > >   # WAL   xlogrec kept     Size WAL kept
    > >     127            39%               50%
    > >     383            22%               38%
    > >     639            20%               29%
    > >
    > > Based on this stats, I expect this would save a lot of time during recovery
    > > in a first step. If it get mature, it might even save a lot of archives
    > > space or extend the retention period with degraded granularity. It would
    > > even help taking full backups with a lower frequency.
    > >
    > > Any thoughts about this design would be much appreciated. I suppose this
    > > should be offlist or in a new thread to avoid polluting this thread as this
    > > is a slightly different subject.  
    > 
    > Interesting idea, but I don't see how it can work if you only deal
    > with the FPWs and not the other records.  For instance, suppose that
    > you take a full backup at time T0, and then at time T1 there are two
    > modifications to a certain block in quick succession.  That block is
    > then never touched again.  Since no checkpoint intervenes between the
    > modifications, the first one emits an FPI and the second does not.
    > Capturing the FPI is fine as far as it goes, but unless you also do
    > something with the non-FPI change, you lose that second modification.
    > You could fix that by having your tool replicate the effects of WAL
    > apply outside the server, but that sounds like a ton of work and a ton
    > of possible bugs.
    
    In my current design, the scan is done backward from end to start and I keep all
    the records appearing after the last occurrence of their respective FPI.
    
    The next challenge I have to achieve is to deal with multiple blocks records
    where some need to be removed and other are FPI to keep (eg. UPDATE).
    
    > I have a related idea, though.  Suppose that, as Peter says upthread,
    > you have a replication slot that prevents old WAL from being removed.
    > You also have a background worker that is connected to that slot.  It
    > decodes WAL and produces summary files containing all block-references
    > extracted from those WAL records and the associated LSN (or maybe some
    > approximation of the LSN instead of the exact value, to allow for
    > compression and combining of nearby references).  Then you hold onto
    > those summary files after the actual WAL is removed.  Now, when
    > somebody asks the server for all blocks changed since a certain LSN,
    > it can use those summary files to figure out which blocks to send
    > without having to read all the pages in the database.  Although I
    > believe that a simple system that finds modified blocks by reading
    > them all is good enough for a first version of this feature and useful
    > in its own right, a more efficient system will be a lot more useful,
    > and something like this seems to me to be probably the best way to
    > implement it.
    
    Summary files looks like what Andrey Borodin described as delta-files and
    change maps.
    
    > With an approach based
    > on WAL-scanning, the work is done in the background and nobody has to
    > wait for it.
    
    Agree with this.
    
    
    
    
  19. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-04-10T18:38:43Z

    On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais
    <jgdr@dalibo.com> wrote:
    > In my current design, the scan is done backward from end to start and I keep all
    > the records appearing after the last occurrence of their respective FPI.
    
    Oh, interesting.  That seems like it would require pretty major
    surgery on the WAL stream.
    
    > Summary files looks like what Andrey Borodin described as delta-files and
    > change maps.
    
    Yep.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  20. Re: block-level incremental backup

    Andres Freund <andres@anarazel.de> — 2019-04-10T18:55:51Z

    Hi,
    
    On 2019-04-10 14:38:43 -0400, Robert Haas wrote:
    > On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais
    > <jgdr@dalibo.com> wrote:
    > > In my current design, the scan is done backward from end to start and I keep all
    > > the records appearing after the last occurrence of their respective FPI.
    > 
    > Oh, interesting.  That seems like it would require pretty major
    > surgery on the WAL stream.
    
    Can't you just read each segment forward, and then reverse? That's not
    that much memory? And sure, there's some inefficient cases where records
    span many segments, but that's rare enough that reading a few segments
    several times doesn't strike me as particularly bad?
    
    Greetings,
    
    Andres Freund
    
    
    
    
  21. Re: block-level incremental backup

    Peter Eisentraut <peter.eisentraut@2ndquadrant.com> — 2019-04-10T19:42:47Z

    On 2019-04-10 17:31, Robert Haas wrote:
    > I think the way to think about this problem, or at least the way I
    > think about this problem, is that we need to decide whether want
    > file-level incremental backup, block-level incremental backup, or
    > byte-level incremental backup.
    
    That is a great analysis.  Seems like block-level is the preferred way
    forward.
    
    -- 
    Peter Eisentraut              http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
    
    
    
    
  22. Re: block-level incremental backup

    Konstantin Knizhnik <k.knizhnik@postgrespro.ru> — 2019-04-10T19:57:38Z

    
    On 10.04.2019 19:51, Robert Haas wrote:
    > On Wed, Apr 10, 2019 at 10:22 AM Konstantin Knizhnik
    > <k.knizhnik@postgrespro.ru> wrote:
    >> Some times ago I have implemented alternative version of ptrack utility
    >> (not one used in pg_probackup)
    >> which detects updated block at file level. It is very simple and may be
    >> it can be sometimes integrated in master.
    > I don't think this is completely crash-safe.  It looks like it
    > arranges to msync() the ptrack file at appropriate times (although I
    > haven't exhaustively verified the logic), but it uses MS_ASYNC, so
    > it's possible that the ptrack file could get updated on disk either
    > before or after the relation file itself.  I think before is probably
    > OK -- it just risks having some blocks look modified when they aren't
    > really -- but after seems like it is very much not OK.  And changing
    > this to use MS_SYNC would probably be really expensive.  Likely a
    > better approach would be to hook into the new fsync queue machinery
    > that Thomas Munro added to PostgreSQL 12.
    
    I do not think that MS_SYNC or fsync queue is needed here.
    If power failure or OS crash cause loose of some writes to ptrack map, 
    then in any case {ostgres will perform recovery and updating pages from 
    WAL cause once again marking them in ptrack map. So as in case of CLOG 
    and many other Postgres files it is not critical to loose some writes 
    because them will be restored from WAL. And before truncating WAL, 
    Postgres performs checkpoint which flushes all changes to the disk, 
    including ptrack map updates.
    
    
    > It looks like your system maps all the blocks in the system into a
    > fixed-size map using hashing.  If the number of modified blocks
    > between the full backup and the incremental backup is large compared
    > to the size of the ptrack map, you'll start to get a lot of
    > false-positives.  It will look as if much of the database needs to be
    > backed up.  For example, in your sample configuration, you have
    > ptrack_map_size = 1000003. If you've got a 100GB database with 20%
    > daily turnover, that's about 2.6 million blocks.  If you set bump a
    > random entry ~2.6 million times in a map with 1000003 entries, on the
    > average ~92% of the entries end up getting bumped, so you will get
    > very little benefit from incremental backup.  This problem drops off
    > pretty fast if you raise the size of the map, but it's pretty critical
    > that your map is large enough for the database you've got, or you may
    > as well not bother.
    This is why ptrack block size should be larger than page size.
    Assume that it is 1Mb. 1MB is considered to be optimal amount of disk 
    IO, when frequent seeks are not degrading read speed (it is most 
    critical for HDD). In other words reading 10 random pages (20%) from 
    this 1Mb block will takes almost the same amount of time (or even 
    longer) than reading all this 1Mb in one operation.
    
    There will be just 100000 used entries in ptrack map with very small 
    probability of collision.
    Actually I have chosen this size (1000003) for ptrack map because with 
    1Mb block size is allows to map without noticable number of collisions 
    1Tb database which seems to be enough for most Postgres installations. 
    But increasing ptrack map size 10 and even 100 times should not also 
    cause problems with modern RAM sizes.
    
    >
    > It also appears that your system can't really handle resizing of the
    > map in any friendly way.  So if your data size grows, you may be faced
    > with either letting the map become progressively less effective, or
    > throwing it out and losing all the data you have.
    >
    > None of that is to say that what you're presenting here has no value,
    > but I think it's possible to do better (and I think we should try).
    >
    Definitely I didn't consider proposed patch as perfect solution and 
    certainly it requires improvements (and may be complete redesign).
    I just want to present this approach (maintaining hash of block's LSN in 
    mapped memory) and keeping track of modified blocks at file level 
    (unlike current ptrack implementation which logs changes in all places 
    in Postgres code where data is updated).
    
    Also, despite to the fact that this patch may be considered as raw 
    prototype, I have spent some time thinking about all aspects of this 
    approach including fault tolerance and false positives.
    
    
    
    
    
  23. Re: block-level incremental backup

    Jehan-Guillaume de Rorthais <jgdr@dalibo.com> — 2019-04-10T20:46:03Z

    On Wed, 10 Apr 2019 14:38:43 -0400
    Robert Haas <robertmhaas@gmail.com> wrote:
    
    > On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais
    > <jgdr@dalibo.com> wrote:
    > > In my current design, the scan is done backward from end to start and I
    > > keep all the records appearing after the last occurrence of their
    > > respective FPI.  
    > 
    > Oh, interesting.  That seems like it would require pretty major
    > surgery on the WAL stream.
    
    Indeed.
    
    Presently, the surgery in my code is replacing redundant xlogrecord with noop.
    
    I have now to deal with muti-blocks records. So far, I tried to mark non-needed
    block with !BKPBLOCK_HAS_DATA and made a simple patch in core to ignore such
    marked blocks, but it doesn't play well with dependency between xlogrecord, eg.
    during UPDATE. So my plan is to rewrite them to remove non-needed blocks using
    eg. XLOG_FPI.
    
    As I wrote, this is mainly an hobby project right now for my own education. Not
    sure where it leads me, but I learn a lot while working on it.
    
    
    
    
  24. Re: block-level incremental backup

    Jehan-Guillaume de Rorthais <jgdr@dalibo.com> — 2019-04-10T20:54:18Z

    On Wed, 10 Apr 2019 11:55:51 -0700
    Andres Freund <andres@anarazel.de> wrote:
    
    > Hi,
    > 
    > On 2019-04-10 14:38:43 -0400, Robert Haas wrote:
    > > On Wed, Apr 10, 2019 at 2:21 PM Jehan-Guillaume de Rorthais
    > > <jgdr@dalibo.com> wrote:  
    > > > In my current design, the scan is done backward from end to start and I
    > > > keep all the records appearing after the last occurrence of their
    > > > respective FPI.  
    > > 
    > > Oh, interesting.  That seems like it would require pretty major
    > > surgery on the WAL stream.  
    > 
    > Can't you just read each segment forward, and then reverse?
    
    Not sure what you mean.
    
    I first look for the very last XLOG record by jumping to the last WAL and
    scanning it forward. 
    
    Then, I do a backward from there to record LSN of xlogrecord to keep.
    
    Finally, I clone each WAL and edit them as needed (as described in my previous
    email). This is my current WIP though.
    
    > That's not that much memory?
    
    I don't know, yet. I did not mesure it.
    
    
    
    
  25. Re: block-level incremental backup

    Michael Paquier <michael@paquier.xyz> — 2019-04-11T04:22:28Z

    On Wed, Apr 10, 2019 at 09:42:47PM +0200, Peter Eisentraut wrote:
    > That is a great analysis.  Seems like block-level is the preferred way
    > forward.
    
    In any solution related to incremental backups I have see from
    community, all of them tend to prefer block-level backups per the
    filtering which is possible based on the LSN of the page header.  The
    holes in the middle of the page are also easier to handle so as an
    incremental page size is reduced in the actual backup.  My preference
    tends toward a block-level approach if we were to do something in this
    area, though I fear that performance will be bad if we begin to scan
    all the relation files to fetch a set of blocks since a past LSN.
    Hence we need some kind of LSN map so as it is possible to skip a
    one block or a group of blocks (say one LSN every 8/16 blocks for
    example) at once for a given relation if the relation is mostly
    read-only.
    --
    Michael
    
  26. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-04-11T13:45:52Z

    On Thu, Apr 11, 2019 at 12:22 AM Michael Paquier <michael@paquier.xyz> wrote:
    > incremental page size is reduced in the actual backup.  My preference
    > tends toward a block-level approach if we were to do something in this
    > area, though I fear that performance will be bad if we begin to scan
    > all the relation files to fetch a set of blocks since a past LSN.
    > Hence we need some kind of LSN map so as it is possible to skip a
    > one block or a group of blocks (say one LSN every 8/16 blocks for
    > example) at once for a given relation if the relation is mostly
    > read-only.
    
    So, in this thread, I want to focus on the UI and how the incremental
    backup is stored on disk.  Making the process of identifying modified
    blocks efficient is the subject of
    http://postgr.es/m/CA+TgmoahOeuuR4pmDP1W=JnRyp4fWhynTOsa68BfxJq-qB_53A@mail.gmail.com
    
    Over there, the merits of what you are describing here and the
    competing approaches are under discussion.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  27. Re: block-level incremental backup

    Anastasia Lubennikova <a.lubennikova@postgrespro.ru> — 2019-04-11T17:29:29Z

    09.04.2019 18:48, Robert Haas writes:
    > Thoughts?
    Hi,
    Thank you for bringing that up.
    In-core support of incremental backups is a long-awaited feature.
    Hopefully, this take will end up committed in PG13.
    
    Speaking of UI:
    1) I agree that it should be implemented as a new replication command.
    
    2) There should be a command to get only a map of changes without actual 
    data.
    
    Most backup tools establish server connection, so they can use this 
    protocol to get the list of changed blocks.
    Then they can use this information for any purpose. For example, 
    distribute files between parallel workers to copy the data,
    or estimate backup size before data is sent, or store metadata 
    separately from the data itself.
    Most methods (except straightforward LSN comparison) consist of two 
    steps: get a map of changes and read blocks.
    So it won't add much of extra work.
    
    example commands:
    GET_FILELIST [lsn]
    returning json (or whatever) with filenames and maps of changed blocks
    
    Map format is also the subject of discussion.
    Now in pg_probackup we reuse code from pg_rewind/datapagemap,
    not sure if this format is good for sending data via the protocol, though.
    
    3) The API should provide functions to request data with a granularity 
    of file and block.
    It will be useful for parallelism and for various future projects.
    
    example commands:
    GET_DATAFILE [filename [map of blocks] ]
    GET_DATABLOCK [filename] [blkno]
    returning data in some format
    
    4) The algorithm of collecting changed blocks is another topic.
    Though, it's API should be discussed here:
    
    Do we want to have multiple implementations?
    Personally, I think that it's good to provide several strategies,
    since they have different requirements and fit for different workloads.
    
    Maybe we can add a hook to allow custom implementations.
    
    Do we want to allow the backup client to tell what block collection 
    method to use?
    example commands:
    GET_FILELIST [lsn] [METHOD lsn | page | ptrack | etc]
    Or should it be server-side cost-based decision?
    
    5) The method based on LSN comparison stands out - it can be done in one 
    pass.
    So it probably requires special protocol commands.
    for example:
    GET_DATAFILES [lsn]
    GET_DATAFILE [filename] [lsn]
    
    This is pretty simple to implement and pg_basebackup can use this method,
    at least until we have something more advanced in-core.
    
    I'll be happy to help with design, code, review, and testing.
    Hope that my experience with pg_probackup will be useful.
    
    -- 
    Anastasia Lubennikova
    Postgres Professional: http://www.postgrespro.com
    The Russian Postgres Company
    
    
    
    
    
  28. Re: block-level incremental backup

    Stephen Frost <sfrost@snowman.net> — 2019-04-15T13:01:11Z

    Greetings,
    
    * Robert Haas (robertmhaas@gmail.com) wrote:
    > Several companies, including EnterpriseDB, NTT, and Postgres Pro, have
    > developed technology that permits a block-level incremental backup to
    > be taken from a PostgreSQL server.  I believe the idea in all of those
    > cases is that non-relation files should be backed up in their
    > entirety, but for relation files, only those blocks that have been
    > changed need to be backed up.
    
    I love the general idea of having additional facilities in core to
    support block-level incremental backups.  I've long been unhappy that
    any such approach ends up being limited to a subset of the files which
    need to be included in the backup, meaning the rest of the files have to
    be backed up in their entirety.  I don't think we have to solve for that
    as part of this, but I'd like to see a discussion for how to deal with
    the other files which are being backed up to avoid needing to just
    wholesale copy them.
    
    > I would like to propose that we should
    > have a solution for this problem in core, rather than leaving it to
    > each individual PostgreSQL company to develop and maintain their own
    > solution. 
    
    I'm certainly a fan of improving our in-core backup solutions.
    
    I'm quite concerned that trying to graft this on to pg_basebackup
    (which, as you note later, is missing an awful lot of what users expect
    from a real backup solution already- retention handling, parallel
    capabilities, WAL archive management, and many more... but also is just
    not nearly as developed a tool as the external solutions) is going to
    make things unnecessairly difficult when what we really want here is
    better support from core for block-level incremental backup for the
    existing external tools to leverage.
    
    Perhaps there's something here which can be done with pg_basebackup to
    have it work with the block-level approach, but I certainly don't see
    it as a natural next step for it and really does seem like limiting the
    way this is implemented to something that pg_basebackup can easily
    digest might make it less useful for the more developed tools.
    
    As an example, I believe all of the other tools mentioned (at least,
    those that are open source I'm pretty sure all do) support parallel
    backup and therefore having a way to get the block-level changes in a
    parallel fashion would be a pretty big thing that those tools will want
    and pg_basebackup is single-threaded today and this proposal doesn't
    seem to be contemplating changing that, implying that a serial-based
    block-level protocol would be fine but that'd be a pretty awful
    restriction for the other tools.
    
    > Generally my idea is:
    > 
    > 1. There should be a way to tell pg_basebackup to request from the
    > server only those blocks where LSN >= threshold_value.  There are
    > several possible ways for the server to implement this, the simplest
    > of which is to just scan all the blocks and send only the ones that
    > satisfy that criterion.  That might sound dumb, but it does still save
    > network bandwidth, and it works even without any prior setup. It will
    > probably be more efficient in many cases to instead scan all the WAL
    > generated since that LSN and extract block references from it, but
    > that is only possible if the server has all of that WAL available or
    > can somehow get it from the archive.  We could also, as several people
    > have proposed previously, have some kind of additional relation for
    > that stores either a single is-modified bit -- which only helps if the
    > reference LSN for the is-modified bit is older than the requested LSN
    > but not too much older -- or the highest LSN for each range of K
    > blocks, or something like that.  I am at the moment not too concerned
    > with the exact strategy we use here. I believe we may want to
    > eventually support more than one, since they have different
    > trade-offs.
    
    This part of the discussion is a another example of how we're limiting
    ourselves in this implementation to the "pg_basebackup can work with
    this" case- by only consideration the options of "scan all the files" or
    "use the WAL- if the request is for WAL we have available on the
    server."  The other backup solutions mentioned in your initial email,
    and others that weren't, have a WAL archive which includes a lot more
    WAL than just what the primary currently has.  When I've thought about
    how WAL could be used to build a differential or incremental backup, the
    question of "do we have all the WAL we need" hasn't ever been a
    consideration- because the backup tool manages the WAL archive and has
    WAL going back across, most likely, weeks or even months.  Having a tool
    which can essentially "compress" WAL would be fantastic and would be
    able to be leveraged by all of the different backup solutions.
    
    > 2. When you use pg_basebackup in this way, each relation file that is
    > not sent in its entirety is replaced by a file with a different name.
    > For example, instead of base/16384/16417, you might get
    > base/16384/partial.16417 or however we decide to name them.  Each such
    > file will store near the beginning of the file a list of all the
    > blocks contained in that file, and the blocks themselves will follow
    > at offsets that can be predicted from the metadata at the beginning of
    > the file.  The idea is that you shouldn't have to read the whole file
    > to figure out which blocks it contains, and if you know specifically
    > what blocks you want, you should be able to reasonably efficiently
    > read just those blocks.  A backup taken in this manner should also
    > probably create some kind of metadata file in the root directory that
    > stops the server from starting and lists other salient details of the
    > backup.  In particular, you need the threshold LSN for the backup
    > (i.e. contains blocks newer than this) and the start LSN for the
    > backup (i.e. the LSN that would have been returned from
    > pg_start_backup).
    
    Two things here- having some file that "stops the server from starting"
    is just going to cause a lot of pain, in my experience.  Users do a lot
    of really rather.... curious things, and then come asking questions
    about them, and removing the file that stopped the server from starting
    is going to quickly become one of those questions on stack overflow that
    people just follow the highest-ranked question for, even though everyone
    who follows this list will know that doing so results in corruption of
    the database.
    
    An alternative approach in developing this feature would be to have
    pg_basebackup have an option to run against an *existing* backup, with
    the entire point being that the existing backup is updated with these
    incremental changes, instead of having some independent tool which takes
    the result of multiple pg_basebackup runs and then combines them.
    
    An alternative tool might be one which simply reads the WAL and keeps
    track of the FPIs and the updates and then eliminates any duplication
    which exists in the set of WAL provided (that is, multiple FPIs for the
    same page would be merged into one, and only the delta changes to that
    page are preserved, across the entire set of WAL being combined).  Of
    course, that's complicated by having to deal with the other files in the
    database, so it wouldn't really work on its own.
    
    > 3. There should be a new tool that knows how to merge a full backup
    > with any number of incremental backups and produce a complete data
    > directory with no remaining partial files.  The tool should check that
    > the threshold LSN for each incremental backup is less than or equal to
    > the start LSN of the previous backup; if not, there may be changes
    > that happened in between which would be lost, so combining the backups
    > is unsafe.  Running this tool can be thought of either as restoring
    > the backup or as producing a new synthetic backup from any number of
    > incremental backups.  This would allow for a strategy of unending
    > incremental backups.  For instance, on day 1, you take a full backup.
    > On every subsequent day, you take an incremental backup.  On day 9,
    > you run pg_combinebackup day1 day2 -o full; rm -rf day1 day2; mv full
    > day2.  On each subsequent day you do something similar.  Now you can
    > always roll back to any of the last seven days by combining the oldest
    > backup you have (which is always a synthetic full backup) with as many
    > newer incrementals as you want, up to the point where you want to
    > stop.
    
    I'd really prefer that we avoid adding in another low-level tool like
    the one described here.  Users, imv anyway, don't want to deal with
    *more* tools for handling this aspect of backup/recovery.  If we had a
    tool in core today which managed multiples backups, kept track of them,
    and all of the WAL during and between them, then we could add options to
    that tool to do what's being described here in a way that makes sense
    and provides a good interface to users.  I don't know that we're going
    to be able to do that with pg_basebackup when, really, the goal here
    isn't actually to make pg_basebackup into an enterprise backup tool,
    it's to make things easier for the external tools to do block-level
    backups.
    
    Thanks!
    
    Stephen
    
  29. Re: block-level incremental backup

    Bruce Momjian <bruce@momjian.us> — 2019-04-15T16:48:57Z

    On Mon, Apr 15, 2019 at 09:01:11AM -0400, Stephen Frost wrote:
    > Greetings,
    > 
    > * Robert Haas (robertmhaas@gmail.com) wrote:
    > > Several companies, including EnterpriseDB, NTT, and Postgres Pro, have
    > > developed technology that permits a block-level incremental backup to
    > > be taken from a PostgreSQL server.  I believe the idea in all of those
    > > cases is that non-relation files should be backed up in their
    > > entirety, but for relation files, only those blocks that have been
    > > changed need to be backed up.
    > 
    > I love the general idea of having additional facilities in core to
    > support block-level incremental backups.  I've long been unhappy that
    > any such approach ends up being limited to a subset of the files which
    > need to be included in the backup, meaning the rest of the files have to
    > be backed up in their entirety.  I don't think we have to solve for that
    > as part of this, but I'd like to see a discussion for how to deal with
    > the other files which are being backed up to avoid needing to just
    > wholesale copy them.
    
    I assume you are talking about non-heap/index files.  Which of those are
    large enough to benefit from incremental backup?
    
    > > I would like to propose that we should
    > > have a solution for this problem in core, rather than leaving it to
    > > each individual PostgreSQL company to develop and maintain their own
    > > solution. 
    > 
    > I'm certainly a fan of improving our in-core backup solutions.
    > 
    > I'm quite concerned that trying to graft this on to pg_basebackup
    > (which, as you note later, is missing an awful lot of what users expect
    > from a real backup solution already- retention handling, parallel
    > capabilities, WAL archive management, and many more... but also is just
    > not nearly as developed a tool as the external solutions) is going to
    > make things unnecessairly difficult when what we really want here is
    > better support from core for block-level incremental backup for the
    > existing external tools to leverage.
    
    I think there is some interesting complexity brought up in this thread. 
    Which options are going to minimize storage I/O, network I/O, have only
    background overhead, allow parallel operation, integrate with
    pg_basebackup.  Eventually we will need to evaluate the incremental
    backup options against these criteria.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        http://momjian.us
      EnterpriseDB                             http://enterprisedb.com
    
    + As you are, so once was I.  As I am, so you will be. +
    +                      Ancient Roman grave inscription +
    
    
    
    
  30. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-04-15T18:14:31Z

    On Thu, Apr 11, 2019 at 1:29 PM Anastasia Lubennikova
    <a.lubennikova@postgrespro.ru> wrote:
    > 2) There should be a command to get only a map of changes without actual
    > data.
    
    Good idea.
    
    > 4) The algorithm of collecting changed blocks is another topic.
    > Though, it's API should be discussed here:
    >
    > Do we want to have multiple implementations?
    > Personally, I think that it's good to provide several strategies,
    > since they have different requirements and fit for different workloads.
    >
    > Maybe we can add a hook to allow custom implementations.
    
    I'm not sure a hook is going to be practical, but I do think we want
    more than one strategy.
    
    > I'll be happy to help with design, code, review, and testing.
    > Hope that my experience with pg_probackup will be useful.
    
    Great, thanks!
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  31. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-04-15T18:52:32Z

    On Mon, Apr 15, 2019 at 9:01 AM Stephen Frost <sfrost@snowman.net> wrote:
    > I love the general idea of having additional facilities in core to
    > support block-level incremental backups.  I've long been unhappy that
    > any such approach ends up being limited to a subset of the files which
    > need to be included in the backup, meaning the rest of the files have to
    > be backed up in their entirety.  I don't think we have to solve for that
    > as part of this, but I'd like to see a discussion for how to deal with
    > the other files which are being backed up to avoid needing to just
    > wholesale copy them.
    
    Ideas?  Generally, I don't think that anything other than the main
    forks of relations are worth worrying about, because the files are too
    small to really matter.  Even if they're big, the main forks of
    relations will be much bigger.  I think.
    
    > I'm quite concerned that trying to graft this on to pg_basebackup
    > (which, as you note later, is missing an awful lot of what users expect
    > from a real backup solution already- retention handling, parallel
    > capabilities, WAL archive management, and many more... but also is just
    > not nearly as developed a tool as the external solutions) is going to
    > make things unnecessairly difficult when what we really want here is
    > better support from core for block-level incremental backup for the
    > existing external tools to leverage.
    >
    > Perhaps there's something here which can be done with pg_basebackup to
    > have it work with the block-level approach, but I certainly don't see
    > it as a natural next step for it and really does seem like limiting the
    > way this is implemented to something that pg_basebackup can easily
    > digest might make it less useful for the more developed tools.
    
    I agree that there are a bunch of things that pg_basebackup does not
    do, such as backup management.  I think a lot of users do not want
    PostgreSQL to do backup management for them.  They have an existing
    solution that they use to manage backups, and they want PostgreSQL to
    interoperate with it. I think it makes sense for pg_basebackup to be
    in charge of taking the backup, and then other tools can either use it
    as a building block or use the streaming replication protocol to send
    approximately the same commands to the server.  I certainly would not
    want to expose server capabilities that let you take an incremental
    backup and NOT teach pg_basebackup to use them -- then we'd be in a
    situation of saying that PostgreSQL has incremental backup, but you
    have to get external tool XYZ to use it.  That will be perceived as
    PostgreSQL does NOT have incremental backup and this external tool
    adds it.
    
    > As an example, I believe all of the other tools mentioned (at least,
    > those that are open source I'm pretty sure all do) support parallel
    > backup and therefore having a way to get the block-level changes in a
    > parallel fashion would be a pretty big thing that those tools will want
    > and pg_basebackup is single-threaded today and this proposal doesn't
    > seem to be contemplating changing that, implying that a serial-based
    > block-level protocol would be fine but that'd be a pretty awful
    > restriction for the other tools.
    
    I mentioned this exact issue in my original email.  I spoke positively
    of it.  But I think it is different from what is being proposed here.
    We could have parallel backup without incremental backup, and that
    would be a good feature.  We could have parallel backup without full
    backup, and that would also be a good feature.  We could also have
    both, which would be best of all.  I don't see that my proposal throws
    up any architectural obstacle to parallelism.  I assume parallel
    backup, whether full or incremental, would be implemented by dividing
    up the files that need to be sent across the available connections; if
    incremental backup exists, each connection then has to decide whether
    to send the whole file or only part of it.
    
    > This part of the discussion is a another example of how we're limiting
    > ourselves in this implementation to the "pg_basebackup can work with
    > this" case- by only consideration the options of "scan all the files" or
    > "use the WAL- if the request is for WAL we have available on the
    > server."  The other backup solutions mentioned in your initial email,
    > and others that weren't, have a WAL archive which includes a lot more
    > WAL than just what the primary currently has.  When I've thought about
    > how WAL could be used to build a differential or incremental backup, the
    > question of "do we have all the WAL we need" hasn't ever been a
    > consideration- because the backup tool manages the WAL archive and has
    > WAL going back across, most likely, weeks or even months.  Having a tool
    > which can essentially "compress" WAL would be fantastic and would be
    > able to be leveraged by all of the different backup solutions.
    
    I don't think this is a case of limiting ourselves; I think it's a
    case of keeping separate considerations properly separate.  As I said
    in my original email, the client doesn't really need to know how the
    server is identifying the blocks that have been modified.  That is the
    server's job.  I started a separate thread on the WAL-scanning
    approach, so we should take that part of the discussion over there.  I
    see no reason why the server couldn't be taught to reach back into an
    available archive for WAL that it no longer has locally, but that's
    really independent of the design ideas being discussed on this thread.
    
    > Two things here- having some file that "stops the server from starting"
    > is just going to cause a lot of pain, in my experience.  Users do a lot
    > of really rather.... curious things, and then come asking questions
    > about them, and removing the file that stopped the server from starting
    > is going to quickly become one of those questions on stack overflow that
    > people just follow the highest-ranked question for, even though everyone
    > who follows this list will know that doing so results in corruption of
    > the database.
    
    Wait, you want to make it maximally easy for users to start the server
    in a state that is 100% certain to result in a corrupted and unusable
    database?  Why?? I'd l like to make that a tiny bit difficult.  If
    they really want a corrupted database, they can remove the file.
    
    > An alternative approach in developing this feature would be to have
    > pg_basebackup have an option to run against an *existing* backup, with
    > the entire point being that the existing backup is updated with these
    > incremental changes, instead of having some independent tool which takes
    > the result of multiple pg_basebackup runs and then combines them.
    
    That would be really unsafe, because if the tool is interrupted before
    it finishes (and fsyncs everything), you no longer have any usable
    backup.  It also doesn't lend itself to several of the scenarios I
    described in my original email -- like endless incrementals that are
    merged into the full backup after some number of days -- a capability
    upon which others have already remarked positively.
    
    > An alternative tool might be one which simply reads the WAL and keeps
    > track of the FPIs and the updates and then eliminates any duplication
    > which exists in the set of WAL provided (that is, multiple FPIs for the
    > same page would be merged into one, and only the delta changes to that
    > page are preserved, across the entire set of WAL being combined).  Of
    > course, that's complicated by having to deal with the other files in the
    > database, so it wouldn't really work on its own.
    
    You've jumped back to solving the server's problem (which blocks
    should I send?) rather than the client's problem (what does an
    incremental backup look like once I've taken it and how do I manage
    and restore them?).  It does seem possible to figure out the contents
    of modified blocks strictly from looking at the WAL, without any
    examination of the current database contents.  However, it also seems
    very complicated, because the tool that is figuring out the current
    block contents just by looking at the WAL would have to know how to
    apply any type of WAL record, not just one that contains an FPI.  And
    I really don't want to build a client-side tool that knows how to
    apply WAL.
    
    > I'd really prefer that we avoid adding in another low-level tool like
    > the one described here.  Users, imv anyway, don't want to deal with
    > *more* tools for handling this aspect of backup/recovery.  If we had a
    > tool in core today which managed multiples backups, kept track of them,
    > and all of the WAL during and between them, then we could add options to
    > that tool to do what's being described here in a way that makes sense
    > and provides a good interface to users.  I don't know that we're going
    > to be able to do that with pg_basebackup when, really, the goal here
    > isn't actually to make pg_basebackup into an enterprise backup tool,
    > it's to make things easier for the external tools to do block-level
    > backups.
    
    Well, I agree with you that the goal is not to make pg_basebackup an
    enterprise backup tool.  However, I don't see teaching it to take
    incremental backups as opposed to that goal.  I think backup
    management and retention should remain firmly outside the purview of
    pg_basebackup and left either to some other in-core tool or maybe even
    to out-of-core tools.  However, I don't see any reason why that the
    task of taking an incremental and/or parallel backup should also be
    left to another tool.
    
    There is a very close relationship between the thing that
    pg_basebackup already does (copy everything) and the thing that we
    want to do here (copy everything except blocks that we know haven't
    changed). If we made it the job of some other tool to take parallel
    and/or incremental backups, that other tool would need to reimplement
    a lot of things that pg_basebackup has already got, like tar vs. plain
    format, fast vs. spread checkpoint, rate-limiting, compression levels,
    etc.  That seems like a waste.  Better to give pg_basebackup the
    capability to do those things, and then any backup management tool
    that anyone writes can take advantage of those capabilities.
    
    I come at this, BTW, from the perspective of having just spent a bunch
    of time working on EDB's Backup And Recovery Tool (BART).  That tool
    works in exactly the manner you seem to be advocating: it knows how to
    do incremental and parallel full backups, and it also does backup
    management.  However, this has not turned out to be the best division
    of labor.  People who don't want to use the backup management
    capabilities may still want the parallel or incremental backup
    capabilities, and if all of that is within the envelope of an
    "enterprise backup tool," they don't have that option.  So I want to
    split it up.  I want pg_basebackup to take all the kinds of backups
    that PostgreSQL supports -- full, incremental, parallel, serial,
    whatever -- and I want some other tool -- pgBackRest, BART, barman, or
    some yet-to-be-invented core thing to do the management of those
    backups.  Then everybody can use exactly the bits they want.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  32. Re: block-level incremental backup

    Stephen Frost <sfrost@snowman.net> — 2019-04-16T21:44:32Z

    Greetings,
    
    * Bruce Momjian (bruce@momjian.us) wrote:
    > On Mon, Apr 15, 2019 at 09:01:11AM -0400, Stephen Frost wrote:
    > > * Robert Haas (robertmhaas@gmail.com) wrote:
    > > > Several companies, including EnterpriseDB, NTT, and Postgres Pro, have
    > > > developed technology that permits a block-level incremental backup to
    > > > be taken from a PostgreSQL server.  I believe the idea in all of those
    > > > cases is that non-relation files should be backed up in their
    > > > entirety, but for relation files, only those blocks that have been
    > > > changed need to be backed up.
    > > 
    > > I love the general idea of having additional facilities in core to
    > > support block-level incremental backups.  I've long been unhappy that
    > > any such approach ends up being limited to a subset of the files which
    > > need to be included in the backup, meaning the rest of the files have to
    > > be backed up in their entirety.  I don't think we have to solve for that
    > > as part of this, but I'd like to see a discussion for how to deal with
    > > the other files which are being backed up to avoid needing to just
    > > wholesale copy them.
    > 
    > I assume you are talking about non-heap/index files.  Which of those are
    > large enough to benefit from incremental backup?
    
    Based on discussions I had with Andrey, specifically the visibility map
    is an issue for them with WAL-G.  I haven't spent a lot of time thinking
    about it, but I can understand how that could be an issue.
    
    > > I'm quite concerned that trying to graft this on to pg_basebackup
    > > (which, as you note later, is missing an awful lot of what users expect
    > > from a real backup solution already- retention handling, parallel
    > > capabilities, WAL archive management, and many more... but also is just
    > > not nearly as developed a tool as the external solutions) is going to
    > > make things unnecessairly difficult when what we really want here is
    > > better support from core for block-level incremental backup for the
    > > existing external tools to leverage.
    > 
    > I think there is some interesting complexity brought up in this thread. 
    > Which options are going to minimize storage I/O, network I/O, have only
    > background overhead, allow parallel operation, integrate with
    > pg_basebackup.  Eventually we will need to evaluate the incremental
    > backup options against these criteria.
    
    This presumes that we're going to have multiple competeing incremental
    backup options presented, doesn't it?  Are you aware of another effort
    going on which aims for inclusion in core?  There's been past attempts
    made, but I don't believe there's anyone else currently planning to or
    working on something for inclusion in core.
    
    Just to be clear- we're not currently working on one, but I'd really
    like to see core provide good support for incremental block-level backup
    so that we can leverage when it is there.
    
    Thanks!
    
    Stephen
    
  33. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-04-16T22:40:44Z

    On Tue, Apr 16, 2019 at 5:44 PM Stephen Frost <sfrost@snowman.net> wrote:
    > > > I love the general idea of having additional facilities in core to
    > > > support block-level incremental backups.  I've long been unhappy that
    > > > any such approach ends up being limited to a subset of the files which
    > > > need to be included in the backup, meaning the rest of the files have to
    > > > be backed up in their entirety.  I don't think we have to solve for that
    > > > as part of this, but I'd like to see a discussion for how to deal with
    > > > the other files which are being backed up to avoid needing to just
    > > > wholesale copy them.
    > >
    > > I assume you are talking about non-heap/index files.  Which of those are
    > > large enough to benefit from incremental backup?
    >
    > Based on discussions I had with Andrey, specifically the visibility map
    > is an issue for them with WAL-G.  I haven't spent a lot of time thinking
    > about it, but I can understand how that could be an issue.
    
    If I understand correctly, the VM contains 1 byte per 4 heap pages and
    the FSM contains 1 byte per heap page (plus some overhead for higher
    levels of the tree).  Since the FSM is not WAL-logged, I'm not sure
    there's a whole lot we can do to avoid having to back it up, although
    maybe there's some clever idea I'm not quite seeing.  The VM is
    WAL-logged, albeit with some strange warts that I have the honor of
    inventing, so there's more possibilities there.
    
    Before worrying about it too much, it would be useful to hear more
    about the concerns related to these forks, so that we make sure we're
    solving the right problem.  It seems difficult for a single relation
    to be big enough for these to be much of an issue.  For example, on a
    1TB relation, we have 2^40 bytes = 2^27 pages = ~2^25 bits of VM fork
    = 32MB.  Not nothing, but 32MB of useless overhead every time you back
    up a 1TB database probably isn't going to break the bank.  It might be
    more of a concern for users with many small tables.  For example, if
    somebody has got a million tables with 1 page in each one, they'll
    have a million data pages, a million VM pages, and 3 million FSM pages
    (unless the new don't-create-the-FSM-for-small-tables stuff in v12
    kicks in).  I don't know if it's worth going to a lot of trouble to
    optimize that case.  Creating a million tables with 100 tuples (or
    whatever) in each one sounds like terrible database design to me.
    
    > > > I'm quite concerned that trying to graft this on to pg_basebackup
    > > > (which, as you note later, is missing an awful lot of what users expect
    > > > from a real backup solution already- retention handling, parallel
    > > > capabilities, WAL archive management, and many more... but also is just
    > > > not nearly as developed a tool as the external solutions) is going to
    > > > make things unnecessairly difficult when what we really want here is
    > > > better support from core for block-level incremental backup for the
    > > > existing external tools to leverage.
    > >
    > > I think there is some interesting complexity brought up in this thread.
    > > Which options are going to minimize storage I/O, network I/O, have only
    > > background overhead, allow parallel operation, integrate with
    > > pg_basebackup.  Eventually we will need to evaluate the incremental
    > > backup options against these criteria.
    >
    > This presumes that we're going to have multiple competeing incremental
    > backup options presented, doesn't it?  Are you aware of another effort
    > going on which aims for inclusion in core?  There's been past attempts
    > made, but I don't believe there's anyone else currently planning to or
    > working on something for inclusion in core.
    
    Yeah, I really hope we don't end up with dueling patches.  I want to
    come up with an approach that can be widely-endorsed and then have
    everybody rowing in the same direction.  On the other hand, I do think
    that we may support multiple options in certain places which may have
    the kinds of trade-offs that Bruce mentions.  For instance,
    identifying changed blocks by scanning the whole cluster and checking
    the LSN of each block has an advantage in that it requires no prior
    setup or extra configuration.  Like a sequential scan, it always
    works, and that is an advantage.  Of course, for many people, the
    competing advantage of a WAL-scanning approach that can save a lot of
    I/O will appear compelling, but maybe not for everyone.  I think
    there's room for two or three approaches there -- not in the sense of
    competing patches, but in the sense of giving users a choice based on
    their needs.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  34. Re: block-level incremental backup

    Bruce Momjian <bruce@momjian.us> — 2019-04-17T15:57:35Z

    On Tue, Apr 16, 2019 at 06:40:44PM -0400, Robert Haas wrote:
    > Yeah, I really hope we don't end up with dueling patches.  I want to
    > come up with an approach that can be widely-endorsed and then have
    > everybody rowing in the same direction.  On the other hand, I do think
    > that we may support multiple options in certain places which may have
    > the kinds of trade-offs that Bruce mentions.  For instance,
    > identifying changed blocks by scanning the whole cluster and checking
    > the LSN of each block has an advantage in that it requires no prior
    > setup or extra configuration.  Like a sequential scan, it always
    > works, and that is an advantage.  Of course, for many people, the
    > competing advantage of a WAL-scanning approach that can save a lot of
    > I/O will appear compelling, but maybe not for everyone.  I think
    > there's room for two or three approaches there -- not in the sense of
    > competing patches, but in the sense of giving users a choice based on
    > their needs.
    
    Well, by having a separate modblock file for each WAL file, you can keep
    both WAL and modblock files and use the modblock list to pull pages from
    each WAL file, or from the heap/index files, and it can be done in
    parallel.  Having WAL and modblock files in the same directory makes
    retention simpler.
    
    In fact, you can do an incremental backup just using the modblock files
    and the heap/index files, so you don't even need the WAL.
    
    Also, instead of storing the file name and block number in the modblock
    file, using the database oid, relfilenode, and block number (3 int32
    values) should be sufficient.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        http://momjian.us
      EnterpriseDB                             http://enterprisedb.com
    
    + As you are, so once was I.  As I am, so you will be. +
    +                      Ancient Roman grave inscription +
    
    
    
    
  35. Re: block-level incremental backup

    Stephen Frost <sfrost@snowman.net> — 2019-04-17T21:20:03Z

    Greetings,
    
    * Robert Haas (robertmhaas@gmail.com) wrote:
    > On Tue, Apr 16, 2019 at 5:44 PM Stephen Frost <sfrost@snowman.net> wrote:
    > > > > I love the general idea of having additional facilities in core to
    > > > > support block-level incremental backups.  I've long been unhappy that
    > > > > any such approach ends up being limited to a subset of the files which
    > > > > need to be included in the backup, meaning the rest of the files have to
    > > > > be backed up in their entirety.  I don't think we have to solve for that
    > > > > as part of this, but I'd like to see a discussion for how to deal with
    > > > > the other files which are being backed up to avoid needing to just
    > > > > wholesale copy them.
    > > >
    > > > I assume you are talking about non-heap/index files.  Which of those are
    > > > large enough to benefit from incremental backup?
    > >
    > > Based on discussions I had with Andrey, specifically the visibility map
    > > is an issue for them with WAL-G.  I haven't spent a lot of time thinking
    > > about it, but I can understand how that could be an issue.
    > 
    > If I understand correctly, the VM contains 1 byte per 4 heap pages and
    > the FSM contains 1 byte per heap page (plus some overhead for higher
    > levels of the tree).  Since the FSM is not WAL-logged, I'm not sure
    > there's a whole lot we can do to avoid having to back it up, although
    > maybe there's some clever idea I'm not quite seeing.  The VM is
    > WAL-logged, albeit with some strange warts that I have the honor of
    > inventing, so there's more possibilities there.
    > 
    > Before worrying about it too much, it would be useful to hear more
    > about the concerns related to these forks, so that we make sure we're
    > solving the right problem.  It seems difficult for a single relation
    > to be big enough for these to be much of an issue.  For example, on a
    > 1TB relation, we have 2^40 bytes = 2^27 pages = ~2^25 bits of VM fork
    > = 32MB.  Not nothing, but 32MB of useless overhead every time you back
    > up a 1TB database probably isn't going to break the bank.  It might be
    > more of a concern for users with many small tables.  For example, if
    > somebody has got a million tables with 1 page in each one, they'll
    > have a million data pages, a million VM pages, and 3 million FSM pages
    > (unless the new don't-create-the-FSM-for-small-tables stuff in v12
    > kicks in).  I don't know if it's worth going to a lot of trouble to
    > optimize that case.  Creating a million tables with 100 tuples (or
    > whatever) in each one sounds like terrible database design to me.
    
    As I understand it, the problem is not with backing up an individual
    database or cluster, but rather dealing with backing up thousands of
    individual clusters with thousands of tables in each, leading to an
    awful lot of tables with lots of FSMs/VMs, all of which end up having to
    get copied and stored wholesale.  I'll point this thread out to him and
    hopefully he'll have a chance to share more specific information.
    
    > > > > I'm quite concerned that trying to graft this on to pg_basebackup
    > > > > (which, as you note later, is missing an awful lot of what users expect
    > > > > from a real backup solution already- retention handling, parallel
    > > > > capabilities, WAL archive management, and many more... but also is just
    > > > > not nearly as developed a tool as the external solutions) is going to
    > > > > make things unnecessairly difficult when what we really want here is
    > > > > better support from core for block-level incremental backup for the
    > > > > existing external tools to leverage.
    > > >
    > > > I think there is some interesting complexity brought up in this thread.
    > > > Which options are going to minimize storage I/O, network I/O, have only
    > > > background overhead, allow parallel operation, integrate with
    > > > pg_basebackup.  Eventually we will need to evaluate the incremental
    > > > backup options against these criteria.
    > >
    > > This presumes that we're going to have multiple competeing incremental
    > > backup options presented, doesn't it?  Are you aware of another effort
    > > going on which aims for inclusion in core?  There's been past attempts
    > > made, but I don't believe there's anyone else currently planning to or
    > > working on something for inclusion in core.
    > 
    > Yeah, I really hope we don't end up with dueling patches.  I want to
    > come up with an approach that can be widely-endorsed and then have
    > everybody rowing in the same direction.  On the other hand, I do think
    > that we may support multiple options in certain places which may have
    > the kinds of trade-offs that Bruce mentions.  For instance,
    > identifying changed blocks by scanning the whole cluster and checking
    > the LSN of each block has an advantage in that it requires no prior
    > setup or extra configuration.  Like a sequential scan, it always
    > works, and that is an advantage.  Of course, for many people, the
    > competing advantage of a WAL-scanning approach that can save a lot of
    > I/O will appear compelling, but maybe not for everyone.  I think
    > there's room for two or three approaches there -- not in the sense of
    > competing patches, but in the sense of giving users a choice based on
    > their needs.
    
    I can agree with the idea of having multiple options for how to collect
    up the set of changed blocks, though I continue to feel that a
    WAL-scanning approach isn't something that we'd have implemented in the
    backend at all since it doesn't require the backend and a given backend
    might not even have all of the WAL that is relevant.  I certainly don't
    think it makes sense to have a backend go get WAL from the archive to
    then merge the WAL to provide the result to a client asking for it-
    that's adding entirely unnecessary load to the database server.
    
    As such, only the LSN-based scanning of relation files to produce the
    set of changed blocks seems to make sense to me to implement in the
    backend.
    
    Just to be clear- I don't have any problem with a tool being implemented
    in core to support the scanning of WAL to produce a changeset, I just
    don't think that's something we'd have built into the *backend*, nor do
    I think it would make sense to add that functionality to the replication
    (or any other) protocol, at least not with support for arbitrary LSN
    starting and ending points.
    
    A thought that occurs to me is to have the functions for supporting the
    WAL merging be included in libcommon and available to both the
    independent executable that's available for doing WAL merging, and to
    the backend to be able to WAL merging itself- but for a specific
    purpose: having a way to reduce the amount of WAL that needs to be sent
    to a replica which has a replication slot but that's been disconnected
    for a while.  Of course, there'd have to be some way to handle the other
    files for that to work to update a long out-of-date replica.  Now, if we
    taught the backup tool about having a replication slot then perhaps we
    could have the backend effectively have the same capability proposed
    above, but without the need to go get the WAL from the archive
    repository.
    
    I'm still not entirely sure that this makes sense to do in the backend
    due to the additional load, this is really just some brainstorming.
    
    Thanks!
    
    Stephen
    
  36. Re: block-level incremental backup

    Stephen Frost <sfrost@snowman.net> — 2019-04-17T22:43:10Z

    Greetings,
    
    * Robert Haas (robertmhaas@gmail.com) wrote:
    > On Mon, Apr 15, 2019 at 9:01 AM Stephen Frost <sfrost@snowman.net> wrote:
    > > I love the general idea of having additional facilities in core to
    > > support block-level incremental backups.  I've long been unhappy that
    > > any such approach ends up being limited to a subset of the files which
    > > need to be included in the backup, meaning the rest of the files have to
    > > be backed up in their entirety.  I don't think we have to solve for that
    > > as part of this, but I'd like to see a discussion for how to deal with
    > > the other files which are being backed up to avoid needing to just
    > > wholesale copy them.
    > 
    > Ideas?  Generally, I don't think that anything other than the main
    > forks of relations are worth worrying about, because the files are too
    > small to really matter.  Even if they're big, the main forks of
    > relations will be much bigger.  I think.
    
    Sadly, I haven't got any great ideas today.  I do know that the WAL-G
    folks have specifically mentioned issues with the visibility map being
    large enough across enough of their systems that it kinda sucks to deal
    with.  Perhaps we could do something like the rsync binary-diff protocol
    for non-relation files?  This is clearly just hand-waving but maybe
    there's something reasonable in that idea.
    
    > > I'm quite concerned that trying to graft this on to pg_basebackup
    > > (which, as you note later, is missing an awful lot of what users expect
    > > from a real backup solution already- retention handling, parallel
    > > capabilities, WAL archive management, and many more... but also is just
    > > not nearly as developed a tool as the external solutions) is going to
    > > make things unnecessairly difficult when what we really want here is
    > > better support from core for block-level incremental backup for the
    > > existing external tools to leverage.
    > >
    > > Perhaps there's something here which can be done with pg_basebackup to
    > > have it work with the block-level approach, but I certainly don't see
    > > it as a natural next step for it and really does seem like limiting the
    > > way this is implemented to something that pg_basebackup can easily
    > > digest might make it less useful for the more developed tools.
    > 
    > I agree that there are a bunch of things that pg_basebackup does not
    > do, such as backup management.  I think a lot of users do not want
    > PostgreSQL to do backup management for them.  They have an existing
    > solution that they use to manage backups, and they want PostgreSQL to
    > interoperate with it. I think it makes sense for pg_basebackup to be
    > in charge of taking the backup, and then other tools can either use it
    > as a building block or use the streaming replication protocol to send
    > approximately the same commands to the server.  
    
    There's something like 6 different backup tools, at least, for
    PostgreSQL that provide backup management, so I have a really hard time
    agreeing with this idea that users don't want a PG backup management
    system.  Maybe that's not what you're suggesting here, but that's what
    came across to me.
    
    Yes, there are some users who have an existing backup solution and
    they'd like a better way to integrate PostgreSQL into that solution,
    but that's usually something like filesystem snapshots or an enterprise
    backup tool which has a PostgreSQL agent or similar to do the start/stop
    and collect up the WAL, not something that's just calling pg_basebackup.
    
    Those are typically not things we have any visibility into though and
    aren't open source either (and, at least as often as not, they don't
    seem to be very well thought through, based on my experience with those
    tools...).
    
    Unless maybe I'm misunderstanding and what you're suggesting here is
    that the "existing solution" is something like the external PG-specific
    backup tools?  But then the rest doesn't seem to make sense, as only
    maybe one or two of those tools use pg_basebackup internally.
    
    > I certainly would not
    > want to expose server capabilities that let you take an incremental
    > backup and NOT teach pg_basebackup to use them -- then we'd be in a
    > situation of saying that PostgreSQL has incremental backup, but you
    > have to get external tool XYZ to use it.  That will be perceived as
    > PostgreSQL does NOT have incremental backup and this external tool
    > adds it.
    
    ... but this is exactly the situation we're in already with all of the
    *other* features around backup (parallel backup, backup management, WAL
    management, etc).  Users want those features, pg_basebackup/PG core
    doesn't provide it, and therefore there's a bunch of other tools which
    have been written that do.  In addition, saying that PG has incremental
    backup but no built-in management of those full-vs-incremental backups
    and telling users that they basically have to build that themselves
    really feels a lot like we're trying to address a check-box requirement
    rather than making something that our users are going to be happy with.
    
    > > As an example, I believe all of the other tools mentioned (at least,
    > > those that are open source I'm pretty sure all do) support parallel
    > > backup and therefore having a way to get the block-level changes in a
    > > parallel fashion would be a pretty big thing that those tools will want
    > > and pg_basebackup is single-threaded today and this proposal doesn't
    > > seem to be contemplating changing that, implying that a serial-based
    > > block-level protocol would be fine but that'd be a pretty awful
    > > restriction for the other tools.
    > 
    > I mentioned this exact issue in my original email.  I spoke positively
    > of it.  But I think it is different from what is being proposed here.
    > We could have parallel backup without incremental backup, and that
    > would be a good feature.  We could have parallel backup without full
    > backup, and that would also be a good feature.  We could also have
    > both, which would be best of all.  I don't see that my proposal throws
    > up any architectural obstacle to parallelism.  I assume parallel
    > backup, whether full or incremental, would be implemented by dividing
    > up the files that need to be sent across the available connections; if
    > incremental backup exists, each connection then has to decide whether
    > to send the whole file or only part of it.
    
    I don't think that I was very clear in what my specific concern here
    was.  I'm not asking for pg_basebackup to have parallel backup (at
    least, not in this part of the discussion), I'm asking for the
    incremental block-based protocol that's going to be built-in to core to
    be able to be used in a parallel fashion.
    
    The existing protocol that pg_basebackup uses is basically, connect to
    the server and then say "please give me a tarball of the data directory"
    and that is then streamed on that connection, making that protocol
    impossible to use for parallel backup.  That's fine as far as it goes
    because only pg_basebackup actually uses that protocol (note that nearly
    all of the other tools for doing backups of PostgreSQL don't...).  If
    we're expecting the external tools to use the block-level incremental
    protocol then that protocol really needs to have a way to be
    parallelized, otherwise we're just going to end up with all of the
    individual tools doing their own thing for block-level incremental
    (though perhaps they'd reimplement whatever is done in core but in a way
    that they could parallelize it...), if possible (which I add just in
    case there's some idea that we end up in a situation where the
    block-level incremental backup has to coordinate with the backend in
    some fashion to work...  which would mean that *everyone* has to use the
    protocol even if it isn't parallel and that would be really bad, imv).
    
    > > This part of the discussion is a another example of how we're limiting
    > > ourselves in this implementation to the "pg_basebackup can work with
    > > this" case- by only consideration the options of "scan all the files" or
    > > "use the WAL- if the request is for WAL we have available on the
    > > server."  The other backup solutions mentioned in your initial email,
    > > and others that weren't, have a WAL archive which includes a lot more
    > > WAL than just what the primary currently has.  When I've thought about
    > > how WAL could be used to build a differential or incremental backup, the
    > > question of "do we have all the WAL we need" hasn't ever been a
    > > consideration- because the backup tool manages the WAL archive and has
    > > WAL going back across, most likely, weeks or even months.  Having a tool
    > > which can essentially "compress" WAL would be fantastic and would be
    > > able to be leveraged by all of the different backup solutions.
    > 
    > I don't think this is a case of limiting ourselves; I think it's a
    > case of keeping separate considerations properly separate.  As I said
    > in my original email, the client doesn't really need to know how the
    > server is identifying the blocks that have been modified.  That is the
    > server's job.  I started a separate thread on the WAL-scanning
    > approach, so we should take that part of the discussion over there.  I
    > see no reason why the server couldn't be taught to reach back into an
    > available archive for WAL that it no longer has locally, but that's
    > really independent of the design ideas being discussed on this thread.
    
    I've provided thoughts on that other thread, I'm happy to discuss
    further there.
    
    > > Two things here- having some file that "stops the server from starting"
    > > is just going to cause a lot of pain, in my experience.  Users do a lot
    > > of really rather.... curious things, and then come asking questions
    > > about them, and removing the file that stopped the server from starting
    > > is going to quickly become one of those questions on stack overflow that
    > > people just follow the highest-ranked question for, even though everyone
    > > who follows this list will know that doing so results in corruption of
    > > the database.
    > 
    > Wait, you want to make it maximally easy for users to start the server
    > in a state that is 100% certain to result in a corrupted and unusable
    > database?  Why?? I'd l like to make that a tiny bit difficult.  If
    > they really want a corrupted database, they can remove the file.
    
    No, I don't want it to be easy for users to start the server in a state
    that's going to result in a corrupted cluster.  That's basically the
    complete opposite of what I was going for- having a file that can be
    trivially removed to start up the cluster is *going* to result in people
    having corrupted clusters, no matter how much we tell them "don't do
    that".  This is exactly the problem with have with backup_label today.
    I'd really rather not double-down on that.
    
    > > An alternative approach in developing this feature would be to have
    > > pg_basebackup have an option to run against an *existing* backup, with
    > > the entire point being that the existing backup is updated with these
    > > incremental changes, instead of having some independent tool which takes
    > > the result of multiple pg_basebackup runs and then combines them.
    > 
    > That would be really unsafe, because if the tool is interrupted before
    > it finishes (and fsyncs everything), you no longer have any usable
    > backup.  It also doesn't lend itself to several of the scenarios I
    > described in my original email -- like endless incrementals that are
    > merged into the full backup after some number of days -- a capability
    > upon which others have already remarked positively.
    
    There's really two things here- the first is that I agree with the
    concern about potentially destorying the existing backup if the
    pg_basebackup doesn't complete, but there's some ways to address that
    (such as filesystem snapshotting), so I'm not sure that the idea is
    quite that bad, but it would need to be more than just what
    pg_basebackup does in this case in order to be trustworthy (at least,
    for most).
    
    The other part here is the idea of endless incrementals where the blocks
    which don't appear to have changed are never re-validated against what's
    in the backup.  Unfortunately, latent corruption happens and you really
    want to have a way to check for that.  In past discussions that I've had
    with David, there's been some idea to check some percentage of the
    blocks that didn't appear to change for each backup against what's in
    the backup.
    
    I share this just to point out that there's some risk to that approach,
    not to say that we shouldn't do it or that we should discourage the
    development of such a feature.
    
    > > An alternative tool might be one which simply reads the WAL and keeps
    > > track of the FPIs and the updates and then eliminates any duplication
    > > which exists in the set of WAL provided (that is, multiple FPIs for the
    > > same page would be merged into one, and only the delta changes to that
    > > page are preserved, across the entire set of WAL being combined).  Of
    > > course, that's complicated by having to deal with the other files in the
    > > database, so it wouldn't really work on its own.
    > 
    > You've jumped back to solving the server's problem (which blocks
    > should I send?) rather than the client's problem (what does an
    > incremental backup look like once I've taken it and how do I manage
    > and restore them?).  It does seem possible to figure out the contents
    > of modified blocks strictly from looking at the WAL, without any
    > examination of the current database contents.  However, it also seems
    > very complicated, because the tool that is figuring out the current
    > block contents just by looking at the WAL would have to know how to
    > apply any type of WAL record, not just one that contains an FPI.  And
    > I really don't want to build a client-side tool that knows how to
    > apply WAL.
    
    Wow.  I have to admit that I feel completely opposite of that- I'd
    *love* to have an independent tool (which ideally uses the same code
    through the common library, or similar) that can be run to apply WAL.
    
    In other words, I don't agree that it's the server's problem at all to
    solve that, or, at least, I don't believe that it needs to be.
    
    > > I'd really prefer that we avoid adding in another low-level tool like
    > > the one described here.  Users, imv anyway, don't want to deal with
    > > *more* tools for handling this aspect of backup/recovery.  If we had a
    > > tool in core today which managed multiples backups, kept track of them,
    > > and all of the WAL during and between them, then we could add options to
    > > that tool to do what's being described here in a way that makes sense
    > > and provides a good interface to users.  I don't know that we're going
    > > to be able to do that with pg_basebackup when, really, the goal here
    > > isn't actually to make pg_basebackup into an enterprise backup tool,
    > > it's to make things easier for the external tools to do block-level
    > > backups.
    > 
    > Well, I agree with you that the goal is not to make pg_basebackup an
    > enterprise backup tool.  However, I don't see teaching it to take
    > incremental backups as opposed to that goal.  I think backup
    > management and retention should remain firmly outside the purview of
    > pg_basebackup and left either to some other in-core tool or maybe even
    > to out-of-core tools.  However, I don't see any reason why that the
    > task of taking an incremental and/or parallel backup should also be
    > left to another tool.
    
    I've tried to outline how the incremental backup capability and backup
    management are really very closely related and having those be
    implemented by independent tools is not a good interface for our users
    to have to live with.
    
    > There is a very close relationship between the thing that
    > pg_basebackup already does (copy everything) and the thing that we
    > want to do here (copy everything except blocks that we know haven't
    > changed). If we made it the job of some other tool to take parallel
    > and/or incremental backups, that other tool would need to reimplement
    > a lot of things that pg_basebackup has already got, like tar vs. plain
    > format, fast vs. spread checkpoint, rate-limiting, compression levels,
    > etc.  That seems like a waste.  Better to give pg_basebackup the
    > capability to do those things, and then any backup management tool
    > that anyone writes can take advantage of those capabilities.
    
    I don't believe any of the external tools which do backups of PostgreSQL
    support tar format.  Fast-vs-spread checkpointing isn't in the purview
    of the external tools, they just have to accept the option and pass it
    to pg_start_backup(), which they already know how to do.  Rate-limiting
    and compression are implemented by those other tools already, where it's
    been desired.
    
    Most of the external tools don't use pg_basebackup, nor the base backup
    protocol (or, if they do, it's only as an option among others).  In my
    opinion, that's pretty clear indication that pg_basebackup and the base
    backup protocol aren't sufficient to cover any but the simplest of
    use-cases (though those simple use-cases are handled rather well).
    We're talking about adding on a capability that's much more complicated
    and is one that a lot of tools have already taken a stab at, let's try
    to do it in a way that those tools can leverage it and avoid having to
    implement it themselves.
    
    > I come at this, BTW, from the perspective of having just spent a bunch
    > of time working on EDB's Backup And Recovery Tool (BART).  That tool
    > works in exactly the manner you seem to be advocating: it knows how to
    > do incremental and parallel full backups, and it also does backup
    > management.  However, this has not turned out to be the best division
    > of labor.  People who don't want to use the backup management
    > capabilities may still want the parallel or incremental backup
    > capabilities, and if all of that is within the envelope of an
    > "enterprise backup tool," they don't have that option.  So I want to
    > split it up.  I want pg_basebackup to take all the kinds of backups
    > that PostgreSQL supports -- full, incremental, parallel, serial,
    > whatever -- and I want some other tool -- pgBackRest, BART, barman, or
    > some yet-to-be-invented core thing to do the management of those
    > backups.  Then everybody can use exactly the bits they want.
    
    I come at this from years of working with David on pgBackRest, listening
    to what users want, what features they like, what they'd like to see
    added, and what they don't like about how it works today.
    
    It's an interesting idea to add in everything to pg_basebackup that
    users doing backups would like to see, but that's quite a list:
    
    - full backups
    - differential backups
    - incremental backups / block-level backups
    - (server-side) compression
    - (server-side) encryption
    - page-level checksum validation
    - calculating checksums (on the whole file)
    - External object storage (S3, et al)
    - more things...
    
    I'm really not convinced that I agree with the division of labor as
    you've outlined it, where all of the above is done by pg_basebackup,
    where just archiving and backup retention are handled by some external
    tool (except that we already have pg_receivewal, so archiving isn't
    really an externally handled thing either, unless you want features like
    parallel archive-push or parallel archive-get...).
    
    What would really help me, at least, understand the idea here would be
    to understand exactly what the existing tools do that the subset of
    users you're thinking about doesn't like/want, but which pg_basebackup,
    today, does.  Is the issue that there's a repository instead of just a
    plain PG directory or set of tar files, like what pg_basebackup produces
    today?  But how would we do things like have compression, or encryption,
    or block-based incremental backups without some kind of repository or
    directory that doesn't actually look exactly like a PG data directory?
    
    Another thing I really don't understand from this discussion, and part of
    why it's taken me a while to respond, is this, from above:
    
    > I think a lot of users do not want
    > PostgreSQL to do backup management for them.
    
    Followed by:
    
    > I come at this, BTW, from the perspective of having just spent a bunch
    > of time working on EDB's Backup And Recovery Tool (BART).  That tool
    > works in exactly the manner you seem to be advocating: it knows how to
    > do incremental and parallel full backups, and it also does backup
    > management.
    
    I certainly can understand that there are PostgreSQL users who want to
    leverage incremental backups without having to use BART or another tool
    outside of whatever enterprise backup system they've got, but surely
    that's a large pool of users who *do* want a PG backup tool that manages
    backups, or you wouldn't have spent a considerable amount of your very
    valuable time hacking on BART.  I've certainly seen a fair share of both
    and I don't think we should set out to exclude either.
    
    Perhaps that's what we're both saying too and just talking past each
    other, but I feel like the approach here is "make it work just for the
    simple pg_basebackup case and not worry too much about the other tools,
    since what we do for pg_basebackup will work for them too" while where
    I'm coming from is "focus on what the other tools need first, and then
    make pg_basebackup work with that if there's a sensible way to do so."
    
    A third possibility is that it's just too early to be talking about this
    since it means we've gotta be awful vaugue about it.
    
    Thanks!
    
    Stephen
    
  37. Re: block-level incremental backup

    David Fetter <david@fetter.org> — 2019-04-18T15:32:57Z

    On Wed, Apr 17, 2019 at 11:57:35AM -0400, Bruce Momjian wrote:
    > On Tue, Apr 16, 2019 at 06:40:44PM -0400, Robert Haas wrote:
    > > Yeah, I really hope we don't end up with dueling patches.  I want to
    > > come up with an approach that can be widely-endorsed and then have
    > > everybody rowing in the same direction.  On the other hand, I do think
    > > that we may support multiple options in certain places which may have
    > > the kinds of trade-offs that Bruce mentions.  For instance,
    > > identifying changed blocks by scanning the whole cluster and checking
    > > the LSN of each block has an advantage in that it requires no prior
    > > setup or extra configuration.  Like a sequential scan, it always
    > > works, and that is an advantage.  Of course, for many people, the
    > > competing advantage of a WAL-scanning approach that can save a lot of
    > > I/O will appear compelling, but maybe not for everyone.  I think
    > > there's room for two or three approaches there -- not in the sense of
    > > competing patches, but in the sense of giving users a choice based on
    > > their needs.
    > 
    > Well, by having a separate modblock file for each WAL file, you can keep
    > both WAL and modblock files and use the modblock list to pull pages from
    > each WAL file, or from the heap/index files, and it can be done in
    > parallel.  Having WAL and modblock files in the same directory makes
    > retention simpler.
    > 
    > In fact, you can do an incremental backup just using the modblock files
    > and the heap/index files, so you don't even need the WAL.
    > 
    > Also, instead of storing the file name and block number in the modblock
    > file, using the database oid, relfilenode, and block number (3 int32
    > values) should be sufficient.
    
    Would doing it that way constrain the design of new table access
    methods in some meaningful way?
    
    Best,
    David.
    -- 
    David Fetter <david(at)fetter(dot)org> http://fetter.org/
    Phone: +1 415 235 3778
    
    Remember to vote!
    Consider donating to Postgres: http://www.postgresql.org/about/donate
    
    
    
    
  38. Re: block-level incremental backup

    Bruce Momjian <bruce@momjian.us> — 2019-04-18T15:34:32Z

    On Thu, Apr 18, 2019 at 05:32:57PM +0200, David Fetter wrote:
    > On Wed, Apr 17, 2019 at 11:57:35AM -0400, Bruce Momjian wrote:
    > > Also, instead of storing the file name and block number in the modblock
    > > file, using the database oid, relfilenode, and block number (3 int32
    > > values) should be sufficient.
    > 
    > Would doing it that way constrain the design of new table access
    > methods in some meaningful way?
    
    I think these are the values used in WAL, so I assume table access
    methods already have to map to those, unless they use their own.
    I actually don't know.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        http://momjian.us
      EnterpriseDB                             http://enterprisedb.com
    
    + As you are, so once was I.  As I am, so you will be. +
    +                      Ancient Roman grave inscription +
    
    
    
    
  39. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-04-18T16:56:10Z

    On Wed, Apr 17, 2019 at 5:20 PM Stephen Frost <sfrost@snowman.net> wrote:
    > As I understand it, the problem is not with backing up an individual
    > database or cluster, but rather dealing with backing up thousands of
    > individual clusters with thousands of tables in each, leading to an
    > awful lot of tables with lots of FSMs/VMs, all of which end up having to
    > get copied and stored wholesale.  I'll point this thread out to him and
    > hopefully he'll have a chance to share more specific information.
    
    Sounds good.
    
    > I can agree with the idea of having multiple options for how to collect
    > up the set of changed blocks, though I continue to feel that a
    > WAL-scanning approach isn't something that we'd have implemented in the
    > backend at all since it doesn't require the backend and a given backend
    > might not even have all of the WAL that is relevant.  I certainly don't
    > think it makes sense to have a backend go get WAL from the archive to
    > then merge the WAL to provide the result to a client asking for it-
    > that's adding entirely unnecessary load to the database server.
    
    My motivation for wanting to include it in the database server was twofold:
    
    1. I was hoping to leverage the background worker machinery.  The
    WAL-scanner would just run all the time in the background, and start
    up and shut down along with the server.  If it's a standalone tool,
    then it can run on a different server or when the server is down, both
    of which are nice.  The downside though is that now you probably have
    to put it in crontab or under systemd or something, instead of just
    setting a couple of GUCs and letting the server handle the rest.  For
    me that downside seems rather significant, but YMMV.
    
    2. In order for the information produced by the WAL-scanner to be
    useful, it's got to be available to the server when the server is
    asked for an incremental backup.  If the information is constructed by
    a standalone frontend tool, and stored someplace other than under
    $PGDATA, then the server won't have convenient access to it.  I guess
    we could make it the client's job to provide that information to the
    server, but I kind of liked the simplicity of not needing to give the
    server anything more than an LSN.
    
    > A thought that occurs to me is to have the functions for supporting the
    > WAL merging be included in libcommon and available to both the
    > independent executable that's available for doing WAL merging, and to
    > the backend to be able to WAL merging itself-
    
    Yeah, that might be possible.
    
    > but for a specific
    > purpose: having a way to reduce the amount of WAL that needs to be sent
    > to a replica which has a replication slot but that's been disconnected
    > for a while.  Of course, there'd have to be some way to handle the other
    > files for that to work to update a long out-of-date replica.  Now, if we
    > taught the backup tool about having a replication slot then perhaps we
    > could have the backend effectively have the same capability proposed
    > above, but without the need to go get the WAL from the archive
    > repository.
    
    Hmm, but you can't just skip over WAL records or segments because
    there are checksums and previous-record pointers and things....
    
    > I'm still not entirely sure that this makes sense to do in the backend
    > due to the additional load, this is really just some brainstorming.
    
    Would it really be that much load?
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  40. Re: block-level incremental backup

    Andres Freund <andres@anarazel.de> — 2019-04-18T17:00:53Z

    Hi,
    
    On 2019-04-18 11:34:32 -0400, Bruce Momjian wrote:
    > On Thu, Apr 18, 2019 at 05:32:57PM +0200, David Fetter wrote:
    > > On Wed, Apr 17, 2019 at 11:57:35AM -0400, Bruce Momjian wrote:
    > > > Also, instead of storing the file name and block number in the modblock
    > > > file, using the database oid, relfilenode, and block number (3 int32
    > > > values) should be sufficient.
    > > 
    > > Would doing it that way constrain the design of new table access
    > > methods in some meaningful way?
    > 
    > I think these are the values used in WAL, so I assume table access
    > methods already have to map to those, unless they use their own.
    > I actually don't know.
    
    I don't think it'd be a meaningful restriction. Given that we use those
    for shared_buffer descriptors, WAL etc.
    
    Greetings,
    
    Andres Freund
    
    
    
    
  41. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-04-18T18:05:40Z

    On Wed, Apr 17, 2019 at 6:43 PM Stephen Frost <sfrost@snowman.net> wrote:
    > Sadly, I haven't got any great ideas today.  I do know that the WAL-G
    > folks have specifically mentioned issues with the visibility map being
    > large enough across enough of their systems that it kinda sucks to deal
    > with.  Perhaps we could do something like the rsync binary-diff protocol
    > for non-relation files?  This is clearly just hand-waving but maybe
    > there's something reasonable in that idea.
    
    I guess it all comes down to how complicated you're willing to make
    the client-server protocol.  With the very simple protocol that I
    proposed -- client provides a threshold LSN and server sends blocks
    modified since then -- the client need not have access to the old
    incremental backup to take a new one.  Of course, if it happens to
    have access to the old backup then it can delta-compress however it
    likes after-the-fact, but that doesn't help with the amount of network
    transfer.  That problem could be solved by doing something like what
    you're talking about (with some probably-negligible false match rate)
    but I have no intention of trying to implement anything that
    complicated, and I don't really think it's necessary, at least not for
    a first version.  What I proposed would already allow, for most users,
    a large reduction in transfer and storage costs; what you are talking
    about here would help more, but also be a lot more work and impose
    some additional requirements on the system.  I don't object to you
    implementing the more complex system, but I'll pass.
    
    > There's something like 6 different backup tools, at least, for
    > PostgreSQL that provide backup management, so I have a really hard time
    > agreeing with this idea that users don't want a PG backup management
    > system.  Maybe that's not what you're suggesting here, but that's what
    > came across to me.
    
    Let me be a little more clear.  Different users want different things.
    Some people want a canned PostgreSQL backup solution, while other
    people just want access to a reasonable set of facilities from which
    they can construct their own solution.  I believe that the proposal I
    am making here could be used either by backup tool authors to enhance
    their offerings, or by individuals who want to build up their own
    solution using facilities provided by core.
    
    > Unless maybe I'm misunderstanding and what you're suggesting here is
    > that the "existing solution" is something like the external PG-specific
    > backup tools?  But then the rest doesn't seem to make sense, as only
    > maybe one or two of those tools use pg_basebackup internally.
    
    Well, what I'm really talking about is in two pieces: providing some
    new facilities via the replication protocol, and making pg_basebackup
    able to use those facilities.  Nothing would stop other tools from
    using those facilities directly if they wish.
    
    > ... but this is exactly the situation we're in already with all of the
    > *other* features around backup (parallel backup, backup management, WAL
    > management, etc).  Users want those features, pg_basebackup/PG core
    > doesn't provide it, and therefore there's a bunch of other tools which
    > have been written that do.  In addition, saying that PG has incremental
    > backup but no built-in management of those full-vs-incremental backups
    > and telling users that they basically have to build that themselves
    > really feels a lot like we're trying to address a check-box requirement
    > rather than making something that our users are going to be happy with.
    
    I disagree.  Yes, parallel backup, like incremental backup, needs to
    go in core.  And pg_basebackup should be able to do a parallel backup.
    I will fight tooth, nail, and claw any suggestion that the server
    should know how to do a parallel backup but pg_basebackup should not
    have an option to exploit that capability.  And similarly for
    incremental.
    
    > I don't think that I was very clear in what my specific concern here
    > was.  I'm not asking for pg_basebackup to have parallel backup (at
    > least, not in this part of the discussion), I'm asking for the
    > incremental block-based protocol that's going to be built-in to core to
    > be able to be used in a parallel fashion.
    >
    > The existing protocol that pg_basebackup uses is basically, connect to
    > the server and then say "please give me a tarball of the data directory"
    > and that is then streamed on that connection, making that protocol
    > impossible to use for parallel backup.  That's fine as far as it goes
    > because only pg_basebackup actually uses that protocol (note that nearly
    > all of the other tools for doing backups of PostgreSQL don't...).  If
    > we're expecting the external tools to use the block-level incremental
    > protocol then that protocol really needs to have a way to be
    > parallelized, otherwise we're just going to end up with all of the
    > individual tools doing their own thing for block-level incremental
    > (though perhaps they'd reimplement whatever is done in core but in a way
    > that they could parallelize it...), if possible (which I add just in
    > case there's some idea that we end up in a situation where the
    > block-level incremental backup has to coordinate with the backend in
    > some fashion to work...  which would mean that *everyone* has to use the
    > protocol even if it isn't parallel and that would be really bad, imv).
    
    The obvious way of extending this system to parallel backup is to have
    N connections each streaming a separate tarfile such that when you
    combine them all you recreate the original data directory.  That would
    be perfectly compatible with what I'm proposing for incremental
    backup.  Maybe you have another idea in mind, but I don't know what it
    is exactly.
    
    > > Wait, you want to make it maximally easy for users to start the server
    > > in a state that is 100% certain to result in a corrupted and unusable
    > > database?  Why?? I'd l like to make that a tiny bit difficult.  If
    > > they really want a corrupted database, they can remove the file.
    >
    > No, I don't want it to be easy for users to start the server in a state
    > that's going to result in a corrupted cluster.  That's basically the
    > complete opposite of what I was going for- having a file that can be
    > trivially removed to start up the cluster is *going* to result in people
    > having corrupted clusters, no matter how much we tell them "don't do
    > that".  This is exactly the problem with have with backup_label today.
    > I'd really rather not double-down on that.
    
    Well, OK, but short of scanning the entire directory tree on startup,
    I don't see how to achieve that.
    
    > There's really two things here- the first is that I agree with the
    > concern about potentially destorying the existing backup if the
    > pg_basebackup doesn't complete, but there's some ways to address that
    > (such as filesystem snapshotting), so I'm not sure that the idea is
    > quite that bad, but it would need to be more than just what
    > pg_basebackup does in this case in order to be trustworthy (at least,
    > for most).
    
    Well, I did mention in my original email that there could be a
    combine-backups-destructively option.  I guess this is just taking
    that to the next level: merge a backup being taken into an existing
    backup on-the-fly.  Given you remarks above, it is worth noting that
    this GREATLY increases the chances of people accidentally causing
    corruption in ways that are almost undetectable.  All they have to do
    is kill -9 the backup tool half way through and then start postgres on
    the resulting directory.
    
    > The other part here is the idea of endless incrementals where the blocks
    > which don't appear to have changed are never re-validated against what's
    > in the backup.  Unfortunately, latent corruption happens and you really
    > want to have a way to check for that.  In past discussions that I've had
    > with David, there's been some idea to check some percentage of the
    > blocks that didn't appear to change for each backup against what's in
    > the backup.
    
    Sure, I'm not trying to block anybody from developing something like
    that, and I acknowledge that there is risk in a system like this,
    but...
    
    > I share this just to point out that there's some risk to that approach,
    > not to say that we shouldn't do it or that we should discourage the
    > development of such a feature.
    
    ...it seems we are viewing this, at least, from the same perspective.
    
    > Wow.  I have to admit that I feel completely opposite of that- I'd
    > *love* to have an independent tool (which ideally uses the same code
    > through the common library, or similar) that can be run to apply WAL.
    >
    > In other words, I don't agree that it's the server's problem at all to
    > solve that, or, at least, I don't believe that it needs to be.
    
    I mean, I guess I'd love to have that if I could get it by waving a
    magic wand, but I wouldn't love it if I had to write the code or
    maintain it.  The routines for applying WAL currently all assume that
    you have a whole bunch of server infrastructure present; that code
    wouldn't run in a frontend environment, I think.  I wouldn't want to
    have a second copy of every WAL apply routine that might have its own
    set of bugs.
    
    > I've tried to outline how the incremental backup capability and backup
    > management are really very closely related and having those be
    > implemented by independent tools is not a good interface for our users
    > to have to live with.
    
    I disagree.  I think the "existing backup tools don't use
    pg_basebackup" argument isn't very compelling, because the reason
    those tools don't use pg_basebackup is because it can't do what they
    need.  If it did, they'd probably use it.  People don't write a whole
    separate engine for running backups just because it's fun to not reuse
    code -- they do it because there's no other way to get what they want.
    
    > Most of the external tools don't use pg_basebackup, nor the base backup
    > protocol (or, if they do, it's only as an option among others).  In my
    > opinion, that's pretty clear indication that pg_basebackup and the base
    > backup protocol aren't sufficient to cover any but the simplest of
    > use-cases (though those simple use-cases are handled rather well).
    > We're talking about adding on a capability that's much more complicated
    > and is one that a lot of tools have already taken a stab at, let's try
    > to do it in a way that those tools can leverage it and avoid having to
    > implement it themselves.
    
    I mean, again, if it were part of pg_basebackup and available via the
    replication protocol, they could do exactly that, through either
    method.  I don't get it.  You seem to be arguing that we shouldn't add
    the necessary capabilities to the replication protocol or
    pg_basebackup, but at the same time arguing that pg_basebackup is
    inadequate because it's missing important capabilities.  This confuses
    me.
    
    > It's an interesting idea to add in everything to pg_basebackup that
    > users doing backups would like to see, but that's quite a list:
    >
    > - full backups
    > - differential backups
    > - incremental backups / block-level backups
    > - (server-side) compression
    > - (server-side) encryption
    > - page-level checksum validation
    > - calculating checksums (on the whole file)
    > - External object storage (S3, et al)
    > - more things...
    >
    > I'm really not convinced that I agree with the division of labor as
    > you've outlined it, where all of the above is done by pg_basebackup,
    > where just archiving and backup retention are handled by some external
    > tool (except that we already have pg_receivewal, so archiving isn't
    > really an externally handled thing either, unless you want features like
    > parallel archive-push or parallel archive-get...).
    
    Yeah, if it were up to me, I'd choose put most of that in the server
    and make it available via the replication protocol, and then give
    pg_basebackup able to use that functionality.  And external tools
    could use that functionality via pg_basebackup or by using the
    replication protocol directly.  I actually don't really understand
    what the alternative is.  If you want server-side compression, for
    example, that really has to be done on the server.  And how would the
    server expose that, except through the replication protocol?  Sure, we
    could design a new protocol for it. Call it... say... the
    shmeplication protocol.  And then you could use the replication
    protocol for what it does today and the shmeplication protocol for all
    the cool bits.  But why would that be better?
    
    > What would really help me, at least, understand the idea here would be
    > to understand exactly what the existing tools do that the subset of
    > users you're thinking about doesn't like/want, but which pg_basebackup,
    > today, does.  Is the issue that there's a repository instead of just a
    > plain PG directory or set of tar files, like what pg_basebackup produces
    > today?  But how would we do things like have compression, or encryption,
    > or block-based incremental backups without some kind of repository or
    > directory that doesn't actually look exactly like a PG data directory?
    
    I guess we're still wallowing in the same confusion here.
    pg_basebackup, for me, is just a convenient place to stick this
    functionality.  If the server has the ability to construct and send an
    incremental backup by some means, then it needs a client on the other
    end to receive and store that backup, and since pg_basebackup already
    knows how to do that for full backups, extending it to incremental
    backups (and/or parallel, encrypted, compressed, and validated
    backups) seems very natural to me.  Otherwise I add server-side
    functionality to allow $X and then have to  write an entirely new
    client to interact with that instead of just using the client I've
    already got.  That's more work, and I'm lazy.
    
    Now it's true that if we wanted to build something like the rsync
    protocol into PostgreSQL, jamming that into pg_basebackup might well
    be a bridge too far.  That would involve taking backups via a method
    so different from what we're currently doing that it would probably
    make sense to at least consider creating a whole new tool for that
    purpose.  But that wasn't my proposal...
    
    > I certainly can understand that there are PostgreSQL users who want to
    > leverage incremental backups without having to use BART or another tool
    > outside of whatever enterprise backup system they've got, but surely
    > that's a large pool of users who *do* want a PG backup tool that manages
    > backups, or you wouldn't have spent a considerable amount of your very
    > valuable time hacking on BART.  I've certainly seen a fair share of both
    > and I don't think we should set out to exclude either.
    
    Sure, I agree.
    
    > Perhaps that's what we're both saying too and just talking past each
    > other, but I feel like the approach here is "make it work just for the
    > simple pg_basebackup case and not worry too much about the other tools,
    > since what we do for pg_basebackup will work for them too" while where
    > I'm coming from is "focus on what the other tools need first, and then
    > make pg_basebackup work with that if there's a sensible way to do so."
    
    I think perhaps the disconnect is that I just don't see how it can
    fail to work for the external tools if it works for pg_basebackup.
    Any given piece of functionality is either available in the
    replication stream, or it's not.  I suspect that for both BART and
    pg_backrest, they won't be able to completely give up on having their
    own backup engines solely because core has incremental backup, but I
    don't know what the alternative to adding features to core one at a
    time is.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  42. Re: block-level incremental backup

    Andres Freund <andres@anarazel.de> — 2019-04-18T18:21:50Z

    Hi,
    
    > > Wow.  I have to admit that I feel completely opposite of that- I'd
    > > *love* to have an independent tool (which ideally uses the same code
    > > through the common library, or similar) that can be run to apply WAL.
    > >
    > > In other words, I don't agree that it's the server's problem at all to
    > > solve that, or, at least, I don't believe that it needs to be.
    > 
    > I mean, I guess I'd love to have that if I could get it by waving a
    > magic wand, but I wouldn't love it if I had to write the code or
    > maintain it.  The routines for applying WAL currently all assume that
    > you have a whole bunch of server infrastructure present; that code
    > wouldn't run in a frontend environment, I think.  I wouldn't want to
    > have a second copy of every WAL apply routine that might have its own
    > set of bugs.
    
    I'll fight tooth and nail not to have a second implementation of replay,
    even if it's just portions.  The code we have is complicated and fragile
    enough, having a [partial] second version would be way worse.  There's
    already plenty improvements we need to make to speed up replay, and a
    lot of them require multiple execution threads (be it processes or OS
    threads), something not easily feasible in a standalone tool. And
    without the already existing concurrent work during replay (primarily
    checkpointer doing a lot of the necessary IO), it'd also be pretty
    unattractive to use any separate tool.
    
    Unless you just define the server binary as that "independent tool".
    Which I think is entirely reasonable. With the 'consistent' and LSN
    recovery targets one already can get most of what's needed from such a
    tool, anyway.  I'd argue the biggest issue there is that there's no
    equivalent to starting postgres with a private socket directory on
    windows, and perhaps an option or two making it easier to start postgres
    in a "private" mode for things like this.
    
    Greetings,
    
    Andres Freund
    
    
    
    
  43. Re: block-level incremental backup

    Stephen Frost <sfrost@snowman.net> — 2019-04-18T20:59:12Z

    Greetings,
    
    * Robert Haas (robertmhaas@gmail.com) wrote:
    > On Wed, Apr 17, 2019 at 5:20 PM Stephen Frost <sfrost@snowman.net> wrote:
    > > As I understand it, the problem is not with backing up an individual
    > > database or cluster, but rather dealing with backing up thousands of
    > > individual clusters with thousands of tables in each, leading to an
    > > awful lot of tables with lots of FSMs/VMs, all of which end up having to
    > > get copied and stored wholesale.  I'll point this thread out to him and
    > > hopefully he'll have a chance to share more specific information.
    > 
    > Sounds good.
    
    Ok, done.
    
    > > I can agree with the idea of having multiple options for how to collect
    > > up the set of changed blocks, though I continue to feel that a
    > > WAL-scanning approach isn't something that we'd have implemented in the
    > > backend at all since it doesn't require the backend and a given backend
    > > might not even have all of the WAL that is relevant.  I certainly don't
    > > think it makes sense to have a backend go get WAL from the archive to
    > > then merge the WAL to provide the result to a client asking for it-
    > > that's adding entirely unnecessary load to the database server.
    > 
    > My motivation for wanting to include it in the database server was twofold:
    > 
    > 1. I was hoping to leverage the background worker machinery.  The
    > WAL-scanner would just run all the time in the background, and start
    > up and shut down along with the server.  If it's a standalone tool,
    > then it can run on a different server or when the server is down, both
    > of which are nice.  The downside though is that now you probably have
    > to put it in crontab or under systemd or something, instead of just
    > setting a couple of GUCs and letting the server handle the rest.  For
    > me that downside seems rather significant, but YMMV.
    
    Background workers can be used to do pretty much anything.  I'm not
    suggesting that's a bad thing- just that it's such a completely generic
    tool that could be used to put anything/everything into the backend, so
    I'm not sure how much it makes sense as an argument when it comes to
    designing a new capability/feature.  Yes, there's an advantage there
    when it comes to configuration since that means we don't need to set up
    a cronjob and can, instead, just set a few GUCs...  but it also means
    that it *must* be done on the server and there's no option to do it
    elsewhere, as you say.
    
    When it comes to "this is something that I can do on the DB server or on
    some other server", the usual preference is to use another system for
    it, to reduce load on the server.
    
    If it comes down to something that needs to/should be an ongoing
    process, then the packaging can package that as a daemon-type tool which
    handles the systemd component to it, assuming the stand-alone tool
    supports that, which it hopefully would.
    
    > 2. In order for the information produced by the WAL-scanner to be
    > useful, it's got to be available to the server when the server is
    > asked for an incremental backup.  If the information is constructed by
    > a standalone frontend tool, and stored someplace other than under
    > $PGDATA, then the server won't have convenient access to it.  I guess
    > we could make it the client's job to provide that information to the
    > server, but I kind of liked the simplicity of not needing to give the
    > server anything more than an LSN.
    
    If the WAL-scanner tool is a stand-alone tool, and it handles picking
    out all of the FPIs and incremental page changes for each relation, then
    what does the tool to build out the "new" backup really need to tell the
    backend?  I feel like it mainly needs to ask the backend for the
    non-relation files, which gets into at least one approach that I've
    thought about for redesigning the backup protocol:
    
    1. Ask for a list of files and metadata about them
    2. Allow asking for individual files
    3. Support multiple connections asking for individual files
    
    Quite a few of the existing backup tools for PG use a model along these
    lines (or use tools underneath which do).
    
    > > A thought that occurs to me is to have the functions for supporting the
    > > WAL merging be included in libcommon and available to both the
    > > independent executable that's available for doing WAL merging, and to
    > > the backend to be able to WAL merging itself-
    > 
    > Yeah, that might be possible.
    
    I feel like this would be necessary, as it's certainly delicate and
    critical code and having multiple implementations of it will be
    difficult to manage.
    
    That said...  we already have independent work going on to do WAL
    mergeing (WAL-G, at least), and if we insist that the WAL replay code
    only exists in the backend, I strongly suspect we'll end up with
    independent implementations of that too.  Sure, we can distance
    ourselves from that and say that we don't have to deal with any bugs
    from it... but it seems like the better approach would be to have a
    common library that provides it.
    
    > > but for a specific
    > > purpose: having a way to reduce the amount of WAL that needs to be sent
    > > to a replica which has a replication slot but that's been disconnected
    > > for a while.  Of course, there'd have to be some way to handle the other
    > > files for that to work to update a long out-of-date replica.  Now, if we
    > > taught the backup tool about having a replication slot then perhaps we
    > > could have the backend effectively have the same capability proposed
    > > above, but without the need to go get the WAL from the archive
    > > repository.
    > 
    > Hmm, but you can't just skip over WAL records or segments because
    > there are checksums and previous-record pointers and things....
    
    Those aren't what I would be worried about, I'd think?  Maybe we're
    talking about different things, but if there's a way to scan/compress
    WAL so that we have less work to do when replaying, then we should
    leverage that for replicas that have been disconnected for a while too.
    
    One important bit here is that the replica wouldn't be able to answer
    queries while it's working through this compressed WAL, since it
    wouldn't reach a consistent state until more-or-less the end of WAL, but
    I am not sure that's a bad thing; who wants to get responses back from a
    very out-of-date replica?
    
    > > I'm still not entirely sure that this makes sense to do in the backend
    > > due to the additional load, this is really just some brainstorming.
    > 
    > Would it really be that much load?
    
    Well, it'd clearly be more than zero.  There may be an argument to be
    made that it's worth it to reduce the overall throughput of the system
    in order to add this capability, but I don't think we've got enough
    information at this point to know.  My gut feeling, at least, is that
    tracking enough information to do WAL-compression on a high-write system
    is going to be pretty expensive as you'd need to have a data structure
    that makes it easy to identify every page in the system, and be able to
    find each of them later on in the stream, and then throw away the old
    FPI in favor of the new one, and then track all the incremental page
    updates to that page, more-or-less, right?
    
    On a large system, given how much information has to be tracked, it
    seems like it could be a fair bit of load, but perhaps you've got some
    ideas as to how to reduce it..?
    
    Thanks!
    
    Stephen
    
  44. Re: block-level incremental backup

    Stephen Frost <sfrost@snowman.net> — 2019-04-18T21:17:02Z

    Greetings,
    
    I wanted to respond to this point specifically as I feel like it'll
    really help clear things up when it comes to the point of view I'm
    seeing this from.
    
    * Robert Haas (robertmhaas@gmail.com) wrote:
    > > Perhaps that's what we're both saying too and just talking past each
    > > other, but I feel like the approach here is "make it work just for the
    > > simple pg_basebackup case and not worry too much about the other tools,
    > > since what we do for pg_basebackup will work for them too" while where
    > > I'm coming from is "focus on what the other tools need first, and then
    > > make pg_basebackup work with that if there's a sensible way to do so."
    > 
    > I think perhaps the disconnect is that I just don't see how it can
    > fail to work for the external tools if it works for pg_basebackup.
    
    The existing backup protocol that pg_basebackup uses *does* *not* *work*
    for the external backup tools.  If it worked, they'd use it, but they
    don't and that's because you can't do things like a parallel backup,
    which we *know* users want because there's a number of tools which
    implement that exact capability.
    
    I do *not* want another piece of functionality added in this space which
    is limited in the same way because it does *not* help the external
    backup tools at all.
    
    > Any given piece of functionality is either available in the
    > replication stream, or it's not.  I suspect that for both BART and
    > pg_backrest, they won't be able to completely give up on having their
    > own backup engines solely because core has incremental backup, but I
    > don't know what the alternative to adding features to core one at a
    > time is.
    
    This idea that it's either "in the replication system" or "not in the
    replication system" is really bad, in my view, because it can be "in the
    replication system" and at the same time not at all useful to the
    existing external backup tools, but users and others will see the
    "checkbox" as ticked and assume that it's available in a useful fashion
    by the backend and then get upset when they discover the limitations.
    
    The existing base backup/replication protocol that's used by
    pg_basebackup is *not* useful to most of the backup tools, that's quite
    clear since they *don't* use it.  Building on to that an incremental
    backup solution that is similairly limited isn't going to make things
    easier for the external tools.
    
    If the goal is to make things easier for the external tools by providing
    capability in the backend / replication protocol then we need to be
    looking at what those tools require and not at what would be minimally
    sufficient for pg_basebackup.  If we don't care about the external tools
    and *just* care about making it work for pg_basebackup, then let's be
    clear about that, and accept that it'll have to be, most likely, ripped
    out and rewritten when we go to add parallel capabilities, for example,
    to pg_basebackup down the road.  That's clearly the case for the
    existing "base backup" protocol, so I don't see why it'd be different
    for an incremental backup system that is similairly designed and
    implemented.
    
    To be clear, I'm all for adding feature to core one at a time, but
    there's different ways to implement features and that's really what
    we're talking about here- what's the best way to implement this
    feature, ideally in a way that it's useful, practically, to both
    pg_basebackup and the other external backup utilities.
    
    Thanks!
    
    Stephen
    
  45. Re: block-level incremental backup

    Stephen Frost <sfrost@snowman.net> — 2019-04-18T22:39:46Z

    Greetings,
    
    Ok, responding to the rest of this email.
    
    * Robert Haas (robertmhaas@gmail.com) wrote:
    > On Wed, Apr 17, 2019 at 6:43 PM Stephen Frost <sfrost@snowman.net> wrote:
    > > Sadly, I haven't got any great ideas today.  I do know that the WAL-G
    > > folks have specifically mentioned issues with the visibility map being
    > > large enough across enough of their systems that it kinda sucks to deal
    > > with.  Perhaps we could do something like the rsync binary-diff protocol
    > > for non-relation files?  This is clearly just hand-waving but maybe
    > > there's something reasonable in that idea.
    > 
    > I guess it all comes down to how complicated you're willing to make
    > the client-server protocol.  With the very simple protocol that I
    > proposed -- client provides a threshold LSN and server sends blocks
    > modified since then -- the client need not have access to the old
    > incremental backup to take a new one.
    
    Where is the client going to get the threshold LSN from?
    
    > Of course, if it happens to
    > have access to the old backup then it can delta-compress however it
    > likes after-the-fact, but that doesn't help with the amount of network
    > transfer.
    
    If it doesn't have access to the old backup, then I'm a bit confused as
    to how a incremental backup would be possible?  Isn't that a requirement
    here?
    
    > That problem could be solved by doing something like what
    > you're talking about (with some probably-negligible false match rate)
    > but I have no intention of trying to implement anything that
    > complicated, and I don't really think it's necessary, at least not for
    > a first version.  What I proposed would already allow, for most users,
    > a large reduction in transfer and storage costs; what you are talking
    > about here would help more, but also be a lot more work and impose
    > some additional requirements on the system.  I don't object to you
    > implementing the more complex system, but I'll pass.
    
    I was talking about the rsync binary-diff specifically for the files
    that aren't easy to deal with in the WAL stream.  I wouldn't think we'd
    use it for other files, and there is definitely a question there of if
    there's a way to do better than a binary-diff approach for those files.
    
    > > There's something like 6 different backup tools, at least, for
    > > PostgreSQL that provide backup management, so I have a really hard time
    > > agreeing with this idea that users don't want a PG backup management
    > > system.  Maybe that's not what you're suggesting here, but that's what
    > > came across to me.
    > 
    > Let me be a little more clear.  Different users want different things.
    > Some people want a canned PostgreSQL backup solution, while other
    > people just want access to a reasonable set of facilities from which
    > they can construct their own solution.  I believe that the proposal I
    > am making here could be used either by backup tool authors to enhance
    > their offerings, or by individuals who want to build up their own
    > solution using facilities provided by core.
    
    The last thing that I think users really want it so build up their own
    solution.  There may be some organizations who would like to provide
    their own tool, but that's a bit different.  Personally, I'd *really*
    like PG to have a good tool in this area and I've been working, as I've
    said before, to try to get to a point where we at least have the option
    to add in such a tool that meets our various requirements.
    
    Further, I'm concerned that the approach being presented here won't be
    interesting to most of the external tools because it's limited and can't
    be used in a parallel fashion.
    
    > > Unless maybe I'm misunderstanding and what you're suggesting here is
    > > that the "existing solution" is something like the external PG-specific
    > > backup tools?  But then the rest doesn't seem to make sense, as only
    > > maybe one or two of those tools use pg_basebackup internally.
    > 
    > Well, what I'm really talking about is in two pieces: providing some
    > new facilities via the replication protocol, and making pg_basebackup
    > able to use those facilities.  Nothing would stop other tools from
    > using those facilities directly if they wish.
    
    If those facilities are developed and implemented in the same way as the
    protocol used by pg_basebackup works, then I strongly suspect that the
    existing backup tools will treat it similairly- which is to say, they'll
    largely end up ignoring it.
    
    > > ... but this is exactly the situation we're in already with all of the
    > > *other* features around backup (parallel backup, backup management, WAL
    > > management, etc).  Users want those features, pg_basebackup/PG core
    > > doesn't provide it, and therefore there's a bunch of other tools which
    > > have been written that do.  In addition, saying that PG has incremental
    > > backup but no built-in management of those full-vs-incremental backups
    > > and telling users that they basically have to build that themselves
    > > really feels a lot like we're trying to address a check-box requirement
    > > rather than making something that our users are going to be happy with.
    > 
    > I disagree.  Yes, parallel backup, like incremental backup, needs to
    > go in core.  And pg_basebackup should be able to do a parallel backup.
    > I will fight tooth, nail, and claw any suggestion that the server
    > should know how to do a parallel backup but pg_basebackup should not
    > have an option to exploit that capability.  And similarly for
    > incremental.
    
    These aren't independent things though, the way it seems like you're
    portraying them, because there are ways we can implement incremental
    backup that would support it being parallelized, and ways we can
    implement it that wouldn't work with parallelism at all, and all I'm
    argueing for is that we add in this feature in a way that it can be
    parallelized (since that's what most of the external tools do today...),
    even though pg_basebackup can't be, but in a way that pg_basebackup can
    also use it (albeit in a serial fashion).
    
    > > I don't think that I was very clear in what my specific concern here
    > > was.  I'm not asking for pg_basebackup to have parallel backup (at
    > > least, not in this part of the discussion), I'm asking for the
    > > incremental block-based protocol that's going to be built-in to core to
    > > be able to be used in a parallel fashion.
    > >
    > > The existing protocol that pg_basebackup uses is basically, connect to
    > > the server and then say "please give me a tarball of the data directory"
    > > and that is then streamed on that connection, making that protocol
    > > impossible to use for parallel backup.  That's fine as far as it goes
    > > because only pg_basebackup actually uses that protocol (note that nearly
    > > all of the other tools for doing backups of PostgreSQL don't...).  If
    > > we're expecting the external tools to use the block-level incremental
    > > protocol then that protocol really needs to have a way to be
    > > parallelized, otherwise we're just going to end up with all of the
    > > individual tools doing their own thing for block-level incremental
    > > (though perhaps they'd reimplement whatever is done in core but in a way
    > > that they could parallelize it...), if possible (which I add just in
    > > case there's some idea that we end up in a situation where the
    > > block-level incremental backup has to coordinate with the backend in
    > > some fashion to work...  which would mean that *everyone* has to use the
    > > protocol even if it isn't parallel and that would be really bad, imv).
    > 
    > The obvious way of extending this system to parallel backup is to have
    > N connections each streaming a separate tarfile such that when you
    > combine them all you recreate the original data directory.  That would
    > be perfectly compatible with what I'm proposing for incremental
    > backup.  Maybe you have another idea in mind, but I don't know what it
    > is exactly.
    
    So, while that's an obvious approach, it isn't the most sensible- and
    we know that from experience in actually implementing parallel backup of
    PG files.  I'm happy to discuss the approach we use in pgBackRest if
    you'd like to discuss this further, but it seems a bit far afield from
    the topic of discussion here and it seems like you're not interested or
    offering to work on supporting parallel backup in core.
    
    I don't think what you're proposing here wouldn't, technically, work for
    the various external tools, what I'm saying is that they aren't going to
    actually use it, which means that you're really implementing it *only*
    for pg_basebackup's benefit... and only for as long as pg_basebackup is
    serial in nature.
    
    > > > Wait, you want to make it maximally easy for users to start the server
    > > > in a state that is 100% certain to result in a corrupted and unusable
    > > > database?  Why?? I'd l like to make that a tiny bit difficult.  If
    > > > they really want a corrupted database, they can remove the file.
    > >
    > > No, I don't want it to be easy for users to start the server in a state
    > > that's going to result in a corrupted cluster.  That's basically the
    > > complete opposite of what I was going for- having a file that can be
    > > trivially removed to start up the cluster is *going* to result in people
    > > having corrupted clusters, no matter how much we tell them "don't do
    > > that".  This is exactly the problem with have with backup_label today.
    > > I'd really rather not double-down on that.
    > 
    > Well, OK, but short of scanning the entire directory tree on startup,
    > I don't see how to achieve that.
    
    Ok, so, this is a bit of spit-balling, just to be clear, but we
    currently track things like "where we know the heap files are
    consistant" by storing it in the control file as a checkpoint LSN, and
    then we have a backup_label file to say where we need to get to in order
    to be consistent from a backup.  Perhaps there's a way to use those to
    cross-validate while we are updating a data directory to be consistent?
    Maybe we update those files as we go, and add a cross-check flag between
    them, so that we know from two places that we're restoring from a backup
    (incremental or full), and then also know where we need to start from
    and where we need to get to, in order to be conistant.
    
    Of course, users can still get past this by hacking these files around
    and maybe we can provide a tool along the lines of pg_resetwal which
    lets them force the files to agree, but then we can at least throw big
    glaring warnings and tell users "this is really bad, type YES to
    continue".
    
    > > There's really two things here- the first is that I agree with the
    > > concern about potentially destorying the existing backup if the
    > > pg_basebackup doesn't complete, but there's some ways to address that
    > > (such as filesystem snapshotting), so I'm not sure that the idea is
    > > quite that bad, but it would need to be more than just what
    > > pg_basebackup does in this case in order to be trustworthy (at least,
    > > for most).
    > 
    > Well, I did mention in my original email that there could be a
    > combine-backups-destructively option.  I guess this is just taking
    > that to the next level: merge a backup being taken into an existing
    > backup on-the-fly.  Given you remarks above, it is worth noting that
    > this GREATLY increases the chances of people accidentally causing
    > corruption in ways that are almost undetectable.  All they have to do
    > is kill -9 the backup tool half way through and then start postgres on
    > the resulting directory.
    
    Right, we need to come up with a way to detect if that happens and
    complain loudly, and not continue to move forward unless and until the
    user explicitly insists that it's the right thing to do.
    
    > > The other part here is the idea of endless incrementals where the blocks
    > > which don't appear to have changed are never re-validated against what's
    > > in the backup.  Unfortunately, latent corruption happens and you really
    > > want to have a way to check for that.  In past discussions that I've had
    > > with David, there's been some idea to check some percentage of the
    > > blocks that didn't appear to change for each backup against what's in
    > > the backup.
    > 
    > Sure, I'm not trying to block anybody from developing something like
    > that, and I acknowledge that there is risk in a system like this,
    > but...
    > 
    > > I share this just to point out that there's some risk to that approach,
    > > not to say that we shouldn't do it or that we should discourage the
    > > development of such a feature.
    > 
    > ...it seems we are viewing this, at least, from the same perspective.
    
    Great, but I feel like the question here is if we're comfortable putting
    out this capability *without* some mechanism to verify that the existing
    blocks are clean/not corrupted/changed, or if we feel like this risk is
    enough that we want to include a check of the existing blocks, in some
    fashion, as part of the incremental backup feature.
    
    Personally, and in discussion with David, we've generally felt like we
    don't want this feature until we have a way to verify the blocks that
    aren't being backed up every time and we are assuming are clean/correct,
    (at least some portion of them anyway, with a way to make sure we
    eventually check them all) because we are concerned that users will get
    bit by latent corruption and then be quite unhappy with us for not
    picking up on that.
    
    > > Wow.  I have to admit that I feel completely opposite of that- I'd
    > > *love* to have an independent tool (which ideally uses the same code
    > > through the common library, or similar) that can be run to apply WAL.
    > >
    > > In other words, I don't agree that it's the server's problem at all to
    > > solve that, or, at least, I don't believe that it needs to be.
    > 
    > I mean, I guess I'd love to have that if I could get it by waving a
    > magic wand, but I wouldn't love it if I had to write the code or
    > maintain it.  The routines for applying WAL currently all assume that
    > you have a whole bunch of server infrastructure present; that code
    > wouldn't run in a frontend environment, I think.  I wouldn't want to
    > have a second copy of every WAL apply routine that might have its own
    > set of bugs.
    
    I agree that we don't want to have multiple implementations or copies of
    the WAL apply routines.  On the other hand, while I agree that there's
    some server infrastructure they depend on today, I feel like a lot of
    that infrastructure is things that we'd actually like to have in at
    least some of the client tools (and likely pg_basebackup specifically).
    I understand that it's not trivial to implement, of course, or to pull
    out into a common library.  We are already seeing some efforts to
    consolidate common routines in the client libraries (Peter E's recent
    work around the error messaging being a good example) and I feel like
    that's something we should encourage and expect to see happening more in
    the future as we add more sophisticated client utilities.
    
    > > I've tried to outline how the incremental backup capability and backup
    > > management are really very closely related and having those be
    > > implemented by independent tools is not a good interface for our users
    > > to have to live with.
    > 
    > I disagree.  I think the "existing backup tools don't use
    > pg_basebackup" argument isn't very compelling, because the reason
    > those tools don't use pg_basebackup is because it can't do what they
    > need.  If it did, they'd probably use it.  People don't write a whole
    > separate engine for running backups just because it's fun to not reuse
    > code -- they do it because there's no other way to get what they want.
    
    I understand that you disagree but I don't clearly understand the
    subsequent justification for why you disagree.  As I understand it, you
    disagree that an incremental backup capability and backup management are
    closely related, but that's because the existing tools don't leverage
    pg_basebackup (or the backup protocol), but aren't those pretty
    distinct things?  I accept that perhaps it's my fault for implying that
    these topics were related in the emails I've sent, and while replying to
    various parts of the discussion which has traveled across a number of
    topics, some related and some not.  I see incremental backups and backup
    management as related because, in part, of expiration- if you expire out
    a 'full' backup then you must expire out any incremental or differential
    backups based on it.  Just generally that association of which
    incremental depends on which full (or prior differential, or prior
    incremental) is extremely important and necessary to avoid corrupt
    systems (consider that you might apply an incremental to a full backup,
    but the incremental taken was actually based on another incremental and
    not based on the full, or variations of that...).
    
    In short, I don't think I could confidently trust any incremental backup
    that's taken without having a clear link to the backup it's based on,
    and having it be expired when the backup it depends on is expired.
    
    > > Most of the external tools don't use pg_basebackup, nor the base backup
    > > protocol (or, if they do, it's only as an option among others).  In my
    > > opinion, that's pretty clear indication that pg_basebackup and the base
    > > backup protocol aren't sufficient to cover any but the simplest of
    > > use-cases (though those simple use-cases are handled rather well).
    > > We're talking about adding on a capability that's much more complicated
    > > and is one that a lot of tools have already taken a stab at, let's try
    > > to do it in a way that those tools can leverage it and avoid having to
    > > implement it themselves.
    > 
    > I mean, again, if it were part of pg_basebackup and available via the
    > replication protocol, they could do exactly that, through either
    > method.  I don't get it.
    
    No, they can't.  Today there exists *exactly* this situation:
    pg_basebackup uses the base backup protocol for doing backups, and the
    external tools don't use it.
    
    Why?
    
    Because it can't be used in a parallel manner, making it largely
    uninteresting as a mechanism for doing backups of systems at any scale.
    
    Yes, sure, they *could* technically use it, but from a *practical*
    standpoint they don't because it *sucks*.  Let's not do that for
    incremental backups.
    
    > You seem to be arguing that we shouldn't add
    > the necessary capabilities to the replication protocol or
    > pg_basebackup, but at the same time arguing that pg_basebackup is
    > inadequate because it's missing important capabilities.  This confuses
    > me.
    
    I'm sorry for not being clear.  I'm not argueing that we *shouldn't* add
    such capabilities.  I *want* these capabilities to be added, but I want
    them added in a way that's actually useful to the external tools and not
    something that only works for pg_basebackup (which is currently
    single-threaded).
    
    I hope that's the kind of feedback you've been looking for on this
    thread.
    
    > > It's an interesting idea to add in everything to pg_basebackup that
    > > users doing backups would like to see, but that's quite a list:
    > >
    > > - full backups
    > > - differential backups
    > > - incremental backups / block-level backups
    > > - (server-side) compression
    > > - (server-side) encryption
    > > - page-level checksum validation
    > > - calculating checksums (on the whole file)
    > > - External object storage (S3, et al)
    > > - more things...
    > >
    > > I'm really not convinced that I agree with the division of labor as
    > > you've outlined it, where all of the above is done by pg_basebackup,
    > > where just archiving and backup retention are handled by some external
    > > tool (except that we already have pg_receivewal, so archiving isn't
    > > really an externally handled thing either, unless you want features like
    > > parallel archive-push or parallel archive-get...).
    > 
    > Yeah, if it were up to me, I'd choose put most of that in the server
    > and make it available via the replication protocol, and then give
    > pg_basebackup able to use that functionality.
    
    I'm all about that.  I don't know that the client-side tool would still
    be called 'pg_basebackup' at that point, but I definitely want to get to
    a point where we have all of these capabilities available in core.
    
    > And external tools
    > could use that functionality via pg_basebackup or by using the
    > replication protocol directly.  I actually don't really understand
    > what the alternative is.  If you want server-side compression, for
    > example, that really has to be done on the server.  And how would the
    > server expose that, except through the replication protocol?  Sure, we
    > could design a new protocol for it. Call it... say... the
    > shmeplication protocol.  And then you could use the replication
    > protocol for what it does today and the shmeplication protocol for all
    > the cool bits.  But why would that be better?
    
    The replication protocol (or base backup protocol, really..) is what we
    make it, in the end.  Of course server-side compression needs to be done
    on the server and we need a way to tell the server "please compress this
    for us before sending it".  I'm not suggesting there's some alternative
    to that.  What I'm suggesting is that when we go to implement the
    incremental backup protocol that we have a way for that to be
    parallelized (at least...  maybe other things too) because that's what
    the external tools would really like.
    
    Even pg_dump works in the way that it connects and builds a list of
    things to run against and then farms that out to the parallel processes,
    so we have an example of how this is done in core today.
    
    > > What would really help me, at least, understand the idea here would be
    > > to understand exactly what the existing tools do that the subset of
    > > users you're thinking about doesn't like/want, but which pg_basebackup,
    > > today, does.  Is the issue that there's a repository instead of just a
    > > plain PG directory or set of tar files, like what pg_basebackup produces
    > > today?  But how would we do things like have compression, or encryption,
    > > or block-based incremental backups without some kind of repository or
    > > directory that doesn't actually look exactly like a PG data directory?
    > 
    > I guess we're still wallowing in the same confusion here.
    > pg_basebackup, for me, is just a convenient place to stick this
    > functionality.  If the server has the ability to construct and send an
    > incremental backup by some means, then it needs a client on the other
    > end to receive and store that backup, and since pg_basebackup already
    > knows how to do that for full backups, extending it to incremental
    > backups (and/or parallel, encrypted, compressed, and validated
    > backups) seems very natural to me.  Otherwise I add server-side
    > functionality to allow $X and then have to  write an entirely new
    > client to interact with that instead of just using the client I've
    > already got.  That's more work, and I'm lazy.
    
    I'm not suggesting that we don't add this functionality to
    pg_basebackup, I'm just saying that we should be thinking about how the
    external tools will want to leverage this new capability because it's
    materially different from the basic minimum that pg_basebackup requires.
    Yes, it'd be a bit more work and a somewhat more complicated protocol
    than the simple approach needed by pg_basebackup, but that's what those
    other tools will want.  If we don't care about them, ok, I get that, but
    I thought the idea here was to build something that's useful to both the
    external tools and pg_basebackup.  We won't get that if we focus on just
    implementing a protocol for pg_basebackup to use.
    
    > Now it's true that if we wanted to build something like the rsync
    > protocol into PostgreSQL, jamming that into pg_basebackup might well
    > be a bridge too far.  That would involve taking backups via a method
    > so different from what we're currently doing that it would probably
    > make sense to at least consider creating a whole new tool for that
    > purpose.  But that wasn't my proposal...
    
    The idea around the rsync binary-diff protocol was *specifically* for
    things that we can't do through block-level updates with WAL scanning,
    just to be clear.  I wasn't thinking that would be good for the relation
    files since we have more information for those in the LSN, et al.
    
    Thanks!
    
    Stephen
    
  46. Re: block-level incremental backup

    Stephen Frost <sfrost@snowman.net> — 2019-04-20T00:04:41Z

    Greetings,
    
    * Andres Freund (andres@anarazel.de) wrote:
    > > > Wow.  I have to admit that I feel completely opposite of that- I'd
    > > > *love* to have an independent tool (which ideally uses the same code
    > > > through the common library, or similar) that can be run to apply WAL.
    > > >
    > > > In other words, I don't agree that it's the server's problem at all to
    > > > solve that, or, at least, I don't believe that it needs to be.
    > > 
    > > I mean, I guess I'd love to have that if I could get it by waving a
    > > magic wand, but I wouldn't love it if I had to write the code or
    > > maintain it.  The routines for applying WAL currently all assume that
    > > you have a whole bunch of server infrastructure present; that code
    > > wouldn't run in a frontend environment, I think.  I wouldn't want to
    > > have a second copy of every WAL apply routine that might have its own
    > > set of bugs.
    > 
    > I'll fight tooth and nail not to have a second implementation of replay,
    > even if it's just portions.  The code we have is complicated and fragile
    > enough, having a [partial] second version would be way worse.  There's
    > already plenty improvements we need to make to speed up replay, and a
    > lot of them require multiple execution threads (be it processes or OS
    > threads), something not easily feasible in a standalone tool. And
    > without the already existing concurrent work during replay (primarily
    > checkpointer doing a lot of the necessary IO), it'd also be pretty
    > unattractive to use any separate tool.
    
    I agree that we don't want another implementation and that there's a lot
    that we want to do to improve replay performance.  We've already got
    frontend tools which work with multiple execution threads, so I'm not
    sure I get the "not easily feasible" bit, and the argument about the
    checkpointer seems largely related to that (as in- if we didn't have
    multiple threads/processes then things would perform quite badly...  but
    we can and do have multiple threads/processes in frontend tools today,
    even in pg_basebackup).
    
    You certainly bring up some good concerns though and they make me think
    of other bits that would seem like they'd possibly be larger issues for
    a frontend tool- like having a large pool of memory for cacheing (aka
    shared buffers) the changes.  If what we're talking about here is *just*
    replay though, without having the system available for reads, I wonder
    if we might want a different solution there.
    
    > Unless you just define the server binary as that "independent tool".
    
    That's certainly an interesting idea.
    
    > Which I think is entirely reasonable. With the 'consistent' and LSN
    > recovery targets one already can get most of what's needed from such a
    > tool, anyway.  I'd argue the biggest issue there is that there's no
    > equivalent to starting postgres with a private socket directory on
    > windows, and perhaps an option or two making it easier to start postgres
    > in a "private" mode for things like this.
    
    This would mean building in a way to do parallel WAL replay into the
    server binary though, as discussed above, and it seems like making that
    work in a way that allows us to still be available as a read-only
    standby would be quite a bit more difficult.  We could possibly support
    parallel WAL replay only when we aren't a replica but from the same
    binary.  The concerns mentioned about making it easier to start PG in a
    private mode don't seem too bad but I am not entirely sure that the
    tools which want to leverage that kind of capability would want to have
    to exec out to the PG binary to use it.
    
    A lot of this part of the discussion feels like a tangent though, unless
    I'm missing something.  The "WAL compression" tool contemplated
    previously would be much simpler and not the full-blown WAL replay
    capability, which would be left to the server, unless you're suggesting
    that even that should be exclusively the purview of the backend?  Though
    that ship's already sailed, given that external projects have
    implemented it.  Having a library to provide that which external
    projects could leverage would be nicer than having everyone write their
    own version.
    
    Thanks!
    
    Stephen
    
  47. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-04-20T04:05:35Z

    On Thu, Apr 18, 2019 at 6:39 PM Stephen Frost <sfrost@snowman.net> wrote:
    > Where is the client going to get the threshold LSN from?
    >
    > If it doesn't have access to the old backup, then I'm a bit confused as
    > to how a incremental backup would be possible?  Isn't that a requirement
    > here?
    
    I explained this in the very first email that I wrote on this thread,
    and then wrote a very extensive further reply on this exact topic to
    Peter Eisentraut.  It's a bit disheartening to see you arguing against
    my ideas when it's not clear that you've actually read and understood
    them.
    
    > > The obvious way of extending this system to parallel backup is to have
    > > N connections each streaming a separate tarfile such that when you
    > > combine them all you recreate the original data directory.  That would
    > > be perfectly compatible with what I'm proposing for incremental
    > > backup.  Maybe you have another idea in mind, but I don't know what it
    > > is exactly.
    >
    > So, while that's an obvious approach, it isn't the most sensible- and
    > we know that from experience in actually implementing parallel backup of
    > PG files.  I'm happy to discuss the approach we use in pgBackRest if
    > you'd like to discuss this further, but it seems a bit far afield from
    > the topic of discussion here and it seems like you're not interested or
    > offering to work on supporting parallel backup in core.
    
    If there's some way of modifying my proposal so that it makes life
    better for external backup tools, I'm certainly willing to consider
    that, but you're going to have to tell me what you have in mind.  If
    that means describing what pgbackrest does, then do it.
    
    My concern here is that you seem to want a lot of complicated stuff
    that will require *significant* setup in order for people to be able
    to use it.  From what I am able to gather from your remarks so far,
    you think people should archive their WAL to a separate machine, and
    then the WAL-summarizer should run there, and then data from that
    should be fed back to the backup client, which should then give the
    server a list of modified files (and presumably, someday, blocks) and
    the server then returns that data, which the client then
    cross-verifies with checksums and awesome sauce.
    
    Which is all fine, but actually requires quite a bit of set-up and
    quite a bit of buy-in to the tool.  And I have no problem with people
    having that level of buy-in to the tool.  EnterpriseDB offers a number
    of tools which require similar levels of setup and configuration, and
    it's not inappropriate for an enterprise-grade backup tool to have all
    that stuff.  However, for those who may not want to do all that, my
    original proposal lets you take an incremental backup by doing the
    following list of steps:
    
    1. Take an incremental backup.
    
    If you'd like, you can also:
    
    0. Enable the WAL-scanning background worker to make incremental
    backups much faster.
    
    You do not need a WAL archive, and you do not need EITHER the backup
    tool or the server to have access to previous backups, and you do not
    need the client to have any access to archived WAL or the summary
    files produced from it.  The only thing you need to know the
    start-of-backup LSN for the previous backup.
    
    I expect you to reply with a long complaint about how my proposal is
    totally inadequate, but actually I think for most people, most of the
    time, it would not only be adequate, but extremely convenient.  And
    despite your protestations to the contrary, it does not block
    parallelism, checksum verification, or any other cool features that
    somebody may want to add later.  It'll work just fine with those
    things.
    
    And for the record, I am willing to put some effort into parallelism.
    I just think that it makes more sense to do the incremental part
    first.  I think that incremental backup is likely to have less effect
    on parallel backup than the other way around.  What I'm NOT willing to
    do is build a whole bunch of infrastructure that will help pgbackrest
    do amazing things but will not provide a simple and convenient way of
    taking incremental backups using only core tools.  I do care about
    having something that's good for pgbackrest and other out-of-core
    tools.  I just care about it MUCH LESS than I care about making
    PostgreSQL core awesome.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  48. Re: block-level incremental backup

    Stephen Frost <sfrost@snowman.net> — 2019-04-20T04:19:51Z

    Greetings,
    
    * Robert Haas (robertmhaas@gmail.com) wrote:
    > What I'm NOT willing to
    > do is build a whole bunch of infrastructure that will help pgbackrest
    > do amazing things but will not provide a simple and convenient way of
    > taking incremental backups using only core tools.  I do care about
    > having something that's good for pgbackrest and other out-of-core
    > tools.  I just care about it MUCH LESS than I care about making
    > PostgreSQL core awesome.
    
    Then I misunderstood your original proposal where you talked about
    providing something that the various external tools could use.  If you'd
    like to *just* provide a mechanism for pg_basebackup to be able to do a
    trivial incremental backup, great, but it's not going to be useful or
    used by the external tools, just like the existing base backup protocol
    isn't used by the external tools because it can't be used in a parallel
    fashion.
    
    As such, and with all the other missing bits from pg_basebackup, it
    looks likely to me that such a feature is going to be lackluster, at
    best, and end up being only marginally interesting, when it could have
    been much more and leveraged by all of the existing tools.  I agree that
    making a parallel-supporting protocol work is harder but I actually
    don't think it would be *that* much more difficult to do.
    
    That's frankly discouraging, but I'm not going to tell you where to
    spend your time.
    
    Making PG core awesome when it comes to backup is going to involve so
    much more than just marginal improvements to pg_basebackup, but it's
    also something that I'm very much supportive of and have invested a
    great deal in, by spending time and resources working to build a tool
    that gets closer to what an in-core solution would look like than
    anything that exists today.
    
    Thanks,
    
    Stephen
    
  49. Re: block-level incremental backup

    Andrey Borodin <x4mmm@yandex-team.ru> — 2019-04-20T16:44:35Z

    Hi!
    
    Sorry for the delay.
    
    > 18 апр. 2019 г., в 21:56, Robert Haas <robertmhaas@gmail.com> написал(а):
    > 
    > On Wed, Apr 17, 2019 at 5:20 PM Stephen Frost <sfrost@snowman.net> wrote:
    >> As I understand it, the problem is not with backing up an individual
    >> database or cluster, but rather dealing with backing up thousands of
    >> individual clusters with thousands of tables in each, leading to an
    >> awful lot of tables with lots of FSMs/VMs, all of which end up having to
    >> get copied and stored wholesale.  I'll point this thread out to him and
    >> hopefully he'll have a chance to share more specific information.
    > 
    > Sounds good.
    
    During introduction of WAL-delta backups, we faced two things:
    1. Heavy spike in network load. We shift beginning of backup randomly, but variation is not very big: night is short and we want to make big backups during low rps time. This low variation of time of starts of small backups creates big network spike.
    2. Incremental backups became very cheap if measured in used resources of a single cluster.
    
    1st is not a big problem, actually, bit we realized that we can do incremental backups not just at night, but, for example, 4 times a day. Or every hour. Or every minute. Why not, if they are cheap enough?
    
    Incremental backup of 1Tb DB made with distance of few minutes (small change set) is few Gbs. All of this size is made of FSM (no LSN) and VM (hard to use LSN).
    Sure, this overhead size is fine if we make daily backup. But at some frequency of backups it will be too much.
    
    I think that problem of incrementing FSM and VM is too distant now.
    But if I had to implement it right now I'd choose following way: do not backup FSM and VM, recreate it during restore. Looks like it is possible, but too much AM-specific.
    It is hard when you write backup tool in Go and cannot simply link with PG.
    
    > 15 апр. 2019 г., в 18:01, Stephen Frost <sfrost@snowman.net> написал(а):
    > ...the goal here
    > isn't actually to make pg_basebackup into an enterprise backup tool,
    > ...
    
    BTW, I'm all hands for extensibility and "hackability". But, personally, I'd be happy if pg_basebackup would be ubiquitous and sufficient. And tools like WAL-G and others became part of a history. There is not fundamental reason why external backup tool can be better than backup tool in core. (Unlike many PLs, data types, hooks, tuners etc)
    
    
    Here's 53 mentions of "parallel backup". I want to note that there may be parallel read from disk and parallel network transmission. Things between these two are neglectable and can be single-threaded. From my POV, it's not about threads, it's about saturated IO controllers.
    Also I think parallel restore matters more than parallel backup. Backups themself can be slow, on many clusters we even throttle disk IO. But users may want parallel backup to catch-up standby.
    
    Thanks.
    
    Best regards, Andrey Borodin.
    
    
    
  50. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-04-20T20:11:11Z

    On Sat, Apr 20, 2019 at 12:19 AM Stephen Frost <sfrost@snowman.net> wrote:
    > * Robert Haas (robertmhaas@gmail.com) wrote:
    > > What I'm NOT willing to
    > > do is build a whole bunch of infrastructure that will help pgbackrest
    > > do amazing things but will not provide a simple and convenient way of
    > > taking incremental backups using only core tools.  I do care about
    > > having something that's good for pgbackrest and other out-of-core
    > > tools.  I just care about it MUCH LESS than I care about making
    > > PostgreSQL core awesome.
    >
    > Then I misunderstood your original proposal where you talked about
    > providing something that the various external tools could use.  If you'd
    > like to *just* provide a mechanism for pg_basebackup to be able to do a
    > trivial incremental backup, great, but it's not going to be useful or
    > used by the external tools, just like the existing base backup protocol
    > isn't used by the external tools because it can't be used in a parallel
    > fashion.
    
    Well, what I meant - and perhaps I wasn't clear enough about this - is
    that it could be used by an external solution for *managing* backups,
    not so much an external engine for *taking* backups.  But actually, I
    really don't see any reason why the latter wouldn't also be possible.
    It was already suggested upthread by Anastasia that there should be a
    way to ask the server to give only the identity of the modified blocks
    without the contents of those blocks; if we provide that, then a tool
    can get those and do whatever it likes with them, including fetching
    them in parallel by some other means.  Another obvious extension would
    be to add a command that says 'give me this file' or 'give me this
    file but only this list of blocks' which would give clients lots of
    options: they could provide their own lists of blocks to fetch
    computed by whatever internal magic they have, or they could request
    the server's modified-block map information first and then schedule
    fetching those blocks in parallel using this new command.  So it seems
    like with some pretty straightforward extensions this can be made
    usable by and valuable to people wanting to build external backup
    engines, too.  I do not necessarily feel obliged to implement every
    feature that might help with that kind of thing just because I've
    expressed an interest in this general area, but I might do some of
    them, and maybe people like you or Anastasia who want to make these
    facilities available to external tools can help with some of the work,
    too.
    
    That being said, as long as there is significant demand for
    value-added backup features over and above what is in core, there are
    probably going to be non-core backup tools that do things their own
    way instead of just leaning on whatever the server provides natively.
    In a certain sense that's regrettable, because it means that somebody
    - or perhaps multiple somebodys - goes to the trouble of doing
    something outside core and then somebody else puts something in core
    that obsoletes it and therein lies duplication of effort.  On the
    other hand, it also allows people to innovate way faster than can be
    done in core, it allows competition among different possible designs,
    and it's just kinda the way we roll around here.  I can't get very
    worked up about it.
    
    One thing I'm definitely not going to do here is abandon my goal of
    producing a *simple* incremental backup solution that can be deployed
    *easily* by users. I understand from your remarks that such a solution
    will not suit everybody.  However, unlike you, I do not believe that
    pg_basebackup was a failure.  I certainly agree that it has some
    limitations that mean that it is hard to use in large deployments, but
    it's also *extremely* convenient for people with a fairly small
    database when they just need a quick and easy backup.  Adding some
    more features to it - such as incremental backup - will make it useful
    to more people in more cases.  There will doubtless still be people
    who need more, and that's OK: those people can use a third-party tool.
    I will not get anywhere trying to solve every problem at once.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  51. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-04-20T20:13:42Z

    On Sat, Apr 20, 2019 at 12:44 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
    > Incremental backup of 1Tb DB made with distance of few minutes (small change set) is few Gbs. All of this size is made of FSM (no LSN) and VM (hard to use LSN).
    > Sure, this overhead size is fine if we make daily backup. But at some frequency of backups it will be too much.
    
    It seems like if the backups are only a few minutes apart, PITR might
    be a better choice than super-frequent incremental backups.  What do
    you think about that?
    
    > I think that problem of incrementing FSM and VM is too distant now.
    > But if I had to implement it right now I'd choose following way: do not backup FSM and VM, recreate it during restore. Looks like it is possible, but too much AM-specific.
    
    Interesting idea - that's worth some more thought.
    
    > BTW, I'm all hands for extensibility and "hackability". But, personally, I'd be happy if pg_basebackup would be ubiquitous and sufficient. And tools like WAL-G and others became part of a history. There is not fundamental reason why external backup tool can be better than backup tool in core. (Unlike many PLs, data types, hooks, tuners etc)
    
    +1
    
    > Here's 53 mentions of "parallel backup". I want to note that there may be parallel read from disk and parallel network transmission. Things between these two are neglectable and can be single-threaded. From my POV, it's not about threads, it's about saturated IO controllers.
    > Also I think parallel restore matters more than parallel backup. Backups themself can be slow, on many clusters we even throttle disk IO. But users may want parallel backup to catch-up standby.
    
    I'm not sure I entirely understand your point here -- are you saying
    that parallel backup is important, or that it's not important, or
    something in between?  Do you think it's more or less important than
    incremental backup?
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  52. Re: block-level incremental backup

    Stephen Frost <sfrost@snowman.net> — 2019-04-20T20:32:32Z

    Greetings,
    
    * Robert Haas (robertmhaas@gmail.com) wrote:
    > On Sat, Apr 20, 2019 at 12:19 AM Stephen Frost <sfrost@snowman.net> wrote:
    > > * Robert Haas (robertmhaas@gmail.com) wrote:
    > > > What I'm NOT willing to
    > > > do is build a whole bunch of infrastructure that will help pgbackrest
    > > > do amazing things but will not provide a simple and convenient way of
    > > > taking incremental backups using only core tools.  I do care about
    > > > having something that's good for pgbackrest and other out-of-core
    > > > tools.  I just care about it MUCH LESS than I care about making
    > > > PostgreSQL core awesome.
    > >
    > > Then I misunderstood your original proposal where you talked about
    > > providing something that the various external tools could use.  If you'd
    > > like to *just* provide a mechanism for pg_basebackup to be able to do a
    > > trivial incremental backup, great, but it's not going to be useful or
    > > used by the external tools, just like the existing base backup protocol
    > > isn't used by the external tools because it can't be used in a parallel
    > > fashion.
    > 
    > Well, what I meant - and perhaps I wasn't clear enough about this - is
    > that it could be used by an external solution for *managing* backups,
    > not so much an external engine for *taking* backups.  But actually, I
    > really don't see any reason why the latter wouldn't also be possible.
    > It was already suggested upthread by Anastasia that there should be a
    > way to ask the server to give only the identity of the modified blocks
    > without the contents of those blocks; if we provide that, then a tool
    > can get those and do whatever it likes with them, including fetching
    > them in parallel by some other means.  Another obvious extension would
    > be to add a command that says 'give me this file' or 'give me this
    > file but only this list of blocks' which would give clients lots of
    > options: they could provide their own lists of blocks to fetch
    > computed by whatever internal magic they have, or they could request
    > the server's modified-block map information first and then schedule
    > fetching those blocks in parallel using this new command.  So it seems
    > like with some pretty straightforward extensions this can be made
    > usable by and valuable to people wanting to build external backup
    > engines, too.  I do not necessarily feel obliged to implement every
    > feature that might help with that kind of thing just because I've
    > expressed an interest in this general area, but I might do some of
    > them, and maybe people like you or Anastasia who want to make these
    > facilities available to external tools can help with some of the work,
    > too.
    
    Yes, if we spend a bit of time thinking about how this could be
    implemented in a way that could be used by multiple connections
    concurrently then we could provide something that both pg_basebackup and
    the external tools could use.  Getting a list first and then supporting
    a 'give me this file' API, or 'give me these blocks from this file'
    would be very similar to what many of the external tools today.  I agree
    that I don't think it'd be hard to do.  I'm suggesting that we do that
    instead of, at a protocol level, something similar to what was done with
    pg_basebackup which prevents that.
    
    I don't really agree that implementing "give me a list of files" and
    "give me this file" is really somehow an 'extension' to the tar-based
    approach that pg_basebackup uses today, it's really a rather different
    thing, and I mention that as a parallel (hah!) to what we're discussing
    here regarding the incremental backup approach.
    
    Having been around for a while working on backup-related things, if I
    was to implement the protocol for pg_basebackup today, I'd definitely
    implement "give me a list" and "give me this file" rather than the
    tar-based approach, because I've learned that people want to be
    able to do parallel backups and that's a decent way to do that.  I
    wouldn't set out and implement something new that's there's just no hope
    of making parallel.  Maybe the first write of pg_basebackup would still
    be simple and serial since it's certainly more work to make a frontend
    tool like that work in parallel, but at least the protocol would be
    ready to support a parallel option being added alter without being
    rewritten.
    
    And that's really what I was trying to get at here- if we've got the
    choice now to decide what this is going to look like from a protocol
    level, it'd be great if we could make it able to support being used in a
    parallel fashion, even if pg_basebackup is still single-threaded.
    
    > That being said, as long as there is significant demand for
    > value-added backup features over and above what is in core, there are
    > probably going to be non-core backup tools that do things their own
    > way instead of just leaning on whatever the server provides natively.
    > In a certain sense that's regrettable, because it means that somebody
    > - or perhaps multiple somebodys - goes to the trouble of doing
    > something outside core and then somebody else puts something in core
    > that obsoletes it and therein lies duplication of effort.  On the
    > other hand, it also allows people to innovate way faster than can be
    > done in core, it allows competition among different possible designs,
    > and it's just kinda the way we roll around here.  I can't get very
    > worked up about it.
    
    Yes, that's largely the tact we've taken with it- build something
    outside of core, where we can move a lot faster with the implementation
    and innovate quickly, until we get to a stable system that's as portable
    and in a compatible language to what's in core today.  I don't have any
    problem with new things going into core, in fact, I'm all for it, but if
    someone asks me "I'd like to do this thing in core and I'd like it to be
    useful for external tools" then I'll do my best to share my experiences
    with what's been done in core vs. what's been done in this space outside
    of core and what some lessons learned from that have been and ways that
    we could at least try to make it so that external tools will be able to
    use whatever is implemented in core.
    
    > One thing I'm definitely not going to do here is abandon my goal of
    > producing a *simple* incremental backup solution that can be deployed
    > *easily* by users. I understand from your remarks that such a solution
    > will not suit everybody.  However, unlike you, I do not believe that
    > pg_basebackup was a failure.  I certainly agree that it has some
    > limitations that mean that it is hard to use in large deployments, but
    > it's also *extremely* convenient for people with a fairly small
    > database when they just need a quick and easy backup.  Adding some
    > more features to it - such as incremental backup - will make it useful
    > to more people in more cases.  There will doubtless still be people
    > who need more, and that's OK: those people can use a third-party tool.
    > I will not get anywhere trying to solve every problem at once.
    
    I don't get this at all.  What I've really been focused on has been the
    protocol-level questions of what this is going to look like, because
    that's what I see the external tools potentially using.  pg_basebackup
    itself could remain single-threaded and could provide exactly the same
    interface, no matter if the protocol is "give me all the blocks across
    the entire cluster as a single compressed stream" or the protocol is
    "give me a list of files that changed" and "give me a list of these
    blocks in this file" or even "give me all the blocks that changed in
    this file".
    
    I also don't think pg_basebackup is a failure, and I didn't mean to
    imply that, and I'm sorry for some of the hyperbole which lead to that
    impression coming across.  pg_basebackup is great, for what it is, and I
    regularly recommend it in certain use-cases as being a simple tool that
    does one thing and does it pretty well, for smaller clusters.  The
    protocol it uses is unfortunately only useful in a single-threaded
    manner though and it'd be great if we could avoid implementing similar
    things in the protocol in the future.
    
    Thanks,
    
    Stephen
    
  53. Re: block-level incremental backup

    Andrey Borodin <x4mmm@yandex-team.ru> — 2019-04-21T09:05:02Z

    
    > 21 апр. 2019 г., в 1:13, Robert Haas <robertmhaas@gmail.com> написал(а):
    > 
    > On Sat, Apr 20, 2019 at 12:44 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
    >> Incremental backup of 1Tb DB made with distance of few minutes (small change set) is few Gbs. All of this size is made of FSM (no LSN) and VM (hard to use LSN).
    >> Sure, this overhead size is fine if we make daily backup. But at some frequency of backups it will be too much.
    > 
    > It seems like if the backups are only a few minutes apart, PITR might
    > be a better choice than super-frequent incremental backups.  What do
    > you think about that?
    PITR is painfully slow on heavily loaded clusters. I observed restorations when 5 seconds of WAL were restored in 4 seconds. Backup was only few hours past primary node, but could catch up only at night.
    And during this process only one of 56 cpu cores was used. And SSD RAID throughput was not 100% utilized.
    
    Block level delta backups can be restored very efficiently: if we restore from newest to past steps, we write no more than cluster size at last backup.
    
    >> I think that problem of incrementing FSM and VM is too distant now.
    >> But if I had to implement it right now I'd choose following way: do not backup FSM and VM, recreate it during restore. Looks like it is possible, but too much AM-specific.
    > 
    > Interesting idea - that's worth some more thought.
    
    Core routines to recreate VM and FSM would be cool :) But this need to be done without extra IO, not an easy trick.
    
    >> Here's 53 mentions of "parallel backup". I want to note that there may be parallel read from disk and parallel network transmission. Things between these two are neglectable and can be single-threaded. From my POV, it's not about threads, it's about saturated IO controllers.
    >> Also I think parallel restore matters more than parallel backup. Backups themself can be slow, on many clusters we even throttle disk IO. But users may want parallel backup to catch-up standby.
    > 
    > I'm not sure I entirely understand your point here -- are you saying
    > that parallel backup is important, or that it's not important, or
    > something in between?  Do you think it's more or less important than
    > incremental backup?
    I think that there is no such thing as parallel backup. Backup creation is composite process of many subprocesses.
    
    In my experience, parallel network transmission is cool and very important, it makes upload 3 times faster. But my experience is limited to cloud storages. Would this hold if storage backend is local FS? I have no idea.
    Parallel reading from disk has the same effect. Compression and encryption can be single threaded, I think it will not be bottleneck (unless one uses lzma's neighborhood on Pareto frontier).
    
    For me, I think the most important thing is incremental backups (with parallel steps merge) and then parallel backup.
    But there is huge fraction of users, who can benefit from parallel backup and do not need incremental backup at all.
    
    
    Best regards, Andrey Borodin.
    
    
    
  54. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-04-21T23:02:26Z

    On Sat, Apr 20, 2019 at 4:32 PM Stephen Frost <sfrost@snowman.net> wrote:
    > Having been around for a while working on backup-related things, if I
    > was to implement the protocol for pg_basebackup today, I'd definitely
    > implement "give me a list" and "give me this file" rather than the
    > tar-based approach, because I've learned that people want to be
    > able to do parallel backups and that's a decent way to do that.  I
    > wouldn't set out and implement something new that's there's just no hope
    > of making parallel.  Maybe the first write of pg_basebackup would still
    > be simple and serial since it's certainly more work to make a frontend
    > tool like that work in parallel, but at least the protocol would be
    > ready to support a parallel option being added alter without being
    > rewritten.
    >
    > And that's really what I was trying to get at here- if we've got the
    > choice now to decide what this is going to look like from a protocol
    > level, it'd be great if we could make it able to support being used in a
    > parallel fashion, even if pg_basebackup is still single-threaded.
    
    I think we're getting closer to a meeting of the minds here, but I
    don't think it's intrinsically necessary to rewrite the whole method
    of operation of pg_basebackup to implement incremental backup in a
    sensible way.  One could instead just do a straightforward extension
    to the existing BASE_BACKUP command to enable incremental backup.
    Then, to enable parallel full backup and all sorts of out-of-core
    hacking, one could expand the command language to allow tools to
    access individual steps: START_BACKUP, SEND_FILE_LIST,
    SEND_FILE_CONTENTS, STOP_BACKUP, or whatever.  The second thing makes
    for an appealing project, but I do not think there is a technical
    reason why it has to be done first.  Or for that matter why it has to
    be done second.  As I keep saying, incremental backup and full backup
    are separate projects and I believe it's completely reasonable for
    whoever is doing the work to decide on the order in which they would
    like to do the work.
    
    Having said that, I'm curious what people other than Stephen (and
    other pgbackrest hackers) think about the relative value of parallel
    backup vs. incremental backup.  Stephen appears quite convinced that
    parallel backup is full of win and incremental backup is a bit of a
    yawn by comparison, and while I certainly would not want to discount
    the value of his experience in this area, it sometimes happens on this
    mailing list that [ drum roll please ] not everybody agrees about
    everything.  So, what do other people think?
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  55. Re: block-level incremental backup

    Konstantin Knizhnik <k.knizhnik@postgrespro.ru> — 2019-04-22T07:38:18Z

    
    On 22.04.2019 2:02, Robert Haas wrote:
    > On Sat, Apr 20, 2019 at 4:32 PM Stephen Frost <sfrost@snowman.net> wrote:
    >> Having been around for a while working on backup-related things, if I
    >> was to implement the protocol for pg_basebackup today, I'd definitely
    >> implement "give me a list" and "give me this file" rather than the
    >> tar-based approach, because I've learned that people want to be
    >> able to do parallel backups and that's a decent way to do that.  I
    >> wouldn't set out and implement something new that's there's just no hope
    >> of making parallel.  Maybe the first write of pg_basebackup would still
    >> be simple and serial since it's certainly more work to make a frontend
    >> tool like that work in parallel, but at least the protocol would be
    >> ready to support a parallel option being added alter without being
    >> rewritten.
    >>
    >> And that's really what I was trying to get at here- if we've got the
    >> choice now to decide what this is going to look like from a protocol
    >> level, it'd be great if we could make it able to support being used in a
    >> parallel fashion, even if pg_basebackup is still single-threaded.
    > I think we're getting closer to a meeting of the minds here, but I
    > don't think it's intrinsically necessary to rewrite the whole method
    > of operation of pg_basebackup to implement incremental backup in a
    > sensible way.  One could instead just do a straightforward extension
    > to the existing BASE_BACKUP command to enable incremental backup.
    > Then, to enable parallel full backup and all sorts of out-of-core
    > hacking, one could expand the command language to allow tools to
    > access individual steps: START_BACKUP, SEND_FILE_LIST,
    > SEND_FILE_CONTENTS, STOP_BACKUP, or whatever.  The second thing makes
    > for an appealing project, but I do not think there is a technical
    > reason why it has to be done first.  Or for that matter why it has to
    > be done second.  As I keep saying, incremental backup and full backup
    > are separate projects and I believe it's completely reasonable for
    > whoever is doing the work to decide on the order in which they would
    > like to do the work.
    >
    > Having said that, I'm curious what people other than Stephen (and
    > other pgbackrest hackers) think about the relative value of parallel
    > backup vs. incremental backup.  Stephen appears quite convinced that
    > parallel backup is full of win and incremental backup is a bit of a
    > yawn by comparison, and while I certainly would not want to discount
    > the value of his experience in this area, it sometimes happens on this
    > mailing list that [ drum roll please ] not everybody agrees about
    > everything.  So, what do other people think?
    >
    
    Based on the experience of pg_probackup users I can say that  there is 
    no 100% winer and depending on use case either
    parallel either incremental backups are preferable.
    - If size of database is not so larger and intensity of updates is high 
    enough, then parallel backup within one data center is definitely more 
    efficient solution.
    - If size of database is very large and data is rarely updated or 
    database is mostly append-only, then incremental backup is preferable.
    - Some customers need to collect at central server backups of databases 
    installed at many nodes with slow and unreliable connection (assume DBMS 
    installed at locomotives). Definitely parallelism can not help here, 
    unlike support of incremental backup.
    - Parallel backup more aggressively consumes resources of the system, 
    interfering with normal work of application. So performing parallel 
    backup may cause significant degradation of application speed.
    
    pg_probackup supports both features: parallel and incremental backups 
    and it is up to user how to use it in more efficient way for particular 
    configuration.
    
    
    
    -- 
    Konstantin Knizhnik
    Postgres Professional: http://www.postgrespro.com
    The Russian Postgres Company
    
    
    
    
    
  56. Re: block-level incremental backup

    Stephen Frost <sfrost@snowman.net> — 2019-04-22T17:08:05Z

    Greetings,
    
    * Robert Haas (robertmhaas@gmail.com) wrote:
    > On Sat, Apr 20, 2019 at 4:32 PM Stephen Frost <sfrost@snowman.net> wrote:
    > > Having been around for a while working on backup-related things, if I
    > > was to implement the protocol for pg_basebackup today, I'd definitely
    > > implement "give me a list" and "give me this file" rather than the
    > > tar-based approach, because I've learned that people want to be
    > > able to do parallel backups and that's a decent way to do that.  I
    > > wouldn't set out and implement something new that's there's just no hope
    > > of making parallel.  Maybe the first write of pg_basebackup would still
    > > be simple and serial since it's certainly more work to make a frontend
    > > tool like that work in parallel, but at least the protocol would be
    > > ready to support a parallel option being added alter without being
    > > rewritten.
    > >
    > > And that's really what I was trying to get at here- if we've got the
    > > choice now to decide what this is going to look like from a protocol
    > > level, it'd be great if we could make it able to support being used in a
    > > parallel fashion, even if pg_basebackup is still single-threaded.
    > 
    > I think we're getting closer to a meeting of the minds here, but I
    > don't think it's intrinsically necessary to rewrite the whole method
    > of operation of pg_basebackup to implement incremental backup in a
    > sensible way.  
    
    It wasn't my intent to imply that the whole method of operation of
    pg_basebackup would have to change for this.
    
    > One could instead just do a straightforward extension
    > to the existing BASE_BACKUP command to enable incremental backup.
    
    Ok, how do you envision that?  As I mentioned up-thread, I am concerned
    that we're talking too high-level here and it's making the discussion
    more difficult than it would be if we were to put together specific
    ideas and then discuss them.
    
    One way I can imagine to extend BASE_BACKUP is by adding LSN as an
    optional parameter and then having the database server scan the entire
    cluster and send a tarball which contains essentially a 'diff' file of
    some kind for each file where we can construct a diff based on the LSN,
    and then the complete contents of the file for everything else that
    needs to be in the backup.
    
    So, sure, that would work, but it wouldn't be able to be parallelized
    and I don't think it'd end up being very exciting for the external tools
    because of that, but it would be fine for pg_basebackup.
    
    On the other hand, if you added new commands for 'list of files changed
    since this LSN' and 'give me this file' and 'give me this file with the
    changes in it since this LSN', then pg_basebackup could work with that
    pretty easily in a single-threaded model (maybe with two connections to
    the backend, but still in a single process, or maybe just by slurping up
    the file list and then asking for each one) and the external tools could
    leverage those new capabilities too for their backups, both full backups
    and incremental ones.  This also wouldn't have to change how
    pg_basebackup does full backups today one bit, so what we're really
    talking about here is the direction to take the new code that's being
    written, not about rewriting existing code.  I agree that it'd be a bit
    more work...  but hopefully not *that* much more, and it would mean we
    could later add parallel backup to pg_basebackup more easily too, if we
    wanted to.
    
    > Then, to enable parallel full backup and all sorts of out-of-core
    > hacking, one could expand the command language to allow tools to
    > access individual steps: START_BACKUP, SEND_FILE_LIST,
    > SEND_FILE_CONTENTS, STOP_BACKUP, or whatever.  The second thing makes
    > for an appealing project, but I do not think there is a technical
    > reason why it has to be done first.  Or for that matter why it has to
    > be done second.  As I keep saying, incremental backup and full backup
    > are separate projects and I believe it's completely reasonable for
    > whoever is doing the work to decide on the order in which they would
    > like to do the work.
    
    I didn't mean to imply that one had to be done before the other from a
    technical standpoint.  I agree that they don't depend on each other.
    
    You're certainly welcome to do what you would like, I simply wanted to
    share my experiences and try to help move this in a direction that would
    involve less code rewrite in the future and to have a feature that would
    be more appealing to the external tools.
    
    > Having said that, I'm curious what people other than Stephen (and
    > other pgbackrest hackers) 
    
    While David and I do talk, we haven't really discussed this proposal all
    that much, so please don't assume that he shares my thoughts here.  I'd
    also like to hear what others think, particularly those who have been
    working in this area.
    
    > think about the relative value of parallel
    > backup vs. incremental backup.  Stephen appears quite convinced that
    > parallel backup is full of win and incremental backup is a bit of a
    > yawn by comparison, and while I certainly would not want to discount
    > the value of his experience in this area, it sometimes happens on this
    > mailing list that [ drum roll please ] not everybody agrees about
    > everything.  So, what do other people think?
    
    I'm afraid this is painting my position here with an extremely broad
    brush and so I'd like to clarify a bit: I'm *all* for incremental
    backups.  Incremental and differential backups were supported by
    pgBackRest very early on and are used extensively.  Today's pgBackRest
    does that at a file level, but I would very much like to get to a block
    level shortly after we finish rewriting it into C and porting it to
    Windows (and probably the other platforms PG runs on today), which isn't
    very far off now.  I'd like to make sure that whatever core ends up with
    as an incremental backup solution also matches very closely what we do
    with pgBackRest too, but everything that's been discussed here seems
    pretty reasonable when it comes to the bits around how the blocks are
    detected and the files get stitched back together, so I don't expect
    there to be too much of an issue there.
    
    What I'm afraid will be lackluster is adding block-level incremental
    backup support to pg_basebackup without any support for managing
    backups or anything else.  I'm also concerned that it's going to mean
    that people who want to use incremental backup with pg_basebackup are
    going to have to write a lot of their own management code (probably in
    shell scripts and such...) around that and if they get anything wrong
    there then people are going to end up with bad backups that they can't
    restore from, or they'll have corrupted clusters if they do manage to
    get them restored.
    
    It'd also be nice to have as much exposed through the common library as
    possible when it comes to, well, everything being discussed, so that the
    external tools could leverage that code and avoid having to write their
    own.  This would probably apply more to the WAL-scanning discussion, but
    figured I'd mention it here too.
    
    If the protocol was implemented in a way that we could leverage it from
    external tools in a parallel fashion then I'd be more excited about the
    overall body of work, although, thinking about it a bit more, I have to
    admit that I'm not sure that pgBackRest would end up using it in any
    case, no matter how it's implemented, since it wouldn't support
    compression or encryption, both of which we support doing in-stream
    before the data leaves the server, though the external tools which don't
    support those options likely would find the parallel option more
    appealing.
    
    Thanks,
    
    Stephen
    
  57. Re: block-level incremental backup

    Andres Freund <andres@anarazel.de> — 2019-04-22T17:36:44Z

    Hi,
    
    On 2019-04-19 20:04:41 -0400, Stephen Frost wrote:
    > I agree that we don't want another implementation and that there's a lot
    > that we want to do to improve replay performance.  We've already got
    > frontend tools which work with multiple execution threads, so I'm not
    > sure I get the "not easily feasible" bit, and the argument about the
    > checkpointer seems largely related to that (as in- if we didn't have
    > multiple threads/processes then things would perform quite badly...  but
    > we can and do have multiple threads/processes in frontend tools today,
    > even in pg_basebackup).
    
    You need not just multiple execution threads, but basically a new
    implementation of shared buffers, locking, process monitoring, with most
    of the related infrastructure. You're literally talking about
    reimplementing a very substantial portion of the backend.  I'm not sure
    I can transport in written words - via a public medium - how bad an idea
    it would be to go there.
    
    
    > You certainly bring up some good concerns though and they make me think
    > of other bits that would seem like they'd possibly be larger issues for
    > a frontend tool- like having a large pool of memory for cacheing (aka
    > shared buffers) the changes.  If what we're talking about here is *just*
    > replay though, without having the system available for reads, I wonder
    > if we might want a different solution there.
    
    No.
    
    
    > > Which I think is entirely reasonable. With the 'consistent' and LSN
    > > recovery targets one already can get most of what's needed from such a
    > > tool, anyway.  I'd argue the biggest issue there is that there's no
    > > equivalent to starting postgres with a private socket directory on
    > > windows, and perhaps an option or two making it easier to start postgres
    > > in a "private" mode for things like this.
    > 
    > This would mean building in a way to do parallel WAL replay into the
    > server binary though, as discussed above, and it seems like making that
    > work in a way that allows us to still be available as a read-only
    > standby would be quite a bit more difficult.  We could possibly support
    > parallel WAL replay only when we aren't a replica but from the same
    > binary.
    
    I'm doubtful that we should try to implement parallel WAL apply that
    can't support HS - a substantial portion of the the logic to avoid
    issues around relfilenode reuse, consistency etc is going to be to be
    necessary for non-HS aware apply anyway.  But if somebody had a concrete
    proposal for something that's fundamentally only doable without HS, I
    could be convinced.
    
    
    > The concerns mentioned about making it easier to start PG in a
    > private mode don't seem too bad but I am not entirely sure that the
    > tools which want to leverage that kind of capability would want to have
    > to exec out to the PG binary to use it.
    
    Tough luck.  But even leaving infeasability aside, it seems like a quite
    bad idea to do this in-process inside a tool that manages backup &
    recovery. Creating threads / sub-processes with complicated needs (like
    any pared down version of pg to do just recovery would have) from within
    a library has substantial complications. So you'd not want to do this
    in-process anyway.
    
    
    > A lot of this part of the discussion feels like a tangent though, unless
    > I'm missing something.
    
    I'm replying to:
    
    On 2019-04-17 18:43:10 -0400, Stephen Frost wrote:
    > Wow.  I have to admit that I feel completely opposite of that- I'd
    > *love* to have an independent tool (which ideally uses the same code
    > through the common library, or similar) that can be run to apply WAL.
    
    And I'm basically saying that anything that starts from this premise is
    fatally flawed (in the ex falso quodlibet kind of sense ;)).
    
    
    > The "WAL compression" tool contemplated
    > previously would be much simpler and not the full-blown WAL replay
    > capability, which would be left to the server, unless you're suggesting
    > that even that should be exclusively the purview of the backend?  Though
    > that ship's already sailed, given that external projects have
    > implemented it.
    
    I'm extremely doubtful of such tools (but it's not what I was responding
    too, see above). I'd be extremely surprised if even one of them came
    close to being correct. The old FPI removal tool had data corrupting
    bugs left and right.
    
    
    > Having a library to provide that which external
    > projects could leverage would be nicer than having everyone write their
    > own version.
    
    No, I don't think that's necessarily true. Something complicated that's
    hard to get right doesn't have to be provided by core. Even if other
    projects decide that their risk/reward assesment is different than core
    postgres'. We don't have to take on all kind of work and complexity for
    external tools.
    
    Greetings,
    
    Andres Freund
    
    
    
    
  58. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-04-22T17:44:25Z

    On Mon, Apr 22, 2019 at 1:08 PM Stephen Frost <sfrost@snowman.net> wrote:
    > > I think we're getting closer to a meeting of the minds here, but I
    > > don't think it's intrinsically necessary to rewrite the whole method
    > > of operation of pg_basebackup to implement incremental backup in a
    > > sensible way.
    >
    > It wasn't my intent to imply that the whole method of operation of
    > pg_basebackup would have to change for this.
    
    Cool.
    
    > > One could instead just do a straightforward extension
    > > to the existing BASE_BACKUP command to enable incremental backup.
    >
    > Ok, how do you envision that?  As I mentioned up-thread, I am concerned
    > that we're talking too high-level here and it's making the discussion
    > more difficult than it would be if we were to put together specific
    > ideas and then discuss them.
    >
    > One way I can imagine to extend BASE_BACKUP is by adding LSN as an
    > optional parameter and then having the database server scan the entire
    > cluster and send a tarball which contains essentially a 'diff' file of
    > some kind for each file where we can construct a diff based on the LSN,
    > and then the complete contents of the file for everything else that
    > needs to be in the backup.
    
    /me scratches head.  Isn't that pretty much what I described in my
    original post?  I even described what that "'diff' file of some kind"
    would look like in some detail in the paragraph of that emailed
    numbered "2.", and I described the reasons for that choice at length
    in http://postgr.es/m/CA+TgmoZrqdV-tB8nY9P+1pQLqKXp5f1afghuoHh5QT6ewdkJ6g@mail.gmail.com
    
    I can't figure out how I'm managing to be so unclear about things
    about which I thought I'd been rather explicit.
    
    > So, sure, that would work, but it wouldn't be able to be parallelized
    > and I don't think it'd end up being very exciting for the external tools
    > because of that, but it would be fine for pg_basebackup.
    
    Stop being such a pessimist.  Yes, if we only add the option to the
    BASE_BACKUP command, it won't directly be very exciting for external
    tools, but a lot of the work that is needed to do things that ARE
    exciting for external tools will have been done.  For instance, if the
    work to figure out which blocks have been modified via WAL-scanning
    gets done, and initially that's only exposed via BASE_BACKUP, it won't
    be much work for somebody to write code for a new code that exposes
    that information directly through some new replication command.
    There's a difference between something that's going in the wrong
    direction and something that's going in the right direction but not as
    far or as fast as you'd like.  And I'm 99% sure that everything I'm
    proposing here falls in the latter category rather than the former.
    
    > On the other hand, if you added new commands for 'list of files changed
    > since this LSN' and 'give me this file' and 'give me this file with the
    > changes in it since this LSN', then pg_basebackup could work with that
    > pretty easily in a single-threaded model (maybe with two connections to
    > the backend, but still in a single process, or maybe just by slurping up
    > the file list and then asking for each one) and the external tools could
    > leverage those new capabilities too for their backups, both full backups
    > and incremental ones.  This also wouldn't have to change how
    > pg_basebackup does full backups today one bit, so what we're really
    > talking about here is the direction to take the new code that's being
    > written, not about rewriting existing code.  I agree that it'd be a bit
    > more work...  but hopefully not *that* much more, and it would mean we
    > could later add parallel backup to pg_basebackup more easily too, if we
    > wanted to.
    
    For purposes of implementing parallel pg_basebackup, it would probably
    be better if the server rather than the client decided which files to
    send via which connection.  If the client decides, then every time the
    server finishes sending a file, the client has to request another
    file, and that introduces some latency: after the server finishes
    sending each file, it has to wait for the client to finish receiving
    the data, and it has to wait for the client to tell it what file to
    send next.  If the server decides, then it can just send data at top
    speed without a break.  So the ideal interface for pg_basebackup would
    really be something like:
    
    START_PARALLEL_BACKUP blah blah PARTICIPANTS 4;
    
    ...returning a cookie that can be then be used by each participant for
    an argument to a new commands:
    
    JOIN_PARALLLEL_BACKUP 'cookie';
    
    However, that is obviously extremely inconvenient for third-party
    tools.  It's possible we need both an interface like this -- for use
    by parallel pg_basebackup -- and a
    START_BACKUP/SEND_FILE_LIST/SEND_FILE_CONTENTS/STOP_BACKUP type
    interface for use by external tools.  On the other hand, maybe the
    additional overhead caused by managing the list of files to be fetched
    on the client side is negligible.  It'd be interesting to see, though,
    how busy the server is when running an incremental backup managed by
    an external tool like BART or pgbackrest on a cluster with a gazillion
    little-tiny relations.  I wonder if we'd find that it spends most of
    its time waiting for the client.
    
    > What I'm afraid will be lackluster is adding block-level incremental
    > backup support to pg_basebackup without any support for managing
    > backups or anything else.  I'm also concerned that it's going to mean
    > that people who want to use incremental backup with pg_basebackup are
    > going to have to write a lot of their own management code (probably in
    > shell scripts and such...) around that and if they get anything wrong
    > there then people are going to end up with bad backups that they can't
    > restore from, or they'll have corrupted clusters if they do manage to
    > get them restored.
    
    I think that this is another complaint that basically falls into the
    category of saying that this proposal might not fix everything for
    everybody, but that complaint could be levied against any reasonable
    development proposal.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  59. Re: block-level incremental backup

    Stephen Frost <sfrost@snowman.net> — 2019-04-22T18:03:41Z

    Greetings,
    
    * Andres Freund (andres@anarazel.de) wrote:
    > On 2019-04-19 20:04:41 -0400, Stephen Frost wrote:
    > > I agree that we don't want another implementation and that there's a lot
    > > that we want to do to improve replay performance.  We've already got
    > > frontend tools which work with multiple execution threads, so I'm not
    > > sure I get the "not easily feasible" bit, and the argument about the
    > > checkpointer seems largely related to that (as in- if we didn't have
    > > multiple threads/processes then things would perform quite badly...  but
    > > we can and do have multiple threads/processes in frontend tools today,
    > > even in pg_basebackup).
    > 
    > You need not just multiple execution threads, but basically a new
    > implementation of shared buffers, locking, process monitoring, with most
    > of the related infrastructure. You're literally talking about
    > reimplementing a very substantial portion of the backend.  I'm not sure
    > I can transport in written words - via a public medium - how bad an idea
    > it would be to go there.
    
    Yes, there'd be some need for locking and process monitoring, though if
    we aren't supporting ongoing read queries at the same time, there's a
    whole bunch of things that we don't need from the existing backend.
    
    > > > Which I think is entirely reasonable. With the 'consistent' and LSN
    > > > recovery targets one already can get most of what's needed from such a
    > > > tool, anyway.  I'd argue the biggest issue there is that there's no
    > > > equivalent to starting postgres with a private socket directory on
    > > > windows, and perhaps an option or two making it easier to start postgres
    > > > in a "private" mode for things like this.
    > > 
    > > This would mean building in a way to do parallel WAL replay into the
    > > server binary though, as discussed above, and it seems like making that
    > > work in a way that allows us to still be available as a read-only
    > > standby would be quite a bit more difficult.  We could possibly support
    > > parallel WAL replay only when we aren't a replica but from the same
    > > binary.
    > 
    > I'm doubtful that we should try to implement parallel WAL apply that
    > can't support HS - a substantial portion of the the logic to avoid
    > issues around relfilenode reuse, consistency etc is going to be to be
    > necessary for non-HS aware apply anyway.  But if somebody had a concrete
    > proposal for something that's fundamentally only doable without HS, I
    > could be convinced.
    
    I'd certainly prefer that we support parallel WAL replay *with* HS, that
    just seems like a much larger problem, but I'd be quite happy to be told
    that it wouldn't be that much harder.
    
    > > A lot of this part of the discussion feels like a tangent though, unless
    > > I'm missing something.
    > 
    > I'm replying to:
    > 
    > On 2019-04-17 18:43:10 -0400, Stephen Frost wrote:
    > > Wow.  I have to admit that I feel completely opposite of that- I'd
    > > *love* to have an independent tool (which ideally uses the same code
    > > through the common library, or similar) that can be run to apply WAL.
    > 
    > And I'm basically saying that anything that starts from this premise is
    > fatally flawed (in the ex falso quodlibet kind of sense ;)).
    
    I'd just say that it'd be... difficult. :)
    
    > > The "WAL compression" tool contemplated
    > > previously would be much simpler and not the full-blown WAL replay
    > > capability, which would be left to the server, unless you're suggesting
    > > that even that should be exclusively the purview of the backend?  Though
    > > that ship's already sailed, given that external projects have
    > > implemented it.
    > 
    > I'm extremely doubtful of such tools (but it's not what I was responding
    > too, see above). I'd be extremely surprised if even one of them came
    > close to being correct. The old FPI removal tool had data corrupting
    > bugs left and right.
    
    I have concerns about it myself, which is why I'd actually really like
    to see something in core that does it, and does it the right way, that
    other projects could then leverage (ideally by just linking into the
    library without having to rewrite what's in core, though that might not
    be an option for things like WAL-G that are in Go and possibly don't
    want to link in some C library).
    
    > > Having a library to provide that which external
    > > projects could leverage would be nicer than having everyone write their
    > > own version.
    > 
    > No, I don't think that's necessarily true. Something complicated that's
    > hard to get right doesn't have to be provided by core. Even if other
    > projects decide that their risk/reward assesment is different than core
    > postgres'. We don't have to take on all kind of work and complexity for
    > external tools.
    
    No, it doesn't have to be provided by core, but I sure would like it to
    be and I'd be much more comfortable if it was because then we'd also
    take care to not break whatever assumptions are made (or to do so in a
    way that can be detected and/or handled) as new code is written.  As
    discussed above, as long as it isn't provided by core, it's not going to
    be trusted, likely will have bugs, and probably will be broken by things
    happening in core moving forward.  The only option left is "well, we
    just won't have that capability at all".  Maybe that's what you're
    getting at here, but not sure I agree with that as the result.
    
    Thanks,
    
    Stephen
    
  60. Re: block-level incremental backup

    Stephen Frost <sfrost@snowman.net> — 2019-04-22T18:26:40Z

    Greetings,
    
    * Robert Haas (robertmhaas@gmail.com) wrote:
    > On Mon, Apr 22, 2019 at 1:08 PM Stephen Frost <sfrost@snowman.net> wrote:
    > > > One could instead just do a straightforward extension
    > > > to the existing BASE_BACKUP command to enable incremental backup.
    > >
    > > Ok, how do you envision that?  As I mentioned up-thread, I am concerned
    > > that we're talking too high-level here and it's making the discussion
    > > more difficult than it would be if we were to put together specific
    > > ideas and then discuss them.
    > >
    > > One way I can imagine to extend BASE_BACKUP is by adding LSN as an
    > > optional parameter and then having the database server scan the entire
    > > cluster and send a tarball which contains essentially a 'diff' file of
    > > some kind for each file where we can construct a diff based on the LSN,
    > > and then the complete contents of the file for everything else that
    > > needs to be in the backup.
    > 
    > /me scratches head.  Isn't that pretty much what I described in my
    > original post?  I even described what that "'diff' file of some kind"
    > would look like in some detail in the paragraph of that emailed
    > numbered "2.", and I described the reasons for that choice at length
    > in http://postgr.es/m/CA+TgmoZrqdV-tB8nY9P+1pQLqKXp5f1afghuoHh5QT6ewdkJ6g@mail.gmail.com
    > 
    > I can't figure out how I'm managing to be so unclear about things
    > about which I thought I'd been rather explicit.
    
    There was basically zero discussion about what things would look like at
    a protocol level (I went back and skimmed over the thread before sending
    my last email to specifically see if I was going to get this response
    back..).  I get the idea behind the diff file, the contents of which I
    wasn't getting into above.
    
    > > So, sure, that would work, but it wouldn't be able to be parallelized
    > > and I don't think it'd end up being very exciting for the external tools
    > > because of that, but it would be fine for pg_basebackup.
    > 
    > Stop being such a pessimist.  Yes, if we only add the option to the
    > BASE_BACKUP command, it won't directly be very exciting for external
    > tools, but a lot of the work that is needed to do things that ARE
    > exciting for external tools will have been done.  For instance, if the
    > work to figure out which blocks have been modified via WAL-scanning
    > gets done, and initially that's only exposed via BASE_BACKUP, it won't
    > be much work for somebody to write code for a new code that exposes
    > that information directly through some new replication command.
    > There's a difference between something that's going in the wrong
    > direction and something that's going in the right direction but not as
    > far or as fast as you'd like.  And I'm 99% sure that everything I'm
    > proposing here falls in the latter category rather than the former.
    
    I didn't mean to imply that you're doing in the wrong direction here and
    I thought I said somewhere in my last email more-or-less exactly the
    same, that a great deal of the work needed for block-level incremental
    backup would be done, but specifically that this proposal wouldn't allow
    external tools to leverage that.  It sounds like what you're suggesting
    now is that you're happy to implement the backend code, expose it in a
    way that works just for pg_basebackup, and that if someone else wants to
    add things to the protocol to make it easier for external tools to
    leverage, great.  All I can say is that that's basically how we ended up
    in the situation we're in today where pg_basebackup doesn't support
    parallel backup but a bunch of external tools do and they don't go
    through the backend to get there, even though they'd probably prefer to.
    
    > > On the other hand, if you added new commands for 'list of files changed
    > > since this LSN' and 'give me this file' and 'give me this file with the
    > > changes in it since this LSN', then pg_basebackup could work with that
    > > pretty easily in a single-threaded model (maybe with two connections to
    > > the backend, but still in a single process, or maybe just by slurping up
    > > the file list and then asking for each one) and the external tools could
    > > leverage those new capabilities too for their backups, both full backups
    > > and incremental ones.  This also wouldn't have to change how
    > > pg_basebackup does full backups today one bit, so what we're really
    > > talking about here is the direction to take the new code that's being
    > > written, not about rewriting existing code.  I agree that it'd be a bit
    > > more work...  but hopefully not *that* much more, and it would mean we
    > > could later add parallel backup to pg_basebackup more easily too, if we
    > > wanted to.
    > 
    > For purposes of implementing parallel pg_basebackup, it would probably
    > be better if the server rather than the client decided which files to
    > send via which connection.  If the client decides, then every time the
    > server finishes sending a file, the client has to request another
    > file, and that introduces some latency: after the server finishes
    > sending each file, it has to wait for the client to finish receiving
    > the data, and it has to wait for the client to tell it what file to
    > send next.  If the server decides, then it can just send data at top
    > speed without a break.  So the ideal interface for pg_basebackup would
    > really be something like:
    > 
    > START_PARALLEL_BACKUP blah blah PARTICIPANTS 4;
    > 
    > ...returning a cookie that can be then be used by each participant for
    > an argument to a new commands:
    > 
    > JOIN_PARALLLEL_BACKUP 'cookie';
    > 
    > However, that is obviously extremely inconvenient for third-party
    > tools.  It's possible we need both an interface like this -- for use
    > by parallel pg_basebackup -- and a
    > START_BACKUP/SEND_FILE_LIST/SEND_FILE_CONTENTS/STOP_BACKUP type
    > interface for use by external tools.  On the other hand, maybe the
    > additional overhead caused by managing the list of files to be fetched
    > on the client side is negligible.  It'd be interesting to see, though,
    > how busy the server is when running an incremental backup managed by
    > an external tool like BART or pgbackrest on a cluster with a gazillion
    > little-tiny relations.  I wonder if we'd find that it spends most of
    > its time waiting for the client.
    
    Thanks for sharing your thoughts on that, certainly having the backend
    able to be more intelligent about streaming files to avoid latency is
    good and possibly the best approach.  Another alternative to reducing
    the latency would be to have a way for the client to request a set of
    files, but I don't know that it'd be better.
    
    I'm not really sure why the above is extremely inconvenient for
    third-party tools, beyond just that they've already been written to work
    with an assumption that the server-side of things isn't as intelligent
    as PG is.
    
    > > What I'm afraid will be lackluster is adding block-level incremental
    > > backup support to pg_basebackup without any support for managing
    > > backups or anything else.  I'm also concerned that it's going to mean
    > > that people who want to use incremental backup with pg_basebackup are
    > > going to have to write a lot of their own management code (probably in
    > > shell scripts and such...) around that and if they get anything wrong
    > > there then people are going to end up with bad backups that they can't
    > > restore from, or they'll have corrupted clusters if they do manage to
    > > get them restored.
    > 
    > I think that this is another complaint that basically falls into the
    > category of saying that this proposal might not fix everything for
    > everybody, but that complaint could be levied against any reasonable
    > development proposal.
    
    I'm disappointed that the concerns about the trouble that end users are
    likely to have with this didn't garner more discussion.
    
    Thanks,
    
    Stephen
    
  61. Re: block-level incremental backup

    Andres Freund <andres@anarazel.de> — 2019-04-22T18:33:46Z

    Hi,
    
    On 2019-04-22 14:26:40 -0400, Stephen Frost wrote:
    > I'm disappointed that the concerns about the trouble that end users are
    > likely to have with this didn't garner more discussion.
    
    My impression is that endusers are having a lot more trouble due to
    important backup/restore features not being in core/pg_basebackup, than
    due to external tools having a harder time to implement certain
    features. Focusing on external tools being able to provide all those
    features, because core hasn't yet, is imo entirely the wrong thing to
    concentrate upon.  And it's not like things largely haven't been
    implemented in pg_basebackup for fundamental architectural reasons.
    It's because we've built like 5 different external tools with randomly
    differing featureset and licenses.
    
    Greetings,
    
    Andres Freund
    
    
    
    
  62. Re: block-level incremental backup

    Stephen Frost <sfrost@snowman.net> — 2019-04-22T19:03:04Z

    Greetings,
    
    * Andres Freund (andres@anarazel.de) wrote:
    > On 2019-04-22 14:26:40 -0400, Stephen Frost wrote:
    > > I'm disappointed that the concerns about the trouble that end users are
    > > likely to have with this didn't garner more discussion.
    > 
    > My impression is that endusers are having a lot more trouble due to
    > important backup/restore features not being in core/pg_basebackup, than
    > due to external tools having a harder time to implement certain
    > features.
    
    I had been referring specifically to the concern I raised about
    incremental block-level backups being added to pg_basebackup and how
    that'll make using pg_basebackup more complicated and therefore more
    difficult for end-users to get right, particularly if the end user is
    having to handle management of the association between the full backup
    and the incremental backups.  I wasn't referring to anything regarding
    external tools.
    
    > Focusing on external tools being able to provide all those
    > features, because core hasn't yet, is imo entirely the wrong thing to
    > concentrate upon.  And it's not like things largely haven't been
    > implemented in pg_basebackup for fundamental architectural reasons.
    > It's because we've built like 5 different external tools with randomly
    > differing featureset and licenses.
    
    There's a few challenges when it comes to adding backup features to
    core.  One of the reasons is that core naturally moves slower when it
    comes to development than external projects do, as was discusssed
    earlier on this thread.  Another is that, when it comes to backup,
    specifically, people want to back up their *existing* systems, which
    means that they need a backup tool that's going to work with whatever
    version of PG they've currently got deployed and that's often a few
    years old already.  Certainly when I've thought about features that we'd
    like to see and considered if there's something that could be
    implemented in core vs. implemented outside of core, the answer often
    ends up being "well, if we do it ourselves then we can make it work for
    PG 9.2 and above, and have it working for existing users, but if we work
    it in as part of core, it won't be available until next year and only
    for version 12 and above, and users can only use it once they've
    upgraded.."
    
    Thanks,
    
    Stephen
    
  63. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-04-22T20:08:18Z

    On Mon, Apr 22, 2019 at 2:26 PM Stephen Frost <sfrost@snowman.net> wrote:
    > There was basically zero discussion about what things would look like at
    > a protocol level (I went back and skimmed over the thread before sending
    > my last email to specifically see if I was going to get this response
    > back..).  I get the idea behind the diff file, the contents of which I
    > wasn't getting into above.
    
    Well, I wrote:
    
    "There should be a way to tell pg_basebackup to request from the
    server only those blocks where LSN >= threshold_value."
    
    I guess I assumed that people would interested in the details take
    that to mean "and therefore the protocol would grow an option for this
    type of request in whatever way is the most straightforward possible
    extension of the current functionality is," which is indeed how you
    eventually interpreted it when you said we could "extend BASE_BACKUP
    is by adding LSN as an optional parameter."
    
    I could have been more explicit, but sometimes people tell me that my
    emails are too long.
    
    > external tools to leverage that.  It sounds like what you're suggesting
    > now is that you're happy to implement the backend code, expose it in a
    > way that works just for pg_basebackup, and that if someone else wants to
    > add things to the protocol to make it easier for external tools to
    > leverage, great.
    
    Yep, that's more or less it, although I am potentially willing to do
    some modest amount of that other work along the way.  I just don't
    want to prioritize it higher than getting the actual thing I want to
    build built, which I think is a pretty fair position for me to take.
    
    > All I can say is that that's basically how we ended up
    > in the situation we're in today where pg_basebackup doesn't support
    > parallel backup but a bunch of external tools do and they don't go
    > through the backend to get there, even though they'd probably prefer to.
    
    I certainly agree that core should try to do things in a way that is
    useful to external tools when that can be done without undue effort,
    but only if it can actually be done without undo effort.  Let's see
    whether that's the case here:
    
    - Anastasia wants a command added that dumps out whatever the server
    knows about what files have changed, which I already agreed was a
    reasonable extension of my initial proposal.
    
    - You said that for this to be useful to pgbackrest, it'd have to use
    a whole different mechanism that includes commands to request
    individual files and blocks within those files, which would be a
    significant rewrite of pg_basebackup that you agreed is more closely
    related to parallel backup than to the project under discussion on
    this thread.  And that even then pgbackrest probably wouldn't use it
    because it also does server-side compression and encryption which are
    not included in this proposal.
    
    It seems to me that the first one falls into the category a reasonable
    additional effort and the second one falls into the category of lots
    of extra and unrelated work that wouldn't even get used.
    
    > Thanks for sharing your thoughts on that, certainly having the backend
    > able to be more intelligent about streaming files to avoid latency is
    > good and possibly the best approach.  Another alternative to reducing
    > the latency would be to have a way for the client to request a set of
    > files, but I don't know that it'd be better.
    
    I don't know either.  This is an area that needs more thought, I
    think, although as discussed, it's more related to parallel backup
    than $SUBJECT.
    
    > I'm not really sure why the above is extremely inconvenient for
    > third-party tools, beyond just that they've already been written to work
    > with an assumption that the server-side of things isn't as intelligent
    > as PG is.
    
    Well, one thing you might want to do is have a tool that connects to
    the server, enters backup mode, requests information on what blocks
    have changed, copies those blocks via direct filesystem access, and
    then exits backup mode.  Such a tool would really benefit from a
    START_BACKUP / SEND_FILE_LIST / SEND_FILE_CONTENTS / STOP_BACKUP
    command language, because it would just skip ever issuing the
    SEND_FILE_CONTENTS command in favor of doing that part of the work via
    other means.  On the other hand, a START_PARALLEL_BACKUP LSN '1/234'
    command is useless to such a tool.
    
    Contrariwise, a tool that has its own magic - perhaps based on
    WAL-scanning or something like ptrack - to know which files currently
    exist and which blocks are modified could use SEND_FILE_CONTENTS but
    not SEND_FILE_LIST.  And a filesystem-snapshot based technique might
    use START_BACKUP and STOP_BACKUP but nothing else.
    
    In short, providing granular commands like this lets the client be
    really intelligent even if the server isn't, and lets the client have
    fine-grained control of the process.  This is very good if you're an
    out-of-core tool maintainer and your tool is trying to be smarter than
    - or even just differently-designed than - core.
    
    But if what you really want is just a maximally-efficient parallel
    backup, you don't need the commands to be fine-grained like this.  You
    don't even really *want* the commands to be fine-grained like this,
    because it's better if the server works it all out so as to avoid
    unnecessary network round-trips.  You just want to tell the server
    "hey, I want to do a parallel backup with 5 participants - hit me!"
    and have it do that in the most efficient way that it knows how,
    without forcing the client to make any decisions that can be made just
    as well, and perhaps more efficiently, on the server.
    
    On the third hand, one advantage of having the fine-grained commands
    is that it would not only make it easier for out-of-core tools to do
    cool things, but also in-core tools.  For instance, you can imagine
    being able to do something like:
    
    pg_basebackup -D outputdir -d conninfo --copy-files-from=$PGDATA
    
    If the client is using what I'm calling fine-grained commands, this is
    easy to implement.  If it's just calling a piece of server side
    functionality that sends back a tarball as a blob, it's not.
    
    So each approach has some pros and cons.
    
    > I'm disappointed that the concerns about the trouble that end users are
    > likely to have with this didn't garner more discussion.
    
    Well, we can keep discussing things.  I've tried to reply to as many
    of your concerns as I can, but I believe you've written more email on
    this thread than everyone else combined, so perhaps I haven't entirely
    been able to keep up.
    
    That being said, as far as I can tell, those concerns were not
    seconded by anyone else.  Also, if I understand correctly, when I
    asked how we could avoid that problem, you that you didn't know.  And
    I said it seemed like we would need to a very expensive operation at
    server startup, or magic.  So I feel that perhaps it is a problem that
    (1) is not of great general concern and (2) to which no really
    superior engineering solution is possible.
    
    I may, however, be mistaken.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  64. Re: block-level incremental backup

    Anastasia Lubennikova <a.lubennikova@postgrespro.ru> — 2019-04-23T11:08:12Z

    22.04.2019 2:02, Robert Haas wrote:
    > I think we're getting closer to a meeting of the minds here, but I
    > don't think it's intrinsically necessary to rewrite the whole method
    > of operation of pg_basebackup to implement incremental backup in a
    > sensible way.  One could instead just do a straightforward extension
    > to the existing BASE_BACKUP command to enable incremental backup.
    > Then, to enable parallel full backup and all sorts of out-of-core
    > hacking, one could expand the command language to allow tools to
    > access individual steps: START_BACKUP, SEND_FILE_LIST,
    > SEND_FILE_CONTENTS, STOP_BACKUP, or whatever.  The second thing makes
    > for an appealing project, but I do not think there is a technical
    > reason why it has to be done first.  Or for that matter why it has to
    > be done second.  As I keep saying, incremental backup and full backup
    > are separate projects and I believe it's completely reasonable for
    > whoever is doing the work to decide on the order in which they would
    > like to do the work.
    >
    > Having said that, I'm curious what people other than Stephen (and
    > other pgbackrest hackers) think about the relative value of parallel
    > backup vs. incremental backup.  Stephen appears quite convinced that
    > parallel backup is full of win and incremental backup is a bit of a
    > yawn by comparison, and while I certainly would not want to discount
    > the value of his experience in this area, it sometimes happens on this
    > mailing list that [ drum roll please ] not everybody agrees about
    > everything.  So, what do other people think?
    >
    Personally, I believe that incremental backups are more useful to implement
    first since they benefit both backup speed and the space taken by a backup.
    Frankly speaking, I'm a bit surprised that the discussion of parallel 
    backups
    took so much of this thread.
    Of course, we must keep it in mind, while designing the API to avoid 
    introducing
    any architectural obstacles, but any further discussion of parallelism is a
    subject of another topic.
    
    
    I understand Stephen's concerns about the difficulties of incremental backup
    management.
    Even with an assumption that user is ready to manage backup chains, 
    retention,
    and other stuff, we must consider the format of backup metadata that 
    will allow
    us to perform some primitive commands:
    
    1) Tell whether this backup full or incremental.
    
    2) Tell what backup is a parent of this incremental backup.
    Probably, we can limit it to just returning "start_lsn", which later can be
    compared to "stop_lsn" of parent backup.
    
    3) Take an incremental backup based on this backup.
    Here we must help a backup manager to retrieve the LSN to pass it to
    pg_basebackup.
    
    4) Restore an incremental backup into a directory (on top of already 
    restored
    full backup).
    One may use it to perform "merge" or "restore" of the incremental backup,
    depending on the destination directory.
    I wonder if it is possible to integrate it into any existing tool, or we 
    end up
    with something like pg_basebackup/pg_baserestore as in case of
    pg_dump/pg_restore.
    
    Have you designed these? I may only recall "pg_combinebackup" from the very
    first message in this thread, which looks more like a sketch to explain the
    idea, rather than the thought-out feature design. I also found a page
    https://wiki.postgresql.org/wiki/Incremental_backup that raises the same
    questions.
    I'm volunteering to write a draft patch or, more likely, set of patches, 
    which
    will allow us to discuss the subject in more detail.
    And to do that I wish we agree on the API and data format (at least 
    broadly).
    Looking forward to hearing your thoughts.
    
    
    As I see it, ideally the backup management tools should concentrate more on
    managing multiple backups, while all the logic of taking a single backup 
    (of any
    kind) should be integrated into the core. It means that any out-of-core 
    client
    won't have to walk the PGDATA directory and care about all the postgres 
    specific
    knowledge of data files consisting of blocks with headers and LSNs and 
    so on. It
    simply requests data and gets it.
    Understandably, it won't be implemented in one take and what is more 
    probably,
    it is not reachable fully.
    Still, it will be great to do our best to provide such tools (both 
    existing and
    future) with conveniently formatted data and API to get it.
    
    -- 
    Anastasia Lubennikova
    Postgres Professional: http://www.postgrespro.com
    The Russian Postgres Company
    
    
    
    
    
  65. Re: block-level incremental backup

    Adam Brusselback <adambrusselback@gmail.com> — 2019-04-23T19:12:27Z

    I hope it's alright to throw in my $0.02 as a user. I've been following
    this (and the other thread on reading WAL to find modified blocks,
    prefaulting, whatever else) since the start with great excitement and would
    love to see the built-in backup capabilities in Postgres greatly improved.
    I know this is not completely on-topic for just incremental backups, so I
    apologize in advance. It just seemed like the most apt place to chime in.
    
    
    Just to preface where I am coming from, I have been using pgBackRest for
    the past couple years and used wal-e prior to that.  I am not a big *nix
    user other than all my servers, do all my development on Windows / use
    primarily Java. The command line is not where I feel most comfortable
    despite my best efforts over the last 5-6 years. Prior to Postgres, I used
    SQL Server for quite a few years at previous companies but was more a
    junior / intermediate skill set back then. I just wanted to put that out
    there so you can see where my bias's are.
    
    
    
    
    With all that said, I would not be comfortable using pg_basebackup as my
    main backup tool simply because I’d have to cobble together numerous tools
    to get backups stored in a safe (not on the same server) location, I’d have
    to manage expiring backups and the WAL which is no longer needed, along
    with the rest of the stuff that makes these backup management tools useful.
    
    
    The command line scares me, and even if I was able to get all that working,
    I would not feel warm and fuzzy I didn’t mess something up horribly and I
    may hit an edge case which destroys backups, silently corrupts data, etc.
    
    I love that there are tools that manage all of it; backups, wal archiving,
    remote storage, integrate with cloud storage (S3 and the like), manages the
    retention of these backups with all their dependencies for me, and has all
    the restore options necessary built in as well.
    
    
    Block level incremental backup would be amazing for my use case. I have
    small updates / deletes that happen to data all over some of my largest
    tables. With pgBackRest, since the diff/incremental backups are at the file
    level, I can have a single update / delete which touched a random spot in a
    table and now requires that whole 1gb file to be backed up again. That
    said, even if pg_basebackup was the only tool that did incremental block
    level backup tomorrow, I still wouldn’t start using it directly. I went
    into the issues I’d have to deal with if I used pg_basebackup above, and
    incremental backups without a management tool make me think using it
    correctly would be much harder.
    
    
    I know this thread is just about incremental backup, and that pretty much
    everything in core is built up from small features into larger more complex
    ones. I understand that and am not trying to dump on any efforts, I am
    super excited to see work being done in this area! I just wanted to share
    my perspective on how crucial good backup management is to me (and I’m sure
    a few others may share my sentiment considering how popular all the
    external tools are).
    
    I would never put a system in production unless I have some backup
    management in place. If core builds a backup management tool which uses
    pg_basebackup as building blocks for its solution…awesome! That may be
    something I’d use.  If pg_basebackup can be improved so it can be used as
    the basis most external backup management tools can build on top of, that’s
    also great. All the external tools which practically every Postgres company
    have built show that it’s obviously a need for a lot of users. Core will
    never solve every single problem for all users, I know that. It would just
    be great to see some of the fundamental features of backup management baked
    into core in an extensible way.
    
    With that, there could be a recommended way to set up backups
    (full/incremental, parallel, compressed), point in time recovery, backup
    retention, and perform restores (to a point in time, on a replica server,
    etc) with just the tooling within core with a nice and simple user
    interface, and great performance.
    
    If those features core supports in the internal tooling are built in an
    extensible way (as has been discussed), there could be much less
    duplication of work implementing the same base features over and over for
    each external tool. Those companies can focus on more value-added features
    to their own products that core would never support, or on improving the
    tooling/performance/features core provides.
    
    
    Well, this is way longer and a lot less coherent than I was hoping, so I
    apologize for that. Hopefully my stream of thoughts made a little bit of
    sense to someone.
    
    
    -Adam
    
  66. Re: block-level incremental backup

    Stephen Frost <sfrost@snowman.net> — 2019-04-24T13:28:15Z

    Greetings,
    
    * Robert Haas (robertmhaas@gmail.com) wrote:
    > On Mon, Apr 22, 2019 at 2:26 PM Stephen Frost <sfrost@snowman.net> wrote:
    > > There was basically zero discussion about what things would look like at
    > > a protocol level (I went back and skimmed over the thread before sending
    > > my last email to specifically see if I was going to get this response
    > > back..).  I get the idea behind the diff file, the contents of which I
    > > wasn't getting into above.
    > 
    > Well, I wrote:
    > 
    > "There should be a way to tell pg_basebackup to request from the
    > server only those blocks where LSN >= threshold_value."
    > 
    > I guess I assumed that people would interested in the details take
    > that to mean "and therefore the protocol would grow an option for this
    > type of request in whatever way is the most straightforward possible
    > extension of the current functionality is," which is indeed how you
    > eventually interpreted it when you said we could "extend BASE_BACKUP
    > is by adding LSN as an optional parameter."
    
    Looking at it from what I'm sitting, I brought up two ways that we
    could extend the protocol to "request from the server only those blocks
    where LSN >= threshold_value" with one being the modification to
    BASE_BACKUP and the other being a new set of commands that could be
    parallelized.  If I had assumed that you'd be thinking the same way I am
    about extending the backup protocol, I wouldn't have said anything now
    and then would have complained after you wrote a patch that just
    extended the BASE_BACKUP command, at which point I likely would have
    been told that it's now been done and that I should have mentioned it
    earlier.
    
    > > external tools to leverage that.  It sounds like what you're suggesting
    > > now is that you're happy to implement the backend code, expose it in a
    > > way that works just for pg_basebackup, and that if someone else wants to
    > > add things to the protocol to make it easier for external tools to
    > > leverage, great.
    > 
    > Yep, that's more or less it, although I am potentially willing to do
    > some modest amount of that other work along the way.  I just don't
    > want to prioritize it higher than getting the actual thing I want to
    > build built, which I think is a pretty fair position for me to take.
    
    At least in part then it seems like we're viewing the level of effort
    around what I'm talking about quite differently, and I feel like that's
    largely because every time I mention parallel anything there's this
    assumption that I'm asking you to parallelize pg_basebackup or write a
    whole bunch more code to provide a fully optimized server-side parallel
    implementation for backups.  That really wasn't what I was going for.  I
    was thinking it would be a modest amount of additional work add
    incremental backup via a few new commands, instead of through the
    BASE_BACKUP protocol command, that would make parallelization possible.
    
    Now, through this discussion, you've brought up some really good points
    about how the initial thoughts I had around how we could add some
    relatively simple commands, as part of this work, to make it easier for
    someone to later add parallel support to pg_basebackup (either full or
    incremental), or for external tools to leverage, might not be the best
    solution when it comes to having parallel backup in core, and therefore
    wouldn't actually end up being useful towards that end.  That's
    certainly a fair point and possibly enough to justify not spending even
    the modest time I was thinking it'd need, but I'm not convinced.  Now,
    that said, if you are convinced that's the case, and you're doing the
    work, then it's certainly your prerogative to go in the direction you're
    convinced of.  I don't mean any of this discussion to imply that I'd
    object to a commit that extended BASE_BACKUP in the way outlined above,
    but I understood the question to be "what do people think of this idea?"
    and to that I'm still of the opinion that spending a modest amount of
    time to provide a way to parallelize an incremental backup is worth it,
    even if it isn't optimal and isn't the direct goal of this effort.
    
    There's a tangent on all of this that's pretty key though, which is the
    question around just how the blocks are identified.  If the WAL scanning
    is done to figure out the blocks, then that's quite a bit different from
    the other idea of "open this relation and scan it, but only give me the
    blocks after this LSN".  It's the latter case that I've been mostly
    thinking about in this thread, which is part of why I was thinking it'd
    be a modest amount of work to have protocol commands that accepted a
    file (or perhaps a relation..) to scan and return blocks from instead of
    baking this into BASE_BACKUP which by definition just serially scans the
    data directory and returns things as it finds them.  For the case where
    we have WAL scanning happening and modfiles which are being read and
    used to figure out the blocks to send, it seems like it might be more
    complicated and therefore potentially quite a bit more work to have a
    parallel version of that.
    
    > > All I can say is that that's basically how we ended up
    > > in the situation we're in today where pg_basebackup doesn't support
    > > parallel backup but a bunch of external tools do and they don't go
    > > through the backend to get there, even though they'd probably prefer to.
    > 
    > I certainly agree that core should try to do things in a way that is
    > useful to external tools when that can be done without undue effort,
    > but only if it can actually be done without undo effort.  Let's see
    > whether that's the case here:
    > 
    > - Anastasia wants a command added that dumps out whatever the server
    > knows about what files have changed, which I already agreed was a
    > reasonable extension of my initial proposal.
    
    That seems like a useful thing to have, I agree.
    
    > - You said that for this to be useful to pgbackrest, it'd have to use
    > a whole different mechanism that includes commands to request
    > individual files and blocks within those files, which would be a
    > significant rewrite of pg_basebackup that you agreed is more closely
    > related to parallel backup than to the project under discussion on
    > this thread.  And that even then pgbackrest probably wouldn't use it
    > because it also does server-side compression and encryption which are
    > not included in this proposal.
    
    Yes, having thought about it a bit more, without adding in the other
    features that we already support in pgBackRest, it's unlikely we'd use
    it in the form that I was contemplating.  That said, it'd at least be
    closer to something we could use and adding those other features, such
    as compression and encryption, would almost certainly be simpler and
    easier if there were already protocol commands like those we discussed
    for parallel work.
    
    > > Thanks for sharing your thoughts on that, certainly having the backend
    > > able to be more intelligent about streaming files to avoid latency is
    > > good and possibly the best approach.  Another alternative to reducing
    > > the latency would be to have a way for the client to request a set of
    > > files, but I don't know that it'd be better.
    > 
    > I don't know either.  This is an area that needs more thought, I
    > think, although as discussed, it's more related to parallel backup
    > than $SUBJECT.
    
    Yes, I agree with that.
    
    > > I'm not really sure why the above is extremely inconvenient for
    > > third-party tools, beyond just that they've already been written to work
    > > with an assumption that the server-side of things isn't as intelligent
    > > as PG is.
    > 
    > Well, one thing you might want to do is have a tool that connects to
    > the server, enters backup mode, requests information on what blocks
    > have changed, copies those blocks via direct filesystem access, and
    > then exits backup mode.  Such a tool would really benefit from a
    > START_BACKUP / SEND_FILE_LIST / SEND_FILE_CONTENTS / STOP_BACKUP
    > command language, because it would just skip ever issuing the
    > SEND_FILE_CONTENTS command in favor of doing that part of the work via
    > other means.  On the other hand, a START_PARALLEL_BACKUP LSN '1/234'
    > command is useless to such a tool.
    
    That's true, but I hardly ever hear people talking about how wonderful
    it is that pgBackRest uses SSH to grab the data.  What I hear, often, is
    that people would really like backups to be done over the PG protocol on
    the same port that replication is done on.  A possible compromise is
    having a dedicated port for the backup agent to use, but it's definitely
    not the preference.
    
    > Contrariwise, a tool that has its own magic - perhaps based on
    > WAL-scanning or something like ptrack - to know which files currently
    > exist and which blocks are modified could use SEND_FILE_CONTENTS but
    > not SEND_FILE_LIST.  And a filesystem-snapshot based technique might
    > use START_BACKUP and STOP_BACKUP but nothing else.
    > 
    > In short, providing granular commands like this lets the client be
    > really intelligent even if the server isn't, and lets the client have
    > fine-grained control of the process.  This is very good if you're an
    > out-of-core tool maintainer and your tool is trying to be smarter than
    > - or even just differently-designed than - core.
    > 
    > But if what you really want is just a maximally-efficient parallel
    > backup, you don't need the commands to be fine-grained like this.  You
    > don't even really *want* the commands to be fine-grained like this,
    > because it's better if the server works it all out so as to avoid
    > unnecessary network round-trips.  You just want to tell the server
    > "hey, I want to do a parallel backup with 5 participants - hit me!"
    > and have it do that in the most efficient way that it knows how,
    > without forcing the client to make any decisions that can be made just
    > as well, and perhaps more efficiently, on the server.
    > 
    > On the third hand, one advantage of having the fine-grained commands
    > is that it would not only make it easier for out-of-core tools to do
    > cool things, but also in-core tools.  For instance, you can imagine
    > being able to do something like:
    > 
    > pg_basebackup -D outputdir -d conninfo --copy-files-from=$PGDATA
    > 
    > If the client is using what I'm calling fine-grained commands, this is
    > easy to implement.  If it's just calling a piece of server side
    > functionality that sends back a tarball as a blob, it's not.
    > 
    > So each approach has some pros and cons.
    
    I agree that each has some pros and cons.  Certainly one of the big
    'cons' here is that it'd be a lot more backend work to implement the
    'maximally-efficient parallel backup', while the fine-grained commands
    wouldn't require nearly as much but would still allow a great deal of
    the benefit for both in-core and out-of-core tools, potentially.
    
    > > I'm disappointed that the concerns about the trouble that end users are
    > > likely to have with this didn't garner more discussion.
    > 
    > Well, we can keep discussing things.  I've tried to reply to as many
    > of your concerns as I can, but I believe you've written more email on
    > this thread than everyone else combined, so perhaps I haven't entirely
    > been able to keep up.
    >
    > That being said, as far as I can tell, those concerns were not
    > seconded by anyone else.  Also, if I understand correctly, when I
    > asked how we could avoid that problem, you that you didn't know.  And
    > I said it seemed like we would need to a very expensive operation at
    > server startup, or magic.  So I feel that perhaps it is a problem that
    > (1) is not of great general concern and (2) to which no really
    > superior engineering solution is possible.
    
    The comments that Anastasia had around the issues with being able to
    identify the full backup that goes with a given incremental backup, et
    al, certainly echoed some my concerns regarding this part of the
    discussion.
    
    As for the concerns about trying to avoid corruption from starting up an
    invalid cluster, I didn't see much discussion about the idea of some
    kind of cross-check between pg_control and backup_label.  That was all
    very hand-wavy, so I'm not too surprised, but I don't think it's
    completely impossible to have something better than "well, if you just
    remove this one file, then you get a non-obviously corrupt cluster that
    you can happily start up".  I'll certainly accept that it requires more
    thought though and if we're willing to continue a discussion around
    that, great.
    
    Thanks,
    
    Stephen
    
  67. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-04-24T15:58:59Z

    On Wed, Apr 24, 2019 at 9:28 AM Stephen Frost <sfrost@snowman.net> wrote:
    > Looking at it from what I'm sitting, I brought up two ways that we
    > could extend the protocol to "request from the server only those blocks
    > where LSN >= threshold_value" with one being the modification to
    > BASE_BACKUP and the other being a new set of commands that could be
    > parallelized.  If I had assumed that you'd be thinking the same way I am
    > about extending the backup protocol, I wouldn't have said anything now
    > and then would have complained after you wrote a patch that just
    > extended the BASE_BACKUP command, at which point I likely would have
    > been told that it's now been done and that I should have mentioned it
    > earlier.
    
    Fair enough.
    
    > At least in part then it seems like we're viewing the level of effort
    > around what I'm talking about quite differently, and I feel like that's
    > largely because every time I mention parallel anything there's this
    > assumption that I'm asking you to parallelize pg_basebackup or write a
    > whole bunch more code to provide a fully optimized server-side parallel
    > implementation for backups.  That really wasn't what I was going for.  I
    > was thinking it would be a modest amount of additional work add
    > incremental backup via a few new commands, instead of through the
    > BASE_BACKUP protocol command, that would make parallelization possible.
    
    I'm not sure about that.  It doesn't seem crazy difficult, but there
    are a few wrinkles.  One is that if the client is requesting files one
    at a time, it's got to have a list of all the files that it needs to
    request, and that means that it has to ask the server to make a
    preparatory pass over the whole PGDATA directory to get a list of all
    the files that exist.  That overhead is not otherwise needed.  Another
    is that the list of files might be really large, and that means that
    the client would either use a lot of memory to hold that great big
    list, or need to deal with spilling the list to a spool file
    someplace, or else have a server protocol that lets the list be
    fetched in incrementally in chunks.  A third is that, as you mention
    further on, it means that the client has to care a lot more about
    exactly how the server is figuring out which blocks have been
    modified.  If it just says BASE_BACKUP ..., the server an be
    internally reading each block and checking the LSN, or using
    WAL-scanning or ptrack or whatever and the client doesn't need to know
    or care.  But if the client is asking for a list of modified files or
    blocks, then that presumes the information is available, and not too
    expensively, without actually reading the files.  Fourth, MAX_RATE
    probably won't actually limit to the correct rate overall if the limit
    is applied separately to each file.
    
    I'd be afraid that a patch that tried to handle all that as part of
    this project would get rejected on the grounds that it was trying to
    solve too many unrelated problems.  Also, though not everybody has to
    agree on what constitutes a "modest amount of additional work," I
    would not describe solving all of those problems as a modest effort,
    but rather a pretty substantial one.
    
    > There's a tangent on all of this that's pretty key though, which is the
    > question around just how the blocks are identified.  If the WAL scanning
    > is done to figure out the blocks, then that's quite a bit different from
    > the other idea of "open this relation and scan it, but only give me the
    > blocks after this LSN".  It's the latter case that I've been mostly
    > thinking about in this thread, which is part of why I was thinking it'd
    > be a modest amount of work to have protocol commands that accepted a
    > file (or perhaps a relation..) to scan and return blocks from instead of
    > baking this into BASE_BACKUP which by definition just serially scans the
    > data directory and returns things as it finds them.  For the case where
    > we have WAL scanning happening and modfiles which are being read and
    > used to figure out the blocks to send, it seems like it might be more
    > complicated and therefore potentially quite a bit more work to have a
    > parallel version of that.
    
    Yeah.  I don't entirely agree that the first one is simple, as per the
    above, but I definitely agree that the second one is more complicated
    than the first one.
    
    > > Well, one thing you might want to do is have a tool that connects to
    > > the server, enters backup mode, requests information on what blocks
    > > have changed, copies those blocks via direct filesystem access, and
    > > then exits backup mode.  Such a tool would really benefit from a
    > > START_BACKUP / SEND_FILE_LIST / SEND_FILE_CONTENTS / STOP_BACKUP
    > > command language, because it would just skip ever issuing the
    > > SEND_FILE_CONTENTS command in favor of doing that part of the work via
    > > other means.  On the other hand, a START_PARALLEL_BACKUP LSN '1/234'
    > > command is useless to such a tool.
    >
    > That's true, but I hardly ever hear people talking about how wonderful
    > it is that pgBackRest uses SSH to grab the data.  What I hear, often, is
    > that people would really like backups to be done over the PG protocol on
    > the same port that replication is done on.  A possible compromise is
    > having a dedicated port for the backup agent to use, but it's definitely
    > not the preference.
    
    If you happen to be on the same system where the backup is running,
    reading straight from the data directory might be a lot faster.
    Otherwise, I tend to agree with you that using libpq is probably best.
    
    > I agree that each has some pros and cons.  Certainly one of the big
    > 'cons' here is that it'd be a lot more backend work to implement the
    > 'maximally-efficient parallel backup', while the fine-grained commands
    > wouldn't require nearly as much but would still allow a great deal of
    > the benefit for both in-core and out-of-core tools, potentially.
    
    I agree.
    
    > The comments that Anastasia had around the issues with being able to
    > identify the full backup that goes with a given incremental backup, et
    > al, certainly echoed some my concerns regarding this part of the
    > discussion.
    >
    > As for the concerns about trying to avoid corruption from starting up an
    > invalid cluster, I didn't see much discussion about the idea of some
    > kind of cross-check between pg_control and backup_label.  That was all
    > very hand-wavy, so I'm not too surprised, but I don't think it's
    > completely impossible to have something better than "well, if you just
    > remove this one file, then you get a non-obviously corrupt cluster that
    > you can happily start up".  I'll certainly accept that it requires more
    > thought though and if we're willing to continue a discussion around
    > that, great.
    
    I think there are three different issues here that need to be
    considered separately.
    
    Issue #1: If you manually add files to your backup, remove files from
    your backup, or change files in your backup, bad things will happen.
    There is fundamentally nothing we can do to prevent this completely,
    but it may be possible to make the system more resilient against
    ham-handed modifications, at least to the extent of detecting them.
    That's maybe a topic for another thread, but it's an interesting one:
    Andres and I were brainstorming about it at some point.
    
    Issue #2: You can only restore an LSN-based incremental backup
    correctly if you have a base backup whose start-of-backup LSN is
    greater than or equal to the threshold LSN used to take the
    incremental backup.  If #1 is not in play, this is just a simple
    cross-check at restoration time: retrieve the 'START WAL LOCATION'
    from the prior backup's backup_label file and the threshold LSN for
    the incremental backup from wherever you decide to store it and
    compare them; if they do not have the right relationship, ERROR.  As
    to whether #1 might end up in play here, anything's possible, but
    wouldn't manually editing LSNs in backup metadata files be pretty
    obviously a bad idea?  (Then again, I didn't really think the whole
    backup_label thing was that confusing either, and obviously I was
    wrong about that.  Still, editing a file requires a little more work
    than removing it... you have to not only lie to the system, you have
    to decide which lie to tell!)
    
    Issue #3: Even if you clearly understand the rule articulated in #2,
    you might find it hard to follow in practice.  If you take a full
    backup on Sunday and an incremental against Sunday's backup or against
    the previous day's backup on each subsequent day, it's not really that
    hard to understand.  But in more complex scenarios it could be hard to
    get right.  For example if you've been removing your backups when they
    are a month old and and then you start doing the same thing once you
    add incrementals to the picture you might easily remove a full backup
    upon which a newer incremental depends.  I see the need for good tools
    to manage this kind of complexity, but have no plan as part of this
    project to provide them.  I think that just requires too many
    assumptions about where those backups are being stored and how they
    are being catalogued and managed; I don't believe I currently am
    knowledgeable enough to design something that would be good enough to
    meet core standards for inclusion, and I don't want to waste energy
    trying.  If someone else wants to try, that's OK with me, but I think
    it's probably better to let this be a thing that people experiment
    with outside of core for a while until we see what ends up being a
    winner.  I realize that this is a debatable position, but as I'm sure
    you realize by now, I have a strong desire to limit the scope of this
    project in such a way that I can get it done, 'cuz a bird in the hand
    is worth two in the bush.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  68. Re: block-level incremental backup

    Stephen Frost <sfrost@snowman.net> — 2019-04-24T16:57:36Z

    Greetings,
    
    * Robert Haas (robertmhaas@gmail.com) wrote:
    > On Wed, Apr 24, 2019 at 9:28 AM Stephen Frost <sfrost@snowman.net> wrote:
    > > At least in part then it seems like we're viewing the level of effort
    > > around what I'm talking about quite differently, and I feel like that's
    > > largely because every time I mention parallel anything there's this
    > > assumption that I'm asking you to parallelize pg_basebackup or write a
    > > whole bunch more code to provide a fully optimized server-side parallel
    > > implementation for backups.  That really wasn't what I was going for.  I
    > > was thinking it would be a modest amount of additional work add
    > > incremental backup via a few new commands, instead of through the
    > > BASE_BACKUP protocol command, that would make parallelization possible.
    > 
    > I'm not sure about that.  It doesn't seem crazy difficult, but there
    > are a few wrinkles.  One is that if the client is requesting files one
    > at a time, it's got to have a list of all the files that it needs to
    > request, and that means that it has to ask the server to make a
    > preparatory pass over the whole PGDATA directory to get a list of all
    > the files that exist.  That overhead is not otherwise needed.  Another
    > is that the list of files might be really large, and that means that
    > the client would either use a lot of memory to hold that great big
    > list, or need to deal with spilling the list to a spool file
    > someplace, or else have a server protocol that lets the list be
    > fetched in incrementally in chunks.
    
    So, I had a thought about that when I was composing the last email and
    while I'm still unsure about it, maybe it'd be useful to mention it
    here- do we really need a list of every *file*, or could we reduce that
    down to a list of relations + forks for the main data directory, and
    then always include whatever other directories/files are appropriate?
    
    When it comes to operating in chunks, well, if we're getting a list of
    relations instead of files, we do have this thing called cursors..
    
    > A third is that, as you mention
    > further on, it means that the client has to care a lot more about
    > exactly how the server is figuring out which blocks have been
    > modified.  If it just says BASE_BACKUP ..., the server an be
    > internally reading each block and checking the LSN, or using
    > WAL-scanning or ptrack or whatever and the client doesn't need to know
    > or care.  But if the client is asking for a list of modified files or
    > blocks, then that presumes the information is available, and not too
    > expensively, without actually reading the files.
    
    I would think the client would be able to just ask for the list of
    modified files, when it comes to building up the list of files to ask
    for, which could potentially be done based on mtime instead of by WAL
    scanning or by scanning the files themselves.  Don't get me wrong, I'd
    prefer that we work based on the WAL, since I have more confidence in
    that, but certainly quite a few of the tools do work off mtime these
    days and while it's not perfect, the risk/reward there is pretty
    palatable to a lot of people.
    
    > Fourth, MAX_RATE
    > probably won't actually limit to the correct rate overall if the limit
    > is applied separately to each file.
    
    Sure, I hadn't been thinking about MAX_RATE and that would certainly
    complicate things if we're offering to provide MAX_RATE-type
    capabilities as part of this new set of commands.
    
    > I'd be afraid that a patch that tried to handle all that as part of
    > this project would get rejected on the grounds that it was trying to
    > solve too many unrelated problems.  Also, though not everybody has to
    > agree on what constitutes a "modest amount of additional work," I
    > would not describe solving all of those problems as a modest effort,
    > but rather a pretty substantial one.
    
    I suspect some of that's driven by how they get solved and if we decide
    we have to solve all of them.  With things like MAX_RATE + incremental
    backups, I wonder how that's going to end up working, when you have the
    option to apply the limit to the network, or to the disk I/O.  You might
    have addressed that elsewhere, I've not looked, and I'm not too
    particular about it personally either, but a definition could be "max
    rate at which we'll read the file you asked for on this connection" and
    that would be pretty straight-forward, I'd think.
    
    > > > Well, one thing you might want to do is have a tool that connects to
    > > > the server, enters backup mode, requests information on what blocks
    > > > have changed, copies those blocks via direct filesystem access, and
    > > > then exits backup mode.  Such a tool would really benefit from a
    > > > START_BACKUP / SEND_FILE_LIST / SEND_FILE_CONTENTS / STOP_BACKUP
    > > > command language, because it would just skip ever issuing the
    > > > SEND_FILE_CONTENTS command in favor of doing that part of the work via
    > > > other means.  On the other hand, a START_PARALLEL_BACKUP LSN '1/234'
    > > > command is useless to such a tool.
    > >
    > > That's true, but I hardly ever hear people talking about how wonderful
    > > it is that pgBackRest uses SSH to grab the data.  What I hear, often, is
    > > that people would really like backups to be done over the PG protocol on
    > > the same port that replication is done on.  A possible compromise is
    > > having a dedicated port for the backup agent to use, but it's definitely
    > > not the preference.
    > 
    > If you happen to be on the same system where the backup is running,
    > reading straight from the data directory might be a lot faster.
    
    Yes, that's certainly true.
    
    > > The comments that Anastasia had around the issues with being able to
    > > identify the full backup that goes with a given incremental backup, et
    > > al, certainly echoed some my concerns regarding this part of the
    > > discussion.
    > >
    > > As for the concerns about trying to avoid corruption from starting up an
    > > invalid cluster, I didn't see much discussion about the idea of some
    > > kind of cross-check between pg_control and backup_label.  That was all
    > > very hand-wavy, so I'm not too surprised, but I don't think it's
    > > completely impossible to have something better than "well, if you just
    > > remove this one file, then you get a non-obviously corrupt cluster that
    > > you can happily start up".  I'll certainly accept that it requires more
    > > thought though and if we're willing to continue a discussion around
    > > that, great.
    > 
    > I think there are three different issues here that need to be
    > considered separately.
    > 
    > Issue #1: If you manually add files to your backup, remove files from
    > your backup, or change files in your backup, bad things will happen.
    > There is fundamentally nothing we can do to prevent this completely,
    > but it may be possible to make the system more resilient against
    > ham-handed modifications, at least to the extent of detecting them.
    > That's maybe a topic for another thread, but it's an interesting one:
    > Andres and I were brainstorming about it at some point.
    
    I'd certainly be interested in hearing about ways we can improve on
    that.  I'm alright with it being on another thread as it's a broader
    concern than just what we're talking about here.
    
    > Issue #2: You can only restore an LSN-based incremental backup
    > correctly if you have a base backup whose start-of-backup LSN is
    > greater than or equal to the threshold LSN used to take the
    > incremental backup.  If #1 is not in play, this is just a simple
    > cross-check at restoration time: retrieve the 'START WAL LOCATION'
    > from the prior backup's backup_label file and the threshold LSN for
    > the incremental backup from wherever you decide to store it and
    > compare them; if they do not have the right relationship, ERROR.  As
    > to whether #1 might end up in play here, anything's possible, but
    > wouldn't manually editing LSNs in backup metadata files be pretty
    > obviously a bad idea?  (Then again, I didn't really think the whole
    > backup_label thing was that confusing either, and obviously I was
    > wrong about that.  Still, editing a file requires a little more work
    > than removing it... you have to not only lie to the system, you have
    > to decide which lie to tell!)
    
    Yes, that'd certainly be at least one cross-check, but what if you've
    got an incremental backup based on a prior incremental backup that's
    based on a prior full, and you skip the incremental backup inbetween
    somehow?  Or are we just going to state outright that we don't support
    incremental-on-incremental (in which case, all backups would actually be
    either 'full' or 'differential' in the pgBackRest parlance, anyway, and
    that parlance comes from my recollection of how other tools describe the
    different backup types, but that was from many moons ago and might be
    entirely wrong)?
    
    > Issue #3: Even if you clearly understand the rule articulated in #2,
    > you might find it hard to follow in practice.  If you take a full
    > backup on Sunday and an incremental against Sunday's backup or against
    > the previous day's backup on each subsequent day, it's not really that
    > hard to understand.  But in more complex scenarios it could be hard to
    > get right.  For example if you've been removing your backups when they
    > are a month old and and then you start doing the same thing once you
    > add incrementals to the picture you might easily remove a full backup
    > upon which a newer incremental depends.  I see the need for good tools
    > to manage this kind of complexity, but have no plan as part of this
    > project to provide them.  I think that just requires too many
    > assumptions about where those backups are being stored and how they
    > are being catalogued and managed; I don't believe I currently am
    > knowledgeable enough to design something that would be good enough to
    > meet core standards for inclusion, and I don't want to waste energy
    > trying.  If someone else wants to try, that's OK with me, but I think
    > it's probably better to let this be a thing that people experiment
    > with outside of core for a while until we see what ends up being a
    > winner.  I realize that this is a debatable position, but as I'm sure
    > you realize by now, I have a strong desire to limit the scope of this
    > project in such a way that I can get it done, 'cuz a bird in the hand
    > is worth two in the bush.
    
    Even if what we're talking about here is really only "differentials", or
    backups where the incremental contains all the changes from a prior full
    backup, if the only check is "full LSN is greater than or equal to the
    incremental backup LSN", then you have a potential problem that's larger
    than just the incrementals no longer being valid because you removed the
    full backup on which they were taken- you might think that an *earlier*
    full backup is the one for a given incremental and perform a restore
    with the wrong full/incremental matchup and end up with a corrupted
    database.
    
    These are exactly the kind of issues that make me really wonder if this
    is the right natural progression for pg_basebackup or any backup tool to
    go in.  Maybe there's some additional things we can do to make it harder
    for someone to end up with a corrupted database when they restore, but
    it's really hard to get things like expiration correct.  We see users
    already ending up with problems because they don't manage expiration of
    their WAL correctly, and now we're adding another level of serious
    complication to the expiration requirements that, as we've seen even on
    this thread, some users are just not going to ever feel comfortable
    with doing on their own.
    
    Perhaps it's not relevant and I get that you want to build this cool
    incremental backup capability into pg_basebackup and I'm not going to
    stop you from doing it, but if I was going to build a backup tool,
    adding support for block-level incremental backup wouldn't be where I'd
    start, and, in fact, I might not even get to it even after investing
    over 5 years in the project and even after building in proper backup
    management.  The idea of implementing block-level incrementals while
    pushing the backup management, expiration, and dependency between
    incrementals and fulls on to the user to figure out just strikes me as
    entirely backwards and, frankly, to be gratuitously 'itch scratching' at
    the expense of what users really want and need here.
    
    One of the great things about pg_basebackup is its simplicity and
    ability to be a one-time "give me a snapshot of the database" and this
    is building in a complicated feature to it that *requires* users to
    build their own basic capabilities externally in order to be able to use
    it.  I've tried to avoid getting into that here and I won't go on about
    it, since it's your time to do with as you feel appropriate, but I do
    worry that it makes us, as a project, look a bit more cavalier about
    what users are asking for vs. what cool new thing we want to play with
    than I, at least, would like us to be (so, I'll caveat that with "in
    this area anyway", since I suspect saying this will probably come back
    to bite me in some other discussion later ;).
    
    Thanks,
    
    Stephen
    
  69. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-04-25T11:32:13Z

    On Wed, Apr 24, 2019 at 12:57 PM Stephen Frost <sfrost@snowman.net> wrote:
    > So, I had a thought about that when I was composing the last email and
    > while I'm still unsure about it, maybe it'd be useful to mention it
    > here- do we really need a list of every *file*, or could we reduce that
    > down to a list of relations + forks for the main data directory, and
    > then always include whatever other directories/files are appropriate?
    
    I'm not quite sure what the difference is here.  I agree that we could
    try to compact the list of file names by saying 16384 (24 segments)
    instead of 16384, 16384.1, ..., 16384.23, but I doubt that saves
    anything meaningful.  I don't see how we can leave anything out
    altogether.  If there's a filename called boaty.mcboatface in the
    server directory, I think we've got to back it up, and that won't
    happen unless the client knows that it is there, and it won't know
    unless we include it in a list.
    
    > When it comes to operating in chunks, well, if we're getting a list of
    > relations instead of files, we do have this thing called cursors..
    
    Sure... but they don't work for replication commands and I am
    definitely not volunteering to change that.
    
    > I would think the client would be able to just ask for the list of
    > modified files, when it comes to building up the list of files to ask
    > for, which could potentially be done based on mtime instead of by WAL
    > scanning or by scanning the files themselves.  Don't get me wrong, I'd
    > prefer that we work based on the WAL, since I have more confidence in
    > that, but certainly quite a few of the tools do work off mtime these
    > days and while it's not perfect, the risk/reward there is pretty
    > palatable to a lot of people.
    
    That approach, as with a few others that have been suggested, requires
    that the client have access to the previous backup, which makes me
    uninterested in implementing it.  I want a version of incremental
    backup where the client needs to know the LSN of the previous backup
    and nothing else.  That way, if you store your actual backups on a
    tape drive in an airless vault at the bottom of the Pacific Ocean, you
    can still take incremental backup against them, as long as you
    remember to note the LSNs before you ship the backups to the vault.
    Woohoo!  It also allows for the wire protocol to be very simple and
    the client to be very simple; neither of those things is essential,
    but both are nice.
    
    Also, I think using mtimes is just asking to get burned.  Yeah, almost
    nobody will, but an LSN-based approach is more granular (block level)
    and more reliable (can't be fooled by resetting a clock backward, or
    by a filesystem being careless with file metadata), so I think it
    makes sense to focus on getting that to work.  It's worth keeping in
    mind that there may be somewhat different expectations for an external
    tool vs. a core feature.  Stupid as it may sound, I think people using
    an external tool are more likely to do things read the directions, and
    those directions can say things like "use a reasonable filesystem and
    don't set your clock backward."  When stuff goes into core, people
    assume that they should be able to run it on any filesystem on any
    hardware where they can get it to work and it should just work.  And
    you also get a lot more users, so even if the percentage of people not
    reading the directions were to stay constant, the actual number of
    such people will go up a lot. So picking what we seem to both agree to
    be the most robust way of detecting changes seems like the way to go
    from here.
    
    > I suspect some of that's driven by how they get solved and if we decide
    > we have to solve all of them.  With things like MAX_RATE + incremental
    > backups, I wonder how that's going to end up working, when you have the
    > option to apply the limit to the network, or to the disk I/O.  You might
    > have addressed that elsewhere, I've not looked, and I'm not too
    > particular about it personally either, but a definition could be "max
    > rate at which we'll read the file you asked for on this connection" and
    > that would be pretty straight-forward, I'd think.
    
    I mean, it's just so people can tell pg_basebackup what rate they want
    via a command-line option and have it happen like that.  They don't
    care about the rates for individual files.
    
    > > Issue #1: If you manually add files to your backup, remove files from
    > > your backup, or change files in your backup, bad things will happen.
    > > There is fundamentally nothing we can do to prevent this completely,
    > > but it may be possible to make the system more resilient against
    > > ham-handed modifications, at least to the extent of detecting them.
    > > That's maybe a topic for another thread, but it's an interesting one:
    > > Andres and I were brainstorming about it at some point.
    >
    > I'd certainly be interested in hearing about ways we can improve on
    > that.  I'm alright with it being on another thread as it's a broader
    > concern than just what we're talking about here.
    
    Might be a good topic to chat about at PGCon.
    
    > > Issue #2: You can only restore an LSN-based incremental backup
    > > correctly if you have a base backup whose start-of-backup LSN is
    > > greater than or equal to the threshold LSN used to take the
    > > incremental backup.  If #1 is not in play, this is just a simple
    > > cross-check at restoration time: retrieve the 'START WAL LOCATION'
    > > from the prior backup's backup_label file and the threshold LSN for
    > > the incremental backup from wherever you decide to store it and
    > > compare them; if they do not have the right relationship, ERROR.  As
    > > to whether #1 might end up in play here, anything's possible, but
    > > wouldn't manually editing LSNs in backup metadata files be pretty
    > > obviously a bad idea?  (Then again, I didn't really think the whole
    > > backup_label thing was that confusing either, and obviously I was
    > > wrong about that.  Still, editing a file requires a little more work
    > > than removing it... you have to not only lie to the system, you have
    > > to decide which lie to tell!)
    >
    > Yes, that'd certainly be at least one cross-check, but what if you've
    > got an incremental backup based on a prior incremental backup that's
    > based on a prior full, and you skip the incremental backup inbetween
    > somehow?  Or are we just going to state outright that we don't support
    > incremental-on-incremental (in which case, all backups would actually be
    > either 'full' or 'differential' in the pgBackRest parlance, anyway, and
    > that parlance comes from my recollection of how other tools describe the
    > different backup types, but that was from many moons ago and might be
    > entirely wrong)?
    
    I have every intention of supporting that case, just as I described in
    my original email, and the algorithm that I just described handles it.
    You just have to repeat the checks for every backup in the chain.   If
    you have a backup A, and a backup B intended as an incremental vs. A,
    and a backup C intended as an incremental vs. B, then the threshold
    LSN for C is presumably the starting LSN for B, and the threshold LSN
    for B is presumably the starting LSN for A.  If you try to restore
    A-B-C you'll check C vs. B and find that all is well and similarly for
    B vs. A.  If you try to restore A-C, you'll find out that A's start
    LSN precedes C's threshold LSN and error out.
    
    > Even if what we're talking about here is really only "differentials", or
    > backups where the incremental contains all the changes from a prior full
    > backup, if the only check is "full LSN is greater than or equal to the
    > incremental backup LSN", then you have a potential problem that's larger
    > than just the incrementals no longer being valid because you removed the
    > full backup on which they were taken- you might think that an *earlier*
    > full backup is the one for a given incremental and perform a restore
    > with the wrong full/incremental matchup and end up with a corrupted
    > database.
    
    No, the proposed check is explicitly designed to prevent that.  You'd
    get a restore failure (which is not great either, of course).
    
    > management.  The idea of implementing block-level incrementals while
    > pushing the backup management, expiration, and dependency between
    > incrementals and fulls on to the user to figure out just strikes me as
    > entirely backwards and, frankly, to be gratuitously 'itch scratching' at
    > the expense of what users really want and need here.
    
    Well, not everybody needs or wants the same thing.  I wouldn't be
    proposing it if my employer didn't think it was gonna solve a real
    problem...
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  70. Re: block-level incremental backup

    Anastasia Lubennikova <a.lubennikova@postgrespro.ru> — 2019-07-10T18:16:59Z

    23.04.2019 14:08, Anastasia Lubennikova wrote:
    > I'm volunteering to write a draft patch or, more likely, set of 
    > patches, which
    > will allow us to discuss the subject in more detail.
    > And to do that I wish we agree on the API and data format (at least 
    > broadly).
    > Looking forward to hearing your thoughts. 
    
    Though the previous discussion stalled,
    I still hope that we could agree on basic points such as a map file 
    format and protocol extension,
    which is necessary to start implementing the feature.
    
    --------- Proof Of Concept patch ---------
    
    In attachments, you can find a prototype of incremental pg_basebackup, 
    which consists of 2 features:
    
    1) To perform incremental backup one should call pg_basebackup with a 
    new argument:
    
    pg_basebackup -D 'basedir' --prev-backup-start-lsn 'lsn'
    
    where lsn is a start_lsn of parent backup (can be found in 
    "backup_label" file)
    
    It calls BASE_BACKUP replication command with a new argument 
    PREV_BACKUP_START_LSN 'lsn'.
    
    For datafiles, only pages with LSN > prev_backup_start_lsn will be 
    included in the backup.
    They are saved into 'filename.partial' file, 'filename.blockmap' file 
    contains an array of BlockNumbers.
    For example, if we backuped blocks 1,3,5, filename.partial will contain 
    3 blocks, and 'filename.blockmap' will contain array {1,3,5}.
    
    Non-datafiles use the same format as before.
    
    2) To merge incremental backup into a full backup call
    
    pg_basebackup -D 'basedir' --incremental-pgdata 'incremental_basedir' 
    --merge-backups
    
    It will move all files from 'incremental_basedir' to 'basedir' handling 
    '.partial' files correctly.
    
    
    --------- Questions to discuss ---------
    
    Please note that it is just a proof-of-concept patch and it can be 
    optimized in many ways.
    Let's concentrate on issues that affect the protocol or data format.
    
    1) Whether we collect block maps using simple "read everything page by 
    page" approach
    or WAL scanning or any other page tracking algorithm, we must choose a 
    map format.
    I implemented the simplest one, while there are more ideas:
    
    - We can have a map not per file, but per relation or maybe per tablespace,
    which will make implementation more complex, but probably more optimal.
    The only problem I see with existing implementation is that even if only 
    a few blocks changed,
    we still must pad it to 512 bytes per tar format requirements.
    
    - We can save LSNs into the block map.
    
    typedef struct BlockMapItem {
         BlockNumber blkno;
         XLogRecPtr lsn;
    } BlockMapItem;
    
    In my implementation, invalid prev_backup_start_lsn means fallback to 
    regular basebackup
    without any block maps. Alternatively, we can define another meaning of 
    this value and send a block map for all files.
    Backup utilities can use these maps to speed up backup merge or restore.
    
    2) We can implement BASE_BACKUP SEND_FILELIST replication command,
    which will return a list of filenames with file sizes and block maps if 
    lsn was provided.
    
    To avoid changing format, we can simply send tar headers for each file:
    - tarHeader("filename.blockmap") followed by blockmap for relation files 
    if prev_backup_start_lsn is provided;
    - tarHeader("filename") without actual file content for non relation 
    files or for all files in "FULL" backup
    
    The caller can parse messages and use them for any purpose, for example, 
    to perform a parallel backup.
    
    Thoughts?
    
    -- 
    Anastasia Lubennikova
    Postgres Professional: http://www.postgrespro.com
    The Russian Postgres Company
    
    
  71. Re: block-level incremental backup

    Jeevan Chalke <jeevan.chalke@enterprisedb.com> — 2019-07-11T11:30:22Z

    Hi Anastasia,
    
    On Wed, Jul 10, 2019 at 11:47 PM Anastasia Lubennikova <
    a.lubennikova@postgrespro.ru> wrote:
    
    > 23.04.2019 14:08, Anastasia Lubennikova wrote:
    > > I'm volunteering to write a draft patch or, more likely, set of
    > > patches, which
    > > will allow us to discuss the subject in more detail.
    > > And to do that I wish we agree on the API and data format (at least
    > > broadly).
    > > Looking forward to hearing your thoughts.
    >
    > Though the previous discussion stalled,
    > I still hope that we could agree on basic points such as a map file
    > format and protocol extension,
    > which is necessary to start implementing the feature.
    >
    
    It's great that you too come up with the PoC patch. I didn't look at your
    changes in much details but we at EnterpriseDB too working on this feature
    and started implementing it.
    
    Attached series of patches I had so far... (which needed further
    optimization and adjustments though)
    
    Here is the overall design (as proposed by Robert) we are trying to
    implement:
    
    1. Extend the BASE_BACKUP command that can be used with replication
    connections. Add a new [ LSN 'lsn' ] option.
    
    2. Extend pg_basebackup with a new --lsn=LSN option that causes it to send
    the option added to the server in #1.
    
    Here are the implementation details when we have a valid LSN
    
    sendFile() in basebackup.c is the function which mostly does the thing for
    us. If the filename looks like a relation file, then we'll need to consider
    sending only a partial file. The way to do that is probably:
    
    A. Read the whole file into memory.
    
    B. Check the LSN of each block. Build a bitmap indicating which blocks have
    an LSN greater than or equal to the threshold LSN.
    
    C. If more than 90% of the bits in the bitmap are set, send the whole file
    just as if this were a full backup. This 90% is a constant now; we might
    make it a GUC later.
    
    D. Otherwise, send a file with .partial added to the name. The .partial
    file contains an indication of which blocks were changed at the beginning,
    followed by the data blocks. It also includes a checksum/CRC.
    Currently, a .partial file format looks like:
     - start with a 4-byte magic number
     - then store a 4-byte CRC covering the header
     - then a 4-byte count of the number of blocks included in the file
     - then the block numbers, each as a 4-byte quantity
     - then the data blocks
    
    
    We are also working on combining these incremental back-ups with the full
    backup and for that, we are planning to add a new utility called
    pg_combinebackup. Will post the details on that later once we have on the
    same page for taking backup.
    
    Thanks
    -- 
    Jeevan Chalke
    Technical Architect, Product Development
    EnterpriseDB Corporation
    
  72. Re: block-level incremental backup

    Jeevan Chalke <jeevan.chalke@enterprisedb.com> — 2019-07-17T05:21:51Z

    On Thu, Jul 11, 2019 at 5:00 PM Jeevan Chalke <
    jeevan.chalke@enterprisedb.com> wrote:
    
    > Hi Anastasia,
    >
    > On Wed, Jul 10, 2019 at 11:47 PM Anastasia Lubennikova <
    > a.lubennikova@postgrespro.ru> wrote:
    >
    >> 23.04.2019 14:08, Anastasia Lubennikova wrote:
    >> > I'm volunteering to write a draft patch or, more likely, set of
    >> > patches, which
    >> > will allow us to discuss the subject in more detail.
    >> > And to do that I wish we agree on the API and data format (at least
    >> > broadly).
    >> > Looking forward to hearing your thoughts.
    >>
    >> Though the previous discussion stalled,
    >> I still hope that we could agree on basic points such as a map file
    >> format and protocol extension,
    >> which is necessary to start implementing the feature.
    >>
    >
    > It's great that you too come up with the PoC patch. I didn't look at your
    > changes in much details but we at EnterpriseDB too working on this feature
    > and started implementing it.
    >
    > Attached series of patches I had so far... (which needed further
    > optimization and adjustments though)
    >
    > Here is the overall design (as proposed by Robert) we are trying to
    > implement:
    >
    > 1. Extend the BASE_BACKUP command that can be used with replication
    > connections. Add a new [ LSN 'lsn' ] option.
    >
    > 2. Extend pg_basebackup with a new --lsn=LSN option that causes it to send
    > the option added to the server in #1.
    >
    > Here are the implementation details when we have a valid LSN
    >
    > sendFile() in basebackup.c is the function which mostly does the thing for
    > us. If the filename looks like a relation file, then we'll need to consider
    > sending only a partial file. The way to do that is probably:
    >
    > A. Read the whole file into memory.
    >
    > B. Check the LSN of each block. Build a bitmap indicating which blocks
    > have an LSN greater than or equal to the threshold LSN.
    >
    > C. If more than 90% of the bits in the bitmap are set, send the whole file
    > just as if this were a full backup. This 90% is a constant now; we might
    > make it a GUC later.
    >
    > D. Otherwise, send a file with .partial added to the name. The .partial
    > file contains an indication of which blocks were changed at the beginning,
    > followed by the data blocks. It also includes a checksum/CRC.
    > Currently, a .partial file format looks like:
    >  - start with a 4-byte magic number
    >  - then store a 4-byte CRC covering the header
    >  - then a 4-byte count of the number of blocks included in the file
    >  - then the block numbers, each as a 4-byte quantity
    >  - then the data blocks
    >
    >
    > We are also working on combining these incremental back-ups with the full
    > backup and for that, we are planning to add a new utility called
    > pg_combinebackup. Will post the details on that later once we have on the
    > same page for taking backup.
    >
    
    For combining a full backup with one or more incremental backup, we are
    adding
    a new utility called pg_combinebackup in src/bin.
    
    Here is the overall design as proposed by Robert.
    
    pg_combinebackup starts from the LAST backup specified and work backward. It
    must NOT start with the full backup and work forward. This is important both
    for reasons of efficiency and of correctness. For example, if you start by
    copying over the full backup and then later apply the incremental backups on
    top of it then you'll copy data and later end up overwriting it or removing
    it. Any files that are leftover at the end that aren't in the final
    incremental backup even as .partial files need to be removed, or the result
    is
    wrong. We should aim for a system where every block in the output directory
    is
    written exactly once and nothing ever has to be created and then removed.
    
    To make that work, we should start by examining the final incremental
    backup.
    We should proceed with one file at a time. For each file:
    
    1. If the complete file is present in the incremental backup, then just
    copy it
    to the output directory - and move on to the next file.
    
    2. Otherwise, we have a .partial file. Work backward through the backup
    chain
    until we find a complete version of the file. That might happen when we get
    \back to the full backup at the start of the chain, but it might also happen
    sooner - at which point we do not need to and should not look at earlier
    backups for that file. During this phase, we should read only the HEADER of
    each .partial file, building a map of which blocks we're ultimately going to
    need to read from each backup. We can also compute the offset within each
    file
    where that block is stored at this stage, again using the header
    information.
    
    3. Now, we can write the output file - reading each block in turn from the
    correct backup and writing it to the write output file, using the map we
    constructed in the previous step. We should probably keep all of the input
    files open over steps 2 and 3 and then close them at the end because
    repeatedly closing and opening them is going to be expensive. When that's
    done,
    go on to the next file and start over at step 1.
    
    
    We are already started working on this design.
    
    -- 
    Jeevan Chalke
    Technical Architect, Product Development
    EnterpriseDB Corporation
    
  73. Re: block-level incremental backup

    Ibrar Ahmed <ibrar.ahmad@gmail.com> — 2019-07-17T08:44:44Z

    On Wed, Jul 17, 2019 at 10:22 AM Jeevan Chalke <
    jeevan.chalke@enterprisedb.com> wrote:
    
    >
    >
    > On Thu, Jul 11, 2019 at 5:00 PM Jeevan Chalke <
    > jeevan.chalke@enterprisedb.com> wrote:
    >
    >> Hi Anastasia,
    >>
    >> On Wed, Jul 10, 2019 at 11:47 PM Anastasia Lubennikova <
    >> a.lubennikova@postgrespro.ru> wrote:
    >>
    >>> 23.04.2019 14:08, Anastasia Lubennikova wrote:
    >>> > I'm volunteering to write a draft patch or, more likely, set of
    >>> > patches, which
    >>> > will allow us to discuss the subject in more detail.
    >>> > And to do that I wish we agree on the API and data format (at least
    >>> > broadly).
    >>> > Looking forward to hearing your thoughts.
    >>>
    >>> Though the previous discussion stalled,
    >>> I still hope that we could agree on basic points such as a map file
    >>> format and protocol extension,
    >>> which is necessary to start implementing the feature.
    >>>
    >>
    >> It's great that you too come up with the PoC patch. I didn't look at your
    >> changes in much details but we at EnterpriseDB too working on this feature
    >> and started implementing it.
    >>
    >> Attached series of patches I had so far... (which needed further
    >> optimization and adjustments though)
    >>
    >> Here is the overall design (as proposed by Robert) we are trying to
    >> implement:
    >>
    >> 1. Extend the BASE_BACKUP command that can be used with replication
    >> connections. Add a new [ LSN 'lsn' ] option.
    >>
    >> 2. Extend pg_basebackup with a new --lsn=LSN option that causes it to
    >> send the option added to the server in #1.
    >>
    >> Here are the implementation details when we have a valid LSN
    >>
    >> sendFile() in basebackup.c is the function which mostly does the thing
    >> for us. If the filename looks like a relation file, then we'll need to
    >> consider sending only a partial file. The way to do that is probably:
    >>
    >> A. Read the whole file into memory.
    >>
    >> B. Check the LSN of each block. Build a bitmap indicating which blocks
    >> have an LSN greater than or equal to the threshold LSN.
    >>
    >> C. If more than 90% of the bits in the bitmap are set, send the whole
    >> file just as if this were a full backup. This 90% is a constant now; we
    >> might make it a GUC later.
    >>
    >> D. Otherwise, send a file with .partial added to the name. The .partial
    >> file contains an indication of which blocks were changed at the beginning,
    >> followed by the data blocks. It also includes a checksum/CRC.
    >> Currently, a .partial file format looks like:
    >>  - start with a 4-byte magic number
    >>  - then store a 4-byte CRC covering the header
    >>  - then a 4-byte count of the number of blocks included in the file
    >>  - then the block numbers, each as a 4-byte quantity
    >>  - then the data blocks
    >>
    >>
    >> We are also working on combining these incremental back-ups with the full
    >> backup and for that, we are planning to add a new utility called
    >> pg_combinebackup. Will post the details on that later once we have on the
    >> same page for taking backup.
    >>
    >
    > For combining a full backup with one or more incremental backup, we are
    > adding
    > a new utility called pg_combinebackup in src/bin.
    >
    > Here is the overall design as proposed by Robert.
    >
    > pg_combinebackup starts from the LAST backup specified and work backward.
    > It
    > must NOT start with the full backup and work forward. This is important
    > both
    > for reasons of efficiency and of correctness. For example, if you start by
    > copying over the full backup and then later apply the incremental backups
    > on
    > top of it then you'll copy data and later end up overwriting it or removing
    > it. Any files that are leftover at the end that aren't in the final
    > incremental backup even as .partial files need to be removed, or the
    > result is
    > wrong. We should aim for a system where every block in the output
    > directory is
    > written exactly once and nothing ever has to be created and then removed.
    >
    > To make that work, we should start by examining the final incremental
    > backup.
    > We should proceed with one file at a time. For each file:
    >
    > 1. If the complete file is present in the incremental backup, then just
    > copy it
    > to the output directory - and move on to the next file.
    >
    > 2. Otherwise, we have a .partial file. Work backward through the backup
    > chain
    > until we find a complete version of the file. That might happen when we get
    > \back to the full backup at the start of the chain, but it might also
    > happen
    > sooner - at which point we do not need to and should not look at earlier
    > backups for that file. During this phase, we should read only the HEADER of
    > each .partial file, building a map of which blocks we're ultimately going
    > to
    > need to read from each backup. We can also compute the offset within each
    > file
    > where that block is stored at this stage, again using the header
    > information.
    >
    > 3. Now, we can write the output file - reading each block in turn from the
    > correct backup and writing it to the write output file, using the map we
    > constructed in the previous step. We should probably keep all of the input
    > files open over steps 2 and 3 and then close them at the end because
    > repeatedly closing and opening them is going to be expensive. When that's
    > done,
    > go on to the next file and start over at step 1.
    >
    >
    > At what stage you will apply the WAL generated in between the START/STOP
    backup.
    
    
    > We are already started working on this design.
    >
    > --
    > Jeevan Chalke
    > Technical Architect, Product Development
    > EnterpriseDB Corporation
    >
    >
    
    -- 
    Ibrar Ahmed
    
  74. Re: block-level incremental backup

    Jeevan Chalke <jeevan.chalke@enterprisedb.com> — 2019-07-17T13:43:36Z

    On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
    
    >
    > At what stage you will apply the WAL generated in between the START/STOP
    > backup.
    >
    
    In this design, we are not touching any WAL related code. The WAL files will
    get copied with each backup either full or incremental. And thus, the last
    incremental backup will have the final WAL files which will be copied as-is
    in the combined full-backup and they will get apply automatically if that
    the data directory is used to start the server.
    
    
    > --
    > Ibrar Ahmed
    >
    
    -- 
    Jeevan Chalke
    Technical Architect, Product Development
    EnterpriseDB Corporation
    
  75. Re: block-level incremental backup

    Ibrar Ahmed <ibrar.ahmad@gmail.com> — 2019-07-17T14:08:07Z

    On Wed, Jul 17, 2019 at 6:43 PM Jeevan Chalke <
    jeevan.chalke@enterprisedb.com> wrote:
    
    > On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
    >
    >>
    >> At what stage you will apply the WAL generated in between the START/STOP
    >> backup.
    >>
    >
    > In this design, we are not touching any WAL related code. The WAL files
    > will
    > get copied with each backup either full or incremental. And thus, the last
    > incremental backup will have the final WAL files which will be copied as-is
    > in the combined full-backup and they will get apply automatically if that
    > the data directory is used to start the server.
    >
    
    Ok, so you keep all the WAL files since the first backup, right?
    
    >
    >
    >> --
    >> Ibrar Ahmed
    >>
    >
    > --
    > Jeevan Chalke
    > Technical Architect, Product Development
    > EnterpriseDB Corporation
    >
    >
    
    -- 
    Ibrar Ahmed
    
  76. Re: block-level incremental backup

    Jeevan Chalke <jeevan.chalke@enterprisedb.com> — 2019-07-17T14:41:53Z

    On Wed, Jul 17, 2019 at 7:38 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
    
    >
    >
    > On Wed, Jul 17, 2019 at 6:43 PM Jeevan Chalke <
    > jeevan.chalke@enterprisedb.com> wrote:
    >
    >> On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed <ibrar.ahmad@gmail.com>
    >> wrote:
    >>
    >>>
    >>> At what stage you will apply the WAL generated in between the START/STOP
    >>> backup.
    >>>
    >>
    >> In this design, we are not touching any WAL related code. The WAL files
    >> will
    >> get copied with each backup either full or incremental. And thus, the last
    >> incremental backup will have the final WAL files which will be copied
    >> as-is
    >> in the combined full-backup and they will get apply automatically if that
    >> the data directory is used to start the server.
    >>
    >
    > Ok, so you keep all the WAL files since the first backup, right?
    >
    
    The WAL files will anyway be copied while taking a backup (full or
    incremental),
    but only last incremental backup's WAL files are copied to the combined
    synthetic full backup.
    
    
    >>
    >>> --
    >>> Ibrar Ahmed
    >>>
    >>
    >> --
    >> Jeevan Chalke
    >> Technical Architect, Product Development
    >> EnterpriseDB Corporation
    >>
    >>
    >
    > --
    > Ibrar Ahmed
    >
    
    
    -- 
    Jeevan Chalke
    Technical Architect, Product Development
    EnterpriseDB Corporation
    
  77. Re: block-level incremental backup

    vignesh C <vignesh21@gmail.com> — 2019-07-20T17:52:34Z

    Hi Jeevan,
    
    The idea is very nice.
    When Insert/update/delete and truncate/drop happens at various
    combinations, How the incremental backup handles the copying of the
    blocks?
    
    
    On Wed, Jul 17, 2019 at 8:12 PM Jeevan Chalke
    <jeevan.chalke@enterprisedb.com> wrote:
    >
    >
    >
    > On Wed, Jul 17, 2019 at 7:38 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
    >>
    >>
    >>
    >> On Wed, Jul 17, 2019 at 6:43 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
    >>>
    >>> On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
    >>>>
    >>>>
    >>>> At what stage you will apply the WAL generated in between the START/STOP backup.
    >>>
    >>>
    >>> In this design, we are not touching any WAL related code. The WAL files will
    >>> get copied with each backup either full or incremental. And thus, the last
    >>> incremental backup will have the final WAL files which will be copied as-is
    >>> in the combined full-backup and they will get apply automatically if that
    >>> the data directory is used to start the server.
    >>
    >>
    >> Ok, so you keep all the WAL files since the first backup, right?
    >
    >
    > The WAL files will anyway be copied while taking a backup (full or incremental),
    > but only last incremental backup's WAL files are copied to the combined
    > synthetic full backup.
    >
    >>>
    >>>>
    >>>> --
    >>>> Ibrar Ahmed
    >>>
    >>>
    >>> --
    >>> Jeevan Chalke
    >>> Technical Architect, Product Development
    >>> EnterpriseDB Corporation
    >>>
    >>
    >>
    >> --
    >> Ibrar Ahmed
    >
    >
    >
    > --
    > Jeevan Chalke
    > Technical Architect, Product Development
    > EnterpriseDB Corporation
    >
    
    
    --
    Regards,
    vignesh
    EnterpriseDB: http://www.enterprisedb.com
    
    
    
    
  78. Re: block-level incremental backup

    Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> — 2019-07-23T17:48:50Z

    Hi Vignesh,
    
    This backup technology is extending the pg_basebackup itself, which means
    we can
    still take online backups. This is internally done using pg_start_backup and
    pg_stop_backup. pg_start_backup performs a checkpoint, and this checkpoint
    is
    used in the recovery process while starting the cluster from a backup
    image. What
    incremental backup will just modify (as compared to traditional
    pg_basebackup)
    is - After doing the checkpoint, instead of copying the entire relation
    files,
    it takes an input LSN and scan all the blocks in all relation files, and
    store
    the blocks having LSN >= InputLSN. This means it considers all the changes
    that are already written into relation files including insert/update/delete
    etc
    up to the checkpoint performed by pg_start_backup internally, and as Jeevan
    Chalke
    mentioned upthread the incremental backup will also contain copy of WAL
    files.
    Once this incremental backup is combined with the parent backup by means of
    new
    combine process (that will be introduced as part of this feature itself)
    should
    ideally look like a full pg_basebackup. Note that any changes done by these
    insert/delete/update operations while the incremental backup was being taken
    will be still available via WAL files and as normal restore process, will be
    replayed from the checkpoint onwards up to a consistent point.
    
    My two cents!
    
    Regards,
    Jeevan Ladhe
    
    On Sat, Jul 20, 2019 at 11:22 PM vignesh C <vignesh21@gmail.com> wrote:
    
    > Hi Jeevan,
    >
    > The idea is very nice.
    > When Insert/update/delete and truncate/drop happens at various
    > combinations, How the incremental backup handles the copying of the
    > blocks?
    >
    >
    > On Wed, Jul 17, 2019 at 8:12 PM Jeevan Chalke
    > <jeevan.chalke@enterprisedb.com> wrote:
    > >
    > >
    > >
    > > On Wed, Jul 17, 2019 at 7:38 PM Ibrar Ahmed <ibrar.ahmad@gmail.com>
    > wrote:
    > >>
    > >>
    > >>
    > >> On Wed, Jul 17, 2019 at 6:43 PM Jeevan Chalke <
    > jeevan.chalke@enterprisedb.com> wrote:
    > >>>
    > >>> On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed <ibrar.ahmad@gmail.com>
    > wrote:
    > >>>>
    > >>>>
    > >>>> At what stage you will apply the WAL generated in between the
    > START/STOP backup.
    > >>>
    > >>>
    > >>> In this design, we are not touching any WAL related code. The WAL
    > files will
    > >>> get copied with each backup either full or incremental. And thus, the
    > last
    > >>> incremental backup will have the final WAL files which will be copied
    > as-is
    > >>> in the combined full-backup and they will get apply automatically if
    > that
    > >>> the data directory is used to start the server.
    > >>
    > >>
    > >> Ok, so you keep all the WAL files since the first backup, right?
    > >
    > >
    > > The WAL files will anyway be copied while taking a backup (full or
    > incremental),
    > > but only last incremental backup's WAL files are copied to the combined
    > > synthetic full backup.
    > >
    > >>>
    > >>>>
    > >>>> --
    > >>>> Ibrar Ahmed
    > >>>
    > >>>
    > >>> --
    > >>> Jeevan Chalke
    > >>> Technical Architect, Product Development
    > >>> EnterpriseDB Corporation
    > >>>
    > >>
    > >>
    > >> --
    > >> Ibrar Ahmed
    > >
    > >
    > >
    > > --
    > > Jeevan Chalke
    > > Technical Architect, Product Development
    > > EnterpriseDB Corporation
    > >
    >
    >
    > --
    > Regards,
    > vignesh
    > EnterpriseDB: http://www.enterprisedb.com
    >
    >
    >
    
  79. Re: block-level incremental backup

    vignesh C <vignesh21@gmail.com> — 2019-07-24T04:03:34Z

    Thanks Jeevan.
    
    1) If relation file has changed due to truncate or vacuum.
        During incremental backup the new files will be copied.
        There are chances that both the old  file and new file
        will be present. I'm not sure if cleaning up of the
        old file is handled.
    2) Just a small thought on building the bitmap,
        can the bitmap be built and maintained as
        and when the changes are happening in the system.
        If we are building the bitmap while doing the incremental backup,
        Scanning through each file might take more time.
        This can be a configurable parameter, the system can run
        without capturing this information by default, but if there are some
        of them who will be taking incremental backup frequently this
        configuration can be enabled which should track the modified blocks.
    
        What is your thought on this?
    -- 
    Regards,
    Vignesh
    EnterpriseDB: http://www.enterprisedb.com
    
    
    On Tue, Jul 23, 2019 at 11:19 PM Jeevan Ladhe
    <jeevan.ladhe@enterprisedb.com> wrote:
    >
    > Hi Vignesh,
    >
    > This backup technology is extending the pg_basebackup itself, which means we can
    > still take online backups. This is internally done using pg_start_backup and
    > pg_stop_backup. pg_start_backup performs a checkpoint, and this checkpoint is
    > used in the recovery process while starting the cluster from a backup image. What
    > incremental backup will just modify (as compared to traditional pg_basebackup)
    > is - After doing the checkpoint, instead of copying the entire relation files,
    > it takes an input LSN and scan all the blocks in all relation files, and store
    > the blocks having LSN >= InputLSN. This means it considers all the changes
    > that are already written into relation files including insert/update/delete etc
    > up to the checkpoint performed by pg_start_backup internally, and as Jeevan Chalke
    > mentioned upthread the incremental backup will also contain copy of WAL files.
    > Once this incremental backup is combined with the parent backup by means of new
    > combine process (that will be introduced as part of this feature itself) should
    > ideally look like a full pg_basebackup. Note that any changes done by these
    > insert/delete/update operations while the incremental backup was being taken
    > will be still available via WAL files and as normal restore process, will be
    > replayed from the checkpoint onwards up to a consistent point.
    >
    > My two cents!
    >
    > Regards,
    > Jeevan Ladhe
    >
    > On Sat, Jul 20, 2019 at 11:22 PM vignesh C <vignesh21@gmail.com> wrote:
    >>
    >> Hi Jeevan,
    >>
    >> The idea is very nice.
    >> When Insert/update/delete and truncate/drop happens at various
    >> combinations, How the incremental backup handles the copying of the
    >> blocks?
    >>
    >>
    >> On Wed, Jul 17, 2019 at 8:12 PM Jeevan Chalke
    >> <jeevan.chalke@enterprisedb.com> wrote:
    >> >
    >> >
    >> >
    >> > On Wed, Jul 17, 2019 at 7:38 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
    >> >>
    >> >>
    >> >>
    >> >> On Wed, Jul 17, 2019 at 6:43 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
    >> >>>
    >> >>> On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
    >> >>>>
    >> >>>>
    >> >>>> At what stage you will apply the WAL generated in between the START/STOP backup.
    >> >>>
    >> >>>
    >> >>> In this design, we are not touching any WAL related code. The WAL files will
    >> >>> get copied with each backup either full or incremental. And thus, the last
    >> >>> incremental backup will have the final WAL files which will be copied as-is
    >> >>> in the combined full-backup and they will get apply automatically if that
    >> >>> the data directory is used to start the server.
    >> >>
    >> >>
    >> >> Ok, so you keep all the WAL files since the first backup, right?
    >> >
    >> >
    >> > The WAL files will anyway be copied while taking a backup (full or incremental),
    >> > but only last incremental backup's WAL files are copied to the combined
    >> > synthetic full backup.
    >> >
    >> >>>
    >> >>>>
    >> >>>> --
    >> >>>> Ibrar Ahmed
    >> >>>
    >> >>>
    >> >>> --
    >> >>> Jeevan Chalke
    >> >>> Technical Architect, Product Development
    >> >>> EnterpriseDB Corporation
    >> >>>
    >> >>
    >> >>
    >> >> --
    >> >> Ibrar Ahmed
    >> >
    >> >
    >> >
    >> > --
    >> > Jeevan Chalke
    >> > Technical Architect, Product Development
    >> > EnterpriseDB Corporation
    >> >
    >>
    >>
    >> --
    >> Regards,
    >> vignesh
    >>
    >>
    >>
    
    
    
    
  80. Re: block-level incremental backup

    Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> — 2019-07-26T05:51:43Z

    Hi Vignesh,
    
    Please find my comments inline below:
    
    1) If relation file has changed due to truncate or vacuum.
    >     During incremental backup the new files will be copied.
    >     There are chances that both the old  file and new file
    >     will be present. I'm not sure if cleaning up of the
    >     old file is handled.
    >
    
    When an incremental backup is taken it either copies the file in its
    entirety if
    a file is changed more than 90%, or writes .partial with changed blocks
    bitmap
    and actual data. For the files that are unchanged, it writes 0 bytes and
    still
    creates a .partial file for unchanged files too. This means there is a
    .partitial
    file for all the files that are to be looked up in full backup.
    While composing a synthetic backup from incremental backup the
    pg_combinebackup
    tool will only look for those relation files in full(parent) backup which
    are
    having .partial files in the incremental backup. So, if vacuum/truncate
    happened
    between full and incremental backup, then the incremental backup image will
    not
    have a 0-length .partial file for that relation, and so the synthetic backup
    that is restored using pg_combinebackup will not have that file as well.
    
    
    > 2) Just a small thought on building the bitmap,
    >     can the bitmap be built and maintained as
    >     and when the changes are happening in the system.
    >     If we are building the bitmap while doing the incremental backup,
    >     Scanning through each file might take more time.
    >     This can be a configurable parameter, the system can run
    >     without capturing this information by default, but if there are some
    >     of them who will be taking incremental backup frequently this
    >     configuration can be enabled which should track the modified blocks.
    
    
    IIUC, this will need changes in the backend. Honestly, I think backup is a
    maintenance task and hampering the backend for this does not look like a
    good
    idea. But, having said that even if we have to provide this as a switch for
    some
    of the users, it will need a different infrastructure than what we are
    building
    here for constructing bitmap, where we scan all the files one by one. Maybe
    for
    the initial version, we can go with the current proposal that Robert has
    suggested,
    and add this switch at a later point as an enhancement.
    - My thoughts.
    
    Regards,
    Jeevan Ladhe
    
  81. Re: block-level incremental backup

    vignesh C <vignesh21@gmail.com> — 2019-07-26T07:53:57Z

    On Fri, Jul 26, 2019 at 11:21 AM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com>
    wrote:
    
    > Hi Vignesh,
    >
    > Please find my comments inline below:
    >
    > 1) If relation file has changed due to truncate or vacuum.
    >>     During incremental backup the new files will be copied.
    >>     There are chances that both the old  file and new file
    >>     will be present. I'm not sure if cleaning up of the
    >>     old file is handled.
    >>
    >
    > When an incremental backup is taken it either copies the file in its
    > entirety if
    > a file is changed more than 90%, or writes .partial with changed blocks
    > bitmap
    > and actual data. For the files that are unchanged, it writes 0 bytes and
    > still
    > creates a .partial file for unchanged files too. This means there is a
    > .partitial
    > file for all the files that are to be looked up in full backup.
    > While composing a synthetic backup from incremental backup the
    > pg_combinebackup
    > tool will only look for those relation files in full(parent) backup which
    > are
    > having .partial files in the incremental backup. So, if vacuum/truncate
    > happened
    > between full and incremental backup, then the incremental backup image
    > will not
    > have a 0-length .partial file for that relation, and so the synthetic
    > backup
    > that is restored using pg_combinebackup will not have that file as well.
    >
    >
    Thanks Jeevan for the update, I feel this logic is good.
    It will handle the case of deleting the old relation files.
    
    >
    >
    >> 2) Just a small thought on building the bitmap,
    >>     can the bitmap be built and maintained as
    >>     and when the changes are happening in the system.
    >>     If we are building the bitmap while doing the incremental backup,
    >>     Scanning through each file might take more time.
    >>     This can be a configurable parameter, the system can run
    >>     without capturing this information by default, but if there are some
    >>     of them who will be taking incremental backup frequently this
    >>     configuration can be enabled which should track the modified blocks.
    >
    >
    > IIUC, this will need changes in the backend. Honestly, I think backup is a
    > maintenance task and hampering the backend for this does not look like a
    > good
    > idea. But, having said that even if we have to provide this as a switch
    > for some
    > of the users, it will need a different infrastructure than what we are
    > building
    > here for constructing bitmap, where we scan all the files one by one.
    > Maybe for
    > the initial version, we can go with the current proposal that Robert has
    > suggested,
    > and add this switch at a later point as an enhancement.
    >
    >
    That sounds fair to me.
    
    
    Regards,
    vignesh
    EnterpriseDB: http://www.enterprisedb.com
    
  82. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-07-29T20:28:03Z

    On Wed, Jul 10, 2019 at 2:17 PM Anastasia Lubennikova
    <a.lubennikova@postgrespro.ru> wrote:
    > In attachments, you can find a prototype of incremental pg_basebackup,
    > which consists of 2 features:
    >
    > 1) To perform incremental backup one should call pg_basebackup with a
    > new argument:
    >
    > pg_basebackup -D 'basedir' --prev-backup-start-lsn 'lsn'
    >
    > where lsn is a start_lsn of parent backup (can be found in
    > "backup_label" file)
    >
    > It calls BASE_BACKUP replication command with a new argument
    > PREV_BACKUP_START_LSN 'lsn'.
    >
    > For datafiles, only pages with LSN > prev_backup_start_lsn will be
    > included in the backup.
    > They are saved into 'filename.partial' file, 'filename.blockmap' file
    > contains an array of BlockNumbers.
    > For example, if we backuped blocks 1,3,5, filename.partial will contain
    > 3 blocks, and 'filename.blockmap' will contain array {1,3,5}.
    
    I think it's better to keep both the information about changed blocks
    and the contents of the changed blocks in a single file.  The list of
    changed blocks is probably quite short, and I don't really want to
    double the number of files in the backup if there's no real need. I
    suspect it's just overall a bit simpler to keep everything together.
    I don't think this is a make-or-break thing, and welcome contrary
    arguments, but that's my preference.
    
    > 2) To merge incremental backup into a full backup call
    >
    > pg_basebackup -D 'basedir' --incremental-pgdata 'incremental_basedir'
    > --merge-backups
    >
    > It will move all files from 'incremental_basedir' to 'basedir' handling
    > '.partial' files correctly.
    
    This, to me, looks like it's much worse than the design that I
    proposed originally.  It means that:
    
    1. You can't take an incremental backup without having the full backup
    available at the time you want to take the incremental backup.
    
    2. You're always storing a full backup, which means that you need more
    disk space, and potentially much more I/O while taking the backup.
    You save on transfer bandwidth, but you add a lot of disk reads and
    writes, costs which have to be paid even if the backup is never
    restored.
    
    > 1) Whether we collect block maps using simple "read everything page by
    > page" approach
    > or WAL scanning or any other page tracking algorithm, we must choose a
    > map format.
    > I implemented the simplest one, while there are more ideas:
    
    I think we should start simple.
    
    I haven't had a chance to look at Jeevan's patch at all, or yours in
    any detail, as yet, so these are just some very preliminary comments.
    It will be good, however, if we can agree on who is going to do what
    part of this as we try to drive this forward together.  I'm sorry that
    I didn't communicate EDB's plans to work on this more clearly;
    duplicated effort serves nobody well.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  83. Re: block-level incremental backup

    Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> — 2019-07-30T01:39:06Z

    Hi Jeevan
    
    
    I reviewed first two patches -
    
    
    0001-Add-support-for-command-line-option-to-pass-LSN.patch and
    
    0002-Add-TAP-test-to-test-LSN-option.patch
    
    
    from the set of incremental backup patches, and the changes look good to me.
    
    
    I had some concerns around the way we are working around with the fact that
    
    pg_lsn_in() accepts the lsn with 0 as a valid lsn and I think that itself is
    
    contradictory to the definition of InvalidXLogRecPtr. I have started a
    separate
    
    new thread[1] for the same.
    
    
    Also, I observe that now commit 21f428eb, has already moved the lsn decoding
    
    logic to a separate function pg_lsn_in_internal(), so the function
    
    decode_lsn_internal() from patch 0001 will go away and the dependent code
    needs
    
    to be modified.
    
    
    I shall review the rest of the patches, and post the comments.
    
    
    Regards,
    
    Jeevan Ladhe
    
    
    [1]
    https://www.postgresql.org/message-id/CAOgcT0NOM9oR0Hag_3VpyW0uF3iCU=BDUFSPfk9JrWXRcWQHqw@mail.gmail.com
    
    On Thu, Jul 11, 2019 at 5:00 PM Jeevan Chalke <
    jeevan.chalke@enterprisedb.com> wrote:
    
    > Hi Anastasia,
    >
    > On Wed, Jul 10, 2019 at 11:47 PM Anastasia Lubennikova <
    > a.lubennikova@postgrespro.ru> wrote:
    >
    >> 23.04.2019 14:08, Anastasia Lubennikova wrote:
    >> > I'm volunteering to write a draft patch or, more likely, set of
    >> > patches, which
    >> > will allow us to discuss the subject in more detail.
    >> > And to do that I wish we agree on the API and data format (at least
    >> > broadly).
    >> > Looking forward to hearing your thoughts.
    >>
    >> Though the previous discussion stalled,
    >> I still hope that we could agree on basic points such as a map file
    >> format and protocol extension,
    >> which is necessary to start implementing the feature.
    >>
    >
    > It's great that you too come up with the PoC patch. I didn't look at your
    > changes in much details but we at EnterpriseDB too working on this feature
    > and started implementing it.
    >
    > Attached series of patches I had so far... (which needed further
    > optimization and adjustments though)
    >
    > Here is the overall design (as proposed by Robert) we are trying to
    > implement:
    >
    > 1. Extend the BASE_BACKUP command that can be used with replication
    > connections. Add a new [ LSN 'lsn' ] option.
    >
    > 2. Extend pg_basebackup with a new --lsn=LSN option that causes it to send
    > the option added to the server in #1.
    >
    > Here are the implementation details when we have a valid LSN
    >
    > sendFile() in basebackup.c is the function which mostly does the thing for
    > us. If the filename looks like a relation file, then we'll need to consider
    > sending only a partial file. The way to do that is probably:
    >
    > A. Read the whole file into memory.
    >
    > B. Check the LSN of each block. Build a bitmap indicating which blocks
    > have an LSN greater than or equal to the threshold LSN.
    >
    > C. If more than 90% of the bits in the bitmap are set, send the whole file
    > just as if this were a full backup. This 90% is a constant now; we might
    > make it a GUC later.
    >
    > D. Otherwise, send a file with .partial added to the name. The .partial
    > file contains an indication of which blocks were changed at the beginning,
    > followed by the data blocks. It also includes a checksum/CRC.
    > Currently, a .partial file format looks like:
    >  - start with a 4-byte magic number
    >  - then store a 4-byte CRC covering the header
    >  - then a 4-byte count of the number of blocks included in the file
    >  - then the block numbers, each as a 4-byte quantity
    >  - then the data blocks
    >
    >
    > We are also working on combining these incremental back-ups with the full
    > backup and for that, we are planning to add a new utility called
    > pg_combinebackup. Will post the details on that later once we have on the
    > same page for taking backup.
    >
    > Thanks
    > --
    > Jeevan Chalke
    > Technical Architect, Product Development
    > EnterpriseDB Corporation
    >
    >
    
  84. Re: block-level incremental backup

    Jeevan Chalke <jeevan.chalke@enterprisedb.com> — 2019-07-30T04:09:37Z

    On Tue, Jul 30, 2019 at 1:58 AM Robert Haas <robertmhaas@gmail.com> wrote:
    
    >
    > I haven't had a chance to look at Jeevan's patch at all, or yours in
    > any detail, as yet, so these are just some very preliminary comments.
    > It will be good, however, if we can agree on who is going to do what
    > part of this as we try to drive this forward together.  I'm sorry that
    > I didn't communicate EDB's plans to work on this more clearly;
    > duplicated effort serves nobody well.
    >
    
    I had a look over Anastasia's PoC patch to understand the approach she has
    taken and here are my observations.
    
    1.
    The patch first creates a .blockmap file for each relation file containing
    an array of all modified block numbers. This is done by reading all blocks
    (in a chunk of 4 (32kb in total) in a loop) from a file and checking the
    page
    LSN with given LSN. Later, to create .partial file, a relation file is
    opened
    again and all blocks are read in a chunk of 4 in a loop. If found modified,
    it is copied into another memory and after scanning all 4 blocks, all copied
    blocks are sent to the .partial file.
    
    In this approach, each file is opened and read twice which looks more
    expensive
    to me. Whereas in my patch, I do that just once. However, I read the entire
    file in memory to check which blocks are modified but in Anastasia's design
    max TAR_SEND_SIZE (32kb) will be read at a time but, in a loop. I need to do
    that as we wanted to know how heavily the file got modified so that we can
    send the entire file if it was modified beyond the threshold (currently
    90%).
    
    2.
    Also, while sending modified blocks, they are copied in another buffer,
    instead
    they can be just sent from the read files contents (in BLCKSZ block size).
    Here, the .blockmap created earlier was not used. In my implementation, we
    are
    sending just a .partial file with a header containing all required details
    like
    the number of blocks changes along with the block numbers including CRC
    followed by the blocks itself.
    
    3.
    I tried compiling Anastasia's patch, but getting an error. So could not see
    or
    test how it goes. Also, like a normal backup option, the incremental backup
    option needs to verify the checksum if requested.
    
    4.
    While combining full and incremental backup, files from the incremental
    backup
    are just copied into the full backup directory. While the design I posted
    earlier, we are trying another way round to avoid over-writing and other
    issues
    as I explained earlier.
    
    I am almost done writing the patch for pg_combinebackup and will post soon.
    
    
    >
    > --
    > Robert Haas
    > EnterpriseDB: http://www.enterprisedb.com
    > The Enterprise PostgreSQL Company
    >
    >
    >
    Thanks
    -- 
    Jeevan Chalke
    Technical Architect, Product Development
    EnterpriseDB Corporation
    The Enterprise PostgreSQL Company
    
  85. Re: block-level incremental backup

    Ibrar Ahmed <ibrar.ahmad@gmail.com> — 2019-07-30T13:27:07Z

    On Tue, Jul 30, 2019 at 1:28 AM Robert Haas <robertmhaas@gmail.com> wrote:
    
    > On Wed, Jul 10, 2019 at 2:17 PM Anastasia Lubennikova
    > <a.lubennikova@postgrespro.ru> wrote:
    > > In attachments, you can find a prototype of incremental pg_basebackup,
    > > which consists of 2 features:
    > >
    > > 1) To perform incremental backup one should call pg_basebackup with a
    > > new argument:
    > >
    > > pg_basebackup -D 'basedir' --prev-backup-start-lsn 'lsn'
    > >
    > > where lsn is a start_lsn of parent backup (can be found in
    > > "backup_label" file)
    > >
    > > It calls BASE_BACKUP replication command with a new argument
    > > PREV_BACKUP_START_LSN 'lsn'.
    > >
    > > For datafiles, only pages with LSN > prev_backup_start_lsn will be
    > > included in the backup.
    > > They are saved into 'filename.partial' file, 'filename.blockmap' file
    > > contains an array of BlockNumbers.
    > > For example, if we backuped blocks 1,3,5, filename.partial will contain
    > > 3 blocks, and 'filename.blockmap' will contain array {1,3,5}.
    >
    > I think it's better to keep both the information about changed blocks
    > and the contents of the changed blocks in a single file.  The list of
    > changed blocks is probably quite short, and I don't really want to
    > double the number of files in the backup if there's no real need. I
    > suspect it's just overall a bit simpler to keep everything together.
    > I don't think this is a make-or-break thing, and welcome contrary
    > arguments, but that's my preference.
    >
    
    I had experience working on a similar product and I agree with Robert to
    keep
    the changed block info and the changed block in a single file make more
    sense.
    +1
    
    >
    > > 2) To merge incremental backup into a full backup call
    > >
    > > pg_basebackup -D 'basedir' --incremental-pgdata 'incremental_basedir'
    > > --merge-backups
    > >
    > > It will move all files from 'incremental_basedir' to 'basedir' handling
    > > '.partial' files correctly.
    >
    > This, to me, looks like it's much worse than the design that I
    > proposed originally.  It means that:
    >
    > 1. You can't take an incremental backup without having the full backup
    > available at the time you want to take the incremental backup.
    >
    > 2. You're always storing a full backup, which means that you need more
    > disk space, and potentially much more I/O while taking the backup.
    > You save on transfer bandwidth, but you add a lot of disk reads and
    > writes, costs which have to be paid even if the backup is never
    > restored.
    >
    > > 1) Whether we collect block maps using simple "read everything page by
    > > page" approach
    > > or WAL scanning or any other page tracking algorithm, we must choose a
    > > map format.
    > > I implemented the simplest one, while there are more ideas:
    >
    > I think we should start simple.
    >
    > I haven't had a chance to look at Jeevan's patch at all, or yours in
    > any detail, as yet, so these are just some very preliminary comments.
    > It will be good, however, if we can agree on who is going to do what
    > part of this as we try to drive this forward together.  I'm sorry that
    > I didn't communicate EDB's plans to work on this more clearly;
    > duplicated effort serves nobody well.
    >
    > --
    > Robert Haas
    > EnterpriseDB: http://www.enterprisedb.com
    > The Enterprise PostgreSQL Company
    >
    >
    >
    
    -- 
    Ibrar Ahmed
    
  86. Re: block-level incremental backup

    vignesh C <vignesh21@gmail.com> — 2019-07-31T17:59:30Z

    On Tue, Jul 30, 2019 at 1:58 AM Robert Haas <robertmhaas@gmail.com> wrote:
    >
    > On Wed, Jul 10, 2019 at 2:17 PM Anastasia Lubennikova
    > <a.lubennikova@postgrespro.ru> wrote:
    > > In attachments, you can find a prototype of incremental pg_basebackup,
    > > which consists of 2 features:
    > >
    > > 1) To perform incremental backup one should call pg_basebackup with a
    > > new argument:
    > >
    > > pg_basebackup -D 'basedir' --prev-backup-start-lsn 'lsn'
    > >
    > > where lsn is a start_lsn of parent backup (can be found in
    > > "backup_label" file)
    > >
    > > It calls BASE_BACKUP replication command with a new argument
    > > PREV_BACKUP_START_LSN 'lsn'.
    > >
    > > For datafiles, only pages with LSN > prev_backup_start_lsn will be
    > > included in the backup.
    >>
    One thought, if the file is not modified no need to check the lsn.
    >>
    > > They are saved into 'filename.partial' file, 'filename.blockmap' file
    > > contains an array of BlockNumbers.
    > > For example, if we backuped blocks 1,3,5, filename.partial will contain
    > > 3 blocks, and 'filename.blockmap' will contain array {1,3,5}.
    >
    > I think it's better to keep both the information about changed blocks
    > and the contents of the changed blocks in a single file.  The list of
    > changed blocks is probably quite short, and I don't really want to
    > double the number of files in the backup if there's no real need. I
    > suspect it's just overall a bit simpler to keep everything together.
    > I don't think this is a make-or-break thing, and welcome contrary
    > arguments, but that's my preference.
    >
    I feel Robert's suggestion is good.
    We can probably keep one meta file for each backup with some basic information
    of all the files being backed up, this metadata file will be useful in the
    below case:
    Table dropped before incremental backup
    Table truncated and Insert/Update/Delete operations before incremental backup
    
    I feel if we have the metadata, we can add some optimization to decide the
    above scenario with the metadata information to identify the file deletion
    and avoiding write and delete for pg_combinebackup which Jeevan has told in
    his previous mail.
    
    Probably it can also help us to decide which work the worker needs to do
    if we are planning to backup in parallel.
    
    Regards,
    vignesh
    EnterpriseDB: http://www.enterprisedb.com
    
    
    
    
  87. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-07-31T20:03:01Z

    On Wed, Jul 31, 2019 at 1:59 PM vignesh C <vignesh21@gmail.com> wrote:
    > I feel Robert's suggestion is good.
    > We can probably keep one meta file for each backup with some basic information
    > of all the files being backed up, this metadata file will be useful in the
    > below case:
    > Table dropped before incremental backup
    > Table truncated and Insert/Update/Delete operations before incremental backup
    
    There's really no need for this with the design I proposed.  The files
    that should exist when you restore in incremental backup are exactly
    the set of files that exist in the final incremental backup, except
    that any .partial files need to be replaced with a correct
    reconstruction of the underlying file.  You don't need to know what
    got dropped or truncated; you only need to know what's supposed to be
    there at the end.
    
    You may be thinking, as I once did, that restoring an incremental
    backup would consist of restoring the full backup first and then
    layering the incrementals over it, but if you read what I proposed, it
    actually works the other way around: you restore the files that are
    present in the incremental, and as needed, pull pieces of them from
    earlier incremental and/or full backups.  I think this is a *much*
    better design than doing it the other way; it avoids any risk of
    getting the wrong answer due to truncations or drops, and it also is
    faster, because you only read older backups to the extent that you
    actually need their contents.
    
    I think it's a good idea to try to keep all the information about a
    single file being backup in one place. It's just less confusing.  If,
    for example, you have a metadata file that tells you which files are
    dropped - that is, which files you DON'T have - then what happen if
    one of those files is present in the data directory after all?  Well,
    then you have inconsistent information and are confused, and maybe
    your code won't even notice the inconsistency.  Similarly, if the
    metadata file is separate from the block data, then what happens if
    one file is missing, or isn't from the same backup as the other file?
    That shouldn't happen, of course, but if it does, you'll get confused.
    There's no perfect solution to these kinds of problems: if we suppose
    that the backup can be corrupted by having missing or extra files, why
    not also corruption within a single file? Still, on balance I tend to
    think that keeping related stuff together minimizes the surface area
    for bugs.  I realize that's arguable, though.
    
    One consideration that goes the other way: if you have a manifest file
    that says what files are supposed to be present in the backup, then
    you can detect a disappearing file, which is impossible with the
    design I've proposed (and with the current full backup machinery).
    That might be worth fixing, but it's a separate feature that has
    little to do with incremental backup.
    
    > Probably it can also help us to decide which work the worker needs to do
    > if we are planning to backup in parallel.
    
    I don't think we need a manifest file for parallel backup.  One
    process or thread can scan the directory tree, make a list of which
    files are present, and then hand individual files off to other
    processes or threads. In short, the directory listing serves as the
    manifest.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  88. Re: block-level incremental backup

    Jeevan Chalke <jeevan.chalke@enterprisedb.com> — 2019-08-01T11:36:25Z

    On Tue, Jul 30, 2019 at 9:39 AM Jeevan Chalke <
    jeevan.chalke@enterprisedb.com> wrote:
    
    >
    >
    >
    > I am almost done writing the patch for pg_combinebackup and will post soon.
    >
    
    Attached patch which implements the pg_combinebackup utility used to combine
    full basebackup with one or more incremental backups.
    
    I have tested it manually and it works for all best cases.
    
    Let me know if you have any inputs/suggestions/review comments?
    
    Thanks
    -- 
    Jeevan Chalke
    Technical Architect, Product Development
    EnterpriseDB Corporation
    The Enterprise PostgreSQL Company
    
  89. Re: block-level incremental backup

    vignesh C <vignesh21@gmail.com> — 2019-08-02T13:12:49Z

    On Thu, Aug 1, 2019 at 5:06 PM Jeevan Chalke
    <jeevan.chalke@enterprisedb.com> wrote:
    >
    > On Tue, Jul 30, 2019 at 9:39 AM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
    >>
    >> I am almost done writing the patch for pg_combinebackup and will post soon.
    >
    >
    > Attached patch which implements the pg_combinebackup utility used to combine
    > full basebackup with one or more incremental backups.
    >
    > I have tested it manually and it works for all best cases.
    >
    > Let me know if you have any inputs/suggestions/review comments?
    >
    Some comments:
    1) There will be some link files created for tablespace, we might
    require some special handling for it
    
    2)
    + while (numretries <= maxretries)
    + {
    + rc = system(copycmd);
    + if (rc == 0)
    + return;
    +
    + pg_log_info("could not copy, retrying after %d seconds",
    + sleeptime);
    + pg_usleep(numretries++ * sleeptime * 1000000L);
    + }
    Retry functionality is hanlded only for copying of full files, should
    we handle retry for copying of partial files
    
    3)
    + maxretries = atoi(optarg);
    + if (maxretries < 0)
    + {
    + pg_log_error("invalid value for maxretries");
    + fprintf(stderr, _("%s: -r maxretries must be >= 0\n"), progname);
    + exit(1);
    + }
    + break;
    + case 's':
    + sleeptime = atoi(optarg);
    + if (sleeptime <= 0 || sleeptime > 60)
    + {
    + pg_log_error("invalid value for sleeptime");
    + fprintf(stderr, _("%s: -s sleeptime must be between 1 and 60\n"), progname);
    + exit(1);
    + }
    + break;
    we can have some range for maxretries similar to sleeptime
    
    4)
    + fp = fopen(filename, "r");
    + if (fp == NULL)
    + {
    + pg_log_error("could not read file \"%s\": %m", filename);
    + exit(1);
    + }
    +
    + labelfile = malloc(statbuf.st_size + 1);
    + if (fread(labelfile, 1, statbuf.st_size, fp) != statbuf.st_size)
    + {
    + pg_log_error("corrupted file \"%s\": %m", filename);
    + free(labelfile);
    + exit(1);
    + }
    Should we check for malloc failure
    
    5) Should we add display of progress as backup may take some time,
    this can be added as enhancement. We can get other's opinion on this.
    
    6)
    + if (nIncrDir == MAX_INCR_BK_COUNT)
    + {
    + pg_log_error("too many incremental backups to combine");
    + fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
    + exit(1);
    + }
    +
    + IncrDirs[nIncrDir] = optarg;
    + nIncrDir++;
    + break;
    
    If the backup count increases providing the input may be difficult,
    Shall user provide all the incremental backups from a parent folder
    and can we handle the ordering of incremental backup internally
    
    7)
    + if (isPartialFile)
    + {
    + if (verbose)
    + pg_log_info("combining partial file \"%s.partial\"", fn);
    +
    + combine_partial_files(fn, IncrDirs, nIncrDir, subdirpath, outfn);
    + }
    + else
    + copy_whole_file(infn, outfn);
    
    Add verbose for copying whole file
    
    8) We can also check if approximate space is available in disk before
    starting combine backup, this can be added as enhancement. We can get
    other's opinion on this.
    
    9)
    + printf(_("  -i, --incr-backup=DIRECTORY incremental backup directory
    (maximum %d)\n"), MAX_INCR_BK_COUNT);
    + printf(_("  -o, --output-dir=DIRECTORY  combine backup into directory\n"));
    + printf(_("\nGeneral options:\n"));
    + printf(_("  -n, --no-clean              do not clean up after errors\n"));
    
    Combine backup into directory can be combine backup directory
    
    10)
    +/* Max number of incremental backups to be combined. */
    +#define MAX_INCR_BK_COUNT 10
    +
    +/* magic number in incremental backup's .partial file */
    
    MAX_INCR_BK_COUNT can be increased little, some applications use 1
    full backup at the beginning of the month and use 30 incremental
    backups rest of the days in the month
    
    Regards,
    Vignesh
    EnterpriseDB: http://www.enterprisedb.com
    
    
    
    
  90. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-08-05T13:42:59Z

    On Fri, Aug 2, 2019 at 9:13 AM vignesh C <vignesh21@gmail.com> wrote:
    > + rc = system(copycmd);
    
    I don't think this patch should be calling system() in the first place.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  91. Re: block-level incremental backup

    Stephen Frost <sfrost@snowman.net> — 2019-08-05T14:22:32Z

    Greetings,
    
    * Robert Haas (robertmhaas@gmail.com) wrote:
    > On Fri, Aug 2, 2019 at 9:13 AM vignesh C <vignesh21@gmail.com> wrote:
    > > + rc = system(copycmd);
    > 
    > I don't think this patch should be calling system() in the first place.
    
    +1.
    
    Thanks,
    
    Stephen
    
  92. Re: block-level incremental backup

    Ibrar Ahmed <ibrar.ahmad@gmail.com> — 2019-08-06T18:31:50Z

    I have not looked at the patch in detail, but just some nits from my side.
    
    On Fri, Aug 2, 2019 at 6:13 PM vignesh C <vignesh21@gmail.com> wrote:
    
    > On Thu, Aug 1, 2019 at 5:06 PM Jeevan Chalke
    > <jeevan.chalke@enterprisedb.com> wrote:
    > >
    > > On Tue, Jul 30, 2019 at 9:39 AM Jeevan Chalke <
    > jeevan.chalke@enterprisedb.com> wrote:
    > >>
    > >> I am almost done writing the patch for pg_combinebackup and will post
    > soon.
    > >
    > >
    > > Attached patch which implements the pg_combinebackup utility used to
    > combine
    > > full basebackup with one or more incremental backups.
    > >
    > > I have tested it manually and it works for all best cases.
    > >
    > > Let me know if you have any inputs/suggestions/review comments?
    > >
    > Some comments:
    > 1) There will be some link files created for tablespace, we might
    > require some special handling for it
    >
    > 2)
    > + while (numretries <= maxretries)
    > + {
    > + rc = system(copycmd);
    > + if (rc == 0)
    > + return;
    >
    > Use API to copy the file instead of "system", better to use the secure
    copy.
    
    
    > + pg_log_info("could not copy, retrying after %d seconds",
    > + sleeptime);
    > + pg_usleep(numretries++ * sleeptime * 1000000L);
    > + }
    > Retry functionality is hanlded only for copying of full files, should
    > we handle retry for copying of partial files
    >
    > The log and the sleep time does not match, you are multiplying sleeptime
    with numretries++ and logging only "sleeptime"
    
    Why we are retiring here, capture proper copy error and act accordingly.
    Blindly retiring does not make sense.
    
    3)
    > + maxretries = atoi(optarg);
    > + if (maxretries < 0)
    > + {
    > + pg_log_error("invalid value for maxretries");
    > + fprintf(stderr, _("%s: -r maxretries must be >= 0\n"), progname);
    > + exit(1);
    > + }
    > + break;
    > + case 's':
    > + sleeptime = atoi(optarg);
    > + if (sleeptime <= 0 || sleeptime > 60)
    > + {
    > + pg_log_error("invalid value for sleeptime");
    > + fprintf(stderr, _("%s: -s sleeptime must be between 1 and 60\n"),
    > progname);
    > + exit(1);
    > + }
    > + break;
    > we can have some range for maxretries similar to sleeptime
    >
    > 4)
    > + fp = fopen(filename, "r");
    > + if (fp == NULL)
    > + {
    > + pg_log_error("could not read file \"%s\": %m", filename);
    > + exit(1);
    > + }
    > +
    > + labelfile = malloc(statbuf.st_size + 1);
    > + if (fread(labelfile, 1, statbuf.st_size, fp) != statbuf.st_size)
    > + {
    > + pg_log_error("corrupted file \"%s\": %m", filename);
    > + free(labelfile);
    > + exit(1);
    > + }
    > Should we check for malloc failure
    >
    > Use pg_malloc instead of malloc
    
    
    > 5) Should we add display of progress as backup may take some time,
    > this can be added as enhancement. We can get other's opinion on this.
    >
    > Yes, we should, but this is not the right time to do that.
    
    
    > 6)
    > + if (nIncrDir == MAX_INCR_BK_COUNT)
    > + {
    > + pg_log_error("too many incremental backups to combine");
    > + fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
    > progname);
    > + exit(1);
    > + }
    > +
    > + IncrDirs[nIncrDir] = optarg;
    > + nIncrDir++;
    > + break;
    >
    > If the backup count increases providing the input may be difficult,
    > Shall user provide all the incremental backups from a parent folder
    > and can we handle the ordering of incremental backup internally
    >
    > Why we have that limit at first place?
    
    
    > 7)
    > + if (isPartialFile)
    > + {
    > + if (verbose)
    > + pg_log_info("combining partial file \"%s.partial\"", fn);
    > +
    > + combine_partial_files(fn, IncrDirs, nIncrDir, subdirpath, outfn);
    > + }
    > + else
    > + copy_whole_file(infn, outfn);
    >
    > Add verbose for copying whole file
    >
    > 8) We can also check if approximate space is available in disk before
    > starting combine backup, this can be added as enhancement. We can get
    > other's opinion on this.
    >
    > 9)
    > + printf(_("  -i, --incr-backup=DIRECTORY incremental backup directory
    > (maximum %d)\n"), MAX_INCR_BK_COUNT);
    > + printf(_("  -o, --output-dir=DIRECTORY  combine backup into
    > directory\n"));
    > + printf(_("\nGeneral options:\n"));
    > + printf(_("  -n, --no-clean              do not clean up after
    > errors\n"));
    >
    > Combine backup into directory can be combine backup directory
    >
    > 10)
    > +/* Max number of incremental backups to be combined. */
    > +#define MAX_INCR_BK_COUNT 10
    > +
    > +/* magic number in incremental backup's .partial file */
    >
    > MAX_INCR_BK_COUNT can be increased little, some applications use 1
    > full backup at the beginning of the month and use 30 incremental
    > backups rest of the days in the month
    >
    > Regards,
    > Vignesh
    > EnterpriseDB: http://www.enterprisedb.com
    >
    >
    >
    
    -- 
    Ibrar Ahmed
    
  93. Re: block-level incremental backup

    Ibrar Ahmed <ibrar.ahmad@gmail.com> — 2019-08-06T18:37:21Z

    On Tue, Aug 6, 2019 at 11:31 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
    
    >
    > I have not looked at the patch in detail, but just some nits from my side.
    >
    > On Fri, Aug 2, 2019 at 6:13 PM vignesh C <vignesh21@gmail.com> wrote:
    >
    >> On Thu, Aug 1, 2019 at 5:06 PM Jeevan Chalke
    >> <jeevan.chalke@enterprisedb.com> wrote:
    >> >
    >> > On Tue, Jul 30, 2019 at 9:39 AM Jeevan Chalke <
    >> jeevan.chalke@enterprisedb.com> wrote:
    >> >>
    >> >> I am almost done writing the patch for pg_combinebackup and will post
    >> soon.
    >> >
    >> >
    >> > Attached patch which implements the pg_combinebackup utility used to
    >> combine
    >> > full basebackup with one or more incremental backups.
    >> >
    >> > I have tested it manually and it works for all best cases.
    >> >
    >> > Let me know if you have any inputs/suggestions/review comments?
    >> >
    >> Some comments:
    >> 1) There will be some link files created for tablespace, we might
    >> require some special handling for it
    >>
    >> 2)
    >> + while (numretries <= maxretries)
    >> + {
    >> + rc = system(copycmd);
    >> + if (rc == 0)
    >> + return;
    >>
    >> Use API to copy the file instead of "system", better to use the secure
    > copy.
    >
    Ah, it is a local copy, simple copy API is enough.
    
    >
    >
    >> + pg_log_info("could not copy, retrying after %d seconds",
    >> + sleeptime);
    >> + pg_usleep(numretries++ * sleeptime * 1000000L);
    >> + }
    >> Retry functionality is hanlded only for copying of full files, should
    >> we handle retry for copying of partial files
    >>
    >> The log and the sleep time does not match, you are multiplying sleeptime
    > with numretries++ and logging only "sleeptime"
    >
    > Why we are retiring here, capture proper copy error and act accordingly.
    > Blindly retiring does not make sense.
    >
    > 3)
    >> + maxretries = atoi(optarg);
    >> + if (maxretries < 0)
    >> + {
    >> + pg_log_error("invalid value for maxretries");
    >> + fprintf(stderr, _("%s: -r maxretries must be >= 0\n"), progname);
    >> + exit(1);
    >> + }
    >> + break;
    >> + case 's':
    >> + sleeptime = atoi(optarg);
    >> + if (sleeptime <= 0 || sleeptime > 60)
    >> + {
    >> + pg_log_error("invalid value for sleeptime");
    >> + fprintf(stderr, _("%s: -s sleeptime must be between 1 and 60\n"),
    >> progname);
    >> + exit(1);
    >> + }
    >> + break;
    >> we can have some range for maxretries similar to sleeptime
    >>
    >> 4)
    >> + fp = fopen(filename, "r");
    >> + if (fp == NULL)
    >> + {
    >> + pg_log_error("could not read file \"%s\": %m", filename);
    >> + exit(1);
    >> + }
    >> +
    >> + labelfile = malloc(statbuf.st_size + 1);
    >> + if (fread(labelfile, 1, statbuf.st_size, fp) != statbuf.st_size)
    >> + {
    >> + pg_log_error("corrupted file \"%s\": %m", filename);
    >> + free(labelfile);
    >> + exit(1);
    >> + }
    >> Should we check for malloc failure
    >>
    >> Use pg_malloc instead of malloc
    >
    >
    >> 5) Should we add display of progress as backup may take some time,
    >> this can be added as enhancement. We can get other's opinion on this.
    >>
    >> Yes, we should, but this is not the right time to do that.
    >
    >
    >> 6)
    >> + if (nIncrDir == MAX_INCR_BK_COUNT)
    >> + {
    >> + pg_log_error("too many incremental backups to combine");
    >> + fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
    >> progname);
    >> + exit(1);
    >> + }
    >> +
    >> + IncrDirs[nIncrDir] = optarg;
    >> + nIncrDir++;
    >> + break;
    >>
    >> If the backup count increases providing the input may be difficult,
    >> Shall user provide all the incremental backups from a parent folder
    >> and can we handle the ordering of incremental backup internally
    >>
    >> Why we have that limit at first place?
    >
    >
    >> 7)
    >> + if (isPartialFile)
    >> + {
    >> + if (verbose)
    >> + pg_log_info("combining partial file \"%s.partial\"", fn);
    >> +
    >> + combine_partial_files(fn, IncrDirs, nIncrDir, subdirpath, outfn);
    >> + }
    >> + else
    >> + copy_whole_file(infn, outfn);
    >>
    >> Add verbose for copying whole file
    >>
    >> 8) We can also check if approximate space is available in disk before
    >> starting combine backup, this can be added as enhancement. We can get
    >> other's opinion on this.
    >>
    >> 9)
    >> + printf(_("  -i, --incr-backup=DIRECTORY incremental backup directory
    >> (maximum %d)\n"), MAX_INCR_BK_COUNT);
    >> + printf(_("  -o, --output-dir=DIRECTORY  combine backup into
    >> directory\n"));
    >> + printf(_("\nGeneral options:\n"));
    >> + printf(_("  -n, --no-clean              do not clean up after
    >> errors\n"));
    >>
    >> Combine backup into directory can be combine backup directory
    >>
    >> 10)
    >> +/* Max number of incremental backups to be combined. */
    >> +#define MAX_INCR_BK_COUNT 10
    >> +
    >> +/* magic number in incremental backup's .partial file */
    >>
    >> MAX_INCR_BK_COUNT can be increased little, some applications use 1
    >> full backup at the beginning of the month and use 30 incremental
    >> backups rest of the days in the month
    >>
    >> Regards,
    >> Vignesh
    >> EnterpriseDB: http://www.enterprisedb.com
    >>
    >>
    >>
    >
    > --
    > Ibrar Ahmed
    >
    
    
    -- 
    Ibrar Ahmed
    
  94. Re: block-level incremental backup

    Jeevan Chalke <jeevan.chalke@enterprisedb.com> — 2019-08-07T09:46:43Z

    On Mon, Aug 5, 2019 at 7:13 PM Robert Haas <robertmhaas@gmail.com> wrote:
    
    > On Fri, Aug 2, 2019 at 9:13 AM vignesh C <vignesh21@gmail.com> wrote:
    > > + rc = system(copycmd);
    >
    > I don't think this patch should be calling system() in the first place.
    >
    
    So, do you mean we should just do fread() and fwrite() for the whole file?
    
    I thought it is better if it was done by the OS itself instead of reading
    1GB
    into the memory and writing the same to the file.
    
    
    > --
    > Robert Haas
    > EnterpriseDB: http://www.enterprisedb.com
    > The Enterprise PostgreSQL Company
    >
    
    
    -- 
    Jeevan Chalke
    Technical Architect, Product Development
    EnterpriseDB Corporation
    The Enterprise PostgreSQL Company
    
  95. Re: block-level incremental backup

    Ibrar Ahmed <ibrar.ahmad@gmail.com> — 2019-08-07T09:52:12Z

    On Wed, Aug 7, 2019 at 2:47 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com>
    wrote:
    
    >
    >
    > On Mon, Aug 5, 2019 at 7:13 PM Robert Haas <robertmhaas@gmail.com> wrote:
    >
    >> On Fri, Aug 2, 2019 at 9:13 AM vignesh C <vignesh21@gmail.com> wrote:
    >> > + rc = system(copycmd);
    >>
    >> I don't think this patch should be calling system() in the first place.
    >>
    >
    > So, do you mean we should just do fread() and fwrite() for the whole file?
    >
    > I thought it is better if it was done by the OS itself instead of reading
    > 1GB
    > into the memory and writing the same to the file.
    >
    > It is not necessary to read the whole 1GB into Ram.
    
    
    >
    >> --
    >> Robert Haas
    >> EnterpriseDB: http://www.enterprisedb.com
    >> The Enterprise PostgreSQL Company
    >>
    >
    >
    > --
    > Jeevan Chalke
    > Technical Architect, Product Development
    > EnterpriseDB Corporation
    > The Enterprise PostgreSQL Company
    >
    >
    
    -- 
    Ibrar Ahmed
    
  96. Re: block-level incremental backup

    Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> — 2019-08-09T00:37:14Z

    Hi Jeevan,
    
    I have reviewed the backup part at code level and still looking into the
    restore(combine) and functional part of it. But, here are my comments so
    far:
    
    The patches need rebase.
    ----------------------------------------------------
    +       if (!XLogRecPtrIsInvalid(previous_lsn))
    +           appendStringInfo(labelfile, "PREVIOUS WAL LOCATION: %X/%X\n",
    +                            (uint32) (previous_lsn >> 32), (uint32)
    previous_lsn);
    
    May be we should rename to something like:
    "INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP START
    LOCATION"
    to make it more intuitive?
    
    ----------------------------------------------------
    
    +typedef struct
    
    +{
    
    +   uint32      magic;
    
    +   pg_crc32c   checksum;
    
    +   uint32      nblocks;
    
    +   uint32      blocknumbers[FLEXIBLE_ARRAY_MEMBER];
    
    +} partial_file_header;
    
    
    File header structure is defined in both the files basebackup.c and
    pg_combinebackup.c. I think it is better to move this to
    replication/basebackup.h.
    
    ----------------------------------------------------
    
    +   bool        isrelfile = false;
    
    I think we can avoid having flag isrelfile in sendFile().
    Something like this:
    
    if (startincrptr && OidIsValid(dboid) && looks_like_rel_name(filename))
    {
    //include the code here that is under "if (isrelfile)" block.
    }
    else
    {
    _tarWriteHeader(tarfilename, NULL, statbuf, false);
    while ((cnt = fread(buf, 1, Min(sizeof(buf), statbuf->st_size - len), fp))
    > 0)
    {
    ...
    }
    }
    
    ----------------------------------------------------
    
    Also, having isrelfile as part of following condition:
    {code}
    +   while (!isrelfile &&
    +          (cnt = fread(buf, 1, Min(sizeof(buf), statbuf->st_size - len),
    fp)) > 0)
    {code}
    
    is confusing, because even the relation files in full backup are going to be
    backed up by this loop only, but still, the condition reads '(!isrelfile
    &&...)'.
    
    ----------------------------------------------------
    
    verify_page_checksum()
    {
    while(1)
    {
    ....
    break;
    }
    }
    
    IMHO, while labels are not advisable in general, it may be better to use a
    label
    here rather than a while(1) loop, so that we can move to the label in case
    we
    want to retry once. I think here it opens doors for future bugs if someone
    happens to add code here, ending up adding some condition and then the
    break becomes conditional. That will leave us in an infinite loop.
    
    ----------------------------------------------------
    
    +/* magic number in incremental backup's .partial file */
    +#define INCREMENTAL_BACKUP_MAGIC   0x494E4352
    
    Similar to structure partial_file_header, I think above macro can also be
    moved
    to basebackup.h instead of defining it twice.
    
    ----------------------------------------------------
    
    In sendFile():
    
    +       buf = (char *) malloc(RELSEG_SIZE * BLCKSZ);
    
    I think this is a huge memory request (1GB) and may fail on busy/loaded
    server at
    times. We should check for failures of malloc, maybe throw some error on
    getting ENOMEM as errno.
    
    ----------------------------------------------------
    
    +       /* Perform incremenatl backup stuff here. */
    +       if ((cnt = fread(buf, 1, Min(RELSEG_SIZE * BLCKSZ,
    statbuf->st_size), fp)) > 0)
    +       {
    
    Here, should not we expect statbuf->st_size < (RELSEG_SIZE * BLCKSZ), and it
    should be safe to read just statbuf_st_size always I guess? But, I am ok
    with
    having this extra guard here.
    
    ----------------------------------------------------
    
    In sendFile(), I am sorry if I am missing something, but I am not able to
    understand why 'cnt' and 'i' should have different values when they are
    being
    passed to verify_page_checksum(). I think passing only one of them should be
    sufficient.
    
    ----------------------------------------------------
    
    +               XLogRecPtr  pglsn;
    +
    +               for (i = 0; i < cnt / BLCKSZ; i++)
    +               {
    
    Maybe we should just have a variable no_of_blocks to store a number of
    blocks,
    rather than calculating this say RELSEG_SIZE(i.e. 131072) times in the worst
    case.
    
    ----------------------------------------------------
    +               len += cnt;
    +               throttle(cnt);
    +           }
    
    Sorry if I am missing something, but, should not it be just:
    
    len = cnt;
    
    ----------------------------------------------------
    
    As I said earlier in my previous email, we now do not need
    +decode_lsn_internal()
    as it is already taken care by the introduction of function
    pg_lsn_in_internal().
    
    Regards,
    Jeevan Ladhe
    
  97. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-08-09T13:06:26Z

    On Wed, Aug 7, 2019 at 5:46 AM Jeevan Chalke
    <jeevan.chalke@enterprisedb.com> wrote:
    > So, do you mean we should just do fread() and fwrite() for the whole file?
    >
    > I thought it is better if it was done by the OS itself instead of reading 1GB
    > into the memory and writing the same to the file.
    
    Well, 'cp' is just a C program.  If they can write code to copy a
    file, so can we, and then we're not dependent on 'cp' being installed,
    working properly, being in the user's path or at the hard-coded
    pathname we expect, etc.  There's an existing copy_file() function in
    src/backed/storage/file/copydir.c which I'd probably look into
    adapting for frontend use.  I'm not sure whether it would be important
    to adapt the data-flushing code that's present in that routine or
    whether we could get by with just the loop to read() and write() data.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  98. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-08-09T13:10:40Z

    On Thu, Aug 8, 2019 at 8:37 PM Jeevan Ladhe
    <jeevan.ladhe@enterprisedb.com> wrote:
    > +       if (!XLogRecPtrIsInvalid(previous_lsn))
    > +           appendStringInfo(labelfile, "PREVIOUS WAL LOCATION: %X/%X\n",
    > +                            (uint32) (previous_lsn >> 32), (uint32) previous_lsn);
    >
    > May be we should rename to something like:
    > "INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP START LOCATION"
    > to make it more intuitive?
    
    So, I think that you are right that PREVIOUS WAL LOCATION might not be
    entirely clear, but at least in my view, INCREMENTAL BACKUP START WAL
    LOCATION is definitely not clear.  This backup is an incremental
    backup, and it has a start WAL location, so you'd end up with START
    WAL LOCATION and INCREMENTAL BACKUP START WAL LOCATION and those sound
    like they ought to both be the same thing, but they're not.  Perhaps
    something like REFERENCE WAL LOCATION or REFERENCE WAL LOCATION FOR
    INCREMENTAL BACKUP would be clearer.
    
    > File header structure is defined in both the files basebackup.c and
    > pg_combinebackup.c. I think it is better to move this to replication/basebackup.h.
    
    Or some other header, but yeah, definitely don't duplicate the struct
    definition (or any other kind of definition).
    
    > IMHO, while labels are not advisable in general, it may be better to use a label
    > here rather than a while(1) loop, so that we can move to the label in case we
    > want to retry once. I think here it opens doors for future bugs if someone
    > happens to add code here, ending up adding some condition and then the
    > break becomes conditional. That will leave us in an infinite loop.
    
    I'm not sure which style is better here, but I don't really buy this argument.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  99. Re: block-level incremental backup

    Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> — 2019-08-09T18:25:47Z

    Hi Robert,
    
    On Fri, Aug 9, 2019 at 6:40 PM Robert Haas <robertmhaas@gmail.com> wrote:
    
    > On Thu, Aug 8, 2019 at 8:37 PM Jeevan Ladhe
    > <jeevan.ladhe@enterprisedb.com> wrote:
    > > +       if (!XLogRecPtrIsInvalid(previous_lsn))
    > > +           appendStringInfo(labelfile, "PREVIOUS WAL LOCATION: %X/%X\n",
    > > +                            (uint32) (previous_lsn >> 32), (uint32)
    > previous_lsn);
    > >
    > > May be we should rename to something like:
    > > "INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP
    > START LOCATION"
    > > to make it more intuitive?
    >
    > So, I think that you are right that PREVIOUS WAL LOCATION might not be
    > entirely clear, but at least in my view, INCREMENTAL BACKUP START WAL
    > LOCATION is definitely not clear.  This backup is an incremental
    > backup, and it has a start WAL location, so you'd end up with START
    > WAL LOCATION and INCREMENTAL BACKUP START WAL LOCATION and those sound
    > like they ought to both be the same thing, but they're not.  Perhaps
    > something like REFERENCE WAL LOCATION or REFERENCE WAL LOCATION FOR
    > INCREMENTAL BACKUP would be clearer.
    >
    
    Agree, how about INCREMENTAL BACKUP REFERENCE WAL LOCATION ?
    
    
    > > File header structure is defined in both the files basebackup.c and
    > > pg_combinebackup.c. I think it is better to move this to
    > replication/basebackup.h.
    >
    > Or some other header, but yeah, definitely don't duplicate the struct
    > definition (or any other kind of definition).
    >
    
    Thanks.
    
    
    > > IMHO, while labels are not advisable in general, it may be better to use
    > a label
    > > here rather than a while(1) loop, so that we can move to the label in
    > case we
    > > want to retry once. I think here it opens doors for future bugs if
    > someone
    > > happens to add code here, ending up adding some condition and then the
    > > break becomes conditional. That will leave us in an infinite loop.
    >
    > I'm not sure which style is better here, but I don't really buy this
    > argument.
    
    
    No issues. I am ok either way.
    
    Regards,
    Jeevan Ladhe
    
  100. Re: block-level incremental backup

    Jeevan Chalke <jeevan.chalke@enterprisedb.com> — 2019-08-12T11:57:29Z

    On Fri, Aug 9, 2019 at 6:36 PM Robert Haas <robertmhaas@gmail.com> wrote:
    
    > On Wed, Aug 7, 2019 at 5:46 AM Jeevan Chalke
    > <jeevan.chalke@enterprisedb.com> wrote:
    > > So, do you mean we should just do fread() and fwrite() for the whole
    > file?
    > >
    > > I thought it is better if it was done by the OS itself instead of
    > reading 1GB
    > > into the memory and writing the same to the file.
    >
    > Well, 'cp' is just a C program.  If they can write code to copy a
    > file, so can we, and then we're not dependent on 'cp' being installed,
    > working properly, being in the user's path or at the hard-coded
    > pathname we expect, etc.  There's an existing copy_file() function in
    > src/backed/storage/file/copydir.c which I'd probably look into
    > adapting for frontend use.  I'm not sure whether it would be important
    > to adapt the data-flushing code that's present in that routine or
    > whether we could get by with just the loop to read() and write() data.
    >
    
    Agree that we can certainly use open(), read(), write(), and close() here,
    but
    given that pg_basebackup.c and basbackup.c are using file operations, I
    think
    using fopen(), fread(), fwrite(), and fclose() will be better here, at-least
    for consistetncy.
    
    Let me know if we still want to go with native OS calls.
    
    
    >
    > --
    > Robert Haas
    > EnterpriseDB: http://www.enterprisedb.com
    > The Enterprise PostgreSQL Company
    >
    
    
    -- 
    Jeevan Chalke
    Technical Architect, Product Development
    EnterpriseDB Corporation
    The Enterprise PostgreSQL Company
    
  101. Re: block-level incremental backup

    Jeevan Chalke <jeevan.chalke@enterprisedb.com> — 2019-08-12T11:59:50Z

    On Fri, Aug 9, 2019 at 11:56 PM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com>
    wrote:
    
    > Hi Robert,
    >
    > On Fri, Aug 9, 2019 at 6:40 PM Robert Haas <robertmhaas@gmail.com> wrote:
    >
    >> On Thu, Aug 8, 2019 at 8:37 PM Jeevan Ladhe
    >> <jeevan.ladhe@enterprisedb.com> wrote:
    >> > +       if (!XLogRecPtrIsInvalid(previous_lsn))
    >> > +           appendStringInfo(labelfile, "PREVIOUS WAL LOCATION:
    >> %X/%X\n",
    >> > +                            (uint32) (previous_lsn >> 32), (uint32)
    >> previous_lsn);
    >> >
    >> > May be we should rename to something like:
    >> > "INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP
    >> START LOCATION"
    >> > to make it more intuitive?
    >>
    >> So, I think that you are right that PREVIOUS WAL LOCATION might not be
    >> entirely clear, but at least in my view, INCREMENTAL BACKUP START WAL
    >> LOCATION is definitely not clear.  This backup is an incremental
    >> backup, and it has a start WAL location, so you'd end up with START
    >> WAL LOCATION and INCREMENTAL BACKUP START WAL LOCATION and those sound
    >> like they ought to both be the same thing, but they're not.  Perhaps
    >> something like REFERENCE WAL LOCATION or REFERENCE WAL LOCATION FOR
    >> INCREMENTAL BACKUP would be clearer.
    >>
    >
    > Agree, how about INCREMENTAL BACKUP REFERENCE WAL LOCATION ?
    >
    
    +1 for INCREMENTAL BACKUP REFERENCE WA.
    
    
    >
    
    -- 
    Jeevan Chalke
    Technical Architect, Product Development
    EnterpriseDB Corporation
    The Enterprise PostgreSQL Company
    
  102. Re: block-level incremental backup

    Jeevan Chalke <jeevan.chalke@enterprisedb.com> — 2019-08-12T12:03:21Z

    On Mon, Aug 12, 2019 at 5:29 PM Jeevan Chalke <
    jeevan.chalke@enterprisedb.com> wrote:
    
    >
    >
    > On Fri, Aug 9, 2019 at 11:56 PM Jeevan Ladhe <
    > jeevan.ladhe@enterprisedb.com> wrote:
    >
    >> Hi Robert,
    >>
    >> On Fri, Aug 9, 2019 at 6:40 PM Robert Haas <robertmhaas@gmail.com> wrote:
    >>
    >>> On Thu, Aug 8, 2019 at 8:37 PM Jeevan Ladhe
    >>> <jeevan.ladhe@enterprisedb.com> wrote:
    >>> > +       if (!XLogRecPtrIsInvalid(previous_lsn))
    >>> > +           appendStringInfo(labelfile, "PREVIOUS WAL LOCATION:
    >>> %X/%X\n",
    >>> > +                            (uint32) (previous_lsn >> 32), (uint32)
    >>> previous_lsn);
    >>> >
    >>> > May be we should rename to something like:
    >>> > "INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP
    >>> START LOCATION"
    >>> > to make it more intuitive?
    >>>
    >>> So, I think that you are right that PREVIOUS WAL LOCATION might not be
    >>> entirely clear, but at least in my view, INCREMENTAL BACKUP START WAL
    >>> LOCATION is definitely not clear.  This backup is an incremental
    >>> backup, and it has a start WAL location, so you'd end up with START
    >>> WAL LOCATION and INCREMENTAL BACKUP START WAL LOCATION and those sound
    >>> like they ought to both be the same thing, but they're not.  Perhaps
    >>> something like REFERENCE WAL LOCATION or REFERENCE WAL LOCATION FOR
    >>> INCREMENTAL BACKUP would be clearer.
    >>>
    >>
    >> Agree, how about INCREMENTAL BACKUP REFERENCE WAL LOCATION ?
    >>
    >
    > +1 for INCREMENTAL BACKUP REFERENCE WA.
    >
    
    Sorry for the typo:
    +1 for the INCREMENTAL BACKUP REFERENCE WAL LOCATION.
    
    
    >
    >>
    >
    > --
    > Jeevan Chalke
    > Technical Architect, Product Development
    > EnterpriseDB Corporation
    > The Enterprise PostgreSQL Company
    >
    >
    
    -- 
    Jeevan Chalke
    Technical Architect, Product Development
    EnterpriseDB Corporation
    The Enterprise PostgreSQL Company
    
  103. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-08-12T12:11:50Z

    On Mon, Aug 12, 2019 at 7:57 AM Jeevan Chalke
    <jeevan.chalke@enterprisedb.com> wrote:
    > Agree that we can certainly use open(), read(), write(), and close() here, but
    > given that pg_basebackup.c and basbackup.c are using file operations, I think
    > using fopen(), fread(), fwrite(), and fclose() will be better here, at-least
    > for consistetncy.
    
    Oh, that's fine.  Whatever's more consistent with the pre-existing
    code. Just, let's not use system().
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  104. Re: block-level incremental backup

    Ibrar Ahmed <ibrar.ahmad@gmail.com> — 2019-08-13T20:47:26Z

    On Mon, Aug 12, 2019 at 4:57 PM Jeevan Chalke <
    jeevan.chalke@enterprisedb.com> wrote:
    
    >
    >
    > On Fri, Aug 9, 2019 at 6:36 PM Robert Haas <robertmhaas@gmail.com> wrote:
    >
    >> On Wed, Aug 7, 2019 at 5:46 AM Jeevan Chalke
    >> <jeevan.chalke@enterprisedb.com> wrote:
    >> > So, do you mean we should just do fread() and fwrite() for the whole
    >> file?
    >> >
    >> > I thought it is better if it was done by the OS itself instead of
    >> reading 1GB
    >> > into the memory and writing the same to the file.
    >>
    >> Well, 'cp' is just a C program.  If they can write code to copy a
    >> file, so can we, and then we're not dependent on 'cp' being installed,
    >> working properly, being in the user's path or at the hard-coded
    >> pathname we expect, etc.  There's an existing copy_file() function in
    >> src/backed/storage/file/copydir.c which I'd probably look into
    >> adapting for frontend use.  I'm not sure whether it would be important
    >> to adapt the data-flushing code that's present in that routine or
    >> whether we could get by with just the loop to read() and write() data.
    >>
    >
    > Agree that we can certainly use open(), read(), write(), and close() here,
    > but
    > given that pg_basebackup.c and basbackup.c are using file operations, I
    > think
    > using fopen(), fread(), fwrite(), and fclose() will be better here,
    > at-least
    > for consistetncy.
    >
    
    +1 for using  fopen(), fread(), fwrite(), and fclose()
    
    
    > Let me know if we still want to go with native OS calls.
    >
    >
    
    -1 for OS call
    
    
    >
    >> --
    >> Robert Haas
    >> EnterpriseDB: http://www.enterprisedb.com
    >> The Enterprise PostgreSQL Company
    >>
    >
    >
    > --
    > Jeevan Chalke
    > Technical Architect, Product Development
    > EnterpriseDB Corporation
    > The Enterprise PostgreSQL Company
    >
    >
    
    -- 
    Ibrar Ahmed
    
  105. Re: block-level incremental backup

    Jeevan Chalke <jeevan.chalke@enterprisedb.com> — 2019-08-16T10:23:35Z

    On Fri, Aug 2, 2019 at 6:43 PM vignesh C <vignesh21@gmail.com> wrote:
    
    > Some comments:
    > 1) There will be some link files created for tablespace, we might
    > require some special handling for it
    >
    
    Yep. I have that in my ToDo.
    Will start working on that soon.
    
    
    > 2)
    > Retry functionality is hanlded only for copying of full files, should
    > we handle retry for copying of partial files
    > 3)
    > we can have some range for maxretries similar to sleeptime
    >
    
    I took help from pg_standby code related to maxentries and sleeptime.
    
    However, as we don't want to use system() call now, I have
    removed all this kludge and just used fread/fwrite as discussed.
    
    
    > 4)
    > Should we check for malloc failure
    >
    
    Used pg_malloc() instead. Same is also suggested by Ibrar.
    
    
    >
    > 5) Should we add display of progress as backup may take some time,
    > this can be added as enhancement. We can get other's opinion on this.
    >
    
    Can be done afterward once we have the functionality in place.
    
    
    >
    > 6)
    > If the backup count increases providing the input may be difficult,
    > Shall user provide all the incremental backups from a parent folder
    > and can we handle the ordering of incremental backup internally
    >
    
    I am not sure of this yet. We need to provide the tablespace mapping too.
    But thanks for putting a point here. Will keep that in mind when I revisit
    this.
    
    
    >
    > 7)
    > Add verbose for copying whole file
    >
    Done
    
    
    >
    > 8) We can also check if approximate space is available in disk before
    > starting combine backup, this can be added as enhancement. We can get
    > other's opinion on this.
    >
    
    Hmm... will leave it for now. User will get an error anyway.
    
    
    >
    > 9)
    > Combine backup into directory can be combine backup directory
    >
    Done
    
    
    >
    > 10)
    > MAX_INCR_BK_COUNT can be increased little, some applications use 1
    > full backup at the beginning of the month and use 30 incremental
    > backups rest of the days in the month
    >
    
    Yeah, agree. But using any number here is debatable.
    Let's see others opinion too.
    
    
    > Regards,
    > Vignesh
    > EnterpriseDB: http://www.enterprisedb.com
    >
    
    
    Attached new sets of patches with refactoring done separately.
    Incremental backup patch became small now and hopefully more
    readable than the first version.
    
    -- 
    Jeevan Chalke
    Technical Architect, Product Development
    EnterpriseDB Corporation
    The Enterprise PostgreSQL Company
    
  106. Re: block-level incremental backup

    Jeevan Chalke <jeevan.chalke@enterprisedb.com> — 2019-08-16T10:42:52Z

    On Fri, Aug 9, 2019 at 6:07 AM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com>
    wrote:
    
    > Hi Jeevan,
    >
    > I have reviewed the backup part at code level and still looking into the
    > restore(combine) and functional part of it. But, here are my comments so
    > far:
    >
    
    Thank you Jeevan Ladhe for reviewing the changes.
    
    
    >
    > The patches need rebase.
    >
    
    Done.
    
    
    > May be we should rename to something like:
    > "INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP
    > START LOCATION"
    > to make it more intuitive?
    >
    
    As discussed, used "INCREMENTAL BACKUP REFERENCE WAL LOCATION".
    
    File header structure is defined in both the files basebackup.c and
    > pg_combinebackup.c. I think it is better to move this to
    > replication/basebackup.h.
    >
    
    Yep. Was that in my cleanup list. Done now.
    
    
    > I think we can avoid having flag isrelfile in sendFile().
    > Something like this:
    >
    Also, having isrelfile as part of following condition:
    > is confusing, because even the relation files in full backup are going to
    > be
    > backed up by this loop only, but still, the condition reads '(!isrelfile
    > &&...)'.
    >
    
    In the refactored patch I have moved full backup code in a separate
    function.
    And now all incremental backup code is also done in its own function.
    Hopefully, the code is now more readable.
    
    
    >
    > IMHO, while labels are not advisable in general, it may be better to use a
    > label
    > here rather than a while(1) loop, so that we can move to the label in case
    > we
    > want to retry once. I think here it opens doors for future bugs if someone
    > happens to add code here, ending up adding some condition and then the
    > break becomes conditional. That will leave us in an infinite loop.
    >
    
    I kept it as is as I don't see any correctness issue here.
    
    Similar to structure partial_file_header, I think above macro can also be
    > moved
    > to basebackup.h instead of defining it twice.
    >
    
    Yes. Done.
    
    
    > I think this is a huge memory request (1GB) and may fail on busy/loaded
    > server at
    > times. We should check for failures of malloc, maybe throw some error on
    > getting ENOMEM as errno.
    >
    
    Agree. Done.
    
    
    > Here, should not we expect statbuf->st_size < (RELSEG_SIZE * BLCKSZ), and
    > it
    > should be safe to read just statbuf_st_size always I guess? But, I am ok
    > with
    > having this extra guard here.
    >
    
    Yes, we can do this way. Added an Assert() before that and used just
    statbuf->st_size.
    
    In sendFile(), I am sorry if I am missing something, but I am not able to
    > understand why 'cnt' and 'i' should have different values when they are
    > being
    > passed to verify_page_checksum(). I think passing only one of them should
    > be
    > sufficient.
    >
    
    As discussed offline, you meant to say i and blkno.
    These two are different. i represent the current block offset from the read
    buffer whereas blkno is the offset from the start of the page. For
    incremental
    backup, they are same as we read the whole file but they are different in
    case
    of regular full backup where we read 4 blocks at a time. i value there will
    be
    between 0 and 3.
    
    
    > Maybe we should just have a variable no_of_blocks to store a number of
    > blocks,
    > rather than calculating this say RELSEG_SIZE(i.e. 131072) times in the
    > worst
    > case.
    >
    
    OK. Done.
    
    
    > Sorry if I am missing something, but, should not it be just:
    >
    > len = cnt;
    >
    
    Yeah. Done.
    
    
    > As I said earlier in my previous email, we now do not need
    > +decode_lsn_internal()
    > as it is already taken care by the introduction of function
    > pg_lsn_in_internal().
    >
    
    Yes. Done that and rebased on latest HEAD.
    
    
    >
    > Regards,
    > Jeevan Ladhe
    >
    
    Patches attached in the previous reply.
    
    -- 
    Jeevan Chalke
    Technical Architect, Product Development
    EnterpriseDB Corporation
    The Enterprise PostgreSQL Company
    
  107. Re: block-level incremental backup

    Ibrar Ahmed <ibrar.ahmad@gmail.com> — 2019-08-16T11:12:32Z

    On Fri, Aug 16, 2019 at 3:24 PM Jeevan Chalke <
    jeevan.chalke@enterprisedb.com> wrote:
    
    >
    >
    > On Fri, Aug 2, 2019 at 6:43 PM vignesh C <vignesh21@gmail.com> wrote:
    >
    >> Some comments:
    >> 1) There will be some link files created for tablespace, we might
    >> require some special handling for it
    >>
    >
    > Yep. I have that in my ToDo.
    > Will start working on that soon.
    >
    >
    >> 2)
    >> Retry functionality is hanlded only for copying of full files, should
    >> we handle retry for copying of partial files
    >> 3)
    >> we can have some range for maxretries similar to sleeptime
    >>
    >
    > I took help from pg_standby code related to maxentries and sleeptime.
    >
    > However, as we don't want to use system() call now, I have
    > removed all this kludge and just used fread/fwrite as discussed.
    >
    >
    >> 4)
    >> Should we check for malloc failure
    >>
    >
    > Used pg_malloc() instead. Same is also suggested by Ibrar.
    >
    >
    >>
    >> 5) Should we add display of progress as backup may take some time,
    >> this can be added as enhancement. We can get other's opinion on this.
    >>
    >
    > Can be done afterward once we have the functionality in place.
    >
    >
    >>
    >> 6)
    >> If the backup count increases providing the input may be difficult,
    >> Shall user provide all the incremental backups from a parent folder
    >> and can we handle the ordering of incremental backup internally
    >>
    >
    > I am not sure of this yet. We need to provide the tablespace mapping too.
    > But thanks for putting a point here. Will keep that in mind when I revisit
    > this.
    >
    >
    >>
    >> 7)
    >> Add verbose for copying whole file
    >>
    > Done
    >
    >
    >>
    >> 8) We can also check if approximate space is available in disk before
    >> starting combine backup, this can be added as enhancement. We can get
    >> other's opinion on this.
    >>
    >
    > Hmm... will leave it for now. User will get an error anyway.
    >
    >
    >>
    >> 9)
    >> Combine backup into directory can be combine backup directory
    >>
    > Done
    >
    >
    >>
    >> 10)
    >> MAX_INCR_BK_COUNT can be increased little, some applications use 1
    >> full backup at the beginning of the month and use 30 incremental
    >> backups rest of the days in the month
    >>
    >
    > Yeah, agree. But using any number here is debatable.
    > Let's see others opinion too.
    >
    Why not use a list?
    
    
    >
    >
    >> Regards,
    >> Vignesh
    >> EnterpriseDB: http://www.enterprisedb.com
    >>
    >
    >
    > Attached new sets of patches with refactoring done separately.
    > Incremental backup patch became small now and hopefully more
    > readable than the first version.
    >
    > --
    > Jeevan Chalke
    > Technical Architect, Product Development
    > EnterpriseDB Corporation
    > The Enterprise PostgreSQL Company
    >
    >
    
    +       buf = (char *) malloc(statbuf->st_size);
    
    +       if (buf == NULL)
    
    +               ereport(ERROR,
    
    +                               (errcode(ERRCODE_OUT_OF_MEMORY),
    
    +                                errmsg("out of memory")));
    
    Why are you using malloc, you can use palloc here.
    
    
    
    
    -- 
    Ibrar Ahmed
    
  108. Re: block-level incremental backup

    Ibrar Ahmed <ibrar.ahmad@gmail.com> — 2019-08-16T14:36:50Z

    On Fri, Aug 16, 2019 at 4:12 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
    
    >
    >
    >
    >
    > On Fri, Aug 16, 2019 at 3:24 PM Jeevan Chalke <
    > jeevan.chalke@enterprisedb.com> wrote:
    >
    >>
    >>
    >> On Fri, Aug 2, 2019 at 6:43 PM vignesh C <vignesh21@gmail.com> wrote:
    >>
    >>> Some comments:
    >>> 1) There will be some link files created for tablespace, we might
    >>> require some special handling for it
    >>>
    >>
    >> Yep. I have that in my ToDo.
    >> Will start working on that soon.
    >>
    >>
    >>> 2)
    >>> Retry functionality is hanlded only for copying of full files, should
    >>> we handle retry for copying of partial files
    >>> 3)
    >>> we can have some range for maxretries similar to sleeptime
    >>>
    >>
    >> I took help from pg_standby code related to maxentries and sleeptime.
    >>
    >> However, as we don't want to use system() call now, I have
    >> removed all this kludge and just used fread/fwrite as discussed.
    >>
    >>
    >>> 4)
    >>> Should we check for malloc failure
    >>>
    >>
    >> Used pg_malloc() instead. Same is also suggested by Ibrar.
    >>
    >>
    >>>
    >>> 5) Should we add display of progress as backup may take some time,
    >>> this can be added as enhancement. We can get other's opinion on this.
    >>>
    >>
    >> Can be done afterward once we have the functionality in place.
    >>
    >>
    >>>
    >>> 6)
    >>> If the backup count increases providing the input may be difficult,
    >>> Shall user provide all the incremental backups from a parent folder
    >>> and can we handle the ordering of incremental backup internally
    >>>
    >>
    >> I am not sure of this yet. We need to provide the tablespace mapping too.
    >> But thanks for putting a point here. Will keep that in mind when I
    >> revisit this.
    >>
    >>
    >>>
    >>> 7)
    >>> Add verbose for copying whole file
    >>>
    >> Done
    >>
    >>
    >>>
    >>> 8) We can also check if approximate space is available in disk before
    >>> starting combine backup, this can be added as enhancement. We can get
    >>> other's opinion on this.
    >>>
    >>
    >> Hmm... will leave it for now. User will get an error anyway.
    >>
    >>
    >>>
    >>> 9)
    >>> Combine backup into directory can be combine backup directory
    >>>
    >> Done
    >>
    >>
    >>>
    >>> 10)
    >>> MAX_INCR_BK_COUNT can be increased little, some applications use 1
    >>> full backup at the beginning of the month and use 30 incremental
    >>> backups rest of the days in the month
    >>>
    >>
    >> Yeah, agree. But using any number here is debatable.
    >> Let's see others opinion too.
    >>
    > Why not use a list?
    >
    >
    >>
    >>
    >>> Regards,
    >>> Vignesh
    >>> EnterpriseDB: http://www.enterprisedb.com
    >>>
    >>
    >>
    >> Attached new sets of patches with refactoring done separately.
    >> Incremental backup patch became small now and hopefully more
    >> readable than the first version.
    >>
    >> --
    >> Jeevan Chalke
    >> Technical Architect, Product Development
    >> EnterpriseDB Corporation
    >> The Enterprise PostgreSQL Company
    >>
    >>
    >
    > +       buf = (char *) malloc(statbuf->st_size);
    >
    > +       if (buf == NULL)
    >
    > +               ereport(ERROR,
    >
    > +                               (errcode(ERRCODE_OUT_OF_MEMORY),
    >
    > +                                errmsg("out of memory")));
    >
    > Why are you using malloc, you can use palloc here.
    >
    >
    >
    > Hi, I gave another look at the patch and have some quick comments.
    
    
    -
    > char       *extptr = strstr(fn, ".partial");
    
    I think there should be a better and strict way to check the file
    extension.
    
    -
    > +               extptr = strstr(outfn, ".partial");
    > +               Assert (extptr != NULL);
    
    Why are you checking that again, you just appended that in the above
    statement?
    
    -
    > +       if (verbose && statbuf.st_size > (RELSEG_SIZE * BLCKSZ))
    > +               pg_log_info("found big file \"%s\" (size: %.2lfGB): %m",
    fromfn,
    > +                                       (double) statbuf.st_size /
    (RELSEG_SIZE * BLCKSZ));
    
    This is not just a log, you find a file which is bigger which surely has
    some problem.
    
    -
    > +        * We do read entire 1GB file in memory while taking incremental
    backup; so
    > +        * I don't see any reason why can't we do that here.  Also,
    copying data in
    > +        * chunks is expensive.  However, for bigger files, we still
    slice at 1GB
    > +        * border.
    
    
    What do you mean by bigger file, a file greater than 1GB? In which case you
    get file > 1GB?
    
    
    -- 
    Ibrar Ahmed
    
  109. Re: block-level incremental backup

    vignesh C <vignesh21@gmail.com> — 2019-08-27T11:16:32Z

    On Fri, Aug 16, 2019 at 8:07 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
    >
    > What do you mean by bigger file, a file greater than 1GB? In which case you get file > 1GB?
    >
    >
    >
    Few comments:
    Comment:
    + buf = (char *) malloc(statbuf->st_size);
    + if (buf == NULL)
    + ereport(ERROR,
    + (errcode(ERRCODE_OUT_OF_MEMORY),
    + errmsg("out of memory")));
    +
    + if ((cnt = fread(buf, 1, statbuf->st_size, fp)) > 0)
    + {
    + Bitmapset  *mod_blocks = NULL;
    + int nmodblocks = 0;
    +
    + if (cnt % BLCKSZ != 0)
    + {
    
    We can use same size as full page size.
    After pg start backup full page write will be enabled.
    We can use the same file size to maintain data consistency.
    
    Comment:
    /* Validate given LSN and convert it into XLogRecPtr. */
    + opt->lsn = pg_lsn_in_internal(strVal(defel->arg), &have_error);
    + if (XLogRecPtrIsInvalid(opt->lsn))
    + ereport(ERROR,
    + (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
    + errmsg("invalid value for LSN")));
    
    Validate input lsn is less than current system lsn.
    
    Comment:
    /* Validate given LSN and convert it into XLogRecPtr. */
    + opt->lsn = pg_lsn_in_internal(strVal(defel->arg), &have_error);
    + if (XLogRecPtrIsInvalid(opt->lsn))
    + ereport(ERROR,
    + (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
    + errmsg("invalid value for LSN")));
    
    Should we check if it is same timeline as the system's timeline.
    
    Comment:
    + if (fread(blkdata, 1, BLCKSZ, infp) != BLCKSZ)
    + {
    + pg_log_error("could not read from file \"%s\": %m", outfn);
    + cleanup_filemaps(filemaps, fmindex + 1);
    + exit(1);
    + }
    +
    + /* Finally write one block to the output file */
    + if (fwrite(blkdata, 1, BLCKSZ, outfp) != BLCKSZ)
    + {
    + pg_log_error("could not write to file \"%s\": %m", outfn);
    + cleanup_filemaps(filemaps, fmindex + 1);
    + exit(1);
    + }
    
    Should we support compression formats supported by pg_basebackup.
    This can be an enhancement after the functionality is completed.
    
    Comment:
    We should provide some mechanism to validate the backup. To identify
    if some backup is corrupt or some file is missing(deleted) in a
    backup.
    
    Comment:
    + ofp = fopen(tofn, "wb");
    + if (ofp == NULL)
    + {
    + pg_log_error("could not create file \"%s\": %m", tofn);
    + exit(1);
    + }
    
    ifp should be closed in the error flow.
    
    Comment:
    + fp = fopen(filename, "r");
    + if (fp == NULL)
    + {
    + pg_log_error("could not read file \"%s\": %m", filename);
    + exit(1);
    + }
    +
    + labelfile = pg_malloc(statbuf.st_size + 1);
    + if (fread(labelfile, 1, statbuf.st_size, fp) != statbuf.st_size)
    + {
    + pg_log_error("corrupted file \"%s\": %m", filename);
    + pg_free(labelfile);
    + exit(1);
    + }
    
    fclose can be moved above.
    
    Comment:
    + if (!modifiedblockfound)
    + {
    + copy_whole_file(fm->filename, outfn);
    + cleanup_filemaps(filemaps, fmindex + 1);
    + return;
    + }
    +
    + /* Write all blocks to the output file */
    +
    + if (fstat(fileno(fm->fp), &statbuf) != 0)
    + {
    + pg_log_error("could not stat file \"%s\": %m", fm->filename);
    + pg_free(filemaps);
    + exit(1);
    + }
    
    Some error flow, cleanup_filemaps need to be called to close the file
    descriptors that are opened.
    
    Comment:
    +/*
    + * When to send the whole file, % blocks modified (90%)
    + */
    +#define WHOLE_FILE_THRESHOLD 0.9
    +
    
    This can be user configured value.
    This can be an enhancement after the functionality is completed.
    
    
    Comment:
    We can add a readme file with all the details regarding incremental
    backup and combine backup.
    
    Regards,
    Vignesh
    EnterpriseDB: http://www.enterprisedb.com
    
    
    
    
  110. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-08-27T18:29:34Z

    On Fri, Aug 16, 2019 at 6:23 AM Jeevan Chalke
    <jeevan.chalke@enterprisedb.com> wrote:
    > [ patches ]
    
    Reviewing 0002 and 0003:
    
    - Commit message for 0003 claims magic number and checksum are 0, but
    that (fortunately) doesn't seem to be the case.
    
    - looks_like_rel_name actually checks whether it looks like a
    *non-temporary* relation name; suggest adjusting the function name.
    
    - The names do_full_backup and do_incremental_backup are quite
    confusing because you're really talking about what to do with one
    file.  I suggest sendCompleteFile() and sendPartialFile().
    
    - Is there any good reason to have 'refptr' as a global variable, or
    could we just pass the LSN around via function arguments?  I know it's
    just mimicking startptr, but storing startptr in a global variable
    doesn't seem like a great idea either, so if it's not too annoying,
    let's pass it down via function arguments instead.  Also, refptr is a
    crappy name (even worse than startptr); whether we end up with a
    global variable or a bunch of local variables, let's make the name(s)
    clear and unambiguous, like incremental_reference_lsn.  Yeah, I know
    that's long, but I still think it's better than being unclear.
    
    - do_incremental_backup looks like it can never report an error from
    fread(), which is bad.  But I see that this is just copied from the
    existing code which has the same problem, so I started a separate
    thread about that.
    
    - I think that passing cnt and blkindex to verify_page_checksum()
    doesn't look very good from an abstraction point of view.  Granted,
    the existing code isn't great either, but I think this makes the
    problem worse.  I suggest passing "int backup_distance" to this
    function, computed as cnt - BLCKSZ * blkindex.  Then, you can
    fseek(-backup_distance), fread(BLCKSZ), and then fseek(backup_distance
    - BLCKSZ).
    
    - While I generally support the use of while and for loops rather than
    goto for flow control, a while (1) loop that ends with a break is
    functionally a goto anyway.  I think there are several ways this could
    be revised.  The most obvious one is probably to use goto, but I vote
    for inverting the sense of the test: if (PageIsNew(page) ||
    PageGetLSN(page) >= startptr) break; This approach also saves a level
    of indentation for more than half of the function.
    
    - I am not sure that it's a good idea for sendwholefile = true to
    result in dumping the entire file onto the wire in a single CopyData
    message.  I don't know of a concrete problem in typical
    configurations, but someone who increases RELSEG_SIZE might be able to
    overflow CopyData's length word.  At 2GB the length word would be
    negative, which might break, and at 4GB it would wrap around, which
    would certainly break.  See CopyData in
    https://www.postgresql.org/docs/12/protocol-message-formats.html  To
    avoid this issue, and maybe some others, I suggest defining a
    reasonably large chunk size, say 1MB as a constant in this file
    someplace, and sending the data as a series of chunks of that size.
    
    - I don't think that the way concurrent truncation is handled is
    correct for partial files.  Right now it just falls through to code
    which appends blocks of zeroes in either the complete-file or
    partial-file case.  I think that logic should be moved into the
    function that handles the complete-file case.  In the partial-file
    case, the blocks that we actually send need to match the list of block
    numbers we promised to send.  We can't just send the promised blocks
    and then tack a bunch of zero-filled blocks onto the end that the file
    header doesn't know about.
    
    - For reviewer convenience, please use the -v option to git
    format-patch when posting and reposting a patch series.  Using -v2,
    -v3, etc. on successive versions really helps.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  111. Re: block-level incremental backup

    Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> — 2019-08-29T14:41:04Z

    Due to the inherent nature of pg_basebackup, the incremental backup also
    allows taking backup in tar and compressed format. But, pg_combinebackup
    does not understand how to restore this. I think we should either make
    pg_combinebackup support restoration of tar incremental backup or restrict
    taking the incremental backup in tar format until pg_combinebackup
    supports the restoration by making option '--lsn' and '-Ft' exclusive.
    
    It is arguable that one can take the incremental backup in tar format,
    extract
    that manually and then give the resultant directory as input to the
    pg_combinebackup, but I think that kills the purpose of having
    pg_combinebackup utility.
    
    Thoughts?
    
    Regards,
    Jeevan Ladhe
    
  112. Re: block-level incremental backup

    Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> — 2019-08-30T12:56:31Z

    Hi,
    
    I am doing some testing on pg_basebackup and pg_combinebackup patches. I
    have also tried to create tap test for pg_combinebackup by taking
    reference from pg_basebackup tap cases.
    Attaching first draft test patch.
    
    I have done some testing with compression options, both -z and -Z level is
    working with incremental backup.
    
    A minor comment : It is mentioned in pg_combinebackup help that maximum 10
    incremental backup can be given with -i option, but I found maximum 9
    incremental backup directories can be given at a time.
    
    Thanks & Regards,
    Rajkumar Raghuwanshi
    QMG, EnterpriseDB Corporation
    
    
    On Thu, Aug 29, 2019 at 10:06 PM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com>
    wrote:
    
    > Due to the inherent nature of pg_basebackup, the incremental backup also
    > allows taking backup in tar and compressed format. But, pg_combinebackup
    > does not understand how to restore this. I think we should either make
    > pg_combinebackup support restoration of tar incremental backup or restrict
    > taking the incremental backup in tar format until pg_combinebackup
    > supports the restoration by making option '--lsn' and '-Ft' exclusive.
    >
    > It is arguable that one can take the incremental backup in tar format,
    > extract
    > that manually and then give the resultant directory as input to the
    > pg_combinebackup, but I think that kills the purpose of having
    > pg_combinebackup utility.
    >
    > Thoughts?
    >
    > Regards,
    > Jeevan Ladhe
    >
    
  113. Re: block-level incremental backup

    Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> — 2019-08-30T13:21:50Z

    Here are some comments:
    
    
    +/* The reference XLOG position for the incremental backup. */
    
    +static XLogRecPtr refptr;
    
    As Robert already pointed we may want to pass this as parameter around
    instead
    of a global variable. Also, can be renamed to something like:
    incr_backup_refptr.
    I see in your earlier version of patch this was named startincrptr, which I
    think was more meaningful.
    
    ---------
    
        /*
    
         * If incremental backup, see whether the filename is a relation
    filename
         * or not.
    
         */
    
    Can be reworded something like:
    "If incremental backup, check if it is relation file and can be sent
    partially."
    
    ---------
    
    +           if (verify_checksum)
    +           {
    +               ereport(WARNING,
    +                       (errmsg("cannot verify checksum in file \"%s\",
    block "
    +                               "%d: read buffer size %d and page size %d "
    +                               "differ",
    +                               readfilename, blkno, (int) cnt, BLCKSZ)));
    +               verify_checksum = false;
    +           }
    
    For do_incremental_backup() it does not make sense to show the block number
    in
    warning as it is always going to be 0 when we throw this warning.
    Further, I think this can be rephrased as:
    "cannot verify checksum in file \"%s\", read file size %d is not multiple of
    page size %d".
    
    Or maybe we can just say:
    "cannot verify checksum in file \"%s\"" if checksum requested, disable the
    checksum and leave it to the following message:
    
    +           ereport(WARNING,
    +                   (errmsg("file size (%d) not in multiple of page size
    (%d), sending whole file",
    +                           (int) cnt, BLCKSZ)));
    
    ---------
    
    If you agree on the above comment for blkno, then we can shift declaration
    of blkno
    inside the condition "       if (!sendwholefile)" in
    do_incremental_backup(), or
    avoid it altogether, and just pass "i" as blkindex, as well as blkno to
    verify_page_checksum(). May be add a comment why they are same in case of
    incremental backup.
    
    ---------
    
    I think we should give the user hint from where he should be reading the
    input
    lsn for incremental backup in the --help option as well as documentation?
    Something like - "To take an incremental backup, please provide value of
    "--lsn"
    as the "START WAL LOCATION" of previously taken full backup or incremental
    backup from backup_lable file.
    
    ---------
    
    pg_combinebackup:
    
    +static bool made_new_outputdata = false;
    +static bool found_existing_outputdata = false;
    
    Both of these are global, I understand that we need them global so that
    they are
    accessible in cleanup_directories_atexit(). But they are passed to
    verify_dir_is_empty_or_create() as parameters, which I think is not needed.
    Instead verify_dir_is_empty_or_create() can directly change the globals.
    
    ---------
    
    I see that checksum_failure is never set and always remains as false. May be
    it is something that you wanted to set in combine_partial_files() when a
    the corrupted partial file is detected?
    
    ---------
    
    I think the logic for verifying the backup chain should be moved out from
    main()
    function to a separate function.
    
    ---------
    
    + /*
    + * Verify the backup chain.  INCREMENTAL BACKUP REFERENCE WAL LOCATION of
    + * the incremental backup must match with the START WAL LOCATION of the
    + * previous backup, until we reach a full backup in which there is no
    + * INCREMENTAL BACKUP REFERENCE WAL LOCATION.
    + */
    
    The current logic assumes the incremental backup directories are to be
    provided
    as input in the serial order the backups were taken. This is bit confusing
    unless clarified in pg_combinebackup help menu or documentation. I think we
    should clarify it at both the places.
    
    ---------
    
    I think scan_directory() should be rather renamed as do_combinebackup().
    
    Regards,
    Jeevan Ladhe
    
    On Thu, Aug 29, 2019 at 8:11 PM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com>
    wrote:
    
    > Due to the inherent nature of pg_basebackup, the incremental backup also
    > allows taking backup in tar and compressed format. But, pg_combinebackup
    > does not understand how to restore this. I think we should either make
    > pg_combinebackup support restoration of tar incremental backup or restrict
    > taking the incremental backup in tar format until pg_combinebackup
    > supports the restoration by making option '--lsn' and '-Ft' exclusive.
    >
    > It is arguable that one can take the incremental backup in tar format,
    > extract
    > that manually and then give the resultant directory as input to the
    > pg_combinebackup, but I think that kills the purpose of having
    > pg_combinebackup utility.
    >
    > Thoughts?
    >
    > Regards,
    > Jeevan Ladhe
    >
    
  114. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-08-31T02:58:53Z

    On Thu, Aug 29, 2019 at 10:41 AM Jeevan Ladhe
    <jeevan.ladhe@enterprisedb.com> wrote:
    > Due to the inherent nature of pg_basebackup, the incremental backup also
    > allows taking backup in tar and compressed format. But, pg_combinebackup
    > does not understand how to restore this. I think we should either make
    > pg_combinebackup support restoration of tar incremental backup or restrict
    > taking the incremental backup in tar format until pg_combinebackup
    > supports the restoration by making option '--lsn' and '-Ft' exclusive.
    >
    > It is arguable that one can take the incremental backup in tar format, extract
    > that manually and then give the resultant directory as input to the
    > pg_combinebackup, but I think that kills the purpose of having
    > pg_combinebackup utility.
    
    I don't agree. You're right that you would have to untar (and
    uncompress) the backup to run pg_combinebackup, but you would also
    have to do that to restore a non-incremental backup, so it doesn't
    seem much different.  It's true that for an incremental backup you
    might need to untar and uncompress multiple prior backups rather than
    just one, but that's just the nature of an incremental backup.  And,
    on a practical level, if you want compression, which is pretty likely
    if you're thinking about incremental backups, the way to get that is
    to use tar format with -z or -Z.
    
    It might be interesting to teach pg_combinebackup to be able to read
    tar-format backups, but I think that there are several variants of the
    tar format, and I suspect it would need to read them all.  If someone
    un-tars and re-tars a backup with a different tar tool, we don't want
    it to become unreadable.  So we'd either have to write our own
    de-tarring library or add an external dependency on one.  I don't
    think it's worth doing that at this point; I definitely don't think it
    needs to be part of the first patch.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  115. Re: block-level incremental backup

    Ibrar Ahmed <ibrar.ahmad@gmail.com> — 2019-08-31T19:40:28Z

    On Sat, Aug 31, 2019 at 7:59 AM Robert Haas <robertmhaas@gmail.com> wrote:
    
    > On Thu, Aug 29, 2019 at 10:41 AM Jeevan Ladhe
    > <jeevan.ladhe@enterprisedb.com> wrote:
    > > Due to the inherent nature of pg_basebackup, the incremental backup also
    > > allows taking backup in tar and compressed format. But, pg_combinebackup
    > > does not understand how to restore this. I think we should either make
    > > pg_combinebackup support restoration of tar incremental backup or
    > restrict
    > > taking the incremental backup in tar format until pg_combinebackup
    > > supports the restoration by making option '--lsn' and '-Ft' exclusive.
    > >
    > > It is arguable that one can take the incremental backup in tar format,
    > extract
    > > that manually and then give the resultant directory as input to the
    > > pg_combinebackup, but I think that kills the purpose of having
    > > pg_combinebackup utility.
    >
    > I don't agree. You're right that you would have to untar (and
    > uncompress) the backup to run pg_combinebackup, but you would also
    > have to do that to restore a non-incremental backup, so it doesn't
    > seem much different.  It's true that for an incremental backup you
    > might need to untar and uncompress multiple prior backups rather than
    > just one, but that's just the nature of an incremental backup.  And,
    > on a practical level, if you want compression, which is pretty likely
    > if you're thinking about incremental backups, the way to get that is
    > to use tar format with -z or -Z.
    >
    > It might be interesting to teach pg_combinebackup to be able to read
    > tar-format backups, but I think that there are several variants of the
    > tar format, and I suspect it would need to read them all.  If someone
    > un-tars and re-tars a backup with a different tar tool, we don't want
    > it to become unreadable.  So we'd either have to write our own
    > de-tarring library or add an external dependency on one.
    
    
    Are we using any tar library in pg_basebackup.c? We already have the
    capability
    in pg_basebackup to do that.
    
    
    
    > I don't
    > think it's worth doing that at this point; I definitely don't think it
    > needs to be part of the first patch.
    >
    > --
    > Robert Haas
    > EnterpriseDB: http://www.enterprisedb.com
    > The Enterprise PostgreSQL Company
    >
    >
    >
    
    -- 
    Ibrar Ahmed
    
  116. Re: block-level incremental backup

    Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> — 2019-09-02T07:39:39Z

    Hi Robert,
    
    On Sat, Aug 31, 2019 at 8:29 AM Robert Haas <robertmhaas@gmail.com> wrote:
    
    > On Thu, Aug 29, 2019 at 10:41 AM Jeevan Ladhe
    > <jeevan.ladhe@enterprisedb.com> wrote:
    > > Due to the inherent nature of pg_basebackup, the incremental backup also
    > > allows taking backup in tar and compressed format. But, pg_combinebackup
    > > does not understand how to restore this. I think we should either make
    > > pg_combinebackup support restoration of tar incremental backup or
    > restrict
    > > taking the incremental backup in tar format until pg_combinebackup
    > > supports the restoration by making option '--lsn' and '-Ft' exclusive.
    > >
    > > It is arguable that one can take the incremental backup in tar format,
    > extract
    > > that manually and then give the resultant directory as input to the
    > > pg_combinebackup, but I think that kills the purpose of having
    > > pg_combinebackup utility.
    >
    > I don't agree. You're right that you would have to untar (and
    > uncompress) the backup to run pg_combinebackup, but you would also
    > have to do that to restore a non-incremental backup, so it doesn't
    > seem much different.
    >
    
    Thanks. Yes I agree about the similarity between restoring non-incremental
    and incremental backup in this case.
    
    
    >  I don't think it's worth doing that at this point; I definitely don't
    > think it
    > needs to be part of the first patch.
    >
    
    Makes sense.
    
    Regards,
    Jeevan Ladhe
    
  117. Re: block-level incremental backup

    Dilip Kumar <dilipbalaut@gmail.com> — 2019-09-03T06:41:38Z

    On Fri, Aug 16, 2019 at 3:54 PM Jeevan Chalke
    <jeevan.chalke@enterprisedb.com> wrote:
    >
    0003:
    +/*
    + * When to send the whole file, % blocks modified (90%)
    + */
    +#define WHOLE_FILE_THRESHOLD 0.9
    
    How this threshold is selected.  Is it by some test?
    
    
    - magic number, currently 0 (4 bytes)
    I think in the patch we are using  (#define INCREMENTAL_BACKUP_MAGIC
    0x494E4352) as a magic number, not 0
    
    
    + Assert(statbuf->st_size <= (RELSEG_SIZE * BLCKSZ));
    +
    + buf = (char *) malloc(statbuf->st_size);
    + if (buf == NULL)
    + ereport(ERROR,
    + (errcode(ERRCODE_OUT_OF_MEMORY),
    + errmsg("out of memory")));
    +
    + if ((cnt = fread(buf, 1, statbuf->st_size, fp)) > 0)
    + {
    + Bitmapset  *mod_blocks = NULL;
    + int nmodblocks = 0;
    +
    + if (cnt % BLCKSZ != 0)
    + {
    
    It will be good to add some comments for the if block and also for the
    assert. Actully, it's not very clear from the code.
    
    0004:
    +#include <time.h>
    +#include <sys/stat.h>
    +#include <unistd.h>
    Header file include order (sys/state.h should be before time.h)
    
    
    
    + printf(_("%s combines full backup with incremental backup.\n\n"), progname);
    /backup/backups
    
    
    + * scan_file
    + *
    + * Checks whether given file is partial file or not.  If partial, then combines
    + * it into a full backup file, else copies as is to the output directory.
    + */
    
    /If partial, then combines/ If partial, then combine
    
    
    
    +static void
    +combine_partial_files(const char *fn, char **IncrDirs, int nIncrDir,
    +   const char *subdirpath, const char *outfn)
    + /*
    + * Open all files from all incremental backup directories and create a file
    + * map.
    + */
    + basefilefound = false;
    + for (i = (nIncrDir - 1), fmindex = 0; i >= 0; i--, fmindex++)
    + {
    + fm = &filemaps[fmindex];
    +
    .....
    + }
    +
    +
    + /* Process all opened files. */
    + lastblkno = 0;
    + modifiedblockfound = false;
    + for (i = 0; i < fmindex; i++)
    + {
    + char    *buf;
    + int hsize;
    + int k;
    + int blkstartoffset;
    ......
    + }
    +
    + for (i = 0; i <= lastblkno; i++)
    + {
    + char blkdata[BLCKSZ];
    + FILE    *infp;
    + int offset;
    ...
    + }
    }
    
    Can we breakdown this function in 2-3 functions.  At least creating a
    file map can directly go to a separate function.
    
    I have read 0003 and 0004 patch and there are few cosmetic comments.
    
    
    -- 
    Regards,
    Dilip Kumar
    EnterpriseDB: http://www.enterprisedb.com
    
    
    
    
  118. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-09-03T12:59:53Z

    On Sat, Aug 31, 2019 at 3:41 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
    > Are we using any tar library in pg_basebackup.c? We already have the capability
    > in pg_basebackup to do that.
    
    I think pg_basebackup is using homebrew code to generate tar files,
    but I'm reluctant to do that for reading tar files.  For generating a
    file, you can always emit the newest and "best" tar format, but for
    reading a file, you probably want to be prepared for older or cruftier
    variants.  Maybe not -- I'm not super-familiar with the tar on-disk
    format.  But I think there must be a reason why tar libraries exist,
    and I don't want to write a new one.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  119. Re: block-level incremental backup

    Ibrar Ahmed <ibrar.ahmad@gmail.com> — 2019-09-03T14:04:59Z

    On Tue, Sep 3, 2019 at 6:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
    
    > On Sat, Aug 31, 2019 at 3:41 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
    > > Are we using any tar library in pg_basebackup.c? We already have the
    > capability
    > > in pg_basebackup to do that.
    >
    > I think pg_basebackup is using homebrew code to generate tar files,
    > but I'm reluctant to do that for reading tar files.  For generating a
    > file, you can always emit the newest and "best" tar format, but for
    > reading a file, you probably want to be prepared for older or cruftier
    > variants.  Maybe not -- I'm not super-familiar with the tar on-disk
    > format.  But I think there must be a reason why tar libraries exist,
    > and I don't want to write a new one.
    >
    +1 using the library to tar. But I think reason not using tar library is
    TAR is
    one of the most simple file format. What is the best/newest format of TAR?
    
    >
    > --
    > Robert Haas
    > EnterpriseDB: http://www.enterprisedb.com
    > The Enterprise PostgreSQL Company
    >
    
    
    -- 
    Ibrar Ahmed
    
  120. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-09-03T14:39:11Z

    On Tue, Sep 3, 2019 at 10:05 AM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
    > +1 using the library to tar. But I think reason not using tar library is TAR is
    > one of the most simple file format. What is the best/newest format of TAR?
    
    So, I don't really want to go down this path at all, as I already
    said.  You can certainly do your own research on this topic if you
    wish.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  121. Re: block-level incremental backup

    Tom Lane <tgl@sss.pgh.pa.us> — 2019-09-03T15:00:22Z

    Ibrar Ahmed <ibrar.ahmad@gmail.com> writes:
    > +1 using the library to tar.
    
    Uh, *what* library?
    
    pg_dump's pg_backup_tar.c is about 1300 lines, a very large fraction
    of which is boilerplate for interfacing to pg_backup_archiver's APIs.
    The stuff that actually knows specifically about tar looks to be maybe
    a couple hundred lines, plus there's another couple hundred lines of
    (rather duplicative?) code in src/port/tar.c.  None of it is rocket
    science.
    
    I can't believe that it'd be a good tradeoff to create a new external
    dependency to replace that amount of code.  In case you haven't noticed,
    our luck with depending on external libraries has been abysmal.
    
    Possibly there's an argument for refactoring things so that there's
    more stuff in tar.c and less elsewhere, but let's not go looking
    for external code to depend on.
    
    			regards, tom lane
    
    
    
    
  122. Re: block-level incremental backup

    Ibrar Ahmed <ibrar.ahmad@gmail.com> — 2019-09-03T16:44:25Z

    On Tue, Sep 3, 2019 at 8:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
    
    > Ibrar Ahmed <ibrar.ahmad@gmail.com> writes:
    > > +1 using the library to tar.
    >
    > Uh, *what* library?
    >
    
    I was just replying the Robert that he said
    
    "But I think there must be a reason why tar libraries exist,
    and I don't want to write a new one."
    
    I said I am ok to use a library "what he is proposing/thinking",
    but explained to him that TAR is the most simpler format that
    why PG has its own code.
    
    
    > pg_dump's pg_backup_tar.c is about 1300 lines, a very large fraction
    > of which is boilerplate for interfacing to pg_backup_archiver's APIs.
    > The stuff that actually knows specifically about tar looks to be maybe
    > a couple hundred lines, plus there's another couple hundred lines of
    > (rather duplicative?) code in src/port/tar.c.  None of it is rocket
    > science.
    >
    > I can't believe that it'd be a good tradeoff to create a new external
    > dependency to replace that amount of code.  In case you haven't noticed,
    > our luck with depending on external libraries has been abysmal.
    >
    > Possibly there's an argument for refactoring things so that there's
    > more stuff in tar.c and less elsewhere, but let's not go looking
    > for external code to depend on.
    >
    >                         regards, tom lane
    >
    
    
    -- 
    Ibrar Ahmed
    
  123. Re: block-level incremental backup

    Ibrar Ahmed <ibrar.ahmad@gmail.com> — 2019-09-03T16:46:13Z

    On Tue, Sep 3, 2019 at 7:39 PM Robert Haas <robertmhaas@gmail.com> wrote:
    
    > On Tue, Sep 3, 2019 at 10:05 AM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
    > > +1 using the library to tar. But I think reason not using tar library is
    > TAR is
    > > one of the most simple file format. What is the best/newest format of
    > TAR?
    >
    > So, I don't really want to go down this path at all, as I already
    > said.  You can certainly do your own research on this topic if you
    > wish.
    >
    > I did that and have experience working on the TAR format.  I was curious
    about what
    "best/newest" you are talking.
    
    
    
    > --
    > Robert Haas
    > EnterpriseDB: http://www.enterprisedb.com
    > The Enterprise PostgreSQL Company
    >
    
    
    -- 
    Ibrar Ahmed
    
  124. Re: block-level incremental backup

    Dilip Kumar <dilipbalaut@gmail.com> — 2019-09-04T11:51:36Z

    On Tue, Sep 3, 2019 at 12:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
    >
    > On Fri, Aug 16, 2019 at 3:54 PM Jeevan Chalke
    > <jeevan.chalke@enterprisedb.com> wrote:
    > >
    > 0003:
    > +/*
    > + * When to send the whole file, % blocks modified (90%)
    > + */
    > +#define WHOLE_FILE_THRESHOLD 0.9
    >
    > How this threshold is selected.  Is it by some test?
    >
    >
    > - magic number, currently 0 (4 bytes)
    > I think in the patch we are using  (#define INCREMENTAL_BACKUP_MAGIC
    > 0x494E4352) as a magic number, not 0
    >
    >
    > + Assert(statbuf->st_size <= (RELSEG_SIZE * BLCKSZ));
    > +
    > + buf = (char *) malloc(statbuf->st_size);
    > + if (buf == NULL)
    > + ereport(ERROR,
    > + (errcode(ERRCODE_OUT_OF_MEMORY),
    > + errmsg("out of memory")));
    > +
    > + if ((cnt = fread(buf, 1, statbuf->st_size, fp)) > 0)
    > + {
    > + Bitmapset  *mod_blocks = NULL;
    > + int nmodblocks = 0;
    > +
    > + if (cnt % BLCKSZ != 0)
    > + {
    >
    > It will be good to add some comments for the if block and also for the
    > assert. Actully, it's not very clear from the code.
    >
    > 0004:
    > +#include <time.h>
    > +#include <sys/stat.h>
    > +#include <unistd.h>
    > Header file include order (sys/state.h should be before time.h)
    >
    >
    >
    > + printf(_("%s combines full backup with incremental backup.\n\n"), progname);
    > /backup/backups
    >
    >
    > + * scan_file
    > + *
    > + * Checks whether given file is partial file or not.  If partial, then combines
    > + * it into a full backup file, else copies as is to the output directory.
    > + */
    >
    > /If partial, then combines/ If partial, then combine
    >
    >
    >
    > +static void
    > +combine_partial_files(const char *fn, char **IncrDirs, int nIncrDir,
    > +   const char *subdirpath, const char *outfn)
    > + /*
    > + * Open all files from all incremental backup directories and create a file
    > + * map.
    > + */
    > + basefilefound = false;
    > + for (i = (nIncrDir - 1), fmindex = 0; i >= 0; i--, fmindex++)
    > + {
    > + fm = &filemaps[fmindex];
    > +
    > .....
    > + }
    > +
    > +
    > + /* Process all opened files. */
    > + lastblkno = 0;
    > + modifiedblockfound = false;
    > + for (i = 0; i < fmindex; i++)
    > + {
    > + char    *buf;
    > + int hsize;
    > + int k;
    > + int blkstartoffset;
    > ......
    > + }
    > +
    > + for (i = 0; i <= lastblkno; i++)
    > + {
    > + char blkdata[BLCKSZ];
    > + FILE    *infp;
    > + int offset;
    > ...
    > + }
    > }
    >
    > Can we breakdown this function in 2-3 functions.  At least creating a
    > file map can directly go to a separate function.
    >
    > I have read 0003 and 0004 patch and there are few cosmetic comments.
    >
     I have not yet completed the review for 0004, but I have few more
    comments.  Tomorrow I will try to complete the review and some testing
    as well.
    
    1. It seems that the output full backup generated with
    pg_combinebackup also contains the "INCREMENTAL BACKUP REFERENCE WAL
    LOCATION".  It seems confusing
    because now this is a full backup, not the incremental backup.
    
    2.
    + FILE    *outfp;
    + FileOffset outblocks[RELSEG_SIZE];
    + int i;
    + FileMap    *filemaps;
    + int fmindex;
    + bool basefilefound;
    + bool modifiedblockfound;
    + uint32 lastblkno;
    + FileMap    *fm;
    + struct stat statbuf;
    + uint32 nblocks;
    +
    + memset(outblocks, 0, sizeof(FileOffset) * RELSEG_SIZE);
    
    I don't think you need to memset this explicitly as you can initialize
    the array itself no?
    FileOffset outblocks[RELSEG_SIZE] = {{0}}
    
    -- 
    Regards,
    Dilip Kumar
    EnterpriseDB: http://www.enterprisedb.com
    
    
    
    
  125. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-09-04T13:31:44Z

    On Tue, Sep 3, 2019 at 12:46 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
    > I did that and have experience working on the TAR format.  I was curious about what
    > "best/newest" you are talking.
    
    Well, why not go look it up?
    
    On my MacBook, tar is documented to understand three different tar
    formats: gnutar, ustar, and v7, and two sets of extensions to the tar
    format: numeric extensions required by POSIX, and Solaris extensions.
    It also understands the pax and restricted-pax formats which are
    derived from the ustar format.  I don't know what your system
    supports, but it's probably not hugely different; the fact that there
    are multiple tar formats has been documented in the tar man page on
    every machine I've checked for the past 20 years.  Here, 'man tar'
    refers the reader to 'man libarchive-formats', which contains the
    details mentioned above.
    
    A quick Google search for 'multiple tar formats' also finds
    https://en.wikipedia.org/wiki/Tar_(computing)#File_format and
    https://www.gnu.org/software/tar/manual/html_chapter/tar_8.html each
    of which explains a good deal of the complexity in this area.
    
    I don't really understand why I have to explain to you what I mean
    when I say there are multiple tar formats when you can look it up on
    Google and find that there are multiple tar formats.  Again, the point
    is that the current code only generates tar archives and therefore
    only needs to generate one format, but if we add code that reads a tar
    archive, it probably needs to read several formats, because there are
    several formats that are popular enough to be widely-supported.
    
    It's possible that somebody else here knows more about this topic and
    could make better judgements than I can, but my view at present is
    that if we want to read tar archives, we probably would want to do it
    by depending on libarchive.  And I don't think we should do that for
    this project because I don't think it would provide much value.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  126. Re: block-level incremental backup

    Michael Paquier <michael@paquier.xyz> — 2019-09-05T02:07:25Z

    On Tue, Sep 03, 2019 at 08:59:53AM -0400, Robert Haas wrote:
    > I think pg_basebackup is using homebrew code to generate tar files,
    > but I'm reluctant to do that for reading tar files.
    
    Yes.  This code has not actually changed since its introduction.
    Please note that we also have code which reads directly data from a
    tarball in pg_basebackup.c when appending the recovery parameters to
    postgresql.auto.conf for -R.  There could be some consolidation here
    with what you are doing.
    
    > For generating a
    > file, you can always emit the newest and "best" tar format, but for
    > reading a file, you probably want to be prepared for older or cruftier
    > variants.  Maybe not -- I'm not super-familiar with the tar on-disk
    > format.  But I think there must be a reason why tar libraries exist,
    > and I don't want to write a new one.
    
    We need to be sure as well that the library chosen does not block
    access to a feature in all the various platforms we have.
    --
    Michael
    
  127. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-09-05T03:25:46Z

    On Wed, Sep 4, 2019 at 10:08 PM Michael Paquier <michael@paquier.xyz> wrote:
    > > For generating a
    > > file, you can always emit the newest and "best" tar format, but for
    > > reading a file, you probably want to be prepared for older or cruftier
    > > variants.  Maybe not -- I'm not super-familiar with the tar on-disk
    > > format.  But I think there must be a reason why tar libraries exist,
    > > and I don't want to write a new one.
    >
    > We need to be sure as well that the library chosen does not block
    > access to a feature in all the various platforms we have.
    
    Well, again, my preference is to just not make this particular feature
    work natively with tar files.  Then I don't need to choose a library,
    so the question is moot.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  128. Re: block-level incremental backup

    Jeevan Chalke <jeevan.chalke@enterprisedb.com> — 2019-09-09T11:08:15Z

    Hi,
    
    Attached new set of patches adding support for the tablespace handling.
    
    This patchset also fixes the issues reported by Vignesh, Robert, Jeevan
    Ladhe,
    and Dilip Kumar.
    
    Please have a look and let me know if I  missed any comments to account.
    
    Thanks
    -- 
    Jeevan Chalke
    Technical Architect, Product Development
    EnterpriseDB Corporation
    The Enterprise PostgreSQL Company
    
  129. Re: block-level incremental backup

    Jeevan Chalke <jeevan.chalke@enterprisedb.com> — 2019-09-09T11:21:34Z

    On Tue, Aug 27, 2019 at 4:46 PM vignesh C <vignesh21@gmail.com> wrote:
    
    > Few comments:
    > Comment:
    > + buf = (char *) malloc(statbuf->st_size);
    > + if (buf == NULL)
    > + ereport(ERROR,
    > + (errcode(ERRCODE_OUT_OF_MEMORY),
    > + errmsg("out of memory")));
    > +
    > + if ((cnt = fread(buf, 1, statbuf->st_size, fp)) > 0)
    > + {
    > + Bitmapset  *mod_blocks = NULL;
    > + int nmodblocks = 0;
    > +
    > + if (cnt % BLCKSZ != 0)
    > + {
    >
    > We can use same size as full page size.
    > After pg start backup full page write will be enabled.
    > We can use the same file size to maintain data consistency.
    >
    
    Can you please explain which size?
    The aim here is to read entire file in-memory and thus used
    statbuf->st_size.
    
    Comment:
    > Should we check if it is same timeline as the system's timeline.
    >
    
    At the time of taking the incremental backup, we can't check that.
    However, while combining, I made sure that the timeline is the same for all
    backups.
    
    
    >
    > Comment:
    >
    > Should we support compression formats supported by pg_basebackup.
    > This can be an enhancement after the functionality is completed.
    >
    
    For the incremental backup, it just works out of the box.
    For combining backup, as discussed up-thread, the user has to
    uncompress first, combine them, compress if required.
    
    
    > Comment:
    > We should provide some mechanism to validate the backup. To identify
    > if some backup is corrupt or some file is missing(deleted) in a
    > backup.
    >
    
    Maybe, but not for the first version.
    
    
    > Comment:
    > +/*
    > + * When to send the whole file, % blocks modified (90%)
    > + */
    > +#define WHOLE_FILE_THRESHOLD 0.9
    > +
    > This can be user configured value.
    > This can be an enhancement after the functionality is completed.
    >
    
    Yes.
    
    
    > Comment:
    > We can add a readme file with all the details regarding incremental
    > backup and combine backup.
    >
    
    Will have a look.
    
    
    >
    > Regards,
    > Vignesh
    > EnterpriseDB: http://www.enterprisedb.com
    >
    
    Thanks
    -- 
    Jeevan Chalke
    Technical Architect, Product Development
    EnterpriseDB Corporation
    The Enterprise PostgreSQL Company
    
  130. Re: block-level incremental backup

    Jeevan Chalke <jeevan.chalke@enterprisedb.com> — 2019-09-09T11:30:33Z

    On Tue, Aug 27, 2019 at 11:59 PM Robert Haas <robertmhaas@gmail.com> wrote:
    
    > On Fri, Aug 16, 2019 at 6:23 AM Jeevan Chalke
    > <jeevan.chalke@enterprisedb.com> wrote:
    > > [ patches ]
    >
    > Reviewing 0002 and 0003:
    >
    > - Commit message for 0003 claims magic number and checksum are 0, but
    > that (fortunately) doesn't seem to be the case.
    >
    
    Oops, updated commit message.
    
    
    >
    > - looks_like_rel_name actually checks whether it looks like a
    > *non-temporary* relation name; suggest adjusting the function name.
    >
    > - The names do_full_backup and do_incremental_backup are quite
    > confusing because you're really talking about what to do with one
    > file.  I suggest sendCompleteFile() and sendPartialFile().
    >
    
    Changed function names.
    
    
    >
    > - Is there any good reason to have 'refptr' as a global variable, or
    > could we just pass the LSN around via function arguments?  I know it's
    > just mimicking startptr, but storing startptr in a global variable
    > doesn't seem like a great idea either, so if it's not too annoying,
    > let's pass it down via function arguments instead.  Also, refptr is a
    > crappy name (even worse than startptr); whether we end up with a
    > global variable or a bunch of local variables, let's make the name(s)
    > clear and unambiguous, like incremental_reference_lsn.  Yeah, I know
    > that's long, but I still think it's better than being unclear.
    >
    
    Renamed variable.
    However, I have kept that as global only as it needs many functions to
    change their signature, like, sendFile(), sendDir(), sendTablspeace() etc.
    
    
    > - do_incremental_backup looks like it can never report an error from
    > fread(), which is bad.  But I see that this is just copied from the
    > existing code which has the same problem, so I started a separate
    > thread about that.
    >
    > - I think that passing cnt and blkindex to verify_page_checksum()
    > doesn't look very good from an abstraction point of view.  Granted,
    > the existing code isn't great either, but I think this makes the
    > problem worse.  I suggest passing "int backup_distance" to this
    > function, computed as cnt - BLCKSZ * blkindex.  Then, you can
    > fseek(-backup_distance), fread(BLCKSZ), and then fseek(backup_distance
    > - BLCKSZ).
    >
    
    Yep. Done these changes in the refactoring patch.
    
    
    >
    > - While I generally support the use of while and for loops rather than
    > goto for flow control, a while (1) loop that ends with a break is
    > functionally a goto anyway.  I think there are several ways this could
    > be revised.  The most obvious one is probably to use goto, but I vote
    > for inverting the sense of the test: if (PageIsNew(page) ||
    > PageGetLSN(page) >= startptr) break; This approach also saves a level
    > of indentation for more than half of the function.
    >
    
    I have used this new inverted condition, but we still need a while(1) loop.
    
    
    > - I am not sure that it's a good idea for sendwholefile = true to
    > result in dumping the entire file onto the wire in a single CopyData
    > message.  I don't know of a concrete problem in typical
    > configurations, but someone who increases RELSEG_SIZE might be able to
    > overflow CopyData's length word.  At 2GB the length word would be
    > negative, which might break, and at 4GB it would wrap around, which
    > would certainly break.  See CopyData in
    > https://www.postgresql.org/docs/12/protocol-message-formats.html  To
    > avoid this issue, and maybe some others, I suggest defining a
    > reasonably large chunk size, say 1MB as a constant in this file
    > someplace, and sending the data as a series of chunks of that size.
    >
    
    OK. Done as per the suggestions.
    
    
    >
    > - I don't think that the way concurrent truncation is handled is
    > correct for partial files.  Right now it just falls through to code
    > which appends blocks of zeroes in either the complete-file or
    > partial-file case.  I think that logic should be moved into the
    > function that handles the complete-file case.  In the partial-file
    > case, the blocks that we actually send need to match the list of block
    > numbers we promised to send.  We can't just send the promised blocks
    > and then tack a bunch of zero-filled blocks onto the end that the file
    > header doesn't know about.
    >
    
    Well, in partial file case we won't end up inside that block. So we are
    never sending zeroes at the end in case of partial file.
    
    
    > - For reviewer convenience, please use the -v option to git
    > format-patch when posting and reposting a patch series.  Using -v2,
    > -v3, etc. on successive versions really helps.
    >
    
    Sure. Thanks for letting me know about this option.
    
    
    >
    > --
    > Robert Haas
    > EnterpriseDB: http://www.enterprisedb.com
    > The Enterprise PostgreSQL Company
    >
    
    Thanks
    -- 
    Jeevan Chalke
    Technical Architect, Product Development
    EnterpriseDB Corporation
    The Enterprise PostgreSQL Company
    
  131. Re: block-level incremental backup

    Jeevan Chalke <jeevan.chalke@enterprisedb.com> — 2019-09-09T11:42:39Z

    On Fri, Aug 30, 2019 at 6:52 PM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com>
    wrote:
    
    > Here are some comments:
    > Or maybe we can just say:
    > "cannot verify checksum in file \"%s\"" if checksum requested, disable the
    > checksum and leave it to the following message:
    >
    > +           ereport(WARNING,
    > +                   (errmsg("file size (%d) not in multiple of page size
    > (%d), sending whole file",
    > +                           (int) cnt, BLCKSZ)));
    >
    >
    Opted for the above suggestion.
    
    
    >
    > I think we should give the user hint from where he should be reading the
    > input
    > lsn for incremental backup in the --help option as well as documentation?
    > Something like - "To take an incremental backup, please provide value of
    > "--lsn"
    > as the "START WAL LOCATION" of previously taken full backup or incremental
    > backup from backup_lable file.
    >
    
    Added this in the documentation. In help, it will be too crowdy.
    
    
    > pg_combinebackup:
    >
    > +static bool made_new_outputdata = false;
    > +static bool found_existing_outputdata = false;
    >
    > Both of these are global, I understand that we need them global so that
    > they are
    > accessible in cleanup_directories_atexit(). But they are passed to
    > verify_dir_is_empty_or_create() as parameters, which I think is not needed.
    > Instead verify_dir_is_empty_or_create() can directly change the globals.
    >
    
    After adding support for a tablespace, these two functions take different
    values depending upon the context.
    
    
    > The current logic assumes the incremental backup directories are to be
    > provided
    > as input in the serial order the backups were taken. This is bit confusing
    > unless clarified in pg_combinebackup help menu or documentation. I think we
    > should clarify it at both the places.
    >
    
    Added in doc.
    
    
    >
    > I think scan_directory() should be rather renamed as do_combinebackup().
    >
    
    I am not sure about this renaming. scan_directory() is called recursively
    to scan each sub-directories too. If we rename it then it is not actually
    recursively doing a combinebackup. Combine backup is a single whole
    process.
    
    -- 
    Jeevan Chalke
    Technical Architect, Product Development
    EnterpriseDB Corporation
    The Enterprise PostgreSQL Company
    
  132. Re: block-level incremental backup

    Jeevan Chalke <jeevan.chalke@enterprisedb.com> — 2019-09-09T11:47:39Z

    On Tue, Sep 3, 2019 at 12:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
    
    > On Fri, Aug 16, 2019 at 3:54 PM Jeevan Chalke
    > <jeevan.chalke@enterprisedb.com> wrote:
    > >
    > 0003:
    > +/*
    > + * When to send the whole file, % blocks modified (90%)
    > + */
    > +#define WHOLE_FILE_THRESHOLD 0.9
    >
    > How this threshold is selected.  Is it by some test?
    >
    
    Currently, it is set arbitrarily. If required, we will make it a GUC.
    
    
    >
    > - magic number, currently 0 (4 bytes)
    > I think in the patch we are using  (#define INCREMENTAL_BACKUP_MAGIC
    > 0x494E4352) as a magic number, not 0
    >
    
    Yes. Robert too reported this. Updated the commit message.
    
    
    >
    > Can we breakdown this function in 2-3 functions.  At least creating a
    > file map can directly go to a separate function.
    >
    
    Separated out filemap changes to separate function. Rest kept as is to have
    an easy followup.
    
    
    >
    > I have read 0003 and 0004 patch and there are few cosmetic comments.
    >
    
    Can you please post those too?
    
    Other comments are fixed.
    
    
    >
    >
    > --
    > Regards,
    > Dilip Kumar
    > EnterpriseDB: http://www.enterprisedb.com
    >
    
    Thanks
    -- 
    Jeevan Chalke
    Technical Architect, Product Development
    EnterpriseDB Corporation
    The Enterprise PostgreSQL Company
    
  133. Re: block-level incremental backup

    Jeevan Chalke <jeevan.chalke@enterprisedb.com> — 2019-09-09T11:51:34Z

    On Wed, Sep 4, 2019 at 5:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
    
    >
    >  I have not yet completed the review for 0004, but I have few more
    > comments.  Tomorrow I will try to complete the review and some testing
    > as well.
    >
    > 1. It seems that the output full backup generated with
    > pg_combinebackup also contains the "INCREMENTAL BACKUP REFERENCE WAL
    > LOCATION".  It seems confusing
    > because now this is a full backup, not the incremental backup.
    >
    
    Yes, that was remaining and was in my TODO.
    Done in the new patchset. Also, taking --label as an input like
    pg_basebackup.
    
    
    >
    > 2.
    > + memset(outblocks, 0, sizeof(FileOffset) * RELSEG_SIZE);
    >
    > I don't think you need to memset this explicitly as you can initialize
    > the array itself no?
    > FileOffset outblocks[RELSEG_SIZE] = {{0}}
    >
    
    I didn't see any issue with memset either but changed this per your
    suggestion.
    
    
    >
    > --
    > Regards,
    > Dilip Kumar
    > EnterpriseDB: http://www.enterprisedb.com
    >
    
    
    -- 
    Jeevan Chalke
    Technical Architect, Product Development
    EnterpriseDB Corporation
    The Enterprise PostgreSQL Company
    
  134. Re: block-level incremental backup

    Jeevan Chalke <jeevan.chalke@enterprisedb.com> — 2019-09-12T13:13:18Z

    Hi,
    
    One of my colleague at EDB, Rajkumar Raghuwanshi, while testing this
    feature reported an issue. He reported that if a full base-backup is
    taken, and then created a database, and then took an incremental backup,
    combining full backup with incremental backup is then failing.
    
    I had a look over this issue and observed that when the new database is
    created, the catalog files are copied as-is into the new directory
    corresponding to a newly created database. And as they are just copied,
    the LSN on those pages are not changed. Due to this incremental backup
    thinks that its an existing file and thus do not copy the blocks from
    these new files, leading to the failure.
    
    I have surprised to know that even though we are creating new files from
    old files, we kept the LSN unmodified. I didn't see any other parameter
    in basebackup which tells that this is a new file from last LSN or
    something.
    
    I tried looking for any other DDL doing similar stuff like creating a new
    page with existing LSN. But I could not find any other commands than
    CREATE DATABASE and ALTER DATABASE .. SET TABLESPACE.
    
    Suggestions/thoughts?
    
    -- 
    Jeevan Chalke
    Technical Architect, Product Development
    EnterpriseDB Corporation
    The Enterprise PostgreSQL Company
    
  135. Re: block-level incremental backup

    vignesh C <vignesh21@gmail.com> — 2019-09-13T17:08:12Z

    On Mon, Sep 9, 2019 at 4:51 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com>
    wrote:
    >
    >
    >
    > On Tue, Aug 27, 2019 at 4:46 PM vignesh C <vignesh21@gmail.com> wrote:
    >>
    >> Few comments:
    >> Comment:
    >> + buf = (char *) malloc(statbuf->st_size);
    >> + if (buf == NULL)
    >> + ereport(ERROR,
    >> + (errcode(ERRCODE_OUT_OF_MEMORY),
    >> + errmsg("out of memory")));
    >> +
    >> + if ((cnt = fread(buf, 1, statbuf->st_size, fp)) > 0)
    >> + {
    >> + Bitmapset  *mod_blocks = NULL;
    >> + int nmodblocks = 0;
    >> +
    >> + if (cnt % BLCKSZ != 0)
    >> + {
    >>
    >> We can use same size as full page size.
    >> After pg start backup full page write will be enabled.
    >> We can use the same file size to maintain data consistency.
    >
    >
    > Can you please explain which size?
    > The aim here is to read entire file in-memory and thus used
    statbuf->st_size.
    >
    Instead of reading the whole file here, we can read the file page by page.
    There is a possibility of data inconsistency if data is not read page by
    page, data will be consistent if read page by page as full page write will
    be enabled at this time.
    
    Regards,
    Vignesh
    EnterpriseDB: http://www.enterprisedb.com
    
  136. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-09-16T01:36:39Z

    On Fri, Sep 13, 2019 at 1:08 PM vignesh C <vignesh21@gmail.com> wrote:
    > Instead of reading the whole file here, we can read the file page by page. There is a possibility of data inconsistency if data is not read page by page, data will be consistent if read page by page as full page write will be enabled at this time.
    
    I think you are confused about what "full page writes" means. It has
    to do what gets written to the write-ahead log, not the way that the
    pages themselves are written. There is no portable way to ensure that
    an 8kB read or write is atomic, and generally it isn't.
    
    It shouldn't matter whether the file is read all at once, page by
    page, or byte by byte, except for performance. Recovery is going to
    run when that backup is restored, and any inconsistencies should get
    fixed up at that time.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  137. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-09-16T01:44:59Z

    On Thu, Sep 12, 2019 at 9:13 AM Jeevan Chalke
    <jeevan.chalke@enterprisedb.com> wrote:
    > I had a look over this issue and observed that when the new database is
    > created, the catalog files are copied as-is into the new directory
    > corresponding to a newly created database. And as they are just copied,
    > the LSN on those pages are not changed. Due to this incremental backup
    > thinks that its an existing file and thus do not copy the blocks from
    > these new files, leading to the failure.
    
    *facepalm*
    
    Well, this shoots a pretty big hole in my design for this feature. I
    don't know why I didn't think of this when I wrote out that design
    originally. Ugh.
    
    Unless we change the way that CREATE DATABASE and any similar
    operations work so that they always stamp pages with new LSNs, I think
    we have to give up on the idea of being able to take an incremental
    backup by just specifying an LSN. We'll instead need to get a list of
    files from the server first, and then request the entirety of any that
    we don't have, plus the changed blocks from the ones that we do have.
    I guess that will make Stephen happy, since it's more like the design
    he wanted originally (and should generalize more simply to parallel
    backup).
    
    One question I have is: is there any scenario in which an existing
    page gets modified after the full backup and before the incremental
    backup but does not end up with an LSN that follows the full backup's
    start LSN? If there is, then the whole concept of using LSNs to tell
    which blocks have been modified doesn't really work. I can't think of
    a way that can happen off-hand, but then, I thought my last design was
    good, too.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  138. Re: block-level incremental backup

    Amit Kapila <amit.kapila16@gmail.com> — 2019-09-16T08:31:21Z

    On Mon, Sep 16, 2019 at 7:22 AM Robert Haas <robertmhaas@gmail.com> wrote:
    >
    > On Thu, Sep 12, 2019 at 9:13 AM Jeevan Chalke
    > <jeevan.chalke@enterprisedb.com> wrote:
    > > I had a look over this issue and observed that when the new database is
    > > created, the catalog files are copied as-is into the new directory
    > > corresponding to a newly created database. And as they are just copied,
    > > the LSN on those pages are not changed. Due to this incremental backup
    > > thinks that its an existing file and thus do not copy the blocks from
    > > these new files, leading to the failure.
    >
    > *facepalm*
    >
    > Well, this shoots a pretty big hole in my design for this feature. I
    > don't know why I didn't think of this when I wrote out that design
    > originally. Ugh.
    >
    > Unless we change the way that CREATE DATABASE and any similar
    > operations work so that they always stamp pages with new LSNs, I think
    > we have to give up on the idea of being able to take an incremental
    > backup by just specifying an LSN.
    >
    
    This seems to be a blocking problem for the LSN based design.  Can we
    think of using creation time for file?  Basically, if the file
    creation time is later than backup-labels "START TIME:", then include
    that file entirely.  I think one big point against this is clock skew
    like what if somebody tinkers with the clock.  And also, this can
    cover cases like
    what Jeevan has pointed but might not cover other cases which we found
    problematic.
    
    >  We'll instead need to get a list of
    > files from the server first, and then request the entirety of any that
    > we don't have, plus the changed blocks from the ones that we do have.
    > I guess that will make Stephen happy, since it's more like the design
    > he wanted originally (and should generalize more simply to parallel
    > backup).
    >
    > One question I have is: is there any scenario in which an existing
    > page gets modified after the full backup and before the incremental
    > backup but does not end up with an LSN that follows the full backup's
    > start LSN?
    >
    
    I think the operations covered by WAL flag XLR_SPECIAL_REL_UPDATE will
    have similar problems.
    
    One related point is how do incremental backups handle the case where
    vacuum truncates the relation partially?  Basically, with current
    patch/design, it doesn't appear that such information can be passed
    via incremental backup.  I am not sure if this is a problem, but it
    would be good if we can somehow handle this.
    
    Isn't some operations where at the end we directly call heap_sync
    without writing WAL will have a similar problem as well?  Similarly,
    it is not very clear if unlogged relations are handled in some way if
    not, the same could be documented.
    
    -- 
    With Regards,
    Amit Kapila.
    EnterpriseDB: http://www.enterprisedb.com
    
    
    
    
  139. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-09-16T13:30:06Z

    On Mon, Sep 16, 2019 at 4:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
    > This seems to be a blocking problem for the LSN based design.
    
    Well, only the simplest version of it, I think.
    
    > Can we think of using creation time for file?  Basically, if the file
    > creation time is later than backup-labels "START TIME:", then include
    > that file entirely.  I think one big point against this is clock skew
    > like what if somebody tinkers with the clock.  And also, this can
    > cover cases like
    > what Jeevan has pointed but might not cover other cases which we found
    > problematic.
    
    Well that would mean, for example, that if you copied the data
    directory from one machine to another, the next "incremental" backup
    would turn into a full backup. That sucks. And in other situations,
    like resetting the clock, it could mean that you end up with a corrupt
    backup without any real ability for PostgreSQL to detect it. I'm not
    saying that it is impossible to create a practically useful system
    based on file time stamps, but I really don't like it.
    
    > I think the operations covered by WAL flag XLR_SPECIAL_REL_UPDATE will
    > have similar problems.
    
    I'm not sure quite what you mean by that.  Can you elaborate? It
    appears to me that the XLR_SPECIAL_REL_UPDATE operations are all
    things that create files, remove files, or truncate files, and the
    sketch in my previous email would handle the first two of those cases
    correctly.  See below for the third.
    
    > One related point is how do incremental backups handle the case where
    > vacuum truncates the relation partially?  Basically, with current
    > patch/design, it doesn't appear that such information can be passed
    > via incremental backup.  I am not sure if this is a problem, but it
    > would be good if we can somehow handle this.
    
    As to this, if you're taking a full backup of a particular file,
    there's no problem.  If you're taking a partial backup of a particular
    file, you need to include the current length of the file and the
    identity and contents of each modified block.  Then you're fine.
    
    > Isn't some operations where at the end we directly call heap_sync
    > without writing WAL will have a similar problem as well?
    
    Maybe.  Can you give an example?
    
    > Similarly,
    > it is not very clear if unlogged relations are handled in some way if
    > not, the same could be documented.
    
    I think that we don't need to back up the contents of unlogged
    relations at all, right? Restoration from an online backup always
    involves running recovery, and so unlogged relations will anyway get
    zapped.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  140. Re: block-level incremental backup

    Stephen Frost <sfrost@snowman.net> — 2019-09-16T14:38:17Z

    Greetings,
    
    * Robert Haas (robertmhaas@gmail.com) wrote:
    > On Mon, Sep 16, 2019 at 4:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
    > > Can we think of using creation time for file?  Basically, if the file
    > > creation time is later than backup-labels "START TIME:", then include
    > > that file entirely.  I think one big point against this is clock skew
    > > like what if somebody tinkers with the clock.  And also, this can
    > > cover cases like
    > > what Jeevan has pointed but might not cover other cases which we found
    > > problematic.
    > 
    > Well that would mean, for example, that if you copied the data
    > directory from one machine to another, the next "incremental" backup
    > would turn into a full backup. That sucks. And in other situations,
    > like resetting the clock, it could mean that you end up with a corrupt
    > backup without any real ability for PostgreSQL to detect it. I'm not
    > saying that it is impossible to create a practically useful system
    > based on file time stamps, but I really don't like it.
    
    In a number of cases, trying to make sure that on a failover or copy of
    the backup the next 'incremental' is really an 'incremental' is
    dangerous.  A better strategy to address this, and the other issues
    realized on this thread recently, is to:
    
    - Have a manifest of every file in each backup
    - Always back up new files that weren't in the prior backup
    - Keep a checksum of each file
    - Track the timestamp of each file as of when it was backed up
    - Track the file size of each file
    - Track the starting timestamp of each backup
    - Always include files with a modification time after the starting
      timestamp of the prior backup, or if the file size has changed
    - In the event of any anomolies (which includes things like a timeline
      switch), use checksum matching (aka 'delta checksum backup') to
      perform the backup instead of using timestamps (or just always do that
      if you want to be particularly careful- having an option for it is
      great)
    - Probably other things I'm not thinking of off-hand, but this is at
      least a good start.  Make sure to checksum this information too.
    
    I agree entirely that it is dangerous to simply rely on creation time as
    compared to some other time, or to rely on modification time of a given
    file across multiple backups (which has been shown to reliably cause
    corruption, at least with rsync and its 1-second granularity on
    modification time).
    
    By having a manifest for each backed up file for each backup, you also
    gain the ability to validate that a backup in the repository hasn't been
    corrupted post-backup, a feature that at least some other database
    backup and restore systems have (referring specifically to the big O in
    this particular case, but I bet others do too).
    
    Having a system of keeping track of which backups are full and which are
    differential in an overall system also gives you the ability to do
    things like expiration in a sensible way, including handling WAL
    expiration.
    
    As also mentioned up-thread, this likely also allows you to have a
    simpler approach to parallelizing the overall backup.
    
    I'd like to clarify that while I would like to have an easier way to
    parallelize backups, that's a relatively minor complaint- the much
    bigger issue that I have with this feature is that trying to address
    everything correctly while having only the amount of information that
    could be passed on the command-line about the prior full/incremental is
    going to be extremely difficult, complicated, and likely to lead to
    subtle bugs in the actual code, and probably less than subtle bugs in
    how users end up using it, since they'll have to implement the
    expiration and tracking of information between backups themselves
    (unless something's changed in that part during this discussion- I admit
    that I've not read every email in this thread).
    
    > > One related point is how do incremental backups handle the case where
    > > vacuum truncates the relation partially?  Basically, with current
    > > patch/design, it doesn't appear that such information can be passed
    > > via incremental backup.  I am not sure if this is a problem, but it
    > > would be good if we can somehow handle this.
    > 
    > As to this, if you're taking a full backup of a particular file,
    > there's no problem.  If you're taking a partial backup of a particular
    > file, you need to include the current length of the file and the
    > identity and contents of each modified block.  Then you're fine.
    
    I would also expect this to be fine but if there's an example of where
    this is an issue, please share.  The only issue that I can think of
    off-hand is orphaned-file risk, whereby you have something like CREATE
    DATABASE or perhaps ALTER TABLE .. SET TABLESPACE or such, take a
    backup while that's happening, but that doesn't complete during the
    backup (or recovery, or perhaps even in some other scenarios, it's
    unfortunately quite complicated).  This orphaned file risk isn't newly
    discovered but fixing it is pretty complicated- would love to discuss
    ideas around how to handle it.
    
    > > Isn't some operations where at the end we directly call heap_sync
    > > without writing WAL will have a similar problem as well?
    > 
    > Maybe.  Can you give an example?
    
    I'd be curious to hear what the concern is here also.
    
    > > Similarly,
    > > it is not very clear if unlogged relations are handled in some way if
    > > not, the same could be documented.
    > 
    > I think that we don't need to back up the contents of unlogged
    > relations at all, right? Restoration from an online backup always
    > involves running recovery, and so unlogged relations will anyway get
    > zapped.
    
    Unlogged relations shouldn't be in the backup at all, since, yes, they
    get zapped at the start of recovery.  We recently taught pg_basebackup
    how to avoid backing them up so this shouldn't be an issue, as they
    should be skipped for incrementals as well as fulls.  I expect the
    orphaned file problem also exists for UNLOGGED->LOGGED transitions.
    
    Thanks,
    
    Stephen
    
  141. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-09-16T15:52:56Z

    On Mon, Sep 16, 2019 at 9:30 AM Robert Haas <robertmhaas@gmail.com> wrote:
    > > Isn't some operations where at the end we directly call heap_sync
    > > without writing WAL will have a similar problem as well?
    >
    > Maybe.  Can you give an example?
    
    Looking through the code, I found two cases where we do this.  One is
    a bulk insert operation with wal_level = minimal, and the other is
    CLUSTER or VACUUM FULL with wal_level = minimal. In both of these
    cases we are generating new blocks whose LSNs will be 0/0. So, I think
    we need a rule that if the server is asked to back up all blocks in a
    file with LSNs > some threshold LSN, it must also include any blocks
    whose LSN is 0/0. Those blocks are either uninitialized or are
    populated without WAL logging, so they always need to be copied.
    
    Outside of unlogged and temporary tables, I don't know of any case
    where make a critical modification to an already-existing block
    without bumping the LSN. I hope there is no such case.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  142. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-09-16T16:23:28Z

    On Mon, Sep 16, 2019 at 10:38 AM Stephen Frost <sfrost@snowman.net> wrote:
    > In a number of cases, trying to make sure that on a failover or copy of
    > the backup the next 'incremental' is really an 'incremental' is
    > dangerous.  A better strategy to address this, and the other issues
    > realized on this thread recently, is to:
    >
    > - Have a manifest of every file in each backup
    > - Always back up new files that weren't in the prior backup
    > - Keep a checksum of each file
    > - Track the timestamp of each file as of when it was backed up
    > - Track the file size of each file
    > - Track the starting timestamp of each backup
    > - Always include files with a modification time after the starting
    >   timestamp of the prior backup, or if the file size has changed
    > - In the event of any anomolies (which includes things like a timeline
    >   switch), use checksum matching (aka 'delta checksum backup') to
    >   perform the backup instead of using timestamps (or just always do that
    >   if you want to be particularly careful- having an option for it is
    >   great)
    > - Probably other things I'm not thinking of off-hand, but this is at
    >   least a good start.  Make sure to checksum this information too.
    
    I agree with some of these ideas but not all of them.  I think having
    a backup manifest is a good idea; that would allow taking a new
    incremental backup to work from the manifest rather than the data
    directory, which could be extremely useful, because it might be a lot
    faster and the manifest could also be copied to a machine other than
    the one where the entire backup is stored. If the backup itself has
    been pushed off to S3 or whatever, you can't access it quickly, but
    you could keep the manifest around.
    
    I also agree that backing up all files that weren't in the previous
    backup is a good strategy.  I proposed that fairly explicitly a few
    emails back; but also, the contrary is obviously nonsense. And I also
    agree with, and proposed, that we record the size along with the file.
    
    I don't really agree with your comments about checksums and
    timestamps.  I think that, if possible, there should be ONE method of
    determining whether a block has changed in some important way, and I
    think if we can make LSN work, that would be for the best. If you use
    multiple methods of detecting changes without any clearly-defined
    reason for so doing, maybe what you're saying is that you don't really
    believe that any of the methods are reliable but if we throw the
    kitchen sink at the problem it should come out OK. Any bugs in one
    mechanism are likely to be masked by one of the others, but that's not
    as as good as one method that is known to be altogether reliable.
    
    > By having a manifest for each backed up file for each backup, you also
    > gain the ability to validate that a backup in the repository hasn't been
    > corrupted post-backup, a feature that at least some other database
    > backup and restore systems have (referring specifically to the big O in
    > this particular case, but I bet others do too).
    
    Agreed. The manifest only lets you validate to a limited extent, but
    that's still useful.
    
    > Having a system of keeping track of which backups are full and which are
    > differential in an overall system also gives you the ability to do
    > things like expiration in a sensible way, including handling WAL
    > expiration.
    
    True, but I'm not sure that functionality belongs in core. It
    certainly needs to be possible for out-of-core code to do this part of
    the work if desired, because people want to integrate with enterprise
    backup systems, and we can't come in and say, well, you back up
    everything else using Netbackup or Tivoli, but for PostgreSQL you have
    to use pg_backrest. I mean, maybe you can win that argument, but I
    know I can't.
    
    > I'd like to clarify that while I would like to have an easier way to
    > parallelize backups, that's a relatively minor complaint- the much
    > bigger issue that I have with this feature is that trying to address
    > everything correctly while having only the amount of information that
    > could be passed on the command-line about the prior full/incremental is
    > going to be extremely difficult, complicated, and likely to lead to
    > subtle bugs in the actual code, and probably less than subtle bugs in
    > how users end up using it, since they'll have to implement the
    > expiration and tracking of information between backups themselves
    > (unless something's changed in that part during this discussion- I admit
    > that I've not read every email in this thread).
    
    Well, the evidence seems to show that you are right, at least to some
    extent. I consider it a positive good if the client needs to give the
    server only a limited amount of information. After all, you could
    always take an incremental backup by shipping every byte of the
    previous backup to the server, having it compare everything to the
    current contents, and having it then send you back the stuff that is
    new or different. But that would be dumb, because most of the point of
    an incremental backup is to save on sending lots of data over the
    network unnecessarily. Now, it seems that I took that goal to an
    unhealthy extreme, because as we've now realized, sending only an LSN
    and nothing else isn't enough to get a correct backup. So we need to
    send more, and it doesn't have to be the absolutely most
    stripped-down, bear-bones version of what could be sent. But it should
    be fairly minimal, I think; that's kinda the point of the feature.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  143. Re: block-level incremental backup

    Stephen Frost <sfrost@snowman.net> — 2019-09-16T17:10:50Z

    Greetings,
    
    * Robert Haas (robertmhaas@gmail.com) wrote:
    > On Mon, Sep 16, 2019 at 10:38 AM Stephen Frost <sfrost@snowman.net> wrote:
    > > In a number of cases, trying to make sure that on a failover or copy of
    > > the backup the next 'incremental' is really an 'incremental' is
    > > dangerous.  A better strategy to address this, and the other issues
    > > realized on this thread recently, is to:
    > >
    > > - Have a manifest of every file in each backup
    > > - Always back up new files that weren't in the prior backup
    > > - Keep a checksum of each file
    > > - Track the timestamp of each file as of when it was backed up
    > > - Track the file size of each file
    > > - Track the starting timestamp of each backup
    > > - Always include files with a modification time after the starting
    > >   timestamp of the prior backup, or if the file size has changed
    > > - In the event of any anomolies (which includes things like a timeline
    > >   switch), use checksum matching (aka 'delta checksum backup') to
    > >   perform the backup instead of using timestamps (or just always do that
    > >   if you want to be particularly careful- having an option for it is
    > >   great)
    > > - Probably other things I'm not thinking of off-hand, but this is at
    > >   least a good start.  Make sure to checksum this information too.
    > 
    > I agree with some of these ideas but not all of them.  I think having
    > a backup manifest is a good idea; that would allow taking a new
    > incremental backup to work from the manifest rather than the data
    > directory, which could be extremely useful, because it might be a lot
    > faster and the manifest could also be copied to a machine other than
    > the one where the entire backup is stored. If the backup itself has
    > been pushed off to S3 or whatever, you can't access it quickly, but
    > you could keep the manifest around.
    
    Yes, those are also good reasons for having a manifest.
    
    > I also agree that backing up all files that weren't in the previous
    > backup is a good strategy.  I proposed that fairly explicitly a few
    > emails back; but also, the contrary is obviously nonsense. And I also
    > agree with, and proposed, that we record the size along with the file.
    
    Sure, I didn't mean to imply that there was something wrong with that.
    Including the checksum and other metadata is also valuable, both for
    helping to identify corruption in the backup archive and for forensics,
    if not for other reasons.
    
    > I don't really agree with your comments about checksums and
    > timestamps.  I think that, if possible, there should be ONE method of
    > determining whether a block has changed in some important way, and I
    > think if we can make LSN work, that would be for the best. If you use
    > multiple methods of detecting changes without any clearly-defined
    > reason for so doing, maybe what you're saying is that you don't really
    > believe that any of the methods are reliable but if we throw the
    > kitchen sink at the problem it should come out OK. Any bugs in one
    > mechanism are likely to be masked by one of the others, but that's not
    > as as good as one method that is known to be altogether reliable.
    
    I disagree with this on a couple of levels.  The first is pretty simple-
    we don't have all of the information.  The user may have some reason to
    believe that timestamp-based is a bad idea, for example, and therefore
    having an option to perform a checksum-based backup makes sense.  rsync
    is a pretty good tool in my view and it has a very similar option-
    because there are trade-offs to be made.  LSN is great, if you don't
    mind reading every file of your database start-to-finish every time, but
    in a running system which hasn't suffered from clock skew or other odd
    issues (some of which we can also detect), it's pretty painful to scan
    absolutely everything like that for an incremental.
    
    Perhaps the discussion has already moved on to having some way of our
    own to track if a given file has changed without having to scan all of
    it- if so, that's a discussion I'd be interested in.  I'm not against
    other approaches here besides timestamps if there's a solid reason why
    they're better and they're also able to avoid scanning the entire
    database.
    
    > > By having a manifest for each backed up file for each backup, you also
    > > gain the ability to validate that a backup in the repository hasn't been
    > > corrupted post-backup, a feature that at least some other database
    > > backup and restore systems have (referring specifically to the big O in
    > > this particular case, but I bet others do too).
    > 
    > Agreed. The manifest only lets you validate to a limited extent, but
    > that's still useful.
    
    If you track the checksum of the file in the manifest then it's a pretty
    strong validation that the backup repo hasn't been corrupted between the
    backup and the restore.  Of course, the database could have been
    corrupted at the source, and perhaps that's what you were getting at
    with your 'limited extent' but that isn't what I was referring to.
    
    Claiming that the backup has been 'validated' by only looking at file
    sizes certainly wouldn't be acceptable.  I can't imagine you were
    suggesting that as you're certainly capable of realizing that, but I got
    the feeling you weren't agreeing that having the checksum of the file
    made sense to include in the manifest, so I feel like I'm missing
    something here.
    
    > > Having a system of keeping track of which backups are full and which are
    > > differential in an overall system also gives you the ability to do
    > > things like expiration in a sensible way, including handling WAL
    > > expiration.
    > 
    > True, but I'm not sure that functionality belongs in core. It
    > certainly needs to be possible for out-of-core code to do this part of
    > the work if desired, because people want to integrate with enterprise
    > backup systems, and we can't come in and say, well, you back up
    > everything else using Netbackup or Tivoli, but for PostgreSQL you have
    > to use pg_backrest. I mean, maybe you can win that argument, but I
    > know I can't.
    
    I'm pretty baffled by this argument, particularly in this context.  We
    already have tooling around trying to manage WAL archives in core- see
    pg_archivecleanup.  Further, we're talking about pg_basebackup here, not
    about Netbackup or Tivoli, and the results of a pg_basebackup (that is,
    a set of tar files, or a data directory) could happily be backed up
    using whatever Enterprise tool folks want to use- in much the same way
    that a pgbackrest repo is also able to be backed up using whatever
    Enterprise tools someone wishes to use.  We designed it quite carefully
    to work with exactly that use-case, so the distinction here is quite
    lost on me.  Perhaps you could clarify what use-case these changes to
    pg_basebackup solve, when working with a Netbackup or Tivoli system,
    that pgbackrest doesn't, since you bring it up here?
    
    > > I'd like to clarify that while I would like to have an easier way to
    > > parallelize backups, that's a relatively minor complaint- the much
    > > bigger issue that I have with this feature is that trying to address
    > > everything correctly while having only the amount of information that
    > > could be passed on the command-line about the prior full/incremental is
    > > going to be extremely difficult, complicated, and likely to lead to
    > > subtle bugs in the actual code, and probably less than subtle bugs in
    > > how users end up using it, since they'll have to implement the
    > > expiration and tracking of information between backups themselves
    > > (unless something's changed in that part during this discussion- I admit
    > > that I've not read every email in this thread).
    > 
    > Well, the evidence seems to show that you are right, at least to some
    > extent. I consider it a positive good if the client needs to give the
    > server only a limited amount of information. After all, you could
    > always take an incremental backup by shipping every byte of the
    > previous backup to the server, having it compare everything to the
    > current contents, and having it then send you back the stuff that is
    > new or different. But that would be dumb, because most of the point of
    > an incremental backup is to save on sending lots of data over the
    > network unnecessarily. Now, it seems that I took that goal to an
    > unhealthy extreme, because as we've now realized, sending only an LSN
    > and nothing else isn't enough to get a correct backup. So we need to
    > send more, and it doesn't have to be the absolutely most
    > stripped-down, bear-bones version of what could be sent. But it should
    > be fairly minimal, I think; that's kinda the point of the feature.
    
    Right- much of the point of an incremental backup feature is to try and
    minimize the amount of work that's done while still getting a good
    backup.  I don't agree that we should focus solely on network bandwidth
    as there are also trade-offs to be made around disk bandwidth to
    consider, see above discussion regarding timestamps vs. checksum'ing
    every file.
    
    As for if we should be sending more to the server, or asking the server
    to send more to us, I don't really have a good feel for what's "best".
    At least one implementation I'm familiar with builds a manifest on the
    PG server side and then compares the results of that to the manifest
    stored with the backup (where that comparison is actually done is on
    whatever system the "backup" was started from, typically a backup
    server).  Perhaps there's an argument for sending the manifest from the
    backup repository to PostgreSQL for it to then compare against the data
    directory but I'm not really sure how it could possibly do that more
    efficiently and that's moving work to the PG server that it doesn't
    really need to do.
    
    Thanks,
    
    Stephen
    
  144. Re: block-level incremental backup

    Stephen Frost <sfrost@snowman.net> — 2019-09-16T17:39:33Z

    Greetings,
    
    * Robert Haas (robertmhaas@gmail.com) wrote:
    > On Mon, Sep 16, 2019 at 9:30 AM Robert Haas <robertmhaas@gmail.com> wrote:
    > > > Isn't some operations where at the end we directly call heap_sync
    > > > without writing WAL will have a similar problem as well?
    > >
    > > Maybe.  Can you give an example?
    > 
    > Looking through the code, I found two cases where we do this.  One is
    > a bulk insert operation with wal_level = minimal, and the other is
    > CLUSTER or VACUUM FULL with wal_level = minimal. In both of these
    > cases we are generating new blocks whose LSNs will be 0/0. So, I think
    > we need a rule that if the server is asked to back up all blocks in a
    > file with LSNs > some threshold LSN, it must also include any blocks
    > whose LSN is 0/0. Those blocks are either uninitialized or are
    > populated without WAL logging, so they always need to be copied.
    
    I'm not sure I see a way around it but this seems pretty unfortunate-
    every single incremental backup will have all of those included even
    though the full backup likely also does (I say likely since someone
    could do a full backup, set the WAL to minimal, load a bunch of data,
    and then restart back to a WAL level where we can do a new backup, and
    then do an incremental, so we don't *know* that the full includes those
    blocks unless we also track a block-level checksum or similar).  Then
    again, doing these kinds of server bounces to change the WAL level
    around is, hopefully, relatively rare..
    
    > Outside of unlogged and temporary tables, I don't know of any case
    > where make a critical modification to an already-existing block
    > without bumping the LSN. I hope there is no such case.
    
    I believe we all do. :)
    
    Thanks,
    
    Stephen
    
  145. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-09-16T19:02:41Z

    On Mon, Sep 16, 2019 at 1:10 PM Stephen Frost <sfrost@snowman.net> wrote:
    > I disagree with this on a couple of levels.  The first is pretty simple-
    > we don't have all of the information.  The user may have some reason to
    > believe that timestamp-based is a bad idea, for example, and therefore
    > having an option to perform a checksum-based backup makes sense.  rsync
    > is a pretty good tool in my view and it has a very similar option-
    > because there are trade-offs to be made.  LSN is great, if you don't
    > mind reading every file of your database start-to-finish every time, but
    > in a running system which hasn't suffered from clock skew or other odd
    > issues (some of which we can also detect), it's pretty painful to scan
    > absolutely everything like that for an incremental.
    
    There's a separate thread on using WAL-scanning to avoid having to
    scan all the data every time. I pointed it out to you early in this
    thread, too.
    
    > If you track the checksum of the file in the manifest then it's a pretty
    > strong validation that the backup repo hasn't been corrupted between the
    > backup and the restore.  Of course, the database could have been
    > corrupted at the source, and perhaps that's what you were getting at
    > with your 'limited extent' but that isn't what I was referring to.
    
    Yeah, that all seems fair. Without the checksum, you can only validate
    that you have the right files and that they are the right sizes, which
    is not bad, but the checksums certainly make it stronger. But,
    wouldn't having to checksum all of the files add significantly to the
    cost of taking the backup? If so, I can imagine that some people might
    want to pay that cost but others might not. If it's basically free to
    checksum the data while we have it in memory anyway, then I guess
    there's little to be lost.
    
    > I'm pretty baffled by this argument, particularly in this context.  We
    > already have tooling around trying to manage WAL archives in core- see
    > pg_archivecleanup.  Further, we're talking about pg_basebackup here, not
    > about Netbackup or Tivoli, and the results of a pg_basebackup (that is,
    > a set of tar files, or a data directory) could happily be backed up
    > using whatever Enterprise tool folks want to use- in much the same way
    > that a pgbackrest repo is also able to be backed up using whatever
    > Enterprise tools someone wishes to use.  We designed it quite carefully
    > to work with exactly that use-case, so the distinction here is quite
    > lost on me.  Perhaps you could clarify what use-case these changes to
    > pg_basebackup solve, when working with a Netbackup or Tivoli system,
    > that pgbackrest doesn't, since you bring it up here?
    
    I'm not an expert on any of those systems, but I doubt that
    everybody's OK with backing everything up to a pgbackrest repository
    and then separately backing up that repository to some other system.
    That sounds like a pretty large storage cost.
    
    > As for if we should be sending more to the server, or asking the server
    > to send more to us, I don't really have a good feel for what's "best".
    > At least one implementation I'm familiar with builds a manifest on the
    > PG server side and then compares the results of that to the manifest
    > stored with the backup (where that comparison is actually done is on
    > whatever system the "backup" was started from, typically a backup
    > server).  Perhaps there's an argument for sending the manifest from the
    > backup repository to PostgreSQL for it to then compare against the data
    > directory but I'm not really sure how it could possibly do that more
    > efficiently and that's moving work to the PG server that it doesn't
    > really need to do.
    
    I agree with all that, but... if the server builds a manifest on the
    PG server that is to be compared with the backup's manifest, the one
    the PG server builds can't really include checksums, I think. To get
    the checksums, it would have to read the entire cluster while building
    the manifest, which sounds insane. Presumably it would have to build a
    checksum-free version of the manifest, and then the client could
    checksum the files as they're streamed down and write out a revised
    manifest that adds the checksums.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  146. Re: block-level incremental backup

    Stephen Frost <sfrost@snowman.net> — 2019-09-16T19:38:47Z

    Greetings,
    
    * Robert Haas (robertmhaas@gmail.com) wrote:
    > On Mon, Sep 16, 2019 at 1:10 PM Stephen Frost <sfrost@snowman.net> wrote:
    > > I disagree with this on a couple of levels.  The first is pretty simple-
    > > we don't have all of the information.  The user may have some reason to
    > > believe that timestamp-based is a bad idea, for example, and therefore
    > > having an option to perform a checksum-based backup makes sense.  rsync
    > > is a pretty good tool in my view and it has a very similar option-
    > > because there are trade-offs to be made.  LSN is great, if you don't
    > > mind reading every file of your database start-to-finish every time, but
    > > in a running system which hasn't suffered from clock skew or other odd
    > > issues (some of which we can also detect), it's pretty painful to scan
    > > absolutely everything like that for an incremental.
    > 
    > There's a separate thread on using WAL-scanning to avoid having to
    > scan all the data every time. I pointed it out to you early in this
    > thread, too.
    
    As discussed nearby, not everything that needs to be included in the
    backup is actually going to be in the WAL though, right?  How would that
    ever be able to handle the case where someone starts the server under
    wal_level = logical, takes a full backup, then restarts with wal_level =
    minimal, writes out a bunch of new data, and then restarts back to
    wal_level = logical and takes an incremental?
    
    How would we even detect that such a thing happened?
    
    > > If you track the checksum of the file in the manifest then it's a pretty
    > > strong validation that the backup repo hasn't been corrupted between the
    > > backup and the restore.  Of course, the database could have been
    > > corrupted at the source, and perhaps that's what you were getting at
    > > with your 'limited extent' but that isn't what I was referring to.
    > 
    > Yeah, that all seems fair. Without the checksum, you can only validate
    > that you have the right files and that they are the right sizes, which
    > is not bad, but the checksums certainly make it stronger. But,
    > wouldn't having to checksum all of the files add significantly to the
    > cost of taking the backup? If so, I can imagine that some people might
    > want to pay that cost but others might not. If it's basically free to
    > checksum the data while we have it in memory anyway, then I guess
    > there's little to be lost.
    
    On larger systems, so many of the files are 1GB in size that checking
    the file size is quite close to meaningless.  Yes, having to checksum
    all of the files definitely adds to the cost of taking the backup, but
    to avoid it we need strong assurances that a given file hasn't been
    changed since our last full backup.  WAL, today at least, isn't quite
    that, and timestamps can possibly be fooled with, so if you'd like to be
    particularly careful, there doesn't seem to be a lot of alternatives.
    
    > > I'm pretty baffled by this argument, particularly in this context.  We
    > > already have tooling around trying to manage WAL archives in core- see
    > > pg_archivecleanup.  Further, we're talking about pg_basebackup here, not
    > > about Netbackup or Tivoli, and the results of a pg_basebackup (that is,
    > > a set of tar files, or a data directory) could happily be backed up
    > > using whatever Enterprise tool folks want to use- in much the same way
    > > that a pgbackrest repo is also able to be backed up using whatever
    > > Enterprise tools someone wishes to use.  We designed it quite carefully
    > > to work with exactly that use-case, so the distinction here is quite
    > > lost on me.  Perhaps you could clarify what use-case these changes to
    > > pg_basebackup solve, when working with a Netbackup or Tivoli system,
    > > that pgbackrest doesn't, since you bring it up here?
    > 
    > I'm not an expert on any of those systems, but I doubt that
    > everybody's OK with backing everything up to a pgbackrest repository
    > and then separately backing up that repository to some other system.
    > That sounds like a pretty large storage cost.
    
    I'm not asking you to be an expert on those systems, just to help me
    understand the statements you're making.  How is backing up to a
    pgbackrest repo different than running a pg_basebackup in the context of
    using some other Enterprise backup system?  In both cases, you'll have a
    full copy of the backup (presumably compressed) somewhere out on a disk
    or filesystem which is then backed up by the Enterprise tool.
    
    > > As for if we should be sending more to the server, or asking the server
    > > to send more to us, I don't really have a good feel for what's "best".
    > > At least one implementation I'm familiar with builds a manifest on the
    > > PG server side and then compares the results of that to the manifest
    > > stored with the backup (where that comparison is actually done is on
    > > whatever system the "backup" was started from, typically a backup
    > > server).  Perhaps there's an argument for sending the manifest from the
    > > backup repository to PostgreSQL for it to then compare against the data
    > > directory but I'm not really sure how it could possibly do that more
    > > efficiently and that's moving work to the PG server that it doesn't
    > > really need to do.
    > 
    > I agree with all that, but... if the server builds a manifest on the
    > PG server that is to be compared with the backup's manifest, the one
    > the PG server builds can't really include checksums, I think. To get
    > the checksums, it would have to read the entire cluster while building
    > the manifest, which sounds insane. Presumably it would have to build a
    > checksum-free version of the manifest, and then the client could
    > checksum the files as they're streamed down and write out a revised
    > manifest that adds the checksums.
    
    Unless files can be excluded based on some relatively strong criteria,
    then yes, the approach would be to use checksums of the files and would
    necessairly include all files, meaning that you'd have to read them all.
    
    That's not great, of course, which is why there are trade-offs to be
    made, one of which typically involves using timestamps, but doing so
    quite carefully, to perform the file exclusion.  Other ideas are great
    but it seems like WAL isn't really a great idea unless we make some
    changes there and we, as in PG, haven't got a robust "we know this file
    changed as of this point" to work from.  I worry that we're putting too
    much faith into a system to do something independent of what it was
    actually built and designed to do, and thinking that because we could
    trust it for X, we can trust it for Y.
    
    Thanks,
    
    Stephen
    
  147. Re: block-level incremental backup

    Amit Kapila <amit.kapila16@gmail.com> — 2019-09-17T09:21:38Z

    On Mon, Sep 16, 2019 at 11:09 PM Stephen Frost <sfrost@snowman.net> wrote:
    >
    > Greetings,
    >
    > * Robert Haas (robertmhaas@gmail.com) wrote:
    > > On Mon, Sep 16, 2019 at 9:30 AM Robert Haas <robertmhaas@gmail.com> wrote:
    > > > > Isn't some operations where at the end we directly call heap_sync
    > > > > without writing WAL will have a similar problem as well?
    > > >
    > > > Maybe.  Can you give an example?
    > >
    > > Looking through the code, I found two cases where we do this.  One is
    > > a bulk insert operation with wal_level = minimal, and the other is
    > > CLUSTER or VACUUM FULL with wal_level = minimal. In both of these
    > > cases we are generating new blocks whose LSNs will be 0/0. So, I think
    > > we need a rule that if the server is asked to back up all blocks in a
    > > file with LSNs > some threshold LSN, it must also include any blocks
    > > whose LSN is 0/0. Those blocks are either uninitialized or are
    > > populated without WAL logging, so they always need to be copied.
    >
    > I'm not sure I see a way around it but this seems pretty unfortunate-
    > every single incremental backup will have all of those included even
    > though the full backup likely also does
    >
    
    Yeah, this is quite unfortunate.  One more thing to note is that the
    same is true for other operation like 'create index' (ex. nbtree
    bypasses buffer manager while creating the index, doesn't write wal
    for wal_level=minimal and then syncs at the end).
    
    -- 
    With Regards,
    Amit Kapila.
    EnterpriseDB: http://www.enterprisedb.com
    
    
    
    
  148. Re: block-level incremental backup

    Amit Kapila <amit.kapila16@gmail.com> — 2019-09-17T09:24:11Z

    On Mon, Sep 16, 2019 at 7:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
    >
    > On Mon, Sep 16, 2019 at 4:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
    > > This seems to be a blocking problem for the LSN based design.
    >
    > Well, only the simplest version of it, I think.
    >
    > > Can we think of using creation time for file?  Basically, if the file
    > > creation time is later than backup-labels "START TIME:", then include
    > > that file entirely.  I think one big point against this is clock skew
    > > like what if somebody tinkers with the clock.  And also, this can
    > > cover cases like
    > > what Jeevan has pointed but might not cover other cases which we found
    > > problematic.
    >
    > Well that would mean, for example, that if you copied the data
    > directory from one machine to another, the next "incremental" backup
    > would turn into a full backup. That sucks. And in other situations,
    > like resetting the clock, it could mean that you end up with a corrupt
    > backup without any real ability for PostgreSQL to detect it. I'm not
    > saying that it is impossible to create a practically useful system
    > based on file time stamps, but I really don't like it.
    >
    > > I think the operations covered by WAL flag XLR_SPECIAL_REL_UPDATE will
    > > have similar problems.
    >
    > I'm not sure quite what you mean by that.  Can you elaborate? It
    > appears to me that the XLR_SPECIAL_REL_UPDATE operations are all
    > things that create files, remove files, or truncate files, and the
    > sketch in my previous email would handle the first two of those cases
    > correctly.  See below for the third.
    >
    > > One related point is how do incremental backups handle the case where
    > > vacuum truncates the relation partially?  Basically, with current
    > > patch/design, it doesn't appear that such information can be passed
    > > via incremental backup.  I am not sure if this is a problem, but it
    > > would be good if we can somehow handle this.
    >
    > As to this, if you're taking a full backup of a particular file,
    > there's no problem.  If you're taking a partial backup of a particular
    > file, you need to include the current length of the file and the
    > identity and contents of each modified block.  Then you're fine.
    >
    
    Right, this should address that point.
    
    -- 
    With Regards,
    Amit Kapila.
    EnterpriseDB: http://www.enterprisedb.com
    
    
    
    
  149. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-09-17T14:55:04Z

    On Mon, Sep 16, 2019 at 3:38 PM Stephen Frost <sfrost@snowman.net> wrote:
    > As discussed nearby, not everything that needs to be included in the
    > backup is actually going to be in the WAL though, right?  How would that
    > ever be able to handle the case where someone starts the server under
    > wal_level = logical, takes a full backup, then restarts with wal_level =
    > minimal, writes out a bunch of new data, and then restarts back to
    > wal_level = logical and takes an incremental?
    
    Fair point. I think the WAL-scanning approach can only work if
    wal_level > minimal. But, I also think that few people run with
    wal_level = minimal in this era where the default has been changed to
    replica; and I think we can detect the WAL level in use while scanning
    WAL. It can only change at a checkpoint.
    
    > On larger systems, so many of the files are 1GB in size that checking
    > the file size is quite close to meaningless.  Yes, having to checksum
    > all of the files definitely adds to the cost of taking the backup, but
    > to avoid it we need strong assurances that a given file hasn't been
    > changed since our last full backup.  WAL, today at least, isn't quite
    > that, and timestamps can possibly be fooled with, so if you'd like to be
    > particularly careful, there doesn't seem to be a lot of alternatives.
    
    I see your points, but it feels like you're trying to talk down the
    WAL-based approach over what seem to me to be fairly manageable corner
    cases.
    
    > I'm not asking you to be an expert on those systems, just to help me
    > understand the statements you're making.  How is backing up to a
    > pgbackrest repo different than running a pg_basebackup in the context of
    > using some other Enterprise backup system?  In both cases, you'll have a
    > full copy of the backup (presumably compressed) somewhere out on a disk
    > or filesystem which is then backed up by the Enterprise tool.
    
    Well, I think that what people really want is to be able to backup
    straight into the enterprise tool, without an intermediate step.
    
    My basic point here is: As with practically all PostgreSQL
    development, I think we should try to expose capabilities and avoid
    making policy on behalf of users.
    
    I'm not objecting to the idea of having tools that can help users
    figure out how much WAL they need to retain -- but insofar as we can
    do it, such tools should work regardless of where that WAL is actually
    stored. I dislike the idea that PostgreSQL would provide something
    akin to a "pgbackrest repository" in core, or I at least I think it
    would be important that we're careful about how much functionality
    gets tied to the presence and use of such a thing, because, at least
    based on my experience working at EnterpriseDB, larger customers often
    don't want to do it that way.
    
    > That's not great, of course, which is why there are trade-offs to be
    > made, one of which typically involves using timestamps, but doing so
    > quite carefully, to perform the file exclusion.  Other ideas are great
    > but it seems like WAL isn't really a great idea unless we make some
    > changes there and we, as in PG, haven't got a robust "we know this file
    > changed as of this point" to work from.  I worry that we're putting too
    > much faith into a system to do something independent of what it was
    > actually built and designed to do, and thinking that because we could
    > trust it for X, we can trust it for Y.
    
    That seems like a considerable overreaction to me based on the
    problems reported thus far. The fact is, WAL was originally intended
    for crash recovery and has subsequently been generalized to be usable
    for point-in-time recovery, standby servers, and logical decoding.
    It's clearly established at this point as the canonical way that you
    know what in the database has changed, which is the same need that we
    have for incremental backup.
    
    At any rate, the same criticism can be leveled - IMHO with a lot more
    validity - at timestamps. Last-modification timestamps are completely
    outside of our control; they are owned by the OS and various operating
    systems can and do have varying behavior. They can go backwards when
    things have changed; they can go forwards when things have not
    changed. They were clearly not intended to meet this kind of
    requirement. Even, they were intended for that purpose much less so
    than WAL, which was actually designed for a requirement in this
    general ballpark, if not this thing precisely.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  150. Re: block-level incremental backup

    Stephen Frost <sfrost@snowman.net> — 2019-09-17T16:09:08Z

    Greetings,
    
    * Robert Haas (robertmhaas@gmail.com) wrote:
    > On Mon, Sep 16, 2019 at 3:38 PM Stephen Frost <sfrost@snowman.net> wrote:
    > > As discussed nearby, not everything that needs to be included in the
    > > backup is actually going to be in the WAL though, right?  How would that
    > > ever be able to handle the case where someone starts the server under
    > > wal_level = logical, takes a full backup, then restarts with wal_level =
    > > minimal, writes out a bunch of new data, and then restarts back to
    > > wal_level = logical and takes an incremental?
    > 
    > Fair point. I think the WAL-scanning approach can only work if
    > wal_level > minimal. But, I also think that few people run with
    > wal_level = minimal in this era where the default has been changed to
    > replica; and I think we can detect the WAL level in use while scanning
    > WAL. It can only change at a checkpoint.
    
    We need to be sure that we can detect if the WAL level has ever been set
    to minimal between a full and an incremental and, if so, either refuse
    to run the incremental, or promote it to a full, or make it a
    checksum-based incremental instead of trusting the WAL stream.
    
    I'm also glad that we ended up changing the default though and I do hope
    that there's relatively few people running with minimal and that there's
    even fewer who play around with flipping it back and forth.
    
    > > On larger systems, so many of the files are 1GB in size that checking
    > > the file size is quite close to meaningless.  Yes, having to checksum
    > > all of the files definitely adds to the cost of taking the backup, but
    > > to avoid it we need strong assurances that a given file hasn't been
    > > changed since our last full backup.  WAL, today at least, isn't quite
    > > that, and timestamps can possibly be fooled with, so if you'd like to be
    > > particularly careful, there doesn't seem to be a lot of alternatives.
    > 
    > I see your points, but it feels like you're trying to talk down the
    > WAL-based approach over what seem to me to be fairly manageable corner
    > cases.
    
    Just to be clear, I see your points and I like the general idea of
    finding solutions, but it seems like the issues are likely to be pretty
    complex and I'm not sure that's being appreciated very well.
    
    > > I'm not asking you to be an expert on those systems, just to help me
    > > understand the statements you're making.  How is backing up to a
    > > pgbackrest repo different than running a pg_basebackup in the context of
    > > using some other Enterprise backup system?  In both cases, you'll have a
    > > full copy of the backup (presumably compressed) somewhere out on a disk
    > > or filesystem which is then backed up by the Enterprise tool.
    > 
    > Well, I think that what people really want is to be able to backup
    > straight into the enterprise tool, without an intermediate step.
    
    Ok..  I can understand that but I don't get how these changes to
    pg_basebackup will help facilitate that.  If they don't and what you're
    talking about here is independent, then great, that clarifies things,
    but if you're saying that these changes to pg_basebackup are to help
    with backing up directly into those Enterprise systems then I'm just
    asking for some help in understanding how- what's the use-case here that
    we're adding to pg_basebackup that makes it work with these Enterprise
    systems?
    
    I'm not trying to be difficult here, I'm just trying to understand.
    
    > My basic point here is: As with practically all PostgreSQL
    > development, I think we should try to expose capabilities and avoid
    > making policy on behalf of users.
    > 
    > I'm not objecting to the idea of having tools that can help users
    > figure out how much WAL they need to retain -- but insofar as we can
    > do it, such tools should work regardless of where that WAL is actually
    > stored. 
    
    How would that tool work, if it's to be able to work regardless of where
    the WAL is actually stored..?  Today, pg_archivecleanup just works
    against a POSIX filesystem- are you thinking that the tool would have a
    pluggable storage system, so that it could work with, say, a POSIX
    filesystem, or a CIFS mount, or a s3-like system?
    
    > I dislike the idea that PostgreSQL would provide something
    > akin to a "pgbackrest repository" in core, or I at least I think it
    > would be important that we're careful about how much functionality
    > gets tied to the presence and use of such a thing, because, at least
    > based on my experience working at EnterpriseDB, larger customers often
    > don't want to do it that way.
    
    This seems largely independent of the above discussion, but since we're
    discussing it, I've certainly had various experiences in this area too-
    some larger customers would like to use an s3-like store (which
    pgbackrest already supports and will be supporting others going forward
    as it has a pluggable storage mechanism for the repo...), and then
    there's customers who would like to point their Enterprise backup
    solution at a directory on disk to back it up (which pgbackrest also
    supports, as mentioned previously), and lastly there's customers who
    really want to just backup the PG data directory and they'd like it to
    "just work", thank you, and no they don't have any thought or concern
    about how to handle WAL, but surely it can't be that important, can it?
    
    The last is tongue-in-cheek and I'm half-kidding there, but this is why
    I was trying to understand the comments above about what the use-case is
    here that we're trying to solve for that answers the call for the
    Enterprise software crowd, and ideally what distinguishes that from
    pgbackrest, but just the clear cut "this is what this change will do to
    make pg_basebackup work for Enterprise customers" would be great, or
    even a "well, pg_basebackup today works for them because it does X and
    it'll continue to be able to do X even after this change."
    
    I'll take a wild shot in the dark to try to help move us through this-
    is it that pg_basebackup can stream out to stdout in some cases..?
    Though that's quite limited since it means you can't have additional
    tablespaces and you can't stream the WAL, and how would that work with
    the manifest idea that's being discussed..?  If there's a directory
    that's got manifest files in it for each backup, so we have the file
    sizes for them, those would need to be accessible when we go to do the
    incremental backup and couldn't be stored off somewhere else, I wouldn't
    think..
    
    > > That's not great, of course, which is why there are trade-offs to be
    > > made, one of which typically involves using timestamps, but doing so
    > > quite carefully, to perform the file exclusion.  Other ideas are great
    > > but it seems like WAL isn't really a great idea unless we make some
    > > changes there and we, as in PG, haven't got a robust "we know this file
    > > changed as of this point" to work from.  I worry that we're putting too
    > > much faith into a system to do something independent of what it was
    > > actually built and designed to do, and thinking that because we could
    > > trust it for X, we can trust it for Y.
    > 
    > That seems like a considerable overreaction to me based on the
    > problems reported thus far. The fact is, WAL was originally intended
    > for crash recovery and has subsequently been generalized to be usable
    > for point-in-time recovery, standby servers, and logical decoding.
    > It's clearly established at this point as the canonical way that you
    > know what in the database has changed, which is the same need that we
    > have for incremental backup.
    
    Provided the WAL level is at the level that you need it to be that will
    be true for things which are actually supported with PITR, replication
    to standby servers, et al.  I can see how it might come across as an
    overreaction but this strikes me as a pretty glaring issue and I worry
    that if it was overlooked until now that there'll be other more subtle
    issues, and backups are just plain complicated to get right, just to
    begin with already, something that I don't think people appreciate until
    they've been dealing with them for quite a while.
    
    Not that this would be the first time we've had issues in this area, and
    we'd likely work through them over time, but I'm sure we'd all prefer to
    get it as close to right as possible the first time around, and that's
    going to require some pretty in depth review.
    
    > At any rate, the same criticism can be leveled - IMHO with a lot more
    > validity - at timestamps. Last-modification timestamps are completely
    > outside of our control; they are owned by the OS and various operating
    > systems can and do have varying behavior. They can go backwards when
    > things have changed; they can go forwards when things have not
    > changed. They were clearly not intended to meet this kind of
    > requirement. Even, they were intended for that purpose much less so
    > than WAL, which was actually designed for a requirement in this
    > general ballpark, if not this thing precisely.
    
    While I understand that timestamps may be used for a lot of things and
    that the time on a system could go forward or backward, the actual
    requirement is:
    
    - If the file was modified after the backup was done, the timestamp (or
      the size) needs to be different.  Doesn't actually matter if it's
      forwards, or backwards, different is all that's needed.  The timestamp
      also needs to be before the backup started for it to be considered an
      option to skip it.
    
    Is it possible for that to be fool'd?  Yes, of course, but it isn't as
    simply fooled as your typical "just copy files newer than X" issue that
    other tools have, at least, if you're keeping a manifest of all of the
    files, et al, as discussed earlier.
    
    Thanks,
    
    Stephen
    
  151. Re: block-level incremental backup

    Robert Haas <robertmhaas@gmail.com> — 2019-09-17T16:58:23Z

    On Tue, Sep 17, 2019 at 12:09 PM Stephen Frost <sfrost@snowman.net> wrote:
    > We need to be sure that we can detect if the WAL level has ever been set
    > to minimal between a full and an incremental and, if so, either refuse
    > to run the incremental, or promote it to a full, or make it a
    > checksum-based incremental instead of trusting the WAL stream.
    
    Sure. What about checksum collisions?
    
    > Just to be clear, I see your points and I like the general idea of
    > finding solutions, but it seems like the issues are likely to be pretty
    > complex and I'm not sure that's being appreciated very well.
    
    Definitely possible, but it's more helpful if you can point out the
    actual issues.
    
    > Ok..  I can understand that but I don't get how these changes to
    > pg_basebackup will help facilitate that.  If they don't and what you're
    > talking about here is independent, then great, that clarifies things,
    > but if you're saying that these changes to pg_basebackup are to help
    > with backing up directly into those Enterprise systems then I'm just
    > asking for some help in understanding how- what's the use-case here that
    > we're adding to pg_basebackup that makes it work with these Enterprise
    > systems?
    >
    > I'm not trying to be difficult here, I'm just trying to understand.
    
    Man, I feel like we're totally drifting off into the weeds here.  I'm
    not arguing that these changes to pg_basebackup will help enterprise
    users except insofar as those users want incremental backup.  All of
    this discussion started with this comment from you:
    
    "Having a system of keeping track of which backups are full and which
    are differential in an overall system also gives you the ability to do
    things like expiration in a sensible way, including handling WAL
    expiration."
    
    All I was doing was saying that for an enterprise user, the overall
    system might be something entirely outside of our control, like
    NetBackup or Tivoli. Therefore, whatever functionality we provide to
    do that kind of thing should be able to be used in such contexts. That
    hardly seems like a controversial proposition.
    
    > How would that tool work, if it's to be able to work regardless of where
    > the WAL is actually stored..?  Today, pg_archivecleanup just works
    > against a POSIX filesystem- are you thinking that the tool would have a
    > pluggable storage system, so that it could work with, say, a POSIX
    > filesystem, or a CIFS mount, or a s3-like system?
    
    Again, I was making a general statement about design goals -- "we
    should try to work nicely with enterprise backup products" -- not
    proposing a specific design for a specific thing. I don't think the
    idea of some pluggability in that area is a bad one, but it's not even
    slightly what this thread is about.
    
    > Provided the WAL level is at the level that you need it to be that will
    > be true for things which are actually supported with PITR, replication
    > to standby servers, et al.  I can see how it might come across as an
    > overreaction but this strikes me as a pretty glaring issue and I worry
    > that if it was overlooked until now that there'll be other more subtle
    > issues, and backups are just plain complicated to get right, just to
    > begin with already, something that I don't think people appreciate until
    > they've been dealing with them for quite a while.
    
    Permit me to be unpersuaded. If it was such a glaring issue, and if
    experience is the key to spotting such issues, then why didn't YOU
    spot it?
    
    I'm not arguing that this stuff isn't hard. It is. Nor am I arguing
    that I didn't screw up. I did. But designs need to be accepted or
    rejected based on facts, not FUD. You've raised some good technical
    points and if you've got more concerns, I'd like to hear them, but I
    don't think arguing vaguely that a certain approach will probably run
    into trouble gets us anywhere.
    
    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  152. Re: block-level incremental backup

    Stephen Frost <sfrost@snowman.net> — 2019-09-17T17:48:00Z

    Greetings,
    
    * Robert Haas (robertmhaas@gmail.com) wrote:
    > On Tue, Sep 17, 2019 at 12:09 PM Stephen Frost <sfrost@snowman.net> wrote:
    > > We need to be sure that we can detect if the WAL level has ever been set
    > > to minimal between a full and an incremental and, if so, either refuse
    > > to run the incremental, or promote it to a full, or make it a
    > > checksum-based incremental instead of trusting the WAL stream.
    > 
    > Sure. What about checksum collisions?
    
    Certainly possible, of course, but a sha256 of each file is at least
    somewhat better than, say, our page-level checksums.  I do agree that
    having the option to just say "promote it to a full", or "do a
    byte-by-byte comparison against the prior backed up file" would be
    useful for those who are concerned about sha256 collision probabilities.
    
    Having a cross-check of "does this X% of files that we decided not to
    back up due to whatever really still match what we think is in the
    backup?" is definitely a valuable feature and one which I'd hope we get
    to at some point.
    
    > > Ok..  I can understand that but I don't get how these changes to
    > > pg_basebackup will help facilitate that.  If they don't and what you're
    > > talking about here is independent, then great, that clarifies things,
    > > but if you're saying that these changes to pg_basebackup are to help
    > > with backing up directly into those Enterprise systems then I'm just
    > > asking for some help in understanding how- what's the use-case here that
    > > we're adding to pg_basebackup that makes it work with these Enterprise
    > > systems?
    > >
    > > I'm not trying to be difficult here, I'm just trying to understand.
    > 
    > Man, I feel like we're totally drifting off into the weeds here.  I'm
    > not arguing that these changes to pg_basebackup will help enterprise
    > users except insofar as those users want incremental backup.  All of
    > this discussion started with this comment from you:
    > 
    > "Having a system of keeping track of which backups are full and which
    > are differential in an overall system also gives you the ability to do
    > things like expiration in a sensible way, including handling WAL
    > expiration."
    > 
    > All I was doing was saying that for an enterprise user, the overall
    > system might be something entirely outside of our control, like
    > NetBackup or Tivoli. Therefore, whatever functionality we provide to
    > do that kind of thing should be able to be used in such contexts. That
    > hardly seems like a controversial proposition.
    
    And all I was trying to understand was how what pg_basebackup does in
    this context is really different from what can be done with pgbackrest,
    since you brought it up:
    
    "True, but I'm not sure that functionality belongs in core. It
    certainly needs to be possible for out-of-core code to do this part of
    the work if desired, because people want to integrate with enterprise
    backup systems, and we can't come in and say, well, you back up
    everything else using Netbackup or Tivoli, but for PostgreSQL you have
    to use pg_backrest. I mean, maybe you can win that argument, but I
    know I can't."
    
    What it sounds like you're argueing here is that what pg_basebackup
    "has" in it is that it specifically doesn't include any kind of
    expiration management of any kind, and that's somehow helpful to people
    who want to use Enterprise backup solutions.  Maybe that's what you were
    getting at, in which case, I'm sorry for misunderstanding and dragging
    it out, and thanks for helping me understand.
    
    > > How would that tool work, if it's to be able to work regardless of where
    > > the WAL is actually stored..?  Today, pg_archivecleanup just works
    > > against a POSIX filesystem- are you thinking that the tool would have a
    > > pluggable storage system, so that it could work with, say, a POSIX
    > > filesystem, or a CIFS mount, or a s3-like system?
    > 
    > Again, I was making a general statement about design goals -- "we
    > should try to work nicely with enterprise backup products" -- not
    > proposing a specific design for a specific thing. I don't think the
    > idea of some pluggability in that area is a bad one, but it's not even
    > slightly what this thread is about.
    
    Well, I agree with you, as I said up-thread, that this seemed to be
    going in a different and perhaps not entirely relevant direction.
    
    > > Provided the WAL level is at the level that you need it to be that will
    > > be true for things which are actually supported with PITR, replication
    > > to standby servers, et al.  I can see how it might come across as an
    > > overreaction but this strikes me as a pretty glaring issue and I worry
    > > that if it was overlooked until now that there'll be other more subtle
    > > issues, and backups are just plain complicated to get right, just to
    > > begin with already, something that I don't think people appreciate until
    > > they've been dealing with them for quite a while.
    > 
    > Permit me to be unpersuaded. If it was such a glaring issue, and if
    > experience is the key to spotting such issues, then why didn't YOU
    > spot it?
    
    I'm not designing the feature..?  Sure, I agreed earlier with the
    general idea that we might be able to use WAL scanning and/or the LSN to
    figure out if a page had changed, but the next step would have been, I
    would have thought anyway, for someone to go do the analysis that has
    only recently been started to look at the places when we write and the
    cases where we write the WAL and actually build up confidence that this
    approach isn't missing anything.  Instead, we seem to have come a long
    way in the development of this without having done that, and that does
    shake my confidence in this effort.
    
    > I'm not arguing that this stuff isn't hard. It is. Nor am I arguing
    > that I didn't screw up. I did. But designs need to be accepted or
    > rejected based on facts, not FUD. You've raised some good technical
    > points and if you've got more concerns, I'd like to hear them, but I
    > don't think arguing vaguely that a certain approach will probably run
    > into trouble gets us anywhere.
    
    This just gets back to what I was saying earlier.  It seems like we're
    presuming this is going to 'just work' because, say, replication works
    great, or crash recovery works great, and those are based on WAL.  I'm
    still hopeful that we can do something based on WAL or LSN here, but it
    needs a careful review of when we are, and when we aren't, writing out
    WAL for basically everything we do, an effort that I'm glad to see might
    be starting to happen, but a quick "oh, this is why in this one case
    with this one thing, and we're all good now" doesn't instill confidence
    in me, at least.
    
    Thanks,
    
    Stephen