Re: backup manifests

Robert Haas <robertmhaas@gmail.com>

From: Robert Haas <robertmhaas@gmail.com>
To: Andres Freund <andres@anarazel.de>
Cc: David Steele <david@pgmasters.net>, Noah Misch <noah@leadboat.com>, Stephen Frost <sfrost@snowman.net>, Amit Kapila <amit.kapila16@gmail.com>, Suraj Kharage <suraj.kharage@enterprisedb.com>, tushar <tushar.ahuja@enterprisedb.com>, Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com>, Rushabh Lathia <rushabh.lathia@gmail.com>, Tels <nospam-pg-abuse@bloodgate.com>, Andrew Dunstan <andrew.dunstan@2ndquadrant.com>, PostgreSQL Hackers <pgsql-hackers@postgresql.org>, Jeevan Chalke <jeevan.chalke@enterprisedb.com>, vignesh C <vignesh21@gmail.com>
Date: 2020-03-31T18:10:34Z
Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Try to avoid compiler warnings in optimized builds.

  2. Fix option related issues in pg_verifybackup.

  3. Add index term for backup manifest in documentation.

  4. Code review for backup manifest.

  5. Document the backup manifest file format.

  6. Fix typo in pg_validatebackup documentation.

  7. Exclude backup_manifest file that existed in database, from BASE_BACKUP.

  8. Msys2 tweaks for pg_validatebackup corruption test

  9. Fix resource management bug with replication=database.

  10. Be more careful about time_t vs. pg_time_t in basebackup.c.

  11. pg_validatebackup: Fix 'make clean' to remove tmp_check.

  12. pg_validatebackup: Also use perl2host in TAP tests.

  13. Generate backup manifests for base backups, and validate them.

  14. Add checksum helper functions.

  15. pg_waldump: Add a --quiet option.

  16. Catversion bump for b9b408c48724

  17. pg_basebackup: Refactor code for reading COPY and tar data.

  18. Use a ResourceOwner to track buffer pins in all cases.

  19. Use ARMv8 CRC instructions where available.

  20. Logical replication support for initial data copy

  21. Use Intel SSE 4.2 CRC instructions where available.

  22. Switch to CRC-32C in WAL and other places.

  23. Remove support for 64-bit CRC.

  24. Change CRCs in WAL records from 64bit to 32bit for performance reasons.

Attachments

On Mon, Mar 30, 2020 at 2:59 PM Andres Freund <andres@anarazel.de> wrote:
> I think it wouldn't be too hard to compute that information while taking
> the base backup. We know the end timeline (ThisTimeLineID), so we can
> just call readTimeLineHistory(ThisTimeLineID). Which should then allow
> for something pretty trivial along the lines of
>
> timelines = readTimeLineHistory(ThisTimeLineID);
> last_start = InvalidXLogRecPtr;
> foreach(lc, timelines)
> {
>     TimeLineHistoryEntry *he = lfirst(lc);
>
>     if (he->end < startptr)
>         continue;
>
>     //
>     manifest_emit_wal_range(Min(he->begin, startptr), he->end);
>     last_start = he->end;
> }
>
> if (last_start == InvalidXlogRecPtr)
>    start = startptr;
> else
>    start = last_start;
>
> manifest_emit_wal_range(start, entptr);

I made an attempt to implement this. In the attached patch set, 0001
and 0002 are (I think) unmodified from the last version. 0003 is a
slightly-rejiggered version of your new pg_waldump option. 0004 whacks
0002 around so that the WAL ranges are included in the manifest and
pg_validatebackup tries to run pg_waldump for each WAL range. It
appears to work in light testing, but I haven't yet (1) tested it
extensively, (2) written good regression tests for it above and beyond
what pg_validatebackup had already, or (3) updated the documentation.
I'm going to work on those things. I would appreciate *very timely*
feedback on anything people do or do not like about this, because I
want to commit this patch set by the end of the work week and that
isn't very far away. I would also appreciate if people would bear in
mind the principle that half a loaf is better than none, and further
improvements can be made in future releases.

As part of my light testing, I tried promoting a standby that was
running pg_basebackup, and found that pg_basebackup failed like this:

pg_basebackup: error: could not get COPY data stream: ERROR:  the
standby was promoted during online backup
HINT:  This means that the backup being taken is corrupt and should
not be used. Try taking another online backup.
pg_basebackup: removing data directory "/Users/rhaas/pgslave2"

My first thought was that this error message is hard to reconcile with
this comment:

        /*
         * Send timeline history files too. Only the latest timeline history
         * file is required for recovery, and even that only if there happens
         * to be a timeline switch in the first WAL segment that contains the
         * checkpoint record, or if we're taking a base backup from a standby
         * server and the target timeline changes while the backup is taken.
         * But they are small and highly useful for debugging purposes, so
         * better include them all, always.
         */

But then it occurred to me that this might be a cascading standby.
Maybe the original master died and this machine's master got promoted,
so it has to follow a timeline switch but doesn't itself get promoted.
I think I might try to test out that scenario and see what happens,
but I haven't done so as of this writing. Regardless, it seems like a
really good idea to store a list of WAL ranges rather than a single
start/end/timeline, because even if it's impossible today it might
become possible in the future. Still, unless there's an easy way to
set up a test scenario where multiple WAL ranges need to be verified,
it may be hard to test that this code actually behaves properly.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company