Re: backup manifests

Robert Haas <robertmhaas@gmail.com>

From: Robert Haas <robertmhaas@gmail.com>

To: David Steele <david@pgmasters.net>

Cc: Andrew Dunstan <andrew.dunstan@2ndquadrant.com>, Rushabh Lathia <rushabh.lathia@gmail.com>, PostgreSQL Hackers <pgsql-hackers@postgresql.org>, Jeevan Chalke <jeevan.chalke@enterprisedb.com>, vignesh C <vignesh21@gmail.com>

Date: 2019-11-25T17:11:53Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Try to avoid compiler warnings in optimized builds.
- 05021a2c0cd2 13.0 landed
Fix option related issues in pg_verifybackup.
- 0a89e93bfaa6 13.0 landed
Add index term for backup manifest in documentation.
- 4db819ba4039 13.0 landed
Code review for backup manifest.
- a2ac73e7be7a 13.0 landed
Document the backup manifest file format.
- 149f2ae88ab0 13.0 landed
Fix typo in pg_validatebackup documentation.
- c4f82a779d26 13.0 landed
Exclude backup_manifest file that existed in database, from BASE_BACKUP.
- 1ec50a81ec0a 13.0 landed
Msys2 tweaks for pg_validatebackup corruption test
- c3e4cbaab936 13.0 landed
Fix resource management bug with replication=database.
- 3e0d80fd8d3d 13.0 cited
Be more careful about time_t vs. pg_time_t in basebackup.c.
- db1531cae009 13.0 cited
pg_validatebackup: Fix 'make clean' to remove tmp_check.
- 9f8f881caa0f 13.0 landed
pg_validatebackup: Also use perl2host in TAP tests.
- 460314db08e8 13.0 landed
Generate backup manifests for base backups, and validate them.
- 0d8c9c1210c4 13.0 landed
Add checksum helper functions.
- c12e43a2e0d4 13.0 landed
pg_waldump: Add a --quiet option.
- ac44367efbef 13.0 landed
Catversion bump for b9b408c48724
- afb5465e0cfc 13.0 cited
pg_basebackup: Refactor code for reading COPY and tar data.
- 431ba7bebf13 13.0 landed
Use a ResourceOwner to track buffer pins in all cases.
- 3cb646264e8c 12.0 cited
Use ARMv8 CRC instructions where available.
- f044d71e331d 11.0 cited
Logical replication support for initial data copy
- 7c4f52409a8c 10.0 cited
Use Intel SSE 4.2 CRC instructions where available.
- 3dc2d62d0486 9.5.0 cited
Switch to CRC-32C in WAL and other places.
- 5028f22f6eb0 9.5.0 cited
Remove support for 64-bit CRC.
- 404bc51cde9d 9.5.0 cited
Change CRCs in WAL records from 64bit to 32bit for performance reasons.
- 21fda22ec46d 8.1.0 cited

On Fri, Nov 22, 2019 at 2:29 PM David Steele <david@pgmasters.net> wrote:
> See:
> https://www.nist.gov/system/files/documents/2017/04/26/lrdc_systems_part2_032713.pdf
> Search for "The maximum block size"

Hmm, so it says: "The maximum block size that can be protected by a
32-bit CRC is 512MB." My problem is that (1) it doesn't back this up
with a citation or any kind of logical explanation and (2) it's not
very clear what "protected" means. Tels replies downthread to explain
that the internal state of the 32-bit CRC calculation is also limited
to 32 bits, and changes once per bit, so that after processing 512MB =
2^29 bytes = 2^32 bits of data, you're guaranteed to start repeating
internal states. Perhaps this is also what the NIST folks had in mind,
though it's hard to know.

This link provides some more details:

https://community.arm.com/developer/tools-software/tools/f/keil-forum/17467/crc-for-256-byte-data

Not everyone on the thread agrees with everybody else, but it seems
like there are size limits below which a CRC-n is guaranteed to detect
all 1-bit and 2-bit errors, and above which this is no longer
guaranteed. They put the limit *lower* than what NIST supposes, namely
2^(n-1)-1 bits, which would be 256MB, not 512MB, if I'm doing math
correctly. However, they also say that above that value, you are still
likely to detect most errors. Absent an intelligent adversary, the
chance of a random collision when corruption is present is still about
1 in 4 billion (2^-32).

To me, guaranteed detection of 1-bit and 2-bit errors (and the other
kinds of specific things CRC is designed to catch) doesn't seem like a
principle design consideration. It's nice if we can get it and I'm not
against it, but these are algorithms that are designed to be used when
data undergoes a digital-to-analog-to-digital conversion, where for
example it's possible that that the conversion back to digital loses
sync and reads 9 bits or 7 bits rather than 8 bits. And that's not
really what we're doing here: we all know that bits get flipped
sometimes, but nobody uses scp to copy a 1GB file and ends up with a
file that is 1GB +/- a few bits. Some lower-level part of the
communication stack is handling that part of the work; you're going to
get exactly 1GB. So it seems to me that here, as with XLOG, we're not
relying on the specific CRC properties that were intended to be used
to catch and in some cases repair bit flips caused by wrinkles in an
A-to-D conversion, but just on its general tendency to probably not
match if any bits got flipped. And those properties hold regardless of
input length.

That being said, having done some reading on this, I am a little
concerned that we're getting further and further from the design
center of the CRC algorithm. Like relation segment files, XLOG records
are not packets subject to bit insertions, but at least they're small,
and relation files are not. Using a 40-year-old algorithm that was
intended to be used for things like making sure the modem hadn't lost
framing in the last second to verify 1GB files feels, in some nebulous
way, like we might be stretching. That being said, I'm not sure what
we think the reasonable alternatives are. Users aren't going to be
better off if we say that, because CRC-32C might not do a great job
detecting errors, we're not going to check for errors at all. If we go
the other way and say we're going to use some variant of SHA, they
will be better off, but at the price of what looks like a
*significant* hit in terms of backup time.

> > "Typically an n-bit CRC applied to a data block of arbitrary length
> > will detect any single error burst not longer than n bits, and the
> > fraction of all longer error bursts that it will detect is (1 −
> > 2^−n)."
>
> I'm not sure how encouraging I find this -- a four-byte error not a lot
> and 2^32 is only 4 billion.  We have individual users who have backed up
> more than 4 billion files over the last few years.

I agree that people have a lot more than 4 billion files backed up,
but I'm not sure it matters very much given the use case I'm trying to
enable. There's a lot of difference between delta restore and backup
integrity checking. For backup integrity checking, my goal is that, on
those occasions when a file gets corrupted, the chances that we notice
that it has been corrupted. For that purpose, a 32-bit checksum is
probably sufficient. If a file gets corrupted, we have about a
1-in-4-billion chance of being unable to detect it. If 4 billion files
get corrupted, we'll miss, on average, one of those corruption events.
That's sad, but so is the fact that you had *4 billion corrupted
files*. This is not the total number of files backed up; this is the
number of those that got corrupted. I don't really know how common it
is to copy a file and end up with a corrupt copy, but if you say it's
one-in-a-million, which I suspect is far too high, then you'd have to
back up something like 4 quadrillion files before you missed a
corruption event, and that's a *very* big number.

Now delta restore is a whole different kettle of fish. The birthday
problem is huge here. If you've got a 32-bit checksum for file A, and
you go and look it up in a database of checksums, and that database
has even 1 billion things in it, you've got a pretty decent shot of
latching onto a file that is not actually the same as file A. The
problem goes away almost entirely if you only compare against previous
versions of that file from that database cluster. You've probably only
got tens or maybe at the very outside hundreds or thousands of backups
of that particular file, and a collision is unlikely even with only a
32-bit checksum -- though even there maybe you'd like to use something
larger just to be on the safe side. But if you're going to compare to
other files from the same cluster, or even worse any file from any
cluster, 32 bits is *woefully* inadequate. TBH even using SHA for such
use cases feels a little scary to me. It's probably good enough --
2^160 for SHA-1 is a *lot* bigger than 2^32, and 2^512 for SHA-512 is
enormous. But I'd want to spend time thinking very carefully about the
math before designing such a system.

> OK, I'll buy that.  But I *don't* think CRCs should be allowed for
> deltas (when we have them) and I *do* think we should caveat their
> effectiveness (assuming we can agree on them).

Sounds good.

> In general the answer to faster backups should be more cores/faster
> network/faster disk, not compromising backup integrity.  I understand
> we'll need to wait until we have parallelism in pg_basebackup to justify
> that answer.

I would like to dispute that characterization of what we're talking
about here. If we added a 1-bit checksum (parity bit) it would be
*strictly better* than what we're doing right now, which is nothing.
That's not a serious proposal because it's obvious we can do a lot
better for trivial additional cost, but deciding that we're going to
use a weaker kind of checksum to avoid adding too much overhead is not
wimping out, because it's still going to be strong enough to catch the
overwhelming majority of problems that go undetected today. Even an
*8-bit* checksum would give us a >99% chance of catching a corrupted
file, which would be noticeably better than the 0% chance we have
today. Even a manifest with no checksums at all that just checked the
presence and size of files would catch tons of operator error, e.g.

- wait, that database had tablespaces?
- were those logs in pg_clog anything important?
- oh, i wasn't supposed to start postgres on the copy of the database
stored in the backup directory?

So I don't think we're talking about whether to compromise backup
integrity. I think we're talking about - if we're going to make backup
integrity better than it is today, how much better should we try to
make it, and what are the trade-offs there? The straw man here is that
we could make the database infinitely secure if we put it in a
concrete bunker and sunk it to the bottom of the ocean, with the small
price that we'd no longer be able to access it either. Somewhere
between that extreme and the other extreme of setting the
authentication method to 0.0.0.0/0 trust there's a happy medium where
security is tolerably good but ease of access isn't crippled, and the
same thing applies here. We could (probably) be the first database on
the planet to store a 1024-bit encrypted checksum of every 8kB block,
but that seems like it's going too far in the "concrete bunker"
direction. IMHO, at least, we should be aiming for something that has
a high probability of catching real problems and a low probability of
being super-annoying.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company