Re: backup manifests

Robert Haas <robertmhaas@gmail.com>

From: Robert Haas <robertmhaas@gmail.com>

To: Suraj Kharage <suraj.kharage@enterprisedb.com>

Cc: Rushabh Lathia <rushabh.lathia@gmail.com>, Tels <nospam-pg-abuse@bloodgate.com>, David Steele <david@pgmasters.net>, Andrew Dunstan <andrew.dunstan@2ndquadrant.com>, PostgreSQL Hackers <pgsql-hackers@postgresql.org>, Jeevan Chalke <jeevan.chalke@enterprisedb.com>, vignesh C <vignesh21@gmail.com>

Date: 2019-12-30T18:22:59Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Try to avoid compiler warnings in optimized builds.
- 05021a2c0cd2 13.0 landed
Fix option related issues in pg_verifybackup.
- 0a89e93bfaa6 13.0 landed
Add index term for backup manifest in documentation.
- 4db819ba4039 13.0 landed
Code review for backup manifest.
- a2ac73e7be7a 13.0 landed
Document the backup manifest file format.
- 149f2ae88ab0 13.0 landed
Fix typo in pg_validatebackup documentation.
- c4f82a779d26 13.0 landed
Exclude backup_manifest file that existed in database, from BASE_BACKUP.
- 1ec50a81ec0a 13.0 landed
Msys2 tweaks for pg_validatebackup corruption test
- c3e4cbaab936 13.0 landed
Fix resource management bug with replication=database.
- 3e0d80fd8d3d 13.0 cited
Be more careful about time_t vs. pg_time_t in basebackup.c.
- db1531cae009 13.0 cited
pg_validatebackup: Fix 'make clean' to remove tmp_check.
- 9f8f881caa0f 13.0 landed
pg_validatebackup: Also use perl2host in TAP tests.
- 460314db08e8 13.0 landed
Generate backup manifests for base backups, and validate them.
- 0d8c9c1210c4 13.0 landed
Add checksum helper functions.
- c12e43a2e0d4 13.0 landed
pg_waldump: Add a --quiet option.
- ac44367efbef 13.0 landed
Catversion bump for b9b408c48724
- afb5465e0cfc 13.0 cited
pg_basebackup: Refactor code for reading COPY and tar data.
- 431ba7bebf13 13.0 landed
Use a ResourceOwner to track buffer pins in all cases.
- 3cb646264e8c 12.0 cited
Use ARMv8 CRC instructions where available.
- f044d71e331d 11.0 cited
Logical replication support for initial data copy
- 7c4f52409a8c 10.0 cited
Use Intel SSE 4.2 CRC instructions where available.
- 3dc2d62d0486 9.5.0 cited
Switch to CRC-32C in WAL and other places.
- 5028f22f6eb0 9.5.0 cited
Remove support for 64-bit CRC.
- 404bc51cde9d 9.5.0 cited
Change CRCs in WAL records from 64bit to 32bit for performance reasons.
- 21fda22ec46d 8.1.0 cited

On Tue, Dec 24, 2019 at 5:42 AM Suraj Kharage
<suraj.kharage@enterprisedb.com> wrote:
> Made the change in backup manifest as well in backup validatort patch. Thanks to Rushabh Lathia for the offline discussion and help.
>
> To examine the first word of each line, I am using below check:
> if (strncmp(line, "File", 4) == 0)
> {
> ..
> }
> else if (strncmp(line, "Manifest-Checksum", 17) == 0)
> {
> ..
> }
> else
>     error
>
> strncmp might be not right here, but we can not put '\0' in between the line (to find out first word)
> before we recognize the line type.
> All the lines expect line last one (where we have manifest checksum) are feed to the checksum machinary to calculate manifest checksum.
> so update_checksum() should be called after recognizing the type, i.e: if it is a File type record. Do you see any issues with this?

I see the problem, but I don't think your solution is right, because
the first test would pass if the line said FiletMignon rather than
just File, which we certainly don't want. You've got to write the test
so that you're checking against the whole first word, not just some
prefix of it. There are several possible ways to accomplish that, but
this isn't one of them.

>> + pg_log_error("invalid record found in \"%s\"", manifest_path);
>>
>> Error message needs work.

Looks better now, but you have a messages that say "invalid checksums
type \"%s\" found in \"%s\"". This is wrong because checksums would
need to be singular in this context (checksum). Also, I think it could
be better phrased as "manifest file \"%s\" specifies unknown checksum
algorithm \"%s\" at line %d".

>> Your function names should be consistent with the surrounding style,
>> and with each other, as far as possible. Three different conventions
>> within the same patch and source file seems over the top.

This appears to be fixed.

>> Also keep in mind that you're not writing code in a vacuum. There's a
>> whole file of code here, and around that, a whole project.
>> scan_data_directory() is a good example of a function whose name is
>> clearly too generic. It's not a general-purpose function for scanning
>> the data directory; it's specifically a support function for verifying
>> a backup. Yet, the name gives no hint of this.

But this appears not to be fixed.

>> if (strcmp(newpathsuffix, "/pg_wal") == 0 || strcmp(newpathsuffix,
>> "/backup_manifest") == 0)
>>     continue;
>
> Thanks for the suggestion. Corrected as per the above inputs.

You need a comment here, like "Ignore the possible presence of a
backup_manifest file and/or a pg_wal directory in the backup being
verified." and then maybe another sentence explaining why that's the
right thing to do.

+             * The forth parameter to VerifyFile() will pass the relative path
+             * of file to match exactly with the filename present in manifest.

I don't know what this comment is trying to tell me, which might be
something you want to try to fix. However, I'm pretty sure it's
supposed to say "fourth" not "forth".

>> and the result would be that everything inside that long if-block is
>> now at the top level of the function and indented one level less. And
>> I think if you look at this function you'll see a way that you can
>> save a *second* level of indentation for much of that code. Please
>> check the rest of the patch for similar cases, too.
>
> Make sense. corrected.

I don't agree. A large chunk of VerifyFile() is still subject to a
quite unnecessary level of indentation.

> I have added a check for EOF, but not sure whether that woule be right here.
> Do we need to check the length of buffer as well?

That's really, really not right. EOF is not a character that can
appear in the buffer. It's chosen on purpose to be a value that never
matches any actual character when both the character and the EOF value
are regarded as values of type 'int'. That guarantee doesn't apply
here though because you're dealing with values of type 'char'. So what
this code is doing is searching for an impossible value using
incorrect logic, which has very little to do with the actual need
here, which is to avoid running off the end of the buffer. To see what
the problem is, try creating a file with no terminating newline, like
this:

echo -n this file has no terminating newline >> some-file

I doubt it will be very hard to make this patch crash horribly. Even
if you can't, it seems pretty clear that the logic isn't right.

I don't really know what the \0 tests in NextLine() and NextWord()
think they're doing either. If there's a \0 in the buffer before you
add one, it was in the original input data, and pretending like that
marks a word or line boundary seems like a fairly arbitrary choice.

What I suggest is:

(1) Allocate one byte more than the file size for the buffer that's
going to hold the file, so that if you write a \0 just after the last
byte of the file, you don't overrun the allocated buffer.

(2) Compute char *endptr = buf + len.

(3) Pass endptr to NextLine and NextWord and write the loop condition
something like while (*buf != '\n' && buf < endptr).

Other notes:

- The error handling in ReadFileIntoBuffer() does not seem to consider
the case of a short read. If you look through the source tree, you can
find examples of how we normally handle that.

- Putting string_hash_sdbm() into encode.c seems like a surprising
choice. What does this have to do with encoding anything? And why is
it going into src/common at all if it's only intended for frontend
use?

- It seems like whether or not any problems were found while verifying
the manifest ought to affect the exit status of pg_basebackup. I'm not
exactly sure what exit codes ought to be used, but you could look for
similar precedents. Document this, too.

- As much as possible let's have errors in the manifest file report
the line number, and let's also try to make them more specific, e.g.
instead of "invalid manifest record found in \"%s\"", perhaps
"manifest file \"%s\" contains invalid keyword \"%s\" at line %d".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company