Thread

  1. Re: Add some more corruption error codes to relcache

    Kirk Wolak <wolakk@gmail.com> — 2023-06-27T03:32:52Z

    On Fri, Jun 16, 2023 at 9:18 AM Andrey M. Borodin <x4mmm@yandex-team.ru>
    wrote:
    
    > Hi hackers,
    >
    > Relcache errors from time to time detect catalog corruptions. For example,
    > recently I observed following:
    > 1. Filesystem or nvme disk zeroed out leading 160Kb of catalog index. This
    > type of corruption passes through data_checksums.
    > 2. RelationBuildTupleDesc() was failing with "catalog is missing 1
    > attribute(s) for relid 2662".
    > 3. We monitor corruption error codes and alert on-call DBAs when see one,
    > but the message is not marked as XX001 or XX002. It's XX000 which happens
    > from time to time due to less critical reasons than data corruption.
    > 4. High-availability automation switched primary to other host and other
    > monitoring checks did not ring too.
    >
    > This particular case is not very illustrative. In fact we had index
    > corruption that looked like catalog corruption.
    > But still it looks to me that catalog inconsistencies (like relnatts !=
    > number of pg_attribute rows) could be marked with ERRCODE_DATA_CORRUPTED.
    > This particular error code in my experience proved to be a good indicator
    > for early corruption detection.
    >
    > What do you think?
    > What other subsystems can be improved in the same manner?
    >
    > Best regards, Andrey Borodin.
    >
    
    Andrey, I think this is a good idea.  But your #1 item sounds familiar.
    There was a thread about someone creating/dropping lots of databases, who
    found some kind of race condition that would ZERO out pg_ catalog entries,
    just like you are mentioning.  I think he found the problem with that
    relations could not be found and/or the DB did not want to start.  I just
    spent 30 minutes looking for it, but my "search-fu" is apparently failing.
    
    Which leads me to ask if there is a way to detect the corrupting write
    (writing all zeroes to the file when we know better?  A Zeroed out header
    when one cannot exist?)  Hoping this triggers a bright idea on your end...
    
    Kirk...