Thread

  1. Re: Making pg_rewind faster

    Srinath Reddy Sadipiralla <srinath2133@gmail.com> — 2025-10-15T14:26:56Z

    Hi,
    
    On Wed, Oct 15, 2025 at 4:49 AM John H <johnhyvr@gmail.com> wrote:
    
    > Hi,
    >
    > On Fri, Oct 10, 2025 at 12:45 AM Srinath Reddy Sadipiralla
    > <srinath2133@gmail.com> wrote:
    > >
    > > ...
    > > XLogFilePath , then validate this new path with the given path
    > > ,this helps to catch invalid xlog files like
    > pg_wal/00000001FFFFFFFFFFFFFF10.
    > > ...
    >
    > Are you concerned that somehow these files, which are named like XLog
    > files but actually
    > aren't, are somehow created therefore we should sync them in case?
    > I'm trying to understand how these files would be generated in the first
    > place.
    >
    
    the problem is not when server generates them because
    filename is properly calculated using the XLogRecPtr in
    XLogWrite->XLogFileInit->XLogFileInitInternal->XLogFilePath
    ,the main problem is when if someone manually places an invalid WAL file
    in pg_wal like 00000001FFFFFFFFFFFFFF10, IsXLogFileName will
    consider it as valid ,so with the approach as i mentioned earlier we can
    catch such cases.
    
    
    >
    > >
    > > instead of passing last_common_segno down the call stack directly,
    > > we can have a struct maybe called "rewindState" which will have the
    > common
    > > information related to both clusters involved in rewind ,so that in
    > future
    > > if we want to pass more such information down we won't be increasing
    > > the number of arguments that need to be passed down to all functions in
    > stack.
    > >
    >
    > I don't feel too strongly about this. Not sure how we view
    > future-proofing things,
    > as in do we proactively wrap around structs or wait until there's an
    > actual need.
    >
    > This is the first time it's been touched since it was introduced in 14 for
    > what
    > it's worth.
    >
    
    TBH first it seemed to me a good coding practice for future proofing,
    then i checked if there's any place in postgres we are doing
    something similar ,then found 1 in src/include/access/gist.h
    
    typedef struct GISTDeletedPageContents
    {
    /* last xid which could see the page in a scan */
    FullTransactionId deleteXid;
    } GISTDeletedPageContents;
    
    
    >
    > >
    > > we can improve the TAP test file header comment as
    > > # Test the situation where we skip copying the same WAL files from
    > source to target
    > > # except if WAL file size differs.
    > >
    > > let me put it this way
    > > 1) WALs are written to 000000010000000000000002 in primary.
    > > ...
    > > 7) so the last common WAL segment was 000000010000000000000003.
    >
    > Updated comments.
    >
    > >
    > > ...
    > > where entry->target_size != entry->source_size and we do copy,
    > > like if stat->mtime is changed (which means file has been copied)
    > > and target_stat->size != soucrce_stat->size here no error expected.
    > >
    >
    > Updated patch with one additional segment has a byte appended to it
    > on target but not source.
    >
    
    thanks for updating, i have attached a diff patch on
    top of v9-0001 patch , where i tried to add more tests
    to improve the validation that we copy the WAL file even
    when it exists on both source and target but the size differs.
    I also tried to rename some variables for readability,while
    testing this out i found that usleep(1000); is not enough
    to show that there has been a copy which changed the
    stat->mtime of the file because i think the copy was very
    fast (less than 1 millisecond) so the mtime of the
    file after rewind's copy didn't change or filesystem precision
    didn't caught the change, the copying of the file is confirmed
    because the same file has different size before rewind and
    same after rewind which these tests proved
    
    ok($corrupt_wal_seg_stat_before_rewind->size !=
    $corrupt_wal_seg_source_stat->size,"WAL segment $corrupt_wal_seg has
    different size in source vs target before rewind");
    ok 11 - WAL segment 000000010000000000000004 has different size in source
    vs target before rewind
    
    ok($corrupt_wal_seg_stat_after_rewind->size ==
    $corrupt_wal_seg_source_stat->size, "WAL segment $corrupt_wal_seg was
    copied: file sizes are same between target and source after rewind");
    ok 12 - WAL segment 000000010000000000000004 was copied: file sizes are
    same between target and source after rewind
    
    and more over the pg_rewind confirmed this with a new debug
    message which I added in this diff patch
    
    diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
    index 69d7728ce44..743484f4833 100644
    --- a/src/bin/pg_rewind/filemap.c
    +++ b/src/bin/pg_rewind/filemap.c
    @@ -732,11 +732,11 @@ decide_wal_file_action(const char *fname, XLogSegNo
    last_common_segno,
      */
      if (file_segno < last_common_segno && source_size == target_size)
      {
      pg_log_debug("WAL segment \"%s\" not copied to target", fname);
      return FILE_ACTION_NONE;
      }
    
    + pg_log_debug("WAL segment \"%s\" is copied to target", fname);
      return FILE_ACTION_COPY;
     }
    ok 6 - run pg_rewind stderr /(?^:WAL segment \"000000010000000000000004\"
    is copied to target)/
    
    also did this instead of using catfile
    -my $wal_skipped_path = File::Spec->catfile($node_primary->data_dir,
    'pg_wal', $wal_seg_skipped);
    +my $wal_skipped_path = $node_primary->data_dir . '/pg_wal/' .
    $wal_seg_skipped;
    seems consistent with other TAP tests.
    
    -- 
    Thanks,
    Srinath Reddy Sadipiralla
    EDB: https://www.enterprisedb.com/