Thread

  1. Re: pg_recvlogical: Prevent flushed data from being re-sent after restarting replication

    Mircea Cadariu <cadariu.mircea@gmail.com> — 2025-11-26T13:25:31Z

    Hi,
    
    On 25/11/2025 17:16, Fujii Masao wrote:
    > Thanks for writing the test case and turning it into a patch. I agree that
    > we should add a regression test to ensure the reported issue doesn't recur.
    Thanks for your feedback, updated patch is attached. Again, I checked 
    that it fails in master, but passes with your patch.
    > It looks like the v1 patch you attached accidentally includes
    > the patch file itself. Could you remove that?
    Whoops, not sure what happened there, fixed.
    > +    '--fsync-interval', '1',
    > +    '--status-interval', '1',
    >
    > Wouldn't it be safer to use a larger value (e.g., 100) for --status-interval?
    > With a very small interval, the status feedback might happen before
    > the walsender is terminated and pg_recvlogical reconnects, which could
    > prevent the duplicate data from appearing even without the patch.
    Yes indeed good one. I actually had it set to 60 in the previous version 
    I sent earlier.
    
    > +use IPC::Run qw(start);
    > <snip>
    > +my $recv = start [
    >
    > For simplicity, would it be better to avoid "use IPC::Run qw(start)" and
    > just call "IPC::Run::start" directly?
    Indeed, done.
    
    > +# Wait only for initial connection
    > +$node->poll_query_until('postgres',
    > +    "SELECT active_pid IS NOT NULL FROM pg_replication_slots WHERE
    > slot_name = 'reconnect_test'");
    >
    > This is unlikely, but pg_recvlogical's connection could be terminated
    > immediately after connecting, before receiving any data. If that happens,
    > the test might behave unexpectedly. To make the test more robust,
    > should we instead poll on:
    >
    >          SELECT pg_read_file('$outfile') ~ 'INSERT'
    >
    > instead, to ensure that some data has actually been received before
    > terminating the backend?
    
    > +# Wait only for initial connection
    > +$node->poll_query_until('postgres',
    > +    "SELECT active_pid IS NOT NULL FROM pg_replication_slots WHERE
    > slot_name = 'reconnect_test'");
    >
    > This is unlikely, but pg_recvlogical's connection could be terminated
    > immediately after connecting, before receiving any data. If that happens,
    > the test might behave unexpectedly. To make the test more robust,
    > should we instead poll on:
    >
    >          SELECT pg_read_file('$outfile') ~ 'INSERT'
    >
    > instead, to ensure that some data has actually been received before
    > terminating the backend?
    >
    >
    >
    >> Secondly I noticed in function sendFeedback at line 166, the startpos is
    >> set to output_written_lsn. This seems to counter conceptually the change
    >> you made in the patch, however it seems to not affect correctness. Shall
    >> we remove this line as to avoid confusion?
    > Isn't this necessary when - is specified for --file, causing 
    > OutputFsync() to be skipped? Regards, 
    Yes, for sure. Would really like to avoid introducing flake in CI due to 
    this test.
    
    > Isn't this necessary when - is specified for --file, causing 
    > OutputFsync() to be skipped? 
    Upon another look, indeed. When writing to a regular file (--file -) 
    that assignment is redundant but harmless. But like you said, when 
    writing to stdout, without that line, startpos would never be updated.
    
    > Additionally, when the --no-loop option is used, I found that 
    > pg_recvlogical
    > could previously exit without flushing written data, risking data loss.
    > The attached patch fixes this issue by also ensuring that all data is 
    > flushed
    > to disk before exiting with --no-loop.
    
    Should we think of some kind of test also for this part?
    
    -- 
    Thanks,
    Mircea Cadariu