Thread
-
Re: pg_recvlogical: Prevent flushed data from being re-sent after restarting replication
Mircea Cadariu <cadariu.mircea@gmail.com> — 2025-11-26T13:25:31Z
Hi, On 25/11/2025 17:16, Fujii Masao wrote: > Thanks for writing the test case and turning it into a patch. I agree that > we should add a regression test to ensure the reported issue doesn't recur. Thanks for your feedback, updated patch is attached. Again, I checked that it fails in master, but passes with your patch. > It looks like the v1 patch you attached accidentally includes > the patch file itself. Could you remove that? Whoops, not sure what happened there, fixed. > + '--fsync-interval', '1', > + '--status-interval', '1', > > Wouldn't it be safer to use a larger value (e.g., 100) for --status-interval? > With a very small interval, the status feedback might happen before > the walsender is terminated and pg_recvlogical reconnects, which could > prevent the duplicate data from appearing even without the patch. Yes indeed good one. I actually had it set to 60 in the previous version I sent earlier. > +use IPC::Run qw(start); > <snip> > +my $recv = start [ > > For simplicity, would it be better to avoid "use IPC::Run qw(start)" and > just call "IPC::Run::start" directly? Indeed, done. > +# Wait only for initial connection > +$node->poll_query_until('postgres', > + "SELECT active_pid IS NOT NULL FROM pg_replication_slots WHERE > slot_name = 'reconnect_test'"); > > This is unlikely, but pg_recvlogical's connection could be terminated > immediately after connecting, before receiving any data. If that happens, > the test might behave unexpectedly. To make the test more robust, > should we instead poll on: > > SELECT pg_read_file('$outfile') ~ 'INSERT' > > instead, to ensure that some data has actually been received before > terminating the backend? > +# Wait only for initial connection > +$node->poll_query_until('postgres', > + "SELECT active_pid IS NOT NULL FROM pg_replication_slots WHERE > slot_name = 'reconnect_test'"); > > This is unlikely, but pg_recvlogical's connection could be terminated > immediately after connecting, before receiving any data. If that happens, > the test might behave unexpectedly. To make the test more robust, > should we instead poll on: > > SELECT pg_read_file('$outfile') ~ 'INSERT' > > instead, to ensure that some data has actually been received before > terminating the backend? > > > >> Secondly I noticed in function sendFeedback at line 166, the startpos is >> set to output_written_lsn. This seems to counter conceptually the change >> you made in the patch, however it seems to not affect correctness. Shall >> we remove this line as to avoid confusion? > Isn't this necessary when - is specified for --file, causing > OutputFsync() to be skipped? Regards, Yes, for sure. Would really like to avoid introducing flake in CI due to this test. > Isn't this necessary when - is specified for --file, causing > OutputFsync() to be skipped? Upon another look, indeed. When writing to a regular file (--file -) that assignment is redundant but harmless. But like you said, when writing to stdout, without that line, startpos would never be updated. > Additionally, when the --no-loop option is used, I found that > pg_recvlogical > could previously exit without flushing written data, risking data loss. > The attached patch fixes this issue by also ensuring that all data is > flushed > to disk before exiting with --no-loop. Should we think of some kind of test also for this part? -- Thanks, Mircea Cadariu