Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

Amit Kapila <amit.kapila16@gmail.com>

From: Amit Kapila <amit.kapila16@gmail.com>

To: "Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>

Cc: "ocean_li_996@163.com" <ocean_li_996@163.com>, "pgsql-bugs@lists.postgresql.org" <pgsql-bugs@lists.postgresql.org>

Date: 2023-08-17T10:07:10Z

Lists: pgsql-bugs

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Fix uninitialized access to InitialRunningXacts during decoding after ERROR.
- c7256e6564fa 15.5 landed
- f7d25117ba87 14.10 landed
- c570bb4d61b6 13.13 landed
- 7e57208ed51a 12.17 landed
- feb4e218e5f9 11.22 landed

On Thu, Aug 17, 2023 at 3:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 17, 2023 at 12:21 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Tuesday, August 15, 2023 12:05 AM PG Bug reporting form <noreply@postgresql.org> wrote:
> > >
> > > The following bug has been logged on the website:
> > >
> > > Bug reference:      18055
> > > Logged by:          ocean li
> > > Email address:      ocean_li_996@163.com
> > > PostgreSQL version: 11.9
> > > Operating system:   centos7 5.10.84 x86_64
> > > Description:
> > >
> > > For testing logical decoding module, *pg_logical_slot_get_changes* function
> > > is used. Sometimes i got an core whose stack was like that:
> > > ...
> > > And in level #3 of stack above, NInitialRunningXacts is 2 and
> > > InitialRunningXacts is not NULL observed in one of  cores.
> > >
> > > Using of NInitialRunningXacts and InitialRunningXacts are clear. Currently, the
> > > core, as far as i know, maybe caused by this way: an ERROR raised when calling
> > > *pg_logical_slot_get_changes_guts* function. The code part of
> > > PG_CATCH() doses not reset  NInitialRunningXacts and InitialRunningXacts.
> > > Then, calling pg_logical_slot_get_changes_guts again, the core may occur.
> > > Unfortunately, I couldn't find a minimal reproduction case. However, I
> > > observed an *ERROR: canceling statement due to statement timeout*  logged
> > > before each core occurred. (For some reason, I can't provide the information of
> > > log)
> >
> > Thanks for the report and the fix!
> >
> > I can also reproduce by the steps[1] in PG15~11(Note that we need to change
> > LOG_SNAPSHOT_INTERVAL_MS to a bigger number to avoid extra running xacts wal
> > records).
> >
> > About the patch:
> >
> > ---
> >                 /* free context, call shutdown callback */
> >                 FreeDecodingContext(ctx);
> >
> >                 ReplicationSlotRelease();
> >                 InvalidateSystemCaches();
> >         }
> >         PG_CATCH();
> >         {
> > +
> > +               /* free context, call shutdown callback */
> > +               FreeDecodingContext(ctx);
> > +
> > +               ReplicationSlotRelease();
> >
> > I think we could not directly call the cleanup functions here. There could be
> > two risks:
> > 1) it's possible either 'ctx' hasn't been initialized before the error happen,
> > we don't need to clean it up in this case
> > 2) It's possible 'ctx' or 'MyReplicationSlot' been be cleaned up normally
> > before entering the catch() block which means we will double cleanup(free) it.
> >
> > To improve this, I think we can use PG_FINALLY() here and move all these cleanup functions
> > in it and do the null check for 'ctx' before cleaning up.
> >
> > Just share one small patch based on your fix for reference, feel free to update it.
> >
>
> - LogicalDecodingContext *ctx;
> + LogicalDecodingContext *ctx = NULL;
>
> Don't we need to use volatile for ctx? See the following comment in
> elog.h: "Note: if a local variable of the function containing PG_TRY
> is modified in the PG_TRY section and used in the PG_CATCH section,
> that variable must be declared "volatile" for POSIX compliance."?
>

BTW, don't we need a similar change in pg_logical_replication_slot_advance()?

-- 
With Regards,
Amit Kapila.