Thread

Re: Bound memory usage during manual slot sync retries

Xuneng Zhou <xunengzhou@gmail.com> — 2026-05-15T12:53:50Z
Hi Shveta, Zhijie,

On Fri, May 15, 2026 at 4:41 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, May 15, 2026 2:36 PM shveta malik <shveta.malik@gmail.com> wrote:
> > On Fri, May 15, 2026 at 11:03 AM Xuneng Zhou <xunengzhou@gmail.com>
> > wrote:
> > >
> > > Hi Hackers,
> > >
> > > pg_sync_replication_slots() now retries inside a single SQL function
> > > call.  Unlike the slotsync worker, it does not get a transaction
> > > boundary between retry cycles, so allocations made while fetching and
> > > synchronizing remote slots can accumulate until the function returns.
> > >
> > > The existing list_free_deep(remote_slots) is not enough to bound this.
> > > It frees the List cells and the RemoteSlot structs stored as list
> > > elements, but it does not recursively free allocations owned by those
> > > structs, such as the copied slot name, plugin name, and database name.
> > > It also does not release unrelated per-cycle allocations made while
> > > fetching and processing the remote slots.
> > >
> > > This is probably modest in normal cases, as the retained memory is
> > > roughly proportional to the number of retries times the number of remote
> > slots.
> > > Still, the function can wait for a long time on a lagging or
> > > misconfigured standby, so the growth is unbounded for that call.
> > >
> > > The attached patch runs each manual retry cycle in a short-lived
> > > memory context and resets it before the next attempt.  The slot name
> > > list needed across retries is copied outside that context.
> > >
> > > It also adds two local cleanups.  Tuple slots created with
> > > MakeSingleTupleTableSlot() are explicitly released with
> > > ExecDropSingleTupleTableSlot() before clearing the walreceiver result.
> > > The memory context would reclaim the storage eventually, but the slot
> > > also holds a reference to the result tuple descriptor, so releasing it
> > > at the matching ownership boundary seems clearer and follows nearby
> > code.
> > >
> > > drop_local_obsolete_slots() now frees the temporary list container it
> > > builds.  The retry-cycle context would reclaim this list too, but
> > > freeing it locally would make the helper self-contained.
> > >
> > > Politely CCed the original authors.
> > >
> > >
> >
> > I like the core idea of this patch. It makes the flow cleaner and protects the
> > flow from memory-leaks when dealing with a high number of slots. I think
> > during implementation, we considered having a separate memory context,
> > but since we were freeing the remote_slots then, we thought allocating a new
> > memory context might not be required. But on re-thinking and reading the
> > details above, IMO, it is a good improvement.

Yeah, we need to stop the memory accumulation, either by using a
dedicated memory context or by freeing the memory manually.

> +1 to the general idea of avoiding memory accumulation.
>
> >
> > A couple of minor comments:
> >
> > 1. I think we need to delete the new memory context in
> > slotsync_failure_callback() as well.
>
> I think we can avoid doing that, because the new memory context should be
> destroyed along with its parent context on transaction abort (if any error
> occurs).

Yeah, normal error/transaction memory context cleanup seems enough. It
avoids adding callback state only to delete cycle_ctx.

> Just one comment:
>
> @@ -579,6 +579,8 @@ drop_local_obsolete_slots(List *remote_slot_list)
>                                                    local_slot->data.database));
>                 }
>         }
> +
> +       list_free(local_slots);
>  }
>
> I think we prefer to avoid freeing memory explicitly, since this is a
> static function and we already have a memory context in place to prevent leaks.
> (We've seen this discussion about explicit freeing multiple times before, and
> the previous conclusion was to rely on the memory context management rather than
> adding more free code.)

Thanks for pointing out. I wasn't aware of the prior discussion. I
mainly wanted to avoid potential issues in case future callers invoke
it without handling the cleanup properly.


--
Regards,
Xuneng Zhou
HighGo Software Co., Ltd.