Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Andres Freund <andres@anarazel.de>
From: Andres Freund <andres@anarazel.de>
To: hubert depesz lubaczewski <depesz@depesz.com>, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
Cc: Amit Kapila <amit.kapila16@gmail.com>, pgsql-bugs mailing list <pgsql-bugs@postgresql.org>
Date: 2022-11-17T07:25:12Z
Lists: pgsql-bugs
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Fix a possibility of logical replication slot's restart_lsn going backwards.
- e5ed873b1b4a 18.0 landed
- 568e78a653ee 17.2 landed
- f353911337cf 16.6 landed
- 91771b3fbbc3 15.10 landed
- 26c4e8968690 14.15 landed
- 15dc1abb17dd 13.18 landed
Hi,
On 2022-11-11 15:50:40 +0100, hubert depesz lubaczewski wrote:
> Out of 8 servers that we prepared upgrade for, one failed with the
> same/similar problem.
Perhaps a stupid question - are you using max_slot_wal_keep_size? And what's
your wal_segment_size? And wal_keep_size?
I'm a bit confused by the changes the changes c6550776394e made to
KeepLogSeg() etc.
Before:
/* then check whether slots limit removal further */
if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
{
XLogSegNo slotSegNo;
XLByteToSeg(keep, slotSegNo, wal_segment_size);
if (slotSegNo <= 0)
segno = 1;
else if (slotSegNo < segno)
segno = slotSegNo;
}
Current:
keep = XLogGetReplicationSlotMinimumLSN();
if (keep != InvalidXLogRecPtr)
{
XLByteToSeg(keep, segno, wal_segment_size);
/* Cap by max_slot_wal_keep_size ... */
if (max_slot_wal_keep_size_mb >= 0)
{
uint64 slot_keep_segs;
slot_keep_segs =
ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
if (currSegNo - segno > slot_keep_segs)
segno = currSegNo - slot_keep_segs;
}
}
Which looks to me to have lost the protection against segno being 0. Which
afaict would break the callers, because they do _logSegNo--. Which would then
allow all WAL to be reused, no? I guess the argument for changing this was
that we should never have such a dangerous value returned by
XLogGetReplicationSlotMinimumLSN()?
It seems decidedly not great to not log at least a debug1 (but probably it
should be LOG) message when KeepLogSeg() decides to limit based on
max_slot_wal_keep_size.
It feels wrong to subtract max_slot_wal_keep_size from recptr - that's the end
of the checkpoint record. Given that we, leaving max_slot_wal_keep_size aside,
only actually remove WAL if older than the segment that RedoRecPtr (the
logical start of the checkpoint) is in. If the checkpoint is large we'll end
up removing replication slots even though they potentially would only have
retained one additional WAL segment.
Isn't it problematic to use ConvertToXSegs() to implement
max_slot_wal_keep_size, given that it rounds *down*? Particularly for a large
wal_segment_size that'd afaict lead to being much more aggressive invalidating
slots.
Also, why do we do something as expensive as
InvalidateObsoleteReplicationSlots() even when max_slot_wal_keep_size had no
effect?
Greetings,
Andres Freund