Thread

Re: Optimize SnapBuild by maintaining committed.xip in sorted order

Xuneng Zhou <xunengzhou@gmail.com> — 2025-12-16T10:52:13Z
Hi,

On Tue, Dec 16, 2025 at 6:20 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi,
>
> On Fri, Nov 7, 2025 at 12:55 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> >
> > Hi hackers,
> >
> > I'd like to propose an optimization for logical decoding's snapshot building
> > mechanism that eliminates repeated sorting overhead once a replication slot
> > reaches the CONSISTENT state.
> >
> > 1) Problem
> >
> > Currently, SnapBuildBuildSnapshot() sorts the entire committed.xip array on
> > every call using qsort(). Once a logical decoding slot reaches CONSISTENT
> > state, snapshots are built frequently—after every catalog-modifying transaction
> > commit. This repeated O(n log n) sorting becomes a performance bottleneck as
> > the committed transaction array grows.
> >
> > The existing code even has a TODO comment acknowledging this inefficiency:
> > "TODO: It's unclear whether that reasoning has much merit. Every time we add
> > something here after becoming consistent will also require distributing a
> > snapshot."
> >
> > 2) Proposed Solution
> >
> > Maintain sorted order via batch merge on each commit:
> >
> > 1. Collect all relevant XIDs (including subtransactions) into a batch array
> > 2. Sort the batch: O(m log m) where m is typically small (1 + nsubxacts)
> > 3. Merge into the global array: O(n + m) via reverse in-place merge
> >
> > While per-commit cost increases from O(m) to O(m log m + n), this is offset by
> > eliminating O(n log n) sorts at each snapshot build. Since CONSISTENT state
> > builds snapshots after each catalog-modifying commit, the amortized cost is
> > significantly lower.
> >
> > 3) Performance Impact
> >
> > Expected improvements in CONSISTENT state:
> > - Eliminates O(n log n) qsort on every snapshot build
> > - Replaces with O(m log m + n) merge per commit
> > - For typical cases where m << n and snapshots are frequent, this is a win
> >
> >
> > The benchmark results for this patch are shown below.
> >
> > 4) Configuration
> >
> > shared_buffers = '4GB'
> > wal_level = logical
> > max_replication_slots = 10
> > max_wal_senders = 10
> > log_min_messages = warning
> > max_connections = 600
> > autovacuum = off
> > checkpoint_timeout = 15min
> > max_wal_size = 4GB
> >
> >
> > 5) Workloads
> >
> > # Workload 1: DDL - High frequency, small commits
> > create_ddl_workload() {
> >   local ROOT="$1"
> >   cat >"$ROOT/ddl.sql" <<'SQL'
> > -- High-frequency small catalog commits
> > -- Each commit triggers SnapBuildBuildSnapshot
> > DO $$
> > DECLARE
> >   tbl text := format('temp_%s_%s',
> >                      current_setting('application_name', true),
> >                      floor(random()*1e9)::int);
> > BEGIN
> >   EXECUTE format('CREATE TEMP TABLE %I (id int, data text) ON COMMIT
> > DROP', tbl);
> >   EXECUTE format('INSERT INTO %I VALUES (1, ''x'')', tbl);
> >   EXECUTE format('SELECT * FROM %I', tbl);
> > END$$;
> > SQL
> > }
> >
> > # Workload 2: MIXED - mix of DDL and DML, 50%-50%
> > create_mixed_workload() {
> >   local ROOT="$1"
> >   cat >"$ROOT/mixed_ddl.sql" <<'SQL'
> > -- DDL workload (catalog changes)
> > DO $$
> > DECLARE
> >   tbl text := format('t_%s_%s',
> >                      current_setting('application_name', true),
> >                      floor(random()*1e9)::int);
> > BEGIN
> >   EXECUTE format('CREATE TABLE %I (id int, data text) ON COMMIT DROP', tbl);
> >   EXECUTE format('INSERT INTO %I VALUES (1, ''x'')', tbl);
> > END$$;
> >
> > SQL
> >   cat >"$ROOT/mixed_dml.sql" <<'SQL'
> > -- DML workload (no catalog changes)
> > INSERT INTO app_data (id, data)
> > VALUES (floor(random()*1e6)::int, repeat('x', 100))
> > ON CONFLICT (id) DO UPDATE SET data = repeat('y', 100);
> > SQL
> > }
> >
> > # Workload 3: CONTROL - Pure DML, no catalog changes
> > create_control_workload() {
> >   local ROOT="$1"
> >   cat >"$ROOT/control.sql" <<'SQL'
> >
> > -- Pure DML, no catalog changes
> > -- Should show no difference between baseline and patched
> > INSERT INTO control_data (id, data)
> > VALUES (floor(random()*1e6)::int, repeat('x', 100))
> > ON CONFLICT (id) DO UPDATE SET data = repeat('y', 100);
> > SQL
> > }
> >
> >
> > 6) Test strategy
> >
> > Clients: 100 concurrent connections per workload
> > Duration: 40 seconds per run
> > Repetitions: 1 run per workload type
> > Replication:  Logical replication slot using `test_decoding` plugin
> > with `EXPORT_SNAPSHOT=true`
> >
> > Workload Types:
> > 1. DDL - Pure catalog churn (temp table create/drop)
> > 2. Mixed- 50% DDL + 50% DML (workload)
> > 3. Control - Pure DML (no catalog changes)
> >
> > Measurement:
> > - Metrics captured from pg_stat_replication_slots before/after each run
> > - Primary metrics: total_txns (transactions decoded) and total_bytes
> > (data volume)
> > - Compared baseline (vanilla PostgreSQL) vs patched (sorted
> > committed.xip optimization)
> >
> >
> > 7) Performance results
> >
> > DDL Workload: +235% Decoder Improvement
> > Decoder throughput:  713.76 → 2396.52 txns/sec  (+235%)
> > Throughput (MB/s):   672.67 → 1747.22 MB/s     (+159%)
> > Decode efficiency:   9.46% → 33.29%             (+23.83 points)
> >
> > Mixed Workload: No Change
> > Decoder throughput:  2751.10 → 2730.00 txns/sec  (0%)
> > Decode efficiency:   40.47% → 40.47%             (unchanged)
> >
> > We can see that the qsort overhead in SnapBuild has been eliminated in
> > the flamegraphs in the mixed workload. However, the performance
> > improvement was not observed as in the DDL workload. My guess is that,
> > for DML workloads, ReorderBufferApplyChange has become the new
> > hotspot.
> >
> > DML Workload: No Change
> > Decoder throughput:  3062.57 → 3066.37 txns/sec  (0%)
> > Decode efficiency:   49.97% → 49.97%             (unchanged)
> >
> > === Workload: ddl ===
> > Client commits/sec:
> >   Baseline:  7545.76 commits/sec
> >   Patched:   7198.21 commits/sec
> >
> > Decoder throughput (from pg_stat_replication_slots):
> >   Baseline:  713.76 txns/sec  (672.67 MB/s)
> >   Patched:   2396.52 txns/sec  (1747.22 MB/s)
> >
> > Transaction efficiency (decoded vs committed):
> >   Baseline:  309376 committed  →  29264 decoded  (9.46%)
> >   Patched:   302325 committed  →  100654 decoded  (33.29%)
> >
> > Total decoded (all reps):
> >   Baseline:  29264 txns  (27579.53 MB)
> >   Patched:   100654 txns  (73383.10 MB)
> >
> >   Decoder improvement: +235.00% (txns/sec)
> >   Decoder improvement: +159.00% (MB/s)
> >   Efficiency improvement: +23.83% points (more transactions decoded
> > per committed)
> >
> > === Workload: mixed ===
> > Client commits/sec:
> >   Baseline:  6797.50 commits/sec
> >   Patched:   6745.26 commits/sec
> >
> > Decoder throughput (from pg_stat_replication_slots):
> >   Baseline:  2751.10 txns/sec  (210.35 MB/s)
> >   Patched:   2730.00 txns/sec  (205.64 MB/s)
> >
> > Transaction efficiency (decoded vs committed):
> >   Baseline:  285495 committed  →  115546 decoded  (40.47%)
> >   Patched:   283301 committed  →  114660 decoded  (40.47%)
> >
> > Total decoded (all reps):
> >   Baseline:  115546 txns  (8834.71 MB)
> >   Patched:   114660 txns  (8636.96 MB)
> >
> >   ≈  Decoder unchanged: 0.00% (txns/sec)
> >
> > === Workload: DML ===
> > Client commits/sec:
> >   Baseline:  6129.24 commits/sec
> >   Patched:   6136.93 commits/sec
> >
> > Decoder throughput (from pg_stat_replication_slots):
> >   Baseline:  3062.57 txns/sec  (0.26 MB/s)
> >   Patched:   3066.37 txns/sec  (0.26 MB/s)
> >
> > Transaction efficiency (decoded vs committed):
> >   Baseline:  257428 committed  →  128628 decoded  (49.97%)
> >   Patched:   251614 committed  →  125721 decoded  (49.97%)
> >
> > Total decoded (all reps):
> >   Baseline:  128628 txns  (10.98 MB)
> >   Patched:   125721 txns  (10.69 MB)
> >
> >   ≈  Decoder unchanged: 0.00% (txns/sec)
> >
> > 8) Potential regression
> >
> > The potential regression point could be before the slot reaches the
> > CONSISTENT state, particularly when building_full_snapshot is set to
> > true. In this phase, all transactions including those that don’t
> > modify the catalog — must be added to the committed.xip array. These
> > XIDs don’t require later snapshot builds or sorting, so the
> > batch-insert logic increases the per-insert cost from O(1) to O(m + n)
> > without providing a direct benefit.
> >
> > However, the impact of this regression could be limited. The system
> > remains in the pre-CONSISTENT phase only briefly during initial
> > snapshot building, and the building_full_snapshot = true case is rare,
> > mainly used when creating replication slots with the EXPORT_SNAPSHOT
> > option.
> >
> > Once the slot becomes CONSISTENT, only catalog-modifying transactions
> > are tracked in committed.xip, and the patch reduces overall
> > snapshot-building overhead by eliminating repeated full-array sorts.
> >
> > We could also adopt a two-phase approach — keeping the current
> > behavior before reaching the CONSISTENT state and maintaining a sorted
> > array only after that point. This would preserve the performance
> > benefits while avoiding potential regressions. However, it would
> > introduce additional complexity and potential risks in handling the
> > state transitions.
> >
> > if (builder->state < SNAPBUILD_CONSISTENT)
> > {
> > /* ensure that only commits after this are getting replayed */
> > if (builder->start_decoding_at <= lsn)
> > builder->start_decoding_at = lsn + 1;
> >
> > /*
> > * If building an exportable snapshot, force xid to be tracked, even
> > * if the transaction didn't modify the catalog.
> > */
> > if (builder->building_full_snapshot)
> > {
> > needs_timetravel = true;
> > }
> > }
>
> After some consideration, a two-phase sorting strategy seems feasible
> to implement in a relatively straightforward manner. So it's done in
> v2. I also plan to run benchmarks to evaluate potential regressions of
> the original “sort-at-all-stages” approach of v1.
>
> > 9) Additional benefit
> > With this patch applied, we can optimize the SnapBuildPurgeOlderTxn to use
> > two binary searchs rather than interating and copying to find the interval
> > to keep in the sorted commited.xip array. [1]
> >
> > Feedbacks welcomed.
> >
> > [1] https://www.postgresql.org/message-id/flat/CABPTF7V9gcpTLrOY0fG4YontoHjVg8YrbmiH4XB_5PT6K56xhg%40mail.gmail.com
>
> V2 fixes several issues in v1, including a potential memory leak, type
> inconsistencies, and applies pgindent to the files.
>

V3 fixes a critical issue where the snapshot->xip array in
SnapBuildBuildSnapshot might not be sorted before reaching the
consistent state. Sorry for the noise here.

-- 
Best,
Xuneng