Thread

  1. Re: Optimize SnapBuild by maintaining committed.xip in sorted order

    Xuneng Zhou <xunengzhou@gmail.com> — 2025-12-16T10:52:13Z

    Hi,
    
    On Tue, Dec 16, 2025 at 6:20 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
    >
    > Hi,
    >
    > On Fri, Nov 7, 2025 at 12:55 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
    > >
    > > Hi hackers,
    > >
    > > I'd like to propose an optimization for logical decoding's snapshot building
    > > mechanism that eliminates repeated sorting overhead once a replication slot
    > > reaches the CONSISTENT state.
    > >
    > > 1) Problem
    > >
    > > Currently, SnapBuildBuildSnapshot() sorts the entire committed.xip array on
    > > every call using qsort(). Once a logical decoding slot reaches CONSISTENT
    > > state, snapshots are built frequently—after every catalog-modifying transaction
    > > commit. This repeated O(n log n) sorting becomes a performance bottleneck as
    > > the committed transaction array grows.
    > >
    > > The existing code even has a TODO comment acknowledging this inefficiency:
    > > "TODO: It's unclear whether that reasoning has much merit. Every time we add
    > > something here after becoming consistent will also require distributing a
    > > snapshot."
    > >
    > > 2) Proposed Solution
    > >
    > > Maintain sorted order via batch merge on each commit:
    > >
    > > 1. Collect all relevant XIDs (including subtransactions) into a batch array
    > > 2. Sort the batch: O(m log m) where m is typically small (1 + nsubxacts)
    > > 3. Merge into the global array: O(n + m) via reverse in-place merge
    > >
    > > While per-commit cost increases from O(m) to O(m log m + n), this is offset by
    > > eliminating O(n log n) sorts at each snapshot build. Since CONSISTENT state
    > > builds snapshots after each catalog-modifying commit, the amortized cost is
    > > significantly lower.
    > >
    > > 3) Performance Impact
    > >
    > > Expected improvements in CONSISTENT state:
    > > - Eliminates O(n log n) qsort on every snapshot build
    > > - Replaces with O(m log m + n) merge per commit
    > > - For typical cases where m << n and snapshots are frequent, this is a win
    > >
    > >
    > > The benchmark results for this patch are shown below.
    > >
    > > 4) Configuration
    > >
    > > shared_buffers = '4GB'
    > > wal_level = logical
    > > max_replication_slots = 10
    > > max_wal_senders = 10
    > > log_min_messages = warning
    > > max_connections = 600
    > > autovacuum = off
    > > checkpoint_timeout = 15min
    > > max_wal_size = 4GB
    > >
    > >
    > > 5) Workloads
    > >
    > > # Workload 1: DDL - High frequency, small commits
    > > create_ddl_workload() {
    > >   local ROOT="$1"
    > >   cat >"$ROOT/ddl.sql" <<'SQL'
    > > -- High-frequency small catalog commits
    > > -- Each commit triggers SnapBuildBuildSnapshot
    > > DO $$
    > > DECLARE
    > >   tbl text := format('temp_%s_%s',
    > >                      current_setting('application_name', true),
    > >                      floor(random()*1e9)::int);
    > > BEGIN
    > >   EXECUTE format('CREATE TEMP TABLE %I (id int, data text) ON COMMIT
    > > DROP', tbl);
    > >   EXECUTE format('INSERT INTO %I VALUES (1, ''x'')', tbl);
    > >   EXECUTE format('SELECT * FROM %I', tbl);
    > > END$$;
    > > SQL
    > > }
    > >
    > > # Workload 2: MIXED - mix of DDL and DML, 50%-50%
    > > create_mixed_workload() {
    > >   local ROOT="$1"
    > >   cat >"$ROOT/mixed_ddl.sql" <<'SQL'
    > > -- DDL workload (catalog changes)
    > > DO $$
    > > DECLARE
    > >   tbl text := format('t_%s_%s',
    > >                      current_setting('application_name', true),
    > >                      floor(random()*1e9)::int);
    > > BEGIN
    > >   EXECUTE format('CREATE TABLE %I (id int, data text) ON COMMIT DROP', tbl);
    > >   EXECUTE format('INSERT INTO %I VALUES (1, ''x'')', tbl);
    > > END$$;
    > >
    > > SQL
    > >   cat >"$ROOT/mixed_dml.sql" <<'SQL'
    > > -- DML workload (no catalog changes)
    > > INSERT INTO app_data (id, data)
    > > VALUES (floor(random()*1e6)::int, repeat('x', 100))
    > > ON CONFLICT (id) DO UPDATE SET data = repeat('y', 100);
    > > SQL
    > > }
    > >
    > > # Workload 3: CONTROL - Pure DML, no catalog changes
    > > create_control_workload() {
    > >   local ROOT="$1"
    > >   cat >"$ROOT/control.sql" <<'SQL'
    > >
    > > -- Pure DML, no catalog changes
    > > -- Should show no difference between baseline and patched
    > > INSERT INTO control_data (id, data)
    > > VALUES (floor(random()*1e6)::int, repeat('x', 100))
    > > ON CONFLICT (id) DO UPDATE SET data = repeat('y', 100);
    > > SQL
    > > }
    > >
    > >
    > > 6) Test strategy
    > >
    > > Clients: 100 concurrent connections per workload
    > > Duration: 40 seconds per run
    > > Repetitions: 1 run per workload type
    > > Replication:  Logical replication slot using `test_decoding` plugin
    > > with `EXPORT_SNAPSHOT=true`
    > >
    > > Workload Types:
    > > 1. DDL - Pure catalog churn (temp table create/drop)
    > > 2. Mixed- 50% DDL + 50% DML (workload)
    > > 3. Control - Pure DML (no catalog changes)
    > >
    > > Measurement:
    > > - Metrics captured from pg_stat_replication_slots before/after each run
    > > - Primary metrics: total_txns (transactions decoded) and total_bytes
    > > (data volume)
    > > - Compared baseline (vanilla PostgreSQL) vs patched (sorted
    > > committed.xip optimization)
    > >
    > >
    > > 7) Performance results
    > >
    > > DDL Workload: +235% Decoder Improvement
    > > Decoder throughput:  713.76 → 2396.52 txns/sec  (+235%)
    > > Throughput (MB/s):   672.67 → 1747.22 MB/s     (+159%)
    > > Decode efficiency:   9.46% → 33.29%             (+23.83 points)
    > >
    > > Mixed Workload: No Change
    > > Decoder throughput:  2751.10 → 2730.00 txns/sec  (0%)
    > > Decode efficiency:   40.47% → 40.47%             (unchanged)
    > >
    > > We can see that the qsort overhead in SnapBuild has been eliminated in
    > > the flamegraphs in the mixed workload. However, the performance
    > > improvement was not observed as in the DDL workload. My guess is that,
    > > for DML workloads, ReorderBufferApplyChange has become the new
    > > hotspot.
    > >
    > > DML Workload: No Change
    > > Decoder throughput:  3062.57 → 3066.37 txns/sec  (0%)
    > > Decode efficiency:   49.97% → 49.97%             (unchanged)
    > >
    > > === Workload: ddl ===
    > > Client commits/sec:
    > >   Baseline:  7545.76 commits/sec
    > >   Patched:   7198.21 commits/sec
    > >
    > > Decoder throughput (from pg_stat_replication_slots):
    > >   Baseline:  713.76 txns/sec  (672.67 MB/s)
    > >   Patched:   2396.52 txns/sec  (1747.22 MB/s)
    > >
    > > Transaction efficiency (decoded vs committed):
    > >   Baseline:  309376 committed  →  29264 decoded  (9.46%)
    > >   Patched:   302325 committed  →  100654 decoded  (33.29%)
    > >
    > > Total decoded (all reps):
    > >   Baseline:  29264 txns  (27579.53 MB)
    > >   Patched:   100654 txns  (73383.10 MB)
    > >
    > >   Decoder improvement: +235.00% (txns/sec)
    > >   Decoder improvement: +159.00% (MB/s)
    > >   Efficiency improvement: +23.83% points (more transactions decoded
    > > per committed)
    > >
    > > === Workload: mixed ===
    > > Client commits/sec:
    > >   Baseline:  6797.50 commits/sec
    > >   Patched:   6745.26 commits/sec
    > >
    > > Decoder throughput (from pg_stat_replication_slots):
    > >   Baseline:  2751.10 txns/sec  (210.35 MB/s)
    > >   Patched:   2730.00 txns/sec  (205.64 MB/s)
    > >
    > > Transaction efficiency (decoded vs committed):
    > >   Baseline:  285495 committed  →  115546 decoded  (40.47%)
    > >   Patched:   283301 committed  →  114660 decoded  (40.47%)
    > >
    > > Total decoded (all reps):
    > >   Baseline:  115546 txns  (8834.71 MB)
    > >   Patched:   114660 txns  (8636.96 MB)
    > >
    > >   ≈  Decoder unchanged: 0.00% (txns/sec)
    > >
    > > === Workload: DML ===
    > > Client commits/sec:
    > >   Baseline:  6129.24 commits/sec
    > >   Patched:   6136.93 commits/sec
    > >
    > > Decoder throughput (from pg_stat_replication_slots):
    > >   Baseline:  3062.57 txns/sec  (0.26 MB/s)
    > >   Patched:   3066.37 txns/sec  (0.26 MB/s)
    > >
    > > Transaction efficiency (decoded vs committed):
    > >   Baseline:  257428 committed  →  128628 decoded  (49.97%)
    > >   Patched:   251614 committed  →  125721 decoded  (49.97%)
    > >
    > > Total decoded (all reps):
    > >   Baseline:  128628 txns  (10.98 MB)
    > >   Patched:   125721 txns  (10.69 MB)
    > >
    > >   ≈  Decoder unchanged: 0.00% (txns/sec)
    > >
    > > 8) Potential regression
    > >
    > > The potential regression point could be before the slot reaches the
    > > CONSISTENT state, particularly when building_full_snapshot is set to
    > > true. In this phase, all transactions including those that don’t
    > > modify the catalog — must be added to the committed.xip array. These
    > > XIDs don’t require later snapshot builds or sorting, so the
    > > batch-insert logic increases the per-insert cost from O(1) to O(m + n)
    > > without providing a direct benefit.
    > >
    > > However, the impact of this regression could be limited. The system
    > > remains in the pre-CONSISTENT phase only briefly during initial
    > > snapshot building, and the building_full_snapshot = true case is rare,
    > > mainly used when creating replication slots with the EXPORT_SNAPSHOT
    > > option.
    > >
    > > Once the slot becomes CONSISTENT, only catalog-modifying transactions
    > > are tracked in committed.xip, and the patch reduces overall
    > > snapshot-building overhead by eliminating repeated full-array sorts.
    > >
    > > We could also adopt a two-phase approach — keeping the current
    > > behavior before reaching the CONSISTENT state and maintaining a sorted
    > > array only after that point. This would preserve the performance
    > > benefits while avoiding potential regressions. However, it would
    > > introduce additional complexity and potential risks in handling the
    > > state transitions.
    > >
    > > if (builder->state < SNAPBUILD_CONSISTENT)
    > > {
    > > /* ensure that only commits after this are getting replayed */
    > > if (builder->start_decoding_at <= lsn)
    > > builder->start_decoding_at = lsn + 1;
    > >
    > > /*
    > > * If building an exportable snapshot, force xid to be tracked, even
    > > * if the transaction didn't modify the catalog.
    > > */
    > > if (builder->building_full_snapshot)
    > > {
    > > needs_timetravel = true;
    > > }
    > > }
    >
    > After some consideration, a two-phase sorting strategy seems feasible
    > to implement in a relatively straightforward manner. So it's done in
    > v2. I also plan to run benchmarks to evaluate potential regressions of
    > the original “sort-at-all-stages” approach of v1.
    >
    > > 9) Additional benefit
    > > With this patch applied, we can optimize the SnapBuildPurgeOlderTxn to use
    > > two binary searchs rather than interating and copying to find the interval
    > > to keep in the sorted commited.xip array. [1]
    > >
    > > Feedbacks welcomed.
    > >
    > > [1] https://www.postgresql.org/message-id/flat/CABPTF7V9gcpTLrOY0fG4YontoHjVg8YrbmiH4XB_5PT6K56xhg%40mail.gmail.com
    >
    > V2 fixes several issues in v1, including a potential memory leak, type
    > inconsistencies, and applies pgindent to the files.
    >
    
    V3 fixes a critical issue where the snapshot->xip array in
    SnapBuildBuildSnapshot might not be sorted before reaching the
    consistent state. Sorry for the noise here.
    
    -- 
    Best,
    Xuneng