Thread

  1. Re: Parallel Apply

    Konstantin Knizhnik <knizhnik@garret.ru> — 2025-08-18T14:49:56Z

    On 18/08/2025 9:56 AM, Nisha Moond wrote:
    > On Wed, Aug 13, 2025 at 4:17 PM Zhijie Hou (Fujitsu)
    > <houzj.fnst@fujitsu.com> wrote:
    >> Here is the initial POC patch for this idea.
    >>
    > Thank you Hou-san for the patch.
    >
    > I did some performance benchmarking for the patch and overall, the
    > results show substantial performance improvements.
    > Please find the details as follows:
    >
    > Source code:
    > ----------------
    > pgHead (572c0f1b0e) and v1-0001 patch
    >
    > Setup:
    > ---------
    > Pub --> Sub
    >   - Two nodes created in pub-sub logical replication setup.
    >   - Both nodes have the same set of pgbench tables created with scale=300.
    >   - The sub node is subscribed to all the changes from the pub node's
    > pgbench tables.
    >
    > Workload Run:
    > --------------------
    >   - Disable the subscription on Sub node
    >   - Run default pgbench(read-write) only on Pub node with #clients=40
    > and run duration=10 minutes
    >   - Enable the subscription on Sub once pgbench completes and then
    > measure time taken in replication.
    > ~~~
    >
    > Test-01: Measure Replication lag
    > ----------------------------------------
    > Observations:
    > ---------------
    >   - Replication time improved as the number of parallel workers
    > increased with the patch.
    >   - On pgHead, replicating a 10-minute publisher workload took ~46 minutes.
    >   - With just 2 parallel workers (default), replication time was cut in
    > half, and with 8 workers it completed in ~13 minutes(3.5x faster).
    >   - With 16 parallel workers, achieved ~3.7x speedup over pgHead.
    >   - With 32 workers, performance gains plateaued slightly, likely due
    > to more workers running on the machine and work done parallelly is not
    > that high to see further improvements.
    >
    > Detailed Result:
    > -----------------
    > Case    Time_taken_in_replication(sec)    rep_time_in_minutes
    > faster_than_head
    > 1. pgHead              2760.791     46.01318333    -
    > 2. patched_#worker=2    1463.853    24.3975    1.88 times
    > 3. patched_#worker=4    1031.376    17.1896    2.68 times
    > 4. patched_#worker=8      781.007    13.0168    3.54 times
    > 5. patched_#worker=16    741.108    12.3518    3.73 times
    > 6. patched_#worker=32    787.203    13.1201    3.51 times
    > ~~~~
    >
    > Test-02: Measure number of transactions parallelized
    > -----------------------------------------------------
    >   - Used a top up patch to LOG the number of transactions applied by
    > parallel worker, applied by leader, and are depended.
    >   - The LOG output e.g. -
    >    ```
    > LOG:  parallelized_nxact: 11497254 dependent_nxact: 0 leader_applied_nxact: 600
    > ```
    >   - parallelized_nxact: gives the number of parallelized transactions
    >   - dependent_nxact: gives the dependent transactions
    >   - leader_applied_nxact: gives the transactions applied by leader worker
    >   (the required top-up v1-002 patch is attached.)
    >
    >   Observations:
    > ----------------
    >   - With 4 to 8 parallel workers, ~80%-98% transactions are parallelized
    >   - As the number of workers increased, the parallelized percentage
    > increased and reached 99.99% with 32 workers.
    >
    > Detailed Result:
    > -----------------
    > case1: #parallel_workers = 2(default)
    >    #total_pgbench_txns = 24745648
    >      parallelized_nxact = 14439480 (58.35%)
    >      dependent_nxact    = 16 (0.00006%)
    >      leader_applied_nxact = 10306153 (41.64%)
    >
    > case2: #parallel_workers = 4
    >    #total_pgbench_txns = 24776108
    >      parallelized_nxact = 19666593 (79.37%)
    >      dependent_nxact    = 212 (0.0008%)
    >      leader_applied_nxact = 5109304 (20.62%)
    >
    > case3: #parallel_workers = 8
    >    #total_pgbench_txns = 24821333
    >      parallelized_nxact = 24397431 (98.29%)
    >      dependent_nxact    = 282 (0.001%)
    >      leader_applied_nxact = 423621 (1.71%)
    >
    > case4: #parallel_workers = 16
    >    #total_pgbench_txns = 24938255
    >      parallelized_nxact = 24937754 (99.99%)
    >      dependent_nxact    = 142 (0.0005%)
    >      leader_applied_nxact = 360 (0.0014%)
    >
    > case5: #parallel_workers = 32
    >    #total_pgbench_txns = 24769474
    >      parallelized_nxact = 24769135 (99.99%)
    >      dependent_nxact    = 312 (0.0013%)
    >      leader_applied_nxact = 28 (0.0001%)
    >
    > ~~~~~
    > The scripts used for above tests are attached.
    >
    > Next, I plan to extend the testing to larger workloads by running
    > pgbench for 20–30 minutes.
    > We will also benchmark performance across different workload types to
    > evaluate the improvements once the patch has matured further.
    >
    > --
    > Thanks,
    > Nisha
    
    
    I also did some benchmarking of the proposed parallel apply patch and 
    compare it with my prewarming approach.
    And parallel apply is significantly more efficient than prefetch (it is 
    expected).
    
    So I had two tests (more details here):
    
    https://www.postgresql.org/message-id/flat/84ed36b8-7d06-4945-9a6b-3826b3f999a6%40garret.ru#70b45c44814c248d3d519a762f528753
    
    One is performing random updates and another - inserts with random key.
    I stop subscriber, apply workload at publisher during 100 seconds and 
    then measure how long time it will take subscriber to caught up.
    
    update test (with 8 parallel apply workers):
    
         master:           8:30 min
         prefetch:         2:05 min
         parallel apply: 1:30 min
    
    insert test (with 8 parallel apply workers):
    
         master:           9:20 min
         prefetch:         3:08 min
         parallel apply: 1:54 min