Thread

Re: Parallel Apply

Konstantin Knizhnik <knizhnik@garret.ru> — 2025-08-18T14:49:56Z
On 18/08/2025 9:56 AM, Nisha Moond wrote:
> On Wed, Aug 13, 2025 at 4:17 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
>> Here is the initial POC patch for this idea.
>>
> Thank you Hou-san for the patch.
>
> I did some performance benchmarking for the patch and overall, the
> results show substantial performance improvements.
> Please find the details as follows:
>
> Source code:
> ----------------
> pgHead (572c0f1b0e) and v1-0001 patch
>
> Setup:
> ---------
> Pub --> Sub
>   - Two nodes created in pub-sub logical replication setup.
>   - Both nodes have the same set of pgbench tables created with scale=300.
>   - The sub node is subscribed to all the changes from the pub node's
> pgbench tables.
>
> Workload Run:
> --------------------
>   - Disable the subscription on Sub node
>   - Run default pgbench(read-write) only on Pub node with #clients=40
> and run duration=10 minutes
>   - Enable the subscription on Sub once pgbench completes and then
> measure time taken in replication.
> ~~~
>
> Test-01: Measure Replication lag
> ----------------------------------------
> Observations:
> ---------------
>   - Replication time improved as the number of parallel workers
> increased with the patch.
>   - On pgHead, replicating a 10-minute publisher workload took ~46 minutes.
>   - With just 2 parallel workers (default), replication time was cut in
> half, and with 8 workers it completed in ~13 minutes(3.5x faster).
>   - With 16 parallel workers, achieved ~3.7x speedup over pgHead.
>   - With 32 workers, performance gains plateaued slightly, likely due
> to more workers running on the machine and work done parallelly is not
> that high to see further improvements.
>
> Detailed Result:
> -----------------
> Case    Time_taken_in_replication(sec)    rep_time_in_minutes
> faster_than_head
> 1. pgHead              2760.791     46.01318333    -
> 2. patched_#worker=2    1463.853    24.3975    1.88 times
> 3. patched_#worker=4    1031.376    17.1896    2.68 times
> 4. patched_#worker=8      781.007    13.0168    3.54 times
> 5. patched_#worker=16    741.108    12.3518    3.73 times
> 6. patched_#worker=32    787.203    13.1201    3.51 times
> ~~~~
>
> Test-02: Measure number of transactions parallelized
> -----------------------------------------------------
>   - Used a top up patch to LOG the number of transactions applied by
> parallel worker, applied by leader, and are depended.
>   - The LOG output e.g. -
>    ```
> LOG:  parallelized_nxact: 11497254 dependent_nxact: 0 leader_applied_nxact: 600
> ```
>   - parallelized_nxact: gives the number of parallelized transactions
>   - dependent_nxact: gives the dependent transactions
>   - leader_applied_nxact: gives the transactions applied by leader worker
>   (the required top-up v1-002 patch is attached.)
>
>   Observations:
> ----------------
>   - With 4 to 8 parallel workers, ~80%-98% transactions are parallelized
>   - As the number of workers increased, the parallelized percentage
> increased and reached 99.99% with 32 workers.
>
> Detailed Result:
> -----------------
> case1: #parallel_workers = 2(default)
>    #total_pgbench_txns = 24745648
>      parallelized_nxact = 14439480 (58.35%)
>      dependent_nxact    = 16 (0.00006%)
>      leader_applied_nxact = 10306153 (41.64%)
>
> case2: #parallel_workers = 4
>    #total_pgbench_txns = 24776108
>      parallelized_nxact = 19666593 (79.37%)
>      dependent_nxact    = 212 (0.0008%)
>      leader_applied_nxact = 5109304 (20.62%)
>
> case3: #parallel_workers = 8
>    #total_pgbench_txns = 24821333
>      parallelized_nxact = 24397431 (98.29%)
>      dependent_nxact    = 282 (0.001%)
>      leader_applied_nxact = 423621 (1.71%)
>
> case4: #parallel_workers = 16
>    #total_pgbench_txns = 24938255
>      parallelized_nxact = 24937754 (99.99%)
>      dependent_nxact    = 142 (0.0005%)
>      leader_applied_nxact = 360 (0.0014%)
>
> case5: #parallel_workers = 32
>    #total_pgbench_txns = 24769474
>      parallelized_nxact = 24769135 (99.99%)
>      dependent_nxact    = 312 (0.0013%)
>      leader_applied_nxact = 28 (0.0001%)
>
> ~~~~~
> The scripts used for above tests are attached.
>
> Next, I plan to extend the testing to larger workloads by running
> pgbench for 20–30 minutes.
> We will also benchmark performance across different workload types to
> evaluate the improvements once the patch has matured further.
>
> --
> Thanks,
> Nisha


I also did some benchmarking of the proposed parallel apply patch and 
compare it with my prewarming approach.
And parallel apply is significantly more efficient than prefetch (it is 
expected).

So I had two tests (more details here):

https://www.postgresql.org/message-id/flat/84ed36b8-7d06-4945-9a6b-3826b3f999a6%40garret.ru#70b45c44814c248d3d519a762f528753

One is performing random updates and another - inserts with random key.
I stop subscriber, apply workload at publisher during 100 seconds and 
then measure how long time it will take subscriber to caught up.

update test (with 8 parallel apply workers):

     master:           8:30 min
     prefetch:         2:05 min
     parallel apply: 1:30 min

insert test (with 8 parallel apply workers):

     master:           9:20 min
     prefetch:         3:08 min
     parallel apply: 1:54 min