Thread
-
RE: Parallel Apply
Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> — 2025-12-16T11:35:34Z
Dear hackers, I have been spending time for benchmarking the patch set. Here is an updated report. Firstly, I want to reply few points raised by Tomas. > 5) It's not clear to me how did you measure the TPS in your benchmark. > Did you measure how long it takes for the standby to catch up, or what > did you do? Since the approach was not straightforward, we changed the metric - latency for replication was measured. See the "workload" section for more details. > 2) If I understand correctly, the patch maintains a "replica_identity" > hash table, with replica identity keys for all changes for all > concurrent transactions. How expensive can this be, in terms of CPU and > memory? What if I have multiple large batch transactions, each updating > millions of rows? I have profiled large transaction cases and confirmed that cleanup is not CPU costly. E.g., the attached .dat file showed the profile for the leader worker, with 1 M update workload and 16 parallelisms. We can see that the leader worker spends most of its time reading data from the stream, while the cleanup function spends only around 5%. Also, I temporary removed the dependency tracking part then ran tests, but the performance was not changed. Based on that, the CPU consumption for dependency tracking can be ignored. I have not attached the profile for other cases, tell me if needed. We are still analyzing the memory consumption, will share later. > 6) Did you investigate why the speedup is just ~2.1 with 4 workers, i.e. > about half of the "ideal" speedup? Is it bottlenecked on WAL, leader > having to determine dependencies, or something else? Even in the 1M insert/update workload with the replica identity, parallelism could not be improved. My theory was that parallel workers were fast enough, and four workers could finish applying all transactions. Thus, I did further experiment, which removed a replica identity and used REPLICA IDENTITY FULL for applying UPDATEs. It increased the application time, and performance could be improved up to w=16. See "Result" part. Below contains details of benchmarks. Abstract ---------- I did benchmarks with two workloads: 1) 1 million tuples are inserted in total, and 2) 1 million tuples are updated in total. Overall, we can say that parallel apply can improve performance, especially when transactions are long and needs time to apply them. Regarding the INSERT workload, the patch applies changes about 10% faster than HEAD, but results remain constant regardless of parallelism. IIUC, because applying transactions was relatively fast, fewer parallel workers could be launched. Another point is that performance worsens when the number of workers is set to 0. We may be able to skip additional patches in this case. Regarding the UPDATE workload, performance could be improved till max_parallel_apply_workers_per_subscription=4, but it was stable for {8, 16} cases. This is because four workers are enough to apply all changes. When leader tries to assign a new transaction, the first parallel worker has already finished its task. Additionally, I ran UPDATE workload with REPLICA IDENTITY FULL, and this allows us to improve performance till the w=16 case. This also shows that each parallel worker spent more time, and the leader assigned workers from the pool. Machine details ---------------- Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz CPU(s) :88 cores, - 503 GiB RAM Source code: ---------------- pgHead (19b966243c) and v4 patch set Setup: --------- Pub --> Sub - Two nodes created in pub-sub logical replication setup. - both instances had a table " foo (id int PRIMARY KEY, value double precision)" and it was included in the publication Workload: ---------------- Two workloads were run: 1. Disabled the subscription on the Sub node 2. ran 1000 transactions. Each transaction inserted 1000 tuples. I.e., there were 1 million tuples on the publisher. 3. Enabled the subscription on Sub and measured the time taken in replication. Case 2) UPDATE 1 million tuples 1. Inserted one million tuples on the Pub node 2. Waited until tuples were replicated 3. Disabled the subscription on the Sub node 4. ran 1000 transactions. Each transaction updated 1000 tuples. Note that each transaction modified different tuples. 5. Enabled the subscription on Sub and measured the time taken in replication. Furthermore, I ran one additional case that performed a 1 M update without PK. Result: --------------------- I measured with varying the parallelism of the apply, max_parallel_apply_workers_per_subscription. Case 1) 1 M insert Each cell is the median of 5-time runs. Also, insert 1 million tuples spends *8.28 second*s on publisher side. (w means the max_parallel_apply_workers_per_subscription) Used source elapsed time [s] ------------------------ HEAD 6.750675 patched, w=0 7.215072 patched, w=1 5.674886 patched, w=2 5.566869 patched, w=4 5.491499 patched, w=8 5.541768 patched, w=16 5.556885 We can see a regression if number of workers is set to zero because the leader worker checks the dependency even in the case. We may be able to discuss optimizing the part, one idea is to skip them if the parallelism is disabled. w=1 case has better performance. Because the leader can concentrate receiving the changes and parallel worker can apply in parallel. This looks like what streaming replication does. In case of w=2 and larger, the performance was not changed. I found that after the benchmark only one parallel apply worker was launched at that time. The reason was that the launched parallel worker can finish applying a transaction before the leader worker receives further changes. When the leader worker tries to assign, it finds the parallel worker has already finished the task thus leader re-uses it. This scenario means that the parallelism can work effectively if transactions have dependency or applying transactions need time more than leader receives new ones. Also, I think it is OK that the performance cannot be improved linearly because such a workload can be applied very quicky. In this experiment the applying on subscriber is mostly the same as (or faster than) publisher. Case 2) 1 M update Used source elapsed time [s] ------------------------ HEAD 17.180169 patched, w=0 18.284964 patched, w=1 13.390546 patched, w=2 11.978078 patched, w=4 8.906887 patched, w=8 9.004753 patched, w=16 8.974946 Same as the INSERT case w=0 has worse performance than HEAD, and w=1 is better than it. In case of updates, performance could be improved up to the w=4 case. Per my analysis, the p.a. could be launched up to 4 in the workload. Before receiving the 5th transaction, the first p.a. could finish applying the task and start applying the next one. Additionally, I ran the same workload with case 2), without PK on both nodes. REPLICA IDENTITY was set to FULL on publisher node to replicate UPDATE commands. Since it needs more than 2 hrs for HEAD/w=0 I did not run these cases. Used source elapsed time [s] ------------------------ patched, w=1 7571.225952 patched, w=2 2688.792047 patched, w=4 1681.862011 patched, w=8 995.177401 patched, w=16 718.488441 Apart from above, performance can be improved for all max_parallel_apply_workers_per_subscription. This meant that leader fully used the worker pool for all cases. I checked the perf report at that time and found that leader spent most of time at RelationFindReplTupleSeq - this meant leader could not assign transactions to parallel workers and it applied by itself. Used scripts were attached, you could run to verify the same workload. Best regards, Hayato Kuroda FUJITSU LIMITED