Thread

RE: Parallel Apply

Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> — 2025-12-16T11:35:34Z

Dear hackers,

I have been spending time for benchmarking the patch set. Here is an updated
report. Firstly, I want to reply few points raised by Tomas.

> 5) It's not clear to me how did you measure the TPS in your benchmark.
> Did you measure how long it takes for the standby to catch up, or what
> did you do?

Since the approach was not straightforward, we changed the metric - latency
for replication was measured. See the "workload" section for more details.

> 2) If I understand correctly, the patch maintains a "replica_identity"
> hash table, with replica identity keys for all changes for all
> concurrent transactions. How expensive can this be, in terms of CPU and
> memory? What if I have multiple large batch transactions, each updating
> millions of rows?

I have profiled large transaction cases and confirmed that cleanup is not CPU
costly. E.g., the attached .dat file showed the profile for the leader worker,
with 1 M update workload and 16 parallelisms. We can see that the leader worker
spends most of its time reading data from the stream, while the cleanup function
spends only around 5%. Also, I temporary removed the dependency tracking part
then ran tests, but the performance was not changed. Based on that, the CPU
consumption for dependency tracking can be ignored.
I have not attached the profile for other cases, tell me if needed.

We are still analyzing the memory consumption, will share later.

> 6) Did you investigate why the speedup is just ~2.1 with 4 workers, i.e.
> about half of the "ideal" speedup? Is it bottlenecked on WAL, leader
> having to determine dependencies, or something else?

Even in the 1M insert/update workload with the replica identity, parallelism
could not be improved. My theory was that parallel workers were fast enough,
and four workers could finish applying all transactions.
Thus, I did further experiment, which removed a replica identity and used REPLICA
IDENTITY FULL for applying UPDATEs. It increased the application time, and
performance could be improved up to w=16. See "Result" part.

Below contains details of benchmarks.

Abstract
----------
I did benchmarks with two workloads: 1) 1 million tuples are inserted in total,
and 2) 1 million tuples are updated in total. Overall, we can say that parallel
apply can improve performance, especially when transactions are long and
needs time to apply them.

Regarding the INSERT workload, the patch applies changes about 10% faster than
HEAD, but results remain constant regardless of parallelism. IIUC, because
applying transactions was relatively fast, fewer parallel workers could be
launched. Another point is that performance worsens when the number of workers
is set to 0. We may be able to skip additional patches in this case.
Regarding the UPDATE workload, performance could be improved till
max_parallel_apply_workers_per_subscription=4, but it was stable for {8, 16} cases.
This is because four workers are enough to apply all changes. When leader tries to
assign a new transaction, the first parallel worker has already finished its task.

Additionally, I ran UPDATE workload with REPLICA IDENTITY FULL, and this allows us
to improve performance till the w=16 case. This also shows that each parallel
worker spent more time, and the leader assigned workers from the pool.

Machine details
----------------
Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz CPU(s) :88 cores, - 503 GiB RAM

Source code:
----------------
pgHead (19b966243c) and v4 patch set

Setup:
---------
Pub --> Sub
- Two nodes created in pub-sub logical replication setup.
- both instances had a table " foo (id int PRIMARY KEY, value double precision)"
and it was included in the publication

Workload:
----------------
Two workloads were run:

1. Disabled the subscription on the Sub node
2. ran 1000 transactions. Each transaction inserted 1000 tuples.
I.e., there were 1 million tuples on the publisher.
3. Enabled the subscription on Sub and measured the time taken in replication.

Case 2) UPDATE 1 million tuples

1. Inserted one million tuples on the Pub node
2. Waited until tuples were replicated
3. Disabled the subscription on the Sub node
4. ran 1000 transactions. Each transaction updated 1000 tuples.
Note that each transaction modified different tuples.
5. Enabled the subscription on Sub and measured the time taken in replication.

Furthermore, I ran one additional case that performed a 1 M update without PK.

Result:
---------------------
I measured with varying the parallelism of the apply, max_parallel_apply_workers_per_subscription.

Case 1) 1 M insert
Each cell is the median of 5-time runs. Also, insert 1 million tuples spends
*8.28 second*s on publisher side.
(w means the max_parallel_apply_workers_per_subscription)

Used source elapsed time [s]
------------------------
HEAD 6.750675
patched, w=0 7.215072
patched, w=1 5.674886
patched, w=2 5.566869
patched, w=4 5.491499
patched, w=8 5.541768
patched, w=16 5.556885

We can see a regression if number of workers is set to zero because the leader
worker checks the dependency even in the case. We may be able to discuss optimizing
the part, one idea is to skip them if the parallelism is disabled.

w=1 case has better performance. Because the leader can concentrate receiving
the changes and parallel worker can apply in parallel. This looks like what
streaming replication does.

In case of w=2 and larger, the performance was not changed. I found that after the
benchmark only one parallel apply worker was launched at that time. The reason was
that the launched parallel worker can finish applying a transaction before the
leader worker receives further changes. When the leader worker tries to assign,
it finds the parallel worker has already finished the task thus leader re-uses it.
This scenario means that the parallelism can work effectively if transactions have
dependency or applying transactions need time more than leader receives new ones.
Also, I think it is OK that the performance cannot be improved linearly because such
a workload can be applied very quicky. In this experiment the applying on subscriber
is mostly the same as (or faster than) publisher.

Case 2) 1 M update

Used source elapsed time [s]
------------------------
HEAD 17.180169
patched, w=0 18.284964
patched, w=1 13.390546
patched, w=2 11.978078
patched, w=4 8.906887
patched, w=8 9.004753
patched, w=16 8.974946

Same as the INSERT case w=0 has worse performance than HEAD, and w=1 is better
than it. In case of updates, performance could be improved up to the w=4 case.
Per my analysis, the p.a. could be launched up to 4 in the workload. Before
receiving the 5th transaction, the first p.a. could finish applying the task and
start applying the next one.

Additionally, I ran the same workload with case 2), without PK on both nodes.
REPLICA IDENTITY was set to FULL on publisher node to replicate UPDATE commands.
Since it needs more than 2 hrs for HEAD/w=0 I did not run these cases.

Used source elapsed time [s]
------------------------
patched, w=1 7571.225952
patched, w=2 2688.792047
patched, w=4 1681.862011
patched, w=8 995.177401
patched, w=16 718.488441

Apart from above, performance can be improved for all max_parallel_apply_workers_per_subscription.
This meant that leader fully used the worker pool for all cases. I checked the
perf report at that time and found that leader spent most of time
at RelationFindReplTupleSeq - this meant leader could not assign transactions to
parallel workers and it applied by itself.

Used scripts were attached, you could run to verify the same workload.

Best regards,
Hayato Kuroda
FUJITSU LIMITED