Thread

RE: Parallel Apply

Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> — 2025-11-18T08:16:18Z
Dear hackers,

> I think it is better to enable preserve order by default - for safety reasons.

Per some discussions on -hackers, I implemented the patch which preserves the
commit ordering on publisher. Let me clarify from the beginning.

Background
==========
Current patch, say v1, does not preserve the commit ordering on the publisher node.
After the leader worker sends a COMMIT message to parallel apply worker, the
leader does not wait to apply the transaction and continue reading messages from
the publisher node. This can cause that a parallel apply worker assigned later may
commit earlier, which breaks the commit ordering on the pub node.
 
Proposal
========
We decided to preserve the commit ordering by default not to break data between
nodes [1]. The basic idea is that leader apply worker caches the remote_xid when
it sends to commit record to the parallel apply worker. Leader worker sends
INTERNAL_DEPENDENCY message with the cached xid to the parallel apply worker
before the leader sends commit message to p.a. P.a. would read the DEPENDENCY
message and wait until the transaction finishes. The cached xid would be updated
after the leader sends COMMIT.
This approach requires less codes because DEPENDENCY message has already been 
introduced by v1, but the number of transaction messages would be increased.


Performance testing
===================
I confirmed that even if we preserve the commit ordering, the parallel apply still
has 2.x improvement compared with the HEAD. Below contains the detail.

Machine details
---------------
Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz CPU(s) :88 cores, - 503 GiB RAM

Used patch
----------
v1 is same as Hou posted on -hackers [1], and v2 implements preserve-commit-order
part. Attached patch is what I used here.

Workload
-----
Setup:
Pub --> Sub
 - Two nodes created in pub-sub synchronous logical replication setup.
 - Both nodes have same set of pgbench tables created with scale=100.
 - The Sub node is subscribed to all the changes from the Pub's pgbench tables

Workload Run:
 - Run built-in pgbench(simple-update)[2] only on Pub with #clients=40 and run duration=5 minutes

This means that same tuples would be rarely modified between transactions.
I can imagine that v1 patch would work mostly without waits, and 0002 would
be slower because it waits until previous commit would be done every time.

Results:
Number of workers is fixed to 4. v2 was 2.1 times faster than HEAD, and
v1 was 2.6 times faster than HEAD. I think it is very good improvement.
I can continue some other benchmarks with different workloads and parameters.

		HEAD	v1		v2
TPS		6134.7	16194.8		12944.4
		6030.5	16303.9		13043.0
		6181.9	16251.5		12815.7
		6108.1	16173.3		12771.8
		6035.6	16180.3		13054.5
AVE		6098.2	16220.8		12925.8
MEDIAN	6108.1	16194.8		12944.4

[1]: https://www.postgresql.org/message-id/CADzfLwXnJ1H4HncFugGPdnm8t%2BaUAU4E-yfi1j3BbiP5VfXD8g%40mail.gmail.com
[2]: https://www.postgresql.org/docs/current/pgbench.html#PGBENCH-OPTION-BUILTIN

Best regards,
Hayato Kuroda
FUJITSU LIMITED