Thread
-
RE: Parallel Apply
Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> — 2025-11-18T08:16:18Z
Dear hackers, > I think it is better to enable preserve order by default - for safety reasons. Per some discussions on -hackers, I implemented the patch which preserves the commit ordering on publisher. Let me clarify from the beginning. Background ========== Current patch, say v1, does not preserve the commit ordering on the publisher node. After the leader worker sends a COMMIT message to parallel apply worker, the leader does not wait to apply the transaction and continue reading messages from the publisher node. This can cause that a parallel apply worker assigned later may commit earlier, which breaks the commit ordering on the pub node. Proposal ======== We decided to preserve the commit ordering by default not to break data between nodes [1]. The basic idea is that leader apply worker caches the remote_xid when it sends to commit record to the parallel apply worker. Leader worker sends INTERNAL_DEPENDENCY message with the cached xid to the parallel apply worker before the leader sends commit message to p.a. P.a. would read the DEPENDENCY message and wait until the transaction finishes. The cached xid would be updated after the leader sends COMMIT. This approach requires less codes because DEPENDENCY message has already been introduced by v1, but the number of transaction messages would be increased. Performance testing =================== I confirmed that even if we preserve the commit ordering, the parallel apply still has 2.x improvement compared with the HEAD. Below contains the detail. Machine details --------------- Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz CPU(s) :88 cores, - 503 GiB RAM Used patch ---------- v1 is same as Hou posted on -hackers [1], and v2 implements preserve-commit-order part. Attached patch is what I used here. Workload ----- Setup: Pub --> Sub - Two nodes created in pub-sub synchronous logical replication setup. - Both nodes have same set of pgbench tables created with scale=100. - The Sub node is subscribed to all the changes from the Pub's pgbench tables Workload Run: - Run built-in pgbench(simple-update)[2] only on Pub with #clients=40 and run duration=5 minutes This means that same tuples would be rarely modified between transactions. I can imagine that v1 patch would work mostly without waits, and 0002 would be slower because it waits until previous commit would be done every time. Results: Number of workers is fixed to 4. v2 was 2.1 times faster than HEAD, and v1 was 2.6 times faster than HEAD. I think it is very good improvement. I can continue some other benchmarks with different workloads and parameters. HEAD v1 v2 TPS 6134.7 16194.8 12944.4 6030.5 16303.9 13043.0 6181.9 16251.5 12815.7 6108.1 16173.3 12771.8 6035.6 16180.3 13054.5 AVE 6098.2 16220.8 12925.8 MEDIAN 6108.1 16194.8 12944.4 [1]: https://www.postgresql.org/message-id/CADzfLwXnJ1H4HncFugGPdnm8t%2BaUAU4E-yfi1j3BbiP5VfXD8g%40mail.gmail.com [2]: https://www.postgresql.org/docs/current/pgbench.html#PGBENCH-OPTION-BUILTIN Best regards, Hayato Kuroda FUJITSU LIMITED