Thread

  1. RE: Parallel Apply

    Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> — 2025-11-18T08:16:18Z

    Dear hackers,
    
    > I think it is better to enable preserve order by default - for safety reasons.
    
    Per some discussions on -hackers, I implemented the patch which preserves the
    commit ordering on publisher. Let me clarify from the beginning.
    
    Background
    ==========
    Current patch, say v1, does not preserve the commit ordering on the publisher node.
    After the leader worker sends a COMMIT message to parallel apply worker, the
    leader does not wait to apply the transaction and continue reading messages from
    the publisher node. This can cause that a parallel apply worker assigned later may
    commit earlier, which breaks the commit ordering on the pub node.
     
    Proposal
    ========
    We decided to preserve the commit ordering by default not to break data between
    nodes [1]. The basic idea is that leader apply worker caches the remote_xid when
    it sends to commit record to the parallel apply worker. Leader worker sends
    INTERNAL_DEPENDENCY message with the cached xid to the parallel apply worker
    before the leader sends commit message to p.a. P.a. would read the DEPENDENCY
    message and wait until the transaction finishes. The cached xid would be updated
    after the leader sends COMMIT.
    This approach requires less codes because DEPENDENCY message has already been 
    introduced by v1, but the number of transaction messages would be increased.
    
    
    Performance testing
    ===================
    I confirmed that even if we preserve the commit ordering, the parallel apply still
    has 2.x improvement compared with the HEAD. Below contains the detail.
    
    Machine details
    ---------------
    Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz CPU(s) :88 cores, - 503 GiB RAM
    
    Used patch
    ----------
    v1 is same as Hou posted on -hackers [1], and v2 implements preserve-commit-order
    part. Attached patch is what I used here.
    
    Workload
    -----
    Setup:
    Pub --> Sub
     - Two nodes created in pub-sub synchronous logical replication setup.
     - Both nodes have same set of pgbench tables created with scale=100.
     - The Sub node is subscribed to all the changes from the Pub's pgbench tables
    
    Workload Run:
     - Run built-in pgbench(simple-update)[2] only on Pub with #clients=40 and run duration=5 minutes
    
    This means that same tuples would be rarely modified between transactions.
    I can imagine that v1 patch would work mostly without waits, and 0002 would
    be slower because it waits until previous commit would be done every time.
    
    Results:
    Number of workers is fixed to 4. v2 was 2.1 times faster than HEAD, and
    v1 was 2.6 times faster than HEAD. I think it is very good improvement.
    I can continue some other benchmarks with different workloads and parameters.
    
    		HEAD	v1		v2
    TPS		6134.7	16194.8		12944.4
    		6030.5	16303.9		13043.0
    		6181.9	16251.5		12815.7
    		6108.1	16173.3		12771.8
    		6035.6	16180.3		13054.5
    AVE		6098.2	16220.8		12925.8
    MEDIAN	6108.1	16194.8		12944.4
    
    [1]: https://www.postgresql.org/message-id/CADzfLwXnJ1H4HncFugGPdnm8t%2BaUAU4E-yfi1j3BbiP5VfXD8g%40mail.gmail.com
    [2]: https://www.postgresql.org/docs/current/pgbench.html#PGBENCH-OPTION-BUILTIN
    
    Best regards,
    Hayato Kuroda
    FUJITSU LIMITED