Thread

RE: Parallel Apply

Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> — 2025-11-11T11:09:33Z
Dear Hackers,

> I measured the performance data for the shared hash table approach. Based on
> the result,
> local hash table approach seems better.

I did analyze bit more detail for tests. Let me share from the beginning...

Background and current implementation
==========
Even if apply worker is being parallelized, some transactions which depend on
other transactions must wait until others are committed.

In the first version of PoC, leader apply worker has a local hash table, which
has the key {txid,replica identity}. When the leader sends a replication message
to one of parallel apply worker, the leader checks for existing entries:
(a) If no match: add the entry and proceed; (b) If match: instruct the worker to
wait until the dependent transaction completes.

One possible downside of the approach is to clean up the dependency tracking hash table.
First PoC does when: a) the leader worker sends feedback to walsender or
b) the number of entries exceeds the limit (1024). Leader worker cannot receive
replication messages to other workers while cleaning up entries thus this might
be a bottleneck.

Proposal
========
Based on above, one possible idea to improve the performance was to make the
dependency hash table shared one. A leader worker and parallel apply workers
assigned from the leader could attach to the same shared hash table.
Leader worker would use the hash table samely when it put replication messages.
One difference was that when parallel apply worker commits a transaction,
it removes the used entry from the shared hash table. This could reduce entries
continuously and leader did not have to maintain the hash.

Downside of the approach was to need additional overhead accessing the hash.


Results and considerations
==========================
As I shared on -hackers, there are no performance improvement by making the hash
shared. I found the reason is the cleanup task is not so expensive.

I did profile leader worker during the benchmark, and I found that that cleanup
function `cleanup_replica_identity_table` wastes only 0.84% CPU time.
(I did try to attach results, but the file was too huge)

Attached histogram (simple_cleanup) shows the spent time in the cleanup for each
patches. The average of elapsed was 1.2 microseconds in the 0001 patch.
The needed time per transaction is around 74 microseconds (from TPS) thus it might
not affect the whole performance.

Another experiment - contains 2000 changes per transaction
===========================================================
First example used the built-in simple-update workload, and there was a possibility
that the trend might be different if each transaction has more changed, because
each cleanup might spend more time.
Based on that, the second workload had the 1000 deletion and 1000 insertions per
transaction.

Below table shows the results (with #worker = 4). They have mostly same TPSs,
same trend as simpler-update workload case. Histogram for the case is also attached.

	0001	0001+0002	diff
TPS	10297.58551	10146.71342	1%
	10046.75987	9865.730785	2%
	9970.800272	9977.835592	0%
	9927.863416	9909.675726	0%
	10033.03796	9886.181373	1%
AVE	10055.209405	9957.227380	1%
MEDIAN	10033.037957	9909.675726	1%

Overall, I think local hash approach seems enough for now, unless we find better
approaches and corner cases.

Best regards,
Hayato Kuroda
FUJITSU LIMITED