Thread

Re: Logical replication prefetch

Konstantin Knizhnik <knizhnik@garret.ru> — 2025-07-13T12:29:35Z
On 13/07/2025 9:28 am, Amit Kapila wrote:
> I didn't understand your scenario. pa_launch_parallel_worker() should
> spawn a new worker only if all the workers in the pool are busy, and
> then it will free the worker if the pool already has enough workers.
> So, do you mean to say that the workers in the pool are always busy in
> your workload which lead spawn/exit of new workers? Can you please
> explain your scenario in some more detail?
>
Current LR apply logic is not working well for applying small OLTP 
transactions.
First of all by default reorder buffer at publisher will buffer them and 
so prevent parallel apply at subscriber.
Publisher switches to streaming mode only if transaction is too large or 
`debug_logical_replication_streaming=immediate`.
But even if we force publisher to stream short transactions, subscriber 
will try to launch new parallel apply worker for each transactions (if 
all existed workers are busy).
If there are 100 active backends at publisher, then subscriber will try 
to launch 100 parallel apply workers.
Most likely it fails because of limit for maximal number of workers. In 
this case leader will serialize such transactions.
So if there are 100 streamed transactions and 10 parallel apply workers, 
then 10 transactions are started in parallel and 90 will be serialized 
to disk.
It seems to be not so efficient for short transaction. It is better to 
wait for some time until some of workers become vacant.

But the worst thing happen when parallel apply worker completes its 
transactions. If number of parallel apply workers in pool exceeds 
`max_parallel_apply_workers_per_subscription / 2`,
then this parallel apply worker is terminated. So instead of having 
`max_parallel_apply_workers_per_subscription` workers applying 
transactions at maximal possible speed and leader
which distributes transaction between them and stops receiving new data 
from publisher if there is no vacant worker, we will have leader 
serializing and writing transactions to the disk
(and then definitely reading them from the disk) and permanently 
starting and terminating parallel apply worker processes. It leads to 
awful performance.


Certainly originally intended use case was different: parallel apply is 
performed only for large transactions. Number of of such transactions is 
not so big and
so there should be enough parallel apply workers in pool to proceed 
them. And if there are not enough workers, it is not a problem to spawn 
new one and terminate
it after completion of transaction (because transaction is long, 
overhead of spawning process is not so larger comparing with redo of 
large transaction).
But if we want to efficiently replicate OLTP workload, then we 
definitely need some other approach.

Prefetch is actually more compatible with current implementation because 
prefetch operations don't need to be grouped by transaction and can be 
executed by any prefetch worker.