Thread

  1. Logical replication prefetch

    Konstantin Knizhnik <knizhnik@garret.ru> — 2025-07-08T06:36:22Z

    There is well known Postgres problem that logical replication subscriber 
    can not caught-up with publisher just because LR changes are applied by 
    single worker and at publisher changes are made by
    multiple concurrent backends. The problem is not logical replication 
    specific: physical replication stream is also handled by single 
    walreceiver. But for physical replication Postgres now implements 
    prefetch: looking at WAL record blocks it is quite easy to predict which 
    pages will be required for redo and prefetch them. With logical 
    replication situation is much more complicated.
    
    My first idea was to implement parallel apply of transactions. But to do 
    it we need to track dependencies between transactions. Right now 
    Postgres can apply transactions in parallel, but only if they are 
    streamed  (which is done only for large transactions) and serialize them 
    by commits. It is possible to enforce parallel apply of short 
    transactions using `debug_logical_replication_streaming` but then 
    performance is ~2x times slower than in case of sequential apply by 
    single worker. By removing serialization by commits, it is possible to 
    speedup apply 3x times and make subscriber apply changes faster then 
    producer can produce them even with multiple clients. But it is possible 
    only if transactions are independent and it can be enforced only by 
    tracking dependencies which seems to be very non-trivial and invasive.
    
    I still do not completely give up with tracking dependencies approach, 
    but decided first to try more simple solution - prefetching. It is 
    already used for physical replication. Certainly in case of physical 
    replication it is much simpler, because each WAL record contains list of 
    accessed blocks.
    
    In case of logical replication prefetching can be done either by 
    prefetching access to replica identity index (usually primary key), 
    either by executing replication command by some background worker
    Certainly first case is much more easy. We just perform index lookup in 
    prefetch worker and it loads accessed index and heap pages in shared 
    buffer, so main apply worker does not need to read something from disk.
    But it works well only for DELETE and HOT UPDATE operations.
    
    In the second case we normally execute the LR command in background 
    worker and then abort transaction. Certainly in this case we are doing 
    the same work twice. But assumption is the same: parallel prefetch 
    workers should load affected pages, speeding up work of the main apply 
    worker.
    
    I have implemented some PoC (see attached patch). And get first results 
    of efficiency of such prefetching.
    
    *** First scenario (update-only).
    
    Publisher:
    ```
    create table t(pk integer primary key, counter integer, filler text 
    default repeat('x', 1000)) with (fillfactor=10);
    insert into t values (generate_series(1,100000), 0);
    create publication pub1 for table t;
    ```
    
    Subscriber:
    ```
    create table t(pk integer primary key, counter integer, filler text 
    default repeat('x', 1000)) with (fillfactor=10);
    create subscription sub1 connection 'port=54321 dbname=postgres' 
    publication pub1;
    ```
    
    Then I wait until replication is synced, stop subscriber and do random 
    dot updates in 10 sessions at publisher:
    
    ```
    pgbench -T 100 -c 10 -M prepared -n -f update.sql -p 54321 -d postgres
    ```
    
    where update.sql is:
    
    ```
    \set pk random(1, 100000)
    update t set counter=counter+1 where pk=:pk;
    ```
    
    Then I start subscriber and measure how much time is needed for it to 
    caught up.
    Results:
    
    no prefetch: 2:00 min
    prefetch (replica identity only): 0:55 min
    prefetch (all): 1:10 min
    
    This is definitely the best case for replica-identity index only 
    prefetch (update-only and no other indexes).
    How to interpret this results?
    
    Without prefetch applying updates takes about two times  more at 
    subscriber than performing this updates at publisher.
    It means that under huge workload subscriber has no chances to caught up.
    
    With prefetching replica identity index, apply time is even smaller than 
    time needed to perform updates at publisher.
    Performing the whole operation and transaction abort certainly adds more 
    overhead. But still improvement is quite significant.
    
    Please also notice that this results were obtains at the system with 
    larger amount of RAM (64Gb) and fast SSD.
    With data set not fitting in RAM and much slower disks, the difference 
    is expected to be more significant.
    I have tried to simulate it be adding 0.1msec delay to pg_preadv.
    When I add artificial 0.1msec `preadv` delay, I got the following results:
    
    no prefetch: 7:40
    prefetch (replica identity only): 3:10 min
    prefetch (all): 3:09
    
    
    In this case apply takes much more time than 100 seconds during which 
    updates are performed at publisher. Prefetch can improve speed about two 
    times,
    but it doesn't allow subcriber to caught-up.
    
    
    
    *** Second scenario: inserts with secondary random key.
    
    
    Publisher:
    
    ```
    create table t(pk serial primary key, sk integer, counter integer default 0)
    insert into t (sk) select random()*10000000 from generate_series(1,10000000)
    create index on t(sk)
    create publication pub1 for table t
    ```
    
    Subscriber:
    ```
    
    create table t(pk integer primary key, sk integer, counter integer)
    create index on t(sk)
    create subscription sub1 connection 'port=54321 dbname=postgres' 
    publication pub1
    ```
    
    workload:
    
    ```
    pgbench -T 100 -c 10 -M prepared -n -f insert.sql -p 54321 -d postgres
    
    ```
    
    where insert.sql:
    
    ```
    INSERT INTO t (sk) VALUES (random()*10000000);
    ```
    
    Results (with 0.1msec delay)  are the followingL
    
    no prefetch: 10:10 min
    prefetch (identity): 8:25 min
    prefetch (full): 5:50min
    
    Here as expected prefetching only primary key doesn't provide some big 
    improvement. But replaying insert command in prefetch worker allows to 
    speedup apply almost twice.
    
    Please notice that this approach requires minimal changes in Postgres, 
    because all infrastructure of parallel apply workers is already present 
    and we can reuse the same apply code (with minimal changes) for 
    performing prefetch. I only have to introduce extra tuple lock types 
    (no-lock and try-lock) to minimize overhead and lock conflicts between 
    prefetch and main apply workers. Still it can not completely prevent 
    locks conflicts and deadlocks in prefetch workers. Looks like more work 
    is needed here. Also I set `wal_level=minimal` in prefetch workers to 
    avoid  WAL-logging overhead.
    
    Number of prefetch workers is specified by 
    `max_parallel_prefetch_workers_per_subscription` GUC. If it is zero 
    (default) then no prefetching is performed.
    Prefetch mode is controlled by `prefetch_replica_identity_only` GUC . By 
    default it is true which makes prefetch efficient for hot updates, 
    deletes or inserts in table with just one index (primary key).
    
    
    Attached please find patch and two shell scripts used to produce this 
    test results.
    Also it may be more convenient to inspect this patch as PR: 
    https://github.com/knizhnik/postgres/pull/3
    
    I wonder if such LR prefetching approach is considered to be useful?
    Or it is better to investigate other ways to improve LR apply speed 
    (parallel apply)?