Thread

  1. MPTCP - multiplexing many TCP connections through one socket to get better bandwidth

    Jakub Wartak <jakub.wartak@enterprisedb.com> — 2025-09-04T10:56:06Z

    Hi -hackers,
    
    With the attached patch PostgreSQL could possibly gain built-in MPTCP
    support which would allow multiplexing (aggregating) multiple
    kernel-based TCP streams into one MPTCP socket. This allows bypassing
    any "chokepoints" on the network transparently for libpq, especially
    if having *multiple* TCP streams could achieve higher bandwidth than
    single one. One can think of transparent aggregation of bandwidth over
    multiple WAN links/tunnels and so. In short it works like this:
        libpq_client <--MPTCP--> client_kernel <==multiple TCP
    connections==> server_kernel <--MPTCP--> server_kernel
    
    Without much rework of PostgreSQL, this means accelerating any
    libpq-based use case. Most obvious beneficiaries could be any
    libpq-based heavy network transfers, especially in enterprise
    networks. Those come to my mind:
    - pg_basebackup (over e.g. WAN or multiple interfaces; but also one
    can think of using 2x 10GigE over LAN)
    - streaming replication or logical replication [years ago I've was
    able to use MPTCP with colleagues on production to bypass single TCP
    stream limitation of streaming replication]
    - COPY (both upload and download)
    - postgres_fdw/dblinks?
    
    MPTCP is IETF standard and included from Linux kernels from some time
    (realistically 5.16+?) and it's *enabled* by default in most modern
    distributions. One could use it with mptcpize (LD_PRELOAD wrapper to
    hijack socket()), but it's not elegant and would require altering
    systemd startup scripts (the same story like with NUMA: literally
    nobody hacking those to just include numactl --interleave there or
    with adjusting ulimits).
    
    The patch right now just assumes IPPROTO_MPTCP is there, so it is not
    portable, but not that many OSes support it at all -- I think #ifdef
    would be good enough for now. I dont have access to MacOS to develop
    this more there, nor I think it would add benefit there, but I may be
    wrong.  So as such the proposed patch is trivial and Linux-only,
    although there is RFC8684[1][2]. I suspect it is way easier and
    simpler to support it , rather than try to solve the same problem for
    each of the listed use-cases.
    
    Simulation, basic-use and tests:
    
    1. Strictly for demo purposes here, we need to ARTIFICIALLY limit
    outbound bandwidth for each new flow (TCP connection) to 10 Mbit/s
    using `tc` on the server where PostgreSQL is going to be running later
    on (this simulates some chokepoints, multiple WAN paths):
        DEV=enp0s31f6
        tc qdisc add dev $DEV root handle 1: htb
        tc class add dev $DEV parent 1: classid 1:1 htb rate 100mbit
        for i in `seq 1 9`; do
            tc class add dev $DEV parent 1:1 classid 1:$i htb rate 10mbit
    ceil 10mbit
        done
        # see tc-flow(8) for details, classify each flow with port into
    separate class (1:X)
        tc filter add dev $DEV parent 1: protocol ip prio 1 handle 1 flow
    hash keys src,dst,proto,proto-src,proto-dst divisor 8 baseclass 1
    
    2. From client, verify that single TCP bandwidth is really limited:
        verify using iperf3 -P 1 -R -c <server> # if you really getting
    limited single-stream TCP connection instead of full
        verify using iperf3 -P 8 -R -c <server> # if you really getting
    more bandwidth than above
    
    3. Check if MPTCP is enabled and configured on both sides
        uname -r # at least 5.10+ according [4] to get this balancing
    working, but 6.1+ LTS highly recommended (I've used 6.14.x)
        sysctl net.mptcp.enabled # should be 1 on both sides by default
        ip mptcp limits set subflows 8 add_addr_accepted 8  # but feel
    free to setup max limits
    
    4. Configure MPTCP endpoints on the server (registers some dedicated
    listening ports for MPTCP use so that there's no need to use multiple
    IP aliases or PBR):
        ps uaxw | grep -i mptcpd # check if mptcp daemon (path manager is
    running or not), it is NOT required in this case
        ip addr ls # let's assume 10.0.1.240 is my main IP on eno1 device,
    no need to add new IPs thanks to below trick:
        ip mptcp endpoint show # to verify
        #ip mptcp endpoint flush # if necessary
        # below registers ports 5202..5205 as LISTENing by kernel and
    dedicated for MPTCP subflows
        ip mptcp endpoint add 10.0.1.240 dev eno1 port 5202 signal
        ip mptcp endpoint add 10.0.1.240 dev eno1 port 5203 signal
        ip mptcp endpoint add 10.0.1.240 dev eno1 port 5204 signal
        ip mptcp endpoint add 10.0.1.240 dev eno1 port 5205 signal
        ip mptcp endpoint show # to verify
    
    5. Configure the client:
        ip addr ls # here I got 10.0.1.250
        ip mptcp endpoint show
        ip mptcp endpoint add 10.0.1.250 dev enp0s31f6 subflow fullmesh #
    not sure fullmesh is necessary, probably not
        ip mptcp limits set add_addr_accepted 8 subflows 8
    
    6. Verify that MPTCP works, rerun tests with mptcpize, e.g.:
        on server: mptcpize run iperf3 -s
        on client: mptcpize run -d iperf3 -P 1 -R -c <server> # should get
    better bandwidth but using just 1 MPTCP connection
        on server run PostgreSQL with listen_mptcp='on'
        on server: ss -Mtlnp sport 5432 # mptcp should be displayed
        on client: run basebackup/psql/..
    
    Sample results for 82MB table copy, it's 3x:
        $ time PGMPTCP=0 /usr/pgsql19/bin/psql -h 10.0.1.240  -c '\copy
    pgbench_accounts TO '/dev/null';'
        COPY 500000
        real    0m42.123s
    
        $ time PGMPTCP=1 /usr/pgsql19/bin/psql -h 10.0.1.240  -c '\copy
    pgbench_accounts TO '/dev/null';'
        enabling MPTCP client
        COPY 500000
        real    0m14.416s
    
    Sample results for pgbench of DB created with: pgbench -i -s 5,
    ~1076MB total due to WALs
        $ time /usr/pgsql19/bin/pg_basebackup -h 10.0.1.240 -c fast  -D /tmp/test -v
        pg_basebackup: initiating base backup, waiting for checkpoint to complete
        pg_basebackup: checkpoint completed
        [..]
        pg_basebackup: base backup completed
        real    1m26.786s
    
    With PGMPTCP=1 set, it gets ~3x
        $ time PGMPTCP=1 /usr/pgsql19/bin/pg_basebackup -h 10.0.1.240 -c
    fast  -D /tmp/test -v
        enabling MPTCP client
        pg_basebackup: initiating base backup, waiting for checkpoint to complete
    [..]
        pg_basebackup: starting background WAL receiver
        enabling MPTCP client
    [..]
        pg_basebackup: base backup completed
        real    0m30.460s
    
    Because in the above case, we have advertised 4 IP addresses/port of
    server to the client, we got the bump on a single socket (note: flows
    end up being hashed into various HTB classes is random depending on
    ports used you can get usually 2x .. 4x here). Also as there are two
    independent application-based connections here in basebackup (transfer
    + WALs), both get multiplexed (each with 4 subflows). If I would add
    more ip mptcp ports (server-side), we could get even more juice of
    course there, but it assumes one has that many paths. Some more
    advanced setups including separate policy-based-routed (ip rule)
    things are possible, and stuff like keeping the TCP connection highly
    available 0 even across ISP/interface (WiFi?) outages - is possible.
    It works transparently with SSL/TLS too - tested. Of course it won't
    remove the single CPU limitation of the tools involved (that's
    completely different problem).
    
    If it sounds interesting I was thinking about adding to the patch
    something like contrib/mptcpinfo (pg_stat_mptcp view to mimic
    pg_stat_ssl). Also as for the patch there were some places where
    socket() is being created (libpq cancel packet), but there's no
    purpose of adding MPTCP there I think.
    
    It is important to mention there are two implementations of MPTCP on
    Linux, so when someone will be googling there's lots of conflicting
    information:
    1) Earlier one, required kernel patching up to <= 5.6, had
    "ndiffports" multiplexer built-in which worked mostly out of the box.
    2) Newer one [3], already merged one into kernel today, a little bit
    different does not come with built-in ndiffports path manager. In this
    newer one, as shown above some more manual steps (ip mptcp endpoints)
    may be required, but mptcpd daemon which is managing (sub)flows seems
    to be evolving as the usage of this protocol is rising. So I hope in
    future all of those mptcp commands would be probably optional.
    
    Thoughts?
    
    -Jakub Wartak.
    
    [1] - https://en.wikipedia.org/wiki/Multipath_TCP
    [2] - https://www.rfc-editor.org/rfc/rfc8684.html
    [3] - https://www.mptcp.dev/
    [4] - https://github.com/multipath-tcp/mptcp_net-next/wiki/#changelog