Re: WIP/PoC for parallel backup
David Zhang <david.zhang@highgo.ca>
From: David Zhang <david.zhang@highgo.ca>
To: Suraj Kharage <suraj.kharage@enterprisedb.com>,
Amit Kapila <amit.kapila16@gmail.com>
Cc: Ahsan Hadi <ahsan.hadi@gmail.com>, Asif Rehman <asifr.rehman@gmail.com>,
Kashif Zeeshan <kashif.zeeshan@enterprisedb.com>,
Robert Haas <robertmhaas@gmail.com>,
Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com>,
Jeevan Chalke <jeevan.chalke@enterprisedb.com>,
PostgreSQL Hackers <pgsql-hackers@postgresql.org>
Date: 2020-04-30T06:26:16Z
Lists: pgsql-hackers
Commits
Same data as JSON:
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Fix failures in incremental_sort due to number of workers
- 23ba3b5ee278 13.0 cited
-
In jsonb_plpython.c, suppress warning message from gcc 10.
- a06921816370 13.0 cited
-
Fix minor problems with non-exclusive backup cleanup.
- 303640199d04 13.0 cited
Attachments
- perf_report.tar.gz (application/x-gzip)
Hi, Thanks a lot for sharing the test results. Here is the our test results using perf on three ASW t2.xlarge with below configuration. Machine configuration: Instance Type :t2.xlarge Volume type :io1 Memory (MiB) :16GB vCPU # :4 Architecture :x86_64 IOP :6000 Database Size (GB) :45 (Server) case 1: postgres server: without patch and without load * Disk I/O: # Samples: 342K of event 'block:block_rq_insert' # Event count (approx.): 342834 # # Overhead Command Shared Object Symbol # ........ ............... ................. ..................... # 97.65% postgres [kernel.kallsyms] [k] __elv_add_request 2.27% kworker/u30:0 [kernel.kallsyms] [k] __elv_add_request * CPU: # Samples: 6M of event 'cpu-clock' # Event count (approx.): 1559444750000 # # Overhead Command Shared Object Symbol # ........ ............... .................... ............................................. # 64.73% swapper [kernel.kallsyms] [k] native_safe_halt 10.89% postgres [vdso] [.] __vdso_gettimeofday 5.64% postgres [kernel.kallsyms] [k] do_syscall_64 5.43% postgres libpthread-2.26.so [.] __libc_recv 1.72% postgres [kernel.kallsyms] [k] pvclock_clocksource_read * Network: # Samples: 2M of event 'skb:consume_skb' # Event count (approx.): 2739785 # # Overhead Command Shared Object Symbol # ........ ............... ................. ........................... # 91.58% swapper [kernel.kallsyms] [k] consume_skb 7.09% postgres [kernel.kallsyms] [k] consume_skb 0.61% kswapd0 [kernel.kallsyms] [k] consume_skb 0.44% ksoftirqd/3 [kernel.kallsyms] [k] consume_skb case 1: pg_basebackup client: without patch and without load * Disk I/O: # Samples: 371K of event 'block:block_rq_insert' # Event count (approx.): 371362 # # Overhead Command Shared Object Symbol # ........ ............... ................. ..................... # 96.78% kworker/u30:0 [kernel.kallsyms] [k] __elv_add_request 2.82% pg_basebackup [kernel.kallsyms] [k] __elv_add_request 0.29% kworker/u30:1 [kernel.kallsyms] [k] __elv_add_request 0.09% xfsaild/xvda1 [kernel.kallsyms] [k] __elv_add_request * CPU: # Samples: 3M of event 'cpu-clock' # Event count (approx.): 903527000000 # # Overhead Command Shared Object Symbol # ........ ............... .................. ............................................. # 87.99% swapper [kernel.kallsyms] [k] native_safe_halt 3.14% swapper [kernel.kallsyms] [k] __lock_text_start 0.48% swapper [kernel.kallsyms] [k] __softirqentry_text_start 0.37% pg_basebackup [kernel.kallsyms] [k] copy_user_enhanced_fast_string 0.35% swapper [kernel.kallsyms] [k] do_csum * Network: # Samples: 12M of event 'skb:consume_skb' # Event count (approx.): 12260713 # # Overhead Command Shared Object Symbol # ........ ............... ................. ........................... # 95.12% swapper [kernel.kallsyms] [k] consume_skb 3.23% pg_basebackup [kernel.kallsyms] [k] consume_skb 0.83% ksoftirqd/1 [kernel.kallsyms] [k] consume_skb 0.45% kswapd0 [kernel.kallsyms] [k] consume_skb case 2: postgres server: with patch and with load, 4 backup workers on client side * Disk I/O: # Samples: 3M of event 'block:block_rq_insert' # Event count (approx.): 3634542 # # Overhead Command Shared Object Symbol # ........ ............... ................. ..................... # 98.88% postgres [kernel.kallsyms] [k] __elv_add_request 0.66% perf [kernel.kallsyms] [k] __elv_add_request 0.42% kworker/u30:1 [kernel.kallsyms] [k] __elv_add_request 0.01% sshd [kernel.kallsyms] [k] __elv_add_request * CPU: # Samples: 9M of event 'cpu-clock' # Event count (approx.): 2299129250000 # # Overhead Command Shared Object Symbol # ........ ............... ..................... ............................................. # 52.73% swapper [kernel.kallsyms] [k] native_safe_halt 8.31% postgres [vdso] [.] __vdso_gettimeofday 4.46% postgres [kernel.kallsyms] [k] do_syscall_64 4.16% postgres libpthread-2.26.so [.] __libc_recv 1.58% postgres [kernel.kallsyms] [k] __lock_text_start 1.52% postgres [kernel.kallsyms] [k] pvclock_clocksource_read 0.81% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string * Network: # Samples: 6M of event 'skb:consume_skb' # Event count (approx.): 6048795 # # Overhead Command Shared Object Symbol # ........ ............... ................. ........................... # 85.81% postgres [kernel.kallsyms] [k] consume_skb 12.03% swapper [kernel.kallsyms] [k] consume_skb 0.97% postgres [kernel.kallsyms] [k] __consume_stateless_skb 0.85% ksoftirqd/3 [kernel.kallsyms] [k] consume_skb 0.24% perf [kernel.kallsyms] [k] consume_skb case 2: pg_basebackup 4 workers: with patch and with load * Disk I/O: # Samples: 372K of event 'block:block_rq_insert' # Event count (approx.): 372360 # # Overhead Command Shared Object Symbol # ........ ............... ................. ..................... # 97.26% kworker/u30:0 [kernel.kallsyms] [k] __elv_add_request 1.45% pg_basebackup [kernel.kallsyms] [k] __elv_add_request 0.95% kworker/u30:1 [kernel.kallsyms] [k] __elv_add_request 0.14% xfsaild/xvda1 [kernel.kallsyms] [k] __elv_add_request * CPU: # Samples: 4M of event 'cpu-clock' # Event count (approx.): 1234071000000 # # Overhead Command Shared Object Symbol # ........ ............... ........................ ................................................. # 89.25% swapper [kernel.kallsyms] [k] native_safe_halt 0.93% pg_basebackup [kernel.kallsyms] [k] __lock_text_start 0.91% swapper [kernel.kallsyms] [k] __lock_text_start 0.69% pg_basebackup [kernel.kallsyms] [k] copy_user_enhanced_fast_string 0.45% swapper [kernel.kallsyms] [k] do_csum * Network: # Samples: 6M of event 'skb:consume_skb' # Event count (approx.): 6449013 # # Overhead Command Shared Object Symbol # ........ ............... ................. ........................... # 90.28% pg_basebackup [kernel.kallsyms] [k] consume_skb 9.09% swapper [kernel.kallsyms] [k] consume_skb 0.29% ksoftirqd/1 [kernel.kallsyms] [k] consume_skb 0.21% sshd [kernel.kallsyms] [k] consume_skb The detailed perf report is attached, with different scenarios, i.e. without patch (with and without load for server and client) , with patch (with and without load for 1, 2, 4, 8 workers for both server and client). The file name should self explain the cases. Let me know if more information required. Best regards, David On 2020-04-29 5:41 a.m., Suraj Kharage wrote: > Hi, > > We at EnterpriseDB did some performance testing around this > parallel backup to check how this is beneficial and below are the > results. In this testing, we run the backup - > 1) Without Asif’s patch > 2) With Asif’s patch and combination of workers 1,2,4,8. > > We run those test on two setup > > 1) Client and Server both on the same machine (Local backups) > > 2) Client and server on a different machine (remote backups) > > > *Machine details: * > > 1: Server (on which local backups performed and used as server for > remote backups) > > 2: Client (Used as a client for remote backups) > > > *Server:* > > RAM:500 GB > CPU details: > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 128 > On-line CPU(s) list: 0-127 > Thread(s) per core: 2 > Core(s) per socket: 8 > Socket(s): 8 > NUMA node(s): 8 > Filesystem:ext4 > > > *Client:* > RAM:490 GB > CPU details: > Architecture: ppc64le > Byte Order: Little Endian > CPU(s): 192 > On-line CPU(s) list: 0-191 > Thread(s) per core: 8 > Core(s) per socket: 1 > Socket(s): 24 > Filesystem:ext4 > > Below are the results for the local test: > > Data size without paralle backup > patch parallel backup with > 1 worker % performance > increased/decreased > compare to normal > backup > (without patch) parallel backup with > 2 worker % performance > increased/decreased > compare to normal > backup > (without patch) parallel backup with > 4 worker % performance > increased/decreased > compare to normal > backup > (without patch) parallel backup with > 8 worker % performance > increased/decreased > compare to normal > backup > (without patch) > 10 GB > (10 tables - each table around 1.05 GB) real 0m27.016s > user 0m3.378s > sys 0m23.059s real 0m30.314s > user 0m3.575s > sys 0m22.946s 12% performance > decreased real 0m20.400s > user 0m3.622s > sys 0m29.670s 27% performace > increased real 0m15.331s > user 0m3.706s > sys 0m39.189s 43% performance > increased real 0m15.094s > user 0m3.915s > sys 1m23.350s 44% performace > increased. > 50GB > (50 tables - each table around 1.05 GB) real 2m11.049s > user 0m16.464s > sys 2m1.757s real 2m26.621s > user 0m18.497s > sys 2m4.792s 21% performance > decreased real 1m9.581s > user 0m18.298s > sys 2m12.030s 46% performance > increased real 0m53.894s > user 0m18.588s > sys 2m47.390s 58% performance > increased. real 0m55.373s > user 0m18.423s > sys 5m57.470s 57% performance > increased. > 100GB > (100 tables - each table around 1.05 GB) real 4m4.776s > user 0m33.699s > sys 3m27.777s real 4m20.862s > user 0m35.753s > sys 3m28.262s 6% performance > decreased real 2m37.411s > user 0m36.440s > sys 4m16.424s" 35% performance > increased real 1m49.503s > user 0m37.200s > sys 5m58.077s 55% performace > increased real 1m36.762s > user 0m36.987s > sys 9m36.906s 60% performace > increased. > 200GB > (200 tables - each table around 1.05 GB) real 10m34.998s > user 1m8.471s > sys 7m21.520s real 11m30.899s > user 1m12.933s > sys 8m14.496s 8% performance > decreased real 6m8.481s > user 1m13.771s > sys 9m31.216s 41% performance > increased real 4m2.403s > user 1m18.331s > sys 12m29.661s 61% performance > increased real 4m3.768s > user 1m24.547s > sys 15m21.421s 61% performance > increased > > > Results for the remote test: > > Data size without paralle backup > patch parallel backup with > 1 worker % performance > increased/decreased > compare to normal > backup > (without patch) parallel backup with > 2 worker % performance > increased/decreased > compare to normal > backup > (without patch) parallel backup with > 4 worker % performance > increased/decreased > compare to normal > backup > (without patch) parallel backup with > 8 worker % performance > increased/decreased > compare to normal > backup > (without patch) > 10 GB > (10 tables - each table around 1.05 GB) real 1m36.829s > user 0m2.124s > sys 0m14.004s real 1m37.598s > user 0m3.272s > sys 0m11.110s 0.8% performance > decreased real 1m36.753s > user 0m2.627s > sys 0m15.312s 0.08% performance > increased. real 1m37.212s > user 0m3.835s > sys 0m13.221s 0.3% performance > decreased. real 1m36.977s > user 0m4.475s > sys 0m17.937s 0.1% perfomance > decreased. > 50GB > (50 tables - each table around 1.05 GB) real 7m54.211s > user 0m10.826s > sys 1m10.435s real 7m55.603s > user 0m16.535s > sys 1m8.147s 0.2% performance > decreased real 7m53.499s > user 0m18.131s > sys 1m8.822s 0.1% performance > increased. real 7m54.687s > user 0m15.818s > sys 1m30.991s 0.1% performance > decreased real 7m54.658s > user 0m20.783s > sys 1m34.460s 0.1% performance > decreased > 100GB > (100 tables - each table around 1.05 GB) real 15m45.776s > user 0m21.802s > sys 2m59.006s real 15m46.315s > user 0m32.499s > sys 2m47.245s 0.05% performance > decreased real 15m46.065s > user 0m28.877s > sys 2m21.181s 0.03% performacne > drcreased real 15m47.793s > user 0m30.932s > sys 2m36.708s 0.2% performance > decresed real 15m47.129s > user 0m35.151s > sys 3m23.572s 0.14% performance > decreased. > 200GB > (200 tables - each table around 1.05 GB) real 32m55.720s > user 0m50.602s > sys 5m38.875s real 31m30.602s > user 0m45.377s > sys 4m57.405s 4% performance > increased real 31m30.214s > user 0m55.023s > sys 5m8.689s 4% performance > increased real 31m31.187s > user 1m13.390s > sys 5m40.861s 4% performance > increased real 31m31.729s > user 1m4.955s > sys 6m35.774s 4% performance > decreased > > > > Client & Server on the same machine, the result shows around 50% > improvement in parallel run with worker 4 and 8. We don’t see the > huge performance improvement with more workers been added. > > > Whereas, when the client and server on a different machine, we don’t > see any major benefit in performance. This testing result matches the > testing results posted by David Zhang up thread. > > > > We ran the test for 100GB backup with parallel worker 4 to see the CPU > usage and other information. What we noticed is that server is > consuming the CPU almost 100% whole the time and pg_stat_activity > shows that server is busy with ClientWrite most of the time. > > > Attaching captured output for > > 1) Top command output on the server after every 5 second > > 2) pg_stat_activity output after every 5 second > > 3) Top command output on the client after every 5 second > > > Do let me know if anyone has further questions/inputs for the > benchmarking. > > > Thanks to Rushabh Lathia for helping me with this testing. > > On Tue, Apr 28, 2020 at 8:46 AM Amit Kapila <amit.kapila16@gmail.com > <mailto:amit.kapila16@gmail.com>> wrote: > > On Mon, Apr 27, 2020 at 10:23 PM David Zhang > <david.zhang@highgo.ca <mailto:david.zhang@highgo.ca>> wrote: > > > > Hi, > > > > Here is the parallel backup performance test results with and > without > > the patch "parallel_backup_v15" on AWS cloud environment. Two > > "t2.xlarge" machines were used: one for Postgres server and the > other > > one for pg_basebackup with the same machine configuration > showing below. > > > > Machine configuration: > > Instance Type :t2.xlarge > > Volume type :io1 > > Memory (MiB) :16GB > > vCPU # :4 > > Architecture :x86_64 > > IOP :6000 > > Database Size (GB) :108 > > > > Performance test results: > > without patch: > > real 18m49.346s > > user 1m24.178s > > sys 7m2.966s > > > > 1 worker with patch: > > real 18m43.201s > > user 1m55.787s > > sys 7m24.724s > > > > 2 worker with patch: > > real 18m47.373s > > user 2m22.970s > > sys 11m23.891s > > > > 4 worker with patch: > > real 18m46.878s > > user 2m26.791s > > sys 13m14.716s > > > > As required, I didn't have the pgbench running in parallel like > we did > > in the previous benchmark. > > > > So, there doesn't seem to be any significant improvement in this > scenario. Now, it is not clear why there was a significant > improvement in the previous run where pgbench was also running > simultaneously. I am not sure but maybe it is because when a lot of > other backends were running (performing read-only workload) the > backend that was responsible for doing backup was getting frequently > scheduled out and it slowed down the overall backup process. And when > we start using multiple backends for backup one or other backup > process is always running making the overall backup faster. One idea > to find this out is to check how much time backup takes when we run it > with and without pgbench workload on HEAD (aka unpatched code). Even > if what I am saying is true or there is some other reason due to which > we are seeing speedup in some cases (where there is a concurrent > workload), it might not make the case for using multiple backends for > backup but still, it is good to find that information as it might help > in designing this feature better. > > > The perf report files for both Postgres server and pg_basebackup > sides > > are attached. > > > > It is not clear which functions are taking more time or for which > functions time is reduced as function symbols are not present in the > reports. I think you can refer > "https://wiki.postgresql.org/wiki/Profiling_with_perf" to see how to > take profiles and additionally use -fno-omit-frame-pointer during > configure (you can use CFLAGS="-fno-omit-frame-pointer during > configure). > > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com > > > > > -- > -- > > Thanks & Regards, > Suraj kharage, > EnterpriseDB Corporation, > The Postgres Database Company. -- David Software Engineer Highgo Software Inc. (Canada) www.highgo.ca