Thread

  1. pgcon unconference / impact of block size on performance

    Tomas Vondra <tomas.vondra@enterprisedb.com> — 2022-06-04T23:22:48Z

    Hi,
    
    At on of the pgcon unconference sessions a couple days ago, I presented
    a bunch of benchmark results comparing performance with different
    data/WAL block size. Most of the OLTP results showed significant gains
    (up to 50%) with smaller (4k) data pages.
    
    This opened a long discussion about possible explanations - I claimed
    one of the main factors is the adoption of flash storage, due to pretty
    fundamental differences between HDD and SSD systems. But the discussion
    concluded with an agreement to continue investigating this, so here's an
    attempt to support the claim with some measurements/data.
    
    Let me present results of low-level fio benchmarks on a couple different
    HDD and SSD drives. This should eliminate any postgres-related influence
    (e.g. FPW), and demonstrates inherent HDD/SSD differences.
    
    Each of the PDF pages shows results for five basic workloads:
    
     - random read
     - random write
     - random r/w
     - sequential read
     - sequential write
    
    The chars on the left show IOPS, charts on right bandwidth. The x-axis
    shows I/O depth - number of concurrent I/O requests or queue length,
    with values 1, 2, 4, 8, 64, 128. And each "group" shows results for
    different page size (1K, 2K, 4K, 8K, 16K, 32K). The colored page size is
    the default value (8K).
    
    This makes it clear how a page size affects performance (IOPS and BW)
    for a given I/O depth, and also the impact of higher I/O depth.
    
    
    I do think the difference between HDD and SSD storage is pretty clearly
    visible (even though there is some variability between the SSD devices).
    
    IMHO the crucial difference is that for HDD, the page size has almost no
    impact on IOPS (in the random workloads). If you look at the random read
    results, the page size does not matter - once you fix the I/O depth, the
    result are pretty much exactly the same. For the random write test it's
    even clearer, because the I/O depth does not matter and you get 350 IOPS
    no matter the page size or I/O depth.
    
    This makes perfect sense, because for "spinning rust" the dominant part
    is seeking to the right part of the platter. And once you've seeked to
    the right place, it does not matter much if you read 1K or 32K - the
    cost is much lower than the seek.
    
    And the economy is pretty simple - you can't really improve IOPS, but
    you can improve bandwidth by using larger pages. If you do 350 IOPS, it
    can be either 350kB/s with 1K pages or 11MB/s with 32KB pages).
    
    So we'd gain very little by using smaller pages, and larger pages
    improve bandwidth - not just for random tests, but sequential too. And
    8KB seems like a reasonable compromise - bandwidth with 32KB pages is
    better, but with higher I/O depths (8 or more) we get pretty close,
    likely due to hitting SATA limits.
    
    
    Now, compare this to the SSD. There are some differences between the
    models, manufacturers, interface etc. but the impact of page size on
    IOPS is pretty clear. On the Optane you can get +20-30% by using 4K
    pages, on the Samsung it's even more, etc. This means that workloads
    dominated by random I/O get significant benefit from smaller pages.
    
    Another consequence of this is that for sequential workloads, the
    difference between page sizes is smaller, because when smaller pages
    reach better IOPS this reduces the difference in bandwidth.
    
    
    If you imagine two extremes:
    
      1) different pages yield the same IOPS
    
      2) different pages yield the same bandwidth
    
    then old-school HDDs are pretty close to (1), while future storage
    systems (persistent memory) is likely close to (2).
    
    This matters, because various trade-offs we've made in the past are
    reasonable for (1), but will be inefficient for (2). And as the results
    I shared during the pgcon session suggest, we might do so much better
    even for current SSDs, which are somewhere between (1) and (2).
    
    
    The other important factor is the native SSD page, which is similar to
    sectors on HDD. SSDs however don't allow in-place updates, and have to
    reset/rewrite of the whole native page. It's actually more complicated,
    because the reset happens at a much larger scale (~8MB block), so it
    does matter how quickly we "dirty" the data. The consequence is that
    using data pages smaller than the native page (depends on the device,
    but seems 4K is the common value) either does not help or actually hurts
    the write performance.
    
    All the SSD results show this behavior - the Optane and Samsung nicely
    show that 4K is much better (in random write IOPS) than 8K, but 1-2K
    pages make it worse.
    
    
    I'm sure there are other important factors - for example, eliminating
    the very expensive "seek" cost (SSDs can do 10k-100k IOPS easily, while
    HDDs did ~100-400 IOPS), other steps start to play much bigger role. I
    wouldn't be surprised if memcpy() started to matter, for example.
    
    
    regards
    
    -- 
    Tomas Vondra
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  2. Re: pgcon unconference / impact of block size on performance

    Roberto Mello <roberto.mello@gmail.com> — 2022-06-05T00:21:07Z

    On Sat, Jun 4, 2022 at 5:23 PM Tomas Vondra <tomas.vondra@enterprisedb.com>
    wrote:
    
    > Hi,
    >
    > At on of the pgcon unconference sessions a couple days ago, I presented
    > a bunch of benchmark results comparing performance with different
    > data/WAL block size. Most of the OLTP results showed significant gains
    > (up to 50%) with smaller (4k) data pages.
    
    
    Thanks for sharing this Thomas.
    
    We’ve been doing similar tests with different storage classes in kubernetes
    clusters.
    
    Roberto
    
    >
    
  3. Re: pgcon unconference / impact of block size on performance

    Tomas Vondra <tomas.vondra@enterprisedb.com> — 2022-06-05T09:51:32Z

    
    On 6/5/22 02:21, Roberto Mello wrote:
    > On Sat, Jun 4, 2022 at 5:23 PM Tomas Vondra
    > <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
    > wrote:
    > 
    >     Hi,
    > 
    >     At on of the pgcon unconference sessions a couple days ago, I presented
    >     a bunch of benchmark results comparing performance with different
    >     data/WAL block size. Most of the OLTP results showed significant gains
    >     (up to 50%) with smaller (4k) data pages.
    > 
    > 
    > Thanks for sharing this Thomas.
    > 
    > We’ve been doing similar tests with different storage classes in
    > kubernetes clusters. 
    > 
    
    Can you share some of the results? Might be interesting, particularly if
    you use network-attached storage (like EBS, etc.).
    
    
    regards
    
    -- 
    Tomas Vondra
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  4. RE: pgcon unconference / impact of block size on performance

    Jakub Wartak <jakub.wartak@tomtom.com> — 2022-06-06T14:27:06Z

    Hi Tomas,
    
    > Hi,
    > 
    > At on of the pgcon unconference sessions a couple days ago, I presented a
    > bunch of benchmark results comparing performance with different data/WAL
    > block size. Most of the OLTP results showed significant gains (up to 50%) with
    > smaller (4k) data pages.
    
    Nice. I just saw this https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference , do you have any plans for publishing those other graphs too (e.g. WAL block size impact)?
    
    > This opened a long discussion about possible explanations - I claimed one of the
    > main factors is the adoption of flash storage, due to pretty fundamental
    > differences between HDD and SSD systems. But the discussion concluded with an
    > agreement to continue investigating this, so here's an attempt to support the
    > claim with some measurements/data.
    > 
    > Let me present results of low-level fio benchmarks on a couple different HDD
    > and SSD drives. This should eliminate any postgres-related influence (e.g. FPW),
    > and demonstrates inherent HDD/SSD differences.
    > All the SSD results show this behavior - the Optane and Samsung nicely show
    > that 4K is much better (in random write IOPS) than 8K, but 1-2K pages make it
    > worse.
    > 
    [..]
    Can you share what Linux kernel version, what filesystem , it's mount options and LVM setup were you using if any(?)
    
    I've hastily tried your script on 4VCPU/32GB RAM/1xNVMe device @ ~900GB (AWS i3.xlarge), kernel 5.x, ext4 defaults, no LVM, libaio only, fio deviations: runtime -> 1min, 64GB file, 1 iteration only. Results are attached, w/o graphs. 
    
    > Now, compare this to the SSD. There are some differences between the models, manufacturers, interface etc. but the impact of page size on IOPS is pretty clear. On the Optane you can get +20-30% by using 4K pages, on the Samsung it's even more, etc. This means that workloads dominated by random I/O get significant benefit from smaller pages.
    
    Yup, same here, reproduced, 1.42x faster on writes:
    [root@x ~]# cd libaio/nvme/randwrite/128/ # 128=queue depth
    [root@x 128]# grep -r "write:" * | awk '{print $1, $4, $5}'  | sort -n
    1k/1.txt: bw=24162KB/s, iops=24161,
    2k/1.txt: bw=47164KB/s, iops=23582,
    4k/1.txt: bw=280450KB/s, iops=70112, <<<
    8k/1.txt: bw=393082KB/s, iops=49135,
    16k/1.txt: bw=393103KB/s, iops=24568,
    32k/1.txt: bw=393283KB/s, iops=12290,
    BTW it's interesting to compare to your's Optane 900P result (same two high bars for IOPS @ 4,8kB), but in my case it's even more import to select 4kB so it behaves more like Samsung 860 in your case
    
    # 1.41x on randreads
    [root@x ~]# cd libaio/nvme/randread/128/ # 128=queue depth
    [root@x 128]# grep -r "read :" | awk '{print $1, $5, $6}' | sort -n
    1k/1.txt: bw=169938KB/s, iops=169937,
    2k/1.txt: bw=376653KB/s, iops=188326,
    4k/1.txt: bw=691529KB/s, iops=172882, <<<
    8k/1.txt: bw=976916KB/s, iops=122114,
    16k/1.txt: bw=990524KB/s, iops=61907,
    32k/1.txt: bw=974318KB/s, iops=30447,
    
    I think that the above just a demonstration of device bandwidth saturation: 32k*30k IOPS =~ 1GB/s random reads. Given that DB would be tuned @ 4kB for app(OLTP), but once upon a time Parallel Seq Scans "critical reports" could only achieve 70% of what it could achieve on 8kB, correct? (I'm assuming most real systems are really OLTP but with some reporting/data exporting needs). One way or another it would be very nice to be able to select the tradeoff using initdb(1) without the need to recompile, which then begs for some initdb --calibrate /mnt/nvme (effective_io_concurrency, DB page size, ...).
    
    Do you envision any plans for this we still in a need to gather more info exactly why this happens? (perf reports?)
    
    Also have you guys discussed on that meeting any long-term future plans on storage layer by any chance ? If sticking to 4kB pages on DB/page size/hardware sector size, wouldn't it be possible to win also disabling FPWs in the longer run using uring (assuming O_DIRECT | O_ATOMIC one day?)
    I recall that Thomas M. was researching O_ATOMIC, I think he wrote some of that pretty nicely in [1] 
    
    [1] - https://wiki.postgresql.org/wiki/FreeBSD/AtomicIO
    
  5. Re: pgcon unconference / impact of block size on performance

    Tomas Vondra <tomas.vondra@enterprisedb.com> — 2022-06-06T15:00:56Z

    On 6/6/22 16:27, Jakub Wartak wrote:
    > Hi Tomas,
    > 
    >> Hi,
    >>
    >> At on of the pgcon unconference sessions a couple days ago, I presented a
    >> bunch of benchmark results comparing performance with different data/WAL
    >> block size. Most of the OLTP results showed significant gains (up to 50%) with
    >> smaller (4k) data pages.
    > 
    > Nice. I just saw this
    https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference , do
    you have any plans for publishing those other graphs too (e.g. WAL block
    size impact)?
    > 
    
    Well, there's plenty of charts in the github repositories, including the
    charts I think you're asking for:
    
    https://github.com/tvondra/pg-block-bench-pgbench/blob/master/process/heatmaps/xeon/20220406-fpw/16/heatmap-tps.png
    
    https://github.com/tvondra/pg-block-bench-pgbench/blob/master/process/heatmaps/i5/20220427-fpw/16/heatmap-io-tps.png
    
    
    I admit the charts may not be documented very clearly :-(
    
    >> This opened a long discussion about possible explanations - I claimed one of the
    >> main factors is the adoption of flash storage, due to pretty fundamental
    >> differences between HDD and SSD systems. But the discussion concluded with an
    >> agreement to continue investigating this, so here's an attempt to support the
    >> claim with some measurements/data.
    >>
    >> Let me present results of low-level fio benchmarks on a couple different HDD
    >> and SSD drives. This should eliminate any postgres-related influence (e.g. FPW),
    >> and demonstrates inherent HDD/SSD differences.
    >> All the SSD results show this behavior - the Optane and Samsung nicely show
    >> that 4K is much better (in random write IOPS) than 8K, but 1-2K pages make it
    >> worse.
    >>
    > [..]
    > Can you share what Linux kernel version, what filesystem , it's
    > mount options and LVM setup were you using if any(?)
    > 
    
    The PostgreSQL benchmarks were with 5.14.x kernels, with either ext4 or
    xfs filesystems.
    
    i5 uses LVM on the 6x SATA SSD devices, with this config:
    
    bench ~ # mdadm --detail /dev/md0
    /dev/md0:
               Version : 0.90
         Creation Time : Thu Feb  8 15:05:49 2018
            Raid Level : raid0
            Array Size : 586106880 (558.96 GiB 600.17 GB)
          Raid Devices : 6
         Total Devices : 6
       Preferred Minor : 0
           Persistence : Superblock is persistent
    
           Update Time : Thu Feb  8 15:05:49 2018
                 State : clean
        Active Devices : 6
       Working Devices : 6
        Failed Devices : 0
         Spare Devices : 0
    
            Chunk Size : 512K
    
    Consistency Policy : none
    
                  UUID : 24c6158c:36454b38:529cc8e5:b4b9cc9d (local to host
    bench)
                Events : 0.1
    
        Number   Major   Minor   RaidDevice State
           0       8        1        0      active sync   /dev/sda1
           1       8       17        1      active sync   /dev/sdb1
           2       8       33        2      active sync   /dev/sdc1
           3       8       49        3      active sync   /dev/sdd1
           4       8       65        4      active sync   /dev/sde1
           5       8       81        5      active sync   /dev/sdf1
    
    bench ~ # mount | grep md0
    /dev/md0 on /mnt/raid type xfs
    (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=16,swidth=96,noquota)
    
    
    and the xeon just uses ext4 on the device directly:
    
    /dev/nvme0n1p1 on /mnt/data type ext4 (rw,relatime)
    
    
    > I've hastily tried your script on 4VCPU/32GB RAM/1xNVMe device @ 
    > ~900GB (AWS i3.xlarge), kernel 5.x, ext4 defaults, no LVM, libaio
    > only, fio deviations: runtime -> 1min, 64GB file, 1 iteration only.
    > Results are attached, w/o graphs.
    > 
    >> Now, compare this to the SSD. There are some differences between 
    >> the models, manufacturers, interface etc. but the impact of page
    >> size on IOPS is pretty clear. On the Optane you can get +20-30% by
    >> using 4K pages, on the Samsung it's even more, etc. This means that
    >> workloads dominated by random I/O get significant benefit from
    >> smaller pages.
    > 
    > Yup, same here, reproduced, 1.42x faster on writes:
    > [root@x ~]# cd libaio/nvme/randwrite/128/ # 128=queue depth
    > [root@x 128]# grep -r "write:" * | awk '{print $1, $4, $5}'  | sort -n
    > 1k/1.txt: bw=24162KB/s, iops=24161,
    > 2k/1.txt: bw=47164KB/s, iops=23582,
    > 4k/1.txt: bw=280450KB/s, iops=70112, <<<
    > 8k/1.txt: bw=393082KB/s, iops=49135,
    > 16k/1.txt: bw=393103KB/s, iops=24568,
    > 32k/1.txt: bw=393283KB/s, iops=12290,
    >
    > BTW it's interesting to compare to your's Optane 900P result (same 
    > two high bars for IOPS @ 4,8kB), but in my case it's even more import
    > to select 4kB so it behaves more like Samsung 860 in your case
    > 
    
    Thanks. Interesting!
    
    > # 1.41x on randreads
    > [root@x ~]# cd libaio/nvme/randread/128/ # 128=queue depth
    > [root@x 128]# grep -r "read :" | awk '{print $1, $5, $6}' | sort -n
    > 1k/1.txt: bw=169938KB/s, iops=169937,
    > 2k/1.txt: bw=376653KB/s, iops=188326,
    > 4k/1.txt: bw=691529KB/s, iops=172882, <<<
    > 8k/1.txt: bw=976916KB/s, iops=122114,
    > 16k/1.txt: bw=990524KB/s, iops=61907,
    > 32k/1.txt: bw=974318KB/s, iops=30447,
    > 
    > I think that the above just a demonstration of device bandwidth 
    > saturation: 32k*30k IOPS =~ 1GB/s random reads. Given that DB would
    > be tuned @ 4kB for app(OLTP), but once upon a time Parallel Seq
    > Scans "critical reports" could only achieve 70% of what it could
    > achieve on 8kB, correct? (I'm assuming most real systems are really
    > OLTP but with some reporting/data exporting needs).
    > 
    
    Right, that's roughly my thinking too. Also, OLAP queries often do a lot
    of random I/O, due to index scans etc.
    
    I also wonder how is this related to filesystem page size - in all the
    benchmarks I did I used the default (4k), but maybe it'd behave if the
    filesystem page matched the data page.
    
    > One way or another it would be very nice to be able to select the
    > tradeoff using initdb(1) without the need to recompile, which then
    > begs for some initdb --calibrate /mnt/nvme (effective_io_concurrency,
    > DB page size, ...).>
    > Do you envision any plans for this we still in a need to gather more
    > info exactly why this happens? (perf reports?)
    > 
    
    Not sure I follow. Plans for what? Something that calibrates cost
    parameters? That might be useful, but that's a rather separate issue
    from what's discussed here - page size, which needs to happen before
    initdb (at least with how things work currently).
    
    The other issue (e.g. with effective_io_concurrency) is that it very
    much depends on the access pattern - random pages and sequential pages
    will require very different e_i_c values. But again, that's something to
    discuss in a separate thread (e.g. [1])
    
    [1]: https://postgr.es/m/Yl92RVoXVfs+z2Yj@momjian.us
    
    > Also have you guys discussed on that meeting any long-term future 
    > plans on storage layer by any chance ? If sticking to 4kB pages on 
    > DB/page size/hardware sector size, wouldn't it be possible to win
    > also disabling FPWs in the longer run using uring (assuming O_DIRECT
    > | O_ATOMIC one day?)>
    > I recall that Thomas M. was researching O_ATOMIC, I think he wrote
    > some of that pretty nicely in [1] 
    > 
    > [1] - https://wiki.postgresql.org/wiki/FreeBSD/AtomicIO
    
    No, no such discussion - at least no in this unconference slot.
    
    
    regards
    
    -- 
    Tomas Vondra
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  6. Re: pgcon unconference / impact of block size on performance

    Tomas Vondra <tomas.vondra@enterprisedb.com> — 2022-06-06T15:52:46Z

    
    On 6/6/22 17:00, Tomas Vondra wrote:
    > 
    > On 6/6/22 16:27, Jakub Wartak wrote:
    >> Hi Tomas,
    >>
    >>> Hi,
    >>>
    >>> At on of the pgcon unconference sessions a couple days ago, I presented a
    >>> bunch of benchmark results comparing performance with different data/WAL
    >>> block size. Most of the OLTP results showed significant gains (up to 50%) with
    >>> smaller (4k) data pages.
    >>
    >> Nice. I just saw this
    > https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference , do
    > you have any plans for publishing those other graphs too (e.g. WAL block
    > size impact)?
    >>
    > 
    > Well, there's plenty of charts in the github repositories, including the
    > charts I think you're asking for:
    > 
    > https://github.com/tvondra/pg-block-bench-pgbench/blob/master/process/heatmaps/xeon/20220406-fpw/16/heatmap-tps.png
    > 
    > https://github.com/tvondra/pg-block-bench-pgbench/blob/master/process/heatmaps/i5/20220427-fpw/16/heatmap-io-tps.png
    > 
    > 
    > I admit the charts may not be documented very clearly :-(
    > 
    >>> This opened a long discussion about possible explanations - I claimed one of the
    >>> main factors is the adoption of flash storage, due to pretty fundamental
    >>> differences between HDD and SSD systems. But the discussion concluded with an
    >>> agreement to continue investigating this, so here's an attempt to support the
    >>> claim with some measurements/data.
    >>>
    >>> Let me present results of low-level fio benchmarks on a couple different HDD
    >>> and SSD drives. This should eliminate any postgres-related influence (e.g. FPW),
    >>> and demonstrates inherent HDD/SSD differences.
    >>> All the SSD results show this behavior - the Optane and Samsung nicely show
    >>> that 4K is much better (in random write IOPS) than 8K, but 1-2K pages make it
    >>> worse.
    >>>
    >> [..]
    >> Can you share what Linux kernel version, what filesystem , it's
    >> mount options and LVM setup were you using if any(?)
    >>
    > 
    > The PostgreSQL benchmarks were with 5.14.x kernels, with either ext4 or
    > xfs filesystems.
    > 
    
    I realized I mentioned just two of the devices, used for the postgres
    test, but this thread is dealing mostly with about fio results. So let
    me list info about all the devices/filesystems:
    
    i5
    --
    
    Intel SSD 320 120GB SATA (SSDSA2CW12)
    /dev/sdh1 on /mnt/data type ext4 (rw,noatime)
    
    6x Intel SSD DC S3700 100GB SATA (SSDSC2BA10), LVM RAID0
    /dev/md0 on /mnt/raid type xfs
    (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=16,swidth=96,noquota)
    
    
    xeon
    ----
    
    Samsung SSD 860 EVO 2TB SATA (RVT04B6Q)
    /dev/sde1 on /mnt/samsung type ext4 (rw,relatime)
    
    Intel Optane 900P 280GB NVMe (SSDPED1D280GA)
    /dev/nvme0n1p1 on /mnt/data type ext4 (rw,relatime)
    
    3x Maxtor DiamondMax 21 500B 7.2k SATA (STM350063), LVM RAID0
    /dev/md0 on /mnt/raid type ext4 (rw,relatime,stripe=48)
    
    # mdadm --detail /dev/md0
    /dev/md0:
               Version : 1.2
         Creation Time : Fri Aug 31 21:11:48 2018
            Raid Level : raid0
            Array Size : 1464763392 (1396.91 GiB 1499.92 GB)
          Raid Devices : 3
         Total Devices : 3
           Persistence : Superblock is persistent
    
           Update Time : Fri Aug 31 21:11:48 2018
                 State : clean
        Active Devices : 3
       Working Devices : 3
        Failed Devices : 0
         Spare Devices : 0
    
            Chunk Size : 64K
    
    Consistency Policy : none
    
                  Name : bench2:0  (local to host bench2)
                  UUID : 72e48e7b:a75554ea:05952b34:810ed6bc
                Events : 0
    
        Number   Major   Minor   RaidDevice State
           0       8       17        0      active sync   /dev/sdb1
           1       8       33        1      active sync   /dev/sdc1
           2       8       49        2      active sync   /dev/sdd1
    
    
    Hopefully this is more complete ...
    
    -- 
    Tomas Vondra
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  7. Re: pgcon unconference / impact of block size on performance

    Fabien COELHO <coelho@cri.ensmp.fr> — 2022-06-06T20:39:17Z

    Hello Tomas,
    
    > At on of the pgcon unconference sessions a couple days ago, I presented
    > a bunch of benchmark results comparing performance with different
    > data/WAL block size. Most of the OLTP results showed significant gains
    > (up to 50%) with smaller (4k) data pages.
    
    You wrote something about SSD a long time ago, but the link is now dead:
    
    http://www.fuzzy.cz/en/articles/ssd-benchmark-results-read-write-pgbench/
    
    See also:
    
    http://www.cybertec.at/postgresql-block-sizes-getting-started/
    http://blog.coelho.net/database/2014/08/08/postgresql-page-size-for-SSD.html
    
    [...]
    
    > The other important factor is the native SSD page, which is similar to
    > sectors on HDD. SSDs however don't allow in-place updates, and have to
    > reset/rewrite of the whole native page. It's actually more complicated,
    > because the reset happens at a much larger scale (~8MB block), so it
    > does matter how quickly we "dirty" the data. The consequence is that
    > using data pages smaller than the native page (depends on the device,
    > but seems 4K is the common value) either does not help or actually hurts
    > the write performance.
    >
    > All the SSD results show this behavior - the Optane and Samsung nicely
    > show that 4K is much better (in random write IOPS) than 8K, but 1-2K
    > pages make it worse.
    
    Yep. ISTM that uou should also consider the underlying FS block size. Ext4 
    uses 4 KiB by default, so if you write 2 KiB it will write 4 KiB anyway.
    
    There is no much doubt that with SSD we should reduce the default page 
    size. There are some negative impacts (eg more space is lost because of 
    headers and the number of tuples that can be fitted), but I guess the 
    should be an overall benefit. It would help a lot if it would be possible 
    to initdb with a different block size, without recompiling.
    
    -- 
    Fabien.
    
    
    
    
  8. Re: pgcon unconference / impact of block size on performance

    David Rowley <dgrowleyml@gmail.com> — 2022-06-06T22:39:53Z

    On Sun, 5 Jun 2022 at 11:23, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
    > At on of the pgcon unconference sessions a couple days ago, I presented
    > a bunch of benchmark results comparing performance with different
    > data/WAL block size. Most of the OLTP results showed significant gains
    > (up to 50%) with smaller (4k) data pages.
    
    A few years ago when you and I were doing analysis into the TPC-H
    benchmark, we found that larger page sizes helped various queries,
    especially Q1.  It would be good to see what the block size changes in
    performance in a query such as: SELECT sum(value) FROM
    table_with_tuples_of_several_hundred_bytes;.  I don't recall the
    reason why 32k pages helped there, but it seems reasonable that doing
    more work for each lookup in shared buffers might be 1 reason.
    
    Maybe some deeper analysis into various workloads might convince us
    that it might be worth having an initdb option to specify the
    blocksize. There'd be various hurdles to get over in the code to make
    that work. I doubt we could ever make the default smaller than it is
    today as it would nobody would be able to insert rows larger than 4
    kilobytes into a table anymore. Plus pg_upgrade issues.
    
    David
    
    [1] https://www.postgresql.org/docs/current/limits.html
    
    
    
    
  9. RE: pgcon unconference / impact of block size on performance

    Jakub Wartak <jakub.wartak@tomtom.com> — 2022-06-07T09:46:51Z

    Hi Tomas,
    
    > Well, there's plenty of charts in the github repositories, including the charts I
    > think you're asking for:
    
    Thanks.
    
    > I also wonder how is this related to filesystem page size - in all the benchmarks I
    > did I used the default (4k), but maybe it'd behave if the filesystem page matched
    > the data page.
    
    That may be it - using fio on raw NVMe device (without fs/VFS at all) shows:
    
    [root@x libaio-raw]# grep -r -e 'write:' -e 'read :' *
    nvme/randread/128/1k/1.txt:  read : io=7721.9MB, bw=131783KB/s, iops=131783, runt= 60001msec [b]
    nvme/randread/128/2k/1.txt:  read : io=15468MB, bw=263991KB/s, iops=131995, runt= 60001msec [b] 
    nvme/randread/128/4k/1.txt:  read : io=30142MB, bw=514408KB/s, iops=128602, runt= 60001msec [b]
    nvme/randread/128/8k/1.txt:  read : io=56698MB, bw=967635KB/s, iops=120954, runt= 60001msec
    nvme/randwrite/128/1k/1.txt:  write: io=4140.9MB, bw=70242KB/s, iops=70241, runt= 60366msec [a]
    nvme/randwrite/128/2k/1.txt:  write: io=8271.5MB, bw=141161KB/s, iops=70580, runt= 60002msec [a]
    nvme/randwrite/128/4k/1.txt:  write: io=16543MB, bw=281164KB/s, iops=70291, runt= 60248msec
    nvme/randwrite/128/8k/1.txt:  write: io=22924MB, bw=390930KB/s, iops=48866, runt= 60047msec
    
    So, I've found out two interesting things while playing with raw vs ext4:
    a) I've got 70k IOPS always randwrite even on 1k,2k,4k without ext4 (so as expected, this was ext4 4kb default fs page size impact as you was thinking about when fio 1k was hitting ext4 4kB block)
    b) Another thing that you could also include in testing is that I've spotted a couple of times single-threaded fio might could be limiting factor (numjobs=1 by default), so I've tried with numjobs=2,group_reporting=1 and got this below ouput on ext4 defaults even while dropping caches (echo 3) each loop iteration -- something that I cannot explain (ext4 direct I/O caching effect? how's that even possible? reproduced several times even with numjobs=1) - the point being 206643 1kb IOPS @ ext4 direct-io > 131783 1kB IOPS @ raw, smells like some caching effect because for randwrite it does not happen. I've triple-checked with iostat -x... it cannot be any internal device cache as with direct I/O that doesn't happen:
    
    [root@x libaio-ext4]# grep -r -e 'write:' -e 'read :' *
    nvme/randread/128/1k/1.txt:  read : io=12108MB, bw=206644KB/s, iops=206643, runt= 60001msec [b]
    nvme/randread/128/2k/1.txt:  read : io=18821MB, bw=321210KB/s, iops=160604, runt= 60001msec [b]
    nvme/randread/128/4k/1.txt:  read : io=36985MB, bw=631208KB/s, iops=157802, runt= 60001msec [b]
    nvme/randread/128/8k/1.txt:  read : io=57364MB, bw=976923KB/s, iops=122115, runt= 60128msec
    nvme/randwrite/128/1k/1.txt:  write: io=1036.2MB, bw=17683KB/s, iops=17683, runt= 60001msec [a, as before]
    nvme/randwrite/128/2k/1.txt:  write: io=2023.2MB, bw=34528KB/s, iops=17263, runt= 60001msec [a, as before]
    nvme/randwrite/128/4k/1.txt:  write: io=16667MB, bw=282977KB/s, iops=70744, runt= 60311msec [reproduced benefit, as per earlier email]
    nvme/randwrite/128/8k/1.txt:  write: io=22997MB, bw=391839KB/s, iops=48979, runt= 60099msec
    
    > > One way or another it would be very nice to be able to select the
    > > tradeoff using initdb(1) without the need to recompile, which then
    > > begs for some initdb --calibrate /mnt/nvme (effective_io_concurrency,
    > > DB page size, ...).> Do you envision any plans for this we still in a
    > > need to gather more info exactly why this happens? (perf reports?)
    > >
    > 
    > Not sure I follow. Plans for what? Something that calibrates cost parameters?
    > That might be useful, but that's a rather separate issue from what's discussed
    > here - page size, which needs to happen before initdb (at least with how things
    > work currently).
    [..]
    
    Sorry, I was too far teched and assumed you guys were talking very long term. 
    
    -J.
    
    
  10. RE: pgcon unconference / impact of block size on performance

    Jakub Wartak <jakub.wartak@tomtom.com> — 2022-06-07T09:46:53Z

    [..]
    >I doubt we could ever
    > make the default smaller than it is today as it would nobody would be able to
    > insert rows larger than 4 kilobytes into a table anymore. 
    
    Add error "values larger than 1/3 of a buffer page cannot be indexed" to that list...
    
    -J.
    
    
    
    
  11. Re: pgcon unconference / impact of block size on performance

    Tomas Vondra <tomas.vondra@enterprisedb.com> — 2022-06-07T13:29:46Z

    On 6/7/22 11:46, Jakub Wartak wrote:
    > Hi Tomas,
    > 
    >> Well, there's plenty of charts in the github repositories, including the charts I
    >> think you're asking for:
    > 
    > Thanks.
    > 
    >> I also wonder how is this related to filesystem page size - in all the benchmarks I
    >> did I used the default (4k), but maybe it'd behave if the filesystem page matched
    >> the data page.
    > 
    > That may be it - using fio on raw NVMe device (without fs/VFS at all) shows:
    > 
    > [root@x libaio-raw]# grep -r -e 'write:' -e 'read :' *
    > nvme/randread/128/1k/1.txt:  read : io=7721.9MB, bw=131783KB/s, iops=131783, runt= 60001msec [b]
    > nvme/randread/128/2k/1.txt:  read : io=15468MB, bw=263991KB/s, iops=131995, runt= 60001msec [b] 
    > nvme/randread/128/4k/1.txt:  read : io=30142MB, bw=514408KB/s, iops=128602, runt= 60001msec [b]
    > nvme/randread/128/8k/1.txt:  read : io=56698MB, bw=967635KB/s, iops=120954, runt= 60001msec
    > nvme/randwrite/128/1k/1.txt:  write: io=4140.9MB, bw=70242KB/s, iops=70241, runt= 60366msec [a]
    > nvme/randwrite/128/2k/1.txt:  write: io=8271.5MB, bw=141161KB/s, iops=70580, runt= 60002msec [a]
    > nvme/randwrite/128/4k/1.txt:  write: io=16543MB, bw=281164KB/s, iops=70291, runt= 60248msec
    > nvme/randwrite/128/8k/1.txt:  write: io=22924MB, bw=390930KB/s, iops=48866, runt= 60047msec
    > 
    > So, I've found out two interesting things while playing with raw vs ext4:
    > a) I've got 70k IOPS always randwrite even on 1k,2k,4k without ext4 (so as expected, this was ext4 4kb default fs page size impact as you was thinking about when fio 1k was hitting ext4 4kB block)
    
    Right. Interesting, so for randread we get a consistent +30% speedup on
    raw devices with all page sizes, while on randwrite it's about 1.0x for
    4K. The really puzzling thing is why is the filesystem so much slower
    for smaller pages. I mean, why would writing 1K be 1/3 of writing 4K?
    Why would a filesystem have such effect?
    
    > b) Another thing that you could also include in testing is that I've spotted a couple of times single-threaded fio might could be limiting factor (numjobs=1 by default), so I've tried with numjobs=2,group_reporting=1 and got this below ouput on ext4 defaults even while dropping caches (echo 3) each loop iteration -- something that I cannot explain (ext4 direct I/O caching effect? how's that even possible? reproduced several times even with numjobs=1) - the point being 206643 1kb IOPS @ ext4 direct-io > 131783 1kB IOPS @ raw, smells like some caching effect because for randwrite it does not happen. I've triple-checked with iostat -x... it cannot be any internal device cache as with direct I/O that doesn't happen:
    > 
    > [root@x libaio-ext4]# grep -r -e 'write:' -e 'read :' *
    > nvme/randread/128/1k/1.txt:  read : io=12108MB, bw=206644KB/s, iops=206643, runt= 60001msec [b]
    > nvme/randread/128/2k/1.txt:  read : io=18821MB, bw=321210KB/s, iops=160604, runt= 60001msec [b]
    > nvme/randread/128/4k/1.txt:  read : io=36985MB, bw=631208KB/s, iops=157802, runt= 60001msec [b]
    > nvme/randread/128/8k/1.txt:  read : io=57364MB, bw=976923KB/s, iops=122115, runt= 60128msec
    > nvme/randwrite/128/1k/1.txt:  write: io=1036.2MB, bw=17683KB/s, iops=17683, runt= 60001msec [a, as before]
    > nvme/randwrite/128/2k/1.txt:  write: io=2023.2MB, bw=34528KB/s, iops=17263, runt= 60001msec [a, as before]
    > nvme/randwrite/128/4k/1.txt:  write: io=16667MB, bw=282977KB/s, iops=70744, runt= 60311msec [reproduced benefit, as per earlier email]
    > nvme/randwrite/128/8k/1.txt:  write: io=22997MB, bw=391839KB/s, iops=48979, runt= 60099msec
    > 
    
    No idea what might be causing this. BTW so you're not using direct-io to
    access the raw device? Or am I just misreading this?
    
    >>> One way or another it would be very nice to be able to select the
    >>> tradeoff using initdb(1) without the need to recompile, which then
    >>> begs for some initdb --calibrate /mnt/nvme (effective_io_concurrency,
    >>> DB page size, ...).> Do you envision any plans for this we still in a
    >>> need to gather more info exactly why this happens? (perf reports?)
    >>>
    >>
    >> Not sure I follow. Plans for what? Something that calibrates cost parameters?
    >> That might be useful, but that's a rather separate issue from what's discussed
    >> here - page size, which needs to happen before initdb (at least with how things
    >> work currently).
    > [..]
    > 
    > Sorry, I was too far teched and assumed you guys were talking very long term. 
    > 
    
    Np, I think that'd be an useful tool, but it seems more like a
    completely separate discussion.
    
    
    regards
    
    -- 
    Tomas Vondra
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  12. RE: pgcon unconference / impact of block size on performance

    Jakub Wartak <jakub.wartak@tomtom.com> — 2022-06-07T13:48:09Z

    Hi,
    
    > The really
    > puzzling thing is why is the filesystem so much slower for smaller pages. I mean,
    > why would writing 1K be 1/3 of writing 4K?
    > Why would a filesystem have such effect?
    
    Ha! I don't care at this point as 1 or 2kB seems too small to handle many real world scenarios ;)
    
    > > b) Another thing that you could also include in testing is that I've spotted a
    > couple of times single-threaded fio might could be limiting factor (numjobs=1 by
    > default), so I've tried with numjobs=2,group_reporting=1 and got this below
    > ouput on ext4 defaults even while dropping caches (echo 3) each loop iteration -
    > - something that I cannot explain (ext4 direct I/O caching effect? how's that
    > even possible? reproduced several times even with numjobs=1) - the point being
    > 206643 1kb IOPS @ ext4 direct-io > 131783 1kB IOPS @ raw, smells like some
    > caching effect because for randwrite it does not happen. I've triple-checked with
    > iostat -x... it cannot be any internal device cache as with direct I/O that doesn't
    > happen:
    > >
    > > [root@x libaio-ext4]# grep -r -e 'write:' -e 'read :' *
    > > nvme/randread/128/1k/1.txt:  read : io=12108MB, bw=206644KB/s,
    > > iops=206643, runt= 60001msec [b]
    > > nvme/randread/128/2k/1.txt:  read : io=18821MB, bw=321210KB/s,
    > > iops=160604, runt= 60001msec [b]
    > > nvme/randread/128/4k/1.txt:  read : io=36985MB, bw=631208KB/s,
    > > iops=157802, runt= 60001msec [b]
    > > nvme/randread/128/8k/1.txt:  read : io=57364MB, bw=976923KB/s,
    > > iops=122115, runt= 60128msec
    > > nvme/randwrite/128/1k/1.txt:  write: io=1036.2MB, bw=17683KB/s,
    > > iops=17683, runt= 60001msec [a, as before]
    > > nvme/randwrite/128/2k/1.txt:  write: io=2023.2MB, bw=34528KB/s,
    > > iops=17263, runt= 60001msec [a, as before]
    > > nvme/randwrite/128/4k/1.txt:  write: io=16667MB, bw=282977KB/s,
    > > iops=70744, runt= 60311msec [reproduced benefit, as per earlier email]
    > > nvme/randwrite/128/8k/1.txt:  write: io=22997MB, bw=391839KB/s,
    > > iops=48979, runt= 60099msec
    > >
    > 
    > No idea what might be causing this. BTW so you're not using direct-io to access
    > the raw device? Or am I just misreading this?
    
    Both scenarios (raw and fs) have had direct=1 set. I just cannot understand how having direct I/O enabled (which disables caching) achieves better read IOPS on ext4 than on raw device... isn't it contradiction?
    
    -J.
    
    
    
    
    
  13. Re: pgcon unconference / impact of block size on performance

    Tomas Vondra <tomas.vondra@enterprisedb.com> — 2022-06-07T14:00:18Z

    
    On 6/7/22 15:48, Jakub Wartak wrote:
    > Hi,
    > 
    >> The really
    >> puzzling thing is why is the filesystem so much slower for smaller pages. I mean,
    >> why would writing 1K be 1/3 of writing 4K?
    >> Why would a filesystem have such effect?
    > 
    > Ha! I don't care at this point as 1 or 2kB seems too small to handle many real world scenarios ;)
    > 
    
    I think that's not quite true - a lot of OLTP works with fairly narrow
    rows, and if they use more data, it's probably in TOAST, so again split
    into smaller rows. It's true smaller pages would cut some of the limits
    (columns, index tuple, ...) of course, and that might be an issue.
    
    Independently of that, it seems like an interesting behavior and it
    might tell us something about how to optimize for larger pages.
    
    >>> b) Another thing that you could also include in testing is that I've spotted a
    >> couple of times single-threaded fio might could be limiting factor (numjobs=1 by
    >> default), so I've tried with numjobs=2,group_reporting=1 and got this below
    >> ouput on ext4 defaults even while dropping caches (echo 3) each loop iteration -
    >> - something that I cannot explain (ext4 direct I/O caching effect? how's that
    >> even possible? reproduced several times even with numjobs=1) - the point being
    >> 206643 1kb IOPS @ ext4 direct-io > 131783 1kB IOPS @ raw, smells like some
    >> caching effect because for randwrite it does not happen. I've triple-checked with
    >> iostat -x... it cannot be any internal device cache as with direct I/O that doesn't
    >> happen:
    >>>
    >>> [root@x libaio-ext4]# grep -r -e 'write:' -e 'read :' *
    >>> nvme/randread/128/1k/1.txt:  read : io=12108MB, bw=206644KB/s,
    >>> iops=206643, runt= 60001msec [b]
    >>> nvme/randread/128/2k/1.txt:  read : io=18821MB, bw=321210KB/s,
    >>> iops=160604, runt= 60001msec [b]
    >>> nvme/randread/128/4k/1.txt:  read : io=36985MB, bw=631208KB/s,
    >>> iops=157802, runt= 60001msec [b]
    >>> nvme/randread/128/8k/1.txt:  read : io=57364MB, bw=976923KB/s,
    >>> iops=122115, runt= 60128msec
    >>> nvme/randwrite/128/1k/1.txt:  write: io=1036.2MB, bw=17683KB/s,
    >>> iops=17683, runt= 60001msec [a, as before]
    >>> nvme/randwrite/128/2k/1.txt:  write: io=2023.2MB, bw=34528KB/s,
    >>> iops=17263, runt= 60001msec [a, as before]
    >>> nvme/randwrite/128/4k/1.txt:  write: io=16667MB, bw=282977KB/s,
    >>> iops=70744, runt= 60311msec [reproduced benefit, as per earlier email]
    >>> nvme/randwrite/128/8k/1.txt:  write: io=22997MB, bw=391839KB/s,
    >>> iops=48979, runt= 60099msec
    >>>
    >>
    >> No idea what might be causing this. BTW so you're not using direct-io to access
    >> the raw device? Or am I just misreading this?
    > 
    > Both scenarios (raw and fs) have had direct=1 set. I just cannot understand how having direct I/O enabled (which disables caching) achieves better read IOPS on ext4 than on raw device... isn't it contradiction?
    > 
    
    Thanks for the clarification. Not sure what might be causing this. Did
    you use the same parameters (e.g. iodepth) in both cases?
    
    
    regards
    
    -- 
    Tomas Vondra
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  14. Re: pgcon unconference / impact of block size on performance

    Robert Haas <robertmhaas@gmail.com> — 2022-06-07T16:26:01Z

    On Sat, Jun 4, 2022 at 7:23 PM Tomas Vondra
    <tomas.vondra@enterprisedb.com> wrote:
    > This opened a long discussion about possible explanations - I claimed
    > one of the main factors is the adoption of flash storage, due to pretty
    > fundamental differences between HDD and SSD systems. But the discussion
    > concluded with an agreement to continue investigating this, so here's an
    > attempt to support the claim with some measurements/data.
    
    Interesting. I wonder if the fact that x86 machines have a 4kB page
    size matters here. It seems hard to be sure because it's not something
    you can really change. But there are a few of your graphs where 4kB
    spikes up above any higher or lower value, and maybe that's why?
    
    -- 
    Robert Haas
    EDB: http://www.enterprisedb.com
    
    
    
    
  15. Re: pgcon unconference / impact of block size on performance

    Tomas Vondra <tomas.vondra@enterprisedb.com> — 2022-06-07T17:47:11Z

    On 6/7/22 18:26, Robert Haas wrote:
    > On Sat, Jun 4, 2022 at 7:23 PM Tomas Vondra
    > <tomas.vondra@enterprisedb.com> wrote:
    >> This opened a long discussion about possible explanations - I claimed
    >> one of the main factors is the adoption of flash storage, due to pretty
    >> fundamental differences between HDD and SSD systems. But the discussion
    >> concluded with an agreement to continue investigating this, so here's an
    >> attempt to support the claim with some measurements/data.
    > 
    > Interesting. I wonder if the fact that x86 machines have a 4kB page
    > size matters here. It seems hard to be sure because it's not something
    > you can really change. But there are a few of your graphs where 4kB
    > spikes up above any higher or lower value, and maybe that's why?
    >
    
    Possibly, but why would that be the case? Maybe there are places that do
    stuff with memory and have different optimizations based on length? I'd
    bet the 4k page is way more optimized than the other cases.
    
    But honestly, I think the SSD page size matters much more, and the main
    bump between 4k and 8k comes from having to deal with just a single
    page. Imagine you write 8k postgres page - the filesystem splits that
    into two 4k pages, and then eventually writes them to storage. It may
    happen the writeback flushes them separately, possibly even to different
    places on the device. Which might be more expensive to read later, etc.
    
    I'm just speculating, of course. Maybe the storage is smarter and can
    figure some of this internally, or maybe the locality will remain high.
    
    regards
    
    -- 
    Tomas Vondra
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  16. Re: pgcon unconference / impact of block size on performance

    Robert Haas <robertmhaas@gmail.com> — 2022-06-07T19:15:14Z

    On Tue, Jun 7, 2022 at 1:47 PM Tomas Vondra
    <tomas.vondra@enterprisedb.com> wrote:
    > Possibly, but why would that be the case? Maybe there are places that do
    > stuff with memory and have different optimizations based on length? I'd
    > bet the 4k page is way more optimized than the other cases.
    
    I don't really know. It was just a thought. It feels like the fact
    that the page sizes are different could be hurting us somehow, but I
    don't really know what the mechanism would be.
    
    -- 
    Robert Haas
    EDB: http://www.enterprisedb.com
    
    
    
    
  17. RE: pgcon unconference / impact of block size on performance

    Jakub Wartak <jakub.wartak@tomtom.com> — 2022-06-08T14:15:17Z

    Hi, got some answers!
    
    TL;DR for fio it would make sense to use many stressfiles (instead of 1) and same for numjobs ~ VCPU to avoid various pitfails.
    
    > >> The really
    > >> puzzling thing is why is the filesystem so much slower for smaller
    > >> pages. I mean, why would writing 1K be 1/3 of writing 4K?
    > >> Why would a filesystem have such effect?
    > >
    > > Ha! I don't care at this point as 1 or 2kB seems too small to handle
    > > many real world scenarios ;)
    [..]
    > Independently of that, it seems like an interesting behavior and it might tell us
    > something about how to optimize for larger pages.
    
    OK, curiosity won:
    
    With randwrite on ext4 directio using 4kb the avgqu-sz reaches ~90-100 (close to fio's 128 queue depth?) and I'm getting ~70k IOPS [with maxdepth=128]
    With randwrite on ext4 directio using 1kb the avgqu-sz is just 0.7 and I'm getting just ~17-22k IOPS [with maxdepth=128] ->  conclusion: something is being locked thus preventing queue to build up
    With randwrite on ext4 directio using 4kb the avgqu-sz reaches ~2.3 (so something is queued) and I'm also getting ~70k IOPS with minimal possible maxdepth=4 ->  conclusion: I just need to split the lock contention by 4.
    
    The 1kB (slow) profile top function is aio_write() -> .... -> iov_iter_get_pages() -> internal_get_user_pages_fast() and there's sadly plenty of "lock" keywords inside {related to memory manager, padding to full page size, inode locking} also one can find some articles / commits related to it [1] which didn't made a good feeling to be honest as the fio is using just 1 file (even while I'm on kernel 5.10.x). So I've switched to 4x files and numjobs=4 and got easily 60k IOPS, contention solved whatever it was :) So I would assume PostgreSQL (with it's splitting data files by default on 1GB boundaries and multiprocess architecture) should be relatively safe from such ext4 inode(?)/mm(?) contentions even with smallest 1kb block sizes on Direct I/O some day.  
    
    [1] - https://www.phoronix.com/scan.php?page=news_item&px=EXT4-DIO-Faster-DBs
    
    > > Both scenarios (raw and fs) have had direct=1 set. I just cannot understand
    > how having direct I/O enabled (which disables caching) achieves better read
    > IOPS on ext4 than on raw device... isn't it contradiction?
    > >
    > 
    > Thanks for the clarification. Not sure what might be causing this. Did you use the
    > same parameters (e.g. iodepth) in both cases?
    
    Explanation: it's the CPU scheduler migrations mixing the performance result during the runs of fio  (as you have in your framework). Various VCPUs seem to be having varying max IOPS characteristics (sic!) and CPU scheduler seems to be unaware of it. At least on 1kB and 4kB blocksize this happens also notice that some VCPUs [XXXX marker] don't reach 100% CPU reaching almost twice the result; while cores 0, 3 do reach 100% and lack CPU power to perform more. The only thing that I don't get is that it doesn't make sense from extened lscpu output (but maybe it's AWS XEN mixing real CPU mappings, who knows). 
    
    [root@x ~]# for((x=0; x<=3; x++)) ; do echo "$x:"; taskset -c $x fio fio.ext4 | grep -e 'read :' -e 'cpu '; done
    0:
      read : io=2416.8MB, bw=123730KB/s, iops=123730, runt= 20001msec
      cpu          : usr=42.98%, sys=56.52%, ctx=2317, majf=0, minf=41 [XXXX: 100% cpu bottleneck and just 123k IOPS]
    1:
      read : io=4077.9MB, bw=208774KB/s, iops=208773, runt= 20001msec
      cpu          : usr=29.47%, sys=51.43%, ctx=2993, majf=0, minf=42  [XXXX, some idle power and 208k IOPS just by switching to core1...]
    2:
      read : io=4036.7MB, bw=206636KB/s, iops=206636, runt= 20001msec
      cpu          : usr=31.00%, sys=52.41%, ctx=2815, majf=0, minf=42 [XXXX]
    3:
      read : io=2398.4MB, bw=122791KB/s, iops=122791, runt= 20001msec
      cpu          : usr=44.20%, sys=55.20%, ctx=2522, majf=0, minf=41
    [root@x ~]# for((x=0; x<=3; x++)) ; do echo "$x:"; taskset -c $x fio fio.raw | grep -e 'read :' -e 'cpu '; done
    0:
      read : io=2512.3MB, bw=128621KB/s, iops=128620, runt= 20001msec
      cpu          : usr=47.62%, sys=51.58%, ctx=2365, majf=0, minf=42
    1:
      read : io=4070.2MB, bw=206748KB/s, iops=206748, runt= 20159msec
      cpu          : usr=29.52%, sys=42.86%, ctx=2808, majf=0, minf=42 [XXXX]
    2:
      read : io=4101.3MB, bw=209975KB/s, iops=209975, runt= 20001msec
      cpu          : usr=28.05%, sys=45.09%, ctx=3419, majf=0, minf=42 [XXXX]
    3:
      read : io=2519.4MB, bw=128985KB/s, iops=128985, runt= 20001msec
      cpu          : usr=46.59%, sys=52.70%, ctx=2371, majf=0, minf=41
    
    [root@x ~]# lscpu --extended
    CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ    MINMHZ
    0   0    0      0    0:0:0:0       yes    3000.0000 1200.0000
    1   0    0      1    1:1:1:0       yes    3000.0000 1200.0000
    2   0    0      0    0:0:0:0       yes    3000.0000 1200.0000
    3   0    0      1    1:1:1:0       yes    3000.0000 1200.0000
    [root@x ~]# lscpu | grep -e ^Model -e ^NUMA -e ^Hyper
    NUMA node(s):        1
    Model:               79
    Model name:          Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
    Hypervisor vendor:   Xen
    NUMA node0 CPU(s):   0-3
    [root@x ~]# diff -u fio.raw fio.ext4
    --- fio.raw     2022-06-08 12:32:26.603482453 +0000
    +++ fio.ext4    2022-06-08 12:32:36.071621708 +0000
    @@ -1,5 +1,5 @@
     [global]
    -filename=/dev/nvme0n1
    +filename=/mnt/nvme/fio/data.file
     size=256GB
     direct=1
     ioengine=libaio
    [root@x ~]# cat fio.raw
    [global]
    filename=/dev/nvme0n1
    size=256GB
    direct=1
    ioengine=libaio
    runtime=20
    numjobs=1
    group_reporting=1
    
    [job]
    rw=randread
    iodepth=128
    bs=1k
    size=64GB
    [root@x ~]#
    
    -J.
    
    
    
    
  18. Re: pgcon unconference / impact of block size on performance

    Tomas Vondra <tomas.vondra@enterprisedb.com> — 2022-06-08T14:51:41Z

    On 6/8/22 16:15, Jakub Wartak wrote:
    > Hi, got some answers!
    > 
    > TL;DR for fio it would make sense to use many stressfiles (instead of 1) and same for numjobs ~ VCPU to avoid various pitfails.
    > >>>> The really
    >>>> puzzling thing is why is the filesystem so much slower for smaller
    >>>> pages. I mean, why would writing 1K be 1/3 of writing 4K?
    >>>> Why would a filesystem have such effect?
    >>>
    >>> Ha! I don't care at this point as 1 or 2kB seems too small to handle
    >>> many real world scenarios ;)
    > [..]
    >> Independently of that, it seems like an interesting behavior and it might tell us
    >> something about how to optimize for larger pages.
    > 
    > OK, curiosity won:
    > 
    > With randwrite on ext4 directio using 4kb the avgqu-sz reaches ~90-100 (close to fio's 128 queue depth?) and I'm getting ~70k IOPS [with maxdepth=128]
    > With randwrite on ext4 directio using 1kb the avgqu-sz is just 0.7 and I'm getting just ~17-22k IOPS [with maxdepth=128] ->  conclusion: something is being locked thus preventing queue to build up
    > With randwrite on ext4 directio using 4kb the avgqu-sz reaches ~2.3 (so something is queued) and I'm also getting ~70k IOPS with minimal possible maxdepth=4 ->  conclusion: I just need to split the lock contention by 4.
    > 
    > The 1kB (slow) profile top function is aio_write() -> .... -> iov_iter_get_pages() -> internal_get_user_pages_fast() and there's sadly plenty of "lock" keywords inside {related to memory manager, padding to full page size, inode locking} also one can find some articles / commits related to it [1] which didn't made a good feeling to be honest as the fio is using just 1 file (even while I'm on kernel 5.10.x). So I've switched to 4x files and numjobs=4 and got easily 60k IOPS, contention solved whatever it was :) So I would assume PostgreSQL (with it's splitting data files by default on 1GB boundaries and multiprocess architecture) should be relatively safe from such ext4 inode(?)/mm(?) contentions even with smallest 1kb block sizes on Direct I/O some day.  
    > 
    
    Interesting. So what parameter values would you suggest?
    
    FWIW some of the tests I did were on xfs, so I wonder if that might be
    hitting similar/other bottlenecks.
    
    > [1] - https://www.phoronix.com/scan.php?page=news_item&px=EXT4-DIO-Faster-DBs
    > 
    >>> Both scenarios (raw and fs) have had direct=1 set. I just cannot understand
    >> how having direct I/O enabled (which disables caching) achieves better read
    >> IOPS on ext4 than on raw device... isn't it contradiction?
    >>>
    >>
    >> Thanks for the clarification. Not sure what might be causing this. Did you use the
    >> same parameters (e.g. iodepth) in both cases?
    > 
    > Explanation: it's the CPU scheduler migrations mixing the performance result during the runs of fio  (as you have in your framework). Various VCPUs seem to be having varying max IOPS characteristics (sic!) and CPU scheduler seems to be unaware of it. At least on 1kB and 4kB blocksize this happens also notice that some VCPUs [XXXX marker] don't reach 100% CPU reaching almost twice the result; while cores 0, 3 do reach 100% and lack CPU power to perform more. The only thing that I don't get is that it doesn't make sense from extened lscpu output (but maybe it's AWS XEN mixing real CPU mappings, who knows).
    
    Uh, that's strange. I haven't seen anything like that, but I'm running
    on physical HW and not AWS, so it's either that or maybe I just didn't
    do the same test.
    
    
    regards
    
    -- 
    Tomas Vondra
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  19. RE: pgcon unconference / impact of block size on performance

    Jakub Wartak <jakub.wartak@tomtom.com> — 2022-06-09T11:23:36Z

    > > >>>> The really
    > >>>> puzzling thing is why is the filesystem so much slower for smaller
    > >>>> pages. I mean, why would writing 1K be 1/3 of writing 4K?
    > >>>> Why would a filesystem have such effect?
    > >>>
    > >>> Ha! I don't care at this point as 1 or 2kB seems too small to handle
    > >>> many real world scenarios ;)
    > > [..]
    > >> Independently of that, it seems like an interesting behavior and it
    > >> might tell us something about how to optimize for larger pages.
    > >
    > > OK, curiosity won:
    > >
    > > With randwrite on ext4 directio using 4kb the avgqu-sz reaches ~90-100
    > > (close to fio's 128 queue depth?) and I'm getting ~70k IOPS [with
    > > maxdepth=128] With randwrite on ext4 directio using 1kb the avgqu-sz is just
    > 0.7 and I'm getting just ~17-22k IOPS [with maxdepth=128] ->  conclusion:
    > something is being locked thus preventing queue to build up With randwrite on
    > ext4 directio using 4kb the avgqu-sz reaches ~2.3 (so something is queued) and
    > I'm also getting ~70k IOPS with minimal possible maxdepth=4 ->  conclusion: I
    > just need to split the lock contention by 4.
    > >
    > > The 1kB (slow) profile top function is aio_write() -> .... -> iov_iter_get_pages()
    > -> internal_get_user_pages_fast() and there's sadly plenty of "lock" keywords
    > inside {related to memory manager, padding to full page size, inode locking}
    > also one can find some articles / commits related to it [1] which didn't made a
    > good feeling to be honest as the fio is using just 1 file (even while I'm on kernel
    > 5.10.x). So I've switched to 4x files and numjobs=4 and got easily 60k IOPS,
    > contention solved whatever it was :) So I would assume PostgreSQL (with it's
    > splitting data files by default on 1GB boundaries and multiprocess architecture)
    > should be relatively safe from such ext4 inode(?)/mm(?) contentions even with
    > smallest 1kb block sizes on Direct I/O some day.
    > >
    > 
    > Interesting. So what parameter values would you suggest?
    
    At least have 4x filename= entries and numjobs=4
    
    > FWIW some of the tests I did were on xfs, so I wonder if that might be hitting
    > similar/other bottlenecks.
    
    Apparently XFS also shows same contention on single file for 1..2kb randwrite, see [ZZZ]. 
    
    [root@x ~]# mount|grep /mnt/nvme
    /dev/nvme0n1 on /mnt/nvme type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
    
    # using 1 fio job and 1 file
    [root@x ~]# grep -r -e 'read :' -e 'write:' libaio
    libaio/nvme/randread/128/1k/1.txt:  read : io=5779.1MB, bw=196573KB/s, iops=196573, runt= 30109msec
    libaio/nvme/randread/128/2k/1.txt:  read : io=10335MB, bw=352758KB/s, iops=176379, runt= 30001msec
    libaio/nvme/randread/128/4k/1.txt:  read : io=22220MB, bw=758408KB/s, iops=189601, runt= 30001msec
    libaio/nvme/randread/128/8k/1.txt:  read : io=28914MB, bw=986896KB/s, iops=123361, runt= 30001msec
    libaio/nvme/randwrite/128/1k/1.txt:  write: io=694856KB, bw=23161KB/s, iops=23161, runt= 30001msec [ZZZ]
    libaio/nvme/randwrite/128/2k/1.txt:  write: io=1370.7MB, bw=46782KB/s, iops=23390, runt= 30001msec [ZZZ]
    libaio/nvme/randwrite/128/4k/1.txt:  write: io=8261.3MB, bw=281272KB/s, iops=70318, runt= 30076msec [OK]
    libaio/nvme/randwrite/128/8k/1.txt:  write: io=11598MB, bw=394320KB/s, iops=49289, runt= 30118msec
    
    # but it's all ok using 4 fio jobs and 4 files
    [root@x ~]# grep -r -e 'read :' -e 'write:' libaio
    libaio/nvme/randread/128/1k/1.txt:  read : io=6174.6MB, bw=210750KB/s, iops=210750, runt= 30001msec
    libaio/nvme/randread/128/2k/1.txt:  read : io=12152MB, bw=413275KB/s, iops=206637, runt= 30110msec
    libaio/nvme/randread/128/4k/1.txt:  read : io=24382MB, bw=832116KB/s, iops=208028, runt= 30005msec
    libaio/nvme/randread/128/8k/1.txt:  read : io=29281MB, bw=985831KB/s, iops=123228, runt= 30415msec
    libaio/nvme/randwrite/128/1k/1.txt:  write: io=1692.2MB, bw=57748KB/s, iops=57748, runt= 30003msec
    libaio/nvme/randwrite/128/2k/1.txt:  write: io=3601.9MB, bw=122940KB/s, iops=61469, runt= 30001msec
    libaio/nvme/randwrite/128/4k/1.txt:  write: io=8470.8MB, bw=285857KB/s, iops=71464, runt= 30344msec
    libaio/nvme/randwrite/128/8k/1.txt:  write: io=11449MB, bw=390603KB/s, iops=48825, runt= 30014msec
     
    
    > >>> Both scenarios (raw and fs) have had direct=1 set. I just cannot
    > >>> understand
    > >> how having direct I/O enabled (which disables caching) achieves
    > >> better read IOPS on ext4 than on raw device... isn't it contradiction?
    > >>>
    > >>
    > >> Thanks for the clarification. Not sure what might be causing this.
    > >> Did you use the same parameters (e.g. iodepth) in both cases?
    > >
    > > Explanation: it's the CPU scheduler migrations mixing the performance result
    > during the runs of fio  (as you have in your framework). Various VCPUs seem to
    > be having varying max IOPS characteristics (sic!) and CPU scheduler seems to be
    > unaware of it. At least on 1kB and 4kB blocksize this happens also notice that
    > some VCPUs [XXXX marker] don't reach 100% CPU reaching almost twice the
    > result; while cores 0, 3 do reach 100% and lack CPU power to perform more.
    > The only thing that I don't get is that it doesn't make sense from extened lscpu
    > output (but maybe it's AWS XEN mixing real CPU mappings, who knows).
    > 
    > Uh, that's strange. I haven't seen anything like that, but I'm running on physical
    > HW and not AWS, so it's either that or maybe I just didn't do the same test.
    
    I couldn't belived it until I've checked via taskset 😊 BTW: I don't have real HW with NVMe , but we might be with worth checking if placing (taskset -c ...) fio on hyperthreading VCPU is not causing (there's /sys/devices/system/cpu/cpu0/topology/thread_siblings and maybe lscpu(1) output). On AWS I have feeling that lscpu might simply lie and I cannot identify which VCPU is HT and which isn't.
    
    -J.
    
  20. Re: pgcon unconference / impact of block size on performance

    Tomas Vondra <tomas.vondra@enterprisedb.com> — 2022-06-09T22:24:29Z

    
    
    On 6/9/22 13:23, Jakub Wartak wrote:
    >>>>>>> The really
    >>>>>> puzzling thing is why is the filesystem so much slower for smaller
    >>>>>> pages. I mean, why would writing 1K be 1/3 of writing 4K?
    >>>>>> Why would a filesystem have such effect?
    >>>>>
    >>>>> Ha! I don't care at this point as 1 or 2kB seems too small to handle
    >>>>> many real world scenarios ;)
    >>> [..]
    >>>> Independently of that, it seems like an interesting behavior and it
    >>>> might tell us something about how to optimize for larger pages.
    >>>
    >>> OK, curiosity won:
    >>>
    >>> With randwrite on ext4 directio using 4kb the avgqu-sz reaches ~90-100
    >>> (close to fio's 128 queue depth?) and I'm getting ~70k IOPS [with
    >>> maxdepth=128] With randwrite on ext4 directio using 1kb the avgqu-sz is just
    >> 0.7 and I'm getting just ~17-22k IOPS [with maxdepth=128] ->  conclusion:
    >> something is being locked thus preventing queue to build up With randwrite on
    >> ext4 directio using 4kb the avgqu-sz reaches ~2.3 (so something is queued) and
    >> I'm also getting ~70k IOPS with minimal possible maxdepth=4 ->  conclusion: I
    >> just need to split the lock contention by 4.
    >>>
    >>> The 1kB (slow) profile top function is aio_write() -> .... -> iov_iter_get_pages()
    >> -> internal_get_user_pages_fast() and there's sadly plenty of "lock" keywords
    >> inside {related to memory manager, padding to full page size, inode locking}
    >> also one can find some articles / commits related to it [1] which didn't made a
    >> good feeling to be honest as the fio is using just 1 file (even while I'm on kernel
    >> 5.10.x). So I've switched to 4x files and numjobs=4 and got easily 60k IOPS,
    >> contention solved whatever it was :) So I would assume PostgreSQL (with it's
    >> splitting data files by default on 1GB boundaries and multiprocess architecture)
    >> should be relatively safe from such ext4 inode(?)/mm(?) contentions even with
    >> smallest 1kb block sizes on Direct I/O some day.
    >>>
    >>
    >> Interesting. So what parameter values would you suggest?
    > 
    > At least have 4x filename= entries and numjobs=4
    > 
    >> FWIW some of the tests I did were on xfs, so I wonder if that might be hitting
    >> similar/other bottlenecks.
    > 
    > Apparently XFS also shows same contention on single file for 1..2kb randwrite, see [ZZZ]. 
    > 
    
    I don't have any results yet, but after thinking about this a bit I find
    this really strange. Why would there be any contention with a single fio
    job? Doesn't contention imply multiple processes competing for the same
    resource/lock etc.?
    
    Isn't this simply due to the iodepth increase? IIUC with multiple fio
    jobs, each will use a separate iodepth value. So with numjobs=4, we'll
    really use iodepth*4, which can make a big difference.
    
    
    >>>
    >>> Explanation: it's the CPU scheduler migrations mixing the performance result
    >> during the runs of fio  (as you have in your framework). Various VCPUs seem to
    >> be having varying max IOPS characteristics (sic!) and CPU scheduler seems to be
    >> unaware of it. At least on 1kB and 4kB blocksize this happens also notice that
    >> some VCPUs [XXXX marker] don't reach 100% CPU reaching almost twice the
    >> result; while cores 0, 3 do reach 100% and lack CPU power to perform more.
    >> The only thing that I don't get is that it doesn't make sense from extened lscpu
    >> output (but maybe it's AWS XEN mixing real CPU mappings, who knows).
    >>
    >> Uh, that's strange. I haven't seen anything like that, but I'm running on physical
    >> HW and not AWS, so it's either that or maybe I just didn't do the same test.
    > 
    > I couldn't belived it until I've checked via taskset 😊 BTW: I don't 
    > have real HW with NVMe , but we might be with worth checking if
    > placing (taskset -c ...) fio on hyperthreading VCPU is not causing
    > (there's /sys/devices/system/cpu/cpu0/topology/thread_siblings and
    > maybe lscpu(1) output). On AWS I have feeling that lscpu might simply
    > lie and I cannot identify which VCPU is HT and which isn't.
    
    Did you see the same issue with io_uring?
    
    
    regards
    
    -- 
    Tomas Vondra
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
    
    
  21. RE: pgcon unconference / impact of block size on performance

    Jakub Wartak <jakub.wartak@tomtom.com> — 2022-06-10T08:52:49Z

    > On 6/9/22 13:23, Jakub Wartak wrote:
    > >>>>>>> The really
    > >>>>>> puzzling thing is why is the filesystem so much slower for
    > >>>>>> smaller pages. I mean, why would writing 1K be 1/3 of writing 4K?
    > >>>>>> Why would a filesystem have such effect?
    > >>>>>
    > >>>>> Ha! I don't care at this point as 1 or 2kB seems too small to
    > >>>>> handle many real world scenarios ;)
    > >>> [..]
    > >>>> Independently of that, it seems like an interesting behavior and it
    > >>>> might tell us something about how to optimize for larger pages.
    > >>>
    > >>> OK, curiosity won:
    > >>>
    > >>> With randwrite on ext4 directio using 4kb the avgqu-sz reaches
    > >>> ~90-100 (close to fio's 128 queue depth?) and I'm getting ~70k IOPS
    > >>> [with maxdepth=128] With randwrite on ext4 directio using 1kb the
    > >>> avgqu-sz is just
    > >> 0.7 and I'm getting just ~17-22k IOPS [with maxdepth=128] ->  conclusion:
    > >> something is being locked thus preventing queue to build up With
    > >> randwrite on
    > >> ext4 directio using 4kb the avgqu-sz reaches ~2.3 (so something is
    > >> queued) and I'm also getting ~70k IOPS with minimal possible
    > >> maxdepth=4 ->  conclusion: I just need to split the lock contention by 4.
    > >>>
    > >>> The 1kB (slow) profile top function is aio_write() -> .... ->
    > >>> iov_iter_get_pages()
    > >> -> internal_get_user_pages_fast() and there's sadly plenty of "lock"
    > >> -> keywords
    > >> inside {related to memory manager, padding to full page size, inode
    > >> locking} also one can find some articles / commits related to it [1]
    > >> which didn't made a good feeling to be honest as the fio is using
    > >> just 1 file (even while I'm on kernel 5.10.x). So I've switched to 4x
    > >> files and numjobs=4 and got easily 60k IOPS, contention solved
    > >> whatever it was :) So I would assume PostgreSQL (with it's splitting
    > >> data files by default on 1GB boundaries and multiprocess
    > >> architecture) should be relatively safe from such ext4 inode(?)/mm(?)
    > contentions even with smallest 1kb block sizes on Direct I/O some day.
    > >>>
    > >>
    > >> Interesting. So what parameter values would you suggest?
    > >
    > > At least have 4x filename= entries and numjobs=4
    > >
    > >> FWIW some of the tests I did were on xfs, so I wonder if that might
    > >> be hitting similar/other bottlenecks.
    > >
    > > Apparently XFS also shows same contention on single file for 1..2kb randwrite,
    > see [ZZZ].
    > >
    > 
    > I don't have any results yet, but after thinking about this a bit I find this really
    > strange. Why would there be any contention with a single fio job?  Doesn't contention 
    > imply multiple processes competing for the same resource/lock etc.?
    
    Maybe 1 job throws a lot of concurrent random I/Os that contend against the same inodes / pages (?) 
     
    > Isn't this simply due to the iodepth increase? IIUC with multiple fio jobs, each
    > will use a separate iodepth value. So with numjobs=4, we'll really use iodepth*4,
    > which can make a big difference.
    
    I was thinking the same (it should be enough to have big queue depth), but apparently one needs many files (inodes?) too:
    
    On 1 file I'm not getting a lot of IOPS on small blocksize (even with numjobs), < 20k IOPS always:
    numjobs=1/ext4/io_uring/nvme/randwrite/128/1k/1.txt:  write: IOPS=13.5k, BW=13.2MiB/s (13.8MB/s)(396MiB/30008msec); 0 zone resets
    numjobs=1/ext4/io_uring/nvme/randwrite/128/4k/1.txt:  write: IOPS=49.1k, BW=192MiB/s (201MB/s)(5759MiB/30001msec); 0 zone resets
    numjobs=1/ext4/libaio/nvme/randwrite/128/1k/1.txt:  write: IOPS=16.8k, BW=16.4MiB/s (17.2MB/s)(494MiB/30001msec); 0 zone resets
    numjobs=1/ext4/libaio/nvme/randwrite/128/4k/1.txt:  write: IOPS=62.5k, BW=244MiB/s (256MB/s)(7324MiB/30001msec); 0 zone resets
    numjobs=1/xfs/io_uring/nvme/randwrite/128/1k/1.txt:  write: IOPS=14.7k, BW=14.3MiB/s (15.0MB/s)(429MiB/30008msec); 0 zone resets
    numjobs=1/xfs/io_uring/nvme/randwrite/128/4k/1.txt:  write: IOPS=46.4k, BW=181MiB/s (190MB/s)(5442MiB/30002msec); 0 zone resets
    numjobs=1/xfs/libaio/nvme/randwrite/128/1k/1.txt:  write: IOPS=22.3k, BW=21.8MiB/s (22.9MB/s)(654MiB/30001msec); 0 zone resets
    numjobs=1/xfs/libaio/nvme/randwrite/128/4k/1.txt:  write: IOPS=59.6k, BW=233MiB/s (244MB/s)(6988MiB/30001msec); 0 zone resets
    numjobs=4/ext4/io_uring/nvme/randwrite/128/1k/1.txt:  write: IOPS=13.9k, BW=13.6MiB/s (14.2MB/s)(407MiB/30035msec); 0 zone resets [FAIL 4*qdepth]
    numjobs=4/ext4/io_uring/nvme/randwrite/128/4k/1.txt:  write: IOPS=52.9k, BW=207MiB/s (217MB/s)(6204MiB/30010msec); 0 zone resets
    numjobs=4/ext4/libaio/nvme/randwrite/128/1k/1.txt:  write: IOPS=17.9k, BW=17.5MiB/s (18.4MB/s)(525MiB/30001msec); 0 zone resets [FAIL 4*qdepth]
    numjobs=4/ext4/libaio/nvme/randwrite/128/4k/1.txt:  write: IOPS=63.3k, BW=247MiB/s (259MB/s)(7417MiB/30001msec); 0 zone resets
    numjobs=4/xfs/io_uring/nvme/randwrite/128/1k/1.txt:  write: IOPS=14.3k, BW=13.9MiB/s (14.6MB/s)(419MiB/30033msec); 0 zone resets [FAIL 4*qdepth]
    numjobs=4/xfs/io_uring/nvme/randwrite/128/4k/1.txt:  write: IOPS=50.5k, BW=197MiB/s (207MB/s)(5917MiB/30010msec); 0 zone resets
    numjobs=4/xfs/libaio/nvme/randwrite/128/1k/1.txt:  write: IOPS=19.6k, BW=19.1MiB/s (20.1MB/s)(574MiB/30001msec); 0 zone resets [FAIL 4*qdepth]
    numjobs=4/xfs/libaio/nvme/randwrite/128/4k/1.txt:  write: IOPS=63.6k, BW=248MiB/s (260MB/s)(7448MiB/30001msec); 0 zone resets
    
    Now with 4 files: It is necessary to have *both* 4 files and bigger processes to get the result, irrespective of IO interface and fs to get closer to at least half of IOPS max
    numjobs=1/ext4/io_uring/nvme/randwrite/128/1k/1.txt:  write: IOPS=28.3k, BW=27.6MiB/s (28.9MB/s)(834MiB/30230msec); 0 zone resets
    numjobs=1/ext4/io_uring/nvme/randwrite/128/4k/1.txt:  write: IOPS=57.8k, BW=226MiB/s (237MB/s)(6772MiB/30001msec); 0 zone resets
    numjobs=1/ext4/libaio/nvme/randwrite/128/1k/1.txt:  write: IOPS=17.3k, BW=16.9MiB/s (17.7MB/s)(506MiB/30001msec); 0 zone resets
    numjobs=1/ext4/libaio/nvme/randwrite/128/4k/1.txt:  write: IOPS=61.6k, BW=240MiB/s (252MB/s)(7215MiB/30001msec); 0 zone resets
    numjobs=1/xfs/io_uring/nvme/randwrite/128/1k/1.txt:  write: IOPS=24.3k, BW=23.8MiB/s (24.9MB/s)(713MiB/30008msec); 0 zone resets
    numjobs=1/xfs/io_uring/nvme/randwrite/128/4k/1.txt:  write: IOPS=54.7k, BW=214MiB/s (224MB/s)(6408MiB/30002msec); 0 zone resets
    numjobs=1/xfs/libaio/nvme/randwrite/128/1k/1.txt:  write: IOPS=22.1k, BW=21.6MiB/s (22.6MB/s)(648MiB/30001msec); 0 zone resets
    numjobs=1/xfs/libaio/nvme/randwrite/128/4k/1.txt:  write: IOPS=65.7k, BW=257MiB/s (269MB/s)(7705MiB/30001msec); 0 zone resets
    numjobs=4/ext4/io_uring/nvme/randwrite/128/1k/1.txt:  write: IOPS=34.1k, BW=33.3MiB/s (34.9MB/s)(999MiB/30020msec); 0 zone resets [OK?]
    numjobs=4/ext4/io_uring/nvme/randwrite/128/4k/1.txt:  write: IOPS=64.5k, BW=252MiB/s (264MB/s)(7565MiB/30003msec); 0 zone resets
    numjobs=4/ext4/libaio/nvme/randwrite/128/1k/1.txt:  write: IOPS=49.7k, BW=48.5MiB/s (50.9MB/s)(1456MiB/30001msec); 0 zone resets [OK]
    numjobs=4/ext4/libaio/nvme/randwrite/128/4k/1.txt:  write: IOPS=67.1k, BW=262MiB/s (275MB/s)(7874MiB/30037msec); 0 zone resets
    numjobs=4/xfs/io_uring/nvme/randwrite/128/1k/1.txt:  write: IOPS=33.9k, BW=33.1MiB/s (34.7MB/s)(994MiB/30026msec); 0 zone resets [OK?]
    numjobs=4/xfs/io_uring/nvme/randwrite/128/4k/1.txt:  write: IOPS=67.7k, BW=264MiB/s (277MB/s)(7933MiB/30007msec); 0 zone resets
    numjobs=4/xfs/libaio/nvme/randwrite/128/1k/1.txt:  write: IOPS=61.0k, BW=59.5MiB/s (62.4MB/s)(1786MiB/30001msec); 0 zone resets [OK]
    numjobs=4/xfs/libaio/nvme/randwrite/128/4k/1.txt:  write: IOPS=69.2k, BW=270MiB/s (283MB/s)(8111MiB/30004msec); 0 zone resets
    
    It makes me thing this looks like some file/inode<->process kind of a locking (reminder: Direct I/O case) -- note that even with files=4 and numjobs=1 it doesn't reach those levels it should. One way or another PostgreSQL should be safe on OLTP - that's the first though, but on 2nd thought - when thinking about extreme IOPS and single-threaded checkpointer / bgwriter / walrecovery on standbys I'm not so sure. In potential future IO API implementations - with Direct I/O (???) - the 1kb, 2kb apparently would seem to be limited unless you parallelize those processes due to some internal kernel locking (sigh! - at least that's what the result the 4 files/numjobs=1/../1k cases indicate; this may vary across kernel versions as per earlier link). 
     
    > >>>
    > >>> Explanation: it's the CPU scheduler migrations mixing the
    > >>> performance result
    > >> during the runs of fio  (as you have in your framework). Various
    > >> VCPUs seem to be having varying max IOPS characteristics (sic!) and
    > >> CPU scheduler seems to be unaware of it. At least on 1kB and 4kB
    > >> blocksize this happens also notice that some VCPUs [XXXX marker]
    > >> don't reach 100% CPU reaching almost twice the result; while cores 0, 3 do
    > reach 100% and lack CPU power to perform more.
    > >> The only thing that I don't get is that it doesn't make sense from
    > >> extened lscpu output (but maybe it's AWS XEN mixing real CPU mappings,
    > who knows).
    > >>
    > >> Uh, that's strange. I haven't seen anything like that, but I'm
    > >> running on physical HW and not AWS, so it's either that or maybe I just didn't
    > do the same test.
    > >
    > > I couldn't belived it until I've checked via taskset 😊 BTW: I don't
    > > have real HW with NVMe , but we might be with worth checking if
    > > placing (taskset -c ...) fio on hyperthreading VCPU is not causing
    > > (there's /sys/devices/system/cpu/cpu0/topology/thread_siblings and
    > > maybe lscpu(1) output). On AWS I have feeling that lscpu might simply
    > > lie and I cannot identify which VCPU is HT and which isn't.
    > 
    > Did you see the same issue with io_uring?
    
    Yes, tested today, got similar results (io_uring doesn’t change a thing and BTW it looks like hypervisor shifts real HW CPUs to logical VCPUs ) After reading this https://wiki.xenproject.org/wiki/Hyperthreading  (section: Is Xen hyperthreading aware), I think solid NVMe testing shouldn't be conducted on anything virtualized - I have no control over potentially noisy CPU-heavy neighbors. So please take my results for with grain of salt, unless somebody reproduces this taskset -c .. fio tests on proper isolated HW, but another thing: that's where PostgreSQL runs in reality.
    
    -J.
    
  22. Re: pgcon unconference / impact of block size on performance

    Tomas Vondra <tomas.vondra@enterprisedb.com> — 2022-06-13T14:06:56Z

    I did a couple tests to evaluate the impact of filesystem overhead and
    block size, so here are some preliminary results. I'm running a more
    extensive set of tests, but some of this seems interesting.
    
    I did two sets of tests:
    
    1) fio test on raw devices
    
    2) fio tests on ext4/xfs with different fs block size
    
    Both sets of tests were executed with varying iodepth (1, 2, 4, ...) and
    number of processes (1, 8).
    
    The results are attached - CSV file with results, and PDF with pivot
    tables showing them in more readable format.
    
    
    1) raw device tests
    
    The results for raw devices have regular patterns, with smaller blocks
    giving better performance - particularly for read workloads. For write
    workloads, it's similar, except that 4K blocks perform better than 1-2K
    ones (this applies especially to the NVMe device).
    
    
    2) fs tests
    
    This shows how the tests perform on ext4/xfs filesystems with different
    block sizes (1K-4K). Overall the patterns are fairly similar to raw
    devices. There are a couple strange things, though.
    
    For example, ext4 often behaves like this on the "write" (i.e.
    sequential write) benchmark:
    
       fs block     1K       2K       4K       8K      16K      32K
      --------------------------------------------------------------
       1024      33374    28290    27286    26453    22341    19568
       2048      33420    38595    75741    63790    48474    33474
       4096      33959    38913    73949    63940    49217    33017
    
    It's somewhat expected that 1-2K blocks perform worse than 4K (the raw
    device behaves the same way), but notice how the behavior differs
    depending on the fs block. For 2k and 4K fs blocks the throughput
    improves, but for 1K blocks it just goes down. For higher iodepth values
    this is even more visible:
    
       fs block     1K      2K       4K       8K      16K     32K
      ------------------------------------------------------------
       1024      34879   25708    24744    23937    22527   19357
       2048      31648   50348   282696   236118   121750   60646
       4096      34273   39890   273395   214817   135072   66943
    
    The interesting thing is xfs does not have this issue.
    
    Furthermore, it seems interesting to compare iops on a filesystem to the
    raw device, which might be seen as "best case" without the fs overhead.
    The "comparison" attachmens do exactly that.
    
    There are two interesting observations, here:
    
    1) ext4 seems to have some issue with 1-2K random writes (randrw and
    randwrite tests) with larger 2-4K filesystem blocks. Consider for
    example this:
    
       fs block      1K        2K        4K        8K      16K      32K
      ------------------------------------------------------------------
       1024      214765    143564    108075     83098    58238    38569
       2048       66010    216287    260116    214541   113848    57045
       4096       66656     64155    268141    215860   109175    54877
    
    Agian, the xfs does not behave like this.
    
    2) Interestingly enough, compe cases can actually perform better on a
    filesystem than directly on the raw device - I'm not sure what's the
    explanation, but it only happens on the SSD RAID (not on the NVMe), and
    with higher iodepth values.
    
    
    regards
    
    -- 
    Tomas Vondra
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  23. Re: pgcon unconference / impact of block size on performance

    Merlin Moncure <mmoncure@gmail.com> — 2022-06-13T15:42:31Z

    On Sat, Jun 4, 2022 at 6:23 PM Tomas Vondra <tomas.vondra@enterprisedb.com>
    wrote:
    
    > Hi,
    >
    > At on of the pgcon unconference sessions a couple days ago, I presented
    > a bunch of benchmark results comparing performance with different
    > data/WAL block size. Most of the OLTP results showed significant gains
    > (up to 50%) with smaller (4k) data pages.
    >
    
    Wow.  Random numbers are fantastic,  Significant reduction in sequential
    throughput is a little painful though, I see 40% reduction in some cases if
    I'm reading that right.  Any thoughts on why that's the case?  Are there
    mitigations possible?
    
    merlin
    
  24. Re: pgcon unconference / impact of block size on performance

    Tomas Vondra <tomas.vondra@enterprisedb.com> — 2022-06-13T16:05:38Z

    On 6/13/22 17:42, Merlin Moncure wrote:
    > On Sat, Jun 4, 2022 at 6:23 PM Tomas Vondra
    > <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
    > wrote:
    > 
    >     Hi,
    > 
    >     At on of the pgcon unconference sessions a couple days ago, I presented
    >     a bunch of benchmark results comparing performance with different
    >     data/WAL block size. Most of the OLTP results showed significant gains
    >     (up to 50%) with smaller (4k) data pages.
    > 
    > 
    > Wow.  Random numbers are fantastic,  Significant reduction in sequential
    > throughput is a little painful though, I see 40% reduction in some cases
    > if I'm reading that right.  Any thoughts on why that's the case?  Are
    > there mitigations possible?
    > 
    
    I think you read that right - given a fixed I/O depth, the throughput
    for sequential access gets reduced. Consider for example the attached
    chart with sequential read/write results for the Optane 900P. The IOPS
    increases for smaller blocks, but not enough to compensate for the
    bandwidth drop.
    
    Regarding the mitigations - I think prefetching (read-ahead) should do
    the trick. Just going to iodepth=2 mostly makes up for the bandwidth
    difference. You might argue prefetching would improve the random I/O
    results too, but I don't think that's the same thing - read-ahead for
    sequential workloads is much easier to implement (even transparently).
    
    
    regards
    
    -- 
    Tomas Vondra
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company