Thread

  1. Re: Proposal: Adding compression of temporary files

    Tomas Vondra <tomas@vondra.me> — 2026-05-12T14:13:57Z

    On 5/11/26 09:09, Filip Janus wrote:
    > 
    > 
    > Hi Tomas,
    > 
    > Thanks for the thorough benchmark and the script -- it was very helpful
    > as a starting point for my testing. I understand the results on
    > your machine were discouraging, and I appreciate the honest assessment.
    > 
    > I ran a similar benchmark on different x86_64 hardware to see how the
    > results change under more I/O pressure. The short version: lz4 and
    > zstd show significant speedups once storage or page cache becomes a
    > bottleneck.
    > 
    
    I'm glad you didn't just give up and decided to run some more tests.
    
    > Setup
    > -----
    > 
    > I used your run-hashjoins.sh as a base, with the same parameters:
    > 100M rows, d in {1, 10, 100, 1000}, w in {1, 4, 8}, drop-caches
    > between runs. I also added zstd to the compression methods tested,
    > and tested with a larger compression block size (32 KB instead of
    > the default 8 KB BLCKSZ).
    > 
    > Two x86_64 machines:
    > 
    >   (A) HPE BL460c Gen10, 2x Xeon Gold 6148, 64 GB RAM,
    >       rotational HDD (5 disks), io_uring, Fedora 43
    > 
    >   (B) Dell MX840c, Xeon Gold 6148, SATA SSD (~224 GB),
    >       RAM capped to 16 GB via systemd MemoryMax
    > 
    > Both use 32 KB compression blocks (COMPRESS_BLCKSZ = 4*BLCKSZ).
    > 
    
    What is COMPRESS_BLCKSZ? I don't see that in the patch anywhere. What am
    I missing?
    
    > Results
    > -------
    > 
    > Below are the relative timings (% of uncompressed baseline), directly
    > comparable to your table. Values below 100% mean compression is faster.
    > 
    > Your results (Xeon, 64 GB, SSD/NVMe, 8 KB blocks):
    > 
    >                      pglz              lz4
    >   rows  rep    1    4    8       1    4    8
    >   -------------------------------------------------
    >    10     1  661  688  300     144  148   86
    >    10  1000  460  472  234     119  119   58
    >   100     1  471  303  204     132  135  102
    >   100  1000  378  262  164     107   91   81
    > 
    > Our results, machine A -- x86 HDD, 64 GB, 32 KB blocks:
    > 
    >                      pglz              lz4              zstd
    >   rows  rep    1    4    8       1    4    8       1    4    8
    >   ----------------------------------------------------------------
    >   100     1  200  119   69      91   82   67      80   50   35
    >   100    10  204  101   70      91   64   66      83   44   39
    >   100   100  220  104   72      94   75   69      85   50   34
    >   100  1000  170   92   54      79   58   52      74   42   28
    > 
    > Our results, machine B -- x86 SATA SSD, 16 GB cap, 32 KB blocks:
    > 
    >                      pglz              lz4              zstd
    >   rows  rep    1    4    8       1    4    8       1    4    8
    >   ----------------------------------------------------------------
    >   100     1  284  103   79      92   81   82      98   59   53
    >   100    10  262   99   77      92   80   85      96   57   50
    >   100   100  221   89   67      80   70   64      85   49   44
    >   100  1000  155   51   42      72   39   39      77   27   29
    > 
    > Analysis
    > --------
    > 
    > I think the key difference is page cache pressure. Your machine has
    > 64 GB RAM with 8 GB shared_buffers, leaving ~56 GB for the OS page
    > cache. Even with 8 connections x ~10 GB temp files = ~80 GB, a large
    > portion stays cached and synchronous I/O to storage is limited.
    > 
    > On our machines, I/O is a real bottleneck:
    >   - Machine A: rotational HDD with 8 concurrent streams
    >   - Machine B: SATA SSD but only 16 GB RAM, so the page cache
    >     cannot absorb 8 x 12 GB of temp data
    > 
    > Under these conditions, reducing the bytes written translates
    > directly into wall-clock savings.
    > 
    
    Seems like that. It's not a huge surprise that this matters more on
    systems with memory pressure and slower storage. I should have tested
    that on my machines too.
    
    I was going to question how common such systems are nowadays, when
    people can just spin a VM with plenty of RAM and SSDs. But given the
    current RAM shortage / costs, and relatively slow network storage (even
    if temporary files can use ephemeral disks), maybe it's not all that
    uncommon ...
    
    > Both your results and ours confirm that pglz is simply too slow for
    > this use case. Your benchmark shows 164-688% overhead; ours shows
    > 155-284% with w=1. Even under heavy I/O contention (w=8 on HDD)
    > where pglz eventually wins, it never outperforms lz4 or zstd. I
    > would recommend against offering pglz for temp file compression
    > altogether -- it creates a trap for users who might try it expecting
    > reasonable performance.
    > 
    
    Right.
    
    > lz4 looks safe: the worst case in our data is 94% (w=1, d=100 on
    > HDD) -- barely distinguishable from noise. Under I/O pressure it
    > delivers 39-52% of baseline time (2-2.5x speedup).
    > 
    > zstd is the most compelling option: it achieves the best compression
    > ratios (down to 22% of original size on the SATA SSD) and the best
    > speedups (27-28% of baseline = 3.5x faster), with no regression
    > exceeding 98% on x86_64. I would recommend zstd as the primary
    > option to document, with lz4 as a lighter-weight alternative.
    > 
    
    Agreed. lz4 seems safe, zstd is good too. I wonder how much this depends
    on the particular data set (e.g. if we generate data differently, how
    much would it affect the results).
    
    > Compression block size
    > ----------------------
    > 
    > I also tested 8 KB, 32 KB, and 64 KB compression block sizes.
    > 32 KB appears to be the sweet spot. Example for lz4, d=1000, w=8
    > on HDD:
    > 
    >    COMPRESS_BLCKSZ    time (% of no)    compressed bytes
    >    --------------------------------------------------------
    >     8 KB (BLCKSZ)         58%             7.47 GB
    >    32 KB (4*BLCKSZ)       52%             7.22 GB
    >    64 KB (8*BLCKSZ)       56%             7.14 GB
    > 
    > The 8K-to-32K improvement comes from fewer compress/decompress calls
    > (4x fewer), less per-block header overhead, and better compression
    > ratios. Going to 64K shows diminishing returns and slightly worse
    > timings, possibly due to increased cache pressure.
    > 
    
    I'm still not quite sure what "compression block size" means here, and
    how did you change it.
    
    > Conclusion
    > ----------
    > 
    > I think the data shows that the benefit of temporary file compression
    > depends heavily on the I/O characteristics of the system. On machines
    > with fast storage and ample page cache, compression is neutral -- it
    > means negligible overhead, which is a good outcome on its own. On
    > systems with real I/O pressure -- slower storage, limited RAM, or
    > concurrent workloads competing for page cache -- compression delivers
    > substantial speedups.
    > 
    
    True.
    
    > The feature does not need to be enabled by default. Compression is
    > controlled by the temp_file_compression GUC, which defaults to "none".
    > That means there is no risk of regression for existing users. But for
    > administrators who know their systems are I/O-constrained -- spinning
    > disks, limited memory, heavy concurrent spilling -- having the option
    > to enable lz4 or zstd can make a real difference. The data above shows
    > up to 3.5x speedup in those scenarios, with no
    > downside when the setting is left at its default.
    > 
    Yes, having it as opt-in for systems where it matters helps.
    
    What bothers me a little bit is that systems generally are not under
    such pressure 24/7, but only for some part of a day. But people will
    mostly set the GUC in the config file. I don't have a better solution to
    this, though.
    
    
    FYI I won't be able to do much work on this until ~June.
    
    
    regards
    
    -- 
    Tomas Vondra