Thread

Re: Proposal: Adding compression of temporary files

Tomas Vondra <tomas@vondra.me> — 2026-05-12T14:13:57Z
On 5/11/26 09:09, Filip Janus wrote:
> 
> 
> Hi Tomas,
> 
> Thanks for the thorough benchmark and the script -- it was very helpful
> as a starting point for my testing. I understand the results on
> your machine were discouraging, and I appreciate the honest assessment.
> 
> I ran a similar benchmark on different x86_64 hardware to see how the
> results change under more I/O pressure. The short version: lz4 and
> zstd show significant speedups once storage or page cache becomes a
> bottleneck.
> 

I'm glad you didn't just give up and decided to run some more tests.

> Setup
> -----
> 
> I used your run-hashjoins.sh as a base, with the same parameters:
> 100M rows, d in {1, 10, 100, 1000}, w in {1, 4, 8}, drop-caches
> between runs. I also added zstd to the compression methods tested,
> and tested with a larger compression block size (32 KB instead of
> the default 8 KB BLCKSZ).
> 
> Two x86_64 machines:
> 
>   (A) HPE BL460c Gen10, 2x Xeon Gold 6148, 64 GB RAM,
>       rotational HDD (5 disks), io_uring, Fedora 43
> 
>   (B) Dell MX840c, Xeon Gold 6148, SATA SSD (~224 GB),
>       RAM capped to 16 GB via systemd MemoryMax
> 
> Both use 32 KB compression blocks (COMPRESS_BLCKSZ = 4*BLCKSZ).
> 

What is COMPRESS_BLCKSZ? I don't see that in the patch anywhere. What am
I missing?

> Results
> -------
> 
> Below are the relative timings (% of uncompressed baseline), directly
> comparable to your table. Values below 100% mean compression is faster.
> 
> Your results (Xeon, 64 GB, SSD/NVMe, 8 KB blocks):
> 
>                      pglz              lz4
>   rows  rep    1    4    8       1    4    8
>   -------------------------------------------------
>    10     1  661  688  300     144  148   86
>    10  1000  460  472  234     119  119   58
>   100     1  471  303  204     132  135  102
>   100  1000  378  262  164     107   91   81
> 
> Our results, machine A -- x86 HDD, 64 GB, 32 KB blocks:
> 
>                      pglz              lz4              zstd
>   rows  rep    1    4    8       1    4    8       1    4    8
>   ----------------------------------------------------------------
>   100     1  200  119   69      91   82   67      80   50   35
>   100    10  204  101   70      91   64   66      83   44   39
>   100   100  220  104   72      94   75   69      85   50   34
>   100  1000  170   92   54      79   58   52      74   42   28
> 
> Our results, machine B -- x86 SATA SSD, 16 GB cap, 32 KB blocks:
> 
>                      pglz              lz4              zstd
>   rows  rep    1    4    8       1    4    8       1    4    8
>   ----------------------------------------------------------------
>   100     1  284  103   79      92   81   82      98   59   53
>   100    10  262   99   77      92   80   85      96   57   50
>   100   100  221   89   67      80   70   64      85   49   44
>   100  1000  155   51   42      72   39   39      77   27   29
> 
> Analysis
> --------
> 
> I think the key difference is page cache pressure. Your machine has
> 64 GB RAM with 8 GB shared_buffers, leaving ~56 GB for the OS page
> cache. Even with 8 connections x ~10 GB temp files = ~80 GB, a large
> portion stays cached and synchronous I/O to storage is limited.
> 
> On our machines, I/O is a real bottleneck:
>   - Machine A: rotational HDD with 8 concurrent streams
>   - Machine B: SATA SSD but only 16 GB RAM, so the page cache
>     cannot absorb 8 x 12 GB of temp data
> 
> Under these conditions, reducing the bytes written translates
> directly into wall-clock savings.
> 

Seems like that. It's not a huge surprise that this matters more on
systems with memory pressure and slower storage. I should have tested
that on my machines too.

I was going to question how common such systems are nowadays, when
people can just spin a VM with plenty of RAM and SSDs. But given the
current RAM shortage / costs, and relatively slow network storage (even
if temporary files can use ephemeral disks), maybe it's not all that
uncommon ...

> Both your results and ours confirm that pglz is simply too slow for
> this use case. Your benchmark shows 164-688% overhead; ours shows
> 155-284% with w=1. Even under heavy I/O contention (w=8 on HDD)
> where pglz eventually wins, it never outperforms lz4 or zstd. I
> would recommend against offering pglz for temp file compression
> altogether -- it creates a trap for users who might try it expecting
> reasonable performance.
> 

Right.

> lz4 looks safe: the worst case in our data is 94% (w=1, d=100 on
> HDD) -- barely distinguishable from noise. Under I/O pressure it
> delivers 39-52% of baseline time (2-2.5x speedup).
> 
> zstd is the most compelling option: it achieves the best compression
> ratios (down to 22% of original size on the SATA SSD) and the best
> speedups (27-28% of baseline = 3.5x faster), with no regression
> exceeding 98% on x86_64. I would recommend zstd as the primary
> option to document, with lz4 as a lighter-weight alternative.
> 

Agreed. lz4 seems safe, zstd is good too. I wonder how much this depends
on the particular data set (e.g. if we generate data differently, how
much would it affect the results).

> Compression block size
> ----------------------
> 
> I also tested 8 KB, 32 KB, and 64 KB compression block sizes.
> 32 KB appears to be the sweet spot. Example for lz4, d=1000, w=8
> on HDD:
> 
>    COMPRESS_BLCKSZ    time (% of no)    compressed bytes
>    --------------------------------------------------------
>     8 KB (BLCKSZ)         58%             7.47 GB
>    32 KB (4*BLCKSZ)       52%             7.22 GB
>    64 KB (8*BLCKSZ)       56%             7.14 GB
> 
> The 8K-to-32K improvement comes from fewer compress/decompress calls
> (4x fewer), less per-block header overhead, and better compression
> ratios. Going to 64K shows diminishing returns and slightly worse
> timings, possibly due to increased cache pressure.
> 

I'm still not quite sure what "compression block size" means here, and
how did you change it.

> Conclusion
> ----------
> 
> I think the data shows that the benefit of temporary file compression
> depends heavily on the I/O characteristics of the system. On machines
> with fast storage and ample page cache, compression is neutral -- it
> means negligible overhead, which is a good outcome on its own. On
> systems with real I/O pressure -- slower storage, limited RAM, or
> concurrent workloads competing for page cache -- compression delivers
> substantial speedups.
> 

True.

> The feature does not need to be enabled by default. Compression is
> controlled by the temp_file_compression GUC, which defaults to "none".
> That means there is no risk of regression for existing users. But for
> administrators who know their systems are I/O-constrained -- spinning
> disks, limited memory, heavy concurrent spilling -- having the option
> to enable lz4 or zstd can make a real difference. The data above shows
> up to 3.5x speedup in those scenarios, with no
> downside when the setting is left at its default.
> 
Yes, having it as opt-in for systems where it matters helps.

What bothers me a little bit is that systems generally are not under
such pressure 24/7, but only for some part of a day. But people will
mostly set the GUC in the config file. I don't have a better solution to
this, though.


FYI I won't be able to do much work on this until ~June.


regards

-- 
Tomas Vondra