Thread
-
Re: Proposal: Adding compression of temporary files
Tomas Vondra <tomas@vondra.me> — 2026-05-12T14:13:57Z
On 5/11/26 09:09, Filip Janus wrote: > > > Hi Tomas, > > Thanks for the thorough benchmark and the script -- it was very helpful > as a starting point for my testing. I understand the results on > your machine were discouraging, and I appreciate the honest assessment. > > I ran a similar benchmark on different x86_64 hardware to see how the > results change under more I/O pressure. The short version: lz4 and > zstd show significant speedups once storage or page cache becomes a > bottleneck. > I'm glad you didn't just give up and decided to run some more tests. > Setup > ----- > > I used your run-hashjoins.sh as a base, with the same parameters: > 100M rows, d in {1, 10, 100, 1000}, w in {1, 4, 8}, drop-caches > between runs. I also added zstd to the compression methods tested, > and tested with a larger compression block size (32 KB instead of > the default 8 KB BLCKSZ). > > Two x86_64 machines: > > (A) HPE BL460c Gen10, 2x Xeon Gold 6148, 64 GB RAM, > rotational HDD (5 disks), io_uring, Fedora 43 > > (B) Dell MX840c, Xeon Gold 6148, SATA SSD (~224 GB), > RAM capped to 16 GB via systemd MemoryMax > > Both use 32 KB compression blocks (COMPRESS_BLCKSZ = 4*BLCKSZ). > What is COMPRESS_BLCKSZ? I don't see that in the patch anywhere. What am I missing? > Results > ------- > > Below are the relative timings (% of uncompressed baseline), directly > comparable to your table. Values below 100% mean compression is faster. > > Your results (Xeon, 64 GB, SSD/NVMe, 8 KB blocks): > > pglz lz4 > rows rep 1 4 8 1 4 8 > ------------------------------------------------- > 10 1 661 688 300 144 148 86 > 10 1000 460 472 234 119 119 58 > 100 1 471 303 204 132 135 102 > 100 1000 378 262 164 107 91 81 > > Our results, machine A -- x86 HDD, 64 GB, 32 KB blocks: > > pglz lz4 zstd > rows rep 1 4 8 1 4 8 1 4 8 > ---------------------------------------------------------------- > 100 1 200 119 69 91 82 67 80 50 35 > 100 10 204 101 70 91 64 66 83 44 39 > 100 100 220 104 72 94 75 69 85 50 34 > 100 1000 170 92 54 79 58 52 74 42 28 > > Our results, machine B -- x86 SATA SSD, 16 GB cap, 32 KB blocks: > > pglz lz4 zstd > rows rep 1 4 8 1 4 8 1 4 8 > ---------------------------------------------------------------- > 100 1 284 103 79 92 81 82 98 59 53 > 100 10 262 99 77 92 80 85 96 57 50 > 100 100 221 89 67 80 70 64 85 49 44 > 100 1000 155 51 42 72 39 39 77 27 29 > > Analysis > -------- > > I think the key difference is page cache pressure. Your machine has > 64 GB RAM with 8 GB shared_buffers, leaving ~56 GB for the OS page > cache. Even with 8 connections x ~10 GB temp files = ~80 GB, a large > portion stays cached and synchronous I/O to storage is limited. > > On our machines, I/O is a real bottleneck: > - Machine A: rotational HDD with 8 concurrent streams > - Machine B: SATA SSD but only 16 GB RAM, so the page cache > cannot absorb 8 x 12 GB of temp data > > Under these conditions, reducing the bytes written translates > directly into wall-clock savings. > Seems like that. It's not a huge surprise that this matters more on systems with memory pressure and slower storage. I should have tested that on my machines too. I was going to question how common such systems are nowadays, when people can just spin a VM with plenty of RAM and SSDs. But given the current RAM shortage / costs, and relatively slow network storage (even if temporary files can use ephemeral disks), maybe it's not all that uncommon ... > Both your results and ours confirm that pglz is simply too slow for > this use case. Your benchmark shows 164-688% overhead; ours shows > 155-284% with w=1. Even under heavy I/O contention (w=8 on HDD) > where pglz eventually wins, it never outperforms lz4 or zstd. I > would recommend against offering pglz for temp file compression > altogether -- it creates a trap for users who might try it expecting > reasonable performance. > Right. > lz4 looks safe: the worst case in our data is 94% (w=1, d=100 on > HDD) -- barely distinguishable from noise. Under I/O pressure it > delivers 39-52% of baseline time (2-2.5x speedup). > > zstd is the most compelling option: it achieves the best compression > ratios (down to 22% of original size on the SATA SSD) and the best > speedups (27-28% of baseline = 3.5x faster), with no regression > exceeding 98% on x86_64. I would recommend zstd as the primary > option to document, with lz4 as a lighter-weight alternative. > Agreed. lz4 seems safe, zstd is good too. I wonder how much this depends on the particular data set (e.g. if we generate data differently, how much would it affect the results). > Compression block size > ---------------------- > > I also tested 8 KB, 32 KB, and 64 KB compression block sizes. > 32 KB appears to be the sweet spot. Example for lz4, d=1000, w=8 > on HDD: > > COMPRESS_BLCKSZ time (% of no) compressed bytes > -------------------------------------------------------- > 8 KB (BLCKSZ) 58% 7.47 GB > 32 KB (4*BLCKSZ) 52% 7.22 GB > 64 KB (8*BLCKSZ) 56% 7.14 GB > > The 8K-to-32K improvement comes from fewer compress/decompress calls > (4x fewer), less per-block header overhead, and better compression > ratios. Going to 64K shows diminishing returns and slightly worse > timings, possibly due to increased cache pressure. > I'm still not quite sure what "compression block size" means here, and how did you change it. > Conclusion > ---------- > > I think the data shows that the benefit of temporary file compression > depends heavily on the I/O characteristics of the system. On machines > with fast storage and ample page cache, compression is neutral -- it > means negligible overhead, which is a good outcome on its own. On > systems with real I/O pressure -- slower storage, limited RAM, or > concurrent workloads competing for page cache -- compression delivers > substantial speedups. > True. > The feature does not need to be enabled by default. Compression is > controlled by the temp_file_compression GUC, which defaults to "none". > That means there is no risk of regression for existing users. But for > administrators who know their systems are I/O-constrained -- spinning > disks, limited memory, heavy concurrent spilling -- having the option > to enable lz4 or zstd can make a real difference. The data above shows > up to 3.5x speedup in those scenarios, with no > downside when the setting is left at its default. > Yes, having it as opt-in for systems where it matters helps. What bothers me a little bit is that systems generally are not under such pressure 24/7, but only for some part of a day. But people will mostly set the GUC in the config file. I don't have a better solution to this, though. FYI I won't be able to do much work on this until ~June. regards -- Tomas Vondra