Thread
-
Re: Improve pg_stat_statements scalability
Sami Imseih <samimseih@gmail.com> — 2026-05-30T02:15:31Z
>> I've created a new wiki page combining the prior 2025 discussion, and >> notes from today: >> >> https://wiki.postgresql.org/wiki/Scalability_of_pg_stat_statements >Thank you, Lukas! Thanks for the feedback on v2, and for the productive unconference discussion in PGConf.dev. I've been iterating on the patches/benchmarks and wanted to post an updated series that addresses the major design points raised. Addressing the feedback from the unconference session: 1/ Eviction The concern in the design presented in which entries are skipped until a single threaded eviction completed was that on a steady workload with max=5000, the 5001st unique query would always be immediately evicted. This was considered unacceptable, and rightly so. It may be fine for high churn, pathological cases, but for slow churn cases it would be unacceptable. Andres mentioned that eviction should occur in parallel, which I understood as multiple backends should evict a subset of the entries in the dshash concurrently. To do this, I implemented a parallel clock-sweep in which eviction sweeps a partition decrementing refcounts, and only entries that have decayed to zero are evicted. A genuinely new query that just arrived has a fresh refcount and survives the sweep, and if it becomes popular in the meantime, it will survive for longer. So the algorithm wants an entry to prove itself to remain considered "hot", else it's swept. There's a 5% headroom using the already existing USAGE_DEALLOC_PERCENT in which we evict until count (across all partitions) drops to 95% of max, so there's always room for new arrivals before the next sweep. I chose not to use a background worker for eviction, as discussed earlier, based on the consensus that backpressure is important, and a background worker being asynchronous in nature will not provide that. Also, it could be complicated where a background worker cannot be spun up for whatever reason. However, to implement parallel eviction as described above, I needed 2 core changes: - pgstat_drop_entry must be able to optionally tolerate a dropped entry, as there could be a delay between the time an entry is marked as dropped and garbage collection, and within that time multiple evictions may attempt to drop the same entry. - Implement a seq scan API for dshash that scans a specific partition. This will allow eviction to cycle through partitions, i.e. clock-sweep. The eviction does not attempt to make room in the bucket for an entry that triggered it. The main point is to keep forward progress in making room. We don't need to be more strict here. Also the fact that pg_stat_statements.max is dynamic means a user can increase this value to manage high churn without a restart. The dealloc counter in pg_stat_statements_info now counts individual entries evicted rather than the number of times eviction was invoked. I think this is more useful, but it does change the semantics between versions. With the new design, a single eviction pass can remove many entries across a partition, so counting evictions no longer tells you much. Open to other thoughts here on this. Profiling revealed that pgstat_request_entry_refs_gc(), which is the standard pattern for less frequently removed entries in other places, was too expensive when called on every entry drop or every partition sweep. Now it is only called once per full rotation across all partitions, which showed much better results in benchmarks. 2/ DSA only query text storage Rather than a "performance cliff" that Lukas mentioned above when we switch between DSA and disk for query text storage, the consensus is to just store all query text in memory. Andres made a point that even now, transient memory usage for loading the query text for the purpose of garbage-collection or reading the query stats in pg_stat_statements means that a user's machine must have enough memory to handle this. So, why not just throw all query text in memory. The memory is capped by a new GUC pg_stat_statements.query_text_memory (default of 4MB but up for discussion). It is also possible to store more entries due to .max than .query_text_memory can support, so empty query text columns could be possible. In the case a user increases .query_text_memory if they observe empty query columns, the next time an entry is touched, it will backfill the query text and normalize the string if it can. The last part could be improved and actually guaranteed if we make JumbleState available to all hooks (I did not work on this part, but open for discussion). Some other comments from Lukas's earlier review: > It appears you've moved the equivalent of the "if (!entry)" check > into the pgss_store_query_text function, and we now unconditionally > call generate_normalized_query. Fixed. v3 series is now 5 patches: 0001: pgstat: Introduce pg_stat_report_anytime() Nothing changes from v2. 0002: pgstat: tolerate already-dropped entries in pgstat_drop_entry() Required for the parallel eviction design. With multiple backends sweeping different partitions concurrently, the same entry can be targeted for drop more than once before garbage collection runs. A skip_dropped flag makes this safe rather than throwing ERROR. 0003: dshash: add partition-scoped sequential scan Adds dshash_seq_init_partition() to restrict a scan to a single partition. This is the building block for per-partition clock-sweep; a backend only locks and sweeps one partition at a time. 0004: pg_stat_statements: modernize entry storage with pgstat kind The main patch. Replaces ShmemInitHash with dshash via DSM registry, and replaces per-entry spinlock counter updates with a custom pgstat kind that uses the core pgstat infrastructure. Eviction changes from qsort-all-entries to clock-sweep with an atomic rotating hand. Each entry carries a refcount (capped at 10) that decays on sweep; entries reaching zero are evicted. Hot queries keep their refcount topped up proportionally to access frequency. pg_stat_statements.max becomes PGC_SIGHUP. 0005: pg_stat_statements: store query text in DSA instead of file Moves query text from pgss_query_texts.stat into a DSA area via GetNamedDSA. Adds pg_stat_statements.query_text_memory (PGC_SIGHUP, default 4MB) controlling DSA size. Eliminates the GC machinery entirely. When DSA is exhausted, entries are still tracked but query text is stored as NULL. A backfill mechanism recovers text on subsequent executions once space becomes available. Benchmark: Attached are the benchmark scripts I used for v3 (and will keep using going forward) with the results in the benchmark_v3.txt file comparing patch vs upstream. The benchmark performs various workloads: "high churn", "light churn", "multi stmt", and simple "select1". I am also tracking query retention (hot/cold entry retention) to verify the clock-sweep behaves as expected. I attached the .sql scripts used to benchmark. The select1 result shows a 1.0% regression. This workload has no contention, so it purely measures pgstat infrastructure overhead; perf profiling shows pgstat_get_entry_ref() at the top. However, on machines with higher core count the upstream spinlock on the counters becomes a bottleneck, which is where the dshash design should win back this overhead and maybe more. I still plan to benchmark this on a larger machine. You will notice that cold_calls in the patched churn case are much lower (805 vs 4,458). This is because entries get evicted sooner under per-partition sweep. Hot and cold query retention lines up w ith current upstream (1000/1000 hot entries survive continuous churn). The deallocs count is much higher in the patch (11.9M vs 38.5K in high churn). per-partition sweep fires frequently to keep the table at target capacity, whereas upstream batches fewer, larger deallocations. We can maybe look into reducing USAGE_DEALLOC_PERCENT to increase retention of "colder" entries. We also see much less LWLock contention in the patched churn case. The top wait is PgStatsDSA (502 total) vs pg_stat_statements (7,757) in upstream, a 15x reduction. Looking forward to your feedback! -- Sami Imseih Amazon Web Services (AWS)