Thread
-
Improve pg_stat_statements scalability
Sami Imseih <samimseih@gmail.com> — 2026-05-11T23:53:53Z
Hi, pg_stat_statements has well-known scaling problems under high concurrency. This patch series is an initial proposal for how to $SUBJECT. The three scaling problems: 1. Per-entry SpinLock contention. Every counter update acquires a SpinLock on the entry. Hot queries executed across many backends contend on the same lock, and the hold time grows as we add more counters to the struct. Higher core counts mean more CPUs contending for the same spinlock, making the problem worse on modern hardware. See this complaint here [3]. 2. Deallocation under exclusive LWLock. The hash table is fixed-size, bounded by pg_stat_statements.max. When full, a single backend performs least-frequently-used eviction of the bottom 5% while holding an exclusive LWLock, blocking all other backends from reading or writing pg_stat_statements. This happens inline during query execution, and workloads with many unique queries trigger it frequently. Making matters worse, pg_stat_statements.max is a static GUC, requiring a full server restart to change, and that is disruptive in production systems. 3. Query text file bloat and GC. Deallocated entries leave dead text in the external file. When the file grows to 2x the size of live text, GC rewrites the entire file under exclusive lock. Disk I/O under exclusive lock is the worst combination, and frequent deallocations (problem #2) trigger more GC. Using the pgstat, "statistics collector", system to improve pg_stat_statements scalability has been discussed previously [4]. It writes stats locally first and flushes periodically to shared memory, avoiding spinlock contention on the write path. Also, the storage is a dshash (partitioned hash table), which reduces contention on lookups and allows dynamic resizing without restart. The prerequisite work to make this usable by extensions has been building over two release cycles: - PG18: Pluggable cumulative statistics API (pgstat_register_kind) [7949d959] - PG19: Serialization callbacks (to_serialized_data, from_serialized_data, finish) for custom stats kinds [4ba012a8ed9] This patch builds on the prerequisite work to address the scalability issues mentioned above. The series consists of two patches: --- [0001] pgstat: add pgstat_flush_pending() and pg_stat_flush_pending(pid) Adds infrastructure for flushing pending statistics to shared memory on demand. pgstat_flush_pending() flushes all pending entries in the calling backend immediately. Unlike pgstat_report_stat(), it can be called mid-transaction, making it suitable for view functions that need fresh shared stats before the transaction ends. pg_stat_flush_pending(pid) is the SQL-callable interface to the same function. This patch is related to the discussion in [1] about flushing stats within running transactions. The pg_stat_statements modernization provides a good example where this is useful. At least for the pg_stat_statements tests, we need a way to flush statistics within a transaction. I plan to spin this off to a dedicated thread, but include it here for now so this patchset can be tested. [0002] pg_stat_statements: modernize entry storage with pgstat kind This is the main patch. It replaces the private shared-memory hash table with the pgstat subsystem's dshash (registered as a custom stats kind). The pgstats hash key has an objid: typedef struct PgStat_HashKey { PgStat_Kind kind; /* statistics entry kind */ Oid dboid; /* database ID. InvalidOid for shared objects. */ uint64 objid; /* object ID (table, function, etc.), or * identifier. */ } PgStat_HashKey; So the entry objid is computed by combining hashes from userid, queryid and toplevel. Note that because of 0001 there are no changes required to the existing regression tests. We may need to add some new tests, but I have not thought too much about that yet. Key design changes: - No SpinLock on the execution path. Counters accumulate per-backend and merge into shared memory on flush. Stddev and related fields use Welford's algorithm as before, but since the stats are first updated locally, we need a way to merge the stats on flush, so we use Chan's parallel algorithm for this purpose [5]. - No fixed shared memory allocation. Entries live in the pgstat dshash, which grows dynamically. pg_stat_statements.max becomes PGC_SIGHUP, and therefore can be changed dynamically. - Throttled inline eviction. When entry count reaches pgss_max, a backend attempts eviction using a conditional lock and a shared timestamp that ensures at most one eviction cycle per 10 seconds. Other backends simply skip entry creation without blocking. This is acceptable because the current upstream behavior already suffers from the same data loss under heavy churn. Newly created entries are immediately candidates for eviction and are frequently removed in the next deallocation cycle. The throttled approach makes this trade-off explicit and avoids the cost of blocking all backends behind an exclusive lock to create space for entries that may just be removed shortly. See benchmark results below for measuring the retention of "hot" entries. The aging decay mechanism remains in place, but the sticky entries are no longer needed for this patch, and we simply don't evict calls == 0. A background worker for eviction was also considered. The current inline approach was chosen because it avoids the added complexity while still preventing other backends from blocking. If a background worker has more merit, I am open to that discussion. - Query texts in DSA memory with disk fallback. A new GUC pg_stat_statements.query_text_memory (default 64 MB) controls DSA shared memory for query text storage. When enabled and not exhausted, new texts are stored in DSA instead of the external file, eliminating file I/O on the read path. If DSA is disabled (set to 0) or full, texts fall back to the existing file-based storage. Entry eviction and reset properly free DSA-allocated texts, and GC of the text file skips DSA-backed entries. - Serialization via pgstat callbacks. to_serialized_data / from_serialized_data handle saving and restoring stats across restarts, replacing the module's bespoke shutdown/startup logic. - pg_stat_statements_info gains columns. num_entries (current count), last_eviction_time, query_text_memory_bytes (DSA currently allocated), and query_text_file_bytes (overflow file size) provide operational visibility into eviction behavior and memory usage. Some of these fields were discussed in this thread [6] as they are relevant even to the current state of pg_stat_statements. The LWLock is retained but narrowed to protect only query text file operations and eviction. Entry-level locking is handled by the pgstat subsystem's built-in mechanism. --- Benchmark: Attached is a benchmark script that runs three workloads: - 5k: 80% hot (1000 distinct) + 20% churn (4000 distinct). Total fits within max, no eviction expected. - 100k: 80% hot (1000 distinct) + 20% churn (100000 distinct). Exceeds max, continuous eviction pressure. - spinlock: single SELECT; from all 256 clients. Pure contention on one entry. The benchmark collects pg_stat_activity in the background to Measure wait events. It also polls pg_stat_statements to check For hot and churn query retention, so we can measure hot Query retention and also measure the overhead of querying pg_stat_statements. Benchmark Results: Environment: x86_64, 16 CPUs, 29 GB RAM, Linux 6.1 (EC2). PostgreSQL 19devel, release build, cassert=off. pg_stat_statements.max = 5500, 256 clients, 16 threads, 5 min per test. Test | un-patched TPS | Patch TPS | Delta ---------+----------------+-----------+------ 5k | 241,078 | 239,001 | -0.9% 100k | 180,817 | 240,314 | +33% spinlock | 328,075 | 319,466 | -3% Wait event samples (collected every 1s for 300s): Test | un-patched | Patch ---------+-----------------------------+----------------- 5k | ClientRead 1907 | ClientRead 2106 100k | pg_stat_statements 20416 | ClientRead 1978 | ClientRead 12874 | spinlock | ClientRead 1804 | ClientRead 1749 Notice the huge difference in wait events. Essentially, the pg_stat_statements waits are eliminated with this design and there is a 33% performance improvement. Entry retention under 100k (heavy eviction): query type | un-patched | Patch -----------+-----------------------------+--------------------------- hot | 1000 entries, 42-44k calls | 1000 entries, 52-58k calls churn | 4436 entries, 1-5 calls | 4430 entries, 1-192 calls deallocs | 37,972 | 30 Under heavy churn, the unpatched eviction behavior retains hot entries and discards churn entries after 1-5 calls. The patched design throttles eviction to at most once per 10 seconds, so entries survive longer between cycles and accumulate more calls. In the patched design, the churn entries reach 1-192 calls before eviction. Hot entries benefit from the elimination of exclusive-lock blocking, accumulating 52-58k calls versus 42-44k. In the non-churn case (5k test), both designs are equivalent in performance. More benchmarking across different patterns is needed, but I believe this is a good start in terms of numbers and accuracy. --- Open questions: 1. The attached patches use PGSTAT_KIND_EXPERIMENTAL for the custom kind ID. For commit, we'd want a proper kind number. 2. The 10-second eviction throttle is a compile-time constant (EVICTION_INTERVAL_MS). Should this be a GUC, or is a fixed interval sufficient? 3. The current design skips entry creation when at capacity and eviction is throttled. An alternative would be to allow temporary overshoot (soft limit) and never reject entries. Thoughts on the trade-off? However, for heavy churn and many unique entries we can easily overshoot the max by magnitudes higher which I don't think is a good idea. 4. The default size of query text memory. The patch has it at 64MB. [1] https://www.postgresql.org/message-id/acNTfL1xO_UUXkZQ%40paquier.xyz [3] https://www.postgresql.org/message-id/btsjlfnqge3y6yypkwe7yvhv2tcopt6pug7gigz6xaha2iemkw@lflv3psi7xoz [4] https://www.postgresql.org/message-id/aKF0V-T8-XAxj47T@paquier.xyz [5] https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm [6] https://www.postgresql.org/message-id/CAP53PkzYZ8YxH0o+Garw9fWdFRoEtmQKT09-q=2RVMW8uVS5Nw@mail.gmail.com Patches and benchmark script attached. -- Sami Imseih Amazon Web Services (AWS)