Thread

Improve pg_stat_statements scalability

Sami Imseih <samimseih@gmail.com> — 2026-05-11T23:53:53Z

Hi,

pg_stat_statements has well-known scaling problems under
high concurrency. This patch series is an initial proposal
for how to $SUBJECT.

The three scaling problems:

1. Per-entry SpinLock contention. Every counter update acquires
a SpinLock on the entry. Hot queries executed across many
backends contend on the same lock, and the hold time grows as
we add more counters to the struct. Higher core counts mean
more CPUs contending for the same spinlock, making the problem
worse on modern hardware. See this complaint here [3].

2. Deallocation under exclusive LWLock. The hash table is
fixed-size, bounded by pg_stat_statements.max. When full, a
single backend performs least-frequently-used eviction of
the bottom 5% while holding an exclusive LWLock, blocking all
other backends from reading or writing pg_stat_statements.
This happens inline during query execution, and workloads with
many unique queries trigger it frequently. Making matters
worse, pg_stat_statements.max is a static GUC, requiring a
full server restart to change, and that is disruptive in
production systems.

3. Query text file bloat and GC. Deallocated entries leave dead
text in the external file. When the file grows to 2x the size
of live text, GC rewrites the entire file under exclusive lock.
Disk I/O under exclusive lock is the worst combination, and
frequent deallocations (problem #2) trigger more GC.

Using the pgstat, "statistics collector", system to improve
pg_stat_statements scalability has been discussed previously
[4]. It writes stats locally first and flushes periodically
to shared memory, avoiding spinlock contention on the write
path. Also, the storage is a dshash (partitioned hash table),
which reduces contention on lookups and allows dynamic resizing
without restart. The prerequisite work to make this usable by
extensions has been building over two release cycles:

- PG18: Pluggable cumulative statistics API
(pgstat_register_kind) [7949d959]

- PG19: Serialization callbacks (to_serialized_data,
from_serialized_data, finish) for custom stats kinds
[4ba012a8ed9]

This patch builds on the prerequisite work to address the
scalability issues mentioned above.

The series consists of two patches:

---

[0001] pgstat: add pgstat_flush_pending() and pg_stat_flush_pending(pid)

Adds infrastructure for flushing pending statistics to shared
memory on demand. pgstat_flush_pending() flushes all pending
entries in the calling backend immediately. Unlike
pgstat_report_stat(), it can be called mid-transaction, making
it suitable for view functions that need fresh shared stats
before the transaction ends.

pg_stat_flush_pending(pid) is the SQL-callable interface to
the same function.

This patch is related to the discussion in [1] about flushing
stats within running transactions. The pg_stat_statements
modernization provides a good example where this is useful.
At least for the pg_stat_statements tests, we need a way to
flush statistics within a transaction. I plan to spin this off
to a dedicated thread, but include it here for now so this
patchset can be tested.

[0002] pg_stat_statements: modernize entry storage with
pgstat kind

This is the main patch. It replaces the private shared-memory
hash table with the pgstat subsystem's dshash (registered as a
custom stats kind).

The pgstats hash key has an objid:

typedef struct PgStat_HashKey
{
PgStat_Kind kind; /* statistics entry kind */
Oid dboid; /* database ID. InvalidOid for shared
objects. */
uint64 objid; /* object ID (table, function, etc.), or
* identifier. */
} PgStat_HashKey;

So the entry objid is computed by combining hashes from userid, queryid and
toplevel.

Note that because of 0001 there are no changes required to the
existing regression tests. We may need to add some new tests,
but I have not thought too much about that yet.

Key design changes:

- No SpinLock on the execution path. Counters accumulate
per-backend and merge into shared memory on flush. Stddev
and related fields use Welford's algorithm as before, but since the stats
are first updated locally, we need
a way to merge the stats on flush, so we use Chan's parallel
algorithm for this purpose [5].

- No fixed shared memory allocation. Entries live in the pgstat
dshash, which grows dynamically. pg_stat_statements.max
becomes PGC_SIGHUP, and therefore can be changed dynamically.

- Throttled inline eviction. When entry count reaches
pgss_max, a backend attempts eviction using a conditional
lock and a shared timestamp that ensures at most one eviction
cycle per 10 seconds. Other backends simply skip entry
creation without blocking. This is acceptable because the
current upstream behavior already suffers from the same data
loss under heavy churn. Newly created entries are immediately
candidates for eviction and are frequently removed in the
next deallocation cycle. The throttled approach makes this
trade-off explicit and avoids the cost of blocking all backends
behind an exclusive lock to create space for entries that may
just be removed shortly. See benchmark results below for
measuring the retention of "hot" entries.

The aging decay mechanism remains in place, but the sticky
entries are no longer needed for this patch, and we simply
don't evict calls == 0.

A background worker for eviction was also considered. The
current inline approach was chosen because it avoids the
added complexity while still preventing other backends from
blocking. If a background worker has more merit, I am open
to that discussion.

- Query texts in DSA memory with disk fallback. A new GUC
pg_stat_statements.query_text_memory (default 64 MB) controls
DSA shared memory for query text storage. When enabled and
not exhausted, new texts are stored in DSA instead of the
external file, eliminating file I/O on the read path. If DSA
is disabled (set to 0) or full, texts fall back to the
existing file-based storage. Entry eviction and reset properly
free DSA-allocated texts, and GC of the text file skips
DSA-backed entries.

- Serialization via pgstat callbacks. to_serialized_data /
from_serialized_data handle saving and restoring stats across
restarts, replacing the module's bespoke shutdown/startup
logic.

- pg_stat_statements_info gains columns. num_entries (current
count), last_eviction_time, query_text_memory_bytes (DSA
currently allocated), and query_text_file_bytes (overflow
file size) provide operational visibility into eviction
behavior and memory usage. Some of these fields were discussed
in this thread [6] as they are relevant even to the current
state of pg_stat_statements.

The LWLock is retained but narrowed to protect only query text
file operations and eviction. Entry-level locking is handled
by the pgstat subsystem's built-in mechanism.

---

Benchmark:

Attached is a benchmark script that runs three workloads:

- 5k: 80% hot (1000 distinct) + 20% churn (4000 distinct).
Total fits within max, no eviction expected.
- 100k: 80% hot (1000 distinct) + 20% churn (100000 distinct).
Exceeds max, continuous eviction pressure.
- spinlock: single SELECT; from all 256 clients. Pure
contention on one entry.

The benchmark collects pg_stat_activity in the background to
Measure wait events. It also polls pg_stat_statements to check
For hot and churn query retention, so we can measure hot
Query retention and also measure the overhead of querying
pg_stat_statements.

Benchmark Results:

Environment: x86_64, 16 CPUs, 29 GB RAM, Linux 6.1 (EC2).
PostgreSQL 19devel, release build, cassert=off.
pg_stat_statements.max = 5500, 256 clients, 16 threads,
5 min per test.

Test | un-patched TPS | Patch TPS | Delta
---------+----------------+-----------+------
5k | 241,078 | 239,001 | -0.9%
100k | 180,817 | 240,314 | +33%
spinlock | 328,075 | 319,466 | -3%

Wait event samples (collected every 1s for 300s):

Notice the huge difference in wait events. Essentially, the
pg_stat_statements waits are eliminated with this design and
there is a 33% performance improvement.

Entry retention under 100k (heavy eviction):

Under heavy churn, the unpatched eviction behavior retains
hot entries and discards churn entries after 1-5 calls. The patched
design throttles eviction to at most once per 10 seconds, so entries
survive longer between cycles and accumulate more calls. In the patched
design, the churn entries reach 1-192 calls before eviction. Hot entries
benefit from the elimination of exclusive-lock blocking, accumulating 52-58k
calls versus 42-44k.

In the non-churn case (5k test), both designs are equivalent
in performance.

More benchmarking across different patterns is needed, but I believe this
is a good start in terms of numbers and accuracy.

---

Open questions:

1. The attached patches use PGSTAT_KIND_EXPERIMENTAL for the
custom kind ID. For commit, we'd want a proper kind number.

2. The 10-second eviction throttle is a compile-time constant
(EVICTION_INTERVAL_MS). Should this be a GUC, or is a fixed
interval sufficient?

3. The current design skips entry creation when at capacity and
eviction is throttled. An alternative would be to allow
temporary overshoot (soft limit) and never reject entries.
Thoughts on the trade-off? However, for heavy churn and
many unique entries we can easily overshoot the max by
magnitudes higher which I don't think is a good idea.

4. The default size of query text memory. The patch has it at 64MB.

[1] https://www.postgresql.org/message-id/acNTfL1xO_UUXkZQ%40paquier.xyz
[3] https://www.postgresql.org/message-id/btsjlfnqge3y6yypkwe7yvhv2tcopt6pug7gigz6xaha2iemkw@lflv3psi7xoz
[4] https://www.postgresql.org/message-id/aKF0V-T8-XAxj47T@paquier.xyz
[5] https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
[6] https://www.postgresql.org/message-id/CAP53PkzYZ8YxH0o+Garw9fWdFRoEtmQKT09-q=2RVMW8uVS5Nw@mail.gmail.com

Patches and benchmark script attached.

--
Sami Imseih
Amazon Web Services (AWS)