Thread

  1. Improve pg_stat_statements scalability

    Sami Imseih <samimseih@gmail.com> — 2026-05-11T23:53:53Z

    Hi,
    
    pg_stat_statements has well-known scaling problems under
    high concurrency. This patch series is an initial proposal
    for how to $SUBJECT.
    
    The three scaling problems:
    
    1. Per-entry SpinLock contention. Every counter update acquires
    a SpinLock on the entry. Hot queries executed across many
    backends contend on the same lock, and the hold time grows as
    we add more counters to the struct. Higher core counts mean
    more CPUs contending for the same spinlock, making the problem
    worse on modern hardware. See this complaint here [3].
    
    2. Deallocation under exclusive LWLock. The hash table is
    fixed-size, bounded by pg_stat_statements.max. When full, a
    single backend performs least-frequently-used eviction of
    the bottom 5% while holding an exclusive LWLock, blocking all
    other backends from reading or writing pg_stat_statements.
    This happens inline during query execution, and workloads with
    many unique queries trigger it frequently. Making matters
    worse, pg_stat_statements.max is a static GUC, requiring a
    full server restart to change, and that is disruptive in
    production systems.
    
    3. Query text file bloat and GC. Deallocated entries leave dead
    text in the external file. When the file grows to 2x the size
    of live text, GC rewrites the entire file under exclusive lock.
    Disk I/O under exclusive lock is the worst combination, and
    frequent deallocations (problem #2) trigger more GC.
    
    Using the pgstat, "statistics collector", system to improve
    pg_stat_statements scalability has been discussed previously
    [4]. It writes stats locally first and flushes periodically
    to shared memory, avoiding spinlock contention on the write
    path. Also, the storage is a dshash (partitioned hash table),
    which reduces contention on lookups and allows dynamic resizing
    without restart. The prerequisite work to make this usable by
    extensions has been building over two release cycles:
    
    - PG18: Pluggable cumulative statistics API
    (pgstat_register_kind) [7949d959]
    
    - PG19: Serialization callbacks (to_serialized_data,
    from_serialized_data, finish) for custom stats kinds
    [4ba012a8ed9]
    
    This patch builds on the prerequisite work to address the
    scalability issues mentioned above.
    
    The series consists of two patches:
    
    ---
    
    [0001] pgstat: add pgstat_flush_pending() and pg_stat_flush_pending(pid)
    
    Adds infrastructure for flushing pending statistics to shared
    memory on demand. pgstat_flush_pending() flushes all pending
    entries in the calling backend immediately. Unlike
    pgstat_report_stat(), it can be called mid-transaction, making
    it suitable for view functions that need fresh shared stats
    before the transaction ends.
    
    pg_stat_flush_pending(pid) is the SQL-callable interface to
    the same function.
    
    This patch is related to the discussion in [1] about flushing
    stats within running transactions. The pg_stat_statements
    modernization provides a good example where this is useful.
    At least for the pg_stat_statements tests, we need a way to
    flush statistics within a transaction. I plan to spin this off
    to a dedicated thread, but include it here for now so this
    patchset can be tested.
    
    [0002] pg_stat_statements: modernize entry storage with
    pgstat kind
    
    This is the main patch. It replaces the private shared-memory
    hash table with the pgstat subsystem's dshash (registered as a
    custom stats kind).
    
    The pgstats hash key has an objid:
    
    typedef struct PgStat_HashKey
    {
        PgStat_Kind kind;           /* statistics entry kind */
        Oid         dboid;          /* database ID. InvalidOid for shared
    objects. */
        uint64      objid;          /* object ID (table, function, etc.), or
                                     * identifier. */
    } PgStat_HashKey;
    
    So the entry objid is computed by combining hashes from userid, queryid and
    toplevel.
    
    Note that because of 0001 there are no changes required to the
    existing regression tests. We may need to add some new tests,
    but I have not thought too much about that yet.
    
    Key design changes:
    
    - No SpinLock on the execution path. Counters accumulate
    per-backend and merge into shared memory on flush. Stddev
    and related fields use Welford's algorithm as before, but since the stats
    are first updated locally, we need
    a way to merge the stats on flush, so we use Chan's parallel
    algorithm for this purpose [5].
    
    - No fixed shared memory allocation. Entries live in the pgstat
    dshash, which grows dynamically. pg_stat_statements.max
    becomes PGC_SIGHUP, and therefore can be changed dynamically.
    
    - Throttled inline eviction. When entry count reaches
    pgss_max, a backend attempts eviction using a conditional
    lock and a shared timestamp that ensures at most one eviction
    cycle per 10 seconds. Other backends simply skip entry
    creation without blocking. This is acceptable because the
    current upstream behavior already suffers from the same data
    loss under heavy churn. Newly created entries are immediately
    candidates for eviction and are frequently removed in the
    next deallocation cycle. The throttled approach makes this
    trade-off explicit and avoids the cost of blocking all backends
    behind an exclusive lock to create space for entries that may
    just be removed shortly. See benchmark results below for
    measuring the retention of "hot" entries.
    
    The aging decay mechanism remains in place, but the sticky
    entries are no longer needed for this patch, and we simply
    don't evict calls == 0.
    
    A background worker for eviction was also considered. The
    current inline approach was chosen because it avoids the
    added complexity while still preventing other backends from
    blocking. If a background worker has more merit, I am open
    to that discussion.
    
    - Query texts in DSA memory with disk fallback. A new GUC
    pg_stat_statements.query_text_memory (default 64 MB) controls
    DSA shared memory for query text storage. When enabled and
    not exhausted, new texts are stored in DSA instead of the
    external file, eliminating file I/O on the read path. If DSA
    is disabled (set to 0) or full, texts fall back to the
    existing file-based storage. Entry eviction and reset properly
    free DSA-allocated texts, and GC of the text file skips
    DSA-backed entries.
    
    - Serialization via pgstat callbacks. to_serialized_data /
    from_serialized_data handle saving and restoring stats across
    restarts, replacing the module's bespoke shutdown/startup
    logic.
    
    - pg_stat_statements_info gains columns. num_entries (current
    count), last_eviction_time, query_text_memory_bytes (DSA
    currently allocated), and query_text_file_bytes (overflow
    file size) provide operational visibility into eviction
    behavior and memory usage. Some of these fields were discussed
    in this thread [6] as they are relevant even to the current
    state of pg_stat_statements.
    
    The LWLock is retained but narrowed to protect only query text
    file operations and eviction. Entry-level locking is handled
    by the pgstat subsystem's built-in mechanism.
    
    ---
    
    Benchmark:
    
    Attached is a benchmark script that runs three workloads:
    
    - 5k: 80% hot (1000 distinct) + 20% churn (4000 distinct).
    Total fits within max, no eviction expected.
    - 100k: 80% hot (1000 distinct) + 20% churn (100000 distinct).
    Exceeds max, continuous eviction pressure.
    - spinlock: single SELECT; from all 256 clients. Pure
    contention on one entry.
    
    The benchmark collects pg_stat_activity in the background to
    Measure wait events. It also polls pg_stat_statements to check
    For hot and churn query retention, so we can measure hot
    Query retention and also measure the overhead of querying
    pg_stat_statements.
    
    Benchmark Results:
    
    Environment: x86_64, 16 CPUs, 29 GB RAM, Linux 6.1 (EC2).
    PostgreSQL 19devel, release build, cassert=off.
    pg_stat_statements.max = 5500, 256 clients, 16 threads,
    5 min per test.
    
    
    Test     | un-patched TPS | Patch TPS | Delta
    ---------+----------------+-----------+------
    5k       |        241,078 |   239,001 | -0.9%
    100k     |        180,817 |   240,314 | +33%
    spinlock |        328,075 |   319,466 | -3%
    
    Wait event samples (collected every 1s for 300s):
    
    Test     | un-patched                  | Patch
    ---------+-----------------------------+-----------------
    5k       | ClientRead 1907             | ClientRead 2106
    100k   | pg_stat_statements 20416    | ClientRead 1978
               | ClientRead 12874            |
    spinlock | ClientRead 1804             | ClientRead 1749
    
    Notice the huge difference in wait events. Essentially, the
    pg_stat_statements waits are eliminated with this design and
    there is a 33% performance improvement.
    
    Entry retention under 100k (heavy eviction):
    
    query type | un-patched                  | Patch
    -----------+-----------------------------+---------------------------
    hot        | 1000 entries, 42-44k calls  | 1000 entries, 52-58k calls
    churn      | 4436 entries, 1-5 calls     | 4430 entries, 1-192 calls
    deallocs   | 37,972                      | 30
    
    Under heavy churn, the unpatched eviction behavior retains
    hot entries and discards churn entries after 1-5 calls. The patched
    design throttles eviction to at most once per 10 seconds, so entries
    survive longer between cycles and accumulate more calls. In the patched
    design, the churn entries reach 1-192 calls before eviction. Hot entries
    benefit from the elimination of exclusive-lock blocking, accumulating 52-58k
    calls versus 42-44k.
    
    In the non-churn case (5k test), both designs are equivalent
    in performance.
    
    More benchmarking across different patterns is needed, but I believe this
    is a good start in terms of numbers and accuracy.
    
    ---
    
    Open questions:
    
    1. The attached patches use PGSTAT_KIND_EXPERIMENTAL for the
    custom kind ID. For commit, we'd want a proper kind number.
    
    2. The 10-second eviction throttle is a compile-time constant
    (EVICTION_INTERVAL_MS). Should this be a GUC, or is a fixed
    interval sufficient?
    
    3. The current design skips entry creation when at capacity and
    eviction is throttled. An alternative would be to allow
    temporary overshoot (soft limit) and never reject entries.
    Thoughts on the trade-off? However, for heavy churn and
    many unique entries we can easily overshoot the max by
    magnitudes higher which I don't think is a good idea.
    
    4. The default size of query text memory. The patch has it at 64MB.
    
    [1] https://www.postgresql.org/message-id/acNTfL1xO_UUXkZQ%40paquier.xyz
    [3] https://www.postgresql.org/message-id/btsjlfnqge3y6yypkwe7yvhv2tcopt6pug7gigz6xaha2iemkw@lflv3psi7xoz
    [4] https://www.postgresql.org/message-id/aKF0V-T8-XAxj47T@paquier.xyz
    [5] https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
    [6] https://www.postgresql.org/message-id/CAP53PkzYZ8YxH0o+Garw9fWdFRoEtmQKT09-q=2RVMW8uVS5Nw@mail.gmail.com
    
    Patches and benchmark script attached.
    
    --
    Sami Imseih
    Amazon Web Services (AWS)