Thread

  1. Re: Improve pg_stat_statements scalability

    Sami Imseih <samimseih@gmail.com> — 2026-05-30T02:15:31Z

    >> I've created a new wiki page combining the prior 2025 discussion, and
    >> notes from today:
    >>
    >> https://wiki.postgresql.org/wiki/Scalability_of_pg_stat_statements
    
    >Thank you, Lukas!
    
    Thanks for the feedback on v2, and for the productive unconference
    discussion in PGConf.dev. I've been iterating on the patches/benchmarks
    and wanted to post an updated series that addresses the major design
    points raised.
    
    Addressing the feedback from the unconference session:
    
    1/ Eviction
    
    The concern in the design presented in which entries are skipped until
    a single threaded eviction completed was that on a steady workload with
    max=5000, the 5001st unique query would always be immediately evicted.
    This was considered unacceptable, and rightly so. It may be fine for
    high churn, pathological cases, but for slow churn cases it would be
    unacceptable.
    
    Andres mentioned that eviction should occur in parallel, which I
    understood as multiple backends should evict a subset of the entries in
    the dshash concurrently. To do this, I implemented a parallel
    clock-sweep in which eviction sweeps a partition decrementing
    refcounts, and only entries that have decayed to zero are evicted. A
    genuinely new query that just arrived has a fresh refcount and survives
    the sweep, and if it becomes popular in the meantime, it will survive
    for longer. So the algorithm wants an entry to prove itself to remain
    considered "hot", else it's swept.
    
    There's a 5% headroom using the already existing
    USAGE_DEALLOC_PERCENT in which we evict until count (across all
    partitions) drops to 95% of max, so there's always room for new
    arrivals before the next sweep.
    
    I chose not to use a background worker for eviction, as discussed
    earlier, based on the consensus that backpressure is important, and a
    background worker being asynchronous in nature will not provide that.
    Also, it could be complicated where a background worker cannot be spun
    up for whatever reason.
    
    However, to implement parallel eviction as described above, I needed 2
    core changes:
    
    - pgstat_drop_entry must be able to optionally tolerate a dropped
    entry, as there could be a delay between the time an entry is marked
    as dropped and garbage collection, and within that time multiple
    evictions may attempt to drop the same entry.
    
    - Implement a seq scan API for dshash that scans a specific partition.
    This will allow eviction to cycle through partitions, i.e.
    clock-sweep.
    
    The eviction does not attempt to make room in the bucket for an entry
    that triggered it. The main point is to keep forward progress in
    making room. We don't need to be more strict here. Also the fact that
    pg_stat_statements.max is dynamic means a user can increase this value
    to manage high churn without a restart.
    
    The dealloc counter in pg_stat_statements_info now counts individual
    entries evicted rather than the number of times eviction was invoked.
    I think this is more useful, but it does change the semantics between
    versions. With the new design, a single eviction pass can remove many
    entries across a partition, so counting evictions no longer tells
    you much. Open to other thoughts here on this.
    
    Profiling revealed that pgstat_request_entry_refs_gc(), which is the
    standard pattern for less frequently removed entries in other places,
    was too expensive when called on every entry drop or every partition
    sweep. Now it is only called once per full rotation across all
    partitions, which showed much better results in benchmarks.
    
    2/ DSA only query text storage
    
    Rather than a "performance cliff" that Lukas mentioned above when we
    switch between DSA and disk for query text storage, the consensus is
    to just store all query text in memory. Andres made a point that even
    now, transient memory usage for loading the query text for the purpose
    of garbage-collection or reading the query stats in
    pg_stat_statements means that a user's machine must have enough memory
    to handle this. So, why not just throw all query text in memory. The
    memory is capped by a new GUC pg_stat_statements.query_text_memory
    (default of 4MB but up for discussion).
    
    It is also possible to store more entries due to .max than
    .query_text_memory can support, so empty query text columns could be
    possible. In the case a user increases .query_text_memory if they
    observe empty query columns, the next time an entry is touched, it
    will backfill the query text and normalize the string if it can. The
    last part could be improved and actually guaranteed if we make
    JumbleState available to all hooks (I did not work on this part, but
    open for discussion).
    
    Some other comments from Lukas's earlier review:
    
    > It appears you've moved the equivalent of the "if (!entry)" check
    > into the pgss_store_query_text function, and we now unconditionally
    > call generate_normalized_query.
    
    Fixed.
    
    v3 series is now 5 patches:
    
    0001: pgstat: Introduce pg_stat_report_anytime()
    
    Nothing changes from v2.
    
    0002: pgstat: tolerate already-dropped entries in pgstat_drop_entry()
    
    Required for the parallel eviction design. With multiple backends
    sweeping different partitions concurrently, the same entry can be
    targeted for drop more than once before garbage collection runs.
    A skip_dropped flag makes this safe rather than throwing ERROR.
    
    0003: dshash: add partition-scoped sequential scan
    
    Adds dshash_seq_init_partition() to restrict a scan to a single
    partition. This is the building block for per-partition clock-sweep;
    a backend only locks and sweeps one partition at a time.
    
    0004: pg_stat_statements: modernize entry storage with pgstat kind
    
    The main patch. Replaces ShmemInitHash with dshash via DSM registry,
    and replaces per-entry spinlock counter updates with a custom pgstat
    kind that uses the core pgstat infrastructure.
    
    Eviction changes from qsort-all-entries to clock-sweep with an atomic
    rotating hand. Each entry carries a refcount (capped at 10) that
    decays on sweep; entries reaching zero are evicted. Hot queries keep
    their refcount topped up proportionally to access frequency.
    
    pg_stat_statements.max becomes PGC_SIGHUP.
    
    0005: pg_stat_statements: store query text in DSA instead of file
    
    Moves query text from pgss_query_texts.stat into a DSA area via
    GetNamedDSA. Adds pg_stat_statements.query_text_memory (PGC_SIGHUP,
    default 4MB) controlling DSA size. Eliminates the GC machinery
    entirely. When DSA is exhausted, entries are still tracked but query
    text is stored as NULL. A backfill mechanism recovers text on
    subsequent executions once space becomes available.
    
    Benchmark:
    
    Attached are the benchmark scripts I used for v3 (and will keep
    using going forward) with the results in the benchmark_v3.txt file comparing
    patch vs upstream.
    
    The benchmark performs various workloads: "high churn", "light churn",
    "multi stmt", and simple "select1". I am also tracking query retention
    (hot/cold entry retention) to verify the clock-sweep behaves as
    expected. I attached the .sql scripts used to benchmark.
    
    The select1 result shows a 1.0% regression. This workload has no
    contention, so it purely measures pgstat infrastructure overhead;
    perf profiling shows pgstat_get_entry_ref() at the top. However, on
    machines with higher core count the upstream spinlock on the counters
    becomes a bottleneck, which is where the dshash design should win
    back this overhead and maybe more. I still plan to benchmark this on
    a larger machine.
    
    You will notice that cold_calls in the patched churn case are much
    lower (805 vs 4,458). This is because entries get evicted sooner
    under per-partition sweep. Hot and cold query retention lines up w
    ith current upstream (1000/1000 hot entries survive continuous churn).
    
    The deallocs count is much higher in the patch (11.9M vs 38.5K in
    high churn). per-partition sweep fires frequently to keep the table
    at target capacity, whereas upstream batches fewer, larger deallocations.
    We can maybe look into reducing USAGE_DEALLOC_PERCENT to
    increase retention of "colder" entries.
    
    We also see much less LWLock contention in the patched churn case.
    The top wait is PgStatsDSA (502 total) vs pg_stat_statements (7,757)
    in upstream, a 15x reduction.
    
    Looking forward to your feedback!
    
    --
    Sami Imseih
    Amazon Web Services (AWS)