v3-0004-pg_stat_statements-modernize-entry-storage-with-p.patch

application/x-patch
Filename: v3-0004-pg_stat_statements-modernize-entry-storage-with-p.patch
Type: application/x-patch
Part: 9
Message: Re: Improve pg_stat_statements scalability
From 715935c78ba9f7cd16806df12e19cc7d605a3a4c Mon Sep 17 00:00:00 2001
From: Sami Imseih <samimseih@gmail.com>
Date: Fri, 29 May 2026 09:20:13 -0500
Subject: [PATCH v3 4/5] pg_stat_statements: modernize entry storage with
 pgstat kind

Replace the shared-memory hash table (ShmemInitHash) with a dshash
table registered via DSM registry, and replace per-entry spinlock
counter updates with a custom pgstat kind that uses the core pgstat
infrastructure.

Eviction switches from sort-based (qsort all entries by usage factor)
to clock-sweep with an atomic rotating hand.  Each entry carries a
refcount (capped at PGSS_REF_CAP = 10) that is decremented on sweep;
entries reaching zero are evicted.  Hot queries keep their refcount
topped up proportionally to access frequency.  Eviction is guaranteed
to make forward progress.

The pg_stat_statements.max GUC is changed from PGC_POSTMASTER to
PGC_SIGHUP so the limit can be adjusted without a restart.

Query text storage remains file-based (pgss_query_texts.stat) as in
the original implementation, with the same qtext_store / qtext_load_file /
gc_qtexts infrastructure adapted to work with the dshash entries.

The extension is bumped to version 1.14 with PARALLEL RESTRICTED
marking on the main function, since parallel workers cannot flush
stats accumulated by the leader via the anytime API.
---
 .../expected/oldextversions.out               |   67 +
 contrib/pg_stat_statements/meson.build        |    3 +-
 .../pg_stat_statements--1.13--1.14.sql        |   81 +
 .../pg_stat_statements/pg_stat_statements.c   | 2359 ++++++++---------
 .../pg_stat_statements.conf                   |    1 +
 .../pg_stat_statements.control                |    2 +-
 .../pg_stat_statements/sql/oldextversions.sql |    5 +
 doc/src/sgml/pgstatstatements.sgml            |   39 +-
 8 files changed, 1229 insertions(+), 1328 deletions(-)
 create mode 100644 contrib/pg_stat_statements/pg_stat_statements--1.13--1.14.sql

diff --git a/contrib/pg_stat_statements/expected/oldextversions.out b/contrib/pg_stat_statements/expected/oldextversions.out
index 726383a99d7..5d65eb6b521 100644
--- a/contrib/pg_stat_statements/expected/oldextversions.out
+++ b/contrib/pg_stat_statements/expected/oldextversions.out
@@ -474,4 +474,71 @@ SELECT count(*) > 0 AS has_data FROM pg_stat_statements;
  t
 (1 row)
 
+-- Functions marked PARALLEL RESTRICTED in 1.14
+AlTER EXTENSION pg_stat_statements UPDATE TO '1.14';
+\d pg_stat_statements
+                            View "public.pg_stat_statements"
+           Column           |           Type           | Collation | Nullable | Default 
+----------------------------+--------------------------+-----------+----------+---------
+ userid                     | oid                      |           |          | 
+ dbid                       | oid                      |           |          | 
+ toplevel                   | boolean                  |           |          | 
+ queryid                    | bigint                   |           |          | 
+ query                      | text                     |           |          | 
+ plans                      | bigint                   |           |          | 
+ total_plan_time            | double precision         |           |          | 
+ min_plan_time              | double precision         |           |          | 
+ max_plan_time              | double precision         |           |          | 
+ mean_plan_time             | double precision         |           |          | 
+ stddev_plan_time           | double precision         |           |          | 
+ calls                      | bigint                   |           |          | 
+ total_exec_time            | double precision         |           |          | 
+ min_exec_time              | double precision         |           |          | 
+ max_exec_time              | double precision         |           |          | 
+ mean_exec_time             | double precision         |           |          | 
+ stddev_exec_time           | double precision         |           |          | 
+ rows                       | bigint                   |           |          | 
+ shared_blks_hit            | bigint                   |           |          | 
+ shared_blks_read           | bigint                   |           |          | 
+ shared_blks_dirtied        | bigint                   |           |          | 
+ shared_blks_written        | bigint                   |           |          | 
+ local_blks_hit             | bigint                   |           |          | 
+ local_blks_read            | bigint                   |           |          | 
+ local_blks_dirtied         | bigint                   |           |          | 
+ local_blks_written         | bigint                   |           |          | 
+ temp_blks_read             | bigint                   |           |          | 
+ temp_blks_written          | bigint                   |           |          | 
+ shared_blk_read_time       | double precision         |           |          | 
+ shared_blk_write_time      | double precision         |           |          | 
+ local_blk_read_time        | double precision         |           |          | 
+ local_blk_write_time       | double precision         |           |          | 
+ temp_blk_read_time         | double precision         |           |          | 
+ temp_blk_write_time        | double precision         |           |          | 
+ wal_records                | bigint                   |           |          | 
+ wal_fpi                    | bigint                   |           |          | 
+ wal_bytes                  | numeric                  |           |          | 
+ wal_buffers_full           | bigint                   |           |          | 
+ jit_functions              | bigint                   |           |          | 
+ jit_generation_time        | double precision         |           |          | 
+ jit_inlining_count         | bigint                   |           |          | 
+ jit_inlining_time          | double precision         |           |          | 
+ jit_optimization_count     | bigint                   |           |          | 
+ jit_optimization_time      | double precision         |           |          | 
+ jit_emission_count         | bigint                   |           |          | 
+ jit_emission_time          | double precision         |           |          | 
+ jit_deform_count           | bigint                   |           |          | 
+ jit_deform_time            | double precision         |           |          | 
+ parallel_workers_to_launch | bigint                   |           |          | 
+ parallel_workers_launched  | bigint                   |           |          | 
+ generic_plan_calls         | bigint                   |           |          | 
+ custom_plan_calls          | bigint                   |           |          | 
+ stats_since                | timestamp with time zone |           |          | 
+ minmax_stats_since         | timestamp with time zone |           |          | 
+
+SELECT count(*) > 0 AS has_data FROM pg_stat_statements;
+ has_data 
+----------
+ t
+(1 row)
+
 DROP EXTENSION pg_stat_statements;
diff --git a/contrib/pg_stat_statements/meson.build b/contrib/pg_stat_statements/meson.build
index 9d78cb88b7d..7ffc8964494 100644
--- a/contrib/pg_stat_statements/meson.build
+++ b/contrib/pg_stat_statements/meson.build
@@ -21,6 +21,7 @@ contrib_targets += pg_stat_statements
 install_data(
   'pg_stat_statements.control',
   'pg_stat_statements--1.4.sql',
+  'pg_stat_statements--1.13--1.14.sql',
   'pg_stat_statements--1.12--1.13.sql',
   'pg_stat_statements--1.11--1.12.sql',
   'pg_stat_statements--1.10--1.11.sql',
@@ -69,7 +70,7 @@ tests += {
   },
   'tap': {
     'tests': [
-      't/010_restart.pl',
+      't/001_restart.pl',
     ],
   },
 }
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.13--1.14.sql b/contrib/pg_stat_statements/pg_stat_statements--1.13--1.14.sql
new file mode 100644
index 00000000000..7ed4c19eb5a
--- /dev/null
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.13--1.14.sql
@@ -0,0 +1,81 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.13--1.14.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.14'" to load this file. \quit
+
+/* First we have to remove them from the extension */
+ALTER EXTENSION pg_stat_statements DROP VIEW pg_stat_statements;
+ALTER EXTENSION pg_stat_statements DROP FUNCTION pg_stat_statements(boolean);
+
+/* Then we can drop them */
+DROP VIEW pg_stat_statements;
+DROP FUNCTION pg_stat_statements(boolean);
+
+/* Now redefine with PARALLEL RESTRICTED */
+CREATE FUNCTION pg_stat_statements(IN showtext boolean,
+    OUT userid oid,
+    OUT dbid oid,
+    OUT toplevel bool,
+    OUT queryid bigint,
+    OUT query text,
+    OUT plans int8,
+    OUT total_plan_time float8,
+    OUT min_plan_time float8,
+    OUT max_plan_time float8,
+    OUT mean_plan_time float8,
+    OUT stddev_plan_time float8,
+    OUT calls int8,
+    OUT total_exec_time float8,
+    OUT min_exec_time float8,
+    OUT max_exec_time float8,
+    OUT mean_exec_time float8,
+    OUT stddev_exec_time float8,
+    OUT rows int8,
+    OUT shared_blks_hit int8,
+    OUT shared_blks_read int8,
+    OUT shared_blks_dirtied int8,
+    OUT shared_blks_written int8,
+    OUT local_blks_hit int8,
+    OUT local_blks_read int8,
+    OUT local_blks_dirtied int8,
+    OUT local_blks_written int8,
+    OUT temp_blks_read int8,
+    OUT temp_blks_written int8,
+    OUT shared_blk_read_time float8,
+    OUT shared_blk_write_time float8,
+    OUT local_blk_read_time float8,
+    OUT local_blk_write_time float8,
+    OUT temp_blk_read_time float8,
+    OUT temp_blk_write_time float8,
+    OUT wal_records int8,
+    OUT wal_fpi int8,
+    OUT wal_bytes numeric,
+    OUT wal_buffers_full int8,
+    OUT jit_functions int8,
+    OUT jit_generation_time float8,
+    OUT jit_inlining_count int8,
+    OUT jit_inlining_time float8,
+    OUT jit_optimization_count int8,
+    OUT jit_optimization_time float8,
+    OUT jit_emission_count int8,
+    OUT jit_emission_time float8,
+    OUT jit_deform_count int8,
+    OUT jit_deform_time float8,
+    OUT parallel_workers_to_launch int8,
+    OUT parallel_workers_launched int8,
+    OUT generic_plan_calls int8,
+    OUT custom_plan_calls int8,
+    OUT stats_since timestamp with time zone,
+    OUT minmax_stats_since timestamp with time zone
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_stat_statements_1_13'
+LANGUAGE C STRICT VOLATILE PARALLEL RESTRICTED;
+
+CREATE VIEW pg_stat_statements AS
+  SELECT * FROM pg_stat_statements(true);
+
+GRANT SELECT ON pg_stat_statements TO PUBLIC;
+
+/* Mark reset functions as PARALLEL RESTRICTED */
+ALTER FUNCTION pg_stat_statements_reset(Oid, Oid, bigint, boolean) PARALLEL RESTRICTED;
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 92315627916..2004cad91f7 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -5,8 +5,9 @@
  *		usage across a whole database cluster.
  *
  * Execution costs are totaled for each distinct source query, and kept in
- * a shared hashtable.  (We track only as many distinct queries as will fit
- * in the designated amount of shared memory.)
+ * a dshash table registered via DSM registry.  (We attempt to keep no more
+ * distinct queries than the configured limit, but because dynamic shared
+ * memory is used, the count may briefly exceed it.)
  *
  * Starting in Postgres 9.2, this module normalized query entries.  As of
  * Postgres 14, the normalization is done by the core if compute_query_id is
@@ -14,25 +15,35 @@
  *
  * To facilitate presenting entries to users, we create "representative" query
  * strings in which constants are replaced with parameter symbols ($n), to
- * make it clearer what a normalized entry can represent.  To save on shared
- * memory, and to avoid having to truncate oversized query strings, we store
- * these strings in a temporary external query-texts file.  Offsets into this
- * file are kept in shared memory.
+ * make it clearer what a normalized entry can represent.  To avoid having
+ * to truncate oversized query strings, we store these strings in a temporary
+ * external query-texts file.  Offsets into this file are kept in the dshash
+ * entries.
  *
- * Note about locking issues: to create or delete an entry in the shared
- * hashtable, one must hold pgss->lock exclusively.  Modifying any field
- * in an entry except the counters requires the same.  To look up an entry,
- * one must hold the lock shared.  To read or update the counters within
- * an entry, one must hold the lock shared or exclusive (so the entry doesn't
- * disappear!) and also take the entry's mutex spinlock.
- * The shared state variable pgss->extent (the next free spot in the external
- * query-text file) should be accessed only while holding either the
- * pgss->mutex spinlock, or exclusive lock on pgss->lock.  We use the mutex to
- * allow reserving file space while holding only shared lock on pgss->lock.
- * Rewriting the entire external query-text file, eg for garbage collection,
- * requires holding pgss->lock exclusively; this allows individual entries
- * in the file to be read or written while holding only shared lock.
+ * The dshash serves as the source-of-truth registry of tracked queries,
+ * storing the key, file offsets to the query text, and an atomic refcount
+ * used for clock-sweep eviction.  Actual counters (calls, timing, buffers,
+ * etc.) are maintained via a custom pgstat kind using the pending/flush
+ * infrastructure, so backends accumulate stats locally and flush them
+ * without contending on shared state.
  *
+ * Note about locking: to look up an existing entry, a backend takes a
+ * shared lock on the entry's dshash partition and bumps the refcount
+ * atomically.  To insert a new entry, an exclusive partition lock is
+ * required but only on that single partition.  When the entry count exceeds
+ * pg_stat_statements.max, a clock-sweep eviction pass claims a partition
+ * via an atomic rotating hand and sweeps it under exclusive lock, evicting
+ * entries whose refcount has reached zero.
+ *
+ * A backend that triggers eviction does not necessarily evict from the same
+ * partition it is inserting into.  The point of the clock-sweep is to
+ * continually make progress on keeping the total entry count in check,
+ * spreading eviction work across all partitions over time.  The refcount
+ * is capped at PGSS_REF_CAP, so any entry will reach zero after at most
+ * that many sweep passes regardless of how frequently it is accessed.  No
+ * entry can become permanently immune to eviction.  Furthermore, because
+ * the sweep holds the partition's exclusive lock, no backend can re-bump
+ * the refcount of an entry while it is being considered for eviction.
  *
  * Copyright (c) 2008-2026, PostgreSQL Global Development Group
  *
@@ -50,24 +61,27 @@
 #include "access/htup_details.h"
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
+#include "common/hashfn.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
 #include "jit/jit.h"
+#include "lib/dshash.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
+#include "pgstat.h"
 #include "nodes/queryjumble.h"
 #include "optimizer/planner.h"
 #include "parser/analyze.h"
-#include "pgstat.h"
+#include "storage/dsm_registry.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/lwlock.h"
-#include "storage/shmem.h"
 #include "storage/spin.h"
 #include "tcop/utility.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
-#include "utils/memutils.h"
+#include "utils/guc.h"
+#include "utils/numeric.h"
+#include "utils/pgstat_internal.h"
 #include "utils/timestamp.h"
 #include "utils/tuplestore.h"
 
@@ -76,29 +90,18 @@ PG_MODULE_MAGIC_EXT(
 					.version = PG_VERSION
 );
 
-/* Location of permanent stats file (valid when database is shut down) */
-#define PGSS_DUMP_FILE	PGSTAT_STAT_PERMANENT_DIRECTORY "/pg_stat_statements.stat"
-
 /*
  * Location of external query text file.
  */
 #define PGSS_TEXT_FILE	PG_STAT_TMP_DIR "/pgss_query_texts.stat"
 
-/* Magic number identifying the stats file format */
-static const uint32 PGSS_FILE_HEADER = 0x20250731;
+/* Custom pgstat kind ID */
+#define PGSTAT_KIND_PGSS	25
 
-/* PostgreSQL major version number, changes in which invalidate all entries */
-static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100;
-
-/* XXX: Should USAGE_EXEC reflect execution time and/or buffer usage? */
-#define USAGE_EXEC(duration)	(1.0)
-#define USAGE_INIT				(1.0)	/* including initial planning */
-#define ASSUMED_MEDIAN_INIT		(10.0)	/* initial assumed median usage */
-#define ASSUMED_LENGTH_INIT		1024	/* initial assumed mean query length */
-#define USAGE_DECREASE_FACTOR	(0.99)	/* decreased every entry_dealloc */
-#define STICKY_DECREASE_FACTOR	(0.50)	/* factor for sticky entries */
-#define USAGE_DEALLOC_PERCENT	5	/* free this % of entries at once */
-#define IS_STICKY(c)	((c.calls[PGSS_PLAN] + c.calls[PGSS_EXEC]) == 0)
+/* Clock-sweep settings */
+#define USAGE_DEALLOC_PERCENT  5	/* evict until total entries drop to this
+									 * % below max */
+#define PGSS_REF_CAP	10		/* maximum refcount for clock-sweep */
 
 /*
  * Extension version number, for supporting older extension versions' objects
@@ -133,12 +136,9 @@ typedef enum pgssStoreKind
 #define PGSS_NUMKIND (PGSS_EXEC + 1)
 
 /*
- * Hashtable key that defines the identity of a hashtable entry.  We separate
+ * Hashtable key that defines the identity of a tracked statement.
+ * This is used in the dshash (source-of-truth registry).  We separate
  * queries by user and by database even if they are otherwise identical.
- *
- * If you add a new key to this struct, make sure to teach pgss_store() to
- * zero the padding bytes.  Otherwise, things will break, because pgss_hash is
- * created using HASH_BLOBS, and thus tag_hash is used to hash this.
  */
 typedef struct pgssHashKey
 {
@@ -149,9 +149,9 @@ typedef struct pgssHashKey
 } pgssHashKey;
 
 /*
- * The actual stats counters kept within pgssEntry.
+ * The actual stats counters kept within the custom pgstat kind.
  */
-typedef struct Counters
+typedef struct pgssCounters
 {
 	int64		calls[PGSS_NUMKIND];	/* # of times planned/executed */
 	double		total_time[PGSS_NUMKIND];	/* total planning/execution time,
@@ -212,65 +212,81 @@ typedef struct Counters
 											 * launched */
 	int64		generic_plan_calls; /* number of calls using a generic plan */
 	int64		custom_plan_calls;	/* number of calls using a custom plan */
-} Counters;
+}			pgssCounters;
 
-/*
- * Global statistics for pg_stat_statements
- */
-typedef struct pgssGlobalStats
+/* Shared pgstat entry */
+typedef struct PgStatShared_Pgss
 {
-	int64		dealloc;		/* # of times entries were deallocated */
-	TimestampTz stats_reset;	/* timestamp with all stats reset */
-} pgssGlobalStats;
+	PgStatShared_Common header;
+	pgssHashKey key;
+	pgssCounters counters;
+}			PgStatShared_Pgss;
 
 /*
- * Statistics per statement
+ * Entry in the dshash source-of-truth registry.
  *
  * Note: in event of a failure in garbage collection of the query text file,
  * we reset query_offset to zero and query_len to -1.  This will be seen as
  * an invalid state by qtext_fetch().
+ *
+ * Note: Key must be first for dshash.
  */
 typedef struct pgssEntry
 {
-	pgssHashKey key;			/* hash key of entry - MUST BE FIRST */
-	Counters	counters;		/* the statistics for this query */
+	pgssHashKey key;
+	pg_atomic_uint32 refcount;	/* clock-sweep: decremented on sweep, evict at
+								 * 0 */
 	Size		query_offset;	/* query text offset in external file */
 	int			query_len;		/* # of valid bytes in query string, or -1 */
 	int			encoding;		/* query text encoding */
 	TimestampTz stats_since;	/* timestamp of entry allocation */
 	TimestampTz minmax_stats_since; /* timestamp of last min/max values reset */
-	slock_t		mutex;			/* protects the counters only */
 } pgssEntry;
 
 /*
- * Global shared state
+ * Global shared state stored in DSM segment
  */
 typedef struct pgssSharedState
 {
-	LWLockPadded lock;			/* protects hashtable search/modification */
-	double		cur_median_usage;	/* current median usage in hashtable */
-	Size		mean_query_len; /* current mean entry text length */
+	pg_atomic_uint64 nentries;	/* current number of tracked entries */
+	pg_atomic_uint64 dealloc;	/* total # of entries evicted */
+	pg_atomic_uint32 sweep_partition;	/* rotating hand: next partition to
+										 * sweep */
+	TimestampTz stats_reset;	/* timestamp with all stats reset */
 	slock_t		mutex;			/* protects following fields only: */
 	Size		extent;			/* current extent of query file */
 	int			n_writers;		/* number of active writers to query file */
 	int			gc_count;		/* query file garbage collection cycle count */
-	pgssGlobalStats stats;		/* global statistics for pgss */
 } pgssSharedState;
 
-/* Links to shared memory state */
-static pgssSharedState *pgss;
-static HTAB *pgss_hash;
+/* Backend-local pending entry */
+typedef struct PgStat_PgssPending
+{
+	pgssHashKey key;
+	pgssCounters counters;
+}			PgStat_PgssPending;
 
-static void pgss_shmem_request(void *arg);
-static void pgss_shmem_init(void *arg);
+/*
+ * source-of-truth registry stored in dshash
+ */
 
-static const ShmemCallbacks pgss_shmem_callbacks = {
-	.request_fn = pgss_shmem_request,
-	.init_fn = pgss_shmem_init,
+static const dshash_parameters pgss_dsh_params = {
+	sizeof(pgssHashKey),
+	sizeof(pgssEntry),
+	dshash_memcmp,
+	dshash_memhash,
+	dshash_memcpy,
+	LWTRANCHE_FIRST_USER_DEFINED,
 };
 
 /*---- Local variables ----*/
 
+/* Global shared state */
+static pgssSharedState *pgss_shared = NULL;
+
+/* source-of-truth dshash table */
+static dshash_table *pgss_hash = NULL;
+
 /* Current nesting depth of planner/ExecutorRun/ProcessUtility calls */
 static int	nesting_level = 0;
 
@@ -314,9 +330,9 @@ static bool pgss_save = true;	/* whether to save stats across shutdown */
 
 #define record_gc_qtexts() \
 	do { \
-		SpinLockAcquire(&pgss->mutex); \
-		pgss->gc_count++; \
-		SpinLockRelease(&pgss->mutex); \
+		SpinLockAcquire(&pgss_shared->mutex); \
+		pgss_shared->gc_count++; \
+		SpinLockRelease(&pgss_shared->mutex); \
 	} while(0)
 
 /*---- Function declarations ----*/
@@ -335,7 +351,6 @@ PG_FUNCTION_INFO_V1(pg_stat_statements_1_13);
 PG_FUNCTION_INFO_V1(pg_stat_statements);
 PG_FUNCTION_INFO_V1(pg_stat_statements_info);
 
-static void pgss_shmem_shutdown(int code, Datum arg);
 static void pgss_post_parse_analyze(ParseState *pstate, Query *query,
 									const JumbleState *jstate);
 static PlannedStmt *pgss_planner(Query *parse,
@@ -365,12 +380,11 @@ static void pgss_store(const char *query, int64 queryId,
 					   int parallel_workers_to_launch,
 					   int parallel_workers_launched,
 					   PlannedStmtOrigin planOrigin);
-static void pg_stat_statements_internal(FunctionCallInfo fcinfo,
-										pgssVersion api_version,
-										bool showtext);
-static pgssEntry *entry_alloc(pgssHashKey *key, Size query_offset, int query_len,
-							  int encoding, bool sticky);
+
 static void entry_dealloc(void);
+static TimestampTz entry_reset(Oid userid, Oid dbid, int64 queryid, bool minmax_only);
+static uint64 pgss_hash_key(pgssHashKey *key);
+
 static bool qtext_store(const char *query, int query_len,
 						Size *query_offset, int *gc_count);
 static char *qtext_load_file(Size *buffer_size);
@@ -378,37 +392,99 @@ static char *qtext_fetch(Size query_offset, int query_len,
 						 char *buffer, Size buffer_size);
 static bool need_gc_qtexts(void);
 static void gc_qtexts(void);
-static TimestampTz entry_reset(Oid userid, Oid dbid, int64 queryid, bool minmax_only);
 static char *generate_normalized_query(const JumbleState *jstate,
 									   const char *query,
 									   int query_loc, int *query_len_p);
 
+static void pg_stat_statements_internal(FunctionCallInfo fcinfo,
+										pgssVersion api_version,
+										bool showtext);
+
+static bool pgss_flush_pending_cb(PgStat_EntryRef *entry_ref, bool nowait);
+static void pgss_to_serialized_data(const PgStat_HashKey *key,
+									const PgStatShared_Common *header,
+									FILE *statfile);
+static bool pgss_from_serialized_data(const PgStat_HashKey *key,
+									  PgStatShared_Common *header,
+									  FILE *statfile);
+static void pgss_finish(PgStat_StatsFileOp status);
+
+/*--------------------------------------------------------------------------
+ * Custom pgstat kind definition
+ *--------------------------------------------------------------------------
+ */
+
+static const PgStat_KindInfo pgss_kind_info = {
+	.name = "pg_stat_statements",
+	.fixed_amount = false,
+	.write_to_file = true,
+	.track_entry_count = true,
+	.accessed_across_databases = true,
+	.shared_size = sizeof(PgStatShared_Pgss),
+	.shared_data_off = offsetof(PgStatShared_Pgss, counters),
+	.shared_data_len = sizeof(pgssCounters),
+	.pending_size = sizeof(PgStat_PgssPending),
+	.flush_pending_cb = pgss_flush_pending_cb,
+	.to_serialized_data = pgss_to_serialized_data,
+	.from_serialized_data = pgss_from_serialized_data,
+	.finish = pgss_finish,
+};
+
+static uint64
+pgss_hash_key(pgssHashKey *key)
+{
+	return hash_bytes_extended((const unsigned char *) key,
+							   sizeof(pgssHashKey), 0);
+}
+
+static void
+pgss_init_shmem(void *ptr, void *arg)
+{
+	pgssSharedState *state = (pgssSharedState *) ptr;
+
+	pg_atomic_init_u64(&state->nentries, 0);
+	pg_atomic_init_u64(&state->dealloc, 0);
+	pg_atomic_init_u32(&state->sweep_partition, 0);
+	state->stats_reset = GetCurrentTimestamp();
+	SpinLockInit(&state->mutex);
+	state->extent = 0;
+	state->n_writers = 0;
+	state->gc_count = 0;
+}
+
+static void
+pgss_attach_shmem(void)
+{
+	bool		found;
+
+	if (pgss_shared != NULL)
+		return;
+
+	pgss_shared = GetNamedDSMSegment("pg_stat_statements_state",
+									 sizeof(pgssSharedState),
+									 pgss_init_shmem,
+									 &found, NULL);
+
+	if (pgss_hash == NULL)
+		pgss_hash = GetNamedDSHash("pg_stat_statements", &pgss_dsh_params,
+								   &found);
+}
+
 /*
  * Module load callback
  */
 void
 _PG_init(void)
 {
-	/*
-	 * In order to create our shared memory area, we have to be loaded via
-	 * shared_preload_libraries.  If not, fall out without hooking into any of
-	 * the main system.  (We don't throw error here because it seems useful to
-	 * allow the pg_stat_statements functions to be created even when the
-	 * module isn't active.  The functions must protect themselves against
-	 * being called then, however.)
-	 */
 	if (!process_shared_preload_libraries_in_progress)
 		return;
 
-	/*
-	 * Inform the postmaster that we want to enable query_id calculation if
-	 * compute_query_id is set to auto.
-	 */
 	EnableQueryId();
 
-	/*
-	 * Define (or redefine) custom GUC variables.
-	 */
+	/* Register custom pgstat kind */
+	pgstat_register_kind(PGSTAT_KIND_PGSS, &pgss_kind_info);
+
+	/* Define GUCs */
 	DefineCustomIntVariable("pg_stat_statements.max",
 							"Sets the maximum number of statements tracked by pg_stat_statements.",
 							NULL,
@@ -416,7 +492,7 @@ _PG_init(void)
 							5000,
 							100,
 							INT_MAX / 2,
-							PGC_POSTMASTER,
+							PGC_SIGHUP,
 							0,
 							NULL,
 							NULL,
@@ -457,7 +533,7 @@ _PG_init(void)
 							 NULL);
 
 	DefineCustomBoolVariable("pg_stat_statements.save",
-							 "Save pg_stat_statements statistics across server shutdowns.",
+							 "Save pg_stat_statements data across server shutdowns.",
 							 NULL,
 							 &pgss_save,
 							 true,
@@ -469,11 +545,6 @@ _PG_init(void)
 
 	MarkGUCPrefixReserved("pg_stat_statements");
 
-	/*
-	 * Register our shared memory needs.
-	 */
-	RegisterShmemCallbacks(&pgss_shmem_callbacks);
-
 	/*
 	 * Install hooks.
 	 */
@@ -493,369 +564,568 @@ _PG_init(void)
 	ProcessUtility_hook = pgss_ProcessUtility;
 }
 
-/*
- * shmem request callback: Request shared memory resources.
- *
- * This is called at postmaster startup.  Note that the shared memory isn't
- * allocated here yet, this merely register our needs.
- *
- * In EXEC_BACKEND mode, this is also called in each backend, to re-attach to
- * the shared memory area that was already initialized.
- */
-static void
-pgss_shmem_request(void *arg)
-{
-	ShmemRequestHash(.name = "pg_stat_statements hash",
-					 .nelems = pgss_max,
-					 .hash_info.keysize = sizeof(pgssHashKey),
-					 .hash_info.entrysize = sizeof(pgssEntry),
-					 .hash_flags = HASH_ELEM | HASH_BLOBS,
-					 .ptr = &pgss_hash,
-		);
-	ShmemRequestStruct(.name = "pg_stat_statements",
-					   .size = sizeof(pgssSharedState),
-					   .ptr = (void **) &pgss,
-		);
-}
-
-/*
- * shmem init callback: Initialize our shared memory data structures at
- * postmaster startup.
- *
- * Load any pre-existing statistics from file.  Also create and load the
- * query-texts file, which is expected to exist (even if empty) while the
- * module is enabled.
+/*--------------------------------------------------------------------------
+ * pgstat flush helpers: merge per-kind timing counters into shared memory
+ *--------------------------------------------------------------------------
  */
 static void
-pgss_shmem_init(void *arg)
+pgss_flush_kind(pgssCounters * shared, pgssCounters * pending, pgssStoreKind kind)
 {
-	int			tranche_id;
-	FILE	   *file = NULL;
-	FILE	   *qfile = NULL;
-	uint32		header;
-	int32		num;
-	int32		pgver;
-	int32		i;
-	int			buffer_size;
-	char	   *buffer = NULL;
-
-	/*
-	 * We already checked that we're loaded from shared_preload_libraries in
-	 * _PG_init(), so we should not get here after postmaster startup.
-	 */
-	Assert(!IsUnderPostmaster);
-
-	/*
-	 * Initialize the shmem area with no statistics.
-	 */
-	tranche_id = LWLockNewTrancheId("pg_stat_statements");
-	LWLockInitialize(&pgss->lock.lock, tranche_id);
-	pgss->cur_median_usage = ASSUMED_MEDIAN_INIT;
-	pgss->mean_query_len = ASSUMED_LENGTH_INIT;
-	SpinLockInit(&pgss->mutex);
-	pgss->extent = 0;
-	pgss->n_writers = 0;
-	pgss->gc_count = 0;
-	pgss->stats.dealloc = 0;
-	pgss->stats.stats_reset = GetCurrentTimestamp();
-
-	/* The hash table must've also been initialized by now */
-	Assert(pgss_hash != NULL);
-
-	/*
-	 * Set up a shmem exit hook to dump the statistics to disk on postmaster
-	 * (or standalone backend) exit.
-	 */
-	on_shmem_exit(pgss_shmem_shutdown, (Datum) 0);
+	int64		n_a,
+				n_b;
+	double		delta;
 
-	/*
-	 * Load any pre-existing statistics from file.
-	 *
-	 * Note: we don't bother with locks here, because there should be no other
-	 * processes running when this code is reached.
-	 */
+	n_a = shared->calls[kind];
+	n_b = pending->calls[kind];
 
-	/* Unlink query text file possibly left over from crash */
-	unlink(PGSS_TEXT_FILE);
-
-	/* Allocate new query text temp file */
-	qfile = AllocateFile(PGSS_TEXT_FILE, PG_BINARY_W);
-	if (qfile == NULL)
-		goto write_error;
+	shared->calls[kind] += n_b;
+	shared->total_time[kind] += pending->total_time[kind];
 
-	/*
-	 * If we were told not to load old statistics, we're done.  (Note we do
-	 * not try to unlink any old dump file in this case.  This seems a bit
-	 * questionable but it's the historical behavior.)
-	 */
-	if (!pgss_save)
+	if (n_a == 0)
 	{
-		FreeFile(qfile);
-		return;
+		shared->min_time[kind] = pending->min_time[kind];
+		shared->max_time[kind] = pending->max_time[kind];
+		shared->mean_time[kind] = pending->mean_time[kind];
+		shared->sum_var_time[kind] = pending->sum_var_time[kind];
 	}
-
-	/*
-	 * Attempt to load old statistics from the dump file.
-	 */
-	file = AllocateFile(PGSS_DUMP_FILE, PG_BINARY_R);
-	if (file == NULL)
+	else
 	{
-		if (errno != ENOENT)
-			goto read_error;
-		/* No existing persisted stats file, so we're done */
-		FreeFile(qfile);
-		return;
+		if (pending->min_time[kind] < shared->min_time[kind])
+			shared->min_time[kind] = pending->min_time[kind];
+		if (pending->max_time[kind] > shared->max_time[kind])
+			shared->max_time[kind] = pending->max_time[kind];
+
+		/*
+		 * Chan's parallel variance algorithm: combine two sets of (count,
+		 * mean, sum_of_squared_deviations). See
+		 * <http://www.johndcook.com/blog/standard_deviation/>
+		 */
+		delta = pending->mean_time[kind] - shared->mean_time[kind];
+		shared->sum_var_time[kind] +=
+			pending->sum_var_time[kind] +
+			delta * delta * (double) n_a * (double) n_b / (double) (n_a + n_b);
+		shared->mean_time[kind] =
+			shared->total_time[kind] / shared->calls[kind];
 	}
+}
 
-	buffer_size = 2048;
-	buffer = (char *) palloc(buffer_size);
+static bool
+pgss_flush_pending_cb(PgStat_EntryRef *entry_ref, bool nowait)
+{
+	PgStat_PgssPending *pending;
+	PgStatShared_Pgss *shared;
 
-	if (fread(&header, sizeof(uint32), 1, file) != 1 ||
-		fread(&pgver, sizeof(uint32), 1, file) != 1 ||
-		fread(&num, sizeof(int32), 1, file) != 1)
-		goto read_error;
+	pending = (PgStat_PgssPending *) entry_ref->pending;
+	shared = (PgStatShared_Pgss *) entry_ref->shared_stats;
 
-	if (header != PGSS_FILE_HEADER ||
-		pgver != PGSS_PG_MAJOR_VERSION)
-		goto data_error;
+	if (!pgstat_lock_entry(entry_ref, nowait))
+		return false;
 
-	for (i = 0; i < num; i++)
-	{
-		pgssEntry	temp;
-		pgssEntry  *entry;
-		Size		query_offset;
+	shared->key = pending->key;
+
+	pgss_flush_kind(&shared->counters, &pending->counters, PGSS_EXEC);
+
+	if (pgss_track_planning && pending->counters.calls[PGSS_PLAN] > 0)
+		pgss_flush_kind(&shared->counters, &pending->counters, PGSS_PLAN);
+
+	shared->counters.rows += pending->counters.rows;
+	shared->counters.shared_blks_hit += pending->counters.shared_blks_hit;
+	shared->counters.shared_blks_read += pending->counters.shared_blks_read;
+	shared->counters.shared_blks_dirtied += pending->counters.shared_blks_dirtied;
+	shared->counters.shared_blks_written += pending->counters.shared_blks_written;
+	shared->counters.local_blks_hit += pending->counters.local_blks_hit;
+	shared->counters.local_blks_read += pending->counters.local_blks_read;
+	shared->counters.local_blks_dirtied += pending->counters.local_blks_dirtied;
+	shared->counters.local_blks_written += pending->counters.local_blks_written;
+	shared->counters.temp_blks_read += pending->counters.temp_blks_read;
+	shared->counters.temp_blks_written += pending->counters.temp_blks_written;
+	shared->counters.shared_blk_read_time += pending->counters.shared_blk_read_time;
+	shared->counters.shared_blk_write_time += pending->counters.shared_blk_write_time;
+	shared->counters.local_blk_read_time += pending->counters.local_blk_read_time;
+	shared->counters.local_blk_write_time += pending->counters.local_blk_write_time;
+	shared->counters.temp_blk_read_time += pending->counters.temp_blk_read_time;
+	shared->counters.temp_blk_write_time += pending->counters.temp_blk_write_time;
+	shared->counters.wal_records += pending->counters.wal_records;
+	shared->counters.wal_fpi += pending->counters.wal_fpi;
+	shared->counters.wal_bytes += pending->counters.wal_bytes;
+	shared->counters.wal_buffers_full += pending->counters.wal_buffers_full;
+	shared->counters.jit_functions += pending->counters.jit_functions;
+	shared->counters.jit_generation_time += pending->counters.jit_generation_time;
+	shared->counters.jit_inlining_count += pending->counters.jit_inlining_count;
+	shared->counters.jit_inlining_time += pending->counters.jit_inlining_time;
+	shared->counters.jit_optimization_count += pending->counters.jit_optimization_count;
+	shared->counters.jit_optimization_time += pending->counters.jit_optimization_time;
+	shared->counters.jit_emission_count += pending->counters.jit_emission_count;
+	shared->counters.jit_emission_time += pending->counters.jit_emission_time;
+	shared->counters.jit_deform_count += pending->counters.jit_deform_count;
+	shared->counters.jit_deform_time += pending->counters.jit_deform_time;
+	shared->counters.parallel_workers_to_launch += pending->counters.parallel_workers_to_launch;
+	shared->counters.parallel_workers_launched += pending->counters.parallel_workers_launched;
+	shared->counters.generic_plan_calls += pending->counters.generic_plan_calls;
+	shared->counters.custom_plan_calls += pending->counters.custom_plan_calls;
+
+	pgstat_unlock_entry(entry_ref);
 
-		if (fread(&temp, sizeof(pgssEntry), 1, file) != 1)
-			goto read_error;
+	return true;
+}
 
-		/* Encoding is the only field we can easily sanity-check */
-		if (!PG_VALID_BE_ENCODING(temp.encoding))
-			goto data_error;
+/*
+ * Serialize dshash entry data + query text inline alongside each pgstat entry.
+ * On restart, from_serialized_data reconstructs both.
+ */
+static void
+pgss_to_serialized_data(const PgStat_HashKey *key,
+						const PgStatShared_Common *header,
+						FILE *statfile)
+{
+	static char *qbuffer = NULL;
+	static Size qbuffer_size = 0;
 
-		/* Resize buffer as needed */
-		if (temp.query_len >= buffer_size)
-		{
-			buffer_size = Max(buffer_size * 2, temp.query_len + 1);
-			buffer = repalloc(buffer, buffer_size);
-		}
+	PgStatShared_Pgss *shpgss = (PgStatShared_Pgss *) header;
+	pgssEntry  *entry;
+	bool		found = false;
+	pgssEntry	serialized;
+	char	   *qtext = NULL;
+	int			qtext_len = 0;
 
-		if (fread(buffer, 1, temp.query_len + 1, file) != temp.query_len + 1)
-			goto read_error;
+	if (!pgss_save)
+	{
+		pgstat_write_chunk_s(statfile, &found);
+		return;
+	}
 
-		/* Should have a trailing null, but let's make sure */
-		buffer[temp.query_len] = '\0';
+	pgss_attach_shmem();
 
-		/* Skip loading "sticky" entries */
-		if (IS_STICKY(temp.counters))
-			continue;
+	memset(&serialized, 0, sizeof(pgssEntry));
 
-		/* Store the query text */
-		query_offset = pgss->extent;
-		if (fwrite(buffer, 1, temp.query_len + 1, qfile) != temp.query_len + 1)
-			goto write_error;
-		pgss->extent += temp.query_len + 1;
-
-		/* make the hashtable entry (discards old entries if too many) */
-		entry = entry_alloc(&temp.key, query_offset, temp.query_len,
-							temp.encoding,
-							false);
-
-		/* copy in the actual stats */
-		entry->counters = temp.counters;
-		entry->stats_since = temp.stats_since;
-		entry->minmax_stats_since = temp.minmax_stats_since;
+	entry = dshash_find(pgss_hash, &shpgss->key, false);
+	if (entry)
+	{
+		serialized = *entry;
+		found = true;
+		dshash_release_lock(pgss_hash, entry);
 	}
 
-	/* Read global statistics for pg_stat_statements */
-	if (fread(&pgss->stats, sizeof(pgssGlobalStats), 1, file) != 1)
-		goto read_error;
+	pgstat_write_chunk_s(statfile, &found);
 
-	pfree(buffer);
-	FreeFile(file);
-	FreeFile(qfile);
+	if (!found)
+		return;
 
-	/*
-	 * Remove the persisted stats file so it's not included in
-	 * backups/replication standbys, etc.  A new file will be written on next
-	 * shutdown.
-	 *
-	 * Note: it's okay if the PGSS_TEXT_FILE is included in a basebackup,
-	 * because we remove that file on startup; it acts inversely to
-	 * PGSS_DUMP_FILE, in that it is only supposed to be around when the
-	 * server is running, whereas PGSS_DUMP_FILE is only supposed to be around
-	 * when the server is not running.  Leaving the file creates no danger of
-	 * a newly restored database having a spurious record of execution costs,
-	 * which is what we're really concerned about here.
-	 */
-	unlink(PGSS_DUMP_FILE);
+	pgstat_write_chunk_s(statfile, &serialized);
 
-	return;
+	/* Load query text file once, reuse across all entries */
+	if (!qbuffer)
+		qbuffer = qtext_load_file(&qbuffer_size);
 
-read_error:
-	ereport(LOG,
-			(errcode_for_file_access(),
-			 errmsg("could not read file \"%s\": %m",
-					PGSS_DUMP_FILE)));
-	goto fail;
-data_error:
-	ereport(LOG,
-			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-			 errmsg("ignoring invalid data in file \"%s\"",
-					PGSS_DUMP_FILE)));
-	goto fail;
-write_error:
-	ereport(LOG,
-			(errcode_for_file_access(),
-			 errmsg("could not write file \"%s\": %m",
-					PGSS_TEXT_FILE)));
-fail:
-	if (buffer)
-		pfree(buffer);
-	if (file)
-		FreeFile(file);
-	if (qfile)
-		FreeFile(qfile);
-	/* If possible, throw away the bogus file; ignore any error */
-	unlink(PGSS_DUMP_FILE);
+	if (serialized.query_len >= 0 && qbuffer)
+		qtext = qtext_fetch(serialized.query_offset, serialized.query_len,
+							qbuffer, qbuffer_size);
 
-	/*
-	 * Don't unlink PGSS_TEXT_FILE here; it should always be around while the
-	 * server is running with pg_stat_statements enabled
-	 */
+	if (qtext)
+	{
+		qtext_len = serialized.query_len;
+		pgstat_write_chunk_s(statfile, &qtext_len);
+		pgstat_write_chunk(statfile, qtext, qtext_len + 1);
+	}
+	else
+	{
+		qtext_len = -1;
+		pgstat_write_chunk_s(statfile, &qtext_len);
+	}
 }
 
 /*
- * shmem_shutdown hook: Dump statistics into file.
- *
- * Note: we don't bother with acquiring lock, because there should be no
- * other processes running when this is called.
+ * Deserialize auxiliary data: recreate the dshash entry and store query
+ * text in the file.
  */
-static void
-pgss_shmem_shutdown(int code, Datum arg)
+static bool
+pgss_from_serialized_data(const PgStat_HashKey *key,
+						  PgStatShared_Common *header,
+						  FILE *statfile)
 {
-	FILE	   *file;
-	char	   *qbuffer = NULL;
-	Size		qbuffer_size = 0;
-	HASH_SEQ_STATUS hash_seq;
-	int32		num_entries;
-	pgssEntry  *entry;
-
-	/* Don't try to dump during a crash. */
-	if (code)
-		return;
+	bool		had_entry;
+	pgssEntry	serialized;
+	pgssEntry  *dsh_entry;
+	bool		found;
+	int			qtext_len;
 
-	/* Safety check ... shouldn't get here unless shmem is set up. */
-	if (!pgss || !pgss_hash)
-		return;
+	if (!pgstat_read_chunk_s(statfile, &had_entry))
+		return false;
 
-	/* Don't dump if told not to. */
-	if (!pgss_save)
-		return;
+	if (!had_entry)
+	{
+		pgstat_drop_entry(PGSTAT_KIND_PGSS, key->dboid, key->objid, false);
+		return true;
+	}
 
-	file = AllocateFile(PGSS_DUMP_FILE ".tmp", PG_BINARY_W);
-	if (file == NULL)
-		goto error;
+	if (!pgstat_read_chunk_s(statfile, &serialized))
+		return false;
 
-	if (fwrite(&PGSS_FILE_HEADER, sizeof(uint32), 1, file) != 1)
-		goto error;
-	if (fwrite(&PGSS_PG_MAJOR_VERSION, sizeof(uint32), 1, file) != 1)
-		goto error;
-	num_entries = hash_get_num_entries(pgss_hash);
-	if (fwrite(&num_entries, sizeof(int32), 1, file) != 1)
-		goto error;
+	if (!pgstat_read_chunk_s(statfile, &qtext_len))
+		return false;
 
-	qbuffer = qtext_load_file(&qbuffer_size);
-	if (qbuffer == NULL)
-		goto error;
+	pgss_attach_shmem();
 
-	/*
-	 * When serializing to disk, we store query texts immediately after their
-	 * entry data.  Any orphaned query texts are thereby excluded.
-	 */
-	hash_seq_init(&hash_seq, pgss_hash);
-	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	dsh_entry = dshash_find_or_insert(pgss_hash, &serialized.key, &found);
+	if (!found)
 	{
-		int			len = entry->query_len;
-		char	   *qstr = qtext_fetch(entry->query_offset, len,
-									   qbuffer, qbuffer_size);
+		pg_atomic_init_u32(&dsh_entry->refcount, pg_atomic_read_u32(&serialized.refcount));
+		dsh_entry->encoding = serialized.encoding;
+		dsh_entry->stats_since = serialized.stats_since;
+		dsh_entry->minmax_stats_since = serialized.minmax_stats_since;
+
+		if (qtext_len >= 0)
+		{
+			char	   *qtext;
+			Size		query_offset;
+
+			qtext = palloc(qtext_len + 1);
+			if (!pgstat_read_chunk(statfile, qtext, qtext_len + 1))
+			{
+				dshash_release_lock(pgss_hash, dsh_entry);
+				pfree(qtext);
+				return false;
+			}
 
-		if (qstr == NULL)
-			continue;			/* Ignore any entries with bogus texts */
+			/* Store in the text file */
+			if (qtext_store(qtext, qtext_len, &query_offset, NULL))
+			{
+				dsh_entry->query_offset = query_offset;
+				dsh_entry->query_len = qtext_len;
+			}
+			else
+			{
+				dsh_entry->query_offset = 0;
+				dsh_entry->query_len = -1;
+			}
 
-		if (fwrite(entry, sizeof(pgssEntry), 1, file) != 1 ||
-			fwrite(qstr, 1, len + 1, file) != len + 1)
+			pfree(qtext);
+		}
+		else
 		{
-			/* note: we assume hash_seq_term won't change errno */
-			hash_seq_term(&hash_seq);
-			goto error;
+			dsh_entry->query_offset = 0;
+			dsh_entry->query_len = -1;
 		}
+
+		pg_atomic_fetch_add_u64(&pgss_shared->nentries, 1);
 	}
+	else
+	{
+		/* Entry already exists (race), skip the text */
+		if (qtext_len >= 0)
+		{
+			char	   *discard = palloc(qtext_len + 1);
 
-	/* Dump global statistics for pg_stat_statements */
-	if (fwrite(&pgss->stats, sizeof(pgssGlobalStats), 1, file) != 1)
-		goto error;
+			pgstat_read_chunk(statfile, discard, qtext_len + 1);
+			pfree(discard);
+		}
+	}
 
-	pfree(qbuffer);
-	qbuffer = NULL;
+	dshash_release_lock(pgss_hash, dsh_entry);
 
-	if (FreeFile(file))
-	{
-		file = NULL;
-		goto error;
-	}
+	return true;
+}
 
-	/*
-	 * Rename file into place, so we atomically replace any old one.
-	 */
-	(void) durable_rename(PGSS_DUMP_FILE ".tmp", PGSS_DUMP_FILE, LOG);
+static void
+pgss_finish(PgStat_StatsFileOp status)
+{
+	switch (status)
+	{
+		case STATS_WRITE:
 
-	/* Unlink query-texts file; it's not needed while shutdown */
-	unlink(PGSS_TEXT_FILE);
+			/*
+			 * Text has been serialized into the pgstat file; remove the
+			 * original.
+			 */
+			unlink(PGSS_TEXT_FILE);
+			break;
 
-	return;
+		case STATS_READ:
+			/* Text file was rebuilt by from_serialized_data; keep it. */
+			break;
 
-error:
-	ereport(LOG,
-			(errcode_for_file_access(),
-			 errmsg("could not write file \"%s\": %m",
-					PGSS_DUMP_FILE ".tmp")));
-	if (qbuffer)
-		pfree(qbuffer);
-	if (file)
-		FreeFile(file);
-	unlink(PGSS_DUMP_FILE ".tmp");
-	unlink(PGSS_TEXT_FILE);
+		case STATS_DISCARD:
+			/* Stats discarded; remove orphaned text file. */
+			unlink(PGSS_TEXT_FILE);
+			break;
+	}
 }
 
-/*
- * Post-parse-analysis hook: mark query with a queryId
+/*--------------------------------------------------------------------------
+ * pgss_store: Record statistics for one statement execution.
+ *
+ * Creates dshash entry if new, then accumulates counters in the pending
+ * pgstat entry.
+ *--------------------------------------------------------------------------
  */
 static void
-pgss_post_parse_analyze(ParseState *pstate, Query *query, const JumbleState *jstate)
+pgss_store(const char *query, int64 queryId,
+		   int query_location, int query_len,
+		   pgssStoreKind kind,
+		   double total_time, uint64 rows,
+		   const BufferUsage *bufusage,
+		   const WalUsage *walusage,
+		   const struct JitInstrumentation *jitusage,
+		   const JumbleState *jstate,
+		   int parallel_workers_to_launch,
+		   int parallel_workers_launched,
+		   PlannedStmtOrigin planOrigin)
 {
-	if (prev_post_parse_analyze_hook)
-		prev_post_parse_analyze_hook(pstate, query, jstate);
+	pgssHashKey key;
+	pgssEntry  *dsh_entry;
+	bool		found;
+	uint64		objid;
+	PgStat_EntryRef *entry_ref;
+	PgStat_PgssPending *pending;
+	char	   *norm_query = NULL;
+	int			encoding = GetDatabaseEncoding();
+
+	Assert(query != NULL);
 
-	/* Safety check... */
-	if (!pgss || !pgss_hash || !pgss_enabled(nesting_level))
+	if (queryId == INT64CONST(0))
 		return;
 
 	/*
-	 * If it's EXECUTE, clear the queryId so that stats will accumulate for
-	 * the underlying PREPARE.  But don't do this if we're not tracking
-	 * utility statements, to avoid messing up another extension that might be
-	 * tracking them.
+	 * Confine our attention to the relevant part of the string, if the query
+	 * is a portion of a multi-statement source string, and update query
+	 * location and length if needed.
 	 */
-	if (query->utilityStmt)
+	query = CleanQuerytext(query, &query_location, &query_len);
+
+	/* Build key */
+	memset(&key, 0, sizeof(pgssHashKey));
+	key.userid = GetUserId();
+	key.dbid = MyDatabaseId;
+	key.queryid = queryId;
+	key.toplevel = (nesting_level == 0);
+
+	/* Create or update dshash registry entry */
+	pgss_attach_shmem();
+
+	/* Fast path: look up with shared lock only */
+	dsh_entry = dshash_find(pgss_hash, &key, false);
+	if (dsh_entry)
 	{
-		if (pgss_track_utility && IsA(query->utilityStmt, ExecuteStmt))
-		{
-			query->queryId = INT64CONST(0);
+		/* Bump refcount atomically under shared lock */
+		uint32		cur = pg_atomic_read_u32(&dsh_entry->refcount);
+
+		if (cur < PGSS_REF_CAP)
+			pg_atomic_compare_exchange_u32(&dsh_entry->refcount, &cur, cur + 1);
+
+		dshash_release_lock(pgss_hash, dsh_entry);
+
+		/* If jstate is provided and kind is PGSS_INVALID, nothing more to do */
+		if (jstate && kind == PGSS_INVALID)
 			return;
-		}
 	}
-
-	/*
+	else
+	{
+		Size		query_offset;
+		int			gc_count;
+		bool		stored;
+		bool		do_gc;
+
+		/* Slow path: insert with exclusive lock */
+		if (jstate)
+		{
+			norm_query = generate_normalized_query(jstate, query,
+												   query_location,
+												   &query_len);
+		}
+
+		/* Store query text in the file */
+		stored = qtext_store(norm_query ? norm_query : query, query_len,
+							 &query_offset, &gc_count);
+
+		/*
+		 * Determine whether we need to garbage collect external query texts.
+		 */
+		do_gc = need_gc_qtexts();
+
+		dsh_entry = dshash_find_or_insert(pgss_hash, &key, &found);
+		if (!found)
+		{
+			pg_atomic_init_u32(&dsh_entry->refcount, 1);
+			dsh_entry->stats_since = GetCurrentTimestamp();
+			dsh_entry->minmax_stats_since = dsh_entry->stats_since;
+			dsh_entry->encoding = encoding;
+
+			if (stored)
+			{
+				dsh_entry->query_offset = query_offset;
+				dsh_entry->query_len = query_len;
+			}
+			else
+			{
+				dsh_entry->query_offset = 0;
+				dsh_entry->query_len = -1;
+			}
+
+			dshash_release_lock(pgss_hash, dsh_entry);
+
+			pg_atomic_fetch_add_u64(&pgss_shared->nentries, 1);
+			entry_dealloc();
+
+			/* If needed, perform garbage collection */
+			if (do_gc)
+				gc_qtexts();
+
+			if (jstate && kind == PGSS_INVALID)
+			{
+				if (norm_query)
+					pfree(norm_query);
+				return;
+			}
+		}
+		else
+		{
+			/* Another backend inserted it concurrently */
+			uint32		cur = pg_atomic_read_u32(&dsh_entry->refcount);
+
+			if (cur < PGSS_REF_CAP)
+				pg_atomic_compare_exchange_u32(&dsh_entry->refcount, &cur, cur + 1);
+			dshash_release_lock(pgss_hash, dsh_entry);
+
+			if (jstate && kind == PGSS_INVALID)
+			{
+				if (norm_query)
+					pfree(norm_query);
+				return;
+			}
+		}
+	}
+
+	if (norm_query)
+		pfree(norm_query);
+
+	/* Accumulate counters in pending pgstat entry */
+	Assert(kind == PGSS_PLAN || kind == PGSS_EXEC);
+
+	objid = pgss_hash_key(&key);
+
+	entry_ref = pgstat_prep_pending_entry(PGSTAT_KIND_PGSS, key.dbid,
+										  objid, NULL);
+	pending = (PgStat_PgssPending *) entry_ref->pending;
+	pending->key = key;
+
+	pending->counters.calls[kind]++;
+	pending->counters.total_time[kind] += total_time;
+
+	if (pending->counters.calls[kind] == 1)
+	{
+		pending->counters.min_time[kind] = total_time;
+		pending->counters.max_time[kind] = total_time;
+		pending->counters.mean_time[kind] = total_time;
+	}
+	else
+	{
+		/*
+		 * Welford's online algorithm for accumulating mean and sum of squared
+		 * deviations. See <http://www.johndcook.com/blog/standard_deviation/>
+		 */
+		double		old_mean = pending->counters.mean_time[kind];
+
+		pending->counters.mean_time[kind] +=
+			(total_time - old_mean) / pending->counters.calls[kind];
+		pending->counters.sum_var_time[kind] +=
+			(total_time - old_mean) * (total_time - pending->counters.mean_time[kind]);
+
+		if (pending->counters.min_time[kind] > total_time)
+			pending->counters.min_time[kind] = total_time;
+		if (pending->counters.max_time[kind] < total_time)
+			pending->counters.max_time[kind] = total_time;
+	}
+
+	pending->counters.rows += rows;
+
+	if (bufusage)
+	{
+		pending->counters.shared_blks_hit += bufusage->shared_blks_hit;
+		pending->counters.shared_blks_read += bufusage->shared_blks_read;
+		pending->counters.shared_blks_dirtied += bufusage->shared_blks_dirtied;
+		pending->counters.shared_blks_written += bufusage->shared_blks_written;
+		pending->counters.local_blks_hit += bufusage->local_blks_hit;
+		pending->counters.local_blks_read += bufusage->local_blks_read;
+		pending->counters.local_blks_dirtied += bufusage->local_blks_dirtied;
+		pending->counters.local_blks_written += bufusage->local_blks_written;
+		pending->counters.temp_blks_read += bufusage->temp_blks_read;
+		pending->counters.temp_blks_written += bufusage->temp_blks_written;
+		pending->counters.shared_blk_read_time += INSTR_TIME_GET_MILLISEC(bufusage->shared_blk_read_time);
+		pending->counters.shared_blk_write_time += INSTR_TIME_GET_MILLISEC(bufusage->shared_blk_write_time);
+		pending->counters.local_blk_read_time += INSTR_TIME_GET_MILLISEC(bufusage->local_blk_read_time);
+		pending->counters.local_blk_write_time += INSTR_TIME_GET_MILLISEC(bufusage->local_blk_write_time);
+		pending->counters.temp_blk_read_time += INSTR_TIME_GET_MILLISEC(bufusage->temp_blk_read_time);
+		pending->counters.temp_blk_write_time += INSTR_TIME_GET_MILLISEC(bufusage->temp_blk_write_time);
+	}
+
+	if (walusage)
+	{
+		pending->counters.wal_records += walusage->wal_records;
+		pending->counters.wal_fpi += walusage->wal_fpi;
+		pending->counters.wal_bytes += walusage->wal_bytes;
+		pending->counters.wal_buffers_full += walusage->wal_buffers_full;
+	}
+
+	if (jitusage)
+	{
+		pending->counters.jit_functions += jitusage->created_functions;
+		pending->counters.jit_generation_time += INSTR_TIME_GET_MILLISEC(jitusage->generation_counter);
+
+		if (INSTR_TIME_GET_MILLISEC(jitusage->deform_counter))
+			pending->counters.jit_deform_count++;
+		pending->counters.jit_deform_time += INSTR_TIME_GET_MILLISEC(jitusage->deform_counter);
+
+		if (INSTR_TIME_GET_MILLISEC(jitusage->inlining_counter))
+			pending->counters.jit_inlining_count++;
+		pending->counters.jit_inlining_time += INSTR_TIME_GET_MILLISEC(jitusage->inlining_counter);
+
+		if (INSTR_TIME_GET_MILLISEC(jitusage->optimization_counter))
+			pending->counters.jit_optimization_count++;
+		pending->counters.jit_optimization_time += INSTR_TIME_GET_MILLISEC(jitusage->optimization_counter);
+
+		if (INSTR_TIME_GET_MILLISEC(jitusage->emission_counter))
+			pending->counters.jit_emission_count++;
+		pending->counters.jit_emission_time += INSTR_TIME_GET_MILLISEC(jitusage->emission_counter);
+	}
+
+	pending->counters.parallel_workers_to_launch += parallel_workers_to_launch;
+	pending->counters.parallel_workers_launched += parallel_workers_launched;
+
+	if (planOrigin == PLAN_STMT_CACHE_GENERIC)
+		pending->counters.generic_plan_calls++;
+	else if (planOrigin == PLAN_STMT_CACHE_CUSTOM)
+		pending->counters.custom_plan_calls++;
+}
+
+/*--------------------------------------------------------------------------
+ * Hook implementations
+ *--------------------------------------------------------------------------
+ */
+
+static void
+pgss_post_parse_analyze(ParseState *pstate, Query *query,
+						const JumbleState *jstate)
+{
+	if (prev_post_parse_analyze_hook)
+		prev_post_parse_analyze_hook(pstate, query, jstate);
+
+	if (!pgss_enabled(nesting_level))
+		return;
+
+	/*
+	 * Clear queryId for EXECUTE so stats accumulate under the PREPARE's
+	 * queryId instead.
+	 */
+	if (query->utilityStmt)
+	{
+		if (pgss_track_utility && IsA(query->utilityStmt, ExecuteStmt))
+		{
+			query->queryId = INT64CONST(0);
+			return;
+		}
+	}
+
+	/*
 	 * If query jumbling were able to identify any ignorable constants, we
 	 * immediately create a hash table entry for the query, so that we can
 	 * record the normalized form of the query string.  If there were no such
@@ -879,10 +1149,6 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query, const JumbleState *jst
 				   PLAN_STMT_UNKNOWN);
 }
 
-/*
- * Planner hook: forward to regular planner, but measure planning time
- * if needed.
- */
 static PlannedStmt *
 pgss_planner(Query *parse,
 			 const char *query_string,
@@ -908,13 +1174,7 @@ pgss_planner(Query *parse,
 		WalUsage	walusage_start,
 					walusage;
 
-		/* We need to track buffer usage as the planner can access them. */
 		bufusage_start = pgBufferUsage;
-
-		/*
-		 * Similarly the planner could write some WAL records in some cases
-		 * (e.g. setting a hint bit with those being WAL-logged)
-		 */
 		walusage_start = pgWalUsage;
 		INSTR_TIME_SET_CURRENT(start);
 
@@ -937,11 +1197,9 @@ pgss_planner(Query *parse,
 		INSTR_TIME_SET_CURRENT(duration);
 		INSTR_TIME_SUBTRACT(duration, start);
 
-		/* calc differences of buffer counters. */
 		memset(&bufusage, 0, sizeof(BufferUsage));
 		BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
 
-		/* calc differences of WAL counters. */
 		memset(&walusage, 0, sizeof(WalUsage));
 		WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
 
@@ -956,17 +1214,11 @@ pgss_planner(Query *parse,
 				   &walusage,
 				   NULL,
 				   NULL,
-				   0,
-				   0,
+				   0, 0,
 				   result->planOrigin);
 	}
 	else
 	{
-		/*
-		 * Even though we're not tracking plan time for this statement, we
-		 * must still increment the nesting level, to ensure that functions
-		 * evaluated during planning are not seen as top-level calls.
-		 */
 		nesting_level++;
 		PG_TRY();
 		{
@@ -987,20 +1239,12 @@ pgss_planner(Query *parse,
 	return result;
 }
 
-/*
- * ExecutorStart hook: start up tracking if needed
- */
 static void
 pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
 {
-	/*
-	 * If query has queryId zero, don't track it.  This prevents double
-	 * counting of optimizable statements that are directly contained in
-	 * utility statements.
-	 */
-	if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != INT64CONST(0))
+	if (pgss_enabled(nesting_level) &&
+		queryDesc->plannedstmt->queryId != INT64CONST(0))
 	{
-		/* Request all summary instrumentation, i.e. timing, buffers and WAL */
 		queryDesc->query_instr_options |= INSTRUMENT_ALL;
 	}
 
@@ -1010,9 +1254,6 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
 		standard_ExecutorStart(queryDesc, eflags);
 }
 
-/*
- * ExecutorRun hook: all we need do is track nesting depth
- */
 static void
 pgss_ExecutorRun(QueryDesc *queryDesc, ScanDirection direction, uint64 count)
 {
@@ -1031,9 +1272,6 @@ pgss_ExecutorRun(QueryDesc *queryDesc, ScanDirection direction, uint64 count)
 	PG_END_TRY();
 }
 
-/*
- * ExecutorFinish hook: all we need do is track nesting depth
- */
 static void
 pgss_ExecutorFinish(QueryDesc *queryDesc)
 {
@@ -1052,9 +1290,6 @@ pgss_ExecutorFinish(QueryDesc *queryDesc)
 	PG_END_TRY();
 }
 
-/*
- * ExecutorEnd hook: store results if needed
- */
 static void
 pgss_ExecutorEnd(QueryDesc *queryDesc)
 {
@@ -1085,9 +1320,6 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
 		standard_ExecutorEnd(queryDesc);
 }
 
-/*
- * ProcessUtility hook
- */
 static void
 pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 					bool readOnlyTree,
@@ -1102,36 +1334,9 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 	PlannedStmtOrigin saved_planOrigin = pstmt->planOrigin;
 	bool		enabled = pgss_track_utility && pgss_enabled(nesting_level);
 
-	/*
-	 * Force utility statements to get queryId zero.  We do this even in cases
-	 * where the statement contains an optimizable statement for which a
-	 * queryId could be derived (such as EXPLAIN or DECLARE CURSOR).  For such
-	 * cases, runtime control will first go through ProcessUtility and then
-	 * the executor, and we don't want the executor hooks to do anything,
-	 * since we are already measuring the statement's costs at the utility
-	 * level.
-	 *
-	 * Note that this is only done if pg_stat_statements is enabled and
-	 * configured to track utility statements, in the unlikely possibility
-	 * that user configured another extension to handle utility statements
-	 * only.
-	 */
 	if (enabled)
 		pstmt->queryId = INT64CONST(0);
 
-	/*
-	 * If it's an EXECUTE statement, we don't track it and don't increment the
-	 * nesting level.  This allows the cycles to be charged to the underlying
-	 * PREPARE instead (by the Executor hooks), which is much more useful.
-	 *
-	 * We also don't track execution of PREPARE.  If we did, we would get one
-	 * hash table entry for the PREPARE (with hash calculated from the query
-	 * string), and then a different one with the same query string (but hash
-	 * calculated from the query tree) would be used to accumulate costs of
-	 * ensuing EXECUTEs.  This would be confusing.  Since PREPARE doesn't
-	 * actually run the planner (only parse+rewrite), its costs are generally
-	 * pretty negligible and it seems okay to just ignore it.
-	 */
 	if (enabled &&
 		!IsA(parsetree, ExecuteStmt) &&
 		!IsA(parsetree, PrepareStmt))
@@ -1166,36 +1371,20 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 		}
 		PG_END_TRY();
 
-		/*
-		 * CAUTION: do not access the *pstmt data structure again below here.
-		 * If it was a ROLLBACK or similar, that data structure may have been
-		 * freed.  We must copy everything we still need into local variables,
-		 * which we did above.
-		 *
-		 * For the same reason, we can't risk restoring pstmt->queryId to its
-		 * former value, which'd otherwise be a good idea.
-		 */
 		pstmt = NULL;
 
 		INSTR_TIME_SET_CURRENT(duration);
 		INSTR_TIME_SUBTRACT(duration, start);
 
-		/*
-		 * Track the total number of rows retrieved or affected by the utility
-		 * statements of COPY, FETCH, CREATE TABLE AS, CREATE MATERIALIZED
-		 * VIEW, REFRESH MATERIALIZED VIEW and SELECT INTO.
-		 */
 		rows = (qc && (qc->commandTag == CMDTAG_COPY ||
 					   qc->commandTag == CMDTAG_FETCH ||
 					   qc->commandTag == CMDTAG_SELECT ||
 					   qc->commandTag == CMDTAG_REFRESH_MATERIALIZED_VIEW)) ?
 			qc->nprocessed : 0;
 
-		/* calc differences of buffer counters. */
 		memset(&bufusage, 0, sizeof(BufferUsage));
 		BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
 
-		/* calc differences of WAL counters. */
 		memset(&walusage, 0, sizeof(WalUsage));
 		WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
 
@@ -1210,24 +1399,11 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 				   &walusage,
 				   NULL,
 				   NULL,
-				   0,
-				   0,
+				   0, 0,
 				   saved_planOrigin);
 	}
 	else
 	{
-		/*
-		 * Even though we're not tracking execution time for this statement,
-		 * we must still increment the nesting level, to ensure that functions
-		 * evaluated within it are not seen as top-level calls.  But don't do
-		 * so for EXECUTE; that way, when control reaches pgss_planner or
-		 * pgss_ExecutorStart, we will treat the costs as top-level if
-		 * appropriate.  Likewise, don't bump for PREPARE, so that parse
-		 * analysis will treat the statement as top-level if appropriate.
-		 *
-		 * To be absolutely certain we don't mess up the nesting level,
-		 * evaluate the bump_level condition just once.
-		 */
 		bool		bump_level =
 			!IsA(parsetree, ExecuteStmt) &&
 			!IsA(parsetree, PrepareStmt);
@@ -1254,251 +1430,33 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 	}
 }
 
-/*
- * Store some statistics for a statement.
- *
- * If jstate is not NULL then we're trying to create an entry for which
- * we have no statistics as yet; we just want to record the normalized
- * query string.  total_time, rows, bufusage and walusage are ignored in this
- * case.
- *
- * If kind is PGSS_PLAN or PGSS_EXEC, its value is used as the array position
- * for the arrays in the Counters field.
+/*--------------------------------------------------------------------------
+ * SQL-callable functions
+ *--------------------------------------------------------------------------
  */
-static void
-pgss_store(const char *query, int64 queryId,
-		   int query_location, int query_len,
-		   pgssStoreKind kind,
-		   double total_time, uint64 rows,
-		   const BufferUsage *bufusage,
-		   const WalUsage *walusage,
-		   const struct JitInstrumentation *jitusage,
-		   const JumbleState *jstate,
-		   int parallel_workers_to_launch,
-		   int parallel_workers_launched,
-		   PlannedStmtOrigin planOrigin)
-{
-	pgssHashKey key;
-	pgssEntry  *entry;
-	char	   *norm_query = NULL;
-	int			encoding = GetDatabaseEncoding();
-
-	Assert(query != NULL);
-
-	/* Safety check... */
-	if (!pgss || !pgss_hash)
-		return;
-
-	/*
-	 * Nothing to do if compute_query_id isn't enabled and no other module
-	 * computed a query identifier.
-	 */
-	if (queryId == INT64CONST(0))
-		return;
-
-	/*
-	 * Confine our attention to the relevant part of the string, if the query
-	 * is a portion of a multi-statement source string, and update query
-	 * location and length if needed.
-	 */
-	query = CleanQuerytext(query, &query_location, &query_len);
-
-	/* Set up key for hashtable search */
-
-	/* clear padding */
-	memset(&key, 0, sizeof(pgssHashKey));
-
-	key.userid = GetUserId();
-	key.dbid = MyDatabaseId;
-	key.queryid = queryId;
-	key.toplevel = (nesting_level == 0);
-
-	/* Lookup the hash table entry with shared lock. */
-	LWLockAcquire(&pgss->lock.lock, LW_SHARED);
-
-	entry = (pgssEntry *) hash_search(pgss_hash, &key, HASH_FIND, NULL);
-
-	/* Create new entry, if not present */
-	if (!entry)
-	{
-		Size		query_offset;
-		int			gc_count;
-		bool		stored;
-		bool		do_gc;
-
-		/*
-		 * Create a new, normalized query string if caller asked.  We don't
-		 * need to hold the lock while doing this work.  (Note: in any case,
-		 * it's possible that someone else creates a duplicate hashtable entry
-		 * in the interval where we don't hold the lock below.  That case is
-		 * handled by entry_alloc.)
-		 */
-		if (jstate)
-		{
-			LWLockRelease(&pgss->lock.lock);
-			norm_query = generate_normalized_query(jstate, query,
-												   query_location,
-												   &query_len);
-			LWLockAcquire(&pgss->lock.lock, LW_SHARED);
-		}
 
-		/* Append new query text to file with only shared lock held */
-		stored = qtext_store(norm_query ? norm_query : query, query_len,
-							 &query_offset, &gc_count);
-
-		/*
-		 * Determine whether we need to garbage collect external query texts
-		 * while the shared lock is still held.  This micro-optimization
-		 * avoids taking the time to decide this while holding exclusive lock.
-		 */
-		do_gc = need_gc_qtexts();
-
-		/* Need exclusive lock to make a new hashtable entry - promote */
-		LWLockRelease(&pgss->lock.lock);
-		LWLockAcquire(&pgss->lock.lock, LW_EXCLUSIVE);
-
-		/*
-		 * A garbage collection may have occurred while we weren't holding the
-		 * lock.  In the unlikely event that this happens, the query text we
-		 * stored above will have been garbage collected, so write it again.
-		 * This should be infrequent enough that doing it while holding
-		 * exclusive lock isn't a performance problem.
-		 */
-		if (!stored || pgss->gc_count != gc_count)
-			stored = qtext_store(norm_query ? norm_query : query, query_len,
-								 &query_offset, NULL);
-
-		/* If we failed to write to the text file, give up */
-		if (!stored)
-			goto done;
-
-		/* OK to create a new hashtable entry */
-		entry = entry_alloc(&key, query_offset, query_len, encoding,
-							jstate != NULL);
-
-		/* If needed, perform garbage collection while exclusive lock held */
-		if (do_gc)
-			gc_qtexts();
-	}
-
-	/* Increment the counts, except when jstate is not NULL */
-	if (!jstate)
-	{
-		Assert(kind == PGSS_PLAN || kind == PGSS_EXEC);
-
-		/*
-		 * Grab the spinlock while updating the counters (see comment about
-		 * locking rules at the head of the file)
-		 */
-		SpinLockAcquire(&entry->mutex);
-
-		/* "Unstick" entry if it was previously sticky */
-		if (IS_STICKY(entry->counters))
-			entry->counters.usage = USAGE_INIT;
-
-		entry->counters.calls[kind] += 1;
-		entry->counters.total_time[kind] += total_time;
-
-		if (entry->counters.calls[kind] == 1)
-		{
-			entry->counters.min_time[kind] = total_time;
-			entry->counters.max_time[kind] = total_time;
-			entry->counters.mean_time[kind] = total_time;
-		}
-		else
-		{
-			/*
-			 * Welford's method for accurately computing variance. See
-			 * <http://www.johndcook.com/blog/standard_deviation/>
-			 */
-			double		old_mean = entry->counters.mean_time[kind];
-
-			entry->counters.mean_time[kind] +=
-				(total_time - old_mean) / entry->counters.calls[kind];
-			entry->counters.sum_var_time[kind] +=
-				(total_time - old_mean) * (total_time - entry->counters.mean_time[kind]);
-
-			/*
-			 * Calculate min and max time. min = 0 and max = 0 means that the
-			 * min/max statistics were reset
-			 */
-			if (entry->counters.min_time[kind] == 0
-				&& entry->counters.max_time[kind] == 0)
-			{
-				entry->counters.min_time[kind] = total_time;
-				entry->counters.max_time[kind] = total_time;
-			}
-			else
-			{
-				if (entry->counters.min_time[kind] > total_time)
-					entry->counters.min_time[kind] = total_time;
-				if (entry->counters.max_time[kind] < total_time)
-					entry->counters.max_time[kind] = total_time;
-			}
-		}
-		entry->counters.rows += rows;
-		entry->counters.shared_blks_hit += bufusage->shared_blks_hit;
-		entry->counters.shared_blks_read += bufusage->shared_blks_read;
-		entry->counters.shared_blks_dirtied += bufusage->shared_blks_dirtied;
-		entry->counters.shared_blks_written += bufusage->shared_blks_written;
-		entry->counters.local_blks_hit += bufusage->local_blks_hit;
-		entry->counters.local_blks_read += bufusage->local_blks_read;
-		entry->counters.local_blks_dirtied += bufusage->local_blks_dirtied;
-		entry->counters.local_blks_written += bufusage->local_blks_written;
-		entry->counters.temp_blks_read += bufusage->temp_blks_read;
-		entry->counters.temp_blks_written += bufusage->temp_blks_written;
-		entry->counters.shared_blk_read_time += INSTR_TIME_GET_MILLISEC(bufusage->shared_blk_read_time);
-		entry->counters.shared_blk_write_time += INSTR_TIME_GET_MILLISEC(bufusage->shared_blk_write_time);
-		entry->counters.local_blk_read_time += INSTR_TIME_GET_MILLISEC(bufusage->local_blk_read_time);
-		entry->counters.local_blk_write_time += INSTR_TIME_GET_MILLISEC(bufusage->local_blk_write_time);
-		entry->counters.temp_blk_read_time += INSTR_TIME_GET_MILLISEC(bufusage->temp_blk_read_time);
-		entry->counters.temp_blk_write_time += INSTR_TIME_GET_MILLISEC(bufusage->temp_blk_write_time);
-		entry->counters.usage += USAGE_EXEC(total_time);
-		entry->counters.wal_records += walusage->wal_records;
-		entry->counters.wal_fpi += walusage->wal_fpi;
-		entry->counters.wal_bytes += walusage->wal_bytes;
-		entry->counters.wal_buffers_full += walusage->wal_buffers_full;
-		if (jitusage)
-		{
-			entry->counters.jit_functions += jitusage->created_functions;
-			entry->counters.jit_generation_time += INSTR_TIME_GET_MILLISEC(jitusage->generation_counter);
-
-			if (INSTR_TIME_GET_MILLISEC(jitusage->deform_counter))
-				entry->counters.jit_deform_count++;
-			entry->counters.jit_deform_time += INSTR_TIME_GET_MILLISEC(jitusage->deform_counter);
-
-			if (INSTR_TIME_GET_MILLISEC(jitusage->inlining_counter))
-				entry->counters.jit_inlining_count++;
-			entry->counters.jit_inlining_time += INSTR_TIME_GET_MILLISEC(jitusage->inlining_counter);
-
-			if (INSTR_TIME_GET_MILLISEC(jitusage->optimization_counter))
-				entry->counters.jit_optimization_count++;
-			entry->counters.jit_optimization_time += INSTR_TIME_GET_MILLISEC(jitusage->optimization_counter);
-
-			if (INSTR_TIME_GET_MILLISEC(jitusage->emission_counter))
-				entry->counters.jit_emission_count++;
-			entry->counters.jit_emission_time += INSTR_TIME_GET_MILLISEC(jitusage->emission_counter);
-		}
-
-		/* parallel worker counters */
-		entry->counters.parallel_workers_to_launch += parallel_workers_to_launch;
-		entry->counters.parallel_workers_launched += parallel_workers_launched;
-
-		/* plan cache counters */
-		if (planOrigin == PLAN_STMT_CACHE_GENERIC)
-			entry->counters.generic_plan_calls++;
-		else if (planOrigin == PLAN_STMT_CACHE_CUSTOM)
-			entry->counters.custom_plan_calls++;
-
-		SpinLockRelease(&entry->mutex);
-	}
+/* Number of output arguments (columns) for various API versions */
+#define PG_STAT_STATEMENTS_COLS_V1_0	14
+#define PG_STAT_STATEMENTS_COLS_V1_1	18
+#define PG_STAT_STATEMENTS_COLS_V1_2	19
+#define PG_STAT_STATEMENTS_COLS_V1_3	23
+#define PG_STAT_STATEMENTS_COLS_V1_8	32
+#define PG_STAT_STATEMENTS_COLS_V1_9	33
+#define PG_STAT_STATEMENTS_COLS_V1_10	43
+#define PG_STAT_STATEMENTS_COLS_V1_11	49
+#define PG_STAT_STATEMENTS_COLS_V1_12	52
+#define PG_STAT_STATEMENTS_COLS_V1_13	54
+#define PG_STAT_STATEMENTS_COLS			54	/* maximum of above */
 
-done:
-	LWLockRelease(&pgss->lock.lock);
+/*
+ * Reset statement statistics.
+ */
+Datum
+pg_stat_statements_reset(PG_FUNCTION_ARGS)
+{
+	entry_reset(0, 0, 0, false);
 
-	/* We postpone this clean-up until we're out of the lock */
-	if (norm_query)
-		pfree(norm_query);
+	PG_RETURN_VOID();
 }
 
 /*
@@ -1536,123 +1494,80 @@ pg_stat_statements_reset_1_11(PG_FUNCTION_ARGS)
 	PG_RETURN_TIMESTAMPTZ(entry_reset(userid, dbid, queryid, minmax_only));
 }
 
-/*
- * Reset statement statistics.
- */
-Datum
-pg_stat_statements_reset(PG_FUNCTION_ARGS)
-{
-	entry_reset(0, 0, 0, false);
-
-	PG_RETURN_VOID();
-}
-
-/* Number of output arguments (columns) for various API versions */
-#define PG_STAT_STATEMENTS_COLS_V1_0	14
-#define PG_STAT_STATEMENTS_COLS_V1_1	18
-#define PG_STAT_STATEMENTS_COLS_V1_2	19
-#define PG_STAT_STATEMENTS_COLS_V1_3	23
-#define PG_STAT_STATEMENTS_COLS_V1_8	32
-#define PG_STAT_STATEMENTS_COLS_V1_9	33
-#define PG_STAT_STATEMENTS_COLS_V1_10	43
-#define PG_STAT_STATEMENTS_COLS_V1_11	49
-#define PG_STAT_STATEMENTS_COLS_V1_12	52
-#define PG_STAT_STATEMENTS_COLS_V1_13	54
-#define PG_STAT_STATEMENTS_COLS			54	/* maximum of above */
-
-/*
- * Retrieve statement statistics.
- *
- * The SQL API of this function has changed multiple times, and will likely
- * do so again in future.  To support the case where a newer version of this
- * loadable module is being used with an old SQL declaration of the function,
- * we continue to support the older API versions.  For 1.2 and later, the
- * expected API version is identified by embedding it in the C name of the
- * function.  Unfortunately we weren't bright enough to do that for 1.1.
- */
 Datum
-pg_stat_statements_1_13(PG_FUNCTION_ARGS)
+pg_stat_statements_1_2(PG_FUNCTION_ARGS)
 {
 	bool		showtext = PG_GETARG_BOOL(0);
 
-	pg_stat_statements_internal(fcinfo, PGSS_V1_13, showtext);
-
+	pg_stat_statements_internal(fcinfo, PGSS_V1_2, showtext);
 	return (Datum) 0;
 }
 
 Datum
-pg_stat_statements_1_12(PG_FUNCTION_ARGS)
+pg_stat_statements_1_3(PG_FUNCTION_ARGS)
 {
 	bool		showtext = PG_GETARG_BOOL(0);
 
-	pg_stat_statements_internal(fcinfo, PGSS_V1_12, showtext);
-
+	pg_stat_statements_internal(fcinfo, PGSS_V1_3, showtext);
 	return (Datum) 0;
 }
 
 Datum
-pg_stat_statements_1_11(PG_FUNCTION_ARGS)
+pg_stat_statements_1_8(PG_FUNCTION_ARGS)
 {
 	bool		showtext = PG_GETARG_BOOL(0);
 
-	pg_stat_statements_internal(fcinfo, PGSS_V1_11, showtext);
-
+	pg_stat_statements_internal(fcinfo, PGSS_V1_8, showtext);
 	return (Datum) 0;
 }
 
 Datum
-pg_stat_statements_1_10(PG_FUNCTION_ARGS)
+pg_stat_statements_1_9(PG_FUNCTION_ARGS)
 {
 	bool		showtext = PG_GETARG_BOOL(0);
 
-	pg_stat_statements_internal(fcinfo, PGSS_V1_10, showtext);
-
+	pg_stat_statements_internal(fcinfo, PGSS_V1_9, showtext);
 	return (Datum) 0;
 }
 
 Datum
-pg_stat_statements_1_9(PG_FUNCTION_ARGS)
+pg_stat_statements_1_10(PG_FUNCTION_ARGS)
 {
 	bool		showtext = PG_GETARG_BOOL(0);
 
-	pg_stat_statements_internal(fcinfo, PGSS_V1_9, showtext);
-
+	pg_stat_statements_internal(fcinfo, PGSS_V1_10, showtext);
 	return (Datum) 0;
 }
 
 Datum
-pg_stat_statements_1_8(PG_FUNCTION_ARGS)
+pg_stat_statements_1_11(PG_FUNCTION_ARGS)
 {
 	bool		showtext = PG_GETARG_BOOL(0);
 
-	pg_stat_statements_internal(fcinfo, PGSS_V1_8, showtext);
-
+	pg_stat_statements_internal(fcinfo, PGSS_V1_11, showtext);
 	return (Datum) 0;
 }
 
 Datum
-pg_stat_statements_1_3(PG_FUNCTION_ARGS)
+pg_stat_statements_1_12(PG_FUNCTION_ARGS)
 {
 	bool		showtext = PG_GETARG_BOOL(0);
 
-	pg_stat_statements_internal(fcinfo, PGSS_V1_3, showtext);
-
+	pg_stat_statements_internal(fcinfo, PGSS_V1_12, showtext);
 	return (Datum) 0;
 }
 
 Datum
-pg_stat_statements_1_2(PG_FUNCTION_ARGS)
+pg_stat_statements_1_13(PG_FUNCTION_ARGS)
 {
-	bool		showtext = PG_GETARG_BOOL(0);
-
-	pg_stat_statements_internal(fcinfo, PGSS_V1_2, showtext);
+	bool		showtext = PG_GETARG_BOOL(0);
 
+	pg_stat_statements_internal(fcinfo, PGSS_V1_13, showtext);
 	return (Datum) 0;
 }
 
 /*
  * Legacy entry point for pg_stat_statements() API versions 1.0 and 1.1.
- * This can be removed someday, perhaps.
  */
 Datum
 pg_stat_statements(PG_FUNCTION_ARGS)
@@ -1663,33 +1578,31 @@ pg_stat_statements(PG_FUNCTION_ARGS)
 	return (Datum) 0;
 }
 
-/* Common code for all versions of pg_stat_statements() */
+/*
+ * pg_stat_statements_internal
+ *
+ * Scan the dshash for all tracked entries, fetch their counters from
+ * pgstat, and return the combined result set. Version-aware column emission.
+ */
 static void
 pg_stat_statements_internal(FunctionCallInfo fcinfo,
 							pgssVersion api_version,
 							bool showtext)
 {
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	dshash_seq_status status;
+	pgssEntry  *entry;
 	Oid			userid = GetUserId();
-	bool		is_allowed_role = false;
+	bool		is_allowed_role;
 	char	   *qbuffer = NULL;
 	Size		qbuffer_size = 0;
-	Size		extent = 0;
-	int			gc_count = 0;
-	HASH_SEQ_STATUS hash_seq;
-	pgssEntry  *entry;
 
-	/*
-	 * Superusers or roles with the privileges of pg_read_all_stats members
-	 * are allowed
-	 */
 	is_allowed_role = has_privs_of_role(userid, ROLE_PG_READ_ALL_STATS);
 
-	/* hash table must exist already */
-	if (!pgss || !pgss_hash)
-		ereport(ERROR,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("pg_stat_statements must be loaded via \"shared_preload_libraries\"")));
+	pgss_attach_shmem();
+
+	/* Flush pending stats so we can read up-to-date counters */
+	pgstat_report_anytime_stat();
 
 	InitMaterializedSRF(fcinfo, 0);
 
@@ -1746,82 +1659,39 @@ pg_stat_statements_internal(FunctionCallInfo fcinfo,
 			elog(ERROR, "incorrect number of output arguments");
 	}
 
-	/*
-	 * We'd like to load the query text file (if needed) while not holding any
-	 * lock on pgss->lock.  In the worst case we'll have to do this again
-	 * after we have the lock, but it's unlikely enough to make this a win
-	 * despite occasional duplicated work.  We need to reload if anybody
-	 * writes to the file (either a retail qtext_store(), or a garbage
-	 * collection) between this point and where we've gotten shared lock.  If
-	 * a qtext_store is actually in progress when we look, we might as well
-	 * skip the speculative load entirely.
-	 */
-	if (showtext)
-	{
-		int			n_writers;
-
-		/* Take the mutex so we can examine variables */
-		SpinLockAcquire(&pgss->mutex);
-		extent = pgss->extent;
-		n_writers = pgss->n_writers;
-		gc_count = pgss->gc_count;
-		SpinLockRelease(&pgss->mutex);
-
-		/* No point in loading file now if there are active writers */
-		if (n_writers == 0)
-			qbuffer = qtext_load_file(&qbuffer_size);
-	}
-
-	/*
-	 * Get shared lock, load or reload the query text file if we must, and
-	 * iterate over the hashtable entries.
-	 *
-	 * With a large hash table, we might be holding the lock rather longer
-	 * than one could wish.  However, this only blocks creation of new hash
-	 * table entries, and the larger the hash table the less likely that is to
-	 * be needed.  So we can hope this is okay.  Perhaps someday we'll decide
-	 * we need to partition the hash table to limit the time spent holding any
-	 * one lock.
-	 */
-	LWLockAcquire(&pgss->lock.lock, LW_SHARED);
-
+	/* Load the query text file */
 	if (showtext)
-	{
-		/*
-		 * Here it is safe to examine extent and gc_count without taking the
-		 * mutex.  Note that although other processes might change
-		 * pgss->extent just after we look at it, the strings they then write
-		 * into the file cannot yet be referenced in the hashtable, so we
-		 * don't care whether we see them or not.
-		 *
-		 * If qtext_load_file fails, we just press on; we'll return NULL for
-		 * every query text.
-		 */
-		if (qbuffer == NULL ||
-			pgss->extent != extent ||
-			pgss->gc_count != gc_count)
-		{
-			if (qbuffer)
-				pfree(qbuffer);
-			qbuffer = qtext_load_file(&qbuffer_size);
-		}
-	}
+		qbuffer = qtext_load_file(&qbuffer_size);
 
-	hash_seq_init(&hash_seq, pgss_hash);
-	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	dshash_seq_init(&status, pgss_hash, false);
+	while ((entry = dshash_seq_next(&status)) != NULL)
 	{
 		Datum		values[PG_STAT_STATEMENTS_COLS];
 		bool		nulls[PG_STAT_STATEMENTS_COLS];
 		int			i = 0;
-		Counters	tmp;
+		uint64		objid;
+		pgssCounters *counters;
+		pgssCounters tmp;
 		double		stddev;
-		int64		queryid = entry->key.queryid;
-		TimestampTz stats_since;
-		TimestampTz minmax_stats_since;
+		bool		may_free = false;
 
 		memset(values, 0, sizeof(values));
 		memset(nulls, 0, sizeof(nulls));
 
+		/* Fetch counters from pgstat */
+		objid = pgss_hash_key((pgssHashKey *) &entry->key);
+		counters = (pgssCounters *)
+			pgstat_fetch_entry(PGSTAT_KIND_PGSS, entry->key.dbid, objid,
+							   &may_free);
+
+		if (!counters)
+			continue;
+
+		tmp = *counters;
+		if (may_free)
+			pfree(counters);
+
+
 		values[i++] = ObjectIdGetDatum(entry->key.userid);
 		values[i++] = ObjectIdGetDatum(entry->key.dbid);
 		if (api_version >= PGSS_V1_9)
@@ -1830,7 +1700,7 @@ pg_stat_statements_internal(FunctionCallInfo fcinfo,
 		if (is_allowed_role || entry->key.userid == userid)
 		{
 			if (api_version >= PGSS_V1_2)
-				values[i++] = Int64GetDatumFast(queryid);
+				values[i++] = Int64GetDatumFast(entry->key.queryid);
 
 			if (showtext)
 			{
@@ -1841,62 +1711,30 @@ pg_stat_statements_internal(FunctionCallInfo fcinfo,
 
 				if (qstr)
 				{
-					char	   *enc;
-
-					enc = pg_any_to_server(qstr,
-										   entry->query_len,
-										   entry->encoding);
+					char	   *enc = pg_any_to_server(qstr, entry->query_len, entry->encoding);
 
 					values[i++] = CStringGetTextDatum(enc);
-
 					if (enc != qstr)
 						pfree(enc);
 				}
 				else
-				{
-					/* Just return a null if we fail to find the text */
 					nulls[i++] = true;
-				}
 			}
 			else
-			{
-				/* Query text not requested */
 				nulls[i++] = true;
-			}
 		}
 		else
 		{
-			/* Don't show queryid */
 			if (api_version >= PGSS_V1_2)
 				nulls[i++] = true;
 
-			/*
-			 * Don't show query text, but hint as to the reason for not doing
-			 * so if it was requested
-			 */
 			if (showtext)
 				values[i++] = CStringGetTextDatum("<insufficient privilege>");
 			else
 				nulls[i++] = true;
 		}
 
-		/* copy counters to a local variable to keep locking time short */
-		SpinLockAcquire(&entry->mutex);
-		tmp = entry->counters;
-		SpinLockRelease(&entry->mutex);
-
-		/*
-		 * The spinlock is not required when reading these two as they are
-		 * always updated when holding pgss->lock exclusively.
-		 */
-		stats_since = entry->stats_since;
-		minmax_stats_since = entry->minmax_stats_since;
-
-		/* Skip entry if unexecuted (ie, it's a pending "sticky" entry) */
-		if (IS_STICKY(tmp))
-			continue;
-
-		/* Note that we rely on PGSS_PLAN being 0 and PGSS_EXEC being 1. */
+		/* Note: PGSS_PLAN is 0, PGSS_EXEC is 1 */
 		for (int kind = 0; kind < PGSS_NUMKIND; kind++)
 		{
 			if (kind == PGSS_EXEC || api_version >= PGSS_V1_8)
@@ -1912,12 +1750,6 @@ pg_stat_statements_internal(FunctionCallInfo fcinfo,
 				values[i++] = Float8GetDatumFast(tmp.max_time[kind]);
 				values[i++] = Float8GetDatumFast(tmp.mean_time[kind]);
 
-				/*
-				 * Note we are calculating the population variance here, not
-				 * the sample variance, as we have data for the whole
-				 * population, so Bessel's correction is not used, and we
-				 * don't divide by tmp.calls - 1.
-				 */
 				if (tmp.calls[kind] > 1)
 					stddev = sqrt(tmp.sum_var_time[kind] / tmp.calls[kind]);
 				else
@@ -1925,6 +1757,7 @@ pg_stat_statements_internal(FunctionCallInfo fcinfo,
 				values[i++] = Float8GetDatumFast(stddev);
 			}
 		}
+
 		values[i++] = Int64GetDatumFast(tmp.rows);
 		values[i++] = Int64GetDatumFast(tmp.shared_blks_hit);
 		values[i++] = Int64GetDatumFast(tmp.shared_blks_read);
@@ -1962,8 +1795,6 @@ pg_stat_statements_internal(FunctionCallInfo fcinfo,
 			values[i++] = Int64GetDatumFast(tmp.wal_fpi);
 
 			snprintf(buf, sizeof buf, UINT64_FORMAT, tmp.wal_bytes);
-
-			/* Convert to numeric. */
 			wal_bytes = DirectFunctionCall3(numeric_in,
 											CStringGetDatum(buf),
 											ObjectIdGetDatum(0),
@@ -1971,9 +1802,7 @@ pg_stat_statements_internal(FunctionCallInfo fcinfo,
 			values[i++] = wal_bytes;
 		}
 		if (api_version >= PGSS_V1_12)
-		{
 			values[i++] = Int64GetDatumFast(tmp.wal_buffers_full);
-		}
 		if (api_version >= PGSS_V1_10)
 		{
 			values[i++] = Int64GetDatumFast(tmp.jit_functions);
@@ -2002,8 +1831,8 @@ pg_stat_statements_internal(FunctionCallInfo fcinfo,
 		}
 		if (api_version >= PGSS_V1_11)
 		{
-			values[i++] = TimestampTzGetDatum(stats_since);
-			values[i++] = TimestampTzGetDatum(minmax_stats_since);
+			values[i++] = TimestampTzGetDatum(entry->stats_since);
+			values[i++] = TimestampTzGetDatum(entry->minmax_stats_since);
 		}
 
 		Assert(i == (api_version == PGSS_V1_0 ? PG_STAT_STATEMENTS_COLS_V1_0 :
@@ -2020,8 +1849,7 @@ pg_stat_statements_internal(FunctionCallInfo fcinfo,
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
 	}
-
-	LWLockRelease(&pgss->lock.lock);
+	dshash_seq_term(&status);
 
 	if (qbuffer)
 		pfree(qbuffer);
@@ -2036,196 +1864,296 @@ pg_stat_statements_internal(FunctionCallInfo fcinfo,
 Datum
 pg_stat_statements_info(PG_FUNCTION_ARGS)
 {
-	pgssGlobalStats stats;
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_STATEMENTS_INFO_COLS] = {0};
 	bool		nulls[PG_STAT_STATEMENTS_INFO_COLS] = {0};
 
-	if (!pgss || !pgss_hash)
-		ereport(ERROR,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("pg_stat_statements must be loaded via \"shared_preload_libraries\"")));
-
-	/* Build a tuple descriptor for our result type */
 	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
 		elog(ERROR, "return type must be a row type");
 
-	/* Read global statistics for pg_stat_statements */
-	SpinLockAcquire(&pgss->mutex);
-	stats = pgss->stats;
-	SpinLockRelease(&pgss->mutex);
+	pgss_attach_shmem();
 
-	values[0] = Int64GetDatum(stats.dealloc);
-	values[1] = TimestampTzGetDatum(stats.stats_reset);
+	values[0] = Int64GetDatum((int64) pg_atomic_read_u64(&pgss_shared->dealloc));
+	values[1] = TimestampTzGetDatum(pgss_shared->stats_reset);
 
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
 
 /*
- * Allocate a new hashtable entry.
- * caller must hold an exclusive lock on pgss->lock
+ * Evict entries using clock-sweep with reference count decay.
  *
- * "query" need not be null-terminated; we rely on query_len instead
+ * Multiple backends can sweep concurrently, each claiming the next partition
+ * via atomic increment of sweep_partition (rotating hand).  Each backend
+ * sweeps one partition per invocation: decrementing refcounts and evicting
+ * entries that reach zero via dshash_delete_current.  Entries still above 0
+ * survive.  Hot queries that execute frequently keep their refcount topped
+ * up (capped at PGSS_REF_CAP), giving them proportional protection.
  *
- * If "sticky" is true, make the new entry artificially sticky so that it will
- * probably still be there when the query finishes execution.  We do this by
- * giving it a median usage value rather than the normal value.  (Strictly
- * speaking, query strings are normalized on a best effort basis, though it
- * would be difficult to demonstrate this even under artificial conditions.)
+ * The hand advances across invocations so each sweep picks up where the last
+ * one left off, providing clock-sweep behavior.  Eviction stops within a
+ * partition as soon as nentries drops back to pgss_max.
  *
- * Note: despite needing exclusive lock, it's not an error for the target
- * entry to already exist.  This is because pgss_store releases and
- * reacquires lock after failing to find a match; so someone else could
- * have made the entry while we waited to get exclusive lock.
+ * Note: the partition swept is determined by the rotating hand, not by the
+ * key just inserted.  Eviction is a continuous background pressure mechanism
+ * to keep total entries near pgss_max, not a targeted operation to make room
+ * for the current entry specifically.
+ *
+ * Guaranteed eviction: any entry that stops being accessed will reach 0
+ * after at most PGSS_REF_CAP sweeps.
  */
-static pgssEntry *
-entry_alloc(pgssHashKey *key, Size query_offset, int query_len, int encoding,
-			bool sticky)
+static void
+entry_dealloc(void)
 {
+	dshash_seq_status status;
 	pgssEntry  *entry;
-	bool		found;
+	int			nentries;
+	int			evicted = 0;
+	uint32		my_partition;
+	uint32		sweep_val;
+	uint32		cur;
+
+	nentries = (int) pg_atomic_read_u64(&pgss_shared->nentries);
+	if (nentries <= pgss_max)
+		return;
 
-	/* Make space if needed */
-	while (hash_get_num_entries(pgss_hash) >= pgss_max)
-		entry_dealloc();
+	/*
+	 * Claim the next partition by atomically advancing the hand.  Multiple
+	 * backends can sweep concurrently, each on a different partition.
+	 */
+	sweep_val = pg_atomic_fetch_add_u32(&pgss_shared->sweep_partition, 1);
+	my_partition = sweep_val % DSHASH_NUM_PARTITIONS;
 
-	/* Find or create an entry with desired hash code */
-	entry = (pgssEntry *) hash_search(pgss_hash, key, HASH_ENTER, &found);
+	/* Re-check: another backend may have already resolved the overshoot */
+	if ((int) pg_atomic_read_u64(&pgss_shared->nentries) <= pgss_max)
+		return;
 
-	if (!found)
+	/*
+	 * Sweep one partition: decrement each entry's refcount, evict entries
+	 * that reach zero.  Stop if nentries drops back to pgss_max.
+	 */
+	dshash_seq_init_partition(&status, pgss_hash, true, my_partition);
+	while ((entry = dshash_seq_next(&status)) != NULL)
 	{
-		/* New entry, initialize it */
-
-		/* reset the statistics */
-		memset(&entry->counters, 0, sizeof(Counters));
-		/* set the appropriate initial usage count */
-		entry->counters.usage = sticky ? pgss->cur_median_usage : USAGE_INIT;
-		/* re-initialize the mutex each time ... we assume no one using it */
-		SpinLockInit(&entry->mutex);
-		/* ... and don't forget the query text metadata */
-		Assert(query_len >= 0);
-		entry->query_offset = query_offset;
-		entry->query_len = query_len;
-		entry->encoding = encoding;
-		entry->stats_since = GetCurrentTimestamp();
-		entry->minmax_stats_since = entry->stats_since;
-	}
+		if ((int) pg_atomic_read_u64(&pgss_shared->nentries) <= pgss_max * (100 - USAGE_DEALLOC_PERCENT) / 100)
+			break;
 
-	return entry;
-}
+		/*
+		 * Decrement refcount; evict if it reaches zero.  Guard against
+		 * underflow — a prior sweep may have already decremented to zero
+		 * without evicting (because nentries dropped below threshold). No CAS
+		 * needed: we hold the partition lock exclusively.
+		 */
+		cur = pg_atomic_read_u32(&entry->refcount);
+		if (cur > 0)
+		{
+			pg_atomic_write_u32(&entry->refcount, cur - 1);
+			if (cur - 1 > 0)
+				continue;
+		}
 
-/*
- * qsort comparator for sorting into increasing usage order
- */
-static int
-entry_cmp(const void *lhs, const void *rhs)
-{
-	double		l_usage = (*(pgssEntry *const *) lhs)->counters.usage;
-	double		r_usage = (*(pgssEntry *const *) rhs)->counters.usage;
+		/* Evict: drop from pgstat, then delete from dshash */
+		pgstat_drop_entry(PGSTAT_KIND_PGSS, entry->key.dbid,
+						  pgss_hash_key(&entry->key), true);
 
-	if (l_usage < r_usage)
-		return -1;
-	else if (l_usage > r_usage)
-		return +1;
-	else
-		return 0;
+		dshash_delete_current(&status);
+		pg_atomic_fetch_sub_u64(&pgss_shared->nentries, 1);
+		evicted++;
+	}
+	dshash_seq_term(&status);
+
+	if (evicted > 0)
+	{
+		pg_atomic_fetch_add_u64(&pgss_shared->dealloc, evicted);
+
+		/* Request pgstat entry ref GC once per full rotation */
+		if (my_partition == DSHASH_NUM_PARTITIONS - 1)
+			pgstat_request_entry_refs_gc();
+	}
 }
 
 /*
- * Deallocate least-used entries.
- *
- * Caller must hold an exclusive lock on pgss->lock.
+ * Reset entries corresponding to parameters passed.
+ * Iterates the dshash to find matching entries, deletes them from the
+ * dshash, and drops their pgstat entries.
  */
-static void
-entry_dealloc(void)
+static TimestampTz
+entry_reset(Oid userid, Oid dbid, int64 queryid, bool minmax_only)
 {
-	HASH_SEQ_STATUS hash_seq;
-	pgssEntry **entries;
+	dshash_seq_status status;
 	pgssEntry  *entry;
-	int			nvictims;
-	int			i;
-	Size		tottextlen;
-	int			nvalidtexts;
+	TimestampTz stats_reset;
+	int64		num_entries;
+	int64		num_remove = 0;
 
-	/*
-	 * Sort entries by usage and deallocate USAGE_DEALLOC_PERCENT of them.
-	 * While we're scanning the table, apply the decay factor to the usage
-	 * values, and update the mean query length.
-	 *
-	 * Note that the mean query length is almost immediately obsolete, since
-	 * we compute it before not after discarding the least-used entries.
-	 * Hopefully, that doesn't affect the mean too much; it doesn't seem worth
-	 * making two passes to get a more current result.  Likewise, the new
-	 * cur_median_usage includes the entries we're about to zap.
-	 */
+	pgss_attach_shmem();
 
-	entries = palloc(hash_get_num_entries(pgss_hash) * sizeof(pgssEntry *));
+	/* Flush pending stats before reset so we don't re-create dropped entries */
+	pgstat_report_anytime_stat();
 
-	i = 0;
-	tottextlen = 0;
-	nvalidtexts = 0;
+	stats_reset = GetCurrentTimestamp();
+	num_entries = (int64) pg_atomic_read_u64(&pgss_shared->nentries);
 
-	hash_seq_init(&hash_seq, pgss_hash);
-	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	if (minmax_only)
 	{
-		entries[i++] = entry;
-		/* "Sticky" entries get a different usage decay rate. */
-		if (IS_STICKY(entry->counters))
-			entry->counters.usage *= STICKY_DECREASE_FACTOR;
-		else
-			entry->counters.usage *= USAGE_DECREASE_FACTOR;
-		/* In the mean length computation, ignore dropped texts. */
-		if (entry->query_len >= 0)
+		/*
+		 * Reset only min/max timing values and update minmax_stats_since.
+		 * Iterate matching entries in the dshash, reset the corresponding
+		 * pgstat counters' min/max fields.
+		 */
+		dshash_seq_init(&status, pgss_hash, true);
+		while ((entry = dshash_seq_next(&status)) != NULL)
 		{
-			tottextlen += entry->query_len + 1;
-			nvalidtexts++;
+			if ((!userid || entry->key.userid == userid) &&
+				(!dbid || entry->key.dbid == dbid) &&
+				(!queryid || entry->key.queryid == queryid))
+			{
+				uint64		objid = pgss_hash_key(&entry->key);
+				PgStat_EntryRef *entry_ref;
+
+				entry->minmax_stats_since = stats_reset;
+
+				entry_ref = pgstat_get_entry_ref(PGSTAT_KIND_PGSS,
+												 entry->key.dbid, objid,
+												 false, NULL);
+				if (entry_ref)
+				{
+					PgStatShared_Pgss *shared;
+
+					shared = (PgStatShared_Pgss *) entry_ref->shared_stats;
+					if (!pgstat_lock_entry(entry_ref, false))
+						continue;
+
+					for (int kind = 0; kind < PGSS_NUMKIND; kind++)
+					{
+						shared->counters.min_time[kind] = 0;
+						shared->counters.max_time[kind] = 0;
+						shared->counters.mean_time[kind] = 0;
+						shared->counters.sum_var_time[kind] = 0;
+					}
+
+					pgstat_unlock_entry(entry_ref);
+				}
+			}
 		}
+		dshash_seq_term(&status);
+
+		return stats_reset;
 	}
 
-	/* Sort into increasing order by usage */
-	qsort(entries, i, sizeof(pgssEntry *), entry_cmp);
+	if (userid != 0 && dbid != 0 && queryid != INT64CONST(0))
+	{
+		/*
+		 * Fast path: specific entry identified by all three key components.
+		 * Try both toplevel=true and toplevel=false.
+		 */
+		pgssHashKey key;
 
-	/* Record the (approximate) median usage */
-	if (i > 0)
-		pgss->cur_median_usage = entries[i / 2]->counters.usage;
-	/* Record the mean query length */
-	if (nvalidtexts > 0)
-		pgss->mean_query_len = tottextlen / nvalidtexts;
-	else
-		pgss->mean_query_len = ASSUMED_LENGTH_INIT;
+		memset(&key, 0, sizeof(pgssHashKey));
+		key.userid = userid;
+		key.dbid = dbid;
+		key.queryid = queryid;
+
+		key.toplevel = false;
+		entry = dshash_find(pgss_hash, &key, true);
+		if (entry)
+		{
+			uint64		objid = pgss_hash_key(&key);
+
+			dshash_delete_entry(pgss_hash, entry);
+			pg_atomic_fetch_sub_u64(&pgss_shared->nentries, 1);
+			num_remove++;
 
-	/* Now zap an appropriate fraction of lowest-usage entries */
-	nvictims = Max(10, i * USAGE_DEALLOC_PERCENT / 100);
-	nvictims = Min(nvictims, i);
+			if (!pgstat_drop_entry(PGSTAT_KIND_PGSS, key.dbid, objid, true))
+				pgstat_request_entry_refs_gc();
+		}
+
+		key.toplevel = true;
+		entry = dshash_find(pgss_hash, &key, true);
+		if (entry)
+		{
+			uint64		objid = pgss_hash_key(&key);
 
-	for (i = 0; i < nvictims; i++)
+			dshash_delete_entry(pgss_hash, entry);
+			pg_atomic_fetch_sub_u64(&pgss_shared->nentries, 1);
+			num_remove++;
+
+			if (!pgstat_drop_entry(PGSTAT_KIND_PGSS, key.dbid, objid, true))
+				pgstat_request_entry_refs_gc();
+		}
+	}
+	else
 	{
-		hash_search(pgss_hash, &entries[i]->key, HASH_REMOVE, NULL);
+		/* Iterate all entries and remove those matching the filter */
+		dshash_seq_init(&status, pgss_hash, true);
+		while ((entry = dshash_seq_next(&status)) != NULL)
+		{
+			if ((!userid || entry->key.userid == userid) &&
+				(!dbid || entry->key.dbid == dbid) &&
+				(!queryid || entry->key.queryid == queryid))
+			{
+				pgssHashKey ekey = entry->key;
+				uint64		objid = pgss_hash_key(&ekey);
+
+				dshash_delete_current(&status);
+				pg_atomic_fetch_sub_u64(&pgss_shared->nentries, 1);
+				num_remove++;
+
+				if (!pgstat_drop_entry(PGSTAT_KIND_PGSS, ekey.dbid, objid, true))
+					pgstat_request_entry_refs_gc();
+			}
+		}
+		dshash_seq_term(&status);
 	}
 
-	pfree(entries);
+	/* If all entries were removed, reset global statistics and text file */
+	if (num_remove > 0 && num_entries == num_remove)
+	{
+		FILE	   *qfile;
+
+		pg_atomic_write_u64(&pgss_shared->dealloc, 0);
+		pgss_shared->stats_reset = stats_reset;
+
+		/* Write new empty query file */
+		qfile = AllocateFile(PGSS_TEXT_FILE, PG_BINARY_W);
+		if (qfile == NULL)
+		{
+			ereport(LOG,
+					(errcode_for_file_access(),
+					 errmsg("could not create file \"%s\": %m",
+							PGSS_TEXT_FILE)));
+		}
+		else
+		{
+			if (ftruncate(fileno(qfile), 0) != 0)
+				ereport(LOG,
+						(errcode_for_file_access(),
+						 errmsg("could not truncate file \"%s\": %m",
+								PGSS_TEXT_FILE)));
+			FreeFile(qfile);
+		}
+
+		SpinLockAcquire(&pgss_shared->mutex);
+		pgss_shared->extent = 0;
+		SpinLockRelease(&pgss_shared->mutex);
+		record_gc_qtexts();
+	}
 
-	/* Increment the number of times entries are deallocated */
-	SpinLockAcquire(&pgss->mutex);
-	pgss->stats.dealloc += 1;
-	SpinLockRelease(&pgss->mutex);
+	return stats_reset;
 }
 
+/*--------------------------------------------------------------------------
+ * Query text file management
+ *--------------------------------------------------------------------------
+ */
+
 /*
- * Given a query string (not necessarily null-terminated), allocate a new
- * entry in the external query text file and store the string there.
+ * Store query text in the external file.
  *
  * If successful, returns true, and stores the new entry's offset in the file
  * into *query_offset.  Also, if gc_count isn't NULL, *gc_count is set to the
  * number of garbage collections that have occurred so far.
  *
  * On failure, returns false.
- *
- * At least a shared lock on pgss->lock must be held by the caller, so as
- * to prevent a concurrent garbage collection.  Share-lock-holding callers
- * should pass a gc_count pointer to obtain the number of garbage collections,
- * so that they can recheck the count after obtaining exclusive lock to
- * detect whether a garbage collection occurred (and removed this entry).
  */
 static bool
 qtext_store(const char *query, int query_len,
@@ -2238,13 +2166,13 @@ qtext_store(const char *query, int query_len,
 	 * We use a spinlock to protect extent/n_writers/gc_count, so that
 	 * multiple processes may execute this function concurrently.
 	 */
-	SpinLockAcquire(&pgss->mutex);
-	off = pgss->extent;
-	pgss->extent += query_len + 1;
-	pgss->n_writers++;
+	SpinLockAcquire(&pgss_shared->mutex);
+	off = pgss_shared->extent;
+	pgss_shared->extent += query_len + 1;
+	pgss_shared->n_writers++;
 	if (gc_count)
-		*gc_count = pgss->gc_count;
-	SpinLockRelease(&pgss->mutex);
+		*gc_count = pgss_shared->gc_count;
+	SpinLockRelease(&pgss_shared->mutex);
 
 	*query_offset = off;
 
@@ -2255,7 +2183,7 @@ qtext_store(const char *query, int query_len,
 	 */
 	if (unlikely(query_len >= MaxAllocHugeSize - off))
 	{
-		errno = EFBIG;			/* not quite right, but it'll do */
+		errno = EFBIG;
 		fd = -1;
 		goto error;
 	}
@@ -2273,9 +2201,9 @@ qtext_store(const char *query, int query_len,
 	CloseTransientFile(fd);
 
 	/* Mark our write complete */
-	SpinLockAcquire(&pgss->mutex);
-	pgss->n_writers--;
-	SpinLockRelease(&pgss->mutex);
+	SpinLockAcquire(&pgss_shared->mutex);
+	pgss_shared->n_writers--;
+	SpinLockRelease(&pgss_shared->mutex);
 
 	return true;
 
@@ -2289,9 +2217,9 @@ error:
 		CloseTransientFile(fd);
 
 	/* Mark our write complete */
-	SpinLockAcquire(&pgss->mutex);
-	pgss->n_writers--;
-	SpinLockRelease(&pgss->mutex);
+	SpinLockAcquire(&pgss_shared->mutex);
+	pgss_shared->n_writers--;
+	SpinLockRelease(&pgss_shared->mutex);
 
 	return false;
 }
@@ -2303,9 +2231,6 @@ error:
  * file not there or insufficient memory.
  *
  * On success, the buffer size is also returned into *buffer_size.
- *
- * This can be called without any lock on pgss->lock, but in that case
- * the caller is responsible for verifying that the result is sane.
  */
 static char *
 qtext_load_file(Size *buffer_size)
@@ -2354,22 +2279,14 @@ qtext_load_file(Size *buffer_size)
 	}
 
 	/*
-	 * OK, slurp in the file.  Windows fails if we try to read more than
-	 * INT_MAX bytes at once, and other platforms might not like that either,
-	 * so read a very large file in 1GB segments.
+	 * OK, slurp in the file.  Read in 1GB segments to avoid issues on some
+	 * platforms.
 	 */
 	nread = 0;
 	while (nread < stat.st_size)
 	{
 		int			toread = Min(1024 * 1024 * 1024, stat.st_size - nread);
 
-		/*
-		 * If we get a short read and errno doesn't get set, the reason is
-		 * probably that garbage collection truncated the file since we did
-		 * the fstat(), so we don't log a complaint --- but we don't return
-		 * the data, either, since it's most likely corrupt due to concurrent
-		 * writes from garbage collection.
-		 */
 		errno = 0;
 		if (read(fd, buf + nread, toread) != toread)
 		{
@@ -2421,37 +2338,39 @@ qtext_fetch(Size query_offset, int query_len,
 /*
  * Do we need to garbage-collect the external query text file?
  *
- * Caller should hold at least a shared lock on pgss->lock.
+ * We check whether the file has grown excessively relative to the number
+ * of entries tracked.
  */
 static bool
 need_gc_qtexts(void)
 {
 	Size		extent;
+	int			nentries;
 
 	/* Read shared extent pointer */
-	SpinLockAcquire(&pgss->mutex);
-	extent = pgss->extent;
-	SpinLockRelease(&pgss->mutex);
+	SpinLockAcquire(&pgss_shared->mutex);
+	extent = pgss_shared->extent;
+	SpinLockRelease(&pgss_shared->mutex);
+
+	nentries = (int) pg_atomic_read_u64(&pgss_shared->nentries);
 
 	/*
 	 * Don't proceed if file does not exceed 512 bytes per possible entry.
-	 *
-	 * Here and in the next test, 32-bit machines have overflow hazards if
-	 * pgss_max and/or mean_query_len are large.  Force the multiplications
-	 * and comparisons to be done in uint64 arithmetic to forestall trouble.
 	 */
 	if ((uint64) extent < (uint64) 512 * pgss_max)
 		return false;
 
 	/*
-	 * Don't proceed if file is less than about 50% bloat.  Nothing can or
-	 * should be done in the event of unusually large query texts accounting
-	 * for file's large size.  We go to the trouble of maintaining the mean
-	 * query length in order to prevent garbage collection from thrashing
-	 * uselessly.
+	 * Don't proceed if file is less than about 50% bloat.  We estimate mean
+	 * query length from the file size and entry count.
 	 */
-	if ((uint64) extent < (uint64) pgss->mean_query_len * pgss_max * 2)
-		return false;
+	if (nentries > 0)
+	{
+		Size		mean_query_len = extent / nentries;
+
+		if ((uint64) extent < (uint64) mean_query_len * pgss_max * 2)
+			return false;
+	}
 
 	return true;
 }
@@ -2459,18 +2378,14 @@ need_gc_qtexts(void)
 /*
  * Garbage-collect orphaned query texts in external file.
  *
- * This won't be called often in the typical case, since it's likely that
- * there won't be too much churn, and besides, a similar compaction process
- * occurs when serializing to disk at shutdown or as part of resetting.
- * Despite this, it seems prudent to plan for the edge case where the file
- * becomes unreasonably large, with no other method of compaction likely to
- * occur in the foreseeable future.
+ * This rewrites the query text file, keeping only texts referenced by
+ * current dshash entries, and updates their offsets accordingly.
  *
- * The caller must hold an exclusive lock on pgss->lock.
- *
- * At the first sign of trouble we unlink the query text file to get a clean
- * slate (although existing statistics are retained), rather than risk
- * thrashing by allowing the same problem case to recur indefinitely.
+ * Note: unlike the upstream implementation which required LWLock exclusive,
+ * this uses dshash partition-level exclusive locks via sequential scan with
+ * exclusive=true.  This is safe because we're updating entry offsets in-place
+ * and the file is guaranteed not to grow during GC (no new writers can get
+ * offsets into the old region after we've reset the extent).
  */
 static void
 gc_qtexts(void)
@@ -2478,25 +2393,16 @@ gc_qtexts(void)
 	char	   *qbuffer;
 	Size		qbuffer_size;
 	FILE	   *qfile = NULL;
-	HASH_SEQ_STATUS hash_seq;
+	dshash_seq_status status;
 	pgssEntry  *entry;
 	Size		extent;
-	int			nentries;
 
-	/*
-	 * When called from pgss_store, some other session might have proceeded
-	 * with garbage collection in the no-lock-held interim of lock strength
-	 * escalation.  Check once more that this is actually necessary.
-	 */
 	if (!need_gc_qtexts())
 		return;
 
 	/*
 	 * Load the old texts file.  If we fail (out of memory, for instance),
-	 * invalidate query texts.  Hopefully this is rare.  It might seem better
-	 * to leave things alone on an OOM failure, but the problem is that the
-	 * file is only going to get bigger; hoping for a future non-OOM result is
-	 * risky and can easily lead to complete denial of service.
+	 * invalidate query texts.
 	 */
 	qbuffer = qtext_load_file(&qbuffer_size);
 	if (qbuffer == NULL)
@@ -2504,9 +2410,7 @@ gc_qtexts(void)
 
 	/*
 	 * We overwrite the query texts file in place, so as to reduce the risk of
-	 * an out-of-disk-space failure.  Since the file is guaranteed not to get
-	 * larger, this should always work on traditional filesystems; though we
-	 * could still lose on copy-on-write filesystems.
+	 * an out-of-disk-space failure.
 	 */
 	qfile = AllocateFile(PGSS_TEXT_FILE, PG_BINARY_W);
 	if (qfile == NULL)
@@ -2519,10 +2423,9 @@ gc_qtexts(void)
 	}
 
 	extent = 0;
-	nentries = 0;
 
-	hash_seq_init(&hash_seq, pgss_hash);
-	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	dshash_seq_init(&status, pgss_hash, true);
+	while ((entry = dshash_seq_next(&status)) != NULL)
 	{
 		int			query_len = entry->query_len;
 		char	   *qry = qtext_fetch(entry->query_offset,
@@ -2535,7 +2438,6 @@ gc_qtexts(void)
 			/* Trouble ... drop the text */
 			entry->query_offset = 0;
 			entry->query_len = -1;
-			/* entry will not be counted in mean query length computation */
 			continue;
 		}
 
@@ -2545,18 +2447,17 @@ gc_qtexts(void)
 					(errcode_for_file_access(),
 					 errmsg("could not write file \"%s\": %m",
 							PGSS_TEXT_FILE)));
-			hash_seq_term(&hash_seq);
+			dshash_seq_term(&status);
 			goto gc_fail;
 		}
 
 		entry->query_offset = extent;
 		extent += query_len + 1;
-		nentries++;
 	}
+	dshash_seq_term(&status);
 
 	/*
-	 * Truncate away any now-unused space.  If this fails for some odd reason,
-	 * we log it, but there's no need to fail.
+	 * Truncate away any now-unused space.
 	 */
 	if (ftruncate(fileno(qfile), extent) != 0)
 		ereport(LOG,
@@ -2575,29 +2476,15 @@ gc_qtexts(void)
 	}
 
 	elog(DEBUG1, "pgss gc of queries file shrunk size from %zu to %zu",
-		 pgss->extent, extent);
+		 pgss_shared->extent, extent);
 
 	/* Reset the shared extent pointer */
-	pgss->extent = extent;
-
-	/*
-	 * Also update the mean query length, to be sure that need_gc_qtexts()
-	 * won't still think we have a problem.
-	 */
-	if (nentries > 0)
-		pgss->mean_query_len = extent / nentries;
-	else
-		pgss->mean_query_len = ASSUMED_LENGTH_INIT;
+	SpinLockAcquire(&pgss_shared->mutex);
+	pgss_shared->extent = extent;
+	SpinLockRelease(&pgss_shared->mutex);
 
 	pfree(qbuffer);
 
-	/*
-	 * OK, count a garbage collection cycle.  (Note: even though we have
-	 * exclusive lock on pgss->lock, we must take pgss->mutex for this, since
-	 * other processes may examine gc_count while holding only the mutex.
-	 * Also, we have to advance the count *after* we've rewritten the file,
-	 * else other processes might not realize they read a stale file.)
-	 */
 	record_gc_qtexts();
 
 	return;
@@ -2611,14 +2498,15 @@ gc_fail:
 
 	/*
 	 * Since the contents of the external file are now uncertain, mark all
-	 * hashtable entries as having invalid texts.
+	 * dshash entries as having invalid texts.
 	 */
-	hash_seq_init(&hash_seq, pgss_hash);
-	while ((entry = hash_seq_search(&hash_seq)) != NULL)
+	dshash_seq_init(&status, pgss_hash, true);
+	while ((entry = dshash_seq_next(&status)) != NULL)
 	{
 		entry->query_offset = 0;
 		entry->query_len = -1;
 	}
+	dshash_seq_term(&status);
 
 	/*
 	 * Destroy the query text file and create a new, empty one
@@ -2634,160 +2522,11 @@ gc_fail:
 		FreeFile(qfile);
 
 	/* Reset the shared extent pointer */
-	pgss->extent = 0;
-
-	/* Reset mean_query_len to match the new state */
-	pgss->mean_query_len = ASSUMED_LENGTH_INIT;
-
-	/*
-	 * Bump the GC count even though we failed.
-	 *
-	 * This is needed to make concurrent readers of file without any lock on
-	 * pgss->lock notice existence of new version of file.  Once readers
-	 * subsequently observe a change in GC count with pgss->lock held, that
-	 * forces a safe reopen of file.  Writers also require that we bump here,
-	 * of course.  (As required by locking protocol, readers and writers don't
-	 * trust earlier file contents until gc_count is found unchanged after
-	 * pgss->lock acquired in shared or exclusive mode respectively.)
-	 */
-	record_gc_qtexts();
-}
-
-#define SINGLE_ENTRY_RESET(e) \
-if (e) { \
-	if (minmax_only) { \
-		/* When requested reset only min/max statistics of an entry */ \
-		for (int kind = 0; kind < PGSS_NUMKIND; kind++) \
-		{ \
-			e->counters.max_time[kind] = 0; \
-			e->counters.min_time[kind] = 0; \
-		} \
-		e->minmax_stats_since = stats_reset; \
-	} \
-	else \
-	{ \
-		/* Remove the key otherwise  */ \
-		hash_search(pgss_hash, &e->key, HASH_REMOVE, NULL); \
-		num_remove++; \
-	} \
-}
-
-/*
- * Reset entries corresponding to parameters passed.
- */
-static TimestampTz
-entry_reset(Oid userid, Oid dbid, int64 queryid, bool minmax_only)
-{
-	HASH_SEQ_STATUS hash_seq;
-	pgssEntry  *entry;
-	FILE	   *qfile;
-	int64		num_entries;
-	int64		num_remove = 0;
-	pgssHashKey key;
-	TimestampTz stats_reset;
-
-	if (!pgss || !pgss_hash)
-		ereport(ERROR,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("pg_stat_statements must be loaded via \"shared_preload_libraries\"")));
-
-	LWLockAcquire(&pgss->lock.lock, LW_EXCLUSIVE);
-	num_entries = hash_get_num_entries(pgss_hash);
-
-	stats_reset = GetCurrentTimestamp();
-
-	if (userid != 0 && dbid != 0 && queryid != INT64CONST(0))
-	{
-		/* If all the parameters are available, use the fast path. */
-		memset(&key, 0, sizeof(pgssHashKey));
-		key.userid = userid;
-		key.dbid = dbid;
-		key.queryid = queryid;
-
-		/*
-		 * Reset the entry if it exists, starting with the non-top-level
-		 * entry.
-		 */
-		key.toplevel = false;
-		entry = (pgssEntry *) hash_search(pgss_hash, &key, HASH_FIND, NULL);
-
-		SINGLE_ENTRY_RESET(entry);
-
-		/* Also reset the top-level entry if it exists. */
-		key.toplevel = true;
-		entry = (pgssEntry *) hash_search(pgss_hash, &key, HASH_FIND, NULL);
-
-		SINGLE_ENTRY_RESET(entry);
-	}
-	else if (userid != 0 || dbid != 0 || queryid != INT64CONST(0))
-	{
-		/* Reset entries corresponding to valid parameters. */
-		hash_seq_init(&hash_seq, pgss_hash);
-		while ((entry = hash_seq_search(&hash_seq)) != NULL)
-		{
-			if ((!userid || entry->key.userid == userid) &&
-				(!dbid || entry->key.dbid == dbid) &&
-				(!queryid || entry->key.queryid == queryid))
-			{
-				SINGLE_ENTRY_RESET(entry);
-			}
-		}
-	}
-	else
-	{
-		/* Reset all entries. */
-		hash_seq_init(&hash_seq, pgss_hash);
-		while ((entry = hash_seq_search(&hash_seq)) != NULL)
-		{
-			SINGLE_ENTRY_RESET(entry);
-		}
-	}
-
-	/* All entries are removed? */
-	if (num_entries != num_remove)
-		goto release_lock;
-
-	/*
-	 * Reset global statistics for pg_stat_statements since all entries are
-	 * removed.
-	 */
-	SpinLockAcquire(&pgss->mutex);
-	pgss->stats.dealloc = 0;
-	pgss->stats.stats_reset = stats_reset;
-	SpinLockRelease(&pgss->mutex);
-
-	/*
-	 * Write new empty query file, perhaps even creating a new one to recover
-	 * if the file was missing.
-	 */
-	qfile = AllocateFile(PGSS_TEXT_FILE, PG_BINARY_W);
-	if (qfile == NULL)
-	{
-		ereport(LOG,
-				(errcode_for_file_access(),
-				 errmsg("could not create file \"%s\": %m",
-						PGSS_TEXT_FILE)));
-		goto done;
-	}
-
-	/* If ftruncate fails, log it, but it's not a fatal problem */
-	if (ftruncate(fileno(qfile), 0) != 0)
-		ereport(LOG,
-				(errcode_for_file_access(),
-				 errmsg("could not truncate file \"%s\": %m",
-						PGSS_TEXT_FILE)));
+	SpinLockAcquire(&pgss_shared->mutex);
+	pgss_shared->extent = 0;
+	SpinLockRelease(&pgss_shared->mutex);
 
-	FreeFile(qfile);
-
-done:
-	pgss->extent = 0;
-	/* This counts as a query text garbage collection for our purposes */
 	record_gc_qtexts();
-
-release_lock:
-	LWLockRelease(&pgss->lock.lock);
-
-	return stats_reset;
 }
 
 /*
diff --git a/contrib/pg_stat_statements/pg_stat_statements.conf b/contrib/pg_stat_statements/pg_stat_statements.conf
index 0e900d7119b..21a10c41d09 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.conf
+++ b/contrib/pg_stat_statements/pg_stat_statements.conf
@@ -1,2 +1,3 @@
 shared_preload_libraries = 'pg_stat_statements'
 max_prepared_transactions = 5
+max_parallel_workers_per_gather = 0
diff --git a/contrib/pg_stat_statements/pg_stat_statements.control b/contrib/pg_stat_statements/pg_stat_statements.control
index 2eee0ceffa8..61ae41efc14 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.control
+++ b/contrib/pg_stat_statements/pg_stat_statements.control
@@ -1,5 +1,5 @@
 # pg_stat_statements extension
 comment = 'track planning and execution statistics of all SQL statements executed'
-default_version = '1.13'
+default_version = '1.14'
 module_pathname = '$libdir/pg_stat_statements'
 relocatable = true
diff --git a/contrib/pg_stat_statements/sql/oldextversions.sql b/contrib/pg_stat_statements/sql/oldextversions.sql
index e416efe9ffb..b2090be6267 100644
--- a/contrib/pg_stat_statements/sql/oldextversions.sql
+++ b/contrib/pg_stat_statements/sql/oldextversions.sql
@@ -68,4 +68,9 @@ AlTER EXTENSION pg_stat_statements UPDATE TO '1.13';
 \d pg_stat_statements
 SELECT count(*) > 0 AS has_data FROM pg_stat_statements;
 
+-- Functions marked PARALLEL RESTRICTED in 1.14
+AlTER EXTENSION pg_stat_statements UPDATE TO '1.14';
+\d pg_stat_statements
+SELECT count(*) > 0 AS has_data FROM pg_stat_statements;
+
 DROP EXTENSION pg_stat_statements;
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index d753de5836e..8470250ff66 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -16,12 +16,13 @@
  <para>
   The module must be loaded by adding <literal>pg_stat_statements</literal> to
   <xref linkend="guc-shared-preload-libraries"/> in
-  <filename>postgresql.conf</filename>, because it requires additional shared memory.
-  This means that a server restart is needed to add or remove the module.
-  In addition, query identifier calculation must be enabled in order for the
-  module to be active, which is done automatically if <xref linkend="guc-compute-query-id"/>
-  is set to <literal>auto</literal> or <literal>on</literal>, or any third-party
-  module that calculates query identifiers is loaded.
+  <filename>postgresql.conf</filename>, because it must register hooks and a
+  custom statistics kind at server start. This means that a server restart is
+  needed to add or remove the module. In addition, query identifier calculation
+  must be enabled in order for the module to be active, which is done automatically
+  if <xref linkend="guc-compute-query-id"/> is set to <literal>auto</literal> or
+  <literal>on</literal>, or any third-party module that calculates query identifiers
+  is loaded.
  </para>
 
  <para>
@@ -794,10 +795,12 @@ calls | 2
        <structfield>dealloc</structfield> <type>bigint</type>
       </para>
       <para>
-       Total number of times <structname>pg_stat_statements</structname>
-       entries about the least-executed statements were deallocated
-       because more distinct statements than
-       <varname>pg_stat_statements.max</varname> were observed
+       Total number of <structname>pg_stat_statements</structname>
+       entries evicted because more distinct statements than
+       <varname>pg_stat_statements.max</varname> were observed.
+       A high value relative to <varname>pg_stat_statements.max</varname>
+       indicates significant query churn and that
+       <varname>pg_stat_statements.max</varname> should be increased
       </para></entry>
      </row>
      <row>
@@ -910,12 +913,15 @@ calls | 2
       <varname>pg_stat_statements.max</varname> is the maximum number of
       statements tracked by the module (i.e., the maximum number of rows
       in the <structname>pg_stat_statements</structname> view).  If more distinct
-      statements than that are observed, information about the least-executed
+      statements than that are observed, information about the least-recently-used
       statements is discarded.  The number of times such information was
       discarded can be seen in the
       <structname>pg_stat_statements_info</structname> view.
+      This is a soft limit; the actual number of tracked statements may
+      briefly exceed it until eviction reclaims space.
       The default value is 5000.
-      This parameter can only be set at server start.
+      This parameter can only be set in the <filename>postgresql.conf</filename>
+      file or on the server command line.
      </para>
     </listitem>
    </varlistentry>
@@ -1007,10 +1013,11 @@ calls | 2
   </variablelist>
 
   <para>
-   The module requires additional shared memory proportional to
-   <varname>pg_stat_statements.max</varname>.  Note that this
-   memory is consumed whenever the module is loaded, even if
-   <varname>pg_stat_statements.track</varname> is set to <literal>none</literal>.
+   The module uses dynamic shared memory that grows as statements are
+   tracked, up to the limit set by
+   <varname>pg_stat_statements.max</varname>.  Note that this memory is
+   not reclaimed when entries are deallocated; it is reused for new
+   entries but the overall shared memory footprint does not shrink.
   </para>
 
   <para>
-- 
2.50.1 (Apple Git-155)