v7-0004-Add-skip-scan-to-nbtree.patch
application/octet-stream
Filename: v7-0004-Add-skip-scan-to-nbtree.patch
Type: application/octet-stream
Part: 3
Patch
Same data as JSON:
GET /api/v1/attachments/:id/patch
the parsed metadata as JSON — format, series position, per-file stats; never the diff bytes.
API reference →
Format: format-patch
Series: patch v7-0004
Subject: Add skip scan to nbtree.
| File | + | − |
|---|---|---|
| doc/src/sgml/btree.sgml | 13 | 0 |
| doc/src/sgml/indexam.sgml | 2 | 1 |
| doc/src/sgml/indices.sgml | 22 | 18 |
| doc/src/sgml/xindex.sgml | 13 | 3 |
| src/backend/access/index/indexam.c | 2 | 1 |
| src/backend/access/nbtree/nbtcompare.c | 273 | 0 |
| src/backend/access/nbtree/nbtree.c | 177 | 28 |
| src/backend/access/nbtree/nbtsearch.c | 86 | 7 |
| src/backend/access/nbtree/nbtutils.c | 1309 | 100 |
| src/backend/access/nbtree/nbtvalidate.c | 4 | 0 |
| src/backend/commands/opclasscmds.c | 25 | 0 |
| src/backend/storage/lmgr/lwlock.c | 1 | 0 |
| src/backend/utils/activity/wait_event_names.txt | 1 | 0 |
| src/backend/utils/adt/date.c | 46 | 0 |
| src/backend/utils/adt/Makefile | 1 | 0 |
| src/backend/utils/adt/meson.build | 1 | 0 |
| src/backend/utils/adt/selfuncs.c | 265 | 103 |
| src/backend/utils/adt/skipsupport.c | 60 | 0 |
| src/backend/utils/adt/uuid.c | 71 | 0 |
| src/backend/utils/misc/guc_tables.c | 23 | 0 |
| src/include/access/amapi.h | 2 | 1 |
| src/include/access/nbtree.h | 24 | 3 |
| src/include/catalog/pg_amproc.dat | 16 | 0 |
| src/include/catalog/pg_proc.dat | 24 | 0 |
| src/include/storage/lwlock.h | 1 | 0 |
| src/include/utils/skipsupport.h | 109 | 0 |
| src/test/regress/expected/alter_generic.out | 3 | 3 |
| src/test/regress/expected/create_index.out | 2 | 2 |
| src/test/regress/expected/join.out | 35 | 26 |
| src/test/regress/expected/psql.out | 2 | 1 |
| src/test/regress/expected/union.out | 7 | 8 |
| src/test/regress/sql/alter_generic.sql | 1 | 1 |
| src/test/regress/sql/create_index.sql | 2 | 2 |
| src/tools/pgindent/typedefs.list | 3 | 0 |
From 336f8e6ac66bf930add19873040634bbe55985fe Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 16 Apr 2024 13:21:36 -0400
Subject: [PATCH v7 4/4] Add skip scan to nbtree.
Skip scan allows nbtree index scans to efficiently use a composite index
on the columns (a, b) for queries with a qual "WHERE b = 5". This is
useful in cases where the total number of distinct values in the column
'a' is reasonably small (think hundreds, possibly thousands). In
effect, a skip scan treats the composite index on (a, b) as if it was a
series of disjunct subindexes -- one subindex per distinct 'a' value.
The scan exhaustively "searches every subindex" by using a qual that
behaves like "WHERE a = ANY(<every possible 'a' value>) AND b = 5".
This works by extending the design for arrays established by commit
5bf748b8. "Skip arrays" generate their array values procedurally and
on-demand, but otherwise work just like conventional SAOP arrays.
The core B-Tree operator classes on most discrete types generate their
array elements with help from their own custom skip support routine.
This infrastructure gives nbtree a way to generate the next required
array element by incrementing (or decrementing) the current array value.
It can reduce the number of index descents in cases where the very next
indexable value frequently turns out to be the next indexed value. In
practice, this is likely whenever the scan skips with an input opclass
where "dense" indexed values occur naturally, such as btree/date_ops.
Opclasses that lack a skip support routine fall back on having nbtree
"increment" (or "decrement") a skip array's current element by setting
the NEXTPRIOR scan key flag, without directly changing its sk_argument.
The presence of NEXTPRIOR makes the scan interpret the key's sk_argument
as coming immediately after (or coming immediately before) sk_argument
in the key space. The key value must still come before (or still come
after) any possible greater-than (or less-than) indexable/non-sentinel
value. Obviously, the scan will never locate any exactly equal tuples.
But attempting to locate a match serves to make the scan locate the true
next value in whatever way it determines is most efficient, without any
need for special cases in high level scan-related code. In particular,
this design obviates the need for explicit "next key" index probes.
Though it's typical for nbtree preprocessing to cons up skip arrays when
it will allow the scan to apply one or more omitted-from-query leading
key columns when skipping, that's never a requirement. There are hardly
any limitations around where skip arrays/scan keys may appear relative
to conventional/input scan keys. This is no less true in the presence
of conventional SAOP array scan keys, which may both roll over and be
rolled over by skip arrays. For example, a skip array on the column "b"
is generated with quals such as "WHERE a = 42 AND c IN (1, 2, 3)". As
with any nbtree scan involving arrays, whether or not we actually skip
depends on the physical characteristics of the index during the scan.
The optimizer doesn't use distinct new index paths to represent index
skip scans. Skipping isn't an either/or question. It's possible for
individual index scans to conspicuously vary how and when they skip in
order to deal with variation in how leading column values cluster
together over the key space of the index. A dynamic strategy seems to
work best. Skipping can be used during nbtree bitmap index scans,
nbtree index scans, and nbtree index-only scans. Parallel index skip
scan is also supported.
Preprocessing will never cons up a skip array for an index column that
has an equality strategy scan key on input, but will do so for indexed
columns that are only constrained by inequality type input scan keys.
This allows the scan to skip using a range skip array. These are just
like conventional skip arrays, but only generate values from within a
given range. The range is constrained by input inequality scan keys.
For example, a skip array on "a" can only ever use array element values
1 and 2 when generated for a qual "WHERE a BETWEEN 1 AND 2 AND b = 66".
Such a skip array works very much like the conventional SAOP array that
would be generated given the qual "WHERE a = ANY('{1, 2}') AND b = 66".
Such transformations only happen when they enable later preprocessing to
mark the copied-from-input scan key on "b" required to continue the scan
(otherwise, preprocessing directly outputs the >= and <= keys on "a" in
the traditional way, without adding a superseding skip array on "a").
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiro Ikeda <masahiro.ikeda@nttdata.com>
Reviewed-By: Aleksander Alekseev <aleksander@timescale.com>
Discussion: https://postgr.es/m/CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.com
---
src/include/access/amapi.h | 3 +-
src/include/access/nbtree.h | 27 +-
src/include/catalog/pg_amproc.dat | 16 +
src/include/catalog/pg_proc.dat | 24 +
src/include/storage/lwlock.h | 1 +
src/include/utils/skipsupport.h | 109 ++
src/backend/access/index/indexam.c | 3 +-
src/backend/access/nbtree/nbtcompare.c | 273 ++++
src/backend/access/nbtree/nbtree.c | 205 ++-
src/backend/access/nbtree/nbtsearch.c | 93 +-
src/backend/access/nbtree/nbtutils.c | 1409 +++++++++++++++--
src/backend/access/nbtree/nbtvalidate.c | 4 +
src/backend/commands/opclasscmds.c | 25 +
src/backend/storage/lmgr/lwlock.c | 1 +
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/adt/Makefile | 1 +
src/backend/utils/adt/date.c | 46 +
src/backend/utils/adt/meson.build | 1 +
src/backend/utils/adt/selfuncs.c | 368 +++--
src/backend/utils/adt/skipsupport.c | 60 +
src/backend/utils/adt/uuid.c | 71 +
src/backend/utils/misc/guc_tables.c | 23 +
doc/src/sgml/btree.sgml | 13 +
doc/src/sgml/indexam.sgml | 3 +-
doc/src/sgml/indices.sgml | 40 +-
doc/src/sgml/xindex.sgml | 16 +-
src/test/regress/expected/alter_generic.out | 6 +-
src/test/regress/expected/create_index.out | 4 +-
src/test/regress/expected/join.out | 61 +-
src/test/regress/expected/psql.out | 3 +-
src/test/regress/expected/union.out | 15 +-
src/test/regress/sql/alter_generic.sql | 2 +-
src/test/regress/sql/create_index.sql | 4 +-
src/tools/pgindent/typedefs.list | 3 +
34 files changed, 2626 insertions(+), 308 deletions(-)
create mode 100644 src/include/utils/skipsupport.h
create mode 100644 src/backend/utils/adt/skipsupport.c
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index c51de742e..0826c124f 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -202,7 +202,8 @@ typedef void (*amrestrpos_function) (IndexScanDesc scan);
*/
/* estimate size of parallel scan descriptor */
-typedef Size (*amestimateparallelscan_function) (int nkeys, int norderbys);
+typedef Size (*amestimateparallelscan_function) (Relation indexRelation,
+ int nkeys, int norderbys);
/* prepare for parallel index scan */
typedef void (*aminitparallelscan_function) (void *target);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d709fe08d..e1b6b883e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -24,6 +24,7 @@
#include "lib/stringinfo.h"
#include "storage/bufmgr.h"
#include "storage/shm_toc.h"
+#include "utils/skipsupport.h"
/* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
typedef uint16 BTCycleId;
@@ -709,7 +710,8 @@ BTreeTupleGetMaxHeapTID(IndexTuple itup)
#define BTINRANGE_PROC 3
#define BTEQUALIMAGE_PROC 4
#define BTOPTIONS_PROC 5
-#define BTNProcs 5
+#define BTSKIPSUPPORT_PROC 6
+#define BTNProcs 6
/*
* We need to be able to tell the difference between read and write
@@ -1031,10 +1033,22 @@ typedef BTScanPosData *BTScanPos;
/* We need one of these for each equality-type SK_SEARCHARRAY scan key */
typedef struct BTArrayKeyInfo
{
+ /* fields used by both kinds of array (standard arrays and skip arrays) */
int scan_key; /* index of associated key in keyData */
+ int num_elems; /* number of elems (-1 for skip array) */
+
+ /* fields for standard arrays that store elements in memory */
int cur_elem; /* index of current element in elem_values */
- int num_elems; /* number of elems in current array value */
Datum *elem_values; /* array of num_elems Datums */
+
+ /* fields for skip arrays, which generate their elements procedurally */
+ bool use_sksup; /* sksup set to valid routine? */
+ bool null_elem; /* lowest/highest element actually NULL? */
+ SkipSupportData sksup; /* opclass skip scan support, when use_sksup */
+ ScanKey low_compare; /* array's > or >= lower bound */
+ ScanKey high_compare; /* array's < or <= upper bound */
+ FmgrInfo order_low; /* low_compare's ORDER proc */
+ FmgrInfo order_high; /* high_compare's ORDER proc */
} BTArrayKeyInfo;
typedef struct BTScanOpaqueData
@@ -1124,6 +1138,9 @@ typedef struct BTReadPageState
*/
#define SK_BT_REQFWD 0x00010000 /* required to continue forward scan */
#define SK_BT_REQBKWD 0x00020000 /* required to continue backward scan */
+#define SK_BT_SKIP 0x00040000 /* skip array, for skip scan */
+#define SK_BT_NEGPOSINF 0x00080000 /* no sk_argument, -inf/+inf key */
+#define SK_BT_NEXTPRIOR 0x00100000 /* sk_argument is next/prior key */
#define SK_BT_INDOPTION_SHIFT 24 /* must clear the above bits */
#define SK_BT_DESC (INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
#define SK_BT_NULLS_FIRST (INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
@@ -1160,6 +1177,10 @@ typedef struct BTOptions
#define PROGRESS_BTREE_PHASE_PERFORMSORT_2 4
#define PROGRESS_BTREE_PHASE_LEAF_LOAD 5
+/* GUC parameters (just a temporary convenience for reviewers) */
+extern PGDLLIMPORT int skipscan_prefix_cols;
+extern PGDLLIMPORT bool skipscan_skipsupport_enabled;
+
/*
* external entry points for btree, in nbtree.c
*/
@@ -1170,7 +1191,7 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
bool indexUnchanged,
struct IndexInfo *indexInfo);
extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
-extern Size btestimateparallelscan(int nkeys, int norderbys);
+extern Size btestimateparallelscan(Relation rel, int nkeys, int norderbys);
extern void btinitparallelscan(void *target);
extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index 5d7fe292b..c0ccdfdfb 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -21,6 +21,8 @@
amprocrighttype => 'bit', amprocnum => '4', amproc => 'btequalimage' },
{ amprocfamily => 'btree/bool_ops', amproclefttype => 'bool',
amprocrighttype => 'bool', amprocnum => '1', amproc => 'btboolcmp' },
+{ amprocfamily => 'btree/bool_ops', amproclefttype => 'bool',
+ amprocrighttype => 'bool', amprocnum => '6', amproc => 'btboolskipsupport' },
{ amprocfamily => 'btree/bool_ops', amproclefttype => 'bool',
amprocrighttype => 'bool', amprocnum => '4', amproc => 'btequalimage' },
{ amprocfamily => 'btree/bpchar_ops', amproclefttype => 'bpchar',
@@ -41,12 +43,16 @@
amprocrighttype => 'char', amprocnum => '1', amproc => 'btcharcmp' },
{ amprocfamily => 'btree/char_ops', amproclefttype => 'char',
amprocrighttype => 'char', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/char_ops', amproclefttype => 'char',
+ amprocrighttype => 'char', amprocnum => '6', amproc => 'btcharskipsupport' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '1', amproc => 'date_cmp' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '2', amproc => 'date_sortsupport' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
+ amprocrighttype => 'date', amprocnum => '6', amproc => 'date_skipsupport' },
{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
amprocrighttype => 'timestamp', amprocnum => '1',
amproc => 'date_cmp_timestamp' },
@@ -122,6 +128,8 @@
amprocrighttype => 'int2', amprocnum => '2', amproc => 'btint2sortsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
amprocrighttype => 'int2', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
+ amprocrighttype => 'int2', amprocnum => '6', amproc => 'btint2skipsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
amprocrighttype => 'int4', amprocnum => '1', amproc => 'btint24cmp' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
@@ -141,6 +149,8 @@
amprocrighttype => 'int4', amprocnum => '2', amproc => 'btint4sortsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
amprocrighttype => 'int4', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
+ amprocrighttype => 'int4', amprocnum => '6', amproc => 'btint4skipsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
amprocrighttype => 'int8', amprocnum => '1', amproc => 'btint48cmp' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
@@ -160,6 +170,8 @@
amprocrighttype => 'int8', amprocnum => '2', amproc => 'btint8sortsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
amprocrighttype => 'int8', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
+ amprocrighttype => 'int8', amprocnum => '6', amproc => 'btint8skipsupport' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
amprocrighttype => 'int4', amprocnum => '1', amproc => 'btint84cmp' },
{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
@@ -193,6 +205,8 @@
amprocrighttype => 'oid', amprocnum => '2', amproc => 'btoidsortsupport' },
{ amprocfamily => 'btree/oid_ops', amproclefttype => 'oid',
amprocrighttype => 'oid', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/oid_ops', amproclefttype => 'oid',
+ amprocrighttype => 'oid', amprocnum => '6', amproc => 'btoidskipsupport' },
{ amprocfamily => 'btree/oidvector_ops', amproclefttype => 'oidvector',
amprocrighttype => 'oidvector', amprocnum => '1',
amproc => 'btoidvectorcmp' },
@@ -261,6 +275,8 @@
amprocrighttype => 'uuid', amprocnum => '2', amproc => 'uuid_sortsupport' },
{ amprocfamily => 'btree/uuid_ops', amproclefttype => 'uuid',
amprocrighttype => 'uuid', amprocnum => '4', amproc => 'btequalimage' },
+{ amprocfamily => 'btree/uuid_ops', amproclefttype => 'uuid',
+ amprocrighttype => 'uuid', amprocnum => '6', amproc => 'uuid_skipsupport' },
{ amprocfamily => 'btree/record_ops', amproclefttype => 'record',
amprocrighttype => 'record', amprocnum => '1', amproc => 'btrecordcmp' },
{ amprocfamily => 'btree/record_image_ops', amproclefttype => 'record',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2513c36fc..d288aced9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1004,18 +1004,27 @@
{ oid => '3129', descr => 'sort support',
proname => 'btint2sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btint2sortsupport' },
+{ oid => '9290', descr => 'skip support',
+ proname => 'btint2skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btint2skipsupport' },
{ oid => '351', descr => 'less-equal-greater',
proname => 'btint4cmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'int4 int4', prosrc => 'btint4cmp' },
{ oid => '3130', descr => 'sort support',
proname => 'btint4sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btint4sortsupport' },
+{ oid => '9291', descr => 'skip support',
+ proname => 'btint4skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btint4skipsupport' },
{ oid => '842', descr => 'less-equal-greater',
proname => 'btint8cmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'int8 int8', prosrc => 'btint8cmp' },
{ oid => '3131', descr => 'sort support',
proname => 'btint8sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btint8sortsupport' },
+{ oid => '9292', descr => 'skip support',
+ proname => 'btint8skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btint8skipsupport' },
{ oid => '354', descr => 'less-equal-greater',
proname => 'btfloat4cmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'float4 float4', prosrc => 'btfloat4cmp' },
@@ -1034,12 +1043,18 @@
{ oid => '3134', descr => 'sort support',
proname => 'btoidsortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'btoidsortsupport' },
+{ oid => '9293', descr => 'skip support',
+ proname => 'btoidskipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btoidskipsupport' },
{ oid => '404', descr => 'less-equal-greater',
proname => 'btoidvectorcmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'oidvector oidvector', prosrc => 'btoidvectorcmp' },
{ oid => '358', descr => 'less-equal-greater',
proname => 'btcharcmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'char char', prosrc => 'btcharcmp' },
+{ oid => '9294', descr => 'skip support',
+ proname => 'btcharskipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btcharskipsupport' },
{ oid => '359', descr => 'less-equal-greater',
proname => 'btnamecmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'name name', prosrc => 'btnamecmp' },
@@ -2250,6 +2265,9 @@
{ oid => '3136', descr => 'sort support',
proname => 'date_sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'date_sortsupport' },
+{ oid => '9295', descr => 'skip support',
+ proname => 'date_skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'date_skipsupport' },
{ oid => '4133', descr => 'window RANGE support',
proname => 'in_range', prorettype => 'bool',
proargtypes => 'date date interval bool bool',
@@ -4437,6 +4455,9 @@
{ oid => '1693', descr => 'less-equal-greater',
proname => 'btboolcmp', proleakproof => 't', prorettype => 'int4',
proargtypes => 'bool bool', prosrc => 'btboolcmp' },
+{ oid => '9296', descr => 'skip support',
+ proname => 'btboolskipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'btboolskipsupport' },
{ oid => '1688', descr => 'hash',
proname => 'time_hash', prorettype => 'int4', proargtypes => 'time',
@@ -9294,6 +9315,9 @@
{ oid => '3300', descr => 'sort support',
proname => 'uuid_sortsupport', prorettype => 'void',
proargtypes => 'internal', prosrc => 'uuid_sortsupport' },
+{ oid => '9297', descr => 'skip support',
+ proname => 'uuid_skipsupport', prorettype => 'void',
+ proargtypes => 'internal', prosrc => 'uuid_skipsupport' },
{ oid => '2961', descr => 'I/O',
proname => 'uuid_recv', prorettype => 'uuid', proargtypes => 'internal',
prosrc => 'uuid_recv' },
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d70e6d37e..5b58739ad 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -192,6 +192,7 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_LOCK_MANAGER,
LWTRANCHE_PREDICATE_LOCK_MANAGER,
LWTRANCHE_PARALLEL_HASH_JOIN,
+ LWTRANCHE_PARALLEL_BTREE_SCAN,
LWTRANCHE_PARALLEL_QUERY_DSA,
LWTRANCHE_PER_SESSION_DSA,
LWTRANCHE_PER_SESSION_RECORD_TYPE,
diff --git a/src/include/utils/skipsupport.h b/src/include/utils/skipsupport.h
new file mode 100644
index 000000000..d91390fc6
--- /dev/null
+++ b/src/include/utils/skipsupport.h
@@ -0,0 +1,109 @@
+/*-------------------------------------------------------------------------
+ *
+ * skipsupport.h
+ * Support routines for B-Tree skip scan.
+ *
+ * B-Tree operator classes for discrete types can optionally provide a support
+ * function for skipping. This is used during skip scans.
+ *
+ * A B-tree operator class that implements skip support provides B-tree index
+ * scans with a way of enumerating and iterating through every possible value
+ * from the domain of indexable values. This gives scans a way to determine
+ * the next value in line for a given skip array/scan key/skipped attribute.
+ * This happens at the point where the scan determines that another primitive
+ * index scan is required. The next value is used (in combination with at
+ * least one additional lower-order non-skip key, taken from the SQL query) to
+ * relocate the scan, skipping over many irrelevant leaf pages in the process.
+ *
+ * Skip support generally works best with discrete types such as integer,
+ * date, and boolean; types where there is a decent chance that indexes will
+ * contain contiguous values (given a leading attributes using the opclass).
+ * When gaps/discontinuities are naturally rare (e.g., a leading identity
+ * column in a composite index, a date column preceding a product_id column),
+ * then it makes sense for skip scans to optimistically assume that the next
+ * distinct indexable value will find directly matching index tuples.
+ *
+ * The B-Tree code can fall back on next-key sentinel values for any opclass
+ * that doesn't provide its own skip support function. There is no point in
+ * providing skip support unless the next indexed key value is often the next
+ * indexable value (at least with some workloads). Opclasses where that never
+ * works out in practice should just rely on the B-Tree AM's generic next-key
+ * fallback strategy. Opclasses where adding skip support is infeasible or
+ * hard (e.g., an opclass for a continuous type) can also use the fallback.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/skipsupport.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SKIPSUPPORT_H
+#define SKIPSUPPORT_H
+
+#include "utils/relcache.h"
+
+typedef struct SkipSupportData *SkipSupport;
+typedef Datum (*SkipSupportIncDec) (Relation rel,
+ Datum existing,
+ bool *overflow);
+
+/*
+ * State/callbacks used by skip arrays to procedurally generate elements.
+ *
+ * A BTSKIPSUPPORT_PROC function must set each and every field when called.
+ * If an opclass can only set some of the fields, then it cannot safely
+ * provide a skip support routine.
+ */
+typedef struct SkipSupportData
+{
+ /*
+ * low_elem and high_elem must be set with the lowest and highest possible
+ * values from the domain of indexable values (assuming standard ascending
+ * order). This helps the B-Tree code with finding its initial position
+ * at the leaf level (during the skip scan's first primitive index scan).
+ * In other words, it gives the B-Tree code a useful value to start from,
+ * before any data has been read from the index.
+ *
+ * low_elem and high_elem are also used by skip scans to determine when
+ * they've reached the final possible value (in the current direction).
+ * It's typical for the scan to run out of leaf pages before it runs out
+ * of unscanned indexable values, but it's still useful for the scan to
+ * have a way to recognize when it has reached the last possible value
+ * (this saves us a useless probe that just lands on the final leaf page).
+ */
+ Datum low_elem; /* lowest sorting/leftmost non-NULL value */
+ Datum high_elem; /* highest sorting/rightmost non-NULL value */
+
+ /*
+ * Decrement/increment functions.
+ *
+ * Returns a decremented/incremented copy of caller's existing datum,
+ * allocated in caller's memory context (in the case of pass-by-reference
+ * types). It's not okay for these functions to leak any memory.
+ *
+ * Both decrement and increment callbacks are guaranteed to never be
+ * called with a NULL "existing" arg.
+ *
+ * When the decrement function (or increment function) is called with a
+ * value that already matches low_elem (or high_elem), function must set
+ * the *overflow argument. The return value is undefined, and the B-Tree
+ * code is entitled to assume that no memory will have been allocated.
+ *
+ * The B-Tree skip scan caller's "existing" datum is often just a straight
+ * copy of a value from an index tuple. Operator classes must be liberal
+ * in accepting every possible representational variation within the
+ * underlying data type. On the other hand, opclasses are _not_ expected
+ * to preserve any information that doesn't affect how datums are sorted
+ * (e.g., skip support for a fixed precision numeric type isn't required
+ * to preserve datum display scale).
+ */
+ SkipSupportIncDec decrement;
+ SkipSupportIncDec increment;
+} SkipSupportData;
+
+extern bool PrepareSkipSupportFromOpclass(Oid opfamily, Oid opcintype,
+ bool reverse, SkipSupport sksup);
+
+#endif /* SKIPSUPPORT_H */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index dcd04b813..dc99dad29 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -470,7 +470,8 @@ index_parallelscan_estimate(Relation indexRelation, int nkeys, int norderbys,
*/
if (indexRelation->rd_indam->amestimateparallelscan != NULL)
nbytes = add_size(nbytes,
- indexRelation->rd_indam->amestimateparallelscan(nkeys,
+ indexRelation->rd_indam->amestimateparallelscan(indexRelation,
+ nkeys,
norderbys));
return nbytes;
diff --git a/src/backend/access/nbtree/nbtcompare.c b/src/backend/access/nbtree/nbtcompare.c
index 1c72867c8..26ebbf776 100644
--- a/src/backend/access/nbtree/nbtcompare.c
+++ b/src/backend/access/nbtree/nbtcompare.c
@@ -58,6 +58,7 @@
#include <limits.h>
#include "utils/fmgrprotos.h"
+#include "utils/skipsupport.h"
#include "utils/sortsupport.h"
#ifdef STRESS_SORT_INT_MIN
@@ -78,6 +79,51 @@ btboolcmp(PG_FUNCTION_ARGS)
PG_RETURN_INT32((int32) a - (int32) b);
}
+static Datum
+bool_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ bool bexisting = DatumGetBool(existing);
+
+ if (bexisting == false)
+ {
+ /* return value is undefined */
+ *underflow = true;
+ return (Datum) 0;
+ }
+
+ *underflow = false;
+ return BoolGetDatum(bexisting - 1);
+}
+
+static Datum
+bool_increment(Relation rel, Datum existing, bool *overflow)
+{
+ bool bexisting = DatumGetBool(existing);
+
+ if (bexisting == true)
+ {
+ /* return value is undefined */
+ *overflow = true;
+ return (Datum) 0;
+ }
+
+ *overflow = false;
+ return BoolGetDatum(bexisting + 1);
+}
+
+Datum
+btboolskipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = bool_decrement;
+ sksup->increment = bool_increment;
+ sksup->low_elem = BoolGetDatum(false);
+ sksup->high_elem = BoolGetDatum(true);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint2cmp(PG_FUNCTION_ARGS)
{
@@ -105,6 +151,51 @@ btint2sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+int2_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ int16 iexisting = DatumGetInt16(existing);
+
+ if (iexisting == PG_INT16_MIN)
+ {
+ /* return value is undefined */
+ *underflow = true;
+ return (Datum) 0;
+ }
+
+ *underflow = false;
+ return Int16GetDatum(iexisting - 1);
+}
+
+static Datum
+int2_increment(Relation rel, Datum existing, bool *overflow)
+{
+ int16 iexisting = DatumGetInt16(existing);
+
+ if (iexisting == PG_INT16_MAX)
+ {
+ /* return value is undefined */
+ *overflow = true;
+ return (Datum) 0;
+ }
+
+ *overflow = false;
+ return Int16GetDatum(iexisting + 1);
+}
+
+Datum
+btint2skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = int2_decrement;
+ sksup->increment = int2_increment;
+ sksup->low_elem = Int16GetDatum(PG_INT16_MIN);
+ sksup->high_elem = Int16GetDatum(PG_INT16_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint4cmp(PG_FUNCTION_ARGS)
{
@@ -128,6 +219,51 @@ btint4sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+int4_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ int32 iexisting = DatumGetInt32(existing);
+
+ if (iexisting == PG_INT32_MIN)
+ {
+ /* return value is undefined */
+ *underflow = true;
+ return (Datum) 0;
+ }
+
+ *underflow = false;
+ return Int32GetDatum(iexisting - 1);
+}
+
+static Datum
+int4_increment(Relation rel, Datum existing, bool *overflow)
+{
+ int32 iexisting = DatumGetInt32(existing);
+
+ if (iexisting == PG_INT32_MAX)
+ {
+ /* return value is undefined */
+ *overflow = true;
+ return (Datum) 0;
+ }
+
+ *overflow = false;
+ return Int32GetDatum(iexisting + 1);
+}
+
+Datum
+btint4skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = int4_decrement;
+ sksup->increment = int4_increment;
+ sksup->low_elem = Int32GetDatum(PG_INT32_MIN);
+ sksup->high_elem = Int32GetDatum(PG_INT32_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint8cmp(PG_FUNCTION_ARGS)
{
@@ -171,6 +307,51 @@ btint8sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+int8_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ int64 iexisting = DatumGetInt64(existing);
+
+ if (iexisting == PG_INT64_MIN)
+ {
+ /* return value is undefined */
+ *underflow = true;
+ return (Datum) 0;
+ }
+
+ *underflow = false;
+ return Int64GetDatum(iexisting - 1);
+}
+
+static Datum
+int8_increment(Relation rel, Datum existing, bool *overflow)
+{
+ int64 iexisting = DatumGetInt64(existing);
+
+ if (iexisting == PG_INT64_MAX)
+ {
+ /* return value is undefined */
+ *overflow = true;
+ return (Datum) 0;
+ }
+
+ *overflow = false;
+ return Int64GetDatum(iexisting + 1);
+}
+
+Datum
+btint8skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = int8_decrement;
+ sksup->increment = int8_increment;
+ sksup->low_elem = Int64GetDatum(PG_INT64_MIN);
+ sksup->high_elem = Int64GetDatum(PG_INT64_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btint48cmp(PG_FUNCTION_ARGS)
{
@@ -292,6 +473,51 @@ btoidsortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+oid_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ Oid oexisting = DatumGetObjectId(existing);
+
+ if (oexisting == InvalidOid)
+ {
+ /* return value is undefined */
+ *underflow = true;
+ return (Datum) 0;
+ }
+
+ *underflow = false;
+ return ObjectIdGetDatum(oexisting - 1);
+}
+
+static Datum
+oid_increment(Relation rel, Datum existing, bool *overflow)
+{
+ Oid oexisting = DatumGetObjectId(existing);
+
+ if (oexisting == OID_MAX)
+ {
+ /* return value is undefined */
+ *overflow = true;
+ return (Datum) 0;
+ }
+
+ *overflow = false;
+ return ObjectIdGetDatum(oexisting + 1);
+}
+
+Datum
+btoidskipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = oid_decrement;
+ sksup->increment = oid_increment;
+ sksup->low_elem = ObjectIdGetDatum(InvalidOid);
+ sksup->high_elem = ObjectIdGetDatum(OID_MAX);
+
+ PG_RETURN_VOID();
+}
+
Datum
btoidvectorcmp(PG_FUNCTION_ARGS)
{
@@ -325,3 +551,50 @@ btcharcmp(PG_FUNCTION_ARGS)
/* Be careful to compare chars as unsigned */
PG_RETURN_INT32((int32) ((uint8) a) - (int32) ((uint8) b));
}
+
+static Datum
+char_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ uint8 cexisting = UInt8GetDatum(existing);
+
+ if (cexisting == 0)
+ {
+ /* return value is undefined */
+ *underflow = true;
+ return (Datum) 0;
+ }
+
+ *underflow = false;
+ return CharGetDatum((uint8) cexisting - 1);
+}
+
+static Datum
+char_increment(Relation rel, Datum existing, bool *overflow)
+{
+ uint8 cexisting = UInt8GetDatum(existing);
+
+ if (cexisting == UCHAR_MAX)
+ {
+ /* return value is undefined */
+ *overflow = true;
+ return (Datum) 0;
+ }
+
+ *overflow = false;
+ return CharGetDatum((uint8) cexisting + 1);
+}
+
+Datum
+btcharskipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = char_decrement;
+ sksup->increment = char_increment;
+
+ /* btcharcmp compares chars as unsigned */
+ sksup->low_elem = UInt8GetDatum(0);
+ sksup->high_elem = UInt8GetDatum(UCHAR_MAX);
+
+ PG_RETURN_VOID();
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4ecf883d3..e4e2bb741 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -32,6 +32,7 @@
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/smgr.h"
+#include "utils/datum.h"
#include "utils/fmgrprotos.h"
#include "utils/index_selfuncs.h"
#include "utils/memutils.h"
@@ -71,7 +72,7 @@ typedef struct BTParallelScanDescData
* available for scan. see above for
* possible states of parallel scan. */
uint64 btps_nsearches; /* instrumentation */
- slock_t btps_mutex; /* protects above variables, btps_arrElems */
+ LWLock btps_lock; /* protects above variables, btps_arrElems */
ConditionVariable btps_cv; /* used to synchronize parallel scan */
/*
@@ -79,11 +80,21 @@ typedef struct BTParallelScanDescData
* index scan. Holds BTArrayKeyInfo.cur_elem offsets for scan keys.
*/
int btps_arrElems[FLEXIBLE_ARRAY_MEMBER];
+
+ /*
+ * The reset of the space allocated in shared memory is also used when
+ * scans need to schedule another primitive index scan. It holds a
+ * flattened representation of the backend's skip array datums, if any.
+ */
} BTParallelScanDescData;
typedef struct BTParallelScanDescData *BTParallelScanDesc;
+static void _bt_parallel_serialize_arrays(Relation rel, BTParallelScanDesc btscan,
+ BTScanOpaque so);
+static void _bt_parallel_restore_arrays(Relation rel, BTParallelScanDesc btscan,
+ BTScanOpaque so);
static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback, void *callback_state,
BTCycleId cycleid);
@@ -538,10 +549,155 @@ btrestrpos(IndexScanDesc scan)
* btestimateparallelscan -- estimate storage for BTParallelScanDescData
*/
Size
-btestimateparallelscan(int nkeys, int norderbys)
+btestimateparallelscan(Relation rel, int nkeys, int norderbys)
{
+ int16 nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ Size estnbtreeshared,
+ genericattrspace;
+
/* Pessimistically assume all input scankeys will be output with arrays */
- return offsetof(BTParallelScanDescData, btps_arrElems) + sizeof(int) * nkeys;
+ estnbtreeshared = offsetof(BTParallelScanDescData, btps_arrElems) +
+ sizeof(int) * nkeys;
+
+ /*
+ * Assume every index attribute might require that we generate a skip scan
+ * key
+ */
+ genericattrspace = datumEstimateSpace((Datum) 0, false, true,
+ sizeof(Datum));
+ for (int attnum = 1; attnum <= nkeyatts; attnum++)
+ {
+ Form_pg_attribute attr;
+
+ /* Every skip array needs a space for storing sk_flags */
+ estnbtreeshared = add_size(estnbtreeshared, sizeof(int));
+
+ attr = TupleDescAttr(rel->rd_att, attnum - 1);
+ if (attr->attlen > 0)
+ {
+ /* Fixed length datum */
+ Size estfixed = datumEstimateSpace((Datum) 0, false,
+ attr->attbyval,
+ attr->attlen);
+
+ estnbtreeshared = add_size(estnbtreeshared, estfixed);
+ continue;
+ }
+
+ /*
+ * Varlena (or other variable-length) datum.
+ *
+ * Assume that serializing the arrays will use just as much space as a
+ * pass-by-value datum, in addition to a full third of a page of
+ * space.
+ */
+ estnbtreeshared = add_size(estnbtreeshared, genericattrspace);
+ estnbtreeshared = add_size(estnbtreeshared, BLCKSZ / 3);
+ }
+
+ return estnbtreeshared;
+}
+
+/*
+ * _bt_parallel_serialize_arrays() -- Serialize parallel array state.
+ *
+ * Caller must have exclusively locked btscan->btps_lock when called.
+ */
+static void
+_bt_parallel_serialize_arrays(Relation rel, BTParallelScanDesc btscan,
+ BTScanOpaque so)
+{
+ char *datumshared;
+
+ /* Space for serialized datums begins immediately after btps_arrElems[] */
+ datumshared = ((char *) &btscan->btps_arrElems[so->numArrayKeys]);
+ for (int i = 0; i < so->numArrayKeys; i++)
+ {
+ BTArrayKeyInfo *array = &so->arrayKeys[i];
+ ScanKey skey = &so->keyData[array->scan_key];
+ Form_pg_attribute attr;
+
+ if (array->num_elems != -1)
+ {
+ /* Serialize regular (non-skip) array */
+ Assert(!(skey->sk_flags & SK_BT_SKIP));
+ btscan->btps_arrElems[i] = array->cur_elem;
+ continue;
+ }
+
+ /* Serialize skip array */
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ memcpy(datumshared, &skey->sk_flags, sizeof(int));
+ datumshared += sizeof(int);
+
+ if (skey->sk_flags & SK_BT_NEGPOSINF)
+ {
+ /* No sk_argument to serialize */
+ Assert(skey->sk_argument == 0);
+ continue;
+ }
+
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ datumSerialize(skey->sk_argument, (skey->sk_flags & SK_ISNULL) != 0,
+ attr->attbyval, attr->attlen, &datumshared);
+ }
+}
+
+/*
+ * _bt_parallel_restore_arrays() -- Restore serialized parallel array state.
+ *
+ * Caller must have exclusively locked btscan->btps_lock when called.
+ */
+static void
+_bt_parallel_restore_arrays(Relation rel, BTParallelScanDesc btscan,
+ BTScanOpaque so)
+{
+ char *datumshared;
+
+ /* Space for serialized datums begins immediately after btps_arrElems[] */
+ datumshared = ((char *) &btscan->btps_arrElems[so->numArrayKeys]);
+ for (int i = 0; i < so->numArrayKeys; i++)
+ {
+ BTArrayKeyInfo *array = &so->arrayKeys[i];
+ ScanKey skey = &so->keyData[array->scan_key];
+ bool isnull;
+ Form_pg_attribute attr;
+
+ if (array->num_elems != -1)
+ {
+ /* Restore regular (non-skip) array */
+ Assert(!(skey->sk_flags & SK_BT_SKIP));
+ array->cur_elem = btscan->btps_arrElems[i];
+ skey->sk_argument = array->elem_values[array->cur_elem];
+ continue;
+ }
+
+ /* Restore skip array */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+ skey->sk_argument = (Datum) 0;
+
+ /* Now that old sk_argument memory is freed, copy over sk_flags */
+ memcpy(&skey->sk_flags, datumshared, sizeof(int));
+ datumshared += sizeof(int);
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+
+ if (skey->sk_flags & SK_BT_NEGPOSINF)
+ {
+ /* No sk_argument to serialize */
+ continue;
+ }
+
+ skey->sk_argument = datumRestore(&datumshared, &isnull);
+ if (isnull)
+ {
+ Assert(skey->sk_argument == 0);
+ Assert(skey->sk_flags & SK_SEARCHNULL);
+ Assert(skey->sk_flags & SK_ISNULL);
+ }
+ }
}
/*
@@ -552,7 +708,8 @@ btinitparallelscan(void *target)
{
BTParallelScanDesc bt_target = (BTParallelScanDesc) target;
- SpinLockInit(&bt_target->btps_mutex);
+ LWLockInitialize(&bt_target->btps_lock,
+ LWTRANCHE_PARALLEL_BTREE_SCAN);
bt_target->btps_scanPage = InvalidBlockNumber;
bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
bt_target->btps_nsearches = 0;
@@ -574,15 +731,15 @@ btparallelrescan(IndexScanDesc scan)
parallel_scan->ps_offset);
/*
- * In theory, we don't need to acquire the spinlock here, because there
+ * In theory, we don't need to acquire the LWLock here, because there
* shouldn't be any other workers running at this point, but we do so for
* consistency.
*/
- SpinLockAcquire(&btscan->btps_mutex);
+ LWLockAcquire(&btscan->btps_lock, LW_EXCLUSIVE);
btscan->btps_scanPage = InvalidBlockNumber;
btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
/* deliberately don't reset btps_nsearches (matches index_rescan) */
- SpinLockRelease(&btscan->btps_mutex);
+ LWLockRelease(&btscan->btps_lock);
}
/*
@@ -609,6 +766,7 @@ btparallelrescan(IndexScanDesc scan)
bool
_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
{
+ Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
bool exit_loop = false;
bool status = true;
@@ -642,7 +800,7 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
while (1)
{
- SpinLockAcquire(&btscan->btps_mutex);
+ LWLockAcquire(&btscan->btps_lock, LW_EXCLUSIVE);
if (btscan->btps_pageStatus == BTPARALLEL_DONE)
{
@@ -657,14 +815,9 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
{
/* Can start scheduled primitive scan right away, so do so */
btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
- for (int i = 0; i < so->numArrayKeys; i++)
- {
- BTArrayKeyInfo *array = &so->arrayKeys[i];
- ScanKey skey = &so->keyData[array->scan_key];
- array->cur_elem = btscan->btps_arrElems[i];
- skey->sk_argument = array->elem_values[array->cur_elem];
- }
+ /* Restore scan's array keys from serialized values */
+ _bt_parallel_restore_arrays(rel, btscan, so);
*pageno = InvalidBlockNumber;
exit_loop = true;
}
@@ -701,7 +854,7 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
*pageno = btscan->btps_scanPage;
exit_loop = true;
}
- SpinLockRelease(&btscan->btps_mutex);
+ LWLockRelease(&btscan->btps_lock);
if (exit_loop || !status)
break;
ConditionVariableSleep(&btscan->btps_cv, WAIT_EVENT_BTREE_PAGE);
@@ -731,10 +884,10 @@ _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
parallel_scan->ps_offset);
- SpinLockAcquire(&btscan->btps_mutex);
+ LWLockAcquire(&btscan->btps_lock, LW_EXCLUSIVE);
btscan->btps_scanPage = scan_page;
btscan->btps_pageStatus = BTPARALLEL_IDLE;
- SpinLockRelease(&btscan->btps_mutex);
+ LWLockRelease(&btscan->btps_lock);
ConditionVariableSignal(&btscan->btps_cv);
}
@@ -771,7 +924,7 @@ _bt_parallel_done(IndexScanDesc scan)
* Mark the parallel scan as done, unless some other process did so
* already
*/
- SpinLockAcquire(&btscan->btps_mutex);
+ LWLockAcquire(&btscan->btps_lock, LW_EXCLUSIVE);
Assert(btscan->btps_pageStatus != BTPARALLEL_NEED_PRIMSCAN);
if (btscan->btps_pageStatus != BTPARALLEL_DONE)
{
@@ -780,7 +933,7 @@ _bt_parallel_done(IndexScanDesc scan)
}
/* Copy the authoritative shared primitive scan counter to local field */
scan->nsearches = btscan->btps_nsearches;
- SpinLockRelease(&btscan->btps_mutex);
+ LWLockRelease(&btscan->btps_lock);
/* wake up all the workers associated with this parallel scan */
if (status_changed)
@@ -798,6 +951,7 @@ _bt_parallel_done(IndexScanDesc scan)
void
_bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber prev_scan_page)
{
+ Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
BTParallelScanDesc btscan;
@@ -807,7 +961,7 @@ _bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber prev_scan_page)
btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
parallel_scan->ps_offset);
- SpinLockAcquire(&btscan->btps_mutex);
+ LWLockAcquire(&btscan->btps_lock, LW_EXCLUSIVE);
if (btscan->btps_scanPage == prev_scan_page &&
btscan->btps_pageStatus == BTPARALLEL_IDLE)
{
@@ -816,14 +970,9 @@ _bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber prev_scan_page)
btscan->btps_nsearches++;
/* Serialize scan's current array keys */
- for (int i = 0; i < so->numArrayKeys; i++)
- {
- BTArrayKeyInfo *array = &so->arrayKeys[i];
-
- btscan->btps_arrElems[i] = array->cur_elem;
- }
+ _bt_parallel_serialize_arrays(rel, btscan, so);
}
- SpinLockRelease(&btscan->btps_mutex);
+ LWLockRelease(&btscan->btps_lock);
}
/*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index e5f941e0a..ed5593c62 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -883,7 +883,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
Buffer buf;
BTStack stack;
OffsetNumber offnum;
- StrategyNumber strat;
BTScanInsertData inskey;
ScanKey startKeys[INDEX_MAX_KEYS];
ScanKeyData notnullkeys[INDEX_MAX_KEYS];
@@ -975,7 +974,20 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* a > or < boundary or find an attribute with no boundary (which can be
* thought of as the same as "> -infinity"), we can't use keys for any
* attributes to its right, because it would break our simplistic notion
- * of what initial positioning strategy to use.
+ * of what initial positioning strategy to use. In practice skip scan
+ * typically enables us to use all scan keys here, even with a set of
+ * input keys that leave a "gap" between two index attributes (cases with
+ * multiple gaps will even manage this without any special restrictions).
+ *
+ * Skip scan works by having _bt_preprocess_keys cons up = boundary keys
+ * for any index columns that were missing a = key in scan->keyData[], the
+ * input scan keys passed to us by the executor. This happens for index
+ * attributes prior to the attribute of our final input scan key. The
+ * underlying = keys use skip arrays. The keys can be thought of as the
+ * same as "col = ANY('{every possible col value}')". Note that this
+ * often includes the array element NULL, which the scan will treat as an
+ * IS NULL qual (the skip array's scan key is already marked SK_SEARCHNULL
+ * when we're called, so we need no special handling for this case here).
*
* When the scan keys include cross-type operators, _bt_preprocess_keys
* may not be able to eliminate redundant keys; in such cases we will
@@ -1050,6 +1062,47 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
{
if (i >= so->numberOfKeys || cur->sk_attno != curattr)
{
+ if (chosen && (chosen->sk_flags & SK_BT_NEGPOSINF))
+ {
+ /* -inf/+inf element from a skip array's scan key */
+ ScanKey origchosen = chosen;
+ BTArrayKeyInfo *array = NULL;
+
+ for (int arridx = 0; arridx < so->numArrayKeys; arridx++)
+ {
+ array = &so->arrayKeys[arridx];
+ if (array->scan_key == chosen - so->keyData)
+ break;
+ }
+
+ /* use array's inequality key in startKeys[] */
+ if (ScanDirectionIsForward(dir))
+ chosen = array->low_compare;
+ else
+ chosen = array->high_compare;
+
+ Assert(!chosen ||
+ chosen->sk_attno == origchosen->sk_attno);
+
+ if (!array->null_elem)
+ {
+ /*
+ * The array does not include a NULL element (meaning
+ * array advancement never generates an IS NULL qual).
+ * We'll deduce a NOT NULL key to skip over any NULLs
+ * when there's no usable low_compare (or no usable
+ * high_compare, during a backwards scan).
+ *
+ * Note: this also handles an explicit NOT NULL key
+ * that preprocessing folded into the skip array (it
+ * doesn't save them in low_compare/high_compare).
+ */
+ impliesNN = origchosen;
+ }
+ else
+ Assert(chosen == NULL && impliesNN == NULL);
+ }
+
/*
* Done looking at keys for curattr. If we didn't find a
* usable boundary key, see if we can deduce a NOT NULL key.
@@ -1083,16 +1136,42 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
break;
startKeys[keysz++] = chosen;
+ if (chosen->sk_flags & SK_BT_NEXTPRIOR)
+ {
+ /*
+ * Next/prior key element from a skip array's scan key.
+ * 'chosen' could be SK_ISNULL, in which case startKeys[]
+ * positions us at the first tuple > NULL (for backwards
+ * scans it's the first tuple < NULL instead).
+ *
+ * Adjust strat_total, so that our = key gets treated like
+ * a > key (or like a < key) within _bt_search.
+ */
+ Assert(strat_total == BTEqualStrategyNumber);
+ if (ScanDirectionIsForward(dir))
+ strat_total = BTGreaterStrategyNumber;
+ else
+ strat_total = BTLessStrategyNumber;
+
+ /*
+ * We'll never find an exact = match for a NEXTPRIOR
+ * sentinel sk_argument value, so there's no reason to
+ * save any later would-be boundary keys in startKeys[]
+ * (besides, doing so would confuse _bt_search, since it
+ * isn't directly aware of NEXTPRIOR sentinel values)
+ */
+ break;
+ }
+
/*
* Adjust strat_total, and quit if we have stored a > or <
* key.
*/
- strat = chosen->sk_strategy;
- if (strat != BTEqualStrategyNumber)
+ if (chosen->sk_strategy != BTEqualStrategyNumber)
{
- strat_total = strat;
- if (strat == BTGreaterStrategyNumber ||
- strat == BTLessStrategyNumber)
+ strat_total = chosen->sk_strategy;
+ if (chosen->sk_strategy == BTGreaterStrategyNumber ||
+ chosen->sk_strategy == BTLessStrategyNumber)
break;
}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index d1423bd85..ab8f8f8c5 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -29,9 +29,37 @@
#include "utils/memutils.h"
#include "utils/rel.h"
+/*
+ * GUC parameters (temporary convenience for reviewers).
+ *
+ * To disable all skipping, set skipscan_prefix_cols=0. Otherwise set it to
+ * the attribute number that you wish to make the last attribute number that
+ * we can add a skip scan key for. For example, skipscan_prefix_cols=1 makes
+ * an index scan with qual "WHERE b = 1 AND c > 42" generate a skip scan key
+ * on the column 'a' (which is attnum 1) only, preventing us from adding one
+ * for the column 'c' (and so 'c' will still have an inequality scan key,
+ * required in only one direction -- 'c' won't be output as a "range" skip
+ * key/array).
+ */
+int skipscan_prefix_cols = INDEX_MAX_KEYS;
+
+/*
+ * skipscan_skipsupport_enabled can be used to avoid using opclass skip
+ * support routines. This can be used to quantify the peformance benefit that
+ * comes from having dedicated skip support, with a given test query.
+ */
+bool skipscan_skipsupport_enabled = true;
+
#define LOOK_AHEAD_REQUIRED_RECHECKS 3
#define LOOK_AHEAD_DEFAULT_DISTANCE 5
+typedef struct BTSkipPreproc
+{
+ SkipSupportData sksup; /* opclass skip scan support (optional) */
+ bool use_sksup; /* sksup set to valid routine? */
+ Oid eq_op; /* InvalidOid means don't skip */
+} BTSkipPreproc;
+
typedef struct BTSortArrayContext
{
FmgrInfo *sortproc;
@@ -64,15 +92,38 @@ static bool _bt_compare_array_scankey_args(IndexScanDesc scan,
bool *qual_ok);
static ScanKey _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys);
static void _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap);
+static int _bt_decide_skipatts(IndexScanDesc scan, BTSkipPreproc *skipatts);
+static bool _bt_skipsupport(Relation rel, int add_skip_attno,
+ BTSkipPreproc *skipatts);
static int _bt_compare_array_elements(const void *a, const void *b, void *arg);
static inline int32 _bt_compare_array_skey(FmgrInfo *orderproc,
Datum tupdatum, bool tupnull,
- Datum arrdatum, ScanKey cur);
+ Datum arrdatum, bool arrnull,
+ ScanKey cur);
+static void _bt_array_preproc_shrink(ScanKey arraysk, ScanKey skey,
+ FmgrInfo *orderprocp,
+ BTArrayKeyInfo *array, bool *qual_ok);
+static bool _bt_skip_preproc_shrink(IndexScanDesc scan, ScanKey arraysk,
+ ScanKey skey, FmgrInfo *orderprocp,
+ BTArrayKeyInfo *array, bool *qual_ok);
static int _bt_binsrch_array_skey(FmgrInfo *orderproc,
bool cur_elem_trig, ScanDirection dir,
Datum tupdatum, bool tupnull,
BTArrayKeyInfo *array, ScanKey cur,
int32 *set_elem_result);
+static void _bt_binsrch_skiparray_skey(FmgrInfo *orderproc,
+ bool cur_elem_trig, ScanDirection dir,
+ Datum tupdatum, bool tupnull,
+ BTArrayKeyInfo *array, ScanKey cur,
+ int32 *set_elem_result);
+static void _bt_scankey_set_low_or_high(Relation rel, ScanKey skey,
+ BTArrayKeyInfo *array, bool low_not_high);
+static void _bt_scankey_set_element(Relation rel, ScanKey skey, BTArrayKeyInfo *array,
+ Datum tupdatum, bool tupnull);
+static void _bt_scankey_unset_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
+static void _bt_scankey_set_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
+static bool _bt_scankey_decrement(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
+static bool _bt_scankey_increment(Relation rel, ScanKey skey, BTArrayKeyInfo *array);
static bool _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir);
static void _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir);
static bool _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
@@ -258,11 +309,19 @@ _bt_freestack(BTStack stack)
* preprocessing steps are complete. This will convert the scan key offset
* references into references to the scan's so->keyData[] output scan keys.
*
+ * We're also responsible for generating skip arrays (and their associated
+ * scan keys) here. This enables skip scan. We do this for index attributes
+ * that initially lacked an equality condition within scan->keyData[], iff
+ * doing so allows a later scan key (that was passed to us in scan->keyData[])
+ * to be marked required by later preprocessing on output.
+ * _bt_decide_skipatts decides which attributes receive skip arrays.
+ *
* Caller must pass *numberOfKeys to give us a way to change the number of
* input scan keys (our output is caller's input). The returned array can be
* smaller than scan->keyData[] when we eliminated a redundant array scan key
- * (redundant with some other array scan key, for the same attribute). Caller
- * uses this to allocate so->keyData[] for the current btrescan.
+ * (redundant with some other array scan key, for the same attribute). It can
+ * also be larger when we added a skip array/skip scan key. Caller uses this
+ * to allocate so->keyData[] for the current btrescan.
*
* Note: the reason we need to return a temp scan key array, rather than just
* scribbling on scan->keyData, is that callers are permitted to call btrescan
@@ -275,8 +334,11 @@ _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys)
Relation rel = scan->indexRelation;
int numArrayKeyData = scan->numberOfKeys;
int16 *indoption = rel->rd_indoption;
+ BTSkipPreproc skipatts[INDEX_MAX_KEYS];
int numArrayKeys,
+ numSkipArrayKeys,
output_ikey = 0;
+ AttrNumber attno_skip = 1;
int origarrayatt = InvalidAttrNumber,
origarraykey = -1;
Oid origelemtype = InvalidOid;
@@ -286,7 +348,10 @@ _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys)
Assert(scan->numberOfKeys);
- /* Quick check to see if there are any array keys */
+ /*
+ * Quick check to see if there are any array keys, or any missing keys we
+ * can generate a "skip scan" array key for ourselves
+ */
numArrayKeys = 0;
for (int i = 0; i < scan->numberOfKeys; i++)
{
@@ -304,6 +369,15 @@ _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys)
}
}
+ numSkipArrayKeys = _bt_decide_skipatts(scan, skipatts);
+ if (numSkipArrayKeys)
+ {
+ /* At least one skip array scan key must be added to arrayKeyData[] */
+ numArrayKeys += numSkipArrayKeys;
+ /* output scan key buffer allocation needs space for skip scan keys */
+ numArrayKeyData += numSkipArrayKeys;
+ }
+
/* Quit if nothing to do. */
if (numArrayKeys == 0)
return NULL;
@@ -330,7 +404,12 @@ _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys)
/* Allocate space for ORDER procs used to help _bt_checkkeys */
so->orderProcs = (FmgrInfo *) palloc(numArrayKeyData * sizeof(FmgrInfo));
- /* Now process each array key */
+ /*
+ * Process each array key, and generate skip arrays as needed. Also copy
+ * every scan->keyData[] input scan key (whether it's an array or not)
+ * into the arrayKeyData array we'll return to our caller (barring any
+ * array scan keys that we could eliminate early through array merging).
+ */
numArrayKeys = 0;
for (int input_ikey = 0; input_ikey < scan->numberOfKeys; input_ikey++)
{
@@ -348,8 +427,76 @@ _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys)
int num_nonnulls;
int j;
+ /* Create a skip array and scan key where indicated by skipatts */
+ while (numSkipArrayKeys &&
+ attno_skip <= scan->keyData[input_ikey].sk_attno)
+ {
+ Oid opcintype = rel->rd_opcintype[attno_skip - 1];
+ Oid collation = rel->rd_indcollation[attno_skip - 1];
+ Oid eq_op = skipatts[attno_skip - 1].eq_op;
+ RegProcedure cmp_proc;
+
+ if (!OidIsValid(eq_op))
+ {
+ /* won't skip using this attribute */
+ attno_skip++;
+ continue;
+ }
+
+ cmp_proc = get_opcode(eq_op);
+ if (!RegProcedureIsValid(cmp_proc))
+ elog(ERROR, "missing oprcode for skipping equals operator %u", eq_op);
+
+ cur = &arrayKeyData[output_ikey];
+ Assert(attno_skip <= scan->keyData[input_ikey].sk_attno);
+ ScanKeyEntryInitialize(cur,
+ SK_SEARCHARRAY | SK_BT_SKIP, /* flags */
+ attno_skip, /* skipped att number */
+ BTEqualStrategyNumber, /* equality strategy */
+ InvalidOid, /* opclass input subtype */
+ collation, /* index column's collation */
+ cmp_proc, /* equality operator's proc */
+ (Datum) 0); /* constant */
+
+ /* Initialize array fields */
+ so->arrayKeys[numArrayKeys].scan_key = output_ikey;
+ so->arrayKeys[numArrayKeys].num_elems = -1;
+ so->arrayKeys[numArrayKeys].cur_elem = 0;
+ so->arrayKeys[numArrayKeys].elem_values = NULL; /* unusued */
+ so->arrayKeys[numArrayKeys].use_sksup = skipatts[attno_skip - 1].use_sksup;
+ so->arrayKeys[numArrayKeys].null_elem = true; /* for now */
+ so->arrayKeys[numArrayKeys].sksup = skipatts[attno_skip - 1].sksup;
+ so->arrayKeys[numArrayKeys].low_compare = NULL; /* for now */
+ so->arrayKeys[numArrayKeys].high_compare = NULL; /* for now */
+
+ /*
+ * Temporary testing GUC can disable the use of skip support
+ * routines
+ */
+ if (!skipscan_skipsupport_enabled)
+ so->arrayKeys[numArrayKeys].use_sksup = false;
+
+ /*
+ * We'll need a 3-way ORDER proc to determine when and how the
+ * consed-up "array" will advance inside _bt_advance_array_keys.
+ * Set one up now.
+ */
+ _bt_setup_array_cmp(scan, cur, opcintype,
+ &so->orderProcs[output_ikey], NULL);
+
+ /*
+ * Prepare to output next scan key (might be another skip scan
+ * key, or it could be an input scan key from scan->keyData[])
+ */
+ numSkipArrayKeys--;
+ numArrayKeys++;
+ attno_skip++;
+ output_ikey++; /* keep this scan key/array */
+ }
+
/*
- * Copy input scan key into temp arrayKeyData scan key array
+ * Copy input scan key into temp arrayKeyData scan key array. (From
+ * here on, cur points at our copy of the input scan key.)
*/
cur = &arrayKeyData[output_ikey];
*cur = scan->keyData[input_ikey];
@@ -521,6 +668,10 @@ _bt_preprocess_array_keys(IndexScanDesc scan, int *numberOfKeys)
so->arrayKeys[numArrayKeys].scan_key = output_ikey;
so->arrayKeys[numArrayKeys].num_elems = num_elems;
so->arrayKeys[numArrayKeys].elem_values = elem_values;
+ so->arrayKeys[numArrayKeys].null_elem = false; /* unused */
+ so->arrayKeys[numArrayKeys].use_sksup = false; /* redundant */
+ so->arrayKeys[numArrayKeys].low_compare = NULL; /* unused */
+ so->arrayKeys[numArrayKeys].high_compare = NULL; /* unused */
numArrayKeys++;
output_ikey++; /* keep this scan key/array */
}
@@ -634,7 +785,8 @@ _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap)
{
BTArrayKeyInfo *array = &so->arrayKeys[arrayidx];
- Assert(array->num_elems > 0);
+ Assert(array->num_elems > 0 || array->num_elems == -1);
+ Assert(array->num_elems != -1 || outkey->sk_flags & SK_BT_REQFWD);
if (array->scan_key == input_ikey)
{
@@ -685,7 +837,7 @@ _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap)
* Parallel index scans require space in shared memory to store the
* current array elements (for arrays kept by preprocessing) to schedule
* the next primitive index scan. The underlying structure is protected
- * using a spinlock, so defensively limit its size. In practice this can
+ * using an LWLock, so defensively limit its size. In practice this can
* only affect parallel scans that use an incomplete opfamily.
*/
if (scan->parallel_scan && so->numArrayKeys > INDEX_MAX_KEYS)
@@ -695,6 +847,192 @@ _bt_preprocess_array_keys_final(IndexScanDesc scan, int *keyDataMap)
so->numArrayKeys, INDEX_MAX_KEYS)));
}
+/*
+ * _bt_decide_skipatts() -- set index attributes requiring skip arrays
+ *
+ * _bt_preprocess_array_keys helper function. Determines which attributes
+ * will require skip arrays/scan keys. Also sets up skip support callbacks
+ * for attributes whose input opclass have skip support (opclasses without
+ * skip support will fall back on using next-key sentinel values when
+ * advancing the skip array to its next array element).
+ *
+ * Return value is the total number of scan keys to add as "input" scan keys
+ * for further processing within _bt_preprocess_keys.
+ */
+static int
+_bt_decide_skipatts(IndexScanDesc scan, BTSkipPreproc *skipatts)
+{
+ Relation rel = scan->indexRelation;
+ ScanKey inputsk;
+ AttrNumber attno_inputsk = 1,
+ attno_skip = 1;
+ bool attno_has_equal = false,
+ attno_has_rowcompare = false;
+ int numSkipArrayKeys = 0;
+
+ Assert(scan->numberOfKeys);
+
+ /*
+ * Only add skip arrays (and associated scan keys) when doing so will
+ * enable _bt_preprocess_keys to mark one or more lower-order input scan
+ * keys (user-visible scan keys taken from scan->keyData[] input array) as
+ * required to continue the scan
+ */
+ inputsk = &scan->keyData[0];
+ for (int i = 0;; inputsk++, i++)
+ {
+ int prev_numSkipArrayKeys = numSkipArrayKeys;
+
+ /*
+ * Backfill skip arrays for any wholly omitted attributes prior to
+ * attno_inputsk
+ */
+ while (attno_skip < attno_inputsk)
+ {
+ if (!_bt_skipsupport(rel, attno_skip, &skipatts[attno_skip - 1]))
+ {
+ /*
+ * Cannot generate a skip array for this or later attributes
+ * (input opclass lacks an equality strategy operator)
+ */
+ return prev_numSkipArrayKeys;
+ }
+
+ /* plan on adding a backfill skip array for this attribute */
+ numSkipArrayKeys++;
+ attno_skip++;
+ }
+
+ /*
+ * Stop once past the final input scan key. We deliberately never add
+ * a skip attribute for the attribute of the last input scan key.
+ *
+ * If the last input scan key(s) use equality strategy, then a skip
+ * attribute is superfluous at best. If the last input scan key uses
+ * an inequality strategy, then adding a skip scan array/scan key is a
+ * valid though suboptimal transformation. It is better to arrange
+ * for preprocessing to allow such an input inequality scan key to
+ * remain an inequality on output. That way _bt_checkkeys will be
+ * able to make best use of both of its precheck optimizations, but
+ * _bt_first will be no less capable of efficiently finding the
+ * starting position for each primitive index scan.
+ */
+ if (i >= scan->numberOfKeys)
+ break;
+
+ /*
+ * Cannot keep adding skip arrays after a RowCompare
+ */
+ if (attno_has_rowcompare)
+ break;
+
+ /*
+ * Apply temporary testing GUC that can be used to disable skipping
+ * (either in part or in whole)
+ */
+ if (attno_inputsk > skipscan_prefix_cols)
+ break;
+
+ /*
+ * Now consider next attno_inputsk (or keep going if this is an
+ * additional scan key against the same attribute)
+ */
+ if (attno_inputsk < inputsk->sk_attno)
+ {
+ /*
+ * Now add skip array for previous scan key's attribute, though
+ * only if the attribute has no equality strategy scan keys.
+ *
+ * Adding skip arrays to an attribute that has one or more
+ * inequality scan keys will cause preprocessing to output a range
+ * skip array. This will happen when preprocessing proper deals
+ * with the redundancy between the array and its inequalities.
+ */
+ skipatts[attno_skip - 1].eq_op = InvalidOid;
+ if (attno_has_equal)
+ {
+ /* Don't skip, attribute already has an input equality key */
+ }
+ else if (_bt_skipsupport(rel, attno_skip, &skipatts[attno_skip - 1]))
+ {
+ /*
+ * Saw no equalities for the prior attribute, so add a range
+ * skip array for this attribute
+ */
+ numSkipArrayKeys++;
+ }
+ else
+ {
+ /*
+ * Cannot generate a skip array for this or later attributes
+ * (input opclass lacks an equality strategy operator)
+ */
+ return numSkipArrayKeys;
+ }
+
+ /* Set things up for this new attribute */
+ attno_skip++;
+ attno_inputsk = inputsk->sk_attno;
+ attno_has_equal = false;
+ }
+
+ /*
+ * Track if this attribute's scan keys have any equality strategy scan
+ * keys (while counting IS NULL scan keys as equality scan keys)
+ */
+ if (inputsk->sk_strategy == BTEqualStrategyNumber ||
+ (inputsk->sk_flags & SK_SEARCHNULL))
+ attno_has_equal = true;
+
+ /*
+ * We don't support RowCompare transformation. Remember that we saw a
+ * RowCompare, so that we don't keep adding skip attributes. (We may
+ * still backfill skip attributes before the RowCompare, so that it
+ * will be marked required.)
+ */
+ if (inputsk->sk_flags & SK_ROW_HEADER)
+ attno_has_rowcompare = true;
+ }
+
+ return numSkipArrayKeys;
+}
+
+/*
+ * _bt_skipsupport() -- set up skip support function in *skipatts
+ *
+ * Returns true on success, indicating that we set *skipatts with input
+ * opclass's equality operator. Otherwise returns false (only possible for an
+ * index attribute whose input opclass comes from an incomplete opfamily).
+ */
+static bool
+_bt_skipsupport(Relation rel, int add_skip_attno, BTSkipPreproc *skipatts)
+{
+ int16 *indoption = rel->rd_indoption;
+ Oid opfamily = rel->rd_opfamily[add_skip_attno - 1];
+ Oid opcintype = rel->rd_opcintype[add_skip_attno - 1];
+ bool reverse;
+
+ /* Look up input opclass's equality operator (might fail) */
+ skipatts->eq_op = get_opfamily_member(opfamily, opcintype, opcintype,
+ BTEqualStrategyNumber);
+
+ /*
+ * We don't really expect opclasses that lack an equality operator, but
+ * they're supported. Cope with them by having caller not use skip scan.
+ */
+ if (!OidIsValid(skipatts->eq_op))
+ return false;
+
+ /* Have skip support infrastructure set all SkipSupport fields */
+ reverse = (indoption[add_skip_attno - 1] & INDOPTION_DESC) != 0;
+ skipatts->use_sksup = PrepareSkipSupportFromOpclass(opfamily, opcintype,
+ reverse,
+ &skipatts->sksup);
+
+ /* might not have set up skip support routine, but can skip either way */
+ return true;
+}
+
/*
* _bt_setup_array_cmp() -- Set up array comparison functions
*
@@ -987,17 +1325,15 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
FmgrInfo *orderproc, BTArrayKeyInfo *array,
bool *qual_ok)
{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
Relation rel = scan->indexRelation;
Oid opcintype = rel->rd_opcintype[arraysk->sk_attno - 1];
- int cmpresult = 0,
- cmpexact = 0,
- matchelem,
- new_nelems = 0;
FmgrInfo crosstypeproc;
FmgrInfo *orderprocp = orderproc;
+ MemoryContext oldContext;
+ bool eliminated;
Assert(arraysk->sk_attno == skey->sk_attno);
- Assert(array->num_elems > 0);
Assert(!(arraysk->sk_flags & (SK_ISNULL | SK_ROW_HEADER | SK_ROW_MEMBER)));
Assert((arraysk->sk_flags & SK_SEARCHARRAY) &&
arraysk->sk_strategy == BTEqualStrategyNumber);
@@ -1010,8 +1346,8 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
* datum of opclass input type for the index's attribute (on-disk type).
* We can reuse the array's ORDER proc whenever the non-array scan key's
* type is a match for the corresponding attribute's input opclass type.
- * Otherwise, we have to do another ORDER proc lookup so that our call to
- * _bt_binsrch_array_skey applies the correct comparator.
+ * Otherwise, we have to do another ORDER proc lookup. We have to be sure
+ * that _bt_compare_array_skey/_bt_binsrch_array_skey use the right proc.
*
* Note: we have to support the convention that sk_subtype == InvalidOid
* means the opclass input type; this is a hack to simplify life for
@@ -1042,11 +1378,65 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
return false;
}
- /* We have all we need to determine redundancy/contradictoriness */
+ /* We successfully looked up the required cross-type ORDER proc */
orderprocp = &crosstypeproc;
fmgr_info(cmp_proc, orderprocp);
}
+ oldContext = MemoryContextSwitchTo(so->arrayContext);
+
+ /*
+ * Perform preprocessing of the array based on whether it's a conventional
+ * array, or a skip array. Sets *qual_ok correctly in passing.
+ */
+ if (array->num_elems != -1)
+ {
+ _bt_array_preproc_shrink(arraysk, skey, orderprocp, array, qual_ok);
+
+ /*
+ * We successfully looked up the required cross-type ORDER proc, which
+ * ensured that the scalar scan key could be eliminated as redundant
+ */
+ eliminated = true;
+ }
+ else
+ {
+ /*
+ * With a skip array it's possible that we won't be able to eliminate
+ * the scalar scan key, despite looking up the required ORDER proc.
+ * This happens when earlier preprocessing wasn't able to eliminate a
+ * redundant scan key inequality due to a lack of cross-type support.
+ */
+ eliminated = _bt_skip_preproc_shrink(scan, arraysk, skey, orderprocp,
+ array, qual_ok);
+ }
+
+ MemoryContextSwitchTo(oldContext);
+
+ return eliminated;
+}
+
+/*
+ * Finish off preprocessing of conventional (non-skip) array scan key when it
+ * is redundant with (or contradicted by) a non-array scalar scan key.
+ * _bt_compare_array_scankey_args helper function, called after the relevant
+ * (potentially cross-type) ORDER proc has been looked up successfully.
+ *
+ * Rewrites caller's array in-place as needed to eliminate redundant array
+ * elements. Calling here always renders caller's scalar scan key redundant.
+ */
+static void
+_bt_array_preproc_shrink(ScanKey arraysk, ScanKey skey, FmgrInfo *orderprocp,
+ BTArrayKeyInfo *array, bool *qual_ok)
+{
+ int cmpresult = 0,
+ cmpexact = 0,
+ matchelem,
+ new_nelems = 0;
+
+ Assert(array->num_elems > 0);
+ Assert(!(arraysk->sk_flags & SK_BT_SKIP));
+
matchelem = _bt_binsrch_array_skey(orderprocp, false,
NoMovementScanDirection,
skey->sk_argument, false, array,
@@ -1098,6 +1488,137 @@ _bt_compare_array_scankey_args(IndexScanDesc scan, ScanKey arraysk, ScanKey skey
array->num_elems = new_nelems;
*qual_ok = new_nelems > 0;
+}
+
+/*
+ * Finish off preprocessing of skip array scan key when it is "redundant with"
+ * a non-array scalar scan key. The scalar scan key must be an inequality.
+ * _bt_compare_array_scankey_args helper function, called after the relevant
+ * (potentially cross-type) ORDER proc has been looked up successfully.
+ *
+ * Unlike _bt_array_preproc_shrink, we cannot really modify caller's array
+ * in-place. Skip arrays work by procedurally generating their elements as
+ * needed, so our approach is to store a copy of the inequality in the skip
+ * array, allowing its elements to be generated within the limits of a range.
+ * Calling here always renders caller's scalar scan key redundant (the key is
+ * applied when the array advances, but that's just an implementation detail).
+ *
+ * Return value indicates if the array already had a lower/upper bound
+ * (whichever caller's scalar scan key was expected to be). We return true in
+ * the common case where caller's scan key could be successfully rolled into
+ * the skip array. We return false when we can't do that due to the presence
+ * of a conflicting inequality.
+ */
+static bool
+_bt_skip_preproc_shrink(IndexScanDesc scan, ScanKey arraysk, ScanKey skey,
+ FmgrInfo *orderprocp, BTArrayKeyInfo *array,
+ bool *qual_ok)
+{
+ bool test_result;
+
+ /*
+ * We don't expect to have to deal with NULLs in non-array/non-skip scan
+ * key. We expect _bt_preprocess_array_keys to avoid generating a skip
+ * array for an index attribute with an IS NULL input scan key. (It will
+ * still do so in the presence of IS NOT NULL input scan keys, but
+ * _bt_compare_scankey_args is expected to handle those for us.)
+ */
+ Assert(arraysk->sk_flags & SK_BT_SKIP);
+ Assert(arraysk->sk_flags & SK_SEARCHARRAY);
+ Assert(arraysk->sk_strategy == BTEqualStrategyNumber);
+ Assert(array->num_elems == -1);
+
+ /* Scalar scan key must be a B-Tree inequality, which are always strict */
+ Assert(!(skey->sk_flags & SK_ISNULL));
+ Assert(skey->sk_strategy != BTEqualStrategyNumber);
+
+ /*
+ * Array must not generate a NULL array element (for "IS NULL" qual). Its
+ * index attribute is constrained by a strict operator, so NULL elements
+ * must not be returned by the scan (it would be wrong to allow it).
+ */
+ array->null_elem = false;
+ *qual_ok = true;
+
+ /*
+ * Store a copy of caller's scalar scan key, plus a copy of the operator's
+ * corresponding 3-way ORDER proc.
+ *
+ * A skip array scan key always uses the underlying index attribute's
+ * input opclass, but it's possible that caller's scalar scan key uses a
+ * cross-type operator. In cross-type scenarios, skey.sk_argument doesn't
+ * use the same type as later array elements (which are all just copies of
+ * datums taken from index tuples, possibly modified by skip support).
+ *
+ * We represent the lowest (and highest) possible value in the array using
+ * the sentinel value -inf (+inf for high_compare). The only exceptions
+ * apply when the opclass has skip support: there we can use a copy of the
+ * skip support routine's low_elem/high_elem instead -- though only when
+ * there is no corresponding low_compare/high_compare inequality.
+ *
+ * _bt_first understands that -inf/+inf indicate that it should use the
+ * low_compare/high_compare inequality for initial positioning purposes
+ * when it sees either value (unless there is no corresponding inequality,
+ * in which case the values are literally interpreted as -inf or +inf).
+ * _bt_first can therefore vary in whether it uses a cross-type operator,
+ * or an input-opclass-only operator (it can vary across primitive scans
+ * for the same index attribute/skip array).
+ *
+ * _bt_scankey_decrement/_bt_scankey_increment both make sure that each
+ * newly generated element is constrained by low_compare/high_compare.
+ * This must happen without skey.sk_argument ever being treated as a true
+ * array element (that wouldn't always work because array elements are
+ * only ever supposed to use the opclass input type).
+ */
+ switch (skey->sk_strategy)
+ {
+ case BTLessStrategyNumber:
+ case BTLessEqualStrategyNumber:
+ if (array->high_compare)
+ {
+ /* try to keep only one high_compare inequality */
+ if (!_bt_compare_scankey_args(scan, array->high_compare, skey,
+ array->high_compare, NULL, NULL,
+ &test_result))
+ return false; /* can't make new high_compare redundant */
+
+ if (!test_result)
+ return true; /* discard new high_compare */
+
+ /* replace old high_compare with new one */
+ }
+ else
+ array->high_compare = palloc(sizeof(ScanKeyData));
+
+ memcpy(array->high_compare, skey, sizeof(ScanKeyData));
+ array->order_high = *orderprocp;
+ break;
+ case BTGreaterEqualStrategyNumber:
+ case BTGreaterStrategyNumber:
+ if (array->low_compare)
+ {
+ /* try to keep only one low_compare inequality */
+ if (!_bt_compare_scankey_args(scan, array->low_compare, skey,
+ array->low_compare, NULL, NULL,
+ &test_result))
+ return false; /* can't make new low_compare redundant */
+
+ if (!test_result)
+ return true; /* discard new low_compare */
+
+ /* replace old low_compare with new one */
+ }
+ else
+ array->low_compare = palloc(sizeof(ScanKeyData));
+
+ memcpy(array->low_compare, skey, sizeof(ScanKeyData));
+ array->order_low = *orderprocp;
+ break;
+ default:
+ elog(ERROR, "unrecognized StrategyNumber: %d",
+ (int) skey->sk_strategy);
+ break;
+ }
return true;
}
@@ -1140,7 +1661,8 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
static inline int32
_bt_compare_array_skey(FmgrInfo *orderproc,
Datum tupdatum, bool tupnull,
- Datum arrdatum, ScanKey cur)
+ Datum arrdatum, bool arrnull,
+ ScanKey cur)
{
int32 result = 0;
@@ -1148,14 +1670,14 @@ _bt_compare_array_skey(FmgrInfo *orderproc,
if (tupnull) /* NULL tupdatum */
{
- if (cur->sk_flags & SK_ISNULL)
+ if (arrnull)
result = 0; /* NULL "=" NULL */
else if (cur->sk_flags & SK_BT_NULLS_FIRST)
result = -1; /* NULL "<" NOT_NULL */
else
result = 1; /* NULL ">" NOT_NULL */
}
- else if (cur->sk_flags & SK_ISNULL) /* NOT_NULL tupdatum, NULL arrdatum */
+ else if (arrnull) /* NOT_NULL tupdatum, NULL arrdatum */
{
if (cur->sk_flags & SK_BT_NULLS_FIRST)
result = 1; /* NOT_NULL ">" NULL */
@@ -1221,6 +1743,8 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
Datum arrdatum;
Assert(cur->sk_flags & SK_SEARCHARRAY);
+ Assert(!(cur->sk_flags & SK_BT_SKIP));
+ Assert(!(cur->sk_flags & SK_ISNULL)); /* plain arrays can't do this */
Assert(cur->sk_strategy == BTEqualStrategyNumber);
if (cur_elem_trig)
@@ -1256,7 +1780,7 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
{
arrdatum = array->elem_values[low_elem];
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- arrdatum, cur);
+ arrdatum, false, cur);
if (result <= 0)
{
@@ -1284,7 +1808,7 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
{
arrdatum = array->elem_values[high_elem];
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- arrdatum, cur);
+ arrdatum, false, cur);
if (result >= 0)
{
@@ -1311,7 +1835,7 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
arrdatum = array->elem_values[mid_elem];
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- arrdatum, cur);
+ arrdatum, false, cur);
if (result == 0)
{
@@ -1336,13 +1860,102 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
*/
if (low_elem != mid_elem)
result = _bt_compare_array_skey(orderproc, tupdatum, tupnull,
- array->elem_values[low_elem], cur);
+ array->elem_values[low_elem], false,
+ cur);
*set_elem_result = result;
return low_elem;
}
+/*
+ * _bt_binsrch_skiparray_skey() -- "Binary search" within a skip array
+ *
+ * This routine doesn't return an index into the array, because the array
+ * doesn't actually have any elements (it generates its array elements
+ * procedurally instead). Note that this may include a NULL value/an IS NULL
+ * qual.
+ *
+ * Sets *set_elem_result just like _bt_binsrch_array_skey would with a true
+ * array. The value 0 indicates that tupdatum/tupnull is within the range of
+ * the skip array. Other values indicate what _bt_compare_array_skey returned
+ * for the best available match to tupdatum/tupnull (in practice this means
+ * either the lowest item or the highest item in the range of the array).
+ *
+ * cur_elem_trig indicates if array advancement was triggered by this array's
+ * scan key. We use this to optimize-away comparisons that are known by our
+ * caller to be unnecessary from context, just like _bt_binsrch_array_skey.
+ */
+static void
+_bt_binsrch_skiparray_skey(FmgrInfo *orderproc,
+ bool cur_elem_trig, ScanDirection dir,
+ Datum tupdatum, bool tupnull,
+ BTArrayKeyInfo *array, ScanKey cur,
+ int32 *set_elem_result)
+{
+ Assert(cur->sk_flags & SK_BT_SKIP);
+ Assert(cur->sk_flags & SK_SEARCHARRAY);
+ Assert(cur->sk_flags & SK_BT_REQFWD);
+ Assert(array->num_elems == -1);
+ Assert(!ScanDirectionIsNoMovement(dir));
+
+ if (tupnull) /* NULL tupdatum */
+ {
+ if (array->null_elem)
+ *set_elem_result = 0; /* NULL "=" NULL */
+ else if (cur->sk_flags & SK_BT_NULLS_FIRST)
+ *set_elem_result = -1; /* NULL "<" NOT_NULL */
+ else
+ *set_elem_result = 1; /* NULL ">" NOT_NULL */
+
+ return;
+ }
+
+ /*
+ * Array inequalities determine whether tupdatum is within the range of
+ * caller's skip array
+ */
+ *set_elem_result = 0;
+ if (ScanDirectionIsForward(dir))
+ {
+ /*
+ * Evaluate low_compare first (unless cur_elem_trig tells us that it
+ * cannot possibly fail to be satisfied), then evaluate high_compare
+ */
+ if (!cur_elem_trig && array->low_compare &&
+ !DatumGetBool(FunctionCall2Coll(&array->low_compare->sk_func,
+ array->low_compare->sk_collation,
+ tupdatum,
+ array->low_compare->sk_argument)))
+ *set_elem_result = -1;
+ else if (array->high_compare &&
+ !DatumGetBool(FunctionCall2Coll(&array->high_compare->sk_func,
+ array->high_compare->sk_collation,
+ tupdatum,
+ array->high_compare->sk_argument)))
+ *set_elem_result = 1;
+ }
+ else
+ {
+ /*
+ * Evaluate high_compare first (unless cur_elem_trig tells us that it
+ * cannot possibly fail to be satisfied), then evaluate low_compare
+ */
+ if (!cur_elem_trig && array->high_compare &&
+ !DatumGetBool(FunctionCall2Coll(&array->high_compare->sk_func,
+ array->high_compare->sk_collation,
+ tupdatum,
+ array->high_compare->sk_argument)))
+ *set_elem_result = 1;
+ else if (array->low_compare &&
+ !DatumGetBool(FunctionCall2Coll(&array->low_compare->sk_func,
+ array->low_compare->sk_collation,
+ tupdatum,
+ array->low_compare->sk_argument)))
+ *set_elem_result = -1;
+ }
+}
+
/*
* _bt_start_array_keys() -- Initialize array keys at start of a scan
*
@@ -1352,29 +1965,486 @@ _bt_binsrch_array_skey(FmgrInfo *orderproc,
void
_bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
{
+ Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
- int i;
Assert(so->numArrayKeys);
Assert(so->qual_ok);
- for (i = 0; i < so->numArrayKeys; i++)
+ for (int i = 0; i < so->numArrayKeys; i++)
{
BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
ScanKey skey = &so->keyData[curArrayKey->scan_key];
- Assert(curArrayKey->num_elems > 0);
Assert(skey->sk_flags & SK_SEARCHARRAY);
- if (ScanDirectionIsBackward(dir))
- curArrayKey->cur_elem = curArrayKey->num_elems - 1;
- else
- curArrayKey->cur_elem = 0;
- skey->sk_argument = curArrayKey->elem_values[curArrayKey->cur_elem];
+ _bt_scankey_set_low_or_high(rel, skey, curArrayKey,
+ ScanDirectionIsForward(dir));
}
so->scanBehind = so->oppoDirCheck = false; /* reset */
}
+/*
+ * _bt_scankey_set_low_or_high() -- Set array scan key to lowest/highest element
+ *
+ * Caller also passes associated scan key, which will have its argument set to
+ * the lowest/highest array value in passing.
+ */
+static void
+_bt_scankey_set_low_or_high(Relation rel, ScanKey skey, BTArrayKeyInfo *array,
+ bool low_not_high)
+{
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+
+ if (array->num_elems != -1)
+ {
+ /* set low or high element for conventional array */
+ int set_elem = 0;
+
+ Assert(!(skey->sk_flags & SK_BT_SKIP));
+
+ if (!low_not_high)
+ set_elem = array->num_elems - 1;
+
+ /*
+ * Just copy over array datum (only skip arrays require freeing and
+ * allocating memory for sk_argument)
+ */
+ array->cur_elem = set_elem;
+ skey->sk_argument = array->elem_values[set_elem];
+
+ return;
+ }
+
+ /* set low or high element for skip array */
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(array->num_elems == -1);
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+
+ /* Clear possibly-irrelevant flags */
+ skey->sk_argument = (Datum) 0;
+ skey->sk_flags &= ~(SK_SEARCHNULL | SK_ISNULL |
+ SK_BT_NEGPOSINF | SK_BT_NEXTPRIOR);
+
+ if (array->null_elem &&
+ (low_not_high == ((skey->sk_flags & SK_BT_NULLS_FIRST) != 0)))
+ {
+ /* Lowest (or highest) element is NULL, so set scan key to NULL */
+ skey->sk_flags |= (SK_SEARCHNULL | SK_ISNULL);
+ }
+ else if (low_not_high)
+ {
+ /* Lowest array element isn't NULL */
+ if (array->use_sksup && !array->low_compare)
+ skey->sk_argument = datumCopy(array->sksup.low_elem,
+ attr->attbyval, attr->attlen);
+ else
+ skey->sk_flags |= SK_BT_NEGPOSINF;
+ }
+ else
+ {
+ /* Highest array element isn't NULL */
+ if (array->use_sksup && !array->high_compare)
+ skey->sk_argument = datumCopy(array->sksup.high_elem,
+ attr->attbyval, attr->attlen);
+ else
+ skey->sk_flags |= SK_BT_NEGPOSINF;
+ }
+}
+
+/*
+ * _bt_scankey_set_element() -- Set skip array scan key's sk_argument
+ *
+ * Sets scan key to "IS NULL" when required, and handles memory management for
+ * pass-by-reference types.
+ */
+static void
+_bt_scankey_set_element(Relation rel, ScanKey skey, BTArrayKeyInfo *array,
+ Datum tupdatum, bool tupnull)
+{
+ /* tupdatum within the range of low_value/high_value */
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(!(tupnull && !array->null_elem));
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+ skey->sk_argument = (Datum) 0;
+ skey->sk_flags &= ~(SK_SEARCHNULL | SK_ISNULL |
+ SK_BT_NEGPOSINF | SK_BT_NEXTPRIOR);
+ if (!tupnull)
+ skey->sk_argument = datumCopy(tupdatum, attr->attbyval, attr->attlen);
+ else
+ skey->sk_flags |= (SK_SEARCHNULL | SK_ISNULL);
+}
+
+/*
+ * _bt_scankey_unset_isnull() -- increment/decrement scan key from NULL
+ *
+ * Unsets scan key's "IS NULL" marking, and sets the non-NULL value from the
+ * array immediately before (or immediate after) NULL in the key space.
+ */
+static void
+_bt_scankey_unset_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(skey->sk_flags & SK_SEARCHNULL);
+ Assert(skey->sk_flags & SK_ISNULL);
+ Assert(!(skey->sk_flags & (SK_BT_NEGPOSINF | SK_BT_NEXTPRIOR)));
+ Assert(skey->sk_argument == 0);
+ Assert(array->use_sksup && array->null_elem &&
+ !array->low_compare && !array->high_compare);
+
+ /*
+ * sk_argument must be set to whatever non-NULL value comes immediately
+ * before or after NULL
+ */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ skey->sk_flags &= ~(SK_SEARCHNULL | SK_ISNULL);
+ if (skey->sk_flags & SK_BT_NULLS_FIRST)
+ skey->sk_argument = datumCopy(array->sksup.low_elem,
+ attr->attbyval, attr->attlen);
+ else
+ skey->sk_argument = datumCopy(array->sksup.high_elem,
+ attr->attbyval, attr->attlen);
+}
+
+/*
+ * _bt_scankey_set_isnull() -- decrement/increment scan key to NULL
+ */
+static void
+_bt_scankey_set_isnull(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_BT_SKIP);
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(!(skey->sk_flags & (SK_SEARCHNULL | SK_ISNULL |
+ SK_BT_NEGPOSINF | SK_BT_NEXTPRIOR)));
+ Assert(array->null_elem);
+ Assert(!array->low_compare && !array->high_compare);
+
+ /* Free memory previously allocated for sk_argument if needed */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+
+ /* Set sk_argument to NULL */
+ skey->sk_argument = (Datum) 0;
+ skey->sk_flags |= (SK_SEARCHNULL | SK_ISNULL);
+}
+
+/*
+ * _bt_scankey_decrement() -- decrement array scan key's sk_argument
+ *
+ * Return value indicates whether caller's array was successfully decremented.
+ * Cannot decrement an array whose current element is already the first one.
+ */
+static bool
+_bt_scankey_decrement(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ bool underflow = false;
+ Datum dec_sk_argument;
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(!(skey->sk_flags & SK_BT_NEXTPRIOR));
+
+ /* Regular (non-skip) array? */
+ if (array->num_elems != -1)
+ {
+ Assert(!(skey->sk_flags & SK_BT_SKIP));
+ if (array->cur_elem > 0)
+ {
+ /*
+ * Just copy over array datum (only skip arrays require freeing
+ * and allocating memory for sk_argument)
+ */
+ array->cur_elem--;
+ skey->sk_argument = array->elem_values[array->cur_elem];
+
+ /* Successfully decremented array */
+ return true;
+ }
+
+ /* Cannot decrement to before first array element */
+ return false;
+ }
+
+ /* Nope, this is a skip array */
+ Assert(skey->sk_flags & SK_BT_SKIP);
+
+ /* The sentinel value -inf is never decrementable */
+ if (skey->sk_flags & SK_BT_NEGPOSINF)
+ return false;
+
+ /*
+ * When the current array element is NULL, and the lowest sorting value in
+ * the index is also NULL, we cannot decrement before first array element
+ */
+ if ((skey->sk_flags & SK_ISNULL) && (skey->sk_flags & SK_BT_NULLS_FIRST))
+ return false;
+
+ /*
+ * Opclasses without skip support "decrement" the scan key's current
+ * element by setting the NEXTPRIOR flag. The true prior value can only
+ * be determined when the scan reads lower sorting tuples.
+ *
+ * When the current array element is NULL, and the highest sorting value
+ * in the index is also NULL, _bt_first can find the highest non-NULL.
+ */
+ if (!array->use_sksup)
+ {
+ /*
+ * Determine as best we can (given the lack of skip support) whether
+ * the prior element will turn out to be out of bounds for the skip
+ * array.
+ *
+ * Skip arrays (that lack skip support) can only do this when their
+ * low_compare is for an >= inequality; if the current array element
+ * is == the inequality's sk_argument, then the true prior value
+ * cannot possibly satisfy low_compare. We can give up right away.
+ */
+ if (array->low_compare &&
+ array->low_compare->sk_strategy == BTGreaterEqualStrategyNumber &&
+ _bt_compare_array_skey(&array->order_low,
+ array->low_compare->sk_argument, false,
+ skey->sk_argument, false,
+ skey) == 0)
+ return false;
+
+ /* else the scan must figure out the true prior value */
+ skey->sk_flags |= SK_BT_NEXTPRIOR;
+ return true;
+ }
+
+ /*
+ * Opclasses with skip support decrement the scan key's current element
+ * using a callback
+ */
+ if (skey->sk_flags & SK_ISNULL)
+ {
+ Assert(!(skey->sk_flags & SK_BT_NULLS_FIRST));
+
+ /*
+ * Existing sk_argument/array element is NULL (for an IS NULL qual).
+ *
+ * Decrement current array element to the high_elem value provided by
+ * opclass skip support routine.
+ */
+ _bt_scankey_unset_isnull(rel, skey, array);
+ return true;
+ }
+
+ /*
+ * Ask opclass support routine to provide decremented copy of existing
+ * non-NULL sk_argument
+ */
+ dec_sk_argument = array->sksup.decrement(rel, skey->sk_argument, &underflow);
+
+ if (underflow)
+ {
+ /* dec_sk_argument has undefined value due to underflow */
+ if (array->null_elem && (skey->sk_flags & SK_BT_NULLS_FIRST))
+ {
+ /* "Decrement" sk_argument to NULL */
+ _bt_scankey_set_isnull(rel, skey, array);
+ return true;
+ }
+
+ /* Cannot decrement before first array element */
+ return false;
+ }
+
+ /*
+ * Successfully decremented sk_argument to a non-NULL value. Make sure
+ * that the decremented value is still within the range of the skip array.
+ */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (array->low_compare &&
+ !DatumGetBool(FunctionCall2Coll(&array->low_compare->sk_func,
+ array->low_compare->sk_collation,
+ dec_sk_argument,
+ array->low_compare->sk_argument)))
+ {
+ /* Keep existing sk_argument after all */
+ if (!attr->attbyval)
+ pfree(DatumGetPointer(dec_sk_argument));
+
+ /* Cannot decrement before first array element */
+ return false;
+ }
+
+ /* Accept non-NULL datum value from opclass decrement callback */
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+ skey->sk_argument = dec_sk_argument;
+
+ return true;
+}
+
+/*
+ * _bt_scankey_increment() -- increment array scan key's sk_argument
+ *
+ * Return value indicates whether caller's array was successfully incremented.
+ * Cannot increment an array whose current element is already the final one.
+ */
+static bool
+_bt_scankey_increment(Relation rel, ScanKey skey, BTArrayKeyInfo *array)
+{
+ bool overflow = false;
+ Datum inc_sk_argument;
+ Form_pg_attribute attr;
+
+ Assert(skey->sk_flags & SK_SEARCHARRAY);
+ Assert(!(skey->sk_flags & SK_BT_NEXTPRIOR));
+
+ /* Regular (non-skip) array? */
+ if (array->num_elems != -1)
+ {
+ Assert(!(skey->sk_flags & SK_BT_SKIP));
+ if (array->cur_elem < array->num_elems - 1)
+ {
+ /*
+ * Just copy over array datum (only skip arrays require freeing
+ * and allocating memory for sk_argument)
+ */
+ array->cur_elem++;
+ skey->sk_argument = array->elem_values[array->cur_elem];
+
+ /* Successfully incremented array */
+ return true;
+ }
+
+ /* Cannot increment past final array element */
+ return false;
+ }
+
+ /* Nope, this is a skip array */
+ Assert(skey->sk_flags & SK_BT_SKIP);
+
+ /* The sentinel value +inf is never incrementable */
+ if (skey->sk_flags & SK_BT_NEGPOSINF)
+ return false;
+
+ /*
+ * When the current array element is NULL, and the highest sorting value
+ * in the index is also NULL, we cannot increment past the final element
+ */
+ if ((skey->sk_flags & SK_ISNULL) && !(skey->sk_flags & SK_BT_NULLS_FIRST))
+ return false;
+
+ /*
+ * Opclasses without skip support "increment" the scan key's current
+ * element by setting the NEXTPRIOR flag. The true next value can only be
+ * determined when the scan reads higher sorting tuples.
+ *
+ * When the current array element is NULL, and the lowest sorting value in
+ * the index is also NULL, _bt_first can find the lowest non-NULL.
+ */
+ if (!array->use_sksup)
+ {
+ /*
+ * Determine as best we can (given the lack of skip support) whether
+ * the next element will turn out to be out of bounds for the skip
+ * array.
+ *
+ * Skip arrays (that lack skip support) can only do this when their
+ * high_compare is for an <= inequality; if the current array element
+ * is == the inequality's sk_argument, then the true next value cannot
+ * possibly satisfy high_compare. We can give up right away.
+ */
+ if (array->high_compare &&
+ array->high_compare->sk_strategy == BTLessEqualStrategyNumber &&
+ _bt_compare_array_skey(&array->order_high,
+ array->high_compare->sk_argument, false,
+ skey->sk_argument, false,
+ skey) == 0)
+ return false;
+
+ /* else the scan must figure out the true next value */
+ skey->sk_flags |= SK_BT_NEXTPRIOR;
+ return true;
+ }
+
+ /*
+ * Opclasses with skip support increment the scan key's current element
+ * using a callback
+ */
+ if (skey->sk_flags & SK_ISNULL)
+ {
+ Assert(skey->sk_flags & SK_BT_NULLS_FIRST);
+
+ /*
+ * Existing sk_argument/array element is NULL (for an IS NULL qual).
+ *
+ * Increment current array element to the low_elem value provided by
+ * opclass skip support routine.
+ */
+ _bt_scankey_unset_isnull(rel, skey, array);
+ return true;
+ }
+
+ /*
+ * Ask opclass support routine to provide incremented copy of existing
+ * non-NULL sk_argument
+ */
+ inc_sk_argument = array->sksup.increment(rel, skey->sk_argument, &overflow);
+
+ if (overflow)
+ {
+ /* inc_sk_argument has undefined value due to overflow */
+ if (array->null_elem && !(skey->sk_flags & SK_BT_NULLS_FIRST))
+ {
+ /* "Increment" sk_argument to NULL */
+ _bt_scankey_set_isnull(rel, skey, array);
+ return true;
+ }
+
+ /* Cannot increment past final array element */
+ return false;
+ }
+
+ /*
+ * Successfully incremented sk_argument to a non-NULL value. Make sure
+ * that the incremented value is still within the range of the skip array.
+ */
+ attr = TupleDescAttr(RelationGetDescr(rel), skey->sk_attno - 1);
+ if (array->high_compare &&
+ !DatumGetBool(FunctionCall2Coll(&array->high_compare->sk_func,
+ array->high_compare->sk_collation,
+ inc_sk_argument,
+ array->high_compare->sk_argument)))
+ {
+ /* Keep existing sk_argument after all */
+ if (!attr->attbyval)
+ pfree(DatumGetPointer(inc_sk_argument));
+
+ /* Cannot increment past final array element */
+ return false;
+ }
+
+ /* Accept non-NULL datum value from opclass increment callback */
+ if (!attr->attbyval && skey->sk_argument)
+ pfree(DatumGetPointer(skey->sk_argument));
+ skey->sk_argument = inc_sk_argument;
+
+ return true;
+}
+
/*
* _bt_advance_array_keys_increment() -- Advance to next set of array elements
*
@@ -1390,6 +2460,7 @@ _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
static bool
_bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
{
+ Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
/*
@@ -1399,29 +2470,30 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
*/
for (int i = so->numArrayKeys - 1; i >= 0; i--)
{
- BTArrayKeyInfo *curArrayKey = &so->arrayKeys[i];
- ScanKey skey = &so->keyData[curArrayKey->scan_key];
- int cur_elem = curArrayKey->cur_elem;
- int num_elems = curArrayKey->num_elems;
- bool rolled = false;
+ BTArrayKeyInfo *array = &so->arrayKeys[i];
+ ScanKey skey = &so->keyData[array->scan_key];
- if (ScanDirectionIsForward(dir) && ++cur_elem >= num_elems)
+ if (ScanDirectionIsForward(dir))
{
- cur_elem = 0;
- rolled = true;
+ if (_bt_scankey_increment(rel, skey, array))
+ return true;
}
- else if (ScanDirectionIsBackward(dir) && --cur_elem < 0)
+ else
{
- cur_elem = num_elems - 1;
- rolled = true;
+ if (_bt_scankey_decrement(rel, skey, array))
+ return true;
}
- curArrayKey->cur_elem = cur_elem;
- skey->sk_argument = curArrayKey->elem_values[cur_elem];
- if (!rolled)
- return true;
+ /*
+ * Couldn't increment (or decrement) array. Handle array roll over.
+ *
+ * Start over at the array's lowest sorting value (or its highest
+ * value, for backward scans)...
+ */
+ _bt_scankey_set_low_or_high(rel, skey, array,
+ ScanDirectionIsForward(dir));
- /* Need to advance next array key, if any */
+ /* ...then increment (or decrement) next most significant array */
}
/*
@@ -1476,6 +2548,7 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir)
static void
_bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
{
+ Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
int arrayidx = 0;
@@ -1483,7 +2556,6 @@ _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
{
ScanKey cur = so->keyData + ikey;
BTArrayKeyInfo *array = NULL;
- int first_elem_dir;
if (!(cur->sk_flags & SK_SEARCHARRAY) ||
cur->sk_strategy != BTEqualStrategyNumber)
@@ -1495,16 +2567,10 @@ _bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
if ((cur->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
continue;
- if (ScanDirectionIsForward(dir))
- first_elem_dir = 0;
- else
- first_elem_dir = array->num_elems - 1;
+ Assert(array->num_elems != -1); /* No non-required skip arrays */
- if (array->cur_elem != first_elem_dir)
- {
- array->cur_elem = first_elem_dir;
- cur->sk_argument = array->elem_values[first_elem_dir];
- }
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsForward(dir));
}
}
@@ -1568,6 +2634,8 @@ _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
for (int ikey = sktrig; ikey < so->numberOfKeys; ikey++)
{
ScanKey cur = so->keyData + ikey;
+ Datum sk_argument = cur->sk_argument;
+ bool sk_isnull = (cur->sk_flags & SK_ISNULL) != 0;
Datum tupdatum;
bool tupnull;
int32 result;
@@ -1629,9 +2697,66 @@ _bt_tuple_before_array_skeys(IndexScanDesc scan, ScanDirection dir,
tupdatum = index_getattr(tuple, cur->sk_attno, tupdesc, &tupnull);
- result = _bt_compare_array_skey(&so->orderProcs[ikey],
- tupdatum, tupnull,
- cur->sk_argument, cur);
+ if (!(cur->sk_flags & SK_BT_NEGPOSINF))
+ {
+ /* The scankey has a conventional sk_argument/element value */
+ result = _bt_compare_array_skey(&so->orderProcs[ikey],
+ tupdatum, tupnull,
+ sk_argument, sk_isnull, cur);
+
+ /*
+ * When scan key is marked NEXTPRIOR, the current array element is
+ * "sk_argument + infinitesimal" (or the current array element is
+ * "sk_argument - infinitesimal", during backwards scans)
+ */
+ if (result == 0 && (cur->sk_flags & SK_BT_NEXTPRIOR))
+ {
+ /*
+ * tupdatum is actually still < "sk_argument + infinitesimal"
+ * (or it's actually still > "sk_argument - infinitesimal")
+ */
+ return true;
+ }
+ }
+ else
+ {
+ /*
+ * The scankey searches for the sentinel value -inf/+inf.
+ *
+ * Note: -inf could mean "absolute" -inf, or it could represent
+ * the lowest possible value that still satisfies the array's
+ * low_compare. +inf and high_compare work similarly.
+ */
+ BTArrayKeyInfo *array = NULL;
+
+ for (int arrayidx = 0; arrayidx < so->numArrayKeys; arrayidx++)
+ {
+ array = &so->arrayKeys[arrayidx];
+ if (array->scan_key == ikey)
+ break;
+ }
+
+ /*
+ * Compare tupdatum against -inf using array's low_compare, if any
+ * (or compare it against +inf using array's high_compare).
+ *
+ * Optimization: avoid uselessly evaluating array's high_compare
+ * (or uselessly evaluating array's low_compare) by passing
+ * cur_elem_trig=true, along with an inverted scan direction.
+ */
+ _bt_binsrch_skiparray_skey(&so->orderProcs[ikey], true, -dir,
+ tupdatum, tupnull, array, cur,
+ &result);
+
+ if (result == 0)
+ {
+ /*
+ * tupdatum is > -inf sk_argument (or < +inf sk_argument).
+ * It's time for caller to advance the scan's array keys.
+ */
+ return false;
+ }
+ }
/*
* Does this comparison indicate that caller must _not_ advance the
@@ -1963,18 +3088,9 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
*/
if (beyond_end_advance)
{
- int final_elem_dir;
-
- if (ScanDirectionIsBackward(dir) || !array)
- final_elem_dir = 0;
- else
- final_elem_dir = array->num_elems - 1;
-
- if (array && array->cur_elem != final_elem_dir)
- {
- array->cur_elem = final_elem_dir;
- cur->sk_argument = array->elem_values[final_elem_dir];
- }
+ if (array)
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsBackward(dir));
continue;
}
@@ -1999,18 +3115,9 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
*/
if (!all_required_satisfied || cur->sk_attno > tupnatts)
{
- int first_elem_dir;
-
- if (ScanDirectionIsForward(dir) || !array)
- first_elem_dir = 0;
- else
- first_elem_dir = array->num_elems - 1;
-
- if (array && array->cur_elem != first_elem_dir)
- {
- array->cur_elem = first_elem_dir;
- cur->sk_argument = array->elem_values[first_elem_dir];
- }
+ if (array)
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsForward(dir));
continue;
}
@@ -2028,15 +3135,27 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
/*
* Binary search for closest match that's available from the array
*/
- set_elem = _bt_binsrch_array_skey(&so->orderProcs[ikey],
- cur_elem_trig, dir,
- tupdatum, tupnull, array, cur,
- &result);
+ if (array->num_elems != -1)
+ set_elem = _bt_binsrch_array_skey(&so->orderProcs[ikey],
+ cur_elem_trig, dir,
+ tupdatum, tupnull, array, cur,
+ &result);
- Assert(set_elem >= 0 && set_elem < array->num_elems);
+ /*
+ * "Binary search" by checking if tupdatum/tupnull are within the
+ * range of the skip array
+ */
+ else
+ _bt_binsrch_skiparray_skey(&so->orderProcs[ikey],
+ cur_elem_trig, dir,
+ tupdatum, tupnull, array, cur,
+ &result);
}
else
{
+ Datum sk_argument = cur->sk_argument;
+ bool sk_isnull = (cur->sk_flags & SK_ISNULL) != 0;
+
Assert(sktrig_required && required);
/*
@@ -2050,7 +3169,7 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
*/
result = _bt_compare_array_skey(&so->orderProcs[ikey],
tupdatum, tupnull,
- cur->sk_argument, cur);
+ sk_argument, sk_isnull, cur);
}
/*
@@ -2109,11 +3228,65 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
}
}
- /* Advance array keys, even when set_elem isn't an exact match */
- if (array && array->cur_elem != set_elem)
+ if (!array)
+ continue; /* cannot advance a non-array */
+
+ /* Advance array keys, even when we don't have an exact match */
+ if (array->num_elems != -1)
{
- array->cur_elem = set_elem;
- cur->sk_argument = array->elem_values[set_elem];
+ /* Conventional array, use set_elem... */
+ if (array->cur_elem != set_elem)
+ {
+ array->cur_elem = set_elem;
+ cur->sk_argument = array->elem_values[set_elem];
+ }
+
+ continue;
+ }
+
+ /*
+ * ...or skip array, which doesn't advance using a set_elem offset.
+ *
+ * Array "contains" elements for every possible datum from a given
+ * range of values. This is often the range -inf through to +inf.
+ * "Binary searching" a skip array only determines whether tupdatum is
+ * beyond its range, before its range, or within its range.
+ *
+ * Note: conventional arrays cannot use this approach. They need
+ * "beyond end of array element" advancement to distinguish between
+ * the final array element (where incremental advancement rolls over
+ * to the next most significant array), and some earlier array element
+ * (where incremental advancement just increments set_elem/cur_elem).
+ * That distinction doesn't exist when dealing with range skip arrays.
+ */
+ if (beyond_end_advance)
+ {
+ /*
+ * tupdatum/tupnull is > the skip array's "final element"
+ * (tupdatum/tupnull is < the "first element" for backwards scans)
+ */
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsBackward(dir));
+ }
+ else if (!all_required_satisfied)
+ {
+ /*
+ * tupdatum/tupnull is < the skip array's "first element"
+ * (tupdatum/tupnull is > the "final element" for backwards scans)
+ */
+ Assert(sktrig < ikey); /* check on _bt_tuple_before_array_skeys */
+ _bt_scankey_set_low_or_high(rel, cur, array,
+ ScanDirectionIsForward(dir));
+ }
+ else
+ {
+ /*
+ * tupdatum/tupnull is == some particular skip array element.
+ *
+ * Set scan key's sk_argument to tupdatum. If tupdatum is null,
+ * we'll set IS NULL flags in scan key's sk_flags instead.
+ */
+ _bt_scankey_set_element(rel, cur, array, tupdatum, tupnull);
}
}
@@ -2464,6 +3637,8 @@ end_toplevel_scan:
* within each attribute may be done as a byproduct of the processing here.
* That process must leave array scan keys (within an attribute) in the same
* order as corresponding entries from the scan's BTArrayKeyInfo array info.
+ * We might also cons up skip array scan keys that weren't present in the
+ * original input keys; these are also output in standard attribute order.
*
* The output keys are marked with flags SK_BT_REQFWD and/or SK_BT_REQBKWD
* if they must be satisfied in order to continue the scan forward or backward
@@ -2587,8 +3762,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
inputsk = scan->keyData;
/*
- * Now that we have an estimate of the number of output scan keys,
- * allocate space for them
+ * Now that we have an estimate of the number of output scan keys
+ * (including any skip array scan keys), allocate space for them
*/
so->keyData = palloc(sizeof(ScanKeyData) * numberOfKeys);
@@ -2724,7 +3899,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
return;
}
/* else discard the redundant non-equality key */
- Assert(!array || array->num_elems > 0);
+ Assert(!array || array->num_elems > 0 ||
+ array->num_elems == -1);
xform[j].skey = NULL;
xform[j].ikey = -1;
}
@@ -2887,7 +4063,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
/* Have all we need to determine redundancy */
if (test_result)
{
- Assert(!array || array->num_elems > 0);
+ Assert(!array || array->num_elems > 0 ||
+ array->num_elems == -1);
/*
* New key is more restrictive, and so replaces old key...
@@ -3029,10 +4206,11 @@ _bt_verify_keys_with_arraykeys(IndexScanDesc scan)
if (array->scan_key != ikey)
return false;
- if (array->num_elems <= 0)
+ if (array->num_elems == 0 || array->num_elems < -1)
return false;
- if (cur->sk_argument != array->elem_values[array->cur_elem])
+ if (array->num_elems != -1 &&
+ cur->sk_argument != array->elem_values[array->cur_elem])
return false;
if (last_sk_attno > cur->sk_attno)
return false;
@@ -3107,6 +4285,22 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
bool leftnull,
rightnull;
+ /* Handle skip array comparison with IS NOT NULL scan key */
+ if ((leftarg->sk_flags | rightarg->sk_flags) & SK_BT_SKIP)
+ {
+ /* Shouldn't generate skip array in presence of IS NULL key */
+ Assert(!((leftarg->sk_flags | rightarg->sk_flags) & SK_SEARCHNULL));
+ Assert((leftarg->sk_flags | rightarg->sk_flags) & SK_SEARCHNOTNULL);
+
+ /* Skip array will have no NULL element/IS NULL scan key */
+ Assert(array->num_elems == -1);
+ array->null_elem = false;
+
+ /* IS NOT NULL key (could be leftarg or rightarg) now redundant */
+ *result = true;
+ return true;
+ }
+
if (leftarg->sk_flags & SK_ISNULL)
{
Assert(leftarg->sk_flags & (SK_SEARCHNULL | SK_SEARCHNOTNULL));
@@ -3180,6 +4374,7 @@ _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
{
/* Can't make the comparison */
*result = false; /* suppress compiler warnings */
+ Assert(!((leftarg->sk_flags | rightarg->sk_flags) & SK_BT_SKIP));
return false;
}
@@ -3743,6 +4938,20 @@ _bt_check_compare(IndexScanDesc scan, ScanDirection dir,
continue;
}
+ /*
+ * A skip array scan key might be negative/positive infinity. Might
+ * also be next key/prior key sentinel, which we don't deal with.
+ */
+ if (key->sk_flags & (SK_BT_NEGPOSINF | SK_BT_NEXTPRIOR))
+ {
+ Assert(key->sk_flags & SK_SEARCHARRAY);
+ Assert(key->sk_flags & SK_BT_SKIP);
+ Assert(requiredSameDir);
+
+ *continuescan = false;
+ return false;
+ }
+
/* row-comparison keys need special processing */
if (key->sk_flags & SK_ROW_HEADER)
{
diff --git a/src/backend/access/nbtree/nbtvalidate.c b/src/backend/access/nbtree/nbtvalidate.c
index e9d4cd60d..96d0d9185 100644
--- a/src/backend/access/nbtree/nbtvalidate.c
+++ b/src/backend/access/nbtree/nbtvalidate.c
@@ -114,6 +114,10 @@ btvalidate(Oid opclassoid)
case BTOPTIONS_PROC:
ok = check_amoptsproc_signature(procform->amproc);
break;
+ case BTSKIPSUPPORT_PROC:
+ ok = check_amproc_signature(procform->amproc, VOIDOID, true,
+ 1, 1, INTERNALOID);
+ break;
default:
ereport(INFO,
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
diff --git a/src/backend/commands/opclasscmds.c b/src/backend/commands/opclasscmds.c
index b8b5c147c..a86dbf71b 100644
--- a/src/backend/commands/opclasscmds.c
+++ b/src/backend/commands/opclasscmds.c
@@ -1330,6 +1330,31 @@ assignProcTypes(OpFamilyMember *member, Oid amoid, Oid typeoid,
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
errmsg("btree equal image functions must not be cross-type")));
}
+ else if (member->number == BTSKIPSUPPORT_PROC)
+ {
+ if (procform->pronargs != 1 ||
+ procform->proargtypes.values[0] != INTERNALOID)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("btree skip support functions must accept type \"internal\"")));
+ if (procform->prorettype != VOIDOID)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("btree skip support functions must return void")));
+
+ /*
+ * pg_amproc functions are indexed by (lefttype, righttype), but a
+ * skip support function doesn't make sense in cross-type
+ * scenarios. The same opclass opcintype OID is always used for
+ * lefttype and righttype. Providing a cross-type routine isn't
+ * sensible. Reject cross-type ALTER OPERATOR FAMILY ... ADD
+ * FUNCTION 6 statements here.
+ */
+ if (member->lefttype != member->righttype)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("btree skip support functions must not be cross-type")));
+ }
}
else if (amoid == HASH_AM_OID)
{
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index db6ed784a..86a5de8f4 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -143,6 +143,7 @@ static const char *const BuiltinTrancheNames[] = {
[LWTRANCHE_LOCK_MANAGER] = "LockManager",
[LWTRANCHE_PREDICATE_LOCK_MANAGER] = "PredicateLockManager",
[LWTRANCHE_PARALLEL_HASH_JOIN] = "ParallelHashJoin",
+ [LWTRANCHE_PARALLEL_BTREE_SCAN] = "ParallelBtreeScan",
[LWTRANCHE_PARALLEL_QUERY_DSA] = "ParallelQueryDSA",
[LWTRANCHE_PER_SESSION_DSA] = "PerSessionDSA",
[LWTRANCHE_PER_SESSION_RECORD_TYPE] = "PerSessionRecordType",
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8efb4044d..6efa3e353 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -372,6 +372,7 @@ BufferMapping "Waiting to associate a data block with a buffer in the buffer poo
LockManager "Waiting to read or update information about <quote>heavyweight</quote> locks."
PredicateLockManager "Waiting to access predicate lock information used by serializable transactions."
ParallelHashJoin "Waiting to synchronize workers during Parallel Hash Join plan execution."
+ParallelBtreeScan "Waiting to synchronize workers during a parallel B-tree scan."
ParallelQueryDSA "Waiting for parallel query dynamic shared memory allocation."
PerSessionDSA "Waiting for parallel query dynamic shared memory allocation."
PerSessionRecordType "Waiting to access a parallel query's information about composite types."
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index edb09d4e3..e945686c8 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -96,6 +96,7 @@ OBJS = \
rowtypes.o \
ruleutils.o \
selfuncs.o \
+ skipsupport.o \
tid.o \
timestamp.o \
trigfuncs.o \
diff --git a/src/backend/utils/adt/date.c b/src/backend/utils/adt/date.c
index 8130f3e8a..256350007 100644
--- a/src/backend/utils/adt/date.c
+++ b/src/backend/utils/adt/date.c
@@ -34,6 +34,7 @@
#include "utils/date.h"
#include "utils/datetime.h"
#include "utils/numeric.h"
+#include "utils/skipsupport.h"
#include "utils/sortsupport.h"
/*
@@ -455,6 +456,51 @@ date_sortsupport(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+static Datum
+date_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ DateADT dexisting = DatumGetDateADT(existing);
+
+ if (dexisting == DATEVAL_NOBEGIN)
+ {
+ /* return value is undefined */
+ *underflow = true;
+ return (Datum) 0;
+ }
+
+ *underflow = false;
+ return DateADTGetDatum(dexisting - 1);
+}
+
+static Datum
+date_increment(Relation rel, Datum existing, bool *overflow)
+{
+ DateADT dexisting = DatumGetDateADT(existing);
+
+ if (dexisting == DATEVAL_NOEND)
+ {
+ /* return value is undefined */
+ *overflow = true;
+ return (Datum) 0;
+ }
+
+ *overflow = false;
+ return DateADTGetDatum(dexisting + 1);
+}
+
+Datum
+date_skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+
+ sksup->decrement = date_decrement;
+ sksup->increment = date_increment;
+ sksup->low_elem = DateADTGetDatum(DATEVAL_NOBEGIN);
+ sksup->high_elem = DateADTGetDatum(DATEVAL_NOEND);
+
+ PG_RETURN_VOID();
+}
+
Datum
hashdate(PG_FUNCTION_ARGS)
{
diff --git a/src/backend/utils/adt/meson.build b/src/backend/utils/adt/meson.build
index 8c6fc80c3..91682edd5 100644
--- a/src/backend/utils/adt/meson.build
+++ b/src/backend/utils/adt/meson.build
@@ -83,6 +83,7 @@ backend_sources += files(
'rowtypes.c',
'ruleutils.c',
'selfuncs.c',
+ 'skipsupport.c',
'tid.c',
'timestamp.c',
'trigfuncs.c',
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 03d7fb5f4..78864b15d 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -192,6 +192,8 @@ static double convert_timevalue_to_scalar(Datum value, Oid typid,
bool *failure);
static void examine_simple_variable(PlannerInfo *root, Var *var,
VariableStatData *vardata);
+static void examine_indexcol_variable(PlannerInfo *root, IndexOptInfo *index,
+ int indexcol, VariableStatData *vardata);
static bool get_variable_range(PlannerInfo *root, VariableStatData *vardata,
Oid sortop, Oid collation,
Datum *min, Datum *max);
@@ -213,6 +215,8 @@ static bool get_actual_variable_endpoint(Relation heapRel,
MemoryContext outercontext,
Datum *endpointDatum);
static RelOptInfo *find_join_input_rel(PlannerInfo *root, Relids relids);
+static double btcost_correlation(IndexOptInfo *index,
+ VariableStatData *vardata);
/*
@@ -5733,6 +5737,92 @@ examine_simple_variable(PlannerInfo *root, Var *var,
}
}
+/*
+ * examine_indexcol_variable
+ * Try to look up statistical data about an index column/expression.
+ * Fill in a VariableStatData struct to describe the column.
+ *
+ * Inputs:
+ * root: the planner info
+ * index: the index whose column we're interested in
+ * indexcol: 0-based index column number (subscripts index->indexkeys[])
+ *
+ * Outputs: *vardata is filled as follows:
+ * var: the input expression (with any binary relabeling stripped, if
+ * it is or contains a variable; but otherwise the type is preserved)
+ * rel: RelOptInfo for table relation containing variable.
+ * statsTuple: the pg_statistic entry for the variable, if one exists;
+ * otherwise NULL.
+ * freefunc: pointer to a function to release statsTuple with.
+ *
+ * Caller is responsible for doing ReleaseVariableStats() before exiting.
+ */
+static void
+examine_indexcol_variable(PlannerInfo *root, IndexOptInfo *index,
+ int indexcol, VariableStatData *vardata)
+{
+ AttrNumber colnum;
+ Oid relid;
+
+ if (index->indexkeys[indexcol] != 0)
+ {
+ /* Simple variable --- look to stats for the underlying table */
+ RangeTblEntry *rte = planner_rt_fetch(index->rel->relid, root);
+
+ Assert(rte->rtekind == RTE_RELATION);
+ relid = rte->relid;
+ Assert(relid != InvalidOid);
+ colnum = index->indexkeys[indexcol];
+ vardata->rel = index->rel;
+
+ if (get_relation_stats_hook &&
+ (*get_relation_stats_hook) (root, rte, colnum, vardata))
+ {
+ /*
+ * The hook took control of acquiring a stats tuple. If it did
+ * supply a tuple, it'd better have supplied a freefunc.
+ */
+ if (HeapTupleIsValid(vardata->statsTuple) &&
+ !vardata->freefunc)
+ elog(ERROR, "no function provided to release variable stats with");
+ }
+ else
+ {
+ vardata->statsTuple = SearchSysCache3(STATRELATTINH,
+ ObjectIdGetDatum(relid),
+ Int16GetDatum(colnum),
+ BoolGetDatum(rte->inh));
+ vardata->freefunc = ReleaseSysCache;
+ }
+ }
+ else
+ {
+ /* Expression --- maybe there are stats for the index itself */
+ relid = index->indexoid;
+ colnum = indexcol + 1;
+
+ if (get_index_stats_hook &&
+ (*get_index_stats_hook) (root, relid, colnum, vardata))
+ {
+ /*
+ * The hook took control of acquiring a stats tuple. If it did
+ * supply a tuple, it'd better have supplied a freefunc.
+ */
+ if (HeapTupleIsValid(vardata->statsTuple) &&
+ !vardata->freefunc)
+ elog(ERROR, "no function provided to release variable stats with");
+ }
+ else
+ {
+ vardata->statsTuple = SearchSysCache3(STATRELATTINH,
+ ObjectIdGetDatum(relid),
+ Int16GetDatum(colnum),
+ BoolGetDatum(false));
+ vardata->freefunc = ReleaseSysCache;
+ }
+ }
+}
+
/*
* Check whether it is permitted to call func_oid passing some of the
* pg_statistic data in vardata. We allow this either if the user has SELECT
@@ -6791,6 +6881,54 @@ add_predicate_to_index_quals(IndexOptInfo *index, List *indexQuals)
return list_concat(predExtraQuals, indexQuals);
}
+/*
+ * Estimate correlation of btree index's first column.
+ *
+ * If we can get an estimate of the first column's ordering correlation C
+ * from pg_statistic, estimate the index correlation as C for a
+ * single-column index, or C * 0.75 for multiple columns. (The idea here
+ * is that multiple columns dilute the importance of the first column's
+ * ordering, but don't negate it entirely. Before 8.0 we divided the
+ * correlation by the number of columns, but that seems too strong.)
+ *
+ * We already filled in the stats tuple for *vardata when called.
+ */
+static double
+btcost_correlation(IndexOptInfo *index, VariableStatData *vardata)
+{
+ Oid sortop;
+ AttStatsSlot sslot;
+ double indexCorrelation = 0;
+
+ Assert(HeapTupleIsValid(vardata->statsTuple));
+
+ sortop = get_opfamily_member(index->opfamily[0],
+ index->opcintype[0],
+ index->opcintype[0],
+ BTLessStrategyNumber);
+ if (OidIsValid(sortop) &&
+ get_attstatsslot(&sslot, vardata->statsTuple,
+ STATISTIC_KIND_CORRELATION, sortop,
+ ATTSTATSSLOT_NUMBERS))
+ {
+ double varCorrelation;
+
+ Assert(sslot.nnumbers == 1);
+ varCorrelation = sslot.numbers[0];
+
+ if (index->reverse_sort[0])
+ varCorrelation = -varCorrelation;
+
+ if (index->nkeycolumns > 1)
+ indexCorrelation = varCorrelation * 0.75;
+ else
+ indexCorrelation = varCorrelation;
+
+ free_attstatsslot(&sslot);
+ }
+
+ return indexCorrelation;
+}
void
btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
@@ -6800,17 +6938,21 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
{
IndexOptInfo *index = path->indexinfo;
GenericCosts costs = {0};
- Oid relid;
- AttrNumber colnum;
VariableStatData vardata = {0};
double numIndexTuples;
Cost descentCost;
List *indexBoundQuals;
int indexcol;
bool eqQualHere;
+ bool upperInequalHere;
+ bool lowerInequalHere;
+ bool have_correlation = false;
+ bool found_skip;
bool found_saop;
bool found_is_null_op;
+ double inequalselectivity = 1.0;
double num_sa_scans;
+ double correlation = 0;
ListCell *lc;
/*
@@ -6826,13 +6968,17 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
* For a RowCompareExpr, we consider only the first column, just as
* rowcomparesel() does.
*
- * If there's a ScalarArrayOpExpr in the quals, we'll actually perform up
- * to N index descents (not just one), but the ScalarArrayOpExpr's
- * operator can be considered to act the same as it normally does.
+ * If there's a ScalarArrayOpExpr in the quals, or if we expect to
+ * generate a skip scan array, then we'll actually perform up to N index
+ * descents (not just one), but the underlying operator can be considered
+ * to act the same as it normally does.
*/
indexBoundQuals = NIL;
indexcol = 0;
eqQualHere = false;
+ upperInequalHere = false;
+ lowerInequalHere = false;
+ found_skip = false;
found_saop = false;
found_is_null_op = false;
num_sa_scans = 1;
@@ -6843,13 +6989,81 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
if (indexcol != iclause->indexcol)
{
+ bool first = true;
+
/* Beginning of a new column's quals */
- if (!eqQualHere)
- break; /* done if no '=' qual for indexcol */
+ if (eqQualHere)
+ indexcol++; /* don't skip the previous '=' qual's column */
+
+ /*
+ * Now estimate number of "array elements" using ndistinct.
+ *
+ * Internally, nbtree treats skip scans as scans with SAOP style
+ * arrays that generate elements procedurally. We effectively
+ * assume a "col = ANY('{every possible col value}')" qual.
+ */
+ while (indexcol < iclause->indexcol)
+ {
+ double ndistinct = DEFAULT_NUM_DISTINCT;
+ bool isdefault = true;
+
+ /* Attain ndistinct for index column/indexed expression */
+ examine_indexcol_variable(root, index, indexcol, &vardata);
+ if (HeapTupleIsValid(vardata.statsTuple))
+ {
+ ndistinct = get_variable_numdistinct(&vardata, &isdefault);
+
+ if (indexcol == 0)
+ {
+ /*
+ * Get an estimate of the leading column's correlation
+ * in passing (avoids rereading variable stats below)
+ */
+ correlation = btcost_correlation(index, &vardata);
+ have_correlation = true;
+ }
+ }
+
+ ReleaseVariableStats(vardata);
+
+ if (first)
+ {
+ first = false;
+
+ /*
+ * Apply the selectivities of any inequalities to
+ * ndistinct (unless ndistinct is only a default estimate)
+ */
+ if (!isdefault)
+ ndistinct *= inequalselectivity;
+
+ /*
+ * Skip scan will likely require an initial index descent
+ * to find out what the real first element is..
+ */
+ if (!upperInequalHere)
+ ndistinct += 1;
+
+ /*
+ * ...and another extra descent to confirm no further
+ * groupings/matches
+ */
+ if (!lowerInequalHere)
+ ndistinct += 1;
+ }
+
+ num_sa_scans *= ndistinct;
+
+ /* Done costing skipping for this index column */
+ indexcol++;
+ found_skip = true;
+ }
+
+ /* new index column resets tracking variables */
eqQualHere = false;
- indexcol++;
- if (indexcol != iclause->indexcol)
- break; /* no quals at all for indexcol */
+ upperInequalHere = false;
+ lowerInequalHere = false;
+ inequalselectivity = 1.0;
}
/* Examine each indexqual associated with this index clause */
@@ -6891,7 +7105,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
if (nt->nulltesttype == IS_NULL)
{
found_is_null_op = true;
- /* IS NULL is like = for selectivity purposes */
+ /* IS NULL is like = for selectivity/skipping purposes */
eqQualHere = true;
}
}
@@ -6907,6 +7121,38 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
Assert(op_strategy != 0); /* not a member of opfamily?? */
if (op_strategy == BTEqualStrategyNumber)
eqQualHere = true;
+ else if (rinfo->norm_selec >= 0)
+ {
+ bool useinequal = true;
+
+ /*
+ * Skip scan requires tracking inequality selectivities to
+ * compute an adjusted whole-column ndistinct
+ */
+ if (op_strategy < BTEqualStrategyNumber)
+ {
+ if (upperInequalHere)
+ useinequal = false;
+ upperInequalHere = true;
+ }
+ else
+ {
+ if (lowerInequalHere)
+ useinequal = false;
+ lowerInequalHere = true;
+ }
+
+ /*
+ * Assume inequality selectivities are _not_ independent,
+ * but only track up to one upper bound inequality and up
+ * to one lower bound inequality. This avoids wildly
+ * wrong estimates given redundant operators.
+ */
+ if (useinequal)
+ inequalselectivity =
+ Max(inequalselectivity - (1.0 - rinfo->norm_selec),
+ DEFAULT_RANGE_INEQ_SEL);
+ }
}
indexBoundQuals = lappend(indexBoundQuals, rinfo);
@@ -6922,6 +7168,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
if (index->unique &&
indexcol == index->nkeycolumns - 1 &&
eqQualHere &&
+ !found_skip &&
!found_saop &&
!found_is_null_op)
numIndexTuples = 1.0;
@@ -7030,104 +7277,19 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
costs.indexStartupCost += descentCost;
costs.indexTotalCost += costs.num_sa_scans * descentCost;
- /*
- * If we can get an estimate of the first column's ordering correlation C
- * from pg_statistic, estimate the index correlation as C for a
- * single-column index, or C * 0.75 for multiple columns. (The idea here
- * is that multiple columns dilute the importance of the first column's
- * ordering, but don't negate it entirely. Before 8.0 we divided the
- * correlation by the number of columns, but that seems too strong.)
- */
- if (index->indexkeys[0] != 0)
+ if (!have_correlation)
{
- /* Simple variable --- look to stats for the underlying table */
- RangeTblEntry *rte = planner_rt_fetch(index->rel->relid, root);
-
- Assert(rte->rtekind == RTE_RELATION);
- relid = rte->relid;
- Assert(relid != InvalidOid);
- colnum = index->indexkeys[0];
-
- if (get_relation_stats_hook &&
- (*get_relation_stats_hook) (root, rte, colnum, &vardata))
- {
- /*
- * The hook took control of acquiring a stats tuple. If it did
- * supply a tuple, it'd better have supplied a freefunc.
- */
- if (HeapTupleIsValid(vardata.statsTuple) &&
- !vardata.freefunc)
- elog(ERROR, "no function provided to release variable stats with");
- }
- else
- {
- vardata.statsTuple = SearchSysCache3(STATRELATTINH,
- ObjectIdGetDatum(relid),
- Int16GetDatum(colnum),
- BoolGetDatum(rte->inh));
- vardata.freefunc = ReleaseSysCache;
- }
+ examine_indexcol_variable(root, index, 0, &vardata);
+ if (HeapTupleIsValid(vardata.statsTuple))
+ costs.indexCorrelation = btcost_correlation(index, &vardata);
+ ReleaseVariableStats(vardata);
}
else
{
- /* Expression --- maybe there are stats for the index itself */
- relid = index->indexoid;
- colnum = 1;
-
- if (get_index_stats_hook &&
- (*get_index_stats_hook) (root, relid, colnum, &vardata))
- {
- /*
- * The hook took control of acquiring a stats tuple. If it did
- * supply a tuple, it'd better have supplied a freefunc.
- */
- if (HeapTupleIsValid(vardata.statsTuple) &&
- !vardata.freefunc)
- elog(ERROR, "no function provided to release variable stats with");
- }
- else
- {
- vardata.statsTuple = SearchSysCache3(STATRELATTINH,
- ObjectIdGetDatum(relid),
- Int16GetDatum(colnum),
- BoolGetDatum(false));
- vardata.freefunc = ReleaseSysCache;
- }
+ /* get_variable_index_correlation called earlier */
+ costs.indexCorrelation = correlation;
}
- if (HeapTupleIsValid(vardata.statsTuple))
- {
- Oid sortop;
- AttStatsSlot sslot;
-
- sortop = get_opfamily_member(index->opfamily[0],
- index->opcintype[0],
- index->opcintype[0],
- BTLessStrategyNumber);
- if (OidIsValid(sortop) &&
- get_attstatsslot(&sslot, vardata.statsTuple,
- STATISTIC_KIND_CORRELATION, sortop,
- ATTSTATSSLOT_NUMBERS))
- {
- double varCorrelation;
-
- Assert(sslot.nnumbers == 1);
- varCorrelation = sslot.numbers[0];
-
- if (index->reverse_sort[0])
- varCorrelation = -varCorrelation;
-
- if (index->nkeycolumns > 1)
- costs.indexCorrelation = varCorrelation * 0.75;
- else
- costs.indexCorrelation = varCorrelation;
-
- free_attstatsslot(&sslot);
- }
- }
-
- ReleaseVariableStats(vardata);
-
*indexStartupCost = costs.indexStartupCost;
*indexTotalCost = costs.indexTotalCost;
*indexSelectivity = costs.indexSelectivity;
diff --git a/src/backend/utils/adt/skipsupport.c b/src/backend/utils/adt/skipsupport.c
new file mode 100644
index 000000000..d91471e26
--- /dev/null
+++ b/src/backend/utils/adt/skipsupport.c
@@ -0,0 +1,60 @@
+/*-------------------------------------------------------------------------
+ *
+ * skipsupport.c
+ * Support routines for B-Tree skip scan.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/utils/adt/skipsupport.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "utils/lsyscache.h"
+#include "utils/skipsupport.h"
+
+/*
+ * Fill in SkipSupport given an operator class (opfamily + opcintype).
+ *
+ * On success, returns true, and initializes all SkipSupport fields for
+ * caller. Otherwise returns false, indicating that operator class has no
+ * skip support function.
+ */
+bool
+PrepareSkipSupportFromOpclass(Oid opfamily, Oid opcintype, bool reverse,
+ SkipSupport sksup)
+{
+ Oid skipSupportFunction;
+
+ /* Look for a skip support function */
+ skipSupportFunction = get_opfamily_proc(opfamily, opcintype, opcintype,
+ BTSKIPSUPPORT_PROC);
+ if (!OidIsValid(skipSupportFunction))
+ return false;
+
+ OidFunctionCall1(skipSupportFunction, PointerGetDatum(sksup));
+
+ if (reverse)
+ {
+ /*
+ * DESC/reverse case: swap low_elem with high_elem, and swap decrement
+ * with increment
+ */
+ Datum low_elem = sksup->low_elem;
+ SkipSupportIncDec decrement = sksup->decrement;
+
+ sksup->low_elem = sksup->high_elem;
+ sksup->decrement = sksup->increment;
+
+ sksup->high_elem = low_elem;
+ sksup->increment = decrement;
+ }
+
+ return true;
+}
diff --git a/src/backend/utils/adt/uuid.c b/src/backend/utils/adt/uuid.c
index 5284d23dc..654bda638 100644
--- a/src/backend/utils/adt/uuid.c
+++ b/src/backend/utils/adt/uuid.c
@@ -13,12 +13,15 @@
#include "postgres.h"
+#include <limits.h>
+
#include "common/hashfn.h"
#include "lib/hyperloglog.h"
#include "libpq/pqformat.h"
#include "port/pg_bswap.h"
#include "utils/fmgrprotos.h"
#include "utils/guc.h"
+#include "utils/skipsupport.h"
#include "utils/sortsupport.h"
#include "utils/timestamp.h"
#include "utils/uuid.h"
@@ -384,6 +387,74 @@ uuid_abbrev_convert(Datum original, SortSupport ssup)
return res;
}
+static Datum
+uuid_decrement(Relation rel, Datum existing, bool *underflow)
+{
+ pg_uuid_t *uuid;
+
+ uuid = (pg_uuid_t *) palloc(UUID_LEN);
+ memcpy(uuid, DatumGetUUIDP(existing), UUID_LEN);
+ for (int i = UUID_LEN - 1; i >= 0; i--)
+ {
+ if (uuid->data[i] > 0)
+ {
+ uuid->data[i]--;
+ *underflow = false;
+ return UUIDPGetDatum(uuid);
+ }
+ uuid->data[i] = UCHAR_MAX;
+ }
+
+ pfree(uuid); /* cannot leak memory */
+
+ /* return value is undefined */
+ *underflow = true;
+ return (Datum) 0;
+}
+
+static Datum
+uuid_increment(Relation rel, Datum existing, bool *overflow)
+{
+ pg_uuid_t *uuid;
+
+ uuid = (pg_uuid_t *) palloc(UUID_LEN);
+ memcpy(uuid, DatumGetUUIDP(existing), UUID_LEN);
+ for (int i = UUID_LEN - 1; i >= 0; i--)
+ {
+ if (uuid->data[i] < UCHAR_MAX)
+ {
+ uuid->data[i]++;
+ *overflow = false;
+ return UUIDPGetDatum(uuid);
+ }
+ uuid->data[i] = 0;
+ }
+
+ pfree(uuid); /* cannot leak memory */
+
+ /* return value is undefined */
+ *overflow = true;
+ return (Datum) 0;
+}
+
+Datum
+uuid_skipsupport(PG_FUNCTION_ARGS)
+{
+ SkipSupport sksup = (SkipSupport) PG_GETARG_POINTER(0);
+ pg_uuid_t *uuid_min = palloc(UUID_LEN);
+ pg_uuid_t *uuid_max = palloc(UUID_LEN);
+
+ memset(uuid_min->data, 0x00, UUID_LEN);
+ memset(uuid_max->data, 0xFF, UUID_LEN);
+
+ sksup->decrement = uuid_decrement;
+ sksup->increment = uuid_increment;
+ sksup->low_elem = UUIDPGetDatum(uuid_min);
+ sksup->high_elem = UUIDPGetDatum(uuid_max);
+
+ PG_RETURN_VOID();
+}
+
/* hash index support */
Datum
uuid_hash(PG_FUNCTION_ARGS)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 686309db5..faa3a678f 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -28,6 +28,7 @@
#include "access/commit_ts.h"
#include "access/gin.h"
+#include "access/nbtree.h"
#include "access/slru.h"
#include "access/toast_compression.h"
#include "access/twophase.h"
@@ -1751,6 +1752,17 @@ struct config_bool ConfigureNamesBool[] =
},
#endif
+ /* XXX Remove before commit */
+ {
+ {"skipscan_skipsupport_enabled", PGC_SUSET, DEVELOPER_OPTIONS,
+ NULL, NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &skipscan_skipsupport_enabled,
+ true,
+ NULL, NULL, NULL
+ },
+
{
{"integer_datetimes", PGC_INTERNAL, PRESET_OPTIONS,
gettext_noop("Shows whether datetimes are integer based."),
@@ -3590,6 +3602,17 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ /* XXX Remove before commit */
+ {
+ {"skipscan_prefix_cols", PGC_SUSET, DEVELOPER_OPTIONS,
+ NULL, NULL,
+ GUC_NOT_IN_SAMPLE
+ },
+ &skipscan_prefix_cols,
+ INDEX_MAX_KEYS, 0, INDEX_MAX_KEYS,
+ NULL, NULL, NULL
+ },
+
{
/* Can't be set in postgresql.conf */
{"server_version_num", PGC_INTERNAL, PRESET_OPTIONS,
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index 2b3997988..9662fb2ba 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -583,6 +583,19 @@ options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns
</para>
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><function>skipsupport</function></term>
+ <listitem>
+ <para>
+ Optionally, a btree operator family may provide a <firstterm>skip
+ support</firstterm> function, registered under support function
+ number 6. These functions allow the B-tree code to more efficiently
+ navigate the index structure via an index <quote>skip scan</quote>. The
+ APIs involved in this are defined in
+ <filename>src/include/utils/skipsupport.h</filename>.
+ </para>
+ </listitem>
+ </varlistentry>
</variablelist>
</sect2>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index dc7d14b60..3116c6350 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -825,7 +825,8 @@ amrestrpos (IndexScanDesc scan);
<para>
<programlisting>
Size
-amestimateparallelscan (int nkeys,
+amestimateparallelscan (Relation indexRelation,
+ int nkeys,
int norderbys);
</programlisting>
Estimate and return the number of bytes of dynamic shared memory which
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 6d731e070..433e108b8 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -457,23 +457,26 @@ CREATE INDEX test2_mm_idx ON test2 (major, minor);
<para>
A multicolumn B-tree index can be used with query conditions that
involve any subset of the index's columns, but the index is most
- efficient when there are constraints on the leading (leftmost) columns.
- The exact rule is that equality constraints on leading columns, plus
- any inequality constraints on the first column that does not have an
- equality constraint, will be used to limit the portion of the index
- that is scanned. Constraints on columns to the right of these columns
- are checked in the index, so they save visits to the table proper, but
- they do not reduce the portion of the index that has to be scanned.
+ efficient when there are equality constraints on the leading (leftmost) columns.
+ B-Tree index scans can use the index skip scan strategy to generate
+ equality constraints on prefix columns that were wholly omitted from the
+ query predicate, as well as prefix columns whose values were constrained by
+ inequality conditions.
For example, given an index on <literal>(a, b, c)</literal> and a
query condition <literal>WHERE a = 5 AND b >= 42 AND c < 77</literal>,
the index would have to be scanned from the first entry with
<literal>a</literal> = 5 and <literal>b</literal> = 42 up through the last entry with
- <literal>a</literal> = 5. Index entries with <literal>c</literal> >= 77 would be
- skipped, but they'd still have to be scanned through.
+ <literal>a</literal> = 5. Intevening groups of index entries with
+ <literal>c</literal> >= 77 would not need to be returned by the scan,
+ and can be skipped over entirely by applying the skip scan strategy.
This index could in principle be used for queries that have constraints
on <literal>b</literal> and/or <literal>c</literal> with no constraint on <literal>a</literal>
- — but the entire index would have to be scanned, so in most cases
- the planner would prefer a sequential table scan over using the index.
+ — but that approach is generally only taken when there are so few
+ distinct <literal>a</literal> values that the planner expects the skip scan
+ strategy to allow the scan to skip over most individual index leaf pages.
+ If there are many distinct <literal>a</literal> values, then the entire
+ index will have to be scanned, so in most cases the planner will prefer a
+ sequential table scan over using the index.
</para>
<para>
@@ -508,10 +511,7 @@ CREATE INDEX test2_mm_idx ON test2 (major, minor);
</para>
<para>
- Multicolumn indexes should be used sparingly. In most situations,
- an index on a single column is sufficient and saves space and time.
- Indexes with more than three columns are unlikely to be helpful
- unless the usage of the table is extremely stylized. See also
+ Multicolumn indexes should be used judiciously. See
<xref linkend="indexes-bitmap-scans"/> and
<xref linkend="indexes-index-only-scans"/> for some discussion of the
merits of different index configurations.
@@ -669,9 +669,13 @@ CREATE INDEX test3_desc_index ON test3 (id DESC NULLS LAST);
multicolumn index on <literal>(x, y)</literal>. This index would typically be
more efficient than index combination for queries involving both
columns, but as discussed in <xref linkend="indexes-multicolumn"/>, it
- would be almost useless for queries involving only <literal>y</literal>, so it
- should not be the only index. A combination of the multicolumn index
- and a separate index on <literal>y</literal> would serve reasonably well. For
+ would be less useful for queries involving only <literal>y</literal>. Just
+ how useful might depend on how effective the B-Tree index skip scan
+ optimization is; if <literal>x</literal> has no more than several hundred
+ distinct values, skip scan will make searches for specific
+ <literal>y</literal> values execute reasonably efficiently. A combination
+ of a multicolumn index on <literal>(x, y)</literal> and a separate index on
+ <literal>y</literal> might also serve reasonably well. For
queries involving only <literal>x</literal>, the multicolumn index could be
used, though it would be larger and hence slower than an index on
<literal>x</literal> alone. The last alternative is to create all three
diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml
index 3a19dab15..ee26dbecc 100644
--- a/doc/src/sgml/xindex.sgml
+++ b/doc/src/sgml/xindex.sgml
@@ -461,6 +461,13 @@
</entry>
<entry>5</entry>
</row>
+ <row>
+ <entry>
+ Return the addresses of C-callable skip support function(s)
+ (optional)
+ </entry>
+ <entry>6</entry>
+ </row>
</tbody>
</tgroup>
</table>
@@ -1062,7 +1069,8 @@ DEFAULT FOR TYPE int8 USING btree FAMILY integer_ops AS
FUNCTION 1 btint8cmp(int8, int8) ,
FUNCTION 2 btint8sortsupport(internal) ,
FUNCTION 3 in_range(int8, int8, int8, boolean, boolean) ,
- FUNCTION 4 btequalimage(oid) ;
+ FUNCTION 4 btequalimage(oid) ,
+ FUNCTION 6 btint8skipsupport(internal) ;
CREATE OPERATOR CLASS int4_ops
DEFAULT FOR TYPE int4 USING btree FAMILY integer_ops AS
@@ -1075,7 +1083,8 @@ DEFAULT FOR TYPE int4 USING btree FAMILY integer_ops AS
FUNCTION 1 btint4cmp(int4, int4) ,
FUNCTION 2 btint4sortsupport(internal) ,
FUNCTION 3 in_range(int4, int4, int4, boolean, boolean) ,
- FUNCTION 4 btequalimage(oid) ;
+ FUNCTION 4 btequalimage(oid) ,
+ FUNCTION 6 btint4skipsupport(internal) ;
CREATE OPERATOR CLASS int2_ops
DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
@@ -1088,7 +1097,8 @@ DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
FUNCTION 1 btint2cmp(int2, int2) ,
FUNCTION 2 btint2sortsupport(internal) ,
FUNCTION 3 in_range(int2, int2, int2, boolean, boolean) ,
- FUNCTION 4 btequalimage(oid) ;
+ FUNCTION 4 btequalimage(oid) ,
+ FUNCTION 6 btint2skipsupport(internal) ;
ALTER OPERATOR FAMILY integer_ops USING btree ADD
-- cross-type comparisons int8 vs int2
diff --git a/src/test/regress/expected/alter_generic.out b/src/test/regress/expected/alter_generic.out
index ae54cb254..8b6b775c1 100644
--- a/src/test/regress/expected/alter_generic.out
+++ b/src/test/regress/expected/alter_generic.out
@@ -362,9 +362,9 @@ ERROR: invalid operator number 0, must be between 1 and 5
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 1 < ; -- operator without argument types
ERROR: operator argument types must be specified in ALTER OPERATOR FAMILY
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 0 btint42cmp(int4, int2); -- invalid options parsing function
-ERROR: invalid function number 0, must be between 1 and 5
-ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 6 btint42cmp(int4, int2); -- function number should be between 1 and 5
-ERROR: invalid function number 6, must be between 1 and 5
+ERROR: invalid function number 0, must be between 1 and 6
+ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 7 btint42cmp(int4, int2); -- function number should be between 1 and 6
+ERROR: invalid function number 7, must be between 1 and 6
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD STORAGE invalid_storage; -- Ensure STORAGE is not a part of ALTER OPERATOR FAMILY
ERROR: STORAGE cannot be specified in ALTER OPERATOR FAMILY
DROP OPERATOR FAMILY alt_opf4 USING btree;
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index cf6eac573..f7b3ecef4 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1938,7 +1938,7 @@ ORDER BY unique1;
42
(3 rows)
--- Non-required array scan key on "tenthous":
+-- Skip array on "thousand", SAOP array on "tenthous":
explain (costs off)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -1958,7 +1958,7 @@ ORDER BY thousand;
1 | 1001
(2 rows)
--- Non-required array scan key on "tenthous", backward scan:
+-- Skip array on "thousand", SAOP array on "tenthous", backward scan:
explain (costs off)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 31fb7d142..8c2a939b0 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -4370,24 +4370,25 @@ select b.unique1 from
join int4_tbl i1 on b.thousand = f1
right join int4_tbl i2 on i2.f1 = b.tenthous
order by 1;
- QUERY PLAN
------------------------------------------------------------------------------------------
+ QUERY PLAN
+--------------------------------------------------------------------------------
Sort
Sort Key: b.unique1
-> Nested Loop Left Join
-> Seq Scan on int4_tbl i2
- -> Nested Loop Left Join
- Join Filter: (b.unique1 = 42)
- -> Nested Loop
+ -> Nested Loop
+ Join Filter: (b.thousand = i1.f1)
+ -> Nested Loop Left Join
+ Join Filter: (b.unique1 = 42)
-> Nested Loop
- -> Seq Scan on int4_tbl i1
-> Index Scan using tenk1_thous_tenthous on tenk1 b
- Index Cond: ((thousand = i1.f1) AND (tenthous = i2.f1))
- -> Index Scan using tenk1_unique1 on tenk1 a
- Index Cond: (unique1 = b.unique2)
- -> Index Only Scan using tenk1_thous_tenthous on tenk1 c
- Index Cond: (thousand = a.thousand)
-(15 rows)
+ Index Cond: (tenthous = i2.f1)
+ -> Index Scan using tenk1_unique1 on tenk1 a
+ Index Cond: (unique1 = b.unique2)
+ -> Index Only Scan using tenk1_thous_tenthous on tenk1 c
+ Index Cond: (thousand = a.thousand)
+ -> Seq Scan on int4_tbl i1
+(16 rows)
select b.unique1 from
tenk1 a join tenk1 b on a.unique1 = b.unique2
@@ -7482,19 +7483,23 @@ select * from fkest f1
join fkest f2 on (f1.x = f2.x and f1.x10 = f2.x10b and f1.x100 = f2.x100)
join fkest f3 on f1.x = f3.x
where f1.x100 = 2;
- QUERY PLAN
------------------------------------------------------------
+ QUERY PLAN
+-------------------------------------------------------------------
Nested Loop
-> Hash Join
Hash Cond: ((f2.x = f1.x) AND (f2.x10b = f1.x10))
- -> Seq Scan on fkest f2
- Filter: (x100 = 2)
+ -> Bitmap Heap Scan on fkest f2
+ Recheck Cond: (x100 = 2)
+ -> Bitmap Index Scan on fkest_x_x10_x100_idx
+ Index Cond: (x100 = 2)
-> Hash
- -> Seq Scan on fkest f1
- Filter: (x100 = 2)
+ -> Bitmap Heap Scan on fkest f1
+ Recheck Cond: (x100 = 2)
+ -> Bitmap Index Scan on fkest_x_x10_x100_idx
+ Index Cond: (x100 = 2)
-> Index Scan using fkest_x_x10_x100_idx on fkest f3
Index Cond: (x = f1.x)
-(10 rows)
+(14 rows)
alter table fkest add constraint fk
foreign key (x, x10b, x100) references fkest (x, x10, x100);
@@ -7503,20 +7508,24 @@ select * from fkest f1
join fkest f2 on (f1.x = f2.x and f1.x10 = f2.x10b and f1.x100 = f2.x100)
join fkest f3 on f1.x = f3.x
where f1.x100 = 2;
- QUERY PLAN
------------------------------------------------------
+ QUERY PLAN
+-------------------------------------------------------------------
Hash Join
Hash Cond: ((f2.x = f1.x) AND (f2.x10b = f1.x10))
-> Hash Join
Hash Cond: (f3.x = f2.x)
-> Seq Scan on fkest f3
-> Hash
- -> Seq Scan on fkest f2
- Filter: (x100 = 2)
+ -> Bitmap Heap Scan on fkest f2
+ Recheck Cond: (x100 = 2)
+ -> Bitmap Index Scan on fkest_x_x10_x100_idx
+ Index Cond: (x100 = 2)
-> Hash
- -> Seq Scan on fkest f1
- Filter: (x100 = 2)
-(11 rows)
+ -> Bitmap Heap Scan on fkest f1
+ Recheck Cond: (x100 = 2)
+ -> Bitmap Index Scan on fkest_x_x10_x100_idx
+ Index Cond: (x100 = 2)
+(15 rows)
rollback;
--
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 6aeb7cb96..f4c696ca5 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5193,9 +5193,10 @@ List of access methods
btree | uuid_ops | uuid | uuid | 1 | uuid_cmp
btree | uuid_ops | uuid | uuid | 2 | uuid_sortsupport
btree | uuid_ops | uuid | uuid | 4 | btequalimage
+ btree | uuid_ops | uuid | uuid | 6 | uuid_skipsupport
hash | uuid_ops | uuid | uuid | 1 | uuid_hash
hash | uuid_ops | uuid | uuid | 2 | uuid_hash_extended
-(5 rows)
+(6 rows)
-- check \dconfig
set work_mem = 10240;
diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 0456d48c9..39aa1f89e 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -1461,18 +1461,17 @@ select t1.unique1 from tenk1 t1
inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0
union all
(values(1)) limit 1;
- QUERY PLAN
---------------------------------------------------------
+ QUERY PLAN
+---------------------------------------------------------------------
Limit
-> Append
-> Nested Loop
- Join Filter: (t1.tenthous = t2.tenthous)
- -> Seq Scan on tenk1 t1
- -> Materialize
- -> Seq Scan on tenk2 t2
- Filter: (thousand = 0)
+ -> Seq Scan on tenk2 t2
+ Filter: (thousand = 0)
+ -> Index Scan using tenk1_thous_tenthous on tenk1 t1
+ Index Cond: (tenthous = t2.tenthous)
-> Result
-(9 rows)
+(8 rows)
-- Ensure there is no problem if cheapest_startup_path is NULL
explain (costs off)
diff --git a/src/test/regress/sql/alter_generic.sql b/src/test/regress/sql/alter_generic.sql
index de58d268d..4246afefd 100644
--- a/src/test/regress/sql/alter_generic.sql
+++ b/src/test/regress/sql/alter_generic.sql
@@ -310,7 +310,7 @@ ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 6 < (int4, int2); -- ope
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 0 < (int4, int2); -- operator number should be between 1 and 5
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 1 < ; -- operator without argument types
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 0 btint42cmp(int4, int2); -- invalid options parsing function
-ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 6 btint42cmp(int4, int2); -- function number should be between 1 and 5
+ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 7 btint42cmp(int4, int2); -- function number should be between 1 and 6
ALTER OPERATOR FAMILY alt_opf4 USING btree ADD STORAGE invalid_storage; -- Ensure STORAGE is not a part of ALTER OPERATOR FAMILY
DROP OPERATOR FAMILY alt_opf4 USING btree;
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index e296891ca..1d269dc30 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -766,7 +766,7 @@ SELECT unique1 FROM tenk1
WHERE unique1 IN (1,42,7)
ORDER BY unique1;
--- Non-required array scan key on "tenthous":
+-- Skip array on "thousand", SAOP array on "tenthous":
explain (costs off)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
@@ -776,7 +776,7 @@ SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
--- Non-required array scan key on "tenthous", backward scan:
+-- Skip array on "thousand", SAOP array on "tenthous", backward scan:
explain (costs off)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ace5414fa..bdaf4d62a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -218,6 +218,7 @@ BTScanPos
BTScanPosData
BTScanPosItem
BTShared
+BTSkipPreproc
BTSortArrayContext
BTSpool
BTStack
@@ -2662,6 +2663,8 @@ SimpleStringListCell
SingleBoundSortItem
Size
SkipPages
+SkipSupport
+SkipSupportData
SlabBlock
SlabContext
SlabSlot
--
2.45.2