Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Make row compares robust during nbtree array scans.
- bd3f59fdb717 19 (unreleased) cited

Multicolumn index scan efficiency

Vitalii Tymchyshyn <vit@tym.im> — 2025-11-10T02:44:25Z

Hi.

Since Friday I have been trying to diagnose the slowness of scanning the
multicolumn index in Postgres. I figured out that multicolumns conditions
like (a,b,c) > (x,y,z) with a,b and c being part of primary index work slow
in postgresql 9.6 (original version in prod, I know it's old). I thought
that it would be fast in 15 (one I had handy), but it was still slow. It
was finally fast in 18. Note that all 3 are Google CloudSQL, but I believe
those are pretty vanilla in terms of index scan internals.

I am wondering about 2 things:
1) Does anyone know which specific change / version made it fast?
2) What was the proper way to do a range index scan like WHERE (a,b,c)
between (x1,y1,z1) and (x2,y2,z2) before the improvement.
Note that my tests can mostly be rewritten as equality at least for some
columns (and this is what we'll do), but sometimes we do need a range scan
like above, so understanding it would be important. Also I am curious :).

The full test will be towards the end, but the query I started from is
EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
SELECT * FROM application_specs
WHERE (namespace, application, version) > ('default',
'test_application_pipeline_with__simulated_long_name_075000', '')
  AND (namespace, application) <= ('default',
'test_application_pipeline_with__simulated_long_name_075000')
  AND (latest = true OR latest is NULL);

Plans:
--- 9.6
 Index Scan using application_specs_pkey on application_specs
 (cost=0.56..6.33 rows=1 width=198) (actual rows=1 loops=1)
   Index Cond: ((ROW(namespace, application, version) >
ROW('default'::text,
'test_application_pipeline_with__simulated_long_name_075000'::text,
''::text)) AND (ROW(namespace, application) <= ROW('default'::text,
'test_application_pipeline_with__simulated_long_name_075000'::text)))
   Filter: (latest OR (latest IS NULL))
   Rows Removed by Filter: 29
   Buffers: shared hit=35442
 Planning time: 0.153 ms
 Execution time: 5049.123 ms
(7 rows)
--- 15
 Index Scan using application_specs_pkey on application_specs
 (cost=0.56..6.33 rows=1 width=206) (actual rows=1 loops=1)
   Index Cond: ((ROW(namespace, application, version) >
ROW('default'::text,
'test_application_pipeline_with__simulated_long_name_075000'::text,
''::text)) AND (ROW(namespace, application) <= ROW('default'::text,
'test_application_pipeline_with__simulated_long_name_075000'::text)))
   Filter: (latest OR (latest IS NULL))
   Rows Removed by Filter: 29
   Buffers: shared hit=45839
 Planning Time: 0.171 ms
 Execution Time: 8190.116 ms
(7 rows)

--- 18
 Index Scan using application_specs_pkey on application_specs
 (cost=0.56..6.33 rows=1 width=206) (actual rows=1.00 loops=1)
   Index Cond: ((ROW(namespace, application, version) >
ROW('default'::text,
'test_application_pipeline_with__simulated_long_name_075000'::text,
''::text)) AND (ROW(namespace, application) <= ROW('default'::text,
'test_application_pipeline_with__simulated_long_name_075000'::text)))
   Filter: (latest OR (latest IS NULL))
   Rows Removed by Filter: 29
   Index Searches: 1
   Buffers: shared hit=35
 Planning:
   Buffers: shared hit=29
 Planning Time: 0.243 ms
 Execution Time: 0.155 ms
(10 rows)

So, it's index scan in all cases, but for some reason index scan is very
inefficient in the old versions and it has to scan a huge portion of index
to find a few rows.

--- Full test

-- 0. DISABLE SEQSCAN to ensure index is used
SET ENABLE_SEQSCAN=false;

-- 1. SETUP: Clean slate
DROP TABLE IF EXISTS application_specs;

-- 2. SCHEMA
CREATE TABLE application_specs (
    namespace text NOT NULL,
    application text NOT NULL,
    version text NOT NULL,
    latest boolean,
    data text,
    PRIMARY KEY (namespace, application, version) WITH (fillfactor = 90)
);

-- 3. GENERATE DATA: ~4.5 Million Rows Total
-- 150,000 Applications * ~30 versions each
INSERT INTO application_specs (namespace, application, version, latest,
data)
SELECT
    'default',
    'test_application_pipeline_with__simulated_long_name_' || lpad(a::text,
6, '0'),
    md5(random()::text || clock_timestamp()::text)::uuid::text,
    (v = 30), -- Make the 30th version the 'latest' one
    repeat('x', 100)
FROM generate_series(1, 150000) a
CROSS JOIN generate_series(1, 30) v;

-- 4. ANALYZE to ensure planner has accurate stats
VACUUM (ANALYZE, FREEZE) application_specs;

-- =========================================
-- THE TEST
-- Target a middle application: '...075000'
-- =========================================

---------------------------------------------------
-- TEST A: Original Row-wise Query (Should be slow)
---------------------------------------------------
EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
SELECT * FROM application_specs
WHERE (namespace, application, version) > ('default',
'test_application_pipeline_with__simulated_long_name_075000', '')
  AND (namespace, application) <= ('default',
'test_application_pipeline_with__simulated_long_name_075000')
  AND (latest = true OR latest is NULL);

---------------------------------------------------
-- TEST B: Equality Query (Should be fast)
---------------------------------------------------
EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
SELECT * FROM application_specs
WHERE namespace = 'default'
  AND application =
'test_application_pipeline_with__simulated_long_name_075000'
  AND version > ''
  AND (latest = true OR latest is NULL);

---------------------------------------------------
-- TEST C: (Should be slow)
---------------------------------------------------
EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
SELECT * FROM application_specs
WHERE (namespace, application) >= ('default',
'test_application_pipeline_with__simulated_long_name_075000')
  AND (namespace, application) <= ('default',
'test_application_pipeline_with__simulated_long_name_075000')
  AND (latest = true OR latest is NULL) AND version > '';

---------------------------------------------------
-- TEST D: (Should be fast)
---------------------------------------------------
EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
SELECT * FROM application_specs
WHERE (namespace, application) = ('default',
'test_application_pipeline_with__simulated_long_name_075000')
  AND (latest = true OR latest is NULL) AND version > '';

--- End of test

Best regards, Vitalii Tymchyshyn

Re: Multicolumn index scan efficiency

Peter Geoghegan <pg@bowt.ie> — 2025-11-10T04:20:39Z

On Sun, Nov 9, 2025 at 9:44 PM Vitalii Tymchyshyn <vit@tym.im> wrote:
> I am wondering about 2 things:
> 1) Does anyone know which specific change / version made it fast?
> 2) What was the proper way to do a range index scan like WHERE (a,b,c) between (x1,y1,z1) and (x2,y2,z2) before the improvement.
> Note that my tests can mostly be rewritten as equality at least for some columns (and this is what we'll do), but sometimes we do need a range scan like above, so understanding it would be important. Also I am curious :).

This improvement you're seeing here is down to work in commit
bd3f59fd. The short version is that the way we used to decide when a
condition like "WHERE (a,b,c) <= (x2,y2,z2)" was needlessly
conservative. If there were many "a" values equal to x2, we'd have to
scan the index until we got to the next distinct/non-equal "a" value
-- without realizing that we're already past the point where there
cannot possibly be any more matches.

See the discussion on this thread which complained about the problem,
particularly my response to the complaint:

https://www.postgresql.org/message-id/flat/CAH2-WzmLREy6r68A6SEHXnstg01kNs1HiQtOvSO5cTvWuaducw%40mail.gmail.com#62e393ac8bbf06f0f73598ba2ceeab69

--
Peter Geoghegan

Re: Multicolumn index scan efficiency

Vitalii Tymchyshyn <vit@tym.im> — 2025-11-10T05:12:09Z

Thank you so much for both clarifying and fixing it!
In our case (FYI, this is from http://github.com/cdapio/cdap) a lot of
users have just a single namespace, so it effectively means scanning till
the end of the index.
We'll fix
https://github.com/cdapio/cdap/blob/develop/cdap-data-fabric/src/main/java/io/cdap/cdap/spi/data/sql/PostgreSqlStructuredTable.java
to detect equality scan prefixes and make corresponding SQL. That would fix
it for all postgres versions.

Best regards, Vitalii Tymchyshyn

нд, 9 лист. 2025 р. о 20:21 Peter Geoghegan <pg@bowt.ie> пише:

> On Sun, Nov 9, 2025 at 9:44 PM Vitalii Tymchyshyn <vit@tym.im> wrote:
> > I am wondering about 2 things:
> > 1) Does anyone know which specific change / version made it fast?
> > 2) What was the proper way to do a range index scan like WHERE (a,b,c)
> between (x1,y1,z1) and (x2,y2,z2) before the improvement.
> > Note that my tests can mostly be rewritten as equality at least for some
> columns (and this is what we'll do), but sometimes we do need a range scan
> like above, so understanding it would be important. Also I am curious :).
>
> This improvement you're seeing here is down to work in commit
> bd3f59fd. The short version is that the way we used to decide when a
> condition like "WHERE (a,b,c) <= (x2,y2,z2)" was needlessly
> conservative. If there were many "a" values equal to x2, we'd have to
> scan the index until we got to the next distinct/non-equal "a" value
> -- without realizing that we're already past the point where there
> cannot possibly be any more matches.
>
> See the discussion on this thread which complained about the problem,
> particularly my response to the complaint:
>
>
> https://www.postgresql.org/message-id/flat/CAH2-WzmLREy6r68A6SEHXnstg01kNs1HiQtOvSO5cTvWuaducw%40mail.gmail.com#62e393ac8bbf06f0f73598ba2ceeab69
>
> --
> Peter Geoghegan
>

Re: Multicolumn index scan efficiency

Peter Geoghegan <pg@bowt.ie> — 2025-11-10T17:00:02Z

On Mon, Nov 10, 2025 at 12:12 AM Vitalii Tymchyshyn <vit@tym.im> wrote:
> Thank you so much for both clarifying and fixing it!

FWIW the problem is limited to row compares/row constructor
comparisons that are used to decide when to end the scan. Note in
particular that row compares that decide where in the index (what leaf
page) the scan should *begin* from were never affected -- only those
that determine where the scan should end. In other words, for a
forwards scan, > and >= row compares aren't affected (but < and <= row
compares are). For backwards scans/with ORDER BY a DESC, b DESC, it's
exactly the other way around (it's > and >= row compares that'll end
the scan/that had this problem).

My guess is that this issue wasn't noticed sooner because in practice
a lot of users of row compares only use them to determine where each
scan begins from, in the context of apply row compares to implement
keyset pagination [1]. I think that it's typical to use an ORDER BY
... LIMIT, or a FETCH FIRST ... ROWS WITH TIES to limit the size of
the result set on each individual query. It was a nasty and surprising
issue, but it didn't actually come up all that often.

After all, if you use a < or a <= condition to end each scan, the
total number of rows that'll be returned each time is unpredictable --
and potentially very large. That isn't generally desirable with keyset
pagination; what users usually do is have Postgres return a more or
less uniform number of rows for each individual query that fetches the
next portion of the "total result set". That's kinda the natural way
to do it.

[1] https://wiki.postgresql.org/images/3/35/Pagination_Done_the_PostgreSQL_Way.pdf
-- 
Peter Geoghegan