Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Make row compares robust during nbtree array scans.

  1. Multicolumn index scan efficiency

    Vitalii Tymchyshyn <vit@tym.im> — 2025-11-10T02:44:25Z

    Hi.
    
    Since Friday I have been trying to diagnose the slowness of scanning the
    multicolumn index in Postgres. I figured out that multicolumns conditions
    like (a,b,c) > (x,y,z) with a,b and c being part of primary index work slow
    in postgresql 9.6 (original version in prod, I know it's old). I thought
    that it would be fast in 15 (one I had handy), but it was still slow. It
    was finally fast in 18. Note that all 3 are Google CloudSQL, but I believe
    those are pretty vanilla in terms of index scan internals.
    
    I am wondering about 2 things:
    1) Does anyone know which specific change / version made it fast?
    2) What was the proper way to do a range index scan like WHERE (a,b,c)
    between (x1,y1,z1) and (x2,y2,z2) before the improvement.
    Note that my tests can mostly be rewritten as equality at least for some
    columns (and this is what we'll do), but sometimes we do need a range scan
    like above, so understanding it would be important. Also I am curious :).
    
    The full test will be towards the end, but the query I started from is
    EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
    SELECT * FROM application_specs
    WHERE (namespace, application, version) > ('default',
    'test_application_pipeline_with__simulated_long_name_075000', '')
      AND (namespace, application) <= ('default',
    'test_application_pipeline_with__simulated_long_name_075000')
      AND (latest = true OR latest is NULL);
    
    Plans:
    --- 9.6
     Index Scan using application_specs_pkey on application_specs
     (cost=0.56..6.33 rows=1 width=198) (actual rows=1 loops=1)
       Index Cond: ((ROW(namespace, application, version) >
    ROW('default'::text,
    'test_application_pipeline_with__simulated_long_name_075000'::text,
    ''::text)) AND (ROW(namespace, application) <= ROW('default'::text,
    'test_application_pipeline_with__simulated_long_name_075000'::text)))
       Filter: (latest OR (latest IS NULL))
       Rows Removed by Filter: 29
       Buffers: shared hit=35442
     Planning time: 0.153 ms
     Execution time: 5049.123 ms
    (7 rows)
    --- 15
     Index Scan using application_specs_pkey on application_specs
     (cost=0.56..6.33 rows=1 width=206) (actual rows=1 loops=1)
       Index Cond: ((ROW(namespace, application, version) >
    ROW('default'::text,
    'test_application_pipeline_with__simulated_long_name_075000'::text,
    ''::text)) AND (ROW(namespace, application) <= ROW('default'::text,
    'test_application_pipeline_with__simulated_long_name_075000'::text)))
       Filter: (latest OR (latest IS NULL))
       Rows Removed by Filter: 29
       Buffers: shared hit=45839
     Planning Time: 0.171 ms
     Execution Time: 8190.116 ms
    (7 rows)
    
    --- 18
     Index Scan using application_specs_pkey on application_specs
     (cost=0.56..6.33 rows=1 width=206) (actual rows=1.00 loops=1)
       Index Cond: ((ROW(namespace, application, version) >
    ROW('default'::text,
    'test_application_pipeline_with__simulated_long_name_075000'::text,
    ''::text)) AND (ROW(namespace, application) <= ROW('default'::text,
    'test_application_pipeline_with__simulated_long_name_075000'::text)))
       Filter: (latest OR (latest IS NULL))
       Rows Removed by Filter: 29
       Index Searches: 1
       Buffers: shared hit=35
     Planning:
       Buffers: shared hit=29
     Planning Time: 0.243 ms
     Execution Time: 0.155 ms
    (10 rows)
    
    So, it's index scan in all cases, but for some reason index scan is very
    inefficient in the old versions and it has to scan a huge portion of index
    to find a few rows.
    
    --- Full test
    
    -- 0. DISABLE SEQSCAN to ensure index is used
    SET ENABLE_SEQSCAN=false;
    
    -- 1. SETUP: Clean slate
    DROP TABLE IF EXISTS application_specs;
    
    -- 2. SCHEMA
    CREATE TABLE application_specs (
        namespace text NOT NULL,
        application text NOT NULL,
        version text NOT NULL,
        latest boolean,
        data text,
        PRIMARY KEY (namespace, application, version) WITH (fillfactor = 90)
    );
    
    -- 3. GENERATE DATA: ~4.5 Million Rows Total
    -- 150,000 Applications * ~30 versions each
    INSERT INTO application_specs (namespace, application, version, latest,
    data)
    SELECT
        'default',
        'test_application_pipeline_with__simulated_long_name_' || lpad(a::text,
    6, '0'),
        md5(random()::text || clock_timestamp()::text)::uuid::text,
        (v = 30), -- Make the 30th version the 'latest' one
        repeat('x', 100)
    FROM generate_series(1, 150000) a
    CROSS JOIN generate_series(1, 30) v;
    
    -- 4. ANALYZE to ensure planner has accurate stats
    VACUUM (ANALYZE, FREEZE) application_specs;
    
    -- =========================================
    -- THE TEST
    -- Target a middle application: '...075000'
    -- =========================================
    
    ---------------------------------------------------
    -- TEST A: Original Row-wise Query (Should be slow)
    ---------------------------------------------------
    EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
    SELECT * FROM application_specs
    WHERE (namespace, application, version) > ('default',
    'test_application_pipeline_with__simulated_long_name_075000', '')
      AND (namespace, application) <= ('default',
    'test_application_pipeline_with__simulated_long_name_075000')
      AND (latest = true OR latest is NULL);
    
    ---------------------------------------------------
    -- TEST B: Equality Query (Should be fast)
    ---------------------------------------------------
    EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
    SELECT * FROM application_specs
    WHERE namespace = 'default'
      AND application =
    'test_application_pipeline_with__simulated_long_name_075000'
      AND version > ''
      AND (latest = true OR latest is NULL);
    
    ---------------------------------------------------
    -- TEST C: (Should be slow)
    ---------------------------------------------------
    EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
    SELECT * FROM application_specs
    WHERE (namespace, application) >= ('default',
    'test_application_pipeline_with__simulated_long_name_075000')
      AND (namespace, application) <= ('default',
    'test_application_pipeline_with__simulated_long_name_075000')
      AND (latest = true OR latest is NULL) AND version > '';
    
    ---------------------------------------------------
    -- TEST D: (Should be fast)
    ---------------------------------------------------
    EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
    SELECT * FROM application_specs
    WHERE (namespace, application) = ('default',
    'test_application_pipeline_with__simulated_long_name_075000')
      AND (latest = true OR latest is NULL) AND version > '';
    
    --- End of test
    
    Best regards, Vitalii Tymchyshyn
    
  2. Re: Multicolumn index scan efficiency

    Peter Geoghegan <pg@bowt.ie> — 2025-11-10T04:20:39Z

    On Sun, Nov 9, 2025 at 9:44 PM Vitalii Tymchyshyn <vit@tym.im> wrote:
    > I am wondering about 2 things:
    > 1) Does anyone know which specific change / version made it fast?
    > 2) What was the proper way to do a range index scan like WHERE (a,b,c) between (x1,y1,z1) and (x2,y2,z2) before the improvement.
    > Note that my tests can mostly be rewritten as equality at least for some columns (and this is what we'll do), but sometimes we do need a range scan like above, so understanding it would be important. Also I am curious :).
    
    This improvement you're seeing here is down to work in commit
    bd3f59fd. The short version is that the way we used to decide when a
    condition like "WHERE (a,b,c) <= (x2,y2,z2)" was needlessly
    conservative. If there were many "a" values equal to x2, we'd have to
    scan the index until we got to the next distinct/non-equal "a" value
    -- without realizing that we're already past the point where there
    cannot possibly be any more matches.
    
    See the discussion on this thread which complained about the problem,
    particularly my response to the complaint:
    
    https://www.postgresql.org/message-id/flat/CAH2-WzmLREy6r68A6SEHXnstg01kNs1HiQtOvSO5cTvWuaducw%40mail.gmail.com#62e393ac8bbf06f0f73598ba2ceeab69
    
    --
    Peter Geoghegan
    
    
    
    
  3. Re: Multicolumn index scan efficiency

    Vitalii Tymchyshyn <vit@tym.im> — 2025-11-10T05:12:09Z

    Thank you so much for both clarifying and fixing it!
    In our case (FYI, this is from http://github.com/cdapio/cdap) a lot of
    users have just a single namespace, so it effectively means scanning till
    the end of the index.
    We'll fix
    https://github.com/cdapio/cdap/blob/develop/cdap-data-fabric/src/main/java/io/cdap/cdap/spi/data/sql/PostgreSqlStructuredTable.java
    to detect equality scan prefixes and make corresponding SQL. That would fix
    it for all postgres versions.
    
    Best regards, Vitalii Tymchyshyn
    
    нд, 9 лист. 2025 р. о 20:21 Peter Geoghegan <pg@bowt.ie> пише:
    
    > On Sun, Nov 9, 2025 at 9:44 PM Vitalii Tymchyshyn <vit@tym.im> wrote:
    > > I am wondering about 2 things:
    > > 1) Does anyone know which specific change / version made it fast?
    > > 2) What was the proper way to do a range index scan like WHERE (a,b,c)
    > between (x1,y1,z1) and (x2,y2,z2) before the improvement.
    > > Note that my tests can mostly be rewritten as equality at least for some
    > columns (and this is what we'll do), but sometimes we do need a range scan
    > like above, so understanding it would be important. Also I am curious :).
    >
    > This improvement you're seeing here is down to work in commit
    > bd3f59fd. The short version is that the way we used to decide when a
    > condition like "WHERE (a,b,c) <= (x2,y2,z2)" was needlessly
    > conservative. If there were many "a" values equal to x2, we'd have to
    > scan the index until we got to the next distinct/non-equal "a" value
    > -- without realizing that we're already past the point where there
    > cannot possibly be any more matches.
    >
    > See the discussion on this thread which complained about the problem,
    > particularly my response to the complaint:
    >
    >
    > https://www.postgresql.org/message-id/flat/CAH2-WzmLREy6r68A6SEHXnstg01kNs1HiQtOvSO5cTvWuaducw%40mail.gmail.com#62e393ac8bbf06f0f73598ba2ceeab69
    >
    > --
    > Peter Geoghegan
    >
    
  4. Re: Multicolumn index scan efficiency

    Peter Geoghegan <pg@bowt.ie> — 2025-11-10T17:00:02Z

    On Mon, Nov 10, 2025 at 12:12 AM Vitalii Tymchyshyn <vit@tym.im> wrote:
    > Thank you so much for both clarifying and fixing it!
    
    FWIW the problem is limited to row compares/row constructor
    comparisons that are used to decide when to end the scan. Note in
    particular that row compares that decide where in the index (what leaf
    page) the scan should *begin* from were never affected -- only those
    that determine where the scan should end. In other words, for a
    forwards scan, > and >= row compares aren't affected (but < and <= row
    compares are). For backwards scans/with ORDER BY a DESC, b DESC, it's
    exactly the other way around (it's > and >= row compares that'll end
    the scan/that had this problem).
    
    My guess is that this issue wasn't noticed sooner because in practice
    a lot of users of row compares only use them to determine where each
    scan begins from, in the context of apply row compares to implement
    keyset pagination [1]. I think that it's typical to use an ORDER BY
    ... LIMIT, or a FETCH FIRST ... ROWS WITH TIES to limit the size of
    the result set on each individual query. It was a nasty and surprising
    issue, but it didn't actually come up all that often.
    
    After all, if you use a < or a <= condition to end each scan, the
    total number of rows that'll be returned each time is unpredictable --
    and potentially very large. That isn't generally desirable with keyset
    pagination; what users usually do is have Postgres return a more or
    less uniform number of rows for each individual query that fetches the
    next portion of the "total result set". That's kinda the natural way
    to do it.
    
    [1] https://wiki.postgresql.org/images/3/35/Pagination_Done_the_PostgreSQL_Way.pdf
    -- 
    Peter Geoghegan