RE: Adding skip scan (including MDAM style range skip scan) to nbtree

masahiro.ikeda@nttdata.com

From: <Masahiro.Ikeda@nttdata.com>
To: <pg@bowt.ie>, <pgsql-hackers@lists.postgresql.org>
Cc: <Masao.Fujii@nttdata.com>
Date: 2024-07-12T05:18:56Z
Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. nbtree: Always set skipScan flag on rescan.

  2. meson: Build numeric.c with -ftree-vectorize.

  3. Fix "variable not found in subplan target lists" in semijoin de-duplication.

  4. Revert "nbtree: Remove useless row compare arg."

  5. nbtree: Remove useless row compare arg.

  6. Prevent premature nbtree array advancement.

  7. nbtree: tighten up array recheck rules.

  8. Avoid treating nonrequired nbtree keys as required.

  9. Adjust overstrong nbtree skip array assertion.

  10. Make NULL tuple values always advance skip arrays.

  11. Avoid extra index searches through preprocessing.

  12. Improve nbtree skip scan primitive scan scheduling.

  13. Further optimize nbtree search scan key comparisons.

  14. Add nbtree skip scan optimization.

  15. Improve nbtree array primitive scan scheduling.

  16. nbtree: Make BTMaxItemSize into object-like macro.

  17. Show index search count in EXPLAIN ANALYZE, take 2.

  18. Make parallel nbtree index scans use an LWLock.

  19. Show index search count in EXPLAIN ANALYZE.

  20. Avoid nbtree parallel scan currPos confusion.

  21. nbtree: Remove useless 'strat' local variable.

  22. Normalize nbtree truncated high key array behavior.

  23. Refactor handling of nbtree array redundancies.

  24. Fix nbtree pgstats accounting with parallel scans.

  25. Avoid parallel nbtree index scan hangs with SAOPs.

  26. Show Parallel Bitmap Heap Scan worker stats in EXPLAIN ANALYZE

  27. Enhance nbtree ScalarArrayOp execution.

  28. Skip checking of scan keys required for directional scan in B-tree

  29. Instead of using a numberOfRequiredKeys count to distinguish required

Attachments

Hi,

Since I'd like to understand the skip scan to improve the EXPLAIN output
for multicolumn B-Tree Index[1], I began to try the skip scan with some
queries and look into the source code.

I have some feedback and comments.

(1)

At first, I was surprised to look at your benchmark result because the skip scan
index can improve much performance. I agree that there are many users to be
happy with the feature for especially OLAP use-case. I expected to use v18.


(2)

I found the cost is estimated to much higher if the number of skipped attributes
is more than two. Is it expected behavior?

# Test result. The attached file is the detail of tests.

-- Index Scan
-- The actual time is low since the skip scan works well
-- But the cost is higher than one of seqscan
EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT * FROM test WHERE id3 = 101;
                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_id1_id2_id3 on public.test  (cost=0.42..26562.77 rows=984 width=20) (actual time=0.051..15.533 rows=991 loops=1)
   Output: id1, id2, id3, value
   Index Cond: (test.id3 = 101)
   Buffers: shared hit=4402
 Planning:
   Buffers: shared hit=7
 Planning Time: 0.234 ms
 Execution Time: 15.711 ms
(8 rows)

-- Seq Scan
-- actual time is high, but the cost is lower than one of the above Index Scan.
EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT * FROM test WHERE id3 = 101;
                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..12676.73 rows=984 width=20) (actual time=0.856..113.861 rows=991 loops=1)
   Output: id1, id2, id3, value
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=6370
   ->  Parallel Seq Scan on public.test  (cost=0.00..11578.33 rows=410 width=20) (actual time=0.061..102.016 rows=330 loops=3)
         Output: id1, id2, id3, value
         Filter: (test.id3 = 101)
         Rows Removed by Filter: 333003
         Buffers: shared hit=6370
         Worker 0:  actual time=0.099..98.014 rows=315 loops=1
           Buffers: shared hit=2066
         Worker 1:  actual time=0.054..97.162 rows=299 loops=1
           Buffers: shared hit=1858
 Planning:
   Buffers: shared hit=19
 Planning Time: 0.194 ms
 Execution Time: 114.129 ms
(18 rows)


I look at btcostestimate() to find the reason and found the bound quals
and cost.num_sa_scans are different from my expectation.

My assumption is
* bound quals is id3=XXX (and id1 and id2 are skipped attributes)
* cost.num_sa_scans = 100 (=10*10 because assuming 10 primitive index scans
                          per skipped attribute)

But it's wrong. The above index scan result is
* bound quals is NULL
* cost.num_sa_scans = 1


As I know you said the below, but I'd like to know the above is expected or not.

> That approach seems far more practicable than preempting the problem
> during planning or during nbtree preprocessing. It seems like it'd be
> very hard to model the costs statistically. We need revisions to
> btcostestimate, of course, but the less we can rely on btcostestimate
> the better. As I said, there are no new index paths generated by the
> optimizer for any of this.

I couldn't understand why there is the below logic well.

  btcostestimate()
  (...omit...)
  			if (indexcol != iclause->indexcol)
  			{
  				/* no quals at all for indexcol */
  				found_skip = true;
  				if (index->pages < 100)
  					break;
  				num_sa_scans += 10 * (indexcol - iclause->indexcol);     // why add minus value?
  				continue;                                                  // why skip to add bound quals?
  			}


(3)

Currently, there is an assumption that "there will be 10 primitive index scans
per skipped attribute". Is any chance to use pg_stats.n_distinct?

[1] Improve EXPLAIN output for multicolumn B-Tree Index
https://www.postgresql.org/message-id/flat/TYWPR01MB1098260B694D27758FE2BA46FB1C92%40TYWPR01MB10982.jpnprd01.prod.outlook.com

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION