Thread

  1. Re: Allow to collect statistics on virtual generated columns

    Yugo Nagata <nagata@sraoss.co.jp> — 2025-12-31T10:44:55Z

    On Tue, 2 Sep 2025 16:33:41 +0900
    Yugo Nagata <nagata@sraoss.co.jp> wrote:
    
    > > > > This allows to collect statistics on virtual generated columns.
    > > > > 
    > > > > During ANALYZE, generation expressions are expanded, and statistics are computed
    > > > > using compute_expr_stats(). To support this, both compute_expr_stats() and AnlExprData
    > > > > are now exported from extended_stats.c. However, since they are no longer specific
    > > > > to extended statistics, it might be better to move them to analyze.c and vacuum.h.
    > > > > 
    > > > > To enable the optimizer to make use of these statistics, a new field named
    > > > > virtual_gencols is added to RelOptInfo. This field holds the expressions of
    > > > > virtual generated columns in the table. In examine_variable(), if an expression
    > > > > in a WHERE clause matches a virtual generated column, the corresponding statistics
    > > > > are used for that expression.
    > > > > 
    > > > > Example:
    > > > > 
    > > > > - Before applying the patch, the cardinality estimate is erroneous.
    > > > > 
    > > > > test=# create table t (i int, j int generated always as (i*10) virtual);
    > > > > CREATE TABLE
    > > > > test=# insert into t select generate_series(1,1000);
    > > > > INSERT 0 1000
    > > > > test=# insert into t select 1 from generate_series(1,1000);
    > > > > INSERT 0 1000
    > > > > test=# analyze t;
    > > > > ANALYZE
    > > > > test=# explain analyze select * from t where j = 10;
    > > > >                                            QUERY PLAN                                            
    > > > > -------------------------------------------------------------------------------------------------
    > > > >  Seq Scan on t  (cost=0.00..0.02 rows=1 width=8) (actual time=0.031..0.806 rows=1001.00 loops=1)
    > > > >    Filter: ((i * 10) = 10)
    > > > >    Rows Removed by Filter: 999
    > > > >    Buffers: shared hit=9
    > > > >  Planning:
    > > > >    Buffers: shared hit=10
    > > > >  Planning Time: 0.299 ms
    > > > >  Execution Time: 0.948 ms
    > > > > (8 rows)
    > > > > 
    > > > > 
    > > > > - After applying the patch, the cardinality estimate is correct.
    > > > > 
    > > > > test=# analyze t;
    > > > > ANALYZE
    > > > > test=# explain analyze select * from t where j = 10;
    > > > >                                              QUERY PLAN                                              
    > > > > -----------------------------------------------------------------------------------------------------
    > > > >  Seq Scan on t  (cost=0.00..41.50 rows=1001 width=8) (actual time=0.034..0.871 rows=1001.00 loops=1)
    > > > >    Filter: ((i * 10) = 10)
    > > > >    Rows Removed by Filter: 999
    > > > >    Buffers: shared hit=9
    > > > >  Planning:
    > > > >    Buffers: shared hit=6
    > > > >  Planning Time: 0.374 ms
    > > > >  Execution Time: 1.028 ms
    > > > > (8 rows)
    > > > > 
    > > > > 
    > > > > Note that the patch is still a work in progress, so documentation and tests are not included.
    > > > 
    > > > I've attached an updated patch.
    > > > 
    > > > I modified the documentation to remove the statement that virtual generated columns
    > > > do not have statistics.
    > > > 
    > > > In addition, I added a test to ensure that statistics on virtual generated columns
    > > > are available.
    > 
    
    I've attached an updated patch that fixes the broken test since 10c4fe074a.
    
    Regards,
    Yugo Nagata
    
    -- 
    Yugo Nagata <nagata@sraoss.co.jp>