Thread

  1. Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

    Lukas Fittl <lukas@fittl.com> — 2023-03-05T00:20:59Z

    Hi,
    
    I was debugging a planner problem on Postgres 14.4 the other day - and the
    involved "bad" plan was including Memoize - though I don't necessarily
    think that Memoize is to blame (and this isn't any of the problems recently
    fixed in Memoize costing).
    
    However, what I noticed whilst trying different ways to fix the plan, is
    that the Memoize output was a bit hard to reason about - especially since
    the plan involving Memoize was expensive to run, and so I was mostly
    running EXPLAIN without ANALYZE to look at the costing.
    
    Here is an example of the output I was looking at:
    
         ->  Nested Loop  (cost=1.00..971672.56 rows=119623 width=0)
               ->  Index Only Scan using table1_idx on table1
    (cost=0.43..372676.50 rows=23553966 width=8)
               ->  Memoize  (cost=0.57..0.61 rows=1 width=8)
                     Cache Key: table1.table2_id
                     Cache Mode: logical
                     ->  Index Scan using table2_idx on table2
    (cost=0.56..0.60 rows=1 width=8)
                           Index Cond: (id = table1.table2_id)
    
    The other plan I was comparing with (that I wanted the planner to choose
    instead), had a total cost of 1,451,807.35 -- and so I was trying to figure
    out why the Nested Loop was costed as 971,672.56.
    
    Simple math makes me expect the Nested Loop should roughly have a total
    cost of14,740,595.76 here (372,676.50 + 23,553,966 * 0.61), ignoring a lot
    of the smaller costs. Thus, in this example, it appears Memoize made the
    plan cost significantly cheaper (roughly 6% of the regular cost).
    
    Essentially this comes down to the "cost reduction" performed by Memoize
    only being implicitly visible in the Nested Loop's total cost - and with
    nothing useful on the Memoize node itself - since the rescan costs are not
    shown.
    
    I think explicitly adding the estimated cache hit ratio for Memoize nodes
    might make this easier to reason about, like this:
    
    ->  Memoize  (cost=0.57..0.61 rows=1 width=8)
         Cache Key: table1.table2_id
         Cache Mode: logical
         Cache Hit Ratio Estimated: 0.94
    
    Alternatively (or in addition) we could consider showing the "ndistinct"
    value that is calculated in cost_memoize_rescan - since that's the most
    significant contributor to the cache hit ratio (and you can influence that
    directly by improving the ndistinct statistics).
    
    See attached a patch that implements showing the cache hit ratio as a
    discussion starter.
    
    I'll park this in the July commitfest for now.
    
    Thanks,
    Lukas
    
    -- 
    Lukas Fittl