Thread

  1. Estimating bloat for very large tables: what is the state of art?

    Dmytro Astapov <dastapov@gmail.com> — 2022-09-12T20:14:06Z

    Hi!
    
    I am trying to solve the problem of estimating the table bloat (and index
    bloat, though I am mostly focusing on tables at the moment).
    
    After searching far and wide, it seems that the choice is to be made
    between two methods:
    1. Slow, but very precise pgstattuple
    2. Fast, but somewhat imprecise "bloat query" which is attributed to
    check_postgres <https://bucardo.org/check_postgres/> project, though there
    are numerous
    <https://www.citusdata.com/blog/2017/10/20/monitoring-your-bloat-in-postgres/>
    variations <https://github.com/pgexperts/pgx_scripts/tree/master/bloat> in
    existence.
    
    pgstattuple is beautiful and accurate but rather slow. If tables are large,
    pgstattuple_approx could easily take 5-10 minutes, and if that were the
    case, you can see pgstattuple to take 30-60 minutes on the same table
    easily.
    
    "Bloat query", on the other hand, is wonderfully fast, but rather
    imprecise. It tries to estimate the table data size as pg_class.reltuples *
    row_width, where row_width is taken, roughly, to be (24 bytes for the
    header + size of NULL map + (sum( (1 - null_frac)*avg_width ) for all
    columns in the table, as reported by pg_statistics)).
    
    This, of course, completely ignores the question of padding and so on
    tables with a large number of columns the query tends to underestimate the
    size of live data by some 10-20% (unless schema was explicitly created to
    minimize padding).
    
    I'd like to ask you:
    1. Are these indeed two approaches the only options on the table, or am I
    missing something?
    
    2. I am considering my own approach where, after looking at pg_attributes
    and pg_stats, I am constructing "an example row from this table with no
    nulls" (so, max amount of data + max amount of padding) and "an example row
    from the table with all the NULLs" (so, as little padding as possible), do
    pg_column_size() on both these rows (so that pg_column_size could compute
    size+padding for me) and then take an average between them, perhaps
    weighted somehow by examining null_frac of table columns. Quick experiments
    show that this yields a more accurate estimate of row size for tables with
    large numbers of columns than what the "bloat query" does.  Question: can I
    do anything better/easier here without sacrificing speed?
    
    -- 
    D. Astapov