Thread

Estimating bloat for very large tables: what is the state of art?

Dmytro Astapov <dastapov@gmail.com> — 2022-09-12T20:14:06Z
Hi!

I am trying to solve the problem of estimating the table bloat (and index
bloat, though I am mostly focusing on tables at the moment).

After searching far and wide, it seems that the choice is to be made
between two methods:
1. Slow, but very precise pgstattuple
2. Fast, but somewhat imprecise "bloat query" which is attributed to
check_postgres <https://bucardo.org/check_postgres/> project, though there
are numerous
<https://www.citusdata.com/blog/2017/10/20/monitoring-your-bloat-in-postgres/>
variations <https://github.com/pgexperts/pgx_scripts/tree/master/bloat> in
existence.

pgstattuple is beautiful and accurate but rather slow. If tables are large,
pgstattuple_approx could easily take 5-10 minutes, and if that were the
case, you can see pgstattuple to take 30-60 minutes on the same table
easily.

"Bloat query", on the other hand, is wonderfully fast, but rather
imprecise. It tries to estimate the table data size as pg_class.reltuples *
row_width, where row_width is taken, roughly, to be (24 bytes for the
header + size of NULL map + (sum( (1 - null_frac)*avg_width ) for all
columns in the table, as reported by pg_statistics)).

This, of course, completely ignores the question of padding and so on
tables with a large number of columns the query tends to underestimate the
size of live data by some 10-20% (unless schema was explicitly created to
minimize padding).

I'd like to ask you:
1. Are these indeed two approaches the only options on the table, or am I
missing something?

2. I am considering my own approach where, after looking at pg_attributes
and pg_stats, I am constructing "an example row from this table with no
nulls" (so, max amount of data + max amount of padding) and "an example row
from the table with all the NULLs" (so, as little padding as possible), do
pg_column_size() on both these rows (so that pg_column_size could compute
size+padding for me) and then take an average between them, perhaps
weighted somehow by examining null_frac of table columns. Quick experiments
show that this yields a more accurate estimate of row size for tables with
large numbers of columns than what the "bloat query" does.  Question: can I
do anything better/easier here without sacrificing speed?

-- 
D. Astapov