Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com>

From: Corey Huinker <corey.huinker@gmail.com>

To: Michael Paquier <michael@paquier.xyz>

Cc: jian he <jian.universality@gmail.com>, Tomas Vondra <tomas@vondra.me>, pgsql-hackers@lists.postgresql.org, tgl@sss.pgh.pa.us

Date: 2025-10-22T11:55:31Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Add test doing some cloning of extended statistics data
- fc365e4fccc4 19 (unreleased) landed
Add test for pg_restore_extended_stats() with multiranges
- 0b7beec42ae2 19 (unreleased) landed
Add support for "mcv" in pg_restore_extended_stats()
- efbebb4e8587 19 (unreleased) landed
Include extended statistics data in pg_dump
- c32fb29e979d 19 (unreleased) landed
Add support for "dependencies" in pg_restore_extended_stats()
- 302879bd68d1 19 (unreleased) landed
Add test for MAINTAIN permission with pg_restore_extended_stats()
- d9abd9e1050d 19 (unreleased) landed
Add pg_restore_extended_stats()
- 0e80f3f88dea 19 (unreleased) landed
Add routine to free MCVList
- 7ebb64c55757 19 (unreleased) landed
Improve pg_clear_extended_stats() with incorrect relation/stats combination
- 395b73c045e0 19 (unreleased) landed
Add pg_clear_extended_stats()
- d756fa1019ff 19 (unreleased) landed
Introduce routines to validate and free MVNDistinct and MVDependencies
- 32e27bd32082 19 (unreleased) landed
Fix typo in stat_utils.c
- eee19a30d60d 19 (unreleased) landed
Move attribute statistics functions to stat_utils.c
- 213a1b895270 19 (unreleased) landed
Improve error messages of input functions for pg_dependencies and pg_ndistinct
- f68597ee777d 19 (unreleased) landed
Improve test output of extended statistics for ndistinct and dependencies
- 2f04110225ab 19 (unreleased) landed
Fix some compiler warnings
- 7bc88c3d6f3a 19 (unreleased) landed
Add input function for data type pg_dependencies
- e1405aa5e3ac 19 (unreleased) landed
Add input function for data type pg_ndistinct
- 44eba8f06e55 19 (unreleased) landed
Rework output format of pg_dependencies
- e76defbcf09e 19 (unreleased) landed
Rework output format of pg_ndistinct
- 1f927cce4498 19 (unreleased) landed
Fix comments of output routines for pg_ndistinct and pg_dependencies
- 040a39ed25bf 19 (unreleased) landed
Move code specific to pg_dependencies to new file
- 2ddc8d9e9baa 19 (unreleased) landed
Move code specific to pg_ndistinct to new file
- a5523123430f 19 (unreleased) landed
Document some structures in attribute_stats.c
- d6c132d83bff 19 (unreleased) landed
Fix FATAL message for invalid recovery timeline at beginning of recovery
- 71f17823ba01 18.0 cited

>
> The functions exposed in 0003 should be renamed to match more with the
> style of the rest, aka it is a bit hard to figure out what they do at
> first sight.  Presumably, these should be prefixed with some
> "statext_", except text_to_stavalues() which could still be named the
> same.
>

That prefix would probably be statatt_ or statattr_.


>
> Do you have some numbers regarding the increase in size this generates
> for the catalogs?
>

Sorry, I don't understand. There shouldn't be any increase inside the
catalogs as the internal storage of the datatypes hasn't changed, so I can
only conclude that you're referring to something else.


>
> 0004 has been designed following the same model as the relation and
> attribute stats.  That sounds OK here.
>
> +enum extended_stats_argnum
> [...]
> +enum extended_stats_exprs_element
>
> It would be nice to document why such things are around.  That would
> be less guessing for somebody reading the code.
>

The equivalent structures in attribute_stats.c will need documenting too.




>
> Reusing this small sequence from your pg_dump patch, executed on a v14
> backend:
> create schema dump_test;
> CREATE TABLE dump_test.has_ext_stats
>   AS SELECT g.g AS x, g.g / 2 AS y FROM generate_series(1,100) AS g(g);
> CREATE STATISTICS dump_test.es1 ON x, (y % 2) FROM dump_test.has_ext_stats;
> ANALYZE dump_test.has_ext_stats;
>
> Then pg_dump fails:
> pg_dump: error: query failed: ERROR:  column e.inherited does not exist
> LINE 2: ...hemaname = $1 AND e.statistics_name = $2 ORDER BY e.inherite...
>

Noted.


>
>
> +        * TODO: Until v18 is released the master branch has a
> +        * server_version_num of 180000. We will update this to 190000
> as soon
> +        * as the master branch updates.
>
> This part has not been updated.
>
> +       Assert(item.nattributes > 0);   /* TODO: elog? */
> [...]
> +       Assert(dependency->nattributes > 1);    /* TODO: elog? */
> Yes and yes.  It seems like it should be possible to craft some input
> that triggers these..
>

+1


>
> +void
> +free_pg_dependencies(MVDependencies *dependencies);
>
> Double declaration of this routine in dependencies.c.
>
> Perhaps some of the regression tests could use some jsonb_pretty() in
> the outputs generated.  Some of the results generated are very hard to
> parse, something that would become harder in the buildfarm.  This
> comment starts with 0001 for stxdndistinct.
>

Can do.


>
> I have mixed feelings about 0005, FWIW.  I am wondering if we should
> not lift the needle a bit here and only support the dump of extended
> statistics when dealing with a backend of at least v19.  This would
> mean that we would only get the full benefit of this feature once
> people upgrade to v20 or dump from a pg_dump with --statistics from at
> least v19, but with the long-term picture in mind this would also make
> the dump/restore picture of the patch dead simple (spoiler: I like
> simple).
>

I also like simple.

Right now we have a situation where the vast majority of databases can
carry forward all of their stats via pg_upgrade, except for those databases
that have extended stats. The trouble is, most customers don't know if
their database uses extended statistics or not, and those that do are in
for some bad query plans if they haven't run vacuumdb --missing-stats-only.
Explaining that to customers is complicated, especially when most of them
do not know what extended stats are, let alone whether they have them. It
would be a lot simpler to just say "all stats are carried over on upgrade",
and vacuumdb becomes unnecessary, making upgrades one step simpler as well.

Given that, I think that the admittedly ugly transformation is worth it,
and sequestering it inside pg_dump is the smallest footprint it can have.
Earlier in this thread I posted some functions that did the translation
from the existing formats to the proposed new formats. We could include
those as new system functions, and that would make the dump code very
simple. Having said that, I don't know that there would be use for those
functions except inside pg_dump, hence the decision to do the transforms
right in the dump query.

If the format translation is a barrier to fetching existing extended stats,
then I'd be more inclined to keep the existing pg_ndistinct and
pg_dependencies data formats as they are now.