Re: Extended Statistics set/restore/clear functions.

Tomas Vondra <tomas@vondra.me>

From: Tomas Vondra <tomas@vondra.me>

To: Corey Huinker <corey.huinker@gmail.com>, pgsql-hackers@lists.postgresql.org

Date: 2025-01-22T22:50:57Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Add test doing some cloning of extended statistics data
- fc365e4fccc4 19 (unreleased) landed
Add test for pg_restore_extended_stats() with multiranges
- 0b7beec42ae2 19 (unreleased) landed
Add support for "mcv" in pg_restore_extended_stats()
- efbebb4e8587 19 (unreleased) landed
Include extended statistics data in pg_dump
- c32fb29e979d 19 (unreleased) landed
Add support for "dependencies" in pg_restore_extended_stats()
- 302879bd68d1 19 (unreleased) landed
Add test for MAINTAIN permission with pg_restore_extended_stats()
- d9abd9e1050d 19 (unreleased) landed
Add pg_restore_extended_stats()
- 0e80f3f88dea 19 (unreleased) landed
Add routine to free MCVList
- 7ebb64c55757 19 (unreleased) landed
Improve pg_clear_extended_stats() with incorrect relation/stats combination
- 395b73c045e0 19 (unreleased) landed
Add pg_clear_extended_stats()
- d756fa1019ff 19 (unreleased) landed
Introduce routines to validate and free MVNDistinct and MVDependencies
- 32e27bd32082 19 (unreleased) landed
Fix typo in stat_utils.c
- eee19a30d60d 19 (unreleased) landed
Move attribute statistics functions to stat_utils.c
- 213a1b895270 19 (unreleased) landed
Improve error messages of input functions for pg_dependencies and pg_ndistinct
- f68597ee777d 19 (unreleased) landed
Improve test output of extended statistics for ndistinct and dependencies
- 2f04110225ab 19 (unreleased) landed
Fix some compiler warnings
- 7bc88c3d6f3a 19 (unreleased) landed
Add input function for data type pg_dependencies
- e1405aa5e3ac 19 (unreleased) landed
Add input function for data type pg_ndistinct
- 44eba8f06e55 19 (unreleased) landed
Rework output format of pg_dependencies
- e76defbcf09e 19 (unreleased) landed
Rework output format of pg_ndistinct
- 1f927cce4498 19 (unreleased) landed
Fix comments of output routines for pg_ndistinct and pg_dependencies
- 040a39ed25bf 19 (unreleased) landed
Move code specific to pg_dependencies to new file
- 2ddc8d9e9baa 19 (unreleased) landed
Move code specific to pg_ndistinct to new file
- a5523123430f 19 (unreleased) landed
Document some structures in attribute_stats.c
- d6c132d83bff 19 (unreleased) landed
Fix FATAL message for invalid recovery timeline at beginning of recovery
- 71f17823ba01 18.0 cited

Hi,

Thanks for continuing to work on this.

On 1/22/25 19:17, Corey Huinker wrote:
> This is a separate thread for work started in [1] but focused purely on
> getting the following functions working:
> 
> * pg_set_extended_stats
> * pg_clear_extended_stats
> * pg_restore_extended_stats
> 
> These functions are analogous to their relation/attribute counterparts,
> use the same calling conventions, and build upon the same basic
> infrastructure.
> 
> I think it is important that we get these implemented because they close
> the gap that was left in terms of the ability to modify existing
> statistics and to round out the work being done to carry over statistics
> via dump/restore and pg_upgrade i [1].
> 
> The purpose of each patch is as follows (adapted from previous thread):
> 
> 0001 - This makes the input function for pg_ndistinct functional.
> 
> 0002 - This makes the input function for pg_dependencies functional.
> 

I only quickly skimmed the patches, but a couple comments:

1) I think it makes perfect sense to use the JSON parsing for the input
functions, but maybe it'd be better to adjust the format a bit to make
that even easier?

Right now the JSON "keys" have structure, which means we need some ad
hoc parsing. Maybe we should make it "proper JSON" by moving that into
separate key/value, e.g. for ndistinct we might replace this:

  {"1, 2": 2323, "1, 3" : 3232, ...}

with this:

  [ {"keys": [1, 2], "ndistinct" : 2323},
    {"keys": [1, 3], "ndistinct" : 3232},
    ... ]

so a regular JSON array of objects, with keys an "array". And similarly
for dependencies.

Yes, it's more verbose, but maybe better for "mechanical" processing?

2) Do we need some sort of validation? Perhaps this was discussed in the
other thread and I missed that, but isn't it a problem that happily
accept e.g. this?

  {"6666, 6666" : 1, "1, -222": 14, ...}

That has duplicate keys with bogus attribute numbers, stats on (bogus)
system attributes, etc. I suspect this may easily cause problems during
planning (if it happens to visit those statistics).

Maybe that's acceptable - ultimately the user could import something
broken in a much subtler way, of course. But the pg_set_attribute_stats
seems somewhat more protected against this, because it gets the attr as
a separate argument.

I recall I wished to have the attnum in the output function, but that
was not quite possible because we don't know the relid (and thus the
descriptor) in that function.

Is it a good idea to rely on the input/output format directly? How will
that deal with cross-version differences? Doesn't it mean the in/out
format is effectively fixed, or at least has to be backwards compatible
(i.e. new version has to handle any value from older versions)?

Or what if I want to import the stats for a table with slightly
different structure (e.g. because dump/restore skips dropped columns).
Won't that be a problem with the format containing raw attnums? Or is
this a use case we don't expect to work?

For the per-attribute stats it's probably fine, because that's mostly
just a collection of regular data types (scalar values or arrays of
values, ...) and we're not modifying them except for maybe adding new
fields. But extended stats seem more complex, so maybe it's different?

I remember a long discussion about the format at the very beginning of
this patch series, and the conclusion clearly was to have a function
that import stats for one attribute at a time. And that seems to be
working fine, but extended stats values have more internal structure, so
perhaps they need to do something more complicated.

> 0003 - Makes several static functions in attribute_stats.c public for use
> by extended stats. One of those is get_stat_attr_type(), which in the last
> patchset was modified to take an attribute name rather than attnum, thus
> saving a syscache lookup. However, extended stats identifies attributes by
> attnum not name, so that optimization had to be set aside, at least
> temporarily.
> 
> 0004 - These implement the functions pg_set_extended_stats(),
> pg_clear_extended_stats(), and pg_restore_extended_stats() and behave like
> their relation/attribute equivalents. If we can get these committed and
> used by pg_dump, then we don't have to debate how to handle post-upgrade
> steps for users who happen to have extended stats vs the approximately
> 99.75% of users who do not have extended stats.
> 

I see there's a couple MCV-specific functions in the extended_stats.c.
Shouldn't those go into mvc.c instead?

FWIW there's a bunch of whitespace issues during git apply.

> This patchset does not presently include any work to integrate these
> functions into pg_dump, but may do so once that work is settled, or it
> may become its own thread.
> 

OK. Thanks for the patch!

regards

-- 
Tomas Vondra