Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz>

From: Michael Paquier <michael@paquier.xyz>

To: Corey Huinker <corey.huinker@gmail.com>

Cc: jian he <jian.universality@gmail.com>, Tomas Vondra <tomas@vondra.me>, pgsql-hackers@lists.postgresql.org, tgl@sss.pgh.pa.us

Date: 2025-11-17T23:19:58Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Add test doing some cloning of extended statistics data
- fc365e4fccc4 19 (unreleased) landed
Add test for pg_restore_extended_stats() with multiranges
- 0b7beec42ae2 19 (unreleased) landed
Add support for "mcv" in pg_restore_extended_stats()
- efbebb4e8587 19 (unreleased) landed
Include extended statistics data in pg_dump
- c32fb29e979d 19 (unreleased) landed
Add support for "dependencies" in pg_restore_extended_stats()
- 302879bd68d1 19 (unreleased) landed
Add test for MAINTAIN permission with pg_restore_extended_stats()
- d9abd9e1050d 19 (unreleased) landed
Add pg_restore_extended_stats()
- 0e80f3f88dea 19 (unreleased) landed
Add routine to free MCVList
- 7ebb64c55757 19 (unreleased) landed
Improve pg_clear_extended_stats() with incorrect relation/stats combination
- 395b73c045e0 19 (unreleased) landed
Add pg_clear_extended_stats()
- d756fa1019ff 19 (unreleased) landed
Introduce routines to validate and free MVNDistinct and MVDependencies
- 32e27bd32082 19 (unreleased) landed
Fix typo in stat_utils.c
- eee19a30d60d 19 (unreleased) landed
Move attribute statistics functions to stat_utils.c
- 213a1b895270 19 (unreleased) landed
Improve error messages of input functions for pg_dependencies and pg_ndistinct
- f68597ee777d 19 (unreleased) landed
Improve test output of extended statistics for ndistinct and dependencies
- 2f04110225ab 19 (unreleased) landed
Fix some compiler warnings
- 7bc88c3d6f3a 19 (unreleased) landed
Add input function for data type pg_dependencies
- e1405aa5e3ac 19 (unreleased) landed
Add input function for data type pg_ndistinct
- 44eba8f06e55 19 (unreleased) landed
Rework output format of pg_dependencies
- e76defbcf09e 19 (unreleased) landed
Rework output format of pg_ndistinct
- 1f927cce4498 19 (unreleased) landed
Fix comments of output routines for pg_ndistinct and pg_dependencies
- 040a39ed25bf 19 (unreleased) landed
Move code specific to pg_dependencies to new file
- 2ddc8d9e9baa 19 (unreleased) landed
Move code specific to pg_ndistinct to new file
- a5523123430f 19 (unreleased) landed
Document some structures in attribute_stats.c
- d6c132d83bff 19 (unreleased) landed
Fix FATAL message for invalid recovery timeline at beginning of recovery
- 71f17823ba01 18.0 cited

On Mon, Nov 17, 2025 at 12:18:55PM -0500, Corey Huinker wrote:
> On Mon, Nov 17, 2025 at 1:56 AM Michael Paquier <michael@paquier.xyz> wrote:
> Though, I was thinking some more about the output format. Using
> jsonb_pretty() makes it readable in one way, and very clumsy in other ways.
> Instead, I'm going to try doing the following:
> 
> replace (ndist_string_value, '},', E'}\n,')
> 
> This will result in the output value being formatted in exactly the way
> described in the commit messages.
> 
> Of course, we could make the the actual default format by changing
> appendStringInfoString(&str, ", ") instead.

This feels like a different pretty still compressed output for json.
I don't think we should change the output functions to do that, but if
you want to add a function that filters these contents a bit in the
tests for the input functions, sure, why not.

> One might argue that the output shouldn't get too flowery, but we're
> already adding spaces between items and array elements, and we've already
> made extensive changes favoring readability over compactness.

I'd still keep it without newlines, FWIW.  So what we have in the
output functions is OK for me.  The key names could be updated to
something else for this release, I'm open for suggestions and we have
time for this release.  It would be nice to not do rename tweaks
several times.

> I'm curious about the re-parameterization of error messages involving
> PG_NDISTINCT_KEY_ATTRIBUTES, PG_NDISTINCT_KEY_NDISTINCT, and similar keys
> in dependencies. I like the parameterized version better, and was confused
> as to why it was removed. Did you change your mind, or was it done for ease
> of translation?

Yes, this one is to reduce the translation work, and because the
messages are quite the same across the board and deal with the same
requirements:
- Single integer expected after a key (attnum or actual value).
- Array of attribute expected after a key.
- For the degree key, float value.

> I had a feeling that was going to be requested. My question would be if
> that we want to stick to modeling the other combinations after the first
> longest combination, last longest, or if we want to defer those checks
> altogether until we have to validate against an actual stats object?

I would tend to think that performing one round of validation once the
whole set of objects has been parsed is going to be cheaper than
periodic checks.

One other thing would be to force a sort of the elements in the array
to match with the order these are generated when creating the stats.
We cannot do that in the input functions because we have no idea about
the order of the attributes in the statistics object yet.  Applying a
sort sounds also important to me to make sure that we order the stats
based on what the group generation functions (aka
generate_combinations(), etc.) think on the matter, which would
enforce a stronger binary compatibility after we are sure that we have
a full set of attributes listed in an array with the input function of
course.  I have briefly looked at the planner code where extended
stats are used, like selfuncs.c, and the ordering does not completely
matter, it seems, but it's cheap enough to enforce a stricter ordering
based on the K groups of N elements generated in the import function.

>> Except for this argument, the input of pg_ndistinct feels OK in terms
>> of the guarantees that we'd want to enforce on an import.  The same
>> argument applies in terms of attribute number guarantees for
>> pg_dependencies, based on DependencyGenerator_init() & friends in
>> dependencies.c.  Could you look at that?
> 
> Yes. I had already looked at it to verify that _all_ combinations were
> always generated (they are), because I had some vague memory of the
> generator dropping combinations that were statistically insignificant. In
> retrospect, I have no idea where I got that idea.

Hmm.  I would need to double-check the code to be sure, but I don't
think that we drop combinations, because the code prevents duplicates
to begin with, even for expressions:
create table aa (a int, b int);
create statistics stats (ndistinct) ON a, a, b, b from aa;
ERROR:  42701: duplicate column name in statistics definition
create statistics stats (ndistinct) ON (a + a), ((a + a)) from aa;
ERROR:  42701: duplicate expression in statistics definition

These don't make sense anyway because they have a predictible and
perfectly matching correlation relationship.

> This is fairly simple to do. The dependency attnum is just appended to the
> list of attnums, and the combinations are generated the same as ndistinct,
> though obviously there are no single elements.

Yeah.  That should be not be bad, I hope.

> There's probably some common code between the lists to be shared, differing
> only in how they report missing combinations.

I would like to agree on that, but it did not look that obvious to me
yesterday.  If you think that something could be refactored, I'd
suggest a refactoring patch that applies on top of the rest of the
patch set, with new generic facilities in stat_util.c, or even a
new separate file, if that leads to a cleaner result (okay, a
definition of "clean" is up to one's taste).
--
Michael