Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Add test doing some cloning of extended statistics data
- fc365e4fccc4 19 (unreleased) landed
Add test for pg_restore_extended_stats() with multiranges
- 0b7beec42ae2 19 (unreleased) landed
Add support for "mcv" in pg_restore_extended_stats()
- efbebb4e8587 19 (unreleased) landed
Include extended statistics data in pg_dump
- c32fb29e979d 19 (unreleased) landed
Add support for "dependencies" in pg_restore_extended_stats()
- 302879bd68d1 19 (unreleased) landed
Add test for MAINTAIN permission with pg_restore_extended_stats()
- d9abd9e1050d 19 (unreleased) landed
Add pg_restore_extended_stats()
- 0e80f3f88dea 19 (unreleased) landed
Add routine to free MCVList
- 7ebb64c55757 19 (unreleased) landed
Improve pg_clear_extended_stats() with incorrect relation/stats combination
- 395b73c045e0 19 (unreleased) landed
Add pg_clear_extended_stats()
- d756fa1019ff 19 (unreleased) landed
Introduce routines to validate and free MVNDistinct and MVDependencies
- 32e27bd32082 19 (unreleased) landed
Fix typo in stat_utils.c
- eee19a30d60d 19 (unreleased) landed
Move attribute statistics functions to stat_utils.c
- 213a1b895270 19 (unreleased) landed
Improve error messages of input functions for pg_dependencies and pg_ndistinct
- f68597ee777d 19 (unreleased) landed
Improve test output of extended statistics for ndistinct and dependencies
- 2f04110225ab 19 (unreleased) landed
Fix some compiler warnings
- 7bc88c3d6f3a 19 (unreleased) landed
Add input function for data type pg_dependencies
- e1405aa5e3ac 19 (unreleased) landed
Add input function for data type pg_ndistinct
- 44eba8f06e55 19 (unreleased) landed
Rework output format of pg_dependencies
- e76defbcf09e 19 (unreleased) landed
Rework output format of pg_ndistinct
- 1f927cce4498 19 (unreleased) landed
Fix comments of output routines for pg_ndistinct and pg_dependencies
- 040a39ed25bf 19 (unreleased) landed
Move code specific to pg_dependencies to new file
- 2ddc8d9e9baa 19 (unreleased) landed
Move code specific to pg_ndistinct to new file
- a5523123430f 19 (unreleased) landed
Document some structures in attribute_stats.c
- d6c132d83bff 19 (unreleased) landed
Fix FATAL message for invalid recovery timeline at beginning of recovery
- 71f17823ba01 18.0 cited

Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-01-22T18:17:27Z

This is a separate thread for work started in [1] but focused purely on
getting the following functions working:

* pg_set_extended_stats
* pg_clear_extended_stats
* pg_restore_extended_stats

These functions are analogous to their relation/attribute counterparts, use
the same calling conventions, and build upon the same basic infrastructure.

I think it is important that we get these implemented because they close
the gap that was left in terms of the ability to modify existing statistics
and to round out the work being done to carry over statistics via
dump/restore and pg_upgrade i [1].

The purpose of each patch is as follows (adapted from previous thread):

0001 - This makes the input function for pg_ndistinct functional.

0002 - This makes the input function for pg_dependencies functional.

0003 - Makes several static functions in attribute_stats.c public for use
by extended stats. One of those is get_stat_attr_type(), which in the last
patchset was modified to take an attribute name rather than attnum, thus
saving a syscache lookup. However, extended stats identifies attributes by
attnum not name, so that optimization had to be set aside, at least
temporarily.

0004 - These implement the functions pg_set_extended_stats(),
pg_clear_extended_stats(), and pg_restore_extended_stats() and behave like
their relation/attribute equivalents. If we can get these committed and
used by pg_dump, then we don't have to debate how to handle post-upgrade
steps for users who happen to have extended stats vs the approximately
99.75% of users who do not have extended stats.

This patchset does not presently include any work to integrate these
functions into pg_dump, but may do so once that work is settled, or it may
become its own thread.

[1]
https://www.postgresql.org/message-id/CADkLM=c6NHdXU+d+m1yASZ4NT_qye1LCnaU2Vgf8YJ80jJT-Qg@mail.gmail.com

Re: Extended Statistics set/restore/clear functions.

Tomas Vondra <tomas@vondra.me> — 2025-01-22T22:50:57Z

Hi,

Thanks for continuing to work on this.

On 1/22/25 19:17, Corey Huinker wrote:
> This is a separate thread for work started in [1] but focused purely on
> getting the following functions working:
> 
> * pg_set_extended_stats
> * pg_clear_extended_stats
> * pg_restore_extended_stats
> 
> These functions are analogous to their relation/attribute counterparts,
> use the same calling conventions, and build upon the same basic
> infrastructure.
> 
> I think it is important that we get these implemented because they close
> the gap that was left in terms of the ability to modify existing
> statistics and to round out the work being done to carry over statistics
> via dump/restore and pg_upgrade i [1].
> 
> The purpose of each patch is as follows (adapted from previous thread):
> 
> 0001 - This makes the input function for pg_ndistinct functional.
> 
> 0002 - This makes the input function for pg_dependencies functional.
> 

I only quickly skimmed the patches, but a couple comments:

1) I think it makes perfect sense to use the JSON parsing for the input
functions, but maybe it'd be better to adjust the format a bit to make
that even easier?

Right now the JSON "keys" have structure, which means we need some ad
hoc parsing. Maybe we should make it "proper JSON" by moving that into
separate key/value, e.g. for ndistinct we might replace this:

  {"1, 2": 2323, "1, 3" : 3232, ...}

with this:

  [ {"keys": [1, 2], "ndistinct" : 2323},
    {"keys": [1, 3], "ndistinct" : 3232},
    ... ]

so a regular JSON array of objects, with keys an "array". And similarly
for dependencies.

Yes, it's more verbose, but maybe better for "mechanical" processing?

2) Do we need some sort of validation? Perhaps this was discussed in the
other thread and I missed that, but isn't it a problem that happily
accept e.g. this?

  {"6666, 6666" : 1, "1, -222": 14, ...}

That has duplicate keys with bogus attribute numbers, stats on (bogus)
system attributes, etc. I suspect this may easily cause problems during
planning (if it happens to visit those statistics).

Maybe that's acceptable - ultimately the user could import something
broken in a much subtler way, of course. But the pg_set_attribute_stats
seems somewhat more protected against this, because it gets the attr as
a separate argument.

I recall I wished to have the attnum in the output function, but that
was not quite possible because we don't know the relid (and thus the
descriptor) in that function.

Is it a good idea to rely on the input/output format directly? How will
that deal with cross-version differences? Doesn't it mean the in/out
format is effectively fixed, or at least has to be backwards compatible
(i.e. new version has to handle any value from older versions)?

Or what if I want to import the stats for a table with slightly
different structure (e.g. because dump/restore skips dropped columns).
Won't that be a problem with the format containing raw attnums? Or is
this a use case we don't expect to work?

For the per-attribute stats it's probably fine, because that's mostly
just a collection of regular data types (scalar values or arrays of
values, ...) and we're not modifying them except for maybe adding new
fields. But extended stats seem more complex, so maybe it's different?

I remember a long discussion about the format at the very beginning of
this patch series, and the conclusion clearly was to have a function
that import stats for one attribute at a time. And that seems to be
working fine, but extended stats values have more internal structure, so
perhaps they need to do something more complicated.

> 0003 - Makes several static functions in attribute_stats.c public for use
> by extended stats. One of those is get_stat_attr_type(), which in the last
> patchset was modified to take an attribute name rather than attnum, thus
> saving a syscache lookup. However, extended stats identifies attributes by
> attnum not name, so that optimization had to be set aside, at least
> temporarily.
> 
> 0004 - These implement the functions pg_set_extended_stats(),
> pg_clear_extended_stats(), and pg_restore_extended_stats() and behave like
> their relation/attribute equivalents. If we can get these committed and
> used by pg_dump, then we don't have to debate how to handle post-upgrade
> steps for users who happen to have extended stats vs the approximately
> 99.75% of users who do not have extended stats.
> 

I see there's a couple MCV-specific functions in the extended_stats.c.
Shouldn't those go into mvc.c instead?

FWIW there's a bunch of whitespace issues during git apply.

> This patchset does not presently include any work to integrate these
> functions into pg_dump, but may do so once that work is settled, or it
> may become its own thread.
> 

OK. Thanks for the patch!

regards

-- 
Tomas Vondra

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-01-23T14:51:08Z

On Wed, Jan 22, 2025 at 5:50 PM Tomas Vondra <tomas@vondra.me> wrote:

> Hi,
>
> Thanks for continuing to work on this.
>
> On 1/22/25 19:17, Corey Huinker wrote:
> > This is a separate thread for work started in [1] but focused purely on
> > getting the following functions working:
> >
> > * pg_set_extended_stats
> > * pg_clear_extended_stats
> > * pg_restore_extended_stats
> >
> > These functions are analogous to their relation/attribute counterparts,
> > use the same calling conventions, and build upon the same basic
> > infrastructure.
> >
> > I think it is important that we get these implemented because they close
> > the gap that was left in terms of the ability to modify existing
> > statistics and to round out the work being done to carry over statistics
> > via dump/restore and pg_upgrade i [1].
> >
> > The purpose of each patch is as follows (adapted from previous thread):
> >
> > 0001 - This makes the input function for pg_ndistinct functional.
> >
> > 0002 - This makes the input function for pg_dependencies functional.
> >
>
> I only quickly skimmed the patches, but a couple comments:
>
>
> 1) I think it makes perfect sense to use the JSON parsing for the input
> functions, but maybe it'd be better to adjust the format a bit to make
> that even easier?
>
> Right now the JSON "keys" have structure, which means we need some ad
> hoc parsing. Maybe we should make it "proper JSON" by moving that into
> separate key/value, e.g. for ndistinct we might replace this:
>
>   {"1, 2": 2323, "1, 3" : 3232, ...}
>
> with this:
>
>   [ {"keys": [1, 2], "ndistinct" : 2323},
>     {"keys": [1, 3], "ndistinct" : 3232},
>     ... ]
>
> so a regular JSON array of objects, with keys an "array". And similarly
> for dependencies.
>

That is almost exactly what I did back when the stats import functions took
a nested JSON argument.

The biggest problem with changing that format is that the old format would
still show up in the system being exported, so we would have to process
that format as well as the new one.

> Yes, it's more verbose, but maybe better for "mechanical" processing?
>

It absolutely would be better for processing, but we'd still have to read
the old format from older systems.  I suppose the pg_dump code could do
some SQL gymnastics to convert the old json-but-sad format into the
processing-friendly format of the future, and I could easily adapt what
I've already written over a year ago to that task. I suppose it's just a
matter of having the community behind changing the output format to enable
a better input format.

> 2) Do we need some sort of validation? Perhaps this was discussed in the
> other thread and I missed that, but isn't it a problem that happily
> accept e.g. this?
>
>   {"6666, 6666" : 1, "1, -222": 14, ...}
>
> That has duplicate keys with bogus attribute numbers, stats on (bogus)
> system attributes, etc. I suspect this may easily cause problems during
> planning (if it happens to visit those statistics).
>

We used to have _lots_ of validation for data quality issues, much of which
was removed at the request of reviewers. However, much of that discussion
was about the health of the statistics, but these are standalone data
types, maybe they're held to a higher standard. If so, what sort of checks
do you think would be needed and/or wanted?

So far, I can imagine the following:

* no negative attnums in key list
* no duplicate attnums in key list

anything else?

>
> Maybe that's acceptable - ultimately the user could import something
> broken in a much subtler way, of course. But the pg_set_attribute_stats
> seems somewhat more protected against this, because it gets the attr as
> a separate argument.
>

The datatype itself is in isolation, but once we've created a valid
pg_ndistinct or pg_dependencies, there's nothing stopping us from
validating the values in the datatype against the statistics object and the
relation it belongs to, but that might get the same resistance that I got
to say, ensuring that frequency lists were monotonically decreasing.

> I recall I wished to have the attnum in the output function, but that
> was not quite possible because we don't know the relid (and thus the
> descriptor) in that function.
>
> Is it a good idea to rely on the input/output format directly? How will
> that deal with cross-version differences? Doesn't it mean the in/out
> format is effectively fixed, or at least has to be backwards compatible
> (i.e. new version has to handle any value from older versions)?
>

Presently there are no cross-version differences, though earlier I address
the pros and cons of changing it. No matter what, the burden of having a
valid format is on the user in the case of pg_set_extended_stats() and
pg_restore_extended_stats() has a server version number associated, so the
option of handling a format change could be baked in, but then we're doing
version tests and input typecasts like we do with ANYARRAY types. Not
impossible, but definitely more work.

> Or what if I want to import the stats for a table with slightly
> different structure (e.g. because dump/restore skips dropped columns).
> Won't that be a problem with the format containing raw attnums? Or is
> this a use case we don't expect to work?
>

The family of pg_set_*_stats functions expect the input to be meaningful
and correct for that relation on that server version. Any attnum
translation would have to be done by the user to adapt to the new or
changed relation.

The family of pg_restore_*_stats functions are designed to be forward
compatible, and to work across versions but for basically the same relation
or relation of the same shape. Basically, they're for pg_restore and
pg_upgrade, so no changes in attnums would be expected.

>
> For the per-attribute stats it's probably fine, because that's mostly
> just a collection of regular data types (scalar values or arrays of
> values, ...) and we're not modifying them except for maybe adding new
> fields. But extended stats seem more complex, so maybe it's different?
>

I had that working by attname matching way back in the early days, but it
would involve creating an intermediate format for translating the attnums
to attnames on export, and then re-translating them on the way back in.

I suppose someone could write the following utility functions

    pg_xlat_ndistinct_to_attnames(relation reloid, ndist pg_ndistinct) ->
json
    pg_xlat_ndistinct_from_attnames(relation reloid, ndist json) ->
pg_ndistinct

and that would bridge the gap for the special case where you want to adapt
pg_ndistinct from one table structure to a slightly different one.

>
> I remember a long discussion about the format at the very beginning of
> this patch series, and the conclusion clearly was to have a function
> that import stats for one attribute at a time. And that seems to be
> working fine, but extended stats values have more internal structure, so
> perhaps they need to do something more complicated.
>

I believe this *is* doing something more complicated, especially with the
MCVList, though I was very pleased to see how much of the existing
infrastructure I was able to reuse.

>
> > 0003 - Makes several static functions in attribute_stats.c public for use
> > by extended stats. One of those is get_stat_attr_type(), which in the
> last
> > patchset was modified to take an attribute name rather than attnum, thus
> > saving a syscache lookup. However, extended stats identifies attributes
> by
> > attnum not name, so that optimization had to be set aside, at least
> > temporarily.
> >
> > 0004 - These implement the functions pg_set_extended_stats(),
> > pg_clear_extended_stats(), and pg_restore_extended_stats() and behave
> like
> > their relation/attribute equivalents. If we can get these committed and
> > used by pg_dump, then we don't have to debate how to handle post-upgrade
> > steps for users who happen to have extended stats vs the approximately
> > 99.75% of users who do not have extended stats.
> >
>
> I see there's a couple MCV-specific functions in the extended_stats.c.
> Shouldn't those go into mvc.c instead?
>

I wanted to put it there, but there was a reason I didn't and I've now
forgotten what it was. I'll make an effort to relocate it to mcv.c.

For that matter, it might make sense to break out the expressions code into
its own file, because every other stat attribute has its own. Thoughts on
that?

>
> FWIW there's a bunch of whitespace issues during git apply.
>

+1

>
> OK. Thanks for the patch!
>

Thanks for the feedback, please keep it coming. I think it's really
important that extended stats, though used rarely, be included in our
dump/restore/upgrade changes so as to make for a more consistent user
experience.

Re: Extended Statistics set/restore/clear functions.

Tomas Vondra <tomas@vondra.me> — 2025-01-23T20:15:53Z


On 1/23/25 15:51, Corey Huinker wrote:
> On Wed, Jan 22, 2025 at 5:50 PM Tomas Vondra <tomas@vondra.me
> <mailto:tomas@vondra.me>> wrote:
> 
>     Hi,
> 
>     Thanks for continuing to work on this.
> 
>     On 1/22/25 19:17, Corey Huinker wrote:
>     > This is a separate thread for work started in [1] but focused
>     purely on
>     > getting the following functions working:
>     >
>     > * pg_set_extended_stats
>     > * pg_clear_extended_stats
>     > * pg_restore_extended_stats
>     >
>     > These functions are analogous to their relation/attribute
>     counterparts,
>     > use the same calling conventions, and build upon the same basic
>     > infrastructure.
>     >
>     > I think it is important that we get these implemented because they
>     close
>     > the gap that was left in terms of the ability to modify existing
>     > statistics and to round out the work being done to carry over
>     statistics
>     > via dump/restore and pg_upgrade i [1].
>     >
>     > The purpose of each patch is as follows (adapted from previous
>     thread):
>     >
>     > 0001 - This makes the input function for pg_ndistinct functional.
>     >
>     > 0002 - This makes the input function for pg_dependencies functional.
>     >
> 
>     I only quickly skimmed the patches, but a couple comments:
> 
> 
>     1) I think it makes perfect sense to use the JSON parsing for the input
>     functions, but maybe it'd be better to adjust the format a bit to make
>     that even easier?
> 
>     Right now the JSON "keys" have structure, which means we need some ad
>     hoc parsing. Maybe we should make it "proper JSON" by moving that into
>     separate key/value, e.g. for ndistinct we might replace this:
> 
>       {"1, 2": 2323, "1, 3" : 3232, ...}
> 
>     with this:
> 
>       [ {"keys": [1, 2], "ndistinct" : 2323},
>         {"keys": [1, 3], "ndistinct" : 3232},
>         ... ]
> 
>     so a regular JSON array of objects, with keys an "array". And similarly
>     for dependencies.
> 
> 
> That is almost exactly what I did back when the stats import functions
> took a nested JSON argument.
> 
> The biggest problem with changing that format is that the old format
> would still show up in the system being exported, so we would have to
> process that format as well as the new one.
>  
> 
>     Yes, it's more verbose, but maybe better for "mechanical" processing?
> 
> 
> It absolutely would be better for processing, but we'd still have to
> read the old format from older systems.  I suppose the pg_dump code
> could do some SQL gymnastics to convert the old json-but-sad format into
> the processing-friendly format of the future, and I could easily adapt
> what I've already written over a year ago to that task. I suppose it's
> just a matter of having the community behind changing the output format
> to enable a better input format.
> 

D'oh! I always forget about the backwards compatibility issue, i.e. that
we still need to ingest values from already released versions. Yeah,
that makes the format change less beneficial.

>  
> 
>     2) Do we need some sort of validation? Perhaps this was discussed in the
>     other thread and I missed that, but isn't it a problem that happily
>     accept e.g. this?
> 
>       {"6666, 6666" : 1, "1, -222": 14, ...}
> 
>     That has duplicate keys with bogus attribute numbers, stats on (bogus)
>     system attributes, etc. I suspect this may easily cause problems during
>     planning (if it happens to visit those statistics).
> 
> 
> We used to have _lots_ of validation for data quality issues, much of
> which was removed at the request of reviewers. However, much of that
> discussion was about the health of the statistics, but these are
> standalone data types, maybe they're held to a higher standard. If so,
> what sort of checks do you think would be needed and/or wanted?
> 
> So far, I can imagine the following:
> 
> * no negative attnums in key list
> * no duplicate attnums in key list
> 
> anything else?
> 

Yeah, I recall there were a lot of checks initially and we dropped them
over time. I'm not asking to reinstate all of those thorough checks.

At this point I was really thinking only about validating the attnums,
i.e. to make sure it's a valid attribute in the table / statistics. That
is something the pg_set_attribute_stats() enforce too, thanks to having
a separate argument for the attribute name.

That's where I'd stop. I don't want to do checks on the statistics
content, like verifying the frequencies in the MCV sum up to 1.0 or
stuff like that. I think we're not doing that for pg_set_attribute_stats
either (and I'd bet one could cause a lot of "fun" this way).

> 
>     Maybe that's acceptable - ultimately the user could import something
>     broken in a much subtler way, of course. But the pg_set_attribute_stats
>     seems somewhat more protected against this, because it gets the attr as
>     a separate argument.
> 
> 
> The datatype itself is in isolation, but once we've created a valid
> pg_ndistinct or pg_dependencies, there's nothing stopping us from
> validating the values in the datatype against the statistics object and
> the relation it belongs to, but that might get the same resistance that
> I got to say, ensuring that frequency lists were monotonically decreasing.
>  

Understood. IMHO it's fine to say we're not validating the statistics
are "consistent" but I think we should check it matches the definition.

> 
>     I recall I wished to have the attnum in the output function, but that
>     was not quite possible because we don't know the relid (and thus the
>     descriptor) in that function.
> 
>     Is it a good idea to rely on the input/output format directly? How will
>     that deal with cross-version differences? Doesn't it mean the in/out
>     format is effectively fixed, or at least has to be backwards compatible
>     (i.e. new version has to handle any value from older versions)?
> 
> 
> Presently there are no cross-version differences, though earlier I
> address the pros and cons of changing it. No matter what, the burden of
> having a valid format is on the user in the case of
> pg_set_extended_stats() and pg_restore_extended_stats() has a server
> version number associated, so the option of handling a format change
> could be baked in, but then we're doing version tests and input
> typecasts like we do with ANYARRAY types. Not impossible, but definitely
> more work.
> 

OK, makes sense.
 
> 
>     Or what if I want to import the stats for a table with slightly
>     different structure (e.g. because dump/restore skips dropped columns).
>     Won't that be a problem with the format containing raw attnums? Or is
>     this a use case we don't expect to work?
> 
> 
> The family of pg_set_*_stats functions expect the input to be meaningful
> and correct for that relation on that server version. Any attnum
> translation would have to be done by the user to adapt to the new or
> changed relation.
> 
> The family of pg_restore_*_stats functions are designed to be forward
> compatible, and to work across versions but for basically the same
> relation or relation of the same shape. Basically, they're for
> pg_restore and pg_upgrade, so no changes in attnums would be expected.
>  
> 

OK

> 
>     For the per-attribute stats it's probably fine, because that's mostly
>     just a collection of regular data types (scalar values or arrays of
>     values, ...) and we're not modifying them except for maybe adding new
>     fields. But extended stats seem more complex, so maybe it's different?
> 
> 
> I had that working by attname matching way back in the early days, but
> it would involve creating an intermediate format for translating the
> attnums to attnames on export, and then re-translating them on the way
> back in.
> 
> I suppose someone could write the following utility functions
> 
>     pg_xlat_ndistinct_to_attnames(relation reloid, ndist pg_ndistinct) -
>> json
>     pg_xlat_ndistinct_from_attnames(relation reloid, ndist json) ->
> pg_ndistinct
> 
> and that would bridge the gap for the special case where you want to
> adapt pg_ndistinct from one table structure to a slightly different one.
> 
> 

OK
 
> 
> 
>     I remember a long discussion about the format at the very beginning of
>     this patch series, and the conclusion clearly was to have a function
>     that import stats for one attribute at a time. And that seems to be
>     working fine, but extended stats values have more internal structure, so
>     perhaps they need to do something more complicated.
> 
> 
> I believe this *is* doing something more complicated, especially with
> the MCVList, though I was very pleased to see how much of the existing
> infrastructure I was able to reuse.
> 
> 

OK

>  
> 
> 
>     > 0003 - Makes several static functions in attribute_stats.c public
>     for use
>     > by extended stats. One of those is get_stat_attr_type(), which in
>     the last
>     > patchset was modified to take an attribute name rather than
>     attnum, thus
>     > saving a syscache lookup. However, extended stats identifies
>     attributes by
>     > attnum not name, so that optimization had to be set aside, at least
>     > temporarily.
>     >
>     > 0004 - These implement the functions pg_set_extended_stats(),
>     > pg_clear_extended_stats(), and pg_restore_extended_stats() and
>     behave like
>     > their relation/attribute equivalents. If we can get these
>     committed and
>     > used by pg_dump, then we don't have to debate how to handle post-
>     upgrade
>     > steps for users who happen to have extended stats vs the approximately
>     > 99.75% of users who do not have extended stats.
>     >
> 
>     I see there's a couple MCV-specific functions in the extended_stats.c.
>     Shouldn't those go into mvc.c instead?
> 
> 
> I wanted to put it there, but there was a reason I didn't and I've now
> forgotten what it was. I'll make an effort to relocate it to mcv.c.
> 
> For that matter, it might make sense to break out the expressions code
> into its own file, because every other stat attribute has its own.
> Thoughts on that?
>  

+1 to that, if it reduced unnecessary code duplication

> 
> 
>     FWIW there's a bunch of whitespace issues during git apply.
> 
> 
> +1
>  
> 
> 
>     OK. Thanks for the patch!
> 
> 
> Thanks for the feedback, please keep it coming. I think it's really
> important that extended stats, though used rarely, be included in our
> dump/restore/upgrade changes so as to make for a more consistent user
> experience.
>  

I agree, and I appreciate you working on it.



-- 
Tomas Vondra

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-01-23T20:52:55Z

>
>
>
> > * no negative attnums in key list
>

Disregard this suggestion - negative attnums mean the Nth expression in the
extended stats object, though it boggles the mind how we could have 222
expressions...



> > * no duplicate attnums in key list
>

This one is still live, am considering.

At this point I was really thinking only about validating the attnums,
> i.e. to make sure it's a valid attribute in the table / statistics. That
> is something the pg_set_attribute_stats() enforce too, thanks to having
> a separate argument for the attribute name.
>
> That's where I'd stop. I don't want to do checks on the statistics
> content, like verifying the frequencies in the MCV sum up to 1.0 or
> stuff like that. I think we're not doing that for pg_set_attribute_stats
>

Agreed.


> either (and I'd bet one could cause a lot of "fun" this way).
>

If by "fun" you mean "create a fuzzing tool", then yes.

As an aside, the "big win" in all these functions is the ability to dump a
database --no-data, but have all the schema and statistics, thus allowing
for checking query plans on existing databases with sensitive data while
not actually exposing the data (except mcv, obvs), nor spending the I/O to
load that data.


> Understood. IMHO it's fine to say we're not validating the statistics
> are "consistent" but I think we should check it matches the definition.
>

+1



> > I suppose someone could write the following utility functions
> >
> >     pg_xlat_ndistinct_to_attnames(relation reloid, ndist pg_ndistinct) -
> >> json
> >     pg_xlat_ndistinct_from_attnames(relation reloid, ndist json) ->
> > pg_ndistinct
> >
> > and that would bridge the gap for the special case where you want to
> > adapt pg_ndistinct from one table structure to a slightly different one.
> >
> >
>
> OK
>

As they'll be pure-SQL functions, I'll likely post the definitions here,
but not put them into a patch unless it draws interest.
 > For that matter, it might make sense to break out the expressions code

> > into its own file, because every other stat attribute has its own.
> > Thoughts on that?
> >
>
> +1 to that, if it reduced unnecessary code duplication
>

I'm uncertain that it actually would deduplicate any code, but I'll
certainly try.

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-01-24T00:06:48Z

>
> I see there's a couple MCV-specific functions in the extended_stats.c.
>> Shouldn't those go into mvc.c instead?
>>
>
> I wanted to put it there, but there was a reason I didn't and I've now
> forgotten what it was. I'll make an effort to relocate it to mcv.c.
>

Looking at it now, I see that check_mcvlist_array() expects access to
extarginfo, so I either I add a bunch of Datum-aware and extarginfo-aware
code to mcv.c, or I do those checks outside of import_mcvlist() and leave
check_mcvlist_array() where it is, which is my current inclination, though
obviously it's not a perfectly clean break.

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-01-24T05:20:59Z

>
>
> I agree, and I appreciate you working on it.
>

I've tried to incorporate all your feedback:

* duplicates in pg_ndistinct and pg_dependencies now cause the input
function to fail, with tests for each

* all the non-array, non-parametery stuff moved out of import_mcvlist so
that import_mcvlist can move to mcv.c in a sane way

* whitespace issues

* validating attnums inside pg_ndistinct. I held off on doing
pg_dependencies pending feedback on how I handled pg_ndistinct. tests for
invalid positive (attnum isn't involved in the extended stats object), 0
attnum (just wrong) and invalid negative (more than the number of
expressions in the object)

Still to do:

* attnum checking for dependencies, should be simple, once we've locked
down how to do it for pg_ndistinct

* move import_expressions code to expressions.c along with some other
expression-specific stuff.

* more tests, probably

I'd like to merge these down to 3 patches again, but I'm keeping them
separate for this patchset to isolate the attnum-checking code for this
go-round.

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-01-27T19:52:15Z

>
> I'd like to merge these down to 3 patches again, but I'm keeping them
> separate for this patchset to isolate the attnum-checking code for this
> go-round.
>

These are mock-ups of the to/from JSON functions, but building from/to text
rather than the not-yet-committed pg_ndistinct and pg_dependencies data
types. Currently they're done with JSON rather than JSONB because I assume
that the ordering within the datatype matters. We can probably preserve
order by adding an "order" field populated by WITH ORDINALITY.

To get all Jurrassic about this: I've spent some time thinking about
whether we CAN make these functions, it's time to consider whether we
SHOULD. And that leads me to a couple of points:

p1. We could switch to the new formats without any change to the internal
representation, but pg_dump would always need to know about the old formats.

p2. The JSON format is both more understandable and easier to manipulate.

p3. If we thought the number of people using extended stats was small, the
number of people tweaking extended stats is going to be smaller.

So that gives us a few paths forward:

o1. Switch to the new input/output format, and the queries inside these
functions get incorporated into some future pg_dump queries.

o2. Keep the old formats, create these functions inside the system and
whoever wants to use them can use them.

o3. Keep old formats, and make these functions work as the CASTs to and
from JSON/JSONB.

o4. Keep old formats, could create these functions in an extension.

o5. Keep old formats, leave these function definitions here for a future
intrepid hacker.

Re: Extended Statistics set/restore/clear functions.

jian he <jian.universality@gmail.com> — 2025-01-28T16:24:59Z

hi.
I reviewed 0001 only.

in src/backend/statistics/mvdistinct.c

no need #include "nodes/pg_list.h" since
src/include/statistics/statistics.h sub level include "nodes/pg_list.h"

no need #include "utils/palloc.h"
sicne #include "postgres.h"
already included it.


 select '[{"6, -32768,,": -11}]'::pg_ndistinct;
ERROR:  malformed pg_ndistinct: "[{"6, -32768,,": -11}]"
LINE 1: select '[{"6, -32768,,": -11}]'::pg_ndistinct;
               ^
DETAIL:  All ndistinct count values are scalar doubles.
imho, this errdetail message is not good.


select '{}'::pg_ndistinct ;
segfault


select '{"1,":"1"}'::pg_ndistinct ;
ERROR:  malformed pg_ndistinct: "{"1,":"1"}"
LINE 1: select '{"1,":"1"}'::pg_ndistinct ;
               ^
DETAIL:  All ndistinct attnum lists must be a comma separated list of attnums.

imho, this errdetail message is not good. would be better saying that
"length of list of attnums must be larger than 1".



select  pt1.typnamespace, pt1.typarray, pt1.typcategory, pt1.typname
from    pg_type pt1
where pt1.typname ~*'distinct';

 typnamespace | typarray | typcategory |   typname
--------------+----------+-------------+--------------
           11 |        0 | Z           | pg_ndistinct

typcategory (Z) marked as Internal-use types. and there is no
pg_ndistinct array type,
not sure this is fine.


all the errcode one pair of the parenthesis is unnecessary.
for example
                (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
                 errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
                 errdetail("Must begin with \"{\"")));
can change to
                errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
                 errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
                 errdetail("Must begin with \"{\""));

see https://www.postgresql.org/docs/current/error-message-reporting.html

Re: Extended Statistics set/restore/clear functions.

jian he <jian.universality@gmail.com> — 2025-01-29T07:50:09Z

hi.

select '{"1, 0B100101":"NaN"}'::pg_ndistinct;
      pg_ndistinct
------------------------
 {"1, 37": -2147483648}
(1 row)

this is not what we expected?
For the VALUE part of pg_ndistinct, float8 has 3 special values: inf, -inf, NaN.

For the key part of pg_ndistinct, see example.
select '{"1, 16\t":"1"}'::pg_ndistinct;
here \t is not tab character, ascii 9. it's two characters: backslash
and character "t".
so here it should error out?
(apply this to \n, \r, \b)


pg_ndistinct_in(PG_FUNCTION_ARGS)
ending part should be:

    freeJsonLexContext(lex);
    if (result == JSON_SUCCESS)
    {
        ......
    }
    else
    {
       ereturn(parse_state.escontext, (Datum) 0,
                    errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
                    errmsg("malformed pg_ndistinct: \"%s\"", str),
                    errdetail("Must be valid JSON."));
       PG_RETURN_NULL();
    }
result should be either JSON_SUCCESS or anything else.



all these functions:
ndistinct_object_start, ndistinct_array_start,
ndistinct_object_field_start, ndistinct_array_element_start
have
ndistinctParseState *parse = state;

do we need to change it to
ndistinctParseState *parse = (ndistinctParseState *)state;
?



ndistinctParseState need to add to src/tools/pgindent/typedefs.list
probably change it to "typedef struct ndistinctParseState".
also struct ndistinctParseState need placed in the top of
src/backend/statistics/mvdistinct.c
not in line 340?

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-01-29T19:56:33Z

On Tue, Jan 28, 2025 at 11:25 AM jian he <jian.universality@gmail.com>
wrote:

> hi.
> I reviewed 0001 only.
>
> in src/backend/statistics/mvdistinct.c
>
> no need #include "nodes/pg_list.h" since
> src/include/statistics/statistics.h sub level include "nodes/pg_list.h"
>
> no need #include "utils/palloc.h"
> sicne #include "postgres.h"
> already included it.
>


Noted.


>  select '[{"6, -32768,,": -11}]'::pg_ndistinct;
> ERROR:  malformed pg_ndistinct: "[{"6, -32768,,": -11}]"
> LINE 1: select '[{"6, -32768,,": -11}]'::pg_ndistinct;
>                ^
> DETAIL:  All ndistinct count values are scalar doubles.
> imho, this errdetail message is not good.
>

What error message do you think is appropriate in that situation?


> select '{}'::pg_ndistinct ;
> segfault
>

Mmm, gotta look into that!



>
>
> select '{"1,":"1"}'::pg_ndistinct ;
> ERROR:  malformed pg_ndistinct: "{"1,":"1"}"
> LINE 1: select '{"1,":"1"}'::pg_ndistinct ;
>                ^
> DETAIL:  All ndistinct attnum lists must be a comma separated list of
> attnums.
>
> imho, this errdetail message is not good. would be better saying that
> "length of list of attnums must be larger than 1".
>

That sounds better.



> typcategory (Z) marked as Internal-use types. and there is no
> pg_ndistinct array type,
> not sure this is fine.
>

I think it's probably ok for now. The datatype currently has no utility
other than extended statistics, and I'm doubtful that it ever will.

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-01-29T20:04:40Z

On Wed, Jan 29, 2025 at 2:50 AM jian he <jian.universality@gmail.com> wrote:

> hi.
>
> select '{"1, 0B100101":"NaN"}'::pg_ndistinct;
>       pg_ndistinct
> ------------------------
>  {"1, 37": -2147483648}
> (1 row)
>

I think my initial reaction is to just refuse those special values, but
I'll look into the parsing code to see what can be done.



this is not what we expected?
> For the VALUE part of pg_ndistinct, float8 has 3 special values: inf,
> -inf, NaN.
>
> For the key part of pg_ndistinct, see example.
> select '{"1, 16\t":"1"}'::pg_ndistinct;
> here \t is not tab character, ascii 9. it's two characters: backslash
> and character "t".
> so here it should error out?
> (apply this to \n, \r, \b)
>

I don't have a good answer as to what should happen here. Special cases
like this make Tomas' suggestion to change the in/out format more
attractive.




>
>
> pg_ndistinct_in(PG_FUNCTION_ARGS)
> ending part should be:
>
>     freeJsonLexContext(lex);
>     if (result == JSON_SUCCESS)
>     {
>         ......
>     }
>     else
>     {
>        ereturn(parse_state.escontext, (Datum) 0,
>                     errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
>                     errmsg("malformed pg_ndistinct: \"%s\"", str),
>                     errdetail("Must be valid JSON."));
>        PG_RETURN_NULL();
>     }
> result should be either JSON_SUCCESS or anything else.
>
>
>
> all these functions:
> ndistinct_object_start, ndistinct_array_start,
> ndistinct_object_field_start, ndistinct_array_element_start
> have
> ndistinctParseState *parse = state;
>
> do we need to change it to
> ndistinctParseState *parse = (ndistinctParseState *)state;
> ?
>

The compiler isn't complaining so far, but I see no harm in it.

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-02-20T04:43:16Z

>
> select '{"1, 0B100101":"NaN"}'::pg_ndistinct;
>>       pg_ndistinct
>> ------------------------
>>  {"1, 37": -2147483648}
>> (1 row)
>>
>
> I think my initial reaction is to just refuse those special values, but
> I'll look into the parsing code to see what can be done.
>

I noticed that the output function for pg_ndistinct casts that value to an
integer before formatting it %d, so it's being treated as an integer even
if it is not stored as one. After some consultation with Tomas, it made the
most sense to just replicate this on the input side as well, and that is
addressed in the patches below.

I've updated and rebased the patches.

The existing pg_ndistinct and pg_dependences formats were kept as-is. The
formats are clumsy, more processing-friendly formats would be easier, but
the need for such processing is minimal bordering on theoretical, so there
is little impact in keeping the historical format.

There are now checks to ensure that the pg_ndistinct or pg_dependencies
value assigned to an extended statistics object actually makes sense for
that object. What this amounts to is checking that for every attnum cited,
the positive attnums are also ones found the in the stxkeys of the
pg_statistic_ext tuple, and the negative attnums correspond do not exceed
the number of expressions in the attnum. In other words, if the stats
object has no expressions in it, then no negative numbers will be accepted,
if it has 2 expressions than any value -3 or lower will be rejected, etc.

All patches rebased to 71f17823ba010296da9946bd906bb8bcad6325bc.

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-03-04T20:30:11Z

>
>
>> I think my initial reaction is to just refuse those special values, but
>> I'll look into the parsing code to see what can be done.
>>
>
> I noticed that the output function for pg_ndistinct casts that value to an
> integer before formatting it %d, so it's being treated as an integer even
> if it is not stored as one. After some consultation with Tomas, it made the
> most sense to just replicate this on the input side as well, and that is
> addressed in the patches below.
>
> I've updated and rebased the patches.
>
> The existing pg_ndistinct and pg_dependences formats were kept as-is. The
> formats are clumsy, more processing-friendly formats would be easier, but
> the need for such processing is minimal bordering on theoretical, so there
> is little impact in keeping the historical format.
>
> There are now checks to ensure that the pg_ndistinct or pg_dependencies
> value assigned to an extended statistics object actually makes sense for
> that object. What this amounts to is checking that for every attnum cited,
> the positive attnums are also ones found the in the stxkeys of the
> pg_statistic_ext tuple, and the negative attnums correspond do not exceed
> the number of expressions in the attnum. In other words, if the stats
> object has no expressions in it, then no negative numbers will be accepted,
> if it has 2 expressions than any value -3 or lower will be rejected, etc.
>
> All patches rebased to 71f17823ba010296da9946bd906bb8bcad6325bc.
>

A rebasing, and a few changes
* regnamespace and name parameters changed to statistics_schemaname as text
and statistics_name as text, so that there's one less thing that can
potentially fail in an upgrade
* schema lookup and stat name lookup failures now issue a warning and
return false, rather than ERROR
* elevel replaced with hardcoded WARNING most everywhere, as has been done
with relation/attribute stats

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-03-31T05:10:23Z
```
Just rebasing.
```

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-05-29T22:32:17Z

On Mon, Mar 31, 2025 at 1:10 AM Corey Huinker <corey.huinker@gmail.com>
wrote:

> Just rebasing.
>

At pgconf.dev this year, the subject of changing the formats of
pg_ndistinct and pg_depdentencies came up again.

To recap: presently these datatypes have no working input function, but
would need one for statistics import to work on extended statistics. The
existing input formats are technically JSON, but the keys themselves are a
comma-separated list of attnums, so they require additional parsing. That
parsing is already done in the patches in this thread, but overall the
format is terrible for any sort of manipulation, like the manipulation that
people might want to do to translate the values to a table with a different
column order (say, after a restore of a table that had dropped columns), or
to do query planner experiments.

Because the old formats don't have a corresponding input function, there is
no risk of the ouptut not matching required inputs, but there will be once
we add new input functions, so this is our last chance to change the format
to something we like better.

The old format can be trivially translated via functions posted earlier in
this thread back in January (pg_xlat_ndistinct_to_attnames,
pg_xlat_dependencies_to_attnames) as well as the reverse (s/_to_/_from_/),
so dumping values from older versions will not be difficult.

I believe that we should take this opportunity to make the change. While we
don't have a pressing need to manipulate these structures now, we might in
the future and failing to do so now makes a later change much harder.

With that in mind, I'd like people to have a look at the proposed format
change if pg_ndistinct (the changes to pg_dependencies are similar), to see
if they want to make any improvements or comments. As you can see, the new
format is much less compact (about 3x as large), which could get bad if the
number of elements grew by a lot, but the number of elements is tied to the
number of factors in the extended support (N choose N, then N choose N-1,
etc, excluding choose 1), so this can't get too out of hand.

Existing format (newlines/formatting added by me to make head-to-head
comparison easier):

'{"2, 3": 4,
  "2, -1": 4,
  "2, -2": 4,
  "3, -1": 4,
  "3, -2": 4,
  "-1, -2": 3,
  "2, 3, -1": 4,
  "2, 3, -2": 4,
  "2, -1, -2": 4,
  "3, -1, -2": 4}'::pg_ndistinct

Proposed new format (again, all formatting here is just for ease of humans
reading):

' [ {"attributes" : [2,3], "ndistinct" : 4},
    {"attributes" : [2,-1], "ndistinct" : 4},
    {"attributes" : [2,-2], "ndistinct" : 4},
    {"attributes" : [3,-1], "ndistinct" : 4},
    {"attributes" : [3,-2], "ndistinct" : 4},
    {"attributes" : [-1,-2], "ndistinct" : 3},
    {"attributes" : [2,3,-1], "ndistinct" : 4},
    {"attributes" : [2,3,-2], "ndistinct" : 4},
    {"attributes" : [2,-1,-2], "ndistinct" : 4},
    {"attributes" : [3,-1,-2], "ndistinct" : 4}]'::pg_ndistinct

The pg_dependencies structure is only slightly more complex:

An abbreviated example:

{"2 => 1": 1.000000, "2 => -1": 1.000000, ..., "2, -2 => -1": 1.000000, "3,
-1 => 2": 1.000000},

Becomes:

[ {"attributes": [2], "dependency": 1, "degree": 1.000000},
  {"attributes": [2], "dependency": -1, "degree": 1.000000},
  {"attributes": [2, -2], "dependency":  -1, "degree": 1.000000},
   ...,
   {"attributes": [2, -2], "dependency": -1, "degree": 1.000000},
   {"attributes": [3, -1], "dependency": 2, "degree": 1.000000}]

Any thoughts on using/improving these structures?

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-06-22T00:12:28Z

>
>
> Any thoughts on using/improving these structures?
>

Hearing no objections, here is the latest patchset.

0001 - Changes input/output functions of pg_ndistinct to the format
described earlier.
0002 - Changes input/output functions of pg_dependencies to the format
described earlier.
0003 - Makes some previously internal/static attribute stats functions
visible to extended_stats.c, because the exprs attribute is basically an
array of partially filled-out pg_statistic rows.
0004 - Adds pg_restore_attribute_stats(), pg_clear_attribute_stats(), in
the pattern of their relation/attribute brethren.
0005 - adds the dumping and restoring of extended statistics back to v10.
No command line flag changes needed.

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-08-12T17:28:14Z

On Sat, Jun 21, 2025 at 8:12 PM Corey Huinker <corey.huinker@gmail.com>
wrote:

>
>> Any thoughts on using/improving these structures?
>>
>
> Hearing no objections, here is the latest patchset.
>
> 0001 - Changes input/output functions of pg_ndistinct to the format
> described earlier.
> 0002 - Changes input/output functions of pg_dependencies to the format
> described earlier.
> 0003 - Makes some previously internal/static attribute stats functions
> visible to extended_stats.c, because the exprs attribute is basically an
> array of partially filled-out pg_statistic rows.
> 0004 - Adds pg_restore_attribute_stats(), pg_clear_attribute_stats(), in
> the pattern of their relation/attribute brethren.
> 0005 - adds the dumping and restoring of extended statistics back to v10.
> No command line flag changes needed.
>
>

Rebased. Enjoy.

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-09-12T04:55:56Z
```
>
> Rebased. Enjoy.
>

Rebased.



>
```

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-10-19T00:27:58Z

On Fri, Sep 12, 2025 at 12:55 AM Corey Huinker <corey.huinker@gmail.com>
wrote:

> Rebased. Enjoy.
>>
>
> Rebased.
>

And rebased again to conform to  688dc6299 and 4bd919129.

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-10-21T05:48:37Z

On Sat, Oct 18, 2025 at 08:27:58PM -0400, Corey Huinker wrote:
> And rebased again to conform to  688dc6299 and 4bd919129.

The format redesign for extended stats is pretty nice, as done in
0001.  I think that this patch should be split in two, actually, as it
tackles two issues:
- One patch for the format change.
- Second patch for the introduction of the input function, useful on
its own to allow more regression tests paths for the format generated.

Similar split comment for 0002 regarding pg_dependencies.  The format
change is one thing.  The input function is cool to have for input
validation, still not absolutely mandatory for the core part of the
format change.

The functions exposed in 0003 should be renamed to match more with the
style of the rest, aka it is a bit hard to figure out what they do at
first sight.  Presumably, these should be prefixed with some
"statext_", except text_to_stavalues() which could still be named the
same.

Do you have some numbers regarding the increase in size this generates
for the catalogs?

0004 has been designed following the same model as the relation and
attribute stats.  That sounds OK here.

+enum extended_stats_argnum
[...]
+enum extended_stats_exprs_element

It would be nice to document why such things are around.  That would
be less guessing for somebody reading the code.

Reusing this small sequence from your pg_dump patch, executed on a v14
backend:
create schema dump_test;
CREATE TABLE dump_test.has_ext_stats
  AS SELECT g.g AS x, g.g / 2 AS y FROM generate_series(1,100) AS g(g);
CREATE STATISTICS dump_test.es1 ON x, (y % 2) FROM dump_test.has_ext_stats;
ANALYZE dump_test.has_ext_stats;

Then pg_dump fails:
pg_dump: error: query failed: ERROR:  column e.inherited does not exist
LINE 2: ...hemaname = $1 AND e.statistics_name = $2 ORDER BY e.inherite...


+        * TODO: Until v18 is released the master branch has a
+        * server_version_num of 180000. We will update this to 190000
as soon
+        * as the master branch updates.

This part has not been updated.

+       Assert(item.nattributes > 0);   /* TODO: elog? */
[...]
+       Assert(dependency->nattributes > 1);    /* TODO: elog? */
Yes and yes.  It seems like it should be possible to craft some input
that triggers these..

+void
+free_pg_dependencies(MVDependencies *dependencies);

Double declaration of this routine in dependencies.c.

Perhaps some of the regression tests could use some jsonb_pretty() in
the outputs generated.  Some of the results generated are very hard to
parse, something that would become harder in the buildfarm.  This
comment starts with 0001 for stxdndistinct.

I have mixed feelings about 0005, FWIW.  I am wondering if we should
not lift the needle a bit here and only support the dump of extended
statistics when dealing with a backend of at least v19.  This would
mean that we would only get the full benefit of this feature once
people upgrade to v20 or dump from a pg_dump with --statistics from at
least v19, but with the long-term picture in mind this would also make
the dump/restore picture of the patch dead simple (spoiler: I like
simple). 

Tomas, what is your take about the format changes and my argument
about the backward requirements of pg_dump (about not dumping these
stats if connecting to a server older than v18, included)?
--
Michael

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-10-21T05:57:00Z

On Tue, Oct 21, 2025 at 02:48:37PM +0900, Michael Paquier wrote:
> Tomas, what is your take about the format changes and my argument
> about the backward requirements of pg_dump (about not dumping these
> stats if connecting to a server older than v18, included)?

By the way, the patch is still needed in the previous CF of September:
https://commitfest.postgresql.org/patch/5517/

You may want to move it.
--
Michael

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-10-22T11:55:31Z

>
> The functions exposed in 0003 should be renamed to match more with the
> style of the rest, aka it is a bit hard to figure out what they do at
> first sight.  Presumably, these should be prefixed with some
> "statext_", except text_to_stavalues() which could still be named the
> same.
>

That prefix would probably be statatt_ or statattr_.


>
> Do you have some numbers regarding the increase in size this generates
> for the catalogs?
>

Sorry, I don't understand. There shouldn't be any increase inside the
catalogs as the internal storage of the datatypes hasn't changed, so I can
only conclude that you're referring to something else.


>
> 0004 has been designed following the same model as the relation and
> attribute stats.  That sounds OK here.
>
> +enum extended_stats_argnum
> [...]
> +enum extended_stats_exprs_element
>
> It would be nice to document why such things are around.  That would
> be less guessing for somebody reading the code.
>

The equivalent structures in attribute_stats.c will need documenting too.




>
> Reusing this small sequence from your pg_dump patch, executed on a v14
> backend:
> create schema dump_test;
> CREATE TABLE dump_test.has_ext_stats
>   AS SELECT g.g AS x, g.g / 2 AS y FROM generate_series(1,100) AS g(g);
> CREATE STATISTICS dump_test.es1 ON x, (y % 2) FROM dump_test.has_ext_stats;
> ANALYZE dump_test.has_ext_stats;
>
> Then pg_dump fails:
> pg_dump: error: query failed: ERROR:  column e.inherited does not exist
> LINE 2: ...hemaname = $1 AND e.statistics_name = $2 ORDER BY e.inherite...
>

Noted.


>
>
> +        * TODO: Until v18 is released the master branch has a
> +        * server_version_num of 180000. We will update this to 190000
> as soon
> +        * as the master branch updates.
>
> This part has not been updated.
>
> +       Assert(item.nattributes > 0);   /* TODO: elog? */
> [...]
> +       Assert(dependency->nattributes > 1);    /* TODO: elog? */
> Yes and yes.  It seems like it should be possible to craft some input
> that triggers these..
>

+1


>
> +void
> +free_pg_dependencies(MVDependencies *dependencies);
>
> Double declaration of this routine in dependencies.c.
>
> Perhaps some of the regression tests could use some jsonb_pretty() in
> the outputs generated.  Some of the results generated are very hard to
> parse, something that would become harder in the buildfarm.  This
> comment starts with 0001 for stxdndistinct.
>

Can do.


>
> I have mixed feelings about 0005, FWIW.  I am wondering if we should
> not lift the needle a bit here and only support the dump of extended
> statistics when dealing with a backend of at least v19.  This would
> mean that we would only get the full benefit of this feature once
> people upgrade to v20 or dump from a pg_dump with --statistics from at
> least v19, but with the long-term picture in mind this would also make
> the dump/restore picture of the patch dead simple (spoiler: I like
> simple).
>

I also like simple.

Right now we have a situation where the vast majority of databases can
carry forward all of their stats via pg_upgrade, except for those databases
that have extended stats. The trouble is, most customers don't know if
their database uses extended statistics or not, and those that do are in
for some bad query plans if they haven't run vacuumdb --missing-stats-only.
Explaining that to customers is complicated, especially when most of them
do not know what extended stats are, let alone whether they have them. It
would be a lot simpler to just say "all stats are carried over on upgrade",
and vacuumdb becomes unnecessary, making upgrades one step simpler as well.

Given that, I think that the admittedly ugly transformation is worth it,
and sequestering it inside pg_dump is the smallest footprint it can have.
Earlier in this thread I posted some functions that did the translation
from the existing formats to the proposed new formats. We could include
those as new system functions, and that would make the dump code very
simple. Having said that, I don't know that there would be use for those
functions except inside pg_dump, hence the decision to do the transforms
right in the dump query.

If the format translation is a barrier to fetching existing extended stats,
then I'd be more inclined to keep the existing pg_ndistinct and
pg_dependencies data formats as they are now.

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-10-22T23:46:27Z

On Wed, Oct 22, 2025 at 02:55:31PM +0300, Corey Huinker wrote:
>> Do you have some numbers regarding the increase in size this generates
>> for the catalogs?
> 
> Sorry, I don't understand. There shouldn't be any increase inside the
> catalogs as the internal storage of the datatypes hasn't changed, so I can
> only conclude that you're referring to something else.

The new format meant more characters, perhaps I've just missed
something while quickly testing the patch..  Anyway, that's OK at this
stage.

> The equivalent structures in attribute_stats.c will need documenting too.

Right.  This sounds like a separate patch to me, impacting HEAD.

> Right now we have a situation where the vast majority of databases can
> carry forward all of their stats via pg_upgrade, except for those databases
> that have extended stats. The trouble is, most customers don't know if
> their database uses extended statistics or not, and those that do are in
> for some bad query plans if they haven't run vacuumdb --missing-stats-only.
> Explaining that to customers is complicated, especially when most of them
> do not know what extended stats are, let alone whether they have them. It
> would be a lot simpler to just say "all stats are carried over on upgrade",
> and vacuumdb becomes unnecessary, making upgrades one step simpler as well.

Okay.

> Given that, I think that the admittedly ugly transformation is worth it,
> and sequestering it inside pg_dump is the smallest footprint it can have.
> Earlier in this thread I posted some functions that did the translation
> from the existing formats to the proposed new formats. We could include
> those as new system functions, and that would make the dump code very
> simple. Having said that, I don't know that there would be use for those
> functions except inside pg_dump, hence the decision to do the transforms
> right in the dump query.

I'd prefer the new format.  One killer pushing in favor of the new
format that you are making upthread in favor of is that it makes much
easier the viewing, editing and injecting of these stats.  It's the
part of the patch where we would need Tomas' input on the matter
before deciding anything, I guess, as primary author of the original
facilities.  My view of the problem is just one opinion.

> If the format translation is a barrier to fetching existing extended stats,
> then I'd be more inclined to keep the existing pg_ndistinct and
> pg_dependencies data formats as they are now.

Not necessarily, it can be possible to also take that in multiple
steps rather than a single one:
- First do the basics in v19 with the new format.
- Raise the bar to older versions.
--
Michael

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-10-23T00:35:04Z

On Thu, Oct 23, 2025 at 08:46:27AM +0900, Michael Paquier wrote:
> I'd prefer the new format.  One killer pushing in favor of the new
> format that you are making upthread in favor of is that it makes much
> easier the viewing, editing and injecting of these stats.

This is missing an "argument", as in "One killer argument pushing in
favor.." :D
--
Michael

Re: Extended Statistics set/restore/clear functions.

Tomas Vondra <tomas@vondra.me> — 2025-10-31T20:22:55Z


On 10/23/25 01:46, Michael Paquier wrote:
> On Wed, Oct 22, 2025 at 02:55:31PM +0300, Corey Huinker wrote:
>>> Do you have some numbers regarding the increase in size this generates
>>> for the catalogs?
>>
>> Sorry, I don't understand. There shouldn't be any increase inside the
>> catalogs as the internal storage of the datatypes hasn't changed, so I can
>> only conclude that you're referring to something else.
> 
> The new format meant more characters, perhaps I've just missed
> something while quickly testing the patch..  Anyway, that's OK at this
> stage.
> 
>> The equivalent structures in attribute_stats.c will need documenting too.
> 
> Right.  This sounds like a separate patch to me, impacting HEAD.
> 
>> Right now we have a situation where the vast majority of databases can
>> carry forward all of their stats via pg_upgrade, except for those databases
>> that have extended stats. The trouble is, most customers don't know if
>> their database uses extended statistics or not, and those that do are in
>> for some bad query plans if they haven't run vacuumdb --missing-stats-only.
>> Explaining that to customers is complicated, especially when most of them
>> do not know what extended stats are, let alone whether they have them. It
>> would be a lot simpler to just say "all stats are carried over on upgrade",
>> and vacuumdb becomes unnecessary, making upgrades one step simpler as well.
> 
> Okay.
> 
>> Given that, I think that the admittedly ugly transformation is worth it,
>> and sequestering it inside pg_dump is the smallest footprint it can have.
>> Earlier in this thread I posted some functions that did the translation
>> from the existing formats to the proposed new formats. We could include
>> those as new system functions, and that would make the dump code very
>> simple. Having said that, I don't know that there would be use for those
>> functions except inside pg_dump, hence the decision to do the transforms
>> right in the dump query.
> 
> I'd prefer the new format.  One killer pushing in favor of the new
> format that you are making upthread in favor of is that it makes much
> easier the viewing, editing and injecting of these stats.  It's the
> part of the patch where we would need Tomas' input on the matter
> before deciding anything, I guess, as primary author of the original
> facilities.  My view of the problem is just one opinion.
> 

Sorry for not paying much attention to this thread ...

My opinion is that we should both use the new format and keep the
pg_dump code to allow upgrading from older pre-19 versions.

There really is nothing special about the current format - I should have
used JSON (or any other established format) from the beginning. But I
only saw that as human-readable version of ephemeral data, it didn't
occur to me we'll use this to export/import stats cross versions. So if
we need to adjust that to make new use cases more convenient, let's bite
the bullet now.

If doing both is too complex / ugly, I think the pg_upgrade capability
is more valuable. I'd rather keep the old, less convenient format to
have pg_upgrade support for all versions.

Otherwise users may not benefit from this pg_upgrade feature for a
couple more years. Plenty of users delay upgrading until the EOL gets
close, and so might be unable to dump/restore extended stats for the
next ~5 years.

regards

-- 
Tomas Vondra

Re: Extended Statistics set/restore/clear functions.[

Michael Paquier <michael@paquier.xyz> — 2025-11-04T07:13:55Z

On Fri, Oct 31, 2025 at 09:22:55PM +0100, Tomas Vondra wrote:
> Sorry for not paying much attention to this thread ...

It was PGEU last week, it's not surprising knowing that you are in..
Europe :)

> My opinion is that we should both use the new format and keep the
> pg_dump code to allow upgrading from older pre-19 versions.

Okay, thanks for the input.

> There really is nothing special about the current format - I should have
> used JSON (or any other established format) from the beginning. But I
> only saw that as human-readable version of ephemeral data, it didn't
> occur to me we'll use this to export/import stats cross versions. So if
> we need to adjust that to make new use cases more convenient, let's bite
> the bullet now.
> 
> If doing both is too complex / ugly, I think the pg_upgrade capability
> is more valuable. I'd rather keep the old, less convenient format to
> have pg_upgrade support for all versions.

Hmm.  I have been eyeing at the dump code once again, and it's not
that bad on a second lookup, so I'd be OK to incorporate the whole in
a review.

> Otherwise users may not benefit from this pg_upgrade feature for a
> couple more years. Plenty of users delay upgrading until the EOL gets
> close, and so might be unable to dump/restore extended stats for the
> next ~5 years.

Okay.

I still think that we should split things a bit more in the patch set.
Corey has sent me a proposal in this direction, where most of the
entries can be reviewed and potentially applied separately:
1. pg_ndistinct output function change.
2. pg_ndistinct input function addition.
3. pg_dependencies output function change
4. pg_dependencies input function
5. Expose attribute statistics function and rename them attstat_* or
statatt_*
6. pg_restore_extended_stats
7. pg_dump with no ability to fetch old-format
pg_ndistinct/pg_dependences.
8. pg_dump working back as far as possible

I am not completely sold about the changes in the input/output
functions, but I'd be happy to consider more options with a rebased
patch set (with the dump issue on older clusters issue, while on it).
--
Michael

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-11-05T06:38:56Z

>
> Otherwise users may not benefit from this pg_upgrade feature for a
> couple more years. Plenty of users delay upgrading until the EOL gets
> close, and so might be unable to dump/restore extended stats for the
> next ~5 years


Paquier's response got sidetracked because of an errant subject line
change, so I will try to recap:

* pg_dump code changes no longer seem as bad on second look
* proceed with breakout per off-list discussion

in that off-list discussion I proposed (though I was mostly echoing what I
thought Paquier wanted):

1. pg_ndistinct output function change.
2. pg_ndistinct input function addition.
3. pg_dependencies output function change
4. pg_dependencies input function
5. Expose attribute statistics function and rename them attstat_* or
statatt_*   (edit: and fix lack of comments on the enums and arrays)
6. pg_restore_extended_stats
7. pg_dump with no ability to fetch old-format pg_ndistinct/pg_dependences.
(edit: and fix inherited bug)
8. pg_dump working back as far as possible

Given that the pg_dump code no longer seems as bad, and Tomas is very much
in support of it, I've opted not to split out steps 7/8.

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-11-06T08:26:58Z

On Wed, Nov 05, 2025 at 01:38:56AM -0500, Corey Huinker wrote:
> Paquier's response got sidetracked because of an errant subject line
> change, so I will try to recap:

That was a typo that found its way into the email subject.  Sorry
about that, that broke gmail's tracking at least.

> in that off-list discussion I proposed (though I was mostly echoing what I
> thought Paquier wanted):
> 
> 1. pg_ndistinct output function change.
> 2. pg_ndistinct input function addition.
> 3. pg_dependencies output function change
> 4. pg_dependencies input function
> 5. Expose attribute statistics function and rename them attstat_* or
> statatt_*   (edit: and fix lack of comments on the enums and arrays)
> 6. pg_restore_extended_stats
> 7. pg_dump with no ability to fetch old-format pg_ndistinct/pg_dependences.
> (edit: and fix inherited bug)
> 8. pg_dump working back as far as possible

Thanks.  I have begun reviewing it (more a bit later, still need to
study more the structure of the code).  For now, I have extracted some
of the comment changes in 0005 and applied these independently.

> Given that the pg_dump code no longer seems as bad, and Tomas is very much
> in support of it, I've opted not to split out steps 7/8.

That sounds like a way forward to me, then, in terms of using a new
format and make pg_dump intelligent enough to deal with it based on
what's in the past versions.
--
Michael

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-11-06T18:35:34Z

>
> That was a typo that found its way into the email subject.  Sorry
> about that, that broke gmail's tracking at least.
>

So the mailing list archive will still pick it up? That's nice.


> Thanks.  I have begun reviewing it (more a bit later, still need to
> study more the structure of the code).  For now, I have extracted some
> of the comment changes in 0005 and applied these independently.
>

Rebased to reflect that commit.

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-11-07T07:51:15Z

On Thu, Nov 06, 2025 at 01:35:34PM -0500, Corey Huinker wrote:
> So the mailing list archive will still pick it up? That's nice.

It did.  My email client does not care much either.

> Rebased to reflect that commit.

I have spent a bit more time on this set.

Patch 0001 for ndistinct was missing a documentation update, we have
one query in perform.sgml that looks at stxdndistinct.  Patch 0003 is
looking OK here as well.

For dependencies, the format switches from a single json object
with key/vals like that:
"3 => 4": 1.000000
To a JSON array made of elements like that:
{"degree": 1.000000, "attributes": [3],"dependency": 4},

For ndistincts, we move from a JSON blob with key/vals like that:
"3, 4": 11
To a JSON array made of the following elements:
{"ndistinct": 11, "attributes": [3,4]}

Using a keyword within each element would force a stronger validation
when these get imported back, which is a good thing.  I like that.

Before going in-depth into the input functions to cross-check the
amount of validation we should do, have folks any comments about the
proposed format?  That's the key point this patch set depends on, and
I'd rather not spend more time the whole thing if somebody would like
a different format.  This is the format that Tomas has mentioned at
the top of the thread.  Note: as noted upthread, pg_dump would be in
charge of transferring the data of the old format to the new format at
the end.

While looking at 0002 and 0004 (which have a couple of issues
actually), I have been wondering about moving into a new file the four
data-type functions (in, out, send and receive) and the new input
functions that rely on a new JSON lexer and parser logic into for both
ndistinct and dependencies.  The new set of headers added at the top
of mvdistinct.c and dependencies.c for the new code points that a
separation may be better in the long-term, because the new code relies
on parts of the backend that the existing code does not care about,
and these files become larger than the relation and attribute stats
files.  I would be tempted to name these new files pg_dependencies.c
and pg_ndistinct.c, mapping with their catalog types.  With this
separation, it looks like the "core" parts in charge of the
calculations with ndistinct and dependencies can be kept on its own.
What do you think?

A second comment is for 0005.  The routines of attributes.c are
applied to the new clear and restore functions.  Shouldn't these be in
stats_utils.c at the end?  That's where the "common" functions used by
the stats manipulation logic are.
--
Michael

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-11-07T22:28:50Z

>
>
> Patch 0001 for ndistinct was missing a documentation update, we have
> one query in perform.sgml that looks at stxdndistinct.  Patch 0003 is
> looking OK here as well.
>

Well spotted.


> For dependencies, the format switches from a single json object
> with key/vals like that:
> "3 => 4": 1.000000
> To a JSON array made of elements like that:
> {"degree": 1.000000, "attributes": [3],"dependency": 4},
>
> For ndistincts, we move from a JSON blob with key/vals like that:
> "3, 4": 11
> To a JSON array made of the following elements:
> {"ndistinct": 11, "attributes": [3,4]}
>
> Using a keyword within each element would force a stronger validation
> when these get imported back, which is a good thing.  I like that.
>
> Before going in-depth into the input functions to cross-check the
> amount of validation we should do, have folks any comments about the
> proposed format?  That's the key point this patch set depends on, and
> I'd rather not spend more time the whole thing if somebody would like
> a different format.  This is the format that Tomas has mentioned at
> the top of the thread.  Note: as noted upthread, pg_dump would be in
> charge of transferring the data of the old format to the new format at
> the end.
>

I'm open to other formats, but aside from renaming the json keys (maybe
"attnums" or "keys" instead of "attributes"?), I'm not sure what really
could be done and still be JSON. I suppose we could go with a tuple format
like this:

'{({3,4},11),...}' for pg_ndistinct and
'{({3},4,1.00000),...}'  for pg_dependencies.

Those would certainly be more compact, but makes for a hard read by humans,
and while the JSON code is big, it's also proven in other parts of the
codebase, hence less risky.



>
> While looking at 0002 and 0004 (which have a couple of issues
> actually), I have been wondering about moving into a new file the four
> data-type functions (in, out, send and receive) and the new input
> functions that rely on a new JSON lexer and parser logic into for both
> ndistinct and dependencies.  The new set of headers added at the top
> of mvdistinct.c and dependencies.c for the new code points that a
> separation may be better in the long-term, because the new code relies
> on parts of the backend that the existing code does not care about,
> and these files become larger than the relation and attribute stats
> files.  I would be tempted to name these new files pg_dependencies.c
> and pg_ndistinct.c, mapping with their catalog types.  With this
> separation, it looks like the "core" parts in charge of the
> calculations with ndistinct and dependencies can be kept on its own.
> What do you think?
>

A part of me thinks that everything that remains after removing
in/out/send/recv is just taking a table sample data structure and crunching
numbers to come up with the deserialized data structure...that's in/out
with a different starting/ending points.

There's no denying that JSON parsing is a very different code style than
statistical number crunching, and mixing the two is incongruous, so it's
worth a shot, and I'll try that for v9.


>
> A second comment is for 0005.  The routines of attributes.c are
> applied to the new clear and restore functions.  Shouldn't these be in
> stats_utils.c at the end?  That's where the "common" functions used by
> the stats manipulation logic are.
>

I assume you're referring to attribute_stats.c. I think that would cause
stats_utils.c to have to pull in a lot of things from attribute_stats.c,
and that would create the exact sort of include-pollution that you're
trying to avoid in the mvdistinct.c/dependencies.c situation mentioned
above.

The one lone exception to this is text_to_stavalues(), which is a fancy
wrapper around array_in() and could perhaps be turned to even more generic
usage outside of stats in general.

The functions in question are needed because the exprs value is itself an
array of partly-filled-out pg_attribute tuples, so it's common to those two
needs, but specific to stats about attributes. Maybe we need an
attr_stats_utils.h?

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-11-07T22:56:48Z

On Fri, Nov 07, 2025 at 05:28:50PM -0500, Corey Huinker wrote:
> I'm open to other formats, but aside from renaming the json keys (maybe
> "attnums" or "keys" instead of "attributes"?), I'm not sure what really
> could be done and still be JSON. I suppose we could go with a tuple format
> like this:
> 
> '{({3,4},11),...}' for pg_ndistinct and
> '{({3},4,1.00000),...}'  for pg_dependencies.
> 
> Those would certainly be more compact, but makes for a hard read by humans,
> and while the JSON code is big, it's also proven in other parts of the
> codebase, hence less risky.

I've liked the human-readability factor of the format in the current
patches with names in the keys, and values assigned to each property.

Another thing that may be worth doing is pushing the names of the keys
and some its the JSON meta-data shaping the object into a new header
than can be loaded by both the backend and the frontend.  It would be
nice to not hardcode this knowledge in a bunch of places if we finish
by renaming these attributes.

> A part of me thinks that everything that remains after removing
> in/out/send/recv is just taking a table sample data structure and crunching
> numbers to come up with the deserialized data structure...that's in/out
> with a different starting/ending points.
> 
> There's no denying that JSON parsing is a very different code style than
> statistical number crunching, and mixing the two is incongruous, so it's
> worth a shot, and I'll try that for v9.

Yeah, right.  Thanks.  The parsing pieces seem like pieces worth their
own file.

> The functions in question are needed because the exprs value is itself an
> array of partly-filled-out pg_attribute tuples, so it's common to those two
> needs, but specific to stats about attributes. Maybe we need an
> attr_stats_utils.h?

Hmm, maybe.  I'd be OK to revisit these structures once we're happy
with the in/out structures.  That would be a good start point before
working on the SQL functions and the dump/restore bits in more
details.
--
Michael

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-11-10T05:33:40Z

>
> Another thing that may be worth doing is pushing the names of the keys
> and some its the JSON meta-data shaping the object into a new header
> than can be loaded by both the backend and the frontend.  It would be
> nice to not hardcode this knowledge in a bunch of places if we finish
> by renaming these attributes.
>

It may not be quite what you wanted, but the attribute names are now static
constants in the new adt c files. It's possible/probable that you wanted
them in some header file, but so far I haven't had to create any new header
files, but that can be done if desired.

Yeah, right.  Thanks.  The parsing pieces seem like pieces worth their
> own file.
>

That's done in the 0008-0009 patches. If I was starting from scratch, I
would have moved the pre-existing in/out/send/recv functions to their own
files in their own patches before changing the output format, but tacked on
at the end like they are it's easier to see what the changes were, and the
patches will probably get squashed together anyway.


> > The functions in question are needed because the exprs value is itself an
> > array of partly-filled-out pg_attribute tuples, so it's common to those
> two
> > needs, but specific to stats about attributes. Maybe we need an
> > attr_stats_utils.h?
>
> Hmm, maybe.  I'd be OK to revisit these structures once we're happy
> with the in/out structures.  That would be a good start point before
> working on the SQL functions and the dump/restore bits in more
> details.
>

In addition to the changes detailed above, I fixed a few typos and
incorporated the v8 change.

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-11-11T08:06:49Z

On Mon, Nov 10, 2025 at 12:33:40AM -0500, Corey Huinker wrote:
> It may not be quite what you wanted, but the attribute names are now static
> constants in the new adt c files. It's possible/probable that you wanted
> them in some header file, but so far I haven't had to create any new header
> files, but that can be done if desired.

No, that's not the best thing we can do with the dump/restore pieces
in mind.  Let's put that in a separate header.

> That's done in the 0008-0009 patches. If I was starting from scratch, I
> would have moved the pre-existing in/out/send/recv functions to their own
> files in their own patches before changing the output format, but tacked on
> at the end like they are it's easier to see what the changes were, and the
> patches will probably get squashed together anyway.

Thanks for the new patch.  And FWIW I disagree with this approach:
cleanup and refactoring pieces make more sense if done first, as these
lead to less code churn in the final result.  So...  I've begun to put
my hands on the patch set.  The whole has been restructured a bit, as
per the attached.  Patch 0001 to 0004 feel OK here, these include two
code moves and the two output functions:
- Two new files for adt/, that I'm planning to apply soon as a
separate cleanup.
- New output functions, with keys added to a new header named
statistics_format.h, for frontend and backend consumption.

Next comes the input functions.  First, I am unhappy with the amount
of testing that has been put into ndistinct, first and only input
facility I've looked at in details for the moment.  I have quickly
spotted a couple a few issues while testing buggy input, like this one
that crashes on pointer dereference, not good obviously:
SELECT '[]'::pg_ndistinct;

There was a second one with the error message generated when using an
incorrect key value.

Second, the inputs are too permissive and could be more strictly
checked IMHO.  For example, patterns like that are incorrect, still
authorized with only the patches up to 0005 in:
- Duplicated list of attributes:
SELECT '[{"attributes" : [2,3], "ndistinct" : 4},
         {"attributes" : [2,3], "ndistinct" : 4}]'::pg_ndistinct;
- Partial (K,N) sets, for example say we take stats on attrs (1,2,3),
a partial input like this one is basically OK:
SELECT '[{"attributes" : [1,3], "ndistinct" : 4},
         {"attributes" : [1,2,3], "ndistinct" : 4}]'::pg_ndistinct;

These are checked in the patches that introduce the functions like
with pg_ndistinct_validate_items(), based on the list of stxkeys we
have.  However, I think that this is not enough by itself.  Shouldn't
we check that the list of items in the array is what we expect based
on the longest "attributes" array at least, even after a JSON that was
parsed?  That would be cheap to check in the output function itself,
at least as a first layer of checks before trying something with the
import function and cross-checking the list of attributes for the
extended statistics object.  This means checking that for N attributes
we have all the elements we'd expect in each element of the array,
without gaps or duplications, with an extra step done once the JSON
parsing is finished.  Except for this sanity issue this part of the
patch set should be mostly OK, plus more cleanup and more typo/grammar
fixes. 

I suspect a similar family of issues with pg_dependencies, and it
would be nice to move the tests with the input function into a new
regression file, like the other one.

I've rebased the full set using the new structure.  0001~0004 are
clean.  0005~ need more work and analysis, but that's a start.
--
Michael

Re: Extended Statistics set/restore/clear functions.

jian he <jian.universality@gmail.com> — 2025-11-11T15:14:34Z

On Tue, Nov 11, 2025 at 4:07 PM Michael Paquier <michael@paquier.xyz> wrote:
> I've rebased the full set using the new structure.  0001~0004 are
> clean.  0005~ need more work and analysis, but that's a start.
hi.

diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 106583fb2965..e0e9f0468cb8 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1579,9 +1579,39 @@ ANALYZE zipcodes;
 SELECT stxkeys AS k, stxdndistinct AS nd
   FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
   WHERE stxname = 'stts2';
--[ RECORD 1 ]------------------------------------------------------&zwsp;--
+-[ RECORD 1 ]-------------------
 k  | 1 2 5
-nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178}
+nd | [                          +
+   |     {                      +
+   |         "ndistinct": 33178,+
+   |         "attributes": [    +
+   |             1,             +
+   |             2              +
+   |         ]                  +
+   |     },                     +

If you not specify as
 SELECT stxkeys AS k, stxdndistinct AS nd
   FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
   WHERE stxname = 'stts2' \gx
then it won't look like ``-[ RECORD 1 ]``.

I tried the dummy data, the above query, the nd column data look like

[{"attributes": [1, 2], "ndistinct": 2} .....
which does not look like the above.

So I guess the query in doc/src/sgml/perform.sgml would be:

SELECT stxkeys AS k, jsonb_pretty(d.stxdndistinct::text::jsonb) AS nd
FROM pg_statistic_ext join pg_statistic_ext_data d on (oid = d.stxoid)
WHERE stxname = 'stts2' \gx

per related thread, the word "join" needs uppercase?


--
jian
https://www.enterprisedb.com/

Re: Extended Statistics set/restore/clear functions.

jian he <jian.universality@gmail.com> — 2025-11-12T02:51:51Z

On Tue, Nov 11, 2025 at 11:14 PM jian he <jian.universality@gmail.com> wrote:
>
> On Tue, Nov 11, 2025 at 4:07 PM Michael Paquier <michael@paquier.xyz> wrote:
> > I've rebased the full set using the new structure.  0001~0004 are
> > clean.  0005~ need more work and analysis, but that's a start.

hi.

+Datum
+pg_ndistinct_out(PG_FUNCTION_ARGS)
+
+
+ appendStringInfo(&str, "], \"" PG_NDISTINCT_KEY_NDISTINCT "\": %d}",
+ (int) item.ndistinct);

I’m a bit confused about the part above,
item.ndistinct is double type, we just cast it to int type?


after apply 0004, the below in doc/src/sgml/perform.sgml also need to change?

<programlisting>
CREATE STATISTICS stts (dependencies) ON city, zip FROM zipcodes;

ANALYZE zipcodes;

SELECT stxname, stxkeys, stxddependencies
  FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
  WHERE stxname = 'stts';
 stxname | stxkeys |             stxddependencies
---------+---------+------------------------------------------
 stts    | 1 5     | {"1 => 5": 1.000000, "5 => 1": 0.423130}
(1 row)
</programlisting>


Do you think it's worth the trouble to have two separate
appendStringInfoChar for ``{}``?

for example in loop ``for (i = 0; i < ndist->nitems; i++)``. we can change to:
        appendStringInfoChar(&str, '{');
        appendStringInfo(&str, "\"" PG_NDISTINCT_KEY_ATTRIBUTES "\": [%d",
                         item.attributes[0]);

        for (int j = 1; j < item.nattributes; j++)
            appendStringInfo(&str, ", %d", item.attributes[j]);

        appendStringInfo(&str, "], \"" PG_NDISTINCT_KEY_NDISTINCT "\": %d",
                         (int) item.ndistinct);
        appendStringInfoChar(&str, '}');


--
jian
https://www.enterprisedb.com/

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-11-12T06:47:33Z

>
> Thanks for the new patch.  And FWIW I disagree with this approach:
> cleanup and refactoring pieces make more sense if done first, as these
> lead to less code churn in the final result.  So...  I've begun to put
> my hands on the patch set.  The whole has been restructured a bit, as
> per the attached.  Patch 0001 to 0004 feel OK here, these include two
> code moves and the two output functions:
> - Two new files for adt/, that I'm planning to apply soon as a
> separate cleanup.
> - New output functions, with keys added to a new header named
> statistics_format.h, for frontend and backend consumption.
>

Agreed, 0001-0004 all look good.


> Next comes the input functions.  First, I am unhappy with the amount
> of testing that has been put into ndistinct, first and only input
> facility I've looked at in details for the moment.  I have quickly
> spotted a couple a few issues while testing buggy input, like this one
> that crashes on pointer dereference, not good obviously:
> SELECT '[]'::pg_ndistinct;
>

- I put some work into more specific error messages for invalid values for
both pg_ndistinct and pg_dependencies.
- The check for empty attribute lists and item lists now occur in the
array-end event handler.
- Also tried to standardize conventions between the two data types (switch
statements, similar utility functions, etc).


>
> These are checked in the patches that introduce the functions like
> with pg_ndistinct_validate_items(), based on the list of stxkeys we
> have.  However, I think that this is not enough by itself.  Shouldn't
> we check that the list of items in the array is what we expect based
> on the longest "attributes" array at least, even after a JSON that was
> parsed?  That would be cheap to check in the output function itself,
> at least as a first layer of checks before trying something with the
> import function and cross-checking the list of attributes for the
> extended statistics object.


I added tests for both duplicate attribute sequences as well as making the
first-longest attribute sequence the template by which all later and
shorter sequences are checked. I had been reluctant to add checks like
this, because so many similar validations were removed from the earlier
statistics code like histograms and the like.


>
> I suspect a similar family of issues with pg_dependencies, and it
> would be nice to move the tests with the input function into a new
> regression file, like the other one.
>

Did so.

0001-0004,0007,0009 unchanged.

Significant modification of the stats_import.sql regression tests in 0008
to conform to stricter datatype rules enacted in 0005, 0006.

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-11-12T07:53:50Z

On Wed, Nov 12, 2025 at 01:47:33AM -0500, Corey Huinker wrote:
>> Thanks for the new patch.  And FWIW I disagree with this approach:
>> cleanup and refactoring pieces make more sense if done first, as these
>> lead to less code churn in the final result.  So...  I've begun to put
>> my hands on the patch set.  The whole has been restructured a bit, as
>> per the attached.  Patch 0001 to 0004 feel OK here, these include two
>> code moves and the two output functions:
>> - Two new files for adt/, that I'm planning to apply soon as a
>> separate cleanup.
> 
> Agreed, 0001-0004 all look good.

Applied 0001 and 0002 for now.
--
Michael

Re: Extended Statistics set/restore/clear functions.

zengman <zengman@halodbtech.com> — 2025-11-12T08:45:16Z

Hey, I saw that some comments in the recent commit have inconsistent styles—maybe we can tweak them to make them uniform!

```c
// pg_dependencies    - output routine for type pg_dependencies.
pg_dependencies_out   - output routine for type pg_dependencies.

// pg_ndistinct
//    output routine for type pg_ndistinct
pg_ndistinct_out
    output routine for type pg_ndistinct
```

Oh, I didn’t operate it properly, which resulted in the creation of a duplicate theme.
https://www.postgresql.org/message-id/tencent_6EC50DD07D5D567C15E65C46%40qq.com
--
Man Zeng

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-11-12T11:25:43Z

On Wed, Nov 12, 2025 at 08:45:16AM +0000, Man Zeng wrote:
> pg_dependencies_out   - output routine for type pg_dependencies.
> pg_ndistinct_out
> 
> Oh, I didn’t operate it properly, which resulted in the creation of a duplicate theme.
> https://www.postgresql.org/message-id/tencent_6EC50DD07D5D567C15E65C46%40qq.com

No problem.  I've just fixed both comments, thanks.  These were older
than the recent commit, but let's fix things when these are in sight.
--
Michael

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-11-12T22:21:36Z

>
> +
> + appendStringInfo(&str, "], \"" PG_NDISTINCT_KEY_NDISTINCT "\": %d}",
> + (int) item.ndistinct);
>
> I’m a bit confused about the part above,
> item.ndistinct is double type, we just cast it to int type?
>

It's a historical quirk. That's what the original output function did in
mvdistinct.c, so we maintain compatibility with that. Altering the internal
storage type would affect the bytea serialization, which would break binary
compatibility.


> after apply 0004, the below in doc/src/sgml/perform.sgml also need to
> change?
>

Yes it does, good catch.


> Do you think it's worth the trouble to have two separate
> appendStringInfoChar for ``{}``?
>
> for example in loop ``for (i = 0; i < ndist->nitems; i++)``. we can change
> to:
>

I agree that that feels more symmetrical. However, it seems the prevailing
wisdom is that we're already paying for a string interpolation in the very
next appendStringInfo(), we might as well save ourselves a function call.
Hence, I left that one as-is.

The sgml change has been worked into a rebased and reduced patch set
(thanks for the commits Michael!)

Re: Extended Statistics set/restore/clear functions.

jian he <jian.universality@gmail.com> — 2025-11-13T07:01:42Z

hi.

v12-0001 and v12-0002 overall look good to me.

        if (dependency->nattributes <= 1)
            elog(ERROR, "invalid zero-length nattributes array in
MVDependencies");
This is an unlikely-to-happen error message,  but still, “nattributes”
seems confusing?

in  doc/src/sgml/func/func-json.sgml, we have \gset
<programlisting>
SELECT '{
.......
  }
}' AS json \gset
</programlisting>

similarly, in doc/src/sgml/perform.sgml, I think the query should be:

SELECT stxkeys AS k, jsonb_pretty(stxdndistinct::text::jsonb) AS nd
  FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
  WHERE stxname = 'stts2'    \gx



--
jian
https://www.enterprisedb.com/

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-11-13T07:32:17Z

On Thu, Nov 13, 2025 at 2:02 AM jian he <jian.universality@gmail.com> wrote:

> hi.
>
> v12-0001 and v12-0002 overall look good to me.
>
>         if (dependency->nattributes <= 1)
>             elog(ERROR, "invalid zero-length nattributes array in
> MVDependencies");
> This is an unlikely-to-happen error message,  but still, “nattributes”
> seems confusing?
>

Agreed the error message should be changed if it's kept at all. That error
may never occur now that we test for empty arrays in the array close event
handler. So maybe this becomes a plain assert().

> similarly, in doc/src/sgml/perform.sgml, I think the query should be:
>
> SELECT stxkeys AS k, jsonb_pretty(stxdndistinct::text::jsonb) AS nd
>   FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
>   WHERE stxname = 'stts2'    \gx
>

The example almost certainly predates \gx, so that's a good suggestion.

Re: Extended Statistics set/restore/clear functions.

jian he <jian.universality@gmail.com> — 2025-11-13T11:51:00Z

hi.
now looking at v12-0003-Add-working-input-function-for-pg_ndistinct.patch
again.

+ * example input:
+ *     [{"attributes": [6, -1], "ndistinct": 14},
+ *      {"attributes": [6, -2], "ndistinct": 9143},
+ *      {"attributes": [-1,-2], "ndistinct": 13454},
+ *      {"attributes": [6, -1, -2], "ndistinct": 14549}]
  */
 Datum
 pg_ndistinct_in(PG_FUNCTION_ARGS)

extenssted statistics surely won't work on system columns,
how should we deal with case like:
```
{"attributes": [6, -1], "ndistinct": 14}
{"attributes": [6, -7], "ndistinct": 14},
```
issue a warning or error out saying that your attribute number is invalid?
Should we discourage using system columns as examples in comments here?

I have added more test code in src/test/regress/sql/pg_ndistinct.sql,
to improve the code coverage.

as mentioned before in
https://postgr.es/m/CACJufxEZYqocFdgn-x-bJMRBSk_zkS=ziGGkaSumteiPDksnsg@mail.gmail.com
I think it's a good thing to change
``(errcode....``
to
``errcode``.
So I did the change.


+static JsonParseErrorType
+ndistinct_array_element_start(void *state, bool isnull)
+{
+ NDistinctParseState *parse = state;
+
+ switch(parse->state)
+ {
+ case NDIST_EXPECT_ATTNUM:
+ if (!isnull)
+ return JSON_SUCCESS;
+
+ ereturn(parse->escontext, (Datum) 0,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
+ errdetail("Attnum list elements cannot be null.")));

this (and many other places) looks wrong, because
ereturn would really return ``(Datum) 0``, and this function returns
JsonParseErrorType.
so we have to errsave here.

+typedef struct
+{
+ const char *str;
+ NDistinctSemanticState state;
+
+ List   *distinct_items; /* Accumulated complete MVNDistinctItems */
+ Node   *escontext;
+
+ bool found_attributes; /* Item has an attributes key */
+ bool found_ndistinct; /* Item has ndistinct key */
+ List   *attnum_list; /* Accumulated attributes attnums */
+ int64 ndistinct;
+} NDistinctParseState;

+ case NDIST_EXPECT_NDISTINCT:
+ /*
+ * While the structure dictates that ndistinct in a double precision
+ * floating point, in practice it has always been an integer, and it
+ * is output as such. Therefore, we follow usage precendent over the
+ * actual storage structure, and read it in as an integer.
+ */
+ parse->ndistinct = pg_strtoint64_safe(token, parse->escontext);
+
+ if (SOFT_ERROR_OCCURRED(parse->escontext))
+ return JSON_SEM_ACTION_FAILED;

NDistinctParseState.ndistinct should be integer,
otherwise pg_ndistinct_out will not be consistent with  pg_ndistinct_in?

SELECT '[{"attributes" : [1, 2], "ndistinct" :
2147483648}]'::pg_ndistinct; --error
                    pg_ndistinct
----------------------------------------------------
 [{"attributes": [1, 2], "ndistinct": -2147483648}]
(1 row)

The result seems not what we expected.

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-11-14T05:49:23Z

>
> extenssted statistics surely won't work on system columns,
> how should we deal with case like:
> ```
> {"attributes": [6, -1], "ndistinct": 14}
> {"attributes": [6, -7], "ndistinct": 14},
> ```
> issue a warning or error out saying that your attribute number is invalid?
> Should we discourage using system columns as examples in comments here?
>

Negative numbers represent the Nth expression defined in the extended
statistics object. So if you have extended statistics on a, b, length(a),
length(b) then you can legally have -1 and -2 in the attributes, but
nothing lower than that.
See functions pg_ndistinct_validate_items() and
pg_depdendencies_validate_deps() as these check the attributes in the value
against the definition of the extended stats object.

Though this does bring up a small naming problem: Elements of a
pg_dependencies are Dependency, abbreviated to dep, but are also called
Items like the elements in a pg_ndistinct. We should pick a standard name
for such things (probably item) and use it everywhere.


>
> I have added more test code in src/test/regress/sql/pg_ndistinct.sql,
> to improve the code coverage.
>

I'm trying to implement those test cases, but I may have missed some.


>
>
> this (and many other places) looks wrong, because
> ereturn would really return ``(Datum) 0``, and this function returns
> JsonParseErrorType.
> so we have to errsave here.
>

Good point. Implemented.



>
> NDistinctParseState.ndistinct should be integer,
> otherwise pg_ndistinct_out will not be consistent with  pg_ndistinct_in?
>

+1

Implemented many, but not all of these suggestions.

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-11-14T06:25:27Z

On Fri, Nov 14, 2025 at 12:49:23AM -0500, Corey Huinker wrote:
> Negative numbers represent the Nth expression defined in the extended
> statistics object. So if you have extended statistics on a, b, length(a),
> length(b) then you can legally have -1 and -2 in the attributes, but
> nothing lower than that.
>
> See functions pg_ndistinct_validate_items() and
> pg_depdendencies_validate_deps() as these check the attributes in the value
> against the definition of the extended stats object.

Exactly.  Extended stats on system columns don't work because they
don't really make sense as we want to track correlations between the
attributes defined, and these reflect internal states:
create table poo (a int, b int);
create statistics poos (ndistinct ) ON cmax, a from poo;
ERROR:  0A000: statistics creation on system columns is not supported

Note that the expressions are also stored in pg_stats_ext_exprs.

> I'm trying to implement those test cases, but I may have missed some.

I've found a lot of them with coverage-html during a previous lookup.
I'd like to think that we should aim for something close to 100%
coverage for the two input functions.

> Implemented many, but not all of these suggestions.

Thanks for the new versions, I'll also look at all these across the
next couple of days.  Probably not at 0005~ for now.
--
Michael

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-11-17T06:56:06Z

On Fri, Nov 14, 2025 at 03:25:27PM +0900, Michael Paquier wrote:
> Thanks for the new versions, I'll also look at all these across the
> next couple of days.  Probably not at 0005~ for now.

0001 and 0002 from series v13 have been applied to change the output
functions.

And I have looked at 0003 in details for now.  Attached is a revised
version for it, with many adjustments.  Some notes:
- Many portions of the coverage were missing.  I have measured the
coverage at 91% with the updated version attached.  This includes
coverage for some error reporting, something that we rely a lot on for
this code.
- The error reports are made simpler, with the token values getting
hidden.  While testing with some fancy values, I have actually noticed
that the error handlings for the parsing of the int16 and int32 values
were incorrect, the error reports used what the safe functions
generated, not the reports from the data type.
- Passing down arbitrary bytes sequences was leading to these bytes
reported in the error outputs because we cared about the token values.
I have added a few tests based on that for the code paths involved.

There is an extra thing that bugs me as incorrect for the pg_ndistinct
input, something I have not tackled myself yet.  Your patch checks
that subsets of attributes are included in the longest set found, but
it does not match the guarantees we have in mvndistinct.c: we have to
check that *all* the combinations generated by generator_init() are
satisfied based on the longest of attributes detected.  For example,
this is thought as correct in the input function:
SELECT '[{"attributes" : [-1,2], "ndistinct" : 1},
         {"attributes" : [-1,2,3], "ndistinct" : 3}]'::pg_ndistinct;

However it is obviously not correct as we are missing an element for
the attributes [-1, 3].  The simplest solution would be to export the
routines that generate the groups now in mvndistinct.c.  Also we
should make sure that the number of elements in the arrays match with
the number of groups we expect, not only the elements.  I don't think
that we need to care much about the values, but we ought to provide
stronger guarantees for the attributes listed in these elements.

Except for this argument, the input of pg_ndistinct feels OK in terms
of the guarantees that we'd want to enforce on an import.  The same
argument applies in terms of attribute number guarantees for
pg_dependencies, based on DependencyGenerator_init() & friends in
dependencies.c.  Could you look at that?

For pg_dependencies, we also require some checks on the value for
"dependency", of course, making sure that this matches with what's
expected with the "largest" sets of attributes.  In this case, we need
to track the union of "dependency" and "attributes", with "attributes"
having at least one element.

The tests of pg_dependencies need also to be extended more (begun that
a bit, far from being complete and I'm lacking of time this week due
to a conference).  One thing that I would add are nested JSON objects
in the paths where we expect values, for example.  Please note that I
have done a brush of 0004, while on it, cleaning up typos,
inconsistencies and making the error codes consistent with the
ndistinct case where possible.  This is not ready, but that's at least
it's a start to rely on.

In terms of committable bits, it would be better to apply the input
functions once both parts are ready to go.  For now I am attached a
v14 with the work I've put into them.  0005~ are not reviewed yet, as
mentioned previously.  The changes in pg_dependencies are actually
straight-forward to figure out (well, mostly) once the pg_ndistinct
changes are OK in shape.
--
Michael

Re: Extended Statistics set/restore/clear functions.

jian he <jian.universality@gmail.com> — 2025-11-17T12:53:35Z

On Mon, Nov 17, 2025 at 2:56 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Nov 14, 2025 at 03:25:27PM +0900, Michael Paquier wrote:
> > Thanks for the new versions, I'll also look at all these across the
> > next couple of days.  Probably not at 0005~ for now.
>
> 0001 and 0002 from series v13 have been applied to change the output
> functions.
>
> And I have looked at 0003 in details for now.  Attached is a revised
> version for it, with many adjustments.  Some notes:
> - Many portions of the coverage were missing.  I have measured the
> coverage at 91% with the updated version attached.  This includes
> coverage for some error reporting, something that we rely a lot on for
> this code.
> - The error reports are made simpler, with the token values getting
> hidden.  While testing with some fancy values, I have actually noticed
> that the error handlings for the parsing of the int16 and int32 values
> were incorrect, the error reports used what the safe functions
> generated, not the reports from the data type.
> - Passing down arbitrary bytes sequences was leading to these bytes
> reported in the error outputs because we cared about the token values.
> I have added a few tests based on that for the code paths involved.
>
> There is an extra thing that bugs me as incorrect for the pg_ndistinct
> input, something I have not tackled myself yet.  Your patch checks
> that subsets of attributes are included in the longest set found, but
> it does not match the guarantees we have in mvndistinct.c: we have to
> check that *all* the combinations generated by generator_init() are
> satisfied based on the longest of attributes detected.  For example,
> this is thought as correct in the input function:
> SELECT '[{"attributes" : [-1,2], "ndistinct" : 1},
>          {"attributes" : [-1,2,3], "ndistinct" : 3}]'::pg_ndistinct;
>
> However it is obviously not correct as we are missing an element for
> the attributes [-1, 3].  The simplest solution would be to export the
> routines that generate the groups now in mvndistinct.c.  Also we
> should make sure that the number of elements in the arrays match with
> the number of groups we expect, not only the elements.  I don't think
> that we need to care much about the values, but we ought to provide
> stronger guarantees for the attributes listed in these elements.
>
> Except for this argument, the input of pg_ndistinct feels OK in terms
> of the guarantees that we'd want to enforce on an import.  The same
> argument applies in terms of attribute number guarantees for
> pg_dependencies, based on DependencyGenerator_init() & friends in
> dependencies.c.  Could you look at that?
>

hi.

NDistinctSemanticState, last element
NDIST_EXPECT_COMPLETE should follow with a comma, like:
``NDIST_EXPECT_COMPLETE, ``


Lots of tests use pg_input_error_info, which is good.
since currently only pg_input_error_info, pg_input_is_valid
NDistinctParseState.escontext is not NULL.


+ /*
+ * If escontext already set, just use that. Anything else is a generic
+ * JSON parse error.
+ */
+ if (!SOFT_ERROR_OCCURRED(parse_state.escontext))
+ errsave(parse_state.escontext,
+ errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("malformed pg_ndistinct: \"%s\"", str),
+ errdetail("Must be valid JSON."));
seems no coverage for this.

segfault:
SELECT '[{"attributes" : [1,2,3,4,5,67,6,7,8], "ndistinct" : 4}]'::pg_ndistinct;

because src/backend/statistics/mvdistinct.c line: 310, Assert
Assert((item->nattributes >= 2) && (item->nattributes <= STATS_MAX_DIMENSIONS));


+SELECT '[{"attributes" : [2], "ndistinct" : 4}]'::pg_ndistinct;
+ERROR:  malformed pg_ndistinct: "[{"attributes" : [2], "ndistinct" : 4}]"
+LINE 1: SELECT '[{"attributes" : [2], "ndistinct" : 4}]'::pg_ndistin...
+               ^
+DETAIL:  The "ndistinct" key must contain an array of at least two attributes.

here, it should be
+DETAIL:  The "attributes" key must contain an array of at least two attributes.
?

--
jian
https://www.enterprisedb.com/

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-11-17T17:18:55Z

On Mon, Nov 17, 2025 at 1:56 AM Michael Paquier <michael@paquier.xyz> wrote:

> On Fri, Nov 14, 2025 at 03:25:27PM +0900, Michael Paquier wrote:
> > Thanks for the new versions, I'll also look at all these across the
> > next couple of days.  Probably not at 0005~ for now.
>
> 0001 and 0002 from series v13 have been applied to change the output
> functions.
>

Thanks!

Though, I was thinking some more about the output format. Using
jsonb_pretty() makes it readable in one way, and very clumsy in other ways.
Instead, I'm going to try doing the following:

replace (ndist_string_value, '},', E'}\n,')

This will result in the output value being formatted in exactly the way
described in the commit messages.

Of course, we could make the the actual default format by changing
appendStringInfoString(&str, ", ") instead.

One might argue that the output shouldn't get too flowery, but we're
already adding spaces between items and array elements, and we've already
made extensive changes favoring readability over compactness.

Review of changes to input function patches:

I'm curious about the re-parameterization of error messages involving
PG_NDISTINCT_KEY_ATTRIBUTES, PG_NDISTINCT_KEY_NDISTINCT, and similar keys
in dependencies. I like the parameterized version better, and was confused
as to why it was removed. Did you change your mind, or was it done for ease
of translation?

Wording changes all look good. Use of pg_input_error_info() makes sense.

>
> There is an extra thing that bugs me as incorrect for the pg_ndistinct
> input, something I have not tackled myself yet.  Your patch checks
> that subsets of attributes are included in the longest set found, but
> it does not match the guarantees we have in mvndistinct.c: we have to
> check that *all* the combinations generated by generator_init() are
> satisfied based on the longest of attributes detected.  For example,
> this is thought as correct in the input function:
> SELECT '[{"attributes" : [-1,2], "ndistinct" : 1},
>          {"attributes" : [-1,2,3], "ndistinct" : 3}]'::pg_ndistinct;
>
> However it is obviously not correct as we are missing an element for
> the attributes [-1, 3].  The simplest solution would be to export the
> routines that generate the groups now in mvndistinct.c.  Also we
> should make sure that the number of elements in the arrays match with
> the number of groups we expect, not only the elements.  I don't think
> that we need to care much about the values, but we ought to provide
> stronger guarantees for the attributes listed in these elements.
>

I had a feeling that was going to be requested. My question would be if
that we want to stick to modeling the other combinations after the first
longest combination, last longest, or if we want to defer those checks
altogether until we have to validate against an actual stats object?

>
> Except for this argument, the input of pg_ndistinct feels OK in terms
> of the guarantees that we'd want to enforce on an import.  The same
> argument applies in terms of attribute number guarantees for
> pg_dependencies, based on DependencyGenerator_init() & friends in
> dependencies.c.  Could you look at that?
>

Yes. I had already looked at it to verify that _all_ combinations were
always generated (they are), because I had some vague memory of the
generator dropping combinations that were statistically insignificant. In
retrospect, I have no idea where I got that idea.

>
> For pg_dependencies, we also require some checks on the value for
> "dependency", of course, making sure that this matches with what's
> expected with the "largest" sets of attributes.  In this case, we need
> to track the union of "dependency" and "attributes", with "attributes"
> having at least one element.
>

This is fairly simple to do. The dependency attnum is just appended to the
list of attnums, and the combinations are generated the same as ndistinct,
though obviously there are no single elements.

There's probably some common code between the lists to be shared, differing
only in how they report missing combinations.

> The tests of pg_dependencies need also to be extended more (begun that
> a bit, far from being complete and I'm lacking of time this week due
> to a conference).  One thing that I would add are nested JSON objects
> in the paths where we expect values, for example.  Please note that I
> have done a brush of 0004, while on it, cleaning up typos,
> inconsistencies and making the error codes consistent with the
> ndistinct case where possible.  This is not ready, but that's at least
> it's a start to rely on.
>

>
> In terms of committable bits, it would be better to apply the input
> functions once both parts are ready to go.  For now I am attached a
> v14 with the work I've put into them.  0005~ are not reviewed yet, as
> mentioned previously.  The changes in pg_dependencies are actually
> straight-forward to figure out (well, mostly) once the pg_ndistinct
> changes are OK in shape.
> --
> Michael

+1

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-11-17T23:19:58Z

On Mon, Nov 17, 2025 at 12:18:55PM -0500, Corey Huinker wrote:
> On Mon, Nov 17, 2025 at 1:56 AM Michael Paquier <michael@paquier.xyz> wrote:
> Though, I was thinking some more about the output format. Using
> jsonb_pretty() makes it readable in one way, and very clumsy in other ways.
> Instead, I'm going to try doing the following:
> 
> replace (ndist_string_value, '},', E'}\n,')
> 
> This will result in the output value being formatted in exactly the way
> described in the commit messages.
> 
> Of course, we could make the the actual default format by changing
> appendStringInfoString(&str, ", ") instead.

This feels like a different pretty still compressed output for json.
I don't think we should change the output functions to do that, but if
you want to add a function that filters these contents a bit in the
tests for the input functions, sure, why not.

> One might argue that the output shouldn't get too flowery, but we're
> already adding spaces between items and array elements, and we've already
> made extensive changes favoring readability over compactness.

I'd still keep it without newlines, FWIW.  So what we have in the
output functions is OK for me.  The key names could be updated to
something else for this release, I'm open for suggestions and we have
time for this release.  It would be nice to not do rename tweaks
several times.

> I'm curious about the re-parameterization of error messages involving
> PG_NDISTINCT_KEY_ATTRIBUTES, PG_NDISTINCT_KEY_NDISTINCT, and similar keys
> in dependencies. I like the parameterized version better, and was confused
> as to why it was removed. Did you change your mind, or was it done for ease
> of translation?

Yes, this one is to reduce the translation work, and because the
messages are quite the same across the board and deal with the same
requirements:
- Single integer expected after a key (attnum or actual value).
- Array of attribute expected after a key.
- For the degree key, float value.

> I had a feeling that was going to be requested. My question would be if
> that we want to stick to modeling the other combinations after the first
> longest combination, last longest, or if we want to defer those checks
> altogether until we have to validate against an actual stats object?

I would tend to think that performing one round of validation once the
whole set of objects has been parsed is going to be cheaper than
periodic checks.

One other thing would be to force a sort of the elements in the array
to match with the order these are generated when creating the stats.
We cannot do that in the input functions because we have no idea about
the order of the attributes in the statistics object yet.  Applying a
sort sounds also important to me to make sure that we order the stats
based on what the group generation functions (aka
generate_combinations(), etc.) think on the matter, which would
enforce a stronger binary compatibility after we are sure that we have
a full set of attributes listed in an array with the input function of
course.  I have briefly looked at the planner code where extended
stats are used, like selfuncs.c, and the ordering does not completely
matter, it seems, but it's cheap enough to enforce a stricter ordering
based on the K groups of N elements generated in the import function.

>> Except for this argument, the input of pg_ndistinct feels OK in terms
>> of the guarantees that we'd want to enforce on an import.  The same
>> argument applies in terms of attribute number guarantees for
>> pg_dependencies, based on DependencyGenerator_init() & friends in
>> dependencies.c.  Could you look at that?
> 
> Yes. I had already looked at it to verify that _all_ combinations were
> always generated (they are), because I had some vague memory of the
> generator dropping combinations that were statistically insignificant. In
> retrospect, I have no idea where I got that idea.

Hmm.  I would need to double-check the code to be sure, but I don't
think that we drop combinations, because the code prevents duplicates
to begin with, even for expressions:
create table aa (a int, b int);
create statistics stats (ndistinct) ON a, a, b, b from aa;
ERROR:  42701: duplicate column name in statistics definition
create statistics stats (ndistinct) ON (a + a), ((a + a)) from aa;
ERROR:  42701: duplicate expression in statistics definition

These don't make sense anyway because they have a predictible and
perfectly matching correlation relationship.

> This is fairly simple to do. The dependency attnum is just appended to the
> list of attnums, and the combinations are generated the same as ndistinct,
> though obviously there are no single elements.

Yeah.  That should be not be bad, I hope.

> There's probably some common code between the lists to be shared, differing
> only in how they report missing combinations.

I would like to agree on that, but it did not look that obvious to me
yesterday.  If you think that something could be refactored, I'd
suggest a refactoring patch that applies on top of the rest of the
patch set, with new generic facilities in stat_util.c, or even a
new separate file, if that leads to a cleaner result (okay, a
definition of "clean" is up to one's taste).
--
Michael

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-11-17T23:35:00Z

On Mon, Nov 17, 2025 at 08:53:35PM +0800, jian he wrote:
> segfault:
> SELECT '[{"attributes" : [1,2,3,4,5,67,6,7,8], "ndistinct" : 4}]'::pg_ndistinct;
> 
> because src/backend/statistics/mvdistinct.c line: 310, Assert
> Assert((item->nattributes >= 2) && (item->nattributes <= STATS_MAX_DIMENSIONS));

Ahah, good one here.  Perhaps we had better switch this assertion to
become an elog() with the import functions in mind, as well.  It may
be worth double-checking the other deserialization paths, as well.  We
need to be super careful about this stuff for data imports.
--
Michael

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-11-18T02:32:37Z

>
>
> This feels like a different pretty still compressed output for json.
> I don't think we should change the output functions to do that, but if
> you want to add a function that filters these contents a bit in the
> tests for the input functions, sure, why not.
>

+1, I'll probably just use the replaces rather than define a function and
give someone the false impression that the function exists elsewhere.


>
> Yes, this one is to reduce the translation work, and because the
> messages are quite the same across the board and deal with the same
> requirements:
> - Single integer expected after a key (attnum or actual value).
> - Array of attribute expected after a key.
> - For the degree key, float value.
>
> > I had a feeling that was going to be requested. My question would be if
> > that we want to stick to modeling the other combinations after the first
> > longest combination, last longest, or if we want to defer those checks
> > altogether until we have to validate against an actual stats object?
>
> I would tend to think that performing one round of validation once the
> whole set of objects has been parsed is going to be cheaper than
> periodic checks.
>
> One other thing would be to force a sort of the elements in the array
> to match with the order these are generated when creating the stats.
> We cannot do that in the input functions because we have no idea about
> the order of the attributes in the statistics object yet.  Applying a
> sort sounds also important to me to make sure that we order the stats
> based on what the group generation functions (aka
> generate_combinations(), etc.) think on the matter, which would
> enforce a stronger binary compatibility after we are sure that we have
> a full set of attributes listed in an array with the input function of
> course.  I have briefly looked at the planner code where extended
> stats are used, like selfuncs.c, and the ordering does not completely
> matter, it seems, but it's cheap enough to enforce a stricter ordering
> based on the K groups of N elements generated in the import function.
>
> >> Except for this argument, the input of pg_ndistinct feels OK in terms
> >> of the guarantees that we'd want to enforce on an import.  The same
> >> argument applies in terms of attribute number guarantees for
> >> pg_dependencies, based on DependencyGenerator_init() & friends in
> >> dependencies.c.  Could you look at that?
> >
> > Yes. I had already looked at it to verify that _all_ combinations were
> > always generated (they are), because I had some vague memory of the
> > generator dropping combinations that were statistically insignificant. In
> > retrospect, I have no idea where I got that idea.
>
> Hmm.  I would need to double-check the code to be sure, but I don't
> think that we drop combinations, because the code prevents duplicates
> to begin with, even for expressions:
> create table aa (a int, b int);
> create statistics stats (ndistinct) ON a, a, b, b from aa;
> ERROR:  42701: duplicate column name in statistics definition
> create statistics stats (ndistinct) ON (a + a), ((a + a)) from aa;
> ERROR:  42701: duplicate expression in statistics definition
>

So I looked at the generator functions, hoping they'd have enough in common
that they could be made generic. And they're just different enough that I
think it's not worth it to try.

But, if we don't care about the order of the combinations, I also don't
think we need to expose the functions at all. We know exactly how many
combinations there should be for any N attributes as each attribute must be
unique. So if we have the right number of unique combinations, and they're
all subsets of the first-longest, then we must have a complete set.
Thoughts on that?

Getting _too_ tight with the ordering and contents makes me concerned for
the day when the format might change. We don't want to _fail_ an upgrade
because some of the combinations were in the wrong order.



> These don't make sense anyway because they have a predictible and
> perfectly matching correlation relationship.
>

They do, for now, but are we willing to lock ourselves into that forever?


>
> > This is fairly simple to do. The dependency attnum is just appended to
> the
> > list of attnums, and the combinations are generated the same as
> ndistinct,
> > though obviously there are no single elements.
>
> Yeah.  That should be not be bad, I hope.
>
> > There's probably some common code between the lists to be shared,
> differing
> > only in how they report missing combinations.
>
> I would like to agree on that, but it did not look that obvious to me
> yesterday.  If you think that something could be refactored, I'd
> suggest a refactoring patch that applies on top of the rest of the
> patch set, with new generic facilities in stat_util.c, or even a
> new separate file, if that leads to a cleaner result (okay, a
> definition of "clean" is up to one's taste).
>

Looking over those functions, they both could have use the same generator,
but the dependencies-side decided that dependency order doesn't matter,
which puts doubt in my head that the order is perfectly the same for both,
so we'd better follow each individually IF we want to enforce order.

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-11-18T02:34:14Z

On Mon, Nov 17, 2025 at 6:35 PM Michael Paquier <michael@paquier.xyz> wrote:

> On Mon, Nov 17, 2025 at 08:53:35PM +0800, jian he wrote:
> > segfault:
> > SELECT '[{"attributes" : [1,2,3,4,5,67,6,7,8], "ndistinct" :
> 4}]'::pg_ndistinct;
> >
> > because src/backend/statistics/mvdistinct.c line: 310, Assert
> > Assert((item->nattributes >= 2) && (item->nattributes <=
> STATS_MAX_DIMENSIONS));
>
> Ahah, good one here.  Perhaps we had better switch this assertion to
> become an elog() with the import functions in mind, as well.  It may
> be worth double-checking the other deserialization paths, as well.  We
> need to be super careful about this stuff for data imports.
>

Yeah, I liked that one too, and I can act on that while we're debating how
tightly to turn the screws on combinatorics.

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-11-18T03:34:29Z

On Mon, Nov 17, 2025 at 09:32:37PM -0500, Corey Huinker wrote:
> So I looked at the generator functions, hoping they'd have enough in common
> that they could be made generic. And they're just different enough that I
> think it's not worth it to try.
> 
> But, if we don't care about the order of the combinations, I also don't
> think we need to expose the functions at all. We know exactly how many
> combinations there should be for any N attributes as each attribute must be
> unique. So if we have the right number of unique combinations, and they're
> all subsets of the first-longest, then we must have a complete set.
> Thoughts on that?
> 
> Getting _too_ tight with the ordering and contents makes me concerned for
> the day when the format might change. We don't want to _fail_ an upgrade
> because some of the combinations were in the wrong order.

That's fair.  The planner costing code pulling the stats numbers based
on the attributes was smart enough to not care much about the ordering
as far as I recall, but I'd rather make sure of that first.  This
needs some careful lookup.

>> These don't make sense anyway because they have a predictible and
>> perfectly matching correlation relationship.
>>
> 
> They do, for now, but are we willing to lock ourselves into that forever?

Perhaps not.  I cannot say for sure what's the future is going to be
made of.

> Looking over those functions, they both could have use the same generator,
> but the dependencies-side decided that dependency order doesn't matter,
> which puts doubt in my head that the order is perfectly the same for both,
> so we'd better follow each individually IF we want to enforce order.

I'd try to look at the bits related to pg_dependencies and
pg_ndistinct as two separate concepts, at the end.  They're sort of
alike, but have too many differences already.
--
Michael

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-11-18T04:39:50Z

>
> > But, if we don't care about the order of the combinations, I also don't
> > think we need to expose the functions at all. We know exactly how many
> > combinations there should be for any N attributes as each attribute must
> be
> > unique. So if we have the right number of unique combinations, and
> they're
> > all subsets of the first-longest, then we must have a complete set.
> > Thoughts on that?
> >
> > Getting _too_ tight with the ordering and contents makes me concerned for
> > the day when the format might change. We don't want to _fail_ an upgrade
> > because some of the combinations were in the wrong order.
>
> That's fair.  The planner costing code pulling the stats numbers based
> on the attributes was smart enough to not care much about the ordering
> as far as I recall, but I'd rather make sure of that first.  This
> needs some careful lookup.
>

I've done some experiments, creating extended stats objects up to the 8
attribute limit.

The big takeaway is that I wasn't imagining that he number of dependencies
combinations is NOT deterministic:

/*
 * if the dependency seems entirely invalid, don't store it
 */
if (degree == 0.0)
     continue;

So, in theory, an empty (i.e. '[]') pg_dependencies is valid.

The number of pg_ndistinct is deterministic, now, but I'm even less sure
that'll be true in the future.

We can definitely rely on the attnums being all the positive numbers in
ascending order first, followed by the negative numbers in descending
order, but that's about it. Which raises the question of how we describe
the error when attnums are out of order.

We know that the deserialize functions take the data's word for it as to
how many items to unpack, so I don't see the impact of not caring how many
might be missing. That even sort of feeds into Tom's idea that stats import
was in some sense a fuzzing tool.


> I'd try to look at the bits related to pg_dependencies and
> pg_ndistinct as two separate concepts, at the end.  They're sort of
> alike, but have too many differences already.
>

Based on the above, I think we can't really add anything beyond the attnum
order, and we have to relax some existing restrictions on pg_dependencies...

Re: Extended Statistics set/restore/clear functions.

jian he <jian.universality@gmail.com> — 2025-11-18T05:07:23Z

On Mon, Nov 17, 2025 at 2:56 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Nov 14, 2025 at 03:25:27PM +0900, Michael Paquier wrote:
> > Thanks for the new versions, I'll also look at all these across the
> > next couple of days.  Probably not at 0005~ for now.
>
> 0001 and 0002 from series v13 have been applied to change the output
> functions.
>

> And I have looked at 0003 in details for now.  Attached is a revised
> version for it, with many adjustments.  Some notes:
> - Many portions of the coverage were missing.  I have measured the
> coverage at 91% with the updated version attached.  This includes
> coverage for some error reporting, something that we rely a lot on for
> this code.
> - The error reports are made simpler, with the token values getting
> hidden.  While testing with some fancy values, I have actually noticed
> that the error handlings for the parsing of the int16 and int32 values
> were incorrect, the error reports used what the safe functions
> generated, not the reports from the data type.
> - Passing down arbitrary bytes sequences was leading to these bytes
> reported in the error outputs because we cared about the token values.
> I have added a few tests based on that for the code paths involved.
>
hi.

in src/backend/statistics/mvdistinct.c, we have:
Assert(AttributeNumberIsValid(item->attributes[j]));

should we disallow 0 in key attributes?
SELECT '[{"attributes" : [0,1], "ndistinct" : 4}]'::pg_ndistinct;
I didn't find a way to trigger this Assert yet.


+ errsave(parse->escontext,
+ errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
+ errdetail("Invalid \"%s\" value.", PG_NDISTINCT_KEY_ATTRIBUTES));

+ errsave(parse->escontext,
+ errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
+ errdetail("Invalid \"%s\" value.",
+  PG_NDISTINCT_KEY_NDISTINCT));

the errdetail is way too generic?
similar to ``select 'a'::int;``
we can
DETAIL:  Invalid input syntax for type integer: "a"
HINT: "ndistinct" value expected to be a type of integer.

what do you think?


we already have "fname" in ndistinct_object_field_start,
we can also print out the "fname", like:
    errsave(parse->escontext,
            errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
            errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
            errdetail("Unexpected key \"%s\"", fname),
            errhint("Only allowed keys are \"%s\" and \"%s\".",
                      PG_NDISTINCT_KEY_ATTRIBUTES,
                      PG_NDISTINCT_KEY_NDISTINCT));


SELECT '[{"attributes" : [2,3], "ndistinct" : 4, "ndistinct" :
14}]'::pg_ndistinct;
               pg_ndistinct
-------------------------------------------
 [{"attributes": [2, 3], "ndistinct": 14}]

SELECT '[{"attributes" : [2,3], "ndistinct" : 4, "attributes" :
[]}]'::pg_ndistinct;
               pg_ndistinct
------------------------------------------
 [{"attributes": [2, 3], "ndistinct": 4}]

Is the above output what we expected?


+ /*
+ * We need at least two attribute numbers for a ndistinct item, anything
+ * less is malformed.
+ */
+ natts = parse->attnum_list->length;
here, we can use list_length.

+ if (parse->attnum_list != NIL)
+ if (parse->distinct_items != NIL)
here, we can also use list_length.


--
jian
https://www.enterprisedb.com/

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-11-18T05:39:04Z

On Tue, Nov 18, 2025 at 01:07:23PM +0800, jian he wrote:
> + errsave(parse->escontext,
> + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
> + errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
> + errdetail("Invalid \"%s\" value.",
> +  PG_NDISTINCT_KEY_NDISTINCT));
> 
> the errdetail is way too generic?
> similar to ``select 'a'::int;``
> we can
> DETAIL:  Invalid input syntax for type integer: "a"
> HINT: "ndistinct" value expected to be a type of integer.
> 
> what do you think?

That is intentional, as it is intentional to not show the value of the
token in such cases.  This will be mostly used for the import
functions, so I am not sure that it is worth spending time on tuning
all these error cases that one is most likely not going to see as the
main scenarios are going to be through pg_dump/restore, most of the
time in the scope of an upgrade.

> SELECT '[{"attributes" : [2,3], "ndistinct" : 4, "ndistinct" :
> 14}]'::pg_ndistinct;
>                pg_ndistinct
> -------------------------------------------
>  [{"attributes": [2, 3], "ndistinct": 14}]
> 
> SELECT '[{"attributes" : [2,3], "ndistinct" : 4, "attributes" :
> []}]'::pg_ndistinct;
>                pg_ndistinct
> ------------------------------------------
>  [{"attributes": [2, 3], "ndistinct": 4}]
> 
> Is the above output what we expected?

Interesting one.  The extra "attributes" should not be required once
we have one set, indeed.
--
Michael

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-11-18T08:52:31Z

>
> > SELECT '[{"attributes" : [2,3], "ndistinct" : 4, "ndistinct" :
> > 14}]'::pg_ndistinct;
> >                pg_ndistinct
> > -------------------------------------------
> >  [{"attributes": [2, 3], "ndistinct": 14}]
> >
> > SELECT '[{"attributes" : [2,3], "ndistinct" : 4, "attributes" :
> > []}]'::pg_ndistinct;
> >                pg_ndistinct
> > ------------------------------------------
> >  [{"attributes": [2, 3], "ndistinct": 4}]
> >
> > Is the above output what we expected?
>
> Interesting one.  The extra "attributes" should not be required once
> we have one set, indeed.
> --
> Michael
>

v15:

- catches duplicate object keys cited above
- enforces attnum ordering (ascending positive numbers followed by
descending negative numbers, no zeros allowed), which means we get
duplicate attnum detection for free
- attnum validation is now done as soon as the attnum is parsed
- tests refactored to put attnums in proper order
- unfortunately, this means that one of the error cases from
stats_import.sql (attnum = 0) is now an error rather than something that
can be soft-excluded.
- didn't enforce combinatorical completeness for dependencies because not
all combinations are guaranteed to be there.
- didn't enforce combinatorical completeness for ndistinct because I'm not
convinced we should.

Re: Extended Statistics set/restore/clear functions.

jian he <jian.universality@gmail.com> — 2025-11-19T07:49:56Z

On Tue, Nov 18, 2025 at 4:52 PM Corey Huinker <corey.huinker@gmail.com> wrote:
> v15:
>
> - catches duplicate object keys cited above
> - enforces attnum ordering (ascending positive numbers followed by descending negative numbers, no zeros allowed), which means we get duplicate attnum detection for free
> - attnum validation is now done as soon as the attnum is parsed
> - tests refactored to put attnums in proper order
> - unfortunately, this means that one of the error cases from stats_import.sql (attnum = 0) is now an error rather than something that can be soft-excluded.
> - didn't enforce combinatorical completeness for dependencies because not all combinations are guaranteed to be there.
> - didn't enforce combinatorical completeness for ndistinct because I'm not convinced we should.
>

hi.

some of the switch->default, default don't have ``break``.

+ for (int i = 0; i < nitems; i++)
+ {
+ MVNDistinctItem *item = parse_state.distinct_items->elements[i].ptr_value;

exposing the ptr_value seems not a good idea, we can foreach_ptr
the attached patch using foreach_ptr.

in function pg_ndistinct_in some errsave can change to ereturn.
(I didn't do this part, though).

+ /*
+ * The attnum cannot be zero a negative number beyond the number of the
+ * possible expressions.
+ */
+ if (attnum == 0 || attnum < (0-STATS_MAX_DIMENSIONS))
+ {
+ errsave(parse->escontext,
+ errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
+ errdetail("Invalid \"%s\" element: %d.",
+ PG_NDISTINCT_KEY_ATTRIBUTES, attnum));
+ return JSON_SEM_ACTION_FAILED;
+ }
This part had no coverage tests, so I added a few.


as mentioned before
+ errsave(parse->escontext,
+ errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
+ errdetail("The \"%s\" key must contain an array of at least %d "
+  "and no more than %d attributes.",
+  PG_NDISTINCT_KEY_NDISTINCT, 2, STATS_MAX_DIMENSIONS));
here PG_NDISTINCT_KEY_NDISTINCT, should be PG_NDISTINCT_KEY_ATTRIBUTES.


Please check the attached minor miscellaneous changes.

--
jian
https://www.enterprisedb.com/

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-11-20T23:08:02Z

>
>
> some of the switch->default, default don't have ``break``.
>

The compiler doesn't require them, but I see that we do use them in a lot
of places, so I'll incorporate this.


>
> + for (int i = 0; i < nitems; i++)
> + {
> + MVNDistinctItem *item =
> parse_state.distinct_items->elements[i].ptr_value;
>
> exposing the ptr_value seems not a good idea, we can foreach_ptr
> the attached patch using foreach_ptr.


I didn't like this because it makes getting the index value harder, and the
index value is needed in lots of places. Instead I used list_nth() and
list_nth_int().



> in function pg_ndistinct_in some errsave can change to ereturn.
> (I didn't do this part, though).
>

-0.25

There's something that I like about the consistency of errsave() followed
by a return that makes it clear that "the function ends here" that I don't
get from ereturn().

What I would really like, is a way to generate unique (but translated)
errdetail values, and move the errsave/ereturn to after the switch
statement.


>
> + /*
> + * The attnum cannot be zero a negative number beyond the number of the
> + * possible expressions.
> + */
> + if (attnum == 0 || attnum < (0-STATS_MAX_DIMENSIONS))
> + {
> + errsave(parse->escontext,
> + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
> + errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
> + errdetail("Invalid \"%s\" element: %d.",
> + PG_NDISTINCT_KEY_ATTRIBUTES, attnum));
> + return JSON_SEM_ACTION_FAILED;
> + }
> This part had no coverage tests, so I added a few.
>

+1



>
>
> as mentioned before
> + errsave(parse->escontext,
> + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
> + errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
> + errdetail("The \"%s\" key must contain an array of at least %d "
> +  "and no more than %d attributes.",
> +  PG_NDISTINCT_KEY_NDISTINCT, 2, STATS_MAX_DIMENSIONS));
> here PG_NDISTINCT_KEY_NDISTINCT, should be PG_NDISTINCT_KEY_ATTRIBUTES.
>

+1

Did similar things to pg_dependencies.

Re: Extended Statistics set/restore/clear functions.

Chao Li <li.evan.chao@gmail.com> — 2025-11-21T03:36:23Z

After 16 rounds of revisions, v16 looks solid and robust. Only got a few small comments:

> On Nov 21, 2025, at 07:08, Corey Huinker <corey.huinker@gmail.com> wrote:
> 
> <v16-0001-Add-working-input-function-for-pg_ndistinct.patch><v16-0002-Add-working-input-function-for-pg_dependencies.patch><v16-0003-Expose-attribute-statistics-functions-for-use-in.patch><v16-0004-Add-extended-statistics-support-functions.patch><v16-0005-Include-Extended-Statistics-in-pg_dump.patch>

1 - 0001
```
+	/*
+	 * We need at least two attribute numbers for a ndistinct item, anything
+	 * less is malformed.
+	 */
+	natts = list_length(parse->attnum_list);
+	if ((natts < 2) || (natts > STATS_MAX_DIMENSIONS))
+	{
+		errsave(parse->escontext,
+				errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+				errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
+				errdetail("The \"%s\" key must contain an array of at least %d "
+						  "and no more than %d attributes.",
+						  PG_NDISTINCT_KEY_ATTRIBUTES, 2, STATS_MAX_DIMENSIONS));
+		return JSON_SEM_ACTION_FAILED;
+	}
```

Do we also need to reset item state vars similar to the success case?

2 - 0001
```
*
+ * Arrays can never be empty.
+ */
+static JsonParseErrorType
+ndistinct_array_end(void *state)
+{
```

This function comment is too simple, and doesn’t match to the function name. Reading the function body, the logic is clear, so I would suggest either just remove the comment or enhance it.

3  - 0001
```
static JsonParseErrorType
+ndistinct_array_element_start(void *state, bool isnull)
+{
+	NDistinctParseState *parse = state;
```

Nit: NDistinctParseState * can be const.

4 - 0001
```
+static bool
+valid_subsequent_attnum(const AttrNumber prev, const AttrNumber cur)
```

AttrNumber is int16, thus “const” here is unnecessary.

5 - 0001
```
+			return JSON_SUCCESS;
+			break;
```

“Break” after “return” is unnecessary. I tried to delete the “break” and didn’t see a compile warning, it should be safe to delete the “break”.

6 - 0001
```
+/*
+ * Compare the attribute arrays of two MVNDistinctItem values,
+ * looking for duplicate sets.

+ */
+static bool
+has_duplicate_attributes(const MVNDistinctItem *a,
+						 const MVNDistinctItem *b)
```

The function comment says “looking for duplicate sets”, seems to imply the function would return the set, but the function returns a bool, which is a little confusing.

7 - 0002
```
+	for (int i = 0; i < natts; i++)
+	{
+		dep->attributes[i] = (AttrNumber) list_nth_int(parse->attnum_list, i);
+		if (dep->attributes[i] == parse->dependency)
+		{
+			errsave(parse->escontext,
+					errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+					errmsg("malformed pg_dependencies: \"%s\"", parse->str),
+					errdetail("Item \"%s\" value %d found in the \"%s\" list.",
+							  PG_DEPENDENCIES_KEY_DEPENDENCY, parse->dependency,
+			   				  PG_DEPENDENCIES_KEY_ATTRIBUTES));
+			return JSON_SEM_ACTION_FAILED;
+		}
```

Similar to comment 1, do we need to reset state vars upon failure?

8 - 0002
```
+				errdetail("The \"%s\" key must contain an array of at least %d "
+						  " and no than %d elements.",
```

I believe “more” is missed between “no than”.

9 - 0002
```
+static JsonParseErrorType
+dependencies_array_start(void *state)
+{
+	DependenciesParseState *parse = state;
```

Similar to commit 3, DependenciesParseState * can be const.

10 - 0002
```
static bool
valid_subsequent_attnum(const AttrNumber prev, const AttrNumber cur)
```

This function is a dup to the one in 0001, can we put it to a common place?

11 - 0003
```
+extern void statatt_get_type(Oid reloid, AttrNumber attnum,
+							 Oid *atttypid, int32 *atttypmod,
+							 char *atttyptype, Oid *atttypcoll,
+							 Oid *eq_opr, Oid *lt_opr);
+extern void statatt_init_empty_tuple(Oid reloid, int16 attnum, bool inherited,
+									 Datum *values, bool *nulls, bool *replaces);
+
+extern void statatt_set_slot(Datum *values, bool *nulls, bool *replaces,
+							 int16 stakind, Oid staop, Oid stacoll,
+							 Datum stanumbers, bool stanumbers_isnull,
+							 Datum stavalues, bool stavalues_isnull);
+
+extern Datum text_to_stavalues(const char *staname, FmgrInfo *array_in, Datum d,
+							   Oid typid, int32 typmod, bool *ok);
+extern bool statatt_get_elem_type(Oid atttypid, char atttyptype,
+								  Oid *elemtypid, Oid *elem_eq_opr);
```

Several functions are made external visible, they are all renamed with adding a prefix “statatt_”, why text_to_stavalues is an exception?

11 - 0004
```
+bool
+pg_dependencies_validate_deps(MVDependencies *dependencies, int2vector *stxkeys, int numexprs, int elevel)
+{
+	int		attnum_expr_lowbound = 0 - numexprs;
+
+	for (int i = 0; i < dependencies->ndeps; i++)
+	{
+		MVDependency *dep = dependencies->deps[i];
```

This MVDependency * can be const.

12 - 0004
```
static void
upsert_pg_statistic_ext_data(Datum *values, bool *nulls, bool *replaces)
{
```

This function pass values, nulls and replaces to heap_modify_tuple() and heap_form_tuple(), the both functions take all const pointers as parameters. So, here values, nulls and replaces can all be const.

13 - 0004
```
+		if (!exprs_nulls[offset + AVG_WIDTH_ELEM])
+		{
+			ok = text_to_int4(exprs_elems[offset + AVG_WIDTH_ELEM],
+							  &values[Anum_pg_statistic_stawidth - 1]);
+
+			if (!ok)
+			{
+				char	   *s = TextDatumGetCString(exprs_elems[offset + NULL_FRAC_ELEM]);
+
+				ereport(WARNING,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("Expression %s element \"%s\" does not match expected input type.",
+								extexprarginfo[AVG_WIDTH_ELEM].argname, s)));
+				pfree(s);
+				return (Datum) 0;
+			}
+		}
```

Here: char    *s = TextDatumGetCString(exprs_elems[offset + NULL_FRAC_ELEM]);
Should NULL_FRAC_ELEM be AVG_WIDTH_ELEM?

14 - 0004
```
+		if (!exprs_nulls[offset + N_DISTINCT_ELEM])
+		{
+			ok = text_to_float4(exprs_elems[offset + N_DISTINCT_ELEM],
+								&values[Anum_pg_statistic_stadistinct - 1]);
+
+			if (!ok)
+			{
+				char	   *s = TextDatumGetCString(exprs_elems[offset + NULL_FRAC_ELEM]);
+
+				ereport(WARNING,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("Expression %s element \"%s\" does not match expected input type.",
+								extexprarginfo[N_DISTINCT_ELEM].argname, s)));
+				pfree(s);
+				return (Datum) 0;
+			}
+		}
```

Similar to comment 13, should NULL_FRAC_ELEM be N_DISTINCT_ELEM?

15 - 0004
```
+extern Datum import_mcvlist(HeapTuple tup, int elevel, int numattrs,
+							Oid *atttypids, int32 *atttypmods, Oid *atttypcolls,
+							int nitems, Datum *mcv_elems, bool *mcv_nulls,
+							bool *mcv_elem_nulls, float8 *freqs, float8 *base_freqs);
+
+extern Datum import_mcvlist(HeapTuple tup, int elevel, int numattrs,
+							Oid *atttypids, int32 *atttypmods, Oid *atttypcolls,
+							int nitems, Datum *mcv_elems, bool *mcv_nulls,
+							bool *mcv_elem_nulls, float8 *freqs, float8 *base_freqs);
```

import_mcvlist is declared twice, looks like a copy-paste mistake.

16 - 0005
```
+	PREPQUERY_DUMPEXTSTATSSTATS,
```

“statsstats” looks weird, I’d suggest PREPQUERY_DUMPEXTSTATSDATA.

Best regards,
—
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Extended Statistics set/restore/clear functions.

Yuefei Shi <shiyuefei1004@gmail.com> — 2025-11-21T05:33:56Z

On Fri, Nov 21, 2025 at 10:54 AM Corey Huinker <corey.huinker@gmail.com>
wrote:

>
>> some of the switch->default, default don't have ``break``.
>>
>
> The compiler doesn't require them, but I see that we do use them in a lot
> of places, so I'll incorporate this.
>
>
>>
>> + for (int i = 0; i < nitems; i++)
>> + {
>> + MVNDistinctItem *item =
>> parse_state.distinct_items->elements[i].ptr_value;
>>
>> exposing the ptr_value seems not a good idea, we can foreach_ptr
>> the attached patch using foreach_ptr.
>
>
> I didn't like this because it makes getting the index value harder, and
> the index value is needed in lots of places. Instead I used list_nth() and
> list_nth_int().
>
>
>
>> in function pg_ndistinct_in some errsave can change to ereturn.
>> (I didn't do this part, though).
>>
>
> -0.25
>
> There's something that I like about the consistency of errsave() followed
> by a return that makes it clear that "the function ends here" that I don't
> get from ereturn().
>
> What I would really like, is a way to generate unique (but translated)
> errdetail values, and move the errsave/ereturn to after the switch
> statement.
>
>
>>
>> + /*
>> + * The attnum cannot be zero a negative number beyond the number of the
>> + * possible expressions.
>> + */
>> + if (attnum == 0 || attnum < (0-STATS_MAX_DIMENSIONS))
>> + {
>> + errsave(parse->escontext,
>> + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
>> + errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
>> + errdetail("Invalid \"%s\" element: %d.",
>> + PG_NDISTINCT_KEY_ATTRIBUTES, attnum));
>> + return JSON_SEM_ACTION_FAILED;
>> + }
>> This part had no coverage tests, so I added a few.
>>
>
> +1
>
>
>
>>
>>
>> as mentioned before
>> + errsave(parse->escontext,
>> + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
>> + errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
>> + errdetail("The \"%s\" key must contain an array of at least %d "
>> +  "and no more than %d attributes.",
>> +  PG_NDISTINCT_KEY_NDISTINCT, 2, STATS_MAX_DIMENSIONS));
>> here PG_NDISTINCT_KEY_NDISTINCT, should be PG_NDISTINCT_KEY_ATTRIBUTES.
>>
>
> +1
>
> Did similar things to pg_dependencies.
>
>
A few small comments.

1. Minor typo fixes:
In pg_dependencies_in: "pg_dependencies parssing claims..." could be
corrected to "pg_dependencies parsing claims..."
In ndistinct_array_end: "must be an non-empty array" might be better as
"must be a non-empty array"
In the comment for dependencies_object_field_start: "depeendency" appears
to be a typo for "dependency"

2.Code maintainability suggestion:
I noticed the string "malformed pg_dependencies: "%s"" is used repeatedly
throughout the code. Would you consider defining this as a macro? This
could reduce duplication and make future updates easier.

3.Memory management observation:
Regarding item_attnum_list, while PostgreSQL's memory context mechanism
handles cleanup, explicitly freeing the allocated memory after use might
improve code clarity.

These are all minor points - the implementation looks solid overall. Thank
you for your work on this feature!

Re: Extended Statistics set/restore/clear functions.

jian he <jian.universality@gmail.com> — 2025-11-21T07:16:26Z

hi.

makeJsonLexContextCstringLen in v16-0001 and v16-0002
should use GetDatabaseEncoding()?

wondering, should we check the value of "degree" is between 0 to 1?


+ if ((natts < 1) || (natts > (STATS_MAX_DIMENSIONS - 1)))
+ {
+ errsave(parse->escontext,
+ errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("malformed pg_dependencies: \"%s\"", parse->str),
+ errdetail("The \"%s\" key must contain an array of at least %d "
+  " and no than %d elements.",
+  PG_DEPENDENCIES_KEY_ATTRIBUTES, 1, STATS_MAX_DIMENSIONS - 1));

DETAIL:  The "attributes" key must contain an array of at least 1  and
no than 7 elements.
if you check the error message carefully, there is one more extra
white space after "at least 1".


typedef struct MVDependency
{
    double        degree;            /* degree of validity (0-1) */
    AttrNumber    nattributes;    /* number of attributes */
    AttrNumber    attributes[FLEXIBLE_ARRAY_MEMBER];    /* attribute numbers */
} MVDependency;

SELECT '[{"attributes" : [2], "dependency" : 4, "degree":
"NaN"}]'::pg_dependencies;
SELECT '[{"attributes" : [2], "dependency" : 4, "degree":
"-inf"}]'::pg_dependencies;
SELECT '[{"attributes" : [2], "dependency" : 4, "degree":
"inf"}]'::pg_dependencies;

either error out or not, pg_dependencies.sql needs tests like the
above to test these special values.
json does not support special value like, "inf", "NaN", so we can
SELECT '[{"attributes" : [2], "dependency" : 4, "degree":
"-inf"}]'::pg_dependencies;
but can not
SELECT '[{"attributes" : [2], "dependency" : 4, "degree":
"-inf"}]'::pg_dependencies::text::pg_dependencies;

I found that numeric values behave the same with or without double quotes.
SELECT '[{"ndistinct" : "1", "attributes" : ["2","3"]}]'::pg_ndistinct;
SELECT '[{"ndistinct" : 1, "attributes" : [2,3]}]'::pg_ndistinct;
SELECT '[{"attributes" : [1, 2], "dependency" : 4, "degree":
"1"}]'::pg_dependencies;
SELECT '[{"attributes" : [1, "2"], "dependency" : "4", "degree":
"1"}]'::pg_dependencies;
i guess the reason is in makeJsonLexContextCstringLen, we set
need_escapes to true.


I added more regress tests to improve coverage for
src/test/regress/sql/pg_dependencies.sql


SELECT '[{"attributes" : [-11,1], "dependency" : -1116, "degree":
"1.2"}]'::pg_dependencies;
DETAIL:  Invalid "attributes" element: -11.
we already check the key "attributes", we can also check "dependency",
IF the value of ""dependency"
does not meet
``(if (parse->dependency == 0 || parse->dependency < (0-STATS_MAX_DIMENSIONS))``
then error out.
see the attached patch.


--
jian
https://www.enterprisedb.com/

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-11-21T22:41:15Z

>
>
> A few small comments.
>
> 1. Minor typo fixes:
>

+1 Thanks.


> 2.Code maintainability suggestion:
> I noticed the string "malformed pg_dependencies: "%s"" is used repeatedly
> throughout the code. Would you consider defining this as a macro? This
> could reduce duplication and make future updates easier.
>

It seems to be the way of things. As I stated earlier, what I'd really like
is the ability to form different errdetails and have them feed into one
errsave(), and still have the strings go through translation.



>
> 3.Memory management observation:
> Regarding item_attnum_list, while PostgreSQL's memory context mechanism
> handles cleanup, explicitly freeing the allocated memory after use might
> improve code clarity.
>
> These are all minor points - the implementation looks solid overall. Thank
> you for your work on this feature!
>

+1

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-11-22T08:26:19Z

>
>
> wondering, should we check the value of "degree" is between 0 to 1?
>

I'm open to it, but there was significant pushback to having tight validity
checks on other stats types.

The more things that we make errors, the more chances we have to breaking a
dump/restore. Granted, such values should never happen, but odd things do
happen.


>
> IF the value of ""dependency"
> does not meet
> ``(if (parse->dependency == 0 || parse->dependency <
> (0-STATS_MAX_DIMENSIONS))``
> then error out.
> see the attached patch.
>

Good catch.

Incorporated these patches as well as Yuefei's feedback especially around
memory management.

I added a comment debating the feasibility of testing for subsets of
attribute sets in pg_dependencies. Basically, I think we can't have the
test at all, but I haven't removed it just yet pending consensus.

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-11-23T06:57:31Z

On Sat, Nov 22, 2025 at 03:26:19AM -0500, Corey Huinker wrote:
> I'm open to it, but there was significant pushback to having tight validity
> checks on other stats types.

Yeah, I am not sure that there is much value in having strict checks
for the values.  Some control over the arguments listed in a single
object is much more valuable as these are easy to check in the input
functions. 

> The more things that we make errors, the more chances we have to breaking a
> dump/restore. Granted, such values should never happen, but odd things do
> happen.

Indeed.  Like the existing import functions, these should not lead to
hard failures, but WARNING-level reports.
--
Michael

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-11-25T07:28:51Z

On Sat, Nov 22, 2025 at 03:26:19AM -0500, Corey Huinker wrote:
> I added a comment debating the feasibility of testing for subsets of
> attribute sets in pg_dependencies. Basically, I think we can't have the
> test at all, but I haven't removed it just yet pending consensus.

+     * Verify that all attnum sets are a proper subset of the first longest
+     * attnum set.
+     *
+     * TODO:
+     *
+     * I'm fairly certain that because statisticsally insignificant dependency
+     * combinations are not stored, there is a chance that the longest dependency
+     * does not exist, and therefore this test cannot be done. I have left the
+     * test in place for the time being until the issue can be definitively
+     * settled.

As you have already quoted upthread, statext_dependencies_build()
settles the issue on this one, I think.  It is entirely possible that
any group returned by DependencyGenerator generates a degree value
that would prevent a given group to be stored, and this could as well
be the largest possible group there could be in the set.  So we cannot
do any of that for dependencies, unfortunately.  We can always rely on
the list of attributes when assigning the json blob to the stats
object, at least, cross-checking that each attribute list matches with
the numbers of the stats object.  At least we can check for
duplicates, which is better than nothing at all.

Regarding the suggested check where we'd want to enforce all the
groups of attributes to be listed depending on the longest set we have
found, at the end estimate_multivariate_ndistinct() checks the items
listed one-by-one, giving up if we cannot find something in the list
of items.  I think that I am going to be content with the patch as it
is, without this piece.  Let's add an extra SQL test to treat that as
valid input, though.  So I am feeling OK with the input for ndistinct
at this stage.  I have noticed a couple of issues in passing,
adjusting them.  We are reaching more than 90% of coverage with the
tests, and I am not sure that we can actually reach the rest except if
one of the previous steps failed.

So That's one.  Now into the second patch for the input of the
dependencies.

+SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "NaN"}]'::pg_dependencies;
+SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "-inf"}]'::pg_dependencies;
+SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "inf"}]'::pg_dependencies;
+SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "-inf"}]'::pg_dependencies::text::pg_dependencies;

Okay, I have to admit that these ones are fun.  I doubt that anybody
would actually do that, and these do not produce valid json objects,
which is what the last case shows.  Hmm, it makes sense to keep these,
and I'm still siding that we should not care too much about applying
checks on the values and complicate the input function more than that,
so fine by me.

There were a couple of things in the tests, missing quite a few soft
errors.  Many typos, grammar mistakes in the whole.  Also, please do
not split the error strings into multiple lines to make these
greppable.  There is also no need for a break after a return.  In some
cases, a return was used where a break made more sense as the default
path returned a failure.. 

The TODO in build_mvdependencies() could be an elog(), but I have left
it untouched for the errdetail().

We're reaching 91% of coverage here, not bad.  The rest does not seem
reachable, as far as I can see.

With that said, a v18 for the first two patches with the input
functions.  Comments and/or opinions?
--
Michael

Re: Extended Statistics set/restore/clear functions.

Chao Li <li.evan.chao@gmail.com> — 2025-11-25T08:59:28Z


> On Nov 25, 2025, at 15:28, Michael Paquier <michael@paquier.xyz> wrote:
> 
> On Sat, Nov 22, 2025 at 03:26:19AM -0500, Corey Huinker wrote:
>> I added a comment debating the feasibility of testing for subsets of
>> attribute sets in pg_dependencies. Basically, I think we can't have the
>> test at all, but I haven't removed it just yet pending consensus.
> 
> +     * Verify that all attnum sets are a proper subset of the first longest
> +     * attnum set.
> +     *
> +     * TODO:
> +     *
> +     * I'm fairly certain that because statisticsally insignificant dependency
> +     * combinations are not stored, there is a chance that the longest dependency
> +     * does not exist, and therefore this test cannot be done. I have left the
> +     * test in place for the time being until the issue can be definitively
> +     * settled.
> 
> As you have already quoted upthread, statext_dependencies_build()
> settles the issue on this one, I think.  It is entirely possible that
> any group returned by DependencyGenerator generates a degree value
> that would prevent a given group to be stored, and this could as well
> be the largest possible group there could be in the set.  So we cannot
> do any of that for dependencies, unfortunately.  We can always rely on
> the list of attributes when assigning the json blob to the stats
> object, at least, cross-checking that each attribute list matches with
> the numbers of the stats object.  At least we can check for
> duplicates, which is better than nothing at all.
> 
> Regarding the suggested check where we'd want to enforce all the
> groups of attributes to be listed depending on the longest set we have
> found, at the end estimate_multivariate_ndistinct() checks the items
> listed one-by-one, giving up if we cannot find something in the list
> of items.  I think that I am going to be content with the patch as it
> is, without this piece.  Let's add an extra SQL test to treat that as
> valid input, though.  So I am feeling OK with the input for ndistinct
> at this stage.  I have noticed a couple of issues in passing,
> adjusting them.  We are reaching more than 90% of coverage with the
> tests, and I am not sure that we can actually reach the rest except if
> one of the previous steps failed.
> 
> So That's one.  Now into the second patch for the input of the
> dependencies.
> 
> +SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "NaN"}]'::pg_dependencies;
> +SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "-inf"}]'::pg_dependencies;
> +SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "inf"}]'::pg_dependencies;
> +SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "-inf"}]'::pg_dependencies::text::pg_dependencies;
> 
> Okay, I have to admit that these ones are fun.  I doubt that anybody
> would actually do that, and these do not produce valid json objects,
> which is what the last case shows.  Hmm, it makes sense to keep these,
> and I'm still siding that we should not care too much about applying
> checks on the values and complicate the input function more than that,
> so fine by me.
> 
> There were a couple of things in the tests, missing quite a few soft
> errors.  Many typos, grammar mistakes in the whole.  Also, please do
> not split the error strings into multiple lines to make these
> greppable.  There is also no need for a break after a return.  In some
> cases, a return was used where a break made more sense as the default
> path returned a failure.. 
> 
> The TODO in build_mvdependencies() could be an elog(), but I have left
> it untouched for the errdetail().
> 
> We're reaching 91% of coverage here, not bad.  The rest does not seem
> reachable, as far as I can see.
> 
> With that said, a v18 for the first two patches with the input
> functions.  Comments and/or opinions?
> --
> Michael
> <v18-0001-Add-working-input-function-for-pg_ndistinct.patch><v18-0002-Add-working-input-function-for-pg_dependencies.patch>

I don’t see any of my comments are addressed in v18.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-11-25T10:06:09Z

On Tue, Nov 25, 2025 at 04:59:28PM +0800, Chao Li wrote:
> I don’t see any of my comments are addressed in v18.

Sorry about that, I thought that v17 was addressing these.  I have
reviewed your latest email posted here:
https://www.postgresql.org/message-id/2C8A20C7-4E08-41F4-B30E-AD9F8B007AE5@gmail.com

And well, this presents a bunch of issues I have already fixed (break
after return, one error message incorrect, some descriptions).  The
comments on top of the functions are actually true as they stand (aka
ndistinct_array_end() fails if we don't have an array).  There is no
need to reset state vars upon a JSON_SEM_ACTION_FAILED AFAIK.
Finally, the consts are non-appealing for me for functions that are
internally used in these files.  And forgot one:
valid_subsequent_attnum()'s duplication is actually adapted IMO, the
argument lists could have different assumptions depending on the type
of extended stats we are dealing with.  I did not check the comments
for 0003~0005.  Perhaps you are right that these need adjustments.

Thanks,
--
Michael

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-11-26T02:06:10Z

On Tue, Nov 25, 2025 at 04:28:51PM +0900, Michael Paquier wrote:
> With that said, a v18 for the first two patches with the input
> functions.  Comments and/or opinions?

Okay.  I have gone other these two and after an extra round of
polishing, mainly around comments (some suggested by Chao, actually),
style, a fix for the test with valid UTF8 sequences failing in
non-UTF8 databases, plus a few more tests that were missing, I have
applied the two patches for the input functions.

Perhaps these will need some adjustments, but I also see little point
in moving on with these based on the coverage we have and how the code
is shaped.  Could you rebase the rest of the patch set for the three
remaining pieces, Corey?  You will also need to address some of the
comments from Chao based on a review of v16, that were not addressed
as of v17. 
--
Michael

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-11-26T05:14:26Z

>
>
> I don’t see any of my comments are addressed in v18.
>
>
My apologies. My v17 focused entirely on the input functions, as those were
receiving the vast majority of the attention. Now that those are out of the
way (Thanks Michael!) I can address those issues.

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-11-26T05:24:30Z

On Tue, Nov 25, 2025 at 11:14:26PM -0600, Corey Huinker wrote:
> My apologies. My v17 focused entirely on the input functions, as those were
> receiving the vast majority of the attention. Now that those are out of the
> way (Thanks Michael!) I can address those issues.

No problem, thanks.  Please note that I am not sure if I will be able
to look at anything else posted on this thread until next week, so
please feel free to continue the discussion for the time being.
--
Michael

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-11-26T05:32:49Z

> Finally, the consts are non-appealing for me for functions that are
> internally used in these files.


I'm curious why you find them non-appealing for non-exposed functions, if
you have time to elaborate.


>   And forgot one:
> valid_subsequent_attnum()'s duplication is actually adapted IMO, the
> argument lists could have different assumptions depending on the type
> of extended stats we are dealing with.


Yeah, I saw that, reviewing the diff-of-diffs just now.

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-11-26T05:34:56Z

>
>
> No problem, thanks.  Please note that I am not sure if I will be able
> to look at anything else posted on this thread until next week, so
> please feel free to continue the discussion for the time being.
>
>
It's holiday season here too. I'm going to get the rebase out and see if
any lessons learned can be applied to the remaining patches, but my own
availability may be limited.

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-11-26T05:37:09Z

>
>
> non-UTF8 databases, plus a few more tests that were missing, I have
> applied the two patches for the input functions.
>

I've reviewed what you committed and it looks good to me aside from the one
nitpick I already made.


> Perhaps these will need some adjustments, but I also see little point
> in moving on with these based on the coverage we have and how the code
> is shaped.  Could you rebase the rest of the patch set for the three
> remaining pieces, Corey?  You will also need to address some of the
> comments from Chao based on a review of v16, that were not addressed
> as of v17.
>

+1

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-11-26T05:44:23Z

On Tue, Nov 25, 2025 at 11:37:09PM -0600, Corey Huinker wrote:
> I've reviewed what you committed and it looks good to me aside from the one
> nitpick I already made.

I was not really excited about these as it concerns only code local to
each file where the input functions are located, no routines that are
actually published for consumption by other parts of the backend.
--
Michael

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-11-26T05:50:14Z

On Tue, Nov 25, 2025 at 11:44 PM Michael Paquier <michael@paquier.xyz>
wrote:

> On Tue, Nov 25, 2025 at 11:37:09PM -0600, Corey Huinker wrote:
> > I've reviewed what you committed and it looks good to me aside from the
> one
> > nitpick I already made.
>
> I was not really excited about these as it concerns only code local to
> each file where the input functions are located, no routines that are
> actually published for consumption by other parts of the backend.
> --
> Michael
>

So if a function were previously static to one page, but now had utility
outside of that file, you would then consider changing the function
signature to add the consts? I ask because this may become relevant with
the statatt_* functions being moved in the remaining patches.

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-11-26T06:16:54Z

On Tue, Nov 25, 2025 at 11:50:14PM -0600, Corey Huinker wrote:
> So if a function were previously static to one page, but now had utility
> outside of that file, you would then consider changing the function
> signature to add the consts? I ask because this may become relevant with
> the statatt_* functions being moved in the remaining patches.

That's the kind of analysis that requires a case-by-case lookup.  The
input functions were not really that worth bothering to me based on
their logic being very isolated.  The refactoring pieces with
attribute stats that you are suggesting in 0003 may be indeed more
relevant if done this way.
--
Michael

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-11-26T09:21:46Z

On Tue, Nov 25, 2025 at 11:14 PM Corey Huinker <corey.huinker@gmail.com>
wrote:

>
>> I don’t see any of my comments are addressed in v18.
>>
>>
> My apologies. My v17 focused entirely on the input functions, as those
> were receiving the vast majority of the attention. Now that those are out
> of the way (Thanks Michael!) I can address those issues.
>

Paraphrasing the "quotes" here for brevity...

> Several functions are made external visible, they are all renamed with
adding a prefix “statatt_”, why text_to_stavalues is an exception?

Michael had specifically said that one didn't need to be renamed. I suppose
statatt_import_stavalues() might be a good name for it. It *is* specific to
attribute stats, though that definition also applies to the attribute stats
nested in the stxdexprs of extended stats. I have no strong opinion on the
matter.

> This MVDependency * can be const.

+1

> static void
> upsert_pg_statistic_ext_data(Datum *values, bool *nulls, bool *replaces)
> {
> ```

> This function pass values, nulls and replaces to heap_modify_tuple() and
heap_form_tuple(),
> the both functions take all const pointers as parameters.
> So, here values, nulls and replaces can all be const.
I find your argument here persuasive enough to override Michael's
previously stated non-excitement.

> ...the two NULL_FRAC questions

Yes, fixed.

> import_mcvlist is declared twice, looks like a copy-paste mistake.

That or a rebase/apply gone wrong. Fixed.

> PREPQUERY_DUMPEXTSTATSSTATS weird, suggest PREPQUERY_DUMPEXTSTATSDATA

Awkward names are almost inevitable when talking about the statistics
associated with an object of type "statistics".

There was a debate about whether statistics were data or not, and I'd
rather not restart that, so I went with PREPQUERY_DUMPEXTSTATSOBJSTATS for
now.

 Incorporated these fixes, and some other lessons learned.

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-12-03T05:00:31Z

On Wed, Nov 26, 2025 at 03:21:46AM -0600, Corey Huinker wrote:
> There was a debate about whether statistics were data or not, and I'd
> rather not restart that, so I went with PREPQUERY_DUMPEXTSTATSOBJSTATS for
> now.
> 
>  Incorporated these fixes, and some other lessons learned.

I've managed to miss the update you have sent here with the three
remaining patches.  I am going to looking at the remaining pieces
now as this patch set was still on my stack of patches to look at..
--
Michael

Re:Re: Extended Statistics set/restore/clear functions.

Yu Wang <wangyu_runtime@163.com> — 2025-12-04T00:52:16Z

Hi Corey
I was reviewing the recent patch v19-0003-Include-Extended-Statistics-in-pg_dump.patch and noticed a couple of small typo issues in the explanatory comments — nothing that affects the functionality.

Here are the two minor fixes I’d suggest:
1. “ndistintinct” should be “ndistinct”.
2. “depdendencies” should be “dependencies”.

Best regards,
Yu

在 2025-12-04 08:46:29，"Corey Huinker" <corey.huinker@gmail.com> 写道：

On Tue, Nov 25, 2025 at 11:14 PM Corey Huinker <corey.huinker@gmail.com> wrote:

I don’t see any of my comments are addressed in v18.

My apologies. My v17 focused entirely on the input functions, as those were receiving the vast majority of the attention. Now that those are out of the way (Thanks Michael!) I can address those issues.

Paraphrasing the "quotes" here for brevity...

> Several functions are made external visible, they are all renamed with adding a prefix “statatt_”, why text_to_stavalues is an exception?

Michael had specifically said that one didn't need to be renamed. I suppose statatt_import_stavalues() might be a good name for it. It *is* specific to attribute stats, though that definition also applies to the attribute stats nested in the stxdexprs of extended stats. I have no strong opinion on the matter.

> This MVDependency * can be const.

+1

> static void
> upsert_pg_statistic_ext_data(Datum *values, bool *nulls, bool *replaces)
> {
> ```

> This function pass values, nulls and replaces to heap_modify_tuple() and heap_form_tuple(),
> the both functions take all const pointers as parameters.
> So, here values, nulls and replaces can all be const.
I find your argument here persuasive enough to override Michael's previously stated non-excitement.

> ...the two NULL_FRAC questions

Yes, fixed.

> import_mcvlist is declared twice, looks like a copy-paste mistake.

That or a rebase/apply gone wrong. Fixed.

> PREPQUERY_DUMPEXTSTATSSTATS weird, suggest PREPQUERY_DUMPEXTSTATSDATA

Awkward names are almost inevitable when talking about the statistics associated with an object of type "statistics".

There was a debate about whether statistics were data or not, and I'd rather not restart that, so I went with PREPQUERY_DUMPEXTSTATSOBJSTATS for now.

 Incorporated these fixes, and some other lessons learned.

Re: Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-12-04T07:28:00Z

On Thu, Dec 04, 2025 at 08:52:16AM +0800, WangYu wrote:
> I was reviewing the recent patch
> v19-0003-Include-Extended-Statistics-in-pg_dump.patch and noticed a
> couple of small typo issues in the explanatory comments — nothing
> that affects the functionality.

Please be careful that the mailing lists of Postgres use
bottom-posting for thread discussions.  Please see here for details:
https://en.wikipedia.org/wiki/Posting_style#Bottom-posting

> Here are the two minor fixes I’d suggest:
> 1. “ndistintinct” should be “ndistinct”.
> 2. “depdendencies” should be “dependencies”.

Indeed.  These exact typos existed in some of the previous iterations
of the patch posted on this thread.

I have begun looking at 0001 and 0002 more seriously, and can now
share a couple of comments about them. 

My biggest issue with 0001 is the severe lack of documentation for all
the routines that are now in attribute_stats.c which are published so
as they can be used for the import functions of extended stats:

1) statatt_get_type(), that retrieves a set of information from a
relation and one of its attributes.  This is lightly documented in
extended_statistics_update() from 0002, with an argument claiming that
we do the same as attribute statistics.  If one then looks at
attribute_stats.c, we have a not-really-helpful "derive information
from attribute".  The trick is that we have an implied dependency
here: statatt_get_type() needs to be called *before*
statatt_get_elem_type() for attribute *and* extended stats,
statatt_get_elem_type() being fed data from statatt_get_type().

2) statatt_get_elem_type(), interesting piece of the puzzle that acts
as a wrapper for the type cache.  An important part here, it seems to
me, would be to at least tell that this relies on data retrieved from
statatt_get_type(), initially.  Once I've noticed this dependency the
logic felt a bit cleaner to understand, but we have no docs explaining
any of that..

3) statatt_init_empty_tuple(), that initializes a set of data to be
used for updates and/or insertion.  My first thought here is about 
"inherited", which turns around an assumption in
attribute_statistics_update().  How should callers assign its value?
For attribute_statistics_update() it comes as an input argument.
Also why are these values initialized as they are and why should these
make sense when applied to pg_statistic.  Extended stats set the
"inherited" argument to false in import_expressions().  That seems
equally important to document, both in a comment in extended_stats.c
and at the top of the function.  Of course it means that this refers
to the fact that the stats include values from stats tables, which do
not apply to extended stats.  But one would have to guess this fact
after knowing that this relates to the catalog pg_statistic..

4) statatt_set_slot is slightly simpler.  Still here, it is really easy
to miss that this routine should be fed data retrieved by
statatt_get_type().  staop can also be an eq or an lt operator.
stakind is one of the STATISTIC_KIND_* values, etc, etc.

5) text_to_stavalues() has a bit more explanation, with some data fed
from statatt_get_type() for the atttypmod and atttypid.  We have also
some fix of input coming from statatt_get_elem_type(), for elemtypid.

As far as I can see, there is nothing in these functions that require
them to be located in attribute_stats.c anymore.  Let's move that to
stats_utils.c, common place for all the shared facilities used by
these stats modules.  Or it could also be a new, separate patch, for
the manipulation and extraction of the tuple data related to
the pg_statistic tuples.

I am not OK with these being in attribute_stats.c, and as far as I can
see they have no contents that force these routines to 

Perhaps you know what these imply because you have written this code
originally in ce207d2a7901, but exposing them also means that it is
very important to document at least some concepts and assumptions
around them so as it is possible to guide somebody that may want to
use this code:
- What are the input arguments from?
- And what are the results?
- In which circumstances should these be used?

Then about 0002.

There are a bunch of assumptions embedded inside import_expressions()
that are interdependent with the calls of these attribute routines
which are a bit hard to evaluate by themselves, at least it's not
clear why some of these operations are done and why they are done the
way they are presented in the patch.  The code processes an array of
expressions and processes them in a sequential fashion, repeating
checks based on extended_stats_exprs_element.  Each step is going to
need more explanation to connect the dots.  For example, why the first
checks on exprs_nulls are important, with a second step of checks for
each STATISTIC_KIND_*.  Too much is let to the reader to guess, making
the code complicated to maintain as written, at least IMHO.

Is there any need to locate these new functions and code in
extended_stats.c, actually?  A separation into a new file seems like
it would be a cleaner result, leaving extended_stats.c to deal with
the import of the new data, including the fetch and deletion of any
existing data in pg_stat_ext that would be updated or inserted.
Perhaps name that extended_stats_funcs.c, as these are about the
direct SQL functions, import and deletion.

For the tests, it looks like it would be better to have everything in
a new file, like a stats_ext_import.sql for the clear and restore
bits.

A lot of the tests are copy-pastes of surrounding queries.  Perhaps it
would be better to use a SQL function wrapper or an IN clause with
multiple relations?  The patch has a lot of bloat with these repeated
queries..  Perhaps we should use a split for the elements in the
extended stats array instead of jsonb_pretty().  The results get quite
long here.

As of 0002, there are actually two independent pieces dealt with: the
deletion of extended stats with pg_clear_extended_stats(), and the
insert/update of extended stats with pg_restore_extended_stats().  I'd
suggest to split that into two parts, with the clear being first in
rank.  The deletion is simpler, and getting that in first simplifies
the review of the import part with less input to deal with.
--
Michael

Re: Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-12-04T23:12:16Z

>
>
> Indeed.  These exact typos existed in some of the previous iterations
> of the patch posted on this thread.
>

+1


> 1) statatt_get_type(), that retrieves a set of information from a
> relation and one of its attributes.  This is lightly documented in
> extended_statistics_update() from 0002, with an argument claiming that
> we do the same as attribute statistics.  If one then looks at
> attribute_stats.c, we have a not-really-helpful "derive information
> from attribute".  The trick is that we have an implied dependency
> here: statatt_get_type() needs to be called *before*
> statatt_get_elem_type() for attribute *and* extended stats,
> statatt_get_elem_type() being fed data from statatt_get_type().
>

I've added comments to that effect.


>
> 2) statatt_get_elem_type(), interesting piece of the puzzle that acts
> as a wrapper for the type cache.  An important part here, it seems to
> me, would be to at least tell that this relies on data retrieved from
> statatt_get_type(), initially.  Once I've noticed this dependency the
> logic felt a bit cleaner to understand, but we have no docs explaining
> any of that..
>

I've added comments that will hopefully connect those dots.


>
> 3) statatt_init_empty_tuple(), that initializes a set of data to be
> used for updates and/or insertion.  My first thought here is about
> ...
> not apply to extended stats.  But one would have to guess this fact
> after knowing that this relates to the catalog pg_statistic..
>

The comments didn't seem quite so illuminating, but I added them just the
same.


>
> 4) statatt_set_slot is slightly simpler.  Still here, it is really easy
> to miss that this routine should be fed data retrieved by
> statatt_get_type().  staop can also be an eq or an lt operator.
> stakind is one of the STATISTIC_KIND_* values, etc, etc.
>

Some here too.


> As far as I can see, there is nothing in these functions that require
> them to be located in attribute_stats.c anymore.  Let's move that to
> stats_utils.c, common place for all the shared facilities used by
> these stats modules.  Or it could also be a new, separate patch, for
> the manipulation and extraction of the tuple data related to
> the pg_statistic tuples.
>

They've been moved to stats_utils.c, and impact was fairly minimal. I was
worried about an explosion of #includes, but that ended up not being the
case.


> Perhaps you know what these imply because you have written this code
> originally in ce207d2a7901, but exposing them also means that it is
> very important to document at least some concepts and assumptions
> around them so as it is possible to guide somebody that may want to
> use this code:
> - What are the input arguments from?
> - And what are the results?
> - In which circumstances should these be used?
>

An attempt was made.


> Then about 0002.
>
> There are a bunch of assumptions embedded inside import_expressions()
> that are interdependent with the calls of these attribute routines
> ...
> each STATISTIC_KIND_*.  Too much is let to the reader to guess, making
> the code complicated to maintain as written, at least IMHO.
>

I tried to expand on these things, but I haven't yet moved them to another
file for reasons I'll explain later.


> Is there any need to locate these new functions and code in
> extended_stats.c, actually?  A separation into a new file seems like
> it would be a cleaner result, leaving extended_stats.c to deal with
> the import of the new data, including the fetch and deletion of any
> existing data in pg_stat_ext that would be updated or inserted.
> Perhaps name that extended_stats_funcs.c, as these are about the
> direct SQL functions, import and deletion.
>

We can move them to another file, but I'm not sure if that will actually
make things clearer. What is extended_stats.c if not functions about
extended stats? Perhaps it makes more sense to try harder to find the
commonality in the building of the pg_statistic tuples, move those
functions to stats_utils, and get things decluttered that way?

For the tests, it looks like it would be better to have everything in
> a new file, like a stats_ext_import.sql for the clear and restore
> bits.
>

Those tests very much build upon the tables and data already created for
regular relation/attribute imports, so we save quite a bit of boilerplate
in keeping them there.


>
> A lot of the tests are copy-pastes of surrounding queries.  Perhaps it
> would be better to use a SQL function wrapper or an IN clause with
> multiple relations?


I'd say "no", because we're doing a lot of things that *could* generate
ERRORs, and we're verifying that they generate WARNINGs instead. If we did
batch up those commands, and any one of them generated an error, we'd be at
loss for which one was the culprit, and we'd also lose any insight into
what else got busted in the other rows of the batch.


>   The patch has a lot of bloat with these repeated
> queries..  Perhaps we should use a split for the elements in the
> extended stats array instead of jsonb_pretty().  The results get quite
> long here.
>

I replaced '}, ' with '},\n' and removed the jsonb_pretty entirely, which
is something I had proposed doing earlier, as it's actually more pretty
than jsonb_pretty(). It's still in the same file, though.

I like the net result of this, and it might make sense to search other
regression tests for places where we can employ this mini-pretty pattern.


> As of 0002, there are actually two independent pieces dealt with: the
> deletion of extended stats with pg_clear_extended_stats(), and the
> insert/update of extended stats with pg_restore_extended_stats().  I'd
> suggest to split that into two parts, with the clear being first in
> rank.  The deletion is simpler, and getting that in first simplifies
> the review of the import part with less input to deal with.
>

The pg_clear function is quite trivial, as it's basically just marshalling
and locking before a call to delete_pg_statistic_ext_data(). So while I can
split them out, I'd also have to split the pg_proc.dat changes into two
separate ones, which makes it hard to see how similar they are. I have a
similar feeling about the changes to func-admin.sgml.

All of that, plus a rebase.

Re: Extended Statistics set/restore/clear functions.

Tom Lane <tgl@sss.pgh.pa.us> — 2025-12-05T00:37:32Z

Michael Paquier <michael@paquier.xyz> writes:
> Okay.  I have gone other these two and after an extra round of
> polishing, mainly around comments (some suggested by Chao, actually),
> style, a fix for the test with valid UTF8 sequences failing in
> non-UTF8 databases, plus a few more tests that were missing, I have
> applied the two patches for the input functions.

I notice that since e1405aa5e went in, a number of older buildfarm
animals are issuing warnings about it:

In file included from pg_dependencies.c:21:0:
pg_dependencies.c: In function "dependencies_scalar":
../../../../src/include/nodes/miscnodes.h:54:15: warning: the comparison will always evaluate as "true" for the address of "escontext" will never be NULL [-Waddress]
  ((escontext) != NULL && IsA(escontext, ErrorSaveContext) && \\
               ^
pg_dependencies.c:497:8: note: in expansion of macro "SOFT_ERROR_OCCURRED"
    if (SOFT_ERROR_OCCURRED(&escontext))
        ^
../../../../src/include/nodes/miscnodes.h:54:15: warning: the comparison will always evaluate as "true" for the address of "escontext" will never be NULL [-Waddress]
  ((escontext) != NULL && IsA(escontext, ErrorSaveContext) && \\
               ^
pg_dependencies.c:542:8: note: in expansion of macro "SOFT_ERROR_OCCURRED"
    if (SOFT_ERROR_OCCURRED(&escontext))
        ^
../../../../src/include/nodes/miscnodes.h:54:15: warning: the comparison will always evaluate as "true" for the address of "escontext" will never be NULL [-Waddress]
  ((escontext) != NULL && IsA(escontext, ErrorSaveContext) && \\
               ^
pg_dependencies.c:572:8: note: in expansion of macro "SOFT_ERROR_OCCURRED"
    if (SOFT_ERROR_OCCURRED(&escontext))
        ^
In file included from pg_ndistinct.c:21:0:
pg_ndistinct.c: In function "ndistinct_scalar":
../../../../src/include/nodes/miscnodes.h:54:15: warning: the comparison will always evaluate as "true" for the address of "escontext" will never be NULL [-Waddress]
  ((escontext) != NULL && IsA(escontext, ErrorSaveContext) && \\
               ^
pg_ndistinct.c:438:8: note: in expansion of macro "SOFT_ERROR_OCCURRED"
    if (SOFT_ERROR_OCCURRED(&escontext))
        ^
../../../../src/include/nodes/miscnodes.h:54:15: warning: the comparison will always evaluate as "true" for the address of "escontext" will never be NULL [-Waddress]
  ((escontext) != NULL && IsA(escontext, ErrorSaveContext) && \\
               ^
pg_ndistinct.c:488:9: note: in expansion of macro "SOFT_ERROR_OCCURRED"
    if (!SOFT_ERROR_OCCURRED(&escontext))
         ^

The above is from rhinoceros, but about half a dozen other animals
are reporting the same.  They are all running very old gcc versions,
mostly from RHEL7 or CentOS 7.

This of course arises because the source code is passing the address
of a local ErrorSaveContext variable to that macro.  We don't have
any other instances of that coding pattern AFAICS, so I wonder if
that was really the most adapted way to do it.  Looking at the
code, it seems to be turning around and stuffing a different error
message into a passed-in ErrorSaveContext, which feels a little
weird.

It may be that there's not a better way, in which case I'll probably
just teach my buildfarm-warning-scraper script to ignore these
warnings, as it's already doing for some other warnings from these
same animals (-Wmissing-braces mostly).  But I thought I'd raise a
question about it.

Also, I don't think these errdetail messages meet our style
guidelines:

                        errdetail("Invalid \"%s\" value.", PG_DEPENDENCIES_KEY_ATTRIBUTES));

They're supposed to be complete sentences.

			regards, tom lane

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-12-05T01:19:27Z

On Thu, Dec 04, 2025 at 07:37:32PM -0500, Tom Lane wrote:
> This of course arises because the source code is passing the address
> of a local ErrorSaveContext variable to that macro.  We don't have
> any other instances of that coding pattern AFAICS, so I wonder if
> that was really the most adapted way to do it.  Looking at the
> code, it seems to be turning around and stuffing a different error
> message into a passed-in ErrorSaveContext, which feels a little
> weird.

Thanks for the report.  Right.  There are three instances of that in
pg_dependencies.c, two in pg_ndistinct.c.  I would not mind doing the
attached to calm down these warnings, matching with the other areas of
the code, that simply removes the macro and checks the state value
directly.

> Also, I don't think these errdetail messages meet our style
> guidelines:
> 
>    errdetail("Invalid \"%s\" value.", PG_DEPENDENCIES_KEY_ATTRIBUTES));
> 
> They're supposed to be complete sentences.

Yeah, I should have been more careful here.  I have spent more time on
all of these, and adjusted all the errdetails for pg_ndistinct.c and
pg_dependencies.c to use full sentences.

I am just grouping everything with the attached.  They'd be better
if handled separately, obviously.  Comments?
--
Michael

Re: Extended Statistics set/restore/clear functions.

Tom Lane <tgl@sss.pgh.pa.us> — 2025-12-05T02:29:50Z

Michael Paquier <michael@paquier.xyz> writes:
> On Thu, Dec 04, 2025 at 07:37:32PM -0500, Tom Lane wrote:
>> This of course arises because the source code is passing the address
>> of a local ErrorSaveContext variable to that macro.  We don't have
>> any other instances of that coding pattern AFAICS, so I wonder if
>> that was really the most adapted way to do it.

> Thanks for the report.  Right.  There are three instances of that in
> pg_dependencies.c, two in pg_ndistinct.c.  I would not mind doing the
> attached to calm down these warnings, matching with the other areas of
> the code, that simply removes the macro and checks the state value
> directly.

Ah, right, now that you mention it, 56b1e88c8 did this same fix
before.  This way is fine with me.

I didn't review the error message changes closely.  I do wonder if
some of them are can't-happen cases that could just use an elog()
instead of a translatable message.

			regards, tom lane

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-12-05T03:39:05Z

On Thu, Dec 04, 2025 at 09:29:50PM -0500, Tom Lane wrote:
> Ah, right, now that you mention it, 56b1e88c8 did this same fix
> before.  This way is fine with me.

Oh, I haven't noticed this one.  Pushed the bits to take care of the
compiler warnings, for now.

> I didn't review the error message changes closely.  I do wonder if
> some of them are can't-happen cases that could just use an elog()
> instead of a translatable message.

Some of them are, yes.  However, they are also worded with the same
format as some of the legit cases.  So they don't add any extra
workload on the translation side as far as I recall, and I've been
fond of the errdetail part to get a consistent style across the board.
I'll double-check the whole a bit later, attached is the rest of them.

Corey, any comments about these?
--
Michael

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-12-05T05:30:38Z

>
>
> Some of them are, yes.  However, they are also worded with the same
> format as some of the legit cases.  So they don't add any extra
> workload on the translation side as far as I recall, and I've been
> fond of the errdetail part to get a consistent style across the board.
> I'll double-check the whole a bit later, attached is the rest of them.
>
> Corey, any comments about these?
>

The wordings are fine, and I'm sorry I didn't word them as complete
sentences from the get-go.

Attached is a follow-on to Michael's most recent uncommitted patch,
changing the errors that I see as "impossible" to elogs. However, I agree
that they don't add significant workload to the translations, and most
input functions need to avoid any hard error returns lest they be called in
a soft-error context.

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-12-08T01:25:45Z

On Fri, Dec 05, 2025 at 12:30:38AM -0500, Corey Huinker wrote:
> Attached is a follow-on to Michael's most recent uncommitted patch,
> changing the errors that I see as "impossible" to elogs. However, I agree
> that they don't add significant workload to the translations, and most
> input functions need to avoid any hard error returns lest they be called in
> a soft-error context.

Reporting the state of the parser in these new elogs is OK for me, but
we had more cases that what your patch has been updating.  I have
included all these cases, making the set of error messages a bit more
consistent across the board, and applied the result.
--
Michael

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-12-11T21:52:26Z

On Sun, Dec 7, 2025 at 8:26 PM Michael Paquier <michael@paquier.xyz> wrote:

> On Fri, Dec 05, 2025 at 12:30:38AM -0500, Corey Huinker wrote:
> > Attached is a follow-on to Michael's most recent uncommitted patch,
> > changing the errors that I see as "impossible" to elogs. However, I agree
> > that they don't add significant workload to the translations, and most
> > input functions need to avoid any hard error returns lest they be called
> in
> > a soft-error context.
>
> Reporting the state of the parser in these new elogs is OK for me, but
> we had more cases that what your patch has been updating.  I have
> included all these cases, making the set of error messages a bit more
> consistent across the board, and applied the result.
> --
> Michael
>

Rebased. Used new palloc0_array() in a few places, and fixed an expected
output message that wasn't updated to reflect the changes.

Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com> — 2025-12-11T21:55:52Z

>
> Reporting the state of the parser in these new elogs is OK for me, but
>> we had more cases that what your patch has been updating.  I have
>> included all these cases, making the set of error messages a bit more
>> consistent across the board, and applied the result.
>> --
>> Michael
>>
>
> Rebased. Used new palloc0_array() in a few places, and fixed an expected
> output message that wasn't updated to reflect the changes.
>

Disregard v21 (uncommitted changes). Everything I said about 21 applies to
22.

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-12-25T06:23:58Z

On Thu, Dec 11, 2025 at 04:55:52PM -0500, Corey Huinker wrote:
> Disregard v21 (uncommitted changes). Everything I said about 21 applies to
> 22.

A couple of weeks later, I have locked a couple of hours to look at
v22-0001.  And after a couple of rounds of self-fixing and adjustments
in the documentation added, the result looks OK so I have applied it.
Attached are the remaining pieces, labelled v23.
--
Michael

Re: Extended Statistics set/restore/clear functions.

Tender Wang <tndrwang@gmail.com> — 2025-12-25T11:00:02Z

Tender Wang <tndrwang@gmail.com> 于2025年12月25日周四 18:58写道：

> Hi Michael,
>
> I found a typo in recent commit :
> commit 213a1b89527049cb27bbcd6871fdb0c0916b43e1 (origin/master,
> origin/HEAD, master)
> Author: Michael Paquier <michael@paquier.xyz>
> Date:   Thu Dec 25 15:13:39 2025 +0900
>  Move attribute statistics functions to stat_utils.c
> ....
>  * The stacoll value should be either the atttypcoll derived from
>  * statatt_get_type(), or a harcoded value required by that particular
>  * stakind.
> ....
>
> "harcoded" should be "hardcoded".
>
> Please check the attached patch.
>
> --
> Thanks,
> Tender Wang
>

Sorry, I forgot to cc the pgsql-hackers mail in my last email.
-- 
Thanks,
Tender Wang

Re: Extended Statistics set/restore/clear functions.

Michael Paquier <michael@paquier.xyz> — 2025-12-25T22:47:54Z

On Thu, Dec 25, 2025 at 07:00:02PM +0800, Tender Wang wrote:
> Sorry, I forgot to cc the pgsql-hackers mail in my last email.

Missed this one from v22-0001.  Thanks.
--
Michael