Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Add test doing some cloning of extended statistics data

  2. Add test for pg_restore_extended_stats() with multiranges

  3. Add support for "mcv" in pg_restore_extended_stats()

  4. Include extended statistics data in pg_dump

  5. Add support for "dependencies" in pg_restore_extended_stats()

  6. Add test for MAINTAIN permission with pg_restore_extended_stats()

  7. Add pg_restore_extended_stats()

  8. Add routine to free MCVList

  9. Improve pg_clear_extended_stats() with incorrect relation/stats combination

  10. Add pg_clear_extended_stats()

  11. Introduce routines to validate and free MVNDistinct and MVDependencies

  12. Fix typo in stat_utils.c

  13. Move attribute statistics functions to stat_utils.c

  14. Improve error messages of input functions for pg_dependencies and pg_ndistinct

  15. Improve test output of extended statistics for ndistinct and dependencies

  16. Fix some compiler warnings

  17. Add input function for data type pg_dependencies

  18. Add input function for data type pg_ndistinct

  19. Rework output format of pg_dependencies

  20. Rework output format of pg_ndistinct

  21. Fix comments of output routines for pg_ndistinct and pg_dependencies

  22. Move code specific to pg_dependencies to new file

  23. Move code specific to pg_ndistinct to new file

  24. Document some structures in attribute_stats.c

  25. Fix FATAL message for invalid recovery timeline at beginning of recovery

  1. Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-01-22T18:17:27Z

    This is a separate thread for work started in [1] but focused purely on
    getting the following functions working:
    
    * pg_set_extended_stats
    * pg_clear_extended_stats
    * pg_restore_extended_stats
    
    These functions are analogous to their relation/attribute counterparts, use
    the same calling conventions, and build upon the same basic infrastructure.
    
    I think it is important that we get these implemented because they close
    the gap that was left in terms of the ability to modify existing statistics
    and to round out the work being done to carry over statistics via
    dump/restore and pg_upgrade i [1].
    
    The purpose of each patch is as follows (adapted from previous thread):
    
    0001 - This makes the input function for pg_ndistinct functional.
    
    0002 - This makes the input function for pg_dependencies functional.
    
    0003 - Makes several static functions in attribute_stats.c public for use
    by extended stats. One of those is get_stat_attr_type(), which in the last
    patchset was modified to take an attribute name rather than attnum, thus
    saving a syscache lookup. However, extended stats identifies attributes by
    attnum not name, so that optimization had to be set aside, at least
    temporarily.
    
    0004 - These implement the functions pg_set_extended_stats(),
    pg_clear_extended_stats(), and pg_restore_extended_stats() and behave like
    their relation/attribute equivalents. If we can get these committed and
    used by pg_dump, then we don't have to debate how to handle post-upgrade
    steps for users who happen to have extended stats vs the approximately
    99.75% of users who do not have extended stats.
    
    This patchset does not presently include any work to integrate these
    functions into pg_dump, but may do so once that work is settled, or it may
    become its own thread.
    
    [1]
    https://www.postgresql.org/message-id/CADkLM=c6NHdXU+d+m1yASZ4NT_qye1LCnaU2Vgf8YJ80jJT-Qg@mail.gmail.com
    
  2. Re: Extended Statistics set/restore/clear functions.

    Tomas Vondra <tomas@vondra.me> — 2025-01-22T22:50:57Z

    Hi,
    
    Thanks for continuing to work on this.
    
    On 1/22/25 19:17, Corey Huinker wrote:
    > This is a separate thread for work started in [1] but focused purely on
    > getting the following functions working:
    > 
    > * pg_set_extended_stats
    > * pg_clear_extended_stats
    > * pg_restore_extended_stats
    > 
    > These functions are analogous to their relation/attribute counterparts,
    > use the same calling conventions, and build upon the same basic
    > infrastructure.
    > 
    > I think it is important that we get these implemented because they close
    > the gap that was left in terms of the ability to modify existing
    > statistics and to round out the work being done to carry over statistics
    > via dump/restore and pg_upgrade i [1].
    > 
    > The purpose of each patch is as follows (adapted from previous thread):
    > 
    > 0001 - This makes the input function for pg_ndistinct functional.
    > 
    > 0002 - This makes the input function for pg_dependencies functional.
    > 
    
    I only quickly skimmed the patches, but a couple comments:
    
    
    1) I think it makes perfect sense to use the JSON parsing for the input
    functions, but maybe it'd be better to adjust the format a bit to make
    that even easier?
    
    Right now the JSON "keys" have structure, which means we need some ad
    hoc parsing. Maybe we should make it "proper JSON" by moving that into
    separate key/value, e.g. for ndistinct we might replace this:
    
      {"1, 2": 2323, "1, 3" : 3232, ...}
    
    with this:
    
      [ {"keys": [1, 2], "ndistinct" : 2323},
        {"keys": [1, 3], "ndistinct" : 3232},
        ... ]
    
    so a regular JSON array of objects, with keys an "array". And similarly
    for dependencies.
    
    Yes, it's more verbose, but maybe better for "mechanical" processing?
    
    
    2) Do we need some sort of validation? Perhaps this was discussed in the
    other thread and I missed that, but isn't it a problem that happily
    accept e.g. this?
    
      {"6666, 6666" : 1, "1, -222": 14, ...}
    
    That has duplicate keys with bogus attribute numbers, stats on (bogus)
    system attributes, etc. I suspect this may easily cause problems during
    planning (if it happens to visit those statistics).
    
    Maybe that's acceptable - ultimately the user could import something
    broken in a much subtler way, of course. But the pg_set_attribute_stats
    seems somewhat more protected against this, because it gets the attr as
    a separate argument.
    
    
    I recall I wished to have the attnum in the output function, but that
    was not quite possible because we don't know the relid (and thus the
    descriptor) in that function.
    
    Is it a good idea to rely on the input/output format directly? How will
    that deal with cross-version differences? Doesn't it mean the in/out
    format is effectively fixed, or at least has to be backwards compatible
    (i.e. new version has to handle any value from older versions)?
    
    Or what if I want to import the stats for a table with slightly
    different structure (e.g. because dump/restore skips dropped columns).
    Won't that be a problem with the format containing raw attnums? Or is
    this a use case we don't expect to work?
    
    For the per-attribute stats it's probably fine, because that's mostly
    just a collection of regular data types (scalar values or arrays of
    values, ...) and we're not modifying them except for maybe adding new
    fields. But extended stats seem more complex, so maybe it's different?
    
    I remember a long discussion about the format at the very beginning of
    this patch series, and the conclusion clearly was to have a function
    that import stats for one attribute at a time. And that seems to be
    working fine, but extended stats values have more internal structure, so
    perhaps they need to do something more complicated.
    
    > 0003 - Makes several static functions in attribute_stats.c public for use
    > by extended stats. One of those is get_stat_attr_type(), which in the last
    > patchset was modified to take an attribute name rather than attnum, thus
    > saving a syscache lookup. However, extended stats identifies attributes by
    > attnum not name, so that optimization had to be set aside, at least
    > temporarily.
    > 
    > 0004 - These implement the functions pg_set_extended_stats(),
    > pg_clear_extended_stats(), and pg_restore_extended_stats() and behave like
    > their relation/attribute equivalents. If we can get these committed and
    > used by pg_dump, then we don't have to debate how to handle post-upgrade
    > steps for users who happen to have extended stats vs the approximately
    > 99.75% of users who do not have extended stats.
    > 
    
    I see there's a couple MCV-specific functions in the extended_stats.c.
    Shouldn't those go into mvc.c instead?
    
    FWIW there's a bunch of whitespace issues during git apply.
    
    > This patchset does not presently include any work to integrate these
    > functions into pg_dump, but may do so once that work is settled, or it
    > may become its own thread.
    > 
    
    OK. Thanks for the patch!
    
    
    regards
    
    -- 
    Tomas Vondra
    
    
    
    
    
  3. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-01-23T14:51:08Z

    On Wed, Jan 22, 2025 at 5:50 PM Tomas Vondra <tomas@vondra.me> wrote:
    
    > Hi,
    >
    > Thanks for continuing to work on this.
    >
    > On 1/22/25 19:17, Corey Huinker wrote:
    > > This is a separate thread for work started in [1] but focused purely on
    > > getting the following functions working:
    > >
    > > * pg_set_extended_stats
    > > * pg_clear_extended_stats
    > > * pg_restore_extended_stats
    > >
    > > These functions are analogous to their relation/attribute counterparts,
    > > use the same calling conventions, and build upon the same basic
    > > infrastructure.
    > >
    > > I think it is important that we get these implemented because they close
    > > the gap that was left in terms of the ability to modify existing
    > > statistics and to round out the work being done to carry over statistics
    > > via dump/restore and pg_upgrade i [1].
    > >
    > > The purpose of each patch is as follows (adapted from previous thread):
    > >
    > > 0001 - This makes the input function for pg_ndistinct functional.
    > >
    > > 0002 - This makes the input function for pg_dependencies functional.
    > >
    >
    > I only quickly skimmed the patches, but a couple comments:
    >
    >
    > 1) I think it makes perfect sense to use the JSON parsing for the input
    > functions, but maybe it'd be better to adjust the format a bit to make
    > that even easier?
    >
    > Right now the JSON "keys" have structure, which means we need some ad
    > hoc parsing. Maybe we should make it "proper JSON" by moving that into
    > separate key/value, e.g. for ndistinct we might replace this:
    >
    >   {"1, 2": 2323, "1, 3" : 3232, ...}
    >
    > with this:
    >
    >   [ {"keys": [1, 2], "ndistinct" : 2323},
    >     {"keys": [1, 3], "ndistinct" : 3232},
    >     ... ]
    >
    > so a regular JSON array of objects, with keys an "array". And similarly
    > for dependencies.
    >
    
    That is almost exactly what I did back when the stats import functions took
    a nested JSON argument.
    
    The biggest problem with changing that format is that the old format would
    still show up in the system being exported, so we would have to process
    that format as well as the new one.
    
    
    > Yes, it's more verbose, but maybe better for "mechanical" processing?
    >
    
    It absolutely would be better for processing, but we'd still have to read
    the old format from older systems.  I suppose the pg_dump code could do
    some SQL gymnastics to convert the old json-but-sad format into the
    processing-friendly format of the future, and I could easily adapt what
    I've already written over a year ago to that task. I suppose it's just a
    matter of having the community behind changing the output format to enable
    a better input format.
    
    
    
    > 2) Do we need some sort of validation? Perhaps this was discussed in the
    > other thread and I missed that, but isn't it a problem that happily
    > accept e.g. this?
    >
    >   {"6666, 6666" : 1, "1, -222": 14, ...}
    >
    > That has duplicate keys with bogus attribute numbers, stats on (bogus)
    > system attributes, etc. I suspect this may easily cause problems during
    > planning (if it happens to visit those statistics).
    >
    
    We used to have _lots_ of validation for data quality issues, much of which
    was removed at the request of reviewers. However, much of that discussion
    was about the health of the statistics, but these are standalone data
    types, maybe they're held to a higher standard. If so, what sort of checks
    do you think would be needed and/or wanted?
    
    So far, I can imagine the following:
    
    * no negative attnums in key list
    * no duplicate attnums in key list
    
    anything else?
    
    >
    > Maybe that's acceptable - ultimately the user could import something
    > broken in a much subtler way, of course. But the pg_set_attribute_stats
    > seems somewhat more protected against this, because it gets the attr as
    > a separate argument.
    >
    
    The datatype itself is in isolation, but once we've created a valid
    pg_ndistinct or pg_dependencies, there's nothing stopping us from
    validating the values in the datatype against the statistics object and the
    relation it belongs to, but that might get the same resistance that I got
    to say, ensuring that frequency lists were monotonically decreasing.
    
    
    > I recall I wished to have the attnum in the output function, but that
    > was not quite possible because we don't know the relid (and thus the
    > descriptor) in that function.
    >
    > Is it a good idea to rely on the input/output format directly? How will
    > that deal with cross-version differences? Doesn't it mean the in/out
    > format is effectively fixed, or at least has to be backwards compatible
    > (i.e. new version has to handle any value from older versions)?
    >
    
    Presently there are no cross-version differences, though earlier I address
    the pros and cons of changing it. No matter what, the burden of having a
    valid format is on the user in the case of pg_set_extended_stats() and
    pg_restore_extended_stats() has a server version number associated, so the
    option of handling a format change could be baked in, but then we're doing
    version tests and input typecasts like we do with ANYARRAY types. Not
    impossible, but definitely more work.
    
    
    > Or what if I want to import the stats for a table with slightly
    > different structure (e.g. because dump/restore skips dropped columns).
    > Won't that be a problem with the format containing raw attnums? Or is
    > this a use case we don't expect to work?
    >
    
    The family of pg_set_*_stats functions expect the input to be meaningful
    and correct for that relation on that server version. Any attnum
    translation would have to be done by the user to adapt to the new or
    changed relation.
    
    The family of pg_restore_*_stats functions are designed to be forward
    compatible, and to work across versions but for basically the same relation
    or relation of the same shape. Basically, they're for pg_restore and
    pg_upgrade, so no changes in attnums would be expected.
    
    
    >
    > For the per-attribute stats it's probably fine, because that's mostly
    > just a collection of regular data types (scalar values or arrays of
    > values, ...) and we're not modifying them except for maybe adding new
    > fields. But extended stats seem more complex, so maybe it's different?
    >
    
    I had that working by attname matching way back in the early days, but it
    would involve creating an intermediate format for translating the attnums
    to attnames on export, and then re-translating them on the way back in.
    
    I suppose someone could write the following utility functions
    
        pg_xlat_ndistinct_to_attnames(relation reloid, ndist pg_ndistinct) ->
    json
        pg_xlat_ndistinct_from_attnames(relation reloid, ndist json) ->
    pg_ndistinct
    
    and that would bridge the gap for the special case where you want to adapt
    pg_ndistinct from one table structure to a slightly different one.
    
    
    
    >
    > I remember a long discussion about the format at the very beginning of
    > this patch series, and the conclusion clearly was to have a function
    > that import stats for one attribute at a time. And that seems to be
    > working fine, but extended stats values have more internal structure, so
    > perhaps they need to do something more complicated.
    >
    
    I believe this *is* doing something more complicated, especially with the
    MCVList, though I was very pleased to see how much of the existing
    infrastructure I was able to reuse.
    
    
    
    
    >
    > > 0003 - Makes several static functions in attribute_stats.c public for use
    > > by extended stats. One of those is get_stat_attr_type(), which in the
    > last
    > > patchset was modified to take an attribute name rather than attnum, thus
    > > saving a syscache lookup. However, extended stats identifies attributes
    > by
    > > attnum not name, so that optimization had to be set aside, at least
    > > temporarily.
    > >
    > > 0004 - These implement the functions pg_set_extended_stats(),
    > > pg_clear_extended_stats(), and pg_restore_extended_stats() and behave
    > like
    > > their relation/attribute equivalents. If we can get these committed and
    > > used by pg_dump, then we don't have to debate how to handle post-upgrade
    > > steps for users who happen to have extended stats vs the approximately
    > > 99.75% of users who do not have extended stats.
    > >
    >
    > I see there's a couple MCV-specific functions in the extended_stats.c.
    > Shouldn't those go into mvc.c instead?
    >
    
    I wanted to put it there, but there was a reason I didn't and I've now
    forgotten what it was. I'll make an effort to relocate it to mcv.c.
    
    For that matter, it might make sense to break out the expressions code into
    its own file, because every other stat attribute has its own. Thoughts on
    that?
    
    
    >
    > FWIW there's a bunch of whitespace issues during git apply.
    >
    
    +1
    
    
    >
    > OK. Thanks for the patch!
    >
    
    Thanks for the feedback, please keep it coming. I think it's really
    important that extended stats, though used rarely, be included in our
    dump/restore/upgrade changes so as to make for a more consistent user
    experience.
    
  4. Re: Extended Statistics set/restore/clear functions.

    Tomas Vondra <tomas@vondra.me> — 2025-01-23T20:15:53Z

    
    On 1/23/25 15:51, Corey Huinker wrote:
    > On Wed, Jan 22, 2025 at 5:50 PM Tomas Vondra <tomas@vondra.me
    > <mailto:tomas@vondra.me>> wrote:
    > 
    >     Hi,
    > 
    >     Thanks for continuing to work on this.
    > 
    >     On 1/22/25 19:17, Corey Huinker wrote:
    >     > This is a separate thread for work started in [1] but focused
    >     purely on
    >     > getting the following functions working:
    >     >
    >     > * pg_set_extended_stats
    >     > * pg_clear_extended_stats
    >     > * pg_restore_extended_stats
    >     >
    >     > These functions are analogous to their relation/attribute
    >     counterparts,
    >     > use the same calling conventions, and build upon the same basic
    >     > infrastructure.
    >     >
    >     > I think it is important that we get these implemented because they
    >     close
    >     > the gap that was left in terms of the ability to modify existing
    >     > statistics and to round out the work being done to carry over
    >     statistics
    >     > via dump/restore and pg_upgrade i [1].
    >     >
    >     > The purpose of each patch is as follows (adapted from previous
    >     thread):
    >     >
    >     > 0001 - This makes the input function for pg_ndistinct functional.
    >     >
    >     > 0002 - This makes the input function for pg_dependencies functional.
    >     >
    > 
    >     I only quickly skimmed the patches, but a couple comments:
    > 
    > 
    >     1) I think it makes perfect sense to use the JSON parsing for the input
    >     functions, but maybe it'd be better to adjust the format a bit to make
    >     that even easier?
    > 
    >     Right now the JSON "keys" have structure, which means we need some ad
    >     hoc parsing. Maybe we should make it "proper JSON" by moving that into
    >     separate key/value, e.g. for ndistinct we might replace this:
    > 
    >       {"1, 2": 2323, "1, 3" : 3232, ...}
    > 
    >     with this:
    > 
    >       [ {"keys": [1, 2], "ndistinct" : 2323},
    >         {"keys": [1, 3], "ndistinct" : 3232},
    >         ... ]
    > 
    >     so a regular JSON array of objects, with keys an "array". And similarly
    >     for dependencies.
    > 
    > 
    > That is almost exactly what I did back when the stats import functions
    > took a nested JSON argument.
    > 
    > The biggest problem with changing that format is that the old format
    > would still show up in the system being exported, so we would have to
    > process that format as well as the new one.
    >  
    > 
    >     Yes, it's more verbose, but maybe better for "mechanical" processing?
    > 
    > 
    > It absolutely would be better for processing, but we'd still have to
    > read the old format from older systems.  I suppose the pg_dump code
    > could do some SQL gymnastics to convert the old json-but-sad format into
    > the processing-friendly format of the future, and I could easily adapt
    > what I've already written over a year ago to that task. I suppose it's
    > just a matter of having the community behind changing the output format
    > to enable a better input format.
    > 
    
    D'oh! I always forget about the backwards compatibility issue, i.e. that
    we still need to ingest values from already released versions. Yeah,
    that makes the format change less beneficial.
    
    >  
    > 
    >     2) Do we need some sort of validation? Perhaps this was discussed in the
    >     other thread and I missed that, but isn't it a problem that happily
    >     accept e.g. this?
    > 
    >       {"6666, 6666" : 1, "1, -222": 14, ...}
    > 
    >     That has duplicate keys with bogus attribute numbers, stats on (bogus)
    >     system attributes, etc. I suspect this may easily cause problems during
    >     planning (if it happens to visit those statistics).
    > 
    > 
    > We used to have _lots_ of validation for data quality issues, much of
    > which was removed at the request of reviewers. However, much of that
    > discussion was about the health of the statistics, but these are
    > standalone data types, maybe they're held to a higher standard. If so,
    > what sort of checks do you think would be needed and/or wanted?
    > 
    > So far, I can imagine the following:
    > 
    > * no negative attnums in key list
    > * no duplicate attnums in key list
    > 
    > anything else?
    > 
    
    Yeah, I recall there were a lot of checks initially and we dropped them
    over time. I'm not asking to reinstate all of those thorough checks.
    
    At this point I was really thinking only about validating the attnums,
    i.e. to make sure it's a valid attribute in the table / statistics. That
    is something the pg_set_attribute_stats() enforce too, thanks to having
    a separate argument for the attribute name.
    
    That's where I'd stop. I don't want to do checks on the statistics
    content, like verifying the frequencies in the MCV sum up to 1.0 or
    stuff like that. I think we're not doing that for pg_set_attribute_stats
    either (and I'd bet one could cause a lot of "fun" this way).
    
    > 
    >     Maybe that's acceptable - ultimately the user could import something
    >     broken in a much subtler way, of course. But the pg_set_attribute_stats
    >     seems somewhat more protected against this, because it gets the attr as
    >     a separate argument.
    > 
    > 
    > The datatype itself is in isolation, but once we've created a valid
    > pg_ndistinct or pg_dependencies, there's nothing stopping us from
    > validating the values in the datatype against the statistics object and
    > the relation it belongs to, but that might get the same resistance that
    > I got to say, ensuring that frequency lists were monotonically decreasing.
    >  
    
    Understood. IMHO it's fine to say we're not validating the statistics
    are "consistent" but I think we should check it matches the definition.
    
    > 
    >     I recall I wished to have the attnum in the output function, but that
    >     was not quite possible because we don't know the relid (and thus the
    >     descriptor) in that function.
    > 
    >     Is it a good idea to rely on the input/output format directly? How will
    >     that deal with cross-version differences? Doesn't it mean the in/out
    >     format is effectively fixed, or at least has to be backwards compatible
    >     (i.e. new version has to handle any value from older versions)?
    > 
    > 
    > Presently there are no cross-version differences, though earlier I
    > address the pros and cons of changing it. No matter what, the burden of
    > having a valid format is on the user in the case of
    > pg_set_extended_stats() and pg_restore_extended_stats() has a server
    > version number associated, so the option of handling a format change
    > could be baked in, but then we're doing version tests and input
    > typecasts like we do with ANYARRAY types. Not impossible, but definitely
    > more work.
    > 
    
    OK, makes sense.
     
    > 
    >     Or what if I want to import the stats for a table with slightly
    >     different structure (e.g. because dump/restore skips dropped columns).
    >     Won't that be a problem with the format containing raw attnums? Or is
    >     this a use case we don't expect to work?
    > 
    > 
    > The family of pg_set_*_stats functions expect the input to be meaningful
    > and correct for that relation on that server version. Any attnum
    > translation would have to be done by the user to adapt to the new or
    > changed relation.
    > 
    > The family of pg_restore_*_stats functions are designed to be forward
    > compatible, and to work across versions but for basically the same
    > relation or relation of the same shape. Basically, they're for
    > pg_restore and pg_upgrade, so no changes in attnums would be expected.
    >  
    > 
    
    OK
    
    > 
    >     For the per-attribute stats it's probably fine, because that's mostly
    >     just a collection of regular data types (scalar values or arrays of
    >     values, ...) and we're not modifying them except for maybe adding new
    >     fields. But extended stats seem more complex, so maybe it's different?
    > 
    > 
    > I had that working by attname matching way back in the early days, but
    > it would involve creating an intermediate format for translating the
    > attnums to attnames on export, and then re-translating them on the way
    > back in.
    > 
    > I suppose someone could write the following utility functions
    > 
    >     pg_xlat_ndistinct_to_attnames(relation reloid, ndist pg_ndistinct) -
    >> json
    >     pg_xlat_ndistinct_from_attnames(relation reloid, ndist json) ->
    > pg_ndistinct
    > 
    > and that would bridge the gap for the special case where you want to
    > adapt pg_ndistinct from one table structure to a slightly different one.
    > 
    > 
    
    OK
     
    > 
    > 
    >     I remember a long discussion about the format at the very beginning of
    >     this patch series, and the conclusion clearly was to have a function
    >     that import stats for one attribute at a time. And that seems to be
    >     working fine, but extended stats values have more internal structure, so
    >     perhaps they need to do something more complicated.
    > 
    > 
    > I believe this *is* doing something more complicated, especially with
    > the MCVList, though I was very pleased to see how much of the existing
    > infrastructure I was able to reuse.
    > 
    > 
    
    OK
    
    >  
    > 
    > 
    >     > 0003 - Makes several static functions in attribute_stats.c public
    >     for use
    >     > by extended stats. One of those is get_stat_attr_type(), which in
    >     the last
    >     > patchset was modified to take an attribute name rather than
    >     attnum, thus
    >     > saving a syscache lookup. However, extended stats identifies
    >     attributes by
    >     > attnum not name, so that optimization had to be set aside, at least
    >     > temporarily.
    >     >
    >     > 0004 - These implement the functions pg_set_extended_stats(),
    >     > pg_clear_extended_stats(), and pg_restore_extended_stats() and
    >     behave like
    >     > their relation/attribute equivalents. If we can get these
    >     committed and
    >     > used by pg_dump, then we don't have to debate how to handle post-
    >     upgrade
    >     > steps for users who happen to have extended stats vs the approximately
    >     > 99.75% of users who do not have extended stats.
    >     >
    > 
    >     I see there's a couple MCV-specific functions in the extended_stats.c.
    >     Shouldn't those go into mvc.c instead?
    > 
    > 
    > I wanted to put it there, but there was a reason I didn't and I've now
    > forgotten what it was. I'll make an effort to relocate it to mcv.c.
    > 
    > For that matter, it might make sense to break out the expressions code
    > into its own file, because every other stat attribute has its own.
    > Thoughts on that?
    >  
    
    +1 to that, if it reduced unnecessary code duplication
    
    > 
    > 
    >     FWIW there's a bunch of whitespace issues during git apply.
    > 
    > 
    > +1
    >  
    > 
    > 
    >     OK. Thanks for the patch!
    > 
    > 
    > Thanks for the feedback, please keep it coming. I think it's really
    > important that extended stats, though used rarely, be included in our
    > dump/restore/upgrade changes so as to make for a more consistent user
    > experience.
    >  
    
    I agree, and I appreciate you working on it.
    
    
    
    -- 
    Tomas Vondra
    
    
    
    
    
  5. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-01-23T20:52:55Z

    >
    >
    >
    > > * no negative attnums in key list
    >
    
    Disregard this suggestion - negative attnums mean the Nth expression in the
    extended stats object, though it boggles the mind how we could have 222
    expressions...
    
    
    
    > > * no duplicate attnums in key list
    >
    
    This one is still live, am considering.
    
    At this point I was really thinking only about validating the attnums,
    > i.e. to make sure it's a valid attribute in the table / statistics. That
    > is something the pg_set_attribute_stats() enforce too, thanks to having
    > a separate argument for the attribute name.
    >
    > That's where I'd stop. I don't want to do checks on the statistics
    > content, like verifying the frequencies in the MCV sum up to 1.0 or
    > stuff like that. I think we're not doing that for pg_set_attribute_stats
    >
    
    Agreed.
    
    
    > either (and I'd bet one could cause a lot of "fun" this way).
    >
    
    If by "fun" you mean "create a fuzzing tool", then yes.
    
    As an aside, the "big win" in all these functions is the ability to dump a
    database --no-data, but have all the schema and statistics, thus allowing
    for checking query plans on existing databases with sensitive data while
    not actually exposing the data (except mcv, obvs), nor spending the I/O to
    load that data.
    
    
    > Understood. IMHO it's fine to say we're not validating the statistics
    > are "consistent" but I think we should check it matches the definition.
    >
    
    +1
    
    
    
    > > I suppose someone could write the following utility functions
    > >
    > >     pg_xlat_ndistinct_to_attnames(relation reloid, ndist pg_ndistinct) -
    > >> json
    > >     pg_xlat_ndistinct_from_attnames(relation reloid, ndist json) ->
    > > pg_ndistinct
    > >
    > > and that would bridge the gap for the special case where you want to
    > > adapt pg_ndistinct from one table structure to a slightly different one.
    > >
    > >
    >
    > OK
    >
    
    As they'll be pure-SQL functions, I'll likely post the definitions here,
    but not put them into a patch unless it draws interest.
     > For that matter, it might make sense to break out the expressions code
    
    > > into its own file, because every other stat attribute has its own.
    > > Thoughts on that?
    > >
    >
    > +1 to that, if it reduced unnecessary code duplication
    >
    
    I'm uncertain that it actually would deduplicate any code, but I'll
    certainly try.
    
  6. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-01-24T00:06:48Z

    >
    > I see there's a couple MCV-specific functions in the extended_stats.c.
    >> Shouldn't those go into mvc.c instead?
    >>
    >
    > I wanted to put it there, but there was a reason I didn't and I've now
    > forgotten what it was. I'll make an effort to relocate it to mcv.c.
    >
    
    Looking at it now, I see that check_mcvlist_array() expects access to
    extarginfo, so I either I add a bunch of Datum-aware and extarginfo-aware
    code to mcv.c, or I do those checks outside of import_mcvlist() and leave
    check_mcvlist_array() where it is, which is my current inclination, though
    obviously it's not a perfectly clean break.
    
  7. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-01-24T05:20:59Z

    >
    >
    > I agree, and I appreciate you working on it.
    >
    
    I've tried to incorporate all your feedback:
    
    * duplicates in pg_ndistinct and pg_dependencies now cause the input
    function to fail, with tests for each
    
    * all the non-array, non-parametery stuff moved out of import_mcvlist so
    that import_mcvlist can move to mcv.c in a sane way
    
    * whitespace issues
    
    * validating attnums inside pg_ndistinct. I held off on doing
    pg_dependencies pending feedback on how I handled pg_ndistinct. tests for
    invalid positive (attnum isn't involved in the extended stats object), 0
    attnum (just wrong) and invalid negative (more than the number of
    expressions in the object)
    
    Still to do:
    
    * attnum checking for dependencies, should be simple, once we've locked
    down how to do it for pg_ndistinct
    
    * move import_expressions code to expressions.c along with some other
    expression-specific stuff.
    
    * more tests, probably
    
    I'd like to merge these down to 3 patches again, but I'm keeping them
    separate for this patchset to isolate the attnum-checking code for this
    go-round.
    
  8. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-01-27T19:52:15Z

    >
    > I'd like to merge these down to 3 patches again, but I'm keeping them
    > separate for this patchset to isolate the attnum-checking code for this
    > go-round.
    >
    
    These are mock-ups of the to/from JSON functions, but building from/to text
    rather than the not-yet-committed pg_ndistinct and pg_dependencies data
    types. Currently they're done with JSON rather than JSONB because I assume
    that the ordering within the datatype matters. We can probably preserve
    order by adding an "order" field populated by WITH ORDINALITY.
    
    To get all Jurrassic about this: I've spent some time thinking about
    whether we CAN make these functions, it's time to consider whether we
    SHOULD. And that leads me to a couple of points:
    
    p1. We could switch to the new formats without any change to the internal
    representation, but pg_dump would always need to know about the old formats.
    
    p2. The JSON format is both more understandable and easier to manipulate.
    
    p3. If we thought the number of people using extended stats was small, the
    number of people tweaking extended stats is going to be smaller.
    
    So that gives us a few paths forward:
    
    o1. Switch to the new input/output format, and the queries inside these
    functions get incorporated into some future pg_dump queries.
    
    o2. Keep the old formats, create these functions inside the system and
    whoever wants to use them can use them.
    
    o3. Keep old formats, and make these functions work as the CASTs to and
    from JSON/JSONB.
    
    o4. Keep old formats, could create these functions in an extension.
    
    o5. Keep old formats, leave these function definitions here for a future
    intrepid hacker.
    
  9. Re: Extended Statistics set/restore/clear functions.

    jian he <jian.universality@gmail.com> — 2025-01-28T16:24:59Z

    hi.
    I reviewed 0001 only.
    
    in src/backend/statistics/mvdistinct.c
    
    no need #include "nodes/pg_list.h" since
    src/include/statistics/statistics.h sub level include "nodes/pg_list.h"
    
    no need #include "utils/palloc.h"
    sicne #include "postgres.h"
    already included it.
    
    
     select '[{"6, -32768,,": -11}]'::pg_ndistinct;
    ERROR:  malformed pg_ndistinct: "[{"6, -32768,,": -11}]"
    LINE 1: select '[{"6, -32768,,": -11}]'::pg_ndistinct;
                   ^
    DETAIL:  All ndistinct count values are scalar doubles.
    imho, this errdetail message is not good.
    
    
    select '{}'::pg_ndistinct ;
    segfault
    
    
    select '{"1,":"1"}'::pg_ndistinct ;
    ERROR:  malformed pg_ndistinct: "{"1,":"1"}"
    LINE 1: select '{"1,":"1"}'::pg_ndistinct ;
                   ^
    DETAIL:  All ndistinct attnum lists must be a comma separated list of attnums.
    
    imho, this errdetail message is not good. would be better saying that
    "length of list of attnums must be larger than 1".
    
    
    
    select  pt1.typnamespace, pt1.typarray, pt1.typcategory, pt1.typname
    from    pg_type pt1
    where pt1.typname ~*'distinct';
    
     typnamespace | typarray | typcategory |   typname
    --------------+----------+-------------+--------------
               11 |        0 | Z           | pg_ndistinct
    
    typcategory (Z) marked as Internal-use types. and there is no
    pg_ndistinct array type,
    not sure this is fine.
    
    
    all the errcode one pair of the parenthesis is unnecessary.
    for example
                    (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
                     errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
                     errdetail("Must begin with \"{\"")));
    can change to
                    errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
                     errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
                     errdetail("Must begin with \"{\""));
    
    see https://www.postgresql.org/docs/current/error-message-reporting.html
    
    
    
    
  10. Re: Extended Statistics set/restore/clear functions.

    jian he <jian.universality@gmail.com> — 2025-01-29T07:50:09Z

    hi.
    
    select '{"1, 0B100101":"NaN"}'::pg_ndistinct;
          pg_ndistinct
    ------------------------
     {"1, 37": -2147483648}
    (1 row)
    
    this is not what we expected?
    For the VALUE part of pg_ndistinct, float8 has 3 special values: inf, -inf, NaN.
    
    For the key part of pg_ndistinct, see example.
    select '{"1, 16\t":"1"}'::pg_ndistinct;
    here \t is not tab character, ascii 9. it's two characters: backslash
    and character "t".
    so here it should error out?
    (apply this to \n, \r, \b)
    
    
    pg_ndistinct_in(PG_FUNCTION_ARGS)
    ending part should be:
    
        freeJsonLexContext(lex);
        if (result == JSON_SUCCESS)
        {
            ......
        }
        else
        {
           ereturn(parse_state.escontext, (Datum) 0,
                        errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
                        errmsg("malformed pg_ndistinct: \"%s\"", str),
                        errdetail("Must be valid JSON."));
           PG_RETURN_NULL();
        }
    result should be either JSON_SUCCESS or anything else.
    
    
    
    all these functions:
    ndistinct_object_start, ndistinct_array_start,
    ndistinct_object_field_start, ndistinct_array_element_start
    have
    ndistinctParseState *parse = state;
    
    do we need to change it to
    ndistinctParseState *parse = (ndistinctParseState *)state;
    ?
    
    
    
    ndistinctParseState need to add to src/tools/pgindent/typedefs.list
    probably change it to "typedef struct ndistinctParseState".
    also struct ndistinctParseState need placed in the top of
    src/backend/statistics/mvdistinct.c
    not in line 340?
    
    
    
    
  11. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-01-29T19:56:33Z

    On Tue, Jan 28, 2025 at 11:25 AM jian he <jian.universality@gmail.com>
    wrote:
    
    > hi.
    > I reviewed 0001 only.
    >
    > in src/backend/statistics/mvdistinct.c
    >
    > no need #include "nodes/pg_list.h" since
    > src/include/statistics/statistics.h sub level include "nodes/pg_list.h"
    >
    > no need #include "utils/palloc.h"
    > sicne #include "postgres.h"
    > already included it.
    >
    
    
    Noted.
    
    
    >  select '[{"6, -32768,,": -11}]'::pg_ndistinct;
    > ERROR:  malformed pg_ndistinct: "[{"6, -32768,,": -11}]"
    > LINE 1: select '[{"6, -32768,,": -11}]'::pg_ndistinct;
    >                ^
    > DETAIL:  All ndistinct count values are scalar doubles.
    > imho, this errdetail message is not good.
    >
    
    What error message do you think is appropriate in that situation?
    
    
    > select '{}'::pg_ndistinct ;
    > segfault
    >
    
    Mmm, gotta look into that!
    
    
    
    >
    >
    > select '{"1,":"1"}'::pg_ndistinct ;
    > ERROR:  malformed pg_ndistinct: "{"1,":"1"}"
    > LINE 1: select '{"1,":"1"}'::pg_ndistinct ;
    >                ^
    > DETAIL:  All ndistinct attnum lists must be a comma separated list of
    > attnums.
    >
    > imho, this errdetail message is not good. would be better saying that
    > "length of list of attnums must be larger than 1".
    >
    
    That sounds better.
    
    
    
    > typcategory (Z) marked as Internal-use types. and there is no
    > pg_ndistinct array type,
    > not sure this is fine.
    >
    
    I think it's probably ok for now. The datatype currently has no utility
    other than extended statistics, and I'm doubtful that it ever will.
    
  12. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-01-29T20:04:40Z

    On Wed, Jan 29, 2025 at 2:50 AM jian he <jian.universality@gmail.com> wrote:
    
    > hi.
    >
    > select '{"1, 0B100101":"NaN"}'::pg_ndistinct;
    >       pg_ndistinct
    > ------------------------
    >  {"1, 37": -2147483648}
    > (1 row)
    >
    
    I think my initial reaction is to just refuse those special values, but
    I'll look into the parsing code to see what can be done.
    
    
    
    this is not what we expected?
    > For the VALUE part of pg_ndistinct, float8 has 3 special values: inf,
    > -inf, NaN.
    >
    > For the key part of pg_ndistinct, see example.
    > select '{"1, 16\t":"1"}'::pg_ndistinct;
    > here \t is not tab character, ascii 9. it's two characters: backslash
    > and character "t".
    > so here it should error out?
    > (apply this to \n, \r, \b)
    >
    
    I don't have a good answer as to what should happen here. Special cases
    like this make Tomas' suggestion to change the in/out format more
    attractive.
    
    
    
    
    >
    >
    > pg_ndistinct_in(PG_FUNCTION_ARGS)
    > ending part should be:
    >
    >     freeJsonLexContext(lex);
    >     if (result == JSON_SUCCESS)
    >     {
    >         ......
    >     }
    >     else
    >     {
    >        ereturn(parse_state.escontext, (Datum) 0,
    >                     errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
    >                     errmsg("malformed pg_ndistinct: \"%s\"", str),
    >                     errdetail("Must be valid JSON."));
    >        PG_RETURN_NULL();
    >     }
    > result should be either JSON_SUCCESS or anything else.
    >
    >
    >
    > all these functions:
    > ndistinct_object_start, ndistinct_array_start,
    > ndistinct_object_field_start, ndistinct_array_element_start
    > have
    > ndistinctParseState *parse = state;
    >
    > do we need to change it to
    > ndistinctParseState *parse = (ndistinctParseState *)state;
    > ?
    >
    
    The compiler isn't complaining so far, but I see no harm in it.
    
  13. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-02-20T04:43:16Z

    >
    > select '{"1, 0B100101":"NaN"}'::pg_ndistinct;
    >>       pg_ndistinct
    >> ------------------------
    >>  {"1, 37": -2147483648}
    >> (1 row)
    >>
    >
    > I think my initial reaction is to just refuse those special values, but
    > I'll look into the parsing code to see what can be done.
    >
    
    I noticed that the output function for pg_ndistinct casts that value to an
    integer before formatting it %d, so it's being treated as an integer even
    if it is not stored as one. After some consultation with Tomas, it made the
    most sense to just replicate this on the input side as well, and that is
    addressed in the patches below.
    
    I've updated and rebased the patches.
    
    The existing pg_ndistinct and pg_dependences formats were kept as-is. The
    formats are clumsy, more processing-friendly formats would be easier, but
    the need for such processing is minimal bordering on theoretical, so there
    is little impact in keeping the historical format.
    
    There are now checks to ensure that the pg_ndistinct or pg_dependencies
    value assigned to an extended statistics object actually makes sense for
    that object. What this amounts to is checking that for every attnum cited,
    the positive attnums are also ones found the in the stxkeys of the
    pg_statistic_ext tuple, and the negative attnums correspond do not exceed
    the number of expressions in the attnum. In other words, if the stats
    object has no expressions in it, then no negative numbers will be accepted,
    if it has 2 expressions than any value -3 or lower will be rejected, etc.
    
    All patches rebased to 71f17823ba010296da9946bd906bb8bcad6325bc.
    
  14. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-03-04T20:30:11Z

    >
    >
    >> I think my initial reaction is to just refuse those special values, but
    >> I'll look into the parsing code to see what can be done.
    >>
    >
    > I noticed that the output function for pg_ndistinct casts that value to an
    > integer before formatting it %d, so it's being treated as an integer even
    > if it is not stored as one. After some consultation with Tomas, it made the
    > most sense to just replicate this on the input side as well, and that is
    > addressed in the patches below.
    >
    > I've updated and rebased the patches.
    >
    > The existing pg_ndistinct and pg_dependences formats were kept as-is. The
    > formats are clumsy, more processing-friendly formats would be easier, but
    > the need for such processing is minimal bordering on theoretical, so there
    > is little impact in keeping the historical format.
    >
    > There are now checks to ensure that the pg_ndistinct or pg_dependencies
    > value assigned to an extended statistics object actually makes sense for
    > that object. What this amounts to is checking that for every attnum cited,
    > the positive attnums are also ones found the in the stxkeys of the
    > pg_statistic_ext tuple, and the negative attnums correspond do not exceed
    > the number of expressions in the attnum. In other words, if the stats
    > object has no expressions in it, then no negative numbers will be accepted,
    > if it has 2 expressions than any value -3 or lower will be rejected, etc.
    >
    > All patches rebased to 71f17823ba010296da9946bd906bb8bcad6325bc.
    >
    
    A rebasing, and a few changes
    * regnamespace and name parameters changed to statistics_schemaname as text
    and statistics_name as text, so that there's one less thing that can
    potentially fail in an upgrade
    * schema lookup and stat name lookup failures now issue a warning and
    return false, rather than ERROR
    * elevel replaced with hardcoded WARNING most everywhere, as has been done
    with relation/attribute stats
    
  15. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-03-31T05:10:23Z

    Just rebasing.
    
  16. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-05-29T22:32:17Z

    On Mon, Mar 31, 2025 at 1:10 AM Corey Huinker <corey.huinker@gmail.com>
    wrote:
    
    > Just rebasing.
    >
    
    At pgconf.dev this year, the subject of changing the formats of
    pg_ndistinct and pg_depdentencies came up again.
    
    To recap: presently these datatypes have no working input function, but
    would need one for statistics import to work on extended statistics. The
    existing input formats are technically JSON, but the keys themselves are a
    comma-separated list of attnums, so they require additional parsing. That
    parsing is already done in the patches in this thread, but overall the
    format is terrible for any sort of manipulation, like the manipulation that
    people might want to do to translate the values to a table with a different
    column order (say, after a restore of a table that had dropped columns), or
    to do query planner experiments.
    
    Because the old formats don't have a corresponding input function, there is
    no risk of the ouptut not matching required inputs, but there will be once
    we add new input functions, so this is our last chance to change the format
    to something we like better.
    
    The old format can be trivially translated via functions posted earlier in
    this thread back in January (pg_xlat_ndistinct_to_attnames,
    pg_xlat_dependencies_to_attnames) as well as the reverse (s/_to_/_from_/),
    so dumping values from older versions will not be difficult.
    
    I believe that we should take this opportunity to make the change. While we
    don't have a pressing need to manipulate these structures now, we might in
    the future and failing to do so now makes a later change much harder.
    
    With that in mind, I'd like people to have a look at the proposed format
    change if pg_ndistinct (the changes to pg_dependencies are similar), to see
    if they want to make any improvements or comments. As you can see, the new
    format is much less compact (about 3x as large), which could get bad if the
    number of elements grew by a lot, but the number of elements is tied to the
    number of factors in the extended support (N choose N, then N choose N-1,
    etc, excluding choose 1), so this can't get too out of hand.
    
    Existing format (newlines/formatting added by me to make head-to-head
    comparison easier):
    
    '{"2, 3": 4,
      "2, -1": 4,
      "2, -2": 4,
      "3, -1": 4,
      "3, -2": 4,
      "-1, -2": 3,
      "2, 3, -1": 4,
      "2, 3, -2": 4,
      "2, -1, -2": 4,
      "3, -1, -2": 4}'::pg_ndistinct
    
    Proposed new format (again, all formatting here is just for ease of humans
    reading):
    
    ' [ {"attributes" : [2,3], "ndistinct" : 4},
        {"attributes" : [2,-1], "ndistinct" : 4},
        {"attributes" : [2,-2], "ndistinct" : 4},
        {"attributes" : [3,-1], "ndistinct" : 4},
        {"attributes" : [3,-2], "ndistinct" : 4},
        {"attributes" : [-1,-2], "ndistinct" : 3},
        {"attributes" : [2,3,-1], "ndistinct" : 4},
        {"attributes" : [2,3,-2], "ndistinct" : 4},
        {"attributes" : [2,-1,-2], "ndistinct" : 4},
        {"attributes" : [3,-1,-2], "ndistinct" : 4}]'::pg_ndistinct
    
    
    The pg_dependencies structure is only slightly more complex:
    
    An abbreviated example:
    
    {"2 => 1": 1.000000, "2 => -1": 1.000000, ..., "2, -2 => -1": 1.000000, "3,
    -1 => 2": 1.000000},
    
    Becomes:
    
    [ {"attributes": [2], "dependency": 1, "degree": 1.000000},
      {"attributes": [2], "dependency": -1, "degree": 1.000000},
      {"attributes": [2, -2], "dependency":  -1, "degree": 1.000000},
       ...,
       {"attributes": [2, -2], "dependency": -1, "degree": 1.000000},
       {"attributes": [3, -1], "dependency": 2, "degree": 1.000000}]
    
    Any thoughts on using/improving these structures?
    
  17. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-06-22T00:12:28Z

    >
    >
    > Any thoughts on using/improving these structures?
    >
    
    Hearing no objections, here is the latest patchset.
    
    0001 - Changes input/output functions of pg_ndistinct to the format
    described earlier.
    0002 - Changes input/output functions of pg_dependencies to the format
    described earlier.
    0003 - Makes some previously internal/static attribute stats functions
    visible to extended_stats.c, because the exprs attribute is basically an
    array of partially filled-out pg_statistic rows.
    0004 - Adds pg_restore_attribute_stats(), pg_clear_attribute_stats(), in
    the pattern of their relation/attribute brethren.
    0005 - adds the dumping and restoring of extended statistics back to v10.
    No command line flag changes needed.
    
  18. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-08-12T17:28:14Z

    On Sat, Jun 21, 2025 at 8:12 PM Corey Huinker <corey.huinker@gmail.com>
    wrote:
    
    >
    >> Any thoughts on using/improving these structures?
    >>
    >
    > Hearing no objections, here is the latest patchset.
    >
    > 0001 - Changes input/output functions of pg_ndistinct to the format
    > described earlier.
    > 0002 - Changes input/output functions of pg_dependencies to the format
    > described earlier.
    > 0003 - Makes some previously internal/static attribute stats functions
    > visible to extended_stats.c, because the exprs attribute is basically an
    > array of partially filled-out pg_statistic rows.
    > 0004 - Adds pg_restore_attribute_stats(), pg_clear_attribute_stats(), in
    > the pattern of their relation/attribute brethren.
    > 0005 - adds the dumping and restoring of extended statistics back to v10.
    > No command line flag changes needed.
    >
    >
    
    Rebased. Enjoy.
    
  19. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-09-12T04:55:56Z

    >
    > Rebased. Enjoy.
    >
    
    Rebased.
    
    
    
    >
    
  20. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-10-19T00:27:58Z

    On Fri, Sep 12, 2025 at 12:55 AM Corey Huinker <corey.huinker@gmail.com>
    wrote:
    
    > Rebased. Enjoy.
    >>
    >
    > Rebased.
    >
    
    And rebased again to conform to  688dc6299 and 4bd919129.
    
  21. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-10-21T05:48:37Z

    On Sat, Oct 18, 2025 at 08:27:58PM -0400, Corey Huinker wrote:
    > And rebased again to conform to  688dc6299 and 4bd919129.
    
    The format redesign for extended stats is pretty nice, as done in
    0001.  I think that this patch should be split in two, actually, as it
    tackles two issues:
    - One patch for the format change.
    - Second patch for the introduction of the input function, useful on
    its own to allow more regression tests paths for the format generated.
    
    Similar split comment for 0002 regarding pg_dependencies.  The format
    change is one thing.  The input function is cool to have for input
    validation, still not absolutely mandatory for the core part of the
    format change.
    
    The functions exposed in 0003 should be renamed to match more with the
    style of the rest, aka it is a bit hard to figure out what they do at
    first sight.  Presumably, these should be prefixed with some
    "statext_", except text_to_stavalues() which could still be named the
    same.
    
    Do you have some numbers regarding the increase in size this generates
    for the catalogs?
    
    0004 has been designed following the same model as the relation and
    attribute stats.  That sounds OK here.
    
    +enum extended_stats_argnum
    [...]
    +enum extended_stats_exprs_element
    
    It would be nice to document why such things are around.  That would
    be less guessing for somebody reading the code.
    
    Reusing this small sequence from your pg_dump patch, executed on a v14
    backend:
    create schema dump_test;
    CREATE TABLE dump_test.has_ext_stats
      AS SELECT g.g AS x, g.g / 2 AS y FROM generate_series(1,100) AS g(g);
    CREATE STATISTICS dump_test.es1 ON x, (y % 2) FROM dump_test.has_ext_stats;
    ANALYZE dump_test.has_ext_stats;
    
    Then pg_dump fails:
    pg_dump: error: query failed: ERROR:  column e.inherited does not exist
    LINE 2: ...hemaname = $1 AND e.statistics_name = $2 ORDER BY e.inherite...
    
    
    +        * TODO: Until v18 is released the master branch has a
    +        * server_version_num of 180000. We will update this to 190000
    as soon
    +        * as the master branch updates.
    
    This part has not been updated.
    
    +       Assert(item.nattributes > 0);   /* TODO: elog? */
    [...]
    +       Assert(dependency->nattributes > 1);    /* TODO: elog? */
    Yes and yes.  It seems like it should be possible to craft some input
    that triggers these..
    
    +void
    +free_pg_dependencies(MVDependencies *dependencies);
    
    Double declaration of this routine in dependencies.c.
    
    Perhaps some of the regression tests could use some jsonb_pretty() in
    the outputs generated.  Some of the results generated are very hard to
    parse, something that would become harder in the buildfarm.  This
    comment starts with 0001 for stxdndistinct.
    
    I have mixed feelings about 0005, FWIW.  I am wondering if we should
    not lift the needle a bit here and only support the dump of extended
    statistics when dealing with a backend of at least v19.  This would
    mean that we would only get the full benefit of this feature once
    people upgrade to v20 or dump from a pg_dump with --statistics from at
    least v19, but with the long-term picture in mind this would also make
    the dump/restore picture of the patch dead simple (spoiler: I like
    simple). 
    
    Tomas, what is your take about the format changes and my argument
    about the backward requirements of pg_dump (about not dumping these
    stats if connecting to a server older than v18, included)?
    --
    Michael
    
  22. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-10-21T05:57:00Z

    On Tue, Oct 21, 2025 at 02:48:37PM +0900, Michael Paquier wrote:
    > Tomas, what is your take about the format changes and my argument
    > about the backward requirements of pg_dump (about not dumping these
    > stats if connecting to a server older than v18, included)?
    
    By the way, the patch is still needed in the previous CF of September:
    https://commitfest.postgresql.org/patch/5517/
    
    You may want to move it.
    --
    Michael
    
  23. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-10-22T11:55:31Z

    >
    > The functions exposed in 0003 should be renamed to match more with the
    > style of the rest, aka it is a bit hard to figure out what they do at
    > first sight.  Presumably, these should be prefixed with some
    > "statext_", except text_to_stavalues() which could still be named the
    > same.
    >
    
    That prefix would probably be statatt_ or statattr_.
    
    
    >
    > Do you have some numbers regarding the increase in size this generates
    > for the catalogs?
    >
    
    Sorry, I don't understand. There shouldn't be any increase inside the
    catalogs as the internal storage of the datatypes hasn't changed, so I can
    only conclude that you're referring to something else.
    
    
    >
    > 0004 has been designed following the same model as the relation and
    > attribute stats.  That sounds OK here.
    >
    > +enum extended_stats_argnum
    > [...]
    > +enum extended_stats_exprs_element
    >
    > It would be nice to document why such things are around.  That would
    > be less guessing for somebody reading the code.
    >
    
    The equivalent structures in attribute_stats.c will need documenting too.
    
    
    
    
    >
    > Reusing this small sequence from your pg_dump patch, executed on a v14
    > backend:
    > create schema dump_test;
    > CREATE TABLE dump_test.has_ext_stats
    >   AS SELECT g.g AS x, g.g / 2 AS y FROM generate_series(1,100) AS g(g);
    > CREATE STATISTICS dump_test.es1 ON x, (y % 2) FROM dump_test.has_ext_stats;
    > ANALYZE dump_test.has_ext_stats;
    >
    > Then pg_dump fails:
    > pg_dump: error: query failed: ERROR:  column e.inherited does not exist
    > LINE 2: ...hemaname = $1 AND e.statistics_name = $2 ORDER BY e.inherite...
    >
    
    Noted.
    
    
    >
    >
    > +        * TODO: Until v18 is released the master branch has a
    > +        * server_version_num of 180000. We will update this to 190000
    > as soon
    > +        * as the master branch updates.
    >
    > This part has not been updated.
    >
    > +       Assert(item.nattributes > 0);   /* TODO: elog? */
    > [...]
    > +       Assert(dependency->nattributes > 1);    /* TODO: elog? */
    > Yes and yes.  It seems like it should be possible to craft some input
    > that triggers these..
    >
    
    +1
    
    
    >
    > +void
    > +free_pg_dependencies(MVDependencies *dependencies);
    >
    > Double declaration of this routine in dependencies.c.
    >
    > Perhaps some of the regression tests could use some jsonb_pretty() in
    > the outputs generated.  Some of the results generated are very hard to
    > parse, something that would become harder in the buildfarm.  This
    > comment starts with 0001 for stxdndistinct.
    >
    
    Can do.
    
    
    >
    > I have mixed feelings about 0005, FWIW.  I am wondering if we should
    > not lift the needle a bit here and only support the dump of extended
    > statistics when dealing with a backend of at least v19.  This would
    > mean that we would only get the full benefit of this feature once
    > people upgrade to v20 or dump from a pg_dump with --statistics from at
    > least v19, but with the long-term picture in mind this would also make
    > the dump/restore picture of the patch dead simple (spoiler: I like
    > simple).
    >
    
    I also like simple.
    
    Right now we have a situation where the vast majority of databases can
    carry forward all of their stats via pg_upgrade, except for those databases
    that have extended stats. The trouble is, most customers don't know if
    their database uses extended statistics or not, and those that do are in
    for some bad query plans if they haven't run vacuumdb --missing-stats-only.
    Explaining that to customers is complicated, especially when most of them
    do not know what extended stats are, let alone whether they have them. It
    would be a lot simpler to just say "all stats are carried over on upgrade",
    and vacuumdb becomes unnecessary, making upgrades one step simpler as well.
    
    Given that, I think that the admittedly ugly transformation is worth it,
    and sequestering it inside pg_dump is the smallest footprint it can have.
    Earlier in this thread I posted some functions that did the translation
    from the existing formats to the proposed new formats. We could include
    those as new system functions, and that would make the dump code very
    simple. Having said that, I don't know that there would be use for those
    functions except inside pg_dump, hence the decision to do the transforms
    right in the dump query.
    
    If the format translation is a barrier to fetching existing extended stats,
    then I'd be more inclined to keep the existing pg_ndistinct and
    pg_dependencies data formats as they are now.
    
  24. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-10-22T23:46:27Z

    On Wed, Oct 22, 2025 at 02:55:31PM +0300, Corey Huinker wrote:
    >> Do you have some numbers regarding the increase in size this generates
    >> for the catalogs?
    > 
    > Sorry, I don't understand. There shouldn't be any increase inside the
    > catalogs as the internal storage of the datatypes hasn't changed, so I can
    > only conclude that you're referring to something else.
    
    The new format meant more characters, perhaps I've just missed
    something while quickly testing the patch..  Anyway, that's OK at this
    stage.
    
    > The equivalent structures in attribute_stats.c will need documenting too.
    
    Right.  This sounds like a separate patch to me, impacting HEAD.
    
    > Right now we have a situation where the vast majority of databases can
    > carry forward all of their stats via pg_upgrade, except for those databases
    > that have extended stats. The trouble is, most customers don't know if
    > their database uses extended statistics or not, and those that do are in
    > for some bad query plans if they haven't run vacuumdb --missing-stats-only.
    > Explaining that to customers is complicated, especially when most of them
    > do not know what extended stats are, let alone whether they have them. It
    > would be a lot simpler to just say "all stats are carried over on upgrade",
    > and vacuumdb becomes unnecessary, making upgrades one step simpler as well.
    
    Okay.
    
    > Given that, I think that the admittedly ugly transformation is worth it,
    > and sequestering it inside pg_dump is the smallest footprint it can have.
    > Earlier in this thread I posted some functions that did the translation
    > from the existing formats to the proposed new formats. We could include
    > those as new system functions, and that would make the dump code very
    > simple. Having said that, I don't know that there would be use for those
    > functions except inside pg_dump, hence the decision to do the transforms
    > right in the dump query.
    
    I'd prefer the new format.  One killer pushing in favor of the new
    format that you are making upthread in favor of is that it makes much
    easier the viewing, editing and injecting of these stats.  It's the
    part of the patch where we would need Tomas' input on the matter
    before deciding anything, I guess, as primary author of the original
    facilities.  My view of the problem is just one opinion.
    
    > If the format translation is a barrier to fetching existing extended stats,
    > then I'd be more inclined to keep the existing pg_ndistinct and
    > pg_dependencies data formats as they are now.
    
    Not necessarily, it can be possible to also take that in multiple
    steps rather than a single one:
    - First do the basics in v19 with the new format.
    - Raise the bar to older versions.
    --
    Michael
    
  25. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-10-23T00:35:04Z

    On Thu, Oct 23, 2025 at 08:46:27AM +0900, Michael Paquier wrote:
    > I'd prefer the new format.  One killer pushing in favor of the new
    > format that you are making upthread in favor of is that it makes much
    > easier the viewing, editing and injecting of these stats.
    
    This is missing an "argument", as in "One killer argument pushing in
    favor.." :D
    --
    Michael
    
  26. Re: Extended Statistics set/restore/clear functions.

    Tomas Vondra <tomas@vondra.me> — 2025-10-31T20:22:55Z

    
    On 10/23/25 01:46, Michael Paquier wrote:
    > On Wed, Oct 22, 2025 at 02:55:31PM +0300, Corey Huinker wrote:
    >>> Do you have some numbers regarding the increase in size this generates
    >>> for the catalogs?
    >>
    >> Sorry, I don't understand. There shouldn't be any increase inside the
    >> catalogs as the internal storage of the datatypes hasn't changed, so I can
    >> only conclude that you're referring to something else.
    > 
    > The new format meant more characters, perhaps I've just missed
    > something while quickly testing the patch..  Anyway, that's OK at this
    > stage.
    > 
    >> The equivalent structures in attribute_stats.c will need documenting too.
    > 
    > Right.  This sounds like a separate patch to me, impacting HEAD.
    > 
    >> Right now we have a situation where the vast majority of databases can
    >> carry forward all of their stats via pg_upgrade, except for those databases
    >> that have extended stats. The trouble is, most customers don't know if
    >> their database uses extended statistics or not, and those that do are in
    >> for some bad query plans if they haven't run vacuumdb --missing-stats-only.
    >> Explaining that to customers is complicated, especially when most of them
    >> do not know what extended stats are, let alone whether they have them. It
    >> would be a lot simpler to just say "all stats are carried over on upgrade",
    >> and vacuumdb becomes unnecessary, making upgrades one step simpler as well.
    > 
    > Okay.
    > 
    >> Given that, I think that the admittedly ugly transformation is worth it,
    >> and sequestering it inside pg_dump is the smallest footprint it can have.
    >> Earlier in this thread I posted some functions that did the translation
    >> from the existing formats to the proposed new formats. We could include
    >> those as new system functions, and that would make the dump code very
    >> simple. Having said that, I don't know that there would be use for those
    >> functions except inside pg_dump, hence the decision to do the transforms
    >> right in the dump query.
    > 
    > I'd prefer the new format.  One killer pushing in favor of the new
    > format that you are making upthread in favor of is that it makes much
    > easier the viewing, editing and injecting of these stats.  It's the
    > part of the patch where we would need Tomas' input on the matter
    > before deciding anything, I guess, as primary author of the original
    > facilities.  My view of the problem is just one opinion.
    > 
    
    Sorry for not paying much attention to this thread ...
    
    My opinion is that we should both use the new format and keep the
    pg_dump code to allow upgrading from older pre-19 versions.
    
    There really is nothing special about the current format - I should have
    used JSON (or any other established format) from the beginning. But I
    only saw that as human-readable version of ephemeral data, it didn't
    occur to me we'll use this to export/import stats cross versions. So if
    we need to adjust that to make new use cases more convenient, let's bite
    the bullet now.
    
    If doing both is too complex / ugly, I think the pg_upgrade capability
    is more valuable. I'd rather keep the old, less convenient format to
    have pg_upgrade support for all versions.
    
    Otherwise users may not benefit from this pg_upgrade feature for a
    couple more years. Plenty of users delay upgrading until the EOL gets
    close, and so might be unable to dump/restore extended stats for the
    next ~5 years.
    
    regards
    
    -- 
    Tomas Vondra
    
    
    
    
    
  27. Re: Extended Statistics set/restore/clear functions.[

    Michael Paquier <michael@paquier.xyz> — 2025-11-04T07:13:55Z

    On Fri, Oct 31, 2025 at 09:22:55PM +0100, Tomas Vondra wrote:
    > Sorry for not paying much attention to this thread ...
    
    It was PGEU last week, it's not surprising knowing that you are in..
    Europe :)
    
    > My opinion is that we should both use the new format and keep the
    > pg_dump code to allow upgrading from older pre-19 versions.
    
    Okay, thanks for the input.
    
    > There really is nothing special about the current format - I should have
    > used JSON (or any other established format) from the beginning. But I
    > only saw that as human-readable version of ephemeral data, it didn't
    > occur to me we'll use this to export/import stats cross versions. So if
    > we need to adjust that to make new use cases more convenient, let's bite
    > the bullet now.
    > 
    > If doing both is too complex / ugly, I think the pg_upgrade capability
    > is more valuable. I'd rather keep the old, less convenient format to
    > have pg_upgrade support for all versions.
    
    Hmm.  I have been eyeing at the dump code once again, and it's not
    that bad on a second lookup, so I'd be OK to incorporate the whole in
    a review.
    
    > Otherwise users may not benefit from this pg_upgrade feature for a
    > couple more years. Plenty of users delay upgrading until the EOL gets
    > close, and so might be unable to dump/restore extended stats for the
    > next ~5 years.
    
    Okay.
    
    I still think that we should split things a bit more in the patch set.
    Corey has sent me a proposal in this direction, where most of the
    entries can be reviewed and potentially applied separately:
    1. pg_ndistinct output function change.
    2. pg_ndistinct input function addition.
    3. pg_dependencies output function change
    4. pg_dependencies input function
    5. Expose attribute statistics function and rename them attstat_* or
    statatt_*
    6. pg_restore_extended_stats
    7. pg_dump with no ability to fetch old-format
    pg_ndistinct/pg_dependences.
    8. pg_dump working back as far as possible
    
    I am not completely sold about the changes in the input/output
    functions, but I'd be happy to consider more options with a rebased
    patch set (with the dump issue on older clusters issue, while on it).
    --
    Michael
    
  28. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-11-05T06:38:56Z

    >
    > Otherwise users may not benefit from this pg_upgrade feature for a
    > couple more years. Plenty of users delay upgrading until the EOL gets
    > close, and so might be unable to dump/restore extended stats for the
    > next ~5 years
    
    
    Paquier's response got sidetracked because of an errant subject line
    change, so I will try to recap:
    
    * pg_dump code changes no longer seem as bad on second look
    * proceed with breakout per off-list discussion
    
    in that off-list discussion I proposed (though I was mostly echoing what I
    thought Paquier wanted):
    
    1. pg_ndistinct output function change.
    2. pg_ndistinct input function addition.
    3. pg_dependencies output function change
    4. pg_dependencies input function
    5. Expose attribute statistics function and rename them attstat_* or
    statatt_*   (edit: and fix lack of comments on the enums and arrays)
    6. pg_restore_extended_stats
    7. pg_dump with no ability to fetch old-format pg_ndistinct/pg_dependences.
    (edit: and fix inherited bug)
    8. pg_dump working back as far as possible
    
    Given that the pg_dump code no longer seems as bad, and Tomas is very much
    in support of it, I've opted not to split out steps 7/8.
    
  29. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-11-06T08:26:58Z

    On Wed, Nov 05, 2025 at 01:38:56AM -0500, Corey Huinker wrote:
    > Paquier's response got sidetracked because of an errant subject line
    > change, so I will try to recap:
    
    That was a typo that found its way into the email subject.  Sorry
    about that, that broke gmail's tracking at least.
    
    > in that off-list discussion I proposed (though I was mostly echoing what I
    > thought Paquier wanted):
    > 
    > 1. pg_ndistinct output function change.
    > 2. pg_ndistinct input function addition.
    > 3. pg_dependencies output function change
    > 4. pg_dependencies input function
    > 5. Expose attribute statistics function and rename them attstat_* or
    > statatt_*   (edit: and fix lack of comments on the enums and arrays)
    > 6. pg_restore_extended_stats
    > 7. pg_dump with no ability to fetch old-format pg_ndistinct/pg_dependences.
    > (edit: and fix inherited bug)
    > 8. pg_dump working back as far as possible
    
    Thanks.  I have begun reviewing it (more a bit later, still need to
    study more the structure of the code).  For now, I have extracted some
    of the comment changes in 0005 and applied these independently.
    
    > Given that the pg_dump code no longer seems as bad, and Tomas is very much
    > in support of it, I've opted not to split out steps 7/8.
    
    That sounds like a way forward to me, then, in terms of using a new
    format and make pg_dump intelligent enough to deal with it based on
    what's in the past versions.
    --
    Michael
    
  30. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-11-06T18:35:34Z

    >
    > That was a typo that found its way into the email subject.  Sorry
    > about that, that broke gmail's tracking at least.
    >
    
    So the mailing list archive will still pick it up? That's nice.
    
    
    > Thanks.  I have begun reviewing it (more a bit later, still need to
    > study more the structure of the code).  For now, I have extracted some
    > of the comment changes in 0005 and applied these independently.
    >
    
    Rebased to reflect that commit.
    
  31. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-11-07T07:51:15Z

    On Thu, Nov 06, 2025 at 01:35:34PM -0500, Corey Huinker wrote:
    > So the mailing list archive will still pick it up? That's nice.
    
    It did.  My email client does not care much either.
    
    > Rebased to reflect that commit.
    
    I have spent a bit more time on this set.
    
    Patch 0001 for ndistinct was missing a documentation update, we have
    one query in perform.sgml that looks at stxdndistinct.  Patch 0003 is
    looking OK here as well.
    
    For dependencies, the format switches from a single json object
    with key/vals like that:
    "3 => 4": 1.000000
    To a JSON array made of elements like that:
    {"degree": 1.000000, "attributes": [3],"dependency": 4},
    
    For ndistincts, we move from a JSON blob with key/vals like that:
    "3, 4": 11
    To a JSON array made of the following elements:
    {"ndistinct": 11, "attributes": [3,4]}
    
    Using a keyword within each element would force a stronger validation
    when these get imported back, which is a good thing.  I like that.
    
    Before going in-depth into the input functions to cross-check the
    amount of validation we should do, have folks any comments about the
    proposed format?  That's the key point this patch set depends on, and
    I'd rather not spend more time the whole thing if somebody would like
    a different format.  This is the format that Tomas has mentioned at
    the top of the thread.  Note: as noted upthread, pg_dump would be in
    charge of transferring the data of the old format to the new format at
    the end.
    
    While looking at 0002 and 0004 (which have a couple of issues
    actually), I have been wondering about moving into a new file the four
    data-type functions (in, out, send and receive) and the new input
    functions that rely on a new JSON lexer and parser logic into for both
    ndistinct and dependencies.  The new set of headers added at the top
    of mvdistinct.c and dependencies.c for the new code points that a
    separation may be better in the long-term, because the new code relies
    on parts of the backend that the existing code does not care about,
    and these files become larger than the relation and attribute stats
    files.  I would be tempted to name these new files pg_dependencies.c
    and pg_ndistinct.c, mapping with their catalog types.  With this
    separation, it looks like the "core" parts in charge of the
    calculations with ndistinct and dependencies can be kept on its own.
    What do you think?
    
    A second comment is for 0005.  The routines of attributes.c are
    applied to the new clear and restore functions.  Shouldn't these be in
    stats_utils.c at the end?  That's where the "common" functions used by
    the stats manipulation logic are.
    --
    Michael
    
  32. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-11-07T22:28:50Z

    >
    >
    > Patch 0001 for ndistinct was missing a documentation update, we have
    > one query in perform.sgml that looks at stxdndistinct.  Patch 0003 is
    > looking OK here as well.
    >
    
    Well spotted.
    
    
    > For dependencies, the format switches from a single json object
    > with key/vals like that:
    > "3 => 4": 1.000000
    > To a JSON array made of elements like that:
    > {"degree": 1.000000, "attributes": [3],"dependency": 4},
    >
    > For ndistincts, we move from a JSON blob with key/vals like that:
    > "3, 4": 11
    > To a JSON array made of the following elements:
    > {"ndistinct": 11, "attributes": [3,4]}
    >
    > Using a keyword within each element would force a stronger validation
    > when these get imported back, which is a good thing.  I like that.
    >
    > Before going in-depth into the input functions to cross-check the
    > amount of validation we should do, have folks any comments about the
    > proposed format?  That's the key point this patch set depends on, and
    > I'd rather not spend more time the whole thing if somebody would like
    > a different format.  This is the format that Tomas has mentioned at
    > the top of the thread.  Note: as noted upthread, pg_dump would be in
    > charge of transferring the data of the old format to the new format at
    > the end.
    >
    
    I'm open to other formats, but aside from renaming the json keys (maybe
    "attnums" or "keys" instead of "attributes"?), I'm not sure what really
    could be done and still be JSON. I suppose we could go with a tuple format
    like this:
    
    '{({3,4},11),...}' for pg_ndistinct and
    '{({3},4,1.00000),...}'  for pg_dependencies.
    
    Those would certainly be more compact, but makes for a hard read by humans,
    and while the JSON code is big, it's also proven in other parts of the
    codebase, hence less risky.
    
    
    
    >
    > While looking at 0002 and 0004 (which have a couple of issues
    > actually), I have been wondering about moving into a new file the four
    > data-type functions (in, out, send and receive) and the new input
    > functions that rely on a new JSON lexer and parser logic into for both
    > ndistinct and dependencies.  The new set of headers added at the top
    > of mvdistinct.c and dependencies.c for the new code points that a
    > separation may be better in the long-term, because the new code relies
    > on parts of the backend that the existing code does not care about,
    > and these files become larger than the relation and attribute stats
    > files.  I would be tempted to name these new files pg_dependencies.c
    > and pg_ndistinct.c, mapping with their catalog types.  With this
    > separation, it looks like the "core" parts in charge of the
    > calculations with ndistinct and dependencies can be kept on its own.
    > What do you think?
    >
    
    A part of me thinks that everything that remains after removing
    in/out/send/recv is just taking a table sample data structure and crunching
    numbers to come up with the deserialized data structure...that's in/out
    with a different starting/ending points.
    
    There's no denying that JSON parsing is a very different code style than
    statistical number crunching, and mixing the two is incongruous, so it's
    worth a shot, and I'll try that for v9.
    
    
    >
    > A second comment is for 0005.  The routines of attributes.c are
    > applied to the new clear and restore functions.  Shouldn't these be in
    > stats_utils.c at the end?  That's where the "common" functions used by
    > the stats manipulation logic are.
    >
    
    I assume you're referring to attribute_stats.c. I think that would cause
    stats_utils.c to have to pull in a lot of things from attribute_stats.c,
    and that would create the exact sort of include-pollution that you're
    trying to avoid in the mvdistinct.c/dependencies.c situation mentioned
    above.
    
    The one lone exception to this is text_to_stavalues(), which is a fancy
    wrapper around array_in() and could perhaps be turned to even more generic
    usage outside of stats in general.
    
    The functions in question are needed because the exprs value is itself an
    array of partly-filled-out pg_attribute tuples, so it's common to those two
    needs, but specific to stats about attributes. Maybe we need an
    attr_stats_utils.h?
    
  33. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-11-07T22:56:48Z

    On Fri, Nov 07, 2025 at 05:28:50PM -0500, Corey Huinker wrote:
    > I'm open to other formats, but aside from renaming the json keys (maybe
    > "attnums" or "keys" instead of "attributes"?), I'm not sure what really
    > could be done and still be JSON. I suppose we could go with a tuple format
    > like this:
    > 
    > '{({3,4},11),...}' for pg_ndistinct and
    > '{({3},4,1.00000),...}'  for pg_dependencies.
    > 
    > Those would certainly be more compact, but makes for a hard read by humans,
    > and while the JSON code is big, it's also proven in other parts of the
    > codebase, hence less risky.
    
    I've liked the human-readability factor of the format in the current
    patches with names in the keys, and values assigned to each property.
    
    Another thing that may be worth doing is pushing the names of the keys
    and some its the JSON meta-data shaping the object into a new header
    than can be loaded by both the backend and the frontend.  It would be
    nice to not hardcode this knowledge in a bunch of places if we finish
    by renaming these attributes.
    
    > A part of me thinks that everything that remains after removing
    > in/out/send/recv is just taking a table sample data structure and crunching
    > numbers to come up with the deserialized data structure...that's in/out
    > with a different starting/ending points.
    > 
    > There's no denying that JSON parsing is a very different code style than
    > statistical number crunching, and mixing the two is incongruous, so it's
    > worth a shot, and I'll try that for v9.
    
    Yeah, right.  Thanks.  The parsing pieces seem like pieces worth their
    own file.
    
    > The functions in question are needed because the exprs value is itself an
    > array of partly-filled-out pg_attribute tuples, so it's common to those two
    > needs, but specific to stats about attributes. Maybe we need an
    > attr_stats_utils.h?
    
    Hmm, maybe.  I'd be OK to revisit these structures once we're happy
    with the in/out structures.  That would be a good start point before
    working on the SQL functions and the dump/restore bits in more
    details.
    --
    Michael
    
  34. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-11-10T05:33:40Z

    >
    > Another thing that may be worth doing is pushing the names of the keys
    > and some its the JSON meta-data shaping the object into a new header
    > than can be loaded by both the backend and the frontend.  It would be
    > nice to not hardcode this knowledge in a bunch of places if we finish
    > by renaming these attributes.
    >
    
    It may not be quite what you wanted, but the attribute names are now static
    constants in the new adt c files. It's possible/probable that you wanted
    them in some header file, but so far I haven't had to create any new header
    files, but that can be done if desired.
    
    Yeah, right.  Thanks.  The parsing pieces seem like pieces worth their
    > own file.
    >
    
    That's done in the 0008-0009 patches. If I was starting from scratch, I
    would have moved the pre-existing in/out/send/recv functions to their own
    files in their own patches before changing the output format, but tacked on
    at the end like they are it's easier to see what the changes were, and the
    patches will probably get squashed together anyway.
    
    
    > > The functions in question are needed because the exprs value is itself an
    > > array of partly-filled-out pg_attribute tuples, so it's common to those
    > two
    > > needs, but specific to stats about attributes. Maybe we need an
    > > attr_stats_utils.h?
    >
    > Hmm, maybe.  I'd be OK to revisit these structures once we're happy
    > with the in/out structures.  That would be a good start point before
    > working on the SQL functions and the dump/restore bits in more
    > details.
    >
    
    In addition to the changes detailed above, I fixed a few typos and
    incorporated the v8 change.
    
  35. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-11-11T08:06:49Z

    On Mon, Nov 10, 2025 at 12:33:40AM -0500, Corey Huinker wrote:
    > It may not be quite what you wanted, but the attribute names are now static
    > constants in the new adt c files. It's possible/probable that you wanted
    > them in some header file, but so far I haven't had to create any new header
    > files, but that can be done if desired.
    
    No, that's not the best thing we can do with the dump/restore pieces
    in mind.  Let's put that in a separate header.
    
    > That's done in the 0008-0009 patches. If I was starting from scratch, I
    > would have moved the pre-existing in/out/send/recv functions to their own
    > files in their own patches before changing the output format, but tacked on
    > at the end like they are it's easier to see what the changes were, and the
    > patches will probably get squashed together anyway.
    
    Thanks for the new patch.  And FWIW I disagree with this approach:
    cleanup and refactoring pieces make more sense if done first, as these
    lead to less code churn in the final result.  So...  I've begun to put
    my hands on the patch set.  The whole has been restructured a bit, as
    per the attached.  Patch 0001 to 0004 feel OK here, these include two
    code moves and the two output functions:
    - Two new files for adt/, that I'm planning to apply soon as a
    separate cleanup.
    - New output functions, with keys added to a new header named
    statistics_format.h, for frontend and backend consumption.
    
    Next comes the input functions.  First, I am unhappy with the amount
    of testing that has been put into ndistinct, first and only input
    facility I've looked at in details for the moment.  I have quickly
    spotted a couple a few issues while testing buggy input, like this one
    that crashes on pointer dereference, not good obviously:
    SELECT '[]'::pg_ndistinct;
    
    There was a second one with the error message generated when using an
    incorrect key value.
    
    Second, the inputs are too permissive and could be more strictly
    checked IMHO.  For example, patterns like that are incorrect, still
    authorized with only the patches up to 0005 in:
    - Duplicated list of attributes:
    SELECT '[{"attributes" : [2,3], "ndistinct" : 4},
             {"attributes" : [2,3], "ndistinct" : 4}]'::pg_ndistinct;
    - Partial (K,N) sets, for example say we take stats on attrs (1,2,3),
    a partial input like this one is basically OK:
    SELECT '[{"attributes" : [1,3], "ndistinct" : 4},
             {"attributes" : [1,2,3], "ndistinct" : 4}]'::pg_ndistinct;
    
    These are checked in the patches that introduce the functions like
    with pg_ndistinct_validate_items(), based on the list of stxkeys we
    have.  However, I think that this is not enough by itself.  Shouldn't
    we check that the list of items in the array is what we expect based
    on the longest "attributes" array at least, even after a JSON that was
    parsed?  That would be cheap to check in the output function itself,
    at least as a first layer of checks before trying something with the
    import function and cross-checking the list of attributes for the
    extended statistics object.  This means checking that for N attributes
    we have all the elements we'd expect in each element of the array,
    without gaps or duplications, with an extra step done once the JSON
    parsing is finished.  Except for this sanity issue this part of the
    patch set should be mostly OK, plus more cleanup and more typo/grammar
    fixes. 
    
    I suspect a similar family of issues with pg_dependencies, and it
    would be nice to move the tests with the input function into a new
    regression file, like the other one.
    
    I've rebased the full set using the new structure.  0001~0004 are
    clean.  0005~ need more work and analysis, but that's a start.
    --
    Michael
    
  36. Re: Extended Statistics set/restore/clear functions.

    jian he <jian.universality@gmail.com> — 2025-11-11T15:14:34Z

    On Tue, Nov 11, 2025 at 4:07 PM Michael Paquier <michael@paquier.xyz> wrote:
    > I've rebased the full set using the new structure.  0001~0004 are
    > clean.  0005~ need more work and analysis, but that's a start.
    hi.
    
    diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
    index 106583fb2965..e0e9f0468cb8 100644
    --- a/doc/src/sgml/perform.sgml
    +++ b/doc/src/sgml/perform.sgml
    @@ -1579,9 +1579,39 @@ ANALYZE zipcodes;
     SELECT stxkeys AS k, stxdndistinct AS nd
       FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
       WHERE stxname = 'stts2';
    --[ RECORD 1 ]------------------------------------------------------&zwsp;--
    +-[ RECORD 1 ]-------------------
     k  | 1 2 5
    -nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178}
    +nd | [                          +
    +   |     {                      +
    +   |         "ndistinct": 33178,+
    +   |         "attributes": [    +
    +   |             1,             +
    +   |             2              +
    +   |         ]                  +
    +   |     },                     +
    
    If you not specify as
     SELECT stxkeys AS k, stxdndistinct AS nd
       FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
       WHERE stxname = 'stts2' \gx
    then it won't look like ``-[ RECORD 1 ]``.
    
    I tried the dummy data, the above query, the nd column data look like
    
    [{"attributes": [1, 2], "ndistinct": 2} .....
    which does not look like the above.
    
    So I guess the query in doc/src/sgml/perform.sgml would be:
    
    SELECT stxkeys AS k, jsonb_pretty(d.stxdndistinct::text::jsonb) AS nd
    FROM pg_statistic_ext join pg_statistic_ext_data d on (oid = d.stxoid)
    WHERE stxname = 'stts2' \gx
    
    per related thread, the word "join" needs uppercase?
    
    
    --
    jian
    https://www.enterprisedb.com/
    
    
    
    
  37. Re: Extended Statistics set/restore/clear functions.

    jian he <jian.universality@gmail.com> — 2025-11-12T02:51:51Z

    On Tue, Nov 11, 2025 at 11:14 PM jian he <jian.universality@gmail.com> wrote:
    >
    > On Tue, Nov 11, 2025 at 4:07 PM Michael Paquier <michael@paquier.xyz> wrote:
    > > I've rebased the full set using the new structure.  0001~0004 are
    > > clean.  0005~ need more work and analysis, but that's a start.
    
    hi.
    
    +Datum
    +pg_ndistinct_out(PG_FUNCTION_ARGS)
    +
    +
    + appendStringInfo(&str, "], \"" PG_NDISTINCT_KEY_NDISTINCT "\": %d}",
    + (int) item.ndistinct);
    
    I’m a bit confused about the part above,
    item.ndistinct is double type, we just cast it to int type?
    
    
    after apply 0004, the below in doc/src/sgml/perform.sgml also need to change?
    
    <programlisting>
    CREATE STATISTICS stts (dependencies) ON city, zip FROM zipcodes;
    
    ANALYZE zipcodes;
    
    SELECT stxname, stxkeys, stxddependencies
      FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
      WHERE stxname = 'stts';
     stxname | stxkeys |             stxddependencies
    ---------+---------+------------------------------------------
     stts    | 1 5     | {"1 => 5": 1.000000, "5 => 1": 0.423130}
    (1 row)
    </programlisting>
    
    
    Do you think it's worth the trouble to have two separate
    appendStringInfoChar for ``{}``?
    
    for example in loop ``for (i = 0; i < ndist->nitems; i++)``. we can change to:
            appendStringInfoChar(&str, '{');
            appendStringInfo(&str, "\"" PG_NDISTINCT_KEY_ATTRIBUTES "\": [%d",
                             item.attributes[0]);
    
            for (int j = 1; j < item.nattributes; j++)
                appendStringInfo(&str, ", %d", item.attributes[j]);
    
            appendStringInfo(&str, "], \"" PG_NDISTINCT_KEY_NDISTINCT "\": %d",
                             (int) item.ndistinct);
            appendStringInfoChar(&str, '}');
    
    
    --
    jian
    https://www.enterprisedb.com/
    
    
    
    
  38. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-11-12T06:47:33Z

    >
    > Thanks for the new patch.  And FWIW I disagree with this approach:
    > cleanup and refactoring pieces make more sense if done first, as these
    > lead to less code churn in the final result.  So...  I've begun to put
    > my hands on the patch set.  The whole has been restructured a bit, as
    > per the attached.  Patch 0001 to 0004 feel OK here, these include two
    > code moves and the two output functions:
    > - Two new files for adt/, that I'm planning to apply soon as a
    > separate cleanup.
    > - New output functions, with keys added to a new header named
    > statistics_format.h, for frontend and backend consumption.
    >
    
    Agreed, 0001-0004 all look good.
    
    
    > Next comes the input functions.  First, I am unhappy with the amount
    > of testing that has been put into ndistinct, first and only input
    > facility I've looked at in details for the moment.  I have quickly
    > spotted a couple a few issues while testing buggy input, like this one
    > that crashes on pointer dereference, not good obviously:
    > SELECT '[]'::pg_ndistinct;
    >
    
    - I put some work into more specific error messages for invalid values for
    both pg_ndistinct and pg_dependencies.
    - The check for empty attribute lists and item lists now occur in the
    array-end event handler.
    - Also tried to standardize conventions between the two data types (switch
    statements, similar utility functions, etc).
    
    
    >
    > These are checked in the patches that introduce the functions like
    > with pg_ndistinct_validate_items(), based on the list of stxkeys we
    > have.  However, I think that this is not enough by itself.  Shouldn't
    > we check that the list of items in the array is what we expect based
    > on the longest "attributes" array at least, even after a JSON that was
    > parsed?  That would be cheap to check in the output function itself,
    > at least as a first layer of checks before trying something with the
    > import function and cross-checking the list of attributes for the
    > extended statistics object.
    
    
    I added tests for both duplicate attribute sequences as well as making the
    first-longest attribute sequence the template by which all later and
    shorter sequences are checked. I had been reluctant to add checks like
    this, because so many similar validations were removed from the earlier
    statistics code like histograms and the like.
    
    
    >
    > I suspect a similar family of issues with pg_dependencies, and it
    > would be nice to move the tests with the input function into a new
    > regression file, like the other one.
    >
    
    Did so.
    
    0001-0004,0007,0009 unchanged.
    
    Significant modification of the stats_import.sql regression tests in 0008
    to conform to stricter datatype rules enacted in 0005, 0006.
    
  39. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-11-12T07:53:50Z

    On Wed, Nov 12, 2025 at 01:47:33AM -0500, Corey Huinker wrote:
    >> Thanks for the new patch.  And FWIW I disagree with this approach:
    >> cleanup and refactoring pieces make more sense if done first, as these
    >> lead to less code churn in the final result.  So...  I've begun to put
    >> my hands on the patch set.  The whole has been restructured a bit, as
    >> per the attached.  Patch 0001 to 0004 feel OK here, these include two
    >> code moves and the two output functions:
    >> - Two new files for adt/, that I'm planning to apply soon as a
    >> separate cleanup.
    > 
    > Agreed, 0001-0004 all look good.
    
    Applied 0001 and 0002 for now.
    --
    Michael
    
  40. Re: Extended Statistics set/restore/clear functions.

    zengman <zengman@halodbtech.com> — 2025-11-12T08:45:16Z

    Hey, I saw that some comments in the recent commit have inconsistent styles—maybe we can tweak them to make them uniform!
    
    ```c
    // pg_dependencies    - output routine for type pg_dependencies.
    pg_dependencies_out   - output routine for type pg_dependencies.
    
    // pg_ndistinct
    //    output routine for type pg_ndistinct
    pg_ndistinct_out
        output routine for type pg_ndistinct
    ```
    
    Oh, I didn’t operate it properly, which resulted in the creation of a duplicate theme.
    https://www.postgresql.org/message-id/tencent_6EC50DD07D5D567C15E65C46%40qq.com
    --
    Man Zeng
  41. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-11-12T11:25:43Z

    On Wed, Nov 12, 2025 at 08:45:16AM +0000, Man Zeng wrote:
    > pg_dependencies_out   - output routine for type pg_dependencies.
    > pg_ndistinct_out
    > 
    > Oh, I didn’t operate it properly, which resulted in the creation of a duplicate theme.
    > https://www.postgresql.org/message-id/tencent_6EC50DD07D5D567C15E65C46%40qq.com
    
    No problem.  I've just fixed both comments, thanks.  These were older
    than the recent commit, but let's fix things when these are in sight.
    --
    Michael
    
  42. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-11-12T22:21:36Z

    >
    > +
    > + appendStringInfo(&str, "], \"" PG_NDISTINCT_KEY_NDISTINCT "\": %d}",
    > + (int) item.ndistinct);
    >
    > I’m a bit confused about the part above,
    > item.ndistinct is double type, we just cast it to int type?
    >
    
    It's a historical quirk. That's what the original output function did in
    mvdistinct.c, so we maintain compatibility with that. Altering the internal
    storage type would affect the bytea serialization, which would break binary
    compatibility.
    
    
    > after apply 0004, the below in doc/src/sgml/perform.sgml also need to
    > change?
    >
    
    Yes it does, good catch.
    
    
    > Do you think it's worth the trouble to have two separate
    > appendStringInfoChar for ``{}``?
    >
    > for example in loop ``for (i = 0; i < ndist->nitems; i++)``. we can change
    > to:
    >
    
    I agree that that feels more symmetrical. However, it seems the prevailing
    wisdom is that we're already paying for a string interpolation in the very
    next appendStringInfo(), we might as well save ourselves a function call.
    Hence, I left that one as-is.
    
    The sgml change has been worked into a rebased and reduced patch set
    (thanks for the commits Michael!)
    
  43. Re: Extended Statistics set/restore/clear functions.

    jian he <jian.universality@gmail.com> — 2025-11-13T07:01:42Z

    hi.
    
    v12-0001 and v12-0002 overall look good to me.
    
            if (dependency->nattributes <= 1)
                elog(ERROR, "invalid zero-length nattributes array in
    MVDependencies");
    This is an unlikely-to-happen error message,  but still, “nattributes”
    seems confusing?
    
    in  doc/src/sgml/func/func-json.sgml, we have \gset
    <programlisting>
    SELECT '{
    .......
      }
    }' AS json \gset
    </programlisting>
    
    similarly, in doc/src/sgml/perform.sgml, I think the query should be:
    
    SELECT stxkeys AS k, jsonb_pretty(stxdndistinct::text::jsonb) AS nd
      FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
      WHERE stxname = 'stts2'    \gx
    
    
    
    --
    jian
    https://www.enterprisedb.com/
    
    
    
    
  44. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-11-13T07:32:17Z

    On Thu, Nov 13, 2025 at 2:02 AM jian he <jian.universality@gmail.com> wrote:
    
    > hi.
    >
    > v12-0001 and v12-0002 overall look good to me.
    >
    >         if (dependency->nattributes <= 1)
    >             elog(ERROR, "invalid zero-length nattributes array in
    > MVDependencies");
    > This is an unlikely-to-happen error message,  but still, “nattributes”
    > seems confusing?
    >
    
    Agreed the error message should be changed if it's kept at all. That error
    may never occur now that we test for empty arrays in the array close event
    handler. So maybe this becomes a plain assert().
    
    
    > similarly, in doc/src/sgml/perform.sgml, I think the query should be:
    >
    > SELECT stxkeys AS k, jsonb_pretty(stxdndistinct::text::jsonb) AS nd
    >   FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
    >   WHERE stxname = 'stts2'    \gx
    >
    
    The example almost certainly predates \gx, so that's a good suggestion.
    
  45. Re: Extended Statistics set/restore/clear functions.

    jian he <jian.universality@gmail.com> — 2025-11-13T11:51:00Z

    hi.
    now looking at v12-0003-Add-working-input-function-for-pg_ndistinct.patch
    again.
    
    + * example input:
    + *     [{"attributes": [6, -1], "ndistinct": 14},
    + *      {"attributes": [6, -2], "ndistinct": 9143},
    + *      {"attributes": [-1,-2], "ndistinct": 13454},
    + *      {"attributes": [6, -1, -2], "ndistinct": 14549}]
      */
     Datum
     pg_ndistinct_in(PG_FUNCTION_ARGS)
    
    extenssted statistics surely won't work on system columns,
    how should we deal with case like:
    ```
    {"attributes": [6, -1], "ndistinct": 14}
    {"attributes": [6, -7], "ndistinct": 14},
    ```
    issue a warning or error out saying that your attribute number is invalid?
    Should we discourage using system columns as examples in comments here?
    
    I have added more test code in src/test/regress/sql/pg_ndistinct.sql,
    to improve the code coverage.
    
    as mentioned before in
    https://postgr.es/m/CACJufxEZYqocFdgn-x-bJMRBSk_zkS=ziGGkaSumteiPDksnsg@mail.gmail.com
    I think it's a good thing to change
    ``(errcode....``
    to
    ``errcode``.
    So I did the change.
    
    
    +static JsonParseErrorType
    +ndistinct_array_element_start(void *state, bool isnull)
    +{
    + NDistinctParseState *parse = state;
    +
    + switch(parse->state)
    + {
    + case NDIST_EXPECT_ATTNUM:
    + if (!isnull)
    + return JSON_SUCCESS;
    +
    + ereturn(parse->escontext, (Datum) 0,
    + (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
    + errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
    + errdetail("Attnum list elements cannot be null.")));
    
    this (and many other places) looks wrong, because
    ereturn would really return ``(Datum) 0``, and this function returns
    JsonParseErrorType.
    so we have to errsave here.
    
    +typedef struct
    +{
    + const char *str;
    + NDistinctSemanticState state;
    +
    + List   *distinct_items; /* Accumulated complete MVNDistinctItems */
    + Node   *escontext;
    +
    + bool found_attributes; /* Item has an attributes key */
    + bool found_ndistinct; /* Item has ndistinct key */
    + List   *attnum_list; /* Accumulated attributes attnums */
    + int64 ndistinct;
    +} NDistinctParseState;
    
    + case NDIST_EXPECT_NDISTINCT:
    + /*
    + * While the structure dictates that ndistinct in a double precision
    + * floating point, in practice it has always been an integer, and it
    + * is output as such. Therefore, we follow usage precendent over the
    + * actual storage structure, and read it in as an integer.
    + */
    + parse->ndistinct = pg_strtoint64_safe(token, parse->escontext);
    +
    + if (SOFT_ERROR_OCCURRED(parse->escontext))
    + return JSON_SEM_ACTION_FAILED;
    
    NDistinctParseState.ndistinct should be integer,
    otherwise pg_ndistinct_out will not be consistent with  pg_ndistinct_in?
    
    SELECT '[{"attributes" : [1, 2], "ndistinct" :
    2147483648}]'::pg_ndistinct; --error
                        pg_ndistinct
    ----------------------------------------------------
     [{"attributes": [1, 2], "ndistinct": -2147483648}]
    (1 row)
    
    The result seems not what we expected.
    
  46. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-11-14T05:49:23Z

    >
    > extenssted statistics surely won't work on system columns,
    > how should we deal with case like:
    > ```
    > {"attributes": [6, -1], "ndistinct": 14}
    > {"attributes": [6, -7], "ndistinct": 14},
    > ```
    > issue a warning or error out saying that your attribute number is invalid?
    > Should we discourage using system columns as examples in comments here?
    >
    
    Negative numbers represent the Nth expression defined in the extended
    statistics object. So if you have extended statistics on a, b, length(a),
    length(b) then you can legally have -1 and -2 in the attributes, but
    nothing lower than that.
    See functions pg_ndistinct_validate_items() and
    pg_depdendencies_validate_deps() as these check the attributes in the value
    against the definition of the extended stats object.
    
    Though this does bring up a small naming problem: Elements of a
    pg_dependencies are Dependency, abbreviated to dep, but are also called
    Items like the elements in a pg_ndistinct. We should pick a standard name
    for such things (probably item) and use it everywhere.
    
    
    >
    > I have added more test code in src/test/regress/sql/pg_ndistinct.sql,
    > to improve the code coverage.
    >
    
    I'm trying to implement those test cases, but I may have missed some.
    
    
    >
    >
    > this (and many other places) looks wrong, because
    > ereturn would really return ``(Datum) 0``, and this function returns
    > JsonParseErrorType.
    > so we have to errsave here.
    >
    
    Good point. Implemented.
    
    
    
    >
    > NDistinctParseState.ndistinct should be integer,
    > otherwise pg_ndistinct_out will not be consistent with  pg_ndistinct_in?
    >
    
    +1
    
    Implemented many, but not all of these suggestions.
    
  47. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-11-14T06:25:27Z

    On Fri, Nov 14, 2025 at 12:49:23AM -0500, Corey Huinker wrote:
    > Negative numbers represent the Nth expression defined in the extended
    > statistics object. So if you have extended statistics on a, b, length(a),
    > length(b) then you can legally have -1 and -2 in the attributes, but
    > nothing lower than that.
    >
    > See functions pg_ndistinct_validate_items() and
    > pg_depdendencies_validate_deps() as these check the attributes in the value
    > against the definition of the extended stats object.
    
    Exactly.  Extended stats on system columns don't work because they
    don't really make sense as we want to track correlations between the
    attributes defined, and these reflect internal states:
    create table poo (a int, b int);
    create statistics poos (ndistinct ) ON cmax, a from poo;
    ERROR:  0A000: statistics creation on system columns is not supported
    
    Note that the expressions are also stored in pg_stats_ext_exprs.
    
    > I'm trying to implement those test cases, but I may have missed some.
    
    I've found a lot of them with coverage-html during a previous lookup.
    I'd like to think that we should aim for something close to 100%
    coverage for the two input functions.
    
    > Implemented many, but not all of these suggestions.
    
    Thanks for the new versions, I'll also look at all these across the
    next couple of days.  Probably not at 0005~ for now.
    --
    Michael
    
  48. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-11-17T06:56:06Z

    On Fri, Nov 14, 2025 at 03:25:27PM +0900, Michael Paquier wrote:
    > Thanks for the new versions, I'll also look at all these across the
    > next couple of days.  Probably not at 0005~ for now.
    
    0001 and 0002 from series v13 have been applied to change the output
    functions.
    
    And I have looked at 0003 in details for now.  Attached is a revised
    version for it, with many adjustments.  Some notes:
    - Many portions of the coverage were missing.  I have measured the
    coverage at 91% with the updated version attached.  This includes
    coverage for some error reporting, something that we rely a lot on for
    this code.
    - The error reports are made simpler, with the token values getting
    hidden.  While testing with some fancy values, I have actually noticed
    that the error handlings for the parsing of the int16 and int32 values
    were incorrect, the error reports used what the safe functions
    generated, not the reports from the data type.
    - Passing down arbitrary bytes sequences was leading to these bytes
    reported in the error outputs because we cared about the token values.
    I have added a few tests based on that for the code paths involved.
    
    There is an extra thing that bugs me as incorrect for the pg_ndistinct
    input, something I have not tackled myself yet.  Your patch checks
    that subsets of attributes are included in the longest set found, but
    it does not match the guarantees we have in mvndistinct.c: we have to
    check that *all* the combinations generated by generator_init() are
    satisfied based on the longest of attributes detected.  For example,
    this is thought as correct in the input function:
    SELECT '[{"attributes" : [-1,2], "ndistinct" : 1},
             {"attributes" : [-1,2,3], "ndistinct" : 3}]'::pg_ndistinct;
    
    However it is obviously not correct as we are missing an element for
    the attributes [-1, 3].  The simplest solution would be to export the
    routines that generate the groups now in mvndistinct.c.  Also we
    should make sure that the number of elements in the arrays match with
    the number of groups we expect, not only the elements.  I don't think
    that we need to care much about the values, but we ought to provide
    stronger guarantees for the attributes listed in these elements.
    
    Except for this argument, the input of pg_ndistinct feels OK in terms
    of the guarantees that we'd want to enforce on an import.  The same
    argument applies in terms of attribute number guarantees for
    pg_dependencies, based on DependencyGenerator_init() & friends in
    dependencies.c.  Could you look at that?
    
    For pg_dependencies, we also require some checks on the value for
    "dependency", of course, making sure that this matches with what's
    expected with the "largest" sets of attributes.  In this case, we need
    to track the union of "dependency" and "attributes", with "attributes"
    having at least one element.
    
    The tests of pg_dependencies need also to be extended more (begun that
    a bit, far from being complete and I'm lacking of time this week due
    to a conference).  One thing that I would add are nested JSON objects
    in the paths where we expect values, for example.  Please note that I
    have done a brush of 0004, while on it, cleaning up typos,
    inconsistencies and making the error codes consistent with the
    ndistinct case where possible.  This is not ready, but that's at least
    it's a start to rely on.
    
    In terms of committable bits, it would be better to apply the input
    functions once both parts are ready to go.  For now I am attached a
    v14 with the work I've put into them.  0005~ are not reviewed yet, as
    mentioned previously.  The changes in pg_dependencies are actually
    straight-forward to figure out (well, mostly) once the pg_ndistinct
    changes are OK in shape.
    --
    Michael
    
  49. Re: Extended Statistics set/restore/clear functions.

    jian he <jian.universality@gmail.com> — 2025-11-17T12:53:35Z

    On Mon, Nov 17, 2025 at 2:56 PM Michael Paquier <michael@paquier.xyz> wrote:
    >
    > On Fri, Nov 14, 2025 at 03:25:27PM +0900, Michael Paquier wrote:
    > > Thanks for the new versions, I'll also look at all these across the
    > > next couple of days.  Probably not at 0005~ for now.
    >
    > 0001 and 0002 from series v13 have been applied to change the output
    > functions.
    >
    > And I have looked at 0003 in details for now.  Attached is a revised
    > version for it, with many adjustments.  Some notes:
    > - Many portions of the coverage were missing.  I have measured the
    > coverage at 91% with the updated version attached.  This includes
    > coverage for some error reporting, something that we rely a lot on for
    > this code.
    > - The error reports are made simpler, with the token values getting
    > hidden.  While testing with some fancy values, I have actually noticed
    > that the error handlings for the parsing of the int16 and int32 values
    > were incorrect, the error reports used what the safe functions
    > generated, not the reports from the data type.
    > - Passing down arbitrary bytes sequences was leading to these bytes
    > reported in the error outputs because we cared about the token values.
    > I have added a few tests based on that for the code paths involved.
    >
    > There is an extra thing that bugs me as incorrect for the pg_ndistinct
    > input, something I have not tackled myself yet.  Your patch checks
    > that subsets of attributes are included in the longest set found, but
    > it does not match the guarantees we have in mvndistinct.c: we have to
    > check that *all* the combinations generated by generator_init() are
    > satisfied based on the longest of attributes detected.  For example,
    > this is thought as correct in the input function:
    > SELECT '[{"attributes" : [-1,2], "ndistinct" : 1},
    >          {"attributes" : [-1,2,3], "ndistinct" : 3}]'::pg_ndistinct;
    >
    > However it is obviously not correct as we are missing an element for
    > the attributes [-1, 3].  The simplest solution would be to export the
    > routines that generate the groups now in mvndistinct.c.  Also we
    > should make sure that the number of elements in the arrays match with
    > the number of groups we expect, not only the elements.  I don't think
    > that we need to care much about the values, but we ought to provide
    > stronger guarantees for the attributes listed in these elements.
    >
    > Except for this argument, the input of pg_ndistinct feels OK in terms
    > of the guarantees that we'd want to enforce on an import.  The same
    > argument applies in terms of attribute number guarantees for
    > pg_dependencies, based on DependencyGenerator_init() & friends in
    > dependencies.c.  Could you look at that?
    >
    
    hi.
    
    NDistinctSemanticState, last element
    NDIST_EXPECT_COMPLETE should follow with a comma, like:
    ``NDIST_EXPECT_COMPLETE, ``
    
    
    Lots of tests use pg_input_error_info, which is good.
    since currently only pg_input_error_info, pg_input_is_valid
    NDistinctParseState.escontext is not NULL.
    
    
    + /*
    + * If escontext already set, just use that. Anything else is a generic
    + * JSON parse error.
    + */
    + if (!SOFT_ERROR_OCCURRED(parse_state.escontext))
    + errsave(parse_state.escontext,
    + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
    + errmsg("malformed pg_ndistinct: \"%s\"", str),
    + errdetail("Must be valid JSON."));
    seems no coverage for this.
    
    segfault:
    SELECT '[{"attributes" : [1,2,3,4,5,67,6,7,8], "ndistinct" : 4}]'::pg_ndistinct;
    
    because src/backend/statistics/mvdistinct.c line: 310, Assert
    Assert((item->nattributes >= 2) && (item->nattributes <= STATS_MAX_DIMENSIONS));
    
    
    +SELECT '[{"attributes" : [2], "ndistinct" : 4}]'::pg_ndistinct;
    +ERROR:  malformed pg_ndistinct: "[{"attributes" : [2], "ndistinct" : 4}]"
    +LINE 1: SELECT '[{"attributes" : [2], "ndistinct" : 4}]'::pg_ndistin...
    +               ^
    +DETAIL:  The "ndistinct" key must contain an array of at least two attributes.
    
    here, it should be
    +DETAIL:  The "attributes" key must contain an array of at least two attributes.
    ?
    
    --
    jian
    https://www.enterprisedb.com/
    
    
    
    
  50. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-11-17T17:18:55Z

    On Mon, Nov 17, 2025 at 1:56 AM Michael Paquier <michael@paquier.xyz> wrote:
    
    > On Fri, Nov 14, 2025 at 03:25:27PM +0900, Michael Paquier wrote:
    > > Thanks for the new versions, I'll also look at all these across the
    > > next couple of days.  Probably not at 0005~ for now.
    >
    > 0001 and 0002 from series v13 have been applied to change the output
    > functions.
    >
    
    Thanks!
    
    Though, I was thinking some more about the output format. Using
    jsonb_pretty() makes it readable in one way, and very clumsy in other ways.
    Instead, I'm going to try doing the following:
    
    replace (ndist_string_value, '},', E'}\n,')
    
    This will result in the output value being formatted in exactly the way
    described in the commit messages.
    
    Of course, we could make the the actual default format by changing
    appendStringInfoString(&str, ", ") instead.
    
    One might argue that the output shouldn't get too flowery, but we're
    already adding spaces between items and array elements, and we've already
    made extensive changes favoring readability over compactness.
    
    
    Review of changes to input function patches:
    
    I'm curious about the re-parameterization of error messages involving
    PG_NDISTINCT_KEY_ATTRIBUTES, PG_NDISTINCT_KEY_NDISTINCT, and similar keys
    in dependencies. I like the parameterized version better, and was confused
    as to why it was removed. Did you change your mind, or was it done for ease
    of translation?
    
    Wording changes all look good. Use of pg_input_error_info() makes sense.
    
    
    
    >
    > There is an extra thing that bugs me as incorrect for the pg_ndistinct
    > input, something I have not tackled myself yet.  Your patch checks
    > that subsets of attributes are included in the longest set found, but
    > it does not match the guarantees we have in mvndistinct.c: we have to
    > check that *all* the combinations generated by generator_init() are
    > satisfied based on the longest of attributes detected.  For example,
    > this is thought as correct in the input function:
    > SELECT '[{"attributes" : [-1,2], "ndistinct" : 1},
    >          {"attributes" : [-1,2,3], "ndistinct" : 3}]'::pg_ndistinct;
    >
    > However it is obviously not correct as we are missing an element for
    > the attributes [-1, 3].  The simplest solution would be to export the
    > routines that generate the groups now in mvndistinct.c.  Also we
    > should make sure that the number of elements in the arrays match with
    > the number of groups we expect, not only the elements.  I don't think
    > that we need to care much about the values, but we ought to provide
    > stronger guarantees for the attributes listed in these elements.
    >
    
    I had a feeling that was going to be requested. My question would be if
    that we want to stick to modeling the other combinations after the first
    longest combination, last longest, or if we want to defer those checks
    altogether until we have to validate against an actual stats object?
    
    
    >
    > Except for this argument, the input of pg_ndistinct feels OK in terms
    > of the guarantees that we'd want to enforce on an import.  The same
    > argument applies in terms of attribute number guarantees for
    > pg_dependencies, based on DependencyGenerator_init() & friends in
    > dependencies.c.  Could you look at that?
    >
    
    Yes. I had already looked at it to verify that _all_ combinations were
    always generated (they are), because I had some vague memory of the
    generator dropping combinations that were statistically insignificant. In
    retrospect, I have no idea where I got that idea.
    
    
    
    >
    > For pg_dependencies, we also require some checks on the value for
    > "dependency", of course, making sure that this matches with what's
    > expected with the "largest" sets of attributes.  In this case, we need
    > to track the union of "dependency" and "attributes", with "attributes"
    > having at least one element.
    >
    
    This is fairly simple to do. The dependency attnum is just appended to the
    list of attnums, and the combinations are generated the same as ndistinct,
    though obviously there are no single elements.
    
    There's probably some common code between the lists to be shared, differing
    only in how they report missing combinations.
    
    
    
    > The tests of pg_dependencies need also to be extended more (begun that
    > a bit, far from being complete and I'm lacking of time this week due
    > to a conference).  One thing that I would add are nested JSON objects
    > in the paths where we expect values, for example.  Please note that I
    > have done a brush of 0004, while on it, cleaning up typos,
    > inconsistencies and making the error codes consistent with the
    > ndistinct case where possible.  This is not ready, but that's at least
    > it's a start to rely on.
    >
    
    
    
    >
    > In terms of committable bits, it would be better to apply the input
    > functions once both parts are ready to go.  For now I am attached a
    > v14 with the work I've put into them.  0005~ are not reviewed yet, as
    > mentioned previously.  The changes in pg_dependencies are actually
    > straight-forward to figure out (well, mostly) once the pg_ndistinct
    > changes are OK in shape.
    > --
    > Michael
    
    
    +1
    
  51. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-11-17T23:19:58Z

    On Mon, Nov 17, 2025 at 12:18:55PM -0500, Corey Huinker wrote:
    > On Mon, Nov 17, 2025 at 1:56 AM Michael Paquier <michael@paquier.xyz> wrote:
    > Though, I was thinking some more about the output format. Using
    > jsonb_pretty() makes it readable in one way, and very clumsy in other ways.
    > Instead, I'm going to try doing the following:
    > 
    > replace (ndist_string_value, '},', E'}\n,')
    > 
    > This will result in the output value being formatted in exactly the way
    > described in the commit messages.
    > 
    > Of course, we could make the the actual default format by changing
    > appendStringInfoString(&str, ", ") instead.
    
    This feels like a different pretty still compressed output for json.
    I don't think we should change the output functions to do that, but if
    you want to add a function that filters these contents a bit in the
    tests for the input functions, sure, why not.
    
    > One might argue that the output shouldn't get too flowery, but we're
    > already adding spaces between items and array elements, and we've already
    > made extensive changes favoring readability over compactness.
    
    I'd still keep it without newlines, FWIW.  So what we have in the
    output functions is OK for me.  The key names could be updated to
    something else for this release, I'm open for suggestions and we have
    time for this release.  It would be nice to not do rename tweaks
    several times.
    
    > I'm curious about the re-parameterization of error messages involving
    > PG_NDISTINCT_KEY_ATTRIBUTES, PG_NDISTINCT_KEY_NDISTINCT, and similar keys
    > in dependencies. I like the parameterized version better, and was confused
    > as to why it was removed. Did you change your mind, or was it done for ease
    > of translation?
    
    Yes, this one is to reduce the translation work, and because the
    messages are quite the same across the board and deal with the same
    requirements:
    - Single integer expected after a key (attnum or actual value).
    - Array of attribute expected after a key.
    - For the degree key, float value.
    
    > I had a feeling that was going to be requested. My question would be if
    > that we want to stick to modeling the other combinations after the first
    > longest combination, last longest, or if we want to defer those checks
    > altogether until we have to validate against an actual stats object?
    
    I would tend to think that performing one round of validation once the
    whole set of objects has been parsed is going to be cheaper than
    periodic checks.
    
    One other thing would be to force a sort of the elements in the array
    to match with the order these are generated when creating the stats.
    We cannot do that in the input functions because we have no idea about
    the order of the attributes in the statistics object yet.  Applying a
    sort sounds also important to me to make sure that we order the stats
    based on what the group generation functions (aka
    generate_combinations(), etc.) think on the matter, which would
    enforce a stronger binary compatibility after we are sure that we have
    a full set of attributes listed in an array with the input function of
    course.  I have briefly looked at the planner code where extended
    stats are used, like selfuncs.c, and the ordering does not completely
    matter, it seems, but it's cheap enough to enforce a stricter ordering
    based on the K groups of N elements generated in the import function.
    
    >> Except for this argument, the input of pg_ndistinct feels OK in terms
    >> of the guarantees that we'd want to enforce on an import.  The same
    >> argument applies in terms of attribute number guarantees for
    >> pg_dependencies, based on DependencyGenerator_init() & friends in
    >> dependencies.c.  Could you look at that?
    > 
    > Yes. I had already looked at it to verify that _all_ combinations were
    > always generated (they are), because I had some vague memory of the
    > generator dropping combinations that were statistically insignificant. In
    > retrospect, I have no idea where I got that idea.
    
    Hmm.  I would need to double-check the code to be sure, but I don't
    think that we drop combinations, because the code prevents duplicates
    to begin with, even for expressions:
    create table aa (a int, b int);
    create statistics stats (ndistinct) ON a, a, b, b from aa;
    ERROR:  42701: duplicate column name in statistics definition
    create statistics stats (ndistinct) ON (a + a), ((a + a)) from aa;
    ERROR:  42701: duplicate expression in statistics definition
    
    These don't make sense anyway because they have a predictible and
    perfectly matching correlation relationship.
    
    > This is fairly simple to do. The dependency attnum is just appended to the
    > list of attnums, and the combinations are generated the same as ndistinct,
    > though obviously there are no single elements.
    
    Yeah.  That should be not be bad, I hope.
    
    > There's probably some common code between the lists to be shared, differing
    > only in how they report missing combinations.
    
    I would like to agree on that, but it did not look that obvious to me
    yesterday.  If you think that something could be refactored, I'd
    suggest a refactoring patch that applies on top of the rest of the
    patch set, with new generic facilities in stat_util.c, or even a
    new separate file, if that leads to a cleaner result (okay, a
    definition of "clean" is up to one's taste).
    --
    Michael
    
  52. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-11-17T23:35:00Z

    On Mon, Nov 17, 2025 at 08:53:35PM +0800, jian he wrote:
    > segfault:
    > SELECT '[{"attributes" : [1,2,3,4,5,67,6,7,8], "ndistinct" : 4}]'::pg_ndistinct;
    > 
    > because src/backend/statistics/mvdistinct.c line: 310, Assert
    > Assert((item->nattributes >= 2) && (item->nattributes <= STATS_MAX_DIMENSIONS));
    
    Ahah, good one here.  Perhaps we had better switch this assertion to
    become an elog() with the import functions in mind, as well.  It may
    be worth double-checking the other deserialization paths, as well.  We
    need to be super careful about this stuff for data imports.
    --
    Michael
    
  53. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-11-18T02:32:37Z

    >
    >
    > This feels like a different pretty still compressed output for json.
    > I don't think we should change the output functions to do that, but if
    > you want to add a function that filters these contents a bit in the
    > tests for the input functions, sure, why not.
    >
    
    +1, I'll probably just use the replaces rather than define a function and
    give someone the false impression that the function exists elsewhere.
    
    
    >
    > Yes, this one is to reduce the translation work, and because the
    > messages are quite the same across the board and deal with the same
    > requirements:
    > - Single integer expected after a key (attnum or actual value).
    > - Array of attribute expected after a key.
    > - For the degree key, float value.
    >
    > > I had a feeling that was going to be requested. My question would be if
    > > that we want to stick to modeling the other combinations after the first
    > > longest combination, last longest, or if we want to defer those checks
    > > altogether until we have to validate against an actual stats object?
    >
    > I would tend to think that performing one round of validation once the
    > whole set of objects has been parsed is going to be cheaper than
    > periodic checks.
    >
    > One other thing would be to force a sort of the elements in the array
    > to match with the order these are generated when creating the stats.
    > We cannot do that in the input functions because we have no idea about
    > the order of the attributes in the statistics object yet.  Applying a
    > sort sounds also important to me to make sure that we order the stats
    > based on what the group generation functions (aka
    > generate_combinations(), etc.) think on the matter, which would
    > enforce a stronger binary compatibility after we are sure that we have
    > a full set of attributes listed in an array with the input function of
    > course.  I have briefly looked at the planner code where extended
    > stats are used, like selfuncs.c, and the ordering does not completely
    > matter, it seems, but it's cheap enough to enforce a stricter ordering
    > based on the K groups of N elements generated in the import function.
    >
    > >> Except for this argument, the input of pg_ndistinct feels OK in terms
    > >> of the guarantees that we'd want to enforce on an import.  The same
    > >> argument applies in terms of attribute number guarantees for
    > >> pg_dependencies, based on DependencyGenerator_init() & friends in
    > >> dependencies.c.  Could you look at that?
    > >
    > > Yes. I had already looked at it to verify that _all_ combinations were
    > > always generated (they are), because I had some vague memory of the
    > > generator dropping combinations that were statistically insignificant. In
    > > retrospect, I have no idea where I got that idea.
    >
    > Hmm.  I would need to double-check the code to be sure, but I don't
    > think that we drop combinations, because the code prevents duplicates
    > to begin with, even for expressions:
    > create table aa (a int, b int);
    > create statistics stats (ndistinct) ON a, a, b, b from aa;
    > ERROR:  42701: duplicate column name in statistics definition
    > create statistics stats (ndistinct) ON (a + a), ((a + a)) from aa;
    > ERROR:  42701: duplicate expression in statistics definition
    >
    
    So I looked at the generator functions, hoping they'd have enough in common
    that they could be made generic. And they're just different enough that I
    think it's not worth it to try.
    
    But, if we don't care about the order of the combinations, I also don't
    think we need to expose the functions at all. We know exactly how many
    combinations there should be for any N attributes as each attribute must be
    unique. So if we have the right number of unique combinations, and they're
    all subsets of the first-longest, then we must have a complete set.
    Thoughts on that?
    
    Getting _too_ tight with the ordering and contents makes me concerned for
    the day when the format might change. We don't want to _fail_ an upgrade
    because some of the combinations were in the wrong order.
    
    
    
    > These don't make sense anyway because they have a predictible and
    > perfectly matching correlation relationship.
    >
    
    They do, for now, but are we willing to lock ourselves into that forever?
    
    
    >
    > > This is fairly simple to do. The dependency attnum is just appended to
    > the
    > > list of attnums, and the combinations are generated the same as
    > ndistinct,
    > > though obviously there are no single elements.
    >
    > Yeah.  That should be not be bad, I hope.
    >
    > > There's probably some common code between the lists to be shared,
    > differing
    > > only in how they report missing combinations.
    >
    > I would like to agree on that, but it did not look that obvious to me
    > yesterday.  If you think that something could be refactored, I'd
    > suggest a refactoring patch that applies on top of the rest of the
    > patch set, with new generic facilities in stat_util.c, or even a
    > new separate file, if that leads to a cleaner result (okay, a
    > definition of "clean" is up to one's taste).
    >
    
    Looking over those functions, they both could have use the same generator,
    but the dependencies-side decided that dependency order doesn't matter,
    which puts doubt in my head that the order is perfectly the same for both,
    so we'd better follow each individually IF we want to enforce order.
    
  54. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-11-18T02:34:14Z

    On Mon, Nov 17, 2025 at 6:35 PM Michael Paquier <michael@paquier.xyz> wrote:
    
    > On Mon, Nov 17, 2025 at 08:53:35PM +0800, jian he wrote:
    > > segfault:
    > > SELECT '[{"attributes" : [1,2,3,4,5,67,6,7,8], "ndistinct" :
    > 4}]'::pg_ndistinct;
    > >
    > > because src/backend/statistics/mvdistinct.c line: 310, Assert
    > > Assert((item->nattributes >= 2) && (item->nattributes <=
    > STATS_MAX_DIMENSIONS));
    >
    > Ahah, good one here.  Perhaps we had better switch this assertion to
    > become an elog() with the import functions in mind, as well.  It may
    > be worth double-checking the other deserialization paths, as well.  We
    > need to be super careful about this stuff for data imports.
    >
    
    Yeah, I liked that one too, and I can act on that while we're debating how
    tightly to turn the screws on combinatorics.
    
  55. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-11-18T03:34:29Z

    On Mon, Nov 17, 2025 at 09:32:37PM -0500, Corey Huinker wrote:
    > So I looked at the generator functions, hoping they'd have enough in common
    > that they could be made generic. And they're just different enough that I
    > think it's not worth it to try.
    > 
    > But, if we don't care about the order of the combinations, I also don't
    > think we need to expose the functions at all. We know exactly how many
    > combinations there should be for any N attributes as each attribute must be
    > unique. So if we have the right number of unique combinations, and they're
    > all subsets of the first-longest, then we must have a complete set.
    > Thoughts on that?
    > 
    > Getting _too_ tight with the ordering and contents makes me concerned for
    > the day when the format might change. We don't want to _fail_ an upgrade
    > because some of the combinations were in the wrong order.
    
    That's fair.  The planner costing code pulling the stats numbers based
    on the attributes was smart enough to not care much about the ordering
    as far as I recall, but I'd rather make sure of that first.  This
    needs some careful lookup.
    
    >> These don't make sense anyway because they have a predictible and
    >> perfectly matching correlation relationship.
    >>
    > 
    > They do, for now, but are we willing to lock ourselves into that forever?
    
    Perhaps not.  I cannot say for sure what's the future is going to be
    made of.
    
    > Looking over those functions, they both could have use the same generator,
    > but the dependencies-side decided that dependency order doesn't matter,
    > which puts doubt in my head that the order is perfectly the same for both,
    > so we'd better follow each individually IF we want to enforce order.
    
    I'd try to look at the bits related to pg_dependencies and
    pg_ndistinct as two separate concepts, at the end.  They're sort of
    alike, but have too many differences already.
    --
    Michael
    
  56. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-11-18T04:39:50Z

    >
    > > But, if we don't care about the order of the combinations, I also don't
    > > think we need to expose the functions at all. We know exactly how many
    > > combinations there should be for any N attributes as each attribute must
    > be
    > > unique. So if we have the right number of unique combinations, and
    > they're
    > > all subsets of the first-longest, then we must have a complete set.
    > > Thoughts on that?
    > >
    > > Getting _too_ tight with the ordering and contents makes me concerned for
    > > the day when the format might change. We don't want to _fail_ an upgrade
    > > because some of the combinations were in the wrong order.
    >
    > That's fair.  The planner costing code pulling the stats numbers based
    > on the attributes was smart enough to not care much about the ordering
    > as far as I recall, but I'd rather make sure of that first.  This
    > needs some careful lookup.
    >
    
    I've done some experiments, creating extended stats objects up to the 8
    attribute limit.
    
    The big takeaway is that I wasn't imagining that he number of dependencies
    combinations is NOT deterministic:
    
    /*
     * if the dependency seems entirely invalid, don't store it
     */
    if (degree == 0.0)
         continue;
    
    So, in theory, an empty (i.e. '[]') pg_dependencies is valid.
    
    The number of pg_ndistinct is deterministic, now, but I'm even less sure
    that'll be true in the future.
    
    We can definitely rely on the attnums being all the positive numbers in
    ascending order first, followed by the negative numbers in descending
    order, but that's about it. Which raises the question of how we describe
    the error when attnums are out of order.
    
    We know that the deserialize functions take the data's word for it as to
    how many items to unpack, so I don't see the impact of not caring how many
    might be missing. That even sort of feeds into Tom's idea that stats import
    was in some sense a fuzzing tool.
    
    
    > I'd try to look at the bits related to pg_dependencies and
    > pg_ndistinct as two separate concepts, at the end.  They're sort of
    > alike, but have too many differences already.
    >
    
    Based on the above, I think we can't really add anything beyond the attnum
    order, and we have to relax some existing restrictions on pg_dependencies...
    
  57. Re: Extended Statistics set/restore/clear functions.

    jian he <jian.universality@gmail.com> — 2025-11-18T05:07:23Z

    On Mon, Nov 17, 2025 at 2:56 PM Michael Paquier <michael@paquier.xyz> wrote:
    >
    > On Fri, Nov 14, 2025 at 03:25:27PM +0900, Michael Paquier wrote:
    > > Thanks for the new versions, I'll also look at all these across the
    > > next couple of days.  Probably not at 0005~ for now.
    >
    > 0001 and 0002 from series v13 have been applied to change the output
    > functions.
    >
    
    > And I have looked at 0003 in details for now.  Attached is a revised
    > version for it, with many adjustments.  Some notes:
    > - Many portions of the coverage were missing.  I have measured the
    > coverage at 91% with the updated version attached.  This includes
    > coverage for some error reporting, something that we rely a lot on for
    > this code.
    > - The error reports are made simpler, with the token values getting
    > hidden.  While testing with some fancy values, I have actually noticed
    > that the error handlings for the parsing of the int16 and int32 values
    > were incorrect, the error reports used what the safe functions
    > generated, not the reports from the data type.
    > - Passing down arbitrary bytes sequences was leading to these bytes
    > reported in the error outputs because we cared about the token values.
    > I have added a few tests based on that for the code paths involved.
    >
    hi.
    
    in src/backend/statistics/mvdistinct.c, we have:
    Assert(AttributeNumberIsValid(item->attributes[j]));
    
    should we disallow 0 in key attributes?
    SELECT '[{"attributes" : [0,1], "ndistinct" : 4}]'::pg_ndistinct;
    I didn't find a way to trigger this Assert yet.
    
    
    + errsave(parse->escontext,
    + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
    + errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
    + errdetail("Invalid \"%s\" value.", PG_NDISTINCT_KEY_ATTRIBUTES));
    
    + errsave(parse->escontext,
    + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
    + errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
    + errdetail("Invalid \"%s\" value.",
    +  PG_NDISTINCT_KEY_NDISTINCT));
    
    the errdetail is way too generic?
    similar to ``select 'a'::int;``
    we can
    DETAIL:  Invalid input syntax for type integer: "a"
    HINT: "ndistinct" value expected to be a type of integer.
    
    what do you think?
    
    
    we already have "fname" in ndistinct_object_field_start,
    we can also print out the "fname", like:
        errsave(parse->escontext,
                errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
                errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
                errdetail("Unexpected key \"%s\"", fname),
                errhint("Only allowed keys are \"%s\" and \"%s\".",
                          PG_NDISTINCT_KEY_ATTRIBUTES,
                          PG_NDISTINCT_KEY_NDISTINCT));
    
    
    SELECT '[{"attributes" : [2,3], "ndistinct" : 4, "ndistinct" :
    14}]'::pg_ndistinct;
                   pg_ndistinct
    -------------------------------------------
     [{"attributes": [2, 3], "ndistinct": 14}]
    
    SELECT '[{"attributes" : [2,3], "ndistinct" : 4, "attributes" :
    []}]'::pg_ndistinct;
                   pg_ndistinct
    ------------------------------------------
     [{"attributes": [2, 3], "ndistinct": 4}]
    
    Is the above output what we expected?
    
    
    + /*
    + * We need at least two attribute numbers for a ndistinct item, anything
    + * less is malformed.
    + */
    + natts = parse->attnum_list->length;
    here, we can use list_length.
    
    + if (parse->attnum_list != NIL)
    + if (parse->distinct_items != NIL)
    here, we can also use list_length.
    
    
    --
    jian
    https://www.enterprisedb.com/
    
    
    
    
  58. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-11-18T05:39:04Z

    On Tue, Nov 18, 2025 at 01:07:23PM +0800, jian he wrote:
    > + errsave(parse->escontext,
    > + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
    > + errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
    > + errdetail("Invalid \"%s\" value.",
    > +  PG_NDISTINCT_KEY_NDISTINCT));
    > 
    > the errdetail is way too generic?
    > similar to ``select 'a'::int;``
    > we can
    > DETAIL:  Invalid input syntax for type integer: "a"
    > HINT: "ndistinct" value expected to be a type of integer.
    > 
    > what do you think?
    
    That is intentional, as it is intentional to not show the value of the
    token in such cases.  This will be mostly used for the import
    functions, so I am not sure that it is worth spending time on tuning
    all these error cases that one is most likely not going to see as the
    main scenarios are going to be through pg_dump/restore, most of the
    time in the scope of an upgrade.
    
    > SELECT '[{"attributes" : [2,3], "ndistinct" : 4, "ndistinct" :
    > 14}]'::pg_ndistinct;
    >                pg_ndistinct
    > -------------------------------------------
    >  [{"attributes": [2, 3], "ndistinct": 14}]
    > 
    > SELECT '[{"attributes" : [2,3], "ndistinct" : 4, "attributes" :
    > []}]'::pg_ndistinct;
    >                pg_ndistinct
    > ------------------------------------------
    >  [{"attributes": [2, 3], "ndistinct": 4}]
    > 
    > Is the above output what we expected?
    
    Interesting one.  The extra "attributes" should not be required once
    we have one set, indeed.
    --
    Michael
    
  59. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-11-18T08:52:31Z

    >
    > > SELECT '[{"attributes" : [2,3], "ndistinct" : 4, "ndistinct" :
    > > 14}]'::pg_ndistinct;
    > >                pg_ndistinct
    > > -------------------------------------------
    > >  [{"attributes": [2, 3], "ndistinct": 14}]
    > >
    > > SELECT '[{"attributes" : [2,3], "ndistinct" : 4, "attributes" :
    > > []}]'::pg_ndistinct;
    > >                pg_ndistinct
    > > ------------------------------------------
    > >  [{"attributes": [2, 3], "ndistinct": 4}]
    > >
    > > Is the above output what we expected?
    >
    > Interesting one.  The extra "attributes" should not be required once
    > we have one set, indeed.
    > --
    > Michael
    >
    
    v15:
    
    - catches duplicate object keys cited above
    - enforces attnum ordering (ascending positive numbers followed by
    descending negative numbers, no zeros allowed), which means we get
    duplicate attnum detection for free
    - attnum validation is now done as soon as the attnum is parsed
    - tests refactored to put attnums in proper order
    - unfortunately, this means that one of the error cases from
    stats_import.sql (attnum = 0) is now an error rather than something that
    can be soft-excluded.
    - didn't enforce combinatorical completeness for dependencies because not
    all combinations are guaranteed to be there.
    - didn't enforce combinatorical completeness for ndistinct because I'm not
    convinced we should.
    
  60. Re: Extended Statistics set/restore/clear functions.

    jian he <jian.universality@gmail.com> — 2025-11-19T07:49:56Z

    On Tue, Nov 18, 2025 at 4:52 PM Corey Huinker <corey.huinker@gmail.com> wrote:
    > v15:
    >
    > - catches duplicate object keys cited above
    > - enforces attnum ordering (ascending positive numbers followed by descending negative numbers, no zeros allowed), which means we get duplicate attnum detection for free
    > - attnum validation is now done as soon as the attnum is parsed
    > - tests refactored to put attnums in proper order
    > - unfortunately, this means that one of the error cases from stats_import.sql (attnum = 0) is now an error rather than something that can be soft-excluded.
    > - didn't enforce combinatorical completeness for dependencies because not all combinations are guaranteed to be there.
    > - didn't enforce combinatorical completeness for ndistinct because I'm not convinced we should.
    >
    
    hi.
    
    some of the switch->default, default don't have ``break``.
    
    + for (int i = 0; i < nitems; i++)
    + {
    + MVNDistinctItem *item = parse_state.distinct_items->elements[i].ptr_value;
    
    exposing the ptr_value seems not a good idea, we can foreach_ptr
    the attached patch using foreach_ptr.
    
    in function pg_ndistinct_in some errsave can change to ereturn.
    (I didn't do this part, though).
    
    + /*
    + * The attnum cannot be zero a negative number beyond the number of the
    + * possible expressions.
    + */
    + if (attnum == 0 || attnum < (0-STATS_MAX_DIMENSIONS))
    + {
    + errsave(parse->escontext,
    + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
    + errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
    + errdetail("Invalid \"%s\" element: %d.",
    + PG_NDISTINCT_KEY_ATTRIBUTES, attnum));
    + return JSON_SEM_ACTION_FAILED;
    + }
    This part had no coverage tests, so I added a few.
    
    
    as mentioned before
    + errsave(parse->escontext,
    + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
    + errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
    + errdetail("The \"%s\" key must contain an array of at least %d "
    +  "and no more than %d attributes.",
    +  PG_NDISTINCT_KEY_NDISTINCT, 2, STATS_MAX_DIMENSIONS));
    here PG_NDISTINCT_KEY_NDISTINCT, should be PG_NDISTINCT_KEY_ATTRIBUTES.
    
    
    Please check the attached minor miscellaneous changes.
    
    --
    jian
    https://www.enterprisedb.com/
    
  61. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-11-20T23:08:02Z

    >
    >
    > some of the switch->default, default don't have ``break``.
    >
    
    The compiler doesn't require them, but I see that we do use them in a lot
    of places, so I'll incorporate this.
    
    
    >
    > + for (int i = 0; i < nitems; i++)
    > + {
    > + MVNDistinctItem *item =
    > parse_state.distinct_items->elements[i].ptr_value;
    >
    > exposing the ptr_value seems not a good idea, we can foreach_ptr
    > the attached patch using foreach_ptr.
    
    
    I didn't like this because it makes getting the index value harder, and the
    index value is needed in lots of places. Instead I used list_nth() and
    list_nth_int().
    
    
    
    > in function pg_ndistinct_in some errsave can change to ereturn.
    > (I didn't do this part, though).
    >
    
    -0.25
    
    There's something that I like about the consistency of errsave() followed
    by a return that makes it clear that "the function ends here" that I don't
    get from ereturn().
    
    What I would really like, is a way to generate unique (but translated)
    errdetail values, and move the errsave/ereturn to after the switch
    statement.
    
    
    >
    > + /*
    > + * The attnum cannot be zero a negative number beyond the number of the
    > + * possible expressions.
    > + */
    > + if (attnum == 0 || attnum < (0-STATS_MAX_DIMENSIONS))
    > + {
    > + errsave(parse->escontext,
    > + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
    > + errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
    > + errdetail("Invalid \"%s\" element: %d.",
    > + PG_NDISTINCT_KEY_ATTRIBUTES, attnum));
    > + return JSON_SEM_ACTION_FAILED;
    > + }
    > This part had no coverage tests, so I added a few.
    >
    
    +1
    
    
    
    >
    >
    > as mentioned before
    > + errsave(parse->escontext,
    > + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
    > + errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
    > + errdetail("The \"%s\" key must contain an array of at least %d "
    > +  "and no more than %d attributes.",
    > +  PG_NDISTINCT_KEY_NDISTINCT, 2, STATS_MAX_DIMENSIONS));
    > here PG_NDISTINCT_KEY_NDISTINCT, should be PG_NDISTINCT_KEY_ATTRIBUTES.
    >
    
    +1
    
    Did similar things to pg_dependencies.
    
  62. Re: Extended Statistics set/restore/clear functions.

    Chao Li <li.evan.chao@gmail.com> — 2025-11-21T03:36:23Z

    After 16 rounds of revisions, v16 looks solid and robust. Only got a few small comments:
    
    > On Nov 21, 2025, at 07:08, Corey Huinker <corey.huinker@gmail.com> wrote:
    > 
    > <v16-0001-Add-working-input-function-for-pg_ndistinct.patch><v16-0002-Add-working-input-function-for-pg_dependencies.patch><v16-0003-Expose-attribute-statistics-functions-for-use-in.patch><v16-0004-Add-extended-statistics-support-functions.patch><v16-0005-Include-Extended-Statistics-in-pg_dump.patch>
    
    1 - 0001
    ```
    +	/*
    +	 * We need at least two attribute numbers for a ndistinct item, anything
    +	 * less is malformed.
    +	 */
    +	natts = list_length(parse->attnum_list);
    +	if ((natts < 2) || (natts > STATS_MAX_DIMENSIONS))
    +	{
    +		errsave(parse->escontext,
    +				errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
    +				errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
    +				errdetail("The \"%s\" key must contain an array of at least %d "
    +						  "and no more than %d attributes.",
    +						  PG_NDISTINCT_KEY_ATTRIBUTES, 2, STATS_MAX_DIMENSIONS));
    +		return JSON_SEM_ACTION_FAILED;
    +	}
    ```
    
    Do we also need to reset item state vars similar to the success case?
    
    2 - 0001
    ```
    *
    + * Arrays can never be empty.
    + */
    +static JsonParseErrorType
    +ndistinct_array_end(void *state)
    +{
    ```
    
    This function comment is too simple, and doesn’t match to the function name. Reading the function body, the logic is clear, so I would suggest either just remove the comment or enhance it.
    
    3  - 0001
    ```
    static JsonParseErrorType
    +ndistinct_array_element_start(void *state, bool isnull)
    +{
    +	NDistinctParseState *parse = state;
    ```
    
    Nit: NDistinctParseState * can be const.
    
    4 - 0001
    ```
    +static bool
    +valid_subsequent_attnum(const AttrNumber prev, const AttrNumber cur)
    ```
    
    AttrNumber is int16, thus “const” here is unnecessary.
    
    5 - 0001
    ```
    +			return JSON_SUCCESS;
    +			break;
    ```
    
    “Break” after “return” is unnecessary. I tried to delete the “break” and didn’t see a compile warning, it should be safe to delete the “break”.
    
    6 - 0001
    ```
    +/*
    + * Compare the attribute arrays of two MVNDistinctItem values,
    + * looking for duplicate sets.
    
    + */
    +static bool
    +has_duplicate_attributes(const MVNDistinctItem *a,
    +						 const MVNDistinctItem *b)
    ```
    
    The function comment says “looking for duplicate sets”, seems to imply the function would return the set, but the function returns a bool, which is a little confusing.
    
    7 - 0002
    ```
    +	for (int i = 0; i < natts; i++)
    +	{
    +		dep->attributes[i] = (AttrNumber) list_nth_int(parse->attnum_list, i);
    +		if (dep->attributes[i] == parse->dependency)
    +		{
    +			errsave(parse->escontext,
    +					errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
    +					errmsg("malformed pg_dependencies: \"%s\"", parse->str),
    +					errdetail("Item \"%s\" value %d found in the \"%s\" list.",
    +							  PG_DEPENDENCIES_KEY_DEPENDENCY, parse->dependency,
    +			   				  PG_DEPENDENCIES_KEY_ATTRIBUTES));
    +			return JSON_SEM_ACTION_FAILED;
    +		}
    ```
    
    Similar to comment 1, do we need to reset state vars upon failure?
    
    8 - 0002
    ```
    +				errdetail("The \"%s\" key must contain an array of at least %d "
    +						  " and no than %d elements.",
    ```
    
    I believe “more” is missed between “no than”.
    
    9 - 0002
    ```
    +static JsonParseErrorType
    +dependencies_array_start(void *state)
    +{
    +	DependenciesParseState *parse = state;
    ```
    
    Similar to commit 3, DependenciesParseState * can be const.
    
    10 - 0002
    ```
    static bool
    valid_subsequent_attnum(const AttrNumber prev, const AttrNumber cur)
    ```
    
    This function is a dup to the one in 0001, can we put it to a common place?
    
    11 - 0003
    ```
    +extern void statatt_get_type(Oid reloid, AttrNumber attnum,
    +							 Oid *atttypid, int32 *atttypmod,
    +							 char *atttyptype, Oid *atttypcoll,
    +							 Oid *eq_opr, Oid *lt_opr);
    +extern void statatt_init_empty_tuple(Oid reloid, int16 attnum, bool inherited,
    +									 Datum *values, bool *nulls, bool *replaces);
    +
    +extern void statatt_set_slot(Datum *values, bool *nulls, bool *replaces,
    +							 int16 stakind, Oid staop, Oid stacoll,
    +							 Datum stanumbers, bool stanumbers_isnull,
    +							 Datum stavalues, bool stavalues_isnull);
    +
    +extern Datum text_to_stavalues(const char *staname, FmgrInfo *array_in, Datum d,
    +							   Oid typid, int32 typmod, bool *ok);
    +extern bool statatt_get_elem_type(Oid atttypid, char atttyptype,
    +								  Oid *elemtypid, Oid *elem_eq_opr);
    ```
    
    Several functions are made external visible, they are all renamed with adding a prefix “statatt_”, why text_to_stavalues is an exception?
    
    11 - 0004
    ```
    +bool
    +pg_dependencies_validate_deps(MVDependencies *dependencies, int2vector *stxkeys, int numexprs, int elevel)
    +{
    +	int		attnum_expr_lowbound = 0 - numexprs;
    +
    +	for (int i = 0; i < dependencies->ndeps; i++)
    +	{
    +		MVDependency *dep = dependencies->deps[i];
    ```
    
    This MVDependency * can be const.
    
    12 - 0004
    ```
    static void
    upsert_pg_statistic_ext_data(Datum *values, bool *nulls, bool *replaces)
    {
    ```
    
    This function pass values, nulls and replaces to heap_modify_tuple() and heap_form_tuple(), the both functions take all const pointers as parameters. So, here values, nulls and replaces can all be const.
    
    13 - 0004
    ```
    +		if (!exprs_nulls[offset + AVG_WIDTH_ELEM])
    +		{
    +			ok = text_to_int4(exprs_elems[offset + AVG_WIDTH_ELEM],
    +							  &values[Anum_pg_statistic_stawidth - 1]);
    +
    +			if (!ok)
    +			{
    +				char	   *s = TextDatumGetCString(exprs_elems[offset + NULL_FRAC_ELEM]);
    +
    +				ereport(WARNING,
    +						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
    +						 errmsg("Expression %s element \"%s\" does not match expected input type.",
    +								extexprarginfo[AVG_WIDTH_ELEM].argname, s)));
    +				pfree(s);
    +				return (Datum) 0;
    +			}
    +		}
    ```
    
    Here: char    *s = TextDatumGetCString(exprs_elems[offset + NULL_FRAC_ELEM]);
    Should NULL_FRAC_ELEM be AVG_WIDTH_ELEM?
    
    14 - 0004
    ```
    +		if (!exprs_nulls[offset + N_DISTINCT_ELEM])
    +		{
    +			ok = text_to_float4(exprs_elems[offset + N_DISTINCT_ELEM],
    +								&values[Anum_pg_statistic_stadistinct - 1]);
    +
    +			if (!ok)
    +			{
    +				char	   *s = TextDatumGetCString(exprs_elems[offset + NULL_FRAC_ELEM]);
    +
    +				ereport(WARNING,
    +						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
    +						 errmsg("Expression %s element \"%s\" does not match expected input type.",
    +								extexprarginfo[N_DISTINCT_ELEM].argname, s)));
    +				pfree(s);
    +				return (Datum) 0;
    +			}
    +		}
    ```
    
    Similar to comment 13, should NULL_FRAC_ELEM be N_DISTINCT_ELEM?
    
    15 - 0004
    ```
    +extern Datum import_mcvlist(HeapTuple tup, int elevel, int numattrs,
    +							Oid *atttypids, int32 *atttypmods, Oid *atttypcolls,
    +							int nitems, Datum *mcv_elems, bool *mcv_nulls,
    +							bool *mcv_elem_nulls, float8 *freqs, float8 *base_freqs);
    +
    +extern Datum import_mcvlist(HeapTuple tup, int elevel, int numattrs,
    +							Oid *atttypids, int32 *atttypmods, Oid *atttypcolls,
    +							int nitems, Datum *mcv_elems, bool *mcv_nulls,
    +							bool *mcv_elem_nulls, float8 *freqs, float8 *base_freqs);
    ```
    
    import_mcvlist is declared twice, looks like a copy-paste mistake.
    
    16 - 0005
    ```
    +	PREPQUERY_DUMPEXTSTATSSTATS,
    ```
    
    “statsstats” looks weird, I’d suggest PREPQUERY_DUMPEXTSTATSDATA.
    
    Best regards,
    —
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
    
    
    
  63. Re: Extended Statistics set/restore/clear functions.

    Yuefei Shi <shiyuefei1004@gmail.com> — 2025-11-21T05:33:56Z

    On Fri, Nov 21, 2025 at 10:54 AM Corey Huinker <corey.huinker@gmail.com>
    wrote:
    
    >
    >> some of the switch->default, default don't have ``break``.
    >>
    >
    > The compiler doesn't require them, but I see that we do use them in a lot
    > of places, so I'll incorporate this.
    >
    >
    >>
    >> + for (int i = 0; i < nitems; i++)
    >> + {
    >> + MVNDistinctItem *item =
    >> parse_state.distinct_items->elements[i].ptr_value;
    >>
    >> exposing the ptr_value seems not a good idea, we can foreach_ptr
    >> the attached patch using foreach_ptr.
    >
    >
    > I didn't like this because it makes getting the index value harder, and
    > the index value is needed in lots of places. Instead I used list_nth() and
    > list_nth_int().
    >
    >
    >
    >> in function pg_ndistinct_in some errsave can change to ereturn.
    >> (I didn't do this part, though).
    >>
    >
    > -0.25
    >
    > There's something that I like about the consistency of errsave() followed
    > by a return that makes it clear that "the function ends here" that I don't
    > get from ereturn().
    >
    > What I would really like, is a way to generate unique (but translated)
    > errdetail values, and move the errsave/ereturn to after the switch
    > statement.
    >
    >
    >>
    >> + /*
    >> + * The attnum cannot be zero a negative number beyond the number of the
    >> + * possible expressions.
    >> + */
    >> + if (attnum == 0 || attnum < (0-STATS_MAX_DIMENSIONS))
    >> + {
    >> + errsave(parse->escontext,
    >> + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
    >> + errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
    >> + errdetail("Invalid \"%s\" element: %d.",
    >> + PG_NDISTINCT_KEY_ATTRIBUTES, attnum));
    >> + return JSON_SEM_ACTION_FAILED;
    >> + }
    >> This part had no coverage tests, so I added a few.
    >>
    >
    > +1
    >
    >
    >
    >>
    >>
    >> as mentioned before
    >> + errsave(parse->escontext,
    >> + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
    >> + errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
    >> + errdetail("The \"%s\" key must contain an array of at least %d "
    >> +  "and no more than %d attributes.",
    >> +  PG_NDISTINCT_KEY_NDISTINCT, 2, STATS_MAX_DIMENSIONS));
    >> here PG_NDISTINCT_KEY_NDISTINCT, should be PG_NDISTINCT_KEY_ATTRIBUTES.
    >>
    >
    > +1
    >
    > Did similar things to pg_dependencies.
    >
    >
    A few small comments.
    
    1. Minor typo fixes:
    In pg_dependencies_in: "pg_dependencies parssing claims..." could be
    corrected to "pg_dependencies parsing claims..."
    In ndistinct_array_end: "must be an non-empty array" might be better as
    "must be a non-empty array"
    In the comment for dependencies_object_field_start: "depeendency" appears
    to be a typo for "dependency"
    
    2.Code maintainability suggestion:
    I noticed the string "malformed pg_dependencies: "%s"" is used repeatedly
    throughout the code. Would you consider defining this as a macro? This
    could reduce duplication and make future updates easier.
    
    3.Memory management observation:
    Regarding item_attnum_list, while PostgreSQL's memory context mechanism
    handles cleanup, explicitly freeing the allocated memory after use might
    improve code clarity.
    
    These are all minor points - the implementation looks solid overall. Thank
    you for your work on this feature!
    
  64. Re: Extended Statistics set/restore/clear functions.

    jian he <jian.universality@gmail.com> — 2025-11-21T07:16:26Z

    hi.
    
    makeJsonLexContextCstringLen in v16-0001 and v16-0002
    should use GetDatabaseEncoding()?
    
    wondering, should we check the value of "degree" is between 0 to 1?
    
    
    + if ((natts < 1) || (natts > (STATS_MAX_DIMENSIONS - 1)))
    + {
    + errsave(parse->escontext,
    + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
    + errmsg("malformed pg_dependencies: \"%s\"", parse->str),
    + errdetail("The \"%s\" key must contain an array of at least %d "
    +  " and no than %d elements.",
    +  PG_DEPENDENCIES_KEY_ATTRIBUTES, 1, STATS_MAX_DIMENSIONS - 1));
    
    DETAIL:  The "attributes" key must contain an array of at least 1  and
    no than 7 elements.
    if you check the error message carefully, there is one more extra
    white space after "at least 1".
    
    
    typedef struct MVDependency
    {
        double        degree;            /* degree of validity (0-1) */
        AttrNumber    nattributes;    /* number of attributes */
        AttrNumber    attributes[FLEXIBLE_ARRAY_MEMBER];    /* attribute numbers */
    } MVDependency;
    
    SELECT '[{"attributes" : [2], "dependency" : 4, "degree":
    "NaN"}]'::pg_dependencies;
    SELECT '[{"attributes" : [2], "dependency" : 4, "degree":
    "-inf"}]'::pg_dependencies;
    SELECT '[{"attributes" : [2], "dependency" : 4, "degree":
    "inf"}]'::pg_dependencies;
    
    either error out or not, pg_dependencies.sql needs tests like the
    above to test these special values.
    json does not support special value like, "inf", "NaN", so we can
    SELECT '[{"attributes" : [2], "dependency" : 4, "degree":
    "-inf"}]'::pg_dependencies;
    but can not
    SELECT '[{"attributes" : [2], "dependency" : 4, "degree":
    "-inf"}]'::pg_dependencies::text::pg_dependencies;
    
    I found that numeric values behave the same with or without double quotes.
    SELECT '[{"ndistinct" : "1", "attributes" : ["2","3"]}]'::pg_ndistinct;
    SELECT '[{"ndistinct" : 1, "attributes" : [2,3]}]'::pg_ndistinct;
    SELECT '[{"attributes" : [1, 2], "dependency" : 4, "degree":
    "1"}]'::pg_dependencies;
    SELECT '[{"attributes" : [1, "2"], "dependency" : "4", "degree":
    "1"}]'::pg_dependencies;
    i guess the reason is in makeJsonLexContextCstringLen, we set
    need_escapes to true.
    
    
    I added more regress tests to improve coverage for
    src/test/regress/sql/pg_dependencies.sql
    
    
    SELECT '[{"attributes" : [-11,1], "dependency" : -1116, "degree":
    "1.2"}]'::pg_dependencies;
    DETAIL:  Invalid "attributes" element: -11.
    we already check the key "attributes", we can also check "dependency",
    IF the value of ""dependency"
    does not meet
    ``(if (parse->dependency == 0 || parse->dependency < (0-STATS_MAX_DIMENSIONS))``
    then error out.
    see the attached patch.
    
    
    --
    jian
    https://www.enterprisedb.com/
    
  65. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-11-21T22:41:15Z

    >
    >
    > A few small comments.
    >
    > 1. Minor typo fixes:
    >
    
    +1 Thanks.
    
    
    > 2.Code maintainability suggestion:
    > I noticed the string "malformed pg_dependencies: "%s"" is used repeatedly
    > throughout the code. Would you consider defining this as a macro? This
    > could reduce duplication and make future updates easier.
    >
    
    It seems to be the way of things. As I stated earlier, what I'd really like
    is the ability to form different errdetails and have them feed into one
    errsave(), and still have the strings go through translation.
    
    
    
    >
    > 3.Memory management observation:
    > Regarding item_attnum_list, while PostgreSQL's memory context mechanism
    > handles cleanup, explicitly freeing the allocated memory after use might
    > improve code clarity.
    >
    > These are all minor points - the implementation looks solid overall. Thank
    > you for your work on this feature!
    >
    
    +1
    
  66. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-11-22T08:26:19Z

    >
    >
    > wondering, should we check the value of "degree" is between 0 to 1?
    >
    
    I'm open to it, but there was significant pushback to having tight validity
    checks on other stats types.
    
    The more things that we make errors, the more chances we have to breaking a
    dump/restore. Granted, such values should never happen, but odd things do
    happen.
    
    
    >
    > IF the value of ""dependency"
    > does not meet
    > ``(if (parse->dependency == 0 || parse->dependency <
    > (0-STATS_MAX_DIMENSIONS))``
    > then error out.
    > see the attached patch.
    >
    
    Good catch.
    
    Incorporated these patches as well as Yuefei's feedback especially around
    memory management.
    
    I added a comment debating the feasibility of testing for subsets of
    attribute sets in pg_dependencies. Basically, I think we can't have the
    test at all, but I haven't removed it just yet pending consensus.
    
  67. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-11-23T06:57:31Z

    On Sat, Nov 22, 2025 at 03:26:19AM -0500, Corey Huinker wrote:
    > I'm open to it, but there was significant pushback to having tight validity
    > checks on other stats types.
    
    Yeah, I am not sure that there is much value in having strict checks
    for the values.  Some control over the arguments listed in a single
    object is much more valuable as these are easy to check in the input
    functions. 
    
    > The more things that we make errors, the more chances we have to breaking a
    > dump/restore. Granted, such values should never happen, but odd things do
    > happen.
    
    Indeed.  Like the existing import functions, these should not lead to
    hard failures, but WARNING-level reports.
    --
    Michael
    
  68. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-11-25T07:28:51Z

    On Sat, Nov 22, 2025 at 03:26:19AM -0500, Corey Huinker wrote:
    > I added a comment debating the feasibility of testing for subsets of
    > attribute sets in pg_dependencies. Basically, I think we can't have the
    > test at all, but I haven't removed it just yet pending consensus.
    
    +     * Verify that all attnum sets are a proper subset of the first longest
    +     * attnum set.
    +     *
    +     * TODO:
    +     *
    +     * I'm fairly certain that because statisticsally insignificant dependency
    +     * combinations are not stored, there is a chance that the longest dependency
    +     * does not exist, and therefore this test cannot be done. I have left the
    +     * test in place for the time being until the issue can be definitively
    +     * settled.
    
    As you have already quoted upthread, statext_dependencies_build()
    settles the issue on this one, I think.  It is entirely possible that
    any group returned by DependencyGenerator generates a degree value
    that would prevent a given group to be stored, and this could as well
    be the largest possible group there could be in the set.  So we cannot
    do any of that for dependencies, unfortunately.  We can always rely on
    the list of attributes when assigning the json blob to the stats
    object, at least, cross-checking that each attribute list matches with
    the numbers of the stats object.  At least we can check for
    duplicates, which is better than nothing at all.
    
    Regarding the suggested check where we'd want to enforce all the
    groups of attributes to be listed depending on the longest set we have
    found, at the end estimate_multivariate_ndistinct() checks the items
    listed one-by-one, giving up if we cannot find something in the list
    of items.  I think that I am going to be content with the patch as it
    is, without this piece.  Let's add an extra SQL test to treat that as
    valid input, though.  So I am feeling OK with the input for ndistinct
    at this stage.  I have noticed a couple of issues in passing,
    adjusting them.  We are reaching more than 90% of coverage with the
    tests, and I am not sure that we can actually reach the rest except if
    one of the previous steps failed.
    
    So That's one.  Now into the second patch for the input of the
    dependencies.
    
    +SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "NaN"}]'::pg_dependencies;
    +SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "-inf"}]'::pg_dependencies;
    +SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "inf"}]'::pg_dependencies;
    +SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "-inf"}]'::pg_dependencies::text::pg_dependencies;
    
    Okay, I have to admit that these ones are fun.  I doubt that anybody
    would actually do that, and these do not produce valid json objects,
    which is what the last case shows.  Hmm, it makes sense to keep these,
    and I'm still siding that we should not care too much about applying
    checks on the values and complicate the input function more than that,
    so fine by me.
    
    There were a couple of things in the tests, missing quite a few soft
    errors.  Many typos, grammar mistakes in the whole.  Also, please do
    not split the error strings into multiple lines to make these
    greppable.  There is also no need for a break after a return.  In some
    cases, a return was used where a break made more sense as the default
    path returned a failure.. 
    
    The TODO in build_mvdependencies() could be an elog(), but I have left
    it untouched for the errdetail().
    
    We're reaching 91% of coverage here, not bad.  The rest does not seem
    reachable, as far as I can see.
    
    With that said, a v18 for the first two patches with the input
    functions.  Comments and/or opinions?
    --
    Michael
    
  69. Re: Extended Statistics set/restore/clear functions.

    Chao Li <li.evan.chao@gmail.com> — 2025-11-25T08:59:28Z

    
    > On Nov 25, 2025, at 15:28, Michael Paquier <michael@paquier.xyz> wrote:
    > 
    > On Sat, Nov 22, 2025 at 03:26:19AM -0500, Corey Huinker wrote:
    >> I added a comment debating the feasibility of testing for subsets of
    >> attribute sets in pg_dependencies. Basically, I think we can't have the
    >> test at all, but I haven't removed it just yet pending consensus.
    > 
    > +     * Verify that all attnum sets are a proper subset of the first longest
    > +     * attnum set.
    > +     *
    > +     * TODO:
    > +     *
    > +     * I'm fairly certain that because statisticsally insignificant dependency
    > +     * combinations are not stored, there is a chance that the longest dependency
    > +     * does not exist, and therefore this test cannot be done. I have left the
    > +     * test in place for the time being until the issue can be definitively
    > +     * settled.
    > 
    > As you have already quoted upthread, statext_dependencies_build()
    > settles the issue on this one, I think.  It is entirely possible that
    > any group returned by DependencyGenerator generates a degree value
    > that would prevent a given group to be stored, and this could as well
    > be the largest possible group there could be in the set.  So we cannot
    > do any of that for dependencies, unfortunately.  We can always rely on
    > the list of attributes when assigning the json blob to the stats
    > object, at least, cross-checking that each attribute list matches with
    > the numbers of the stats object.  At least we can check for
    > duplicates, which is better than nothing at all.
    > 
    > Regarding the suggested check where we'd want to enforce all the
    > groups of attributes to be listed depending on the longest set we have
    > found, at the end estimate_multivariate_ndistinct() checks the items
    > listed one-by-one, giving up if we cannot find something in the list
    > of items.  I think that I am going to be content with the patch as it
    > is, without this piece.  Let's add an extra SQL test to treat that as
    > valid input, though.  So I am feeling OK with the input for ndistinct
    > at this stage.  I have noticed a couple of issues in passing,
    > adjusting them.  We are reaching more than 90% of coverage with the
    > tests, and I am not sure that we can actually reach the rest except if
    > one of the previous steps failed.
    > 
    > So That's one.  Now into the second patch for the input of the
    > dependencies.
    > 
    > +SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "NaN"}]'::pg_dependencies;
    > +SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "-inf"}]'::pg_dependencies;
    > +SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "inf"}]'::pg_dependencies;
    > +SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "-inf"}]'::pg_dependencies::text::pg_dependencies;
    > 
    > Okay, I have to admit that these ones are fun.  I doubt that anybody
    > would actually do that, and these do not produce valid json objects,
    > which is what the last case shows.  Hmm, it makes sense to keep these,
    > and I'm still siding that we should not care too much about applying
    > checks on the values and complicate the input function more than that,
    > so fine by me.
    > 
    > There were a couple of things in the tests, missing quite a few soft
    > errors.  Many typos, grammar mistakes in the whole.  Also, please do
    > not split the error strings into multiple lines to make these
    > greppable.  There is also no need for a break after a return.  In some
    > cases, a return was used where a break made more sense as the default
    > path returned a failure.. 
    > 
    > The TODO in build_mvdependencies() could be an elog(), but I have left
    > it untouched for the errdetail().
    > 
    > We're reaching 91% of coverage here, not bad.  The rest does not seem
    > reachable, as far as I can see.
    > 
    > With that said, a v18 for the first two patches with the input
    > functions.  Comments and/or opinions?
    > --
    > Michael
    > <v18-0001-Add-working-input-function-for-pg_ndistinct.patch><v18-0002-Add-working-input-function-for-pg_dependencies.patch>
    
    I don’t see any of my comments are addressed in v18.
    
    Best regards,
    --
    Chao Li (Evan)
    HighGo Software Co., Ltd.
    https://www.highgo.com/
    
    
    
    
    
    
    
    
  70. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-11-25T10:06:09Z

    On Tue, Nov 25, 2025 at 04:59:28PM +0800, Chao Li wrote:
    > I don’t see any of my comments are addressed in v18.
    
    Sorry about that, I thought that v17 was addressing these.  I have
    reviewed your latest email posted here:
    https://www.postgresql.org/message-id/2C8A20C7-4E08-41F4-B30E-AD9F8B007AE5@gmail.com
    
    And well, this presents a bunch of issues I have already fixed (break
    after return, one error message incorrect, some descriptions).  The
    comments on top of the functions are actually true as they stand (aka
    ndistinct_array_end() fails if we don't have an array).  There is no
    need to reset state vars upon a JSON_SEM_ACTION_FAILED AFAIK.
    Finally, the consts are non-appealing for me for functions that are
    internally used in these files.  And forgot one:
    valid_subsequent_attnum()'s duplication is actually adapted IMO, the
    argument lists could have different assumptions depending on the type
    of extended stats we are dealing with.  I did not check the comments
    for 0003~0005.  Perhaps you are right that these need adjustments.
    
    Thanks,
    --
    Michael
    
  71. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-11-26T02:06:10Z

    On Tue, Nov 25, 2025 at 04:28:51PM +0900, Michael Paquier wrote:
    > With that said, a v18 for the first two patches with the input
    > functions.  Comments and/or opinions?
    
    Okay.  I have gone other these two and after an extra round of
    polishing, mainly around comments (some suggested by Chao, actually),
    style, a fix for the test with valid UTF8 sequences failing in
    non-UTF8 databases, plus a few more tests that were missing, I have
    applied the two patches for the input functions.
    
    Perhaps these will need some adjustments, but I also see little point
    in moving on with these based on the coverage we have and how the code
    is shaped.  Could you rebase the rest of the patch set for the three
    remaining pieces, Corey?  You will also need to address some of the
    comments from Chao based on a review of v16, that were not addressed
    as of v17. 
    --
    Michael
    
  72. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-11-26T05:14:26Z

    >
    >
    > I don’t see any of my comments are addressed in v18.
    >
    >
    My apologies. My v17 focused entirely on the input functions, as those were
    receiving the vast majority of the attention. Now that those are out of the
    way (Thanks Michael!) I can address those issues.
    
  73. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-11-26T05:24:30Z

    On Tue, Nov 25, 2025 at 11:14:26PM -0600, Corey Huinker wrote:
    > My apologies. My v17 focused entirely on the input functions, as those were
    > receiving the vast majority of the attention. Now that those are out of the
    > way (Thanks Michael!) I can address those issues.
    
    No problem, thanks.  Please note that I am not sure if I will be able
    to look at anything else posted on this thread until next week, so
    please feel free to continue the discussion for the time being.
    --
    Michael
    
  74. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-11-26T05:32:49Z

    > Finally, the consts are non-appealing for me for functions that are
    > internally used in these files.
    
    
    I'm curious why you find them non-appealing for non-exposed functions, if
    you have time to elaborate.
    
    
    >   And forgot one:
    > valid_subsequent_attnum()'s duplication is actually adapted IMO, the
    > argument lists could have different assumptions depending on the type
    > of extended stats we are dealing with.
    
    
    Yeah, I saw that, reviewing the diff-of-diffs just now.
    
  75. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-11-26T05:34:56Z

    >
    >
    > No problem, thanks.  Please note that I am not sure if I will be able
    > to look at anything else posted on this thread until next week, so
    > please feel free to continue the discussion for the time being.
    >
    >
    It's holiday season here too. I'm going to get the rebase out and see if
    any lessons learned can be applied to the remaining patches, but my own
    availability may be limited.
    
  76. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-11-26T05:37:09Z

    >
    >
    > non-UTF8 databases, plus a few more tests that were missing, I have
    > applied the two patches for the input functions.
    >
    
    I've reviewed what you committed and it looks good to me aside from the one
    nitpick I already made.
    
    
    > Perhaps these will need some adjustments, but I also see little point
    > in moving on with these based on the coverage we have and how the code
    > is shaped.  Could you rebase the rest of the patch set for the three
    > remaining pieces, Corey?  You will also need to address some of the
    > comments from Chao based on a review of v16, that were not addressed
    > as of v17.
    >
    
    +1
    
  77. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-11-26T05:44:23Z

    On Tue, Nov 25, 2025 at 11:37:09PM -0600, Corey Huinker wrote:
    > I've reviewed what you committed and it looks good to me aside from the one
    > nitpick I already made.
    
    I was not really excited about these as it concerns only code local to
    each file where the input functions are located, no routines that are
    actually published for consumption by other parts of the backend.
    --
    Michael
    
  78. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-11-26T05:50:14Z

    On Tue, Nov 25, 2025 at 11:44 PM Michael Paquier <michael@paquier.xyz>
    wrote:
    
    > On Tue, Nov 25, 2025 at 11:37:09PM -0600, Corey Huinker wrote:
    > > I've reviewed what you committed and it looks good to me aside from the
    > one
    > > nitpick I already made.
    >
    > I was not really excited about these as it concerns only code local to
    > each file where the input functions are located, no routines that are
    > actually published for consumption by other parts of the backend.
    > --
    > Michael
    >
    
    So if a function were previously static to one page, but now had utility
    outside of that file, you would then consider changing the function
    signature to add the consts? I ask because this may become relevant with
    the statatt_* functions being moved in the remaining patches.
    
  79. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-11-26T06:16:54Z

    On Tue, Nov 25, 2025 at 11:50:14PM -0600, Corey Huinker wrote:
    > So if a function were previously static to one page, but now had utility
    > outside of that file, you would then consider changing the function
    > signature to add the consts? I ask because this may become relevant with
    > the statatt_* functions being moved in the remaining patches.
    
    That's the kind of analysis that requires a case-by-case lookup.  The
    input functions were not really that worth bothering to me based on
    their logic being very isolated.  The refactoring pieces with
    attribute stats that you are suggesting in 0003 may be indeed more
    relevant if done this way.
    --
    Michael
    
  80. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-11-26T09:21:46Z

    On Tue, Nov 25, 2025 at 11:14 PM Corey Huinker <corey.huinker@gmail.com>
    wrote:
    
    >
    >> I don’t see any of my comments are addressed in v18.
    >>
    >>
    > My apologies. My v17 focused entirely on the input functions, as those
    > were receiving the vast majority of the attention. Now that those are out
    > of the way (Thanks Michael!) I can address those issues.
    >
    
    Paraphrasing the "quotes" here for brevity...
    
    > Several functions are made external visible, they are all renamed with
    adding a prefix “statatt_”, why text_to_stavalues is an exception?
    
    Michael had specifically said that one didn't need to be renamed. I suppose
    statatt_import_stavalues() might be a good name for it. It *is* specific to
    attribute stats, though that definition also applies to the attribute stats
    nested in the stxdexprs of extended stats. I have no strong opinion on the
    matter.
    
    > This MVDependency * can be const.
    
    +1
    
    > static void
    > upsert_pg_statistic_ext_data(Datum *values, bool *nulls, bool *replaces)
    > {
    > ```
    
    > This function pass values, nulls and replaces to heap_modify_tuple() and
    heap_form_tuple(),
    > the both functions take all const pointers as parameters.
    > So, here values, nulls and replaces can all be const.
    I find your argument here persuasive enough to override Michael's
    previously stated non-excitement.
    
    > ...the two NULL_FRAC questions
    
    Yes, fixed.
    
    > import_mcvlist is declared twice, looks like a copy-paste mistake.
    
    That or a rebase/apply gone wrong. Fixed.
    
    > PREPQUERY_DUMPEXTSTATSSTATS weird, suggest PREPQUERY_DUMPEXTSTATSDATA
    
    Awkward names are almost inevitable when talking about the statistics
    associated with an object of type "statistics".
    
    There was a debate about whether statistics were data or not, and I'd
    rather not restart that, so I went with PREPQUERY_DUMPEXTSTATSOBJSTATS for
    now.
    
     Incorporated these fixes, and some other lessons learned.
    
  81. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-12-03T05:00:31Z

    On Wed, Nov 26, 2025 at 03:21:46AM -0600, Corey Huinker wrote:
    > There was a debate about whether statistics were data or not, and I'd
    > rather not restart that, so I went with PREPQUERY_DUMPEXTSTATSOBJSTATS for
    > now.
    > 
    >  Incorporated these fixes, and some other lessons learned.
    
    I've managed to miss the update you have sent here with the three
    remaining patches.  I am going to looking at the remaining pieces
    now as this patch set was still on my stack of patches to look at..
    --
    Michael
    
  82. Re:Re: Extended Statistics set/restore/clear functions.

    Yu Wang <wangyu_runtime@163.com> — 2025-12-04T00:52:16Z

    Hi Corey
    I was reviewing the recent patch v19-0003-Include-Extended-Statistics-in-pg_dump.patch and noticed a couple of small typo issues in the explanatory comments — nothing that affects the functionality.
    
    
    Here are the two minor fixes I’d suggest:
    1. “ndistintinct” should be “ndistinct”.
    2. “depdendencies” should be “dependencies”.
    
    
    Best regards,
    Yu
    
    
    
    
    
    
    
    在 2025-12-04 08:46:29,"Corey Huinker" <corey.huinker@gmail.com> 写道:
    
    
    
    
    
    On Tue, Nov 25, 2025 at 11:14 PM Corey Huinker <corey.huinker@gmail.com> wrote:
    
    
    I don’t see any of my comments are addressed in v18.
    
    
    
    
    My apologies. My v17 focused entirely on the input functions, as those were receiving the vast majority of the attention. Now that those are out of the way (Thanks Michael!) I can address those issues.
    
    
    Paraphrasing the "quotes" here for brevity...
    
    
    > Several functions are made external visible, they are all renamed with adding a prefix “statatt_”, why text_to_stavalues is an exception?
    
    
    Michael had specifically said that one didn't need to be renamed. I suppose statatt_import_stavalues() might be a good name for it. It *is* specific to attribute stats, though that definition also applies to the attribute stats nested in the stxdexprs of extended stats. I have no strong opinion on the matter.
    
    
    > This MVDependency * can be const.
    
    +1
    
    > static void
    > upsert_pg_statistic_ext_data(Datum *values, bool *nulls, bool *replaces)
    > {
    > ```
    
    > This function pass values, nulls and replaces to heap_modify_tuple() and heap_form_tuple(),
    > the both functions take all const pointers as parameters.
    > So, here values, nulls and replaces can all be const.
    I find your argument here persuasive enough to override Michael's previously stated non-excitement.
    
    
    > ...the two NULL_FRAC questions
    
    
    Yes, fixed.
    
    > import_mcvlist is declared twice, looks like a copy-paste mistake.
    
    
    That or a rebase/apply gone wrong. Fixed.
    
    
    > PREPQUERY_DUMPEXTSTATSSTATS weird, suggest PREPQUERY_DUMPEXTSTATSDATA
    
    
    Awkward names are almost inevitable when talking about the statistics associated with an object of type "statistics".
    
    
    There was a debate about whether statistics were data or not, and I'd rather not restart that, so I went with PREPQUERY_DUMPEXTSTATSOBJSTATS for now.
    
    
     Incorporated these fixes, and some other lessons learned.
    
    
  83. Re: Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-12-04T07:28:00Z

    On Thu, Dec 04, 2025 at 08:52:16AM +0800, WangYu wrote:
    > I was reviewing the recent patch
    > v19-0003-Include-Extended-Statistics-in-pg_dump.patch and noticed a
    > couple of small typo issues in the explanatory comments — nothing
    > that affects the functionality.
    
    Please be careful that the mailing lists of Postgres use
    bottom-posting for thread discussions.  Please see here for details:
    https://en.wikipedia.org/wiki/Posting_style#Bottom-posting
    
    > Here are the two minor fixes I’d suggest:
    > 1. “ndistintinct” should be “ndistinct”.
    > 2. “depdendencies” should be “dependencies”.
    
    Indeed.  These exact typos existed in some of the previous iterations
    of the patch posted on this thread.
    
    I have begun looking at 0001 and 0002 more seriously, and can now
    share a couple of comments about them. 
    
    My biggest issue with 0001 is the severe lack of documentation for all
    the routines that are now in attribute_stats.c which are published so
    as they can be used for the import functions of extended stats:
    
    1) statatt_get_type(), that retrieves a set of information from a
    relation and one of its attributes.  This is lightly documented in
    extended_statistics_update() from 0002, with an argument claiming that
    we do the same as attribute statistics.  If one then looks at
    attribute_stats.c, we have a not-really-helpful "derive information
    from attribute".  The trick is that we have an implied dependency
    here: statatt_get_type() needs to be called *before*
    statatt_get_elem_type() for attribute *and* extended stats,
    statatt_get_elem_type() being fed data from statatt_get_type().
    
    2) statatt_get_elem_type(), interesting piece of the puzzle that acts
    as a wrapper for the type cache.  An important part here, it seems to
    me, would be to at least tell that this relies on data retrieved from
    statatt_get_type(), initially.  Once I've noticed this dependency the
    logic felt a bit cleaner to understand, but we have no docs explaining
    any of that..
    
    3) statatt_init_empty_tuple(), that initializes a set of data to be
    used for updates and/or insertion.  My first thought here is about 
    "inherited", which turns around an assumption in
    attribute_statistics_update().  How should callers assign its value?
    For attribute_statistics_update() it comes as an input argument.
    Also why are these values initialized as they are and why should these
    make sense when applied to pg_statistic.  Extended stats set the
    "inherited" argument to false in import_expressions().  That seems
    equally important to document, both in a comment in extended_stats.c
    and at the top of the function.  Of course it means that this refers
    to the fact that the stats include values from stats tables, which do
    not apply to extended stats.  But one would have to guess this fact
    after knowing that this relates to the catalog pg_statistic..
    
    4) statatt_set_slot is slightly simpler.  Still here, it is really easy
    to miss that this routine should be fed data retrieved by
    statatt_get_type().  staop can also be an eq or an lt operator.
    stakind is one of the STATISTIC_KIND_* values, etc, etc.
    
    5) text_to_stavalues() has a bit more explanation, with some data fed
    from statatt_get_type() for the atttypmod and atttypid.  We have also
    some fix of input coming from statatt_get_elem_type(), for elemtypid.
    
    As far as I can see, there is nothing in these functions that require
    them to be located in attribute_stats.c anymore.  Let's move that to
    stats_utils.c, common place for all the shared facilities used by
    these stats modules.  Or it could also be a new, separate patch, for
    the manipulation and extraction of the tuple data related to
    the pg_statistic tuples.
    
    I am not OK with these being in attribute_stats.c, and as far as I can
    see they have no contents that force these routines to 
    
    Perhaps you know what these imply because you have written this code
    originally in ce207d2a7901, but exposing them also means that it is
    very important to document at least some concepts and assumptions
    around them so as it is possible to guide somebody that may want to
    use this code:
    - What are the input arguments from?
    - And what are the results?
    - In which circumstances should these be used?
    
    Then about 0002.
    
    There are a bunch of assumptions embedded inside import_expressions()
    that are interdependent with the calls of these attribute routines
    which are a bit hard to evaluate by themselves, at least it's not
    clear why some of these operations are done and why they are done the
    way they are presented in the patch.  The code processes an array of
    expressions and processes them in a sequential fashion, repeating
    checks based on extended_stats_exprs_element.  Each step is going to
    need more explanation to connect the dots.  For example, why the first
    checks on exprs_nulls are important, with a second step of checks for
    each STATISTIC_KIND_*.  Too much is let to the reader to guess, making
    the code complicated to maintain as written, at least IMHO.
    
    Is there any need to locate these new functions and code in
    extended_stats.c, actually?  A separation into a new file seems like
    it would be a cleaner result, leaving extended_stats.c to deal with
    the import of the new data, including the fetch and deletion of any
    existing data in pg_stat_ext that would be updated or inserted.
    Perhaps name that extended_stats_funcs.c, as these are about the
    direct SQL functions, import and deletion.
    
    For the tests, it looks like it would be better to have everything in
    a new file, like a stats_ext_import.sql for the clear and restore
    bits.
    
    A lot of the tests are copy-pastes of surrounding queries.  Perhaps it
    would be better to use a SQL function wrapper or an IN clause with
    multiple relations?  The patch has a lot of bloat with these repeated
    queries..  Perhaps we should use a split for the elements in the
    extended stats array instead of jsonb_pretty().  The results get quite
    long here.
    
    As of 0002, there are actually two independent pieces dealt with: the
    deletion of extended stats with pg_clear_extended_stats(), and the
    insert/update of extended stats with pg_restore_extended_stats().  I'd
    suggest to split that into two parts, with the clear being first in
    rank.  The deletion is simpler, and getting that in first simplifies
    the review of the import part with less input to deal with.
    --
    Michael
    
  84. Re: Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-12-04T23:12:16Z

    >
    >
    > Indeed.  These exact typos existed in some of the previous iterations
    > of the patch posted on this thread.
    >
    
    +1
    
    
    > 1) statatt_get_type(), that retrieves a set of information from a
    > relation and one of its attributes.  This is lightly documented in
    > extended_statistics_update() from 0002, with an argument claiming that
    > we do the same as attribute statistics.  If one then looks at
    > attribute_stats.c, we have a not-really-helpful "derive information
    > from attribute".  The trick is that we have an implied dependency
    > here: statatt_get_type() needs to be called *before*
    > statatt_get_elem_type() for attribute *and* extended stats,
    > statatt_get_elem_type() being fed data from statatt_get_type().
    >
    
    I've added comments to that effect.
    
    
    >
    > 2) statatt_get_elem_type(), interesting piece of the puzzle that acts
    > as a wrapper for the type cache.  An important part here, it seems to
    > me, would be to at least tell that this relies on data retrieved from
    > statatt_get_type(), initially.  Once I've noticed this dependency the
    > logic felt a bit cleaner to understand, but we have no docs explaining
    > any of that..
    >
    
    I've added comments that will hopefully connect those dots.
    
    
    >
    > 3) statatt_init_empty_tuple(), that initializes a set of data to be
    > used for updates and/or insertion.  My first thought here is about
    > ...
    > not apply to extended stats.  But one would have to guess this fact
    > after knowing that this relates to the catalog pg_statistic..
    >
    
    The comments didn't seem quite so illuminating, but I added them just the
    same.
    
    
    >
    > 4) statatt_set_slot is slightly simpler.  Still here, it is really easy
    > to miss that this routine should be fed data retrieved by
    > statatt_get_type().  staop can also be an eq or an lt operator.
    > stakind is one of the STATISTIC_KIND_* values, etc, etc.
    >
    
    Some here too.
    
    
    > As far as I can see, there is nothing in these functions that require
    > them to be located in attribute_stats.c anymore.  Let's move that to
    > stats_utils.c, common place for all the shared facilities used by
    > these stats modules.  Or it could also be a new, separate patch, for
    > the manipulation and extraction of the tuple data related to
    > the pg_statistic tuples.
    >
    
    They've been moved to stats_utils.c, and impact was fairly minimal. I was
    worried about an explosion of #includes, but that ended up not being the
    case.
    
    
    > Perhaps you know what these imply because you have written this code
    > originally in ce207d2a7901, but exposing them also means that it is
    > very important to document at least some concepts and assumptions
    > around them so as it is possible to guide somebody that may want to
    > use this code:
    > - What are the input arguments from?
    > - And what are the results?
    > - In which circumstances should these be used?
    >
    
    An attempt was made.
    
    
    > Then about 0002.
    >
    > There are a bunch of assumptions embedded inside import_expressions()
    > that are interdependent with the calls of these attribute routines
    > ...
    > each STATISTIC_KIND_*.  Too much is let to the reader to guess, making
    > the code complicated to maintain as written, at least IMHO.
    >
    
    I tried to expand on these things, but I haven't yet moved them to another
    file for reasons I'll explain later.
    
    
    > Is there any need to locate these new functions and code in
    > extended_stats.c, actually?  A separation into a new file seems like
    > it would be a cleaner result, leaving extended_stats.c to deal with
    > the import of the new data, including the fetch and deletion of any
    > existing data in pg_stat_ext that would be updated or inserted.
    > Perhaps name that extended_stats_funcs.c, as these are about the
    > direct SQL functions, import and deletion.
    >
    
    We can move them to another file, but I'm not sure if that will actually
    make things clearer. What is extended_stats.c if not functions about
    extended stats? Perhaps it makes more sense to try harder to find the
    commonality in the building of the pg_statistic tuples, move those
    functions to stats_utils, and get things decluttered that way?
    
    For the tests, it looks like it would be better to have everything in
    > a new file, like a stats_ext_import.sql for the clear and restore
    > bits.
    >
    
    Those tests very much build upon the tables and data already created for
    regular relation/attribute imports, so we save quite a bit of boilerplate
    in keeping them there.
    
    
    >
    > A lot of the tests are copy-pastes of surrounding queries.  Perhaps it
    > would be better to use a SQL function wrapper or an IN clause with
    > multiple relations?
    
    
    I'd say "no", because we're doing a lot of things that *could* generate
    ERRORs, and we're verifying that they generate WARNINGs instead. If we did
    batch up those commands, and any one of them generated an error, we'd be at
    loss for which one was the culprit, and we'd also lose any insight into
    what else got busted in the other rows of the batch.
    
    
    >   The patch has a lot of bloat with these repeated
    > queries..  Perhaps we should use a split for the elements in the
    > extended stats array instead of jsonb_pretty().  The results get quite
    > long here.
    >
    
    I replaced '}, ' with '},\n' and removed the jsonb_pretty entirely, which
    is something I had proposed doing earlier, as it's actually more pretty
    than jsonb_pretty(). It's still in the same file, though.
    
    I like the net result of this, and it might make sense to search other
    regression tests for places where we can employ this mini-pretty pattern.
    
    
    > As of 0002, there are actually two independent pieces dealt with: the
    > deletion of extended stats with pg_clear_extended_stats(), and the
    > insert/update of extended stats with pg_restore_extended_stats().  I'd
    > suggest to split that into two parts, with the clear being first in
    > rank.  The deletion is simpler, and getting that in first simplifies
    > the review of the import part with less input to deal with.
    >
    
    The pg_clear function is quite trivial, as it's basically just marshalling
    and locking before a call to delete_pg_statistic_ext_data(). So while I can
    split them out, I'd also have to split the pg_proc.dat changes into two
    separate ones, which makes it hard to see how similar they are. I have a
    similar feeling about the changes to func-admin.sgml.
    
    All of that, plus a rebase.
    
  85. Re: Extended Statistics set/restore/clear functions.

    Tom Lane <tgl@sss.pgh.pa.us> — 2025-12-05T00:37:32Z

    Michael Paquier <michael@paquier.xyz> writes:
    > Okay.  I have gone other these two and after an extra round of
    > polishing, mainly around comments (some suggested by Chao, actually),
    > style, a fix for the test with valid UTF8 sequences failing in
    > non-UTF8 databases, plus a few more tests that were missing, I have
    > applied the two patches for the input functions.
    
    I notice that since e1405aa5e went in, a number of older buildfarm
    animals are issuing warnings about it:
    
    In file included from pg_dependencies.c:21:0:
    pg_dependencies.c: In function "dependencies_scalar":
    ../../../../src/include/nodes/miscnodes.h:54:15: warning: the comparison will always evaluate as "true" for the address of "escontext" will never be NULL [-Waddress]
      ((escontext) != NULL && IsA(escontext, ErrorSaveContext) && \\
                   ^
    pg_dependencies.c:497:8: note: in expansion of macro "SOFT_ERROR_OCCURRED"
        if (SOFT_ERROR_OCCURRED(&escontext))
            ^
    ../../../../src/include/nodes/miscnodes.h:54:15: warning: the comparison will always evaluate as "true" for the address of "escontext" will never be NULL [-Waddress]
      ((escontext) != NULL && IsA(escontext, ErrorSaveContext) && \\
                   ^
    pg_dependencies.c:542:8: note: in expansion of macro "SOFT_ERROR_OCCURRED"
        if (SOFT_ERROR_OCCURRED(&escontext))
            ^
    ../../../../src/include/nodes/miscnodes.h:54:15: warning: the comparison will always evaluate as "true" for the address of "escontext" will never be NULL [-Waddress]
      ((escontext) != NULL && IsA(escontext, ErrorSaveContext) && \\
                   ^
    pg_dependencies.c:572:8: note: in expansion of macro "SOFT_ERROR_OCCURRED"
        if (SOFT_ERROR_OCCURRED(&escontext))
            ^
    In file included from pg_ndistinct.c:21:0:
    pg_ndistinct.c: In function "ndistinct_scalar":
    ../../../../src/include/nodes/miscnodes.h:54:15: warning: the comparison will always evaluate as "true" for the address of "escontext" will never be NULL [-Waddress]
      ((escontext) != NULL && IsA(escontext, ErrorSaveContext) && \\
                   ^
    pg_ndistinct.c:438:8: note: in expansion of macro "SOFT_ERROR_OCCURRED"
        if (SOFT_ERROR_OCCURRED(&escontext))
            ^
    ../../../../src/include/nodes/miscnodes.h:54:15: warning: the comparison will always evaluate as "true" for the address of "escontext" will never be NULL [-Waddress]
      ((escontext) != NULL && IsA(escontext, ErrorSaveContext) && \\
                   ^
    pg_ndistinct.c:488:9: note: in expansion of macro "SOFT_ERROR_OCCURRED"
        if (!SOFT_ERROR_OCCURRED(&escontext))
             ^
    
    The above is from rhinoceros, but about half a dozen other animals
    are reporting the same.  They are all running very old gcc versions,
    mostly from RHEL7 or CentOS 7.
    
    This of course arises because the source code is passing the address
    of a local ErrorSaveContext variable to that macro.  We don't have
    any other instances of that coding pattern AFAICS, so I wonder if
    that was really the most adapted way to do it.  Looking at the
    code, it seems to be turning around and stuffing a different error
    message into a passed-in ErrorSaveContext, which feels a little
    weird.
    
    It may be that there's not a better way, in which case I'll probably
    just teach my buildfarm-warning-scraper script to ignore these
    warnings, as it's already doing for some other warnings from these
    same animals (-Wmissing-braces mostly).  But I thought I'd raise a
    question about it.
    
    Also, I don't think these errdetail messages meet our style
    guidelines:
    
                            errdetail("Invalid \"%s\" value.", PG_DEPENDENCIES_KEY_ATTRIBUTES));
    
    They're supposed to be complete sentences.
    
    			regards, tom lane
    
    
    
    
  86. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-12-05T01:19:27Z

    On Thu, Dec 04, 2025 at 07:37:32PM -0500, Tom Lane wrote:
    > This of course arises because the source code is passing the address
    > of a local ErrorSaveContext variable to that macro.  We don't have
    > any other instances of that coding pattern AFAICS, so I wonder if
    > that was really the most adapted way to do it.  Looking at the
    > code, it seems to be turning around and stuffing a different error
    > message into a passed-in ErrorSaveContext, which feels a little
    > weird.
    
    Thanks for the report.  Right.  There are three instances of that in
    pg_dependencies.c, two in pg_ndistinct.c.  I would not mind doing the
    attached to calm down these warnings, matching with the other areas of
    the code, that simply removes the macro and checks the state value
    directly.
    
    > Also, I don't think these errdetail messages meet our style
    > guidelines:
    > 
    >    errdetail("Invalid \"%s\" value.", PG_DEPENDENCIES_KEY_ATTRIBUTES));
    > 
    > They're supposed to be complete sentences.
    
    Yeah, I should have been more careful here.  I have spent more time on
    all of these, and adjusted all the errdetails for pg_ndistinct.c and
    pg_dependencies.c to use full sentences.
    
    I am just grouping everything with the attached.  They'd be better
    if handled separately, obviously.  Comments?
    --
    Michael
    
  87. Re: Extended Statistics set/restore/clear functions.

    Tom Lane <tgl@sss.pgh.pa.us> — 2025-12-05T02:29:50Z

    Michael Paquier <michael@paquier.xyz> writes:
    > On Thu, Dec 04, 2025 at 07:37:32PM -0500, Tom Lane wrote:
    >> This of course arises because the source code is passing the address
    >> of a local ErrorSaveContext variable to that macro.  We don't have
    >> any other instances of that coding pattern AFAICS, so I wonder if
    >> that was really the most adapted way to do it.
    
    > Thanks for the report.  Right.  There are three instances of that in
    > pg_dependencies.c, two in pg_ndistinct.c.  I would not mind doing the
    > attached to calm down these warnings, matching with the other areas of
    > the code, that simply removes the macro and checks the state value
    > directly.
    
    Ah, right, now that you mention it, 56b1e88c8 did this same fix
    before.  This way is fine with me.
    
    I didn't review the error message changes closely.  I do wonder if
    some of them are can't-happen cases that could just use an elog()
    instead of a translatable message.
    
    			regards, tom lane
    
    
    
    
  88. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-12-05T03:39:05Z

    On Thu, Dec 04, 2025 at 09:29:50PM -0500, Tom Lane wrote:
    > Ah, right, now that you mention it, 56b1e88c8 did this same fix
    > before.  This way is fine with me.
    
    Oh, I haven't noticed this one.  Pushed the bits to take care of the
    compiler warnings, for now.
    
    > I didn't review the error message changes closely.  I do wonder if
    > some of them are can't-happen cases that could just use an elog()
    > instead of a translatable message.
    
    Some of them are, yes.  However, they are also worded with the same
    format as some of the legit cases.  So they don't add any extra
    workload on the translation side as far as I recall, and I've been
    fond of the errdetail part to get a consistent style across the board.
    I'll double-check the whole a bit later, attached is the rest of them.
    
    Corey, any comments about these?
    --
    Michael
    
  89. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-12-05T05:30:38Z

    >
    >
    > Some of them are, yes.  However, they are also worded with the same
    > format as some of the legit cases.  So they don't add any extra
    > workload on the translation side as far as I recall, and I've been
    > fond of the errdetail part to get a consistent style across the board.
    > I'll double-check the whole a bit later, attached is the rest of them.
    >
    > Corey, any comments about these?
    >
    
    The wordings are fine, and I'm sorry I didn't word them as complete
    sentences from the get-go.
    
    Attached is a follow-on to Michael's most recent uncommitted patch,
    changing the errors that I see as "impossible" to elogs. However, I agree
    that they don't add significant workload to the translations, and most
    input functions need to avoid any hard error returns lest they be called in
    a soft-error context.
    
  90. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-12-08T01:25:45Z

    On Fri, Dec 05, 2025 at 12:30:38AM -0500, Corey Huinker wrote:
    > Attached is a follow-on to Michael's most recent uncommitted patch,
    > changing the errors that I see as "impossible" to elogs. However, I agree
    > that they don't add significant workload to the translations, and most
    > input functions need to avoid any hard error returns lest they be called in
    > a soft-error context.
    
    Reporting the state of the parser in these new elogs is OK for me, but
    we had more cases that what your patch has been updating.  I have
    included all these cases, making the set of error messages a bit more
    consistent across the board, and applied the result.
    --
    Michael
    
  91. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-12-11T21:52:26Z

    On Sun, Dec 7, 2025 at 8:26 PM Michael Paquier <michael@paquier.xyz> wrote:
    
    > On Fri, Dec 05, 2025 at 12:30:38AM -0500, Corey Huinker wrote:
    > > Attached is a follow-on to Michael's most recent uncommitted patch,
    > > changing the errors that I see as "impossible" to elogs. However, I agree
    > > that they don't add significant workload to the translations, and most
    > > input functions need to avoid any hard error returns lest they be called
    > in
    > > a soft-error context.
    >
    > Reporting the state of the parser in these new elogs is OK for me, but
    > we had more cases that what your patch has been updating.  I have
    > included all these cases, making the set of error messages a bit more
    > consistent across the board, and applied the result.
    > --
    > Michael
    >
    
    Rebased. Used new palloc0_array() in a few places, and fixed an expected
    output message that wasn't updated to reflect the changes.
    
  92. Re: Extended Statistics set/restore/clear functions.

    Corey Huinker <corey.huinker@gmail.com> — 2025-12-11T21:55:52Z

    >
    > Reporting the state of the parser in these new elogs is OK for me, but
    >> we had more cases that what your patch has been updating.  I have
    >> included all these cases, making the set of error messages a bit more
    >> consistent across the board, and applied the result.
    >> --
    >> Michael
    >>
    >
    > Rebased. Used new palloc0_array() in a few places, and fixed an expected
    > output message that wasn't updated to reflect the changes.
    >
    
    Disregard v21 (uncommitted changes). Everything I said about 21 applies to
    22.
    
  93. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-12-25T06:23:58Z

    On Thu, Dec 11, 2025 at 04:55:52PM -0500, Corey Huinker wrote:
    > Disregard v21 (uncommitted changes). Everything I said about 21 applies to
    > 22.
    
    A couple of weeks later, I have locked a couple of hours to look at
    v22-0001.  And after a couple of rounds of self-fixing and adjustments
    in the documentation added, the result looks OK so I have applied it.
    Attached are the remaining pieces, labelled v23.
    --
    Michael
    
  94. Re: Extended Statistics set/restore/clear functions.

    Tender Wang <tndrwang@gmail.com> — 2025-12-25T11:00:02Z

    Tender Wang <tndrwang@gmail.com> 于2025年12月25日周四 18:58写道:
    
    > Hi Michael,
    >
    > I found a typo in recent commit :
    > commit 213a1b89527049cb27bbcd6871fdb0c0916b43e1 (origin/master,
    > origin/HEAD, master)
    > Author: Michael Paquier <michael@paquier.xyz>
    > Date:   Thu Dec 25 15:13:39 2025 +0900
    >  Move attribute statistics functions to stat_utils.c
    > ....
    >  * The stacoll value should be either the atttypcoll derived from
    >  * statatt_get_type(), or a harcoded value required by that particular
    >  * stakind.
    > ....
    >
    > "harcoded" should be "hardcoded".
    >
    > Please check the attached patch.
    >
    > --
    > Thanks,
    > Tender Wang
    >
    
    Sorry, I forgot to cc the pgsql-hackers mail in my last email.
    -- 
    Thanks,
    Tender Wang
    
  95. Re: Extended Statistics set/restore/clear functions.

    Michael Paquier <michael@paquier.xyz> — 2025-12-25T22:47:54Z

    On Thu, Dec 25, 2025 at 07:00:02PM +0800, Tender Wang wrote:
    > Sorry, I forgot to cc the pgsql-hackers mail in my last email.
    
    Missed this one from v22-0001.  Thanks.
    --
    Michael