Thread
Commits
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Add test doing some cloning of extended statistics data
- fc365e4fccc4 19 (unreleased) landed
-
Add test for pg_restore_extended_stats() with multiranges
- 0b7beec42ae2 19 (unreleased) landed
-
Add support for "mcv" in pg_restore_extended_stats()
- efbebb4e8587 19 (unreleased) landed
-
Include extended statistics data in pg_dump
- c32fb29e979d 19 (unreleased) landed
-
Add support for "dependencies" in pg_restore_extended_stats()
- 302879bd68d1 19 (unreleased) landed
-
Add test for MAINTAIN permission with pg_restore_extended_stats()
- d9abd9e1050d 19 (unreleased) landed
-
Add pg_restore_extended_stats()
- 0e80f3f88dea 19 (unreleased) landed
-
Add routine to free MCVList
- 7ebb64c55757 19 (unreleased) landed
-
Improve pg_clear_extended_stats() with incorrect relation/stats combination
- 395b73c045e0 19 (unreleased) landed
-
Add pg_clear_extended_stats()
- d756fa1019ff 19 (unreleased) landed
-
Introduce routines to validate and free MVNDistinct and MVDependencies
- 32e27bd32082 19 (unreleased) landed
-
Fix typo in stat_utils.c
- eee19a30d60d 19 (unreleased) landed
-
Move attribute statistics functions to stat_utils.c
- 213a1b895270 19 (unreleased) landed
-
Improve error messages of input functions for pg_dependencies and pg_ndistinct
- f68597ee777d 19 (unreleased) landed
-
Improve test output of extended statistics for ndistinct and dependencies
- 2f04110225ab 19 (unreleased) landed
-
Fix some compiler warnings
- 7bc88c3d6f3a 19 (unreleased) landed
-
Add input function for data type pg_dependencies
- e1405aa5e3ac 19 (unreleased) landed
-
Add input function for data type pg_ndistinct
- 44eba8f06e55 19 (unreleased) landed
-
Rework output format of pg_dependencies
- e76defbcf09e 19 (unreleased) landed
-
Rework output format of pg_ndistinct
- 1f927cce4498 19 (unreleased) landed
-
Fix comments of output routines for pg_ndistinct and pg_dependencies
- 040a39ed25bf 19 (unreleased) landed
-
Move code specific to pg_dependencies to new file
- 2ddc8d9e9baa 19 (unreleased) landed
-
Move code specific to pg_ndistinct to new file
- a5523123430f 19 (unreleased) landed
-
Document some structures in attribute_stats.c
- d6c132d83bff 19 (unreleased) landed
-
Fix FATAL message for invalid recovery timeline at beginning of recovery
- 71f17823ba01 18.0 cited
-
Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-01-22T18:17:27Z
This is a separate thread for work started in [1] but focused purely on getting the following functions working: * pg_set_extended_stats * pg_clear_extended_stats * pg_restore_extended_stats These functions are analogous to their relation/attribute counterparts, use the same calling conventions, and build upon the same basic infrastructure. I think it is important that we get these implemented because they close the gap that was left in terms of the ability to modify existing statistics and to round out the work being done to carry over statistics via dump/restore and pg_upgrade i [1]. The purpose of each patch is as follows (adapted from previous thread): 0001 - This makes the input function for pg_ndistinct functional. 0002 - This makes the input function for pg_dependencies functional. 0003 - Makes several static functions in attribute_stats.c public for use by extended stats. One of those is get_stat_attr_type(), which in the last patchset was modified to take an attribute name rather than attnum, thus saving a syscache lookup. However, extended stats identifies attributes by attnum not name, so that optimization had to be set aside, at least temporarily. 0004 - These implement the functions pg_set_extended_stats(), pg_clear_extended_stats(), and pg_restore_extended_stats() and behave like their relation/attribute equivalents. If we can get these committed and used by pg_dump, then we don't have to debate how to handle post-upgrade steps for users who happen to have extended stats vs the approximately 99.75% of users who do not have extended stats. This patchset does not presently include any work to integrate these functions into pg_dump, but may do so once that work is settled, or it may become its own thread. [1] https://www.postgresql.org/message-id/CADkLM=c6NHdXU+d+m1yASZ4NT_qye1LCnaU2Vgf8YJ80jJT-Qg@mail.gmail.com
-
Re: Extended Statistics set/restore/clear functions.
Tomas Vondra <tomas@vondra.me> — 2025-01-22T22:50:57Z
Hi, Thanks for continuing to work on this. On 1/22/25 19:17, Corey Huinker wrote: > This is a separate thread for work started in [1] but focused purely on > getting the following functions working: > > * pg_set_extended_stats > * pg_clear_extended_stats > * pg_restore_extended_stats > > These functions are analogous to their relation/attribute counterparts, > use the same calling conventions, and build upon the same basic > infrastructure. > > I think it is important that we get these implemented because they close > the gap that was left in terms of the ability to modify existing > statistics and to round out the work being done to carry over statistics > via dump/restore and pg_upgrade i [1]. > > The purpose of each patch is as follows (adapted from previous thread): > > 0001 - This makes the input function for pg_ndistinct functional. > > 0002 - This makes the input function for pg_dependencies functional. > I only quickly skimmed the patches, but a couple comments: 1) I think it makes perfect sense to use the JSON parsing for the input functions, but maybe it'd be better to adjust the format a bit to make that even easier? Right now the JSON "keys" have structure, which means we need some ad hoc parsing. Maybe we should make it "proper JSON" by moving that into separate key/value, e.g. for ndistinct we might replace this: {"1, 2": 2323, "1, 3" : 3232, ...} with this: [ {"keys": [1, 2], "ndistinct" : 2323}, {"keys": [1, 3], "ndistinct" : 3232}, ... ] so a regular JSON array of objects, with keys an "array". And similarly for dependencies. Yes, it's more verbose, but maybe better for "mechanical" processing? 2) Do we need some sort of validation? Perhaps this was discussed in the other thread and I missed that, but isn't it a problem that happily accept e.g. this? {"6666, 6666" : 1, "1, -222": 14, ...} That has duplicate keys with bogus attribute numbers, stats on (bogus) system attributes, etc. I suspect this may easily cause problems during planning (if it happens to visit those statistics). Maybe that's acceptable - ultimately the user could import something broken in a much subtler way, of course. But the pg_set_attribute_stats seems somewhat more protected against this, because it gets the attr as a separate argument. I recall I wished to have the attnum in the output function, but that was not quite possible because we don't know the relid (and thus the descriptor) in that function. Is it a good idea to rely on the input/output format directly? How will that deal with cross-version differences? Doesn't it mean the in/out format is effectively fixed, or at least has to be backwards compatible (i.e. new version has to handle any value from older versions)? Or what if I want to import the stats for a table with slightly different structure (e.g. because dump/restore skips dropped columns). Won't that be a problem with the format containing raw attnums? Or is this a use case we don't expect to work? For the per-attribute stats it's probably fine, because that's mostly just a collection of regular data types (scalar values or arrays of values, ...) and we're not modifying them except for maybe adding new fields. But extended stats seem more complex, so maybe it's different? I remember a long discussion about the format at the very beginning of this patch series, and the conclusion clearly was to have a function that import stats for one attribute at a time. And that seems to be working fine, but extended stats values have more internal structure, so perhaps they need to do something more complicated. > 0003 - Makes several static functions in attribute_stats.c public for use > by extended stats. One of those is get_stat_attr_type(), which in the last > patchset was modified to take an attribute name rather than attnum, thus > saving a syscache lookup. However, extended stats identifies attributes by > attnum not name, so that optimization had to be set aside, at least > temporarily. > > 0004 - These implement the functions pg_set_extended_stats(), > pg_clear_extended_stats(), and pg_restore_extended_stats() and behave like > their relation/attribute equivalents. If we can get these committed and > used by pg_dump, then we don't have to debate how to handle post-upgrade > steps for users who happen to have extended stats vs the approximately > 99.75% of users who do not have extended stats. > I see there's a couple MCV-specific functions in the extended_stats.c. Shouldn't those go into mvc.c instead? FWIW there's a bunch of whitespace issues during git apply. > This patchset does not presently include any work to integrate these > functions into pg_dump, but may do so once that work is settled, or it > may become its own thread. > OK. Thanks for the patch! regards -- Tomas Vondra -
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-01-23T14:51:08Z
On Wed, Jan 22, 2025 at 5:50 PM Tomas Vondra <tomas@vondra.me> wrote: > Hi, > > Thanks for continuing to work on this. > > On 1/22/25 19:17, Corey Huinker wrote: > > This is a separate thread for work started in [1] but focused purely on > > getting the following functions working: > > > > * pg_set_extended_stats > > * pg_clear_extended_stats > > * pg_restore_extended_stats > > > > These functions are analogous to their relation/attribute counterparts, > > use the same calling conventions, and build upon the same basic > > infrastructure. > > > > I think it is important that we get these implemented because they close > > the gap that was left in terms of the ability to modify existing > > statistics and to round out the work being done to carry over statistics > > via dump/restore and pg_upgrade i [1]. > > > > The purpose of each patch is as follows (adapted from previous thread): > > > > 0001 - This makes the input function for pg_ndistinct functional. > > > > 0002 - This makes the input function for pg_dependencies functional. > > > > I only quickly skimmed the patches, but a couple comments: > > > 1) I think it makes perfect sense to use the JSON parsing for the input > functions, but maybe it'd be better to adjust the format a bit to make > that even easier? > > Right now the JSON "keys" have structure, which means we need some ad > hoc parsing. Maybe we should make it "proper JSON" by moving that into > separate key/value, e.g. for ndistinct we might replace this: > > {"1, 2": 2323, "1, 3" : 3232, ...} > > with this: > > [ {"keys": [1, 2], "ndistinct" : 2323}, > {"keys": [1, 3], "ndistinct" : 3232}, > ... ] > > so a regular JSON array of objects, with keys an "array". And similarly > for dependencies. > That is almost exactly what I did back when the stats import functions took a nested JSON argument. The biggest problem with changing that format is that the old format would still show up in the system being exported, so we would have to process that format as well as the new one. > Yes, it's more verbose, but maybe better for "mechanical" processing? > It absolutely would be better for processing, but we'd still have to read the old format from older systems. I suppose the pg_dump code could do some SQL gymnastics to convert the old json-but-sad format into the processing-friendly format of the future, and I could easily adapt what I've already written over a year ago to that task. I suppose it's just a matter of having the community behind changing the output format to enable a better input format. > 2) Do we need some sort of validation? Perhaps this was discussed in the > other thread and I missed that, but isn't it a problem that happily > accept e.g. this? > > {"6666, 6666" : 1, "1, -222": 14, ...} > > That has duplicate keys with bogus attribute numbers, stats on (bogus) > system attributes, etc. I suspect this may easily cause problems during > planning (if it happens to visit those statistics). > We used to have _lots_ of validation for data quality issues, much of which was removed at the request of reviewers. However, much of that discussion was about the health of the statistics, but these are standalone data types, maybe they're held to a higher standard. If so, what sort of checks do you think would be needed and/or wanted? So far, I can imagine the following: * no negative attnums in key list * no duplicate attnums in key list anything else? > > Maybe that's acceptable - ultimately the user could import something > broken in a much subtler way, of course. But the pg_set_attribute_stats > seems somewhat more protected against this, because it gets the attr as > a separate argument. > The datatype itself is in isolation, but once we've created a valid pg_ndistinct or pg_dependencies, there's nothing stopping us from validating the values in the datatype against the statistics object and the relation it belongs to, but that might get the same resistance that I got to say, ensuring that frequency lists were monotonically decreasing. > I recall I wished to have the attnum in the output function, but that > was not quite possible because we don't know the relid (and thus the > descriptor) in that function. > > Is it a good idea to rely on the input/output format directly? How will > that deal with cross-version differences? Doesn't it mean the in/out > format is effectively fixed, or at least has to be backwards compatible > (i.e. new version has to handle any value from older versions)? > Presently there are no cross-version differences, though earlier I address the pros and cons of changing it. No matter what, the burden of having a valid format is on the user in the case of pg_set_extended_stats() and pg_restore_extended_stats() has a server version number associated, so the option of handling a format change could be baked in, but then we're doing version tests and input typecasts like we do with ANYARRAY types. Not impossible, but definitely more work. > Or what if I want to import the stats for a table with slightly > different structure (e.g. because dump/restore skips dropped columns). > Won't that be a problem with the format containing raw attnums? Or is > this a use case we don't expect to work? > The family of pg_set_*_stats functions expect the input to be meaningful and correct for that relation on that server version. Any attnum translation would have to be done by the user to adapt to the new or changed relation. The family of pg_restore_*_stats functions are designed to be forward compatible, and to work across versions but for basically the same relation or relation of the same shape. Basically, they're for pg_restore and pg_upgrade, so no changes in attnums would be expected. > > For the per-attribute stats it's probably fine, because that's mostly > just a collection of regular data types (scalar values or arrays of > values, ...) and we're not modifying them except for maybe adding new > fields. But extended stats seem more complex, so maybe it's different? > I had that working by attname matching way back in the early days, but it would involve creating an intermediate format for translating the attnums to attnames on export, and then re-translating them on the way back in. I suppose someone could write the following utility functions pg_xlat_ndistinct_to_attnames(relation reloid, ndist pg_ndistinct) -> json pg_xlat_ndistinct_from_attnames(relation reloid, ndist json) -> pg_ndistinct and that would bridge the gap for the special case where you want to adapt pg_ndistinct from one table structure to a slightly different one. > > I remember a long discussion about the format at the very beginning of > this patch series, and the conclusion clearly was to have a function > that import stats for one attribute at a time. And that seems to be > working fine, but extended stats values have more internal structure, so > perhaps they need to do something more complicated. > I believe this *is* doing something more complicated, especially with the MCVList, though I was very pleased to see how much of the existing infrastructure I was able to reuse. > > > 0003 - Makes several static functions in attribute_stats.c public for use > > by extended stats. One of those is get_stat_attr_type(), which in the > last > > patchset was modified to take an attribute name rather than attnum, thus > > saving a syscache lookup. However, extended stats identifies attributes > by > > attnum not name, so that optimization had to be set aside, at least > > temporarily. > > > > 0004 - These implement the functions pg_set_extended_stats(), > > pg_clear_extended_stats(), and pg_restore_extended_stats() and behave > like > > their relation/attribute equivalents. If we can get these committed and > > used by pg_dump, then we don't have to debate how to handle post-upgrade > > steps for users who happen to have extended stats vs the approximately > > 99.75% of users who do not have extended stats. > > > > I see there's a couple MCV-specific functions in the extended_stats.c. > Shouldn't those go into mvc.c instead? > I wanted to put it there, but there was a reason I didn't and I've now forgotten what it was. I'll make an effort to relocate it to mcv.c. For that matter, it might make sense to break out the expressions code into its own file, because every other stat attribute has its own. Thoughts on that? > > FWIW there's a bunch of whitespace issues during git apply. > +1 > > OK. Thanks for the patch! > Thanks for the feedback, please keep it coming. I think it's really important that extended stats, though used rarely, be included in our dump/restore/upgrade changes so as to make for a more consistent user experience. -
Re: Extended Statistics set/restore/clear functions.
Tomas Vondra <tomas@vondra.me> — 2025-01-23T20:15:53Z
On 1/23/25 15:51, Corey Huinker wrote: > On Wed, Jan 22, 2025 at 5:50 PM Tomas Vondra <tomas@vondra.me > <mailto:tomas@vondra.me>> wrote: > > Hi, > > Thanks for continuing to work on this. > > On 1/22/25 19:17, Corey Huinker wrote: > > This is a separate thread for work started in [1] but focused > purely on > > getting the following functions working: > > > > * pg_set_extended_stats > > * pg_clear_extended_stats > > * pg_restore_extended_stats > > > > These functions are analogous to their relation/attribute > counterparts, > > use the same calling conventions, and build upon the same basic > > infrastructure. > > > > I think it is important that we get these implemented because they > close > > the gap that was left in terms of the ability to modify existing > > statistics and to round out the work being done to carry over > statistics > > via dump/restore and pg_upgrade i [1]. > > > > The purpose of each patch is as follows (adapted from previous > thread): > > > > 0001 - This makes the input function for pg_ndistinct functional. > > > > 0002 - This makes the input function for pg_dependencies functional. > > > > I only quickly skimmed the patches, but a couple comments: > > > 1) I think it makes perfect sense to use the JSON parsing for the input > functions, but maybe it'd be better to adjust the format a bit to make > that even easier? > > Right now the JSON "keys" have structure, which means we need some ad > hoc parsing. Maybe we should make it "proper JSON" by moving that into > separate key/value, e.g. for ndistinct we might replace this: > > {"1, 2": 2323, "1, 3" : 3232, ...} > > with this: > > [ {"keys": [1, 2], "ndistinct" : 2323}, > {"keys": [1, 3], "ndistinct" : 3232}, > ... ] > > so a regular JSON array of objects, with keys an "array". And similarly > for dependencies. > > > That is almost exactly what I did back when the stats import functions > took a nested JSON argument. > > The biggest problem with changing that format is that the old format > would still show up in the system being exported, so we would have to > process that format as well as the new one. > > > Yes, it's more verbose, but maybe better for "mechanical" processing? > > > It absolutely would be better for processing, but we'd still have to > read the old format from older systems. I suppose the pg_dump code > could do some SQL gymnastics to convert the old json-but-sad format into > the processing-friendly format of the future, and I could easily adapt > what I've already written over a year ago to that task. I suppose it's > just a matter of having the community behind changing the output format > to enable a better input format. > D'oh! I always forget about the backwards compatibility issue, i.e. that we still need to ingest values from already released versions. Yeah, that makes the format change less beneficial. > > > 2) Do we need some sort of validation? Perhaps this was discussed in the > other thread and I missed that, but isn't it a problem that happily > accept e.g. this? > > {"6666, 6666" : 1, "1, -222": 14, ...} > > That has duplicate keys with bogus attribute numbers, stats on (bogus) > system attributes, etc. I suspect this may easily cause problems during > planning (if it happens to visit those statistics). > > > We used to have _lots_ of validation for data quality issues, much of > which was removed at the request of reviewers. However, much of that > discussion was about the health of the statistics, but these are > standalone data types, maybe they're held to a higher standard. If so, > what sort of checks do you think would be needed and/or wanted? > > So far, I can imagine the following: > > * no negative attnums in key list > * no duplicate attnums in key list > > anything else? > Yeah, I recall there were a lot of checks initially and we dropped them over time. I'm not asking to reinstate all of those thorough checks. At this point I was really thinking only about validating the attnums, i.e. to make sure it's a valid attribute in the table / statistics. That is something the pg_set_attribute_stats() enforce too, thanks to having a separate argument for the attribute name. That's where I'd stop. I don't want to do checks on the statistics content, like verifying the frequencies in the MCV sum up to 1.0 or stuff like that. I think we're not doing that for pg_set_attribute_stats either (and I'd bet one could cause a lot of "fun" this way). > > Maybe that's acceptable - ultimately the user could import something > broken in a much subtler way, of course. But the pg_set_attribute_stats > seems somewhat more protected against this, because it gets the attr as > a separate argument. > > > The datatype itself is in isolation, but once we've created a valid > pg_ndistinct or pg_dependencies, there's nothing stopping us from > validating the values in the datatype against the statistics object and > the relation it belongs to, but that might get the same resistance that > I got to say, ensuring that frequency lists were monotonically decreasing. > Understood. IMHO it's fine to say we're not validating the statistics are "consistent" but I think we should check it matches the definition. > > I recall I wished to have the attnum in the output function, but that > was not quite possible because we don't know the relid (and thus the > descriptor) in that function. > > Is it a good idea to rely on the input/output format directly? How will > that deal with cross-version differences? Doesn't it mean the in/out > format is effectively fixed, or at least has to be backwards compatible > (i.e. new version has to handle any value from older versions)? > > > Presently there are no cross-version differences, though earlier I > address the pros and cons of changing it. No matter what, the burden of > having a valid format is on the user in the case of > pg_set_extended_stats() and pg_restore_extended_stats() has a server > version number associated, so the option of handling a format change > could be baked in, but then we're doing version tests and input > typecasts like we do with ANYARRAY types. Not impossible, but definitely > more work. > OK, makes sense. > > Or what if I want to import the stats for a table with slightly > different structure (e.g. because dump/restore skips dropped columns). > Won't that be a problem with the format containing raw attnums? Or is > this a use case we don't expect to work? > > > The family of pg_set_*_stats functions expect the input to be meaningful > and correct for that relation on that server version. Any attnum > translation would have to be done by the user to adapt to the new or > changed relation. > > The family of pg_restore_*_stats functions are designed to be forward > compatible, and to work across versions but for basically the same > relation or relation of the same shape. Basically, they're for > pg_restore and pg_upgrade, so no changes in attnums would be expected. > > OK > > For the per-attribute stats it's probably fine, because that's mostly > just a collection of regular data types (scalar values or arrays of > values, ...) and we're not modifying them except for maybe adding new > fields. But extended stats seem more complex, so maybe it's different? > > > I had that working by attname matching way back in the early days, but > it would involve creating an intermediate format for translating the > attnums to attnames on export, and then re-translating them on the way > back in. > > I suppose someone could write the following utility functions > > pg_xlat_ndistinct_to_attnames(relation reloid, ndist pg_ndistinct) - >> json > pg_xlat_ndistinct_from_attnames(relation reloid, ndist json) -> > pg_ndistinct > > and that would bridge the gap for the special case where you want to > adapt pg_ndistinct from one table structure to a slightly different one. > > OK > > > I remember a long discussion about the format at the very beginning of > this patch series, and the conclusion clearly was to have a function > that import stats for one attribute at a time. And that seems to be > working fine, but extended stats values have more internal structure, so > perhaps they need to do something more complicated. > > > I believe this *is* doing something more complicated, especially with > the MCVList, though I was very pleased to see how much of the existing > infrastructure I was able to reuse. > > OK > > > > > 0003 - Makes several static functions in attribute_stats.c public > for use > > by extended stats. One of those is get_stat_attr_type(), which in > the last > > patchset was modified to take an attribute name rather than > attnum, thus > > saving a syscache lookup. However, extended stats identifies > attributes by > > attnum not name, so that optimization had to be set aside, at least > > temporarily. > > > > 0004 - These implement the functions pg_set_extended_stats(), > > pg_clear_extended_stats(), and pg_restore_extended_stats() and > behave like > > their relation/attribute equivalents. If we can get these > committed and > > used by pg_dump, then we don't have to debate how to handle post- > upgrade > > steps for users who happen to have extended stats vs the approximately > > 99.75% of users who do not have extended stats. > > > > I see there's a couple MCV-specific functions in the extended_stats.c. > Shouldn't those go into mvc.c instead? > > > I wanted to put it there, but there was a reason I didn't and I've now > forgotten what it was. I'll make an effort to relocate it to mcv.c. > > For that matter, it might make sense to break out the expressions code > into its own file, because every other stat attribute has its own. > Thoughts on that? > +1 to that, if it reduced unnecessary code duplication > > > FWIW there's a bunch of whitespace issues during git apply. > > > +1 > > > > OK. Thanks for the patch! > > > Thanks for the feedback, please keep it coming. I think it's really > important that extended stats, though used rarely, be included in our > dump/restore/upgrade changes so as to make for a more consistent user > experience. > I agree, and I appreciate you working on it. -- Tomas Vondra -
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-01-23T20:52:55Z
> > > > > * no negative attnums in key list > Disregard this suggestion - negative attnums mean the Nth expression in the extended stats object, though it boggles the mind how we could have 222 expressions... > > * no duplicate attnums in key list > This one is still live, am considering. At this point I was really thinking only about validating the attnums, > i.e. to make sure it's a valid attribute in the table / statistics. That > is something the pg_set_attribute_stats() enforce too, thanks to having > a separate argument for the attribute name. > > That's where I'd stop. I don't want to do checks on the statistics > content, like verifying the frequencies in the MCV sum up to 1.0 or > stuff like that. I think we're not doing that for pg_set_attribute_stats > Agreed. > either (and I'd bet one could cause a lot of "fun" this way). > If by "fun" you mean "create a fuzzing tool", then yes. As an aside, the "big win" in all these functions is the ability to dump a database --no-data, but have all the schema and statistics, thus allowing for checking query plans on existing databases with sensitive data while not actually exposing the data (except mcv, obvs), nor spending the I/O to load that data. > Understood. IMHO it's fine to say we're not validating the statistics > are "consistent" but I think we should check it matches the definition. > +1 > > I suppose someone could write the following utility functions > > > > pg_xlat_ndistinct_to_attnames(relation reloid, ndist pg_ndistinct) - > >> json > > pg_xlat_ndistinct_from_attnames(relation reloid, ndist json) -> > > pg_ndistinct > > > > and that would bridge the gap for the special case where you want to > > adapt pg_ndistinct from one table structure to a slightly different one. > > > > > > OK > As they'll be pure-SQL functions, I'll likely post the definitions here, but not put them into a patch unless it draws interest. > For that matter, it might make sense to break out the expressions code > > into its own file, because every other stat attribute has its own. > > Thoughts on that? > > > > +1 to that, if it reduced unnecessary code duplication > I'm uncertain that it actually would deduplicate any code, but I'll certainly try.
-
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-01-24T00:06:48Z
> > I see there's a couple MCV-specific functions in the extended_stats.c. >> Shouldn't those go into mvc.c instead? >> > > I wanted to put it there, but there was a reason I didn't and I've now > forgotten what it was. I'll make an effort to relocate it to mcv.c. > Looking at it now, I see that check_mcvlist_array() expects access to extarginfo, so I either I add a bunch of Datum-aware and extarginfo-aware code to mcv.c, or I do those checks outside of import_mcvlist() and leave check_mcvlist_array() where it is, which is my current inclination, though obviously it's not a perfectly clean break.
-
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-01-24T05:20:59Z
> > > I agree, and I appreciate you working on it. > I've tried to incorporate all your feedback: * duplicates in pg_ndistinct and pg_dependencies now cause the input function to fail, with tests for each * all the non-array, non-parametery stuff moved out of import_mcvlist so that import_mcvlist can move to mcv.c in a sane way * whitespace issues * validating attnums inside pg_ndistinct. I held off on doing pg_dependencies pending feedback on how I handled pg_ndistinct. tests for invalid positive (attnum isn't involved in the extended stats object), 0 attnum (just wrong) and invalid negative (more than the number of expressions in the object) Still to do: * attnum checking for dependencies, should be simple, once we've locked down how to do it for pg_ndistinct * move import_expressions code to expressions.c along with some other expression-specific stuff. * more tests, probably I'd like to merge these down to 3 patches again, but I'm keeping them separate for this patchset to isolate the attnum-checking code for this go-round.
-
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-01-27T19:52:15Z
> > I'd like to merge these down to 3 patches again, but I'm keeping them > separate for this patchset to isolate the attnum-checking code for this > go-round. > These are mock-ups of the to/from JSON functions, but building from/to text rather than the not-yet-committed pg_ndistinct and pg_dependencies data types. Currently they're done with JSON rather than JSONB because I assume that the ordering within the datatype matters. We can probably preserve order by adding an "order" field populated by WITH ORDINALITY. To get all Jurrassic about this: I've spent some time thinking about whether we CAN make these functions, it's time to consider whether we SHOULD. And that leads me to a couple of points: p1. We could switch to the new formats without any change to the internal representation, but pg_dump would always need to know about the old formats. p2. The JSON format is both more understandable and easier to manipulate. p3. If we thought the number of people using extended stats was small, the number of people tweaking extended stats is going to be smaller. So that gives us a few paths forward: o1. Switch to the new input/output format, and the queries inside these functions get incorporated into some future pg_dump queries. o2. Keep the old formats, create these functions inside the system and whoever wants to use them can use them. o3. Keep old formats, and make these functions work as the CASTs to and from JSON/JSONB. o4. Keep old formats, could create these functions in an extension. o5. Keep old formats, leave these function definitions here for a future intrepid hacker.
-
Re: Extended Statistics set/restore/clear functions.
jian he <jian.universality@gmail.com> — 2025-01-28T16:24:59Z
hi. I reviewed 0001 only. in src/backend/statistics/mvdistinct.c no need #include "nodes/pg_list.h" since src/include/statistics/statistics.h sub level include "nodes/pg_list.h" no need #include "utils/palloc.h" sicne #include "postgres.h" already included it. select '[{"6, -32768,,": -11}]'::pg_ndistinct; ERROR: malformed pg_ndistinct: "[{"6, -32768,,": -11}]" LINE 1: select '[{"6, -32768,,": -11}]'::pg_ndistinct; ^ DETAIL: All ndistinct count values are scalar doubles. imho, this errdetail message is not good. select '{}'::pg_ndistinct ; segfault select '{"1,":"1"}'::pg_ndistinct ; ERROR: malformed pg_ndistinct: "{"1,":"1"}" LINE 1: select '{"1,":"1"}'::pg_ndistinct ; ^ DETAIL: All ndistinct attnum lists must be a comma separated list of attnums. imho, this errdetail message is not good. would be better saying that "length of list of attnums must be larger than 1". select pt1.typnamespace, pt1.typarray, pt1.typcategory, pt1.typname from pg_type pt1 where pt1.typname ~*'distinct'; typnamespace | typarray | typcategory | typname --------------+----------+-------------+-------------- 11 | 0 | Z | pg_ndistinct typcategory (Z) marked as Internal-use types. and there is no pg_ndistinct array type, not sure this is fine. all the errcode one pair of the parenthesis is unnecessary. for example (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), errmsg("malformed pg_ndistinct: \"%s\"", parse->str), errdetail("Must begin with \"{\""))); can change to errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), errmsg("malformed pg_ndistinct: \"%s\"", parse->str), errdetail("Must begin with \"{\"")); see https://www.postgresql.org/docs/current/error-message-reporting.html -
Re: Extended Statistics set/restore/clear functions.
jian he <jian.universality@gmail.com> — 2025-01-29T07:50:09Z
hi. select '{"1, 0B100101":"NaN"}'::pg_ndistinct; pg_ndistinct ------------------------ {"1, 37": -2147483648} (1 row) this is not what we expected? For the VALUE part of pg_ndistinct, float8 has 3 special values: inf, -inf, NaN. For the key part of pg_ndistinct, see example. select '{"1, 16\t":"1"}'::pg_ndistinct; here \t is not tab character, ascii 9. it's two characters: backslash and character "t". so here it should error out? (apply this to \n, \r, \b) pg_ndistinct_in(PG_FUNCTION_ARGS) ending part should be: freeJsonLexContext(lex); if (result == JSON_SUCCESS) { ...... } else { ereturn(parse_state.escontext, (Datum) 0, errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), errmsg("malformed pg_ndistinct: \"%s\"", str), errdetail("Must be valid JSON.")); PG_RETURN_NULL(); } result should be either JSON_SUCCESS or anything else. all these functions: ndistinct_object_start, ndistinct_array_start, ndistinct_object_field_start, ndistinct_array_element_start have ndistinctParseState *parse = state; do we need to change it to ndistinctParseState *parse = (ndistinctParseState *)state; ? ndistinctParseState need to add to src/tools/pgindent/typedefs.list probably change it to "typedef struct ndistinctParseState". also struct ndistinctParseState need placed in the top of src/backend/statistics/mvdistinct.c not in line 340? -
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-01-29T19:56:33Z
On Tue, Jan 28, 2025 at 11:25 AM jian he <jian.universality@gmail.com> wrote: > hi. > I reviewed 0001 only. > > in src/backend/statistics/mvdistinct.c > > no need #include "nodes/pg_list.h" since > src/include/statistics/statistics.h sub level include "nodes/pg_list.h" > > no need #include "utils/palloc.h" > sicne #include "postgres.h" > already included it. > Noted. > select '[{"6, -32768,,": -11}]'::pg_ndistinct; > ERROR: malformed pg_ndistinct: "[{"6, -32768,,": -11}]" > LINE 1: select '[{"6, -32768,,": -11}]'::pg_ndistinct; > ^ > DETAIL: All ndistinct count values are scalar doubles. > imho, this errdetail message is not good. > What error message do you think is appropriate in that situation? > select '{}'::pg_ndistinct ; > segfault > Mmm, gotta look into that! > > > select '{"1,":"1"}'::pg_ndistinct ; > ERROR: malformed pg_ndistinct: "{"1,":"1"}" > LINE 1: select '{"1,":"1"}'::pg_ndistinct ; > ^ > DETAIL: All ndistinct attnum lists must be a comma separated list of > attnums. > > imho, this errdetail message is not good. would be better saying that > "length of list of attnums must be larger than 1". > That sounds better. > typcategory (Z) marked as Internal-use types. and there is no > pg_ndistinct array type, > not sure this is fine. > I think it's probably ok for now. The datatype currently has no utility other than extended statistics, and I'm doubtful that it ever will. -
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-01-29T20:04:40Z
On Wed, Jan 29, 2025 at 2:50 AM jian he <jian.universality@gmail.com> wrote: > hi. > > select '{"1, 0B100101":"NaN"}'::pg_ndistinct; > pg_ndistinct > ------------------------ > {"1, 37": -2147483648} > (1 row) > I think my initial reaction is to just refuse those special values, but I'll look into the parsing code to see what can be done. this is not what we expected? > For the VALUE part of pg_ndistinct, float8 has 3 special values: inf, > -inf, NaN. > > For the key part of pg_ndistinct, see example. > select '{"1, 16\t":"1"}'::pg_ndistinct; > here \t is not tab character, ascii 9. it's two characters: backslash > and character "t". > so here it should error out? > (apply this to \n, \r, \b) > I don't have a good answer as to what should happen here. Special cases like this make Tomas' suggestion to change the in/out format more attractive. > > > pg_ndistinct_in(PG_FUNCTION_ARGS) > ending part should be: > > freeJsonLexContext(lex); > if (result == JSON_SUCCESS) > { > ...... > } > else > { > ereturn(parse_state.escontext, (Datum) 0, > errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), > errmsg("malformed pg_ndistinct: \"%s\"", str), > errdetail("Must be valid JSON.")); > PG_RETURN_NULL(); > } > result should be either JSON_SUCCESS or anything else. > > > > all these functions: > ndistinct_object_start, ndistinct_array_start, > ndistinct_object_field_start, ndistinct_array_element_start > have > ndistinctParseState *parse = state; > > do we need to change it to > ndistinctParseState *parse = (ndistinctParseState *)state; > ? > The compiler isn't complaining so far, but I see no harm in it. -
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-02-20T04:43:16Z
> > select '{"1, 0B100101":"NaN"}'::pg_ndistinct; >> pg_ndistinct >> ------------------------ >> {"1, 37": -2147483648} >> (1 row) >> > > I think my initial reaction is to just refuse those special values, but > I'll look into the parsing code to see what can be done. > I noticed that the output function for pg_ndistinct casts that value to an integer before formatting it %d, so it's being treated as an integer even if it is not stored as one. After some consultation with Tomas, it made the most sense to just replicate this on the input side as well, and that is addressed in the patches below. I've updated and rebased the patches. The existing pg_ndistinct and pg_dependences formats were kept as-is. The formats are clumsy, more processing-friendly formats would be easier, but the need for such processing is minimal bordering on theoretical, so there is little impact in keeping the historical format. There are now checks to ensure that the pg_ndistinct or pg_dependencies value assigned to an extended statistics object actually makes sense for that object. What this amounts to is checking that for every attnum cited, the positive attnums are also ones found the in the stxkeys of the pg_statistic_ext tuple, and the negative attnums correspond do not exceed the number of expressions in the attnum. In other words, if the stats object has no expressions in it, then no negative numbers will be accepted, if it has 2 expressions than any value -3 or lower will be rejected, etc. All patches rebased to 71f17823ba010296da9946bd906bb8bcad6325bc. -
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-03-04T20:30:11Z
> > >> I think my initial reaction is to just refuse those special values, but >> I'll look into the parsing code to see what can be done. >> > > I noticed that the output function for pg_ndistinct casts that value to an > integer before formatting it %d, so it's being treated as an integer even > if it is not stored as one. After some consultation with Tomas, it made the > most sense to just replicate this on the input side as well, and that is > addressed in the patches below. > > I've updated and rebased the patches. > > The existing pg_ndistinct and pg_dependences formats were kept as-is. The > formats are clumsy, more processing-friendly formats would be easier, but > the need for such processing is minimal bordering on theoretical, so there > is little impact in keeping the historical format. > > There are now checks to ensure that the pg_ndistinct or pg_dependencies > value assigned to an extended statistics object actually makes sense for > that object. What this amounts to is checking that for every attnum cited, > the positive attnums are also ones found the in the stxkeys of the > pg_statistic_ext tuple, and the negative attnums correspond do not exceed > the number of expressions in the attnum. In other words, if the stats > object has no expressions in it, then no negative numbers will be accepted, > if it has 2 expressions than any value -3 or lower will be rejected, etc. > > All patches rebased to 71f17823ba010296da9946bd906bb8bcad6325bc. > A rebasing, and a few changes * regnamespace and name parameters changed to statistics_schemaname as text and statistics_name as text, so that there's one less thing that can potentially fail in an upgrade * schema lookup and stat name lookup failures now issue a warning and return false, rather than ERROR * elevel replaced with hardcoded WARNING most everywhere, as has been done with relation/attribute stats
-
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-03-31T05:10:23Z
Just rebasing.
-
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-05-29T22:32:17Z
On Mon, Mar 31, 2025 at 1:10 AM Corey Huinker <corey.huinker@gmail.com> wrote: > Just rebasing. > At pgconf.dev this year, the subject of changing the formats of pg_ndistinct and pg_depdentencies came up again. To recap: presently these datatypes have no working input function, but would need one for statistics import to work on extended statistics. The existing input formats are technically JSON, but the keys themselves are a comma-separated list of attnums, so they require additional parsing. That parsing is already done in the patches in this thread, but overall the format is terrible for any sort of manipulation, like the manipulation that people might want to do to translate the values to a table with a different column order (say, after a restore of a table that had dropped columns), or to do query planner experiments. Because the old formats don't have a corresponding input function, there is no risk of the ouptut not matching required inputs, but there will be once we add new input functions, so this is our last chance to change the format to something we like better. The old format can be trivially translated via functions posted earlier in this thread back in January (pg_xlat_ndistinct_to_attnames, pg_xlat_dependencies_to_attnames) as well as the reverse (s/_to_/_from_/), so dumping values from older versions will not be difficult. I believe that we should take this opportunity to make the change. While we don't have a pressing need to manipulate these structures now, we might in the future and failing to do so now makes a later change much harder. With that in mind, I'd like people to have a look at the proposed format change if pg_ndistinct (the changes to pg_dependencies are similar), to see if they want to make any improvements or comments. As you can see, the new format is much less compact (about 3x as large), which could get bad if the number of elements grew by a lot, but the number of elements is tied to the number of factors in the extended support (N choose N, then N choose N-1, etc, excluding choose 1), so this can't get too out of hand. Existing format (newlines/formatting added by me to make head-to-head comparison easier): '{"2, 3": 4, "2, -1": 4, "2, -2": 4, "3, -1": 4, "3, -2": 4, "-1, -2": 3, "2, 3, -1": 4, "2, 3, -2": 4, "2, -1, -2": 4, "3, -1, -2": 4}'::pg_ndistinct Proposed new format (again, all formatting here is just for ease of humans reading): ' [ {"attributes" : [2,3], "ndistinct" : 4}, {"attributes" : [2,-1], "ndistinct" : 4}, {"attributes" : [2,-2], "ndistinct" : 4}, {"attributes" : [3,-1], "ndistinct" : 4}, {"attributes" : [3,-2], "ndistinct" : 4}, {"attributes" : [-1,-2], "ndistinct" : 3}, {"attributes" : [2,3,-1], "ndistinct" : 4}, {"attributes" : [2,3,-2], "ndistinct" : 4}, {"attributes" : [2,-1,-2], "ndistinct" : 4}, {"attributes" : [3,-1,-2], "ndistinct" : 4}]'::pg_ndistinct The pg_dependencies structure is only slightly more complex: An abbreviated example: {"2 => 1": 1.000000, "2 => -1": 1.000000, ..., "2, -2 => -1": 1.000000, "3, -1 => 2": 1.000000}, Becomes: [ {"attributes": [2], "dependency": 1, "degree": 1.000000}, {"attributes": [2], "dependency": -1, "degree": 1.000000}, {"attributes": [2, -2], "dependency": -1, "degree": 1.000000}, ..., {"attributes": [2, -2], "dependency": -1, "degree": 1.000000}, {"attributes": [3, -1], "dependency": 2, "degree": 1.000000}] Any thoughts on using/improving these structures? -
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-06-22T00:12:28Z
> > > Any thoughts on using/improving these structures? > Hearing no objections, here is the latest patchset. 0001 - Changes input/output functions of pg_ndistinct to the format described earlier. 0002 - Changes input/output functions of pg_dependencies to the format described earlier. 0003 - Makes some previously internal/static attribute stats functions visible to extended_stats.c, because the exprs attribute is basically an array of partially filled-out pg_statistic rows. 0004 - Adds pg_restore_attribute_stats(), pg_clear_attribute_stats(), in the pattern of their relation/attribute brethren. 0005 - adds the dumping and restoring of extended statistics back to v10. No command line flag changes needed.
-
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-08-12T17:28:14Z
On Sat, Jun 21, 2025 at 8:12 PM Corey Huinker <corey.huinker@gmail.com> wrote: > >> Any thoughts on using/improving these structures? >> > > Hearing no objections, here is the latest patchset. > > 0001 - Changes input/output functions of pg_ndistinct to the format > described earlier. > 0002 - Changes input/output functions of pg_dependencies to the format > described earlier. > 0003 - Makes some previously internal/static attribute stats functions > visible to extended_stats.c, because the exprs attribute is basically an > array of partially filled-out pg_statistic rows. > 0004 - Adds pg_restore_attribute_stats(), pg_clear_attribute_stats(), in > the pattern of their relation/attribute brethren. > 0005 - adds the dumping and restoring of extended statistics back to v10. > No command line flag changes needed. > > Rebased. Enjoy.
-
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-09-12T04:55:56Z
> > Rebased. Enjoy. > Rebased. >
-
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-10-19T00:27:58Z
On Fri, Sep 12, 2025 at 12:55 AM Corey Huinker <corey.huinker@gmail.com> wrote: > Rebased. Enjoy. >> > > Rebased. > And rebased again to conform to 688dc6299 and 4bd919129.
-
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-10-21T05:48:37Z
On Sat, Oct 18, 2025 at 08:27:58PM -0400, Corey Huinker wrote: > And rebased again to conform to 688dc6299 and 4bd919129. The format redesign for extended stats is pretty nice, as done in 0001. I think that this patch should be split in two, actually, as it tackles two issues: - One patch for the format change. - Second patch for the introduction of the input function, useful on its own to allow more regression tests paths for the format generated. Similar split comment for 0002 regarding pg_dependencies. The format change is one thing. The input function is cool to have for input validation, still not absolutely mandatory for the core part of the format change. The functions exposed in 0003 should be renamed to match more with the style of the rest, aka it is a bit hard to figure out what they do at first sight. Presumably, these should be prefixed with some "statext_", except text_to_stavalues() which could still be named the same. Do you have some numbers regarding the increase in size this generates for the catalogs? 0004 has been designed following the same model as the relation and attribute stats. That sounds OK here. +enum extended_stats_argnum [...] +enum extended_stats_exprs_element It would be nice to document why such things are around. That would be less guessing for somebody reading the code. Reusing this small sequence from your pg_dump patch, executed on a v14 backend: create schema dump_test; CREATE TABLE dump_test.has_ext_stats AS SELECT g.g AS x, g.g / 2 AS y FROM generate_series(1,100) AS g(g); CREATE STATISTICS dump_test.es1 ON x, (y % 2) FROM dump_test.has_ext_stats; ANALYZE dump_test.has_ext_stats; Then pg_dump fails: pg_dump: error: query failed: ERROR: column e.inherited does not exist LINE 2: ...hemaname = $1 AND e.statistics_name = $2 ORDER BY e.inherite... + * TODO: Until v18 is released the master branch has a + * server_version_num of 180000. We will update this to 190000 as soon + * as the master branch updates. This part has not been updated. + Assert(item.nattributes > 0); /* TODO: elog? */ [...] + Assert(dependency->nattributes > 1); /* TODO: elog? */ Yes and yes. It seems like it should be possible to craft some input that triggers these.. +void +free_pg_dependencies(MVDependencies *dependencies); Double declaration of this routine in dependencies.c. Perhaps some of the regression tests could use some jsonb_pretty() in the outputs generated. Some of the results generated are very hard to parse, something that would become harder in the buildfarm. This comment starts with 0001 for stxdndistinct. I have mixed feelings about 0005, FWIW. I am wondering if we should not lift the needle a bit here and only support the dump of extended statistics when dealing with a backend of at least v19. This would mean that we would only get the full benefit of this feature once people upgrade to v20 or dump from a pg_dump with --statistics from at least v19, but with the long-term picture in mind this would also make the dump/restore picture of the patch dead simple (spoiler: I like simple). Tomas, what is your take about the format changes and my argument about the backward requirements of pg_dump (about not dumping these stats if connecting to a server older than v18, included)? -- Michael
-
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-10-21T05:57:00Z
On Tue, Oct 21, 2025 at 02:48:37PM +0900, Michael Paquier wrote: > Tomas, what is your take about the format changes and my argument > about the backward requirements of pg_dump (about not dumping these > stats if connecting to a server older than v18, included)? By the way, the patch is still needed in the previous CF of September: https://commitfest.postgresql.org/patch/5517/ You may want to move it. -- Michael
-
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-10-22T11:55:31Z
> > The functions exposed in 0003 should be renamed to match more with the > style of the rest, aka it is a bit hard to figure out what they do at > first sight. Presumably, these should be prefixed with some > "statext_", except text_to_stavalues() which could still be named the > same. > That prefix would probably be statatt_ or statattr_. > > Do you have some numbers regarding the increase in size this generates > for the catalogs? > Sorry, I don't understand. There shouldn't be any increase inside the catalogs as the internal storage of the datatypes hasn't changed, so I can only conclude that you're referring to something else. > > 0004 has been designed following the same model as the relation and > attribute stats. That sounds OK here. > > +enum extended_stats_argnum > [...] > +enum extended_stats_exprs_element > > It would be nice to document why such things are around. That would > be less guessing for somebody reading the code. > The equivalent structures in attribute_stats.c will need documenting too. > > Reusing this small sequence from your pg_dump patch, executed on a v14 > backend: > create schema dump_test; > CREATE TABLE dump_test.has_ext_stats > AS SELECT g.g AS x, g.g / 2 AS y FROM generate_series(1,100) AS g(g); > CREATE STATISTICS dump_test.es1 ON x, (y % 2) FROM dump_test.has_ext_stats; > ANALYZE dump_test.has_ext_stats; > > Then pg_dump fails: > pg_dump: error: query failed: ERROR: column e.inherited does not exist > LINE 2: ...hemaname = $1 AND e.statistics_name = $2 ORDER BY e.inherite... > Noted. > > > + * TODO: Until v18 is released the master branch has a > + * server_version_num of 180000. We will update this to 190000 > as soon > + * as the master branch updates. > > This part has not been updated. > > + Assert(item.nattributes > 0); /* TODO: elog? */ > [...] > + Assert(dependency->nattributes > 1); /* TODO: elog? */ > Yes and yes. It seems like it should be possible to craft some input > that triggers these.. > +1 > > +void > +free_pg_dependencies(MVDependencies *dependencies); > > Double declaration of this routine in dependencies.c. > > Perhaps some of the regression tests could use some jsonb_pretty() in > the outputs generated. Some of the results generated are very hard to > parse, something that would become harder in the buildfarm. This > comment starts with 0001 for stxdndistinct. > Can do. > > I have mixed feelings about 0005, FWIW. I am wondering if we should > not lift the needle a bit here and only support the dump of extended > statistics when dealing with a backend of at least v19. This would > mean that we would only get the full benefit of this feature once > people upgrade to v20 or dump from a pg_dump with --statistics from at > least v19, but with the long-term picture in mind this would also make > the dump/restore picture of the patch dead simple (spoiler: I like > simple). > I also like simple. Right now we have a situation where the vast majority of databases can carry forward all of their stats via pg_upgrade, except for those databases that have extended stats. The trouble is, most customers don't know if their database uses extended statistics or not, and those that do are in for some bad query plans if they haven't run vacuumdb --missing-stats-only. Explaining that to customers is complicated, especially when most of them do not know what extended stats are, let alone whether they have them. It would be a lot simpler to just say "all stats are carried over on upgrade", and vacuumdb becomes unnecessary, making upgrades one step simpler as well. Given that, I think that the admittedly ugly transformation is worth it, and sequestering it inside pg_dump is the smallest footprint it can have. Earlier in this thread I posted some functions that did the translation from the existing formats to the proposed new formats. We could include those as new system functions, and that would make the dump code very simple. Having said that, I don't know that there would be use for those functions except inside pg_dump, hence the decision to do the transforms right in the dump query. If the format translation is a barrier to fetching existing extended stats, then I'd be more inclined to keep the existing pg_ndistinct and pg_dependencies data formats as they are now.
-
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-10-22T23:46:27Z
On Wed, Oct 22, 2025 at 02:55:31PM +0300, Corey Huinker wrote: >> Do you have some numbers regarding the increase in size this generates >> for the catalogs? > > Sorry, I don't understand. There shouldn't be any increase inside the > catalogs as the internal storage of the datatypes hasn't changed, so I can > only conclude that you're referring to something else. The new format meant more characters, perhaps I've just missed something while quickly testing the patch.. Anyway, that's OK at this stage. > The equivalent structures in attribute_stats.c will need documenting too. Right. This sounds like a separate patch to me, impacting HEAD. > Right now we have a situation where the vast majority of databases can > carry forward all of their stats via pg_upgrade, except for those databases > that have extended stats. The trouble is, most customers don't know if > their database uses extended statistics or not, and those that do are in > for some bad query plans if they haven't run vacuumdb --missing-stats-only. > Explaining that to customers is complicated, especially when most of them > do not know what extended stats are, let alone whether they have them. It > would be a lot simpler to just say "all stats are carried over on upgrade", > and vacuumdb becomes unnecessary, making upgrades one step simpler as well. Okay. > Given that, I think that the admittedly ugly transformation is worth it, > and sequestering it inside pg_dump is the smallest footprint it can have. > Earlier in this thread I posted some functions that did the translation > from the existing formats to the proposed new formats. We could include > those as new system functions, and that would make the dump code very > simple. Having said that, I don't know that there would be use for those > functions except inside pg_dump, hence the decision to do the transforms > right in the dump query. I'd prefer the new format. One killer pushing in favor of the new format that you are making upthread in favor of is that it makes much easier the viewing, editing and injecting of these stats. It's the part of the patch where we would need Tomas' input on the matter before deciding anything, I guess, as primary author of the original facilities. My view of the problem is just one opinion. > If the format translation is a barrier to fetching existing extended stats, > then I'd be more inclined to keep the existing pg_ndistinct and > pg_dependencies data formats as they are now. Not necessarily, it can be possible to also take that in multiple steps rather than a single one: - First do the basics in v19 with the new format. - Raise the bar to older versions. -- Michael
-
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-10-23T00:35:04Z
On Thu, Oct 23, 2025 at 08:46:27AM +0900, Michael Paquier wrote: > I'd prefer the new format. One killer pushing in favor of the new > format that you are making upthread in favor of is that it makes much > easier the viewing, editing and injecting of these stats. This is missing an "argument", as in "One killer argument pushing in favor.." :D -- Michael
-
Re: Extended Statistics set/restore/clear functions.
Tomas Vondra <tomas@vondra.me> — 2025-10-31T20:22:55Z
On 10/23/25 01:46, Michael Paquier wrote: > On Wed, Oct 22, 2025 at 02:55:31PM +0300, Corey Huinker wrote: >>> Do you have some numbers regarding the increase in size this generates >>> for the catalogs? >> >> Sorry, I don't understand. There shouldn't be any increase inside the >> catalogs as the internal storage of the datatypes hasn't changed, so I can >> only conclude that you're referring to something else. > > The new format meant more characters, perhaps I've just missed > something while quickly testing the patch.. Anyway, that's OK at this > stage. > >> The equivalent structures in attribute_stats.c will need documenting too. > > Right. This sounds like a separate patch to me, impacting HEAD. > >> Right now we have a situation where the vast majority of databases can >> carry forward all of their stats via pg_upgrade, except for those databases >> that have extended stats. The trouble is, most customers don't know if >> their database uses extended statistics or not, and those that do are in >> for some bad query plans if they haven't run vacuumdb --missing-stats-only. >> Explaining that to customers is complicated, especially when most of them >> do not know what extended stats are, let alone whether they have them. It >> would be a lot simpler to just say "all stats are carried over on upgrade", >> and vacuumdb becomes unnecessary, making upgrades one step simpler as well. > > Okay. > >> Given that, I think that the admittedly ugly transformation is worth it, >> and sequestering it inside pg_dump is the smallest footprint it can have. >> Earlier in this thread I posted some functions that did the translation >> from the existing formats to the proposed new formats. We could include >> those as new system functions, and that would make the dump code very >> simple. Having said that, I don't know that there would be use for those >> functions except inside pg_dump, hence the decision to do the transforms >> right in the dump query. > > I'd prefer the new format. One killer pushing in favor of the new > format that you are making upthread in favor of is that it makes much > easier the viewing, editing and injecting of these stats. It's the > part of the patch where we would need Tomas' input on the matter > before deciding anything, I guess, as primary author of the original > facilities. My view of the problem is just one opinion. > Sorry for not paying much attention to this thread ... My opinion is that we should both use the new format and keep the pg_dump code to allow upgrading from older pre-19 versions. There really is nothing special about the current format - I should have used JSON (or any other established format) from the beginning. But I only saw that as human-readable version of ephemeral data, it didn't occur to me we'll use this to export/import stats cross versions. So if we need to adjust that to make new use cases more convenient, let's bite the bullet now. If doing both is too complex / ugly, I think the pg_upgrade capability is more valuable. I'd rather keep the old, less convenient format to have pg_upgrade support for all versions. Otherwise users may not benefit from this pg_upgrade feature for a couple more years. Plenty of users delay upgrading until the EOL gets close, and so might be unable to dump/restore extended stats for the next ~5 years. regards -- Tomas Vondra
-
Re: Extended Statistics set/restore/clear functions.[
Michael Paquier <michael@paquier.xyz> — 2025-11-04T07:13:55Z
On Fri, Oct 31, 2025 at 09:22:55PM +0100, Tomas Vondra wrote: > Sorry for not paying much attention to this thread ... It was PGEU last week, it's not surprising knowing that you are in.. Europe :) > My opinion is that we should both use the new format and keep the > pg_dump code to allow upgrading from older pre-19 versions. Okay, thanks for the input. > There really is nothing special about the current format - I should have > used JSON (or any other established format) from the beginning. But I > only saw that as human-readable version of ephemeral data, it didn't > occur to me we'll use this to export/import stats cross versions. So if > we need to adjust that to make new use cases more convenient, let's bite > the bullet now. > > If doing both is too complex / ugly, I think the pg_upgrade capability > is more valuable. I'd rather keep the old, less convenient format to > have pg_upgrade support for all versions. Hmm. I have been eyeing at the dump code once again, and it's not that bad on a second lookup, so I'd be OK to incorporate the whole in a review. > Otherwise users may not benefit from this pg_upgrade feature for a > couple more years. Plenty of users delay upgrading until the EOL gets > close, and so might be unable to dump/restore extended stats for the > next ~5 years. Okay. I still think that we should split things a bit more in the patch set. Corey has sent me a proposal in this direction, where most of the entries can be reviewed and potentially applied separately: 1. pg_ndistinct output function change. 2. pg_ndistinct input function addition. 3. pg_dependencies output function change 4. pg_dependencies input function 5. Expose attribute statistics function and rename them attstat_* or statatt_* 6. pg_restore_extended_stats 7. pg_dump with no ability to fetch old-format pg_ndistinct/pg_dependences. 8. pg_dump working back as far as possible I am not completely sold about the changes in the input/output functions, but I'd be happy to consider more options with a rebased patch set (with the dump issue on older clusters issue, while on it). -- Michael
-
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-11-05T06:38:56Z
> > Otherwise users may not benefit from this pg_upgrade feature for a > couple more years. Plenty of users delay upgrading until the EOL gets > close, and so might be unable to dump/restore extended stats for the > next ~5 years Paquier's response got sidetracked because of an errant subject line change, so I will try to recap: * pg_dump code changes no longer seem as bad on second look * proceed with breakout per off-list discussion in that off-list discussion I proposed (though I was mostly echoing what I thought Paquier wanted): 1. pg_ndistinct output function change. 2. pg_ndistinct input function addition. 3. pg_dependencies output function change 4. pg_dependencies input function 5. Expose attribute statistics function and rename them attstat_* or statatt_* (edit: and fix lack of comments on the enums and arrays) 6. pg_restore_extended_stats 7. pg_dump with no ability to fetch old-format pg_ndistinct/pg_dependences. (edit: and fix inherited bug) 8. pg_dump working back as far as possible Given that the pg_dump code no longer seems as bad, and Tomas is very much in support of it, I've opted not to split out steps 7/8.
-
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-11-06T08:26:58Z
On Wed, Nov 05, 2025 at 01:38:56AM -0500, Corey Huinker wrote: > Paquier's response got sidetracked because of an errant subject line > change, so I will try to recap: That was a typo that found its way into the email subject. Sorry about that, that broke gmail's tracking at least. > in that off-list discussion I proposed (though I was mostly echoing what I > thought Paquier wanted): > > 1. pg_ndistinct output function change. > 2. pg_ndistinct input function addition. > 3. pg_dependencies output function change > 4. pg_dependencies input function > 5. Expose attribute statistics function and rename them attstat_* or > statatt_* (edit: and fix lack of comments on the enums and arrays) > 6. pg_restore_extended_stats > 7. pg_dump with no ability to fetch old-format pg_ndistinct/pg_dependences. > (edit: and fix inherited bug) > 8. pg_dump working back as far as possible Thanks. I have begun reviewing it (more a bit later, still need to study more the structure of the code). For now, I have extracted some of the comment changes in 0005 and applied these independently. > Given that the pg_dump code no longer seems as bad, and Tomas is very much > in support of it, I've opted not to split out steps 7/8. That sounds like a way forward to me, then, in terms of using a new format and make pg_dump intelligent enough to deal with it based on what's in the past versions. -- Michael
-
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-11-06T18:35:34Z
> > That was a typo that found its way into the email subject. Sorry > about that, that broke gmail's tracking at least. > So the mailing list archive will still pick it up? That's nice. > Thanks. I have begun reviewing it (more a bit later, still need to > study more the structure of the code). For now, I have extracted some > of the comment changes in 0005 and applied these independently. > Rebased to reflect that commit.
-
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-11-07T07:51:15Z
On Thu, Nov 06, 2025 at 01:35:34PM -0500, Corey Huinker wrote: > So the mailing list archive will still pick it up? That's nice. It did. My email client does not care much either. > Rebased to reflect that commit. I have spent a bit more time on this set. Patch 0001 for ndistinct was missing a documentation update, we have one query in perform.sgml that looks at stxdndistinct. Patch 0003 is looking OK here as well. For dependencies, the format switches from a single json object with key/vals like that: "3 => 4": 1.000000 To a JSON array made of elements like that: {"degree": 1.000000, "attributes": [3],"dependency": 4}, For ndistincts, we move from a JSON blob with key/vals like that: "3, 4": 11 To a JSON array made of the following elements: {"ndistinct": 11, "attributes": [3,4]} Using a keyword within each element would force a stronger validation when these get imported back, which is a good thing. I like that. Before going in-depth into the input functions to cross-check the amount of validation we should do, have folks any comments about the proposed format? That's the key point this patch set depends on, and I'd rather not spend more time the whole thing if somebody would like a different format. This is the format that Tomas has mentioned at the top of the thread. Note: as noted upthread, pg_dump would be in charge of transferring the data of the old format to the new format at the end. While looking at 0002 and 0004 (which have a couple of issues actually), I have been wondering about moving into a new file the four data-type functions (in, out, send and receive) and the new input functions that rely on a new JSON lexer and parser logic into for both ndistinct and dependencies. The new set of headers added at the top of mvdistinct.c and dependencies.c for the new code points that a separation may be better in the long-term, because the new code relies on parts of the backend that the existing code does not care about, and these files become larger than the relation and attribute stats files. I would be tempted to name these new files pg_dependencies.c and pg_ndistinct.c, mapping with their catalog types. With this separation, it looks like the "core" parts in charge of the calculations with ndistinct and dependencies can be kept on its own. What do you think? A second comment is for 0005. The routines of attributes.c are applied to the new clear and restore functions. Shouldn't these be in stats_utils.c at the end? That's where the "common" functions used by the stats manipulation logic are. -- Michael -
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-11-07T22:28:50Z
> > > Patch 0001 for ndistinct was missing a documentation update, we have > one query in perform.sgml that looks at stxdndistinct. Patch 0003 is > looking OK here as well. > Well spotted. > For dependencies, the format switches from a single json object > with key/vals like that: > "3 => 4": 1.000000 > To a JSON array made of elements like that: > {"degree": 1.000000, "attributes": [3],"dependency": 4}, > > For ndistincts, we move from a JSON blob with key/vals like that: > "3, 4": 11 > To a JSON array made of the following elements: > {"ndistinct": 11, "attributes": [3,4]} > > Using a keyword within each element would force a stronger validation > when these get imported back, which is a good thing. I like that. > > Before going in-depth into the input functions to cross-check the > amount of validation we should do, have folks any comments about the > proposed format? That's the key point this patch set depends on, and > I'd rather not spend more time the whole thing if somebody would like > a different format. This is the format that Tomas has mentioned at > the top of the thread. Note: as noted upthread, pg_dump would be in > charge of transferring the data of the old format to the new format at > the end. > I'm open to other formats, but aside from renaming the json keys (maybe "attnums" or "keys" instead of "attributes"?), I'm not sure what really could be done and still be JSON. I suppose we could go with a tuple format like this: '{({3,4},11),...}' for pg_ndistinct and '{({3},4,1.00000),...}' for pg_dependencies. Those would certainly be more compact, but makes for a hard read by humans, and while the JSON code is big, it's also proven in other parts of the codebase, hence less risky. > > While looking at 0002 and 0004 (which have a couple of issues > actually), I have been wondering about moving into a new file the four > data-type functions (in, out, send and receive) and the new input > functions that rely on a new JSON lexer and parser logic into for both > ndistinct and dependencies. The new set of headers added at the top > of mvdistinct.c and dependencies.c for the new code points that a > separation may be better in the long-term, because the new code relies > on parts of the backend that the existing code does not care about, > and these files become larger than the relation and attribute stats > files. I would be tempted to name these new files pg_dependencies.c > and pg_ndistinct.c, mapping with their catalog types. With this > separation, it looks like the "core" parts in charge of the > calculations with ndistinct and dependencies can be kept on its own. > What do you think? > A part of me thinks that everything that remains after removing in/out/send/recv is just taking a table sample data structure and crunching numbers to come up with the deserialized data structure...that's in/out with a different starting/ending points. There's no denying that JSON parsing is a very different code style than statistical number crunching, and mixing the two is incongruous, so it's worth a shot, and I'll try that for v9. > > A second comment is for 0005. The routines of attributes.c are > applied to the new clear and restore functions. Shouldn't these be in > stats_utils.c at the end? That's where the "common" functions used by > the stats manipulation logic are. > I assume you're referring to attribute_stats.c. I think that would cause stats_utils.c to have to pull in a lot of things from attribute_stats.c, and that would create the exact sort of include-pollution that you're trying to avoid in the mvdistinct.c/dependencies.c situation mentioned above. The one lone exception to this is text_to_stavalues(), which is a fancy wrapper around array_in() and could perhaps be turned to even more generic usage outside of stats in general. The functions in question are needed because the exprs value is itself an array of partly-filled-out pg_attribute tuples, so it's common to those two needs, but specific to stats about attributes. Maybe we need an attr_stats_utils.h? -
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-11-07T22:56:48Z
On Fri, Nov 07, 2025 at 05:28:50PM -0500, Corey Huinker wrote: > I'm open to other formats, but aside from renaming the json keys (maybe > "attnums" or "keys" instead of "attributes"?), I'm not sure what really > could be done and still be JSON. I suppose we could go with a tuple format > like this: > > '{({3,4},11),...}' for pg_ndistinct and > '{({3},4,1.00000),...}' for pg_dependencies. > > Those would certainly be more compact, but makes for a hard read by humans, > and while the JSON code is big, it's also proven in other parts of the > codebase, hence less risky. I've liked the human-readability factor of the format in the current patches with names in the keys, and values assigned to each property. Another thing that may be worth doing is pushing the names of the keys and some its the JSON meta-data shaping the object into a new header than can be loaded by both the backend and the frontend. It would be nice to not hardcode this knowledge in a bunch of places if we finish by renaming these attributes. > A part of me thinks that everything that remains after removing > in/out/send/recv is just taking a table sample data structure and crunching > numbers to come up with the deserialized data structure...that's in/out > with a different starting/ending points. > > There's no denying that JSON parsing is a very different code style than > statistical number crunching, and mixing the two is incongruous, so it's > worth a shot, and I'll try that for v9. Yeah, right. Thanks. The parsing pieces seem like pieces worth their own file. > The functions in question are needed because the exprs value is itself an > array of partly-filled-out pg_attribute tuples, so it's common to those two > needs, but specific to stats about attributes. Maybe we need an > attr_stats_utils.h? Hmm, maybe. I'd be OK to revisit these structures once we're happy with the in/out structures. That would be a good start point before working on the SQL functions and the dump/restore bits in more details. -- Michael -
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-11-10T05:33:40Z
> > Another thing that may be worth doing is pushing the names of the keys > and some its the JSON meta-data shaping the object into a new header > than can be loaded by both the backend and the frontend. It would be > nice to not hardcode this knowledge in a bunch of places if we finish > by renaming these attributes. > It may not be quite what you wanted, but the attribute names are now static constants in the new adt c files. It's possible/probable that you wanted them in some header file, but so far I haven't had to create any new header files, but that can be done if desired. Yeah, right. Thanks. The parsing pieces seem like pieces worth their > own file. > That's done in the 0008-0009 patches. If I was starting from scratch, I would have moved the pre-existing in/out/send/recv functions to their own files in their own patches before changing the output format, but tacked on at the end like they are it's easier to see what the changes were, and the patches will probably get squashed together anyway. > > The functions in question are needed because the exprs value is itself an > > array of partly-filled-out pg_attribute tuples, so it's common to those > two > > needs, but specific to stats about attributes. Maybe we need an > > attr_stats_utils.h? > > Hmm, maybe. I'd be OK to revisit these structures once we're happy > with the in/out structures. That would be a good start point before > working on the SQL functions and the dump/restore bits in more > details. > In addition to the changes detailed above, I fixed a few typos and incorporated the v8 change.
-
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-11-11T08:06:49Z
On Mon, Nov 10, 2025 at 12:33:40AM -0500, Corey Huinker wrote: > It may not be quite what you wanted, but the attribute names are now static > constants in the new adt c files. It's possible/probable that you wanted > them in some header file, but so far I haven't had to create any new header > files, but that can be done if desired. No, that's not the best thing we can do with the dump/restore pieces in mind. Let's put that in a separate header. > That's done in the 0008-0009 patches. If I was starting from scratch, I > would have moved the pre-existing in/out/send/recv functions to their own > files in their own patches before changing the output format, but tacked on > at the end like they are it's easier to see what the changes were, and the > patches will probably get squashed together anyway. Thanks for the new patch. And FWIW I disagree with this approach: cleanup and refactoring pieces make more sense if done first, as these lead to less code churn in the final result. So... I've begun to put my hands on the patch set. The whole has been restructured a bit, as per the attached. Patch 0001 to 0004 feel OK here, these include two code moves and the two output functions: - Two new files for adt/, that I'm planning to apply soon as a separate cleanup. - New output functions, with keys added to a new header named statistics_format.h, for frontend and backend consumption. Next comes the input functions. First, I am unhappy with the amount of testing that has been put into ndistinct, first and only input facility I've looked at in details for the moment. I have quickly spotted a couple a few issues while testing buggy input, like this one that crashes on pointer dereference, not good obviously: SELECT '[]'::pg_ndistinct; There was a second one with the error message generated when using an incorrect key value. Second, the inputs are too permissive and could be more strictly checked IMHO. For example, patterns like that are incorrect, still authorized with only the patches up to 0005 in: - Duplicated list of attributes: SELECT '[{"attributes" : [2,3], "ndistinct" : 4}, {"attributes" : [2,3], "ndistinct" : 4}]'::pg_ndistinct; - Partial (K,N) sets, for example say we take stats on attrs (1,2,3), a partial input like this one is basically OK: SELECT '[{"attributes" : [1,3], "ndistinct" : 4}, {"attributes" : [1,2,3], "ndistinct" : 4}]'::pg_ndistinct; These are checked in the patches that introduce the functions like with pg_ndistinct_validate_items(), based on the list of stxkeys we have. However, I think that this is not enough by itself. Shouldn't we check that the list of items in the array is what we expect based on the longest "attributes" array at least, even after a JSON that was parsed? That would be cheap to check in the output function itself, at least as a first layer of checks before trying something with the import function and cross-checking the list of attributes for the extended statistics object. This means checking that for N attributes we have all the elements we'd expect in each element of the array, without gaps or duplications, with an extra step done once the JSON parsing is finished. Except for this sanity issue this part of the patch set should be mostly OK, plus more cleanup and more typo/grammar fixes. I suspect a similar family of issues with pg_dependencies, and it would be nice to move the tests with the input function into a new regression file, like the other one. I've rebased the full set using the new structure. 0001~0004 are clean. 0005~ need more work and analysis, but that's a start. -- Michael -
Re: Extended Statistics set/restore/clear functions.
jian he <jian.universality@gmail.com> — 2025-11-11T15:14:34Z
On Tue, Nov 11, 2025 at 4:07 PM Michael Paquier <michael@paquier.xyz> wrote: > I've rebased the full set using the new structure. 0001~0004 are > clean. 0005~ need more work and analysis, but that's a start. hi. diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml index 106583fb2965..e0e9f0468cb8 100644 --- a/doc/src/sgml/perform.sgml +++ b/doc/src/sgml/perform.sgml @@ -1579,9 +1579,39 @@ ANALYZE zipcodes; SELECT stxkeys AS k, stxdndistinct AS nd FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid) WHERE stxname = 'stts2'; --[ RECORD 1 ]------------------------------------------------------&zwsp;-- +-[ RECORD 1 ]------------------- k | 1 2 5 -nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178} +nd | [ + + | { + + | "ndistinct": 33178,+ + | "attributes": [ + + | 1, + + | 2 + + | ] + + | }, + If you not specify as SELECT stxkeys AS k, stxdndistinct AS nd FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid) WHERE stxname = 'stts2' \gx then it won't look like ``-[ RECORD 1 ]``. I tried the dummy data, the above query, the nd column data look like [{"attributes": [1, 2], "ndistinct": 2} ..... which does not look like the above. So I guess the query in doc/src/sgml/perform.sgml would be: SELECT stxkeys AS k, jsonb_pretty(d.stxdndistinct::text::jsonb) AS nd FROM pg_statistic_ext join pg_statistic_ext_data d on (oid = d.stxoid) WHERE stxname = 'stts2' \gx per related thread, the word "join" needs uppercase? -- jian https://www.enterprisedb.com/ -
Re: Extended Statistics set/restore/clear functions.
jian he <jian.universality@gmail.com> — 2025-11-12T02:51:51Z
On Tue, Nov 11, 2025 at 11:14 PM jian he <jian.universality@gmail.com> wrote: > > On Tue, Nov 11, 2025 at 4:07 PM Michael Paquier <michael@paquier.xyz> wrote: > > I've rebased the full set using the new structure. 0001~0004 are > > clean. 0005~ need more work and analysis, but that's a start. hi. +Datum +pg_ndistinct_out(PG_FUNCTION_ARGS) + + + appendStringInfo(&str, "], \"" PG_NDISTINCT_KEY_NDISTINCT "\": %d}", + (int) item.ndistinct); I’m a bit confused about the part above, item.ndistinct is double type, we just cast it to int type? after apply 0004, the below in doc/src/sgml/perform.sgml also need to change? <programlisting> CREATE STATISTICS stts (dependencies) ON city, zip FROM zipcodes; ANALYZE zipcodes; SELECT stxname, stxkeys, stxddependencies FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid) WHERE stxname = 'stts'; stxname | stxkeys | stxddependencies ---------+---------+------------------------------------------ stts | 1 5 | {"1 => 5": 1.000000, "5 => 1": 0.423130} (1 row) </programlisting> Do you think it's worth the trouble to have two separate appendStringInfoChar for ``{}``? for example in loop ``for (i = 0; i < ndist->nitems; i++)``. we can change to: appendStringInfoChar(&str, '{'); appendStringInfo(&str, "\"" PG_NDISTINCT_KEY_ATTRIBUTES "\": [%d", item.attributes[0]); for (int j = 1; j < item.nattributes; j++) appendStringInfo(&str, ", %d", item.attributes[j]); appendStringInfo(&str, "], \"" PG_NDISTINCT_KEY_NDISTINCT "\": %d", (int) item.ndistinct); appendStringInfoChar(&str, '}'); -- jian https://www.enterprisedb.com/ -
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-11-12T06:47:33Z
> > Thanks for the new patch. And FWIW I disagree with this approach: > cleanup and refactoring pieces make more sense if done first, as these > lead to less code churn in the final result. So... I've begun to put > my hands on the patch set. The whole has been restructured a bit, as > per the attached. Patch 0001 to 0004 feel OK here, these include two > code moves and the two output functions: > - Two new files for adt/, that I'm planning to apply soon as a > separate cleanup. > - New output functions, with keys added to a new header named > statistics_format.h, for frontend and backend consumption. > Agreed, 0001-0004 all look good. > Next comes the input functions. First, I am unhappy with the amount > of testing that has been put into ndistinct, first and only input > facility I've looked at in details for the moment. I have quickly > spotted a couple a few issues while testing buggy input, like this one > that crashes on pointer dereference, not good obviously: > SELECT '[]'::pg_ndistinct; > - I put some work into more specific error messages for invalid values for both pg_ndistinct and pg_dependencies. - The check for empty attribute lists and item lists now occur in the array-end event handler. - Also tried to standardize conventions between the two data types (switch statements, similar utility functions, etc). > > These are checked in the patches that introduce the functions like > with pg_ndistinct_validate_items(), based on the list of stxkeys we > have. However, I think that this is not enough by itself. Shouldn't > we check that the list of items in the array is what we expect based > on the longest "attributes" array at least, even after a JSON that was > parsed? That would be cheap to check in the output function itself, > at least as a first layer of checks before trying something with the > import function and cross-checking the list of attributes for the > extended statistics object. I added tests for both duplicate attribute sequences as well as making the first-longest attribute sequence the template by which all later and shorter sequences are checked. I had been reluctant to add checks like this, because so many similar validations were removed from the earlier statistics code like histograms and the like. > > I suspect a similar family of issues with pg_dependencies, and it > would be nice to move the tests with the input function into a new > regression file, like the other one. > Did so. 0001-0004,0007,0009 unchanged. Significant modification of the stats_import.sql regression tests in 0008 to conform to stricter datatype rules enacted in 0005, 0006.
-
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-11-12T07:53:50Z
On Wed, Nov 12, 2025 at 01:47:33AM -0500, Corey Huinker wrote: >> Thanks for the new patch. And FWIW I disagree with this approach: >> cleanup and refactoring pieces make more sense if done first, as these >> lead to less code churn in the final result. So... I've begun to put >> my hands on the patch set. The whole has been restructured a bit, as >> per the attached. Patch 0001 to 0004 feel OK here, these include two >> code moves and the two output functions: >> - Two new files for adt/, that I'm planning to apply soon as a >> separate cleanup. > > Agreed, 0001-0004 all look good. Applied 0001 and 0002 for now. -- Michael
-
Re: Extended Statistics set/restore/clear functions.
zengman <zengman@halodbtech.com> — 2025-11-12T08:45:16Z
Hey, I saw that some comments in the recent commit have inconsistent styles—maybe we can tweak them to make them uniform! ```c // pg_dependencies - output routine for type pg_dependencies. pg_dependencies_out - output routine for type pg_dependencies. // pg_ndistinct // output routine for type pg_ndistinct pg_ndistinct_out output routine for type pg_ndistinct ``` Oh, I didn’t operate it properly, which resulted in the creation of a duplicate theme. https://www.postgresql.org/message-id/tencent_6EC50DD07D5D567C15E65C46%40qq.com -- Man Zeng -
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-11-12T11:25:43Z
On Wed, Nov 12, 2025 at 08:45:16AM +0000, Man Zeng wrote: > pg_dependencies_out - output routine for type pg_dependencies. > pg_ndistinct_out > > Oh, I didn’t operate it properly, which resulted in the creation of a duplicate theme. > https://www.postgresql.org/message-id/tencent_6EC50DD07D5D567C15E65C46%40qq.com No problem. I've just fixed both comments, thanks. These were older than the recent commit, but let's fix things when these are in sight. -- Michael
-
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-11-12T22:21:36Z
> > + > + appendStringInfo(&str, "], \"" PG_NDISTINCT_KEY_NDISTINCT "\": %d}", > + (int) item.ndistinct); > > I’m a bit confused about the part above, > item.ndistinct is double type, we just cast it to int type? > It's a historical quirk. That's what the original output function did in mvdistinct.c, so we maintain compatibility with that. Altering the internal storage type would affect the bytea serialization, which would break binary compatibility. > after apply 0004, the below in doc/src/sgml/perform.sgml also need to > change? > Yes it does, good catch. > Do you think it's worth the trouble to have two separate > appendStringInfoChar for ``{}``? > > for example in loop ``for (i = 0; i < ndist->nitems; i++)``. we can change > to: > I agree that that feels more symmetrical. However, it seems the prevailing wisdom is that we're already paying for a string interpolation in the very next appendStringInfo(), we might as well save ourselves a function call. Hence, I left that one as-is. The sgml change has been worked into a rebased and reduced patch set (thanks for the commits Michael!) -
Re: Extended Statistics set/restore/clear functions.
jian he <jian.universality@gmail.com> — 2025-11-13T07:01:42Z
hi. v12-0001 and v12-0002 overall look good to me. if (dependency->nattributes <= 1) elog(ERROR, "invalid zero-length nattributes array in MVDependencies"); This is an unlikely-to-happen error message, but still, “nattributes” seems confusing? in doc/src/sgml/func/func-json.sgml, we have \gset <programlisting> SELECT '{ ....... } }' AS json \gset </programlisting> similarly, in doc/src/sgml/perform.sgml, I think the query should be: SELECT stxkeys AS k, jsonb_pretty(stxdndistinct::text::jsonb) AS nd FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid) WHERE stxname = 'stts2' \gx -- jian https://www.enterprisedb.com/ -
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-11-13T07:32:17Z
On Thu, Nov 13, 2025 at 2:02 AM jian he <jian.universality@gmail.com> wrote: > hi. > > v12-0001 and v12-0002 overall look good to me. > > if (dependency->nattributes <= 1) > elog(ERROR, "invalid zero-length nattributes array in > MVDependencies"); > This is an unlikely-to-happen error message, but still, “nattributes” > seems confusing? > Agreed the error message should be changed if it's kept at all. That error may never occur now that we test for empty arrays in the array close event handler. So maybe this becomes a plain assert(). > similarly, in doc/src/sgml/perform.sgml, I think the query should be: > > SELECT stxkeys AS k, jsonb_pretty(stxdndistinct::text::jsonb) AS nd > FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid) > WHERE stxname = 'stts2' \gx > The example almost certainly predates \gx, so that's a good suggestion.
-
Re: Extended Statistics set/restore/clear functions.
jian he <jian.universality@gmail.com> — 2025-11-13T11:51:00Z
hi. now looking at v12-0003-Add-working-input-function-for-pg_ndistinct.patch again. + * example input: + * [{"attributes": [6, -1], "ndistinct": 14}, + * {"attributes": [6, -2], "ndistinct": 9143}, + * {"attributes": [-1,-2], "ndistinct": 13454}, + * {"attributes": [6, -1, -2], "ndistinct": 14549}] */ Datum pg_ndistinct_in(PG_FUNCTION_ARGS) extenssted statistics surely won't work on system columns, how should we deal with case like: ``` {"attributes": [6, -1], "ndistinct": 14} {"attributes": [6, -7], "ndistinct": 14}, ``` issue a warning or error out saying that your attribute number is invalid? Should we discourage using system columns as examples in comments here? I have added more test code in src/test/regress/sql/pg_ndistinct.sql, to improve the code coverage. as mentioned before in https://postgr.es/m/CACJufxEZYqocFdgn-x-bJMRBSk_zkS=ziGGkaSumteiPDksnsg@mail.gmail.com I think it's a good thing to change ``(errcode....`` to ``errcode``. So I did the change. +static JsonParseErrorType +ndistinct_array_element_start(void *state, bool isnull) +{ + NDistinctParseState *parse = state; + + switch(parse->state) + { + case NDIST_EXPECT_ATTNUM: + if (!isnull) + return JSON_SUCCESS; + + ereturn(parse->escontext, (Datum) 0, + (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), + errmsg("malformed pg_ndistinct: \"%s\"", parse->str), + errdetail("Attnum list elements cannot be null."))); this (and many other places) looks wrong, because ereturn would really return ``(Datum) 0``, and this function returns JsonParseErrorType. so we have to errsave here. +typedef struct +{ + const char *str; + NDistinctSemanticState state; + + List *distinct_items; /* Accumulated complete MVNDistinctItems */ + Node *escontext; + + bool found_attributes; /* Item has an attributes key */ + bool found_ndistinct; /* Item has ndistinct key */ + List *attnum_list; /* Accumulated attributes attnums */ + int64 ndistinct; +} NDistinctParseState; + case NDIST_EXPECT_NDISTINCT: + /* + * While the structure dictates that ndistinct in a double precision + * floating point, in practice it has always been an integer, and it + * is output as such. Therefore, we follow usage precendent over the + * actual storage structure, and read it in as an integer. + */ + parse->ndistinct = pg_strtoint64_safe(token, parse->escontext); + + if (SOFT_ERROR_OCCURRED(parse->escontext)) + return JSON_SEM_ACTION_FAILED; NDistinctParseState.ndistinct should be integer, otherwise pg_ndistinct_out will not be consistent with pg_ndistinct_in? SELECT '[{"attributes" : [1, 2], "ndistinct" : 2147483648}]'::pg_ndistinct; --error pg_ndistinct ---------------------------------------------------- [{"attributes": [1, 2], "ndistinct": -2147483648}] (1 row) The result seems not what we expected. -
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-11-14T05:49:23Z
> > extenssted statistics surely won't work on system columns, > how should we deal with case like: > ``` > {"attributes": [6, -1], "ndistinct": 14} > {"attributes": [6, -7], "ndistinct": 14}, > ``` > issue a warning or error out saying that your attribute number is invalid? > Should we discourage using system columns as examples in comments here? > Negative numbers represent the Nth expression defined in the extended statistics object. So if you have extended statistics on a, b, length(a), length(b) then you can legally have -1 and -2 in the attributes, but nothing lower than that. See functions pg_ndistinct_validate_items() and pg_depdendencies_validate_deps() as these check the attributes in the value against the definition of the extended stats object. Though this does bring up a small naming problem: Elements of a pg_dependencies are Dependency, abbreviated to dep, but are also called Items like the elements in a pg_ndistinct. We should pick a standard name for such things (probably item) and use it everywhere. > > I have added more test code in src/test/regress/sql/pg_ndistinct.sql, > to improve the code coverage. > I'm trying to implement those test cases, but I may have missed some. > > > this (and many other places) looks wrong, because > ereturn would really return ``(Datum) 0``, and this function returns > JsonParseErrorType. > so we have to errsave here. > Good point. Implemented. > > NDistinctParseState.ndistinct should be integer, > otherwise pg_ndistinct_out will not be consistent with pg_ndistinct_in? > +1 Implemented many, but not all of these suggestions. -
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-11-14T06:25:27Z
On Fri, Nov 14, 2025 at 12:49:23AM -0500, Corey Huinker wrote: > Negative numbers represent the Nth expression defined in the extended > statistics object. So if you have extended statistics on a, b, length(a), > length(b) then you can legally have -1 and -2 in the attributes, but > nothing lower than that. > > See functions pg_ndistinct_validate_items() and > pg_depdendencies_validate_deps() as these check the attributes in the value > against the definition of the extended stats object. Exactly. Extended stats on system columns don't work because they don't really make sense as we want to track correlations between the attributes defined, and these reflect internal states: create table poo (a int, b int); create statistics poos (ndistinct ) ON cmax, a from poo; ERROR: 0A000: statistics creation on system columns is not supported Note that the expressions are also stored in pg_stats_ext_exprs. > I'm trying to implement those test cases, but I may have missed some. I've found a lot of them with coverage-html during a previous lookup. I'd like to think that we should aim for something close to 100% coverage for the two input functions. > Implemented many, but not all of these suggestions. Thanks for the new versions, I'll also look at all these across the next couple of days. Probably not at 0005~ for now. -- Michael
-
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-11-17T06:56:06Z
On Fri, Nov 14, 2025 at 03:25:27PM +0900, Michael Paquier wrote: > Thanks for the new versions, I'll also look at all these across the > next couple of days. Probably not at 0005~ for now. 0001 and 0002 from series v13 have been applied to change the output functions. And I have looked at 0003 in details for now. Attached is a revised version for it, with many adjustments. Some notes: - Many portions of the coverage were missing. I have measured the coverage at 91% with the updated version attached. This includes coverage for some error reporting, something that we rely a lot on for this code. - The error reports are made simpler, with the token values getting hidden. While testing with some fancy values, I have actually noticed that the error handlings for the parsing of the int16 and int32 values were incorrect, the error reports used what the safe functions generated, not the reports from the data type. - Passing down arbitrary bytes sequences was leading to these bytes reported in the error outputs because we cared about the token values. I have added a few tests based on that for the code paths involved. There is an extra thing that bugs me as incorrect for the pg_ndistinct input, something I have not tackled myself yet. Your patch checks that subsets of attributes are included in the longest set found, but it does not match the guarantees we have in mvndistinct.c: we have to check that *all* the combinations generated by generator_init() are satisfied based on the longest of attributes detected. For example, this is thought as correct in the input function: SELECT '[{"attributes" : [-1,2], "ndistinct" : 1}, {"attributes" : [-1,2,3], "ndistinct" : 3}]'::pg_ndistinct; However it is obviously not correct as we are missing an element for the attributes [-1, 3]. The simplest solution would be to export the routines that generate the groups now in mvndistinct.c. Also we should make sure that the number of elements in the arrays match with the number of groups we expect, not only the elements. I don't think that we need to care much about the values, but we ought to provide stronger guarantees for the attributes listed in these elements. Except for this argument, the input of pg_ndistinct feels OK in terms of the guarantees that we'd want to enforce on an import. The same argument applies in terms of attribute number guarantees for pg_dependencies, based on DependencyGenerator_init() & friends in dependencies.c. Could you look at that? For pg_dependencies, we also require some checks on the value for "dependency", of course, making sure that this matches with what's expected with the "largest" sets of attributes. In this case, we need to track the union of "dependency" and "attributes", with "attributes" having at least one element. The tests of pg_dependencies need also to be extended more (begun that a bit, far from being complete and I'm lacking of time this week due to a conference). One thing that I would add are nested JSON objects in the paths where we expect values, for example. Please note that I have done a brush of 0004, while on it, cleaning up typos, inconsistencies and making the error codes consistent with the ndistinct case where possible. This is not ready, but that's at least it's a start to rely on. In terms of committable bits, it would be better to apply the input functions once both parts are ready to go. For now I am attached a v14 with the work I've put into them. 0005~ are not reviewed yet, as mentioned previously. The changes in pg_dependencies are actually straight-forward to figure out (well, mostly) once the pg_ndistinct changes are OK in shape. -- Michael -
Re: Extended Statistics set/restore/clear functions.
jian he <jian.universality@gmail.com> — 2025-11-17T12:53:35Z
On Mon, Nov 17, 2025 at 2:56 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Nov 14, 2025 at 03:25:27PM +0900, Michael Paquier wrote: > > Thanks for the new versions, I'll also look at all these across the > > next couple of days. Probably not at 0005~ for now. > > 0001 and 0002 from series v13 have been applied to change the output > functions. > > And I have looked at 0003 in details for now. Attached is a revised > version for it, with many adjustments. Some notes: > - Many portions of the coverage were missing. I have measured the > coverage at 91% with the updated version attached. This includes > coverage for some error reporting, something that we rely a lot on for > this code. > - The error reports are made simpler, with the token values getting > hidden. While testing with some fancy values, I have actually noticed > that the error handlings for the parsing of the int16 and int32 values > were incorrect, the error reports used what the safe functions > generated, not the reports from the data type. > - Passing down arbitrary bytes sequences was leading to these bytes > reported in the error outputs because we cared about the token values. > I have added a few tests based on that for the code paths involved. > > There is an extra thing that bugs me as incorrect for the pg_ndistinct > input, something I have not tackled myself yet. Your patch checks > that subsets of attributes are included in the longest set found, but > it does not match the guarantees we have in mvndistinct.c: we have to > check that *all* the combinations generated by generator_init() are > satisfied based on the longest of attributes detected. For example, > this is thought as correct in the input function: > SELECT '[{"attributes" : [-1,2], "ndistinct" : 1}, > {"attributes" : [-1,2,3], "ndistinct" : 3}]'::pg_ndistinct; > > However it is obviously not correct as we are missing an element for > the attributes [-1, 3]. The simplest solution would be to export the > routines that generate the groups now in mvndistinct.c. Also we > should make sure that the number of elements in the arrays match with > the number of groups we expect, not only the elements. I don't think > that we need to care much about the values, but we ought to provide > stronger guarantees for the attributes listed in these elements. > > Except for this argument, the input of pg_ndistinct feels OK in terms > of the guarantees that we'd want to enforce on an import. The same > argument applies in terms of attribute number guarantees for > pg_dependencies, based on DependencyGenerator_init() & friends in > dependencies.c. Could you look at that? > hi. NDistinctSemanticState, last element NDIST_EXPECT_COMPLETE should follow with a comma, like: ``NDIST_EXPECT_COMPLETE, `` Lots of tests use pg_input_error_info, which is good. since currently only pg_input_error_info, pg_input_is_valid NDistinctParseState.escontext is not NULL. + /* + * If escontext already set, just use that. Anything else is a generic + * JSON parse error. + */ + if (!SOFT_ERROR_OCCURRED(parse_state.escontext)) + errsave(parse_state.escontext, + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), + errmsg("malformed pg_ndistinct: \"%s\"", str), + errdetail("Must be valid JSON.")); seems no coverage for this. segfault: SELECT '[{"attributes" : [1,2,3,4,5,67,6,7,8], "ndistinct" : 4}]'::pg_ndistinct; because src/backend/statistics/mvdistinct.c line: 310, Assert Assert((item->nattributes >= 2) && (item->nattributes <= STATS_MAX_DIMENSIONS)); +SELECT '[{"attributes" : [2], "ndistinct" : 4}]'::pg_ndistinct; +ERROR: malformed pg_ndistinct: "[{"attributes" : [2], "ndistinct" : 4}]" +LINE 1: SELECT '[{"attributes" : [2], "ndistinct" : 4}]'::pg_ndistin... + ^ +DETAIL: The "ndistinct" key must contain an array of at least two attributes. here, it should be +DETAIL: The "attributes" key must contain an array of at least two attributes. ? -- jian https://www.enterprisedb.com/ -
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-11-17T17:18:55Z
On Mon, Nov 17, 2025 at 1:56 AM Michael Paquier <michael@paquier.xyz> wrote: > On Fri, Nov 14, 2025 at 03:25:27PM +0900, Michael Paquier wrote: > > Thanks for the new versions, I'll also look at all these across the > > next couple of days. Probably not at 0005~ for now. > > 0001 and 0002 from series v13 have been applied to change the output > functions. > Thanks! Though, I was thinking some more about the output format. Using jsonb_pretty() makes it readable in one way, and very clumsy in other ways. Instead, I'm going to try doing the following: replace (ndist_string_value, '},', E'}\n,') This will result in the output value being formatted in exactly the way described in the commit messages. Of course, we could make the the actual default format by changing appendStringInfoString(&str, ", ") instead. One might argue that the output shouldn't get too flowery, but we're already adding spaces between items and array elements, and we've already made extensive changes favoring readability over compactness. Review of changes to input function patches: I'm curious about the re-parameterization of error messages involving PG_NDISTINCT_KEY_ATTRIBUTES, PG_NDISTINCT_KEY_NDISTINCT, and similar keys in dependencies. I like the parameterized version better, and was confused as to why it was removed. Did you change your mind, or was it done for ease of translation? Wording changes all look good. Use of pg_input_error_info() makes sense. > > There is an extra thing that bugs me as incorrect for the pg_ndistinct > input, something I have not tackled myself yet. Your patch checks > that subsets of attributes are included in the longest set found, but > it does not match the guarantees we have in mvndistinct.c: we have to > check that *all* the combinations generated by generator_init() are > satisfied based on the longest of attributes detected. For example, > this is thought as correct in the input function: > SELECT '[{"attributes" : [-1,2], "ndistinct" : 1}, > {"attributes" : [-1,2,3], "ndistinct" : 3}]'::pg_ndistinct; > > However it is obviously not correct as we are missing an element for > the attributes [-1, 3]. The simplest solution would be to export the > routines that generate the groups now in mvndistinct.c. Also we > should make sure that the number of elements in the arrays match with > the number of groups we expect, not only the elements. I don't think > that we need to care much about the values, but we ought to provide > stronger guarantees for the attributes listed in these elements. > I had a feeling that was going to be requested. My question would be if that we want to stick to modeling the other combinations after the first longest combination, last longest, or if we want to defer those checks altogether until we have to validate against an actual stats object? > > Except for this argument, the input of pg_ndistinct feels OK in terms > of the guarantees that we'd want to enforce on an import. The same > argument applies in terms of attribute number guarantees for > pg_dependencies, based on DependencyGenerator_init() & friends in > dependencies.c. Could you look at that? > Yes. I had already looked at it to verify that _all_ combinations were always generated (they are), because I had some vague memory of the generator dropping combinations that were statistically insignificant. In retrospect, I have no idea where I got that idea. > > For pg_dependencies, we also require some checks on the value for > "dependency", of course, making sure that this matches with what's > expected with the "largest" sets of attributes. In this case, we need > to track the union of "dependency" and "attributes", with "attributes" > having at least one element. > This is fairly simple to do. The dependency attnum is just appended to the list of attnums, and the combinations are generated the same as ndistinct, though obviously there are no single elements. There's probably some common code between the lists to be shared, differing only in how they report missing combinations. > The tests of pg_dependencies need also to be extended more (begun that > a bit, far from being complete and I'm lacking of time this week due > to a conference). One thing that I would add are nested JSON objects > in the paths where we expect values, for example. Please note that I > have done a brush of 0004, while on it, cleaning up typos, > inconsistencies and making the error codes consistent with the > ndistinct case where possible. This is not ready, but that's at least > it's a start to rely on. > > > In terms of committable bits, it would be better to apply the input > functions once both parts are ready to go. For now I am attached a > v14 with the work I've put into them. 0005~ are not reviewed yet, as > mentioned previously. The changes in pg_dependencies are actually > straight-forward to figure out (well, mostly) once the pg_ndistinct > changes are OK in shape. > -- > Michael +1 -
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-11-17T23:19:58Z
On Mon, Nov 17, 2025 at 12:18:55PM -0500, Corey Huinker wrote: > On Mon, Nov 17, 2025 at 1:56 AM Michael Paquier <michael@paquier.xyz> wrote: > Though, I was thinking some more about the output format. Using > jsonb_pretty() makes it readable in one way, and very clumsy in other ways. > Instead, I'm going to try doing the following: > > replace (ndist_string_value, '},', E'}\n,') > > This will result in the output value being formatted in exactly the way > described in the commit messages. > > Of course, we could make the the actual default format by changing > appendStringInfoString(&str, ", ") instead. This feels like a different pretty still compressed output for json. I don't think we should change the output functions to do that, but if you want to add a function that filters these contents a bit in the tests for the input functions, sure, why not. > One might argue that the output shouldn't get too flowery, but we're > already adding spaces between items and array elements, and we've already > made extensive changes favoring readability over compactness. I'd still keep it without newlines, FWIW. So what we have in the output functions is OK for me. The key names could be updated to something else for this release, I'm open for suggestions and we have time for this release. It would be nice to not do rename tweaks several times. > I'm curious about the re-parameterization of error messages involving > PG_NDISTINCT_KEY_ATTRIBUTES, PG_NDISTINCT_KEY_NDISTINCT, and similar keys > in dependencies. I like the parameterized version better, and was confused > as to why it was removed. Did you change your mind, or was it done for ease > of translation? Yes, this one is to reduce the translation work, and because the messages are quite the same across the board and deal with the same requirements: - Single integer expected after a key (attnum or actual value). - Array of attribute expected after a key. - For the degree key, float value. > I had a feeling that was going to be requested. My question would be if > that we want to stick to modeling the other combinations after the first > longest combination, last longest, or if we want to defer those checks > altogether until we have to validate against an actual stats object? I would tend to think that performing one round of validation once the whole set of objects has been parsed is going to be cheaper than periodic checks. One other thing would be to force a sort of the elements in the array to match with the order these are generated when creating the stats. We cannot do that in the input functions because we have no idea about the order of the attributes in the statistics object yet. Applying a sort sounds also important to me to make sure that we order the stats based on what the group generation functions (aka generate_combinations(), etc.) think on the matter, which would enforce a stronger binary compatibility after we are sure that we have a full set of attributes listed in an array with the input function of course. I have briefly looked at the planner code where extended stats are used, like selfuncs.c, and the ordering does not completely matter, it seems, but it's cheap enough to enforce a stricter ordering based on the K groups of N elements generated in the import function. >> Except for this argument, the input of pg_ndistinct feels OK in terms >> of the guarantees that we'd want to enforce on an import. The same >> argument applies in terms of attribute number guarantees for >> pg_dependencies, based on DependencyGenerator_init() & friends in >> dependencies.c. Could you look at that? > > Yes. I had already looked at it to verify that _all_ combinations were > always generated (they are), because I had some vague memory of the > generator dropping combinations that were statistically insignificant. In > retrospect, I have no idea where I got that idea. Hmm. I would need to double-check the code to be sure, but I don't think that we drop combinations, because the code prevents duplicates to begin with, even for expressions: create table aa (a int, b int); create statistics stats (ndistinct) ON a, a, b, b from aa; ERROR: 42701: duplicate column name in statistics definition create statistics stats (ndistinct) ON (a + a), ((a + a)) from aa; ERROR: 42701: duplicate expression in statistics definition These don't make sense anyway because they have a predictible and perfectly matching correlation relationship. > This is fairly simple to do. The dependency attnum is just appended to the > list of attnums, and the combinations are generated the same as ndistinct, > though obviously there are no single elements. Yeah. That should be not be bad, I hope. > There's probably some common code between the lists to be shared, differing > only in how they report missing combinations. I would like to agree on that, but it did not look that obvious to me yesterday. If you think that something could be refactored, I'd suggest a refactoring patch that applies on top of the rest of the patch set, with new generic facilities in stat_util.c, or even a new separate file, if that leads to a cleaner result (okay, a definition of "clean" is up to one's taste). -- Michael
-
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-11-17T23:35:00Z
On Mon, Nov 17, 2025 at 08:53:35PM +0800, jian he wrote: > segfault: > SELECT '[{"attributes" : [1,2,3,4,5,67,6,7,8], "ndistinct" : 4}]'::pg_ndistinct; > > because src/backend/statistics/mvdistinct.c line: 310, Assert > Assert((item->nattributes >= 2) && (item->nattributes <= STATS_MAX_DIMENSIONS)); Ahah, good one here. Perhaps we had better switch this assertion to become an elog() with the import functions in mind, as well. It may be worth double-checking the other deserialization paths, as well. We need to be super careful about this stuff for data imports. -- Michael -
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-11-18T02:32:37Z
> > > This feels like a different pretty still compressed output for json. > I don't think we should change the output functions to do that, but if > you want to add a function that filters these contents a bit in the > tests for the input functions, sure, why not. > +1, I'll probably just use the replaces rather than define a function and give someone the false impression that the function exists elsewhere. > > Yes, this one is to reduce the translation work, and because the > messages are quite the same across the board and deal with the same > requirements: > - Single integer expected after a key (attnum or actual value). > - Array of attribute expected after a key. > - For the degree key, float value. > > > I had a feeling that was going to be requested. My question would be if > > that we want to stick to modeling the other combinations after the first > > longest combination, last longest, or if we want to defer those checks > > altogether until we have to validate against an actual stats object? > > I would tend to think that performing one round of validation once the > whole set of objects has been parsed is going to be cheaper than > periodic checks. > > One other thing would be to force a sort of the elements in the array > to match with the order these are generated when creating the stats. > We cannot do that in the input functions because we have no idea about > the order of the attributes in the statistics object yet. Applying a > sort sounds also important to me to make sure that we order the stats > based on what the group generation functions (aka > generate_combinations(), etc.) think on the matter, which would > enforce a stronger binary compatibility after we are sure that we have > a full set of attributes listed in an array with the input function of > course. I have briefly looked at the planner code where extended > stats are used, like selfuncs.c, and the ordering does not completely > matter, it seems, but it's cheap enough to enforce a stricter ordering > based on the K groups of N elements generated in the import function. > > >> Except for this argument, the input of pg_ndistinct feels OK in terms > >> of the guarantees that we'd want to enforce on an import. The same > >> argument applies in terms of attribute number guarantees for > >> pg_dependencies, based on DependencyGenerator_init() & friends in > >> dependencies.c. Could you look at that? > > > > Yes. I had already looked at it to verify that _all_ combinations were > > always generated (they are), because I had some vague memory of the > > generator dropping combinations that were statistically insignificant. In > > retrospect, I have no idea where I got that idea. > > Hmm. I would need to double-check the code to be sure, but I don't > think that we drop combinations, because the code prevents duplicates > to begin with, even for expressions: > create table aa (a int, b int); > create statistics stats (ndistinct) ON a, a, b, b from aa; > ERROR: 42701: duplicate column name in statistics definition > create statistics stats (ndistinct) ON (a + a), ((a + a)) from aa; > ERROR: 42701: duplicate expression in statistics definition > So I looked at the generator functions, hoping they'd have enough in common that they could be made generic. And they're just different enough that I think it's not worth it to try. But, if we don't care about the order of the combinations, I also don't think we need to expose the functions at all. We know exactly how many combinations there should be for any N attributes as each attribute must be unique. So if we have the right number of unique combinations, and they're all subsets of the first-longest, then we must have a complete set. Thoughts on that? Getting _too_ tight with the ordering and contents makes me concerned for the day when the format might change. We don't want to _fail_ an upgrade because some of the combinations were in the wrong order. > These don't make sense anyway because they have a predictible and > perfectly matching correlation relationship. > They do, for now, but are we willing to lock ourselves into that forever? > > > This is fairly simple to do. The dependency attnum is just appended to > the > > list of attnums, and the combinations are generated the same as > ndistinct, > > though obviously there are no single elements. > > Yeah. That should be not be bad, I hope. > > > There's probably some common code between the lists to be shared, > differing > > only in how they report missing combinations. > > I would like to agree on that, but it did not look that obvious to me > yesterday. If you think that something could be refactored, I'd > suggest a refactoring patch that applies on top of the rest of the > patch set, with new generic facilities in stat_util.c, or even a > new separate file, if that leads to a cleaner result (okay, a > definition of "clean" is up to one's taste). > Looking over those functions, they both could have use the same generator, but the dependencies-side decided that dependency order doesn't matter, which puts doubt in my head that the order is perfectly the same for both, so we'd better follow each individually IF we want to enforce order.
-
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-11-18T02:34:14Z
On Mon, Nov 17, 2025 at 6:35 PM Michael Paquier <michael@paquier.xyz> wrote: > On Mon, Nov 17, 2025 at 08:53:35PM +0800, jian he wrote: > > segfault: > > SELECT '[{"attributes" : [1,2,3,4,5,67,6,7,8], "ndistinct" : > 4}]'::pg_ndistinct; > > > > because src/backend/statistics/mvdistinct.c line: 310, Assert > > Assert((item->nattributes >= 2) && (item->nattributes <= > STATS_MAX_DIMENSIONS)); > > Ahah, good one here. Perhaps we had better switch this assertion to > become an elog() with the import functions in mind, as well. It may > be worth double-checking the other deserialization paths, as well. We > need to be super careful about this stuff for data imports. > Yeah, I liked that one too, and I can act on that while we're debating how tightly to turn the screws on combinatorics. -
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-11-18T03:34:29Z
On Mon, Nov 17, 2025 at 09:32:37PM -0500, Corey Huinker wrote: > So I looked at the generator functions, hoping they'd have enough in common > that they could be made generic. And they're just different enough that I > think it's not worth it to try. > > But, if we don't care about the order of the combinations, I also don't > think we need to expose the functions at all. We know exactly how many > combinations there should be for any N attributes as each attribute must be > unique. So if we have the right number of unique combinations, and they're > all subsets of the first-longest, then we must have a complete set. > Thoughts on that? > > Getting _too_ tight with the ordering and contents makes me concerned for > the day when the format might change. We don't want to _fail_ an upgrade > because some of the combinations were in the wrong order. That's fair. The planner costing code pulling the stats numbers based on the attributes was smart enough to not care much about the ordering as far as I recall, but I'd rather make sure of that first. This needs some careful lookup. >> These don't make sense anyway because they have a predictible and >> perfectly matching correlation relationship. >> > > They do, for now, but are we willing to lock ourselves into that forever? Perhaps not. I cannot say for sure what's the future is going to be made of. > Looking over those functions, they both could have use the same generator, > but the dependencies-side decided that dependency order doesn't matter, > which puts doubt in my head that the order is perfectly the same for both, > so we'd better follow each individually IF we want to enforce order. I'd try to look at the bits related to pg_dependencies and pg_ndistinct as two separate concepts, at the end. They're sort of alike, but have too many differences already. -- Michael
-
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-11-18T04:39:50Z
> > > But, if we don't care about the order of the combinations, I also don't > > think we need to expose the functions at all. We know exactly how many > > combinations there should be for any N attributes as each attribute must > be > > unique. So if we have the right number of unique combinations, and > they're > > all subsets of the first-longest, then we must have a complete set. > > Thoughts on that? > > > > Getting _too_ tight with the ordering and contents makes me concerned for > > the day when the format might change. We don't want to _fail_ an upgrade > > because some of the combinations were in the wrong order. > > That's fair. The planner costing code pulling the stats numbers based > on the attributes was smart enough to not care much about the ordering > as far as I recall, but I'd rather make sure of that first. This > needs some careful lookup. > I've done some experiments, creating extended stats objects up to the 8 attribute limit. The big takeaway is that I wasn't imagining that he number of dependencies combinations is NOT deterministic: /* * if the dependency seems entirely invalid, don't store it */ if (degree == 0.0) continue; So, in theory, an empty (i.e. '[]') pg_dependencies is valid. The number of pg_ndistinct is deterministic, now, but I'm even less sure that'll be true in the future. We can definitely rely on the attnums being all the positive numbers in ascending order first, followed by the negative numbers in descending order, but that's about it. Which raises the question of how we describe the error when attnums are out of order. We know that the deserialize functions take the data's word for it as to how many items to unpack, so I don't see the impact of not caring how many might be missing. That even sort of feeds into Tom's idea that stats import was in some sense a fuzzing tool. > I'd try to look at the bits related to pg_dependencies and > pg_ndistinct as two separate concepts, at the end. They're sort of > alike, but have too many differences already. > Based on the above, I think we can't really add anything beyond the attnum order, and we have to relax some existing restrictions on pg_dependencies... -
Re: Extended Statistics set/restore/clear functions.
jian he <jian.universality@gmail.com> — 2025-11-18T05:07:23Z
On Mon, Nov 17, 2025 at 2:56 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Nov 14, 2025 at 03:25:27PM +0900, Michael Paquier wrote: > > Thanks for the new versions, I'll also look at all these across the > > next couple of days. Probably not at 0005~ for now. > > 0001 and 0002 from series v13 have been applied to change the output > functions. > > And I have looked at 0003 in details for now. Attached is a revised > version for it, with many adjustments. Some notes: > - Many portions of the coverage were missing. I have measured the > coverage at 91% with the updated version attached. This includes > coverage for some error reporting, something that we rely a lot on for > this code. > - The error reports are made simpler, with the token values getting > hidden. While testing with some fancy values, I have actually noticed > that the error handlings for the parsing of the int16 and int32 values > were incorrect, the error reports used what the safe functions > generated, not the reports from the data type. > - Passing down arbitrary bytes sequences was leading to these bytes > reported in the error outputs because we cared about the token values. > I have added a few tests based on that for the code paths involved. > hi. in src/backend/statistics/mvdistinct.c, we have: Assert(AttributeNumberIsValid(item->attributes[j])); should we disallow 0 in key attributes? SELECT '[{"attributes" : [0,1], "ndistinct" : 4}]'::pg_ndistinct; I didn't find a way to trigger this Assert yet. + errsave(parse->escontext, + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), + errmsg("malformed pg_ndistinct: \"%s\"", parse->str), + errdetail("Invalid \"%s\" value.", PG_NDISTINCT_KEY_ATTRIBUTES)); + errsave(parse->escontext, + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), + errmsg("malformed pg_ndistinct: \"%s\"", parse->str), + errdetail("Invalid \"%s\" value.", + PG_NDISTINCT_KEY_NDISTINCT)); the errdetail is way too generic? similar to ``select 'a'::int;`` we can DETAIL: Invalid input syntax for type integer: "a" HINT: "ndistinct" value expected to be a type of integer. what do you think? we already have "fname" in ndistinct_object_field_start, we can also print out the "fname", like: errsave(parse->escontext, errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), errmsg("malformed pg_ndistinct: \"%s\"", parse->str), errdetail("Unexpected key \"%s\"", fname), errhint("Only allowed keys are \"%s\" and \"%s\".", PG_NDISTINCT_KEY_ATTRIBUTES, PG_NDISTINCT_KEY_NDISTINCT)); SELECT '[{"attributes" : [2,3], "ndistinct" : 4, "ndistinct" : 14}]'::pg_ndistinct; pg_ndistinct ------------------------------------------- [{"attributes": [2, 3], "ndistinct": 14}] SELECT '[{"attributes" : [2,3], "ndistinct" : 4, "attributes" : []}]'::pg_ndistinct; pg_ndistinct ------------------------------------------ [{"attributes": [2, 3], "ndistinct": 4}] Is the above output what we expected? + /* + * We need at least two attribute numbers for a ndistinct item, anything + * less is malformed. + */ + natts = parse->attnum_list->length; here, we can use list_length. + if (parse->attnum_list != NIL) + if (parse->distinct_items != NIL) here, we can also use list_length. -- jian https://www.enterprisedb.com/ -
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-11-18T05:39:04Z
On Tue, Nov 18, 2025 at 01:07:23PM +0800, jian he wrote: > + errsave(parse->escontext, > + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), > + errmsg("malformed pg_ndistinct: \"%s\"", parse->str), > + errdetail("Invalid \"%s\" value.", > + PG_NDISTINCT_KEY_NDISTINCT)); > > the errdetail is way too generic? > similar to ``select 'a'::int;`` > we can > DETAIL: Invalid input syntax for type integer: "a" > HINT: "ndistinct" value expected to be a type of integer. > > what do you think? That is intentional, as it is intentional to not show the value of the token in such cases. This will be mostly used for the import functions, so I am not sure that it is worth spending time on tuning all these error cases that one is most likely not going to see as the main scenarios are going to be through pg_dump/restore, most of the time in the scope of an upgrade. > SELECT '[{"attributes" : [2,3], "ndistinct" : 4, "ndistinct" : > 14}]'::pg_ndistinct; > pg_ndistinct > ------------------------------------------- > [{"attributes": [2, 3], "ndistinct": 14}] > > SELECT '[{"attributes" : [2,3], "ndistinct" : 4, "attributes" : > []}]'::pg_ndistinct; > pg_ndistinct > ------------------------------------------ > [{"attributes": [2, 3], "ndistinct": 4}] > > Is the above output what we expected? Interesting one. The extra "attributes" should not be required once we have one set, indeed. -- Michael -
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-11-18T08:52:31Z
> > > SELECT '[{"attributes" : [2,3], "ndistinct" : 4, "ndistinct" : > > 14}]'::pg_ndistinct; > > pg_ndistinct > > ------------------------------------------- > > [{"attributes": [2, 3], "ndistinct": 14}] > > > > SELECT '[{"attributes" : [2,3], "ndistinct" : 4, "attributes" : > > []}]'::pg_ndistinct; > > pg_ndistinct > > ------------------------------------------ > > [{"attributes": [2, 3], "ndistinct": 4}] > > > > Is the above output what we expected? > > Interesting one. The extra "attributes" should not be required once > we have one set, indeed. > -- > Michael > v15: - catches duplicate object keys cited above - enforces attnum ordering (ascending positive numbers followed by descending negative numbers, no zeros allowed), which means we get duplicate attnum detection for free - attnum validation is now done as soon as the attnum is parsed - tests refactored to put attnums in proper order - unfortunately, this means that one of the error cases from stats_import.sql (attnum = 0) is now an error rather than something that can be soft-excluded. - didn't enforce combinatorical completeness for dependencies because not all combinations are guaranteed to be there. - didn't enforce combinatorical completeness for ndistinct because I'm not convinced we should. -
Re: Extended Statistics set/restore/clear functions.
jian he <jian.universality@gmail.com> — 2025-11-19T07:49:56Z
On Tue, Nov 18, 2025 at 4:52 PM Corey Huinker <corey.huinker@gmail.com> wrote: > v15: > > - catches duplicate object keys cited above > - enforces attnum ordering (ascending positive numbers followed by descending negative numbers, no zeros allowed), which means we get duplicate attnum detection for free > - attnum validation is now done as soon as the attnum is parsed > - tests refactored to put attnums in proper order > - unfortunately, this means that one of the error cases from stats_import.sql (attnum = 0) is now an error rather than something that can be soft-excluded. > - didn't enforce combinatorical completeness for dependencies because not all combinations are guaranteed to be there. > - didn't enforce combinatorical completeness for ndistinct because I'm not convinced we should. > hi. some of the switch->default, default don't have ``break``. + for (int i = 0; i < nitems; i++) + { + MVNDistinctItem *item = parse_state.distinct_items->elements[i].ptr_value; exposing the ptr_value seems not a good idea, we can foreach_ptr the attached patch using foreach_ptr. in function pg_ndistinct_in some errsave can change to ereturn. (I didn't do this part, though). + /* + * The attnum cannot be zero a negative number beyond the number of the + * possible expressions. + */ + if (attnum == 0 || attnum < (0-STATS_MAX_DIMENSIONS)) + { + errsave(parse->escontext, + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), + errmsg("malformed pg_ndistinct: \"%s\"", parse->str), + errdetail("Invalid \"%s\" element: %d.", + PG_NDISTINCT_KEY_ATTRIBUTES, attnum)); + return JSON_SEM_ACTION_FAILED; + } This part had no coverage tests, so I added a few. as mentioned before + errsave(parse->escontext, + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), + errmsg("malformed pg_ndistinct: \"%s\"", parse->str), + errdetail("The \"%s\" key must contain an array of at least %d " + "and no more than %d attributes.", + PG_NDISTINCT_KEY_NDISTINCT, 2, STATS_MAX_DIMENSIONS)); here PG_NDISTINCT_KEY_NDISTINCT, should be PG_NDISTINCT_KEY_ATTRIBUTES. Please check the attached minor miscellaneous changes. -- jian https://www.enterprisedb.com/ -
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-11-20T23:08:02Z
> > > some of the switch->default, default don't have ``break``. > The compiler doesn't require them, but I see that we do use them in a lot of places, so I'll incorporate this. > > + for (int i = 0; i < nitems; i++) > + { > + MVNDistinctItem *item = > parse_state.distinct_items->elements[i].ptr_value; > > exposing the ptr_value seems not a good idea, we can foreach_ptr > the attached patch using foreach_ptr. I didn't like this because it makes getting the index value harder, and the index value is needed in lots of places. Instead I used list_nth() and list_nth_int(). > in function pg_ndistinct_in some errsave can change to ereturn. > (I didn't do this part, though). > -0.25 There's something that I like about the consistency of errsave() followed by a return that makes it clear that "the function ends here" that I don't get from ereturn(). What I would really like, is a way to generate unique (but translated) errdetail values, and move the errsave/ereturn to after the switch statement. > > + /* > + * The attnum cannot be zero a negative number beyond the number of the > + * possible expressions. > + */ > + if (attnum == 0 || attnum < (0-STATS_MAX_DIMENSIONS)) > + { > + errsave(parse->escontext, > + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), > + errmsg("malformed pg_ndistinct: \"%s\"", parse->str), > + errdetail("Invalid \"%s\" element: %d.", > + PG_NDISTINCT_KEY_ATTRIBUTES, attnum)); > + return JSON_SEM_ACTION_FAILED; > + } > This part had no coverage tests, so I added a few. > +1 > > > as mentioned before > + errsave(parse->escontext, > + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), > + errmsg("malformed pg_ndistinct: \"%s\"", parse->str), > + errdetail("The \"%s\" key must contain an array of at least %d " > + "and no more than %d attributes.", > + PG_NDISTINCT_KEY_NDISTINCT, 2, STATS_MAX_DIMENSIONS)); > here PG_NDISTINCT_KEY_NDISTINCT, should be PG_NDISTINCT_KEY_ATTRIBUTES. > +1 Did similar things to pg_dependencies. -
Re: Extended Statistics set/restore/clear functions.
Chao Li <li.evan.chao@gmail.com> — 2025-11-21T03:36:23Z
After 16 rounds of revisions, v16 looks solid and robust. Only got a few small comments: > On Nov 21, 2025, at 07:08, Corey Huinker <corey.huinker@gmail.com> wrote: > > <v16-0001-Add-working-input-function-for-pg_ndistinct.patch><v16-0002-Add-working-input-function-for-pg_dependencies.patch><v16-0003-Expose-attribute-statistics-functions-for-use-in.patch><v16-0004-Add-extended-statistics-support-functions.patch><v16-0005-Include-Extended-Statistics-in-pg_dump.patch> 1 - 0001 ``` + /* + * We need at least two attribute numbers for a ndistinct item, anything + * less is malformed. + */ + natts = list_length(parse->attnum_list); + if ((natts < 2) || (natts > STATS_MAX_DIMENSIONS)) + { + errsave(parse->escontext, + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), + errmsg("malformed pg_ndistinct: \"%s\"", parse->str), + errdetail("The \"%s\" key must contain an array of at least %d " + "and no more than %d attributes.", + PG_NDISTINCT_KEY_ATTRIBUTES, 2, STATS_MAX_DIMENSIONS)); + return JSON_SEM_ACTION_FAILED; + } ``` Do we also need to reset item state vars similar to the success case? 2 - 0001 ``` * + * Arrays can never be empty. + */ +static JsonParseErrorType +ndistinct_array_end(void *state) +{ ``` This function comment is too simple, and doesn’t match to the function name. Reading the function body, the logic is clear, so I would suggest either just remove the comment or enhance it. 3 - 0001 ``` static JsonParseErrorType +ndistinct_array_element_start(void *state, bool isnull) +{ + NDistinctParseState *parse = state; ``` Nit: NDistinctParseState * can be const. 4 - 0001 ``` +static bool +valid_subsequent_attnum(const AttrNumber prev, const AttrNumber cur) ``` AttrNumber is int16, thus “const” here is unnecessary. 5 - 0001 ``` + return JSON_SUCCESS; + break; ``` “Break” after “return” is unnecessary. I tried to delete the “break” and didn’t see a compile warning, it should be safe to delete the “break”. 6 - 0001 ``` +/* + * Compare the attribute arrays of two MVNDistinctItem values, + * looking for duplicate sets. + */ +static bool +has_duplicate_attributes(const MVNDistinctItem *a, + const MVNDistinctItem *b) ``` The function comment says “looking for duplicate sets”, seems to imply the function would return the set, but the function returns a bool, which is a little confusing. 7 - 0002 ``` + for (int i = 0; i < natts; i++) + { + dep->attributes[i] = (AttrNumber) list_nth_int(parse->attnum_list, i); + if (dep->attributes[i] == parse->dependency) + { + errsave(parse->escontext, + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), + errmsg("malformed pg_dependencies: \"%s\"", parse->str), + errdetail("Item \"%s\" value %d found in the \"%s\" list.", + PG_DEPENDENCIES_KEY_DEPENDENCY, parse->dependency, + PG_DEPENDENCIES_KEY_ATTRIBUTES)); + return JSON_SEM_ACTION_FAILED; + } ``` Similar to comment 1, do we need to reset state vars upon failure? 8 - 0002 ``` + errdetail("The \"%s\" key must contain an array of at least %d " + " and no than %d elements.", ``` I believe “more” is missed between “no than”. 9 - 0002 ``` +static JsonParseErrorType +dependencies_array_start(void *state) +{ + DependenciesParseState *parse = state; ``` Similar to commit 3, DependenciesParseState * can be const. 10 - 0002 ``` static bool valid_subsequent_attnum(const AttrNumber prev, const AttrNumber cur) ``` This function is a dup to the one in 0001, can we put it to a common place? 11 - 0003 ``` +extern void statatt_get_type(Oid reloid, AttrNumber attnum, + Oid *atttypid, int32 *atttypmod, + char *atttyptype, Oid *atttypcoll, + Oid *eq_opr, Oid *lt_opr); +extern void statatt_init_empty_tuple(Oid reloid, int16 attnum, bool inherited, + Datum *values, bool *nulls, bool *replaces); + +extern void statatt_set_slot(Datum *values, bool *nulls, bool *replaces, + int16 stakind, Oid staop, Oid stacoll, + Datum stanumbers, bool stanumbers_isnull, + Datum stavalues, bool stavalues_isnull); + +extern Datum text_to_stavalues(const char *staname, FmgrInfo *array_in, Datum d, + Oid typid, int32 typmod, bool *ok); +extern bool statatt_get_elem_type(Oid atttypid, char atttyptype, + Oid *elemtypid, Oid *elem_eq_opr); ``` Several functions are made external visible, they are all renamed with adding a prefix “statatt_”, why text_to_stavalues is an exception? 11 - 0004 ``` +bool +pg_dependencies_validate_deps(MVDependencies *dependencies, int2vector *stxkeys, int numexprs, int elevel) +{ + int attnum_expr_lowbound = 0 - numexprs; + + for (int i = 0; i < dependencies->ndeps; i++) + { + MVDependency *dep = dependencies->deps[i]; ``` This MVDependency * can be const. 12 - 0004 ``` static void upsert_pg_statistic_ext_data(Datum *values, bool *nulls, bool *replaces) { ``` This function pass values, nulls and replaces to heap_modify_tuple() and heap_form_tuple(), the both functions take all const pointers as parameters. So, here values, nulls and replaces can all be const. 13 - 0004 ``` + if (!exprs_nulls[offset + AVG_WIDTH_ELEM]) + { + ok = text_to_int4(exprs_elems[offset + AVG_WIDTH_ELEM], + &values[Anum_pg_statistic_stawidth - 1]); + + if (!ok) + { + char *s = TextDatumGetCString(exprs_elems[offset + NULL_FRAC_ELEM]); + + ereport(WARNING, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("Expression %s element \"%s\" does not match expected input type.", + extexprarginfo[AVG_WIDTH_ELEM].argname, s))); + pfree(s); + return (Datum) 0; + } + } ``` Here: char *s = TextDatumGetCString(exprs_elems[offset + NULL_FRAC_ELEM]); Should NULL_FRAC_ELEM be AVG_WIDTH_ELEM? 14 - 0004 ``` + if (!exprs_nulls[offset + N_DISTINCT_ELEM]) + { + ok = text_to_float4(exprs_elems[offset + N_DISTINCT_ELEM], + &values[Anum_pg_statistic_stadistinct - 1]); + + if (!ok) + { + char *s = TextDatumGetCString(exprs_elems[offset + NULL_FRAC_ELEM]); + + ereport(WARNING, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("Expression %s element \"%s\" does not match expected input type.", + extexprarginfo[N_DISTINCT_ELEM].argname, s))); + pfree(s); + return (Datum) 0; + } + } ``` Similar to comment 13, should NULL_FRAC_ELEM be N_DISTINCT_ELEM? 15 - 0004 ``` +extern Datum import_mcvlist(HeapTuple tup, int elevel, int numattrs, + Oid *atttypids, int32 *atttypmods, Oid *atttypcolls, + int nitems, Datum *mcv_elems, bool *mcv_nulls, + bool *mcv_elem_nulls, float8 *freqs, float8 *base_freqs); + +extern Datum import_mcvlist(HeapTuple tup, int elevel, int numattrs, + Oid *atttypids, int32 *atttypmods, Oid *atttypcolls, + int nitems, Datum *mcv_elems, bool *mcv_nulls, + bool *mcv_elem_nulls, float8 *freqs, float8 *base_freqs); ``` import_mcvlist is declared twice, looks like a copy-paste mistake. 16 - 0005 ``` + PREPQUERY_DUMPEXTSTATSSTATS, ``` “statsstats” looks weird, I’d suggest PREPQUERY_DUMPEXTSTATSDATA. Best regards, — Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/ -
Re: Extended Statistics set/restore/clear functions.
Yuefei Shi <shiyuefei1004@gmail.com> — 2025-11-21T05:33:56Z
On Fri, Nov 21, 2025 at 10:54 AM Corey Huinker <corey.huinker@gmail.com> wrote: > >> some of the switch->default, default don't have ``break``. >> > > The compiler doesn't require them, but I see that we do use them in a lot > of places, so I'll incorporate this. > > >> >> + for (int i = 0; i < nitems; i++) >> + { >> + MVNDistinctItem *item = >> parse_state.distinct_items->elements[i].ptr_value; >> >> exposing the ptr_value seems not a good idea, we can foreach_ptr >> the attached patch using foreach_ptr. > > > I didn't like this because it makes getting the index value harder, and > the index value is needed in lots of places. Instead I used list_nth() and > list_nth_int(). > > > >> in function pg_ndistinct_in some errsave can change to ereturn. >> (I didn't do this part, though). >> > > -0.25 > > There's something that I like about the consistency of errsave() followed > by a return that makes it clear that "the function ends here" that I don't > get from ereturn(). > > What I would really like, is a way to generate unique (but translated) > errdetail values, and move the errsave/ereturn to after the switch > statement. > > >> >> + /* >> + * The attnum cannot be zero a negative number beyond the number of the >> + * possible expressions. >> + */ >> + if (attnum == 0 || attnum < (0-STATS_MAX_DIMENSIONS)) >> + { >> + errsave(parse->escontext, >> + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), >> + errmsg("malformed pg_ndistinct: \"%s\"", parse->str), >> + errdetail("Invalid \"%s\" element: %d.", >> + PG_NDISTINCT_KEY_ATTRIBUTES, attnum)); >> + return JSON_SEM_ACTION_FAILED; >> + } >> This part had no coverage tests, so I added a few. >> > > +1 > > > >> >> >> as mentioned before >> + errsave(parse->escontext, >> + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), >> + errmsg("malformed pg_ndistinct: \"%s\"", parse->str), >> + errdetail("The \"%s\" key must contain an array of at least %d " >> + "and no more than %d attributes.", >> + PG_NDISTINCT_KEY_NDISTINCT, 2, STATS_MAX_DIMENSIONS)); >> here PG_NDISTINCT_KEY_NDISTINCT, should be PG_NDISTINCT_KEY_ATTRIBUTES. >> > > +1 > > Did similar things to pg_dependencies. > > A few small comments. 1. Minor typo fixes: In pg_dependencies_in: "pg_dependencies parssing claims..." could be corrected to "pg_dependencies parsing claims..." In ndistinct_array_end: "must be an non-empty array" might be better as "must be a non-empty array" In the comment for dependencies_object_field_start: "depeendency" appears to be a typo for "dependency" 2.Code maintainability suggestion: I noticed the string "malformed pg_dependencies: "%s"" is used repeatedly throughout the code. Would you consider defining this as a macro? This could reduce duplication and make future updates easier. 3.Memory management observation: Regarding item_attnum_list, while PostgreSQL's memory context mechanism handles cleanup, explicitly freeing the allocated memory after use might improve code clarity. These are all minor points - the implementation looks solid overall. Thank you for your work on this feature! -
Re: Extended Statistics set/restore/clear functions.
jian he <jian.universality@gmail.com> — 2025-11-21T07:16:26Z
hi. makeJsonLexContextCstringLen in v16-0001 and v16-0002 should use GetDatabaseEncoding()? wondering, should we check the value of "degree" is between 0 to 1? + if ((natts < 1) || (natts > (STATS_MAX_DIMENSIONS - 1))) + { + errsave(parse->escontext, + errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), + errmsg("malformed pg_dependencies: \"%s\"", parse->str), + errdetail("The \"%s\" key must contain an array of at least %d " + " and no than %d elements.", + PG_DEPENDENCIES_KEY_ATTRIBUTES, 1, STATS_MAX_DIMENSIONS - 1)); DETAIL: The "attributes" key must contain an array of at least 1 and no than 7 elements. if you check the error message carefully, there is one more extra white space after "at least 1". typedef struct MVDependency { double degree; /* degree of validity (0-1) */ AttrNumber nattributes; /* number of attributes */ AttrNumber attributes[FLEXIBLE_ARRAY_MEMBER]; /* attribute numbers */ } MVDependency; SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "NaN"}]'::pg_dependencies; SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "-inf"}]'::pg_dependencies; SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "inf"}]'::pg_dependencies; either error out or not, pg_dependencies.sql needs tests like the above to test these special values. json does not support special value like, "inf", "NaN", so we can SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "-inf"}]'::pg_dependencies; but can not SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "-inf"}]'::pg_dependencies::text::pg_dependencies; I found that numeric values behave the same with or without double quotes. SELECT '[{"ndistinct" : "1", "attributes" : ["2","3"]}]'::pg_ndistinct; SELECT '[{"ndistinct" : 1, "attributes" : [2,3]}]'::pg_ndistinct; SELECT '[{"attributes" : [1, 2], "dependency" : 4, "degree": "1"}]'::pg_dependencies; SELECT '[{"attributes" : [1, "2"], "dependency" : "4", "degree": "1"}]'::pg_dependencies; i guess the reason is in makeJsonLexContextCstringLen, we set need_escapes to true. I added more regress tests to improve coverage for src/test/regress/sql/pg_dependencies.sql SELECT '[{"attributes" : [-11,1], "dependency" : -1116, "degree": "1.2"}]'::pg_dependencies; DETAIL: Invalid "attributes" element: -11. we already check the key "attributes", we can also check "dependency", IF the value of ""dependency" does not meet ``(if (parse->dependency == 0 || parse->dependency < (0-STATS_MAX_DIMENSIONS))`` then error out. see the attached patch. -- jian https://www.enterprisedb.com/ -
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-11-21T22:41:15Z
> > > A few small comments. > > 1. Minor typo fixes: > +1 Thanks. > 2.Code maintainability suggestion: > I noticed the string "malformed pg_dependencies: "%s"" is used repeatedly > throughout the code. Would you consider defining this as a macro? This > could reduce duplication and make future updates easier. > It seems to be the way of things. As I stated earlier, what I'd really like is the ability to form different errdetails and have them feed into one errsave(), and still have the strings go through translation. > > 3.Memory management observation: > Regarding item_attnum_list, while PostgreSQL's memory context mechanism > handles cleanup, explicitly freeing the allocated memory after use might > improve code clarity. > > These are all minor points - the implementation looks solid overall. Thank > you for your work on this feature! > +1
-
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-11-22T08:26:19Z
> > > wondering, should we check the value of "degree" is between 0 to 1? > I'm open to it, but there was significant pushback to having tight validity checks on other stats types. The more things that we make errors, the more chances we have to breaking a dump/restore. Granted, such values should never happen, but odd things do happen. > > IF the value of ""dependency" > does not meet > ``(if (parse->dependency == 0 || parse->dependency < > (0-STATS_MAX_DIMENSIONS))`` > then error out. > see the attached patch. > Good catch. Incorporated these patches as well as Yuefei's feedback especially around memory management. I added a comment debating the feasibility of testing for subsets of attribute sets in pg_dependencies. Basically, I think we can't have the test at all, but I haven't removed it just yet pending consensus.
-
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-11-23T06:57:31Z
On Sat, Nov 22, 2025 at 03:26:19AM -0500, Corey Huinker wrote: > I'm open to it, but there was significant pushback to having tight validity > checks on other stats types. Yeah, I am not sure that there is much value in having strict checks for the values. Some control over the arguments listed in a single object is much more valuable as these are easy to check in the input functions. > The more things that we make errors, the more chances we have to breaking a > dump/restore. Granted, such values should never happen, but odd things do > happen. Indeed. Like the existing import functions, these should not lead to hard failures, but WARNING-level reports. -- Michael
-
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-11-25T07:28:51Z
On Sat, Nov 22, 2025 at 03:26:19AM -0500, Corey Huinker wrote: > I added a comment debating the feasibility of testing for subsets of > attribute sets in pg_dependencies. Basically, I think we can't have the > test at all, but I haven't removed it just yet pending consensus. + * Verify that all attnum sets are a proper subset of the first longest + * attnum set. + * + * TODO: + * + * I'm fairly certain that because statisticsally insignificant dependency + * combinations are not stored, there is a chance that the longest dependency + * does not exist, and therefore this test cannot be done. I have left the + * test in place for the time being until the issue can be definitively + * settled. As you have already quoted upthread, statext_dependencies_build() settles the issue on this one, I think. It is entirely possible that any group returned by DependencyGenerator generates a degree value that would prevent a given group to be stored, and this could as well be the largest possible group there could be in the set. So we cannot do any of that for dependencies, unfortunately. We can always rely on the list of attributes when assigning the json blob to the stats object, at least, cross-checking that each attribute list matches with the numbers of the stats object. At least we can check for duplicates, which is better than nothing at all. Regarding the suggested check where we'd want to enforce all the groups of attributes to be listed depending on the longest set we have found, at the end estimate_multivariate_ndistinct() checks the items listed one-by-one, giving up if we cannot find something in the list of items. I think that I am going to be content with the patch as it is, without this piece. Let's add an extra SQL test to treat that as valid input, though. So I am feeling OK with the input for ndistinct at this stage. I have noticed a couple of issues in passing, adjusting them. We are reaching more than 90% of coverage with the tests, and I am not sure that we can actually reach the rest except if one of the previous steps failed. So That's one. Now into the second patch for the input of the dependencies. +SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "NaN"}]'::pg_dependencies; +SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "-inf"}]'::pg_dependencies; +SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "inf"}]'::pg_dependencies; +SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "-inf"}]'::pg_dependencies::text::pg_dependencies; Okay, I have to admit that these ones are fun. I doubt that anybody would actually do that, and these do not produce valid json objects, which is what the last case shows. Hmm, it makes sense to keep these, and I'm still siding that we should not care too much about applying checks on the values and complicate the input function more than that, so fine by me. There were a couple of things in the tests, missing quite a few soft errors. Many typos, grammar mistakes in the whole. Also, please do not split the error strings into multiple lines to make these greppable. There is also no need for a break after a return. In some cases, a return was used where a break made more sense as the default path returned a failure.. The TODO in build_mvdependencies() could be an elog(), but I have left it untouched for the errdetail(). We're reaching 91% of coverage here, not bad. The rest does not seem reachable, as far as I can see. With that said, a v18 for the first two patches with the input functions. Comments and/or opinions? -- Michael -
Re: Extended Statistics set/restore/clear functions.
Chao Li <li.evan.chao@gmail.com> — 2025-11-25T08:59:28Z
> On Nov 25, 2025, at 15:28, Michael Paquier <michael@paquier.xyz> wrote: > > On Sat, Nov 22, 2025 at 03:26:19AM -0500, Corey Huinker wrote: >> I added a comment debating the feasibility of testing for subsets of >> attribute sets in pg_dependencies. Basically, I think we can't have the >> test at all, but I haven't removed it just yet pending consensus. > > + * Verify that all attnum sets are a proper subset of the first longest > + * attnum set. > + * > + * TODO: > + * > + * I'm fairly certain that because statisticsally insignificant dependency > + * combinations are not stored, there is a chance that the longest dependency > + * does not exist, and therefore this test cannot be done. I have left the > + * test in place for the time being until the issue can be definitively > + * settled. > > As you have already quoted upthread, statext_dependencies_build() > settles the issue on this one, I think. It is entirely possible that > any group returned by DependencyGenerator generates a degree value > that would prevent a given group to be stored, and this could as well > be the largest possible group there could be in the set. So we cannot > do any of that for dependencies, unfortunately. We can always rely on > the list of attributes when assigning the json blob to the stats > object, at least, cross-checking that each attribute list matches with > the numbers of the stats object. At least we can check for > duplicates, which is better than nothing at all. > > Regarding the suggested check where we'd want to enforce all the > groups of attributes to be listed depending on the longest set we have > found, at the end estimate_multivariate_ndistinct() checks the items > listed one-by-one, giving up if we cannot find something in the list > of items. I think that I am going to be content with the patch as it > is, without this piece. Let's add an extra SQL test to treat that as > valid input, though. So I am feeling OK with the input for ndistinct > at this stage. I have noticed a couple of issues in passing, > adjusting them. We are reaching more than 90% of coverage with the > tests, and I am not sure that we can actually reach the rest except if > one of the previous steps failed. > > So That's one. Now into the second patch for the input of the > dependencies. > > +SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "NaN"}]'::pg_dependencies; > +SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "-inf"}]'::pg_dependencies; > +SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "inf"}]'::pg_dependencies; > +SELECT '[{"attributes" : [2], "dependency" : 4, "degree": "-inf"}]'::pg_dependencies::text::pg_dependencies; > > Okay, I have to admit that these ones are fun. I doubt that anybody > would actually do that, and these do not produce valid json objects, > which is what the last case shows. Hmm, it makes sense to keep these, > and I'm still siding that we should not care too much about applying > checks on the values and complicate the input function more than that, > so fine by me. > > There were a couple of things in the tests, missing quite a few soft > errors. Many typos, grammar mistakes in the whole. Also, please do > not split the error strings into multiple lines to make these > greppable. There is also no need for a break after a return. In some > cases, a return was used where a break made more sense as the default > path returned a failure.. > > The TODO in build_mvdependencies() could be an elog(), but I have left > it untouched for the errdetail(). > > We're reaching 91% of coverage here, not bad. The rest does not seem > reachable, as far as I can see. > > With that said, a v18 for the first two patches with the input > functions. Comments and/or opinions? > -- > Michael > <v18-0001-Add-working-input-function-for-pg_ndistinct.patch><v18-0002-Add-working-input-function-for-pg_dependencies.patch> I don’t see any of my comments are addressed in v18. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/ -
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-11-25T10:06:09Z
On Tue, Nov 25, 2025 at 04:59:28PM +0800, Chao Li wrote: > I don’t see any of my comments are addressed in v18. Sorry about that, I thought that v17 was addressing these. I have reviewed your latest email posted here: https://www.postgresql.org/message-id/2C8A20C7-4E08-41F4-B30E-AD9F8B007AE5@gmail.com And well, this presents a bunch of issues I have already fixed (break after return, one error message incorrect, some descriptions). The comments on top of the functions are actually true as they stand (aka ndistinct_array_end() fails if we don't have an array). There is no need to reset state vars upon a JSON_SEM_ACTION_FAILED AFAIK. Finally, the consts are non-appealing for me for functions that are internally used in these files. And forgot one: valid_subsequent_attnum()'s duplication is actually adapted IMO, the argument lists could have different assumptions depending on the type of extended stats we are dealing with. I did not check the comments for 0003~0005. Perhaps you are right that these need adjustments. Thanks, -- Michael
-
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-11-26T02:06:10Z
On Tue, Nov 25, 2025 at 04:28:51PM +0900, Michael Paquier wrote: > With that said, a v18 for the first two patches with the input > functions. Comments and/or opinions? Okay. I have gone other these two and after an extra round of polishing, mainly around comments (some suggested by Chao, actually), style, a fix for the test with valid UTF8 sequences failing in non-UTF8 databases, plus a few more tests that were missing, I have applied the two patches for the input functions. Perhaps these will need some adjustments, but I also see little point in moving on with these based on the coverage we have and how the code is shaped. Could you rebase the rest of the patch set for the three remaining pieces, Corey? You will also need to address some of the comments from Chao based on a review of v16, that were not addressed as of v17. -- Michael
-
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-11-26T05:14:26Z
> > > I don’t see any of my comments are addressed in v18. > > My apologies. My v17 focused entirely on the input functions, as those were receiving the vast majority of the attention. Now that those are out of the way (Thanks Michael!) I can address those issues.
-
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-11-26T05:24:30Z
On Tue, Nov 25, 2025 at 11:14:26PM -0600, Corey Huinker wrote: > My apologies. My v17 focused entirely on the input functions, as those were > receiving the vast majority of the attention. Now that those are out of the > way (Thanks Michael!) I can address those issues. No problem, thanks. Please note that I am not sure if I will be able to look at anything else posted on this thread until next week, so please feel free to continue the discussion for the time being. -- Michael
-
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-11-26T05:32:49Z
> Finally, the consts are non-appealing for me for functions that are > internally used in these files. I'm curious why you find them non-appealing for non-exposed functions, if you have time to elaborate. > And forgot one: > valid_subsequent_attnum()'s duplication is actually adapted IMO, the > argument lists could have different assumptions depending on the type > of extended stats we are dealing with. Yeah, I saw that, reviewing the diff-of-diffs just now.
-
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-11-26T05:34:56Z
> > > No problem, thanks. Please note that I am not sure if I will be able > to look at anything else posted on this thread until next week, so > please feel free to continue the discussion for the time being. > > It's holiday season here too. I'm going to get the rebase out and see if any lessons learned can be applied to the remaining patches, but my own availability may be limited.
-
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-11-26T05:37:09Z
> > > non-UTF8 databases, plus a few more tests that were missing, I have > applied the two patches for the input functions. > I've reviewed what you committed and it looks good to me aside from the one nitpick I already made. > Perhaps these will need some adjustments, but I also see little point > in moving on with these based on the coverage we have and how the code > is shaped. Could you rebase the rest of the patch set for the three > remaining pieces, Corey? You will also need to address some of the > comments from Chao based on a review of v16, that were not addressed > as of v17. > +1
-
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-11-26T05:44:23Z
On Tue, Nov 25, 2025 at 11:37:09PM -0600, Corey Huinker wrote: > I've reviewed what you committed and it looks good to me aside from the one > nitpick I already made. I was not really excited about these as it concerns only code local to each file where the input functions are located, no routines that are actually published for consumption by other parts of the backend. -- Michael
-
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-11-26T05:50:14Z
On Tue, Nov 25, 2025 at 11:44 PM Michael Paquier <michael@paquier.xyz> wrote: > On Tue, Nov 25, 2025 at 11:37:09PM -0600, Corey Huinker wrote: > > I've reviewed what you committed and it looks good to me aside from the > one > > nitpick I already made. > > I was not really excited about these as it concerns only code local to > each file where the input functions are located, no routines that are > actually published for consumption by other parts of the backend. > -- > Michael > So if a function were previously static to one page, but now had utility outside of that file, you would then consider changing the function signature to add the consts? I ask because this may become relevant with the statatt_* functions being moved in the remaining patches.
-
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-11-26T06:16:54Z
On Tue, Nov 25, 2025 at 11:50:14PM -0600, Corey Huinker wrote: > So if a function were previously static to one page, but now had utility > outside of that file, you would then consider changing the function > signature to add the consts? I ask because this may become relevant with > the statatt_* functions being moved in the remaining patches. That's the kind of analysis that requires a case-by-case lookup. The input functions were not really that worth bothering to me based on their logic being very isolated. The refactoring pieces with attribute stats that you are suggesting in 0003 may be indeed more relevant if done this way. -- Michael
-
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-11-26T09:21:46Z
On Tue, Nov 25, 2025 at 11:14 PM Corey Huinker <corey.huinker@gmail.com> wrote: > >> I don’t see any of my comments are addressed in v18. >> >> > My apologies. My v17 focused entirely on the input functions, as those > were receiving the vast majority of the attention. Now that those are out > of the way (Thanks Michael!) I can address those issues. > Paraphrasing the "quotes" here for brevity... > Several functions are made external visible, they are all renamed with adding a prefix “statatt_”, why text_to_stavalues is an exception? Michael had specifically said that one didn't need to be renamed. I suppose statatt_import_stavalues() might be a good name for it. It *is* specific to attribute stats, though that definition also applies to the attribute stats nested in the stxdexprs of extended stats. I have no strong opinion on the matter. > This MVDependency * can be const. +1 > static void > upsert_pg_statistic_ext_data(Datum *values, bool *nulls, bool *replaces) > { > ``` > This function pass values, nulls and replaces to heap_modify_tuple() and heap_form_tuple(), > the both functions take all const pointers as parameters. > So, here values, nulls and replaces can all be const. I find your argument here persuasive enough to override Michael's previously stated non-excitement. > ...the two NULL_FRAC questions Yes, fixed. > import_mcvlist is declared twice, looks like a copy-paste mistake. That or a rebase/apply gone wrong. Fixed. > PREPQUERY_DUMPEXTSTATSSTATS weird, suggest PREPQUERY_DUMPEXTSTATSDATA Awkward names are almost inevitable when talking about the statistics associated with an object of type "statistics". There was a debate about whether statistics were data or not, and I'd rather not restart that, so I went with PREPQUERY_DUMPEXTSTATSOBJSTATS for now. Incorporated these fixes, and some other lessons learned. -
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-12-03T05:00:31Z
On Wed, Nov 26, 2025 at 03:21:46AM -0600, Corey Huinker wrote: > There was a debate about whether statistics were data or not, and I'd > rather not restart that, so I went with PREPQUERY_DUMPEXTSTATSOBJSTATS for > now. > > Incorporated these fixes, and some other lessons learned. I've managed to miss the update you have sent here with the three remaining patches. I am going to looking at the remaining pieces now as this patch set was still on my stack of patches to look at.. -- Michael
-
Re:Re: Extended Statistics set/restore/clear functions.
Yu Wang <wangyu_runtime@163.com> — 2025-12-04T00:52:16Z
Hi Corey I was reviewing the recent patch v19-0003-Include-Extended-Statistics-in-pg_dump.patch and noticed a couple of small typo issues in the explanatory comments — nothing that affects the functionality. Here are the two minor fixes I’d suggest: 1. “ndistintinct” should be “ndistinct”. 2. “depdendencies” should be “dependencies”. Best regards, Yu 在 2025-12-04 08:46:29,"Corey Huinker" <corey.huinker@gmail.com> 写道: On Tue, Nov 25, 2025 at 11:14 PM Corey Huinker <corey.huinker@gmail.com> wrote: I don’t see any of my comments are addressed in v18. My apologies. My v17 focused entirely on the input functions, as those were receiving the vast majority of the attention. Now that those are out of the way (Thanks Michael!) I can address those issues. Paraphrasing the "quotes" here for brevity... > Several functions are made external visible, they are all renamed with adding a prefix “statatt_”, why text_to_stavalues is an exception? Michael had specifically said that one didn't need to be renamed. I suppose statatt_import_stavalues() might be a good name for it. It *is* specific to attribute stats, though that definition also applies to the attribute stats nested in the stxdexprs of extended stats. I have no strong opinion on the matter. > This MVDependency * can be const. +1 > static void > upsert_pg_statistic_ext_data(Datum *values, bool *nulls, bool *replaces) > { > ``` > This function pass values, nulls and replaces to heap_modify_tuple() and heap_form_tuple(), > the both functions take all const pointers as parameters. > So, here values, nulls and replaces can all be const. I find your argument here persuasive enough to override Michael's previously stated non-excitement. > ...the two NULL_FRAC questions Yes, fixed. > import_mcvlist is declared twice, looks like a copy-paste mistake. That or a rebase/apply gone wrong. Fixed. > PREPQUERY_DUMPEXTSTATSSTATS weird, suggest PREPQUERY_DUMPEXTSTATSDATA Awkward names are almost inevitable when talking about the statistics associated with an object of type "statistics". There was a debate about whether statistics were data or not, and I'd rather not restart that, so I went with PREPQUERY_DUMPEXTSTATSOBJSTATS for now. Incorporated these fixes, and some other lessons learned. -
Re: Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-12-04T07:28:00Z
On Thu, Dec 04, 2025 at 08:52:16AM +0800, WangYu wrote: > I was reviewing the recent patch > v19-0003-Include-Extended-Statistics-in-pg_dump.patch and noticed a > couple of small typo issues in the explanatory comments — nothing > that affects the functionality. Please be careful that the mailing lists of Postgres use bottom-posting for thread discussions. Please see here for details: https://en.wikipedia.org/wiki/Posting_style#Bottom-posting > Here are the two minor fixes I’d suggest: > 1. “ndistintinct” should be “ndistinct”. > 2. “depdendencies” should be “dependencies”. Indeed. These exact typos existed in some of the previous iterations of the patch posted on this thread. I have begun looking at 0001 and 0002 more seriously, and can now share a couple of comments about them. My biggest issue with 0001 is the severe lack of documentation for all the routines that are now in attribute_stats.c which are published so as they can be used for the import functions of extended stats: 1) statatt_get_type(), that retrieves a set of information from a relation and one of its attributes. This is lightly documented in extended_statistics_update() from 0002, with an argument claiming that we do the same as attribute statistics. If one then looks at attribute_stats.c, we have a not-really-helpful "derive information from attribute". The trick is that we have an implied dependency here: statatt_get_type() needs to be called *before* statatt_get_elem_type() for attribute *and* extended stats, statatt_get_elem_type() being fed data from statatt_get_type(). 2) statatt_get_elem_type(), interesting piece of the puzzle that acts as a wrapper for the type cache. An important part here, it seems to me, would be to at least tell that this relies on data retrieved from statatt_get_type(), initially. Once I've noticed this dependency the logic felt a bit cleaner to understand, but we have no docs explaining any of that.. 3) statatt_init_empty_tuple(), that initializes a set of data to be used for updates and/or insertion. My first thought here is about "inherited", which turns around an assumption in attribute_statistics_update(). How should callers assign its value? For attribute_statistics_update() it comes as an input argument. Also why are these values initialized as they are and why should these make sense when applied to pg_statistic. Extended stats set the "inherited" argument to false in import_expressions(). That seems equally important to document, both in a comment in extended_stats.c and at the top of the function. Of course it means that this refers to the fact that the stats include values from stats tables, which do not apply to extended stats. But one would have to guess this fact after knowing that this relates to the catalog pg_statistic.. 4) statatt_set_slot is slightly simpler. Still here, it is really easy to miss that this routine should be fed data retrieved by statatt_get_type(). staop can also be an eq or an lt operator. stakind is one of the STATISTIC_KIND_* values, etc, etc. 5) text_to_stavalues() has a bit more explanation, with some data fed from statatt_get_type() for the atttypmod and atttypid. We have also some fix of input coming from statatt_get_elem_type(), for elemtypid. As far as I can see, there is nothing in these functions that require them to be located in attribute_stats.c anymore. Let's move that to stats_utils.c, common place for all the shared facilities used by these stats modules. Or it could also be a new, separate patch, for the manipulation and extraction of the tuple data related to the pg_statistic tuples. I am not OK with these being in attribute_stats.c, and as far as I can see they have no contents that force these routines to Perhaps you know what these imply because you have written this code originally in ce207d2a7901, but exposing them also means that it is very important to document at least some concepts and assumptions around them so as it is possible to guide somebody that may want to use this code: - What are the input arguments from? - And what are the results? - In which circumstances should these be used? Then about 0002. There are a bunch of assumptions embedded inside import_expressions() that are interdependent with the calls of these attribute routines which are a bit hard to evaluate by themselves, at least it's not clear why some of these operations are done and why they are done the way they are presented in the patch. The code processes an array of expressions and processes them in a sequential fashion, repeating checks based on extended_stats_exprs_element. Each step is going to need more explanation to connect the dots. For example, why the first checks on exprs_nulls are important, with a second step of checks for each STATISTIC_KIND_*. Too much is let to the reader to guess, making the code complicated to maintain as written, at least IMHO. Is there any need to locate these new functions and code in extended_stats.c, actually? A separation into a new file seems like it would be a cleaner result, leaving extended_stats.c to deal with the import of the new data, including the fetch and deletion of any existing data in pg_stat_ext that would be updated or inserted. Perhaps name that extended_stats_funcs.c, as these are about the direct SQL functions, import and deletion. For the tests, it looks like it would be better to have everything in a new file, like a stats_ext_import.sql for the clear and restore bits. A lot of the tests are copy-pastes of surrounding queries. Perhaps it would be better to use a SQL function wrapper or an IN clause with multiple relations? The patch has a lot of bloat with these repeated queries.. Perhaps we should use a split for the elements in the extended stats array instead of jsonb_pretty(). The results get quite long here. As of 0002, there are actually two independent pieces dealt with: the deletion of extended stats with pg_clear_extended_stats(), and the insert/update of extended stats with pg_restore_extended_stats(). I'd suggest to split that into two parts, with the clear being first in rank. The deletion is simpler, and getting that in first simplifies the review of the import part with less input to deal with. -- Michael
-
Re: Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-12-04T23:12:16Z
> > > Indeed. These exact typos existed in some of the previous iterations > of the patch posted on this thread. > +1 > 1) statatt_get_type(), that retrieves a set of information from a > relation and one of its attributes. This is lightly documented in > extended_statistics_update() from 0002, with an argument claiming that > we do the same as attribute statistics. If one then looks at > attribute_stats.c, we have a not-really-helpful "derive information > from attribute". The trick is that we have an implied dependency > here: statatt_get_type() needs to be called *before* > statatt_get_elem_type() for attribute *and* extended stats, > statatt_get_elem_type() being fed data from statatt_get_type(). > I've added comments to that effect. > > 2) statatt_get_elem_type(), interesting piece of the puzzle that acts > as a wrapper for the type cache. An important part here, it seems to > me, would be to at least tell that this relies on data retrieved from > statatt_get_type(), initially. Once I've noticed this dependency the > logic felt a bit cleaner to understand, but we have no docs explaining > any of that.. > I've added comments that will hopefully connect those dots. > > 3) statatt_init_empty_tuple(), that initializes a set of data to be > used for updates and/or insertion. My first thought here is about > ... > not apply to extended stats. But one would have to guess this fact > after knowing that this relates to the catalog pg_statistic.. > The comments didn't seem quite so illuminating, but I added them just the same. > > 4) statatt_set_slot is slightly simpler. Still here, it is really easy > to miss that this routine should be fed data retrieved by > statatt_get_type(). staop can also be an eq or an lt operator. > stakind is one of the STATISTIC_KIND_* values, etc, etc. > Some here too. > As far as I can see, there is nothing in these functions that require > them to be located in attribute_stats.c anymore. Let's move that to > stats_utils.c, common place for all the shared facilities used by > these stats modules. Or it could also be a new, separate patch, for > the manipulation and extraction of the tuple data related to > the pg_statistic tuples. > They've been moved to stats_utils.c, and impact was fairly minimal. I was worried about an explosion of #includes, but that ended up not being the case. > Perhaps you know what these imply because you have written this code > originally in ce207d2a7901, but exposing them also means that it is > very important to document at least some concepts and assumptions > around them so as it is possible to guide somebody that may want to > use this code: > - What are the input arguments from? > - And what are the results? > - In which circumstances should these be used? > An attempt was made. > Then about 0002. > > There are a bunch of assumptions embedded inside import_expressions() > that are interdependent with the calls of these attribute routines > ... > each STATISTIC_KIND_*. Too much is let to the reader to guess, making > the code complicated to maintain as written, at least IMHO. > I tried to expand on these things, but I haven't yet moved them to another file for reasons I'll explain later. > Is there any need to locate these new functions and code in > extended_stats.c, actually? A separation into a new file seems like > it would be a cleaner result, leaving extended_stats.c to deal with > the import of the new data, including the fetch and deletion of any > existing data in pg_stat_ext that would be updated or inserted. > Perhaps name that extended_stats_funcs.c, as these are about the > direct SQL functions, import and deletion. > We can move them to another file, but I'm not sure if that will actually make things clearer. What is extended_stats.c if not functions about extended stats? Perhaps it makes more sense to try harder to find the commonality in the building of the pg_statistic tuples, move those functions to stats_utils, and get things decluttered that way? For the tests, it looks like it would be better to have everything in > a new file, like a stats_ext_import.sql for the clear and restore > bits. > Those tests very much build upon the tables and data already created for regular relation/attribute imports, so we save quite a bit of boilerplate in keeping them there. > > A lot of the tests are copy-pastes of surrounding queries. Perhaps it > would be better to use a SQL function wrapper or an IN clause with > multiple relations? I'd say "no", because we're doing a lot of things that *could* generate ERRORs, and we're verifying that they generate WARNINGs instead. If we did batch up those commands, and any one of them generated an error, we'd be at loss for which one was the culprit, and we'd also lose any insight into what else got busted in the other rows of the batch. > The patch has a lot of bloat with these repeated > queries.. Perhaps we should use a split for the elements in the > extended stats array instead of jsonb_pretty(). The results get quite > long here. > I replaced '}, ' with '},\n' and removed the jsonb_pretty entirely, which is something I had proposed doing earlier, as it's actually more pretty than jsonb_pretty(). It's still in the same file, though. I like the net result of this, and it might make sense to search other regression tests for places where we can employ this mini-pretty pattern. > As of 0002, there are actually two independent pieces dealt with: the > deletion of extended stats with pg_clear_extended_stats(), and the > insert/update of extended stats with pg_restore_extended_stats(). I'd > suggest to split that into two parts, with the clear being first in > rank. The deletion is simpler, and getting that in first simplifies > the review of the import part with less input to deal with. > The pg_clear function is quite trivial, as it's basically just marshalling and locking before a call to delete_pg_statistic_ext_data(). So while I can split them out, I'd also have to split the pg_proc.dat changes into two separate ones, which makes it hard to see how similar they are. I have a similar feeling about the changes to func-admin.sgml. All of that, plus a rebase.
-
Re: Extended Statistics set/restore/clear functions.
Tom Lane <tgl@sss.pgh.pa.us> — 2025-12-05T00:37:32Z
Michael Paquier <michael@paquier.xyz> writes: > Okay. I have gone other these two and after an extra round of > polishing, mainly around comments (some suggested by Chao, actually), > style, a fix for the test with valid UTF8 sequences failing in > non-UTF8 databases, plus a few more tests that were missing, I have > applied the two patches for the input functions. I notice that since e1405aa5e went in, a number of older buildfarm animals are issuing warnings about it: In file included from pg_dependencies.c:21:0: pg_dependencies.c: In function "dependencies_scalar": ../../../../src/include/nodes/miscnodes.h:54:15: warning: the comparison will always evaluate as "true" for the address of "escontext" will never be NULL [-Waddress] ((escontext) != NULL && IsA(escontext, ErrorSaveContext) && \\ ^ pg_dependencies.c:497:8: note: in expansion of macro "SOFT_ERROR_OCCURRED" if (SOFT_ERROR_OCCURRED(&escontext)) ^ ../../../../src/include/nodes/miscnodes.h:54:15: warning: the comparison will always evaluate as "true" for the address of "escontext" will never be NULL [-Waddress] ((escontext) != NULL && IsA(escontext, ErrorSaveContext) && \\ ^ pg_dependencies.c:542:8: note: in expansion of macro "SOFT_ERROR_OCCURRED" if (SOFT_ERROR_OCCURRED(&escontext)) ^ ../../../../src/include/nodes/miscnodes.h:54:15: warning: the comparison will always evaluate as "true" for the address of "escontext" will never be NULL [-Waddress] ((escontext) != NULL && IsA(escontext, ErrorSaveContext) && \\ ^ pg_dependencies.c:572:8: note: in expansion of macro "SOFT_ERROR_OCCURRED" if (SOFT_ERROR_OCCURRED(&escontext)) ^ In file included from pg_ndistinct.c:21:0: pg_ndistinct.c: In function "ndistinct_scalar": ../../../../src/include/nodes/miscnodes.h:54:15: warning: the comparison will always evaluate as "true" for the address of "escontext" will never be NULL [-Waddress] ((escontext) != NULL && IsA(escontext, ErrorSaveContext) && \\ ^ pg_ndistinct.c:438:8: note: in expansion of macro "SOFT_ERROR_OCCURRED" if (SOFT_ERROR_OCCURRED(&escontext)) ^ ../../../../src/include/nodes/miscnodes.h:54:15: warning: the comparison will always evaluate as "true" for the address of "escontext" will never be NULL [-Waddress] ((escontext) != NULL && IsA(escontext, ErrorSaveContext) && \\ ^ pg_ndistinct.c:488:9: note: in expansion of macro "SOFT_ERROR_OCCURRED" if (!SOFT_ERROR_OCCURRED(&escontext)) ^ The above is from rhinoceros, but about half a dozen other animals are reporting the same. They are all running very old gcc versions, mostly from RHEL7 or CentOS 7. This of course arises because the source code is passing the address of a local ErrorSaveContext variable to that macro. We don't have any other instances of that coding pattern AFAICS, so I wonder if that was really the most adapted way to do it. Looking at the code, it seems to be turning around and stuffing a different error message into a passed-in ErrorSaveContext, which feels a little weird. It may be that there's not a better way, in which case I'll probably just teach my buildfarm-warning-scraper script to ignore these warnings, as it's already doing for some other warnings from these same animals (-Wmissing-braces mostly). But I thought I'd raise a question about it. Also, I don't think these errdetail messages meet our style guidelines: errdetail("Invalid \"%s\" value.", PG_DEPENDENCIES_KEY_ATTRIBUTES)); They're supposed to be complete sentences. regards, tom lane -
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-12-05T01:19:27Z
On Thu, Dec 04, 2025 at 07:37:32PM -0500, Tom Lane wrote: > This of course arises because the source code is passing the address > of a local ErrorSaveContext variable to that macro. We don't have > any other instances of that coding pattern AFAICS, so I wonder if > that was really the most adapted way to do it. Looking at the > code, it seems to be turning around and stuffing a different error > message into a passed-in ErrorSaveContext, which feels a little > weird. Thanks for the report. Right. There are three instances of that in pg_dependencies.c, two in pg_ndistinct.c. I would not mind doing the attached to calm down these warnings, matching with the other areas of the code, that simply removes the macro and checks the state value directly. > Also, I don't think these errdetail messages meet our style > guidelines: > > errdetail("Invalid \"%s\" value.", PG_DEPENDENCIES_KEY_ATTRIBUTES)); > > They're supposed to be complete sentences. Yeah, I should have been more careful here. I have spent more time on all of these, and adjusted all the errdetails for pg_ndistinct.c and pg_dependencies.c to use full sentences. I am just grouping everything with the attached. They'd be better if handled separately, obviously. Comments? -- Michael -
Re: Extended Statistics set/restore/clear functions.
Tom Lane <tgl@sss.pgh.pa.us> — 2025-12-05T02:29:50Z
Michael Paquier <michael@paquier.xyz> writes: > On Thu, Dec 04, 2025 at 07:37:32PM -0500, Tom Lane wrote: >> This of course arises because the source code is passing the address >> of a local ErrorSaveContext variable to that macro. We don't have >> any other instances of that coding pattern AFAICS, so I wonder if >> that was really the most adapted way to do it. > Thanks for the report. Right. There are three instances of that in > pg_dependencies.c, two in pg_ndistinct.c. I would not mind doing the > attached to calm down these warnings, matching with the other areas of > the code, that simply removes the macro and checks the state value > directly. Ah, right, now that you mention it, 56b1e88c8 did this same fix before. This way is fine with me. I didn't review the error message changes closely. I do wonder if some of them are can't-happen cases that could just use an elog() instead of a translatable message. regards, tom lane
-
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-12-05T03:39:05Z
On Thu, Dec 04, 2025 at 09:29:50PM -0500, Tom Lane wrote: > Ah, right, now that you mention it, 56b1e88c8 did this same fix > before. This way is fine with me. Oh, I haven't noticed this one. Pushed the bits to take care of the compiler warnings, for now. > I didn't review the error message changes closely. I do wonder if > some of them are can't-happen cases that could just use an elog() > instead of a translatable message. Some of them are, yes. However, they are also worded with the same format as some of the legit cases. So they don't add any extra workload on the translation side as far as I recall, and I've been fond of the errdetail part to get a consistent style across the board. I'll double-check the whole a bit later, attached is the rest of them. Corey, any comments about these? -- Michael
-
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-12-05T05:30:38Z
> > > Some of them are, yes. However, they are also worded with the same > format as some of the legit cases. So they don't add any extra > workload on the translation side as far as I recall, and I've been > fond of the errdetail part to get a consistent style across the board. > I'll double-check the whole a bit later, attached is the rest of them. > > Corey, any comments about these? > The wordings are fine, and I'm sorry I didn't word them as complete sentences from the get-go. Attached is a follow-on to Michael's most recent uncommitted patch, changing the errors that I see as "impossible" to elogs. However, I agree that they don't add significant workload to the translations, and most input functions need to avoid any hard error returns lest they be called in a soft-error context.
-
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-12-08T01:25:45Z
On Fri, Dec 05, 2025 at 12:30:38AM -0500, Corey Huinker wrote: > Attached is a follow-on to Michael's most recent uncommitted patch, > changing the errors that I see as "impossible" to elogs. However, I agree > that they don't add significant workload to the translations, and most > input functions need to avoid any hard error returns lest they be called in > a soft-error context. Reporting the state of the parser in these new elogs is OK for me, but we had more cases that what your patch has been updating. I have included all these cases, making the set of error messages a bit more consistent across the board, and applied the result. -- Michael
-
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-12-11T21:52:26Z
On Sun, Dec 7, 2025 at 8:26 PM Michael Paquier <michael@paquier.xyz> wrote: > On Fri, Dec 05, 2025 at 12:30:38AM -0500, Corey Huinker wrote: > > Attached is a follow-on to Michael's most recent uncommitted patch, > > changing the errors that I see as "impossible" to elogs. However, I agree > > that they don't add significant workload to the translations, and most > > input functions need to avoid any hard error returns lest they be called > in > > a soft-error context. > > Reporting the state of the parser in these new elogs is OK for me, but > we had more cases that what your patch has been updating. I have > included all these cases, making the set of error messages a bit more > consistent across the board, and applied the result. > -- > Michael > Rebased. Used new palloc0_array() in a few places, and fixed an expected output message that wasn't updated to reflect the changes.
-
Re: Extended Statistics set/restore/clear functions.
Corey Huinker <corey.huinker@gmail.com> — 2025-12-11T21:55:52Z
> > Reporting the state of the parser in these new elogs is OK for me, but >> we had more cases that what your patch has been updating. I have >> included all these cases, making the set of error messages a bit more >> consistent across the board, and applied the result. >> -- >> Michael >> > > Rebased. Used new palloc0_array() in a few places, and fixed an expected > output message that wasn't updated to reflect the changes. > Disregard v21 (uncommitted changes). Everything I said about 21 applies to 22.
-
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-12-25T06:23:58Z
On Thu, Dec 11, 2025 at 04:55:52PM -0500, Corey Huinker wrote: > Disregard v21 (uncommitted changes). Everything I said about 21 applies to > 22. A couple of weeks later, I have locked a couple of hours to look at v22-0001. And after a couple of rounds of self-fixing and adjustments in the documentation added, the result looks OK so I have applied it. Attached are the remaining pieces, labelled v23. -- Michael
-
Re: Extended Statistics set/restore/clear functions.
Tender Wang <tndrwang@gmail.com> — 2025-12-25T11:00:02Z
Tender Wang <tndrwang@gmail.com> 于2025年12月25日周四 18:58写道: > Hi Michael, > > I found a typo in recent commit : > commit 213a1b89527049cb27bbcd6871fdb0c0916b43e1 (origin/master, > origin/HEAD, master) > Author: Michael Paquier <michael@paquier.xyz> > Date: Thu Dec 25 15:13:39 2025 +0900 > Move attribute statistics functions to stat_utils.c > .... > * The stacoll value should be either the atttypcoll derived from > * statatt_get_type(), or a harcoded value required by that particular > * stakind. > .... > > "harcoded" should be "hardcoded". > > Please check the attached patch. > > -- > Thanks, > Tender Wang > Sorry, I forgot to cc the pgsql-hackers mail in my last email. -- Thanks, Tender Wang
-
Re: Extended Statistics set/restore/clear functions.
Michael Paquier <michael@paquier.xyz> — 2025-12-25T22:47:54Z
On Thu, Dec 25, 2025 at 07:00:02PM +0800, Tender Wang wrote: > Sorry, I forgot to cc the pgsql-hackers mail in my last email. Missed this one from v22-0001. Thanks. -- Michael