Re: Extended Statistics set/restore/clear functions.

Corey Huinker <corey.huinker@gmail.com>

From: Corey Huinker <corey.huinker@gmail.com>

To: jian he <jian.universality@gmail.com>

Cc: Tomas Vondra <tomas@vondra.me>, pgsql-hackers@lists.postgresql.org, tgl@sss.pgh.pa.us

Date: 2025-05-29T22:32:17Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Add test doing some cloning of extended statistics data
- fc365e4fccc4 19 (unreleased) landed
Add test for pg_restore_extended_stats() with multiranges
- 0b7beec42ae2 19 (unreleased) landed
Add support for "mcv" in pg_restore_extended_stats()
- efbebb4e8587 19 (unreleased) landed
Include extended statistics data in pg_dump
- c32fb29e979d 19 (unreleased) landed
Add support for "dependencies" in pg_restore_extended_stats()
- 302879bd68d1 19 (unreleased) landed
Add test for MAINTAIN permission with pg_restore_extended_stats()
- d9abd9e1050d 19 (unreleased) landed
Add pg_restore_extended_stats()
- 0e80f3f88dea 19 (unreleased) landed
Add routine to free MCVList
- 7ebb64c55757 19 (unreleased) landed
Improve pg_clear_extended_stats() with incorrect relation/stats combination
- 395b73c045e0 19 (unreleased) landed
Add pg_clear_extended_stats()
- d756fa1019ff 19 (unreleased) landed
Introduce routines to validate and free MVNDistinct and MVDependencies
- 32e27bd32082 19 (unreleased) landed
Fix typo in stat_utils.c
- eee19a30d60d 19 (unreleased) landed
Move attribute statistics functions to stat_utils.c
- 213a1b895270 19 (unreleased) landed
Improve error messages of input functions for pg_dependencies and pg_ndistinct
- f68597ee777d 19 (unreleased) landed
Improve test output of extended statistics for ndistinct and dependencies
- 2f04110225ab 19 (unreleased) landed
Fix some compiler warnings
- 7bc88c3d6f3a 19 (unreleased) landed
Add input function for data type pg_dependencies
- e1405aa5e3ac 19 (unreleased) landed
Add input function for data type pg_ndistinct
- 44eba8f06e55 19 (unreleased) landed
Rework output format of pg_dependencies
- e76defbcf09e 19 (unreleased) landed
Rework output format of pg_ndistinct
- 1f927cce4498 19 (unreleased) landed
Fix comments of output routines for pg_ndistinct and pg_dependencies
- 040a39ed25bf 19 (unreleased) landed
Move code specific to pg_dependencies to new file
- 2ddc8d9e9baa 19 (unreleased) landed
Move code specific to pg_ndistinct to new file
- a5523123430f 19 (unreleased) landed
Document some structures in attribute_stats.c
- d6c132d83bff 19 (unreleased) landed
Fix FATAL message for invalid recovery timeline at beginning of recovery
- 71f17823ba01 18.0 cited

On Mon, Mar 31, 2025 at 1:10 AM Corey Huinker <corey.huinker@gmail.com>
wrote:

> Just rebasing.
>

At pgconf.dev this year, the subject of changing the formats of
pg_ndistinct and pg_depdentencies came up again.

To recap: presently these datatypes have no working input function, but
would need one for statistics import to work on extended statistics. The
existing input formats are technically JSON, but the keys themselves are a
comma-separated list of attnums, so they require additional parsing. That
parsing is already done in the patches in this thread, but overall the
format is terrible for any sort of manipulation, like the manipulation that
people might want to do to translate the values to a table with a different
column order (say, after a restore of a table that had dropped columns), or
to do query planner experiments.

Because the old formats don't have a corresponding input function, there is
no risk of the ouptut not matching required inputs, but there will be once
we add new input functions, so this is our last chance to change the format
to something we like better.

The old format can be trivially translated via functions posted earlier in
this thread back in January (pg_xlat_ndistinct_to_attnames,
pg_xlat_dependencies_to_attnames) as well as the reverse (s/_to_/_from_/),
so dumping values from older versions will not be difficult.

I believe that we should take this opportunity to make the change. While we
don't have a pressing need to manipulate these structures now, we might in
the future and failing to do so now makes a later change much harder.

With that in mind, I'd like people to have a look at the proposed format
change if pg_ndistinct (the changes to pg_dependencies are similar), to see
if they want to make any improvements or comments. As you can see, the new
format is much less compact (about 3x as large), which could get bad if the
number of elements grew by a lot, but the number of elements is tied to the
number of factors in the extended support (N choose N, then N choose N-1,
etc, excluding choose 1), so this can't get too out of hand.

Existing format (newlines/formatting added by me to make head-to-head
comparison easier):

'{"2, 3": 4,
  "2, -1": 4,
  "2, -2": 4,
  "3, -1": 4,
  "3, -2": 4,
  "-1, -2": 3,
  "2, 3, -1": 4,
  "2, 3, -2": 4,
  "2, -1, -2": 4,
  "3, -1, -2": 4}'::pg_ndistinct

Proposed new format (again, all formatting here is just for ease of humans
reading):

' [ {"attributes" : [2,3], "ndistinct" : 4},
    {"attributes" : [2,-1], "ndistinct" : 4},
    {"attributes" : [2,-2], "ndistinct" : 4},
    {"attributes" : [3,-1], "ndistinct" : 4},
    {"attributes" : [3,-2], "ndistinct" : 4},
    {"attributes" : [-1,-2], "ndistinct" : 3},
    {"attributes" : [2,3,-1], "ndistinct" : 4},
    {"attributes" : [2,3,-2], "ndistinct" : 4},
    {"attributes" : [2,-1,-2], "ndistinct" : 4},
    {"attributes" : [3,-1,-2], "ndistinct" : 4}]'::pg_ndistinct

The pg_dependencies structure is only slightly more complex:

An abbreviated example:

{"2 => 1": 1.000000, "2 => -1": 1.000000, ..., "2, -2 => -1": 1.000000, "3,
-1 => 2": 1.000000},

Becomes:

[ {"attributes": [2], "dependency": 1, "degree": 1.000000},
  {"attributes": [2], "dependency": -1, "degree": 1.000000},
  {"attributes": [2, -2], "dependency":  -1, "degree": 1.000000},
   ...,
   {"attributes": [2, -2], "dependency": -1, "degree": 1.000000},
   {"attributes": [3, -1], "dependency": 2, "degree": 1.000000}]

Any thoughts on using/improving these structures?