Re: Assorted improvements in pg_dump

Andres Freund <andres@anarazel.de>

From: Andres Freund <andres@anarazel.de>
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Hans Buschmann <buschmann@nidsa.net>, "pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>
Date: 2021-10-25T19:30:24Z
Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. pg_dump: avoid unsafe function calls in getPolicies().

  2. Postpone calls of unsafe server-side functions in pg_dump.

  3. Account for TOAST data while scheduling parallel dumps.

  4. Use PREPARE/EXECUTE for repetitive per-object queries in pg_dump.

  5. Avoid per-object queries in performance-critical paths in pg_dump.

  6. Rethink pg_dump's handling of object ACLs.

  7. Refactor pg_dump's tracking of object components to be dumped.

  8. pg_dump: fix mis-dumping of non-global default privileges.

Hi,

On 2021-10-24 17:10:55 -0400, Tom Lane wrote:
> 0004 is not committable as-is, because it assumes that the source
> server has single-array unnest(), which is not true before 8.4.
> We could fix that by using "oid = ANY(array-constant)" conditions
> instead, but I'm unsure about the performance properties of that
> for large OID arrays on those old server versions.

It doesn't seem bad at all. 8.3 assert:

CREATE TABLE foo(oid oid primary key);
INSERT INTO foo SELECT generate_series(1, 1000000);
postgres[1164129][1]=# explain ANALYZE SELECT count(*) FROM foo WHERE oid = ANY(ARRAY(SELECT generate_series(1100000, 1, -1)));
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                             QUERY PLAN                                                              │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate  (cost=81.54..81.55 rows=1 width=0) (actual time=2433.656..2433.656 rows=1 loops=1)                                       │
│   InitPlan                                                                                                                          │
│     ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.004..149.425 rows=1100000 loops=1)                                  │
│   ->  Bitmap Heap Scan on foo  (cost=42.70..81.50 rows=10 width=0) (actual time=2275.137..2369.478 rows=1000000 loops=1)            │
│         Recheck Cond: (oid = ANY (($0)::oid[]))                                                                                     │
│         ->  Bitmap Index Scan on foo_pkey  (cost=0.00..42.69 rows=10 width=0) (actual time=2274.077..2274.077 rows=1000000 loops=1) │
│               Index Cond: (oid = ANY (($0)::oid[]))                                                                                 │
│ Total runtime: 2436.201 ms                                                                                                          │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(8 rows)

Time: 2437.568 ms (00:02.438)


> Lastly, 0006 implements the other idea we'd discussed in the other
> thread: for queries that are issued repetitively but not within a
> single pg_dump function invocation, use PREPARE/EXECUTE to cut down
> the overhead.  This gets only diminishing returns after 0004, but
> it still brings "pg_dump -s regression" down from 0.39s to 0.33s,
> so maybe it's worth doing.

I think it's worth doing. There's things that the batch approach won't help
with and even if it doesn't help a lot with the regression test database, I'd
expect it to help plenty with other cases.

A test database I had around with lots of functions got drastically faster to
dump (7.4s to 2.5s), even though the number of queries didn't change
significantly. According to pg_stat_statements plan_time for the dumpFunc
query went from 2352ms to 0.4ms - interestingly execution time nearly halves
as well.


> I stopped after caching the plans for
> functions/aggregates/operators/types, though.  The remaining sorts
> of objects aren't likely to appear in typical databases enough times
> to be worth worrying over.  (This patch will be a net loss if there
> are more than zero but less than perhaps 10 instances of an object type,
> so there's definitely reasons beyond laziness for not doing more.)

Seems reasonable.


> @@ -7340,25 +7340,37 @@ getDomainConstraints(Archive *fout, TypeInfo *tyinfo)
>  				i_consrc;
>  	int			ntups;
>  
> -	query = createPQExpBuffer();
> +	static bool query_prepared = false;
>  
> -	if (fout->remoteVersion >= 90100)
> -		appendPQExpBuffer(query, "SELECT tableoid, oid, conname, "
> -						  "pg_catalog.pg_get_constraintdef(oid) AS consrc, "
> -						  "convalidated "
> -						  "FROM pg_catalog.pg_constraint "
> -						  "WHERE contypid = '%u'::pg_catalog.oid "
> -						  "ORDER BY conname",
> -						  tyinfo->dobj.catId.oid);
> +	if (!query_prepared)
> +	{

I wonder if it'd be better to store this in Archive or such. The approach with
static variables might run into problems with parallel pg_dump at some
point. These objects aren't dumped in parallel yet, but still...

Greetings,

Andres Freund