Thread

Re: [HACKERS] vacuum process size

Tom Lane <tgl@sss.pgh.pa.us> — 1999-08-24T16:20:22Z

I have been looking some more at the vacuum-process-size issue, and
I am having a hard time understanding why the VPageList data structure
is the critical one.  As far as I can see, there should be at most one
pointer in it for each disk page of the relation.  OK, you were
vacuuming a table with something like a quarter million pages, so
the end size of the VPageList would have been something like a megabyte,
and given the inefficient usage of repalloc() in the original code,
a lot more space than that would have been wasted as the list grew.
So doubling the array size at each step is a good change.

But there are a lot more tuples than pages in most relations.

I see two lists with per-tuple data in vacuum.c, "vtlinks" in
vc_scanheap and "vtmove" in vc_rpfheap, that are both being grown with
essentially the same technique of repalloc() after every N entries.
I'm not entirely clear on how many tuples get put into each of these
lists, but it sure seems like in ordinary circumstances they'd be much
bigger space hogs than any of the three VPageList lists.

I recommend going to a doubling approach for each of these lists as
well as for VPageList.

There is a fourth usage of repalloc with the same method, for "ioid"
in vc_getindices.  This only gets one entry per index on the current
relation, so it's unlikely to be worth changing on its own merit.
But it might be worth building a single subroutine that expands a
growable list of entries (taking sizeof() each entry as a parameter)
and applying it in all four places.

			regards, tom lane

Re: [HACKERS] vacuum process size

Brian E Gallew <geek+@cmu.edu> — 1999-08-24T17:01:12Z

Then <tgl@sss.pgh.pa.us> spoke up and said:
> So doubling the array size at each step is a good change.
> 
> But there are a lot more tuples than pages in most relations.
> 
> I see two lists with per-tuple data in vacuum.c, "vtlinks" in
> vc_scanheap and "vtmove" in vc_rpfheap, that are both being grown with
> essentially the same technique of repalloc() after every N entries.
> I'm not entirely clear on how many tuples get put into each of these
> lists, but it sure seems like in ordinary circumstances they'd be much
> bigger space hogs than any of the three VPageList lists.
> 
> I recommend going to a doubling approach for each of these lists as
> well as for VPageList.

Question: is there reliable information in pg_statistics (or other
system tables) which can be used to make a reasonable estimate for the
sizes of these structures before initial allocation?  Certainly the
file size can be gotten from a stat (some portability issues, sparse
file issues).


-- 
=====================================================================
| JAVA must have been developed in the wilds of West Virginia.      |
| After all, why else would it support only single inheritance??    |
=====================================================================
| Finger geek@cmu.edu for my public key.                            |
=====================================================================

RE: [HACKERS] vacuum process size

Hiroshi Inoue <inoue@tpf.co.jp> — 1999-08-25T01:11:42Z

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Wednesday, August 25, 1999 1:20 AM
> To: t-ishii@sra.co.jp
> Cc: Mike Mascari; Hiroshi Inoue; pgsql-hackers@postgreSQL.org
> Subject: Re: [HACKERS] vacuum process size 
> 
> 
> I have been looking some more at the vacuum-process-size issue, and
> I am having a hard time understanding why the VPageList data structure
> is the critical one.  As far as I can see, there should be at most one
> pointer in it for each disk page of the relation.  OK, you were
> vacuuming a table with something like a quarter million pages, so
> the end size of the VPageList would have been something like a megabyte,
> and given the inefficient usage of repalloc() in the original code,
> a lot more space than that would have been wasted as the list grew.
> So doubling the array size at each step is a good change.
> 
> But there are a lot more tuples than pages in most relations.
> 
> I see two lists with per-tuple data in vacuum.c, "vtlinks" in
> vc_scanheap and "vtmove" in vc_rpfheap, that are both being grown with
> essentially the same technique of repalloc() after every N entries.
> I'm not entirely clear on how many tuples get put into each of these
> lists, but it sure seems like in ordinary circumstances they'd be much
> bigger space hogs than any of the three VPageList lists.
>

AFAIK,both vtlinks and vtmove are NULL if vacuum is executed
without concurrent transactions.
They won't be so big unless loooong concurrent transactions exist.
 
Regards.

Hiroshi Inoue
Inoue@tpf.co.jp