Thread

Problem with the numbers I reported yesterday

Michael Meskes <meskes@topsystem.de> — 1998-02-11T09:36:03Z

I ran my performance tests some more times and it seems the numbers are not
really comparable. When I run PostgreSQL without -F I get a sync after every
insert. With -F I get no sync at all as all inserts fit well into the
buffer. However, Oracle in comparison does sync. Simply hearing the disk
access it seems as if they sync every two or three seconds.

Does anyone know a way to really check both DBMSs?

Michael

-- 
Dr. Michael Meskes, Project-Manager    | topsystem Systemhaus GmbH
meskes@topsystem.de                    | Europark A2, Adenauerstr. 20
meskes@debian.org                      | 52146 Wuerselen
Go SF49ers! Go Rhein Fire!             | Tel: (+49) 2405/4670-44
Use Debian GNU/Linux!                  | Fax: (+49) 2405/4670-10

Re: [HACKERS] Problem with the numbers I reported yesterday

Bruce Momjian <maillist@candle.pha.pa.us> — 1998-02-11T15:31:20Z

> 
> I ran my performance tests some more times and it seems the numbers are not
> really comparable. When I run PostgreSQL without -F I get a sync after every
> insert. With -F I get no sync at all as all inserts fit well into the
> buffer. However, Oracle in comparison does sync. Simply hearing the disk
> access it seems as if they sync every two or three seconds.
> 
> Does anyone know a way to really check both DBMSs?

Many dbms's do buffered logging, that is they sync after the buffer gets
full or after a minute or so.  We have the logic to add buffered logging
to PostgreSQL and will be doing it later.  Right now, we only have
non-buffered logging, and no logging.

-- 
Bruce Momjian
maillist@candle.pha.pa.us

Re: [HACKERS] Problem with the numbers I reported yesterday

Bruce Momjian <maillist@candle.pha.pa.us> — 1998-02-12T14:47:43Z

> 
> Hi,
> 
> perhaps a stupid question:
> 
> What will happen, if you start PostgreSQL with -F and write a script 
> which is a loop that fsyncs every 2 seconds? Does this lead to a 
> database which is "almost" correct? Does this seem like a good 
> compromise?

Well, you really have to sync the data pages BEFORE sync'ing pg_log. 
Our buffered logging idea for post-6.3 will do exactly that.  I don't
think it is a workaround.  You could get pg_log to disk saying a
transaction is complete without getting all the data pages to disk if
the crash happened during the sync.

-- 
Bruce Momjian
maillist@candle.pha.pa.us

Re: [HACKERS] Problem with the numbers I reported yesterday

Boersenspielteam <boersenspiel@vocalweb.de> — 1998-02-12T15:41:45Z

Hi,

perhaps a stupid question:

What will happen, if you start PostgreSQL with -F and write a script 
which is a loop that fsyncs every 2 seconds? Does this lead to a 
database which is "almost" correct? Does this seem like a good 
compromise?

> > I ran my performance tests some more times and it seems the numbers are not
> > really comparable. When I run PostgreSQL without -F I get a sync after every
> > insert. With -F I get no sync at all as all inserts fit well into the
> > buffer. However, Oracle in comparison does sync. Simply hearing the disk
> > access it seems as if they sync every two or three seconds.
> > 
> > Does anyone know a way to really check both DBMSs?
> 
> Many dbms's do buffered logging, that is they sync after the buffer gets
> full or after a minute or so.  We have the logic to add buffered logging
> to PostgreSQL and will be doing it later.  Right now, we only have
> non-buffered logging, and no logging.
> 
> -- 
> Bruce Momjian
> maillist@candle.pha.pa.us
> 
> 

Ciao

Das Boersenspielteam.

---------------------------------------------------------------------------
                          http://www.boersenspiel.de
              	         Das Boersenspiel im Internet
             *Realitaetsnah*  *Kostenlos*  *Ueber 6000 Spieler*
---------------------------------------------------------------------------

Re: [HACKERS] Problem with the numbers I reported yesterday

ocie@paracel.com — 1998-02-12T18:22:47Z

Bruce Momjian wrote:
> 
> > 
> > Hi,
> > 
> > perhaps a stupid question:
> > 
> > What will happen, if you start PostgreSQL with -F and write a script 
> > which is a loop that fsyncs every 2 seconds? Does this lead to a 
> > database which is "almost" correct? Does this seem like a good 
> > compromise?
> 
> Well, you really have to sync the data pages BEFORE sync'ing pg_log. 

Why should this be necessary?  If the transaction is considered
committed once the log has be written, and the system crashes before
the data are written, then postgres can look at the data and logs when
it is next started up and apply all the transactions that were logged
but not committed to the data pages.

Am I missing something?  It seems to me if you sync the data pages
first, then what good is the log? (other than being able to retrace
your steps)

Ocie Mitchell

Re: [HACKERS] Problem with the numbers I reported yesterday

Bruce Momjian <maillist@candle.pha.pa.us> — 1998-02-12T19:16:29Z

> 
> Bruce Momjian wrote:
> > 
> > > 
> > > Hi,
> > > 
> > > perhaps a stupid question:
> > > 
> > > What will happen, if you start PostgreSQL with -F and write a script 
> > > which is a loop that fsyncs every 2 seconds? Does this lead to a 
> > > database which is "almost" correct? Does this seem like a good 
> > > compromise?
> > 
> > Well, you really have to sync the data pages BEFORE sync'ing pg_log. 
> 
> Why should this be necessary?  If the transaction is considered
> committed once the log has be written, and the system crashes before
> the data are written, then postgres can look at the data and logs when
> it is next started up and apply all the transactions that were logged
> but not committed to the data pages.

No, on restart, you can't identify the old/new data.  Remember, pg_log
is just the transaction id and a flag.  The superceeded/new rows are
mixed on the data pages, with transaction id's as markers.


> 
> Am I missing something?  It seems to me if you sync the data pages
> first, then what good is the log? (other than being able to retrace
> your steps)

Again, the log is just a list of transaction ids, and their statuses.

-- 
Bruce Momjian
maillist@candle.pha.pa.us

Re: [HACKERS] Problem with the numbers I reported yesterday

Kent S. Gordon <kgor@inetspace.com> — 1998-02-13T17:33:22Z

>>>>> "ocie" == ocie  <ocie@paracel.com> writes:

    > Bruce Momjian wrote:
    >>  > > Hi, > > perhaps a stupid question: > > What will happen,
    >> if you start PostgreSQL with -F and write a script > which is a
    >> loop that fsyncs every 2 seconds? Does this lead to a >
    >> database which is "almost" correct? Does this seem like a good
    >> > compromise?
    >> 
    >> Well, you really have to sync the data pages BEFORE sync'ing
    >> pg_log.

    > Why should this be necessary?  If the transaction is considered
    > committed once the log has be written, and the system crashes
    > before the data are written, then postgres can look at the data
    > and logs when it is next started up and apply all the
    > transactions that were logged but not committed to the data
    > pages.

    > Am I missing something?  It seems to me if you sync the data
    > pages first, then what good is the log? (other than being able
    > to retrace your steps)

I do not think that pg_log is used like a normal 'log' device in other
databases.  My quick look at the code looks like pg_log only has a
list of transactions and not the actual data blocks.  Notice that
TransRecover is commented out in backent/access/transam/transam.c.

Most database log has the before images and after images of any page
that has been modified in a transaction followed by commit/abort
record.  This allows for only this file to have to be synced.  The
rest of the database can float (generally checkpoints are done every
so often to reduce recover time).  The method of recovering from a
crash is to replay the log from the last checkpoint until the end of
the log by applying the before/after images (as needed based on
weather the transaction commited) to the actual database relations.

I would appreciate anyone correcting any mistakes in my understanding
of how postgres works.

    > Ocie Mitchell

Kent S. Gordon
Architect
iNetSpace Co.
voice: (972)851-3494 fax:(972)702-0384 e-mail:kgor@inetspace.com

Re: [HACKERS] Problem with the numbers I reported yesterday

Bruce Momjian <maillist@candle.pha.pa.us> — 1998-02-13T19:34:09Z

This is 100% correct.  See my other posting describing the issues.

> 
> >>>>> "ocie" == ocie  <ocie@paracel.com> writes:
> 
>     > Bruce Momjian wrote:
>     >>  > > Hi, > > perhaps a stupid question: > > What will happen,
>     >> if you start PostgreSQL with -F and write a script > which is a
>     >> loop that fsyncs every 2 seconds? Does this lead to a >
>     >> database which is "almost" correct? Does this seem like a good
>     >> > compromise?
>     >> 
>     >> Well, you really have to sync the data pages BEFORE sync'ing
>     >> pg_log.
> 
>     > Why should this be necessary?  If the transaction is considered
>     > committed once the log has be written, and the system crashes
>     > before the data are written, then postgres can look at the data
>     > and logs when it is next started up and apply all the
>     > transactions that were logged but not committed to the data
>     > pages.
> 
>     > Am I missing something?  It seems to me if you sync the data
>     > pages first, then what good is the log? (other than being able
>     > to retrace your steps)
> 
> I do not think that pg_log is used like a normal 'log' device in other
> databases.  My quick look at the code looks like pg_log only has a
> list of transactions and not the actual data blocks.  Notice that
> TransRecover is commented out in backent/access/transam/transam.c.
> 
> Most database log has the before images and after images of any page
> that has been modified in a transaction followed by commit/abort
> record.  This allows for only this file to have to be synced.  The
> rest of the database can float (generally checkpoints are done every
> so often to reduce recover time).  The method of recovering from a
> crash is to replay the log from the last checkpoint until the end of
> the log by applying the before/after images (as needed based on
> weather the transaction commited) to the actual database relations.
> 
> I would appreciate anyone correcting any mistakes in my understanding
> of how postgres works.
> 
>     > Ocie Mitchell
> 
> Kent S. Gordon
> Architect
> iNetSpace Co.
> voice: (972)851-3494 fax:(972)702-0384 e-mail:kgor@inetspace.com
> 


-- 
Bruce Momjian
maillist@candle.pha.pa.us

DB logging (was: Problem with the numbers I reported yesterday)

Jan Wieck <jwieck@debis.com> — 1998-02-14T13:58:05Z

Kent wrote:
>
> I do not think that pg_log is used like a normal 'log' device in other
> databases.  My quick look at the code looks like pg_log only has a
> list of transactions and not the actual data blocks.  Notice that
> TransRecover is commented out in backent/access/transam/transam.c.
>
> Most database log has the before images and after images of any page
> that has been modified in a transaction followed by commit/abort
> record.  This allows for only this file to have to be synced.  The
> rest of the database can float (generally checkpoints are done every
> so often to reduce recover time).  The method of recovering from a
> crash is to replay the log from the last checkpoint until the end of
> the log by applying the before/after images (as needed based on
> weather the transaction commited) to the actual database relations.
>
> I would appreciate anyone correcting any mistakes in my understanding
> of how postgres works.
>
>     > Ocie Mitchell
>
> Kent S. Gordon
> Architect
> iNetSpace Co.
> voice: (972)851-3494 fax:(972)702-0384 e-mail:kgor@inetspace.com
>
>

    Totally  right,  PostgreSQL doesn't have a log mechanism that
    collects all the information to recover a corrupted  database
    from a backup.

    I hacked around on that a little bit.

    When doing a complete after image logging, that is taking all
    the tuples that are stored on insert/update, the  tuple  id's
    of  deletes  plus the information about transaction id's that
    commit, the regression tests produce log data  that  is  more
    than   the   size  of  the  final  regression  database.  The
    performance  increase  when  only  syncing   the   log-   and
    controlfiles  (2  control  files on different devices and the
    logfile on a different device from the  database  files)  and
    running  the  backends  with  -F  is  about  15-20%  for  the
    regression test.

    I thought this is far too much logging data and so  I  didn't
    spent much time trying to implement a recovery. But as far as
    I got it I can tell that the updates to system  catalogs  and
    keeping the indices up to date will be really tricky.

    Another  possible log mechanism I'll try sometimes after v6.3
    release is to log the queries and  data  from  copy  commands
    along with informations about Oid and Tid allocations.


Until later, Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#======================================== jwieck@debis.com (Jan Wieck) #

Re: DB logging (was: Problem with the numbers I reported yesterday)

Kent S. Gordon <kgor@inetspace.com> — 1998-02-16T14:18:31Z

>>>>> "jwieck" == Jan Wieck <jwieck@debis.com> writes:

    > Kent wrote:
description of current logging method deleted.

    >     Totally right, PostgreSQL doesn't have a log mechanism that
    > collects all the information to recover a corrupted database
    > from a backup.

    >     I hacked around on that a little bit.

    >     When doing a complete after image logging, that is taking
    > all the tuples that are stored on insert/update, the tuple id's
    > of deletes plus the information about transaction id's that
    > commit, the regression tests produce log data that is more than
    > the size of the final regression database.  The performance
    > increase when only syncing the log- and controlfiles (2 control
    > files on different devices and the logfile on a different device
    > from the database files) and running the backends with -F is
    > about 15-20% for the regression test.

Log files do get very big with image logging.  I would not expect a
huge win in performance unless the log device is a raw device.  On a
cooked device (file system) buffer cache effects are very large (all
disk data is being buffered both by postgresql and the OS buffer
cache.  The buffer cache is actual harmfully in this case, since data
is not reused, and the writes are synced.  The number of writes to the
log also flush out other buffer from the cache leading to even more
io.).  If a system does not have raw devices (linux, NT), it would be
very useful if a flag exists to tell the OS that the file will be read
sequential like in the madvise() call for mmap.  Is your code
available anywhere?

    >     I thought this is far too much logging data and so I didn't
    > spent much time trying to implement a recovery. But as far as I
    > got it I can tell that the updates to system catalogs and
    > keeping the indices up to date will be really tricky.

I have not looked at this area of the code.  Do the system catalogs
have a separate storage manager?  I do not see why the could not be
handled like any other data except for keeping the buffer in the cache.

Kent S. Gordon
Architect
iNetSpace Co.
voice: (972)851-3494 fax:(972)702-0384 e-mail:kgor@inetspace.com

Re: DB logging (was: Problem with the numbers I reported yesterday)

ocie@paracel.com — 1998-02-17T01:44:46Z

Kent S. Gordon wrote:

[SNIP]

> Log files do get very big with image logging.  I would not expect a
> huge win in performance unless the log device is a raw device.  On a
> cooked device (file system) buffer cache effects are very large (all
> disk data is being buffered both by postgresql and the OS buffer
> cache.  The buffer cache is actual harmfully in this case, since data
> is not reused, and the writes are synced.  The number of writes to the
> log also flush out other buffer from the cache leading to even more
> io.).  If a system does not have raw devices (linux, NT), it would be
                                                ^^^^^ 
What exactly do you mean by "raw devices" that it is not offered by
Linux?  If I take a partition of one of my hard drives and I don't
make a filesystem on it, I can perform reads and writes on the "raw
device" /dev/hd?? or /dev/sd??  I didn't think these writes were
buffered (if that's what you're referring to), but I could be wrong.

Ocie Mitchell.

Re: DB logging (was: Problem with the numbers I reported yesterday)

Bruce Momjian <maillist@candle.pha.pa.us> — 1998-02-17T03:15:58Z

> What exactly do you mean by "raw devices" that it is not offered by
> Linux?  If I take a partition of one of my hard drives and I don't
> make a filesystem on it, I can perform reads and writes on the "raw
> device" /dev/hd?? or /dev/sd??  I didn't think these writes were
> buffered (if that's what you're referring to), but I could be wrong.

Your /dev/hd* goes through the the buffer cache, the raw versions
/dev/rhd* does not.

-- 
Bruce Momjian
maillist@candle.pha.pa.us

Re: DB logging (was: Problem with the numbers I reported yesterday)

ocie@paracel.com — 1998-02-17T04:21:25Z

Bruce Momjian wrote:
> 
> > What exactly do you mean by "raw devices" that it is not offered by
> > Linux?  If I take a partition of one of my hard drives and I don't
> > make a filesystem on it, I can perform reads and writes on the "raw
> > device" /dev/hd?? or /dev/sd??  I didn't think these writes were
> > buffered (if that's what you're referring to), but I could be wrong.
> 
> Your /dev/hd* goes through the the buffer cache, the raw versions
> /dev/rhd* does not.

Actually on Linux, there is no raw/cooked drive interface as in
Solaris.  In Solaris, the /dev/dsk/ devices are buffered by the OS,
while their counterparts in /dev/rdsk are not.  Linux only has the one
interface to the partition, which is raw.  Code which uses these raw
devices (the filesystem code) must supply its own buffering.

Anyway, I don't want to prolong this tangential topic.  Linux should
provide raw devices, but does not (as Solaris does) provide buffered
or cooked access to disks.

Ocie Mitchell

Re: [HACKERS] Re: DB logging (was: Problem with the numbers I reported yesterday)

Jan Wieck <jwieck@debis.com> — 1998-02-19T12:09:02Z

Kent wrote:
>
> >>>>> "jwieck" == Jan Wieck <jwieck@debis.com> writes:
>     >     When doing a complete after image logging, that is taking
>     > all the tuples that are stored on insert/update, the tuple id's
>     > of deletes plus the information about transaction id's that
>     > commit, the regression tests produce log data that is more than
>     > the size of the final regression database.  The performance
>     > increase when only syncing the log- and controlfiles (2 control
>     > files on different devices and the logfile on a different device
>     > from the database files) and running the backends with -F is
>     > about 15-20% for the regression test.
>
> Log files do get very big with image logging.  I would not expect a
> huge win in performance unless the log device is a raw device.  On a
> cooked device (file system) buffer cache effects are very large (all
> disk data is being buffered both by postgresql and the OS buffer
> cache.  The buffer cache is actual harmfully in this case, since data
> is not reused, and the writes are synced.  The number of writes to the
> log also flush out other buffer from the cache leading to even more
> io.).  If a system does not have raw devices (linux, NT), it would be
> very useful if a flag exists to tell the OS that the file will be read
> sequential like in the madvise() call for mmap.  Is your code
> available anywhere?

    I don't have that code any more. It wasn't that much so I can
    redo it if at least you would like to help on that topic. But
    since  this  will  be  a  feature  we should wait for the 6.3
    release before touching anything.

>
>     >     I thought this is far too much logging data and so I didn't
>     > spent much time trying to implement a recovery. But as far as I
>     > got it I can tell that the updates to system catalogs and
>     > keeping the indices up to date will be really tricky.
>
> I have not looked at this area of the code.  Do the system catalogs
> have a separate storage manager?  I do not see why the could not be
> handled like any other data except for keeping the buffer in the cache.

    I just had some problems on the system catalogs (maybe due to
    the  system caching). I think that it can be handled somehow.

    There are other details in the logging we should  care  about
    when we implement it.

    The   logging  should  be  configurable  per  database.  Some
    databases have logging enabled while others are  unprotected.

    It  must  be  able  to do point in time recovery (restore the
    database from a backup and recover until an absolute time  or
    transaction ID).

    The   previous  two  produce  a  problem  for  shared  system
    relations.  If a backend  running  on  an  unlogged  database
    updates  pg_user  for  example, this time it must go into the
    log!

    We should give query logging instead of image logging a  try.


Until later, Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#======================================== jwieck@debis.com (Jan Wieck) #