Thread

  1. Problem with the numbers I reported yesterday

    Michael Meskes <meskes@topsystem.de> — 1998-02-11T09:36:03Z

    I ran my performance tests some more times and it seems the numbers are not
    really comparable. When I run PostgreSQL without -F I get a sync after every
    insert. With -F I get no sync at all as all inserts fit well into the
    buffer. However, Oracle in comparison does sync. Simply hearing the disk
    access it seems as if they sync every two or three seconds.
    
    Does anyone know a way to really check both DBMSs?
    
    Michael
    
    -- 
    Dr. Michael Meskes, Project-Manager    | topsystem Systemhaus GmbH
    meskes@topsystem.de                    | Europark A2, Adenauerstr. 20
    meskes@debian.org                      | 52146 Wuerselen
    Go SF49ers! Go Rhein Fire!             | Tel: (+49) 2405/4670-44
    Use Debian GNU/Linux!                  | Fax: (+49) 2405/4670-10
    
    
  2. Re: [HACKERS] Problem with the numbers I reported yesterday

    Bruce Momjian <maillist@candle.pha.pa.us> — 1998-02-11T15:31:20Z

    > 
    > I ran my performance tests some more times and it seems the numbers are not
    > really comparable. When I run PostgreSQL without -F I get a sync after every
    > insert. With -F I get no sync at all as all inserts fit well into the
    > buffer. However, Oracle in comparison does sync. Simply hearing the disk
    > access it seems as if they sync every two or three seconds.
    > 
    > Does anyone know a way to really check both DBMSs?
    
    Many dbms's do buffered logging, that is they sync after the buffer gets
    full or after a minute or so.  We have the logic to add buffered logging
    to PostgreSQL and will be doing it later.  Right now, we only have
    non-buffered logging, and no logging.
    
    -- 
    Bruce Momjian
    maillist@candle.pha.pa.us
    
    
  3. Re: [HACKERS] Problem with the numbers I reported yesterday

    Bruce Momjian <maillist@candle.pha.pa.us> — 1998-02-12T14:47:43Z

    > 
    > Hi,
    > 
    > perhaps a stupid question:
    > 
    > What will happen, if you start PostgreSQL with -F and write a script 
    > which is a loop that fsyncs every 2 seconds? Does this lead to a 
    > database which is "almost" correct? Does this seem like a good 
    > compromise?
    
    Well, you really have to sync the data pages BEFORE sync'ing pg_log. 
    Our buffered logging idea for post-6.3 will do exactly that.  I don't
    think it is a workaround.  You could get pg_log to disk saying a
    transaction is complete without getting all the data pages to disk if
    the crash happened during the sync.
    
    -- 
    Bruce Momjian
    maillist@candle.pha.pa.us
    
    
  4. Re: [HACKERS] Problem with the numbers I reported yesterday

    Boersenspielteam <boersenspiel@vocalweb.de> — 1998-02-12T15:41:45Z

    Hi,
    
    perhaps a stupid question:
    
    What will happen, if you start PostgreSQL with -F and write a script 
    which is a loop that fsyncs every 2 seconds? Does this lead to a 
    database which is "almost" correct? Does this seem like a good 
    compromise?
    
    > > I ran my performance tests some more times and it seems the numbers are not
    > > really comparable. When I run PostgreSQL without -F I get a sync after every
    > > insert. With -F I get no sync at all as all inserts fit well into the
    > > buffer. However, Oracle in comparison does sync. Simply hearing the disk
    > > access it seems as if they sync every two or three seconds.
    > > 
    > > Does anyone know a way to really check both DBMSs?
    > 
    > Many dbms's do buffered logging, that is they sync after the buffer gets
    > full or after a minute or so.  We have the logic to add buffered logging
    > to PostgreSQL and will be doing it later.  Right now, we only have
    > non-buffered logging, and no logging.
    > 
    > -- 
    > Bruce Momjian
    > maillist@candle.pha.pa.us
    > 
    > 
    
    Ciao
    
    Das Boersenspielteam.
    
    ---------------------------------------------------------------------------
                              http://www.boersenspiel.de
                  	         Das Boersenspiel im Internet
                 *Realitaetsnah*  *Kostenlos*  *Ueber 6000 Spieler*
    ---------------------------------------------------------------------------
    
    
  5. Re: [HACKERS] Problem with the numbers I reported yesterday

    ocie@paracel.com — 1998-02-12T18:22:47Z

    Bruce Momjian wrote:
    > 
    > > 
    > > Hi,
    > > 
    > > perhaps a stupid question:
    > > 
    > > What will happen, if you start PostgreSQL with -F and write a script 
    > > which is a loop that fsyncs every 2 seconds? Does this lead to a 
    > > database which is "almost" correct? Does this seem like a good 
    > > compromise?
    > 
    > Well, you really have to sync the data pages BEFORE sync'ing pg_log. 
    
    Why should this be necessary?  If the transaction is considered
    committed once the log has be written, and the system crashes before
    the data are written, then postgres can look at the data and logs when
    it is next started up and apply all the transactions that were logged
    but not committed to the data pages.
    
    Am I missing something?  It seems to me if you sync the data pages
    first, then what good is the log? (other than being able to retrace
    your steps)
    
    Ocie Mitchell
    
    
  6. Re: [HACKERS] Problem with the numbers I reported yesterday

    Bruce Momjian <maillist@candle.pha.pa.us> — 1998-02-12T19:16:29Z

    > 
    > Bruce Momjian wrote:
    > > 
    > > > 
    > > > Hi,
    > > > 
    > > > perhaps a stupid question:
    > > > 
    > > > What will happen, if you start PostgreSQL with -F and write a script 
    > > > which is a loop that fsyncs every 2 seconds? Does this lead to a 
    > > > database which is "almost" correct? Does this seem like a good 
    > > > compromise?
    > > 
    > > Well, you really have to sync the data pages BEFORE sync'ing pg_log. 
    > 
    > Why should this be necessary?  If the transaction is considered
    > committed once the log has be written, and the system crashes before
    > the data are written, then postgres can look at the data and logs when
    > it is next started up and apply all the transactions that were logged
    > but not committed to the data pages.
    
    No, on restart, you can't identify the old/new data.  Remember, pg_log
    is just the transaction id and a flag.  The superceeded/new rows are
    mixed on the data pages, with transaction id's as markers.
    
    
    > 
    > Am I missing something?  It seems to me if you sync the data pages
    > first, then what good is the log? (other than being able to retrace
    > your steps)
    
    Again, the log is just a list of transaction ids, and their statuses.
    
    -- 
    Bruce Momjian
    maillist@candle.pha.pa.us
    
    
  7. Re: [HACKERS] Problem with the numbers I reported yesterday

    Kent S. Gordon <kgor@inetspace.com> — 1998-02-13T17:33:22Z

    >>>>> "ocie" == ocie  <ocie@paracel.com> writes:
    
        > Bruce Momjian wrote:
        >>  > > Hi, > > perhaps a stupid question: > > What will happen,
        >> if you start PostgreSQL with -F and write a script > which is a
        >> loop that fsyncs every 2 seconds? Does this lead to a >
        >> database which is "almost" correct? Does this seem like a good
        >> > compromise?
        >> 
        >> Well, you really have to sync the data pages BEFORE sync'ing
        >> pg_log.
    
        > Why should this be necessary?  If the transaction is considered
        > committed once the log has be written, and the system crashes
        > before the data are written, then postgres can look at the data
        > and logs when it is next started up and apply all the
        > transactions that were logged but not committed to the data
        > pages.
    
        > Am I missing something?  It seems to me if you sync the data
        > pages first, then what good is the log? (other than being able
        > to retrace your steps)
    
    I do not think that pg_log is used like a normal 'log' device in other
    databases.  My quick look at the code looks like pg_log only has a
    list of transactions and not the actual data blocks.  Notice that
    TransRecover is commented out in backent/access/transam/transam.c.
    
    Most database log has the before images and after images of any page
    that has been modified in a transaction followed by commit/abort
    record.  This allows for only this file to have to be synced.  The
    rest of the database can float (generally checkpoints are done every
    so often to reduce recover time).  The method of recovering from a
    crash is to replay the log from the last checkpoint until the end of
    the log by applying the before/after images (as needed based on
    weather the transaction commited) to the actual database relations.
    
    I would appreciate anyone correcting any mistakes in my understanding
    of how postgres works.
    
        > Ocie Mitchell
    
    Kent S. Gordon
    Architect
    iNetSpace Co.
    voice: (972)851-3494 fax:(972)702-0384 e-mail:kgor@inetspace.com
    
    
  8. Re: [HACKERS] Problem with the numbers I reported yesterday

    Bruce Momjian <maillist@candle.pha.pa.us> — 1998-02-13T19:34:09Z

    This is 100% correct.  See my other posting describing the issues.
    
    > 
    > >>>>> "ocie" == ocie  <ocie@paracel.com> writes:
    > 
    >     > Bruce Momjian wrote:
    >     >>  > > Hi, > > perhaps a stupid question: > > What will happen,
    >     >> if you start PostgreSQL with -F and write a script > which is a
    >     >> loop that fsyncs every 2 seconds? Does this lead to a >
    >     >> database which is "almost" correct? Does this seem like a good
    >     >> > compromise?
    >     >> 
    >     >> Well, you really have to sync the data pages BEFORE sync'ing
    >     >> pg_log.
    > 
    >     > Why should this be necessary?  If the transaction is considered
    >     > committed once the log has be written, and the system crashes
    >     > before the data are written, then postgres can look at the data
    >     > and logs when it is next started up and apply all the
    >     > transactions that were logged but not committed to the data
    >     > pages.
    > 
    >     > Am I missing something?  It seems to me if you sync the data
    >     > pages first, then what good is the log? (other than being able
    >     > to retrace your steps)
    > 
    > I do not think that pg_log is used like a normal 'log' device in other
    > databases.  My quick look at the code looks like pg_log only has a
    > list of transactions and not the actual data blocks.  Notice that
    > TransRecover is commented out in backent/access/transam/transam.c.
    > 
    > Most database log has the before images and after images of any page
    > that has been modified in a transaction followed by commit/abort
    > record.  This allows for only this file to have to be synced.  The
    > rest of the database can float (generally checkpoints are done every
    > so often to reduce recover time).  The method of recovering from a
    > crash is to replay the log from the last checkpoint until the end of
    > the log by applying the before/after images (as needed based on
    > weather the transaction commited) to the actual database relations.
    > 
    > I would appreciate anyone correcting any mistakes in my understanding
    > of how postgres works.
    > 
    >     > Ocie Mitchell
    > 
    > Kent S. Gordon
    > Architect
    > iNetSpace Co.
    > voice: (972)851-3494 fax:(972)702-0384 e-mail:kgor@inetspace.com
    > 
    
    
    -- 
    Bruce Momjian
    maillist@candle.pha.pa.us
    
    
  9. DB logging (was: Problem with the numbers I reported yesterday)

    Jan Wieck <jwieck@debis.com> — 1998-02-14T13:58:05Z

    Kent wrote:
    >
    > I do not think that pg_log is used like a normal 'log' device in other
    > databases.  My quick look at the code looks like pg_log only has a
    > list of transactions and not the actual data blocks.  Notice that
    > TransRecover is commented out in backent/access/transam/transam.c.
    >
    > Most database log has the before images and after images of any page
    > that has been modified in a transaction followed by commit/abort
    > record.  This allows for only this file to have to be synced.  The
    > rest of the database can float (generally checkpoints are done every
    > so often to reduce recover time).  The method of recovering from a
    > crash is to replay the log from the last checkpoint until the end of
    > the log by applying the before/after images (as needed based on
    > weather the transaction commited) to the actual database relations.
    >
    > I would appreciate anyone correcting any mistakes in my understanding
    > of how postgres works.
    >
    >     > Ocie Mitchell
    >
    > Kent S. Gordon
    > Architect
    > iNetSpace Co.
    > voice: (972)851-3494 fax:(972)702-0384 e-mail:kgor@inetspace.com
    >
    >
    
        Totally  right,  PostgreSQL doesn't have a log mechanism that
        collects all the information to recover a corrupted  database
        from a backup.
    
        I hacked around on that a little bit.
    
        When doing a complete after image logging, that is taking all
        the tuples that are stored on insert/update, the  tuple  id's
        of  deletes  plus the information about transaction id's that
        commit, the regression tests produce log data  that  is  more
        than   the   size  of  the  final  regression  database.  The
        performance  increase  when  only  syncing   the   log-   and
        controlfiles  (2  control  files on different devices and the
        logfile on a different device from the  database  files)  and
        running  the  backends  with  -F  is  about  15-20%  for  the
        regression test.
    
        I thought this is far too much logging data and so  I  didn't
        spent much time trying to implement a recovery. But as far as
        I got it I can tell that the updates to system  catalogs  and
        keeping the indices up to date will be really tricky.
    
        Another  possible log mechanism I'll try sometimes after v6.3
        release is to log the queries and  data  from  copy  commands
        along with informations about Oid and Tid allocations.
    
    
    Until later, Jan
    
    --
    
    #======================================================================#
    # It's easier to get forgiveness for being wrong than for being right. #
    # Let's break this rule - forgive me.                                  #
    #======================================== jwieck@debis.com (Jan Wieck) #
    
    
    
    
  10. Re: DB logging (was: Problem with the numbers I reported yesterday)

    Kent S. Gordon <kgor@inetspace.com> — 1998-02-16T14:18:31Z

    >>>>> "jwieck" == Jan Wieck <jwieck@debis.com> writes:
    
        > Kent wrote:
    description of current logging method deleted.
    
        >     Totally right, PostgreSQL doesn't have a log mechanism that
        > collects all the information to recover a corrupted database
        > from a backup.
    
        >     I hacked around on that a little bit.
    
        >     When doing a complete after image logging, that is taking
        > all the tuples that are stored on insert/update, the tuple id's
        > of deletes plus the information about transaction id's that
        > commit, the regression tests produce log data that is more than
        > the size of the final regression database.  The performance
        > increase when only syncing the log- and controlfiles (2 control
        > files on different devices and the logfile on a different device
        > from the database files) and running the backends with -F is
        > about 15-20% for the regression test.
    
    Log files do get very big with image logging.  I would not expect a
    huge win in performance unless the log device is a raw device.  On a
    cooked device (file system) buffer cache effects are very large (all
    disk data is being buffered both by postgresql and the OS buffer
    cache.  The buffer cache is actual harmfully in this case, since data
    is not reused, and the writes are synced.  The number of writes to the
    log also flush out other buffer from the cache leading to even more
    io.).  If a system does not have raw devices (linux, NT), it would be
    very useful if a flag exists to tell the OS that the file will be read
    sequential like in the madvise() call for mmap.  Is your code
    available anywhere?
    
        >     I thought this is far too much logging data and so I didn't
        > spent much time trying to implement a recovery. But as far as I
        > got it I can tell that the updates to system catalogs and
        > keeping the indices up to date will be really tricky.
    
    I have not looked at this area of the code.  Do the system catalogs
    have a separate storage manager?  I do not see why the could not be
    handled like any other data except for keeping the buffer in the cache.
    
    Kent S. Gordon
    Architect
    iNetSpace Co.
    voice: (972)851-3494 fax:(972)702-0384 e-mail:kgor@inetspace.com
    
    
    
    
  11. Re: DB logging (was: Problem with the numbers I reported yesterday)

    ocie@paracel.com — 1998-02-17T01:44:46Z

    Kent S. Gordon wrote:
    
    [SNIP]
    
    > Log files do get very big with image logging.  I would not expect a
    > huge win in performance unless the log device is a raw device.  On a
    > cooked device (file system) buffer cache effects are very large (all
    > disk data is being buffered both by postgresql and the OS buffer
    > cache.  The buffer cache is actual harmfully in this case, since data
    > is not reused, and the writes are synced.  The number of writes to the
    > log also flush out other buffer from the cache leading to even more
    > io.).  If a system does not have raw devices (linux, NT), it would be
                                                    ^^^^^ 
    What exactly do you mean by "raw devices" that it is not offered by
    Linux?  If I take a partition of one of my hard drives and I don't
    make a filesystem on it, I can perform reads and writes on the "raw
    device" /dev/hd?? or /dev/sd??  I didn't think these writes were
    buffered (if that's what you're referring to), but I could be wrong.
    
    Ocie Mitchell.
    
    
  12. Re: DB logging (was: Problem with the numbers I reported yesterday)

    Bruce Momjian <maillist@candle.pha.pa.us> — 1998-02-17T03:15:58Z

    > What exactly do you mean by "raw devices" that it is not offered by
    > Linux?  If I take a partition of one of my hard drives and I don't
    > make a filesystem on it, I can perform reads and writes on the "raw
    > device" /dev/hd?? or /dev/sd??  I didn't think these writes were
    > buffered (if that's what you're referring to), but I could be wrong.
    
    Your /dev/hd* goes through the the buffer cache, the raw versions
    /dev/rhd* does not.
    
    -- 
    Bruce Momjian
    maillist@candle.pha.pa.us
    
    
  13. Re: DB logging (was: Problem with the numbers I reported yesterday)

    ocie@paracel.com — 1998-02-17T04:21:25Z

    Bruce Momjian wrote:
    > 
    > > What exactly do you mean by "raw devices" that it is not offered by
    > > Linux?  If I take a partition of one of my hard drives and I don't
    > > make a filesystem on it, I can perform reads and writes on the "raw
    > > device" /dev/hd?? or /dev/sd??  I didn't think these writes were
    > > buffered (if that's what you're referring to), but I could be wrong.
    > 
    > Your /dev/hd* goes through the the buffer cache, the raw versions
    > /dev/rhd* does not.
    
    Actually on Linux, there is no raw/cooked drive interface as in
    Solaris.  In Solaris, the /dev/dsk/ devices are buffered by the OS,
    while their counterparts in /dev/rdsk are not.  Linux only has the one
    interface to the partition, which is raw.  Code which uses these raw
    devices (the filesystem code) must supply its own buffering.
    
    Anyway, I don't want to prolong this tangential topic.  Linux should
    provide raw devices, but does not (as Solaris does) provide buffered
    or cooked access to disks.
    
    Ocie Mitchell
    
    
    
  14. Re: [HACKERS] Re: DB logging (was: Problem with the numbers I reported yesterday)

    Jan Wieck <jwieck@debis.com> — 1998-02-19T12:09:02Z

    Kent wrote:
    >
    > >>>>> "jwieck" == Jan Wieck <jwieck@debis.com> writes:
    >     >     When doing a complete after image logging, that is taking
    >     > all the tuples that are stored on insert/update, the tuple id's
    >     > of deletes plus the information about transaction id's that
    >     > commit, the regression tests produce log data that is more than
    >     > the size of the final regression database.  The performance
    >     > increase when only syncing the log- and controlfiles (2 control
    >     > files on different devices and the logfile on a different device
    >     > from the database files) and running the backends with -F is
    >     > about 15-20% for the regression test.
    >
    > Log files do get very big with image logging.  I would not expect a
    > huge win in performance unless the log device is a raw device.  On a
    > cooked device (file system) buffer cache effects are very large (all
    > disk data is being buffered both by postgresql and the OS buffer
    > cache.  The buffer cache is actual harmfully in this case, since data
    > is not reused, and the writes are synced.  The number of writes to the
    > log also flush out other buffer from the cache leading to even more
    > io.).  If a system does not have raw devices (linux, NT), it would be
    > very useful if a flag exists to tell the OS that the file will be read
    > sequential like in the madvise() call for mmap.  Is your code
    > available anywhere?
    
        I don't have that code any more. It wasn't that much so I can
        redo it if at least you would like to help on that topic. But
        since  this  will  be  a  feature  we should wait for the 6.3
        release before touching anything.
    
    >
    >     >     I thought this is far too much logging data and so I didn't
    >     > spent much time trying to implement a recovery. But as far as I
    >     > got it I can tell that the updates to system catalogs and
    >     > keeping the indices up to date will be really tricky.
    >
    > I have not looked at this area of the code.  Do the system catalogs
    > have a separate storage manager?  I do not see why the could not be
    > handled like any other data except for keeping the buffer in the cache.
    
        I just had some problems on the system catalogs (maybe due to
        the  system caching). I think that it can be handled somehow.
    
        There are other details in the logging we should  care  about
        when we implement it.
    
        The   logging  should  be  configurable  per  database.  Some
        databases have logging enabled while others are  unprotected.
    
        It  must  be  able  to do point in time recovery (restore the
        database from a backup and recover until an absolute time  or
        transaction ID).
    
        The   previous  two  produce  a  problem  for  shared  system
        relations.  If a backend  running  on  an  unlogged  database
        updates  pg_user  for  example, this time it must go into the
        log!
    
        We should give query logging instead of image logging a  try.
    
    
    Until later, Jan
    
    --
    
    #======================================================================#
    # It's easier to get forgiveness for being wrong than for being right. #
    # Let's break this rule - forgive me.                                  #
    #======================================== jwieck@debis.com (Jan Wieck) #