Thread

  1. Re: Real-life range datasets

    Oleg Bartunov <oleg@sai.msu.su> — 2011-12-22T15:48:03Z

    Bene,
    
    we have pgfoundry project http://pgfoundry.org/projects/dbsamples/.
    Since your sample database is very important (for me also), I suggest to use
    this site.
    
    Oleg
    On Thu, 22 Dec 2011, Benedikt Grundmann wrote:
    
    > Hello,
    >
    > We have a table in a postgres 8.4 database that would make use of date
    > ranges and exclusion constraints if they were available.  Sadly I cannot
    > give you the data as it is based on data we are paying for and as part
    > of the relevant licenses we are obliqued to not give the data to third
    > parties.
    >
    > Basically the tables keep meta data about financial instruments on
    > a given day.  Thanks to corporate actions (companies merging, splitting
    > up, etc...) that meta data can be different from one day to the next.
    >
    > One way to model such a table is:
    >
    > identifier, date, payload columns...
    >
    >
    > unique index on (date, identifier)
    >
    > (and in fact some of the payload columns are unique per day indices
    > as well).
    >
    > But because there are a large number of rows per day and most don't
    > change this is a very wasteful representation.
    >
    > Instead we use this
    >
    > identifier, effective_from, effective_until, payload columns...
    >
    > And we have some clever plpgsql functions that merge a days snapshot
    > into that representation.  That happens only a few times per day and
    > is currently quite slow mostly because a lot of time is spend in
    > validation triggers to check that there are no overlapping entries
    > for effective_from,effective_until for jane_symbol and a few other
    > identifiers.
    >
    > The most common operations are:
    >
    >  Get all or most rows of a given day
    >
    > (select ... from instruments where :date between effective_from and effective_until)
    >
    > left join of the instruments (again in the normal case constrained to
    > one day but in same cases periods of a week or a few month)
    >
    > select ... from t left join instruments on
    >  t.jane_symbol = instruments.jane_symbol
    >  t.date between instruments.effective_from and t.effective_until
    >  where t.date = X
    >    and additional constraint on the number of rows from t
    >
    > With t a huge table clustered on date with roughly 500,000 to 2,000,000
    > entries per day.  The left join would work most of the time (my guess is
    > more than 90%).  But there are entries in t where the jane_symbol would
    > not be in instruments (sadly).
    >
    > Current size (immediately after a cluster):
    >
    >  table     toast        (all indices)   total
    > | 1268 MB | 900 MB     | 693 MB        | 2861 MB
    >
    > => select min(effective_from), max(effective_from) from instruments;
    >    min     |    max
    > ------------+------------
    > 2011-05-30 | 2011-12-21
    > (1 row)
    >
    > b=> select count(*) from instruments where current_date - 1 between effective_from and effective_until ;
    > count
    > --------
    > 358741
    > (1 row)
    >
    > I should be able to give you a table with the same characteristics as
    > the instruments table but bogus data by replacing all entries in the
    > table with random strings of the same length or something like that.
    > I can probably take a little bit of time during this or the next week
    > to generate such "fake" real world data ;-)   Is there an ftp site to
    > upload the gzipped pg_dump file to?
    >
    > Cheers,
    >
    > Bene
    >
    > On 20/12/11 16:48, Alexander Korotkov wrote:
    >> Hackers,
    >>
    >> For better GiST indexing of range types it's important to have real-life
    >> datasets for testing on. Real-life range datasets would help to proof (or
    >> reject) some concepts and get more realistic benchmarks. Also, it would be
    >> nice to know what queries you expect to run fast on that datasets. Ideally
    >> it should be real-life set of queries, but it also could be your
    >> presentation of what are typical queries  for such datasets.
    >> Thanks!
    >>
    >> -----
    >> With best regards,
    >> Alexander Korotkov.
    >
    >
    
     	Regards,
     		Oleg
    _____________________________________________________________
    Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
    Sternberg Astronomical Institute, Moscow University, Russia
    Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
    phone: +007(495)939-16-83, +007(495)939-23-83