Thread

  1. Re: [HACKERS] LONG

    Tom Lane <tgl@sss.pgh.pa.us> — 1999-12-12T18:02:13Z

    Bruce Momjian <pgman@candle.pha.pa.us> writes:
    > Also, my idea was to auto-enable longs for all varlena types, so short
    > values stay in the table, while longer chained ones that take up lots of
    > space and are expensive to expand are retrieved only when needed.
    
    I missed most of yesterday's discussion (was off fighting a different
    fire...).  This morning in the shower I had a brilliant idea, which
    I now see Bruce has beaten me to ;-)
    
    The idea of doing tuple splitting by pushing "long" fields out of line,
    rather than just cutting up the tuple at arbitrary points, is clearly
    a win for the reasons Bruce and Jan point out.  But I like Bruce's
    approach (automatically do it for any overly-long varlena attribute)
    much better than Jan's (invent a special LONG datatype).  A special
    datatype is bad for several reasons:
    * it forces users to kluge up their database schemas;
    * inevitably, users will pick the wrong columns to make LONG (it's
      a truism that programmers seldom guess right about what parts of
      their programs consume the most resources; users would need a
      "profiler" to make the right decisions);
    * it doesn't solve the problems for arrays, which desperately need it;
    * we'd need to add a whole bunch of operations on the special datatype;
    
    I could live with all of those limitations if a "clean" datatype-based
    solution were possible, ie, all the special code is in the datatype
    functions.  But we already know that that's not possible --- there would
    have to be special hacks for the LONG datatype in other places.  So I
    think we ought to handle the problem as part of the tuple access
    machinery, not as a special datatype.
    
    I think that the right place to implement this is in heapam, and that
    it should go more or less like this:
    
    1. While writing out a tuple, if the total tuple size is "too big"
    (threshold would be some fraction of BLCKSZ, yet to be chosen),
    then the tuple manager would go through the tuple to find the longest
    varlena attribute, and convert same into an out-of-line attribute.
    Repeat if necessary until tuple size fits within threshold.
    
    2. While reading a tuple, fastgetattr() automatically fetches the
    out-of-line value if it sees the requested attribute is out-of-line.
    (I'd be inclined to mark out-of-line attributes in the same way that
    NULL attributes are marked: one bit in the tuple header shows if any
    out-of-line attrs are present, and if so there is a bitmap to show
    which ones are out-of-line.  We could also use Bruce's idea of
    commandeering the high-order bit of the varlena length word, but
    I think that's a much uglier and more fragile solution.)
    
    I think that these two changes would handle 99% of the problem.
    VACUUM would still need work, but most normal access to tuples would
    just work automatically, because all access to varlena fields must go
    through fastgetattr().
    
    An as-yet-unsolved issue is how to avoid memory leaks of out-of-line
    values after they have been read in by fastgetattr().  However, I think
    that's going to be a nasty problem with Jan's approach as well.  The
    best answer might be to solve this in combination with addressing the
    problem of leakage of temporary results during expression evaluation,
    say by adding some kind of reference-count convention to all varlena
    values.
    
    BTW, I don't see any really good reason to keep the out-of-line values
    in a separate physical file (relation) as Jan originally proposed.
    Why not keep them in the same file, but mark them as being something
    different than a normal tuple?  Sequential scans would have to know to
    skip over them (big deal), and VACUUM would have to handle them
    properly, but I think VACUUM is going to have to have special code to
    support this feature no matter what.  If we do make them a new primitive
    kind-of-a-tuple on disk, we could sidestep the problem of marking all
    the out-of-line values associated with a tuple when the tuple is
    outdated by a transaction.  The out-of-line values wouldn't have
    transaction IDs in them at all; they'd just be labeled with the CTID
    and/or OID of the primary tuple they belong to.  VACUUM would consult
    that tuple to determine whether to keep or discard an out-of-line value.
    
    			regards, tom lane
    
    
  2. Re: [HACKERS] LONG

    Jan Wieck <wieck@debis.com> — 1999-12-12T20:45:56Z

    Tom Lane wrote:
    
    > Bruce Momjian <pgman@candle.pha.pa.us> writes:
    > > Also, my idea was to auto-enable longs for all varlena types, so short
    > > values stay in the table, while longer chained ones that take up lots of
    > > space and are expensive to expand are retrieved only when needed.
    >
    > I missed most of yesterday's discussion (was off fighting a different
    > fire...).  This morning in the shower I had a brilliant idea, which
    > I now see Bruce has beaten me to ;-)
    >
    > The idea of doing tuple splitting by pushing "long" fields out of line,
    > rather than just cutting up the tuple at arbitrary points, is clearly
    > a win for the reasons Bruce and Jan point out.  But I like Bruce's
    > approach (automatically do it for any overly-long varlena attribute)
    > much better than Jan's (invent a special LONG datatype).  A special
    > datatype is bad for several reasons:
    > * it forces users to kluge up their database schemas;
    > * inevitably, users will pick the wrong columns to make LONG (it's
    >   a truism that programmers seldom guess right about what parts of
    >   their programs consume the most resources; users would need a
    >   "profiler" to make the right decisions);
    > * it doesn't solve the problems for arrays, which desperately need it;
    > * we'd need to add a whole bunch of operations on the special datatype;
    
    O.K.,
    
        you two got me now.
    
    >
    > I think that the right place to implement this is in heapam, and that
    > it should go more or less like this:
    >
    > 1. While writing out a tuple, if the total tuple size is "too big"
    > (threshold would be some fraction of BLCKSZ, yet to be chosen),
    > then the tuple manager would go through the tuple to find the longest
    > varlena attribute, and convert same into an out-of-line attribute.
    > Repeat if necessary until tuple size fits within threshold.
    
        Yepp.  But  it  does  NOT mangle up the tuple handed to it in
        place.  The flat values in the tuple are sometimes used AFTER
        heap_insert()    and    heap_update(),    for   example   for
        index_insert. So that might break other places.
    
    > 2. While reading a tuple, fastgetattr() automatically fetches the
    > out-of-line value if it sees the requested attribute is out-of-line.
    > (I'd be inclined to mark out-of-line attributes in the same way that
    > NULL attributes are marked: one bit in the tuple header shows if any
    > out-of-line attrs are present, and if so there is a bitmap to show
    > which ones are out-of-line.  We could also use Bruce's idea of
    > commandeering the high-order bit of the varlena length word, but
    > I think that's a much uglier and more fragile solution.)
    >
    > I think that these two changes would handle 99% of the problem.
    > VACUUM would still need work, but most normal access to tuples would
    > just work automatically, because all access to varlena fields must go
    > through fastgetattr().
    
        And I like Bruce's idea with the high order  bit  of  vl_len.
        This  is IMHO the only chance, to tell on UPDATE if the value
        wasn't changed.
    
        To detect that an UPDATE did not touch the out of line value,
        you  need  the  complete  long  reference  information in the
        RESULT tuple. The executor must not expand  the  value  while
        building them up already.
    
        But  Tom  is  right,  there is a visibility problem I haven't
        seen before.  It is  that  when  fetching  the  out  of  line
        attribute  (for  example in the type output function) is done
        later  than  fetching  the  reference  information.   Then  a
        transaction  reading  dirty  or  committed  might  see  wrong
        content,  or  worse,  see  different  contents  at  different
        fetches.
    
        The  solution  I see is to give any out of line datum another
        Oid, that is  part  of  it's  header  and  stamped  into  the
        reference  data.  That way, the long attribute lookup can use
        SnapshotAny using this  Oid,  there  can  only  be  one  that
        exists,  so SnapshotAny is safe here and forces that only the
        visibility of the master tuple in the main  table  counts  at
        all.
    
        Since this Values Oid is known in the Values reference of the
        tuple, we only need two indices on the out of line data.  One
        on  this  Oid, on on the referencing row's oid|attrno|seq  to
        be fast in heap_delete() and heap_update().
    
    > An as-yet-unsolved issue is how to avoid memory leaks of out-of-line
    > values after they have been read in by fastgetattr().  However, I think
    > that's going to be a nasty problem with Jan's approach as well.  The
    > best answer might be to solve this in combination with addressing the
    > problem of leakage of temporary results during expression evaluation,
    > say by adding some kind of reference-count convention to all varlena
    > values.
    
        At the point we decide to move an attribute out of the tuple,
        we  make a lookup in an array consisting of type Oid's. Thus,
        we have plenty of time to add one datatype after another  and
        enable  them separately for long processing, but get the ones
        enabled ASAP (next release) out of the door.
    
        As Bruce suggested, we  implement  a  central  function  that
        fetches  back  the  long  value. This is used in all the type
        specific  funcitons  in  adt.   Now  that  we  have  an   Oid
        identifier  per  single value, it's easy to implement a cache
        there, that can manage a LRU table of the last fetched values
        and cache smaller ones for fast access.
    
        It's  the  response  of  the types adt functions, to free the
        returned (old VARLENA looking) memory. Since  we  enable  the
        types one-by-one, there's no need to hurry on this.
    
    > BTW, I don't see any really good reason to keep the out-of-line values
    > in a separate physical file (relation) as Jan originally proposed.
    > Why not keep them in the same file, but mark them as being something
    > different than a normal tuple?  Sequential scans would have to know to
    > skip over them (big deal), and VACUUM would have to handle them
    
        The  one  I  see  is that a sequential scan would not benefit
        from this, it still has to read the entire relation, even  if
        looking only on small, fixed size items in the tuple. Will be
        a big win for count(*). And with the  mentioned  value  cache
        for  relatively  small  (yet  to define what that is) values,
        there will be very little overhead in a sort, if  the  tuples
        in  it  are  sorted  by  an  attribute where some long values
        occationally appear.
    
    > properly, but I think VACUUM is going to have to have special code to
    > support this feature no matter what.  If we do make them a new primitive
    > kind-of-a-tuple on disk, we could sidestep the problem of marking all
    > the out-of-line values associated with a tuple when the tuple is
    > outdated by a transaction.  The out-of-line values wouldn't have
    > transaction IDs in them at all; they'd just be labeled with the CTID
    > and/or OID of the primary tuple they belong to.  VACUUM would consult
    > that tuple to determine whether to keep or discard an out-of-line value.
    
        AFAIK, VACUUM consults single attributes of a tuple  only  to
        produce  the  statistical  informations  for them on ANALYZE.
        Well, statistical information  for  columns  containing  LONG
        values aren't good for the WHERE clause (I think we all agree
        on that). So it doesn't matter if these  informations  aren't
        totally  accurate, or if VACUUM counts them but uses only the
        first couple of bytes for the min/max etc. info.
    
        Also, the new long data relations should IMHO have their  own
        relkind.   So  VACUUM can easily detect them. This I think is
        required, so VACUUM can place an exclusive lock on  the  main
        table  first before starting to vacuum the long values (which
        can be done as is since it is in fact  a  normal  relation  -
        just  not  visible  to  the  user).  This  should  avoid race
        conditions as explained above on the visibility problem.
    
        I'll start to play around with this  approach  for  a  while,
        using  lztext  as  test  candidate  (with  custom compression
        parameters that force uncompressed storage). When I have some
        reasonable  result  ready to look at, I'll send a patch here,
        so we can continue the discussion while looking at some  test
        implementation.
    
    
    Jan
    
    --
    
    #======================================================================#
    # It's easier to get forgiveness for being wrong than for being right. #
    # Let's break this rule - forgive me.                                  #
    #========================================= wieck@debis.com (Jan Wieck) #
    
    
    
    
  3. Re: [HACKERS] LONG

    Bruce Momjian <pgman@candle.pha.pa.us> — 1999-12-12T21:44:27Z

    > 2. While reading a tuple, fastgetattr() automatically fetches the
    > out-of-line value if it sees the requested attribute is out-of-line.
    > (I'd be inclined to mark out-of-line attributes in the same way that
    > NULL attributes are marked: one bit in the tuple header shows if any
    > out-of-line attrs are present, and if so there is a bitmap to show
    > which ones are out-of-line.  We could also use Bruce's idea of
    > commandeering the high-order bit of the varlena length word, but
    > I think that's a much uglier and more fragile solution.)
    
    Not sure if fastgetattr() is the place for this.  I thought the varlena
    access routines themselves would work.  It is nice and clean to do it in
    fastgetattr, but how do you know to pfree it?  I suppose if you kept the
    high bit set, you could try cleaning up, but where?
    
    My idea was to expand the out-of-line varlena, and unset the 'long' bit.
    
    	long-bit|length|reloid|tupleoid|attno|longlen
    
    Unexpanded would be:
    
    	1|20|10032|23123|5|20000
    
    unexpanded is:
    
    	0|20000|data
    
    
    > 
    > I think that these two changes would handle 99% of the problem.
    > VACUUM would still need work, but most normal access to tuples would
    > just work automatically, because all access to varlena fields must go
    > through fastgetattr().
    > 
    > An as-yet-unsolved issue is how to avoid memory leaks of out-of-line
    > values after they have been read in by fastgetattr().  However, I think
    > that's going to be a nasty problem with Jan's approach as well.  The
    > best answer might be to solve this in combination with addressing the
    > problem of leakage of temporary results during expression evaluation,
    > say by adding some kind of reference-count convention to all varlena
    > values.
    
    That's why I was going to do the expansion only in the varlena access
    routines.  Patch already posted.
    
    > 
    > BTW, I don't see any really good reason to keep the out-of-line values
    > in a separate physical file (relation) as Jan originally proposed.
    > Why not keep them in the same file, but mark them as being something
    > different than a normal tuple?  Sequential scans would have to know to
    > skip over them (big deal), and VACUUM would have to handle them
    > properly, but I think VACUUM is going to have to have special code to
    > support this feature no matter what.  If we do make them a new primitive
    > kind-of-a-tuple on disk, we could sidestep the problem of marking all
    > the out-of-line values associated with a tuple when the tuple is
    > outdated by a transaction.  The out-of-line values wouldn't have
    > transaction IDs in them at all; they'd just be labeled with the CTID
    > and/or OID of the primary tuple they belong to.  VACUUM would consult
    > that tuple to determine whether to keep or discard an out-of-line value.
    
    I disagree.  By moving to another table, we don't have non-standard
    tuples in the main table.  We can create normal tuples in the long*
    table, of identical format, and access them just like normal tuples. 
    Having special long tuples in the main table that don't follow the
    format of the other tuples it a certain mess.  The long* tables also
    move the long data out of the main table so it is not accessed in
    sequential scans.  Why keep them in the main table?
    
    -- 
      Bruce Momjian                        |  http://www.op.net/~candle
      maillist@candle.pha.pa.us            |  (610) 853-3000
      +  If your life is a hard drive,     |  830 Blythe Avenue
      +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
    
    
  4. Re: [HACKERS] LONG

    Jan Wieck <wieck@debis.com> — 1999-12-12T21:54:35Z

    Bruce Momjian wrote:
    
    > I disagree.  By moving to another table, we don't have non-standard
    > tuples in the main table.  We can create normal tuples in the long*
    > table, of identical format, and access them just like normal tuples.
    > Having special long tuples in the main table that don't follow the
    > format of the other tuples it a certain mess.  The long* tables also
    > move the long data out of the main table so it is not accessed in
    > sequential scans.  Why keep them in the main table?
    
        More  ugly  and  complicated (especially for VACUUM) seems to
        me, the we need an index on these  nonstandard  tuples,  that
        doesn't  see  the  standard  ones,  while the regular indices
        ignore the new long tuples. At least  if  we  want  to  delay
        reading of long values until they're explicitly requested.
    
    
    Jan
    
    --
    
    #======================================================================#
    # It's easier to get forgiveness for being wrong than for being right. #
    # Let's break this rule - forgive me.                                  #
    #========================================= wieck@debis.com (Jan Wieck) #
    
    
    
    
  5. Re: [HACKERS] LONG

    Bruce Momjian <pgman@candle.pha.pa.us> — 1999-12-12T22:04:54Z

    > Bruce Momjian wrote:
    > 
    > > I disagree.  By moving to another table, we don't have non-standard
    > > tuples in the main table.  We can create normal tuples in the long*
    > > table, of identical format, and access them just like normal tuples.
    > > Having special long tuples in the main table that don't follow the
    > > format of the other tuples it a certain mess.  The long* tables also
    > > move the long data out of the main table so it is not accessed in
    > > sequential scans.  Why keep them in the main table?
    > 
    >     More  ugly  and  complicated (especially for VACUUM) seems to
    >     me, the we need an index on these  nonstandard  tuples,  that
    >     doesn't  see  the  standard  ones,  while the regular indices
    >     ignore the new long tuples. At least  if  we  want  to  delay
    >     reading of long values until they're explicitly requested.
    > 
    
    Yes, good point.  No reason to create non-standard tuples if you can
    avoid it.  And a separate table has performance advantages, especially
    because the long tuples are by definition long and take up lots of
    blocks.
    
    
    -- 
      Bruce Momjian                        |  http://www.op.net/~candle
      maillist@candle.pha.pa.us            |  (610) 853-3000
      +  If your life is a hard drive,     |  830 Blythe Avenue
      +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
    
    
  6. libpq questions...when threads collide

    Don Baccus <dhogaza@pacifier.com> — 1999-12-12T22:18:47Z

    I'm working on bullet-proofing AOLserver's postgres driver.
    
    I've fixed a bunch of weaknesses, but am stumped by the
    following...
    
    AOLserver's a multithreaded server, and libpq's database
    connection routines aren't threadsafe.  It turns out the
    environment in which the driver lives doesn't allow me
    to ensure that only one thread executes a PQsetdb at a
    time, at least without resorting to the specific operating
    system's mutexes and cond primitives.  The server provides
    a nice portable interface for such things but they're
    not available to database drivers because in general the
    server's not interested in having database drivers do such
    things.
    
    That's not a problem for this group, but I'm curious.  People
    have been using this driver for years, and some use it 
    heavily (Lamar Owen, for one).  Despite the thread unsafeness
    of PQsetdb et al, I've never seen a failure in this environment
    and I've never heard of folks experiencing such a failure.
    
    So my question's simple - what exactly makes PQsetdb et al
    thread unsafe?  I'm asking in order to attempt to get a handle
    on just how vulnerable the routines are when two threads attempt
    to open a database connection simultaneously.
    
    The other question's simple, too - are the implications predictable,
    i.e. will (for instance) one of the attemps simply crash or
    fail when two or more threads attempt to make a connection?  Or
    am I looking at something more evil, like silent building of a
    connection messed up in some subtle way?  
    
    I suspect the answer to the last question is that the result of
    doing this is unpredictable, but thought I'd ask.
    
    AOLserver supports external drivers called by a proxy with a 
    separate process provided for each database connection, but
    there are unfortunate performance implications with this
    approach.  It's designed explicitly for dbs with no threadsafe
    C API.  This includes Sybase, and in my testing the internal
    Postgres driver can feed bytes to the server about three times
    as fast as the external driver written for Sybase, so you
    can see why I'm reluctant to rewrite the Postgres driver simply
    because building a connection's not threadsafe.  After all,
    unless a backend crashes they only happen when the server's
    first fired up.  And people aren't seeing problems.
    
    
    
    
    - Don Baccus, Portland OR <dhogaza@pacifier.com>
      Nature photos, on-line guides, Pacific Northwest
      Rare Bird Alert Service and other goodies at
      http://donb.photo.net.
    
    
  7. Re: [HACKERS] libpq questions...when threads collide

    Tom Lane <tgl@sss.pgh.pa.us> — 1999-12-12T22:41:50Z

    Don Baccus <dhogaza@pacifier.com> writes:
    > Despite the thread unsafeness
    > of PQsetdb et al, I've never seen a failure in this environment
    > and I've never heard of folks experiencing such a failure.
    
    The *only* thing that's actually thread-unsafe, AFAIR, is
    PQconnectdb's use of a global array for connection parameters.
    PQsetdb/setdbLogin are thread-safe; so just use them instead.
    
    At least that was true before the async-connection code got added.
    I haven't looked at that to see if it introduces any problems.
    
    			regards, tom lane
    
    
  8. Re: [HACKERS] libpq questions...when threads collide

    Don Baccus <dhogaza@pacifier.com> — 1999-12-12T22:53:13Z

    At 05:41 PM 12/12/99 -0500, Tom Lane wrote:
    >Don Baccus <dhogaza@pacifier.com> writes:
    >> Despite the thread unsafeness
    >> of PQsetdb et al, I've never seen a failure in this environment
    >> and I've never heard of folks experiencing such a failure.
    >
    >The *only* thing that's actually thread-unsafe, AFAIR, is
    >PQconnectdb's use of a global array for connection parameters.
    >PQsetdb/setdbLogin are thread-safe; so just use them instead.
    
    Cool!  I am using setdbLogin but the documentation sez they,
    too, aren't threadsafe...maybe this should be changed?  This
    is great news.
    
    >At least that was true before the async-connection code got added.
    >I haven't looked at that to see if it introduces any problems.
    
    For the moment, I'm happy to believe that it hasn't, it makes my
    immediate future much simpler if I do so...
    
    Also, the documentation describes two routines, PQoidStatus and
    PQoidValue, but the libpq source seem to only define PQoidStatus.
    
    (some user asked for a routine to feed back the oid of an insert,
    so I looked into it while simultaneously suggesting he study
    "sequence" and its associated "nextval" and "currval" functions
    and ponder on why it's really a bad idea to related tables by
    storing oids rather than generated keys)
    
    
    
    
    - Don Baccus, Portland OR <dhogaza@pacifier.com>
      Nature photos, on-line guides, Pacific Northwest
      Rare Bird Alert Service and other goodies at
      http://donb.photo.net.
    
    
  9. Re: [HACKERS] libpq questions...when threads collide

    Tom Lane <tgl@sss.pgh.pa.us> — 1999-12-12T22:58:12Z

    Don Baccus <dhogaza@pacifier.com> writes:
    > Cool!  I am using setdbLogin but the documentation sez they,
    > too, aren't threadsafe...maybe this should be changed?
    
    I guess so.  Submit a patch...
    
    > Also, the documentation describes two routines, PQoidStatus and
    > PQoidValue, but the libpq source seem to only define PQoidStatus.
    
    PQoidValue is new in current sources --- you must be looking at
    current-snapshot docs, rather than what was released with 6.5.
    
    			regards, tom lane
    
    
  10. Re: [HACKERS] libpq questions...when threads collide

    Don Baccus <dhogaza@pacifier.com> — 1999-12-12T23:01:19Z

    At 05:58 PM 12/12/99 -0500, Tom Lane wrote:
    
    >PQoidValue is new in current sources --- you must be looking at
    >current-snapshot docs, rather than what was released with 6.5.
    
    I'm using the docs at www.postgresql.org, which I assumed would
    be matched to the current release.
    
    
    
    - Don Baccus, Portland OR <dhogaza@pacifier.com>
      Nature photos, on-line guides, Pacific Northwest
      Rare Bird Alert Service and other goodies at
      http://donb.photo.net.
    
    
  11. RE: [HACKERS] LONG

    Hiroshi Inoue <inoue@tpf.co.jp> — 1999-12-13T01:27:10Z

    There are so many mails for me to follow about this issue. 
    For example,what's the conclusion about the following ?
    Please teach me.
    
    > -----Original Message-----
    > From: owner-pgsql-hackers@postgreSQL.org
    > [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Tom Lane
    > 
    > 
    > BTW, I don't see any really good reason to keep the out-of-line values
    > in a separate physical file (relation) as Jan originally proposed.
    > Why not keep them in the same file, but mark them as being something
    > different than a normal tuple?  Sequential scans would have to know to
    > skip over them (big deal), and VACUUM would have to handle them
    > properly, but I think VACUUM is going to have to have special code to
    > support this feature no matter what.  If we do make them a new primitive
    > kind-of-a-tuple on disk, we could sidestep the problem of marking all
    > the out-of-line values associated with a tuple when the tuple is
    > outdated by a transaction.  The out-of-line values wouldn't have
    > transaction IDs in them at all; they'd just be labeled with the CTID
    
    What is wong if out-of-line values have their own XIDs ?
    If an out-of-line is newer than corresponding row in "primary" table
    it's bad but could it occur ?
    Because (rowid) of "secondary" table references "primary" table(oid)
    on delete cascade,XID_MAXs of them would be synchronized.
     
    Why is CTID needed ?  Is it necessary to know "primary" tuples from
    out-of-lines values ? 
    
    > and/or OID of the primary tuple they belong to.  VACUUM would consult
    > that tuple to determine whether to keep or discard an out-of-line value.
    >
    
    What is wrong with separate VACUUM ?
    VACUUM never changes OIDs and XIDs(after MVCC).
    
    Regards.
    
    Hiroshi Inoue
    Inoue@tpf.co.jp
    
    
  12. Re: [HACKERS] LONG

    Bruce Momjian <pgman@candle.pha.pa.us> — 1999-12-13T02:38:15Z

    >     The  solution  I see is to give any out of line datum another
    >     Oid, that is  part  of  it's  header  and  stamped  into  the
    >     reference  data.  That way, the long attribute lookup can use
    >     SnapshotAny using this  Oid,  there  can  only  be  one  that
    >     exists,  so SnapshotAny is safe here and forces that only the
    >     visibility of the master tuple in the main  table  counts  at
    >     all.
    
    This is a great idea.  Get rid of my use of the attribute number.  Make
    the varlena long value be:
    
    	long-bit|length|longrelid|longoid|longlen
    
    No need for attno in there anymore.
    
    Having a separate oid for the long value is great.  You can then have
    multiple versions of the long attribute in the long table and can
    control when updating a tuple.
    
    I liked Hiroshi's idea of allowing long values in an index by just
    pointing to the long table.  Seems that would work too.  varlena access
    routines make that possible.
    
    
    
    -- 
      Bruce Momjian                        |  http://www.op.net/~candle
      maillist@candle.pha.pa.us            |  (610) 853-3000
      +  If your life is a hard drive,     |  830 Blythe Avenue
      +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
    
    
  13. Re: [HACKERS] LONG

    Jan Wieck <wieck@debis.com> — 1999-12-13T03:01:55Z

    >
    > >     The  solution  I see is to give any out of line datum another
    > >     Oid, that is  part  of  it's  header  and  stamped  into  the
    > >     reference  data.  That way, the long attribute lookup can use
    > >     SnapshotAny using this  Oid,  there  can  only  be  one  that
    > >     exists,  so SnapshotAny is safe here and forces that only the
    > >     visibility of the master tuple in the main  table  counts  at
    > >     all.
    >
    > This is a great idea.  Get rid of my use of the attribute number.  Make
    > the varlena long value be:
    >
    >    long-bit|length|longrelid|longoid|longlen
    >
    > No need for attno in there anymore.
    
        I  still  need  it  to  explicitly  remove  one long value on
        update, while the other one is untouched. Otherwise  I  would
        have  to  drop  all  long  values  for  the  row together and
        reinsert all new ones.
    
    > Having a separate oid for the long value is great.  You can then have
    > multiple versions of the long attribute in the long table and can
    > control when updating a tuple.
    >
    > I liked Hiroshi's idea of allowing long values in an index by just
    > pointing to the long table.  Seems that would work too.  varlena access
    > routines make that possible.
    
        Maybe possible, but not that good IMHO. Would  cause  another
        index  scan from inside index scan to get at the value. An we
        all agree that indexing huge values isn't that a  good  thing
        at all.
    
    
    Jan
    
    --
    
    #======================================================================#
    # It's easier to get forgiveness for being wrong than for being right. #
    # Let's break this rule - forgive me.                                  #
    #========================================= wieck@debis.com (Jan Wieck) #
    
    
    
    
  14. RE: [HACKERS] LONG

    Hiroshi Inoue <inoue@tpf.co.jp> — 1999-12-13T03:56:08Z

    > -----Original Message-----
    > From: owner-pgsql-hackers@postgreSQL.org
    > [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Jan Wieck
    > 
    > >
    > > Having a separate oid for the long value is great.  You can then have
    > > multiple versions of the long attribute in the long table and can
    > > control when updating a tuple.
    > >
    > > I liked Hiroshi's idea of allowing long values in an index by just
    > > pointing to the long table.  Seems that would work too.  varlena access
    > > routines make that possible.
    > 
    >     Maybe possible, but not that good IMHO. Would  cause  another
    >     index  scan from inside index scan to get at the value. An we
    >     all agree that indexing huge values isn't that a  good  thing
    >     at all.
    >
    
    What I need is an unqiue index (rowid,rowattno,chunk_seq) on
    "secondary" table.
    Is it different from your orginal idea ?
    I don't need any index on primary table.
    
    Regards.
    
    Hiroshi Inoue
    Inoue@tpf.co.jp
    
    
  15. Re: [HACKERS] LONG

    Bruce Momjian <pgman@candle.pha.pa.us> — 1999-12-13T04:12:19Z

    > >
    > > >     The  solution  I see is to give any out of line datum another
    > > >     Oid, that is  part  of  it's  header  and  stamped  into  the
    > > >     reference  data.  That way, the long attribute lookup can use
    > > >     SnapshotAny using this  Oid,  there  can  only  be  one  that
    > > >     exists,  so SnapshotAny is safe here and forces that only the
    > > >     visibility of the master tuple in the main  table  counts  at
    > > >     all.
    > >
    > > This is a great idea.  Get rid of my use of the attribute number.  Make
    > > the varlena long value be:
    > >
    > >    long-bit|length|longrelid|longoid|longlen
    > >
    > > No need for attno in there anymore.
    > 
    >     I  still  need  it  to  explicitly  remove  one long value on
    >     update, while the other one is untouched. Otherwise  I  would
    >     have  to  drop  all  long  values  for  the  row together and
    >     reinsert all new ones.
    
    I am suggesting the longoid is not the oid of the primary or long*
    table, but a unque id we assigned just to number all parts of the long*
    tuple.  I thought that's what your oid was for.
    
    > 
    > > Having a separate oid for the long value is great.  You can then have
    > > multiple versions of the long attribute in the long table and can
    > > control when updating a tuple.
    > >
    > > I liked Hiroshi's idea of allowing long values in an index by just
    > > pointing to the long table.  Seems that would work too.  varlena access
    > > routines make that possible.
    > 
    >     Maybe possible, but not that good IMHO. Would  cause  another
    >     index  scan from inside index scan to get at the value. An we
    >     all agree that indexing huge values isn't that a  good  thing
    >     at all.
    
    May as well.  I can't think of a better solution for indexing when you
    have long values.  I don't think we want long* versions of indexes.
    
    -- 
      Bruce Momjian                        |  http://www.op.net/~candle
      maillist@candle.pha.pa.us            |  (610) 853-3000
      +  If your life is a hard drive,     |  830 Blythe Avenue
      +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
    
    
  16. RE: [HACKERS] LONG

    Hiroshi Inoue <inoue@tpf.co.jp> — 1999-12-13T05:19:27Z

    > -----Original Message-----
    > From: owner-pgsql-hackers@postgreSQL.org
    > [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Bruce Momjian
    > 
    > > >
    > > > >     The  solution  I see is to give any out of line datum another
    > > > >     Oid, that is  part  of  it's  header  and  stamped  into  the
    > > > >     reference  data.  That way, the long attribute lookup can use
    > > > >     SnapshotAny using this  Oid,  there  can  only  be  one  that
    > > > >     exists,  so SnapshotAny is safe here and forces that only the
    > > > >     visibility of the master tuple in the main  table  counts  at
    > > > >     all.
    > > >
    > > > This is a great idea.  Get rid of my use of the attribute 
    > number.  Make
    > > > the varlena long value be:
    > > >
    > > >    long-bit|length|longrelid|longoid|longlen
    > > >
    > > > No need for attno in there anymore.
    > > 
    > >     I  still  need  it  to  explicitly  remove  one long value on
    > >     update, while the other one is untouched. Otherwise  I  would
    > >     have  to  drop  all  long  values  for  the  row together and
    > >     reinsert all new ones.
    > 
    > I am suggesting the longoid is not the oid of the primary or long*
    > table, but a unque id we assigned just to number all parts of the long*
    > tuple.  I thought that's what your oid was for.
    >
    
    Unfortunately I couldn't follow this issue correctly. 
    Is the format of long value relation different from Jan's original now ?
    
     - At    CREATE   TABLE,   a   long   value   relation   named
          "_LONG<tablename>" is created for those tables who need it.
          And of course dropped and truncated appropriate. The schema
          of this table is
    
              rowid       Oid,          -- oid of our main data row
              rowattno    int2,         -- the attribute number in main data
              chunk_seq   int4,         -- the part number of this data chunk
              chunk       text          -- the content of this data chunk
     
    I thought that there's an unique index (rowid,rowattno,chunk_seq).
    Seems we could even update partially(specified chunk_seq only)
    without problem.
    
    Regards.
    
    Hiroshi Inoue
    Inoue@tpf.co.jp
    
    
  17. Re: [HACKERS] LONG

    Bruce Momjian <pgman@candle.pha.pa.us> — 1999-12-13T05:59:21Z

    > > I am suggesting the longoid is not the oid of the primary or long*
    > > table, but a unque id we assigned just to number all parts of the long*
    > > tuple.  I thought that's what your oid was for.
    > >
    > 
    > Unfortunately I couldn't follow this issue correctly. 
    > Is the format of long value relation different from Jan's original now ?
    > 
    >  - At    CREATE   TABLE,   a   long   value   relation   named
    >       "_LONG<tablename>" is created for those tables who need it.
    >       And of course dropped and truncated appropriate. The schema
    >       of this table is
    > 
    >           rowid       Oid,          -- oid of our main data row
    
    I am suggesting a unique oid just to store this long value.  The new oid
    gets stored in the primary table, and on every row of the long* table.
    
    
    >           rowattno    int2,         -- the attribute number in main data
    
    Not needed anymore.
    
    >           chunk_seq   int4,         -- the part number of this data chunk
    >           chunk       text          -- the content of this data chunk
    
    Yes.
    
    >  
    > I thought that there's an unique index (rowid,rowattno,chunk_seq).
    
    Index on longoid only.  No need index on longoid and chunk_seq because
    you don't need the rows returned in order.
    
    
    > Seems we could even update partially(specified chunk_seq only)
    > without problem.
    
    That could be done, but seems too rare because the new data would have
    to be the same length. Doesn't seem worthit, though others may
    disagree. 
    
    -- 
      Bruce Momjian                        |  http://www.op.net/~candle
      maillist@candle.pha.pa.us            |  (610) 853-3000
      +  If your life is a hard drive,     |  830 Blythe Avenue
      +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
    
    
  18. Re: [HACKERS] LONG

    Jan Wieck <wieck@debis.com> — 1999-12-13T06:27:06Z

    Bruce Momjian wrote:
    
    > > > No need for attno in there anymore.
    > >
    > >     I  still  need  it  to  explicitly  remove  one long value on
    > >     update, while the other one is untouched. Otherwise  I  would
    > >     have  to  drop  all  long  values  for  the  row together and
    > >     reinsert all new ones.
    >
    > I am suggesting the longoid is not the oid of the primary or long*
    > table, but a unque id we assigned just to number all parts of the long*
    > tuple.  I thought that's what your oid was for.
    
        It's  not  even  an  Oid  of  any  existing  tuple,  just  an
        identifier to quickly find all the chunks of one  LONG  value
        by (non-unique) index.
    
        My idea is this now:
    
        The schema of the expansion relation is
    
            value_id        Oid
            chunk_seq       int32
            chunk_data      text
    
        with a non unique index on value_id.
    
        We  change  heap_formtuple(),  heap_copytuple()  etc.  not to
        allocate the entire thing in one palloc(). Instead the  tuple
        portion itself is allocated separately and the current memory
        context remembered too  in  the  HeapTuple  struct  (this  is
        required below).
    
        The long value reference in a tuple is defined as:
    
            vl_len          int32;     /* high bit set, 32-bit = 18 */
            vl_datasize     int32;     /* real vl_len of long value */
            vl_valueid      Oid;       /* value_id in expansion relation */
            vl_relid        Oid;       /* Oid of "expansion" table */
            vl_rowid        Oid;       /* Oid of the row in "primary" table */
            vl_attno        int16;     /* attribute number in "primary" table */
    
        The  tuple  given to heap_update() (the most complex one) can
        now contain usual VARLENA values of the format
    
            high-bit=0|31-bit-size|data
    
        or if the value is the result of a scan eventually
    
            high-bit=1|31-bit=18|datasize|valueid|relid|rowid|attno
    
        Now there are a couple of different cases.
    
        1.  The value found is a plain VARLENA  that  must  be  moved
            off.
    
            To  move  it  off a new Oid for value_id is obtained, the
            value itself stored in the  expansion  relation  and  the
            attribute in the tuple is replaced by the above structure
            with the values  1,  18,  original  VARSIZE(),  value_id,
            "expansion" relid, "primary" tuples Oid and attno.
    
        2.  The  value  found  is a long value reference that has our
            own "expansion" relid and the correct  rowid  and  attno.
            This  would  be  the result of an UPDATE without touching
            this long value.
    
            Nothing to be done.
    
        3.  The value found is a  long  value  reference  of  another
            attribute,  row or relation and this attribute is enabled
            for move off.
    
            The long value is fetched from the expansion relation  it
            is  living  in,  and the same as for 1. is done with that
            value. There's space for optimization  here,  because  we
            might have room to store the value plain. This can happen
            if the operation was an INSERT INTO t1  SELECT  FROM  t2,
            where  t1 has few small plus one varsize attribute, while
            t2 has many, many long varsizes.
    
        4.  The value found is a  long  value  reference  of  another
            attribute, row or relation and this attribute is disabled
            for move off (either per column or because  our  relation
            does not have an expansion relation at all).
    
            The  long value is fetched from the expansion relation it
            is living in, and the reference in our tuple is  replaced
            with this plain VARLENA.
    
        This  in place replacement of values in the main tuple is the
        reason, why we have to make another allocation for the  tuple
        data  and  remember the memory context where made. Due to the
        above process, the tuple data can expand, and we then need to
        change into that context and reallocate it.
    
        What  heap_update()  further  must  do  is to examine the OLD
        tuple (that  it  already  has  grabbed  by  CTID  for  header
        modification)  and  delete all long values by their value_id,
        that aren't any longer present in the new tuple.
    
        The VARLENA arguments to type specific functions now can also
        have both formats.  The macro
    
            #define VAR_GETPLAIN(arg) \
                (VARLENA_ISLONG(arg) ? expand_long(arg) : (arg))
    
        can   be   used   to  get  a  pointer  to  an  allways  plain
        representation, and the macro
    
            #define VAR_FREEPLAIN(arg,userptr) \
                  if (arg != userptr) pfree(userptr);
    
        is to be used to tidy up before returning.
    
        In this scenario, a function  like  smaller(text,text)  would
        look like
    
            text *
            smaller(text *t1, text *t2)
            {
                text *plain1 = VAR_GETPLAIN(t1);
                text *plain2 = VAR_GETPLAIN(t2);
                text *result;
    
                if ( /* whatever to compare plain1 and plain2 */ )
                    result = t1;
                else
                    result = t2;
    
                VAR_FREEPLAIN(t1,plain1);
                VAR_FREEPLAIN(t2,plain2);
    
                return result;
            }
    
        The  LRU cache used in expand_long() will the again and again
        expansion become cheap enough. The  benefit  would  be,  that
        huge  values resulting from table scans will be passed around
        in the system (in and out of sorting,  grouping  etc.)  until
        they are modified or really stored/output.
    
        And the LONG index stuff should be covered here already (free
        lunch)!   Index_insert()  MUST  allways   be   called   after
        heap_insert()/heap_update(),   because  it  needs  the  there
        assigned CTID. So at that time, the moved off attributes  are
        replaced  in  the tuple data by the references. These will be
        stored instead of the values that  originally  where  in  the
        tuple.   Should  also  work with hash indices, as long as the
        hashing functions use VAR_GETPLAIN as well.
    
        If we want to use auto compression too, no problem.  We  code
        this  into  another  bit  of  the  first  32-bit  vl_len. The
        question if to call expand_long() changes now to "is  one  of
        these  set".  This  way,  we  can  store both, compressed and
        uncompressed  into  both,  "primary"  tuple  or   "expansion"
        relation. expand_long() will take care for it.
    
    
    Jan
    
    --
    
    #======================================================================#
    # It's easier to get forgiveness for being wrong than for being right. #
    # Let's break this rule - forgive me.                                  #
    #========================================= wieck@debis.com (Jan Wieck) #
    
    
    
    
  19. Re: [HACKERS] LONG

    Jan Wieck <wieck@debis.com> — 1999-12-13T06:46:01Z

    >     We  change  heap_formtuple(),  heap_copytuple()  etc.  not to
    >     allocate the entire thing in one palloc(). Instead the  tuple
    >     portion itself is allocated separately and the current memory
    >     context remembered too  in  the  HeapTuple  struct  (this  is
    >     required below).
    
    Uhh,
    
        just  realized  that  the  usual  pfree(htup)  will  not work
        anymore. But shouldn't that already have been something  like
        heap_freetuple(htup)?
    
    
    Jan
    
    --
    
    #======================================================================#
    # It's easier to get forgiveness for being wrong than for being right. #
    # Let's break this rule - forgive me.                                  #
    #========================================= wieck@debis.com (Jan Wieck) #
    
    
    
    
  20. RE: [HACKERS] LONG

    Hiroshi Inoue <inoue@tpf.co.jp> — 1999-12-13T09:42:01Z

    > -----Original Message-----
    > From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
    >
    > > > I am suggesting the longoid is not the oid of the primary or long*
    > > > table, but a unque id we assigned just to number all parts of
    > the long*
    > > > tuple.  I thought that's what your oid was for.
    > > >
    > > Unfortunately I couldn't follow this issue correctly.
    > > Is the format of long value relation different from Jan's original now ?
    > >
    > >  - At    CREATE   TABLE,   a   long   value   relation   named
    > >       "_LONG<tablename>" is created for those tables who need it.
    > >       And of course dropped and truncated appropriate. The schema
    > >       of this table is
    > >
    > >           rowid       Oid,          -- oid of our main data row
    >
    > I am suggesting a unique oid just to store this long value.  The new oid
    > gets stored in the primary table, and on every row of the long* table.
    >
    
    Hmm,we could delete long values easily using rowid in case of
    heap_delete() .......
    
    >
    > > Seems we could even update partially(specified chunk_seq only)
    > > without problem.
    >
    > That could be done, but seems too rare because the new data would have
    > to be the same length. Doesn't seem worthit, though others may
    > disagree.
    >
    
    First,I wanted to emphasize that we don't have to update any long value
    tuples if we don't update long values. It's a special case of partial
    update.
    
    Second,large object has an feature like this. If we would replace large
    object by LONG,isn't it needed ?
    
    Regards.
    
    Hiroshi Inoue
    Inoue@tpf.co.jp
    
    
    
  21. Re: [HACKERS] LONG

    Christof Petig <christof.petig@wtal.de> — 1999-12-13T21:50:52Z

    As I offered some time to work on tuple chaining this thread clearly
    touches the same area.
    
    The idea of transparantly moving big attributes into a seperate table
    clearly has its benefits as long as normal operations need not to touch
    these long values. I (too) see this as a great deal. And the fact that
    it happens transparently (not visible to user) is the best about it.
    
    But AFAICS tuple chaining shouldn't be such a big deal, it should be
    about three days of work. (It'll definitely take longer for me, since I
    have to understand pgsql's internals first.): Split the tuple into
    multiple Items on disk storage, concatenate them on read in. Then make
    vacuum ignore continued items when not dealing with the whole tuple. No
    need to touch CID, XID etc. The most obvious disadvantage is possible
    fragmentation of tuples (unless handled in vacuum). Disk access
    atomicity for tuples is a non issue for Linux people since Linux uses 1k
    blocks :-(
    
    Storing attributes seperately is the best solution once you exceed
    4*BLKSZ, tuple chaining addresses 1.1-3*BLKSZ most efficiently. (correct
    me if I'm wrong)
    
    LONG as a seperate type is IMHO just another concept you have to master
    before you can use a RDBMS efficiently. The less different concepts a
    user needs to learn, the easier life is for him. Postgres already has a
    lot of data types to learn. 
    
    Wrapping lo in a user type sounds good to me.
    
    Yours 
         Christof
    
    
    
    
  22. Re: [HACKERS] LONG

    Bruce Momjian <pgman@candle.pha.pa.us> — 1999-12-14T01:56:35Z

    This outline is perfect!
    
    
    > > I am suggesting the longoid is not the oid of the primary or long*
    > > table, but a unque id we assigned just to number all parts of the long*
    > > tuple.  I thought that's what your oid was for.
    > 
    >     It's  not  even  an  Oid  of  any  existing  tuple,  just  an
    >     identifier to quickly find all the chunks of one  LONG  value
    >     by (non-unique) index.
    
    Yes, I understood this and I think it is a great idea.  It allows UPDATE
    to control whether it wants to replace the LONG value.
    
    
    > 
    >     My idea is this now:
    > 
    >     The schema of the expansion relation is
    > 
    >         value_id        Oid
    >         chunk_seq       int32
    >         chunk_data      text
    > 
    >     with a non unique index on value_id.
    
    Yes, exactly.
    
    > 
    >     We  change  heap_formtuple(),  heap_copytuple()  etc.  not to
    >     allocate the entire thing in one palloc(). Instead the  tuple
    >     portion itself is allocated separately and the current memory
    >     context remembered too  in  the  HeapTuple  struct  (this  is
    >     required below).
    
    I read the later part.  I understand.
    
    > 
    >     The long value reference in a tuple is defined as:
    > 
    >         vl_len          int32;     /* high bit set, 32-bit = 18 */
    >         vl_datasize     int32;     /* real vl_len of long value */
    >         vl_valueid      Oid;       /* value_id in expansion relation */
    >         vl_relid        Oid;       /* Oid of "expansion" table */
    >         vl_rowid        Oid;       /* Oid of the row in "primary" table */
    >         vl_attno        int16;     /* attribute number in "primary" table */
    
    I see you need vl_rowid and vl_attno so you don't accidentally reference
    a LONG value twice.  Good point.  I hadn't thought of that.
    
    > 
    >     The  tuple  given to heap_update() (the most complex one) can
    >     now contain usual VARLENA values of the format
    > 
    >         high-bit=0|31-bit-size|data
    > 
    >     or if the value is the result of a scan eventually
    > 
    >         high-bit=1|31-bit=18|datasize|valueid|relid|rowid|attno
    > 
    >     Now there are a couple of different cases.
    > 
    >     1.  The value found is a plain VARLENA  that  must  be  moved
    >         off.
    > 
    >         To  move  it  off a new Oid for value_id is obtained, the
    >         value itself stored in the  expansion  relation  and  the
    >         attribute in the tuple is replaced by the above structure
    >         with the values  1,  18,  original  VARSIZE(),  value_id,
    >         "expansion" relid, "primary" tuples Oid and attno.
    > 
    >     2.  The  value  found  is a long value reference that has our
    >         own "expansion" relid and the correct  rowid  and  attno.
    >         This  would  be  the result of an UPDATE without touching
    >         this long value.
    > 
    >         Nothing to be done.
    > 
    >     3.  The value found is a  long  value  reference  of  another
    >         attribute,  row or relation and this attribute is enabled
    >         for move off.
    > 
    >         The long value is fetched from the expansion relation  it
    >         is  living  in,  and the same as for 1. is done with that
    >         value. There's space for optimization  here,  because  we
    >         might have room to store the value plain. This can happen
    >         if the operation was an INSERT INTO t1  SELECT  FROM  t2,
    >         where  t1 has few small plus one varsize attribute, while
    >         t2 has many, many long varsizes.
    > 
    >     4.  The value found is a  long  value  reference  of  another
    >         attribute, row or relation and this attribute is disabled
    >         for move off (either per column or because  our  relation
    >         does not have an expansion relation at all).
    > 
    >         The  long value is fetched from the expansion relation it
    >         is living in, and the reference in our tuple is  replaced
    >         with this plain VARLENA.
    
    Yes.
    
    > 
    >     This  in place replacement of values in the main tuple is the
    >     reason, why we have to make another allocation for the  tuple
    >     data  and  remember the memory context where made. Due to the
    >     above process, the tuple data can expand, and we then need to
    >     change into that context and reallocate it.
    
    
    Yes, got it.
    
    > 
    >     What  heap_update()  further  must  do  is to examine the OLD
    >     tuple (that  it  already  has  grabbed  by  CTID  for  header
    >     modification)  and  delete all long values by their value_id,
    >     that aren't any longer present in the new tuple.
    
    Yes, makes vacuum run find on the LONG* relation.
    
    > 
    >     The VARLENA arguments to type specific functions now can also
    >     have both formats.  The macro
    > 
    >         #define VAR_GETPLAIN(arg) \
    >             (VARLENA_ISLONG(arg) ? expand_long(arg) : (arg))
    > 
    >     can   be   used   to  get  a  pointer  to  an  allways  plain
    >     representation, and the macro
    > 
    >         #define VAR_FREEPLAIN(arg,userptr) \
    >               if (arg != userptr) pfree(userptr);
    > 
    >     is to be used to tidy up before returning.
    
    Got it.
    
    > 
    >     In this scenario, a function  like  smaller(text,text)  would
    >     look like
    > 
    >         text *
    >         smaller(text *t1, text *t2)
    >         {
    >             text *plain1 = VAR_GETPLAIN(t1);
    >             text *plain2 = VAR_GETPLAIN(t2);
    >             text *result;
    > 
    >             if ( /* whatever to compare plain1 and plain2 */ )
    >                 result = t1;
    >             else
    >                 result = t2;
    > 
    >             VAR_FREEPLAIN(t1,plain1);
    >             VAR_FREEPLAIN(t2,plain2);
    > 
    >             return result;
    >         }
    
    Yes.
    
    > 
    >     The  LRU cache used in expand_long() will the again and again
    >     expansion become cheap enough. The  benefit  would  be,  that
    >     huge  values resulting from table scans will be passed around
    >     in the system (in and out of sorting,  grouping  etc.)  until
    >     they are modified or really stored/output.
    
    Yes.
    
    > 
    >     And the LONG index stuff should be covered here already (free
    >     lunch)!   Index_insert()  MUST  allways   be   called   after
    >     heap_insert()/heap_update(),   because  it  needs  the  there
    >     assigned CTID. So at that time, the moved off attributes  are
    >     replaced  in  the tuple data by the references. These will be
    >     stored instead of the values that  originally  where  in  the
    >     tuple.   Should  also  work with hash indices, as long as the
    >     hashing functions use VAR_GETPLAIN as well.
    
    I hoped this would be true.  Great.
    
    > 
    >     If we want to use auto compression too, no problem.  We  code
    >     this  into  another  bit  of  the  first  32-bit  vl_len. The
    >     question if to call expand_long() changes now to "is  one  of
    >     these  set".  This  way,  we  can  store both, compressed and
    >     uncompressed  into  both,  "primary"  tuple  or   "expansion"
    >     relation. expand_long() will take care for it.
    
    Perfect.  Sounds great.
    
    -- 
      Bruce Momjian                        |  http://www.op.net/~candle
      maillist@candle.pha.pa.us            |  (610) 853-3000
      +  If your life is a hard drive,     |  830 Blythe Avenue
      +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026