Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →
  1. Fix bug introduced by pgrminclude where the tablespace version name was

  1. Large C files

    Bruce Momjian <bruce@momjian.us> — 2011-09-03T23:18:47Z

    FYI, here are all the C files with over 6k lines:
    
    -  45133 ./interfaces/ecpg/preproc/preproc.c
    -  33651 ./backend/parser/gram.c
    -  17551 ./backend/parser/scan.c
       14209 ./bin/pg_dump/pg_dump.c
       10590 ./backend/access/transam/xlog.c
        9764 ./backend/commands/tablecmds.c
        8681 ./backend/utils/misc/guc.c
    -   7667 ./bin/psql/psqlscan.c
        7213 ./backend/utils/adt/ruleutils.c
        6814 ./backend/utils/adt/selfuncs.c
        6176 ./backend/utils/adt/numeric.c
        6030 ./pl/plpgsql/src/pl_exec.c
    
    I have dash-marked the files that are computer-generated.  It seems
    pg_dump.c and xlog.c should be split into smaller C files.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        http://momjian.us
      EnterpriseDB                             http://enterprisedb.com
    
      + It's impossible for everything to be true. +
    
    
  2. Re: Large C files

    Alvaro Herrera <alvherre@commandprompt.com> — 2011-09-05T22:56:49Z

    Excerpts from Bruce Momjian's message of sáb sep 03 20:18:47 -0300 2011:
    > FYI, here are all the C files with over 6k lines:
    > 
    > -  45133 ./interfaces/ecpg/preproc/preproc.c
    > -  33651 ./backend/parser/gram.c
    > -  17551 ./backend/parser/scan.c
    >    14209 ./bin/pg_dump/pg_dump.c
    >    10590 ./backend/access/transam/xlog.c
    >     9764 ./backend/commands/tablecmds.c
    >     8681 ./backend/utils/misc/guc.c
    > -   7667 ./bin/psql/psqlscan.c
    >     7213 ./backend/utils/adt/ruleutils.c
    >     6814 ./backend/utils/adt/selfuncs.c
    >     6176 ./backend/utils/adt/numeric.c
    >     6030 ./pl/plpgsql/src/pl_exec.c
    > 
    > I have dash-marked the files that are computer-generated.  It seems
    > pg_dump.c and xlog.c should be split into smaller C files.
    
    I don't think there's any particular point to this general exercise (too
    large for what?), but Simon had patches (or at least ideas) to split
    xlog.c IIRC.
    
    -- 
    Álvaro Herrera <alvherre@commandprompt.com>
    The PostgreSQL Company - Command Prompt, Inc.
    PostgreSQL Replication, Consulting, Custom Development, 24x7 support
    
    
  3. Re: Large C files

    Robert Haas <robertmhaas@gmail.com> — 2011-09-06T02:54:53Z

    On Mon, Sep 5, 2011 at 6:56 PM, Alvaro Herrera
    <alvherre@commandprompt.com> wrote:
    > Excerpts from Bruce Momjian's message of sáb sep 03 20:18:47 -0300 2011:
    >> FYI, here are all the C files with over 6k lines:
    >>
    >> -  45133 ./interfaces/ecpg/preproc/preproc.c
    >> -  33651 ./backend/parser/gram.c
    >> -  17551 ./backend/parser/scan.c
    >>    14209 ./bin/pg_dump/pg_dump.c
    >>    10590 ./backend/access/transam/xlog.c
    >>     9764 ./backend/commands/tablecmds.c
    >>     8681 ./backend/utils/misc/guc.c
    >> -   7667 ./bin/psql/psqlscan.c
    >>     7213 ./backend/utils/adt/ruleutils.c
    >>     6814 ./backend/utils/adt/selfuncs.c
    >>     6176 ./backend/utils/adt/numeric.c
    >>     6030 ./pl/plpgsql/src/pl_exec.c
    >>
    >> I have dash-marked the files that are computer-generated.  It seems
    >> pg_dump.c and xlog.c should be split into smaller C files.
    >
    > I don't think there's any particular point to this general exercise (too
    > large for what?), but Simon had patches (or at least ideas) to split
    > xlog.c IIRC.
    
    Yeah.  xlog.c and pg_dump.c are really pretty horrible code, and could
    probably benefit from being split up.  Actually, just splitting them
    up probably isn't enough: I think they need extensive refactoring.
    For example, ISTM that StartupXLOG() should delegate substantial
    chunks of what it does to subroutines, so that the toplevel function
    is short enough to read and understand without getting lost.
    
    On the other hand, I can't help but think splitting up numeric.c would
    be a bad idea all around.  There's not really going to be any coherent
    way of dividing up the functionality in that file, and it would hinder
    the ability to make functions static and keep interfaces private.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  4. Re: Large C files

    Peter Eisentraut <peter_e@gmx.net> — 2011-09-06T07:29:34Z

    On lör, 2011-09-03 at 19:18 -0400, Bruce Momjian wrote:
    > FYI, here are all the C files with over 6k lines:
    > 
    > -  45133 ./interfaces/ecpg/preproc/preproc.c
    > -  33651 ./backend/parser/gram.c
    > -  17551 ./backend/parser/scan.c
    >    14209 ./bin/pg_dump/pg_dump.c
    >    10590 ./backend/access/transam/xlog.c
    >     9764 ./backend/commands/tablecmds.c
    >     8681 ./backend/utils/misc/guc.c
    > -   7667 ./bin/psql/psqlscan.c
    >     7213 ./backend/utils/adt/ruleutils.c
    >     6814 ./backend/utils/adt/selfuncs.c
    >     6176 ./backend/utils/adt/numeric.c
    >     6030 ./pl/plpgsql/src/pl_exec.c
    > 
    > I have dash-marked the files that are computer-generated.  It seems
    > pg_dump.c and xlog.c should be split into smaller C files.
    
    I was thinking about splitting up plpython.c, but it's not even on that
    list. ;-)
    
    
    
    
  5. Re: Large C files

    Bruce Momjian <bruce@momjian.us> — 2011-09-06T14:05:32Z

    Peter Eisentraut wrote:
    > On l?r, 2011-09-03 at 19:18 -0400, Bruce Momjian wrote:
    > > FYI, here are all the C files with over 6k lines:
    > > 
    > > -  45133 ./interfaces/ecpg/preproc/preproc.c
    > > -  33651 ./backend/parser/gram.c
    > > -  17551 ./backend/parser/scan.c
    > >    14209 ./bin/pg_dump/pg_dump.c
    > >    10590 ./backend/access/transam/xlog.c
    > >     9764 ./backend/commands/tablecmds.c
    > >     8681 ./backend/utils/misc/guc.c
    > > -   7667 ./bin/psql/psqlscan.c
    > >     7213 ./backend/utils/adt/ruleutils.c
    > >     6814 ./backend/utils/adt/selfuncs.c
    > >     6176 ./backend/utils/adt/numeric.c
    > >     6030 ./pl/plpgsql/src/pl_exec.c
    > > 
    > > I have dash-marked the files that are computer-generated.  It seems
    > > pg_dump.c and xlog.c should be split into smaller C files.
    > 
    > I was thinking about splitting up plpython.c, but it's not even on that
    > list. ;-)
    
    For me, the test is when I feel, "Yuck, I am in that massive file
    again".
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        http://momjian.us
      EnterpriseDB                             http://enterprisedb.com
    
      + It's impossible for everything to be true. +
    
    
  6. Re: Large C files

    Simon Riggs <simon@2ndquadrant.com> — 2011-09-06T19:56:02Z

    On Tue, Sep 6, 2011 at 3:05 PM, Bruce Momjian <bruce@momjian.us> wrote:
    > Peter Eisentraut wrote:
    >> On l?r, 2011-09-03 at 19:18 -0400, Bruce Momjian wrote:
    >> > FYI, here are all the C files with over 6k lines:
    >> >
    >> > -  45133 ./interfaces/ecpg/preproc/preproc.c
    >> > -  33651 ./backend/parser/gram.c
    >> > -  17551 ./backend/parser/scan.c
    >> >    14209 ./bin/pg_dump/pg_dump.c
    >> >    10590 ./backend/access/transam/xlog.c
    >> >     9764 ./backend/commands/tablecmds.c
    >> >     8681 ./backend/utils/misc/guc.c
    >> > -   7667 ./bin/psql/psqlscan.c
    >> >     7213 ./backend/utils/adt/ruleutils.c
    >> >     6814 ./backend/utils/adt/selfuncs.c
    >> >     6176 ./backend/utils/adt/numeric.c
    >> >     6030 ./pl/plpgsql/src/pl_exec.c
    >> >
    >> > I have dash-marked the files that are computer-generated.  It seems
    >> > pg_dump.c and xlog.c should be split into smaller C files.
    >>
    >> I was thinking about splitting up plpython.c, but it's not even on that
    >> list. ;-)
    >
    > For me, the test is when I feel, "Yuck, I am in that massive file
    > again".
    
    There are many things to do yet in xlog.c, but splitting it into
    pieces is pretty low on the list.
    
    I did look at doing it years ago before we started the heavy lifting
    for SR/HS but it was too much work and dangerous too.
    
    The only way I would entertain thoughts of major refactoring would be
    if comprehensive tests were contributed, so we could verify everything
    still works afterwards.
    
    -- 
     Simon Riggs                   http://www.2ndQuadrant.com/
     PostgreSQL Development, 24x7 Support, Training & Services
    
    
  7. Re: Large C files

    Robert Haas <robertmhaas@gmail.com> — 2011-09-06T20:07:55Z

    On Tue, Sep 6, 2011 at 3:56 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
    > The only way I would entertain thoughts of major refactoring would be
    > if comprehensive tests were contributed, so we could verify everything
    > still works afterwards.
    
    That sounds like a really good idea.  Mind you, I don't have a very
    clear idea of how to design such tests and probably no immediate
    availability to do the work either, but I like the concept very much.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  8. Re: Large C files

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-09-06T23:13:52Z

    On 6 September 2011 21:07, Robert Haas <robertmhaas@gmail.com> wrote:
    > On Tue, Sep 6, 2011 at 3:56 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
    >> The only way I would entertain thoughts of major refactoring would be
    >> if comprehensive tests were contributed, so we could verify everything
    >> still works afterwards.
    >
    > That sounds like a really good idea.  Mind you, I don't have a very
    > clear idea of how to design such tests and probably no immediate
    > availability to do the work either, but I like the concept very much.
    
    More or less off the top of my head, I don't think that it's much
    trouble to write an automated test for this. Of course, it would be as
    flawed as any test in that it can only prove the presence of errors by
    the criteria of the test, not the absence of all errors.
    
    Here's what could go wrong that we can test for when splitting a
    monster translation unit into more manageable modules that I can
    immediately think of, that is not already handled by compiler
    diagnostics or the build farm (I'm thinking of problems arising when
    some headers are excluded in new .c files because they appear to be
    superfluous, but turn out to not be on some platform):
    
    * Across TUs, we somehow fail to provide consistent linkage between
    the old object file and the sum of the new object files.
    
    * Within TUs, we unshadow a previously shadowed variable, so we link
    to a global variable rather than one local to the original/other new
    file. Unlikely to be a problem. Here's what I get when I compile
    xlog.c in the usual way with the addition of the -Wshadow flag:
    
    [peter@localhost transam]$ gcc -O0 -g -Wall -Wmissing-prototypes
    -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels
    -Wformat-security -fno-strict-aliasing -fwrapv -Wshadow -g
    -I../../../../src/include -D_GNU_SOURCE   -c -o xlog.o xlog.c -MMD -MP
    -MF .deps/xlog.Po
    xlog.c: In function ‘XLogCheckBuffer’:
    xlog.c:1279:12: warning: declaration of ‘lower’ shadows a global declaration
    ../../../../src/include/utils/builtins.h:793:14: warning: shadowed
    declaration is here
    xlog.c:1280:12: warning: declaration of ‘upper’ shadows a global declaration
    ../../../../src/include/utils/builtins.h:794:14: warning: shadowed
    declaration is here
    xlog.c: In function ‘XLogArchiveNotifySeg’:
    xlog.c:1354:29: warning: declaration of ‘log’ shadows a global declaration
    xlog.c: In function ‘XLogFileInit’:
    xlog.c:2329:21: warning: declaration of ‘log’ shadows a global declaration
    xlog.c: In function ‘XLogFileCopy’:
    xlog.c:2480:21: warning: declaration of ‘log’ shadows a global declaration
    xlog.c: In function ‘InstallXLogFileSegment’:
    xlog.c:2598:32: warning: declaration of ‘log’ shadows a global declaration
    xlog.c: In function ‘XLogFileOpen’:
    xlog.c:2676:21: warning: declaration of ‘log’ shadows a global declaration
    xlog.c: In function ‘XLogFileRead’:
    *** SNIP, CUT OUT A BIT MORE OF SAME***
    
    
    Having looked at the man page for nm, it should be possible to hack
    together a shellscript for src/tools/ that:
    
    1. Takes one filename as its first argument, and a set of 2 or more
    equivalent split file names (no file extension - there is a
    requirement that both $FILENAME.c and $FILENAME.o exist).
    
    2. Looks at the symbol table for the original C file's corresponding
    object file, say xlog.o, as output from nm, and sorts it.
    
    3. Intelligently diffs that against the concatenated output of the
    symbol table for each new object file. This output would be sorted
    just as the the single object file nm output was, but only after
    concatenation. Here, by intelligently I mean that we exclude undefined
    symbols. That way, it shouldn't matter if symbols go missing when, for
    example, Text section symbols from one file but show up in multiple
    other new files as undefined symbols, nor should it matter that we
    call functions like memcpy() from each new file that only appeared
    once in the old object file's symbol table.
    
    4. Do a similar kind of intelligent diff with the -Wshadow ouput
    above, stripping out line numbers and file names as the first step of
    a pipline, using sed, sorting the orignal file's compiler output, and
    stripping, concatenating then sorting the new set of outputs. I think
    that after that, the output from the original file should be the same
    as the combined output of the new files, unless we messed up.
    
    If you have to give a previously static variable global linkage, then
    prima facie "you're doing it wrong", so we don't have to worry about
    that case - maybe you can argue that it's okay in this one case, but
    that's considered controversial. We detect this case because the
    symbol type goes from 'd' to 'D'.
    
    Of course, we expect that some functions will lose their internal
    linkage as part of this process, and that is output by the shellscript
    - no attempt is made to hide that. This test doesn't reduce the
    failure to a simple pass or fail - it has to be interpreted by a
    human. It does take the drudgery out of verifying that this mechanical
    process has been performed correctly though.
    
    I agree that C files shouldn't be split because they're big, but
    because modularising them is a maintainability win. I also think that
    10,000 line C files are a real problem that we should think about
    addressing. These two views may seem to be in tension, but they're not
    - if you're not able to usefully modularise such a large file, perhaps
    it's time to reconsider your design (this is not an allusion to any
    problems that I might believe any particular C file has) .
    
    Anyone interested in this idea? I think that an actual implementation
    will probably reveal a few more problems that haven't occurred to me
    yet, but it's too late to produce one right now.
    
    On 6 September 2011 08:29, Peter Eisentraut <peter_e@gmx.net> wrote:
    > I was thinking about splitting up plpython.c, but it's not even on that
    > list. ;-)
    
    IIRC the obesity of that file is something that Jan Urbański intends
    to fix, or is at least concerned about. Perhaps you should compare
    notes.
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
    
  9. Re: Large C files

    Jan Urbański <wulczer@wulczer.org> — 2011-09-06T23:22:29Z

    On 07/09/11 01:13, Peter Geoghegan wrote:
    > On 6 September 2011 08:29, Peter Eisentraut <peter_e@gmx.net> wrote:
    >> I was thinking about splitting up plpython.c, but it's not even on that
    >> list. ;-)
    > 
    > IIRC the obesity of that file is something that Jan Urbański intends
    > to fix, or is at least concerned about. Perhaps you should compare
    > notes.
    
    Yeah, plpython.c could easily be splitted into submodules for builtin
    functions, spi support, subtransactions support, traceback support etc.
    
    If there is consensus that this should be done, I'll see if I can find
    time to submit a giant-diff-no-behaviour-changes patch for the next CF.
    
    Cheers,
    Jan
    
    
  10. Re: Large C files

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-09-07T00:03:55Z

    On 7 September 2011 00:13, Peter Geoghegan <peter@2ndquadrant.com> wrote:
    > * Within TUs, we unshadow a previously shadowed variable, so we link
    > to a global variable rather than one local to the original/other new
    > file. Unlikely to be a problem. Here's what I get when I compile
    > xlog.c in the usual way with the addition of the -Wshadow flag:
    
    I hit send too soon. Of course, this isn't going to matter in the case
    I described because an extern declaration of int foo cannot appear in
    the same TU as a static declaration of int foo - it won't compile. I
    hastily gave that as an example of a general phenomenon that can occur
    when performing this splitting process. An actually valid example of
    same would be if someone refactored functions a bit as part of this
    process to make things more modular, and now referenced a global
    variable rather than a local one as part of that process. This is
    quite possible, because namespace pollution is a big problem with
    heavyweight C files - Just look at how much output that -Wshadow flag
    gives when used on xlog.c.
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
    
  11. Re: Large C files

    Bruce Momjian <bruce@momjian.us> — 2011-09-07T00:18:05Z

    Peter Geoghegan wrote:
    > On 7 September 2011 00:13, Peter Geoghegan <peter@2ndquadrant.com> wrote:
    > > * Within TUs, we unshadow a previously shadowed variable, so we link
    > > to a global variable rather than one local to the original/other new
    > > file. Unlikely to be a problem. Here's what I get when I compile
    > > xlog.c in the usual way with the addition of the -Wshadow flag:
    > 
    > I hit send too soon. Of course, this isn't going to matter in the case
    > I described because an extern declaration of int foo cannot appear in
    > the same TU as a static declaration of int foo - it won't compile. I
    > hastily gave that as an example of a general phenomenon that can occur
    > when performing this splitting process. An actually valid example of
    > same would be if someone refactored functions a bit as part of this
    > process to make things more modular, and now referenced a global
    > variable rather than a local one as part of that process. This is
    > quite possible, because namespace pollution is a big problem with
    > heavyweight C files - Just look at how much output that -Wshadow flag
    > gives when used on xlog.c.
    
    I am confused how moving a function from one C file to another could
    cause breakage?
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        http://momjian.us
      EnterpriseDB                             http://enterprisedb.com
    
      + It's impossible for everything to be true. +
    
    
  12. Re: Large C files

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-09-07T01:14:18Z

    On 7 September 2011 01:18, Bruce Momjian <bruce@momjian.us> wrote:
    > I am confused how moving a function from one C file to another could
    > cause breakage?
    
    I'm really concerned about silent breakage, however unlikely - that is
    also Simon and Robert's concern, and rightly so. If it's possible in
    principle to guard against a certain type of problem, we should do so.
    While this is a mechanical process, it isn't quite that mechanical a
    process - I would not expect this work to be strictly limited to
    simply spreading some functions around different files. Certainly, if
    there are any other factors at all that could disrupt things, a
    problem that does not cause a compiler warning or error is vastly more
    troublesome than one that does, and the most plausible such error is
    if a symbol is somehow resolved differently when the function is
    moved. I suppose that the simplest way that this could happen is
    probably by somehow having a variable of the same name and type appear
    twice, once as a static, the other time as a global.
    
    IMHO, when manipulating code at this sort of macro scale, it's good to
    be paranoid and exhaustive, particularly when it doesn't cost you
    anything anyway. Unlike when writing or fixing a bit of code at a
    time, which we're all used to, if the compiler doesn't tell you about
    it, your chances of catching the problem before a bug manifests itself
    are close to zero.
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
    
  13. Re: Large C files

    Robert Haas <robertmhaas@gmail.com> — 2011-09-07T17:44:48Z

    On Tue, Sep 6, 2011 at 9:14 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
    > On 7 September 2011 01:18, Bruce Momjian <bruce@momjian.us> wrote:
    >> I am confused how moving a function from one C file to another could
    >> cause breakage?
    >
    > I'm really concerned about silent breakage, however unlikely - that is
    > also Simon and Robert's concern, and rightly so. If it's possible in
    > principle to guard against a certain type of problem, we should do so.
    > While this is a mechanical process, it isn't quite that mechanical a
    > process - I would not expect this work to be strictly limited to
    > simply spreading some functions around different files. Certainly, if
    > there are any other factors at all that could disrupt things, a
    > problem that does not cause a compiler warning or error is vastly more
    > troublesome than one that does, and the most plausible such error is
    > if a symbol is somehow resolved differently when the function is
    > moved. I suppose that the simplest way that this could happen is
    > probably by somehow having a variable of the same name and type appear
    > twice, once as a static, the other time as a global.
    >
    > IMHO, when manipulating code at this sort of macro scale, it's good to
    > be paranoid and exhaustive, particularly when it doesn't cost you
    > anything anyway. Unlike when writing or fixing a bit of code at a
    > time, which we're all used to, if the compiler doesn't tell you about
    > it, your chances of catching the problem before a bug manifests itself
    > are close to zero.
    
    I was less concerned about the breakage that might be caused by
    variables acquiring unintended referents - which should be unlikely
    anyway given reasonable variable naming conventions - and more
    concerned that the associated refactoring would break recovery.  We
    have no recovery regression tests; that's not a good thing.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  14. Re: Large C files

    Bruce Momjian <bruce@momjian.us> — 2011-09-07T17:48:37Z

    Robert Haas wrote:
    > > IMHO, when manipulating code at this sort of macro scale, it's good to
    > > be paranoid and exhaustive, particularly when it doesn't cost you
    > > anything anyway. Unlike when writing or fixing a bit of code at a
    > > time, which we're all used to, if the compiler doesn't tell you about
    > > it, your chances of catching the problem before a bug manifests itself
    > > are close to zero.
    > 
    > I was less concerned about the breakage that might be caused by
    > variables acquiring unintended referents - which should be unlikely
    > anyway given reasonable variable naming conventions - and more
    > concerned that the associated refactoring would break recovery.  We
    > have no recovery regression tests; that's not a good thing.
    
    So we are talking about more than moving files between functions?  Yes,
    it would be risky to restruction functions, but for someone who
    understand that code, it might be safe.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        http://momjian.us
      EnterpriseDB                             http://enterprisedb.com
    
      + It's impossible for everything to be true. +
    
    
  15. Re: Large C files

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-07T18:12:34Z

    Bruce Momjian <bruce@momjian.us> writes:
    > Robert Haas wrote:
    >> I was less concerned about the breakage that might be caused by
    >> variables acquiring unintended referents - which should be unlikely
    >> anyway given reasonable variable naming conventions - and more
    >> concerned that the associated refactoring would break recovery.  We
    >> have no recovery regression tests; that's not a good thing.
    
    > So we are talking about more than moving files between functions?  Yes,
    > it would be risky to restruction functions, but for someone who
    > understand that code, it might be safe.
    
    The pgrminclude-induced bug you just fixed shows a concrete way in which
    moving code from one file to another might silently break it, ie, it
    still compiles despite lack of definition of some symbol it's intended
    to see.
    
    Having said that, I tend to agree that xlog.c is getting so large and
    messy that it needs to be broken up.  But I'm not in favor of breaking
    up files just because they're large, eg, ruleutils.c is not in need of
    such treatment.  The problem with xlog.c is that it seems to be dealing
    with many more considerations than it originally did.
    
    			regards, tom lane
    
    
  16. Re: Large C files

    Simon Riggs <simon@2ndquadrant.com> — 2011-09-07T19:44:00Z

    On Wed, Sep 7, 2011 at 7:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Bruce Momjian <bruce@momjian.us> writes:
    >> Robert Haas wrote:
    >>> I was less concerned about the breakage that might be caused by
    >>> variables acquiring unintended referents - which should be unlikely
    >>> anyway given reasonable variable naming conventions - and more
    >>> concerned that the associated refactoring would break recovery.  We
    >>> have no recovery regression tests; that's not a good thing.
    >
    >> So we are talking about more than moving files between functions?  Yes,
    >> it would be risky to restruction functions, but for someone who
    >> understand that code, it might be safe.
    >
    > The pgrminclude-induced bug you just fixed shows a concrete way in which
    > moving code from one file to another might silently break it, ie, it
    > still compiles despite lack of definition of some symbol it's intended
    > to see.
    >
    > Having said that, I tend to agree that xlog.c is getting so large and
    > messy that it needs to be broken up.  But I'm not in favor of breaking
    > up files just because they're large, eg, ruleutils.c is not in need of
    > such treatment.  The problem with xlog.c is that it seems to be dealing
    > with many more considerations than it originally did.
    
    I agree as well, though we've spawned many new files and directories
    in the last 7 years. Almost nothing has gone in there that didn't need
    to.
    
    As long as we accept its not a priority, I'll do some work to slide
    things away and we can do it over time.
    
    Please lets not waste effort on refactoring efforts in mid dev cycle.
    Having this done by someone without good experience is just going to
    waste all of our time and sneak bugs into something that does actually
    work rather well.
    
    -- 
     Simon Riggs                   http://www.2ndQuadrant.com/
     PostgreSQL Development, 24x7 Support, Training & Services
    
    
  17. Re: Large C files

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-07T20:02:26Z

    Simon Riggs <simon@2ndQuadrant.com> writes:
    > Please lets not waste effort on refactoring efforts in mid dev cycle.
    
    Say what?  When else would you have us do it?
    
    			regards, tom lane
    
    
  18. Re: Large C files

    Simon Riggs <simon@2ndquadrant.com> — 2011-09-08T06:15:35Z

    On Wed, Sep 7, 2011 at 9:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Simon Riggs <simon@2ndQuadrant.com> writes:
    >> Please lets not waste effort on refactoring efforts in mid dev cycle.
    >
    > Say what?  When else would you have us do it?
    
    When else would you have us develop?
    
    Major changes happen at start of a dev cycle, after some discussion.
    Not in the middle and especially not for low priority items. It's not
    even a bug, just code beautification.
    
    We've all accepted the need for some change, but I would like to see
    it happen slowly and carefully because of the very high risk of
    introducing bugs into stable code that doesn't have a formal
    verification mechanism currently. Anybody that could be trusted to
    make those changes ought to have something better to do. If they don't
    then that is in itself a much more serious issue than this.
    
    -- 
     Simon Riggs                   http://www.2ndQuadrant.com/
     PostgreSQL Development, 24x7 Support, Training & Services
    
    
  19. Re: Large C files

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-08T14:25:14Z

    Simon Riggs <simon@2ndQuadrant.com> writes:
    > On Wed, Sep 7, 2011 at 9:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >> Simon Riggs <simon@2ndQuadrant.com> writes:
    >>> Please lets not waste effort on refactoring efforts in mid dev cycle.
    
    >> Say what? When else would you have us do it?
    
    > When else would you have us develop?
    
    In my eyes that sort of activity *is* development.  I find the
    distinction you are drawing entirely artificial, and more calculated to
    make sure refactoring never happens than to add any safety.  Any
    significant development change carries a risk of breakage.
    
    			regards, tom lane
    
    
  20. Re: Large C files

    Bruce Momjian <bruce@momjian.us> — 2011-09-08T14:29:14Z

    Tom Lane wrote:
    > Simon Riggs <simon@2ndQuadrant.com> writes:
    > > On Wed, Sep 7, 2011 at 9:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > >> Simon Riggs <simon@2ndQuadrant.com> writes:
    > >>> Please lets not waste effort on refactoring efforts in mid dev cycle.
    > 
    > >> Say what? When else would you have us do it?
    > 
    > > When else would you have us develop?
    > 
    > In my eyes that sort of activity *is* development.  I find the
    > distinction you are drawing entirely artificial, and more calculated to
    > make sure refactoring never happens than to add any safety.  Any
    > significant development change carries a risk of breakage.
    
    I ran pgrminclude a week ago and that is certainly a larger change than
    this.  Early in the development cycle people are merging in their saved
    patches, so now is a fine time to do refactoring.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        http://momjian.us
      EnterpriseDB                             http://enterprisedb.com
    
      + It's impossible for everything to be true. +
    
    
  21. Re: Large C files

    Robert Haas <robertmhaas@gmail.com> — 2011-09-08T14:43:38Z

    On Thu, Sep 8, 2011 at 10:29 AM, Bruce Momjian <bruce@momjian.us> wrote:
    > Tom Lane wrote:
    >> Simon Riggs <simon@2ndQuadrant.com> writes:
    >> > On Wed, Sep 7, 2011 at 9:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >> >> Simon Riggs <simon@2ndQuadrant.com> writes:
    >> >>> Please lets not waste effort on refactoring efforts in mid dev cycle.
    >>
    >> >> Say what?  When else would you have us do it?
    >>
    >> > When else would you have us develop?
    >>
    >> In my eyes that sort of activity *is* development.  I find the
    >> distinction you are drawing entirely artificial, and more calculated to
    >> make sure refactoring never happens than to add any safety.  Any
    >> significant development change carries a risk of breakage.
    >
    > I ran pgrminclude a week ago and that is certainly a larger change than
    > this.  Early in the development cycle people are merging in their saved
    > patches, so now is a fine time to do refactoring.
    
    +1.
    
    I'd feel more comfortable refactoring it if we had some automated
    testing of those code paths, but I don't see anything wrong with doing
    it now from a timing perspective.  We still have 4 months until the
    start of the final CommitFest.  I wouldn't be too enthusiastic about
    starting a project like this in January, but now seems fine.  A bigger
    problem is that I don't hear anyone volunteering to do the work.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  22. Re: Large C files

    Simon Riggs <simon@2ndquadrant.com> — 2011-09-08T17:25:49Z

    On Thu, Sep 8, 2011 at 3:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Simon Riggs <simon@2ndQuadrant.com> writes:
    >> On Wed, Sep 7, 2011 at 9:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >>> Simon Riggs <simon@2ndQuadrant.com> writes:
    >>>> Please lets not waste effort on refactoring efforts in mid dev cycle.
    >
    >>> Say what?  When else would you have us do it?
    >
    >> When else would you have us develop?
    >
    > In my eyes that sort of activity *is* development.  I find the
    > distinction you are drawing entirely artificial, and more calculated to
    > make sure refactoring never happens than to add any safety.  Any
    > significant development change carries a risk of breakage.
    
    You clearly have the bit between your teeth on this.
    
    That doesn't make it worthwhile or sensible though.
    
    I've offered to do it slowly and carefully over time, but that seems
    not enough for some reason. What is the real reason for this?
    
    I assume whoever does it will be spending significant time on testing
    and bug fixing afterwards. I foresee lots of "while I'm there, I
    thought I'd just mend X", so we'll spend lots of time fighting to keep
    functionality that's already there. Look at the discussion around
    archive_command for an example of that.
    
    -- 
     Simon Riggs                   http://www.2ndQuadrant.com/
     PostgreSQL Development, 24x7 Support, Training & Services
    
    
  23. Re: Large C files

    Bruce Momjian <bruce@momjian.us> — 2011-09-08T17:34:02Z

    Simon Riggs wrote:
    > On Thu, Sep 8, 2011 at 3:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > > Simon Riggs <simon@2ndQuadrant.com> writes:
    > >> On Wed, Sep 7, 2011 at 9:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > >>> Simon Riggs <simon@2ndQuadrant.com> writes:
    > >>>> Please lets not waste effort on refactoring efforts in mid dev cycle.
    > >
    > >>> Say what? ?When else would you have us do it?
    > >
    > >> When else would you have us develop?
    > >
    > > In my eyes that sort of activity *is* development. ?I find the
    > > distinction you are drawing entirely artificial, and more calculated to
    > > make sure refactoring never happens than to add any safety. ?Any
    > > significant development change carries a risk of breakage.
    > 
    > You clearly have the bit between your teeth on this.
    
    So I guess Tom, Robert Haas, and I all have bits then.
    
    > That doesn't make it worthwhile or sensible though.
    
    We are not saying do it now.  We are disputing your suggestion that now
    is not a good time --- there is a difference.
    
    > I've offered to do it slowly and carefully over time, but that seems
    > not enough for some reason. What is the real reason for this?
    
    Oh, if we told you, then, well, you would know, and the big secret would
    be out.  (Hint:  when three people are telling you the same thing, odds
    are there is not some secret motive.)
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        http://momjian.us
      EnterpriseDB                             http://enterprisedb.com
    
      + It's impossible for everything to be true. +
    
    
  24. Re: Large C files

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-08T19:10:24Z

    Simon Riggs <simon@2ndQuadrant.com> writes:
    > You clearly have the bit between your teeth on this.
    
    Personally, I'm neither intending to break up xlog.c right now, nor
    asking you to do it.  I'm just objecting to your claim that there
    should be some project-policy restriction on when refactoring gets done.
    I do have other refactoring-ish plans in mind, eg
    http://archives.postgresql.org/message-id/9232.1315441364@sss.pgh.pa.us
    and am not going to be happy if you claim I can't do that now.
    
    			regards, tom lane
    
    
  25. Re: Large C files

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-09-08T20:45:56Z

    On 8 September 2011 15:43, Robert Haas <robertmhaas@gmail.com> wrote:
    > I wouldn't be too enthusiastic about
    > starting a project like this in January, but now seems fine.  A bigger
    > problem is that I don't hear anyone volunteering to do the work.
    
    You seem to have a fairly strong opinion on the xlog.c code. It would
    be useful to hear any preliminary thoughts that you might have on what
    we'd end up with when this refactoring work is finished. If I'm not
    mistaken, you think that it is a good candidate for being refactored
    not so much because of its size, but for other reasons -  could you
    please elaborate on those? In particular, I'd like to know what
    boundaries it is envisaged that the code should be refactored along to
    increase its conceptual integrity, or to better separate concerns. I
    assume that that's the idea, since each new .c file would have to have
    a discrete purpose.
    
    On 7 September 2011 19:12, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > The pgrminclude-induced bug you just fixed shows a concrete way in which
    > moving code from one file to another might silently break it, ie, it
    > still compiles despite lack of definition of some symbol it's intended
    > to see.
    
    Of course, the big unknown here is the C preprocessor. There is
    nothing in principle that stops a header file from slightly or utterly
    altering the semantics of any file it is included in. That said, it
    wouldn't surprise me if the nm-diff shell script I proposed (or a
    slight variant intended to catch pgrminclude type bugs) caught many of
    those problems in practice.
    
    Attached is simple POC nm-diff.sh, intended to detect
    pgrminclude-induced bugs (split c file variant may follow if there is
    interest). I tested this on object files compiled before and after the
    bug fix to catalog.h in this commit:
    
    http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=f81fb4f690355bc88fee69624103956fb4576fe5
    
    or rather, I tested things after that commit with and without the
    supposedly unneeded header; I removed catalog version differences,
    applying Ockham's razor, so there was exactly one difference. That
    didn't change anything; I just saw the same thing as before, which
    was:
    
    [peter@localhost commands]$ nm-diff.sh tablespace.o tablespace.old.o
    Symbol table differences:
    50c50
    < 00000000000005e7 r __func__.15690
    ---
    > 00000000000005ef r __func__.15690
    WARNING: Symbol tables don't match!
    
    
    I decided to add a bunch of obviously superfluous headers (a bunch of
    xlog stuff) to the same file to verify that they *didn't* change
    anything, figuring that it was very unlikely that adding these headers
    could *really* change something. However, they did, but I don't think
    that that proves that the script is fundamentally flawed (they didn't
    *really* change anything):
    
    [peter@localhost commands]$ nm-diff.sh tablespace.o tablespace.old.o
    Symbol table differences:
    41,50c41,50
    < 00000000000006f0 r __func__.15719 <-- symbol name differs, offset does not
    < 00000000000006d0 r __func__.15730
    < 00000000000006be r __func__.15750
    < 00000000000006a0 r __func__.15759
    < 0000000000000680 r __func__.15771
    < 0000000000000660 r __func__.15793
    < 0000000000000640 r __func__.15803
    < 0000000000000620 r __func__.15825
    < 0000000000000600 r __func__.15886
    < 00000000000005e7 r __func__.15903  <-- value/local offset the same as before
    ---
    > 00000000000006f0 r __func__.15506
    > 00000000000006d0 r __func__.15517
    > 00000000000006be r __func__.15537
    > 00000000000006a0 r __func__.15546
    > 0000000000000680 r __func__.15558
    > 0000000000000660 r __func__.15580
    > 0000000000000640 r __func__.15590
    > 0000000000000620 r __func__.15612
    > 0000000000000600 r __func__.15673
    > 00000000000005ef r __func__.15690  <-- Also, value/local offset the same as before (same inconsistency too)
    WARNING: Symbol tables don't match!
    
    My analysis is that the fact that the arbitrary symbol names assigned
    within this read-only data section differ likely has no significance
    (it looks like the compiler assigns them at an early optimisation
    stage like Postgres assigns sequence values, before some are
    eliminated at a later stage - I read ELF docs, but couldn't confirm
    this, but maybe should have read GCC docs), so the next revision of
    this script should simply ignore them using a regex if they look like
    this; only the internal offsets/values matter, and they have been
    observed to differ based on whether or not the header is included per
    Bruce's revision (importantly, the difference in offset/values of the
    last __func__ remains just the same regardless of whether the
    superfluous xlog headers are included).
    
    Does that seem reasonable? Is there interest in developing this idea further?
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
  26. Re: Large C files

    Josh Berkus <josh@agliodbs.com> — 2011-09-08T21:59:45Z

    Simon, Robert, Bruce, Tom,
    
    >>>> >>> Say what?  When else would you have us do it?
    >> >
    >>> >> When else would you have us develop?
    >> >
    >> > In my eyes that sort of activity *is* development.  I find the
    >> > distinction you are drawing entirely artificial, and more calculated to
    >> > make sure refactoring never happens than to add any safety.  Any
    >> > significant development change carries a risk of breakage.
    > You clearly have the bit between your teeth on this.
    > 
    > That doesn't make it worthwhile or sensible though.
    
    The above really seems like a non-argument over trivial wording of
    stuff.  Can we please give each other the benefit of the doubt and not
    read objectionable content into offhand comments?
    
    -- 
    Josh Berkus
    PostgreSQL Experts Inc.
    http://pgexperts.com
    
    
  27. Re: Large C files

    Robert Haas <robertmhaas@gmail.com> — 2011-09-09T00:20:22Z

    On Thu, Sep 8, 2011 at 4:45 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
    > On 8 September 2011 15:43, Robert Haas <robertmhaas@gmail.com> wrote:
    >> I wouldn't be too enthusiastic about
    >> starting a project like this in January, but now seems fine.  A bigger
    >> problem is that I don't hear anyone volunteering to do the work.
    >
    > You seem to have a fairly strong opinion on the xlog.c code. It would
    > be useful to hear any preliminary thoughts that you might have on what
    > we'd end up with when this refactoring work is finished. If I'm not
    > mistaken, you think that it is a good candidate for being refactored
    > not so much because of its size, but for other reasons -  could you
    > please elaborate on those? In particular, I'd like to know what
    > boundaries it is envisaged that the code should be refactored along to
    > increase its conceptual integrity, or to better separate concerns. I
    > assume that that's the idea, since each new .c file would have to have
    > a discrete purpose.
    
    I'm not sure how strong my opinions are, but one thing that I think
    could be improved is that StartupXLOG() is really, really long, and it
    does a lot of different stuff:
    
    - Read the control file and sanity check it.
    - Read the backup label if there is one or otherwise inspect the
    control file's checkpoint information.
    - Set up for REDO (including Hot Standby) if required.
    - Main REDO loop.
    - Check whether we reached consistency and/or whether we need a new TLI.
    - Prepare to write WAL.
    - Post-recovery cleanup.
    - Initialize for normal running.
    
    It seems to me that we could improve the readability of that function
    by separating out some of the larger chunks of functionality into
    their own static functions.  That wouldn't make xlog.c any shorter,
    but it would make that function easier to understand.
    
    Another gripe I have is that recoveryStopsHere() has non-obvious side
    effects.  Not sure what to do about that, exactly, but I found it
    extremely surprising when first picking my way through this code.
    
    One pretty easy thing to pull out of this file wold be all the
    user-callable functions.  pg_xlog_recovery_replay_{pause,resume},
    pg_is_xlog_recovery_paused, pg_last_xact_replay_timestamp,
    pg_is_in_recovery, pg_{start,stop}_backup(), pg_switch_xlog(),
    pg_create_restore_point(), pg_current_xlog_{insert_,}location,
    pg_last_xlog_{receive,replay}_location(), pg_xlogfile_name_offset(),
    pg_xlogfile_name().
    
    Another chunk that seems like it would be pretty simple to separate
    out is the checkpoint-related stuff: LogCheckpointStart(),
    LogCheckpointEnd(), CreateCheckPoint(), CheckPointGuts(),
    RecoveryRestartPoint(), CreateRestartPoint(), KeepLogSeg().  Not a lot
    of code, maybe, but it seems clearly distinguishable from what the
    rest of the file is about.
    
    I'm sure there are other good ways to do it, too...
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  28. Re: Large C files

    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> — 2011-09-09T05:57:03Z

    On 08.09.2011 23:45, Peter Geoghegan wrote:
    > On 8 September 2011 15:43, Robert Haas<robertmhaas@gmail.com>  wrote:
    >> I wouldn't be too enthusiastic about
    >> starting a project like this in January, but now seems fine.  A bigger
    >> problem is that I don't hear anyone volunteering to do the work.
    >
    > You seem to have a fairly strong opinion on the xlog.c code. It would
    > be useful to hear any preliminary thoughts that you might have on what
    > we'd end up with when this refactoring work is finished. If I'm not
    > mistaken, you think that it is a good candidate for being refactored
    > not so much because of its size, but for other reasons -  could you
    > please elaborate on those? In particular, I'd like to know what
    > boundaries it is envisaged that the code should be refactored along to
    > increase its conceptual integrity, or to better separate concerns. I
    > assume that that's the idea, since each new .c file would have to have
    > a discrete purpose.
    
    I'd like to see it split into routines involved in writing WAL, and 
    those involved in recovery. And maybe a third file for archiving-related 
    stuff.
    
    -- 
       Heikki Linnakangas
       EnterpriseDB   http://www.enterprisedb.com
    
    
  29. Re: Large C files

    Greg Stark <stark@mit.edu> — 2011-09-09T22:54:18Z

    On Fri, Sep 9, 2011 at 6:57 AM, Heikki Linnakangas
    <heikki.linnakangas@enterprisedb.com> wrote:
    >> In particular, I'd like to know what
    >> boundaries it is envisaged that the code should be refactored along to
    >> increase its conceptual integrity, or to better separate concerns. I
    >> assume that that's the idea, since each new .c file would have to have
    >> a discrete purpose.
    >
    > I'd like to see it split into routines involved in writing WAL, and those
    > involved in recovery. And maybe a third file for archiving-related stuff.
    
    Having a clean API for working with WAL independently of recovery
    would let us have a maintainable xlogdump tool that doesn't constantly
    get out of sync with the wal archive format. It would also make it
    easier to write things like prefretching logic that requires reading
    the upcoming xlog before its time to actually replay it.
    
    
    -- 
    greg
    
    
  30. Re: Large C files

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-09-10T03:28:16Z

    I've produced the refinement of my little shell script anticipated by
    my last e-mail (using sed to remove irrelevant variations in
    __func__.12345 type symbol names). I decided to test it for:
    
    1. Detecting behavioural changes when removing existing "pgrminclude
    ignore" files (Basically headers that won't break the build if
    removed, but will break Postgres). I stuck with .c files here for the
    sake of convenience.
    
    2. False positives by including a bunch of random headers.
    
    Fist file tested was pl_comp.c. The script detected the change in
    behaviour due to the omission of that header (incidentally, the way
    the header is used there strikes me as a bit ugly). The script did not
    produce false positives with the addition of these headers:
    
    #include "commands/portalcmds.h"
    #include "commands/conversioncmds.h"
    #include "commands/defrem.h"
    #include "commands/proclang.h"
    #include "commands/tablecmds.h"
    #include "commands/tablespace.h"
    #include "commands/typecmds.h"
    #include "commands/collationcmds.h"
    #include "commands/prepare.h"
    #include "commands/explain.h"
    #include "commands/comment.h"
    #include "commands/cluster.h"
    #include "commands/discard.h"
    #include "commands/trigger.h"
    #include "commands/user.h"
    #include "commands/view.h"
    #include "commands/sequence.h"
    #include "commands/dbcommands.h"
    #include "commands/async.h"
    #include "commands/lockcmds.h"
    #include "commands/vacuum.h"
    
    The second file tested was regerror.c . Similarly, the omission was
    detected, and the addition of the same headers did not produce a false
    positive.
    
    It's very difficult or impossible to anticipate how effective the tool
    will be in practice, but when you consider that it works and does not
    produce false positives for the first 3 real-world cases tested, it
    seems reasonable to assume that it's at least worth having around. Tom
    recently said of a previous pgrminclude campaign in July 2006 that "It
    took us two weeks to mostly recover, but we were still dealing with
    some fallout in December". I think that makes the case for adding this
    tool or some refinement as a complement to pgrminclude in src/tools
    fairly compelling.
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
  31. Re: Large C files

    Robert Haas <robertmhaas@gmail.com> — 2011-09-23T14:46:00Z

    On Fri, Sep 9, 2011 at 11:28 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
    > It's very difficult or impossible to anticipate how effective the tool
    > will be in practice, but when you consider that it works and does not
    > produce false positives for the first 3 real-world cases tested, it
    > seems reasonable to assume that it's at least worth having around. Tom
    > recently said of a previous pgrminclude campaign in July 2006 that "It
    > took us two weeks to mostly recover, but we were still dealing with
    > some fallout in December". I think that makes the case for adding this
    > tool or some refinement as a complement to pgrminclude in src/tools
    > fairly compelling.
    
    I'm not opposed to adding something like this, but I think it needs to
    either be tied into the actual running of the script, or have a lot
    more documentation than it does now, or both.  I am possibly stupid,
    but I can't understand from reading the script (or, honestly, the
    thread) exactly what kind of pgrminclude-induced errors this is
    protecting against; but even if we clarify that, it seems like it
    would be a lot of work to run it manually on all the files that might
    be affected by a pgrminclude run, unless we can somehow automate that.
    
    I'm also curious to see how much more fallout we're going to see from
    that run.  We had a few glitches when it was first done, but it didn't
    seem like they were really all that bad.  It might be that we'd be
    better off running pgrminclude a lot *more* often (like once a cycle,
    or even after every CommitFest), because the scope of the changes
    would then be far smaller and we wouldn't be dealing with 5 years of
    accumulated cruft all at once; we'd also get a lot more experience
    with what works or does not work with the script, which might lead to
    improvements in that script on a less-than-geologic time scale.
    
    -- 
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
    
    
  32. Re: Large C files

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-09-23T20:26:31Z

    On 23 September 2011 15:46, Robert Haas <robertmhaas@gmail.com> wrote:
    > I'm not opposed to adding something like this, but I think it needs to
    > either be tied into the actual running of the script, or have a lot
    > more documentation than it does now, or both.  I am possibly stupid,
    > but I can't understand from reading the script (or, honestly, the
    > thread) exactly what kind of pgrminclude-induced errors this is
    > protecting against;
    
    The basic idea is simple. There is a high likelihood that if removing
    a header alters the behaviour of an object file silently, that it will
    also alter the symbol table for the same object file - in particular,
    the "value" of function symbols (their local offset in the object
    file), which relates to the number of instructions in each function.
    Yes, that is imperfect, but it's better than nothing, and intuitively
    I think that the only things that it won't catch in practice are
    completely inorganic bugs (i.e. things done for the express purpose of
    proving that it can be broken). Thinking about it now though, I have
    to wonder if I could have done a better job with "objdump -td".
    
    > but even if we clarify that, it seems like it
    > would be a lot of work to run it manually on all the files that might
    > be affected by a pgrminclude run, unless we can somehow automate that.
    
    I would have automated it if anyone had expressed any interest in the
    basic idea - it might be an over-reaction to the problem. I'm not
    sure. I'd have had it detect which object files might have been
    affected (through directly including the header, or indirectly
    including it by proxy). It could rename them such that, for example,
    xlog.o was renamed to xlog_old.o . Then, you make your changes,
    rebuild, and run the program again in a different mode. It notices the
    *_old.o files, and runs nm-diff on them.
    
    > I'm also curious to see how much more fallout we're going to see from
    > that run.  We had a few glitches when it was first done, but it didn't
    > seem like they were really all that bad.  It might be that we'd be
    > better off running pgrminclude a lot *more* often (like once a cycle,
    > or even after every CommitFest), because the scope of the changes
    > would then be far smaller and we wouldn't be dealing with 5 years of
    > accumulated cruft all at once; we'd also get a lot more experience
    > with what works or does not work with the script, which might lead to
    > improvements in that script on a less-than-geologic time scale.
    
    Fair point. I'm a little busy with other things right now, but I'll
    revisit it soon.
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
    
  33. Re: Large C files

    Hannu Krosing <hannu@2ndquadrant.com> — 2011-09-24T11:00:35Z

    On Wed, 2011-09-07 at 01:22 +0200, Jan Urbański wrote:
    > On 07/09/11 01:13, Peter Geoghegan wrote:
    > > On 6 September 2011 08:29, Peter Eisentraut <peter_e@gmx.net> wrote:
    > >> I was thinking about splitting up plpython.c, but it's not even on that
    > >> list. ;-)
    > > 
    > > IIRC the obesity of that file is something that Jan Urbański intends
    > > to fix, or is at least concerned about. Perhaps you should compare
    > > notes.
    > 
    > Yeah, plpython.c could easily be splitted into submodules for builtin
    > functions, spi support, subtransactions support, traceback support etc.
    > 
    > If there is consensus that this should be done, I'll see if I can find
    > time to submit a giant-diff-no-behaviour-changes patch for the next CF.
    
    +1 from me.
    
    Would make working with it much nicer.
    
    > Cheers,
    > Jan
    > 
    
    -- 
    -------
    Hannu Krosing
    PostgreSQL Unlimited Scalability and Performance Consultant
    2ndQuadrant Nordic
    PG Admin Book: http://www.2ndQuadrant.com/books/
    
    
    
  34. Re: Large C files

    Bruce Momjian <bruce@momjian.us> — 2011-09-24T13:30:18Z

    Robert Haas wrote:
    > On Fri, Sep 9, 2011 at 11:28 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
    > > It's very difficult or impossible to anticipate how effective the tool
    > > will be in practice, but when you consider that it works and does not
    > > produce false positives for the first 3 real-world cases tested, it
    > > seems reasonable to assume that it's at least worth having around. Tom
    > > recently said of a previous pgrminclude campaign in July 2006 that "It
    > > took us two weeks to mostly recover, but we were still dealing with
    > > some fallout in December". I think that makes the case for adding this
    > > tool or some refinement as a complement to pgrminclude in src/tools
    > > fairly compelling.
    > 
    > I'm not opposed to adding something like this, but I think it needs to
    > either be tied into the actual running of the script, or have a lot
    > more documentation than it does now, or both.  I am possibly stupid,
    > but I can't understand from reading the script (or, honestly, the
    > thread) exactly what kind of pgrminclude-induced errors this is
    > protecting against; but even if we clarify that, it seems like it
    > would be a lot of work to run it manually on all the files that might
    > be affected by a pgrminclude run, unless we can somehow automate that.
    > 
    > I'm also curious to see how much more fallout we're going to see from
    > that run.  We had a few glitches when it was first done, but it didn't
    > seem like they were really all that bad.  It might be that we'd be
    > better off running pgrminclude a lot *more* often (like once a cycle,
    > or even after every CommitFest), because the scope of the changes
    > would then be far smaller and we wouldn't be dealing with 5 years of
    > accumulated cruft all at once; we'd also get a lot more experience
    > with what works or does not work with the script, which might lead to
    > improvements in that script on a less-than-geologic time scale.
    
    Interesting idea.  pgrminclude has three main problems:
    
    o  defines --- to prevent removing an include that is referenced in an
    #ifdef block that is not reached, it removes ifdef blocks, though that
    might cause the file not to compile and be skipped.
    
    o  ## expansion --- we found that CppAsString2() uses the CCP expansion
    ##, which throws no error if the symbol is does not exist (perhaps
    through the removal of a define).  This is the problem we had with
    tablespace directory names, and the script now checks for CppAsString2
    and friends and skips the file.
    
    o  imbalance of includes --- pgrminclude can remove includes that are
    not required, but this can cause other files to need many more includes.
    This is the imbalance include problem Tom found.
    
    The submitted patch to compare object files only catches the second
    problem we had.  
    
    I think we determined that the best risk/reward is to run it every five
    years.  The pgrminclude run removed 627 include references, which is
    0.006% of our 1,077,878 lines of C code.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        http://momjian.us
      EnterpriseDB                             http://enterprisedb.com
    
      + It's impossible for everything to be true. +
    
    
  35. Re: Large C files

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-24T15:41:55Z

    Bruce Momjian <bruce@momjian.us> writes:
    > Robert Haas wrote:
    >> I'm also curious to see how much more fallout we're going to see from
    >> that run.  We had a few glitches when it was first done, but it didn't
    >> seem like they were really all that bad.  It might be that we'd be
    >> better off running pgrminclude a lot *more* often (like once a cycle,
    >> or even after every CommitFest), because the scope of the changes
    >> would then be far smaller and we wouldn't be dealing with 5 years of
    >> accumulated cruft all at once; we'd also get a lot more experience
    >> with what works or does not work with the script, which might lead to
    >> improvements in that script on a less-than-geologic time scale.
    
    > Interesting idea.  pgrminclude has three main problems: [ snipped ]
    
    Actually, I believe that the *main* problem with pgrminclude is that
    it fails to account for combinations of build options other than those
    that Bruce uses.  In the previous go-round, the reason we were still
    squashing bugs months later is that it took that long for people to
    notice and complain "hey, compiling with LOCK_DEBUG no longer works",
    or various other odd build options that the buildfarm doesn't exercise.
    I have 100% faith that we'll be squashing some bugs like that ... very
    possibly, the exact same ones as five years ago ... over the next few
    months.  Peter's proposed tool would catch issues like the CppAsString2
    failure, but it's still only going to exercise the build options you're
    testing with.
    
    If we could get pgrminclude up to a similar level of reliability as
    pgindent, I'd be for running it on every cycle.  But I'm not sure that
    the current approach to it is capable even in theory of getting to "it
    just works" reliability.  I'm also not impressed at all by the hack of
    avoiding problems by excluding entire files from the processing ---
    what's the point of having the tool then?
    
    > I think we determined that the best risk/reward is to run it every five
    > years.
    
    Frankly, with the tool in its current state I'd rather not run it at
    all, ever.  The value per man-hour expended is too low.  The mess it
    made out of the xlog-related includes this time around makes me question
    whether it's even a net benefit, regardless of whether it can be
    guaranteed not to break things.  Fundamentally, there's a large
    component of design judgment/taste in the question of which header files
    should include which others, but this tool does not have any taste.
    
    			regards, tom lane
    
    
  36. Re: Large C files

    Bruce Momjian <bruce@momjian.us> — 2011-09-24T16:13:34Z

    Tom Lane wrote:
    > Actually, I believe that the *main* problem with pgrminclude is that
    > it fails to account for combinations of build options other than those
    > that Bruce uses.  In the previous go-round, the reason we were still
    > squashing bugs months later is that it took that long for people to
    > notice and complain "hey, compiling with LOCK_DEBUG no longer works",
    > or various other odd build options that the buildfarm doesn't exercise.
    > I have 100% faith that we'll be squashing some bugs like that ... very
    > possibly, the exact same ones as five years ago ... over the next few
    > months.  Peter's proposed tool would catch issues like the CppAsString2
    
    The new code removes #ifdef markers so all code is compiled, or the file
    is skipped if it can't be compiled.  That should avoid this problem.
    
    > failure, but it's still only going to exercise the build options you're
    > testing with.
    > 
    > If we could get pgrminclude up to a similar level of reliability as
    > pgindent, I'd be for running it on every cycle.  But I'm not sure that
    > the current approach to it is capable even in theory of getting to "it
    > just works" reliability.  I'm also not impressed at all by the hack of
    > avoiding problems by excluding entire files from the processing ---
    > what's the point of having the tool then?
    
    We remove the includes we safely can, and skip the others --- the
    benefit is only for the files we can process.
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        http://momjian.us
      EnterpriseDB                             http://enterprisedb.com
    
      + It's impossible for everything to be true. +
    
    
  37. Re: Large C files

    Tom Lane <tgl@sss.pgh.pa.us> — 2011-09-24T16:46:59Z

    Bruce Momjian <bruce@momjian.us> writes:
    > Tom Lane wrote:
    >> Actually, I believe that the *main* problem with pgrminclude is that
    >> it fails to account for combinations of build options other than those
    >> that Bruce uses.  In the previous go-round, the reason we were still
    >> squashing bugs months later is that it took that long for people to
    >> notice and complain "hey, compiling with LOCK_DEBUG no longer works",
    >> or various other odd build options that the buildfarm doesn't exercise.
    >> I have 100% faith that we'll be squashing some bugs like that ... very
    >> possibly, the exact same ones as five years ago ... over the next few
    >> months.  Peter's proposed tool would catch issues like the CppAsString2
    
    > The new code removes #ifdef markers so all code is compiled, or the file
    > is skipped if it can't be compiled.  That should avoid this problem.
    
    It avoids it at a very large cost, namely skipping all the files where
    it's not possible to compile each arm of every #if on the machine being
    used.  I do not think that's a solution, just a band-aid; for instance,
    won't it prevent include optimization in every file that contains even
    one #ifdef WIN32?  Or what about files in which there are #if blocks
    that each define the same function, constant table, etc?
    
    The right solution would involve testing each #if block under the
    conditions in which it was *meant* to be compiled.
    
    			regards, tom lane
    
    
  38. Re: Large C files

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-09-24T17:10:38Z

    On 24 September 2011 16:41, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    > Frankly, with the tool in its current state I'd rather not run it at
    > all, ever.  The value per man-hour expended is too low.  The mess it
    > made out of the xlog-related includes this time around makes me question
    > whether it's even a net benefit, regardless of whether it can be
    > guaranteed not to break things.  Fundamentally, there's a large
    > component of design judgment/taste in the question of which header files
    > should include which others, but this tool does not have any taste.
    
    I agree. If this worked well in a semi-automated fashion, there'd be
    some other open source tool already available for us to use. As far as
    I know, there isn't. As we work around pgrminclude's bugs, its
    benefits become increasingly small and hard to quantify.
    
    If we're not going to use it, it should be removed from the tree.
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
    
  39. Re: Large C files

    Andrew Dunstan <andrew@dunslane.net> — 2011-09-24T17:17:58Z

    
    On 09/24/2011 01:10 PM, Peter Geoghegan wrote:
    > On 24 September 2011 16:41, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
    >> Frankly, with the tool in its current state I'd rather not run it at
    >> all, ever.  The value per man-hour expended is too low.  The mess it
    >> made out of the xlog-related includes this time around makes me question
    >> whether it's even a net benefit, regardless of whether it can be
    >> guaranteed not to break things.  Fundamentally, there's a large
    >> component of design judgment/taste in the question of which header files
    >> should include which others, but this tool does not have any taste.
    > I agree. If this worked well in a semi-automated fashion, there'd be
    > some other open source tool already available for us to use. As far as
    > I know, there isn't. As we work around pgrminclude's bugs, its
    > benefits become increasingly small and hard to quantify.
    >
    > If we're not going to use it, it should be removed from the tree.
    >
    
    Yeah, I've always been dubious about the actual benefit.  At best this 
    can be seen as identifying some candidates for pruning, but as an 
    automated tool I'm inclined to write it off as a failed experiment.
    
    cheers
    
    andrew
    
    
  40. Re: Large C files

    Bruce Momjian <bruce@momjian.us> — 2011-09-24T17:26:23Z

    Tom Lane wrote:
    > Bruce Momjian <bruce@momjian.us> writes:
    > > Tom Lane wrote:
    > >> Actually, I believe that the *main* problem with pgrminclude is that
    > >> it fails to account for combinations of build options other than those
    > >> that Bruce uses.  In the previous go-round, the reason we were still
    > >> squashing bugs months later is that it took that long for people to
    > >> notice and complain "hey, compiling with LOCK_DEBUG no longer works",
    > >> or various other odd build options that the buildfarm doesn't exercise.
    > >> I have 100% faith that we'll be squashing some bugs like that ... very
    > >> possibly, the exact same ones as five years ago ... over the next few
    > >> months.  Peter's proposed tool would catch issues like the CppAsString2
    > 
    > > The new code removes #ifdef markers so all code is compiled, or the file
    > > is skipped if it can't be compiled.  That should avoid this problem.
    > 
    > It avoids it at a very large cost, namely skipping all the files where
    > it's not possible to compile each arm of every #if on the machine being
    > used.  I do not think that's a solution, just a band-aid; for instance,
    > won't it prevent include optimization in every file that contains even
    > one #ifdef WIN32?  Or what about files in which there are #if blocks
    > that each define the same function, constant table, etc?
    > 
    > The right solution would involve testing each #if block under the
    > conditions in which it was *meant* to be compiled.
    
    Right.  It is under the "better than nothing" category, which is better
    than nothing (not running it).  ;-)
    
    -- 
      Bruce Momjian  <bruce@momjian.us>        http://momjian.us
      EnterpriseDB                             http://enterprisedb.com
    
      + It's impossible for everything to be true. +
    
    
  41. Re: Large C files

    Peter Geoghegan <peter@2ndquadrant.com> — 2011-10-14T18:36:32Z

    This evening, David Fetter described a problem to me that he was
    having building the Twitter FDW. It transpired that it was down to its
    dependence on an #include that was recently judged to be
    redundant.This seems like another reason to avoid using pgrminclude -
    even if we can be certain that the #includes are redundant within
    Postgres, we cannot be sure that they are redundant in third party
    code.
    
    -- 
    Peter Geoghegan       http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training and Services
    
    
  42. Re: Large C files

    Alvaro Herrera <alvherre@commandprompt.com> — 2011-10-14T18:47:49Z

    Excerpts from Peter Geoghegan's message of vie oct 14 15:36:32 -0300 2011:
    > This evening, David Fetter described a problem to me that he was
    > having building the Twitter FDW. It transpired that it was down to its
    > dependence on an #include that was recently judged to be
    > redundant.This seems like another reason to avoid using pgrminclude -
    > even if we can be certain that the #includes are redundant within
    > Postgres, we cannot be sure that they are redundant in third party
    > code.
    
    I'm not sure about this.  I mean, surely the problem was easily detected
    and fixed because the compile fails outright, yes?
    
    I do take time on ocassion to do some header reorganization and remove
    inclusions that are redundant or unnecessary.  I don't want that work
    thrown aside because we have to support some hypothetical third party
    module that would fail to compile.  (In fact, we purposedly removed some
    header from spi.h because it was unnecessary).
    
    -- 
    Álvaro Herrera <alvherre@commandprompt.com>
    The PostgreSQL Company - Command Prompt, Inc.
    PostgreSQL Replication, Consulting, Custom Development, 24x7 support
    
    
  43. Re: Large C files

    David Fetter <david@fetter.org> — 2011-10-17T06:00:19Z

    On Fri, Oct 14, 2011 at 07:36:32PM +0100, Peter Geoghegan wrote:
    > This evening, David Fetter described a problem to me that he was
    > having building the Twitter FDW. It transpired that it was down to its
    > dependence on an #include that was recently judged to be
    > redundant.This seems like another reason to avoid using pgrminclude -
    > even if we can be certain that the #includes are redundant within
    > Postgres, we cannot be sure that they are redundant in third party
    > code.
    
    Perhaps something that tested some third-party code
    automatically...say, doesn't the new buildfarm stuff allow running
    some arbitrary code?
    
    Cheers,
    David.
    -- 
    David Fetter <david@fetter.org> http://fetter.org/
    Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
    Skype: davidfetter      XMPP: david.fetter@gmail.com
    iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics
    
    Remember to vote!
    Consider donating to Postgres: http://www.postgresql.org/about/donate
    
    
  44. Re: Large C files

    Alvaro Herrera <alvherre@commandprompt.com> — 2011-10-19T19:39:55Z

    Excerpts from David Fetter's message of lun oct 17 03:00:19 -0300 2011:
    > On Fri, Oct 14, 2011 at 07:36:32PM +0100, Peter Geoghegan wrote:
    > > This evening, David Fetter described a problem to me that he was
    > > having building the Twitter FDW. It transpired that it was down to its
    > > dependence on an #include that was recently judged to be
    > > redundant.This seems like another reason to avoid using pgrminclude -
    > > even if we can be certain that the #includes are redundant within
    > > Postgres, we cannot be sure that they are redundant in third party
    > > code.
    > 
    > Perhaps something that tested some third-party code
    > automatically...say, doesn't the new buildfarm stuff allow running
    > some arbitrary code?
    
    I think you could run your own buildfarm member and add whatever "steps"
    you wanted (we have one building JDBC).  I'm not sure that we want that
    though -- it'd start polluting the greenness of our buildfarm with
    failures from code outside of our control.
    
    I mean, if the third party code fails to compile, surely it's the third
    party devs that care.
    
    I fail to see why this is such a big deal.
    
    -- 
    Álvaro Herrera <alvherre@commandprompt.com>
    The PostgreSQL Company - Command Prompt, Inc.
    PostgreSQL Replication, Consulting, Custom Development, 24x7 support