ok, after thinking about it for a while, i've changed my cflags. now, they are:
CFLAGS="-march=pentium4 -O3 -pipe -fomit-frame-pointer -mfpmath=sse,387 -ffast-math -funroll-all-loops -fforce-addr -fmerge-all-constants -maccumulate-outgoing-args -falign-functions=16 -fPIC"
i believe this to be the best choice for our pentium 4s. (northwood, based on the intel netburst architecture)
here is why:
the -march=pentium4 is obvious.
-pipe uses pipes to feed data through the compiler, as opposed to temporary files. this is faster, and is good for any system with 256 megs of ram or more. (which we probably all have).
-fomit-frame-pointer will free up an extra register that would have been used for a frame pointer. you don't need it unless you do debugging.
-O3: now, here's a point of contention. we all know that at least -O2 is good, considering the fact that it does many optimizations without a large space/speed tradeoff. -O3 does everything that -O2 does, plus function inlining and register renaming. function inlining is a good thing on our processors. inlining integrates simple functions into their callers, and helps the pentium4's trace cache make better branch predictions. the return address stack has a 16 entry limit - so a depth of more than 16 nested calls will hurt performance. calls and returns are expensive... so by integrating functions into callers, they don't have to be called separately - that's one less call for each one. not only that, when a function is inlined, there is the possibility for it to be exposed to even more optimization. register renaming makes use of extra registers that are left over after register allocation. this is really only beneficial for processors with many registers... and the ix86 platform is a rather register-starved one, so it doesn't do much to have it. i suppose you could just use '-O2 -finline-functions' in place of -O3, but it's not like having register renaming on hurts anything. so i just use -O3.
-mfpmath=sse,387: this specifies that both sse and 387 floating point math should be used at the same time, which could theoretically double the amount of available registers. it's still a little experimental, though, so if you don't feel comfortable with that, then just use -mfpmath=sse. you want intel processors (well... almost any processor, actually) to use sse over 387 floating point.
-ffast-math: this enables unsafe math optimizations. it speeds things up a lot, but could possibly cause inaccuracies in precise calculations. for general use and non-critical applications, it's fine. but if you need to do precise calculations with scientific software or something, then don't use it.
-funroll-all-loops: here is the main point of our debate; loop unrolling. yes, loop unrolling makes programs bigger. however, unrolling loops also eliminates some branches, which decreases branch overhead, and allows the loop to be aggressively scheduled to avoid latency. also, like function inlining, loop unrolling also exposes the code to more optimization. the pentium4 can predict exit branches for loops with 16 iterations, so loops without a definite number of iterations should be expanded to 16 iterations max, which i'm sure the compiler knows since we specified our architecture. you could just do -funroll-loops, which will only unroll loops where a definite number of iterations can be determined. -funroll-all-loops unrolls the same definite ones as -funroll-loops, plus ones without an explicit number of iterations. it's really up to you; but from what i've seen, unrolling all loops is better. but we don't want to do any further loop unrolling; excessive unrolling could make the unrolled loop too large to fit in the trace cache, which is bad. while loop unrolling does make the program bigger, which does cause more cache misses, that slight degradation is overshadowed by the performance gain of loop unrolling, especially if you align the misses to the cache line size.
-fforce-addr: this theoretically produces better code. -fforce-mem is enabled by O2, and copies memory operands into registers before working on them. -fforce-addr copies memory address constants into registers too before doing arithmetic on them. i don't know why this isn't enabled along with -fforce-mem, but i just enabled it since -fforce-mem is already enabled. it's up to you.
-fmerge-all-constants: by default, the compiler will merge identical string constants and floating point constants if it's supported. (-fmerge-constants) -fmerge-all-constants will merge even more identical constants, including arrays, or integer or floating point variables. C and C++ want each variable in a separate location, so this will cause code that could possibly not conform to standards. i haven't noticed any problems with it, though.
-maccumulate-outgoing-args: while this does increase code size, it calculates the amount of space an outgoing argument would need at the beginning of the function, which improves scheduling and reduces stack usage. it slows down older processors, but it speed up modern processors that can handle it. (like ours)
-falign-functions=16: here's the alignment. this is to help reduce cache misses. or rather, not exactly reduce, but optimize them. the pentium4 has a 64-bit cache line size. that's how much of the cache it can swap out at once. so if you aligned a function to a 64-bit boundary, the function starts at the beginning of the cache line, so it can maximize the usage of that cache line, without leaving empty space before the function. (it's similar to cluster sizes in filesystems, or chunk sizes in raids) the problem is that a 64 bit boundary is a large amount to round up to, especially if the function would start at... say... 40 bits. that would be rounded up to 64, losing 24 bits. 16 bits is a nice fraction of that, 1/4 - it can be aligned a little better on a natural operand size boundary without sacrificing too much. plus, vector instruction loads and stores are supposed to be aligned at 16 bits anyway.
-fPIC: this is completely personal preference. this generates position independent code, which is used for dynamic prelinking through the global offset table. -fPIC avoids machine-specific limits, as opposed to -fpic. -fPIC breaks certain compiles, though - some things can't be created with PIC.
a lot of this information was gathered by reading intel's pentium 4 "IA-32 Intel Architecture Optimization Reference Manual", on the intel site.
i did a lot of messing around and testing of flags. if you're interested in doing so also, you might want to check out this program:http://www.rocklinux.org/ccbench.html
i'm not a computer architecture expert, so correct me if i'm wrong about something.