Ok, steve_416, riddle me this:
How can an Athlon processor with a lower clock speed and less than or equal cache size to a Pentium outperform it in tests using the same software?
Happens all the time - Athlons go up to 2.16 GHz, P4 starts at 2.2. Athlons have 256-512K L2 Cache, P4's have 512. And yet top Athlons are faster than low-end Pentiums by a wide margin.
I'll try to explain one last time: Yes, clock speed is the number of instructions a CPU can process per second. But in reality, with pipelined processors, a large percentage of cycles are wasted with no-op instructions. That means the total
number of cycles it takes to complete execution of the program is equal to the number of program instructions plus the number of wasted instructions. And since wasted isntructions vary by processor implementation, different processors will have differing number of wasted instructions while executing the same program, and therefore take a different number of cycles to complete program execution.
There are different ways a chip can reduce the number of no-ops executed, thereby reducing the number of wasted cycles.
For example, suppose a chip (like the P4) has about 20 steps to its pipeline - call it Proc A. If it executes a contional (e.g. "if a > 0 goto lineX"), it has to put 20 no-ops into the pipeline to hold execution until the conditional resolves and it knows which line of code to execute next. That means, for each conditional in the program it wastes 20 cycles. Now say chip B only has a 5-step pipeline, so it only has to put 5 no-ops in the pipe for each conditional. Proc B therefore wastes only 25% as many cycles waiting for conditionals as Proc A.
So say a 1 million instruction program contains 5% condtional instructions (a reasonable ballpark estimate). Proc A needs 1 million cycles to process the actual instructions, and wastes 20*5%*1mil = 1 million cycles on no-ops for the conditionals. Proc B needs 1 million cycles to process the actual instructions, and 5*5%*1mil = 250,000 cycles on no ops. Net result, Proc B finishes in 1,250K cycles, or 62.5% of the cycles used, compared to the 2 mil total cycles for Proc A.
And that's merely one way two chips can finish the same set of instructions in different numbers of cycles.
Which obviously begs the question, "why would anyone use a 20-step pipeline if it wastes that many cycles?". The answer is simple: the simpler each step of the pipeline is, the faster it can be executed. So the more you break down instruction execution into a larger number of pipeline steps the faster you can run your processer. Which means, in the above example, Proc A will most likely run at a much faster clock speed than Porc B, which will bring them back to similar execution time. More importantly for the manufacturer of Proc A, however, is the gigahertz number they can market with Proc A - most people will assume the higher clock speed translates linearly into a faster processor and so be more likely to buy Proc A over Proc B.
There are other tricks you can do to minimize no-ops with a long pipeline. Take Proc C: 20-step pipeline, and conditional execution with sqashing. When it hits a conditional like "if a>0 goto LineX", it immeditaly follows the conditional with the following instructions in the program. If the conditional comes out without a goto, it can simply continue processing, without having used any no-ops. If the conditonal comes out with
the goto, Proc C 'squashes', or erases the results of all instructions submitted after the conditional and goes to LineX. So how does C perform? Well, assume it needs to squash half the time (equivalent to randomly guessing the outcome). Then, for half of the conditionals, it wastes no cycles, and for the other half it wastes the usual 20, making average waste per conditional equal to 10 cycles, for a net number of cycles to completion of 10*5%*1mil + 1mil = 1,500,000 cycles. [Note: it's actually slower than this because of the strong possibility of hitting a second conditional within 20 ops of the first, but accounting for that makes for some really complex math, so we'll leave it at this idealized simplification].
Even neater trick: Proc D has two stacks and can track execution of two programs simultaneously. Proc D has a 20-step pipeline, just like Proc A. However, instead of putting in 20 no-ops for each conditional, whenever Program 1 hits a conditional, it executes instructions from Program 2 while it waits. And vice versa. So it takes Proc D 2 million cycles to finish execution of our sample program - just like Proc A - but it can simultaneously complete execution another 1 mil instruction program, thereby effectively executing 2 million unwasted instructions. [Again, an idealistic oversimplification, as we assume all 1mil cycles that would be no-ops in Proc A are actually used for real operations in the second program. In reality, there will be times when both programs hit a conditional, and the CPU has to use no-ops to wait for the first conditional to finish. So Proc D will be slower than described, but to take that into account again leads to hairy math, so I'll stick with this description.]
Can anyone name a processor feature similar to what I've described as Proc D?