when a cpu receives info, it essentially "chops" it up into little pieces and sends the pieces through a thing called a pipeline to be processed. The longer the pipeline of a processor, the faster each stage in the pipeline can be completed because there are more stages to share the work. A P4 has a very long pipeline. The reasoning is simply: Intel needed the P4 to have a long pipelin so that they could reach more GHZ. The new Prescott chip has an even longer pipeline because Intel hopes to reach speeds of 4GHz.
Anyway, in a longer pipeline, data being processed is often thrown out and reprocessed and thus more clock cycles are needed to refill with data when that occurs. Since cpus are much faster than the system memory that sends the information to them, they are forced to "guess" what pieces of info will be sent next. So while the northwood P4's 20 stage pipeline allows for very fast speeds, it takes a performance toll do to incorrect branch predictions (predictions of which info would be sent next from the system memory)
Amd's cpus have much smaller pipelines. THe Athlon64 only has a 12 stage pipeline and thus when it guesses what info will be sent next incorrectly, it's performance is not hurt as much. Since info doesn't need to pass through as many stages per clock cycle, the processor can often perform faster. This is why an Amd running at a lower clock speed can still offer the same performance as a faster P4. Here's a real-life example:
AMD: A boy runs at a steady rate of 10mph to run a distance of twenty yards.
Intel: A boy runs at a steady rate of 20mph to run a distance of forty yards.
The larger the cache of a processor, then the more accurate its branch predictions should be. I hope everything I wrote is correct.