Obviously for the multi-core world, the one goal here is to support scaling as more cores are thrown at a problem. That has meant that performance tweaking requires:
- Avoid locking of any kind, otherwise performance won't scale as more cores are thrown into the stewpot
- Minimize cache misses or hot cache reloads, increase cache-coherency
- Old fashion instruction tweaking (i.e. reducing instruction costs).
The above are listed in their approximate order of importance.
I highly recommend watching the videos listed on this posting as they point out that #2 is often more important that #3 in performance tweaking.
Locking can often be avoided by using userspace RCU, or similar tricks.
Other great resources:
Performance bit twiddling
Awesome parallel programming reference
Detailed Assembly/C/C++ x86 Optimizations
Obviously one of the great tools is just running perf top, a great deal of insight can be gained just by looking at the results the command below produces:
sudo /usr/bin/perf top -p <pid>