Wow--I finally ended up watching these after sitting on these links for a while. They are just what the doctor ordered if you have questions about low-level source code performance optimizations in modern processors.
Questions on SSE/AVX, pipelining, cache fetch times, memory access, locality of memory access etc. are addressed in these talks. What is fantastic (and repeatedly driven home) is that performance is not necessarily about reducing overall CPU instructions, but reducing memory cache access times (and how to do this).
By the way--you can disregard the Microsoft provenance--most of the discussion/techniques equally apply to any modern x86 compiler.
Jo bob says watch'em!