When I tried some SIMD optimization in large-scale simulation(HPC), it is so difficult to implment. Since there is no easy way to change the simulatoin logic to use e.g. matrix blocking, and go to finer-gratunity, BLAS or simlar math libs is already high-performance.
For example, same idea here, use cache and do map-reduce. While the algorithm logic is to avoid cached large matrix on each core, also we need synchronization on every iteration due to algorithm logic, which makes map-reduce painful.
I suppose there maybe two issues: either HPC algorithm design should at first considering scalabity like big data apps, which lead the components can be easy distributed, or big data is big data, but simple processing logic.