BLAS-level CPU Performance in 100 Lines of C

UPDATE 2022: I made this page in 2015. Some of the links below were out of date so I updated them to point to the
original 2015 code. This tutorial was made for educational purposes to teach how compilers and libraries obtain high
CPU performance. It is not trying to replace BLAS libraries or generate portable code.

Introduction
A common misconception is that BLAS implementations of matrix multiplication are orders of magnitude faster than naive implementations because they are very complex. This tutorial shows that, using Intel intrinsics (FMA3 and AVX2), BLAS-speed in dense matrix multiplication can be achieved using only 100 lines of C.
The code can be found here. It is based on the paper by Smith et al.[1]: Anatomy of High-Performance Many-Threaded Matrix Multiplication

Hardware
The CPU used in this tutorial provides 128 GFLOPS peak performance (2 physical cores x 2.00 GHz x 32 FLOps/cycle):

• 2 physical cores, 4 virtual (2.00 GHz)
• Haswell microarchitecture (32 FLOps/cycle: 8-float AVX2 x 2 FMA3 units x 2 FLOps/cycle per FMA)

Where FLOps = floating point operations and FLOPS = floating point operations per second.

Optimizations
This tutorial implements the GEMM procedure specified in [1], measuring throughput for various levels of optimization. Each refers to a function in compare_blas.cpp
1. Naive implementation
The naive implementation uses three nested for loops
2. + Vectorization to take advantage of SIMD and FMA hardware
Three nested loops are still used, but rather than calculate 1 element of the output matrix C at once, this "tiles" C using a "micro-kernel" [1] which calculates outputs in tiles of size 6 rows by 16 columns (i.e. 6 rows of 2 8-wide "vectors"). This uses 12/16 of the AVX2 registers (each 8 floats wide). The remaining 4 registers are used to hold elements of matrix A and B while the micro-kernel executes. Note that this provides nearly an 8x speedup over the naive implementation, because of the 8x increase in parallelism.
See micro-kernel in matrix_kernel16_6.c
3. + Cache Blocking (L1, L2, L3) to maximize data re-use
Rather than use three nested loops, additional loops are introduced for cache blocking, as in [1].
See micro-kernel in matrix_kernel16_6_block.c
4. + Threads to utilize all cores
Multithreading using OpenMP is used to parallelize the loops which tile/block the matrix C. There are 2 physical cores, and using threads give a speedup of 2x. This includes a ~10% speedup from hyperthreading (4 threads as opposed to 2). While virtual cores do not always help compute-bound kernels like GEMM, in this case hyperthreading is likely being used to execute prefetch instructions as described in [1].
Also uses micro-kernel in matrix_kernel16_6_block.c

Various Matrix Sizes
With all these optimizations the hardware peak performance (128 GFLOPS) is nearly reached and so (not surprisingly) the speed of OpenBLAS is matched:

Conclusion
Even though BLAS libraries are very complex (to implement all necessary functions, to support all parameters e.g. transpose, to handle any matrix size, to target multiple hardware architectures, etc.), the core performance is not difficult to obtain following implementations such as [1]. We've seen that this can be done in 100 lines of C / Intel intrinsics, and no assembly.

Citations
1. Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee.
Anatomy of high-performance many-threaded matrix multiplication.
In Proceedings of International Parallel and Distributed Processing Symposium, 2014