Introduction
A common misconception is that BLAS implementations of matrix multiplication
are orders of magnitude faster than naive implementations because they are very complex.
This tutorial shows that, using Intel intrinsics (
FMA3
and
AVX2), BLASspeed
in dense matrix multiplication can be achieved using only 100 lines of C.
The code can be found
here.
It is based on the paper by Smith et al.[1]:
Anatomy of HighPerformance ManyThreaded Matrix Multiplication
Hardware
The CPU used in this tutorial provides 128 GFLOPS peak performance (2 physical cores x 2.00 GHz x 32 FLOps/cycle):
 2 physical cores, 4 virtual (2.00 GHz)
 Haswell microarchitecture (32 FLOps/cycle: 8float AVX2 x 2 FMA3 units x 2 FLOps/cycle per FMA)
Where
FLOps = floating point operations and
FLOPS = floating point operations per second.
Optimizations
This tutorial implements the GEMM procedure specified in [1], measuring
throughput for various levels of optimization.
Each refers to a function in
compare_blas.cpp

Naive implementation
The naive implementation uses three nested for loops

+ Vectorization to take advantage of SIMD and FMA hardware
Three nested loops are still used, but
rather than calculate 1 element of the output matrix C at once, this "tiles" C
using a "microkernel" [1] which
calculates outputs in tiles of size 6 rows by 16 columns
(i.e. 6 rows of 2 8wide "vectors").
This uses 12/16 of the AVX2 registers (each 8 floats wide).
The remaining 4 registers are used to hold elements of matrix A and B while the microkernel executes.
Note that this provides nearly an 8x speedup over the naive implementation, because of the 8x increase in parallelism.
See microkernel in matrix_kernel16_6.c

+ Cache Blocking (L1, L2, L3) to maximize data reuse
Rather than use three nested loops, additional loops are introduced for cache blocking, as in [1].
See microkernel in matrix_kernel16_6_block.c

+ Threads to utilize all cores
Multithreading using OpenMP is used to parallelize the loops which
tile/block the matrix C. There are 2 physical cores, and using threads give a speedup of 2x.
This includes a ~10% speedup from hyperthreading (4 threads as opposed to 2).
While virtual cores do not always help computebound kernels like GEMM, in this case
hyperthreading is likely being used to execute
prefetch instructions
as described in [1].
Also uses microkernel in matrix_kernel16_6_block.c
Various Matrix Sizes
With all these optimizations the hardware peak performance (128 GFLOPS) is nearly reached and so (not
surprisingly) the speed of OpenBLAS is matched:
Conclusion
Even though BLAS libraries are very complex (to implement all necessary functions,
to support all parameters e.g. transpose, to handle any matrix size, to target
multiple hardware architectures, etc.), the core performance is not difficult to obtain
following implementations such as [1]. We've seen that this can be done in
100 lines of C / Intel intrinsics, and no assembly.
Citations

Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond,
and Field G. Van Zee.
Anatomy of highperformance manythreaded matrix multiplication.
In Proceedings of International Parallel and Distributed Processing Symposium, 2014
Website by Stefan Hadjis