Introduction
A common misconception is that BLAS implementations of matrix multiplication
are orders of magnitude faster than naive implementations because they are very complex.
This tutorial shows that, using Intel intrinsics (
FMA3
and
AVX2), BLAS-speed
in dense matrix multiplication can be achieved using only 100 lines of C.
The code can be found
here.
It is based on the paper by Smith et al.[1]:
Anatomy of High-Performance Many-Threaded Matrix Multiplication
Hardware
The CPU used in this tutorial provides 128 GFLOPS peak performance (2 physical cores x 2.00 GHz x 32 FLOps/cycle):
- 2 physical cores, 4 virtual (2.00 GHz)
- Haswell microarchitecture (32 FLOps/cycle: 8-float AVX2 x 2 FMA3 units x 2 FLOps/cycle per FMA)
Where
FLOps = floating point operations and
FLOPS = floating point operations per second.
Optimizations
This tutorial implements the GEMM procedure specified in [1], measuring
throughput for various levels of optimization.
Each refers to a function in
compare_blas.cpp
-
Naive implementation
The naive implementation uses three nested for loops
-
+ Vectorization to take advantage of SIMD and FMA hardware
Three nested loops are still used, but
rather than calculate 1 element of the output matrix C at once, this "tiles" C
using a "micro-kernel" [1] which
calculates outputs in tiles of size 6 rows by 16 columns
(i.e. 6 rows of 2 8-wide "vectors").
This uses 12/16 of the AVX2 registers (each 8 floats wide).
The remaining 4 registers are used to hold elements of matrix A and B while the micro-kernel executes.
Note that this provides nearly an 8x speedup over the naive implementation, because of the 8x increase in parallelism.
See micro-kernel in matrix_kernel16_6.c
-
+ Cache Blocking (L1, L2, L3) to maximize data re-use
Rather than use three nested loops, additional loops are introduced for cache blocking, as in [1].
See micro-kernel in matrix_kernel16_6_block.c
-
+ Threads to utilize all cores
Multithreading using OpenMP is used to parallelize the loops which
tile/block the matrix C. There are 2 physical cores, and using threads give a speedup of 2x.
This includes a ~10% speedup from hyperthreading (4 threads as opposed to 2).
While virtual cores do not always help compute-bound kernels like GEMM, in this case
hyperthreading is likely being used to execute
prefetch instructions
as described in [1].
Also uses micro-kernel in matrix_kernel16_6_block.c
Various Matrix Sizes
With all these optimizations the hardware peak performance (128 GFLOPS) is nearly reached and so (not
surprisingly) the speed of OpenBLAS is matched:
Conclusion
Even though BLAS libraries are very complex (to implement all necessary functions,
to support all parameters e.g. transpose, to handle any matrix size, to target
multiple hardware architectures, etc.), the core performance is not difficult to obtain
following implementations such as [1]. We've seen that this can be done in
100 lines of C / Intel intrinsics, and no assembly.
Citations
-
Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond,
and Field G. Van Zee.
Anatomy of high-performance many-threaded matrix multiplication.
In Proceedings of International Parallel and Distributed Processing Symposium, 2014
Website by Stefan Hadjis