Stefan Hadjis
Research Software

See here for publications.

The Spatial Multiverse

The Spatial Multiverse in an ongoing research project which exists around the Spatial Language and Compiler. It provides Spatial with front-end support from popular machine learning (ML) frameworks. The Spatial Multiverse contains (1) Spatial as a submodule, (2) tools to compile ML frameworks to Spatial in order to target hardware accelerators, and (3) examples of models and applications in these frameworks. This allows ML frameworks to be mapped to programmable hardware. In its initial version, the Spatial Multiverse supports mapping DNN models described in TensorFlow to FPGAs.

More information
Spatial: A Language and Compiler for Application Accelerators

I'm currently contributing to Spatial, a language which allows programmers to design efficient accelerators from a high level of abstraction. Algorithmic breakthroughs in data analytics and the availability of enormous amounts of data and have created a need for accelerators in data centers to provide low latency and high throughput (e.g. the TPU or Brainwave). At the same time, robotics and the internet-of-things have created need for energy efficient embedded hardware. Spatial can target CGRAs and FPGAs, including the newly announced F1 FPGAs on the Amazon Elastic Compute Cloud.

More information
Omnivore: A Distributed Deep Learning Training Optimizer

A deep learning training optimizer which optimizes both (1) hardware performance and (2) number of iterations to convergence. One discovery was that while people often tune learning rate during training, they fix momentum to 0.9 -- but tuning both greatly reduces time to convergence. Another was that it is often best to group machines such that within a group gradients are computed synchronously, but different groups compute gradients asynchronously. Along with efficient hardware mapping and aggressive pruning of the hyper-parameter space, this leads to significant speedups over state-of-the-art tools.

More information
Maximizing CPU Efficiency in Deep Learning

Researchers often use GPUs for Deep Learning training and claim that GPUs are much faster than CPUs than is actually the case. Often this is because Deep Learning frameworks use sub-optimal CPU implementations which achieve a smaller percentage of the peak device throughput on the CPU than on the GPU. In this work we optimize DNN training for CPUs and, when targeting CPUs, show order-of-magnitude speedups compared to state-of-the-art frameworks. We also show that, with these optimizations, CPUs and GPUs can be used together to accelerate training. (paper, paper, slides).

More information
Making FPGA programming easier for software developers

I contributed to the LegUp high-level synthesis tool, which allows FPGAs to be programmed from a software specification. FPGA hardware can provide orders of magnitude better performance and energy-efficiency compared to software, but hardware is much more difficult to design than software. LegUp automatically compiles C software into FPGA hardware, making FPGA programming fast and accessible for everyone.

More information

Software Tutorials

BLAS-level CPU Performance in 100 Lines of C

General Dense Matrix Multiplication.or GEMM is at the heart of many applications including Deep Learning. A common misconception is that BLAS code is really complicated, but in this tutorial I show that you can get BLAS-speed using only 100 lines of C (and no assembly programming).

More information

Website by Stefan Hadjis