Stefan Hadjis

Research Software

For publications, see here

Spatial: A Language and Compiler for Application Accelerators

I'm currently contributing to Spatial, a language which allows programmers to design efficient accelerators from a high level of abstraction. Algorithmic breakthroughs in data analytics and the availability of enormous amounts of data and have created a need for accelerators in data centers to provide low latency and high throughput (e.g. the TPU or Brainwave). At the same time, robotics and the internet-of-things have created need for energy efficient embedded hardware. Spatial can target CGRAs and FPGAs, including the newly announced F1 FPGAs on the Amazon Elastic Compute Cloud.

More information
Omnivore: A Distributed Deep Learning Training Optimizer

A deep learning training optimizer which optimizes both (1) hardware performance and (2) number of iterations to convergence. One discovery was that while people often tune learning rate during training, they fix momentum to 0.9 -- but tuning both greatly reduces time to convergence. Another was that it is often best to group machines such that within a group gradients are computed synchronously, but different groups compute gradients asynchronously. Along with efficient hardware mapping and aggressive pruning of the hyper-parameter space, this leads to significant speedups over state-of-the-art tools.

More information
Maximizing CPU Efficiency in Deep Learning

Researchers often use GPUs for Deep Learning training and claim that GPUs are much faster than CPUs than they actually are. Often this is because Deep Learning frameworks use sub-optimal CPU implementations, and when targeting CPUs we show order-of-magnitude speedups compared to state-of-the-art frameworks (paper, paper, slides).

More information
Making FPGA programming easier for software developers

I contributed to the LegUp high-level synthesis tool, which allows FPGAs to be programmed from a software specification. FPGA hardware can provide orders of magnitude better performance and energy-efficiency compared to software, but hardware is much more difficult to design than software. LegUp automatically compiles C software into FPGA hardware, making FPGA programming fast and accessible for everyone.

More information

Software Tutorials

BLAS-level CPU Performance in 100 Lines of C

General Dense Matrix Multiplication.or GEMM is at the heart of many applications including Deep Learning. A common misconception is that BLAS code is really complicated, but in this tutorial I show that you can get BLAS-speed using only 100 lines of C (and no assembly programming).

More information

Website by Stefan Hadjis