State of the Art FPGAs Design Optimizations

In my previous post, I talked about the general techniques used for the computation and energy efficiency of the neural networks on the available hardware. This post we will cover a breed of hardware called the FPGAs, which has been in the talks some years after the dawn of the neural net era.

The first question that comes in the mind is, why FPGAs?

So a typical CPU can perform 10–100 GFLOPs per second. A GPU offers up to 10 TFLOP/s peak performance. Thus, CPUs are out of the question as they are badly beaten by the GPUs. But GPUs still have a drawback in terms of energy efficiency and are still general-purpose to have optimization for a specific task.
We all know that ASICs are chips used for specific tasks but the problem with silicon brains is that, once the chip is designed and burnt, it cannot be redesigned again. Here come the FPGAs, reprogrammable ASICs, that’s what I like to call them.
With this re-programmability comes the power to design the chips with high parallelism strategies in mind(which neural nets thrive for badly) and bring the computation load and efficiency compared to GPUs.

Current problems with FPGAs

Above mentioned are the reason why there is research going on around this breed of silicon, to bring the clock cycles par with the CPU and GPUs and ideally surpass them. There are techniques discussed in the below sections which make mention of hardware optimization techniques for FPGAs.
The current State of the Art neural Network accelerator design estimates at least 10x better energy efficiency the current GPUs.

Convolution and FC layers

Overview FPGA based Accelerator

High level FPGA based accelerator design

Design Methodology and Criteria

Speed

Theoretical throughput of a system

where OPSact -> the number of operations performed per second at run-time by the accelerator.
W -> the total theoretical workload of the network.
OPSpeak -> maximum number of operations that can be processed per second.
n -> Utilization ratio of the computation units, measured by the average ratio of working computation units in all the computation units during each inference.
f -> working frequency of computation units.
P -> number of computation units.

Latency

where L -> Latency of processing an inference
C -> Concurrency of the accelerator, measured by the number of inferences processed in parallel.
IPS -> Throughput of the system, measured by the number of inferences processed each second.

Energy Efficiency

Eff -> the energy efficieny of the system, measured by the number of operations can be processed within unit energy.
W -> Workload for each inference, measured by the number of operations in the network, mainly addition and multiplication for neural network.
Etotal comprises of the static ram access energy component + dynamic ram access energy component + Static Energy.

We separate the memory access energy into DRAM part and SRAM part. Nx acc can be reduced by quantization, sparsification, efficient on-chip memory system, and scheduling method. Thus these methods help reduce dynamic memory energy.

This article we will focus on the hardware design optimization than on the optimization of the neural network side which involves data quantization, weight reduction i.e. sparsification, weight pruning, weights clustering techniques.

Hardware Design : Efficient Architecture

Computation Unit Designs

A smaller computation unit means, more number of computation units can be embedded on the chip which means higher peak performance. A carefully designed computation unit array can increase the working frequency of the system and thus improve the peak performance.

Loop Unrolling Strategies

How we loop over the conv layers and FC layers for convolution and multiply-accumulate operations is also a considerable question of research. The inefficient looping strategy can take up a lot of time and bring down the processing efficiency of the system.

Above is the loop unrolling strategy traditionally used. For N number of filters we loop over each of the filter and for each filter channel and for each output map and for each kernel element, we produce the each element of the output feature map by multiplying value of the input feature map with each value of the kernel. Each output map is then added with a bias per channel.
You can go through the looping strategy’s pseudo code for better understanding.

System Design

Block graph of a typical FPGA-based neural network accelerator system
An example of the roofline model. The shaded part denotes the valid design space given bandwidth and resource limitation.

Computation to communication (CTC) ratio as the x-axis and hardware
performance as the y-axis. CTC is the number of operations that can be executed with a unit size of memory access. Each hardware design can be treated as a point in the figure. So y/x equals to the bandwidth requirement of the design.
The actual bandwidth roof is below the theoretical roof because the achievable bandwidth of DDR depends on the data access pattern. Sequential DDR access achieves much higher bandwidth than random access. The other roof is the computation roof, which is limited by the available resource on FPGA.

Loop tiling and unrolling

The loop unrolling strategies to increase the parallelism while reducing the waste of computation for a certain network. When the loop unrolling strategy is decided, the scheduling of the rest part of the loops decides how the hardware can reuse data with on-chip buffer. This involves loop tiling and loop interchange strategy.
Loop tiling is a higher level of loop unrolling. All the input data of a loop tile will be stored on-chip, and the loop unrolling hardware kernel works on these data. A larger loop tile size means that each tile will be loaded from external memory to on-chip memory fewer times. Loop interchange
strategy decides the processing order of the loop tiles.
The data arrangement in on-chip buffers is controlled through instructions to fit with different feature map sizes. This means the hardware can always fully utilize the on-chip buffer to use the largest tiling size according to on-chip buffer size. This work also proposes the ”back and forth” loop execution order to avoid total on-chip data refresh when an innermost loop finishes.

This article is still the tip of the iceberg and just an overview of the what is the State of the Art techniques used in this space to get better computation and energy efficiency on the FPGAs.
Definitely AI on the edge is the next big thing needed to move away from the requirement of the computing on the cloud which is where this field of research holds the most value.

Till next time, keep reading and keep growing :)

Reference : https://arxiv.org/pdf/1712.08934.pdf

Homo Bayesian