The accelerator consists of two core components: a systolic array and a SIMD array. Data is supplied to the two engines through the input buffer (IBUF), output buffer (OBUF), instruction memory (IMEM), weight buffer (WBUF), bias buffer (BBUF) and vector memory (VMEM). These interfaces harbor programmable data access modules and controller FSMs that together issue the addresses and requests to load or store a tile of data from/to off-chip memory. The data access creates strided patterns that access the off-chip memory to read/write the corresponding data from/to on-chip buffers and populate the on-chip memory. These interfaces also include tag logic that is in charge of handling double-buffered data transfer to hide the latencies of load/store operations and also facilitate prefetching. Among these interfaces, the interfaces for the OBUF and SIMD array handle both load and store operations, while the other interfaces handle only load operations. These interfaces are fully programmable through the instruction set architecture (ISA) of the GeneSys accelerator. The overall system view of the GeneSys DNN accelerator is shown below.
Our systolic array configuration is characterized by a unified activation buffer that feeds input data into the system. Each PE in the array is also equipped with its own weight buffer bank. The output generated by the systolic array can be directed in two ways. It can either be stored directly to DRAM or it can be read by the SIMD array. The latter option allows for additional operations to be performed on the output data, such as activation functions or pooling. A diagram of our systolic array is shown below.
Instruction Set Architecture
GeneSys is a programmable accelerator which offers a large set of instructions such that it can accelerate DNNs. Directly supported operations include GEMM, convolution, non-linear activation functions, and pooling. To download a spreadsheet describing the GeneSys ISA, click here.
The compiler for GeneSys starts with an ONNX description of an ML algorithm and generates an intermediate representation (IR) in the form of a simultaneous-recursive dataflow graph (sr-DFG). The sr-DFG is a recursively defined dataflow graph which enables simultaneous access to all levels of operation granularity for flexibly compiling to different architectures. Once generated from an ONNX file, the sr-DFG is transformed to a series of parameterized operation kernels. The operation kernels are then transformed and optimized by applying an architecture description to the compilation process, which constrains kernel parameters based on hardware attributes such as memory size, bandwidth, and operation capabilities. Once the kernel parameters are evaluated, each statement in the kernel is used to generate sequences of instructions defined by the architecture abstraction. The compilation flow is shown below.