Page 32

EETE SEP 2014

PROGRAMABLE LOGIC Implementing FPGA design with the OpenCL standard By Deshanand Singh The initial era of programmable technologies contained two different extremes of programmability. One extreme was represented by single core CPU and DSP units. These devices were programmable using software consisting of a list of instructions to be executed. Instructions were created in a manner that was conceptually sequential to the programmer, although an advanced processor could reorder instructions to extract instruction-level parallelism from these sequential programs at run time. In contrast, the other extreme of programmable technology was represented by the FPGA. These devices are programmed by creating configurable hardware circuits, which execute completely in parallel. A designer using an FPGA is essentially creating a massively fine-grained parallel application. For many years, these extremes coexisted with each type of programmability being applied to different application domains. However, recent trends in technology scaling have favored technologies Fig. 1: Recent trend of programmable and parallel technologies. that are both programmable and parallel. The second trend that the software programmable devices relied on was the emergence of complex hardware that would extract instruction-level parallelism from sequential programs. A single-core architecture would input a stream of instructions and execute them on a device that might have many parallel functional units. A significant fraction of the processor hardware must be dedicated to extracting parallelism dynamically from the sequential code. Additionally, hardware attempted to compensate for memory latencies. Generally, programmers create programs without consideration of the processor’s underlying memory hierarchy, as if there were only a large, flat, uniformly fast memory. In contrast, the processor must deal with the physical realities of high-latency and limited bandwidth connections to external memory. In order to keep functional units fed with data, the processor must also speculatively pre-fetch data from external memory into on-chip caches so that the data is much closer to where the computation is being performed. After many decades of performance improvements using these techniques, there have been greatly diminishing returns from these types of architectures. Given the diminishing benefits of these two trends on conventional processor architectures, we are beginning to see that the spectrum of software-programmable devices is now evolving significantly, as shown in figure 1. The emphasis is shifting from automatically extracting instruction-level parallelism at run time to explicitly identifying thread-level parallelism at coding time. Highly parallel multicore devices are beginning to emerge with a general trend of containing multiple simpler processors where more of the transistors are dedicated to computation rather than caching and extraction of parallelism. These devices range from multicore CPUs, which commonly have 2, 4, or 8 cores, to GPUs consisting of hundreds of simple cores optimized for data-parallel computation. To achieve high performance on these multicore devices, the programmer must explicitly code their applications in a parallel fashion. Each core must be assigned work in such a way that all cores can cooperate to execute a particular computation. This is also exactly what FPGA designers do to create their high-level system architectures. Considering the need for creating parallel programs for the emerging multicore era, the OpenCL (Open Computing Language) was created in an effort to create a cross-platform parallel programming standard. The OpenCL standard inherently offers the ability to describe parallel algorithms to be implemented on FPGAs, at a much higher level of abstraction than hardware description languages (HDLs) such as VHDL or Verilog. Although many high-level synthesis tools exist for gaining this higher level of abstraction, they have all suffered from the same fundamental problem. These tools would attempt to take in a sequential C program and produce a parallel HDL implementation. The difficulty was not so much in the creation of a HDL implementation, but rather in the extraction of threadlevel parallelism that would allow the FPGA implementation to achieve high performance. With FPGAs being on the furthest Deshanand Singh is Supervising Principal Engineer for Software and IP Engineering at Altera Corporation – www.altera.com Fig. 2: Example of OpenCL implementation on an FPGA. 28 Electronic Engineering Times Europe September 2014 www.electronics-eetimes.com


EETE SEP 2014
To see the actual publication please follow the link above