016-017-018-019-020-021_EETE-VF

EETE OCTOBER 2012

DESIGN & PRODUCTS DIGITal SIGNal PROCESSING Efficient implementation of complex matrix inversion on the StarCore SC3900 DSP By Avi Gal and Dmitry Lachover CompLex mAtrix ALGeBrA is of great importance for a wide recently, the sC3900 starCore Dsp core earned the high- variety of applications. one of the most important applica- est fixed-point BDTIsimMark2000 benchmark score ever tion areas is wireless communications. matrix calculations are recorded by independent signal-processing technology analy- used in 3Gpp communications standards such as Lte, Lte- sis firm Berkeley Design Technology, Inc. (BDTI). At 1.2 GHz, Advanced, WimAx and many others. For example, the mimo the sC3900 core registered a BDtisimmark2000 performance (multi input multi output) algorithm in the Lte receiver is based benchmark score of 37,460 – a mark nearly 2x higher than com- on a complex matrix inversion. petitive DSP offerings in the market. (The BDTIsimMark2000 in this article we discuss the implementation of a 4x4 provides a summary measure of digital signal processing per- complex matrix inversion on the recently announced starCore formance. see www.BDti.com for details.) SC3900 flexible vector processor. We use the cofactor method and optimize the code to take advantage of the high parallel- the eight-multiplier DmUs are new in the sC3900. the pre- ism and instruction set supported by the sC3900 architecture, vious-generation SC3850 offered a dual-multiplier per ALU. The resulting in a highly efficient implementation. We discuss the hardware is complemented by a diverse set of multiply instruc- implementation in detail, including code structure, optimiza- tions including complex 16x16 and 32x16 multiply instructions. tions and comparison to the previous generation starCore Complex 16x16 multiplication is performed using the mpycx.2x SC3850 DSP. The matrix inversion output is verified against a instruction that computes the real and imaginary portion of the floating point MATLAB model that was used as reference. The product. All inputs and outputs come from 40-bit registers. the performance ratio between sC3900 to sC3850 in complex source operands are assumed to contain a packed complex matrix inversion is 6 times better. 8x8 matrix inversion can be number, where the high portion holds the real part (signed, implemented efficiently using blockwise inversion with similar fractional 16 bits,) and the low portion holds the imaginary part performance gain. (signed, fractional 16 bits). The SC3900 flexible vector processor discussed in this article is used in the B4860 and B4420 soC devices that in addition embed the mApLe-B baseband acceleration process- ing engines for flexibility, integration and affordability in bases- tation design. The MAPLE-B combines a set of efficient high speed processing elements for FeC (forward error correction), FFT/DFT and MiMO equalizer also allows offloading compute The output width of the operation can be 16 bits, 20 bits or intensive functions such as 4x4 matrix inversion from the Dsp 40 bits, with or without saturation depending on the instruction. core. this maximizes soC performance and capability. mApLe mpycx.2x stores the output in two 40 bit registers. Using the 4 efficiently handles up to 4x4 matrix inversion operation and base station developers can choose either to perform it on mApLe or on the Dsp core. in case 8x8 matrices op- eration is needed it can be done efficiently on the SC3900 core. The StarCore SC3900 architecture Freescale’s SC3900 is a high-performance flexible vector processor optimized for wireless infrastructure applica- tions. it is used in the QoriQ Qonverge B4860 and B4420 multicore soC products targeting wireless broadband equipment. the sC3900 has four independent data multiplication units (DmU), each of which contains eight fixed point 16-bit multipliers. Together, the four DMUs can complete thirty two 16-bit multiply-accumulates (MACs) per cycle—up to 38.4 GMACs at 1.2 GHz. Fig. 1: SC3900 complex multiplication. mpycx_2x calculates both the real and imaginary portions of the product. Avi Gal is a Dsp applications expert in the Wireless DMUs, eight complex 16x16 multiplications can be performed infrastructure Design Department in Freescale israel - in one cycle. Figure 1 illustrates the complex multiply instruc- www.freescale.com tions. C intrinsic functions are used to instruct the compiler to Also working at Freescale israel, Dmitry Lachover is Dsp use a specific assembly language instruction. applications team leader and applications & communications the sC3900 rich instruction set can accumulate a complex expert in Wireless infrastructure Design. number into two 20-bit values – see figure 2. 16 Electronic Engineering Times Europe October 2012 www.electronics-eetimes.com


EETE OCTOBER 2012
To see the actual publication please follow the link above