Page 40

EETE SEP 2014

PROGRAMABLE LOGIC Pixel local storage on ARM Mali GPUs By Jan-Harald Fridreksen For many computer graphics algorithms operations on a pixel can be performed independently of similar operations on other pixels. This is one reason why computer graphics algorithms are often “embarrassingly parallel”. Blending is a basic example of this where for each pixel the previous color of the pixel is combined with an incoming color. But there are also more complex examples, such as deferred shading. Such algorithms need to store multiple values per pixel location, which are finally combined in an application-specific way to produce the final pixel value. On today’s graphics APIs, these algorithms are typically implemented by a multi-pass approach. Pixel values are first written to a set of off-screen render targets and, in a second pass, these render targets are read as textures and used to compute the final pixel value that is written to the framebuffer. Coupled with the importance of optimizing use of power by mobile GPUs, such an approach is far from ideal. ARM Mali GPUs are based on a tile-based architecture. In short, we split the framebuffer into small regions, called tiles. During geometry processing, all triangles are assigned to the tiles that they cover. Subsequently, per-pixel processing in performed on one tile at the time. Per-pixel values are stored on-chip until the fragment shading for all pixels in the tile is complete. Only at that point is the final pixel value written back to the framebuffer. ARM has recently published a set of OpenGL® ES extensions that give applications access to this on-chip storage. We will explain what these are and how they enable more bandwidth-efficient processing. ARM_shader_framebuffer_fetch is the first of these extensions. It enables applications to read the current framebuffer color from the fragment shader. An obvious use-case for this is programmable blending. Similarly, ARM_shader_framebuffer_ fetch_depth_stencil enables applications to read the current depth and stencil values from the framebuffer. This enables use-cases such as programmable depth and stencil testing, modulating shadows, soft particles, and creating variance shadow maps in a single render pass. The above extensions enable applications to read what is in the framebuffer, but they are limited to storing one value per pixel and the format of the stored values must match the format of the framebuffer. The EXT_shader_pixel_local_storage extension lifts these restrictions by enabling applications to store and retrieve arbitrary values at a given pixel location. This is a powerful principle that enables algorithms such as deferred shading to be implemented without incurring a large bandwidth cost. A typical implementation of deferred shading using this extension would split the rendering into three phases. The first phase would write properties such as the diffuse color and normal for of each pixel are written to the pixel local storage. The second phase would calculate the lighting for each pixel based on the stored properties, and accumulate these back to the pixel local storage for each light source. Finally, in a third phase the values in pixel local storage would be used to calculate the final value of the pixel. At this point, the pixel local storage is no longer needed and is discarded. The key point here is that the pixel local storage data is never written to memory. It is kept on-chip throughout and incurs no bandwidth cost. This is a significant improvement over existing solutions that would require this data to be stored off-chip between the phases. In demo applications we observe framebuffer related bandwidth reduced from 40MB to 8MB per frame. And it’s not just about efficiency. These extensions allow applications to express algorithms more directly compared to alternative approaches. It achieves this by making it clear when framebuffer values are kept on-chip and when they are written back to memory, as well as providing flexible access to this onchip memory. ARM is very excited about the possibilities opened up by these extensions as they pave the way for more complex algorithms to be implemented efficiently on mobile GPUs. ARM supports all these extensions on Mali-T6xx and Mali-T7xx series. Jan-Harald Fridreksen is Principal Software Engineer at ARM Norway – www.ARM.com Final result 36 Electronic Engineering Times Europe September 2014 www.electronics-eetimes.com


EETE SEP 2014
To see the actual publication please follow the link above