Many scientific, technical and engineering applications in finance, medical imaging, modeling, simulation, and image processing can benefit greatly from the floating point acceleration offered by modern general purpose graphics processing units (GPGPU). Today’s graphics processors have evolved into sophisticated, massively-parallel, highly-programmable compute engines ideally suited for algorithms with a high degree of data parallelism. Combined with modern parallel programming languages and application interfaces such as industry standard OpenCL™, GPGPU offers a new paradigm for high-performance computing (HPC).In this article we plan to examine the use of GPGPU to accelerate iterative, grid-based, finite-difference methods used extensively in HPC applications. Finite-Difference Time-Domain (FDTD) solvers are used to model electromagnetic wave propagation. Velocity-stress finite-difference time-domain (VS-FDTD) solvers are used to model acoustic wave propagation in seismic applications. Both algorithms are ported to the ATI Radeon™ HD 5870 and ATI Radeon™ HD 5970 GPU co-processors and benchmarked against a CPU implementation. Let’s start with an overview of the ATI Radeon HD 5870 GPU.
Overview: The ATI Radeon HD 5870 GPU is a high-performance graphics processor delivering 2.72 teraFLOPS (peak) single precision floating point performance, at 14.47 gigaFLOPS/ Watt. The GPU supports the OpenCL 1.0 API and complies with the IEEE 754-2008 floating point arithmetic standard. Configured with 1 GByte of GDDR5 memory, the GPU supports 153 GBytes/sec of memory bandwidth to local memory. Communication with a host CPU is via the x16 PCI Express® 2.0 bus.
Architecture: For GPGPU applications, key functional blocks include the 20 single-instruction, multiple-data (SIMD) engines, thread dispatch controller, memory controller, instruction / constant / memory caches and 64 Kbytes of global memory (Figure 1). Each of the SIMD engines contain an array of 80 stream processor elements (PEs), a 32 KByte local data store, a fetch unit with control logic and an 8 KByte L1 cache. Each SIMD engine is able to run its own instruction threads. Communication with other SIMDs is through the 64 KByte global memory. To maximize computation throughput and hide memory latency, the ultra-threaded dispatch processor launches multiple, simultaneous thread contexts. Memory bandwidth is optimized by four memory controllers connected to four 128 KByte L2 caches that coalesce memory reads and writes to drive the external 256-bit wide memory bus.
Subscribe today to receive the INSIDER, a FREE e-mail newsletter from Embedded Technology featuring exclusive previews of upcoming articles, late breaking NASA and industry news, hot products and design ideas, links to online resources, and much more.