Floating Point Accelerators: Does IT Finally Get A Free Lunch?
Floating point acceleration key to huge top-end performance jump in the world's fastest supercomputers.
As much as supercomputers represent the cutting edge of computing, they play a critical role in maturing technology and help trickle down new solutions into the general data center and even mainstream computing solutions. The most recent Top500 list of the world's fastest supercomputers revealed yet another huge jump in top end performance, which more and more relies on floating point accelerators that now also include Intel's first contender, the Xeon Phi. And Intel claims programming for FP acceleration just got a lot easier.
To be considered for the November 2012 Top500 list, a supercomputer had to offer at least 76.4 TFlops of sustained performance. Five years ago, this number would have been good enough for a place among the Top 10 and would have been enough to claim the top spot eight years ago. Meanwhile, the currently fastest supercomputer, ORNL's Titan, claims 17.6 PFlops and 22 systems listed worldwide now deliver at least 1 Pflop.
Even more impressive than the absolute jump in computing horsepower is the increase in efficiency. Titan, for example, delivers 36 times the performance of the fastest supercomputer five years ago, LLNL's BlueGene/L (478.2 TFlops), but consumes only 3.5 times the power (8209 KW versus 2329 KW). Much of this efficiency increases is due, of course to much more efficient CPUs, but supercomputers increasingly take advantage of floating point (FP) accelerators such as general purpose GPUs (GPGPUs) such as Nvidia's Tesla family or AMD's FirePro series, as well as Intel's just announced Xeon Phi co-processor.
Titan integrates a total of 18,688 Nvidia Tesla K20X units with a total of 46,645,248 CUDA processing units. AMD deployed 420 FirePro S10000 cards with a total of 1,505,280 stream processors in the SANAM supercomputer at the King Abdulaziz City for Science and Technology, and an undisclosed number of Phi 5110P cards with 60 processors each is used in Beacon, a supercomputer installed at the National Institute for Computational Sciences/University of Tennessee. These three supercomputers are currently rated as the most power efficient supercomputers in the world today.
Accelerators have been used in numerous variations for several years, including more exotic solutions such as Clearspeed's accelerator cards. GPGPUs, however, have become a rather compelling solution and the specifications are certainly impressive. For example, Nvidia's flagship Tesla K20X promises 3.95 Tflops in single-precision (SP) and 1.31 TFlops in double precision (DP) for 235 watts and $3,200. AMD's Firepro S10000 even claims 5.91 TFlops (SP) and 5.91 TFlops (DP) for 375 watts and $3,600. Both are marvels of engineering packing 2,496 (Nvidia) and 3,584 (AMD) individual processors, 6 GB of GDDr5 memory at 5.2 and 5 GHz, respectively, and pack a stunning 7.1 billion (Nvidia) and 8.62 billion (AMD) transistors.
Nvidia has been able to drive its GPGPUs and CUDA programming model into the market with a campaign that was based on mass-support as well as education in universities. The company focused early on to promote classes teaching CUDA and established deep relationships with scientists in universities to create a knowledge layer to support its technology. The approach has resulted in more than 400 million sold GPUs supporting CUDA over the past five years and 8,000 organizations participating in the CUDA community today. Nvidia claims that the commercially deployed Tesla K20 family reached a performance of about 30 PFlops within the first month after launch. Of course, the majority of this performance is built into Titan, but Nvidia apparently has been able to sell a few thousand of these cards to other customers as well.
Despite its success, Tesla is not without issues and the performance potential is largely theoretical; those 1.31 TFlops are unlikely to be ever exploited entirely by any application. The main complaint of developers remain that very specific knowledge is required to tap into the full potential of the GPU. On a very high level, Nvidia's CUDA is just a set of C++ extensions and porting an existing application to run on GPUs may not seem so difficult. But developers speaking on the condition of anonymity told us that a claim that only C++ knowledge is required to exploit the horsepower of a GPGPU is an "oversimplification" and only detailed knowledge of the GPU and memory architecture can approach the level of performance of what hardware makers are promising. This knowledge remains rather rare in today's market.
Wolfgang GruenerWolfgang Gruener is a contributor to Tom's IT Pro. He is currently principal analyst at Ndicio Research, a market analysis firm that focuses on cloud computing and disruptive technologies, and maintains the conceivablytech.com blog. An 18-year veteran in IT journalism and market research, he previously published TG Daily and was managing editor of Tom's Hardware news, which he grew from a link collection in the early 2000s into one of the most comprehensive and trusted technology news sources.
See here for all of Wolfgang's Tom's IT Pro articles.