Which is better for massively parallel computing, a GPU accelerator board from NVidia, or Intel’s new Xeon Phi? On the eve of NVidia’s GPU Technology Conference comes a paper which Intel will enjoy. Erik Sauley, Kamer Kayay, and Umit V. C atalyurek from the Ohio State University have issued a paper with performance comparisons between Xeon Phi, NVIDIA Tesla C2050 and NVIDIA Tesla K20. The K20 has 2,496 CUDA cores, versus a mere 61 processor cores on the Xeon Phi, yet on the particular calculations under test the researchers got generally better performance from Xeon Phi.
In the case of sparse-matrix vector multiplication (SpMV):
For GPU architectures, the K20 card is typically faster than the C2050 card. It performs better for 18 of the 22 instances. It obtains between 4.9 and 13.2GFlop/s and the highest performance on 9 of the instances. Xeon Phi reaches the highest performance on 12 of the instances and it is the only architecture which can obtain more than 15GFlop/s.
and in the case of sparse-matrix matrix multiplication (SpMM):
The K20 GPU is often more than twice faster than C2050, which is much better compared with their relative performances in SpMV. The Xeon Phi coprocessor gets the best performance in 14 instances where this number is 5 and 3 for the CPU and GPU configurations, respectively. Intel Xeon Phi is the only architecture which achieves more than 100GFlop/s.
Note that this is a limited test, and that the authors note that SpMV computation is known to be a difficult case for GPU computing:
the irregularity and sparsity of SpMV-like kernels create several problems for these architectures.
They also note that memory latency is the biggest factor slowing performance:
At last, for most instances, the SpMV kernel appears to be memory latency bound rather than memory bandwidth bound
It is difficult to compare like with like. The Xeon Phi implementation uses OpenMP, whereas the GPU implementation uses CuSparse. I would also be interested to know whether as much effort was made to optimise for the GPU as for the Xeon Phi.
Still, this is a real-world test that, if nothing else, demonstrates that in the right circumstances the smaller number of cores in a Xeon Phi do not prevent it comparing favourably against a GPU accelerator:
When compared with cutting-edge processors and accelerators, its SpMV, and especially SpMM, performance are superior thanks to its wide registers and vectorization capabilities. We believe that Xeon Phi will gain more interest in HPC community in the near future.
I first became aware of NVIDIA’s propaganda war against Intel at the 2012 GPU Technology conference in Beijing. CEO Jen-Hsun Huang stated that CPUs are remarkably inefficient for multicore processing:
The CPU is fast and is terrific at single-threaded performance, but because so much of the electronics inside the CPU is dedicated to out of order execution, branch prediction, speculative execution, all of the technology that has gone into sustaining instruction throughput and making the CPU faster at single-threaded applications, the electronics necessary to enable it to do that has grown tremendously. With four cores, in order to execute an operation, a floating point add or a floating point multiply, 50 times more energy is dedicated to the scheduling of that operation than the operation itself. If you look at the silicone of a CPU, the floating point unit is only a few percentage of the overall die, and it is consistent with the usage of the energy to sequence, to schedule the instructions running complicated programs.
That figure of 50 times surprised me, and I asked Intel’s James Reinders for a comment. He was quick to respond, noting that:
50X is ridiculous if it encourages you to believe that there is an alternative which is 50X better. The argument he makes, for a power-efficient approach for parallel processing, is worth about 2X (give or take a little). The best example of this, it turns out, is the Intel MIC [Many Integrated Core] architecture.
Reinders went on to say:
Knights Corner is superior to any GPGPU type solution for two reasons: (1) we don’t have the extra power-sucking silicon wasted on graphics functionality when all we want to do is compute in a power efficient manner, and (2) we can dedicate our design to being highly programmable because we aren’t a GPU (we’re an x86 core – a Pentium-like core for “in order” power efficiency). These two turn out to be substantial advantages that the Intel MIC architecture has over GPGPU solutions that will allow it to have the power efficiency we all want for highly parallel workloads, but able to run an enormous volume of code that will never run on GPGPUs (and every algorithm that can run on GPGPUs will certainly be able to run on a MIC co-processor).
So Intel is evangelising its MIC vs GPCPU solutions such as NVIDIA’s Tesla line. Yesterday NVIDIA’s Steve Scott spoke up to put the other case. If Intel’s point is that a Tesla is really a GPU pressed into service for general computing, then Scott’s first point is that the cores in MIC are really CPUs, albeit of an older, simpler design:
They don’t really have the equivalent of a throughput-optimized GPU core, but were able to go back to a 15+ year-old Pentium design to get a simpler processor core, and then marry it with a wide vector unit to get higher flops per watt than can be achieved by Xeon processors.
Scott then takes on Intel’s most compelling claim, compatibility with existing x86 code. It does not matter much, says Scott, since you will have to change your code anyway:
The reality is that there is no such thing as a “magic” compiler that will automatically parallelize your code. No future processor or system (from Intel, NVIDIA, or anyone else) is going to relieve today’s programmers from the hard work of preparing their applications for the future.
What is the real story here? It would, of course, be most interesting to compare the performance of MIC vs Tesla, or against the next generation of NVIDIA GPGPUs based on Kepler; and may the fastest and most power-efficient win. That will have to wait though; in the meantime we can see that Intel is not enjoying seeing the world’s supercomputers install NVIDIA GPGPUs – the Oak Ridge National Laboratory Jaguar/Titan (the most powerful supercomputer in the USA) being a high profile example:
In addition, 960 of Jaguar’s 18,688 compute nodes now contain an NVIDIA graphical processing unit (GPU). The GPUs were added to the system in anticipation of a much larger GPU installation later in the year.
Equally, NVIDIA may be rattled by the prospect of Intel offering strong competition for Tesla. It has not had a lot of competition in this space.
There is an ARM factor here too. When I spoke to Scott in Beijing, he hinted that NVIDIA would one day produce GPGPUs with ARM chips embedded for CPU duties, perhaps sharing the same memory.
I spoke to Dr Steve Scott, NVIDIA’s CTO for Tesla, at the end of the GPU Technology Conference which has just finished here in Beijing. In the closing session, Scott talked about the future of NVIDIA’s GPU computing chips. NVIDIA releases a new generation of graphics chips every two years:
2008 Tesla
2010 Fermi
2012 Kepler
2014 Maxwell
Yes, it is confusing that the Tesla brand, meaning cards for GPU computing, has persisted even though the Tesla family is now obsolete.
Dr Steve Scott showing off the power efficiency of GPU computing
Scott talked a little about a topic that interests me: the convergence or integration of the GPU and the CPU. The background here is that while the GPU is fast and efficient for parallel number-crunching, it is of course still necessary to have a CPU, and there is a price to pay for the communication between the two. The GPU and the CPU each have their own memory, so data must be copied back and forth, which is an expensive operation.
One solution is for GPU and CPU to share memory, so that a single pointer is valid on both. I asked CEO Jen-Hsun Huang about this and he did not give much hope for this:
We think that today it is far better to have a wonderful CPU with its own dedicated cache and dedicated memory, and a dedicated GPU with a very fast frame buffer, very fast local memory, that combination is a pretty good model, and then we’ll work towards making the programmer’s view and the programmer’s perspective easier and easier.
Scott on the other hand was more forthcoming about future plans. Kepler, which is expected in the first half of 2012, will bring some changes to the CUDA architecture which will “broaden the applicability of GPU programming, tighten the integration of the CPU and GPU, and enhance programmability,” to quote Scott’s slides. This integration will include some limited sharing of memory between GPU and CPU, he said.
What caught my interest though was when he remarked that at some future date NVIDIA will probably build CPU functionality into the GPU. The form that might take, he said, is that the GPU will have a couple of cores that do the CPU functions. This will likely be an implementation of the ARM CPU.
Note that this is not promised for Kepler nor even for Maxwell but was thrown out as a general statement of direction.
There are a couple of further implications. One is that NVIDIA plans to reduce its dependence on Intel. ARM is a better partner, Scott told me, because its designs can be licensed by anyone. It is not surprising then that Intel’s multi-core evangelist James Reinders was dismissive when I asked him about NVIDIA’s claim that the GPU is far more power-efficient than the CPU. Reinders says that the forthcoming MIC (Many Integrated Core) processors codenamed Knights Corner are a better solution, referring to the:
… substantial advantages that the Intel MIC architecture has over GPGPU solutions that will allow it to have the power efficiency we all want for highly parallel workloads, but able to run an enormous volume of code that will never run on GPGPUs (and every algorithm that can run on GPGPUs will certainly be able to run on a MIC co-processor).
In other words, Intel foresees a future without the need for NVIDIA, at least in terms of general-purpose GPU programming, just as NVIDIA foresees a future without the need for Intel.
Incidentally, Scott told me that he left Cray for NVIDIA because of his belief in the superior power efficiency of GPUs. He also described how the Titan supercomputer operated by the Oak Ridge National Laboratory in the USA will be upgraded from its current CPU-only design to incorporate thousands of NVIDIA GPUs, with the intention of achieving twice the speed of Japan’s K computer, currently the world’s fastest.
This whole debate also has implications for Microsoft and Windows. Huang says he is looking forward to Windows on ARM, which makes sense given NVIDIA’s future plans. That said, the I get impression from Microsoft is that Windows on ARM is not intended to be the same as Windows on x86 save for the change of processor. My impression is that Windows on ARM is Microsoft’s iOS, a locked-down operating system that will be safer for users and more profitable for Microsoft as app sales are channelled through its store. That is all very well, but suggests that we will still need x86 Windows if only to retain open access to the operating system.
Another interesting question is what will happen to Microsoft Office on ARM. It may be that x86 Windows will still be required for the full features of Office.
This means we cannot assume that Windows on ARM will be an instant hit; much is uncertain.
NVIDIA CEO Jen-Hsung Huang spoke to the press at the GPU Technology Conference and I took the opportunity to ask some questions.
I asked for his views on the cloud as a supercomputer and whether that would impact the need for local supercomputers of the kind GPU computing enables.
Although we expect more and more to happen in the cloud, in the meantime we’re going to keep buying devices with more and more solid state memory. The way to think about it is, storage is simply a surrogate for bandwidth. If we had infinite bandwidth none of us would need storage. As bandwidth improves the requirement for storage should reduce. But there’s another trend which is that the amount of data we collect is growing incredibly fast … It’s going to be quite a long time before our need for storage will reduce.
But what about local computing power, Gigaflops as opposed to storage?
Wherever there is storage, there’s GigaFlops. Local storage, local computing.
Next, I brought up a subject which has been puzzling me here at GTC. You can do GPU programming with NVIDIA’s CUDA C, which only works on NVIDIA GPUs, or with OpenCL which works with other vendor’s GPUs as well. Why is there more focus here on CUDA, when on the face of it developers would be better off with the cross-GPU approach? (Of course I know part of the answer, that NVIDIA does not mind locking developers to its own products).
The reason we focus all our evangelism and energy on CUDA is because CUDA requires us to, OpenCL does not. OpenCL has the benefit of IBM, AMD, Intel, and ourselves. Now CUDA is a little difference in that its programming approach is different. Instead of an API it’s a language extension. You program in C, it’s a different model.
The reason why CUDA is more adopted than OpenCL is because it is simply more advanced. We’ve invested in CUDA much longer. The quality of the compiler is much better. The robustness of the programming environment is better. The tools around it are better, and there are more people programming it. The ecosystem is richer.
People ask me how do we feel about the fact that it is proprietary. There’s two ways to think about it. There’s CUDA and there’s Tesla. Tesla’s not proprietary at all, Tesla supports OpenCL and CUDA. If you bought a server with Tesla in it, you’re not getting anything less, you’re getting CUDA more. That’s the reason Tesla has been adopted by all the OEMs. If you want a GPU cluster, would you want one that only does OpenCL? Or does OpenCL and CUDA? 80% of GPU computing today is CUDA, 20% is OpenCL. If you want to reach 100% of it, you’re better off using Tesla. Over time, if more people use OpenCL that’s fine with us. The most important thing is GPU computing, the next most important thing to us is NVIDIA’s GPUs, and the next is CUDA. It’s way down the list.
Next, a hot topic. Jen-Hsun Huang explained why he announced a roadmap for future graphics chip architectures – Kepler in 2011, Maxwell in 2013 – so that software developers engaged in GPU programming can plan their projects. I asked him why Fermi, the current chip architecture, had been so delayed, and whether there was good reason to have confidence in the newly announced dates.
He answered by explaining the Fermi delay in both technical and management terms.
The technical answer is that there’s a piece of functionality that is between the shared symmetric multiprocessors (SMs), 236 processors, that need to communicate with each other, and with memory of all different types. So there’s SMs up here, and underneath the memories. In between there is a very complicated inter-connecting system that is very fast. It’s nearly all wires, dense metal with very little logic … we call that the fabric.
When you have wires that are next to each other that closely they couple, they interfere … it’s a solid mesh of metal. We found a major breakdown between the models, the tools, and reality. We got the first Fermi back. That piece of fabric – imagine we are all processors. All of us seem to be working. But we can’t talk to each other. We found out it’s because the connection between us is completely broken. We re-engineered the whole thing and made it work.
Your question was deeper than that. Your question wasn’t just what broke with Fermi – it was the fabric – but the question is how would you not let it happen again? It won’t be fabric next time, it will be something else.
The reason why the fabric failed isn’t because it was hard, but because it sat between the responsibility of two groups. The fabric is complicated because there’s an architectural component, a logic design component, and there’s a physics component. My engineers who know physics and my engineers who know architecture are in two different organisations. We let it sit right in the middle. So the management lesson learned – there should always be a pilot in charge.
Huang spent some time discussing changes in the industry. He identifies mobile computing “superphones” and tablets as the focus of a major shift happening now. Someone asked “What does that mean for your Geforce business?”
I don’t think like that. The way I think is, “what is my personal computer business”. The personal computer business is Geforce plus Tegra. If you start a business, don’t think about the product you make. Think about the customer you’re making it for. I want to give them the best possible personal computing experience.
Tegra is NVIDIA’s complete system on a chip, including ARM processor and of course NVIDIA graphics, aimed at mobile devices. NVIDIA’s challenge is that its success with Geforce does not guarantee success with Tegra, for which it is early days.
The further implication is that the immediate future may not be easy, as traditional PC and laptop sales decline.
The mainstream business for the personal computer industry will be rocky for some time. The reason is not because of the economy but because of mobile computing. The PC … will be under disruption from tablets. The difference between a tablet and a PC is going to become very small. Over the next few years we’re going to see that more and more people use their mobile device as their primary computer.
[Holds up Blackberry] There’s no question right now that this is my primary computer.
The rise of mobile devices is a topic Huang has returned to on several occasions here. “ARM is the most important CPU architecture, instruction set architecture, of the future” he told the keynote audience.
Clearly NVIDIA’s business plans are not without risk; but you cannot fault Huang for enthusiasm or awareness of coming changes. It is clear to me that NVIDIA has the attention of the scientific and academic community for GPU computing, and workstation OEMs are scrambling to built Tesla GPU computing cards into their systems, but transitions in the market for its mass-market graphics cards will be tricky for the company.
Update: Huang’s comments about the reasons for Fermi’s delay raised considerable interest as apparently he had not spoken about this on record before. Journalist Nico Ernst captured the moment on video: