Adobe has just announced Creative Suite 6. CS 5.5 used the Mercury Playback Engine in Premiere Pro, which takes advantage of NVIDIA’s CUDA library in order to accelerate processing when an NVIDIA GPU is present. Just to be clear, this is not just graphics acceleration, but programming the GPU to take advantage of its many processor cores for general-purpose computing.
Premiere Pro CS6 also uses the Mercury Playback Engine, and while CUDA is still recommended there is new support for OpenCL:
The Mercury Playback Engine brings performance gains to all the GPUs supported in Adobe Creative Suite 6 software, but the best performance comes with specific NVIDIA® CUDA™ enabled GPUs, including support for mobile GPUs and NVIDIA Maximus™ dual-GPU configurations. New support for the OpenCL-based AMD Radeon HD 6750M and 6770M cards available with certain Apple MacBook Pro computers running OS X Lion (v10.7x), with a minimum of 1GB VRAM, brings GPU-accelerated mobile workflows to Mac users.
PhotoShop CS6 also uses the GPU to accelerate processing, using the new Mercury Graphics Engine. The Mercury Graphics Engine uses the OpenCL framework, which is not specific to any one GPU vendor, rather than CUDA:
The Mercury Graphics Engine (MGE) represents features that use video card, or GPU, acceleration. In Photoshop CS6, this new engine delivers near-instant results when editing with key tools such as Liquify, Warp, Lighting Effects and the Oil Paint filter. The new MGE delivers unprecedented responsiveness for a fluid feel as you work. MGE is new to Photoshop CS6, and uses both the OpenGL and OpenCL frameworks. It does not use the proprietary CUDA framework from nVidia.
It seems to me that this amounts to a shift by Adobe from CUDA to OpenCL, which is a good thing for users of non-NVIDIA GPUs.
This also suggests to me that NVIDIA will need to ensure excellent OpenCL support in its GPU cards, as well as continuing to evolve CUDA, since Creative Suite is a key product for designers using the workstations which form a substantial part of the market for high-end GPUs.
I’m at Intel’s software tools conference in Dubrovnik, which I have attended for the last three years, and as usual the big topic is concurrent programming and how to write code that takes advantage of the multiple cores in today’s computers.
Clearly this remains a critical subject, but in some ways the progress over these last three years has been disappointing when it comes to the PCs that most of us use. Many machines are only dual-core, which is sub-optimal for concurrent programming since there is an overhead to multi-threading programming that eats into the benefit of having two cores. Quad core is now common too, and more useful, but what about having 50 or 80 or more cores? This enables massively parallel processing of the kind that you can easily do today with general-purpose GPU programming using OpenCL or NVidia’s CUDA, but not yet on the CPU unless you have a super computer. I realise that GPU cores are not the same as CPU cores; but nevertheless they enable some spectacularly fast parallel processing.
I am interested therefore in Intel’s MIC or Many Integrated Core architecture, which combines 50 or more CPU cores on a single chip. MIC is already in preview, with hardware codenamed Knight’s Corner and a development kit called Knight’s Ferry. But when will MIC hit the mainstream for servers and workstations, and how long is it until we can have 50 cores on a commodity desktop PC? I spoke to Intel’s chief evangelist James Reinders.
Reinders first gave me some background on MIC:
“We’ve made those bold steps to dual core, quad core and we’ve got even ten core now, but if you look inside those microprocessors they have a very simple structure. All the cores are hooked together and share their connection to memory, through a shared cache usually that’s on the chip. It’s a simple computer structure, and we know from experience when you build computers with more and more processors, that eventually you go to more sophisticated connections between the cores. You don’t build a 1000-processor super computer and hook them all together with a bus to one memory.
“It’s inevitable that on a chip we need to design a more sophisticated connection. That’s what MIC’s about, that’s what the Larrabee project has always been about, a belief that we should take a bunch of x86 cores and hook them together with something more sophisticated. In this case it’s a ring, a bi-directional, 512-bit wide high performance ring, with multiple connections to memory off the chip, which gives us more bandwidth.
“That’s how I look at MIC, it’s putting a cluster-type of design on a chip.”
But what about timing?
“The first place you’ll see this is in servers and in workstations, where there’s a lot of demand for a lot of computation. In that case we’ll see that availability sometime by the end of 2012. The Intel product should be out late in that year.
“When will we see it in other devices? I think that’s a ways off. It’s a very high core count part, more than 50, it’s going to consume a fair amount of power. The same part 18 months later will probably consume half the power. So inside a decade we could see this being common on desktops, I don’t know about mobile devices, it might even make it to tablets. A decade’s a long time, it gives a lot of time for people to come up with innovative uses for it in software.
“We’ll see single core disappear everywhere.”
Incidentally, it is hard to judge how much computing power is “enough”. Although having many CPU cores may seem overkill for everyday computing, things like speech recognition or on-the-fly image processing make devices smarter at the expense of intense processing under the covers. From super computers to smartphones, if more computing capability is available history tells us that we will find ways to use it.
NVIDIA CEO Jen-Hsung Huang spoke to the press at the GPU Technology Conference and I took the opportunity to ask some questions.
I asked for his views on the cloud as a supercomputer and whether that would impact the need for local supercomputers of the kind GPU computing enables.
Although we expect more and more to happen in the cloud, in the meantime we’re going to keep buying devices with more and more solid state memory. The way to think about it is, storage is simply a surrogate for bandwidth. If we had infinite bandwidth none of us would need storage. As bandwidth improves the requirement for storage should reduce. But there’s another trend which is that the amount of data we collect is growing incredibly fast … It’s going to be quite a long time before our need for storage will reduce.
But what about local computing power, Gigaflops as opposed to storage?
Wherever there is storage, there’s GigaFlops. Local storage, local computing.
Next, I brought up a subject which has been puzzling me here at GTC. You can do GPU programming with NVIDIA’s CUDA C, which only works on NVIDIA GPUs, or with OpenCL which works with other vendor’s GPUs as well. Why is there more focus here on CUDA, when on the face of it developers would be better off with the cross-GPU approach? (Of course I know part of the answer, that NVIDIA does not mind locking developers to its own products).
The reason we focus all our evangelism and energy on CUDA is because CUDA requires us to, OpenCL does not. OpenCL has the benefit of IBM, AMD, Intel, and ourselves. Now CUDA is a little difference in that its programming approach is different. Instead of an API it’s a language extension. You program in C, it’s a different model.
The reason why CUDA is more adopted than OpenCL is because it is simply more advanced. We’ve invested in CUDA much longer. The quality of the compiler is much better. The robustness of the programming environment is better. The tools around it are better, and there are more people programming it. The ecosystem is richer.
People ask me how do we feel about the fact that it is proprietary. There’s two ways to think about it. There’s CUDA and there’s Tesla. Tesla’s not proprietary at all, Tesla supports OpenCL and CUDA. If you bought a server with Tesla in it, you’re not getting anything less, you’re getting CUDA more. That’s the reason Tesla has been adopted by all the OEMs. If you want a GPU cluster, would you want one that only does OpenCL? Or does OpenCL and CUDA? 80% of GPU computing today is CUDA, 20% is OpenCL. If you want to reach 100% of it, you’re better off using Tesla. Over time, if more people use OpenCL that’s fine with us. The most important thing is GPU computing, the next most important thing to us is NVIDIA’s GPUs, and the next is CUDA. It’s way down the list.
Next, a hot topic. Jen-Hsun Huang explained why he announced a roadmap for future graphics chip architectures – Kepler in 2011, Maxwell in 2013 – so that software developers engaged in GPU programming can plan their projects. I asked him why Fermi, the current chip architecture, had been so delayed, and whether there was good reason to have confidence in the newly announced dates.
He answered by explaining the Fermi delay in both technical and management terms.
The technical answer is that there’s a piece of functionality that is between the shared symmetric multiprocessors (SMs), 236 processors, that need to communicate with each other, and with memory of all different types. So there’s SMs up here, and underneath the memories. In between there is a very complicated inter-connecting system that is very fast. It’s nearly all wires, dense metal with very little logic … we call that the fabric.
When you have wires that are next to each other that closely they couple, they interfere … it’s a solid mesh of metal. We found a major breakdown between the models, the tools, and reality. We got the first Fermi back. That piece of fabric – imagine we are all processors. All of us seem to be working. But we can’t talk to each other. We found out it’s because the connection between us is completely broken. We re-engineered the whole thing and made it work.
Your question was deeper than that. Your question wasn’t just what broke with Fermi – it was the fabric – but the question is how would you not let it happen again? It won’t be fabric next time, it will be something else.
The reason why the fabric failed isn’t because it was hard, but because it sat between the responsibility of two groups. The fabric is complicated because there’s an architectural component, a logic design component, and there’s a physics component. My engineers who know physics and my engineers who know architecture are in two different organisations. We let it sit right in the middle. So the management lesson learned – there should always be a pilot in charge.
Huang spent some time discussing changes in the industry. He identifies mobile computing “superphones” and tablets as the focus of a major shift happening now. Someone asked “What does that mean for your Geforce business?”
I don’t think like that. The way I think is, “what is my personal computer business”. The personal computer business is Geforce plus Tegra. If you start a business, don’t think about the product you make. Think about the customer you’re making it for. I want to give them the best possible personal computing experience.
Tegra is NVIDIA’s complete system on a chip, including ARM processor and of course NVIDIA graphics, aimed at mobile devices. NVIDIA’s challenge is that its success with Geforce does not guarantee success with Tegra, for which it is early days.
The further implication is that the immediate future may not be easy, as traditional PC and laptop sales decline.
The mainstream business for the personal computer industry will be rocky for some time. The reason is not because of the economy but because of mobile computing. The PC … will be under disruption from tablets. The difference between a tablet and a PC is going to become very small. Over the next few years we’re going to see that more and more people use their mobile device as their primary computer.
[Holds up Blackberry] There’s no question right now that this is my primary computer.
The rise of mobile devices is a topic Huang has returned to on several occasions here. “ARM is the most important CPU architecture, instruction set architecture, of the future” he told the keynote audience.
Clearly NVIDIA’s business plans are not without risk; but you cannot fault Huang for enthusiasm or awareness of coming changes. It is clear to me that NVIDIA has the attention of the scientific and academic community for GPU computing, and workstation OEMs are scrambling to built Tesla GPU computing cards into their systems, but transitions in the market for its mass-market graphics cards will be tricky for the company.
Update: Huang’s comments about the reasons for Fermi’s delay raised considerable interest as apparently he had not spoken about this on record before. Journalist Nico Ernst captured the moment on video:
I’m at NVIDIA’s GPU tech conference in San Jose. The central theme of the conference is that the capabilities of modern GPUs enable substantial performance gains for general computing, not just for graphics, though most of the examples we have seen involve some element of graphical processing. The reason you should care about this is that the gains are huge.
Take Matlab for example, a popular language and IDE for algorithm development, data analysis and mathematical computation. We were told in the keynote here yesterday that Matlab is offering a parallel computing toolkit based on NVIDIA’s CUDA, with speed-ups from 10 to 40 times. Dramatic performance improvements opens up new possibilities in computing.
Why has GPU performance advanced so rapidly, whereas CPU performance has levelled off? The reason is that they use different computing models. CPUs are general-purpose. The focus is on fast serial computation, executing a single thread as rapidly as possible. Since many applications are largely single-thread, this is what we need, but there are technical barriers to increasing clock speed. Of course multi-core and multi-processor systems are now standard, so we have dual-core or quad-core machines, with big performance gains for multi-threaded applications.
By contrast, GPUs are designed to be massively parallel. A Tesla C1060 has not 2 or 4 or 8 cores, but 240; the C2050 has 448. These are not the same as CPU cores, but nevertheless do execute in parallel. The clock speed is only 1.3Ghz, whereas an Intel Core i7 Extreme is 3.3Ghz, but the Intel CPU has a mere 6 cores. An Intel Xeon 7560 runs at 2.266 Ghz and has 8 cores.The lower clock speed in the GPU is one reason it is more power-efficient.
NVIDIA’s CUDA initiative is about making this capability available to any application. NVIDIA made changes to its hardware to make it more amenable to standard C code, and delivered CUDA C with extensions to support it. In essence it is pretty simple. The extensions let you specify functions to execute on the GPU, allocate memory for pointers on the GPU, and copy memory between the GPU (called the device) and the main memory on the PC (called the host). You can also synchronize threads and use shared memory between threads.
The reward is great performance, but there are several disadvantages. One is the challenge of concurrent programming and the subtle bugs it can introduce.
Another is the hassle of copying memory between host and device. The device is in effect a computer within a computer. Shifting data between the two is relatively show.
A third is that CUDA is proprietary to NVIDIA. If you want your code to work with ATI’s equivalent, called Streams, then you should use the OpenCL library, though I’ve noticed that most people here seem to use CUDA; I presume they are able to specify the hardware and would rather avoid the compromises of a cross-GPU library. In the worst case, if you need to support both CUDA and non-CUDA systems, you might need to support different code paths depending on what is detected at runtime.
It is all a bit messy, though there are tools and libraries to simplify the task. For example, this morning we heard about GMAC, which makes host and device appear to use a single address space, though I imagine there are performance implications.
NVIDIA says it is democratizing supercomputing, bringing high performance computing within reach for almost anyone. There is something in that; but at the same time as a developer I would rather not think about whether my code will execute on the CPU or the GPU. Viewed at the highest level, I find it disappointing that to get great performance I need to bolster the capabilities of the CPU with a specialist add-on. The triumph of the GPU is in a sense the failure of the CPU. Convergence in some form or other strikes me as inevitable.