Which is better for massively parallel computing, a GPU accelerator board from NVidia, or Intel’s new Xeon Phi? On the eve of NVidia’s GPU Technology Conference comes a paper which Intel will enjoy. Erik Sauley, Kamer Kayay, and Umit V. C atalyurek from the Ohio State University have issued a paper with performance comparisons between Xeon Phi, NVIDIA Tesla C2050 and NVIDIA Tesla K20. The K20 has 2,496 CUDA cores, versus a mere 61 processor cores on the Xeon Phi, yet on the particular calculations under test the researchers got generally better performance from Xeon Phi.
In the case of sparse-matrix vector multiplication (SpMV):
For GPU architectures, the K20 card is typically faster than the C2050 card. It performs better for 18 of the 22 instances. It obtains between 4.9 and 13.2GFlop/s and the highest performance on 9 of the instances. Xeon Phi reaches the highest performance on 12 of the instances and it is the only architecture which can obtain more than 15GFlop/s.
and in the case of sparse-matrix matrix multiplication (SpMM):
The K20 GPU is often more than twice faster than C2050, which is much better compared with their relative performances in SpMV. The Xeon Phi coprocessor gets
the best performance in 14 instances where this number is 5 and 3 for the CPU and GPU configurations, respectively. Intel Xeon Phi is the only architecture which achieves more than 100GFlop/s.
Note that this is a limited test, and that the authors note that SpMV computation is known to be a difficult case for GPU computing:
the irregularity and sparsity of SpMV-like kernels create several problems for these architectures.
They also note that memory latency is the biggest factor slowing performance:
At last, for most instances, the SpMV kernel appears to be memory latency bound rather than memory bandwidth bound
It is difficult to compare like with like. The Xeon Phi implementation uses OpenMP, whereas the GPU implementation uses CuSparse. I would also be interested to know whether as much effort was made to optimise for the GPU as for the Xeon Phi.
Still, this is a real-world test that, if nothing else, demonstrates that in the right circumstances the smaller number of cores in a Xeon Phi do not prevent it comparing favourably against a GPU accelerator:
When compared with cutting-edge processors and accelerators, its SpMV, and especially SpMM, performance are superior thanks to its wide registers
and vectorization capabilities. We believe that Xeon Phi will gain more interest in HPC community in the near future.
The paper mentioned in your blog is an unpublished paper and only compares CSR formats between Xeon Phi and K20.
You may be interested in the below paper, which has been published at ICS 2013 and compares Xeon Phi with the best matrix formats on K20X
http://dl.acm.org/citation.cfm?id=2465013