In a recent paper (see arXiv:1012.3911) we showed how to solve the time-dependent Schrödinger equation and the time-dependent Dirac equation by a Fourier split operator method on GPU hardware. For the Fourier split operator method one has to compute a fast Fourier transform (FFT) in each time step and the FFT dominates the overall computing time. For this reason I evaluated various kinds of GPU hardware for calculating FFTs.
I started with a Nvidia GeForce GTX480 (Fermi architecture), which is consumer graphics card, and got quite satisfactory performance gains compared to traditional CPU implementations of FFTs, see below and arXiv:1012.3911 for details. Consumer graphics cards as the GTX480, however, have a reduced double precision performance. Tesla compute modules have a double precision peak performance that is about four times higher than the peak performance of consumer graphics cards based on the Fermi architecture. I wondered, do we actually see this four-fold double precision performance in FFT performance?
Because I do not have Tesla compute module I ran an few benchmarks in the Amazon cloud on a Cluster GPU instance. Each Cluster GPU instance is equipped with two Nvidia Tesla M2050 GPUs. Setting up a node in the cloud was easy but FFT performance was quite disappointing. Especially for small problems the Tesla M2050 performed quite poorly as compared the GeForce GTX480, see below. I am not able to pin down the reason for the observed performance degeneration definitely. However, I speculate that virtualization takes its toll here.
Finally, I could also run my benchmark without a virtualization layer on a Tesla S2050 system (that contains four Tesla M2050 GPUs). Many thanks to Megaware for providing computing time. Performance measurements on this system support my conjecture that the Amazon Cluster GPU instance was slowed down by the cloud’s virtualization layer. If my benchmark runs on »bare metal« then Tesla M2050 GPUs give approximately the same FFT performance as GeForce GTX480 GPUs do. However, from the fact that a Tesla M2050 GPU has an about four times higher double precision peak performance than a GeForce GTX one might expect even better FFT performance for a Tesla M2050 GPU. That we do not see this is an indication that the FFT is inherently bounded by the memory bandwidth. In fact, to come close to the GeForce GTX480 we have to turn off error correction (ECC) for the Tesla M2050 GPU. (The GeForce GTX480 memory has no error correction.) Even without error correction the Tesla M2050 GPU is slightly slower than a GeForce GTX480 because it has a slightly lower clock rate.
Thank you for the valuable entry. Today, I encounter the same performance issue on Amazon EC2.