Morgan Kaufmann Publishers have published one of the very first books on Nvidia’s popular Compute Unified Device Architecture (CUDA) parallel computing technology, Programming Massively Parallel Processors: A Hands-on Approach, written by two experts in the field, David B. Kirk and Wen-mei W. Hwu. Kirk was Chief Scientist at Nvidia and he is currently an Nvidia Fellow. Hwu is principle investigator for the first Nvidia CUDA Center of Excellence at the University of Illinois at Urbana-Champaign.
Chapers 1 and 2 give an introduction to graphics processing units (GPUs) as massive parallel computers and present a short history of GPU computing. In the book’s main part, chapers 3 to 6, the authors present a hands-on introduction to CUDA by developing a parallel matrix multiplication routine. Starting from a naive parallel implementation of the textbook matrix multiplication formula, the authors show how to improve its performance by taking into account the hardware characteristics of modern GPUs. The book describes CUDA thread organization, memory hierarchies, memory bandwidth, data prefetching and other techniques that may be necessary to achieve maximal GPU performance. Chapter 7 is a general chapter on floating point arithmetic and chapters 8 and 9 present two real wold case studies of CUDA applications. The book finishes with a general chapter on parallel programming and a short introduction to OpenCL.
The book is very clearly written and well organized. It should be accessible to everybody who is familiar with the C programming language, previous knowledge in parallel programming is not required but may be useful. In fact, I felt a little bit bored sometimes when reading the book because some very basic things are explained very broadly, for example in the chapter on floating point arithmetic. Nevertheless, I think the book is an valuable introduction to parallel programming in the CUDA environment.
The book’s main merit is that it gives a good understanding of the Tesla and Fermi (in chapter 12) GPU hardware architecture which is essential for writing high-performance CUDA codes. The book also gives an introduction to the CUDA runtime API, but here it scratches only the surface. Thus, in order to write complex CUDA applications or applications that utilize the CUDA driver API one also has to read the original CUDA API documentation by Nvidia. However, the book’s introduction to the CUDA runtime API is sufficient to write useful CUDA applications. The book by Kirk and Hwu and the original CUDA documentation by Nvidia are complementary to each other. This distinguishes Kirk’s and Hwu’s work from the vast majority of the computing and programming books which merely rephrase API documentations and man-pages.
In summary, Programming Massively Parallel Processors is a must-have for every CUDA programmer.