Programming Massively Parallel Processors: A Hands-on Approach (Applications of GPU Computing Series)

Programming Massively Parallel Processors: A Hands-on Approach (Applications of GPU Computing Series)

Language: English

Pages: 280

ISBN: 0123814723

Format: PDF / Kindle (mobi) / ePub

Programming Massively Parallel Processors discusses basic concepts about parallel programming and GPU architecture. ""Massively parallel"" refers to the use of a large number of processors to perform a set of computations in a coordinated parallel way. The book details various techniques for constructing parallel programs. It also discusses the development process, performance level, floating-point format, parallel patterns, and dynamic parallelism. The book serves as a teaching guide where parallel programming is the main topic of the course. It builds on the basics of C programming for CUDA, a parallel programming environment that is supported on NVI- DIA GPUs.
Composed of 12 chapters, the book begins with basic information about the GPU as a parallel computer source. It also explains the main concepts of CUDA, data parallelism, and the importance of memory access efficiency using CUDA.
The target audience of the book is graduate and undergraduate students from all science and engineering disciplines who need information about computational thinking and parallel programming.

  • Teaches computational thinking and problem-solving techniques that facilitate high-performance parallel computing.
  • Utilizes CUDA (Compute Unified Device Architecture), NVIDIA's software development tool created specifically for massively parallel environments.
  • Shows you how to achieve both high-performance and high-reliability using the CUDA programming model as well as OpenCL.

Genetic Programming Theory and Practice XI (Genetic and Evolutionary Computation)

Principles of Transaction Processing for the Systems Professional (1st Edition)

Start Concurrent: An Introduction to Problem Solving in Java with a Focus on Concurrency (2013 Edition)














parallelism model concepts, NDRange API calls, and memory types to their CUDA equivalents. On the other hand, OpenCL has a more complex device management model that reflects its support for multiplatform and multivendor portability. While the OpenCL standard is designed to support code portability across devices produced by different vendors, such portability does not come for free. OpenCL programs must be prepared to deal with much greater hardware diversity and thus will exhibit more

rectangular collection of data locations. This is not a new copy of the data but rather a new way to access the existing memory locations. The template has two parameters: the type of the elements of the source data, and an integer that indicates the dimensionality of the array_view. Throughout C++ AMP, template parameters that indicate dimensionality are referred to as the rank of the type or object. In this example, we have a 1D array_view (or an array_view of rank 1) of C++ float values. The

are forced to find four-fold to eight-fold parallelism to fully utilize these processors. Many of them resort to coarse-grained parallelism strategies where different tasks of an application are performed in parallel. Such applications must be rewritten often to have more parallel tasks for each successive doubling of core count. In contrast, the highly multithreaded GPUs encourage the use of massive, fine-grained data parallelism in CUDA. Efficient threading support in GPUs allows applications

created by the runtime system for each thread. They will reside in registers that are accessible by one thread. They are initialized with the threadIdx and blockIdx values and used many times during the lifetime of the thread. Once the thread ends, the values of these variables also cease to exist. Lines 5 and 6 determine the row index and column index of the d_P element that the thread is to produce. As shown in line 6, the horizontal (x) position, or the column index of the d_P element to be

block slots and thread slots. For example, if each block has 128 threads, the 1,536 thread slots can be partitioned and assigned to 12 blocks. However, since there are only 8 block slots in each SM, only 8 blocks will be allowed. This means that only 1,024 of the thread slots will be utilized. Therefore, to fully utilize both the block slots and thread slots, one needs at least 256 threads in each block. As we mentioned in Chapter 4, the automatic variables declared in a CUDA kernel are placed

Download sample