Notes for Heterogeneous Parallel Programming By W. Hwu

UIUC ECE 408 on Coursera

latency cores (CPU cores) & throughput cores (GPU cores)

Different design philosophy

CPU

Powerful ALU
- Reduced operation latency
Large caches
- Convert long latency memory accesses to short latency cache accesses
Sophisticated control
- Branch prediction for reduced branch latency
- Data forwarding for reduced data latency

GPU

Small caches
- To boost memory throughput
Simple control
- No branch prediction
- No data forwarding
Energy efficient ALUs
- Many, long latency but heavily pipelined for high throughput
Require massive number of threads to tolerate latencies

Therefore we use CPU to finish the sequential parts and GPU for parallel parts

Scalability & Portability

Scalability
- The same application runs efficiently on new generations of cores
- The same application runs efficiently on more of the same cores
Portability
- The same application runs efficiently on different types of cores
- The same application runs efficiently on systems with different organizations and interfaces
- - Portability across many different HW types
  - XBb vs. ARM, etc.
  - Latency oriented CPUs vs. throughput oriented GPUs
  - VLIU vs. SIMD vs. threading
  - Shared memory vs. distributed memory

CUDA Basics

CUDA: Extension of C
A thread is a “virtualized” or “abstractea Von-Neumann Processor

jamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjamjam

Divide thread array into multiple blocks
- Threads within a block cooperate via shared memory, atomic operations and barrier synchronization
- Threads in different blocks do not interact
- blockldx:1D,2D, or3D (CUDA4.0) threadldx: , or

Memory management

cudaMalloc()
cudaFree()
cudaMemcpy()
- just like C but some differences
- the copy is asynchronous (it will be immediately returned, but the actual copy will be done later)

void vecAdd(float* h_A, float* h_B, float* h_C, int n )
{
int size = n * sizeof (float); float * d_A, d_B, d_c;
cudaMalloc((void **) &d_A, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_B, size);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_C, size);
// Kernel invocation code - to be shown later
cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);
cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);
}

cudaError_t err = cudaMalloc ((void * *) &d_A, size);
if (err != cudaSuccess) {
  printf("%s in % s at line % d backslash n", 
    cudaGetErrorString(err), __FILE__, __LINE__);
  exit(EXIT_FAILURE);
}

__global__
void vecAddKernel (float* $A$, float* $B$, float*
$\mathrm{C}$, int $\mathrm{n}$ )
\{
int $i=$ threadIdx.x+blockDim.x*blockIdx.x;
i{f (i<n)} $C[i]=A[i]+B[i]$; //boundary check

\}

int vecAdd (float* $h \_A, f l o a t * h \_B, f l o a t * h \_C$, int $n$ )
\{
dim3DimGrid( $(\mathrm{n}-1) /256+1,1,1) ; \quad$ Host Code
dim3DimBlock $(256,1,1)$;
vecAddKernnel<<<DimGrid,DimBlock>》>(d_A, d_B, d_C, n) ;
\}

A kernel function must return void
__device__ and __host__ can be used together
__host__ is optional if used alone