CUDA

Definition

Compute Unified Device Architecture
Is a general purpose parallel programming platform and model

Scalability with CUDA

GPU is has an array of Streaming Multiprocessors
A program can be partitioned into blocks that execute independently
Allow performance to scale with the number of GPU multiprocessors

Architecture

Kernel

CUDA extends C by allow to define C functions, which is called a kernel
- Kernel is executed in parallel, by N different CUDA threads, rather than only once in serial
- Defined with specifier __global__
Number of threads is given using <<<NBlocks, NThreads>>>, and used with func_name<<<NBlocks, NThreads>>>(params);
- They take an integer or dim3, so it cal also be called Dimension of Blocks, or Dimension of Threads
Thread number is given an ID, accessible by threadIdx variable

Example:

// Add array A and B of size N together, store it in C
#define N 2
// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C) {
    // Each thread performs one pair wise addition
    int i = threadIdx.x;
    // The array is a float, meaning we need to allocate memory on GPU
    C[i] = A[i] + B[i];
}
int main() {
    ...
    // Kernel invocation with N threads
    VecAdd<<<1, N>>>(A, B, C);
    ...
}

Note to self: run nvcc with -arch=native otherwise it won't work.

Device code and Host code (Important)

Compiler nvcc separates source code into host and device code
- Device code (kernels): definition marked with __global__, called from host code (with func<<<x, y>>>(a, b)), and run on device (GPU)
- Host code: normal C code, processed by system compiler like gcc or clang
If the device code definitions are used, the compiled code execute on both CPU and GPU

Thread hierarchy

threadIdx is a 3 component vector
Enables threads to be identified in 1-3 dimension blocks, which can form a grid of thread blocks
Number of threads depend on the size of data being processed
Features:
- Maximum number of threads per block: because they reside on the same block, on the same processor core
- A kernel can be executed by multiple equally shaped thread blocks, so the total number of threads is threads per block times blocks

Example:

#define N 10
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) {
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}
int main() {
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock(N, N);
    MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
    ...
}

A complete example: see code in ./assets/code/

Block Hierarchy

Blocks can be represented in 1-3 dimensional grids, size is dependent on the input data
Accessing
- Block index: blockIdx
- Dimension: blockDim

Example: multi block kernel:

#define N 16
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N]) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if (i < N && j < N)
    C[i][j] = A[i][j] + B[i][j];
}
int main() {
    ...
    // Kernel invocation
    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
    MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
    ...
}

Memory Hierarchy

Thread: Registers, Local memory
Block of threads: Shared memory
All Blocks: Global memory

GPU vs. CPU

GPU has more stream processors, and each processor has much more ALUs
TODO: see code in assets/code, and figure shit out, and index array

Coordinating with CPU

GPU kernel launch are asynchronous, while control returns to CPU immediately
CPU need to synchronize before using the results
- cudaMemcpy(): block and copy
- cudaMemcpyAsync(): copy but don't block
- cudaDeviceSynchronize(): Block the CPU until all CUDA calls before have completed

4.2 KiB Raw Permalink Blame History

CUDA

Definition

Scalability with CUDA

Architecture

Kernel

Device code and Host code (Important)

Thread hierarchy

Block Hierarchy

Memory Hierarchy

GPU vs. CPU

Coordinating with CPU

4.2 KiB

Raw Permalink Blame History