4.2 KiB
4.2 KiB
CUDA
Definition
- Compute Unified Device Architecture
- Is a general purpose parallel programming platform and model
Scalability with CUDA
- GPU is has an array of Streaming Multiprocessors
- A program can be partitioned into blocks that execute independently
- Allow performance to scale with the number of GPU multiprocessors
Architecture
Kernel
- CUDA extends C by allow to define C functions, which is called a kernel
- Kernel is executed in parallel, by N different CUDA threads, rather than only once in serial
- Defined with specifier
__global__
- Number of threads is given using
<<<NBlocks, NThreads>>>
, and used withfunc_name<<<NBlocks, NThreads>>>(params);
- They take an integer or dim3, so it cal also be called Dimension of Blocks, or Dimension of Threads
- Thread number is given an ID, accessible by
threadIdx
variable - Example:
// Add array A and B of size N together, store it in C #define N 2 // Kernel definition __global__ void VecAdd(float* A, float* B, float* C) { // Each thread performs one pair wise addition int i = threadIdx.x; // The array is a float, meaning we need to allocate memory on GPU C[i] = A[i] + B[i]; } int main() { ... // Kernel invocation with N threads VecAdd<<<1, N>>>(A, B, C); ... }
- Note to self: run
nvcc
with-arch=native
otherwise it won't work.
Device code and Host code
- Compiler
nvcc
separates source code into host and device code- Device code (kernels): definition marked with
__global__
, called from host code (withfunc<<<x, y>>>(a, b)
), and run on device (GPU) - Host code: normal C code, processed by system compiler like
gcc
orclang
- Device code (kernels): definition marked with
- If the device code definitions are used, the compiled code execute on both CPU and GPU
Thread hierarchy
-
threadIdx
is a 3 component vector -
Enables threads to be identified in 1-3 dimension blocks, which can form a grid of thread blocks
-
Number of threads depend on the size of data being processed
-
Features:
- Maximum number of threads per block: because they reside on the same block, on the same processor core
- A kernel can be executed by multiple equally shaped thread blocks, so the total number of threads is threads per block times blocks
-
Example:
#define N 10 // Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = threadIdx.x; int j = threadIdx.y; C[i][j] = A[i][j] + B[i][j]; } int main() { ... // Kernel invocation with one block of N * N * 1 threads int numBlocks = 1; dim3 threadsPerBlock(N, N); MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C); ... }
-
A complete example:
Block Hierarchy
- Blocks can be represented in 1-3 dimensional grids, size is dependent on the input data
- Accessing
- Block index:
blockIdx
- Dimension:
blockDim
- Block index:
- Example: multi block kernel:
#define N 16 // Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < N && j < N) C[i][j] = A[i][j] + B[i][j]; } int main() { ... // Kernel invocation dim3 threadsPerBlock(16, 16); dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y); MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C); ... }
Memory Hierarchy
- Thread: Registers, Local memory
- Block of threads: Shared memory
- All Blocks: Global memory
GPU vs. CPU
- GPU has more stream processors, and each processor has much more ALUs
- TODO: see code in assets/code, and figure shit out, and index array
Coordinating with CPU
- GPU kernel launch are asynchronous, while control returns to CPU immediately
- CPU need to synchronize before using the results
cudaMemcpy()
: block and copycudaMemcpyAsync()
: copy but don't blockcudaDeviceSynchronize()
: Block the CPU until all CUDA calls before have completed