EBU6502_cloud_computing_notes/2-2-cuda.md

# CUDA

## Definition

- Compute Unified Device Architecture
- Is a general purpose **parallel** programming platform and model

## Scalability with CUDA

- GPU is has an array of Streaming Multiprocessors
- A program can be partitioned into blocks that execute independently
- Allow performance to scale with the number of GPU multiprocessors

## Architecture

### Kernel

- CUDA extends C by allow to define C functions, which is called a kernel
    - Kernel is executed in parallel, by N different CUDA threads, rather than
      only once in serial
    - Defined with specifier `__global__`
- Number of threads is given using `<<<NBlocks, NThreads>>>`, and used with
  `func_name<<<NBlocks, NThreads>>>(params);`
    - They take an integer or dim3, so it cal also be called Dimension of
      Blocks, or Dimension of Threads
- Thread number is given an ID, accessible by `threadIdx` variable
- Example:
    ```c
    // Add array A and B of size N together, store it in C
    #define N 2
    // Kernel definition
    __global__ void VecAdd(float* A, float* B, float* C) {
        // Each thread performs one pair wise addition
        int i = threadIdx.x;
        // The array is a float, meaning we need to allocate memory on GPU
        C[i] = A[i] + B[i];
    }
    int main() {
        ...
        // Kernel invocation with N threads
        VecAdd<<<1, N>>>(A, B, C);
        ...
    }
    ```
- Note to self: run `nvcc` with `-arch=native` otherwise it won't work.

### Device code and Host code (Important)

- Compiler `nvcc` separates source code into host and device code
    - Device code (kernels): definition marked with `__global__`, called from
      host code (with `func<<<x, y>>>(a, b)`), and run on device (GPU)
    - Host code: normal C code, processed by system compiler like `gcc` or
      `clang`
- If the device code definitions are used, the compiled code execute on both CPU
  and GPU

### Thread hierarchy

- `threadIdx` is a 3 component vector
- Enables threads to be identified in 1-3 dimension blocks, which can form a
  **grid** of thread blocks
- Number of threads depend on the size of data being processed
- Features:
    - Maximum number of threads per block: because they reside on the same
      block, on the same processor core
    - A kernel can be executed by multiple equally shaped thread blocks, so the
      total number of threads is threads per block times blocks
- Example:
    ```c
    #define N 10
    // Kernel definition
    __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) {
        int i = threadIdx.x;
        int j = threadIdx.y;
        C[i][j] = A[i][j] + B[i][j];
    }
    int main() {
        ...
        // Kernel invocation with one block of N * N * 1 threads
        int numBlocks = 1;
        dim3 threadsPerBlock(N, N);
        MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
        ...
    }
    ```
- A complete example: see code in `./assets/code/`

### Block Hierarchy

- Blocks can be represented in 1-3 dimensional **grids**, size is dependent on
  the input data
- Accessing
    - Block index: `blockIdx`
    - Dimension: `blockDim`
- Example: multi block kernel:
    ```c
    #define N 16
    // Kernel definition
    __global__ void MatAdd(float A[N][N], float B[N][N],
    float C[N][N]) {
        int i = blockIdx.x * blockDim.x + threadIdx.x;
        int j = blockIdx.y * blockDim.y + threadIdx.y;
        if (i < N && j < N)
        C[i][j] = A[i][j] + B[i][j];
    }
    int main() {
        ...
        // Kernel invocation
        dim3 threadsPerBlock(16, 16);
        dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
        MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
        ...
    }
    ```

### Memory Hierarchy

- Thread: Registers, Local memory
- Block of threads: Shared memory
- All Blocks: Global memory

### GPU vs. CPU

- GPU has more stream processors, and each processor has much more ALUs
- TODO: see code in assets/code, and figure shit out, and index array

## Coordinating with CPU

- GPU kernel launch are **asynchronous**, while control returns to CPU
  immediately
- CPU need to **synchronize** before using the results
    - `cudaMemcpy()`: block and copy
    - `cudaMemcpyAsync()`: copy but don't block
    - `cudaDeviceSynchronize()`: Block the CPU until all CUDA calls before have
      completed
finish cuda programming, took around 1.8 hrs 2024-12-29 12:40:01 +08:00			`# CUDA`

			`## Definition`

			`- Compute Unified Device Architecture`
			`- Is a general purpose parallel programming platform and model`

			`## Scalability with CUDA`

			`- GPU is has an array of Streaming Multiprocessors`
			`- A program can be partitioned into blocks that execute independently`
			`- Allow performance to scale with the number of GPU multiprocessors`

			`## Architecture`

			`### Kernel`

			`- CUDA extends C by allow to define C functions, which is called a kernel`
			`- Kernel is executed in parallel, by N different CUDA threads, rather than`
			`only once in serial`
			- Defined with specifier `__global__`
			- Number of threads is given using `<<<NBlocks, NThreads>>>`, and used with
			`func_name<<<NBlocks, NThreads>>>(params);`
			`- They take an integer or dim3, so it cal also be called Dimension of`
			`Blocks, or Dimension of Threads`
			- Thread number is given an ID, accessible by `threadIdx` variable
			`- Example:`
			```c
			`// Add array A and B of size N together, store it in C`
			`#define N 2`
			`// Kernel definition`
			`__global__ void VecAdd(float* A, float* B, float* C) {`
			`// Each thread performs one pair wise addition`
			`int i = threadIdx.x;`
			`// The array is a float, meaning we need to allocate memory on GPU`
			`C[i] = A[i] + B[i];`
			`}`
			`int main() {`
			`...`
			`// Kernel invocation with N threads`
			`VecAdd<<<1, N>>>(A, B, C);`
			`...`
			`}`
			```
CUDA advanced Took 1hr 50 mins 2024-12-29 16:45:17 +08:00			- Note to self: run `nvcc` with `-arch=native` otherwise it won't work.
finish cuda programming, took around 1.8 hrs 2024-12-29 12:40:01 +08:00
Fix some content 2025-01-04 16:31:11 +08:00			`### Device code and Host code (Important)`
finish cuda programming, took around 1.8 hrs 2024-12-29 12:40:01 +08:00
			- Compiler `nvcc` separates source code into host and device code
			- Device code (kernels): definition marked with `__global__`, called from
			host code (with `func<<<x, y>>>(a, b)`), and run on device (GPU)
			- Host code: normal C code, processed by system compiler like `gcc` or
			`clang`
			`- If the device code definitions are used, the compiled code execute on both CPU`
			`and GPU`

			`### Thread hierarchy`

			- `threadIdx` is a 3 component vector
			`- Enables threads to be identified in 1-3 dimension blocks, which can form a`
			`grid of thread blocks`
			`- Number of threads depend on the size of data being processed`
			`- Features:`
			`- Maximum number of threads per block: because they reside on the same`
			`block, on the same processor core`
			`- A kernel can be executed by multiple equally shaped thread blocks, so the`
			`total number of threads is threads per block times blocks`
			`- Example:`
			```c
			`#define N 10`
			`// Kernel definition`
			`__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) {`
			`int i = threadIdx.x;`
			`int j = threadIdx.y;`
			`C[i][j] = A[i][j] + B[i][j];`
			`}`
			`int main() {`
			`...`
			`// Kernel invocation with one block of N * N * 1 threads`
			`int numBlocks = 1;`
			`dim3 threadsPerBlock(N, N);`
			`MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);`
			`...`
			`}`
			```
Fix some errors in notes 2025-01-06 14:24:58 +08:00			- A complete example: see code in `./assets/code/`
finish cuda programming, took around 1.8 hrs 2024-12-29 12:40:01 +08:00
			`### Block Hierarchy`

			`- Blocks can be represented in 1-3 dimensional grids, size is dependent on`
			`the input data`
			`- Accessing`
			- Block index: `blockIdx`
			- Dimension: `blockDim`
			`- Example: multi block kernel:`
			```c
			`#define N 16`
			`// Kernel definition`
			`__global__ void MatAdd(float A[N][N], float B[N][N],`
			`float C[N][N]) {`
			`int i = blockIdx.x * blockDim.x + threadIdx.x;`
			`int j = blockIdx.y * blockDim.y + threadIdx.y;`
			`if (i < N && j < N)`
			`C[i][j] = A[i][j] + B[i][j];`
			`}`
			`int main() {`
			`...`
			`// Kernel invocation`
			`dim3 threadsPerBlock(16, 16);`
			`dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);`
			`MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);`
			`...`
			`}`
			```

			`### Memory Hierarchy`

			`- Thread: Registers, Local memory`
			`- Block of threads: Shared memory`
			`- All Blocks: Global memory`

			`### GPU vs. CPU`

			`- GPU has more stream processors, and each processor has much more ALUs`
			`- TODO: see code in assets/code, and figure shit out, and index array`

			`## Coordinating with CPU`

			`- GPU kernel launch are asynchronous, while control returns to CPU`
			`immediately`
			`- CPU need to synchronize before using the results`
			- `cudaMemcpy()`: block and copy
			- `cudaMemcpyAsync()`: copy but don't block
			- `cudaDeviceSynchronize()`: Block the CPU until all CUDA calls before have
			`completed`