diff --git a/2-1-gpu.md b/2-1-gpu.md new file mode 100644 index 0000000..dd455cb --- /dev/null +++ b/2-1-gpu.md @@ -0,0 +1,140 @@ +# Introduction to GPU + +## Definition + +- Graphics Processing Unit, also known as Visual Processing Unit +- Used to **accelerate** computation and processing of images and data, to + output to a display device, or used in modern High Performance Computing +- Very efficient at `computer graphics` and floating point processing +- History: + - Programmable GPU is invented by NVIDIA in 1999 + - Initially used by gamers, artists and game programmers + - Then used by researchers + - GPGPU was introduced by NVIDIA to allow programming languages to be used + in GPU + - CUDA was invented by NVIDIA, to enable parallel computing using GPU and + GPGPU + +## Components of GPU + +### Structure + +- Many core processor +- 5 layer architecture + +### Architecture + +#### Host Interface + +- Communicate between host and GPU + - Receives command from CPU, obtain information from memory + - Produces vertices for processing + +#### Vortex Processing + +- Receive vertices from host interface, and produce output in screen space +- No vertices are added or deleted: 1:1 mapping relationship + +#### Triangle Setup + +- Convert screen space geometry from vertex processing layer, to pixels in the + output (raster format) +- Triangles located outside of the view is discarded +- Triangle fragments rendered as fragments, and only if the center of fragment + is in the center of triangle. + +#### Pixel Processing + +- Fragment is received from last layer, with metadata (attributes) attached, + which are used to calculate color of the pixel +- Has texture mapping and math, so it's the most costly + +#### Memory Interface + +- Fragment colors are stored here + - Are compressed to save space and bandwidth, second costly + +## CPU and GPU + +### Differences + +- CPU: small number of hard tasks + - Individual and **distinctive** task + - Few cores, to primarily do sequential and **serial** processing, less + execution units, and **transistors** + - Memory interface is **slower** + - Less pipelines + - More control and caching transistors + - Has L1 (data, instruction) and L2 caches on each core, and **shared** L3 + cache +- GPU: large number of simple tasks + - Can be broken into **many** **tiny** parts, and worked on in parallel + - Massively **parallel**, thousands of cores that handle multiple tasks once + at a time, and more execution **units** and **transistors** + - Memory interface is much **faster** + - More pipelines than GPU + - More ALU (Algorithmic logic units) + - Dedicated L1 cache for every Streaming Processors, contained by Processor + Clusters, shared L2 cache, more tolerant on laatency (less cache) + +### Communication + +- They interact in **parallel** with each other. +- They run on separate threads, and communicate through a **command buffer** +- Problems + - CPU bottleneck: If CPU is slow, the command buffer is **empty**, GPU will + wait for input from CPU, resulting idling GPU + - GPU bottleneck: IF GPU is slow, the command buffer is **full**, CPU will + wait for output from GPU, resulting in idling CPU + +## GPU Computing + +### Definition + +- Using GPU with CPU to accelerate scientific and enterprise application + processing +- Some parts can be broken down to parallelizable smaller parts, which are + processed on the GPU, while serial parts are on the CPU + +### GPU programming + +- Writing parallel programs that run on GPUs using compliant platforms, like + CUDA, C/Fortran or OpenACC +- Ways to program GPUs: + - GPU-Accelerated Libraries: Developers only need to write code and use the + library + - GPU directives: automatic parallel loops using directives (OpenACC, C, + Fortran) + - Develop your own: Use CUDA along with language: CUDA C/C++ + +### Memory + +#### Device memory: GDDR (Graphics Double Data Rate) + +- Resides in CUDA address space +- Used by CUDA kernels, with pointer and array de-referencing +- Most GPUs dedicated memory attached to GPU +- No virtual memory allocation like in CPU: when memory is exhausted, allocation + will fail + +#### Host memory: DDR (Double Data Rate) + +- CPU memory, managed by library calls like `malloc`, `free` +- In CUDA, this is virtualized: when memory is exhausted, allocation will not + fail + - OS that manages virtual memory: VMM (Virtual Memory Manager) + - GPU access host memory with: DMA (Direct Memory Access), which enables GPU + to work with CPU + +### Processing + +1. Copy input from CPU memory to GPU memory +2. GPU load program and execute, caching data in GPU cache, save result to GPU + memory +3. Copy results from GPU memory to CPU memory. + +### Advantages + +- Fast +- Efficiency: energy and design +- Less cycles of communication