2-1, around 1.5 hour

2024-12-28 21:15:56 +08:00 · 2024-12-28 21:15:56 +08:00 · 557be68e19
parent 92235588b8
commit 557be68e19
1 changed files with 140 additions and 0 deletions
--- a/2-1-gpu.md
+++ b/2-1-gpu.md
@ -0,0 +1,140 @@
+# Introduction to GPU
+
+## Definition
+
+- Graphics Processing Unit, also known as Visual Processing Unit
+- Used to **accelerate** computation and processing of images and data, to
+  output to a display device, or used in modern High Performance Computing
+- Very efficient at `computer graphics` and floating point processing
+- History:
+    - Programmable GPU is invented by NVIDIA in 1999
+    - Initially used by gamers, artists and game programmers
+    - Then used by researchers
+    - GPGPU was introduced by NVIDIA to allow programming languages to be used
+      in GPU
+    - CUDA was invented by NVIDIA, to enable parallel computing using GPU and
+      GPGPU
+
+## Components of GPU
+
+### Structure
+
+- Many core processor
+- 5 layer architecture
+
+### Architecture
+
+#### Host Interface
+
+- Communicate between host and GPU
+    - Receives command from CPU, obtain information from memory
+    - Produces vertices for processing
+
+#### Vortex Processing
+
+- Receive vertices from host interface, and produce output in screen space
+- No vertices are added or deleted: 1:1 mapping relationship
+
+#### Triangle Setup
+
+- Convert screen space geometry from vertex processing layer, to pixels in the
+  output (raster format)
+- Triangles located outside of the view is discarded
+- Triangle fragments rendered as fragments, and only if the center of fragment
+  is in the center of triangle.
+
+#### Pixel Processing
+
+- Fragment is received from last layer, with metadata (attributes) attached,
+  which are used to calculate color of the pixel
+- Has texture mapping and math, so it's the most costly
+
+#### Memory Interface
+
+- Fragment colors are stored here
+    - Are compressed to save space and bandwidth, second costly
+
+## CPU and GPU
+
+### Differences
+
+- CPU: small number of hard tasks
+    - Individual and **distinctive** task
+    - Few cores, to primarily do sequential and **serial** processing, less
+      execution units, and **transistors**
+    - Memory interface is **slower**
+    - Less pipelines
+    - More control and caching transistors
+    - Has L1 (data, instruction) and L2 caches on each core, and **shared** L3
+      cache
+- GPU: large number of simple tasks
+    - Can be broken into **many** **tiny** parts, and worked on in parallel
+    - Massively **parallel**, thousands of cores that handle multiple tasks once
+      at a time, and more execution **units** and **transistors**
+    - Memory interface is much **faster**
+    - More pipelines than GPU
+    - More ALU (Algorithmic logic units)
+    - Dedicated L1 cache for every Streaming Processors, contained by Processor
+      Clusters, shared L2 cache, more tolerant on laatency (less cache)
+
+### Communication
+
+- They interact in **parallel** with each other.
+- They run on separate threads, and communicate through a **command buffer**
+- Problems
+    - CPU bottleneck: If CPU is slow, the command buffer is **empty**, GPU will
+      wait for input from CPU, resulting idling GPU
+    - GPU bottleneck: IF GPU is slow, the command buffer is **full**, CPU will
+      wait for output from GPU, resulting in idling CPU
+
+## GPU Computing
+
+### Definition
+
+- Using GPU with CPU to accelerate scientific and enterprise application
+  processing
+- Some parts can be broken down to parallelizable smaller parts, which are
+  processed on the GPU, while serial parts are on the CPU
+
+### GPU programming
+
+- Writing parallel programs that run on GPUs using compliant platforms, like
+  CUDA, C/Fortran or OpenACC
+- Ways to program GPUs:
+    - GPU-Accelerated Libraries: Developers only need to write code and use the
+      library
+    - GPU directives: automatic parallel loops using directives (OpenACC, C,
+      Fortran)
+    - Develop your own: Use CUDA along with language: CUDA C/C++
+
+### Memory
+
+#### Device memory: GDDR (Graphics Double Data Rate)
+
+- Resides in CUDA address space
+- Used by CUDA kernels, with pointer and array de-referencing
+- Most GPUs dedicated memory attached to GPU
+- No virtual memory allocation like in CPU: when memory is exhausted, allocation
+  will fail
+
+#### Host memory: DDR (Double Data Rate)
+
+- CPU memory, managed by library calls like `malloc`, `free`
+- In CUDA, this is virtualized: when memory is exhausted, allocation will not
+  fail
+    - OS that manages virtual memory: VMM (Virtual Memory Manager)
+    - GPU access host memory with: DMA (Direct Memory Access), which enables GPU
+      to work with CPU
+
+### Processing
+
+1. Copy input from CPU memory to GPU memory
+2. GPU load program and execute, caching data in GPU cache, save result to GPU
+   memory
+3. Copy results from GPU memory to CPU memory.
+
+### Advantages
+
+- Fast
+- Efficiency: energy and design
+- Less cycles of communication