2-1, around 1.5 hour

2024-12-28 21:15:56 +08:00 · 2024-12-28 21:15:56 +08:00 · 557be68e19
parent 92235588b8
commit 557be68e19
1 changed files with 140 additions and 0 deletions
--- a/2-1-gpu.md
+++ b/2-1-gpu.md
@ -0,0 +1,140 @@
 # Introduction to GPU
 ## Definition
 - Graphics Processing Unit, also known as Visual Processing Unit
 - Used to **accelerate** computation and processing of images and data, to
  output to a display device, or used in modern High Performance Computing
 - Very efficient at `computer graphics` and floating point processing
 - History:
    - Programmable GPU is invented by NVIDIA in 1999
    - Initially used by gamers, artists and game programmers
    - Then used by researchers
    - GPGPU was introduced by NVIDIA to allow programming languages to be used
      in GPU
    - CUDA was invented by NVIDIA, to enable parallel computing using GPU and
      GPGPU
 ## Components of GPU
 ### Structure
 - Many core processor
 - 5 layer architecture
 ### Architecture
 #### Host Interface
 - Communicate between host and GPU
    - Receives command from CPU, obtain information from memory
    - Produces vertices for processing
 #### Vortex Processing
 - Receive vertices from host interface, and produce output in screen space
 - No vertices are added or deleted: 1:1 mapping relationship
 #### Triangle Setup
 - Convert screen space geometry from vertex processing layer, to pixels in the
  output (raster format)
 - Triangles located outside of the view is discarded
 - Triangle fragments rendered as fragments, and only if the center of fragment
  is in the center of triangle.
 #### Pixel Processing
 - Fragment is received from last layer, with metadata (attributes) attached,
  which are used to calculate color of the pixel
 - Has texture mapping and math, so it's the most costly
 #### Memory Interface
 - Fragment colors are stored here
    - Are compressed to save space and bandwidth, second costly
 ## CPU and GPU
 ### Differences
 - CPU: small number of hard tasks
    - Individual and **distinctive** task
    - Few cores, to primarily do sequential and **serial** processing, less
      execution units, and **transistors**
    - Memory interface is **slower**
    - Less pipelines
    - More control and caching transistors
    - Has L1 (data, instruction) and L2 caches on each core, and **shared** L3
      cache
 - GPU: large number of simple tasks
    - Can be broken into **many** **tiny** parts, and worked on in parallel
    - Massively **parallel**, thousands of cores that handle multiple tasks once
      at a time, and more execution **units** and **transistors**
    - Memory interface is much **faster**
    - More pipelines than GPU
    - More ALU (Algorithmic logic units)
    - Dedicated L1 cache for every Streaming Processors, contained by Processor
      Clusters, shared L2 cache, more tolerant on laatency (less cache)
 ### Communication
 - They interact in **parallel** with each other.
 - They run on separate threads, and communicate through a **command buffer**
 - Problems
    - CPU bottleneck: If CPU is slow, the command buffer is **empty**, GPU will
      wait for input from CPU, resulting idling GPU
    - GPU bottleneck: IF GPU is slow, the command buffer is **full**, CPU will
      wait for output from GPU, resulting in idling CPU
 ## GPU Computing
 ### Definition
 - Using GPU with CPU to accelerate scientific and enterprise application
  processing
 - Some parts can be broken down to parallelizable smaller parts, which are
  processed on the GPU, while serial parts are on the CPU
 ### GPU programming
 - Writing parallel programs that run on GPUs using compliant platforms, like
  CUDA, C/Fortran or OpenACC
 - Ways to program GPUs:
    - GPU-Accelerated Libraries: Developers only need to write code and use the
      library
    - GPU directives: automatic parallel loops using directives (OpenACC, C,
      Fortran)
    - Develop your own: Use CUDA along with language: CUDA C/C++
 ### Memory
 #### Device memory: GDDR (Graphics Double Data Rate)
 - Resides in CUDA address space
 - Used by CUDA kernels, with pointer and array de-referencing
 - Most GPUs dedicated memory attached to GPU
 - No virtual memory allocation like in CPU: when memory is exhausted, allocation
  will fail
 #### Host memory: DDR (Double Data Rate)
 - CPU memory, managed by library calls like `malloc`, `free`
 - In CUDA, this is virtualized: when memory is exhausted, allocation will not
  fail
    - OS that manages virtual memory: VMM (Virtual Memory Manager)
    - GPU access host memory with: DMA (Direct Memory Access), which enables GPU
      to work with CPU
 ### Processing
 1. Copy input from CPU memory to GPU memory
 2. GPU load program and execute, caching data in GPU cache, save result to GPU
   memory
 3. Copy results from GPU memory to CPU memory.
 ### Advantages
 - Fast
 - Efficiency: energy and design
 - Less cycles of communication