EBU6502_cloud_computing_notes/2-1-gpu.md
2024-12-28 21:15:56 +08:00

4.6 KiB

Introduction to GPU

Definition

  • Graphics Processing Unit, also known as Visual Processing Unit
  • Used to accelerate computation and processing of images and data, to output to a display device, or used in modern High Performance Computing
  • Very efficient at computer graphics and floating point processing
  • History:
    • Programmable GPU is invented by NVIDIA in 1999
    • Initially used by gamers, artists and game programmers
    • Then used by researchers
    • GPGPU was introduced by NVIDIA to allow programming languages to be used in GPU
    • CUDA was invented by NVIDIA, to enable parallel computing using GPU and GPGPU

Components of GPU

Structure

  • Many core processor
  • 5 layer architecture

Architecture

Host Interface

  • Communicate between host and GPU
    • Receives command from CPU, obtain information from memory
    • Produces vertices for processing

Vortex Processing

  • Receive vertices from host interface, and produce output in screen space
  • No vertices are added or deleted: 1:1 mapping relationship

Triangle Setup

  • Convert screen space geometry from vertex processing layer, to pixels in the output (raster format)
  • Triangles located outside of the view is discarded
  • Triangle fragments rendered as fragments, and only if the center of fragment is in the center of triangle.

Pixel Processing

  • Fragment is received from last layer, with metadata (attributes) attached, which are used to calculate color of the pixel
  • Has texture mapping and math, so it's the most costly

Memory Interface

  • Fragment colors are stored here
    • Are compressed to save space and bandwidth, second costly

CPU and GPU

Differences

  • CPU: small number of hard tasks
    • Individual and distinctive task
    • Few cores, to primarily do sequential and serial processing, less execution units, and transistors
    • Memory interface is slower
    • Less pipelines
    • More control and caching transistors
    • Has L1 (data, instruction) and L2 caches on each core, and shared L3 cache
  • GPU: large number of simple tasks
    • Can be broken into many tiny parts, and worked on in parallel
    • Massively parallel, thousands of cores that handle multiple tasks once at a time, and more execution units and transistors
    • Memory interface is much faster
    • More pipelines than GPU
    • More ALU (Algorithmic logic units)
    • Dedicated L1 cache for every Streaming Processors, contained by Processor Clusters, shared L2 cache, more tolerant on laatency (less cache)

Communication

  • They interact in parallel with each other.
  • They run on separate threads, and communicate through a command buffer
  • Problems
    • CPU bottleneck: If CPU is slow, the command buffer is empty, GPU will wait for input from CPU, resulting idling GPU
    • GPU bottleneck: IF GPU is slow, the command buffer is full, CPU will wait for output from GPU, resulting in idling CPU

GPU Computing

Definition

  • Using GPU with CPU to accelerate scientific and enterprise application processing
  • Some parts can be broken down to parallelizable smaller parts, which are processed on the GPU, while serial parts are on the CPU

GPU programming

  • Writing parallel programs that run on GPUs using compliant platforms, like CUDA, C/Fortran or OpenACC
  • Ways to program GPUs:
    • GPU-Accelerated Libraries: Developers only need to write code and use the library
    • GPU directives: automatic parallel loops using directives (OpenACC, C, Fortran)
    • Develop your own: Use CUDA along with language: CUDA C/C++

Memory

Device memory: GDDR (Graphics Double Data Rate)

  • Resides in CUDA address space
  • Used by CUDA kernels, with pointer and array de-referencing
  • Most GPUs dedicated memory attached to GPU
  • No virtual memory allocation like in CPU: when memory is exhausted, allocation will fail

Host memory: DDR (Double Data Rate)

  • CPU memory, managed by library calls like malloc, free
  • In CUDA, this is virtualized: when memory is exhausted, allocation will not fail
    • OS that manages virtual memory: VMM (Virtual Memory Manager)
    • GPU access host memory with: DMA (Direct Memory Access), which enables GPU to work with CPU

Processing

  1. Copy input from CPU memory to GPU memory
  2. GPU load program and execute, caching data in GPU cache, save result to GPU memory
  3. Copy results from GPU memory to CPU memory.

Advantages

  • Fast
  • Efficiency: energy and design
  • Less cycles of communication