4.6 KiB
4.6 KiB
Introduction to GPU
Definition
- Graphics Processing Unit, also known as Visual Processing Unit
- Used to accelerate computation and processing of images and data, to output to a display device, or used in modern High Performance Computing
- Very efficient at
computer graphics
and floating point processing - History:
- Programmable GPU is invented by NVIDIA in 1999
- Initially used by gamers, artists and game programmers
- Then used by researchers
- GPGPU was introduced by NVIDIA to allow programming languages to be used in GPU
- CUDA was invented by NVIDIA, to enable parallel computing using GPU and GPGPU
Components of GPU
Structure
- Many core processor
- 5 layer architecture
Architecture
Host Interface
- Communicate between host and GPU
- Receives command from CPU, obtain information from memory
- Produces vertices for processing
Vortex Processing
- Receive vertices from host interface, and produce output in screen space
- No vertices are added or deleted: 1:1 mapping relationship
Triangle Setup
- Convert screen space geometry from vertex processing layer, to pixels in the output (raster format)
- Triangles located outside of the view is discarded
- Triangle fragments rendered as fragments, and only if the center of fragment is in the center of triangle.
Pixel Processing
- Fragment is received from last layer, with metadata (attributes) attached, which are used to calculate color of the pixel
- Has texture mapping and math, so it's the most costly
Memory Interface
- Fragment colors are stored here
- Are compressed to save space and bandwidth, second costly
CPU and GPU
Differences
- CPU: small number of hard tasks
- Individual and distinctive task
- Few cores, to primarily do sequential and serial processing, less execution units, and transistors
- Memory interface is slower
- Less pipelines
- More control and caching transistors
- Has L1 (data, instruction) and L2 caches on each core, and shared L3 cache
- GPU: large number of simple tasks
- Can be broken into many tiny parts, and worked on in parallel
- Massively parallel, thousands of cores that handle multiple tasks once at a time, and more execution units and transistors
- Memory interface is much faster
- More pipelines than GPU
- More ALU (Algorithmic logic units)
- Dedicated L1 cache for every Streaming Processors, contained by Processor Clusters, shared L2 cache, more tolerant on laatency (less cache)
Communication
- They interact in parallel with each other.
- They run on separate threads, and communicate through a command buffer
- Problems
- CPU bottleneck: If CPU is slow, the command buffer is empty, GPU will wait for input from CPU, resulting idling GPU
- GPU bottleneck: IF GPU is slow, the command buffer is full, CPU will wait for output from GPU, resulting in idling CPU
GPU Computing
Definition
- Using GPU with CPU to accelerate scientific and enterprise application processing
- Some parts can be broken down to parallelizable smaller parts, which are processed on the GPU, while serial parts are on the CPU
GPU programming
- Writing parallel programs that run on GPUs using compliant platforms, like CUDA, C/Fortran or OpenACC
- Ways to program GPUs:
- GPU-Accelerated Libraries: Developers only need to write code and use the library
- GPU directives: automatic parallel loops using directives (OpenACC, C, Fortran)
- Develop your own: Use CUDA along with language: CUDA C/C++
Memory
Device memory: GDDR (Graphics Double Data Rate)
- Resides in CUDA address space
- Used by CUDA kernels, with pointer and array de-referencing
- Most GPUs dedicated memory attached to GPU
- No virtual memory allocation like in CPU: when memory is exhausted, allocation will fail
Host memory: DDR (Double Data Rate)
- CPU memory, managed by library calls like
malloc
,free
- In CUDA, this is virtualized: when memory is exhausted, allocation will not
fail
- OS that manages virtual memory: VMM (Virtual Memory Manager)
- GPU access host memory with: DMA (Direct Memory Access), which enables GPU to work with CPU
Processing
- Copy input from CPU memory to GPU memory
- GPU load program and execute, caching data in GPU cache, save result to GPU memory
- Copy results from GPU memory to CPU memory.
Advantages
- Fast
- Efficiency: energy and design
- Less cycles of communication