141 lines
4.6 KiB
Markdown
141 lines
4.6 KiB
Markdown
|
# Introduction to GPU
|
||
|
|
||
|
## Definition
|
||
|
|
||
|
- Graphics Processing Unit, also known as Visual Processing Unit
|
||
|
- Used to **accelerate** computation and processing of images and data, to
|
||
|
output to a display device, or used in modern High Performance Computing
|
||
|
- Very efficient at `computer graphics` and floating point processing
|
||
|
- History:
|
||
|
- Programmable GPU is invented by NVIDIA in 1999
|
||
|
- Initially used by gamers, artists and game programmers
|
||
|
- Then used by researchers
|
||
|
- GPGPU was introduced by NVIDIA to allow programming languages to be used
|
||
|
in GPU
|
||
|
- CUDA was invented by NVIDIA, to enable parallel computing using GPU and
|
||
|
GPGPU
|
||
|
|
||
|
## Components of GPU
|
||
|
|
||
|
### Structure
|
||
|
|
||
|
- Many core processor
|
||
|
- 5 layer architecture
|
||
|
|
||
|
### Architecture
|
||
|
|
||
|
#### Host Interface
|
||
|
|
||
|
- Communicate between host and GPU
|
||
|
- Receives command from CPU, obtain information from memory
|
||
|
- Produces vertices for processing
|
||
|
|
||
|
#### Vortex Processing
|
||
|
|
||
|
- Receive vertices from host interface, and produce output in screen space
|
||
|
- No vertices are added or deleted: 1:1 mapping relationship
|
||
|
|
||
|
#### Triangle Setup
|
||
|
|
||
|
- Convert screen space geometry from vertex processing layer, to pixels in the
|
||
|
output (raster format)
|
||
|
- Triangles located outside of the view is discarded
|
||
|
- Triangle fragments rendered as fragments, and only if the center of fragment
|
||
|
is in the center of triangle.
|
||
|
|
||
|
#### Pixel Processing
|
||
|
|
||
|
- Fragment is received from last layer, with metadata (attributes) attached,
|
||
|
which are used to calculate color of the pixel
|
||
|
- Has texture mapping and math, so it's the most costly
|
||
|
|
||
|
#### Memory Interface
|
||
|
|
||
|
- Fragment colors are stored here
|
||
|
- Are compressed to save space and bandwidth, second costly
|
||
|
|
||
|
## CPU and GPU
|
||
|
|
||
|
### Differences
|
||
|
|
||
|
- CPU: small number of hard tasks
|
||
|
- Individual and **distinctive** task
|
||
|
- Few cores, to primarily do sequential and **serial** processing, less
|
||
|
execution units, and **transistors**
|
||
|
- Memory interface is **slower**
|
||
|
- Less pipelines
|
||
|
- More control and caching transistors
|
||
|
- Has L1 (data, instruction) and L2 caches on each core, and **shared** L3
|
||
|
cache
|
||
|
- GPU: large number of simple tasks
|
||
|
- Can be broken into **many** **tiny** parts, and worked on in parallel
|
||
|
- Massively **parallel**, thousands of cores that handle multiple tasks once
|
||
|
at a time, and more execution **units** and **transistors**
|
||
|
- Memory interface is much **faster**
|
||
|
- More pipelines than GPU
|
||
|
- More ALU (Algorithmic logic units)
|
||
|
- Dedicated L1 cache for every Streaming Processors, contained by Processor
|
||
|
Clusters, shared L2 cache, more tolerant on laatency (less cache)
|
||
|
|
||
|
### Communication
|
||
|
|
||
|
- They interact in **parallel** with each other.
|
||
|
- They run on separate threads, and communicate through a **command buffer**
|
||
|
- Problems
|
||
|
- CPU bottleneck: If CPU is slow, the command buffer is **empty**, GPU will
|
||
|
wait for input from CPU, resulting idling GPU
|
||
|
- GPU bottleneck: IF GPU is slow, the command buffer is **full**, CPU will
|
||
|
wait for output from GPU, resulting in idling CPU
|
||
|
|
||
|
## GPU Computing
|
||
|
|
||
|
### Definition
|
||
|
|
||
|
- Using GPU with CPU to accelerate scientific and enterprise application
|
||
|
processing
|
||
|
- Some parts can be broken down to parallelizable smaller parts, which are
|
||
|
processed on the GPU, while serial parts are on the CPU
|
||
|
|
||
|
### GPU programming
|
||
|
|
||
|
- Writing parallel programs that run on GPUs using compliant platforms, like
|
||
|
CUDA, C/Fortran or OpenACC
|
||
|
- Ways to program GPUs:
|
||
|
- GPU-Accelerated Libraries: Developers only need to write code and use the
|
||
|
library
|
||
|
- GPU directives: automatic parallel loops using directives (OpenACC, C,
|
||
|
Fortran)
|
||
|
- Develop your own: Use CUDA along with language: CUDA C/C++
|
||
|
|
||
|
### Memory
|
||
|
|
||
|
#### Device memory: GDDR (Graphics Double Data Rate)
|
||
|
|
||
|
- Resides in CUDA address space
|
||
|
- Used by CUDA kernels, with pointer and array de-referencing
|
||
|
- Most GPUs dedicated memory attached to GPU
|
||
|
- No virtual memory allocation like in CPU: when memory is exhausted, allocation
|
||
|
will fail
|
||
|
|
||
|
#### Host memory: DDR (Double Data Rate)
|
||
|
|
||
|
- CPU memory, managed by library calls like `malloc`, `free`
|
||
|
- In CUDA, this is virtualized: when memory is exhausted, allocation will not
|
||
|
fail
|
||
|
- OS that manages virtual memory: VMM (Virtual Memory Manager)
|
||
|
- GPU access host memory with: DMA (Direct Memory Access), which enables GPU
|
||
|
to work with CPU
|
||
|
|
||
|
### Processing
|
||
|
|
||
|
1. Copy input from CPU memory to GPU memory
|
||
|
2. GPU load program and execute, caching data in GPU cache, save result to GPU
|
||
|
memory
|
||
|
3. Copy results from GPU memory to CPU memory.
|
||
|
|
||
|
### Advantages
|
||
|
|
||
|
- Fast
|
||
|
- Efficiency: energy and design
|
||
|
- Less cycles of communication
|