2-1, around 1.5 hour
This commit is contained in:
parent
92235588b8
commit
557be68e19
140
2-1-gpu.md
Normal file
140
2-1-gpu.md
Normal file
|
@ -0,0 +1,140 @@
|
|||
# Introduction to GPU
|
||||
|
||||
## Definition
|
||||
|
||||
- Graphics Processing Unit, also known as Visual Processing Unit
|
||||
- Used to **accelerate** computation and processing of images and data, to
|
||||
output to a display device, or used in modern High Performance Computing
|
||||
- Very efficient at `computer graphics` and floating point processing
|
||||
- History:
|
||||
- Programmable GPU is invented by NVIDIA in 1999
|
||||
- Initially used by gamers, artists and game programmers
|
||||
- Then used by researchers
|
||||
- GPGPU was introduced by NVIDIA to allow programming languages to be used
|
||||
in GPU
|
||||
- CUDA was invented by NVIDIA, to enable parallel computing using GPU and
|
||||
GPGPU
|
||||
|
||||
## Components of GPU
|
||||
|
||||
### Structure
|
||||
|
||||
- Many core processor
|
||||
- 5 layer architecture
|
||||
|
||||
### Architecture
|
||||
|
||||
#### Host Interface
|
||||
|
||||
- Communicate between host and GPU
|
||||
- Receives command from CPU, obtain information from memory
|
||||
- Produces vertices for processing
|
||||
|
||||
#### Vortex Processing
|
||||
|
||||
- Receive vertices from host interface, and produce output in screen space
|
||||
- No vertices are added or deleted: 1:1 mapping relationship
|
||||
|
||||
#### Triangle Setup
|
||||
|
||||
- Convert screen space geometry from vertex processing layer, to pixels in the
|
||||
output (raster format)
|
||||
- Triangles located outside of the view is discarded
|
||||
- Triangle fragments rendered as fragments, and only if the center of fragment
|
||||
is in the center of triangle.
|
||||
|
||||
#### Pixel Processing
|
||||
|
||||
- Fragment is received from last layer, with metadata (attributes) attached,
|
||||
which are used to calculate color of the pixel
|
||||
- Has texture mapping and math, so it's the most costly
|
||||
|
||||
#### Memory Interface
|
||||
|
||||
- Fragment colors are stored here
|
||||
- Are compressed to save space and bandwidth, second costly
|
||||
|
||||
## CPU and GPU
|
||||
|
||||
### Differences
|
||||
|
||||
- CPU: small number of hard tasks
|
||||
- Individual and **distinctive** task
|
||||
- Few cores, to primarily do sequential and **serial** processing, less
|
||||
execution units, and **transistors**
|
||||
- Memory interface is **slower**
|
||||
- Less pipelines
|
||||
- More control and caching transistors
|
||||
- Has L1 (data, instruction) and L2 caches on each core, and **shared** L3
|
||||
cache
|
||||
- GPU: large number of simple tasks
|
||||
- Can be broken into **many** **tiny** parts, and worked on in parallel
|
||||
- Massively **parallel**, thousands of cores that handle multiple tasks once
|
||||
at a time, and more execution **units** and **transistors**
|
||||
- Memory interface is much **faster**
|
||||
- More pipelines than GPU
|
||||
- More ALU (Algorithmic logic units)
|
||||
- Dedicated L1 cache for every Streaming Processors, contained by Processor
|
||||
Clusters, shared L2 cache, more tolerant on laatency (less cache)
|
||||
|
||||
### Communication
|
||||
|
||||
- They interact in **parallel** with each other.
|
||||
- They run on separate threads, and communicate through a **command buffer**
|
||||
- Problems
|
||||
- CPU bottleneck: If CPU is slow, the command buffer is **empty**, GPU will
|
||||
wait for input from CPU, resulting idling GPU
|
||||
- GPU bottleneck: IF GPU is slow, the command buffer is **full**, CPU will
|
||||
wait for output from GPU, resulting in idling CPU
|
||||
|
||||
## GPU Computing
|
||||
|
||||
### Definition
|
||||
|
||||
- Using GPU with CPU to accelerate scientific and enterprise application
|
||||
processing
|
||||
- Some parts can be broken down to parallelizable smaller parts, which are
|
||||
processed on the GPU, while serial parts are on the CPU
|
||||
|
||||
### GPU programming
|
||||
|
||||
- Writing parallel programs that run on GPUs using compliant platforms, like
|
||||
CUDA, C/Fortran or OpenACC
|
||||
- Ways to program GPUs:
|
||||
- GPU-Accelerated Libraries: Developers only need to write code and use the
|
||||
library
|
||||
- GPU directives: automatic parallel loops using directives (OpenACC, C,
|
||||
Fortran)
|
||||
- Develop your own: Use CUDA along with language: CUDA C/C++
|
||||
|
||||
### Memory
|
||||
|
||||
#### Device memory: GDDR (Graphics Double Data Rate)
|
||||
|
||||
- Resides in CUDA address space
|
||||
- Used by CUDA kernels, with pointer and array de-referencing
|
||||
- Most GPUs dedicated memory attached to GPU
|
||||
- No virtual memory allocation like in CPU: when memory is exhausted, allocation
|
||||
will fail
|
||||
|
||||
#### Host memory: DDR (Double Data Rate)
|
||||
|
||||
- CPU memory, managed by library calls like `malloc`, `free`
|
||||
- In CUDA, this is virtualized: when memory is exhausted, allocation will not
|
||||
fail
|
||||
- OS that manages virtual memory: VMM (Virtual Memory Manager)
|
||||
- GPU access host memory with: DMA (Direct Memory Access), which enables GPU
|
||||
to work with CPU
|
||||
|
||||
### Processing
|
||||
|
||||
1. Copy input from CPU memory to GPU memory
|
||||
2. GPU load program and execute, caching data in GPU cache, save result to GPU
|
||||
memory
|
||||
3. Copy results from GPU memory to CPU memory.
|
||||
|
||||
### Advantages
|
||||
|
||||
- Fast
|
||||
- Efficiency: energy and design
|
||||
- Less cycles of communication
|
Loading…
Reference in a new issue