root/EBU6502_cloud_computing_notes

Fork 0

Ryan 557be68e19 2-1, around 1.5 hour

2024-12-28 21:15:56 +08:00

4.6 KiB

Raw Blame History

Introduction to GPU

Definition

Graphics Processing Unit, also known as Visual Processing Unit
Used to accelerate computation and processing of images and data, to output to a display device, or used in modern High Performance Computing
Very efficient at computer graphics and floating point processing
History:
- Programmable GPU is invented by NVIDIA in 1999
- Initially used by gamers, artists and game programmers
- Then used by researchers
- GPGPU was introduced by NVIDIA to allow programming languages to be used in GPU
- CUDA was invented by NVIDIA, to enable parallel computing using GPU and GPGPU

Components of GPU

Structure

Many core processor
5 layer architecture

Architecture

Host Interface

Communicate between host and GPU
- Receives command from CPU, obtain information from memory
- Produces vertices for processing

Vortex Processing

Receive vertices from host interface, and produce output in screen space
No vertices are added or deleted: 1:1 mapping relationship

Triangle Setup

Convert screen space geometry from vertex processing layer, to pixels in the output (raster format)
Triangles located outside of the view is discarded
Triangle fragments rendered as fragments, and only if the center of fragment is in the center of triangle.

Pixel Processing

Fragment is received from last layer, with metadata (attributes) attached, which are used to calculate color of the pixel
Has texture mapping and math, so it's the most costly

Memory Interface

Fragment colors are stored here
- Are compressed to save space and bandwidth, second costly

CPU and GPU

Differences

CPU: small number of hard tasks
- Individual and distinctive task
- Few cores, to primarily do sequential and serial processing, less execution units, and transistors
- Memory interface is slower
- Less pipelines
- More control and caching transistors
- Has L1 (data, instruction) and L2 caches on each core, and shared L3 cache
GPU: large number of simple tasks
- Can be broken into many tiny parts, and worked on in parallel
- Massively parallel, thousands of cores that handle multiple tasks once at a time, and more execution units and transistors
- Memory interface is much faster
- More pipelines than GPU
- More ALU (Algorithmic logic units)
- Dedicated L1 cache for every Streaming Processors, contained by Processor Clusters, shared L2 cache, more tolerant on laatency (less cache)

Communication

They interact in parallel with each other.
They run on separate threads, and communicate through a command buffer
Problems
- CPU bottleneck: If CPU is slow, the command buffer is empty, GPU will wait for input from CPU, resulting idling GPU
- GPU bottleneck: IF GPU is slow, the command buffer is full, CPU will wait for output from GPU, resulting in idling CPU

GPU Computing

Definition

Using GPU with CPU to accelerate scientific and enterprise application processing
Some parts can be broken down to parallelizable smaller parts, which are processed on the GPU, while serial parts are on the CPU

GPU programming

Writing parallel programs that run on GPUs using compliant platforms, like CUDA, C/Fortran or OpenACC
Ways to program GPUs:
- GPU-Accelerated Libraries: Developers only need to write code and use the library
- GPU directives: automatic parallel loops using directives (OpenACC, C, Fortran)
- Develop your own: Use CUDA along with language: CUDA C/C++

Memory

Device memory: GDDR (Graphics Double Data Rate)

Resides in CUDA address space
Used by CUDA kernels, with pointer and array de-referencing
Most GPUs dedicated memory attached to GPU
No virtual memory allocation like in CPU: when memory is exhausted, allocation will fail

Host memory: DDR (Double Data Rate)

CPU memory, managed by library calls like malloc, free
In CUDA, this is virtualized: when memory is exhausted, allocation will not fail
- OS that manages virtual memory: VMM (Virtual Memory Manager)
- GPU access host memory with: DMA (Direct Memory Access), which enables GPU to work with CPU

Processing

Copy input from CPU memory to GPU memory
GPU load program and execute, caching data in GPU cache, save result to GPU memory
Copy results from GPU memory to CPU memory.

Advantages

Fast
Efficiency: energy and design
Less cycles of communication

4.6 KiB Raw Blame History

Introduction to GPU

Definition

Components of GPU

Structure

Architecture

Host Interface

Vortex Processing

Triangle Setup

Pixel Processing

Memory Interface

CPU and GPU

Differences

Communication

GPU Computing

Definition

GPU programming

Memory

Device memory: GDDR (Graphics Double Data Rate)

Host memory: DDR (Double Data Rate)

Processing

Advantages

4.6 KiB

Raw Blame History