root/EBU6502_cloud_computing_notes

Fork 0

Ryan 40f6f6b436 3-4, 1hr

2024-12-30 17:26:57 +08:00

6 KiB

Raw Blame History

Principles of Map Reduce

Big Data

Definitions: collecting and processing (interpretation) a large amount of data, to our own benefits.
Key Features:
- Volume: scale of petabytes
- Variety: Structured, semi-structured, unstructured
- Velocity: Social media, sensor, high throughput
- Veracity: Unclean, Imprecise, Unclear

Distributed System

Simple definition: any system that works across many computers
Features:
- Concurrency
- No guarantee on sequence of events
- No guarantee on reliability
- Price Performance ratio is better
Challenges
- Assigning workers
- Failure handling
- Exchanging results
- Synchronization
Examples: link

Map Reduce

Parallelism

Intro

number of processors working together to calculate or solve a problem earlier slides
Calculation will be divided into tasks, and sent to different processors
- Can be different cores or machines
Challenges
- Hard to divide
- The subtasks may use results from each other
To solve this, we can rank task based on the difficulty to parallelize
- Easy: Image processing
- Hard: Path Search

Benefits

Solve larger or complex problems
Serial computation will take too long: too large or complex
Also cheaper, since not less powerful computer it requires
Serial computation, a retrospective:
- FDE cycle: one instruction is fetched, decoded and executed, based on Von Neumann
- As processor speed increases, more instructions can be executed at the same time
- We have reached the limit of single processor processing, where it's too hard to make a processor much faster
- We now have more cores

Real life examples

Simulation: where directly testing is difficult
Prediction: challenging in algorithm design and implementation
Data Analysis: Google search, biologist on DNA, Marketers on twitter

Platforms for parallel computing

Spreadsheets, but can't handle true big data
R language: powerful statistical features
Python: General purpose with R like addons
Databases
- RDBMS, Relational Database Management Systems
- NoSQL: Key value storage, like Amazon DynamoDB
- Document databases: MongoDB
- Graph databases: Neo4j, Giraph
Specialized language
- MapReduce / Hadoop
- Spark
Data preparation and visualization

Introduction to the Map Reduce Model

Example problem

Counting words
input: text
output: key-pair lists: word - count

Serial solution

Split into lines
Iterate over each line, and count each word

Parallelized solution using Divide and Conquer

Split sentences or lines into words
Count words on separate machines, in parallel
Partition the problem and combine the result
Challenges
- Assigning work units to workers
- Having more work units than workers
- Shared partial results may be needed by workers
- Aggregate partial results
- Knowing workers finished or stopped

Map Reduce

Definitions

A parallel programming model, and associated implementation, that scales well and has auto parallelization
Features:
- Used by big companies
- User specify the map function and reduce function
- Runtime automatically parallelize the computation
- Runtime handles failures, communication and perf issues
- Not suitable for every possible algorithms
Can also refer to the runtime (execution framework), or the specific code implementation

Pattern

Input: usually from file
Map task: aggregate key-value pairs
Reduce task: Process and reduce the key-value pairs
Output: can write to file system

Usage

Google: Index building for search, Article clustering, Statistical machine translation
Yahoo: Index building for search, spam detection for mail
Facebook: Data Mining, Advertisement, Spam

History

Inspired by functional programming like Lisp:
- map(): apply function to each individual value of a array, to modify the array
- reduce(): combine the values from an array to get a reduced number

Model

Input data is partitioned into processable chunks
Map: Called on every item in input, and emit a series of intermediate pairs (list(ki, vi))
- One map job per chunk, can be parallelized
Reduce: Called on every unique key (ki, list(vi)), and it's value list, to emit a value to the output (ki, vi' or simply vi')
- One reduce job for each distinct key emitted by mapper, can be parallelized
Procedure: nodes work on map job first, after map is completed, they synchronize to aggregate the intermediate values by output key, then run reduce

Example

TODO: review the lab code on word count

Benefits

High level abstraction
Framework is efficient, leading to good performance
Scalability is close to linear, since map and reduce is fully parallelized
Reduces the complexity of parallel programming

Running

Shuffle, moving data from mappers to reducers and sort, ordering outputs before being processed by reducer
- Shuffle:
  - Every intermediate key-value pairs generated by mapper are collected
  - Pairs are partitioned to a list of values, each partition sorted by key
  - Data is shuffled and sent to each reducer
- Sort:
  - Reducer copy data from mapper
  - Downloaded data are merged and sorted as input for reducer, key-list of values, sorted by keys
Runtime
- Partitions input
- Schedules executions
- Handles load balancing
- Shuffle, partition, sort intermediate data
- Handle failure
- Manage IPC (Inter Process Communication)
TODO: see page 66 for diagram example

Implementations

Google MapReduce: proprietary and private
Hadoop: Open Source, written in Java, first used by Yahoo, then merged into Apache project.
- Used by most major companies: Amazon, Facebook, Google, IBM, last.fm, Yahoo
Custom implementation

6 KiB Raw Blame History

Principles of Map Reduce

Big Data

Distributed System

Map Reduce

Parallelism

Intro

Benefits

Real life examples

Platforms for parallel computing

Introduction to the Map Reduce Model

Example problem

Serial solution

Parallelized solution using Divide and Conquer

Map Reduce

Definitions

Pattern

Usage

History

Model

Example

Benefits

Running

Implementations

6 KiB

Raw Blame History