EBU6502_cloud_computing_notes/3-1-map-reduce.md

199 lines
6.1 KiB
Markdown
Raw Normal View History

2024-12-29 21:17:17 +08:00
# Principles of Map Reduce
## Big Data
- Definitions: collecting and processing (interpretation) a large amount of
data, to our own benefits.
- Key Features:
- Volume: scale of petabytes
- Variety: Structured, semi-structured, unstructured
- Velocity: Social media, sensor, high throughput
- Veracity: Unclean, Imprecise, Unclear
## Distributed System
- Simple definition: any system that works across many computers
- Features:
- Concurrency
- No guarantee on sequence of events
- No guarantee on reliability
- Price Performance ratio is better
- Challenges
- Assigning workers
- Failure handling
- Exchanging results
- Synchronization
- Examples: [link](/1-1-intro.md#examples)
## Map Reduce
### Parallelism
#### Intro
- number of processors working together to calculate or solve a problem
[earlier slides](/1-4-scalability.md)
- Calculation will be divided into tasks, and sent to different processors
- Can be different cores or machines
- Challenges
- Hard to divide
- The subtasks may use results from each other
- To solve this, we can rank task based on the difficulty to parallelize
- Easy: Image processing
- Hard: Path Search
#### Benefits
- Solve larger or complex problems
- Serial computation will take too long: too large or complex
- Also cheaper, since not less powerful computer it requires
- Serial computation, a retrospective:
- FDE cycle: one instruction is fetched, decoded and executed, based on Von
Neumann
- As processor speed increases, more instructions can be executed at the
same time
- We have reached the limit of single processor processing, where it's too
hard to make a processor much faster
- We now have more cores
#### Real life examples
- Simulation: where directly testing is difficult
- Prediction: challenging in algorithm design and implementation
- Data Analysis: Google search, biologist on DNA, Marketers on twitter
#### Platforms for parallel computing
- Spreadsheets, but can't handle true big data
- R language: powerful statistical features
- Python: General purpose with R like addons
- Databases
- RDBMS, Relational Database Management Systems
- NoSQL: Key value storage, like Amazon DynamoDB
- Document databases: MongoDB
- Graph databases: Neo4j, Giraph
- Specialized language
- MapReduce / Hadoop
- Spark
- Data preparation and visualization
### Introduction to the Map Reduce Model
#### Example problem
- Counting words
- input: text
- output: key-pair lists: word - count
#### Serial solution
- Split into lines
- Iterate over each line, and count each word
#### Parallelized solution using Divide and Conquer
- Split sentences or lines into words
- Count words on separate machines, in parallel
- Partition the problem and combine the result
- Challenges
- Assigning work units to workers
- Having more work units than workers
- Shared partial results may be needed by workers
- Aggregate partial results
- Knowing workers finished or stopped
### Map Reduce
#### Definitions
- A parallel programming model, and associated implementation, that scales well
and has auto parallelization
- Features:
- Used by big companies
- User specify the map function and reduce function
- Runtime automatically parallelize the computation
- Runtime handles failures, communication and perf issues
- **Not** suitable for every possible algorithms
- Can also refer to the runtime (execution framework), or the specific code
implementation
#### Pattern
- Input: usually from file
- Map task: aggregate key-value pairs
- Reduce task: Process and reduce the key-value pairs
- Output: can write to file system
#### Usage
- Google: Index building for search, Article clustering, Statistical machine
translation
- Yahoo: Index building for search, spam detection for mail
- Facebook: Data Mining, Advertisement, Spam
#### History
- Inspired by functional programming like Lisp:
- `map()`: apply function to each individual value of a array, to modify the
array
- `reduce()`: combine the values from an array to get a reduced number
#### Model
- Input data is partitioned into processable chunks
- Map: Called on every item in input, and **emit** a series of **intermediate**
<key-value> pairs (list(ki, vi))
- One map job per chunk, can be parallelized
- Reduce: Called on every unique key (ki, list(vi)), and it's value list, to
emit a value to the output (ki, vi' or simply vi')
- One reduce job for each distinct key emitted by mapper, can be
parallelized
- Procedure: nodes work on **map** job first, after map is completed, they
**synchronize** to aggregate the intermediate values by output key, then run
**reduce**
#### Example
- TODO: review the lab code on word count
#### Benefits
- High level abstraction
- Framework is efficient, leading to good performance
- Scalability is close to linear, since map and reduce is fully parallelized
- Reduces the complexity of parallel programming
#### Running
2024-12-30 15:14:36 +08:00
- Shuffle, moving data from mappers to reducers and sort, ordering outputs
before being processed by reducer
- Shuffle:
- Every intermediate key-value pairs generated by mapper are
**collected**
- Pairs are **partitioned** to a list of values, each partition sorted
by key
- Combiner run on each partition, to combine the key-value pairs with
key- list of values
- Data is shuffled and sent to each reducer
- Sort:
- Reducer copy data from mapper
- Downloaded data are **merged** and **sorted** as input for reducer, key-list
of values, sorted by keys
2024-12-29 21:17:17 +08:00
- Runtime
- Partitions input
2024-12-30 15:14:36 +08:00
- Schedules executions
2024-12-29 21:17:17 +08:00
- Handles load balancing
- Shuffle, partition, sort intermediate data
- Handle failure
- Manage IPC (Inter Process Communication)
- TODO: see page 66 for diagram example
#### Implementations
- Google MapReduce: proprietary and private
- Hadoop: Open Source, written in Java, first used by Yahoo, then merged into
Apache project.
- Used by most major companies: Amazon, Facebook, Google, IBM, last.fm,
Yahoo
- Custom implementation