197 lines
6 KiB
Markdown
197 lines
6 KiB
Markdown
# Principles of Map Reduce
|
|
|
|
## Big Data
|
|
|
|
- Definitions: collecting and processing (interpretation) a large amount of
|
|
data, to our own benefits.
|
|
- Key Features:
|
|
- Volume: scale of petabytes
|
|
- Variety: Structured, semi-structured, unstructured
|
|
- Velocity: Social media, sensor, high throughput
|
|
- Veracity: Unclean, Imprecise, Unclear
|
|
|
|
## Distributed System
|
|
|
|
- Simple definition: any system that works across many computers
|
|
- Features:
|
|
- Concurrency
|
|
- No guarantee on sequence of events
|
|
- No guarantee on reliability
|
|
- Price Performance ratio is better
|
|
- Challenges
|
|
- Assigning workers
|
|
- Failure handling
|
|
- Exchanging results
|
|
- Synchronization
|
|
- Examples: [link](/1-1-intro.md#examples)
|
|
|
|
## Map Reduce
|
|
|
|
### Parallelism
|
|
|
|
#### Intro
|
|
|
|
- number of processors working together to calculate or solve a problem
|
|
[earlier slides](/1-4-scalability.md)
|
|
- Calculation will be divided into tasks, and sent to different processors
|
|
- Can be different cores or machines
|
|
- Challenges
|
|
- Hard to divide
|
|
- The subtasks may use results from each other
|
|
- To solve this, we can rank task based on the difficulty to parallelize
|
|
- Easy: Image processing
|
|
- Hard: Path Search
|
|
|
|
#### Benefits
|
|
|
|
- Solve larger or complex problems
|
|
- Serial computation will take too long: too large or complex
|
|
- Also cheaper, since not less powerful computer it requires
|
|
- Serial computation, a retrospective:
|
|
- FDE cycle: one instruction is fetched, decoded and executed, based on Von
|
|
Neumann
|
|
- As processor speed increases, more instructions can be executed at the
|
|
same time
|
|
- We have reached the limit of single processor processing, where it's too
|
|
hard to make a processor much faster
|
|
- We now have more cores
|
|
|
|
#### Real life examples
|
|
|
|
- Simulation: where directly testing is difficult
|
|
- Prediction: challenging in algorithm design and implementation
|
|
- Data Analysis: Google search, biologist on DNA, Marketers on twitter
|
|
|
|
#### Platforms for parallel computing
|
|
|
|
- Spreadsheets, but can't handle true big data
|
|
- R language: powerful statistical features
|
|
- Python: General purpose with R like addons
|
|
- Databases
|
|
- RDBMS, Relational Database Management Systems
|
|
- NoSQL: Key value storage, like Amazon DynamoDB
|
|
- Document databases: MongoDB
|
|
- Graph databases: Neo4j, Giraph
|
|
- Specialized language
|
|
- MapReduce / Hadoop
|
|
- Spark
|
|
- Data preparation and visualization
|
|
|
|
### Introduction to the Map Reduce Model
|
|
|
|
#### Example problem
|
|
|
|
- Counting words
|
|
- input: text
|
|
- output: key-pair lists: word - count
|
|
|
|
#### Serial solution
|
|
|
|
- Split into lines
|
|
- Iterate over each line, and count each word
|
|
|
|
#### Parallelized solution using Divide and Conquer
|
|
|
|
- Split sentences or lines into words
|
|
- Count words on separate machines, in parallel
|
|
- Partition the problem and combine the result
|
|
- Challenges
|
|
- Assigning work units to workers
|
|
- Having more work units than workers
|
|
- Shared partial results may be needed by workers
|
|
- Aggregate partial results
|
|
- Knowing workers finished or stopped
|
|
|
|
### Map Reduce
|
|
|
|
#### Definitions
|
|
|
|
- A parallel programming model, and associated implementation, that scales well
|
|
and has auto parallelization
|
|
- Features:
|
|
- Used by big companies
|
|
- User specify the map function and reduce function
|
|
- Runtime automatically parallelize the computation
|
|
- Runtime handles failures, communication and perf issues
|
|
- **Not** suitable for every possible algorithms
|
|
- Can also refer to the runtime (execution framework), or the specific code
|
|
implementation
|
|
|
|
#### Pattern
|
|
|
|
- Input: usually from file
|
|
- Map task: aggregate key-value pairs
|
|
- Reduce task: Process and reduce the key-value pairs
|
|
- Output: can write to file system
|
|
|
|
#### Usage
|
|
|
|
- Google: Index building for search, Article clustering, Statistical machine
|
|
translation
|
|
- Yahoo: Index building for search, spam detection for mail
|
|
- Facebook: Data Mining, Advertisement, Spam
|
|
|
|
#### History
|
|
|
|
- Inspired by functional programming like Lisp:
|
|
- `map()`: apply function to each individual value of a array, to modify the
|
|
array
|
|
- `reduce()`: combine the values from an array to get a reduced number
|
|
|
|
#### Model
|
|
|
|
- Input data is partitioned into processable chunks
|
|
- Map: Called on every item in input, and **emit** a series of **intermediate**
|
|
<key-value> pairs (list(ki, vi))
|
|
- One map job per chunk, can be parallelized
|
|
- Reduce: Called on every unique key (ki, list(vi)), and it's value list, to
|
|
emit a value to the output (ki, vi' or simply vi')
|
|
- One reduce job for each distinct key emitted by mapper, can be
|
|
parallelized
|
|
- Procedure: nodes work on **map** job first, after map is completed, they
|
|
**synchronize** to aggregate the intermediate values by output key, then run
|
|
**reduce**
|
|
|
|
#### Example
|
|
|
|
- TODO: review the lab code on word count
|
|
|
|
#### Benefits
|
|
|
|
- High level abstraction
|
|
- Framework is efficient, leading to good performance
|
|
- Scalability is close to linear, since map and reduce is fully parallelized
|
|
- Reduces the complexity of parallel programming
|
|
|
|
#### Running
|
|
|
|
- Shuffle, moving data from mappers to reducers and sort, ordering outputs
|
|
before being processed by reducer
|
|
- Shuffle:
|
|
- Every intermediate key-value pairs generated by mapper are
|
|
**collected**
|
|
- Pairs are **partitioned** to a list of values, each partition sorted
|
|
by key
|
|
- Data is shuffled and sent to each reducer
|
|
- Sort:
|
|
- Reducer copy data from mapper
|
|
- Downloaded data are **merged** and **sorted** as input for reducer, key-list
|
|
of values, sorted by keys
|
|
- Runtime
|
|
- Partitions input
|
|
- Schedules executions
|
|
- Handles load balancing
|
|
- Shuffle, partition, sort intermediate data
|
|
- Handle failure
|
|
- Manage IPC (Inter Process Communication)
|
|
- TODO: see page 66 for diagram example
|
|
|
|
#### Implementations
|
|
|
|
- Google MapReduce: proprietary and private
|
|
- Hadoop: Open Source, written in Java, first used by Yahoo, then merged into
|
|
Apache project.
|
|
- Used by most major companies: Amazon, Facebook, Google, IBM, last.fm,
|
|
Yahoo
|
|
- Custom implementation
|