map reduce intro, took 1.5 hr
This commit is contained in:
parent
83254e49ec
commit
dd6c4dad47
|
@ -50,6 +50,7 @@
|
||||||
|
|
||||||
- Internet
|
- Internet
|
||||||
- Intranets
|
- Intranets
|
||||||
|
- Domain name service
|
||||||
- Grid computing
|
- Grid computing
|
||||||
- Peer to peer (p2p) computing
|
- Peer to peer (p2p) computing
|
||||||
- Cloud computing
|
- Cloud computing
|
||||||
|
@ -116,7 +117,6 @@
|
||||||
- Interacts with heterogeneous systems
|
- Interacts with heterogeneous systems
|
||||||
- The above gave birth to cloud computing
|
- The above gave birth to cloud computing
|
||||||
|
|
||||||
|
|
||||||
### Definition
|
### Definition
|
||||||
|
|
||||||
- A computing infrastructure, that consists of shared pool of **virtualized**
|
- A computing infrastructure, that consists of shared pool of **virtualized**
|
||||||
|
@ -158,6 +158,7 @@
|
||||||
- Not **customizable**
|
- Not **customizable**
|
||||||
|
|
||||||
### Private cloud
|
### Private cloud
|
||||||
|
|
||||||
- Advantages:
|
- Advantages:
|
||||||
- Highly **private** and **secured**
|
- Highly **private** and **secured**
|
||||||
- More **control**
|
- More **control**
|
||||||
|
|
189
3-1-map-reduce.md
Normal file
189
3-1-map-reduce.md
Normal file
|
@ -0,0 +1,189 @@
|
||||||
|
# Principles of Map Reduce
|
||||||
|
|
||||||
|
## Big Data
|
||||||
|
|
||||||
|
- Definitions: collecting and processing (interpretation) a large amount of
|
||||||
|
data, to our own benefits.
|
||||||
|
- Key Features:
|
||||||
|
- Volume: scale of petabytes
|
||||||
|
- Variety: Structured, semi-structured, unstructured
|
||||||
|
- Velocity: Social media, sensor, high throughput
|
||||||
|
- Veracity: Unclean, Imprecise, Unclear
|
||||||
|
|
||||||
|
## Distributed System
|
||||||
|
|
||||||
|
- Simple definition: any system that works across many computers
|
||||||
|
- Features:
|
||||||
|
- Concurrency
|
||||||
|
- No guarantee on sequence of events
|
||||||
|
- No guarantee on reliability
|
||||||
|
- Price Performance ratio is better
|
||||||
|
- Challenges
|
||||||
|
- Assigning workers
|
||||||
|
- Failure handling
|
||||||
|
- Exchanging results
|
||||||
|
- Synchronization
|
||||||
|
- Examples: [link](/1-1-intro.md#examples)
|
||||||
|
|
||||||
|
## Map Reduce
|
||||||
|
|
||||||
|
### Parallelism
|
||||||
|
|
||||||
|
#### Intro
|
||||||
|
|
||||||
|
- number of processors working together to calculate or solve a problem
|
||||||
|
[earlier slides](/1-4-scalability.md)
|
||||||
|
- Calculation will be divided into tasks, and sent to different processors
|
||||||
|
- Can be different cores or machines
|
||||||
|
- Challenges
|
||||||
|
- Hard to divide
|
||||||
|
- The subtasks may use results from each other
|
||||||
|
- To solve this, we can rank task based on the difficulty to parallelize
|
||||||
|
- Easy: Image processing
|
||||||
|
- Hard: Path Search
|
||||||
|
|
||||||
|
#### Benefits
|
||||||
|
|
||||||
|
- Solve larger or complex problems
|
||||||
|
- Serial computation will take too long: too large or complex
|
||||||
|
- Also cheaper, since not less powerful computer it requires
|
||||||
|
- Serial computation, a retrospective:
|
||||||
|
- FDE cycle: one instruction is fetched, decoded and executed, based on Von
|
||||||
|
Neumann
|
||||||
|
- As processor speed increases, more instructions can be executed at the
|
||||||
|
same time
|
||||||
|
- We have reached the limit of single processor processing, where it's too
|
||||||
|
hard to make a processor much faster
|
||||||
|
- We now have more cores
|
||||||
|
|
||||||
|
#### Real life examples
|
||||||
|
|
||||||
|
- Simulation: where directly testing is difficult
|
||||||
|
- Prediction: challenging in algorithm design and implementation
|
||||||
|
- Data Analysis: Google search, biologist on DNA, Marketers on twitter
|
||||||
|
|
||||||
|
#### Platforms for parallel computing
|
||||||
|
|
||||||
|
- Spreadsheets, but can't handle true big data
|
||||||
|
- R language: powerful statistical features
|
||||||
|
- Python: General purpose with R like addons
|
||||||
|
- Databases
|
||||||
|
- RDBMS, Relational Database Management Systems
|
||||||
|
- NoSQL: Key value storage, like Amazon DynamoDB
|
||||||
|
- Document databases: MongoDB
|
||||||
|
- Graph databases: Neo4j, Giraph
|
||||||
|
- Specialized language
|
||||||
|
- MapReduce / Hadoop
|
||||||
|
- Spark
|
||||||
|
- Data preparation and visualization
|
||||||
|
|
||||||
|
### Introduction to the Map Reduce Model
|
||||||
|
|
||||||
|
#### Example problem
|
||||||
|
|
||||||
|
- Counting words
|
||||||
|
- input: text
|
||||||
|
- output: key-pair lists: word - count
|
||||||
|
|
||||||
|
#### Serial solution
|
||||||
|
|
||||||
|
- Split into lines
|
||||||
|
- Iterate over each line, and count each word
|
||||||
|
|
||||||
|
#### Parallelized solution using Divide and Conquer
|
||||||
|
|
||||||
|
- Split sentences or lines into words
|
||||||
|
- Count words on separate machines, in parallel
|
||||||
|
- Partition the problem and combine the result
|
||||||
|
- Challenges
|
||||||
|
- Assigning work units to workers
|
||||||
|
- Having more work units than workers
|
||||||
|
- Shared partial results may be needed by workers
|
||||||
|
- Aggregate partial results
|
||||||
|
- Knowing workers finished or stopped
|
||||||
|
|
||||||
|
### Map Reduce
|
||||||
|
|
||||||
|
#### Definitions
|
||||||
|
|
||||||
|
- A parallel programming model, and associated implementation, that scales well
|
||||||
|
and has auto parallelization
|
||||||
|
- Features:
|
||||||
|
- Used by big companies
|
||||||
|
- User specify the map function and reduce function
|
||||||
|
- Runtime automatically parallelize the computation
|
||||||
|
- Runtime handles failures, communication and perf issues
|
||||||
|
- **Not** suitable for every possible algorithms
|
||||||
|
- Can also refer to the runtime (execution framework), or the specific code
|
||||||
|
implementation
|
||||||
|
|
||||||
|
#### Pattern
|
||||||
|
|
||||||
|
- Input: usually from file
|
||||||
|
- Map task: aggregate key-value pairs
|
||||||
|
- Reduce task: Process and reduce the key-value pairs
|
||||||
|
- Output: can write to file system
|
||||||
|
|
||||||
|
#### Usage
|
||||||
|
|
||||||
|
- Google: Index building for search, Article clustering, Statistical machine
|
||||||
|
translation
|
||||||
|
- Yahoo: Index building for search, spam detection for mail
|
||||||
|
- Facebook: Data Mining, Advertisement, Spam
|
||||||
|
|
||||||
|
#### History
|
||||||
|
|
||||||
|
- Inspired by functional programming like Lisp:
|
||||||
|
- `map()`: apply function to each individual value of a array, to modify the
|
||||||
|
array
|
||||||
|
- `reduce()`: combine the values from an array to get a reduced number
|
||||||
|
|
||||||
|
#### Model
|
||||||
|
|
||||||
|
- Input data is partitioned into processable chunks
|
||||||
|
- Map: Called on every item in input, and **emit** a series of **intermediate**
|
||||||
|
<key-value> pairs (list(ki, vi))
|
||||||
|
- One map job per chunk, can be parallelized
|
||||||
|
- Reduce: Called on every unique key (ki, list(vi)), and it's value list, to
|
||||||
|
emit a value to the output (ki, vi' or simply vi')
|
||||||
|
- One reduce job for each distinct key emitted by mapper, can be
|
||||||
|
parallelized
|
||||||
|
- Procedure: nodes work on **map** job first, after map is completed, they
|
||||||
|
**synchronize** to aggregate the intermediate values by output key, then run
|
||||||
|
**reduce**
|
||||||
|
|
||||||
|
#### Example
|
||||||
|
|
||||||
|
- TODO: review the lab code on word count
|
||||||
|
|
||||||
|
#### Benefits
|
||||||
|
|
||||||
|
- High level abstraction
|
||||||
|
- Framework is efficient, leading to good performance
|
||||||
|
- Scalability is close to linear, since map and reduce is fully parallelized
|
||||||
|
- Reduces the complexity of parallel programming
|
||||||
|
|
||||||
|
#### Running
|
||||||
|
|
||||||
|
- Shuffle and sort
|
||||||
|
- Every intermediate key-value pairs generated by mapper are collected
|
||||||
|
- Same key items are grouped to a list of values
|
||||||
|
- Data is shuffled and sent to each reducer
|
||||||
|
- Data provided to each reducer is sorted by keys
|
||||||
|
- Runtime
|
||||||
|
- Partitions input
|
||||||
|
- schedules executions
|
||||||
|
- Handles load balancing
|
||||||
|
- Shuffle, partition, sort intermediate data
|
||||||
|
- Handle failure
|
||||||
|
- Manage IPC (Inter Process Communication)
|
||||||
|
- TODO: see page 66 for diagram example
|
||||||
|
|
||||||
|
#### Implementations
|
||||||
|
|
||||||
|
- Google MapReduce: proprietary and private
|
||||||
|
- Hadoop: Open Source, written in Java, first used by Yahoo, then merged into
|
||||||
|
Apache project.
|
||||||
|
- Used by most major companies: Amazon, Facebook, Google, IBM, last.fm,
|
||||||
|
Yahoo
|
||||||
|
- Custom implementation
|
Loading…
Reference in a new issue