map reduce intro, took 1.5 hr
This commit is contained in:
parent
83254e49ec
commit
dd6c4dad47
|
@ -50,6 +50,7 @@
|
|||
|
||||
- Internet
|
||||
- Intranets
|
||||
- Domain name service
|
||||
- Grid computing
|
||||
- Peer to peer (p2p) computing
|
||||
- Cloud computing
|
||||
|
@ -116,7 +117,6 @@
|
|||
- Interacts with heterogeneous systems
|
||||
- The above gave birth to cloud computing
|
||||
|
||||
|
||||
### Definition
|
||||
|
||||
- A computing infrastructure, that consists of shared pool of **virtualized**
|
||||
|
@ -145,7 +145,7 @@
|
|||
|
||||
## Cloud Deployment Models
|
||||
|
||||
### Public Cloud
|
||||
### Public Cloud
|
||||
|
||||
- Advantages:
|
||||
- Flexible
|
||||
|
@ -158,6 +158,7 @@
|
|||
- Not **customizable**
|
||||
|
||||
### Private cloud
|
||||
|
||||
- Advantages:
|
||||
- Highly **private** and **secured**
|
||||
- More **control**
|
||||
|
|
189
3-1-map-reduce.md
Normal file
189
3-1-map-reduce.md
Normal file
|
@ -0,0 +1,189 @@
|
|||
# Principles of Map Reduce
|
||||
|
||||
## Big Data
|
||||
|
||||
- Definitions: collecting and processing (interpretation) a large amount of
|
||||
data, to our own benefits.
|
||||
- Key Features:
|
||||
- Volume: scale of petabytes
|
||||
- Variety: Structured, semi-structured, unstructured
|
||||
- Velocity: Social media, sensor, high throughput
|
||||
- Veracity: Unclean, Imprecise, Unclear
|
||||
|
||||
## Distributed System
|
||||
|
||||
- Simple definition: any system that works across many computers
|
||||
- Features:
|
||||
- Concurrency
|
||||
- No guarantee on sequence of events
|
||||
- No guarantee on reliability
|
||||
- Price Performance ratio is better
|
||||
- Challenges
|
||||
- Assigning workers
|
||||
- Failure handling
|
||||
- Exchanging results
|
||||
- Synchronization
|
||||
- Examples: [link](/1-1-intro.md#examples)
|
||||
|
||||
## Map Reduce
|
||||
|
||||
### Parallelism
|
||||
|
||||
#### Intro
|
||||
|
||||
- number of processors working together to calculate or solve a problem
|
||||
[earlier slides](/1-4-scalability.md)
|
||||
- Calculation will be divided into tasks, and sent to different processors
|
||||
- Can be different cores or machines
|
||||
- Challenges
|
||||
- Hard to divide
|
||||
- The subtasks may use results from each other
|
||||
- To solve this, we can rank task based on the difficulty to parallelize
|
||||
- Easy: Image processing
|
||||
- Hard: Path Search
|
||||
|
||||
#### Benefits
|
||||
|
||||
- Solve larger or complex problems
|
||||
- Serial computation will take too long: too large or complex
|
||||
- Also cheaper, since not less powerful computer it requires
|
||||
- Serial computation, a retrospective:
|
||||
- FDE cycle: one instruction is fetched, decoded and executed, based on Von
|
||||
Neumann
|
||||
- As processor speed increases, more instructions can be executed at the
|
||||
same time
|
||||
- We have reached the limit of single processor processing, where it's too
|
||||
hard to make a processor much faster
|
||||
- We now have more cores
|
||||
|
||||
#### Real life examples
|
||||
|
||||
- Simulation: where directly testing is difficult
|
||||
- Prediction: challenging in algorithm design and implementation
|
||||
- Data Analysis: Google search, biologist on DNA, Marketers on twitter
|
||||
|
||||
#### Platforms for parallel computing
|
||||
|
||||
- Spreadsheets, but can't handle true big data
|
||||
- R language: powerful statistical features
|
||||
- Python: General purpose with R like addons
|
||||
- Databases
|
||||
- RDBMS, Relational Database Management Systems
|
||||
- NoSQL: Key value storage, like Amazon DynamoDB
|
||||
- Document databases: MongoDB
|
||||
- Graph databases: Neo4j, Giraph
|
||||
- Specialized language
|
||||
- MapReduce / Hadoop
|
||||
- Spark
|
||||
- Data preparation and visualization
|
||||
|
||||
### Introduction to the Map Reduce Model
|
||||
|
||||
#### Example problem
|
||||
|
||||
- Counting words
|
||||
- input: text
|
||||
- output: key-pair lists: word - count
|
||||
|
||||
#### Serial solution
|
||||
|
||||
- Split into lines
|
||||
- Iterate over each line, and count each word
|
||||
|
||||
#### Parallelized solution using Divide and Conquer
|
||||
|
||||
- Split sentences or lines into words
|
||||
- Count words on separate machines, in parallel
|
||||
- Partition the problem and combine the result
|
||||
- Challenges
|
||||
- Assigning work units to workers
|
||||
- Having more work units than workers
|
||||
- Shared partial results may be needed by workers
|
||||
- Aggregate partial results
|
||||
- Knowing workers finished or stopped
|
||||
|
||||
### Map Reduce
|
||||
|
||||
#### Definitions
|
||||
|
||||
- A parallel programming model, and associated implementation, that scales well
|
||||
and has auto parallelization
|
||||
- Features:
|
||||
- Used by big companies
|
||||
- User specify the map function and reduce function
|
||||
- Runtime automatically parallelize the computation
|
||||
- Runtime handles failures, communication and perf issues
|
||||
- **Not** suitable for every possible algorithms
|
||||
- Can also refer to the runtime (execution framework), or the specific code
|
||||
implementation
|
||||
|
||||
#### Pattern
|
||||
|
||||
- Input: usually from file
|
||||
- Map task: aggregate key-value pairs
|
||||
- Reduce task: Process and reduce the key-value pairs
|
||||
- Output: can write to file system
|
||||
|
||||
#### Usage
|
||||
|
||||
- Google: Index building for search, Article clustering, Statistical machine
|
||||
translation
|
||||
- Yahoo: Index building for search, spam detection for mail
|
||||
- Facebook: Data Mining, Advertisement, Spam
|
||||
|
||||
#### History
|
||||
|
||||
- Inspired by functional programming like Lisp:
|
||||
- `map()`: apply function to each individual value of a array, to modify the
|
||||
array
|
||||
- `reduce()`: combine the values from an array to get a reduced number
|
||||
|
||||
#### Model
|
||||
|
||||
- Input data is partitioned into processable chunks
|
||||
- Map: Called on every item in input, and **emit** a series of **intermediate**
|
||||
<key-value> pairs (list(ki, vi))
|
||||
- One map job per chunk, can be parallelized
|
||||
- Reduce: Called on every unique key (ki, list(vi)), and it's value list, to
|
||||
emit a value to the output (ki, vi' or simply vi')
|
||||
- One reduce job for each distinct key emitted by mapper, can be
|
||||
parallelized
|
||||
- Procedure: nodes work on **map** job first, after map is completed, they
|
||||
**synchronize** to aggregate the intermediate values by output key, then run
|
||||
**reduce**
|
||||
|
||||
#### Example
|
||||
|
||||
- TODO: review the lab code on word count
|
||||
|
||||
#### Benefits
|
||||
|
||||
- High level abstraction
|
||||
- Framework is efficient, leading to good performance
|
||||
- Scalability is close to linear, since map and reduce is fully parallelized
|
||||
- Reduces the complexity of parallel programming
|
||||
|
||||
#### Running
|
||||
|
||||
- Shuffle and sort
|
||||
- Every intermediate key-value pairs generated by mapper are collected
|
||||
- Same key items are grouped to a list of values
|
||||
- Data is shuffled and sent to each reducer
|
||||
- Data provided to each reducer is sorted by keys
|
||||
- Runtime
|
||||
- Partitions input
|
||||
- schedules executions
|
||||
- Handles load balancing
|
||||
- Shuffle, partition, sort intermediate data
|
||||
- Handle failure
|
||||
- Manage IPC (Inter Process Communication)
|
||||
- TODO: see page 66 for diagram example
|
||||
|
||||
#### Implementations
|
||||
|
||||
- Google MapReduce: proprietary and private
|
||||
- Hadoop: Open Source, written in Java, first used by Yahoo, then merged into
|
||||
Apache project.
|
||||
- Used by most major companies: Amazon, Facebook, Google, IBM, last.fm,
|
||||
Yahoo
|
||||
- Custom implementation
|
Loading…
Reference in a new issue